android_kernel_zuk_msm8996.git

	Commit message (Collapse)	Author	Age
...
\| * \|	sched/fair: use min capacity when evaluating idle backup cpus	Ionela Voinescu	2023-07-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we are calculating what the impact of placing a task on a specific cpu is, we should include the information that there might be a minimum capacity imposed upon that cpu which could change the performance and/or energy cost decisions. When choosing an idle backup CPU, favour CPUs that won't end up running at a high OPP due to a min capacity cap imposed by external actors. Change-Id: I566623ffb3a7c5b61a23242dcce1cb4147ef8a4a Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com>
\| * \|	sched/fair: use min capacity when evaluating placement energy costs	Ionela Voinescu	2023-07-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add the ability to track minimim capacity forced onto a sched_group by some external actor. group_max_util returns the highest utilisation inside a sched_group and is used when we are trying to calculate an energy cost estimate for a specific scheduling scenario. Minimum capacities imposed from elsewhere will influence this energy cost so we should reflect it here. Change-Id: Ibd537a6dbe6d67b11cc9e9be18f40fcb2c0f13de Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com>
\| * \|	sched/fair: introduce minimum capacity capping sched feature	Ionela Voinescu	2023-07-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have the ability to track minimum capacity forced onto a CPU by userspace or external actors. This is provided though a minimum frequency scale factor exposed by arch_scale_min_freq_capacity. The use of this information is enabled through the MIN_CAPACITY_CAPPING feature. If not enabled, the minimum frequency scale factor will remain 0 and it will not impact energy estimation or scheduling decisions. Change-Id: Ibc61f2bf4fddf186695b72b262e602a6e8bfde37 Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com>
\| * \|	sched: add arch_scale_min_freq_capacity to track minimum capacity caps	Ionela Voinescu	2023-07-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the minimum capacity of a group is capped by userspace or internal dependencies which are not otherwise visible to the scheduler, we need a way to see these and integrate this information into the energy calculations and task placement decisions we make. Add arch_scale_min_freq_capacity to determine the lowest capacity which a specific cpu can provide under the current set of known constraints. Change-Id: Ied4a1dc0982bbf42cb5ea2f27201d4363db59705 Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
\| * \|	sched/fair: introduce an arch scaling function for max frequency capping	Dietmar Eggemann	2023-07-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The max frequency scaling factor is defined as: max_freq_scale = policy_max_freq / cpuinfo_max_freq To be able to scale the cpu capacity by this factor introduce a call to the new arch scaling function arch_scale_max_freq_capacity() in update_cpu_capacity() and provide a default implementation which returns SCHED_CAPACITY_SCALE. Another subsystem (e.g. cpufreq) can overwrite this default implementation, exactly as for frequency and cpu invariance. It has to be enabled by the arch by defining arch_scale_max_freq_capacity to the actual implementation. Change-Id: I266cd1f4c1c82f54b80063c36aa5f7662599dd28 Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com>
\| * \|	sched/fair: reduce rounding errors in energy computations	Patrick Bellasi	2023-07-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The SG's energy is obtained by adding busy and idle contributions which are computed by considering a proper fraction of the SCHED_CAPACITY_SCALE defined by the SG's utilizations. By scaling each and every contribution conputed we risk to accumulate rounding errors which can results into a non null energy_delta also in cases when the same total accomulated utilization is differently distributed among different CPUs. To reduce rouding errors, this patch accumulated non-scaled busy/idle energy contributions for each visited SG, and scale each of them just one time at the end. Change-Id: Idf8367fee0ac11938c6436096f0c1b2d630210d2 Suggested-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
\| * \|	sched/fair: re-factor energy_diff to use a single (extensible) energy_env	Patrick Bellasi	2023-07-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The energy_env data structure is used to cache values required by multiple different functions involved in energy_diff computation. Some of these functions require additional parameters which can be easily embedded into the energy_env itself. The current implementation of energy_diff hardcodes the usage of two different energy_env structures to estimate and compare the energy consumption related to a "before" and an "after" CPU. Moreover, it does this energy estimation by walking multiple times the SDs/SGs data structures. A better design can be envisioned by better using the energy_env structure to support a more efficient and concurrent evaluation of multiple schedule candidates. To this purpose, this patch provides a complete re-factoring of the energy_diff implementation to: 1. use a single energy_env structure for the evaluation of all the candidate CPUs 2. walk just one time the SDs/SGs, thus improving the overall performance to compute the energy estimation for each CPU candidate specified by the single used energy_env 3. simplify the code (at least if you look at the new version and not at this re-factoring patch) thus providing a more clean code to maintain and extend for additional features This patch updated all the clients of energy_env to use only the data provided by this structure and an index for one of its CPUs candidates. Embedding everything within the energy env will make it simple to add tracepoints for this new version, which can easily provide an holistic view on how energy_diff evaluated the proposed CPU candidates. The new proposed structure, for both "struct energy_env" and the functions using it, is designed in such a way to easily accommodate additional further extensions (e.g. SchedTune filtering) without requiring an additional big re-factoring of these core functions. Finally, the usage of a CPUs candidate array, embedded into the energy_diff structure, allows also to seamless extend the exploration of multiple candidate CPUs, for example to support the comparison of a spread-vs-packing strategy. Change-Id: Ic04ffb6848b2c763cf1788767f22c6872eb12bee Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> [reworked find_new_capacity() and enforced the respect of find_best_target() selection order] Signed-off-by: Quentin Perret <quentin.perret@arm.com> [@0ctobot: Adapted for kernel.lnx.4.4.r35-rel] Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
\| * \|	sched/fair: cleanup select_energy_cpu_brute to be more consistent	Patrick Bellasi	2023-07-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current definition of select_energy_cpu_brute is a bit confusing in the definition of the value for the target_cpu to be returned as wakeup CPU for the specified task. This cleanup the code by ensuring that we always set target_cpu right before returning it. rcu_read_lock and check on *sd!=NULL are also moved around to be exactly where they are required. Change-Id: I70a4b558b3624a13395da1a87ddc0776fd1d6641 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
\| * \|	sched/fair: remove capacity tracking from energy_diff	Patrick Bellasi	2023-07-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In preparation for the energy_diff refactoring, let's remove all the SchedTune specific bits which are used to keep track of the capacity variations requited by the PESpace filtering. This removes also the energy_normalization function and the wrapper of energy_diff which is used to trigger a PESpace filtering by schedtune_accept_deltas(). The remaining code is the "original" energy_diff function which looks just at the energy variations to compare prev_cpu vs next_cpu. Change-Id: I4fb1d1c5ba45a364e6db9ab8044969349aba0307 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
\| * \|	sched/fair: remove energy_diff tracepoint in preparation to re-factoring	Patrick Bellasi	2023-07-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The format of the energy_diff tracepoint is going to be changed by the following energ_diff refactoring patches. Let's remove it now to start from a clean slate. Change-Id: Id4f537ed60d90a7ddcca0a29a49944bfacb85c8c Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
\| * \|	sched/fair: use *p to reference task_structs	Chris Redpath	2023-07-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a simple renaming patch which just align to the most common code convention used in fair.c, task_structs pointers are usually named *p. Change-Id: Id0769e52b6a271014d89353fdb4be9bb721b5b2f Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
* \| \|	msm8996: Fix kernel compilation errors	Raghuram Subramani	2024-10-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	error: a function declaration without a prototype is deprecated in all versions of C Change-Id: Iea020e1a126d23f5c8056807ac9c02a79493153b
* \| \|	Merge remote-tracking branch 'msm8998/lineage-20' into lineage-20	Raghuram Subramani	2024-10-17
\| \| \| \| \| \| \| \| \| \| \| \|	Change-Id: I126075a330f305c85f8fe1b8c9d408f368be95d1
* \| \|	Merge lineage-20 of git@github.com:LineageOS/android_kernel_qcom_msm8998.git ↵	Davide Garberi	2023-08-06
\|\\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	into lineage-20 7d11b1a7a11c Revert "sched: cpufreq: Use sched_clock instead of rq_clock when updating schedutil" daaa5da96a74 sched: Take irq_sparse lock during the isolation 217ab2d0ef91 rcu: Speed up calling of RCU tasks callbacks 997b726bc092 kernel: power: Workaround for sensor ipc message causing high power consume b933e4d37bc0 sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 82d3f23d6dc5 sched/fair: Fix bandwidth timer clock drift condition 629bfed360f9 kernel: power: qos: remove check for core isolation while cluster LPMs 891a63210e1d sched/fair: Fix issue where frequency update not skipped b775cb29f663 ANDROID: Move schedtune en/dequeue before schedutil update triggers ebdb82f7b34a sched/fair: Skip frequency updates if CPU about to idle ff383d94478a FROMLIST: sched: Make iowait_boost optional in schedutil 9539942cb065 FROMLIST: cpufreq: Make iowait boost a policy option b65c91c9aa14 ARM: dts: msm: add HW CPU's busy-cost-data for additional freqs 72f13941085b ARM: dts: msm: fix CPU's idle-cost-data ab88411382f7 ARM: dts: msm: fix EM to be monotonically increasing 83dcbae14782 ARM: dts: msm: Fix EAS idle-cost-data property length 33d3b17bfdfb ARM: dts: msm: Add msm8998 energy model c0fa7577022c sched/walt: Re-add code to allow WALT to function d5cd35f38616 FROMGIT: binder: use EINTR for interrupted wait for work db74739c86de sched: Don't fail isolation request for an already isolated CPU aee7a16e347b sched: WALT: increase WALT minimum window size to 20ms 4dbe44554792 sched: cpufreq: Use per_cpu_ptr instead of this_cpu_ptr when reporting load ef3fb04c7df4 sched: cpufreq: Use sched_clock instead of rq_clock when updating schedutil c7128748614a sched/cpupri: Exclude isolated CPUs from the lowest_mask 6adb092856e8 sched: cpufreq: Limit governor updates to WALT changes alone 0fa652ee00f5 sched: walt: Correct WALT window size initialization 41cbb7bc59fb sched: walt: fix window misalignment when HZ=300 43cbf9d6153d sched/tune: Increase the cgroup limit to 6 c71b8fffe6b3 drivers: cpuidle: lpm-levels: Fix KW issues with idle state idx < 0 938e42ca699f drivers: cpuidle: lpm-levels: Correctly check for list empty 8d8a48aecde5 sched/fair: Fix load_balance() affinity redo path eccc8acbe705 sched/fair: Avoid unnecessary active load balance 0ffdb886996b BACKPORT: sched/core: Fix rules for running on online && !active CPUs c9999f04236e sched/core: Allow kthreads to fall back to online && !active cpus b9b6bc6ea3c0 sched: Allow migrating kthreads into online but inactive CPUs a9314f9d8ad4 sched/fair: Allow load bigger task load balance when nr_running is 2 c0b317c27d44 pinctrl: qcom: Clear status bit on irq_unmask 45df1516d04a UPSTREAM: mm: fix misplaced unlock_page in do_wp_page() 899def5edcd4 UPSTREAM: mm/ksm: Remove reuse_ksm_page() 46c6fbdd185a BACKPORT: mm: do_wp_page() simplification 90dccbae4c04 UPSTREAM: mm: reuse only-pte-mapped KSM page in do_wp_page() ebf270d24640 sched/fair: vruntime should normalize when switching from fair cbe0b37059c9 mm: introduce arg_lock to protect arg_start\|end and env_start\|end in mm_struct 12d40f1995b4 msm: mdss: Fix indentation 620df03a7229 msm: mdss: Treat polling_en as the bool that it is 12af218146a6 msm: mdss: add idle state node 13e661759656 cpuset: Restore tasks affinity while moving across cpusets 602bf4096dab genirq: Honour IRQ's affinity hint during migration 9209b5556f6a power: qos: Use effective affinity mask f31078b5825f genirq: Introduce effective affinity mask 58c453484f7e sched/cputime: Mitigate performance regression in times()/clock_gettime() 400383059868 kernel: time: Add delay after cpu_relax() in tight loops 1daa7ea39076 pinctrl: qcom: Update irq handle for GPIO pins 07f7c9961c7c power: smb-lib: Fix mutex acquisition deadlock on PD hard reset 094b738f46c8 power: qpnp-smb2: Implement battery charging_enabled node d6038d6da57f ASoC: msm-pcm-q6-v2: Add dsp buf check 0d7a6c301af8 qcacld-3.0: Fix OOB in wma_scan_roam.c Change-Id: Ia2e189e37daad6e99bdb359d1204d9133a7916f4
\| * \|	Revert "sched: cpufreq: Use sched_clock instead of rq_clock when updating ↵	Georg Veichtlbauer	2023-07-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	schedutil" That commit should have changed rq_clock to sched_clock, instead of sched_ktime_clock, which kept schedutil from making correct decisions. This reverts commit ef3fb04c7df43dfa1793e33f764a2581cda96310. Change-Id: Id4118894388c33bf2b2d3d5ee27eb35e82dc4a96
\| * \|	sched: Take irq_sparse lock during the isolation	Prasad Sodagudi	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	irq_migrate_all_off_this_cpu() is used to migrate IRQs and this function checks for all active irq in the allocated_irqs mask. irq_migrate_all_off_this_cpu() expects the caller to take irq_sparse lock to avoid race conditions while accessing allocated_irqs mask variable. Prevent a race between irq alloc/free and irq migration by adding irq_sparse lock across CPU isolation. Change-Id: I9edece1ecea45297c8f6529952d88b3133046467 Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
\| * \|	rcu: Speed up calling of RCU tasks callbacks	Steven Rostedt (VMware)	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Joel Fernandes found that the synchronize_rcu_tasks() was taking a significant amount of time. He demonstrated it with the following test: # cd /sys/kernel/tracing # while [ 1 ]; do x=1; done & # echo '__schedule_bug:traceon' > set_ftrace_filter # time echo '!__schedule_bug:traceon' > set_ftrace_filter; real 0m1.064s user 0m0.000s sys 0m0.004s Where it takes a little over a second to perform the synchronize, because there's a loop that waits 1 second at a time for tasks to get through their quiescent points when there's a task that must be waited for. After discussion we came up with a simple way to wait for holdouts but increase the time for each iteration of the loop but no more than a full second. With the new patch we have: # time echo '!__schedule_bug:traceon' > set_ftrace_filter; real 0m0.131s user 0m0.000s sys 0m0.004s Which drops it down to 13% of what the original wait time was. Link: http://lkml.kernel.org/r/20180523063815.198302-2-joel@joelfernandes.org Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org> Suggested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Change-Id: I40bcecdfdb2a1cdae7195f1d3b107455ea4b26b1 Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
\| * \|	kernel: power: Workaround for sensor ipc message causing high power consume	Frank Luo	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Sync from Qcom's document KBA-180725024109 To avoid the non-wakeup type sensor data break the AP sleep flow, notify sensor subsystem in the first place of pm_suspend . Bug: 118418963 Test: measure power consumption after running test case Change-Id: I2848230d495e30ac462aef148b3f885103d9c24e Signed-off-by: Frank Luo <luofrank@google.com>
\| * \|	sched/fair: Fix low cpu usage with high throttling by removing expiration of ↵	Dave Chiluk	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cpu-local slices commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream. It has been observed, that highly-threaded, non-cpu-bound applications running under cpu.cfs_quota_us constraints can hit a high percentage of periods throttled while simultaneously not consuming the allocated amount of quota. This use case is typical of user-interactive non-cpu bound applications, such as those running in kubernetes or mesos when run on multiple cpu cores. This has been root caused to cpu-local run queue being allocated per cpu bandwidth slices, and then not fully using that slice within the period. At which point the slice and quota expires. This expiration of unused slice results in applications not being able to utilize the quota for which they are allocated. The non-expiration of per-cpu slices was recently fixed by 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")'. Prior to that it appears that this had been broken since at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That added the following conditional which resulted in slices never being expired. if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; Because this was broken for nearly 5 years, and has recently been fixed and is now being noticed by many users running kubernetes (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion that the mechanisms around expiring runtime should be removed altogether. This allows quota already allocated to per-cpu run-queues to live longer than the period boundary. This allows threads on runqueues that do not use much CPU to continue to use their remaining slice over a longer period of time than cpu.cfs_period_us. However, this helps prevent the above condition of hitting throttling while also not fully utilizing your cpu quota. This theoretically allows a machine to use slightly more than its allotted quota in some periods. This overflow would be bounded by the remaining quota left on each per-cpu runqueueu. This is typically no more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will change nothing, as they should theoretically fully utilize all of their quota in each period. For user-interactive tasks as described above this provides a much better user/application experience as their cpu utilization will more closely match the amount they requested when they hit throttling. This means that cpu limits no longer strictly apply per period for non-cpu bound applications, but that they are still accurate over longer timeframes. This greatly improves performance of high-thread-count, non-cpu bound applications with low cfs_quota_us allocation on high-core-count machines. In the case of an artificial testcase (10ms/100ms of quota on 80 CPU machine), this commit resulted in almost 30x performance improvement, while still maintaining correct cpu quota restrictions. That testcase is available at https://github.com/indeedeng/fibtest. Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") Change-Id: I7d7a39fb554ec0c31f9381f492165f43c70b3924 Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: John Hammond <jhammond@indeed.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kyle Anderson <kwa@yelp.com> Cc: Gabriel Munos <gmunoz@netflix.com> Cc: Peter Oskolkov <posk@posk.io> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Brendan Gregg <bgregg@netflix.com> Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
\| * \|	sched/fair: Fix bandwidth timer clock drift condition	Xunlei Pang	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	commit 512ac999d2755d2b7109e996a76b6fb8b888631d upstream. I noticed that cgroup task groups constantly get throttled even if they have low CPU usage, this causes some jitters on the response time to some of our business containers when enabling CPU quotas. It's very simple to reproduce: mkdir /sys/fs/cgroup/cpu/test cd /sys/fs/cgroup/cpu/test echo 100000 > cpu.cfs_quota_us echo $$ > tasks then repeat: cat cpu.stat \| grep nr_throttled # nr_throttled will increase steadily After some analysis, we found that cfs_rq::runtime_remaining will be cleared by expire_cfs_rq_runtime() due to two equal but stale "cfs_{b\|q}->runtime_expires" after period timer is re-armed. The current condition to judge clock drift in expire_cfs_rq_runtime() is wrong, the two runtime_expires are actually the same when clock drift happens, so this condtion can never hit. The orginal design was correctly done by this commit: a9cf55b28610 ("sched: Expire invalid runtime") ... but was changed to be the current implementation due to its locking bug. This patch introduces another way, it adds a new field in both structures cfs_rq and cfs_bandwidth to record the expiration update sequence, and uses them to figure out if clock drift happens (true if they are equal). Change-Id: I8168fe3b45785643536f289ea823d1a62d9d8ab2 Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [alakeshh: backport: Fixed merge conflicts: - sched.h: Fix the indentation and order in which the variables are declared to match with coding style of the existing code in 4.14 Struct members of same type were declared in separate lines in upstream patch which has been changed back to having multiple members of same type in the same line. e.g. int a; int b; -> int a, b; ] Signed-off-by: Alakesh Haloi <alakeshh@amazon.com> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> # 4.14.x Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period") Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
\| * \|	kernel: power: qos: remove check for core isolation while cluster LPMs	Raghavendra Kakarla	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since all cores in a cluster are in isolation, PMQoS latency constraint set by clock driver to switch PLL is ignored. So, Cluster enter to L2PC and SPM is trying to disable the PLL and at same time clock driver trying to switch the PLL from other cluster which leads to the synchronization issues. Fix is although all cores are in isolation, honor PMQoS request for cluster LPMs. Change-Id: I4296e16ef4e9046d1fbe3b7378e9f61a2f11c74d Signed-off-by: Raghavendra Kakarla <rkakarla@codeaurora.org>
\| * \|	sched/fair: Fix issue where frequency update not skipped	Joel Fernandes	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes one of the infrequent conditions in commit 54b6baeca500 ("sched/fair: Skip frequency updates if CPU about to idle") where we could have skipped a frequency update. The fix is to use the correct flag which skips freq updates. Note that this is a rare issue (can show up only during CFS throttling) and even then we just do an additional frequency update which we were doing anyway before the above patch. Bug: 64689959 Change-Id: I0117442f395cea932ad56617065151bdeb9a3b53 Signed-off-by: Joel Fernandes <joelaf@google.com>
\| * \|	ANDROID: Move schedtune en/dequeue before schedutil update triggers	Chris Redpath	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CPU rq util updates happen when rq signals are updated as part of enqueue and dequeue operations. Doing these updates triggers a call to the registered util update handler, which takes schedtune boosting into account. Enqueueing the task in the correct schedtune group after this happens means that we will potentially not see the boost for an entire throttle period. Move the enqueue/dequeue operations for schedtune before the signal updates which can trigger OPP changes. Change-Id: I4236e6b194bc5daad32ff33067d4be1987996780 Signed-off-by: Chris Redpath <chris.redpath@arm.com>
\| * \|	sched/fair: Skip frequency updates if CPU about to idle	Joel Fernandes	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If CPU is about to idle, prevent a frequency update. With the number of schedutil governor wake ups are reduced by more than half on a test playing bluetooth audio. Test: sugov wake ups drop by more than half when playing music with screen off (476 / 1092) Bug: 64689959 Change-Id: I400026557b4134c0ac77f51c79610a96eb985b4a Signed-off-by: Joel Fernandes <joelaf@google.com>
\| * \|	FROMLIST: sched: Make iowait_boost optional in schedutil	Joel Fernandes	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We should apply the iowait boost only if cpufreq policy has iowait boost enabled. Also make it a schedutil configuration from sysfs so it can be turned on/off if needed (by default initialize it to the policy value). For systems that don't need/want it enabled, such as those on arm64 based mobile devices that are battery operated, it saves energy when the cpufreq driver policy doesn't have it enabled (details below): Here are some results for energy measurements collected running a YouTube video for 30 seconds: Before: 8.042533 mWh After: 7.948377 mWh Energy savings is ~1.2% Bug: 38010527 Link: https://lkml.org/lkml/2017/5/19/42 Change-Id: If124076ad0c16ade369253840dedfbf870aff927 Signed-off-by: Joel Fernandes <joelaf@google.com>
\| * \|	FROMLIST: cpufreq: Make iowait boost a policy option	Joel Fernandes	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Make iowait boost a cpufreq policy option and enable it for intel_pstate cpufreq driver. Governors like schedutil can use it to determine if boosting for tasks that wake up with p->in_iowait set is needed. Bug: 38010527 Link: https://lkml.org/lkml/2017/5/19/43 Change-Id: Icf59e75fbe731dc67abb28fb837f7bb0cd5ec6cc Signed-off-by: Joel Fernandes <joelaf@google.com>
\| * \|	ARM: dts: msm: add HW CPU's busy-cost-data for additional freqs	Andres Oportus	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Initial Enery Model was calculated with a device including less number of available frequencies. This change adds the missing values, note that all performance values had to be updated so they would be re-normalized to 0-1024. Bug: 64837462 Test: YouTube did not have energy regression Change-Id: I2b4c62d06e39fe0da524af96568187042664d62a Signed-off-by: Andres Oportus <andresoportus@google.com>
\| * \|	ARM: dts: msm: fix CPU's idle-cost-data	Patrick Bellasi	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CPU idle states are mapped into EAS energy model data structures according to this table: + cpu_idle_status \| + idle-cost-data index \| \| + meaning \| \| \| + expected energy cost \| \| \| \| -1 0: CPU active CPU energy > 0 0 1: CPU WFI CPU energy > 0 1 2: CPU off (cluster on) CPU energy = 0 2 3: CPU off (cluster off) CPU energy = 0 Change-Id: I4b51bb74cb96c265731f3872c95947474db973ac Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
\| * \|	ARM: dts: msm: fix EM to be monotonically increasing	Patrick Bellasi	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Change-Id: Iad2e3882a2e9d7dbbfd80cf485bbb1f0e664b04f Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
\| * \|	ARM: dts: msm: Fix EAS idle-cost-data property length	Siqi Lin	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need 4 idle-cost-data for CPUs, despite cpu_idle supporting only 3 different idle states. The idle-cost-data property length should always be one more entry longer than the number of available cpu_idle states. The idle-cost-data property has to have the same length for both CLUSTER_COST_N and CPU_COST_N. Bug: 37641804 Change-Id: Ic14a6a1ef4409e81c5adc23575f7d1157d6eadce Signed-off-by: Siqi Lin <siqilin@google.com>
\| * \|	ARM: dts: msm: Add msm8998 energy model	Rashed Abdel-Tawab	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Squash of commits: ed6442938f08: Enable EAS in 8998 MTP 3989a0e22e44: Update Energy Model using Muskie 922c6f4b9e8b: Added idle-cost-data to energy model and fixed busy-cost-data for big cluster cpus Change-Id: I717eb88204f5e28a1afd494dc484895cc749e2fc
\| * \|	sched/walt: Re-add code to allow WALT to function	Ethan Chen	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \|	Change-Id: Ieb1067c5e276f872ed4c722b7d1fabecbdad87e7
\| * \|	FROMGIT: binder: use EINTR for interrupted wait for work	Marco Ballesio	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	when interrupted by a signal, binder_wait_for_work currently returns -ERESTARTSYS. This error code isn't propagated to user space, but a way to handle interruption due to signals must be provided to code using this API. Replace this instance of -ERESTARTSYS with -EINTR, which is propagated to user space. Bug: 180989544 (cherry picked from commit 48f10b7ed0c23e2df7b2c752ad1d3559dad007f9 git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git char-misc-testing) Signed-off-by: Marco Ballesio <balejs@google.com> Signed-off-by: Li Li <dualli@google.com> Test: built, booted, interrupted a worker thread within Acked-by: Todd Kjos <tkjos@google.com> Link: https://lore.kernel.org/r/20210316011630.1121213-3-dualli@chromium.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: Ie6c7993cab699bc2c1a25a2f9d94b200a1156e5d
\| * \|	sched: Don't fail isolation request for an already isolated CPU	Pavankumar Kondeti	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When isolating a CPU, a check is performed to see if there is only 1 active CPU in the system. If that is the case, the CPU is not isolated. However this check is done before testing if the requested CPU is already isolated or not. If the requested CPU is already isolated, there is no need to fail the isolation even when there is only 1 active CPU in the system. For example, 0-6 CPUs are isolated on a 8 CPU machine. When an isolation request comes for CPU6, which is already isolated, the current code fail the requesting thinking we end up with no active CPU in the system. Change-Id: I28fea4ff67ffed82465e5cfa785414069e4a180a Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
\| * \|	sched: WALT: increase WALT minimum window size to 20ms	Joonwoo Park	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Increase WALT minimum window size to 20ms. 10ms isn't large enough capture workload's pattern. [beykerykt}: Adapt for HMP Change-Id: I4d69577fbfeac2bc23db4ff414939cc51ada30d6 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
\| * \|	sched: cpufreq: Use per_cpu_ptr instead of this_cpu_ptr when reporting load	Vikram Mulukutla	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need cpufreq_update_util to report load for the CPU corresponding to the rq that is passed in as an argument, rather than the CPU executing cpufreq_update_util. Change-Id: I8473f230d40928d5920c614760e96fef12745d5a Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
\| * \|	sched: cpufreq: Use sched_clock instead of rq_clock when updating schedutil	Vikram Mulukutla	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	rq_clock may not be updated often enough for schedutil or other cpufreq governors to work correctly when it's passed as the timestamp for a load report. Use sched_clock instead. [beykerykt]: Switch to sched_ktime_clock() Change-Id: I745b727870a31da25f766c2c2f37527f568c20da Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
\| * \|	sched/cpupri: Exclude isolated CPUs from the lowest_mask	Pavankumar Kondeti	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The cpupri_find() returns the candidate CPUs which are running lower priority than the waking RT task in the lowest_mask. This contains isolated CPUs as well. Since the energy aware CPU selection skips isolated CPUs, no target CPU may be found if all unisolated CPUs are running higher priority RT tasks. In which case, we fallback to the default CPU selection algorithm and returns an isolated CPU. This decision is reversed by select_task_rq() and returns an unisolated CPU that is busy with other RT tasks. This RT task packing is desired behavior. However, RT push mechanism pushes the packed RT task to an isolated CPU. This can be avoided by excluding isolated CPUs from the lowest_mask returned by cpupri_find(). Change-Id: I75486b3935caf496a638d0333565beffc47fe249 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
\| * \|	sched: cpufreq: Limit governor updates to WALT changes alone	Vikram Mulukutla	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It's not necessary to keep reporting load to the governor if it doesn't change in a window. Limit updates to when we expect load changes - after window rollover and when we send updates related to intercluster migrations. [beykerykt]: Adapt for HMP Change-Id: I3232d40f3d54b0b81cfafdcdb99b534df79327bf Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
\| * \|	sched: walt: Correct WALT window size initialization	Vikram Mulukutla	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is preferable that WALT window rollover occurs just before a tick, since the tick is an opportune moment to record a complete window's statistics, as well as report those stats to the cpu frequency governor. When CONFIG_HZ results in a TICK_NSEC that isn't a integral number, this requirement may be violated. Account for this by reducing the WALT window size to the nearest multiple of TICK_NSEC. Commit d368c6faa19b ("sched: walt: fix window misalignment when HZ=300") attempted to do this but WALT isn't using MIN_SCHED_RAVG_WINDOW as the window size and the patch was doing nothing. Also, change the type of 'walt_disabled' to bool and warn if an invalid window size causes WALT to be disabled. [beykerykt]: Adapt for HMP Change-Id: Ie3dcfc21a3df4408254ca1165a355bbe391ed5c7 Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
\| * \|	sched: walt: fix window misalignment when HZ=300	Joonwoo Park	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Due to rounding error hrtimer tick interval becomes 3333333 ns when HZ=300. Consequently the tick time stamp nearest to the WALT's default window size 20ms will be also 19999998 (3333333 * 6). [beykerykt]: Adapt for HMP Change-Id: I08f9bd2dbecccbb683e4490d06d8b0da703d3ab2 Suggested-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
\| * \|	sched/tune: Increase the cgroup limit to 6	Pavankumar Kondeti	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The schedtune cgroup controller allows upto 5 cgroups including the default/root cgroup. Until now the user space is creating only 4 additional cgroups namely, foreground, background, top-app and audio-app. Recently another cgroup called rt is created before the audio-app cgroup. Since kernel limits the cgroups to 5, the creation of audio-app cgroup is failing. Fix this by increasing the schedtune cgroup controller cgroup limit to 6. Change-Id: I13252a90dba9b8010324eda29b8901cb0b20bc21 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
\| * \|	drivers: cpuidle: lpm-levels: Fix KW issues with idle state idx < 0	Lina Iyer	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Idle state calculcations will need to return the state chosen as an integer. The state chosen is used as a index into the array and as such cannot be negative value. Do not return negative errors from the calculations. By default, the state returned wil be zero. Change-Id: Idb18e933f385cf868fe99fa6a2783f6b8e84c196 Signed-off-by: Lina Iyer <ilina@codeaurora.org>
\| * \|	drivers: cpuidle: lpm-levels: Correctly check for list empty	Maulik Shah	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Correctly check for list empty condition to get least cluster latency. Change-Id: I6584a8d6d77794ca506c994d927467e9c1fefa63 Signed-off-by: Maulik Shah <mkshah@codeaurora.org>
\| * \|	sched/fair: Fix load_balance() affinity redo path	Jeffrey Hugo	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If load_balance() fails to migrate any tasks because all tasks were affined, load_balance() removes the source cpu from consideration and attempts to redo and balance among the new subset of cpus. There is a bug in this code path where the algorithm considers all active cpus in the system (minus the source that was just masked out). This is not valid for two reasons: some active cpus may not be in the current scheduling domain and one of the active cpus is dst_cpu. These cpus should not be considered, as we cannot pull load from them. Instead of failing out of load_balance(), we may end up redoing the search with no valid cpus and incorrectly concluding the domain is balanced. Additionally, if the group_imbalance flag was just set, it may also be incorrectly unset, thus the flag will not be seen by other cpus in future load_balance() runs as that algorithm intends. Fix the check by removing cpus not in the current domain and the dst_cpu from considertation, thus limiting the evaluation to valid remaining cpus from which load might be migrated. Co-authored-by: Austin Christ <austinwc@codeaurora.org> Co-authored-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org> Tested-by: Tyler Baicar <tbaicar@codeaurora.org> Change-Id: Ife6701c9c62e7155493d9db9398f08c4474e94b3
\| * \|	sched/fair: Avoid unnecessary active load balance	Maria Yu	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When find busiest group, it will avoid load balance if it is only 1 task running on src cpu. Consider race when different cpus do newly idle load balance at the same time, check src cpu nr_running to avoid unnecessary active load balance again. See the race condition example here: 1) cpu2 have 2 tasks, so cpu2 rq->nr_running == 2 and cfs.h_nr_running ==2. 2) cpu4 and cpu5 doing newly idle load balance at the same time. 3) cpu4 and cpu5 both see cpu2 sched_load_balance_sg_stats sum_nr_run=2 so they are both see cpu2 as the busiest rq. 4) cpu5 did a success migration task from cpu2, so cpu2 only have 1 task left, cpu2 rq->nr_running == 1 and cfs.h_nr_running ==1. 5) cpu4 surely goes to no_move because currently cpu4 only have 1 task which is currently running. 6) and then cpu4 goes here to check if cpu2 need active load balance. Change-Id: Ia9539a43e9769c4936f06ecfcc11864984c50c29 Signed-off-by: Maria Yu <aiquny@codeaurora.org>
\| * \|	BACKPORT: sched/core: Fix rules for running on online && !active CPUs	Peter Zijlstra	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As already enforced by the WARN() in __set_cpus_allowed_ptr(), the rules for running on an online && !active CPU are stricter than just being a kthread, you need to be a per-cpu kthread. If you're not strictly per-CPU, you have better CPUs to run on and don't need the partially booted one to get your work done. The exception is to allow smpboot threads to bootstrap the CPU itself and get kernel 'services' initialized before we allow userspace on it. Change-Id: I515e873a6e5be0cde7771ecedf56101614300fe2 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 955dbdf4ce87 ("sched: Allow migrating kthreads into online but inactive CPUs") Link: http://lkml.kernel.org/r/20170725165821.cejhb7v2s3kecems@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Backported to 4.4 Signed-off-by: joshuous <joshuous@gmail.com>
\| * \|	sched/core: Allow kthreads to fall back to online && !active cpus	Tejun Heo	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	During CPU hotplug, CPU_ONLINE callbacks are run while the CPU is online but not active. A CPU_ONLINE callback may create or bind a kthread so that its cpus_allowed mask only allows the CPU which is being brought online. The kthread may start executing before the CPU is made active and can end up in select_fallback_rq(). In such cases, the expected behavior is selecting the CPU which is coming online; however, because select_fallback_rq() only chooses from active CPUs, it determines that the task doesn't have any viable CPU in its allowed mask and ends up overriding it to cpu_possible_mask. CPU_ONLINE callbacks should be able to put kthreads on the CPU which is coming online. Update select_fallback_rq() so that it follows cpu_online() rather than cpu_active() for kthreads. Reported-by: Gautham R Shenoy <ego@linux.vnet.ibm.com> Tested-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Change-Id: I562dcc53717b1f2f8324abffb652b91592ba8d5c Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-team@fb.com Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/20160616193504.GB3262@mtj.duckdns.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
\| * \|	sched: Allow migrating kthreads into online but inactive CPUs	Tejun Heo	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Per-cpu workqueues have been tripping CPU affinity sanity checks while a CPU is being offlined. A per-cpu kworker ends up running on a CPU which isn't its target CPU while the CPU is online but inactive. While the scheduler allows kthreads to wake up on an online but inactive CPU, it doesn't allow a running kthread to be migrated to such a CPU, which leads to an odd situation where setting affinity on a sleeping and running kthread leads to different results. Each mem-reclaim workqueue has one rescuer which guarantees forward progress and the rescuer needs to bind itself to the CPU which needs help in making forward progress; however, due to the above issue, while set_cpus_allowed_ptr() succeeds, the rescuer doesn't end up on the correct CPU if the CPU is in the process of going offline, tripping the sanity check and executing the work item on the wrong CPU. This patch updates __migrate_task() so that kthreads can be migrated into an inactive but online CPU. Change-Id: I38cc3eb3b2ec5b7034cc72a2bcdd32a549314915 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Reported-by: Steven Rostedt <rostedt@goodmis.org>
\| * \|	sched/fair: Allow load bigger task load balance when nr_running is 2	Maria Yu	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When there is only 2 tasks in 1 cpu and the other task is currently running, allow load bigger task to be balanced if the other task is currently running. Change-Id: I489e9624ba010f9293272a67585e8209a786b787 Signed-off-by: Maria Yu <aiquny@codeaurora.org>