summaryrefslogtreecommitdiff
path: root/kernel/sched (follow)
Commit message (Collapse)AuthorAge
...
| | * | | | Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-androidMark Brown2016-07-29
| | |\| | |
| | | * | | sched/fair: Fix cfs_rq avg tracking underflowPeter Zijlstra2016-07-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8974189222159154c55f24ddad33e3613960521a upstream. As per commit: b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization") > the code generated from update_cfs_rq_load_avg(): > > if (atomic_long_read(&cfs_rq->removed_load_avg)) { > s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); > sa->load_avg = max_t(long, sa->load_avg - r, 0); > sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0); > removed_load = 1; > } > > turns into: > > ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax > ffffffff8108706b: 48 85 c0 test %rax,%rax > ffffffff8108706e: 74 40 je ffffffff810870b0 <update_blocked_averages+0xc0> > ffffffff81087070: 4c 89 f8 mov %r15,%rax > ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13) > ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13) > ffffffff8108707e: 4c 89 f9 mov %r15,%rcx > ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx > ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13) > ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx > > Which you'll note ends up with sa->load_avg -= r in memory at > ffffffff8108707a. So I _should_ have looked at other unserialized users of ->load_avg, but alas. Luckily nikbor reported a similar /0 from task_h_load() which instantly triggered recollection of this here problem. Aside from the intermediate value hitting memory and causing problems, there's another problem: the underflow detection relies on the signed bit. This reduces the effective width of the variables, IOW its effectively the same as having these variables be of signed type. This patch changes to a different means of unsigned underflow detection to not rely on the signed bit. This allows the variables to use the 'full' unsigned range. And it does so with explicit LOAD - STORE to ensure any intermediate value will never be visible in memory, allowing these unserialized loads. Note: GCC generates crap code for this, might warrant a look later. Note2: I say 'full' above, if we end up at U*_MAX we'll still explode; maybe we should do clamping on add too. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yuyang Du <yuyang.du@intel.com> Cc: bsegall@google.com Cc: kernel@kyup.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: steve.muckle@linaro.org Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking") Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| | * | | | Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-androidAlex Shi2016-06-27
| | |\| | |
| | | * | | sched: panic on corrupted stack endJann Horn2016-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 29d6455178a09e1dc340380c582b13356227e8df upstream. Until now, hitting this BUG_ON caused a recursive oops (because oops handling involves do_exit(), which calls into the scheduler, which in turn raises an oops), which caused stuff below the stack to be overwritten until a panic happened (e.g. via an oops in interrupt context, caused by the overwritten CPU index in the thread_info). Just panic directly. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| | * | | | Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-androidAlex Shi2016-06-02
| | |\| | |
| | | * | | sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systemsVik Heyndrickx2016-06-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 20878232c52329f92423d27a60e48b6a6389e0dd upstream. Systems show a minimal load average of 0.00, 0.01, 0.05 even when they have no load at all. Uptime and /proc/loadavg on all systems with kernels released during the last five years up until kernel version 4.6-rc5, show a 5- and 15-minute minimum loadavg of 0.01 and 0.05 respectively. This should be 0.00 on idle systems, but the way the kernel calculates this value prevents it from getting lower than the mentioned values. Likewise but not as obviously noticeable, a fully loaded system with no processes waiting, shows a maximum 1/5/15 loadavg of 1.00, 0.99, 0.95 (multiplied by number of cores). Once the (old) load becomes 93 or higher, it mathematically can never get lower than 93, even when the active (load) remains 0 forever. This results in the strange 0.00, 0.01, 0.05 uptime values on idle systems. Note: 93/2048 = 0.0454..., which rounds up to 0.05. It is not correct to add a 0.5 rounding (=1024/2048) here, since the result from this function is fed back into the next iteration again, so the result of that +0.5 rounding value then gets multiplied by (2048-2037), and then rounded again, so there is a virtual "ghost" load created, next to the old and active load terms. By changing the way the internally kept value is rounded, that internal value equivalent now can reach 0.00 on idle, and 1.00 on full load. Upon increasing load, the internally kept load value is rounded up, when the load is decreasing, the load value is rounded down. The modified code was tested on nohz=off and nohz kernels. It was tested on vanilla kernel 4.6-rc5 and on centos 7.1 kernel 3.10.0-327. It was tested on single, dual, and octal cores system. It was tested on virtual hosts and bare hardware. No unwanted effects have been observed, and the problems that the patch intended to fix were indeed gone. Tested-by: Damien Wyart <damien.wyart@free.fr> Signed-off-by: Vik Heyndrickx <vik.heyndrickx@veribox.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Doug Smythies <dsmythies@telus.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 0f004f5a696a ("sched: Cure more NO_HZ load average woes") Link: http://lkml.kernel.org/r/e8d32bff-d544-7748-72b5-3c86cc71f09f@veribox.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| | * | | | Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-androidAlex Shi2016-05-12
| | |\| | |
| | * | | | Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-androidAlex Shi2016-04-13
| | |\ \ \ \
| | * \ \ \ \ Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-androidAlex Shi2016-02-29
| | |\ \ \ \ \ | | | |_|_|/ / | | |/| | | |
* | | | | | | Merge "sched/hmp: Automatically add children threads to colocation group"Linux Build Service Account2016-11-02
|\ \ \ \ \ \ \
| * | | | | | | sched/hmp: Automatically add children threads to colocation groupSyed Rameez Mustafa2016-10-27
| |/ / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When sched_enable_thread_grouping is turned on, the scheduler needs to ensure that any pre-existing children of a task get added to the co-location group. Upon removal from the co-location group, however, the scheduler does not check for the thread grouping flag because userspace cannot ensure correct behavior. Therefore as a precautionary measure to avoid memory leaks the scheduler has to forcefully remove children from the group regardless of the flag setting. While at it, also make group management a lot simpler. Without these simplifications, we can end up in extremely complicated locking scenarios where ensuring the correct order to avoid deadlocks is near impossible. Change-Id: I4c13601b0fded6de9d8f897c6d471c6a40c90e4d Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
* | | | | | | Merge "sched/hmp: Disable interrupts when resetting all task stats"Linux Build Service Account2016-10-31
|\ \ \ \ \ \ \
| * | | | | | | sched/hmp: Disable interrupts when resetting all task statsSyed Rameez Mustafa2016-10-28
| |/ / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Taking the pi_lock without disabling interrupts in reset_all_task_stats() is problematic. In that an interrupt can end up waking a task which in turn needs the pi_lock again causing a deadlock. Disable interrupts along with taking the lock to avoid this problem. Change-Id: If27cb2bb3fcaafa5c8435f3c2e0e4be9b8f1e987 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
* | | | | | | Merge "sched: Fix compilation issue with reset_hmp_stats"Linux Build Service Account2016-10-27
|\ \ \ \ \ \ \ | |/ / / / / / |/| | | | | |
| * | | | | | sched: Fix compilation issue with reset_hmp_statsOlav Haugan2016-10-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | reset_hmp_stats was moved to another file and when CONFIG_CFS_BANDWIDTH is enabled there is code still referencing this in the original file causing compilation error. Change-Id: Iab7fc8551b628c443ce751026b06c5ff4ebba39a Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
| * | | | | | sched/fair: Fix compilation issueOlav Haugan2016-10-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Code does not compile with CONFIG_CFS_BANDWIDTH. Change-Id: Idb74e9df4fcb55085ac869f5ba273cef4a3eb9eb Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
* | | | | | | sched: Set curr/prev_window_cpu pointers to NULL in sched_exit()Syed Rameez Mustafa2016-10-24
|/ / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | trace_sched_update_task_ravg relies on NULL pointers to ensure that it doesn't access them. Make sure that when a task exits, these pointers are set to NULL. Otherwise any call to update_task_ravg() between sched_exit() and releasing the task structure will access bogus pointers. In some cases those memory locations are unmapped and cause a kernel panic. Change-Id: I9eebb4fb35aca2c8424bfb29ae9d833650dc5ad4 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
* | | | | | Merge "sched/core_ctl: Move header file to global location"Linux Build Service Account2016-10-20
|\ \ \ \ \ \
| * | | | | | sched/core_ctl: Move header file to global locationOlav Haugan2016-10-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the header file of core control to the standard linux include directory to allow other entities to include this file. Change-Id: I2ddb8b3b96063be3c6a6cb6bc333998e007f9de7 Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
| * | | | | | core_ctl: Add refcounting to boost apiOlav Haugan2016-10-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | More than one client may call the core_ctl_set_boost api. Add support for this. Also add a new trace event that is emitted when this api is called. Change-Id: Iad0a9fc45f1ce87433995e8e549bfca80e8b9cb2 Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
* | | | | | | Merge "sched: don't bias towards waker cluster when sched_boost is set"Linux Build Service Account2016-10-20
|\ \ \ \ \ \ \
| * | | | | | | sched: don't bias towards waker cluster when sched_boost is setJoonwoo Park2016-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When sched_boost is set scheduler needs to place task on the least loaded CPU or performance CPU for better performance. Change-Id: I41512b4af9cd56712a241c114583b0021d1395d2 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
* | | | | | | | Merge "sched/hmp: Fix range checking for target load"Linux Build Service Account2016-10-19
|\ \ \ \ \ \ \ \
| * | | | | | | | sched/hmp: Fix range checking for target loadOlav Haugan2016-10-19
| | |/ / / / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The range check for target load is incorrect. Fix this. This is only a sanity check to catch badly specified target loads. Change-Id: Ia90d020f5e0bdf37c600661a1c246dab5b637b3b Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
* | | | | | | | Merge "sched: Add multiple load reporting policies for cpu frequency"Linux Build Service Account2016-10-19
|\ \ \ \ \ \ \ \
| * | | | | | | | sched: Add multiple load reporting policies for cpu frequencySyed Rameez Mustafa2016-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous patches in this series introduce the mechanics of CPU load tracking without fixups for intra cluster migration and top task load tracking. Add a tunable that dictates what of the above needs to be considered when reporting load to the governor. The default policy is to take the maximum of the CPU load and top task load. Change-Id: Ie585a11ed774b929910d04c41471db3a2a102ec5 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
* | | | | | | | | Merge "sched: Optimize the next top task search logic upon task migration"Linux Build Service Account2016-10-19
|\| | | | | | | | | |_|/ / / / / / |/| | | | | | |
| * | | | | | | sched: Optimize the next top task search logic upon task migrationSyed Rameez Mustafa2016-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | find_next_top_index() is responsible for finding the second top task on a CPU when the top task migrates away from that CPU. This operation is expensive as we need to iterate the entire array of top tasks to find the second top task. Optimize this by introducing bitmaps for tracking top task indices. There are two bitmaps; one for the previous window and one for the current window. Each bit in a bitmap tracks whether the corresponding bucket in the top task hashmap has a non zero refcount. The bit is set when the refcount becomes non zero and is cleared when it becomes zero. Finding the second top task upon migration is then simply a matter of finding the highest set bit in the bitmap. Change-Id: Ibafaf66eed756b0328704dfaa89c17ab0d84e359 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
| * | | | | | | sched: Add the mechanics of top task tracking for frequency guidanceSyed Rameez Mustafa2016-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous patches in this rewrite of scheduler guided frequency selection reintroduces the part-picture problem that we addressed in our initial implementation. In that, when tasks migrate across CPUs within a cluster, we end up losing the complete picture of the sequential nature of the workload. This patch aims to solve that problem slightly differently. We track the top task on every CPU within a window. Top task is defined as the task that runs the most in a given window. This enhances our ability to detect the sequential nature of workloads. A single migrating task executing for an entire window will cause 100% load to be reported for frequency guidance instead of the maximum footprint left on any individual CPU in the task's trail. There are cases, that this new approach does not address. Namely, cases where the sum of two or more tasks accurately reflects the true sequential nature of the workload. Future optimizations might aim to tackle that problem. To track top tasks, we first realize that there is no strict need to maintain the task struct itself as long as we know the load exerted by the top task. We also realize that to maintain top tasks on every CPU we have to track the execution of every single task that runs during the window. The load associated with a task needs to be migrated when the task migrates from one CPU to another. When the top task migrates away, we need to locate the second top task and so on. Given the above realizations, we use hashmaps to track top task load both for the current and the previous window. This hashmap is implemented as an array of fixed size. The key of the hashmap is given by task_execution_time_in_a_window / array_size. The size of the array (number of buckets in the hashmap) dictate the load granularity of each bucket. The value stored in each bucket is a refcount of all the tasks that executed long enough to be in that bucket. This approach has a few benefits. Firstly, any top task stats update now take O(1) time. While task migration is also O(1), it does still involve going through up to the size of the array to find the second top task. Further patches will aim to optimize this behavior. Secondly, and more importantly, not having to store the task struct itself saves a lot of memory usage in that 1) there is no need to retrieve task structs later causing cache misses and 2) we don't have to unnecessarily hold up task memory for up to 2 full windows by calling get_task_struct() after a task exits. Change-Id: I004dba474f41590db7d3f40d9deafe86e71359ac Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
| * | | | | | | sched: Enhance the scheduler migration load fixup featureSyed Rameez Mustafa2016-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the current frequency guidance implementation the scheduler migrates task load from the source CPU to the destination CPU when a task migrates. The underlying assumption is that a task will stay on the destination CPU following the migration. Hence a CPU's load should reflect the sum of all tasks that last ran on that CPU prior to window expiration even if these tasks executed on some other CPU in that window prior to being migrated. However, given the ubiquitous nature of migrations the above assumption is flawed causing the scheduler to often add up load on a single CPU that in reality ran concurrently on multiple CPUs and will continue to run concurrently in subsequent windows. This leads to load over reporting on a single CPU which in turn causes CPU frequency to be higher than necessary. This is the first patch in a series of patches that attempts to change how load fixups are done upon migration to prevent load over reporting. In this patch, we stop doing migration fixups for intra-cluster migrations. Inter-cluster migration fixups are still retained. In order to achieve the above, we make use the per CPU footprint of each task introduced in the previous patch. Upon inter cluster migration, we go through every CPU in the source cluster to subtract the migrating task's contribution to the busy time on each one of those CPUs. The sum of the contributions is then added to the destination CPU allowing it to ramp up to the appropriate frequency for that task. Subtracting load from each of the source CPUs is not trivial, however, as it would require all runqueue locks to held. To get around this we introduce a deferred load subtraction mechanism whereby subtracting load from each of the source CPUs in deferred until an opportune moment. This opportune moment is when the governor comes asking the scheduler for load. At that time, all necessary runqueue locks are already held. There are a few cases to consider when doing deferred subtraction. Since we are not holding all runqueue locks other CPUs in the source cluster can be in a different window than the source CPU where the task is migrating from. Case 1: Other CPU in the source cluster is in the same window No special consideration Case 2: Other CPU in the source cluster is ahead by 1 window In this case, we will be doing redundant updates to subtraction load for the prev window. There is no way to avoid this redundant update though, without holding the rq lock. Case 3: Other CPU in the source cluster is trailing by 1 window In this case, we might end up overwriting old data for that CPU. But this is not a problem as when the other CPU calls update_task_ravg() it will move to the same window. This relies on maintaining synchronized windows between CPUs, which is true today. Finally, we must deal with frequency aggregation. When frequency aggregation is in effect, there is little point in dealing with per CPU footprint since the load of all related tasks have to be reported on a single CPU. Therefore when a task enters a related group we clear out all per CPU contributions and add it to the task CPU's cpu_time struct. From that point onwards we stop managing per CPU contributions upon inter cluster migrations since that work is redundant. Finally when a task exits a related group we must walk every CPU in reset all CPU contributions. We then set the task CPU contribution to the respective curr/prev sum values and add that sum to the task CPU rq runnable sum. Change-Id: I1f8d596e6c930f3f6f00e24109ddbe8b121f8d6b Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
| * | | | | | | sched: Add per CPU load tracking for each taskSyed Rameez Mustafa2016-10-17
| |/ / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Keeping a track of the load footprint of each task on every CPU that it executed on gives the scheduler much more flexibility in terms of the number of frequency guidance policies. These new fields will be used in subsequent patches as we alter the load fixup mechanism upon task migration. We still need to maintain the curr/prev_window sums as they will also be required in subsequent patches as we start to track top tasks based on cumulative load. Also, we need to call init_new_task_load() for the idle task. This is an existing harmless bug as load tracking for the idle task is irrelevant. However, in this patch we are adding pointers to the ravg structure. These pointers have to be initialized even for the idle task. Finally move init_new_task_load() to sched_fork(). This was always the more appropriate place, however, following the introduction of new pointers in the ravg struct, this is necessary to avoid races with functions such as reset_all_task_stats(). Change-Id: Ib584372eb539706da4319973314e54dae04e5934 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
* / / / / / / sched/fair: Fix issue with trace flag not being set properlyOlav Haugan2016-10-17
|/ / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During scheduler boost the sched_task_load ftrace event might not log the correct flag value. Ensure that the flag is always initialized with the selected cluster information. Change-Id: Ia986d0fbc512c8e9ed1b5fb5b2ac4bc564cc4ba9 Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
* | | | | | Merge "sched/cgroup: Fix/cleanup cgroup teardown/init"Linux Build Service Account2016-10-13
|\ \ \ \ \ \
| * | | | | | sched/cgroup: Fix/cleanup cgroup teardown/initPeter Zijlstra2016-10-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The CPU controller hasn't kept up with the various changes in the whole cgroup initialization / destruction sequence, and commit: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups") caused it to explode. The reason for this is that zombies do not inhibit css_offline() from being called, but do stall css_released(). Now we tear down the cfs_rq structures on css_offline() but zombies can run after that, leading to use-after-free issues. The solution is to move the tear-down to css_released(), which guarantees nobody (including no zombies) is still using our cgroup. Furthermore, a few simple cleanups are possible too. There doesn't appear to be any point to us using css_online() (anymore?) so fold that in css_alloc(). And since cgroup code guarantees an RCU grace period between css_released() and css_free() we can forgo using call_rcu() and free the stuff immediately. Change-Id: I51af3d4f0e5dd1c9df6375cce4bb933f67f1022e Suggested-by: Tejun Heo <tj@kernel.org> Reported-by: Kazuki Yamaguchi <k@rhe.jp> Reported-by: Niklas Cassel <niklas.cassel@axis.com> Tested-by: Niklas Cassel <niklas.cassel@axis.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups") Link: http://lkml.kernel.org/r/20160316152245.GY6344@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-commit: 2f5177f0fd7e531b26d54633be62d1d4cb94621c Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
| * | | | | | sched/cgroup: Fix cgroup entity load tracking tear-downPeter Zijlstra2016-10-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a cgroup's CPU runqueue is destroyed, it should remove its remaining load accounting from its parent cgroup. The current site for doing so it unsuited because its far too late and unordered against other cgroup removal (->css_free() will be, but we're also in an RCU callback). Put it in the ->css_offline() callback, which is the start of cgroup destruction, right after the group has been made unavailable to userspace. The ->css_offline() callbacks are called in hierarchical order after the following v4.4 commit: aa226ff4a1ce ("cgroup: make sure a parent css isn't offlined before its children") Change-Id: Ice7cbd71d9e545da84d61686aa46c7213607bb9d Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160121212416.GL6357@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-commit: 6fe1f348b3dd1f700f9630562b7d38afd6949568 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
* | | | | | | Merge "sched: bucketize CPU c-state levels"Linux Build Service Account2016-10-13
|\ \ \ \ \ \ \
| * | | | | | | sched: bucketize CPU c-state levelsJoonwoo Park2016-10-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | C-state aware scheduler takes note of wakeup latency of each c-state level to determine whether to pack or wake up LPM CPU. But it doesn't distinguish small and large delta as it's inefficient for scheduler to do so on its critical path. Disregard wakeup latencies less than 64 us between different c-state levels. This reduces unnecessary task packing. CRs-fixed: 1074879 Change-Id: Ib0cadbd390d1a0b6da3e39c98010cedb43e5bf60 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
* | | | | | | | Merge "sched: use wakeup latency as c-state determinant"Linux Build Service Account2016-10-13
|\| | | | | | |
| * | | | | | | sched: use wakeup latency as c-state determinantJoonwoo Park2016-10-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | C-state aware scheduler at present, uses a raw c-state index number as its determinant and avoids task placement on deeper c-state CPUs at cost of latency. However there are CPUs offering comparable wake-up latency at different c-state levels and the wake-up latency at each c-state levels are already have being fed to scheduler. Hence use the wakeup_latency as c-state determinant instead of raw c-state index to avoid unnecessary task packing where it's doable. CRs-fixed: 1074879 Change-Id: If927f84f6c8ba719716d99669e5d1f1b19aaacbe Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
* | | | | | | | sched/tune: Remove redundant checks for NULL cssSyed Rameez Mustafa2016-10-12
|/ / / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The check for NULL css is redundant as upper layers are already making sure that css cannot be NULL. Remove this check. It helps to silence static analysis errors as well. Change-Id: I64585ff8cceb307904e20ff788e52eb05c000e1f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
* | | | | | | sched: Add cgroup attach functionality to the tune controllerSyed Rameez Mustafa2016-10-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is required to allow tasks to freely move between cgroups associated with the tune controller. Change-Id: I1f39b957462034586edc2fdc0a35488b314e9c8c Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
* | | | | | | sched: Update the number of tune groups to 5Syed Rameez Mustafa2016-10-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The schedtune controller will mimic the cpusets controller configuration for now. For that we need to make 4 groups in addition to the root group present by default. Change-Id: I082f1e4e4ebf863e623cf66ee127eac70a3e2716 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
* | | | | | | sched/tune: add initial support for CGroups based boostingPatrick Bellasi2016-10-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To support task performance boosting, the usage of a single knob has the advantage to be a simple solution, both from the implementation and the usability standpoint. However, on a real system it can be difficult to identify a single value for the knob which fits the needs of multiple different tasks. For example, some kernel threads and/or user-space background services should be better managed the "standard" way while we still want to be able to boost the performance of specific workloads. In order to improve the flexibility of the task boosting mechanism this patch is the first of a small series which extends the previous implementation to introduce a "per task group" support. This first patch introduces just the basic CGroups support, a new "schedtune" CGroups controller is added which allows to configure different boost value for different groups of tasks. To keep the implementation simple but still effective for a boosting strategy, the new controller: 1. allows only a two layer hierarchy 2. supports only a limited number of boost groups A two layer hierarchy allows to place each task either: a) in the root control group thus being subject to a system-wide boosting value b) in a child of the root group thus being subject to the specific boost value defined by that "boost group" The limited number of "boost groups" supported is mainly motivated by the observation that in a real system it could be useful to have only few classes of tasks which deserve different treatment. For example, background vs foreground or interactive vs low-priority. As an additional benefit, a limited number of boost groups allows also to have a simpler implementation especially for the code required to compute the boost value for CPUs which have runnable tasks belonging to different boost groups. Change-Id: I1304e33a8440bfdad9c8bcf8129ff390216f2e32 cc: Tejun Heo <tj@kernel.org> cc: Li Zefan <lizefan@huawei.com> cc: Johannes Weiner <hannes@cmpxchg.org> cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Git-commit: 13001f47c9a610705219700af4636386b647e231 Git-repo: https://android.googlesource.com/kernel/common Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
* | | | | | | Merge "sched/tune: add sysctl interface to define a boost value"Linux Build Service Account2016-10-06
|\ \ \ \ \ \ \
| * | | | | | | sched/tune: add sysctl interface to define a boost valuePatrick Bellasi2016-10-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current (CFS) scheduler implementation does not allow "to boost" tasks performance by running them at a higher OPP compared to the minimum required to meet their workload demands. To support tasks performance boosting the scheduler should provide a "knob" which allows to tune how much the system is going to be optimised for energy efficiency vs performance. This patch is the first of a series which provides a simple interface to define a tuning knob. One system-wide "boost" tunable is exposed via: /proc/sys/kernel/sched_cfs_boost which can be configured in the range [0..100], to define a percentage where: - 0% boost requires to operate in "standard" mode by scheduling tasks at the minimum capacities required by the workload demand - 100% boost requires to push at maximum the task performances, "regardless" of the incurred energy consumption A boost value in between these two boundaries is used to bias the power/performance trade-off, the higher the boost value the more the scheduler is biased toward performance boosting instead of energy efficiency. Change-Id: I59a41725e2d8f9238a61dfb0c909071b53560fc0 cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Git-commit: 63c8fad2b06805ef88f1220551289f0a3c3529f1 Git-repo: https://source.codeaurora.org/quic/la/kernel/msm-4.4 Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
| * | | | | | | sched: Initialize HMP stats inside init_sd_lb_stats()Syed Rameez Mustafa2016-10-05
| |/ / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This ensures that the load balancer always works correctly even without compiler optimizations. Change-Id: I36408ae65833b624401e60edfb50c19cc061d7bf Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
* | | | | | | Merge "sched: Fix a division by zero bug in scale_exec_time()"Linux Build Service Account2016-10-06
|\ \ \ \ \ \ \
| * | | | | | | sched: Fix a division by zero bug in scale_exec_time()Pavankumar Kondeti2016-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When cycle_counter is used to estimate the frequency, calling update_task_ravg() twice on the same task without refreshing the wallclock results in a division by zero bug. Add a safety check in update_task_ravg() to prevent this. The above bug is hit from __schedule() when next == prev. There is no need to call update_task_ravg() twice for PUT_PREV_TASK and PICK_NEXT_TASK events for the same task. Calling update_task_ravg() with TASK_UPDATE event is sufficient. Change-Id: Ib3af9004f2462618c535b8195377bedb584d0261 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
* | | | | | | | Merge "sched: Fix integer overflow in sched_update_nr_prod()"Linux Build Service Account2016-10-05
|\ \ \ \ \ \ \ \
| * | | | | | | | sched: Fix integer overflow in sched_update_nr_prod()Pavankumar Kondeti2016-10-04
| | |/ / / / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "int" type is used to hold the time difference between the successive updates to nr_run in sched_update_nr_prod(). This can result in overflow, if the function is called ~2.15 sec after it was called before. The most probable scenarios are when CPU is idle and hotplugged. But as we update the last_time of all possible CPUs in sched_get_nr_running_avg() periodically from a deferrable timer context (core_ctl module), this overflow is observed only when the system is completely idle for long time. When this overflow happens we hit a BUG_ON() in sched_get_nr_running_avg(). Use "u64" type instead of "int" for holding the time difference and add additional BUG_ON() to catch the instances where sched_clock() returns a backward value. Change-Id: I284abb5889ceb8cf9cc689c79ed69422a0e74986 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>