diff options
| author | Srivatsa Vaddagiri <vatsa@codeaurora.org> | 2014-10-31 16:04:00 -0700 |
|---|---|---|
| committer | David Keitel <dkeitel@codeaurora.org> | 2016-03-23 20:01:12 -0700 |
| commit | 72b7c5d36c6440cad49659190867865648bdaa00 (patch) | |
| tree | b834ec8f534a58eff0ad957d3e3a3d5ba7c27837 | |
| parent | 588055e8c73e8ef0c8c23ab5db45453aa49be665 (diff) | |
sched: Provide knob to prefer mostly_idle over idle cpus
sysctl_sched_prefer_idle lets the scheduler bias selection of
idle cpus over mostly idle cpus for tasks. This knob could be
useful to control balance between power and performance.
Change-Id: Ide6eef684ef94ac8b9927f53c220ccf94976fe67
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
| -rw-r--r-- | Documentation/scheduler/sched-hmp.txt | 237 | ||||
| -rw-r--r-- | include/linux/sched/sysctl.h | 1 | ||||
| -rw-r--r-- | kernel/sched/fair.c | 75 | ||||
| -rw-r--r-- | kernel/sysctl.c | 7 |
4 files changed, 121 insertions, 199 deletions
diff --git a/Documentation/scheduler/sched-hmp.txt b/Documentation/scheduler/sched-hmp.txt index ecbbaec5372c..8a813a3ebef4 100644 --- a/Documentation/scheduler/sched-hmp.txt +++ b/Documentation/scheduler/sched-hmp.txt @@ -641,192 +641,63 @@ select_best_cpu(), represents the heart of the HMP scheduling algorithm described in this document. The behavior of select_best_cpu() differs depending on whether the -task being placed is a small task or not. +task being placed is a small task or not and the value of the sched_prefer_idle +tunable. --- Wakeup Logic a Non-Small Task "p" - The following is evaluated for every online CPU i which task p may run on: - - | - | task doesn't fit, but - | is this CPU a good - V fallback candidate? -+---------------+ +-------------+ +--------+ -| does task fit |------------>| is CPU |----------->| ignore | -| on CPU | no | mostly idle | no | cpu | -+---------------+ +-------------+ +--------+ - | | - | yes | yes - | | +--------------------------+ - | --------->| load < min_fallback_load | - | +--------------------------+ - | | | - | | yes | no - | V V - | +-----------------------+ +------------+ - | | fallback_idle_cpu = i | | ignore cpu | - | task fits, prefer +-----------------------+ +------------+ - | mostly idle CPUs | | - | or non-max capacity V V - | CPUs that won't hit next CPU next CPU - | spill threshold - V -+---------------------+ task does not meet load requirements -| CPU mostly idle || | no +------------+ -| (!max_capacity && |---------->| ignore cpu |----> next CPU -| !(p causes spill)) | +------------+ -+---------------------+ - | - | yes - | - | - | - | is CPU in a lower power band - V than previously seen min cost CPU CPU in a lower power band -+---------------------+ than previously seen min, -| cost(p, i) is | yes +----------------------------+ override -| > band_limit % less |---------->| best_cpu = i | previously -| than current min | | min_cost = cost(p,i) | seen min_load -+---------------------+ | min_load = load(i) | CPU - | +----------------------------+ - | no | - | ---------> next CPU - | - | - | - | does CPU have lower load than CPU has lower load than - V previously seen min_load CPU previously seen lowest load -+--------------------+ yes +-----------------+ -| load(i) < min_load |------------------------->| best_cpu = i | -+--------------------+ | min_load = load | - | +-----------------+ - | no | - | | - | if load is tied with lowest previously | - | seen lowest load, is power cost less | - V | -+------------------------+ | -| load(i) == min_load && | yes +--------------+ | -| cost(p, i) < min_cost |-------->| best_cpu = i | | -+------------------------+ +--------------+ | - | | | - | no | / - \_____________________________ | __________/ - \ | / - | | | - V V V if power cost of this - +----------------------+ CPU is lower than - | cost(p,i) < min_cost | current min, update - +----------------------+ min_cost - | | - | yes | no - | ----------> next CPU - V - +----------------------+ - | min_cost = cost(p,i) |-------> next CPU - +----------------------+ - -Once this flow chart has been evaluated for every online CPU the task -may run on, if a "best_cpu" was found, it is returned. If a best_cpu -was not found but a fallback_idle_cpu was found, then the -fallback_idle_cpu is returned. Finally, if no best_cpu or -fallback_idle cpu was found, then the task's previous CPU is returned. - -Phew! Fortunately, all of that can be summarized relatively easily. The -order of CPU preference for a non-small task is the following: - - 1. The least-loaded CPU the task is allowed to run on in the lowest - power band where the task will fit and where the placement will - not result in cpu exceeding spill level. When there is a tie of - two cpus at same load, their CPU with the lowest power cost is - chosen. - - 2. The least-loaded mostly idle CPU that the task is allowed to run - on where the task won't fit (since there was no CPU where the - task would fit). - - 3. The CPU which the task last ran on. +The order of CPU preference for a non-small task when sched_prefer_idle = 1 is +the following: + + 1. The shallowest-cstate idle CPU in the lowest-power cluster which can fit + the task. Where there is a tie of two CPUs with the same load, the CPU with + the lowest power cost is chosen. + + 2. The least-loaded CPU the task is allowed to run on in the lowest power band + where the task will fit and where the placement will not result in cpu + exceeding spill level. When there is a tie of two CPUs at same load, the + CPU with the lowest power cost is chosen. + + 3. The least-loaded mostly idle CPU that the task is allowed to run on where + the task won't fit (since there was no CPU where the task would fit). + + 4. The CPU which the task last ran on. + +The order of CPU preference for a non-small task when sched_prefer_idle = 0 +is the following: + + 1. The least-loaded non-idle mostly idle CPU the task is allowed to run on in + the lowest power band where the task will fit. When there is a tie of two + CPUs at same load, the CPU with the lowest power cost is chosen. + + 2. The shallowest-cstate idle CPU in the lowest-power cluster which can fit + the task. Where there is a tie of two CPUs with the same load, the CPU with + the lowest power cost is chosen. + + 3. The least-loaded CPU the task is allowed to run on in the lowest power band + where the task will fit and where the placement will not result in the CPU + exceeding spill level. When there is a tie of two CPUs at the same load, + the CPU with the lowest power cost is chosen. + + 4. The least-loaded mostly idle CPU that the task is allowed to run on where + the task won't fit (since there was no CPU where the task would fit). + + 5. The CPU which the task last ran on. --- Wakeup Logic a Small Task "p" -The online CPUs the task is allowed to run on are scanned and the -lowest power CPU is found. This is marked as the min_cost_cpu. - -If the minimum cost CPU is mostly idle but not idle, that CPU is -immediately chosen. - -If the minimum cost CPU is idle or not mostly idle, then the following -will be evaluated for every online CPU i the task is allowed to run -on: - | is CPU i in higher power band is this CPU lower power than - V than min_cost_cpu? best fallback CPU seen -+---------------------+ +-----------------------+ -| cost(p, i) is | yes | cost(p,i) < | no +--------+ -| > band_limit % more |--------------->| min_fallback_cpu_cost |----->| ignore | -| than min_cost_cpu | +-----------------------+ | cpu | -+---------------------+ | +--------+ - | | yes | - | no | V - | | next cpu - | is this CPU V - V idle +-----------------------------------+ -+-----------------+ yes | best_fallback_cpu = i | -| cpu cstate > 0? |----------- | min_fallback_cpu_cost = cost(p,i) | -+-----------------+ | +-----------------------------------+ - | | | - | no | \------> next CPU - | | is this CPU - | is this CPU | the shallowest - V mostly idle | idle CPU seen -+--------------+ +----------------------+ no +--------+ -| cpu i | | cstate < min_cstate? |----->| ignore | -| mostly idle? | +----------------------+ | cpu |--> next cpu -+--------------+ | +--------+ - | | | yes - | no | yes | - | | +--------------+ | +---------------------+ - | \------>| return cpu i | ----->| min_cstate_cpu = i | - | +--------------+ | min_cstate = cstate | - | +---------------------+ - | | - | will task not cross spill | - | threshold, and is this the | - V least loaded busy CPU we've seen | -+-------------------------+ \-----> next cpu -| !(p causes spill) && | no +--------+ -| load(i) < min_busy_load |------>| ignore |---> next cpu -+-------------------------+ | cpu | - | +--------+ - | yes - V -+----------------------+ -| best_busy_cpu = i | -| min_busy_load = load |--------> next cpu -+----------------------+ - -Note that the process of evaluating the flow chart for every online -CPU the task may run on could be interrupted if a mostly idle CPU is -found in the lowest power band. Such a CPU will be selected -immediately by the algorithm. Otherwise, once the flow chart has been -evaluated for every online CPU the task is allowed to run on, a CPU is -selected from the candidates. If one or more idle CPUs exist in the -lowest power band then the one in the shallowest C-state is -returned. If not, then the least loaded CPU in the lowest power band -which would not exceed its spill threshold by accepting the task is -selected, assuming it exists. If none of the former possibilities -exist, the most power-efficient CPU outside the lowest power band is -selected. - -Phew! But once again this can all be summarized. The order of CPU -preference for a small task is the following: +The order of CPU preference for a small task is the following: 1. The lowest-power CPU, if it is not idle but is mostly idle. - 2. A non-idle CPU in the lowest power band which is mostly idle. The - first such CPU found is selected. - 3. An idle CPU in the lowest power band that is in the least shallow - C-state. - 4. The least busy CPU in the lowest power band where adding the task - will not result in exceeding the spill threshold. + + 2. A non-idle CPU in the lowest power band which is mostly idle. The first + such CPU found is selected. + + 3. An idle CPU in the lowest power band that is in the least shallow C-state. + + 4. The least busy CPU in the lowest power band where adding the task will not + result in exceeding the spill threshold. + 5. The most power-efficient CPU outside of the lowest power band. *** 5.3 Scheduler Tick @@ -1369,6 +1240,16 @@ longer eligible to be seen as mostly idle. This will affect the task placement logic described above, causing the scheduler to try and steer tasks away from the CPU. +** 7.23 sched_prefer_idle + +Appears at: /proc/sys/kernel/sched_prefer_idle + +Default value: 1 + +Non-small tasks will prefer to wake up on idle CPUs if this tunable is set to 1. +If the tunable is set to 0, non-small tasks will prefer to wake up on mostly +idle CPUs which are not completely idle, increasing task packing behavior. + ========================= 8. HMP SCHEDULER TRACE POINTS ========================= diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index a400b155bb47..25bdacde2d83 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -65,6 +65,7 @@ extern unsigned int sysctl_sched_small_task_pct; extern unsigned int sysctl_sched_upmigrate_pct; extern unsigned int sysctl_sched_downmigrate_pct; extern int sysctl_sched_upmigrate_min_nice; +extern unsigned int sysctl_sched_prefer_idle; extern unsigned int sysctl_sched_powerband_limit_pct; extern unsigned int sysctl_sched_boost; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bd4f9fc66950..0c9533a2854a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2634,6 +2634,13 @@ unsigned int __read_mostly sysctl_sched_downmigrate_pct = 60; int __read_mostly sysctl_sched_upmigrate_min_nice = 15; /* + * Tunable to govern scheduler wakeup placement CPU selection + * preference. If set, the scheduler chooses to wake up a task + * on an idle CPU. + */ +unsigned int __read_mostly sysctl_sched_prefer_idle = 1; + +/* * Scheduler boost is a mechanism to temporarily place tasks on CPUs * with higher capacity than those where a task would have normally * ended up with their load characteristics. Any entity enabling @@ -3166,13 +3173,15 @@ static int select_packing_target(struct task_struct *p, int best_cpu) /* return cheapest cpu that can fit this task */ static int select_best_cpu(struct task_struct *p, int target, int reason) { - int i, best_cpu = -1, fallback_idle_cpu = -1; + int i, best_cpu = -1, fallback_idle_cpu = -1, min_cstate_cpu = -1; int prev_cpu = task_cpu(p); int cpu_cost, min_cost = INT_MAX; + int min_idle_cost = INT_MAX, min_busy_cost = INT_MAX; u64 load, min_load = ULLONG_MAX, min_fallback_load = ULLONG_MAX; int small_task = is_small_task(p); int boost = sched_boost(); int cstate, min_cstate = INT_MAX; + int prefer_idle = reason ? 1 : sysctl_sched_prefer_idle; trace_sched_task_load(p, small_task, boost, reason); @@ -3225,43 +3234,67 @@ static int select_best_cpu(struct task_struct *p, int target, int reason) * overrides load and C-state. */ if (power_delta_exceeded(cpu_cost, min_cost)) { - if (cpu_cost < min_cost) { - min_load = load; - min_cost = cpu_cost; + if (cpu_cost > min_cost) + continue; + + min_cost = cpu_cost; + min_load = ULLONG_MAX; + min_cstate = INT_MAX; + min_cstate_cpu = -1; + best_cpu = -1; + } + + /* + * Partition CPUs based on whether they are completely idle + * or not. For completely idle CPUs we choose the one in + * the lowest C-state and then break ties with power cost + */ + if (idle_cpu(i)) { + if (cstate > min_cstate) + continue; + + if (cstate < min_cstate) { + min_idle_cost = cpu_cost; min_cstate = cstate; - best_cpu = i; + min_cstate_cpu = i; + continue; + } + + if (cpu_cost < min_idle_cost) { + min_idle_cost = cpu_cost; + min_cstate_cpu = i; } continue; } - /* After power band, load is prioritized next. */ + /* + * For CPUs that are not completely idle, pick one with the + * lowest load and break ties with power cost + */ + if (load > min_load) + continue; + if (load < min_load) { min_load = load; - min_cost = cpu_cost; - min_cstate = cstate; + min_busy_cost = cpu_cost; best_cpu = i; continue; } - if (load > min_load) - continue; /* * The load is equal to the previous selected CPU. - * This will most often occur when deciding between - * idle CPUs. Power cost is prioritized after load, - * followed by cstate. + * This is rare but when it does happen opt for the + * more power efficient CPU option. */ - if (cpu_cost < min_cost) { - min_cost = cpu_cost; - min_cstate = cstate; - best_cpu = i; - continue; - } - if (cpu_cost == min_cost && cstate < min_cstate) { - min_cstate = cstate; + if (cpu_cost < min_busy_cost) { + min_busy_cost = cpu_cost; best_cpu = i; } } + + if (min_cstate_cpu >= 0 && (prefer_idle || + !(best_cpu >= 0 && mostly_idle_cpu(best_cpu)))) + best_cpu = min_cstate_cpu; done: if (best_cpu < 0) { if (unlikely(fallback_idle_cpu < 0)) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 738b154269ea..1465fb869657 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -408,6 +408,13 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec, }, { + .procname = "sched_prefer_idle", + .data = &sysctl_sched_prefer_idle, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { .procname = "sched_init_task_load", .data = &sysctl_sched_init_task_load_pct, .maxlen = sizeof(unsigned int), |
