summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSrivatsa Vaddagiri <vatsa@codeaurora.org>2014-10-31 16:04:00 -0700
committerDavid Keitel <dkeitel@codeaurora.org>2016-03-23 20:01:12 -0700
commit72b7c5d36c6440cad49659190867865648bdaa00 (patch)
treeb834ec8f534a58eff0ad957d3e3a3d5ba7c27837
parent588055e8c73e8ef0c8c23ab5db45453aa49be665 (diff)
sched: Provide knob to prefer mostly_idle over idle cpus
sysctl_sched_prefer_idle lets the scheduler bias selection of idle cpus over mostly idle cpus for tasks. This knob could be useful to control balance between power and performance. Change-Id: Ide6eef684ef94ac8b9927f53c220ccf94976fe67 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
-rw-r--r--Documentation/scheduler/sched-hmp.txt237
-rw-r--r--include/linux/sched/sysctl.h1
-rw-r--r--kernel/sched/fair.c75
-rw-r--r--kernel/sysctl.c7
4 files changed, 121 insertions, 199 deletions
diff --git a/Documentation/scheduler/sched-hmp.txt b/Documentation/scheduler/sched-hmp.txt
index ecbbaec5372c..8a813a3ebef4 100644
--- a/Documentation/scheduler/sched-hmp.txt
+++ b/Documentation/scheduler/sched-hmp.txt
@@ -641,192 +641,63 @@ select_best_cpu(), represents the heart of the HMP scheduling
algorithm described in this document.
The behavior of select_best_cpu() differs depending on whether the
-task being placed is a small task or not.
+task being placed is a small task or not and the value of the sched_prefer_idle
+tunable.
--- Wakeup Logic a Non-Small Task "p"
- The following is evaluated for every online CPU i which task p may run on:
-
- |
- | task doesn't fit, but
- | is this CPU a good
- V fallback candidate?
-+---------------+ +-------------+ +--------+
-| does task fit |------------>| is CPU |----------->| ignore |
-| on CPU | no | mostly idle | no | cpu |
-+---------------+ +-------------+ +--------+
- | |
- | yes | yes
- | | +--------------------------+
- | --------->| load < min_fallback_load |
- | +--------------------------+
- | | |
- | | yes | no
- | V V
- | +-----------------------+ +------------+
- | | fallback_idle_cpu = i | | ignore cpu |
- | task fits, prefer +-----------------------+ +------------+
- | mostly idle CPUs | |
- | or non-max capacity V V
- | CPUs that won't hit next CPU next CPU
- | spill threshold
- V
-+---------------------+ task does not meet load requirements
-| CPU mostly idle || | no +------------+
-| (!max_capacity && |---------->| ignore cpu |----> next CPU
-| !(p causes spill)) | +------------+
-+---------------------+
- |
- | yes
- |
- |
- |
- | is CPU in a lower power band
- V than previously seen min cost CPU CPU in a lower power band
-+---------------------+ than previously seen min,
-| cost(p, i) is | yes +----------------------------+ override
-| > band_limit % less |---------->| best_cpu = i | previously
-| than current min | | min_cost = cost(p,i) | seen min_load
-+---------------------+ | min_load = load(i) | CPU
- | +----------------------------+
- | no |
- | ---------> next CPU
- |
- |
- |
- | does CPU have lower load than CPU has lower load than
- V previously seen min_load CPU previously seen lowest load
-+--------------------+ yes +-----------------+
-| load(i) < min_load |------------------------->| best_cpu = i |
-+--------------------+ | min_load = load |
- | +-----------------+
- | no |
- | |
- | if load is tied with lowest previously |
- | seen lowest load, is power cost less |
- V |
-+------------------------+ |
-| load(i) == min_load && | yes +--------------+ |
-| cost(p, i) < min_cost |-------->| best_cpu = i | |
-+------------------------+ +--------------+ |
- | | |
- | no | /
- \_____________________________ | __________/
- \ | /
- | | |
- V V V if power cost of this
- +----------------------+ CPU is lower than
- | cost(p,i) < min_cost | current min, update
- +----------------------+ min_cost
- | |
- | yes | no
- | ----------> next CPU
- V
- +----------------------+
- | min_cost = cost(p,i) |-------> next CPU
- +----------------------+
-
-Once this flow chart has been evaluated for every online CPU the task
-may run on, if a "best_cpu" was found, it is returned. If a best_cpu
-was not found but a fallback_idle_cpu was found, then the
-fallback_idle_cpu is returned. Finally, if no best_cpu or
-fallback_idle cpu was found, then the task's previous CPU is returned.
-
-Phew! Fortunately, all of that can be summarized relatively easily. The
-order of CPU preference for a non-small task is the following:
-
- 1. The least-loaded CPU the task is allowed to run on in the lowest
- power band where the task will fit and where the placement will
- not result in cpu exceeding spill level. When there is a tie of
- two cpus at same load, their CPU with the lowest power cost is
- chosen.
-
- 2. The least-loaded mostly idle CPU that the task is allowed to run
- on where the task won't fit (since there was no CPU where the
- task would fit).
-
- 3. The CPU which the task last ran on.
+The order of CPU preference for a non-small task when sched_prefer_idle = 1 is
+the following:
+
+ 1. The shallowest-cstate idle CPU in the lowest-power cluster which can fit
+ the task. Where there is a tie of two CPUs with the same load, the CPU with
+ the lowest power cost is chosen.
+
+ 2. The least-loaded CPU the task is allowed to run on in the lowest power band
+ where the task will fit and where the placement will not result in cpu
+ exceeding spill level. When there is a tie of two CPUs at same load, the
+ CPU with the lowest power cost is chosen.
+
+ 3. The least-loaded mostly idle CPU that the task is allowed to run on where
+ the task won't fit (since there was no CPU where the task would fit).
+
+ 4. The CPU which the task last ran on.
+
+The order of CPU preference for a non-small task when sched_prefer_idle = 0
+is the following:
+
+ 1. The least-loaded non-idle mostly idle CPU the task is allowed to run on in
+ the lowest power band where the task will fit. When there is a tie of two
+ CPUs at same load, the CPU with the lowest power cost is chosen.
+
+ 2. The shallowest-cstate idle CPU in the lowest-power cluster which can fit
+ the task. Where there is a tie of two CPUs with the same load, the CPU with
+ the lowest power cost is chosen.
+
+ 3. The least-loaded CPU the task is allowed to run on in the lowest power band
+ where the task will fit and where the placement will not result in the CPU
+ exceeding spill level. When there is a tie of two CPUs at the same load,
+ the CPU with the lowest power cost is chosen.
+
+ 4. The least-loaded mostly idle CPU that the task is allowed to run on where
+ the task won't fit (since there was no CPU where the task would fit).
+
+ 5. The CPU which the task last ran on.
--- Wakeup Logic a Small Task "p"
-The online CPUs the task is allowed to run on are scanned and the
-lowest power CPU is found. This is marked as the min_cost_cpu.
-
-If the minimum cost CPU is mostly idle but not idle, that CPU is
-immediately chosen.
-
-If the minimum cost CPU is idle or not mostly idle, then the following
-will be evaluated for every online CPU i the task is allowed to run
-on:
- | is CPU i in higher power band is this CPU lower power than
- V than min_cost_cpu? best fallback CPU seen
-+---------------------+ +-----------------------+
-| cost(p, i) is | yes | cost(p,i) < | no +--------+
-| > band_limit % more |--------------->| min_fallback_cpu_cost |----->| ignore |
-| than min_cost_cpu | +-----------------------+ | cpu |
-+---------------------+ | +--------+
- | | yes |
- | no | V
- | | next cpu
- | is this CPU V
- V idle +-----------------------------------+
-+-----------------+ yes | best_fallback_cpu = i |
-| cpu cstate > 0? |----------- | min_fallback_cpu_cost = cost(p,i) |
-+-----------------+ | +-----------------------------------+
- | | |
- | no | \------> next CPU
- | | is this CPU
- | is this CPU | the shallowest
- V mostly idle | idle CPU seen
-+--------------+ +----------------------+ no +--------+
-| cpu i | | cstate < min_cstate? |----->| ignore |
-| mostly idle? | +----------------------+ | cpu |--> next cpu
-+--------------+ | +--------+
- | | | yes
- | no | yes |
- | | +--------------+ | +---------------------+
- | \------>| return cpu i | ----->| min_cstate_cpu = i |
- | +--------------+ | min_cstate = cstate |
- | +---------------------+
- | |
- | will task not cross spill |
- | threshold, and is this the |
- V least loaded busy CPU we've seen |
-+-------------------------+ \-----> next cpu
-| !(p causes spill) && | no +--------+
-| load(i) < min_busy_load |------>| ignore |---> next cpu
-+-------------------------+ | cpu |
- | +--------+
- | yes
- V
-+----------------------+
-| best_busy_cpu = i |
-| min_busy_load = load |--------> next cpu
-+----------------------+
-
-Note that the process of evaluating the flow chart for every online
-CPU the task may run on could be interrupted if a mostly idle CPU is
-found in the lowest power band. Such a CPU will be selected
-immediately by the algorithm. Otherwise, once the flow chart has been
-evaluated for every online CPU the task is allowed to run on, a CPU is
-selected from the candidates. If one or more idle CPUs exist in the
-lowest power band then the one in the shallowest C-state is
-returned. If not, then the least loaded CPU in the lowest power band
-which would not exceed its spill threshold by accepting the task is
-selected, assuming it exists. If none of the former possibilities
-exist, the most power-efficient CPU outside the lowest power band is
-selected.
-
-Phew! But once again this can all be summarized. The order of CPU
-preference for a small task is the following:
+The order of CPU preference for a small task is the following:
1. The lowest-power CPU, if it is not idle but is mostly idle.
- 2. A non-idle CPU in the lowest power band which is mostly idle. The
- first such CPU found is selected.
- 3. An idle CPU in the lowest power band that is in the least shallow
- C-state.
- 4. The least busy CPU in the lowest power band where adding the task
- will not result in exceeding the spill threshold.
+
+ 2. A non-idle CPU in the lowest power band which is mostly idle. The first
+ such CPU found is selected.
+
+ 3. An idle CPU in the lowest power band that is in the least shallow C-state.
+
+ 4. The least busy CPU in the lowest power band where adding the task will not
+ result in exceeding the spill threshold.
+
5. The most power-efficient CPU outside of the lowest power band.
*** 5.3 Scheduler Tick
@@ -1369,6 +1240,16 @@ longer eligible to be seen as mostly idle. This will affect the task placement
logic described above, causing the scheduler to try and steer tasks away from
the CPU.
+** 7.23 sched_prefer_idle
+
+Appears at: /proc/sys/kernel/sched_prefer_idle
+
+Default value: 1
+
+Non-small tasks will prefer to wake up on idle CPUs if this tunable is set to 1.
+If the tunable is set to 0, non-small tasks will prefer to wake up on mostly
+idle CPUs which are not completely idle, increasing task packing behavior.
+
=========================
8. HMP SCHEDULER TRACE POINTS
=========================
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index a400b155bb47..25bdacde2d83 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -65,6 +65,7 @@ extern unsigned int sysctl_sched_small_task_pct;
extern unsigned int sysctl_sched_upmigrate_pct;
extern unsigned int sysctl_sched_downmigrate_pct;
extern int sysctl_sched_upmigrate_min_nice;
+extern unsigned int sysctl_sched_prefer_idle;
extern unsigned int sysctl_sched_powerband_limit_pct;
extern unsigned int sysctl_sched_boost;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bd4f9fc66950..0c9533a2854a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2634,6 +2634,13 @@ unsigned int __read_mostly sysctl_sched_downmigrate_pct = 60;
int __read_mostly sysctl_sched_upmigrate_min_nice = 15;
/*
+ * Tunable to govern scheduler wakeup placement CPU selection
+ * preference. If set, the scheduler chooses to wake up a task
+ * on an idle CPU.
+ */
+unsigned int __read_mostly sysctl_sched_prefer_idle = 1;
+
+/*
* Scheduler boost is a mechanism to temporarily place tasks on CPUs
* with higher capacity than those where a task would have normally
* ended up with their load characteristics. Any entity enabling
@@ -3166,13 +3173,15 @@ static int select_packing_target(struct task_struct *p, int best_cpu)
/* return cheapest cpu that can fit this task */
static int select_best_cpu(struct task_struct *p, int target, int reason)
{
- int i, best_cpu = -1, fallback_idle_cpu = -1;
+ int i, best_cpu = -1, fallback_idle_cpu = -1, min_cstate_cpu = -1;
int prev_cpu = task_cpu(p);
int cpu_cost, min_cost = INT_MAX;
+ int min_idle_cost = INT_MAX, min_busy_cost = INT_MAX;
u64 load, min_load = ULLONG_MAX, min_fallback_load = ULLONG_MAX;
int small_task = is_small_task(p);
int boost = sched_boost();
int cstate, min_cstate = INT_MAX;
+ int prefer_idle = reason ? 1 : sysctl_sched_prefer_idle;
trace_sched_task_load(p, small_task, boost, reason);
@@ -3225,43 +3234,67 @@ static int select_best_cpu(struct task_struct *p, int target, int reason)
* overrides load and C-state.
*/
if (power_delta_exceeded(cpu_cost, min_cost)) {
- if (cpu_cost < min_cost) {
- min_load = load;
- min_cost = cpu_cost;
+ if (cpu_cost > min_cost)
+ continue;
+
+ min_cost = cpu_cost;
+ min_load = ULLONG_MAX;
+ min_cstate = INT_MAX;
+ min_cstate_cpu = -1;
+ best_cpu = -1;
+ }
+
+ /*
+ * Partition CPUs based on whether they are completely idle
+ * or not. For completely idle CPUs we choose the one in
+ * the lowest C-state and then break ties with power cost
+ */
+ if (idle_cpu(i)) {
+ if (cstate > min_cstate)
+ continue;
+
+ if (cstate < min_cstate) {
+ min_idle_cost = cpu_cost;
min_cstate = cstate;
- best_cpu = i;
+ min_cstate_cpu = i;
+ continue;
+ }
+
+ if (cpu_cost < min_idle_cost) {
+ min_idle_cost = cpu_cost;
+ min_cstate_cpu = i;
}
continue;
}
- /* After power band, load is prioritized next. */
+ /*
+ * For CPUs that are not completely idle, pick one with the
+ * lowest load and break ties with power cost
+ */
+ if (load > min_load)
+ continue;
+
if (load < min_load) {
min_load = load;
- min_cost = cpu_cost;
- min_cstate = cstate;
+ min_busy_cost = cpu_cost;
best_cpu = i;
continue;
}
- if (load > min_load)
- continue;
/*
* The load is equal to the previous selected CPU.
- * This will most often occur when deciding between
- * idle CPUs. Power cost is prioritized after load,
- * followed by cstate.
+ * This is rare but when it does happen opt for the
+ * more power efficient CPU option.
*/
- if (cpu_cost < min_cost) {
- min_cost = cpu_cost;
- min_cstate = cstate;
- best_cpu = i;
- continue;
- }
- if (cpu_cost == min_cost && cstate < min_cstate) {
- min_cstate = cstate;
+ if (cpu_cost < min_busy_cost) {
+ min_busy_cost = cpu_cost;
best_cpu = i;
}
}
+
+ if (min_cstate_cpu >= 0 && (prefer_idle ||
+ !(best_cpu >= 0 && mostly_idle_cpu(best_cpu))))
+ best_cpu = min_cstate_cpu;
done:
if (best_cpu < 0) {
if (unlikely(fallback_idle_cpu < 0))
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 738b154269ea..1465fb869657 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -408,6 +408,13 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
{
+ .procname = "sched_prefer_idle",
+ .data = &sysctl_sched_prefer_idle,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "sched_init_task_load",
.data = &sysctl_sched_init_task_load_pct,
.maxlen = sizeof(unsigned int),