| Commit message (Collapse) | Author | Age |
| |\ |
|
| | |
| |
| |
| |
| |
| | |
This is already done a few lines up.
Change-Id: Ifeda223d728bfc4e107407418b11303f7819e277
|
| | |
| |
| |
| |
| |
| |
| |
| |
| | |
The usage of `now` in __refill_cfs_bandwidth_runtime was removed in
b933e4d37bc023d27c7394626669bae0a201da52 (sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices)
Remove the variable and the call setting it.
Change-Id: I1e98ccd2c2298d269b9b447a6e5c79d61518c66e
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The implementation is utterly broken, resulting in all processes being
allows to move tasks between sets (as long as they have access to the
"tasks" attribute), and upstream is heading towards checking only
capability anyway, so let's get rid of this code.
BUG=b:31790445,chromium:647994
TEST=Boot android container, examine logcat
Change-Id: I2f780a5992c34e52a8f2d0b3557fc9d490da2779
Signed-off-by: Dmitry Torokhov <dtor@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/394967
Reviewed-by: Ricky Zhou <rickyz@chromium.org>
Reviewed-by: John Stultz <john.stultz@linaro.org>
|
| | |
| |
| |
| |
| |
| |
| |
| | |
This reverts commit f39d0b496aa4e13cbcf23bdf3c22d356bd783b9e.
It is no longer applicable, allow_attach cgroup was ditched upstream.
Change-Id: Ib370171e93a5cb0c9993065b98de2073c905229c
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.
This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.
Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).
The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)
[1] https://github.com/borkmann/llvm/tree/bpf-insns
Change-Id: Ic56500aaeaf5f3ebdfda094ad6ef4666c82e18c5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
free up BPF_JMP | BPF_CALL | BPF_X opcode to be used by actual
indirect call by register and use kernel internal opcode to
mark call instruction into bpf_tail_call() helper.
Change-Id: I1a45b8e3c13848c9689ce288d4862935ede97fa7
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Remove the dummy bpf_jit_compile() stubs for eBPF JITs and make
that a single __weak function in the core that can be overridden
similarly to the eBPF one. Also remove stale pr_err() mentions
of bpf_jit_compile.
Change-Id: Iac221c09e9ae0879acdd7064d710c4f7cb8f478d
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
[ Upstream commit ef9188bcc6ca1d8a2ad83e826b548e6820721061 ]
To prepare for support asynchronous tracer_init_tracefs initcall,
avoid calling create_trace_option_files before __update_tracer_options.
Otherwise, create_trace_option_files will show warning because
some tracers in trace_types list are already in tr->topts.
For example, hwlat_tracer call register_tracer in late_initcall,
and global_trace.dir is already created in tracing_init_dentry,
hwlat_tracer will be put into tr->topts.
Then if the __update_tracer_options is executed after hwlat_tracer
registered, create_trace_option_files find that hwlat_tracer is
already in tr->topts.
Link: https://lkml.kernel.org/r/20220426122407.17042-2-mark-pk.tsai@mediatek.com
Link: https://lore.kernel.org/lkml/20220322133339.GA32582@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Change-Id: Icaa5bcb097b421d2459b082644f84031e89ac88b
Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
[ Upstream commit 6015b1aca1a233379625385feb01dd014aca60b5 ]
The getaffinity() system call uses 'cpumask_size()' to decide how big
the CPU mask is - so far so good. It is indeed the allocation size of a
cpumask.
But the code also assumes that the whole allocation is initialized
without actually doing so itself. That's wrong, because we might have
fixed-size allocations (making copying and clearing more efficient), but
not all of it is then necessarily used if 'nr_cpu_ids' is smaller.
Having checked other users of 'cpumask_size()', they all seem to be ok,
either using it purely for the allocation size, or explicitly zeroing
the cpumask before using the size in bytes to copy it.
See for example the ublk_ctrl_get_queue_affinity() function that uses
the proper 'zalloc_cpumask_var()' to make sure that the whole mask is
cleared, whether the storage is on the stack or if it was an external
allocation.
Fix this by just zeroing the allocation before using it. Do the same
for the compat version of sched_getaffinity(), which had the same logic.
Also, for consistency, make sched_getaffinity() use 'cpumask_bits()' to
access the bits. For a cpumask_var_t, it ends up being a pointer to the
same data either way, but it's just a good idea to treat it like you
would a 'cpumask_t'. The compat case already did that.
Reported-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/lkml/7d026744-6bd6-6827-0471-b5e8eae0be3f@arm.com/
Cc: Yury Norov <yury.norov@gmail.com>
Change-Id: I60139451bd9e9a4b687f0f2097ac1b2813758c45
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Ulrich Hecht <uli+cip@fpond.eu>
|
| | |
| |
| |
| |
| |
| |
| | |
CAF did not use WALT on msm-4.4 kernels and left out important WALT
bits.
Change-Id: If5de7100e010f299bd7b2c62720ff309a98c569d
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Commit 6d5adb184946 ("sched: Restore previous implementation of check_for_migration()")
reverted parts of an upstream commit including the
"EAS scheduler implementation" of check_for_migration()
in favor of the HMP implementation as the former breaks the latter.
However without HMP we do want the former.
Hence add both and select based on CONFIG_SCHED_HMP.
Note that CONFIG_SMP is a precondition for CONFIG_SCHED_HMP, so the
guard in the header uses the former.
Change-Id: Iac0b462a38b35d1670d56ba58fee532a957c60b3
|
| | |
| |
| |
| |
| |
| |
| |
| |
| | |
The code using this was moved to notify_migration but the query and the
variable were not removed.
Fixes 375d7195fc257 ("sched: move out migration notification out of spinlock")
Change-Id: I75569443db2a55510c8279993e94b3e1a0ed562c
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
commit c1ac03af6ed45d05786c219d102f37eb44880f28 upstream.
print_trace_line may overflow seq_file buffer. If the event is not
consumed, the while loop keeps peeking this event, causing a infinite loop.
Link: https://lkml.kernel.org/r/20221129113009.182425-1-yangjihong1@huawei.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 088b1e427dbba ("ftrace: pipe fixes")
Change-Id: Ic7f41111b7c7944e9f64b010a3c1ff095c757b9d
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ulrich Hecht <uli+cip@fpond.eu>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
[ Upstream commit eecb91b9f98d6427d4af5fdb8f108f52572a39e7 ]
Kmemleak report a leak in graph_trace_open():
unreferenced object 0xffff0040b95f4a00 (size 128):
comm "cat", pid 204981, jiffies 4301155872 (age 99771.964s)
hex dump (first 32 bytes):
e0 05 e7 b4 ab 7d 00 00 0b 00 01 00 00 00 00 00 .....}..........
f4 00 01 10 00 a0 ff ff 00 00 00 00 65 00 10 00 ............e...
backtrace:
[<000000005db27c8b>] kmem_cache_alloc_trace+0x348/0x5f0
[<000000007df90faa>] graph_trace_open+0xb0/0x344
[<00000000737524cd>] __tracing_open+0x450/0xb10
[<0000000098043327>] tracing_open+0x1a0/0x2a0
[<00000000291c3876>] do_dentry_open+0x3c0/0xdc0
[<000000004015bcd6>] vfs_open+0x98/0xd0
[<000000002b5f60c9>] do_open+0x520/0x8d0
[<00000000376c7820>] path_openat+0x1c0/0x3e0
[<00000000336a54b5>] do_filp_open+0x14c/0x324
[<000000002802df13>] do_sys_openat2+0x2c4/0x530
[<0000000094eea458>] __arm64_sys_openat+0x130/0x1c4
[<00000000a71d7881>] el0_svc_common.constprop.0+0xfc/0x394
[<00000000313647bf>] do_el0_svc+0xac/0xec
[<000000002ef1c651>] el0_svc+0x20/0x30
[<000000002fd4692a>] el0_sync_handler+0xb0/0xb4
[<000000000c309c35>] el0_sync+0x160/0x180
The root cause is descripted as follows:
__tracing_open() { // 1. File 'trace' is being opened;
...
*iter->trace = *tr->current_trace; // 2. Tracer 'function_graph' is
// currently set;
...
iter->trace->open(iter); // 3. Call graph_trace_open() here,
// and memory are allocated in it;
...
}
s_start() { // 4. The opened file is being read;
...
*iter->trace = *tr->current_trace; // 5. If tracer is switched to
// 'nop' or others, then memory
// in step 3 are leaked!!!
...
}
To fix it, in s_start(), close tracer before switching then reopen the
new tracer after switching. And some tracers like 'wakeup' may not update
'iter->private' in some cases when reopen, then it should be cleared
to avoid being mistakenly closed again.
Link: https://lore.kernel.org/linux-trace-kernel/20230817125539.1646321-1-zhengyejian1@huawei.com
Fixes: d7350c3f4569 ("tracing/core: make the read callbacks reentrants")
Change-Id: I1265ac4538ca529fb02b4cce76a4aa835f90b128
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Ulrich Hecht <uli@kernel.org>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
[ Upstream commit 7acf3a127bb7c65ff39099afd78960e77b2ca5de ]
Booting the kernel with 'trace_buf_size=1' give a warning at
boot during the ftrace selftests:
[ 0.892809] Running postponed tracer tests:
[ 0.892893] Testing tracer function:
[ 0.901899] Callback from call_rcu_tasks_trace() invoked.
[ 0.983829] Callback from call_rcu_tasks_rude() invoked.
[ 1.072003] .. bad ring buffer .. corrupted trace buffer ..
[ 1.091944] Callback from call_rcu_tasks() invoked.
[ 1.097695] PASSED
[ 1.097701] Testing dynamic ftrace: .. filter failed count=0 ..FAILED!
[ 1.353474] ------------[ cut here ]------------
[ 1.353478] WARNING: CPU: 0 PID: 1 at kernel/trace/trace.c:1951 run_tracer_selftest+0x13c/0x1b0
Therefore enforce a minimum of 4096 bytes to make the selftest pass.
Link: https://lkml.kernel.org/r/20220214134456.1751749-1-svens@linux.ibm.com
Change-Id: I796709bd9ccaefb463da0aca49f53a475b5f57a0
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
[ Upstream commit 3203ce39ac0b2a57a84382ec184c7d4a0bede175 ]
The kernel parameter "tp_printk_stop_on_boot" starts with "tp_printk" which is
the same as another kernel parameter "tp_printk". If "tp_printk" setup is
called before the "tp_printk_stop_on_boot", it will override the latter
and keep it from being set.
This is similar to other kernel parameter issues, such as:
Commit 745a600cf1a6 ("um: console: Ignore console= option")
or init/do_mounts.c:45 (setup function of "ro" kernel param)
Fix it by checking for a "_" right after the "tp_printk" and if that
exists do not process the parameter.
Link: https://lkml.kernel.org/r/20220208195421.969326-1-jsyoo5b@gmail.com
Change-Id: I997201f383889388231894b1881ead904e322b14
Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com>
[ Fixed up change log and added space after if condition ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
[ Upstream commit f596da3efaf4130ff61cd029558845808df9bf99 ]
When the blk_classic option is enabled, non-blktrace events must be
filtered out. Otherwise, events of other types are output in the blktrace
classic format, which is unexpected.
The problem can be triggered in the following ways:
# echo 1 > /sys/kernel/debug/tracing/options/blk_classic
# echo 1 > /sys/kernel/debug/tracing/events/enable
# echo blk > /sys/kernel/debug/tracing/current_tracer
# cat /sys/kernel/debug/tracing/trace_pipe
Fixes: c71a89615411 ("blktrace: add ftrace plugin")
Change-Id: Ia8eda6f8123933ae2b8219fc781d417be8711647
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Link: https://lore.kernel.org/r/20221122040410.85113-1-yangjihong1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Ulrich Hecht <uli+cip@fpond.eu>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Similar to `dec_cfs_rq_hmp_stats` vs `walt_dec_cfs_cumulative_runnable_avg`
we need to call `walt_dec_cumulative_runnable_avg` where `dec_rq_hmp_stats`
is called.
Corresponds to the `walt_inc_cfs_cumulative_runnable_avg` call in `enqueue_task_fair`.
Based on 4e29a6c5f98f9694d5ad01a4e7899aad157f8d49 ("sched: Add missing WALT code")
Fixes c0fa7577022c4169e1aaaf1bd9e04f63d285beb2 ("sched/walt: Re-add code to allow WALT to function")
Change-Id: If2b291e1e509ba300d7f4b698afe73a72b273604
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This is trivial to do:
- add flags argument to simple_rename()
- check if flags doesn't have any other than RENAME_NOREPLACE
- assign simple_rename() to .rename2 instead of .rename
Filesystems converted:
hugetlbfs, ramfs, bpf.
Debugfs uses simple_rename() to implement debugfs_rename(), which is for
debugfs instances to rename files internally, not for userspace filesystem
access. For this case pass zero flags to simple_rename().
Change-Id: I1a46ece3b40b05c9f18fd13b98062d2a959b76a0
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If the next_freq field of struct sugov_policy is set to UINT_MAX,
it shouldn't be used for updating the CPU frequency (this is a
special "invalid" value), but after commit b7eaf1aab9f8 (cpufreq:
schedutil: Avoid reducing frequency of busy CPUs prematurely) it
may be passed as the new frequency to sugov_update_commit() in
sugov_update_single().
Fix that by adding an extra check for the special UINT_MAX value
of next_freq to sugov_update_single().
Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.12+ <stable@vger.kernel.org> # 4.12+
Change-Id: Idf4ebe9e912f983598255167d2065e47562ab83d
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 97739501f207efe33145b918817f305b822987f8)
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The schedutil driver sets sg_policy->next_freq to UINT_MAX on certain
occasions to discard the cached value of next freq:
- In sugov_start(), when the schedutil governor is started for a group
of CPUs.
- And whenever we need to force a freq update before rate-limit
duration, which happens when:
- there is an update in cpufreq policy limits.
- Or when the utilization of DL scheduling class increases.
In return, get_next_freq() doesn't return a cached next_freq value but
recalculates the next frequency instead.
But having special meaning for a particular value of frequency makes the
code less readable and error prone. We recently fixed a bug where the
UINT_MAX value was considered as valid frequency in
sugov_update_single().
All we need is a flag which can be used to discard the value of
sg_policy->next_freq and we already have need_freq_update for that. Lets
reuse it instead of setting next_freq to UINT_MAX.
Change-Id: Ia37ef416d5ecac11fe0c6a2be7e21fdbca708a1a
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com> - backported to 4.4
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently there is a chance of a schedutil cpufreq update request to be
dropped if there is a pending update request. This pending request can
be delayed if there is a scheduling delay of the irq_work and the wake
up of the schedutil governor kthread.
A very bad scenario is when a schedutil request was already just made,
such as to reduce the CPU frequency, then a newer request to increase
CPU frequency (even sched deadline urgent frequency increase requests)
can be dropped, even though the rate limits suggest that its Ok to
process a request. This is because of the way the work_in_progress flag
is used.
This patch improves the situation by allowing new requests to happen
even though the old one is still being processed. Note that in this
approach, if an irq_work was already issued, we just update next_freq
and don't bother to queue another request so there's no extra work being
done to make this happen.
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Change-Id: I7b6e19971b2ce3bd8e8336a5a4bc1acb920493b5
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com> - backport to 4.4
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
A more energy efficient update of the IO wait boosting mechanism has
been introduced in:
commit a5a0809 ("cpufreq: schedutil: Make iowait boost more energy
efficient")
where the boost value is expected to be:
- doubled at each successive wakeup from IO
staring from the minimum frequency supported by a CPU
- reset when a CPU is not updated for more then one tick
by either disabling the IO wait boost or resetting its value to the
minimum frequency if this new update requires an IO boost.
This approach is supposed to "ignore" boosting for sporadic wakeups from
IO, while still getting the frequency boosted to the maximum to benefit
long sequence of wakeup from IO operations.
However, these assumptions are not always satisfied.
For example, when an IO boosted CPU enters idle for more the one tick
and then wakes up after an IO wait, since in sugov_set_iowait_boost() we
first check the IOWAIT flag, we keep doubling the iowait boost instead
of restarting from the minimum frequency value.
This misbehavior could happen mainly on non-shared frequency domains,
thus defeating the energy efficiency optimization, but it can also
happen on shared frequency domain systems.
Let fix this issue in sugov_set_iowait_boost() by:
- first check the IO wait boost reset conditions
to eventually reset the boost value
- then applying the correct IO boost value
if required by the caller
Fixes: a5a0809 (cpufreq: schedutil: Make iowait boost more energy
efficient)
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Change-Id: I196b5c464cd43164807c12b2dbc2e5d814bf1d33
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com> - backport to 4.4
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When lower capacity CPUs are load balancing and considering to pull
something from a higher capacity group, we should not pull tasks from a
cpu with only one task running as this is guaranteed to impede progress
for that task. If there is more than one task running, load balance in
the higher capacity group would have already made any possible moves to
resolve imbalance and we should make better use of system compute
capacity by moving a task if we still have more than one running.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[from https://lore.kernel.org/lkml/1530699470-29808-11-git-send-email-morten.rasmussen@arm.com/]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ib86570abdd453a51be885b086c8d80be2773a6f2
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
|
| | |
| |
| |
| | |
Change-Id: Icd8f2cde6cac1b7bb07e54b8e6989c65eea4e4a5
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
joshuous: Adapted to work with CAF's "softirq: defer softirq processing
to ksoftirqd if CPU is busy with RT" commit.
ajaivasudeve: adapted for the commit "softirq: Don't defer all softirq during RT task"
We're finding audio glitches caused by audio-producing RT tasks
that are either interrupted to handle softirq's or that are
scheduled onto cpu's that are handling softirq's.
In a previous patch, we attempted to catch many cases of the
latter problem, but it's clear that we are still losing
significant numbers of races in some apps.
This patch attempts to address both problems:
1. It prohibits handling softirq's when interrupting
an RT task, by delaying the softirq to the ksoftirqd
thread.
2. It attempts to reduce the most common windows in which
we lose the race between scheduling an RT task on a remote
core and starting to handle softirq's on that core.
We still lose some races, but we lose significantly fewer.
(And we don't want to introduce any heavyweight forms
of synchronization on these paths.)
Bug: 64912585
Change-Id: Ida89a903be0f1965552dd0e84e67ef1d3158c7d8
Signed-off-by: John Dias <joaodias@google.com>
Signed-off-by: joshuous <joshuous@gmail.com>
Signed-off-by: ajaivasudeve <ajaivasudeve@gmail.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
For effective interplay between RT and fair tasks. Enables sched_fifo
for UI and Render tasks. Critical for improving user experience.
bug: 24503801
bug: 30377696
Change-Id: I2a210c567c3f5c7edbdd7674244822f848e6d0cf
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
(cherry picked from commit dfe0f16b6fd3a694173c5c62cf825643eef184a3)
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
With a shared policy in place, when one of the CPUs in the policy is
hotplugged out and then brought back online, sugov_stop and
sugov_start are called in order.
sugov_stop removes utilization hooks for each CPU in the policy and
does nothing else in the for_each_cpu loop. sugov_start on the other
hand iterates through the CPUs in the policy and re-initializes the
per-cpu structure _and_ adds the utilization hook. This implies that
the scheduler is allowed to invoke a CPU's utilization update hook
when the rest of the per-cpu structures have yet to be re-inited.
Apart from some strange values in tracepoints this doesn't cause a
problem, but if we do end up accessing a pointer from the per-cpu
sugov_cpu structure somewhere in the sugov_update_shared path,
we will likely see crashes since the memset for another CPU in the
policy is free to race with sugov_update_shared from the CPU that is
ready to go. So let's fix this now to first init all per-cpu
structures, and then add the per-cpu utilization update hooks all at
once.
Change-Id: I399e0e159b3db3ae3258843c9231f92312fe18ef
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently when all the related CPUs from a policy go offline or the
governor is switched, cpufreq framework calls sugov_exit() that
frees the governor tunables. When any of the related CPUs comes back
online or governor is switched back to schedutil sugov_init() gets
called which allocates a fresh set of tunables that are set to
default values. This can cause the userspace settings to those
tunables to be lost across governor switches or when an entire
cluster is hotplugged out.
To prevent this, save the tunable values on governor exit. Restore
these values to the newly allocated tunables on governor init.
Change-Id: I671d4d0e1a4e63e948bfddb0005367df33c0c249
Signed-off-by: Rohit Gupta <rohgup@codeaurora.org>
[Caching and restoring different tunables.]
Signed-off-by: joshuous <joshuous@gmail.com>
Change-Id: I852ae2d23f10c9337e7057a47adcc46fe0623c6a
Signed-off-by: joshuous <joshuous@gmail.com>
|
| | |
| |
| |
| |
| |
| |
| | |
Return immediately upon memory allocation failure.
Change-Id: I30947d55f0f4abd55c51e42912a0762df57cbc1d
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
* Render - added back missing header
When tasks come and go from a runqueue quickly, this can lead to boost
being applied and removed quickly which sometimes means we cannot raise
the CPU frequency again when we need to (due to the rate limit on
frequency updates). This has proved to be a particular issue for RT tasks
and alternative methods have been used in the past to work around it.
This is an attempt to solve the issue for all task classes and cpufreq
governors by introducing a generic mechanism in schedtune to retain
the max boost level from task enqueue for a minimum period - defined
here as 50ms. This timeout was determined experimentally and is not
configurable.
A sched_feat guards the application of this to tasks - in the default
configuration, task boosting only applied to tasks which have RT
policy. Change SCHEDTUNE_BOOST_HOLD_ALL to true to apply it to all
tasks regardless of class.
It works like so:
Every task enqueue (in an allowed class) stores a cpu-local timestamp.
If the task is not a member of an allowed class (all or RT depending
upon feature selection), the timestamp is not updated.
The boost group will stay active regardless of tasks present until
50ms beyond the last timestamp stored. We also store the timestamp
of the active boost group to avoid unneccesarily revisiting the boost
groups when checking CPU boost level.
If the timestamp is more than 50ms in the past when we check boost then
we re-evaluate the boost groups for that CPU, taking into account the
timestamps associated with each group.
Idea based on rt-boost-retention patches from Joel.
Change-Id: I52cc2d2e82d1c5aa03550378c8836764f41630c1
Suggested-by: Joel Fernandes <joelaf@google.com>
Reviewed-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: RenderBroken <zkennedy87@gmail.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The current implementation of the energy-aware wake-up path relies on
find_best_target() to select an ordered list of CPU candidates for task
placement. The first candidate of the list saving energy is then chosen,
disregarding all the others to avoid the overhead of an expensive
energy_diff.
With the recent refactoring of select_energy_cpu_idx(), the cost of
exploring multiple CPUs has been reduced, hence offering the opportunity
to select the most energy-efficient candidate at a lower cost. This commit
seizes this opportunity by allowing to change select_energy_cpu_idx()'s
behaviour as to ignore the order of CPUs returned by find_best_target()
and to pick the best candidate energy-wise.
As this functionality is still considered as experimental, it is hidden
behind a sched_feature named FBT_STRICT_ORDER (like the equivalent
feature in Android 4.14) which defaults to true, hence keeping the
current behaviour by default.
Change-Id: I0cb833bfec1a4a053eddaff1652c0b6cad554f97
Suggested-by: Patrick Bellasi <patrick.bellasi@arm.com>
Suggested-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We are using an incorrect index while initializing the nrg_delta for
the previous CPU in select_energy_cpu_idx(). This initialization
it self is not needed as the nrg_delta for the previous CPU is already
initialized to 0 while preparing the energ_env struct.
Change-Id: Iee4e2c62f904050d2680a0a1df646d4d515c62cc
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Most walt variables in hot code paths are __read_mostly and grouped
together in the .data section, including these variables that show
up as frequently accessed in gem5 simulation: walt_ravg_window,
walt_disabled, walt_account_wait_time, and
walt_freq_account_wait_time.
The exception is walt_ktime_suspended, which is also accessed in many
of the same hot paths. It is also almost entirely accessed by reads.
Move it to __read_mostly in hopes of keeping it in the same cache line
as the other hot data.
Change-Id: I8c9e4ee84e5a0328b943752ee9ed47d4e006e7de
Signed-off-by: Todd Poynor <toddpoynor@google.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When the normal utilization value for all the CPUs in a policy is 0, the
current aggregation scheme of using a > check will result in the aggregated
utilization being 0 and the max value being 1.
This is not a problem for upstream code. However, since we also use other
statistics provided by WALT to update the aggregated utilization value
across all CPUs in a policy, we can end up with a non-zero aggregated
utilization while the max remains as 1.
Then when get_next_freq() tries to compute the frequency using:
max-frequency * 1.25 * (util / max)
it ends up with a frequency that is greater than max-frequency. So the
policy frequency jumps to max-frequency.
By changing the aggregation check to >=, we make sure that the max value is
updated with something reported by the scheduler for a CPU in that policy.
With the correct max, we can continue using the WALT specific statistics
without spurious jumps to max-frequency.
Change-Id: I14996cd796191192ea112f949dc42450782105f7
Signed-off-by: Saravana Kannan <skannan@codeaurora.org>
|
| | |
| |
| |
| |
| | |
Change-Id: Ibbcb281c2f048e2af0ded0b1cbbbedcc49b29e45
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
schedtune sets 0 as the floor for calculating CPU max boost, so
negative schedtune.boost values do not affect CPU frequency
decisions. Remove this restriction to allow userspace to apply
negative boosts when appropriate.
Also change treatment of the root boost group to match the other
groups, so it only affects the CPU boost if it has runnable tasks on
the CPU.
Test: set all boosts negative; sched_boost_cpu trace events show
negative CPU margins.
Change-Id: I89f3470299aef96a18797c105f02ebc8f367b5e1
Signed-off-by: Connor O'Brien <connoro@google.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
in commit ac087abe1358 ("Merge android-msm-8998-4.4-common into
android-msm-muskie-4.4"),
.allow_attach = subsys_cgroup_allow_attach,
was dropped in the merge.
This patch brings back allow_attach,
but with the marlin-3.18 behavior of allowing all cgroup changes
rather than the subsys_cgroup_allow_attach behavior of
requiring SYS_CAP_NICE.
Bug: 36592053
Change-Id: Iaa51597b49a955fd5709ca504e968ea19a9ca8f5
Signed-off-by: Andres Oportus <andresoportus@google.com>
Signed-off-by: Andrew Chant <achant@google.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When we are calculating what the impact of placing a task on a specific
cpu is, we should include the information that there might be a minimum
capacity imposed upon that cpu which could change the performance and/or
energy cost decisions.
When choosing an active target CPU, favour CPUs that won't end up
running at a high OPP due to a min capacity cap imposed by external
actors.
Change-Id: Ibc3302304345b63107f172b1fc3ffdabc19aa9d4
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When we are calculating what the impact of placing a task on a specific
cpu is, we should include the information that there might be a minimum
capacity imposed upon that cpu which could change the performance and/or
energy cost decisions.
When choosing an idle backup CPU, favour CPUs that won't end up
running at a high OPP due to a min capacity cap imposed by external
actors.
Change-Id: I566623ffb3a7c5b61a23242dcce1cb4147ef8a4a
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add the ability to track minimim capacity forced onto a sched_group
by some external actor.
group_max_util returns the highest utilisation inside a sched_group
and is used when we are trying to calculate an energy cost estimate
for a specific scheduling scenario. Minimum capacities imposed from
elsewhere will influence this energy cost so we should reflect it
here.
Change-Id: Ibd537a6dbe6d67b11cc9e9be18f40fcb2c0f13de
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We have the ability to track minimum capacity forced onto a CPU by
userspace or external actors. This is provided though a minimum
frequency scale factor exposed by arch_scale_min_freq_capacity.
The use of this information is enabled through the MIN_CAPACITY_CAPPING
feature. If not enabled, the minimum frequency scale factor will
remain 0 and it will not impact energy estimation or scheduling
decisions.
Change-Id: Ibc61f2bf4fddf186695b72b262e602a6e8bfde37
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If the minimum capacity of a group is capped by userspace or internal
dependencies which are not otherwise visible to the scheduler, we need
a way to see these and integrate this information into the energy
calculations and task placement decisions we make.
Add arch_scale_min_freq_capacity to determine the lowest capacity which
a specific cpu can provide under the current set of known constraints.
Change-Id: Ied4a1dc0982bbf42cb5ea2f27201d4363db59705
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The max frequency scaling factor is defined as:
max_freq_scale = policy_max_freq / cpuinfo_max_freq
To be able to scale the cpu capacity by this factor introduce a call to
the new arch scaling function arch_scale_max_freq_capacity() in
update_cpu_capacity() and provide a default implementation which returns
SCHED_CAPACITY_SCALE.
Another subsystem (e.g. cpufreq) can overwrite this default implementation,
exactly as for frequency and cpu invariance. It has to be enabled by the
arch by defining arch_scale_max_freq_capacity to the actual
implementation.
Change-Id: I266cd1f4c1c82f54b80063c36aa5f7662599dd28
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The SG's energy is obtained by adding busy and idle contributions which
are computed by considering a proper fraction of the
SCHED_CAPACITY_SCALE defined by the SG's utilizations.
By scaling each and every contribution conputed we risk to accumulate
rounding errors which can results into a non null energy_delta also in
cases when the same total accomulated utilization is differently
distributed among different CPUs.
To reduce rouding errors, this patch accumulated non-scaled busy/idle
energy contributions for each visited SG, and scale each of them just
one time at the end.
Change-Id: Idf8367fee0ac11938c6436096f0c1b2d630210d2
Suggested-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The energy_env data structure is used to cache values required by multiple
different functions involved in energy_diff computation. Some of these
functions require additional parameters which can be easily embedded
into the energy_env itself. The current implementation of energy_diff
hardcodes the usage of two different energy_env structures to estimate and
compare the energy consumption related to a "before" and an "after" CPU.
Moreover, it does this energy estimation by walking multiple times the
SDs/SGs data structures.
A better design can be envisioned by better using the energy_env structure
to support a more efficient and concurrent evaluation of multiple schedule
candidates. To this purpose, this patch provides a complete re-factoring
of the energy_diff implementation to:
1. use a single energy_env structure for the evaluation of all the
candidate CPUs
2. walk just one time the SDs/SGs, thus improving the overall performance
to compute the energy estimation for each CPU candidate specified by
the single used energy_env
3. simplify the code (at least if you look at the new version and not at
this re-factoring patch) thus providing a more clean code to maintain
and extend for additional features
This patch updated all the clients of energy_env to use only the data
provided by this structure and an index for one of its CPUs candidates.
Embedding everything within the energy env will make it simple to add
tracepoints for this new version, which can easily provide an holistic
view on how energy_diff evaluated the proposed CPU candidates.
The new proposed structure, for both "struct energy_env" and the functions
using it, is designed in such a way to easily accommodate additional
further extensions (e.g. SchedTune filtering) without requiring an
additional big re-factoring of these core functions.
Finally, the usage of a CPUs candidate array, embedded into the
energy_diff structure, allows also to seamless extend the exploration of
multiple candidate CPUs, for example to support the comparison of a
spread-vs-packing strategy.
Change-Id: Ic04ffb6848b2c763cf1788767f22c6872eb12bee
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
[reworked find_new_capacity() and enforced the respect of
find_best_target() selection order]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
[@0ctobot: Adapted for kernel.lnx.4.4.r35-rel]
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The current definition of select_energy_cpu_brute is a bit confusing in
the definition of the value for the target_cpu to be returned as wakeup
CPU for the specified task.
This cleanup the code by ensuring that we always set target_cpu right
before returning it. rcu_read_lock and check on *sd!=NULL are also moved
around to be exactly where they are required.
Change-Id: I70a4b558b3624a13395da1a87ddc0776fd1d6641
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In preparation for the energy_diff refactoring, let's remove all the
SchedTune specific bits which are used to keep track of the capacity
variations requited by the PESpace filtering.
This removes also the energy_normalization function and the wrapper of
energy_diff which is used to trigger a PESpace filtering by
schedtune_accept_deltas().
The remaining code is the "original" energy_diff function which
looks just at the energy variations to compare prev_cpu vs next_cpu.
Change-Id: I4fb1d1c5ba45a364e6db9ab8044969349aba0307
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The format of the energy_diff tracepoint is going to be changed by the
following energ_diff refactoring patches. Let's remove it now to start from
a clean slate.
Change-Id: Id4f537ed60d90a7ddcca0a29a49944bfacb85c8c
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
|