android_kernel_zuk_msm8996.git

	Commit message (Collapse)	Author	Age
*	Merge remote-tracking branch 'msm8998/lineage-20' into lineage-20	Raghuram Subramani	2024-10-17
\| \| \| \|	Change-Id: I126075a330f305c85f8fe1b8c9d408f368be95d1
*	Merge lineage-20 of git@github.com:LineageOS/android_kernel_qcom_msm8998.git ↵	Davide Garberi	2023-08-06
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	into lineage-20 7d11b1a7a11c Revert "sched: cpufreq: Use sched_clock instead of rq_clock when updating schedutil" daaa5da96a74 sched: Take irq_sparse lock during the isolation 217ab2d0ef91 rcu: Speed up calling of RCU tasks callbacks 997b726bc092 kernel: power: Workaround for sensor ipc message causing high power consume b933e4d37bc0 sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 82d3f23d6dc5 sched/fair: Fix bandwidth timer clock drift condition 629bfed360f9 kernel: power: qos: remove check for core isolation while cluster LPMs 891a63210e1d sched/fair: Fix issue where frequency update not skipped b775cb29f663 ANDROID: Move schedtune en/dequeue before schedutil update triggers ebdb82f7b34a sched/fair: Skip frequency updates if CPU about to idle ff383d94478a FROMLIST: sched: Make iowait_boost optional in schedutil 9539942cb065 FROMLIST: cpufreq: Make iowait boost a policy option b65c91c9aa14 ARM: dts: msm: add HW CPU's busy-cost-data for additional freqs 72f13941085b ARM: dts: msm: fix CPU's idle-cost-data ab88411382f7 ARM: dts: msm: fix EM to be monotonically increasing 83dcbae14782 ARM: dts: msm: Fix EAS idle-cost-data property length 33d3b17bfdfb ARM: dts: msm: Add msm8998 energy model c0fa7577022c sched/walt: Re-add code to allow WALT to function d5cd35f38616 FROMGIT: binder: use EINTR for interrupted wait for work db74739c86de sched: Don't fail isolation request for an already isolated CPU aee7a16e347b sched: WALT: increase WALT minimum window size to 20ms 4dbe44554792 sched: cpufreq: Use per_cpu_ptr instead of this_cpu_ptr when reporting load ef3fb04c7df4 sched: cpufreq: Use sched_clock instead of rq_clock when updating schedutil c7128748614a sched/cpupri: Exclude isolated CPUs from the lowest_mask 6adb092856e8 sched: cpufreq: Limit governor updates to WALT changes alone 0fa652ee00f5 sched: walt: Correct WALT window size initialization 41cbb7bc59fb sched: walt: fix window misalignment when HZ=300 43cbf9d6153d sched/tune: Increase the cgroup limit to 6 c71b8fffe6b3 drivers: cpuidle: lpm-levels: Fix KW issues with idle state idx < 0 938e42ca699f drivers: cpuidle: lpm-levels: Correctly check for list empty 8d8a48aecde5 sched/fair: Fix load_balance() affinity redo path eccc8acbe705 sched/fair: Avoid unnecessary active load balance 0ffdb886996b BACKPORT: sched/core: Fix rules for running on online && !active CPUs c9999f04236e sched/core: Allow kthreads to fall back to online && !active cpus b9b6bc6ea3c0 sched: Allow migrating kthreads into online but inactive CPUs a9314f9d8ad4 sched/fair: Allow load bigger task load balance when nr_running is 2 c0b317c27d44 pinctrl: qcom: Clear status bit on irq_unmask 45df1516d04a UPSTREAM: mm: fix misplaced unlock_page in do_wp_page() 899def5edcd4 UPSTREAM: mm/ksm: Remove reuse_ksm_page() 46c6fbdd185a BACKPORT: mm: do_wp_page() simplification 90dccbae4c04 UPSTREAM: mm: reuse only-pte-mapped KSM page in do_wp_page() ebf270d24640 sched/fair: vruntime should normalize when switching from fair cbe0b37059c9 mm: introduce arg_lock to protect arg_start\|end and env_start\|end in mm_struct 12d40f1995b4 msm: mdss: Fix indentation 620df03a7229 msm: mdss: Treat polling_en as the bool that it is 12af218146a6 msm: mdss: add idle state node 13e661759656 cpuset: Restore tasks affinity while moving across cpusets 602bf4096dab genirq: Honour IRQ's affinity hint during migration 9209b5556f6a power: qos: Use effective affinity mask f31078b5825f genirq: Introduce effective affinity mask 58c453484f7e sched/cputime: Mitigate performance regression in times()/clock_gettime() 400383059868 kernel: time: Add delay after cpu_relax() in tight loops 1daa7ea39076 pinctrl: qcom: Update irq handle for GPIO pins 07f7c9961c7c power: smb-lib: Fix mutex acquisition deadlock on PD hard reset 094b738f46c8 power: qpnp-smb2: Implement battery charging_enabled node d6038d6da57f ASoC: msm-pcm-q6-v2: Add dsp buf check 0d7a6c301af8 qcacld-3.0: Fix OOB in wma_scan_roam.c Change-Id: Ia2e189e37daad6e99bdb359d1204d9133a7916f4
\| *	Revert "sched: cpufreq: Use sched_clock instead of rq_clock when updating ↵	Georg Veichtlbauer	2023-07-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	schedutil" That commit should have changed rq_clock to sched_clock, instead of sched_ktime_clock, which kept schedutil from making correct decisions. This reverts commit ef3fb04c7df43dfa1793e33f764a2581cda96310. Change-Id: Id4118894388c33bf2b2d3d5ee27eb35e82dc4a96
\| *	sched: Take irq_sparse lock during the isolation	Prasad Sodagudi	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	irq_migrate_all_off_this_cpu() is used to migrate IRQs and this function checks for all active irq in the allocated_irqs mask. irq_migrate_all_off_this_cpu() expects the caller to take irq_sparse lock to avoid race conditions while accessing allocated_irqs mask variable. Prevent a race between irq alloc/free and irq migration by adding irq_sparse lock across CPU isolation. Change-Id: I9edece1ecea45297c8f6529952d88b3133046467 Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
\| *	rcu: Speed up calling of RCU tasks callbacks	Steven Rostedt (VMware)	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Joel Fernandes found that the synchronize_rcu_tasks() was taking a significant amount of time. He demonstrated it with the following test: # cd /sys/kernel/tracing # while [ 1 ]; do x=1; done & # echo '__schedule_bug:traceon' > set_ftrace_filter # time echo '!__schedule_bug:traceon' > set_ftrace_filter; real 0m1.064s user 0m0.000s sys 0m0.004s Where it takes a little over a second to perform the synchronize, because there's a loop that waits 1 second at a time for tasks to get through their quiescent points when there's a task that must be waited for. After discussion we came up with a simple way to wait for holdouts but increase the time for each iteration of the loop but no more than a full second. With the new patch we have: # time echo '!__schedule_bug:traceon' > set_ftrace_filter; real 0m0.131s user 0m0.000s sys 0m0.004s Which drops it down to 13% of what the original wait time was. Link: http://lkml.kernel.org/r/20180523063815.198302-2-joel@joelfernandes.org Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org> Suggested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Change-Id: I40bcecdfdb2a1cdae7195f1d3b107455ea4b26b1 Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
\| *	kernel: power: Workaround for sensor ipc message causing high power consume	Frank Luo	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Sync from Qcom's document KBA-180725024109 To avoid the non-wakeup type sensor data break the AP sleep flow, notify sensor subsystem in the first place of pm_suspend . Bug: 118418963 Test: measure power consumption after running test case Change-Id: I2848230d495e30ac462aef148b3f885103d9c24e Signed-off-by: Frank Luo <luofrank@google.com>
\| *	sched/fair: Fix low cpu usage with high throttling by removing expiration of ↵	Dave Chiluk	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cpu-local slices commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream. It has been observed, that highly-threaded, non-cpu-bound applications running under cpu.cfs_quota_us constraints can hit a high percentage of periods throttled while simultaneously not consuming the allocated amount of quota. This use case is typical of user-interactive non-cpu bound applications, such as those running in kubernetes or mesos when run on multiple cpu cores. This has been root caused to cpu-local run queue being allocated per cpu bandwidth slices, and then not fully using that slice within the period. At which point the slice and quota expires. This expiration of unused slice results in applications not being able to utilize the quota for which they are allocated. The non-expiration of per-cpu slices was recently fixed by 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")'. Prior to that it appears that this had been broken since at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That added the following conditional which resulted in slices never being expired. if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; Because this was broken for nearly 5 years, and has recently been fixed and is now being noticed by many users running kubernetes (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion that the mechanisms around expiring runtime should be removed altogether. This allows quota already allocated to per-cpu run-queues to live longer than the period boundary. This allows threads on runqueues that do not use much CPU to continue to use their remaining slice over a longer period of time than cpu.cfs_period_us. However, this helps prevent the above condition of hitting throttling while also not fully utilizing your cpu quota. This theoretically allows a machine to use slightly more than its allotted quota in some periods. This overflow would be bounded by the remaining quota left on each per-cpu runqueueu. This is typically no more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will change nothing, as they should theoretically fully utilize all of their quota in each period. For user-interactive tasks as described above this provides a much better user/application experience as their cpu utilization will more closely match the amount they requested when they hit throttling. This means that cpu limits no longer strictly apply per period for non-cpu bound applications, but that they are still accurate over longer timeframes. This greatly improves performance of high-thread-count, non-cpu bound applications with low cfs_quota_us allocation on high-core-count machines. In the case of an artificial testcase (10ms/100ms of quota on 80 CPU machine), this commit resulted in almost 30x performance improvement, while still maintaining correct cpu quota restrictions. That testcase is available at https://github.com/indeedeng/fibtest. Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") Change-Id: I7d7a39fb554ec0c31f9381f492165f43c70b3924 Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: John Hammond <jhammond@indeed.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kyle Anderson <kwa@yelp.com> Cc: Gabriel Munos <gmunoz@netflix.com> Cc: Peter Oskolkov <posk@posk.io> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Brendan Gregg <bgregg@netflix.com> Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
\| *	sched/fair: Fix bandwidth timer clock drift condition	Xunlei Pang	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	commit 512ac999d2755d2b7109e996a76b6fb8b888631d upstream. I noticed that cgroup task groups constantly get throttled even if they have low CPU usage, this causes some jitters on the response time to some of our business containers when enabling CPU quotas. It's very simple to reproduce: mkdir /sys/fs/cgroup/cpu/test cd /sys/fs/cgroup/cpu/test echo 100000 > cpu.cfs_quota_us echo $$ > tasks then repeat: cat cpu.stat \| grep nr_throttled # nr_throttled will increase steadily After some analysis, we found that cfs_rq::runtime_remaining will be cleared by expire_cfs_rq_runtime() due to two equal but stale "cfs_{b\|q}->runtime_expires" after period timer is re-armed. The current condition to judge clock drift in expire_cfs_rq_runtime() is wrong, the two runtime_expires are actually the same when clock drift happens, so this condtion can never hit. The orginal design was correctly done by this commit: a9cf55b28610 ("sched: Expire invalid runtime") ... but was changed to be the current implementation due to its locking bug. This patch introduces another way, it adds a new field in both structures cfs_rq and cfs_bandwidth to record the expiration update sequence, and uses them to figure out if clock drift happens (true if they are equal). Change-Id: I8168fe3b45785643536f289ea823d1a62d9d8ab2 Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [alakeshh: backport: Fixed merge conflicts: - sched.h: Fix the indentation and order in which the variables are declared to match with coding style of the existing code in 4.14 Struct members of same type were declared in separate lines in upstream patch which has been changed back to having multiple members of same type in the same line. e.g. int a; int b; -> int a, b; ] Signed-off-by: Alakesh Haloi <alakeshh@amazon.com> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> # 4.14.x Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period") Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
\| *	kernel: power: qos: remove check for core isolation while cluster LPMs	Raghavendra Kakarla	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since all cores in a cluster are in isolation, PMQoS latency constraint set by clock driver to switch PLL is ignored. So, Cluster enter to L2PC and SPM is trying to disable the PLL and at same time clock driver trying to switch the PLL from other cluster which leads to the synchronization issues. Fix is although all cores are in isolation, honor PMQoS request for cluster LPMs. Change-Id: I4296e16ef4e9046d1fbe3b7378e9f61a2f11c74d Signed-off-by: Raghavendra Kakarla <rkakarla@codeaurora.org>
\| *	sched/fair: Fix issue where frequency update not skipped	Joel Fernandes	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes one of the infrequent conditions in commit 54b6baeca500 ("sched/fair: Skip frequency updates if CPU about to idle") where we could have skipped a frequency update. The fix is to use the correct flag which skips freq updates. Note that this is a rare issue (can show up only during CFS throttling) and even then we just do an additional frequency update which we were doing anyway before the above patch. Bug: 64689959 Change-Id: I0117442f395cea932ad56617065151bdeb9a3b53 Signed-off-by: Joel Fernandes <joelaf@google.com>
\| *	ANDROID: Move schedtune en/dequeue before schedutil update triggers	Chris Redpath	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CPU rq util updates happen when rq signals are updated as part of enqueue and dequeue operations. Doing these updates triggers a call to the registered util update handler, which takes schedtune boosting into account. Enqueueing the task in the correct schedtune group after this happens means that we will potentially not see the boost for an entire throttle period. Move the enqueue/dequeue operations for schedtune before the signal updates which can trigger OPP changes. Change-Id: I4236e6b194bc5daad32ff33067d4be1987996780 Signed-off-by: Chris Redpath <chris.redpath@arm.com>
\| *	sched/fair: Skip frequency updates if CPU about to idle	Joel Fernandes	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If CPU is about to idle, prevent a frequency update. With the number of schedutil governor wake ups are reduced by more than half on a test playing bluetooth audio. Test: sugov wake ups drop by more than half when playing music with screen off (476 / 1092) Bug: 64689959 Change-Id: I400026557b4134c0ac77f51c79610a96eb985b4a Signed-off-by: Joel Fernandes <joelaf@google.com>
\| *	FROMLIST: sched: Make iowait_boost optional in schedutil	Joel Fernandes	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We should apply the iowait boost only if cpufreq policy has iowait boost enabled. Also make it a schedutil configuration from sysfs so it can be turned on/off if needed (by default initialize it to the policy value). For systems that don't need/want it enabled, such as those on arm64 based mobile devices that are battery operated, it saves energy when the cpufreq driver policy doesn't have it enabled (details below): Here are some results for energy measurements collected running a YouTube video for 30 seconds: Before: 8.042533 mWh After: 7.948377 mWh Energy savings is ~1.2% Bug: 38010527 Link: https://lkml.org/lkml/2017/5/19/42 Change-Id: If124076ad0c16ade369253840dedfbf870aff927 Signed-off-by: Joel Fernandes <joelaf@google.com>
\| *	sched/walt: Re-add code to allow WALT to function	Ethan Chen	2023-07-16
\| \| \| \| \| \| \| \|	Change-Id: Ieb1067c5e276f872ed4c722b7d1fabecbdad87e7
\| *	sched: Don't fail isolation request for an already isolated CPU	Pavankumar Kondeti	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When isolating a CPU, a check is performed to see if there is only 1 active CPU in the system. If that is the case, the CPU is not isolated. However this check is done before testing if the requested CPU is already isolated or not. If the requested CPU is already isolated, there is no need to fail the isolation even when there is only 1 active CPU in the system. For example, 0-6 CPUs are isolated on a 8 CPU machine. When an isolation request comes for CPU6, which is already isolated, the current code fail the requesting thinking we end up with no active CPU in the system. Change-Id: I28fea4ff67ffed82465e5cfa785414069e4a180a Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
\| *	sched: WALT: increase WALT minimum window size to 20ms	Joonwoo Park	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Increase WALT minimum window size to 20ms. 10ms isn't large enough capture workload's pattern. [beykerykt}: Adapt for HMP Change-Id: I4d69577fbfeac2bc23db4ff414939cc51ada30d6 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
\| *	sched: cpufreq: Use per_cpu_ptr instead of this_cpu_ptr when reporting load	Vikram Mulukutla	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need cpufreq_update_util to report load for the CPU corresponding to the rq that is passed in as an argument, rather than the CPU executing cpufreq_update_util. Change-Id: I8473f230d40928d5920c614760e96fef12745d5a Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
\| *	sched: cpufreq: Use sched_clock instead of rq_clock when updating schedutil	Vikram Mulukutla	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	rq_clock may not be updated often enough for schedutil or other cpufreq governors to work correctly when it's passed as the timestamp for a load report. Use sched_clock instead. [beykerykt]: Switch to sched_ktime_clock() Change-Id: I745b727870a31da25f766c2c2f37527f568c20da Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
\| *	sched/cpupri: Exclude isolated CPUs from the lowest_mask	Pavankumar Kondeti	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The cpupri_find() returns the candidate CPUs which are running lower priority than the waking RT task in the lowest_mask. This contains isolated CPUs as well. Since the energy aware CPU selection skips isolated CPUs, no target CPU may be found if all unisolated CPUs are running higher priority RT tasks. In which case, we fallback to the default CPU selection algorithm and returns an isolated CPU. This decision is reversed by select_task_rq() and returns an unisolated CPU that is busy with other RT tasks. This RT task packing is desired behavior. However, RT push mechanism pushes the packed RT task to an isolated CPU. This can be avoided by excluding isolated CPUs from the lowest_mask returned by cpupri_find(). Change-Id: I75486b3935caf496a638d0333565beffc47fe249 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
\| *	sched: cpufreq: Limit governor updates to WALT changes alone	Vikram Mulukutla	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It's not necessary to keep reporting load to the governor if it doesn't change in a window. Limit updates to when we expect load changes - after window rollover and when we send updates related to intercluster migrations. [beykerykt]: Adapt for HMP Change-Id: I3232d40f3d54b0b81cfafdcdb99b534df79327bf Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
\| *	sched: walt: Correct WALT window size initialization	Vikram Mulukutla	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is preferable that WALT window rollover occurs just before a tick, since the tick is an opportune moment to record a complete window's statistics, as well as report those stats to the cpu frequency governor. When CONFIG_HZ results in a TICK_NSEC that isn't a integral number, this requirement may be violated. Account for this by reducing the WALT window size to the nearest multiple of TICK_NSEC. Commit d368c6faa19b ("sched: walt: fix window misalignment when HZ=300") attempted to do this but WALT isn't using MIN_SCHED_RAVG_WINDOW as the window size and the patch was doing nothing. Also, change the type of 'walt_disabled' to bool and warn if an invalid window size causes WALT to be disabled. [beykerykt]: Adapt for HMP Change-Id: Ie3dcfc21a3df4408254ca1165a355bbe391ed5c7 Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
\| *	sched: walt: fix window misalignment when HZ=300	Joonwoo Park	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Due to rounding error hrtimer tick interval becomes 3333333 ns when HZ=300. Consequently the tick time stamp nearest to the WALT's default window size 20ms will be also 19999998 (3333333 * 6). [beykerykt]: Adapt for HMP Change-Id: I08f9bd2dbecccbb683e4490d06d8b0da703d3ab2 Suggested-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
\| *	sched/tune: Increase the cgroup limit to 6	Pavankumar Kondeti	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The schedtune cgroup controller allows upto 5 cgroups including the default/root cgroup. Until now the user space is creating only 4 additional cgroups namely, foreground, background, top-app and audio-app. Recently another cgroup called rt is created before the audio-app cgroup. Since kernel limits the cgroups to 5, the creation of audio-app cgroup is failing. Fix this by increasing the schedtune cgroup controller cgroup limit to 6. Change-Id: I13252a90dba9b8010324eda29b8901cb0b20bc21 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
\| *	sched/fair: Fix load_balance() affinity redo path	Jeffrey Hugo	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If load_balance() fails to migrate any tasks because all tasks were affined, load_balance() removes the source cpu from consideration and attempts to redo and balance among the new subset of cpus. There is a bug in this code path where the algorithm considers all active cpus in the system (minus the source that was just masked out). This is not valid for two reasons: some active cpus may not be in the current scheduling domain and one of the active cpus is dst_cpu. These cpus should not be considered, as we cannot pull load from them. Instead of failing out of load_balance(), we may end up redoing the search with no valid cpus and incorrectly concluding the domain is balanced. Additionally, if the group_imbalance flag was just set, it may also be incorrectly unset, thus the flag will not be seen by other cpus in future load_balance() runs as that algorithm intends. Fix the check by removing cpus not in the current domain and the dst_cpu from considertation, thus limiting the evaluation to valid remaining cpus from which load might be migrated. Co-authored-by: Austin Christ <austinwc@codeaurora.org> Co-authored-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org> Tested-by: Tyler Baicar <tbaicar@codeaurora.org> Change-Id: Ife6701c9c62e7155493d9db9398f08c4474e94b3
\| *	sched/fair: Avoid unnecessary active load balance	Maria Yu	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When find busiest group, it will avoid load balance if it is only 1 task running on src cpu. Consider race when different cpus do newly idle load balance at the same time, check src cpu nr_running to avoid unnecessary active load balance again. See the race condition example here: 1) cpu2 have 2 tasks, so cpu2 rq->nr_running == 2 and cfs.h_nr_running ==2. 2) cpu4 and cpu5 doing newly idle load balance at the same time. 3) cpu4 and cpu5 both see cpu2 sched_load_balance_sg_stats sum_nr_run=2 so they are both see cpu2 as the busiest rq. 4) cpu5 did a success migration task from cpu2, so cpu2 only have 1 task left, cpu2 rq->nr_running == 1 and cfs.h_nr_running ==1. 5) cpu4 surely goes to no_move because currently cpu4 only have 1 task which is currently running. 6) and then cpu4 goes here to check if cpu2 need active load balance. Change-Id: Ia9539a43e9769c4936f06ecfcc11864984c50c29 Signed-off-by: Maria Yu <aiquny@codeaurora.org>
\| *	BACKPORT: sched/core: Fix rules for running on online && !active CPUs	Peter Zijlstra	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As already enforced by the WARN() in __set_cpus_allowed_ptr(), the rules for running on an online && !active CPU are stricter than just being a kthread, you need to be a per-cpu kthread. If you're not strictly per-CPU, you have better CPUs to run on and don't need the partially booted one to get your work done. The exception is to allow smpboot threads to bootstrap the CPU itself and get kernel 'services' initialized before we allow userspace on it. Change-Id: I515e873a6e5be0cde7771ecedf56101614300fe2 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 955dbdf4ce87 ("sched: Allow migrating kthreads into online but inactive CPUs") Link: http://lkml.kernel.org/r/20170725165821.cejhb7v2s3kecems@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Backported to 4.4 Signed-off-by: joshuous <joshuous@gmail.com>
\| *	sched/core: Allow kthreads to fall back to online && !active cpus	Tejun Heo	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	During CPU hotplug, CPU_ONLINE callbacks are run while the CPU is online but not active. A CPU_ONLINE callback may create or bind a kthread so that its cpus_allowed mask only allows the CPU which is being brought online. The kthread may start executing before the CPU is made active and can end up in select_fallback_rq(). In such cases, the expected behavior is selecting the CPU which is coming online; however, because select_fallback_rq() only chooses from active CPUs, it determines that the task doesn't have any viable CPU in its allowed mask and ends up overriding it to cpu_possible_mask. CPU_ONLINE callbacks should be able to put kthreads on the CPU which is coming online. Update select_fallback_rq() so that it follows cpu_online() rather than cpu_active() for kthreads. Reported-by: Gautham R Shenoy <ego@linux.vnet.ibm.com> Tested-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Change-Id: I562dcc53717b1f2f8324abffb652b91592ba8d5c Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-team@fb.com Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/20160616193504.GB3262@mtj.duckdns.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
\| *	sched: Allow migrating kthreads into online but inactive CPUs	Tejun Heo	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Per-cpu workqueues have been tripping CPU affinity sanity checks while a CPU is being offlined. A per-cpu kworker ends up running on a CPU which isn't its target CPU while the CPU is online but inactive. While the scheduler allows kthreads to wake up on an online but inactive CPU, it doesn't allow a running kthread to be migrated to such a CPU, which leads to an odd situation where setting affinity on a sleeping and running kthread leads to different results. Each mem-reclaim workqueue has one rescuer which guarantees forward progress and the rescuer needs to bind itself to the CPU which needs help in making forward progress; however, due to the above issue, while set_cpus_allowed_ptr() succeeds, the rescuer doesn't end up on the correct CPU if the CPU is in the process of going offline, tripping the sanity check and executing the work item on the wrong CPU. This patch updates __migrate_task() so that kthreads can be migrated into an inactive but online CPU. Change-Id: I38cc3eb3b2ec5b7034cc72a2bcdd32a549314915 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Reported-by: Steven Rostedt <rostedt@goodmis.org>
\| *	sched/fair: Allow load bigger task load balance when nr_running is 2	Maria Yu	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When there is only 2 tasks in 1 cpu and the other task is currently running, allow load bigger task to be balanced if the other task is currently running. Change-Id: I489e9624ba010f9293272a67585e8209a786b787 Signed-off-by: Maria Yu <aiquny@codeaurora.org>
\| *	sched/fair: vruntime should normalize when switching from	John Dias	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	fair When rt_mutex_setprio changes a task's scheduling class to RT, we're seeing cases where the task's vruntime is not updated correctly upon return to the fair class. Specifically, the following is being observed: - task is deactivated while still in the fair class - task is boosted to RT via rt_mutex_setprio, which changes the task to RT and calls check_class_changed. - check_class_changed leads to detach_task_cfs_rq, at which point the vruntime_normalized check sees that the task's state is TASK_WAKING, which results in skipping the subtraction of the rq's min_vruntime from the task's vruntime - later, when the prio is deboosted and the task is moved back to the fair class, the fair rq's min_vruntime is added to the task's vruntime, even though it wasn't subtracted earlier. The immediate result is inflation of the task's vruntime, giving it lower priority (starving it if there's enough available work). The longer-term effect is inflation of all vruntimes because the task's vruntime becomes the rq's min_vruntime when the higher priority tasks go idle. That leads to a vicious cycle, where the vruntime inflation repeatedly doubled. The change here is to detect when vruntime_normalized is being called when the task is waking but is waking in another class, and to conclude that this is a case where vruntime has not been normalized. Bug: 80502612 Change-Id: If0bb02eb16939ca5e91ef282b7f9119ff68622c4 Signed-off-by: John Dias <joaodias@google.com>
\| *	mm: introduce arg_lock to protect arg_start\|end and	Yang Shi	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	env_start\|end in mm_struct mmap_sem is on the hot path of kernel, and it very contended, but it is abused too. It is used to protect arg_start\|end and evn_start\|end when reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make sense since those proc files just expect to read 4 values atomically and not related to VM, they could be set to arbitrary values by C/R. And, the mmap_sem contention may cause unexpected issue like below: INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 Call Trace: schedule+0x36/0x80 rwsem_down_read_failed+0xf0/0x150 call_rwsem_down_read_failed+0x18/0x30 down_read+0x20/0x40 proc_pid_cmdline_read+0xd9/0x4e0 __vfs_read+0x37/0x150 vfs_read+0x96/0x130 SyS_read+0x55/0xc0 entry_SYSCALL_64_fastpath+0x1a/0xc5 Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock for them to mitigate the abuse of mmap_sem. So, introduce a new spinlock in mm_struct to protect the concurrent access to arg_start\|end, env_start\|end and others, as well as replace write map_sem to read to protect the race condition between prctl and sys_brk which might break check_data_rlimit(), and makes prctl more friendly to other VM operations. This patch just eliminates the abuse of mmap_sem, but it can't resolve the above hung task warning completely since the later access_remote_vm() call needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the future. Change-Id: Ifa8f001ee2fc4f0ce60c18e771cebcf8a1f0943e [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock] Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: 88aa7cc688d48ddd84558b41d5905a0db9535c4b Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Srinivas Ramana <sramana@codeaurora.org>
\| *	cpuset: Restore tasks affinity while moving across cpusets	Pavankumar Kondeti	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When tasks move across cpusets, the current affinity settings are lost. Cache the task affinity and restore it during cpuset migration. The restoring happens only when the cached affinity is subset of the current cpuset settings. Change-Id: I6c2ec1d5e3d994e176926d94b9e0cc92418020cc Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
\| *	genirq: Honour IRQ's affinity hint during migration	Pavankumar Kondeti	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	An IRQ affinity is broken during hotplug/isolation when there are no online and un-isolated CPUs in the current affinity mask. An online and un-isolated CPU from the irq_default_affinity mask (i.e /proc/irq/default_smp_affinity) is used as the current affinity. However Individual IRQs can have their affinity hint set via irq_set_affinity_hint() API. When such hint is available, use it instead of the irq_default_affinity which is a system level setting. Change-Id: I53a537582ec4e1aed0c59b49f4fd5b6ca7c0c332 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
\| *	power: qos: Use effective affinity mask	Pavankumar Kondeti	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	PM_QOS_REQ_AFFINE_IRQ request is supposed to apply the QoS vote for the CPU(s) on which the attached interrupt arrives. Currently the QoS vote is applied to all the CPUs present in the IRQ affinity mask i.e desc->irq_data.common->affinity. However some chips configure only a single CPU from this affinity mask to receive the IRQ. This information is present in effective affinity mask of an IRQ. Start using it so that a QoS vote is not applied to other CPUs on which the IRQ never comes but present in the affinity mask. Change-Id: If26aa23bebe4a7d07ffedb5ff833ccdb4f4fb6ea Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
\| *	genirq: Introduce effective affinity mask	Thomas Gleixner	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is currently no way to evaluate the effective affinity mask of a given interrupt. Many irq chips allow only a single target CPU or a subset of CPUs in the affinity mask. Updating the mask at the time of setting the affinity to the subset would be counterproductive because information for cpu hotplug about assigned interrupt affinities gets lost. On CPU hotplug it's also pointless to force migrate an interrupt, which is not targeted at the CPU effectively. But currently the information is not available. Provide a seperate mask to be updated by the irq_chip->irq_set_affinity() implementations. Implement the read only proc files so the user can see the effective mask as well w/o trying to deduce it from /proc/interrupts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20170619235446.247834245@linutronix.de Change-Id: Ibeec0031edb532d52cb411286f785aec160d6139
\| *	sched/cputime: Mitigate performance regression in times()/clock_gettime()	Giovanni Gherdovich	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit: 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") fixed a problem whereby clock_nanosleep() followed by clock_gettime() could allow a task to wake early. It addressed the problem by calling the scheduling classes update_curr() when the cputimer starts. Said change induced a considerable performance regression on the syscalls times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some debuggers and applications that monitor their own performance that accidentally depend on the performance of these specific calls. This patch mitigates the performace loss by prefetching data in the CPU cache, as stalls due to cache misses appear to be where most time is spent in our benchmarks. Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge box with 32 logical cores and 2 NUMA nodes. The test is repeated with a variable number of threads, from 2 to 4num_cpus; the results are in seconds and correspond to the average of 10 runs; the percentage gain is computed with (before-after)/before so a positive value is an improvement (it's faster). The improvement varies between a few percents for 5-20 threads and more than 10% for 2 or >20 threads. pound_clock_gettime: threads 4.7-rc7 patched 4.7-rc7 [num] [secs] [secs (percent)] 2 3.48 3.06 ( 11.83%) 5 3.33 3.25 ( 2.40%) 8 3.37 3.26 ( 3.30%) 12 3.32 3.37 ( -1.60%) 21 4.01 3.90 ( 2.74%) 30 3.63 3.36 ( 7.41%) 48 3.71 3.11 ( 16.27%) 79 3.75 3.16 ( 15.74%) 110 3.81 3.25 ( 14.80%) 128 3.88 3.31 ( 14.76%) pound_times: threads 4.7-rc7 patched 4.7-rc7 [num] [secs] [secs (percent)] 2 3.65 3.25 ( 11.03%) 5 3.45 3.17 ( 7.92%) 8 3.52 3.22 ( 8.69%) 12 3.29 3.36 ( -2.04%) 21 4.07 3.92 ( 3.78%) 30 3.87 3.40 ( 12.17%) 48 3.79 3.16 ( 16.61%) 79 3.88 3.28 ( 15.42%) 110 3.90 3.38 ( 13.35%) 128 4.00 3.38 ( 15.45%) pound_clock_gettime and pound_clock_gettime are two benchmarks included in the MMTests framework. They launch a given number of threads which repeatedly call times() or clock_gettimes(). The results above can be reproduced with cloning MMTests from github.com and running the "poundtime" workload: $ git clone https://github.com/gormanm/mmtests.git $ cd mmtests $ cp configs/config-global-dhp__workload_poundtime config $ ./run-mmtests.sh --run-monitor $(uname -r) The above will run "poundtime" measuring the kernel currently running on the machine; Once a new kernel is installed and the machine rebooted, running again $ cd mmtests $ ./run-mmtests.sh --run-monitor $(uname -r) will produce results to compare with. A comparison table will be output with: $ cd mmtests/work/log $ ../../compare-kernels.sh the table will contain a lot of entries; grepping for "Amean" (as in "arithmetic mean") will give the tables presented above. The source code for the two benchmarks is reported at the end of this changelog for clairity. The cache misses addressed by this patch were found using a combination of `perf top`, `perf record` and `perf annotate`. The incriminated lines were found to be struct sched_entity curr = cfs_rq->curr; and delta_exec = now - curr->exec_start; in the function update_curr() from kernel/sched/fair.c. This patch prefetches the data from memory just before update_curr is called in the interested execution path. A comparison of the total number of cycles before and after the patch follows; the data is obtained using `perf stat -r 10 -ddd <program>` running over the same sequence of number of threads used above (a positive gain is an improvement): threads cycles before cycles after gain 2 19,699,563,964 +-1.19% 17,358,917,517 +-1.85% 11.88% 5 47,401,089,566 +-2.96% 45,103,730,829 +-0.97% 4.85% 8 80,923,501,004 +-3.01% 71,419,385,977 +-0.77% 11.74% 12 112,326,485,473 +-0.47% 110,371,524,403 +-0.47% 1.74% 21 193,455,574,299 +-0.72% 180,120,667,904 +-0.36% 6.89% 30 315,073,519,013 +-1.64% 271,222,225,950 +-1.29% 13.92% 48 321,969,515,332 +-1.48% 273,353,977,321 +-1.16% 15.10% 79 337,866,003,422 +-0.97% 289,462,481,538 +-1.05% 14.33% 110 338,712,691,920 +-0.78% 290,574,233,170 +-0.77% 14.21% 128 348,384,794,006 +-0.50% 292,691,648,206 +-0.66% 15.99% A comparison of cache miss vs total cache loads ratios, before and after the patch (again from the `perf stat -r 10 -ddd <program>` tables): threads L1 misses/total100 L1 misses/total100 gain before after 2 7.43 +-4.90% 7.36 +-4.70% 0.94% 5 13.09 +-4.74% 13.52 +-3.73% -3.28% 8 13.79 +-5.61% 12.90 +-3.27% 6.45% 12 11.57 +-2.44% 8.71 +-1.40% 24.72% 21 12.39 +-3.92% 9.97 +-1.84% 19.53% 30 13.91 +-2.53% 11.73 +-2.28% 15.67% 48 13.71 +-1.59% 12.32 +-1.97% 10.14% 79 14.44 +-0.66% 13.40 +-1.06% 7.20% 110 15.86 +-0.50% 14.46 +-0.59% 8.83% 128 16.51 +-0.32% 15.06 +-0.78% 8.78% As a final note, the following shows the evolution of performance figures in the "poundtime" benchmark and pinpoints commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a major source of degradation, mostly unaddressed to this day (figures expressed in seconds). pound_clock_gettime: threads parent of 6e998916dfe3 4.7-rc7 6e998916dfe3 itself 2 2.23 3.68 ( -64.56%) 3.48 (-55.48%) 5 2.83 3.78 ( -33.42%) 3.33 (-17.43%) 8 2.84 4.31 ( -52.12%) 3.37 (-18.76%) 12 3.09 3.61 ( -16.74%) 3.32 ( -7.17%) 21 3.14 4.63 ( -47.36%) 4.01 (-27.71%) 30 3.28 5.75 ( -75.37%) 3.63 (-10.80%) 48 3.02 6.05 (-100.56%) 3.71 (-22.99%) 79 2.88 6.30 (-118.90%) 3.75 (-30.26%) 110 2.95 6.46 (-119.00%) 3.81 (-29.24%) 128 3.05 6.42 (-110.08%) 3.88 (-27.04%) pound_times: threads parent of 6e998916dfe3 4.7-rc7 6e998916dfe3 itself 2 2.27 3.73 ( -64.71%) 3.65 (-61.14%) 5 2.78 3.77 ( -35.56%) 3.45 (-23.98%) 8 2.79 4.41 ( -57.71%) 3.52 (-26.05%) 12 3.02 3.56 ( -17.94%) 3.29 ( -9.08%) 21 3.10 4.61 ( -48.74%) 4.07 (-31.34%) 30 3.33 5.75 ( -72.53%) 3.87 (-16.01%) 48 2.96 6.06 (-105.04%) 3.79 (-28.10%) 79 2.88 6.24 (-116.83%) 3.88 (-34.81%) 110 2.98 6.37 (-114.08%) 3.90 (-31.12%) 128 3.10 6.35 (-104.61%) 4.00 (-28.87%) The source code of the two benchmarks follows. To compile the two: NR_THREADS=42 for FILE in pound_times pound_clock_gettime; do gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE done ==== BEGIN pound_times.c ==== struct tms start; void pound (void threadid) { struct tms end; int oldutime = 0; int utime; int i; for (i = 0; i < 5000000 / NUM_THREADS; i++) { times(&end); utime = ((int)end.tms_utime - (int)start.tms_utime); if (oldutime > utime) { printf("utime decreased, was %d, now %d!\n", oldutime, utime); } oldutime = utime; } pthread_exit(NULL); } int main() { pthread_t th[NUM_THREADS]; long i; times(&start); for (i = 0; i < NUM_THREADS; i++) { pthread_create (&th[i], NULL, pound, (void )i); } pthread_exit(NULL); return 0; } ==== END pound_times.c ==== ==== BEGIN pound_clock_gettime.c ==== void pound (void threadid) { struct timespec ts; int rc, i; unsigned long prev = 0, this = 0; for (i = 0; i < 5000000 / NUM_THREADS; i++) { rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts); if (rc < 0) perror("clock_gettime"); this = (ts.tv_sec 1000000000) + ts.tv_nsec; if (0 && this < prev) printf("%lu ns timewarp at iteration %d\n", prev - this, i); prev = this; } pthread_exit(NULL); } int main() { pthread_t th[NUM_THREADS]; long rc, i; pid_t pgid; for (i = 0; i < NUM_THREADS; i++) { rc = pthread_create(&th[i], NULL, pound, (void *)i); if (rc < 0) perror("pthread_create"); } pthread_exit(NULL); return 0; } ==== END pound_clock_gettime.c ==== Suggested-by: Mike Galbraith <mgalbraith@suse.de> Change-Id: Iad82d9f31c92e50e1e3b1339892512526ceb0acf Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org>
\| *	kernel: time: Add delay after cpu_relax() in tight loops	Prasad Sodagudi	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Tight loops of spin_lock_irqsave() and spin_unlock_irqrestore() in timer and hrtimer are causing scheduling delays. Add delay of few nano seconds after cpu_relax in the timer/hrtimer tight loops. Change-Id: Iaa0ab92da93f7b245b1d922b6edca2bebdc0fbce Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
* \|	Merge lineage-20 of git@github.com:LineageOS/android_kernel_qcom_msm8998.git ↵	Davide Garberi	2023-08-03
\|\\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	into lineage-20 1a4b80f8f201 ANDROID: arch:arm64: Increase kernel command line size 7c253f7aa663 of: reserved_mem: increase max number reserved regions df4dbf557503 msm: camera: Fix indentations 2fc4a156d15d msm: camera: Fix code flow when populating CAM_V_CUSTOM1 687bcb61f125 ALSA: control: use counting semaphore as write lock for ELEM_WRITE operation 75cf9e8c1b1c ALSA: control: Fix memory corruption risk in snd_ctl_elem_read 76cf3b5e53df ALSA: control: code refactoring for ELEM_READ/ELEM_WRITE operations e9af212f9685 ALSA: pcm: Move rwsem lock inside snd_ctl_elem_read to prevent UAF 95fc4fff573f msm: kgsl: Make sure that pool pages don't have any extra references 59ceabe0d242 msm: kgsl: Use dma_buf_get() to get dma_buf structure d1f19956d6b9 ANDROID: usb: f_accessory: Check buffer size when initialised via composite 2d3ce4f7a366 kbuild: handle libs-y archives separately from built-in.o archives 65dc3fbd1593 kbuild: thin archives use P option to ar 362c7b73bac8 kbuild: thin archives for multi-y targets 43076241b514 kbuild: thin archives final link close --whole-archives option aa04fc78256d kbuild: minor improvement for thin archives build f5896747cda6 Merge tag 'LA.UM.7.2.c25-07700-sdm660.0' of https://git.codelinaro.org/clo/la/platform/vendor/qcom-opensource/wlan/qcacld-3.0 into android13-4.4-msm8998 321ac077ee7e qcacld-3.0: Fix out-of-bounds in tx_stats 42be8e4cbf13 BACKPORT: usb: gadget: rndis: prevent integer overflow in rndis_set_response() b490a85b5945 FROMGIT: arm64: fix oops in concurrently setting insn_emulation sysctls 7ed7084b34a9 FROMLIST: binder: fix UAF of ref->proc caused by race condition e31f087fb864 ANDROID: selinux: modify RTM_GETNEIGH{TBL} 80675d431434 UPSTREAM: usb: gadget: clear related members when goto fail fb6adfb00108 UPSTREAM: usb: gadget: don't release an existing dev->buf e4a8dd12424e UPSTREAM: USB: gadget: validate interface OS descriptor requests 8f0a947317e0 UPSTREAM: usb: gadget: rndis: check size of RNDIS_MSG_SET command 1541758765ff ion: Do not 'put' ION handle until after its final use 03b4b3cd8d30 Merge tag 'LA.UM.7.2.c25-07000-sdm660.0' of https://git.codelinaro.org/clo/la/platform/vendor/qcom-opensource/wlan/qcacld-3.0 into android13-4.4-msm8998 7dbda95466d5 Merge tag 'LA.UM.8.4.c25-06600-8x98.0' of https://git.codelinaro.org/clo/la/kernel/msm-4.4 into android13-4.4-msm8998 369119e5df4e cert host tools: Stop complaining about deprecated OpenSSL functions f8e30a0f9a17 fixup! BACKPORT: treewide: Fix function prototypes for module_param_call() 4fa5045f3dc9 arm64/efi: Mark __efistub_stext_offset as an absolute symbol explicitly bcd9668da77f arm64: kernel: do not need to reset UAO on exception entry c4ddd677f7e3 Kbuild: do not emit debug info for assembly with LLVM_IAS=1 1b880b6e19f8 qcacld-3.0: Add time slice duty cycle in wifi_interface_info fd24be2b22a1 qcacmn: Add time slice duty cycle attribute into QCA vendor command d719c1c825f8 qcacld-3.0: Use field-by-field assignment for FW stats fb5eb3bda2d9 ext4: enable quota enforcement based on mount options cd40d7f301de ext4: adds project ID support 360e2f3d18b8 ext4: add project quota support c31ac2be1594 drivers: qcacld-3.0: Remove in_compat_syscall() redefinition 6735c13a269d arm64: link with -z norelro regardless of CONFIG_RELOCATABLE 99962aab3433 arm64: relocatable: fix inconsistencies in linker script and options 24bd8cc5e6bb arm64: prevent regressions in compressed kernel image size when upgrading to binutils 2.27 93bb4c2392a2 arm64: kernel: force ET_DYN ELF type for CONFIG_RELOCATABLE=y a54bbb725ccb arm64: build with baremetal linker target instead of Linux when available c5805c604a9b arm64: add endianness option to LDFLAGS instead of LD ab6052788f60 arm64: Set UTS_MACHINE in the Makefile c3330429b2c6 kbuild: clear LDFLAGS in the top Makefile f33c1532bd61 kbuild: use HOSTLDFLAGS for single .c executables 38b7db363a96 BACKPORT: arm64: Change .weak to SYM_FUNC_START_WEAK_PI for arch/arm64/lib/mem.S 716cb63e81d9 BACKPORT: crypto: arm64/aes-ce-cipher - move assembler code to .S file 7dfbaee16432 BACKPORT: arm64: Remove reference to asm/opcodes.h 531ee8624d17 BACKPORT: arm64: kprobe: protect/rename few definitions to be reused by uprobe 08d83c997b0c BACKPORT: arm64: Delete the space separator in __emit_inst e3951152dc2d BACKPORT: arm64: Get rid of asm/opcodes.h 255820c0f301 BACKPORT: arm64: Fix minor issues with the dcache_by_line_op macro 21bb344a664b BACKPORT: crypto: arm64/aes-modes - get rid of literal load of addend vector 26d5a53c6e0d BACKPORT: arm64: vdso: remove commas between macro name and arguments 78bff1f77c9d BACKPORT: kbuild: support LLVM=1 to switch the default tools to Clang/LLVM 6634f9f63efe BACKPORT: kbuild: replace AS=clang with LLVM_IAS=1 b891e8fdc466 BACKPORT: Documentation/llvm: fix the name of llvm-size 75d6fa8368a8 BACKPORT: Documentation/llvm: add documentation on building w/ Clang/LLVM 95b0a5e52f2a BACKPORT: ANDROID: ftrace: fix function type mismatches 7da9c2138ec8 BACKPORT: ANDROID: fs: logfs: fix filler function type d6d5a4b28ad0 BACKPORT: ANDROID: fs: gfs2: fix filler function type 9b194a470db5 BACKPORT: ANDROID: fs: exofs: fix filler function type 7a45ac4bfb49 BACKPORT: ANDROID: fs: afs: fix filler function type 4099e1b281e5 BACKPORT: drivers/perf: arm_pmu: fix function type mismatch af7b738882f7 BACKPORT: dummycon: fix function types 1b0b55a36dbe BACKPORT: fs: nfs: fix filler function type a58a0e30e20a BACKPORT: mm: fix filler function type mismatch 829e9226a8c0 BACKPORT: mm: fix drain_local_pages function type 865ef61b4da8 BACKPORT: vfs: pass type instead of fn to do_{loop,iter}_readv_writev() 08d2f8e7ba8e BACKPORT: module: Do not paper over type mismatches in module_param_call() ea467f6c33e4 BACKPORT: treewide: Fix function prototypes for module_param_call() d131459e6b8b BACKPORT: module: Prepare to convert all module_param_call() prototypes 6f52abadf006 BACKPORT: kbuild: fix --gc-sections bf7540ffce44 BACKPORT: kbuild: record needed exported symbols for modules c49d2545e437 BACKPORT: kbuild: Allow to specify composite modules with modname-m 427d0fc67dc1 BACKPORT: kbuild: add arch specific post-link Makefile 69f8a31838a3 BACKPORT: arm64: add a workaround for GNU gold with ARM64_MODULE_PLTS ba3368756abf BACKPORT: arm64: explicitly pass --no-fix-cortex-a53-843419 to GNU gold 6dacd7e737fb BACKPORT: arm64: errata: Pass --fix-cortex-a53-843419 to ld if workaround enabled d2787c21f2b5 BACKPORT: kbuild: add __ld-ifversion and linker-specific macros 2d471de60bb4 BACKPORT: kbuild: add ld-name macro 06280a90d845 BACKPORT: arm64: keep .altinstructions and .altinstr_replacement eb0ad3ae07f9 BACKPORT: kbuild: add __cc-ifversion and compiler-specific variants 3d01e1eba86b BACKPORT: FROMLIST: kbuild: add clang-version.sh 18dd378ab563 BACKPORT: FROMLIST: kbuild: fix LD_DEAD_CODE_DATA_ELIMINATION aabbc122b1de BACKPORT: kbuild: thin archives make default for all archs 756d47e345fc BACKPORT: kbuild: allow archs to select link dead code/data elimination 723ab99e48a7 BACKPORT: kbuild: allow architectures to use thin archives instead of ld -r 0b77ec583772 drivers/usb/serial/console.c: remove superfluous serial->port condition 6488cb478f04 drivers/firmware/efi/libstub.c: prevent a relocation dba4259216a0 UPSTREAM: pidfd: fix a poll race when setting exit_state baab6e33b07b BACKPORT: arch: wire-up pidfd_open() 5d2e9e4f8630 BACKPORT: pid: add pidfd_open() f8396a127daf UPSTREAM: pidfd: add polling support f4c358582254 UPSTREAM: signal: improve comments 5500316dc8d8 UPSTREAM: fork: do not release lock that wasn't taken fc7d707593e3 BACKPORT: signal: support CLONE_PIDFD with pidfd_send_signal f044fa00d72a BACKPORT: clone: add CLONE_PIDFD f20fc1c548f2 UPSTREAM: Make anon_inodes unconditional de80525cd462 UPSTREAM: signal: use fdget() since we don't allow O_PATH 229e1bdd624e UPSTREAM: signal: don't silently convert SI_USER signals to non-current pidfd ada02e996b52 BACKPORT: signal: add pidfd_send_signal() syscall 828857678c5c compat: add in_compat_syscall to ask whether we're in a compat syscall e7aede4896c0 bpf: Add new cgroup attach type to enable sock modifications 9ed75228b09c ebpf: allow bpf_get_current_uid_gid_proto also for networking c5aa3963b4ae bpf: fix overflow in prog accounting c46a001439fc bpf: Make sure mac_header was set before using it 8aed99185615 bpf: Enlarge offset check value to INT_MAX in bpf_skb_{load,store}_bytes b0a638335ba6 bpf: avoid false sharing of map refcount with max_entries 1f21605e373c net: remove hlist_nulls_add_tail_rcu() 9ce369b09dbb udp: get rid of SLAB_DESTROY_BY_RCU allocations 070f539fb5d7 udp: no longer use SLAB_DESTROY_BY_RCU a32d2ea857c5 inet: refactor inet[6]_lookup functions to take skb fcf3e7bc7203 soreuseport: fix initialization race df03c8cf024a soreuseport: Fix TCP listener hash collision bd8b9f50c9d3 inet: Fix missing return value in inet6_hash bae331196dd0 soreuseport: fast reuseport TCP socket selection 4ada2ed73da0 inet: create IPv6-equivalent inet_hash function 73f609838475 sock: struct proto hash function may error e3b32750621b cgroup: Fix sock_cgroup_data on big-endian. 69dabcedd4b9 selinux: always allow mounting submounts 17d6ddebcc49 userns: Don't fail follow_automount based on s_user_ns cbd08255e6f8 fs: Better permission checking for submounts 3a9ace719251 mnt: Move the FS_USERNS_MOUNT check into sget_userns af53549b43c5 locks: sprinkle some tracepoints around the file locking code 07dbbc84aa34 locks: rename __posix_lock_file to posix_lock_inode 400cbe93d180 autofs: Fix automounts by using current_real_cred()->uid 7903280ee07a fs: Call d_automount with the filesystems creds b87fb50ff1cd UPSTREAM: kernfs: Check KERNFS_HAS_RELEASE before calling kernfs_release_file() c9c596de3e52 UPSTREAM: kernfs: fix locking around kernfs_ops->release() callback 2172eaf5a901 UPSTREAM: cgroup, bpf: remove unnecessary #include dc81f3963dde kernfs: kernfs_sop_show_path: don't return 0 after seq_dentry call ce9a52e20897 cgroup: Make rebind_subsystems() disable v2 controllers all at once ce5e3aa14c39 cgroup: fix sock_cgroup_data initialization on earlier compilers 94a70ef24da9 samples/bpf: fix bpf_perf_event_output prototype c1920272278e net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list d7707635776b sk_buff: allow segmenting based on frag sizes 924bbacea75e ip_tunnel, bpf: ip_tunnel_info_opts_{get, set} depends on CONFIG_INET 0e9008d618f4 bpf: udp: ipv6: Avoid running reuseport's bpf_prog from __udp6_lib_err 01b437940f5e soreuseport: add compat case for setsockopt SO_ATTACH_REUSEPORT_CBPF 421fbf04bf2c soreuseport: change consume_skb to kfree_skb in error case 1ab50514c430 ipv6: Fix SO_REUSEPORT UDP socket with implicit sk_ipv6only f3dfd61c502d soreuseport: fix ordering for mixed v4/v6 sockets 245ee3c90795 soreuseport: fix NULL ptr dereference SO_REUSEPORT after bind 113fb209854a bpf: do not blindly change rlimit in reuseport net selftest 985253ef27d2 bpf: fix rlimit in reuseport net selftest ae61334510be soreuseport: Fix reuseport_bpf testcase on 32bit architectures 6efa24da01a5 udp: fix potential infinite loop in SO_REUSEPORT logic 66df70c6605d soreuseport: BPF selection functional test for TCP fe161031b8a8 soreuseport: pass skb to secondary UDP socket lookup 9223919efdf2 soreuseport: BPF selection functional test 2090ed790dbb soreuseport: fix mem leak in reuseport_add_sock() 67887f6ac3f1 Merge "diag: Ensure dci entry is valid before sending the packet" e41c0da23b38 diag: Prevent out of bound write while sending dci pkt to remote e1085d1ef39b diag: Ensure dci entry is valid before sending the packet 16802e80ecb5 Merge "ion: Fix integer overflow in msm_ion_custom_ioctl" 57146f83f388 ion: Fix integer overflow in msm_ion_custom_ioctl 6fc2001969fe diag: Use valid data_source for a valid token 0c6dbf858a98 qcacld-3.0: Avoid OOB read in dot11f_unpack_assoc_response f07caca0c485 qcacld-3.0: Fix array OOB for duplicate rate 5a359aba0364 msm: kgsl: Remove 'fd' dependency to get dma_buf handle da8317596949 msm: kgsl: Fix gpuaddr_in_range() to check upper bound 2ed91a98d8b4 msm: adsprpc: Handle UAF in fastrpc debugfs read 2967159ad303 msm: kgsl: Add a sysfs node to control performance counter reads e392a84f25f5 msm: kgsl: Perform cache flush on the pages obtained using get_user_pages() 28b45f75d2ee soc: qcom: hab: Add sanity check for payload_count 885caec7690f Merge "futex: Fix inode life-time issue" 0f57701d2643 Merge "futex: Handle faults correctly for PI futexes" 7d7eb450c333 Merge "futex: Rework inconsistent rt_mutex/futex_q state" 124ebd87ef2f msm: kgsl: Fix out of bound write in adreno_profile_submit_time 228bbfb25032 futex: Fix inode life-time issue 7075ca6a22b3 futex: Handle faults correctly for PI futexes a436b73e9032 futex: Simplify fixup_pi_state_owner() 11b99dbe3221 futex: Use pi_state_update_owner() in put_pi_state() f34484030550 rtmutex: Remove unused argument from rt_mutex_proxy_unlock() 079d1c90b3c3 futex: Provide and use pi_state_update_owner() 3b51e24eb17b futex: Replace pointless printk in fixup_owner() 0eac5c2583a1 futex: Avoid violating the 10th rule of futex 6d6ed38b7d10 futex: Rework inconsistent rt_mutex/futex_q state 3c8f7dfd59b5 futex: Remove rt_mutex_deadlock_account_() 9c870a329520 futex,rt_mutex: Provide futex specific rt_mutex API 7504736e8725 msm: adsprpc: Handle UAF in process shell memory 994e5922a0c2 Disable TRACER Check to improve Camera Performance 8fb3f17b3ad1 msm: kgsl: Deregister gpu address on memdesc_sg_virt failure 13aa628efdca Merge "crypto: Fix possible stack out-of-bound error" 92e777451003 Merge "msm: kgsl: Correct the refcount on current process PID." 9ca218394ed4 Merge "msm: kgsl: Compare pid pointer instead of TGID for a new process" 7eed1f2e0f43 Merge "qcom,max-freq-level change for trial" 6afb5eb98e36 crypto: Fix possible stack out-of-bound error 8b5ba278ed4b msm: kgsl: Correct the refcount on current process PID. 4150552fac96 msm: kgsl: Compare pid pointer instead of TGID for a new process c272102c0793 qcom,max-freq-level change for trial 854ef3ce73f5 msm: kgsl: Protect the memdesc->gpuaddr in SVM use cases. 79c8161aeac9 msm: kgsl: Stop using memdesc->usermem. Change-Id: Iea7db1362c3cd18e36f243411e773a9054f6a445
\| *	BACKPORT: ANDROID: ftrace: fix function type mismatches	Sami Tolvanen	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change fixes indirect call mismatches with function and function graph tracing, which trip Control-Flow Integrity (CFI) checking. Bug: 79510107 Bug: 67506682 Change-Id: I5de08c113fb970ffefedce93c58e0161f22c7ca2 Signed-off-by: Sami Tolvanen <samitolvanen@google.com> (cherry picked from commit c2f9bce9fee8e31e0500c501076f73db7791d8e9) Signed-off-by: Dan Aloni <daloni@magicleap.com> Signed-off-by: Davide Garberi <dade.garberi@gmail.com>
\| *	UPSTREAM: pidfd: fix a poll race when setting exit_state	Suren Baghdasaryan	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is a race between reading task->exit_state in pidfd_poll and writing it after do_notify_parent calls do_notify_pidfd. Expected sequence of events is: CPU 0 CPU 1 ------------------------------------------------ exit_notify do_notify_parent do_notify_pidfd tsk->exit_state = EXIT_DEAD pidfd_poll if (tsk->exit_state) However nothing prevents the following sequence: CPU 0 CPU 1 ------------------------------------------------ exit_notify do_notify_parent do_notify_pidfd pidfd_poll if (tsk->exit_state) tsk->exit_state = EXIT_DEAD This causes a polling task to wait forever, since poll blocks because exit_state is 0 and the waiting task is not notified again. A stress test continuously doing pidfd poll and process exits uncovered this bug. To fix it, we make sure that the task's exit_state is always set before calling do_notify_pidfd. Fixes: b53b0b9d9a6 ("pidfd: add polling support") Cc: kernel-team@android.com Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Link: https://lore.kernel.org/r/20190717172100.261204-1-joel@joelfernandes.org [christian@brauner.io: adapt commit message and drop unneeded changes from wait_task_zombie] Signed-off-by: Christian Brauner <christian@brauner.io> (cherry picked from commit b191d6491be67cef2b3fa83015561caca1394ab9) Bug: 135608568 Test: test program using syscall(__NR_sys_pidfd_open,..) and poll() Change-Id: I043e54c9b69f25de88f6f19ae167920af8532de2 Signed-off-by: Suren Baghdasaryan <surenb@google.com>
\| *	BACKPORT: pid: add pidfd_open()	Christian Brauner	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds the pidfd_open() syscall. It allows a caller to retrieve pollable pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a process that is created via traditional fork()/clone() calls that is only referenced by a PID: int pidfd = pidfd_open(1234, 0); ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0); With the introduction of pidfds through CLONE_PIDFD it is possible to created pidfds at process creation time. However, a lot of processes get created with traditional PID-based calls such as fork() or clone() (without CLONE_PIDFD). For these processes a caller can currently not create a pollable pidfd. This is a problem for Android's low memory killer (LMK) and service managers such as systemd. Both are examples of tools that want to make use of pidfds to get reliable notification of process exit for non-parents (pidfd polling) and race-free signal sending (pidfd_send_signal()). They intend to switch to this API for process supervision/management as soon as possible. Having no way to get pollable pidfds from PID-only processes is one of the biggest blockers for them in adopting this api. With pidfd_open() making it possible to retrieve pidfds for PID-based processes we enable them to adopt this api. In line with Arnd's recent changes to consolidate syscall numbers across architectures, I have added the pidfd_open() syscall to all architectures at the same time. Signed-off-by: Christian Brauner <christian@brauner.io> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jann Horn <jannh@google.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-api@vger.kernel.org (cherry picked from commit 32fcb426ec001cb6d5a4a195091a8486ea77e2df) Conflicts: kernel/pid.c (1. Replaced PIDTYPE_TGID with PIDTYPE_PID and thread_group_leader() check in pidfd_open() call) Bug: 135608568 Test: test program using syscall(__NR_sys_pidfd_open,..) and poll() Change-Id: I52a93a73722d7f7754dae05f63b94b4ca4a71a75 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: electimon <electimon@gmail.com>
\| *	UPSTREAM: pidfd: add polling support	Joel Fernandes (Google)	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds polling support to pidfd. Android low memory killer (LMK) needs to know when a process dies once it is sent the kill signal. It does so by checking for the existence of /proc/pid which is both racy and slow. For example, if a PID is reused between when LMK sends a kill signal and checks for existence of the PID, since the wrong PID is now possibly checked for existence. Using the polling support, LMK will be able to get notified when a process exists in race-free and fast way, and allows the LMK to do other things (such as by polling on other fds) while awaiting the process being killed to die. For notification to polling processes, we follow the same existing mechanism in the kernel used when the parent of the task group is to be notified of a child's death (do_notify_parent). This is precisely when the tasks waiting on a poll of pidfd are also awakened in this patch. We have decided to include the waitqueue in struct pid for the following reasons: 1. The wait queue has to survive for the lifetime of the poll. Including it in task_struct would not be option in this case because the task can be reaped and destroyed before the poll returns. 2. By including the struct pid for the waitqueue means that during de_thread(), the new thread group leader automatically gets the new waitqueue/pid even though its task_struct is different. Appropriate test cases are added in the second patch to provide coverage of all the cases the patch is handling. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Daniel Colascione <dancol@google.com> Cc: Jann Horn <jannh@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Jonathan Kowalski <bl0pbl33p@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: David Howells <dhowells@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: kernel-team@android.com Reviewed-by: Oleg Nesterov <oleg@redhat.com> Co-developed-by: Daniel Colascione <dancol@google.com> Signed-off-by: Daniel Colascione <dancol@google.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Christian Brauner <christian@brauner.io> (cherry picked from commit b53b0b9d9a613c418057f6cb921c2f40a6f78c24) Bug: 135608568 Test: test program using syscall(__NR_sys_pidfd_open,..) and poll() Change-Id: I02f259d2875bec46b198d580edfbb067f077084e Signed-off-by: Suren Baghdasaryan <surenb@google.com>
\| *	UPSTREAM: signal: improve comments	Christian Brauner	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Improve the comments for pidfd_send_signal(). First, the comment still referred to a file descriptor for a process as a "task file descriptor" which stems from way back at the beginning of the discussion. Replace this with "pidfd" for consistency. Second, the wording for the explanation of the arguments to the syscall was a bit inconsistent, e.g. some used the past tense some used present tense. Make the wording more consistent. Signed-off-by: Christian Brauner <christian@brauner.io> (cherry picked from commit c732327f04a3818f35fa97d07b1d64d31b691d78) Bug: 135608568 Test: test program using syscall(__NR_sys_pidfd_open,..) and poll() Change-Id: I06c6bdd1dddaeb8ac75a78dd21f9cdd0dc139a4c Signed-off-by: Suren Baghdasaryan <surenb@google.com>
\| *	UPSTREAM: fork: do not release lock that wasn't taken	Christian Brauner	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Avoid calling cgroup_threadgroup_change_end() without having called cgroup_threadgroup_change_begin() first. During process creation we need to check whether the cgroup we are in allows us to fork. To perform this check the cgroup needs to guard itself against threadgroup changes and takes a lock. Prior to CLONE_PIDFD the cleanup target "bad_fork_free_pid" would also need to call cgroup_threadgroup_change_end() because said lock had already been taken. However, this is not the case anymore with the addition of CLONE_PIDFD. We are now allocating a pidfd before we check whether the cgroup we're in can fork and thus prior to taking the lock. So when copy_process() fails at the right step it would release a lock we haven't taken. This bug is not even very subtle to be honest. It's just not very clear from the naming of cgroup_threadgroup_change_{begin,end}() that a lock is taken. Here's the relevant splat: entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139 RIP: 0023:0xf7fec849 Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90 RSP: 002b:00000000ffed5a8c EFLAGS: 00000246 ORIG_RAX: 0000000000000078 RAX: ffffffffffffffda RBX: 0000000000003ffc RCX: 0000000000000000 RDX: 00000000200005c0 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ------------[ cut here ]------------ DEBUG_LOCKS_WARN_ON(depth <= 0) WARNING: CPU: 1 PID: 7744 at kernel/locking/lockdep.c:4052 __lock_release kernel/locking/lockdep.c:4052 [inline] WARNING: CPU: 1 PID: 7744 at kernel/locking/lockdep.c:4052 lock_release+0x667/0xa00 kernel/locking/lockdep.c:4321 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 7744 Comm: syz-executor007 Not tainted 5.1.0+ #4 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 panic+0x2cb/0x65c kernel/panic.c:214 __warn.cold+0x20/0x45 kernel/panic.c:566 report_bug+0x263/0x2b0 lib/bug.c:186 fixup_bug arch/x86/kernel/traps.c:179 [inline] fixup_bug arch/x86/kernel/traps.c:174 [inline] do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272 do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:972 RIP: 0010:__lock_release kernel/locking/lockdep.c:4052 [inline] RIP: 0010:lock_release+0x667/0xa00 kernel/locking/lockdep.c:4321 Code: 0f 85 a0 03 00 00 8b 35 77 66 08 08 85 f6 75 23 48 c7 c6 a0 55 6b 87 48 c7 c7 40 25 6b 87 4c 89 85 70 ff ff ff e8 b7 a9 eb ff <0f> 0b 4c 8b 85 70 ff ff ff 4c 89 ea 4c 89 e6 4c 89 c7 e8 52 63 ff RSP: 0018:ffff888094117b48 EFLAGS: 00010086 RAX: 0000000000000000 RBX: 1ffff11012822f6f RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff815af236 RDI: ffffed1012822f5b RBP: ffff888094117c00 R08: ffff888092bfc400 R09: fffffbfff113301d R10: fffffbfff113301c R11: ffffffff889980e3 R12: ffffffff8a451df8 R13: ffffffff8142e71f R14: ffffffff8a44cc80 R15: ffff888094117bd8 percpu_up_read.constprop.0+0xcb/0x110 include/linux/percpu-rwsem.h:92 cgroup_threadgroup_change_end include/linux/cgroup-defs.h:712 [inline] copy_process.part.0+0x47ff/0x6710 kernel/fork.c:2222 copy_process kernel/fork.c:1772 [inline] _do_fork+0x25d/0xfd0 kernel/fork.c:2338 __do_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:240 [inline] __se_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:236 [inline] __ia32_compat_sys_x86_clone+0xbc/0x140 arch/x86/ia32/sys_ia32.c:236 do_syscall_32_irqs_on arch/x86/entry/common.c:334 [inline] do_fast_syscall_32+0x281/0xd54 arch/x86/entry/common.c:405 entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139 RIP: 0023:0xf7fec849 Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90 RSP: 002b:00000000ffed5a8c EFLAGS: 00000246 ORIG_RAX: 0000000000000078 RAX: ffffffffffffffda RBX: 0000000000003ffc RCX: 0000000000000000 RDX: 00000000200005c0 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Kernel Offset: disabled Rebooting in 86400 seconds.. Reported-and-tested-by: syzbot+3286e58549edc479faae@syzkaller.appspotmail.com Fixes: b3e583825266 ("clone: add CLONE_PIDFD") Signed-off-by: Christian Brauner <christian@brauner.io> (cherry picked from commit c3b7112df86b769927a60a6d7175988ca3d60f09) Bug: 135608568 Test: test program using syscall(__NR_sys_pidfd_open,..) and poll() Change-Id: Ib9ecb1e5c0c6e2d062b89c25109ec571570eb497 Signed-off-by: Suren Baghdasaryan <surenb@google.com>
\| *	BACKPORT: signal: support CLONE_PIDFD with pidfd_send_signal	Christian Brauner	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Let pidfd_send_signal() use pidfds retrieved via CLONE_PIDFD. With this patch pidfd_send_signal() becomes independent of procfs. This fullfils the request made when we merged the pidfd_send_signal() patchset. The pidfd_send_signal() syscall is now always available allowing for it to be used by users without procfs mounted or even users without procfs support compiled into the kernel. Signed-off-by: Christian Brauner <christian@brauner.io> Co-developed-by: Jann Horn <jannh@google.com> Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Howells <dhowells@redhat.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> (cherry picked from commit 2151ad1b067275730de1b38c7257478cae47d29e) Conflicts: kernel/sys_ni.c (1. Replaced COND_SYSCALL with cond_syscall.) Bug: 135608568 Test: test program using syscall(__NR_sys_pidfd_open,..) and poll() Change-Id: I621fe6547397e0e68c560d7da60ef7715deb290c Signed-off-by: Suren Baghdasaryan <surenb@google.com>
\| *	BACKPORT: clone: add CLONE_PIDFD	Christian Brauner	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patchset makes it possible to retrieve pid file descriptors at process creation time by introducing the new flag CLONE_PIDFD to the clone() system call. Linus originally suggested to implement this as a new flag to clone() instead of making it a separate system call. As spotted by Linus, there is exactly one bit for clone() left. CLONE_PIDFD creates file descriptors based on the anonymous inode implementation in the kernel that will also be used to implement the new mount api. They serve as a simple opaque handle on pids. Logically, this makes it possible to interpret a pidfd differently, narrowing or widening the scope of various operations (e.g. signal sending). Thus, a pidfd cannot just refer to a tgid, but also a tid, or in theory - given appropriate flag arguments in relevant syscalls - a process group or session. A pidfd does not represent a privilege. This does not imply it cannot ever be that way but for now this is not the case. A pidfd comes with additional information in fdinfo if the kernel supports procfs. The fdinfo file contains the pid of the process in the callers pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d". As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the parent_tidptr argument of clone. This has the advantage that we can give back the associated pid and the pidfd at the same time. To remove worries about missing metadata access this patchset comes with a sample program that illustrates how a combination of CLONE_PIDFD, and pidfd_send_signal() can be used to gain race-free access to process metadata through /proc/<pid>. The sample program can easily be translated into a helper that would be suitable for inclusion in libc so that users don't have to worry about writing it themselves. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <christian@brauner.io> Co-developed-by: Jann Horn <jannh@google.com> Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Howells <dhowells@redhat.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> (cherry picked from commit b3e5838252665ee4cfa76b82bdf1198dca81e5be) Conflicts: kernel/fork.c (1. Replaced proc_pid_ns() with its direct implementation.) Bug: 135608568 Test: test program using syscall(__NR_sys_pidfd_open,..) and poll() Change-Id: I3c804a92faea686e5bf7f99df893fe3a5d87ddf7 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: electimon <electimon@gmail.com>
\| *	UPSTREAM: signal: use fdget() since we don't allow O_PATH	Christian Brauner	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As stated in the original commit for pidfd_send_signal() we don't allow to signal processes through O_PATH file descriptors since it is semantically equivalent to a write on the pidfd. We already correctly error out right now and return EBADF if an O_PATH fd is passed. This is because we use file->f_op to detect whether a pidfd is passed and O_PATH fds have their file->f_op set to empty_fops in do_dentry_open() and thus fail the test. Thus, there is no regression. It's just semantically correct to use fdget() and return an error right from there instead of taking a reference and returning an error later. Signed-off-by: Christian Brauner <christian@brauner.io> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jann Horn <jann@thejh.net> Cc: David Howells <dhowells@redhat.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 738a7832d21e3d911fcddab98ce260b79010b461) Bug: 135608568 Test: test program using syscall(__NR_pidfd_send_signal,..) to send SIGKILL Change-Id: Id52eaadf9da371fb2d9caae4df49627760de7229 Signed-off-by: Suren Baghdasaryan <surenb@google.com>
\| *	UPSTREAM: signal: don't silently convert SI_USER signals to non-current pidfd	Jann Horn	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current sys_pidfd_send_signal() silently turns signals with explicit SI_USER context that are sent to non-current tasks into signals with kernel-generated siginfo. This is unlike do_rt_sigqueueinfo(), which returns -EPERM in this case. If a user actually wants to send a signal with kernel-provided siginfo, they can do that with pidfd_send_signal(pidfd, sig, NULL, 0); so allowing this case is unnecessary. Instead of silently replacing the siginfo, just bail out with an error; this is consistent with other interfaces and avoids special-casing behavior based on security checks. Fixes: 3eb39f47934f ("signal: add pidfd_send_signal() syscall") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Christian Brauner <christian@brauner.io> (cherry picked from commit 556a888a14afe27164191955618990fb3ccc9aad) Bug: 135608568 Test: test program using syscall(__NR_pidfd_send_signal,..) to send SIGKILL Change-Id: I493af671b82c43bff1425ee24550d2fb9aa6d961 Signed-off-by: Suren Baghdasaryan <surenb@google.com>
\| *	BACKPORT: signal: add pidfd_send_signal() syscall	Christian Brauner	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The kill() syscall operates on process identifiers (pid). After a process has exited its pid can be reused by another process. If a caller sends a signal to a reused pid it will end up signaling the wrong process. This issue has often surfaced and there has been a push to address this problem [1]. This patch uses file descriptors (fd) from proc/<pid> as stable handles on struct pid. Even if a pid is recycled the handle will not change. The fd can be used to send signals to the process it refers to. Thus, the new syscall pidfd_send_signal() is introduced to solve this problem. Instead of pids it operates on process fds (pidfd). /* prototype and argument /* long pidfd_send_signal(int pidfd, int sig, siginfo_t info, unsigned int flags); / syscall number 424 / The syscall number was chosen to be 424 to align with Arnd's rework in his y2038 to minimize merge conflicts (cf. [25]). In addition to the pidfd and signal argument it takes an additional siginfo_t and flags argument. If the siginfo_t argument is NULL then pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo(). The flags argument is added to allow for future extensions of this syscall. It currently needs to be passed as 0. Failing to do so will cause EINVAL. / pidfd_send_signal() replaces multiple pid-based syscalls / The pidfd_send_signal() syscall currently takes on the job of rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a positive pid is passed to kill(2). It will however be possible to also replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended. / sending signals to threads (tid) and process groups (pgid) / Specifically, the pidfd_send_signal() syscall does currently not operate on process groups or threads. This is left for future extensions. In order to extend the syscall to allow sending signal to threads and process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and PIDFD_TYPE_TID) should be added. This implies that the flags argument will determine what is signaled and not the file descriptor itself. Put in other words, grouping in this api is a property of the flags argument not a property of the file descriptor (cf. [13]). Clarification for this has been requested by Eric (cf. [19]). When appropriate extensions through the flags argument are added then pidfd_send_signal() can additionally replace the part of kill(2) which operates on process groups as well as the tgkill(2) and rt_tgsigqueueinfo(2) syscalls. How such an extension could be implemented has been very roughly sketched in [14], [15], and [16]. However, this should not be taken as a commitment to a particular implementation. There might be better ways to do it. Right now this is intentionally left out to keep this patchset as simple as possible (cf. [4]). / naming / The syscall had various names throughout iterations of this patchset: - procfd_signal() - procfd_send_signal() - taskfd_send_signal() In the last round of reviews it was pointed out that given that if the flags argument decides the scope of the signal instead of different types of fds it might make sense to either settle for "procfd_" or "pidfd_" as prefix. The community was willing to accept either (cf. [17] and [18]). Given that one developer expressed strong preference for the "pidfd_" prefix (cf. [13]) and with other developers less opinionated about the name we should settle for "pidfd_" to avoid further bikeshedding. The "_send_signal" suffix was chosen to reflect the fact that the syscall takes on the job of multiple syscalls. It is therefore intentional that the name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the fomer because it might imply that pidfd_send_signal() is a replacement for kill(2), and not the latter because it is a hassle to remember the correct spelling - especially for non-native speakers - and because it is not descriptive enough of what the syscall actually does. The name "pidfd_send_signal" makes it very clear that its job is to send signals. / zombies / Zombies can be signaled just as any other process. No special error will be reported since a zombie state is an unreliable state (cf. [3]). However, this can be added as an extension through the @flags argument if the need ever arises. / cross-namespace signals / The patch currently enforces that the signaler and signalee either are in the same pid namespace or that the signaler's pid namespace is an ancestor of the signalee's pid namespace. This is done for the sake of simplicity and because it is unclear to what values certain members of struct siginfo_t would need to be set to (cf. [5], [6]). / compat syscalls / It became clear that we would like to avoid adding compat syscalls (cf. [7]). The compat syscall handling is now done in kernel/signal.c itself by adding __copy_siginfo_from_user_generic() which lets us avoid compat syscalls (cf. [8]). It should be noted that the addition of __copy_siginfo_from_user_any() is caused by a bug in the original implementation of rt_sigqueueinfo(2) (cf. 12). With upcoming rework for syscall handling things might improve significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain any additional callers. / testing / This patch was tested on x64 and x86. / userspace usage / An asciinema recording for the basic functionality can be found under [9]. With this patch a process can be killed via: #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/stat.h> #include <sys/syscall.h> #include <sys/types.h> #include <unistd.h> static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t info, unsigned int flags) { #ifdef __NR_pidfd_send_signal return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags); #else return -ENOSYS; #endif } int main(int argc, char argv[]) { int fd, ret, saved_errno, sig; if (argc < 3) exit(EXIT_FAILURE); fd = open(argv[1], O_DIRECTORY \| O_CLOEXEC); if (fd < 0) { printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]); exit(EXIT_FAILURE); } sig = atoi(argv[2]); printf("Sending signal %d to process %s\n", sig, argv[1]); ret = do_pidfd_send_signal(fd, sig, NULL, 0); saved_errno = errno; close(fd); errno = saved_errno; if (ret < 0) { printf("%s - Failed to send signal %d to process %s\n", strerror(errno), sig, argv[1]); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); } / Q&A * Given that it seems the same questions get asked again by people who are * late to the party it makes sense to add a Q&A section to the commit * message so it's hopefully easier to avoid duplicate threads. * * For the sake of progress please consider these arguments settled unless * there is a new point that desperately needs to be addressed. Please make * sure to check the links to the threads in this commit message whether * this has not already been covered. / Q-01: (Florian Weimer [20], Andrew Morton [21]) What happens when the target process has exited? A-01: Sending the signal will fail with ESRCH (cf. [22]). Q-02: (Andrew Morton [21]) Is the task_struct pinned by the fd? A-02: No. A reference to struct pid is kept. struct pid - as far as I understand - was created exactly for the reason to not require to pin struct task_struct (cf. [22]). Q-03: (Andrew Morton [21]) Does the entire procfs directory remain visible? Just one entry within it? A-03: The same thing that happens right now when you hold a file descriptor to /proc/<pid> open (cf. [22]). Q-04: (Andrew Morton [21]) Does the pid remain reserved? A-04: No. This patchset guarantees a stable handle not that pids are not recycled (cf. [22]). Q-05: (Andrew Morton [21]) Do attempts to signal that fd return errors? A-05: See {Q,A}-01. Q-06: (Andrew Morton [22]) Is there a cleaner way of obtaining the fd? Another syscall perhaps. A-06: Userspace can already trivially retrieve file descriptors from procfs so this is something that we will need to support anyway. Hence, there's no immediate need to add another syscalls just to make pidfd_send_signal() not dependent on the presence of procfs. However, adding a syscalls to get such file descriptors is planned for a future patchset (cf. [22]). Q-07: (Andrew Morton [21] and others) This fd-for-a-process sounds like a handy thing and people may well think up other uses for it in the future, probably unrelated to signals. Are the code and the interface designed to permit such future applications? A-07: Yes (cf. [22]). Q-08: (Andrew Morton [21] and others) Now I think about it, why a new syscall? This thing is looking rather like an ioctl? A-08: This has been extensively discussed. It was agreed that a syscall is preferred for a variety or reasons. Here are just a few taken from prior threads. Syscalls are safer than ioctl()s especially when signaling to fds. Processes are a core kernel concept so a syscall seems more appropriate. The layout of the syscall with its four arguments would require the addition of a custom struct for the ioctl() thereby causing at least the same amount or even more complexity for userspace than a simple syscall. The new syscall will replace multiple other pid-based syscalls (see description above). The file-descriptors-for-processes concept introduced with this syscall will be extended with other syscalls in the future. See also [22], [23] and various other threads already linked in here. Q-09: (Florian Weimer [24]) What happens if you use the new interface with an O_PATH descriptor? A-09: pidfds opened as O_PATH fds cannot be used to send signals to a process (cf. [2]). Signaling processes through pidfds is the equivalent of writing to a file. Thus, this is not an operation that operates "purely at the file descriptor level" as required by the open(2) manpage. See also [4]. / References */ [1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/ [2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/ [3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/ [4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/ [5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/ [6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/ [7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/ [8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/ [9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/ [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/ [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/ [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/ [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/ [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/ [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/ [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/ [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/ [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/ [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/ [23]: https://lwn.net/Articles/773459/ [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/ Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jann Horn <jannh@google.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Florian Weimer <fweimer@redhat.com> Signed-off-by: Christian Brauner <christian@brauner.io> Reviewed-by: Tycho Andersen <tycho@tycho.ws> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Serge Hallyn <serge@hallyn.com> Acked-by: Aleksa Sarai <cyphar@cyphar.com> (cherry picked from commit 3eb39f47934f9d5a3027fe00d906a45fe3a15fad) Conflicts: arch/x86/entry/syscalls/syscall_32.tbl - trivial manual merge arch/x86/entry/syscalls/syscall_64.tbl - trivial manual merge include/linux/proc_fs.h - trivial manual merge include/linux/syscalls.h - trivial manual merge include/uapi/asm-generic/unistd.h - trivial manual merge kernel/signal.c - struct kernel_siginfo does not exist in 4.14 kernel/sys_ni.c - cond_syscall is used instead of COND_SYSCALL arch/x86/entry/syscalls/syscall_32.tbl arch/x86/entry/syscalls/syscall_64.tbl (1. manual merges because of 4.14 differences 2. change prepare_kill_siginfo() to use struct siginfo instead of kernel_siginfo 3. use copy_from_user() instead of copy_siginfo_from_user() in copy_siginfo_from_user_any() 4. replaced COND_SYSCALL with cond_syscall 5. Removed __ia32_sys_pidfd_send_signal in arch/x86/entry/syscalls/syscall_32.tbl. 6. Replaced __x64_sys_pidfd_send_signal with sys_pidfd_send_signal in arch/x86/entry/syscalls/syscall_64.tbl.) Bug: 135608568 Test: test program using syscall(__NR_pidfd_send_signal,..) to send SIGKILL Change-Id: I34da11c63ac8cafb0353d9af24c820cef519ec27 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: electimon <electimon@gmail.com>
\| *	bpf: Add new cgroup attach type to enable sock modifications	David Ahern	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run any time a process in the cgroup opens an AF_INET or AF_INET6 socket. Currently only sk_bound_dev_if is exported to userspace for modification by a bpf program. This allows a cgroup to be configured such that AF_INET{6} sockets opened by processes are automatically bound to a specific device. In turn, this enables the running of programs that do not support SO_BINDTODEVICE in a specific VRF context / L3 domain. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I96a6f6f8f650c494d8c173dbb42580a25698368e