android_kernel_zuk_msm8996.git

	Commit message (Collapse)	Author	Age
*	Merge remote-tracking branch 'msm8998/lineage-20' into lineage-20	Raghuram Subramani	2024-10-17
\| \| \| \|	Change-Id: I126075a330f305c85f8fe1b8c9d408f368be95d1
*	Merge lineage-20 of git@github.com:LineageOS/android_kernel_qcom_msm8998.git ↵	Davide Garberi	2023-08-06
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	into lineage-20 7d11b1a7a11c Revert "sched: cpufreq: Use sched_clock instead of rq_clock when updating schedutil" daaa5da96a74 sched: Take irq_sparse lock during the isolation 217ab2d0ef91 rcu: Speed up calling of RCU tasks callbacks 997b726bc092 kernel: power: Workaround for sensor ipc message causing high power consume b933e4d37bc0 sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 82d3f23d6dc5 sched/fair: Fix bandwidth timer clock drift condition 629bfed360f9 kernel: power: qos: remove check for core isolation while cluster LPMs 891a63210e1d sched/fair: Fix issue where frequency update not skipped b775cb29f663 ANDROID: Move schedtune en/dequeue before schedutil update triggers ebdb82f7b34a sched/fair: Skip frequency updates if CPU about to idle ff383d94478a FROMLIST: sched: Make iowait_boost optional in schedutil 9539942cb065 FROMLIST: cpufreq: Make iowait boost a policy option b65c91c9aa14 ARM: dts: msm: add HW CPU's busy-cost-data for additional freqs 72f13941085b ARM: dts: msm: fix CPU's idle-cost-data ab88411382f7 ARM: dts: msm: fix EM to be monotonically increasing 83dcbae14782 ARM: dts: msm: Fix EAS idle-cost-data property length 33d3b17bfdfb ARM: dts: msm: Add msm8998 energy model c0fa7577022c sched/walt: Re-add code to allow WALT to function d5cd35f38616 FROMGIT: binder: use EINTR for interrupted wait for work db74739c86de sched: Don't fail isolation request for an already isolated CPU aee7a16e347b sched: WALT: increase WALT minimum window size to 20ms 4dbe44554792 sched: cpufreq: Use per_cpu_ptr instead of this_cpu_ptr when reporting load ef3fb04c7df4 sched: cpufreq: Use sched_clock instead of rq_clock when updating schedutil c7128748614a sched/cpupri: Exclude isolated CPUs from the lowest_mask 6adb092856e8 sched: cpufreq: Limit governor updates to WALT changes alone 0fa652ee00f5 sched: walt: Correct WALT window size initialization 41cbb7bc59fb sched: walt: fix window misalignment when HZ=300 43cbf9d6153d sched/tune: Increase the cgroup limit to 6 c71b8fffe6b3 drivers: cpuidle: lpm-levels: Fix KW issues with idle state idx < 0 938e42ca699f drivers: cpuidle: lpm-levels: Correctly check for list empty 8d8a48aecde5 sched/fair: Fix load_balance() affinity redo path eccc8acbe705 sched/fair: Avoid unnecessary active load balance 0ffdb886996b BACKPORT: sched/core: Fix rules for running on online && !active CPUs c9999f04236e sched/core: Allow kthreads to fall back to online && !active cpus b9b6bc6ea3c0 sched: Allow migrating kthreads into online but inactive CPUs a9314f9d8ad4 sched/fair: Allow load bigger task load balance when nr_running is 2 c0b317c27d44 pinctrl: qcom: Clear status bit on irq_unmask 45df1516d04a UPSTREAM: mm: fix misplaced unlock_page in do_wp_page() 899def5edcd4 UPSTREAM: mm/ksm: Remove reuse_ksm_page() 46c6fbdd185a BACKPORT: mm: do_wp_page() simplification 90dccbae4c04 UPSTREAM: mm: reuse only-pte-mapped KSM page in do_wp_page() ebf270d24640 sched/fair: vruntime should normalize when switching from fair cbe0b37059c9 mm: introduce arg_lock to protect arg_start\|end and env_start\|end in mm_struct 12d40f1995b4 msm: mdss: Fix indentation 620df03a7229 msm: mdss: Treat polling_en as the bool that it is 12af218146a6 msm: mdss: add idle state node 13e661759656 cpuset: Restore tasks affinity while moving across cpusets 602bf4096dab genirq: Honour IRQ's affinity hint during migration 9209b5556f6a power: qos: Use effective affinity mask f31078b5825f genirq: Introduce effective affinity mask 58c453484f7e sched/cputime: Mitigate performance regression in times()/clock_gettime() 400383059868 kernel: time: Add delay after cpu_relax() in tight loops 1daa7ea39076 pinctrl: qcom: Update irq handle for GPIO pins 07f7c9961c7c power: smb-lib: Fix mutex acquisition deadlock on PD hard reset 094b738f46c8 power: qpnp-smb2: Implement battery charging_enabled node d6038d6da57f ASoC: msm-pcm-q6-v2: Add dsp buf check 0d7a6c301af8 qcacld-3.0: Fix OOB in wma_scan_roam.c Change-Id: Ia2e189e37daad6e99bdb359d1204d9133a7916f4
\| *	FROMLIST: cpufreq: Make iowait boost a policy option	Joel Fernandes	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Make iowait boost a cpufreq policy option and enable it for intel_pstate cpufreq driver. Governors like schedutil can use it to determine if boosting for tasks that wake up with p->in_iowait set is needed. Bug: 38010527 Link: https://lkml.org/lkml/2017/5/19/43 Change-Id: Icf59e75fbe731dc67abb28fb837f7bb0cd5ec6cc Signed-off-by: Joel Fernandes <joelaf@google.com>
\| *	sched/walt: Re-add code to allow WALT to function	Ethan Chen	2023-07-16
\| \| \| \| \| \| \| \|	Change-Id: Ieb1067c5e276f872ed4c722b7d1fabecbdad87e7
\| *	sched: cpufreq: Limit governor updates to WALT changes alone	Vikram Mulukutla	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It's not necessary to keep reporting load to the governor if it doesn't change in a window. Limit updates to when we expect load changes - after window rollover and when we send updates related to intercluster migrations. [beykerykt]: Adapt for HMP Change-Id: I3232d40f3d54b0b81cfafdcdb99b534df79327bf Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
\| *	UPSTREAM: mm/ksm: Remove reuse_ksm_page()	Peter Xu	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove the function as the last reference has gone away with the do_wp_page() changes. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: Ie4da88791abe9407157566854b2db9b94c0c962f
\| *	UPSTREAM: mm: reuse only-pte-mapped KSM page in do_wp_page()	Kirill Tkhai	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add an optimization for KSM pages almost in the same way that we have for ordinary anonymous pages. If there is a write fault in a page, which is mapped to an only pte, and it is not related to swap cache; the page may be reused without copying its content. [ Note that we do not consider PageSwapCache() pages at least for now, since we don't want to complicate __get_ksm_page(), which has nice optimization based on this (for the migration case). Currenly it is spinning on PageSwapCache() pages, waiting for when they have unfreezed counters (i.e., for the migration finish). But we don't want to make it also spinning on swap cache pages, which we try to reuse, since there is not a very high probability to reuse them. So, for now we do not consider PageSwapCache() pages at all. ] So in reuse_ksm_page() we check for 1) PageSwapCache() and 2) page_stable_node(), to skip a page, which KSM is currently trying to link to stable tree. Then we do page_ref_freeze() to prohibit KSM to merge one more page into the page, we are reusing. After that, nobody can refer to the reusing page: KSM skips !PageSwapCache() pages with zero refcount; and the protection against of all other participants is the same as for reused ordinary anon pages pte lock, page lock and mmap_sem. [akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s] Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Cc: Rik van Riel <riel@surriel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: If32387b1f7c36f0e12fcbb0926bf1b67886ec594
\| *	mm: introduce arg_lock to protect arg_start\|end and	Yang Shi	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	env_start\|end in mm_struct mmap_sem is on the hot path of kernel, and it very contended, but it is abused too. It is used to protect arg_start\|end and evn_start\|end when reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make sense since those proc files just expect to read 4 values atomically and not related to VM, they could be set to arbitrary values by C/R. And, the mmap_sem contention may cause unexpected issue like below: INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 Call Trace: schedule+0x36/0x80 rwsem_down_read_failed+0xf0/0x150 call_rwsem_down_read_failed+0x18/0x30 down_read+0x20/0x40 proc_pid_cmdline_read+0xd9/0x4e0 __vfs_read+0x37/0x150 vfs_read+0x96/0x130 SyS_read+0x55/0xc0 entry_SYSCALL_64_fastpath+0x1a/0xc5 Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock for them to mitigate the abuse of mmap_sem. So, introduce a new spinlock in mm_struct to protect the concurrent access to arg_start\|end, env_start\|end and others, as well as replace write map_sem to read to protect the race condition between prctl and sys_brk which might break check_data_rlimit(), and makes prctl more friendly to other VM operations. This patch just eliminates the abuse of mmap_sem, but it can't resolve the above hung task warning completely since the later access_remote_vm() call needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the future. Change-Id: Ifa8f001ee2fc4f0ce60c18e771cebcf8a1f0943e [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock] Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: 88aa7cc688d48ddd84558b41d5905a0db9535c4b Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Srinivas Ramana <sramana@codeaurora.org>
\| *	cpuset: Restore tasks affinity while moving across cpusets	Pavankumar Kondeti	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When tasks move across cpusets, the current affinity settings are lost. Cache the task affinity and restore it during cpuset migration. The restoring happens only when the cached affinity is subset of the current cpuset settings. Change-Id: I6c2ec1d5e3d994e176926d94b9e0cc92418020cc Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
\| *	genirq: Introduce effective affinity mask	Thomas Gleixner	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is currently no way to evaluate the effective affinity mask of a given interrupt. Many irq chips allow only a single target CPU or a subset of CPUs in the affinity mask. Updating the mask at the time of setting the affinity to the subset would be counterproductive because information for cpu hotplug about assigned interrupt affinities gets lost. On CPU hotplug it's also pointless to force migrate an interrupt, which is not targeted at the CPU effectively. But currently the information is not available. Provide a seperate mask to be updated by the irq_chip->irq_set_affinity() implementations. Implement the read only proc files so the user can see the effective mask as well w/o trying to deduce it from /proc/interrupts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20170619235446.247834245@linutronix.de Change-Id: Ibeec0031edb532d52cb411286f785aec160d6139
\| *	kernel: time: Add delay after cpu_relax() in tight loops	Prasad Sodagudi	2023-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Tight loops of spin_lock_irqsave() and spin_unlock_irqrestore() in timer and hrtimer are causing scheduling delays. Add delay of few nano seconds after cpu_relax in the timer/hrtimer tight loops. Change-Id: Iaa0ab92da93f7b245b1d922b6edca2bebdc0fbce Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
* \|	Merge lineage-20 of git@github.com:LineageOS/android_kernel_qcom_msm8998.git ↵	Davide Garberi	2023-08-03
\|\\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	into lineage-20 1a4b80f8f201 ANDROID: arch:arm64: Increase kernel command line size 7c253f7aa663 of: reserved_mem: increase max number reserved regions df4dbf557503 msm: camera: Fix indentations 2fc4a156d15d msm: camera: Fix code flow when populating CAM_V_CUSTOM1 687bcb61f125 ALSA: control: use counting semaphore as write lock for ELEM_WRITE operation 75cf9e8c1b1c ALSA: control: Fix memory corruption risk in snd_ctl_elem_read 76cf3b5e53df ALSA: control: code refactoring for ELEM_READ/ELEM_WRITE operations e9af212f9685 ALSA: pcm: Move rwsem lock inside snd_ctl_elem_read to prevent UAF 95fc4fff573f msm: kgsl: Make sure that pool pages don't have any extra references 59ceabe0d242 msm: kgsl: Use dma_buf_get() to get dma_buf structure d1f19956d6b9 ANDROID: usb: f_accessory: Check buffer size when initialised via composite 2d3ce4f7a366 kbuild: handle libs-y archives separately from built-in.o archives 65dc3fbd1593 kbuild: thin archives use P option to ar 362c7b73bac8 kbuild: thin archives for multi-y targets 43076241b514 kbuild: thin archives final link close --whole-archives option aa04fc78256d kbuild: minor improvement for thin archives build f5896747cda6 Merge tag 'LA.UM.7.2.c25-07700-sdm660.0' of https://git.codelinaro.org/clo/la/platform/vendor/qcom-opensource/wlan/qcacld-3.0 into android13-4.4-msm8998 321ac077ee7e qcacld-3.0: Fix out-of-bounds in tx_stats 42be8e4cbf13 BACKPORT: usb: gadget: rndis: prevent integer overflow in rndis_set_response() b490a85b5945 FROMGIT: arm64: fix oops in concurrently setting insn_emulation sysctls 7ed7084b34a9 FROMLIST: binder: fix UAF of ref->proc caused by race condition e31f087fb864 ANDROID: selinux: modify RTM_GETNEIGH{TBL} 80675d431434 UPSTREAM: usb: gadget: clear related members when goto fail fb6adfb00108 UPSTREAM: usb: gadget: don't release an existing dev->buf e4a8dd12424e UPSTREAM: USB: gadget: validate interface OS descriptor requests 8f0a947317e0 UPSTREAM: usb: gadget: rndis: check size of RNDIS_MSG_SET command 1541758765ff ion: Do not 'put' ION handle until after its final use 03b4b3cd8d30 Merge tag 'LA.UM.7.2.c25-07000-sdm660.0' of https://git.codelinaro.org/clo/la/platform/vendor/qcom-opensource/wlan/qcacld-3.0 into android13-4.4-msm8998 7dbda95466d5 Merge tag 'LA.UM.8.4.c25-06600-8x98.0' of https://git.codelinaro.org/clo/la/kernel/msm-4.4 into android13-4.4-msm8998 369119e5df4e cert host tools: Stop complaining about deprecated OpenSSL functions f8e30a0f9a17 fixup! BACKPORT: treewide: Fix function prototypes for module_param_call() 4fa5045f3dc9 arm64/efi: Mark __efistub_stext_offset as an absolute symbol explicitly bcd9668da77f arm64: kernel: do not need to reset UAO on exception entry c4ddd677f7e3 Kbuild: do not emit debug info for assembly with LLVM_IAS=1 1b880b6e19f8 qcacld-3.0: Add time slice duty cycle in wifi_interface_info fd24be2b22a1 qcacmn: Add time slice duty cycle attribute into QCA vendor command d719c1c825f8 qcacld-3.0: Use field-by-field assignment for FW stats fb5eb3bda2d9 ext4: enable quota enforcement based on mount options cd40d7f301de ext4: adds project ID support 360e2f3d18b8 ext4: add project quota support c31ac2be1594 drivers: qcacld-3.0: Remove in_compat_syscall() redefinition 6735c13a269d arm64: link with -z norelro regardless of CONFIG_RELOCATABLE 99962aab3433 arm64: relocatable: fix inconsistencies in linker script and options 24bd8cc5e6bb arm64: prevent regressions in compressed kernel image size when upgrading to binutils 2.27 93bb4c2392a2 arm64: kernel: force ET_DYN ELF type for CONFIG_RELOCATABLE=y a54bbb725ccb arm64: build with baremetal linker target instead of Linux when available c5805c604a9b arm64: add endianness option to LDFLAGS instead of LD ab6052788f60 arm64: Set UTS_MACHINE in the Makefile c3330429b2c6 kbuild: clear LDFLAGS in the top Makefile f33c1532bd61 kbuild: use HOSTLDFLAGS for single .c executables 38b7db363a96 BACKPORT: arm64: Change .weak to SYM_FUNC_START_WEAK_PI for arch/arm64/lib/mem.S 716cb63e81d9 BACKPORT: crypto: arm64/aes-ce-cipher - move assembler code to .S file 7dfbaee16432 BACKPORT: arm64: Remove reference to asm/opcodes.h 531ee8624d17 BACKPORT: arm64: kprobe: protect/rename few definitions to be reused by uprobe 08d83c997b0c BACKPORT: arm64: Delete the space separator in __emit_inst e3951152dc2d BACKPORT: arm64: Get rid of asm/opcodes.h 255820c0f301 BACKPORT: arm64: Fix minor issues with the dcache_by_line_op macro 21bb344a664b BACKPORT: crypto: arm64/aes-modes - get rid of literal load of addend vector 26d5a53c6e0d BACKPORT: arm64: vdso: remove commas between macro name and arguments 78bff1f77c9d BACKPORT: kbuild: support LLVM=1 to switch the default tools to Clang/LLVM 6634f9f63efe BACKPORT: kbuild: replace AS=clang with LLVM_IAS=1 b891e8fdc466 BACKPORT: Documentation/llvm: fix the name of llvm-size 75d6fa8368a8 BACKPORT: Documentation/llvm: add documentation on building w/ Clang/LLVM 95b0a5e52f2a BACKPORT: ANDROID: ftrace: fix function type mismatches 7da9c2138ec8 BACKPORT: ANDROID: fs: logfs: fix filler function type d6d5a4b28ad0 BACKPORT: ANDROID: fs: gfs2: fix filler function type 9b194a470db5 BACKPORT: ANDROID: fs: exofs: fix filler function type 7a45ac4bfb49 BACKPORT: ANDROID: fs: afs: fix filler function type 4099e1b281e5 BACKPORT: drivers/perf: arm_pmu: fix function type mismatch af7b738882f7 BACKPORT: dummycon: fix function types 1b0b55a36dbe BACKPORT: fs: nfs: fix filler function type a58a0e30e20a BACKPORT: mm: fix filler function type mismatch 829e9226a8c0 BACKPORT: mm: fix drain_local_pages function type 865ef61b4da8 BACKPORT: vfs: pass type instead of fn to do_{loop,iter}_readv_writev() 08d2f8e7ba8e BACKPORT: module: Do not paper over type mismatches in module_param_call() ea467f6c33e4 BACKPORT: treewide: Fix function prototypes for module_param_call() d131459e6b8b BACKPORT: module: Prepare to convert all module_param_call() prototypes 6f52abadf006 BACKPORT: kbuild: fix --gc-sections bf7540ffce44 BACKPORT: kbuild: record needed exported symbols for modules c49d2545e437 BACKPORT: kbuild: Allow to specify composite modules with modname-m 427d0fc67dc1 BACKPORT: kbuild: add arch specific post-link Makefile 69f8a31838a3 BACKPORT: arm64: add a workaround for GNU gold with ARM64_MODULE_PLTS ba3368756abf BACKPORT: arm64: explicitly pass --no-fix-cortex-a53-843419 to GNU gold 6dacd7e737fb BACKPORT: arm64: errata: Pass --fix-cortex-a53-843419 to ld if workaround enabled d2787c21f2b5 BACKPORT: kbuild: add __ld-ifversion and linker-specific macros 2d471de60bb4 BACKPORT: kbuild: add ld-name macro 06280a90d845 BACKPORT: arm64: keep .altinstructions and .altinstr_replacement eb0ad3ae07f9 BACKPORT: kbuild: add __cc-ifversion and compiler-specific variants 3d01e1eba86b BACKPORT: FROMLIST: kbuild: add clang-version.sh 18dd378ab563 BACKPORT: FROMLIST: kbuild: fix LD_DEAD_CODE_DATA_ELIMINATION aabbc122b1de BACKPORT: kbuild: thin archives make default for all archs 756d47e345fc BACKPORT: kbuild: allow archs to select link dead code/data elimination 723ab99e48a7 BACKPORT: kbuild: allow architectures to use thin archives instead of ld -r 0b77ec583772 drivers/usb/serial/console.c: remove superfluous serial->port condition 6488cb478f04 drivers/firmware/efi/libstub.c: prevent a relocation dba4259216a0 UPSTREAM: pidfd: fix a poll race when setting exit_state baab6e33b07b BACKPORT: arch: wire-up pidfd_open() 5d2e9e4f8630 BACKPORT: pid: add pidfd_open() f8396a127daf UPSTREAM: pidfd: add polling support f4c358582254 UPSTREAM: signal: improve comments 5500316dc8d8 UPSTREAM: fork: do not release lock that wasn't taken fc7d707593e3 BACKPORT: signal: support CLONE_PIDFD with pidfd_send_signal f044fa00d72a BACKPORT: clone: add CLONE_PIDFD f20fc1c548f2 UPSTREAM: Make anon_inodes unconditional de80525cd462 UPSTREAM: signal: use fdget() since we don't allow O_PATH 229e1bdd624e UPSTREAM: signal: don't silently convert SI_USER signals to non-current pidfd ada02e996b52 BACKPORT: signal: add pidfd_send_signal() syscall 828857678c5c compat: add in_compat_syscall to ask whether we're in a compat syscall e7aede4896c0 bpf: Add new cgroup attach type to enable sock modifications 9ed75228b09c ebpf: allow bpf_get_current_uid_gid_proto also for networking c5aa3963b4ae bpf: fix overflow in prog accounting c46a001439fc bpf: Make sure mac_header was set before using it 8aed99185615 bpf: Enlarge offset check value to INT_MAX in bpf_skb_{load,store}_bytes b0a638335ba6 bpf: avoid false sharing of map refcount with max_entries 1f21605e373c net: remove hlist_nulls_add_tail_rcu() 9ce369b09dbb udp: get rid of SLAB_DESTROY_BY_RCU allocations 070f539fb5d7 udp: no longer use SLAB_DESTROY_BY_RCU a32d2ea857c5 inet: refactor inet[6]_lookup functions to take skb fcf3e7bc7203 soreuseport: fix initialization race df03c8cf024a soreuseport: Fix TCP listener hash collision bd8b9f50c9d3 inet: Fix missing return value in inet6_hash bae331196dd0 soreuseport: fast reuseport TCP socket selection 4ada2ed73da0 inet: create IPv6-equivalent inet_hash function 73f609838475 sock: struct proto hash function may error e3b32750621b cgroup: Fix sock_cgroup_data on big-endian. 69dabcedd4b9 selinux: always allow mounting submounts 17d6ddebcc49 userns: Don't fail follow_automount based on s_user_ns cbd08255e6f8 fs: Better permission checking for submounts 3a9ace719251 mnt: Move the FS_USERNS_MOUNT check into sget_userns af53549b43c5 locks: sprinkle some tracepoints around the file locking code 07dbbc84aa34 locks: rename __posix_lock_file to posix_lock_inode 400cbe93d180 autofs: Fix automounts by using current_real_cred()->uid 7903280ee07a fs: Call d_automount with the filesystems creds b87fb50ff1cd UPSTREAM: kernfs: Check KERNFS_HAS_RELEASE before calling kernfs_release_file() c9c596de3e52 UPSTREAM: kernfs: fix locking around kernfs_ops->release() callback 2172eaf5a901 UPSTREAM: cgroup, bpf: remove unnecessary #include dc81f3963dde kernfs: kernfs_sop_show_path: don't return 0 after seq_dentry call ce9a52e20897 cgroup: Make rebind_subsystems() disable v2 controllers all at once ce5e3aa14c39 cgroup: fix sock_cgroup_data initialization on earlier compilers 94a70ef24da9 samples/bpf: fix bpf_perf_event_output prototype c1920272278e net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list d7707635776b sk_buff: allow segmenting based on frag sizes 924bbacea75e ip_tunnel, bpf: ip_tunnel_info_opts_{get, set} depends on CONFIG_INET 0e9008d618f4 bpf: udp: ipv6: Avoid running reuseport's bpf_prog from __udp6_lib_err 01b437940f5e soreuseport: add compat case for setsockopt SO_ATTACH_REUSEPORT_CBPF 421fbf04bf2c soreuseport: change consume_skb to kfree_skb in error case 1ab50514c430 ipv6: Fix SO_REUSEPORT UDP socket with implicit sk_ipv6only f3dfd61c502d soreuseport: fix ordering for mixed v4/v6 sockets 245ee3c90795 soreuseport: fix NULL ptr dereference SO_REUSEPORT after bind 113fb209854a bpf: do not blindly change rlimit in reuseport net selftest 985253ef27d2 bpf: fix rlimit in reuseport net selftest ae61334510be soreuseport: Fix reuseport_bpf testcase on 32bit architectures 6efa24da01a5 udp: fix potential infinite loop in SO_REUSEPORT logic 66df70c6605d soreuseport: BPF selection functional test for TCP fe161031b8a8 soreuseport: pass skb to secondary UDP socket lookup 9223919efdf2 soreuseport: BPF selection functional test 2090ed790dbb soreuseport: fix mem leak in reuseport_add_sock() 67887f6ac3f1 Merge "diag: Ensure dci entry is valid before sending the packet" e41c0da23b38 diag: Prevent out of bound write while sending dci pkt to remote e1085d1ef39b diag: Ensure dci entry is valid before sending the packet 16802e80ecb5 Merge "ion: Fix integer overflow in msm_ion_custom_ioctl" 57146f83f388 ion: Fix integer overflow in msm_ion_custom_ioctl 6fc2001969fe diag: Use valid data_source for a valid token 0c6dbf858a98 qcacld-3.0: Avoid OOB read in dot11f_unpack_assoc_response f07caca0c485 qcacld-3.0: Fix array OOB for duplicate rate 5a359aba0364 msm: kgsl: Remove 'fd' dependency to get dma_buf handle da8317596949 msm: kgsl: Fix gpuaddr_in_range() to check upper bound 2ed91a98d8b4 msm: adsprpc: Handle UAF in fastrpc debugfs read 2967159ad303 msm: kgsl: Add a sysfs node to control performance counter reads e392a84f25f5 msm: kgsl: Perform cache flush on the pages obtained using get_user_pages() 28b45f75d2ee soc: qcom: hab: Add sanity check for payload_count 885caec7690f Merge "futex: Fix inode life-time issue" 0f57701d2643 Merge "futex: Handle faults correctly for PI futexes" 7d7eb450c333 Merge "futex: Rework inconsistent rt_mutex/futex_q state" 124ebd87ef2f msm: kgsl: Fix out of bound write in adreno_profile_submit_time 228bbfb25032 futex: Fix inode life-time issue 7075ca6a22b3 futex: Handle faults correctly for PI futexes a436b73e9032 futex: Simplify fixup_pi_state_owner() 11b99dbe3221 futex: Use pi_state_update_owner() in put_pi_state() f34484030550 rtmutex: Remove unused argument from rt_mutex_proxy_unlock() 079d1c90b3c3 futex: Provide and use pi_state_update_owner() 3b51e24eb17b futex: Replace pointless printk in fixup_owner() 0eac5c2583a1 futex: Avoid violating the 10th rule of futex 6d6ed38b7d10 futex: Rework inconsistent rt_mutex/futex_q state 3c8f7dfd59b5 futex: Remove rt_mutex_deadlock_account_() 9c870a329520 futex,rt_mutex: Provide futex specific rt_mutex API 7504736e8725 msm: adsprpc: Handle UAF in process shell memory 994e5922a0c2 Disable TRACER Check to improve Camera Performance 8fb3f17b3ad1 msm: kgsl: Deregister gpu address on memdesc_sg_virt failure 13aa628efdca Merge "crypto: Fix possible stack out-of-bound error" 92e777451003 Merge "msm: kgsl: Correct the refcount on current process PID." 9ca218394ed4 Merge "msm: kgsl: Compare pid pointer instead of TGID for a new process" 7eed1f2e0f43 Merge "qcom,max-freq-level change for trial" 6afb5eb98e36 crypto: Fix possible stack out-of-bound error 8b5ba278ed4b msm: kgsl: Correct the refcount on current process PID. 4150552fac96 msm: kgsl: Compare pid pointer instead of TGID for a new process c272102c0793 qcom,max-freq-level change for trial 854ef3ce73f5 msm: kgsl: Protect the memdesc->gpuaddr in SVM use cases. 79c8161aeac9 msm: kgsl: Stop using memdesc->usermem. Change-Id: Iea7db1362c3cd18e36f243411e773a9054f6a445
\| *	BACKPORT: ANDROID: ftrace: fix function type mismatches	Sami Tolvanen	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change fixes indirect call mismatches with function and function graph tracing, which trip Control-Flow Integrity (CFI) checking. Bug: 79510107 Bug: 67506682 Change-Id: I5de08c113fb970ffefedce93c58e0161f22c7ca2 Signed-off-by: Sami Tolvanen <samitolvanen@google.com> (cherry picked from commit c2f9bce9fee8e31e0500c501076f73db7791d8e9) Signed-off-by: Dan Aloni <daloni@magicleap.com> Signed-off-by: Davide Garberi <dade.garberi@gmail.com>
\| *	BACKPORT: mm: fix filler function type mismatch	Sami Tolvanen	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Bug: 67506682 Change-Id: I6f615164ccd86b407540ada9bbcb39d910395db9 Signed-off-by: Sami Tolvanen <samitolvanen@google.com> (cherry picked from commit 4fd840d1743308b2ef470534523009dd99b3ce2b) Signed-off-by: Dan Aloni <daloni@magicleap.com> Signed-off-by: Davide Garberi <dade.garberi@gmail.com>
\| *	BACKPORT: mm: fix drain_local_pages function type	Sami Tolvanen	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Bug: 67506682 Change-Id: I6ca80f521c880589efe45dc467d494051daae015 Signed-off-by: Sami Tolvanen <samitolvanen@google.com> (cherry picked from commit 97d5fd27f7af6b36d775f56a1c78a026686aa407) Signed-off-by: Dan Aloni <daloni@magicleap.com> Signed-off-by: Davide Garberi <dade.garberi@gmail.com>
\| *	BACKPORT: module: Do not paper over type mismatches in module_param_call()	Kees Cook	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The module_param_call() macro was explicitly casting the .set and .get function prototypes away. This can lead to hard-to-find type mismatches. Now that all the function prototypes have been fixed tree-wide, we can drop these casts, and use named initializers too. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jessica Yu <jeyu@kernel.org> Bug: 67506682 Change-Id: I439c8b4b9f0108ac357267bbc396a63baec2b242 (cherry picked from commit ece1996a21eeb344b49200e627c6660111009c10) Signed-off-by: Sami Tolvanen <samitolvanen@google.com> (cherry picked from commit cb214f0c4c105ceab5456556425ecec74e2ac3c1) Signed-off-by: Dan Aloni <daloni@magicleap.com> Signed-off-by: Davide Garberi <dade.garberi@gmail.com>
\| *	BACKPORT: module: Prepare to convert all module_param_call() prototypes	Kees Cook	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After actually converting all module_param_call() function prototypes, we no longer need to do a tricky sizeof(func(thing)) type-check. Remove it. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jessica Yu <jeyu@kernel.org> Bug: 67506682 Change-Id: Ie20dbd09634c7cbef499c81bf2dbfd762ad0058a (cherry picked from commit b2f270e8747387335d80428c576118e7d87f69cc) Signed-off-by: Sami Tolvanen <samitolvanen@google.com> (cherry picked from commit 38cbecf2ae55478a2046ef4fc93b9d9147fbb170) Signed-off-by: Dan Aloni <daloni@magicleap.com> Signed-off-by: Davide Garberi <dade.garberi@gmail.com>
\| *	BACKPORT: kbuild: allow archs to select link dead code/data elimination	Nicholas Piggin	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce LD_DEAD_CODE_DATA_ELIMINATION option for architectures to select to build with -ffunction-sections, -fdata-sections, and link with --gc-sections. It requires some work (documented) to ensure all unreferenced entrypoints are live, and requires toolchain and build verification, so it is made a per-arch option for now. On a random powerpc64le build, this yelds a significant size saving, it boots and runs fine, but there is a lot I haven't tested as yet, so these savings may be reduced if there are bugs in the link. text data bss dec filename 11169741 1180744 1923176 14273661 vmlinux 10445269 1004127 1919707 13369103 vmlinux.dce ~700K text, ~170K data, 6% removed from kernel image size. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michal Marek <mmarek@suse.com> (cherry-pick from b67067f1176df6ee727450546b58704e4b588563) Change-Id: I81b63489605bc2f146498d0bb0e1cc5b7adab8a0 Signed-off-by: Dan Aloni <daloni@magicleap.com> Signed-off-by: Davide Garberi <dade.garberi@gmail.com>
\| *	BACKPORT: pid: add pidfd_open()	Christian Brauner	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds the pidfd_open() syscall. It allows a caller to retrieve pollable pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a process that is created via traditional fork()/clone() calls that is only referenced by a PID: int pidfd = pidfd_open(1234, 0); ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0); With the introduction of pidfds through CLONE_PIDFD it is possible to created pidfds at process creation time. However, a lot of processes get created with traditional PID-based calls such as fork() or clone() (without CLONE_PIDFD). For these processes a caller can currently not create a pollable pidfd. This is a problem for Android's low memory killer (LMK) and service managers such as systemd. Both are examples of tools that want to make use of pidfds to get reliable notification of process exit for non-parents (pidfd polling) and race-free signal sending (pidfd_send_signal()). They intend to switch to this API for process supervision/management as soon as possible. Having no way to get pollable pidfds from PID-only processes is one of the biggest blockers for them in adopting this api. With pidfd_open() making it possible to retrieve pidfds for PID-based processes we enable them to adopt this api. In line with Arnd's recent changes to consolidate syscall numbers across architectures, I have added the pidfd_open() syscall to all architectures at the same time. Signed-off-by: Christian Brauner <christian@brauner.io> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jann Horn <jannh@google.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-api@vger.kernel.org (cherry picked from commit 32fcb426ec001cb6d5a4a195091a8486ea77e2df) Conflicts: kernel/pid.c (1. Replaced PIDTYPE_TGID with PIDTYPE_PID and thread_group_leader() check in pidfd_open() call) Bug: 135608568 Test: test program using syscall(__NR_sys_pidfd_open,..) and poll() Change-Id: I52a93a73722d7f7754dae05f63b94b4ca4a71a75 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: electimon <electimon@gmail.com>
\| *	UPSTREAM: pidfd: add polling support	Joel Fernandes (Google)	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds polling support to pidfd. Android low memory killer (LMK) needs to know when a process dies once it is sent the kill signal. It does so by checking for the existence of /proc/pid which is both racy and slow. For example, if a PID is reused between when LMK sends a kill signal and checks for existence of the PID, since the wrong PID is now possibly checked for existence. Using the polling support, LMK will be able to get notified when a process exists in race-free and fast way, and allows the LMK to do other things (such as by polling on other fds) while awaiting the process being killed to die. For notification to polling processes, we follow the same existing mechanism in the kernel used when the parent of the task group is to be notified of a child's death (do_notify_parent). This is precisely when the tasks waiting on a poll of pidfd are also awakened in this patch. We have decided to include the waitqueue in struct pid for the following reasons: 1. The wait queue has to survive for the lifetime of the poll. Including it in task_struct would not be option in this case because the task can be reaped and destroyed before the poll returns. 2. By including the struct pid for the waitqueue means that during de_thread(), the new thread group leader automatically gets the new waitqueue/pid even though its task_struct is different. Appropriate test cases are added in the second patch to provide coverage of all the cases the patch is handling. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Daniel Colascione <dancol@google.com> Cc: Jann Horn <jannh@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Jonathan Kowalski <bl0pbl33p@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: David Howells <dhowells@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: kernel-team@android.com Reviewed-by: Oleg Nesterov <oleg@redhat.com> Co-developed-by: Daniel Colascione <dancol@google.com> Signed-off-by: Daniel Colascione <dancol@google.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Christian Brauner <christian@brauner.io> (cherry picked from commit b53b0b9d9a613c418057f6cb921c2f40a6f78c24) Bug: 135608568 Test: test program using syscall(__NR_sys_pidfd_open,..) and poll() Change-Id: I02f259d2875bec46b198d580edfbb067f077084e Signed-off-by: Suren Baghdasaryan <surenb@google.com>
\| *	BACKPORT: clone: add CLONE_PIDFD	Christian Brauner	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patchset makes it possible to retrieve pid file descriptors at process creation time by introducing the new flag CLONE_PIDFD to the clone() system call. Linus originally suggested to implement this as a new flag to clone() instead of making it a separate system call. As spotted by Linus, there is exactly one bit for clone() left. CLONE_PIDFD creates file descriptors based on the anonymous inode implementation in the kernel that will also be used to implement the new mount api. They serve as a simple opaque handle on pids. Logically, this makes it possible to interpret a pidfd differently, narrowing or widening the scope of various operations (e.g. signal sending). Thus, a pidfd cannot just refer to a tgid, but also a tid, or in theory - given appropriate flag arguments in relevant syscalls - a process group or session. A pidfd does not represent a privilege. This does not imply it cannot ever be that way but for now this is not the case. A pidfd comes with additional information in fdinfo if the kernel supports procfs. The fdinfo file contains the pid of the process in the callers pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d". As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the parent_tidptr argument of clone. This has the advantage that we can give back the associated pid and the pidfd at the same time. To remove worries about missing metadata access this patchset comes with a sample program that illustrates how a combination of CLONE_PIDFD, and pidfd_send_signal() can be used to gain race-free access to process metadata through /proc/<pid>. The sample program can easily be translated into a helper that would be suitable for inclusion in libc so that users don't have to worry about writing it themselves. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <christian@brauner.io> Co-developed-by: Jann Horn <jannh@google.com> Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Howells <dhowells@redhat.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> (cherry picked from commit b3e5838252665ee4cfa76b82bdf1198dca81e5be) Conflicts: kernel/fork.c (1. Replaced proc_pid_ns() with its direct implementation.) Bug: 135608568 Test: test program using syscall(__NR_sys_pidfd_open,..) and poll() Change-Id: I3c804a92faea686e5bf7f99df893fe3a5d87ddf7 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: electimon <electimon@gmail.com>
\| *	BACKPORT: signal: add pidfd_send_signal() syscall	Christian Brauner	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The kill() syscall operates on process identifiers (pid). After a process has exited its pid can be reused by another process. If a caller sends a signal to a reused pid it will end up signaling the wrong process. This issue has often surfaced and there has been a push to address this problem [1]. This patch uses file descriptors (fd) from proc/<pid> as stable handles on struct pid. Even if a pid is recycled the handle will not change. The fd can be used to send signals to the process it refers to. Thus, the new syscall pidfd_send_signal() is introduced to solve this problem. Instead of pids it operates on process fds (pidfd). /* prototype and argument /* long pidfd_send_signal(int pidfd, int sig, siginfo_t info, unsigned int flags); / syscall number 424 / The syscall number was chosen to be 424 to align with Arnd's rework in his y2038 to minimize merge conflicts (cf. [25]). In addition to the pidfd and signal argument it takes an additional siginfo_t and flags argument. If the siginfo_t argument is NULL then pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo(). The flags argument is added to allow for future extensions of this syscall. It currently needs to be passed as 0. Failing to do so will cause EINVAL. / pidfd_send_signal() replaces multiple pid-based syscalls / The pidfd_send_signal() syscall currently takes on the job of rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a positive pid is passed to kill(2). It will however be possible to also replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended. / sending signals to threads (tid) and process groups (pgid) / Specifically, the pidfd_send_signal() syscall does currently not operate on process groups or threads. This is left for future extensions. In order to extend the syscall to allow sending signal to threads and process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and PIDFD_TYPE_TID) should be added. This implies that the flags argument will determine what is signaled and not the file descriptor itself. Put in other words, grouping in this api is a property of the flags argument not a property of the file descriptor (cf. [13]). Clarification for this has been requested by Eric (cf. [19]). When appropriate extensions through the flags argument are added then pidfd_send_signal() can additionally replace the part of kill(2) which operates on process groups as well as the tgkill(2) and rt_tgsigqueueinfo(2) syscalls. How such an extension could be implemented has been very roughly sketched in [14], [15], and [16]. However, this should not be taken as a commitment to a particular implementation. There might be better ways to do it. Right now this is intentionally left out to keep this patchset as simple as possible (cf. [4]). / naming / The syscall had various names throughout iterations of this patchset: - procfd_signal() - procfd_send_signal() - taskfd_send_signal() In the last round of reviews it was pointed out that given that if the flags argument decides the scope of the signal instead of different types of fds it might make sense to either settle for "procfd_" or "pidfd_" as prefix. The community was willing to accept either (cf. [17] and [18]). Given that one developer expressed strong preference for the "pidfd_" prefix (cf. [13]) and with other developers less opinionated about the name we should settle for "pidfd_" to avoid further bikeshedding. The "_send_signal" suffix was chosen to reflect the fact that the syscall takes on the job of multiple syscalls. It is therefore intentional that the name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the fomer because it might imply that pidfd_send_signal() is a replacement for kill(2), and not the latter because it is a hassle to remember the correct spelling - especially for non-native speakers - and because it is not descriptive enough of what the syscall actually does. The name "pidfd_send_signal" makes it very clear that its job is to send signals. / zombies / Zombies can be signaled just as any other process. No special error will be reported since a zombie state is an unreliable state (cf. [3]). However, this can be added as an extension through the @flags argument if the need ever arises. / cross-namespace signals / The patch currently enforces that the signaler and signalee either are in the same pid namespace or that the signaler's pid namespace is an ancestor of the signalee's pid namespace. This is done for the sake of simplicity and because it is unclear to what values certain members of struct siginfo_t would need to be set to (cf. [5], [6]). / compat syscalls / It became clear that we would like to avoid adding compat syscalls (cf. [7]). The compat syscall handling is now done in kernel/signal.c itself by adding __copy_siginfo_from_user_generic() which lets us avoid compat syscalls (cf. [8]). It should be noted that the addition of __copy_siginfo_from_user_any() is caused by a bug in the original implementation of rt_sigqueueinfo(2) (cf. 12). With upcoming rework for syscall handling things might improve significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain any additional callers. / testing / This patch was tested on x64 and x86. / userspace usage / An asciinema recording for the basic functionality can be found under [9]. With this patch a process can be killed via: #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/stat.h> #include <sys/syscall.h> #include <sys/types.h> #include <unistd.h> static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t info, unsigned int flags) { #ifdef __NR_pidfd_send_signal return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags); #else return -ENOSYS; #endif } int main(int argc, char argv[]) { int fd, ret, saved_errno, sig; if (argc < 3) exit(EXIT_FAILURE); fd = open(argv[1], O_DIRECTORY \| O_CLOEXEC); if (fd < 0) { printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]); exit(EXIT_FAILURE); } sig = atoi(argv[2]); printf("Sending signal %d to process %s\n", sig, argv[1]); ret = do_pidfd_send_signal(fd, sig, NULL, 0); saved_errno = errno; close(fd); errno = saved_errno; if (ret < 0) { printf("%s - Failed to send signal %d to process %s\n", strerror(errno), sig, argv[1]); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); } / Q&A * Given that it seems the same questions get asked again by people who are * late to the party it makes sense to add a Q&A section to the commit * message so it's hopefully easier to avoid duplicate threads. * * For the sake of progress please consider these arguments settled unless * there is a new point that desperately needs to be addressed. Please make * sure to check the links to the threads in this commit message whether * this has not already been covered. / Q-01: (Florian Weimer [20], Andrew Morton [21]) What happens when the target process has exited? A-01: Sending the signal will fail with ESRCH (cf. [22]). Q-02: (Andrew Morton [21]) Is the task_struct pinned by the fd? A-02: No. A reference to struct pid is kept. struct pid - as far as I understand - was created exactly for the reason to not require to pin struct task_struct (cf. [22]). Q-03: (Andrew Morton [21]) Does the entire procfs directory remain visible? Just one entry within it? A-03: The same thing that happens right now when you hold a file descriptor to /proc/<pid> open (cf. [22]). Q-04: (Andrew Morton [21]) Does the pid remain reserved? A-04: No. This patchset guarantees a stable handle not that pids are not recycled (cf. [22]). Q-05: (Andrew Morton [21]) Do attempts to signal that fd return errors? A-05: See {Q,A}-01. Q-06: (Andrew Morton [22]) Is there a cleaner way of obtaining the fd? Another syscall perhaps. A-06: Userspace can already trivially retrieve file descriptors from procfs so this is something that we will need to support anyway. Hence, there's no immediate need to add another syscalls just to make pidfd_send_signal() not dependent on the presence of procfs. However, adding a syscalls to get such file descriptors is planned for a future patchset (cf. [22]). Q-07: (Andrew Morton [21] and others) This fd-for-a-process sounds like a handy thing and people may well think up other uses for it in the future, probably unrelated to signals. Are the code and the interface designed to permit such future applications? A-07: Yes (cf. [22]). Q-08: (Andrew Morton [21] and others) Now I think about it, why a new syscall? This thing is looking rather like an ioctl? A-08: This has been extensively discussed. It was agreed that a syscall is preferred for a variety or reasons. Here are just a few taken from prior threads. Syscalls are safer than ioctl()s especially when signaling to fds. Processes are a core kernel concept so a syscall seems more appropriate. The layout of the syscall with its four arguments would require the addition of a custom struct for the ioctl() thereby causing at least the same amount or even more complexity for userspace than a simple syscall. The new syscall will replace multiple other pid-based syscalls (see description above). The file-descriptors-for-processes concept introduced with this syscall will be extended with other syscalls in the future. See also [22], [23] and various other threads already linked in here. Q-09: (Florian Weimer [24]) What happens if you use the new interface with an O_PATH descriptor? A-09: pidfds opened as O_PATH fds cannot be used to send signals to a process (cf. [2]). Signaling processes through pidfds is the equivalent of writing to a file. Thus, this is not an operation that operates "purely at the file descriptor level" as required by the open(2) manpage. See also [4]. / References */ [1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/ [2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/ [3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/ [4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/ [5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/ [6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/ [7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/ [8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/ [9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/ [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/ [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/ [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/ [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/ [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/ [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/ [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/ [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/ [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/ [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/ [23]: https://lwn.net/Articles/773459/ [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/ Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jann Horn <jannh@google.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Florian Weimer <fweimer@redhat.com> Signed-off-by: Christian Brauner <christian@brauner.io> Reviewed-by: Tycho Andersen <tycho@tycho.ws> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Serge Hallyn <serge@hallyn.com> Acked-by: Aleksa Sarai <cyphar@cyphar.com> (cherry picked from commit 3eb39f47934f9d5a3027fe00d906a45fe3a15fad) Conflicts: arch/x86/entry/syscalls/syscall_32.tbl - trivial manual merge arch/x86/entry/syscalls/syscall_64.tbl - trivial manual merge include/linux/proc_fs.h - trivial manual merge include/linux/syscalls.h - trivial manual merge include/uapi/asm-generic/unistd.h - trivial manual merge kernel/signal.c - struct kernel_siginfo does not exist in 4.14 kernel/sys_ni.c - cond_syscall is used instead of COND_SYSCALL arch/x86/entry/syscalls/syscall_32.tbl arch/x86/entry/syscalls/syscall_64.tbl (1. manual merges because of 4.14 differences 2. change prepare_kill_siginfo() to use struct siginfo instead of kernel_siginfo 3. use copy_from_user() instead of copy_siginfo_from_user() in copy_siginfo_from_user_any() 4. replaced COND_SYSCALL with cond_syscall 5. Removed __ia32_sys_pidfd_send_signal in arch/x86/entry/syscalls/syscall_32.tbl. 6. Replaced __x64_sys_pidfd_send_signal with sys_pidfd_send_signal in arch/x86/entry/syscalls/syscall_64.tbl.) Bug: 135608568 Test: test program using syscall(__NR_pidfd_send_signal,..) to send SIGKILL Change-Id: I34da11c63ac8cafb0353d9af24c820cef519ec27 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: electimon <electimon@gmail.com>
\| *	compat: add in_compat_syscall to ask whether we're in a compat syscall	Andy Lutomirski	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A lot of code currently abuses is_compat_task to determine this. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Clemens Ladisch <clemens@ladisch.de> Cc: David Airlie <airlied@linux.ie> Cc: David Herrmann <dh.herrmann@googlemail.com> Cc: David Miller <davem@davemloft.net> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Jiri Kosina <jkosina@suse.cz> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: Ic4c37d15c1d72d1f0a90c7f5091d696140fc2a4c
\| *	bpf: Add new cgroup attach type to enable sock modifications	David Ahern	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run any time a process in the cgroup opens an AF_INET or AF_INET6 socket. Currently only sk_bound_dev_if is exported to userspace for modification by a bpf program. This allows a cgroup to be configured such that AF_INET{6} sockets opened by processes are automatically bound to a specific device. In turn, this enables the running of programs that do not support SO_BINDTODEVICE in a specific VRF context / L3 domain. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I96a6f6f8f650c494d8c173dbb42580a25698368e
\| *	bpf: fix overflow in prog accounting	Daniel Borkmann	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	commit 5ccb071e97fbd9ffe623a0d3977cc6d013bee93c upstream. Commit aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs") made a wrong assumption of charging against prog->pages. Unlike map->pages, prog->pages are still subject to change when we need to expand the program through bpf_prog_realloc(). This can for example happen during verification stage when we need to expand and rewrite parts of the program. Should the required space cross a page boundary, then prog->pages is not the same anymore as its original value that we used to bpf_prog_charge_memlock() on. Thus, we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog is freed eventually. I noticed this that despite having unlimited memlock, programs suddenly refused to load with EPERM error due to insufficient memlock. There are two ways to fix this issue. One would be to add a cached variable to struct bpf_prog that takes a snapshot of prog->pages at the time of charging. The other approach is to also account for resizes. I chose to go with the latter for a couple of reasons: i) We want accounting rather to be more accurate instead of further fooling limits, ii) adding yet another page counter on struct bpf_prog would also be a waste just for this purpose. We also do want to charge as early as possible to avoid going into the verifier just to find out later on that we crossed limits. The only place that needs to be fixed is bpf_prog_realloc(), since only here we expand the program, so we try to account for the needed delta and should we fail, call-sites check for outcome anyway. On cBPF to eBPF migrations, we don't grab a reference to the user as they are charged differently. With that in place, my test case worked fine. Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> [Quentin: backport to 4.9: Adjust context in bpf.h ] Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org> Change-Id: I4b31ee38eaf8618cf8c89e44aaff02cf70542440
\| *	bpf: avoid false sharing of map refcount with max_entries	Daniel Borkmann	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	[ upstream commit be95a845cc4402272994ce290e3ad928aff06cb9 ] In addition to commit b2157399cc98 ("bpf: prevent out-of-bounds speculation") also change the layout of struct bpf_map such that false sharing of fast-path members like max_entries is avoided when the maps reference counter is altered. Therefore enforce them to be placed into separate cachelines. pahole dump after change: struct bpf_map { const struct bpf_map_ops * ops; /* 0 8 / struct bpf_map inner_map_meta; /* 8 8 / void security; /* 16 8 / enum bpf_map_type map_type; / 24 4 / u32 key_size; / 28 4 / u32 value_size; / 32 4 / u32 max_entries; / 36 4 / u32 map_flags; / 40 4 / u32 pages; / 44 4 / u32 id; / 48 4 / int numa_node; / 52 4 / bool unpriv_array; / 56 1 / / XXX 7 bytes hole, try to pack / / --- cacheline 1 boundary (64 bytes) --- / struct user_struct user; /* 64 8 / atomic_t refcnt; / 72 4 / atomic_t usercnt; / 76 4 / struct work_struct work; / 80 32 / char name[16]; / 112 16 / / --- cacheline 2 boundary (128 bytes) --- / / size: 128, cachelines: 2, members: 17 / / sum members: 121, holes: 1, sum holes: 7 */ }; Now all entries in the first cacheline are read only throughout the life time of the map, set up once during map creation. Overall struct size and number of cachelines doesn't change from the reordering. struct bpf_map is usually first member and embedded in map structs in specific map implementations, so also avoid those members to sit at the end where it could potentially share the cacheline with first map values e.g. in the array since remote CPUs could trigger map updates just as well for those (easily dirtying members like max_entries intentionally as well) while having subsequent values in cache. Quoting from Google's Project Zero blog [1]: Additionally, at least on the Intel machine on which this was tested, bouncing modified cache lines between cores is slow, apparently because the MESI protocol is used for cache coherence [8]. Changing the reference counter of an eBPF array on one physical CPU core causes the cache line containing the reference counter to be bounced over to that CPU core, making reads of the reference counter on all other CPU cores slow until the changed reference counter has been written back to memory. Because the length and the reference counter of an eBPF array are stored in the same cache line, this also means that changing the reference counter on one physical CPU core causes reads of the eBPF array's length to be slow on other physical CPU cores (intentional false sharing). While this doesn't 'control' the out-of-bounds speculation through masking the index as in commit b2157399cc98, triggering a manipulation of the map's reference counter is really trivial, so lets not allow to easily affect max_entries from it. Splitting to separate cachelines also generally makes sense from a performance perspective anyway in that fast-path won't have a cache miss if the map gets pinned, reused in other progs, etc out of control path, thus also avoids unintentional false sharing. [1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.html Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: Id113157c85bdad735e2b10ceaf40eabb24f10130
\| *	net: remove hlist_nulls_add_tail_rcu()	Eric Dumazet	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	[ Upstream commit d7efc6c11b277d9d80b99b1334a78bfe7d7edf10 ] Alexander Potapenko reported use of uninitialized memory [1] This happens when inserting a request socket into TCP ehash, in __sk_nulls_add_node_rcu(), since sk_reuseport is not initialized. Bug was added by commit d894ba18d4e4 ("soreuseport: fix ordering for mixed v4/v6 sockets") Note that d296ba60d8e2 ("soreuseport: Resolve merge conflict for v4/v6 ordering fix") missed the opportunity to get rid of hlist_nulls_add_tail_rcu() : Both UDP sockets and TCP/DCCP listeners no longer use __sk_nulls_add_node_rcu() for their hash insertion. Since all other sockets have unique 4-tuple, the reuseport status has no special meaning, so we can always use hlist_nulls_add_head_rcu() for them and save few cycles/instructions. [1] ================================================================== BUG: KMSAN: use of uninitialized memory in inet_ehash_insert+0xd40/0x1050 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #3288 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:16 dump_stack+0x185/0x1d0 lib/dump_stack.c:52 kmsan_report+0x13f/0x1c0 mm/kmsan/kmsan.c:1016 __msan_warning_32+0x69/0xb0 mm/kmsan/kmsan_instr.c:766 __sk_nulls_add_node_rcu ./include/net/sock.h:684 inet_ehash_insert+0xd40/0x1050 net/ipv4/inet_hashtables.c:413 reqsk_queue_hash_req net/ipv4/inet_connection_sock.c:754 inet_csk_reqsk_queue_hash_add+0x1cc/0x300 net/ipv4/inet_connection_sock.c:765 tcp_conn_request+0x31e7/0x36f0 net/ipv4/tcp_input.c:6414 tcp_v4_conn_request+0x16d/0x220 net/ipv4/tcp_ipv4.c:1314 tcp_rcv_state_process+0x42a/0x7210 net/ipv4/tcp_input.c:5917 tcp_v4_do_rcv+0xa6a/0xcd0 net/ipv4/tcp_ipv4.c:1483 tcp_v4_rcv+0x3de0/0x4ab0 net/ipv4/tcp_ipv4.c:1763 ip_local_deliver_finish+0x6bb/0xcb0 net/ipv4/ip_input.c:216 NF_HOOK ./include/linux/netfilter.h:248 ip_local_deliver+0x3fa/0x480 net/ipv4/ip_input.c:257 dst_input ./include/net/dst.h:477 ip_rcv_finish+0x6fb/0x1540 net/ipv4/ip_input.c:397 NF_HOOK ./include/linux/netfilter.h:248 ip_rcv+0x10f6/0x15c0 net/ipv4/ip_input.c:488 __netif_receive_skb_core+0x36f6/0x3f60 net/core/dev.c:4298 __netif_receive_skb net/core/dev.c:4336 netif_receive_skb_internal+0x63c/0x19c0 net/core/dev.c:4497 napi_skb_finish net/core/dev.c:4858 napi_gro_receive+0x629/0xa50 net/core/dev.c:4889 e1000_receive_skb drivers/net/ethernet/intel/e1000/e1000_main.c:4018 e1000_clean_rx_irq+0x1492/0x1d30 drivers/net/ethernet/intel/e1000/e1000_main.c:4474 e1000_clean+0x43aa/0x5970 drivers/net/ethernet/intel/e1000/e1000_main.c:3819 napi_poll net/core/dev.c:5500 net_rx_action+0x73c/0x1820 net/core/dev.c:5566 __do_softirq+0x4b4/0x8dd kernel/softirq.c:284 invoke_softirq kernel/softirq.c:364 irq_exit+0x203/0x240 kernel/softirq.c:405 exiting_irq+0xe/0x10 ./arch/x86/include/asm/apic.h:638 do_IRQ+0x15e/0x1a0 arch/x86/kernel/irq.c:263 common_interrupt+0x86/0x86 Fixes: d894ba18d4e4 ("soreuseport: fix ordering for mixed v4/v6 sockets") Fixes: d296ba60d8e2 ("soreuseport: Resolve merge conflict for v4/v6 ordering fix") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Alexander Potapenko <glider@google.com> Acked-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: Ief0d2dfa78eb6e63cb5ac7578565bebab247eb07
\| *	udp: no longer use SLAB_DESTROY_BY_RCU	Eric Dumazet	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Tom Herbert would like not touching UDP socket refcnt for encapsulated traffic. For this to happen, we need to use normal RCU rules, with a grace period before freeing a socket. UDP sockets are not short lived in the high usage case, so the added cost of call_rcu() should not be a concern. This actually removes a lot of complexity in UDP stack. Multicast receives no longer need to hold a bucket spinlock. Note that ip early demux still needs to take a reference on the socket. Same remark for functions used by xt_socket and xt_PROXY netfilter modules, but this might be changed later. Performance for a single UDP socket receiving flood traffic from many RX queues/cpus. Simple udp_rx using simple recvfrom() loop : 438 kpps instead of 374 kpps : 17 % increase of the peak rate. v2: Addressed Willem de Bruijn feedback in multicast handling - keep early demux break in __udp4_lib_demux_lookup() Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Cc: Willem de Bruijn <willemb@google.com> Tested-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I4a8092b7f3adc34bf6f7303d5d23bb3a3fec7a7f
\| *	cgroup: Fix sock_cgroup_data on big-endian.	Cong Wang	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	[ Upstream commit 14b032b8f8fce03a546dcf365454bec8c4a58d7d ] In order for no_refcnt and is_data to be the lowest order two bits in the 'val' we have to pad out the bitfield of the u8. Fixes: ad0f75e5f57c ("cgroup: fix cgroup_sk_alloc() for sk_clone_lock()") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I58b286e24b67cf1bcdfb3d6b4aa95289c92966ad
\| *	fs: Better permission checking for submounts	Eric W. Biederman	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	commit 93faccbbfa958a9668d3ab4e30f38dd205cee8d8 upstream. To support unprivileged users mounting filesystems two permission checks have to be performed: a test to see if the user allowed to create a mount in the mount namespace, and a test to see if the user is allowed to access the specified filesystem. The automount case is special in that mounting the original filesystem grants permission to mount the sub-filesystems, to any user who happens to stumble across the their mountpoint and satisfies the ordinary filesystem permission checks. Attempting to handle the automount case by using override_creds almost works. It preserves the idea that permission to mount the original filesystem is permission to mount the sub-filesystem. Unfortunately using override_creds messes up the filesystems ordinary permission checks. Solve this by being explicit that a mount is a submount by introducing vfs_submount, and using it where appropriate. vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let sget and friends know that a mount is a submount so they can take appropriate action. sget and sget_userns are modified to not perform any permission checks on submounts. follow_automount is modified to stop using override_creds as that has proven problemantic. do_mount is modified to always remove the new MS_SUBMOUNT flag so that we know userspace will never by able to specify it. autofs4 is modified to stop using current_real_cred that was put in there to handle the previous version of submount permission checking. cifs is modified to pass the mountpoint all of the way down to vfs_submount. debugfs is modified to pass the mountpoint all of the way down to trace_automount by adding a new parameter. To make this change easier a new typedef debugfs_automount_t is introduced to capture the type of the debugfs automount function. Fixes: 069d5ac9ae0d ("autofs: Fix automounts by using current_real_cred()->uid") Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds") Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I09cb1f35368fb8dc4a64b5ac5a35c9d2843ef95b
\| *	UPSTREAM: cgroup, bpf: remove unnecessary #include	Alexei Starovoitov	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	this #include is unnecessary and brings whole set of other headers into cgroup-defs.h. Remove it. Fixes: 3007098494be ("cgroup: add support for eBPF programs") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Rami Rosen <roszenrami@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Mack <daniel@zonque.org> Signed-off-by: David S. Miller <davem@davemloft.net> Fixes: Change-Id: I3df35d8d3b1261503f9b5bcd90b18c9358f1ac28 ("cgroup: add support for eBPF programs") (cherry picked from commit b634d30a79ecc2d28e61cbe5b1f4443952f37a8f) Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Change-Id: Ie35435947ea16421a538815bdc4953fd67407de6
\| *	cgroup: fix sock_cgroup_data initialization on earlier compilers	Tejun Heo	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	sock_cgroup_data is a struct containing an anonymous union. sock_cgroup_set_prioidx() and sock_cgroup_set_classid() were initializing a field inside the anonymous union as follows. struct sock_ccgroup_data skcd_buf = { .val = VAL }; While this is fine on more recent compilers, gcc-4.4.7 triggers the following errors. include/linux/cgroup-defs.h: In function ‘sock_cgroup_set_prioidx’: include/linux/cgroup-defs.h:619: error: unknown field ‘val’ specified in initializer include/linux/cgroup-defs.h:619: warning: missing braces around initializer include/linux/cgroup-defs.h:619: warning: (near initialization for ‘skcd_buf.<anonymous>’) This is because .val belongs to the anonymous union nested inside the struct but the initializer is missing the nesting. Fix it by adding an extra pair of braces. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Alaa Hleihel <alaa@dev.mellanox.co.il> Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I343edb9b0ffe0cf6836f911b88d031c32c541228
\| *	sk_buff: allow segmenting based on frag sizes	Marcelo Ricardo Leitner	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch allows segmenting a skb based on its frags sizes instead of based on a fixed value. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: Ic278a7f5bbe0f86ef348a63508da6819b0d098aa
\| *	soreuseport: fix ordering for mixed v4/v6 sockets	Craig Gallek	2022-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With the SO_REUSEPORT socket option, it is possible to create sockets in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address. This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on the AF_INET6 sockets. Prior to the commits referenced below, an incoming IPv4 packet would always be routed to a socket of type AF_INET when this mixed-mode was used. After those changes, the same packet would be routed to the most recently bound socket (if this happened to be an AF_INET6 socket, it would have an IPv4 mapped IPv6 address). The change in behavior occurred because the recent SO_REUSEPORT optimizations short-circuit the socket scoring logic as soon as they find a match. They did not take into account the scoring logic that favors AF_INET sockets over AF_INET6 sockets in the event of a tie. To fix this problem, this patch changes the insertion order of AF_INET and AF_INET6 addresses in the TCP and UDP socket lists when the sockets have SO_REUSEPORT set. AF_INET sockets will be inserted at the head of the list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at the tail of the list. This will force AF_INET sockets to always be considered first. Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection") Reported-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I372800560dce28adb9bb0459b65f9a804c2b2cdc
* \|	Revert "kernel: Only expose su when daemon is running"lineage-19.1	Georg Veichtlbauer	2023-05-22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch is no longer necessary because we no longer ship su add-ons, which is this patch initially designed for. Now it causes another issue which breaks custom root solution such as Magisk, as Magisk switches worker tmpfs dir to RO instead of RW for safety reasons and happens to satisfy MS_RDONLY check for su file, resulting in su file totally inaccessible. This reverts commit 08ff8a2e58eb226015fa68d577121137a7e0953f. Change-Id: If25a9ef7e64c79412948f4619e08faaedb18aa13
* \|	kernel: Only expose su when daemon is running	Tom Marshall	2022-07-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It has been claimed that the PG implementation of 'su' has security vulnerabilities even when disabled. Unfortunately, the people that find these vulnerabilities often like to keep them private so they can profit from exploits while leaving users exposed to malicious hackers. In order to reduce the attack surface for vulnerabilites, it is therefore necessary to make 'su' completely inaccessible when it is not in use (except by the root and system users). Change-Id: I79716c72f74d0b7af34ec3a8054896c6559a181d Signed-off-by: Davide Garberi <dade.garberi@gmail.com>
* \|	pwm: qpnp-pwm: add api for synchronous enable of pwms	Scott Mertz	2022-07-27
\| \| \| \| \| \| \| \| \| \|	Change-Id: Ibe4d8f8f8c9dbb89434cd633a049bd9e216af6f4 Signed-off-by: Davide Garberi <dade.garberi@gmail.com>
* \|	cclogic: Add some edits from ZUK	Davide Garberi	2022-07-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Reworked from ZUK sources to make it looks a bit better * Also adapted to 4.4 and moved to the new directory Signed-off-by: Davide Garberi <dade.garberi@gmail.com>
* \|	drivers: add cclogic usb type-c driver	Faiz Authar	2022-07-27
\| \| \| \| \| \| \| \| \| \| \| \|	Change-Id: I783aaf058c42e85712f861369cc85db2d6adab61 Signed-off-by: Faiz Authar <faizauthar@gmail.com> Signed-off-by: Davide Garberi <dade.garberi@gmail.com>
* \|	rbtree: cache leftmost node internally	Davidlohr Bueso	2022-07-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit cd9e61ed1eebbcd5dfad59475d41ec58d9b64b6a upstream. Patch series "rbtree: Cache leftmost node internally", v4. A series to extending rbtrees to internally cache the leftmost node such that we can have fast overlap check optimization for all interval tree users[1]. The benefits of this series are that: (i) Unify users that do internal leftmost node caching. (ii) Optimize all interval tree users. (iii) Convert at least two new users (epoll and procfs) to the new interface. This patch (of 16): Red-black tree semantics imply that nodes with smaller or greater (or equal for duplicates) keys always be to the left and right, respectively. For the kernel this is extremely evident when considering our rb_first() semantics. Enabling lookups for the smallest node in the tree in O(1) can save a good chunk of cycles in not having to walk down the tree each time. To this end there are a few core users that explicitly do this, such as the scheduler and rtmutexes. There is also the desire for interval trees to have this optimization allowing faster overlap checking. This patch introduces a new 'struct rb_root_cached' which is just the root with a cached pointer to the leftmost node. The reason why the regular rb_root was not extended instead of adding a new structure was that this allows the user to have the choice between memory footprint and actual tree performance. The new wrappers on top of the regular rb_root calls are: - rb_first_cached(cached_root) -- which is a fast replacement for rb_first. - rb_insert_color_cached(node, cached_root, new) - rb_erase_cached(node, cached_root) In addition, augmented cached interfaces are also added for basic insertion and deletion operations; which becomes important for the interval tree changes. With the exception of the inserts, which adds a bool for updating the new leftmost, the interfaces are kept the same. To this end, porting rb users to the cached version becomes really trivial, and keeping current rbtree semantics for users that don't care about the optimization requires zero overhead. Link: http://lkml.kernel.org/r/20170719014603.19029-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Harsh Shandilya <harsh@prjkt.io>
* \|	block: do not merge requests without consulting with io scheduler	Tahsin Erdogan	2022-07-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before merging a bio into an existing request, io scheduler is called to get its approval first. However, the requests that come from a plug flush may get merged by block layer without consulting with io scheduler. In case of CFQ, this can cause fairness problems. For instance, if a request gets merged into a low weight cgroup's request, high weight cgroup now will depend on low weight cgroup to get scheduled. If high weigt cgroup needs that io request to complete before submitting more requests, then it will also lose its timeslice. Following script demonstrates the problem. Group g1 has a low weight, g2 and g3 have equal high weights but g2's requests are adjacent to g1's requests so they are subject to merging. Due to these merges, g2 gets poor disk time allocation. cat > cfq-merge-repro.sh << "EOF" #!/bin/bash set -e IO_ROOT=/mnt-cgroup/io mkdir -p $IO_ROOT if ! mount \| grep -qw $IO_ROOT; then mount -t cgroup none -oblkio $IO_ROOT fi cd $IO_ROOT for i in g1 g2 g3; do if [ -d $i ]; then rmdir $i fi done mkdir g1 && echo 10 > g1/blkio.weight mkdir g2 && echo 495 > g2/blkio.weight mkdir g3 && echo 495 > g3/blkio.weight RUNTIME=10 (echo $BASHPID > g1/cgroup.procs && fio --readonly --name name1 --filename /dev/sdb \ --rw read --size 64k --bs 64k --time_based \ --runtime=$RUNTIME --offset=0k &> /dev/null)& (echo $BASHPID > g2/cgroup.procs && fio --readonly --name name1 --filename /dev/sdb \ --rw read --size 64k --bs 64k --time_based \ --runtime=$RUNTIME --offset=64k &> /dev/null)& (echo $BASHPID > g3/cgroup.procs && fio --readonly --name name1 --filename /dev/sdb \ --rw read --size 64k --bs 64k --time_based \ --runtime=$RUNTIME --offset=256k &> /dev/null)& sleep $((RUNTIME+1)) for i in g1 g2 g3; do echo ---- $i ---- cat $i/blkio.time done EOF # ./cfq-merge-repro.sh ---- g1 ---- 8:16 162 ---- g2 ---- 8:16 165 ---- g3 ---- 8:16 686 After applying the patch: # ./cfq-merge-repro.sh ---- g1 ---- 8:16 90 ---- g2 ---- 8:16 445 ---- g3 ---- 8:16 471 Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* \|	BACKPORT: block: use ktime_get_ns() instead of sched_clock() for cfq	Omar Sandoval	2022-07-27
\|/ \| \| \| \| \| \| \| \| \| \| \| \|	cfq has some internal fields that use sched_clock() which can trivially use ktime_get_ns() instead. Their timestamp fields in struct request can also use ktime_get_ns(), which resolves the 8 year old comment added by commit 28f4197e5d47 ("block: disable preemption before using sched_clock()"). Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: celtare21 <celtare21@gmail.com> Signed-off-by: Yaroslav Furman <yaro330@gmail.com> - improved comment.
*	bpf: fix missing header inclusion	Zi Shen Lim	2022-04-19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 0fc174dea545 ("ebpf: make internal bpf API independent of CONFIG_BPF_SYSCALL ifdefs") introduced usage of ERR_PTR() in bpf_prog_get(), however did not include linux/err.h. Without this patch, when compiling arm64 BPF without CONFIG_BPF_SYSCALL: ... In file included from arch/arm64/net/bpf_jit_comp.c:21:0: include/linux/bpf.h: In function 'bpf_prog_get': include/linux/bpf.h:235:9: error: implicit declaration of function 'ERR_PTR' [-Werror=implicit-function-declaration] return ERR_PTR(-EOPNOTSUPP); ^ include/linux/bpf.h:235:9: warning: return makes pointer from integer without a cast [-Wint-conversion] In file included from include/linux/rwsem.h:17:0, from include/linux/mm_types.h:10, from include/linux/sched.h:27, from arch/arm64/include/asm/compat.h:25, from arch/arm64/include/asm/stat.h:23, from include/linux/stat.h:5, from include/linux/compat.h:12, from include/linux/filter.h:10, from arch/arm64/net/bpf_jit_comp.c:22: include/linux/err.h: At top level: include/linux/err.h:23:35: error: conflicting types for 'ERR_PTR' static inline void * __must_check ERR_PTR(long error) ^ In file included from arch/arm64/net/bpf_jit_comp.c:21:0: include/linux/bpf.h:235:9: note: previous implicit declaration of 'ERR_PTR' was here return ERR_PTR(-EOPNOTSUPP); ^ ... Fixes: 0fc174dea545 ("ebpf: make internal bpf API independent of CONFIG_BPF_SYSCALL ifdefs") Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I5b6a63374d8d137f7d1b2e79cf91178c988a68fc
*	BACKPORT: ANDROID: Remove xt_qtaguid module from new kernels.	Chenbo Feng	2022-04-19
\| \| \| \| \| \| \| \| \| \| \| \|	For new devices ship with 4.9 kernel, the eBPF replacement should cover all the functionalities of xt_qtaguid and it is safe now to remove this android only module from the kernel. Signed-off-by: Chenbo Feng <fengc@google.com> Signed-off-by: Bruno Martins <bgcngm@gmail.com> Bug: 79938294 Test: kernel build Change-Id: I032aecc048f7349f6a0c5192dd381f286fc7e5bf
*	security: Add hook to invalidate inode security labels	Andreas Gruenbacher	2022-04-19
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add a hook to invalidate an inode's security label when the cached information becomes invalid. Add the new hook in selinux: set a flag when a security label becomes invalid. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: James Morris <james.l.morris@oracle.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <pmoore@redhat.com>
*	security: Make inode argument of inode_getsecid non-const	Andreas Gruenbacher	2022-04-19
\| \| \| \| \| \| \| \| \|	Make the inode argument of the inode_getsecid hook non-const so that we can use it to revalidate invalid security labels. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <pmoore@redhat.com>
*	security: Make inode argument of inode_getsecurity non-const	Andreas Gruenbacher	2022-04-19
\| \| \| \| \| \| \| \| \|	Make the inode argument of the inode_getsecurity hook non-const so that we can use it to revalidate invalid security labels. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <pmoore@redhat.com>
*	ANDROID: bpf: validate bpf_func when BPF_JIT is enabled with CFI	Sami Tolvanen	2022-04-19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With CONFIG_BPF_JIT, the kernel makes indirect calls to dynamically generated code, which the compile-time Control-Flow Integrity (CFI) checking cannot validate. This change adds basic sanity checking to ensure we are jumping to a valid location, which narrows down the attack surface on the stored pointer. In addition, this change adds a weak arch_bpf_jit_check_func function, which architectures that implement BPF JIT can override to perform additional validation, such as verifying that the pointer points to the correct memory region. Bug: 140377409 Change-Id: I8ebac6637ab6bd9db44716b1c742add267298669 Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
*	UPSTREAM: bpf: bpf_prog_array_alloc() should return a generic non-rcu pointer	Roman Gushchin	2022-04-19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently the return type of the bpf_prog_array_alloc() is struct bpf_prog_array __rcu , which is not quite correct. Obviously, the returned pointer is a generic pointer, which is valid for an indefinite amount of time and it's not shared with anyone else, so there is no sense in marking it as __rcu. This change eliminate the following sparse warnings: kernel/bpf/core.c:1544:31: warning: incorrect type in return expression (different address spaces) kernel/bpf/core.c:1544:31: expected struct bpf_prog_array [noderef] <asn:4> kernel/bpf/core.c:1544:31: got void * kernel/bpf/core.c:1548:17: warning: incorrect type in return expression (different address spaces) kernel/bpf/core.c:1548:17: expected struct bpf_prog_array [noderef] <asn:4>* kernel/bpf/core.c:1548:17: got struct bpf_prog_array <noident> kernel/bpf/core.c:1681:15: warning: incorrect type in assignment (different address spaces) kernel/bpf/core.c:1681:15: expected struct bpf_prog_array array kernel/bpf/core.c:1681:15: got struct bpf_prog_array [noderef] <asn:4>* Fixes: 324bda9e6c5a ("bpf: multi program support for cgroup+bpf") Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> (cherry picked from commit d29ab6e1fa21ebc2a8a771015dd9e0e5d4e28dc1) Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I18f048dfa46935e6c23744879f04a93b2a747233
*	bpf: Prevent increasing bpf_jit_limit above max	Lorenz Bauer	2022-04-19
\| \| \| \| \| \| \| \| \| \| \|	[ Upstream commit fadb7ff1a6c2c565af56b4aacdd086b067eed440 ] Restrict bpf_jit_limit to the maximum supported by the arch's JIT. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211014142554.53120-4-lmb@cloudflare.com Signed-off-by: Sasha Levin <sashal@kernel.org>