summaryrefslogtreecommitdiff
path: root/include/uapi/linux (follow)
Commit message (Collapse)AuthorAge
* Merge remote-tracking branch 'msm8998/lineage-20' into lineage-20Raghuram Subramani2024-10-17
| | | | Change-Id: I126075a330f305c85f8fe1b8c9d408f368be95d1
* Merge lineage-20 of git@github.com:LineageOS/android_kernel_qcom_msm8998.git ↵Davide Garberi2023-08-03
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | into lineage-20 1a4b80f8f201 ANDROID: arch:arm64: Increase kernel command line size 7c253f7aa663 of: reserved_mem: increase max number reserved regions df4dbf557503 msm: camera: Fix indentations 2fc4a156d15d msm: camera: Fix code flow when populating CAM_V_CUSTOM1 687bcb61f125 ALSA: control: use counting semaphore as write lock for ELEM_WRITE operation 75cf9e8c1b1c ALSA: control: Fix memory corruption risk in snd_ctl_elem_read 76cf3b5e53df ALSA: control: code refactoring for ELEM_READ/ELEM_WRITE operations e9af212f9685 ALSA: pcm: Move rwsem lock inside snd_ctl_elem_read to prevent UAF 95fc4fff573f msm: kgsl: Make sure that pool pages don't have any extra references 59ceabe0d242 msm: kgsl: Use dma_buf_get() to get dma_buf structure d1f19956d6b9 ANDROID: usb: f_accessory: Check buffer size when initialised via composite 2d3ce4f7a366 kbuild: handle libs-y archives separately from built-in.o archives 65dc3fbd1593 kbuild: thin archives use P option to ar 362c7b73bac8 kbuild: thin archives for multi-y targets 43076241b514 kbuild: thin archives final link close --whole-archives option aa04fc78256d kbuild: minor improvement for thin archives build f5896747cda6 Merge tag 'LA.UM.7.2.c25-07700-sdm660.0' of https://git.codelinaro.org/clo/la/platform/vendor/qcom-opensource/wlan/qcacld-3.0 into android13-4.4-msm8998 321ac077ee7e qcacld-3.0: Fix out-of-bounds in tx_stats 42be8e4cbf13 BACKPORT: usb: gadget: rndis: prevent integer overflow in rndis_set_response() b490a85b5945 FROMGIT: arm64: fix oops in concurrently setting insn_emulation sysctls 7ed7084b34a9 FROMLIST: binder: fix UAF of ref->proc caused by race condition e31f087fb864 ANDROID: selinux: modify RTM_GETNEIGH{TBL} 80675d431434 UPSTREAM: usb: gadget: clear related members when goto fail fb6adfb00108 UPSTREAM: usb: gadget: don't release an existing dev->buf e4a8dd12424e UPSTREAM: USB: gadget: validate interface OS descriptor requests 8f0a947317e0 UPSTREAM: usb: gadget: rndis: check size of RNDIS_MSG_SET command 1541758765ff ion: Do not 'put' ION handle until after its final use 03b4b3cd8d30 Merge tag 'LA.UM.7.2.c25-07000-sdm660.0' of https://git.codelinaro.org/clo/la/platform/vendor/qcom-opensource/wlan/qcacld-3.0 into android13-4.4-msm8998 7dbda95466d5 Merge tag 'LA.UM.8.4.c25-06600-8x98.0' of https://git.codelinaro.org/clo/la/kernel/msm-4.4 into android13-4.4-msm8998 369119e5df4e cert host tools: Stop complaining about deprecated OpenSSL functions f8e30a0f9a17 fixup! BACKPORT: treewide: Fix function prototypes for module_param_call() 4fa5045f3dc9 arm64/efi: Mark __efistub_stext_offset as an absolute symbol explicitly bcd9668da77f arm64: kernel: do not need to reset UAO on exception entry c4ddd677f7e3 Kbuild: do not emit debug info for assembly with LLVM_IAS=1 1b880b6e19f8 qcacld-3.0: Add time slice duty cycle in wifi_interface_info fd24be2b22a1 qcacmn: Add time slice duty cycle attribute into QCA vendor command d719c1c825f8 qcacld-3.0: Use field-by-field assignment for FW stats fb5eb3bda2d9 ext4: enable quota enforcement based on mount options cd40d7f301de ext4: adds project ID support 360e2f3d18b8 ext4: add project quota support c31ac2be1594 drivers: qcacld-3.0: Remove in_compat_syscall() redefinition 6735c13a269d arm64: link with -z norelro regardless of CONFIG_RELOCATABLE 99962aab3433 arm64: relocatable: fix inconsistencies in linker script and options 24bd8cc5e6bb arm64: prevent regressions in compressed kernel image size when upgrading to binutils 2.27 93bb4c2392a2 arm64: kernel: force ET_DYN ELF type for CONFIG_RELOCATABLE=y a54bbb725ccb arm64: build with baremetal linker target instead of Linux when available c5805c604a9b arm64: add endianness option to LDFLAGS instead of LD ab6052788f60 arm64: Set UTS_MACHINE in the Makefile c3330429b2c6 kbuild: clear LDFLAGS in the top Makefile f33c1532bd61 kbuild: use HOSTLDFLAGS for single .c executables 38b7db363a96 BACKPORT: arm64: Change .weak to SYM_FUNC_START_WEAK_PI for arch/arm64/lib/mem*.S 716cb63e81d9 BACKPORT: crypto: arm64/aes-ce-cipher - move assembler code to .S file 7dfbaee16432 BACKPORT: arm64: Remove reference to asm/opcodes.h 531ee8624d17 BACKPORT: arm64: kprobe: protect/rename few definitions to be reused by uprobe 08d83c997b0c BACKPORT: arm64: Delete the space separator in __emit_inst e3951152dc2d BACKPORT: arm64: Get rid of asm/opcodes.h 255820c0f301 BACKPORT: arm64: Fix minor issues with the dcache_by_line_op macro 21bb344a664b BACKPORT: crypto: arm64/aes-modes - get rid of literal load of addend vector 26d5a53c6e0d BACKPORT: arm64: vdso: remove commas between macro name and arguments 78bff1f77c9d BACKPORT: kbuild: support LLVM=1 to switch the default tools to Clang/LLVM 6634f9f63efe BACKPORT: kbuild: replace AS=clang with LLVM_IAS=1 b891e8fdc466 BACKPORT: Documentation/llvm: fix the name of llvm-size 75d6fa8368a8 BACKPORT: Documentation/llvm: add documentation on building w/ Clang/LLVM 95b0a5e52f2a BACKPORT: ANDROID: ftrace: fix function type mismatches 7da9c2138ec8 BACKPORT: ANDROID: fs: logfs: fix filler function type d6d5a4b28ad0 BACKPORT: ANDROID: fs: gfs2: fix filler function type 9b194a470db5 BACKPORT: ANDROID: fs: exofs: fix filler function type 7a45ac4bfb49 BACKPORT: ANDROID: fs: afs: fix filler function type 4099e1b281e5 BACKPORT: drivers/perf: arm_pmu: fix function type mismatch af7b738882f7 BACKPORT: dummycon: fix function types 1b0b55a36dbe BACKPORT: fs: nfs: fix filler function type a58a0e30e20a BACKPORT: mm: fix filler function type mismatch 829e9226a8c0 BACKPORT: mm: fix drain_local_pages function type 865ef61b4da8 BACKPORT: vfs: pass type instead of fn to do_{loop,iter}_readv_writev() 08d2f8e7ba8e BACKPORT: module: Do not paper over type mismatches in module_param_call() ea467f6c33e4 BACKPORT: treewide: Fix function prototypes for module_param_call() d131459e6b8b BACKPORT: module: Prepare to convert all module_param_call() prototypes 6f52abadf006 BACKPORT: kbuild: fix --gc-sections bf7540ffce44 BACKPORT: kbuild: record needed exported symbols for modules c49d2545e437 BACKPORT: kbuild: Allow to specify composite modules with modname-m 427d0fc67dc1 BACKPORT: kbuild: add arch specific post-link Makefile 69f8a31838a3 BACKPORT: arm64: add a workaround for GNU gold with ARM64_MODULE_PLTS ba3368756abf BACKPORT: arm64: explicitly pass --no-fix-cortex-a53-843419 to GNU gold 6dacd7e737fb BACKPORT: arm64: errata: Pass --fix-cortex-a53-843419 to ld if workaround enabled d2787c21f2b5 BACKPORT: kbuild: add __ld-ifversion and linker-specific macros 2d471de60bb4 BACKPORT: kbuild: add ld-name macro 06280a90d845 BACKPORT: arm64: keep .altinstructions and .altinstr_replacement eb0ad3ae07f9 BACKPORT: kbuild: add __cc-ifversion and compiler-specific variants 3d01e1eba86b BACKPORT: FROMLIST: kbuild: add clang-version.sh 18dd378ab563 BACKPORT: FROMLIST: kbuild: fix LD_DEAD_CODE_DATA_ELIMINATION aabbc122b1de BACKPORT: kbuild: thin archives make default for all archs 756d47e345fc BACKPORT: kbuild: allow archs to select link dead code/data elimination 723ab99e48a7 BACKPORT: kbuild: allow architectures to use thin archives instead of ld -r 0b77ec583772 drivers/usb/serial/console.c: remove superfluous serial->port condition 6488cb478f04 drivers/firmware/efi/libstub.c: prevent a relocation dba4259216a0 UPSTREAM: pidfd: fix a poll race when setting exit_state baab6e33b07b BACKPORT: arch: wire-up pidfd_open() 5d2e9e4f8630 BACKPORT: pid: add pidfd_open() f8396a127daf UPSTREAM: pidfd: add polling support f4c358582254 UPSTREAM: signal: improve comments 5500316dc8d8 UPSTREAM: fork: do not release lock that wasn't taken fc7d707593e3 BACKPORT: signal: support CLONE_PIDFD with pidfd_send_signal f044fa00d72a BACKPORT: clone: add CLONE_PIDFD f20fc1c548f2 UPSTREAM: Make anon_inodes unconditional de80525cd462 UPSTREAM: signal: use fdget() since we don't allow O_PATH 229e1bdd624e UPSTREAM: signal: don't silently convert SI_USER signals to non-current pidfd ada02e996b52 BACKPORT: signal: add pidfd_send_signal() syscall 828857678c5c compat: add in_compat_syscall to ask whether we're in a compat syscall e7aede4896c0 bpf: Add new cgroup attach type to enable sock modifications 9ed75228b09c ebpf: allow bpf_get_current_uid_gid_proto also for networking c5aa3963b4ae bpf: fix overflow in prog accounting c46a001439fc bpf: Make sure mac_header was set before using it 8aed99185615 bpf: Enlarge offset check value to INT_MAX in bpf_skb_{load,store}_bytes b0a638335ba6 bpf: avoid false sharing of map refcount with max_entries 1f21605e373c net: remove hlist_nulls_add_tail_rcu() 9ce369b09dbb udp: get rid of SLAB_DESTROY_BY_RCU allocations 070f539fb5d7 udp: no longer use SLAB_DESTROY_BY_RCU a32d2ea857c5 inet: refactor inet[6]_lookup functions to take skb fcf3e7bc7203 soreuseport: fix initialization race df03c8cf024a soreuseport: Fix TCP listener hash collision bd8b9f50c9d3 inet: Fix missing return value in inet6_hash bae331196dd0 soreuseport: fast reuseport TCP socket selection 4ada2ed73da0 inet: create IPv6-equivalent inet_hash function 73f609838475 sock: struct proto hash function may error e3b32750621b cgroup: Fix sock_cgroup_data on big-endian. 69dabcedd4b9 selinux: always allow mounting submounts 17d6ddebcc49 userns: Don't fail follow_automount based on s_user_ns cbd08255e6f8 fs: Better permission checking for submounts 3a9ace719251 mnt: Move the FS_USERNS_MOUNT check into sget_userns af53549b43c5 locks: sprinkle some tracepoints around the file locking code 07dbbc84aa34 locks: rename __posix_lock_file to posix_lock_inode 400cbe93d180 autofs: Fix automounts by using current_real_cred()->uid 7903280ee07a fs: Call d_automount with the filesystems creds b87fb50ff1cd UPSTREAM: kernfs: Check KERNFS_HAS_RELEASE before calling kernfs_release_file() c9c596de3e52 UPSTREAM: kernfs: fix locking around kernfs_ops->release() callback 2172eaf5a901 UPSTREAM: cgroup, bpf: remove unnecessary #include dc81f3963dde kernfs: kernfs_sop_show_path: don't return 0 after seq_dentry call ce9a52e20897 cgroup: Make rebind_subsystems() disable v2 controllers all at once ce5e3aa14c39 cgroup: fix sock_cgroup_data initialization on earlier compilers 94a70ef24da9 samples/bpf: fix bpf_perf_event_output prototype c1920272278e net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list d7707635776b sk_buff: allow segmenting based on frag sizes 924bbacea75e ip_tunnel, bpf: ip_tunnel_info_opts_{get, set} depends on CONFIG_INET 0e9008d618f4 bpf: udp: ipv6: Avoid running reuseport's bpf_prog from __udp6_lib_err 01b437940f5e soreuseport: add compat case for setsockopt SO_ATTACH_REUSEPORT_CBPF 421fbf04bf2c soreuseport: change consume_skb to kfree_skb in error case 1ab50514c430 ipv6: Fix SO_REUSEPORT UDP socket with implicit sk_ipv6only f3dfd61c502d soreuseport: fix ordering for mixed v4/v6 sockets 245ee3c90795 soreuseport: fix NULL ptr dereference SO_REUSEPORT after bind 113fb209854a bpf: do not blindly change rlimit in reuseport net selftest 985253ef27d2 bpf: fix rlimit in reuseport net selftest ae61334510be soreuseport: Fix reuseport_bpf testcase on 32bit architectures 6efa24da01a5 udp: fix potential infinite loop in SO_REUSEPORT logic 66df70c6605d soreuseport: BPF selection functional test for TCP fe161031b8a8 soreuseport: pass skb to secondary UDP socket lookup 9223919efdf2 soreuseport: BPF selection functional test 2090ed790dbb soreuseport: fix mem leak in reuseport_add_sock() 67887f6ac3f1 Merge "diag: Ensure dci entry is valid before sending the packet" e41c0da23b38 diag: Prevent out of bound write while sending dci pkt to remote e1085d1ef39b diag: Ensure dci entry is valid before sending the packet 16802e80ecb5 Merge "ion: Fix integer overflow in msm_ion_custom_ioctl" 57146f83f388 ion: Fix integer overflow in msm_ion_custom_ioctl 6fc2001969fe diag: Use valid data_source for a valid token 0c6dbf858a98 qcacld-3.0: Avoid OOB read in dot11f_unpack_assoc_response f07caca0c485 qcacld-3.0: Fix array OOB for duplicate rate 5a359aba0364 msm: kgsl: Remove 'fd' dependency to get dma_buf handle da8317596949 msm: kgsl: Fix gpuaddr_in_range() to check upper bound 2ed91a98d8b4 msm: adsprpc: Handle UAF in fastrpc debugfs read 2967159ad303 msm: kgsl: Add a sysfs node to control performance counter reads e392a84f25f5 msm: kgsl: Perform cache flush on the pages obtained using get_user_pages() 28b45f75d2ee soc: qcom: hab: Add sanity check for payload_count 885caec7690f Merge "futex: Fix inode life-time issue" 0f57701d2643 Merge "futex: Handle faults correctly for PI futexes" 7d7eb450c333 Merge "futex: Rework inconsistent rt_mutex/futex_q state" 124ebd87ef2f msm: kgsl: Fix out of bound write in adreno_profile_submit_time 228bbfb25032 futex: Fix inode life-time issue 7075ca6a22b3 futex: Handle faults correctly for PI futexes a436b73e9032 futex: Simplify fixup_pi_state_owner() 11b99dbe3221 futex: Use pi_state_update_owner() in put_pi_state() f34484030550 rtmutex: Remove unused argument from rt_mutex_proxy_unlock() 079d1c90b3c3 futex: Provide and use pi_state_update_owner() 3b51e24eb17b futex: Replace pointless printk in fixup_owner() 0eac5c2583a1 futex: Avoid violating the 10th rule of futex 6d6ed38b7d10 futex: Rework inconsistent rt_mutex/futex_q state 3c8f7dfd59b5 futex: Remove rt_mutex_deadlock_account_*() 9c870a329520 futex,rt_mutex: Provide futex specific rt_mutex API 7504736e8725 msm: adsprpc: Handle UAF in process shell memory 994e5922a0c2 Disable TRACER Check to improve Camera Performance 8fb3f17b3ad1 msm: kgsl: Deregister gpu address on memdesc_sg_virt failure 13aa628efdca Merge "crypto: Fix possible stack out-of-bound error" 92e777451003 Merge "msm: kgsl: Correct the refcount on current process PID." 9ca218394ed4 Merge "msm: kgsl: Compare pid pointer instead of TGID for a new process" 7eed1f2e0f43 Merge "qcom,max-freq-level change for trial" 6afb5eb98e36 crypto: Fix possible stack out-of-bound error 8b5ba278ed4b msm: kgsl: Correct the refcount on current process PID. 4150552fac96 msm: kgsl: Compare pid pointer instead of TGID for a new process c272102c0793 qcom,max-freq-level change for trial 854ef3ce73f5 msm: kgsl: Protect the memdesc->gpuaddr in SVM use cases. 79c8161aeac9 msm: kgsl: Stop using memdesc->usermem. Change-Id: Iea7db1362c3cd18e36f243411e773a9054f6a445
| * BACKPORT: clone: add CLONE_PIDFDChristian Brauner2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patchset makes it possible to retrieve pid file descriptors at process creation time by introducing the new flag CLONE_PIDFD to the clone() system call. Linus originally suggested to implement this as a new flag to clone() instead of making it a separate system call. As spotted by Linus, there is exactly one bit for clone() left. CLONE_PIDFD creates file descriptors based on the anonymous inode implementation in the kernel that will also be used to implement the new mount api. They serve as a simple opaque handle on pids. Logically, this makes it possible to interpret a pidfd differently, narrowing or widening the scope of various operations (e.g. signal sending). Thus, a pidfd cannot just refer to a tgid, but also a tid, or in theory - given appropriate flag arguments in relevant syscalls - a process group or session. A pidfd does not represent a privilege. This does not imply it cannot ever be that way but for now this is not the case. A pidfd comes with additional information in fdinfo if the kernel supports procfs. The fdinfo file contains the pid of the process in the callers pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d". As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the parent_tidptr argument of clone. This has the advantage that we can give back the associated pid and the pidfd at the same time. To remove worries about missing metadata access this patchset comes with a sample program that illustrates how a combination of CLONE_PIDFD, and pidfd_send_signal() can be used to gain race-free access to process metadata through /proc/<pid>. The sample program can easily be translated into a helper that would be suitable for inclusion in libc so that users don't have to worry about writing it themselves. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <christian@brauner.io> Co-developed-by: Jann Horn <jannh@google.com> Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Howells <dhowells@redhat.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> (cherry picked from commit b3e5838252665ee4cfa76b82bdf1198dca81e5be) Conflicts: kernel/fork.c (1. Replaced proc_pid_ns() with its direct implementation.) Bug: 135608568 Test: test program using syscall(__NR_sys_pidfd_open,..) and poll() Change-Id: I3c804a92faea686e5bf7f99df893fe3a5d87ddf7 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: electimon <electimon@gmail.com>
| * bpf: Add new cgroup attach type to enable sock modificationsDavid Ahern2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run any time a process in the cgroup opens an AF_INET or AF_INET6 socket. Currently only sk_bound_dev_if is exported to userspace for modification by a bpf program. This allows a cgroup to be configured such that AF_INET{6} sockets opened by processes are automatically bound to a specific device. In turn, this enables the running of programs that do not support SO_BINDTODEVICE in a specific VRF context / L3 domain. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I96a6f6f8f650c494d8c173dbb42580a25698368e
| * fs: Better permission checking for submountsEric W. Biederman2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 93faccbbfa958a9668d3ab4e30f38dd205cee8d8 upstream. To support unprivileged users mounting filesystems two permission checks have to be performed: a test to see if the user allowed to create a mount in the mount namespace, and a test to see if the user is allowed to access the specified filesystem. The automount case is special in that mounting the original filesystem grants permission to mount the sub-filesystems, to any user who happens to stumble across the their mountpoint and satisfies the ordinary filesystem permission checks. Attempting to handle the automount case by using override_creds almost works. It preserves the idea that permission to mount the original filesystem is permission to mount the sub-filesystem. Unfortunately using override_creds messes up the filesystems ordinary permission checks. Solve this by being explicit that a mount is a submount by introducing vfs_submount, and using it where appropriate. vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let sget and friends know that a mount is a submount so they can take appropriate action. sget and sget_userns are modified to not perform any permission checks on submounts. follow_automount is modified to stop using override_creds as that has proven problemantic. do_mount is modified to always remove the new MS_SUBMOUNT flag so that we know userspace will never by able to specify it. autofs4 is modified to stop using current_real_cred that was put in there to handle the previous version of submount permission checking. cifs is modified to pass the mountpoint all of the way down to vfs_submount. debugfs is modified to pass the mountpoint all of the way down to trace_automount by adding a new parameter. To make this change easier a new typedef debugfs_automount_t is introduced to capture the type of the debugfs automount function. Fixes: 069d5ac9ae0d ("autofs: Fix automounts by using current_real_cred()->uid") Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds") Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I09cb1f35368fb8dc4a64b5ac5a35c9d2843ef95b
* | mdss: fbdev: Import Zuk z2 row drivers changesDavide Garberi2022-07-27
| | | | | | | | | | | | | | * This is needed to make the display turn on after turning it off * All of this isn't needed at all on z2_plus Signed-off-by: Davide Garberi <dade.garberi@gmail.com>
* | drivers: import zuk sound changesFaiz Authar2022-07-27
| | | | | | | | | | Signed-off-by: dd3boh <dade.garberi@gmail.com> Signed-off-by: Davide Garberi <dade.garberi@gmail.com>
* | include: add key_nav and gestureCallMESuper2022-07-27
|/ | | | | Change-Id: Icd57ecad17abea47cf1c939cde45ea8f4dc3e503 Signed-off-by: Davide Garberi <dade.garberi@gmail.com>
* bpf: add XDP_TX xdp_action for direct forwardingBrenden Blanco2022-04-19
| | | | | | | | | XDP enabled drivers must transmit received packets back out on the same port they were received on when a program returns this action. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_readDaniel Borkmann2022-04-19
| | | | | | | | | | | | Follow-up commit to 1e33759c788c ("bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_output") to add the same functionality into bpf_perf_event_read() helper. The split of index into flags and index component is also safe here, since such large maps are rejected during map allocation time. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* BACKPORT: UPSTREAM: Add a eBPF helper function to retrieve socket uidChenbo Feng2022-04-19
| | | | | | | | | | | | | | | Cherry-pick from commit 6acc5c2910689fc6ee181bf63085c5efff6a42bd Returns the owner uid of the socket inside a sk_buff. This is useful to perform per-UID accounting of network traffic or per-UID packet filtering. The socket need to be a fullsock otherwise overflowuid is returned. Signed-off-by: Chenbo Feng <fengc@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Bug: 30950746 Change-Id: Idc00947ccfdd4e9f2214ffc4178d701cd9ead0ac Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* BACKPORT: UPSTREAM: Add a helper function to get socket cookie in eBPFChenbo Feng2022-04-19
| | | | | | | | | | | | | | | | | | | Cherrypick from commit: 91b8270f2a4d1d9b268de90451cdca63a70052d6 Retrieve the socket cookie generated by sock_gen_cookie() from a sk_buff with a known socket. Generates a new cookie if one was not yet set.If the socket pointer inside sk_buff is NULL, 0 is returned. The helper function coud be useful in monitoring per socket networking traffic statistics and provide a unique socket identifier per namespace. Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Chenbo Feng <fengc@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Bug: 30950746 Change-Id: I95918dcc3ceffb3061495a859d28aee88e3cde3c Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* ANDROID: Fix missing uapi headersChenbo Feng2022-04-19
| | | | | | | | | | | | | Update the missing bpf helper function name in bpf_func_id to keep the uapi header consistent with upstream uapi header because we need the new added bpf helper function bpf get_socket_cookie and get_socket_uid. The patch related to those headers are not backetported since they are not related and backport them will bring in extra confilict. Signed-off-by: Chenbo Feng <fengc@google.com> Bug: 30950746 Change-Id: I2b5fd03799ac5f2e3243ab11a1bccb932f06c312 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: add helper to invalidate hashDaniel Borkmann2022-04-19
| | | | | | | | | | | Add a small helper that complements 36bbef52c7eb ("bpf: direct packet write and access for helpers for clsact progs") for invalidating the current skb->hash after mangling on headers via direct packet write. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* BACKPORT: bpf: multi program support for cgroup+bpfAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple bpf programs to a cgroup. The difference between three possible flags for BPF_PROG_ATTACH command: - NONE(default): No further bpf programs allowed in the subtree. - BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program, the program in this cgroup yields to sub-cgroup program. - BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program, that cgroup program gets run in addition to the program in this cgroup. NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't change their behavior. It only clarifies the semantics in relation to new flag. Only one program is allowed to be attached to a cgroup with NONE or BPF_F_ALLOW_OVERRIDE flag. Multiple programs are allowed to be attached to a cgroup with BPF_F_ALLOW_MULTI flag. They are executed in FIFO order (those that were attached first, run first) The programs of sub-cgroup are executed first, then programs of this cgroup and then programs of parent cgroup. All eligible programs are executed regardless of return code from earlier programs. To allow efficient execution of multiple programs attached to a cgroup and to avoid penalizing cgroups without any programs attached introduce 'struct bpf_prog_array' which is RCU protected array of pointers to bpf programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> for cgroup bits Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit 324bda9e6c5add86ba2e1066476481c48132aca0) Signed-off-by: Connor O'Brien <connoro@google.com> Bug: 121213201 Bug: 138317270 Test: build & boot cuttlefish Change-Id: I06b71c850b9f3e052b106abab7a4a3add012a3f8 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* UPSTREAM: netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'Shmulik Ladkani2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 2c16d6033264 ("netfilter: xt_bpf: support ebpf") introduced support for attaching an eBPF object by an fd, with the 'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each IPT_SO_SET_REPLACE call. However this breaks subsequent iptables calls: # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT # iptables -A INPUT -s 5.6.7.8 -j ACCEPT iptables: Invalid argument. Run `dmesg' for more information. That's because iptables works by loading existing rules using IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with the replacement set. However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number (from the initial "iptables -m bpf" invocation) - so when 2nd invocation occurs, userspace passes a bogus fd number, which leads to 'bpf_mt_check_v1' to fail. One suggested solution [1] was to hack iptables userspace, to perform a "entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new, process-local fd per every 'xt_bpf_info_v1' entry seen. However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects. This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given '.fd' and instead perform an in-kernel lookup for the bpf object given the provided '.path'. It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is expected to provide the path of the pinned object. Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved. References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2 [2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2 Reported-by: Rafael Buchbinder <rafi@rbk.ms> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Chenbo Feng <fengc@google.com> (cherry picked from commit 98589a0998b8b13c4a8fa1ccb0e62751a019faa5) Change-Id: Ia0d15a76823cca3afb38786a3d2c25c13ccf941d Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* UPSTREAM: netfilter: xt_bpf: support ebpfWillem de Bruijn2022-04-19
| | | | | | | | | | | | | | | Add support for attaching an eBPF object by file descriptor. The iptables binary can be called with a path to an elf object or a pinned bpf object. Also pass the mode and path to the kernel to be able to return it later for iptables dump and save. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Chenbo Feng <fengc@google.com> (cherry picked from commit 2c16d60332643e90d4fa244f4a706c454b8c7569) Change-Id: I31b8831a7ffd7c44985ee906ff194c1d934dafbe
* BACKPORT: bpf: Add file mode configuration into bpf mapsChenbo Feng2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | Introduce the map read/write flags to the eBPF syscalls that returns the map fd. The flags is used to set up the file mode when construct a new file descriptor for bpf maps. To not break the backward capability, the f_flags is set to O_RDWR if the flag passed by syscall is 0. Otherwise it should be O_RDONLY or O_WRONLY. When the userspace want to modify or read the map content, it will check the file mode to see if it is allowed to make the change. Signed-off-by: Chenbo Feng <fengc@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Deleted the file mode configuration code in unsupported map type and removed the file mode check in non-existing helper functions. (cherry-pick from net-next: 6e71b04a82248ccf13a94b85cbc674a9fefe53f5) Bug: 30950746 Change-Id: Icfad20f1abb77f91068d244fb0d87fa40824dd1b Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* BACKPORT: bpf: introduce BPF_F_ALLOW_OVERRIDE flagAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command to the given cgroup the descendent cgroup will be able to override effective bpf program that was inherited from this cgroup. By default it's not passed, therefore override is disallowed. Examples: 1. prog X attached to /A with default prog Y fails to attach to /A/B and /A/B/C Everything under /A runs prog X 2. prog X attached to /A with allow_override. prog Y fails to attach to /A/B with default (non-override) prog M attached to /A/B with allow_override. Everything under /A/B runs prog M only. 3. prog X attached to /A with allow_override. prog Y fails to attach to /A with default. The user has to detach first to switch the mode. In the future this behavior may be extended with a chain of non-overridable programs. Also fix the bug where detach from cgroup where nothing is attached was not throwing error. Return ENOENT in such case. Add several testcases and adjust libbpf. Fixes: 3007098494be ("cgroup: add support for eBPF programs") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Daniel Mack <daniel@zonque.org> Signed-off-by: David S. Miller <davem@davemloft.net> Fixes: Change-Id: I3df35d8d3b1261503f9b5bcd90b18c9358f1ac28        ("cgroup: add support for eBPF programs") [AmitP: Refactored original patch for android-4.9 where libbpf sources are in samples/bpf/ and test_cgrp2_attach2, test_cgrp2_sock, and test_cgrp2_sock2 sample tests do not exist.] (cherry picked from commit 7f677633379b4abb3281cdbe7e7006f049305c03) Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* UPSTREAM: bpf: add new prog type for cgroup socket filteringDaniel Mack2022-04-19
| | | | | | | | | | | | | | | | | | Cherry-pick from commit 0e33661de493db325435d565a4a722120ae4cbf3 This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that it does not allow BPF_LD_[ABS|IND] instructions and hooks up the bpf_skb_load_bytes() helper. Programs of this type will be attached to cgroups for network filtering and accounting. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Bug: 30950746 Change-Id: I7b9e063d5d7a91da80917c6d353a60b877133752 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* UPSTREAM: bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commandsDaniel Mack2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cherry-pick from commit f4324551489e8781d838f941b7aee4208e52e8bf Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and BPF_PROG_DETACH which allow attaching and detaching eBPF programs to a target. On the API level, the target could be anything that has an fd in userspace, hence the name of the field in union bpf_attr is called 'target_fd'. When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is expected to be a valid file descriptor of a cgroup v2 directory which has the bpf controller enabled. These are the only use-cases implemented by this patch at this point, but more can be added. If a program of the given type already exists in the given cgroup, the program is swapped automically, so userspace does not have to drop an existing program first before installing a new one, which would otherwise leave a gap in which no program is attached. For more information on the propagation logic to subcgroups, please refer to the bpf cgroup controller implementation. The API is guarded by CAP_NET_ADMIN. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Bug: 30950746 Change-Id: Iab156859332166835d51e1e6f64e5cb8b81870f2 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* sched: new clone flag CLONE_NEWCGROUP for cgroup namespaceAditya Kali2022-04-19
| | | | | | | | | CLONE_NEWCGROUP will be used to create new cgroup namespace. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: direct packet write and access for helpers for clsact progsDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work implements direct packet access for helpers and direct packet write in a similar fashion as already available for XDP types via commits 4acf6c0b84c9 ("bpf: enable direct packet data write for xdp progs") and 6841de8b0d03 ("bpf: allow helpers access the packet directly"), and as a complementary feature to the already available direct packet read for tc (cls/act) programs. For enabling this, we need to introduce two helpers, bpf_skb_pull_data() and bpf_csum_update(). The first is generally needed for both, read and write, because they would otherwise only be limited to the current linear skb head. Usually, when the data_end test fails, programs just bail out, or, in the direct read case, use bpf_skb_load_bytes() as an alternative to overcome this limitation. If such data sits in non-linear parts, we can just pull them in once with the new helper, retest and eventually access them. At the same time, this also makes sure the skb is uncloned, which is, of course, a necessary condition for direct write. As this needs to be an invariant for the write part only, the verifier detects writes and adds a prologue that is calling bpf_skb_pull_data() to effectively unclone the skb from the very beginning in case it is indeed cloned. The heuristic makes use of a similar trick that was done in 233577a22089 ("net: filter: constify detection of pkt_type_offset"). This comes at zero cost for other programs that do not use the direct write feature. Should a program use this feature only sparsely and has read access for the most parts with, for example, drop return codes, then such write action can be delegated to a tail called program for mitigating this cost of potential uncloning to a late point in time where it would have been paid similarly with the bpf_skb_store_bytes() as well. Advantage of direct write is that the writes are inlined whereas the helper cannot make any length assumptions and thus needs to generate a call to memcpy() also for small sizes, as well as cost of helper call itself with sanity checks are avoided. Plus, when direct read is already used, we don't need to cache or perform rechecks on the data boundaries (due to verifier invalidating previous checks for helpers that change skb->data), so more complex programs using rewrites can benefit from switching to direct read plus write. For direct packet access to helpers, we save the otherwise needed copy into a temp struct sitting on stack memory when use-case allows. Both facilities are enabled via may_access_direct_pkt_data() in verifier. For now, we limit this to map helpers and csum_diff, and can successively enable other helpers where we find it makes sense. Helpers that definitely cannot be allowed for this are those part of bpf_helper_changes_skb_data() since they can change underlying data, and those that write into memory as this could happen for packet typed args when still cloned. bpf_csum_update() helper accommodates for the fact that we need to fixup checksum_complete when using direct write instead of bpf_skb_store_bytes(), meaning the programs can use available helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(), csum_block_add(), csum_block_sub() equivalents in eBPF together with the new helper. A usage example will be provided for iproute2's examples/bpf/ directory. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: introduce BPF_PROG_TYPE_PERF_EVENT program typeAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE correspondingly in uapi/linux/perf_event.h) The program visible context meta structure is struct bpf_perf_event_data { struct pt_regs regs; __u64 sample_period; }; which is accessible directly from the program: int bpf_prog(struct bpf_perf_event_data *ctx) { ... ctx->sample_period ... ... ctx->regs.ip ... } The bpf verifier rewrites the accesses into kernel internal struct bpf_perf_event_data_kern which allows changing struct perf_sample_data without affecting bpf programs. New fields can be added to the end of struct bpf_perf_event_data in the future. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: add bpf_skb_change_tail helperDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | This work adds a bpf_skb_change_tail() helper for tc BPF programs. The basic idea is to expand or shrink the skb in a controlled manner. The eBPF program can then rewrite the rest via helpers like bpf_skb_store_bytes(), bpf_lX_csum_replace() and others rather than passing a raw buffer for writing here. bpf_skb_change_tail() is really a slow path helper and intended for replies with f.e. ICMP control messages. Concept is similar to other helpers like bpf_skb_change_proto() helper to keep the helper without protocol specifics and let the BPF program mangle the remaining parts. A flags field has been added and is reserved for now should we extend the helper in future. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: fix bpf_skb_in_cgroup helper namingDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | While hashing out BPF's current_task_under_cgroup helper bits, it came to discussion that the skb_in_cgroup helper name was suboptimally chosen. Tejun says: So, I think in_cgroup should mean that the object is in that particular cgroup while under_cgroup in the subhierarchy of that cgroup. Let's rename the other subhierarchy test to under too. I think that'd be a lot less confusing going forward. [...] It's more intuitive and gives us the room to implement the real "in" test if ever necessary in the future. Since this touches uapi bits, we need to change this as long as v4.8 is not yet officially released. Thus, change the helper enum and rename related bits. Fixes: 4a482f34afcc ("cgroup: bpf: Add bpf_skb_in_cgroup_proto") Reference: http://patchwork.ozlabs.org/patch/658500/ Suggested-by: Sargun Dhillon <sargun@sargun.me> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: Add bpf_current_task_under_cgroup helperSargun Dhillon2022-04-19
| | | | | | | | | | | | | | | | | | | | This adds a bpf helper that's similar to the skb_in_cgroup helper to check whether the probe is currently executing in the context of a specific subset of the cgroupsv2 hierarchy. It does this based on membership test for a cgroup arraymap. It is invalid to call this in an interrupt, and it'll return an error. The helper is primarily to be used in debugging activities for containers, where you may have multiple programs running in a given top-level "container". Signed-off-by: Sargun Dhillon <sargun@sargun.me> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: Add bpf_probe_write_user BPF helper to be called in tracersSargun Dhillon2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | This allows user memory to be written to during the course of a kprobe. It shouldn't be used to implement any kind of security mechanism because of TOC-TOU attacks, but rather to debug, divert, and manipulate execution of semi-cooperative processes. Although it uses probe_kernel_write, we limit the address space the probe can write into by checking the space with access_ok. We do this as opposed to calling copy_to_user directly, in order to avoid sleeping. In addition we ensure the threads's current fs / segment is USER_DS and the thread isn't exiting nor a kernel thread. Given this feature is meant for experiments, and it has a risk of crashing the system, and running programs, we print a warning on when a proglet that attempts to use this helper is installed, along with the pid and process name. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: introduce bpf_get_current_task() helperAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | over time there were multiple requests to access different data structures and fields of task_struct current, so finally add the helper to access 'current' as-is. Tracing bpf programs will do the rest of walking the pointers via bpf_probe_read(). Note that current can be null and bpf program has to deal it with, but even dumb passing null into bpf_probe_read() is still safe. Suggested-by: Brendan Gregg <brendan.d.gregg@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: add bpf_get_hash_recalc helperDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | If skb_clear_hash() was invoked due to mangling of relevant headers and BPF program needs skb->hash later on, we can add a helper to trigger hash recalculation via bpf_get_hash_recalc(). The helper will return the newly retrieved hash directly, but later access can also be done via skb context again through skb->hash directly (inline) without needing to call the helper once more. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: add XDP prog type for early driver filterBrenden Blanco2022-04-19
| | | | | | | | | | | | | | | | | | Add a new bpf prog type that is intended to run in early stages of the packet rx path. Only minimal packet metadata will be available, hence a new context type, struct xdp_md, is exposed to userspace. So far only expose the packet start and end pointers, and only in read mode. An XDP program must return one of the well known enum values, all other return codes are reserved for future use. Unfortunately, this restriction is hard to enforce at verification time, so take the approach of warning at runtime when such programs are encountered. Out of bounds return codes should alias to XDP_ABORTED. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: avoid stack copy and use skb ctx for event outputDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work addresses a couple of issues bpf_skb_event_output() helper currently has: i) We need two copies instead of just a single one for the skb data when it should be part of a sample. The data can be non-linear and thus needs to be extracted via bpf_skb_load_bytes() helper first, and then copied once again into the ring buffer slot. ii) Since bpf_skb_load_bytes() currently needs to be used first, the helper needs to see a constant size on the passed stack buffer to make sure BPF verifier can do sanity checks on it during verification time. Thus, just passing skb->len (or any other non-constant value) wouldn't work, but changing bpf_skb_load_bytes() is also not the proper solution, since the two copies are generally still needed. iii) bpf_skb_load_bytes() is just for rather small buffers like headers, since they need to sit on the limited BPF stack anyway. Instead of working around in bpf_skb_load_bytes(), this work improves the bpf_skb_event_output() helper to address all 3 at once. We can make use of the passed in skb context that we have in the helper anyway, and use some of the reserved flag bits as a length argument. The helper will use the new __output_custom() facility from perf side with bpf_skb_copy() as callback helper to walk and extract the data. It will pass the data for setup to bpf_event_output(), which generates and pushes the raw record with an additional frag part. The linear data used in the first frag of the record serves as programmatically defined meta data passed along with the appended sample. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* cgroup: bpf: Add bpf_skb_in_cgroup_protoMartin KaFai Lau2022-04-19
| | | | | | | | | | | | | | | | | | | | | Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk belongs to a descendant of a cgroup2. It is similar to the feature added in netfilter: commit c38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match") The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY which will be used by the bpf_skb_in_cgroup. Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY and bpf_skb_in_cgroup() are always used together. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: add bpf_skb_change_type helperDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | This work adds a helper for changing skb->pkt_type in a controlled way. We only allow a subset of possible values and can extend that in future should other use cases come up. Doing this as a helper has the advantage that errors can be handeled gracefully and thus helper kept extensible. It's a write counterpart to pkt_type member we can already read from struct __sk_buff context. Major use case is to change incoming skbs to PACKET_HOST in a programmatic way instead of having to recirculate via redirect(..., BPF_F_INGRESS), for example. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: add bpf_skb_change_proto helperDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a minimal helper for doing the groundwork of changing the skb->protocol in a controlled way. Currently supported is v4 to v6 and vice versa transitions, which allows f.e. for a minimal, static nat64 implementation where applications in containers that still require IPv4 can be transparently operated in an IPv6-only environment. For example, host facing veth of the container can transparently do the transitions in a programmatic way with the help of clsact qdisc and cls_bpf. Idea is to separate concerns for keeping complexity of the helper lower, which means that the programs utilize bpf_skb_change_proto(), bpf_skb_store_bytes() and bpf_lX_csum_replace() to get the job done, instead of doing everything in a single helper (and thus partially duplicating helper functionality). Also, bpf_skb_change_proto() shouldn't need to deal with raw packet data as this is done by other helpers. bpf_skb_proto_6_to_4() and bpf_skb_proto_4_to_6() unclone the skb to operate on a private one, push or pop additionally required header space and migrate the gso/gro meta data from the shared info. We do mark the gso type as dodgy so that headers are checked and segs recalculated by the gso/gro engine. The gso_size target is adapted as well. The flags argument added is currently reserved and can be used for future extensions. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAYMartin KaFai Lau2022-04-19
| | | | | | | | | | | | | | Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations. To update an element, the caller is expected to obtain a cgroup2 backed fd by open(cgroup2_dir) and then update the array with that fd. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* perf core: Per event callchain limitArnaldo Carvalho de Melo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Additionally to being able to control the system wide maximum depth via /proc/sys/kernel/perf_event_max_stack, now we are able to ask for different depths per event, using perf_event_attr.sample_max_stack for that. This uses an u16 hole at the end of perf_event_attr, that, when perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if sample_max_stack is zero, means use perf_event_max_stack, otherwise it'll be bounds checked under callchain_mutex. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Milian Wolff <milian.wolff@kdab.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Wang Nan <wangnan0@huawei.com> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_outputDaniel Borkmann2022-04-19
| | | | | | | | | | | | Add a BPF_F_CURRENT_CPU flag to optimize the use-case where user space has per-CPU ring buffers and the eBPF program pushes the data into the current CPU's ring buffer which saves us an extra helper function call in eBPF. Also, make sure to properly reserve the remaining flags which are not used. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: direct packet accessAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extended BPF carried over two instructions from classic to access packet data: LD_ABS and LD_IND. They're highly optimized in JITs, but due to their design they have to do length check for every access. When BPF is processing 20M packets per second single LD_ABS after JIT is consuming 3% cpu. Hence the need to optimize it further by amortizing the cost of 'off < skb_headlen' over multiple packet accesses. One option is to introduce two new eBPF instructions LD_ABS_DW and LD_IND_DW with similar usage as skb_header_pointer(). The kernel part for interpreter and x64 JIT was implemented in [1], but such new insns behave like old ld_abs and abort the program with 'return 0' if access is beyond linear data. Such hidden control flow is hard to workaround plus changing JITs and rolling out new llvm is incovenient. Therefore allow cls_bpf/act_bpf program access skb->data directly: int bpf_prog(struct __sk_buff *skb) { struct iphdr *ip; if (skb->data + sizeof(struct iphdr) + ETH_HLEN > skb->data_end) /* packet too small */ return 0; ip = skb->data + ETH_HLEN; /* access IP header fields with direct loads */ if (ip->version != 4 || ip->saddr == 0x7f000001) return 1; [...] } This solution avoids introduction of new instructions. llvm stays the same and all JITs stay the same, but verifier has to work extra hard to prove safety of the above program. For XDP the direct store instructions can be allowed as well. The skb->data is NET_IP_ALIGNED, so for common cases the verifier can check the alignment. The complex packet parsers where packet pointer is adjusted incrementally cannot be tracked for alignment, so allow byte access in such cases and misaligned access on architectures that define efficient_unaligned_access [1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dw Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: make padding in bpf_tunnel_key explicitDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl and tunnel_label members explicit. No issue has been observed, and gcc/llvm does padding for the old struct already, where tunnel_label was not yet present, so the current code works, but since it's part of uapi, make sure we don't introduce holes in structs. Therefore, add tunnel_ext that we can use generically in future (f.e. to flag OAM messages for backends, etc). Also add the offset to the compat tests to be sure should some compilers not padd the tail of the old version of bpf_tunnel_key. Fixes: 4018ab1875e0 ("bpf: support flow label for bpf_skb_{set, get}_tunnel_key") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: support flow label for bpf_skb_{set, get}_tunnel_keyDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | This patch extends bpf_tunnel_key with a tunnel_label member, that maps to ip_tunnel_key's label so underlying backends like vxlan and geneve can propagate the label to udp_tunnel6_xmit_skb(), where it's being set in the IPv6 header. It allows for having 20 more bits to encode/decode flow related meta information programmatically. Tested with vxlan and geneve. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: support for access to tunnel optionsDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After eBPF being able to programmatically access/manage tunnel key meta data via commit d3aa45ce6b94 ("bpf: add helpers to access tunnel metadata") and more recently also for IPv6 through c6c33454072f ("bpf: support ipv6 for bpf_skb_{set,get}_tunnel_key"), this work adds two complementary helpers to generically access their auxiliary tunnel options. Geneve and vxlan support this facility. For geneve, TLVs can be pushed, and for the vxlan case its GBP extension. I.e. setting tunnel key for geneve case only makes sense, if we can also read/write TLVs into it. In the GBP case, it provides the flexibility to easily map the group policy ID in combination with other helpers or maps. I chose to model this as two separate helpers, bpf_skb_{set,get}_tunnel_opt(), for a couple of reasons. bpf_skb_{set,get}_tunnel_key() is already rather complex by itself, and there may be cases for tunnel key backends where tunnel options are not always needed. If we would have integrated this into bpf_skb_{set,get}_tunnel_key() nevertheless, we are very limited with remaining helper arguments, so keeping compatibility on structs in case of passing in a flat buffer gets more cumbersome. Separating both also allows for more flexibility and future extensibility, f.e. options could be fed directly from a map, etc. Moreover, change geneve's xmit path to test only for info->options_len instead of TUNNEL_GENEVE_OPT flag. This makes it more consistent with vxlan's xmit path and allows for avoiding to specify a protocol flag in the API on xmit, so it can be protocol agnostic. Having info->options_len is enough information that is needed. Tested with vxlan and geneve. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: allow to propagate df in bpf_skb_set_tunnel_keyDaniel Borkmann2022-04-19
| | | | | | | | | | | | Added by 9a628224a61b ("ip_tunnel: Add dont fragment flag."), allow to feed df flag into tunneling facilities (currently supported on TX by vxlan, geneve and gre) as a hint from eBPF's bpf_skb_set_tunnel_key() helper. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: add flags to bpf_skb_store_bytes for clearing hashDaniel Borkmann2022-04-19
| | | | | | | | | | | | When overwriting parts of the packet with bpf_skb_store_bytes() that were fed previously into skb->hash calculation, we should clear the current hash with skb_clear_hash(), so that a next skb_get_hash() call can determine the correct hash related to this skb. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: fix csum setting for bpf_set_tunnel_keyDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | The fix in 35e2d1152b22 ("tunnels: Allow IPv6 UDP checksums to be correctly controlled.") changed behavior for bpf_set_tunnel_key() when in use with IPv6 and thus uncovered a bug that TUNNEL_CSUM needed to be set but wasn't. As a result, the stack dropped ingress vxlan IPv6 packets, that have been sent via eBPF through collect meta data mode due to checksum now being zero. Since after LCO, we enable IPv4 checksum by default, so make that analogous and only provide a flag BPF_F_ZERO_CSUM_TX for the user to turn it off in IPv4 case. Fixes: 35e2d1152b22 ("tunnels: Allow IPv6 UDP checksums to be correctly controlled.") Fixes: c6c33454072f ("bpf: support ipv6 for bpf_skb_{set,get}_tunnel_key") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: fix csum update in bpf_l4_csum_replace helper for udpDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | When using this helper for updating UDP checksums, we need to extend this in order to write CSUM_MANGLED_0 for csum computations that result into 0 as sum. Reason we need this is because packets with a checksum could otherwise become incorrectly marked as a packet without a checksum. Likewise, if the user indicates BPF_F_MARK_MANGLED_0, then we should not turn packets without a checksum into ones with a checksum. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: add generic bpf_csum_diff helperDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For L4 checksums, we currently have bpf_l4_csum_replace() helper. It's currently limited to handle 2 and 4 byte changes in a header and feeds the from/to into inet_proto_csum_replace{2,4}() helpers of the kernel. When working with IPv6, for example, this makes it rather cumbersome to deal with, similarly when editing larger parts of a header. Instead, extend the API in a more generic way: For bpf_l4_csum_replace(), add a case for header field mask of 0 to change the checksum at a given offset through inet_proto_csum_replace_by_diff(), and provide a helper bpf_csum_diff() that can generically calculate a from/to diff for arbitrary amounts of data. This can be used in multiple ways: for the bpf_l4_csum_replace() only part, this even provides us with the option to insert precalculated diffs from user space f.e. from a map, or from bpf_csum_diff() during runtime. bpf_csum_diff() has a optional from/to stack buffer input, so we can calculate a diff by using a scratchbuffer for scenarios where we're inserting (from is NULL), removing (to is NULL) or diffing (from/to buffers don't need to be of equal size) data. Also, bpf_csum_diff() allows to feed a previous csum into csum_partial(), so the function can also be cascaded. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* perf, bpf: allow bpf programs attach to tracepointsAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | introduce BPF_PROG_TYPE_TRACEPOINT program type and allow it to be attached to the perf tracepoint handler, which will copy the arguments into the per-cpu buffer and pass it to the bpf program as its first argument. The layout of the fields can be discovered by doing 'cat /sys/kernel/debug/tracing/events/sched/sched_switch/format' prior to the compilation of the program with exception that first 8 bytes are reserved and not accessible to the program. This area is used to store the pointer to 'struct pt_regs' which some of the bpf helpers will use: +---------+ | 8 bytes | hidden 'struct pt_regs *' (inaccessible to bpf program) +---------+ | N bytes | static tracepoint fields defined in tracepoint/format (bpf readonly) +---------+ | dynamic | __dynamic_array bytes of tracepoint (inaccessible to bpf yet) +---------+ Not that all of the fields are already dumped to user space via perf ring buffer and broken application access it directly without consulting tracepoint/format. Same rule applies here: static tracepoint fields should only be accessed in a format defined in tracepoint/format. The order of fields and field sizes are not an ABI. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: pre-allocate hash map elementsAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If kprobe is placed on spin_unlock then calling kmalloc/kfree from bpf programs is not safe, since the following dead lock is possible: kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe-> bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock) and deadlocks. The following solutions were considered and some implemented, but eventually discarded - kmem_cache_create for every map - add recursion check to slow-path of slub - use reserved memory in bpf_map_update for in_irq or in preempt_disabled - kmalloc via irq_work At the end pre-allocation of all map elements turned out to be the simplest solution and since the user is charged upfront for all the memory, such pre-allocation doesn't affect the user space visible behavior. Since it's impossible to tell whether kprobe is triggered in a safe location from kmalloc point of view, use pre-allocation by default and introduce new BPF_F_NO_PREALLOC flag. While testing of per-cpu hash maps it was discovered that alloc_percpu(GFP_ATOMIC) has odd corner cases and often fails to allocate memory even when 90% of it is free. The pre-allocation of per-cpu hash elements solves this problem as well. Turned out that bpf_map_update() quickly followed by bpf_map_lookup()+bpf_map_delete() is very common pattern used in many of iovisor/bcc/tools, so there is additional benefit of pre-allocation, since such use cases are must faster. Since all hash map elements are now pre-allocated we can remove atomic increment of htab->count and save few more cycles. Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid large malloc/free done by users who don't have sufficient limits. Pre-allocation is done with vmalloc and alloc/free is done via percpu_freelist. Here are performance numbers for different pre-allocation algorithms that were implemented, but discarded in favor of percpu_freelist: 1 cpu: pcpu_ida 2.1M pcpu_ida nolock 2.3M bt 2.4M kmalloc 1.8M hlist+spinlock 2.3M pcpu_freelist 2.6M 4 cpu: pcpu_ida 1.5M pcpu_ida nolock 1.8M bt w/smp_align 1.7M bt no/smp_align 1.1M kmalloc 0.7M hlist+spinlock 0.2M pcpu_freelist 2.0M 8 cpu: pcpu_ida 0.7M bt w/smp_align 0.8M kmalloc 0.4M pcpu_freelist 1.5M 32 cpu: kmalloc 0.13M pcpu_freelist 0.49M pcpu_ida nolock is a modified percpu_ida algorithm without percpu_ida_cpu locks and without cross-cpu tag stealing. It's faster than existing percpu_ida, but not as fast as pcpu_freelist. bt is a variant of block/blk-mq-tag.c simlified and customized for bpf use case. bt w/smp_align is using cache line for every 'long' (similar to blk-mq-tag). bt no/smp_align allocates 'long' bitmasks continuously to save memory. It's comparable to percpu_ida and in some cases faster, but slower than percpu_freelist hlist+spinlock is the simplest free list with single spinlock. As expeceted it has very bad scaling in SMP. kmalloc is existing implementation which is still available via BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist, but saves memory, so in cases where map->max_entries can be large and number of map update/delete per second is low, it may make sense to use it. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: introduce BPF_MAP_TYPE_STACK_TRACEAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | add new map type to store stack traces and corresponding helper bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id @ctx: struct pt_regs* @map: pointer to stack_trace map @flags: bits 0-7 - numer of stack frames to skip bit 8 - collect user stack instead of kernel bit 9 - compare stacks by hash only bit 10 - if two different stacks hash into the same stackid discard old other bits - reserved Return: >= 0 stackid on success or negative error stackid is a 32-bit integer handle that can be further combined with other data (including other stackid) and used as a key into maps. Userspace will access stackmap using standard lookup/delete syscall commands to retrieve full stack trace for given stackid. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>