summaryrefslogtreecommitdiff
path: root/include/uapi/linux (unfollow)
Commit message (Collapse)Author
2024-10-17Merge remote-tracking branch 'msm8998/lineage-20' into lineage-20Raghuram Subramani
Change-Id: I126075a330f305c85f8fe1b8c9d408f368be95d1
2024-02-19input: Drop INPUT_PROP_NO_DUMMY_RELEASE bitHan Wang
* INPUT_PROP_NO_DUMMY_RELEASE definition in this kernel collides with INPUT_PROP_ACCELEROMETER definition in bionic and upstream kernel. As a result, Android recognizes normal input devices like accelerometers and causes strange behaviors. There are no references to this bit in userspace and it is not in 4.9+ kernels, so let's drop this CAF jank. Change-Id: Id9b4ec8d31470e663f533249c4bc4b9e94fd38be
2024-01-10bpf: add BPF_J{LT,LE,SLT,SLE} instructionsDaniel Borkmann
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=), BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that particularly *JLT/*JLE counterparts involving immediates need to be rewritten from e.g. X < [IMM] by swapping arguments into [IMM] > X, meaning the immediate first is required to be loaded into a register Y := [IMM], such that then we can compare with Y > X. Note that the destination operand is always required to be a register. This has the downside of having unnecessarily increased register pressure, meaning complex program would need to spill other registers temporarily to stack in order to obtain an unused register for the [IMM]. Loading to registers will thus also affect state pruning since we need to account for that register use and potentially those registers that had to be spilled/filled again. As a consequence slightly more stack space might have been used due to spilling, and BPF programs are a bit longer due to extra code involving the register load and potentially required spill/fills. Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=) counterparts to the eBPF instruction set. Modifying LLVM to remove the NegateCC() workaround in a PoC patch at [1] and allowing it to also emit the new instructions resulted in cilium's BPF programs that are injected into the fast-path to have a reduced program length in the range of 2-3% (e.g. accumulated main and tail call sections from one of the object file reduced from 4864 to 4729 insns), reduced complexity in the range of 10-30% (e.g. accumulated sections reduced in one of the cases from 116432 to 88428 insns), and reduced stack usage in the range of 1-5% (e.g. accumulated sections from one of the object files reduced from 824 to 784b). The modification for LLVM will be incorporated in a backwards compatible way. Plan is for LLVM to have i) a target specific option to offer a possibility to explicitly enable the extension by the user (as we have with -m target specific extensions today for various CPU insns), and ii) have the kernel checked for presence of the extensions and enable them transparently when the user is selecting more aggressive options such as -march=native in a bpf target context. (Other frontends generating BPF byte code, e.g. ply can probe the kernel directly for its code generation.) [1] https://github.com/borkmann/llvm/tree/bpf-insns Change-Id: Ic56500aaeaf5f3ebdfda094ad6ef4666c82e18c5 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-28BACKPORT: clone: add CLONE_PIDFDChristian Brauner
This patchset makes it possible to retrieve pid file descriptors at process creation time by introducing the new flag CLONE_PIDFD to the clone() system call. Linus originally suggested to implement this as a new flag to clone() instead of making it a separate system call. As spotted by Linus, there is exactly one bit for clone() left. CLONE_PIDFD creates file descriptors based on the anonymous inode implementation in the kernel that will also be used to implement the new mount api. They serve as a simple opaque handle on pids. Logically, this makes it possible to interpret a pidfd differently, narrowing or widening the scope of various operations (e.g. signal sending). Thus, a pidfd cannot just refer to a tgid, but also a tid, or in theory - given appropriate flag arguments in relevant syscalls - a process group or session. A pidfd does not represent a privilege. This does not imply it cannot ever be that way but for now this is not the case. A pidfd comes with additional information in fdinfo if the kernel supports procfs. The fdinfo file contains the pid of the process in the callers pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d". As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the parent_tidptr argument of clone. This has the advantage that we can give back the associated pid and the pidfd at the same time. To remove worries about missing metadata access this patchset comes with a sample program that illustrates how a combination of CLONE_PIDFD, and pidfd_send_signal() can be used to gain race-free access to process metadata through /proc/<pid>. The sample program can easily be translated into a helper that would be suitable for inclusion in libc so that users don't have to worry about writing it themselves. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <christian@brauner.io> Co-developed-by: Jann Horn <jannh@google.com> Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Howells <dhowells@redhat.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> (cherry picked from commit b3e5838252665ee4cfa76b82bdf1198dca81e5be) Conflicts: kernel/fork.c (1. Replaced proc_pid_ns() with its direct implementation.) Bug: 135608568 Test: test program using syscall(__NR_sys_pidfd_open,..) and poll() Change-Id: I3c804a92faea686e5bf7f99df893fe3a5d87ddf7 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: electimon <electimon@gmail.com>
2022-10-28bpf: Add new cgroup attach type to enable sock modificationsDavid Ahern
Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run any time a process in the cgroup opens an AF_INET or AF_INET6 socket. Currently only sk_bound_dev_if is exported to userspace for modification by a bpf program. This allows a cgroup to be configured such that AF_INET{6} sockets opened by processes are automatically bound to a specific device. In turn, this enables the running of programs that do not support SO_BINDTODEVICE in a specific VRF context / L3 domain. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I96a6f6f8f650c494d8c173dbb42580a25698368e
2022-10-28fs: Better permission checking for submountsEric W. Biederman
commit 93faccbbfa958a9668d3ab4e30f38dd205cee8d8 upstream. To support unprivileged users mounting filesystems two permission checks have to be performed: a test to see if the user allowed to create a mount in the mount namespace, and a test to see if the user is allowed to access the specified filesystem. The automount case is special in that mounting the original filesystem grants permission to mount the sub-filesystems, to any user who happens to stumble across the their mountpoint and satisfies the ordinary filesystem permission checks. Attempting to handle the automount case by using override_creds almost works. It preserves the idea that permission to mount the original filesystem is permission to mount the sub-filesystem. Unfortunately using override_creds messes up the filesystems ordinary permission checks. Solve this by being explicit that a mount is a submount by introducing vfs_submount, and using it where appropriate. vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let sget and friends know that a mount is a submount so they can take appropriate action. sget and sget_userns are modified to not perform any permission checks on submounts. follow_automount is modified to stop using override_creds as that has proven problemantic. do_mount is modified to always remove the new MS_SUBMOUNT flag so that we know userspace will never by able to specify it. autofs4 is modified to stop using current_real_cred that was put in there to handle the previous version of submount permission checking. cifs is modified to pass the mountpoint all of the way down to vfs_submount. debugfs is modified to pass the mountpoint all of the way down to trace_automount by adding a new parameter. To make this change easier a new typedef debugfs_automount_t is introduced to capture the type of the debugfs automount function. Fixes: 069d5ac9ae0d ("autofs: Fix automounts by using current_real_cred()->uid") Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds") Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I09cb1f35368fb8dc4a64b5ac5a35c9d2843ef95b
2022-07-27mdss: fbdev: Import Zuk z2 row drivers changesDavide Garberi
* This is needed to make the display turn on after turning it off * All of this isn't needed at all on z2_plus Signed-off-by: Davide Garberi <dade.garberi@gmail.com>
2022-07-27drivers: import zuk sound changesFaiz Authar
Signed-off-by: dd3boh <dade.garberi@gmail.com> Signed-off-by: Davide Garberi <dade.garberi@gmail.com>
2022-07-27include: add key_nav and gestureCallMESuper
Change-Id: Icd57ecad17abea47cf1c939cde45ea8f4dc3e503 Signed-off-by: Davide Garberi <dade.garberi@gmail.com>
2022-04-19bpf: add XDP_TX xdp_action for direct forwardingBrenden Blanco
XDP enabled drivers must transmit received packets back out on the same port they were received on when a program returns this action. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-19bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_readDaniel Borkmann
Follow-up commit to 1e33759c788c ("bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_output") to add the same functionality into bpf_perf_event_read() helper. The split of index into flags and index component is also safe here, since such large maps are rejected during map allocation time. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-19BACKPORT: UPSTREAM: Add a eBPF helper function to retrieve socket uidChenbo Feng
Cherry-pick from commit 6acc5c2910689fc6ee181bf63085c5efff6a42bd Returns the owner uid of the socket inside a sk_buff. This is useful to perform per-UID accounting of network traffic or per-UID packet filtering. The socket need to be a fullsock otherwise overflowuid is returned. Signed-off-by: Chenbo Feng <fengc@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Bug: 30950746 Change-Id: Idc00947ccfdd4e9f2214ffc4178d701cd9ead0ac Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19BACKPORT: UPSTREAM: Add a helper function to get socket cookie in eBPFChenbo Feng
Cherrypick from commit: 91b8270f2a4d1d9b268de90451cdca63a70052d6 Retrieve the socket cookie generated by sock_gen_cookie() from a sk_buff with a known socket. Generates a new cookie if one was not yet set.If the socket pointer inside sk_buff is NULL, 0 is returned. The helper function coud be useful in monitoring per socket networking traffic statistics and provide a unique socket identifier per namespace. Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Chenbo Feng <fengc@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Bug: 30950746 Change-Id: I95918dcc3ceffb3061495a859d28aee88e3cde3c Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19ANDROID: Fix missing uapi headersChenbo Feng
Update the missing bpf helper function name in bpf_func_id to keep the uapi header consistent with upstream uapi header because we need the new added bpf helper function bpf get_socket_cookie and get_socket_uid. The patch related to those headers are not backetported since they are not related and backport them will bring in extra confilict. Signed-off-by: Chenbo Feng <fengc@google.com> Bug: 30950746 Change-Id: I2b5fd03799ac5f2e3243ab11a1bccb932f06c312 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: add helper to invalidate hashDaniel Borkmann
Add a small helper that complements 36bbef52c7eb ("bpf: direct packet write and access for helpers for clsact progs") for invalidating the current skb->hash after mangling on headers via direct packet write. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19BACKPORT: bpf: multi program support for cgroup+bpfAlexei Starovoitov
introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple bpf programs to a cgroup. The difference between three possible flags for BPF_PROG_ATTACH command: - NONE(default): No further bpf programs allowed in the subtree. - BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program, the program in this cgroup yields to sub-cgroup program. - BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program, that cgroup program gets run in addition to the program in this cgroup. NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't change their behavior. It only clarifies the semantics in relation to new flag. Only one program is allowed to be attached to a cgroup with NONE or BPF_F_ALLOW_OVERRIDE flag. Multiple programs are allowed to be attached to a cgroup with BPF_F_ALLOW_MULTI flag. They are executed in FIFO order (those that were attached first, run first) The programs of sub-cgroup are executed first, then programs of this cgroup and then programs of parent cgroup. All eligible programs are executed regardless of return code from earlier programs. To allow efficient execution of multiple programs attached to a cgroup and to avoid penalizing cgroups without any programs attached introduce 'struct bpf_prog_array' which is RCU protected array of pointers to bpf programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> for cgroup bits Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit 324bda9e6c5add86ba2e1066476481c48132aca0) Signed-off-by: Connor O'Brien <connoro@google.com> Bug: 121213201 Bug: 138317270 Test: build & boot cuttlefish Change-Id: I06b71c850b9f3e052b106abab7a4a3add012a3f8 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19UPSTREAM: netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'Shmulik Ladkani
Commit 2c16d6033264 ("netfilter: xt_bpf: support ebpf") introduced support for attaching an eBPF object by an fd, with the 'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each IPT_SO_SET_REPLACE call. However this breaks subsequent iptables calls: # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT # iptables -A INPUT -s 5.6.7.8 -j ACCEPT iptables: Invalid argument. Run `dmesg' for more information. That's because iptables works by loading existing rules using IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with the replacement set. However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number (from the initial "iptables -m bpf" invocation) - so when 2nd invocation occurs, userspace passes a bogus fd number, which leads to 'bpf_mt_check_v1' to fail. One suggested solution [1] was to hack iptables userspace, to perform a "entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new, process-local fd per every 'xt_bpf_info_v1' entry seen. However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects. This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given '.fd' and instead perform an in-kernel lookup for the bpf object given the provided '.path'. It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is expected to provide the path of the pinned object. Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved. References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2 [2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2 Reported-by: Rafael Buchbinder <rafi@rbk.ms> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Chenbo Feng <fengc@google.com> (cherry picked from commit 98589a0998b8b13c4a8fa1ccb0e62751a019faa5) Change-Id: Ia0d15a76823cca3afb38786a3d2c25c13ccf941d Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19UPSTREAM: netfilter: xt_bpf: support ebpfWillem de Bruijn
Add support for attaching an eBPF object by file descriptor. The iptables binary can be called with a path to an elf object or a pinned bpf object. Also pass the mode and path to the kernel to be able to return it later for iptables dump and save. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Chenbo Feng <fengc@google.com> (cherry picked from commit 2c16d60332643e90d4fa244f4a706c454b8c7569) Change-Id: I31b8831a7ffd7c44985ee906ff194c1d934dafbe
2022-04-19BACKPORT: bpf: Add file mode configuration into bpf mapsChenbo Feng
Introduce the map read/write flags to the eBPF syscalls that returns the map fd. The flags is used to set up the file mode when construct a new file descriptor for bpf maps. To not break the backward capability, the f_flags is set to O_RDWR if the flag passed by syscall is 0. Otherwise it should be O_RDONLY or O_WRONLY. When the userspace want to modify or read the map content, it will check the file mode to see if it is allowed to make the change. Signed-off-by: Chenbo Feng <fengc@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Deleted the file mode configuration code in unsupported map type and removed the file mode check in non-existing helper functions. (cherry-pick from net-next: 6e71b04a82248ccf13a94b85cbc674a9fefe53f5) Bug: 30950746 Change-Id: Icfad20f1abb77f91068d244fb0d87fa40824dd1b Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19BACKPORT: bpf: introduce BPF_F_ALLOW_OVERRIDE flagAlexei Starovoitov
If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command to the given cgroup the descendent cgroup will be able to override effective bpf program that was inherited from this cgroup. By default it's not passed, therefore override is disallowed. Examples: 1. prog X attached to /A with default prog Y fails to attach to /A/B and /A/B/C Everything under /A runs prog X 2. prog X attached to /A with allow_override. prog Y fails to attach to /A/B with default (non-override) prog M attached to /A/B with allow_override. Everything under /A/B runs prog M only. 3. prog X attached to /A with allow_override. prog Y fails to attach to /A with default. The user has to detach first to switch the mode. In the future this behavior may be extended with a chain of non-overridable programs. Also fix the bug where detach from cgroup where nothing is attached was not throwing error. Return ENOENT in such case. Add several testcases and adjust libbpf. Fixes: 3007098494be ("cgroup: add support for eBPF programs") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Daniel Mack <daniel@zonque.org> Signed-off-by: David S. Miller <davem@davemloft.net> Fixes: Change-Id: I3df35d8d3b1261503f9b5bcd90b18c9358f1ac28        ("cgroup: add support for eBPF programs") [AmitP: Refactored original patch for android-4.9 where libbpf sources are in samples/bpf/ and test_cgrp2_attach2, test_cgrp2_sock, and test_cgrp2_sock2 sample tests do not exist.] (cherry picked from commit 7f677633379b4abb3281cdbe7e7006f049305c03) Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19UPSTREAM: bpf: add new prog type for cgroup socket filteringDaniel Mack
Cherry-pick from commit 0e33661de493db325435d565a4a722120ae4cbf3 This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that it does not allow BPF_LD_[ABS|IND] instructions and hooks up the bpf_skb_load_bytes() helper. Programs of this type will be attached to cgroups for network filtering and accounting. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Bug: 30950746 Change-Id: I7b9e063d5d7a91da80917c6d353a60b877133752 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19UPSTREAM: bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commandsDaniel Mack
Cherry-pick from commit f4324551489e8781d838f941b7aee4208e52e8bf Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and BPF_PROG_DETACH which allow attaching and detaching eBPF programs to a target. On the API level, the target could be anything that has an fd in userspace, hence the name of the field in union bpf_attr is called 'target_fd'. When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is expected to be a valid file descriptor of a cgroup v2 directory which has the bpf controller enabled. These are the only use-cases implemented by this patch at this point, but more can be added. If a program of the given type already exists in the given cgroup, the program is swapped automically, so userspace does not have to drop an existing program first before installing a new one, which would otherwise leave a gap in which no program is attached. For more information on the propagation logic to subcgroups, please refer to the bpf cgroup controller implementation. The API is guarded by CAP_NET_ADMIN. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Bug: 30950746 Change-Id: Iab156859332166835d51e1e6f64e5cb8b81870f2 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19sched: new clone flag CLONE_NEWCGROUP for cgroup namespaceAditya Kali
CLONE_NEWCGROUP will be used to create new cgroup namespace. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: direct packet write and access for helpers for clsact progsDaniel Borkmann
This work implements direct packet access for helpers and direct packet write in a similar fashion as already available for XDP types via commits 4acf6c0b84c9 ("bpf: enable direct packet data write for xdp progs") and 6841de8b0d03 ("bpf: allow helpers access the packet directly"), and as a complementary feature to the already available direct packet read for tc (cls/act) programs. For enabling this, we need to introduce two helpers, bpf_skb_pull_data() and bpf_csum_update(). The first is generally needed for both, read and write, because they would otherwise only be limited to the current linear skb head. Usually, when the data_end test fails, programs just bail out, or, in the direct read case, use bpf_skb_load_bytes() as an alternative to overcome this limitation. If such data sits in non-linear parts, we can just pull them in once with the new helper, retest and eventually access them. At the same time, this also makes sure the skb is uncloned, which is, of course, a necessary condition for direct write. As this needs to be an invariant for the write part only, the verifier detects writes and adds a prologue that is calling bpf_skb_pull_data() to effectively unclone the skb from the very beginning in case it is indeed cloned. The heuristic makes use of a similar trick that was done in 233577a22089 ("net: filter: constify detection of pkt_type_offset"). This comes at zero cost for other programs that do not use the direct write feature. Should a program use this feature only sparsely and has read access for the most parts with, for example, drop return codes, then such write action can be delegated to a tail called program for mitigating this cost of potential uncloning to a late point in time where it would have been paid similarly with the bpf_skb_store_bytes() as well. Advantage of direct write is that the writes are inlined whereas the helper cannot make any length assumptions and thus needs to generate a call to memcpy() also for small sizes, as well as cost of helper call itself with sanity checks are avoided. Plus, when direct read is already used, we don't need to cache or perform rechecks on the data boundaries (due to verifier invalidating previous checks for helpers that change skb->data), so more complex programs using rewrites can benefit from switching to direct read plus write. For direct packet access to helpers, we save the otherwise needed copy into a temp struct sitting on stack memory when use-case allows. Both facilities are enabled via may_access_direct_pkt_data() in verifier. For now, we limit this to map helpers and csum_diff, and can successively enable other helpers where we find it makes sense. Helpers that definitely cannot be allowed for this are those part of bpf_helper_changes_skb_data() since they can change underlying data, and those that write into memory as this could happen for packet typed args when still cloned. bpf_csum_update() helper accommodates for the fact that we need to fixup checksum_complete when using direct write instead of bpf_skb_store_bytes(), meaning the programs can use available helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(), csum_block_add(), csum_block_sub() equivalents in eBPF together with the new helper. A usage example will be provided for iproute2's examples/bpf/ directory. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: introduce BPF_PROG_TYPE_PERF_EVENT program typeAlexei Starovoitov
Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE correspondingly in uapi/linux/perf_event.h) The program visible context meta structure is struct bpf_perf_event_data { struct pt_regs regs; __u64 sample_period; }; which is accessible directly from the program: int bpf_prog(struct bpf_perf_event_data *ctx) { ... ctx->sample_period ... ... ctx->regs.ip ... } The bpf verifier rewrites the accesses into kernel internal struct bpf_perf_event_data_kern which allows changing struct perf_sample_data without affecting bpf programs. New fields can be added to the end of struct bpf_perf_event_data in the future. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: add bpf_skb_change_tail helperDaniel Borkmann
This work adds a bpf_skb_change_tail() helper for tc BPF programs. The basic idea is to expand or shrink the skb in a controlled manner. The eBPF program can then rewrite the rest via helpers like bpf_skb_store_bytes(), bpf_lX_csum_replace() and others rather than passing a raw buffer for writing here. bpf_skb_change_tail() is really a slow path helper and intended for replies with f.e. ICMP control messages. Concept is similar to other helpers like bpf_skb_change_proto() helper to keep the helper without protocol specifics and let the BPF program mangle the remaining parts. A flags field has been added and is reserved for now should we extend the helper in future. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: fix bpf_skb_in_cgroup helper namingDaniel Borkmann
While hashing out BPF's current_task_under_cgroup helper bits, it came to discussion that the skb_in_cgroup helper name was suboptimally chosen. Tejun says: So, I think in_cgroup should mean that the object is in that particular cgroup while under_cgroup in the subhierarchy of that cgroup. Let's rename the other subhierarchy test to under too. I think that'd be a lot less confusing going forward. [...] It's more intuitive and gives us the room to implement the real "in" test if ever necessary in the future. Since this touches uapi bits, we need to change this as long as v4.8 is not yet officially released. Thus, change the helper enum and rename related bits. Fixes: 4a482f34afcc ("cgroup: bpf: Add bpf_skb_in_cgroup_proto") Reference: http://patchwork.ozlabs.org/patch/658500/ Suggested-by: Sargun Dhillon <sargun@sargun.me> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: Add bpf_current_task_under_cgroup helperSargun Dhillon
This adds a bpf helper that's similar to the skb_in_cgroup helper to check whether the probe is currently executing in the context of a specific subset of the cgroupsv2 hierarchy. It does this based on membership test for a cgroup arraymap. It is invalid to call this in an interrupt, and it'll return an error. The helper is primarily to be used in debugging activities for containers, where you may have multiple programs running in a given top-level "container". Signed-off-by: Sargun Dhillon <sargun@sargun.me> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: Add bpf_probe_write_user BPF helper to be called in tracersSargun Dhillon
This allows user memory to be written to during the course of a kprobe. It shouldn't be used to implement any kind of security mechanism because of TOC-TOU attacks, but rather to debug, divert, and manipulate execution of semi-cooperative processes. Although it uses probe_kernel_write, we limit the address space the probe can write into by checking the space with access_ok. We do this as opposed to calling copy_to_user directly, in order to avoid sleeping. In addition we ensure the threads's current fs / segment is USER_DS and the thread isn't exiting nor a kernel thread. Given this feature is meant for experiments, and it has a risk of crashing the system, and running programs, we print a warning on when a proglet that attempts to use this helper is installed, along with the pid and process name. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: introduce bpf_get_current_task() helperAlexei Starovoitov
over time there were multiple requests to access different data structures and fields of task_struct current, so finally add the helper to access 'current' as-is. Tracing bpf programs will do the rest of walking the pointers via bpf_probe_read(). Note that current can be null and bpf program has to deal it with, but even dumb passing null into bpf_probe_read() is still safe. Suggested-by: Brendan Gregg <brendan.d.gregg@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: add bpf_get_hash_recalc helperDaniel Borkmann
If skb_clear_hash() was invoked due to mangling of relevant headers and BPF program needs skb->hash later on, we can add a helper to trigger hash recalculation via bpf_get_hash_recalc(). The helper will return the newly retrieved hash directly, but later access can also be done via skb context again through skb->hash directly (inline) without needing to call the helper once more. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: add XDP prog type for early driver filterBrenden Blanco
Add a new bpf prog type that is intended to run in early stages of the packet rx path. Only minimal packet metadata will be available, hence a new context type, struct xdp_md, is exposed to userspace. So far only expose the packet start and end pointers, and only in read mode. An XDP program must return one of the well known enum values, all other return codes are reserved for future use. Unfortunately, this restriction is hard to enforce at verification time, so take the approach of warning at runtime when such programs are encountered. Out of bounds return codes should alias to XDP_ABORTED. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: avoid stack copy and use skb ctx for event outputDaniel Borkmann
This work addresses a couple of issues bpf_skb_event_output() helper currently has: i) We need two copies instead of just a single one for the skb data when it should be part of a sample. The data can be non-linear and thus needs to be extracted via bpf_skb_load_bytes() helper first, and then copied once again into the ring buffer slot. ii) Since bpf_skb_load_bytes() currently needs to be used first, the helper needs to see a constant size on the passed stack buffer to make sure BPF verifier can do sanity checks on it during verification time. Thus, just passing skb->len (or any other non-constant value) wouldn't work, but changing bpf_skb_load_bytes() is also not the proper solution, since the two copies are generally still needed. iii) bpf_skb_load_bytes() is just for rather small buffers like headers, since they need to sit on the limited BPF stack anyway. Instead of working around in bpf_skb_load_bytes(), this work improves the bpf_skb_event_output() helper to address all 3 at once. We can make use of the passed in skb context that we have in the helper anyway, and use some of the reserved flag bits as a length argument. The helper will use the new __output_custom() facility from perf side with bpf_skb_copy() as callback helper to walk and extract the data. It will pass the data for setup to bpf_event_output(), which generates and pushes the raw record with an additional frag part. The linear data used in the first frag of the record serves as programmatically defined meta data passed along with the appended sample. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19cgroup: bpf: Add bpf_skb_in_cgroup_protoMartin KaFai Lau
Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk belongs to a descendant of a cgroup2. It is similar to the feature added in netfilter: commit c38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match") The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY which will be used by the bpf_skb_in_cgroup. Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY and bpf_skb_in_cgroup() are always used together. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: add bpf_skb_change_type helperDaniel Borkmann
This work adds a helper for changing skb->pkt_type in a controlled way. We only allow a subset of possible values and can extend that in future should other use cases come up. Doing this as a helper has the advantage that errors can be handeled gracefully and thus helper kept extensible. It's a write counterpart to pkt_type member we can already read from struct __sk_buff context. Major use case is to change incoming skbs to PACKET_HOST in a programmatic way instead of having to recirculate via redirect(..., BPF_F_INGRESS), for example. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: add bpf_skb_change_proto helperDaniel Borkmann
This patch adds a minimal helper for doing the groundwork of changing the skb->protocol in a controlled way. Currently supported is v4 to v6 and vice versa transitions, which allows f.e. for a minimal, static nat64 implementation where applications in containers that still require IPv4 can be transparently operated in an IPv6-only environment. For example, host facing veth of the container can transparently do the transitions in a programmatic way with the help of clsact qdisc and cls_bpf. Idea is to separate concerns for keeping complexity of the helper lower, which means that the programs utilize bpf_skb_change_proto(), bpf_skb_store_bytes() and bpf_lX_csum_replace() to get the job done, instead of doing everything in a single helper (and thus partially duplicating helper functionality). Also, bpf_skb_change_proto() shouldn't need to deal with raw packet data as this is done by other helpers. bpf_skb_proto_6_to_4() and bpf_skb_proto_4_to_6() unclone the skb to operate on a private one, push or pop additionally required header space and migrate the gso/gro meta data from the shared info. We do mark the gso type as dodgy so that headers are checked and segs recalculated by the gso/gro engine. The gso_size target is adapted as well. The flags argument added is currently reserved and can be used for future extensions. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAYMartin KaFai Lau
Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations. To update an element, the caller is expected to obtain a cgroup2 backed fd by open(cgroup2_dir) and then update the array with that fd. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19perf core: Per event callchain limitArnaldo Carvalho de Melo
Additionally to being able to control the system wide maximum depth via /proc/sys/kernel/perf_event_max_stack, now we are able to ask for different depths per event, using perf_event_attr.sample_max_stack for that. This uses an u16 hole at the end of perf_event_attr, that, when perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if sample_max_stack is zero, means use perf_event_max_stack, otherwise it'll be bounds checked under callchain_mutex. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Milian Wolff <milian.wolff@kdab.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Wang Nan <wangnan0@huawei.com> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_outputDaniel Borkmann
Add a BPF_F_CURRENT_CPU flag to optimize the use-case where user space has per-CPU ring buffers and the eBPF program pushes the data into the current CPU's ring buffer which saves us an extra helper function call in eBPF. Also, make sure to properly reserve the remaining flags which are not used. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: direct packet accessAlexei Starovoitov
Extended BPF carried over two instructions from classic to access packet data: LD_ABS and LD_IND. They're highly optimized in JITs, but due to their design they have to do length check for every access. When BPF is processing 20M packets per second single LD_ABS after JIT is consuming 3% cpu. Hence the need to optimize it further by amortizing the cost of 'off < skb_headlen' over multiple packet accesses. One option is to introduce two new eBPF instructions LD_ABS_DW and LD_IND_DW with similar usage as skb_header_pointer(). The kernel part for interpreter and x64 JIT was implemented in [1], but such new insns behave like old ld_abs and abort the program with 'return 0' if access is beyond linear data. Such hidden control flow is hard to workaround plus changing JITs and rolling out new llvm is incovenient. Therefore allow cls_bpf/act_bpf program access skb->data directly: int bpf_prog(struct __sk_buff *skb) { struct iphdr *ip; if (skb->data + sizeof(struct iphdr) + ETH_HLEN > skb->data_end) /* packet too small */ return 0; ip = skb->data + ETH_HLEN; /* access IP header fields with direct loads */ if (ip->version != 4 || ip->saddr == 0x7f000001) return 1; [...] } This solution avoids introduction of new instructions. llvm stays the same and all JITs stay the same, but verifier has to work extra hard to prove safety of the above program. For XDP the direct store instructions can be allowed as well. The skb->data is NET_IP_ALIGNED, so for common cases the verifier can check the alignment. The complex packet parsers where packet pointer is adjusted incrementally cannot be tracked for alignment, so allow byte access in such cases and misaligned access on architectures that define efficient_unaligned_access [1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dw Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: make padding in bpf_tunnel_key explicitDaniel Borkmann
Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl and tunnel_label members explicit. No issue has been observed, and gcc/llvm does padding for the old struct already, where tunnel_label was not yet present, so the current code works, but since it's part of uapi, make sure we don't introduce holes in structs. Therefore, add tunnel_ext that we can use generically in future (f.e. to flag OAM messages for backends, etc). Also add the offset to the compat tests to be sure should some compilers not padd the tail of the old version of bpf_tunnel_key. Fixes: 4018ab1875e0 ("bpf: support flow label for bpf_skb_{set, get}_tunnel_key") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: support flow label for bpf_skb_{set, get}_tunnel_keyDaniel Borkmann
This patch extends bpf_tunnel_key with a tunnel_label member, that maps to ip_tunnel_key's label so underlying backends like vxlan and geneve can propagate the label to udp_tunnel6_xmit_skb(), where it's being set in the IPv6 header. It allows for having 20 more bits to encode/decode flow related meta information programmatically. Tested with vxlan and geneve. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: support for access to tunnel optionsDaniel Borkmann
After eBPF being able to programmatically access/manage tunnel key meta data via commit d3aa45ce6b94 ("bpf: add helpers to access tunnel metadata") and more recently also for IPv6 through c6c33454072f ("bpf: support ipv6 for bpf_skb_{set,get}_tunnel_key"), this work adds two complementary helpers to generically access their auxiliary tunnel options. Geneve and vxlan support this facility. For geneve, TLVs can be pushed, and for the vxlan case its GBP extension. I.e. setting tunnel key for geneve case only makes sense, if we can also read/write TLVs into it. In the GBP case, it provides the flexibility to easily map the group policy ID in combination with other helpers or maps. I chose to model this as two separate helpers, bpf_skb_{set,get}_tunnel_opt(), for a couple of reasons. bpf_skb_{set,get}_tunnel_key() is already rather complex by itself, and there may be cases for tunnel key backends where tunnel options are not always needed. If we would have integrated this into bpf_skb_{set,get}_tunnel_key() nevertheless, we are very limited with remaining helper arguments, so keeping compatibility on structs in case of passing in a flat buffer gets more cumbersome. Separating both also allows for more flexibility and future extensibility, f.e. options could be fed directly from a map, etc. Moreover, change geneve's xmit path to test only for info->options_len instead of TUNNEL_GENEVE_OPT flag. This makes it more consistent with vxlan's xmit path and allows for avoiding to specify a protocol flag in the API on xmit, so it can be protocol agnostic. Having info->options_len is enough information that is needed. Tested with vxlan and geneve. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: allow to propagate df in bpf_skb_set_tunnel_keyDaniel Borkmann
Added by 9a628224a61b ("ip_tunnel: Add dont fragment flag."), allow to feed df flag into tunneling facilities (currently supported on TX by vxlan, geneve and gre) as a hint from eBPF's bpf_skb_set_tunnel_key() helper. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: add flags to bpf_skb_store_bytes for clearing hashDaniel Borkmann
When overwriting parts of the packet with bpf_skb_store_bytes() that were fed previously into skb->hash calculation, we should clear the current hash with skb_clear_hash(), so that a next skb_get_hash() call can determine the correct hash related to this skb. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: fix csum setting for bpf_set_tunnel_keyDaniel Borkmann
The fix in 35e2d1152b22 ("tunnels: Allow IPv6 UDP checksums to be correctly controlled.") changed behavior for bpf_set_tunnel_key() when in use with IPv6 and thus uncovered a bug that TUNNEL_CSUM needed to be set but wasn't. As a result, the stack dropped ingress vxlan IPv6 packets, that have been sent via eBPF through collect meta data mode due to checksum now being zero. Since after LCO, we enable IPv4 checksum by default, so make that analogous and only provide a flag BPF_F_ZERO_CSUM_TX for the user to turn it off in IPv4 case. Fixes: 35e2d1152b22 ("tunnels: Allow IPv6 UDP checksums to be correctly controlled.") Fixes: c6c33454072f ("bpf: support ipv6 for bpf_skb_{set,get}_tunnel_key") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: fix csum update in bpf_l4_csum_replace helper for udpDaniel Borkmann
When using this helper for updating UDP checksums, we need to extend this in order to write CSUM_MANGLED_0 for csum computations that result into 0 as sum. Reason we need this is because packets with a checksum could otherwise become incorrectly marked as a packet without a checksum. Likewise, if the user indicates BPF_F_MARK_MANGLED_0, then we should not turn packets without a checksum into ones with a checksum. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: add generic bpf_csum_diff helperDaniel Borkmann
For L4 checksums, we currently have bpf_l4_csum_replace() helper. It's currently limited to handle 2 and 4 byte changes in a header and feeds the from/to into inet_proto_csum_replace{2,4}() helpers of the kernel. When working with IPv6, for example, this makes it rather cumbersome to deal with, similarly when editing larger parts of a header. Instead, extend the API in a more generic way: For bpf_l4_csum_replace(), add a case for header field mask of 0 to change the checksum at a given offset through inet_proto_csum_replace_by_diff(), and provide a helper bpf_csum_diff() that can generically calculate a from/to diff for arbitrary amounts of data. This can be used in multiple ways: for the bpf_l4_csum_replace() only part, this even provides us with the option to insert precalculated diffs from user space f.e. from a map, or from bpf_csum_diff() during runtime. bpf_csum_diff() has a optional from/to stack buffer input, so we can calculate a diff by using a scratchbuffer for scenarios where we're inserting (from is NULL), removing (to is NULL) or diffing (from/to buffers don't need to be of equal size) data. Also, bpf_csum_diff() allows to feed a previous csum into csum_partial(), so the function can also be cascaded. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19perf, bpf: allow bpf programs attach to tracepointsAlexei Starovoitov
introduce BPF_PROG_TYPE_TRACEPOINT program type and allow it to be attached to the perf tracepoint handler, which will copy the arguments into the per-cpu buffer and pass it to the bpf program as its first argument. The layout of the fields can be discovered by doing 'cat /sys/kernel/debug/tracing/events/sched/sched_switch/format' prior to the compilation of the program with exception that first 8 bytes are reserved and not accessible to the program. This area is used to store the pointer to 'struct pt_regs' which some of the bpf helpers will use: +---------+ | 8 bytes | hidden 'struct pt_regs *' (inaccessible to bpf program) +---------+ | N bytes | static tracepoint fields defined in tracepoint/format (bpf readonly) +---------+ | dynamic | __dynamic_array bytes of tracepoint (inaccessible to bpf yet) +---------+ Not that all of the fields are already dumped to user space via perf ring buffer and broken application access it directly without consulting tracepoint/format. Same rule applies here: static tracepoint fields should only be accessed in a format defined in tracepoint/format. The order of fields and field sizes are not an ABI. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-04-19bpf: pre-allocate hash map elementsAlexei Starovoitov
If kprobe is placed on spin_unlock then calling kmalloc/kfree from bpf programs is not safe, since the following dead lock is possible: kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe-> bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock) and deadlocks. The following solutions were considered and some implemented, but eventually discarded - kmem_cache_create for every map - add recursion check to slow-path of slub - use reserved memory in bpf_map_update for in_irq or in preempt_disabled - kmalloc via irq_work At the end pre-allocation of all map elements turned out to be the simplest solution and since the user is charged upfront for all the memory, such pre-allocation doesn't affect the user space visible behavior. Since it's impossible to tell whether kprobe is triggered in a safe location from kmalloc point of view, use pre-allocation by default and introduce new BPF_F_NO_PREALLOC flag. While testing of per-cpu hash maps it was discovered that alloc_percpu(GFP_ATOMIC) has odd corner cases and often fails to allocate memory even when 90% of it is free. The pre-allocation of per-cpu hash elements solves this problem as well. Turned out that bpf_map_update() quickly followed by bpf_map_lookup()+bpf_map_delete() is very common pattern used in many of iovisor/bcc/tools, so there is additional benefit of pre-allocation, since such use cases are must faster. Since all hash map elements are now pre-allocated we can remove atomic increment of htab->count and save few more cycles. Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid large malloc/free done by users who don't have sufficient limits. Pre-allocation is done with vmalloc and alloc/free is done via percpu_freelist. Here are performance numbers for different pre-allocation algorithms that were implemented, but discarded in favor of percpu_freelist: 1 cpu: pcpu_ida 2.1M pcpu_ida nolock 2.3M bt 2.4M kmalloc 1.8M hlist+spinlock 2.3M pcpu_freelist 2.6M 4 cpu: pcpu_ida 1.5M pcpu_ida nolock 1.8M bt w/smp_align 1.7M bt no/smp_align 1.1M kmalloc 0.7M hlist+spinlock 0.2M pcpu_freelist 2.0M 8 cpu: pcpu_ida 0.7M bt w/smp_align 0.8M kmalloc 0.4M pcpu_freelist 1.5M 32 cpu: kmalloc 0.13M pcpu_freelist 0.49M pcpu_ida nolock is a modified percpu_ida algorithm without percpu_ida_cpu locks and without cross-cpu tag stealing. It's faster than existing percpu_ida, but not as fast as pcpu_freelist. bt is a variant of block/blk-mq-tag.c simlified and customized for bpf use case. bt w/smp_align is using cache line for every 'long' (similar to blk-mq-tag). bt no/smp_align allocates 'long' bitmasks continuously to save memory. It's comparable to percpu_ida and in some cases faster, but slower than percpu_freelist hlist+spinlock is the simplest free list with single spinlock. As expeceted it has very bad scaling in SMP. kmalloc is existing implementation which is still available via BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist, but saves memory, so in cases where map->max_entries can be large and number of map update/delete per second is low, it may make sense to use it. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>