summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAge
...
* | | bpf: refactor fixup_bpf_calls()Alexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 79741b3bdec01a8628368fbcfccc7d189ed606cb upstream. reduce indent and make it iterate over instructions similar to convert_ctx_accesses(). Also convert hard BUG_ON into soft verifier error. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Jiri Slaby <jslaby@suse.cz> [Backported to 4.9.y - gregkh] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | bpf: move fixup_bpf_calls() functionAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e245c5c6a5656e4d61aa7bb08e9694fd6e5b2b9d upstream. no functional change. move fixup_bpf_calls() to verifier.c it's being refactored in the next patch Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Jiri Slaby <jslaby@suse.cz> [backported to 4.9 - gregkh] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | bpf/verifier: Fix states_equal() comparison of pointer and UNKNOWNBen Hutchings2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An UNKNOWN_VALUE is not supposed to be derived from a pointer, unless pointer leaks are allowed. Therefore, states_equal() must not treat a state with a pointer in a register as "equal" to a state with an UNKNOWN_VALUE in that register. This was fixed differently upstream, but the code around here was largely rewritten in 4.14 by commit f1174f77b50c "bpf/verifier: rework value tracking". The bug can be detected by the bpf/verifier sub-test "pointer/scalar confusion in state equality check (way 1)". Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Edward Cree <ecree@solarflare.com> Cc: Jann Horn <jannh@google.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | bpf: fix incorrect sign extension in check_alu_op()Jann Horn2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 95a762e2c8c942780948091f8f2a4f32fce1ac6f ] Distinguish between BPF_ALU64|BPF_MOV|BPF_K (load 32-bit immediate, sign-extended to 64-bit) and BPF_ALU|BPF_MOV|BPF_K (load 32-bit immediate, zero-padded to 64-bit); only perform sign extension in the first case. Starting with v4.14, this is exploitable by unprivileged users as long as the unprivileged_bpf_disabled sysctl isn't set. Debian assigned CVE-2017-16995 for this issue. v3: - add CVE number (Ben Hutchings) Fixes: 484611357c19 ("bpf: allow access into map value arrays") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | bpf: reject out-of-bounds stack pointer calculationJann Horn2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reject programs that compute wildly out-of-bounds stack pointers. Otherwise, pointers can be computed with an offset that doesn't fit into an `int`, causing security issues in the stack memory access check (as well as signed integer overflow during offset addition). This is a fix specifically for the v4.9 stable tree because the mainline code looks very different at this point. Fixes: 7bca0a9702edf ("bpf: enhance verifier to understand stack pointer arithmetic") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | bpf: fix branch pruning logicAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c131187db2d3fa2f8bf32fdf4e9a4ef805168467 ] when the verifier detects that register contains a runtime constant and it's compared with another constant it will prune exploration of the branch that is guaranteed not to be taken at runtime. This is all correct, but malicious program may be constructed in such a way that it always has a constant comparison and the other branch is never taken under any conditions. In this case such path through the program will not be explored by the verifier. It won't be taken at run-time either, but since all instructions are JITed the malicious program may cause JITs to complain about using reserved fields, etc. To fix the issue we have to track the instructions explored by the verifier and sanitize instructions that are dead at run time with NOPs. We cannot reject such dead code, since llvm generates it for valid C code, since it doesn't do as much data flow analysis as the verifier does. Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | bpf: adjust insn_aux_data when patching insnsAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 8041902dae5299c1f194ba42d14383f734631009 ] convert_ctx_accesses() replaces single bpf instruction with a set of instructions. Adjust corresponding insn_aux_data while patching. It's needed to make sure subsequent 'for(all insn)' loops have matching insn and insn_aux_data. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | bpf: fix lockdep splatEric Dumazet2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 89ad2fa3f043a1e8daae193bcb5fe34d5f8caf28 ] pcpu_freelist_pop() needs the same lockdep awareness than pcpu_freelist_populate() to avoid a false positive. [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ] switchto-defaul/12508 [HC0[0]:SC0[6]:HE0:SE0] is trying to acquire: (&htab->buckets[i].lock){......}, at: [<ffffffff9dc099cb>] __htab_percpu_map_update_elem+0x1cb/0x300 and this task is already holding: (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...}, at: [<ffffffff9e135848>] __dev_queue_xmit+0 x868/0x1240 which would create a new lock dependency: (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...} -> (&htab->buckets[i].lock){......} but this new dependency connects a SOFTIRQ-irq-safe lock: (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...} ... which became SOFTIRQ-irq-safe at: [<ffffffff9db5931b>] __lock_acquire+0x42b/0x1f10 [<ffffffff9db5b32c>] lock_acquire+0xbc/0x1b0 [<ffffffff9da05e38>] _raw_spin_lock+0x38/0x50 [<ffffffff9e135848>] __dev_queue_xmit+0x868/0x1240 [<ffffffff9e136240>] dev_queue_xmit+0x10/0x20 [<ffffffff9e1965d9>] ip_finish_output2+0x439/0x590 [<ffffffff9e197410>] ip_finish_output+0x150/0x2f0 [<ffffffff9e19886d>] ip_output+0x7d/0x260 [<ffffffff9e19789e>] ip_local_out+0x5e/0xe0 [<ffffffff9e197b25>] ip_queue_xmit+0x205/0x620 [<ffffffff9e1b8398>] tcp_transmit_skb+0x5a8/0xcb0 [<ffffffff9e1ba152>] tcp_write_xmit+0x242/0x1070 [<ffffffff9e1baffc>] __tcp_push_pending_frames+0x3c/0xf0 [<ffffffff9e1b3472>] tcp_rcv_established+0x312/0x700 [<ffffffff9e1c1acc>] tcp_v4_do_rcv+0x11c/0x200 [<ffffffff9e1c3dc2>] tcp_v4_rcv+0xaa2/0xc30 [<ffffffff9e191107>] ip_local_deliver_finish+0xa7/0x240 [<ffffffff9e191a36>] ip_local_deliver+0x66/0x200 [<ffffffff9e19137d>] ip_rcv_finish+0xdd/0x560 [<ffffffff9e191e65>] ip_rcv+0x295/0x510 [<ffffffff9e12ff88>] __netif_receive_skb_core+0x988/0x1020 [<ffffffff9e130641>] __netif_receive_skb+0x21/0x70 [<ffffffff9e1306ff>] process_backlog+0x6f/0x230 [<ffffffff9e132129>] net_rx_action+0x229/0x420 [<ffffffff9da07ee8>] __do_softirq+0xd8/0x43d [<ffffffff9e282bcc>] do_softirq_own_stack+0x1c/0x30 [<ffffffff9dafc2f5>] do_softirq+0x55/0x60 [<ffffffff9dafc3a8>] __local_bh_enable_ip+0xa8/0xb0 [<ffffffff9db4c727>] cpu_startup_entry+0x1c7/0x500 [<ffffffff9daab333>] start_secondary+0x113/0x140 to a SOFTIRQ-irq-unsafe lock: (&head->lock){+.+...} ... which became SOFTIRQ-irq-unsafe at: ... [<ffffffff9db5971f>] __lock_acquire+0x82f/0x1f10 [<ffffffff9db5b32c>] lock_acquire+0xbc/0x1b0 [<ffffffff9da05e38>] _raw_spin_lock+0x38/0x50 [<ffffffff9dc0b7fa>] pcpu_freelist_pop+0x7a/0xb0 [<ffffffff9dc08b2c>] htab_map_alloc+0x50c/0x5f0 [<ffffffff9dc00dc5>] SyS_bpf+0x265/0x1200 [<ffffffff9e28195f>] entry_SYSCALL_64_fastpath+0x12/0x17 other info that might help us debug this: Chain exists of: dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2 --> &htab->buckets[i].lock --> &head->lock Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&head->lock); local_irq_disable(); lock(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2); lock(&htab->buckets[i].lock); <Interrupt> lock(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2); *** DEADLOCK *** Fixes: e19494edab82 ("bpf: introduce percpu_freelist") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | UPSTREAM: selinux: bpf: Add addtional check for bpf object file receiveChenbo Feng2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a bpf object related check when sending and receiving files through unix domain socket as well as binder. It checks if the receiving process have privilege to read/write the bpf map or use the bpf program. This check is necessary because the bpf maps and programs are using a anonymous inode as their shared inode so the normal way of checking the files and sockets when passing between processes cannot work properly on eBPF object. This check only works when the BPF_SYSCALL is configured. Signed-off-by: Chenbo Feng <fengc@google.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Reviewed-by: James Morris <james.l.morris@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry-pick from net-next: f66e448cfda021b0bcd884f26709796fe19c7cc1) Bug: 30950746 Change-Id: I5b2cf4ccb4eab7eda91ddd7091d6aa3e7ed9f2cd Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | UPSTREAM: selinux: bpf: Add selinux check for eBPF syscall operationsChenbo Feng2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement the actual checks introduced to eBPF related syscalls. This implementation use the security field inside bpf object to store a sid that identify the bpf object. And when processes try to access the object, selinux will check if processes have the right privileges. The creation of eBPF object are also checked at the general bpf check hook and new cmd introduced to eBPF domain can also be checked there. Signed-off-by: Chenbo Feng <fengc@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: James Morris <james.l.morris@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry-pick from net-next: ec27c3568a34c7fe5fcf4ac0a354eda77687f7eb) Bug: 30950746 Change-Id: Ifb0cdd4b7d470223b143646b339ba511ac77c156 Signed-off-by: Chatur27 <jasonbright2709@gmail.com> Change-Id: I073b5ebe76a280267289357af2b5d8f3afcaffa4
* | | BACKPORT: security: bpf: Add LSM hooks for bpf object related syscallChenbo Feng2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce several LSM hooks for the syscalls that will allow the userspace to access to eBPF object such as eBPF programs and eBPF maps. The security check is aimed to enforce a per object security protection for eBPF object so only processes with the right priviliges can read/write to a specific map or use a specific eBPF program. Besides that, a general security hook is added before the multiplexer of bpf syscall to check the cmd and the attribute used for the command. The actual security module can decide which command need to be checked and how the cmd should be checked. Signed-off-by: Chenbo Feng <fengc@google.com> Acked-by: James Morris <james.l.morris@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Added the LIST_HEAD_INIT call for security hooks, it nolonger exist in uptream code. (cherry-pick from net-next: afdb09c720b62b8090584c11151d856df330e57d) Bug: 30950746 Change-Id: Ieb3ac74392f531735fc7c949b83346a5f587a77b Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | BACKPORT: bpf: Add file mode configuration into bpf mapsChenbo Feng2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce the map read/write flags to the eBPF syscalls that returns the map fd. The flags is used to set up the file mode when construct a new file descriptor for bpf maps. To not break the backward capability, the f_flags is set to O_RDWR if the flag passed by syscall is 0. Otherwise it should be O_RDONLY or O_WRONLY. When the userspace want to modify or read the map content, it will check the file mode to see if it is allowed to make the change. Signed-off-by: Chenbo Feng <fengc@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Deleted the file mode configuration code in unsupported map type and removed the file mode check in non-existing helper functions. (cherry-pick from net-next: 6e71b04a82248ccf13a94b85cbc674a9fefe53f5) Bug: 30950746 Change-Id: Icfad20f1abb77f91068d244fb0d87fa40824dd1b Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | bpf: move bpf_map_show_fdinfo to match upstream locationAnay Wadhera2022-04-19
| | | | | | | | | | | | Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | bpf/verifier: reject BPF_ALU64|BPF_ENDEdward Cree2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e67b8a685c7c984e834e3181ef4619cd7025a136 ] Neither ___bpf_prog_run nor the JITs accept it. Also adds a new test case. Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Signed-off-by: Edward Cree <ecree@solarflare.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | bpf/verifier: fix min/max handling in BPF_SUBEdward Cree2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9305706c2e808ae59f1eb201867f82f1ddf6d7a6 ] We have to subtract the src max from the dst min, and vice-versa, since (e.g.) the smallest result comes from the largest subtrahend. Fixes: 484611357c19 ("bpf: allow access into map value arrays") Signed-off-by: Edward Cree <ecree@solarflare.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | bpf: fix mixed signed/unsigned derived min/max value boundsDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 4cabc5b186b5427b9ee5a7495172542af105f02b ] Edward reported that there's an issue in min/max value bounds tracking when signed and unsigned compares both provide hints on limits when having unknown variables. E.g. a program such as the following should have been rejected: 0: (7a) *(u64 *)(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (18) r1 = 0xffff8a94cda93400 5: (85) call bpf_map_lookup_elem#1 6: (15) if r0 == 0x0 goto pc+7 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp 7: (7a) *(u64 *)(r10 -16) = -8 8: (79) r1 = *(u64 *)(r10 -16) 9: (b7) r2 = -1 10: (2d) if r1 > r2 goto pc+3 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0 R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp 11: (65) if r1 s> 0x1 goto pc+2 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=1 R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp 12: (0f) r0 += r1 13: (72) *(u8 *)(r0 +0) = 0 R0=map_value_adj(ks=8,vs=8,id=0),min_value=0,max_value=1 R1=inv,min_value=0,max_value=1 R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp 14: (b7) r0 = 0 15: (95) exit What happens is that in the first part ... 8: (79) r1 = *(u64 *)(r10 -16) 9: (b7) r2 = -1 10: (2d) if r1 > r2 goto pc+3 ... r1 carries an unsigned value, and is compared as unsigned against a register carrying an immediate. Verifier deduces in reg_set_min_max() that since the compare is unsigned and operation is greater than (>), that in the fall-through/false case, r1's minimum bound must be 0 and maximum bound must be r2. Latter is larger than the bound and thus max value is reset back to being 'invalid' aka BPF_REGISTER_MAX_RANGE. Thus, r1 state is now 'R1=inv,min_value=0'. The subsequent test ... 11: (65) if r1 s> 0x1 goto pc+2 ... is a signed compare of r1 with immediate value 1. Here, verifier deduces in reg_set_min_max() that since the compare is signed this time and operation is greater than (>), that in the fall-through/false case, we can deduce that r1's maximum bound must be 1, meaning with prior test, we result in r1 having the following state: R1=inv,min_value=0,max_value=1. Given that the actual value this holds is -8, the bounds are wrongly deduced. When this is being added to r0 which holds the map_value(_adj) type, then subsequent store access in above case will go through check_mem_access() which invokes check_map_access_adj(), that will then probe whether the map memory is in bounds based on the min_value and max_value as well as access size since the actual unknown value is min_value <= x <= max_value; commit fce366a9dd0d ("bpf, verifier: fix alu ops against map_value{, _adj} register types") provides some more explanation on the semantics. It's worth to note in this context that in the current code, min_value and max_value tracking are used for two things, i) dynamic map value access via check_map_access_adj() and since commit 06c1c049721a ("bpf: allow helpers access to variable memory") ii) also enforced at check_helper_mem_access() when passing a memory address (pointer to packet, map value, stack) and length pair to a helper and the length in this case is an unknown value defining an access range through min_value/max_value in that case. The min_value/max_value tracking is /not/ used in the direct packet access case to track ranges. However, the issue also affects case ii), for example, the following crafted program based on the same principle must be rejected as well: 0: (b7) r2 = 0 1: (bf) r3 = r10 2: (07) r3 += -512 3: (7a) *(u64 *)(r10 -16) = -8 4: (79) r4 = *(u64 *)(r10 -16) 5: (b7) r6 = -1 6: (2d) if r4 > r6 goto pc+5 R1=ctx R2=imm0,min_value=0,max_value=0,min_align=2147483648 R3=fp-512 R4=inv,min_value=0 R6=imm-1,max_value=18446744073709551615,min_align=1 R10=fp 7: (65) if r4 s> 0x1 goto pc+4 R1=ctx R2=imm0,min_value=0,max_value=0,min_align=2147483648 R3=fp-512 R4=inv,min_value=0,max_value=1 R6=imm-1,max_value=18446744073709551615,min_align=1 R10=fp 8: (07) r4 += 1 9: (b7) r5 = 0 10: (6a) *(u16 *)(r10 -512) = 0 11: (85) call bpf_skb_load_bytes#26 12: (b7) r0 = 0 13: (95) exit Meaning, while we initialize the max_value stack slot that the verifier thinks we access in the [1,2] range, in reality we pass -7 as length which is interpreted as u32 in the helper. Thus, this issue is relevant also for the case of helper ranges. Resetting both bounds in check_reg_overflow() in case only one of them exceeds limits is also not enough as similar test can be created that uses values which are within range, thus also here learned min value in r1 is incorrect when mixed with later signed test to create a range: 0: (7a) *(u64 *)(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (18) r1 = 0xffff880ad081fa00 5: (85) call bpf_map_lookup_elem#1 6: (15) if r0 == 0x0 goto pc+7 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp 7: (7a) *(u64 *)(r10 -16) = -8 8: (79) r1 = *(u64 *)(r10 -16) 9: (b7) r2 = 2 10: (3d) if r2 >= r1 goto pc+3 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp 11: (65) if r1 s> 0x4 goto pc+2 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3,max_value=4 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp 12: (0f) r0 += r1 13: (72) *(u8 *)(r0 +0) = 0 R0=map_value_adj(ks=8,vs=8,id=0),min_value=3,max_value=4 R1=inv,min_value=3,max_value=4 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp 14: (b7) r0 = 0 15: (95) exit This leaves us with two options for fixing this: i) to invalidate all prior learned information once we switch signed context, ii) to track min/max signed and unsigned boundaries separately as done in [0]. (Given latter introduces major changes throughout the whole verifier, it's rather net-next material, thus this patch follows option i), meaning we can derive bounds either from only signed tests or only unsigned tests.) There is still the case of adjust_reg_min_max_vals(), where we adjust bounds on ALU operations, meaning programs like the following where boundaries on the reg get mixed in context later on when bounds are merged on the dst reg must get rejected, too: 0: (7a) *(u64 *)(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (18) r1 = 0xffff89b2bf87ce00 5: (85) call bpf_map_lookup_elem#1 6: (15) if r0 == 0x0 goto pc+6 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp 7: (7a) *(u64 *)(r10 -16) = -8 8: (79) r1 = *(u64 *)(r10 -16) 9: (b7) r2 = 2 10: (3d) if r2 >= r1 goto pc+2 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp 11: (b7) r7 = 1 12: (65) if r7 s> 0x0 goto pc+2 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,max_value=0 R10=fp 13: (b7) r0 = 0 14: (95) exit from 12 to 15: R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,min_value=1 R10=fp 15: (0f) r7 += r1 16: (65) if r7 s> 0x4 goto pc+2 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp 17: (0f) r0 += r7 18: (72) *(u8 *)(r0 +0) = 0 R0=map_value_adj(ks=8,vs=8,id=0),min_value=4,max_value=4 R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp 19: (b7) r0 = 0 20: (95) exit Meaning, in adjust_reg_min_max_vals() we must also reset range values on the dst when src/dst registers have mixed signed/ unsigned derived min/max value bounds with one unbounded value as otherwise they can be added together deducing false boundaries. Once both boundaries are established from either ALU ops or compare operations w/o mixing signed/unsigned insns, then they can safely be added to other regs also having both boundaries established. Adding regs with one unbounded side to a map value where the bounded side has been learned w/o mixing ops is possible, but the resulting map value won't recover from that, meaning such op is considered invalid on the time of actual access. Invalid bounds are set on the dst reg in case i) src reg, or ii) in case dst reg already had them. The only way to recover would be to perform i) ALU ops but only 'add' is allowed on map value types or ii) comparisons, but these are disallowed on pointers in case they span a range. This is fine as only BPF_JEQ and BPF_JNE may be performed on PTR_TO_MAP_VALUE_OR_NULL registers which potentially turn them into PTR_TO_MAP_VALUE type depending on the branch, so only here min/max value cannot be invalidated for them. In terms of state pruning, value_from_signed is considered as well in states_equal() when dealing with adjusted map values. With regards to breaking existing programs, there is a small risk, but use-cases are rather quite narrow where this could occur and mixing compares probably unlikely. Joint work with Josef and Edward. [0] https://lists.iovisor.org/pipermail/iovisor-dev/2017-June/000822.html Fixes: 484611357c19 ("bpf: allow access into map value arrays") Reported-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | bpf, verifier: fix alu ops against map_value{, _adj} register typesDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit fce366a9dd0ddc47e7ce05611c266e8574a45116 ] While looking into map_value_adj, I noticed that alu operations directly on the map_value() resp. map_value_adj() register (any alu operation on a map_value() register will turn it into a map_value_adj() typed register) are not sufficiently protected against some of the operations. Two non-exhaustive examples are provided that the verifier needs to reject: i) BPF_AND on r0 (map_value_adj): 0: (bf) r2 = r10 1: (07) r2 += -8 2: (7a) *(u64 *)(r2 +0) = 0 3: (18) r1 = 0xbf842a00 5: (85) call bpf_map_lookup_elem#1 6: (15) if r0 == 0x0 goto pc+2 R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp 7: (57) r0 &= 8 8: (7a) *(u64 *)(r0 +0) = 22 R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=8 R10=fp 9: (95) exit from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp 9: (95) exit processed 10 insns ii) BPF_ADD in 32 bit mode on r0 (map_value_adj): 0: (bf) r2 = r10 1: (07) r2 += -8 2: (7a) *(u64 *)(r2 +0) = 0 3: (18) r1 = 0xc24eee00 5: (85) call bpf_map_lookup_elem#1 6: (15) if r0 == 0x0 goto pc+2 R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp 7: (04) (u32) r0 += (u32) 0 8: (7a) *(u64 *)(r0 +0) = 22 R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp 9: (95) exit from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp 9: (95) exit processed 10 insns Issue is, while min_value / max_value boundaries for the access are adjusted appropriately, we change the pointer value in a way that cannot be sufficiently tracked anymore from its origin. Operations like BPF_{AND,OR,DIV,MUL,etc} on a destination register that is PTR_TO_MAP_VALUE{,_ADJ} was probably unintended, in fact, all the test cases coming with 484611357c19 ("bpf: allow access into map value arrays") perform BPF_ADD only on the destination register that is PTR_TO_MAP_VALUE_ADJ. Only for UNKNOWN_VALUE register types such operations make sense, f.e. with unknown memory content fetched initially from a constant offset from the map value memory into a register. That register is then later tested against lower / upper bounds, so that the verifier can then do the tracking of min_value / max_value, and properly check once that UNKNOWN_VALUE register is added to the destination register with type PTR_TO_MAP_VALUE{,_ADJ}. This is also what the original use-case is solving. Note, tracking on what is being added is done through adjust_reg_min_max_vals() and later access to the map value enforced with these boundaries and the given offset from the insn through check_map_access_adj(). Tests will fail for non-root environment due to prohibited pointer arithmetic, in particular in check_alu_op(), we bail out on the is_pointer_value() check on the dst_reg (which is false in root case as we allow for pointer arithmetic via env->allow_ptr_leaks). Similarly to PTR_TO_PACKET, one way to fix it is to restrict the allowed operations on PTR_TO_MAP_VALUE{,_ADJ} registers to 64 bit mode BPF_ADD. The test_verifier suite runs fine after the patch and it also rejects mentioned test cases. Fixes: 484611357c19 ("bpf: allow access into map value arrays") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Josef Bacik <jbacik@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | bpf: adjust verifier heuristicsDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 3c2ce60bdd3d57051bf85615deec04a694473840 ] Current limits with regards to processing program paths do not really reflect today's needs anymore due to programs becoming more complex and verifier smarter, keeping track of more data such as const ALU operations, alignment tracking, spilling of PTR_TO_MAP_VALUE_ADJ registers, and other features allowing for smarter matching of what LLVM generates. This also comes with the side-effect that we result in fewer opportunities to prune search states and thus often need to do more work to prove safety than in the past due to different register states and stack layout where we mismatch. Generally, it's quite hard to determine what caused a sudden increase in complexity, it could be caused by something as trivial as a single branch somewhere at the beginning of the program where LLVM assigned a stack slot that is marked differently throughout other branches and thus causing a mismatch, where verifier then needs to prove safety for the whole rest of the program. Subsequently, programs with even less than half the insn size limit can get rejected. We noticed that while some programs load fine under pre 4.11, they get rejected due to hitting limits on more recent kernels. We saw that in the vast majority of cases (90+%) pruning failed due to register mismatches. In case of stack mismatches, majority of cases failed due to different stack slot types (invalid, spill, misc) rather than differences in spilled registers. This patch makes pruning more aggressive by also adding markers that sit at conditional jumps as well. Currently, we only mark jump targets for pruning. For example in direct packet access, these are usually error paths where we bail out. We found that adding these markers, it can reduce number of processed insns by up to 30%. Another option is to ignore reg->id in probing PTR_TO_MAP_VALUE_OR_NULL registers, which can help pruning slightly as well by up to 7% observed complexity reduction as stand-alone. Meaning, if a previous path with register type PTR_TO_MAP_VALUE_OR_NULL for map X was found to be safe, then in the current state a PTR_TO_MAP_VALUE_OR_NULL register for the same map X must be safe as well. Last but not least the patch also adds a scheduling point and bumps the current limit for instructions to be processed to a more adequate value. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | bpf, verifier: add additional patterns to evaluate_reg_imm_aluJohn Fastabend2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 43188702b3d98d2792969a3377a30957f05695e6 ] Currently the verifier does not track imm across alu operations when the source register is of unknown type. This adds additional pattern matching to catch this and track imm. We've seen LLVM generating this pattern while working on cilium. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | bpf: prevent leaking pointer via xadd on unpriviledgedDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6bdf6abc56b53103324dfd270a86580306e1a232 upstream. Leaking kernel addresses on unpriviledged is generally disallowed, for example, verifier rejects the following: 0: (b7) r0 = 0 1: (18) r2 = 0xffff897e82304400 3: (7b) *(u64 *)(r1 +48) = r2 R2 leaks addr into ctx Doing pointer arithmetic on them is also forbidden, so that they don't turn into unknown value and then get leaked out. However, there's xadd as a special case, where we don't check the src reg for being a pointer register, e.g. the following will pass: 0: (b7) r0 = 0 1: (7b) *(u64 *)(r1 +48) = r0 2: (18) r2 = 0xffff897e82304400 ; map 4: (db) lock *(u64 *)(r1 +48) += r2 5: (95) exit We could store the pointer into skb->cb, loose the type context, and then read it out from there again to leak it eventually out of a map value. Or more easily in a different variant, too: 0: (bf) r6 = r1 1: (7a) *(u64 *)(r10 -8) = 0 2: (bf) r2 = r10 3: (07) r2 += -8 4: (18) r1 = 0x0 6: (85) call bpf_map_lookup_elem#1 7: (15) if r0 == 0x0 goto pc+3 R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R6=ctx R10=fp 8: (b7) r3 = 0 9: (7b) *(u64 *)(r0 +0) = r3 10: (db) lock *(u64 *)(r0 +0) += r6 11: (b7) r0 = 0 12: (95) exit from 7 to 11: R0=inv,min_value=0,max_value=0 R6=ctx R10=fp 11: (b7) r0 = 0 12: (95) exit Prevent this by checking xadd src reg for pointer types. Also add a couple of test cases related to this. Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs") Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | bpf: don't trigger OOM killer under pressure with map allocDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit d407bd25a204bd66b7346dde24bd3d37ef0e0b05 ] This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(), that are to be used for map allocations. Using kmalloc() for very large allocations can cause excessive work within the page allocator, so i) fall back earlier to vmalloc() when the attempt is considered costly anyway, and even more importantly ii) don't trigger OOM killer with any of the allocators. Since this is based on a user space request, for example, when creating maps with element pre-allocation, we really want such requests to fail instead of killing other user space processes. Also, don't spam the kernel log with warnings should any of the allocations fail under pressure. Given that, we can make backend selection in bpf_map_area_alloc() generic, and convert all maps over to use this API for spots with potentially large allocation requests. Note, replacing the one kmalloc_array() is fine as overflow checks happen earlier in htab_map_alloc(), since it must also protect the multiplication for vmalloc() should kmalloc_array() fail. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | FROMLIST: bpf: cgroup skb progs cannot access ld_abs/indDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit fb9a307d11d6 ("bpf: Allow CGROUP_SKB eBPF program to access sk_buff") enabled programs of BPF_PROG_TYPE_CGROUP_SKB type to use ld_abs/ind instructions. However, at this point, we cannot use them, since offsets relative to SKF_LL_OFF will end up pointing skb_mac_header(skb) out of bounds since in the egress path it is not yet set at that point in time, but only after __dev_queue_xmit() did a general reset on the mac header. bpf_internal_load_pointer_neg_helper() will then end up reading data from a wrong offset. BPF_PROG_TYPE_CGROUP_SKB programs can use bpf_skb_load_bytes() already to access packet data, which is also more flexible than the insns carried over from cBPF. Fixes: fb9a307d11d6 ("bpf: Allow CGROUP_SKB eBPF program to access sk_buff") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Chenbo Feng <fengc@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> (url: http://patchwork.ozlabs.org/patch/771946/) Signed-off-by: Chenbo Feng <fengc@google.com> Bug: 30950746 Change-Id: Ia32ac79d8c0d18f811ec101897284a8b60cb042a Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | FROMLIST: [net-next,v2,2/2] bpf: Remove the capability check for cgroup skb ↵Chenbo Feng2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | eBPF program Currently loading a cgroup skb eBPF program require a CAP_SYS_ADMIN capability while attaching the program to a cgroup only requires the user have CAP_NET_ADMIN privilege. We can escape the capability check when load the program just like socket filter program to make the capability requirement consistent. Change since v1: Change the code style in order to be compliant with checkpatch.pl preference (url: http://patchwork.ozlabs.org/patch/769460/) Signed-off-by: Chenbo Feng <fengc@google.com> Bug: 30950746 Change-Id: Ibe51235127d6f9349b8f563ad31effc061b278ed Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | FROMLIST: [net-next,v2,1/2] bpf: Allow CGROUP_SKB eBPF program to access sk_buffChenbo Feng2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows cgroup eBPF program to classify packet based on their protocol or other detail information. Currently program need CAP_NET_ADMIN privilege to attach a cgroup eBPF program, and A process with CAP_NET_ADMIN can already see all packets on the system, for example, by creating an iptables rules that causes the packet to be passed to userspace via NFLOG. (url: http://patchwork.ozlabs.org/patch/769459/) Signed-off-by: Chenbo Feng <fengc@google.com> Bug: 30950746 Change-Id: I11bef84ce26cf8b8f1b89483c32a7fcdd61ae926 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | UPSTREAM: bpf: cgroup: fix documentation of __cgroup_bpf_update()Daniel Mack2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a 'not' missing in one paragraph. Add it. Fixes: 3007098494be ("cgroup: add support for eBPF programs") Signed-off-by: Daniel Mack <daniel@zonque.org> Reported-by: Rami Rosen <roszenrami@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Fixes: Change-Id: I3df35d8d3b1261503f9b5bcd90b18c9358f1ac28        ("cgroup: add support for eBPF programs") (cherry picked from commit 01ae87eab53675cbdabd5c4d727c4a35e397cce0) Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | BACKPORT: bpf: introduce BPF_F_ALLOW_OVERRIDE flagAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command to the given cgroup the descendent cgroup will be able to override effective bpf program that was inherited from this cgroup. By default it's not passed, therefore override is disallowed. Examples: 1. prog X attached to /A with default prog Y fails to attach to /A/B and /A/B/C Everything under /A runs prog X 2. prog X attached to /A with allow_override. prog Y fails to attach to /A/B with default (non-override) prog M attached to /A/B with allow_override. Everything under /A/B runs prog M only. 3. prog X attached to /A with allow_override. prog Y fails to attach to /A with default. The user has to detach first to switch the mode. In the future this behavior may be extended with a chain of non-overridable programs. Also fix the bug where detach from cgroup where nothing is attached was not throwing error. Return ENOENT in such case. Add several testcases and adjust libbpf. Fixes: 3007098494be ("cgroup: add support for eBPF programs") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Daniel Mack <daniel@zonque.org> Signed-off-by: David S. Miller <davem@davemloft.net> Fixes: Change-Id: I3df35d8d3b1261503f9b5bcd90b18c9358f1ac28        ("cgroup: add support for eBPF programs") [AmitP: Refactored original patch for android-4.9 where libbpf sources are in samples/bpf/ and test_cgrp2_attach2, test_cgrp2_sock, and test_cgrp2_sock2 sample tests do not exist.] (cherry picked from commit 7f677633379b4abb3281cdbe7e7006f049305c03) Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | UPSTREAM: samples: bpf: add userspace example for attaching eBPF programs to ↵Daniel Mack2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroups Cherry-pick from commit d8c5b17f2bc0de09fbbfa14d90e8168163a579e7 Add a simple userpace program to demonstrate the new API to attach eBPF programs to cgroups. This is what it does: * Create arraymap in kernel with 4 byte keys and 8 byte values * Load eBPF program The eBPF program accesses the map passed in to store two pieces of information. The number of invocations of the program, which maps to the number of packets received, is stored to key 0. Key 1 is incremented on each iteration by the number of bytes stored in the skb. * Detach any eBPF program previously attached to the cgroup * Attach the new program to the cgroup using BPF_PROG_ATTACH * Once a second, read map[0] and map[1] to see how many bytes and packets were seen on any socket of tasks in the given cgroup. The program takes a cgroup path as 1st argument, and either "ingress" or "egress" as 2nd. Optionally, "drop" can be passed as 3rd argument, which will make the generated eBPF program return 0 instead of 1, so the kernel will drop the packet. libbpf gained two new wrappers for the new syscall commands. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Bug: 30950746 Change-Id: I011436a755abd62050edd22e47995c166a0bd8a2 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | UPSTREAM: bpf: add new prog type for cgroup socket filteringDaniel Mack2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cherry-pick from commit 0e33661de493db325435d565a4a722120ae4cbf3 This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that it does not allow BPF_LD_[ABS|IND] instructions and hooks up the bpf_skb_load_bytes() helper. Programs of this type will be attached to cgroups for network filtering and accounting. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Bug: 30950746 Change-Id: I7b9e063d5d7a91da80917c6d353a60b877133752 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | BACKPORT: UPSTREAM: bpf: pass sk to helper functionsWillem de Bruijn2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cherrypick from commit 8f917bba0042f1e3b7693743fbe9782709e936e7 BPF helper functions access socket fields through skb->sk. This is not set in ingress cgroup and socket filters. The association is only made in skb_set_owner_r once the filter has accepted the packet. Sk is available as socket lookup has taken place. Temporarily set skb->sk to sk in these cases. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Bug: 30950746 Change-Id: Ifcbcbe2ab2882dc79c56f9707be1d6aef08c7fd3 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | UPSTREAM: bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commandsDaniel Mack2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cherry-pick from commit f4324551489e8781d838f941b7aee4208e52e8bf Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and BPF_PROG_DETACH which allow attaching and detaching eBPF programs to a target. On the API level, the target could be anything that has an fd in userspace, hence the name of the field in union bpf_attr is called 'target_fd'. When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is expected to be a valid file descriptor of a cgroup v2 directory which has the bpf controller enabled. These are the only use-cases implemented by this patch at this point, but more can be added. If a program of the given type already exists in the given cgroup, the program is swapped automically, so userspace does not have to drop an existing program first before installing a new one, which would otherwise leave a gap in which no program is attached. For more information on the propagation logic to subcgroups, please refer to the bpf cgroup controller implementation. The API is guarded by CAP_NET_ADMIN. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Bug: 30950746 Change-Id: Iab156859332166835d51e1e6f64e5cb8b81870f2 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | kernfs: implement kernfs_walk_and_get()Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement kernfs_walk_and_get() which is similar to kernfs_find_and_get() but can walk a path instead of just a name. v2: Use strlcpy() instead of strlen() + memcpy() as suggested by David. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: David Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | BACKPORT: fs: kernfs: add poll file operationJohannes Weiner2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "psi: pressure stall monitors", v3. Android is adopting psi to detect and remedy memory pressure that results in stuttering and decreased responsiveness on mobile devices. Psi gives us the stall information, but because we're dealing with latencies in the millisecond range, periodically reading the pressure files to detect stalls in a timely fashion is not feasible. Psi also doesn't aggregate its averages at a high enough frequency right now. This patch series extends the psi interface such that users can configure sensitive latency thresholds and use poll() and friends to be notified when these are breached. As high-frequency aggregation is costly, it implements an aggregation method that is optimized for fast, short-interval averaging, and makes the aggregation frequency adaptive, such that high-frequency updates only happen while monitored stall events are actively occurring. With these patches applied, Android can monitor for, and ward off, mounting memory shortages before they cause problems for the user. For example, using memory stall monitors in userspace low memory killer daemon (lmkd) we can detect mounting pressure and kill less important processes before device becomes visibly sluggish. In our memory stress testing psi memory monitors produce roughly 10x less false positives compared to vmpressure signals. Having ability to specify multiple triggers for the same psi metric allows other parts of Android framework to monitor memory state of the device and act accordingly. The new interface is straightforward. The user opens one of the pressure files for writing and writes a trigger description into the file descriptor that defines the stall state - some or full, and the maximum stall time over a given window of time. E.g.: /* Signal when stall time exceeds 100ms of a 1s window */ char trigger[] = "full 100000 1000000"; fd = open("/proc/pressure/memory"); write(fd, trigger, sizeof(trigger)); while (poll() >= 0) { ... } close(fd); When the monitored stall state is entered, psi adapts its aggregation frequency according to what the configured time window requires in order to emit event signals in a timely fashion. Once the stalling subsides, aggregation reverts back to normal. The trigger is associated with the open file descriptor. To stop monitoring, the user only needs to close the file descriptor and the trigger is discarded. Patches 1-4 prepare the psi code for polling support. Patch 5 implements the adaptive polling logic, the pressure growth detection optimized for short intervals, and hooks up write() and poll() on the pressure files. The patches were developed in collaboration with Johannes Weiner. This patch (of 5): Kernfs has a standardized poll/notification mechanism for waking all pollers on all fds when a filesystem node changes. To allow polling for custom events, add a .poll callback that can override the default. This is in preparation for pollable cgroup pressure files which have per-fd trigger configurations. Link: http://lkml.kernel.org/r/20190124211518.244221-2-surenb@google.com Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> (cherry picked from commit: 147e1a97c4a0bdd43f55a582a9416bb9092563a9) Conflicts: fs/kernfs/file.c include/linux/kernfs.h 1. replaced __poll_t with unsigned int. 2. replaced kernfs_dentry_node() with dentry->d_fsdata 3. replaced EPOLLERR/EPOLLPRI with POLLERR/POLLPRI (values are the same) Bug: 127712811 Test: lmkd in PSI mode Change-Id: Ic2bed334d05aec62f4e695f263893c3057921c55 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | UPSTREAM: kernfs: add kernfs_ops->open/release() callbacksTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add ->open/release() methods to kernfs_ops. ->open() is called when the file is opened and ->release() when the file is either released or severed. These callbacks can be used, for example, to manage persistent caching objects over multiple seq_file iterations. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Acked-by: Zefan Li <lizefan@huawei.com> (cherry picked from commit 0e67db2f9fe91937e798e3d7d22c50a8438187e1) Bug: 111308141 Test: modified lmkd to use PSI and tested using lmkd_unit_test Change-Id: Id06e9d5c6da1280bcdd4dc86309dcfaf52b8f9a4 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | kernfs: Add API to generate relative kernfs pathAditya Kali2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new function kernfs_path_from_node() generates and returns kernfs path of a given kernfs_node relative to a given parent kernfs_node. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | sched: new clone flag CLONE_NEWCGROUP for cgroup namespaceAditya Kali2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | CLONE_NEWCGROUP will be used to create new cgroup namespace. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | UPSTREAM: cgroup: add support for eBPF programsDaniel Mack2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cherry-pick from commit 3007098494bec614fb55dee7bc0410bb7db5ad18 This patch adds two sets of eBPF program pointers to struct cgroup. One for such that are directly pinned to a cgroup, and one for such that are effective for it. To illustrate the logic behind that, assume the following example cgroup hierarchy. A - B - C \ D - E If only B has a program attached, it will be effective for B, C, D and E. If D then attaches a program itself, that will be effective for both D and E, and the program in B will only affect B and C. Only one program of a given type is effective for a cgroup. Attaching and detaching programs will be done through the bpf(2) syscall. For now, ingress and egress inet socket filtering are the only supported use-cases. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Bug: 30950746 Change-Id: I3df35d8d3b1261503f9b5bcd90b18c9358f1ac28 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: don't online subsystems before cgroup_name/path() are operationalTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 07cd12945551b63ecb1a349d50a6d69d1d6feb4a upstream. While refactoring cgroup creation, a5bca2152036 ("cgroup: factor out cgroup_create() out of cgroup_mkdir()") incorrectly onlined subsystems before the new cgroup is associated with it kernfs_node. This is fine for cgroup proper but cgroup_name/path() depend on the associated kernfs_node and if a subsystem makes the new cgroup_subsys_state visible, which they're allowed to after onlining, it can lead to NULL dereference. The current code performs cgroup creation and subsystem onlining in cgroup_create() and cgroup_mkdir() makes the cgroup and subsystems visible afterwards. There's no reason to online the subsystems early and we can simply drop cgroup_apply_control_enable() call from cgroup_create() so that the subsystems are onlined and made visible at the same time. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Fixes: a5bca2152036 ("cgroup: factor out cgroup_create() out of cgroup_mkdir()") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: fix error handling regressions in proc_cgroup_show() and ↵Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_release_agent() 4c737b41de7f ("cgroup: make cgroup_path() and friends behave in the style of strlcpy()") broke error handling in proc_cgroup_show() and cgroup_release_agent() by not handling negative return values from cgroup_path_ns_locked(). Fix it. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com> Change-Id: I30fefb0167ca977eabbf7cbb77cd2ca230cf25b1
* | | cgroup: fix invalid controller enable rejections with cgroup namespaceTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On the v2 hierarchy, "cgroup.subtree_control" rejects controller enables if the cgroup has processes in it. The enforcement of this logic assumes that the cgroup wouldn't have any css_sets associated with it if there are no tasks in the cgroup, which is no longer true since a79a908fd2b0 ("cgroup: introduce cgroup namespaces"). When a cgroup namespace is created, it pins the css_set of the creating task to use it as the root css_set of the namespace. This extra reference stays as long as the namespace is around and makes "cgroup.subtree_control" think that the namespace root cgroup is not empty even when it is and thus reject controller enables. Fix it by making cgroup_subtree_control() walk and test emptiness of each css_set instead of testing whether the list_head is empty. While at it, update the comment of cgroup_task_count() to indicate that the returned value may be higher than the number of tasks, which has always been true due to temporary references and doesn't break anything. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Evgeny Vereshchagin <evvers@ya.ru> Cc: Serge E. Hallyn <serge.hallyn@ubuntu.com> Cc: Aditya Kali <adityakali@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: stable@vger.kernel.org # v4.6+ Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Link: https://github.com/systemd/systemd/pull/3589#issuecomment-249089541 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | kernel: add a helper to get an owning user namespace for a namespaceAndrey Vagin2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Return -EPERM if an owning user namespace is outside of a process current user namespace. v2: In a first version ns_get_owner returned ENOENT for init_user_ns. This special cases was removed from this version. There is nothing outside of init_user_ns, so we can return EPERM. v3: rename ns->get_owner() to ns->owner(). get_* usually means that it grabs a reference. Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | fs: Limit file caps to the user namespace of the super blockSeth Forshee2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Capability sets attached to files must be ignored except in the user namespaces where the mounter is privileged, i.e. s_user_ns and its descendants. Otherwise a vector exists for gaining privileges in namespaces where a user is not already privileged. Add a new helper function, current_in_user_ns(), to test whether a user namespace is the same as or a descendant of another namespace. Use this helper to determine whether a file's capability set should be applied to the caps constructed during exec. --EWB Replaced in_userns with the simpler current_in_userns. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: duplicate cgroup reference when cloning socketsJohannes Weiner2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a socket is cloned, the associated sock_cgroup_data is duplicated but not its reference on the cgroup. As a result, the cgroup reference count will underflow when both sockets are destroyed later on. Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Link: http://lkml.kernel.org/r/20160914194846.11153-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: <stable@vger.kernel.org> [4.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: make cgroup_path() and friends behave in the style of strlcpy()Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_path() and friends used to format the path from the end and thus the resulting path usually didn't start at the start of the passed in buffer. Also, when the buffer was too small, the partial result was truncated from the head rather than tail and there was no way to tell how long the full path would be. These make the functions less robust and more awkward to use. With recent updates to kernfs_path(), cgroup_path() and friends can be made to behave in strlcpy() style. * cgroup_path(), cgroup_path_ns[_locked]() and task_cgroup_path() now always return the length of the full path. If buffer is too small, it contains nul terminated truncated output. * All users updated accordingly. v2: cgroup_path() usage in kernel/sched/debug.c converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com> Change-Id: Ibb5c7544237e0ab6cde03c8bec80b3376917c0a8
* | | cgroupns: Only allow creation of hierarchies in the initial cgroup namespaceEric W. Biederman2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unprivileged users can't use hierarchies if they create them as they do not have privilieges to the root directory. Which means the only thing a hiearchy created by an unprivileged user is good for is expanding the number of cgroup links in every css_set, which is a DOS attack. We could allow hierarchies to be created in namespaces in the initial user namespace. Unfortunately there is only a single namespace for the names of heirarchies, so that is likely to create more confusion than not. So do the simple thing and restrict hiearchy creation to the initial cgroup namespace. Cc: stable@vger.kernel.org Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroupns: Fix the locking in copy_cgroup_nsEric W. Biederman2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If "clone(CLONE_NEWCGROUP...)" is called it results in a nice lockdep valid splat. In __cgroup_proc_write the lock ordering is: cgroup_mutex -- through cgroup_kn_lock_live cgroup_threadgroup_rwsem In copy_process the guts of clone the lock ordering is: cgroup_threadgroup_rwsem -- through threadgroup_change_begin cgroup_mutex -- through copy_namespaces -- copy_cgroup_ns lockdep reports some a different call chains for the first ordering of cgroup_mutex and cgroup_threadgroup_rwsem but it is harder to trace. This is most definitely deadlock potential under the right circumstances. Fix this by by skipping the cgroup_mutex and making the locking in copy_cgroup_ns mirror the locking in cgroup_post_fork which also runs during fork under the cgroup_threadgroup_rwsem. Cc: stable@vger.kernel.org Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: Add cgroup_get_from_fdMartin KaFai Lau2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a helper function to get a cgroup2 from a fd. It will be stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will be introduced in the later patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: allow NULL return from ss->css_alloc()Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup core expected css_alloc to return an ERR_PTR value on failure and caused NULL deref if it returned NULL. It's an easy mistake to make from an alloc function and there's no ambiguity in what's being indicated. Update css_create() so that it interprets NULL return from css_alloc as -ENOMEM. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: remove unnecessary 0 check from css_from_id()Johannes Weiner2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | css_idr allocation starts at 1, so index 0 will never point to an item. css_from_id() currently filters that before asking idr_find(), but idr_find() would also just return NULL, so this is not needed. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: fix idr leak for the first cgroup rootJohannes Weiner2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The valid cgroup hierarchy ID range includes 0, so we can't filter for positive numbers when freeing it, or it'll leak the first ID. No big deal, just disruptive when reading the code. The ID is freed during error handling and when the reference count hits zero, so the double-free test is not necessary; remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: remove redundant cleanup in css_createWenwei Tao2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When create css failed, before call css_free_rcu_fn, we remove the css id and exit the percpu_ref, but we will do these again in css_free_work_fn, so they are redundant. Especially the css id, that would cause problem if we remove it twice, since it may be assigned to another css after the first remove. tj: This was broken by two commits updating the free path without synchronizing the creation failure path. This can be easily triggered by trying to create more than 64k memory cgroups. Signed-off-by: Wenwei Tao <ww.tao0320@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Fixes: 9a1049da9bd2 ("percpu-refcount: require percpu_ref to be exited explicitly") Fixes: 01e586598b22 ("cgroup: release css->id after css_free") Cc: stable@vger.kernel.org # v3.17+ Signed-off-by: Chatur27 <jasonbright2709@gmail.com>