summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAge
* soreuseport: fast reuseport TCP socket selectionCraig Gallek2022-10-28
| | | | | | | | | | | | | | | | | This change extends the fast SO_REUSEPORT socket lookup implemented for UDP to TCP. Listener sockets with SO_REUSEPORT and the same receive address are additionally added to an array for faster random access. This means that only a single socket from the group must be found in the listener list before any socket in the group can be used to receive a packet. Previously, every socket in the group needed to be considered before handing off the incoming packet. This feature also exposes the ability to use a BPF program when selecting a socket from a reuseport group. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I84cc6952d273fcb1b1dba7cbf31504ea135ecdae
* inet: create IPv6-equivalent inet_hash functionCraig Gallek2022-10-28
| | | | | | | | | | | | | | | | | | | | | | In order to support fast lookups for TCP sockets with SO_REUSEPORT, the function that adds sockets to the listening hash set needs to be able to check receive address equality. Since this equality check is different for IPv4 and IPv6, we will need two different socket hashing functions. This patch adds inet6_hash identical to the existing inet_hash function and updates the appropriate references. A following patch will differentiate the two by passing different comparison functions to __inet_hash. Additionally, in order to use the IPv6 address equality function from inet6_hashtables (which is compiled as a built-in object when IPv6 is enabled) it also needs to be in a built-in object file as well. This moves ipv6_rcv_saddr_equal into inet_hashtables to accomplish this. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I3e53baa0fe0a8220dc3ed79b619796daa0f37827
* sock: struct proto hash function may errorCraig Gallek2022-10-28
| | | | | | | | | | | | In order to support fast reuseport lookups in TCP, the hash function defined in struct proto must be capable of returning an error code. This patch changes the function signature of all related hash functions to return an integer and handles or propagates this return value at all call sites. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I03f3906db3060ca9a743d7ea0adc4fdce2047da2
* cgroup: Fix sock_cgroup_data on big-endian.Cong Wang2022-10-28
| | | | | | | | | | | | | [ Upstream commit 14b032b8f8fce03a546dcf365454bec8c4a58d7d ] In order for no_refcnt and is_data to be the lowest order two bits in the 'val' we have to pad out the bitfield of the u8. Fixes: ad0f75e5f57c ("cgroup: fix cgroup_sk_alloc() for sk_clone_lock()") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I58b286e24b67cf1bcdfb3d6b4aa95289c92966ad
* selinux: always allow mounting submountsOndrej Mosnacek2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 2cbdcb882f97a45f7475c67ac6257bbc16277dfe ] If a superblock has the MS_SUBMOUNT flag set, we should always allow mounting it. These mounts are done automatically by the kernel either as part of mounting some parent mount (e.g. debugfs always mounts tracefs under "tracing" for compatibility) or they are mounted automatically as needed on subdirectory accesses (e.g. NFS crossmnt mounts). Since such automounts are either an implicit consequence of the parent mount (which is already checked) or they can happen during regular accesses (where it doesn't make sense to check against the current task's context), the mount permission check should be skipped for them. Without this patch, attempts to access contents of an automounted directory can cause unexpected SELinux denials. In the current kernel tree, the MS_SUBMOUNT flag is set only via vfs_submount(), which is called only from the following places: - AFS, when automounting special "symlinks" referencing other cells - CIFS, when automounting "referrals" - NFS, when automounting subtrees - debugfs, when automounting tracefs In all cases the submounts are meant to be transparent to the user and it makes sense that if mounting the master is allowed, then so should be the automounts. Note that CAP_SYS_ADMIN capability checking is already skipped for (SB_KERNMOUNT|SB_SUBMOUNT) in: - sget_userns() in fs/super.c: if (!(flags & (SB_KERNMOUNT|SB_SUBMOUNT)) && !(type->fs_flags & FS_USERNS_MOUNT) && !capable(CAP_SYS_ADMIN)) return ERR_PTR(-EPERM); - sget() in fs/super.c: /* Ensure the requestor has permissions over the target filesystem */ if (!(flags & (SB_KERNMOUNT|SB_SUBMOUNT)) && !ns_capable(user_ns, CAP_SYS_ADMIN)) return ERR_PTR(-EPERM); Verified internally on patched RHEL 7.6 with a reproducer using NFS+httpd and selinux-tesuite. Fixes: 93faccbbfa95 ("fs: Better permission checking for submounts") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Change-Id: Ic9e93767d111b54845c2ca24c4fc10be64e32fb6
* userns: Don't fail follow_automount based on s_user_nsEric W. Biederman2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit bbc3e471011417598e598707486f5d8814ec9c01 ] When vfs_submount was added the test to limit automounts from filesystems that with s_user_ns != &init_user_ns accidentially left in follow_automount. The test was never about any security concerns and was always about how do we implement this for filesystems whose s_user_ns != &init_user_ns. At the moment this check makes no difference as there are no filesystems that both set FS_USERNS_MOUNT and implement d_automount. Remove this check now while I am thinking about it so there will not be odd booby traps for someone who does want to make this combination work. vfs_submount still needs improvements to allow this combination to work, and vfs_submount contains a check that presents a warning. The autofs4 filesystem could be modified to set FS_USERNS_MOUNT and it would need not work on this code path, as userspace performs the mounts. Fixes: 93faccbbfa95 ("fs: Better permission checking for submounts") Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds") Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I1707ab45c9b3b23ba9c06bfb4738fc85b8f9e166
* fs: Better permission checking for submountsEric W. Biederman2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 93faccbbfa958a9668d3ab4e30f38dd205cee8d8 upstream. To support unprivileged users mounting filesystems two permission checks have to be performed: a test to see if the user allowed to create a mount in the mount namespace, and a test to see if the user is allowed to access the specified filesystem. The automount case is special in that mounting the original filesystem grants permission to mount the sub-filesystems, to any user who happens to stumble across the their mountpoint and satisfies the ordinary filesystem permission checks. Attempting to handle the automount case by using override_creds almost works. It preserves the idea that permission to mount the original filesystem is permission to mount the sub-filesystem. Unfortunately using override_creds messes up the filesystems ordinary permission checks. Solve this by being explicit that a mount is a submount by introducing vfs_submount, and using it where appropriate. vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let sget and friends know that a mount is a submount so they can take appropriate action. sget and sget_userns are modified to not perform any permission checks on submounts. follow_automount is modified to stop using override_creds as that has proven problemantic. do_mount is modified to always remove the new MS_SUBMOUNT flag so that we know userspace will never by able to specify it. autofs4 is modified to stop using current_real_cred that was put in there to handle the previous version of submount permission checking. cifs is modified to pass the mountpoint all of the way down to vfs_submount. debugfs is modified to pass the mountpoint all of the way down to trace_automount by adding a new parameter. To make this change easier a new typedef debugfs_automount_t is introduced to capture the type of the debugfs automount function. Fixes: 069d5ac9ae0d ("autofs: Fix automounts by using current_real_cred()->uid") Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds") Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I09cb1f35368fb8dc4a64b5ac5a35c9d2843ef95b
* mnt: Move the FS_USERNS_MOUNT check into sget_usernsEric W. Biederman2022-10-28
| | | | | | | | | | | Allowing a filesystem to be mounted by other than root in the initial user namespace is a filesystem property not a mount namespace property and as such should be checked in filesystem specific code. Move the FS_USERNS_MOUNT test into super.c:sget_userns(). Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Change-Id: I5da9f5ce3e7b85379a771617e3238817b777eab4
* locks: sprinkle some tracepoints around the file locking codeJeff Layton2022-10-28
| | | | | | | | | | Add some tracepoints around the POSIX locking code. These were useful when tracking down problems when handling the race between setlk and close. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org> Change-Id: I270eda634890d21399ccf939ad6d03b7d201a148
* locks: rename __posix_lock_file to posix_lock_inodeJeff Layton2022-10-28
| | | | | | | | ...a more descriptive name and we can drop the double underscore prefix. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org> Change-Id: Iafb3bd86e5791d9c36bff3be7a876fa8aeb98afa
* autofs: Fix automounts by using current_real_cred()->uidEric W. Biederman2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Seth Forshee reports that in 4.8-rcN some automounts are failing because the requesting the automount changed. The relevant call path is: follow_automount() ->d_automount autofs4_d_automount autofs4_mount_wait autofs4_wait In autofs4_wait wq_uid and wq_gid are set to current_uid() and current_gid respectively. With follow_automount now overriding creds uid that we export to userspace changes and that breaks existing setups. To remove the regression set wq_uid and wq_gid from current_real_cred()->uid and current_real_cred()->gid respectively. This restores the current behavior as current->real_cred is identical to current->cred except when override creds are used. Cc: stable@vger.kernel.org Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds") Reported-by: Seth Forshee <seth.forshee@canonical.com> Tested-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Change-Id: I3ec133334218ec9bd108b18c92fd852104f56926
* fs: Call d_automount with the filesystems credsEric W. Biederman2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | Seth Forshee reported a mount regression in nfs autmounts with "fs: Add user namespace member to struct super_block". It turns out that the assumption that current->cred is something reasonable during mount while necessary to improve support of unprivileged mounts is wrong in the automount path. To fix the existing filesystems override current->cred with the init_cred before calling d_automount and restore current->cred after d_automount completes. To support unprivileged mounts would require a more nuanced cred selection, so fail on unprivileged mounts for the time being. As none of the filesystems that currently set FS_USERNS_MOUNT implement d_automount this check is only good for preventing future problems. Fixes: 6e4eab577a0c ("fs: Add user namespace member to struct super_block") Tested-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Change-Id: I972485e9da3f2883e4ec9b38da3374e0993b1af6
* UPSTREAM: kernfs: Check KERNFS_HAS_RELEASE before calling kernfs_release_file()Vaibhav Jain2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recently started seeing a kernel oops when a module tries removing a memory mapped sysfs bin_attribute. On closer investigation the root cause seems to be kernfs_release_file() trying to call kernfs_op.release() callback that's NULL for such sysfs bin_attributes. The oops occurs when kernfs_release_file() is called from kernfs_drain_open_files() to cleanup any open handles with active memory mappings. The patch fixes this by checking for flag KERNFS_HAS_RELEASE before calling kernfs_release_file() in function kernfs_drain_open_files(). On ppc64-le arch with cxl module the oops back-trace is of the form below: [ 861.381126] Unable to handle kernel paging request for instruction fetch [ 861.381360] Faulting instruction address: 0x00000000 [ 861.381428] Oops: Kernel access of bad area, sig: 11 [#1] .... [ 861.382481] NIP: 0000000000000000 LR: c000000000362c60 CTR: 0000000000000000 .... Call Trace: [c000000f1680b750] [c000000000362c34] kernfs_drain_open_files+0x104/0x1d0 (unreliable) [c000000f1680b790] [c00000000035fa00] __kernfs_remove+0x260/0x2c0 [c000000f1680b820] [c000000000360da0] kernfs_remove_by_name_ns+0x60/0xe0 [c000000f1680b8b0] [c0000000003638f4] sysfs_remove_bin_file+0x24/0x40 [c000000f1680b8d0] [c00000000062a164] device_remove_bin_file+0x24/0x40 [c000000f1680b8f0] [d000000009b7b22c] cxl_sysfs_afu_remove+0x144/0x170 [cxl] [c000000f1680b940] [d000000009b7c7e4] cxl_remove+0x6c/0x1a0 [cxl] [c000000f1680b990] [c00000000052f694] pci_device_remove+0x64/0x110 [c000000f1680b9d0] [c0000000006321d4] device_release_driver_internal+0x1f4/0x2b0 [c000000f1680ba20] [c000000000525cb0] pci_stop_bus_device+0xa0/0xd0 [c000000f1680ba60] [c000000000525e80] pci_stop_and_remove_bus_device+0x20/0x40 [c000000f1680ba90] [c00000000004a6c4] pci_hp_remove_devices+0x84/0xc0 [c000000f1680bad0] [c00000000004a688] pci_hp_remove_devices+0x48/0xc0 [c000000f1680bb10] [c0000000009dfda4] eeh_reset_device+0xb0/0x290 [c000000f1680bbb0] [c000000000032b4c] eeh_handle_normal_event+0x47c/0x530 [c000000f1680bc60] [c000000000032e64] eeh_handle_event+0x174/0x350 [c000000f1680bd10] [c000000000033228] eeh_event_handler+0x1e8/0x1f0 [c000000f1680bdc0] [c0000000000d384c] kthread+0x14c/0x190 [c000000f1680be30] [c00000000000b5a0] ret_from_kernel_thread+0x5c/0xbc Fixes: f83f3c515654 ("kernfs: fix locking around kernfs_ops->release() callback") Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit 966fa72a716ceafc69de901a31f7cc1f52b35f81) Bug: 111308141 Test: modified lmkd to use PSI and tested using lmkd_unit_test Change-Id: I9ca5cbacd1e204a742e5616e6e101339d8719cdf Signed-off-by: Suren Baghdasaryan <surenb@google.com>
* UPSTREAM: kernfs: fix locking around kernfs_ops->release() callbackTejun Heo2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | The release callback may be called from two places - file release operation and kernfs open file draining. kernfs_open_file->mutex is used to synchronize the two callsites. This unfortunately leads to possible circular locking because of->mutex is used to protect the usual kernfs operations which may use locking constructs which are held while removing and thus draining kernfs files. @of->mutex is for synchronizing concurrent kernfs access operations and all we need here is synchronization between the releaes and drain paths. As the drain path has to grab kernfs_open_file_mutex anyway, let's use the mutex to synchronize the release operation instead. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Tony Lindgren <tony@atomide.com> Fixes: 0e67db2f9fe9 ("kernfs: add kernfs_ops->open/release() callbacks") Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit f83f3c515654474e19c7fc86e3b06564bb5cb4d4) Bug: 111308141 Test: modified lmkd to use PSI and tested using lmkd_unit_test Change-Id: I75253c2aa8924987e9342d94e8bae445d6c8f5be Signed-off-by: Suren Baghdasaryan <surenb@google.com>
* UPSTREAM: cgroup, bpf: remove unnecessary #includeAlexei Starovoitov2022-10-28
| | | | | | | | | | | | | | | | | | this #include is unnecessary and brings whole set of other headers into cgroup-defs.h. Remove it. Fixes: 3007098494be ("cgroup: add support for eBPF programs") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Rami Rosen <roszenrami@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Mack <daniel@zonque.org> Signed-off-by: David S. Miller <davem@davemloft.net> Fixes: Change-Id: I3df35d8d3b1261503f9b5bcd90b18c9358f1ac28        ("cgroup: add support for eBPF programs") (cherry picked from commit b634d30a79ecc2d28e61cbe5b1f4443952f37a8f) Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Change-Id: Ie35435947ea16421a538815bdc4953fd67407de6
* kernfs: kernfs_sop_show_path: don't return 0 after seq_dentry callSerge E. Hallyn2022-10-28
| | | | | | | | | | | | | | | | | Our caller expects 0 on success, not >0. This fixes a bug in the patch cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces where /sys does not show up in mountinfo, breaking criu. Thanks for catching this, Andrei. Reported-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Change-Id: I3cf5886bf7a77a943a6540c4b224dd0ca805dca6
* cgroup: Make rebind_subsystems() disable v2 controllers all at onceWaiman Long2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 7ee285395b211cad474b2b989db52666e0430daf ] It was found that the following warning was displayed when remounting controllers from cgroup v2 to v1: [ 8042.997778] WARNING: CPU: 88 PID: 80682 at kernel/cgroup/cgroup.c:3130 cgroup_apply_control_disable+0x158/0x190 : [ 8043.091109] RIP: 0010:cgroup_apply_control_disable+0x158/0x190 [ 8043.096946] Code: ff f6 45 54 01 74 39 48 8d 7d 10 48 c7 c6 e0 46 5a a4 e8 7b 67 33 00 e9 41 ff ff ff 49 8b 84 24 e8 01 00 00 0f b7 40 08 eb 95 <0f> 0b e9 5f ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 [ 8043.115692] RSP: 0018:ffffba8a47c23d28 EFLAGS: 00010202 [ 8043.120916] RAX: 0000000000000036 RBX: ffffffffa624ce40 RCX: 000000000000181a [ 8043.128047] RDX: ffffffffa63c43e0 RSI: ffffffffa63c43e0 RDI: ffff9d7284ee1000 [ 8043.135180] RBP: ffff9d72874c5800 R08: ffffffffa624b090 R09: 0000000000000004 [ 8043.142314] R10: ffffffffa624b080 R11: 0000000000002000 R12: ffff9d7284ee1000 [ 8043.149447] R13: ffff9d7284ee1000 R14: ffffffffa624ce70 R15: ffffffffa6269e20 [ 8043.156576] FS: 00007f7747cff740(0000) GS:ffff9d7a5fc00000(0000) knlGS:0000000000000000 [ 8043.164663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8043.170409] CR2: 00007f7747e96680 CR3: 0000000887d60001 CR4: 00000000007706e0 [ 8043.177539] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8043.184673] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 8043.191804] PKRU: 55555554 [ 8043.194517] Call Trace: [ 8043.196970] rebind_subsystems+0x18c/0x470 [ 8043.201070] cgroup_setup_root+0x16c/0x2f0 [ 8043.205177] cgroup1_root_to_use+0x204/0x2a0 [ 8043.209456] cgroup1_get_tree+0x3e/0x120 [ 8043.213384] vfs_get_tree+0x22/0xb0 [ 8043.216883] do_new_mount+0x176/0x2d0 [ 8043.220550] __x64_sys_mount+0x103/0x140 [ 8043.224474] do_syscall_64+0x38/0x90 [ 8043.228063] entry_SYSCALL_64_after_hwframe+0x44/0xae It was caused by the fact that rebind_subsystem() disables controllers to be rebound one by one. If more than one disabled controllers are originally from the default hierarchy, it means that cgroup_apply_control_disable() will be called multiple times for the same default hierarchy. A controller may be killed by css_kill() in the first round. In the second round, the killed controller may not be completely dead yet leading to the warning. To avoid this problem, we collect all the ssid's of controllers that needed to be disabled from the default hierarchy and then disable them in one go instead of one by one. Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Change-Id: I62fb64dec4392451fd649d6bdbb8e409858d9513
* cgroup: fix sock_cgroup_data initialization on earlier compilersTejun Heo2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | sock_cgroup_data is a struct containing an anonymous union. sock_cgroup_set_prioidx() and sock_cgroup_set_classid() were initializing a field inside the anonymous union as follows. struct sock_ccgroup_data skcd_buf = { .val = VAL }; While this is fine on more recent compilers, gcc-4.4.7 triggers the following errors. include/linux/cgroup-defs.h: In function ‘sock_cgroup_set_prioidx’: include/linux/cgroup-defs.h:619: error: unknown field ‘val’ specified in initializer include/linux/cgroup-defs.h:619: warning: missing braces around initializer include/linux/cgroup-defs.h:619: warning: (near initialization for ‘skcd_buf.<anonymous>’) This is because .val belongs to the anonymous union nested inside the struct but the initializer is missing the nesting. Fix it by adding an extra pair of braces. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Alaa Hleihel <alaa@dev.mellanox.co.il> Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I343edb9b0ffe0cf6836f911b88d031c32c541228
* samples/bpf: fix bpf_perf_event_output prototypeAdam Barth2022-10-28
| | | | | | | | | | | | The commit 555c8a8623a3 ("bpf: avoid stack copy and use skb ctx for event output") started using 20 of initially reserved upper 32-bits of 'flags' argument in bpf_perf_event_output(). Adjust corresponding prototype in samples/bpf/bpf_helpers.h Signed-off-by: Adam Barth <arb@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: Id88405dfd9b4e539aa7b3724c8b5aa1bfafb534f
* net: gso: Fix skb_segment splat when splitting gso_size mangled skb having ↵Shmulik Ladkani2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | linear-headed frag_list [ Upstream commit 3dcbdb134f329842a38f0e6797191b885ab00a00 ] Historically, support for frag_list packets entering skb_segment() was limited to frag_list members terminating on exact same gso_size boundaries. This is verified with a BUG_ON since commit 89319d3801d1 ("net: Add frag_list support to skb_segment"), quote: As such we require all frag_list members terminate on exact MSS boundaries. This is checked using BUG_ON. As there should only be one producer in the kernel of such packets, namely GRO, this requirement should not be difficult to maintain. However, since commit 6578171a7ff0 ("bpf: add bpf_skb_change_proto helper"), the "exact MSS boundaries" assumption no longer holds: An eBPF program using bpf_skb_change_proto() DOES modify 'gso_size', but leaves the frag_list members as originally merged by GRO with the original 'gso_size'. Example of such programs are bpf-based NAT46 or NAT64. This lead to a kernel BUG_ON for flows involving: - GRO generating a frag_list skb - bpf program performing bpf_skb_change_proto() or bpf_skb_adjust_room() - skb_segment() of the skb See example BUG_ON reports in [0]. In commit 13acc94eff12 ("net: permit skb_segment on head_frag frag_list skb"), skb_segment() was modified to support the "gso_size mangling" case of a frag_list GRO'ed skb, but *only* for frag_list members having head_frag==true (having a page-fragment head). Alas, GRO packets having frag_list members with a linear kmalloced head (head_frag==false) still hit the BUG_ON. This commit adds support to skb_segment() for a 'head_skb' packet having a frag_list whose members are *non* head_frag, with gso_size mangled, by disabling SG and thus falling-back to copying the data from the given 'head_skb' into the generated segmented skbs - as suggested by Willem de Bruijn [1]. Since this approach involves the penalty of skb_copy_and_csum_bits() when building the segments, care was taken in order to enable this solution only when required: - untrusted gso_size, by testing SKB_GSO_DODGY is set (SKB_GSO_DODGY is set by any gso_size mangling functions in net/core/filter.c) - the frag_list is non empty, its item is a non head_frag, *and* the headlen of the given 'head_skb' does not match the gso_size. [0] https://lore.kernel.org/netdev/20190826170724.25ff616f@pixies/ https://lore.kernel.org/netdev/9265b93f-253d-6b8c-f2b8-4b54eff1835c@fb.com/ [1] https://lore.kernel.org/netdev/CA+FuTSfVsgNDi7c=GUU8nMg2hWxF2SjCNLXetHeVPdnxAW5K-w@mail.gmail.com/ Fixes: 6578171a7ff0 ("bpf: add bpf_skb_change_proto helper") Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I8451163a3d6e010b73b628aa4606bf2c1ac98f38
* sk_buff: allow segmenting based on frag sizesMarcelo Ricardo Leitner2022-10-28
| | | | | | | | | | This patch allows segmenting a skb based on its frags sizes instead of based on a fixed value. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: Ic278a7f5bbe0f86ef348a63508da6819b0d098aa
* ip_tunnel, bpf: ip_tunnel_info_opts_{get, set} depends on CONFIG_INETDaniel Borkmann2022-10-28
| | | | | | | | | | | | | | | | | | | Helpers like ip_tunnel_info_opts_{get,set}() are only available if CONFIG_INET is set, thus add an empty definition into the header for the !CONFIG_INET case, where already other empty inline helpers are defined. This avoids ifdef kludge inside filter.c, but also vxlan and geneve themself where this facility can only be used with, depend on INET being set. For the !INET case TUNNEL_OPTIONS_PRESENT would never be set in flags. Fixes: 14ca0751c96f ("bpf: support for access to tunnel options") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I1ab8b99b70fcfb9cf3ab82b2999353316753ef24
* bpf: udp: ipv6: Avoid running reuseport's bpf_prog from __udp6_lib_errMartin KaFai Lau2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4ac30c4b3659efac031818c418beb51e630d512d upstream. __udp6_lib_err() may be called when handling icmpv6 message. For example, the icmpv6 toobig(type=2). __udp6_lib_lookup() is then called which may call reuseport_select_sock(). reuseport_select_sock() will call into a bpf_prog (if there is one). reuseport_select_sock() is expecting the skb->data pointing to the transport header (udphdr in this case). For example, run_bpf_filter() is pulling the transport header. However, in the __udp6_lib_err() path, the skb->data is pointing to the ipv6hdr instead of the udphdr. One option is to pull and push the ipv6hdr in __udp6_lib_err(). Instead of doing this, this patch follows how the original commit 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") was done in IPv4, which has passed a NULL skb pointer to reuseport_select_sock(). Fixes: 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") Cc: Craig Gallek <kraig@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Craig Gallek <kraig@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: Idfd324dd88f969d844c49a36ca0472889c11a684
* soreuseport: add compat case for setsockopt SO_ATTACH_REUSEPORT_CBPFHelge Deller2022-10-28
| | | | | | | | | | Commit 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") missed to add the compat case for the SO_ATTACH_REUSEPORT_CBPF option. Signed-off-by: Helge Deller <deller@gmx.de> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I395ccc50087cb150a47a4463c500d7a06302f1f6
* soreuseport: change consume_skb to kfree_skb in error caseCraig Gallek2022-10-28
| | | | | | | | | Fixes: 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I8cb85168af62bee8590c0ede1681044d7c8acb24
* ipv6: Fix SO_REUSEPORT UDP socket with implicit sk_ipv6onlyMartin KaFai Lau2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 7ece54a60ee2ba7a386308cae73c790bd580589c ] If a sk_v6_rcv_saddr is !IPV6_ADDR_ANY and !IPV6_ADDR_MAPPED, it implicitly implies it is an ipv6only socket. However, in inet6_bind(), this addr_type checking and setting sk->sk_ipv6only to 1 are only done after sk->sk_prot->get_port(sk, snum) has been completed successfully. This inconsistency between sk_v6_rcv_saddr and sk_ipv6only confuses the 'get_port()'. In particular, when binding SO_REUSEPORT UDP sockets, udp_reuseport_add_sock(sk,...) is called. udp_reuseport_add_sock() checks "ipv6_only_sock(sk2) == ipv6_only_sock(sk)" before adding sk to sk2->sk_reuseport_cb. In this case, ipv6_only_sock(sk2) could be 1 while ipv6_only_sock(sk) is still 0 here. The end result is, reuseport_alloc(sk) is called instead of adding sk to the existing sk2->sk_reuseport_cb. It can be reproduced by binding two SO_REUSEPORT UDP sockets on an IPv6 address (!ANY and !MAPPED). Only one of the socket will receive packet. The fix is to set the implicit sk_ipv6only before calling get_port(). The original sk_ipv6only has to be saved such that it can be restored in case get_port() failed. The situation is similar to the inet_reset_saddr(sk) after get_port() has failed. Thanks to Calvin Owens <calvinowens@fb.com> who created an easy reproduction which leads to a fix. Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: If8cf2bec6d47a27e0502a8a8392105863c34dab3
* soreuseport: fix ordering for mixed v4/v6 socketsCraig Gallek2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the SO_REUSEPORT socket option, it is possible to create sockets in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address. This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on the AF_INET6 sockets. Prior to the commits referenced below, an incoming IPv4 packet would always be routed to a socket of type AF_INET when this mixed-mode was used. After those changes, the same packet would be routed to the most recently bound socket (if this happened to be an AF_INET6 socket, it would have an IPv4 mapped IPv6 address). The change in behavior occurred because the recent SO_REUSEPORT optimizations short-circuit the socket scoring logic as soon as they find a match. They did not take into account the scoring logic that favors AF_INET sockets over AF_INET6 sockets in the event of a tie. To fix this problem, this patch changes the insertion order of AF_INET and AF_INET6 addresses in the TCP and UDP socket lists when the sockets have SO_REUSEPORT set. AF_INET sockets will be inserted at the head of the list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at the tail of the list. This will force AF_INET sockets to always be considered first. Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection") Reported-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I372800560dce28adb9bb0459b65f9a804c2b2cdc
* soreuseport: fix NULL ptr dereference SO_REUSEPORT after bindCraig Gallek2022-10-28
| | | | | | | | | | | | | | | Marc Dionne discovered a NULL pointer dereference when setting SO_REUSEPORT on a socket after it is bound. This patch removes the assumption that at least one socket in the reuseport group is bound with the SO_REUSEPORT option before other bind calls occur. Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Reported-by: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: Craig Gallek <kraig@google.com> Tested-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: Ie9c2cb3dfae4bf97e3e83756c09a3ed607bcf6f7
* bpf: do not blindly change rlimit in reuseport net selftestEric Dumazet2022-10-28
| | | | | | | | | | | | | | | | [ Upstream commit 262f9d811c7608f1e74258ceecfe1fa213bdf912 ] If the current process has unlimited RLIMIT_MEMLOCK, we should should leave it as is. Fixes: 941ff6f11c02 ("bpf: fix rlimit in reuseport net selftest") Signed-off-by: John Sperbeck <jsperbeck@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I1c54599b3674855909e3888b470f9a75f7f239aa
* bpf: fix rlimit in reuseport net selftestDaniel Borkmann2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 941ff6f11c020913f5cddf543a9ec63475d7c082 ] Fix two issues in the reuseport_bpf selftests that were reported by Linaro CI: [...] + ./reuseport_bpf ---- IPv4 UDP ---- Testing EBPF mod 10... Reprograming, testing mod 5... ./reuseport_bpf: ebpf error. log: 0: (bf) r6 = r1 1: (20) r0 = *(u32 *)skb[0] 2: (97) r0 %= 10 3: (95) exit processed 4 insns : Operation not permitted + echo FAIL [...] ---- IPv4 TCP ---- Testing EBPF mod 10... ./reuseport_bpf: failed to bind send socket: Address already in use + echo FAIL [...] For the former adjust rlimit since this was the cause of failure for loading the BPF prog, and for the latter add SO_REUSEADDR. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Link: https://bugs.linaro.org/show_bug.cgi?id=3502 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I53d6561e6cf326638412508504ff67ca882eaf37
* soreuseport: Fix reuseport_bpf testcase on 32bit architecturesHelge Deller2022-10-28
| | | | | | | | | | | | This fixes the following compiler warnings when compiling the reuseport_bpf testcase on a 32 bit platform: reuseport_bpf.c: In function ‘attach_ebpf’: reuseport_bpf.c:114:15: warning: cast from pointer to integer of ifferent size [-Wpointer-to-int-cast] Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I9c907a05fcaf5d63bba82a937d60e1419e722afb
* udp: fix potential infinite loop in SO_REUSEPORT logicEric Dumazet2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using a combination of connected and un-connected sockets, Dmitry was able to trigger soft lockups with his fuzzer. The problem is that sockets in the SO_REUSEPORT array might have different scores. Right after sk2=socket(), setsockopt(sk2,...,SO_REUSEPORT, on) and bind(sk2, ...), but _before_ the connect(sk2) is done, sk2 is added into the soreuseport array, with a score which is smaller than the score of first socket sk1 found in hash table (I am speaking of the regular UDP hash table), if sk1 had the connect() done, giving a +8 to its score. hash bucket [X] -> sk1 -> sk2 -> NULL sk1 score = 14 (because it did a connect()) sk2 score = 6 SO_REUSEPORT fast selection is an optimization. If it turns out the score of the selected socket does not match score of first socket, just fallback to old SO_REUSEPORT logic instead of trying to be too smart. Normal SO_REUSEPORT users do not mix different kind of sockets, as this mechanism is used for load balance traffic. Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Craig Gallek <kraigatgoog@gmail.com> Acked-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I0b9c6d37496caa0891a79a55963a29c1b79c88bc
* soreuseport: BPF selection functional test for TCPCraig Gallek2022-10-28
| | | | | | | | | | | | | | | Unfortunately the existing test relied on packet payload in order to map incoming packets to sockets. In order to get this to work with TCP, TCP_FASTOPEN needed to be used. Since the fast open path is slightly different than the standard TCP path, I created a second test which sends to reuseport group members based on receiving cpu core id. This will probably serve as a better real-world example use as well. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I1f51b4b131ded50d80046c7b1b0822fa273bae68
* soreuseport: pass skb to secondary UDP socket lookupCraig Gallek2022-10-28
| | | | | | | | | | | | | | | | | | | | This socket-lookup path did not pass along the skb in question in my original BPF-based socket selection patch. The skb in the udpN_lib_lookup2 path can be used for BPF-based socket selection just like it is in the 'traditional' udpN_lib_lookup path. udpN_lib_lookup2 kicks in when there are greater than 10 sockets in the same hlist slot. Coincidentally, I chose 10 sockets per reuseport group in my functional test, so the lookup2 path was not excersised. This adds an additional set of tests with 20 sockets. Fixes: 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") Fixes: 3ca8e4029969 ("soreuseport: BPF selection functional test") Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I776c36c23fd6209e9521d9529c47c46667abf3e0
* soreuseport: BPF selection functional testCraig Gallek2022-10-28
| | | | | | | | | | | | | This program will build classic and extended BPF programs and validate the socket selection logic when used with SO_ATTACH_REUSEPORT_CBPF and SO_ATTACH_REUSEPORT_EBPF. It also validates the re-programing flow and several edge cases. Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I72958d6a2a8ece3db2ef0dfc0e8ec33da31e965d
* soreuseport: fix mem leak in reuseport_add_sock()Eric Dumazet2022-10-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 4db428a7c9ab07e08783e0fcdc4ca0f555da0567 ] reuseport_add_sock() needs to deal with attaching a socket having its own sk_reuseport_cb, after a prior setsockopt(SO_ATTACH_REUSEPORT_?BPF) Without this fix, not only a WARN_ONCE() was issued, but we were also leaking memory. Thanks to sysbot and Eric Biggers for providing us nice C repros. ------------[ cut here ]------------ socket already in reuseport group WARNING: CPU: 0 PID: 3496 at net/core/sock_reuseport.c:119   reuseport_add_sock+0x742/0x9b0 net/core/sock_reuseport.c:117 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 3496 Comm: syzkaller869503 Not tainted 4.15.0-rc6+ #245 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS   Google 01/01/2011 Call Trace:   __dump_stack lib/dump_stack.c:17 [inline]   dump_stack+0x194/0x257 lib/dump_stack.c:53   panic+0x1e4/0x41c kernel/panic.c:183   __warn+0x1dc/0x200 kernel/panic.c:547   report_bug+0x211/0x2d0 lib/bug.c:184   fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178   fixup_bug arch/x86/kernel/traps.c:247 [inline]   do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296   do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315   invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1079 Fixes: ef456144da8e ("soreuseport: define reuseport groups") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+c0ea2226f77a42936bf7@syzkaller.appspotmail.com Acked-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I1e50b157ec68e27ece69ef45a544e1901c15dc09
* bpf: fix missing header inclusionZi Shen Lim2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 0fc174dea545 ("ebpf: make internal bpf API independent of CONFIG_BPF_SYSCALL ifdefs") introduced usage of ERR_PTR() in bpf_prog_get(), however did not include linux/err.h. Without this patch, when compiling arm64 BPF without CONFIG_BPF_SYSCALL: ... In file included from arch/arm64/net/bpf_jit_comp.c:21:0: include/linux/bpf.h: In function 'bpf_prog_get': include/linux/bpf.h:235:9: error: implicit declaration of function 'ERR_PTR' [-Werror=implicit-function-declaration] return ERR_PTR(-EOPNOTSUPP); ^ include/linux/bpf.h:235:9: warning: return makes pointer from integer without a cast [-Wint-conversion] In file included from include/linux/rwsem.h:17:0, from include/linux/mm_types.h:10, from include/linux/sched.h:27, from arch/arm64/include/asm/compat.h:25, from arch/arm64/include/asm/stat.h:23, from include/linux/stat.h:5, from include/linux/compat.h:12, from include/linux/filter.h:10, from arch/arm64/net/bpf_jit_comp.c:22: include/linux/err.h: At top level: include/linux/err.h:23:35: error: conflicting types for 'ERR_PTR' static inline void * __must_check ERR_PTR(long error) ^ In file included from arch/arm64/net/bpf_jit_comp.c:21:0: include/linux/bpf.h:235:9: note: previous implicit declaration of 'ERR_PTR' was here return ERR_PTR(-EOPNOTSUPP); ^ ... Fixes: 0fc174dea545 ("ebpf: make internal bpf API independent of CONFIG_BPF_SYSCALL ifdefs") Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I5b6a63374d8d137f7d1b2e79cf91178c988a68fc
* UPSTREAM: security: selinux: allow per-file labeling for bpffsConnor O'Brien2022-04-19
| | | | | | | | | | | | | | | | Add support for genfscon per-file labeling of bpffs files. This allows for separate permissions for different pinned bpf objects, which may be completely unrelated to each other. Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: Steven Moreland <smoreland@google.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <paul@paul-moore.com> (cherry picked from commit 4ca54d3d3022ce27170b50e4bdecc3a42f05dbdc) [which is v5.6-rc1-10-g4ca54d3d3022 and thus already included in 5.10] Bug: 200440527 Change-Id: I8234b9047f29981b8140bd81bb2ff070b3b0b843 (cherry picked from commit d52ac987ad2ae16ff313d7fb6185bc412cb221a4)
* BACKPORT: ANDROID: Remove xt_qtaguid module from new kernels.Chenbo Feng2022-04-19
| | | | | | | | | | | | For new devices ship with 4.9 kernel, the eBPF replacement should cover all the functionalities of xt_qtaguid and it is safe now to remove this android only module from the kernel. Signed-off-by: Chenbo Feng <fengc@google.com> Signed-off-by: Bruno Martins <bgcngm@gmail.com> Bug: 79938294 Test: kernel build Change-Id: I032aecc048f7349f6a0c5192dd381f286fc7e5bf
* wireguard: compat: udp_tunnel: Account for kernel 4.4 BPF backportBruno Martins2022-04-19
| | | | | | | Function 'udp_tunnel6_xmit_skb' had been redefined in commit 4d5805d. Compat hack must be removed to fix compilation issue. Change-Id: I155b1c45ef57ca2be4fb3f005a5df174fc9041b9
* selinux: distinguish non-init user namespace capability checksStephen Smalley2022-04-19
| | | | | | | | | | | | | | | Distinguish capability checks against a target associated with the init user namespace versus capability checks against a target associated with a non-init user namespace by defining and using separate security classes for the latter. This is needed to support e.g. Chrome usage of user namespaces for the Chrome sandbox without needing to allow Chrome to also exercise capabilities on targets in the init user namespace. Suggested-by: Dan Walsh <dwalsh@redhat.com> Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <paul@paul-moore.com>
* selinux: check ss_initialized before revalidating an inode labelPaul Moore2022-04-19
| | | | | | | There is no point in trying to revalidate an inode's security label if the security server is not yet initialized. Signed-off-by: Paul Moore <paul@paul-moore.com>
* selinux: delay inode label lookup as long as possiblePaul Moore2022-04-19
| | | | | | | Since looking up an inode's label can result in revalidation, delay the lookup as long as possible to limit the performance impact. Signed-off-by: Paul Moore <paul@paul-moore.com>
* selinux: don't revalidate an inode's label when explicitly setting itPaul Moore2022-04-19
| | | | | | | | There is no point in attempting to revalidate an inode's security label when we are in the process of setting it. Reported-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <paul@paul-moore.com>
* selinux: simply inode label states to INVALID and INITIALIZEDPaul Moore2022-04-19
| | | | | | | | | There really is no need for LABEL_MISSING as we really only care if the inode's label is INVALID or INITIALIZED. Also adjust the revalidate code to reload the label whenever the label is not INITIALIZED so we are less sensitive to label state in the future. Signed-off-by: Paul Moore <paul@paul-moore.com>
* gfs2: Invalid security labels of inodes when they go invalidAndreas Gruenbacher2022-04-19
| | | | | | | | | | | | | | | When gfs2 releases the glock of an inode, it must invalidate all information cached for that inode, including the page cache and acls. Use the new security_inode_invalidate_secctx hook to also invalidate security labels in that case. These items will be reread from disk when needed after reacquiring the glock. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Cc: cluster-devel@redhat.com [PM: fixed spelling errors and description line lengths] Signed-off-by: Paul Moore <pmoore@redhat.com>
* selinux: Inode label revalidation performance fixAndreas Gruenbacher2022-04-19
| | | | | | | | | | | | | Commit 5d226df4 has introduced a performance regression of about 10% in the UnixBench pipe benchmark. It turns out that the call to inode_security in selinux_file_permission can be moved below the zero-mask test and that inode_security_revalidate can be removed entirely, which brings us back to roughly the original performance. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <paul@paul-moore.com>
* selinux: Revalidate invalid inode security labelsAndreas Gruenbacher2022-04-19
| | | | | | | | | | | | | When fetching an inode's security label, check if it is still valid, and try reloading it if it is not. Reloading will fail when we are in RCU context which doesn't allow sleeping, or when we can't find a dentry for the inode. (Reloading happens via iop->getxattr which takes a dentry parameter.) When reloading fails, continue using the old, invalid label. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <pmoore@redhat.com>
* security: Add hook to invalidate inode security labelsAndreas Gruenbacher2022-04-19
| | | | | | | | | | | | | Add a hook to invalidate an inode's security label when the cached information becomes invalid. Add the new hook in selinux: set a flag when a security label becomes invalid. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: James Morris <james.l.morris@oracle.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <pmoore@redhat.com>
* selinux: Add accessor functions for inode->i_securityAndreas Gruenbacher2022-04-19
| | | | | | | | | | Add functions dentry_security and inode_security for accessing inode->i_security. These functions initially don't do much, but they will later be used to revalidate the security labels when necessary. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <pmoore@redhat.com>