summaryrefslogtreecommitdiff
path: root/include/linux (follow)
Commit message (Collapse)AuthorAge
...
* bpf: refactor bpf_prog_get and type check into helperDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | Since bpf_prog_get() and program type check is used in a couple of places, refactor this into a small helper function that we can make use of. Since the non RO prog->aux part is not used in performance critical paths and a program destruction via RCU is rather very unlikley when doing the put, we shouldn't have an issue just doing the bpf_prog_get() + prog->type != type check, but actually not taking the ref at all (due to being in fdget() / fdput() section of the bpf fd) is even cleaner and makes the diff smaller as well, so just go for that. Callsites are changed to make use of the new helper where possible. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: generally move prog destruction to RCU deferralDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jann Horn reported following analysis that could potentially result in a very hard to trigger (if not impossible) UAF race, to quote his event timeline: - Set up a process with threads T1, T2 and T3 - Let T1 set up a socket filter F1 that invokes another filter F2 through a BPF map [tail call] - Let T1 trigger the socket filter via a unix domain socket write, don't wait for completion - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put() - Let T3 close the file descriptor for F2, dropping the reference count of F2 to 2 - At this point, T1 should have looked up F2 from the map, but not finished executing it - Let T3 remove F2 from the BPF map, dropping the reference count of F2 to 1 - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping the reference count of F2 to 0 and scheduling bpf_prog_free_deferred() via schedule_work() - At this point, the BPF program could be freed - BPF execution is still running in a freed BPF program While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf event fd we're doing the syscall on doesn't disappear from underneath us for whole syscall time, it may not be the case for the bpf fd used as an argument only after we did the put. It needs to be a valid fd pointing to a BPF program at the time of the call to make the bpf_prog_get() and while T2 gets preempted, F2 must have dropped reference to 1 on the other CPU. The fput() from the close() in T3 should also add additionally delay to the reference drop via exit_task_work() when bpf_prog_release() gets called as well as scheduling bpf_prog_free_deferred(). That said, it makes nevertheless sense to move the BPF prog destruction generally after RCU grace period to guarantee that such scenario above, but also others as recently fixed in ceb56070359b ("bpf, perf: delay release of BPF prog after grace period") with regards to tail calls won't happen. Integrating bpf_prog_free_deferred() directly into the RCU callback is not allowed since the invocation might happen from either softirq or process context, so we're not permitted to block. Reviewing all bpf_prog_put() invocations from eBPF side (note, cBPF -> eBPF progs don't use this for their destruction) with call_rcu() look good to me. Since we don't know whether at the time of attaching the program, we're already part of a tail call map, we need to use RCU variant. However, due to this, there won't be severely more stress on the RCU callback queue: situations with above bpf_prog_get() and bpf_prog_put() combo in practice normally won't lead to releases, but even if they would, enough effort/ cycles have to be put into loading a BPF program into the kernel already. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf, maps: flush own entries on perf map releaseDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The behavior of perf event arrays are quite different from all others as they are tightly coupled to perf event fds, f.e. shown recently by commit e03e7ee34fdd ("perf/bpf: Convert perf_event_array to use struct file") to make refcounting on perf event more robust. A remaining issue that the current code still has is that since additions to the perf event array take a reference on the struct file via perf_event_get() and are only released via fput() (that cleans up the perf event eventually via perf_event_release_kernel()) when the element is either manually removed from the map from user space or automatically when the last reference on the perf event map is dropped. However, this leads us to dangling struct file's when the map gets pinned after the application owning the perf event descriptor exits, and since the struct file reference will in such case only be manually dropped or via pinned file removal, it leads to the perf event living longer than necessary, consuming needlessly resources for that time. Relations between perf event fds and bpf perf event map fds can be rather complex. F.e. maps can act as demuxers among different perf event fds that can possibly be owned by different threads and based on the index selection from the program, events get dispatched to one of the per-cpu fd endpoints. One perf event fd (or, rather a per-cpu set of them) can also live in multiple perf event maps at the same time, listening for events. Also, another requirement is that perf event fds can get closed from application side after they have been attached to the perf event map, so that on exit perf event map will take care of dropping their references eventually. Likewise, when such maps are pinned, the intended behavior is that a user application does bpf_obj_get(), puts its fds in there and on exit when fd is released, they are dropped from the map again, so the map acts rather as connector endpoint. This also makes perf event maps inherently different from program arrays as described in more detail in commit c9da161c6517 ("bpf: fix clearing on persistent program array maps"). To tackle this, map entries are marked by the map struct file that added the element to the map. And when the last reference to that map struct file is released from user space, then the tracked entries are purged from the map. This is okay, because new map struct files instances resp. frontends to the anon inode are provided via bpf_map_new_fd() that is called when we invoke bpf_obj_get_user() for retrieving a pinned map, but also when an initial instance is created via map_create(). The rest is resolved by the vfs layer automatically for us by keeping reference count on the map's struct file. Any concurrent updates on the map slot are fine as well, it just means that perf_event_fd_array_release() needs to delete less of its own entires. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf, maps: extend map_fd_get_ptr argumentsDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | This patch extends map_fd_get_ptr() callback that is used by fd array maps, so that struct file pointer from the related map can be passed in. It's safe to remove map_update_elem() callback for the two maps since this is only allowed from syscall side, but not from eBPF programs for these two map types. Like in per-cpu map case, bpf_fd_array_map_update_elem() needs to be called directly here due to the extra argument. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf, maps: add release callbackDaniel Borkmann2022-04-19
| | | | | | | | | | | Add a release callback for maps that is invoked when the last reference to its struct file is gone and the struct file about to be released by vfs. The handler will be used by fd array maps. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: fix matching of data/data_end in verifierAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | The ctx structure passed into bpf programs is different depending on bpf program type. The verifier incorrectly marked ctx->data and ctx->data_end access based on ctx offset only. That caused loads in tracing programs int bpf_prog(struct pt_regs *ctx) { .. ctx->ax .. } to be incorrectly marked as PTR_TO_PACKET which later caused verifier to reject the program that was actually valid in tracing context. Fix this by doing program type specific matching of ctx offsets. Fixes: 969bf05eb3ce ("bpf: direct packet access") Reported-by: Sasha Goldshtein <goldshtn@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* perf core: Per event callchain limitArnaldo Carvalho de Melo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Additionally to being able to control the system wide maximum depth via /proc/sys/kernel/perf_event_max_stack, now we are able to ask for different depths per event, using perf_event_attr.sample_max_stack for that. This uses an u16 hole at the end of perf_event_attr, that, when perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if sample_max_stack is zero, means use perf_event_max_stack, otherwise it'll be bounds checked under callchain_mutex. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Milian Wolff <milian.wolff@kdab.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Wang Nan <wangnan0@huawei.com> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* perf core: Pass max stack as a perf_callchain_entry contextArnaldo Carvalho de Melo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This makes perf_callchain_{user,kernel}() receive the max stack as context for the perf_callchain_entry, instead of accessing the global sysctl_perf_event_max_stack. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Milian Wolff <milian.wolff@kdab.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Wang Nan <wangnan0@huawei.com> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: add generic constant blinding for use in jitsDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work adds a generic facility for use from eBPF JIT compilers that allows for further hardening of JIT generated images through blinding constants. In response to the original work on BPF JIT spraying published by Keegan McAllister [1], most BPF JITs were changed to make images read-only and start at a randomized offset in the page, where the rest was filled with trap instructions. We have this nowadays in x86, arm, arm64 and s390 JIT compilers. Additionally, later work also made eBPF interpreter images read only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86, arm, arm64 and s390 archs as well currently. This is done by default for mentioned JITs when JITing is enabled. Furthermore, we had a generic and configurable constant blinding facility on our todo for quite some time now to further make spraying harder, and first implementation since around netconf 2016. We found that for systems where untrusted users can load cBPF/eBPF code where JIT is enabled, start offset randomization helps a bit to make jumps into crafted payload harder, but in case where larger programs that cross page boundary are injected, we again have some part of the program opcodes at a page start offset. With improved guessing and more reliable payload injection, chances can increase to jump into such payload. Elena Reshetova recently wrote a test case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which can leave some more room for payloads. Note that for all this, additional bugs in the kernel are still required to make the jump (and of course to guess right, to not jump into a trap) and naturally the JIT must be enabled, which is disabled by default. For helping mitigation, the general idea is to provide an option bpf_jit_harden that admins can tweak along with bpf_jit_enable, so that for cases where JIT should be enabled for performance reasons, the generated image can be further hardened with blinding constants for unpriviledged users (bpf_jit_harden == 1), with trading off performance for these, but not for privileged ones. We also added the option of blinding for all users (bpf_jit_harden == 2), which is quite helpful for testing f.e. with test_bpf.ko. There are no further e.g. hardening levels of bpf_jit_harden switch intended, rationale is to have it dead simple to use as on/off. Since this functionality would need to be duplicated over and over for JIT compilers to use, which are already complex enough, we provide a generic eBPF byte-code level based blinding implementation, which is then just transparently JITed. JIT compilers need to make only a few changes to integrate this facility and can be migrated one by one. This option is for eBPF JITs and will be used in x86, arm64, s390 without too much effort, and soon ppc64 JITs, thus that native eBPF can be blinded as well as cBPF to eBPF migrations, so that both can be covered with a single implementation. The rule for JITs is that bpf_jit_blind_constants() must be called from bpf_int_jit_compile(), and in case blinding is disabled, we follow normally with JITing the passed program. In case blinding is enabled and we fail during the process of blinding itself, we must return with the interpreter. Similarly, in case the JITing process after the blinding failed, we return normally to the interpreter with the non-blinded code. Meaning, interpreter doesn't change in any way and operates on eBPF code as usual. For doing this pre-JIT blinding step, we need to make use of a helper/auxiliary register, here BPF_REG_AX. This is strictly internal to the JIT and not in any way part of the eBPF architecture. Just like in the same way as JITs internally make use of some helper registers when emitting code, only that here the helper register is one abstraction level higher in eBPF bytecode, but nevertheless in JIT phase. That helper register is needed since f.e. manually written program can issue loads to all registers of eBPF architecture. The core concept with the additional register is: blind out all 32 and 64 bit constants by converting BPF_K based instructions into a small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND, and REG <OP> BPF_REG_AX, so actual operation on the target register is translated from BPF_K into BPF_X one that is operating on BPF_REG_AX's content. During rewriting phase when blinding, RND is newly generated via prandom_u32() for each processed instruction. 64 bit loads are split into two 32 bit loads to make translation and patching not too complex. Only basic thing required by JITs is to call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other() pair, and to map BPF_REG_AX into an unused register. Small bpf_jit_disasm extract from [2] when applied to x86 JIT: echo 0 > /proc/sys/net/core/bpf_jit_harden ffffffffa034f5e9 + <x>: [...] 39: mov $0xa8909090,%eax 3e: mov $0xa8909090,%eax 43: mov $0xa8ff3148,%eax 48: mov $0xa89081b4,%eax 4d: mov $0xa8900bb0,%eax 52: mov $0xa810e0c1,%eax 57: mov $0xa8908eb4,%eax 5c: mov $0xa89020b0,%eax [...] echo 1 > /proc/sys/net/core/bpf_jit_harden ffffffffa034f1e5 + <x>: [...] 39: mov $0xe1192563,%r10d 3f: xor $0x4989b5f3,%r10d 46: mov %r10d,%eax 49: mov $0xb8296d93,%r10d 4f: xor $0x10b9fd03,%r10d 56: mov %r10d,%eax 59: mov $0x8c381146,%r10d 5f: xor $0x24c7200e,%r10d 66: mov %r10d,%eax 69: mov $0xeb2a830e,%r10d 6f: xor $0x43ba02ba,%r10d 76: mov %r10d,%eax 79: mov $0xd9730af,%r10d 7f: xor $0xa5073b1f,%r10d 86: mov %r10d,%eax 89: mov $0x9a45662b,%r10d 8f: xor $0x325586ea,%r10d 96: mov %r10d,%eax [...] As can be seen, original constants that carry payload are hidden when enabled, actual operations are transformed from constant-based to register-based ones, making jumps into constants ineffective. Above extract/example uses single BPF load instruction over and over, but of course all instructions with constants are blinded. Performance wise, JIT with blinding performs a bit slower than just JIT and faster than interpreter case. This is expected, since we still get all the performance benefits from JITing and in normal use-cases not every single instruction needs to be blinded. Summing up all 296 test cases averaged over multiple runs from test_bpf.ko suite, interpreter was 55% slower than JIT only and JIT with blinding was 8% slower than JIT only. Since there are also some extremes in the test suite, I expect for ordinary workloads that the performance for the JIT with blinding case is even closer to JIT only case, f.e. nmap test case from suite has averaged timings in ns 29 (JIT), 35 (+ blinding), and 151 (interpreter). BPF test suite, seccomp test suite, eBPF sample code and various bigger networking eBPF programs have been tested with this and were running fine. For testing purposes, I also adapted interpreter and redirected blinded eBPF image to interpreter and also here all tests pass. [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html [2] https://github.com/01org/jit-spray-poc-for-ksp/ [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Elena Reshetova <elena.reshetova@intel.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: move bpf_jit_enable declarationDaniel Borkmann2022-04-19
| | | | | | | | | | | Move the bpf_jit_enable declaration to the filter.h file where most other core code is declared, also since we're going to add a second knob there. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: prepare bpf_int_jit_compile/bpf_prog_select_runtime apisDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | Since the blinding is strictly only called from inside eBPF JITs, we need to change signatures for bpf_int_jit_compile() and bpf_prog_select_runtime() first in order to prepare that the eBPF program we're dealing with can change underneath. Hence, for call sites, we need to return the latest prog. No functional change in this patch. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: add bpf_patch_insn_single helperDaniel Borkmann2022-04-19
| | | | | | | | | | | | Move the functionality to patch instructions out of the verifier code and into the core as the new bpf_patch_insn_single() helper will be needed later on for blinding as well. No changes in functionality. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: fix refcnt overflowAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | On a system with >32Gbyte of phyiscal memory and infinite RLIMIT_MEMLOCK, the malicious application may overflow 32-bit bpf program refcnt. It's also possible to overflow map refcnt on 1Tb system. Impose 32k hard limit which means that the same bpf program or map cannot be shared by more than 32k processes. Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* perf core: Allow setting up max frame stack depth via sysctlArnaldo Carvalho de Melo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The default remains 127, which is good for most cases, and not even hit most of the time, but then for some cases, as reported by Brendan, 1024+ deep frames are appearing on the radar for things like groovy, ruby. And in some workloads putting a _lower_ cap on this may make sense. One that is per event still needs to be put in place tho. The new file is: # cat /proc/sys/kernel/perf_event_max_stack 127 Chaging it: # echo 256 > /proc/sys/kernel/perf_event_max_stack # cat /proc/sys/kernel/perf_event_max_stack 256 But as soon as there is some event using callchains we get: # echo 512 > /proc/sys/kernel/perf_event_max_stack -bash: echo: write error: Device or resource busy # Because we only allocate the callchain percpu data structures when there is a user, which allows for changing the max easily, its just a matter of having no callchain users at that point. Reported-and-Tested-by: Brendan Gregg <brendan.d.gregg@gmail.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: David Ahern <dsahern@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Milian Wolff <milian.wolff@kdab.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Wang Nan <wangnan0@huawei.com> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com> Change-Id: Ic34ecdb4cc1e61257a2926062aa23c960dbd3b8f
* perf: generalize perf_callchainAlexei Starovoitov2022-04-19
| | | | | | | | | . avoid walking the stack when there is no room left in the buffer . generalize get_perf_callchain() to be called from bpf helper Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: add event output helper for notifications/sampling/loggingDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a new helper for cls/act programs that can push events to user space applications. For networking, this can be f.e. for sampling, debugging, logging purposes or pushing of arbitrary wake-up events. The idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output() helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example"). The eBPF program utilizes a perf event array map that user space populates with fds from perf_event_open(), the eBPF program calls into the helper f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw)) so that the raw data is pushed into the fd f.e. at the map index of the current CPU. User space can poll/mmap/etc on this and has a data channel for receiving events that can be post-processed. The nice thing is that since the eBPF program and user space application making use of it are tightly coupled, they can define their own arbitrary raw data format and what/when they want to push. While f.e. packet headers could be one part of the meta data that is being pushed, this is not a substitute for things like packet sockets as whole packet is not being pushed and push is only done in a single direction. Intention is more of a generically usable, efficient event pipe to applications. Workflow is that tc can pin the map and applications can attach themselves e.g. after cls/act setup to one or multiple map slots, demuxing is done by the eBPF program. Adding this facility is with minimal effort, it reuses the helper introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper") and we get its functionality for free by overloading its BPF_FUNC_ identifier for cls/act programs, ctx is currently unused, but will be made use of in future. Example will be added to iproute2's BPF example files. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* tun: use socket locks for sk_{attach,detatch}_filterHannes Frederic Sowa2022-04-19
| | | | | | | | | | | | | | | This reverts commit 5a5abb1fa3b05dd ("tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter") and replaces it to use lock_sock around sk_{attach,detach}_filter. The checks inside filter.c are updated with lockdep_sock_is_held to check for proper socket locks. It keeps the code cleaner by ensuring that only one lock governs the socket filter instead of two independent locks. Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf, verifier: add ARG_PTR_TO_RAW_STACK typeDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When passing buffers from eBPF stack space into a helper function, we have ARG_PTR_TO_STACK argument type for helpers available. The verifier makes sure that such buffers are initialized, within boundaries, etc. However, the downside with this is that we have a couple of helper functions such as bpf_skb_load_bytes() that fill out the passed buffer in the expected success case anyway, so zero initializing them prior to the helper call is unneeded/wasted instructions in the eBPF program that can be avoided. Therefore, add a new helper function argument type called ARG_PTR_TO_RAW_STACK. The idea is to skip the STACK_MISC check in check_stack_boundary() and color the related stack slots as STACK_MISC after we checked all call arguments. Helper functions using ARG_PTR_TO_RAW_STACK must make sure that every path of the helper function will fill the provided buffer area, so that we cannot leak any uninitialized stack memory. This f.e. means that error paths need to memset() the buffers, but the expected fast-path doesn't have to do this anymore. Since there's no such helper needing more than at most one ARG_PTR_TO_RAW_STACK argument, we can keep it simple and don't need to check for multiple areas. Should in future such a use-case really appear, we have check_raw_mode() that will make sure we implement support for it first. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: sanitize bpf tracepoint accessAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | during bpf program loading remember the last byte of ctx access and at the time of attaching the program to tracepoint check that the program doesn't access bytes beyond defined in tracepoint fields This also disallows access to __dynamic_array fields, but can be relaxed in the future. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: support bpf_get_stackid() and bpf_perf_event_output() in tracepoint ↵Alexei Starovoitov2022-04-19
| | | | | | | | | | | | programs needs two wrapper functions to fetch 'struct pt_regs *' to convert tracepoint bpf context into kprobe bpf context to reuse existing helper functions Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* perf: split perf_trace_buf_prepare into alloc and update partsAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | split allows to move expensive update of 'struct trace_entry' to later phase. Repurpose unused 1st argument of perf_tp_event() to indicate event type. While splitting use temp variable 'rctx' instead of '*rctx' to avoid unnecessary loads done by the compiler due to -fno-strict-aliasing Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: convert stackmap to pre-allocationAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It was observed that calling bpf_get_stackid() from a kprobe inside slub or from spin_unlock causes similar deadlock as with hashmap, therefore convert stackmap to use pre-allocated memory. The call_rcu is no longer feasible mechanism, since delayed freeing causes bpf_get_stackid() to fail unpredictably when number of actual stacks is significantly less than user requested max_entries. Since elements are no longer freed into slub, we can push elements into freelist immediately and let them be recycled. However the very unlikley race between user space map_lookup() and program-side recycling is possible: cpu0 cpu1 ---- ---- user does lookup(stackidX) starts copying ips into buffer delete(stackidX) calls bpf_get_stackid() which recyles the element and overwrites with new stack trace To avoid user space seeing a partial stack trace consisting of two merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket); to preserve consistent stack trace delivery to user space. Now we can move memset(,0) of left-over element value from critical path of bpf_get_stackid() into slow-path of user space lookup. Also disallow lookup() from bpf program, since it's useless and program shouldn't be messing with collected stack trace. Note that similar race between user space lookup and kernel side updates is also present in hashmap, but it's not a new race. bpf programs were always allowed to modify hash and array map elements while user space is copying them. Fixes: d5a3b1f69186 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: pre-allocate hash map elementsAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If kprobe is placed on spin_unlock then calling kmalloc/kfree from bpf programs is not safe, since the following dead lock is possible: kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe-> bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock) and deadlocks. The following solutions were considered and some implemented, but eventually discarded - kmem_cache_create for every map - add recursion check to slow-path of slub - use reserved memory in bpf_map_update for in_irq or in preempt_disabled - kmalloc via irq_work At the end pre-allocation of all map elements turned out to be the simplest solution and since the user is charged upfront for all the memory, such pre-allocation doesn't affect the user space visible behavior. Since it's impossible to tell whether kprobe is triggered in a safe location from kmalloc point of view, use pre-allocation by default and introduce new BPF_F_NO_PREALLOC flag. While testing of per-cpu hash maps it was discovered that alloc_percpu(GFP_ATOMIC) has odd corner cases and often fails to allocate memory even when 90% of it is free. The pre-allocation of per-cpu hash elements solves this problem as well. Turned out that bpf_map_update() quickly followed by bpf_map_lookup()+bpf_map_delete() is very common pattern used in many of iovisor/bcc/tools, so there is additional benefit of pre-allocation, since such use cases are must faster. Since all hash map elements are now pre-allocated we can remove atomic increment of htab->count and save few more cycles. Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid large malloc/free done by users who don't have sufficient limits. Pre-allocation is done with vmalloc and alloc/free is done via percpu_freelist. Here are performance numbers for different pre-allocation algorithms that were implemented, but discarded in favor of percpu_freelist: 1 cpu: pcpu_ida 2.1M pcpu_ida nolock 2.3M bt 2.4M kmalloc 1.8M hlist+spinlock 2.3M pcpu_freelist 2.6M 4 cpu: pcpu_ida 1.5M pcpu_ida nolock 1.8M bt w/smp_align 1.7M bt no/smp_align 1.1M kmalloc 0.7M hlist+spinlock 0.2M pcpu_freelist 2.0M 8 cpu: pcpu_ida 0.7M bt w/smp_align 0.8M kmalloc 0.4M pcpu_freelist 1.5M 32 cpu: kmalloc 0.13M pcpu_freelist 0.49M pcpu_ida nolock is a modified percpu_ida algorithm without percpu_ida_cpu locks and without cross-cpu tag stealing. It's faster than existing percpu_ida, but not as fast as pcpu_freelist. bt is a variant of block/blk-mq-tag.c simlified and customized for bpf use case. bt w/smp_align is using cache line for every 'long' (similar to blk-mq-tag). bt no/smp_align allocates 'long' bitmasks continuously to save memory. It's comparable to percpu_ida and in some cases faster, but slower than percpu_freelist hlist+spinlock is the simplest free list with single spinlock. As expeceted it has very bad scaling in SMP. kmalloc is existing implementation which is still available via BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist, but saves memory, so in cases where map->max_entries can be large and number of map update/delete per second is low, it may make sense to use it. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: prevent kprobe+bpf deadlocksAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | if kprobe is placed within update or delete hash map helpers that hold bucket spin lock and triggered bpf program is trying to grab the spinlock for the same bucket on the same cpu, it will deadlock. Fix it by extending existing recursion prevention mechanism. Note, map_lookup and other tracing helpers don't have this problem, since they don't hold any locks and don't modify global data. bpf_trace_printk has its own recursive check and ok as well. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: add new arg_type that allows for 0 sized stack bufferDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, when we pass a buffer from the eBPF stack into a helper function, the function proto indicates argument types as ARG_PTR_TO_STACK and ARG_CONST_STACK_SIZE pair. If R<X> contains the former, then R<X+1> must be of the latter type. Then, verifier checks whether the buffer points into eBPF stack, is initialized, etc. The verifier also guarantees that the constant value passed in R<X+1> is greater than 0, so helper functions don't need to test for it and can always assume a non-NULL initialized buffer as well as non-0 buffer size. This patch adds a new argument types ARG_CONST_STACK_SIZE_OR_ZERO that allows to also pass NULL as R<X> and 0 as R<X+1> into the helper function. Such helper functions, of course, need to be able to handle these cases internally then. Verifier guarantees that either R<X> == NULL && R<X+1> == 0 or R<X> != NULL && R<X+1> != 0 (like the case of ARG_CONST_STACK_SIZE), any other combinations are not possible to load. I went through various options of extending the verifier, and introducing the type ARG_CONST_STACK_SIZE_OR_ZERO seems to have most minimal changes needed to the verifier. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: introduce BPF_MAP_TYPE_STACK_TRACEAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | add new map type to store stack traces and corresponding helper bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id @ctx: struct pt_regs* @map: pointer to stack_trace map @flags: bits 0-7 - numer of stack frames to skip bit 8 - collect user stack instead of kernel bit 9 - compare stacks by hash only bit 10 - if two different stacks hash into the same stackid discard old other bits - reserved Return: >= 0 stackid on success or negative error stackid is a 32-bit integer handle that can be further combined with other data (including other stackid) and used as a key into maps. Userspace will access stackmap using standard lookup/delete syscall commands to retrieve full stack trace for given stackid. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPFCraig Gallek2022-04-19
| | | | | | | | | | | | | | | | Expose socket options for setting a classic or extended BPF program for use when selecting sockets in an SO_REUSEPORT group. These options can be used on the first socket to belong to a group before bind or on any socket in the group after bind. This change includes refactoring of the existing sk_filter code to allow reuse of the existing BPF filter validation checks. Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com> Change-Id: I4433dfd5d865bb69e8d3cab55cc68325829e8b49
* bpf: add lookup/update support for per-cpu hash and array mapsAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | The functions bpf_map_lookup_elem(map, key, value) and bpf_map_update_elem(map, key, value, flags) need to get/set values from all-cpus for per-cpu hash and array maps, so that user space can aggregate/update them as necessary. Example of single counter aggregation in user space: unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF); long values[nr_cpus]; long value = 0; bpf_lookup_elem(fd, key, values); for (i = 0; i < nr_cpus; i++) value += values[i]; The user space must provide round_up(value_size, 8) * nr_cpus array to get/set values, since kernel will use 'long' copy of per-cpu values to try to copy good counters atomically. It's a best-effort, since bpf programs and user space are racing to access the same memory. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY mapAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | Primary use case is a histogram array of latency where bpf program computes the latency of block requests or other events and stores histogram of latency into array of 64 elements. All cpus are constantly running, so normal increment is not accurate, bpf_xadd causes cache ping-pong and this per-cpu approach allows fastest collision-free counters. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* perf/bpf: Convert perf_event_array to use struct fileAlexei Starovoitov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | Robustify refcounting. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Wang Nan <wangnan0@huawei.com> Cc: vince@deater.net Link: http://lkml.kernel.org/r/20160126045947.GA40151@ast-mbp.thefacebook.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: fix clearing on persistent program array mapsDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, when having map file descriptors pointing to program arrays, there's still the issue that we unconditionally flush program array contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens when such a file descriptor is released and is independent of the map's refcount. Having this flush independent of the refcount is for a reason: there can be arbitrary complex dependency chains among tail calls, also circular ones (direct or indirect, nesting limit determined during runtime), and we need to make sure that the map drops all references to eBPF programs it holds, so that the map's refcount can eventually drop to zero and initiate its freeing. Btw, a walk of the whole dependency graph would not be possible for various reasons, one being complexity and another one inconsistency, i.e. new programs can be added to parts of the graph at any time, so there's no guaranteed consistent state for the time of such a walk. Now, the program array pinning itself works, but the issue is that each derived file descriptor on close would nevertheless call unconditionally into bpf_fd_array_map_clear(). Instead, keep track of users and postpone this flush until the last reference to a user is dropped. As this only concerns a subset of references (f.e. a prog array could hold a program that itself has reference on the prog array holding it, etc), we need to track them separately. Short analysis on the refcounting: on map creation time usercnt will be one, so there's no change in behaviour for bpf_map_release(), if unpinned. If we already fail in map_create(), we are immediately freed, and no file descriptor has been made public yet. In bpf_obj_pin_user(), we need to probe for a possible map in bpf_fd_probe_obj() already with a usercnt reference, so before we drop the reference on the fd with fdput(). Therefore, if actual pinning fails, we need to drop that reference again in bpf_any_put(), otherwise we keep holding it. When last reference drops on the inode, the bpf_any_put() in bpf_evict_inode() will take care of dropping the usercnt again. In the bpf_obj_get_user() case, the bpf_any_get() will grab a reference on the usercnt, still at a time when we have the reference on the path. Should we later on fail to grab a new file descriptor, bpf_any_put() will drop it, otherwise we hold it until bpf_map_release() time. Joint work with Alexei. Fixes: b2197755b263 ("bpf: add support for persistent maps/progs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* Revert "bpf: fix clearing on persistent program array maps"Anay Wadhera2022-04-19
| | | | | | This reverts commit c9da161c6517ba12154059d3b965c2cbaf16f90f. Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* Revert "bpf: fix refcnt overflow"Anay Wadhera2022-04-19
| | | | | | This reverts commit 3899251bdb9c2b31fc73d4cc132f52d3710101de. Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* Revert "bpf: add bpf_patch_insn_single helper"Anay Wadhera2022-04-19
| | | | | | This reverts commit 087a92287dbae61b4ee1e76d7c20c81710109422. Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* Revert "bpf: prevent out-of-bounds speculation"Anay Wadhera2022-04-19
| | | | | | This reverts commit 9a7fad4c0e215fb1c256fee27c45f9f8bc4364c5. Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* Revert "bpf: avoid false sharing of map refcount with max_entries"Anay Wadhera2022-04-19
| | | | | | This reverts commit 96d9b2338bed553c37f759127d8d18c857449ceb. Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* Revert "bpf: generally move prog destruction to RCU deferral"Anay Wadhera2022-04-19
| | | | | | This reverts commit e25dc63aa366fd0f61d1d9ba67b66f5d75fc4372. Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* Merge remote-tracking branch 'google/common/android-4.4-p' into ↵Michael Bestas2022-02-04
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | lineage-18.1-caf-msm8998 * google/common/android-4.4-p: Linux 4.4.302 Input: i8042 - Fix misplaced backport of "add ASUS Zenbook Flip to noselftest list" KVM: x86: Fix misplaced backport of "work around leak of uninitialized stack contents" Revert "tc358743: fix register i2c_rd/wr function fix" Revert "drm/radeon/ci: disable mclk switching for high refresh rates (v2)" Bluetooth: MGMT: Fix misplaced BT_HS check ipv4: tcp: send zero IPID in SYNACK messages ipv4: raw: lock the socket in raw_bind() hwmon: (lm90) Reduce maximum conversion rate for G781 drm/msm: Fix wrong size calculation net-procfs: show net devices bound packet types ipv4: avoid using shared IP generator for connected sockets net: fix information leakage in /proc/net/ptype ipv6_tunnel: Rate limit warning messages scsi: bnx2fc: Flush destroy_work queue before calling bnx2fc_interface_put() USB: core: Fix hang in usb_kill_urb by adding memory barriers usb-storage: Add unusual-devs entry for VL817 USB-SATA bridge tty: Add support for Brainboxes UC cards. tty: n_gsm: fix SW flow control encoding/handling serial: stm32: fix software flow control transfer PM: wakeup: simplify the output logic of pm_show_wakelocks() udf: Fix NULL ptr deref when converting from inline format udf: Restore i_lenAlloc when inode expansion fails scsi: zfcp: Fix failed recovery on gone remote port with non-NPIV FCP devices s390/hypfs: include z/VM guests with access control group set Bluetooth: refactor malicious adv data check can: bcm: fix UAF of bcm op Linux 4.4.301 drm/i915: Flush TLBs before releasing backing store Linux 4.4.300 lib82596: Fix IRQ check in sni_82596_probe bcmgenet: add WOL IRQ check net_sched: restore "mpu xxx" handling dmaengine: at_xdmac: Fix at_xdmac_lld struct definition dmaengine: at_xdmac: Fix lld view setting dmaengine: at_xdmac: Print debug message after realeasing the lock dmaengine: at_xdmac: Don't start transactions at tx_submit level netns: add schedule point in ops_exit_list() net: axienet: fix number of TX ring slots for available check net: axienet: Wait for PhyRstCmplt after core reset af_unix: annote lockless accesses to unix_tot_inflight & gc_in_progress parisc: pdc_stable: Fix memory leak in pdcs_register_pathentries net/fsl: xgmac_mdio: Fix incorrect iounmap when removing module powerpc/fsl/dts: Enable WA for erratum A-009885 on fman3l MDIO buses ext4: don't use the orphan list when migrating an inode ext4: Fix BUG_ON in ext4_bread when write quota data ext4: set csum seed in tmp inode while migrating to extents ubifs: Error path in ubifs_remount_rw() seems to wrongly free write buffers power: bq25890: Enable continuous conversion for ADC at charging scsi: sr: Don't use GFP_DMA MIPS: Octeon: Fix build errors using clang i2c: designware-pci: Fix to change data types of hcnt and lcnt parameters ALSA: seq: Set upper limit of processed events w1: Misuse of get_user()/put_user() reported by sparse i2c: mpc: Correct I2C reset procedure powerpc/smp: Move setup_profiling_timer() under CONFIG_PROFILING i2c: i801: Don't silently correct invalid transfer size powerpc/btext: add missing of_node_put powerpc/cell: add missing of_node_put powerpc/powernv: add missing of_node_put powerpc/6xx: add missing of_node_put parisc: Avoid calling faulthandler_disabled() twice serial: core: Keep mctrl register state and cached copy in sync serial: pl010: Drop CR register reset on set_termios dm space map common: add bounds check to sm_ll_lookup_bitmap() dm btree: add a defensive bounds check to insert_at() net: mdio: Demote probed message to debug print btrfs: remove BUG_ON(!eie) in find_parent_nodes btrfs: remove BUG_ON() in find_parent_nodes() ACPICA: Executer: Fix the REFCLASS_REFOF case in acpi_ex_opcode_1A_0T_1R() ACPICA: Utilities: Avoid deleting the same object twice in a row um: registers: Rename function names to avoid conflicts and build problems ath9k: Fix out-of-bound memcpy in ath9k_hif_usb_rx_stream usb: hub: Add delay for SuperSpeed hub resume to let links transit to U0 media: saa7146: hexium_gemini: Fix a NULL pointer dereference in hexium_attach() media: igorplugusb: receiver overflow should be reported net: bonding: debug: avoid printing debug logs when bond is not notifying peers iwlwifi: mvm: synchronize with FW after multicast commands media: m920x: don't use stack on USB reads media: saa7146: hexium_orion: Fix a NULL pointer dereference in hexium_attach() floppy: Add max size check for user space request mwifiex: Fix skb_over_panic in mwifiex_usb_recv() HSI: core: Fix return freed object in hsi_new_client media: b2c2: Add missing check in flexcop_pci_isr: usb: gadget: f_fs: Use stream_open() for endpoint files ar5523: Fix null-ptr-deref with unexpected WDCMSG_TARGET_START reply fs: dlm: filter user dlm messages for kernel locks Bluetooth: Fix debugfs entry leak in hci_register_dev() RDMA/cxgb4: Set queue pair state when being queried mips: bcm63xx: add support for clk_set_parent() mips: lantiq: add support for clk_set_parent() misc: lattice-ecp3-config: Fix task hung when firmware load failed ASoC: samsung: idma: Check of ioremap return value dmaengine: pxa/mmp: stop referencing config->slave_id RDMA/core: Let ib_find_gid() continue search even after empty entry char/mwave: Adjust io port register size ALSA: oss: fix compile error when OSS_DEBUG is enabled powerpc/prom_init: Fix improper check of prom_getprop() ALSA: hda: Add missing rwsem around snd_ctl_remove() calls ALSA: PCM: Add missing rwsem around snd_ctl_remove() calls ALSA: jack: Add missing rwsem around snd_ctl_remove() calls ext4: avoid trim error on fs with small groups net: mcs7830: handle usb read errors properly pcmcia: fix setting of kthread task states can: xilinx_can: xcan_probe(): check for error irq can: softing: softing_startstop(): fix set but not used variable warning spi: spi-meson-spifc: Add missing pm_runtime_disable() in meson_spifc_probe ppp: ensure minimum packet size in ppp_write() pcmcia: rsrc_nonstatic: Fix a NULL pointer dereference in nonstatic_find_mem_region() pcmcia: rsrc_nonstatic: Fix a NULL pointer dereference in __nonstatic_find_io_region() usb: ftdi-elan: fix memory leak on device disconnect media: msi001: fix possible null-ptr-deref in msi001_probe() media: saa7146: mxb: Fix a NULL pointer dereference in mxb_attach() media: dib8000: Fix a memleak in dib8000_init() floppy: Fix hang in watchdog when disk is ejected serial: amba-pl011: do not request memory region twice drm/amdgpu: Fix a NULL pointer dereference in amdgpu_connector_lcd_native_mode() arm64: dts: qcom: msm8916: fix MMC controller aliases netfilter: bridge: add support for pppoe filtering tty: serial: atmel: Call dma_async_issue_pending() tty: serial: atmel: Check return code of dmaengine_submit() crypto: qce - fix uaf on qce_ahash_register_one Bluetooth: stop proccessing malicious adv data Bluetooth: cmtp: fix possible panic when cmtp_init_sockets() fails PCI: Add function 1 DMA alias quirk for Marvell 88SE9125 SATA controller can: softing_cs: softingcs_probe(): fix memleak on registration failure media: stk1160: fix control-message timeouts media: pvrusb2: fix control-message timeouts media: dib0700: fix undefined behavior in tuner shutdown media: em28xx: fix control-message timeouts media: mceusb: fix control-message timeouts rtc: cmos: take rtc_lock while reading from CMOS nfc: llcp: fix NULL error pointer dereference on sendmsg() after failed bind() HID: uhid: Fix worker destroying device without any protection rtlwifi: rtl8192cu: Fix WARNING when calling local_irq_restore() with interrupts enabled media: uvcvideo: fix division by zero at stream start drm/i915: Avoid bitwise vs logical OR warning in snb_wm_latency_quirk() can: gs_usb: gs_can_start_xmit(): zero-initialize hf->{flags,reserved} can: gs_usb: fix use of uninitialized variable, detach device on reception of invalid USB data mfd: intel-lpss: Fix too early PM enablement in the ACPI ->probe() USB: Fix "slab-out-of-bounds Write" bug in usb_hcd_poll_rh_status USB: core: Fix bug in resuming hub's handling of wakeup requests Bluetooth: bfusb: fix division by zero in send path Linux 4.4.299 power: reset: ltc2952: Fix use of floating point literals mISDN: change function names to avoid conflicts net: udp: fix alignment problem in udp4_seq_show() ip6_vti: initialize __ip6_tnl_parm struct in vti6_siocdevprivate scsi: libiscsi: Fix UAF in iscsi_conn_get_param()/iscsi_conn_teardown() phonet: refcount leak in pep_sock_accep rndis_host: support Hytera digital radios xfs: map unwritten blocks in XFS_IOC_{ALLOC,FREE}SP just like fallocate sch_qfq: prevent shift-out-of-bounds in qfq_init_qdisc i40e: Fix incorrect netdev's real number of RX/TX queues mac80211: initialize variable have_higher_than_11mbit ieee802154: atusb: fix uninit value in atusb_set_extended_addr Bluetooth: btusb: Apply QCA Rome patches for some ATH3012 models bpf, test: fix ld_abs + vlan push/pop stress test Linux 4.4.298 net: fix use-after-free in tw_timer_handler Input: spaceball - fix parsing of movement data packets Input: appletouch - initialize work before device registration scsi: vmw_pvscsi: Set residual data length conditionally usb: gadget: f_fs: Clear ffs_eventfd in ffs_data_clear. xhci: Fresco FL1100 controller should not have BROKEN_MSI quirk set. uapi: fix linux/nfc.h userspace compilation errors nfc: uapi: use kernel size_t to fix user-space builds selinux: initialize proto variable in selinux_ip_postroute_compat() recordmcount.pl: fix typo in s390 mcount regex platform/x86: apple-gmux: use resource_size() with res Linux 4.4.297 phonet/pep: refuse to enable an unbound pipe hamradio: improve the incomplete fix to avoid NPD hamradio: defer ax25 kfree after unregister_netdev ax25: NPD bug when detaching AX25 device xen/blkfront: fix bug in backported patch ARM: 9169/1: entry: fix Thumb2 bug in iWMMXt exception handling ALSA: drivers: opl3: Fix incorrect use of vp->state ALSA: jack: Check the return value of kstrdup() hwmon: (lm90) Fix usage of CONFIG2 register in detect function drivers: net: smc911x: Check for error irq bonding: fix ad_actor_system option setting to default qlcnic: potential dereference null pointer of rx_queue->page_ring IB/qib: Fix memory leak in qib_user_sdma_queue_pkts() HID: holtek: fix mouse probing can: kvaser_usb: get CAN clock frequency from device net: usb: lan78xx: add Allied Telesis AT29M2-AF Conflicts: drivers/usb/gadget/function/f_fs.c Change-Id: Iabc390c3c9160c7a2864ffe1125d73412ffdb31d
| * Merge 4.4.302 into android-4.4-pGreg Kroah-Hartman2022-02-03
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changes in 4.4.302 can: bcm: fix UAF of bcm op Bluetooth: refactor malicious adv data check s390/hypfs: include z/VM guests with access control group set scsi: zfcp: Fix failed recovery on gone remote port with non-NPIV FCP devices udf: Restore i_lenAlloc when inode expansion fails udf: Fix NULL ptr deref when converting from inline format PM: wakeup: simplify the output logic of pm_show_wakelocks() serial: stm32: fix software flow control transfer tty: n_gsm: fix SW flow control encoding/handling tty: Add support for Brainboxes UC cards. usb-storage: Add unusual-devs entry for VL817 USB-SATA bridge USB: core: Fix hang in usb_kill_urb by adding memory barriers scsi: bnx2fc: Flush destroy_work queue before calling bnx2fc_interface_put() ipv6_tunnel: Rate limit warning messages net: fix information leakage in /proc/net/ptype ipv4: avoid using shared IP generator for connected sockets net-procfs: show net devices bound packet types drm/msm: Fix wrong size calculation hwmon: (lm90) Reduce maximum conversion rate for G781 ipv4: raw: lock the socket in raw_bind() ipv4: tcp: send zero IPID in SYNACK messages Bluetooth: MGMT: Fix misplaced BT_HS check Revert "drm/radeon/ci: disable mclk switching for high refresh rates (v2)" Revert "tc358743: fix register i2c_rd/wr function fix" KVM: x86: Fix misplaced backport of "work around leak of uninitialized stack contents" Input: i8042 - Fix misplaced backport of "add ASUS Zenbook Flip to noselftest list" Linux 4.4.302 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I5191d3cb4df0fa8de60170d2fedf4a3c51380fdf
| | * net: fix information leakage in /proc/net/ptypeCongyu Liu2022-02-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 47934e06b65637c88a762d9c98329ae6e3238888 upstream. In one net namespace, after creating a packet socket without binding it to a device, users in other net namespaces can observe the new `packet_type` added by this packet socket by reading `/proc/net/ptype` file. This is minor information leakage as packet socket is namespace aware. Add a net pointer in `packet_type` to keep the net namespace of of corresponding packet socket. In `ptype_seq_show`, this net pointer must be checked when it is not NULL. Fixes: 2feb27dbe00c ("[NETNS]: Minor information leak via /proc/net/ptype file.") Signed-off-by: Congyu Liu <liu3101@purdue.edu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* | | Merge remote-tracking branch 'common/android-4.4-p' into ↵Michael Bestas2021-12-27
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | lineage-18.1-caf-msm8998 * common/android-4.4-p: Linux 4.4.296 xen/netback: don't queue unlimited number of packages xen/console: harden hvc_xen against event channel storms xen/netfront: harden netfront against event channel storms xen/blkfront: harden blkfront against event channel storms Input: touchscreen - avoid bitwise vs logical OR warning ARM: 8805/2: remove unneeded naked function usage net: lan78xx: Avoid unnecessary self assignment net: systemport: Add global locking for descriptor lifecycle timekeeping: Really make sure wall_to_monotonic isn't positive USB: serial: option: add Telit FN990 compositions PCI/MSI: Clear PCI_MSIX_FLAGS_MASKALL on error USB: gadget: bRequestType is a bitfield, not a enum igbvf: fix double free in `igbvf_probe` soc/tegra: fuse: Fix bitwise vs. logical OR warning nfsd: fix use-after-free due to delegation race dm btree remove: fix use after free in rebalance_children() recordmcount.pl: look for jgnop instruction as well as bcrl on s390 mac80211: send ADDBA requests using the tid/queue of the aggregation session hwmon: (dell-smm) Fix warning on /proc/i8k creation error net: netlink: af_netlink: Prevent empty skb by adding a check on len. i2c: rk3x: Handle a spurious start completion interrupt flag parisc/agp: Annotate parisc agp init functions with __init nfc: fix segfault in nfc_genl_dump_devices_done FROMGIT: USB: gadget: bRequestType is a bitfield, not a enum Linux 4.4.295 irqchip: nvic: Fix offset for Interrupt Priority Offsets irqchip/irq-gic-v3-its.c: Force synchronisation when issuing INVALL iio: accel: kxcjk-1013: Fix possible memory leak in probe and remove iio: itg3200: Call iio_trigger_notify_done() on error iio: ltr501: Don't return error code in trigger handler iio: mma8452: Fix trigger reference couting iio: stk3310: Don't return error code in interrupt handler usb: core: config: fix validation of wMaxPacketValue entries USB: gadget: zero allocate endpoint 0 buffers USB: gadget: detect too-big endpoint 0 requests net/qla3xxx: fix an error code in ql_adapter_up() net, neigh: clear whole pneigh_entry at alloc time net: fec: only clear interrupt of handling queue in fec_enet_rx_queue() net: altera: set a couple error code in probe() net: cdc_ncm: Allow for dwNtbOutMaxSize to be unset or zero block: fix ioprio_get(IOPRIO_WHO_PGRP) vs setuid(2) tracefs: Set all files to the same group ownership as the mount option signalfd: use wake_up_pollfree() binder: use wake_up_pollfree() wait: add wake_up_pollfree() libata: add horkage for ASMedia 1092 can: pch_can: pch_can_rx_normal: fix use after free tracefs: Have new files inherit the ownership of their parent ALSA: pcm: oss: Handle missing errors in snd_pcm_oss_change_params*() ALSA: pcm: oss: Limit the period size to 16MB ALSA: pcm: oss: Fix negative period/buffer sizes ALSA: ctl: Fix copy of updated id with element read/write mm: bdi: initialize bdi_min_ratio when bdi is unregistered nfc: fix potential NULL pointer deref in nfc_genl_dump_ses_done can: sja1000: fix use after free in ems_pcmcia_add_card() HID: check for valid USB device for many HID drivers HID: wacom: fix problems when device is not a valid USB device HID: add USB_HID dependancy on some USB HID drivers HID: add USB_HID dependancy to hid-chicony HID: add USB_HID dependancy to hid-prodikeys HID: add hid_is_usb() function to make it simpler for USB detection HID: introduce hid_is_using_ll_driver UPSTREAM: USB: gadget: zero allocate endpoint 0 buffers UPSTREAM: USB: gadget: detect too-big endpoint 0 requests Linux 4.4.294 serial: pl011: Add ACPI SBSA UART match id tty: serial: msm_serial: Deactivate RX DMA for polling support vgacon: Propagate console boot parameters before calling `vc_resize' parisc: Fix "make install" on newer debian releases siphash: use _unaligned version by default net: qlogic: qlcnic: Fix a NULL pointer dereference in qlcnic_83xx_add_rings() natsemi: xtensa: fix section mismatch warnings fget: check that the fd still exists after getting a ref to it fs: add fget_many() and fput_many() sata_fsl: fix warning in remove_proc_entry when rmmod sata_fsl sata_fsl: fix UAF in sata_fsl_port_stop when rmmod sata_fsl kprobes: Limit max data_size of the kretprobe instances net: ethernet: dec: tulip: de4x5: fix possible array overflows in type3_infoblock() net: tulip: de4x5: fix the problem that the array 'lp->phy[8]' may be out of bound scsi: iscsi: Unblock session then wake up error handler s390/setup: avoid using memblock_enforce_memory_limit platform/x86: thinkpad_acpi: Fix WWAN device disabled issue after S3 deep net: return correct error code hugetlb: take PMD sharing into account when flushing tlb/caches tty: hvc: replace BUG_ON() with negative return value xen/netfront: don't trust the backend response data blindly xen/netfront: disentangle tx_skb_freelist xen/netfront: don't read data from request on the ring page xen/netfront: read response from backend only once xen/blkfront: don't trust the backend response data blindly xen/blkfront: don't take local copy of a request from the ring page xen/blkfront: read response from backend only once xen: sync include/xen/interface/io/ring.h with Xen's newest version shm: extend forced shm destroy to support objects from several IPC nses fuse: release pipe buf after last use fuse: fix page stealing NFC: add NCI_UNREG flag to eliminate the race proc/vmcore: fix clearing user buffer by properly using clear_user() hugetlbfs: flush TLBs correctly after huge_pmd_unshare tracing: Check pid filtering when creating events tcp_cubic: fix spurious Hystart ACK train detections for not-cwnd-limited flows scsi: mpt3sas: Fix kernel panic during drive powercycle test ARM: socfpga: Fix crash with CONFIG_FORTIRY_SOURCE NFSv42: Don't fail clone() unless the OP_CLONE operation failed net: ieee802154: handle iftypes as u32 ASoC: topology: Add missing rwsem around snd_ctl_remove() calls ARM: dts: BCM5301X: Add interrupt properties to GPIO node xen: detect uninitialized xenbus in xenbus_init xen: don't continue xenstore initialization in case of errors staging: rtl8192e: Fix use after free in _rtl92e_pci_disconnect() ALSA: ctxfi: Fix out-of-range access binder: fix test regression due to sender_euid change usb: hub: Fix locking issues with address0_mutex usb: hub: Fix usb enumeration issue due to address0 race USB: serial: option: add Fibocom FM101-GL variants USB: serial: option: add Telit LE910S1 0x9200 composition staging: ion: Prevent incorrect reference counting behavour Change-Id: Iadf9f213915d2a02b27ceb3b2144eac827ade329
| * | Merge 4.4.295 into android-4.4-pGreg Kroah-Hartman2021-12-14
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changes in 4.4.295 HID: introduce hid_is_using_ll_driver HID: add hid_is_usb() function to make it simpler for USB detection HID: add USB_HID dependancy to hid-prodikeys HID: add USB_HID dependancy to hid-chicony HID: add USB_HID dependancy on some USB HID drivers HID: wacom: fix problems when device is not a valid USB device HID: check for valid USB device for many HID drivers can: sja1000: fix use after free in ems_pcmcia_add_card() nfc: fix potential NULL pointer deref in nfc_genl_dump_ses_done mm: bdi: initialize bdi_min_ratio when bdi is unregistered ALSA: ctl: Fix copy of updated id with element read/write ALSA: pcm: oss: Fix negative period/buffer sizes ALSA: pcm: oss: Limit the period size to 16MB ALSA: pcm: oss: Handle missing errors in snd_pcm_oss_change_params*() tracefs: Have new files inherit the ownership of their parent can: pch_can: pch_can_rx_normal: fix use after free libata: add horkage for ASMedia 1092 wait: add wake_up_pollfree() binder: use wake_up_pollfree() signalfd: use wake_up_pollfree() tracefs: Set all files to the same group ownership as the mount option block: fix ioprio_get(IOPRIO_WHO_PGRP) vs setuid(2) net: cdc_ncm: Allow for dwNtbOutMaxSize to be unset or zero net: altera: set a couple error code in probe() net: fec: only clear interrupt of handling queue in fec_enet_rx_queue() net, neigh: clear whole pneigh_entry at alloc time net/qla3xxx: fix an error code in ql_adapter_up() USB: gadget: detect too-big endpoint 0 requests USB: gadget: zero allocate endpoint 0 buffers usb: core: config: fix validation of wMaxPacketValue entries iio: stk3310: Don't return error code in interrupt handler iio: mma8452: Fix trigger reference couting iio: ltr501: Don't return error code in trigger handler iio: itg3200: Call iio_trigger_notify_done() on error iio: accel: kxcjk-1013: Fix possible memory leak in probe and remove irqchip/irq-gic-v3-its.c: Force synchronisation when issuing INVALL irqchip: nvic: Fix offset for Interrupt Priority Offsets Linux 4.4.295 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I4810f7e2ebd538739fb0d62f9eade0921a9badfb
| | * wait: add wake_up_pollfree()Eric Biggers2021-12-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 42288cb44c4b5fff7653bc392b583a2b8bd6a8c0 upstream. Several ->poll() implementations are special in that they use a waitqueue whose lifetime is the current task, rather than the struct file as is normally the case. This is okay for blocking polls, since a blocking poll occurs within one task; however, non-blocking polls require another solution. This solution is for the queue to be cleared before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'. However, that has a bug: wake_up_poll() calls __wake_up() with nr_exclusive=1. Therefore, if there are multiple "exclusive" waiters, and the wakeup function for the first one returns a positive value, only that one will be called. That's *not* what's needed for POLLFREE; POLLFREE is special in that it really needs to wake up everyone. Considering the three non-blocking poll systems: - io_uring poll doesn't handle POLLFREE at all, so it is broken anyway. - aio poll is unaffected, since it doesn't support exclusive waits. However, that's fragile, as someone could add this feature later. - epoll doesn't appear to be broken by this, since its wakeup function returns 0 when it sees POLLFREE. But this is fragile. Although there is a workaround (see epoll), it's better to define a function which always sends POLLFREE to all waiters. Add such a function. Also make it verify that the queue really becomes empty after all waiters have been woken up. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211209010455.42744-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| | * HID: add hid_is_usb() function to make it simpler for USB detectionGreg Kroah-Hartman2021-12-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f83baa0cb6cfc92ebaf7f9d3a99d7e34f2e77a8a upstream. A number of HID drivers already call hid_is_using_ll_driver() but only for the detection of if this is a USB device or not. Make this more obvious by creating hid_is_usb() and calling the function that way. Also converts the existing hid_is_using_ll_driver() functions to use the new call. Cc: Jiri Kosina <jikos@kernel.org> Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com> Cc: linux-input@vger.kernel.org Cc: stable@vger.kernel.org Tested-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20211201183503.2373082-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| | * HID: introduce hid_is_using_ll_driverJason Gerecke2021-12-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit fc2237a724a9e448599076d7d23497f51e2f7441 upstream. Although HID itself is transport-agnostic, occasionally a driver may want to interact with the low-level transport that a device is connected through. To do this, we need to know what kind of bus is in use. The first guess may be to look at the 'bus' field of the 'struct hid_device', but this field may be emulated in some cases (e.g. uhid). More ideally, we can check which ll_driver a device is using. This function introduces a 'hid_is_using_ll_driver' function and makes the 'struct hid_ll_driver' of the four most common transports accessible through hid.h. Signed-off-by: Jason Gerecke <jason.gerecke@wacom.com> Acked-By: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | Merge 4.4.294 into android-4.4-pGreg Kroah-Hartman2021-12-08
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changes in 4.4.294 staging: ion: Prevent incorrect reference counting behavour USB: serial: option: add Telit LE910S1 0x9200 composition USB: serial: option: add Fibocom FM101-GL variants usb: hub: Fix usb enumeration issue due to address0 race usb: hub: Fix locking issues with address0_mutex binder: fix test regression due to sender_euid change ALSA: ctxfi: Fix out-of-range access staging: rtl8192e: Fix use after free in _rtl92e_pci_disconnect() xen: don't continue xenstore initialization in case of errors xen: detect uninitialized xenbus in xenbus_init ARM: dts: BCM5301X: Add interrupt properties to GPIO node ASoC: topology: Add missing rwsem around snd_ctl_remove() calls net: ieee802154: handle iftypes as u32 NFSv42: Don't fail clone() unless the OP_CLONE operation failed ARM: socfpga: Fix crash with CONFIG_FORTIRY_SOURCE scsi: mpt3sas: Fix kernel panic during drive powercycle test tcp_cubic: fix spurious Hystart ACK train detections for not-cwnd-limited flows tracing: Check pid filtering when creating events hugetlbfs: flush TLBs correctly after huge_pmd_unshare proc/vmcore: fix clearing user buffer by properly using clear_user() NFC: add NCI_UNREG flag to eliminate the race fuse: fix page stealing fuse: release pipe buf after last use shm: extend forced shm destroy to support objects from several IPC nses xen: sync include/xen/interface/io/ring.h with Xen's newest version xen/blkfront: read response from backend only once xen/blkfront: don't take local copy of a request from the ring page xen/blkfront: don't trust the backend response data blindly xen/netfront: read response from backend only once xen/netfront: don't read data from request on the ring page xen/netfront: disentangle tx_skb_freelist xen/netfront: don't trust the backend response data blindly tty: hvc: replace BUG_ON() with negative return value hugetlb: take PMD sharing into account when flushing tlb/caches net: return correct error code platform/x86: thinkpad_acpi: Fix WWAN device disabled issue after S3 deep s390/setup: avoid using memblock_enforce_memory_limit scsi: iscsi: Unblock session then wake up error handler net: tulip: de4x5: fix the problem that the array 'lp->phy[8]' may be out of bound net: ethernet: dec: tulip: de4x5: fix possible array overflows in type3_infoblock() kprobes: Limit max data_size of the kretprobe instances sata_fsl: fix UAF in sata_fsl_port_stop when rmmod sata_fsl sata_fsl: fix warning in remove_proc_entry when rmmod sata_fsl fs: add fget_many() and fput_many() fget: check that the fd still exists after getting a ref to it natsemi: xtensa: fix section mismatch warnings net: qlogic: qlcnic: Fix a NULL pointer dereference in qlcnic_83xx_add_rings() siphash: use _unaligned version by default parisc: Fix "make install" on newer debian releases vgacon: Propagate console boot parameters before calling `vc_resize' tty: serial: msm_serial: Deactivate RX DMA for polling support serial: pl011: Add ACPI SBSA UART match id Linux 4.4.294 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Id3cafc33da957a0501bcf61d000025167d552797
| | * siphash: use _unaligned version by defaultArnd Bergmann2021-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f7e5b9bfa6c8820407b64eabc1f29c9a87e8993d upstream. On ARM v6 and later, we define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS because the ordinary load/store instructions (ldr, ldrh, ldrb) can tolerate any misalignment of the memory address. However, load/store double and load/store multiple instructions (ldrd, ldm) may still only be used on memory addresses that are 32-bit aligned, and so we have to use the CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS macro with care, or we may end up with a severe performance hit due to alignment traps that require fixups by the kernel. Testing shows that this currently happens with clang-13 but not gcc-11. In theory, any compiler version can produce this bug or other problems, as we are dealing with undefined behavior in C99 even on architectures that support this in hardware, see also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363. Fortunately, the get_unaligned() accessors do the right thing: when building for ARMv6 or later, the compiler will emit unaligned accesses using the ordinary load/store instructions (but avoid the ones that require 32-bit alignment). When building for older ARM, those accessors will emit the appropriate sequence of ldrb/mov/orr instructions. And on architectures that can truly tolerate any kind of misalignment, the get_unaligned() accessors resolve to the leXX_to_cpup accessors that operate on aligned addresses. Since the compiler will in fact emit ldrd or ldm instructions when building this code for ARM v6 or later, the solution is to use the unaligned accessors unconditionally on architectures where this is known to be fast. The _aligned version of the hash function is however still needed to get the best performance on architectures that cannot do any unaligned access in hardware. This new version avoids the undefined behavior and should produce the fastest hash on all architectures we support. Link: https://lore.kernel.org/linux-arm-kernel/20181008211554.5355-4-ard.biesheuvel@linaro.org/ Link: https://lore.kernel.org/linux-crypto/CAK8P3a2KfmmGDbVHULWevB0hv71P2oi2ZCHEAqT=8dQfa0=cqQ@mail.gmail.com/ Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Fixes: 2c956a60778c ("siphash: add cryptographically secure PRF") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| | * fs: add fget_many() and fput_many()Jens Axboe2021-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 091141a42e15fe47ada737f3996b317072afcefb upstream. Some uses cases repeatedly get and put references to the same file, but the only exposed interface is doing these one at the time. As each of these entail an atomic inc or dec on a shared structure, that cost can add up. Add fget_many(), which works just like fget(), except it takes an argument for how many references to get on the file. Ditto fput_many(), which can drop an arbitrary number of references to a file. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| | * kprobes: Limit max data_size of the kretprobe instancesMasami Hiramatsu2021-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6bbfa44116689469267f1a6e3d233b52114139d2 upstream. The 'kprobe::data_size' is unsigned, thus it can not be negative. But if user sets it enough big number (e.g. (size_t)-8), the result of 'data_size + sizeof(struct kretprobe_instance)' becomes smaller than sizeof(struct kretprobe_instance) or zero. In result, the kretprobe_instance are allocated without enough memory, and kretprobe accesses outside of allocated memory. To avoid this issue, introduce a max limitation of the kretprobe::data_size. 4KB per instance should be OK. Link: https://lkml.kernel.org/r/163836995040.432120.10322772773821182925.stgit@devnote2 Cc: stable@vger.kernel.org Fixes: f47cd9b553aa ("kprobes: kretprobe user entry-handler") Reported-by: zhangyue <zhangyue1@kylinos.cn> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| | * shm: extend forced shm destroy to support objects from several IPC nsesAlexander Mikhalitsyn2021-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 85b6d24646e4125c591639841169baa98a2da503 upstream. Currently, the exit_shm() function not designed to work properly when task->sysvshm.shm_clist holds shm objects from different IPC namespaces. This is a real pain when sysctl kernel.shm_rmid_forced = 1, because it leads to use-after-free (reproducer exists). This is an attempt to fix the problem by extending exit_shm mechanism to handle shm's destroy from several IPC ns'es. To achieve that we do several things: 1. add a namespace (non-refcounted) pointer to the struct shmid_kernel 2. during new shm object creation (newseg()/shmget syscall) we initialize this pointer by current task IPC ns 3. exit_shm() fully reworked such that it traverses over all shp's in task->sysvshm.shm_clist and gets IPC namespace not from current task as it was before but from shp's object itself, then call shm_destroy(shp, ns). Note: We need to be really careful here, because as it was said before (1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we using special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter only if IPC ns not in the "state of destruction". Q/A Q: Why can we access shp->ns memory using non-refcounted pointer? A: Because shp object lifetime is always shorther than IPC namespace lifetime, so, if we get shp object from the task->sysvshm.shm_clist while holding task_lock(task) nobody can steal our namespace. Q: Does this patch change semantics of unshare/setns/clone syscalls? A: No. It's just fixes non-covered case when process may leave IPC namespace without getting task->sysvshm.shm_clist list cleaned up. Link: https://lkml.kernel.org/r/67bb03e5-f79c-1815-e2bf-949c67047418@colorfullife.com Link: https://lkml.kernel.org/r/20211109151501.4921-1-manfred@colorfullife.com Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity") Co-developed-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Andrei Vagin <avagin@gmail.com> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>