summaryrefslogtreecommitdiff
path: root/include/linux/skbuff.h (follow)
Commit message (Collapse)AuthorAge
* sk_buff: allow segmenting based on frag sizesMarcelo Ricardo Leitner2022-10-28
| | | | | | | | | | This patch allows segmenting a skb based on its frags sizes instead of based on a fixed value. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: Ic278a7f5bbe0f86ef348a63508da6819b0d098aa
* net: simplify and make pkt_type_ok() available for other usersJamal Hadi Salim2022-04-19
| | | | | | | | Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: direct packet write and access for helpers for clsact progsDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work implements direct packet access for helpers and direct packet write in a similar fashion as already available for XDP types via commits 4acf6c0b84c9 ("bpf: enable direct packet data write for xdp progs") and 6841de8b0d03 ("bpf: allow helpers access the packet directly"), and as a complementary feature to the already available direct packet read for tc (cls/act) programs. For enabling this, we need to introduce two helpers, bpf_skb_pull_data() and bpf_csum_update(). The first is generally needed for both, read and write, because they would otherwise only be limited to the current linear skb head. Usually, when the data_end test fails, programs just bail out, or, in the direct read case, use bpf_skb_load_bytes() as an alternative to overcome this limitation. If such data sits in non-linear parts, we can just pull them in once with the new helper, retest and eventually access them. At the same time, this also makes sure the skb is uncloned, which is, of course, a necessary condition for direct write. As this needs to be an invariant for the write part only, the verifier detects writes and adds a prologue that is calling bpf_skb_pull_data() to effectively unclone the skb from the very beginning in case it is indeed cloned. The heuristic makes use of a similar trick that was done in 233577a22089 ("net: filter: constify detection of pkt_type_offset"). This comes at zero cost for other programs that do not use the direct write feature. Should a program use this feature only sparsely and has read access for the most parts with, for example, drop return codes, then such write action can be delegated to a tail called program for mitigating this cost of potential uncloning to a late point in time where it would have been paid similarly with the bpf_skb_store_bytes() as well. Advantage of direct write is that the writes are inlined whereas the helper cannot make any length assumptions and thus needs to generate a call to memcpy() also for small sizes, as well as cost of helper call itself with sanity checks are avoided. Plus, when direct read is already used, we don't need to cache or perform rechecks on the data boundaries (due to verifier invalidating previous checks for helpers that change skb->data), so more complex programs using rewrites can benefit from switching to direct read plus write. For direct packet access to helpers, we save the otherwise needed copy into a temp struct sitting on stack memory when use-case allows. Both facilities are enabled via may_access_direct_pkt_data() in verifier. For now, we limit this to map helpers and csum_diff, and can successively enable other helpers where we find it makes sense. Helpers that definitely cannot be allowed for this are those part of bpf_helper_changes_skb_data() since they can change underlying data, and those that write into memory as this could happen for packet typed args when still cloned. bpf_csum_update() helper accommodates for the fact that we need to fixup checksum_complete when using direct write instead of bpf_skb_store_bytes(), meaning the programs can use available helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(), csum_block_add(), csum_block_sub() equivalents in eBPF together with the new helper. A usage example will be provided for iproute2's examples/bpf/ directory. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: fix checksum fixups on bpf_skb_store_bytesDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bpf_skb_store_bytes() invocations above L2 header need BPF_F_RECOMPUTE_CSUM flag for updates, so that CHECKSUM_COMPLETE will be fixed up along the way. Where we ran into an issue with bpf_skb_store_bytes() is when we did a single-byte update on the IPv6 hoplimit despite using BPF_F_RECOMPUTE_CSUM flag; simple ping via ICMPv6 triggered a hw csum failure as a result. The underlying issue has been tracked down to a buffer alignment issue. Meaning, that csum_partial() computations via skb_postpull_rcsum() and skb_postpush_rcsum() pair invoked had a wrong result since they operated on an odd address for the hoplimit, while other computations were done on an even address. This mix doesn't work as-is with skb_postpull_rcsum(), skb_postpush_rcsum() pair as it always expects at least half-word alignment of input buffers, which is normally the case. Thus, instead of these helpers using csum_sub() and (implicitly) csum_add(), we need to use csum_block_sub(), csum_block_add(), respectively. For unaligned offsets, they rotate the sum to align it to a half-word boundary again, otherwise they work the same as csum_sub() and csum_add(). Adding __skb_postpull_rcsum(), __skb_postpush_rcsum() variants that take the offset as an input and adapting bpf_skb_store_bytes() to them fixes the hw csum failures again. The skb_postpull_rcsum(), skb_postpush_rcsum() helpers use a 0 constant for offset so that the compiler optimizes the offset & 1 test away and generates the same code as with csum_sub()/_add(). Fixes: 608cd71a9c7c ("tc: bpf: generalize pedit action") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* bpf: add bpf_skb_change_tail helperDaniel Borkmann2022-04-19
| | | | | | | | | | | | | | | | | | | | This work adds a bpf_skb_change_tail() helper for tc BPF programs. The basic idea is to expand or shrink the skb in a controlled manner. The eBPF program can then rewrite the rest via helpers like bpf_skb_store_bytes(), bpf_lX_csum_replace() and others rather than passing a raw buffer for writing here. bpf_skb_change_tail() is really a slow path helper and intended for replies with f.e. ICMP control messages. Concept is similar to other helpers like bpf_skb_change_proto() helper to keep the helper without protocol specifics and let the BPF program mangle the remaining parts. A flags field has been added and is reserved for now should we extend the helper in future. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* Merge remote-tracking branch 'common/android-4.4-p' into ↵Michael Bestas2021-10-12
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | lineage-18.1-caf-msm8998 # By Sergey Shtylyov (9) and others # Via Greg Kroah-Hartman * common/android-4.4-p: Linux 4.4.288 libata: Add ATA_HORKAGE_NO_NCQ_ON_ATI for Samsung 860 and 870 SSD. usb: testusb: Fix for showing the connection speed scsi: sd: Free scsi_disk device via put_device() ext2: fix sleeping in atomic bugs on error sparc64: fix pci_iounmap() when CONFIG_PCI is not set xen-netback: correct success/error reporting for the SKB-with-fraglist case af_unix: fix races in sk_peer_pid and sk_peer_cred accesses Linux 4.4.287 Revert "arm64: Mark __stack_chk_guard as __ro_after_init" Linux 4.4.286 cred: allow get_cred() and put_cred() to be given NULL. HID: usbhid: free raw_report buffers in usbhid_stop netfilter: ipset: Fix oversized kvmalloc() calls HID: betop: fix slab-out-of-bounds Write in betop_probe arm64: Extend workaround for erratum 1024718 to all versions of Cortex-A55 EDAC/synopsys: Fix wrong value type assignment for edac_mode ext4: fix potential infinite loop in ext4_dx_readdir() ipack: ipoctal: fix module reference leak ipack: ipoctal: fix missing allocation-failure check ipack: ipoctal: fix tty-registration error handling ipack: ipoctal: fix tty registration race ipack: ipoctal: fix stack information leak e100: fix buffer overrun in e100_get_regs e100: fix length calculation in e100_get_regs_len ipvs: check that ip_vs_conn_tab_bits is between 8 and 20 mac80211: fix use-after-free in CCMP/GCMP RX tty: Fix out-of-bound vmalloc access in imageblit qnx4: work around gcc false positive warning bug spi: Fix tegra20 build with CONFIG_PM=n net: 6pack: Fix tx timeout and slot time alpha: Declare virt_to_phys and virt_to_bus parameter as pointer to volatile arm64: Mark __stack_chk_guard as __ro_after_init parisc: Use absolute_pointer() to define PAGE0 qnx4: avoid stringop-overread errors sparc: avoid stringop-overread errors net: i825xx: Use absolute_pointer for memcpy from fixed memory location compiler.h: Introduce absolute_pointer macro m68k: Double cast io functions to unsigned long blktrace: Fix uaf in blk_trace access after removing by sysfs scsi: iscsi: Adjust iface sysfs attr detection net/mlx4_en: Don't allow aRFS for encapsulated packets net: hso: fix muxed tty registration USB: serial: option: add device id for Foxconn T99W265 USB: serial: option: remove duplicate USB device ID USB: serial: option: add Telit LN920 compositions USB: serial: mos7840: remove duplicated 0xac24 device ID USB: serial: cp210x: add ID for GW Instek GDM-834x Digital Multimeter xen/x86: fix PV trap handling on secondary processors cifs: fix incorrect check for null pointer in header_assemble usb: musb: tusb6010: uninitialized data in tusb_fifo_write_unaligned() usb: gadget: r8a66597: fix a loop in set_feature() Linux 4.4.285 sctp: validate from_addr_param return drm/nouveau/nvkm: Replace -ENOSYS with -ENODEV blk-throttle: fix UAF by deleteing timer in blk_throtl_exit() nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group nilfs2: fix NULL pointer in nilfs_##name##_attr_release nilfs2: fix memory leak in nilfs_sysfs_create_device_group ceph: lockdep annotations for try_nonblocking_invalidate dmaengine: ioat: depends on !UML parisc: Move pci_dev_is_behind_card_dino to where it is used dmaengine: acpi: Avoid comparison GSI with Linux vIRQ dmaengine: acpi-dma: check for 64-bit MMIO address profiling: fix shift-out-of-bounds bugs prctl: allow to setup brk for et_dyn executables 9p/trans_virtio: Remove sysfs file on probe failure thermal/drivers/exynos: Fix an error code in exynos_tmu_probe() sctp: add param size validation for SCTP_PARAM_SET_PRIMARY sctp: validate chunk size in __rcv_asconf_lookup PM / wakeirq: Fix unbalanced IRQ enable for wakeirq s390/bpf: Fix optimizing out zero-extensions Linux 4.4.284 s390/bpf: Fix 64-bit subtraction of the -0x80000000 constant net: renesas: sh_eth: Fix freeing wrong tx descriptor qlcnic: Remove redundant unlock in qlcnic_pinit_from_rom ARC: export clear_user_page() for modules mtd: rawnand: cafe: Fix a resource leak in the error handling path of 'cafe_nand_probe()' PCI: Sync __pci_register_driver() stub for CONFIG_PCI=n ethtool: Fix an error code in cxgb2.c dt-bindings: mtd: gpmc: Fix the ECC bytes vs. OOB bytes equation x86/mm: Fix kern_addr_valid() to cope with existing but not present entries net/af_unix: fix a data-race in unix_dgram_poll tipc: increase timeout in tipc_sk_enqueue() r6040: Restore MDIO clock frequency after MAC reset net/l2tp: Fix reference count leak in l2tp_udp_recv_core dccp: don't duplicate ccid when cloning dccp sock ptp: dp83640: don't define PAGE0 net-caif: avoid user-triggerable WARN_ON(1) bnx2x: Fix enabling network interfaces without VFs platform/chrome: cros_ec_proto: Send command again when timeout occurs parisc: fix crash with signals and alloca net: fix NULL pointer reference in cipso_v4_doi_free ath9k: fix OOB read ar9300_eeprom_restore_internal parport: remove non-zero check on count Revert "USB: xhci: fix U1/U2 handling for hardware with XHCI_INTEL_HOST quirk set" cifs: fix wrong release in sess_alloc_buffer() failed path mmc: rtsx_pci: Fix long reads when clock is prescaled gfs2: Don't call dlm after protocol is unmounted rpc: fix gss_svc_init cleanup on failure ARM: tegra: tamonten: Fix UART pad setting gpu: drm: amd: amdgpu: amdgpu_i2c: fix possible uninitialized-variable access in amdgpu_i2c_router_select_ddc_port() Bluetooth: skip invalid hci_sync_conn_complete_evt serial: 8250_pci: make setup_port() parameters explicitly unsigned hvsi: don't panic on tty_register_driver failure xtensa: ISS: don't panic in rs_init serial: 8250: Define RX trigger levels for OxSemi 950 devices s390/jump_label: print real address in a case of a jump label bug ipv4: ip_output.c: Fix out-of-bounds warning in ip_copy_addrs() video: fbdev: riva: Error out if 'pixclock' equals zero video: fbdev: kyro: Error out if 'pixclock' equals zero video: fbdev: asiliantfb: Error out if 'pixclock' equals zero bpf/tests: Do not PASS tests without actually testing the result bpf/tests: Fix copy-and-paste error in double word test tty: serial: jsm: hold port lock when reporting modem line changes usb: gadget: u_ether: fix a potential null pointer dereference usb: host: fotg210: fix the actual_length of an iso packet Smack: Fix wrong semantics in smk_access_entry() netlink: Deal with ESRCH error in nlmsg_notify() video: fbdev: kyro: fix a DoS bug by restricting user input iio: dac: ad5624r: Fix incorrect handling of an optional regulator. PCI: Use pci_update_current_state() in pci_enable_device_flags() crypto: mxs-dcp - Use sg_mapping_iter to copy data pinctrl: single: Fix error return code in pcs_parse_bits_in_pinctrl_entry() openrisc: don't printk() unconditionally PCI: Return ~0 data on pciconfig_read() CAP_SYS_ADMIN failure PCI: Restrict ASMedia ASM1062 SATA Max Payload Size Supported ARM: 9105/1: atags_to_fdt: don't warn about stack size libata: add ATA_HORKAGE_NO_NCQ_TRIM for Samsung 860 and 870 SSDs media: rc-loopback: return number of emitters rather than error media: uvc: don't do DMA on stack VMCI: fix NULL pointer dereference when unmapping queue pair power: supply: max17042: handle fails of reading status register xen: fix setting of max_pfn in shared_info PCI/MSI: Skip masking MSI-X on Xen PV rtc: tps65910: Correct driver module alias fbmem: don't allow too huge resolutions clk: kirkwood: Fix a clocking boot regression KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted tty: Fix data race between tiocsti() and flush_to_ldisc() ipv4: make exception cache less predictible bcma: Fix memory leak for internally-handled cores ath6kl: wmi: fix an error code in ath6kl_wmi_sync_point() usb: ehci-orion: Handle errors of clk_prepare_enable() in probe i2c: mt65xx: fix IRQ check CIFS: Fix a potencially linear read overflow mmc: moxart: Fix issue with uninitialized dma_slave_config mmc: dw_mmc: Fix issue with uninitialized dma_slave_config i2c: s3c2410: fix IRQ check i2c: iop3xx: fix deferred probing Bluetooth: add timeout sanity check to hci_inquiry usb: gadget: mv_u3d: request_irq() after initializing UDC usb: phy: tahvo: add IRQ check usb: host: ohci-tmio: add IRQ check Bluetooth: Move shutdown callback before flushing tx and rx queue usb: phy: twl6030: add IRQ checks usb: phy: fsl-usb: add IRQ check usb: gadget: udc: at91: add IRQ check drm/msm/dsi: Fix some reference counted resource leaks Bluetooth: fix repeated calls to sco_sock_kill arm64: dts: exynos: correct GIC CPU interfaces address range on Exynos7 Bluetooth: increase BTNAMSIZ to 21 chars to fix potential buffer overflow PCI: PM: Enable PME if it can be signaled from D3cold i2c: highlander: add IRQ check net: cipso: fix warnings in netlbl_cipsov4_add_std tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos Bluetooth: sco: prevent information leak in sco_conn_defer_accept() media: go7007: remove redundant initialization media: dvb-usb: fix uninit-value in vp702x_read_mac_addr media: dvb-usb: fix uninit-value in dvb_usb_adapter_dvb_init certs: Trigger creation of RSA module signing key if it's not an RSA key m68k: emu: Fix invalid free in nfeth_cleanup() udf_get_extendedattr() had no boundary checks. crypto: qat - fix reuse of completion variable crypto: qat - do not ignore errors from enable_vf2pf_comms() libata: fix ata_host_start() power: supply: max17042_battery: fix typo in MAx17042_TOFF crypto: omap-sham - clear dma flags only after omap_sham_update_dma_stop() crypto: mxs-dcp - Check for DMA mapping errors PCI: Call Max Payload Size-related fixup quirks early x86/reboot: Limit Dell Optiplex 990 quirk to early BIOS versions Revert "btrfs: compression: don't try to compress if we don't have enough pages" mm/page_alloc: speed up the iteration of max_order net: ll_temac: Remove left-over debug message powerpc/boot: Delete unneeded .globl _zimage_start powerpc/module64: Fix comment in R_PPC64_ENTRY handling mm/kmemleak.c: make cond_resched() rate-limiting more efficient s390/disassembler: correct disassembly lines alignment ipv4/icmp: l3mdev: Perform icmp error route lookup on source device routing table (v2) tc358743: fix register i2c_rd/wr function fix PM / wakeirq: Enable dedicated wakeirq for suspend USB: serial: mos7720: improve OOM-handling in read_mos_reg() usb: phy: isp1301: Fix build warning when CONFIG_OF is disabled igmp: Add ip_mc_list lock in ip_check_mc_rcu media: stkwebcam: fix memory leak in stk_camera_probe ath9k: Postpone key cache entry deletion for TXQ frames reference it ath: Modify ath_key_delete() to not need full key entry ath: Export ath_hw_keysetmac() ath9k: Clear key cache explicitly on disabling hardware ath: Use safer key clearing with key cache entries ALSA: pcm: fix divide error in snd_pcm_lib_ioctl ARM: 8918/2: only build return_address() if needed cryptoloop: add a deprecation warning qede: Fix memset corruption ARC: fix allnoconfig build warning xtensa: fix kconfig unmet dependency warning for HAVE_FUTEX_CMPXCHG ext4: fix race writing to an inline_data file while its xattrs are changing Change-Id: I0d3200388e095f977c784cba314b9cc061848c7a
| * net/af_unix: fix a data-race in unix_dgram_pollEric Dumazet2021-09-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 04f08eb44b5011493d77b602fdec29ff0f5c6cd5 upstream. syzbot reported another data-race in af_unix [1] Lets change __skb_insert() to use WRITE_ONCE() when changing skb head qlen. Also, change unix_dgram_poll() to use lockless version of unix_recvq_full() It is verry possible we can switch all/most unix_recvq_full() to the lockless version, this will be done in a future kernel version. [1] HEAD commit: 8596e589b787732c8346f0482919e83cc9362db1 BUG: KCSAN: data-race in skb_queue_tail / unix_dgram_poll write to 0xffff88814eeb24e0 of 4 bytes by task 25815 on cpu 0: __skb_insert include/linux/skbuff.h:1938 [inline] __skb_queue_before include/linux/skbuff.h:2043 [inline] __skb_queue_tail include/linux/skbuff.h:2076 [inline] skb_queue_tail+0x80/0xa0 net/core/skbuff.c:3264 unix_dgram_sendmsg+0xff2/0x1600 net/unix/af_unix.c:1850 sock_sendmsg_nosec net/socket.c:703 [inline] sock_sendmsg net/socket.c:723 [inline] ____sys_sendmsg+0x360/0x4d0 net/socket.c:2392 ___sys_sendmsg net/socket.c:2446 [inline] __sys_sendmmsg+0x315/0x4b0 net/socket.c:2532 __do_sys_sendmmsg net/socket.c:2561 [inline] __se_sys_sendmmsg net/socket.c:2558 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2558 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff88814eeb24e0 of 4 bytes by task 25834 on cpu 1: skb_queue_len include/linux/skbuff.h:1869 [inline] unix_recvq_full net/unix/af_unix.c:194 [inline] unix_dgram_poll+0x2bc/0x3e0 net/unix/af_unix.c:2777 sock_poll+0x23e/0x260 net/socket.c:1288 vfs_poll include/linux/poll.h:90 [inline] ep_item_poll fs/eventpoll.c:846 [inline] ep_send_events fs/eventpoll.c:1683 [inline] ep_poll fs/eventpoll.c:1798 [inline] do_epoll_wait+0x6ad/0xf00 fs/eventpoll.c:2226 __do_sys_epoll_wait fs/eventpoll.c:2238 [inline] __se_sys_epoll_wait fs/eventpoll.c:2233 [inline] __x64_sys_epoll_wait+0xf6/0x120 fs/eventpoll.c:2233 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000001b -> 0x00000001 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 25834 Comm: syz-executor.1 Tainted: G W 5.14.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 86b18aaa2b5b ("skbuff: fix a data race in skb_queue_len()") Cc: Qian Cai <cai@lca.pw> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | UPSTREAM: net: skbuff: disambiguate argument and member for ↵Jason A. Donenfeld2020-12-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | skb_list_walk_safe helper This worked before, because we made all callers name their next pointer "next". But in trying to be more "drop-in" ready, the silliness here is revealed. This commit fixes the problem by making the macro argument and the member use different names. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit 5eee7bd7e245914e4e050c413dfe864e31805207) Bug: 152722841 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I4778eb05d0ed284543822bd3ba537379c3c19b85
* | UPSTREAM: net: introduce skb_list_walk_safe for skb segment walkingJason A. Donenfeld2020-12-31
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As part of the continual effort to remove direct usage of skb->next and skb->prev, this patch adds a helper for iterating through the singly-linked variant of skb lists, which are used for lists of GSO packet. The name "skb_list_..." has been chosen to match the existing function, "kfree_skb_list, which also operates on these singly-linked lists, and the "..._walk_safe" part is the same idiom as elsewhere in the kernel. This patch removes the helper from wireguard and puts it into linux/skbuff.h, while making it a bit more robust for general usage. In particular, parenthesis are added around the macro argument usage, and it now accounts for trying to iterate through an already-null skb pointer, which will simply run the iteration zero times. This latter enhancement means it can be used to replace both do { ... } while and while (...) open-coded idioms. This should take care of these three possible usages, which match all current methods of iterations. skb_list_walk_safe(segs, skb, next) { ... } skb_list_walk_safe(skb, skb, next) { ... } skb_list_walk_safe(segs, skb, segs) { ... } Gcc appears to generate efficient code for each of these. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit dcfea72e79b0aa7a057c8f6024169d86a1bbc84b) Bug: 152722841 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ia8fe1237424161e4de23df4515dbb2275e15e180
* skbuff: fix a data race in skb_queue_len()Qian Cai2020-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 86b18aaa2b5b5bb48e609cd591b3d2d0fdbe0442 ] sk_buff.qlen can be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_dgram_sendmsg read to 0xffff8a1b1d8a81c0 of 4 bytes by task 5371 on cpu 96: unix_dgram_sendmsg+0x9a9/0xb70 include/linux/skbuff.h:1821 net/unix/af_unix.c:1761 ____sys_sendmsg+0x33e/0x370 ___sys_sendmsg+0xa6/0xf0 __sys_sendmsg+0x69/0xf0 __x64_sys_sendmsg+0x51/0x70 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe write to 0xffff8a1b1d8a81c0 of 4 bytes by task 1 on cpu 99: __skb_try_recv_from_queue+0x327/0x410 include/linux/skbuff.h:2029 __skb_try_recv_datagram+0xbe/0x220 unix_dgram_recvmsg+0xee/0x850 ____sys_recvmsg+0x1fb/0x210 ___sys_recvmsg+0xa2/0xf0 __sys_recvmsg+0x66/0xf0 __x64_sys_recvmsg+0x51/0x70 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe Since only the read is operating as lockless, it could introduce a logic bug in unix_recvq_full() due to the load tearing. Fix it by adding a lockless variant of skb_queue_len() and unix_recvq_full() where READ_ONCE() is on the read while WRITE_ONCE() is on the write similar to the commit d7d16a89350a ("net: add skb_queue_empty_lockless()"). Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
* net: add __must_check to skb_put_padto()Eric Dumazet2020-10-01
| | | | | | | | | | | [ Upstream commit 4a009cb04aeca0de60b73f37b102573354214b52 ] skb_put_padto() and __skb_put_padto() callers must check return values or risk use-after-free. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net/flow_dissector: switch to siphashEric Dumazet2019-11-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 55667441c84fa5e0911a0aac44fb059c15ba6da2 upstream. UDP IPv6 packets auto flowlabels are using a 32bit secret (static u32 hashrnd in net/core/flow_dissector.c) and apply jhash() over fields known by the receivers. Attackers can easily infer the 32bit secret and use this information to identify a device and/or user, since this 32bit secret is only set at boot time. Really, using jhash() to generate cookies sent on the wire is a serious security concern. Trying to change the rol32(hash, 16) in ip6_make_flowlabel() would be a dead end. Trying to periodically change the secret (like in sch_sfq.c) could change paths taken in the network for long lived flows. Let's switch to siphash, as we did in commit df453700e8d8 ("inet: switch IP ID generator to siphash") Using a cryptographically strong pseudo random function will solve this privacy issue and more generally remove other weak points in the stack. Packet schedulers using skb_get_hash_perturb() benefit from this change. Fixes: b56774163f99 ("ipv6: Enable auto flow labels by default") Fixes: 42240901f7c4 ("ipv6: Implement different admin modes for automatic flow labels") Fixes: 67800f9b1f4e ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel") Fixes: cb1ce2ef387b ("ipv6: Implement automatic flow label generation on transmit") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Berger <jonathann1@walla.com> Reported-by: Amit Klein <aksecurity@gmail.com> Reported-by: Benny Pinkas <benny@pinkas.net> Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: create skb_gso_validate_mac_len()Daniel Axtens2019-06-11
| | | | | | | | | | | | | | | | | | | | commit 2b16f048729bf35e6c28a40cbfad07239f9dcd90 upstream. If you take a GSO skb, and split it into packets, will the MAC length (L2 + L3 + L4 headers + payload) of those packets be small enough to fit within a given length? Move skb_gso_mac_seglen() to skbuff.h with other related functions like skb_gso_network_seglen() so we can use it, and then create skb_gso_validate_mac_len to do the full calculation. Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: David S. Miller <davem@davemloft.net> [bwh: Backported to 4.4: There is no GSO_BY_FRAGS case to handle, so skb_gso_validate_mac_len() becomes a trivial comparison. Put it inline in <linux/skbuff.h>.] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ip: use rb trees for IP frag queue.Peter Oskolkov2019-02-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit fa0f527358bd900ef92f925878ed6bfbd51305cc upstream. Similar to TCP OOO RX queue, it makes sense to use rb trees to store IP fragments, so that OOO fragments are inserted faster. Tested: - a follow-up patch contains a rather comprehensive ip defrag self-test (functional) - ran neper `udp_stream -c -H <host> -F 100 -l 300 -T 20`: netstat --statistics Ip: 282078937 total packets received 0 forwarded 0 incoming packets discarded 946760 incoming packets delivered 18743456 requests sent out 101 fragments dropped after timeout 282077129 reassemblies required 944952 packets reassembled ok 262734239 packet reassembles failed (The numbers/stats above are somewhat better re: reassemblies vs a kernel without this patchset. More comprehensive performance testing TBD). Reported-by: Jann Horn <jannh@google.com> Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Mao Wenan <maowenan@huawei.com> [bwh: Backported to 4.4: - Keep using frag_kfree_skb() in inet_frag_destroy() - Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friendsEric Dumazet2019-02-08
| | | | | | | | | | | | | | | | | | | | commit 88078d98d1bb085d72af8437707279e203524fa5 upstream. After working on IP defragmentation lately, I found that some large packets defeat CHECKSUM_COMPLETE optimization because of NIC adding zero paddings on the last (small) fragment. While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed to CHECKSUM_NONE, forcing a full csum validation, even if all prior fragments had CHECKSUM_COMPLETE set. We can instead compute the checksum of the part we are trimming, usually smaller than the part we keep. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: modify skb_rbtree_purge to return the truesize of all purged skbs.Peter Oskolkov2019-02-08
| | | | | | | | | | | | | | | commit 385114dec8a49b5e5945e77ba7de6356106713f4 upstream. Tested: see the next patch is the series. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* inet: frags: get rid of ipfrag_skb_cb/FRAG_CBEric Dumazet2019-02-08
| | | | | | | | | | | | | | | | | | | | | | | | | | commit bf66337140c64c27fa37222b7abca7e49d63fb57 upstream. ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately this integer is currently in a different cache line than skb->next, meaning that we use two cache lines per skb when finding the insertion point. By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields in a single cache line and save precious memory bandwidth. Note that after the fast path added by Changli Gao in commit d6bebca92c66 ("fragment: add fast path for in-order fragments") this change wont help the fast path, since we still need to access prev->len (2nd cache line), but will show great benefits when slow path is entered, since we perform a linear scan of a potentially long list. Also, note that this potential long list is an attack vector, we might consider also using an rb-tree there eventually. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: Fix usage of pskb_trim_rcsumRoss Lagerwall2019-02-06
| | | | | | | | | | | | | [ Upstream commit 6c57f0458022298e4da1729c67bd33ce41c14e7a ] In certain cases, pskb_trim_rcsum() may change skb pointers. Reinitialize header pointers afterwards to avoid potential use-after-frees. Add a note in the documentation of pskb_trim_rcsum(). Found by KASAN. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tcp: use an RB tree for ooo receive queueYaogong Wang2018-10-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9f5afeae51526b3ad7b7cb21ee8b145ce6ea7a7a ] Over the years, TCP BDP has increased by several orders of magnitude, and some people are considering to reach the 2 Gbytes limit. Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000 MSS. In presence of packet losses (or reorders), TCP stores incoming packets into an out of order queue, and number of skbs sitting there waiting for the missing packets to be received can be in the 10^5 range. Most packets are appended to the tail of this queue, and when packets can finally be transferred to receive queue, we scan the queue from its head. However, in presence of heavy losses, we might have to find an arbitrary point in this queue, involving a linear scan for every incoming packet, throwing away cpu caches. This patch converts it to a RB tree, to get bounded latencies. Yaogong wrote a preliminary patch about 2 years ago. Eric did the rebase, added ofo_last_skb cache, polishing and tests. Tested with network dropping between 1 and 10 % packets, with good success (about 30 % increase of throughput in stress tests) Next step would be to also use an RB tree for the write queue at sender side ;) Signed-off-by: Yaogong Wang <wygivan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: Don't copy pfmemalloc flag in __copy_skb_header()Stefano Brivio2018-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 8b7008620b8452728cadead460a36f64ed78c460 ] The pfmemalloc flag indicates that the skb was allocated from the PFMEMALLOC reserves, and the flag is currently copied on skb copy and clone. However, an skb copied from an skb flagged with pfmemalloc wasn't necessarily allocated from PFMEMALLOC reserves, and on the other hand an skb allocated that way might be copied from an skb that wasn't. So we should not copy the flag on skb copy, and rather decide whether to allow an skb to be associated with sockets unrelated to page reclaim depending only on how it was allocated. Move the pfmemalloc flag before headers_start[0] using an existing 1-bit hole, so that __copy_skb_header() doesn't copy it. When cloning, we'll now take care of this flag explicitly, contravening to the warning comment of __skb_clone(). While at it, restore the newline usage introduced by commit b19372273164 ("net: reorganize sk_buff for faster __copy_skb_header()") to visually separate bytes used in bitfields after headers_start[0], that was gone after commit a9e419dc7be6 ("netfilter: merge ctinfo into nfct pointer storage area"), and describe the pfmemalloc flag in the kernel-doc structure comment. This doesn't change the size of sk_buff or cacheline boundaries, but consolidates the 15 bits hole before tc_index into a 2 bytes hole before csum, that could now be filled more easily. Reported-by: Patrick Talbert <ptalbert@redhat.com> Fixes: c93bdd0e03e8 ("netvm: allow skb allocation to use PFMEMALLOC reserves") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflowJason A. Donenfeld2018-04-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 48a1df65334b74bd7531f932cca5928932abf769 ] This is a defense-in-depth measure in response to bugs like 4d6fa57b4dab ("macsec: avoid heap overflow in skb_to_sgvec"). There's not only a potential overflow of sglist items, but also a stack overflow potential, so we fix this by limiting the amount of recursion this function is allowed to do. Not actually providing a bounded base case is a future disaster that we can easily avoid here. As a small matter of house keeping, we take this opportunity to move the documentation comment over the actual function the documentation is for. While this could be implemented by using an explicit stack of skbuffs, when implementing this, the function complexity increased considerably, and I don't think such complexity and bloat is actually worth it. So, instead I built this and tested it on x86, x86_64, ARM, ARM64, and MIPS, and measured the stack usage there. I also reverted the recent MIPS changes that give it a separate IRQ stack, so that I could experience some worst-case situations. I found that limiting it to 24 layers deep yielded a good stack usage with room for safety, as well as being much deeper than any driver actually ever creates. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: David Howells <dhowells@redhat.com> Cc: Sabrina Dubroca <sd@queasysnail.net> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* netfilter/ipvs: clear ipvs_property flag when SKB net namespace changedYe Yin2017-11-24
| | | | | | | | | | | | | | | | | [ Upstream commit 2b5ec1a5f9738ee7bf8f5ec0526e75e00362c48f ] When run ipvs in two different network namespace at the same host, and one ipvs transport network traffic to the other network namespace ipvs. 'ipvs_property' flag will make the second ipvs take no effect. So we should clear 'ipvs_property' when SKB network namespace changed. Fixes: 621e84d6f373 ("dev: introduce skb_scrub_packet()") Signed-off-by: Ye Yin <hustcat@gmail.com> Signed-off-by: Wei Zhou <chouryzhou@gmail.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: better skb->sender_cpu and skb->napi_id cohabitationEric Dumazet2017-06-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit 52bd2d62ce6758d811edcbd2256eb9ea7f6a56cb upstream. skb->sender_cpu and skb->napi_id share a common storage, and we had various bugs about this. We had to call skb_sender_cpu_clear() in some places to not leave a prior skb->napi_id and fool netdev_pick_tx() As suggested by Alexei, we could split the space so that these errors can not happen. 0 value being reserved as the common (not initialized) value, let's reserve [1 .. NR_CPUS] range for valid sender_cpu, and [NR_CPUS+1 .. ~0U] for valid napi_id. This will allow proper busy polling support over tunnels. Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net_sched: fix mirrored packets checksumWANG Cong2016-07-27
| | | | | | | | | | | | | | | [ Upstream commit 82a31b9231f02d9c1b7b290a46999d517b0d312a ] Similar to commit 9b368814b336 ("net: fix bridge multicast packet checksum validation") we need to fixup the checksum for CHECKSUM_COMPLETE when pushing skb on RX path. Otherwise we get similar splats. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* packet: Use symmetric hash for PACKET_FANOUT_HASH.David S. Miller2016-07-27
| | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit eb70db8756717b90c01ccc765fdefc4dd969fc74 ] People who use PACKET_FANOUT_HASH want a symmetric hash, meaning that they want packets going in both directions on a flow to hash to the same bucket. The core kernel SKB hash became non-symmetric when the ipv6 flow label and other entities were incorporated into the standard flow hash order to increase entropy. But there are no users of PACKET_FANOUT_HASH who want an assymetric hash, they all want a symmetric one. Therefore, use the flow dissector to compute a flat symmetric hash over only the protocol, addresses and ports. This hash does not get installed into and override the normal skb hash, so this change has no effect whatsoever on the rest of the stack. Reported-by: Eric Leblond <eric@regit.org> Tested-by: Eric Leblond <eric@regit.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bpf: try harder on clones when writing into skbDaniel Borkmann2016-07-11
| | | | | | | | | | | | | | [ Upstream commit 3697649ff29e0f647565eed04b27a7779c646a22 ] When we're dealing with clones and the area is not writeable, try harder and get a copy via pskb_expand_head(). Replace also other occurences in tc actions with the new skb_try_make_writable(). Reported-by: Ashhad Sheikh <ashhadsheikh394@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mld, igmp: Fix reserved tailroom calculationBenjamin Poirier2016-04-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 1837b2e2bcd23137766555a63867e649c0b637f0 ] The current reserved_tailroom calculation fails to take hlen and tlen into account. skb: [__hlen__|__data____________|__tlen___|__extra__] ^ ^ head skb_end_offset In this representation, hlen + data + tlen is the size passed to alloc_skb. "extra" is the extra space made available in __alloc_skb because of rounding up by kmalloc. We can reorder the representation like so: [__hlen__|__data____________|__extra__|__tlen___] ^ ^ head skb_end_offset The maximum space available for ip headers and payload without fragmentation is min(mtu, data + extra). Therefore, reserved_tailroom = data + extra + tlen - min(mtu, data + extra) = skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen) = skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen) Compare the second line to the current expression: reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset) and we can see that hlen and tlen are not taken into account. The min() in the third line can be expanded into: if mtu < skb_tailroom - tlen: reserved_tailroom = skb_tailroom - mtu else: reserved_tailroom = tlen Depending on hlen, tlen, mtu and the number of multicast address records, the current code may output skbs that have less tailroom than dev->needed_tailroom or it may output more skbs than needed because not all space available is used. Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs") Signed-off-by: Benjamin Poirier <bpoirier@suse.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: fix bridge multicast packet checksum validationLinus Lüssing2016-04-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9b368814b336b0a1a479135eb2815edbc00efd3c ] We need to update the skb->csum after pulling the skb, otherwise an unnecessary checksum (re)computation can ocure for IGMP/MLD packets in the bridge code. Additionally this fixes the following splats for network devices / bridge ports with support for and enabled RX checksum offloading: [...] [ 43.986968] eth0: hw csum failure [ 43.990344] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.4.0 #2 [ 43.996193] Hardware name: BCM2709 [ 43.999647] [<800204e0>] (unwind_backtrace) from [<8001cf14>] (show_stack+0x10/0x14) [ 44.007432] [<8001cf14>] (show_stack) from [<801ab614>] (dump_stack+0x80/0x90) [ 44.014695] [<801ab614>] (dump_stack) from [<802e4548>] (__skb_checksum_complete+0x6c/0xac) [ 44.023090] [<802e4548>] (__skb_checksum_complete) from [<803a055c>] (ipv6_mc_validate_checksum+0x104/0x178) [ 44.032959] [<803a055c>] (ipv6_mc_validate_checksum) from [<802e111c>] (skb_checksum_trimmed+0x130/0x188) [ 44.042565] [<802e111c>] (skb_checksum_trimmed) from [<803a06e8>] (ipv6_mc_check_mld+0x118/0x338) [ 44.051501] [<803a06e8>] (ipv6_mc_check_mld) from [<803b2c98>] (br_multicast_rcv+0x5dc/0xd00) [ 44.060077] [<803b2c98>] (br_multicast_rcv) from [<803aa510>] (br_handle_frame_finish+0xac/0x51c) [...] Fixes: 9afd85c9e455 ("net: Export IGMP/MLD message validation code") Reported-by: Álvaro Fernández Rojas <noltari@gmail.com> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net:Add sysctl_max_skb_fragsHans Westgaard Ry2016-03-03
| | | | | | | | | | | | | | | | | | [ Upstream commit 5f74f82ea34c0da80ea0b49192bb5ea06e063593 ] Devices may have limits on the number of fragments in an skb they support. Current codebase uses a constant as maximum for number of fragments one skb can hold and use. When enabling scatter/gather and running traffic with many small messages the codebase uses the maximum number of fragments and may thereby violate the max for certain devices. The patch introduces a global variable as max number of fragments. Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: preserve IP control block during GSO segmentationKonstantin Khlebnikov2016-01-31
| | | | | | | | | | | | | | | | | [ Upstream commit 9207f9d45b0ad071baa128e846d7e7ed85016df3 ] Skb_gso_segment() uses skb control block during segmentation. This patch adds 32-bytes room for previous control block which will be copied into all resulting segments. This patch fixes kernel crash during fragmenting forwarded packets. Fragmentation requires valid IP CB in skb for clearing ip options. Also patch removes custom save/restore in ovs code, now it's redundant. Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.com Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm, page_alloc: distinguish between being unable to sleep, unwilling to ↵Mel Gorman2015-11-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* tcp: skb_mstamp_after helperYuchung Cheng2015-10-21
| | | | | | | | | a helper to prepare the first main RACK patch. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* skbuff: Fix skb checksum partial check.Pravin B Shelar2015-09-29
| | | | | | | | | | | | | | | | Earlier patch 6ae459bda tried to detect void ckecksum partial skb by comparing pull length to checksum offset. But it does not work for all cases since checksum-offset depends on updates to skb->data. Following patch fixes it by validating checksum start offset after skb-data pointer is updated. Negative value of checksum offset start means there is no need to checksum. Fixes: 6ae459bda ("skbuff: Fix skb checksum flag on skb pull") Reported-by: Andrew Vagin <avagin@odin.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* skbuff: Fix skb checksum flag on skb pullPravin B Shelar2015-09-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | VXLAN device can receive skb with checksum partial. But the checksum offset could be in outer header which is pulled on receive. This results in negative checksum offset for the skb. Such skb can cause the assert failure in skb_checksum_help(). Following patch fixes the bug by setting checksum-none while pulling outer header. Following is the kernel panic msg from old kernel hitting the bug. ------------[ cut here ]------------ kernel BUG at net/core/dev.c:1906! RIP: 0010:[<ffffffff81518034>] skb_checksum_help+0x144/0x150 Call Trace: <IRQ> [<ffffffffa0164c28>] queue_userspace_packet+0x408/0x470 [openvswitch] [<ffffffffa016614d>] ovs_dp_upcall+0x5d/0x60 [openvswitch] [<ffffffffa0166236>] ovs_dp_process_packet_with_key+0xe6/0x100 [openvswitch] [<ffffffffa016629b>] ovs_dp_process_received_packet+0x4b/0x80 [openvswitch] [<ffffffffa016c51a>] ovs_vport_receive+0x2a/0x30 [openvswitch] [<ffffffffa0171383>] vxlan_rcv+0x53/0x60 [openvswitch] [<ffffffffa01734cb>] vxlan_udp_encap_recv+0x8b/0xf0 [openvswitch] [<ffffffff8157addc>] udp_queue_rcv_skb+0x2dc/0x3b0 [<ffffffff8157b56f>] __udp4_lib_rcv+0x1cf/0x6c0 [<ffffffff8157ba7a>] udp_rcv+0x1a/0x20 [<ffffffff8154fdbd>] ip_local_deliver_finish+0xdd/0x280 [<ffffffff81550128>] ip_local_deliver+0x88/0x90 [<ffffffff8154fa7d>] ip_rcv_finish+0x10d/0x370 [<ffffffff81550365>] ip_rcv+0x235/0x300 [<ffffffff8151ba1d>] __netif_receive_skb+0x55d/0x620 [<ffffffff8151c360>] netif_receive_skb+0x80/0x90 [<ffffffff81459935>] virtnet_poll+0x555/0x6f0 [<ffffffff8151cd04>] net_rx_action+0x134/0x290 [<ffffffff810683d8>] __do_softirq+0xa8/0x210 [<ffffffff8162fe6c>] call_softirq+0x1c/0x30 [<ffffffff810161a5>] do_softirq+0x65/0xa0 [<ffffffff810687be>] irq_exit+0x8e/0xb0 [<ffffffff81630733>] do_IRQ+0x63/0xe0 [<ffffffff81625f2e>] common_interrupt+0x6e/0x6e Reported-by: Anupam Chanda <achanda@vmware.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netfilter: bridge: fix routing of bridge frames with call-iptables=1Florian Westphal2015-09-14
| | | | | | | | | | | | | | | | | We can't re-use the physoutdev storage area. 1. When using NFQUEUE in PREROUTING, we attempt to bump a bogus refcnt since nf_bridge->physoutdev is garbage (ipv4/ipv6 address) 2. for same reason, we crash in physdev match in FORWARD or later if skb is routed instead of bridged. This increases nf_bridge_info to 40 bytes, but we have no other choice. Fixes: 72b1e5e4cac7 ("netfilter: bridge: reduce nf_bridge_info to 32 bytes again") Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* flow_dissector: Use 'const' where possible.David S. Miller2015-09-01
| | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* flow_dissector: Fix function argument ordering dependencyTom Herbert2015-09-01
| | | | | | | | | | | | Commit c6cc1ca7f4d70c ("flowi: Abstract out functions to get flow hash based on flowi") introduced a bug in __skb_set_sw_hash where we require a dependency on evaluating arguments in a function in order. There is no such ordering enforced in C, so this incorrect. This patch fixes that by splitting out the arguments. This bug was found via a compiler warning that keys may be uninitialized. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* flow_dissector: Add flags argument to skb_flow_dissector functionsTom Herbert2015-09-01
| | | | | | | | The flags argument will allow control of the dissection process (for instance whether to parse beyond L3). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* flowi: Abstract out functions to get flow hash based on flowiTom Herbert2015-09-01
| | | | | | | | | | | Create __get_hash_from_flowi6 and __get_hash_from_flowi4 to get the flow keys and hash based on flowi structures. These are called by __skb_get_hash_flowi6 and __skb_get_hash_flowi4. Also, created get_hash_from_flowi6 and get_hash_from_flowi4 which can be called when just the hash value for a flowi is needed. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* skbuff: Make __skb_set_sw_hash a general functionTom Herbert2015-09-01
| | | | | | | | | | | Move __skb_set_sw_hash to skbuff.h and add __skb_set_hash which is a common method (between __skb_set_sw_hash and skb_set_hash) to set the hash in an skbuff. Also, move skb_clear_hash to be closer to __skb_set_hash. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* flow_dissector: Move skb related functions to skbuff.hTom Herbert2015-09-01
| | | | | | | | | Move the flow dissector functions that are specific to skbuffs into skbuff.h out of flow_dissector.h. This makes flow_dissector.h have no dependencies on skbuff.h. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-08-27
|\
| * mm: make page pfmemalloc check more robustMichal Hocko2015-08-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added checks for page->pfmemalloc to __skb_fill_page_desc(): if (page->pfmemalloc && !page->mapping) skb->pfmemalloc = true; It assumes page->mapping == NULL implies that page->pfmemalloc can be trusted. However, __delete_from_page_cache() can set set page->mapping to NULL and leave page->index value alone. Due to being in union, a non-zero page->index will be interpreted as true page->pfmemalloc. So the assumption is invalid if the networking code can see such a page. And it seems it can. We have encountered this with a NFS over loopback setup when such a page is attached to a new skbuf. There is no copying going on in this case so the page confuses __skb_fill_page_desc which interprets the index as pfmemalloc flag and the network stack drops packets that have been allocated using the reserves unless they are to be queued on sockets handling the swapping which is the case here and that leads to hangs when the nfs client waits for a response from the server which has been dropped and thus never arrive. The struct page is already heavily packed so rather than finding another hole to put it in, let's do a trick instead. We can reuse the index again but define it to an impossible value (-1UL). This is the page index so it should never see the value that large. Replace all direct users of page->pfmemalloc by page_is_pfmemalloc which will hide this nastiness from unspoiled eyes. The information will get lost if somebody wants to use page->index obviously but that was the case before and the original code expected that the information should be persisted somewhere else if that is really needed (e.g. what SLAB and SLUB do). [akpm@linux-foundation.org: fix blooper in slub] Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") Signed-off-by: Michal Hocko <mhocko@suse.com> Debugged-by: Vlastimil Babka <vbabka@suse.com> Debugged-by: Jiri Bohac <jbohac@suse.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Acked-by: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> [3.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-08-13
|\| | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/cavium/Kconfig The cavium conflict was overlapping dependency changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net-timestamp: Update skb_complete_tx_timestamp commentBenjamin Poirier2015-08-10
| | | | | | | | | | | | | | | | | | | | After "62bccb8 net-timestamp: Make the clone operation stand-alone from phy timestamping" the hwtstamps parameter of skb_complete_tx_timestamp() may no longer be NULL. Signed-off-by: Benjamin Poirier <bpoirier@suse.com> Cc: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2015-08-04
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next, they are: 1) A couple of cleanups for the netfilter core hook from Eric Biederman. 2) Net namespace hook registration, also from Eric. This adds a dependency with the rtnl_lock. This should be fine by now but we have to keep an eye on this because if we ever get the per-subsys nfnl_lock before rtnl we have may problems in the future. But we have room to remove this in the future by propagating the complexity to the clients, by registering hooks for the init netns functions. 3) Update nf_tables to use the new net namespace hook infrastructure, also from Eric. 4) Three patches to refine and to address problems from the new net namespace hook infrastructure. 5) Switch to alternate jumpstack in xtables iff the packet is reentering. This only applies to a very special case, the TEE target, but Eric Dumazet reports that this is slowing down things for everyone else. So let's only switch to the alternate jumpstack if the tee target is in used through a static key. This batch also comes with offline precalculation of the jumpstack based on the callchain depth. From Florian Westphal. 6) Minimal SCTP multihoming support for our conntrack helper, from Michal Kubecek. 7) Reduce nf_bridge_info per skbuff scratchpad area to 32 bytes, from Florian Westphal. 8) Fix several checkpatch errors in bridge netfilter, from Bernhard Thaler. 9) Get rid of useless debug message in ip6t_REJECT, from Subash Abhinov. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | netfilter: bridge: reduce nf_bridge_info to 32 bytes againFlorian Westphal2015-07-30
| |/ | | | | | | | | | | | | | | | | | | | | | | | | We can use union for most of the temporary cruft (original ipv4/ipv6 address, source mac, physoutdev) since they're used during different stages of br netfilter traversal. Also get rid of the last two ->mask users. Shrinks struct from 48 to 32 on 64bit arch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | net: Add functions to get skb->hash based on flow structuresTom Herbert2015-07-31
| | | | | | | | | | | | | | | | | | | | | | Add skb_get_hash_flowi6 and skb_get_hash_flowi4 which derive an sk_buff hash from flowi6 and flowi4 structures respectively. These functions can be called when creating a packet in the output path where the new sk_buff does not yet contain a fully formed packet that is parsable by flow dissector. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | vxlan: Flow based tunnelingThomas Graf2015-07-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allows putting a VXLAN device into a new flow-based mode in which skbs with a ip_tunnel_info dst metadata attached will be encapsulated according to the instructions stored in there with the VXLAN device defaults taken into consideration. Similar on the receive side, if the VXLAN_F_COLLECT_METADATA flag is set, the packet processing will populate a ip_tunnel_info struct for each packet received and attach it to the skb using the new metadata dst. The metadata structure will contain the outer header and tunnel header fields which have been stripped off. Layers further up in the stack such as routing, tc or netfitler can later match on these fields and perform forwarding. It is the responsibility of upper layers to ensure that the flag is set if the metadata is needed. The flag limits the additional cost of metadata collecting based on demand. This prepares the VXLAN device to be steered by the routing and other subsystems which allows to support encapsulation for a large number of tunnel endpoints and tunnel ids through a single net_device which improves the scalability. It also allows for OVS to leverage this mode which in turn allows for the removal of the OVS specific VXLAN code. Because the skb is currently scrubed in vxlan_rcv(), the attachment of the new dst metadata is postponed until after scrubing which requires the temporary addition of a new member to vxlan_metadata. This member is removed again in a later commit after the indirect VXLAN receive API has been removed. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: remove skb_frag_add_headJiri Benc2015-07-20
| | | | | | | | | | | | | | It's not used anywhere. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>