summaryrefslogtreecommitdiff
path: root/Documentation/networking (unfollow)
Commit message (Collapse)Author
2024-10-17Merge remote-tracking branch 'msm8998/lineage-20' into lineage-20Raghuram Subramani
Change-Id: I126075a330f305c85f8fe1b8c9d408f368be95d1
2024-01-10bpf: add BPF_J{LT,LE,SLT,SLE} instructionsDaniel Borkmann
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=), BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that particularly *JLT/*JLE counterparts involving immediates need to be rewritten from e.g. X < [IMM] by swapping arguments into [IMM] > X, meaning the immediate first is required to be loaded into a register Y := [IMM], such that then we can compare with Y > X. Note that the destination operand is always required to be a register. This has the downside of having unnecessarily increased register pressure, meaning complex program would need to spill other registers temporarily to stack in order to obtain an unused register for the [IMM]. Loading to registers will thus also affect state pruning since we need to account for that register use and potentially those registers that had to be spilled/filled again. As a consequence slightly more stack space might have been used due to spilling, and BPF programs are a bit longer due to extra code involving the register load and potentially required spill/fills. Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=) counterparts to the eBPF instruction set. Modifying LLVM to remove the NegateCC() workaround in a PoC patch at [1] and allowing it to also emit the new instructions resulted in cilium's BPF programs that are injected into the fast-path to have a reduced program length in the range of 2-3% (e.g. accumulated main and tail call sections from one of the object file reduced from 4864 to 4729 insns), reduced complexity in the range of 10-30% (e.g. accumulated sections reduced in one of the cases from 116432 to 88428 insns), and reduced stack usage in the range of 1-5% (e.g. accumulated sections from one of the object files reduced from 824 to 784b). The modification for LLVM will be incorporated in a backwards compatible way. Plan is for LLVM to have i) a target specific option to offer a possibility to explicitly enable the extension by the user (as we have with -m target specific extensions today for various CPU insns), and ii) have the kernel checked for presence of the extensions and enable them transparently when the user is selecting more aggressive options such as -march=native in a bpf target context. (Other frontends generating BPF byte code, e.g. ply can probe the kernel directly for its code generation.) [1] https://github.com/borkmann/llvm/tree/bpf-insns Change-Id: Ic56500aaeaf5f3ebdfda094ad6ef4666c82e18c5 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-29bonding: fix ad_actor_system option setting to defaultFernando Fernandez Mancera
[ Upstream commit 1c15b05baea71a5ff98235783e3e4ad227760876 ] When 802.3ad bond mode is configured the ad_actor_system option is set to "00:00:00:00:00:00". But when trying to set the all-zeroes MAC as actors' system address it was failing with EINVAL. An all-zeroes ethernet address is valid, only multicast addresses are not valid values. Fixes: 171a42c38c6e ("bonding: add netlink support for sys prio, actor sys mac, and port key") Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Link: https://lore.kernel.org/r/20211221111345.2462-1-ffmancera@riseup.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-08icmp: randomize the global rate limiterEric Dumazet
[ Upstream commit b38e7819cae946e2edf869e604af1e65a5d241c5 ] Keyu Man reported that the ICMP rate limiter could be used by attackers to get useful signal. Details will be provided in an upcoming academic publication. Our solution is to add some noise, so that the attackers no longer can get help from the predictable token bucket limiter. Fixes: 4cdf507d5452 ("icmp: add a global rate limitation") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Keyu Man <kman001@ucr.edu> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I8373925e50fd35fd35d59fd32b8a0e7da9257683 Git-commit: d6c552505c0d1719dda42b4af2def0618bd7bf54 Git-repo: https://android.googlesource.com/kernel/msm Signed-off-by: urevanth <urevanth@codeaurora.org>
2020-10-29icmp: randomize the global rate limiterEric Dumazet
[ Upstream commit b38e7819cae946e2edf869e604af1e65a5d241c5 ] Keyu Man reported that the ICMP rate limiter could be used by attackers to get useful signal. Details will be provided in an upcoming academic publication. Our solution is to add some noise, so that the attackers no longer can get help from the predictable token bucket limiter. Fixes: 4cdf507d5452 ("icmp: add a global rate limitation") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Keyu Man <kman001@ucr.edu> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-27net: l2tp: deprecate PPPOL2TP_MSG_* in favour of L2TP_MSG_*Asbjørn Sloth Tønnesen
commit 47c3e7783be4e142b861d34b5c2e223330b05d8a upstream. PPPOL2TP_MSG_* and L2TP_MSG_* are duplicates, and are being used interchangeably in the kernel, so let's standardize on L2TP_MSG_* internally, and keep PPPOL2TP_MSG_* defined in UAPI for compatibility. Signed-off-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Giuliano Procida <gprocida@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-23ipv6: add option to drop unsolicited neighbor advertisementsJohannes Berg
In certain 802.11 wireless deployments, there will be NA proxies that use knowledge of the network to correctly answer requests. To prevent unsolicitd advertisements on the shared medium from being a problem, on such deployments wireless needs to drop them. Enable this by providing an option called "drop_unsolicited_na". Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit aec215e7aa380fe5f85eb6948766b58bf78cb6c3) Change-Id: Iad429a767a786087b0985632be44932b2e3fd1a8
2019-12-23ipv4: add option to drop gratuitous ARP packetsJohannes Berg
In certain 802.11 wireless deployments, there will be ARP proxies that use knowledge of the network to correctly answer requests. To prevent gratuitous ARP frames on the shared medium from being a problem, on such deployments wireless needs to drop them. Enable this by providing an option called "drop_gratuitous_arp". Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit 4078228159c9f54cca7347a8bdace29f2abdef65) Change-Id: I8772dbd7471085878f8b4161eb2a056d79b8b232
2019-12-23ipv6: add option to drop unicast encapsulated in L2 multicastJohannes Berg
In order to solve a problem with 802.11, the so-called hole-196 attack, add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if enabled, causes the stack to drop IPv6 unicast packets encapsulated in link-layer multi- or broadcast frames. Such frames can (as an attack) be created by any member of the same wireless network and transmitted as valid encrypted frames since the symmetric key for broadcast frames is shared between all stations. Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit dede82143bf1bbf92ea73a519bb0298b19c56cb9) Change-Id: I76c8f84b53e95c40ad3c2b5adac0ec4964cc920c
2019-12-23ipv4: add option to drop unicast encapsulated in L2 multicastJohannes Berg
In order to solve a problem with 802.11, the so-called hole-196 attack, add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if enabled, causes the stack to drop IPv4 unicast packets encapsulated in link-layer multi- or broadcast frames. Such frames can (as an attack) be created by any member of the same wireless network and transmitted as valid encrypted frames since the symmetric key for broadcast frames is shared between all stations. Additionally, enabling this option provides compliance with a SHOULD clause of RFC 1122. Change-Id: I8de9fa5bdbea0556802f2ee553d0e73c1349213e Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17tcp: add tcp_min_snd_mss sysctlEric Dumazet
commit 5f3e2bf008c2221478101ee72f5cb4654b9fc363 upstream. Some TCP peers announce a very small MSS option in their SYN and/or SYN/ACK messages. This forces the stack to send packets with a very high network/cpu overhead. Linux has enforced a minimal value of 48. Since this value includes the size of TCP options, and that the options can consume up to 40 bytes, this means that each segment can include only 8 bytes of payload. In some cases, it can be useful to increase the minimal value to a saner value. We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility reasons. Note that TCP_MAXSEG socket option enforces a minimal value of (TCP_MIN_MSS). David Miller increased this minimal value in commit c39508d6f118 ("tcp: Make TCP_MAXSEG minimum more correct.") from 64 to 88. We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS. CVE-2019-11479 -- tcp mss hardcoded to 48 Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-16ipv4: set the tcp_min_rtt_wlen range from 0 to one dayZhangXiaoxu
[ Upstream commit 19fad20d15a6494f47f85d869f00b11343ee5c78 ] There is a UBSAN report as below: UBSAN: Undefined behaviour in net/ipv4/tcp_input.c:2877:56 signed integer overflow: 2147483647 * 1000 cannot be represented in type 'int' CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.1.0-rc4-00058-g582549e #1 Call Trace: <IRQ> dump_stack+0x8c/0xba ubsan_epilogue+0x11/0x60 handle_overflow+0x12d/0x170 ? ttwu_do_wakeup+0x21/0x320 __ubsan_handle_mul_overflow+0x12/0x20 tcp_ack_update_rtt+0x76c/0x780 tcp_clean_rtx_queue+0x499/0x14d0 tcp_ack+0x69e/0x1240 ? __wake_up_sync_key+0x2c/0x50 ? update_group_capacity+0x50/0x680 tcp_rcv_established+0x4e2/0xe10 tcp_v4_do_rcv+0x22b/0x420 tcp_v4_rcv+0xfe8/0x1190 ip_protocol_deliver_rcu+0x36/0x180 ip_local_deliver+0x15b/0x1a0 ip_rcv+0xac/0xd0 __netif_receive_skb_one_core+0x7f/0xb0 __netif_receive_skb+0x33/0xc0 netif_receive_skb_internal+0x84/0x1c0 napi_gro_receive+0x2a0/0x300 receive_buf+0x3d4/0x2350 ? detach_buf_split+0x159/0x390 virtnet_poll+0x198/0x840 ? reweight_entity+0x243/0x4b0 net_rx_action+0x25c/0x770 __do_softirq+0x19b/0x66d irq_exit+0x1eb/0x230 do_IRQ+0x7a/0x150 common_interrupt+0xf/0xf </IRQ> It can be reproduced by: echo 2147483647 > /proc/sys/net/ipv4/tcp_min_rtt_wlen Fixes: f672258391b42 ("tcp: track min RTT using windowed min-filter") Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-20Documentation/network: reword kernel version referenceMark Rustad
It seemed odd to say "since 4.17" in a 4.4 kernel. Consider rewording the reference to indicate where in the stable series it was introduced as well as where it originated. Signed-off-by: Mark Rustad <mrustad@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-08inet: frags: break the 2GB limit for frags storageEric Dumazet
commit 3e67f106f619dcfaf6f4e2039599bdb69848c714 upstream. Some users are willing to provision huge amounts of memory to be able to perform reassembly reasonnably well under pressure. Current memory tracking is using one atomic_t and integers. Switch to atomic_long_t so that 64bit arches can use more than 2GB, without any cost for 32bit arches. Note that this patch avoids an overflow error, if high_thresh was set to ~2GB, since this test in inet_frag_alloc() was never true : if (... || frag_mem_limit(nf) > nf->high_thresh) Tested: $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh <frag DDOS> $ grep FRAG /proc/net/sockstat FRAG: inuse 14705885 memory 16000002880 $ nstat -n ; sleep 1 ; nstat | grep Reas IpReasmReqds 3317150 0.0 IpReasmFails 3317112 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-08inet: frags: use rhashtables for reassembly unitsEric Dumazet
commit 648700f76b03b7e8149d13cc2bdb3355035258a9 upstream. Some applications still rely on IP fragmentation, and to be fair linux reassembly unit is not working under any serious load. It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!) A work queue is supposed to garbage collect items when host is under memory pressure, and doing a hash rebuild, changing seed used in hash computations. This work queue blocks softirqs for up to 25 ms when doing a hash rebuild, occurring every 5 seconds if host is under fire. Then there is the problem of sharing this hash table for all netns. It is time to switch to rhashtables, and allocate one of them per netns to speedup netns dismantle, since this is a critical metric these days. Lookup is now using RCU. A followup patch will even remove the refcount hold/release left from prior implementation and save a couple of atomic operations. Before this patch, 16 cpus (16 RX queue NIC) could not handle more than 1 Mpps frags DDOS. After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB of storage for the fragments (exact number depends on frags being evicted after timeout) $ grep FRAG /proc/net/sockstat FRAG: inuse 1966916 memory 2140004608 A followup patch will change the limits for 64bit arches. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Florian Westphal <fw@strlen.de> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Alexander Aring <alex.aring@gmail.com> Cc: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net> [bwh: Backported to 4.4: adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17selftests: Move networking/timestamping from DocumentationShuah Khan
commit 3d2c86e3057995270e08693231039d9d942871f0 upstream. Remove networking from Documentation Makefile to move the test to selftests. Update networking/timestamping Makefile to work under selftests. These tests will not be run as part of selftests suite and will not be included in install targets. They can be built and run separately for now. This is part of the effort to move runnable code from Documentation. Acked-by: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com> [ added to 4.4.y stable to remove a build warning - gregkh] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-13netdev-FAQ: clarify DaveM's position for stable backportsCong Wang
[ Upstream commit 75d4e704fa8d2cf33ff295e5b441317603d7f9fd ] Per discussion with David at netconf 2018, let's clarify DaveM's position of handling stable backports in netdev-FAQ. This is important for people relying on upstream -stable releases. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-10netfilter: nf_defrag_ipv4: Add sysctl to disable per interfaceSubash Abhinov Kasiviswanathan
Add a sysctl nf_ipv4_defrag_skip to skip defragmentation per interface. This is set 0 to preserve existing behavior (always defrag per interface). This is useful for pure ipv4 forwarding scenarios (without NAT) in conjunction with xfrm. It appears that network stack defrags the packets and then forwards them to xfrm which then encrypts and then later fragments them on a different boundary compared to the source. CRs-Fixed: 2140310 Change-Id: I11956284a9692579274e8626f61cc6432232254c Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
2017-03-31drivers: net: rmnet: Initial implementationSubash Abhinov Kasiviswanathan
RmNet driver provides a transport agnostic MAP (multiplexing and aggregation protocol) support in embedded and bridge modes. Module provides virtual network devices which can be attached to any IP-mode physical device. This will be used to provide all MAP functionality on future hardware in a single consistent location. CRs-Fixed: 2022292 Change-Id: I4dd0f4fcf00bbf9dcbec65cec82436d48a813ecc Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
2017-03-23net: ipv6: Add sysctl for minimum prefix len acceptable in RIOs.Joel Scherpelz
This commit adds a new sysctl accept_ra_rt_info_min_plen that defines the minimum acceptable prefix length of Route Information Options. The new sysctl is intended to be used together with accept_ra_rt_info_max_plen to configure a range of acceptable prefix lengths. It is useful to prevent misconfigurations from unintentionally blackholing too much of the IPv6 address space (e.g., home routers announcing RIOs for fc00::/7, which is incorrect). [backport of net-next bbea124bc99df968011e76eba105fe964a4eceab] Bug: 33333670 Test: net_test passes Signed-off-by: Joel Scherpelz <jscherpelz@google.com> Acked-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22netlink: remove mmapped netlink supportFlorian Westphal
commit d1b4c689d4130bcfd3532680b64db562300716b6 upstream. mmapped netlink has a number of unresolved issues: - TX zerocopy support had to be disabled more than a year ago via commit 4682a0358639b29cf ("netlink: Always copy on mmap TX.") because the content of the mmapped area can change after netlink attribute validation but before message processing. - RX support was implemented mainly to speed up nfqueue dumping packet payload to userspace. However, since commit ae08ce0021087a5d812d2 ("netfilter: nfnetlink_queue: zero copy support") we avoid one copy with the socket-based interface too (via the skb_zerocopy helper). The other problem is that skbs attached to mmaped netlink socket behave different from normal skbs: - they don't have a shinfo area, so all functions that use skb_shinfo() (e.g. skb_clone) cannot be used. - reserving headroom prevents userspace from seeing the content as it expects message to start at skb->head. See for instance commit aa3a022094fa ("netlink: not trim skb for mmaped socket when dump"). - skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we crash because it needs the sk to check if a tx ring is attached. Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf359 ("netfilter: nfnetlink: use original skbuff when acking batches"). mmaped netlink also didn't play nicely with the skb_zerocopy helper used by nfqueue and openvswitch. Daniel Borkmann fixed this via commit 6bb0fef489f6 ("netlink, mmap: fix edge-case leakages in nf queue zero-copy")' but at the cost of also needing to provide remaining length to the allocation function. nfqueue also has problems when used with mmaped rx netlink: - mmaped netlink doesn't allow use of nfqueue batch verdict messages. Problem is that in the mmap case, the allocation time also determines the ordering in which the frame will be seen by userspace (A allocating before B means that A is located in earlier ring slot, but this also means that B might get a lower sequence number then A since seqno is decided later. To fix this we would need to extend the spinlocked region to also cover the allocation and message setup which isn't desirable. - nfqueue can now be configured to queue large (GSO) skbs to userspace. Queing GSO packets is faster than having to force a software segmentation in the kernel, so this is a desirable option. However, with a mmap based ring one has to use 64kb per ring slot element, else mmap has to fall back to the socket path (NL_MMAP_STATUS_COPY) for all large packets. To use the mmap interface, userspace not only has to probe for mmap netlink support, it also has to implement a recv/socket receive path in order to handle messages that exceed the size of an rx ring element. Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Ken-ichirou MATSUZAWA <chamaken@gmail.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Shi Yuejie <shiyuejie@outlook.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-22net/ipv6/addrconf: IPv6 tethering enhancementTianyi Gou
Added new procfs flag to toggle the automatic addition of prefix routes on a per device basis. The new flag is accept_ra_prefix_route. Defaults to 1 as to not break existing behavior. CRs-Fixed: 435320 Change-Id: If25493890c7531c27f5b2c4855afebbbbf5d072a Acked-by: Harout S. Hedeshian <harouth@qti.qualcomm.com> Signed-off-by: Tianyi Gou <tgou@codeaurora.org> [subashab@codeaurora.org: resolve trivial merge conflicts] Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
2016-03-22net: Fail explicit bind to local reserved portsSubash Abhinov Kasiviswanathan
Reserved ports may have some special use cases which are not suitable for use by general userspace applications. Currently, ports specified in ip_local_reserved_ports will not be returned only in case of automatic port assignment. Add a boolean sysctl flag 'reserved_port_bind'. Default value is 1 which preserves the existing behavior. Setting the value to 0 will prevent userspace applications from binding to these ports even when they are explicitly requested. BUG=20663075 Change-Id: Ib1071ca5bd437cd3c4f71b56147e4858f3b9ebec Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
2016-02-16net: support marking accepting TCP socketsLorenzo Colitti
When using mark-based routing, sockets returned from accept() may need to be marked differently depending on the incoming connection request. This is the case, for example, if different socket marks identify different networks: a listening socket may want to accept connections from all networks, but each connection should be marked with the network that the request came in on, so that subsequent packets are sent on the correct network. This patch adds a sysctl to mark TCP sockets based on the fwmark of the incoming SYN packet. If enabled, and an unmarked socket receives a SYN, then the SYN packet's fwmark is written to the connection's inet_request_sock, and later written back to the accepted socket when the connection is established. If the socket already has a nonzero mark, then the behaviour is the same as it is today, i.e., the listening socket's fwmark is used. Black-box tested using user-mode linux: - IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the mark of the incoming SYN packet. - The socket returned by accept() is marked with the mark of the incoming SYN packet. - Tested with syncookies=1 and syncookies=2. Change-Id: I26bc1eceefd2c588d73b921865ab70e4645ade57 Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
2015-12-03e100.txt: Cleanup license info in kernel docJeff Kirsher
Apparently the e100.txt document contained a "License" section left over from days of old, which does not need to be in the kernel documentation. So clean it up.. CC: John Ronciak <john.ronciak@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com>
2015-11-11can-doc: Add missing semicolon to exampleStefan Tatschner
The example code for CAN_BCM, connect(s, (struct sockaddr *)&addr, sizeof(addr)) lacks a semicolon at the end of the line. This patch adds that missing semicolon to ensure that the given code snippet actually compiles. Signed-off-by: Stefan Tatschner <rumpelsepp@sevenbyte.org> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2015-11-09net: Documentation: Fix default value tcp_limit_output_bytesNiklas Cassel
Commit c39c4c6abb89 ("tcp: double default TSQ output bytes limit") updated default value for tcp_limit_output_bytes Signed-off-by: Niklas Cassel <niklas.cassel@axis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-08bpf: doc: correct arch list for supported eBPF JITYang Shi
aarch64 and s390x support eBPF JIT too, correct document to reflect this and avoid any confusion. Signed-off-by: Yang Shi <yang.shi@linaro.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-30switchdev: Make flood to CPU optionalIdo Schimmel
In certain use cases it is not always desirable for the switch device to flood traffic to CPU port. Instead, only certain packet types (e.g. STP, LACP) should be trapped to it. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-30switchdev: Add support for flood controlIdo Schimmel
Allow devices supporting this feature to control the flooding of unknown unicast traffic, by making switchdev infrastructure propagate this setting to the switch driver. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: use RACK to detect lossesYuchung Cheng
This patch implements the second half of RACK that uses the the most recent transmit time among all delivered packets to detect losses. tcp_rack_mark_lost() is called upon receiving a dubious ACK. It then checks if an not-yet-sacked packet was sent at least "reo_wnd" prior to the sent time of the most recently delivered. If so the packet is deemed lost. The "reo_wnd" reordering window starts with 1msec for fast loss detection and changes to min-RTT/4 when reordering is observed. We found 1msec accommodates well on tiny degree of reordering (<3 pkts) on faster links. We use min-RTT instead of SRTT because reordering is more of a path property but SRTT can be inflated by self-inflicated congestion. The factor of 4 is borrowed from the delayed early retransmit and seems to work reasonably well. Since RACK is still experimental, it is now used as a supplemental loss detection on top of existing algorithms. It is only effective after the fast recovery starts or after the timeout occurs. The fast recovery is still triggered by FACK and/or dupack threshold instead of RACK. We introduce a new sysctl net.ipv4.tcp_recovery for future experiments of loss recoveries. For now RACK can be disabled by setting it to 0. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: track min RTT using windowed min-filterYuchung Cheng
Kathleen Nichols' algorithm for tracking the minimum RTT of a data stream over some measurement window. It uses constant space and constant time per update. Yet it almost always delivers the same minimum as an implementation that has to keep all the data in the window. The measurement window is tunable via sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes. The algorithm keeps track of the best, 2nd best & 3rd best min values, maintaining an invariant that the measurement time of the n'th best >= n-1'th best. It also makes sure that the three values are widely separated in the time window since that bounds the worse case error when that data is monotonically increasing over the window. Upon getting a new min, we can forget everything earlier because it has no value - the new min is less than everything else in the window by definition and it's the most recent. So we restart fresh on every new min and overwrites the 2nd & 3rd choices. The same property holds for the 2nd & 3rd best. Therefore we have to maintain two invariants to maximize the information in the samples, one on values (1st.v <= 2nd.v <= 3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <= now). These invariants determine the structure of the code The RTT input to the windowed filter is the minimum RTT measured from ACK or SACK, or as the last resort from TCP timestamps. The accessor tcp_min_rtt() returns the minimum RTT seen in the window. ~0U indicates it is not available. The minimum is 1usec even if the true RTT is below that. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-14Revert "ipv4/icmp: redirect messages can use the ingress daddr as source"Paolo Abeni
Revert the commit e2ca690b657f ("ipv4/icmp: redirect messages can use the ingress daddr as source"), which tried to introduce a more suitable behaviour for ICMP redirect messages generated by VRRP routers. However RFC 5798 section 8.1.1 states: The IPv4 source address of an ICMP redirect should be the address that the end-host used when making its next-hop routing decision. while said commit used the generating packet destination address, which do not match the above and in most cases leads to no redirect packets to be generated. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13net: vrf: Documentation update, ip commandsDavid Ahern
Add ip commands with examples for creating VRF devics, enslaving interfaces and dumping VRF-focused data (address, neighbors, routes). Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12ipv4/icmp: redirect messages can use the ingress daddr as sourcePaolo Abeni
This patch allows configuring how the source address of ICMP redirect messages is selected; by default the old behaviour is retained, while setting icmp_redirects_use_orig_daddr force the usage of the destination address of the packet that caused the redirect. The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the following scenario: Two machines are set up with VRRP to act as routers out of a subnet, they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to x.x.x.254/24. If a host in said subnet needs to get an ICMP redirect from the VRRP router, i.e. to reach a destination behind a different gateway, the source IP in the ICMP redirect is chosen as the primary IP on the interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2. The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2, and will continue to use the wrong next-op. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03switchdev: rename SWITCHDEV_ATTR_* enum values to SWITCHDEV_ATTR_ID_*Jiri Pirko
To be aligned with obj. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03switchdev: rename SWITCHDEV_OBJ_* enum values to SWITCHDEV_OBJ_ID_*Jiri Pirko
Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29xfrm: Let the flowcache handle its size by default.Steffen Klassert
The xfrm flowcache size is limited by the flowcache limit (4096 * number of online cpus) and the xfrm garbage collector threshold (2 * 32768), whatever is reached first. This means that we can hit the garbage collector limit only on systems with more than 16 cpus. On such systems we simply refuse new allocations if we reach the limit, so new flows are dropped. On syslems with 16 or less cpus, we hit the flowcache limit. In this case, we shrink the flow cache instead of refusing new flows. We increase the xfrm garbage collector threshold to INT_MAX to get the same behaviour, independent of the number of cpus. The xfrm garbage collector threshold can still be set below the flowcache limit to reduce the memory usage of the flowcache. Tested-by: Dan Streetman <dan.streetman@canonical.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-09-25l2tp: remove references to modprobe in documentationstephen hemminger
No longer need explicit modprobe's and update to use ip instead of deprecated ifconfig command. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24switchdev: introduce transaction item queue for attr_set and obj_addJiri Pirko
Now, the memory allocation in prepare/commit state is done separatelly in each driver (rocker). Introduce the similar mechanism in generic switchdev code, in form of queue. That can be used not only for memory allocations, but also for different items. Abort item destruction is handled as well. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23switchdev: update documentation on FDB ageing_timeScott Feldman
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-18can: Add documentation for CAN FD driver configurationOliver Hartkopp
With Linux 3.15 the infrastructure for CAN FD hardware drivers had been introduced into the kernel. Now the M_CAN driver and the peak_usb driver support CAN FD. Update the documentation to show the latest CAN related configuration options of 'ip' from iproute2 and describe the CAN FD specific options to set the data bitrate and protocol version (ISO/non-ISO). Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2015-09-17net: Add documentation for VRF deviceDavid Ahern
Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17ieee802154: docs: fix project name to linux-wpan as well as some typosStefan Schmidt
Signed-off-by: Stefan Schmidt <stefan@osg.samsung.com> Acked-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-09-17ipvs: add sysctl to ignore tunneled packetsAlex Gartrell
This is a way to avoid nasty routing loops when multiple ipvs instances can forward to eachother. Signed-off-by: Alex Gartrell <agartrell@fb.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2015-08-31phy: fixed_phy: Add gpio to determine link up/down.Andrew Lunn
An SFP module may have a link up/down status pin which can be connection to a GPIO line of the host. Add support for reading such an GPIO in the fixed_phy driver. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31IGMP: Document igmp_link_local_mcast_reportsPhilip Downey
Document the addition of a new sysctl variable which controls the generation of IGMP reports for link local multicast groups in the 224.0.0.X range. IGMP reports for local multicast groups can now be optionally inhibited by setting the value to zero e.g.: echo 0 > /proc/sys/net/ipv4/igmp_link_local_mcast_reports To retain backwards compatibility the previous behaviour is retained by default on system boot or reverted by setting the value back to non-zero. Signed-off-by: Philip Downey <pdowney@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25Documentation: networking: dsa: Add Broadcom SF2 documentFlorian Fainelli
Add a document describing the Broadcom Starfigther 2 switch hardware, its specifics, and how the driver is implemented and its specifics. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25Documentation: networking: add a DSA documentFlorian Fainelli
Describe how the DSA subsystem works, its design principles, limitations, and describe in details how to implement a DSA switch driver. Acked-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Scott Feldman <sfeldma@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25tcp: refine pacing rate determinationEric Dumazet
When TCP pacing was added back in linux-3.12, we chose to apply a fixed ratio of 200 % against current rate, to allow probing for optimal throughput even during slow start phase, where cwnd can be doubled every other gRTT. At Google, we found it was better applying a different ratio while in Congestion Avoidance phase. This ratio was set to 120 %. We've used the normal tcp_in_slow_start() helper for a while, then tuned the condition to select the conservative ratio as soon as cwnd >= ssthresh/2 : - After cwnd reduction, it is safer to ramp up more slowly, as we approach optimal cwnd. - Initial ramp up (ssthresh == INFINITY) still allows doubling cwnd every other RTT. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>