summaryrefslogtreecommitdiff
path: root/net/ipv4/tcp_timer.c (follow)
Commit message (Collapse)AuthorAge
* Merge branch 'android-4.4-p' of ↵Michael Bestas2020-03-08
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://android.googlesource.com/kernel/common into lineage-17.1-caf-msm8998 This brings LA.UM.8.4.r1-05200-8x98.0 up to date with https://android.googlesource.com/kernel/common/ android-4.4-p at commit: 4db1ebdd40ec0 FROMLIST: HID: nintendo: add nintendo switch controller driver Conflicts: arch/arm64/boot/Makefile arch/arm64/kernel/psci.c arch/x86/configs/x86_64_cuttlefish_defconfig drivers/md/dm.c drivers/of/Kconfig drivers/thermal/thermal_core.c fs/proc/meminfo.c kernel/locking/spinlock_debug.c kernel/time/hrtimer.c net/wireless/util.c Change-Id: I5b5163497b7c6ab8487ffbb2d036e4cda01ed670
| * tcp: fix off-by-one bug on aborting window-probing socketYuchung Cheng2019-12-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 3976535af0cb9fe34a55f2ffb8d7e6b39a2f8188 ] Previously there is an off-by-one bug on determining when to abort a stalled window-probing socket. This patch fixes that so it is consistent with tcp_write_timeout(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
* | Merge android-4.4.182 (9c4ab57) into msm-4.4Srinivasarao P2019-06-18
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | * refs/heads/tmp-9c4ab57 Linux 4.4.182 tcp: enforce tcp_min_snd_mss in tcp_mtu_probing() tcp: add tcp_min_snd_mss sysctl tcp: tcp_fragment() should apply sane memory limits tcp: limit payload size of sacked skbs UPSTREAM: binder: check for overflow when alloc for security context BACKPORT: binder: fix race between munmap() and direct reclaim Change-Id: I4cfb9eb282f54f083631fec4c8161eed42e8ab54 Signed-off-by: Srinivasarao P <spathi@codeaurora.org>
| * tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()Eric Dumazet2019-06-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 967c05aee439e6e5d7d805e195b3a20ef5c433d6 upstream. If mtu probing is enabled tcp_mtu_probing() could very well end up with a too small MSS. Use the new sysctl tcp_min_snd_mss to make sure MSS search is performed in an acceptable range. CVE-2019-11479 -- tcp mss hardcoded to 48 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com> Cc: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge android-4.4.114 (fe09418) into msm-4.4Srinivasarao P2018-02-01
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * refs/heads/tmp-fe09418 Linux 4.4.114 nfsd: auth: Fix gid sorting when rootsquash enabled net: tcp: close sock if net namespace is exiting flow_dissector: properly cap thoff field ipv4: Make neigh lookup keys for loopback/point-to-point devices be INADDR_ANY net: Allow neigh contructor functions ability to modify the primary_key vmxnet3: repair memory leak sctp: return error if the asoc has been peeled off in sctp_wait_for_sndbuf sctp: do not allow the v4 socket to bind a v4mapped v6 address r8169: fix memory corruption on retrieval of hardware statistics. pppoe: take ->needed_headroom of lower device into account on xmit net: qdisc_pkt_len_init() should be more robust tcp: __tcp_hdrlen() helper net: igmp: fix source address check for IGMPv3 reports lan78xx: Fix failure in USB Full Speed ipv6: ip6_make_skb() needs to clear cork.base.dst ipv6: fix udpv6 sendmsg crash caused by too small MTU ipv6: Fix getsockopt() for sockets with default IPV6_AUTOFLOWLABEL dccp: don't restart ccid2_hc_tx_rto_expire() if sk in closed state hrtimer: Reset hrtimer cpu base proper on CPU hotplug x86/microcode/intel: Extend BDW late-loading further with LLC size check eventpoll.h: add missing epoll event masks vsyscall: Fix permissions for emulate mode with KAISER/PTI um: link vmlinux with -no-pie usbip: prevent leaking socket pointer address in messages usbip: fix stub_rx: harden CMD_SUBMIT path to handle malicious input usbip: fix stub_rx: get_pipe() to validate endpoint number usb: usbip: Fix possible deadlocks reported by lockdep Input: trackpoint - force 3 buttons if 0 button is reported Revert "module: Add retpoline tag to VERMAGIC" scsi: libiscsi: fix shifting of DID_REQUEUE host byte fs/fcntl: f_setown, avoid undefined behaviour reiserfs: Don't clear SGID when inheriting ACLs reiserfs: don't preallocate blocks for extended attributes reiserfs: fix race in prealloc discard ext2: Don't clear SGID when inheriting ACLs netfilter: xt_osf: Add missing permission checks netfilter: nfnetlink_cthelper: Add missing permission checks netfilter: fix IS_ERR_VALUE usage netfilter: use fwmark_reflect in nf_send_reset netfilter: nf_conntrack_sip: extend request line validation netfilter: restart search if moved to other chain netfilter: nfnetlink_queue: reject verdict request from different portid netfilter: nf_ct_expect: remove the redundant slash when policy name is empty netfilter: nf_dup_ipv6: set again FLOWI_FLAG_KNOWN_NH at flowi6_flags netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed in 64bit kernel netfilter: x_tables: speed up jump target validation ACPICA: Namespace: fix operand cache leak ACPI / scan: Prefer devices without _HID/_CID for _ADR matching ACPI / processor: Avoid reserving IO regions too early x86/ioapic: Fix incorrect pointers in ioapic_setup_resources() ipc: msg, make msgrcv work with LONG_MIN mm, page_alloc: fix potential false positive in __zone_watermark_ok cma: fix calculation of aligned offset hwpoison, memcg: forcibly uncharge LRU pages mm/mmap.c: do not blow on PROT_NONE MAP_FIXED holes in the stack fs/select: add vmalloc fallback for select(2) mmc: sdhci-of-esdhc: add/remove some quirks according to vendor version PCI: layerscape: Fix MSG TLP drop setting PCI: layerscape: Add "fsl,ls2085a-pcie" compatible ID drivers: base: cacheinfo: fix boot error message when acpi is enabled drivers: base: cacheinfo: fix x86 with CONFIG_OF enabled Prevent timer value 0 for MWAITX timers: Plug locking race vs. timer migration time: Avoid undefined behaviour in ktime_add_safe() PM / sleep: declare __tracedata symbols as char[] rather than char can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once can: af_can: can_rcv(): replace WARN_ONCE by pr_warn_once sched/deadline: Use the revised wakeup rule for suspending constrained dl tasks x86/retpoline: Fill RSB on context switch for affected CPUs x86/cpu/intel: Introduce macros for Intel family numbers x86/microcode/intel: Fix BDW late-loading revision check usbip: Fix potential format overflow in userspace tools usbip: Fix implicit fallthrough warning usbip: prevent vhci_hcd driver from leaking a socket pointer address x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels ANDROID: sched: EAS: check energy_aware() before calling select_energy_cpu_brute() in up-migrate path UPSTREAM: eventpoll.h: add missing epoll event masks ANDROID: xattr: Pass EOPNOTSUPP to permission2 Conflicts: kernel/sched/fair.c Change-Id: I15005cb3bc039f4361d25ed2e22f8175b3d7ca96 Signed-off-by: Srinivasarao P <spathi@codeaurora.org>
| * net: tcp: close sock if net namespace is exitingDan Streetman2018-01-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 4ee806d51176ba7b8ff1efd81f271d7252e03a1d ] When a tcp socket is closed, if it detects that its net namespace is exiting, close immediately and do not wait for FIN sequence. For normal sockets, a reference is taken to their net namespace, so it will never exit while the socket is open. However, kernel sockets do not take a reference to their net namespace, so it may begin exiting while the kernel socket is still open. In this case if the kernel socket is a tcp socket, it will stay open trying to complete its close sequence. The sock's dst(s) hold a reference to their interface, which are all transferred to the namespace's loopback interface when the real interfaces are taken down. When the namespace tries to take down its loopback interface, it hangs waiting for all references to the loopback interface to release, which results in messages like: unregister_netdevice: waiting for lo to become free. Usage count = 1 These messages continue until the socket finally times out and closes. Since the net namespace cleanup holds the net_mutex while calling its registered pernet callbacks, any new net namespace initialization is blocked until the current net namespace finishes exiting. After this change, the tcp socket notices the exiting net namespace, and closes immediately, releasing its dst(s) and their reference to the loopback interface, which lets the net namespace continue exiting. Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407 Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811 Signed-off-by: Dan Streetman <ddstreet@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge android-4.4@4b8fc9f (v4.4.82) into msm-4.4Blagovest Kolenichev2017-09-01
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * refs/heads/tmp-4b8fc9f UPSTREAM: locking: avoid passing around 'thread_info' in mutex debugging code ANDROID: arm64: fix undeclared 'init_thread_info' error UPSTREAM: kdb: use task_cpu() instead of task_thread_info()->cpu Linux 4.4.82 net: account for current skb length when deciding about UFO ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output mm/mempool: avoid KASAN marking mempool poison checks as use-after-free KVM: arm/arm64: Handle hva aging while destroying the vm sparc64: Prevent perf from running during super critical sections udp: consistently apply ufo or fragmentation revert "ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output" revert "net: account for current skb length when deciding about UFO" packet: fix tp_reserve race in packet_set_ring net: avoid skb_warn_bad_offload false positives on UFO tcp: fastopen: tcp_connect() must refresh the route net: sched: set xt_tgchk_param par.nft_compat as 0 in ipt_init_target bpf, s390: fix jit branch offset related to ldimm64 net: fix keepalive code vs TCP_FASTOPEN_CONNECT tcp: avoid setting cwnd to invalid ssthresh after cwnd reduction states ANDROID: keychord: Fix for a memory leak in keychord. ANDROID: keychord: Fix races in keychord_write. Use %zu to print resid (size_t). ANDROID: keychord: Fix a slab out-of-bounds read. Linux 4.4.81 workqueue: implicit ordered attribute should be overridable net: account for current skb length when deciding about UFO ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output mm: don't dereference struct page fields of invalid pages signal: protect SIGNAL_UNKILLABLE from unintentional clearing. lib/Kconfig.debug: fix frv build failure mm, slab: make sure that KMALLOC_MAX_SIZE will fit into MAX_ORDER ARM: 8632/1: ftrace: fix syscall name matching virtio_blk: fix panic in initialization error path drm/virtio: fix framebuffer sparse warning scsi: qla2xxx: Get mutex lock before checking optrom_state phy state machine: failsafe leave invalid RUNNING state x86/boot: Add missing declaration of string functions tg3: Fix race condition in tg3_get_stats64(). net: phy: dp83867: fix irq generation sh_eth: R8A7740 supports packet shecksumming wext: handle NULL extra data in iwe_stream_add_point better sparc64: Measure receiver forward progress to avoid send mondo timeout xen-netback: correctly schedule rate-limited queues net: phy: Fix PHY unbind crash net: phy: Correctly process PHY_HALTED in phy_stop_machine() net/mlx5: Fix command bad flow on command entry allocation failure sctp: fix the check for _sctp_walk_params and _sctp_walk_errors sctp: don't dereference ptr before leaving _sctp_walk_{params, errors}() dccp: fix a memleak for dccp_feat_init err process dccp: fix a memleak that dccp_ipv4 doesn't put reqsk properly dccp: fix a memleak that dccp_ipv6 doesn't put reqsk properly net: ethernet: nb8800: Handle all 4 RGMII modes identically ipv6: Don't increase IPSTATS_MIB_FRAGFAILS twice in ip6_fragment() packet: fix use-after-free in prb_retire_rx_blk_timer_expired() openvswitch: fix potential out of bound access in parse_ct mcs7780: Fix initialization when CONFIG_VMAP_STACK is enabled rtnetlink: allocate more memory for dev_set_mac_address() ipv4: initialize fib_trie prior to register_netdev_notifier call. ipv6: avoid overflow of offset in ip6_find_1stfragopt net: Zero terminate ifr_name in dev_ifname(). ipv4: ipv6: initialize treq->txhash in cookie_v[46]_check() saa7164: fix double fetch PCIe access condition drm: rcar-du: fix backport bug f2fs: sanity check checkpoint segno and blkoff media: lirc: LIRC_GET_REC_RESOLUTION should return microseconds mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries iser-target: Avoid isert_conn->cm_id dereference in isert_login_recv_done iscsi-target: Fix delayed logout processing greater than SECONDS_FOR_LOGOUT_COMP iscsi-target: Fix initial login PDU asynchronous socket close OOPs iscsi-target: Fix early sk_data_ready LOGIN_FLAGS_READY race iscsi-target: Always wait for kthread_should_stop() before kthread exit target: Avoid mappedlun symlink creation during lun shutdown media: platform: davinci: return -EINVAL for VPFE_CMD_S_CCDC_RAW_PARAMS ioctl ARM: dts: armada-38x: Fix irq type for pca955 ext4: fix overflow caused by missing cast in ext4_resize_fs() ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesize mm/page_alloc: Remove kernel address exposure in free_reserved_area() KVM: async_pf: make rcu irq exit if not triggered from idle task ASoC: do not close shared backend dailink ALSA: hda - Fix speaker output from VAIO VPCL14M1R workqueue: restore WQ_UNBOUND/max_active==1 to be ordered libata: array underflow in ata_find_dev() ANDROID: binder: don't queue async transactions to thread. ANDROID: binder: don't enqueue death notifications to thread todo. ANDROID: binder: call poll_wait() unconditionally. android: configs: move quota-related configs to recommended BACKPORT: arm64: split thread_info from task stack UPSTREAM: arm64: assembler: introduce ldr_this_cpu UPSTREAM: arm64: make cpu number a percpu variable UPSTREAM: arm64: smp: prepare for smp_processor_id() rework BACKPORT: arm64: move sp_el0 and tpidr_el1 into cpu_suspend_ctx UPSTREAM: arm64: prep stack walkers for THREAD_INFO_IN_TASK UPSTREAM: arm64: unexport walk_stackframe UPSTREAM: arm64: traps: simplify die() and __die() UPSTREAM: arm64: factor out current_stack_pointer BACKPORT: arm64: asm-offsets: remove unused definitions UPSTREAM: arm64: thread_info remove stale items UPSTREAM: thread_info: include <current.h> for THREAD_INFO_IN_TASK UPSTREAM: thread_info: factor out restart_block UPSTREAM: kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function UPSTREAM: sched/core: Add try_get_task_stack() and put_task_stack() UPSTREAM: sched/core: Allow putting thread_info into task_struct UPSTREAM: printk: when dumping regs, show the stack, not thread_info UPSTREAM: fix up initial thread stack pointer vs thread_info confusion UPSTREAM: Clarify naming of thread info/stack allocators ANDROID: sdcardfs: override credential for ioctl to lower fs Conflicts: android/configs/android-base.cfg arch/arm64/Kconfig arch/arm64/include/asm/suspend.h arch/arm64/kernel/head.S arch/arm64/kernel/smp.c arch/arm64/kernel/suspend.c arch/arm64/kernel/traps.c arch/arm64/mm/proc.S kernel/fork.c sound/soc/soc-pcm.c Change-Id: I273e216c94899a838bbd208391c6cbe20b2bf683 Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
| * net: fix keepalive code vs TCP_FASTOPEN_CONNECTEric Dumazet2017-08-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 2dda640040876cd8ae646408b69eea40c24f9ae9 ] syzkaller was able to trigger a divide by 0 in TCP stack [1] Issue here is that keepalive timer needs to be updated to not attempt to send a probe if the connection setup was deferred using TCP_FASTOPEN_CONNECT socket option added in linux-4.11 [1] divide error: 0000 [#1] SMP CPU: 18 PID: 0 Comm: swapper/18 Not tainted task: ffff986f62f4b040 ti: ffff986f62fa2000 task.ti: ffff986f62fa2000 RIP: 0010:[<ffffffff8409cc0d>] [<ffffffff8409cc0d>] __tcp_select_window+0x8d/0x160 Call Trace: <IRQ> [<ffffffff8409d951>] tcp_transmit_skb+0x11/0x20 [<ffffffff8409da21>] tcp_xmit_probe_skb+0xc1/0xe0 [<ffffffff840a0ee8>] tcp_write_wakeup+0x68/0x160 [<ffffffff840a151b>] tcp_keepalive_timer+0x17b/0x230 [<ffffffff83b3f799>] call_timer_fn+0x39/0xf0 [<ffffffff83b40797>] run_timer_softirq+0x1d7/0x280 [<ffffffff83a04ddb>] __do_softirq+0xcb/0x257 [<ffffffff83ae03ac>] irq_exit+0x9c/0xb0 [<ffffffff83a04c1a>] smp_apic_timer_interrupt+0x6a/0x80 [<ffffffff83a03eaf>] apic_timer_interrupt+0x7f/0x90 <EOI> [<ffffffff83fed2ea>] ? cpuidle_enter_state+0x13a/0x3b0 [<ffffffff83fed2cd>] ? cpuidle_enter_state+0x11d/0x3b0 Tested: Following packetdrill no longer crashes the kernel `echo 0 >/proc/sys/net/ipv4/tcp_timestamps` // Cache warmup: send a Fast Open cookie request 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 +0 setsockopt(3, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0 +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation is now in progress) +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop> +.01 < S. 123:123(0) ack 1 win 14600 <mss 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop> +0 > . 1:1(0) ack 1 +0 close(3) = 0 +0 > F. 1:1(0) ack 1 +0 < F. 1:1(0) ack 2 win 92 +0 > . 2:2(0) ack 2 +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0 +0 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 +.01 connect(4, ..., ...) = 0 +0 setsockopt(4, SOL_TCP, TCP_KEEPIDLE, [5], 4) = 0 +10 close(4) = 0 `echo 1 >/proc/sys/net/ipv4/tcp_timestamps` Fixes: 19f6d3f3c842 ("net/tcp-fastopen: Add new API support") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Wei Wang <weiwan@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge branch 'android-4.4@c71ad0f' into branch 'msm-4.4'Blagovest Kolenichev2017-04-20
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * refs/heads/tmp-c71ad0f: BACKPORT: arm64: dts: juno: fix cluster sleep state entry latency on all SoC versions staging: android: ashmem: lseek failed due to no FMODE_LSEEK. ANDROID: sdcardfs: update module info ANDROID: sdcardfs: use d_splice_alias ANDROID: sdcardfs: add read_iter/write_iter opeations ANDROID: sdcardfs: fix ->llseek to update upper and lower offset ANDROID: sdcardfs: copy lower inode attributes in ->ioctl ANDROID: sdcardfs: remove unnecessary call to do_munmap Merge 4.4.59 into android-4.4 UPSTREAM: ipv6 addrconf: implement RFC7559 router solicitation backoff android: base-cfg: enable CONFIG_INET_DIAG_DESTROY ANDROID: android-base.cfg: add CONFIG_MODULES option ANDROID: android-base.cfg: add CONFIG_IKCONFIG option ANDROID: android-base.cfg: properly sort the file ANDROID: binder: add hwbinder,vndbinder to BINDER_DEVICES. ANDROID: sort android-recommended.cfg UPSTREAM: config/android: Remove CONFIG_IPV6_PRIVACY UPSTREAM: config: android: set SELinux as default security mode config: android: move device mapper options to recommended ANDROID: ARM64: Allow to choose appended kernel image UPSTREAM: arm64: vdso: constify vm_special_mapping used for aarch32 vectors page UPSTREAM: arm64: vdso: add __init section marker to alloc_vectors_page UPSTREAM: ARM: 8597/1: VDSO: put RO and RO after init objects into proper sections UPSTREAM: arm64: Add support for CLOCK_MONOTONIC_RAW in clock_gettime() vDSO UPSTREAM: arm64: Refactor vDSO time functions UPSTREAM: arm64: fix vdso-offsets.h dependency UPSTREAM: kbuild: drop FORCE from PHONY targets UPSTREAM: mm: add PHYS_PFN, use it in __phys_to_pfn() UPSTREAM: ARM: 8476/1: VDSO: use PTR_ERR_OR_ZERO for vma check Linux 4.4.58 crypto: algif_hash - avoid zero-sized array fbcon: Fix vc attr at deinit serial: 8250_pci: Detach low-level driver during PCI error recovery ACPI / blacklist: Make Dell Latitude 3350 ethernet work ACPI / blacklist: add _REV quirks for Dell Precision 5520 and 3520 uvcvideo: uvc_scan_fallback() for webcams with broken chain s390/zcrypt: Introduce CEX6 toleration block: allow WRITE_SAME commands with the SG_IO ioctl vfio/spapr: Postpone allocation of userspace version of TCE table PCI: Do any VF BAR updates before enabling the BARs PCI: Ignore BAR updates on virtual functions PCI: Update BARs using property bits appropriate for type PCI: Don't update VF BARs while VF memory space is enabled PCI: Decouple IORESOURCE_ROM_ENABLE and PCI_ROM_ADDRESS_ENABLE PCI: Add comments about ROM BAR updating PCI: Remove pci_resource_bar() and pci_iov_resource_bar() PCI: Separate VF BAR updates from standard BAR updates x86/hyperv: Handle unknown NMIs on one CPU when unknown_nmi_panic igb: add i211 to i210 PHY workaround igb: Workaround for igb i210 firmware issue xen: do not re-use pirq number cached in pci device msi msg data xfs: clear _XBF_PAGES from buffers when readahead page USB: usbtmc: add missing endpoint sanity check nl80211: fix dumpit error path RTNL deadlocks xfs: fix up xfs_swap_extent_forks inline extent handling xfs: don't allow di_size with high bit set libceph: don't set weight to IN when OSD is destroyed raid10: increment write counter after bio is split cpufreq: Restore policy min/max limits on CPU online ARM: dts: at91: sama5d2: add dma properties to UART nodes ARM: at91: pm: cpu_idle: switch DDR to power-down mode iommu/vt-d: Fix NULL pointer dereference in device_to_iommu xen/acpi: upload PM state from init-domain to Xen mmc: sdhci: Do not disable interrupts while waiting for clock ext4: mark inode dirty after converting inline directory parport: fix attempt to write duplicate procfiles iio: hid-sensor-trigger: Change get poll value function order to avoid sensor properties losing after resume from S3 iio: adc: ti_am335x_adc: fix fifo overrun recovery mmc: ushc: fix NULL-deref at probe uwb: hwa-rc: fix NULL-deref at probe uwb: i1480-dfu: fix NULL-deref at probe usb: hub: Fix crash after failure to read BOS descriptor usb: musb: cppi41: don't check early-TX-interrupt for Isoch transfer USB: wusbcore: fix NULL-deref at probe USB: idmouse: fix NULL-deref at probe USB: lvtest: fix NULL-deref at probe USB: uss720: fix NULL-deref at probe usb-core: Add LINEAR_FRAME_INTR_BINTERVAL USB quirk usb: gadget: f_uvc: Fix SuperSpeed companion descriptor's wBytesPerInterval ACM gadget: fix endianness in notifications USB: serial: qcserial: add Dell DW5811e USB: serial: option: add Quectel UC15, UC20, EC21, and EC25 modems ALSA: hda - Adding a group of pin definition to fix headset problem ALSA: ctxfi: Fix the incorrect check of dma_set_mask() call ALSA: seq: Fix racy cell insertions during snd_seq_pool_done() Input: sur40 - validate number of endpoints before using them Input: kbtab - validate number of endpoints before using them Input: cm109 - validate number of endpoints before using them Input: yealink - validate number of endpoints before using them Input: hanwang - validate number of endpoints before using them Input: ims-pcu - validate number of endpoints before using them Input: iforce - validate number of endpoints before using them Input: i8042 - add noloop quirk for Dell Embedded Box PC 3000 Input: elan_i2c - add ASUS EeeBook X205TA special touchpad fw tcp: initialize icsk_ack.lrcvtime at session start time socket, bpf: fix sk_filter use after free in sk_clone_lock ipv4: provide stronger user input validation in nl_fib_input() net: bcmgenet: remove bcmgenet_internal_phy_setup() net/mlx5e: Count LRO packets correctly net/mlx5: Increase number of max QPs in default profile net: unix: properly re-increment inflight counter of GC discarded candidates amd-xgbe: Fix jumbo MTU processing on newer hardware net: properly release sk_frag.page net: bcmgenet: Do not suspend PHY if Wake-on-LAN is enabled net/openvswitch: Set the ipv6 source tunnel key address attribute correctly Linux 4.4.57 ext4: fix fencepost in s_first_meta_bg validation percpu: acquire pcpu_lock when updating pcpu_nr_empty_pop_pages gfs2: Avoid alignment hole in struct lm_lockname isdn/gigaset: fix NULL-deref at probe target: Fix VERIFY_16 handling in sbc_parse_cdb scsi: libiscsi: add lock around task lists to fix list corruption regression scsi: lpfc: Add shutdown method for kexec target/pscsi: Fix TYPE_TAPE + TYPE_MEDIMUM_CHANGER export md/raid1/10: fix potential deadlock powerpc/boot: Fix zImage TOC alignment cpufreq: Fix and clean up show_cpuinfo_cur_freq() perf/core: Fix event inheritance on fork() give up on gcc ilog2() constant optimizations kernek/fork.c: allocate idle task for a CPU always on its local node hv_netvsc: use skb_get_hash() instead of a homegrown implementation tpm_tis: Use devm_free_irq not free_irq drm/amdgpu: add missing irq.h include s390/pci: fix use after free in dma_init KVM: PPC: Book3S PR: Fix illegal opcode emulation xen/qspinlock: Don't kick CPU if IRQ is not initialized Drivers: hv: avoid vfree() on crash Drivers: hv: balloon: don't crash when memory is added in non-sorted order pinctrl: cherryview: Do not mask all interrupts in probe ACPI / video: skip evaluating _DOD when it does not exist cxlflash: Increase cmd_per_lun for better throughput crypto: mcryptd - Fix load failure crypto: cryptd - Assign statesize properly crypto: ghash-clmulni - Fix load failure USB: don't free bandwidth_mutex too early usb: core: hub: hub_port_init lock controller instead of bus ANDROID: sdcardfs: Fix style issues in macros ANDROID: sdcardfs: Use seq_puts over seq_printf ANDROID: sdcardfs: Use to kstrout ANDROID: sdcardfs: Use pr_[...] instead of printk ANDROID: sdcardfs: remove unneeded null check ANDROID: sdcardfs: Fix style issues with comments ANDROID: sdcardfs: Fix formatting ANDROID: sdcardfs: correct order of descriptors fix the deadlock in xt_qtaguid when enable DDEBUG net: ipv6: Add sysctl for minimum prefix len acceptable in RIOs. Linux 4.4.56 futex: Add missing error handling to FUTEX_REQUEUE_PI futex: Fix potential use-after-free in FUTEX_REQUEUE_PI x86/perf: Fix CR4.PCE propagation to use active_mm instead of mm x86/kasan: Fix boot with KASAN=y and PROFILE_ANNOTATED_BRANCHES=y fscrypto: lock inode while setting encryption policy fscrypt: fix renaming and linking special files net sched actions: decrement module reference count after table flush. dccp: fix memory leak during tear-down of unsuccessful connection request dccp/tcp: fix routing redirect race bridge: drop netfilter fake rtable unconditionally ipv6: avoid write to a possibly cloned skb ipv6: make ECMP route replacement less greedy mpls: Send route delete notifications when router module is unloaded act_connmark: avoid crashing on malformed nlattrs with null parms uapi: fix linux/packet_diag.h userspace compilation error vrf: Fix use-after-free in vrf_xmit dccp: fix use-after-free in dccp_feat_activate_values net: fix socket refcounting in skb_complete_tx_timestamp() net: fix socket refcounting in skb_complete_wifi_ack() tcp: fix various issues for sockets morphing to listen state dccp: Unlock sock before calling sk_free() net: net_enable_timestamp() can be called from irq contexts net: don't call strlen() on the user buffer in packet_bind_spkt() l2tp: avoid use-after-free caused by l2tp_ip_backlog_recv ipv4: mask tos for input route vti6: return GRE_KEY for vti6 vxlan: correctly validate VXLAN ID against VXLAN_N_VID netlink: remove mmapped netlink support ANDROID: mmc: core: export emmc revision BACKPORT: mmc: core: Export device lifetime information through sysfs ANDROID: android-verity: do not compile as independent module ANDROID: sched: fix duplicate sched_group_energy const specifiers config: disable CONFIG_USELIB and CONFIG_FHANDLE ANDROID: power: align wakeup_sources format ANDROID: dm: android-verity: allow disable dm-verity for Treble VTS uid_sys_stats: change to use rt_mutex ANDROID: vfs: user permission2 in notify_change2 ANDROID: sdcardfs: Fix gid issue ANDROID: sdcardfs: Use tabs instead of spaces in multiuser.h ANDROID: sdcardfs: Remove uninformative prints ANDROID: sdcardfs: move path_put outside of spinlock ANDROID: sdcardfs: Use case insensitive hash function ANDROID: sdcardfs: declare MODULE_ALIAS_FS ANDROID: sdcardfs: Get the blocksize from the lower fs ANDROID: sdcardfs: Use d_invalidate instead of drop_recurisve ANDROID: sdcardfs: Switch to internal case insensitive compare ANDROID: sdcardfs: Use spin_lock_nested ANDROID: sdcardfs: Replace get/put with d_lock ANDROID: sdcardfs: rate limit warning print ANDROID: sdcardfs: Fix case insensitive lookup ANDROID: uid_sys_stats: account for fsync syscalls ANDROID: sched: add a counter to track fsync ANDROID: uid_sys_stats: fix negative write bytes. ANDROID: uid_sys_stats: allow writing same state ANDROID: uid_sys_stats: rename uid_cputime.c to uid_sys_stats.c ANDROID: uid_cputime: add per-uid IO usage accounting DTB: Add EAS compatible Juno Energy model to 'juno.dts' arm64: dts: juno: Add idle-states to device tree ANDROID: Replace spaces by '_' for some android filesystem tracepoints. usb: gadget: f_accessory: Fix for UsbAccessory clean unbind. android: binder: move global binder state into context struct. android: binder: add padding to binder_fd_array_object. binder: use group leader instead of open thread nf: IDLETIMER: Use fullsock when querying uid nf: IDLETIMER: Fix use after free condition during work ANDROID: dm: android-verity: fix table_make_digest() error handling ANDROID: usb: gadget: function: Fix commenting style cpufreq: interactive governor drops bits in time calculation ANDROID: sdcardfs: support direct-IO (DIO) operations ANDROID: sdcardfs: implement vm_ops->page_mkwrite ANDROID: sdcardfs: Don't bother deleting freelist ANDROID: sdcardfs: Add missing path_put ANDROID: sdcardfs: Fix incorrect hash ANDROID: ext4 crypto: Disables zeroing on truncation when there's no key ANDROID: ext4: add a non-reversible key derivation method ANDROID: ext4: allow encrypting filenames using HEH algorithm ANDROID: arm64/crypto: add ARMv8-CE optimized poly_hash algorithm ANDROID: crypto: heh - factor out poly_hash algorithm ANDROID: crypto: heh - Add Hash-Encrypt-Hash (HEH) algorithm ANDROID: crypto: gf128mul - Add ble multiplication functions ANDROID: crypto: gf128mul - Refactor gf128 overflow macros and tables UPSTREAM: crypto: gf128mul - Zero memory when freeing multiplication table ANDROID: crypto: shash - Add crypto_grab_shash() and crypto_spawn_shash_alg() ANDROID: crypto: allow blkcipher walks over ablkcipher data UPSTREAM: arm/arm64: crypto: assure that ECB modes don't require an IV ANDROID: Refactor fs readpage/write tracepoints. ANDROID: export security_path_chown Squashfs: optimize reading uncompressed data Squashfs: implement .readpages() Squashfs: replace buffer_head with BIO Squashfs: refactor page_actor Squashfs: remove the FILE_CACHE option ANDROID: android-recommended.cfg: CONFIG_CPU_SW_DOMAIN_PAN=y FROMLIST: 9p: fix a potential acl leak BACKPORT: posix_acl: Clear SGID bit when setting file permissions UPSTREAM: udp: properly support MSG_PEEK with truncated buffers UPSTREAM: arm64: Allow hw watchpoint of length 3,5,6 and 7 BACKPORT: arm64: hw_breakpoint: Handle inexact watchpoint addresses UPSTREAM: arm64: Allow hw watchpoint at varied offset from base address BACKPORT: hw_breakpoint: Allow watchpoint of length 3,5,6 and 7 ANDROID: sdcardfs: Switch strcasecmp for internal call ANDROID: sdcardfs: switch to full_name_hash and qstr ANDROID: sdcardfs: Add GID Derivation to sdcardfs ANDROID: sdcardfs: Remove redundant operation ANDROID: sdcardfs: add support for user permission isolation ANDROID: sdcardfs: Refactor configfs interface ANDROID: sdcardfs: Allow non-owners to touch ANDROID: binder: fix format specifier for type binder_size_t ANDROID: fs: Export vfs_rmdir2 ANDROID: fs: Export free_fs_struct and set_fs_pwd BACKPORT: Input: xpad - validate USB endpoint count during probe BACKPORT: Input: xpad - fix oops when attaching an unknown Xbox One gamepad ANDROID: mnt: remount should propagate to slaves of slaves ANDROID: sdcardfs: Switch ->d_inode to d_inode() ANDROID: sdcardfs: Fix locking issue with permision fix up ANDROID: sdcardfs: Change magic value ANDROID: sdcardfs: Use per mount permissions ANDROID: sdcardfs: Add gid and mask to private mount data ANDROID: sdcardfs: User new permission2 functions ANDROID: vfs: Add setattr2 for filesystems with per mount permissions ANDROID: vfs: Add permission2 for filesystems with per mount permissions ANDROID: vfs: Allow filesystems to access their private mount data ANDROID: mnt: Add filesystem private data to mount points ANDROID: sdcardfs: Move directory unlock before touch ANDROID: sdcardfs: fix external storage exporting incorrect uid ANDROID: sdcardfs: Added top to sdcardfs_inode_info ANDROID: sdcardfs: Switch package list to RCU ANDROID: sdcardfs: Fix locking for permission fix up ANDROID: sdcardfs: Check for other cases on path lookup ANDROID: sdcardfs: override umask on mkdir and create arm64: kernel: Fix build warning DEBUG: sched/fair: Fix sched_load_avg_cpu events for task_groups DEBUG: sched/fair: Fix missing sched_load_avg_cpu events UPSTREAM: l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind() UPSTREAM: packet: fix race condition in packet_set_ring UPSTREAM: netlink: Fix dump skb leak/double free UPSTREAM: net: avoid signed overflows for SO_{SND|RCV}BUFFORCE MIPS: Prevent "restoration" of MSA context in non-MSA kernels net: socket: don't set sk_uid to garbage value in ->setattr() ANDROID: configs: CONFIG_ARM64_SW_TTBR0_PAN=y UPSTREAM: arm64: Disable PAN on uaccess_enable() UPSTREAM: arm64: Enable CONFIG_ARM64_SW_TTBR0_PAN UPSTREAM: arm64: xen: Enable user access before a privcmd hvc call UPSTREAM: arm64: Handle faults caused by inadvertent user access with PAN enabled BACKPORT: arm64: Disable TTBR0_EL1 during normal kernel execution BACKPORT: arm64: Introduce uaccess_{disable,enable} functionality based on TTBR0_EL1 BACKPORT: arm64: Factor out TTBR0_EL1 post-update workaround into a specific asm macro BACKPORT: arm64: Factor out PAN enabling/disabling into separate uaccess_* macros UPSTREAM: arm64: alternative: add auto-nop infrastructure UPSTREAM: arm64: barriers: introduce nops and __nops macros for NOP sequences Revert "FROMLIST: arm64: Factor out PAN enabling/disabling into separate uaccess_* macros" Revert "FROMLIST: arm64: Factor out TTBR0_EL1 post-update workaround into a specific asm macro" Revert "FROMLIST: arm64: Introduce uaccess_{disable,enable} functionality based on TTBR0_EL1" Revert "FROMLIST: arm64: Disable TTBR0_EL1 during normal kernel execution" Revert "FROMLIST: arm64: Handle faults caused by inadvertent user access with PAN enabled" Revert "FROMLIST: arm64: xen: Enable user access before a privcmd hvc call" Revert "FROMLIST: arm64: Enable CONFIG_ARM64_SW_TTBR0_PAN" ANDROID: sched/walt: fix build failure if FAIR_GROUP_SCHED=n ANDROID: trace: net: use %pK for kernel pointers ANDROID: android-base: Enable QUOTA related configs net: ipv4: Don't crash if passing a null sk to ip_rt_update_pmtu. net: inet: Support UID-based routing in IP protocols. net: core: add UID to flows, rules, and routes net: core: Add a UID field to struct sock. Revert "net: core: Support UID-based routing." UPSTREAM: efi/arm64: Don't apply MEMBLOCK_NOMAP to UEFI memory map mapping UPSTREAM: arm64: mm: always take dirty state from new pte in ptep_set_access_flags UPSTREAM: arm64: Implement pmdp_set_access_flags() for hardware AF/DBM UPSTREAM: arm64: Fix typo in the pmdp_huge_get_and_clear() definition UPSTREAM: arm64: enable CONFIG_DEBUG_RODATA by default goldfish: enable CONFIG_INET_DIAG_DESTROY sched/walt: kill {min,max}_capacity sched: fix wrong truncation of walt_avg build: fix build config kernel_dir ANDROID: dm verity: add minimum prefetch size build: add build server configs for goldfish usb: gadget: Fix compilation problem with tx_qlen field Conflicts: android/configs/android-base.cfg arch/arm64/Makefile arch/arm64/include/asm/cpufeature.h arch/arm64/kernel/vdso/gettimeofday.S arch/arm64/mm/cache.S drivers/md/Kconfig drivers/misc/Makefile drivers/mmc/host/sdhci.c drivers/usb/core/hcd.c drivers/usb/gadget/function/u_ether.c fs/sdcardfs/derived_perm.c fs/sdcardfs/file.c fs/sdcardfs/inode.c fs/sdcardfs/lookup.c fs/sdcardfs/main.c fs/sdcardfs/multiuser.h fs/sdcardfs/packagelist.c fs/sdcardfs/sdcardfs.h fs/sdcardfs/super.c include/linux/mmc/card.h include/linux/mmc/mmc.h include/trace/events/android_fs.h include/trace/events/android_fs_template.h drivers/android/binder.c fs/exec.c fs/ext4/crypto_key.c fs/ext4/ext4.h fs/ext4/inline.c fs/ext4/inode.c fs/ext4/readpage.c fs/f2fs/data.c fs/f2fs/inline.c fs/mpage.c include/linux/dcache.h include/trace/events/sched.h include/uapi/linux/ipv6.h net/ipv4/tcp_ipv4.c net/netfilter/xt_IDLETIMER.c Change-Id: Ie345db6a14869fe0aa794aef4b71b5d0d503690b Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
| * tcp: fix various issues for sockets morphing to listen stateEric Dumazet2017-03-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 02b2faaf0af1d85585f6d6980e286d53612acfc2 ] Dmitry Vyukov reported a divide by 0 triggered by syzkaller, exploiting tcp_disconnect() path that was never really considered and/or used before syzkaller ;) I was not able to reproduce the bug, but it seems issues here are the three possible actions that assumed they would never trigger on a listener. 1) tcp_write_timer_handler 2) tcp_delack_timer_handler 3) MTU reduction Only IPv6 MTU reduction was properly testing TCP_CLOSE and TCP_LISTEN states from tcp_v6_mtu_reduced() Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | WLAN subsystem: Sysctl support for key TCP/IP parametersRavi Joshi2016-07-15
|/ | | | | | | | | | | | | It has been observed that default values for some of key tcp/ip parameters are affecting the tput/performance of the system. Hence extending configuration capabilities to TCP/Ip stack through sysctl interface Change-Id: I4287e9103769535f43e0934bac08435a524ee6a4 CRs-Fixed: 507581 Signed-off-by: Ravi Joshi <ravij@codeaurora.org> Signed-off-by: Ganesh Babu Kumaravel <kganesh@codeaurora.org> Signed-off-by: Mohit Khanna <mkhannaqca@codeaurora.org>
* tcp: fix Fast Open snmp over-counting bugYuchung Cheng2015-11-20
| | | | | | | | | Fix incrementing TCPFastOpenActiveFailed snmp stats multiple times when the handshake experiences multiple SYN timeouts. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: disable Fast Open on timeouts after handshakeYuchung Cheng2015-11-20
| | | | | | | | | | | | | Some middle-boxes black-hole the data after the Fast Open handshake (https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf). The exact reason is unknown. The work-around is to disable Fast Open temporarily after multiple recurring timeouts with few or no data delivered in the established state. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: change type of alive from int to boolRichard Sailer2015-10-12
| | | | | | | | | | | | | | | The alive parameter of tcp_orphan_retries, indicates whether the connection is assumed alive or not. In the function and all places calling it is used as a boolean value. Therefore this changes the type of alive to bool in the function definition and all calling locations. Since tcp_orphan_tries is a tcp_timer.c local function no change in any other file or header is necessary. Signed-off-by: Richard Sailer <richard@weltraumpflege.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: do not export tcp_init_xmit_timers()Eric Dumazet2015-07-09
| | | | | | | | | | | After commit 900f65d361d3 ("tcp: move duplicate code from tcp_v4_init_sock()/tcp_v6_init_sock()"), we no longer need to export tcp_init_xmit_timers() Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: introduce tcp_under_memory_pressure()Eric Dumazet2015-05-17
| | | | | | | | Introduce an optimized version of sk_under_memory_pressure() for TCP. Our intent is to use it in fast paths. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: add TCPWinProbe and TCPKeepAlive SNMP countersEric Dumazet2015-05-09
| | | | | | | | | | | | | | | | | | Diagnosing problems related to Window Probes has been hard because we lack a counter. TCPWinProbe counts the number of ACK packets a sender has to send at regular intervals to make sure a reverse ACK packet opening back a window had not been lost. TCPKeepAlive counts the number of ACK packets sent to keep TCP flows alive (SO_KEEPALIVE) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: RFC7413 option support for Fast Open clientDaniel Lee2015-04-07
| | | | | | | | | | | | | | | | | | | | | | Fast Open has been using an experimental option with a magic number (RFC6994). This patch makes the client by default use the RFC7413 option (34) to get and send Fast Open cookies. This patch makes the client solicit cookies from a given server first with the RFC7413 option. If that fails to elicit a cookie, then it tries the RFC6994 experimental option. If that also fails, it uses the RFC7413 option on all subsequent connect attempts. If the server returns a Fast Open cookie then the client caches the form of the option that successfully elicited a cookie, and uses that form on later connects when it presents that cookie. The idea is to gradually obsolete the use of experimental options as the servers and clients upgrade, while keeping the interoperability meanwhile. Signed-off-by: Daniel Lee <Longinus00@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: remove sk_listener parameter from syn_ack_timeout()Eric Dumazet2015-03-23
| | | | | | | | It is not needed, and req->sk_listener points to the listener anyway. request_sock argument can be const. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: get rid of central tcp/dccp listener timerEric Dumazet2015-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of the major issue for TCP is the SYNACK rtx handling, done by inet_csk_reqsk_queue_prune(), fired by the keepalive timer of a TCP_LISTEN socket. This function runs for awful long times, with socket lock held, meaning that other cpus needing this lock have to spin for hundred of ms. SYNACK are sent in huge bursts, likely to cause severe drops anyway. This model was OK 15 years ago when memory was very tight. We now can afford to have a timer per request sock. Timer invocations no longer need to lock the listener, and can be run from all cpus in parallel. With following patch increasing somaxconn width to 32 bits, I tested a listener with more than 4 million active request sockets, and a steady SYNFLOOD of ~200,000 SYN per second. Host was sending ~830,000 SYNACK per second. This is ~100 times more what we could achieve before this patch. Later, we will get rid of the listener hash and use ehash instead. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Create probe timer for tcp PMTU as per RFC4821Fan Du2015-03-06
| | | | | | | | | | | | | | | As per RFC4821 7.3. Selecting Probe Size, a probe timer should be armed once probing has converged. Once this timer expired, probing again to take advantage of any path PMTU change. The recommended probing interval is 10 minutes per RFC1981. Probing interval could be sysctled by sysctl_tcp_probe_interval. Eric Dumazet suggested to implement pseudo timer based on 32bits jiffies tcp_time_stamp instead of using classic timer for such rare event. Signed-off-by: Fan Du <fan.du@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Namespecify TCP PMTU mechanismFan Du2015-02-09
| | | | | | | | | | | | | | | | Packetization Layer Path MTU Discovery works separately beside Path MTU Discovery at IP level, different net namespace has various requirements on which one to chose, e.g., a virutalized container instance would require TCP PMTU to probe an usable effective mtu for underlying tunnel, while the host would employ classical ICMP based PMTU to function. Hence making TCP PMTU mechanism per net namespace to decouple two functionality. Furthermore the probe base MSS should also be configured separately for each namespace. Signed-off-by: Fan Du <fan.du@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Convert LIMIT_NETDEBUG to net_dbg_ratelimitedJoe Perches2014-11-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the more common dynamic_debug capable net_dbg_ratelimited and remove the LIMIT_NETDEBUG macro. All messages are still ratelimited. Some KERN_<LEVEL> uses are changed to KERN_DEBUG. This may have some negative impact on messages that were emitted at KERN_INFO that are not not enabled at all unless DEBUG is defined or dynamic_debug is enabled. Even so, these messages are now _not_ emitted by default. This also eliminates the use of the net_msg_warn sysctl "/proc/sys/net/core/warnings". For backward compatibility, the sysctl is not removed, but it has no function. The extern declaration of net_msg_warn is removed from sock.h and made static in net/core/sysctl_net_core.c Miscellanea: o Update the sysctl documentation o Remove the embedded uses of pr_fmt o Coalesce format fragments o Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: abort orphan sockets stalling on zero window probesYuchung Cheng2014-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we have two different policies for orphan sockets that repeatedly stall on zero window ACKs. If a socket gets a zero window ACK when it is transmitting data, the RTO is used to probe the window. The socket is aborted after roughly tcp_orphan_retries() retries (as in tcp_write_timeout()). But if the socket was idle when it received the zero window ACK, and later wants to send more data, we use the probe timer to probe the window. If the receiver always returns zero window ACKs, icsk_probes keeps getting reset in tcp_ack() and the orphan socket can stall forever until the system reaches the orphan limit (as commented in tcp_probe_timer()). This opens up a simple attack to create lots of hanging orphan sockets to burn the memory and the CPU, as demonstrated in the recent netdev post "TCP connection will hang in FIN_WAIT1 after closing if zero window is advertised." http://www.spinics.net/lists/netdev/msg296539.html This patch follows the design in RTO-based probe: we abort an orphan socket stalling on zero window when the probe timer reaches both the maximum backoff and the maximum RTO. For example, an 100ms RTT connection will timeout after roughly 153 seconds (0.3 + 0.6 + .... + 76.8) if the receiver keeps the window shut. If the orphan socket passes this check, but the system already has too many orphans (as in tcp_out_of_resources()), we still abort it but we'll also send an RST packet as the connection may still be active. In addition, we change TCP_USER_TIMEOUT to cover (life or dead) sockets stalled on zero-window probes. This changes the semantics of TCP_USER_TIMEOUT slightly because it previously only applies when the socket has pending transmission. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reported-by: Andrey Dmitrov <andrey.dmitrov@oktetlabs.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: avoid possible arithmetic overflowsEric Dumazet2014-09-22
| | | | | | | | | | | | | | icsk_rto is a 32bit field, and icsk_backoff can reach 15 by default, or more if some sysctl (eg tcp_retries2) are changed. Better use 64bit to perform icsk_rto << icsk_backoff operations As Joe Perches suggested, add a helper for this. Yuchung spotted the tcp_v4_err() case. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: remove TCP_SKB_CB(skb)->whenEric Dumazet2014-09-05
| | | | | | | | | | | After commit 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution"), we no longer need to maintain timestamps in two different fields. TCP_SKB_CB(skb)->when can be removed, as same information sits in skb_mstamp.stamp_jiffies Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: reduce spurious retransmits due to transient SACK renegingNeal Cardwell2014-08-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit reduces spurious retransmits due to apparent SACK reneging by only reacting to SACK reneging that persists for a short delay. When a sequence space hole at snd_una is filled, some TCP receivers send a series of ACKs as they apparently scan their out-of-order queue and cumulatively ACK all the packets that have now been consecutiveyly received. This is essentially misbehavior B in "Misbehaviors in TCP SACK generation" ACM SIGCOMM Computer Communication Review, April 2011, so we suspect that this is from several common OSes (Windows 2000, Windows Server 2003, Windows XP). However, this issue has also been seen in other cases, e.g. the netdev thread "TCP being hoodwinked into spurious retransmissions by lack of timestamps?" from March 2014, where the receiver was thought to be a BSD box. Since snd_una would temporarily be adjacent to a previously SACKed range in these scenarios, this receiver behavior triggered the Linux SACK reneging code path in the sender. This led the sender to clear the SACK scoreboard, enter CA_Loss, and spuriously retransmit (potentially) every packet from the entire write queue at line rate just a few milliseconds before the ACK for each packet arrives at the sender. To avoid such situations, now when a sender sees apparent reneging it does not yet retransmit, but rather adjusts the RTO timer to give the receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs that will restore sanity to the SACK scoreboard. If the reneging persists until this RTO then, as before, we clear the SACK scoreboard and enter CA_Loss. A 10ms delay tolerates a receiver sending such a stream of ACKs at 56Kbit/sec. And to allow for receivers with slower or more congested paths, we wait for at least RTT/2. We validated the resulting max(RTT/2, 10ms) delay formula with a mix of North American and South American Google web server traffic, and found that for ACKs displaying transient reneging: (1) 90% of inter-ACK delays were less than 10ms (2) 99% of inter-ACK delays were less than RTT/2 In tests on Google web servers this commit reduced reneging events by 75%-90% (as measured by the TcpExtTCPSACKReneging counter), without any measurable impact on latency for user HTTP and SPDY requests. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: snmp stats for Fast Open, SYN rtx, and data pktsYuchung Cheng2014-03-03
| | | | | | | | | | | | | | | | | | | | | | | | Add the following snmp stats: TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed beacuse the remote does not accept it or the attempts timed out. TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down retransmissions into SYN, fast-retransmits, timeout retransmits, etc. TCPOrigDataSent: number of outgoing packets with original data (excluding retransmission but including data-in-SYN). This counter is different from TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is more useful to track the TCP retransmission rate. Change TCPFastOpenActive to track only successful Fast Opens to be symmetric to TCPFastOpenPassive. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Lawrence Brakmo <brakmo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: temporarily disable Fast Open on SYN timeoutYuchung Cheng2013-10-29
| | | | | | | | | | | | | | | | | | | | | | | Fast Open currently has a fall back feature to address SYN-data being dropped but it requires the middle-box to pass on regular SYN retry after SYN-data. This is implemented in commit aab487435 ("net-tcp: Fast Open client - detecting SYN-data drops") However some NAT boxes will drop all subsequent packets after first SYN-data and blackholes the entire connections. An example is in commit 356d7d8 "netfilter: nf_conntrack: fix tcp_in_window for Fast Open". The sender should note such incidents and fall back to use the regular TCP handshake on subsequent attempts temporarily as well: after the second SYN timeouts the original Fast Open SYN is most likely lost. When such an event recurs Fast Open is disabled based on the number of recurrences exponentially. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: make lookups simpler and fasterEric Dumazet2013-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TCP listener refactoring, part 4 : To speed up inet lookups, we moved IPv4 addresses from inet to struct sock_common Now is time to do the same for IPv6, because it permits us to have fast lookups for all kind of sockets, including upcoming SYN_RECV. Getting IPv6 addresses in TCP lookups currently requires two extra cache lines, plus a dereference (and memory stall). inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6 This patch is way bigger than its IPv4 counter part, because for IPv4, we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6, it's not doable easily. inet6_sk(sk)->daddr becomes sk->sk_v6_daddr inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr at the same offset. We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic macro. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: refactor F-RTOYuchung Cheng2013-03-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch series refactor the F-RTO feature (RFC4138/5682). This is to simplify the loss recovery processing. Existing F-RTO was developed during the experimental stage (RFC4138) and has many experimental features. It takes a separate code path from the traditional timeout processing by overloading CA_Disorder instead of using CA_Loss state. This complicates CA_Disorder state handling because it's also used for handling dubious ACKs and undos. While the algorithm in the RFC does not change the congestion control, the implementation intercepts congestion control in various places (e.g., frto_cwnd in tcp_ack()). The new code implements newer F-RTO RFC5682 using CA_Loss processing path. F-RTO becomes a small extension in the timeout processing and interfaces with congestion control and Eifel undo modules. It lets congestion control (module) determines how many to send independently. F-RTO only chooses what to send in order to detect spurious retranmission. If timeout is found spurious it invokes existing Eifel undo algorithms like DSACK or TCP timestamp based detection. The first patch removes all F-RTO code except the sysctl_tcp_frto is left for the new implementation. Since CA_EVENT_FRTO is removed, TCP westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: TLP loss detection.Nandita Dukkipati2013-03-12
| | | | | | | | | | | | | | | | | | | | | | | | | This is the second of the TLP patch series; it augments the basic TLP algorithm with a loss detection scheme. This patch implements a mechanism for loss detection when a Tail loss probe retransmission plugs a hole thereby masking packet loss from the sender. The loss detection algorithm relies on counting TLP dupacks as outlined in Sec. 3 of: http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01 The basic idea is: Sender keeps track of TLP "episode" upon retransmission of a TLP packet. An episode ends when the sender receives an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the episode. We want to make sure that before the episode ends the sender receives a "TLP dupack", indicating that the TLP retransmission was unnecessary, so there was no loss/hole that needed plugging. If the sender gets no TLP dupack before the end of the episode, then it reduces ssthresh and the congestion window, because the TLP packet arriving at the receiver probably plugged a hole. Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: Tail loss probe (TLP)Nandita Dukkipati2013-03-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch series implement the Tail loss probe (TLP) algorithm described in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The first patch implements the basic algorithm. TLP's goal is to reduce tail latency of short transactions. It achieves this by converting retransmission timeouts (RTOs) occuring due to tail losses (losses at end of transactions) into fast recovery. TLP transmits one packet in two round-trips when a connection is in Open state and isn't receiving any ACKs. The transmitted packet, aka loss probe, can be either new or a retransmission. When there is tail loss, the ACK from a loss probe triggers FACK/early-retransmit based fast recovery, thus avoiding a costly RTO. In the absence of loss, there is no change in the connection state. PTO stands for probe timeout. It is a timer event indicating that an ACK is overdue and triggers a loss probe packet. The PTO value is set to max(2*SRTT, 10ms) and is adjusted to account for delayed ACK timer when there is only one oustanding packet. TLP Algorithm On transmission of new data in Open state: -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms). -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms) -> PTO = min(PTO, RTO) Conditions for scheduling PTO: -> Connection is in Open state. -> Connection is either cwnd limited or no new data to send. -> Number of probes per tail loss episode is limited to one. -> Connection is SACK enabled. When PTO fires: new_segment_exists: -> transmit new segment. -> packets_out++. cwnd remains same. no_new_packet: -> retransmit the last segment. Its ACK triggers FACK or early retransmit based recovery. ACK path: -> rearm RTO at start of ACK processing. -> reschedule PTO if need be. In addition, the patch includes a small variation to the Early Retransmit (ER) algorithm, such that ER and TLP together can in principle recover any N-degree of tail loss through fast recovery. TLP is controlled by the same sysctl as ER, tcp_early_retrans sysctl. tcp_early_retrans==0; disables TLP and ER. ==1; enables RFC5827 ER. ==2; delayed ER. ==3; TLP and delayed ER. [DEFAULT] ==4; TLP only. The TLP patch series have been extensively tested on Google Web servers. It is most effective for short Web trasactions, where it reduced RTOs by 15% and improved HTTP response time (average by 6%, 99th percentile by 10%). The transmitted probes account for <0.5% of the overall transmissions. Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2012-11-10
|\ | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c Minor conflict between the BCM_CNIC define removal in net-next and a bug fix added to net. Based upon a conflict resolution patch posted by Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: Reject invalid ack_seq to Fast Open socketsJerry Chu2012-10-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A packet with an invalid ack_seq may cause a TCP Fast Open socket to switch to the unexpected TCP_CLOSING state, triggering a BUG_ON kernel panic. When a FIN packet with an invalid ack_seq# arrives at a socket in the TCP_FIN_WAIT1 state, rather than discarding the packet, the current code will accept the FIN, causing state transition to TCP_CLOSING. This may be a small deviation from RFC793, which seems to say that the packet should be dropped. Unfortunately I did not expect this case for Fast Open hence it will trigger a BUG_ON panic. It turns out there is really nothing bad about a TFO socket going into TCP_CLOSING state so I could just remove the BUG_ON statements. But after some thought I think it's better to treat this case like TCP_SYN_RECV and return a RST to the confused peer who caused the unacceptable ack_seq to be generated in the first place. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: better retrans tracking for defer-acceptEric Dumazet2012-11-03
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For passive TCP connections using TCP_DEFER_ACCEPT facility, we incorrectly increment req->retrans each time timeout triggers while no SYNACK is sent. SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for which we received the ACK from client). Only the last SYNACK is sent so that we can receive again an ACK from client, to move the req into accept queue. We plan to change this later to avoid the useless retransmit (and potential problem as this SYNACK could be lost) TCP_INFO later gives wrong information to user, claiming imaginary retransmits. Decouple req->retrans field into two independent fields : num_retrans : number of retransmit num_timeout : number of timeouts num_timeout is the counter that is incremented at each timeout, regardless of actual SYNACK being sent or not, and used to compute the exponential timeout. Introduce inet_rtx_syn_ack() helper to increment num_retrans only if ->rtx_syn_ack() succeeded. Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans when we re-send a SYNACK in answer to a (retransmitted) SYN. Prior to this patch, we were not counting these retransmits. Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS only if a synack packet was successfully queued. Reported-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Julian Anastasov <ja@ssi.bg> Cc: Vijay Subramanian <subramanian.vijay@gmail.com> Cc: Elliott Hughes <enh@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: TCP Fast Open Server - support TFO listenersJerry Chu2012-08-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch builds on top of the previous patch to add the support for TFO listeners. This includes - 1. allocating, properly initializing, and managing the per listener fastopen_queue structure when TFO is enabled 2. changes to the inet_csk_accept code to support TFO. E.g., the request_sock can no longer be freed upon accept(), not until 3WHS finishes 3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg() if it's a TFO socket 4. properly closing a TFO listener, and a TFO socket before 3WHS finishes 5. supporting TCP_FASTOPEN socket option 6. modifying tcp_check_req() to use to check a TFO socket as well as request_sock 7. supporting TCP's TFO cookie option 8. adding a new SYN-ACK retransmit handler to use the timer directly off the TFO socket rather than the listener socket. Note that TFO server side will not retransmit anything other than SYN-ACK until the 3WHS is completed. The patch also contains an important function "reqsk_fastopen_remove()" to manage the somewhat complex relation between a listener, its request_sock, and the corresponding child socket. See the comment above the function for the detail. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: fix possible socket refcount problemEric Dumazet2012-08-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 6f458dfb40 (tcp: improve latencies of timer triggered events) added bug leading to following trace : [ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000 [ 2866.131726] [ 2866.132188] ========================= [ 2866.132281] [ BUG: held lock freed! ] [ 2866.132281] 3.6.0-rc1+ #622 Not tainted [ 2866.132281] ------------------------- [ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there! [ 2866.132281] (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6 [ 2866.132281] 4 locks held by kworker/0:1/652: [ 2866.132281] #0: (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f [ 2866.132281] #1: ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f [ 2866.132281] #2: (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6 [ 2866.132281] #3: (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f [ 2866.132281] [ 2866.132281] stack backtrace: [ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622 [ 2866.132281] Call Trace: [ 2866.132281] <IRQ> [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159 [ 2866.132281] [<ffffffff818a0839>] ? __sk_free+0xfd/0x114 [ 2866.132281] [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a [ 2866.132281] [<ffffffff818a0839>] __sk_free+0xfd/0x114 [ 2866.132281] [<ffffffff818a08c0>] sk_free+0x1c/0x1e [ 2866.132281] [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56 [ 2866.132281] [<ffffffff81078082>] run_timer_softirq+0x218/0x35f [ 2866.132281] [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f [ 2866.132281] [<ffffffff810f5831>] ? rb_commit+0x58/0x85 [ 2866.132281] [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148 [ 2866.132281] [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9 [ 2866.132281] [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e [ 2866.132281] [<ffffffff81a1227c>] call_softirq+0x1c/0x30 [ 2866.132281] [<ffffffff81039f38>] do_softirq+0x4a/0xa6 [ 2866.132281] [<ffffffff81070f2b>] irq_exit+0x51/0xad [ 2866.132281] [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4 [ 2866.132281] [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f [ 2866.132281] <EOI> [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1 [ 2866.132281] [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56 [ 2866.132281] [<ffffffff81078692>] mod_timer+0x178/0x1a9 [ 2866.132281] [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26 [ 2866.132281] [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4 [ 2866.132281] [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70 [ 2866.132281] [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4 [ 2866.132281] [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1 [ 2866.132281] [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a [ 2866.132281] [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6 [ 2866.132281] [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5 [ 2866.132281] [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f [ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb [ 2866.132281] [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4 [ 2866.132281] [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5 [ 2866.132281] [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f [ 2866.132281] [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd [ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb [ 2866.132281] [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43 [ 2866.132281] [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80 [ 2866.132281] [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0 [ 2866.132281] [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61 [ 2866.132281] [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1 [ 2866.132281] [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db [ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c [ 2866.132281] [<ffffffff81999d92>] call_transmit+0x1c5/0x20e [ 2866.132281] [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225 [ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c [ 2866.132281] [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34 [ 2866.132281] [<ffffffff810835d6>] process_one_work+0x24d/0x47f [ 2866.132281] [<ffffffff81083567>] ? process_one_work+0x1de/0x47f [ 2866.132281] [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225 [ 2866.132281] [<ffffffff81083a6d>] worker_thread+0x236/0x317 [ 2866.132281] [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f [ 2866.132281] [<ffffffff8108b7b8>] kthread+0x9a/0xa2 [ 2866.132281] [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10 [ 2866.132281] [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13 [ 2866.132281] [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a [ 2866.132281] [<ffffffff81a12180>] ? gs_change+0x13/0x13 [ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000 [ 2866.309689] ============================================================================= [ 2866.310254] BUG TCP (Not tainted): Object already free [ 2866.310254] ----------------------------------------------------------------------------- [ 2866.310254] The bug comes from the fact that timer set in sk_reset_timer() can run before we actually do the sock_hold(). socket refcount reaches zero and we free the socket too soon. timer handler is not allowed to reduce socket refcnt if socket is owned by the user, or we need to change sk_reset_timer() implementation. We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED was used instead of TCP_DELACK_TIMER_DEFERRED. For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED, even if not fired from a timer. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: improve latencies of timer triggered eventsEric Dumazet2012-07-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Modern TCP stack highly depends on tcp_write_timer() having a small latency, but current implementation doesn't exactly meet the expectations. When a timer fires but finds the socket is owned by the user, it rearms itself for an additional delay hoping next run will be more successful. tcp_write_timer() for example uses a 50ms delay for next try, and it defeats many attempts to get predictable TCP behavior in term of latencies. Use the recently introduced tcp_release_cb(), so that the user owning the socket will call various handlers right before socket release. This will permit us to post a followup patch to address the tcp_tso_should_defer() syndrome (some deferred packets have to wait RTO timer to be transmitted, while cwnd should allow us to send them sooner) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: John Heffner <johnwheffner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: early retransmit: delayed fast retransmitYuchung Cheng2012-05-02
| | | | | | | | | | | | | Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2). Delays the fast retransmit by an interval of RTT/4. We borrow the RTO timer to implement the delay. If we receive another ACK or send a new packet, the timer is cancelled and restored to original RTO value offset by time elapsed. When the delayed-ER timer fires, we enter fast recovery and perform fast retransmit. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv4: Standardize prefixes for message loggingJoe Perches2012-03-12
| | | | | | | | | | | | | | | | Add #define pr_fmt(fmt) as appropriate. Add "IPv4: ", "TCP: ", and "IPsec: " to appropriate files. Standardize on "UDPLite: " for appropriate uses. Some prefixes were previously "UDPLITE: " and "UDP-Lite: ". Add KBUILD_MODNAME ": " to icmp and gre. Remove embedded prefixes as appropriate. Add missing "\n" to pr_info in gre.c. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Disambiguate kernel messageArun Sharma2012-02-01
| | | | | | | | | | | | | | | | | | | | | | | | | Some of our machines were reporting: TCP: too many of orphaned sockets even when the number of orphaned sockets was well below the limit. We print a different message depending on whether we're out of TCP memory or there are too many orphaned sockets. Also move the check out of line and cleanup the messages that were printed. Signed-off-by: Arun Sharma <asharma@fb.com> Suggested-by: Mohan Srinivasan <mohan@fb.com> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: David Miller <davem@davemloft.net> Cc: Glauber Costa <glommer@parallels.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: fix assignment of 0/1 to bool variables.Rusty Russell2011-12-19
| | | | | | | | | | | | | | | | | | | | | | | | DaveM said: Please, this kind of stuff rots forever and not using bool properly drives me crazy. Joe Perches <joe@perches.com> gave me the spatch script: @@ bool b; @@ -b = 0 +b = false @@ bool b; @@ -b = 1 +b = true I merely installed coccinelle, read the documentation and took credit. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* foundations of per-cgroup memory pressure controlling.Glauber Costa2011-12-12
| | | | | | | | | | | | | | | | | This patch replaces all uses of struct sock fields' memory_pressure, memory_allocated, sockets_allocated, and sysctl_mem to acessor macros. Those macros can either receive a socket argument, or a mem_cgroup argument, depending on the context they live in. Since we're only doing a macro wrapping here, no performance impact at all is expected in the case where we don't have cgroups disabled. Signed-off-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com> CC: David S. Miller <davem@davemloft.net> CC: Eric W. Biederman <ebiederm@xmission.com> CC: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: use IS_ENABLED(CONFIG_IPV6)Eric Dumazet2011-12-11
| | | | | | | Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* TCP: remove TCP_DEBUGFlavio Leitner2011-10-24
| | | | | | | | It was enabled by default and the messages guarded by the define are useful. Signed-off-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: Remove debug macro of TCP_CHECK_TIMERShan Wei2011-02-20
| | | | | | | | | | Now, TCP_CHECK_TIMER is not used for debuging, it does nothing. And, it has been there for several years, maybe 6 years. Remove it to keep code clearer. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: use correct counters in CA_CWR state tooIlpo Järvinen2010-10-17
| | | | | | | | | | | | | | As CWR is stronger than CA_Disorder state, we can miscount SACK/Reno failure into other timeouts. Not a bad problem as it can happen only due to ECN, FRTO detecting spurious RTO or xmit error which are the only callers of tcp_enter_cwr. And even then losses and RTO must still follow thereafter to actually end up into the relevant code paths. Compile tested. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2010-10-04
|\ | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/ipv4/Kconfig net/ipv4/tcp_timer.c
| * net-2.6: SYN retransmits: Add new parameter to retransmits_timed_out()Damian Lukowski2010-09-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes kernel Bugzilla Bug 18952 This patch adds a syn_set parameter to the retransmits_timed_out() routine and updates its callers. If not set, TCP_RTO_MIN is taken as the calculation basis as before. If set, TCP_TIMEOUT_INIT is used instead, so that sysctl_syn_retries represents the actual amount of SYN retransmissions in case no SYNACKs are received when establishing a new connection. Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de> Signed-off-by: David S. Miller <davem@davemloft.net>