summaryrefslogtreecommitdiff
path: root/fs/btrfs/inode.c (follow)
Commit message (Collapse)AuthorAge
* Merge remote-tracking branch 'common/android-4.4-p' into ↵Michael Bestas2021-10-12
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | lineage-18.1-caf-msm8998 # By Sergey Shtylyov (9) and others # Via Greg Kroah-Hartman * common/android-4.4-p: Linux 4.4.288 libata: Add ATA_HORKAGE_NO_NCQ_ON_ATI for Samsung 860 and 870 SSD. usb: testusb: Fix for showing the connection speed scsi: sd: Free scsi_disk device via put_device() ext2: fix sleeping in atomic bugs on error sparc64: fix pci_iounmap() when CONFIG_PCI is not set xen-netback: correct success/error reporting for the SKB-with-fraglist case af_unix: fix races in sk_peer_pid and sk_peer_cred accesses Linux 4.4.287 Revert "arm64: Mark __stack_chk_guard as __ro_after_init" Linux 4.4.286 cred: allow get_cred() and put_cred() to be given NULL. HID: usbhid: free raw_report buffers in usbhid_stop netfilter: ipset: Fix oversized kvmalloc() calls HID: betop: fix slab-out-of-bounds Write in betop_probe arm64: Extend workaround for erratum 1024718 to all versions of Cortex-A55 EDAC/synopsys: Fix wrong value type assignment for edac_mode ext4: fix potential infinite loop in ext4_dx_readdir() ipack: ipoctal: fix module reference leak ipack: ipoctal: fix missing allocation-failure check ipack: ipoctal: fix tty-registration error handling ipack: ipoctal: fix tty registration race ipack: ipoctal: fix stack information leak e100: fix buffer overrun in e100_get_regs e100: fix length calculation in e100_get_regs_len ipvs: check that ip_vs_conn_tab_bits is between 8 and 20 mac80211: fix use-after-free in CCMP/GCMP RX tty: Fix out-of-bound vmalloc access in imageblit qnx4: work around gcc false positive warning bug spi: Fix tegra20 build with CONFIG_PM=n net: 6pack: Fix tx timeout and slot time alpha: Declare virt_to_phys and virt_to_bus parameter as pointer to volatile arm64: Mark __stack_chk_guard as __ro_after_init parisc: Use absolute_pointer() to define PAGE0 qnx4: avoid stringop-overread errors sparc: avoid stringop-overread errors net: i825xx: Use absolute_pointer for memcpy from fixed memory location compiler.h: Introduce absolute_pointer macro m68k: Double cast io functions to unsigned long blktrace: Fix uaf in blk_trace access after removing by sysfs scsi: iscsi: Adjust iface sysfs attr detection net/mlx4_en: Don't allow aRFS for encapsulated packets net: hso: fix muxed tty registration USB: serial: option: add device id for Foxconn T99W265 USB: serial: option: remove duplicate USB device ID USB: serial: option: add Telit LN920 compositions USB: serial: mos7840: remove duplicated 0xac24 device ID USB: serial: cp210x: add ID for GW Instek GDM-834x Digital Multimeter xen/x86: fix PV trap handling on secondary processors cifs: fix incorrect check for null pointer in header_assemble usb: musb: tusb6010: uninitialized data in tusb_fifo_write_unaligned() usb: gadget: r8a66597: fix a loop in set_feature() Linux 4.4.285 sctp: validate from_addr_param return drm/nouveau/nvkm: Replace -ENOSYS with -ENODEV blk-throttle: fix UAF by deleteing timer in blk_throtl_exit() nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group nilfs2: fix NULL pointer in nilfs_##name##_attr_release nilfs2: fix memory leak in nilfs_sysfs_create_device_group ceph: lockdep annotations for try_nonblocking_invalidate dmaengine: ioat: depends on !UML parisc: Move pci_dev_is_behind_card_dino to where it is used dmaengine: acpi: Avoid comparison GSI with Linux vIRQ dmaengine: acpi-dma: check for 64-bit MMIO address profiling: fix shift-out-of-bounds bugs prctl: allow to setup brk for et_dyn executables 9p/trans_virtio: Remove sysfs file on probe failure thermal/drivers/exynos: Fix an error code in exynos_tmu_probe() sctp: add param size validation for SCTP_PARAM_SET_PRIMARY sctp: validate chunk size in __rcv_asconf_lookup PM / wakeirq: Fix unbalanced IRQ enable for wakeirq s390/bpf: Fix optimizing out zero-extensions Linux 4.4.284 s390/bpf: Fix 64-bit subtraction of the -0x80000000 constant net: renesas: sh_eth: Fix freeing wrong tx descriptor qlcnic: Remove redundant unlock in qlcnic_pinit_from_rom ARC: export clear_user_page() for modules mtd: rawnand: cafe: Fix a resource leak in the error handling path of 'cafe_nand_probe()' PCI: Sync __pci_register_driver() stub for CONFIG_PCI=n ethtool: Fix an error code in cxgb2.c dt-bindings: mtd: gpmc: Fix the ECC bytes vs. OOB bytes equation x86/mm: Fix kern_addr_valid() to cope with existing but not present entries net/af_unix: fix a data-race in unix_dgram_poll tipc: increase timeout in tipc_sk_enqueue() r6040: Restore MDIO clock frequency after MAC reset net/l2tp: Fix reference count leak in l2tp_udp_recv_core dccp: don't duplicate ccid when cloning dccp sock ptp: dp83640: don't define PAGE0 net-caif: avoid user-triggerable WARN_ON(1) bnx2x: Fix enabling network interfaces without VFs platform/chrome: cros_ec_proto: Send command again when timeout occurs parisc: fix crash with signals and alloca net: fix NULL pointer reference in cipso_v4_doi_free ath9k: fix OOB read ar9300_eeprom_restore_internal parport: remove non-zero check on count Revert "USB: xhci: fix U1/U2 handling for hardware with XHCI_INTEL_HOST quirk set" cifs: fix wrong release in sess_alloc_buffer() failed path mmc: rtsx_pci: Fix long reads when clock is prescaled gfs2: Don't call dlm after protocol is unmounted rpc: fix gss_svc_init cleanup on failure ARM: tegra: tamonten: Fix UART pad setting gpu: drm: amd: amdgpu: amdgpu_i2c: fix possible uninitialized-variable access in amdgpu_i2c_router_select_ddc_port() Bluetooth: skip invalid hci_sync_conn_complete_evt serial: 8250_pci: make setup_port() parameters explicitly unsigned hvsi: don't panic on tty_register_driver failure xtensa: ISS: don't panic in rs_init serial: 8250: Define RX trigger levels for OxSemi 950 devices s390/jump_label: print real address in a case of a jump label bug ipv4: ip_output.c: Fix out-of-bounds warning in ip_copy_addrs() video: fbdev: riva: Error out if 'pixclock' equals zero video: fbdev: kyro: Error out if 'pixclock' equals zero video: fbdev: asiliantfb: Error out if 'pixclock' equals zero bpf/tests: Do not PASS tests without actually testing the result bpf/tests: Fix copy-and-paste error in double word test tty: serial: jsm: hold port lock when reporting modem line changes usb: gadget: u_ether: fix a potential null pointer dereference usb: host: fotg210: fix the actual_length of an iso packet Smack: Fix wrong semantics in smk_access_entry() netlink: Deal with ESRCH error in nlmsg_notify() video: fbdev: kyro: fix a DoS bug by restricting user input iio: dac: ad5624r: Fix incorrect handling of an optional regulator. PCI: Use pci_update_current_state() in pci_enable_device_flags() crypto: mxs-dcp - Use sg_mapping_iter to copy data pinctrl: single: Fix error return code in pcs_parse_bits_in_pinctrl_entry() openrisc: don't printk() unconditionally PCI: Return ~0 data on pciconfig_read() CAP_SYS_ADMIN failure PCI: Restrict ASMedia ASM1062 SATA Max Payload Size Supported ARM: 9105/1: atags_to_fdt: don't warn about stack size libata: add ATA_HORKAGE_NO_NCQ_TRIM for Samsung 860 and 870 SSDs media: rc-loopback: return number of emitters rather than error media: uvc: don't do DMA on stack VMCI: fix NULL pointer dereference when unmapping queue pair power: supply: max17042: handle fails of reading status register xen: fix setting of max_pfn in shared_info PCI/MSI: Skip masking MSI-X on Xen PV rtc: tps65910: Correct driver module alias fbmem: don't allow too huge resolutions clk: kirkwood: Fix a clocking boot regression KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted tty: Fix data race between tiocsti() and flush_to_ldisc() ipv4: make exception cache less predictible bcma: Fix memory leak for internally-handled cores ath6kl: wmi: fix an error code in ath6kl_wmi_sync_point() usb: ehci-orion: Handle errors of clk_prepare_enable() in probe i2c: mt65xx: fix IRQ check CIFS: Fix a potencially linear read overflow mmc: moxart: Fix issue with uninitialized dma_slave_config mmc: dw_mmc: Fix issue with uninitialized dma_slave_config i2c: s3c2410: fix IRQ check i2c: iop3xx: fix deferred probing Bluetooth: add timeout sanity check to hci_inquiry usb: gadget: mv_u3d: request_irq() after initializing UDC usb: phy: tahvo: add IRQ check usb: host: ohci-tmio: add IRQ check Bluetooth: Move shutdown callback before flushing tx and rx queue usb: phy: twl6030: add IRQ checks usb: phy: fsl-usb: add IRQ check usb: gadget: udc: at91: add IRQ check drm/msm/dsi: Fix some reference counted resource leaks Bluetooth: fix repeated calls to sco_sock_kill arm64: dts: exynos: correct GIC CPU interfaces address range on Exynos7 Bluetooth: increase BTNAMSIZ to 21 chars to fix potential buffer overflow PCI: PM: Enable PME if it can be signaled from D3cold i2c: highlander: add IRQ check net: cipso: fix warnings in netlbl_cipsov4_add_std tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos Bluetooth: sco: prevent information leak in sco_conn_defer_accept() media: go7007: remove redundant initialization media: dvb-usb: fix uninit-value in vp702x_read_mac_addr media: dvb-usb: fix uninit-value in dvb_usb_adapter_dvb_init certs: Trigger creation of RSA module signing key if it's not an RSA key m68k: emu: Fix invalid free in nfeth_cleanup() udf_get_extendedattr() had no boundary checks. crypto: qat - fix reuse of completion variable crypto: qat - do not ignore errors from enable_vf2pf_comms() libata: fix ata_host_start() power: supply: max17042_battery: fix typo in MAx17042_TOFF crypto: omap-sham - clear dma flags only after omap_sham_update_dma_stop() crypto: mxs-dcp - Check for DMA mapping errors PCI: Call Max Payload Size-related fixup quirks early x86/reboot: Limit Dell Optiplex 990 quirk to early BIOS versions Revert "btrfs: compression: don't try to compress if we don't have enough pages" mm/page_alloc: speed up the iteration of max_order net: ll_temac: Remove left-over debug message powerpc/boot: Delete unneeded .globl _zimage_start powerpc/module64: Fix comment in R_PPC64_ENTRY handling mm/kmemleak.c: make cond_resched() rate-limiting more efficient s390/disassembler: correct disassembly lines alignment ipv4/icmp: l3mdev: Perform icmp error route lookup on source device routing table (v2) tc358743: fix register i2c_rd/wr function fix PM / wakeirq: Enable dedicated wakeirq for suspend USB: serial: mos7720: improve OOM-handling in read_mos_reg() usb: phy: isp1301: Fix build warning when CONFIG_OF is disabled igmp: Add ip_mc_list lock in ip_check_mc_rcu media: stkwebcam: fix memory leak in stk_camera_probe ath9k: Postpone key cache entry deletion for TXQ frames reference it ath: Modify ath_key_delete() to not need full key entry ath: Export ath_hw_keysetmac() ath9k: Clear key cache explicitly on disabling hardware ath: Use safer key clearing with key cache entries ALSA: pcm: fix divide error in snd_pcm_lib_ioctl ARM: 8918/2: only build return_address() if needed cryptoloop: add a deprecation warning qede: Fix memset corruption ARC: fix allnoconfig build warning xtensa: fix kconfig unmet dependency warning for HAVE_FUTEX_CMPXCHG ext4: fix race writing to an inline_data file while its xattrs are changing Change-Id: I0d3200388e095f977c784cba314b9cc061848c7a
| * Revert "btrfs: compression: don't try to compress if we don't have enough pages"Qu Wenruo2021-09-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4e9655763b82a91e4c341835bb504a2b1590f984 upstream. This reverts commit f2165627319ffd33a6217275e5690b1ab5c45763. [BUG] It's no longer possible to create compressed inline extent after commit f2165627319f ("btrfs: compression: don't try to compress if we don't have enough pages"). [CAUSE] For compression code, there are several possible reasons we have a range that needs to be compressed while it's no more than one page. - Compressed inline write The data is always smaller than one sector and the test lacks the condition to properly recognize a non-inline extent. - Compressed subpage write For the incoming subpage compressed write support, we require page alignment of the delalloc range. And for 64K page size, we can compress just one page into smaller sectors. For those reasons, the requirement for the data to be more than one page is not correct, and is already causing regression for compressed inline data writeback. The idea of skipping one page to avoid wasting CPU time could be revisited in the future. [FIX] Fix it by reverting the offending commit. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Link: https://lore.kernel.org/linux-btrfs/afa2742.c084f5d6.17b6b08dffc@tnonline.net Fixes: f2165627319f ("btrfs: compression: don't try to compress if we don't have enough pages") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge branch 'android-4.4-p' of ↵Michael Bestas2021-08-12
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://android.googlesource.com/kernel/common into lineage-18.1-caf-msm8998 # By Pavel Skripkin (6) and others # Via Greg Kroah-Hartman * android-4.4-p: Linux 4.4.278 sis900: Fix missing pci_disable_device() in probe and remove tulip: windbond-840: Fix missing pci_disable_device() in probe and remove net: llc: fix skb_over_panic mlx4: Fix missing error code in mlx4_load_one() tipc: fix sleeping in tipc accept routine netfilter: nft_nat: allow to specify layer 4 protocol NAT only cfg80211: Fix possible memory leak in function cfg80211_bss_update x86/asm: Ensure asm/proto.h can be included stand-alone NIU: fix incorrect error return, missed in previous revert can: esd_usb2: fix memory leak can: ems_usb: fix memory leak can: usb_8dev: fix memory leak ocfs2: issue zeroout to EOF blocks ocfs2: fix zero out valid data ARM: ensure the signal page contains defined contents lib/string.c: add multibyte memset functions ARM: dts: versatile: Fix up interrupt controller node names hfs: add lock nesting notation to hfs_find_init hfs: fix high memory mapping in hfs_bnode_read hfs: add missing clean-up in hfs_fill_super sctp: move 198 addresses from unusable to private scope net/802/garp: fix memleak in garp_request_join() net/802/mrp: fix memleak in mrp_request_join() workqueue: fix UAF in pwq_unbound_release_workfn() af_unix: fix garbage collect vs MSG_PEEK net: split out functions related to registering inflight socket files Linux 4.4.277 btrfs: compression: don't try to compress if we don't have enough pages iio: accel: bma180: Fix BMA25x bandwidth register values iio: accel: bma180: Use explicit member assignment net: bcmgenet: ensure EXT_ENERGY_DET_MASK is clear media: ngene: Fix out-of-bounds bug in ngene_command_config_free_buf() tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop. USB: serial: cp210x: add ID for CEL EM3588 USB ZigBee stick USB: serial: cp210x: fix comments for GE CS1000 USB: serial: option: add support for u-blox LARA-R6 family usb: renesas_usbhs: Fix superfluous irqs happen after usb_pkt_pop() usb: max-3421: Prevent corruption of freed memory USB: usb-storage: Add LaCie Rugged USB3-FW to IGNORE_UAS usb: hub: Disable USB 3 device initiated lpm if exit latency is too high KVM: PPC: Book3S: Fix H_RTAS rets buffer overflow xhci: Fix lost USB 2 remote wake ALSA: sb: Fix potential ABBA deadlock in CSP driver s390/ftrace: fix ftrace_update_ftrace_func implementation proc: Avoid mixing integer types in mem_rw() Revert "USB: quirks: ignore remote wake-up on Fibocom L850-GL LTE modem" scsi: target: Fix protect handling in WRITE SAME(32) scsi: iscsi: Fix iface sysfs attr detection netrom: Decrease sock refcount when sock timers expire net: decnet: Fix sleeping inside in af_decnet net: fix uninit-value in caif_seqpkt_sendmsg s390/bpf: Perform r1 range checking before accessing jit->seen_reg[r1] perf probe-file: Delete namelist in del_events() on the error path perf test bpf: Free obj_buf igb: Check if num of q_vectors is smaller than max before array access iavf: Fix an error handling path in 'iavf_probe()' ipv6: tcp: drop silly ICMPv6 packet too big messages tcp: annotate data races around tp->mtu_info net: validate lwtstate->data before returning from skb_tunnel_info() net: ti: fix UAF in tlan_remove_one net: moxa: fix UAF in moxart_mac_probe net: bcmgenet: Ensure all TX/RX queues DMAs are disabled net: ipv6: fix return value of ip6_skb_dst_mtu x86/fpu: Make init_fpstate correct with optimized XSAVE Revert "memory: fsl_ifc: fix leak of IO mapping on probe failure" sched/fair: Fix CFS bandwidth hrtimer expiry type scsi: aic7xxx: Fix unintentional sign extension issue on left shift of u8 kbuild: mkcompile_h: consider timestamp if KBUILD_BUILD_TIMESTAMP is set thermal/core: Correct function name thermal_zone_device_unregister() ARM: imx: pm-imx5: Fix references to imx5_cpu_suspend_info ARM: dts: imx6: phyFLEX: Fix UART hardware flow control ARM: dts: BCM63xx: Fix NAND nodes names ARM: brcmstb: dts: fix NAND nodes names Change-Id: Id59b93b8704270f45923f262facbadde4c486a15
| * btrfs: compression: don't try to compress if we don't have enough pagesDavid Sterba2021-07-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f2165627319ffd33a6217275e5690b1ab5c45763 upstream The early check if we should attempt compression does not take into account the number of input pages. It can happen that there's only one page, eg. a tail page after some ranges of the BTRFS_MAX_UNCOMPRESSED have been processed, or an isolated page that won't be converted to an inline extent. The single page would be compressed but a later check would drop it again because the result size must be at least one block shorter than the input. That can never work with just one page. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: David Sterba <dsterba@suse.com> [sudip: adjust context] Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge branch 'android-4.4-p' of ↵Michael Bestas2020-12-30
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | https://android.googlesource.com/kernel/common into lineage-18.1-caf-msm8998 This brings LA.UM.9.2.r1-01800-SDMxx0.0 up to date with https://android.googlesource.com/kernel/common/ android-4.4-p at commit: 300d539b8e6e2 ANDROID: usb: f_accessory: Wrap '_acc_dev' in get()/put() accessors Conflicts: drivers/usb/gadget/function/f_accessory.c include/linux/spi/spi.h Change-Id: Ifef5bfcb9d92b6d560126f0216369c567476f55d
| * btrfs: fix return value mixup in btrfs_get_extentPavel Machek2020-12-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 881a3a11c2b858fe9b69ef79ac5ee9978a266dc9 upstream btrfs_get_extent() sets variable ret, but out: error path expect error to be in variable err so the error code is lost. Fixes: 6bf9e4bd6a27 ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Pavel Machek (CIP) <pavel@denx.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [sudip: adjust context] Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge branch 'android-4.4-p' of ↵Michael Bestas2020-12-09
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://android.googlesource.com/kernel/common into lineage-17.1-caf-msm8998 This brings LA.UM.8.4.r1-06200-8x98.0 up to date with https://android.googlesource.com/kernel/common/ android-4.4-p at commit: 4cb652f2d058e ANDROID: cuttlefish_defconfig: Disable CONFIG_KSM Conflicts: arch/arm64/include/asm/mmu_context.h arch/powerpc/include/asm/uaccess.h drivers/scsi/ufs/ufshcd.c Change-Id: I25e090fc1a5a7d379aa8f681371e9918b3adeda6
| * btrfs: inode: Verify inode mode to avoid NULL pointer dereferenceQu Wenruo2020-12-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6bf9e4bd6a277840d3fe8c5d5d530a1fbd3db592 upstream [BUG] When accessing a file on a crafted image, btrfs can crash in block layer: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 PGD 136501067 P4D 136501067 PUD 124519067 PMD 0 CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.0.0-rc8-default #252 RIP: 0010:end_bio_extent_readpage+0x144/0x700 Call Trace: <IRQ> blk_update_request+0x8f/0x350 blk_mq_end_request+0x1a/0x120 blk_done_softirq+0x99/0xc0 __do_softirq+0xc7/0x467 irq_exit+0xd1/0xe0 call_function_single_interrupt+0xf/0x20 </IRQ> RIP: 0010:default_idle+0x1e/0x170 [CAUSE] The crafted image has a tricky corruption, the INODE_ITEM has a different type against its parent dir: item 20 key (268 INODE_ITEM 0) itemoff 2808 itemsize 160 generation 13 transid 13 size 1048576 nbytes 1048576 block group 0 mode 121644 links 1 uid 0 gid 0 rdev 0 sequence 9 flags 0x0(none) This mode number 0120000 means it's a symlink. But the dir item think it's still a regular file: item 8 key (264 DIR_INDEX 5) itemoff 3707 itemsize 32 location key (268 INODE_ITEM 0) type FILE transid 13 data_len 0 name_len 2 name: f4 item 40 key (264 DIR_ITEM 51821248) itemoff 1573 itemsize 32 location key (268 INODE_ITEM 0) type FILE transid 13 data_len 0 name_len 2 name: f4 For symlink, we don't set BTRFS_I(inode)->io_tree.ops and leave it empty, as symlink is only designed to have inlined extent, all handled by tree block read. Thus no need to trigger btrfs_submit_bio_hook() for inline file extent. However end_bio_extent_readpage() expects tree->ops populated, as it's reading regular data extent. This causes NULL pointer dereference. [FIX] This patch fixes the problem in two ways: - Verify inode mode against its dir item when looking up inode So in btrfs_lookup_dentry() if we find inode mode mismatch with dir item, we error out so that corrupted inode will not be accessed. - Verify inode mode when getting extent mapping Only regular file should have regular or preallocated extent. If we found regular/preallocated file extent for symlink or the rest, we error out before submitting the read bio. With this fix that crafted image can be rejected gracefully: BTRFS critical (device loop0): inode mode mismatch with dir: inode mode=0121644 btrfs type=7 dir type=1 Reported-by: Yoon Jungyeon <jungyeon@gatech.edu> Link: https://bugzilla.kernel.org/show_bug.cgi?id=202763 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [sudip: use original btrfs_inode_type(), btrfs_crit with root->fs_info, ISREG with inode->i_mode and adjust context] Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge branch 'android-4.4-p' of ↵Michael Bestas2020-07-24
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://android.googlesource.com/kernel/common into lineage-17.1-caf-msm8998 This brings LA.UM.8.4.r1-05700-8x98.0 up to date with https://android.googlesource.com/kernel/common/ android-4.4-p at commit: 8476df741c780 BACKPORT: xtables: extend matches and targets with .usersize Conflicts: drivers/usb/gadget/function/f_uac1.c net/netlink/genetlink.c sound/core/compress_offload.c Change-Id: Id7b2fdf3942f1986edec869dcd965df632cc1c5f
| * btrfs: fix data block group relocation failure due to concurrent scrubFilipe Manana2020-07-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 432cd2a10f1c10cead91fe706ff5dc52f06d642a ] When running relocation of a data block group while scrub is running in parallel, it is possible that the relocation will fail and abort the current transaction with an -EINVAL error: [134243.988595] BTRFS info (device sdc): found 14 extents, stage: move data extents [134243.999871] ------------[ cut here ]------------ [134244.000741] BTRFS: Transaction aborted (error -22) [134244.001692] WARNING: CPU: 0 PID: 26954 at fs/btrfs/ctree.c:1071 __btrfs_cow_block+0x6a7/0x790 [btrfs] [134244.003380] Modules linked in: btrfs blake2b_generic xor raid6_pq (...) [134244.012577] CPU: 0 PID: 26954 Comm: btrfs Tainted: G W 5.6.0-rc7-btrfs-next-58 #5 [134244.014162] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [134244.016184] RIP: 0010:__btrfs_cow_block+0x6a7/0x790 [btrfs] [134244.017151] Code: 48 c7 c7 (...) [134244.020549] RSP: 0018:ffffa41607863888 EFLAGS: 00010286 [134244.021515] RAX: 0000000000000000 RBX: ffff9614bdfe09c8 RCX: 0000000000000000 [134244.022822] RDX: 0000000000000001 RSI: ffffffffb3d63980 RDI: 0000000000000001 [134244.024124] RBP: ffff961589e8c000 R08: 0000000000000000 R09: 0000000000000001 [134244.025424] R10: ffffffffc0ae5955 R11: 0000000000000000 R12: ffff9614bd530d08 [134244.026725] R13: ffff9614ced41b88 R14: ffff9614bdfe2a48 R15: 0000000000000000 [134244.028024] FS: 00007f29b63c08c0(0000) GS:ffff9615ba600000(0000) knlGS:0000000000000000 [134244.029491] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [134244.030560] CR2: 00007f4eb339b000 CR3: 0000000130d6e006 CR4: 00000000003606f0 [134244.031997] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [134244.033153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [134244.034484] Call Trace: [134244.034984] btrfs_cow_block+0x12b/0x2b0 [btrfs] [134244.035859] do_relocation+0x30b/0x790 [btrfs] [134244.036681] ? do_raw_spin_unlock+0x49/0xc0 [134244.037460] ? _raw_spin_unlock+0x29/0x40 [134244.038235] relocate_tree_blocks+0x37b/0x730 [btrfs] [134244.039245] relocate_block_group+0x388/0x770 [btrfs] [134244.040228] btrfs_relocate_block_group+0x161/0x2e0 [btrfs] [134244.041323] btrfs_relocate_chunk+0x36/0x110 [btrfs] [134244.041345] btrfs_balance+0xc06/0x1860 [btrfs] [134244.043382] ? btrfs_ioctl_balance+0x27c/0x310 [btrfs] [134244.045586] btrfs_ioctl_balance+0x1ed/0x310 [btrfs] [134244.045611] btrfs_ioctl+0x1880/0x3760 [btrfs] [134244.049043] ? do_raw_spin_unlock+0x49/0xc0 [134244.049838] ? _raw_spin_unlock+0x29/0x40 [134244.050587] ? __handle_mm_fault+0x11b3/0x14b0 [134244.051417] ? ksys_ioctl+0x92/0xb0 [134244.052070] ksys_ioctl+0x92/0xb0 [134244.052701] ? trace_hardirqs_off_thunk+0x1a/0x1c [134244.053511] __x64_sys_ioctl+0x16/0x20 [134244.054206] do_syscall_64+0x5c/0x280 [134244.054891] entry_SYSCALL_64_after_hwframe+0x49/0xbe [134244.055819] RIP: 0033:0x7f29b51c9dd7 [134244.056491] Code: 00 00 00 (...) [134244.059767] RSP: 002b:00007ffcccc1dd08 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [134244.061168] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f29b51c9dd7 [134244.062474] RDX: 00007ffcccc1dda0 RSI: 00000000c4009420 RDI: 0000000000000003 [134244.063771] RBP: 0000000000000003 R08: 00005565cea4b000 R09: 0000000000000000 [134244.065032] R10: 0000000000000541 R11: 0000000000000202 R12: 00007ffcccc2060a [134244.066327] R13: 00007ffcccc1dda0 R14: 0000000000000002 R15: 00007ffcccc1dec0 [134244.067626] irq event stamp: 0 [134244.068202] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [134244.069351] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 [134244.070909] softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 [134244.072392] softirqs last disabled at (0): [<0000000000000000>] 0x0 [134244.073432] ---[ end trace bd7c03622e0b0a99 ]--- The -EINVAL error comes from the following chain of function calls: __btrfs_cow_block() <-- aborts the transaction btrfs_reloc_cow_block() replace_file_extents() get_new_location() <-- returns -EINVAL When relocating a data block group, for each allocated extent of the block group, we preallocate another extent (at prealloc_file_extent_cluster()), associated with the data relocation inode, and then dirty all its pages. These preallocated extents have, and must have, the same size that extents from the data block group being relocated have. Later before we start the relocation stage that updates pointers (bytenr field of file extent items) to point to the the new extents, we trigger writeback for the data relocation inode. The expectation is that writeback will write the pages to the previously preallocated extents, that it follows the NOCOW path. That is generally the case, however, if a scrub is running it may have turned the block group that contains those extents into RO mode, in which case writeback falls back to the COW path. However in the COW path instead of allocating exactly one extent with the expected size, the allocator may end up allocating several smaller extents due to free space fragmentation - because we tell it at cow_file_range() that the minimum allocation size can match the filesystem's sector size. This later breaks the relocation's expectation that an extent associated to a file extent item in the data relocation inode has the same size as the respective extent pointed by a file extent item in another tree - in this case the extent to which the relocation inode poins to is smaller, causing relocation.c:get_new_location() to return -EINVAL. For example, if we are relocating a data block group X that has a logical address of X and the block group has an extent allocated at the logical address X + 128KiB with a size of 64KiB: 1) At prealloc_file_extent_cluster() we allocate an extent for the data relocation inode with a size of 64KiB and associate it to the file offset 128KiB (X + 128KiB - X) of the data relocation inode. This preallocated extent was allocated at block group Z; 2) A scrub running in parallel turns block group Z into RO mode and starts scrubing its extents; 3) Relocation triggers writeback for the data relocation inode; 4) When running delalloc (btrfs_run_delalloc_range()), we try first the NOCOW path because the data relocation inode has BTRFS_INODE_PREALLOC set in its flags. However, because block group Z is in RO mode, the NOCOW path (run_delalloc_nocow()) falls back into the COW path, by calling cow_file_range(); 5) At cow_file_range(), in the first iteration of the while loop we call btrfs_reserve_extent() to allocate a 64KiB extent and pass it a minimum allocation size of 4KiB (fs_info->sectorsize). Due to free space fragmentation, btrfs_reserve_extent() ends up allocating two extents of 32KiB each, each one on a different iteration of that while loop; 6) Writeback of the data relocation inode completes; 7) Relocation proceeds and ends up at relocation.c:replace_file_extents(), with a leaf which has a file extent item that points to the data extent from block group X, that has a logical address (bytenr) of X + 128KiB and a size of 64KiB. Then it calls get_new_location(), which does a lookup in the data relocation tree for a file extent item starting at offset 128KiB (X + 128KiB - X) and belonging to the data relocation inode. It finds a corresponding file extent item, however that item points to an extent that has a size of 32KiB, which doesn't match the expected size of 64KiB, resuling in -EINVAL being returned from this function and propagated up to __btrfs_cow_block(), which aborts the current transaction. To fix this make sure that at cow_file_range() when we call the allocator we pass it a minimum allocation size corresponding the desired extent size if the inode belongs to the data relocation tree, otherwise pass it the filesystem's sector size as the minimum allocation size. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
| * btrfs: cow_file_range() num_bytes and disk_num_bytes are sameAnand Jain2020-07-09
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 3752d22fcea160cc2493e34f5e0e41cdd7fdd921 ] This patch deletes local variable disk_num_bytes as its value is same as num_bytes in the function cow_file_range(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
| * btrfs: fix error handling when submitting direct I/O bioOmar Sandoval2020-06-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 6d3113a193e3385c72240096fe397618ecab6e43 ] In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID stripe or chunk, we submit orig_bio without cloning it. In this case, we don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we decrement pending_bios to -1, and we never complete orig_bio. Fix it by initializing pending_bios to 1 instead of incrementing later. Fixing this exposes another bug: we put orig_bio prematurely and then put it again from end_io. Fix it by not putting orig_bio. After this change, pending_bios is really more of a reference count, but I'll leave that cleanup separate to keep the fix small. Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* | Merge branch 'android-4.4-p' of ↵Michael Bestas2020-03-08
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://android.googlesource.com/kernel/common into lineage-17.1-caf-msm8998 This brings LA.UM.8.4.r1-05200-8x98.0 up to date with https://android.googlesource.com/kernel/common/ android-4.4-p at commit: 4db1ebdd40ec0 FROMLIST: HID: nintendo: add nintendo switch controller driver Conflicts: arch/arm64/boot/Makefile arch/arm64/kernel/psci.c arch/x86/configs/x86_64_cuttlefish_defconfig drivers/md/dm.c drivers/of/Kconfig drivers/thermal/thermal_core.c fs/proc/meminfo.c kernel/locking/spinlock_debug.c kernel/time/hrtimer.c net/wireless/util.c Change-Id: I5b5163497b7c6ab8487ffbb2d036e4cda01ed670
| * btrfs: do not call synchronize_srcu() in inode_tree_delJosef Bacik2020-01-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit f72ff01df9cf5db25c76674cac16605992d15467 ] Testing with the new fsstress uncovered a pretty nasty deadlock with lookup and snapshot deletion. Process A unlink -> final iput -> inode_tree_del -> synchronize_srcu(subvol_srcu) Process B btrfs_lookup <- srcu_read_lock() acquired here -> btrfs_iget -> find inode that has I_FREEING set -> __wait_on_freeing_inode() We're holding the srcu_read_lock() while doing the iget in order to make sure our fs root doesn't go away, and then we are waiting for the inode to finish freeing. However because the free'ing process is doing a synchronize_srcu() we deadlock. Fix this by dropping the synchronize_srcu() in inode_tree_del(). We don't need people to stop accessing the fs root at this point, we're only adding our empty root to the dead roots list. A larger much more invasive fix is forthcoming to address how we deal with fs roots, but this fixes the immediate problem. Fixes: 76dda93c6ae2 ("Btrfs: add snapshot/subvolume destroy ioctl") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* | Merge android-4.4.164 (564ce1b) into msm-4.4Srinivasarao P2018-11-21
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * refs/heads/tmp-564ce1b Linux 4.4.164 drm/i915/hdmi: Add HDMI 2.0 audio clock recovery N values drm/dp_mst: Check if primary mstb is null drm/rockchip: Allow driver to be shutdown on reboot/kexec mm: migration: fix migration of huge PMD shared pages hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444! configfs: replace strncpy with memcpy fuse: fix leaked notify reply rtc: hctosys: Add missing range error reporting sunrpc: correct the computation for page_ptr when truncating mount: Prevent MNT_DETACH from disconnecting locked mounts mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts mount: Retest MNT_LOCKED in do_umount ext4: fix buffer leak in __ext4_read_dirblock() on error path ext4: fix buffer leak in ext4_xattr_move_to_block() on error path ext4: release bs.bh before re-using in ext4_xattr_block_find() ext4: fix possible leak of sbi->s_group_desc_leak in error path ext4: avoid possible double brelse() in add_new_gdb() on error path ext4: fix missing cleanup if ext4_alloc_flex_bg_array() fails while resizing ext4: avoid buffer leak in ext4_orphan_add() after prior errors ext4: fix possible inode leak in the retry loop of ext4_resize_fs() ext4: avoid potential extra brelse in setup_new_flex_group_blocks() ext4: add missing brelse() add_new_gdb_meta_bg()'s error path ext4: add missing brelse() in set_flexbg_block_bitmap()'s error path ext4: add missing brelse() update_backups()'s error path clockevents/drivers/i8253: Add support for PIT shutdown quirk Btrfs: fix data corruption due to cloning of eof block arch/alpha, termios: implement BOTHER, IBSHIFT and termios2 termios, tty/tty_baudrate.c: fix buffer overrun mtd: docg3: don't set conflicting BCH_CONST_PARAMS option mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings ocfs2: fix a misuse a of brelse after failing ocfs2_check_dir_entry vhost/scsi: truncate T10 PI iov_iter to prot_bytes mach64: fix image corruption due to reading accelerator registers mach64: fix display corruption on big endian machines libceph: bump CEPH_MSG_MAX_DATA_LEN clk: s2mps11: Fix matching when built as module and DT node contains compatible xtensa: fix boot parameters address translation xtensa: make sure bFLT stack is 16 byte aligned xtensa: add NOTES section to the linker script MIPS: Loongson-3: Fix BRIDGE irq delivery problem MIPS: Loongson-3: Fix CPU UART irq delivery problem bna: ethtool: Avoid reading past end of buffer e1000: fix race condition between e1000_down() and e1000_watchdog e1000: avoid null pointer dereference on invalid stat type mm: do not bug_on on incorrect length in __mm_populate() fs, elf: make sure to page align bss in load_elf_library mm: refuse wrapped vm_brk requests binfmt_elf: fix calculations for bss padding mm, elf: handle vm_brk error fuse: set FR_SENT while locked fuse: fix blocked_waitq wakeup fuse: Fix use-after-free in fuse_dev_do_write() fuse: Fix use-after-free in fuse_dev_do_read() scsi: qla2xxx: Fix incorrect port speed being set for FC adapters cdrom: fix improper type cast, which can leat to information leak. 9p: clear dangling pointers in p9stat_free 9p locks: fix glock.client_id leak in do_lock media: tvp5150: fix width alignment during set_selection() sc16is7xx: Fix for multi-channel stall powerpc/boot: Ensure _zimage_start is a weak symbol MIPS: kexec: Mark CPU offline before disabling local IRQ media: pci: cx23885: handle adding to list failure drm/omap: fix memory barrier bug in DMM driver powerpc/nohash: fix undefined behaviour when testing page size support tty: check name length in tty_find_polling_driver() MD: fix invalid stored role for a disk - try2 btrfs: set max_extent_size properly Btrfs: fix null pointer dereference on compressed write path error btrfs: qgroup: Dirty all qgroups before rescan Btrfs: fix wrong dentries after fsync of file that got its parent replaced btrfs: make sure we create all new block groups btrfs: reset max_extent_size on clear in a bitmap btrfs: wait on caching when putting the bg cache btrfs: don't attempt to trim devices that don't support it btrfs: iterate all devices during trim, instead of fs_devices::alloc_list btrfs: locking: Add extra check in btrfs_init_new_buffer() to avoid deadlock btrfs: Handle owner mismatch gracefully when walking up tree soc/tegra: pmc: Fix child-node lookup arm64: dts: stratix10: Correct System Manager register size Cramfs: fix abad comparison when wrap-arounds occur ext4: avoid running out of journal credits when appending to an inline file media: em28xx: make v4l2-compliance happier by starting sequence on zero media: em28xx: fix input name for Terratec AV 350 media: em28xx: use a default format if TRY_FMT fails xen: fix xen_qlock_wait() kgdboc: Passing ekgdboc to command line causes panic TC: Set DMA masks for devices MIPS: OCTEON: fix out of bounds array access on CN68XX powerpc/msi: Fix compile error on mpc83xx dm ioctl: harden copy_params()'s copy_from_user() from malicious users lockd: fix access beyond unterminated strings in prints nfsd: Fix an Oops in free_session() NFSv4.1: Fix the r/wsize checking genirq: Fix race on spurious interrupt detection printk: Fix panic caused by passing log_buf_len to command line smb3: on kerberos mount if server doesn't specify auth type use krb5 smb3: do not attempt cifs operation in smb3 query info error path smb3: allow stats which track session and share reconnects to be reset w1: omap-hdq: fix missing bus unregister at removal iio: adc: at91: fix wrong channel number in triggered buffer mode iio: adc: at91: fix acking DRDY irq on simple conversions kbuild: fix kernel/bounds.c 'W=1' warning hugetlbfs: dirty pages as they are added to pagecache ima: fix showing large 'violations' or 'runtime_measurements_count' crypto: lrw - Fix out-of bounds access on counter overflow signal/GenWQE: Fix sending of SIGKILL PCI: Add Device IDs for Intel GPU "spurious interrupt" quirk HID: hiddev: fix potential Spectre v1 ext4: initialize retries variable in ext4_da_write_inline_data_begin() gfs2_meta: ->mount() can get NULL dev_name jbd2: fix use after free in jbd2_log_do_checkpoint() libnvdimm: Hold reference on parent while scheduling async init net/ipv4: defensive cipso option parsing xen: make xen_qlock_wait() nestable xen: fix race in xen_qlock_wait() tpm: Restore functionality to xen vtpm driver. xen-swiotlb: use actually allocated size on check physical continuous ALSA: hda: Check the non-cached stream buffers more explicitly dmaengine: dma-jz4780: Return error if not probed from DT signal: Always deliver the kernel's SIGKILL and SIGSTOP to a pid namespace init scsi: lpfc: Correct soft lockup when running mds diagnostics uio: ensure class is registered before devices driver/dma/ioat: Call del_timer_sync() without holding prep_lock usb: chipidea: Prevent unbalanced IRQ disable MD: fix invalid stored role for a disk ext4: fix argument checking in EXT4_IOC_MOVE_EXT tpm: suppress transmit cmd error logs when TPM 1.2 is disabled/deactivated scsi: megaraid_sas: fix a missing-check bug scsi: esp_scsi: Track residual for PIO transfers ath10k: schedule hardware restart if WMI command times out pinctrl: ssbi-gpio: Fix pm8xxx_pin_config_get() to be compliant pinctrl: spmi-mpp: Fix pmic_mpp_config_get() to be compliant pinctrl: qcom: spmi-mpp: Fix drive strength setting ACPI / LPSS: Add alternative ACPI HIDs for Cherry Trail DMA controllers kprobes: Return error if we fail to reuse kprobe instead of BUG_ON() pinctrl: qcom: spmi-mpp: Fix err handling of pmic_mpp_set_mux x86: boot: Fix EFI stub alignment Bluetooth: btbcm: Add entry for BCM4335C0 UART bluetooth mmc: sdhci-pci-o2micro: Add quirk for O2 Micro dev 0x8620 rev 0x01 perf tools: Cleanup trace-event-info 'tdata' leak perf tools: Free temporary 'sys' string in read_event_files() tun: Consistently configure generic netdev params via rtnetlink swim: fix cleanup on setup error ataflop: fix error handling during setup locking/lockdep: Fix debug_locks off performance problem selftests: ftrace: Add synthetic event syntax testcase net: qla3xxx: Remove overflowing shift statement x86/fpu: Remove second definition of fpu in __fpu__restore_sig() sparc: Fix single-pcr perf event counter management. x86/kconfig: Fall back to ticket spinlocks x86/corruption-check: Fix panic in memory_corruption_check() when boot option without value is provided ALSA: ca0106: Disable IZD on SB0570 DAC to fix audio pops ALSA: hda - Add mic quirk for the Lenovo G50-30 (17aa:3905) parisc: Fix map_pages() to not overwrite existing pte entries parisc: Fix address in HPMC IVA ipmi: Fix timer race with module unload pcmcia: Implement CLKRUN protocol disabling for Ricoh bridges jffs2: free jffs2_sb_info through jffs2_kill_sb() hwmon: (pmbus) Fix page count auto-detection. bcache: fix miss key refill->end in writeback ANDROID: zram: set comp_len to PAGE_SIZE when page is huge Conflicts: drivers/hid/usbhid/hiddev.c Change-Id: I42874613e3b4102ef4ed051e1e8ed25b2d4ae7f2 Signed-off-by: Srinivasarao P <spathi@codeaurora.org>
| * Btrfs: fix null pointer dereference on compressed write path errorFilipe Manana2018-11-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3527a018c00e5dbada2f9d7ed5576437b6dd5cfb upstream. At inode.c:compress_file_range(), under the "free_pages_out" label, we can end up dereferencing the "pages" pointer when it has a NULL value. This case happens when "start" has a value of 0 and we fail to allocate memory for the "pages" pointer. When that happens we jump to the "cont" label and then enter the "if (start == 0)" branch where we immediately call the cow_file_range_inline() function. If that function returns 0 (success creating an inline extent) or an error (like -ENOMEM for example) we jump to the "free_pages_out" label and then access "pages[i]" leading to a NULL pointer dereference, since "nr_pages" has a value greater than zero at that point. Fix this by setting "nr_pages" to 0 when we fail to allocate memory for the "pages" pointer. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119 Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge "Merge android-4.4.139 (7ba5557) into msm-4.4"Linux Build Service Account2018-07-10
|\|
| * Btrfs: fix unexpected cow in run_delalloc_nocowLiu Bo2018-07-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5811375325420052fcadd944792a416a43072b7f upstream. Fstests generic/475 provides a way to fail metadata reads while checking if checksum exists for the inode inside run_delalloc_nocow(), and csum_exist_in_range() interprets error (-EIO) as inode having checksum and makes its caller enter the cow path. In case of free space inode, this ends up with a warning in cow_file_range(). The same problem applies to btrfs_cross_ref_exist() since it may also read metadata in between. With this, run_delalloc_nocow() bails out when errors occur at the two places. cc: <stable@vger.kernel.org> v2.6.28+ Fixes: 17d217fe970d ("Btrfs: fix nodatasum handling in balancing code") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Revert "do d_instantiate/unlock_new_inode combinations safely"Gustavo Solaira2018-07-03
|/ | | | | | | | This reverts commit 03bb7588942a38623f108b3302c2d1aebb525696. Causes oops with security smack enabled. Change-Id: I14fb2b0841c6b71940bd3f08bd4b49b1d7b039a3 Signed-off-by: Gustavo Solaira <gustavos@codeaurora.org>
* do d_instantiate/unlock_new_inode combinations safelyAl Viro2018-05-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1e2e547a93a00ebc21582c06ca3c6cfea2a309ee upstream. For anything NFS-exported we do _not_ want to unlock new inode before it has grown an alias; original set of fixes got the ordering right, but missed the nasty complication in case of lockdep being enabled - unlock_new_inode() does lockdep_annotate_inode_mutex_key(inode) which can only be done before anyone gets a chance to touch ->i_mutex. Unfortunately, flipping the order and doing unlock_new_inode() before d_instantiate() opens a window when mkdir can race with open-by-fhandle on a guessed fhandle, leading to multiple aliases for a directory inode and all the breakage that follows from that. Correct solution: a new primitive (d_instantiate_new()) combining these two in the right order - lockdep annotate, then d_instantiate(), then the rest of unlock_new_inode(). All combinations of d_instantiate() with unlock_new_inode() should be converted to that. Cc: stable@kernel.org # 2.6.29 and later Tested-by: Mike Marshall <hubcap@omnibond.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: fix deadlock in run_delalloc_nocowLiu Bo2018-02-22
| | | | | | | | | | | | | | | commit e89166990f11c3f21e1649d760dd35f9e410321c upstream. @cur_offset is not set back to what it should be (@cow_start) if btrfs_next_leaf() returns something wrong, and the range [cow_start, cur_offset) remains locked forever. cc: <stable@vger.kernel.org> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: Handle btrfs_set_extent_delalloc failure in fixup workerNikolay Borisov2018-02-16
| | | | | | | | | | | | | | | | | commit f3038ee3a3f1017a1cbe9907e31fa12d366c5dcb upstream. This function was introduced by 247e743cbe6e ("Btrfs: Use async helpers to deal with pages that have been improperly dirtied") and it didn't do any error handling then. This function might very well fail in ENOMEM situation, yet it's not handled, this could lead to inconsistent state. So let's handle the failure by setting the mapping error bit. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: add missing memset while reading compressed inline extentsZygo Blaxell2017-12-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e1699d2d7bf6e6cce3e1baff19f9dd4595a58664 ] This is a story about 4 distinct (and very old) btrfs bugs. Commit c8b978188c ("Btrfs: Add zlib compression support") added three data corruption bugs for inline extents (bugs #1-3). Commit 93c82d5750 ("Btrfs: zero page past end of inline file items") fixed bug #1: uncompressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read. The fix was to add a memset in btrfs_get_extent to zero out the hole. Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption") fixed bug #2: compressed inline extents which contained non-zero bytes might be replaced with zero bytes in some cases. This patch removed an unhelpful memset from uncompress_inline, but the case where memset is required was missed. There is also a memset in the decompression code, but this only covers decompressed data that is shorter than the ram_bytes from the extent ref record. This memset doesn't cover the region between the end of the decompressed data and the end of the page. It has also moved around a few times over the years, so there's no single patch to refer to. This patch fixes bug #3: compressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read (i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/). The fix is the same: zero out the hole in the compressed case too, by putting a memset back in uncompress_inline, but this time with correct parameters. The last and oldest bug, bug #0, is the cause of the offending inline extent/hole/extent pattern. Bug #0 is a subtle and mostly-harmless quirk of behavior somewhere in the btrfs write code. In a few special cases, an inline extent and hole are allowed to persist where they normally would be combined with later extents in the file. A fast reproducer for bug #0 is presented below. A few offending extents are also created in the wild during large rsync transfers with the -S flag. A Linux kernel build (git checkout; make allyesconfig; make -j8) will produce a handful of offending files as well. Once an offending file is created, it can present different content to userspace each time it is read. Bug #0 is at least 4 and possibly 8 years old. I verified every vX.Y kernel back to v3.5 has this behavior. There are fossil records of this bug's effects in commits all the way back to v2.6.32. I have no reason to believe bug #0 wasn't present at the beginning of btrfs compression support in v2.6.29, but I can't easily test kernels that old to be sure. It is not clear whether bug #0 is worth fixing. A fix would likely require injecting extra reads into currently write-only paths, and most of the exceptional cases caused by bug #0 are already handled now. Whether we like them or not, bug #0's inline extents followed by holes are part of the btrfs de-facto disk format now, and we need to be able to read them without data corruption or an infoleak. So enough about bug #0, let's get back to bug #3 (this patch). An example of on-disk structure leading to data corruption found in the wild: item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160 inode generation 50 transid 50 size 47424 nbytes 49141 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 flags 0x0(none) item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20 inode ref index 3 namelen 10 name: DB_File.so item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362 inline extent data size 1341 ram 4085 compress(zlib) item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53 extent data disk byte 5367308288 nr 20480 extent data offset 0 nr 45056 ram 45056 extent compression(zlib) Different data appears in userspace during each read of the 11 bytes between 4085 and 4096. The extent in item 63 is not long enough to fill the first page of the file, so a memset is required to fill the space between item 63 (ending at 4085) and item 64 (beginning at 4096) with zero. Here is a reproducer from Liu Bo, which demonstrates another method of creating the same inline extent and hole pattern: Using 'page_poison=on' kernel command line (or enable CONFIG_PAGE_POISONING) run the following: # touch foo # chattr +c foo # xfs_io -f -c "pwrite -W 0 1000" foo # xfs_io -f -c "falloc 4 8188" foo # od -x foo # echo 3 >/proc/sys/vm/drop_caches # od -x foo This produce the following on my box: Correct output: file contains 1000 data bytes followed by zeros: 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0001740 cdcd cdcd cdcd cdcd 0000 0000 0000 0000 0001760 0000 0000 0000 0000 0000 0000 0000 0000 * 0020000 Actual output: the data after the first 1000 bytes will be different each run: 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d 0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400 0002000 435f 0056 5f74 6164 7400 645f 0062 5f74 (...) Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Chris Mason <clm@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: adjust outstanding_extents counter properly when dio write is splitLiu Bo2017-08-06
| | | | | | | | | | | | | | | | | | | | [ Upstream commit c2931667c83ded6504b3857e99cc45b21fa496fb ] Currently how btrfs dio deals with split dio write is not good enough if dio write is split into several segments due to the lack of contiguous space, a large dio write like 'dd bs=1G count=1' can end up with incorrect outstanding_extents counter and endio would complain loudly with an assertion. This fixes the problem by compensating the outstanding_extents counter in inode if a large dio write gets split. Reported-by: Anand Jain <anand.jain@oracle.com> Tested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: fix truncate down when no_holes feature is enabledLiu Bo2017-07-05
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 91298eec05cd8d4e828cf7ee5d4a6334f70cf69a ] For such a file mapping, [0-4k][hole][8k-12k] In NO_HOLES mode, we don't have the [hole] extent any more. Commit c1aa45759e90 ("Btrfs: fix shrinking truncate when the no_holes feature is enabled") fixed disk isize not being updated in NO_HOLES mode when data is not flushed. However, even if data has been flushed, we can still have trouble in updating disk isize since we updated disk isize to 'start' of the last evicted extent. Reviewed-by: Chris Mason <clm@fb.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: use correct types for page indices in btrfs_page_exists_in_rangeDavid Sterba2017-06-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit cc2b702c52094b637a351d7491ac5200331d0445 upstream. Variables start_idx and end_idx are supposed to hold a page index derived from the file offsets. The int type is not the right one though, offsets larger than 1 << 44 will get silently trimmed off the high bits. (1 << 44 is 16TiB) What can go wrong, if start is below the boundary and end gets trimmed: - if there's a page after start, we'll find it (radix_tree_gang_lookup_slot) - the final check "if (page->index <= end_idx)" will unexpectedly fail The function will return false, ie. "there's no page in the range", although there is at least one. btrfs_page_exists_in_range is used to prevent races in: * in hole punching, where we make sure there are not pages in the truncated range, otherwise we'll wait for them to finish and redo truncation, but we're going to replace the pages with holes anyway so the only problem is the intermediate state * lock_extent_direct: we want to make sure there are no pages before we lock and start DIO, to prevent stale data reads For practical occurence of the bug, there are several constaints. The file must be quite large, the affected range must cross the 16TiB boundary and the internal state of the file pages and pending operations must match. Also, we must not have started any ordered data in the range, otherwise we don't even reach the buggy function check. DIO locking tries hard in several places to avoid deadlocks with buffered IO and avoids waiting for ranges. The worst consequence seems to be stale data read. CC: Liu Bo <bo.li.liu@oracle.com> Fixes: fc4adbff823f7 ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking") Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: qgroup: Prevent qgroup->reserved from going subzeroGoldwyn Rodrigues2016-11-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 0b34c261e235a5c74dcf78bd305845bd15fe2b42 upstream. While free'ing qgroup->reserved resources, we much check if the page has not been invalidated by a truncate operation by checking if the page is still dirty before reducing the qgroup resources. Resources in such a case are free'd when the entire extent is released by delayed_ref. This fixes a double accounting while releasing resources in case of truncating a file, reproduced by the following testcase. SCRATCH_DEV=/dev/vdb SCRATCH_MNT=/mnt mkfs.btrfs -f $SCRATCH_DEV mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT cd $SCRATCH_MNT btrfs quota enable $SCRATCH_MNT btrfs subvolume create a btrfs qgroup limit 500m a $SCRATCH_MNT sync for c in {1..15}; do dd if=/dev/zero bs=1M count=40 of=$SCRATCH_MNT/a/file; done sleep 10 sync sleep 5 touch $SCRATCH_MNT/a/newfile echo "Removing file" rm $SCRATCH_MNT/a/file Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: fix deadlock running delayed iputs at transaction commit timeFilipe Manana2016-03-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c2d6cb1636d235257086f939a8194ef0bf93af6e upstream. While running a stress test I ran into a deadlock when running the delayed iputs at transaction time, which produced the following report and trace: [ 886.399989] ============================================= [ 886.400871] [ INFO: possible recursive locking detected ] [ 886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted [ 886.402384] --------------------------------------------- [ 886.403182] fio/8277 is trying to acquire lock: [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] but task is already holding lock: [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] other info that might help us debug this: [ 886.403568] Possible unsafe locking scenario: [ 886.403568] [ 886.403568] CPU0 [ 886.403568] ---- [ 886.403568] lock(&fs_info->delayed_iput_sem); [ 886.403568] lock(&fs_info->delayed_iput_sem); [ 886.403568] [ 886.403568] *** DEADLOCK *** [ 886.403568] [ 886.403568] May be due to missing lock nesting notation [ 886.403568] [ 886.403568] 3 locks held by fio/8277: [ 886.403568] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0 [ 886.403568] #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs] [ 886.403568] #2: (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] stack backtrace: [ 886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 886.403568] 0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0 [ 886.403568] ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290 [ 886.403568] 0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804 [ 886.403568] Call Trace: [ 886.403568] [<ffffffff8125d4fd>] dump_stack+0x4e/0x79 [ 886.403568] [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b [ 886.403568] [<ffffffff810c22db>] ? __module_address+0xdf/0x108 [ 886.403568] [<ffffffff8108eb77>] lock_acquire+0x10d/0x194 [ 886.403568] [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194 [ 886.403568] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffff8148556b>] down_read+0x3e/0x4d [ 886.489542] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs] [ 886.489542] [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs] [ 886.489542] [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs] [ 886.489542] [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs] [ 886.489542] [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs] [ 886.489542] [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs] [ 886.489542] [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs] [ 886.489542] [<ffffffff81188e31>] evict+0xa7/0x15c [ 886.489542] [<ffffffff81189878>] iput+0x1d3/0x266 [ 886.489542] [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs] [ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs] [ 886.489542] [<ffffffff81085096>] ? signal_pending_state+0x31/0x31 [ 886.489542] [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs] [ 886.489542] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs] [ 886.489542] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs] [ 886.489542] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs] [ 886.489542] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128 [ 886.489542] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs] [ 886.489542] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50 [ 886.489542] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5 [ 886.489542] [<ffffffff81172cda>] vfs_write+0xa0/0xe4 [ 886.489542] [<ffffffff811734cc>] SyS_write+0x50/0x7e [ 886.489542] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds. [ 1081.854348] Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1081.863227] fio D ffff880213f9bb28 0 8244 8240 0x00000000 [ 1081.868719] ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240 [ 1081.872499] ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400 [ 1081.876834] ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4 [ 1081.880782] Call Trace: [ 1081.881793] [<ffffffff81482ba4>] schedule+0x7f/0x97 [ 1081.883340] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325 [ 1081.895525] [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab [ 1081.897419] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20 [ 1081.899251] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20 [ 1081.901063] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21 [ 1081.902365] [<ffffffff814855bd>] down_write+0x43/0x57 [ 1081.903846] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1081.906078] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1081.908846] [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c [ 1081.910409] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs] [ 1081.912482] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs] [ 1081.914597] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs] [ 1081.919037] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128 [ 1081.920754] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs] [ 1081.922496] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50 [ 1081.923922] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5 [ 1081.925275] [<ffffffff81172cda>] vfs_write+0xa0/0xe4 [ 1081.926584] [<ffffffff811734cc>] SyS_write+0x50/0x7e [ 1081.927968] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 1081.985293] INFO: lockdep is turned off. [ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds. [ 1081.987434] Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1081.990147] fio D ffff880218febbb8 0 8249 8240 0x00000000 [ 1081.991626] ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240 [ 1081.993258] ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00 [ 1081.994850] ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4 [ 1081.996485] Call Trace: [ 1081.997037] [<ffffffff81482ba4>] schedule+0x7f/0x97 [ 1081.998017] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325 [ 1081.999241] [<ffffffff810852a5>] ? finish_wait+0x6d/0x76 [ 1082.000306] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20 [ 1082.001533] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20 [ 1082.002776] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21 [ 1082.003995] [<ffffffff814855bd>] down_write+0x43/0x57 [ 1082.005000] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1082.007403] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1082.008988] [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs] [ 1082.010193] [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77 [ 1082.011280] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0 [ 1082.012265] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0 [ 1082.013021] [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff [ 1082.013738] [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b [ 1082.014778] [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea [ 1082.015778] [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e [ 1082.016806] [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71 [ 1082.017789] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [ 1082.018706] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f This happens because we can recursively acquire the semaphore fs_info->delayed_iput_sem when attempting to allocate space to satisfy a file write request as shown in the first trace above - when committing a transaction we acquire (down_read) the semaphore before running the delayed iputs, and when running a delayed iput() we can end up calling an inode's eviction handler, which in turn commits another transaction and attempts to acquire (down_read) again the semaphore to run more delayed iput operations. This results in a deadlock because if a task acquires multiple times a semaphore it should invoke down_read_nested() with a different lockdep class for each level of recursion. Fix this by simplifying the implementation and use a mutex instead that is acquired by the cleaner kthread before it runs the delayed iputs instead of always acquiring a semaphore before delayed references are run from anywhere. Fixes: d7c151717a1e (btrfs: Fix NO_SPACE bug caused by delayed-iput) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: fix transaction handle leak on failure to create hard linkFilipe Manana2016-03-03
| | | | | | | | | | | | | | | commit 271dba4521aed0c37c063548f876b49f5cd64b2e upstream. If we failed to create a hard link we were not always releasing the the transaction handle we got before, resulting in a memory leak and preventing any other tasks from being able to commit the current transaction. Fix this by always releasing our transaction handle. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: fix number of transaction units required to create symlinkFilipe Manana2016-03-03
| | | | | | | | | | | | | commit 9269d12b2d57d9e3d13036bb750762d1110d425c upstream. We weren't accounting for the insertion of an inline extent item for the symlink inode nor that we need to update the parent inode item (through the call to btrfs_add_nondir()). So fix this by including two more transaction units. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: igrab inode in writepageJosef Bacik2016-03-03
| | | | | | | | | | | | | | | | | | commit be7bd730841e69fe8f70120098596f648cd1f3ff upstream. We hit this panic on a few of our boxes this week where we have an ordered_extent with an NULL inode. We do an igrab() of the inode in writepages, but weren't doing it in writepage which can be called directly from the VM on dirty pages. If the inode has been unlinked then we could have I_FREEING set which means igrab() would return NULL and we get this panic. Fix this by trying to igrab in btrfs_writepage, and if it returns NULL then just redirty the page and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: fix direct IO requests not reporting IO error to user spaceFilipe Manana2016-02-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1636d1d77ef4e01e57f706a4cae3371463896136 upstream. If a bio for a direct IO request fails, we were not setting the error in the parent bio (the main DIO bio), making us not return the error to user space in btrfs_direct_IO(), that is, it made __blockdev_direct_IO() return the number of bytes issued for IO and not the error a bio created and submitted by btrfs_submit_direct() got from the block layer. This essentially happens because when we call: dio_end_io(dio_bio, bio->bi_error); It does not set dio_bio->bi_error to the value of the second argument. So just add this missing assignment in endio callbacks, just as we do in the error path at btrfs_submit_direct() when we fail to clone the dio bio or allocate its private object. This follows the convention of what is done with other similar APIs such as bio_endio() where the caller is responsible for setting the bi_error field in the bio it passes as an argument to bio_endio(). This was detected by the new generic test cases in xfstests: 271, 272, 276 and 278. Which essentially setup a dm error target, then load the error table, do a direct IO write and unload the error table. They expect the write to fail with -EIO, which was not getting reported when testing against btrfs. Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: properly set the termination value of ctx->pos in readdirDavid Sterba2016-02-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit bc4ef7592f657ae81b017207a1098817126ad4cb upstream. The value of ctx->pos in the last readdir call is supposed to be set to INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a larger value, then it's LLONG_MAX. There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++" overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a 64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before the increment. We can get to that situation like that: * emit all regular readdir entries * still in the same call to readdir, bump the last pos to INT_MAX * next call to readdir will not emit any entries, but will reach the bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX Normally this is not a problem, but if we call readdir again, we'll find 'pos' set to LLONG_MAX and the unconditional increment will overflow. The report from Victor at (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging print shows that pattern: Overflow: e Overflow: 7fffffff Overflow: 7fffffffffffffff PAX: size overflow detected in function btrfs_real_readdir fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0; context: dir_context; CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1 Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015 ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48 ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78 ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8 Call Trace: [<ffffffff81742f0f>] dump_stack+0x4c/0x7f [<ffffffff811cb706>] report_size_overflow+0x36/0x40 [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0 [<ffffffff811dafc8>] iterate_dir+0xa8/0x150 [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70 [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0 Overflow: 1a [<ffffffff811db070>] ? iterate_dir+0x150/0x150 [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83 The jump from 7fffffff to 7fffffffffffffff happens when new dir entries are not yet synced and are processed from the delayed list. Then the code could go to the bump section again even though it might not emit any new dir entries from the delayed list. The fix avoids entering the "bump" section again once we've finished emitting the entries, both for synced and delayed entries. References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284 Reported-by: Victor <services@swwu.com> Signed-off-by: David Sterba <dsterba@suse.com> Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Merge branch 'for-linus-4.4' of ↵Linus Torvalds2015-11-27
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This has Mark Fasheh's patches to fix quota accounting during subvol deletion, which we've been working on for a while now. The patch is pretty small but it's a key fix. Otherwise it's a random assortment" * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix balance range usage filters in 4.4-rc btrfs: qgroup: account shared subtree during snapshot delete Btrfs: use btrfs_get_fs_root in resolve_indirect_ref btrfs: qgroup: fix quota disable during rescan Btrfs: fix race between cleaner kthread and space cache writeout Btrfs: fix scrub preventing unused block groups from being deleted Btrfs: fix race between scrub and block group deletion btrfs: fix rcu warning during device replace btrfs: Continue replace when set_block_ro failed btrfs: fix clashing number of the enhanced balance usage filter Btrfs: fix the number of transaction units needed to remove a block group Btrfs: use global reserve when deleting unused block group after ENOSPC Btrfs: tests: checking for NULL instead of IS_ERR() btrfs: fix signed overflows in btrfs_sync_file
| * Btrfs: use global reserve when deleting unused block group after ENOSPCFilipe Manana2015-11-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's possible to reach a state where the cleaner kthread isn't able to start a transaction to delete an unused block group due to lack of enough free metadata space and due to lack of unallocated device space to allocate a new metadata block group as well. If this happens try to use space from the global block group reserve just like we do for unlink operations, so that we don't reach a permanent state where starting a transaction for filesystem operations (file creation, renames, etc) keeps failing with -ENOSPC. Such an unfortunate state was observed on a machine where over a dozen unused data block groups existed and the cleaner kthread was failing to delete them due to ENOSPC error when attempting to start a transaction, and even running balance with a -dusage=0 filter failed with ENOSPC as well. Also unmounting and mounting again the filesystem didn't help. Allowing the cleaner kthread to use the global block reserve to delete the unused data block groups fixed the problem. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
* | Merge branch 'for-linus-4.4' of ↵Linus Torvalds2015-11-13
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes and cleanups from Chris Mason: "Some of this got cherry-picked from a github repo this week, but I verified the patches. We have three small scrub cleanups and a collection of fixes" * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: Use fs_info directly in btrfs_delete_unused_bgs btrfs: Fix lost-data-profile caused by balance bg btrfs: Fix lost-data-profile caused by auto removing bg btrfs: Remove len argument from scrub_find_csum btrfs: Reduce unnecessary arguments in scrub_recheck_block btrfs: Use scrub_checksum_data and scrub_checksum_tree_block for scrub_recheck_block_checksum btrfs: Reset sblock->xxx_error stats before calling scrub_recheck_block_checksum btrfs: scrub: setup all fields for sblock_to_check btrfs: scrub: set error stats when tree block spanning stripes Btrfs: fix race when listing an inode's xattrs Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow Btrfs: fix race leading to incorrect item deletion when dropping extents Btrfs: fix sleeping inside atomic context in qgroup rescan worker Btrfs: fix race waiting for qgroup rescan worker btrfs: qgroup: exit the rescan worker during umount Btrfs: fix extent accounting for partial direct IO writes
| * Btrfs: fix race leading to BUG_ON when running delalloc for nodatacowFilipe Manana2015-11-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we are using the NO_HOLES feature, we have a tiny time window when running delalloc for a nodatacow inode where we can race with a concurrent link or xattr add operation leading to a BUG_ON. This happens because at run_delalloc_nocow() we end up casting a leaf item of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a file extent item (struct btrfs_file_extent_item) and then analyse its extent type field, which won't match any of the expected extent types (values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an explicit BUG_ON(1). The following sequence diagram shows how the race happens when running a no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following neighbour leafs: Leaf X (has N items) Leaf Y [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ] slot N - 2 slot N - 1 slot 0 (Note the implicit hole for inode 257 regarding the [0, 8K[ range) CPU 1 CPU 2 run_dealloc_nocow() btrfs_lookup_file_extent() --> searches for a key with value (257 EXTENT_DATA 4096) in the fs/subvol tree --> returns us a path with path->nodes[0] == leaf X and path->slots[0] == N because path->slots[0] is >= btrfs_header_nritems(leaf X), it calls btrfs_next_leaf() btrfs_next_leaf() --> releases the path hard link added to our inode, with key (257 INODE_REF 500) added to the end of leaf X, so leaf X now has N + 1 keys --> searches for the key (257 INODE_REF 256), because it was the last key in leaf X before it released the path, with path->keep_locks set to 1 --> ends up at leaf X again and it verifies that the key (257 INODE_REF 256) is no longer the last key in the leaf, so it returns with path->nodes[0] == leaf X and path->slots[0] == N, pointing to the new item with key (257 INODE_REF 500) the loop iteration of run_dealloc_nocow() does not break out the loop and continues because the key referenced in the path at path->nodes[0] and path->slots[0] is for inode 257, its type is < BTRFS_EXTENT_DATA_KEY and its offset (500) is less then our delalloc range's end (8192) the item pointed by the path, an inode reference item, is (incorrectly) interpreted as a file extent item and we get an invalid extent type, leading to the BUG_ON(1): if (extent_type == BTRFS_FILE_EXTENT_REG || extent_type == BTRFS_FILE_EXTENT_PREALLOC) { (...) } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { (...) } else { BUG_ON(1) } The same can happen if a xattr is added concurrently and ends up having a key with an offset smaller then the delalloc's range end. So fix this by skipping keys with a type smaller than BTRFS_EXTENT_DATA_KEY. Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>
| * Btrfs: fix extent accounting for partial direct IO writesFilipe Manana2015-11-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing a write using direct IO we can end up not doing the whole write operation using the direct IO path, in that case we fallback to a buffered write to do the remaining IO. This happens for example if the range we are writing to contains a compressed extent. When we do a partial write and fallback to buffered IO, due to the existence of a compressed extent for example, we end up not adjusting the outstanding extents counter of our inode which ends up getting decremented twice, once by the DIO ordered extent for the partial write and once again by btrfs_direct_IO(), resulting in an arithmetic underflow at extent-tree.c:drop_outstanding_extent(). For example if we have: extents [ prealloc extent ] [ compressed extent ] offsets A B C D E and at the moment our inode's outstanding extents counter is 0, if we do a direct IO write against the range [B, D[ (which has a length smaller than 128Mb), we end up bumping our inode's outstanding extents counter to 1, we create a DIO ordered extent for the range [B, C[ and then fallback to a buffered write for the range [C, D[. The direct IO handler (inode.c:btrfs_direct_IO()) decrements the outstanding extents counter by 1, leaving it with a value of 0, through a call to btrfs_delalloc_release_space() and then shortly after the DIO ordered extent finishes and calls btrfs_delalloc_release_metadata() which ends up to attempt to decrement the inode's outstanding extents counter by 1, resulting in an assertion failure at drop_outstanding_extent() because the operation would result in an arithmetic underflow (0 - 1). This produces the following trace: [125471.336838] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5526 [125471.338844] ------------[ cut here ]------------ [125471.340745] kernel BUG at fs/btrfs/ctree.h:4173! [125471.340745] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [125471.340745] Modules linked in: btrfs f2fs xfs libcrc32c dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpufreq psmouse i2c_piix4 parport pcspkr serio_raw microcode processor evdev i2c_core button ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci virtio_ring floppy libata virtio e1000 scsi_mod [last unloaded: btrfs] [125471.340745] CPU: 10 PID: 23649 Comm: kworker/u32:1 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [125471.340745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [125471.340745] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [125471.340745] task: ffff8804244fcf80 ti: ffff88040a118000 task.ti: ffff88040a118000 [125471.340745] RIP: 0010:[<ffffffffa0550da1>] [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs] [125471.340745] RSP: 0018:ffff88040a11bc78 EFLAGS: 00010296 [125471.340745] RAX: 0000000000000075 RBX: 0000000000005000 RCX: 0000000000000000 [125471.340745] RDX: ffffffff81098f93 RSI: ffffffff8147c619 RDI: 00000000ffffffff [125471.340745] RBP: ffff88040a11bc78 R08: 0000000000000001 R09: 0000000000000000 [125471.340745] R10: ffff88040a11bc08 R11: ffffffff81651000 R12: ffff8803efb4a000 [125471.340745] R13: ffff8803efb4a000 R14: 0000000000000000 R15: ffff8802f8e33c88 [125471.340745] FS: 0000000000000000(0000) GS:ffff88043dd40000(0000) knlGS:0000000000000000 [125471.340745] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [125471.340745] CR2: 00007fae7ca86095 CR3: 0000000001a0b000 CR4: 00000000000006e0 [125471.340745] Stack: [125471.340745] ffff88040a11bc88 ffffffffa04ca0cd ffff88040a11bcc8 ffffffffa04ceeb1 [125471.340745] ffff8802f8e33940 ffff8802c93eadb0 ffff8802f8e0bf50 ffff8803efb4a000 [125471.340745] 0000000000000000 ffff8802f8e33c88 ffff88040a11bd38 ffffffffa04eccfa [125471.340745] Call Trace: [125471.340745] [<ffffffffa04ca0cd>] drop_outstanding_extent+0x3d/0x6d [btrfs] [125471.340745] [<ffffffffa04ceeb1>] btrfs_delalloc_release_metadata+0x51/0xdd [btrfs] [125471.340745] [<ffffffffa04eccfa>] btrfs_finish_ordered_io+0x420/0x4eb [btrfs] [125471.340745] [<ffffffffa04ecdda>] finish_ordered_fn+0x15/0x17 [btrfs] [125471.340745] [<ffffffffa050e6e8>] normal_work_helper+0x14c/0x32a [btrfs] [125471.340745] [<ffffffffa050e9c8>] btrfs_endio_write_helper+0x12/0x14 [btrfs] [125471.340745] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac [125471.340745] [<ffffffff81064285>] worker_thread+0x206/0x2c2 [125471.340745] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [125471.340745] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [125471.340745] [<ffffffff8106904d>] kthread+0xef/0xf7 [125471.340745] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 [125471.340745] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70 [125471.340745] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 [125471.340745] Code: a5 55 a0 48 89 e5 e8 42 50 bc e0 0f 0b 55 89 f1 48 c7 c2 f0 a8 55 a0 48 89 fe 31 c0 48 c7 c7 14 aa 55 a0 48 89 e5 e8 22 50 bc e0 <0f> 0b 0f 1f 44 00 00 55 31 c9 ba 18 00 00 00 48 89 e5 41 56 41 [125471.340745] RIP [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs] [125471.340745] RSP <ffff88040a11bc78> [125471.539620] ---[ end trace 144259f7838b4aa4 ]--- So fix this by ensuring we adjust the outstanding extents counter when we do the fallback just like we do for the case where the whole write can be done through the direct IO path. We were also adjusting the outstanding extents counter by a constant value of 1, which is incorrect because we were ignorning that we account extents in BTRFS_MAX_EXTENT_SIZE units, o fix that as well. The following test case for fstests reproduces this issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_xfs_io_command "falloc" rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount "-o compress" # Create a compressed extent covering the range [700K, 800K[. $XFS_IO_PROG -f -s -c "pwrite -S 0xaa -b 100K 700K 100K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Create prealloc extent covering the range [600K, 700K[. $XFS_IO_PROG -c "falloc 600K 100K" $SCRATCH_MNT/foo # Write 80K of data to the range [640K, 720K[ using direct IO. This # range covers both the prealloc extent and the compressed extent. # Because there's a compressed extent in the range we are writing to, # the DIO write code path ends up only writing the first 60k of data, # which goes to the prealloc extent, and then falls back to buffered IO # for writing the remaining 20K of data - because that remaining data # maps to a file range containing a compressed extent. # When falling back to buffered IO, we used to trigger an assertion when # releasing reserved space due to bad accounting of the inode's # outstanding extents counter, which was set to 1 but we ended up # decrementing it by 1 twice, once through the ordered extent for the # 60K of data we wrote using direct IO, and once through the main direct # IO handler (inode.cbtrfs_direct_IO()) because the direct IO write # wrote less than 80K of data (60K). $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 80K 640K 80K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Now similar test as above but for very large write operations. This # triggers special cases for an inode's outstanding extents accounting, # as internally btrfs logically splits extents into 128Mb units. $XFS_IO_PROG -f -s \ -c "pwrite -S 0xaa -b 128M 258M 128M" \ -c "falloc 0 258M" \ $SCRATCH_MNT/bar | _filter_xfs_io $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 256M 3M 256M" $SCRATCH_MNT/bar \ | _filter_xfs_io # Now verify the file contents are correct and that they are the same # even after unmounting and mounting the fs again (or evicting the page # cache). # # For file foo, all bytes in the range [0, 640K[ must have a value of # 0x00, all bytes in the range [640K, 720K[ must have a value of 0xbb # and all bytes in the range [720K, 800K[ must have a value of 0xaa. # # For file bar, all bytes in the range [0, 3M[ must havea value of 0x00, # all bytes in the range [3M, 259M[ must have a value of 0xbb and all # bytes in the range [259M, 386M[ must have a value of 0xaa. # echo "File digests before remounting the file system:" md5sum $SCRATCH_MNT/foo | _filter_scratch md5sum $SCRATCH_MNT/bar | _filter_scratch _scratch_remount echo "File digests after remounting the file system:" md5sum $SCRATCH_MNT/foo | _filter_scratch md5sum $SCRATCH_MNT/bar | _filter_scratch status=0 exit Fixes: e1cbbfa5f5aa ("Btrfs: fix outstanding_extents accounting in DIO") Fixes: 3e05bde8c3c2 ("Btrfs: only adjust outstanding_extents when we do a short write") Signed-off-by: Filipe Manana <fdmanana@suse.com>
* | fs/btrfs/inode.c: remove unnecessary new_valid_dev() checkYaowei Bai2015-11-09
|/ | | | | | | | | | | | | new_valid_dev() always returns 1, so the !new_valid_dev() check is not needed. Remove it. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* btrfs: qgroup: Fix a race in delayed_ref which leads to abort transQu Wenruo2015-10-26
| | | | | | | | | | | | | | | | | Between btrfs_allocerved_file_extent() and btrfs_add_delayed_qgroup_reserve(), there is a window that delayed_refs are run and delayed ref head maybe freed before btrfs_add_delayed_qgroup_reserve(). This will cause btrfs_dad_delayed_qgroup_reserve() to return -ENOENT, and cause transaction to be aborted. This patch will record qgroup reserve space info into delayed_ref_head at btrfs_add_delayed_ref(), to eliminate the race window. Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix regression running delayed references when using qgroupsFilipe Manana2015-10-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the kernel 4.2 merge window we had a big changes to the implementation of delayed references and qgroups which made the no_quota field of delayed references not used anymore. More specifically the no_quota field is not used anymore as of: commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.") Leaving the no_quota field actually prevents delayed references from getting merged, which in turn cause the following BUG_ON(), at fs/btrfs/extent-tree.c, to be hit when qgroups are enabled: static int run_delayed_tree_ref(...) { (...) BUG_ON(node->ref_mod != 1); (...) } This happens on a scenario like the following: 1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added. 2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added. It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota. 3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added. It's not merged with the reference at the tail of the list of refs for bytenr X because the reference at the tail, Ref2 is incompatible due to Ref2->no_quota != Ref3->no_quota. 4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added. It's not merged with the reference at the tail of the list of refs for bytenr X because the reference at the tail, Ref3 is incompatible due to Ref3->no_quota != Ref4->no_quota. 5) We run delayed references, trigger merging of delayed references, through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs(). 6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and all other conditions are satisfied too. So Ref1 gets a ref_mod value of 2. 7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and all other conditions are satisfied too. So Ref2 gets a ref_mod value of 2. 8) Ref1 and Ref2 aren't merged, because they have different values for their no_quota field. 9) Delayed reference Ref1 is picked for running (select_delayed_ref() always prefers references with an action == BTRFS_ADD_DELAYED_REF). So run_delayed_tree_ref() is called for Ref1 which triggers the BUG_ON because Ref1->red_mod != 1 (equals 2). So fix this by removing the no_quota field, as it's not used anymore as of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism."). The use of no_quota was also buggy in at least two places: 1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting no_quota to 0 instead of 1 when the following condition was true: is_fstree(ref_root) || !fs_info->quota_enabled 2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to reset a node's no_quota when the condition "!is_fstree(root_objectid) || !root->fs_info->quota_enabled" was true but we did it only in an unused local stack variable, that is, we never reset the no_quota value in the node itself. This fixes the remainder of problems several people have been having when running delayed references, mostly while a balance is running in parallel, on a 4.2+ kernel. Very special thanks to Stéphane Lesimple for helping debugging this issue and testing this fix on his multi terabyte filesystem (which took more than one day to balance alone, plus fsck, etc). Also, this fixes deadlock issue when using the clone ioctl with qgroups enabled, as reported by Elias Probst in the mailing list. The deadlock happens because after calling btrfs_insert_empty_item we have our path holding a write lock on a leaf of the fs/subvol tree and then before releasing the path we called check_ref() which did backref walking, when qgroups are enabled, and tried to read lock the same leaf. The trace for this case is the following: INFO: task systemd-nspawn:6095 blocked for more than 120 seconds. (...) Call Trace: [<ffffffff86999201>] schedule+0x74/0x83 [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea [<ffffffff86137ed7>] ? wait_woken+0x74/0x74 [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810 [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127 [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667 [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6 [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0 [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65 [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88 [<ffffffff863e852e>] check_ref+0x64/0xc4 [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77 [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb (...) The problem goes away by eleminating check_ref(), which no longer is needed as its purpose was to get a value for the no_quota field of a delayed reference (this patch removes the no_quota field as mentioned earlier). Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Reported-by: Elias Probst <mail@eliasprobst.eu> Reported-by: Peter Becker <floyd.net@gmail.com> Reported-by: Malte Schröder <malte@tnxip.de> Reported-by: Derek Dongray <derek@valedon.co.uk> Reported-by: Erkki Seppala <flux-btrfs@inside.org> Cc: stable@vger.kernel.org # 4.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
* Merge branch 'allocator-fixes' into for-linus-4.4Chris Mason2015-10-21
|\ | | | | | | Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: fix prealloc under heavy fragmentation conditionsJosef Bacik2015-10-21
| | | | | | | | | | | | | | | | | | | | | | | | If we are heavily fragmented we will continually try to prealloc the largest extent size we can every time we call btrfs_reserve_extent. This can be very expensive when we are heavily fragmented, burning lots of CPU cycles and loops through the allocator. So instead notice when we get a smaller chunk from the allocator than what we specified and use this as the new maximum size we try to allocate. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* | btrfs: qgroup: Check if qgroup reserved space leakedQu Wenruo2015-10-21
| | | | | | | | | | | | | | | | Add check at btrfs_destroy_inode() time to detect qgroup reserved space leak. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* | btrfs: qgroup: Avoid calling btrfs_free_reserved_data_space in clear_bit_hookQu Wenruo2015-10-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In clear_bit_hook, qgroup reserved data is already handled quite well, either released by finish_ordered_io or invalidatepage. So calling btrfs_qgroup_free_data() here is completely meaningless, and since btrfs_qgroup_free_data() will lock io_tree, so it can't be called with io_tree lock hold. This patch will add a new function btrfs_free_reserved_data_space_noquota() for clear_bit_hook() to cease the lockdep warning. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* | btrfs: Add handler for invalidate pageQu Wenruo2015-10-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For btrfs_invalidatepage() and its variant evict_inode_truncate_page(), there will be pages don't reach disk. In that case, their reserved space won't be release nor freed by finish_ordered_io() nor delayed_ref handler. So we must free their qgroup reserved space, or we will leaking reserved space again. So this will patch will call btrfs_qgroup_free_data() for invalidatepage() and its variant evict_inode_truncate_page(). And due to the nature of new btrfs_qgroup_reserve/free_data() reserved space will only be reserved or freed once, so for pages which are already flushed to disk, their reserved space will be released and freed by delayed_ref handler. Double free won't be a problem. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* | btrfs: qgroup: Add handler for NOCOW and inlineQu Wenruo2015-10-21
| | | | | | | | | | | | | | | | | | For NOCOW and inline case, there will be no delayed_ref created for them, so we should free their reserved data space at proper time(finish_ordered_io for NOCOW and cow_file_inline for inline). Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* | btrfs: qgroup: Cleanup old inaccurate facilitiesQu Wenruo2015-10-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Cleanup the old facilities which use old btrfs_qgroup_reserve() function call, replace them with the newer version, and remove the "__" prefix in them. Also, make btrfs_qgroup_reserve/free() functions private, as they are now only used inside qgroup codes. Now, the whole btrfs qgroup is swithed to use the new reserve facilities. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* | btrfs: extent-tree: Switch to new delalloc space reserve and releaseQu Wenruo2015-10-21
| | | | | | | | | | | | | | | | | | Use new __btrfs_delalloc_reserve_space() and __btrfs_delalloc_release_space() to reserve and release space for delalloc. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* | btrfs: delayed_ref: release and free qgroup reserved at proper timingQu Wenruo2015-10-21
|/ | | | | | | | | | | | | | | | | | | | | | | | Qgroup reserved space needs to be released from inode dirty map and get freed at different timing: 1) Release when the metadata is written into tree After corresponding metadata is written into tree, any newer write will be COWed(don't include NOCOW case yet). So we must release its range from inode dirty range map, or we will forget to reserve needed range, causing accounting exceeding the limit. 2) Free reserved bytes when delayed ref is run When delayed refs are run, qgroup accounting will follow soon and turn the reserved bytes into rfer/excl numbers. As run_delayed_refs and qgroup accounting are all done at commit_transaction() time, we are safe to free reserved space in run_delayed_ref time(). With these timing to release/free reserved space, we should be able to resolve the long existing qgroup reserve space leak problem. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>