summaryrefslogtreecommitdiff
path: root/mm/vmalloc.c (follow)
Commit message (Collapse)AuthorAge
* Merge 4.4.226 into android-4.4-pGreg Kroah-Hartman2020-06-03
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changes in 4.4.226 ax25: fix setsockopt(SO_BINDTODEVICE) net: revert "net: get rid of an signed integer overflow in ip_idents_reserve()" sctp: Start shutdown on association restart if in SHUTDOWN-SENT state and socket is closed net/mlx5: Add command entry handling completion net: sun: fix missing release regions in cas_init_one(). net/mlx4_core: fix a memory leak bug. uapi: fix linux/if_pppol2tp.h userspace compilation errors IB/cma: Fix reference count leak when no ipv4 addresses are set cachefiles: Fix race between read_waiter and read_copier involving op->to_do usb: gadget: legacy: fix redundant initialization warnings cifs: Fix null pointer check in cifs_read Input: usbtouchscreen - add support for BonXeon TP Input: evdev - call input_flush_device() on release(), not flush() Input: xpad - add custom init packet for Xbox One S controllers Input: i8042 - add ThinkPad S230u to i8042 reset list IB/qib: Call kobject_put() when kobject_init_and_add() fails ALSA: hwdep: fix a left shifting 1 by 31 UB bug ALSA: usb-audio: mixer: volume quirk for ESS Technology Asus USB DAC exec: Always set cap_ambient in cap_bprm_set_creds fs/binfmt_elf.c: allocate initialized memory in fill_thread_core_info() include/asm-generic/topology.h: guard cpumask_of_node() macro argument iommu: Fix reference count leak in iommu_group_alloc. parisc: Fix kernel panic in mem_init() x86/dma: Fix max PFN arithmetic overflow on 32 bit systems xfrm: allow to accept packets with ipv6 NEXTHDR_HOP in xfrm_input xfrm: fix a warning in xfrm_policy_insert_list xfrm: fix a NULL-ptr deref in xfrm_local_error vti4: eliminated some duplicate code. ip_vti: receive ipip packet by calling ip_tunnel_rcv netfilter: nft_reject_bridge: enable reject with bridge vlan netfilter: ipset: Fix subcounter update skip netfilter: nf_conntrack_pptp: prevent buffer overflows in debug code qlcnic: fix missing release in qlcnic_83xx_interrupt_test. bonding: Fix reference count leak in bond_sysfs_slave_add. netfilter: nf_conntrack_pptp: fix compilation warning with W=1 build mm: remove VM_BUG_ON(PageSlab()) from page_mapcount() drm/fb-helper: Use proper plane mask for fb cleanup genirq/generic_pending: Do not lose pending affinity update usb: renesas_usbhs: gadget: fix spin_lock_init() for &uep->lock mac80211: fix memory leak net: rtnl_configure_link: fix dev flags changes arg to __dev_notify_flags mm/vmalloc.c: don't dereference possible NULL pointer in __vunmap() asm-prototypes: Clear any CPP defines before declaring the functions sc16is7xx: move label 'err_spi' to correct section drm/msm: Fix possible null dereference on failure of get_pages() printk: help pr_debug and pr_devel to optimize out arguments scsi: zfcp: fix request object use-after-free in send path causing wrong traces Linux 4.4.226 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ia5473e84bcdec9d8a045a80a1c683fc1072f8c4f
| * mm/vmalloc.c: don't dereference possible NULL pointer in __vunmap()Liviu Dudau2020-06-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6ade20327dbb808882888ed8ccded71e93067cf9 upstream. find_vmap_area() can return a NULL pointer and we're going to dereference it without checking it first. Use the existing find_vm_area() function which does exactly what we want and checks for the NULL pointer. Link: http://lkml.kernel.org/r/20181228171009.22269-1-liviu@dudau.co.uk Fixes: f3c01d2f3ade ("mm: vmalloc: avoid racy handling of debugobjects in vunmap") Signed-off-by: Liviu Dudau <liviu@dudau.co.uk> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Chintan Pandya <cpandya@codeaurora.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge 4.4.218 into android-4.4-pGreg Kroah-Hartman2020-04-02
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changes in 4.4.218 spi: qup: call spi_qup_pm_resume_runtime before suspending powerpc: Include .BTF section ARM: dts: dra7: Add "dma-ranges" property to PCIe RC DT nodes spi/zynqmp: remove entry that causes a cs glitch drm/exynos: dsi: propagate error value and silence meaningless warning drm/exynos: dsi: fix workaround for the legacy clock name altera-stapl: altera_get_note: prevent write beyond end of 'key' USB: Disable LPM on WD19's Realtek Hub usb: quirks: add NO_LPM quirk for RTL8153 based ethernet adapters USB: serial: option: add ME910G1 ECM composition 0x110b usb: host: xhci-plat: add a shutdown USB: serial: pl2303: add device-id for HP LD381 ALSA: line6: Fix endless MIDI read loop ALSA: seq: virmidi: Fix running status after receiving sysex ALSA: seq: oss: Fix running status after receiving sysex ALSA: pcm: oss: Avoid plugin buffer overflow ALSA: pcm: oss: Remove WARNING from snd_pcm_plug_alloc() checks staging: rtl8188eu: Add device id for MERCUSYS MW150US v2 staging/speakup: fix get_word non-space look-ahead intel_th: Fix user-visible error codes rtc: max8907: add missing select REGMAP_IRQ memcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_event mm: slub: be more careful about the double cmpxchg of freelist mm, slub: prevent kmalloc_node crashes and memory leaks x86/mm: split vmalloc_sync_all() USB: cdc-acm: fix close_delay and closing_wait units in TIOCSSERIAL USB: cdc-acm: fix rounding error in TIOCSSERIAL kbuild: Disable -Wpointer-to-enum-cast futex: Fix inode life-time issue futex: Unbreak futex hashing ALSA: hda/realtek: Fix pop noise on ALC225 arm64: smp: fix smp_send_stop() behaviour Revert "drm/dp_mst: Skip validating ports during destruction, just ref" hsr: fix general protection fault in hsr_addr_is_self() net: dsa: Fix duplicate frames flooded by learning net_sched: cls_route: remove the right filter from hashtable net_sched: keep alloc_hash updated after hash allocation NFC: fdp: Fix a signedness bug in fdp_nci_send_patch() slcan: not call free_netdev before rtnl_unlock in slcan_open vxlan: check return value of gro_cells_init() hsr: use rcu_read_lock() in hsr_get_node_{list/status}() hsr: add restart routine into hsr_get_node_list() hsr: set .netnsok flag vhost: Check docket sk_family instead of call getname IB/ipoib: Do not warn if IPoIB debugfs doesn't exist uapi glibc compat: fix outer guard of net device flags enum KVM: VMX: Do not allow reexecute_instruction() when skipping MMIO instr drivers/hwspinlock: use correct radix tree API net: ipv4: don't let PMTU updates increase route MTU cpupower: avoid multiple definition with gcc -fno-common dt-bindings: net: FMan erratum A050385 scsi: ipr: Fix softlockup when rescanning devices in petitboot mac80211: Do not send mesh HWMP PREQ if HWMP is disabled sxgbe: Fix off by one in samsung driver strncpy size arg i2c: hix5hd2: add missed clk_disable_unprepare in remove perf probe: Do not depend on dwfl_module_addrsym() scripts/dtc: Remove redundant YYLOC global declaration scsi: sd: Fix optimal I/O size for devices that change reported values mac80211: mark station unauthorized before key removal genirq: Fix reference leaks on irq affinity notifiers vti[6]: fix packet tx through bpf_redirect() in XinY cases xfrm: fix uctx len check in verify_sec_ctx_len xfrm: add the missing verify_sec_ctx_len check in xfrm_add_acquire xfrm: policy: Fix doulbe free in xfrm_policy_timer vti6: Fix memory leak of skb if input policy check fails tools: Let O= makes handle a relative path with -C option USB: serial: option: add support for ASKEY WWHC050 USB: serial: option: add BroadMobi BM806U USB: serial: option: add Wistron Neweb D19Q1 USB: cdc-acm: restore capability check order USB: serial: io_edgeport: fix slab-out-of-bounds read in edge_interrupt_callback usb: musb: fix crash with highmen PIO and usbmon media: flexcop-usb: fix endpoint sanity check media: usbtv: fix control-message timeouts staging: rtl8188eu: Add ASUS USB-N10 Nano B1 to device table staging: wlan-ng: fix use-after-free Read in hfa384x_usbin_callback libfs: fix infoleak in simple_attr_read() media: ov519: add missing endpoint sanity checks media: dib0700: fix rc endpoint lookup media: stv06xx: add missing descriptor sanity checks media: xirlink_cit: add missing descriptor sanity checks vt: selection, introduce vc_is_sel vt: ioctl, switch VT_IS_IN_USE and VT_BUSY to inlines vt: switch vt_dont_switch to bool vt: vt_ioctl: remove unnecessary console allocation checks vt: vt_ioctl: fix VT_DISALLOCATE freeing in-use virtual console locking/atomic, kref: Add kref_read() vt: vt_ioctl: fix use-after-free in vt_in_use() bpf: Explicitly memset the bpf_attr structure net: ks8851-ml: Fix IO operations, again perf map: Fix off by one in strncpy() size argument Linux 4.4.218 Change-Id: I8de6cf91805269943a4c08f8b08e6a0b8539c08e Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
| * x86/mm: split vmalloc_sync_all()Joerg Roedel2020-04-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 763802b53a427ed3cbd419dbba255c414fdd9e7c upstream. Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the vunmap() code-path. While this change was necessary to maintain correctness on x86-32-pae kernels, it also adds additional cycles for architectures that don't need it. Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported severe performance regressions in micro-benchmarks because it now also calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But the vmalloc_sync_all() implementation on x86-64 is only needed for newly created mappings. To avoid the unnecessary work on x86-64 and to gain the performance back, split up vmalloc_sync_all() into two functions: * vmalloc_sync_mappings(), and * vmalloc_sync_unmappings() Most call-sites to vmalloc_sync_all() only care about new mappings being synchronized. The only exception is the new call-site added in the above mentioned commit. Shile Zhang directed us to a report of an 80% regression in reaim throughput. Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Borislav Petkov <bp@suse.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [GHES] Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/ Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge 4.4.190 into android-4.4-pGreg Kroah-Hartman2019-08-25
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changes in 4.4.190 usb: iowarrior: fix deadlock on disconnect sound: fix a memory leak bug x86/mm: Check for pfn instead of page in vmalloc_sync_one() x86/mm: Sync also unmappings in vmalloc_sync_all() mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() perf db-export: Fix thread__exec_comm() usb: yurex: Fix use-after-free in yurex_delete can: peak_usb: fix potential double kfree_skb() netfilter: nfnetlink: avoid deadlock due to synchronous request_module iscsi_ibft: make ISCSI_IBFT dependson ACPI instead of ISCSI_IBFT_FIND mac80211: don't warn about CW params when not using them hwmon: (nct6775) Fix register address and added missed tolerance for nct6106 cpufreq/pasemi: fix use-after-free in pas_cpufreq_cpu_init() s390/qdio: add sanity checks to the fast-requeue path ALSA: compress: Fix regression on compressed capture streams ALSA: compress: Prevent bypasses of set_params ALSA: compress: Be more restrictive about when a drain is allowed perf probe: Avoid calling freeing routine multiple times for same pointer ARM: davinci: fix sleep.S build error on ARMv4 scsi: megaraid_sas: fix panic on loading firmware crashdump scsi: ibmvfc: fix WARN_ON during event pool release tty/ldsem, locking/rwsem: Add missing ACQUIRE to read_failed sleep loop perf/core: Fix creating kernel counters for PMUs that override event->cpu can: peak_usb: pcan_usb_pro: Fix info-leaks to USB devices can: peak_usb: pcan_usb_fd: Fix info-leaks to USB devices hwmon: (nct7802) Fix wrong detection of in4 presence ALSA: firewire: fix a memory leak bug mac80211: don't WARN on short WMM parameters from AP SMB3: Fix deadlock in validate negotiate hits reconnect smb3: send CAP_DFS capability during session setup mwifiex: fix 802.11n/WPA detection scsi: mpt3sas: Use 63-bit DMA addressing on SAS35 HBA sh: kernel: hw_breakpoint: Fix missing break in switch statement usb: gadget: f_midi: fail if set_alt fails to allocate requests USB: gadget: f_midi: fixing a possible double-free in f_midi mm/memcontrol.c: fix use after free in mem_cgroup_iter() ALSA: hda - Fix a memory leak bug HID: holtek: test for sanity of intfdata HID: hiddev: avoid opening a disconnected device HID: hiddev: do cleanup in failure of opening a device Input: kbtab - sanity check for endpoint type Input: iforce - add sanity checks net: usb: pegasus: fix improper read if get_registers() fail xen/pciback: remove set but not used variable 'old_state' irqchip/irq-imx-gpcv2: Forward irq type to parent perf header: Fix divide by zero error if f_header.attr_size==0 perf header: Fix use of unitialized value warning libata: zpodd: Fix small read overflow in zpodd_get_mech_type() scsi: hpsa: correct scsi command status issue after reset ata: libahci: do not complain in case of deferred probe kbuild: modpost: handle KBUILD_EXTRA_SYMBOLS only for external modules IB/core: Add mitigation for Spectre V1 ocfs2: remove set but not used variable 'last_hash' asm-generic: fix -Wtype-limits compiler warnings staging: comedi: dt3000: Fix signed integer overflow 'divider * base' staging: comedi: dt3000: Fix rounding up of timer divisor USB: core: Fix races in character device registration and deregistraion usb: cdc-acm: make sure a refcount is taken early enough USB: serial: option: add D-Link DWM-222 device ID USB: serial: option: Add support for ZTE MF871A USB: serial: option: add the BroadMobi BM818 card USB: serial: option: Add Motorola modem UARTs Backport minimal compiler_attributes.h to support GCC 9 include/linux/module.h: copy __init/__exit attrs to init/cleanup_module arm64: compat: Allow single-byte watchpoints on all addresses Input: psmouse - fix build error of multiple definition asm-generic: default BUG_ON(x) to if(x)BUG() scsi: fcoe: Embed fc_rport_priv in fcoe_rport structure RDMA: Directly cast the sockaddr union to sockaddr IB/mlx5: Make coding style more consistent x86/vdso: Remove direct HPET access through the vDSO iommu/amd: Move iommu_init_pci() to .init section x86/boot: Disable the address-of-packed-member compiler warning net/packet: fix race in tpacket_snd() xen/netback: Reset nr_frags before freeing skb net/mlx5e: Only support tx/rx pause setting for port owner sctp: fix the transport error_count check bonding: Add vlan tx offload to hw_enc_features Linux 4.4.190 Change-Id: Ic4094fbac2f9b8f6d4a9b4397e82471f40424332 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
| * mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()Joerg Roedel2019-08-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3f8fd02b1bf1d7ba964485a56f2f4b53ae88c167 upstream. On x86-32 with PTI enabled, parts of the kernel page-tables are not shared between processes. This can cause mappings in the vmalloc/ioremap area to persist in some page-tables after the region is unmapped and released. When the region is re-used the processes with the old mappings do not fault in the new mappings but still access the old ones. This causes undefined behavior, in reality often data corruption, kernel oopses and panics and even spontaneous reboots. Fix this problem by activly syncing unmaps in the vmalloc/ioremap area to all page-tables in the system before the regions can be re-used. References: https://bugzilla.suse.com/show_bug.cgi?id=1118689 Fixes: 5d72b4fba40ef ('x86, mm: support huge I/O mapping capability I/F') Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20190719184652.11391-4-joro@8bytes.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge 4.4.179 into android-4.4-pGreg Kroah-Hartman2019-04-30
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changes in 4.4.179 arm64: debug: Don't propagate UNKNOWN FAR into si_code for debug signals arm64: debug: Ensure debug handlers check triggering exception level ext4: cleanup bh release code in ext4_ind_remove_space() lib/int_sqrt: optimize initial value compute tty/serial: atmel: Add is_half_duplex helper mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified i2c: core-smbus: prevent stack corruption on read I2C_BLOCK_DATA Bluetooth: Fix decrementing reference count twice in releasing socket tty/serial: atmel: RS485 HD w/DMA: enable RX after TX is stopped CIFS: fix POSIX lock leak and invalid ptr deref h8300: use cc-cross-prefix instead of hardcoding h8300-unknown-linux- tracing: kdb: Fix ftdump to not sleep gpio: gpio-omap: fix level interrupt idling sysctl: handle overflow for file-max enic: fix build warning without CONFIG_CPUMASK_OFFSTACK mm/cma.c: cma_declare_contiguous: correct err handling mm/page_ext.c: fix an imbalance with kmemleak mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512! mm/slab.c: kmemleak no scan alien caches ocfs2: fix a panic problem caused by o2cb_ctl f2fs: do not use mutex lock in atomic context fs/file.c: initialize init_files.resize_wait cifs: use correct format characters dm thin: add sanity checks to thin-pool and external snapshot creation cifs: Fix NULL pointer dereference of devname fs: fix guard_bio_eod to check for real EOD errors tools lib traceevent: Fix buffer overflow in arg_eval usb: chipidea: Grab the (legacy) USB PHY by phandle first scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c coresight: etm4x: Add support to enable ETMv4.2 ARM: 8840/1: use a raw_spinlock_t in unwind mmc: omap: fix the maximum timeout setting e1000e: Fix -Wformat-truncation warnings IB/mlx4: Increase the timeout for CM cache scsi: megaraid_sas: return error when create DMA pool failed perf test: Fix failure of 'evsel-tp-sched' test on s390 SoC: imx-sgtl5000: add missing put_device() media: sh_veu: Correct return type for mem2mem buffer helpers media: s5p-jpeg: Correct return type for mem2mem buffer helpers media: s5p-g2d: Correct return type for mem2mem buffer helpers media: mx2_emmaprp: Correct return type for mem2mem buffer helpers leds: lp55xx: fix null deref on firmware load failure kprobes: Prohibit probing on bsearch() ARM: 8833/1: Ensure that NEON code always compiles with Clang ALSA: PCM: check if ops are defined before suspending PCM bcache: fix input overflow to cache set sysfs file io_error_halflife bcache: fix input overflow to sequential_cutoff bcache: improve sysfs_strtoul_clamp() fbdev: fbmem: fix memory access if logo is bigger than the screen cdrom: Fix race condition in cdrom_sysctl_register ASoC: fsl-asoc-card: fix object reference leaks in fsl_asoc_card_probe soc: qcom: gsbi: Fix error handling in gsbi_probe() mt7601u: bump supported EEPROM version ARM: avoid Cortex-A9 livelock on tight dmb loops tty: increase the default flip buffer limit to 2*640K media: mt9m111: set initial frame size other than 0x0 hwrng: virtio - Avoid repeated init of completion soc/tegra: fuse: Fix illegal free of IO base address hpet: Fix missing '=' character in the __setup() code of hpet_mmap_enable dmaengine: imx-dma: fix warning comparison of distinct pointer types netfilter: physdev: relax br_netfilter dependency media: s5p-jpeg: Check for fmt_ver_flag when doing fmt enumeration regulator: act8865: Fix act8600_sudcdc_voltage_ranges setting wlcore: Fix memory leak in case wl12xx_fetch_firmware failure x86/build: Mark per-CPU symbols as absolute explicitly for LLD dmaengine: tegra: avoid overflow of byte tracking drm/dp/mst: Configure no_stop_bit correctly for remote i2c xfers binfmt_elf: switch to new creds when switching to new mm kbuild: clang: choose GCC_TOOLCHAIN_DIR not on LD x86/build: Specify elf_i386 linker emulation explicitly for i386 objects x86: vdso: Use $LD instead of $CC to link x86/vdso: Drop implicit common-page-size linker flag lib/string.c: implement a basic bcmp tty: mark Siemens R3964 line discipline as BROKEN tty: ldisc: add sysctl to prevent autoloading of ldiscs ipv6: Fix dangling pointer when ipv6 fragment ipv6: sit: reset ip header pointer in ipip6_rcv net: rds: force to destroy connection if t_sock is NULL in rds_tcp_kill_sock(). openvswitch: fix flow actions reallocation qmi_wwan: add Olicard 600 sctp: initialize _pad of sockaddr_in before copying to user memory tcp: Ensure DCTCP reacts to losses netns: provide pure entropy for net_hash_mix() net: ethtool: not call vzalloc for zero sized memory request ip6_tunnel: Match to ARPHRD_TUNNEL6 for dev type ALSA: seq: Fix OOB-reads from strlcpy include/linux/bitrev.h: fix constant bitrev ASoC: fsl_esai: fix channel swap issue when stream starts block: do not leak memory in bio_copy_user_iov() genirq: Respect IRQCHIP_SKIP_SET_WAKE in irq_chip_set_wake_parent() ARM: dts: at91: Fix typo in ISC_D0 on PC9 arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result value xen: Prevent buffer overflow in privcmd ioctl sched/fair: Do not re-read ->h_load_next during hierarchical load calculation xtensa: fix return_address PCI: Add function 1 DMA alias quirk for Marvell 9170 SATA controller perf/core: Restore mmap record type correctly ext4: add missing brelse() in add_new_gdb_meta_bg() ext4: report real fs size after failed resize ALSA: echoaudio: add a check for ioremap_nocache ALSA: sb8: add a check for request_region IB/mlx4: Fix race condition between catas error reset and aliasguid flows mmc: davinci: remove extraneous __init annotation ALSA: opl3: fix mismatch between snd_opl3_drum_switch definition and declaration thermal/int340x_thermal: Add additional UUIDs thermal/int340x_thermal: fix mode setting tools/power turbostat: return the exit status of a command perf top: Fix error handling in cmd_top() perf evsel: Free evsel->counts in perf_evsel__exit() perf tests: Fix a memory leak of cpu_map object in the openat_syscall_event_on_all_cpus test perf tests: Fix a memory leak in test__perf_evsel__tp_sched_test() x86/hpet: Prevent potential NULL pointer dereference x86/cpu/cyrix: Use correct macros for Cyrix calls on Geode processors iommu/vt-d: Check capability before disabling protected memory x86/hw_breakpoints: Make default case in hw_breakpoint_arch_parse() return an error fix incorrect error code mapping for OBJECTID_NOT_FOUND ext4: prohibit fstrim in norecovery mode rsi: improve kernel thread handling to fix kernel panic 9p: do not trust pdu content for stat item size 9p locks: add mount option for lock retry interval f2fs: fix to do sanity check with current segment number serial: uartps: console_setup() can't be placed to init section ARM: samsung: Limit SAMSUNG_PM_CHECK config option to non-Exynos platforms ACPI / SBS: Fix GPE storm on recent MacBookPro's cifs: fallback to older infolevels on findfirst queryinfo retry crypto: sha256/arm - fix crash bug in Thumb2 build crypto: sha512/arm - fix crash bug in Thumb2 build iommu/dmar: Fix buffer overflow during PCI bus notification ARM: 8839/1: kprobe: make patch_lock a raw_spinlock_t appletalk: Fix use-after-free in atalk_proc_exit lib/div64.c: off by one in shift include/linux/swap.h: use offsetof() instead of custom __swapoffset macro tpm/tpm_crb: Avoid unaligned reads in crb_recv() ovl: fix uid/gid when creating over whiteout appletalk: Fix compile regression bonding: fix event handling for stacked bonds net: atm: Fix potential Spectre v1 vulnerabilities net: bridge: multicast: use rcu to access port list from br_multicast_start_querier net: fou: do not use guehdr after iptunnel_pull_offloads in gue_udp_recv tcp: tcp_grow_window() needs to respect tcp_space() ipv4: recompile ip options in ipv4_link_failure ipv4: ensure rcu_read_lock() in ipv4_link_failure() crypto: crypto4xx - properly set IV after de- and encrypt modpost: file2alias: go back to simple devtable lookup modpost: file2alias: check prototype of handler tpm/tpm_i2c_atmel: Return -E2BIG when the transfer is incomplete KVM: x86: Don't clear EFER during SMM transitions for 32-bit vCPU iio/gyro/bmg160: Use millidegrees for temperature scale iio: ad_sigma_delta: select channel when reading register iio: adc: at91: disable adc channel interrupt in timeout case io: accel: kxcjk1013: restore the range after resume. staging: comedi: vmk80xx: Fix use of uninitialized semaphore staging: comedi: vmk80xx: Fix possible double-free of ->usb_rx_buf staging: comedi: ni_usb6501: Fix use of uninitialized mutex staging: comedi: ni_usb6501: Fix possible double-free of ->usb_rx_buf ALSA: core: Fix card races between register and disconnect crypto: x86/poly1305 - fix overflow during partial reduction arm64: futex: Restore oldval initialization to work around buggy compilers x86/kprobes: Verify stack frame on kretprobe kprobes: Mark ftrace mcount handler functions nokprobe kprobes: Fix error check when reusing optimized probes mac80211: do not call driver wake_tx_queue op during reconfig Revert "kbuild: use -Oz instead of -Os when using clang" sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup device_cgroup: fix RCU imbalance in error case mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n ALSA: info: Fix racy addition/deletion of nodes Revert "locking/lockdep: Add debug_locks check in __lock_downgrade()" kernel/sysctl.c: fix out-of-bounds access when setting file-max Linux 4.4.179 Change-Id: Ia88dbd6c37250a682098a4a8540672869c6adf42 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
| * mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512!Uladzislau Rezki (Sony)2019-04-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit afd07389d3f4933c7f7817a92fb5e053d59a3182 ] One of the vmalloc stress test case triggers the kernel BUG(): <snip> [60.562151] ------------[ cut here ]------------ [60.562154] kernel BUG at mm/vmalloc.c:512! [60.562206] invalid opcode: 0000 [#1] PREEMPT SMP PTI [60.562247] CPU: 0 PID: 430 Comm: vmalloc_test/0 Not tainted 4.20.0+ #161 [60.562293] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [60.562351] RIP: 0010:alloc_vmap_area+0x36f/0x390 <snip> it can happen due to big align request resulting in overflowing of calculated address, i.e. it becomes 0 after ALIGN()'s fixup. Fix it by checking if calculated address is within vstart/vend range. Link: http://lkml.kernel.org/r/20190124115648.9433-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joel Fernandes <joelaf@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* | Merge 4.4.177 into android-4.4-pGreg Kroah-Hartman2019-03-23
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changes in 4.4.177 ceph: avoid repeatedly adding inode to mdsc->snap_flush_list numa: change get_mempolicy() to use nr_node_ids instead of MAX_NUMNODES KEYS: allow reaching the keys quotas exactly mfd: ti_am335x_tscadc: Use PLATFORM_DEVID_AUTO while registering mfd cells mfd: twl-core: Fix section annotations on {,un}protect_pm_master mfd: db8500-prcmu: Fix some section annotations mfd: ab8500-core: Return zero in get_register_interruptible() mfd: qcom_rpm: write fw_version to CTRL_REG mfd: wm5110: Add missing ASRC rate register mfd: mc13xxx: Fix a missing check of a register-read failure net: hns: Fix use after free identified by SLUB debug MIPS: ath79: Enable OF serial ports in the default config scsi: qla4xxx: check return code of qla4xxx_copy_from_fwddb_param scsi: isci: initialize shost fully before calling scsi_add_host() MIPS: jazz: fix 64bit build isdn: i4l: isdn_tty: Fix some concurrency double-free bugs atm: he: fix sign-extension overflow on large shift leds: lp5523: fix a missing check of return value of lp55xx_read isdn: avm: Fix string plus integer warning from Clang RDMA/srp: Rework SCSI device reset handling KEYS: user: Align the payload buffer KEYS: always initialize keyring_index_key::desc_len batman-adv: fix uninit-value in batadv_interface_tx() net/packet: fix 4gb buffer limit due to overflow check team: avoid complex list operations in team_nl_cmd_options_set() sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach() net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames ARCv2: Enable unaligned access in early ASM code Revert "bridge: do not add port to router list when receives query with source 0.0.0.0" libceph: handle an empty authorize reply scsi: libsas: Fix rphy phy_identifier for PHYs with end devices attached drm/msm: Unblock writer if reader closes file ASoC: Intel: Haswell/Broadwell: fix setting for .dynamic field ALSA: compress: prevent potential divide by zero bugs thermal: int340x_thermal: Fix a NULL vs IS_ERR() check usb: dwc3: gadget: Fix the uninitialized link_state when udc starts usb: gadget: Potential NULL dereference on allocation error ASoC: dapm: change snprintf to scnprintf for possible overflow ASoC: imx-audmux: change snprintf to scnprintf for possible overflow ARC: fix __ffs return value to avoid build warnings mac80211: fix miscounting of ttl-dropped frames serial: fsl_lpuart: fix maximum acceptable baud rate with over-sampling scsi: csiostor: fix NULL pointer dereference in csio_vport_set_state() net: altera_tse: fix connect_local_phy error path ibmveth: Do not process frames after calling napi_reschedule mac80211: don't initiate TDLS connection if station is not associated to AP cfg80211: extend range deviation for DMG KVM: nSVM: clear events pending from svm_complete_interrupts() when exiting to L1 arm/arm64: KVM: Feed initialized memory to MMIO accesses KVM: arm/arm64: Fix MMIO emulation data handling powerpc: Always initialize input array when calling epapr_hypercall() mmc: spi: Fix card detection during probe mm: enforce min addr even if capable() in expand_downwards() x86/uaccess: Don't leak the AC flag into __put_user() value evaluation USB: serial: option: add Telit ME910 ECM composition USB: serial: cp210x: add ID for Ingenico 3070 USB: serial: ftdi_sio: add ID for Hjelmslund Electronics USB485 cpufreq: Use struct kobj_attribute instead of struct global_attr sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names ncpfs: fix build warning of strncpy isdn: isdn_tty: fix build warning of strncpy staging: lustre: fix buffer overflow of string buffer net-sysfs: Fix mem leak in netdev_register_kobject sky2: Disable MSI on Dell Inspiron 1545 and Gateway P-79 team: Free BPF filter when unregistering netdev bnxt_en: Drop oversize TX packets to prevent errors. net: nfc: Fix NULL dereference on nfc_llcp_build_tlv fails xen-netback: fix occasional leak of grant ref mappings under memory pressure net: Add __icmp_send helper. net: avoid use IPCB in cipso_v4_error net: phy: Micrel KSZ8061: link failure after cable connect x86/CPU/AMD: Set the CPB bit unconditionally on F17h applicom: Fix potential Spectre v1 vulnerabilities MIPS: irq: Allocate accurate order pages for irq stack hugetlbfs: fix races and page leaks during migration netlabel: fix out-of-bounds memory accesses net: dsa: mv88e6xxx: Fix u64 statistics ip6mr: Do not call __IP6_INC_STATS() from preemptible context media: uvcvideo: Fix 'type' check leading to overflow vti4: Fix a ipip packet processing bug in 'IPCOMP' virtual tunnel perf tools: Handle TOPOLOGY headers with no CPU IB/{hfi1, qib}: Fix WC.byte_len calculation for UD_SEND_WITH_IMM ipvs: Fix signed integer overflow when setsockopt timeout iommu/amd: Fix IOMMU page flush when detach device from a domain xtensa: SMP: fix ccount_timer_shutdown xtensa: SMP: fix secondary CPU initialization xtensa: smp_lx200_defconfig: fix vectors clash xtensa: SMP: mark each possible CPU as present xtensa: SMP: limit number of possible CPUs by NR_CPUS net: altera_tse: fix msgdma_tx_completion on non-zero fill_level case net: hns: Fix wrong read accesses via Clause 45 MDIO protocol net: stmmac: dwmac-rk: fix error handling in rk_gmac_powerup() gpio: vf610: Mask all GPIO interrupts nfs: Fix NULL pointer dereference of dev_name scsi: libfc: free skb when receiving invalid flogi resp platform/x86: Fix unmet dependency warning for SAMSUNG_Q10 cifs: fix computation for MAX_SMB2_HDR_SIZE x86/kexec: Don't setup EFI info if EFI runtime is not enabled x86_64: increase stack size for KASAN_EXTRA mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zone fs/drop_caches.c: avoid softlockups in drop_pagecache_sb() autofs: drop dentry reference only when it is never used autofs: fix error return in autofs_fill_super() ARM: pxa: ssp: unneeded to free devm_ allocated data irqchip/mmp: Only touch the PJ4 IRQ & FIQ bits on enable/disable dmaengine: at_xdmac: Fix wrongfull report of a channel as in use dmaengine: dmatest: Abort test in case of mapping error s390/qeth: fix use-after-free in error path perf symbols: Filter out hidden symbols from labels MIPS: Remove function size check in get_frame_info() Input: wacom_serial4 - add support for Wacom ArtPad II tablet Input: elan_i2c - add id for touchpad found in Lenovo s21e-20 iscsi_ibft: Fix missing break in switch statement futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() ARM: dts: exynos: Add minimal clkout parameters to Exynos3250 PMU Revert "x86/platform/UV: Use efi_runtime_lock to serialise BIOS calls" ARM: dts: exynos: Do not ignore real-world fuse values for thermal zone 0 on Exynos5420 udplite: call proper backlog handlers netfilter: x_tables: enforce nul-terminated table name from getsockopt GET_ENTRIES netfilter: nfnetlink_log: just returns error for unknown command netfilter: nfnetlink_acct: validate NFACCT_FILTER parameters netfilter: nf_conntrack_tcp: Fix stack out of bounds when parsing TCP options KEYS: restrict /proc/keys by credentials at open time l2tp: fix infoleak in l2tp_ip6_recvmsg() net: hsr: fix memory leak in hsr_dev_finalize() net: sit: fix UBSAN Undefined behaviour in check_6rd net/x25: fix use-after-free in x25_device_event() net/x25: reset state in x25_connect() pptp: dst_release sk_dst_cache in pptp_sock_destruct ravb: Decrease TxFIFO depth of Q3 and Q2 to one route: set the deleted fnhe fnhe_daddr to 0 in ip_del_fnhe to fix a race tcp: handle inet_csk_reqsk_queue_add() failures net/mlx4_core: Fix reset flow when in command polling mode net/mlx4_core: Fix qp mtt size calculation net/x25: fix a race in x25_bind() mdio_bus: Fix use-after-free on device_register fails net: Set rtm_table to RT_TABLE_COMPAT for ipv6 for tables > 255 missing barriers in some of unix_sock ->addr and ->path accesses ipvlan: disallow userns cap_net_admin to change global mode/flags vxlan: test dev->flags & IFF_UP before calling gro_cells_receive() vxlan: Fix GRO cells race condition between receive and link delete net/hsr: fix possible crash in add_timer() gro_cells: make sure device is up in gro_cells_receive() tcp/dccp: remove reqsk_put() from inet_child_forget() ALSA: bebob: use more identical mod_alias for Saffire Pro 10 I/O against Liquid Saffire 56 fs/9p: use fscache mutex rather than spinlock It's wrong to add len to sector_nr in raid10 reshape twice media: videobuf2-v4l2: drop WARN_ON in vb2_warn_zero_bytesused() 9p: use inode->i_lock to protect i_size_write() under 32-bit 9p/net: fix memory leak in p9_client_create ASoC: fsl_esai: fix register setting issue in RIGHT_J mode stm class: Fix an endless loop in channel allocation crypto: caam - fixed handling of sg list crypto: ahash - fix another early termination in hash walk gpu: ipu-v3: Fix i.MX51 CSI control registers offset gpu: ipu-v3: Fix CSI offsets for imx53 s390/dasd: fix using offset into zero size array error ARM: OMAP2+: Variable "reg" in function omap4_dsi_mux_pads() could be uninitialized Input: matrix_keypad - use flush_delayed_work() i2c: cadence: Fix the hold bit setting Input: st-keyscan - fix potential zalloc NULL dereference ARM: 8824/1: fix a migrating irq bug when hotplug cpu assoc_array: Fix shortcut creation scsi: libiscsi: Fix race between iscsi_xmit_task and iscsi_complete_task net: systemport: Fix reception of BPDUs pinctrl: meson: meson8b: fix the sdxc_a data 1..3 pins net: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe() ASoC: topology: free created components in tplg load error arm64: Relax GIC version check during early boot tmpfs: fix link accounting when a tmpfile is linked in ARC: uacces: remove lp_start, lp_end from clobber list phonet: fix building with clang mac80211_hwsim: propagate genlmsg_reply return code net: set static variable an initial value in atl2_probe() tmpfs: fix uninitialized return value in shmem_link stm class: Prevent division by zero crypto: arm64/aes-ccm - fix logical bug in AAD MAC handling CIFS: Fix read after write for files with read caching tracing: Do not free iter->trace in fail path of tracing_open_pipe() ACPI / device_sysfs: Avoid OF modalias creation for removed device regulator: s2mps11: Fix steps for buck7, buck8 and LDO35 regulator: s2mpa01: Fix step values for some LDOs clocksource/drivers/exynos_mct: Move one-shot check from tick clear to ISR clocksource/drivers/exynos_mct: Clear timer interrupt when shutdown s390/virtio: handle find on invalid queue gracefully scsi: virtio_scsi: don't send sc payload with tmfs scsi: target/iscsi: Avoid iscsit_release_commands_from_conn() deadlock m68k: Add -ffreestanding to CFLAGS btrfs: ensure that a DUP or RAID1 block group has exactly two stripes Btrfs: fix corruption reading shared and compressed extents after hole punching crypto: pcbc - remove bogus memcpy()s with src == dest cpufreq: tegra124: add missing of_node_put() cpufreq: pxa2xx: remove incorrect __init annotation ext4: fix crash during online resizing ext2: Fix underflow in ext2_max_size() clk: ingenic: Fix round_rate misbehaving with non-integer dividers dmaengine: usb-dmac: Make DMAC system sleep callbacks explicit mm/vmalloc: fix size check for remap_vmalloc_range_partial() kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv intel_th: Don't reference unassigned outputs parport_pc: fix find_superio io compare code, should use equal test. i2c: tegra: fix maximum transfer size perf bench: Copy kernel files needed to build mem{cpy,set} x86_64 benchmarks serial: 8250_pci: Fix number of ports for ACCES serial cards serial: 8250_pci: Have ACCES cards that use the four port Pericom PI7C9X7954 chip use the pci_pericom_setup() jbd2: clear dirty flag when revoking a buffer from an older transaction jbd2: fix compile warning when using JBUFFER_TRACE powerpc/32: Clear on-stack exception marker upon exception return powerpc/wii: properly disable use of BATs when requested. powerpc/powernv: Make opal log only readable by root powerpc/83xx: Also save/restore SPRG4-7 during suspend ARM: s3c24xx: Fix boolean expressions in osiris_dvs_notify dm: fix to_sector() for 32bit NFS41: pop some layoutget errors to application perf intel-pt: Fix CYC timestamp calculation after OVF perf auxtrace: Define auxtrace record alignment perf intel-pt: Fix overlap calculation for padding md: Fix failed allocation of md_register_thread NFS: Fix an I/O request leakage in nfs_do_recoalesce NFS: Don't recoalesce on error in nfs_pageio_complete_mirror() nfsd: fix memory corruption caused by readdir nfsd: fix wrong check in write_v4_end_grace() PM / wakeup: Rework wakeup source timer cancellation rcu: Do RCU GP kthread self-wakeup from softirq and interrupt media: uvcvideo: Avoid NULL pointer dereference at the end of streaming drm/radeon/evergreen_cs: fix missing break in switch statement KVM: nVMX: Sign extend displacements of VMX instr's mem operands KVM: nVMX: Ignore limit checks on VMX instructions using flat segments KVM: X86: Fix residual mmio emulation request to userspace Linux 4.4.177 Change-Id: Ia33b88c9634e04612874d79ce4cc166e8aa8096a Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
| * mm/vmalloc: fix size check for remap_vmalloc_range_partial()Roman Penyaev2019-03-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 401592d2e095947344e10ec0623adbcd58934dd4 upstream. When VM_NO_GUARD is not set area->size includes adjacent guard page, thus for correct size checking get_vm_area_size() should be used, but not area->size. This fixes possible kernel oops when userspace tries to mmap an area on 1 page bigger than was allocated by vmalloc_user() call: the size check inside remap_vmalloc_range_partial() accounts non-existing guard page also, so check successfully passes but vmalloc_to_page() returns NULL (guard page does not physically exist). The following code pattern example should trigger an oops: static int oops_mmap(struct file *file, struct vm_area_struct *vma) { void *mem; mem = vmalloc_user(4096); BUG_ON(!mem); /* Do not care about mem leak */ return remap_vmalloc_range(vma, mem, 0); } And userspace simply mmaps size + PAGE_SIZE: mmap(NULL, 8192, PROT_WRITE|PROT_READ, MAP_PRIVATE, fd, 0); Possible candidates for oops which do not have any explicit size checks: *** drivers/media/usb/stkwebcam/stk-webcam.c: v4l_stk_mmap[789] ret = remap_vmalloc_range(vma, sbuf->buffer, 0); Or the following one: *** drivers/video/fbdev/core/fbmem.c static int fb_mmap(struct file *file, struct vm_area_struct * vma) ... res = fb->fb_mmap(info, vma); Where fb_mmap callback calls remap_vmalloc_range() directly without any explicit checks: *** drivers/video/fbdev/vfb.c static int vfb_mmap(struct fb_info *info, struct vm_area_struct *vma) { return remap_vmalloc_range(vma, (void *)info->fix.smem_start, vma->vm_pgoff); } Link: http://lkml.kernel.org/r/20190103145954.16942-2-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joe Perches <joe@perches.com> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge 4.4.146 into android-4.4-pGreg Kroah-Hartman2018-08-06
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changes in 4.4.146 MIPS: Fix off-by-one in pci_resource_to_user() Input: elan_i2c - add ACPI ID for lenovo ideapad 330 Input: i8042 - add Lenovo LaVie Z to the i8042 reset list Input: elan_i2c - add another ACPI ID for Lenovo Ideapad 330-15AST tracing: Fix double free of event_trigger_data tracing: Fix possible double free in event_enable_trigger_func() tracing/kprobes: Fix trace_probe flags on enable_trace_kprobe() failure tracing: Quiet gcc warning about maybe unused link variable xen/netfront: raise max number of slots in xennet_get_responses() ALSA: emu10k1: add error handling for snd_ctl_add ALSA: fm801: add error handling for snd_ctl_add nfsd: fix potential use-after-free in nfsd4_decode_getdeviceinfo mm: vmalloc: avoid racy handling of debugobjects in vunmap mm/slub.c: add __printf verification to slab_err() rtc: ensure rtc_set_alarm fails when alarms are not supported netfilter: ipset: List timing out entries with "timeout 1" instead of zero infiniband: fix a possible use-after-free bug hvc_opal: don't set tb_ticks_per_usec in udbg_init_opal_common() powerpc/64s: Fix compiler store ordering to SLB shadow area RDMA/mad: Convert BUG_ONs to error flows disable loading f2fs module on PAGE_SIZE > 4KB f2fs: fix to don't trigger writeback during recovery usbip: usbip_detach: Fix memory, udev context and udev leak perf/x86/intel/uncore: Correct fixed counter index check in generic code perf/x86/intel/uncore: Correct fixed counter index check for NHM iwlwifi: pcie: fix race in Rx buffer allocator Bluetooth: hci_qca: Fix "Sleep inside atomic section" warning Bluetooth: btusb: Add a new Realtek 8723DE ID 2ff8:b011 ASoC: dpcm: fix BE dai not hw_free and shutdown mfd: cros_ec: Fail early if we cannot identify the EC mwifiex: handle race during mwifiex_usb_disconnect wlcore: sdio: check for valid platform device data before suspend media: videobuf2-core: don't call memop 'finish' when queueing btrfs: add barriers to btrfs_sync_log before log_commit_wait wakeups btrfs: qgroup: Finish rescan when hit the last leaf of extent tree PCI: Prevent sysfs disable of device while driver is attached ath: Add regulatory mapping for FCC3_ETSIC ath: Add regulatory mapping for ETSI8_WORLD ath: Add regulatory mapping for APL13_WORLD ath: Add regulatory mapping for APL2_FCCA ath: Add regulatory mapping for Uganda ath: Add regulatory mapping for Tanzania ath: Add regulatory mapping for Serbia ath: Add regulatory mapping for Bermuda ath: Add regulatory mapping for Bahamas powerpc/32: Add a missing include header powerpc/chrp/time: Make some functions static, add missing header include powerpc/powermac: Add missing prototype for note_bootable_part() powerpc/powermac: Mark variable x as unused powerpc/8xx: fix invalid register expression in head_8xx.S pinctrl: at91-pio4: add missing of_node_put PCI: pciehp: Request control of native hotplug only if supported mwifiex: correct histogram data with appropriate index scsi: ufs: fix exception event handling ALSA: emu10k1: Rate-limit error messages about page errors regulator: pfuze100: add .is_enable() for pfuze100_swb_regulator_ops md: fix NULL dereference of mddev->pers in remove_and_add_spares() media: smiapp: fix timeout checking in smiapp_read_nvm ALSA: usb-audio: Apply rate limit to warning messages in URB complete callback HID: hid-plantronics: Re-resend Update to map button for PTT products drm/radeon: fix mode_valid's return type powerpc/embedded6xx/hlwd-pic: Prevent interrupts from being handled by Starlet HID: i2c-hid: check if device is there before really probing tty: Fix data race in tty_insert_flip_string_fixed_flag dma-iommu: Fix compilation when !CONFIG_IOMMU_DMA media: rcar_jpu: Add missing clk_disable_unprepare() on error in jpu_open() libata: Fix command retry decision media: saa7164: Fix driver name in debug output mtd: rawnand: fsl_ifc: fix FSL NAND driver to read all ONFI parameter pages brcmfmac: Add support for bcm43364 wireless chipset s390/cpum_sf: Add data entry sizes to sampling trailer entry perf: fix invalid bit in diagnostic entry scsi: 3w-9xxx: fix a missing-check bug scsi: 3w-xxxx: fix a missing-check bug scsi: megaraid: silence a static checker bug thermal: exynos: fix setting rising_threshold for Exynos5433 bpf: fix references to free_bpf_prog_info() in comments media: siano: get rid of __le32/__le16 cast warnings drm/atomic: Handling the case when setting old crtc for plane ALSA: hda/ca0132: fix build failure when a local macro is defined memory: tegra: Do not handle spurious interrupts memory: tegra: Apply interrupts mask per SoC drm/gma500: fix psb_intel_lvds_mode_valid()'s return type ipconfig: Correctly initialise ic_nameservers rsi: Fix 'invalid vdd' warning in mmc audit: allow not equal op for audit by executable microblaze: Fix simpleImage format generation usb: hub: Don't wait for connect state at resume for powered-off ports crypto: authencesn - don't leak pointers to authenc keys crypto: authenc - don't leak pointers to authenc keys media: omap3isp: fix unbalanced dma_iommu_mapping scsi: scsi_dh: replace too broad "TP9" string with the exact models scsi: megaraid_sas: Increase timeout by 1 sec for non-RAID fastpath IOs media: si470x: fix __be16 annotations drm: Add DP PSR2 sink enable bit random: mix rdrand with entropy sent in from userspace squashfs: be more careful about metadata corruption ext4: fix inline data updates with checksums enabled ext4: check for allocation block validity with block group locked dmaengine: pxa_dma: remove duplicate const qualifier ASoC: pxa: Fix module autoload for platform drivers ipv4: remove BUG_ON() from fib_compute_spec_dst net: fix amd-xgbe flow-control issue net: lan78xx: fix rx handling before first packet is send xen-netfront: wait xenbus state change when load module manually NET: stmmac: align DMA stuff to largest cache line length tcp: do not force quickack when receiving out-of-order packets tcp: add max_quickacks param to tcp_incr_quickack and tcp_enter_quickack_mode tcp: do not aggressively quick ack after ECN events tcp: refactor tcp_ecn_check_ce to remove sk type cast tcp: add one more quick ack after after ECN events inet: frag: enforce memory limits earlier net: dsa: Do not suspend/resume closed slave_dev netlink: Fix spectre v1 gadget in netlink_create() squashfs: more metadata hardening squashfs: more metadata hardenings can: ems_usb: Fix memory leak on ems_usb_disconnect() net: socket: fix potential spectre v1 gadget in socketcall virtio_balloon: fix another race between migration and ballooning kvm: x86: vmx: fix vpid leak crypto: padlock-aes - Fix Nano workaround data corruption scsi: sg: fix minor memory leak in error path Linux 4.4.146 Change-Id: I7b8ad5e297804f92b3e3a8c5daf8a26ba684029b Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
| * mm: vmalloc: avoid racy handling of debugobjects in vunmapChintan Pandya2018-08-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit f3c01d2f3ade6790db67f80fef60df84424f8964 ] Currently, __vunmap flow is, 1) Release the VM area 2) Free the debug objects corresponding to that vm area. This leave some race window open. 1) Release the VM area 1.5) Some other client gets the same vm area 1.6) This client allocates new debug objects on the same vm area 2) Free the debug objects corresponding to this vm area. Here, we actually free 'other' client's debug objects. Fix this by freeing the debug objects first and then releasing the VM area. Link: http://lkml.kernel.org/r/1523961828-9485-2-git-send-email-cpandya@codeaurora.org Signed-off-by: Chintan Pandya <cpandya@codeaurora.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yisheng Xie <xieyisheng1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | BACKPORT: mm: coalesce split stringsJoe Perches2017-12-14
|/ | | | | | | | | | | | | | | | | | | | Kernel style prefers a single string over split strings when the string is 'user-visible'. Miscellanea: - Add a missing newline - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 64145065 (cherry-picked from 756a025f00091918d9d09ca3229defb160b409c0) Change-Id: I377fb1542980c15d2f306924656227ad17b02b5e Signed-off-by: Paul Lawrence <paullawrence@google.com>
* mm: vmalloc: don't remove inexistent guard hole in remove_vm_area()Jerome Marchand2015-11-20
| | | | | | | | | | | | | | | | Commit 71394fe50146 ("mm: vmalloc: add flag preventing guard hole allocation") missed a spot. Currently remove_vm_area() decreases vm->size to "remove" the guard hole page, even when it isn't present. All but one users just free the vm_struct rigth away and never access vm->size anyway. Don't touch the size in remove_vm_area() and have __vunmap() use the proper get_vm_area_size() helper. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: page_alloc: hide some GFP internals and document the bits and flag ↵Mel Gorman2015-11-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | combinations Andrew stated the following We have quite a history of remote parts of the kernel using weird/wrong/inexplicable combinations of __GFP_ flags. I tend to think that this is because we didn't adequately explain the interface. And I don't think that gfp.h really improved much in this area as a result of this patchset. Could you go through it some time and decide if we've adequately documented all this stuff? This patches first moves some GFP flag combinations that are part of the MM internals to mm/internal.h. The rest of the patch documents the __GFP_FOO bits under various headings and then documents the flag combinations. It will not help callers that are brain damaged but the clarity might motivate some fixes and avoid future mistakes. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, page_alloc: distinguish between being unable to sleep, unwilling to ↵Mel Gorman2015-11-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: use offset_in_page macroAlexander Kuleshov2015-11-05
| | | | | | | | | linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: get rid of 'vmalloc_info' from /proc/meminfoLinus Torvalds2015-11-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It turns out that at least some versions of glibc end up reading /proc/meminfo at every single startup, because glibc wants to know the amount of memory the machine has. And while that's arguably insane, it's just how things are. And it turns out that it's not all that expensive most of the time, but the vmalloc information statistics (amount of virtual memory used in the vmalloc space, and the biggest remaining chunk) can be rather expensive to compute. The 'get_vmalloc_info()' function actually showed up on my profiles as 4% of the CPU usage of "make test" in the git source repository, because the git tests are lots of very short-lived shell-scripts etc. It turns out that apparently this same silly vmalloc info gathering shows up on the facebook servers too, according to Dave Jones. So it's not just "make test" for git. We had two patches to just cache the information (one by me, one by Ingo) to mitigate this issue, but the whole vmalloc information of of rather dubious value to begin with, and people who *actually* want to know what the situation is wrt the vmalloc area should just look at the much more complete /proc/vmallocinfo instead. In fact, according to my testing - and perhaps more importantly, according to that big search engine in the sky: Google - there is nothing out there that actually cares about those two expensive fields: VmallocUsed and VmallocChunk. So let's try to just remove them entirely. Actually, this just removes the computation and reports the numbers as zero for now, just to try to be minimally intrusive. If this breaks anything, we'll obviously have to re-introduce the code to compute this all and add the caching patches on top. But if given the option, I'd really prefer to just remove this bad idea entirely rather than add even more code to work around our historical mistake that likely nobody really cares about. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: get rid of dirty bitmap inside vmap_block structureRoman Pen2015-04-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In original implementation of vm_map_ram made by Nick Piggin there were two bitmaps: alloc_map and dirty_map. None of them were used as supposed to be: finding a suitable free hole for next allocation in block. vm_map_ram allocates space sequentially in block and on free call marks pages as dirty, so freed space can't be reused anymore. Actually it would be very interesting to know the real meaning of those bitmaps, maybe implementation was incomplete, etc. But long time ago Zhang Yanfei removed alloc_map by these two commits: mm/vmalloc.c: remove dead code in vb_alloc 3fcd76e8028e0be37b02a2002b4f56755daeda06 mm/vmalloc.c: remove alloc_map from vmap_block b8e748b6c32999f221ea4786557b8e7e6c4e4e7a In this patch I replaced dirty_map with two range variables: dirty min and max. These variables store minimum and maximum position of dirty space in a block, since we need only to know the dirty range, not exact position of dirty pages. Why it was made? Several reasons: at first glance it seems that vm_map_ram allocator concerns about fragmentation thus it uses bitmaps for finding free hole, but it is not true. To avoid complexity seems it is better to use something simple, like min or max range values. Secondly, code also becomes simpler, without iteration over bitmap, just comparing values in min and max macros. Thirdly, bitmap occupies up to 1024 bits (4MB is a max size of a block). Here I replaced the whole bitmap with two longs. Finally vm_unmap_aliases should be slightly faster and the whole vmap_block structure occupies less memory. Signed-off-by: Roman Pen <r.peniaev@gmail.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: WANG Chao <chaowang@redhat.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Christoph Lameter <cl@linux.com> Cc: Gioh Kim <gioh.kim@lge.com> Cc: Rob Jones <rob.jones@codethink.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: occupy newly allocated vmap block just after allocationRoman Pen2015-04-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previous implementation allocates new vmap block and repeats search of a free block from the very beginning, iterating over the CPU free list. Why it can be better?? 1. Allocation can happen on one CPU, but search can be done on another CPU. In worst case we preallocate amount of vmap blocks which is equal to CPU number on the system. 2. In previous patch I added newly allocated block to the tail of free list to avoid soon exhaustion of virtual space and give a chance to occupy blocks which were allocated long time ago. Thus to find newly allocated block all the search sequence should be repeated, seems it is not efficient. In this patch newly allocated block is occupied right away, address of virtual space is returned to the caller, so there is no any need to repeat the search sequence, allocation job is done. Signed-off-by: Roman Pen <r.peniaev@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: WANG Chao <chaowang@redhat.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Christoph Lameter <cl@linux.com> Cc: Gioh Kim <gioh.kim@lge.com> Cc: Rob Jones <rob.jones@codethink.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: fix possible exhaustion of vmalloc space caused by vm_map_ram ↵Roman Pen2015-04-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | allocator Recently I came across high fragmentation of vm_map_ram allocator: vmap_block has free space, but still new blocks continue to appear. Further investigation showed that certain mapping/unmapping sequences can exhaust vmalloc space. On small 32bit systems that's not a big problem, cause purging will be called soon on a first allocation failure (alloc_vmap_area), but on 64bit machines, e.g. x86_64 has 45 bits of vmalloc space, that can be a disaster. 1) I came up with a simple allocation sequence, which exhausts virtual space very quickly: while (iters) { /* Map/unmap big chunk */ vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL); vm_unmap_ram(vaddr, 16); /* Map/unmap small chunks. * * -1 for hole, which should be left at the end of each block * to keep it partially used, with some free space available */ for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) { vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL); vm_unmap_ram(vaddr, 8); } } The idea behind is simple: 1. We have to map a big chunk, e.g. 16 pages. 2. Then we have to occupy the remaining space with smaller chunks, i.e. 8 pages. At the end small hole should remain to keep block in free list, but do not let big chunk to occupy remaining space. 3. Goto 1 - allocation request of 16 pages can't be completed (only 8 slots are left free in the block in the #2 step), new block will be allocated, all further requests will lay into newly allocated block. To have some measurement numbers for all further tests I setup ftrace and enabled 4 basic calls in a function profile: echo vm_map_ram > /sys/kernel/debug/tracing/set_ftrace_filter; echo alloc_vmap_area >> /sys/kernel/debug/tracing/set_ftrace_filter; echo vm_unmap_ram >> /sys/kernel/debug/tracing/set_ftrace_filter; echo free_vmap_block >> /sys/kernel/debug/tracing/set_ftrace_filter; So for this scenario I got these results: BEFORE (all new blocks are put to the head of a free list) # cat /sys/kernel/debug/tracing/trace_stat/function0 Function Hit Time Avg s^2 -------- --- ---- --- --- vm_map_ram 126000 30683.30 us 0.243 us 30819.36 us vm_unmap_ram 126000 22003.24 us 0.174 us 340.886 us alloc_vmap_area 1000 4132.065 us 4.132 us 0.903 us AFTER (all new blocks are put to the tail of a free list) # cat /sys/kernel/debug/tracing/trace_stat/function0 Function Hit Time Avg s^2 -------- --- ---- --- --- vm_map_ram 126000 28713.13 us 0.227 us 24944.70 us vm_unmap_ram 126000 20403.96 us 0.161 us 1429.872 us alloc_vmap_area 993 3916.795 us 3.944 us 29.370 us free_vmap_block 992 654.157 us 0.659 us 1.273 us SUMMARY: The most interesting numbers in those tables are numbers of block allocations and deallocations: alloc_vmap_area and free_vmap_block calls, which show that before the change blocks were not freed, and virtual space and physical memory (vmap_block structure allocations, etc) were consumed. Average time which were spent in vm_map_ram/vm_unmap_ram became slightly better. That can be explained with a reasonable amount of blocks in a free list, which we need to iterate to find a suitable free block. 2) Another scenario is a random allocation: while (iters) { /* Randomly take number from a range [1..32/64] */ nr = rand(1, VMAP_MAX_ALLOC); vaddr = vm_map_ram(pages, nr, -1, PAGE_KERNEL); vm_unmap_ram(vaddr, nr); } I chose mersenne twister PRNG to generate persistent random state to guarantee that both runs have the same random sequence. For each vm_map_ram call random number from [1..32/64] was taken to represent amount of pages which I do map. I did 10'000 vm_map_ram calls and got these two tables: BEFORE (all new blocks are put to the head of a free list) # cat /sys/kernel/debug/tracing/trace_stat/function0 Function Hit Time Avg s^2 -------- --- ---- --- --- vm_map_ram 10000 10170.01 us 1.017 us 993.609 us vm_unmap_ram 10000 5321.823 us 0.532 us 59.789 us alloc_vmap_area 420 2150.239 us 5.119 us 3.307 us free_vmap_block 37 159.587 us 4.313 us 134.344 us AFTER (all new blocks are put to the tail of a free list) # cat /sys/kernel/debug/tracing/trace_stat/function0 Function Hit Time Avg s^2 -------- --- ---- --- --- vm_map_ram 10000 7745.637 us 0.774 us 395.229 us vm_unmap_ram 10000 5460.573 us 0.546 us 67.187 us alloc_vmap_area 414 2201.650 us 5.317 us 5.591 us free_vmap_block 412 574.421 us 1.394 us 15.138 us SUMMARY: 'BEFORE' table shows, that 420 blocks were allocated and only 37 were freed. Remained 383 blocks are still in a free list, consuming virtual space and physical memory. 'AFTER' table shows, that 414 blocks were allocated and 412 were really freed. 2 blocks remained in a free list. So fragmentation was dramatically reduced. Why? Because when we put newly allocated block to the head, all further requests will occupy new block, regardless remained space in other blocks. In this scenario all requests come randomly. Eventually remained free space will be less than requested size, free list will be iterated and it is possible that nothing will be found there - finally new block will be created. So exhaustion in random scenario happens for the maximum possible allocation size: 32 pages for 32-bit system and 64 pages for 64-bit system. Also average cost of vm_map_ram was reduced from 1.017 us to 0.774 us. Again this can be explained by iteration through smaller list of free blocks. 3) Next simple scenario is a sequential allocation, when the allocation order is increased for each block. This scenario forces allocator to reach maximum amount of partially free blocks in a free list: while (iters) { /* Populate free list with blocks with remaining space */ for (order = 0; order <= ilog2(VMAP_MAX_ALLOC); order++) { nr = VMAP_BBMAP_BITS / (1 << order); /* Leave a hole */ nr -= 1; for (i = 0; i < nr; i++) { vaddr = vm_map_ram(pages, (1 << order), -1, PAGE_KERNEL); vm_unmap_ram(vaddr, (1 << order)); } /* Completely occupy blocks from a free list */ for (order = 0; order <= ilog2(VMAP_MAX_ALLOC); order++) { vaddr = vm_map_ram(pages, (1 << order), -1, PAGE_KERNEL); vm_unmap_ram(vaddr, (1 << order)); } } Results which I got: BEFORE (all new blocks are put to the head of a free list) # cat /sys/kernel/debug/tracing/trace_stat/function0 Function Hit Time Avg s^2 -------- --- ---- --- --- vm_map_ram 2032000 399545.2 us 0.196 us 467123.7 us vm_unmap_ram 2032000 363225.7 us 0.178 us 111405.9 us alloc_vmap_area 7001 30627.76 us 4.374 us 495.755 us free_vmap_block 6993 7011.685 us 1.002 us 159.090 us AFTER (all new blocks are put to the tail of a free list) # cat /sys/kernel/debug/tracing/trace_stat/function0 Function Hit Time Avg s^2 -------- --- ---- --- --- vm_map_ram 2032000 394259.7 us 0.194 us 589395.9 us vm_unmap_ram 2032000 292500.7 us 0.143 us 94181.08 us alloc_vmap_area 7000 31103.11 us 4.443 us 703.225 us free_vmap_block 7000 6750.844 us 0.964 us 119.112 us SUMMARY: No surprises here, almost all numbers are the same. Fixing this fragmentation problem I also did some improvements in a allocation logic of a new vmap block: occupy block immediately and get rid of extra search in a free list. Also I replaced dirty bitmap with min/max dirty range values to make the logic simpler and slightly faster, since two longs comparison costs less, than loop thru bitmap. This patchset raises several questions: Q: Think the problem you comments is already known so that I wrote comments about it as "it could consume lots of address space through fragmentation". Could you tell me about your situation and reason why it should be avoided? Gioh Kim A: Indeed, there was a commit 364376383 which adds explicit comment about fragmentation. But fragmentation which is described in this comment caused by mixing of long-lived and short-lived objects, when a whole block is pinned in memory because some page slots are still in use. But here I am talking about blocks which are free, nobody uses them, and allocator keeps them alive forever, continuously allocating new blocks. Q: I think that if you put newly allocated block to the tail of a free list, below example would results in enormous performance degradation. new block: 1MB (256 pages) while (iters--) { vm_map_ram(3 or something else not dividable for 256) * 85 vm_unmap_ram(3) * 85 } On every iteration, it needs newly allocated block and it is put to the tail of a free list so finding it consumes large amount of time. Joonsoo Kim A: Second patch in current patchset gets rid of extra search in a free list, so new block will be immediately occupied.. Also, the scenario above is impossible, cause vm_map_ram allocates virtual range in orders, i.e. 2^n. I.e. passing 3 to vm_map_ram you will allocate 4 slots in a block and 256 slots (capacity of a block) of course dividable on 4, so block will be completely occupied. But there is a worst case which we can achieve: each free block has a hole equal to order size. The maximum size of allocation is 64 pages for 64-bit system (if you try to map more, original alloc_vmap_area will be called). So the maximum order is 6. That means that worst case, before allocator makes a decision to allocate a new block, is to iterate 7 blocks: HEAD 1st block - has 1 page slot free (order 0) 2nd block - has 2 page slots free (order 1) 3rd block - has 4 page slots free (order 2) 4th block - has 8 page slots free (order 3) 5th block - has 16 page slots free (order 4) 6th block - has 32 page slots free (order 5) 7th block - has 64 page slots free (order 6) TAIL So the worst scenario on 64-bit system is that each CPU queue can have 7 blocks in a free list. This can happen only and only if you allocate blocks increasing the order. (as I did in the function written in the comment of the first patch) This is weird and rare case, but still it is possible. Afterwards you will get 7 blocks in a list. All further requests should be placed in a newly allocated block or some free slots should be found in a free list. Seems it does not look dramatically awful. This patch (of 3): If suitable block can't be found, new block is allocated and put into a head of a free list, so on next iteration this new block will be found first. That's bad, because old blocks in a free list will not get a chance to be fully used, thus fragmentation will grow. Let's consider this simple example: #1 We have one block in a free list which is partially used, and where only one page is free: HEAD |xxxxxxxxx-| TAIL ^ free space for 1 page, order 0 #2 New allocation request of order 1 (2 pages) comes, new block is allocated since we do not have free space to complete this request. New block is put into a head of a free list: HEAD |----------|xxxxxxxxx-| TAIL #3 Two pages were occupied in a new found block: HEAD |xx--------|xxxxxxxxx-| TAIL ^ two pages mapped here #4 New allocation request of order 0 (1 page) comes. Block, which was created on #2 step, is located at the beginning of a free list, so it will be found first: HEAD |xxX-------|xxxxxxxxx-| TAIL ^ ^ page mapped here, but better to use this hole It is obvious, that it is better to complete request of #4 step using the old block, where free space is left, because in other case fragmentation will be highly increased. But fragmentation is not only the case. The worst thing is that I can easily create scenario, when the whole vmalloc space is exhausted by blocks, which are not used, but already dirty and have several free pages. Let's consider this function which execution should be pinned to one CPU: static void exhaust_virtual_space(struct page *pages[16], int iters) { /* Firstly we have to map a big chunk, e.g. 16 pages. * Then we have to occupy the remaining space with smaller * chunks, i.e. 8 pages. At the end small hole should remain. * So at the end of our allocation sequence block looks like * this: * XX big chunk * |XXxxxxxxx-| x small chunk * - hole, which is enough for a small chunk, * but is not enough for a big chunk */ while (iters--) { int i; void *vaddr; /* Map/unmap big chunk */ vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL); vm_unmap_ram(vaddr, 16); /* Map/unmap small chunks. * * -1 for hole, which should be left at the end of each block * to keep it partially used, with some free space available */ for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) { vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL); vm_unmap_ram(vaddr, 8); } } } On every iteration new block (1MB of vm area in my case) will be allocated and then will be occupied, without attempt to resolve small allocation request using previously allocated blocks in a free list. In case of random allocation (size should be randomly taken from the range [1..64] in 64-bit case or [1..32] in 32-bit case) situation is the same: new blocks continue to appear if maximum possible allocation size (32 or 64) passed to the allocator, because all remaining blocks in a free list do not have enough free space to complete this allocation request. In summary if new blocks are put into the head of a free list eventually virtual space will be exhausted. In current patch I simply put newly allocated block to the tail of a free list, thus reduce fragmentation, giving a chance to resolve allocation request using older blocks with possible holes left. Signed-off-by: Roman Pen <r.peniaev@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: WANG Chao <chaowang@redhat.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Christoph Lameter <cl@linux.com> Cc: Gioh Kim <gioh.kim@lge.com> Cc: Rob Jones <rob.jones@codethink.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: change vunmap to tear down huge KVA mappingsToshi Kani2015-04-14
| | | | | | | | | | | | | | | | | | | | Change vunmap_pmd_range() and vunmap_pud_range() to tear down huge KVA mappings when they are set. pud_clear_huge() and pmd_clear_huge() return zero when no-operation is performed, i.e. huge page mapping was not used. These changes are only enabled when CONFIG_HAVE_ARCH_HUGE_VMAP is defined on the architecture. [akpm@linux-foundation.org: use consistent code layout] Signed-off-by: Toshi Kani <toshi.kani@hp.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Robert Elliott <Elliott@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: change __get_vm_area_node() to use fls_long()Toshi Kani2015-04-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ioremap() and its related interfaces are used to create I/O mappings to memory-mapped I/O devices. The mapping sizes of the traditional I/O devices are relatively small. Non-volatile memory (NVM), however, has many GB and is going to have TB soon. It is not very efficient to create large I/O mappings with 4KB. This patchset extends the ioremap() interfaces to transparently create I/O mappings with huge pages whenever possible. ioremap() continues to use 4KB mappings when a huge page does not fit into a requested range. There is no change necessary to the drivers using ioremap(). A requested physical address must be aligned by a huge page size (1GB or 2MB on x86) for using huge page mapping, though. The kernel huge I/O mapping will improve performance of NVM and other devices with large memory, and reduce the time to create their mappings as well. On x86, MTRRs can override PAT memory types with a 4KB granularity. When using a huge page, MTRRs can override the memory type of the huge page, which may lead a performance penalty. The processor can also behave in an undefined manner if a huge page is mapped to a memory range that MTRRs have mapped with multiple different memory types. Therefore, the mapping code falls back to use a smaller page size toward 4KB when a mapping range is covered by non-WB type of MTRRs. The WB type of MTRRs has no affect on the PAT memory types. The patchset introduces HAVE_ARCH_HUGE_VMAP, which indicates that the arch supports huge KVA mappings for ioremap(). User may specify a new kernel option "nohugeiomap" to disable the huge I/O mapping capability of ioremap() when necessary. Patch 1-4 change common files to support huge I/O mappings. There is no change in the functinalities unless HAVE_ARCH_HUGE_VMAP is defined on the architecture of the system. Patch 5-6 implement the HAVE_ARCH_HUGE_VMAP funcs on x86, and set HAVE_ARCH_HUGE_VMAP on x86. This patch (of 6): __get_vm_area_node() takes unsigned long size, which is a 64-bit value on a 64-bit kernel. However, fls(size) simply ignores the upper 32-bit. Change to use fls_long() to handle the size properly. Signed-off-by: Toshi Kani <toshi.kani@hp.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Robert Elliott <Elliott@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kasan, module, vmalloc: rework shadow allocation for modulesAndrey Ryabinin2015-03-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current approach in handling shadow memory for modules is broken. Shadow memory could be freed only after memory shadow corresponds it is no longer used. vfree() called from interrupt context could use memory its freeing to store 'struct llist_node' in it: void vfree(const void *addr) { ... if (unlikely(in_interrupt())) { struct vfree_deferred *p = this_cpu_ptr(&vfree_deferred); if (llist_add((struct llist_node *)addr, &p->list)) schedule_work(&p->wq); Later this list node used in free_work() which actually frees memory. Currently module_memfree() called in interrupt context will free shadow before freeing module's memory which could provoke kernel crash. So shadow memory should be freed after module's memory. However, such deallocation order could race with kasan_module_alloc() in module_alloc(). Free shadow right before releasing vm area. At this point vfree()'d memory is not used anymore and yet not available for other allocations. New VM_KASAN flag used to indicate that vm area has dynamically allocated shadow memory so kasan frees shadow only if it was previously allocated. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmalloc: pass additional vm_flags to __vmalloc_node_range()Andrey Ryabinin2015-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For instrumenting global variables KASan will shadow memory backing memory for modules. So on module loading we will need to allocate memory for shadow and map it at address in shadow that corresponds to the address allocated in module_alloc(). __vmalloc_node_range() could be used for this purpose, except it puts a guard hole after allocated area. Guard hole in shadow memory should be a problem because at some future point we might need to have a shadow memory at address occupied by guard hole. So we could fail to allocate shadow for module_alloc(). Now we have VM_NO_GUARD flag disabling guard page, so we need to pass into __vmalloc_node_range(). Add new parameter 'vm_flags' to __vmalloc_node_range() function. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrey Konovalov <adech.fo@gmail.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmalloc: add flag preventing guard hole allocationAndrey Ryabinin2015-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For instrumenting global variables KASan will shadow memory backing memory for modules. So on module loading we will need to allocate memory for shadow and map it at address in shadow that corresponds to the address allocated in module_alloc(). __vmalloc_node_range() could be used for this purpose, except it puts a guard hole after allocated area. Guard hole in shadow memory should be a problem because at some future point we might need to have a shadow memory at address occupied by guard hole. So we could fail to allocate shadow for module_alloc(). Add a new vm_struct flag 'VM_NO_GUARD' indicating that vm area doesn't have a guard hole. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrey Konovalov <adech.fo@gmail.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: fix memory ordering bugDmitry Vyukov2014-12-13
| | | | | | | | | | Read memory barriers must follow the read operations. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: replace printk with pr_warnPintu Kumar2014-12-10
| | | | | | | | | This patch replaces printk(KERN_WARNING..) with pr_warn. Thus it also reduces one line extra because of formatting. Signed-off-by: Pintu Kumar <pintu.k@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: use seq_open_private() instead of seq_open()Rob Jones2014-10-09
| | | | | | | | | | | | | | Using seq_open_private() removes boilerplate code from vmalloc_open(). The resultant code is shorter and easier to follow. However, please note that seq_open_private() call kzalloc() rather than kmalloc() which may affect timing due to the memory initialisation overhead. Signed-off-by: Rob Jones <rob.jones@codethink.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: clean up map_vm_area third argumentWANG Chao2014-08-06
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently map_vm_area() takes (struct page *** pages) as third argument, and after mapping, it moves (*pages) to point to (*pages + nr_mappped_pages). It looks like this kind of increment is useless to its caller these days. The callers don't care about the increments and actually they're trying to avoid this by passing another copy to map_vm_area(). The caller can always guarantee all the pages can be mapped into vm_area as specified in first argument and the caller only cares about whether map_vm_area() fails or not. This patch cleans up the pointer movement in map_vm_area() and updates its callers accordingly. Signed-off-by: WANG Chao <chaowang@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, vmalloc: constify allocation maskDavid Rientjes2014-08-06
| | | | | | | | | | | | | tmp_mask in the __vmalloc_area_node() iteration never changes so it can be moved into function scope and marked with const. This causes the movl and orl to only be done once per call rather than area->nr_pages times. nested_gfp can also be marked const. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: add a schedule point to vmalloc()Eric Dumazet2014-08-06
| | | | | | | | | | | | | | | It is not uncommon on busy servers to get stuck hundred of ms in vmalloc() calls (like file descriptor expansions). Add a cond_resched() to __vmalloc_area_node() to be gentle to other tasks. [akpm@linux-foundation.org: only do it for __GFP_WAIT, per David] Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmalloc: use rcu list iterator to reduce vmap_area_lock contentionJoonsoo Kim2014-08-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Richard Yao reported a month ago that his system have a trouble with vmap_area_lock contention during performance analysis by /proc/meminfo. Andrew asked why his analysis checks /proc/meminfo stressfully, but he didn't answer it. https://lkml.org/lkml/2014/4/10/416 Although I'm not sure that this is right usage or not, there is a solution reducing vmap_area_lock contention with no side-effect. That is just to use rcu list iterator in get_vmalloc_info(). rcu can be used in this function because all RCU protocol is already respected by writers, since Nick Piggin commit db64fe02258f1 ("mm: rewrite vmap layer") back in linux-2.6.28 Specifically : insertions use list_add_rcu(), deletions use list_del_rcu() and kfree_rcu(). Note the rb tree is not used from rcu reader (it would not be safe), only the vmap_area_list has full RCU protection. Note that __purge_vmap_area_lazy() already uses this rcu protection. rcu_read_lock(); list_for_each_entry_rcu(va, &vmap_area_list, list) { if (va->flags & VM_LAZY_FREE) { if (va->va_start < *start) *start = va->va_start; if (va->va_end > *end) *end = va->va_end; nr += (va->va_end - va->va_start) >> PAGE_SHIFT; list_add_tail(&va->purge_list, &valist); va->flags |= VM_LAZY_FREEING; va->flags &= ~VM_LAZY_FREE; } } rcu_read_unlock(); Peter: : While rcu list traversal over the vmap_area_list is safe, this may : arrive at different results than the spinlocked version. The rcu list : traversal version will not be a 'snapshot' of a single, valid instant : of the entire vmap_area_list, but rather a potential amalgam of : different list states. Joonsoo: : Yes, you are right, but I don't think that we should be strict here. : Meminfo is already not a 'snapshot' at specific time. While we try to get : certain stats, the other stats can change. And, although we may arrive at : different results than the spinlocked version, the difference would not be : large and would not make serious side-effect. [edumazet@google.com: add more commit description] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Richard Yao <ryao@gentoo.org> Acked-by: Eric Dumazet <edumazet@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: export unmap_kernel_range()Minchan Kim2014-06-04
| | | | | | | | | | | | | | | | | | | | zsmalloc needs exported unmap_kernel_range for building as a module. See https://lkml.org/lkml/2013/1/18/487 I didn't send a patch to make unmap_kernel_range exportable at that time because zram was staging stuff and I thought VM function exporting for staging stuff makes no sense. Now zsmalloc was promoted. If we can't build zsmalloc as module, it means we can't build zram as module, either. Additionally, buddy map_vm_area is already exported so let's export unmap_kernel_range to help his buddy. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: replace seq_printf by seq_putsFabian Frederick2014-06-04
| | | | | | | | Replace seq_printf where possible Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: replace __get_cpu_var uses with this_cpu_ptrChristoph Lameter2014-06-04
| | | | | | | | | | | Replace places where __get_cpu_var() is used for an address calculation with this_cpu_ptr(). Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: enhance vm_map_ram() commentGioh Kim2014-04-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vm_map_ram() has a fragmentation problem when it cannot purge a chunk(ie, 4M address space) if there is a pinning object in that addresss space. So it could consume all VMALLOC address space easily. We can fix the fragmentation problem by using vmap instead of vm_map_ram() but vmap() is known to be slow compared to vm_map_ram(). Minchan said vm_map_ram is 5 times faster than vmap in his tests. So I thought we should fix fragment problem of vm_map_ram because our proprietary GPU driver has used it heavily. On second thought, it's not an easy because we should reuse freed space for solving the problem and it could make more IPI and bitmap operation for searching hole. It could mitigate API's goal which is very fast mapping. And even fragmentation problem wouldn't show in 64 bit machine. Another option is that the user should separate long-life and short-life object and use vmap for long-life but vm_map_ram for short-life. If we inform the user about the characteristic of vm_map_ram the user can choose one according to the page lifetime. Let's add some notice messages to user. [akpm@linux-foundation.org: tweak comment text] Signed-off-by: Gioh Kim <gioh.kim@lge.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: use macros from compiler.h instead of __attribute__((...))Gideon Israel Dsouza2014-04-07
| | | | | | | | | | | | To increase compiler portability there is <linux/compiler.h> which provides convenience macros for various gcc constructs. Eg: __weak for __attribute__((weak)). I've replaced all instances of gcc attributes with the right macro in the memory management (/mm) subsystem. [akpm@linux-foundation.org: while-we're-there consistency tweaks] Signed-off-by: Gideon Israel Dsouza <gidisrael@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Revert "mm/vmalloc: interchage the implementation of vmalloc_to_{pfn,page}"malc2014-01-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Revert commit ece86e222db4, which was intended as a small performance improvement. Despite the claim that the patch doesn't introduce any functional changes in fact it does. The "no page" path behaves different now. Originally, vmalloc_to_page might return NULL under some conditions, with new implementation it returns pfn_to_page(0) which is not the same as NULL. Simple test shows the difference. test.c #include <linux/kernel.h> #include <linux/module.h> #include <linux/vmalloc.h> #include <linux/mm.h> int __init myi(void) { struct page *p; void *v; v = vmalloc(PAGE_SIZE); /* trigger the "no page" path in vmalloc_to_page*/ vfree(v); p = vmalloc_to_page(v); pr_err("expected val = NULL, returned val = %p", p); return -EBUSY; } void __exit mye(void) { } module_init(myi) module_exit(mye) Before interchange: expected val = NULL, returned val = (null) After interchange: expected val = NULL, returned val = c7ebe000 Signed-off-by: Vladimir Murzin <murzin.v@gmail.com> Cc: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: interchage the implementation of vmalloc_to_{pfn,page}Jianyu Zhan2014-01-21
| | | | | | | | | | | | | | | | | | | | | | Currently we are implementing vmalloc_to_pfn() as a wrapper around vmalloc_to_page(), which is implemented as follow: 1. walks the page talbes to generates the corresponding pfn, 2. then converts the pfn to struct page, 3. returns it. And vmalloc_to_pfn() re-wraps vmalloc_to_page() to get the pfn. This seems too circuitous, so this patch reverses the way: implement vmalloc_to_page() as a wrapper around vmalloc_to_pfn(). This makes vmalloc_to_pfn() and vmalloc_to_page() slightly more efficient. No functional change. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Cc: Vladimir Murzin <murzin.v@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: kmemleak: avoid false negatives on vmalloc'ed objectsCatalin Marinas2013-11-13
| | | | | | | | | | | | | | | | | Commit 248ac0e1943a ("mm/vmalloc: remove guard page from between vmap blocks") had the side effect of making vmap_area.va_end member point to the next vmap_area.va_start. This was creating an artificial reference to vmalloc'ed objects and kmemleak was rarely reporting vmalloc() leaks. This patch marks the vmap_area containing pointers explicitly and reduces the min ref_count to 2 as vm_struct still contains a reference to the vmalloc'ed object. The kmemleak add_scan_area() function has been improved to allow a SIZE_MAX argument covering the rest of the object (for simpler calling sites). Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* revert mm/vmalloc.c: emit the failure message before returnWanpeng Li2013-11-13
| | | | | | | | | | | | | | | Don't warn twice in __vmalloc_area_node and __vmalloc_node_range if __vmalloc_area_node allocation failure. This patch reverts commit 46c001a2753f ("mm/vmalloc.c: emit the failure message before return"). Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: revert "mm/vmalloc.c: check VM_UNINITIALIZED flag in s_show ↵Wanpeng Li2013-11-13
| | | | | | | | | | | | | | | | | | | | | | | instead of show_numa_info" The VM_UNINITIALIZED/VM_UNLIST flag introduced by f5252e009d5b ("mm: avoid null pointer access in vm_struct via /proc/vmallocinfo") is used to avoid accessing the pages field with unallocated page when show_numa_info() is called. This patch moves the check just before show_numa_info in order that some messages still can be dumped via /proc/vmallocinfo. This patch reverts commit d157a55815ff ("mm/vmalloc.c: check VM_UNINITIALIZED flag in s_show instead of show_numa_info"); Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: fix show vmap_area information race with vmap_area tear downWanpeng Li2013-11-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a race window between vmap_area tear down and show vmap_area information. A B remove_vm_area spin_lock(&vmap_area_lock); va->vm = NULL; va->flags &= ~VM_VM_AREA; spin_unlock(&vmap_area_lock); spin_lock(&vmap_area_lock); if (va->flags & (VM_LAZY_FREE | VM_LAZY_FREEZING)) return 0; if (!(va->flags & VM_VM_AREA)) { seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n", (void *)va->va_start, (void *)va->va_end, va->va_end - va->va_start); return 0; } free_unmap_vmap_area(va); flush_cache_vunmap free_unmap_vmap_area_noflush unmap_vmap_area free_vmap_area_noflush va->flags |= VM_LAZY_FREE The assumption !VM_VM_AREA represents vm_map_ram allocation is introduced by d4033afdf828 ("mm, vmalloc: iterate vmap_area_list, instead of vmlist, in vmallocinfo()"). However, !VM_VM_AREA also represents vmap_area is being tear down in race window mentioned above. This patch fix it by don't dump any information for !VM_VM_AREA case and also remove (VM_LAZY_FREE | VM_LAZY_FREEING) check since they are not possible for !VM_VM_AREA case. Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: don't set area->caller twiceWanpeng Li2013-11-13
| | | | | | | | | | | | | | The caller address has already been set in set_vmalloc_vm(), there's no need to set it again in __vmalloc_area_node. Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: use NUMA_NO_NODEJianguo Wu2013-11-13
| | | | | | | | | Use more appropriate "if (node == NUMA_NO_NODE)" instead of "if (node < 0)" Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: use wrapper function get_vm_area_size to caculate size of vm areaWanpeng Li2013-09-11
| | | | | | | | | | | | | | | | | | | Use wrapper function get_vm_area_size to calculate size of vm area. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, vmalloc: use well-defined find_last_bit() funcJoonsoo Kim2013-09-11
| | | | | | | | | | | | | | Our intention in here is to find last_bit within the region to flush. There is well-defined function, find_last_bit() for this purpose and its performance may be slightly better than current implementation. So change it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, vmalloc: remove useless variable in vmap_blockJoonsoo Kim2013-09-11
| | | | | | | | | | | vbq in vmap_block isn't used. So remove it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: fix an overflow bug in alloc_vmap_area()Zhang Yanfei2013-07-09
| | | | | | | | | | | | | | | | | | | | | | | | | | When searching a vmap area in the vmalloc space, we use (addr + size - 1) to check if the value is less than addr, which is an overflow. But we assign (addr + size) to vmap_area->va_end. So if we come across the below case: (addr + size - 1) : not overflow (addr + size) : overflow we will assign an overflow value (e.g 0) to vmap_area->va_end, And this will trigger BUG in __insert_vmap_area, causing system panic. So using (addr + size) to check the overflow should be the correct behaviour, not (addr + size - 1). Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Reported-by: Ghennadi Procopciuc <unix140@gmail.com> Tested-by: Daniel Baluta <dbaluta@ixiacom.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>