summaryrefslogtreecommitdiff
path: root/mm/memory-failure.c (follow)
Commit message (Collapse)AuthorAge
* page-flags: define PG_locked behavior on compound pagesKirill A. Shutemov2018-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | lock_page() must operate on the whole compound page. It doesn't make much sense to lock part of compound page. Change code to use head page's PG_locked, if tail page is passed. This patch also gets rid of custom helper functions -- __set_page_locked() and __clear_page_locked(). They are replaced with helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these helper would trigger VM_BUG_ON(). SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never appear there. VM_BUG_ON() is added to make sure that this assumption is correct. [akpm@linux-foundation.org: fix fs/cifs/file.c] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: Ifeeb98c789880ff34b286383568db60e08672205 Git-Commit: 48c935ad88f5be20eb5445a77c171351b1eb5111 Git-Repo: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
* Merge android-4.4.114 (fe09418) into msm-4.4Srinivasarao P2018-02-01
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * refs/heads/tmp-fe09418 Linux 4.4.114 nfsd: auth: Fix gid sorting when rootsquash enabled net: tcp: close sock if net namespace is exiting flow_dissector: properly cap thoff field ipv4: Make neigh lookup keys for loopback/point-to-point devices be INADDR_ANY net: Allow neigh contructor functions ability to modify the primary_key vmxnet3: repair memory leak sctp: return error if the asoc has been peeled off in sctp_wait_for_sndbuf sctp: do not allow the v4 socket to bind a v4mapped v6 address r8169: fix memory corruption on retrieval of hardware statistics. pppoe: take ->needed_headroom of lower device into account on xmit net: qdisc_pkt_len_init() should be more robust tcp: __tcp_hdrlen() helper net: igmp: fix source address check for IGMPv3 reports lan78xx: Fix failure in USB Full Speed ipv6: ip6_make_skb() needs to clear cork.base.dst ipv6: fix udpv6 sendmsg crash caused by too small MTU ipv6: Fix getsockopt() for sockets with default IPV6_AUTOFLOWLABEL dccp: don't restart ccid2_hc_tx_rto_expire() if sk in closed state hrtimer: Reset hrtimer cpu base proper on CPU hotplug x86/microcode/intel: Extend BDW late-loading further with LLC size check eventpoll.h: add missing epoll event masks vsyscall: Fix permissions for emulate mode with KAISER/PTI um: link vmlinux with -no-pie usbip: prevent leaking socket pointer address in messages usbip: fix stub_rx: harden CMD_SUBMIT path to handle malicious input usbip: fix stub_rx: get_pipe() to validate endpoint number usb: usbip: Fix possible deadlocks reported by lockdep Input: trackpoint - force 3 buttons if 0 button is reported Revert "module: Add retpoline tag to VERMAGIC" scsi: libiscsi: fix shifting of DID_REQUEUE host byte fs/fcntl: f_setown, avoid undefined behaviour reiserfs: Don't clear SGID when inheriting ACLs reiserfs: don't preallocate blocks for extended attributes reiserfs: fix race in prealloc discard ext2: Don't clear SGID when inheriting ACLs netfilter: xt_osf: Add missing permission checks netfilter: nfnetlink_cthelper: Add missing permission checks netfilter: fix IS_ERR_VALUE usage netfilter: use fwmark_reflect in nf_send_reset netfilter: nf_conntrack_sip: extend request line validation netfilter: restart search if moved to other chain netfilter: nfnetlink_queue: reject verdict request from different portid netfilter: nf_ct_expect: remove the redundant slash when policy name is empty netfilter: nf_dup_ipv6: set again FLOWI_FLAG_KNOWN_NH at flowi6_flags netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed in 64bit kernel netfilter: x_tables: speed up jump target validation ACPICA: Namespace: fix operand cache leak ACPI / scan: Prefer devices without _HID/_CID for _ADR matching ACPI / processor: Avoid reserving IO regions too early x86/ioapic: Fix incorrect pointers in ioapic_setup_resources() ipc: msg, make msgrcv work with LONG_MIN mm, page_alloc: fix potential false positive in __zone_watermark_ok cma: fix calculation of aligned offset hwpoison, memcg: forcibly uncharge LRU pages mm/mmap.c: do not blow on PROT_NONE MAP_FIXED holes in the stack fs/select: add vmalloc fallback for select(2) mmc: sdhci-of-esdhc: add/remove some quirks according to vendor version PCI: layerscape: Fix MSG TLP drop setting PCI: layerscape: Add "fsl,ls2085a-pcie" compatible ID drivers: base: cacheinfo: fix boot error message when acpi is enabled drivers: base: cacheinfo: fix x86 with CONFIG_OF enabled Prevent timer value 0 for MWAITX timers: Plug locking race vs. timer migration time: Avoid undefined behaviour in ktime_add_safe() PM / sleep: declare __tracedata symbols as char[] rather than char can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once can: af_can: can_rcv(): replace WARN_ONCE by pr_warn_once sched/deadline: Use the revised wakeup rule for suspending constrained dl tasks x86/retpoline: Fill RSB on context switch for affected CPUs x86/cpu/intel: Introduce macros for Intel family numbers x86/microcode/intel: Fix BDW late-loading revision check usbip: Fix potential format overflow in userspace tools usbip: Fix implicit fallthrough warning usbip: prevent vhci_hcd driver from leaking a socket pointer address x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels ANDROID: sched: EAS: check energy_aware() before calling select_energy_cpu_brute() in up-migrate path UPSTREAM: eventpoll.h: add missing epoll event masks ANDROID: xattr: Pass EOPNOTSUPP to permission2 Conflicts: kernel/sched/fair.c Change-Id: I15005cb3bc039f4361d25ed2e22f8175b3d7ca96 Signed-off-by: Srinivasarao P <spathi@codeaurora.org>
| * hwpoison, memcg: forcibly uncharge LRU pagesMichal Hocko2018-01-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 18365225f0440d09708ad9daade2ec11275c3df9 upstream. Laurent Dufour has noticed that hwpoinsoned pages are kept charged. In his particular case he has hit a bad_page("page still charged to cgroup") when onlining a hwpoison page. While this looks like something that shouldn't happen in the first place because onlining hwpages and returning them to the page allocator makes only little sense it shows a real problem. hwpoison pages do not get freed usually so we do not uncharge them (at least not since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API")). Each charge pins memcg (since e8ea14cc6ead ("mm: memcontrol: take a css reference for each charged page")) as well and so the mem_cgroup and the associated state will never go away. Fix this leak by forcibly uncharging a LRU hwpoisoned page in delete_from_lru_cache(). We also have to tweak uncharge_list because it cannot rely on zero ref count for these pages. [akpm@linux-foundation.org: coding-style fixes] Fixes: 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") Link: http://lkml.kernel.org/r/20170502185507.GB19165@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Reviewed-by: Balbir Singh <bsingharora@gmail.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge branch 'android-4.4@77ddb50' (v4.4.74) into 'msm-4.4'Blagovest Kolenichev2017-06-28
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * refs/heads/tmp-77ddb50: UPSTREAM: usb: gadget: f_fs: avoid out of bounds access on comp_desc Linux 4.4.74 mm: fix new crash in unmapped_area_topdown() Allow stack to grow up to address space limit mm: larger stack guard gap, between vmas alarmtimer: Rate limit periodic intervals MIPS: Fix bnezc/jialc return address calculation usb: dwc3: exynos fix axius clock error path to do cleanup alarmtimer: Prevent overflow of relative timers genirq: Release resources in __setup_irq() error path swap: cond_resched in swap_cgroup_prepare() mm/memory-failure.c: use compound_head() flags for huge pages USB: gadgetfs, dummy-hcd, net2280: fix locking for callbacks usb: xhci: ASMedia ASM1042A chipset need shorts TX quirk drivers/misc/c2port/c2port-duramar2150.c: checking for NULL instead of IS_ERR() usb: r8a66597-hcd: decrease timeout usb: r8a66597-hcd: select a different endpoint on timeout USB: gadget: dummy_hcd: fix hub-descriptor removable fields pvrusb2: reduce stack usage pvr2_eeprom_analyze() usb: core: fix potential memory leak in error path during hcd creation USB: hub: fix SS max number of ports iio: proximity: as3935: recalibrate RCO after resume staging: rtl8188eu: prevent an underflow in rtw_check_beacon_data() mfd: omap-usb-tll: Fix inverted bit use for USB TLL mode x86/mm/32: Set the '__vmalloc_start_set' flag in initmem_init() serial: efm32: Fix parity management in 'efm32_uart_console_get_options()' mac80211: fix IBSS presp allocation size mac80211: fix CSA in IBSS mode mac80211/wpa: use constant time memory comparison for MACs mac80211: don't look at the PM bit of BAR frames vb2: Fix an off by one error in 'vb2_plane_vaddr' cpufreq: conservative: Allow down_threshold to take values from 1 to 10 can: gs_usb: fix memory leak in gs_cmd_reset() configfs: Fix race between create_link and configfs_rmdir UPSTREAM: bpf: don't let ldimm64 leak map addresses on unprivileged BACKPORT: ext4: fix data exposure after a crash ANDROID: sdcardfs: remove dead function open_flags_to_access_mode() ANDROID: android-base.cfg: split out arm64-specific configs Linux 4.4.73 sparc64: make string buffers large enough s390/kvm: do not rely on the ILC on kvm host protection fauls xtensa: don't use linux IRQ #0 tipc: ignore requests when the connection state is not CONNECTED proc: add a schedule point in proc_pid_readdir() romfs: use different way to generate fsid for BLOCK or MTD sctp: sctp_addr_id2transport should verify the addr before looking up assoc r8152: avoid start_xmit to schedule napi when napi is disabled r8152: fix rtl8152_post_reset function r8152: re-schedule napi for tx nfs: Fix "Don't increment lock sequence ID after NFS4ERR_MOVED" ravb: unmap descriptors when freeing rings drm/ast: Fixed system hanged if disable P2A drm/nouveau: Don't enabling polling twice on runtime resume parisc, parport_gsc: Fixes for printk continuation lines net: adaptec: starfire: add checks for dma mapping errors pinctrl: berlin-bg4ct: fix the value for "sd1a" of pin SCRD0_CRD_PRES gianfar: synchronize DMA API usage by free_skb_rx_queue w/ gfar_new_page net/mlx4_core: Avoid command timeouts during VF driver device shutdown drm/nouveau/fence/g84-: protect against concurrent access to semaphore buffers drm/nouveau: prevent userspace from deleting client object ipv6: fix flow labels when the traffic class is non-0 FS-Cache: Initialise stores_lock in netfs cookie fscache: Clear outstanding writes when disabling a cookie fscache: Fix dead object requeue ethtool: do not vzalloc(0) on registers dump log2: make order_base_2() behave correctly on const input value zero kasan: respect /proc/sys/kernel/traceoff_on_warning jump label: pass kbuild_cflags when checking for asm goto support PM / runtime: Avoid false-positive warnings from might_sleep_if() ipv6: Fix IPv6 packet loss in scenarios involving roaming + snooping switches i2c: piix4: Fix request_region size sierra_net: Add support for IPv6 and Dual-Stack Link Sense Indications sierra_net: Skip validating irrelevant fields for IDLE LSIs net: hns: Fix the device being used for dma mapping during TX NET: mkiss: Fix panic NET: Fix /proc/net/arp for AX.25 ipv6: Inhibit IPv4-mapped src address on the wire. ipv6: Handle IPv4-mapped src to in6addr_any dst. net: xilinx_emaclite: fix receive buffer overflow net: xilinx_emaclite: fix freezes due to unordered I/O Call echo service immediately after socket reconnect staging: rtl8192e: rtl92e_fill_tx_desc fix write to mapped out memory. ARM: dts: imx6dl: Fix the VDD_ARM_CAP voltage for 396MHz operation partitions/msdos: FreeBSD UFS2 file systems are not recognized s390/vmem: fix identity mapping usb: gadget: f_fs: Fix possibe deadlock Conflicts: drivers/usb/gadget/function/f_fs.c Change-Id: I23106e9fc2c4f2d0b06acce59b781f6c36487fcc Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
| * mm/memory-failure.c: use compound_head() flags for huge pagesJames Morse2017-06-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7258ae5c5a2ce2f5969e8b18b881be40ab55433d upstream. memory_failure() chooses a recovery action function based on the page flags. For huge pages it uses the tail page flags which don't have anything interesting set, resulting in: > Memory failure: 0x9be3b4: Unknown page state > Memory failure: 0x9be3b4: recovery action for unknown page: Failed Instead, save a copy of the head page's flags if this is a huge page, this means if there are no relevant flags for this tail page, we use the head pages flags instead. This results in the me_huge_page() recovery action being called: > Memory failure: 0x9b7969: recovery action for huge page: Delayed For hugepages that have not yet been allocated, this allows the hugepage to be dequeued. Fixes: 524fca1e7356 ("HWPOISON: fix misjudgement of page_action() for errors on mlocked pages") Link: http://lkml.kernel.org/r/20170524130204.21845-1-james.morse@arm.com Signed-off-by: James Morse <james.morse@arm.com> Tested-by: Punit Agrawal <punit.agrawal@arm.com> Acked-by: Punit Agrawal <punit.agrawal@arm.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge branch 'android-4.4@6fc0573' into branch 'msm-4.4'Blagovest Kolenichev2017-06-19
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * refs/heads/tmp-6fc0573: Linux 4.4.71 xfs: only return -errno or success from attr ->put_listent xfs: in _attrlist_by_handle, copy the cursor back to userspace xfs: fix unaligned access in xfs_btree_visit_blocks xfs: bad assertion for delalloc an extent that start at i_size xfs: fix indlen accounting error on partial delalloc conversion xfs: wait on new inodes during quotaoff dquot release xfs: update ag iterator to support wait on new inodes xfs: support ability to wait on new inodes xfs: fix up quotacheck buffer list error handling xfs: prevent multi-fsb dir readahead from reading random blocks xfs: handle array index overrun in xfs_dir2_leaf_readbuf() xfs: fix over-copying of getbmap parameters from userspace xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff() xfs: Fix missed holes in SEEK_HOLE implementation mlock: fix mlock count can not decrease in race condition mm/migrate: fix refcount handling when !hugepage_migration_supported() drm/gma500/psb: Actually use VBT mode when it is found slub/memcg: cure the brainless abuse of sysfs attributes ALSA: hda - apply STAC_9200_DELL_M22 quirk for Dell Latitude D430 pcmcia: remove left-over %Z format drm/radeon: Unbreak HPD handling for r600+ drm/radeon/ci: disable mclk switching for high refresh rates (v2) scsi: mpt3sas: Force request partial completion alignment HID: wacom: Have wacom_tpc_irq guard against possible NULL dereference mmc: sdhci-iproc: suppress spurious interrupt with Multiblock read i2c: i2c-tiny-usb: fix buffer not being DMA capable vlan: Fix tcp checksum offloads in Q-in-Q vlans net: phy: marvell: Limit errata to 88m1101 netem: fix skb_orphan_partial() ipv4: add reference counting to metrics sctp: fix ICMP processing if skb is non-linear tcp: avoid fastopen API to be used on AF_UNSPEC virtio-net: enable TSO/checksum offloads for Q-in-Q vlans be2net: Fix offload features for Q-in-Q packets ipv6: fix out of bound writes in __ip6_append_data() bridge: start hello_timer when enabling KERNEL_STP in br_stp_start qmi_wwan: add another Lenovo EM74xx device ID bridge: netlink: check vlan_default_pvid range ipv6: Check ip6_find_1stfragopt() return value properly. ipv6: Prevent overrun when parsing v6 header options net: Improve handling of failures on link and route dumps tcp: eliminate negative reordering in tcp_clean_rtx_queue sctp: do not inherit ipv6_{mc|ac|fl}_list from parent sctp: fix src address selection if using secondary addresses for ipv6 tcp: avoid fragmenting peculiar skbs in SACK s390/qeth: avoid null pointer dereference on OSN s390/qeth: unbreak OSM and OSN support s390/qeth: handle sysfs error during initialization ipv6/dccp: do not inherit ipv6_mc_list from parent dccp/tcp: do not inherit mc_list from parent sparc: Fix -Wstringop-overflow warning android: base-cfg: disable CONFIG_NFS_FS and CONFIG_NFSD schedstats/eas: guard properly to avoid breaking non-smp schedstats users BACKPORT: f2fs: sanity check size of nat and sit cache FROMLIST: f2fs: sanity check checkpoint segno and blkoff sched/tune: don't use schedtune before it is ready sched/fair: use SCHED_CAPACITY_SCALE for energy normalization sched/{fair,tune}: use reciprocal_value to compute boost margin sched/tune: Initialize raw_spin_lock in boosted_groups sched/tune: report when SchedTune has not been initialized sched/tune: fix sched_energy_diff tracepoint sched/tune: increase group count to 5 cpufreq/schedutil: use boosted_cpu_util for PELT to match WALT sched/fair: Fix sched_group_energy() to support per-cpu capacity states sched/fair: discount task contribution to find CPU with lowest utilization sched/fair: ensure utilization signals are synchronized before use sched/fair: remove task util from own cpu when placing waking task trace:sched: Make util_avg in load_avg trace reflect PELT/WALT as used sched/fair: Add eas (& cas) specific rq, sd and task stats sched/core: Fix PELT jump to max OPP upon util increase sched: EAS & 'single cpu per cluster'/cpu hotplug interoperability UPSTREAM: sched/core: Fix group_entity's share update UPSTREAM: sched/fair: Fix calc_cfs_shares() fixed point arithmetics width confusion UPSTREAM: sched/fair: Fix incorrect task group ->load_avg UPSTREAM: sched/fair: Fix effective_load() to consistently use smoothed load UPSTREAM: sched/fair: Propagate asynchrous detach UPSTREAM: sched/fair: Propagate load during synchronous attach/detach UPSTREAM: sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list BACKPORT: sched/fair: Factorize PELT update UPSTREAM: sched/fair: Factorize attach/detach entity UPSTREAM: sched/fair: Improve PELT stuff some more UPSTREAM: sched/fair: Apply more PELT fixes UPSTREAM: sched/fair: Fix post_init_entity_util_avg() serialization BACKPORT: sched/fair: Initiate a new task's util avg to a bounded value sched/fair: Simplify idle_idx handling in select_idle_sibling() sched/fair: refactor find_best_target() for simplicity sched/fair: Change cpu iteration order in find_best_target() sched/core: Add first cpu w/ max/min orig capacity to root domain sched/core: Remove remnants of commit fd5c98da1a42 sched: Remove sysctl_sched_is_big_little sched/fair: Code !is_big_little path into select_energy_cpu_brute() EAS: sched/fair: Re-integrate 'honor sync wakeups' into wakeup path Fixup!: sched/fair.c: Set SchedTune specific struct energy_env.task sched/fair: Energy-aware wake-up task placement sched/fair: Add energy_diff dead-zone margin sched/fair: Decommission energy_aware_wake_cpu() sched/fair: Do not force want_affine eq. true if EAS is enabled arm64: Set SD_ASYM_CPUCAPACITY sched_domain flag on DIE level UPSTREAM: sched/fair: Fix incorrect comment for capacity_margin UPSTREAM: sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups UPSTREAM: sched/fair: Add per-CPU min capacity to sched_group_capacity UPSTREAM: sched/fair: Consider spare capacity in find_idlest_group() UPSTREAM: sched/fair: Compute task/cpu utilization at wake-up correctly UPSTREAM: sched/fair: Let asymmetric CPU configurations balance at wake-up UPSTREAM: sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systems UPSTREAM: sched/core: Pass child domain into sd_init() UPSTREAM: sched/core: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag UPSTREAM: sched/core: Remove unnecessary NULL-pointer check UPSTREAM: sched/fair: Optimize find_idlest_cpu() when there is no choice BACKPORT: sched/fair: Make the use of prev_cpu consistent in the wakeup path UPSTREAM: sched/core: Fix power to capacity renaming in comment Partial Revert: "WIP: sched: Add cpu capacity awareness to wakeup balancing" Revert "WIP: sched: Consider spare cpu capacity at task wake-up" FROM-LIST: cpufreq: schedutil: Redefine the rate_limit_us tunable cpufreq: schedutil: add up/down frequency transition rate limits trace/sched: add rq utilization signal for WALT sched/cpufreq: make schedutil use WALT signal sched: cpufreq: use rt_avg as estimate of required RT CPU capacity cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task BACKPORT: kthread: allow to cancel kthread work sched/cpufreq: fix tunables for schedfreq governor BACKPORT: cpufreq: schedutil: New governor based on scheduler utilization data sched: backport cpufreq hooks from 4.9-rc4 ANDROID: Kconfig: add depends for UID_SYS_STATS ANDROID: hid: uhid: implement refcount for open and close Revert "ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY" ANDROID: mnt: Fix next_descendent Conflicts: include/trace/events/sched.h kernel/sched/Makefile kernel/sched/core.c kernel/sched/fair.c kernel/sched/sched.h Change-Id: I55318828f2c858e192ac7015bcf2bf0ec5c5b2c5 Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
| * mm/migrate: fix refcount handling when !hugepage_migration_supported()Punit Agrawal2017-06-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 30809f559a0d348c2dfd7ab05e9a451e2384962e upstream. On failing to migrate a page, soft_offline_huge_page() performs the necessary update to the hugepage ref-count. But when !hugepage_migration_supported() , unmap_and_move_hugepage() also decrements the page ref-count for the hugepage. The combined behaviour leaves the ref-count in an inconsistent state. This leads to soft lockups when running the overcommitted hugepage test from mce-tests suite. Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000 soft offline: 0x83ed600: migration failed 1, type 1fffc00000008008 (uptodate|head) INFO: rcu_preempt detected stalls on CPUs/tasks: Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715 (detected by 7, t=5254 jiffies, g=963, c=962, q=321) thugetlb_overco R running task 0 2715 2685 0x00000008 Call trace: dump_backtrace+0x0/0x268 show_stack+0x24/0x30 sched_show_task+0x134/0x180 rcu_print_detail_task_stall_rnp+0x54/0x7c rcu_check_callbacks+0xa74/0xb08 update_process_times+0x34/0x60 tick_sched_handle.isra.7+0x38/0x70 tick_sched_timer+0x4c/0x98 __hrtimer_run_queues+0xc0/0x300 hrtimer_interrupt+0xac/0x228 arch_timer_handler_phys+0x3c/0x50 handle_percpu_devid_irq+0x8c/0x290 generic_handle_irq+0x34/0x50 __handle_domain_irq+0x68/0xc0 gic_handle_irq+0x5c/0xb0 Address this by changing the putback_active_hugepage() in soft_offline_huge_page() to putback_movable_pages(). This only triggers on systems that enable memory failure handling (ARCH_SUPPORTS_MEMORY_FAILURE) but not hugepage migration (!ARCH_ENABLE_HUGEPAGE_MIGRATION). I imagine this wasn't triggered as there aren't many systems running this configuration. [akpm@linux-foundation.org: remove dead comment, per Naoya] Link: http://lkml.kernel.org/r/20170525135146.32011-1-punit.agrawal@arm.com Reported-by: Manoj Iyer <manoj.iyer@canonical.com> Tested-by: Manoj Iyer <manoj.iyer@canonical.com> Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | mm: Enhance per process reclaim to consider shared pagesMinchan Kim2016-06-22
|/ | | | | | | | | | | | | | | | | | | | | Some pages could be shared by several processes. (ex, libc) In case of that, it's too bad to reclaim them from the beginnig. This patch causes VM to keep them on memory until last task try to reclaim them so shared pages will be reclaimed only if all of task has gone swapping out. This feature doesn't handle non-linear mapping on ramfs because it's very time-consuming and doesn't make sure of reclaiming and not common. Change-Id: I7e5f34f2e947f5db6d405867fe2ad34863ca40f7 Signed-off-by: Sangseok Lee <sangseok.lee@lge.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Patch-mainline: linux-mm @ 9 May 2013 16:21:27 [vinmenon@codeaurora.org: trivial merge conflict fixes + changes to make the patch work with 3.18 kernel] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
* mm: soft-offline: check return value in second __get_any_page() callNaoya Horiguchi2016-02-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d96b339f453997f2f08c52da3f41423be48c978f upstream. I saw the following BUG_ON triggered in a testcase where a process calls madvise(MADV_SOFT_OFFLINE) on thps, along with a background process that calls migratepages command repeatedly (doing ping-pong among different NUMA nodes) for the first process: Soft offlining page 0x60000 at 0x700000600000 __get_any_page: 0x60000 free buddy page page:ffffea0001800000 count:0 mapcount:-127 mapping: (null) index:0x1 flags: 0x1fffc0000000000() page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0) ------------[ cut here ]------------ kernel BUG at /src/linux-dev/include/linux/mm.h:342! invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC Modules linked in: cfg80211 rfkill crc32c_intel serio_raw virtio_balloon i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi CPU: 3 PID: 3035 Comm: test_alloc_gene Tainted: G O 4.4.0-rc8-v4.4-rc8-160107-1501-00000-rc8+ #74 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff88007c63d5c0 ti: ffff88007c210000 task.ti: ffff88007c210000 RIP: 0010:[<ffffffff8118998c>] [<ffffffff8118998c>] put_page+0x5c/0x60 RSP: 0018:ffff88007c213e00 EFLAGS: 00010246 Call Trace: put_hwpoison_page+0x4e/0x80 soft_offline_page+0x501/0x520 SyS_madvise+0x6bc/0x6f0 entry_SYSCALL_64_fastpath+0x12/0x6a Code: 8b fc ff ff 5b 5d c3 48 89 df e8 b0 fa ff ff 48 89 df 31 f6 e8 c6 7d ff ff 5b 5d c3 48 c7 c6 08 54 a2 81 48 89 df e8 a4 c5 01 00 <0f> 0b 66 90 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 8b 47 RIP [<ffffffff8118998c>] put_page+0x5c/0x60 RSP <ffff88007c213e00> The root cause resides in get_any_page() which retries to get a refcount of the page to be soft-offlined. This function calls put_hwpoison_page(), expecting that the target page is putback to LRU list. But it can be also freed to buddy. So the second check need to care about such case. Fixes: af8fae7c0886 ("mm/memory-failure.c: clean up soft_offline_page()") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm: make compound_head() robustKirill A. Shutemov2015-11-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hugh has pointed that compound_head() call can be unsafe in some context. There's one example: CPU0 CPU1 isolate_migratepages_block() page_count() compound_head() !!PageTail() == true put_page() tail->first_page = NULL head = tail->first_page alloc_pages(__GFP_COMP) prep_compound_page() tail->first_page = head __SetPageTail(p); !!PageTail() == true <head == NULL dereferencing> The race is pure theoretical. I don't it's possible to trigger it in practice. But who knows. We can fix the race by changing how encode PageTail() and compound_head() within struct page to be able to update them in one shot. The patch introduces page->compound_head into third double word block in front of compound_dtor and compound_order. Bit 0 encodes PageTail() and the rest bits are pointer to head page if bit zero is set. The patch moves page->pmd_huge_pte out of word, just in case if an architecture defines pgtable_t into something what can have the bit 0 set. hugetlb_cgroup uses page->lru.next in the second tail page to store pointer struct hugetlb_cgroup. The patch switch it to use page->private in the second tail page instead. The space is free since ->first_page is removed from the union. The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER limitation, since there's now space in first tail page to store struct hugetlb_cgroup pointer. But that's out of scope of the patch. That means page->compound_head shares storage space with: - page->lru.next; - page->next; - page->rcu_head.next; That's too long list to be absolutely sure, but looks like nobody uses bit 0 of the word. page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future call_rcu_lazy() is not allowed as it makes use of the bit and we can get false positive PageTail(). [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: hwpoison: ratelimit messages from unpoison_memory()Naoya Horiguchi2015-11-05
| | | | | | | | | | | | | | | | | Currently kernel prints out results of every single unpoison event, which i= s not necessary because unpoison is purely a testing feature and testers can = get little or no information from lots of lines of unpoison log storm. So this patch ratelimits printk in unpoison_memory(). This patch introduces a file local ratelimit_state, which adds 64 bytes to memory-failure.o. If we apply pr_info_ratelimited() for 8 callsite below, 2= 56 bytes is added, so it's a win. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hwpoison: use page_cgroup_ino for filtering by memcgVladimir Davydov2015-09-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | Hwpoison allows to filter pages by memory cgroup ino. Currently, it calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and then its ino using cgroup_ino, but now we have a helper method for that, page_cgroup_ino, so use it instead. This patch also loosens the hwpoison memcg filter dependency rules - it makes it depend on CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP, because hwpoison memcg filter does not require anything (nor it used to) from CONFIG_MEMCG_SWAP side. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: rename alloc_pages_exact_node() to __alloc_pages_node()Vlastimil Babka2015-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page allocator: do not check NUMA node ID when the caller knows the node is valid") as an optimized variant of alloc_pages_node(), that doesn't fallback to current node for nid == NUMA_NO_NODE. Unfortunately the name of the function can easily suggest that the allocation is restricted to the given node and fails otherwise. In truth, the node is only preferred, unless __GFP_THISNODE is passed among the gfp flags. The misleading name has lead to mistakes in the past, see for example commits 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") and b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node"). Another issue with the name is that there's a family of alloc_pages_exact*() functions where 'exact' means exact size (instead of page order), which leads to more confusion. To prevent further mistakes, this patch effectively renames alloc_pages_exact_node() to __alloc_pages_node() to better convey that it's an optimized variant of alloc_pages_node() not intended for general usage. Both functions get described in comments. It has been also considered to really provide a convenience function for allocations restricted to a node, but the major opinion seems to be that __GFP_THISNODE already provides that functionality and we shouldn't duplicate the API needlessly. The number of users would be small anyway. Existing callers of alloc_pages_exact_node() are simply converted to call __alloc_pages_node(), with the exception of sba_alloc_coherent() which open-codes the check for NUMA_NO_NODE, so it is converted to use alloc_pages_node() instead. This means it no longer performs some VM_BUG_ON checks, and since the current check for nid in alloc_pages_node() uses a 'nid < 0' comparison (which includes NUMA_NO_NODE), it may hide wrong values which would be previously exposed. Both differences will be rectified by the next patch. To sum up, this patch makes no functional changes, except temporarily hiding potentially buggy callers. Restricting the checks in alloc_pages_node() is left for the next patch which can in turn expose more existing buggy callers. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Robin Holt <robinmholt@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Cliff Whickman <cpw@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/hwpoison: don't try to unpoison containment-failed pagesNaoya Horiguchi2015-09-08
| | | | | | | | | | | | | | | | | | | | | memory_failure() can be called at any page at any time, which means that we can't eliminate the possibility of containment failure. In such case the best option is to leak the page intentionally (and never touch it later.) We have an unpoison function for testing, and it cannot handle such containment-failed pages, which results in kernel panic (visible with various calltraces.) So this patch suggests that we limit the unpoisonable pages to properly contained pages and ignore any other ones. Testers are recommended to keep in mind that there're un-unpoisonable pages when writing test programs. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Tested-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/hwpoison: fix race between soft_offline_page and unpoison_memoryWanpeng Li2015-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Wanpeng Li reported a race between soft_offline_page() and unpoison_memory(), which causes the following kernel panic: BUG: Bad page state in process bash pfn:97000 page:ffffea00025c0000 count:0 mapcount:1 mapping: (null) index:0x7f4fdbe00 flags: 0x1fffff80080048(uptodate|active|swapbacked) page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set bad because of flags: flags: 0x40(active) Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 nfsv4 dns_resolver bnep rfcomm nfsd bluetooth auth_rpcgss nfs_acl nfs rfkill lockd grace sunrpc i2c_algo_bit drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic drm snd_hda_intel fscache snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_seq_dummy snd_seq_oss crct10dif_pclmul snd_seq_midi crc32_pclmul snd_seq_midi_event ghash_clmulni_intel snd_rawmidi aesni_intel lrw gf128mul snd_seq glue_helper ablk_helper snd_seq_device cryptd fuse snd_timer dcdbas serio_raw mei_me parport_pc snd mei ppdev i2c_core video lp soundcore parport lpc_ich shpchp mfd_core ext4 mbcache jbd2 sd_mod e1000e ahci ptp libahci crc32c_intel libata pps_core CPU: 3 PID: 2211 Comm: bash Not tainted 4.2.0-rc5-mm1+ #45 Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015 Call Trace: dump_stack+0x48/0x5c bad_page+0xe6/0x140 free_pages_prepare+0x2f9/0x320 ? uncharge_list+0xdd/0x100 free_hot_cold_page+0x40/0x170 __put_single_page+0x20/0x30 put_page+0x25/0x40 unmap_and_move+0x1a6/0x1f0 migrate_pages+0x100/0x1d0 ? kill_procs+0x100/0x100 ? unlock_page+0x6f/0x90 __soft_offline_page+0x127/0x2a0 soft_offline_page+0xa6/0x200 This race is explained like below: CPU0 CPU1 soft_offline_page __soft_offline_page TestSetPageHWPoison unpoison_memory PageHWPoison check (true) TestClearPageHWPoison put_page -> release refcount held by get_hwpoison_page in unpoison_memory put_page -> release refcount held by isolate_lru_page in __soft_offline_page migrate_pages The second put_page() releases refcount held by isolate_lru_page() which will lead to unmap_and_move() releases the last refcount of page and w/ mapcount still 1 since try_to_unmap() is not called if there is only one user map the page. Anyway, the page refcount and mapcount will still mess if the page is mapped by multiple users. This race was introduced by commit 4491f71260 ("mm/memory-failure: set PageHWPoison before migrate_pages()"), which focuses on preventing the reuse of successfully migrated page. Before this commit we prevent the reuse by changing the migratetype to MIGRATE_ISOLATE during soft offlining, which has the following problems, so simply reverting the commit is not a best option: 1) it doesn't eliminate the reuse completely, because set_migratetype_isolate() can fail to set MIGRATE_ISOLATE to the target page if the pageblock of the page contains one or more unmovable pages (i.e. has_unmovable_pages() returns true). 2) the original code changes migratetype to MIGRATE_ISOLATE forcibly, and sets it to MIGRATE_MOVABLE forcibly after soft offline, regardless of the original migratetype state, which could impact other subsystems like memory hotplug or compaction. This patch moves PageSetHWPoison just after put_page() in unmap_and_move(), which closes up the reported race window and minimizes another race window b/w SetPageHWPoison and reallocation (which causes the reuse of soft-offlined page.) The latter race window still exists but it's acceptable, because it's rare and effectively the same as ordinary "containment failure" case even if it happens, so keep the window open is acceptable. Fixes: 4491f71260 ("mm/memory-failure: set PageHWPoison before migrate_pages()") Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Wanpeng Li <wanpeng.li@hotmail.com> Tested-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/hwpoison: introduce num_poisoned_pages wrappersNaoya Horiguchi2015-09-08
| | | | | | | | | | num_poisoned_pages counter will be changed outside mm/memory-failure.c by a subsequent patch, so this patch prepares wrappers to manipulate it. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Tested-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/hwpoison: replace most of put_page in memory error handling by ↵Wanpeng Li2015-09-08
| | | | | | | | | | | | put_hwpoison_page Replace most instances of put_page() in memory error handling with put_hwpoison_page(). Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/hwpoison: introduce put_hwpoison_page to put refcount for memory error ↵Wanpeng Li2015-09-08
| | | | | | | | | | | | handling Introduce put_hwpoison_page to put refcount for memory error handling. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/hwpoison: fix PageHWPoison test/set raceWanpeng Li2015-09-08
| | | | | | | | | | | | | | | | | | | | There is a race between madvise_hwpoison path and memory_failure: CPU0 CPU1 madvise_hwpoison get_user_pages_fast PageHWPoison check (false) memory_failure TestSetPageHWPoison soft_offline_page PageHWPoison check (true) return -EBUSY (without put_page) Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/hwpoison: fix failure to split thp w/ refcount heldWanpeng Li2015-09-08
| | | | | | | | | | | | | THP pages will get a refcount in madvise_hwpoison() w/ MF_COUNT_INCREASED flag, however, the refcount is still held when fail to split THP pages. Fix it by reducing the refcount of THP pages when fail to split THP. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: export struct mem_cgroupMichal Hocko2015-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mem_cgroup structure is defined in mm/memcontrol.c currently which means that the code outside of this file has to use external API even for trivial access stuff. This patch exports mm_struct with its dependencies and makes some of the exported functions inlines. This even helps to reduce the code size a bit (make defconfig + CONFIG_MEMCG=y) text data bss dec hex filename 12355346 1823792 1089536 15268674 e8fb42 vmlinux.before 12354970 1823792 1089536 15268298 e8f9ca vmlinux.after This is not much (370B) but better than nothing. We also save a function call in some hot paths like callers of mem_cgroup_count_vm_event which is used for accounting. The patch doesn't introduce any functional changes. [vdavykov@parallels.com: inline memcg_kmem_is_active] [vdavykov@parallels.com: do not expose type outside of CONFIG_MEMCG] [akpm@linux-foundation.org: memcontrol.h needs eventfd.h for eventfd_ctx] [akpm@linux-foundation.org: export mem_cgroup_from_task() to modules] Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/hwpoison: fix panic due to split huge zero pageWanpeng Li2015-08-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bug: ------------[ cut here ]------------ kernel BUG at mm/huge_memory.c:1957! invalid opcode: 0000 [#1] SMP Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 snd_hda_codec_realtek snd_hda_codec_generic nfsv4 dns_re CPU: 2 PID: 2576 Comm: test_huge Not tainted 4.2.0-rc5-mm1+ #27 Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015 task: ffff880204e3d600 ti: ffff8800db16c000 task.ti: ffff8800db16c000 RIP: split_huge_page_to_list+0xdb/0x120 Call Trace: memory_failure+0x32e/0x7c0 madvise_hwpoison+0x8b/0x160 SyS_madvise+0x40/0x240 ? do_page_fault+0x37/0x90 entry_SYSCALL_64_fastpath+0x12/0x71 Code: ff f0 41 ff 4c 24 30 74 0d 31 c0 48 83 c4 08 5b 41 5c 41 5d c9 c3 4c 89 e7 e8 e2 58 fd ff 48 83 c4 08 31 c0 RIP split_huge_page_to_list+0xdb/0x120 RSP <ffff8800db16fde8> ---[ end trace aee7ce0df8e44076 ]--- Testcase: #define _GNU_SOURCE #include <stdlib.h> #include <stdio.h> #include <sys/mman.h> #include <unistd.h> #include <fcntl.h> #include <sys/types.h> #include <errno.h> #include <string.h> #define MB 1024*1024 int main(void) { char *mem; posix_memalign((void **)&mem, 2 * MB, 200 * MB); madvise(mem, 200 * MB, MADV_HWPOISON); free(mem); return 0; } Huge zero page is allocated if page fault w/o FAULT_FLAG_WRITE flag. The get_user_pages_fast() which called in madvise_hwpoison() will get huge zero page if the page is not allocated before. Huge zero page is a tranparent huge page, however, it is not an anonymous page. memory_failure will split the huge zero page and trigger BUG_ON(is_huge_zero_page(page)); After commit 98ed2b0052e6 ("mm/memory-failure: give up error handling for non-tail-refcounted thp"), memory_failure will not catch non anon thp from madvise_hwpoison path and this bug occur. Fix it by catching non anon thp in memory_failure in order to not split huge zero page in madvise_hwpoison path. After this patch: Injecting memory failure for page 0x202800 at 0x7fd8ae800000 MCE: 0x202800: non anonymous thp [...] [akpm@linux-foundation.org: remove second split, per Wanpeng] Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/hwpoison: fix fail isolate hugetlbfs page w/ refcount heldWanpeng Li2015-08-14
| | | | | | | | | | | | | | | Hugetlbfs pages will get a refcount in get_any_page() or madvise_hwpoison() if soft offlining through madvise. The refcount which is held by the soft offline path should be released if we fail to isolate hugetlbfs pages. Fix it by reducing the refcount for both isolation success and failure. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> [3.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/hwpoison: fix page refcount of unknown non LRU pageWanpeng Li2015-08-14
| | | | | | | | | | | | | | | After trying to drain pages from pagevec/pageset, we try to get reference count of the page again, however, the reference count of the page is not reduced if the page is still not on LRU list. Fix it by adding the put_page() to drop the page reference which is from __get_any_page(). Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> [3.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memory-failure: set PageHWPoison before migrate_pages()Naoya Horiguchi2015-08-07
| | | | | | | | | | | | | | | | | | | | Now page freeing code doesn't consider PageHWPoison as a bad page, so by setting it before completing the page containment, we can prevent the error page from being reused just after successful page migration. I added TTU_IGNORE_HWPOISON for try_to_unmap() to make sure that the page table entry is transformed into migration entry, not to hwpoison entry. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dean Nelson <dnelson@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memory-failure: give up error handling for non-tail-refcounted thpNaoya Horiguchi2015-08-07
| | | | | | | | | | | | | | | | | "non anonymous thp" case is still racy with freeing thp, which causes panic due to put_page() for refcount-0 page. It seems that closing up this race might be hard (and/or not worth doing,) so let's give up the error handling for this case. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dean Nelson <dnelson@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memory-failure: fix race in counting num_poisoned_pagesNaoya Horiguchi2015-08-07
| | | | | | | | | | | | | | | | When memory_failure() is called on a page which are just freed after page migration from soft offlining, the counter num_poisoned_pages is raised twi= ce. So let's fix it with using TestSetPageHWPoison. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dean Nelson <dnelson@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memory-failure: unlock_page before put_pageNaoya Horiguchi2015-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recently I addressed a few of hwpoison race problems and the patches are merged on v4.2-rc1. It made progress, but unfortunately some problems still remain due to less coverage of my testing. So I'm trying to fix or avoid them in this series. One point I'm expecting to discuss is that patch 4/5 changes the page flag set to be checked on free time. In current behavior, __PG_HWPOISON is not supposed to be set when the page is freed. I think that there is no strong reason for this behavior, and it causes a problem hard to fix only in error handler side (because __PG_HWPOISON could be set at arbitrary timing.) So I suggest to change it. With this patchset, hwpoison stress testing in official mce-test testsuite (which previously failed) passes. This patch (of 5): In "just unpoisoned" path, we do put_page and then unlock_page, which is a wrong order and causes "freeing locked page" bug. So let's fix it. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dean Nelson <dnelson@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* tracing: add trace event for memory-failureXie XiuQi2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RAS user space tools like rasdaemon which base on trace event, could receive mce error event, but no memory recovery result event. So, I want to add this event to make this scenario complete. This patch add a event at ras group for memory-failure. The output like below: # tracer: nop # # entries-in-buffer/entries-written: 2/2 #P:24 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | mce-inject-13150 [001] .... 277.019359: memory_failure_event: pfn 0x19869: recovery action for free buddy page: Delayed [xiexiuqi@huawei.com: fix build error] Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Chen Gong <gong.chen@linux.intel.com> Cc: Jim Davis <jim.epost@gmail.com> Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memory-failure: change type of action_result's param 3 to enumXie XiuQi2015-06-24
| | | | | | | | | | | | | | Change type of action_result's param 3 to enum for type consistency, and rename mf_outcome to mf_result for clearly. Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Chen Gong <gong.chen@linux.intel.com> Cc: Jim Davis <jim.epost@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memory-failure: export page_type and action resultXie XiuQi2015-06-24
| | | | | | | | | | | | | | | | | Export 'outcome' and 'action_page_type' to mm.h, so we could use this emnus outside. This patch is preparation for adding trace events for memory-failure recovery action. Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Chen Gong <gong.chen@linux.intel.com> Cc: Jim Davis <jim.epost@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memory-failure: me_huge_page() does nothing for thpNaoya Horiguchi2015-06-24
| | | | | | | | | | | | | | | | | | | memory_failure() is supposed not to handle thp itself, but to split it. But if something were wrong and page_action() were called on thp, me_huge_page() (action routine for hugepages) should be better to take no action, rather than to take wrong action prepared for hugetlb (which triggers BUG_ON().) This change is for potential problems, but makes sense to me because thp is an actively developing feature and this code path can be open in the future. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: soft-offline: don't free target page in successful page migrationNaoya Horiguchi2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Stress testing showed that soft offline events for a process iterating "mmap-pagefault-munmap" loop can trigger VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP) in __free_one_page(): Soft offlining page 0x70fe1 at 0x70100008d000 Soft offlining page 0x705fb at 0x70300008d000 page:ffffea0001c3f840 count:0 mapcount:0 mapping: (null) index:0x2 flags: 0x1fffff80800000(hwpoison) page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1)) ------------[ cut here ]------------ kernel BUG at /src/linux-dev/mm/page_alloc.c:585! invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139 RIP: free_pcppages_bulk+0x52a/0x6f0 Call Trace: drain_pages_zone+0x3d/0x50 drain_local_pages+0x1d/0x30 on_each_cpu_mask+0x46/0x80 drain_all_pages+0x14b/0x1e0 soft_offline_page+0x432/0x6e0 SyS_madvise+0x73c/0x780 system_call_fastpath+0x12/0x17 Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 <0f> 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff RIP [<ffffffff811a806a>] free_pcppages_bulk+0x52a/0x6f0 RSP <ffff88007a117d28> ---[ end trace 53926436e76d1f35 ]--- When soft offline successfully migrates page, the source page is supposed to be freed. But there is a race condition where a source page looks isolated (i.e. the refcount is 0 and the PageHWPoison is set) but somewhat linked to pcplist. Then another soft offline event calls drain_all_pages() and tries to free such hwpoisoned page, which is forbidden. This odd page state seems to happen due to the race between put_page() in putback_lru_page() and __pagevec_lru_add_fn(). But I don't want to play with tweaking drain code as done in commit 9ab3b598d2df "mm: hwpoison: drop lru_add_drain_all() in __soft_offline_page()", or to change page freeing code for this soft offline's purpose. Instead, let's think about the difference between hard offline and soft offline. There is an interesting difference in how to isolate the in-use page between these, that is, hard offline marks PageHWPoison of the target page at first, and doesn't free it by keeping its refcount 1. OTOH, soft offline tries to free the target page then marks PageHWPoison. This difference might be the source of complexity and result in bugs like the above. So making soft offline isolate with keeping refcount can be a solution for this problem. We can pass to page migration code the "reason" which shows the caller, so let's use this more to avoid calling putback_lru_page() when called from soft offline, which effectively does the isolation for soft offline. With this change, target pages of soft offline never be reused without changing migratetype, so this patch also removes the related code. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memory-failure: introduce get_hwpoison_page() for consistent refcount ↵Naoya Horiguchi2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | handling memory_failure() can run in 2 different mode (specified by MF_COUNT_INCREASED) in page refcount perspective. When MF_COUNT_INCREASED is set, memory_failure() assumes that the caller takes a refcount of the target page. And if cleared, memory_failure() takes it in it's own. In current code, however, refcounting is done differently in each caller. For example, madvise_hwpoison() uses get_user_pages_fast() and hwpoison_inject() uses get_page_unless_zero(). So this inconsistent refcounting causes refcount failure especially for thp tail pages. Typical user visible effects are like memory leak or VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page(). To fix this refcounting issue, this patch introduces get_hwpoison_page() to handle thp tail pages in the same manner for each caller of hwpoison code. memory_failure() might fail to split thp and in such case it returns without completing page isolation. This is not good because PageHWPoison on the thp is still set and there's no easy way to unpoison such thps. So this patch try to roll back any action to the thp in "non anonymous thp" case and "thp split failed" case, expecting an MCE(SRAR) generated by later access afterward will properly free such thps. [akpm@linux-foundation.org: fix CONFIG_HWPOISON_INJECT=m] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memory-failure: split thp earlier in memory error handlingNaoya Horiguchi2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | memory_failure() doesn't handle thp itself at this time and need to split it before doing isolation. Currently thp is split in the middle of hwpoison_user_mappings(), but there're corner cases where memory_failure() wrongly tries to handle thp without splitting. 1) "non anonymous" thp, which is not a normal operating mode of thp, but a memory error could hit a thp before anon_vma is initialized. In such case, split_huge_page() fails and me_huge_page() (intended for hugetlb) is called for thp, which triggers BUG_ON in page_hstate(). 2) !PageLRU case, where hwpoison_user_mappings() returns with SWAP_SUCCESS and the result is the same as case 1. memory_failure() can't avoid splitting, so let's split it more earlier, which also reduces code which are prepared for both of normal page and thp. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, hwpoison: remove obsolete "Notebook" todo listAndi Kleen2015-06-24
| | | | | | | | | | All the items mentioned here have been either addressed, or were not really needed. So just remove the comment. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, hwpoison: add comment describing when to add new casesAndi Kleen2015-06-24
| | | | | | | | | | | | Here's another comment fix for hwpoison. It describes the "guiding principle" on when to add new memory error recovery code. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: soft-offline: fix num_poisoned_pages counting on concurrent eventsNaoya Horiguchi2015-05-05
| | | | | | | | | | | | | | | | If multiple soft offline events hit one free page/hugepage concurrently, soft_offline_page() can handle the free page/hugepage multiple times, which makes num_poisoned_pages counter increased more than once. This patch fixes this wrong counting by checking TestSetPageHWPoison for normal papes and by checking the return value of dequeue_hwpoisoned_huge_page() for hugepages. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Dean Nelson <dnelson@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: <stable@vger.kernel.org> [3.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memory-failure: call shake_page() when error hits thp tail pageNaoya Horiguchi2015-05-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently memory_failure() calls shake_page() to sweep pages out from pcplists only when the victim page is 4kB LRU page or thp head page. But we should do this for a thp tail page too. Consider that a memory error hits a thp tail page whose head page is on a pcplist when memory_failure() runs. Then, the current kernel skips shake_pages() part, so hwpoison_user_mappings() returns without calling split_huge_page() nor try_to_unmap() because PageLRU of the thp head is still cleared due to the skip of shake_page(). As a result, me_huge_page() runs for the thp, which is broken behavior. One effect is a leak of the thp. And another is to fail to isolate the memory error, so later access to the error address causes another MCE, which kills the processes which used the thp. This patch fixes this problem by calling shake_page() for thp tail case. Fixes: 385de35722c9 ("thp: allow a hwpoisoned head page to be put back to LRU") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Dean Nelson <dnelson@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Jin Dongming <jin.dongming@np.css.fujitsu.com> Cc: <stable@vger.kernel.org> [3.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: hugetlb: introduce page_huge_activeNaoya Horiguchi2015-04-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are not safe from calling isolate_huge_page() on a hugepage concurrently, which can make the victim hugepage in invalid state and results in BUG_ON(). The root problem of this is that we don't have any information on struct page (so easily accessible) about hugepages' activeness. Note that hugepages' activeness means just being linked to hstate->hugepage_activelist, which is not the same as normal pages' activeness represented by PageActive flag. Normal pages are isolated by isolate_lru_page() which prechecks PageLRU before isolation, so let's do similarly for hugetlb with a new paeg_huge_active(). set/clear_page_huge_active() should be called within hugetlb_lock. But hugetlb_cow() and hugetlb_no_page() don't do this, being justified because in these functions set_page_huge_active() is called right after the hugepage is allocated and no other thread tries to isolate it. [akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/, make it return bool] [fengguang.wu@intel.com: set_page_huge_active() can be static] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memory-failure.c: define page types for action_result() in one placeNaoya Horiguchi2015-04-15
| | | | | | | | | | | | | | | | | | | | This cleanup patch moves all strings passed to action_result() into a singl= e array action_page_type so that a reader can easily find which kind of actio= n results are possible. And this patch also fixes the odd lines to be printed out, like "unknown page state page" or "free buddy, 2nd try page". [akpm@linux-foundation.org: rename messages, per David] [akpm@linux-foundation.org: s/DIRTY_UNEVICTABLE_LRU/CLEAN_UNEVICTABLE_LRU', per Andi] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Xie XiuQi" <xiexiuqi@huawei.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Chen Gong <gong.chen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: hwpoison: drop lru_add_drain_all() in __soft_offline_page()Naoya Horiguchi2015-02-12
| | | | | | | | | | | | | | | | | | | | | | | | | A race condition starts to be visible in recent mmotm, where a PG_hwpoison flag is set on a migration source page *before* it's back in buddy page poo= l. This is problematic because no page flag is supposed to be set when freeing (see __free_one_page().) So the user-visible effect of this race is that it could trigger the BUG_ON() when soft-offlining is called. The root cause is that we call lru_add_drain_all() to make sure that the page is in buddy, but that doesn't work because this function just schedule= s a work item and doesn't wait its completion. drain_all_pages() does drainin= g directly, so simply dropping lru_add_drain_all() solves this problem. Fixes: f15bdfa802bf ("mm/memory-failure.c: fix memory leak in successful soft offlining") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Chen Gong <gong.chen@linux.intel.com> Cc: <stable@vger.kernel.org> [3.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmscan: per memory cgroup slab shrinkersVladimir Davydov2015-02-12
| | | | | | | | | | | | | | | | | | | | | | | | This patch adds SHRINKER_MEMCG_AWARE flag. If a shrinker has this flag set, it will be called per memory cgroup. The memory cgroup to scan objects from is passed in shrink_control->memcg. If the memory cgroup is NULL, a memcg aware shrinker is supposed to scan objects from the global list. Unaware shrinkers are only called on global pressure with memcg=NULL. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Glauber Costa <glommer@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmscan: invoke slab shrinkers from shrink_zone()Johannes Weiner2014-12-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The slab shrinkers are currently invoked from the zonelist walkers in kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the eligible LRU pages and assemble a nodemask to pass to NUMA-aware shrinkers, which then again have to walk over the nodemask. This is redundant code, extra runtime work, and fairly inaccurate when it comes to the estimation of actually scannable LRU pages. The code duplication will only get worse when making the shrinkers cgroup-aware and requiring them to have out-of-band cgroup hierarchy walks as well. Instead, invoke the shrinkers from shrink_zone(), which is where all reclaimers end up, to avoid this duplication. Take the count for eligible LRU pages out of get_scan_count(), which considers many more factors than just the availability of swap space, like zone_reclaimable_pages() currently does. Accumulate the number over all visited lruvecs to get the per-zone value. Some nodes have multiple zones due to memory addressing restrictions. To avoid putting too much pressure on the shrinkers, only invoke them once for each such node, using the class zone of the allocation as the pivot zone. For now, this integrates the slab shrinking better into the reclaim logic and gets rid of duplicative invocations from kswapd, direct reclaim, and zone reclaim. It also prepares for cgroup-awareness, allowing memcg-capable shrinkers to be added at the lruvec level without much duplication of both code and runtime work. This changes kswapd behavior, which used to invoke the shrinkers for each zone, but with scan ratios gathered from the entire node, resulting in meaningless pressure quantities on multi-zone nodes. Zone reclaim behavior also changes. It used to shrink slabs until the same amount of pages were shrunk as were reclaimed from the LRUs. Now it merely invokes the shrinkers once with the zone's scan ratio, which makes the shrinkers go easier on caches that implement aging and would prefer feeding back pressure from recently used slab objects to unused LRU pages. [vdavydov@parallels.com: assure class zone is populated] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memory-failure: share the i_mmap_rwsemDavidlohr Bueso2014-12-13
| | | | | | | | | | | | | | | | No brainer conversion: collect_procs_file() only schedules a process for later kill, share the lock, similarly to the anon vma variant. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: use new helper functions around the i_mmap_mutexDavidlohr Bueso2014-12-13
| | | | | | | | | | | | | | | | Convert all open coded mutex_lock/unlock calls to the i_mmap_[lock/unlock]_write() helpers. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'akpm' (patchbomb from Andrew)Linus Torvalds2014-12-10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge first patchbomb from Andrew Morton: - a few minor cifs fixes - dma-debug upadtes - ocfs2 - slab - about half of MM - procfs - kernel/exit.c - panic.c tweaks - printk upates - lib/ updates - checkpatch updates - fs/binfmt updates - the drivers/rtc tree - nilfs - kmod fixes - more kernel/exit.c - various other misc tweaks and fixes * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits) exit: pidns: fix/update the comments in zap_pid_ns_processes() exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting exit: exit_notify: re-use "dead" list to autoreap current exit: reparent: call forget_original_parent() under tasklist_lock exit: reparent: avoid find_new_reaper() if no children exit: reparent: introduce find_alive_thread() exit: reparent: introduce find_child_reaper() exit: reparent: document the ->has_child_subreaper checks exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper() exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting exit: proc: don't try to flush /proc/tgid/task/tgid exit: release_task: fix the comment about group leader accounting exit: wait: drop tasklist_lock before psig->c* accounting exit: wait: don't use zombie->real_parent exit: wait: cleanup the ptrace_reparented() checks usermodehelper: kill the kmod_thread_locker logic usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper() fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races ...
| * mm, memory_hotplug/failure: drain single zone pcplistsVlastimil Babka2014-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Memory hotplug and failure mechanisms have several places where pcplists are drained so that pages are returned to the buddy allocator and can be e.g. prepared for offlining. This is always done in the context of a single zone, we can reduce the pcplists drain to the single zone, which is now possible. The change should make memory offlining due to hotremove or failure faster and not disturbing unrelated pcplists anymore. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: introduce single zone pcplists drainVlastimil Babka2014-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The functions for draining per-cpu pages back to buddy allocators currently always operate on all zones. There are however several cases where the drain is only needed in the context of a single zone, and spilling other pcplists is a waste of time both due to the extra spilling and later refilling. This patch introduces new zone pointer parameter to drain_all_pages() and changes the dummy parameter of drain_local_pages() to be also a zone pointer. When NULL is passed, the functions operate on all zones as usual. Passing a specific zone pointer reduces the work to the single zone. All callers are updated to pass the NULL pointer in this patch. Conversion to single zone (where appropriate) is done in further patches. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | RAS, HWPOISON: Fix wrong error recovery statusChen, Gong2014-10-21
|/ | | | | | | | | | | | | | | | | | | | | | | When Uncorrected error happens, if the poisoned page is referenced by more than one user after error recovery, the recovery is not successful. But currently the display result is wrong. Before this patch: MCE 0x44e336: dirty mlocked LRU page recovery: Recovered MCE 0x44e336: dirty mlocked LRU page still referenced by 1 users mce: Memory error not recovered After this patch: MCE 0x44e336: dirty mlocked LRU page recovery: Failed MCE 0x44e336: dirty mlocked LRU page still referenced by 1 users mce: Memory error not recovered Signed-off-by: Chen, Gong <gong.chen@linux.intel.com> Link: http://lkml.kernel.org/r/1406530260-26078-3-git-send-email-gong.chen@linux.intel.com Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de>