summaryrefslogtreecommitdiff
path: root/mm/internal.h (follow)
Commit message (Collapse)AuthorAge
* Merge branch 'android-4.4-p' of ↵Michael Bestas2020-04-14
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://android.googlesource.com/kernel/common into lineage-17.1-caf-msm8998 This brings LA.UM.8.4.r1-05300-8x98.0 up to date with https://android.googlesource.com/kernel/common/ android-4.4-p at commit: f9991115f0793 Merge 4.4.219 into android-4.4-p Conflicts: drivers/clk/qcom/clk-rcg2.c drivers/scsi/sd.c drivers/usb/gadget/function/f_fs.c drivers/usb/gadget/function/u_serial.c Change-Id: Ifed3db0ddda828c1697e57e9f73c1b73354bebf7
| * mm: add 'try_get_page()' helper functionLinus Torvalds2020-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 88b1a17dfc3ed7728316478fae0f5ad508f50397 upsteam. This is the same as the traditional 'get_page()' function, but instead of unconditionally incrementing the reference count of the page, it only does so if the count was "safe". It returns whether the reference count was incremented (and is marked __must_check, since the caller obviously has to be aware of it). Also like 'get_page()', you can't use this function unless you already had a reference to the page. The intent is that you can use this exactly like get_page(), but in situations where you want to limit the maximum reference count. The code currently does an unconditional WARN_ON_ONCE() if we ever hit the reference count issues (either zero or negative), as a notification that the conditional non-increment actually happened. NOTE! The count access for the "safety" check is inherently racy, but that doesn't matter since the buffer we use is basically half the range of the reference count (ie we look at the sign of the count). Acked-by: Matthew Wilcox <willy@infradead.org> Cc: Jann Horn <jannh@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [ 4.4.y backport notes: Srivatsa: - Adapted try_get_page() to match the get_page() implementation in 4.4.y, except for the refcount check. - Added try_get_page_foll() which will be needed in a subsequent patch. ] Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Ajay Kaher <akaher@vmware.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * mm: make page ref count overflow check tighter and more explicitLinus Torvalds2020-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f958d7b528b1b40c44cfda5eabe2d82760d868c3 upsteam. We have a VM_BUG_ON() to check that the page reference count doesn't underflow (or get close to overflow) by checking the sign of the count. That's all fine, but we actually want to allow people to use a "get page ref unless it's already very high" helper function, and we want that one to use the sign of the page ref (without triggering this VM_BUG_ON). Change the VM_BUG_ON to only check for small underflows (or _very_ close to overflowing), and ignore overflows which have strayed into negative territory. Acked-by: Matthew Wilcox <willy@infradead.org> Cc: Jann Horn <jannh@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [ 4.4.y backport notes: Ajay: Open-coded atomic refcount access due to missing page_ref_count() helper in 4.4.y Srivatsa: Added overflow check to get_page_foll() and related code. ] Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Ajay Kaher <akaher@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | BACKPORT: mm: fix pageblock heuristicTim Murray2019-12-23
| | | | | | | | | | | | | | | | | | | | | | | | The Android-tuned page block heuristic was accidentally reset in an AU drop. Fix the heuristic to avoid unnecessary unmovable pageblock migration over time. bug 30643938 Bug: 63336523 (cherry-picked from commit 3e19bcf7d08713daaaba888b4d13502e06e38e96) Change-Id: I59efcd3934f29982b1c9aeb7b0f18eb17e0934b3 Signed-off-by: John Dias <joaodias@google.com>
* | Merge android-4.4@4b8fc9f (v4.4.82) into msm-4.4Blagovest Kolenichev2017-09-01
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * refs/heads/tmp-4b8fc9f UPSTREAM: locking: avoid passing around 'thread_info' in mutex debugging code ANDROID: arm64: fix undeclared 'init_thread_info' error UPSTREAM: kdb: use task_cpu() instead of task_thread_info()->cpu Linux 4.4.82 net: account for current skb length when deciding about UFO ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output mm/mempool: avoid KASAN marking mempool poison checks as use-after-free KVM: arm/arm64: Handle hva aging while destroying the vm sparc64: Prevent perf from running during super critical sections udp: consistently apply ufo or fragmentation revert "ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output" revert "net: account for current skb length when deciding about UFO" packet: fix tp_reserve race in packet_set_ring net: avoid skb_warn_bad_offload false positives on UFO tcp: fastopen: tcp_connect() must refresh the route net: sched: set xt_tgchk_param par.nft_compat as 0 in ipt_init_target bpf, s390: fix jit branch offset related to ldimm64 net: fix keepalive code vs TCP_FASTOPEN_CONNECT tcp: avoid setting cwnd to invalid ssthresh after cwnd reduction states ANDROID: keychord: Fix for a memory leak in keychord. ANDROID: keychord: Fix races in keychord_write. Use %zu to print resid (size_t). ANDROID: keychord: Fix a slab out-of-bounds read. Linux 4.4.81 workqueue: implicit ordered attribute should be overridable net: account for current skb length when deciding about UFO ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output mm: don't dereference struct page fields of invalid pages signal: protect SIGNAL_UNKILLABLE from unintentional clearing. lib/Kconfig.debug: fix frv build failure mm, slab: make sure that KMALLOC_MAX_SIZE will fit into MAX_ORDER ARM: 8632/1: ftrace: fix syscall name matching virtio_blk: fix panic in initialization error path drm/virtio: fix framebuffer sparse warning scsi: qla2xxx: Get mutex lock before checking optrom_state phy state machine: failsafe leave invalid RUNNING state x86/boot: Add missing declaration of string functions tg3: Fix race condition in tg3_get_stats64(). net: phy: dp83867: fix irq generation sh_eth: R8A7740 supports packet shecksumming wext: handle NULL extra data in iwe_stream_add_point better sparc64: Measure receiver forward progress to avoid send mondo timeout xen-netback: correctly schedule rate-limited queues net: phy: Fix PHY unbind crash net: phy: Correctly process PHY_HALTED in phy_stop_machine() net/mlx5: Fix command bad flow on command entry allocation failure sctp: fix the check for _sctp_walk_params and _sctp_walk_errors sctp: don't dereference ptr before leaving _sctp_walk_{params, errors}() dccp: fix a memleak for dccp_feat_init err process dccp: fix a memleak that dccp_ipv4 doesn't put reqsk properly dccp: fix a memleak that dccp_ipv6 doesn't put reqsk properly net: ethernet: nb8800: Handle all 4 RGMII modes identically ipv6: Don't increase IPSTATS_MIB_FRAGFAILS twice in ip6_fragment() packet: fix use-after-free in prb_retire_rx_blk_timer_expired() openvswitch: fix potential out of bound access in parse_ct mcs7780: Fix initialization when CONFIG_VMAP_STACK is enabled rtnetlink: allocate more memory for dev_set_mac_address() ipv4: initialize fib_trie prior to register_netdev_notifier call. ipv6: avoid overflow of offset in ip6_find_1stfragopt net: Zero terminate ifr_name in dev_ifname(). ipv4: ipv6: initialize treq->txhash in cookie_v[46]_check() saa7164: fix double fetch PCIe access condition drm: rcar-du: fix backport bug f2fs: sanity check checkpoint segno and blkoff media: lirc: LIRC_GET_REC_RESOLUTION should return microseconds mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries iser-target: Avoid isert_conn->cm_id dereference in isert_login_recv_done iscsi-target: Fix delayed logout processing greater than SECONDS_FOR_LOGOUT_COMP iscsi-target: Fix initial login PDU asynchronous socket close OOPs iscsi-target: Fix early sk_data_ready LOGIN_FLAGS_READY race iscsi-target: Always wait for kthread_should_stop() before kthread exit target: Avoid mappedlun symlink creation during lun shutdown media: platform: davinci: return -EINVAL for VPFE_CMD_S_CCDC_RAW_PARAMS ioctl ARM: dts: armada-38x: Fix irq type for pca955 ext4: fix overflow caused by missing cast in ext4_resize_fs() ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesize mm/page_alloc: Remove kernel address exposure in free_reserved_area() KVM: async_pf: make rcu irq exit if not triggered from idle task ASoC: do not close shared backend dailink ALSA: hda - Fix speaker output from VAIO VPCL14M1R workqueue: restore WQ_UNBOUND/max_active==1 to be ordered libata: array underflow in ata_find_dev() ANDROID: binder: don't queue async transactions to thread. ANDROID: binder: don't enqueue death notifications to thread todo. ANDROID: binder: call poll_wait() unconditionally. android: configs: move quota-related configs to recommended BACKPORT: arm64: split thread_info from task stack UPSTREAM: arm64: assembler: introduce ldr_this_cpu UPSTREAM: arm64: make cpu number a percpu variable UPSTREAM: arm64: smp: prepare for smp_processor_id() rework BACKPORT: arm64: move sp_el0 and tpidr_el1 into cpu_suspend_ctx UPSTREAM: arm64: prep stack walkers for THREAD_INFO_IN_TASK UPSTREAM: arm64: unexport walk_stackframe UPSTREAM: arm64: traps: simplify die() and __die() UPSTREAM: arm64: factor out current_stack_pointer BACKPORT: arm64: asm-offsets: remove unused definitions UPSTREAM: arm64: thread_info remove stale items UPSTREAM: thread_info: include <current.h> for THREAD_INFO_IN_TASK UPSTREAM: thread_info: factor out restart_block UPSTREAM: kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function UPSTREAM: sched/core: Add try_get_task_stack() and put_task_stack() UPSTREAM: sched/core: Allow putting thread_info into task_struct UPSTREAM: printk: when dumping regs, show the stack, not thread_info UPSTREAM: fix up initial thread stack pointer vs thread_info confusion UPSTREAM: Clarify naming of thread info/stack allocators ANDROID: sdcardfs: override credential for ioctl to lower fs Conflicts: android/configs/android-base.cfg arch/arm64/Kconfig arch/arm64/include/asm/suspend.h arch/arm64/kernel/head.S arch/arm64/kernel/smp.c arch/arm64/kernel/suspend.c arch/arm64/kernel/traps.c arch/arm64/mm/proc.S kernel/fork.c sound/soc/soc-pcm.c Change-Id: I273e216c94899a838bbd208391c6cbe20b2bf683 Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
| * mm, mprotect: flush TLB if potentially racing with a parallel reclaim ↵Mel Gorman2017-08-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | leaving stale TLB entries commit 3ea277194daaeaa84ce75180ec7c7a2075027a68 upstream. Stable note for 4.4: The upstream patch patches madvise(MADV_FREE) but 4.4 does not have support for that feature. The changelog is left as-is but the hunk related to madvise is omitted from the backport. Nadav Amit identified a theoritical race between page reclaim and mprotect due to TLB flushes being batched outside of the PTL being held. He described the race as follows: CPU0 CPU1 ---- ---- user accesses memory using RW PTE [PTE now cached in TLB] try_to_unmap_one() ==> ptep_get_and_clear() ==> set_tlb_ubc_flush_pending() mprotect(addr, PROT_READ) ==> change_pte_range() ==> [ PTE non-present - no flush ] user writes using cached RW PTE ... try_to_unmap_flush() The same type of race exists for reads when protecting for PROT_NONE and also exists for operations that can leave an old TLB entry behind such as munmap, mremap and madvise. For some operations like mprotect, it's not necessarily a data integrity issue but it is a correctness issue as there is a window where an mprotect that limits access still allows access. For munmap, it's potentially a data integrity issue although the race is massive as an munmap, mmap and return to userspace must all complete between the window when reclaim drops the PTL and flushes the TLB. However, it's theoritically possible so handle this issue by flushing the mm if reclaim is potentially currently batching TLB flushes. Other instances where a flush is required for a present pte should be ok as either the page lock is held preventing parallel reclaim or a page reference count is elevated preventing a parallel free leading to corruption. In the case of page_mkclean there isn't an obvious path that userspace could take advantage of without using the operations that are guarded by this patch. Other users such as gup as a race with reclaim looks just at PTEs. huge page variants should be ok as they don't race with reclaim. mincore only looks at PTEs. userfault also should be ok as if a parallel reclaim takes place, it will either fault the page back in or read some of the data before the flush occurs triggering a fault. Note that a variant of this patch was acked by Andy Lutomirski but this was for the x86 parts on top of his PCID work which didn't make the 4.13 merge window as expected. His ack is dropped from this version and there will be a follow-on patch on top of PCID that will include his ack. [akpm@linux-foundation.org: tweak comments] [akpm@linux-foundation.org: fix spello] Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim maskMel Gorman2016-08-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e838a45f9392a5bd2be1cd3ab0b16ae85857461c upstream. Commit d0164adc89f6 ("mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd") modified __GFP_WAIT to explicitly identify the difference between atomic callers and those that were unwilling to sleep. Later the definition was removed entirely. The GFP_RECLAIM_MASK is the set of flags that affect watermark checking and reclaim behaviour but __GFP_ATOMIC was never added. Without it, atomic users of the slab allocator strip the __GFP_ATOMIC flag and cannot access the page allocator atomic reserves. This patch addresses the problem. The user-visible impact depends on the workload but potentially atomic allocations unnecessarily fail without this path. Link: http://lkml.kernel.org/r/20160610093832.GK2527@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Marcin Wojtas <mw@semihalf.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | mm/page_alloc: introduce post allocation processing on page allocatorJoonsoo Kim2017-07-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is motivated from Hugh and Vlastimil's concern [1]. There are two ways to get freepage from the allocator. One is using normal memory allocation API and the other is __isolate_free_page() which is internally used for compaction and pageblock isolation. Later usage is rather tricky since it doesn't do whole post allocation processing done by normal API. One problematic thing I already know is that poisoned page would not be checked if it is allocated by __isolate_free_page(). Perhaps, there would be more. We could add more debug logic for allocated page in the future and this separation would cause more problem. I'd like to fix this situation at this time. Solution is simple. This patch commonize some logic for newly allocated page and uses it on all sites. This will solve the problem. [1] http://marc.info/?i=alpine.LSU.2.11.1604270029350.7066%40eggly.anvils%3E Change-Id: I601ec8ce8ee4ab76cd408ff2148dd8c73b959fc2 [iamjoonsoo.kim@lge.com: mm-page_alloc-introduce-post-allocation-processing-on-page-allocator-v3] Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1466150259-27727-9-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: 46f24fd857b37bb86ddd5d0ac3d194e984dfdf1c Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [guptap@codeaurora.org: resolve trivial merge conflicts] Signed-off-by: Prakash Gupta <guptap@codeaurora.org>
* | mm, kswapd: replace kswapd compaction with waking up kcompactdVlastimil Babka2017-02-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similarly to direct reclaim/compaction, kswapd attempts to combine reclaim and compaction to attempt making memory allocation of given order available. The details differ from direct reclaim e.g. in having high watermark as a goal. The code involved in kswapd's reclaim/compaction decisions has evolved to be quite complex. Testing reveals that it doesn't actually work in at least one scenario, and closer inspection suggests that it could be greatly simplified without compromising on the goal (make high-order page available) or efficiency (don't reclaim too much). The simplification relieas of doing all compaction in kcompactd, which is simply woken up when high watermarks are reached by kswapd's reclaim. The scenario where kswapd compaction doesn't work was found with mmtests test stress-highalloc configured to attempt order-9 allocations without direct reclaim, just waking up kswapd. There was no compaction attempt from kswapd during the whole test. Some added instrumentation shows what happens: - balance_pgdat() sets end_zone to Normal, as it's not balanced - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but it cannot reclaim anything, so sc.nr_reclaimed is 0 - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so it merely checks if high watermarks were reached for base pages. This is true, so no reclaim is attempted. For DMA, testorder=0 wasn't used, as compaction_suitable() returned COMPACT_SKIPPED - even though the pgdat_needs_compaction flag wasn't set to false, no compaction happens due to the condition sc.nr_reclaimed > nr_attempted being false (as 0 < 99) - priority-- due to nr_reclaimed being 0, repeat until priority reaches 0 pgdat_balanced() is false as only the small zone DMA appears balanced (curiously in that check, watermark appears OK and compaction_suitable() returns COMPACT_PARTIAL, because a lower classzone_idx is used there) Now, even if it was decided that reclaim shouldn't be attempted on the DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 > nr_attempted=0) is also false. The condition really should use >= as the comment suggests. Then there is a mismatch in the check for setting pgdat_needs_compaction to false using low watermark, while the rest uses high watermark, and who knows what other subtlety. Hopefully this demonstrates that this is unsustainable. Luckily we can simplify this a lot. The reclaim/compaction decisions make sense for direct reclaim scenario, but in kswapd, our primary goal is to reach high watermark in order-0 pages. Afterwards we can attempt compaction just once. Unlike direct reclaim, we don't reclaim extra pages (over the high watermark), the current code already disallows it for good reasons. After this patch, we simply wake up kcompactd to process the pgdat, after we have either succeeded or failed to reach the high watermarks in kswapd, which goes to sleep. We pass kswapd's order and classzone_idx, so kcompactd can apply the same criteria to determine which zones are worth compacting. Note that we use the classzone_idx from wakeup_kswapd(), not balanced_classzone_idx which can include higher zones that kswapd tried to balance too, but didn't consider them in pgdat_balanced(). Since kswapd now cannot create high-order pages itself, we need to adjust how it determines the zones to be balanced. The key element here is adding a "highorder" parameter to zone_balanced, which, when set to false, makes it consider only order-0 watermark instead of the desired higher order (this was done previously by kswapd_shrink_zone(), but not elsewhere). This false is passed for example in pgdat_balanced(). Importantly, wakeup_kswapd() uses true to make sure kswapd and thus kcompactd are woken up for a high-order allocation failure. The last thing is to decide what to do with pageblock_skip bitmap handling. Compaction maintains a pageblock_skip bitmap to record pageblocks where isolation recently failed. This bitmap can be reset by three ways: 1) direct compaction is restarting after going through the full deferred cycle 2) kswapd goes to sleep, and some other direct compaction has previously finished scanning the whole zone and set zone->compact_blockskip_flush. Note that a successful direct compaction clears this flag. 3) compaction was invoked manually via trigger in /proc The case 2) is somewhat fuzzy to begin with, but after introducing kcompactd we should update it. The check for direct compaction in 1), and to set the flush flag in 2) use current_is_kswapd(), which doesn't work for kcompactd. Thus, this patch adds bool direct_compaction to compact_control to use in 2). For the case 1) we remove the check completely - unlike the former kswapd compaction, kcompactd does use the deferred compaction functionality, so flushing tied to restarting from deferred compaction makes sense here. Note that when kswapd goes to sleep, kcompactd is woken up, so it will see the flushed pageblock_skip bits. This is different from when the former kswapd compaction observed the bits and I believe it makes more sense. Kcompactd can afford to be more thorough than a direct compaction trying to limit allocation latency, or kswapd whose primary goal is to reclaim. For testing, I used stress-highalloc configured to do order-9 allocations with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in phases 1 and 2 work as usual): stress-highalloc 4.5-rc1+before 4.5-rc1+after -nodirect -nodirect Success 1 Min 1.00 ( 0.00%) 5.00 (-66.67%) Success 1 Mean 1.40 ( 0.00%) 6.20 (-55.00%) Success 1 Max 2.00 ( 0.00%) 7.00 (-16.67%) Success 2 Min 1.00 ( 0.00%) 5.00 (-66.67%) Success 2 Mean 1.80 ( 0.00%) 6.40 (-52.38%) Success 2 Max 3.00 ( 0.00%) 7.00 (-16.67%) Success 3 Min 34.00 ( 0.00%) 62.00 ( 1.59%) Success 3 Mean 41.80 ( 0.00%) 63.80 ( 1.24%) Success 3 Max 53.00 ( 0.00%) 65.00 ( 2.99%) User 3166.67 3181.09 System 1153.37 1158.25 Elapsed 1768.53 1799.37 4.5-rc1+before 4.5-rc1+after -nodirect -nodirect Direct pages scanned 32938 32797 Kswapd pages scanned 2183166 2202613 Kswapd pages reclaimed 2152359 2143524 Direct pages reclaimed 32735 32545 Percentage direct scans 1% 1% THP fault alloc 579 612 THP collapse alloc 304 316 THP splits 0 0 THP fault fallback 793 778 THP collapse fail 11 16 Compaction stalls 1013 1007 Compaction success 92 67 Compaction failures 920 939 Page migrate success 238457 721374 Page migrate failure 23021 23469 Compaction pages isolated 504695 1479924 Compaction migrate scanned 661390 8812554 Compaction free scanned 13476658 84327916 Compaction cost 262 838 After this patch we see improvements in allocation success rate (especially for phase 3) along with increased compaction activity. The compaction stalls (direct compaction) in the interfering kernel builds (probably THP's) also decreased somewhat thanks to kcompactd activity, yet THP alloc successes improved a bit. Note that elapsed and user time isn't so useful for this benchmark, because of the background interference being unpredictable. It's just to quickly spot some major unexpected differences. System time is somewhat more useful and that didn't increase. Also (after adjusting mmtests' ftrace monitor): Time kswapd awake 2547781 2269241 Time kcompactd awake 0 119253 Time direct compacting 939937 557649 Time kswapd compacting 0 0 Time kcompactd compacting 0 119099 The decrease of overal time spent compacting appears to not match the increased compaction stats. I suspect the tasks get rescheduled and since the ftrace monitor doesn't see that, the reported time is wall time, not CPU time. But arguably direct compactors care about overall latency anyway, whether busy compacting or waiting for CPU doesn't matter. And that latency seems to almost halved. It's also interesting how much time kswapd spent awake just going through all the priorities and failing to even try compacting, over and over. We can also configure stress-highalloc to perform both direct reclaim/compaction and wakeup kswapd/kcompactd, by using GFP_KERNEL|__GFP_HIGH|__GFP_COMP: stress-highalloc 4.5-rc1+before 4.5-rc1+after -direct -direct Success 1 Min 4.00 ( 0.00%) 9.00 (-50.00%) Success 1 Mean 8.00 ( 0.00%) 10.00 (-19.05%) Success 1 Max 12.00 ( 0.00%) 11.00 ( 15.38%) Success 2 Min 4.00 ( 0.00%) 9.00 (-50.00%) Success 2 Mean 8.20 ( 0.00%) 10.00 (-16.28%) Success 2 Max 13.00 ( 0.00%) 11.00 ( 8.33%) Success 3 Min 75.00 ( 0.00%) 74.00 ( 1.33%) Success 3 Mean 75.60 ( 0.00%) 75.20 ( 0.53%) Success 3 Max 77.00 ( 0.00%) 76.00 ( 0.00%) User 3344.73 3246.04 System 1194.24 1172.29 Elapsed 1838.04 1836.76 4.5-rc1+before 4.5-rc1+after -direct -direct Direct pages scanned 125146 120966 Kswapd pages scanned 2119757 2135012 Kswapd pages reclaimed 2073183 2108388 Direct pages reclaimed 124909 120577 Percentage direct scans 5% 5% THP fault alloc 599 652 THP collapse alloc 323 354 THP splits 0 0 THP fault fallback 806 793 THP collapse fail 17 16 Compaction stalls 2457 2025 Compaction success 906 518 Compaction failures 1551 1507 Page migrate success 2031423 2360608 Page migrate failure 32845 40852 Compaction pages isolated 4129761 4802025 Compaction migrate scanned 11996712 21750613 Compaction free scanned 214970969 344372001 Compaction cost 2271 2694 In this scenario, this patch doesn't change the overall success rate as direct compaction already tries all it can. There's however significant reduction in direct compaction stalls (that is, the number of allocations that went into direct compaction). The number of successes (i.e. direct compaction stalls that ended up with successful allocation) is reduced by the same number. This means the offload to kcompactd is working as expected, and direct compaction is reduced either due to detecting contention, or compaction deferred by kcompactd. In the previous version of this patchset there was some apparent reduction of success rate, but the changes in this version (such as using sync compaction only), new baseline kernel, and/or averaging results from 5 executions (my bet), made this go away. Ftrace-based stats seem to roughly agree: Time kswapd awake 2532984 2326824 Time kcompactd awake 0 257916 Time direct compacting 864839 735130 Time kswapd compacting 0 0 Time kcompactd compacting 0 257585 Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: accf62422b3a67fce8ce086aa81c8300ddbf42be Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: I5cc784ff0fb1b28bdcfa8d4ef9bd33f585ee4662 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
* | mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim maskMel Gorman2016-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit d0164adc89f6 ("mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd") modified __GFP_WAIT to explicitly identify the difference between atomic callers and those that were unwilling to sleep. Later the definition was removed entirely. The GFP_RECLAIM_MASK is the set of flags that affect watermark checking and reclaim behaviour but __GFP_ATOMIC was never added. Without it, atomic users of the slab allocator strip the __GFP_ATOMIC flag and cannot access the page allocator atomic reserves. This patch addresses the problem. The user-visible impact depends on the workload but potentially atomic allocations unnecessarily fail without this path. Change-Id: Ieac0932d146f7fd992db9fd834b0e9aa3822f891 Link: http://lkml.kernel.org/r/20160610093832.GK2527@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Marcin Wojtas <mw@semihalf.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> [4.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Git-Commit: 843f65ccbb2273430b57ae135ccd26dddee05be7 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
* | mm: Enhance per process reclaim to consider shared pagesMinchan Kim2016-06-22
|/ | | | | | | | | | | | | | | | | | | | | Some pages could be shared by several processes. (ex, libc) In case of that, it's too bad to reclaim them from the beginnig. This patch causes VM to keep them on memory until last task try to reclaim them so shared pages will be reclaimed only if all of task has gone swapping out. This feature doesn't handle non-linear mapping on ramfs because it's very time-consuming and doesn't make sure of reclaiming and not common. Change-Id: I7e5f34f2e947f5db6d405867fe2ad34863ca40f7 Signed-off-by: Sangseok Lee <sangseok.lee@lge.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Patch-mainline: linux-mm @ 9 May 2013 16:21:27 [vinmenon@codeaurora.org: trivial merge conflict fixes + changes to make the patch work with 3.18 kernel] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
* mm: use 'unsigned int' for page orderKirill A. Shutemov2015-11-06
| | | | | | | | | | | | | | | | | | | | | | Let's try to be consistent about data type of page order. [sfr@canb.auug.org.au: fix build (type of pageblock_order)] [hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: make compound_head() robustKirill A. Shutemov2015-11-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hugh has pointed that compound_head() call can be unsafe in some context. There's one example: CPU0 CPU1 isolate_migratepages_block() page_count() compound_head() !!PageTail() == true put_page() tail->first_page = NULL head = tail->first_page alloc_pages(__GFP_COMP) prep_compound_page() tail->first_page = head __SetPageTail(p); !!PageTail() == true <head == NULL dereferencing> The race is pure theoretical. I don't it's possible to trigger it in practice. But who knows. We can fix the race by changing how encode PageTail() and compound_head() within struct page to be able to update them in one shot. The patch introduces page->compound_head into third double word block in front of compound_dtor and compound_order. Bit 0 encodes PageTail() and the rest bits are pointer to head page if bit zero is set. The patch moves page->pmd_huge_pte out of word, just in case if an architecture defines pgtable_t into something what can have the bit 0 set. hugetlb_cgroup uses page->lru.next in the second tail page to store pointer struct hugetlb_cgroup. The patch switch it to use page->private in the second tail page instead. The space is free since ->first_page is removed from the union. The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER limitation, since there's now space in first tail page to store struct hugetlb_cgroup pointer. But that's out of scope of the patch. That means page->compound_head shares storage space with: - page->lru.next; - page->next; - page->rcu_head.next; That's too long list to be absolutely sure, but looks like nobody uses bit 0 of the word. page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future call_rcu_lazy() is not allowed as it makes use of the bit and we can get false positive PageTail(). [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: page_alloc: hide some GFP internals and document the bits and flag ↵Mel Gorman2015-11-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | combinations Andrew stated the following We have quite a history of remote parts of the kernel using weird/wrong/inexplicable combinations of __GFP_ flags. I tend to think that this is because we didn't adequately explain the interface. And I don't think that gfp.h really improved much in this area as a result of this patchset. Could you go through it some time and decide if we've adequately documented all this stuff? This patches first moves some GFP flag combinations that are part of the MM internals to mm/internal.h. The rest of the patch documents the __GFP_FOO bits under various headings and then documents the flag combinations. It will not help callers that are brain damaged but the clarity might motivate some fixes and avoid future mistakes. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, page_alloc: remove unnecessary recalculations for dirty zone balancingMel Gorman2015-11-06
| | | | | | | | | | | | | | | | | | | | File-backed pages that will be immediately written are balanced between zones. This heuristic tries to avoid having a single zone filled with recently dirtied pages but the checks are unnecessarily expensive. Move consider_zone_balanced into the alloc_context instead of checking bitmaps multiple times. The patch also gives the parameter a more meaningful name. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: page migration fix PageMlocked on migrated pagesHugh Dickins2015-11-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e6c509f85455 ("mm: use clear_page_mlock() in page_remove_rmap()") in v3.7 inadvertently made mlock_migrate_page() impotent: page migration unmaps the page from userspace before migrating, and that commit clears PageMlocked on the final unmap, leaving mlock_migrate_page() with nothing to do. Not a serious bug, the next attempt at reclaiming the page would fix it up; but a betrayal of page migration's intent - the new page ought to emerge as PageMlocked. I don't see how to fix it for mlock_migrate_page() itself; but easily fixed in remove_migration_pte(), by calling mlock_vma_page() when the vma is VM_LOCKED - under pte lock as in try_to_unmap_one(). Delete mlock_migrate_page()? Not quite, it does still serve a purpose for migrate_misplaced_transhuge_page(): where we could replace it by a test, clear_page_mlock(), mlock_vma_page() sequence; but would that be an improvement? mlock_migrate_page() is fairly lean, and let's make it leaner by skipping the irq save/restore now clearly not needed. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/compaction: correct to flush migrated pages if pageblock skip happensJoonsoo Kim2015-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We cache isolate_start_pfn before entering isolate_migratepages(). If pageblock is skipped in isolate_migratepages() due to whatever reason, cc->migrate_pfn can be far from isolate_start_pfn hence we flush pages that were freed. For example, the following scenario can be possible: - assume order-9 compaction, pageblock order is 9 - start_isolate_pfn is 0x200 - isolate_migratepages() - skip a number of pageblocks - start to isolate from pfn 0x600 - cc->migrate_pfn = 0x620 - return - last_migrated_pfn is set to 0x200 - check flushing condition - current_block_start is set to 0x600 - last_migrated_pfn < current_block_start then do useless flush This wrong flush would not help the performance and success rate so this patch tries to fix it. One simple way to know the exact position where we start to isolate migratable pages is that we cache it in isolate_migratepages() before entering actual isolation. This patch implements that and fixes the problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: defer flush of writable TLB entriesMel Gorman2015-09-04
| | | | | | | | | | | | | | | | If a PTE is unmapped and it's dirty then it was writable recently. Due to deferred TLB flushing, it's best to assume a writable TLB cache entry exists. With that assumption, the TLB must be flushed before any IO can start or the page is freed to avoid lost writes or data corruption. This patch defers flushing of potentially writable TLBs as long as possible. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: send one IPI per CPU to TLB flush all entries after unmapping pagesMel Gorman2015-09-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An IPI is sent to flush remote TLBs when a page is unmapped that was potentially accesssed by other CPUs. There are many circumstances where this happens but the obvious one is kswapd reclaiming pages belonging to a running process as kswapd and the task are likely running on separate CPUs. On small machines, this is not a significant problem but as machine gets larger with more cores and more memory, the cost of these IPIs can be high. This patch uses a simple structure that tracks CPUs that potentially have TLB entries for pages being unmapped. When the unmapping is complete, the full TLB is flushed on the assumption that a refill cost is lower than flushing individual entries. Architectures wishing to do this must give the following guarantee. If a clean page is unmapped and not immediately flushed, the architecture must guarantee that a write to that linear address from a CPU with a cached TLB entry will trap a page fault. This is essentially what the kernel already depends on but the window is much larger with this patch applied and is worth highlighting. The architecture should consider whether the cost of the full TLB flush is higher than sending an IPI to flush each individual entry. An additional architecture helper called flush_tlb_local is required. It's a trivial wrapper with some accounting in the x86 case. The impact of this patch depends on the workload as measuring any benefit requires both mapped pages co-located on the LRU and memory pressure. The case with the biggest impact is multiple processes reading mapped pages taken from the vm-scalability test suite. The test case uses NR_CPU readers of mapped files that consume 10*RAM. Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 Ops lru-file-mmap-read-elapsed 159.62 ( 0.00%) 120.68 ( 24.40%) Ops lru-file-mmap-read-time_range 30.59 ( 0.00%) 2.80 ( 90.85%) Ops lru-file-mmap-read-time_stddv 6.70 ( 0.00%) 0.64 ( 90.38%) 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 User 581.00 611.43 System 5804.93 4111.76 Elapsed 161.03 122.12 This is showing that the readers completed 24.40% faster with 29% less system CPU time. From vmstats, it is known that the vanilla kernel was interrupted roughly 900K times per second during the steady phase of the test and the patched kernel was interrupts 180K times per second. The impact is lower on a single socket machine. 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 Ops lru-file-mmap-read-elapsed 25.33 ( 0.00%) 20.38 ( 19.54%) Ops lru-file-mmap-read-time_range 0.91 ( 0.00%) 1.44 (-58.24%) Ops lru-file-mmap-read-time_stddv 0.28 ( 0.00%) 0.47 (-65.34%) 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 User 58.09 57.64 System 111.82 76.56 Elapsed 27.29 22.55 It's still a noticeable improvement with vmstat showing interrupts went from roughly 500K per second to 45K per second. The patch will have no impact on workloads with no memory pressure or have relatively few mapped pages. It will have an unpredictable impact on the workload running on the CPU being flushed as it'll depend on how many TLB entries need to be refilled and how long that takes. Worst case, the TLB will be completely cleared of active entries when the target PFNs were not resident at all. [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: meminit: finish initialisation of struct pages before basic setupMel Gorman2015-06-30
| | | | | | | | | | | | | | | | | | | | | Waiman Long reported that 24TB machines hit OOM during basic setup when struct page initialisation was deferred. One approach is to initialise memory on demand but it interferes with page allocator paths. This patch creates dedicated threads to initialise memory before basic setup. It then blocks on a rw_semaphore until completion as a wait_queue and counter is overkill. This may be slower to boot but it's simplier overall and also gets rid of a section mangling which existed so kswapd could do the initialisation. [akpm@linux-foundation.org: include rwsem.h, use DECLARE_RWSEM, fix comment, remove unneeded cast] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Waiman Long <waiman.long@hp.com Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Scott Norton <scott.norton@hp.com> Tested-by: Daniel J Blueman <daniel@numascale.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: meminit: remove mminit_verify_page_linksMel Gorman2015-06-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | mminit_verify_page_links() is an extremely paranoid check that was introduced when memory initialisation was being heavily reworked. Profiles indicated that up to 10% of parallel memory initialisation was spent on checking this for every page. The cost could be reduced but in practice this check only found problems very early during the initialisation rewrite and has found nothing since. This patch removes an expensive unnecessary check. Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Nate Zimmer <nzimmer@sgi.com> Tested-by: Waiman Long <waiman.long@hp.com> Tested-by: Daniel J Blueman <daniel@numascale.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Robin Holt <robinmholt@gmail.com> Cc: Nate Zimmer <nzimmer@sgi.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Waiman Long <waiman.long@hp.com> Cc: Scott Norton <scott.norton@hp.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: meminit: initialise remaining struct pages in parallel with kswapdMel Gorman2015-06-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only a subset of struct pages are initialised at the moment. When this patch is applied kswapd initialise the remaining struct pages in parallel. This should boot faster by spreading the work to multiple CPUs and initialising data that is local to the CPU. The user-visible effect on large machines is that free memory will appear to rapidly increase early in the lifetime of the system until kswapd reports that all memory is initialised in the kernel log. Once initialised there should be no other user-visibile effects. Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Nate Zimmer <nzimmer@sgi.com> Tested-by: Waiman Long <waiman.long@hp.com> Tested-by: Daniel J Blueman <daniel@numascale.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Robin Holt <robinmholt@gmail.com> Cc: Nate Zimmer <nzimmer@sgi.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Waiman Long <waiman.long@hp.com> Cc: Scott Norton <scott.norton@hp.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: meminit: initialise a subset of struct pages if ↵Mel Gorman2015-06-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | CONFIG_DEFERRED_STRUCT_PAGE_INIT is set This patch initalises all low memory struct pages and 2G of the highest zone on each node during memory initialisation if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set. That config option cannot be set but will be available in a later patch. Parallel initialisation of struct page depends on some features from memory hotplug and it is necessary to alter alter section annotations. Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Nate Zimmer <nzimmer@sgi.com> Tested-by: Waiman Long <waiman.long@hp.com> Tested-by: Daniel J Blueman <daniel@numascale.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Robin Holt <robinmholt@gmail.com> Cc: Nate Zimmer <nzimmer@sgi.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Waiman Long <waiman.long@hp.com> Cc: Scott Norton <scott.norton@hp.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: page_alloc: pass PFN to __free_pages_bootmemMel Gorman2015-06-30
| | | | | | | | | | | | | | | | | | | | | | | | | __free_pages_bootmem prepares a page for release to the buddy allocator and assumes that the struct page is initialised. Parallel initialisation of struct pages defers initialisation and __free_pages_bootmem can be called for struct pages that cannot yet map struct page to PFN. This patch passes PFN to __free_pages_bootmem with no other functional change. Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Nate Zimmer <nzimmer@sgi.com> Tested-by: Waiman Long <waiman.long@hp.com> Tested-by: Daniel J Blueman <daniel@numascale.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Robin Holt <robinmholt@gmail.com> Cc: Nate Zimmer <nzimmer@sgi.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Waiman Long <waiman.long@hp.com> Cc: Scott Norton <scott.norton@hp.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove rest of ACCESS_ONCE() usagesJason Low2015-04-15
| | | | | | | | | | | | | | | | | We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/ tree since it doesn't work reliably on non-scalar types. This patch removes the rest of the usages of ACCESS_ONCE, and use the new READ_ONCE API for the read accesses. This makes things cleaner, instead of using separate/multiple sets of APIs. Signed-off-by: Jason Low <jason.low2@hp.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/compaction: enhance compaction finish conditionJoonsoo Kim2015-04-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Compaction has anti fragmentation algorithm. It is that freepage should be more than pageblock order to finish the compaction if we don't find any freepage in requested migratetype buddy list. This is for mitigating fragmentation, but, there is a lack of migratetype consideration and it is too excessive compared to page allocator's anti fragmentation algorithm. Not considering migratetype would cause premature finish of compaction. For example, if allocation request is for unmovable migratetype, freepage with CMA migratetype doesn't help that allocation and compaction should not be stopped. But, current logic regards this situation as compaction is no longer needed, so finish the compaction. Secondly, condition is too excessive compared to page allocator's logic. We can steal freepage from other migratetype and change pageblock migratetype on more relaxed conditions in page allocator. This is designed to prevent fragmentation and we can use it here. Imposing hard constraint only to the compaction doesn't help much in this case since page allocator would cause fragmentation again. To solve these problems, this patch borrows anti fragmentation logic from page allocator. It will reduce premature compaction finish in some cases and reduce excessive compaction work. stress-highalloc test in mmtests with non movable order 7 allocation shows considerable increase of compaction success rate. Compaction success rate (Compaction success * 100 / Compaction stalls, %) 31.82 : 42.20 I tested it on non-reboot 5 runs stress-highalloc benchmark and found that there is no more degradation on allocation success rate than before. That roughly means that this patch doesn't result in more fragmentations. Vlastimil suggests additional idea that we only test for fallbacks when migration scanner has scanned a whole pageblock. It looked good for fragmentation because chance of stealing increase due to making more free pages in certain pageblock. So, I tested it, but, it results in decreased compaction success rate, roughly 38.00. I guess the reason that if system is low memory condition, watermark check could be failed due to not enough order 0 free page and so, sometimes, we can't reach a fallback check although migrate_pfn is aligned to pageblock_nr_pages. I can insert code to cope with this situation but it makes code more complicated so I don't include his idea at this patch. [akpm@linux-foundation.org: fix CONFIG_CMA=n build] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: rename __mlock_vma_pages_range() to populate_vma_page_range()Kirill A. Shutemov2015-04-14
| | | | | | | | | | | | | | | | | | | __mlock_vma_pages_range() doesn't necessarily mlock pages. It depends on vma flags. The same codepath is used for MAP_POPULATE. Let's rename __mlock_vma_pages_range() to populate_vma_page_range(). This patch also drops mlock_vma_pages_range() references from documentation. It has gone in cea10a19b797 ("mm: directly use __mlock_vma_pages_range() in find_extend_vma()"). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/internal.h: don't split printk call in twoRasmus Villemoes2015-02-12
| | | | | | | | | | | | | | | | | | | | | | | All users of mminit_dprintk pass a compile-time constant as level, so this just makes gcc emit a single printk call instead of two. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com> Cc: Pintu Kumar <pintu.k@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: reduce try_to_compact_pages parametersVlastimil Babka2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Expand the usage of the struct alloc_context introduced in the previous patch also for calling try_to_compact_pages(), to reduce the number of its parameters. Since the function is in different compilation unit, we need to move alloc_context definition in the shared mm/internal.h header. With this change we get simpler code and small savings of code size and stack usage: add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27) function old new delta __alloc_pages_direct_compact 283 256 -27 add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-13 (-13) function old new delta try_to_compact_pages 582 569 -13 Stack usage of __alloc_pages_direct_compact goes from 24 to none (per scripts/checkstack.pl). Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, compaction: always update cached scanner positionsVlastimil Babka2014-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Compaction caches the migration and free scanner positions between compaction invocations, so that the whole zone gets eventually scanned and there is no bias towards the initial scanner positions at the beginning/end of the zone. The cached positions are continuously updated as scanners progress and the updating stops as soon as a page is successfully isolated. The reasoning behind this is that a pageblock where isolation succeeded is likely to succeed again in near future and it should be worth revisiting it. However, the downside is that potentially many pages are rescanned without successful isolation. At worst, there might be a page where isolation from LRU succeeds but migration fails (potentially always). So upon encountering this page, cached position would always stop being updated for no good reason. It might have been useful to let such page be rescanned with sync compaction after async one failed, but this is now handled by caching scanner position for async and sync mode separately since commit 35979ef33931 ("mm, compaction: add per-zone migration pfn cache for async compaction"). After this patch, cached positions are updated unconditionally. In stress-highalloc benchmark, this has decreased the numbers of scanned pages by few percent, without affecting allocation success rates. To prevent free scanner from leaving free pages behind after they are returned due to page migration failure, the cached scanner pfn is changed to point to the pageblock of the returned free page with the highest pfn, before leaving compact_zone(). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, compaction: pass classzone_idx and alloc_flags to watermark checkingVlastimil Babka2014-12-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Compaction relies on zone watermark checks for decisions such as if it's worth to start compacting in compaction_suitable() or whether compaction should stop in compact_finished(). The watermark checks take classzone_idx and alloc_flags parameters, which are related to the memory allocation request. But from the context of compaction they are currently passed as 0, including the direct compaction which is invoked to satisfy the allocation request, and could therefore know the proper values. The lack of proper values can lead to mismatch between decisions taken during compaction and decisions related to the allocation request. Lack of proper classzone_idx value means that lowmem_reserve is not taken into account. This has manifested (during recent changes to deferred compaction) when DMA zone was used as fallback for preferred Normal zone. compaction_suitable() without proper classzone_idx would think that the watermarks are already satisfied, but watermark check in get_page_from_freelist() would fail. Because of this problem, deferring compaction has extra complexity that can be removed in the following patch. The issue (not confirmed in practice) with missing alloc_flags is opposite in nature. For allocations that include ALLOC_HIGH, ALLOC_HIGHER or ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on CMA-enabled systems) the watermark checking in compaction with 0 passed will be stricter than in get_page_from_freelist(). In these cases compaction might be running for a longer time than is really needed. Another issue compaction_suitable() is that the check for "does the zone need compaction at all?" comes only after the check "does the zone have enough free free pages to succeed compaction". The latter considers extra pages for migration and can therefore in some situations fail and return COMPACT_SKIPPED, although the high-order allocation would succeed and we should return COMPACT_PARTIAL. This patch fixes these problems by adding alloc_flags and classzone_idx to struct compact_control and related functions involved in direct compaction and watermark checking. Where possible, all other callers of compaction_suitable() pass proper values where those are known. This is currently limited to classzone_idx, which is sometimes known in kswapd context. However, the direct reclaim callers should_continue_reclaim() and compaction_ready() do not currently know the proper values, so the coordination between reclaim and compaction may still not be as accurate as it could. This can be fixed later, if it's shown to be an issue. Additionaly the checks in compact_suitable() are reordered to address the second issue described above. The effect of this patch should be slightly better high-order allocation success rates and/or less compaction overhead, depending on the type of allocations and presence of CMA. It allows simplifying deferred compaction code in a followup patch. When testing with stress-highalloc, there was some slight improvement (which might be just due to variance) in success rates of non-THP-like allocations. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_alloc: restrict max order of merging on isolated pageblockJoonsoo Kim2014-11-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current pageblock isolation logic could isolate each pageblock individually. This causes freepage accounting problem if freepage with pageblock order on isolate pageblock is merged with other freepage on normal pageblock. We can prevent merging by restricting max order of merging to pageblock order if freepage is on isolate pageblock. A side-effect of this change is that there could be non-merged buddy freepage even if finishing pageblock isolation, because undoing pageblock isolation is just to move freepage from isolate buddy list to normal buddy list rather than to consider merging. So, the patch also makes undoing pageblock isolation consider freepage merge. When un-isolation, freepage with more than pageblock order and it's buddy are checked. If they are on normal pageblock, instead of just moving, we isolate the freepage and free it in order to get merged. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Heesub Shin <heesub.shin@samsung.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Ritesh Harjani <ritesh.list@gmail.com> Cc: Gioh Kim <gioh.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, compaction: pass gfp mask to compact_controlDavid Rientjes2014-10-09
| | | | | | | | | | | | | | | | | | | | struct compact_control currently converts the gfp mask to a migratetype, but we need the entire gfp mask in a follow-up patch. Pass the entire gfp mask as part of struct compact_control. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, compaction: skip buddy pages by their order in the migrate scannerVlastimil Babka2014-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The migration scanner skips PageBuddy pages, but does not consider their order as checking page_order() is generally unsafe without holding the zone->lock, and acquiring the lock just for the check wouldn't be a good tradeoff. Still, this could avoid some iterations over the rest of the buddy page, and if we are careful, the race window between PageBuddy() check and page_order() is small, and the worst thing that can happen is that we skip too much and miss some isolation candidates. This is not that bad, as compaction can already fail for many other reasons like parallel allocations, and those have much larger race window. This patch therefore makes the migration scanner obtain the buddy page order and use it to skip the whole buddy page, if the order appears to be in the valid range. It's important that the page_order() is read only once, so that the value used in the checks and in the pfn calculation is the same. But in theory the compiler can replace the local variable by multiple inlines of page_order(). Therefore, the patch introduces page_order_unsafe() that uses ACCESS_ONCE to prevent this. Testing with stress-highalloc from mmtests shows a 15% reduction in number of pages scanned by migration scanner. The reduction is >60% with __GFP_NO_KSWAPD allocations, along with success rates better by few percent. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, compaction: khugepaged should not give up due to need_resched()Vlastimil Babka2014-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Async compaction aborts when it detects zone lock contention or need_resched() is true. David Rientjes has reported that in practice, most direct async compactions for THP allocation abort due to need_resched(). This means that a second direct compaction is never attempted, which might be OK for a page fault, but khugepaged is intended to attempt a sync compaction in such case and in these cases it won't. This patch replaces "bool contended" in compact_control with an int that distinguishes between aborting due to need_resched() and aborting due to lock contention. This allows propagating the abort through all compaction functions as before, but passing the abort reason up to __alloc_pages_slowpath() which decides when to continue with direct reclaim and another compaction attempt. Another problem is that try_to_compact_pages() did not act upon the reported contention (both need_resched() or lock contention) immediately and would proceed with another zone from the zonelist. When need_resched() is true, that means initializing another zone compaction, only to check again need_resched() in isolate_migratepages() and aborting. For zone lock contention, the unintended consequence is that the lock contended status reported back to the allocator is detrmined from the last zone where compaction was attempted, which is rather arbitrary. This patch fixes the problem in the following way: - async compaction of a zone aborting due to need_resched() or fatal signal pending means that further zones should not be tried. We report COMPACT_CONTENDED_SCHED to the allocator. - aborting zone compaction due to lock contention means we can still try another zone, since it has different set of locks. We report back COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted, it was aborted due to lock contention. As a result of these fixes, khugepaged will proceed with second sync compaction as intended, when the preceding async compaction aborted due to need_resched(). Page fault compactions aborting due to need_resched() will spare some cycles previously wasted by initializing another zone compaction only to abort again. Lock contention will be reported only when compaction in all zones aborted due to lock contention, and therefore it's not a good idea to try again after reclaim. In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this has improved number of THP collapse allocations by 10%, which shows positive effect on khugepaged. The benchmark's success rates are unchanged as it is not recognized as khugepaged. Numbers of compact_stall and compact_fail events have however decreased by 20%, with compact_success still a bit improved, which is good. With benchmark configured not to use __GFP_NO_KSWAPD, there is 6% improvement in THP collapse allocations, and only slight improvement in stalls and failures. [akpm@linux-foundation.org: fix warnings] Reported-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, compaction: move pageblock checks up from isolate_migratepages_range()Vlastimil Babka2014-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | isolate_migratepages_range() is the main function of the compaction scanner, called either on a single pageblock by isolate_migratepages() during regular compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range(). It currently perfoms two pageblock-wide compaction suitability checks, and because of the CMA callpath, it tracks if it crossed a pageblock boundary in order to repeat those checks. However, closer inspection shows that those checks are always true for CMA: - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true - migrate_async_suitable() check is skipped because CMA uses sync compaction We can therefore move the compaction-specific checks to isolate_migratepages() and simplify isolate_migratepages_range(). Furthermore, we can mimic the freepage scanner family of functions, which has isolate_freepages_block() function called both by compaction from isolate_freepages() and by CMA from isolate_freepages_range(), where each use-case adds own specific glue code. This allows further code simplification. Thus, we rename isolate_migratepages_range() to isolate_migratepages_block() and limit its functionality to a single pageblock (or its subset). For CMA, a new different isolate_migratepages_range() is created as a CMA-specific wrapper for the _block() function. The checks specific to compaction are moved to isolate_migratepages(). As part of the unification of these two families of functions, we remove the redundant zone parameter where applicable, since zone pointer is already passed in cc->zone. Furthermore, going back to compact_zone() and compact_finished() when pageblock is found unsuitable (now by isolate_migratepages()) is wasteful - the checks are meant to skip pageblocks quickly. The patch therefore also introduces a simple loop into isolate_migratepages() so that it does not return immediately on failed pageblock checks, but keeps going until isolate_migratepages_range() gets called once. Similarily to isolate_freepages(), the function periodically checks if it needs to reschedule or abort async compaction. [iamjoonsoo.kim@lge.com: fix isolated page counting bug in compaction] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/internal.h: use nth_pageFabian Frederick2014-08-06
| | | | | | | | | Use nth_page instead of pfn_to_page(page_to_pfn Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, compaction: properly signal and act upon lock and need_sched() contentionVlastimil Babka2014-06-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Compaction uses compact_checklock_irqsave() function to periodically check for lock contention and need_resched() to either abort async compaction, or to free the lock, schedule and retake the lock. When aborting, cc->contended is set to signal the contended state to the caller. Two problems have been identified in this mechanism. First, compaction also calls directly cond_resched() in both scanners when no lock is yet taken. This call either does not abort async compaction, or set cc->contended appropriately. This patch introduces a new compact_should_abort() function to achieve both. In isolate_freepages(), the check frequency is reduced to once by SWAP_CLUSTER_MAX pageblocks to match what the migration scanner does in the preliminary page checks. In case a pageblock is found suitable for calling isolate_freepages_block(), the checks within there are done on higher frequency. Second, isolate_freepages() does not check if isolate_freepages_block() aborted due to contention, and advances to the next pageblock. This violates the principle of aborting on contention, and might result in pageblocks not being scanned completely, since the scanning cursor is advanced. This problem has been noticed in the code by Joonsoo Kim when reviewing related patches. This patch makes isolate_freepages_block() check the cc->contended flag and abort. In case isolate_freepages() has already isolated some pages before aborting due to contention, page migration will proceed, which is OK since we do not want to waste the work that has been done, and page migration has own checks for contention. However, we do not want another isolation attempt by either of the scanners, so cc->contended flag check is added also to compaction_alloc() and compact_finished() to make sure compaction is aborted right after the migration. The outcome of the patch should be reduced lock contention by async compaction and lower latencies for higher-order allocations where direct compaction is involved. [akpm@linux-foundation.org: fix typo in comment] Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Tested-by: Shawn Guo <shawn.guo@linaro.org> Tested-by: Kevin Hilman <khilman@linaro.org> Tested-by: Stephen Warren <swarren@nvidia.com> Tested-by: Fabio Estevam <fabio.estevam@freescale.com> Cc: David Rientjes <rientjes@google.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: fold mlocked_vma_newpage() into its only call siteJianyu Zhan2014-06-04
| | | | | | | | | | | | | | | | | | In previous commit(mm: use the light version __mod_zone_page_state in mlocked_vma_newpage()) a irq-unsafe __mod_zone_page_state is used. And as suggested by Andrew, to reduce the risks that new call sites incorrectly using mlocked_vma_newpage() without knowing they are adding racing, this patch folds mlocked_vma_newpage() into its only call site, page_add_new_anon_rmap, to make it open-cocded for people to know what is going on. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Suggested-by: Hugh Dickins <hughd@google.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: use the light version __mod_zone_page_state in mlocked_vma_newpage()Jianyu Zhan2014-06-04
| | | | | | | | | | | | | | | | | | | | | | mlocked_vma_newpage() is called with pte lock held(a spinlock), which implies preemtion disabled, and the vm stat counter is not modified from interrupt context, so we need not use an irq-safe mod_zone_page_state() here, using a light-weight version __mod_zone_page_state() would be OK. This patch also documents __mod_zone_page_state() and some of its callsites. The comment above __mod_zone_page_state() is from Hugh Dickins, and acked by Christoph. Most credits to Hugh and Christoph for the clarification on the usage of the __mod_zone_page_state(). [akpm@linux-foundation.org: coding-style fixes] Suggested-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, compaction: embed migration mode in compact_controlDavid Rientjes2014-06-04
| | | | | | | | | | | | | | | | | | | | | | | | | We're going to want to manipulate the migration mode for compaction in the page allocator, and currently compact_control's sync field is only a bool. Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction depending on the value of this bool. Convert the bool to enum migrate_mode and pass the migration mode in directly. Later, we'll want to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to avoid unnecessary latency. This also alters compaction triggered from sysfs, either for the entire system or for a node, to force MIGRATE_SYNC. [akpm@linux-foundation.org: fix build] [iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()] Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: move get_user_pages()-related code to separate fileKirill A. Shutemov2014-06-04
| | | | | | | | | | | mm/memory.c is overloaded: over 4k lines. get_user_pages() code is pretty much self-contained let's move it to separate file. No other changes made. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/readahead.c: inline ra_submitFabian Frederick2014-04-07
| | | | | | | | | | | | | Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left ra_submit with a single function call. Move ra_submit to internal.h and inline it to save some stack. Thanks to Andrew Morton for commenting different versions. Signed-off-by: Fabian Frederick <fabf@skynet.be> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: page_alloc: spill to remote nodes before waking kswapdJohannes Weiner2014-04-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On NUMA systems, a node may start thrashing cache or even swap anonymous pages while there are still free pages on remote nodes. This is a result of commits 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") and fff4068cba48 ("mm: page_alloc: revert NUMA aspect of fair allocation policy"). Before those changes, the allocator would first try all allowed zones, including those on remote nodes, before waking any kswapds. But now, the allocator fastpath doubles as the fairness pass, which in turn can only consider the local node to prevent remote spilling based on exhausted fairness batches alone. Remote nodes are only considered in the slowpath, after the kswapds are woken up. But if remote nodes still have free memory, kswapd should not be woken to rebalance the local node or it may thrash cash or swap prematurely. Fix this by adding one more unfair pass over the zonelist that is allowed to spill to remote nodes after the local fairness pass fails but before entering the slowpath and waking the kswapds. This also gets rid of the GFP_THISNODE exemption from the fairness protocol because the unfair pass is no longer tied to kswapd, which GFP_THISNODE is not allowed to wake up. However, because remote spills can be more frequent now - we prefer them over local kswapd reclaim - the allocation batches on remote nodes could underflow more heavily. When resetting the batches, use atomic_long_read() directly instead of zone_page_state() to calculate the delta as the latter filters negative counter values. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: <stable@kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page-writeback.c: do not count anon pages as dirtyable memoryJohannes Weiner2014-01-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The VM is currently heavily tuned to avoid swapping. Whether that is good or bad is a separate discussion, but as long as the VM won't swap to make room for dirty cache, we can not consider anonymous pages when calculating the amount of dirtyable memory, the baseline to which dirty_background_ratio and dirty_ratio are applied. A simple workload that occupies a significant size (40+%, depending on memory layout, storage speeds etc.) of memory with anon/tmpfs pages and uses the remainder for a streaming writer demonstrates this problem. In that case, the actual cache pages are a small fraction of what is considered dirtyable overall, which results in an relatively large portion of the cache pages to be dirtied. As kswapd starts rotating these, random tasks enter direct reclaim and stall on IO. Only consider free pages and file pages dirtyable. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tejun Heo <tj@kernel.org> Tested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: improve documentation of page_orderMel Gorman2014-01-23
| | | | | | | | | | | | | | | | | Developers occasionally try and optimise PFN scanners by using page_order but miss that in general it requires zone->lock. This has happened twice for compaction.c and rejected both times. This patch clarifies the documentation of page_order and adds a note to compaction.c why page_order is not used. [akpm@linux-foundation.org: tweaks] [lauraa@codeaurora.org: Corrected a page_zone(page)->lock reference] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Laura Abbott <lauraa@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: show message when updating min_free_kbytes in thpHan Pingtian2014-01-23
| | | | | | | | | | | | | | | | | | | | | min_free_kbytes may be raised during THP's initialization. Sometimes, this will change the value which was set by the user. Showing this message will clarify this confusion. Only show this message when changing a value which was set by the user according to Michal Hocko's suggestion. Show the old value of min_free_kbytes according to Dave Hansen's suggestion. This will give user the chance to restore old value of min_free_kbytes. Signed-off-by: Han Pingtian <hanpt@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGESasha Levin2014-01-23
| | | | | | | | | | | | | | | | | | | | Most of the VM_BUG_ON assertions are performed on a page. Usually, when one of these assertions fails we'll get a BUG_ON with a call stack and the registers. I've recently noticed based on the requests to add a small piece of code that dumps the page to various VM_BUG_ON sites that the page dump is quite useful to people debugging issues in mm. This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what VM_BUG_ON() does, also dumps the page before executing the actual BUG_ON. [akpm@linux-foundation.org: fix up includes] Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: thp: __get_page_tail_foll() can use get_huge_page_tail()Oleg Nesterov2014-01-21
| | | | | | | | | | | | | | | Cleanup. Change __get_page_tail_foll() to use get_huge_page_tail() to avoid the code duplication. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Dave Jones <davej@redhat.com> Cc: Darren Hart <dvhart@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: tail page refcounting optimization for slab and hugetlbfsAndrea Arcangeli2014-01-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This skips the _mapcount mangling for slab and hugetlbfs pages. The main trouble in doing this is to guarantee that PageSlab and PageHeadHuge remains constant for all get_page/put_page run on the tail of slab or hugetlbfs compound pages. Otherwise if they're set during get_page but not set during put_page, the _mapcount of the tail page would underflow. PageHeadHuge will remain true until the compound page is released and enters the buddy allocator so it won't risk to change even if the tail page is the last reference left on the page. PG_slab instead is cleared before the slab frees the head page with put_page, so if the tail pin is released after the slab freed the page, we would have a problem. But in the slab case the tail pin cannot be the last reference left on the page. This is because the slab code is free to reuse the compound page after a kfree/kmem_cache_free without having to check if there's any tail pin left. In turn all tail pins must be always released while the head is still pinned by the slab code and so we know PG_slab will be still set too. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>