From 1e01979c8f502ac13e3cdece4f38712c5944e6e8 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Tue, 12 Jul 2011 09:45:34 +0200 Subject: x86, numa: Implement pfn -> nid mapping granularity check SPARSEMEM w/o VMEMMAP and DISCONTIGMEM, both used only on 32bit, use sections array to map pfn to nid which is limited in granularity. If NUMA nodes are laid out such that the mapping cannot be accurate, boot will fail triggering BUG_ON() in mminit_verify_page_links(). On 32bit, it's 512MiB w/ PAE and SPARSEMEM. This seems to have been granular enough until commit 2706a0bf7b (x86, NUMA: Enable CONFIG_AMD_NUMA on 32bit too). Apparently, there is a machine which aligns NUMA nodes to 128MiB and has only AMD NUMA but not SRAT. This led to the following BUG_ON(). On node 0 totalpages: 2096615 DMA zone: 32 pages used for memmap DMA zone: 0 pages reserved DMA zone: 3927 pages, LIFO batch:0 Normal zone: 1740 pages used for memmap Normal zone: 220978 pages, LIFO batch:31 HighMem zone: 16405 pages used for memmap HighMem zone: 1853533 pages, LIFO batch:31 BUG: Int 6: CR2 (null) EDI (null) ESI 00000002 EBP 00000002 ESP c1543ecc EBX f2400000 EDX 00000006 ECX (null) EAX 00000001 err (null) EIP c16209aa CS 00000060 flg 00010002 Stack: f2400000 00220000 f7200800 c1620613 00220000 01000000 04400000 00238000 (null) f7200000 00000002 f7200b58 f7200800 c1620929 000375fe (null) f7200b80 c16395f0 00200a02 f7200a80 (null) 000375fe 00000002 (null) Pid: 0, comm: swapper Not tainted 2.6.39-rc5-00181-g2706a0b #17 Call Trace: [] ? early_fault+0x2e/0x2e [] ? mminit_verify_page_links+0x12/0x42 [] ? memmap_init_zone+0xaf/0x10c [] ? free_area_init_node+0x2b9/0x2e3 [] ? free_area_init_nodes+0x3f2/0x451 [] ? paging_init+0x112/0x118 [] ? setup_arch+0x791/0x82f [] ? start_kernel+0x6a/0x257 This patch implements node_map_pfn_alignment() which determines maximum internode alignment and update numa_register_memblks() to reject NUMA configuration if alignment exceeds the pfn -> nid mapping granularity of the memory model as determined by PAGES_PER_SECTION. This makes the problematic machine boot w/ flatmem by rejecting the NUMA config and provides protection against crazy NUMA configurations. Signed-off-by: Tejun Heo Link: http://lkml.kernel.org/r/20110712074534.GB2872@htj.dyndns.org LKML-Reference: <20110628174613.GP478@escobedo.osrc.amd.com> Reported-and-Tested-by: Hans Rosenfeld Cc: Conny Seidel Signed-off-by: H. Peter Anvin --- include/linux/mm.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux/mm.h') diff --git a/include/linux/mm.h b/include/linux/mm.h index 9670f71d7be9..c70a326b8f26 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1313,6 +1313,7 @@ extern void remove_active_range(unsigned int nid, unsigned long start_pfn, unsigned long end_pfn); extern void remove_all_active_ranges(void); void sort_node_map(void); +unsigned long node_map_pfn_alignment(void); unsigned long __absent_pages_in_range(int nid, unsigned long start_pfn, unsigned long end_pfn); extern unsigned long absent_pages_in_range(unsigned long start_pfn, -- cgit v1.2.3 From e9299f5058595a655c3b207cda9635e28b9197e6 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Fri, 8 Jul 2011 14:14:37 +1000 Subject: vmscan: add customisable shrinker batch size For shrinkers that have their own cond_resched* calls, having shrink_slab break the work down into small batches is not paticularly efficient. Add a custom batchsize field to the struct shrinker so that shrinkers can use a larger batch size if they desire. A value of zero (uninitialised) means "use the default", so behaviour is unchanged by this patch. Signed-off-by: Dave Chinner Signed-off-by: Al Viro --- include/linux/mm.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux/mm.h') diff --git a/include/linux/mm.h b/include/linux/mm.h index 9670f71d7be9..9b9777ac726d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1150,6 +1150,7 @@ struct shrink_control { struct shrinker { int (*shrink)(struct shrinker *, struct shrink_control *sc); int seeks; /* seeks to recreate an obj */ + long batch; /* reclaim batch size, 0 = default */ /* These are for internal use */ struct list_head list; -- cgit v1.2.3 From b0d40c92adafde7c2d81203ce7c1c69275f41140 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Fri, 8 Jul 2011 14:14:42 +1000 Subject: superblock: introduce per-sb cache shrinker infrastructure With context based shrinkers, we can implement a per-superblock shrinker that shrinks the caches attached to the superblock. We currently have global shrinkers for the inode and dentry caches that split up into per-superblock operations via a coarse proportioning method that does not batch very well. The global shrinkers also have a dependency - dentries pin inodes - so we have to be very careful about how we register the global shrinkers so that the implicit call order is always correct. With a per-sb shrinker callout, we can encode this dependency directly into the per-sb shrinker, hence avoiding the need for strictly ordering shrinker registrations. We also have no need for any proportioning code for the shrinker subsystem already provides this functionality across all shrinkers. Allowing the shrinker to operate on a single superblock at a time means that we do less superblock list traversals and locking and reclaim should batch more effectively. This should result in less CPU overhead for reclaim and potentially faster reclaim of items from each filesystem. Signed-off-by: Dave Chinner Signed-off-by: Al Viro --- include/linux/mm.h | 40 +--------------------------------------- 1 file changed, 1 insertion(+), 39 deletions(-) (limited to 'include/linux/mm.h') diff --git a/include/linux/mm.h b/include/linux/mm.h index 9b9777ac726d..e3a1a9eec0b1 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -15,6 +15,7 @@ #include #include #include +#include struct mempolicy; struct anon_vma; @@ -1121,45 +1122,6 @@ static inline void sync_mm_rss(struct task_struct *task, struct mm_struct *mm) } #endif -/* - * This struct is used to pass information from page reclaim to the shrinkers. - * We consolidate the values for easier extention later. - */ -struct shrink_control { - gfp_t gfp_mask; - - /* How many slab objects shrinker() should scan and try to reclaim */ - unsigned long nr_to_scan; -}; - -/* - * A callback you can register to apply pressure to ageable caches. - * - * 'sc' is passed shrink_control which includes a count 'nr_to_scan' - * and a 'gfpmask'. It should look through the least-recently-used - * 'nr_to_scan' entries and attempt to free them up. It should return - * the number of objects which remain in the cache. If it returns -1, it means - * it cannot do any scanning at this time (eg. there is a risk of deadlock). - * - * The 'gfpmask' refers to the allocation we are currently trying to - * fulfil. - * - * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is - * querying the cache size, so a fastpath for that case is appropriate. - */ -struct shrinker { - int (*shrink)(struct shrinker *, struct shrink_control *sc); - int seeks; /* seeks to recreate an obj */ - long batch; /* reclaim batch size, 0 = default */ - - /* These are for internal use */ - struct list_head list; - long nr; /* objs pending delete */ -}; -#define DEFAULT_SEEKS 2 /* A good number if you don't know better. */ -extern void register_shrinker(struct shrinker *); -extern void unregister_shrinker(struct shrinker *); - int vma_wants_writenotify(struct vm_area_struct *vma); extern pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr, -- cgit v1.2.3 From 33dd4e0ec91138c3d80e790c08a3db47426c81f2 Mon Sep 17 00:00:00 2001 From: Ian Campbell Date: Mon, 25 Jul 2011 17:11:51 -0700 Subject: mm: make some struct page's const These uses are read-only and in a subsequent patch I have a const struct page in my hand... [akpm@linux-foundation.org: fix warnings in lowmem_page_address()] Signed-off-by: Ian Campbell Cc: Rik van Riel Cc: Andrea Arcangeli Cc: Mel Gorman Cc: Michel Lespinasse Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'include/linux/mm.h') diff --git a/include/linux/mm.h b/include/linux/mm.h index 8a45ad22a170..6321e840e21d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -637,7 +637,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) #define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1) #define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1) -static inline enum zone_type page_zonenum(struct page *page) +static inline enum zone_type page_zonenum(const struct page *page) { return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK; } @@ -665,15 +665,15 @@ static inline int zone_to_nid(struct zone *zone) } #ifdef NODE_NOT_IN_PAGE_FLAGS -extern int page_to_nid(struct page *page); +extern int page_to_nid(const struct page *page); #else -static inline int page_to_nid(struct page *page) +static inline int page_to_nid(const struct page *page) { return (page->flags >> NODES_PGSHIFT) & NODES_MASK; } #endif -static inline struct zone *page_zone(struct page *page) +static inline struct zone *page_zone(const struct page *page) { return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]; } @@ -718,9 +718,9 @@ static inline void set_page_links(struct page *page, enum zone_type zone, */ #include -static __always_inline void *lowmem_page_address(struct page *page) +static __always_inline void *lowmem_page_address(const struct page *page) { - return __va(PFN_PHYS(page_to_pfn(page))); + return __va(PFN_PHYS(page_to_pfn((struct page *)page))); } #if defined(CONFIG_HIGHMEM) && !defined(WANT_PAGE_VIRTUAL) -- cgit v1.2.3 From c27fe4c8942d3ca715986f79cc26f44608d7d9fb Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Mon, 25 Jul 2011 17:12:10 -0700 Subject: pagewalk: add locking-rule comments Originally, walk_hugetlb_range() didn't require a caller take any lock. But commit d33b9f45bd ("mm: hugetlb: fix hugepage memory leak in walk_page_range") changed its rule. Because it added find_vma() call in walk_hugetlb_range(). Any locking-rule change commit should write a doc too. [akpm@linux-foundation.org: clarify comment] Signed-off-by: KOSAKI Motohiro Cc: Naoya Horiguchi Cc: Hiroyuki Kamezawa Cc: Andrea Arcangeli Cc: Matt Mackall Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/linux/mm.h') diff --git a/include/linux/mm.h b/include/linux/mm.h index 6321e840e21d..3c5505a3bc49 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -911,6 +911,8 @@ unsigned long unmap_vmas(struct mmu_gather *tlb, * @pte_entry: if set, called for each non-empty PTE (4th-level) entry * @pte_hole: if set, called for each hole at all levels * @hugetlb_entry: if set, called for each hugetlb entry + * *Caution*: The caller must hold mmap_sem() if @hugetlb_entry + * is used. * * (see walk_page_range for more details) */ -- cgit v1.2.3 From 85821aab39b3403a8b5731812a930b78684d1642 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Mon, 25 Jul 2011 17:12:23 -0700 Subject: mm: truncate functions are in truncate.c Correct comment on truncate_inode_pages*() in linux/mm.h; and remove declaration of page_unuse(), it didn't exist even in 2.2.26 or 2.4.0! Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'include/linux/mm.h') diff --git a/include/linux/mm.h b/include/linux/mm.h index 3c5505a3bc49..3cccd053850f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1411,8 +1411,7 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t); extern unsigned long do_brk(unsigned long, unsigned long); -/* filemap.c */ -extern unsigned long page_unuse(struct page *); +/* truncate.c */ extern void truncate_inode_pages(struct address_space *, loff_t); extern void truncate_inode_pages_range(struct address_space *, loff_t lstart, loff_t lend); -- cgit v1.2.3 From 2efaca927f5cd7ecd0f1554b8f9b6a9a2c329c03 Mon Sep 17 00:00:00 2001 From: Benjamin Herrenschmidt Date: Mon, 25 Jul 2011 17:12:32 -0700 Subject: mm/futex: fix futex writes on archs with SW tracking of dirty & young I haven't reproduced it myself but the fail scenario is that on such machines (notably ARM and some embedded powerpc), if you manage to hit that futex path on a writable page whose dirty bit has gone from the PTE, you'll livelock inside the kernel from what I can tell. It will go in a loop of trying the atomic access, failing, trying gup to "fix it up", getting succcess from gup, go back to the atomic access, failing again because dirty wasn't fixed etc... So I think you essentially hang in the kernel. The scenario is probably rare'ish because affected architecture are embedded and tend to not swap much (if at all) so we probably rarely hit the case where dirty is missing or young is missing, but I think Shan has a piece of SW that can reliably reproduce it using a shared writable mapping & fork or something like that. On archs who use SW tracking of dirty & young, a page without dirty is effectively mapped read-only and a page without young unaccessible in the PTE. Additionally, some architectures might lazily flush the TLB when relaxing write protection (by doing only a local flush), and expect a fault to invalidate the stale entry if it's still present on another processor. The futex code assumes that if the "in_atomic()" access -EFAULT's, it can "fix it up" by causing get_user_pages() which would then be equivalent to taking the fault. However that isn't the case. get_user_pages() will not call handle_mm_fault() in the case where the PTE seems to have the right permissions, regardless of the dirty and young state. It will eventually update those bits ... in the struct page, but not in the PTE. Additionally, it will not handle the lazy TLB flushing that can be required by some architectures in the fault case. Basically, gup is the wrong interface for the job. The patch provides a more appropriate one which boils down to just calling handle_mm_fault() since what we are trying to do is simulate a real page fault. The futex code currently attempts to write to user memory within a pagefault disabled section, and if that fails, tries to fix it up using get_user_pages(). This doesn't work on archs where the dirty and young bits are maintained by software, since they will gate access permission in the TLB, and will not be updated by gup(). In addition, there's an expectation on some archs that a spurious write fault triggers a local TLB flush, and that is missing from the picture as well. I decided that adding those "features" to gup() would be too much for this already too complex function, and instead added a new simpler fixup_user_fault() which is essentially a wrapper around handle_mm_fault() which the futex code can call. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout] Signed-off-by: Benjamin Herrenschmidt Reported-by: Shan Hai Tested-by: Shan Hai Cc: David Laight Acked-by: Peter Zijlstra Cc: Darren Hart Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/linux/mm.h') diff --git a/include/linux/mm.h b/include/linux/mm.h index 3cccd053850f..3172a1c0f08e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -988,6 +988,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); struct page *get_dump_page(unsigned long addr); +extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, + unsigned long address, unsigned int fault_flags); extern int try_to_release_page(struct page * page, gfp_t gfp_mask); extern void do_invalidatepage(struct page *page, unsigned long offset); -- cgit v1.2.3 From ea8f5fb8a71fddaf5f3a17100d3247855701f732 Mon Sep 17 00:00:00 2001 From: Huang Ying Date: Wed, 13 Jul 2011 13:14:27 +0800 Subject: HWPoison: add memory_failure_queue() memory_failure() is the entry point for HWPoison memory error recovery. It must be called in process context. But commonly hardware memory errors are notified via MCE or NMI, so some delayed execution mechanism must be used. In MCE handler, a work queue + ring buffer mechanism is used. In addition to MCE, now APEI (ACPI Platform Error Interface) GHES (Generic Hardware Error Source) can be used to report memory errors too. To add support to APEI GHES memory recovery, a mechanism similar to that of MCE is implemented. memory_failure_queue() is the new entry point that can be called in IRQ context. The next step is to make MCE handler uses this interface too. Signed-off-by: Huang Ying Cc: Andi Kleen Cc: Wu Fengguang Cc: Andrew Morton Signed-off-by: Len Brown --- include/linux/mm.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux/mm.h') diff --git a/include/linux/mm.h b/include/linux/mm.h index 9670f71d7be9..303108eed28c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1633,6 +1633,7 @@ enum mf_flags { }; extern void memory_failure(unsigned long pfn, int trapno); extern int __memory_failure(unsigned long pfn, int trapno, int flags); +extern void memory_failure_queue(unsigned long pfn, int trapno, int flags); extern int unpoison_memory(unsigned long pfn); extern int sysctl_memory_failure_early_kill; extern int sysctl_memory_failure_recovery; -- cgit v1.2.3