summaryrefslogtreecommitdiff
path: root/include/linux (follow)
Commit message (Collapse)AuthorAge
* ieee802154: correct ieee802154_is_valid_psdu_lenAlexander Aring2015-02-14
| | | | | | | | | This patch corrects the ieee802154_is_valid_psdu_len function that this function also checks on reserved values 6-8 for validation the psdu length. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
* Merge branch 'akpm' (patches from Andrew)Linus Torvalds2015-02-11
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge second set of updates from Andrew Morton: "More of MM" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits) mm/nommu.c: fix arithmetic overflow in __vm_enough_memory() mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() vmstat: Reduce time interval to stat update on idle cpu mm/page_owner.c: remove unnecessary stack_trace field Documentation/filesystems/proc.txt: describe /proc/<pid>/map_files mm: incorporate read-only pages into transparent huge pages vmstat: do not use deferrable delayed work for vmstat_update mm: more aggressive page stealing for UNMOVABLE allocations mm: always steal split buddies in fallback allocations mm: when stealing freepages, also take pages created by splitting buddy page mincore: apply page table walker on do_mincore() mm: /proc/pid/clear_refs: avoid split_huge_page() mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP) mempolicy: apply page table walker on queue_pages_range() arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma() memcg: cleanup preparation for page table walk numa_maps: remove numa_maps->vma numa_maps: fix typo in gather_hugetbl_stats pagemap: use walk->vma instead of calling find_vma() clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk() ...
| * mm/page_owner.c: remove unnecessary stack_trace fieldSergei Rogachev2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Page owner uses the page_ext structure to keep meta-information for every page in the system. The structure also contains a field of type 'struct stack_trace', page owner uses this field during invocation of the function save_stack_trace. It is easy to notice that keeping a copy of this structure for every page in the system is very inefficiently in terms of memory. The patch removes this unnecessary field of page_ext and forces page owner to use a stack_trace structure allocated on the stack. [akpm@linux-foundation.org: use struct initializers] Signed-off-by: Sergei Rogachev <rogachevsergei@gmail.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * pagewalk: add walk_page_vma()Naoya Horiguchi2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce walk_page_vma(), which is useful for the callers which want to walk over a given vma. It's used by later patches. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * pagewalk: improve vma handlingNaoya Horiguchi2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current implementation of page table walker has a fundamental problem in vma handling, which started when we tried to handle vma(VM_HUGETLB). Because it's done in pgd loop, considering vma boundary makes code complicated and bug-prone. From the users viewpoint, some user checks some vma-related condition to determine whether the user really does page walk over the vma. In order to solve these, this patch moves vma check outside pgd loop and introduce a new callback ->test_walk(). Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm/pagewalk: remove pgd_entry() and pud_entry()Naoya Horiguchi2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently no user of page table walker sets ->pgd_entry() or ->pud_entry(), so checking their existence in each loop is just wasting CPU cycle. So let's remove it to reduce overhead. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: gup: kvm use get_user_pages_unlockedAndrea Arcangeli2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the more generic get_user_pages_unlocked which has the additional benefit of passing FAULT_FLAG_ALLOW_RETRY at the very first page fault (which allows the first page fault in an unmapped area to be always able to block indefinitely by being allowed to release the mmap_sem). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Peter Feiner <pfeiner@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: gup: add __get_user_pages_unlocked to customize gup_flagsAndrea Arcangeli2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION to get a proper -EHWPOSION retval instead of -EFAULT to take a more appropriate action if get_user_pages runs into a memory failure. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Peter Feiner <pfeiner@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: gup: add get_user_pages_locked and get_user_pages_unlockedAndrea Arcangeli2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for reading to reduce the mmap_sem contention (for writing), like while waiting for I/O completion. The problem is that right now practically no get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging that nifty feature. Andres fixed it for the KVM page fault. However get_user_pages_fast remains uncovered, and 99% of other get_user_pages aren't using it either (the only exception being FOLL_NOWAIT in KVM which is really nonblocking and in fact it doesn't even release the mmap_sem). So this patchsets extends the optimization Andres did in the KVM page fault to the whole kernel. It makes most important places (including gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times during I/O. The only few places that remains uncovered are drivers like v4l and other exceptions that tends to work on their own memory and they're not working on random user memory (for example like O_DIRECT that uses gup_fast and is fully covered by this patch). A follow up patch should probably also add a printk_once warning to get_user_pages that should go obsolete and be phased out eventually. The "vmas" parameter of get_user_pages makes it fundamentally incompatible with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the mmap_sem is released). While this is just an optimization, this becomes an absolute requirement for the userfaultfd feature http://lwn.net/Articles/615086/ . The userfaultfd allows to block the page fault, and in order to do so I need to drop the mmap_sem first. So this patch also ensures that all memory where userfaultfd could be registered by KVM, the very first fault (no matter if it is a regular page fault, or a get_user_pages) always has FAULT_FOLL_ALLOW_RETRY set. Then the userfaultfd blocks and it is waken only when the pagetable is already mapped. The second fault attempt after the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry without it. This patch (of 5): We can leverage the VM_FAULT_RETRY functionality in the page fault paths better by using either get_user_pages_locked or get_user_pages_unlocked. The former allows conversion of get_user_pages invocations that will have to pass a "&locked" parameter to know if the mmap_sem was dropped during the call. Example from: down_read(&mm->mmap_sem); do_something() get_user_pages(tsk, mm, ..., pages, NULL); up_read(&mm->mmap_sem); to: int locked = 1; down_read(&mm->mmap_sem); do_something() get_user_pages_locked(tsk, mm, ..., pages, &locked); if (locked) up_read(&mm->mmap_sem); The latter is suitable only as a drop in replacement of the form: down_read(&mm->mmap_sem); get_user_pages(tsk, mm, ..., pages, NULL); up_read(&mm->mmap_sem); into: get_user_pages_unlocked(tsk, mm, ..., pages); Where tsk, mm, the intermediate "..." paramters and "pages" can be any value as before. Just the last parameter of get_user_pages (vmas) must be NULL for get_user_pages_locked|unlocked to be usable (the latter original form wouldn't have been safe anyway if vmas wasn't null, for the former we just make it explicit by dropping the parameter). If vmas is not NULL these two methods cannot be used. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Reviewed-by: Peter Feiner <pfeiner@google.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vmaVlastimil Babka2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous commit ("mm/thp: Allocate transparent hugepages on local node") introduced alloc_hugepage_vma() to mm/mempolicy.c to perform a special policy for THP allocations. The function has the same interface as alloc_pages_vma(), shares a lot of boilerplate code and a long comment. This patch merges the hugepage special case into alloc_pages_vma. The extra if condition should be cheap enough price to pay. We also prevent a (however unlikely) race with parallel mems_allowed update, which could make hugepage allocation restart only within the fallback call to alloc_hugepage_vma() and not reconsider the special rule in alloc_hugepage_vma(). Also by making sure mpol_cond_put(pol) is always called before actual allocation attempt, we can use a single exit path within the function. Also update the comment for missing node parameter and obsolete reference to mm_sem. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm/thp: allocate transparent hugepages on local nodeAneesh Kumar K.V2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This make sure that we try to allocate hugepages from local node if allowed by mempolicy. If we can't, we fallback to small page allocation based on mempolicy. This is based on the observation that allocating pages on local node is more beneficial than allocating hugepages on remote node. With this patch applied we may find transparent huge page allocation failures if the current node doesn't have enough freee hugepages. Before this patch such failures result in us retrying the allocation on other nodes in the numa node mask. [akpm@linux-foundation.org: fix comment, add CONFIG_TRANSPARENT_HUGEPAGE dependency] Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm/compaction: add tracepoint to observe behaviour of compaction deferJoonsoo Kim2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Compaction deferring logic is heavy hammer that block the way to the compaction. It doesn't consider overall system state, so it could prevent user from doing compaction falsely. In other words, even if system has enough range of memory to compact, compaction would be skipped due to compaction deferring logic. This patch add new tracepoint to understand work of deferring logic. This will also help to check compaction success and fail. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm/compaction: more trace to understand when/why compaction start/finishJoonsoo Kim2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is not well analyzed that when/why compaction start/finish or not. With these new tracepoints, we can know much more about start/finish reason of compaction. I can find following bug with these tracepoint. http://www.spinics.net/lists/linux-mm/msg81582.html Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm/compaction: enhance tracepoint output for compaction begin/endJoonsoo Kim2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We now have tracepoint for begin event of compaction and it prints start position of both scanners, but, tracepoint for end event of compaction doesn't print finish position of both scanners. It'd be also useful to know finish position of both scanners so this patch add it. It will help to find odd behavior or problem on compaction internal logic. And mode is added to both begin/end tracepoint output, since according to mode, compaction behavior is quite different. And lastly, status format is changed to string rather than status number for readability. [akpm@linux-foundation.org: fix sparse warning] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: account pmd page tables to the processKirill A. Shutemov2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror("re-mmap"), exit(1); } printf("PID %d consumed %lu KiB in PMD page tables\n", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * oom, PM: make OOM detection in the freezer path racelessMichal Hocko2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, "We used to have freezing points deep in file system code which may be reacheable from page fault." so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * oom: add helpers for setting and clearing TIF_MEMDIEMichal Hocko2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patchset addresses a race which was described in the changelog for 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend"): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: memcontrol: default hierarchy interface for memoryJohannes Weiner2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce the basic control files to account, partition, and limit memory using cgroups in default hierarchy mode. This interface versioning allows us to address fundamental design issues in the existing memory cgroup interface, further explained below. The old interface will be maintained indefinitely, but a clearer model and improved workload performance should encourage existing users to switch over to the new one eventually. The control files are thus: - memory.current shows the current consumption of the cgroup and its descendants, in bytes. - memory.low configures the lower end of the cgroup's expected memory consumption range. The kernel considers memory below that boundary to be a reserve - the minimum that the workload needs in order to make forward progress - and generally avoids reclaiming it, unless there is an imminent risk of entering an OOM situation. - memory.high configures the upper end of the cgroup's expected memory consumption range. A cgroup whose consumption grows beyond this threshold is forced into direct reclaim, to work off the excess and to throttle new allocations heavily, but is generally allowed to continue and the OOM killer is not invoked. - memory.max configures the hard maximum amount of memory that the cgroup is allowed to consume before the OOM killer is invoked. - memory.events shows event counters that indicate how often the cgroup was reclaimed while below memory.low, how often it was forced to reclaim excess beyond memory.high, how often it hit memory.max, and how often it entered OOM due to memory.max. This allows users to identify configuration problems when observing a degradation in workload performance. An overcommitted system will have an increased rate of low boundary breaches, whereas increased rates of high limit breaches, maximum hits, or even OOM situations will indicate internally overcommitted cgroups. For existing users of memory cgroups, the following deviations from the current interface are worth pointing out and explaining: - The original lower boundary, the soft limit, is defined as a limit that is per default unset. As a result, the set of cgroups that global reclaim prefers is opt-in, rather than opt-out. The costs for optimizing these mostly negative lookups are so high that the implementation, despite its enormous size, does not even provide the basic desirable behavior. First off, the soft limit has no hierarchical meaning. All configured groups are organized in a global rbtree and treated like equal peers, regardless where they are located in the hierarchy. This makes subtree delegation impossible. Second, the soft limit reclaim pass is so aggressive that it not just introduces high allocation latencies into the system, but also impacts system performance due to overreclaim, to the point where the feature becomes self-defeating. The memory.low boundary on the other hand is a top-down allocated reserve. A cgroup enjoys reclaim protection when it and all its ancestors are below their low boundaries, which makes delegation of subtrees possible. Secondly, new cgroups have no reserve per default and in the common case most cgroups are eligible for the preferred reclaim pass. This allows the new low boundary to be efficiently implemented with just a minor addition to the generic reclaim code, without the need for out-of-band data structures and reclaim passes. Because the generic reclaim code considers all cgroups except for the ones running low in the preferred first reclaim pass, overreclaim of individual groups is eliminated as well, resulting in much better overall workload performance. - The original high boundary, the hard limit, is defined as a strict limit that can not budge, even if the OOM killer has to be called. But this generally goes against the goal of making the most out of the available memory. The memory consumption of workloads varies during runtime, and that requires users to overcommit. But doing that with a strict upper limit requires either a fairly accurate prediction of the working set size or adding slack to the limit. Since working set size estimation is hard and error prone, and getting it wrong results in OOM kills, most users tend to err on the side of a looser limit and end up wasting precious resources. The memory.high boundary on the other hand can be set much more conservatively. When hit, it throttles allocations by forcing them into direct reclaim to work off the excess, but it never invokes the OOM killer. As a result, a high boundary that is chosen too aggressively will not terminate the processes, but instead it will lead to gradual performance degradation. The user can monitor this and make corrections until the minimal memory footprint that still gives acceptable performance is found. In extreme cases, with many concurrent allocations and a complete breakdown of reclaim progress within the group, the high boundary can be exceeded. But even then it's mostly better to satisfy the allocation from the slack available in other groups or the rest of the system than killing the group. Otherwise, memory.max is there to limit this type of spillover and ultimately contain buggy or even malicious applications. - The original control file names are unwieldy and inconsistent in many different ways. For example, the upper boundary hit count is exported in the memory.failcnt file, but an OOM event count has to be manually counted by listening to memory.oom_control events, and lower boundary / soft limit events have to be counted by first setting a threshold for that value and then counting those events. Also, usage and limit files encode their units in the filename. That makes the filenames very long, even though this is not information that a user needs to be reminded of every time they type out those names. To address these naming issues, as well as to signal clearly that the new interface carries a new configuration model, the naming conventions in it necessarily differ from the old interface. - The original limit files indicate the state of an unset limit with a very high number, and a configured limit can be unset by echoing -1 into those files. But that very high number is implementation and architecture dependent and not very descriptive. And while -1 can be understood as an underflow into the highest possible value, -2 or -10M etc. do not work, so it's not inconsistent. memory.low, memory.high, and memory.max will use the string "infinity" to indicate and set the highest possible value. [akpm@linux-foundation.org: use seq_puts() for basic strings] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: page_counter: pull "-1" handling out of page_counter_memparse()Johannes Weiner2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | The unified hierarchy interface for memory cgroups will no longer use "-1" to mean maximum possible resource value. In preparation for this, make the string an argument and let the caller supply it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * vmscan: force scan offline memory cgroupsVladimir Davydov2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups") pages charged to a memory cgroup are not reparented when the cgroup is removed. Instead, they are supposed to be reclaimed in a regular way, along with pages accounted to online memory cgroups. However, an lruvec of an offline memory cgroup will sooner or later get so small that it will be scanned only at low scan priorities (see get_scan_count()). Therefore, if there are enough reclaimable pages in big lruvecs, pages accounted to offline memory cgroups will never be scanned at all, wasting memory. Fix this by unconditionally forcing scanning dead lruvecs from kswapd. [akpm@linux-foundation.org: fix build] Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: microoptimize zonelist operationsVlastimil Babka2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | next_zones_zonelist() returns a zoneref pointer, as well as a zone pointer via extra parameter. Since the latter can be trivially obtained by dereferencing the former, the overhead of the extra parameter is unjustified. This patch thus removes the zone parameter from next_zones_zonelist(). Both callers happen to be in the same header file, so it's simple to add the zoneref dereference inline. We save some bytes of code size. add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-105 (-105) function old new delta nr_free_zone_pages 129 115 -14 __alloc_pages_nodemask 2300 2285 -15 get_page_from_freelist 2652 2576 -76 add/remove: 0/0 grow/shrink: 1/0 up/down: 10/0 (10) function old new delta try_to_compact_pages 569 579 +10 Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: reduce try_to_compact_pages parametersVlastimil Babka2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Expand the usage of the struct alloc_context introduced in the previous patch also for calling try_to_compact_pages(), to reduce the number of its parameters. Since the function is in different compilation unit, we need to move alloc_context definition in the shared mm/internal.h header. With this change we get simpler code and small savings of code size and stack usage: add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27) function old new delta __alloc_pages_direct_compact 283 256 -27 add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-13 (-13) function old new delta try_to_compact_pages 582 569 -13 Stack usage of __alloc_pages_direct_compact goes from 24 to none (per scripts/checkstack.pl). Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm/hugetlb: take page table lock in follow_huge_pmd()Naoya Horiguchi2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have a race condition between move_pages() and freeing hugepages, where move_pages() calls follow_page(FOLL_GET) for hugepages internally and tries to get its refcount without preventing concurrent freeing. This race crashes the kernel, so this patch fixes it by moving FOLL_GET code for hugepages into follow_huge_pmd() with taking the page table lock. This patch intentionally removes page==NULL check after pte_page. This is justified because pte_page() never returns NULL for any architectures or configurations. This patch changes the behavior of follow_huge_pmd() for tail pages and then tail pages can be pinned/returned. So the caller must be changed to properly handle the returned tail pages. We could have a choice to add the similar locking to follow_huge_(addr|pud) for consistency, but it's not necessary because currently these functions don't support FOLL_GET flag, so let's leave it for future development. Here is the reproducer: $ cat movepages.c #include <stdio.h> #include <stdlib.h> #include <numaif.h> #define ADDR_INPUT 0x700000000000UL #define HPS 0x200000 #define PS 0x1000 int main(int argc, char *argv[]) { int i; int nr_hp = strtol(argv[1], NULL, 0); int nr_p = nr_hp * HPS / PS; int ret; void **addrs; int *status; int *nodes; pid_t pid; pid = strtol(argv[2], NULL, 0); addrs = malloc(sizeof(char *) * nr_p + 1); status = malloc(sizeof(char *) * nr_p + 1); nodes = malloc(sizeof(char *) * nr_p + 1); while (1) { for (i = 0; i < nr_p; i++) { addrs[i] = (void *)ADDR_INPUT + i * PS; nodes[i] = 1; status[i] = 0; } ret = numa_move_pages(pid, nr_p, addrs, nodes, status, MPOL_MF_MOVE_ALL); if (ret == -1) err("move_pages"); for (i = 0; i < nr_p; i++) { addrs[i] = (void *)ADDR_INPUT + i * PS; nodes[i] = 0; status[i] = 0; } ret = numa_move_pages(pid, nr_p, addrs, nodes, status, MPOL_MF_MOVE_ALL); if (ret == -1) err("move_pages"); } return 0; } $ cat hugepage.c #include <stdio.h> #include <sys/mman.h> #include <string.h> #define ADDR_INPUT 0x700000000000UL #define HPS 0x200000 int main(int argc, char *argv[]) { int nr_hp = strtol(argv[1], NULL, 0); char *p; while (1) { p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0); if (p != (void *)ADDR_INPUT) { perror("mmap"); break; } memset(p, 0, nr_hp * HPS); munmap(p, nr_hp * HPS); } } $ sysctl vm.nr_hugepages=40 $ ./hugepage 10 & $ ./movepages 10 $(pgrep -f hugepage) Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: fix typo of MIGRATE_RESERVE in commentBaoquan He2015-02-11
| | | | | | | | | | | | | | | | Found it when I want to jump to the definition of MIGRATE_RESERVE ctags. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: memcontrol: track move_lock state internallyJohannes Weiner2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | The complexity of memcg page stat synchronization is currently leaking into the callsites, forcing them to keep track of the move_lock state and the IRQ flags. Simplify the API by tracking it in the memcg. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * swap: remove unused mem_cgroup_uncharge_swapcache declarationVladimir Davydov2015-02-11
| | | | | | | | | | | | | | | | | | | | | | The body of this function was removed by commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API"). Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm:add KPF_ZERO_PAGE flag for /proc/kpageflagsWang, Yalin2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | Add KPF_ZERO_PAGE flag for zero_page, so that userspace processes can detect zero_page in /proc/kpageflags, and then do memory analysis more accurately. Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: add VM_BUG_ON_PAGE() to page_mapcount()Wang, Yalin2015-02-11
| | | | | | | | | | | | | | | | | | | | | | Add VM_BUG_ON_PAGE() for slab pages. _mapcount is an union with slab struct in struct page, so we must avoid accessing _mapcount if this page is a slab page. Also remove the unneeded bracket. Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: add fields for compound destructor and order into struct pageKirill A. Shutemov2015-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we use lru.next/lru.prev plus cast to access or set destructor and order of compound page. Let's replace it with explicit fields in struct page. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'powerpc-3.20-1' of ↵Linus Torvalds2015-02-11
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux Pull powerpc updates from Michael Ellerman: - Update of all defconfigs - Addition of a bunch of config options to modernise our defconfigs - Some PS3 updates from Geoff - Optimised memcmp for 64 bit from Anton - Fix for kprobes that allows 'perf probe' to work from Naveen - Several cxl updates from Ian & Ryan - Expanded support for the '24x7' PMU from Cody & Sukadev - Freescale updates from Scott: "Highlights include 8xx optimizations, some more work on datapath device tree content, e300 machine check support, t1040 corenet error reporting, and various cleanups and fixes" * tag 'powerpc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (102 commits) cxl: Add missing return statement after handling AFU errror cxl: Fail AFU initialisation if an invalid configuration record is found cxl: Export optional AFU configuration record in sysfs powerpc/mm: Warn on flushing tlb page in kernel context powerpc/powernv: Add OPAL soft-poweroff routine powerpc/perf/hv-24x7: Document sysfs event description entries powerpc/perf/hv-gpci: add the remaining gpci requests powerpc/perf/{hv-gpci, hv-common}: generate requests with counters annotated powerpc/perf/hv-24x7: parse catalog and populate sysfs with events perf: define EVENT_DEFINE_RANGE_FORMAT_LITE helper perf: add PMU_EVENT_ATTR_STRING() helper perf: provide sysfs_show for struct perf_pmu_events_attr powerpc/kernel: Avoid initializing device-tree pointer twice powerpc: Remove old compile time disabled syscall tracing code powerpc/kernel: Make syscall_exit a local label cxl: Fix device_node reference counting powerpc/mm: bail out early when flushing TLB page powerpc: defconfigs: add MTD_SPI_NOR (new dependency for M25P80) perf/powerpc: reset event hw state when adding it to the PMU powerpc/qe: Use strlcpy() ...
| * | perf: add PMU_EVENT_ATTR_STRING() helperCody P Schafer2015-02-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | Helper for constructing static struct perf_pmu_events_attr s. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | perf: provide sysfs_show for struct perf_pmu_events_attrCody P Schafer2015-02-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (struct perf_pmu_events_attr) is defined in include/linux/perf_event.h, but the only "show" for it is in x86 and contains x86 specific stuff. Make a generic one for those of us who are just using the event_str. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
* | | Merge tag 'arm64-upstream' of ↵Linus Torvalds2015-02-11
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: "arm64 updates for 3.20: - reimplementation of the virtual remapping of UEFI Runtime Services in a way that is stable across kexec - emulation of the "setend" instruction for 32-bit tasks (user endianness switching trapped in the kernel, SCTLR_EL1.E0E bit set accordingly) - compat_sys_call_table implemented in C (from asm) and made it a constant array together with sys_call_table - export CPU cache information via /sys (like other architectures) - DMA API implementation clean-up in preparation for IOMMU support - macros clean-up for KVM - dropped some unnecessary cache+tlb maintenance - CONFIG_ARM64_CPU_SUSPEND clean-up - defconfig update (CPU_IDLE) The EFI changes going via the arm64 tree have been acked by Matt Fleming. There is also a patch adding sys_*stat64 prototypes to include/linux/syscalls.h, acked by Andrew Morton" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (47 commits) arm64: compat: Remove incorrect comment in compat_siginfo arm64: Fix section mismatch on alloc_init_p[mu]d() arm64: Avoid breakage caused by .altmacro in fpsimd save/restore macros arm64: mm: use *_sect to check for section maps arm64: drop unnecessary cache+tlb maintenance arm64:mm: free the useless initial page table arm64: Enable CPU_IDLE in defconfig arm64: kernel: remove ARM64_CPU_SUSPEND config option arm64: make sys_call_table const arm64: Remove asm/syscalls.h arm64: Implement the compat_sys_call_table in C syscalls: Declare sys_*stat64 prototypes if __ARCH_WANT_(COMPAT_)STAT64 compat: Declare compat_sys_sigpending and compat_sys_sigprocmask prototypes arm64: uapi: expose our struct ucontext to the uapi headers smp, ARM64: Kill SMP single function call interrupt arm64: Emulate SETEND for AArch32 tasks arm64: Consolidate hotplug notifier for instruction emulation arm64: Track system support for mixed endian EL0 arm64: implement generic IOMMU configuration arm64: Combine coherent and non-coherent swiotlb dma_ops ...
| * | | syscalls: Declare sys_*stat64 prototypes if __ARCH_WANT_(COMPAT_)STAT64Catalin Marinas2015-01-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the sys_stat64, sys_fstat64 and sys_lstat64 prototpyes are only declared if BITS_PER_LONG == 32. Following commit 0753f70f07fb (fs: Build sys_stat64() and friends if __ARCH_WANT_COMPAT_STAT64), the implementation of these functions is allowed on 64-bit systems for compat support. The patch changes the condition on the prototype declaration from BITS_PER_LONG == 32 to defined(__ARCH_WANT_STAT64) || defined(__ARCH_WANT_COMPAT_STAT64). In addition, it moves the sys_fstatat64 prototype under the same #if block Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de>
| * | | compat: Declare compat_sys_sigpending and compat_sys_sigprocmask prototypesCatalin Marinas2015-01-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __ARCH_WANT_SYS_SIGPENDING or __ARCH_WANT_SYS_SIGPROGMASK may be defined for compat support but the corresponding prototypes are missing from linux/compat.h. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de>
| * | | efi: split off remapping code from efi_config_init()Ard Biesheuvel2015-01-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split of the remapping code from efi_config_init() so that the caller can perform its own remapping. This is necessary to correctly handle virtually remapped UEFI memory regions under kexec, as efi.systab will have been updated to a virtual address. Acked-by: Matt Fleming <matt.fleming@intel.com> Tested-by: Leif Lindholm <leif.lindholm@linaro.org> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
* | | | Merge branch 'for-linus' of ↵Linus Torvalds2015-02-11
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: - The remaining patches for the z13 machine support: kernel build option for z13, the cache synonym avoidance, SMT support, compare-and-delay for spinloops and the CES5S crypto adapater. - The ftrace support for function tracing with the gcc hotpatch option. This touches common code Makefiles, Steven is ok with the changes. - The hypfs file system gets an extension to access diagnose 0x0c data in user space for performance analysis for Linux running under z/VM. - The iucv hvc console gets wildcard spport for the user id filtering. - The cacheinfo code is converted to use the generic infrastructure. - Cleanup and bug fixes. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits) s390/process: free vx save area when releasing tasks s390/hypfs: Eliminate hypfs interval s390/hypfs: Add diagnose 0c support s390/cacheinfo: don't use smp_processor_id() in preemptible context s390/zcrypt: fixed domain scanning problem (again) s390/smp: increase maximum value of NR_CPUS to 512 s390/jump label: use different nop instruction s390/jump label: add sanity checks s390/mm: correct missing space when reporting user process faults s390/dasd: cleanup profiling s390/dasd: add locking for global_profile access s390/ftrace: hotpatch support for function tracing ftrace: let notrace function attribute disable hotpatching if necessary ftrace: allow architectures to specify ftrace compile options s390: reintroduce diag 44 calls for cpu_relax() s390/zcrypt: Add support for new crypto express (CEX5S) adapter. s390/zcrypt: Number of supported ap domains is not retrievable. s390/spinlock: add compare-and-delay to lock wait loops s390/tape: remove redundant if statement s390/hvc_iucv: add simple wildcard matches to the iucv allow filter ...
| * | | | ftrace: let notrace function attribute disable hotpatching if necessaryHeiko Carstens2015-01-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gcc supports an s390 specific function attribute called "hotpatch". It can be used to specify the number of halfwords that shall be added before and after a function and which shall be filled with nops for runtime patching. s390 will use the hotpatch attribute for function tracing, therefore make sure that the notrace function attribute either disables the mcount call or in case of hotpatch nop generation. Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* | | | | Merge tag 'please-pull-pstore' of ↵Linus Torvalds2015-02-11
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux Pull pstore update from Tony Luck: "Miscellaneous fs/pstore fixes" * tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: pstore: Fix sprintf format specifier in pstore_dump() pstore: Add pmsg - user-space accessible pstore object pstore: Handle zero-sized prz in series pstore: Remove superfluous memory size check pstore: Use scnprintf() in pstore_mkfile()
| * | | | | pstore: Add pmsg - user-space accessible pstore objectMark Salyzyn2015-01-16
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A secured user-space accessible pstore object. Writes to /dev/pmsg0 are appended to the buffer, on reboot the persistent contents are available in /sys/fs/pstore/pmsg-ramoops-[ID]. One possible use is syslogd, or other daemon, can write messages, then on reboot provides a means to triage user-space activities leading up to a panic as a companion to the pstore dmesg or console logs. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tony Luck <tony.luck@intel.com>
* | | | | Merge tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2015-02-11
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client updates from Trond Myklebust: "Highlights incluse: Features: - Removing the forced serialisation of open()/close() calls in NFSv4.x (x>0) makes for a significant performance improvement in metadata intensive workloads. - Full support for the pNFS "flexible files" layout type - Further RPC/RDMA client improvements from Chuck Bugfixes: - Stable fix: NFSv4.1 backchannel calls blocking operations with !TASK_RUNNING - Stable fix: pnfs_generic_pg_init_read/write can be called with lseg == NULL - Stable fix: Fix an Oopsable condition when nsm_mon_unmon is called as part of the namespace cleanup, - Stable fix: Ensure we reference the inode for return-on-close in delegreturn - Use SO_REUSEPORT to ensure that NFSv3 TCP connections can rebind to the same source address/port combination during a disconnect/ reconnect event. This is a requirement imposed by most NFSv3 server duplicate reply cache implementations. Optimisations: - Ask for no NFSv4.1 delegations on OPEN if using O_DIRECT Other: - Add Anna Schumaker as co-maintainer for the NFS client" * tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (119 commits) SUNRPC: Cleanup to remove xs_tcp_close() pnfs: delete an unintended goto pnfs/flexfiles: Do not dprintk after the free SUNRPC: Fix stupid typo in xs_sock_set_reuseport SUNRPC: Define xs_tcp_fin_timeout only if CONFIG_SUNRPC_DEBUG SUNRPC: Handle connection reset more efficiently. SUNRPC: Remove the redundant XPRT_CONNECTION_CLOSE flag SUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_release SUNRPC: Ensure xs_tcp_shutdown() requests a full close of the connection SUNRPC: Cleanup to remove remaining uses of XPRT_CONNECTION_ABORT SUNRPC: Remove TCP socket linger code SUNRPC: Remove TCP client connection reset hack SUNRPC: TCP/UDP always close the old socket before reconnecting SUNRPC: Add helpers to prevent socket create from racing SUNRPC: Ensure xs_reset_transport() resets the close connection flags SUNRPC: Do not clear the source port in xs_reset_transport SUNRPC: Handle EADDRINUSE on connect SUNRPC: Set SO_REUSEPORT socket option for TCP connections NFSv4.1: Fix pnfs_put_lseg races NFSv4.1: pnfs_send_layoutreturn should use GFP_NOFS ...
| * | | | | SUNRPC: Remove the redundant XPRT_CONNECTION_CLOSE flagTrond Myklebust2015-02-09
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | SUNRPC: Cleanup to remove remaining uses of XPRT_CONNECTION_ABORTTrond Myklebust2015-02-09
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | SUNRPC: Remove TCP client connection reset hackTrond Myklebust2015-02-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead we rely on SO_REUSEPORT to provide the reconnection semantics that we need for NFSv2/v3. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | SUNRPC: Add helpers to prevent socket create from racingTrond Myklebust2015-02-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The socket lock is currently held by the task that is requesting the connection be established. While that is efficient in the case where the connection happens quickly, it is racy in the case where it doesn't. What we really want is for the connect helper to be able to block access to the socket while it is being set up. This patch does so by arranging to transfer the socket lock from the task that is requesting the connect attempt, and then releasing that lock once everything is done. This scheme also gives us automatic protection against collisions with the RPC close code, so we can kill the cancel_delayed_work_sync() call in xs_close(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | NFSv4.1: Pin the inode and super block in asynchronous layoutreturnsTrond Myklebust2015-02-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we're sending an asynchronous layoutreturn, then we need to ensure that the inode and the super block remain pinned. Cc: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Peng Tao <tao.peng@primarydata.com>
| * | | | | NFSv4.1: Pin the inode and super block in asynchronous layoutcommitTrond Myklebust2015-02-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we're sending an asynchronous layoutcommit, then we need to ensure that the inode and the super block remain pinned. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Peng Tao <tao.peng@primarydata.com>
| * | | | | NFSv4.1: Ask for no delegation on OPEN if using O_DIRECTTrond Myklebust2015-02-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we're using NFSv4.1, then we have the ability to let the server know whether or not we believe that returning a delegation as part of our OPEN request would be useful. The feature needs to be used with care, since the client sending the request doesn't necessarily know how other clients are using that file, and how they may be affected by the delegation. For this reason, our initial use of the feature will be to let the server know when the client believes that handing out a delegation would not be useful. The first application for this function is when opening the file using O_DIRECT. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | SUNRPC: NULL utsname dereference on NFS umount during namespace cleanupTrond Myklebust2015-02-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix an Oopsable condition when nsm_mon_unmon is called as part of the namespace cleanup, which now apparently happens after the utsname has been freed. Link: http://lkml.kernel.org/r/20150125220604.090121ae@neptune.home Reported-by: Bruno Prémont <bonbons@linux-vserver.org> Cc: stable@vger.kernel.org # 3.18 Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | Merge branch 'flexfiles'Trond Myklebust2015-02-03
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * flexfiles: (53 commits) pnfs: lookup new lseg at lseg boundary nfs41: .init_read and .init_write can be called with valid pg_lseg pnfs: Update documentation on the Layout Drivers pnfs/flexfiles: Add the FlexFile Layout Driver nfs: count DIO good bytes correctly with mirroring nfs41: wait for LAYOUTRETURN before retrying LAYOUTGET nfs: add a helper to set NFS_ODIRECT_RESCHED_WRITES to direct writes nfs41: add NFS_LAYOUT_RETRY_LAYOUTGET to layout header flags nfs/flexfiles: send layoutreturn before freeing lseg nfs41: introduce NFS_LAYOUT_RETURN_BEFORE_CLOSE nfs41: allow async version layoutreturn nfs41: add range to layoutreturn args pnfs: allow LD to ask to resend read through pnfs nfs: add nfs_pgio_current_mirror helper nfs: only reset desc->pg_mirror_idx when mirroring is supported nfs41: add a debug warning if we destroy an unempty layout pnfs: fail comparison when bucket verifier not set nfs: mirroring support for direct io nfs: add mirroring support to pgio layer pnfs: pass ds_commit_idx through the commit path ... Conflicts: fs/nfs/pnfs.c fs/nfs/pnfs.h