summaryrefslogtreecommitdiff
path: root/include/linux (unfollow)
Commit message (Collapse)Author
2015-02-05ACPI: Add interfaces to parse IOAPIC ID for IOAPIC hotplugYinghai Lu
We need to parse APIC ID for IOAPIC registration for IOAPIC hotplug. ACPI _MAT method and MADT table are used to figure out IOAPIC ID, just like parsing CPU APIC ID for CPU hotplug. [ tglx: Fixed docbook comment ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Len Brown <lenb@kernel.org> Link: http://lkml.kernel.org/r/1414387308-27148-8-git-send-email-jiang.liu@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-02-05PCI: Use common resource list management code instead of private implementationJiang Liu
Use common resource list management data structure and interfaces instead of private implementation. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Acked-by: Will Deacon <will.deacon@arm.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-02-05resources: Move struct resource_list_entry from ACPI into resource coreJiang Liu
Currently ACPI, PCI and pnp all implement the same resource list management with different data structure. We need to transfer from one data structure into another when passing resources from one subsystem into another subsystem. So move struct resource_list_entry from ACPI into resource core and rename it as resource_entry, then it could be reused by different subystems and avoid the data structure conversion. Introduce dedicated header file resource_ext.h instead of embedding it into ioport.h to avoid header file inclusion order issues. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Acked-by: Vinod Koul <vinod.koul@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-02-03ACPI: Introduce helper function acpi_dev_filter_resource_type()Jiang Liu
Introduce helper function acpi_dev_filter_resource_type(), which may be used by acpi_dev_get_resources() to filer out resource based on resource type. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-02-03ACPI: Add field offset to struct resource_list_entryJiang Liu
Add field offset to struct resource_list_entry to host address space translation offset so it could be used to represent bridge resources. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-02-03ACPI: Return translation offset when parsing ACPI address space resourcesJiang Liu
Change function acpi_dev_resource_address_space() and acpi_dev_resource_ext_address_space() to return address space translation offset. It's based on a patch from Yinghai Lu <yinghai@kernel.org>. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-02-01sched: don't cause task state changes in nested sleep debuggingLinus Torvalds
Commit 8eb23b9f35aa ("sched: Debug nested sleeps") added code to report on nested sleep conditions, which we generally want to avoid because the inner sleeping operation can re-set the thread state to TASK_RUNNING, but that will then cause the outer sleep loop not actually sleep when it calls schedule. However, that's actually valid traditional behavior, with the inner sleep being some fairly rare case (like taking a sleeping lock that normally doesn't actually need to sleep). And the debug code would actually change the state of the task to TASK_RUNNING internally, which makes that kind of traditional and working code not work at all, because now the nested sleep doesn't just sometimes cause the outer one to not block, but will cause it to happen every time. In particular, it will cause the cardbus kernel daemon (pccardd) to basically busy-loop doing scheduling, converting a laptop into a heater, as reported by Bruno Prémont. But there may be other legacy uses of that nested sleep model in other drivers that are also likely to never get converted to the new model. This fixes both cases: - don't set TASK_RUNNING when the nested condition happens (note: even if WARN_ONCE() only _warns_ once, the return value isn't whether the warning happened, but whether the condition for the warning was true. So despite the warning only happening once, the "if (WARN_ON(..))" would trigger for every nested sleep. - in the cases where we knowingly disable the warning by using "sched_annotate_sleep()", don't change the task state (that is used for all core scheduling decisions), instead use '->task_state_change' that is used for the debugging decision itself. (Credit for the second part of the fix goes to Oleg Nesterov: "Can't we avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds?" with the suggested change to use 'task_state_change' as part of the test) Reported-and-bisected-by: Bruno Prémont <bonbons@linux-vserver.org> Tested-by: Rafael J Wysocki <rjw@rjwysocki.net> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de>, Cc: Ilya Dryomov <ilya.dryomov@inktank.com>, Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Hurley <peter@hurleysoftware.com>, Cc: Davidlohr Bueso <dave@stgolabs.net>, Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-29vm: add VM_FAULT_SIGSEGV handling supportLinus Torvalds
The core VM already knows about VM_FAULT_SIGBUS, but cannot return a "you should SIGSEGV" error, because the SIGSEGV case was generally handled by the caller - usually the architecture fault handler. That results in lots of duplication - all the architecture fault handlers end up doing very similar "look up vma, check permissions, do retries etc" - but it generally works. However, there are cases where the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV. In particular, when accessing the stack guard page, libsigsegv expects a SIGSEGV. And it usually got one, because the stack growth is handled by that duplicated architecture fault handler. However, when the generic VM layer started propagating the error return from the stack expansion in commit fee7e49d4514 ("mm: propagate error from stack expansion even for guard page"), that now exposed the existing VM_FAULT_SIGBUS result to user space. And user space really expected SIGSEGV, not SIGBUS. To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those duplicate architecture fault handlers about it. They all already have the code to handle SIGSEGV, so it's about just tying that new return value to the existing code, but it's all a bit annoying. This is the mindless minimal patch to do this. A more extensive patch would be to try to gather up the mostly shared fault handling logic into one generic helper routine, and long-term we really should do that cleanup. Just from this patch, you can generally see that most architectures just copied (directly or indirectly) the old x86 way of doing things, but in the meantime that original x86 model has been improved to hold the VM semaphore for shorter times etc and to handle VM_FAULT_RETRY and other "newer" things, so it would be a good idea to bring all those improvements to the generic case and teach other architectures about them too. Reported-and-tested-by: Takashi Iwai <tiwai@suse.de> Tested-by: Jan Engelhardt <jengelh@inai.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots" Cc: linux-arch@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-28perf: Tighten (and fix) the grouping conditionPeter Zijlstra
The fix from 9fc81d87420d ("perf: Fix events installation during moving group") was incomplete in that it failed to recognise that creating a group with events for different CPUs is semantically broken -- they cannot be co-scheduled. Furthermore, it leads to real breakage where, when we create an event for CPU Y and then migrate it to form a group on CPU X, the code gets confused where the counter is programmed -- triggered in practice as well by me via the perf fuzzer. Fix this by tightening the rules for creating groups. Only allow grouping of counters that can be co-scheduled in the same context. This means for the same task and/or the same cpu. Fixes: 9fc81d87420d ("perf: Fix events installation during moving group") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20150123125834.090683288@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-28quota: Switch ->get_dqblk() and ->set_dqblk() to use bytes as space unitsJan Kara
Currently ->get_dqblk() and ->set_dqblk() use struct fs_disk_quota which tracks space limits and usage in 512-byte blocks. However VFS quotas track usage in bytes (as some filesystems require that) and we need to somehow pass this information. Upto now it wasn't a problem because we didn't do any unit conversion (thus VFS quota routines happily stuck number of bytes into d_bcount field of struct fd_disk_quota). Only if you tried to use Q_XGETQUOTA or Q_XSETQLIM for VFS quotas (or Q_GETQUOTA / Q_SETQUOTA for XFS quotas), you got bogus results. Hardly anyone tried this but reportedly some Samba users hit the problem in practice. So when we want interfaces compatible we need to fix this. We bite the bullet and define another quota structure used for passing information from/to ->get_dqblk()/->set_dqblk. It's somewhat sad we have to have more conversion routines in fs/quota/quota.c and another copying of quota structure slows down getting of quota information by about 2% but it seems cleaner than overloading e.g. units of d_bcount to bytes. CC: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-26printk: add dummy routine for when CONFIG_PRINTK=nPranith Kumar
There are missing dummy routines for log_buf_addr_get() and log_buf_len_get() for when CONFIG_PRINTK is not set causing build failures. This patch adds these dummy routines at the appropriate location. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Petr Mladek <pmladek@suse.cz> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-26mm: page_alloc: embed OOM killing naturally into allocation slowpathJohannes Weiner
The OOM killing invocation does a lot of duplicative checks against the task's allocation context. Rework it to take advantage of the existing checks in the allocator slowpath. The OOM killer is invoked when the allocator is unable to reclaim any pages but the allocation has to keep looping. Instead of having a check for __GFP_NORETRY hidden in oom_gfp_allowed(), just move the OOM invocation to the true branch of should_alloc_retry(). The __GFP_FS check from oom_gfp_allowed() can then be moved into the OOM avoidance branch in __alloc_pages_may_oom(), along with the PF_DUMPCORE test. __alloc_pages_may_oom() can then signal to the caller whether the OOM killer was invoked, instead of requiring it to duplicate the order and high_zoneidx checks to guess this when deciding whether to continue. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-26i2c: Only include slave support if selectedJean Delvare
Make the slave support depend on CONFIG_I2C_SLAVE. Otherwise it gets included unconditionally, even when it is not needed. I2C bus drivers which implement slave support must select I2C_SLAVE. Signed-off-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2015-01-22module: make module_refcount() a signed integer.Rusty Russell
James Bottomley points out that it will be -1 during unload. It's only used for diagnostics, so let's not hide that as it could be a clue as to what's gone wrong. Cc: Jason Wessel <jason.wessel@windriver.com> Acked-and-documention-added-by: James Bottomley <James.Bottomley@HansenPartnership.com> Reviewed-by: Masami Hiramatsu <maasami.hiramatsu.pt@hitachi.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-01-20module: remove mod arg from module_free, rename module_memfree().Rusty Russell
Nothing needs the module pointer any more, and the next patch will call it from RCU, where the module itself might no longer exist. Removing the arg is the safest approach. This just codifies the use of the module_alloc/module_free pattern which ftrace and bpf use. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Mikael Starvik <starvik@axis.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Ley Foon Tan <lftan@altera.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: x86@kernel.org Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: linux-cris-kernel@axis.com Cc: linux-kernel@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: nios2-dev@lists.rocketboards.org Cc: linuxppc-dev@lists.ozlabs.org Cc: sparclinux@vger.kernel.org Cc: netdev@vger.kernel.org
2015-01-20module_arch_freeing_init(): new hook for archs before module->module_init freed.Rusty Russell
Archs have been abusing module_free() to clean up their arch-specific allocations. Since module_free() is also (ab)used by BPF and trace code, let's keep it to simple allocations, and provide a hook called before that. This means that avr32, ia64, parisc and s390 no longer need to implement their own module_free() at all. avr32 doesn't need module_finalize() either. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Haavard Skinnemoen <hskinnemoen@gmail.com> Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-kernel@vger.kernel.org Cc: linux-ia64@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linux-s390@vger.kernel.org
2015-01-19libata: allow sata_sil24 to opt-out of tag ordered submissionDan Williams
Ronny reports: https://bugzilla.kernel.org/show_bug.cgi?id=87101 "Since commit 8a4aeec8d "libata/ahci: accommodate tag ordered controllers" the access to the harddisk on the first SATA-port is failing on its first access. The access to the harddisk on the second port is working normal. When reverting the above commit, access to both harddisks is working fine again." Maintain tag ordered submission as the default, but allow sata_sil24 to continue with the old behavior. Cc: <stable@vger.kernel.org> Cc: Tejun Heo <tj@kernel.org> Reported-by: Ronny Hegewald <Ronny.Hegewald@online.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-01-16genetlink: synchronize socket closing and family removalJohannes Berg
In addition to the problem Jeff Layton reported, I looked at the code and reproduced the same warning by subscribing and removing the genl family with a socket still open. This is a fairly tricky race which originates in the fact that generic netlink allows the family to go away while sockets are still open - unlike regular netlink which has a module refcount for every open socket so in general this cannot be triggered. Trying to resolve this issue by the obvious locking isn't possible as it will result in deadlocks between unregistration and group unbind notification (which incidentally lockdep doesn't find due to the home grown locking in the netlink table.) To really resolve this, introduce a "closing socket" reference counter (for generic netlink only, as it's the only affected family) in the core netlink code and use that in generic netlink to wait for all the sockets that are being closed at the same time as a generic netlink family is removed. This fixes the race that when a socket is closed, it will should call the unbind, but if the family is removed at the same time the unbind will not find it, leading to the warning. The real problem though is that in this case the unbind could actually find a new family that is registered to have a multicast group with the same ID, and call its mcast_unbind() leading to confusing. Also remove the warning since it would still trigger, but is now no longer a problem. This also moves the code in af_netlink.c to before unreferencing the module to avoid having the same problem in the normal non-genl case. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-16PCI: Add pci_claim_bridge_resource() to clip window if necessaryYinghai Lu
Add pci_claim_bridge_resource() to claim a PCI-PCI bridge window. This is like regular pci_claim_resource(), except that if we fail to claim the window, we check to see if we can reduce the size of the window and try again. This is for scenarios like this: pci_bus 0000:00: root bus resource [mem 0xc0000000-0xffffffff] pci 0000:00:01.0: bridge window [mem 0xbdf00000-0xddefffff 64bit pref] pci 0000:01:00.0: reg 0x10: [mem 0xc0000000-0xcfffffff pref] The 00:01.0 window is illegal: it starts before the host bridge window, so we have to assume the [0xbdf00000-0xbfffffff] region is inaccessible. We can make it legal by clipping it to [mem 0xc0000000-0xddefffff 64bit pref]. Previously we discarded the 00:01.0 window and tried to reassign that part of the hierarchy from scratch. That is a problem because Linux doesn't always assign things optimally. For example, in this case, BIOS put the 01:00.0 device in a prefetchable window below 4GB, but after 5b28541552ef, Linux puts the prefetchable window above 4GB where the 32-bit 01:00.0 device can't use it. Clipping the 00:01.0 window is less intrusive than completely reassigning things and is sufficient to let us use most of the BIOS configuration. Of course, it's possible that devices below 00:01.0 will no longer fit. If that's the case, we'll have to reassign things. But that's a separate problem. [bhelgaas: changelog, split into separate patch] Link: https://bugzilla.kernel.org/show_bug.cgi?id=85491 Reported-by: Marek Kordik <kordikmarek@gmail.com> Fixes: 5b28541552ef ("PCI: Restrict 64-bit prefetchable bridge windows to 64-bit resources") Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> CC: stable@vger.kernel.org # v3.16+
2015-01-16PCI: Add flag for devices where we can't use bus resetAlex Williamson
Enable a mechanism for devices to quirk that they do not behave when doing a PCI bus reset. We require a modest level of spec compliant behavior in order to do a reset, for instance the device should come out of reset without throwing errors and PCI config space should be accessible after reset. This is too much to ask for some devices. Link: http://lkml.kernel.org/r/20140923210318.498dacbd@dualc.maya.org Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> CC: stable@vger.kernel.org # v3.14+
2015-01-14netdevice: Add missing parentheses in macroBenjamin Poirier
For example, one could conceivably call for_each_netdev_in_bond_rcu(condition ? bond1 : bond2, slave) and get an unexpected result. Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-13kernel: Change ASSIGN_ONCE(val, x) to WRITE_ONCE(x, val)Christian Borntraeger
Feedback has shown that WRITE_ONCE(x, val) is easier to use than ASSIGN_ONCE(val,x). There are no in-tree users yet, so lets change it for 3.19. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-13net: Corrected the comment describing the ndo operations to reflect the ↵B Viswanath
actual prototype for couple of operations Corrected the comment describing the ndo operations to reflect the actual prototype for couple of operations Signed-off-by: B Viswanath <marichika4@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-12mmc: sdhci: Disable re-tuning for HS400Adrian Hunter
Re-tuning for HS400 mode must be done in HS200 mode. Currently there is no support for that. That needs to be reflected in the code. Specifically, if tuning is executed in HS400 mode then return an error, and do not start the tuning timer if HS200 tuning is being done prior to switching to HS400. Note that periodic re-tuning is not expected to be needed for HS400 but re-tuning is still needed after the host controller has lost power. In the case of suspend/resume that is not necessary because the card is fully re-initialised. That just leaves runtime suspend/resume with no support for HS400 re-tuning. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2015-01-09perf: Move task_pt_regs sampling into arch codeAndy Lutomirski
On x86_64, at least, task_pt_regs may be only partially initialized in many contexts, so x86_64 should not use it without extra care from interrupt context, let alone NMI context. This will allow x86_64 to override the logic and will supply some scratch space to use to make a cleaner copy of user regs. Tested-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: chenggang.qcg@taobao.com Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jean Pihet <jean.pihet@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/e431cd4c18c2e1c44c774f10758527fb2d1025c4.1420396372.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-08vfs: renumber FMODE_NONOTIFY and add to uniqueness checkDavid Drysdale
Fix clashing values for O_PATH and FMODE_NONOTIFY on sparc. The clashing O_PATH value was added in commit 5229645bdc35 ("vfs: add nonconflicting values for O_PATH") but this can't be changed as it is user-visible. FMODE_NONOTIFY is only used internally in the kernel, but it is in the same numbering space as the other O_* flags, as indicated by the comment at the top of include/uapi/asm-generic/fcntl.h (and its use in fs/notify/fanotify/fanotify_user.c). So renumber it to avoid the clash. All of this has happened before (commit 12ed2e36c98a: "fanotify: FMODE_NONOTIFY and __O_SYNC in sparc conflict"), and all of this will happen again -- so update the uniqueness check in fcntl_init() to include __FMODE_NONOTIFY. Signed-off-by: David Drysdale <drysdale@google.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Jan Kara <jack@suse.cz> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08mm: protect set_page_dirty() from ongoing truncationJohannes Weiner
Tejun, while reviewing the code, spotted the following race condition between the dirtying and truncation of a page: __set_page_dirty_nobuffers() __delete_from_page_cache() if (TestSetPageDirty(page)) page->mapping = NULL if (PageDirty()) dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); if (page->mapping) account_page_dirtied(page) __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE. Dirtiers usually lock out truncation, either by holding the page lock directly, or in case of zap_pte_range(), by pinning the mapcount with the page table lock held. The notable exception to this rule, though, is do_wp_page(), for which this race exists. However, do_wp_page() already waits for a locked page to unlock before setting the dirty bit, in order to prevent a race where clear_page_dirty() misses the page bit in the presence of dirty ptes. Upgrade that wait to a fully locked set_page_dirty() to also cover the situation explained above. Afterwards, the code in set_page_dirty() dealing with a truncation race is no longer needed. Remove it. Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08mm: prevent endless growth of anon_vma hierarchyKonstantin Khlebnikov
Constantly forking task causes unlimited grow of anon_vma chain. Each next child allocates new level of anon_vmas and links vma to all previous levels because pages might be inherited from any level. This patch adds heuristic which decides to reuse existing anon_vma instead of forking new one. It adds counter anon_vma->degree which counts linked vmas and directly descending anon_vmas and reuses anon_vma if counter is lower than two. As a result each anon_vma has either vma or at least two descending anon_vmas. In such trees half of nodes are leafs with alive vmas, thus count of anon_vmas is no more than two times bigger than count of vmas. This heuristic reuses anon_vmas as few as possible because each reuse adds false aliasing among vmas and rmap walker ought to scan more ptes when it searches where page is might be mapped. Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu Fixes: 5beb49305251 ("mm: change anon_vma linking to fix multi-process server scalability issue") [akpm@linux-foundation.org: fix typo, per Rik] Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-by: Daniel Forrest <dan.forrest@ssec.wisc.edu> Tested-by: Michal Hocko <mhocko@suse.cz> Tested-by: Jerome Marchand <jmarchan@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [2.6.34+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08regulator: s2mps11: Fix wrong calculation of register offsetJonghwa Lee
This patch adds missing registers('BUCK7_SW' & 'LDO29_CTRL'). Since BUCK7 has 1 more register (BUCK7_SW) than others, register offset should be added one more for which has bigger address than BUCK7 registers. Fixes: 76b9840b24ae04(regulator: s2mps11: Add support S2MPS13 regulator device) Signed-off-by: Jonghwa Lee <jonghwa3.lee@samsung.com> Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com> Signed-off-by: Mark Brown <broonie@kernel.org> Cc: <stable@vger.kernel.org>
2015-01-08libceph: fix sparse endianness warningsIlya Dryomov
The only real issue is the one in auth_x.c and it came with 3.19-rc1 merge. Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2015-01-08blk-mq: Allow requests to never expireKeith Busch
Some types of requests may be started that are not gauranteed to ever complete. This adds a request flag that a driver can use so mark the request as such. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-08blk-mq: Add helper to abort requeued requestsJens Axboe
Adds a helper function a driver can use to abort requeued requests in case any are pending when h/w queues are being removed. Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-08blk-mq: Let drivers cancel requeue_workKeith Busch
Kicking requeued requests will start h/w queues in a work_queue, which may alter the driver's requested state to temporarily stop them. This patch exports a method to cancel the q->requeue_work so a driver can be assured stopped h/w queues won't be started up before it is ready. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-08blk-mq: Export if requests were startedKeith Busch
Drivers can iterate over all allocated request tags, but their callback needs a way to know if the driver started the request in the first place. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-08libata: Whitelist SSDs that are known to properly return zeroes after TRIMMartin K. Petersen
As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return Zero After Trim) flags in the ATA Command Set are unreliable in the sense that they only define what happens if the device successfully executed the DSM TRIM command. TRIM is only advisory, however, and the device is free to silently ignore all or parts of the request. In practice this renders the DRAT and RZAT flags completely useless and because the results are unpredictable we decided to disable discard in MD for 3.18 to avoid the risk of data corruption. Hardware vendors in the real world obviously need better guarantees than what the standards bodies provide. Unfortuntely those guarantees are encoded in product requirements documents rather than somewhere we can key off of them programatically. So we are compelled to disabling discard_zeroes_data for all devices unless we explicitly have data to support whitelisting them. This patch whitelists SSDs from a few of the main vendors. None of the whitelists are based on written guarantees. They are purely based on empirical evidence collected from internal and external users that have tested or qualified these drives in RAID deployments. The whitelist is only meant as a starting point and is by no means comprehensive: - All intel SSD models except for 510 - Micron M5?0/M600 - Samsung SSDs - Seagate SSDs Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-01-07time: settimeofday: Validate the values of tv from userSasha Levin
An unvalidated user input is multiplied by a constant, which can result in an undefined behaviour for large values. While this is validated later, we should avoid triggering undefined behaviour. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable <stable@vger.kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> [jstultz: include trivial milisecond->microsecond correction noticed by Andy] Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-01-07blk-mq: get rid of ->cmd_size in the hardware queueJens Axboe
We store it in the tag set, we don't need it in the hardware queue. While removing cmd_size, place ->queue_num further down to avoid a hole on 64-bit archs. It's not used in any fast paths, so we can safely move it. Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-06mm: propagate error from stack expansion even for guard pageLinus Torvalds
Jay Foad reports that the address sanitizer test (asan) sometimes gets confused by a stack pointer that ends up being outside the stack vma that is reported by /proc/maps. This happens due to an interaction between RLIMIT_STACK and the guard page: when we do the guard page check, we ignore the potential error from the stack expansion, which effectively results in a missing guard page, since the expected stack expansion won't have been done. And since /proc/maps explicitly ignores the guard page (commit d7824370e263: "mm: fix up some user-visible effects of the stack guard page"), the stack pointer ends up being outside the reported stack area. This is the minimal patch: it just propagates the error. It also effectively makes the guard page part of the stack limit, which in turn measn that the actual real stack is one page less than the stack limit. Let's see if anybody notices. We could teach acct_stack_growth() to allow an extra page for a grow-up/grow-down stack in the rlimit test, but I don't want to add more complexity if it isn't needed. Reported-and-tested-by: Jay Foad <jay.foad@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-05NFSv4: Cache the NFSv4/v4.1 client owner_id in the struct nfs_clientTrond Myklebust
Ensure that we cache the NFSv4/v4.1 client owner_id so that we can verify it when we're doing trunking detection. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-05ACPI / processor: Rename acpi_(un)map_lsapic() to acpi_(un)map_cpu()Hanjun Guo
acpi_map_lsapic() will allocate a logical CPU number and map it to physical CPU id (such as APIC id) for the hot-added CPU, it will also do some mapping for NUMA node id and etc, acpi_unmap_lsapic() will do the reverse. We can see that the name of the function is a little bit confusing and arch (IA64) dependent so rename them as acpi_(un)map_cpu() to make arch agnostic and explicit. Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-12-29mm: get rid of radix tree gfp mask for pagecache_get_pageMichal Hocko
Commit 2457aec63745 ("mm: non-atomically mark page accessed during page cache allocation where possible") has added a separate parameter for specifying gfp mask for radix tree allocations. Not only this is less than optimal from the API point of view because it is error prone, it is also buggy currently because grab_cache_page_write_begin is using GFP_KERNEL for radix tree and if fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then the radix tree allocation wouldn't obey the restriction and might recurse into filesystem and cause deadlocks. This is the case for most filesystems unfortunately because only ext4 and gfs2 are using AOP_FLAG_NOFS. Let's simply remove radix_gfp_mask parameter because the allocation context is same for both page cache and for the radix tree. Just make sure that the radix tree gets only the sane subset of the mask (e.g. do not pass __GFP_WRITE). Long term it is more preferable to convert remaining users of AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this interface even further. Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-27netlink/genetlink: pass network namespace to bind/unbindJohannes Berg
Netlink families can exist in multiple namespaces, and for the most part multicast subscriptions are per network namespace. Thus it only makes sense to have bind/unbind notifications per network namespace. To achieve this, pass the network namespace of a given client socket to the bind/unbind functions. Also do this in generic netlink, and there also make sure that any bind for multicast groups that only exist in init_net is rejected. This isn't really a problem if it is accepted since a client in a different namespace will never receive any notifications from such a group, but it can confuse the family if not rejected (it's also possible to silently (without telling the family) accept it, but it would also have to be ignored on unbind so families that take any kind of action on bind/unbind won't do unnecessary work for invalid clients like that. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-26net: Generalize ndo_gso_check to ndo_features_checkJesse Gross
GSO isn't the only offload feature with restrictions that potentially can't be expressed with the current features mechanism. Checksum is another although it's a general issue that could in theory apply to anything. Even if it may be possible to implement these restrictions in other ways, it can result in duplicate code or inefficient per-packet behavior. This generalizes ndo_gso_check so that drivers can remove any features that don't make sense for a given packet, similar to netif_skb_features(). It also converts existing driver restrictions to the new format, completing the work that was done to support tunnel protocols since the issues apply to checksums as well. By actually removing features from the set that are used to do offloading, it solves another problem with the existing interface. In these cases, GSO would run with the original set of features and not do anything because it appears that segmentation is not required. CC: Tom Herbert <therbert@google.com> CC: Joe Stringer <joestringer@nicira.com> CC: Eric Dumazet <edumazet@google.com> CC: Hayes Wang <hayeswang@realtek.com> Signed-off-by: Jesse Gross <jesse@nicira.com> Acked-by: Tom Herbert <therbert@google.com> Fixes: 04ffcb255f22 ("net: Add ndo_gso_check") Tested-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-23audit: restore AUDIT_LOGINUID unset ABIRichard Guy Briggs
A regression was caused by commit 780a7654cee8: audit: Make testing for a valid loginuid explicit. (which in turn attempted to fix a regression caused by e1760bd) When audit_krule_to_data() fills in the rules to get a listing, there was a missing clause to convert back from AUDIT_LOGINUID_SET to AUDIT_LOGINUID. This broke userspace by not returning the same information that was sent and expected. The rule: auditctl -a exit,never -F auid=-1 gives: auditctl -l LIST_RULES: exit,never f24=0 syscall=all when it should give: LIST_RULES: exit,never auid=-1 (0xffffffff) syscall=all Tag it so that it is reported the same way it was set. Create a new private flags audit_krule field (pflags) to store it that won't interact with the public one from the API. Cc: stable@vger.kernel.org # v3.10-rc1+ Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <pmoore@redhat.com>
2014-12-23phy: phy-ti-pipe3: fix inconsistent enumeration of PCIe gen2 cardsVignesh R
Prior to DRA74x silicon rev 1.1, pcie_pcs register bits 8-15 and bits 16-23 were used to configure RC delay count for phy1 and phy2 respectively. phyid was used as index to distinguish the phys and to configure the delay values appropriately. As of DRA74x silicon rev 1.1, pcie_pcs register definition has changed. Bits 16-23 are used to configure delay values for *both* phy1 and phy2. Hence phyid is no longer required. So, drop id field from ti_pipe3 structure and its subsequent references for configuring pcie_pcs register. Also, pcie_pcs register now needs to be configured with delay value of 0x96 at bit positions 16-23. See register description of CTRL_CORE_PCIE_PCS in ARM572x TRM, SPRUHZ6, October 2014, section 18.5.2.2, table 18-1804. This is needed to ensure Gen2 cards are enumerated consistently. DRA72x silicon behaves same way as DRA74x rev 1.1 as far as this functionality is considered. Test results on DRA74x and DRA72x EVMs: Before patch ------------ DRA74x ES 1.0: Gen1 cards work, Gen2 cards do not work (expected result due to silicon errata) DRA74x ES 1.1: Gen1 cards work, Gen2 cards do not work sometimes due to incorrect programming of register DRA72x: Gen1 cards work, Gen2 cards do not work sometimes due to incorrect programming of register After patch ----------- DRA74x ES 1.0: Gen1 cards work, Gen2 cards do not work (expected result due to silicon errata) DRA74x ES 1.1: Gen1 cards work, Gen2 cards work consistently. DRA72x: Gen1 and Gen2 cards enumerate consistently. Signed-off-by: Vignesh R <vigneshr@ti.com> Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com>
2014-12-20blk-mq: Export freeze/unfreeze functionsKeith Busch
Let drivers prevent entering a queue that isn't available. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-19PM: Eliminate CONFIG_PM_RUNTIMERafael J. Wysocki
Having switched over all of the users of CONFIG_PM_RUNTIME to use CONFIG_PM directly, turn the latter into a user-selectable option and drop the former entirely from the tree. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Acked-by: Kevin Hilman <khilman@linaro.org>
2014-12-18mm: cma: split cma-reserved in dmesg logPintu Kumar
When the system boots up, in the dmesg logs we can see the memory statistics along with total reserved as below. Memory: 458840k/458840k available, 65448k reserved, 0K highmem When CMA is enabled, still the total reserved memory remains the same. However, the CMA memory is not considered as reserved. But, when we see /proc/meminfo, the CMA memory is part of free memory. This creates confusion. This patch corrects the problem by properly subtracting the CMA reserved memory from the total reserved memory in dmesg logs. Below is the dmesg snapshot from an arm based device with 512MB RAM and 12MB single CMA region. Before this change: Memory: 458840k/458840k available, 65448k reserved, 0K highmem After this change: Memory: 458840k/458840k available, 53160k reserved, 12288k cma-reserved, 0K highmem Signed-off-by: Pintu Kumar <pintu.k@samsung.com> Signed-off-by: Vishnu Pratap Singh <vishnu.ps@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18kernel: Provide READ_ONCE and ASSIGN_ONCEChristian Borntraeger
ACCESS_ONCE does not work reliably on non-scalar types. For example gcc 4.6 and 4.7 might remove the volatile tag for such accesses during the SRA (scalar replacement of aggregates) step https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145) Let's provide READ_ONCE/ASSIGN_ONCE that will do all accesses via scalar types as suggested by Linus Torvalds. Accesses larger than the machines word size cannot be guaranteed to be atomic. These macros will use memcpy and emit a build warning. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2014-12-18KVM: move APIC types to arch/x86/Paolo Bonzini
They are not used anymore by IA64, move them away. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>