summaryrefslogtreecommitdiff
path: root/include (unfollow)
Commit message (Collapse)Author
2012-06-08module_param: stop double-calling parameters.Rusty Russell
Commit 026cee0086fe1df4cf74691cf273062cc769617d "params: <level>_initcall-like kernel parameters" set old-style module parameters to level 0. And we call those level 0 calls where we used to, early in start_kernel(). We also loop through the initcall levels and call the levelled module_params before the corresponding initcall. Unfortunately level 0 is early_init(), so we call the standard module_param calls twice. (Turns out most things don't care, but at least ubi.mtd does). Change the level to -1 for standard module_param calls. Reported-by: Benoît Thébaudeau <benoit.thebaudeau@advansee.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: stable@kernel.org
2012-06-07c/r: prctl: add ability to get clear_tid_addressCyrill Gorcunov
Zero is written at clear_tid_address when the process exits. This functionality is used by pthread_join(). We already have sys_set_tid_address() to change this address for the current task but there is no way to obtain it from user space. Without the ability to find this address and dump it we can't restore pthread'ed apps which call pthread_join() once they have been restored. This patch introduces the PR_GET_TID_ADDRESS prctl option which allows the current process to obtain own clear_tid_address. This feature is available iif CONFIG_CHECKPOINT_RESTORE is set. [akpm@linux-foundation.org: fix prctl numbering] Signed-off-by: Andrew Vagin <avagin@openvz.org> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pedro Alves <palves@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Tejun Heo <tj@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07c/r: prctl: update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removalKonstantin Khlebnikov
A fix for commit b32dfe377102 ("c/r: prctl: add ability to set new mm_struct::exe_file"). After removing mm->num_exe_file_vmas kernel keeps mm->exe_file until final mmput(), it never becomes NULL while task is alive. We can check for other mapped files in mm instead of checking mm->num_exe_file_vmas, and mark mm with flag MMF_EXE_FILE_CHANGED in order to forbid second changing of mm->exe_file. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Kees Cook <keescook@chromium.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-06perf: Limit callchains to 127Arun Sharma
Stack depth of 255 seems excessive, given that copy_from_user_nmi() could be slow. Signed-off-by: Arun Sharma <asharma@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1334961696-19580-3-git-send-email-asharma@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06sched: Fix domain iterationPeter Zijlstra
Weird topologies can lead to asymmetric domain setups. This needs further consideration since these setups are typically non-minimal too. For now, make it work by adding an extra mask selecting which CPUs are allowed to iterate up. The topology that triggered it is the one from David Rientjes: 10 20 20 30 20 10 20 20 20 20 10 20 30 20 20 10 resulting in boxes that wouldn't even boot. Reported-by: David Rientjes <rientjes@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-3p86l9cuaqnxz7uxsojmz5rm@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-05radix-tree: fix contiguous iteratorKonstantin Khlebnikov
This patch fixes bug in macro radix_tree_for_each_contig(). If radix_tree_next_slot() sees NULL in next slot it returns NULL, but following radix_tree_next_chunk() switches iterating into next chunk. As result iterating becomes non-contiguous and breaks vfs "splice" and all its users. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Reported-and-bisected-by: Hans de Bruin <jmdebruin@xmsnet.nl> Reported-and-bisected-by: Ondrej Zary <linux@rainbow-software.org> Reported-bisected-and-tested-by: Toralf Förster <toralf.foerster@gmx.de> Link: https://lkml.org/lkml/2012/6/5/64 Cc: stable <stable@vger.kernel.org> # 3.4.x Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-05drm/radeon/kms: add new SI PCI idsAlex Deucher
Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Dave Airlie <airlied@redhat.com>
2012-06-05drm/radeon/kms: add new BTC PCI idsAlex Deucher
Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Dave Airlie <airlied@redhat.com>
2012-06-05drm/radeon/kms: add new Palm, Sumo PCI idsAlex Deucher
Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Dave Airlie <airlied@redhat.com>
2012-06-05drm/radeon/kms: add new Trinity PCI idsAlex Deucher
Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Dave Airlie <airlied@redhat.com>
2012-06-05drm/exynos: fixed size type.Inki Dae
size type of drm_exynos_gem_mmap struct is changed to uint64_t and it adds pad for the struct to be aligned as 64bit. Signed-off-by: Inki Dae <inki.dae@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
2012-06-04i2c: Add generic I2C multiplexer using pinctrl APIStephen Warren
This is useful for SoCs whose I2C module's signals can be routed to different sets of pins at run-time, using the pinctrl API. +-----+ +-----+ | dev | | dev | +------------------------+ +-----+ +-----+ | SoC | | | | /----|------+--------+ | +---+ +------+ | child bus A, on first set of pins | |I2C|---|Pinmux| | | +---+ +------+ | child bus B, on second set of pins | \----|------+--------+--------+ | | | | | +------------------------+ +-----+ +-----+ +-----+ | dev | | dev | | dev | +-----+ +-----+ +-----+ Signed-off-by: Stephen Warren <swarren@nvidia.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Rob Herring <rob.herring@calxeda.com> Signed-off-by: Wolfram Sang <w.sang@pengutronix.de>
2012-06-04ACPI: fix acpi_bus.h build warnings when ACPI is not enabledLen Brown
introduced in Linux-3.5-rc1 by 66886d6f8c9bcdee3d7fce5796dcffd6b4bc0b48 (ACPI: Add stubs for (un)register_acpi_bus_type) Fix header file warnings when CONFIG_ACPI is not enabled: include/acpi/acpi_bus.h:443:42: warning: 'struct acpi_bus_type' declared inside parameter list include/acpi/acpi_bus.h:443:42: warning: its scope is only this definition or declaration, which is probably not include/acpi/acpi_bus.h:444:44: warning: 'struct acpi_bus_type' declared inside parameter list Signed-off-by: Len Brown <len.brown@intel.com>
2012-06-03Revert "mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks"Linus Torvalds
This reverts commit 5ceb9ce6fe9462a298bb2cd5c9f1ca6cb80a0199. That commit seems to be the cause of the mm compation list corruption issues that Dave Jones reported. The locking (or rather, absense there-of) is dubious, as is the use of the 'page' variable once it has been found to be outside the pageblock range. So revert it for now, we can re-visit this for 3.6. If we even need to: as Minchan Kim says, "The patch wasn't a bug fix and even test workload was very theoretical". Reported-and-tested-by: Dave Jones <davej@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-03vfs: move inode stat information closer togetherLinus Torvalds
The comment above it says "Stat data, not accessed from path walking", but in fact some of inode fields we use for the common stat data was way down at the end of the inode, causing unnecessary cache misses for the common stat operations. The inode structure is pretty big, and this can change padding depending on field width, but at least on the common 64-bit configurations this doesn't change the size. Some of our inode layout has historically been to tro to avoid unnecessary padding fields, but cache locality is at least as important for layout, if not more. Noticed by looking at kernel profiles, and noticing that the "i_blkbits" access stood out like a sore thumb. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-02tty: Revert the tty locking series, it needs more workLinus Torvalds
This reverts the tty layer change to use per-tty locking, because it's not correct yet, and fixing it will require some more deep surgery. The main revert is d29f3ef39be4 ("tty_lock: Localise the lock"), but there are several smaller commits that built upon it, they also get reverted here. The list of reverted commits is: fde86d310886 - tty: add lockdep annotations 8f6576ad476b - tty: fix ldisc lock inversion trace d3ca8b64b97e - pty: Fix lock inversion b1d679afd766 - tty: drop the pty lock during hangup abcefe5fc357 - tty/amiserial: Add missing argument for tty_unlock() fd11b42e3598 - cris: fix missing tty arg in wait_event_interruptible_tty call d29f3ef39be4 - tty_lock: Localise the lock The revert had a trivial conflict in the 68360serial.c staging driver that got removed in the meantime. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-01cipso: handle CIPSO options correctly when NetLabel is disabledPaul Moore
When NetLabel is not enabled, e.g. CONFIG_NETLABEL=n, and the system receives a CIPSO tagged packet it is dropped (cipso_v4_validate() returns non-zero). In most cases this is the correct and desired behavior, however, in the case where we are simply forwarding the traffic, e.g. acting as a network bridge, this becomes a problem. This patch fixes the forwarding problem by providing the basic CIPSO validation code directly in ip_options_compile() without the need for the NetLabel or CIPSO code. The new validation code can not perform any of the CIPSO option label/value verification that cipso_v4_validate() does, but it can verify the basic CIPSO option format. The behavior when NetLabel is enabled is unchanged. Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-01new helper: signal_delivered()Al Viro
Does block_sigmask() + tracehook_signal_handler(); called when sigframe has been successfully built. All architectures converted to it; block_sigmask() itself is gone now (merged into this one). I'm still not too happy with the signature, but that's a separate story (IMO we need a structure that would contain signal number + siginfo + k_sigaction, so that get_signal_to_deliver() would fill one, signal_delivered(), handle_signal() and probably setup...frame() - take one). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01most of set_current_blocked() callers want SIGKILL/SIGSTOP removed from setAl Viro
Only 3 out of 63 do not. Renamed the current variant to __set_current_blocked(), added set_current_blocked() that will exclude unblockable signals, switched open-coded instances to it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01set_restore_sigmask() is never called without SIGPENDING (and never should be)Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01new helper: sigmask_to_save()Al Viro
replace boilerplate "should we use ->saved_sigmask or ->blocked?" with calls of obvious inlined helper... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01new helper: restore_saved_sigmask()Al Viro
first fruits of ..._restore_sigmask() helpers: now we can take boilerplate "signal didn't have a handler, clear RESTORE_SIGMASK and restore the blocked mask from ->saved_mask" into a common helper. Open-coded instances switched... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01new helpers: {clear,test,test_and_clear}_restore_sigmask()Al Viro
helpers parallel to set_restore_sigmask(), used in the next commits Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01HAVE_RESTORE_SIGMASK is defined on all architectures nowAl Viro
Everyone either defines it in arch thread_info.h or has TIF_RESTORE_SIGMASK and picks default set_restore_sigmask() in linux/thread_info.h. Kill the ifdefs, slap #error in linux/thread_info.h to catch breakage when new ones get merged. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01vfs: retry last component if opening stale dentryMiklos Szeredi
NFS optimizes away d_revalidates for last component of open. This means that open itself can find the dentry stale. This patch allows the filesystem to return EOPENSTALE and the VFS will retry the lookup on just the last component if possible. If the lookup was done using RCU mode, including the last component, then this is not possible since the parent dentry is lost. In this case fall back to non-RCU lookup. Currently this is not used since NFS will always leave RCU mode. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01fs: introduce inode operation ->update_timeJosef Bacik
Btrfs has to make sure we have space to allocate new blocks in order to modify the inode, so updating time can fail. We've gotten around this by having our own file_update_time but this is kind of a pain, and Christoph has indicated he would like to make xfs do something different with atime updates. So introduce ->update_time, where we will deal with i_version an a/m/c time updates and indicate which changes need to be made. The normal version just does what it has always done, updates the time and marks the inode dirty, and then filesystems can choose to do something different. I've gone through all of the users of file_update_time and made them check for errors with the exception of the fault code since it's complicated and I wasn't quite sure what to do there, also Jan is going to be pushing the file time updates into page_mkwrite for those who have it so that should satisfy btrfs and make it not a big deal to check the file_update_time() return code in the generic fault path. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-06-01switch aio and shm to do_mmap_pgoff(), make do_mmap() staticAl Viro
after all, 0 bytes and 0 pages is the same thing... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01take security_mmap_file() outside of ->mmap_semAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-31c/r: prctl: add ability to set new mm_struct::exe_fileCyrill Gorcunov
When we do restore we would like to have a way to setup a former mm_struct::exe_file so that /proc/pid/exe would point to the original executable file a process had at checkpoint time. For this the PR_SET_MM_EXE_FILE code is introduced. This option takes a file descriptor which will be set as a source for new /proc/$pid/exe symlink. Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE vmas present for current process, simply because this feature is a special to C/R and mm::num_exe_file_vmas become meaningless after that. To minimize the amount of transition the /proc/pid/exe symlink might have, this feature is implemented in one-shot manner. Thus once changed the symlink can't be changed again. This should help sysadmins to monitor the symlinks over all process running in a system. In particular one could make a snapshot of processes and ring alarm if there unexpected changes of /proc/pid/exe's in a system. Note -- this feature is available iif CONFIG_CHECKPOINT_RESTORE is set and the caller must have CAP_SYS_RESOURCE capability granted, otherwise the request to change symlink will be rejected. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31c/r: prctl: extend PR_SET_MM to set up more mm_struct entriesCyrill Gorcunov
During checkpoint we dump whole process memory to a file and the dump includes process stack memory. But among stack data itself, the stack carries additional parameters such as command line arguments, environment data and auxiliary vector. So when we do restore procedure and once we've restored stack data itself we need to setup mm_struct::arg_start/end, env_start/end, so restored process would be able to find command line arguments and environment data it had at checkpoint time. The same applies to auxiliary vector. For this reason additional PR_SET_MM_(ARG_START | ARG_END | ENV_START | ENV_END | AUXV) codes are introduced. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31syscalls, x86: add __NR_kcmp syscallCyrill Gorcunov
While doing the checkpoint-restore in the user space one need to determine whether various kernel objects (like mm_struct-s of file_struct-s) are shared between tasks and restore this state. The 2nd step can be solved by using appropriate CLONE_ flags and the unshare syscall, while there's currently no ways for solving the 1st one. One of the ways for checking whether two tasks share e.g. mm_struct is to provide some mm_struct ID of a task to its proc file, but showing such info considered to be not that good for security reasons. Thus after some debates we end up in conclusion that using that named 'comparison' syscall might be the best candidate. So here is it -- __NR_kcmp. It takes up to 5 arguments - the pids of the two tasks (which characteristics should be compared), the comparison type and (in case of comparison of files) two file descriptors. Lookups for pids are done in the caller's PID namespace only. At moment only x86 is supported and tested. [akpm@linux-foundation.org: fix up selftests, warnings] [akpm@linux-foundation.org: include errno.h] [akpm@linux-foundation.org: tweak comment text] Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Glauber Costa <glommer@parallels.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tejun Heo <tj@kernel.org> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Valdis.Kletnieks@vt.edu Cc: Michal Marek <mmarek@suse.cz> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector()Christopher Yeoh
A cleanup of rw_copy_check_uvector and compat_rw_copy_check_uvector after changes made to support CMA in an earlier patch. Rather than having an additional check_access parameter to these functions, the first paramater type is overloaded to allow the caller to specify CHECK_IOVEC_ONLY which means check that the contents of the iovec are valid, but do not check the memory that they point to. This is used by process_vm_readv/writev where we need to validate that a iovec passed to the syscall is valid but do not want to check the memory that it points to at this point because it refers to an address space in another process. Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31eventfd: change int to __u64 in eventfd_signal()Sha Zhengju
eventfd_ctx->count is an __u64 counter which is allowed to reach ULLONG_MAX. eventfd_write() adds a __u64 value to "count", but the kernel side eventfd_signal() only adds an int value to it. Make them consistent. [akpm@linux-foundation.org: update interface documentation] Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31rapidio: add DMA engine support for RIO data transfersAlexandre Bounine
Adds DMA Engine framework support into RapidIO subsystem. Uses DMA Engine DMA_SLAVE interface to generate data transfers to/from remote RapidIO target devices. Introduces RapidIO-specific wrapper for prep_slave_sg() interface with an extra parameter to pass target specific information. Uses scatterlist to describe local data buffer. Address flat data buffer on a remote side. Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Dan Williams <dan.j.williams@intel.com> Acked-by: Vinod Koul <vinod.koul@linux.intel.com> Cc: Li Yang <leoli@freescale.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31mqueue: separate mqueue default value from maximum valueKOSAKI Motohiro
Commit b231cca4381e ("message queues: increase range limits") changed mqueue default value when attr parameter is specified NULL from hard coded value to fs.mqueue.{msg,msgsize}_max sysctl value. This made large side effect. When user need to use two mqueue applications 1) using !NULL attr parameter and it require big message size and 2) using NULL attr parameter and only need small size message, app (1) require to raise fs.mqueue.msgsize_max and app (2) consume large memory size even though it doesn't need. Doug Ledford propsed to switch back it to static hard coded value. However it also has a compatibility problem. Some applications might started depend on the default value is tunable. The solution is to separate default value from maximum value. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Doug Ledford <dledford@redhat.com> Acked-by: Doug Ledford <dledford@redhat.com> Acked-by: Joe Korty <joe.korty@ccur.com> Cc: Amerigo Wang <amwang@redhat.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31mqueue: revert bump up DFLT_*MAXKOSAKI Motohiro
Mqueue limitation is slightly naieve parameter likes other ipcs because unprivileged user can consume kernel memory by using ipcs. Thus, too aggressive raise bring us security issue. Example, current setting allow evil unprivileged user use 256GB (= 256 * 1024 * 1024*1024) and it's enough large to system will belome unresponsive. Don't do that. Instead, every admin should adjust the knobs for their own systems. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Doug Ledford <dledford@redhat.com> Acked-by: Joe Korty <joe.korty@ccur.com> Cc: Amerigo Wang <amwang@redhat.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31ipc/mqueue: update maximums for the mqueue subsystemDoug Ledford
Commit b231cca4381e ("message queues: increase range limits") changed the maximum size of a message in a message queue from INT_MAX to 8192*128. Unfortunately, we had customers that relied on a size much larger than 8192*128 on their production systems. After reviewing POSIX, we found that it is silent on the maximum message size. We did find a couple other areas in which it was not silent. Fix up the mqueue maximums so that the customer's system can continue to work, and document both the POSIX and real world requirements in ipc_namespace.h so that we don't have this issue crop back up. Also, commit 9cf18e1dd74cd0 ("ipc: HARD_MSGMAX should be higher not lower on 64bit") fiddled with HARD_MSGMAX without realizing that the number was intentionally in place to limit the msg queue depth to one that was small enough to kmalloc an array of pointers (hence why we divided 128k by sizeof(long)). If we wish to meet POSIX requirements, we have no choice but to change our allocation to a vmalloc instead (at least for the large queue size case). With that, it's possible to increase our allowed maximum to the POSIX requirements (or more if we choose). [sfr@canb.auug.org.au: using vmalloc requires including vmalloc.h] Signed-off-by: Doug Ledford <dledford@redhat.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Amerigo Wang <amwang@redhat.com> Cc: Joe Korty <joe.korty@ccur.com> Cc: Jiri Slaby <jslaby@suse.cz> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31ipc/mqueue: switch back to using non-max values on createDoug Ledford
Commit b231cca4381e ("message queues: increase range limits") changed how we create a queue that does not include an attr struct passed to open so that it creates the queue with whatever the maximum values are. However, if the admin has set the maximums to allow flexibility in creating a queue (aka, both a large size and large queue are allowed, but combined they create a queue too large for the RLIMIT_MSGQUEUE of the user), then attempts to create a queue without an attr struct will fail. Switch back to using acceptable defaults regardless of what the maximums are. Note: so far, we only know of a few applications that rely on this behavior (specifically, set the maximums in /proc, then run the application which calls mq_open() without passing in an attr struct, and the application expects the newly created message queue to have the maximum sizes that were set in /proc used on the mq_open() call, and all of those applications that we know of are actually part of regression test suites that were coded to do something like this: for size in 4096 65536 $((1024 * 1024)) $((16 * 1024 * 1024)); do echo $size > /proc/sys/fs/mqueue/msgsize_max mq_open || echo "Error opening mq with size $size" done These test suites that depend on any behavior like this are broken. The concept that programs should rely upon the system wide maximum in order to get their desired results instead of simply using a attr struct to specify what they want is fundamentally unfriendly programming practice for any multi-tasking OS. Fixing this will break those few apps that we know of (and those app authors recognize the brokenness of their code and the need to fix it). However, the following patch "mqueue: separate mqueue default value" allows a workaround in the form of new knobs for the default msg queue creation parameters for any software out there that we don't already know about that might rely on this behavior at the moment. Signed-off-by: Doug Ledford <dledford@redhat.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Amerigo Wang <amwang@redhat.com> Cc: Joe Korty <joe.korty@ccur.com> Cc: Jiri Slaby <jslaby@suse.cz> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31ipc/mqueue: cleanup definition names and locationsDoug Ledford
Since commit b231cca4381e ("message queues: increase range limits") on Oct 18, 2008, calls to mq_open() that did not pass in an attribute struct and expected to get default values for the size of the queue and the max message size now get the system wide maximums instead of hardwired defaults like they used to get. This was uncovered when one of the earlier patches in this patch set increased the default system wide maximums at the same time it increased the hard ceiling on the system wide maximums (a customer specifically needed the hard ceiling brought back up, the new ceiling that commit b231cca4381e introduced was too low for their production systems). By increasing the default maximums and not realising they were tied to any attempt to create a message queue without an attribute struct, I had inadvertently made it such that all message queue creation attempts without an attribute struct were failing because the new default maximums would create a queue that exceeded the default rlimit for message queue bytes. As a result, the system wide defaults were brought back down to their previous levels, and the system wide ceilings on the maximums were raised to meet the customer's needs. However, the fact that the no attribute struct behavior of mq_open() could be broken by changing the system wide maximums for message queues was seen as fundamentally broken itself. So we hardwired the no attribute case back like it used to be. But, then we realized that on the very off chance that some piece of software in the wild depended on that behavior, we could work around that issue by adding two new knobs to /proc that allowed setting the defaults for message queues created without an attr struct separately from the system wide maximums. What is not an option IMO is to leave the current behavior in place. No piece of software should ever rely on setting the system wide maximums in order to get a desired message queue. Such a reliance would be so fundamentally multitasking OS unfriendly as to not really be tolerable. Fortunately, we don't know of any software in the wild that uses this except for a regression test program that caught the issue in the first place. If there is though, we have made accommodations with the two new /proc knobs (and that's all the accommodations such fundamentally broken software can be allowed).. This patch: The various defines for minimums and maximums of the sysctl controllable mqueue values are scattered amongst different files and named inconsistently. Move them all into ipc_namespace.h and make them have consistent names. Additionally, make the number of queues per namespace also have a minimum and maximum and use the same sysctl function as the other two settable variables. Signed-off-by: Doug Ledford <dledford@redhat.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Amerigo Wang <amwang@redhat.com> Cc: Joe Korty <joe.korty@ccur.com> Cc: Jiri Slaby <jslaby@suse.cz> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31kexec: export kexec.h to user spacemaximilian attems
Add userspace definitions, guard all relevant kernel structures. While at it document stuff and remove now useless userspace hint. It is easy to add the relevant system call to respective libc's, but it seems pointless to have to duplicate the data structures. This is based on the kexec-tools headers, with the exception of just using int on return (succes or failure) and using size_t instead of 'unsigned long int' for the number of segments argument of kexec_load(). Signed-off-by: maximilian attems <max@stro.at> Cc: Simon Horman <horms@verge.net.au> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Haren Myneni <hbabu@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31cpu: introduce clear_tasks_mm_cpumask() helperAnton Vorontsov
Many architectures clear tasks' mm_cpumask like this: read_lock(&tasklist_lock); for_each_process(p) { if (p->mm) cpumask_clear_cpu(cpu, mm_cpumask(p->mm)); } read_unlock(&tasklist_lock); Depending on the context, the code above may have several problems, such as: 1. Working with task->mm w/o getting mm or grabing the task lock is dangerous as ->mm might disappear (exit_mm() assigns NULL under task_lock(), so tasklist lock is not enough). 2. Checking for process->mm is not enough because process' main thread may exit or detach its mm via use_mm(), but other threads may still have a valid mm. This patch implements a small helper function that does things correctly, i.e.: 1. We take the task's lock while whe handle its mm (we can't use get_task_mm()/mmput() pair as mmput() might sleep); 2. To catch exited main thread case, we use find_lock_task_mm(), which walks up all threads and returns an appropriate task (with task lock held). Also, Per Peter Zijlstra's idea, now we don't grab tasklist_lock in the new helper, instead we take the rcu read lock. We can do this because the function is called after the cpu is taken down and marked offline, so no new tasks will get this cpu set in their mm mask. Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31cred: remove task_is_dead() from __task_cred() validationOleg Nesterov
Commit 8f92054e7ca1 ("CRED: Fix __task_cred()'s lockdep check and banner comment"): add the following validation condition: task->exit_state >= 0 to permit the access if the target task is dead and therefore unable to change its own credentials. OK, but afaics currently this can only help wait_task_zombie() which calls __task_cred() without rcu lock. Remove this validation and change wait_task_zombie() to use task_uid() instead. This means we do rcu_read_lock() only to shut up the lockdep, but we already do the same in, say, wait_task_stopped(). task_is_dead() should die, task->exit_state != 0 means that this task has passed exit_notify(), only do_wait-like code paths should use this. Unfortunately, we can't kill task_is_dead() right now, it has already acquired buggy users in drivers/staging. The fix already exists. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31kmod: move call_usermodehelper_fns() to .c file and unexport all it's helpersBoaz Harrosh
If we move call_usermodehelper_fns() to kmod.c file and EXPORT_SYMBOL it we can avoid exporting all it's helper functions: call_usermodehelper_setup call_usermodehelper_setfns call_usermodehelper_exec And make all of them static to kmod.c Since the optimizer will see all these as a single call site it will inline them inside call_usermodehelper_fns(). So we loose the call to _fns but gain 3 calls to the helpers. (Not that it matters) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31kmod: unexport call_usermodehelper_freeinfo()Boaz Harrosh
call_usermodehelper_freeinfo() is not used outside of kmod.c. So unexport it, and make it static to kmod.c Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31fat: introduce special inode for managing the FSINFO blockArtem Bityutskiy
This is patchset makes fatfs stop using the VFS '->write_super()' method for writing out the FSINFO block. The final goal is to get rid of the 'sync_supers()' kernel thread. This kernel thread wakes up every 5 seconds (by default) and calls '->write_super()' for all mounted file-systems. And the bad thing is that this is done even if all the superblocks are clean. Moreover, some file-systems do not even need this end they do not register the '->write_super()' method at all (e.g., btrfs). So 'sync_supers()' most often just generates useless wake-ups and wastes power. I am trying to make all file-systems independent of '->write_super()' and plan to remove 'sync_supers()' and '->write_super' completely once there are no more users. The '->write_supers()' method is mostly used by baroque file-systems like hfs, udf, etc. Modern file-systems like btrfs and xfs do not use it. This justifies removing this stuff from VFS completely and make every FS self-manage own superblock. Tested with xfstests. This patch: Preparation for further changes. It introduces a special inode ('fsinfo_inode') in FAT file-system which we'll later use for managing the FSINFO block. Note, this there is already one special inode ('fat_inode') which is used for managing the FAT tables. Introduce new 'MSDOS_FSINFO_INO' constant for this special inode. It is safe to do because FAT file-system does not store inode numbers on the media but generates them run-time. I've also cleaned up the comment to existing 'MSDOS_ROOT_INO' constant, while on it. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31vsprintf: further optimize decimal conversionDenys Vlasenko
Previous code was using optimizations which were developed to work well even on narrow-word CPUs (by today's standards). But Linux runs only on 32-bit and wider CPUs. We can use that. First: using 32x32->64 multiply and trivial 32-bit shift, we can correctly divide by 10 much larger numbers, and thus we can print groups of 9 digits instead of groups of 5 digits. Next: there are two algorithms to print larger numbers. One is generic: divide by 1000000000 and repeatedly print groups of (up to) 9 digits. It's conceptually simple, but requires an (unsigned long long) / 1000000000 division. Second algorithm splits 64-bit unsigned long long into 16-bit chunks, manipulates them cleverly and generates groups of 4 decimal digits. It so happens that it does NOT require long long division. If long is > 32 bits, division of 64-bit values is relatively easy, and we will use the first algorithm. If long long is > 64 bits (strange architecture with VERY large long long), second algorithm can't be used, and we again use the first one. Else (if long is 32 bits and long long is 64 bits) we use second one. And third: there is a simple optimization which takes fast path not only for zero as was done before, but for all one-digit numbers. In all tested cases new code is faster than old one, in many cases by 30%, in few cases by more than 50% (for example, on x86-32, conversion of 12345678). Code growth is ~0 in 32-bit case and ~130 bytes in 64-bit case. This patch is based upon an original from Michal Nazarewicz. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> Cc: Douglas W Jones <jones@cs.uiowa.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31introduce SIZE_MAXXi Wang
ULONG_MAX is often used to check for integer overflow when calculating allocation size. While ULONG_MAX happens to work on most systems, there is no guarantee that `size_t' must be the same size as `long'. This patch introduces SIZE_MAX, the maximum value of `size_t', to improve portability and readability for allocation size validation. Signed-off-by: Xi Wang <xi.wang@gmail.com> Acked-by: Alex Elder <elder@dreamhost.com> Cc: David Airlie <airlied@linux.ie> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31nfsd4: move rq_flavor into svc_credJ. Bruce Fields
Move the rq_flavor into struct svc_cred, and use it in setclientid and exchange_id comparisons as well. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-05-31nfsd4: move principal name into svc_credJ. Bruce Fields
Instead of keeping the principal name associated with a request in a structure that's private to auth_gss and using an accessor function, move it to svc_cred. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-05-31SUNRPC: new svc_bind() routine introducedStanislav Kinsbursky
This new routine is responsible for service registration in a specified network context. The idea is to separate service creation from per-net operations. Note also: since registering service with svc_bind() can fail, the service will be destroyed and during destruction it will try to unregister itself from rpcbind. In this case unregistration has to be skipped. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>