From 2ef9481e666b4654159ac9f847e6963809e3c470 Mon Sep 17 00:00:00 2001 From: Jon Mason Date: Mon, 23 Jan 2006 10:58:20 -0600 Subject: [PATCH] powerpc: trivial: modify comments to refer to new location of files This patch removes all self references and fixes references to files in the now defunct arch/ppc64 tree. I think this accomplises everything wanted, though there might be a few references I missed. Signed-off-by: Jon Mason Signed-off-by: Paul Mackerras --- kernel/auditsc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 685c25175d96..3e376202dd48 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -958,7 +958,7 @@ void audit_syscall_entry(struct task_struct *tsk, int arch, int major, * * i386 no * x86_64 no - * ppc64 yes (see arch/ppc64/kernel/misc.S) + * ppc64 yes (see arch/powerpc/platforms/iseries/misc.S) * * This also happens with vm86 emulation in a non-nested manner * (entries without exits), so this case must be caught. -- cgit v1.2.3 From 1fa44ecad2b86475e038aed81b0bf333fa484f8b Mon Sep 17 00:00:00 2001 From: James Bottomley Date: Thu, 23 Feb 2006 12:43:43 -0600 Subject: [SCSI] add execute_in_process_context() API We have several points in the SCSI stack (primarily for our device functions) where we need to guarantee process context, but (given the place where the last reference was released) we cannot guarantee this. This API gets around the issue by executing the function directly if the caller has process context, but scheduling a workqueue to execute in process context if the caller doesn't have it. Signed-off-by: James Bottomley --- kernel/workqueue.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index b052e2c4c710..e9e464a90376 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -27,6 +27,7 @@ #include #include #include +#include /* * The per-CPU workqueue (if single thread, we always use the first @@ -476,6 +477,34 @@ void cancel_rearming_delayed_work(struct work_struct *work) } EXPORT_SYMBOL(cancel_rearming_delayed_work); +/** + * execute_in_process_context - reliably execute the routine with user context + * @fn: the function to execute + * @data: data to pass to the function + * @ew: guaranteed storage for the execute work structure (must + * be available when the work executes) + * + * Executes the function immediately if process context is available, + * otherwise schedules the function for delayed execution. + * + * Returns: 0 - function was executed + * 1 - function was scheduled for execution + */ +int execute_in_process_context(void (*fn)(void *data), void *data, + struct execute_work *ew) +{ + if (!in_interrupt()) { + fn(data); + return 0; + } + + INIT_WORK(&ew->work, fn, data); + schedule_work(&ew->work); + + return 1; +} +EXPORT_SYMBOL_GPL(execute_in_process_context); + int keventd_up(void) { return keventd_wq != NULL; -- cgit v1.2.3 From f9a3879abf2f1a27c39915e6074b8ff15a24cb55 Mon Sep 17 00:00:00 2001 From: GOTO Masanori Date: Mon, 13 Mar 2006 21:20:44 -0800 Subject: [PATCH] Fix sigaltstack corruption among cloned threads This patch fixes alternate signal stack corruption among cloned threads with CLONE_SIGHAND (and CLONE_VM) for linux-2.6.16-rc6. The value of alternate signal stack is currently inherited after a call of clone(... CLONE_SIGHAND | CLONE_VM). But if sigaltstack is set by a parent thread, and then if multiple cloned child threads (+ parent threads) call signal handler at the same time, some threads may be conflicted - because they share to use the same alternative signal stack region. Finally they get sigsegv. It's an undesirable race condition. Note that child threads created from NPTL pthread_create() also hit this conflict when the parent thread uses sigaltstack, without my patch. To fix this problem, this patch clears the child threads' sigaltstack information like exec(). This behavior follows the SUSv3 specification. In SUSv3, pthread_create() says "The alternate stack shall not be inherited (when new threads are initialized)". It means that sigaltstack should be cleared when sigaltstack memory space is shared by cloned threads with CLONE_SIGHAND. Note that I chose "if (clone_flags & CLONE_SIGHAND)" line because: - If clone_flags line is not existed, fork() does not inherit sigaltstack. - CLONE_VM is another choice, but vfork() does not inherit sigaltstack. - CLONE_SIGHAND implies CLONE_VM, and it looks suitable. - CLONE_THREAD is another candidate, and includes CLONE_SIGHAND + CLONE_VM, but this flag has a bit different semantics. I decided to use CLONE_SIGHAND. [ Changed to test for CLONE_VM && !CLONE_VFORK after discussion --Linus ] Signed-off-by: GOTO Masanori Cc: Roland McGrath Cc: Ingo Molnar Acked-by: Linus Torvalds Cc: Ulrich Drepper Cc: Jakub Jelinek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index a8eab86de7f1..ccdfbb16c86d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1061,6 +1061,12 @@ static task_t *copy_process(unsigned long clone_flags, */ p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr: NULL; + /* + * sigaltstack should be cleared when sharing the same VM + */ + if ((clone_flags & (CLONE_VM|CLONE_VFORK)) == CLONE_VM) + p->sas_ss_sp = p->sas_ss_size = 0; + /* * Syscall tracing should be turned off in the child regardless * of CLONE_PTRACE. -- cgit v1.2.3 From e0e8eb54d8ae0c4cfd1d297f6351b08a7f635c5f Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Thu, 16 Mar 2006 10:31:38 -0700 Subject: [PATCH] unshare: Use rcu_assign_pointer when setting sighand The sighand pointer only needs the rcu_read_lock on the read side. So only depending on task_lock protection when setting this pointer is not enough. We also need a memory barrier to ensure the initialization is seen first. Use rcu_assign_pointer as it does this for us, and clearly documents that we are setting an rcu readable pointer. Signed-off-by: Eric W. Biederman Acked-by: Paul E. McKenney Signed-off-by: Linus Torvalds --- kernel/fork.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index ccdfbb16c86d..46060cb24af0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1569,7 +1569,7 @@ asmlinkage long sys_unshare(unsigned long unshare_flags) if (new_sigh) { sigh = current->sighand; - current->sighand = new_sigh; + rcu_assign_pointer(current->sighand, new_sigh); new_sigh = sigh; } -- cgit v1.2.3 From 67890d7084085e29c51afa2514036d42643fd3cf Mon Sep 17 00:00:00 2001 From: Christoph Lameter Date: Thu, 16 Mar 2006 23:04:00 -0800 Subject: [PATCH] time_interpolator: add __read_mostly The pointer to the current time interpolator and the current list of time interpolators are typically only changed during bootup. Adding __read_mostly takes them away from possibly hot cachelines. Signed-off-by: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/timer.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/timer.c b/kernel/timer.c index bf7c4193b936..2410c18dbeb1 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -1354,8 +1354,8 @@ void __init init_timers(void) #ifdef CONFIG_TIME_INTERPOLATION -struct time_interpolator *time_interpolator; -static struct time_interpolator *time_interpolator_list; +struct time_interpolator *time_interpolator __read_mostly; +static struct time_interpolator *time_interpolator_list __read_mostly; static DEFINE_SPINLOCK(time_interpolator_lock); static inline u64 time_interpolator_get_cycles(unsigned int src) -- cgit v1.2.3 From a0a0c28c1a7109d7955815074c52cac079ab3ba5 Mon Sep 17 00:00:00 2001 From: Roman Zippel Date: Thu, 16 Mar 2006 23:04:01 -0800 Subject: [PATCH] posix-timers: fix requeue accounting when signal is ignored When the posix-timer signal is ignored then the timer is rearmed by the callback function. The requeue pending accounting has to be fixed up else the state might be wrong. Signed-off-by: Roman Zippel Signed-off-by: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/posix-timers.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c index 216f574b5ffb..fa895fc2ecf5 100644 --- a/kernel/posix-timers.c +++ b/kernel/posix-timers.c @@ -353,6 +353,7 @@ static int posix_timer_fn(void *data) hrtimer_forward(&timr->it.real.timer, timr->it.real.interval); ret = HRTIMER_RESTART; + ++timr->it_requeue_pending; } } -- cgit v1.2.3 From 2d61b86775a5676a8fba2ba2f0f869564e35c630 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Sat, 18 Mar 2006 20:41:10 +0300 Subject: [PATCH] disable unshare(CLONE_VM) for now sys_unshare() does mmput(new_mm). This is not enough if we have mm->core_waiters. This patch is a temporary fix for soon to be released 2.6.16. Signed-off-by: Oleg Nesterov [ Checked with Uli: "I'm not planning to use unshare(CLONE_VM). It's not needed for any functionality planned so far. What we (as in Red Hat) need unshare() for now is the filesystem side." ] Signed-off-by: Linus Torvalds --- kernel/fork.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 46060cb24af0..b373322ca497 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1478,9 +1478,7 @@ static int unshare_vm(unsigned long unshare_flags, struct mm_struct **new_mmp) if ((unshare_flags & CLONE_VM) && (mm && atomic_read(&mm->mm_users) > 1)) { - *new_mmp = dup_mm(current); - if (!*new_mmp) - return -ENOMEM; + return -EINVAL; } return 0; -- cgit v1.2.3 From afc847b7ddcf636e524cf5b0de644bd3a9419a8c Mon Sep 17 00:00:00 2001 From: Al Viro Date: Tue, 28 Feb 2006 12:51:55 -0500 Subject: [PATCH] don't do exit_io_context() until we know we won't be doing any IO testcase: mount /dev/sdb10 /mnt touch /mnt/tmp/b umount /mnt mount /dev/sdb10 /mnt rm /mnt/tmp/b --- kernel/exit.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 531aadca5530..d1e8d500a7e1 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -807,8 +807,6 @@ fastcall NORET_TYPE void do_exit(long code) panic("Attempted to kill the idle task!"); if (unlikely(tsk->pid == 1)) panic("Attempted to kill init!"); - if (tsk->io_context) - exit_io_context(); if (unlikely(current->ptrace & PT_TRACE_EXIT)) { current->ptrace_message = code; @@ -822,6 +820,8 @@ fastcall NORET_TYPE void do_exit(long code) if (unlikely(tsk->flags & PF_EXITING)) { printk(KERN_ALERT "Fixing recursive fault but reboot is needed!\n"); + if (tsk->io_context) + exit_io_context(); set_current_state(TASK_UNINTERRUPTIBLE); schedule(); } @@ -881,6 +881,9 @@ fastcall NORET_TYPE void do_exit(long code) */ mutex_debug_check_no_locks_held(tsk); + if (tsk->io_context) + exit_io_context(); + /* PF_DEAD causes final put_task_struct after we schedule. */ preempt_disable(); BUG_ON(tsk->flags & PF_DEAD); -- cgit v1.2.3 From 7e7f8a036b8e2b2a300df016da5e7128c8a9192e Mon Sep 17 00:00:00 2001 From: Jason Baron Date: Tue, 31 Jan 2006 16:56:28 -0500 Subject: [PATCH] make vm86 call audit_syscall_exit hi, The motivation behind the patch below was to address messages in /var/log/messages such as: Jan 31 10:54:15 mets kernel: audit(:0): major=252 name_count=0: freeing multiple contexts (1) Jan 31 10:54:15 mets kernel: audit(:0): major=113 name_count=0: freeing multiple contexts (2) I can reproduce by running 'get-edid' from: http://john.fremlin.de/programs/linux/read-edid/. These messages come about in the log b/c the vm86 calls do not exit via the normal system call exit paths and thus do not call 'audit_syscall_exit'. The next system call will then free the context for itself and for the vm86 context, thus generating the above messages. This patch addresses the issue by simply adding a call to 'audit_syscall_exit' from the vm86 code. Besides fixing the above error messages the patch also now allows vm86 system calls to become auditable. This is useful since strace does not appear to properly record the return values from sys_vm86. I think this patch is also a step in the right direction in terms of cleaning up some core auditing code. If we can correct any other paths that do not properly call the audit exit and entries points, then we can also eliminate the notion of context chaining. I've tested this patch by verifying that the log messages no longer appear, and that the audit records for sys_vm86 appear to be correct. Also, 'read_edid' produces itentical output. thanks, -Jason Signed-off-by: Jason Baron Signed-off-by: Al Viro --- kernel/auditsc.c | 5 ----- 1 file changed, 5 deletions(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index d7e7e637b92a..cfaa4a277f08 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -966,11 +966,6 @@ void audit_syscall_entry(struct task_struct *tsk, int arch, int major, if (context->in_syscall) { struct audit_context *newctx; -#if defined(__NR_vm86) && defined(__NR_vm86old) - /* vm86 mode should only be entered once */ - if (major == __NR_vm86 || major == __NR_vm86old) - return; -#endif #if AUDIT_DEBUG printk(KERN_ERR "audit(:%d) pid=%d in syscall=%d;" -- cgit v1.2.3 From b0dd25a8263dde3c30b0d7d72a8bd92d7ba0e3f5 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 13 Sep 2005 12:47:11 -0700 Subject: [PATCH] AUDIT: kerneldoc for kernel/audit*.c - add kerneldoc for non-static functions; - don't init static data to 0; - limit lines to < 80 columns; - fix long-format style; - delete whitespace at end of some lines; (chrisw: resend and update to current audit-2.6 tree) Signed-off-by: Randy Dunlap Signed-off-by: Chris Wright Signed-off-by: David Woodhouse --- kernel/audit.c | 134 ++++++++++++++++++++++++++++++++++++++----------- kernel/auditsc.c | 150 ++++++++++++++++++++++++++++++++++++++++++++++++------- 2 files changed, 238 insertions(+), 46 deletions(-) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index 0a813d2883e5..973ca5a9e0d6 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -72,7 +72,7 @@ static int audit_failure = AUDIT_FAIL_PRINTK; * contains the (non-zero) pid. */ int audit_pid; -/* If audit_limit is non-zero, limit the rate of sending audit records +/* If audit_rate_limit is non-zero, limit the rate of sending audit records * to that number per second. This prevents DoS attacks, but results in * audit records being dropped. */ static int audit_rate_limit; @@ -102,7 +102,7 @@ static struct sock *audit_sock; * than AUDIT_MAXFREE are in use, the audit buffer is freed instead of * being placed on the freelist). */ static DEFINE_SPINLOCK(audit_freelist_lock); -static int audit_freelist_count = 0; +static int audit_freelist_count; static LIST_HEAD(audit_freelist); static struct sk_buff_head audit_skb_queue; @@ -186,8 +186,14 @@ static inline int audit_rate_check(void) return retval; } -/* Emit at least 1 message per second, even if audit_rate_check is - * throttling. */ +/** + * audit_log_lost - conditionally log lost audit message event + * @message: the message stating reason for lost audit message + * + * Emit at least 1 message per second, even if audit_rate_check is + * throttling. + * Always increment the lost messages counter. +*/ void audit_log_lost(const char *message) { static unsigned long last_msg = 0; @@ -218,7 +224,6 @@ void audit_log_lost(const char *message) audit_backlog_limit); audit_panic(message); } - } static int audit_set_rate_limit(int limit, uid_t loginuid) @@ -302,6 +307,19 @@ static int kauditd_thread(void *dummy) } } +/** + * audit_send_reply - send an audit reply message via netlink + * @pid: process id to send reply to + * @seq: sequence number + * @type: audit message type + * @done: done (last) flag + * @multi: multi-part message flag + * @payload: payload data + * @size: payload size + * + * Allocates an skb, builds the netlink message, and sends it to the pid. + * No failure notifications. + */ void audit_send_reply(int pid, int seq, int type, int done, int multi, void *payload, int size) { @@ -376,7 +394,8 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) if (err) return err; - /* As soon as there's any sign of userspace auditd, start kauditd to talk to it */ + /* As soon as there's any sign of userspace auditd, + * start kauditd to talk to it */ if (!kauditd_task) kauditd_task = kthread_run(kauditd_thread, NULL, "kauditd"); if (IS_ERR(kauditd_task)) { @@ -469,9 +488,11 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) return err < 0 ? err : 0; } -/* Get message from skb (based on rtnetlink_rcv_skb). Each message is +/* + * Get message from skb (based on rtnetlink_rcv_skb). Each message is * processed by audit_receive_msg. Malformed skbs with wrong length are - * discarded silently. */ + * discarded silently. + */ static void audit_receive_skb(struct sk_buff *skb) { int err; @@ -600,7 +621,10 @@ err: return NULL; } -/* Compute a serial number for the audit record. Audit records are +/** + * audit_serial - compute a serial number for the audit record + * + * Compute a serial number for the audit record. Audit records are * written to user-space as soon as they are generated, so a complete * audit record may be written in several pieces. The timestamp of the * record and this serial number are used by the user-space tools to @@ -612,8 +636,8 @@ err: * audit context (for those records that have a context), and emit them * all at syscall exit. However, this could delay the reporting of * significant errors until syscall exit (or never, if the system - * halts). */ - + * halts). + */ unsigned int audit_serial(void) { static spinlock_t serial_lock = SPIN_LOCK_UNLOCKED; @@ -649,6 +673,21 @@ static inline void audit_get_stamp(struct audit_context *ctx, * will be written at syscall exit. If there is no associated task, tsk * should be NULL. */ +/** + * audit_log_start - obtain an audit buffer + * @ctx: audit_context (may be NULL) + * @gfp_mask: type of allocation + * @type: audit message type + * + * Returns audit_buffer pointer on success or NULL on error. + * + * Obtain an audit buffer. This routine does locking to obtain the + * audit buffer, but then no locking is required for calls to + * audit_log_*format. If the task (ctx) is a task that is currently in a + * syscall, then the syscall is marked as auditable and an audit record + * will be written at syscall exit. If there is no associated task, then + * task context (ctx) should be NULL. + */ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask, int type) { @@ -713,6 +752,7 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask, /** * audit_expand - expand skb in the audit buffer * @ab: audit_buffer + * @extra: space to add at tail of the skb * * Returns 0 (no space) on failed expansion, or available space if * successful. @@ -729,10 +769,12 @@ static inline int audit_expand(struct audit_buffer *ab, int extra) return skb_tailroom(skb); } -/* Format an audit message into the audit buffer. If there isn't enough +/* + * Format an audit message into the audit buffer. If there isn't enough * room in the audit buffer, more room will be allocated and vsnprint * will be called a second time. Currently, we assume that a printk - * can't format message larger than 1024 bytes, so we don't either. */ + * can't format message larger than 1024 bytes, so we don't either. + */ static void audit_log_vformat(struct audit_buffer *ab, const char *fmt, va_list args) { @@ -757,7 +799,8 @@ static void audit_log_vformat(struct audit_buffer *ab, const char *fmt, /* The printk buffer is 1024 bytes long, so if we get * here and AUDIT_BUFSIZ is at least 1024, then we can * log everything that printk could have logged. */ - avail = audit_expand(ab, max_t(unsigned, AUDIT_BUFSIZ, 1+len-avail)); + avail = audit_expand(ab, + max_t(unsigned, AUDIT_BUFSIZ, 1+len-avail)); if (!avail) goto out; len = vsnprintf(skb->tail, avail, fmt, args2); @@ -768,8 +811,14 @@ out: return; } -/* Format a message into the audit buffer. All the work is done in - * audit_log_vformat. */ +/** + * audit_log_format - format a message into the audit buffer. + * @ab: audit_buffer + * @fmt: format string + * @...: optional parameters matching @fmt string + * + * All the work is done in audit_log_vformat. + */ void audit_log_format(struct audit_buffer *ab, const char *fmt, ...) { va_list args; @@ -781,9 +830,18 @@ void audit_log_format(struct audit_buffer *ab, const char *fmt, ...) va_end(args); } -/* This function will take the passed buf and convert it into a string of - * ascii hex digits. The new string is placed onto the skb. */ -void audit_log_hex(struct audit_buffer *ab, const unsigned char *buf, +/** + * audit_log_hex - convert a buffer to hex and append it to the audit skb + * @ab: the audit_buffer + * @buf: buffer to convert to hex + * @len: length of @buf to be converted + * + * No return value; failure to expand is silently ignored. + * + * This function will take the passed buf and convert it into a string of + * ascii hex digits. The new string is placed onto the skb. + */ +void audit_log_hex(struct audit_buffer *ab, const unsigned char *buf, size_t len) { int i, avail, new_len; @@ -812,10 +870,16 @@ void audit_log_hex(struct audit_buffer *ab, const unsigned char *buf, skb_put(skb, len << 1); /* new string is twice the old string */ } -/* This code will escape a string that is passed to it if the string - * contains a control character, unprintable character, double quote mark, +/** + * audit_log_unstrustedstring - log a string that may contain random characters + * @ab: audit_buffer + * @string: string to be logged + * + * This code will escape a string that is passed to it if the string + * contains a control character, unprintable character, double quote mark, * or a space. Unescaped strings will start and end with a double quote mark. - * Strings that are escaped are printed in hex (2 digits per char). */ + * Strings that are escaped are printed in hex (2 digits per char). + */ void audit_log_untrustedstring(struct audit_buffer *ab, const char *string) { const unsigned char *p = string; @@ -854,10 +918,15 @@ void audit_log_d_path(struct audit_buffer *ab, const char *prefix, kfree(path); } -/* The netlink_* functions cannot be called inside an irq context, so - * the audit buffer is places on a queue and a tasklet is scheduled to +/** + * audit_log_end - end one audit record + * @ab: the audit_buffer + * + * The netlink_* functions cannot be called inside an irq context, so + * the audit buffer is placed on a queue and a tasklet is scheduled to * remove them from the queue outside the irq context. May be called in - * any context. */ + * any context. + */ void audit_log_end(struct audit_buffer *ab) { if (!ab) @@ -878,9 +947,18 @@ void audit_log_end(struct audit_buffer *ab) audit_buffer_free(ab); } -/* Log an audit record. This is a convenience function that calls - * audit_log_start, audit_log_vformat, and audit_log_end. It may be - * called in any context. */ +/** + * audit_log - Log an audit record + * @ctx: audit context + * @gfp_mask: type of allocation + * @type: audit message type + * @fmt: format string to use + * @...: variable parameters matching the format string + * + * This is a convenience function that calls audit_log_start, + * audit_log_vformat, and audit_log_end. It may be called + * in any context. + */ void audit_log(struct audit_context *ctx, gfp_t gfp_mask, int type, const char *fmt, ...) { diff --git a/kernel/auditsc.c b/kernel/auditsc.c index cfaa4a277f08..51a4f58a4d81 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -330,6 +330,15 @@ static int audit_list_rules(void *_dest) return 0; } +/** + * audit_receive_filter - apply all rules to the specified message type + * @type: audit message type + * @pid: target pid for netlink audit messages + * @uid: target uid for netlink audit messages + * @seq: netlink audit message sequence (serial) number + * @data: payload data + * @loginuid: loginuid of sender + */ int audit_receive_filter(int type, int pid, int uid, int seq, void *data, uid_t loginuid) { @@ -527,7 +536,7 @@ static enum audit_state audit_filter_task(struct task_struct *tsk) /* At syscall entry and exit time, this filter is called if the * audit_state is not low enough that auditing cannot take place, but is * also not high enough that we already know we have to write an audit - * record (i.e., the state is AUDIT_SETUP_CONTEXT or AUDIT_BUILD_CONTEXT). + * record (i.e., the state is AUDIT_SETUP_CONTEXT or AUDIT_BUILD_CONTEXT). */ static enum audit_state audit_filter_syscall(struct task_struct *tsk, struct audit_context *ctx, @@ -721,10 +730,15 @@ static inline struct audit_context *audit_alloc_context(enum audit_state state) return context; } -/* Filter on the task information and allocate a per-task audit context +/** + * audit_alloc - allocate an audit context block for a task + * @tsk: task + * + * Filter on the task information and allocate a per-task audit context * if necessary. Doing so turns on system call auditing for the * specified task. This is called from copy_process, so no lock is - * needed. */ + * needed. + */ int audit_alloc(struct task_struct *tsk) { struct audit_context *context; @@ -911,8 +925,12 @@ static void audit_log_exit(struct audit_context *context, gfp_t gfp_mask) } } -/* Free a per-task audit context. Called from copy_process and - * __put_task_struct. */ +/** + * audit_free - free a per-task audit context + * @tsk: task whose audit context block to free + * + * Called from copy_process and __put_task_struct. + */ void audit_free(struct task_struct *tsk) { struct audit_context *context; @@ -934,13 +952,24 @@ void audit_free(struct task_struct *tsk) audit_free_context(context); } -/* Fill in audit context at syscall entry. This only happens if the +/** + * audit_syscall_entry - fill in an audit record at syscall entry + * @tsk: task being audited + * @arch: architecture type + * @major: major syscall type (function) + * @a1: additional syscall register 1 + * @a2: additional syscall register 2 + * @a3: additional syscall register 3 + * @a4: additional syscall register 4 + * + * Fill in audit context at syscall entry. This only happens if the * audit context was created when the task was created and the state or * filters demand the audit context be built. If the state from the * per-task filter or from the per-syscall filter is AUDIT_RECORD_CONTEXT, * then the record will be written at syscall exit time (otherwise, it * will only be written if another part of the kernel requests that it - * be written). */ + * be written). + */ void audit_syscall_entry(struct task_struct *tsk, int arch, int major, unsigned long a1, unsigned long a2, unsigned long a3, unsigned long a4) @@ -950,7 +979,8 @@ void audit_syscall_entry(struct task_struct *tsk, int arch, int major, BUG_ON(!context); - /* This happens only on certain architectures that make system + /* + * This happens only on certain architectures that make system * calls in kernel_thread via the entry.S interface, instead of * with direct calls. (If you are porting to a new * architecture, hitting this condition can indicate that you @@ -1009,11 +1039,18 @@ void audit_syscall_entry(struct task_struct *tsk, int arch, int major, context->auditable = !!(state == AUDIT_RECORD_CONTEXT); } -/* Tear down after system call. If the audit context has been marked as +/** + * audit_syscall_exit - deallocate audit context after a system call + * @tsk: task being audited + * @valid: success/failure flag + * @return_code: syscall return value + * + * Tear down after system call. If the audit context has been marked as * auditable (either because of the AUDIT_RECORD_CONTEXT state from * filtering, or because some other part of the kernel write an audit * message), then write out the syscall information. In call cases, - * free the names stored from getname(). */ + * free the names stored from getname(). + */ void audit_syscall_exit(struct task_struct *tsk, int valid, long return_code) { struct audit_context *context; @@ -1048,7 +1085,13 @@ void audit_syscall_exit(struct task_struct *tsk, int valid, long return_code) put_task_struct(tsk); } -/* Add a name to the list. Called from fs/namei.c:getname(). */ +/** + * audit_getname - add a name to the list + * @name: name to add + * + * Add a name to the list of audit names for this context. + * Called from fs/namei.c:getname(). + */ void audit_getname(const char *name) { struct audit_context *context = current->audit_context; @@ -1077,10 +1120,13 @@ void audit_getname(const char *name) } -/* Intercept a putname request. Called from - * include/linux/fs.h:putname(). If we have stored the name from - * getname in the audit context, then we delay the putname until syscall - * exit. */ +/* audit_putname - intercept a putname request + * @name: name to intercept and delay for putname + * + * If we have stored the name from getname in the audit context, + * then we delay the putname until syscall exit. + * Called from include/linux/fs.h:putname(). + */ void audit_putname(const char *name) { struct audit_context *context = current->audit_context; @@ -1117,8 +1163,14 @@ void audit_putname(const char *name) #endif } -/* Store the inode and device from a lookup. Called from - * fs/namei.c:path_lookup(). */ +/** + * audit_inode - store the inode and device from a lookup + * @name: name being audited + * @inode: inode being audited + * @flags: lookup flags (as used in path_lookup()) + * + * Called from fs/namei.c:path_lookup(). + */ void audit_inode(const char *name, const struct inode *inode, unsigned flags) { int idx; @@ -1154,6 +1206,14 @@ void audit_inode(const char *name, const struct inode *inode, unsigned flags) context->names[idx].rdev = inode->i_rdev; } +/** + * auditsc_get_stamp - get local copies of audit_context values + * @ctx: audit_context for the task + * @t: timespec to store time recorded in the audit_context + * @serial: serial value that is recorded in the audit_context + * + * Also sets the context as auditable. + */ void auditsc_get_stamp(struct audit_context *ctx, struct timespec *t, unsigned int *serial) { @@ -1165,6 +1225,15 @@ void auditsc_get_stamp(struct audit_context *ctx, ctx->auditable = 1; } +/** + * audit_set_loginuid - set a task's audit_context loginuid + * @task: task whose audit context is being modified + * @loginuid: loginuid value + * + * Returns 0. + * + * Called (set) from fs/proc/base.c::proc_loginuid_write(). + */ int audit_set_loginuid(struct task_struct *task, uid_t loginuid) { if (task->audit_context) { @@ -1183,11 +1252,26 @@ int audit_set_loginuid(struct task_struct *task, uid_t loginuid) return 0; } +/** + * audit_get_loginuid - get the loginuid for an audit_context + * @ctx: the audit_context + * + * Returns the context's loginuid or -1 if @ctx is NULL. + */ uid_t audit_get_loginuid(struct audit_context *ctx) { return ctx ? ctx->loginuid : -1; } +/** + * audit_ipc_perms - record audit data for ipc + * @qbytes: msgq bytes + * @uid: msgq user id + * @gid: msgq group id + * @mode: msgq mode (permissions) + * + * Returns 0 for success or NULL context or < 0 on error. + */ int audit_ipc_perms(unsigned long qbytes, uid_t uid, gid_t gid, mode_t mode) { struct audit_aux_data_ipcctl *ax; @@ -1211,6 +1295,13 @@ int audit_ipc_perms(unsigned long qbytes, uid_t uid, gid_t gid, mode_t mode) return 0; } +/** + * audit_socketcall - record audit data for sys_socketcall + * @nargs: number of args + * @args: args array + * + * Returns 0 for success or NULL context or < 0 on error. + */ int audit_socketcall(int nargs, unsigned long *args) { struct audit_aux_data_socketcall *ax; @@ -1232,6 +1323,13 @@ int audit_socketcall(int nargs, unsigned long *args) return 0; } +/** + * audit_sockaddr - record audit data for sys_bind, sys_connect, sys_sendto + * @len: data length in user space + * @a: data address in kernel space + * + * Returns 0 for success or NULL context or < 0 on error. + */ int audit_sockaddr(int len, void *a) { struct audit_aux_data_sockaddr *ax; @@ -1253,6 +1351,15 @@ int audit_sockaddr(int len, void *a) return 0; } +/** + * audit_avc_path - record the granting or denial of permissions + * @dentry: dentry to record + * @mnt: mnt to record + * + * Returns 0 for success or NULL context or < 0 on error. + * + * Called from security/selinux/avc.c::avc_audit() + */ int audit_avc_path(struct dentry *dentry, struct vfsmount *mnt) { struct audit_aux_data_path *ax; @@ -1274,6 +1381,14 @@ int audit_avc_path(struct dentry *dentry, struct vfsmount *mnt) return 0; } +/** + * audit_signal_info - record signal info for shutting down audit subsystem + * @sig: signal value + * @t: task being signaled + * + * If the audit subsystem is being terminated, record the task (pid) + * and uid that is doing that. + */ void audit_signal_info(int sig, struct task_struct *t) { extern pid_t audit_sig_pid; @@ -1290,4 +1405,3 @@ void audit_signal_info(int sig, struct task_struct *t) } } } - -- cgit v1.2.3 From b63862f46547487388e582e8ac9083830d34f058 Mon Sep 17 00:00:00 2001 From: Dustin Kirkland Date: Thu, 3 Nov 2005 15:41:46 +0000 Subject: [PATCH] Filter rule comparators Currently, audit only supports the "=" and "!=" operators in the -F filter rules. This patch reworks the support for "=" and "!=", and adds support for ">", ">=", "<", and "<=". This turned out to be a pretty clean, and simply process. I ended up using the high order bits of the "field", as suggested by Steve and Amy. This allowed for no changes whatsoever to the netlink communications. See the documentation within the patch in the include/linux/audit.h area, where there is a table that explains the reasoning of the bitmask assignments clearly. The patch adds a new function, audit_comparator(left, op, right). This function will perform the specified comparison (op, which defaults to "==" for backward compatibility) between two values (left and right). If the negate bit is on, it will negate whatever that result was. This value is returned. Signed-off-by: Dustin Kirkland Signed-off-by: David Woodhouse --- kernel/auditsc.c | 117 +++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 75 insertions(+), 42 deletions(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 51a4f58a4d81..95076fa12202 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -2,6 +2,7 @@ * Handles all system-call specific auditing features. * * Copyright 2003-2004 Red Hat Inc., Durham, North Carolina. + * Copyright (C) 2005 IBM Corporation * All Rights Reserved. * * This program is free software; you can redistribute it and/or modify @@ -27,6 +28,9 @@ * this file -- see entry.S) is based on a GPL'd patch written by * okir@suse.de and Copyright 2003 SuSE Linux AG. * + * The support of additional filter rules compares (>, <, >=, <=) was + * added by Dustin Kirkland , 2005. + * */ #include @@ -252,6 +256,7 @@ static inline int audit_add_rule(struct audit_rule *rule, struct list_head *list) { struct audit_entry *entry; + int i; /* Do not use the _rcu iterator here, since this is the only * addition routine. */ @@ -261,6 +266,16 @@ static inline int audit_add_rule(struct audit_rule *rule, } } + for (i = 0; i < rule->field_count; i++) { + if (rule->fields[i] & AUDIT_UNUSED_BITS) + return -EINVAL; + if ( rule->fields[i] & AUDIT_NEGATE ) + rule->fields[i] |= AUDIT_NOT_EQUAL; + else if ( (rule->fields[i] & AUDIT_OPERATORS) == 0 ) + rule->fields[i] |= AUDIT_EQUAL; + rule->fields[i] &= (~AUDIT_NEGATE); + } + if (!(entry = kmalloc(sizeof(*entry), GFP_KERNEL))) return -ENOMEM; if (audit_copy_rule(&entry->rule, rule)) { @@ -394,6 +409,26 @@ int audit_receive_filter(int type, int pid, int uid, int seq, void *data, return err; } +static int audit_comparator(const u32 left, const u32 op, const u32 right) +{ + switch (op) { + case AUDIT_EQUAL: + return (left == right); + case AUDIT_NOT_EQUAL: + return (left != right); + case AUDIT_LESS_THAN: + return (left < right); + case AUDIT_LESS_THAN_OR_EQUAL: + return (left <= right); + case AUDIT_GREATER_THAN: + return (left > right); + case AUDIT_GREATER_THAN_OR_EQUAL: + return (left >= right); + default: + return -EINVAL; + } +} + /* Compare a task_struct with an audit_rule. Return 1 on match, 0 * otherwise. */ static int audit_filter_rules(struct task_struct *tsk, @@ -404,62 +439,63 @@ static int audit_filter_rules(struct task_struct *tsk, int i, j; for (i = 0; i < rule->field_count; i++) { - u32 field = rule->fields[i] & ~AUDIT_NEGATE; + u32 field = rule->fields[i] & ~AUDIT_OPERATORS; + u32 op = rule->fields[i] & AUDIT_OPERATORS; u32 value = rule->values[i]; int result = 0; switch (field) { case AUDIT_PID: - result = (tsk->pid == value); + result = audit_comparator(tsk->pid, op, value); break; case AUDIT_UID: - result = (tsk->uid == value); + result = audit_comparator(tsk->uid, op, value); break; case AUDIT_EUID: - result = (tsk->euid == value); + result = audit_comparator(tsk->euid, op, value); break; case AUDIT_SUID: - result = (tsk->suid == value); + result = audit_comparator(tsk->suid, op, value); break; case AUDIT_FSUID: - result = (tsk->fsuid == value); + result = audit_comparator(tsk->fsuid, op, value); break; case AUDIT_GID: - result = (tsk->gid == value); + result = audit_comparator(tsk->gid, op, value); break; case AUDIT_EGID: - result = (tsk->egid == value); + result = audit_comparator(tsk->egid, op, value); break; case AUDIT_SGID: - result = (tsk->sgid == value); + result = audit_comparator(tsk->sgid, op, value); break; case AUDIT_FSGID: - result = (tsk->fsgid == value); + result = audit_comparator(tsk->fsgid, op, value); break; case AUDIT_PERS: - result = (tsk->personality == value); + result = audit_comparator(tsk->personality, op, value); break; case AUDIT_ARCH: - if (ctx) - result = (ctx->arch == value); + if (ctx) + result = audit_comparator(ctx->arch, op, value); break; case AUDIT_EXIT: if (ctx && ctx->return_valid) - result = (ctx->return_code == value); + result = audit_comparator(ctx->return_code, op, value); break; case AUDIT_SUCCESS: if (ctx && ctx->return_valid) { if (value) - result = (ctx->return_valid == AUDITSC_SUCCESS); + result = audit_comparator(ctx->return_valid, op, AUDITSC_SUCCESS); else - result = (ctx->return_valid == AUDITSC_FAILURE); + result = audit_comparator(ctx->return_valid, op, AUDITSC_FAILURE); } break; case AUDIT_DEVMAJOR: if (ctx) { for (j = 0; j < ctx->name_count; j++) { - if (MAJOR(ctx->names[j].dev)==value) { + if (audit_comparator(MAJOR(ctx->names[j].dev), op, value)) { ++result; break; } @@ -469,7 +505,7 @@ static int audit_filter_rules(struct task_struct *tsk, case AUDIT_DEVMINOR: if (ctx) { for (j = 0; j < ctx->name_count; j++) { - if (MINOR(ctx->names[j].dev)==value) { + if (audit_comparator(MINOR(ctx->names[j].dev), op, value)) { ++result; break; } @@ -479,7 +515,7 @@ static int audit_filter_rules(struct task_struct *tsk, case AUDIT_INODE: if (ctx) { for (j = 0; j < ctx->name_count; j++) { - if (ctx->names[j].ino == value) { + if (audit_comparator(ctx->names[j].ino, op, value)) { ++result; break; } @@ -489,19 +525,17 @@ static int audit_filter_rules(struct task_struct *tsk, case AUDIT_LOGINUID: result = 0; if (ctx) - result = (ctx->loginuid == value); + result = audit_comparator(ctx->loginuid, op, value); break; case AUDIT_ARG0: case AUDIT_ARG1: case AUDIT_ARG2: case AUDIT_ARG3: if (ctx) - result = (ctx->argv[field-AUDIT_ARG0]==value); + result = audit_comparator(ctx->argv[field-AUDIT_ARG0], op, value); break; } - if (rule->fields[i] & AUDIT_NEGATE) - result = !result; if (!result) return 0; } @@ -550,49 +584,48 @@ static enum audit_state audit_filter_syscall(struct task_struct *tsk, rcu_read_lock(); if (!list_empty(list)) { - int word = AUDIT_WORD(ctx->major); - int bit = AUDIT_BIT(ctx->major); - - list_for_each_entry_rcu(e, list, list) { - if ((e->rule.mask[word] & bit) == bit - && audit_filter_rules(tsk, &e->rule, ctx, &state)) { - rcu_read_unlock(); - return state; - } - } + int word = AUDIT_WORD(ctx->major); + int bit = AUDIT_BIT(ctx->major); + + list_for_each_entry_rcu(e, list, list) { + if ((e->rule.mask[word] & bit) == bit + && audit_filter_rules(tsk, &e->rule, ctx, &state)) { + rcu_read_unlock(); + return state; + } + } } rcu_read_unlock(); return AUDIT_BUILD_CONTEXT; } static int audit_filter_user_rules(struct netlink_skb_parms *cb, - struct audit_rule *rule, - enum audit_state *state) + struct audit_rule *rule, + enum audit_state *state) { int i; for (i = 0; i < rule->field_count; i++) { - u32 field = rule->fields[i] & ~AUDIT_NEGATE; + u32 field = rule->fields[i] & ~AUDIT_OPERATORS; + u32 op = rule->fields[i] & AUDIT_OPERATORS; u32 value = rule->values[i]; int result = 0; switch (field) { case AUDIT_PID: - result = (cb->creds.pid == value); + result = audit_comparator(cb->creds.pid, op, value); break; case AUDIT_UID: - result = (cb->creds.uid == value); + result = audit_comparator(cb->creds.uid, op, value); break; case AUDIT_GID: - result = (cb->creds.gid == value); + result = audit_comparator(cb->creds.gid, op, value); break; case AUDIT_LOGINUID: - result = (cb->loginuid == value); + result = audit_comparator(cb->loginuid, op, value); break; } - if (rule->fields[i] & AUDIT_NEGATE) - result = !result; if (!result) return 0; } -- cgit v1.2.3 From 90d526c074ae5db484388da56c399acf892b6c17 Mon Sep 17 00:00:00 2001 From: Steve Grubb Date: Thu, 3 Nov 2005 15:48:08 +0000 Subject: [PATCH] Define new range of userspace messages. The attached patch updates various items for the new user space messages. Please apply. Signed-off-by: Steve Grubb Signed-off-by: David Woodhouse --- kernel/audit.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index 973ca5a9e0d6..6d61dd79a605 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -369,6 +369,7 @@ static int audit_netlink_ok(kernel_cap_t eff_cap, u16 msg_type) break; case AUDIT_USER: case AUDIT_FIRST_USER_MSG...AUDIT_LAST_USER_MSG: + case AUDIT_FIRST_USER_MSG2...AUDIT_LAST_USER_MSG2: if (!cap_raised(eff_cap, CAP_AUDIT_WRITE)) err = -EPERM; break; @@ -449,6 +450,7 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) break; case AUDIT_USER: case AUDIT_FIRST_USER_MSG...AUDIT_LAST_USER_MSG: + case AUDIT_FIRST_USER_MSG2...AUDIT_LAST_USER_MSG2: if (!audit_enabled && msg_type != AUDIT_USER_AVC) return 0; -- cgit v1.2.3 From f38aa94224c5517a40ba56d453779f70d3229803 Mon Sep 17 00:00:00 2001 From: Amy Griffis Date: Thu, 3 Nov 2005 15:57:06 +0000 Subject: [PATCH] Pass dentry, not just name, in fsnotify creation hooks. The audit hooks (to be added shortly) will want to see dentry->d_inode too, not just the name. Signed-off-by: Amy Griffis Signed-off-by: David Woodhouse --- kernel/auditsc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 95076fa12202..55ba331757c5 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -515,7 +515,7 @@ static int audit_filter_rules(struct task_struct *tsk, case AUDIT_INODE: if (ctx) { for (j = 0; j < ctx->name_count; j++) { - if (audit_comparator(ctx->names[j].ino, op, value)) { + if ( audit_comparator(ctx->names[j].ino, op, value)) { ++result; break; } -- cgit v1.2.3 From 73241ccca0f7786933f1d31b3d86f2456549953a Mon Sep 17 00:00:00 2001 From: Amy Griffis Date: Thu, 3 Nov 2005 16:00:25 +0000 Subject: [PATCH] Collect more inode information during syscall processing. This patch augments the collection of inode info during syscall processing. It represents part of the functionality that was provided by the auditfs patch included in RHEL4. Specifically, it: - Collects information for target inodes created or removed during syscalls. Previous code only collects information for the target inode's parent. - Adds the audit_inode() hook to syscalls that operate on a file descriptor (e.g. fchown), enabling audit to do inode filtering for these calls. - Modifies filtering code to check audit context for either an inode # or a parent inode # matching a given rule. - Modifies logging to provide inode # for both parent and child. - Protect debug info from NULL audit_names.name. [AV: folded a later typo fix from the same author] Signed-off-by: Amy Griffis Signed-off-by: David Woodhouse Signed-off-by: Al Viro --- kernel/auditsc.c | 142 +++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 118 insertions(+), 24 deletions(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 55ba331757c5..73f932b7deb6 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -2,6 +2,7 @@ * Handles all system-call specific auditing features. * * Copyright 2003-2004 Red Hat Inc., Durham, North Carolina. + * Copyright 2005 Hewlett-Packard Development Company, L.P. * Copyright (C) 2005 IBM Corporation * All Rights Reserved. * @@ -31,11 +32,16 @@ * The support of additional filter rules compares (>, <, >=, <=) was * added by Dustin Kirkland , 2005. * + * Modified by Amy Griffis to collect additional + * filesystem information. */ #include #include #include +#include +#include +#include #include #include #include @@ -97,12 +103,12 @@ enum audit_state { struct audit_names { const char *name; unsigned long ino; + unsigned long pino; dev_t dev; umode_t mode; uid_t uid; gid_t gid; dev_t rdev; - unsigned flags; }; struct audit_aux_data { @@ -515,7 +521,8 @@ static int audit_filter_rules(struct task_struct *tsk, case AUDIT_INODE: if (ctx) { for (j = 0; j < ctx->name_count; j++) { - if ( audit_comparator(ctx->names[j].ino, op, value)) { + if (audit_comparator(ctx->names[j].ino, op, value) || + audit_comparator(ctx->names[j].pino, op, value)) { ++result; break; } @@ -696,17 +703,17 @@ static inline void audit_free_names(struct audit_context *context) #if AUDIT_DEBUG == 2 if (context->auditable ||context->put_count + context->ino_count != context->name_count) { - printk(KERN_ERR "audit.c:%d(:%d): major=%d in_syscall=%d" + printk(KERN_ERR "%s:%d(:%d): major=%d in_syscall=%d" " name_count=%d put_count=%d" " ino_count=%d [NOT freeing]\n", - __LINE__, + __FILE__, __LINE__, context->serial, context->major, context->in_syscall, context->name_count, context->put_count, context->ino_count); for (i = 0; i < context->name_count; i++) printk(KERN_ERR "names[%d] = %p = %s\n", i, context->names[i].name, - context->names[i].name); + context->names[i].name ?: "(null)"); dump_stack(); return; } @@ -932,27 +939,34 @@ static void audit_log_exit(struct audit_context *context, gfp_t gfp_mask) } } for (i = 0; i < context->name_count; i++) { + unsigned long ino = context->names[i].ino; + unsigned long pino = context->names[i].pino; + ab = audit_log_start(context, gfp_mask, AUDIT_PATH); if (!ab) continue; /* audit_panic has been called */ audit_log_format(ab, "item=%d", i); - if (context->names[i].name) { - audit_log_format(ab, " name="); + + audit_log_format(ab, " name="); + if (context->names[i].name) audit_log_untrustedstring(ab, context->names[i].name); - } - audit_log_format(ab, " flags=%x\n", context->names[i].flags); - - if (context->names[i].ino != (unsigned long)-1) - audit_log_format(ab, " inode=%lu dev=%02x:%02x mode=%#o" - " ouid=%u ogid=%u rdev=%02x:%02x", - context->names[i].ino, - MAJOR(context->names[i].dev), - MINOR(context->names[i].dev), - context->names[i].mode, - context->names[i].uid, - context->names[i].gid, - MAJOR(context->names[i].rdev), + else + audit_log_format(ab, "(null)"); + + if (pino != (unsigned long)-1) + audit_log_format(ab, " parent=%lu", pino); + if (ino != (unsigned long)-1) + audit_log_format(ab, " inode=%lu", ino); + if ((pino != (unsigned long)-1) || (ino != (unsigned long)-1)) + audit_log_format(ab, " dev=%02x:%02x mode=%#o" + " ouid=%u ogid=%u rdev=%02x:%02x", + MAJOR(context->names[i].dev), + MINOR(context->names[i].dev), + context->names[i].mode, + context->names[i].uid, + context->names[i].gid, + MAJOR(context->names[i].rdev), MINOR(context->names[i].rdev)); audit_log_end(ab); } @@ -1174,7 +1188,7 @@ void audit_putname(const char *name) for (i = 0; i < context->name_count; i++) printk(KERN_ERR "name[%d] = %p = %s\n", i, context->names[i].name, - context->names[i].name); + context->names[i].name ?: "(null)"); } #endif __putname(name); @@ -1204,7 +1218,7 @@ void audit_putname(const char *name) * * Called from fs/namei.c:path_lookup(). */ -void audit_inode(const char *name, const struct inode *inode, unsigned flags) +void __audit_inode(const char *name, const struct inode *inode, unsigned flags) { int idx; struct audit_context *context = current->audit_context; @@ -1230,13 +1244,93 @@ void audit_inode(const char *name, const struct inode *inode, unsigned flags) ++context->ino_count; #endif } - context->names[idx].flags = flags; - context->names[idx].ino = inode->i_ino; context->names[idx].dev = inode->i_sb->s_dev; context->names[idx].mode = inode->i_mode; context->names[idx].uid = inode->i_uid; context->names[idx].gid = inode->i_gid; context->names[idx].rdev = inode->i_rdev; + if ((flags & LOOKUP_PARENT) && (strcmp(name, "/") != 0) && + (strcmp(name, ".") != 0)) { + context->names[idx].ino = (unsigned long)-1; + context->names[idx].pino = inode->i_ino; + } else { + context->names[idx].ino = inode->i_ino; + context->names[idx].pino = (unsigned long)-1; + } +} + +/** + * audit_inode_child - collect inode info for created/removed objects + * @dname: inode's dentry name + * @inode: inode being audited + * @pino: inode number of dentry parent + * + * For syscalls that create or remove filesystem objects, audit_inode + * can only collect information for the filesystem object's parent. + * This call updates the audit context with the child's information. + * Syscalls that create a new filesystem object must be hooked after + * the object is created. Syscalls that remove a filesystem object + * must be hooked prior, in order to capture the target inode during + * unsuccessful attempts. + */ +void __audit_inode_child(const char *dname, const struct inode *inode, + unsigned long pino) +{ + int idx; + struct audit_context *context = current->audit_context; + + if (!context->in_syscall) + return; + + /* determine matching parent */ + if (dname) + for (idx = 0; idx < context->name_count; idx++) + if (context->names[idx].pino == pino) { + const char *n; + const char *name = context->names[idx].name; + int dlen = strlen(dname); + int nlen = name ? strlen(name) : 0; + + if (nlen < dlen) + continue; + + /* disregard trailing slashes */ + n = name + nlen - 1; + while ((*n == '/') && (n > name)) + n--; + + /* find last path component */ + n = n - dlen + 1; + if (n < name) + continue; + else if (n > name) { + if (*--n != '/') + continue; + else + n++; + } + + if (strncmp(n, dname, dlen) == 0) + goto update_context; + } + + /* catch-all in case match not found */ + idx = context->name_count++; + context->names[idx].name = NULL; + context->names[idx].pino = pino; +#if AUDIT_DEBUG + context->ino_count++; +#endif + +update_context: + if (inode) { + context->names[idx].ino = inode->i_ino; + context->names[idx].dev = inode->i_sb->s_dev; + context->names[idx].mode = inode->i_mode; + context->names[idx].uid = inode->i_uid; + context->names[idx].gid = inode->i_gid; + context->names[idx].rdev = inode->i_rdev; + } } /** -- cgit v1.2.3 From c8edc80c8b8c397c53f4f659a05b9ea6208029bf Mon Sep 17 00:00:00 2001 From: Dustin Kirkland Date: Thu, 3 Nov 2005 16:12:36 +0000 Subject: [PATCH] Exclude messages by message type - Add a new, 5th filter called "exclude". - And add a new field AUDIT_MSGTYPE. - Define a new function audit_filter_exclude() that takes a message type as input and examines all rules in the filter. It returns '1' if the message is to be excluded, and '0' otherwise. - Call the audit_filter_exclude() function near the top of audit_log_start() just after asserting audit_initialized. If the message type is not to be audited, return NULL very early, before doing a lot of work. [combined with followup fix for bug in original patch, Nov 4, same author] [combined with later renaming AUDIT_FILTER_EXCLUDE->AUDIT_FILTER_TYPE and audit_filter_exclude() -> audit_filter_type()] Signed-off-by: Dustin Kirkland Signed-off-by: David Woodhouse Signed-off-by: Al Viro --- kernel/audit.c | 3 +++ kernel/auditsc.c | 35 ++++++++++++++++++++++++++++++++++- 2 files changed, 37 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index 6d61dd79a605..1c3eb1b12bfc 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -702,6 +702,9 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask, if (!audit_initialized) return NULL; + if (unlikely(audit_filter_type(type))) + return NULL; + if (gfp_mask & __GFP_WAIT) reserve = 0; else diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 73f932b7deb6..31917ac730af 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -187,7 +187,8 @@ static struct list_head audit_filter_list[AUDIT_NR_FILTERS] = { LIST_HEAD_INIT(audit_filter_list[2]), LIST_HEAD_INIT(audit_filter_list[3]), LIST_HEAD_INIT(audit_filter_list[4]), -#if AUDIT_NR_FILTERS != 5 + LIST_HEAD_INIT(audit_filter_list[5]), +#if AUDIT_NR_FILTERS != 6 #error Fix audit_filter_list initialiser #endif }; @@ -663,6 +664,38 @@ int audit_filter_user(struct netlink_skb_parms *cb, int type) return ret; /* Audit by default */ } +int audit_filter_type(int type) +{ + struct audit_entry *e; + int result = 0; + + rcu_read_lock(); + if (list_empty(&audit_filter_list[AUDIT_FILTER_TYPE])) + goto unlock_and_return; + + list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_TYPE], + list) { + struct audit_rule *rule = &e->rule; + int i; + for (i = 0; i < rule->field_count; i++) { + u32 field = rule->fields[i] & ~AUDIT_OPERATORS; + u32 op = rule->fields[i] & AUDIT_OPERATORS; + u32 value = rule->values[i]; + if ( field == AUDIT_MSGTYPE ) { + result = audit_comparator(type, op, value); + if (!result) + break; + } + } + if (result) + goto unlock_and_return; + } +unlock_and_return: + rcu_read_unlock(); + return result; +} + + /* This should be called with task_lock() held. */ static inline struct audit_context *audit_get_context(struct task_struct *tsk, int return_valid, -- cgit v1.2.3 From 8c8570fb8feef2bc166bee75a85748b25cda22d9 Mon Sep 17 00:00:00 2001 From: Dustin Kirkland Date: Thu, 3 Nov 2005 17:15:16 +0000 Subject: [PATCH] Capture selinux subject/object context information. This patch extends existing audit records with subject/object context information. Audit records associated with filesystem inodes, ipc, and tasks now contain SELinux label information in the field "subj" if the item is performing the action, or in "obj" if the item is the receiver of an action. These labels are collected via hooks in SELinux and appended to the appropriate record in the audit code. This additional information is required for Common Criteria Labeled Security Protection Profile (LSPP). [AV: fixed kmalloc flags use] [folded leak fixes] [folded cleanup from akpm (kfree(NULL)] [folded audit_inode_context() leak fix] [folded akpm's fix for audit_ipc_perm() definition in case of !CONFIG_AUDIT] Signed-off-by: Dustin Kirkland Signed-off-by: David Woodhouse Signed-off-by: Andrew Morton Signed-off-by: Al Viro --- kernel/audit.c | 2 +- kernel/auditsc.c | 142 +++++++++++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 135 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index 1c3eb1b12bfc..45c123ef77a7 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -142,7 +142,7 @@ static void audit_set_pid(struct audit_buffer *ab, pid_t pid) nlh->nlmsg_pid = pid; } -static void audit_panic(const char *message) +void audit_panic(const char *message) { switch (audit_failure) { diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 31917ac730af..4e2256ec7cf3 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -34,6 +34,9 @@ * * Modified by Amy Griffis to collect additional * filesystem information. + * + * Subject and object context labeling support added by + * and for LSPP certification compliance. */ #include @@ -53,6 +56,7 @@ #include #include #include +#include /* 0 = no checking 1 = put_count checking @@ -109,6 +113,7 @@ struct audit_names { uid_t uid; gid_t gid; dev_t rdev; + char *ctx; }; struct audit_aux_data { @@ -125,6 +130,7 @@ struct audit_aux_data_ipcctl { uid_t uid; gid_t gid; mode_t mode; + char *ctx; }; struct audit_aux_data_socketcall { @@ -743,10 +749,11 @@ static inline void audit_free_names(struct audit_context *context) context->serial, context->major, context->in_syscall, context->name_count, context->put_count, context->ino_count); - for (i = 0; i < context->name_count; i++) + for (i = 0; i < context->name_count; i++) { printk(KERN_ERR "names[%d] = %p = %s\n", i, context->names[i].name, context->names[i].name ?: "(null)"); + } dump_stack(); return; } @@ -756,9 +763,13 @@ static inline void audit_free_names(struct audit_context *context) context->ino_count = 0; #endif - for (i = 0; i < context->name_count; i++) + for (i = 0; i < context->name_count; i++) { + char *p = context->names[i].ctx; + context->names[i].ctx = NULL; + kfree(p); if (context->names[i].name) __putname(context->names[i].name); + } context->name_count = 0; if (context->pwd) dput(context->pwd); @@ -778,6 +789,12 @@ static inline void audit_free_aux(struct audit_context *context) dput(axi->dentry); mntput(axi->mnt); } + if ( aux->type == AUDIT_IPC ) { + struct audit_aux_data_ipcctl *axi = (void *)aux; + if (axi->ctx) + kfree(axi->ctx); + } + context->aux = aux->next; kfree(aux); } @@ -862,7 +879,38 @@ static inline void audit_free_context(struct audit_context *context) printk(KERN_ERR "audit: freed %d contexts\n", count); } -static void audit_log_task_info(struct audit_buffer *ab) +static void audit_log_task_context(struct audit_buffer *ab, gfp_t gfp_mask) +{ + char *ctx = NULL; + ssize_t len = 0; + + len = security_getprocattr(current, "current", NULL, 0); + if (len < 0) { + if (len != -EINVAL) + goto error_path; + return; + } + + ctx = kmalloc(len, gfp_mask); + if (!ctx) { + goto error_path; + return; + } + + len = security_getprocattr(current, "current", ctx, len); + if (len < 0 ) + goto error_path; + + audit_log_format(ab, " subj=%s", ctx); + +error_path: + if (ctx) + kfree(ctx); + audit_panic("security_getprocattr error in audit_log_task_context"); + return; +} + +static void audit_log_task_info(struct audit_buffer *ab, gfp_t gfp_mask) { char name[sizeof(current->comm)]; struct mm_struct *mm = current->mm; @@ -875,6 +923,10 @@ static void audit_log_task_info(struct audit_buffer *ab) if (!mm) return; + /* + * this is brittle; all callers that pass GFP_ATOMIC will have + * NULL current->mm and we won't get here. + */ down_read(&mm->mmap_sem); vma = mm->mmap; while (vma) { @@ -888,6 +940,7 @@ static void audit_log_task_info(struct audit_buffer *ab) vma = vma->vm_next; } up_read(&mm->mmap_sem); + audit_log_task_context(ab, gfp_mask); } static void audit_log_exit(struct audit_context *context, gfp_t gfp_mask) @@ -923,7 +976,7 @@ static void audit_log_exit(struct audit_context *context, gfp_t gfp_mask) context->gid, context->euid, context->suid, context->fsuid, context->egid, context->sgid, context->fsgid); - audit_log_task_info(ab); + audit_log_task_info(ab, gfp_mask); audit_log_end(ab); for (aux = context->aux; aux; aux = aux->next) { @@ -936,8 +989,8 @@ static void audit_log_exit(struct audit_context *context, gfp_t gfp_mask) case AUDIT_IPC: { struct audit_aux_data_ipcctl *axi = (void *)aux; audit_log_format(ab, - " qbytes=%lx iuid=%u igid=%u mode=%x", - axi->qbytes, axi->uid, axi->gid, axi->mode); + " qbytes=%lx iuid=%u igid=%u mode=%x obj=%s", + axi->qbytes, axi->uid, axi->gid, axi->mode, axi->ctx); break; } case AUDIT_SOCKETCALL: { @@ -1001,6 +1054,11 @@ static void audit_log_exit(struct audit_context *context, gfp_t gfp_mask) context->names[i].gid, MAJOR(context->names[i].rdev), MINOR(context->names[i].rdev)); + if (context->names[i].ctx) { + audit_log_format(ab, " obj=%s", + context->names[i].ctx); + } + audit_log_end(ab); } } @@ -1243,6 +1301,39 @@ void audit_putname(const char *name) #endif } +void audit_inode_context(int idx, const struct inode *inode) +{ + struct audit_context *context = current->audit_context; + char *ctx = NULL; + int len = 0; + + if (!security_inode_xattr_getsuffix()) + return; + + len = security_inode_getsecurity(inode, (char *)security_inode_xattr_getsuffix(), NULL, 0, 0); + if (len < 0) + goto error_path; + + ctx = kmalloc(len, GFP_KERNEL); + if (!ctx) + goto error_path; + + len = security_inode_getsecurity(inode, (char *)security_inode_xattr_getsuffix(), ctx, len, 0); + if (len < 0) + goto error_path; + + kfree(context->names[idx].ctx); + context->names[idx].ctx = ctx; + return; + +error_path: + if (ctx) + kfree(ctx); + audit_panic("error in audit_inode_context"); + return; +} + + /** * audit_inode - store the inode and device from a lookup * @name: name being audited @@ -1282,6 +1373,7 @@ void __audit_inode(const char *name, const struct inode *inode, unsigned flags) context->names[idx].uid = inode->i_uid; context->names[idx].gid = inode->i_gid; context->names[idx].rdev = inode->i_rdev; + audit_inode_context(idx, inode); if ((flags & LOOKUP_PARENT) && (strcmp(name, "/") != 0) && (strcmp(name, ".") != 0)) { context->names[idx].ino = (unsigned long)-1; @@ -1363,6 +1455,7 @@ update_context: context->names[idx].uid = inode->i_uid; context->names[idx].gid = inode->i_gid; context->names[idx].rdev = inode->i_rdev; + audit_inode_context(idx, inode); } } @@ -1423,6 +1516,38 @@ uid_t audit_get_loginuid(struct audit_context *ctx) return ctx ? ctx->loginuid : -1; } +static char *audit_ipc_context(struct kern_ipc_perm *ipcp) +{ + struct audit_context *context = current->audit_context; + char *ctx = NULL; + int len = 0; + + if (likely(!context)) + return NULL; + + len = security_ipc_getsecurity(ipcp, NULL, 0); + if (len == -EOPNOTSUPP) + goto ret; + if (len < 0) + goto error_path; + + ctx = kmalloc(len, GFP_ATOMIC); + if (!ctx) + goto error_path; + + len = security_ipc_getsecurity(ipcp, ctx, len); + if (len < 0) + goto error_path; + + return ctx; + +error_path: + kfree(ctx); + audit_panic("error in audit_ipc_context"); +ret: + return NULL; +} + /** * audit_ipc_perms - record audit data for ipc * @qbytes: msgq bytes @@ -1432,7 +1557,7 @@ uid_t audit_get_loginuid(struct audit_context *ctx) * * Returns 0 for success or NULL context or < 0 on error. */ -int audit_ipc_perms(unsigned long qbytes, uid_t uid, gid_t gid, mode_t mode) +int audit_ipc_perms(unsigned long qbytes, uid_t uid, gid_t gid, mode_t mode, struct kern_ipc_perm *ipcp) { struct audit_aux_data_ipcctl *ax; struct audit_context *context = current->audit_context; @@ -1440,7 +1565,7 @@ int audit_ipc_perms(unsigned long qbytes, uid_t uid, gid_t gid, mode_t mode) if (likely(!context)) return 0; - ax = kmalloc(sizeof(*ax), GFP_KERNEL); + ax = kmalloc(sizeof(*ax), GFP_ATOMIC); if (!ax) return -ENOMEM; @@ -1448,6 +1573,7 @@ int audit_ipc_perms(unsigned long qbytes, uid_t uid, gid_t gid, mode_t mode) ax->uid = uid; ax->gid = gid; ax->mode = mode; + ax->ctx = audit_ipc_context(ipcp); ax->d.type = AUDIT_IPC; ax->d.next = context->aux; -- cgit v1.2.3 From 7306a0b9b3e2056a616c84841288ca2431a05627 Mon Sep 17 00:00:00 2001 From: Dustin Kirkland Date: Wed, 16 Nov 2005 15:53:13 +0000 Subject: [PATCH] Miscellaneous bug and warning fixes This patch fixes a couple of bugs revealed in new features recently added to -mm1: * fixes warnings due to inconsistent use of const struct inode *inode * fixes bug that prevent a kernel from booting with audit on, and SELinux off due to a missing function in security/dummy.c * fixes a bug that throws spurious audit_panic() messages due to a missing return just before an error_path label * some reasonable house cleaning in audit_ipc_context(), audit_inode_context(), and audit_log_task_context() Signed-off-by: Dustin Kirkland Signed-off-by: David Woodhouse --- kernel/auditsc.c | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 4e2256ec7cf3..4ef14515da35 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -892,21 +892,20 @@ static void audit_log_task_context(struct audit_buffer *ab, gfp_t gfp_mask) } ctx = kmalloc(len, gfp_mask); - if (!ctx) { + if (!ctx) goto error_path; - return; - } len = security_getprocattr(current, "current", ctx, len); if (len < 0 ) goto error_path; audit_log_format(ab, " subj=%s", ctx); + return; error_path: if (ctx) kfree(ctx); - audit_panic("security_getprocattr error in audit_log_task_context"); + audit_panic("error in audit_log_task_context"); return; } @@ -1304,13 +1303,16 @@ void audit_putname(const char *name) void audit_inode_context(int idx, const struct inode *inode) { struct audit_context *context = current->audit_context; + const char *suffix = security_inode_xattr_getsuffix(); char *ctx = NULL; int len = 0; - if (!security_inode_xattr_getsuffix()) - return; + if (!suffix) + goto ret; - len = security_inode_getsecurity(inode, (char *)security_inode_xattr_getsuffix(), NULL, 0, 0); + len = security_inode_getsecurity(inode, suffix, NULL, 0, 0); + if (len == -EOPNOTSUPP) + goto ret; if (len < 0) goto error_path; @@ -1318,18 +1320,19 @@ void audit_inode_context(int idx, const struct inode *inode) if (!ctx) goto error_path; - len = security_inode_getsecurity(inode, (char *)security_inode_xattr_getsuffix(), ctx, len, 0); + len = security_inode_getsecurity(inode, suffix, ctx, len, 0); if (len < 0) goto error_path; kfree(context->names[idx].ctx); context->names[idx].ctx = ctx; - return; + goto ret; error_path: if (ctx) kfree(ctx); audit_panic("error in audit_inode_context"); +ret: return; } -- cgit v1.2.3 From fe7752bab26a9ac0651b695ad4f55659761f68f7 Mon Sep 17 00:00:00 2001 From: David Woodhouse Date: Thu, 15 Dec 2005 18:33:52 +0000 Subject: [PATCH] Fix audit record filtering with !CONFIG_AUDITSYSCALL This fixes the per-user and per-message-type filtering when syscall auditing isn't enabled. [AV: folded followup fix from the same author] Signed-off-by: David Woodhouse Signed-off-by: Al Viro --- kernel/Makefile | 2 +- kernel/audit.c | 1 + kernel/audit.h | 70 ++++++++++ kernel/auditfilter.c | 378 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/auditsc.c | 380 +-------------------------------------------------- 5 files changed, 454 insertions(+), 377 deletions(-) create mode 100644 kernel/audit.h create mode 100644 kernel/auditfilter.c (limited to 'kernel') diff --git a/kernel/Makefile b/kernel/Makefile index 4ae0fbde815d..58cf129e5622 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -26,7 +26,7 @@ obj-$(CONFIG_COMPAT) += compat.o obj-$(CONFIG_CPUSETS) += cpuset.o obj-$(CONFIG_IKCONFIG) += configs.o obj-$(CONFIG_STOP_MACHINE) += stop_machine.o -obj-$(CONFIG_AUDIT) += audit.o +obj-$(CONFIG_AUDIT) += audit.o auditfilter.o obj-$(CONFIG_AUDITSYSCALL) += auditsc.o obj-$(CONFIG_KPROBES) += kprobes.o obj-$(CONFIG_SYSFS) += ksysfs.o diff --git a/kernel/audit.c b/kernel/audit.c index 45c123ef77a7..07c5d2bdd38c 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -305,6 +305,7 @@ static int kauditd_thread(void *dummy) remove_wait_queue(&kauditd_wait, &wait); } } + return 0; } /** diff --git a/kernel/audit.h b/kernel/audit.h new file mode 100644 index 000000000000..7643e46daeb2 --- /dev/null +++ b/kernel/audit.h @@ -0,0 +1,70 @@ +/* audit -- definition of audit_context structure and supporting types + * + * Copyright 2003-2004 Red Hat, Inc. + * Copyright 2005 Hewlett-Packard Development Company, L.P. + * Copyright 2005 IBM Corporation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include +#include + +/* 0 = no checking + 1 = put_count checking + 2 = verbose put_count checking +*/ +#define AUDIT_DEBUG 0 + +/* At task start time, the audit_state is set in the audit_context using + a per-task filter. At syscall entry, the audit_state is augmented by + the syscall filter. */ +enum audit_state { + AUDIT_DISABLED, /* Do not create per-task audit_context. + * No syscall-specific audit records can + * be generated. */ + AUDIT_SETUP_CONTEXT, /* Create the per-task audit_context, + * but don't necessarily fill it in at + * syscall entry time (i.e., filter + * instead). */ + AUDIT_BUILD_CONTEXT, /* Create the per-task audit_context, + * and always fill it in at syscall + * entry time. This makes a full + * syscall record available if some + * other part of the kernel decides it + * should be recorded. */ + AUDIT_RECORD_CONTEXT /* Create the per-task audit_context, + * always fill it in at syscall entry + * time, and always write out the audit + * record at syscall exit time. */ +}; + +/* Rule lists */ +struct audit_entry { + struct list_head list; + struct rcu_head rcu; + struct audit_rule rule; +}; + + +extern int audit_pid; +extern int audit_comparator(const u32 left, const u32 op, const u32 right); + +extern void audit_send_reply(int pid, int seq, int type, + int done, int multi, + void *payload, int size); +extern void audit_log_lost(const char *message); +extern void audit_panic(const char *message); +extern struct semaphore audit_netlink_sem; diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c new file mode 100644 index 000000000000..7f347c360876 --- /dev/null +++ b/kernel/auditfilter.c @@ -0,0 +1,378 @@ +/* auditfilter.c -- filtering of audit events + * + * Copyright 2003-2004 Red Hat, Inc. + * Copyright 2005 Hewlett-Packard Development Company, L.P. + * Copyright 2005 IBM Corporation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include +#include +#include +#include +#include "audit.h" + +/* There are three lists of rules -- one to search at task creation + * time, one to search at syscall entry time, and another to search at + * syscall exit time. */ +struct list_head audit_filter_list[AUDIT_NR_FILTERS] = { + LIST_HEAD_INIT(audit_filter_list[0]), + LIST_HEAD_INIT(audit_filter_list[1]), + LIST_HEAD_INIT(audit_filter_list[2]), + LIST_HEAD_INIT(audit_filter_list[3]), + LIST_HEAD_INIT(audit_filter_list[4]), + LIST_HEAD_INIT(audit_filter_list[5]), +#if AUDIT_NR_FILTERS != 6 +#error Fix audit_filter_list initialiser +#endif +}; + +/* Copy rule from user-space to kernel-space. Called from + * audit_add_rule during AUDIT_ADD. */ +static inline int audit_copy_rule(struct audit_rule *d, struct audit_rule *s) +{ + int i; + + if (s->action != AUDIT_NEVER + && s->action != AUDIT_POSSIBLE + && s->action != AUDIT_ALWAYS) + return -1; + if (s->field_count < 0 || s->field_count > AUDIT_MAX_FIELDS) + return -1; + if ((s->flags & ~AUDIT_FILTER_PREPEND) >= AUDIT_NR_FILTERS) + return -1; + + d->flags = s->flags; + d->action = s->action; + d->field_count = s->field_count; + for (i = 0; i < d->field_count; i++) { + d->fields[i] = s->fields[i]; + d->values[i] = s->values[i]; + } + for (i = 0; i < AUDIT_BITMASK_SIZE; i++) d->mask[i] = s->mask[i]; + return 0; +} + +/* Check to see if two rules are identical. It is called from + * audit_add_rule during AUDIT_ADD and + * audit_del_rule during AUDIT_DEL. */ +static inline int audit_compare_rule(struct audit_rule *a, struct audit_rule *b) +{ + int i; + + if (a->flags != b->flags) + return 1; + + if (a->action != b->action) + return 1; + + if (a->field_count != b->field_count) + return 1; + + for (i = 0; i < a->field_count; i++) { + if (a->fields[i] != b->fields[i] + || a->values[i] != b->values[i]) + return 1; + } + + for (i = 0; i < AUDIT_BITMASK_SIZE; i++) + if (a->mask[i] != b->mask[i]) + return 1; + + return 0; +} + +/* Note that audit_add_rule and audit_del_rule are called via + * audit_receive() in audit.c, and are protected by + * audit_netlink_sem. */ +static inline int audit_add_rule(struct audit_rule *rule, + struct list_head *list) +{ + struct audit_entry *entry; + int i; + + /* Do not use the _rcu iterator here, since this is the only + * addition routine. */ + list_for_each_entry(entry, list, list) { + if (!audit_compare_rule(rule, &entry->rule)) { + return -EEXIST; + } + } + + for (i = 0; i < rule->field_count; i++) { + if (rule->fields[i] & AUDIT_UNUSED_BITS) + return -EINVAL; + if ( rule->fields[i] & AUDIT_NEGATE ) + rule->fields[i] |= AUDIT_NOT_EQUAL; + else if ( (rule->fields[i] & AUDIT_OPERATORS) == 0 ) + rule->fields[i] |= AUDIT_EQUAL; + rule->fields[i] &= (~AUDIT_NEGATE); + } + + if (!(entry = kmalloc(sizeof(*entry), GFP_KERNEL))) + return -ENOMEM; + if (audit_copy_rule(&entry->rule, rule)) { + kfree(entry); + return -EINVAL; + } + + if (entry->rule.flags & AUDIT_FILTER_PREPEND) { + entry->rule.flags &= ~AUDIT_FILTER_PREPEND; + list_add_rcu(&entry->list, list); + } else { + list_add_tail_rcu(&entry->list, list); + } + + return 0; +} + +static inline void audit_free_rule(struct rcu_head *head) +{ + struct audit_entry *e = container_of(head, struct audit_entry, rcu); + kfree(e); +} + +/* Note that audit_add_rule and audit_del_rule are called via + * audit_receive() in audit.c, and are protected by + * audit_netlink_sem. */ +static inline int audit_del_rule(struct audit_rule *rule, + struct list_head *list) +{ + struct audit_entry *e; + + /* Do not use the _rcu iterator here, since this is the only + * deletion routine. */ + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(rule, &e->rule)) { + list_del_rcu(&e->list); + call_rcu(&e->rcu, audit_free_rule); + return 0; + } + } + return -ENOENT; /* No matching rule */ +} + +static int audit_list_rules(void *_dest) +{ + int pid, seq; + int *dest = _dest; + struct audit_entry *entry; + int i; + + pid = dest[0]; + seq = dest[1]; + kfree(dest); + + down(&audit_netlink_sem); + + /* The *_rcu iterators not needed here because we are + always called with audit_netlink_sem held. */ + for (i=0; irule, sizeof(entry->rule)); + } + audit_send_reply(pid, seq, AUDIT_LIST, 1, 1, NULL, 0); + + up(&audit_netlink_sem); + return 0; +} + +/** + * audit_receive_filter - apply all rules to the specified message type + * @type: audit message type + * @pid: target pid for netlink audit messages + * @uid: target uid for netlink audit messages + * @seq: netlink audit message sequence (serial) number + * @data: payload data + * @loginuid: loginuid of sender + */ +int audit_receive_filter(int type, int pid, int uid, int seq, void *data, + uid_t loginuid) +{ + struct task_struct *tsk; + int *dest; + int err = 0; + unsigned listnr; + + switch (type) { + case AUDIT_LIST: + /* We can't just spew out the rules here because we might fill + * the available socket buffer space and deadlock waiting for + * auditctl to read from it... which isn't ever going to + * happen if we're actually running in the context of auditctl + * trying to _send_ the stuff */ + + dest = kmalloc(2 * sizeof(int), GFP_KERNEL); + if (!dest) + return -ENOMEM; + dest[0] = pid; + dest[1] = seq; + + tsk = kthread_run(audit_list_rules, dest, "audit_list_rules"); + if (IS_ERR(tsk)) { + kfree(dest); + err = PTR_ERR(tsk); + } + break; + case AUDIT_ADD: + listnr = ((struct audit_rule *)data)->flags & ~AUDIT_FILTER_PREPEND; + switch(listnr) { + default: + return -EINVAL; + + case AUDIT_FILTER_USER: + case AUDIT_FILTER_TYPE: +#ifdef CONFIG_AUDITSYSCALL + case AUDIT_FILTER_ENTRY: + case AUDIT_FILTER_EXIT: + case AUDIT_FILTER_TASK: +#endif + ; + } + err = audit_add_rule(data, &audit_filter_list[listnr]); + if (!err) + audit_log(NULL, GFP_KERNEL, AUDIT_CONFIG_CHANGE, + "auid=%u added an audit rule\n", loginuid); + break; + case AUDIT_DEL: + listnr =((struct audit_rule *)data)->flags & ~AUDIT_FILTER_PREPEND; + if (listnr >= AUDIT_NR_FILTERS) + return -EINVAL; + + err = audit_del_rule(data, &audit_filter_list[listnr]); + if (!err) + audit_log(NULL, GFP_KERNEL, AUDIT_CONFIG_CHANGE, + "auid=%u removed an audit rule\n", loginuid); + break; + default: + return -EINVAL; + } + + return err; +} + +int audit_comparator(const u32 left, const u32 op, const u32 right) +{ + switch (op) { + case AUDIT_EQUAL: + return (left == right); + case AUDIT_NOT_EQUAL: + return (left != right); + case AUDIT_LESS_THAN: + return (left < right); + case AUDIT_LESS_THAN_OR_EQUAL: + return (left <= right); + case AUDIT_GREATER_THAN: + return (left > right); + case AUDIT_GREATER_THAN_OR_EQUAL: + return (left >= right); + default: + return -EINVAL; + } +} + + + +static int audit_filter_user_rules(struct netlink_skb_parms *cb, + struct audit_rule *rule, + enum audit_state *state) +{ + int i; + + for (i = 0; i < rule->field_count; i++) { + u32 field = rule->fields[i] & ~AUDIT_OPERATORS; + u32 op = rule->fields[i] & AUDIT_OPERATORS; + u32 value = rule->values[i]; + int result = 0; + + switch (field) { + case AUDIT_PID: + result = audit_comparator(cb->creds.pid, op, value); + break; + case AUDIT_UID: + result = audit_comparator(cb->creds.uid, op, value); + break; + case AUDIT_GID: + result = audit_comparator(cb->creds.gid, op, value); + break; + case AUDIT_LOGINUID: + result = audit_comparator(cb->loginuid, op, value); + break; + } + + if (!result) + return 0; + } + switch (rule->action) { + case AUDIT_NEVER: *state = AUDIT_DISABLED; break; + case AUDIT_POSSIBLE: *state = AUDIT_BUILD_CONTEXT; break; + case AUDIT_ALWAYS: *state = AUDIT_RECORD_CONTEXT; break; + } + return 1; +} + +int audit_filter_user(struct netlink_skb_parms *cb, int type) +{ + struct audit_entry *e; + enum audit_state state; + int ret = 1; + + rcu_read_lock(); + list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_USER], list) { + if (audit_filter_user_rules(cb, &e->rule, &state)) { + if (state == AUDIT_DISABLED) + ret = 0; + break; + } + } + rcu_read_unlock(); + + return ret; /* Audit by default */ +} + +int audit_filter_type(int type) +{ + struct audit_entry *e; + int result = 0; + + rcu_read_lock(); + if (list_empty(&audit_filter_list[AUDIT_FILTER_TYPE])) + goto unlock_and_return; + + list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_TYPE], + list) { + struct audit_rule *rule = &e->rule; + int i; + for (i = 0; i < rule->field_count; i++) { + u32 field = rule->fields[i] & ~AUDIT_OPERATORS; + u32 op = rule->fields[i] & AUDIT_OPERATORS; + u32 value = rule->values[i]; + if ( field == AUDIT_MSGTYPE ) { + result = audit_comparator(type, op, value); + if (!result) + break; + } + } + if (result) + goto unlock_and_return; + } +unlock_and_return: + rcu_read_unlock(); + return result; +} + + diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 4ef14515da35..17719b303638 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -52,17 +52,15 @@ #include #include #include -#include #include #include #include #include +#include -/* 0 = no checking - 1 = put_count checking - 2 = verbose put_count checking -*/ -#define AUDIT_DEBUG 0 +#include "audit.h" + +extern struct list_head audit_filter_list[]; /* No syscall auditing will take place unless audit_enabled != 0. */ extern int audit_enabled; @@ -76,29 +74,6 @@ extern int audit_enabled; * path_lookup. */ #define AUDIT_NAMES_RESERVED 7 -/* At task start time, the audit_state is set in the audit_context using - a per-task filter. At syscall entry, the audit_state is augmented by - the syscall filter. */ -enum audit_state { - AUDIT_DISABLED, /* Do not create per-task audit_context. - * No syscall-specific audit records can - * be generated. */ - AUDIT_SETUP_CONTEXT, /* Create the per-task audit_context, - * but don't necessarily fill it in at - * syscall entry time (i.e., filter - * instead). */ - AUDIT_BUILD_CONTEXT, /* Create the per-task audit_context, - * and always fill it in at syscall - * entry time. This makes a full - * syscall record available if some - * other part of the kernel decides it - * should be recorded. */ - AUDIT_RECORD_CONTEXT /* Create the per-task audit_context, - * always fill it in at syscall entry - * time, and always write out the audit - * record at syscall exit time. */ -}; - /* When fs/namei.c:getname() is called, we store the pointer in name and * we don't let putname() free it (instead we free all of the saved * pointers at syscall exit time). @@ -183,264 +158,6 @@ struct audit_context { #endif }; - /* Public API */ -/* There are three lists of rules -- one to search at task creation - * time, one to search at syscall entry time, and another to search at - * syscall exit time. */ -static struct list_head audit_filter_list[AUDIT_NR_FILTERS] = { - LIST_HEAD_INIT(audit_filter_list[0]), - LIST_HEAD_INIT(audit_filter_list[1]), - LIST_HEAD_INIT(audit_filter_list[2]), - LIST_HEAD_INIT(audit_filter_list[3]), - LIST_HEAD_INIT(audit_filter_list[4]), - LIST_HEAD_INIT(audit_filter_list[5]), -#if AUDIT_NR_FILTERS != 6 -#error Fix audit_filter_list initialiser -#endif -}; - -struct audit_entry { - struct list_head list; - struct rcu_head rcu; - struct audit_rule rule; -}; - -extern int audit_pid; - -/* Copy rule from user-space to kernel-space. Called from - * audit_add_rule during AUDIT_ADD. */ -static inline int audit_copy_rule(struct audit_rule *d, struct audit_rule *s) -{ - int i; - - if (s->action != AUDIT_NEVER - && s->action != AUDIT_POSSIBLE - && s->action != AUDIT_ALWAYS) - return -1; - if (s->field_count < 0 || s->field_count > AUDIT_MAX_FIELDS) - return -1; - if ((s->flags & ~AUDIT_FILTER_PREPEND) >= AUDIT_NR_FILTERS) - return -1; - - d->flags = s->flags; - d->action = s->action; - d->field_count = s->field_count; - for (i = 0; i < d->field_count; i++) { - d->fields[i] = s->fields[i]; - d->values[i] = s->values[i]; - } - for (i = 0; i < AUDIT_BITMASK_SIZE; i++) d->mask[i] = s->mask[i]; - return 0; -} - -/* Check to see if two rules are identical. It is called from - * audit_add_rule during AUDIT_ADD and - * audit_del_rule during AUDIT_DEL. */ -static inline int audit_compare_rule(struct audit_rule *a, struct audit_rule *b) -{ - int i; - - if (a->flags != b->flags) - return 1; - - if (a->action != b->action) - return 1; - - if (a->field_count != b->field_count) - return 1; - - for (i = 0; i < a->field_count; i++) { - if (a->fields[i] != b->fields[i] - || a->values[i] != b->values[i]) - return 1; - } - - for (i = 0; i < AUDIT_BITMASK_SIZE; i++) - if (a->mask[i] != b->mask[i]) - return 1; - - return 0; -} - -/* Note that audit_add_rule and audit_del_rule are called via - * audit_receive() in audit.c, and are protected by - * audit_netlink_sem. */ -static inline int audit_add_rule(struct audit_rule *rule, - struct list_head *list) -{ - struct audit_entry *entry; - int i; - - /* Do not use the _rcu iterator here, since this is the only - * addition routine. */ - list_for_each_entry(entry, list, list) { - if (!audit_compare_rule(rule, &entry->rule)) { - return -EEXIST; - } - } - - for (i = 0; i < rule->field_count; i++) { - if (rule->fields[i] & AUDIT_UNUSED_BITS) - return -EINVAL; - if ( rule->fields[i] & AUDIT_NEGATE ) - rule->fields[i] |= AUDIT_NOT_EQUAL; - else if ( (rule->fields[i] & AUDIT_OPERATORS) == 0 ) - rule->fields[i] |= AUDIT_EQUAL; - rule->fields[i] &= (~AUDIT_NEGATE); - } - - if (!(entry = kmalloc(sizeof(*entry), GFP_KERNEL))) - return -ENOMEM; - if (audit_copy_rule(&entry->rule, rule)) { - kfree(entry); - return -EINVAL; - } - - if (entry->rule.flags & AUDIT_FILTER_PREPEND) { - entry->rule.flags &= ~AUDIT_FILTER_PREPEND; - list_add_rcu(&entry->list, list); - } else { - list_add_tail_rcu(&entry->list, list); - } - - return 0; -} - -static inline void audit_free_rule(struct rcu_head *head) -{ - struct audit_entry *e = container_of(head, struct audit_entry, rcu); - kfree(e); -} - -/* Note that audit_add_rule and audit_del_rule are called via - * audit_receive() in audit.c, and are protected by - * audit_netlink_sem. */ -static inline int audit_del_rule(struct audit_rule *rule, - struct list_head *list) -{ - struct audit_entry *e; - - /* Do not use the _rcu iterator here, since this is the only - * deletion routine. */ - list_for_each_entry(e, list, list) { - if (!audit_compare_rule(rule, &e->rule)) { - list_del_rcu(&e->list); - call_rcu(&e->rcu, audit_free_rule); - return 0; - } - } - return -ENOENT; /* No matching rule */ -} - -static int audit_list_rules(void *_dest) -{ - int pid, seq; - int *dest = _dest; - struct audit_entry *entry; - int i; - - pid = dest[0]; - seq = dest[1]; - kfree(dest); - - down(&audit_netlink_sem); - - /* The *_rcu iterators not needed here because we are - always called with audit_netlink_sem held. */ - for (i=0; irule, sizeof(entry->rule)); - } - audit_send_reply(pid, seq, AUDIT_LIST, 1, 1, NULL, 0); - - up(&audit_netlink_sem); - return 0; -} - -/** - * audit_receive_filter - apply all rules to the specified message type - * @type: audit message type - * @pid: target pid for netlink audit messages - * @uid: target uid for netlink audit messages - * @seq: netlink audit message sequence (serial) number - * @data: payload data - * @loginuid: loginuid of sender - */ -int audit_receive_filter(int type, int pid, int uid, int seq, void *data, - uid_t loginuid) -{ - struct task_struct *tsk; - int *dest; - int err = 0; - unsigned listnr; - - switch (type) { - case AUDIT_LIST: - /* We can't just spew out the rules here because we might fill - * the available socket buffer space and deadlock waiting for - * auditctl to read from it... which isn't ever going to - * happen if we're actually running in the context of auditctl - * trying to _send_ the stuff */ - - dest = kmalloc(2 * sizeof(int), GFP_KERNEL); - if (!dest) - return -ENOMEM; - dest[0] = pid; - dest[1] = seq; - - tsk = kthread_run(audit_list_rules, dest, "audit_list_rules"); - if (IS_ERR(tsk)) { - kfree(dest); - err = PTR_ERR(tsk); - } - break; - case AUDIT_ADD: - listnr =((struct audit_rule *)data)->flags & ~AUDIT_FILTER_PREPEND; - if (listnr >= AUDIT_NR_FILTERS) - return -EINVAL; - - err = audit_add_rule(data, &audit_filter_list[listnr]); - if (!err) - audit_log(NULL, GFP_KERNEL, AUDIT_CONFIG_CHANGE, - "auid=%u added an audit rule\n", loginuid); - break; - case AUDIT_DEL: - listnr =((struct audit_rule *)data)->flags & ~AUDIT_FILTER_PREPEND; - if (listnr >= AUDIT_NR_FILTERS) - return -EINVAL; - - err = audit_del_rule(data, &audit_filter_list[listnr]); - if (!err) - audit_log(NULL, GFP_KERNEL, AUDIT_CONFIG_CHANGE, - "auid=%u removed an audit rule\n", loginuid); - break; - default: - return -EINVAL; - } - - return err; -} - -static int audit_comparator(const u32 left, const u32 op, const u32 right) -{ - switch (op) { - case AUDIT_EQUAL: - return (left == right); - case AUDIT_NOT_EQUAL: - return (left != right); - case AUDIT_LESS_THAN: - return (left < right); - case AUDIT_LESS_THAN_OR_EQUAL: - return (left <= right); - case AUDIT_GREATER_THAN: - return (left > right); - case AUDIT_GREATER_THAN_OR_EQUAL: - return (left >= right); - default: - return -EINVAL; - } -} /* Compare a task_struct with an audit_rule. Return 1 on match, 0 * otherwise. */ @@ -613,95 +330,6 @@ static enum audit_state audit_filter_syscall(struct task_struct *tsk, return AUDIT_BUILD_CONTEXT; } -static int audit_filter_user_rules(struct netlink_skb_parms *cb, - struct audit_rule *rule, - enum audit_state *state) -{ - int i; - - for (i = 0; i < rule->field_count; i++) { - u32 field = rule->fields[i] & ~AUDIT_OPERATORS; - u32 op = rule->fields[i] & AUDIT_OPERATORS; - u32 value = rule->values[i]; - int result = 0; - - switch (field) { - case AUDIT_PID: - result = audit_comparator(cb->creds.pid, op, value); - break; - case AUDIT_UID: - result = audit_comparator(cb->creds.uid, op, value); - break; - case AUDIT_GID: - result = audit_comparator(cb->creds.gid, op, value); - break; - case AUDIT_LOGINUID: - result = audit_comparator(cb->loginuid, op, value); - break; - } - - if (!result) - return 0; - } - switch (rule->action) { - case AUDIT_NEVER: *state = AUDIT_DISABLED; break; - case AUDIT_POSSIBLE: *state = AUDIT_BUILD_CONTEXT; break; - case AUDIT_ALWAYS: *state = AUDIT_RECORD_CONTEXT; break; - } - return 1; -} - -int audit_filter_user(struct netlink_skb_parms *cb, int type) -{ - struct audit_entry *e; - enum audit_state state; - int ret = 1; - - rcu_read_lock(); - list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_USER], list) { - if (audit_filter_user_rules(cb, &e->rule, &state)) { - if (state == AUDIT_DISABLED) - ret = 0; - break; - } - } - rcu_read_unlock(); - - return ret; /* Audit by default */ -} - -int audit_filter_type(int type) -{ - struct audit_entry *e; - int result = 0; - - rcu_read_lock(); - if (list_empty(&audit_filter_list[AUDIT_FILTER_TYPE])) - goto unlock_and_return; - - list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_TYPE], - list) { - struct audit_rule *rule = &e->rule; - int i; - for (i = 0; i < rule->field_count; i++) { - u32 field = rule->fields[i] & ~AUDIT_OPERATORS; - u32 op = rule->fields[i] & AUDIT_OPERATORS; - u32 value = rule->values[i]; - if ( field == AUDIT_MSGTYPE ) { - result = audit_comparator(type, op, value); - if (!result) - break; - } - } - if (result) - goto unlock_and_return; - } -unlock_and_return: - rcu_read_unlock(); - return result; -} - - /* This should be called with task_lock() held. */ static inline struct audit_context *audit_get_context(struct task_struct *tsk, int return_valid, -- cgit v1.2.3 From d884596f44ef5a0bcd8a66405dc04902aeaa6fc7 Mon Sep 17 00:00:00 2001 From: David Woodhouse Date: Fri, 16 Dec 2005 10:48:28 +0000 Subject: [PATCH] Minor cosmetic cleanups to the code moved into auditfilter.c Signed-off-by: David Woodhouse --- kernel/auditfilter.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c index 7f347c360876..a3a32752f973 100644 --- a/kernel/auditfilter.c +++ b/kernel/auditfilter.c @@ -69,7 +69,7 @@ static inline int audit_copy_rule(struct audit_rule *d, struct audit_rule *s) /* Check to see if two rules are identical. It is called from * audit_add_rule during AUDIT_ADD and * audit_del_rule during AUDIT_DEL. */ -static inline int audit_compare_rule(struct audit_rule *a, struct audit_rule *b) +static int audit_compare_rule(struct audit_rule *a, struct audit_rule *b) { int i; @@ -107,19 +107,18 @@ static inline int audit_add_rule(struct audit_rule *rule, /* Do not use the _rcu iterator here, since this is the only * addition routine. */ list_for_each_entry(entry, list, list) { - if (!audit_compare_rule(rule, &entry->rule)) { + if (!audit_compare_rule(rule, &entry->rule)) return -EEXIST; - } } for (i = 0; i < rule->field_count; i++) { if (rule->fields[i] & AUDIT_UNUSED_BITS) return -EINVAL; - if ( rule->fields[i] & AUDIT_NEGATE ) + if ( rule->fields[i] & AUDIT_NEGATE) rule->fields[i] |= AUDIT_NOT_EQUAL; else if ( (rule->fields[i] & AUDIT_OPERATORS) == 0 ) rule->fields[i] |= AUDIT_EQUAL; - rule->fields[i] &= (~AUDIT_NEGATE); + rule->fields[i] &= ~AUDIT_NEGATE; } if (!(entry = kmalloc(sizeof(*entry), GFP_KERNEL))) @@ -374,5 +373,3 @@ unlock_and_return: rcu_read_unlock(); return result; } - - -- cgit v1.2.3 From 93315ed6dd12dacfc941f9eb8ca0293aadf99793 Mon Sep 17 00:00:00 2001 From: Amy Griffis Date: Tue, 7 Feb 2006 12:05:27 -0500 Subject: [PATCH] audit string fields interface + consumer Updated patch to dynamically allocate audit rule fields in kernel's internal representation. Added unlikely() calls for testing memory allocation result. Amy Griffis wrote: [Wed Jan 11 2006, 02:02:31PM EST] > Modify audit's kernel-userspace interface to allow the specification > of string fields in audit rules. > > Signed-off-by: Amy Griffis Signed-off-by: Al Viro (cherry picked from 5ffc4a863f92351b720fe3e9c5cd647accff9e03 commit) --- kernel/audit.c | 19 ++- kernel/audit.h | 23 ++- kernel/auditfilter.c | 467 +++++++++++++++++++++++++++++++++++++++------------ kernel/auditsc.c | 50 +++--- 4 files changed, 418 insertions(+), 141 deletions(-) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index 07c5d2bdd38c..4eb97b62d7fa 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -52,6 +52,7 @@ #include #include +#include #include #include @@ -361,9 +362,12 @@ static int audit_netlink_ok(kernel_cap_t eff_cap, u16 msg_type) switch (msg_type) { case AUDIT_GET: case AUDIT_LIST: + case AUDIT_LIST_RULES: case AUDIT_SET: case AUDIT_ADD: + case AUDIT_ADD_RULE: case AUDIT_DEL: + case AUDIT_DEL_RULE: case AUDIT_SIGNAL_INFO: if (!cap_raised(eff_cap, CAP_AUDIT_CONTROL)) err = -EPERM; @@ -470,12 +474,23 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) break; case AUDIT_ADD: case AUDIT_DEL: - if (nlh->nlmsg_len < sizeof(struct audit_rule)) + if (nlmsg_len(nlh) < sizeof(struct audit_rule)) return -EINVAL; /* fallthrough */ case AUDIT_LIST: err = audit_receive_filter(nlh->nlmsg_type, NETLINK_CB(skb).pid, - uid, seq, data, loginuid); + uid, seq, data, nlmsg_len(nlh), + loginuid); + break; + case AUDIT_ADD_RULE: + case AUDIT_DEL_RULE: + if (nlmsg_len(nlh) < sizeof(struct audit_rule_data)) + return -EINVAL; + /* fallthrough */ + case AUDIT_LIST_RULES: + err = audit_receive_filter(nlh->nlmsg_type, NETLINK_CB(skb).pid, + uid, seq, data, nlmsg_len(nlh), + loginuid); break; case AUDIT_SIGNAL_INFO: sig_data.uid = audit_sig_uid; diff --git a/kernel/audit.h b/kernel/audit.h index 7643e46daeb2..4b602cdcabef 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -52,10 +52,27 @@ enum audit_state { }; /* Rule lists */ +struct audit_field { + u32 type; + u32 val; + u32 op; +}; + +struct audit_krule { + int vers_ops; + u32 flags; + u32 listnr; + u32 action; + u32 mask[AUDIT_BITMASK_SIZE]; + u32 buflen; /* for data alloc on list rules */ + u32 field_count; + struct audit_field *fields; +}; + struct audit_entry { - struct list_head list; - struct rcu_head rcu; - struct audit_rule rule; + struct list_head list; + struct rcu_head rcu; + struct audit_krule rule; }; diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c index a3a32752f973..686d514a3518 100644 --- a/kernel/auditfilter.c +++ b/kernel/auditfilter.c @@ -40,52 +40,279 @@ struct list_head audit_filter_list[AUDIT_NR_FILTERS] = { #endif }; -/* Copy rule from user-space to kernel-space. Called from - * audit_add_rule during AUDIT_ADD. */ -static inline int audit_copy_rule(struct audit_rule *d, struct audit_rule *s) +static inline void audit_free_rule(struct audit_entry *e) { + kfree(e->rule.fields); + kfree(e); +} + +static inline void audit_free_rule_rcu(struct rcu_head *head) +{ + struct audit_entry *e = container_of(head, struct audit_entry, rcu); + audit_free_rule(e); +} + +/* Unpack a filter field's string representation from user-space + * buffer. */ +static __attribute__((unused)) char *audit_unpack_string(void **bufp, size_t *remain, size_t len) +{ + char *str; + + if (!*bufp || (len == 0) || (len > *remain)) + return ERR_PTR(-EINVAL); + + /* Of the currently implemented string fields, PATH_MAX + * defines the longest valid length. + */ + if (len > PATH_MAX) + return ERR_PTR(-ENAMETOOLONG); + + str = kmalloc(len + 1, GFP_KERNEL); + if (unlikely(!str)) + return ERR_PTR(-ENOMEM); + + memcpy(str, *bufp, len); + str[len] = 0; + *bufp += len; + *remain -= len; + + return str; +} + +/* Common user-space to kernel rule translation. */ +static inline struct audit_entry *audit_to_entry_common(struct audit_rule *rule) +{ + unsigned listnr; + struct audit_entry *entry; + struct audit_field *fields; + int i, err; + + err = -EINVAL; + listnr = rule->flags & ~AUDIT_FILTER_PREPEND; + switch(listnr) { + default: + goto exit_err; + case AUDIT_FILTER_USER: + case AUDIT_FILTER_TYPE: +#ifdef CONFIG_AUDITSYSCALL + case AUDIT_FILTER_ENTRY: + case AUDIT_FILTER_EXIT: + case AUDIT_FILTER_TASK: +#endif + ; + } + if (rule->action != AUDIT_NEVER && rule->action != AUDIT_POSSIBLE && + rule->action != AUDIT_ALWAYS) + goto exit_err; + if (rule->field_count > AUDIT_MAX_FIELDS) + goto exit_err; + + err = -ENOMEM; + entry = kmalloc(sizeof(*entry), GFP_KERNEL); + if (unlikely(!entry)) + goto exit_err; + fields = kmalloc(sizeof(*fields) * rule->field_count, GFP_KERNEL); + if (unlikely(!fields)) { + kfree(entry); + goto exit_err; + } + + memset(&entry->rule, 0, sizeof(struct audit_krule)); + memset(fields, 0, sizeof(struct audit_field)); + + entry->rule.flags = rule->flags & AUDIT_FILTER_PREPEND; + entry->rule.listnr = listnr; + entry->rule.action = rule->action; + entry->rule.field_count = rule->field_count; + entry->rule.fields = fields; + + for (i = 0; i < AUDIT_BITMASK_SIZE; i++) + entry->rule.mask[i] = rule->mask[i]; + + return entry; + +exit_err: + return ERR_PTR(err); +} + +/* Translate struct audit_rule to kernel's rule respresentation. + * Exists for backward compatibility with userspace. */ +static struct audit_entry *audit_rule_to_entry(struct audit_rule *rule) +{ + struct audit_entry *entry; + int err = 0; int i; - if (s->action != AUDIT_NEVER - && s->action != AUDIT_POSSIBLE - && s->action != AUDIT_ALWAYS) - return -1; - if (s->field_count < 0 || s->field_count > AUDIT_MAX_FIELDS) - return -1; - if ((s->flags & ~AUDIT_FILTER_PREPEND) >= AUDIT_NR_FILTERS) - return -1; - - d->flags = s->flags; - d->action = s->action; - d->field_count = s->field_count; - for (i = 0; i < d->field_count; i++) { - d->fields[i] = s->fields[i]; - d->values[i] = s->values[i]; + entry = audit_to_entry_common(rule); + if (IS_ERR(entry)) + goto exit_nofree; + + for (i = 0; i < rule->field_count; i++) { + struct audit_field *f = &entry->rule.fields[i]; + + if (rule->fields[i] & AUDIT_UNUSED_BITS) { + err = -EINVAL; + goto exit_free; + } + + f->op = rule->fields[i] & (AUDIT_NEGATE|AUDIT_OPERATORS); + f->type = rule->fields[i] & ~(AUDIT_NEGATE|AUDIT_OPERATORS); + f->val = rule->values[i]; + + entry->rule.vers_ops = (f->op & AUDIT_OPERATORS) ? 2 : 1; + if (f->op & AUDIT_NEGATE) + f->op |= AUDIT_NOT_EQUAL; + else if (!(f->op & AUDIT_OPERATORS)) + f->op |= AUDIT_EQUAL; + f->op &= ~AUDIT_NEGATE; } - for (i = 0; i < AUDIT_BITMASK_SIZE; i++) d->mask[i] = s->mask[i]; - return 0; + +exit_nofree: + return entry; + +exit_free: + audit_free_rule(entry); + return ERR_PTR(err); } -/* Check to see if two rules are identical. It is called from - * audit_add_rule during AUDIT_ADD and - * audit_del_rule during AUDIT_DEL. */ -static int audit_compare_rule(struct audit_rule *a, struct audit_rule *b) +/* Translate struct audit_rule_data to kernel's rule respresentation. */ +static struct audit_entry *audit_data_to_entry(struct audit_rule_data *data, + size_t datasz) { + int err = 0; + struct audit_entry *entry; + void *bufp; + /* size_t remain = datasz - sizeof(struct audit_rule_data); */ int i; - if (a->flags != b->flags) - return 1; + entry = audit_to_entry_common((struct audit_rule *)data); + if (IS_ERR(entry)) + goto exit_nofree; - if (a->action != b->action) - return 1; + bufp = data->buf; + entry->rule.vers_ops = 2; + for (i = 0; i < data->field_count; i++) { + struct audit_field *f = &entry->rule.fields[i]; + + err = -EINVAL; + if (!(data->fieldflags[i] & AUDIT_OPERATORS) || + data->fieldflags[i] & ~AUDIT_OPERATORS) + goto exit_free; + + f->op = data->fieldflags[i] & AUDIT_OPERATORS; + f->type = data->fields[i]; + switch(f->type) { + /* call type-specific conversion routines here */ + default: + f->val = data->values[i]; + } + } + +exit_nofree: + return entry; + +exit_free: + audit_free_rule(entry); + return ERR_PTR(err); +} + +/* Pack a filter field's string representation into data block. */ +static inline size_t audit_pack_string(void **bufp, char *str) +{ + size_t len = strlen(str); + + memcpy(*bufp, str, len); + *bufp += len; + + return len; +} + +/* Translate kernel rule respresentation to struct audit_rule. + * Exists for backward compatibility with userspace. */ +static struct audit_rule *audit_krule_to_rule(struct audit_krule *krule) +{ + struct audit_rule *rule; + int i; + + rule = kmalloc(sizeof(*rule), GFP_KERNEL); + if (unlikely(!rule)) + return ERR_PTR(-ENOMEM); + memset(rule, 0, sizeof(*rule)); + + rule->flags = krule->flags | krule->listnr; + rule->action = krule->action; + rule->field_count = krule->field_count; + for (i = 0; i < rule->field_count; i++) { + rule->values[i] = krule->fields[i].val; + rule->fields[i] = krule->fields[i].type; + + if (krule->vers_ops == 1) { + if (krule->fields[i].op & AUDIT_NOT_EQUAL) + rule->fields[i] |= AUDIT_NEGATE; + } else { + rule->fields[i] |= krule->fields[i].op; + } + } + for (i = 0; i < AUDIT_BITMASK_SIZE; i++) rule->mask[i] = krule->mask[i]; + + return rule; +} - if (a->field_count != b->field_count) +/* Translate kernel rule respresentation to struct audit_rule_data. */ +static struct audit_rule_data *audit_krule_to_data(struct audit_krule *krule) +{ + struct audit_rule_data *data; + void *bufp; + int i; + + data = kmalloc(sizeof(*data) + krule->buflen, GFP_KERNEL); + if (unlikely(!data)) + return ERR_PTR(-ENOMEM); + memset(data, 0, sizeof(*data)); + + data->flags = krule->flags | krule->listnr; + data->action = krule->action; + data->field_count = krule->field_count; + bufp = data->buf; + for (i = 0; i < data->field_count; i++) { + struct audit_field *f = &krule->fields[i]; + + data->fields[i] = f->type; + data->fieldflags[i] = f->op; + switch(f->type) { + /* call type-specific conversion routines here */ + default: + data->values[i] = f->val; + } + } + for (i = 0; i < AUDIT_BITMASK_SIZE; i++) data->mask[i] = krule->mask[i]; + + return data; +} + +/* Compare two rules in kernel format. Considered success if rules + * don't match. */ +static int audit_compare_rule(struct audit_krule *a, struct audit_krule *b) +{ + int i; + + if (a->flags != b->flags || + a->listnr != b->listnr || + a->action != b->action || + a->field_count != b->field_count) return 1; for (i = 0; i < a->field_count; i++) { - if (a->fields[i] != b->fields[i] - || a->values[i] != b->values[i]) + if (a->fields[i].type != b->fields[i].type || + a->fields[i].op != b->fields[i].op) return 1; + + switch(a->fields[i].type) { + /* call type-specific comparison routines here */ + default: + if (a->fields[i].val != b->fields[i].val) + return 1; + } } for (i = 0; i < AUDIT_BITMASK_SIZE; i++) @@ -95,41 +322,21 @@ static int audit_compare_rule(struct audit_rule *a, struct audit_rule *b) return 0; } -/* Note that audit_add_rule and audit_del_rule are called via - * audit_receive() in audit.c, and are protected by +/* Add rule to given filterlist if not a duplicate. Protected by * audit_netlink_sem. */ -static inline int audit_add_rule(struct audit_rule *rule, +static inline int audit_add_rule(struct audit_entry *entry, struct list_head *list) { - struct audit_entry *entry; - int i; + struct audit_entry *e; /* Do not use the _rcu iterator here, since this is the only * addition routine. */ - list_for_each_entry(entry, list, list) { - if (!audit_compare_rule(rule, &entry->rule)) + list_for_each_entry(e, list, list) { + if (!audit_compare_rule(&entry->rule, &e->rule)) return -EEXIST; } - for (i = 0; i < rule->field_count; i++) { - if (rule->fields[i] & AUDIT_UNUSED_BITS) - return -EINVAL; - if ( rule->fields[i] & AUDIT_NEGATE) - rule->fields[i] |= AUDIT_NOT_EQUAL; - else if ( (rule->fields[i] & AUDIT_OPERATORS) == 0 ) - rule->fields[i] |= AUDIT_EQUAL; - rule->fields[i] &= ~AUDIT_NEGATE; - } - - if (!(entry = kmalloc(sizeof(*entry), GFP_KERNEL))) - return -ENOMEM; - if (audit_copy_rule(&entry->rule, rule)) { - kfree(entry); - return -EINVAL; - } - if (entry->rule.flags & AUDIT_FILTER_PREPEND) { - entry->rule.flags &= ~AUDIT_FILTER_PREPEND; list_add_rcu(&entry->list, list); } else { list_add_tail_rcu(&entry->list, list); @@ -138,16 +345,9 @@ static inline int audit_add_rule(struct audit_rule *rule, return 0; } -static inline void audit_free_rule(struct rcu_head *head) -{ - struct audit_entry *e = container_of(head, struct audit_entry, rcu); - kfree(e); -} - -/* Note that audit_add_rule and audit_del_rule are called via - * audit_receive() in audit.c, and are protected by +/* Remove an existing rule from filterlist. Protected by * audit_netlink_sem. */ -static inline int audit_del_rule(struct audit_rule *rule, +static inline int audit_del_rule(struct audit_entry *entry, struct list_head *list) { struct audit_entry *e; @@ -155,16 +355,18 @@ static inline int audit_del_rule(struct audit_rule *rule, /* Do not use the _rcu iterator here, since this is the only * deletion routine. */ list_for_each_entry(e, list, list) { - if (!audit_compare_rule(rule, &e->rule)) { + if (!audit_compare_rule(&entry->rule, &e->rule)) { list_del_rcu(&e->list); - call_rcu(&e->rcu, audit_free_rule); + call_rcu(&e->rcu, audit_free_rule_rcu); return 0; } } return -ENOENT; /* No matching rule */ } -static int audit_list_rules(void *_dest) +/* List rules using struct audit_rule. Exists for backward + * compatibility with userspace. */ +static int audit_list(void *_dest) { int pid, seq; int *dest = _dest; @@ -180,9 +382,16 @@ static int audit_list_rules(void *_dest) /* The *_rcu iterators not needed here because we are always called with audit_netlink_sem held. */ for (i=0; irule); + if (unlikely(!rule)) + break; audit_send_reply(pid, seq, AUDIT_LIST, 0, 1, - &entry->rule, sizeof(entry->rule)); + rule, sizeof(*rule)); + kfree(rule); + } } audit_send_reply(pid, seq, AUDIT_LIST, 1, 1, NULL, 0); @@ -190,6 +399,40 @@ static int audit_list_rules(void *_dest) return 0; } +/* List rules using struct audit_rule_data. */ +static int audit_list_rules(void *_dest) +{ + int pid, seq; + int *dest = _dest; + struct audit_entry *e; + int i; + + pid = dest[0]; + seq = dest[1]; + kfree(dest); + + down(&audit_netlink_sem); + + /* The *_rcu iterators not needed here because we are + always called with audit_netlink_sem held. */ + for (i=0; irule); + if (unlikely(!data)) + break; + audit_send_reply(pid, seq, AUDIT_LIST_RULES, 0, 1, + data, sizeof(*data)); + kfree(data); + } + } + audit_send_reply(pid, seq, AUDIT_LIST_RULES, 1, 1, NULL, 0); + + up(&audit_netlink_sem); + return 0; +} + /** * audit_receive_filter - apply all rules to the specified message type * @type: audit message type @@ -197,18 +440,20 @@ static int audit_list_rules(void *_dest) * @uid: target uid for netlink audit messages * @seq: netlink audit message sequence (serial) number * @data: payload data + * @datasz: size of payload data * @loginuid: loginuid of sender */ int audit_receive_filter(int type, int pid, int uid, int seq, void *data, - uid_t loginuid) + size_t datasz, uid_t loginuid) { struct task_struct *tsk; int *dest; - int err = 0; - unsigned listnr; + int err = 0; + struct audit_entry *entry; switch (type) { case AUDIT_LIST: + case AUDIT_LIST_RULES: /* We can't just spew out the rules here because we might fill * the available socket buffer space and deadlock waiting for * auditctl to read from it... which isn't ever going to @@ -221,41 +466,48 @@ int audit_receive_filter(int type, int pid, int uid, int seq, void *data, dest[0] = pid; dest[1] = seq; - tsk = kthread_run(audit_list_rules, dest, "audit_list_rules"); + if (type == AUDIT_LIST) + tsk = kthread_run(audit_list, dest, "audit_list"); + else + tsk = kthread_run(audit_list_rules, dest, + "audit_list_rules"); if (IS_ERR(tsk)) { kfree(dest); err = PTR_ERR(tsk); } break; case AUDIT_ADD: - listnr = ((struct audit_rule *)data)->flags & ~AUDIT_FILTER_PREPEND; - switch(listnr) { - default: - return -EINVAL; - - case AUDIT_FILTER_USER: - case AUDIT_FILTER_TYPE: -#ifdef CONFIG_AUDITSYSCALL - case AUDIT_FILTER_ENTRY: - case AUDIT_FILTER_EXIT: - case AUDIT_FILTER_TASK: -#endif - ; - } - err = audit_add_rule(data, &audit_filter_list[listnr]); + case AUDIT_ADD_RULE: + if (type == AUDIT_ADD) + entry = audit_rule_to_entry(data); + else + entry = audit_data_to_entry(data, datasz); + if (IS_ERR(entry)) + return PTR_ERR(entry); + + err = audit_add_rule(entry, + &audit_filter_list[entry->rule.listnr]); if (!err) audit_log(NULL, GFP_KERNEL, AUDIT_CONFIG_CHANGE, "auid=%u added an audit rule\n", loginuid); + else + audit_free_rule(entry); break; case AUDIT_DEL: - listnr =((struct audit_rule *)data)->flags & ~AUDIT_FILTER_PREPEND; - if (listnr >= AUDIT_NR_FILTERS) - return -EINVAL; - - err = audit_del_rule(data, &audit_filter_list[listnr]); + case AUDIT_DEL_RULE: + if (type == AUDIT_DEL) + entry = audit_rule_to_entry(data); + else + entry = audit_data_to_entry(data, datasz); + if (IS_ERR(entry)) + return PTR_ERR(entry); + + err = audit_del_rule(entry, + &audit_filter_list[entry->rule.listnr]); if (!err) audit_log(NULL, GFP_KERNEL, AUDIT_CONFIG_CHANGE, "auid=%u removed an audit rule\n", loginuid); + audit_free_rule(entry); break; default: return -EINVAL; @@ -287,29 +539,27 @@ int audit_comparator(const u32 left, const u32 op, const u32 right) static int audit_filter_user_rules(struct netlink_skb_parms *cb, - struct audit_rule *rule, + struct audit_krule *rule, enum audit_state *state) { int i; for (i = 0; i < rule->field_count; i++) { - u32 field = rule->fields[i] & ~AUDIT_OPERATORS; - u32 op = rule->fields[i] & AUDIT_OPERATORS; - u32 value = rule->values[i]; + struct audit_field *f = &rule->fields[i]; int result = 0; - switch (field) { + switch (f->type) { case AUDIT_PID: - result = audit_comparator(cb->creds.pid, op, value); + result = audit_comparator(cb->creds.pid, f->op, f->val); break; case AUDIT_UID: - result = audit_comparator(cb->creds.uid, op, value); + result = audit_comparator(cb->creds.uid, f->op, f->val); break; case AUDIT_GID: - result = audit_comparator(cb->creds.gid, op, value); + result = audit_comparator(cb->creds.gid, f->op, f->val); break; case AUDIT_LOGINUID: - result = audit_comparator(cb->loginuid, op, value); + result = audit_comparator(cb->loginuid, f->op, f->val); break; } @@ -354,14 +604,11 @@ int audit_filter_type(int type) list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_TYPE], list) { - struct audit_rule *rule = &e->rule; int i; - for (i = 0; i < rule->field_count; i++) { - u32 field = rule->fields[i] & ~AUDIT_OPERATORS; - u32 op = rule->fields[i] & AUDIT_OPERATORS; - u32 value = rule->values[i]; - if ( field == AUDIT_MSGTYPE ) { - result = audit_comparator(type, op, value); + for (i = 0; i < e->rule.field_count; i++) { + struct audit_field *f = &e->rule.fields[i]; + if (f->type == AUDIT_MSGTYPE) { + result = audit_comparator(type, f->op, f->val); if (!result) break; } diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 17719b303638..ba0878854777 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -162,70 +162,68 @@ struct audit_context { /* Compare a task_struct with an audit_rule. Return 1 on match, 0 * otherwise. */ static int audit_filter_rules(struct task_struct *tsk, - struct audit_rule *rule, + struct audit_krule *rule, struct audit_context *ctx, enum audit_state *state) { int i, j; for (i = 0; i < rule->field_count; i++) { - u32 field = rule->fields[i] & ~AUDIT_OPERATORS; - u32 op = rule->fields[i] & AUDIT_OPERATORS; - u32 value = rule->values[i]; + struct audit_field *f = &rule->fields[i]; int result = 0; - switch (field) { + switch (f->type) { case AUDIT_PID: - result = audit_comparator(tsk->pid, op, value); + result = audit_comparator(tsk->pid, f->op, f->val); break; case AUDIT_UID: - result = audit_comparator(tsk->uid, op, value); + result = audit_comparator(tsk->uid, f->op, f->val); break; case AUDIT_EUID: - result = audit_comparator(tsk->euid, op, value); + result = audit_comparator(tsk->euid, f->op, f->val); break; case AUDIT_SUID: - result = audit_comparator(tsk->suid, op, value); + result = audit_comparator(tsk->suid, f->op, f->val); break; case AUDIT_FSUID: - result = audit_comparator(tsk->fsuid, op, value); + result = audit_comparator(tsk->fsuid, f->op, f->val); break; case AUDIT_GID: - result = audit_comparator(tsk->gid, op, value); + result = audit_comparator(tsk->gid, f->op, f->val); break; case AUDIT_EGID: - result = audit_comparator(tsk->egid, op, value); + result = audit_comparator(tsk->egid, f->op, f->val); break; case AUDIT_SGID: - result = audit_comparator(tsk->sgid, op, value); + result = audit_comparator(tsk->sgid, f->op, f->val); break; case AUDIT_FSGID: - result = audit_comparator(tsk->fsgid, op, value); + result = audit_comparator(tsk->fsgid, f->op, f->val); break; case AUDIT_PERS: - result = audit_comparator(tsk->personality, op, value); + result = audit_comparator(tsk->personality, f->op, f->val); break; case AUDIT_ARCH: if (ctx) - result = audit_comparator(ctx->arch, op, value); + result = audit_comparator(ctx->arch, f->op, f->val); break; case AUDIT_EXIT: if (ctx && ctx->return_valid) - result = audit_comparator(ctx->return_code, op, value); + result = audit_comparator(ctx->return_code, f->op, f->val); break; case AUDIT_SUCCESS: if (ctx && ctx->return_valid) { - if (value) - result = audit_comparator(ctx->return_valid, op, AUDITSC_SUCCESS); + if (f->val) + result = audit_comparator(ctx->return_valid, f->op, AUDITSC_SUCCESS); else - result = audit_comparator(ctx->return_valid, op, AUDITSC_FAILURE); + result = audit_comparator(ctx->return_valid, f->op, AUDITSC_FAILURE); } break; case AUDIT_DEVMAJOR: if (ctx) { for (j = 0; j < ctx->name_count; j++) { - if (audit_comparator(MAJOR(ctx->names[j].dev), op, value)) { + if (audit_comparator(MAJOR(ctx->names[j].dev), f->op, f->val)) { ++result; break; } @@ -235,7 +233,7 @@ static int audit_filter_rules(struct task_struct *tsk, case AUDIT_DEVMINOR: if (ctx) { for (j = 0; j < ctx->name_count; j++) { - if (audit_comparator(MINOR(ctx->names[j].dev), op, value)) { + if (audit_comparator(MINOR(ctx->names[j].dev), f->op, f->val)) { ++result; break; } @@ -245,8 +243,8 @@ static int audit_filter_rules(struct task_struct *tsk, case AUDIT_INODE: if (ctx) { for (j = 0; j < ctx->name_count; j++) { - if (audit_comparator(ctx->names[j].ino, op, value) || - audit_comparator(ctx->names[j].pino, op, value)) { + if (audit_comparator(ctx->names[j].ino, f->op, f->val) || + audit_comparator(ctx->names[j].pino, f->op, f->val)) { ++result; break; } @@ -256,14 +254,14 @@ static int audit_filter_rules(struct task_struct *tsk, case AUDIT_LOGINUID: result = 0; if (ctx) - result = audit_comparator(ctx->loginuid, op, value); + result = audit_comparator(ctx->loginuid, f->op, f->val); break; case AUDIT_ARG0: case AUDIT_ARG1: case AUDIT_ARG2: case AUDIT_ARG3: if (ctx) - result = audit_comparator(ctx->argv[field-AUDIT_ARG0], op, value); + result = audit_comparator(ctx->argv[f->type-AUDIT_ARG0], f->op, f->val); break; } -- cgit v1.2.3 From 5d3301088f7e412992d9e61cc3604cbdff3090ff Mon Sep 17 00:00:00 2001 From: Steve Grubb Date: Mon, 9 Jan 2006 09:48:17 -0500 Subject: [PATCH] add/remove rule update Hi, The following patch adds a little more information to the add/remove rule message emitted by the kernel. Signed-off-by: Steve Grubb Signed-off-by: Al Viro --- kernel/auditfilter.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c index 686d514a3518..35f8fa82bb8b 100644 --- a/kernel/auditfilter.c +++ b/kernel/auditfilter.c @@ -487,10 +487,11 @@ int audit_receive_filter(int type, int pid, int uid, int seq, void *data, err = audit_add_rule(entry, &audit_filter_list[entry->rule.listnr]); - if (!err) - audit_log(NULL, GFP_KERNEL, AUDIT_CONFIG_CHANGE, - "auid=%u added an audit rule\n", loginuid); - else + audit_log(NULL, GFP_KERNEL, AUDIT_CONFIG_CHANGE, + "auid=%u add rule to list=%d res=%d\n", + loginuid, entry->rule.listnr, !err); + + if (err) audit_free_rule(entry); break; case AUDIT_DEL: @@ -504,9 +505,10 @@ int audit_receive_filter(int type, int pid, int uid, int seq, void *data, err = audit_del_rule(entry, &audit_filter_list[entry->rule.listnr]); - if (!err) - audit_log(NULL, GFP_KERNEL, AUDIT_CONFIG_CHANGE, - "auid=%u removed an audit rule\n", loginuid); + audit_log(NULL, GFP_KERNEL, AUDIT_CONFIG_CHANGE, + "auid=%u remove rule from list=%d res=%d\n", + loginuid, entry->rule.listnr, !err); + audit_free_rule(entry); break; default: -- cgit v1.2.3 From a6c043a887a9db32a545539426ddfc8cc2c28f8f Mon Sep 17 00:00:00 2001 From: Steve Grubb Date: Sun, 1 Jan 2006 14:07:00 -0500 Subject: [PATCH] Add tty to syscall audit records Hi, >From the RBAC specs: FAU_SAR.1.1 The TSF shall provide the set of authorized RBAC administrators with the capability to read the following audit information from the audit records: (e) The User Session Identifier or Terminal Type A patch adding the tty for all syscalls is included in this email. Please apply. Signed-off-by: Steve Grubb Signed-off-by: Al Viro --- kernel/auditsc.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index ba0878854777..d3d499272d13 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -57,6 +57,7 @@ #include #include #include +#include #include "audit.h" @@ -573,6 +574,7 @@ static void audit_log_exit(struct audit_context *context, gfp_t gfp_mask) int i; struct audit_buffer *ab; struct audit_aux_data *aux; + const char *tty; ab = audit_log_start(context, gfp_mask, AUDIT_SYSCALL); if (!ab) @@ -585,11 +587,15 @@ static void audit_log_exit(struct audit_context *context, gfp_t gfp_mask) audit_log_format(ab, " success=%s exit=%ld", (context->return_valid==AUDITSC_SUCCESS)?"yes":"no", context->return_code); + if (current->signal->tty && current->signal->tty->name) + tty = current->signal->tty->name; + else + tty = "(none)"; audit_log_format(ab, " a0=%lx a1=%lx a2=%lx a3=%lx items=%d" " pid=%d auid=%u uid=%u gid=%u" " euid=%u suid=%u fsuid=%u" - " egid=%u sgid=%u fsgid=%u", + " egid=%u sgid=%u fsgid=%u tty=%s", context->argv[0], context->argv[1], context->argv[2], @@ -600,7 +606,7 @@ static void audit_log_exit(struct audit_context *context, gfp_t gfp_mask) context->uid, context->gid, context->euid, context->suid, context->fsuid, - context->egid, context->sgid, context->fsgid); + context->egid, context->sgid, context->fsgid, tty); audit_log_task_info(ab, gfp_mask); audit_log_end(ab); -- cgit v1.2.3 From d9d9ec6e2c45b22282cd36cf92fcb23d504350a8 Mon Sep 17 00:00:00 2001 From: Dustin Kirkland Date: Thu, 16 Feb 2006 13:40:01 -0600 Subject: [PATCH] Fix audit operators Darrel Goeddel initiated a discussion on IRC regarding the possibility of audit_comparator() returning -EINVAL signaling an invalid operator. It is possible when creating the rule to assure that the operator is one of the 6 sane values. Here's a snip from include/linux/audit.h Note that 0 (nonsense) and 7 (all operators) are not valid values for an operator. ... /* These are the supported operators. * 4 2 1 * = > < * ------- * 0 0 0 0 nonsense * 0 0 1 1 < * 0 1 0 2 > * 0 1 1 3 != * 1 0 0 4 = * 1 0 1 5 <= * 1 1 0 6 >= * 1 1 1 7 all operators */ ... Furthermore, prior to adding these extended operators, flagging the AUDIT_NEGATE bit implied !=, and otherwise == was assumed. The following code forces the operator to be != if the AUDIT_NEGATE bit was flipped on. And if no operator was specified, == is assumed. The only invalid condition is if the AUDIT_NEGATE bit is off and all of the AUDIT_EQUAL, AUDIT_LESS_THAN, and AUDIT_GREATER_THAN bits are on--clearly a nonsensical operator. Now that this is handled at rule insertion time, the default -EINVAL return of audit_comparator() is eliminated such that the function can only return 1 or 0. If this is acceptable, let's get this applied to the current tree. :-Dustin -- Signed-off-by: Al Viro (cherry picked from 9bf0a8e137040f87d1b563336d4194e38fb2ba1a commit) --- kernel/auditfilter.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c index 35f8fa82bb8b..b85fd8cce11f 100644 --- a/kernel/auditfilter.c +++ b/kernel/auditfilter.c @@ -160,11 +160,17 @@ static struct audit_entry *audit_rule_to_entry(struct audit_rule *rule) f->val = rule->values[i]; entry->rule.vers_ops = (f->op & AUDIT_OPERATORS) ? 2 : 1; + + /* Support for legacy operators where + * AUDIT_NEGATE bit signifies != and otherwise assumes == */ if (f->op & AUDIT_NEGATE) - f->op |= AUDIT_NOT_EQUAL; - else if (!(f->op & AUDIT_OPERATORS)) - f->op |= AUDIT_EQUAL; - f->op &= ~AUDIT_NEGATE; + f->op = AUDIT_NOT_EQUAL; + else if (!f->op) + f->op = AUDIT_EQUAL; + else if (f->op == AUDIT_OPERATORS) { + err = -EINVAL; + goto exit_free; + } } exit_nofree: @@ -533,9 +539,9 @@ int audit_comparator(const u32 left, const u32 op, const u32 right) return (left > right); case AUDIT_GREATER_THAN_OR_EQUAL: return (left >= right); - default: - return -EINVAL; } + BUG(); + return 0; } -- cgit v1.2.3 From 4023e020807ea249ae83f0d1d851b4c7cf0afd8a Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Tue, 7 Mar 2006 23:51:39 -0800 Subject: [PATCH] simplify audit_free() locking Simplify audit_free()'s locking: no need to lock a task that we are tearing down. [the extra locking also caused false positives in the lock validator] Signed-off-by: Ingo Molnar Cc: David Woodhouse Signed-off-by: Andrew Morton Signed-off-by: Al Viro --- kernel/auditsc.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index d3d499272d13..b613ec89e99c 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -704,10 +704,14 @@ void audit_free(struct task_struct *tsk) { struct audit_context *context; - task_lock(tsk); + /* + * No need to lock the task - when we execute audit_free() + * then the task has no external references anymore, and + * we are tearing it down. (The locking also confuses + * DEBUG_LOCKDEP - this freeing may occur in softirq + * contexts as well, via RCU.) + */ context = audit_get_context(tsk, 0, 0); - task_unlock(tsk); - if (likely(!context)) return; -- cgit v1.2.3 From 5a0bbce58bb25bd756f7ec437319d6ed2201a18b Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Tue, 7 Mar 2006 23:51:38 -0800 Subject: [PATCH] sem2mutex: audit_netlink_sem Semaphore to mutex conversion. The conversion was generated via scripts, and the result was validated automatically via a script as well. Signed-off-by: Ingo Molnar Cc: David Woodhouse Signed-off-by: Andrew Morton Signed-off-by: Al Viro --- kernel/audit.c | 6 +++--- kernel/audit.h | 3 ++- kernel/auditfilter.c | 16 ++++++++-------- 3 files changed, 13 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index 4eb97b62d7fa..6a44e0a7707d 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -114,7 +114,7 @@ static DECLARE_WAIT_QUEUE_HEAD(audit_backlog_wait); /* The netlink socket is only to be read by 1 CPU, which lets us assume * that list additions and deletions never happen simultaneously in * auditsc.c */ -DECLARE_MUTEX(audit_netlink_sem); +DEFINE_MUTEX(audit_netlink_mutex); /* AUDIT_BUFSIZ is the size of the temporary buffer used for formatting * audit records. Since printk uses a 1024 byte buffer, this buffer @@ -538,14 +538,14 @@ static void audit_receive(struct sock *sk, int length) struct sk_buff *skb; unsigned int qlen; - down(&audit_netlink_sem); + mutex_lock(&audit_netlink_mutex); for (qlen = skb_queue_len(&sk->sk_receive_queue); qlen; qlen--) { skb = skb_dequeue(&sk->sk_receive_queue); audit_receive_skb(skb); kfree_skb(skb); } - up(&audit_netlink_sem); + mutex_unlock(&audit_netlink_mutex); } diff --git a/kernel/audit.h b/kernel/audit.h index 4b602cdcabef..bc5392076e2b 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -19,6 +19,7 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ +#include #include #include @@ -84,4 +85,4 @@ extern void audit_send_reply(int pid, int seq, int type, void *payload, int size); extern void audit_log_lost(const char *message); extern void audit_panic(const char *message); -extern struct semaphore audit_netlink_sem; +extern struct mutex audit_netlink_mutex; diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c index b85fd8cce11f..d3a8539f3a83 100644 --- a/kernel/auditfilter.c +++ b/kernel/auditfilter.c @@ -329,7 +329,7 @@ static int audit_compare_rule(struct audit_krule *a, struct audit_krule *b) } /* Add rule to given filterlist if not a duplicate. Protected by - * audit_netlink_sem. */ + * audit_netlink_mutex. */ static inline int audit_add_rule(struct audit_entry *entry, struct list_head *list) { @@ -352,7 +352,7 @@ static inline int audit_add_rule(struct audit_entry *entry, } /* Remove an existing rule from filterlist. Protected by - * audit_netlink_sem. */ + * audit_netlink_mutex. */ static inline int audit_del_rule(struct audit_entry *entry, struct list_head *list) { @@ -383,10 +383,10 @@ static int audit_list(void *_dest) seq = dest[1]; kfree(dest); - down(&audit_netlink_sem); + mutex_lock(&audit_netlink_mutex); /* The *_rcu iterators not needed here because we are - always called with audit_netlink_sem held. */ + always called with audit_netlink_mutex held. */ for (i=0; i Date: Thu, 9 Mar 2006 00:33:47 +0100 Subject: [PATCH] EXPORT_SYMBOL patch for audit_log, audit_log_start, audit_log_end and audit_format MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Hi, This is a trivial patch that enables the possibility of using some auditing functions within loadable kernel modules (ie. inside a Linux Security Module). _ Make the audit_log_start, audit_log_end, audit_format and audit_log interfaces available to Loadable Kernel Modules, thus making possible the usage of the audit framework inside LSMs, etc. Signed-off-by: > Signed-off-by: Al Viro --- kernel/audit.c | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index 6a44e0a7707d..c9345d3e8ada 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -994,3 +994,8 @@ void audit_log(struct audit_context *ctx, gfp_t gfp_mask, int type, audit_log_end(ab); } } + +EXPORT_SYMBOL(audit_log_start); +EXPORT_SYMBOL(audit_log_end); +EXPORT_SYMBOL(audit_log_format); +EXPORT_SYMBOL(audit_log); -- cgit v1.2.3 From 71e1c784b24a026a490b3de01541fc5ee14ebc09 Mon Sep 17 00:00:00 2001 From: Amy Griffis Date: Mon, 6 Mar 2006 22:40:05 -0500 Subject: [PATCH] fix audit_init failure path Make audit_init() failure path handle situations where the audit_panic() action is not AUDIT_FAIL_PANIC (default is AUDIT_FAIL_PRINTK). Other uses of audit_sock are not reached unless audit's netlink message handler is properly registered. Bug noticed by Peter Staubach. Signed-off-by: Amy Griffis Signed-off-by: Al Viro --- kernel/audit.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index c9345d3e8ada..04fe2e301b61 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -558,8 +558,9 @@ static int __init audit_init(void) THIS_MODULE); if (!audit_sock) audit_panic("cannot initialize netlink socket"); + else + audit_sock->sk_sndtimeo = MAX_SCHEDULE_TIMEOUT; - audit_sock->sk_sndtimeo = MAX_SCHEDULE_TIMEOUT; skb_queue_head_init(&audit_skb_queue); audit_initialized = 1; audit_enabled = audit_default; -- cgit v1.2.3 From 51107301b629640f9ab76fe23bf385e187b9ac29 Mon Sep 17 00:00:00 2001 From: Jun'ichi Nomura Date: Wed, 15 Mar 2006 08:28:55 -0500 Subject: [PATCH] kobject: fix build error if CONFIG_SYSFS=n Moving uevent_seqnum and uevent_helper to kobject_uevent.c because they are used even if CONFIG_SYSFS=n while kernel/ksysfs.c is built only if CONFIG_SYSFS=y, Signed-off-by: Jun'ichi Nomura Signed-off-by: Greg Kroah-Hartman --- kernel/ksysfs.c | 3 --- 1 file changed, 3 deletions(-) (limited to 'kernel') diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c index d5eeae0fa5bc..f2690ed74530 100644 --- a/kernel/ksysfs.c +++ b/kernel/ksysfs.c @@ -15,9 +15,6 @@ #include #include -u64 uevent_seqnum; -char uevent_helper[UEVENT_HELPER_PATH_LEN] = "/sbin/hotplug"; - #define KERNEL_ATTR_RO(_name) \ static struct subsys_attribute _name##_attr = __ATTR_RO(_name) -- cgit v1.2.3 From 3fd6805f4dfb02bcfb5634972eabad0e790f119a Mon Sep 17 00:00:00 2001 From: Sam Ravnborg Date: Wed, 8 Feb 2006 21:16:45 +0100 Subject: [PATCH] Clean up module.c symbol searching logic Signed-off-by: Sam Ravnborg Signed-off-by: Greg Kroah-Hartman --- kernel/module.c | 73 ++++++++++++++++++++++++++++++++------------------------- 1 file changed, 41 insertions(+), 32 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 5aad477ddc79..2a892b20d68f 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -135,6 +135,18 @@ extern const unsigned long __start___kcrctab_gpl[]; #define symversion(base, idx) ((base) ? ((base) + (idx)) : NULL) #endif +/* lookup symbol in given range of kernel_symbols */ +static const struct kernel_symbol *lookup_symbol(const char *name, + const struct kernel_symbol *start, + const struct kernel_symbol *stop) +{ + const struct kernel_symbol *ks = start; + for (; ks < stop; ks++) + if (strcmp(ks->name, name) == 0) + return ks; + return NULL; +} + /* Find a symbol, return value, crc and module which owns it */ static unsigned long __find_symbol(const char *name, struct module **owner, @@ -142,39 +154,41 @@ static unsigned long __find_symbol(const char *name, int gplok) { struct module *mod; - unsigned int i; + const struct kernel_symbol *ks; /* Core kernel first. */ *owner = NULL; - for (i = 0; __start___ksymtab+i < __stop___ksymtab; i++) { - if (strcmp(__start___ksymtab[i].name, name) == 0) { - *crc = symversion(__start___kcrctab, i); - return __start___ksymtab[i].value; - } + ks = lookup_symbol(name, __start___ksymtab, __stop___ksymtab); + if (ks) { + *crc = symversion(__start___kcrctab, (ks - __start___ksymtab)); + return ks->value; } if (gplok) { - for (i = 0; __start___ksymtab_gpl+i<__stop___ksymtab_gpl; i++) - if (strcmp(__start___ksymtab_gpl[i].name, name) == 0) { - *crc = symversion(__start___kcrctab_gpl, i); - return __start___ksymtab_gpl[i].value; - } + ks = lookup_symbol(name, __start___ksymtab_gpl, + __stop___ksymtab_gpl); + if (ks) { + *crc = symversion(__start___kcrctab_gpl, + (ks - __start___ksymtab_gpl)); + return ks->value; + } } /* Now try modules. */ list_for_each_entry(mod, &modules, list) { *owner = mod; - for (i = 0; i < mod->num_syms; i++) - if (strcmp(mod->syms[i].name, name) == 0) { - *crc = symversion(mod->crcs, i); - return mod->syms[i].value; - } + ks = lookup_symbol(name, mod->syms, mod->syms + mod->num_syms); + if (ks) { + *crc = symversion(mod->crcs, (ks - mod->syms)); + return ks->value; + } if (gplok) { - for (i = 0; i < mod->num_gpl_syms; i++) { - if (strcmp(mod->gpl_syms[i].name, name) == 0) { - *crc = symversion(mod->gpl_crcs, i); - return mod->gpl_syms[i].value; - } + ks = lookup_symbol(name, mod->gpl_syms, + mod->gpl_syms + mod->num_gpl_syms); + if (ks) { + *crc = symversion(mod->gpl_crcs, + (ks - mod->gpl_syms)); + return ks->value; } } } @@ -1444,18 +1458,13 @@ static void setup_modinfo(struct module *mod, Elf_Shdr *sechdrs, #ifdef CONFIG_KALLSYMS int is_exported(const char *name, const struct module *mod) { - unsigned int i; - - if (!mod) { - for (i = 0; __start___ksymtab+i < __stop___ksymtab; i++) - if (strcmp(__start___ksymtab[i].name, name) == 0) - return 1; - return 0; - } - for (i = 0; i < mod->num_syms; i++) - if (strcmp(mod->syms[i].name, name) == 0) + if (!mod && lookup_symbol(name, __start___ksymtab, __stop___ksymtab)) + return 1; + else + if (lookup_symbol(name, mod->syms, mod->syms + mod->num_syms)) return 1; - return 0; + else + return 0; } /* As per nm */ -- cgit v1.2.3 From 9f28bb7e1d0188a993403ab39b774785892805e1 Mon Sep 17 00:00:00 2001 From: Greg Kroah-Hartman Date: Mon, 20 Mar 2006 13:17:13 -0800 Subject: [PATCH] add EXPORT_SYMBOL_GPL_FUTURE() This patch adds the ability to mark symbols that will be changed in the future, so that kernel modules that don't include MODULE_LICENSE("GPL") and use the symbols, will be flagged and printed out to the system log. Signed-off-by: Greg Kroah-Hartman --- kernel/module.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 47 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 2a892b20d68f..5ca99fbe9f44 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -126,8 +126,11 @@ extern const struct kernel_symbol __start___ksymtab[]; extern const struct kernel_symbol __stop___ksymtab[]; extern const struct kernel_symbol __start___ksymtab_gpl[]; extern const struct kernel_symbol __stop___ksymtab_gpl[]; +extern const struct kernel_symbol __start___ksymtab_gpl_future[]; +extern const struct kernel_symbol __stop___ksymtab_gpl_future[]; extern const unsigned long __start___kcrctab[]; extern const unsigned long __start___kcrctab_gpl[]; +extern const unsigned long __start___kcrctab_gpl_future[]; #ifndef CONFIG_MODVERSIONS #define symversion(base, idx) NULL @@ -172,6 +175,22 @@ static unsigned long __find_symbol(const char *name, return ks->value; } } + ks = lookup_symbol(name, __start___ksymtab_gpl_future, + __stop___ksymtab_gpl_future); + if (ks) { + if (!gplok) { + printk(KERN_WARNING "Symbol %s is being used " + "by a non-GPL module, which will not " + "be allowed in the future\n", name); + printk(KERN_WARNING "Please see the file " + "Documentation/feature-removal-schedule.txt " + "in the kernel source tree for more " + "details.\n"); + } + *crc = symversion(__start___kcrctab_gpl_future, + (ks - __start___ksymtab_gpl_future)); + return ks->value; + } /* Now try modules. */ list_for_each_entry(mod, &modules, list) { @@ -191,6 +210,23 @@ static unsigned long __find_symbol(const char *name, return ks->value; } } + ks = lookup_symbol(name, mod->gpl_future_syms, + (mod->gpl_future_syms + + mod->num_gpl_future_syms)); + if (ks) { + if (!gplok) { + printk(KERN_WARNING "Symbol %s is being used " + "by a non-GPL module, which will not " + "be allowed in the future\n", name); + printk(KERN_WARNING "Please see the file " + "Documentation/feature-removal-schedule.txt " + "in the kernel source tree for more " + "details.\n"); + } + *crc = symversion(mod->gpl_future_crcs, + (ks - mod->gpl_future_syms)); + return ks->value; + } } DEBUGP("Failed to find symbol %s\n", name); return 0; @@ -1546,7 +1582,8 @@ static struct module *load_module(void __user *umod, char *secstrings, *args, *modmagic, *strtab = NULL; unsigned int i, symindex = 0, strindex = 0, setupindex, exindex, exportindex, modindex, obsparmindex, infoindex, gplindex, - crcindex, gplcrcindex, versindex, pcpuindex; + crcindex, gplcrcindex, versindex, pcpuindex, gplfutureindex, + gplfuturecrcindex; long arglen; struct module *mod; long err = 0; @@ -1627,8 +1664,10 @@ static struct module *load_module(void __user *umod, /* Optional sections */ exportindex = find_sec(hdr, sechdrs, secstrings, "__ksymtab"); gplindex = find_sec(hdr, sechdrs, secstrings, "__ksymtab_gpl"); + gplfutureindex = find_sec(hdr, sechdrs, secstrings, "__ksymtab_gpl_future"); crcindex = find_sec(hdr, sechdrs, secstrings, "__kcrctab"); gplcrcindex = find_sec(hdr, sechdrs, secstrings, "__kcrctab_gpl"); + gplfuturecrcindex = find_sec(hdr, sechdrs, secstrings, "__kcrctab_gpl_future"); setupindex = find_sec(hdr, sechdrs, secstrings, "__param"); exindex = find_sec(hdr, sechdrs, secstrings, "__ex_table"); obsparmindex = find_sec(hdr, sechdrs, secstrings, "__obsparm"); @@ -1784,10 +1823,16 @@ static struct module *load_module(void __user *umod, mod->gpl_syms = (void *)sechdrs[gplindex].sh_addr; if (gplcrcindex) mod->gpl_crcs = (void *)sechdrs[gplcrcindex].sh_addr; + mod->num_gpl_future_syms = sechdrs[gplfutureindex].sh_size / + sizeof(*mod->gpl_future_syms); + mod->gpl_future_syms = (void *)sechdrs[gplfutureindex].sh_addr; + if (gplfuturecrcindex) + mod->gpl_future_crcs = (void *)sechdrs[gplfuturecrcindex].sh_addr; #ifdef CONFIG_MODVERSIONS if ((mod->num_syms && !crcindex) || - (mod->num_gpl_syms && !gplcrcindex)) { + (mod->num_gpl_syms && !gplcrcindex) || + (mod->num_gpl_future_syms && !gplfuturecrcindex)) { printk(KERN_WARNING "%s: No versions for exported symbols." " Tainting kernel.\n", mod->name); add_taint(TAINT_FORCED_MODULE); -- cgit v1.2.3 From 01ca70dca5c64cb774a8ac2f50bddff21d60169f Mon Sep 17 00:00:00 2001 From: Greg Kroah-Hartman Date: Mon, 20 Mar 2006 13:17:13 -0800 Subject: [PATCH] add EXPORT_SYMBOL_GPL_FUTURE() to RCU subsystem As the RCU symbols are going to be changed to GPL in the near future, lets warn users that this is going to happen. Cc: Paul McKenney Acked-by: Dipankar Sarma Signed-off-by: Greg Kroah-Hartman --- kernel/rcupdate.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c index 8cf15a569fcd..fedf5e369755 100644 --- a/kernel/rcupdate.c +++ b/kernel/rcupdate.c @@ -609,7 +609,7 @@ module_param(qlowmark, int, 0); module_param(rsinterval, int, 0); #endif EXPORT_SYMBOL_GPL(rcu_batches_completed); -EXPORT_SYMBOL(call_rcu); /* WARNING: GPL-only in April 2006. */ -EXPORT_SYMBOL(call_rcu_bh); /* WARNING: GPL-only in April 2006. */ +EXPORT_SYMBOL_GPL_FUTURE(call_rcu); /* WARNING: GPL-only in April 2006. */ +EXPORT_SYMBOL_GPL_FUTURE(call_rcu_bh); /* WARNING: GPL-only in April 2006. */ EXPORT_SYMBOL_GPL(synchronize_rcu); -EXPORT_SYMBOL(synchronize_kernel); /* WARNING: GPL-only in April 2006. */ +EXPORT_SYMBOL_GPL_FUTURE(synchronize_kernel); /* WARNING: GPL-only in April 2006. */ -- cgit v1.2.3 From 03e88ae1b13dfdc8bbaa59b8198e1ca53aad12ac Mon Sep 17 00:00:00 2001 From: Greg Kroah-Hartman Date: Thu, 16 Feb 2006 13:50:23 -0800 Subject: [PATCH] fix module sysfs files reference counting The module files, refcnt, version, and srcversion did not properly increment the owner's module reference count, allowing the modules to be removed while the files were open, causing oopses. This patch fixes this, and also fixes the problem that the version and srcversion files were not showing up, unless CONFIG_MODULE_UNLOAD was enabled, which is not correct. Cc: Nathan Lynch Signed-off-by: Greg Kroah-Hartman --- kernel/module.c | 77 +++++++++++++++++++++++---------------------------------- kernel/params.c | 10 -------- 2 files changed, 31 insertions(+), 56 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 5ca99fbe9f44..77764f22f021 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -429,7 +429,6 @@ static inline void percpu_modcopy(void *pcpudst, const void *src, } #endif /* CONFIG_SMP */ -#ifdef CONFIG_MODULE_UNLOAD #define MODINFO_ATTR(field) \ static void setup_modinfo_##field(struct module *mod, const char *s) \ { \ @@ -461,12 +460,7 @@ static struct module_attribute modinfo_##field = { \ MODINFO_ATTR(version); MODINFO_ATTR(srcversion); -static struct module_attribute *modinfo_attrs[] = { - &modinfo_version, - &modinfo_srcversion, - NULL, -}; - +#ifdef CONFIG_MODULE_UNLOAD /* Init the unload section of the module. */ static void module_unload_init(struct module *mod) { @@ -781,6 +775,15 @@ static inline void module_unload_init(struct module *mod) } #endif /* CONFIG_MODULE_UNLOAD */ +static struct module_attribute *modinfo_attrs[] = { + &modinfo_version, + &modinfo_srcversion, +#ifdef CONFIG_MODULE_UNLOAD + &refcnt, +#endif + NULL, +}; + #ifdef CONFIG_OBSOLETE_MODPARM /* Bounds checking done below */ static int obsparm_copy_string(const char *val, struct kernel_param *kp) @@ -1106,37 +1109,28 @@ static inline void remove_sect_attrs(struct module *mod) } #endif /* CONFIG_KALLSYMS */ - -#ifdef CONFIG_MODULE_UNLOAD -static inline int module_add_refcnt_attr(struct module *mod) -{ - return sysfs_create_file(&mod->mkobj.kobj, &refcnt.attr); -} -static void module_remove_refcnt_attr(struct module *mod) -{ - return sysfs_remove_file(&mod->mkobj.kobj, &refcnt.attr); -} -#else -static inline int module_add_refcnt_attr(struct module *mod) -{ - return 0; -} -static void module_remove_refcnt_attr(struct module *mod) -{ -} -#endif - -#ifdef CONFIG_MODULE_UNLOAD static int module_add_modinfo_attrs(struct module *mod) { struct module_attribute *attr; + struct module_attribute *temp_attr; int error = 0; int i; + mod->modinfo_attrs = kzalloc((sizeof(struct module_attribute) * + (ARRAY_SIZE(modinfo_attrs) + 1)), + GFP_KERNEL); + if (!mod->modinfo_attrs) + return -ENOMEM; + + temp_attr = mod->modinfo_attrs; for (i = 0; (attr = modinfo_attrs[i]) && !error; i++) { if (!attr->test || - (attr->test && attr->test(mod))) - error = sysfs_create_file(&mod->mkobj.kobj,&attr->attr); + (attr->test && attr->test(mod))) { + memcpy(temp_attr, attr, sizeof(*temp_attr)); + temp_attr->attr.owner = mod; + error = sysfs_create_file(&mod->mkobj.kobj,&temp_attr->attr); + ++temp_attr; + } } return error; } @@ -1146,12 +1140,16 @@ static void module_remove_modinfo_attrs(struct module *mod) struct module_attribute *attr; int i; - for (i = 0; (attr = modinfo_attrs[i]); i++) { + for (i = 0; (attr = &mod->modinfo_attrs[i]); i++) { + /* pick a field to test for end of list */ + if (!attr->attr.name) + break; sysfs_remove_file(&mod->mkobj.kobj,&attr->attr); - attr->free(mod); + if (attr->free) + attr->free(mod); } + kfree(mod->modinfo_attrs); } -#endif static int mod_sysfs_setup(struct module *mod, struct kernel_param *kparam, @@ -1169,19 +1167,13 @@ static int mod_sysfs_setup(struct module *mod, if (err) goto out; - err = module_add_refcnt_attr(mod); - if (err) - goto out_unreg; - err = module_param_sysfs_setup(mod, kparam, num_params); if (err) goto out_unreg; -#ifdef CONFIG_MODULE_UNLOAD err = module_add_modinfo_attrs(mod); if (err) goto out_unreg; -#endif return 0; @@ -1193,10 +1185,7 @@ out: static void mod_kobject_remove(struct module *mod) { -#ifdef CONFIG_MODULE_UNLOAD module_remove_modinfo_attrs(mod); -#endif - module_remove_refcnt_attr(mod); module_param_sysfs_remove(mod); kobject_unregister(&mod->mkobj.kobj); @@ -1474,7 +1463,6 @@ static char *get_modinfo(Elf_Shdr *sechdrs, return NULL; } -#ifdef CONFIG_MODULE_UNLOAD static void setup_modinfo(struct module *mod, Elf_Shdr *sechdrs, unsigned int infoindex) { @@ -1489,7 +1477,6 @@ static void setup_modinfo(struct module *mod, Elf_Shdr *sechdrs, attr->attr.name)); } } -#endif #ifdef CONFIG_KALLSYMS int is_exported(const char *name, const struct module *mod) @@ -1803,10 +1790,8 @@ static struct module *load_module(void __user *umod, if (strcmp(mod->name, "driverloader") == 0) add_taint(TAINT_PROPRIETARY_MODULE); -#ifdef CONFIG_MODULE_UNLOAD /* Set up MODINFO_ATTR fields */ setup_modinfo(mod, sechdrs, infoindex); -#endif /* Fix up syms, so that st_value is a pointer to location. */ err = simplify_symbols(sechdrs, symindex, strtab, versindex, pcpuindex, diff --git a/kernel/params.c b/kernel/params.c index c76ad25e6a21..a29150582310 100644 --- a/kernel/params.c +++ b/kernel/params.c @@ -638,13 +638,8 @@ static ssize_t module_attr_show(struct kobject *kobj, if (!attribute->show) return -EIO; - if (!try_module_get(mk->mod)) - return -ENODEV; - ret = attribute->show(attribute, mk->mod, buf); - module_put(mk->mod); - return ret; } @@ -662,13 +657,8 @@ static ssize_t module_attr_store(struct kobject *kobj, if (!attribute->store) return -EIO; - if (!try_module_get(mk->mod)) - return -ENODEV; - ret = attribute->store(attribute, mk->mod, buf, len); - module_put(mk->mod); - return ret; } -- cgit v1.2.3 From 9430d58e34ec3861e1ca72f8e49105b227aad327 Mon Sep 17 00:00:00 2001 From: Mike Galbraith Date: Wed, 22 Mar 2006 00:07:33 -0800 Subject: [PATCH] sched: remove sleep_avg multiplier Remove the sleep_avg multiplier. This multiplier was necessary back when we had 10 seconds of dynamic range in sleep_avg, but now that we only have one second, it causes that one second to be compressed down to 100ms in some cases. This is particularly noticeable when compiling a kernel in a slow NFS mount, and I believe it to be a very likely candidate for other recently reported network related interactivity problems. In testing, I can detect no negative impact of this removal. Signed-off-by: Mike Galbraith Acked-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 6 ------ 1 file changed, 6 deletions(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 4d46e90f59c3..6b6e0d70eb30 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -706,12 +706,6 @@ static int recalc_task_prio(task_t *p, unsigned long long now) p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG - DEF_TIMESLICE); } else { - /* - * The lower the sleep avg a task has the more - * rapidly it will rise with sleep time. - */ - sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1; - /* * Tasks waking from uninterruptible sleep are * limited in their sleep_avg rise as they -- cgit v1.2.3 From 06f9d4f94a075285d25253edbf57f2cda07d4ff3 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Wed, 22 Mar 2006 00:07:40 -0800 Subject: [PATCH] unshare: Error if passed unsupported flags A bare bones trivial patch to ensure we always get -EINVAL on the unsupported cases for sys_unshare. If this goes in before 2.6.16 it allows us to forward compatible with future applications using sys_unshare. Signed-off-by: Eric W. Biederman Cc: JANAK DESAI Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index b373322ca497..9bd7b65ee418 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1534,6 +1534,12 @@ asmlinkage long sys_unshare(unsigned long unshare_flags) check_unshare_flags(&unshare_flags); + /* Return -EINVAL for all unsupported flags */ + err = -EINVAL; + if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND| + CLONE_VM|CLONE_FILES|CLONE_SYSVSEM)) + goto bad_unshare_out; + if ((err = unshare_thread(unshare_flags))) goto bad_unshare_out; if ((err = unshare_fs(unshare_flags, &new_fs))) -- cgit v1.2.3 From 78eef01b0fae087c5fadbd85dd4fe2918c3a015f Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Wed, 22 Mar 2006 00:08:16 -0800 Subject: [PATCH] on_each_cpu(): disable local interrupts When on_each_cpu() runs the callback on other CPUs, it runs with local interrupts disabled. So we should run the function with local interrupts disabled on this CPU, too. And do the same for UP, so the callback is run in the same environment on both UP and SMP. (strictly it should do preempt_disable() too, but I think local_irq_disable is sufficiently equivalent). Also uninlines on_each_cpu(). softirq.c was the most appropriate file I could find, but it doesn't seem to justify creating a new file. Oh, and fix up that comment over (under?) x86's smp_call_function(). It drives me nuts. Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/softirq.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) (limited to 'kernel') diff --git a/kernel/softirq.c b/kernel/softirq.c index ad3295cdded5..ec8fed42a86f 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -16,6 +16,7 @@ #include #include #include +#include #include /* @@ -495,3 +496,22 @@ __init int spawn_ksoftirqd(void) register_cpu_notifier(&cpu_nfb); return 0; } + +#ifdef CONFIG_SMP +/* + * Call a function on all processors + */ +int on_each_cpu(void (*func) (void *info), void *info, int retry, int wait) +{ + int ret = 0; + + preempt_disable(); + ret = smp_call_function(func, info, retry, wait); + local_irq_disable(); + func(info); + local_irq_enable(); + preempt_enable(); + return ret; +} +EXPORT_SYMBOL(on_each_cpu); +#endif -- cgit v1.2.3 From e9028b0ff2bad1954568604dc17725692c8524d6 Mon Sep 17 00:00:00 2001 From: Anton Blanchard Date: Thu, 23 Mar 2006 02:59:20 -0800 Subject: [PATCH] fix scheduler deadlock We have noticed lockups during boot when stress testing kexec on ppc64. Two cpus would deadlock in scheduler code trying to grab already taken spinlocks. The double_rq_lock code uses the address of the runqueue to order the taking of multiple locks. This address is a per cpu variable: if (rq1 < rq2) { spin_lock(&rq1->lock); spin_lock(&rq2->lock); } else { spin_lock(&rq2->lock); spin_lock(&rq1->lock); } On the other hand, the code in wake_sleeping_dependent uses the cpu id order to grab locks: for_each_cpu_mask(i, sibling_map) spin_lock(&cpu_rq(i)->lock); This means we rely on the address of per cpu data increasing as cpu ids increase. While this will be true for the generic percpu implementation it may not be true for arch specific implementations. One way to solve this is to always take runqueues in cpu id order. To do this we add a cpu variable to the runqueue and check it in the double runqueue locking functions. Signed-off-by: Anton Blanchard Acked-by: Ingo Molnar Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 6b6e0d70eb30..a5bd60453eae 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -237,6 +237,7 @@ struct runqueue { task_t *migration_thread; struct list_head migration_queue; + int cpu; #endif #ifdef CONFIG_SCHEDSTATS @@ -1654,6 +1655,9 @@ unsigned long nr_iowait(void) /* * double_rq_lock - safely lock two runqueues * + * We must take them in cpu order to match code in + * dependent_sleeper and wake_dependent_sleeper. + * * Note this does not disable interrupts like task_rq_lock, * you need to do so manually before calling. */ @@ -1665,7 +1669,7 @@ static void double_rq_lock(runqueue_t *rq1, runqueue_t *rq2) spin_lock(&rq1->lock); __acquire(rq2->lock); /* Fake it out ;) */ } else { - if (rq1 < rq2) { + if (rq1->cpu < rq2->cpu) { spin_lock(&rq1->lock); spin_lock(&rq2->lock); } else { @@ -1701,7 +1705,7 @@ static void double_lock_balance(runqueue_t *this_rq, runqueue_t *busiest) __acquires(this_rq->lock) { if (unlikely(!spin_trylock(&busiest->lock))) { - if (busiest < this_rq) { + if (busiest->cpu < this_rq->cpu) { spin_unlock(&this_rq->lock); spin_lock(&busiest->lock); spin_lock(&this_rq->lock); @@ -6029,6 +6033,7 @@ void __init sched_init(void) rq->push_cpu = 0; rq->migration_thread = NULL; INIT_LIST_HEAD(&rq->migration_queue); + rq->cpu = i; #endif atomic_set(&rq->nr_iowait, 0); -- cgit v1.2.3 From 2b322ce210aec74ae0d02938d3a01e29fe079469 Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Thu, 23 Mar 2006 02:59:58 -0800 Subject: [PATCH] revert "swsusp: fix breakage with swap on lvm" This was a temporary thing for 2.6.16. Cc: "Rafael J. Wysocki" Cc: Pavel Machek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/power/swsusp.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/power/swsusp.c b/kernel/power/swsusp.c index 2d9d08f72f76..4e90905f0e87 100644 --- a/kernel/power/swsusp.c +++ b/kernel/power/swsusp.c @@ -153,11 +153,13 @@ static int swsusp_swap_check(void) /* This is called before saving image */ { int i; + if (!swsusp_resume_device) + return -ENODEV; spin_lock(&swap_lock); for (i = 0; i < MAX_SWAPFILES; i++) { if (!(swap_info[i].flags & SWP_WRITEOK)) continue; - if (!swsusp_resume_device || is_resume_device(swap_info + i)) { + if (is_resume_device(swap_info + i)) { spin_unlock(&swap_lock); root_swap = i; return 0; -- cgit v1.2.3 From f577eb30afdc68233f25d4d82b04102129262365 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Thu, 23 Mar 2006 02:59:59 -0800 Subject: [PATCH] swsusp: low level interface Introduce the low level interface that can be used for handling the snapshot of the system memory by the in-kernel swap-writing/reading code of swsusp and the userland interface code (to be introduced shortly). Also change the way in which swsusp records the allocated swap pages and, consequently, simplifies the in-kernel swap-writing/reading code (this is necessary for the userland interface too). To this end, it introduces two helper functions in mm/swapfile.c, so that the swsusp code does not refer directly to the swap internals. Signed-off-by: Rafael J. Wysocki Acked-by: Pavel Machek Signed-off-by: Adrian Bunk Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/power/disk.c | 12 +- kernel/power/power.h | 26 +- kernel/power/snapshot.c | 326 +++++++++++++++++++++- kernel/power/swsusp.c | 723 +++++++++++++++++------------------------------- 4 files changed, 599 insertions(+), 488 deletions(-) (limited to 'kernel') diff --git a/kernel/power/disk.c b/kernel/power/disk.c index 0b43847dc980..4eb464b71347 100644 --- a/kernel/power/disk.c +++ b/kernel/power/disk.c @@ -26,9 +26,9 @@ extern suspend_disk_method_t pm_disk_mode; extern int swsusp_shrink_memory(void); extern int swsusp_suspend(void); -extern int swsusp_write(struct pbe *pblist, unsigned int nr_pages); +extern int swsusp_write(void); extern int swsusp_check(void); -extern int swsusp_read(struct pbe **pblist_ptr); +extern int swsusp_read(void); extern void swsusp_close(void); extern int swsusp_resume(void); @@ -70,10 +70,6 @@ static void power_down(suspend_disk_method_t mode) while(1); } - -static int in_suspend __nosavedata = 0; - - static inline void platform_finish(void) { if (pm_disk_mode == PM_DISK_PLATFORM) { @@ -145,7 +141,7 @@ int pm_suspend_disk(void) if (in_suspend) { device_resume(); pr_debug("PM: writing image.\n"); - error = swsusp_write(pagedir_nosave, nr_copy_pages); + error = swsusp_write(); if (!error) power_down(pm_disk_mode); else { @@ -216,7 +212,7 @@ static int software_resume(void) pr_debug("PM: Reading swsusp image.\n"); - if ((error = swsusp_read(&pagedir_nosave))) { + if ((error = swsusp_read())) { swsusp_free(); goto Thaw; } diff --git a/kernel/power/power.h b/kernel/power/power.h index 388dba680841..ea7132ed029b 100644 --- a/kernel/power/power.h +++ b/kernel/power/power.h @@ -37,21 +37,31 @@ extern struct subsystem power_subsys; /* References to section boundaries */ extern const void __nosave_begin, __nosave_end; -extern unsigned int nr_copy_pages; extern struct pbe *pagedir_nosave; /* Preferred image size in bytes (default 500 MB) */ extern unsigned long image_size; +extern int in_suspend; + extern asmlinkage int swsusp_arch_suspend(void); extern asmlinkage int swsusp_arch_resume(void); extern unsigned int count_data_pages(void); -extern void free_pagedir(struct pbe *pblist); -extern void release_eaten_pages(void); -extern struct pbe *alloc_pagedir(unsigned nr_pages, gfp_t gfp_mask, int safe_needed); extern void swsusp_free(void); -extern int alloc_data_pages(struct pbe *pblist, gfp_t gfp_mask, int safe_needed); -extern unsigned int snapshot_nr_pages(void); -extern struct pbe *snapshot_pblist(void); -extern void snapshot_pblist_set(struct pbe *pblist); + +struct snapshot_handle { + loff_t offset; + unsigned int page; + unsigned int page_offset; + unsigned int prev; + struct pbe *pbe; + void *buffer; + unsigned int buf_offset; +}; + +#define data_of(handle) ((handle).buffer + (handle).buf_offset) + +extern int snapshot_read_next(struct snapshot_handle *handle, size_t count); +extern int snapshot_write_next(struct snapshot_handle *handle, size_t count); +int snapshot_image_loaded(struct snapshot_handle *handle); diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 8d5a5986d621..cc349437fb72 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -10,6 +10,7 @@ */ +#include #include #include #include @@ -34,7 +35,8 @@ #include "power.h" struct pbe *pagedir_nosave; -unsigned int nr_copy_pages; +static unsigned int nr_copy_pages; +static unsigned int nr_meta_pages; #ifdef CONFIG_HIGHMEM unsigned int count_highmem_pages(void) @@ -235,7 +237,7 @@ static void copy_data_pages(struct pbe *pblist) * free_pagedir - free pages allocated with alloc_pagedir() */ -void free_pagedir(struct pbe *pblist) +static void free_pagedir(struct pbe *pblist) { struct pbe *pbe; @@ -301,7 +303,7 @@ struct eaten_page { static struct eaten_page *eaten_pages = NULL; -void release_eaten_pages(void) +static void release_eaten_pages(void) { struct eaten_page *p, *q; @@ -376,7 +378,6 @@ struct pbe *alloc_pagedir(unsigned int nr_pages, gfp_t gfp_mask, int safe_needed if (!nr_pages) return NULL; - pr_debug("alloc_pagedir(): nr_pages = %d\n", nr_pages); pblist = alloc_image_page(gfp_mask, safe_needed); /* FIXME: rewrite this ugly loop */ for (pbe = pblist, num = PBES_PER_PAGE; pbe && num < nr_pages; @@ -414,6 +415,9 @@ void swsusp_free(void) } } } + nr_copy_pages = 0; + nr_meta_pages = 0; + pagedir_nosave = NULL; } @@ -437,7 +441,7 @@ static int enough_free_mem(unsigned int nr_pages) (nr_pages + PBES_PER_PAGE - 1) / PBES_PER_PAGE); } -int alloc_data_pages(struct pbe *pblist, gfp_t gfp_mask, int safe_needed) +static int alloc_data_pages(struct pbe *pblist, gfp_t gfp_mask, int safe_needed) { struct pbe *p; @@ -504,7 +508,319 @@ asmlinkage int swsusp_save(void) */ nr_copy_pages = nr_pages; + nr_meta_pages = (nr_pages * sizeof(long) + PAGE_SIZE - 1) >> PAGE_SHIFT; printk("swsusp: critical section/: done (%d pages copied)\n", nr_pages); return 0; } + +static void init_header(struct swsusp_info *info) +{ + memset(info, 0, sizeof(struct swsusp_info)); + info->version_code = LINUX_VERSION_CODE; + info->num_physpages = num_physpages; + memcpy(&info->uts, &system_utsname, sizeof(system_utsname)); + info->cpus = num_online_cpus(); + info->image_pages = nr_copy_pages; + info->pages = nr_copy_pages + nr_meta_pages + 1; +} + +/** + * pack_orig_addresses - the .orig_address fields of the PBEs from the + * list starting at @pbe are stored in the array @buf[] (1 page) + */ + +static inline struct pbe *pack_orig_addresses(unsigned long *buf, struct pbe *pbe) +{ + int j; + + for (j = 0; j < PAGE_SIZE / sizeof(long) && pbe; j++) { + buf[j] = pbe->orig_address; + pbe = pbe->next; + } + if (!pbe) + for (; j < PAGE_SIZE / sizeof(long); j++) + buf[j] = 0; + return pbe; +} + +/** + * snapshot_read_next - used for reading the system memory snapshot. + * + * On the first call to it @handle should point to a zeroed + * snapshot_handle structure. The structure gets updated and a pointer + * to it should be passed to this function every next time. + * + * The @count parameter should contain the number of bytes the caller + * wants to read from the snapshot. It must not be zero. + * + * On success the function returns a positive number. Then, the caller + * is allowed to read up to the returned number of bytes from the memory + * location computed by the data_of() macro. The number returned + * may be smaller than @count, but this only happens if the read would + * cross a page boundary otherwise. + * + * The function returns 0 to indicate the end of data stream condition, + * and a negative number is returned on error. In such cases the + * structure pointed to by @handle is not updated and should not be used + * any more. + */ + +int snapshot_read_next(struct snapshot_handle *handle, size_t count) +{ + static unsigned long *buffer; + + if (handle->page > nr_meta_pages + nr_copy_pages) + return 0; + if (!buffer) { + /* This makes the buffer be freed by swsusp_free() */ + buffer = alloc_image_page(GFP_ATOMIC, 0); + if (!buffer) + return -ENOMEM; + } + if (!handle->offset) { + init_header((struct swsusp_info *)buffer); + handle->buffer = buffer; + handle->pbe = pagedir_nosave; + } + if (handle->prev < handle->page) { + if (handle->page <= nr_meta_pages) { + handle->pbe = pack_orig_addresses(buffer, handle->pbe); + if (!handle->pbe) + handle->pbe = pagedir_nosave; + } else { + handle->buffer = (void *)handle->pbe->address; + handle->pbe = handle->pbe->next; + } + handle->prev = handle->page; + } + handle->buf_offset = handle->page_offset; + if (handle->page_offset + count >= PAGE_SIZE) { + count = PAGE_SIZE - handle->page_offset; + handle->page_offset = 0; + handle->page++; + } else { + handle->page_offset += count; + } + handle->offset += count; + return count; +} + +/** + * mark_unsafe_pages - mark the pages that cannot be used for storing + * the image during resume, because they conflict with the pages that + * had been used before suspend + */ + +static int mark_unsafe_pages(struct pbe *pblist) +{ + struct zone *zone; + unsigned long zone_pfn; + struct pbe *p; + + if (!pblist) /* a sanity check */ + return -EINVAL; + + /* Clear page flags */ + for_each_zone (zone) { + for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn) + if (pfn_valid(zone_pfn + zone->zone_start_pfn)) + ClearPageNosaveFree(pfn_to_page(zone_pfn + + zone->zone_start_pfn)); + } + + /* Mark orig addresses */ + for_each_pbe (p, pblist) { + if (virt_addr_valid(p->orig_address)) + SetPageNosaveFree(virt_to_page(p->orig_address)); + else + return -EFAULT; + } + + return 0; +} + +static void copy_page_backup_list(struct pbe *dst, struct pbe *src) +{ + /* We assume both lists contain the same number of elements */ + while (src) { + dst->orig_address = src->orig_address; + dst = dst->next; + src = src->next; + } +} + +static int check_header(struct swsusp_info *info) +{ + char *reason = NULL; + + if (info->version_code != LINUX_VERSION_CODE) + reason = "kernel version"; + if (info->num_physpages != num_physpages) + reason = "memory size"; + if (strcmp(info->uts.sysname,system_utsname.sysname)) + reason = "system type"; + if (strcmp(info->uts.release,system_utsname.release)) + reason = "kernel release"; + if (strcmp(info->uts.version,system_utsname.version)) + reason = "version"; + if (strcmp(info->uts.machine,system_utsname.machine)) + reason = "machine"; + if (reason) { + printk(KERN_ERR "swsusp: Resume mismatch: %s\n", reason); + return -EPERM; + } + return 0; +} + +/** + * load header - check the image header and copy data from it + */ + +static int load_header(struct snapshot_handle *handle, + struct swsusp_info *info) +{ + int error; + struct pbe *pblist; + + error = check_header(info); + if (!error) { + pblist = alloc_pagedir(info->image_pages, GFP_ATOMIC, 0); + if (!pblist) + return -ENOMEM; + pagedir_nosave = pblist; + handle->pbe = pblist; + nr_copy_pages = info->image_pages; + nr_meta_pages = info->pages - info->image_pages - 1; + } + return error; +} + +/** + * unpack_orig_addresses - copy the elements of @buf[] (1 page) to + * the PBEs in the list starting at @pbe + */ + +static inline struct pbe *unpack_orig_addresses(unsigned long *buf, + struct pbe *pbe) +{ + int j; + + for (j = 0; j < PAGE_SIZE / sizeof(long) && pbe; j++) { + pbe->orig_address = buf[j]; + pbe = pbe->next; + } + return pbe; +} + +/** + * create_image - use metadata contained in the PBE list + * pointed to by pagedir_nosave to mark the pages that will + * be overwritten in the process of restoring the system + * memory state from the image and allocate memory for + * the image avoiding these pages + */ + +static int create_image(struct snapshot_handle *handle) +{ + int error = 0; + struct pbe *p, *pblist; + + p = pagedir_nosave; + error = mark_unsafe_pages(p); + if (!error) { + pblist = alloc_pagedir(nr_copy_pages, GFP_ATOMIC, 1); + if (pblist) + copy_page_backup_list(pblist, p); + free_pagedir(p); + if (!pblist) + error = -ENOMEM; + } + if (!error) + error = alloc_data_pages(pblist, GFP_ATOMIC, 1); + if (!error) { + release_eaten_pages(); + pagedir_nosave = pblist; + } else { + pagedir_nosave = NULL; + handle->pbe = NULL; + nr_copy_pages = 0; + nr_meta_pages = 0; + } + return error; +} + +/** + * snapshot_write_next - used for writing the system memory snapshot. + * + * On the first call to it @handle should point to a zeroed + * snapshot_handle structure. The structure gets updated and a pointer + * to it should be passed to this function every next time. + * + * The @count parameter should contain the number of bytes the caller + * wants to write to the image. It must not be zero. + * + * On success the function returns a positive number. Then, the caller + * is allowed to write up to the returned number of bytes to the memory + * location computed by the data_of() macro. The number returned + * may be smaller than @count, but this only happens if the write would + * cross a page boundary otherwise. + * + * The function returns 0 to indicate the "end of file" condition, + * and a negative number is returned on error. In such cases the + * structure pointed to by @handle is not updated and should not be used + * any more. + */ + +int snapshot_write_next(struct snapshot_handle *handle, size_t count) +{ + static unsigned long *buffer; + int error = 0; + + if (handle->prev && handle->page > nr_meta_pages + nr_copy_pages) + return 0; + if (!buffer) { + /* This makes the buffer be freed by swsusp_free() */ + buffer = alloc_image_page(GFP_ATOMIC, 0); + if (!buffer) + return -ENOMEM; + } + if (!handle->offset) + handle->buffer = buffer; + if (handle->prev < handle->page) { + if (!handle->prev) { + error = load_header(handle, (struct swsusp_info *)buffer); + if (error) + return error; + } else if (handle->prev <= nr_meta_pages) { + handle->pbe = unpack_orig_addresses(buffer, handle->pbe); + if (!handle->pbe) { + error = create_image(handle); + if (error) + return error; + handle->pbe = pagedir_nosave; + handle->buffer = (void *)handle->pbe->address; + } + } else { + handle->pbe = handle->pbe->next; + handle->buffer = (void *)handle->pbe->address; + } + handle->prev = handle->page; + } + handle->buf_offset = handle->page_offset; + if (handle->page_offset + count >= PAGE_SIZE) { + count = PAGE_SIZE - handle->page_offset; + handle->page_offset = 0; + handle->page++; + } else { + handle->page_offset += count; + } + handle->offset += count; + return count; +} + +int snapshot_image_loaded(struct snapshot_handle *handle) +{ + return !(!handle->pbe || handle->pbe->next || !nr_copy_pages || + handle->page <= nr_meta_pages + nr_copy_pages); +} diff --git a/kernel/power/swsusp.c b/kernel/power/swsusp.c index 4e90905f0e87..457084f50010 100644 --- a/kernel/power/swsusp.c +++ b/kernel/power/swsusp.c @@ -77,6 +77,8 @@ */ unsigned long image_size = 500 * 1024 * 1024; +int in_suspend __nosavedata = 0; + #ifdef CONFIG_HIGHMEM unsigned int count_highmem_pages(void); int save_highmem(void); @@ -98,8 +100,6 @@ static struct swsusp_header { char sig[10]; } __attribute__((packed, aligned(PAGE_SIZE))) swsusp_header; -static struct swsusp_info swsusp_info; - /* * Saving part... */ @@ -129,255 +129,261 @@ static int mark_swapfiles(swp_entry_t start) return error; } -/* - * Check whether the swap device is the specified resume - * device, irrespective of whether they are specified by - * identical names. - * - * (Thus, device inode aliasing is allowed. You can say /dev/hda4 - * instead of /dev/ide/host0/bus0/target0/lun0/part4 [if using devfs] - * and they'll be considered the same device. This is *necessary* for - * devfs, since the resume code can only recognize the form /dev/hda4, - * but the suspend code would see the long name.) +/** + * swsusp_swap_check - check if the resume device is a swap device + * and get its index (if so) */ -static inline int is_resume_device(const struct swap_info_struct *swap_info) -{ - struct file *file = swap_info->swap_file; - struct inode *inode = file->f_dentry->d_inode; - - return S_ISBLK(inode->i_mode) && - swsusp_resume_device == MKDEV(imajor(inode), iminor(inode)); -} static int swsusp_swap_check(void) /* This is called before saving image */ { - int i; - - if (!swsusp_resume_device) - return -ENODEV; - spin_lock(&swap_lock); - for (i = 0; i < MAX_SWAPFILES; i++) { - if (!(swap_info[i].flags & SWP_WRITEOK)) - continue; - if (is_resume_device(swap_info + i)) { - spin_unlock(&swap_lock); - root_swap = i; - return 0; - } + int res = swap_type_of(swsusp_resume_device); + + if (res >= 0) { + root_swap = res; + return 0; } - spin_unlock(&swap_lock); - return -ENODEV; + return res; } /** - * write_page - Write one page to a fresh swap location. - * @addr: Address we're writing. - * @loc: Place to store the entry we used. + * The bitmap is used for tracing allocated swap pages * - * Allocate a new swap entry and 'sync' it. Note we discard -EIO - * errors. That is an artifact left over from swsusp. It did not - * check the return of rw_swap_page_sync() at all, since most pages - * written back to swap would return -EIO. - * This is a partial improvement, since we will at least return other - * errors, though we need to eventually fix the damn code. + * The entire bitmap consists of a number of bitmap_page + * structures linked with the help of the .next member. + * Thus each page can be allocated individually, so we only + * need to make 0-order memory allocations to create + * the bitmap. */ -static int write_page(unsigned long addr, swp_entry_t *loc) -{ - swp_entry_t entry; - int error = -ENOSPC; - entry = get_swap_page_of_type(root_swap); - if (swp_offset(entry)) { - error = rw_swap_page_sync(WRITE, entry, virt_to_page(addr)); - if (!error || error == -EIO) - *loc = entry; - } - return error; -} +#define BITMAP_PAGE_SIZE (PAGE_SIZE - sizeof(void *)) +#define BITMAP_PAGE_CHUNKS (BITMAP_PAGE_SIZE / sizeof(long)) +#define BITS_PER_CHUNK (sizeof(long) * 8) +#define BITMAP_PAGE_BITS (BITMAP_PAGE_CHUNKS * BITS_PER_CHUNK) + +struct bitmap_page { + unsigned long chunks[BITMAP_PAGE_CHUNKS]; + struct bitmap_page *next; +}; /** - * Swap map-handling functions + * The following functions are used for tracing the allocated + * swap pages, so that they can be freed in case of an error. * - * The swap map is a data structure used for keeping track of each page - * written to the swap. It consists of many swap_map_page structures - * that contain each an array of MAP_PAGE_SIZE swap entries. - * These structures are linked together with the help of either the - * .next (in memory) or the .next_swap (in swap) member. - * - * The swap map is created during suspend. At that time we need to keep - * it in memory, because we have to free all of the allocated swap - * entries if an error occurs. The memory needed is preallocated - * so that we know in advance if there's enough of it. - * - * The first swap_map_page structure is filled with the swap entries that - * correspond to the first MAP_PAGE_SIZE data pages written to swap and - * so on. After the all of the data pages have been written, the order - * of the swap_map_page structures in the map is reversed so that they - * can be read from swap in the original order. This causes the data - * pages to be loaded in exactly the same order in which they have been - * saved. - * - * During resume we only need to use one swap_map_page structure - * at a time, which means that we only need to use two memory pages for - * reading the image - one for reading the swap_map_page structures - * and the second for reading the data pages from swap. + * The functions operate on a linked bitmap structure defined + * above */ -#define MAP_PAGE_SIZE ((PAGE_SIZE - sizeof(swp_entry_t) - sizeof(void *)) \ - / sizeof(swp_entry_t)) - -struct swap_map_page { - swp_entry_t entries[MAP_PAGE_SIZE]; - swp_entry_t next_swap; - struct swap_map_page *next; -}; - -static inline void free_swap_map(struct swap_map_page *swap_map) +static void free_bitmap(struct bitmap_page *bitmap) { - struct swap_map_page *swp; + struct bitmap_page *bp; - while (swap_map) { - swp = swap_map->next; - free_page((unsigned long)swap_map); - swap_map = swp; + while (bitmap) { + bp = bitmap->next; + free_page((unsigned long)bitmap); + bitmap = bp; } } -static struct swap_map_page *alloc_swap_map(unsigned int nr_pages) +static struct bitmap_page *alloc_bitmap(unsigned int nr_bits) { - struct swap_map_page *swap_map, *swp; - unsigned n = 0; + struct bitmap_page *bitmap, *bp; + unsigned int n; - if (!nr_pages) + if (!nr_bits) return NULL; - pr_debug("alloc_swap_map(): nr_pages = %d\n", nr_pages); - swap_map = (struct swap_map_page *)get_zeroed_page(GFP_ATOMIC); - swp = swap_map; - for (n = MAP_PAGE_SIZE; n < nr_pages; n += MAP_PAGE_SIZE) { - swp->next = (struct swap_map_page *)get_zeroed_page(GFP_ATOMIC); - swp = swp->next; - if (!swp) { - free_swap_map(swap_map); + bitmap = (struct bitmap_page *)get_zeroed_page(GFP_KERNEL); + bp = bitmap; + for (n = BITMAP_PAGE_BITS; n < nr_bits; n += BITMAP_PAGE_BITS) { + bp->next = (struct bitmap_page *)get_zeroed_page(GFP_KERNEL); + bp = bp->next; + if (!bp) { + free_bitmap(bitmap); return NULL; } } - return swap_map; + return bitmap; } -/** - * reverse_swap_map - reverse the order of pages in the swap map - * @swap_map - */ - -static inline struct swap_map_page *reverse_swap_map(struct swap_map_page *swap_map) +static int bitmap_set(struct bitmap_page *bitmap, unsigned long bit) { - struct swap_map_page *prev, *next; - - prev = NULL; - while (swap_map) { - next = swap_map->next; - swap_map->next = prev; - prev = swap_map; - swap_map = next; + unsigned int n; + + n = BITMAP_PAGE_BITS; + while (bitmap && n <= bit) { + n += BITMAP_PAGE_BITS; + bitmap = bitmap->next; } - return prev; + if (!bitmap) + return -EINVAL; + n -= BITMAP_PAGE_BITS; + bit -= n; + n = 0; + while (bit >= BITS_PER_CHUNK) { + bit -= BITS_PER_CHUNK; + n++; + } + bitmap->chunks[n] |= (1UL << bit); + return 0; } -/** - * free_swap_map_entries - free the swap entries allocated to store - * the swap map @swap_map (this is only called in case of an error) - */ -static inline void free_swap_map_entries(struct swap_map_page *swap_map) +static unsigned long alloc_swap_page(int swap, struct bitmap_page *bitmap) { - while (swap_map) { - if (swap_map->next_swap.val) - swap_free(swap_map->next_swap); - swap_map = swap_map->next; + unsigned long offset; + + offset = swp_offset(get_swap_page_of_type(swap)); + if (offset) { + if (bitmap_set(bitmap, offset)) { + swap_free(swp_entry(swap, offset)); + offset = 0; + } } + return offset; } -/** - * save_swap_map - save the swap map used for tracing the data pages - * stored in the swap - */ - -static int save_swap_map(struct swap_map_page *swap_map, swp_entry_t *start) +static void free_all_swap_pages(int swap, struct bitmap_page *bitmap) { - swp_entry_t entry = (swp_entry_t){0}; - int error; + unsigned int bit, n; + unsigned long test; - while (swap_map) { - swap_map->next_swap = entry; - if ((error = write_page((unsigned long)swap_map, &entry))) - return error; - swap_map = swap_map->next; + bit = 0; + while (bitmap) { + for (n = 0; n < BITMAP_PAGE_CHUNKS; n++) + for (test = 1UL; test; test <<= 1) { + if (bitmap->chunks[n] & test) + swap_free(swp_entry(swap, bit)); + bit++; + } + bitmap = bitmap->next; } - *start = entry; - return 0; } /** - * free_image_entries - free the swap entries allocated to store - * the image data pages (this is only called in case of an error) + * write_page - Write one page to given swap location. + * @buf: Address we're writing. + * @offset: Offset of the swap page we're writing to. */ -static inline void free_image_entries(struct swap_map_page *swp) +static int write_page(void *buf, unsigned long offset) { - unsigned k; + swp_entry_t entry; + int error = -ENOSPC; - while (swp) { - for (k = 0; k < MAP_PAGE_SIZE; k++) - if (swp->entries[k].val) - swap_free(swp->entries[k]); - swp = swp->next; + if (offset) { + entry = swp_entry(root_swap, offset); + error = rw_swap_page_sync(WRITE, entry, virt_to_page(buf)); } + return error; } +/* + * The swap map is a data structure used for keeping track of each page + * written to a swap partition. It consists of many swap_map_page + * structures that contain each an array of MAP_PAGE_SIZE swap entries. + * These structures are stored on the swap and linked together with the + * help of the .next_swap member. + * + * The swap map is created during suspend. The swap map pages are + * allocated and populated one at a time, so we only need one memory + * page to set up the entire structure. + * + * During resume we also only need to use one swap_map_page structure + * at a time. + */ + +#define MAP_PAGE_ENTRIES (PAGE_SIZE / sizeof(long) - 1) + +struct swap_map_page { + unsigned long entries[MAP_PAGE_ENTRIES]; + unsigned long next_swap; +}; + /** - * The swap_map_handle structure is used for handling the swap map in + * The swap_map_handle structure is used for handling swap in * a file-alike way */ struct swap_map_handle { struct swap_map_page *cur; + unsigned long cur_swap; + struct bitmap_page *bitmap; unsigned int k; }; -static inline void init_swap_map_handle(struct swap_map_handle *handle, - struct swap_map_page *map) +static void release_swap_writer(struct swap_map_handle *handle) { - handle->cur = map; + if (handle->cur) + free_page((unsigned long)handle->cur); + handle->cur = NULL; + if (handle->bitmap) + free_bitmap(handle->bitmap); + handle->bitmap = NULL; +} + +static int get_swap_writer(struct swap_map_handle *handle) +{ + handle->cur = (struct swap_map_page *)get_zeroed_page(GFP_KERNEL); + if (!handle->cur) + return -ENOMEM; + handle->bitmap = alloc_bitmap(count_swap_pages(root_swap, 0)); + if (!handle->bitmap) { + release_swap_writer(handle); + return -ENOMEM; + } + handle->cur_swap = alloc_swap_page(root_swap, handle->bitmap); + if (!handle->cur_swap) { + release_swap_writer(handle); + return -ENOSPC; + } handle->k = 0; + return 0; } -static inline int swap_map_write_page(struct swap_map_handle *handle, - unsigned long addr) +static int swap_write_page(struct swap_map_handle *handle, void *buf) { int error; + unsigned long offset; - error = write_page(addr, handle->cur->entries + handle->k); + if (!handle->cur) + return -EINVAL; + offset = alloc_swap_page(root_swap, handle->bitmap); + error = write_page(buf, offset); if (error) return error; - if (++handle->k >= MAP_PAGE_SIZE) { - handle->cur = handle->cur->next; + handle->cur->entries[handle->k++] = offset; + if (handle->k >= MAP_PAGE_ENTRIES) { + offset = alloc_swap_page(root_swap, handle->bitmap); + if (!offset) + return -ENOSPC; + handle->cur->next_swap = offset; + error = write_page(handle->cur, handle->cur_swap); + if (error) + return error; + memset(handle->cur, 0, PAGE_SIZE); + handle->cur_swap = offset; handle->k = 0; } return 0; } +static int flush_swap_writer(struct swap_map_handle *handle) +{ + if (handle->cur && handle->cur_swap) + return write_page(handle->cur, handle->cur_swap); + else + return -EINVAL; +} + /** - * save_image_data - save the data pages pointed to by the PBEs - * from the list @pblist using the swap map handle @handle - * (assume there are @nr_pages data pages to save) + * save_image - save the suspend image data */ -static int save_image_data(struct pbe *pblist, - struct swap_map_handle *handle, - unsigned int nr_pages) +static int save_image(struct swap_map_handle *handle, + struct snapshot_handle *snapshot, + unsigned int nr_pages) { unsigned int m; - struct pbe *p; + int ret; int error = 0; printk("Saving image data pages (%u pages) ... ", nr_pages); @@ -385,98 +391,22 @@ static int save_image_data(struct pbe *pblist, if (!m) m = 1; nr_pages = 0; - for_each_pbe (p, pblist) { - error = swap_map_write_page(handle, p->address); - if (error) - break; - if (!(nr_pages % m)) - printk("\b\b\b\b%3d%%", nr_pages / m); - nr_pages++; - } + do { + ret = snapshot_read_next(snapshot, PAGE_SIZE); + if (ret > 0) { + error = swap_write_page(handle, data_of(*snapshot)); + if (error) + break; + if (!(nr_pages % m)) + printk("\b\b\b\b%3d%%", nr_pages / m); + nr_pages++; + } + } while (ret > 0); if (!error) printk("\b\b\b\bdone\n"); return error; } -static void dump_info(void) -{ - pr_debug(" swsusp: Version: %u\n",swsusp_info.version_code); - pr_debug(" swsusp: Num Pages: %ld\n",swsusp_info.num_physpages); - pr_debug(" swsusp: UTS Sys: %s\n",swsusp_info.uts.sysname); - pr_debug(" swsusp: UTS Node: %s\n",swsusp_info.uts.nodename); - pr_debug(" swsusp: UTS Release: %s\n",swsusp_info.uts.release); - pr_debug(" swsusp: UTS Version: %s\n",swsusp_info.uts.version); - pr_debug(" swsusp: UTS Machine: %s\n",swsusp_info.uts.machine); - pr_debug(" swsusp: UTS Domain: %s\n",swsusp_info.uts.domainname); - pr_debug(" swsusp: CPUs: %d\n",swsusp_info.cpus); - pr_debug(" swsusp: Image: %ld Pages\n",swsusp_info.image_pages); - pr_debug(" swsusp: Total: %ld Pages\n", swsusp_info.pages); -} - -static void init_header(unsigned int nr_pages) -{ - memset(&swsusp_info, 0, sizeof(swsusp_info)); - swsusp_info.version_code = LINUX_VERSION_CODE; - swsusp_info.num_physpages = num_physpages; - memcpy(&swsusp_info.uts, &system_utsname, sizeof(system_utsname)); - - swsusp_info.cpus = num_online_cpus(); - swsusp_info.image_pages = nr_pages; - swsusp_info.pages = nr_pages + - ((nr_pages * sizeof(long) + PAGE_SIZE - 1) >> PAGE_SHIFT) + 1; -} - -/** - * pack_orig_addresses - the .orig_address fields of the PBEs from the - * list starting at @pbe are stored in the array @buf[] (1 page) - */ - -static inline struct pbe *pack_orig_addresses(unsigned long *buf, - struct pbe *pbe) -{ - int j; - - for (j = 0; j < PAGE_SIZE / sizeof(long) && pbe; j++) { - buf[j] = pbe->orig_address; - pbe = pbe->next; - } - if (!pbe) - for (; j < PAGE_SIZE / sizeof(long); j++) - buf[j] = 0; - return pbe; -} - -/** - * save_image_metadata - save the .orig_address fields of the PBEs - * from the list @pblist using the swap map handle @handle - */ - -static int save_image_metadata(struct pbe *pblist, - struct swap_map_handle *handle) -{ - unsigned long *buf; - unsigned int n = 0; - struct pbe *p; - int error = 0; - - printk("Saving image metadata ... "); - buf = (unsigned long *)get_zeroed_page(GFP_ATOMIC); - if (!buf) - return -ENOMEM; - p = pblist; - while (p) { - p = pack_orig_addresses(buf, p); - error = swap_map_write_page(handle, (unsigned long)buf); - if (error) - break; - n++; - } - free_page((unsigned long)buf); - if (!error) - printk("done (%u pages saved)\n", n); - return error; -} - /** * enough_swap - Make sure we have enough swap to save the image. * @@ -486,8 +416,7 @@ static int save_image_metadata(struct pbe *pblist, static int enough_swap(unsigned int nr_pages) { - unsigned int free_swap = swap_info[root_swap].pages - - swap_info[root_swap].inuse_pages; + unsigned int free_swap = count_swap_pages(root_swap, 1); pr_debug("swsusp: free swap pages: %u\n", free_swap); return free_swap > (nr_pages + PAGES_FOR_IO + @@ -503,57 +432,44 @@ static int enough_swap(unsigned int nr_pages) * correctly, we'll mark system clean, anyway.) */ -int swsusp_write(struct pbe *pblist, unsigned int nr_pages) +int swsusp_write(void) { - struct swap_map_page *swap_map; struct swap_map_handle handle; - swp_entry_t start; + struct snapshot_handle snapshot; + struct swsusp_info *header; + unsigned long start; int error; if ((error = swsusp_swap_check())) { printk(KERN_ERR "swsusp: Cannot find swap device, try swapon -a.\n"); return error; } - if (!enough_swap(nr_pages)) { + memset(&snapshot, 0, sizeof(struct snapshot_handle)); + error = snapshot_read_next(&snapshot, PAGE_SIZE); + if (error < PAGE_SIZE) + return error < 0 ? error : -EFAULT; + header = (struct swsusp_info *)data_of(snapshot); + if (!enough_swap(header->pages)) { printk(KERN_ERR "swsusp: Not enough free swap\n"); return -ENOSPC; } - - init_header(nr_pages); - swap_map = alloc_swap_map(swsusp_info.pages); - if (!swap_map) - return -ENOMEM; - init_swap_map_handle(&handle, swap_map); - - error = swap_map_write_page(&handle, (unsigned long)&swsusp_info); - if (!error) - error = save_image_metadata(pblist, &handle); + error = get_swap_writer(&handle); + if (!error) { + start = handle.cur_swap; + error = swap_write_page(&handle, header); + } if (!error) - error = save_image_data(pblist, &handle, nr_pages); - if (error) - goto Free_image_entries; - - swap_map = reverse_swap_map(swap_map); - error = save_swap_map(swap_map, &start); - if (error) - goto Free_map_entries; - - dump_info(); - printk( "S" ); - error = mark_swapfiles(start); - printk( "|\n" ); + error = save_image(&handle, &snapshot, header->pages - 1); + if (!error) { + flush_swap_writer(&handle); + printk("S"); + error = mark_swapfiles(swp_entry(root_swap, start)); + printk("|\n"); + } if (error) - goto Free_map_entries; - -Free_swap_map: - free_swap_map(swap_map); + free_all_swap_pages(root_swap, handle.bitmap); + release_swap_writer(&handle); return error; - -Free_map_entries: - free_swap_map_entries(swap_map); -Free_image_entries: - free_image_entries(swap_map); - goto Free_swap_map; } /** @@ -663,45 +579,6 @@ int swsusp_resume(void) return error; } -/** - * mark_unsafe_pages - mark the pages that cannot be used for storing - * the image during resume, because they conflict with the pages that - * had been used before suspend - */ - -static void mark_unsafe_pages(struct pbe *pblist) -{ - struct zone *zone; - unsigned long zone_pfn; - struct pbe *p; - - if (!pblist) /* a sanity check */ - return; - - /* Clear page flags */ - for_each_zone (zone) { - for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn) - if (pfn_valid(zone_pfn + zone->zone_start_pfn)) - ClearPageNosaveFree(pfn_to_page(zone_pfn + - zone->zone_start_pfn)); - } - - /* Mark orig addresses */ - for_each_pbe (p, pblist) - SetPageNosaveFree(virt_to_page(p->orig_address)); - -} - -static void copy_page_backup_list(struct pbe *dst, struct pbe *src) -{ - /* We assume both lists contain the same number of elements */ - while (src) { - dst->orig_address = src->orig_address; - dst = dst->next; - src = src->next; - } -} - /* * Using bio to read from swap. * This code requires a bit more work than just using buffer heads @@ -779,14 +656,14 @@ static int bio_write_page(pgoff_t page_off, void *page) * in a file-alike way */ -static inline void release_swap_map_reader(struct swap_map_handle *handle) +static void release_swap_reader(struct swap_map_handle *handle) { if (handle->cur) free_page((unsigned long)handle->cur); handle->cur = NULL; } -static inline int get_swap_map_reader(struct swap_map_handle *handle, +static int get_swap_reader(struct swap_map_handle *handle, swp_entry_t start) { int error; @@ -798,149 +675,80 @@ static inline int get_swap_map_reader(struct swap_map_handle *handle, return -ENOMEM; error = bio_read_page(swp_offset(start), handle->cur); if (error) { - release_swap_map_reader(handle); + release_swap_reader(handle); return error; } handle->k = 0; return 0; } -static inline int swap_map_read_page(struct swap_map_handle *handle, void *buf) +static int swap_read_page(struct swap_map_handle *handle, void *buf) { unsigned long offset; int error; if (!handle->cur) return -EINVAL; - offset = swp_offset(handle->cur->entries[handle->k]); + offset = handle->cur->entries[handle->k]; if (!offset) - return -EINVAL; + return -EFAULT; error = bio_read_page(offset, buf); if (error) return error; - if (++handle->k >= MAP_PAGE_SIZE) { + if (++handle->k >= MAP_PAGE_ENTRIES) { handle->k = 0; - offset = swp_offset(handle->cur->next_swap); + offset = handle->cur->next_swap; if (!offset) - release_swap_map_reader(handle); + release_swap_reader(handle); else error = bio_read_page(offset, handle->cur); } return error; } -static int check_header(void) -{ - char *reason = NULL; - - dump_info(); - if (swsusp_info.version_code != LINUX_VERSION_CODE) - reason = "kernel version"; - if (swsusp_info.num_physpages != num_physpages) - reason = "memory size"; - if (strcmp(swsusp_info.uts.sysname,system_utsname.sysname)) - reason = "system type"; - if (strcmp(swsusp_info.uts.release,system_utsname.release)) - reason = "kernel release"; - if (strcmp(swsusp_info.uts.version,system_utsname.version)) - reason = "version"; - if (strcmp(swsusp_info.uts.machine,system_utsname.machine)) - reason = "machine"; - if (reason) { - printk(KERN_ERR "swsusp: Resume mismatch: %s\n", reason); - return -EPERM; - } - return 0; -} - /** - * load_image_data - load the image data using the swap map handle - * @handle and store them using the page backup list @pblist + * load_image - load the image using the swap map handle + * @handle and the snapshot handle @snapshot * (assume there are @nr_pages pages to load) */ -static int load_image_data(struct pbe *pblist, - struct swap_map_handle *handle, - unsigned int nr_pages) +static int load_image(struct swap_map_handle *handle, + struct snapshot_handle *snapshot, + unsigned int nr_pages) { - int error; unsigned int m; - struct pbe *p; + int ret; + int error = 0; - if (!pblist) - return -EINVAL; printk("Loading image data pages (%u pages) ... ", nr_pages); m = nr_pages / 100; if (!m) m = 1; nr_pages = 0; - p = pblist; - while (p) { - error = swap_map_read_page(handle, (void *)p->address); - if (error) - break; - p = p->next; - if (!(nr_pages % m)) - printk("\b\b\b\b%3d%%", nr_pages / m); - nr_pages++; - } + do { + ret = snapshot_write_next(snapshot, PAGE_SIZE); + if (ret > 0) { + error = swap_read_page(handle, data_of(*snapshot)); + if (error) + break; + if (!(nr_pages % m)) + printk("\b\b\b\b%3d%%", nr_pages / m); + nr_pages++; + } + } while (ret > 0); if (!error) printk("\b\b\b\bdone\n"); + if (!snapshot_image_loaded(snapshot)) + error = -ENODATA; return error; } -/** - * unpack_orig_addresses - copy the elements of @buf[] (1 page) to - * the PBEs in the list starting at @pbe - */ - -static inline struct pbe *unpack_orig_addresses(unsigned long *buf, - struct pbe *pbe) -{ - int j; - - for (j = 0; j < PAGE_SIZE / sizeof(long) && pbe; j++) { - pbe->orig_address = buf[j]; - pbe = pbe->next; - } - return pbe; -} - -/** - * load_image_metadata - load the image metadata using the swap map - * handle @handle and put them into the PBEs in the list @pblist - */ - -static int load_image_metadata(struct pbe *pblist, struct swap_map_handle *handle) -{ - struct pbe *p; - unsigned long *buf; - unsigned int n = 0; - int error = 0; - - printk("Loading image metadata ... "); - buf = (unsigned long *)get_zeroed_page(GFP_ATOMIC); - if (!buf) - return -ENOMEM; - p = pblist; - while (p) { - error = swap_map_read_page(handle, buf); - if (error) - break; - p = unpack_orig_addresses(buf, p); - n++; - } - free_page((unsigned long)buf); - if (!error) - printk("done (%u pages loaded)\n", n); - return error; -} - -int swsusp_read(struct pbe **pblist_ptr) +int swsusp_read(void) { int error; - struct pbe *p, *pblist; struct swap_map_handle handle; + struct snapshot_handle snapshot; + struct swsusp_info *header; unsigned int nr_pages; if (IS_ERR(resume_bdev)) { @@ -948,38 +756,19 @@ int swsusp_read(struct pbe **pblist_ptr) return PTR_ERR(resume_bdev); } - error = get_swap_map_reader(&handle, swsusp_header.image); + memset(&snapshot, 0, sizeof(struct snapshot_handle)); + error = snapshot_write_next(&snapshot, PAGE_SIZE); + if (error < PAGE_SIZE) + return error < 0 ? error : -EFAULT; + header = (struct swsusp_info *)data_of(snapshot); + error = get_swap_reader(&handle, swsusp_header.image); if (!error) - error = swap_map_read_page(&handle, &swsusp_info); - if (!error) - error = check_header(); - if (error) - return error; - nr_pages = swsusp_info.image_pages; - p = alloc_pagedir(nr_pages, GFP_ATOMIC, 0); - if (!p) - return -ENOMEM; - error = load_image_metadata(p, &handle); + error = swap_read_page(&handle, header); if (!error) { - mark_unsafe_pages(p); - pblist = alloc_pagedir(nr_pages, GFP_ATOMIC, 1); - if (pblist) - copy_page_backup_list(pblist, p); - free_pagedir(p); - if (!pblist) - error = -ENOMEM; - - /* Allocate memory for the image and read the data from swap */ - if (!error) - error = alloc_data_pages(pblist, GFP_ATOMIC, 1); - if (!error) { - release_eaten_pages(); - error = load_image_data(pblist, &handle, nr_pages); - } - if (!error) - *pblist_ptr = pblist; + nr_pages = header->image_pages; + error = load_image(&handle, &snapshot, nr_pages); } - release_swap_map_reader(&handle); + release_swap_reader(&handle); blkdev_put(resume_bdev); -- cgit v1.2.3 From 61159a314bca6408320c3173c1282c64f5cdaa76 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Thu, 23 Mar 2006 03:00:00 -0800 Subject: [PATCH] swsusp: separate swap-writing/reading code Move the swap-writing/reading code of swsusp to a separate file. Signed-off-by: Rafael J. Wysocki Acked-by: Pavel Machek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/power/Makefile | 2 +- kernel/power/power.h | 31 ++- kernel/power/swap.c | 544 +++++++++++++++++++++++++++++++++++++++++++++++ kernel/power/swsusp.c | 568 +------------------------------------------------- 4 files changed, 581 insertions(+), 564 deletions(-) create mode 100644 kernel/power/swap.c (limited to 'kernel') diff --git a/kernel/power/Makefile b/kernel/power/Makefile index 04be7d0d96a7..bb91a0615303 100644 --- a/kernel/power/Makefile +++ b/kernel/power/Makefile @@ -5,7 +5,7 @@ endif obj-y := main.o process.o console.o obj-$(CONFIG_PM_LEGACY) += pm.o -obj-$(CONFIG_SOFTWARE_SUSPEND) += swsusp.o disk.o snapshot.o +obj-$(CONFIG_SOFTWARE_SUSPEND) += swsusp.o disk.o snapshot.o swap.o obj-$(CONFIG_SUSPEND_SMP) += smp.o diff --git a/kernel/power/power.h b/kernel/power/power.h index ea7132ed029b..089c84bed895 100644 --- a/kernel/power/power.h +++ b/kernel/power/power.h @@ -41,8 +41,8 @@ extern struct pbe *pagedir_nosave; /* Preferred image size in bytes (default 500 MB) */ extern unsigned long image_size; - extern int in_suspend; +extern dev_t swsusp_resume_device; extern asmlinkage int swsusp_arch_suspend(void); extern asmlinkage int swsusp_arch_resume(void); @@ -65,3 +65,32 @@ struct snapshot_handle { extern int snapshot_read_next(struct snapshot_handle *handle, size_t count); extern int snapshot_write_next(struct snapshot_handle *handle, size_t count); int snapshot_image_loaded(struct snapshot_handle *handle); + +/** + * The bitmap is used for tracing allocated swap pages + * + * The entire bitmap consists of a number of bitmap_page + * structures linked with the help of the .next member. + * Thus each page can be allocated individually, so we only + * need to make 0-order memory allocations to create + * the bitmap. + */ + +#define BITMAP_PAGE_SIZE (PAGE_SIZE - sizeof(void *)) +#define BITMAP_PAGE_CHUNKS (BITMAP_PAGE_SIZE / sizeof(long)) +#define BITS_PER_CHUNK (sizeof(long) * 8) +#define BITMAP_PAGE_BITS (BITMAP_PAGE_CHUNKS * BITS_PER_CHUNK) + +struct bitmap_page { + unsigned long chunks[BITMAP_PAGE_CHUNKS]; + struct bitmap_page *next; +}; + +extern void free_bitmap(struct bitmap_page *bitmap); +extern struct bitmap_page *alloc_bitmap(unsigned int nr_bits); +extern unsigned long alloc_swap_page(int swap, struct bitmap_page *bitmap); +extern void free_all_swap_pages(int swap, struct bitmap_page *bitmap); + +extern int swsusp_shrink_memory(void); +extern int swsusp_suspend(void); +extern int swsusp_resume(void); diff --git a/kernel/power/swap.c b/kernel/power/swap.c new file mode 100644 index 000000000000..9177f3f73a6c --- /dev/null +++ b/kernel/power/swap.c @@ -0,0 +1,544 @@ +/* + * linux/kernel/power/swap.c + * + * This file provides functions for reading the suspend image from + * and writing it to a swap partition. + * + * Copyright (C) 1998,2001-2005 Pavel Machek + * Copyright (C) 2006 Rafael J. Wysocki + * + * This file is released under the GPLv2. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "power.h" + +extern char resume_file[]; + +#define SWSUSP_SIG "S1SUSPEND" + +static struct swsusp_header { + char reserved[PAGE_SIZE - 20 - sizeof(swp_entry_t)]; + swp_entry_t image; + char orig_sig[10]; + char sig[10]; +} __attribute__((packed, aligned(PAGE_SIZE))) swsusp_header; + +/* + * Saving part... + */ + +static unsigned short root_swap = 0xffff; + +static int mark_swapfiles(swp_entry_t start) +{ + int error; + + rw_swap_page_sync(READ, + swp_entry(root_swap, 0), + virt_to_page((unsigned long)&swsusp_header)); + if (!memcmp("SWAP-SPACE",swsusp_header.sig, 10) || + !memcmp("SWAPSPACE2",swsusp_header.sig, 10)) { + memcpy(swsusp_header.orig_sig,swsusp_header.sig, 10); + memcpy(swsusp_header.sig,SWSUSP_SIG, 10); + swsusp_header.image = start; + error = rw_swap_page_sync(WRITE, + swp_entry(root_swap, 0), + virt_to_page((unsigned long) + &swsusp_header)); + } else { + pr_debug("swsusp: Partition is not swap space.\n"); + error = -ENODEV; + } + return error; +} + +/** + * swsusp_swap_check - check if the resume device is a swap device + * and get its index (if so) + */ + +static int swsusp_swap_check(void) /* This is called before saving image */ +{ + int res = swap_type_of(swsusp_resume_device); + + if (res >= 0) { + root_swap = res; + return 0; + } + return res; +} + +/** + * write_page - Write one page to given swap location. + * @buf: Address we're writing. + * @offset: Offset of the swap page we're writing to. + */ + +static int write_page(void *buf, unsigned long offset) +{ + swp_entry_t entry; + int error = -ENOSPC; + + if (offset) { + entry = swp_entry(root_swap, offset); + error = rw_swap_page_sync(WRITE, entry, virt_to_page(buf)); + } + return error; +} + +/* + * The swap map is a data structure used for keeping track of each page + * written to a swap partition. It consists of many swap_map_page + * structures that contain each an array of MAP_PAGE_SIZE swap entries. + * These structures are stored on the swap and linked together with the + * help of the .next_swap member. + * + * The swap map is created during suspend. The swap map pages are + * allocated and populated one at a time, so we only need one memory + * page to set up the entire structure. + * + * During resume we also only need to use one swap_map_page structure + * at a time. + */ + +#define MAP_PAGE_ENTRIES (PAGE_SIZE / sizeof(long) - 1) + +struct swap_map_page { + unsigned long entries[MAP_PAGE_ENTRIES]; + unsigned long next_swap; +}; + +/** + * The swap_map_handle structure is used for handling swap in + * a file-alike way + */ + +struct swap_map_handle { + struct swap_map_page *cur; + unsigned long cur_swap; + struct bitmap_page *bitmap; + unsigned int k; +}; + +static void release_swap_writer(struct swap_map_handle *handle) +{ + if (handle->cur) + free_page((unsigned long)handle->cur); + handle->cur = NULL; + if (handle->bitmap) + free_bitmap(handle->bitmap); + handle->bitmap = NULL; +} + +static int get_swap_writer(struct swap_map_handle *handle) +{ + handle->cur = (struct swap_map_page *)get_zeroed_page(GFP_KERNEL); + if (!handle->cur) + return -ENOMEM; + handle->bitmap = alloc_bitmap(count_swap_pages(root_swap, 0)); + if (!handle->bitmap) { + release_swap_writer(handle); + return -ENOMEM; + } + handle->cur_swap = alloc_swap_page(root_swap, handle->bitmap); + if (!handle->cur_swap) { + release_swap_writer(handle); + return -ENOSPC; + } + handle->k = 0; + return 0; +} + +static int swap_write_page(struct swap_map_handle *handle, void *buf) +{ + int error; + unsigned long offset; + + if (!handle->cur) + return -EINVAL; + offset = alloc_swap_page(root_swap, handle->bitmap); + error = write_page(buf, offset); + if (error) + return error; + handle->cur->entries[handle->k++] = offset; + if (handle->k >= MAP_PAGE_ENTRIES) { + offset = alloc_swap_page(root_swap, handle->bitmap); + if (!offset) + return -ENOSPC; + handle->cur->next_swap = offset; + error = write_page(handle->cur, handle->cur_swap); + if (error) + return error; + memset(handle->cur, 0, PAGE_SIZE); + handle->cur_swap = offset; + handle->k = 0; + } + return 0; +} + +static int flush_swap_writer(struct swap_map_handle *handle) +{ + if (handle->cur && handle->cur_swap) + return write_page(handle->cur, handle->cur_swap); + else + return -EINVAL; +} + +/** + * save_image - save the suspend image data + */ + +static int save_image(struct swap_map_handle *handle, + struct snapshot_handle *snapshot, + unsigned int nr_pages) +{ + unsigned int m; + int ret; + int error = 0; + + printk("Saving image data pages (%u pages) ... ", nr_pages); + m = nr_pages / 100; + if (!m) + m = 1; + nr_pages = 0; + do { + ret = snapshot_read_next(snapshot, PAGE_SIZE); + if (ret > 0) { + error = swap_write_page(handle, data_of(*snapshot)); + if (error) + break; + if (!(nr_pages % m)) + printk("\b\b\b\b%3d%%", nr_pages / m); + nr_pages++; + } + } while (ret > 0); + if (!error) + printk("\b\b\b\bdone\n"); + return error; +} + +/** + * enough_swap - Make sure we have enough swap to save the image. + * + * Returns TRUE or FALSE after checking the total amount of swap + * space avaiable from the resume partition. + */ + +static int enough_swap(unsigned int nr_pages) +{ + unsigned int free_swap = count_swap_pages(root_swap, 1); + + pr_debug("swsusp: free swap pages: %u\n", free_swap); + return free_swap > (nr_pages + PAGES_FOR_IO + + (nr_pages + PBES_PER_PAGE - 1) / PBES_PER_PAGE); +} + +/** + * swsusp_write - Write entire image and metadata. + * + * It is important _NOT_ to umount filesystems at this point. We want + * them synced (in case something goes wrong) but we DO not want to mark + * filesystem clean: it is not. (And it does not matter, if we resume + * correctly, we'll mark system clean, anyway.) + */ + +int swsusp_write(void) +{ + struct swap_map_handle handle; + struct snapshot_handle snapshot; + struct swsusp_info *header; + unsigned long start; + int error; + + if ((error = swsusp_swap_check())) { + printk(KERN_ERR "swsusp: Cannot find swap device, try swapon -a.\n"); + return error; + } + memset(&snapshot, 0, sizeof(struct snapshot_handle)); + error = snapshot_read_next(&snapshot, PAGE_SIZE); + if (error < PAGE_SIZE) + return error < 0 ? error : -EFAULT; + header = (struct swsusp_info *)data_of(snapshot); + if (!enough_swap(header->pages)) { + printk(KERN_ERR "swsusp: Not enough free swap\n"); + return -ENOSPC; + } + error = get_swap_writer(&handle); + if (!error) { + start = handle.cur_swap; + error = swap_write_page(&handle, header); + } + if (!error) + error = save_image(&handle, &snapshot, header->pages - 1); + if (!error) { + flush_swap_writer(&handle); + printk("S"); + error = mark_swapfiles(swp_entry(root_swap, start)); + printk("|\n"); + } + if (error) + free_all_swap_pages(root_swap, handle.bitmap); + release_swap_writer(&handle); + return error; +} + +/* + * Using bio to read from swap. + * This code requires a bit more work than just using buffer heads + * but, it is the recommended way for 2.5/2.6. + * The following are to signal the beginning and end of I/O. Bios + * finish asynchronously, while we want them to happen synchronously. + * A simple atomic_t, and a wait loop take care of this problem. + */ + +static atomic_t io_done = ATOMIC_INIT(0); + +static int end_io(struct bio *bio, unsigned int num, int err) +{ + if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) + panic("I/O error reading memory image"); + atomic_set(&io_done, 0); + return 0; +} + +static struct block_device *resume_bdev; + +/** + * submit - submit BIO request. + * @rw: READ or WRITE. + * @off physical offset of page. + * @page: page we're reading or writing. + * + * Straight from the textbook - allocate and initialize the bio. + * If we're writing, make sure the page is marked as dirty. + * Then submit it and wait. + */ + +static int submit(int rw, pgoff_t page_off, void *page) +{ + int error = 0; + struct bio *bio; + + bio = bio_alloc(GFP_ATOMIC, 1); + if (!bio) + return -ENOMEM; + bio->bi_sector = page_off * (PAGE_SIZE >> 9); + bio->bi_bdev = resume_bdev; + bio->bi_end_io = end_io; + + if (bio_add_page(bio, virt_to_page(page), PAGE_SIZE, 0) < PAGE_SIZE) { + printk("swsusp: ERROR: adding page to bio at %ld\n",page_off); + error = -EFAULT; + goto Done; + } + + atomic_set(&io_done, 1); + submit_bio(rw | (1 << BIO_RW_SYNC), bio); + while (atomic_read(&io_done)) + yield(); + if (rw == READ) + bio_set_pages_dirty(bio); + Done: + bio_put(bio); + return error; +} + +static int bio_read_page(pgoff_t page_off, void *page) +{ + return submit(READ, page_off, page); +} + +static int bio_write_page(pgoff_t page_off, void *page) +{ + return submit(WRITE, page_off, page); +} + +/** + * The following functions allow us to read data using a swap map + * in a file-alike way + */ + +static void release_swap_reader(struct swap_map_handle *handle) +{ + if (handle->cur) + free_page((unsigned long)handle->cur); + handle->cur = NULL; +} + +static int get_swap_reader(struct swap_map_handle *handle, + swp_entry_t start) +{ + int error; + + if (!swp_offset(start)) + return -EINVAL; + handle->cur = (struct swap_map_page *)get_zeroed_page(GFP_ATOMIC); + if (!handle->cur) + return -ENOMEM; + error = bio_read_page(swp_offset(start), handle->cur); + if (error) { + release_swap_reader(handle); + return error; + } + handle->k = 0; + return 0; +} + +static int swap_read_page(struct swap_map_handle *handle, void *buf) +{ + unsigned long offset; + int error; + + if (!handle->cur) + return -EINVAL; + offset = handle->cur->entries[handle->k]; + if (!offset) + return -EFAULT; + error = bio_read_page(offset, buf); + if (error) + return error; + if (++handle->k >= MAP_PAGE_ENTRIES) { + handle->k = 0; + offset = handle->cur->next_swap; + if (!offset) + release_swap_reader(handle); + else + error = bio_read_page(offset, handle->cur); + } + return error; +} + +/** + * load_image - load the image using the swap map handle + * @handle and the snapshot handle @snapshot + * (assume there are @nr_pages pages to load) + */ + +static int load_image(struct swap_map_handle *handle, + struct snapshot_handle *snapshot, + unsigned int nr_pages) +{ + unsigned int m; + int ret; + int error = 0; + + printk("Loading image data pages (%u pages) ... ", nr_pages); + m = nr_pages / 100; + if (!m) + m = 1; + nr_pages = 0; + do { + ret = snapshot_write_next(snapshot, PAGE_SIZE); + if (ret > 0) { + error = swap_read_page(handle, data_of(*snapshot)); + if (error) + break; + if (!(nr_pages % m)) + printk("\b\b\b\b%3d%%", nr_pages / m); + nr_pages++; + } + } while (ret > 0); + if (!error) + printk("\b\b\b\bdone\n"); + if (!snapshot_image_loaded(snapshot)) + error = -ENODATA; + return error; +} + +int swsusp_read(void) +{ + int error; + struct swap_map_handle handle; + struct snapshot_handle snapshot; + struct swsusp_info *header; + + if (IS_ERR(resume_bdev)) { + pr_debug("swsusp: block device not initialised\n"); + return PTR_ERR(resume_bdev); + } + + memset(&snapshot, 0, sizeof(struct snapshot_handle)); + error = snapshot_write_next(&snapshot, PAGE_SIZE); + if (error < PAGE_SIZE) + return error < 0 ? error : -EFAULT; + header = (struct swsusp_info *)data_of(snapshot); + error = get_swap_reader(&handle, swsusp_header.image); + if (!error) + error = swap_read_page(&handle, header); + if (!error) + error = load_image(&handle, &snapshot, header->pages - 1); + release_swap_reader(&handle); + + blkdev_put(resume_bdev); + + if (!error) + pr_debug("swsusp: Reading resume file was successful\n"); + else + pr_debug("swsusp: Error %d resuming\n", error); + return error; +} + +/** + * swsusp_check - Check for swsusp signature in the resume device + */ + +int swsusp_check(void) +{ + int error; + + resume_bdev = open_by_devnum(swsusp_resume_device, FMODE_READ); + if (!IS_ERR(resume_bdev)) { + set_blocksize(resume_bdev, PAGE_SIZE); + memset(&swsusp_header, 0, sizeof(swsusp_header)); + if ((error = bio_read_page(0, &swsusp_header))) + return error; + if (!memcmp(SWSUSP_SIG, swsusp_header.sig, 10)) { + memcpy(swsusp_header.sig, swsusp_header.orig_sig, 10); + /* Reset swap signature now */ + error = bio_write_page(0, &swsusp_header); + } else { + return -EINVAL; + } + if (error) + blkdev_put(resume_bdev); + else + pr_debug("swsusp: Signature found, resuming\n"); + } else { + error = PTR_ERR(resume_bdev); + } + + if (error) + pr_debug("swsusp: Error %d check for resume file\n", error); + + return error; +} + +/** + * swsusp_close - close swap device. + */ + +void swsusp_close(void) +{ + if (IS_ERR(resume_bdev)) { + pr_debug("swsusp: block device not initialised\n"); + return; + } + + blkdev_put(resume_bdev); +} diff --git a/kernel/power/swsusp.c b/kernel/power/swsusp.c index 457084f50010..c4016cbbd3e0 100644 --- a/kernel/power/swsusp.c +++ b/kernel/power/swsusp.c @@ -31,41 +31,24 @@ * Fixed runaway init * * Rafael J. Wysocki - * Added the swap map data structure and reworked the handling of swap + * Reworked the freeing of memory and the handling of swap * * More state savers are welcome. Especially for the scsi layer... * * For TODOs,FIXMEs also look in Documentation/power/swsusp.txt */ -#include #include #include -#include -#include -#include -#include -#include -#include #include -#include #include #include #include #include -#include -#include #include #include #include #include -#include - -#include -#include -#include -#include -#include #include "power.h" @@ -89,91 +72,15 @@ static int restore_highmem(void) { return 0; } static unsigned int count_highmem_pages(void) { return 0; } #endif -extern char resume_file[]; - -#define SWSUSP_SIG "S1SUSPEND" - -static struct swsusp_header { - char reserved[PAGE_SIZE - 20 - sizeof(swp_entry_t)]; - swp_entry_t image; - char orig_sig[10]; - char sig[10]; -} __attribute__((packed, aligned(PAGE_SIZE))) swsusp_header; - -/* - * Saving part... - */ - -static unsigned short root_swap = 0xffff; - -static int mark_swapfiles(swp_entry_t start) -{ - int error; - - rw_swap_page_sync(READ, - swp_entry(root_swap, 0), - virt_to_page((unsigned long)&swsusp_header)); - if (!memcmp("SWAP-SPACE",swsusp_header.sig, 10) || - !memcmp("SWAPSPACE2",swsusp_header.sig, 10)) { - memcpy(swsusp_header.orig_sig,swsusp_header.sig, 10); - memcpy(swsusp_header.sig,SWSUSP_SIG, 10); - swsusp_header.image = start; - error = rw_swap_page_sync(WRITE, - swp_entry(root_swap, 0), - virt_to_page((unsigned long) - &swsusp_header)); - } else { - pr_debug("swsusp: Partition is not swap space.\n"); - error = -ENODEV; - } - return error; -} - -/** - * swsusp_swap_check - check if the resume device is a swap device - * and get its index (if so) - */ - -static int swsusp_swap_check(void) /* This is called before saving image */ -{ - int res = swap_type_of(swsusp_resume_device); - - if (res >= 0) { - root_swap = res; - return 0; - } - return res; -} - -/** - * The bitmap is used for tracing allocated swap pages - * - * The entire bitmap consists of a number of bitmap_page - * structures linked with the help of the .next member. - * Thus each page can be allocated individually, so we only - * need to make 0-order memory allocations to create - * the bitmap. - */ - -#define BITMAP_PAGE_SIZE (PAGE_SIZE - sizeof(void *)) -#define BITMAP_PAGE_CHUNKS (BITMAP_PAGE_SIZE / sizeof(long)) -#define BITS_PER_CHUNK (sizeof(long) * 8) -#define BITMAP_PAGE_BITS (BITMAP_PAGE_CHUNKS * BITS_PER_CHUNK) - -struct bitmap_page { - unsigned long chunks[BITMAP_PAGE_CHUNKS]; - struct bitmap_page *next; -}; - /** * The following functions are used for tracing the allocated * swap pages, so that they can be freed in case of an error. * * The functions operate on a linked bitmap structure defined - * above + * in power.h */ -static void free_bitmap(struct bitmap_page *bitmap) +void free_bitmap(struct bitmap_page *bitmap) { struct bitmap_page *bp; @@ -184,7 +91,7 @@ static void free_bitmap(struct bitmap_page *bitmap) } } -static struct bitmap_page *alloc_bitmap(unsigned int nr_bits) +struct bitmap_page *alloc_bitmap(unsigned int nr_bits) { struct bitmap_page *bitmap, *bp; unsigned int n; @@ -227,7 +134,7 @@ static int bitmap_set(struct bitmap_page *bitmap, unsigned long bit) return 0; } -static unsigned long alloc_swap_page(int swap, struct bitmap_page *bitmap) +unsigned long alloc_swap_page(int swap, struct bitmap_page *bitmap) { unsigned long offset; @@ -241,7 +148,7 @@ static unsigned long alloc_swap_page(int swap, struct bitmap_page *bitmap) return offset; } -static void free_all_swap_pages(int swap, struct bitmap_page *bitmap) +void free_all_swap_pages(int swap, struct bitmap_page *bitmap) { unsigned int bit, n; unsigned long test; @@ -258,220 +165,6 @@ static void free_all_swap_pages(int swap, struct bitmap_page *bitmap) } } -/** - * write_page - Write one page to given swap location. - * @buf: Address we're writing. - * @offset: Offset of the swap page we're writing to. - */ - -static int write_page(void *buf, unsigned long offset) -{ - swp_entry_t entry; - int error = -ENOSPC; - - if (offset) { - entry = swp_entry(root_swap, offset); - error = rw_swap_page_sync(WRITE, entry, virt_to_page(buf)); - } - return error; -} - -/* - * The swap map is a data structure used for keeping track of each page - * written to a swap partition. It consists of many swap_map_page - * structures that contain each an array of MAP_PAGE_SIZE swap entries. - * These structures are stored on the swap and linked together with the - * help of the .next_swap member. - * - * The swap map is created during suspend. The swap map pages are - * allocated and populated one at a time, so we only need one memory - * page to set up the entire structure. - * - * During resume we also only need to use one swap_map_page structure - * at a time. - */ - -#define MAP_PAGE_ENTRIES (PAGE_SIZE / sizeof(long) - 1) - -struct swap_map_page { - unsigned long entries[MAP_PAGE_ENTRIES]; - unsigned long next_swap; -}; - -/** - * The swap_map_handle structure is used for handling swap in - * a file-alike way - */ - -struct swap_map_handle { - struct swap_map_page *cur; - unsigned long cur_swap; - struct bitmap_page *bitmap; - unsigned int k; -}; - -static void release_swap_writer(struct swap_map_handle *handle) -{ - if (handle->cur) - free_page((unsigned long)handle->cur); - handle->cur = NULL; - if (handle->bitmap) - free_bitmap(handle->bitmap); - handle->bitmap = NULL; -} - -static int get_swap_writer(struct swap_map_handle *handle) -{ - handle->cur = (struct swap_map_page *)get_zeroed_page(GFP_KERNEL); - if (!handle->cur) - return -ENOMEM; - handle->bitmap = alloc_bitmap(count_swap_pages(root_swap, 0)); - if (!handle->bitmap) { - release_swap_writer(handle); - return -ENOMEM; - } - handle->cur_swap = alloc_swap_page(root_swap, handle->bitmap); - if (!handle->cur_swap) { - release_swap_writer(handle); - return -ENOSPC; - } - handle->k = 0; - return 0; -} - -static int swap_write_page(struct swap_map_handle *handle, void *buf) -{ - int error; - unsigned long offset; - - if (!handle->cur) - return -EINVAL; - offset = alloc_swap_page(root_swap, handle->bitmap); - error = write_page(buf, offset); - if (error) - return error; - handle->cur->entries[handle->k++] = offset; - if (handle->k >= MAP_PAGE_ENTRIES) { - offset = alloc_swap_page(root_swap, handle->bitmap); - if (!offset) - return -ENOSPC; - handle->cur->next_swap = offset; - error = write_page(handle->cur, handle->cur_swap); - if (error) - return error; - memset(handle->cur, 0, PAGE_SIZE); - handle->cur_swap = offset; - handle->k = 0; - } - return 0; -} - -static int flush_swap_writer(struct swap_map_handle *handle) -{ - if (handle->cur && handle->cur_swap) - return write_page(handle->cur, handle->cur_swap); - else - return -EINVAL; -} - -/** - * save_image - save the suspend image data - */ - -static int save_image(struct swap_map_handle *handle, - struct snapshot_handle *snapshot, - unsigned int nr_pages) -{ - unsigned int m; - int ret; - int error = 0; - - printk("Saving image data pages (%u pages) ... ", nr_pages); - m = nr_pages / 100; - if (!m) - m = 1; - nr_pages = 0; - do { - ret = snapshot_read_next(snapshot, PAGE_SIZE); - if (ret > 0) { - error = swap_write_page(handle, data_of(*snapshot)); - if (error) - break; - if (!(nr_pages % m)) - printk("\b\b\b\b%3d%%", nr_pages / m); - nr_pages++; - } - } while (ret > 0); - if (!error) - printk("\b\b\b\bdone\n"); - return error; -} - -/** - * enough_swap - Make sure we have enough swap to save the image. - * - * Returns TRUE or FALSE after checking the total amount of swap - * space avaiable from the resume partition. - */ - -static int enough_swap(unsigned int nr_pages) -{ - unsigned int free_swap = count_swap_pages(root_swap, 1); - - pr_debug("swsusp: free swap pages: %u\n", free_swap); - return free_swap > (nr_pages + PAGES_FOR_IO + - (nr_pages + PBES_PER_PAGE - 1) / PBES_PER_PAGE); -} - -/** - * swsusp_write - Write entire image and metadata. - * - * It is important _NOT_ to umount filesystems at this point. We want - * them synced (in case something goes wrong) but we DO not want to mark - * filesystem clean: it is not. (And it does not matter, if we resume - * correctly, we'll mark system clean, anyway.) - */ - -int swsusp_write(void) -{ - struct swap_map_handle handle; - struct snapshot_handle snapshot; - struct swsusp_info *header; - unsigned long start; - int error; - - if ((error = swsusp_swap_check())) { - printk(KERN_ERR "swsusp: Cannot find swap device, try swapon -a.\n"); - return error; - } - memset(&snapshot, 0, sizeof(struct snapshot_handle)); - error = snapshot_read_next(&snapshot, PAGE_SIZE); - if (error < PAGE_SIZE) - return error < 0 ? error : -EFAULT; - header = (struct swsusp_info *)data_of(snapshot); - if (!enough_swap(header->pages)) { - printk(KERN_ERR "swsusp: Not enough free swap\n"); - return -ENOSPC; - } - error = get_swap_writer(&handle); - if (!error) { - start = handle.cur_swap; - error = swap_write_page(&handle, header); - } - if (!error) - error = save_image(&handle, &snapshot, header->pages - 1); - if (!error) { - flush_swap_writer(&handle); - printk("S"); - error = mark_swapfiles(swp_entry(root_swap, start)); - printk("|\n"); - } - if (error) - free_all_swap_pages(root_swap, handle.bitmap); - release_swap_writer(&handle); - return error; -} - /** * swsusp_shrink_memory - Try to free as much memory as needed * @@ -578,252 +271,3 @@ int swsusp_resume(void) local_irq_enable(); return error; } - -/* - * Using bio to read from swap. - * This code requires a bit more work than just using buffer heads - * but, it is the recommended way for 2.5/2.6. - * The following are to signal the beginning and end of I/O. Bios - * finish asynchronously, while we want them to happen synchronously. - * A simple atomic_t, and a wait loop take care of this problem. - */ - -static atomic_t io_done = ATOMIC_INIT(0); - -static int end_io(struct bio *bio, unsigned int num, int err) -{ - if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) - panic("I/O error reading memory image"); - atomic_set(&io_done, 0); - return 0; -} - -static struct block_device *resume_bdev; - -/** - * submit - submit BIO request. - * @rw: READ or WRITE. - * @off physical offset of page. - * @page: page we're reading or writing. - * - * Straight from the textbook - allocate and initialize the bio. - * If we're writing, make sure the page is marked as dirty. - * Then submit it and wait. - */ - -static int submit(int rw, pgoff_t page_off, void *page) -{ - int error = 0; - struct bio *bio; - - bio = bio_alloc(GFP_ATOMIC, 1); - if (!bio) - return -ENOMEM; - bio->bi_sector = page_off * (PAGE_SIZE >> 9); - bio->bi_bdev = resume_bdev; - bio->bi_end_io = end_io; - - if (bio_add_page(bio, virt_to_page(page), PAGE_SIZE, 0) < PAGE_SIZE) { - printk("swsusp: ERROR: adding page to bio at %ld\n",page_off); - error = -EFAULT; - goto Done; - } - - - atomic_set(&io_done, 1); - submit_bio(rw | (1 << BIO_RW_SYNC), bio); - while (atomic_read(&io_done)) - yield(); - if (rw == READ) - bio_set_pages_dirty(bio); - Done: - bio_put(bio); - return error; -} - -static int bio_read_page(pgoff_t page_off, void *page) -{ - return submit(READ, page_off, page); -} - -static int bio_write_page(pgoff_t page_off, void *page) -{ - return submit(WRITE, page_off, page); -} - -/** - * The following functions allow us to read data using a swap map - * in a file-alike way - */ - -static void release_swap_reader(struct swap_map_handle *handle) -{ - if (handle->cur) - free_page((unsigned long)handle->cur); - handle->cur = NULL; -} - -static int get_swap_reader(struct swap_map_handle *handle, - swp_entry_t start) -{ - int error; - - if (!swp_offset(start)) - return -EINVAL; - handle->cur = (struct swap_map_page *)get_zeroed_page(GFP_ATOMIC); - if (!handle->cur) - return -ENOMEM; - error = bio_read_page(swp_offset(start), handle->cur); - if (error) { - release_swap_reader(handle); - return error; - } - handle->k = 0; - return 0; -} - -static int swap_read_page(struct swap_map_handle *handle, void *buf) -{ - unsigned long offset; - int error; - - if (!handle->cur) - return -EINVAL; - offset = handle->cur->entries[handle->k]; - if (!offset) - return -EFAULT; - error = bio_read_page(offset, buf); - if (error) - return error; - if (++handle->k >= MAP_PAGE_ENTRIES) { - handle->k = 0; - offset = handle->cur->next_swap; - if (!offset) - release_swap_reader(handle); - else - error = bio_read_page(offset, handle->cur); - } - return error; -} - -/** - * load_image - load the image using the swap map handle - * @handle and the snapshot handle @snapshot - * (assume there are @nr_pages pages to load) - */ - -static int load_image(struct swap_map_handle *handle, - struct snapshot_handle *snapshot, - unsigned int nr_pages) -{ - unsigned int m; - int ret; - int error = 0; - - printk("Loading image data pages (%u pages) ... ", nr_pages); - m = nr_pages / 100; - if (!m) - m = 1; - nr_pages = 0; - do { - ret = snapshot_write_next(snapshot, PAGE_SIZE); - if (ret > 0) { - error = swap_read_page(handle, data_of(*snapshot)); - if (error) - break; - if (!(nr_pages % m)) - printk("\b\b\b\b%3d%%", nr_pages / m); - nr_pages++; - } - } while (ret > 0); - if (!error) - printk("\b\b\b\bdone\n"); - if (!snapshot_image_loaded(snapshot)) - error = -ENODATA; - return error; -} - -int swsusp_read(void) -{ - int error; - struct swap_map_handle handle; - struct snapshot_handle snapshot; - struct swsusp_info *header; - unsigned int nr_pages; - - if (IS_ERR(resume_bdev)) { - pr_debug("swsusp: block device not initialised\n"); - return PTR_ERR(resume_bdev); - } - - memset(&snapshot, 0, sizeof(struct snapshot_handle)); - error = snapshot_write_next(&snapshot, PAGE_SIZE); - if (error < PAGE_SIZE) - return error < 0 ? error : -EFAULT; - header = (struct swsusp_info *)data_of(snapshot); - error = get_swap_reader(&handle, swsusp_header.image); - if (!error) - error = swap_read_page(&handle, header); - if (!error) { - nr_pages = header->image_pages; - error = load_image(&handle, &snapshot, nr_pages); - } - release_swap_reader(&handle); - - blkdev_put(resume_bdev); - - if (!error) - pr_debug("swsusp: Reading resume file was successful\n"); - else - pr_debug("swsusp: Error %d resuming\n", error); - return error; -} - -/** - * swsusp_check - Check for swsusp signature in the resume device - */ - -int swsusp_check(void) -{ - int error; - - resume_bdev = open_by_devnum(swsusp_resume_device, FMODE_READ); - if (!IS_ERR(resume_bdev)) { - set_blocksize(resume_bdev, PAGE_SIZE); - memset(&swsusp_header, 0, sizeof(swsusp_header)); - if ((error = bio_read_page(0, &swsusp_header))) - return error; - if (!memcmp(SWSUSP_SIG, swsusp_header.sig, 10)) { - memcpy(swsusp_header.sig, swsusp_header.orig_sig, 10); - /* Reset swap signature now */ - error = bio_write_page(0, &swsusp_header); - } else { - return -EINVAL; - } - if (error) - blkdev_put(resume_bdev); - else - pr_debug("swsusp: Signature found, resuming\n"); - } else { - error = PTR_ERR(resume_bdev); - } - - if (error) - pr_debug("swsusp: Error %d check for resume file\n", error); - - return error; -} - -/** - * swsusp_close - close swap device. - */ - -void swsusp_close(void) -{ - if (IS_ERR(resume_bdev)) { - pr_debug("swsusp: block device not initialised\n"); - return; - } - - blkdev_put(resume_bdev); -} -- cgit v1.2.3 From 74c7e2efbe37378026f00ad9e7253796d7b2fc99 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Thu, 23 Mar 2006 03:00:01 -0800 Subject: [PATCH] kernel/power: move externs to header files Move externs from C source files to header files. Signed-off-by: Randy Dunlap Cc: "Rafael J. Wysocki" Cc: Pavel Machek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/power/disk.c | 11 ----------- kernel/power/power.h | 6 +++++- 2 files changed, 5 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/power/disk.c b/kernel/power/disk.c index 4eb464b71347..4bd68f482f2b 100644 --- a/kernel/power/disk.c +++ b/kernel/power/disk.c @@ -22,17 +22,6 @@ #include "power.h" -extern suspend_disk_method_t pm_disk_mode; - -extern int swsusp_shrink_memory(void); -extern int swsusp_suspend(void); -extern int swsusp_write(void); -extern int swsusp_check(void); -extern int swsusp_read(void); -extern void swsusp_close(void); -extern int swsusp_resume(void); - - static int noresume = 0; char resume_file[256] = CONFIG_PM_STD_PARTITION; dev_t swsusp_resume_device; diff --git a/kernel/power/power.h b/kernel/power/power.h index 089c84bed895..5d1abffbb9ce 100644 --- a/kernel/power/power.h +++ b/kernel/power/power.h @@ -48,7 +48,6 @@ extern asmlinkage int swsusp_arch_suspend(void); extern asmlinkage int swsusp_arch_resume(void); extern unsigned int count_data_pages(void); -extern void swsusp_free(void); struct snapshot_handle { loff_t offset; @@ -91,6 +90,11 @@ extern struct bitmap_page *alloc_bitmap(unsigned int nr_bits); extern unsigned long alloc_swap_page(int swap, struct bitmap_page *bitmap); extern void free_all_swap_pages(int swap, struct bitmap_page *bitmap); +extern int swsusp_check(void); extern int swsusp_shrink_memory(void); +extern void swsusp_free(void); extern int swsusp_suspend(void); extern int swsusp_resume(void); +extern int swsusp_read(void); +extern int swsusp_write(void); +extern void swsusp_close(void); -- cgit v1.2.3 From 543cc27d09643640cbc34189c03a40beb8227aef Mon Sep 17 00:00:00 2001 From: Pavel Machek Date: Thu, 23 Mar 2006 03:00:02 -0800 Subject: [PATCH] swsusp: documentation updates Update suspend-to-RAM documentation with new machines, and makes message when processes can't be stopped little clearer. (In one case, waiting longer actually did help). From: "Rafael J. Wysocki" Warn in the documentation that data may be lost if there are some filesystems mounted from USB devices before suspend. [Thanks to Alan Stern for providing the answer to the question in the Q:-A: part.] Signed-off-by: Pavel Machek Signed-off-by: Rafael J. Wysocki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/power/process.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/power/process.c b/kernel/power/process.c index 28de118f7a0b..02a1b3a9fa90 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -83,7 +83,7 @@ int freeze_processes(void) yield(); /* Yield is okay here */ if (todo && time_after(jiffies, start_time + TIMEOUT)) { printk( "\n" ); - printk(KERN_ERR " stopping tasks failed (%d tasks remaining)\n", todo ); + printk(KERN_ERR " stopping tasks timed out (%d tasks remaining)\n", todo ); break; } } while(todo); -- cgit v1.2.3 From 6e1819d615f24ce0726a7d0bd3dd0152d7b21654 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Thu, 23 Mar 2006 03:00:03 -0800 Subject: [PATCH] swsusp: userland interface MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This patch introduces a user space interface for swsusp. The interface is based on a special character device, called the snapshot device, that allows user space processes to perform suspend and resume-related operations with the help of some ioctls and the read()/write() functions.  Additionally it allows these processes to allocate free swap pages from a selected swap partition, called the resume partition, so that they know which sectors of the resume partition are available to them. The interface uses the same low-level system memory snapshot-handling functions that are used by the built-it swap-writing/reading code of swsusp. The interface documentation is included in the patch. The patch assumes that the major and minor numbers of the snapshot device will be 10 (ie. misc device) and 231, the registration of which has already been requested. Signed-off-by: Rafael J. Wysocki Acked-by: Pavel Machek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/power/Makefile | 2 +- kernel/power/power.h | 14 +++ kernel/power/snapshot.c | 9 +- kernel/power/user.c | 301 ++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 321 insertions(+), 5 deletions(-) create mode 100644 kernel/power/user.c (limited to 'kernel') diff --git a/kernel/power/Makefile b/kernel/power/Makefile index bb91a0615303..8d0af3d37a4b 100644 --- a/kernel/power/Makefile +++ b/kernel/power/Makefile @@ -5,7 +5,7 @@ endif obj-y := main.o process.o console.o obj-$(CONFIG_PM_LEGACY) += pm.o -obj-$(CONFIG_SOFTWARE_SUSPEND) += swsusp.o disk.o snapshot.o swap.o +obj-$(CONFIG_SOFTWARE_SUSPEND) += swsusp.o disk.o snapshot.o swap.o user.o obj-$(CONFIG_SUSPEND_SMP) += smp.o diff --git a/kernel/power/power.h b/kernel/power/power.h index 5d1abffbb9ce..42c431c8bdde 100644 --- a/kernel/power/power.h +++ b/kernel/power/power.h @@ -8,6 +8,7 @@ struct swsusp_info { int cpus; unsigned long image_pages; unsigned long pages; + unsigned long size; } __attribute__((aligned(PAGE_SIZE))); @@ -65,6 +66,19 @@ extern int snapshot_read_next(struct snapshot_handle *handle, size_t count); extern int snapshot_write_next(struct snapshot_handle *handle, size_t count); int snapshot_image_loaded(struct snapshot_handle *handle); +#define SNAPSHOT_IOC_MAGIC '3' +#define SNAPSHOT_FREEZE _IO(SNAPSHOT_IOC_MAGIC, 1) +#define SNAPSHOT_UNFREEZE _IO(SNAPSHOT_IOC_MAGIC, 2) +#define SNAPSHOT_ATOMIC_SNAPSHOT _IOW(SNAPSHOT_IOC_MAGIC, 3, void *) +#define SNAPSHOT_ATOMIC_RESTORE _IO(SNAPSHOT_IOC_MAGIC, 4) +#define SNAPSHOT_FREE _IO(SNAPSHOT_IOC_MAGIC, 5) +#define SNAPSHOT_SET_IMAGE_SIZE _IOW(SNAPSHOT_IOC_MAGIC, 6, unsigned long) +#define SNAPSHOT_AVAIL_SWAP _IOR(SNAPSHOT_IOC_MAGIC, 7, void *) +#define SNAPSHOT_GET_SWAP_PAGE _IOR(SNAPSHOT_IOC_MAGIC, 8, void *) +#define SNAPSHOT_FREE_SWAP_PAGES _IO(SNAPSHOT_IOC_MAGIC, 9) +#define SNAPSHOT_SET_SWAP_FILE _IOW(SNAPSHOT_IOC_MAGIC, 10, unsigned int) +#define SNAPSHOT_IOC_MAXNR 10 + /** * The bitmap is used for tracing allocated swap pages * diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index cc349437fb72..0036955357e0 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -37,6 +37,7 @@ struct pbe *pagedir_nosave; static unsigned int nr_copy_pages; static unsigned int nr_meta_pages; +static unsigned long *buffer; #ifdef CONFIG_HIGHMEM unsigned int count_highmem_pages(void) @@ -389,7 +390,7 @@ struct pbe *alloc_pagedir(unsigned int nr_pages, gfp_t gfp_mask, int safe_needed free_pagedir(pblist); pblist = NULL; } else - create_pbe_list(pblist, nr_pages); + create_pbe_list(pblist, nr_pages); return pblist; } @@ -418,6 +419,7 @@ void swsusp_free(void) nr_copy_pages = 0; nr_meta_pages = 0; pagedir_nosave = NULL; + buffer = NULL; } @@ -523,6 +525,8 @@ static void init_header(struct swsusp_info *info) info->cpus = num_online_cpus(); info->image_pages = nr_copy_pages; info->pages = nr_copy_pages + nr_meta_pages + 1; + info->size = info->pages; + info->size <<= PAGE_SHIFT; } /** @@ -568,8 +572,6 @@ static inline struct pbe *pack_orig_addresses(unsigned long *buf, struct pbe *pb int snapshot_read_next(struct snapshot_handle *handle, size_t count) { - static unsigned long *buffer; - if (handle->page > nr_meta_pages + nr_copy_pages) return 0; if (!buffer) { @@ -774,7 +776,6 @@ static int create_image(struct snapshot_handle *handle) int snapshot_write_next(struct snapshot_handle *handle, size_t count) { - static unsigned long *buffer; int error = 0; if (handle->prev && handle->page > nr_meta_pages + nr_copy_pages) diff --git a/kernel/power/user.c b/kernel/power/user.c new file mode 100644 index 000000000000..8cabc405ca10 --- /dev/null +++ b/kernel/power/user.c @@ -0,0 +1,301 @@ +/* + * linux/kernel/power/user.c + * + * This file provides the user space interface for software suspend/resume. + * + * Copyright (C) 2006 Rafael J. Wysocki + * + * This file is released under the GPLv2. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "power.h" + +#define SNAPSHOT_MINOR 231 + +static struct snapshot_data { + struct snapshot_handle handle; + int swap; + struct bitmap_page *bitmap; + int mode; + char frozen; + char ready; +} snapshot_state; + +static atomic_t device_available = ATOMIC_INIT(1); + +static int snapshot_open(struct inode *inode, struct file *filp) +{ + struct snapshot_data *data; + + if (!atomic_add_unless(&device_available, -1, 0)) + return -EBUSY; + + if ((filp->f_flags & O_ACCMODE) == O_RDWR) + return -ENOSYS; + + nonseekable_open(inode, filp); + data = &snapshot_state; + filp->private_data = data; + memset(&data->handle, 0, sizeof(struct snapshot_handle)); + if ((filp->f_flags & O_ACCMODE) == O_RDONLY) { + data->swap = swsusp_resume_device ? swap_type_of(swsusp_resume_device) : -1; + data->mode = O_RDONLY; + } else { + data->swap = -1; + data->mode = O_WRONLY; + } + data->bitmap = NULL; + data->frozen = 0; + data->ready = 0; + + return 0; +} + +static int snapshot_release(struct inode *inode, struct file *filp) +{ + struct snapshot_data *data; + + swsusp_free(); + data = filp->private_data; + free_all_swap_pages(data->swap, data->bitmap); + free_bitmap(data->bitmap); + if (data->frozen) { + down(&pm_sem); + thaw_processes(); + enable_nonboot_cpus(); + up(&pm_sem); + } + atomic_inc(&device_available); + return 0; +} + +static ssize_t snapshot_read(struct file *filp, char __user *buf, + size_t count, loff_t *offp) +{ + struct snapshot_data *data; + ssize_t res; + + data = filp->private_data; + res = snapshot_read_next(&data->handle, count); + if (res > 0) { + if (copy_to_user(buf, data_of(data->handle), res)) + res = -EFAULT; + else + *offp = data->handle.offset; + } + return res; +} + +static ssize_t snapshot_write(struct file *filp, const char __user *buf, + size_t count, loff_t *offp) +{ + struct snapshot_data *data; + ssize_t res; + + data = filp->private_data; + res = snapshot_write_next(&data->handle, count); + if (res > 0) { + if (copy_from_user(data_of(data->handle), buf, res)) + res = -EFAULT; + else + *offp = data->handle.offset; + } + return res; +} + +static int snapshot_ioctl(struct inode *inode, struct file *filp, + unsigned int cmd, unsigned long arg) +{ + int error = 0; + struct snapshot_data *data; + loff_t offset, avail; + + if (_IOC_TYPE(cmd) != SNAPSHOT_IOC_MAGIC) + return -ENOTTY; + if (_IOC_NR(cmd) > SNAPSHOT_IOC_MAXNR) + return -ENOTTY; + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + data = filp->private_data; + + switch (cmd) { + + case SNAPSHOT_FREEZE: + if (data->frozen) + break; + sys_sync(); + down(&pm_sem); + pm_prepare_console(); + disable_nonboot_cpus(); + if (freeze_processes()) { + thaw_processes(); + enable_nonboot_cpus(); + pm_restore_console(); + error = -EBUSY; + } + up(&pm_sem); + if (!error) + data->frozen = 1; + break; + + case SNAPSHOT_UNFREEZE: + if (!data->frozen) + break; + down(&pm_sem); + thaw_processes(); + enable_nonboot_cpus(); + pm_restore_console(); + up(&pm_sem); + data->frozen = 0; + break; + + case SNAPSHOT_ATOMIC_SNAPSHOT: + if (data->mode != O_RDONLY || !data->frozen || data->ready) { + error = -EPERM; + break; + } + down(&pm_sem); + /* Free memory before shutting down devices. */ + error = swsusp_shrink_memory(); + if (!error) { + error = device_suspend(PMSG_FREEZE); + if (!error) { + in_suspend = 1; + error = swsusp_suspend(); + device_resume(); + } + } + up(&pm_sem); + if (!error) + error = put_user(in_suspend, (unsigned int __user *)arg); + if (!error) + data->ready = 1; + break; + + case SNAPSHOT_ATOMIC_RESTORE: + if (data->mode != O_WRONLY || !data->frozen || + !snapshot_image_loaded(&data->handle)) { + error = -EPERM; + break; + } + down(&pm_sem); + pm_prepare_console(); + error = device_suspend(PMSG_FREEZE); + if (!error) { + error = swsusp_resume(); + device_resume(); + } + pm_restore_console(); + up(&pm_sem); + break; + + case SNAPSHOT_FREE: + swsusp_free(); + memset(&data->handle, 0, sizeof(struct snapshot_handle)); + data->ready = 0; + break; + + case SNAPSHOT_SET_IMAGE_SIZE: + image_size = arg; + break; + + case SNAPSHOT_AVAIL_SWAP: + avail = count_swap_pages(data->swap, 1); + avail <<= PAGE_SHIFT; + error = put_user(avail, (loff_t __user *)arg); + break; + + case SNAPSHOT_GET_SWAP_PAGE: + if (data->swap < 0 || data->swap >= MAX_SWAPFILES) { + error = -ENODEV; + break; + } + if (!data->bitmap) { + data->bitmap = alloc_bitmap(count_swap_pages(data->swap, 0)); + if (!data->bitmap) { + error = -ENOMEM; + break; + } + } + offset = alloc_swap_page(data->swap, data->bitmap); + if (offset) { + offset <<= PAGE_SHIFT; + error = put_user(offset, (loff_t __user *)arg); + } else { + error = -ENOSPC; + } + break; + + case SNAPSHOT_FREE_SWAP_PAGES: + if (data->swap < 0 || data->swap >= MAX_SWAPFILES) { + error = -ENODEV; + break; + } + free_all_swap_pages(data->swap, data->bitmap); + free_bitmap(data->bitmap); + data->bitmap = NULL; + break; + + case SNAPSHOT_SET_SWAP_FILE: + if (!data->bitmap) { + /* + * User space encodes device types as two-byte values, + * so we need to recode them + */ + if (old_decode_dev(arg)) { + data->swap = swap_type_of(old_decode_dev(arg)); + if (data->swap < 0) + error = -ENODEV; + } else { + data->swap = -1; + error = -EINVAL; + } + } else { + error = -EPERM; + } + break; + + default: + error = -ENOTTY; + + } + + return error; +} + +static struct file_operations snapshot_fops = { + .open = snapshot_open, + .release = snapshot_release, + .read = snapshot_read, + .write = snapshot_write, + .llseek = no_llseek, + .ioctl = snapshot_ioctl, +}; + +static struct miscdevice snapshot_device = { + .minor = SNAPSHOT_MINOR, + .name = "snapshot", + .fops = &snapshot_fops, +}; + +static int __init snapshot_device_init(void) +{ + return misc_register(&snapshot_device); +}; + +device_initcall(snapshot_device_init); -- cgit v1.2.3 From 02aaeb9b952f30b1ad6284d5d45be02030f679db Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Thu, 23 Mar 2006 03:00:04 -0800 Subject: [PATCH] swsusp: freeze user space processes first Allow swsusp to freeze processes successfully under heavy load by freezing userspace processes before kernel threads. [Thanks to Nigel Cunningham for suggesting the way to go.] Signed-off-by: Rafael J. Wysocki Acked-by: Pavel Machek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/power/disk.c | 1 - kernel/power/process.c | 61 +++++++++++++++++++++++++++++++++++++------------- kernel/power/user.c | 1 - 3 files changed, 46 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/kernel/power/disk.c b/kernel/power/disk.c index 4bd68f482f2b..81d4d982f3f0 100644 --- a/kernel/power/disk.c +++ b/kernel/power/disk.c @@ -72,7 +72,6 @@ static int prepare_processes(void) int error; pm_prepare_console(); - sys_sync(); disable_nonboot_cpus(); if (freeze_processes()) { diff --git a/kernel/power/process.c b/kernel/power/process.c index 02a1b3a9fa90..8ac7c35fad77 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -12,11 +12,12 @@ #include #include #include +#include /* * Timeout for stopping processes */ -#define TIMEOUT (6 * HZ) +#define TIMEOUT (20 * HZ) static inline int freezeable(struct task_struct * p) @@ -54,38 +55,62 @@ void refrigerator(void) current->state = save; } +static inline void freeze_process(struct task_struct *p) +{ + unsigned long flags; + + if (!freezing(p)) { + freeze(p); + spin_lock_irqsave(&p->sighand->siglock, flags); + signal_wake_up(p, 0); + spin_unlock_irqrestore(&p->sighand->siglock, flags); + } +} + /* 0 = success, else # of processes that we failed to stop */ int freeze_processes(void) { - int todo; + int todo, nr_user, user_frozen; unsigned long start_time; struct task_struct *g, *p; unsigned long flags; printk( "Stopping tasks: " ); start_time = jiffies; + user_frozen = 0; do { - todo = 0; + nr_user = todo = 0; read_lock(&tasklist_lock); do_each_thread(g, p) { if (!freezeable(p)) continue; if (frozen(p)) continue; - - freeze(p); - spin_lock_irqsave(&p->sighand->siglock, flags); - signal_wake_up(p, 0); - spin_unlock_irqrestore(&p->sighand->siglock, flags); - todo++; + if (p->mm && !(p->flags & PF_BORROWED_MM)) { + /* The task is a user-space one. + * Freeze it unless there's a vfork completion + * pending + */ + if (!p->vfork_done) + freeze_process(p); + nr_user++; + } else { + /* Freeze only if the user space is frozen */ + if (user_frozen) + freeze_process(p); + todo++; + } } while_each_thread(g, p); read_unlock(&tasklist_lock); + todo += nr_user; + if (!user_frozen && !nr_user) { + sys_sync(); + start_time = jiffies; + } + user_frozen = !nr_user; yield(); /* Yield is okay here */ - if (todo && time_after(jiffies, start_time + TIMEOUT)) { - printk( "\n" ); - printk(KERN_ERR " stopping tasks timed out (%d tasks remaining)\n", todo ); + if (todo && time_after(jiffies, start_time + TIMEOUT)) break; - } } while(todo); /* This does not unfreeze processes that are already frozen @@ -94,8 +119,14 @@ int freeze_processes(void) * but it cleans up leftover PF_FREEZE requests. */ if (todo) { + printk( "\n" ); + printk(KERN_ERR " stopping tasks timed out " + "after %d seconds (%d tasks remaining):\n", + TIMEOUT / HZ, todo); read_lock(&tasklist_lock); - do_each_thread(g, p) + do_each_thread(g, p) { + if (freezeable(p) && !frozen(p)) + printk(KERN_ERR " %s\n", p->comm); if (freezing(p)) { pr_debug(" clean up: %s\n", p->comm); p->flags &= ~PF_FREEZE; @@ -103,7 +134,7 @@ int freeze_processes(void) recalc_sigpending_tsk(p); spin_unlock_irqrestore(&p->sighand->siglock, flags); } - while_each_thread(g, p); + } while_each_thread(g, p); read_unlock(&tasklist_lock); return todo; } diff --git a/kernel/power/user.c b/kernel/power/user.c index 8cabc405ca10..a97406b86ef3 100644 --- a/kernel/power/user.c +++ b/kernel/power/user.c @@ -138,7 +138,6 @@ static int snapshot_ioctl(struct inode *inode, struct file *filp, case SNAPSHOT_FREEZE: if (data->frozen) break; - sys_sync(); down(&pm_sem); pm_prepare_console(); disable_nonboot_cpus(); -- cgit v1.2.3 From ce6ed29f3136bc4b3644ecf4091d6390d444f628 Mon Sep 17 00:00:00 2001 From: Pavel Machek Date: Thu, 23 Mar 2006 03:00:05 -0800 Subject: [PATCH] suspend: make progress printing prettier Combination of printk/pr_debug led to <7> in the middle of the line, and we printed way too many dots. Signed-off-by: Pavel Machek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/power/snapshot.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 0036955357e0..1b46c2da5a50 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -83,7 +83,7 @@ static int save_highmem_zone(struct zone *zone) void *kaddr; unsigned long pfn = zone_pfn + zone->zone_start_pfn; - if (!(pfn%1000)) + if (!(pfn%10000)) printk("."); if (!pfn_valid(pfn)) continue; @@ -122,13 +122,14 @@ int save_highmem(void) struct zone *zone; int res = 0; - pr_debug("swsusp: Saving Highmem\n"); + pr_debug("swsusp: Saving Highmem"); for_each_zone (zone) { if (is_highmem(zone)) res = save_highmem_zone(zone); if (res) return res; } + printk("\n"); return 0; } -- cgit v1.2.3 From fc558a7496bfab3d29a68953b07a95883fdcfbb1 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Thu, 23 Mar 2006 03:00:05 -0800 Subject: [PATCH] swsusp: finally solve mysqld problem This patch from Pavel moves userland freeze signals handling into more logical place. It now hits even with mysqld running. Signed-off-by: Rafael J. Wysocki Signed-off-by: Pavel Machek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/signal.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index ea154104a00b..dfb09ba5c013 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1922,6 +1922,8 @@ int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka, sigset_t *mask = ¤t->blocked; int signr = 0; + try_to_freeze(); + relock: spin_lock_irq(¤t->sighand->siglock); for (;;) { @@ -2307,7 +2309,6 @@ sys_rt_sigtimedwait(const sigset_t __user *uthese, timeout = schedule_timeout_interruptible(timeout); - try_to_freeze(); spin_lock_irq(¤t->sighand->siglock); sig = dequeue_signal(current, &these, &info); current->blocked = current->real_blocked; -- cgit v1.2.3 From e4e4d665560c75afb6060cb43bb6738777648ca1 Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Thu, 23 Mar 2006 03:00:06 -0800 Subject: [PATCH] swsusp: drain high mem pages Highmem could be in pcp list as well. Signed-off-by: Shaohua Li Acked-by: Pavel Machek Cc: "Rafael J. Wysocki" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/power/snapshot.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 1b46c2da5a50..c5863d02c89e 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -123,6 +123,7 @@ int save_highmem(void) int res = 0; pr_debug("swsusp: Saving Highmem"); + drain_local_pages(); for_each_zone (zone) { if (is_highmem(zone)) res = save_highmem_zone(zone); -- cgit v1.2.3 From 94c188d32996beac00426740974310e32f162c14 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Thu, 23 Mar 2006 03:00:08 -0800 Subject: [PATCH] swsusp: let userland tools switch console on suspend Remove the console-switching code from the suspend part of the swsusp userland interface and let the userland tools switch the console. Signed-off-by: Rafael J. Wysocki Acked-by: Pavel Machek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/power/user.c | 3 --- 1 file changed, 3 deletions(-) (limited to 'kernel') diff --git a/kernel/power/user.c b/kernel/power/user.c index a97406b86ef3..bbd4842104aa 100644 --- a/kernel/power/user.c +++ b/kernel/power/user.c @@ -139,12 +139,10 @@ static int snapshot_ioctl(struct inode *inode, struct file *filp, if (data->frozen) break; down(&pm_sem); - pm_prepare_console(); disable_nonboot_cpus(); if (freeze_processes()) { thaw_processes(); enable_nonboot_cpus(); - pm_restore_console(); error = -EBUSY; } up(&pm_sem); @@ -158,7 +156,6 @@ static int snapshot_ioctl(struct inode *inode, struct file *filp, down(&pm_sem); thaw_processes(); enable_nonboot_cpus(); - pm_restore_console(); up(&pm_sem); data->frozen = 0; break; -- cgit v1.2.3 From 9b238205ba5d79a8a242d7a5ddb82b89e4dc4e48 Mon Sep 17 00:00:00 2001 From: Luca Tettamanti Date: Thu, 23 Mar 2006 03:00:09 -0800 Subject: [PATCH] swsusp: add s2ram ioctl to userland interface Add the SNAPSHOT_S2RAM ioctl to the snapshot device. This ioctl allows a userland application to make the system (previously frozen with the SNAPSHOT_FREE ioctl) enter the S3 state without freezing processes and disabling nonboot CPUs for the second time. This will allow us to implement the suspend-to-disk-and-RAM (STDR) functionality in the userland suspend tools. Signed-off-by: Luca Tettamanti Signed-off-by: Rafael J. Wysocki Cc: Pavel Machek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/power/main.c | 2 +- kernel/power/power.h | 4 +++- kernel/power/user.c | 36 ++++++++++++++++++++++++++++++++++++ 3 files changed, 40 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/power/main.c b/kernel/power/main.c index 9cb235cba4a9..ee371f50ccaa 100644 --- a/kernel/power/main.c +++ b/kernel/power/main.c @@ -103,7 +103,7 @@ static int suspend_prepare(suspend_state_t state) } -static int suspend_enter(suspend_state_t state) +int suspend_enter(suspend_state_t state) { int error = 0; unsigned long flags; diff --git a/kernel/power/power.h b/kernel/power/power.h index 42c431c8bdde..f06f12f21767 100644 --- a/kernel/power/power.h +++ b/kernel/power/power.h @@ -77,7 +77,8 @@ int snapshot_image_loaded(struct snapshot_handle *handle); #define SNAPSHOT_GET_SWAP_PAGE _IOR(SNAPSHOT_IOC_MAGIC, 8, void *) #define SNAPSHOT_FREE_SWAP_PAGES _IO(SNAPSHOT_IOC_MAGIC, 9) #define SNAPSHOT_SET_SWAP_FILE _IOW(SNAPSHOT_IOC_MAGIC, 10, unsigned int) -#define SNAPSHOT_IOC_MAXNR 10 +#define SNAPSHOT_S2RAM _IO(SNAPSHOT_IOC_MAGIC, 11) +#define SNAPSHOT_IOC_MAXNR 11 /** * The bitmap is used for tracing allocated swap pages @@ -112,3 +113,4 @@ extern int swsusp_resume(void); extern int swsusp_read(void); extern int swsusp_write(void); extern void swsusp_close(void); +extern int suspend_enter(suspend_state_t state); diff --git a/kernel/power/user.c b/kernel/power/user.c index bbd4842104aa..3f1539fbe48a 100644 --- a/kernel/power/user.c +++ b/kernel/power/user.c @@ -266,6 +266,42 @@ static int snapshot_ioctl(struct inode *inode, struct file *filp, } break; + case SNAPSHOT_S2RAM: + if (!data->frozen) { + error = -EPERM; + break; + } + + if (down_trylock(&pm_sem)) { + error = -EBUSY; + break; + } + + if (pm_ops->prepare) { + error = pm_ops->prepare(PM_SUSPEND_MEM); + if (error) + goto OutS3; + } + + /* Put devices to sleep */ + error = device_suspend(PMSG_SUSPEND); + if (error) { + printk(KERN_ERR "Failed to suspend some devices.\n"); + } else { + /* Enter S3, system is already frozen */ + suspend_enter(PM_SUSPEND_MEM); + + /* Wake up devices */ + device_resume(); + } + + if (pm_ops->finish) + pm_ops->finish(PM_SUSPEND_MEM); + +OutS3: + up(&pm_sem); + break; + default: error = -ENOTTY; -- cgit v1.2.3 From 0c9e63fd38a2fb2181668a0cdd622a3c23cfd567 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 23 Mar 2006 03:00:12 -0800 Subject: [PATCH] Shrinks sizeof(files_struct) and better layout 1) Reduce the size of (struct fdtable) to exactly 64 bytes on 32bits platforms, lowering kmalloc() allocated space by 50%. 2) Reduce the size of (files_struct), using a special 32 bits (or 64bits) embedded_fd_set, instead of a 1024 bits fd_set for the close_on_exec_init and open_fds_init fields. This save some ram (248 bytes per task) as most tasks dont open more than 32 files. D-Cache footprint for such tasks is also reduced to the minimum. 3) Reduce size of allocated fdset. Currently two full pages are allocated, that is 32768 bits on x86 for example, and way too much. The minimum is now L1_CACHE_BYTES. UP and SMP should benefit from this patch, because most tasks will touch only one cache line when open()/close() stdin/stdout/stderr (0/1/2), (next_fd, close_on_exec_init, open_fds_init, fd_array[0 .. 2] being in the same cache line) Signed-off-by: Eric Dumazet Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 9bd7b65ee418..c79ae0b19a49 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -607,12 +607,12 @@ static struct files_struct *alloc_files(void) atomic_set(&newf->count, 1); spin_lock_init(&newf->file_lock); + newf->next_fd = 0; fdt = &newf->fdtab; - fdt->next_fd = 0; fdt->max_fds = NR_OPEN_DEFAULT; - fdt->max_fdset = __FD_SETSIZE; - fdt->close_on_exec = &newf->close_on_exec_init; - fdt->open_fds = &newf->open_fds_init; + fdt->max_fdset = EMBEDDED_FD_SET_SIZE; + fdt->close_on_exec = (fd_set *)&newf->close_on_exec_init; + fdt->open_fds = (fd_set *)&newf->open_fds_init; fdt->fd = &newf->fd_array[0]; INIT_RCU_HEAD(&fdt->rcu); fdt->free_files = NULL; -- cgit v1.2.3 From 2dd0ebcd2ab7b18a50c0810ddb45a84316e4ee2e Mon Sep 17 00:00:00 2001 From: Ravikiran G Thirumalai Date: Thu, 23 Mar 2006 03:00:13 -0800 Subject: [PATCH] Avoid taking global tasklist_lock for single threadedprocess at getrusage() Avoid taking the global tasklist_lock when possible, if a process is single threaded during getrusage(). Any avoidance of tasklist_lock is good for NUMA boxes (and possibly for large SMPs). Thanks to Oleg Nesterov for review and suggestions. Signed-off-by: Nippun Goel Signed-off-by: Ravikiran Thirumalai Signed-off-by: Shai Fultheim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 42 ++++++++++++++++++++++++++++++++++-------- 1 file changed, 34 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index f91218a5463e..4941b9b14b97 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1677,9 +1677,6 @@ asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim) * a lot simpler! (Which we're not doing right now because we're not * measuring them yet). * - * This expects to be called with tasklist_lock read-locked or better, - * and the siglock not locked. It may momentarily take the siglock. - * * When sampling multiple threads for RUSAGE_SELF, under SMP we might have * races with threads incrementing their own counters. But since word * reads are atomic, we either get new values or old values and we don't @@ -1687,6 +1684,25 @@ asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim) * the c* fields from p->signal from races with exit.c updating those * fields when reaping, so a sample either gets all the additions of a * given child after it's reaped, or none so this sample is before reaping. + * + * tasklist_lock locking optimisation: + * If we are current and single threaded, we do not need to take the tasklist + * lock or the siglock. No one else can take our signal_struct away, + * no one else can reap the children to update signal->c* counters, and + * no one else can race with the signal-> fields. + * If we do not take the tasklist_lock, the signal-> fields could be read + * out of order while another thread was just exiting. So we place a + * read memory barrier when we avoid the lock. On the writer side, + * write memory barrier is implied in __exit_signal as __exit_signal releases + * the siglock spinlock after updating the signal-> fields. + * + * We don't really need the siglock when we access the non c* fields + * of the signal_struct (for RUSAGE_SELF) even in multithreaded + * case, since we take the tasklist lock for read and the non c* signal-> + * fields are updated only in __exit_signal, which is called with + * tasklist_lock taken for write, hence these two threads cannot execute + * concurrently. + * */ static void k_getrusage(struct task_struct *p, int who, struct rusage *r) @@ -1694,13 +1710,23 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r) struct task_struct *t; unsigned long flags; cputime_t utime, stime; + int need_lock = 0; memset((char *) r, 0, sizeof *r); + utime = stime = cputime_zero; - if (unlikely(!p->signal)) - return; + if (p != current || !thread_group_empty(p)) + need_lock = 1; - utime = stime = cputime_zero; + if (need_lock) { + read_lock(&tasklist_lock); + if (unlikely(!p->signal)) { + read_unlock(&tasklist_lock); + return; + } + } else + /* See locking comments above */ + smp_rmb(); switch (who) { case RUSAGE_BOTH: @@ -1740,6 +1766,8 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r) BUG(); } + if (need_lock) + read_unlock(&tasklist_lock); cputime_to_timeval(utime, &r->ru_utime); cputime_to_timeval(stime, &r->ru_stime); } @@ -1747,9 +1775,7 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r) int getrusage(struct task_struct *p, int who, struct rusage __user *ru) { struct rusage r; - read_lock(&tasklist_lock); k_getrusage(p, who, &r); - read_unlock(&tasklist_lock); return copy_to_user(ru, &r, sizeof(r)) ? -EFAULT : 0; } -- cgit v1.2.3 From 3d3f26a7baaa921a0e790b4c72d20f0de91a5d65 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Thu, 23 Mar 2006 03:00:18 -0800 Subject: [PATCH] kernel/cpuset.c, mutex conversion convert cpuset.c's callback_sem and manage_sem to mutexes. Build and boot tested by Ingo. Build, boot, unit and stress tested by pj. Signed-off-by: Ingo Molnar Signed-off-by: Paul Jackson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 212 +++++++++++++++++++++++++++----------------------------- 1 file changed, 103 insertions(+), 109 deletions(-) (limited to 'kernel') diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 12815d3f1a05..c86ee051b734 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -53,7 +53,7 @@ #include #include -#include +#include #define CPUSET_SUPER_MAGIC 0x27e0eb @@ -168,63 +168,57 @@ static struct vfsmount *cpuset_mount; static struct super_block *cpuset_sb; /* - * We have two global cpuset semaphores below. They can nest. - * It is ok to first take manage_sem, then nest callback_sem. We also + * We have two global cpuset mutexes below. They can nest. + * It is ok to first take manage_mutex, then nest callback_mutex. We also * require taking task_lock() when dereferencing a tasks cpuset pointer. * See "The task_lock() exception", at the end of this comment. * - * A task must hold both semaphores to modify cpusets. If a task - * holds manage_sem, then it blocks others wanting that semaphore, - * ensuring that it is the only task able to also acquire callback_sem + * A task must hold both mutexes to modify cpusets. If a task + * holds manage_mutex, then it blocks others wanting that mutex, + * ensuring that it is the only task able to also acquire callback_mutex * and be able to modify cpusets. It can perform various checks on * the cpuset structure first, knowing nothing will change. It can - * also allocate memory while just holding manage_sem. While it is + * also allocate memory while just holding manage_mutex. While it is * performing these checks, various callback routines can briefly - * acquire callback_sem to query cpusets. Once it is ready to make - * the changes, it takes callback_sem, blocking everyone else. + * acquire callback_mutex to query cpusets. Once it is ready to make + * the changes, it takes callback_mutex, blocking everyone else. * * Calls to the kernel memory allocator can not be made while holding - * callback_sem, as that would risk double tripping on callback_sem + * callback_mutex, as that would risk double tripping on callback_mutex * from one of the callbacks into the cpuset code from within * __alloc_pages(). * - * If a task is only holding callback_sem, then it has read-only + * If a task is only holding callback_mutex, then it has read-only * access to cpusets. * * The task_struct fields mems_allowed and mems_generation may only * be accessed in the context of that task, so require no locks. * * Any task can increment and decrement the count field without lock. - * So in general, code holding manage_sem or callback_sem can't rely + * So in general, code holding manage_mutex or callback_mutex can't rely * on the count field not changing. However, if the count goes to - * zero, then only attach_task(), which holds both semaphores, can + * zero, then only attach_task(), which holds both mutexes, can * increment it again. Because a count of zero means that no tasks * are currently attached, therefore there is no way a task attached * to that cpuset can fork (the other way to increment the count). - * So code holding manage_sem or callback_sem can safely assume that + * So code holding manage_mutex or callback_mutex can safely assume that * if the count is zero, it will stay zero. Similarly, if a task - * holds manage_sem or callback_sem on a cpuset with zero count, it + * holds manage_mutex or callback_mutex on a cpuset with zero count, it * knows that the cpuset won't be removed, as cpuset_rmdir() needs - * both of those semaphores. - * - * A possible optimization to improve parallelism would be to make - * callback_sem a R/W semaphore (rwsem), allowing the callback routines - * to proceed in parallel, with read access, until the holder of - * manage_sem needed to take this rwsem for exclusive write access - * and modify some cpusets. + * both of those mutexes. * * The cpuset_common_file_write handler for operations that modify - * the cpuset hierarchy holds manage_sem across the entire operation, + * the cpuset hierarchy holds manage_mutex across the entire operation, * single threading all such cpuset modifications across the system. * - * The cpuset_common_file_read() handlers only hold callback_sem across + * The cpuset_common_file_read() handlers only hold callback_mutex across * small pieces of code, such as when reading out possibly multi-word * cpumasks and nodemasks. * * The fork and exit callbacks cpuset_fork() and cpuset_exit(), don't - * (usually) take either semaphore. These are the two most performance + * (usually) take either mutex. These are the two most performance * critical pieces of code here. The exception occurs on cpuset_exit(), - * when a task in a notify_on_release cpuset exits. Then manage_sem + * when a task in a notify_on_release cpuset exits. Then manage_mutex * is taken, and if the cpuset count is zero, a usermode call made * to /sbin/cpuset_release_agent with the name of the cpuset (path * relative to the root of cpuset file system) as the argument. @@ -242,9 +236,9 @@ static struct super_block *cpuset_sb; * * The need for this exception arises from the action of attach_task(), * which overwrites one tasks cpuset pointer with another. It does - * so using both semaphores, however there are several performance + * so using both mutexes, however there are several performance * critical places that need to reference task->cpuset without the - * expense of grabbing a system global semaphore. Therefore except as + * expense of grabbing a system global mutex. Therefore except as * noted below, when dereferencing or, as in attach_task(), modifying * a tasks cpuset pointer we use task_lock(), which acts on a spinlock * (task->alloc_lock) already in the task_struct routinely used for @@ -256,8 +250,8 @@ static struct super_block *cpuset_sb; * the routine cpuset_update_task_memory_state(). */ -static DECLARE_MUTEX(manage_sem); -static DECLARE_MUTEX(callback_sem); +static DEFINE_MUTEX(manage_mutex); +static DEFINE_MUTEX(callback_mutex); /* * A couple of forward declarations required, due to cyclic reference loop: @@ -432,7 +426,7 @@ static inline struct cftype *__d_cft(struct dentry *dentry) } /* - * Call with manage_sem held. Writes path of cpuset into buf. + * Call with manage_mutex held. Writes path of cpuset into buf. * Returns 0 on success, -errno on error. */ @@ -484,11 +478,11 @@ static int cpuset_path(const struct cpuset *cs, char *buf, int buflen) * status of the /sbin/cpuset_release_agent task, so no sense holding * our caller up for that. * - * When we had only one cpuset semaphore, we had to call this + * When we had only one cpuset mutex, we had to call this * without holding it, to avoid deadlock when call_usermodehelper() * allocated memory. With two locks, we could now call this while - * holding manage_sem, but we still don't, so as to minimize - * the time manage_sem is held. + * holding manage_mutex, but we still don't, so as to minimize + * the time manage_mutex is held. */ static void cpuset_release_agent(const char *pathbuf) @@ -520,15 +514,15 @@ static void cpuset_release_agent(const char *pathbuf) * cs is notify_on_release() and now both the user count is zero and * the list of children is empty, prepare cpuset path in a kmalloc'd * buffer, to be returned via ppathbuf, so that the caller can invoke - * cpuset_release_agent() with it later on, once manage_sem is dropped. - * Call here with manage_sem held. + * cpuset_release_agent() with it later on, once manage_mutex is dropped. + * Call here with manage_mutex held. * * This check_for_release() routine is responsible for kmalloc'ing * pathbuf. The above cpuset_release_agent() is responsible for * kfree'ing pathbuf. The caller of these routines is responsible * for providing a pathbuf pointer, initialized to NULL, then - * calling check_for_release() with manage_sem held and the address - * of the pathbuf pointer, then dropping manage_sem, then calling + * calling check_for_release() with manage_mutex held and the address + * of the pathbuf pointer, then dropping manage_mutex, then calling * cpuset_release_agent() with pathbuf, as set by check_for_release(). */ @@ -559,7 +553,7 @@ static void check_for_release(struct cpuset *cs, char **ppathbuf) * One way or another, we guarantee to return some non-empty subset * of cpu_online_map. * - * Call with callback_sem held. + * Call with callback_mutex held. */ static void guarantee_online_cpus(const struct cpuset *cs, cpumask_t *pmask) @@ -583,7 +577,7 @@ static void guarantee_online_cpus(const struct cpuset *cs, cpumask_t *pmask) * One way or another, we guarantee to return some non-empty subset * of node_online_map. * - * Call with callback_sem held. + * Call with callback_mutex held. */ static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask) @@ -608,12 +602,12 @@ static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask) * current->cpuset if a task has its memory placement changed. * Do not call this routine if in_interrupt(). * - * Call without callback_sem or task_lock() held. May be called - * with or without manage_sem held. Doesn't need task_lock to guard + * Call without callback_mutex or task_lock() held. May be called + * with or without manage_mutex held. Doesn't need task_lock to guard * against another task changing a non-NULL cpuset pointer to NULL, * as that is only done by a task on itself, and if the current task * is here, it is not simultaneously in the exit code NULL'ing its - * cpuset pointer. This routine also might acquire callback_sem and + * cpuset pointer. This routine also might acquire callback_mutex and * current->mm->mmap_sem during call. * * Reading current->cpuset->mems_generation doesn't need task_lock @@ -658,13 +652,13 @@ void cpuset_update_task_memory_state(void) } if (my_cpusets_mem_gen != tsk->cpuset_mems_generation) { - down(&callback_sem); + mutex_lock(&callback_mutex); task_lock(tsk); cs = tsk->cpuset; /* Maybe changed when task not locked */ guarantee_online_mems(cs, &tsk->mems_allowed); tsk->cpuset_mems_generation = cs->mems_generation; task_unlock(tsk); - up(&callback_sem); + mutex_unlock(&callback_mutex); mpol_rebind_task(tsk, &tsk->mems_allowed); } } @@ -674,7 +668,7 @@ void cpuset_update_task_memory_state(void) * * One cpuset is a subset of another if all its allowed CPUs and * Memory Nodes are a subset of the other, and its exclusive flags - * are only set if the other's are set. Call holding manage_sem. + * are only set if the other's are set. Call holding manage_mutex. */ static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q) @@ -692,7 +686,7 @@ static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q) * If we replaced the flag and mask values of the current cpuset * (cur) with those values in the trial cpuset (trial), would * our various subset and exclusive rules still be valid? Presumes - * manage_sem held. + * manage_mutex held. * * 'cur' is the address of an actual, in-use cpuset. Operations * such as list traversal that depend on the actual address of the @@ -746,7 +740,7 @@ static int validate_change(const struct cpuset *cur, const struct cpuset *trial) * exclusive child cpusets * Build these two partitions by calling partition_sched_domains * - * Call with manage_sem held. May nest a call to the + * Call with manage_mutex held. May nest a call to the * lock_cpu_hotplug()/unlock_cpu_hotplug() pair. */ @@ -792,7 +786,7 @@ static void update_cpu_domains(struct cpuset *cur) } /* - * Call with manage_sem held. May take callback_sem during call. + * Call with manage_mutex held. May take callback_mutex during call. */ static int update_cpumask(struct cpuset *cs, char *buf) @@ -811,9 +805,9 @@ static int update_cpumask(struct cpuset *cs, char *buf) if (retval < 0) return retval; cpus_unchanged = cpus_equal(cs->cpus_allowed, trialcs.cpus_allowed); - down(&callback_sem); + mutex_lock(&callback_mutex); cs->cpus_allowed = trialcs.cpus_allowed; - up(&callback_sem); + mutex_unlock(&callback_mutex); if (is_cpu_exclusive(cs) && !cpus_unchanged) update_cpu_domains(cs); return 0; @@ -827,7 +821,7 @@ static int update_cpumask(struct cpuset *cs, char *buf) * the cpuset is marked 'memory_migrate', migrate the tasks * pages to the new memory. * - * Call with manage_sem held. May take callback_sem during call. + * Call with manage_mutex held. May take callback_mutex during call. * Will take tasklist_lock, scan tasklist for tasks in cpuset cs, * lock each such tasks mm->mmap_sem, scan its vma's and rebind * their mempolicies to the cpusets new mems_allowed. @@ -862,11 +856,11 @@ static int update_nodemask(struct cpuset *cs, char *buf) if (retval < 0) goto done; - down(&callback_sem); + mutex_lock(&callback_mutex); cs->mems_allowed = trialcs.mems_allowed; atomic_inc(&cpuset_mems_generation); cs->mems_generation = atomic_read(&cpuset_mems_generation); - up(&callback_sem); + mutex_unlock(&callback_mutex); set_cpuset_being_rebound(cs); /* causes mpol_copy() rebind */ @@ -922,7 +916,7 @@ static int update_nodemask(struct cpuset *cs, char *buf) * tasklist_lock. Forks can happen again now - the mpol_copy() * cpuset_being_rebound check will catch such forks, and rebind * their vma mempolicies too. Because we still hold the global - * cpuset manage_sem, we know that no other rebind effort will + * cpuset manage_mutex, we know that no other rebind effort will * be contending for the global variable cpuset_being_rebound. * It's ok if we rebind the same mm twice; mpol_rebind_mm() * is idempotent. Also migrate pages in each mm to new nodes. @@ -948,7 +942,7 @@ done: } /* - * Call with manage_sem held. + * Call with manage_mutex held. */ static int update_memory_pressure_enabled(struct cpuset *cs, char *buf) @@ -967,7 +961,7 @@ static int update_memory_pressure_enabled(struct cpuset *cs, char *buf) * cs: the cpuset to update * buf: the buffer where we read the 0 or 1 * - * Call with manage_sem held. + * Call with manage_mutex held. */ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, char *buf) @@ -989,12 +983,12 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, char *buf) return err; cpu_exclusive_changed = (is_cpu_exclusive(cs) != is_cpu_exclusive(&trialcs)); - down(&callback_sem); + mutex_lock(&callback_mutex); if (turning_on) set_bit(bit, &cs->flags); else clear_bit(bit, &cs->flags); - up(&callback_sem); + mutex_unlock(&callback_mutex); if (cpu_exclusive_changed) update_cpu_domains(cs); @@ -1104,7 +1098,7 @@ static int fmeter_getrate(struct fmeter *fmp) * writing the path of the old cpuset in 'ppathbuf' if it needs to be * notified on release. * - * Call holding manage_sem. May take callback_sem and task_lock of + * Call holding manage_mutex. May take callback_mutex and task_lock of * the task 'pid' during call. */ @@ -1144,13 +1138,13 @@ static int attach_task(struct cpuset *cs, char *pidbuf, char **ppathbuf) get_task_struct(tsk); } - down(&callback_sem); + mutex_lock(&callback_mutex); task_lock(tsk); oldcs = tsk->cpuset; if (!oldcs) { task_unlock(tsk); - up(&callback_sem); + mutex_unlock(&callback_mutex); put_task_struct(tsk); return -ESRCH; } @@ -1164,7 +1158,7 @@ static int attach_task(struct cpuset *cs, char *pidbuf, char **ppathbuf) from = oldcs->mems_allowed; to = cs->mems_allowed; - up(&callback_sem); + mutex_unlock(&callback_mutex); mm = get_task_mm(tsk); if (mm) { @@ -1221,7 +1215,7 @@ static ssize_t cpuset_common_file_write(struct file *file, const char __user *us } buffer[nbytes] = 0; /* nul-terminate */ - down(&manage_sem); + mutex_lock(&manage_mutex); if (is_removed(cs)) { retval = -ENODEV; @@ -1264,7 +1258,7 @@ static ssize_t cpuset_common_file_write(struct file *file, const char __user *us if (retval == 0) retval = nbytes; out2: - up(&manage_sem); + mutex_unlock(&manage_mutex); cpuset_release_agent(pathbuf); out1: kfree(buffer); @@ -1304,9 +1298,9 @@ static int cpuset_sprintf_cpulist(char *page, struct cpuset *cs) { cpumask_t mask; - down(&callback_sem); + mutex_lock(&callback_mutex); mask = cs->cpus_allowed; - up(&callback_sem); + mutex_unlock(&callback_mutex); return cpulist_scnprintf(page, PAGE_SIZE, mask); } @@ -1315,9 +1309,9 @@ static int cpuset_sprintf_memlist(char *page, struct cpuset *cs) { nodemask_t mask; - down(&callback_sem); + mutex_lock(&callback_mutex); mask = cs->mems_allowed; - up(&callback_sem); + mutex_unlock(&callback_mutex); return nodelist_scnprintf(page, PAGE_SIZE, mask); } @@ -1598,7 +1592,7 @@ static int pid_array_to_buf(char *buf, int sz, pid_t *a, int npids) * Handle an open on 'tasks' file. Prepare a buffer listing the * process id's of tasks currently attached to the cpuset being opened. * - * Does not require any specific cpuset semaphores, and does not take any. + * Does not require any specific cpuset mutexes, and does not take any. */ static int cpuset_tasks_open(struct inode *unused, struct file *file) { @@ -1754,7 +1748,7 @@ static int cpuset_populate_dir(struct dentry *cs_dentry) * name: name of the new cpuset. Will be strcpy'ed. * mode: mode to set on new inode * - * Must be called with the semaphore on the parent inode held + * Must be called with the mutex on the parent inode held */ static long cpuset_create(struct cpuset *parent, const char *name, int mode) @@ -1766,7 +1760,7 @@ static long cpuset_create(struct cpuset *parent, const char *name, int mode) if (!cs) return -ENOMEM; - down(&manage_sem); + mutex_lock(&manage_mutex); cpuset_update_task_memory_state(); cs->flags = 0; if (notify_on_release(parent)) @@ -1782,28 +1776,28 @@ static long cpuset_create(struct cpuset *parent, const char *name, int mode) cs->parent = parent; - down(&callback_sem); + mutex_lock(&callback_mutex); list_add(&cs->sibling, &cs->parent->children); number_of_cpusets++; - up(&callback_sem); + mutex_unlock(&callback_mutex); err = cpuset_create_dir(cs, name, mode); if (err < 0) goto err; /* - * Release manage_sem before cpuset_populate_dir() because it + * Release manage_mutex before cpuset_populate_dir() because it * will down() this new directory's i_mutex and if we race with * another mkdir, we might deadlock. */ - up(&manage_sem); + mutex_unlock(&manage_mutex); err = cpuset_populate_dir(cs->dentry); /* If err < 0, we have a half-filled directory - oh well ;) */ return 0; err: list_del(&cs->sibling); - up(&manage_sem); + mutex_unlock(&manage_mutex); kfree(cs); return err; } @@ -1825,18 +1819,18 @@ static int cpuset_rmdir(struct inode *unused_dir, struct dentry *dentry) /* the vfs holds both inode->i_mutex already */ - down(&manage_sem); + mutex_lock(&manage_mutex); cpuset_update_task_memory_state(); if (atomic_read(&cs->count) > 0) { - up(&manage_sem); + mutex_unlock(&manage_mutex); return -EBUSY; } if (!list_empty(&cs->children)) { - up(&manage_sem); + mutex_unlock(&manage_mutex); return -EBUSY; } parent = cs->parent; - down(&callback_sem); + mutex_lock(&callback_mutex); set_bit(CS_REMOVED, &cs->flags); if (is_cpu_exclusive(cs)) update_cpu_domains(cs); @@ -1848,10 +1842,10 @@ static int cpuset_rmdir(struct inode *unused_dir, struct dentry *dentry) cpuset_d_remove_dir(d); dput(d); number_of_cpusets--; - up(&callback_sem); + mutex_unlock(&callback_mutex); if (list_empty(&parent->children)) check_for_release(parent, &pathbuf); - up(&manage_sem); + mutex_unlock(&manage_mutex); cpuset_release_agent(pathbuf); return 0; } @@ -1960,19 +1954,19 @@ void cpuset_fork(struct task_struct *child) * Description: Detach cpuset from @tsk and release it. * * Note that cpusets marked notify_on_release force every task in - * them to take the global manage_sem semaphore when exiting. + * them to take the global manage_mutex mutex when exiting. * This could impact scaling on very large systems. Be reluctant to * use notify_on_release cpusets where very high task exit scaling * is required on large systems. * * Don't even think about derefencing 'cs' after the cpuset use count - * goes to zero, except inside a critical section guarded by manage_sem - * or callback_sem. Otherwise a zero cpuset use count is a license to + * goes to zero, except inside a critical section guarded by manage_mutex + * or callback_mutex. Otherwise a zero cpuset use count is a license to * any other task to nuke the cpuset immediately, via cpuset_rmdir(). * - * This routine has to take manage_sem, not callback_sem, because - * it is holding that semaphore while calling check_for_release(), - * which calls kmalloc(), so can't be called holding callback__sem(). + * This routine has to take manage_mutex, not callback_mutex, because + * it is holding that mutex while calling check_for_release(), + * which calls kmalloc(), so can't be called holding callback_mutex(). * * We don't need to task_lock() this reference to tsk->cpuset, * because tsk is already marked PF_EXITING, so attach_task() won't @@ -2022,10 +2016,10 @@ void cpuset_exit(struct task_struct *tsk) if (notify_on_release(cs)) { char *pathbuf = NULL; - down(&manage_sem); + mutex_lock(&manage_mutex); if (atomic_dec_and_test(&cs->count)) check_for_release(cs, &pathbuf); - up(&manage_sem); + mutex_unlock(&manage_mutex); cpuset_release_agent(pathbuf); } else { atomic_dec(&cs->count); @@ -2046,11 +2040,11 @@ cpumask_t cpuset_cpus_allowed(struct task_struct *tsk) { cpumask_t mask; - down(&callback_sem); + mutex_lock(&callback_mutex); task_lock(tsk); guarantee_online_cpus(tsk->cpuset, &mask); task_unlock(tsk); - up(&callback_sem); + mutex_unlock(&callback_mutex); return mask; } @@ -2074,11 +2068,11 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk) { nodemask_t mask; - down(&callback_sem); + mutex_lock(&callback_mutex); task_lock(tsk); guarantee_online_mems(tsk->cpuset, &mask); task_unlock(tsk); - up(&callback_sem); + mutex_unlock(&callback_mutex); return mask; } @@ -2104,7 +2098,7 @@ int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl) /* * nearest_exclusive_ancestor() - Returns the nearest mem_exclusive - * ancestor to the specified cpuset. Call holding callback_sem. + * ancestor to the specified cpuset. Call holding callback_mutex. * If no ancestor is mem_exclusive (an unusual configuration), then * returns the root cpuset. */ @@ -2131,12 +2125,12 @@ static const struct cpuset *nearest_exclusive_ancestor(const struct cpuset *cs) * GFP_KERNEL allocations are not so marked, so can escape to the * nearest mem_exclusive ancestor cpuset. * - * Scanning up parent cpusets requires callback_sem. The __alloc_pages() + * Scanning up parent cpusets requires callback_mutex. The __alloc_pages() * routine only calls here with __GFP_HARDWALL bit _not_ set if * it's a GFP_KERNEL allocation, and all nodes in the current tasks * mems_allowed came up empty on the first pass over the zonelist. * So only GFP_KERNEL allocations, if all nodes in the cpuset are - * short of memory, might require taking the callback_sem semaphore. + * short of memory, might require taking the callback_mutex mutex. * * The first loop over the zonelist in mm/page_alloc.c:__alloc_pages() * calls here with __GFP_HARDWALL always set in gfp_mask, enforcing @@ -2171,31 +2165,31 @@ int __cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask) return 1; /* Not hardwall and node outside mems_allowed: scan up cpusets */ - down(&callback_sem); + mutex_lock(&callback_mutex); task_lock(current); cs = nearest_exclusive_ancestor(current->cpuset); task_unlock(current); allowed = node_isset(node, cs->mems_allowed); - up(&callback_sem); + mutex_unlock(&callback_mutex); return allowed; } /** * cpuset_lock - lock out any changes to cpuset structures * - * The out of memory (oom) code needs to lock down cpusets + * The out of memory (oom) code needs to mutex_lock cpusets * from being changed while it scans the tasklist looking for a - * task in an overlapping cpuset. Expose callback_sem via this + * task in an overlapping cpuset. Expose callback_mutex via this * cpuset_lock() routine, so the oom code can lock it, before * locking the task list. The tasklist_lock is a spinlock, so - * must be taken inside callback_sem. + * must be taken inside callback_mutex. */ void cpuset_lock(void) { - down(&callback_sem); + mutex_lock(&callback_mutex); } /** @@ -2206,7 +2200,7 @@ void cpuset_lock(void) void cpuset_unlock(void) { - up(&callback_sem); + mutex_unlock(&callback_mutex); } /** @@ -2218,7 +2212,7 @@ void cpuset_unlock(void) * determine if task @p's memory usage might impact the memory * available to the current task. * - * Call while holding callback_sem. + * Call while holding callback_mutex. **/ int cpuset_excl_nodes_overlap(const struct task_struct *p) @@ -2289,7 +2283,7 @@ void __cpuset_memory_pressure_bump(void) * - Used for /proc//cpuset. * - No need to task_lock(tsk) on this tsk->cpuset reference, as it * doesn't really matter if tsk->cpuset changes after we read it, - * and we take manage_sem, keeping attach_task() from changing it + * and we take manage_mutex, keeping attach_task() from changing it * anyway. */ @@ -2305,7 +2299,7 @@ static int proc_cpuset_show(struct seq_file *m, void *v) return -ENOMEM; tsk = m->private; - down(&manage_sem); + mutex_lock(&manage_mutex); cs = tsk->cpuset; if (!cs) { retval = -EINVAL; @@ -2318,7 +2312,7 @@ static int proc_cpuset_show(struct seq_file *m, void *v) seq_puts(m, buf); seq_putc(m, '\n'); out: - up(&manage_sem); + mutex_unlock(&manage_mutex); kfree(buf); return retval; } -- cgit v1.2.3 From 9331b3157c835353dd28efcd80d23563ad226aee Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Thu, 23 Mar 2006 03:00:19 -0800 Subject: [PATCH] convert kernel/rcupdate.c:rcu_barrier_sema to mutex Convert kernel/rcupdate's rcu_barrier_sema to mutex. Signed-off-by: Ingo Molnar Acked-by: "Paul E. McKenney" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/rcupdate.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c index fedf5e369755..af8a2a57e17d 100644 --- a/kernel/rcupdate.c +++ b/kernel/rcupdate.c @@ -47,6 +47,7 @@ #include #include #include +#include /* Definition for rcupdate control block. */ struct rcu_ctrlblk rcu_ctrlblk = { @@ -75,7 +76,7 @@ static int rsinterval = 1000; #endif static atomic_t rcu_barrier_cpu_count; -static struct semaphore rcu_barrier_sema; +static DEFINE_MUTEX(rcu_barrier_mutex); static struct completion rcu_barrier_completion; #ifdef CONFIG_SMP @@ -207,13 +208,13 @@ static void rcu_barrier_func(void *notused) void rcu_barrier(void) { BUG_ON(in_interrupt()); - /* Take cpucontrol semaphore to protect against CPU hotplug */ - down(&rcu_barrier_sema); + /* Take cpucontrol mutex to protect against CPU hotplug */ + mutex_lock(&rcu_barrier_mutex); init_completion(&rcu_barrier_completion); atomic_set(&rcu_barrier_cpu_count, 0); on_each_cpu(rcu_barrier_func, NULL, 0, 1); wait_for_completion(&rcu_barrier_completion); - up(&rcu_barrier_sema); + mutex_unlock(&rcu_barrier_mutex); } EXPORT_SYMBOL_GPL(rcu_barrier); @@ -549,7 +550,6 @@ static struct notifier_block __devinitdata rcu_nb = { */ void __init rcu_init(void) { - sema_init(&rcu_barrier_sema, 1); rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE, (void *)(long)smp_processor_id()); /* Register notifier for non-boot CPUs */ -- cgit v1.2.3 From 97d1f15b7ef52c1e9c28dc48b454024bb53a5fd2 Mon Sep 17 00:00:00 2001 From: Arjan van de Ven Date: Thu, 23 Mar 2006 03:00:24 -0800 Subject: [PATCH] sem2mutex: kernel/ Semaphore to mutex conversion. The conversion was generated via scripts, and the result was validated automatically via a script as well. Signed-off-by: Arjan van de Ven Signed-off-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kthread.c | 7 ++++--- kernel/module.c | 15 ++++++++------- kernel/posix-timers.c | 1 + kernel/power/pm.c | 21 +++++++++++---------- kernel/profile.c | 11 ++++++----- 5 files changed, 30 insertions(+), 25 deletions(-) (limited to 'kernel') diff --git a/kernel/kthread.c b/kernel/kthread.c index e75950a1092c..6a5373868a98 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -12,6 +12,7 @@ #include #include #include +#include #include /* @@ -41,7 +42,7 @@ struct kthread_stop_info /* Thread stopping is done by setthing this var: lock serializes * multiple kthread_stop calls. */ -static DECLARE_MUTEX(kthread_stop_lock); +static DEFINE_MUTEX(kthread_stop_lock); static struct kthread_stop_info kthread_stop_info; int kthread_should_stop(void) @@ -173,7 +174,7 @@ int kthread_stop_sem(struct task_struct *k, struct semaphore *s) { int ret; - down(&kthread_stop_lock); + mutex_lock(&kthread_stop_lock); /* It could exit after stop_info.k set, but before wake_up_process. */ get_task_struct(k); @@ -194,7 +195,7 @@ int kthread_stop_sem(struct task_struct *k, struct semaphore *s) wait_for_completion(&kthread_stop_info.done); kthread_stop_info.k = NULL; ret = kthread_stop_info.err; - up(&kthread_stop_lock); + mutex_unlock(&kthread_stop_lock); return ret; } diff --git a/kernel/module.c b/kernel/module.c index 77764f22f021..de6312da6bb5 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -39,6 +39,7 @@ #include #include #include +#include #include #include #include @@ -63,15 +64,15 @@ static DEFINE_SPINLOCK(modlist_lock); static DECLARE_MUTEX(module_mutex); static LIST_HEAD(modules); -static DECLARE_MUTEX(notify_mutex); +static DEFINE_MUTEX(notify_mutex); static struct notifier_block * module_notify_list; int register_module_notifier(struct notifier_block * nb) { int err; - down(¬ify_mutex); + mutex_lock(¬ify_mutex); err = notifier_chain_register(&module_notify_list, nb); - up(¬ify_mutex); + mutex_unlock(¬ify_mutex); return err; } EXPORT_SYMBOL(register_module_notifier); @@ -79,9 +80,9 @@ EXPORT_SYMBOL(register_module_notifier); int unregister_module_notifier(struct notifier_block * nb) { int err; - down(¬ify_mutex); + mutex_lock(¬ify_mutex); err = notifier_chain_unregister(&module_notify_list, nb); - up(¬ify_mutex); + mutex_unlock(¬ify_mutex); return err; } EXPORT_SYMBOL(unregister_module_notifier); @@ -1989,9 +1990,9 @@ sys_init_module(void __user *umod, /* Drop lock so they can recurse */ up(&module_mutex); - down(¬ify_mutex); + mutex_lock(¬ify_mutex); notifier_call_chain(&module_notify_list, MODULE_STATE_COMING, mod); - up(¬ify_mutex); + mutex_unlock(¬ify_mutex); /* Start the module */ if (mod->init != NULL) diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c index fa895fc2ecf5..9944379360b5 100644 --- a/kernel/posix-timers.c +++ b/kernel/posix-timers.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include diff --git a/kernel/power/pm.c b/kernel/power/pm.c index 33c508e857dd..0f6908cce1dd 100644 --- a/kernel/power/pm.c +++ b/kernel/power/pm.c @@ -25,6 +25,7 @@ #include #include #include +#include int pm_active; @@ -40,7 +41,7 @@ int pm_active; * until a resume but that will be fine. */ -static DECLARE_MUTEX(pm_devs_lock); +static DEFINE_MUTEX(pm_devs_lock); static LIST_HEAD(pm_devs); /** @@ -67,9 +68,9 @@ struct pm_dev *pm_register(pm_dev_t type, dev->id = id; dev->callback = callback; - down(&pm_devs_lock); + mutex_lock(&pm_devs_lock); list_add(&dev->entry, &pm_devs); - up(&pm_devs_lock); + mutex_unlock(&pm_devs_lock); } return dev; } @@ -85,9 +86,9 @@ struct pm_dev *pm_register(pm_dev_t type, void pm_unregister(struct pm_dev *dev) { if (dev) { - down(&pm_devs_lock); + mutex_lock(&pm_devs_lock); list_del(&dev->entry); - up(&pm_devs_lock); + mutex_unlock(&pm_devs_lock); kfree(dev); } @@ -118,7 +119,7 @@ void pm_unregister_all(pm_callback callback) if (!callback) return; - down(&pm_devs_lock); + mutex_lock(&pm_devs_lock); entry = pm_devs.next; while (entry != &pm_devs) { struct pm_dev *dev = list_entry(entry, struct pm_dev, entry); @@ -126,7 +127,7 @@ void pm_unregister_all(pm_callback callback) if (dev->callback == callback) __pm_unregister(dev); } - up(&pm_devs_lock); + mutex_unlock(&pm_devs_lock); } /** @@ -234,7 +235,7 @@ int pm_send_all(pm_request_t rqst, void *data) { struct list_head *entry; - down(&pm_devs_lock); + mutex_lock(&pm_devs_lock); entry = pm_devs.next; while (entry != &pm_devs) { struct pm_dev *dev = list_entry(entry, struct pm_dev, entry); @@ -246,13 +247,13 @@ int pm_send_all(pm_request_t rqst, void *data) */ if (rqst == PM_SUSPEND) pm_undo_all(dev); - up(&pm_devs_lock); + mutex_unlock(&pm_devs_lock); return status; } } entry = entry->next; } - up(&pm_devs_lock); + mutex_unlock(&pm_devs_lock); return 0; } diff --git a/kernel/profile.c b/kernel/profile.c index f89248e6d704..ad81f799a9b4 100644 --- a/kernel/profile.c +++ b/kernel/profile.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -44,7 +45,7 @@ static cpumask_t prof_cpu_mask = CPU_MASK_ALL; #ifdef CONFIG_SMP static DEFINE_PER_CPU(struct profile_hit *[2], cpu_profile_hits); static DEFINE_PER_CPU(int, cpu_profile_flip); -static DECLARE_MUTEX(profile_flip_mutex); +static DEFINE_MUTEX(profile_flip_mutex); #endif /* CONFIG_SMP */ static int __init profile_setup(char * str) @@ -243,7 +244,7 @@ static void profile_flip_buffers(void) { int i, j, cpu; - down(&profile_flip_mutex); + mutex_lock(&profile_flip_mutex); j = per_cpu(cpu_profile_flip, get_cpu()); put_cpu(); on_each_cpu(__profile_flip_buffers, NULL, 0, 1); @@ -259,14 +260,14 @@ static void profile_flip_buffers(void) hits[i].hits = hits[i].pc = 0; } } - up(&profile_flip_mutex); + mutex_unlock(&profile_flip_mutex); } static void profile_discard_flip_buffers(void) { int i, cpu; - down(&profile_flip_mutex); + mutex_lock(&profile_flip_mutex); i = per_cpu(cpu_profile_flip, get_cpu()); put_cpu(); on_each_cpu(__profile_flip_buffers, NULL, 0, 1); @@ -274,7 +275,7 @@ static void profile_discard_flip_buffers(void) struct profile_hit *hits = per_cpu(cpu_profile_hits, cpu)[i]; memset(hits, 0, NR_PROFILE_HIT*sizeof(struct profile_hit)); } - up(&profile_flip_mutex); + mutex_unlock(&profile_flip_mutex); } void profile_hit(int type, void *__pc) -- cgit v1.2.3 From 70522e121a521aa09bd0f4e62e1aa68708b798e1 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Thu, 23 Mar 2006 03:00:31 -0800 Subject: [PATCH] sem2mutex: tty Semaphore to mutex conversion. The conversion was generated via scripts, and the result was validated automatically via a script as well. Signed-off-by: Ingo Molnar Cc: Alan Cox Cc: Russell King Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 4 ++-- kernel/sys.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index d1e8d500a7e1..8037405e136e 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -345,9 +345,9 @@ void daemonize(const char *name, ...) exit_mm(current); set_special_pids(1, 1); - down(&tty_sem); + mutex_lock(&tty_mutex); current->signal->tty = NULL; - up(&tty_sem); + mutex_unlock(&tty_mutex); /* Block and flush all signals */ sigfillset(&blocked); diff --git a/kernel/sys.c b/kernel/sys.c index 4941b9b14b97..c0fcad9f826c 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1227,7 +1227,7 @@ asmlinkage long sys_setsid(void) struct pid *pid; int err = -EPERM; - down(&tty_sem); + mutex_lock(&tty_mutex); write_lock_irq(&tasklist_lock); pid = find_pid(PIDTYPE_PGID, group_leader->pid); @@ -1241,7 +1241,7 @@ asmlinkage long sys_setsid(void) err = process_group(group_leader); out: write_unlock_irq(&tasklist_lock); - up(&tty_sem); + mutex_unlock(&tty_mutex); return err; } -- cgit v1.2.3 From 7a7d1cf95408863a657035701606b13644c9f55e Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Thu, 23 Mar 2006 03:00:35 -0800 Subject: [PATCH] sem2mutex: kprobes Semaphore to mutex conversion. The conversion was generated via scripts, and the result was validated automatically via a script as well. Signed-off-by: Ingo Molnar Acked-by: Anil S Keshavamurthy Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kprobes.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/kprobes.c b/kernel/kprobes.c index fef1af8a73ce..1fb9f753ef60 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -48,7 +48,7 @@ static struct hlist_head kprobe_table[KPROBE_TABLE_SIZE]; static struct hlist_head kretprobe_inst_table[KPROBE_TABLE_SIZE]; -DECLARE_MUTEX(kprobe_mutex); /* Protects kprobe_table */ +DEFINE_MUTEX(kprobe_mutex); /* Protects kprobe_table */ DEFINE_SPINLOCK(kretprobe_lock); /* Protects kretprobe_inst_table */ static DEFINE_PER_CPU(struct kprobe *, kprobe_instance) = NULL; @@ -460,7 +460,7 @@ static int __kprobes __register_kprobe(struct kprobe *p, } p->nmissed = 0; - down(&kprobe_mutex); + mutex_lock(&kprobe_mutex); old_p = get_kprobe(p->addr); if (old_p) { ret = register_aggr_kprobe(old_p, p); @@ -477,7 +477,7 @@ static int __kprobes __register_kprobe(struct kprobe *p, arch_arm_kprobe(p); out: - up(&kprobe_mutex); + mutex_unlock(&kprobe_mutex); if (ret && probed_mod) module_put(probed_mod); @@ -496,10 +496,10 @@ void __kprobes unregister_kprobe(struct kprobe *p) struct kprobe *old_p, *list_p; int cleanup_p; - down(&kprobe_mutex); + mutex_lock(&kprobe_mutex); old_p = get_kprobe(p->addr); if (unlikely(!old_p)) { - up(&kprobe_mutex); + mutex_unlock(&kprobe_mutex); return; } if (p != old_p) { @@ -507,7 +507,7 @@ void __kprobes unregister_kprobe(struct kprobe *p) if (list_p == p) /* kprobe p is a valid probe */ goto valid_p; - up(&kprobe_mutex); + mutex_unlock(&kprobe_mutex); return; } valid_p: @@ -523,7 +523,7 @@ valid_p: cleanup_p = 0; } - up(&kprobe_mutex); + mutex_unlock(&kprobe_mutex); synchronize_sched(); if (p->mod_refcounted && -- cgit v1.2.3 From 6389a385114ae358693f213266de6468ea116c77 Mon Sep 17 00:00:00 2001 From: Ashutosh Naik Date: Thu, 23 Mar 2006 03:00:46 -0800 Subject: [PATCH] kernel/module.c Semaphore to Mutex Conversion for module_mutex This patch converts the module_mutex semaphore to a mutex. Signed-off-by: Ashutosh Naik Cc: Arjan van de Ven Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/module.c | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index de6312da6bb5..fb404299082e 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -61,7 +61,7 @@ static DEFINE_SPINLOCK(modlist_lock); /* List of modules, protected by module_mutex AND modlist_lock */ -static DECLARE_MUTEX(module_mutex); +static DEFINE_MUTEX(module_mutex); static LIST_HEAD(modules); static DEFINE_MUTEX(notify_mutex); @@ -602,7 +602,7 @@ static void free_module(struct module *mod); static void wait_for_zero_refcount(struct module *mod) { /* Since we might sleep for some time, drop the semaphore first */ - up(&module_mutex); + mutex_unlock(&module_mutex); for (;;) { DEBUGP("Looking at refcount...\n"); set_current_state(TASK_UNINTERRUPTIBLE); @@ -611,7 +611,7 @@ static void wait_for_zero_refcount(struct module *mod) schedule(); } current->state = TASK_RUNNING; - down(&module_mutex); + mutex_lock(&module_mutex); } asmlinkage long @@ -628,7 +628,7 @@ sys_delete_module(const char __user *name_user, unsigned int flags) return -EFAULT; name[MODULE_NAME_LEN-1] = '\0'; - if (down_interruptible(&module_mutex) != 0) + if (mutex_lock_interruptible(&module_mutex) != 0) return -EINTR; mod = find_module(name); @@ -677,14 +677,14 @@ sys_delete_module(const char __user *name_user, unsigned int flags) /* Final destruction now noone is using it. */ if (mod->exit != NULL) { - up(&module_mutex); + mutex_unlock(&module_mutex); mod->exit(); - down(&module_mutex); + mutex_lock(&module_mutex); } free_module(mod); out: - up(&module_mutex); + mutex_unlock(&module_mutex); return ret; } @@ -1973,13 +1973,13 @@ sys_init_module(void __user *umod, return -EPERM; /* Only one module load at a time, please */ - if (down_interruptible(&module_mutex) != 0) + if (mutex_lock_interruptible(&module_mutex) != 0) return -EINTR; /* Do all the hard work */ mod = load_module(umod, len, uargs); if (IS_ERR(mod)) { - up(&module_mutex); + mutex_unlock(&module_mutex); return PTR_ERR(mod); } @@ -1988,7 +1988,7 @@ sys_init_module(void __user *umod, stop_machine_run(__link_module, mod, NR_CPUS); /* Drop lock so they can recurse */ - up(&module_mutex); + mutex_unlock(&module_mutex); mutex_lock(¬ify_mutex); notifier_call_chain(&module_notify_list, MODULE_STATE_COMING, mod); @@ -2007,15 +2007,15 @@ sys_init_module(void __user *umod, mod->name); else { module_put(mod); - down(&module_mutex); + mutex_lock(&module_mutex); free_module(mod); - up(&module_mutex); + mutex_unlock(&module_mutex); } return ret; } /* Now it's a first class citizen! */ - down(&module_mutex); + mutex_lock(&module_mutex); mod->state = MODULE_STATE_LIVE; /* Drop initial reference. */ module_put(mod); @@ -2023,7 +2023,7 @@ sys_init_module(void __user *umod, mod->module_init = NULL; mod->init_size = 0; mod->init_text_size = 0; - up(&module_mutex); + mutex_unlock(&module_mutex); return 0; } @@ -2113,7 +2113,7 @@ struct module *module_get_kallsym(unsigned int symnum, { struct module *mod; - down(&module_mutex); + mutex_lock(&module_mutex); list_for_each_entry(mod, &modules, list) { if (symnum < mod->num_symtab) { *value = mod->symtab[symnum].st_value; @@ -2121,12 +2121,12 @@ struct module *module_get_kallsym(unsigned int symnum, strncpy(namebuf, mod->strtab + mod->symtab[symnum].st_name, 127); - up(&module_mutex); + mutex_unlock(&module_mutex); return mod; } symnum -= mod->num_symtab; } - up(&module_mutex); + mutex_unlock(&module_mutex); return NULL; } @@ -2169,7 +2169,7 @@ static void *m_start(struct seq_file *m, loff_t *pos) struct list_head *i; loff_t n = 0; - down(&module_mutex); + mutex_lock(&module_mutex); list_for_each(i, &modules) { if (n++ == *pos) break; @@ -2190,7 +2190,7 @@ static void *m_next(struct seq_file *m, void *p, loff_t *pos) static void m_stop(struct seq_file *m, void *p) { - up(&module_mutex); + mutex_unlock(&module_mutex); } static int m_show(struct seq_file *m, void *p) -- cgit v1.2.3 From a26fd335b481e0bd14f4e7d1f5e7bb1138b1731f Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Thu, 23 Mar 2006 03:00:49 -0800 Subject: [PATCH] sigprocmask: kill unneeded temp var Cleanup, remove unneeded double copying of current->blocked. Signed-off-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/signal.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index dfb09ba5c013..75f7341b0c39 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2101,10 +2101,11 @@ long do_no_restart_syscall(struct restart_block *param) int sigprocmask(int how, sigset_t *set, sigset_t *oldset) { int error; - sigset_t old_block; spin_lock_irq(¤t->sighand->siglock); - old_block = current->blocked; + if (oldset) + *oldset = current->blocked; + error = 0; switch (how) { case SIG_BLOCK: @@ -2121,8 +2122,7 @@ int sigprocmask(int how, sigset_t *set, sigset_t *oldset) } recalc_sigpending(); spin_unlock_irq(¤t->sighand->siglock); - if (oldset) - *oldset = old_block; + return error; } -- cgit v1.2.3 From 91368d73e4b60d577ad171e5bd315b564265fcdb Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Thu, 23 Mar 2006 03:00:54 -0800 Subject: [PATCH] make bug messages more consistent Consolidate all kernel bug printouts to begin with the "BUG: " string. Makes it easier to find them in large bootup logs. Signed-off-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index a5bd60453eae..7ffaabd64f89 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -2873,7 +2873,7 @@ asmlinkage void __sched schedule(void) */ if (likely(!current->exit_state)) { if (unlikely(in_atomic())) { - printk(KERN_ERR "scheduling while atomic: " + printk(KERN_ERR "BUG: scheduling while atomic: " "%s/0x%08x/%d\n", current->comm, preempt_count(), current->pid); dump_stack(); @@ -6074,7 +6074,7 @@ void __might_sleep(char *file, int line) if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy) return; prev_jiffy = jiffies; - printk(KERN_ERR "Debug: sleeping function called from invalid" + printk(KERN_ERR "BUG: sleeping function called from invalid" " context at %s:%d\n", file, line); printk("in_atomic():%d, irqs_disabled():%d\n", in_atomic(), irqs_disabled()); -- cgit v1.2.3 From dd287796d608fcdc3fe5e8fdb5bf762a8f1bc32a Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Thu, 23 Mar 2006 03:00:57 -0800 Subject: [PATCH] pause_on_oops command line option Attempt to fix the problem wherein people's oops reports scroll off the screen due to repeated oopsing or to oopses on other CPUs. If this happens the user can reboot with the `pause_on_oops=' option. It will allow the first oopsing CPU to print an oops record just a single time. Second oopsing attempts, or oopses on other CPUs will cause those CPUs to enter a tight loop until the specified number of seconds have elapsed. The patch implements the infrastructure generically in the expectation that architectures other than x86 will find it useful. Cc: Dave Jones Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/panic.c | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 96 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/panic.c b/kernel/panic.c index 126dc43f1c74..acd95adddb93 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -20,10 +20,13 @@ #include #include -int panic_timeout; int panic_on_oops; int tainted; +static int pause_on_oops; +static int pause_on_oops_flag; +static DEFINE_SPINLOCK(pause_on_oops_lock); +int panic_timeout; EXPORT_SYMBOL(panic_timeout); struct notifier_block *panic_notifier_list; @@ -174,3 +177,95 @@ void add_taint(unsigned flag) tainted |= flag; } EXPORT_SYMBOL(add_taint); + +static int __init pause_on_oops_setup(char *str) +{ + pause_on_oops = simple_strtoul(str, NULL, 0); + return 1; +} +__setup("pause_on_oops=", pause_on_oops_setup); + +static void spin_msec(int msecs) +{ + int i; + + for (i = 0; i < msecs; i++) { + touch_nmi_watchdog(); + mdelay(1); + } +} + +/* + * It just happens that oops_enter() and oops_exit() are identically + * implemented... + */ +static void do_oops_enter_exit(void) +{ + unsigned long flags; + static int spin_counter; + + if (!pause_on_oops) + return; + + spin_lock_irqsave(&pause_on_oops_lock, flags); + if (pause_on_oops_flag == 0) { + /* This CPU may now print the oops message */ + pause_on_oops_flag = 1; + } else { + /* We need to stall this CPU */ + if (!spin_counter) { + /* This CPU gets to do the counting */ + spin_counter = pause_on_oops; + do { + spin_unlock(&pause_on_oops_lock); + spin_msec(MSEC_PER_SEC); + spin_lock(&pause_on_oops_lock); + } while (--spin_counter); + pause_on_oops_flag = 0; + } else { + /* This CPU waits for a different one */ + while (spin_counter) { + spin_unlock(&pause_on_oops_lock); + spin_msec(1); + spin_lock(&pause_on_oops_lock); + } + } + } + spin_unlock_irqrestore(&pause_on_oops_lock, flags); +} + +/* + * Return true if the calling CPU is allowed to print oops-related info. This + * is a bit racy.. + */ +int oops_may_print(void) +{ + return pause_on_oops_flag == 0; +} + +/* + * Called when the architecture enters its oops handler, before it prints + * anything. If this is the first CPU to oops, and it's oopsing the first time + * then let it proceed. + * + * This is all enabled by the pause_on_oops kernel boot option. We do all this + * to ensure that oopses don't scroll off the screen. It has the side-effect + * of preventing later-oopsing CPUs from mucking up the display, too. + * + * It turns out that the CPU which is allowed to print ends up pausing for the + * right duration, whereas all the other CPUs pause for twice as long: once in + * oops_enter(), once in oops_exit(). + */ +void oops_enter(void) +{ + do_oops_enter_exit(); +} + +/* + * Called when the architecture exits its oops handler, after printing + * everything. + */ +void oops_exit(void) +{ + do_oops_enter_exit(); +} -- cgit v1.2.3 From ee25e96fcd78837c9f192aa655ce12a88bfd63d4 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Thu, 23 Mar 2006 03:00:58 -0800 Subject: [PATCH] BUILD_LOCK_OPS: cleanup preempt_disable() usage This patch changes the code from: preempt_disable(); for (;;) { ... preempt_disable(); } to: for (;;) { preempt_disable(); ... } which seems more clean to me and saves a couple of bytes for each function. Signed-off-by: Oleg Nesterov Acked-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/spinlock.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/spinlock.c b/kernel/spinlock.c index 0375fcd5921d..d1b810782bc4 100644 --- a/kernel/spinlock.c +++ b/kernel/spinlock.c @@ -179,16 +179,16 @@ EXPORT_SYMBOL(_write_lock); #define BUILD_LOCK_OPS(op, locktype) \ void __lockfunc _##op##_lock(locktype##_t *lock) \ { \ - preempt_disable(); \ for (;;) { \ + preempt_disable(); \ if (likely(_raw_##op##_trylock(lock))) \ break; \ preempt_enable(); \ + \ if (!(lock)->break_lock) \ (lock)->break_lock = 1; \ while (!op##_can_lock(lock) && (lock)->break_lock) \ cpu_relax(); \ - preempt_disable(); \ } \ (lock)->break_lock = 0; \ } \ @@ -199,19 +199,18 @@ unsigned long __lockfunc _##op##_lock_irqsave(locktype##_t *lock) \ { \ unsigned long flags; \ \ - preempt_disable(); \ for (;;) { \ + preempt_disable(); \ local_irq_save(flags); \ if (likely(_raw_##op##_trylock(lock))) \ break; \ local_irq_restore(flags); \ - \ preempt_enable(); \ + \ if (!(lock)->break_lock) \ (lock)->break_lock = 1; \ while (!op##_can_lock(lock) && (lock)->break_lock) \ cpu_relax(); \ - preempt_disable(); \ } \ (lock)->break_lock = 0; \ return flags; \ -- cgit v1.2.3 From 2178426d26661ed6e18a8d6ea0bc05c98d73600d Mon Sep 17 00:00:00 2001 From: Adrian Bunk Date: Thu, 23 Mar 2006 03:01:00 -0800 Subject: [PATCH] kernel/rcupdate.c: make two structs static This patch makes two needlessly global structs static. Signed-off-by: Adrian Bunk Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/rcupdate.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c index af8a2a57e17d..6df1559b1c02 100644 --- a/kernel/rcupdate.c +++ b/kernel/rcupdate.c @@ -50,13 +50,13 @@ #include /* Definition for rcupdate control block. */ -struct rcu_ctrlblk rcu_ctrlblk = { +static struct rcu_ctrlblk rcu_ctrlblk = { .cur = -300, .completed = -300, .lock = SPIN_LOCK_UNLOCKED, .cpumask = CPU_MASK_NONE, }; -struct rcu_ctrlblk rcu_bh_ctrlblk = { +static struct rcu_ctrlblk rcu_bh_ctrlblk = { .cur = -300, .completed = -300, .lock = SPIN_LOCK_UNLOCKED, -- cgit v1.2.3 From b86ff981a8252d83d6a7719ae09f3a05307e3592 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Thu, 23 Mar 2006 19:56:55 +0100 Subject: [PATCH] relay: migrate from relayfs to a generic relay API Original patch from Paul Mundt, sysfs parts removed by me since they were broken. Signed-off-by: Jens Axboe --- kernel/Makefile | 1 + kernel/relay.c | 919 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 920 insertions(+) create mode 100644 kernel/relay.c (limited to 'kernel') diff --git a/kernel/Makefile b/kernel/Makefile index 4ae0fbde815d..aebd7a78984e 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -34,6 +34,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softlockup.o obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o +obj-$(CONFIG_RELAY) += relay.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) # According to Alan Modra , the -fno-omit-frame-pointer is diff --git a/kernel/relay.c b/kernel/relay.c new file mode 100644 index 000000000000..9358e8eb8476 --- /dev/null +++ b/kernel/relay.c @@ -0,0 +1,919 @@ +/* + * Public API and common code for kernel->userspace relay file support. + * + * See Documentation/filesystems/relayfs.txt for an overview of relayfs. + * + * Copyright (C) 2002-2005 - Tom Zanussi (zanussi@us.ibm.com), IBM Corp + * Copyright (C) 1999-2005 - Karim Yaghmour (karim@opersys.com) + * + * Moved to kernel/relay.c by Paul Mundt, 2006. + * + * This file is released under the GPL. + */ +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * close() vm_op implementation for relay file mapping. + */ +static void relay_file_mmap_close(struct vm_area_struct *vma) +{ + struct rchan_buf *buf = vma->vm_private_data; + buf->chan->cb->buf_unmapped(buf, vma->vm_file); +} + +/* + * nopage() vm_op implementation for relay file mapping. + */ +static struct page *relay_buf_nopage(struct vm_area_struct *vma, + unsigned long address, + int *type) +{ + struct page *page; + struct rchan_buf *buf = vma->vm_private_data; + unsigned long offset = address - vma->vm_start; + + if (address > vma->vm_end) + return NOPAGE_SIGBUS; /* Disallow mremap */ + if (!buf) + return NOPAGE_OOM; + + page = vmalloc_to_page(buf->start + offset); + if (!page) + return NOPAGE_OOM; + get_page(page); + + if (type) + *type = VM_FAULT_MINOR; + + return page; +} + +/* + * vm_ops for relay file mappings. + */ +static struct vm_operations_struct relay_file_mmap_ops = { + .nopage = relay_buf_nopage, + .close = relay_file_mmap_close, +}; + +/** + * relay_mmap_buf: - mmap channel buffer to process address space + * @buf: relay channel buffer + * @vma: vm_area_struct describing memory to be mapped + * + * Returns 0 if ok, negative on error + * + * Caller should already have grabbed mmap_sem. + */ +int relay_mmap_buf(struct rchan_buf *buf, struct vm_area_struct *vma) +{ + unsigned long length = vma->vm_end - vma->vm_start; + struct file *filp = vma->vm_file; + + if (!buf) + return -EBADF; + + if (length != (unsigned long)buf->chan->alloc_size) + return -EINVAL; + + vma->vm_ops = &relay_file_mmap_ops; + vma->vm_private_data = buf; + buf->chan->cb->buf_mapped(buf, filp); + + return 0; +} + +/** + * relay_alloc_buf - allocate a channel buffer + * @buf: the buffer struct + * @size: total size of the buffer + * + * Returns a pointer to the resulting buffer, NULL if unsuccessful + */ +static void *relay_alloc_buf(struct rchan_buf *buf, unsigned long size) +{ + void *mem; + unsigned int i, j, n_pages; + + size = PAGE_ALIGN(size); + n_pages = size >> PAGE_SHIFT; + + buf->page_array = kcalloc(n_pages, sizeof(struct page *), GFP_KERNEL); + if (!buf->page_array) + return NULL; + + for (i = 0; i < n_pages; i++) { + buf->page_array[i] = alloc_page(GFP_KERNEL); + if (unlikely(!buf->page_array[i])) + goto depopulate; + } + mem = vmap(buf->page_array, n_pages, VM_MAP, PAGE_KERNEL); + if (!mem) + goto depopulate; + + memset(mem, 0, size); + buf->page_count = n_pages; + return mem; + +depopulate: + for (j = 0; j < i; j++) + __free_page(buf->page_array[j]); + kfree(buf->page_array); + return NULL; +} + +/** + * relay_create_buf - allocate and initialize a channel buffer + * @alloc_size: size of the buffer to allocate + * @n_subbufs: number of sub-buffers in the channel + * + * Returns channel buffer if successful, NULL otherwise + */ +struct rchan_buf *relay_create_buf(struct rchan *chan) +{ + struct rchan_buf *buf = kcalloc(1, sizeof(struct rchan_buf), GFP_KERNEL); + if (!buf) + return NULL; + + buf->padding = kmalloc(chan->n_subbufs * sizeof(size_t *), GFP_KERNEL); + if (!buf->padding) + goto free_buf; + + buf->start = relay_alloc_buf(buf, chan->alloc_size); + if (!buf->start) + goto free_buf; + + buf->chan = chan; + kref_get(&buf->chan->kref); + return buf; + +free_buf: + kfree(buf->padding); + kfree(buf); + return NULL; +} + +/** + * relay_destroy_channel - free the channel struct + * + * Should only be called from kref_put(). + */ +void relay_destroy_channel(struct kref *kref) +{ + struct rchan *chan = container_of(kref, struct rchan, kref); + kfree(chan); +} + +/** + * relay_destroy_buf - destroy an rchan_buf struct and associated buffer + * @buf: the buffer struct + */ +void relay_destroy_buf(struct rchan_buf *buf) +{ + struct rchan *chan = buf->chan; + unsigned int i; + + if (likely(buf->start)) { + vunmap(buf->start); + for (i = 0; i < buf->page_count; i++) + __free_page(buf->page_array[i]); + kfree(buf->page_array); + } + kfree(buf->padding); + kfree(buf); + kref_put(&chan->kref, relay_destroy_channel); +} + +/** + * relay_remove_buf - remove a channel buffer + * + * Removes the file from the fileystem, which also frees the + * rchan_buf_struct and the channel buffer. Should only be called from + * kref_put(). + */ +void relay_remove_buf(struct kref *kref) +{ + struct rchan_buf *buf = container_of(kref, struct rchan_buf, kref); + buf->chan->cb->remove_buf_file(buf->dentry); + relay_destroy_buf(buf); +} + +/** + * relay_buf_empty - boolean, is the channel buffer empty? + * @buf: channel buffer + * + * Returns 1 if the buffer is empty, 0 otherwise. + */ +int relay_buf_empty(struct rchan_buf *buf) +{ + return (buf->subbufs_produced - buf->subbufs_consumed) ? 0 : 1; +} +EXPORT_SYMBOL_GPL(relay_buf_empty); + +/** + * relay_buf_full - boolean, is the channel buffer full? + * @buf: channel buffer + * + * Returns 1 if the buffer is full, 0 otherwise. + */ +int relay_buf_full(struct rchan_buf *buf) +{ + size_t ready = buf->subbufs_produced - buf->subbufs_consumed; + return (ready >= buf->chan->n_subbufs) ? 1 : 0; +} +EXPORT_SYMBOL_GPL(relay_buf_full); + +/* + * High-level relay kernel API and associated functions. + */ + +/* + * rchan_callback implementations defining default channel behavior. Used + * in place of corresponding NULL values in client callback struct. + */ + +/* + * subbuf_start() default callback. Does nothing. + */ +static int subbuf_start_default_callback (struct rchan_buf *buf, + void *subbuf, + void *prev_subbuf, + size_t prev_padding) +{ + if (relay_buf_full(buf)) + return 0; + + return 1; +} + +/* + * buf_mapped() default callback. Does nothing. + */ +static void buf_mapped_default_callback(struct rchan_buf *buf, + struct file *filp) +{ +} + +/* + * buf_unmapped() default callback. Does nothing. + */ +static void buf_unmapped_default_callback(struct rchan_buf *buf, + struct file *filp) +{ +} + +/* + * create_buf_file_create() default callback. Does nothing. + */ +static struct dentry *create_buf_file_default_callback(const char *filename, + struct dentry *parent, + int mode, + struct rchan_buf *buf, + int *is_global) +{ + return NULL; +} + +/* + * remove_buf_file() default callback. Does nothing. + */ +static int remove_buf_file_default_callback(struct dentry *dentry) +{ + return -EINVAL; +} + +/* relay channel default callbacks */ +static struct rchan_callbacks default_channel_callbacks = { + .subbuf_start = subbuf_start_default_callback, + .buf_mapped = buf_mapped_default_callback, + .buf_unmapped = buf_unmapped_default_callback, + .create_buf_file = create_buf_file_default_callback, + .remove_buf_file = remove_buf_file_default_callback, +}; + +/** + * wakeup_readers - wake up readers waiting on a channel + * @private: the channel buffer + * + * This is the work function used to defer reader waking. The + * reason waking is deferred is that calling directly from write + * causes problems if you're writing from say the scheduler. + */ +static void wakeup_readers(void *private) +{ + struct rchan_buf *buf = private; + wake_up_interruptible(&buf->read_wait); +} + +/** + * __relay_reset - reset a channel buffer + * @buf: the channel buffer + * @init: 1 if this is a first-time initialization + * + * See relay_reset for description of effect. + */ +static inline void __relay_reset(struct rchan_buf *buf, unsigned int init) +{ + size_t i; + + if (init) { + init_waitqueue_head(&buf->read_wait); + kref_init(&buf->kref); + INIT_WORK(&buf->wake_readers, NULL, NULL); + } else { + cancel_delayed_work(&buf->wake_readers); + flush_scheduled_work(); + } + + buf->subbufs_produced = 0; + buf->subbufs_consumed = 0; + buf->bytes_consumed = 0; + buf->finalized = 0; + buf->data = buf->start; + buf->offset = 0; + + for (i = 0; i < buf->chan->n_subbufs; i++) + buf->padding[i] = 0; + + buf->chan->cb->subbuf_start(buf, buf->data, NULL, 0); +} + +/** + * relay_reset - reset the channel + * @chan: the channel + * + * This has the effect of erasing all data from all channel buffers + * and restarting the channel in its initial state. The buffers + * are not freed, so any mappings are still in effect. + * + * NOTE: Care should be taken that the channel isn't actually + * being used by anything when this call is made. + */ +void relay_reset(struct rchan *chan) +{ + unsigned int i; + struct rchan_buf *prev = NULL; + + if (!chan) + return; + + for (i = 0; i < NR_CPUS; i++) { + if (!chan->buf[i] || chan->buf[i] == prev) + break; + __relay_reset(chan->buf[i], 0); + prev = chan->buf[i]; + } +} +EXPORT_SYMBOL_GPL(relay_reset); + +/** + * relay_open_buf - create a new relay channel buffer + * + * Internal - used by relay_open(). + */ +static struct rchan_buf *relay_open_buf(struct rchan *chan, + const char *filename, + struct dentry *parent, + int *is_global) +{ + struct rchan_buf *buf; + struct dentry *dentry; + + if (*is_global) + return chan->buf[0]; + + buf = relay_create_buf(chan); + if (!buf) + return NULL; + + /* Create file in fs */ + dentry = chan->cb->create_buf_file(filename, parent, S_IRUSR, + buf, is_global); + if (!dentry) { + relay_destroy_buf(buf); + return NULL; + } + + buf->dentry = dentry; + __relay_reset(buf, 1); + + return buf; +} + +/** + * relay_close_buf - close a channel buffer + * @buf: channel buffer + * + * Marks the buffer finalized and restores the default callbacks. + * The channel buffer and channel buffer data structure are then freed + * automatically when the last reference is given up. + */ +static inline void relay_close_buf(struct rchan_buf *buf) +{ + buf->finalized = 1; + cancel_delayed_work(&buf->wake_readers); + flush_scheduled_work(); + kref_put(&buf->kref, relay_remove_buf); +} + +static inline void setup_callbacks(struct rchan *chan, + struct rchan_callbacks *cb) +{ + if (!cb) { + chan->cb = &default_channel_callbacks; + return; + } + + if (!cb->subbuf_start) + cb->subbuf_start = subbuf_start_default_callback; + if (!cb->buf_mapped) + cb->buf_mapped = buf_mapped_default_callback; + if (!cb->buf_unmapped) + cb->buf_unmapped = buf_unmapped_default_callback; + if (!cb->create_buf_file) + cb->create_buf_file = create_buf_file_default_callback; + if (!cb->remove_buf_file) + cb->remove_buf_file = remove_buf_file_default_callback; + chan->cb = cb; +} + +/** + * relay_open - create a new relay channel + * @base_filename: base name of files to create + * @parent: dentry of parent directory, NULL for root directory + * @subbuf_size: size of sub-buffers + * @n_subbufs: number of sub-buffers + * @cb: client callback functions + * + * Returns channel pointer if successful, NULL otherwise. + * + * Creates a channel buffer for each cpu using the sizes and + * attributes specified. The created channel buffer files + * will be named base_filename0...base_filenameN-1. File + * permissions will be S_IRUSR. + */ +struct rchan *relay_open(const char *base_filename, + struct dentry *parent, + size_t subbuf_size, + size_t n_subbufs, + struct rchan_callbacks *cb) +{ + unsigned int i; + struct rchan *chan; + char *tmpname; + int is_global = 0; + + if (!base_filename) + return NULL; + + if (!(subbuf_size && n_subbufs)) + return NULL; + + chan = kcalloc(1, sizeof(struct rchan), GFP_KERNEL); + if (!chan) + return NULL; + + chan->version = RELAYFS_CHANNEL_VERSION; + chan->n_subbufs = n_subbufs; + chan->subbuf_size = subbuf_size; + chan->alloc_size = FIX_SIZE(subbuf_size * n_subbufs); + setup_callbacks(chan, cb); + kref_init(&chan->kref); + + tmpname = kmalloc(NAME_MAX + 1, GFP_KERNEL); + if (!tmpname) + goto free_chan; + + for_each_online_cpu(i) { + sprintf(tmpname, "%s%d", base_filename, i); + chan->buf[i] = relay_open_buf(chan, tmpname, parent, + &is_global); + if (!chan->buf[i]) + goto free_bufs; + + chan->buf[i]->cpu = i; + } + + kfree(tmpname); + return chan; + +free_bufs: + for (i = 0; i < NR_CPUS; i++) { + if (!chan->buf[i]) + break; + relay_close_buf(chan->buf[i]); + if (is_global) + break; + } + kfree(tmpname); + +free_chan: + kref_put(&chan->kref, relay_destroy_channel); + return NULL; +} +EXPORT_SYMBOL_GPL(relay_open); + +/** + * relay_switch_subbuf - switch to a new sub-buffer + * @buf: channel buffer + * @length: size of current event + * + * Returns either the length passed in or 0 if full. + * + * Performs sub-buffer-switch tasks such as invoking callbacks, + * updating padding counts, waking up readers, etc. + */ +size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length) +{ + void *old, *new; + size_t old_subbuf, new_subbuf; + + if (unlikely(length > buf->chan->subbuf_size)) + goto toobig; + + if (buf->offset != buf->chan->subbuf_size + 1) { + buf->prev_padding = buf->chan->subbuf_size - buf->offset; + old_subbuf = buf->subbufs_produced % buf->chan->n_subbufs; + buf->padding[old_subbuf] = buf->prev_padding; + buf->subbufs_produced++; + if (waitqueue_active(&buf->read_wait)) { + PREPARE_WORK(&buf->wake_readers, wakeup_readers, buf); + schedule_delayed_work(&buf->wake_readers, 1); + } + } + + old = buf->data; + new_subbuf = buf->subbufs_produced % buf->chan->n_subbufs; + new = buf->start + new_subbuf * buf->chan->subbuf_size; + buf->offset = 0; + if (!buf->chan->cb->subbuf_start(buf, new, old, buf->prev_padding)) { + buf->offset = buf->chan->subbuf_size + 1; + return 0; + } + buf->data = new; + buf->padding[new_subbuf] = 0; + + if (unlikely(length + buf->offset > buf->chan->subbuf_size)) + goto toobig; + + return length; + +toobig: + buf->chan->last_toobig = length; + return 0; +} +EXPORT_SYMBOL_GPL(relay_switch_subbuf); + +/** + * relay_subbufs_consumed - update the buffer's sub-buffers-consumed count + * @chan: the channel + * @cpu: the cpu associated with the channel buffer to update + * @subbufs_consumed: number of sub-buffers to add to current buf's count + * + * Adds to the channel buffer's consumed sub-buffer count. + * subbufs_consumed should be the number of sub-buffers newly consumed, + * not the total consumed. + * + * NOTE: kernel clients don't need to call this function if the channel + * mode is 'overwrite'. + */ +void relay_subbufs_consumed(struct rchan *chan, + unsigned int cpu, + size_t subbufs_consumed) +{ + struct rchan_buf *buf; + + if (!chan) + return; + + if (cpu >= NR_CPUS || !chan->buf[cpu]) + return; + + buf = chan->buf[cpu]; + buf->subbufs_consumed += subbufs_consumed; + if (buf->subbufs_consumed > buf->subbufs_produced) + buf->subbufs_consumed = buf->subbufs_produced; +} +EXPORT_SYMBOL_GPL(relay_subbufs_consumed); + +/** + * relay_close - close the channel + * @chan: the channel + * + * Closes all channel buffers and frees the channel. + */ +void relay_close(struct rchan *chan) +{ + unsigned int i; + struct rchan_buf *prev = NULL; + + if (!chan) + return; + + for (i = 0; i < NR_CPUS; i++) { + if (!chan->buf[i] || chan->buf[i] == prev) + break; + relay_close_buf(chan->buf[i]); + prev = chan->buf[i]; + } + + if (chan->last_toobig) + printk(KERN_WARNING "relay: one or more items not logged " + "[item size (%Zd) > sub-buffer size (%Zd)]\n", + chan->last_toobig, chan->subbuf_size); + + kref_put(&chan->kref, relay_destroy_channel); +} +EXPORT_SYMBOL_GPL(relay_close); + +/** + * relay_flush - close the channel + * @chan: the channel + * + * Flushes all channel buffers i.e. forces buffer switch. + */ +void relay_flush(struct rchan *chan) +{ + unsigned int i; + struct rchan_buf *prev = NULL; + + if (!chan) + return; + + for (i = 0; i < NR_CPUS; i++) { + if (!chan->buf[i] || chan->buf[i] == prev) + break; + relay_switch_subbuf(chan->buf[i], 0); + prev = chan->buf[i]; + } +} +EXPORT_SYMBOL_GPL(relay_flush); + +/** + * relay_file_open - open file op for relay files + * @inode: the inode + * @filp: the file + * + * Increments the channel buffer refcount. + */ +static int relay_file_open(struct inode *inode, struct file *filp) +{ + struct rchan_buf *buf = inode->u.generic_ip; + kref_get(&buf->kref); + filp->private_data = buf; + + return 0; +} + +/** + * relay_file_mmap - mmap file op for relay files + * @filp: the file + * @vma: the vma describing what to map + * + * Calls upon relay_mmap_buf to map the file into user space. + */ +static int relay_file_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct rchan_buf *buf = filp->private_data; + return relay_mmap_buf(buf, vma); +} + +/** + * relay_file_poll - poll file op for relay files + * @filp: the file + * @wait: poll table + * + * Poll implemention. + */ +static unsigned int relay_file_poll(struct file *filp, poll_table *wait) +{ + unsigned int mask = 0; + struct rchan_buf *buf = filp->private_data; + + if (buf->finalized) + return POLLERR; + + if (filp->f_mode & FMODE_READ) { + poll_wait(filp, &buf->read_wait, wait); + if (!relay_buf_empty(buf)) + mask |= POLLIN | POLLRDNORM; + } + + return mask; +} + +/** + * relay_file_release - release file op for relay files + * @inode: the inode + * @filp: the file + * + * Decrements the channel refcount, as the filesystem is + * no longer using it. + */ +static int relay_file_release(struct inode *inode, struct file *filp) +{ + struct rchan_buf *buf = filp->private_data; + kref_put(&buf->kref, relay_remove_buf); + + return 0; +} + +/** + * relay_file_read_consume - update the consumed count for the buffer + */ +static void relay_file_read_consume(struct rchan_buf *buf, + size_t read_pos, + size_t bytes_consumed) +{ + size_t subbuf_size = buf->chan->subbuf_size; + size_t n_subbufs = buf->chan->n_subbufs; + size_t read_subbuf; + + if (buf->bytes_consumed + bytes_consumed > subbuf_size) { + relay_subbufs_consumed(buf->chan, buf->cpu, 1); + buf->bytes_consumed = 0; + } + + buf->bytes_consumed += bytes_consumed; + read_subbuf = read_pos / buf->chan->subbuf_size; + if (buf->bytes_consumed + buf->padding[read_subbuf] == subbuf_size) { + if ((read_subbuf == buf->subbufs_produced % n_subbufs) && + (buf->offset == subbuf_size)) + return; + relay_subbufs_consumed(buf->chan, buf->cpu, 1); + buf->bytes_consumed = 0; + } +} + +/** + * relay_file_read_avail - boolean, are there unconsumed bytes available? + */ +static int relay_file_read_avail(struct rchan_buf *buf, size_t read_pos) +{ + size_t bytes_produced, bytes_consumed, write_offset; + size_t subbuf_size = buf->chan->subbuf_size; + size_t n_subbufs = buf->chan->n_subbufs; + size_t produced = buf->subbufs_produced % n_subbufs; + size_t consumed = buf->subbufs_consumed % n_subbufs; + + write_offset = buf->offset > subbuf_size ? subbuf_size : buf->offset; + + if (consumed > produced) { + if ((produced > n_subbufs) && + (produced + n_subbufs - consumed <= n_subbufs)) + produced += n_subbufs; + } else if (consumed == produced) { + if (buf->offset > subbuf_size) { + produced += n_subbufs; + if (buf->subbufs_produced == buf->subbufs_consumed) + consumed += n_subbufs; + } + } + + if (buf->offset > subbuf_size) + bytes_produced = (produced - 1) * subbuf_size + write_offset; + else + bytes_produced = produced * subbuf_size + write_offset; + bytes_consumed = consumed * subbuf_size + buf->bytes_consumed; + + if (bytes_produced == bytes_consumed) + return 0; + + relay_file_read_consume(buf, read_pos, 0); + + return 1; +} + +/** + * relay_file_read_subbuf_avail - return bytes available in sub-buffer + */ +static size_t relay_file_read_subbuf_avail(size_t read_pos, + struct rchan_buf *buf) +{ + size_t padding, avail = 0; + size_t read_subbuf, read_offset, write_subbuf, write_offset; + size_t subbuf_size = buf->chan->subbuf_size; + + write_subbuf = (buf->data - buf->start) / subbuf_size; + write_offset = buf->offset > subbuf_size ? subbuf_size : buf->offset; + read_subbuf = read_pos / subbuf_size; + read_offset = read_pos % subbuf_size; + padding = buf->padding[read_subbuf]; + + if (read_subbuf == write_subbuf) { + if (read_offset + padding < write_offset) + avail = write_offset - (read_offset + padding); + } else + avail = (subbuf_size - padding) - read_offset; + + return avail; +} + +/** + * relay_file_read_start_pos - find the first available byte to read + * + * If the read_pos is in the middle of padding, return the + * position of the first actually available byte, otherwise + * return the original value. + */ +static size_t relay_file_read_start_pos(size_t read_pos, + struct rchan_buf *buf) +{ + size_t read_subbuf, padding, padding_start, padding_end; + size_t subbuf_size = buf->chan->subbuf_size; + size_t n_subbufs = buf->chan->n_subbufs; + + read_subbuf = read_pos / subbuf_size; + padding = buf->padding[read_subbuf]; + padding_start = (read_subbuf + 1) * subbuf_size - padding; + padding_end = (read_subbuf + 1) * subbuf_size; + if (read_pos >= padding_start && read_pos < padding_end) { + read_subbuf = (read_subbuf + 1) % n_subbufs; + read_pos = read_subbuf * subbuf_size; + } + + return read_pos; +} + +/** + * relay_file_read_end_pos - return the new read position + */ +static size_t relay_file_read_end_pos(struct rchan_buf *buf, + size_t read_pos, + size_t count) +{ + size_t read_subbuf, padding, end_pos; + size_t subbuf_size = buf->chan->subbuf_size; + size_t n_subbufs = buf->chan->n_subbufs; + + read_subbuf = read_pos / subbuf_size; + padding = buf->padding[read_subbuf]; + if (read_pos % subbuf_size + count + padding == subbuf_size) + end_pos = (read_subbuf + 1) * subbuf_size; + else + end_pos = read_pos + count; + if (end_pos >= subbuf_size * n_subbufs) + end_pos = 0; + + return end_pos; +} + +/** + * relay_file_read - read file op for relay files + * @filp: the file + * @buffer: the userspace buffer + * @count: number of bytes to read + * @ppos: position to read from + * + * Reads count bytes or the number of bytes available in the + * current sub-buffer being read, whichever is smaller. + */ +static ssize_t relay_file_read(struct file *filp, + char __user *buffer, + size_t count, + loff_t *ppos) +{ + struct rchan_buf *buf = filp->private_data; + struct inode *inode = filp->f_dentry->d_inode; + size_t read_start, avail; + ssize_t ret = 0; + void *from; + + mutex_lock(&inode->i_mutex); + if(!relay_file_read_avail(buf, *ppos)) + goto out; + + read_start = relay_file_read_start_pos(*ppos, buf); + avail = relay_file_read_subbuf_avail(read_start, buf); + if (!avail) + goto out; + + from = buf->start + read_start; + ret = count = min(count, avail); + if (copy_to_user(buffer, from, count)) { + ret = -EFAULT; + goto out; + } + relay_file_read_consume(buf, read_start, count); + *ppos = relay_file_read_end_pos(buf, read_start, count); +out: + mutex_unlock(&inode->i_mutex); + return ret; +} + +struct file_operations relay_file_operations = { + .open = relay_file_open, + .poll = relay_file_poll, + .mmap = relay_file_mmap, + .read = relay_file_read, + .llseek = no_llseek, + .release = relay_file_release, +}; +EXPORT_SYMBOL_GPL(relay_file_operations); -- cgit v1.2.3 From 221415d76231d9012871e6e6abcbad906c46626a Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Thu, 23 Mar 2006 19:57:55 +0100 Subject: [PATCH] relay: add sendfile() support Signed-off-by: Jens Axboe --- kernel/relay.c | 144 +++++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 115 insertions(+), 29 deletions(-) (limited to 'kernel') diff --git a/kernel/relay.c b/kernel/relay.c index 9358e8eb8476..fefe2b2a7277 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -95,15 +95,16 @@ int relay_mmap_buf(struct rchan_buf *buf, struct vm_area_struct *vma) * @buf: the buffer struct * @size: total size of the buffer * - * Returns a pointer to the resulting buffer, NULL if unsuccessful + * Returns a pointer to the resulting buffer, NULL if unsuccessful. The + * passed in size will get page aligned, if it isn't already. */ -static void *relay_alloc_buf(struct rchan_buf *buf, unsigned long size) +static void *relay_alloc_buf(struct rchan_buf *buf, size_t *size) { void *mem; unsigned int i, j, n_pages; - size = PAGE_ALIGN(size); - n_pages = size >> PAGE_SHIFT; + *size = PAGE_ALIGN(*size); + n_pages = *size >> PAGE_SHIFT; buf->page_array = kcalloc(n_pages, sizeof(struct page *), GFP_KERNEL); if (!buf->page_array) @@ -118,7 +119,7 @@ static void *relay_alloc_buf(struct rchan_buf *buf, unsigned long size) if (!mem) goto depopulate; - memset(mem, 0, size); + memset(mem, 0, *size); buf->page_count = n_pages; return mem; @@ -146,7 +147,7 @@ struct rchan_buf *relay_create_buf(struct rchan *chan) if (!buf->padding) goto free_buf; - buf->start = relay_alloc_buf(buf, chan->alloc_size); + buf->start = relay_alloc_buf(buf, &chan->alloc_size); if (!buf->start) goto free_buf; @@ -543,6 +544,9 @@ size_t relay_switch_subbuf(struct rchan_buf *buf, size_t length) old_subbuf = buf->subbufs_produced % buf->chan->n_subbufs; buf->padding[old_subbuf] = buf->prev_padding; buf->subbufs_produced++; + buf->dentry->d_inode->i_size += buf->chan->subbuf_size - + buf->padding[old_subbuf]; + smp_mb(); if (waitqueue_active(&buf->read_wait)) { PREPARE_WORK(&buf->wake_readers, wakeup_readers, buf); schedule_delayed_work(&buf->wake_readers, 1); @@ -757,37 +761,33 @@ static void relay_file_read_consume(struct rchan_buf *buf, */ static int relay_file_read_avail(struct rchan_buf *buf, size_t read_pos) { - size_t bytes_produced, bytes_consumed, write_offset; size_t subbuf_size = buf->chan->subbuf_size; size_t n_subbufs = buf->chan->n_subbufs; - size_t produced = buf->subbufs_produced % n_subbufs; - size_t consumed = buf->subbufs_consumed % n_subbufs; + size_t produced = buf->subbufs_produced; + size_t consumed = buf->subbufs_consumed; - write_offset = buf->offset > subbuf_size ? subbuf_size : buf->offset; + relay_file_read_consume(buf, read_pos, 0); - if (consumed > produced) { - if ((produced > n_subbufs) && - (produced + n_subbufs - consumed <= n_subbufs)) - produced += n_subbufs; - } else if (consumed == produced) { - if (buf->offset > subbuf_size) { - produced += n_subbufs; - if (buf->subbufs_produced == buf->subbufs_consumed) - consumed += n_subbufs; - } + if (unlikely(buf->offset > subbuf_size)) { + if (produced == consumed) + return 0; + return 1; } - if (buf->offset > subbuf_size) - bytes_produced = (produced - 1) * subbuf_size + write_offset; - else - bytes_produced = produced * subbuf_size + write_offset; - bytes_consumed = consumed * subbuf_size + buf->bytes_consumed; - - if (bytes_produced == bytes_consumed) + if (unlikely(produced - consumed >= n_subbufs)) { + consumed = (produced / n_subbufs) * n_subbufs; + buf->subbufs_consumed = consumed; + } + + produced = (produced % n_subbufs) * subbuf_size + buf->offset; + consumed = (consumed % n_subbufs) * subbuf_size + buf->bytes_consumed; + + if (consumed > produced) + produced += n_subbufs * subbuf_size; + + if (consumed == produced) return 0; - relay_file_read_consume(buf, read_pos, 0); - return 1; } @@ -908,6 +908,91 @@ out: return ret; } +static ssize_t relay_file_sendsubbuf(struct file *filp, loff_t *ppos, + size_t count, read_actor_t actor, + void *target) +{ + struct rchan_buf *buf = filp->private_data; + read_descriptor_t desc; + size_t read_start, avail; + unsigned long pidx, poff; + unsigned int subbuf_pages; + ssize_t ret = 0; + + if (!relay_file_read_avail(buf, *ppos)) + return 0; + + read_start = relay_file_read_start_pos(*ppos, buf); + avail = relay_file_read_subbuf_avail(read_start, buf); + if (!avail) + return 0; + + count = min(count, avail); + + desc.written = 0; + desc.count = count; + desc.arg.data = target; + desc.error = 0; + + subbuf_pages = buf->chan->alloc_size >> PAGE_SHIFT; + pidx = (read_start / PAGE_SIZE) % subbuf_pages; + poff = read_start & ~PAGE_MASK; + while (count) { + struct page *p = buf->page_array[pidx]; + unsigned int len; + + len = PAGE_SIZE - poff; + if (len > count) + len = count; + + len = actor(&desc, p, poff, len); + + if (desc.error) { + if (!ret) + ret = desc.error; + break; + } + + count -= len; + ret += len; + poff = 0; + pidx = (pidx + 1) % subbuf_pages; + } + + if (ret > 0) { + relay_file_read_consume(buf, read_start, ret); + *ppos = relay_file_read_end_pos(buf, read_start, ret); + } + + return ret; +} + +static ssize_t relay_file_sendfile(struct file *filp, loff_t *ppos, + size_t count, read_actor_t actor, + void *target) +{ + ssize_t sent = 0, ret = 0; + + if (!count) + return 0; + + mutex_lock(&filp->f_dentry->d_inode->i_mutex); + + do { + ret = relay_file_sendsubbuf(filp, ppos, count, actor, target); + if (ret < 0) { + if (!sent) + sent = ret; + break; + } + count -= ret; + sent += ret; + } while (count && ret); + + mutex_unlock(&filp->f_dentry->d_inode->i_mutex); + return sent; +} + struct file_operations relay_file_operations = { .open = relay_file_open, .poll = relay_file_poll, @@ -915,5 +1000,6 @@ struct file_operations relay_file_operations = { .read = relay_file_read, .llseek = no_llseek, .release = relay_file_release, + .sendfile = relay_file_sendfile, }; EXPORT_SYMBOL_GPL(relay_file_operations); -- cgit v1.2.3 From 6dac40a7ce2483a47b54af07afebeb84131c7228 Mon Sep 17 00:00:00 2001 From: Tom Zanussi Date: Thu, 23 Mar 2006 19:58:45 +0100 Subject: [PATCH] relay: consolidate sendfile() and read() code Signed-off-by: Jens Axboe --- kernel/relay.c | 175 ++++++++++++++++++++++++++++++--------------------------- 1 file changed, 91 insertions(+), 84 deletions(-) (limited to 'kernel') diff --git a/kernel/relay.c b/kernel/relay.c index fefe2b2a7277..33345e73485c 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -866,131 +866,138 @@ static size_t relay_file_read_end_pos(struct rchan_buf *buf, } /** - * relay_file_read - read file op for relay files - * @filp: the file - * @buffer: the userspace buffer - * @count: number of bytes to read - * @ppos: position to read from - * - * Reads count bytes or the number of bytes available in the - * current sub-buffer being read, whichever is smaller. + * subbuf_read_actor - read up to one subbuf's worth of data */ -static ssize_t relay_file_read(struct file *filp, - char __user *buffer, - size_t count, - loff_t *ppos) +static int subbuf_read_actor(size_t read_start, + struct rchan_buf *buf, + size_t avail, + read_descriptor_t *desc, + read_actor_t actor) { - struct rchan_buf *buf = filp->private_data; - struct inode *inode = filp->f_dentry->d_inode; - size_t read_start, avail; - ssize_t ret = 0; void *from; - - mutex_lock(&inode->i_mutex); - if(!relay_file_read_avail(buf, *ppos)) - goto out; - - read_start = relay_file_read_start_pos(*ppos, buf); - avail = relay_file_read_subbuf_avail(read_start, buf); - if (!avail) - goto out; + int ret = 0; from = buf->start + read_start; - ret = count = min(count, avail); - if (copy_to_user(buffer, from, count)) { - ret = -EFAULT; - goto out; + ret = avail; + if (copy_to_user(desc->arg.data, from, avail)) { + desc->error = -EFAULT; + ret = 0; } - relay_file_read_consume(buf, read_start, count); - *ppos = relay_file_read_end_pos(buf, read_start, count); -out: - mutex_unlock(&inode->i_mutex); + desc->arg.data += ret; + desc->written += ret; + desc->count -= ret; + return ret; } -static ssize_t relay_file_sendsubbuf(struct file *filp, loff_t *ppos, - size_t count, read_actor_t actor, - void *target) +/** + * subbuf_send_actor - send up to one subbuf's worth of data + */ +static int subbuf_send_actor(size_t read_start, + struct rchan_buf *buf, + size_t avail, + read_descriptor_t *desc, + read_actor_t actor) { - struct rchan_buf *buf = filp->private_data; - read_descriptor_t desc; - size_t read_start, avail; unsigned long pidx, poff; unsigned int subbuf_pages; - ssize_t ret = 0; - - if (!relay_file_read_avail(buf, *ppos)) - return 0; - - read_start = relay_file_read_start_pos(*ppos, buf); - avail = relay_file_read_subbuf_avail(read_start, buf); - if (!avail) - return 0; - - count = min(count, avail); - - desc.written = 0; - desc.count = count; - desc.arg.data = target; - desc.error = 0; + int ret = 0; subbuf_pages = buf->chan->alloc_size >> PAGE_SHIFT; pidx = (read_start / PAGE_SIZE) % subbuf_pages; poff = read_start & ~PAGE_MASK; - while (count) { + while (avail) { struct page *p = buf->page_array[pidx]; unsigned int len; len = PAGE_SIZE - poff; - if (len > count) - len = count; + if (len > avail) + len = avail; - len = actor(&desc, p, poff, len); - - if (desc.error) { - if (!ret) - ret = desc.error; + len = actor(desc, p, poff, len); + if (desc->error) break; - } - count -= len; + avail -= len; ret += len; poff = 0; pidx = (pidx + 1) % subbuf_pages; } - if (ret > 0) { - relay_file_read_consume(buf, read_start, ret); - *ppos = relay_file_read_end_pos(buf, read_start, ret); - } - return ret; } -static ssize_t relay_file_sendfile(struct file *filp, loff_t *ppos, - size_t count, read_actor_t actor, - void *target) +typedef int (*subbuf_actor_t) (size_t read_start, + struct rchan_buf *buf, + size_t avail, + read_descriptor_t *desc, + read_actor_t actor); + +/** + * relay_file_read_subbufs - read count bytes, bridging subbuf boundaries + */ +static inline ssize_t relay_file_read_subbufs(struct file *filp, + loff_t *ppos, + size_t count, + subbuf_actor_t subbuf_actor, + read_actor_t actor, + void *target) { - ssize_t sent = 0, ret = 0; + struct rchan_buf *buf = filp->private_data; + size_t read_start, avail; + read_descriptor_t desc; + int ret; if (!count) return 0; - mutex_lock(&filp->f_dentry->d_inode->i_mutex); + desc.written = 0; + desc.count = count; + desc.arg.data = target; + desc.error = 0; + mutex_lock(&filp->f_dentry->d_inode->i_mutex); do { - ret = relay_file_sendsubbuf(filp, ppos, count, actor, target); - if (ret < 0) { - if (!sent) - sent = ret; + if (!relay_file_read_avail(buf, *ppos)) + break; + + read_start = relay_file_read_start_pos(*ppos, buf); + avail = relay_file_read_subbuf_avail(read_start, buf); + if (!avail) break; - } - count -= ret; - sent += ret; - } while (count && ret); + avail = min(desc.count, avail); + ret = subbuf_actor(read_start, buf, avail, &desc, actor); + if (desc.error < 0) + break; + + if (ret) { + relay_file_read_consume(buf, read_start, ret); + *ppos = relay_file_read_end_pos(buf, read_start, ret); + } + } while (desc.count && ret); mutex_unlock(&filp->f_dentry->d_inode->i_mutex); - return sent; + + return desc.written; +} + +static ssize_t relay_file_read(struct file *filp, + char __user *buffer, + size_t count, + loff_t *ppos) +{ + return relay_file_read_subbufs(filp, ppos, count, subbuf_read_actor, + NULL, buffer); +} + +static ssize_t relay_file_sendfile(struct file *filp, + loff_t *ppos, + size_t count, + read_actor_t actor, + void *target) +{ + return relay_file_read_subbufs(filp, ppos, count, subbuf_send_actor, + actor, target); } struct file_operations relay_file_operations = { -- cgit v1.2.3 From 2056a782f8e7e65fd4bfd027506b4ce1c5e9ccd4 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Thu, 23 Mar 2006 20:00:26 +0100 Subject: [PATCH] Block queue IO tracing support (blktrace) as of 2006-03-23 Signed-off-by: Jens Axboe --- kernel/fork.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index c79ae0b19a49..c21bae8c93b9 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -181,6 +181,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig) /* One for us, one for whoever does the "release_task()" (usually parent) */ atomic_set(&tsk->usage,2); atomic_set(&tsk->fs_excl, 0); + tsk->btrace_seq = 0; return tsk; } -- cgit v1.2.3 From 36f574135e36b86bb6ae794bf1d0fce3efa5601f Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Fri, 24 Mar 2006 03:15:47 -0800 Subject: [PATCH] free_uid() locking improvement Reduce lock hold times in free_uid(). Cc: Ingo Molnar Cc: "Paul E. McKenney" Cc: David Howells Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/user.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/user.c b/kernel/user.c index d9deae43a9ab..2116642f42c6 100644 --- a/kernel/user.c +++ b/kernel/user.c @@ -105,15 +105,19 @@ void free_uid(struct user_struct *up) { unsigned long flags; + if (!up) + return; + local_irq_save(flags); - if (up && atomic_dec_and_lock(&up->__count, &uidhash_lock)) { + if (atomic_dec_and_lock(&up->__count, &uidhash_lock)) { uid_hash_remove(up); + spin_unlock_irqrestore(&uidhash_lock, flags); key_put(up->uid_keyring); key_put(up->session_keyring); kmem_cache_free(uid_cachep, up); - spin_unlock(&uidhash_lock); + } else { + local_irq_restore(flags); } - local_irq_restore(flags); } struct user_struct * alloc_uid(uid_t uid) -- cgit v1.2.3 From f6ef943813ac3085ece7252ea101d663581219f6 Mon Sep 17 00:00:00 2001 From: Bart Samwel Date: Fri, 24 Mar 2006 03:15:48 -0800 Subject: [PATCH] Represent dirty_*_centisecs as jiffies internally Make that the internal values for: /proc/sys/vm/dirty_writeback_centisecs /proc/sys/vm/dirty_expire_centisecs are stored as jiffies instead of centiseconds. Let the sysctl interface do the conversions with full precision using clock_t_to_jiffies, instead of doing overflow-sensitive on-the-fly conversions every time the values are used. Cons: apparent precision loss if HZ is not a multiple of 100, because of conversion back and forth. This is a common problem for all sysctl values that use proc_dointvec_userhz_jiffies. (There is only one other in-tree use, in net/core/neighbour.c.) Signed-off-by: Bart Samwel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sysctl.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 32b48e8ee36e..817ba25517eb 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -742,18 +742,18 @@ static ctl_table vm_table[] = { { .ctl_name = VM_DIRTY_WB_CS, .procname = "dirty_writeback_centisecs", - .data = &dirty_writeback_centisecs, - .maxlen = sizeof(dirty_writeback_centisecs), + .data = &dirty_writeback_interval, + .maxlen = sizeof(dirty_writeback_interval), .mode = 0644, .proc_handler = &dirty_writeback_centisecs_handler, }, { .ctl_name = VM_DIRTY_EXPIRE_CS, .procname = "dirty_expire_centisecs", - .data = &dirty_expire_centisecs, - .maxlen = sizeof(dirty_expire_centisecs), + .data = &dirty_expire_interval, + .maxlen = sizeof(dirty_expire_interval), .mode = 0644, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_dointvec_userhz_jiffies, }, { .ctl_name = VM_NR_PDFLUSH_THREADS, -- cgit v1.2.3 From ed5b43f15a8e86e3ae939b98bc161ee973ecedf2 Mon Sep 17 00:00:00 2001 From: Bart Samwel Date: Fri, 24 Mar 2006 03:15:49 -0800 Subject: [PATCH] Represent laptop_mode as jiffies internally Make that the internal value for /proc/sys/vm/laptop_mode is stored as jiffies instead of seconds. Let the sysctl interface do the conversions, instead of doing on-the-fly conversions every time the value is used. Add a description of the fact that laptop_mode doubles as a flag and a timeout to the comment above the laptop_mode variable. Signed-off-by: Bart Samwel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sysctl.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 817ba25517eb..d13426680d10 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -848,9 +848,8 @@ static ctl_table vm_table[] = { .data = &laptop_mode, .maxlen = sizeof(laptop_mode), .mode = 0644, - .proc_handler = &proc_dointvec, - .strategy = &sysctl_intvec, - .extra1 = &zero, + .proc_handler = &proc_dointvec_jiffies, + .strategy = &sysctl_jiffies, }, { .ctl_name = VM_BLOCK_DUMP, -- cgit v1.2.3 From cba9f33d13a8ca3125b2a30abe2425ce562d8a83 Mon Sep 17 00:00:00 2001 From: Bart Samwel Date: Fri, 24 Mar 2006 03:15:50 -0800 Subject: [PATCH] Range checking in do_proc_dointvec_(userhz_)jiffies_conv When (integer) sysctl values are in either seconds or centiseconds, but represented internally as jiffies, the allowable value range is decreased. This patch adds range checks to the conversion routines. For values in seconds: maximum LONG_MAX / HZ. For values in centiseconds: maximum (LONG_MAX / HZ) * USER_HZ. (BTW, does anyone else feel that an interface in seconds should not be accepting negative values?) Signed-off-by: Bart Samwel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sysctl.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/sysctl.c b/kernel/sysctl.c index d13426680d10..e82726faeeff 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2053,6 +2053,8 @@ static int do_proc_dointvec_jiffies_conv(int *negp, unsigned long *lvalp, int write, void *data) { if (write) { + if (*lvalp > LONG_MAX / HZ) + return 1; *valp = *negp ? -(*lvalp*HZ) : (*lvalp*HZ); } else { int val = *valp; @@ -2074,6 +2076,8 @@ static int do_proc_dointvec_userhz_jiffies_conv(int *negp, unsigned long *lvalp, int write, void *data) { if (write) { + if (USER_HZ < HZ && *lvalp > (LONG_MAX / HZ) * USER_HZ) + return 1; *valp = clock_t_to_jiffies(*negp ? -*lvalp : *lvalp); } else { int val = *valp; -- cgit v1.2.3 From caa9ee771de3195ae85ac6f8cb550f53e9ecdd82 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Fri, 24 Mar 2006 03:15:50 -0800 Subject: [PATCH] rcu_process_callbacks: don't cli() while testing ->nxtlist __rcu_process_callbacks() disables interrupts to protect itself from call_rcu() which adds new entries to ->nxtlist. However we can check "->nxtlist != NULL" with interrupts enabled, we can't get "false positives" because call_rcu() can only change this condition from 0 to 1. Tested with rcutorture.ko. Signed-off-by: Oleg Nesterov Acked-by: Dipankar Sarma Cc: "Paul E. McKenney" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/rcupdate.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c index 6df1559b1c02..13458bbaa1be 100644 --- a/kernel/rcupdate.c +++ b/kernel/rcupdate.c @@ -416,8 +416,8 @@ static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp, rdp->curtail = &rdp->curlist; } - local_irq_disable(); if (rdp->nxtlist && !rdp->curlist) { + local_irq_disable(); rdp->curlist = rdp->nxtlist; rdp->curtail = rdp->nxttail; rdp->nxtlist = NULL; @@ -442,9 +442,8 @@ static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp, rcu_start_batch(rcp); spin_unlock(&rcp->lock); } - } else { - local_irq_enable(); } + rcu_check_quiescent_state(rcp, rdp); if (rdp->donelist) rcu_do_batch(rdp); -- cgit v1.2.3 From a4a6198b80cf82eb8160603c98da218d1bd5e104 Mon Sep 17 00:00:00 2001 From: Jan Beulich Date: Fri, 24 Mar 2006 03:15:54 -0800 Subject: [PATCH] tvec_bases too large for per-cpu data With internal Xen-enabled kernels we see the kernel's static per-cpu data area exceed the limit of 32k on x86-64, and even native x86-64 kernels get fairly close to that limit. I generally question whether it is reasonable to have data structures several kb in size allocated as per-cpu data when the space there is rather limited. The biggest arch-independent consumer is tvec_bases (over 4k on 32-bit archs, over 8k on 64-bit ones), which now gets converted to use dynamically allocated memory instead. Signed-off-by: Jan Beulich Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/timer.c | 45 ++++++++++++++++++++++++++++++++++----------- 1 file changed, 34 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/kernel/timer.c b/kernel/timer.c index 2410c18dbeb1..4427e725ccdd 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -86,7 +86,8 @@ struct tvec_t_base_s { } ____cacheline_aligned_in_smp; typedef struct tvec_t_base_s tvec_base_t; -static DEFINE_PER_CPU(tvec_base_t, tvec_bases); +static DEFINE_PER_CPU(tvec_base_t *, tvec_bases); +static tvec_base_t boot_tvec_bases; static inline void set_running_timer(tvec_base_t *base, struct timer_list *timer) @@ -157,7 +158,7 @@ EXPORT_SYMBOL(__init_timer_base); void fastcall init_timer(struct timer_list *timer) { timer->entry.next = NULL; - timer->base = &per_cpu(tvec_bases, raw_smp_processor_id()).t_base; + timer->base = &per_cpu(tvec_bases, raw_smp_processor_id())->t_base; } EXPORT_SYMBOL(init_timer); @@ -218,7 +219,7 @@ int __mod_timer(struct timer_list *timer, unsigned long expires) ret = 1; } - new_base = &__get_cpu_var(tvec_bases); + new_base = __get_cpu_var(tvec_bases); if (base != &new_base->t_base) { /* @@ -258,7 +259,7 @@ EXPORT_SYMBOL(__mod_timer); */ void add_timer_on(struct timer_list *timer, int cpu) { - tvec_base_t *base = &per_cpu(tvec_bases, cpu); + tvec_base_t *base = per_cpu(tvec_bases, cpu); unsigned long flags; BUG_ON(timer_pending(timer) || !timer->function); @@ -504,7 +505,7 @@ unsigned long next_timer_interrupt(void) } hr_expires += jiffies; - base = &__get_cpu_var(tvec_bases); + base = __get_cpu_var(tvec_bases); spin_lock(&base->t_base.lock); expires = base->timer_jiffies + (LONG_MAX >> 1); list = NULL; @@ -901,7 +902,7 @@ EXPORT_SYMBOL(xtime_lock); */ static void run_timer_softirq(struct softirq_action *h) { - tvec_base_t *base = &__get_cpu_var(tvec_bases); + tvec_base_t *base = __get_cpu_var(tvec_bases); hrtimer_run_queues(); if (time_after_eq(jiffies, base->timer_jiffies)) @@ -1256,12 +1257,32 @@ asmlinkage long sys_sysinfo(struct sysinfo __user *info) return 0; } -static void __devinit init_timers_cpu(int cpu) +static int __devinit init_timers_cpu(int cpu) { int j; tvec_base_t *base; - base = &per_cpu(tvec_bases, cpu); + base = per_cpu(tvec_bases, cpu); + if (!base) { + static char boot_done; + + /* + * Cannot do allocation in init_timers as that runs before the + * allocator initializes (and would waste memory if there are + * more possible CPUs than will ever be installed/brought up). + */ + if (boot_done) { + base = kmalloc_node(sizeof(*base), GFP_KERNEL, + cpu_to_node(cpu)); + if (!base) + return -ENOMEM; + memset(base, 0, sizeof(*base)); + } else { + base = &boot_tvec_bases; + boot_done = 1; + } + per_cpu(tvec_bases, cpu) = base; + } spin_lock_init(&base->t_base.lock); for (j = 0; j < TVN_SIZE; j++) { INIT_LIST_HEAD(base->tv5.vec + j); @@ -1273,6 +1294,7 @@ static void __devinit init_timers_cpu(int cpu) INIT_LIST_HEAD(base->tv1.vec + j); base->timer_jiffies = jiffies; + return 0; } #ifdef CONFIG_HOTPLUG_CPU @@ -1295,8 +1317,8 @@ static void __devinit migrate_timers(int cpu) int i; BUG_ON(cpu_online(cpu)); - old_base = &per_cpu(tvec_bases, cpu); - new_base = &get_cpu_var(tvec_bases); + old_base = per_cpu(tvec_bases, cpu); + new_base = get_cpu_var(tvec_bases); local_irq_disable(); spin_lock(&new_base->t_base.lock); @@ -1326,7 +1348,8 @@ static int __devinit timer_cpu_notify(struct notifier_block *self, long cpu = (long)hcpu; switch(action) { case CPU_UP_PREPARE: - init_timers_cpu(cpu); + if (init_timers_cpu(cpu) < 0) + return NOTIFY_BAD; break; #ifdef CONFIG_HOTPLUG_CPU case CPU_DEAD: -- cgit v1.2.3 From 95c3832272fc77ea3e31f6382f82ba17be985cc7 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 24 Mar 2006 03:15:58 -0800 Subject: [PATCH] rcutorture: tag success/failure line with module parameters A long-running rcutorture test can overflow dmesg, so that the line containing the module parameters is lost. Although it is usually possible to retrieve this information from the log files, it is much better to just tag it onto the final success/failure line so that it may be easily found. This patch does just that. Signed-off-by: "Paul E. McKenney" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/rcutorture.c | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c index 7712912dbc84..9a1fa8894b95 100644 --- a/kernel/rcutorture.c +++ b/kernel/rcutorture.c @@ -441,6 +441,16 @@ rcu_torture_shuffle(void *arg) return 0; } +static inline void +rcu_torture_print_module_parms(char *tag) +{ + printk(KERN_ALERT TORTURE_FLAG "--- %s: nreaders=%d " + "stat_interval=%d verbose=%d test_no_idle_hz=%d " + "shuffle_interval = %d\n", + tag, nrealreaders, stat_interval, verbose, test_no_idle_hz, + shuffle_interval); +} + static void rcu_torture_cleanup(void) { @@ -483,9 +493,10 @@ rcu_torture_cleanup(void) rcu_barrier(); rcu_torture_stats_print(); /* -After- the stats thread is stopped! */ - printk(KERN_ALERT TORTURE_FLAG - "--- End of test: %s\n", - atomic_read(&n_rcu_torture_error) == 0 ? "SUCCESS" : "FAILURE"); + if (atomic_read(&n_rcu_torture_error)) + rcu_torture_print_module_parms("End of test: FAILURE"); + else + rcu_torture_print_module_parms("End of test: SUCCESS"); } static int @@ -501,11 +512,7 @@ rcu_torture_init(void) nrealreaders = nreaders; else nrealreaders = 2 * num_online_cpus(); - printk(KERN_ALERT TORTURE_FLAG "--- Start of test: nreaders=%d " - "stat_interval=%d verbose=%d test_no_idle_hz=%d " - "shuffle_interval = %d\n", - nrealreaders, stat_interval, verbose, test_no_idle_hz, - shuffle_interval); + rcu_torture_print_module_parms("Start of test"); fullstop = 0; /* Set up the freelist. */ -- cgit v1.2.3 From 7b5b9ef0e17d52c188fe73ea78e884fe67079e6c Mon Sep 17 00:00:00 2001 From: Paul Jackson Date: Fri, 24 Mar 2006 03:16:00 -0800 Subject: [PATCH] cpuset cleanup not not operators Since the test_bit() bit operator is boolean (return 0 or 1), the double not "!!" operations needed to convert a scalar (zero or not zero) to a boolean are not needed. Signed-off-by: Paul Jackson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/cpuset.c b/kernel/cpuset.c index c86ee051b734..9f28e1f00185 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -114,27 +114,27 @@ typedef enum { /* convenient tests for these bits */ static inline int is_cpu_exclusive(const struct cpuset *cs) { - return !!test_bit(CS_CPU_EXCLUSIVE, &cs->flags); + return test_bit(CS_CPU_EXCLUSIVE, &cs->flags); } static inline int is_mem_exclusive(const struct cpuset *cs) { - return !!test_bit(CS_MEM_EXCLUSIVE, &cs->flags); + return test_bit(CS_MEM_EXCLUSIVE, &cs->flags); } static inline int is_removed(const struct cpuset *cs) { - return !!test_bit(CS_REMOVED, &cs->flags); + return test_bit(CS_REMOVED, &cs->flags); } static inline int notify_on_release(const struct cpuset *cs) { - return !!test_bit(CS_NOTIFY_ON_RELEASE, &cs->flags); + return test_bit(CS_NOTIFY_ON_RELEASE, &cs->flags); } static inline int is_memory_migrate(const struct cpuset *cs) { - return !!test_bit(CS_MEMORY_MIGRATE, &cs->flags); + return test_bit(CS_MEMORY_MIGRATE, &cs->flags); } /* -- cgit v1.2.3 From 8a39cc60bfa5a72f32d975729a354daca124f6de Mon Sep 17 00:00:00 2001 From: Paul Jackson Date: Fri, 24 Mar 2006 03:16:01 -0800 Subject: [PATCH] cpuset use combined atomic_inc_return calls Replace pairs of calls to , with a single call atomic_inc_return, saving a few bytes of source and kernel text. Signed-off-by: Paul Jackson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 9f28e1f00185..44d13c246e5c 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -858,8 +858,7 @@ static int update_nodemask(struct cpuset *cs, char *buf) mutex_lock(&callback_mutex); cs->mems_allowed = trialcs.mems_allowed; - atomic_inc(&cpuset_mems_generation); - cs->mems_generation = atomic_read(&cpuset_mems_generation); + cs->mems_generation = atomic_inc_return(&cpuset_mems_generation); mutex_unlock(&callback_mutex); set_cpuset_being_rebound(cs); /* causes mpol_copy() rebind */ @@ -1770,8 +1769,7 @@ static long cpuset_create(struct cpuset *parent, const char *name, int mode) atomic_set(&cs->count, 0); INIT_LIST_HEAD(&cs->sibling); INIT_LIST_HEAD(&cs->children); - atomic_inc(&cpuset_mems_generation); - cs->mems_generation = atomic_read(&cpuset_mems_generation); + cs->mems_generation = atomic_inc_return(&cpuset_mems_generation); fmeter_init(&cs->fmeter); cs->parent = parent; @@ -1861,7 +1859,7 @@ int __init cpuset_init_early(void) struct task_struct *tsk = current; tsk->cpuset = &top_cpuset; - tsk->cpuset->mems_generation = atomic_read(&cpuset_mems_generation); + tsk->cpuset->mems_generation = atomic_inc_return(&cpuset_mems_generation); return 0; } @@ -1880,8 +1878,7 @@ int __init cpuset_init(void) top_cpuset.mems_allowed = NODE_MASK_ALL; fmeter_init(&top_cpuset.fmeter); - atomic_inc(&cpuset_mems_generation); - top_cpuset.mems_generation = atomic_read(&cpuset_mems_generation); + top_cpuset.mems_generation = atomic_inc_return(&cpuset_mems_generation); init_task.cpuset = &top_cpuset; -- cgit v1.2.3 From 825a46af5ac171f9f41f794a0a00165588ba1589 Mon Sep 17 00:00:00 2001 From: Paul Jackson Date: Fri, 24 Mar 2006 03:16:03 -0800 Subject: [PATCH] cpuset memory spread basic implementation This patch provides the implementation and cpuset interface for an alternative memory allocation policy that can be applied to certain kinds of memory allocations, such as the page cache (file system buffers) and some slab caches (such as inode caches). The policy is called "memory spreading." If enabled, it spreads out these kinds of memory allocations over all the nodes allowed to a task, instead of preferring to place them on the node where the task is executing. All other kinds of allocations, including anonymous pages for a tasks stack and data regions, are not affected by this policy choice, and continue to be allocated preferring the node local to execution, as modified by the NUMA mempolicy. There are two boolean flag files per cpuset that control where the kernel allocates pages for the file system buffers and related in kernel data structures. They are called 'memory_spread_page' and 'memory_spread_slab'. If the per-cpuset boolean flag file 'memory_spread_page' is set, then the kernel will spread the file system buffers (page cache) evenly over all the nodes that the faulting task is allowed to use, instead of preferring to put those pages on the node where the task is running. If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the kernel will spread some file system related slab caches, such as for inodes and dentries evenly over all the nodes that the faulting task is allowed to use, instead of preferring to put those pages on the node where the task is running. The implementation is simple. Setting the cpuset flags 'memory_spread_page' or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or subsequently joins that cpuset. In subsequent patches, the page allocation calls for the affected page cache and slab caches are modified to perform an inline check for these flags, and if set, a call to a new routine cpuset_mem_spread_node() returns the node to prefer for the allocation. The cpuset_mem_spread_node() routine is also simple. It uses the value of a per-task rotor cpuset_mem_spread_rotor to select the next node in the current tasks mems_allowed to prefer for the allocation. This policy can provide substantial improvements for jobs that need to place thread local data on the corresponding node, but that need to access large file system data sets that need to be spread across the several nodes in the jobs cpuset in order to fit. Without this patch, especially for jobs that might have one thread reading in the data set, the memory allocation across the nodes in the jobs cpuset can become very uneven. A couple of Copyright year ranges are updated as well. And a couple of email addresses that can be found in the MAINTAINERS file are removed. Signed-off-by: Paul Jackson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 98 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 44d13c246e5c..38f18b33de6c 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -4,15 +4,14 @@ * Processor and Memory placement constraints for sets of tasks. * * Copyright (C) 2003 BULL SA. - * Copyright (C) 2004 Silicon Graphics, Inc. + * Copyright (C) 2004-2006 Silicon Graphics, Inc. * * Portions derived from Patrick Mochel's sysfs code. * sysfs is Copyright (c) 2001-3 Patrick Mochel - * Portions Copyright (c) 2004 Silicon Graphics, Inc. * - * 2003-10-10 Written by Simon Derr + * 2003-10-10 Written by Simon Derr. * 2003-10-22 Updates by Stephen Hemminger. - * 2004 May-July Rework by Paul Jackson + * 2004 May-July Rework by Paul Jackson. * * This file is subject to the terms and conditions of the GNU General Public * License. See the file COPYING in the main directory of the Linux @@ -108,7 +107,9 @@ typedef enum { CS_MEM_EXCLUSIVE, CS_MEMORY_MIGRATE, CS_REMOVED, - CS_NOTIFY_ON_RELEASE + CS_NOTIFY_ON_RELEASE, + CS_SPREAD_PAGE, + CS_SPREAD_SLAB, } cpuset_flagbits_t; /* convenient tests for these bits */ @@ -137,6 +138,16 @@ static inline int is_memory_migrate(const struct cpuset *cs) return test_bit(CS_MEMORY_MIGRATE, &cs->flags); } +static inline int is_spread_page(const struct cpuset *cs) +{ + return test_bit(CS_SPREAD_PAGE, &cs->flags); +} + +static inline int is_spread_slab(const struct cpuset *cs) +{ + return test_bit(CS_SPREAD_SLAB, &cs->flags); +} + /* * Increment this atomic integer everytime any cpuset changes its * mems_allowed value. Users of cpusets can track this generation @@ -657,6 +668,14 @@ void cpuset_update_task_memory_state(void) cs = tsk->cpuset; /* Maybe changed when task not locked */ guarantee_online_mems(cs, &tsk->mems_allowed); tsk->cpuset_mems_generation = cs->mems_generation; + if (is_spread_page(cs)) + tsk->flags |= PF_SPREAD_PAGE; + else + tsk->flags &= ~PF_SPREAD_PAGE; + if (is_spread_slab(cs)) + tsk->flags |= PF_SPREAD_SLAB; + else + tsk->flags &= ~PF_SPREAD_SLAB; task_unlock(tsk); mutex_unlock(&callback_mutex); mpol_rebind_task(tsk, &tsk->mems_allowed); @@ -956,7 +975,8 @@ static int update_memory_pressure_enabled(struct cpuset *cs, char *buf) /* * update_flag - read a 0 or a 1 in a file and update associated flag * bit: the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE, - * CS_NOTIFY_ON_RELEASE, CS_MEMORY_MIGRATE) + * CS_NOTIFY_ON_RELEASE, CS_MEMORY_MIGRATE, + * CS_SPREAD_PAGE, CS_SPREAD_SLAB) * cs: the cpuset to update * buf: the buffer where we read the 0 or 1 * @@ -1187,6 +1207,8 @@ typedef enum { FILE_NOTIFY_ON_RELEASE, FILE_MEMORY_PRESSURE_ENABLED, FILE_MEMORY_PRESSURE, + FILE_SPREAD_PAGE, + FILE_SPREAD_SLAB, FILE_TASKLIST, } cpuset_filetype_t; @@ -1246,6 +1268,14 @@ static ssize_t cpuset_common_file_write(struct file *file, const char __user *us case FILE_MEMORY_PRESSURE: retval = -EACCES; break; + case FILE_SPREAD_PAGE: + retval = update_flag(CS_SPREAD_PAGE, cs, buffer); + cs->mems_generation = atomic_inc_return(&cpuset_mems_generation); + break; + case FILE_SPREAD_SLAB: + retval = update_flag(CS_SPREAD_SLAB, cs, buffer); + cs->mems_generation = atomic_inc_return(&cpuset_mems_generation); + break; case FILE_TASKLIST: retval = attach_task(cs, buffer, &pathbuf); break; @@ -1355,6 +1385,12 @@ static ssize_t cpuset_common_file_read(struct file *file, char __user *buf, case FILE_MEMORY_PRESSURE: s += sprintf(s, "%d", fmeter_getrate(&cs->fmeter)); break; + case FILE_SPREAD_PAGE: + *s++ = is_spread_page(cs) ? '1' : '0'; + break; + case FILE_SPREAD_SLAB: + *s++ = is_spread_slab(cs) ? '1' : '0'; + break; default: retval = -EINVAL; goto out; @@ -1718,6 +1754,16 @@ static struct cftype cft_memory_pressure = { .private = FILE_MEMORY_PRESSURE, }; +static struct cftype cft_spread_page = { + .name = "memory_spread_page", + .private = FILE_SPREAD_PAGE, +}; + +static struct cftype cft_spread_slab = { + .name = "memory_spread_slab", + .private = FILE_SPREAD_SLAB, +}; + static int cpuset_populate_dir(struct dentry *cs_dentry) { int err; @@ -1736,6 +1782,10 @@ static int cpuset_populate_dir(struct dentry *cs_dentry) return err; if ((err = cpuset_add_file(cs_dentry, &cft_memory_pressure)) < 0) return err; + if ((err = cpuset_add_file(cs_dentry, &cft_spread_page)) < 0) + return err; + if ((err = cpuset_add_file(cs_dentry, &cft_spread_slab)) < 0) + return err; if ((err = cpuset_add_file(cs_dentry, &cft_tasks)) < 0) return err; return 0; @@ -1764,6 +1814,10 @@ static long cpuset_create(struct cpuset *parent, const char *name, int mode) cs->flags = 0; if (notify_on_release(parent)) set_bit(CS_NOTIFY_ON_RELEASE, &cs->flags); + if (is_spread_page(parent)) + set_bit(CS_SPREAD_PAGE, &cs->flags); + if (is_spread_slab(parent)) + set_bit(CS_SPREAD_SLAB, &cs->flags); cs->cpus_allowed = CPU_MASK_NONE; cs->mems_allowed = NODE_MASK_NONE; atomic_set(&cs->count, 0); @@ -2200,6 +2254,44 @@ void cpuset_unlock(void) mutex_unlock(&callback_mutex); } +/** + * cpuset_mem_spread_node() - On which node to begin search for a page + * + * If a task is marked PF_SPREAD_PAGE or PF_SPREAD_SLAB (as for + * tasks in a cpuset with is_spread_page or is_spread_slab set), + * and if the memory allocation used cpuset_mem_spread_node() + * to determine on which node to start looking, as it will for + * certain page cache or slab cache pages such as used for file + * system buffers and inode caches, then instead of starting on the + * local node to look for a free page, rather spread the starting + * node around the tasks mems_allowed nodes. + * + * We don't have to worry about the returned node being offline + * because "it can't happen", and even if it did, it would be ok. + * + * The routines calling guarantee_online_mems() are careful to + * only set nodes in task->mems_allowed that are online. So it + * should not be possible for the following code to return an + * offline node. But if it did, that would be ok, as this routine + * is not returning the node where the allocation must be, only + * the node where the search should start. The zonelist passed to + * __alloc_pages() will include all nodes. If the slab allocator + * is passed an offline node, it will fall back to the local node. + * See kmem_cache_alloc_node(). + */ + +int cpuset_mem_spread_node(void) +{ + int node; + + node = next_node(current->cpuset_mem_spread_rotor, current->mems_allowed); + if (node == MAX_NUMNODES) + node = first_node(current->mems_allowed); + current->cpuset_mem_spread_rotor = node; + return node; +} +EXPORT_SYMBOL_GPL(cpuset_mem_spread_node); + /** * cpuset_excl_nodes_overlap - Do we overlap @p's mem_exclusive ancestors? * @p: pointer to task_struct of some other task. -- cgit v1.2.3 From c61afb181c649754ea221f104e268cbacfc993e3 Mon Sep 17 00:00:00 2001 From: Paul Jackson Date: Fri, 24 Mar 2006 03:16:08 -0800 Subject: [PATCH] cpuset memory spread slab cache optimizations The hooks in the slab cache allocator code path for support of NUMA mempolicies and cpuset memory spreading are in an important code path. Many systems will use neither feature. This patch optimizes those hooks down to a single check of some bits in the current tasks task_struct flags. For non NUMA systems, this hook and related code is already ifdef'd out. The optimization is done by using another task flag, set if the task is using a non-default NUMA mempolicy. Taking this flag bit along with the PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits added earlier in this 'cpuset memory spreading' patch set, one can check for the combination of any of these special case memory placement mechanisms with a single test of the current tasks task_struct flags. This patch also tightens up the code, to save a few bytes of kernel text space, and moves some of it out of line. Due to the nested inlines called from multiple places, we were ending up with three copies of this code, which once we get off the main code path (for local node allocation) seems a bit wasteful of instruction memory. Signed-off-by: Paul Jackson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index c21bae8c93b9..a02063903aaa 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1021,6 +1021,7 @@ static task_t *copy_process(unsigned long clone_flags, p->mempolicy = NULL; goto bad_fork_cleanup_cpuset; } + mpol_fix_fork_child_flag(p); #endif #ifdef CONFIG_DEBUG_MUTEXES -- cgit v1.2.3 From 8488bc359d674baf710992e4b641513ea5ebd212 Mon Sep 17 00:00:00 2001 From: Paul Jackson Date: Fri, 24 Mar 2006 03:16:10 -0800 Subject: [PATCH] cpuset: remove unnecessary NULL check Remove a no longer needed test for NULL cpuset pointer, with a little comment explaining why the test isn't needed. Signed-off-by: Paul Jackson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 38f18b33de6c..bc4131141230 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -2023,7 +2023,7 @@ void cpuset_fork(struct task_struct *child) * because tsk is already marked PF_EXITING, so attach_task() won't * mess with it, or task is a failed fork, never visible to attach_task. * - * Hack: + * the_top_cpuset_hack: * * Set the exiting tasks cpuset to the root cpuset (top_cpuset). * @@ -2062,7 +2062,7 @@ void cpuset_exit(struct task_struct *tsk) struct cpuset *cs; cs = tsk->cpuset; - tsk->cpuset = &top_cpuset; /* Hack - see comment above */ + tsk->cpuset = &top_cpuset; /* the_top_cpuset_hack - see above */ if (notify_on_release(cs)) { char *pathbuf = NULL; @@ -2373,12 +2373,12 @@ void __cpuset_memory_pressure_bump(void) * - No need to task_lock(tsk) on this tsk->cpuset reference, as it * doesn't really matter if tsk->cpuset changes after we read it, * and we take manage_mutex, keeping attach_task() from changing it - * anyway. + * anyway. No need to check that tsk->cpuset != NULL, thanks to + * the_top_cpuset_hack in cpuset_exit(), which sets an exiting tasks + * cpuset to top_cpuset. */ - static int proc_cpuset_show(struct seq_file *m, void *v) { - struct cpuset *cs; struct task_struct *tsk; char *buf; int retval = 0; @@ -2389,13 +2389,7 @@ static int proc_cpuset_show(struct seq_file *m, void *v) tsk = m->private; mutex_lock(&manage_mutex); - cs = tsk->cpuset; - if (!cs) { - retval = -EINVAL; - goto out; - } - - retval = cpuset_path(cs, buf, PAGE_SIZE); + retval = cpuset_path(tsk->cpuset, buf, PAGE_SIZE); if (retval < 0) goto out; seq_puts(m, buf); -- cgit v1.2.3 From 151a44202d097ae8b1bbaa6d8d2f97df30e3cd1e Mon Sep 17 00:00:00 2001 From: Paul Jackson Date: Fri, 24 Mar 2006 03:16:11 -0800 Subject: [PATCH] cpuset: don't need to mark cpuset_mems_generation atomic Drop the atomic_t marking on the cpuset static global cpuset_mems_generation. Since all access to it is guarded by the global manage_mutex, there is no need for further serialization of this value. Signed-off-by: Paul Jackson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/cpuset.c b/kernel/cpuset.c index bc4131141230..702928664f42 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -149,7 +149,7 @@ static inline int is_spread_slab(const struct cpuset *cs) } /* - * Increment this atomic integer everytime any cpuset changes its + * Increment this integer everytime any cpuset changes its * mems_allowed value. Users of cpusets can track this generation * number, and avoid having to lock and reload mems_allowed unless * the cpuset they're using changes generation. @@ -163,8 +163,11 @@ static inline int is_spread_slab(const struct cpuset *cs) * on every visit to __alloc_pages(), to efficiently check whether * its current->cpuset->mems_allowed has changed, requiring an update * of its current->mems_allowed. + * + * Since cpuset_mems_generation is guarded by manage_mutex, + * there is no need to mark it atomic. */ -static atomic_t cpuset_mems_generation = ATOMIC_INIT(1); +static int cpuset_mems_generation; static struct cpuset top_cpuset = { .flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)), @@ -877,7 +880,7 @@ static int update_nodemask(struct cpuset *cs, char *buf) mutex_lock(&callback_mutex); cs->mems_allowed = trialcs.mems_allowed; - cs->mems_generation = atomic_inc_return(&cpuset_mems_generation); + cs->mems_generation = cpuset_mems_generation++; mutex_unlock(&callback_mutex); set_cpuset_being_rebound(cs); /* causes mpol_copy() rebind */ @@ -1270,11 +1273,11 @@ static ssize_t cpuset_common_file_write(struct file *file, const char __user *us break; case FILE_SPREAD_PAGE: retval = update_flag(CS_SPREAD_PAGE, cs, buffer); - cs->mems_generation = atomic_inc_return(&cpuset_mems_generation); + cs->mems_generation = cpuset_mems_generation++; break; case FILE_SPREAD_SLAB: retval = update_flag(CS_SPREAD_SLAB, cs, buffer); - cs->mems_generation = atomic_inc_return(&cpuset_mems_generation); + cs->mems_generation = cpuset_mems_generation++; break; case FILE_TASKLIST: retval = attach_task(cs, buffer, &pathbuf); @@ -1823,7 +1826,7 @@ static long cpuset_create(struct cpuset *parent, const char *name, int mode) atomic_set(&cs->count, 0); INIT_LIST_HEAD(&cs->sibling); INIT_LIST_HEAD(&cs->children); - cs->mems_generation = atomic_inc_return(&cpuset_mems_generation); + cs->mems_generation = cpuset_mems_generation++; fmeter_init(&cs->fmeter); cs->parent = parent; @@ -1913,7 +1916,7 @@ int __init cpuset_init_early(void) struct task_struct *tsk = current; tsk->cpuset = &top_cpuset; - tsk->cpuset->mems_generation = atomic_inc_return(&cpuset_mems_generation); + tsk->cpuset->mems_generation = cpuset_mems_generation++; return 0; } @@ -1932,7 +1935,7 @@ int __init cpuset_init(void) top_cpuset.mems_allowed = NODE_MASK_ALL; fmeter_init(&top_cpuset.fmeter); - top_cpuset.mems_generation = atomic_inc_return(&cpuset_mems_generation); + top_cpuset.mems_generation = cpuset_mems_generation++; init_task.cpuset = &top_cpuset; -- cgit v1.2.3 From 29afd49b72a9b2c26fa8c678bcf3976d0540446b Mon Sep 17 00:00:00 2001 From: Paul Jackson Date: Fri, 24 Mar 2006 03:16:12 -0800 Subject: [PATCH] cpuset: remove useless local variable initialization Remove a useless variable initialization in cpuset __cpuset_zone_allowed(). The local variable 'allowed' is unconditionally set before use, later on in the code, so does not need to be initialized. Not that it seems to matter to the code generated any, as the compiler optimizes out the superfluous assignment anyway. Signed-off-by: Paul Jackson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 702928664f42..18aea1bd1284 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -2205,7 +2205,7 @@ int __cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask) { int node; /* node that zone z is on */ const struct cpuset *cs; /* current cpuset ancestors */ - int allowed = 1; /* is allocation in zone z allowed? */ + int allowed; /* is allocation in zone z allowed? */ if (in_interrupt()) return 1; -- cgit v1.2.3 From 2ea1c5392cc8ce249fb661db4f4cdfbbae373a89 Mon Sep 17 00:00:00 2001 From: "John Z. Bohach" Date: Fri, 24 Mar 2006 03:18:19 -0800 Subject: [PATCH] console_setup() depends (wrongly?) on CONFIG_PRINTK It appears that console_setup() code only gets compiled into the kernel if CONFIG_PRINTK is enabled. One detrimental side-effect of this is that serial8250_console_setup() never gets invoked when CONFIG_PRINTK is not set, resulting in baud rate not being read/parsed from command line (i.e. console=ttyS0,115200n8 is ignored, at least the baud rate part...) Attached patch moves console_setup() code from inside #ifdef CONFIG_PRINTK to outside (in printk.c), removing dependence on said config. option. Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/printk.c | 76 ++++++++++++++++++++++++++++----------------------------- 1 file changed, 38 insertions(+), 38 deletions(-) (limited to 'kernel') diff --git a/kernel/printk.c b/kernel/printk.c index 13ced0f7828f..8cc19431e74b 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -122,44 +122,6 @@ static char *log_buf = __log_buf; static int log_buf_len = __LOG_BUF_LEN; static unsigned long logged_chars; /* Number of chars produced since last read+clear operation */ -/* - * Setup a list of consoles. Called from init/main.c - */ -static int __init console_setup(char *str) -{ - char name[sizeof(console_cmdline[0].name)]; - char *s, *options; - int idx; - - /* - * Decode str into name, index, options. - */ - if (str[0] >= '0' && str[0] <= '9') { - strcpy(name, "ttyS"); - strncpy(name + 4, str, sizeof(name) - 5); - } else - strncpy(name, str, sizeof(name) - 1); - name[sizeof(name) - 1] = 0; - if ((options = strchr(str, ',')) != NULL) - *(options++) = 0; -#ifdef __sparc__ - if (!strcmp(str, "ttya")) - strcpy(name, "ttyS0"); - if (!strcmp(str, "ttyb")) - strcpy(name, "ttyS1"); -#endif - for (s = name; *s; s++) - if ((*s >= '0' && *s <= '9') || *s == ',') - break; - idx = simple_strtoul(s, NULL, 10); - *s = 0; - - add_preferred_console(name, idx, options); - return 1; -} - -__setup("console=", console_setup); - static int __init log_buf_len_setup(char *str) { unsigned long size = memparse(str, &str); @@ -659,6 +621,44 @@ static void call_console_drivers(unsigned long start, unsigned long end) #endif +/* + * Set up a list of consoles. Called from init/main.c + */ +static int __init console_setup(char *str) +{ + char name[sizeof(console_cmdline[0].name)]; + char *s, *options; + int idx; + + /* + * Decode str into name, index, options. + */ + if (str[0] >= '0' && str[0] <= '9') { + strcpy(name, "ttyS"); + strncpy(name + 4, str, sizeof(name) - 5); + } else { + strncpy(name, str, sizeof(name) - 1); + } + name[sizeof(name) - 1] = 0; + if ((options = strchr(str, ',')) != NULL) + *(options++) = 0; +#ifdef __sparc__ + if (!strcmp(str, "ttya")) + strcpy(name, "ttyS0"); + if (!strcmp(str, "ttyb")) + strcpy(name, "ttyS1"); +#endif + for (s = name; *s; s++) + if ((*s >= '0' && *s <= '9') || *s == ',') + break; + idx = simple_strtoul(s, NULL, 10); + *s = 0; + + add_preferred_console(name, idx, options); + return 1; +} +__setup("console=", console_setup); + /** * add_preferred_console - add a device to the list of preferred consoles. * @name: device name -- cgit v1.2.3 From ec9e16bacdba1da1ee15dd162384e22df5c87e09 Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Fri, 24 Mar 2006 03:18:34 -0800 Subject: [PATCH] sys_setrlimit() cleanup - Whitespace cleanups - Make that expression comprehensible. There's a potential logic change here: we do the "is it_prof_expires equal to zero" test after converting it to seconds, rather than doing the comparison between raw cputime_t's. But given that it's in units of seconds anyway, that shouldn't change anything. Cc: Martin Schwidefsky Cc: Ulrich Weigand Cc: Cliff Wickman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 26 +++++++++++++++----------- 1 file changed, 15 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index c0fcad9f826c..9bdf94f3ae29 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1630,20 +1630,21 @@ asmlinkage long sys_old_getrlimit(unsigned int resource, struct rlimit __user *r asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim) { struct rlimit new_rlim, *old_rlim; + unsigned long it_prof_secs; int retval; if (resource >= RLIM_NLIMITS) return -EINVAL; - if(copy_from_user(&new_rlim, rlim, sizeof(*rlim))) + if (copy_from_user(&new_rlim, rlim, sizeof(*rlim))) return -EFAULT; - if (new_rlim.rlim_cur > new_rlim.rlim_max) - return -EINVAL; + if (new_rlim.rlim_cur > new_rlim.rlim_max) + return -EINVAL; old_rlim = current->signal->rlim + resource; if ((new_rlim.rlim_max > old_rlim->rlim_max) && !capable(CAP_SYS_RESOURCE)) return -EPERM; if (resource == RLIMIT_NOFILE && new_rlim.rlim_max > NR_OPEN) - return -EPERM; + return -EPERM; retval = security_task_setrlimit(resource, &new_rlim); if (retval) @@ -1653,19 +1654,22 @@ asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim) *old_rlim = new_rlim; task_unlock(current->group_leader); - if (resource == RLIMIT_CPU && new_rlim.rlim_cur != RLIM_INFINITY && - (cputime_eq(current->signal->it_prof_expires, cputime_zero) || - new_rlim.rlim_cur <= cputime_to_secs( - current->signal->it_prof_expires))) { + if (resource != RLIMIT_CPU) + goto out; + if (new_rlim.rlim_cur == RLIM_INFINITY) + goto out; + + it_prof_secs = cputime_to_secs(current->signal->it_prof_expires); + if (it_prof_secs == 0 || new_rlim.rlim_cur <= it_prof_secs) { cputime_t cputime = secs_to_cputime(new_rlim.rlim_cur); + read_lock(&tasklist_lock); spin_lock_irq(¤t->sighand->siglock); - set_process_cpu_timer(current, CPUCLOCK_PROF, - &cputime, NULL); + set_process_cpu_timer(current, CPUCLOCK_PROF, &cputime, NULL); spin_unlock_irq(¤t->sighand->siglock); read_unlock(&tasklist_lock); } - +out: return 0; } -- cgit v1.2.3 From e0661111e5441995f7a69dc4336c9f131cb9bc58 Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Fri, 24 Mar 2006 03:18:35 -0800 Subject: [PATCH] RLIMIT_CPU: fix handling of a zero limit At present the kernel doesn't honour an attempt to set RLIMIT_CPU to zero seconds. But the spec says it should, and that's what 2.4.x does. Fixing this for real would involve some complexity (such as adding a new it-has-been-set flag to the task_struct, and testing that everwhere, instead of overloading the value of it_prof_expires). Given that a 2.4 kernel won't actually send the signal until one second has expired anyway, let's just handle this case by treating the caller's zero-seconds as one second. Cc: Martin Schwidefsky Cc: Ulrich Weigand Cc: Cliff Wickman Acked-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index 9bdf94f3ae29..9e157e0240d4 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1661,8 +1661,19 @@ asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim) it_prof_secs = cputime_to_secs(current->signal->it_prof_expires); if (it_prof_secs == 0 || new_rlim.rlim_cur <= it_prof_secs) { - cputime_t cputime = secs_to_cputime(new_rlim.rlim_cur); + unsigned long rlim_cur = new_rlim.rlim_cur; + cputime_t cputime; + if (rlim_cur == 0) { + /* + * The caller is asking for an immediate RLIMIT_CPU + * expiry. But we use the zero value to mean "it was + * never set". So let's cheat and make it one second + * instead + */ + rlim_cur = 1; + } + cputime = secs_to_cputime(rlim_cur); read_lock(&tasklist_lock); spin_lock_irq(¤t->sighand->siglock); set_process_cpu_timer(current, CPUCLOCK_PROF, &cputime, NULL); -- cgit v1.2.3 From d3561f78fd379a7110e46c87964ba7aa4120235c Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Fri, 24 Mar 2006 03:18:36 -0800 Subject: [PATCH] RLIMIT_CPU: document wrong return value Document the fact that setrlimit(RLIMIT_CPU) doesn't return error codes when it should. I don't think we can fix this without a 2.7.x.. Cc: Martin Schwidefsky Cc: Ulrich Weigand Cc: Cliff Wickman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index 9e157e0240d4..19d058be49d4 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1656,6 +1656,13 @@ asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim) if (resource != RLIMIT_CPU) goto out; + + /* + * RLIMIT_CPU handling. Note that the kernel fails to return an error + * code if it rejected the user's attempt to set RLIMIT_CPU. This is a + * very long-standing error, and fixing it now risks breakage of + * applications, so we live with it + */ if (new_rlim.rlim_cur == RLIM_INFINITY) goto out; -- cgit v1.2.3 From 6a4d11c2abc57ed7ca42041e5f68ae4f7f640a81 Mon Sep 17 00:00:00 2001 From: Sergey Vlasov Date: Fri, 24 Mar 2006 03:18:38 -0800 Subject: [PATCH] Fix module refcount leak in __set_personality() If the change of personality does not lead to change of exec domain, __set_personality() returned without releasing the module reference acquired by lookup_exec_domain(). Signed-off-by: Sergey Vlasov Cc: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exec_domain.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/exec_domain.c b/kernel/exec_domain.c index 867d6dbeb574..c01cead2cfd6 100644 --- a/kernel/exec_domain.c +++ b/kernel/exec_domain.c @@ -140,6 +140,7 @@ __set_personality(u_long personality) ep = lookup_exec_domain(personality); if (ep == current_thread_info()->exec_domain) { current->personality = personality; + module_put(ep->module); return 0; } -- cgit v1.2.3 From 6687a97d4041f996f725902d2990e5de6ef5cbe5 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Fri, 24 Mar 2006 03:18:41 -0800 Subject: [PATCH] timer-irq-driven soft-watchdog, cleanups Make the softlockup detector purely timer-interrupt driven, removing softirq-context (timer) dependencies. This means that if the softlockup watchdog triggers, it has truly observed a longer than 10 seconds scheduling delay of a SCHED_FIFO prio 99 task. (the patch also turns off the softlockup detector during the initial bootup phase and does small style fixes) Signed-off-by: Ingo Molnar Signed-off-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/softlockup.c | 54 +++++++++++++++++++++++++++++------------------------ kernel/timer.c | 2 +- 2 files changed, 31 insertions(+), 25 deletions(-) (limited to 'kernel') diff --git a/kernel/softlockup.c b/kernel/softlockup.c index c67189a25d52..dd9524fa649a 100644 --- a/kernel/softlockup.c +++ b/kernel/softlockup.c @@ -1,12 +1,11 @@ /* * Detect Soft Lockups * - * started by Ingo Molnar, (C) 2005, Red Hat + * started by Ingo Molnar, Copyright (C) 2005, 2006 Red Hat, Inc. * * this code detects soft lockups: incidents in where on a CPU * the kernel does not reschedule for 10 seconds or more. */ - #include #include #include @@ -17,13 +16,14 @@ static DEFINE_SPINLOCK(print_lock); -static DEFINE_PER_CPU(unsigned long, timestamp) = 0; -static DEFINE_PER_CPU(unsigned long, print_timestamp) = 0; +static DEFINE_PER_CPU(unsigned long, touch_timestamp); +static DEFINE_PER_CPU(unsigned long, print_timestamp); static DEFINE_PER_CPU(struct task_struct *, watchdog_task); static int did_panic = 0; -static int softlock_panic(struct notifier_block *this, unsigned long event, - void *ptr) + +static int +softlock_panic(struct notifier_block *this, unsigned long event, void *ptr) { did_panic = 1; @@ -36,7 +36,7 @@ static struct notifier_block panic_block = { void touch_softlockup_watchdog(void) { - per_cpu(timestamp, raw_smp_processor_id()) = jiffies; + per_cpu(touch_timestamp, raw_smp_processor_id()) = jiffies; } EXPORT_SYMBOL(touch_softlockup_watchdog); @@ -44,25 +44,35 @@ EXPORT_SYMBOL(touch_softlockup_watchdog); * This callback runs from the timer interrupt, and checks * whether the watchdog thread has hung or not: */ -void softlockup_tick(struct pt_regs *regs) +void softlockup_tick(void) { int this_cpu = smp_processor_id(); - unsigned long timestamp = per_cpu(timestamp, this_cpu); + unsigned long touch_timestamp = per_cpu(touch_timestamp, this_cpu); - if (per_cpu(print_timestamp, this_cpu) == timestamp) + /* prevent double reports: */ + if (per_cpu(print_timestamp, this_cpu) == touch_timestamp || + did_panic || + !per_cpu(watchdog_task, this_cpu)) return; - /* Do not cause a second panic when there already was one */ - if (did_panic) + /* do not print during early bootup: */ + if (unlikely(system_state != SYSTEM_RUNNING)) { + touch_softlockup_watchdog(); return; + } - if (time_after(jiffies, timestamp + 10*HZ)) { - per_cpu(print_timestamp, this_cpu) = timestamp; + /* Wake up the high-prio watchdog task every second: */ + if (time_after(jiffies, touch_timestamp + HZ)) + wake_up_process(per_cpu(watchdog_task, this_cpu)); + + /* Warn about unreasonable 10+ seconds delays: */ + if (time_after(jiffies, touch_timestamp + 10*HZ)) { + per_cpu(print_timestamp, this_cpu) = touch_timestamp; spin_lock(&print_lock); printk(KERN_ERR "BUG: soft lockup detected on CPU#%d!\n", this_cpu); - show_regs(regs); + dump_stack(); spin_unlock(&print_lock); } } @@ -77,18 +87,16 @@ static int watchdog(void * __bind_cpu) sched_setscheduler(current, SCHED_FIFO, ¶m); current->flags |= PF_NOFREEZE; - set_current_state(TASK_INTERRUPTIBLE); - /* - * Run briefly once per second - if this gets delayed for - * more than 10 seconds then the debug-printout triggers - * in softlockup_tick(): + * Run briefly once per second to reset the softlockup timestamp. + * If this gets delayed for more than 10 seconds then the + * debug-printout triggers in softlockup_tick(). */ while (!kthread_should_stop()) { - msleep_interruptible(1000); + set_current_state(TASK_INTERRUPTIBLE); touch_softlockup_watchdog(); + schedule(); } - __set_current_state(TASK_RUNNING); return 0; } @@ -114,7 +122,6 @@ cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) kthread_bind(p, hotcpu); break; case CPU_ONLINE: - wake_up_process(per_cpu(watchdog_task, hotcpu)); break; #ifdef CONFIG_HOTPLUG_CPU @@ -146,4 +153,3 @@ __init void spawn_softlockup_task(void) notifier_chain_register(&panic_notifier_list, &panic_block); } - diff --git a/kernel/timer.c b/kernel/timer.c index 4427e725ccdd..17d956cebcb9 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -915,6 +915,7 @@ static void run_timer_softirq(struct softirq_action *h) void run_local_timers(void) { raise_softirq(TIMER_SOFTIRQ); + softlockup_tick(); } /* @@ -945,7 +946,6 @@ void do_timer(struct pt_regs *regs) /* prevent loading jiffies before storing new jiffies_64 value. */ barrier(); update_times(); - softlockup_tick(regs); } #ifdef __ARCH_WANT_SYS_ALARM -- cgit v1.2.3 From 24277dda3a54aa5e6265487e1a3091e27f3c0c45 Mon Sep 17 00:00:00 2001 From: Davi Arnaut Date: Fri, 24 Mar 2006 03:18:43 -0800 Subject: [PATCH] strndup_user: convert module Change hand-coded userspace string copying to strndup_user. Signed-off-by: Davi Arnaut Cc: David Howells Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/module.c | 19 +++---------------- 1 file changed, 3 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index fb404299082e..54623c714bba 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1572,7 +1572,6 @@ static struct module *load_module(void __user *umod, exportindex, modindex, obsparmindex, infoindex, gplindex, crcindex, gplcrcindex, versindex, pcpuindex, gplfutureindex, gplfuturecrcindex; - long arglen; struct module *mod; long err = 0; void *percpu = NULL, *ptr = NULL; /* Stops spurious gcc warning */ @@ -1691,23 +1690,11 @@ static struct module *load_module(void __user *umod, } /* Now copy in args */ - arglen = strlen_user(uargs); - if (!arglen) { - err = -EFAULT; - goto free_hdr; - } - args = kmalloc(arglen, GFP_KERNEL); - if (!args) { - err = -ENOMEM; + args = strndup_user(uargs, ~0UL >> 1); + if (IS_ERR(args)) { + err = PTR_ERR(args); goto free_hdr; } - if (copy_from_user(args, uargs, arglen) != 0) { - err = -EFAULT; - goto free_mod; - } - - /* Userspace could have altered the string after the strlen_user() */ - args[arglen - 1] = '\0'; if (find_module(mod->name)) { err = -EEXIST; -- cgit v1.2.3 From f125b56113be4956867cc9bd098bb99b1b9bb93f Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Fri, 24 Mar 2006 03:18:44 -0800 Subject: [PATCH] fix build error if CONFIG_SYSFS=n uevent_seqnum and uevent_helper are only defined if CONFIG_HOTPLUG=y, CONFIG_NET=n. (I stole this back from Greg's tree - it makes allnoconfig work). Cc: Greg KH Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/ksysfs.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c index f2690ed74530..f119e098e67b 100644 --- a/kernel/ksysfs.c +++ b/kernel/ksysfs.c @@ -22,7 +22,7 @@ static struct subsys_attribute _name##_attr = __ATTR_RO(_name) static struct subsys_attribute _name##_attr = \ __ATTR(_name, 0644, _name##_show, _name##_store) -#ifdef CONFIG_HOTPLUG +#if defined(CONFIG_HOTPLUG) && defined(CONFIG_NET) /* current uevent sequence number */ static ssize_t uevent_seqnum_show(struct subsystem *subsys, char *page) { @@ -52,7 +52,7 @@ decl_subsys(kernel, NULL, NULL); EXPORT_SYMBOL_GPL(kernel_subsys); static struct attribute * kernel_attrs[] = { -#ifdef CONFIG_HOTPLUG +#if defined(CONFIG_HOTPLUG) && defined(CONFIG_NET) &uevent_seqnum_attr.attr, &uevent_helper_attr.attr, #endif -- cgit v1.2.3 From 6978c7052f2e22c6c40781cdd4eba5c4bce9a789 Mon Sep 17 00:00:00 2001 From: Eric Sesterhenn Date: Fri, 24 Mar 2006 18:45:21 +0100 Subject: BUG_ON() Conversion in kernel/cpu.c this changes if() BUG(); constructs to BUG_ON() which is cleaner, contains unlikely() and can better optimized away. Signed-off-by: Eric Sesterhenn Signed-off-by: Adrian Bunk --- kernel/cpu.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/cpu.c b/kernel/cpu.c index e882c6babf41..8be22bd80933 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -223,8 +223,7 @@ int __devinit cpu_up(unsigned int cpu) ret = __cpu_up(cpu); if (ret != 0) goto out_notify; - if (!cpu_online(cpu)) - BUG(); + BUG_ON(!cpu_online(cpu)); /* Now call notifier in preparation. */ notifier_call_chain(&cpu_chain, CPU_ONLINE, hcpu); -- cgit v1.2.3 From 185ae6d7a32721e9062030a9f2d24ed714fa45df Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Sat, 25 Mar 2006 03:06:32 -0800 Subject: [PATCH] timer irq driven soft watchdog fix I seem to have lost this hunk in yesterday's patch. It brings the coming-online CPU's softlockup timer up to date so we don't get false-positive tripups during CPU hot-add. Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/softlockup.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/softlockup.c b/kernel/softlockup.c index dd9524fa649a..d9b3d5847ed8 100644 --- a/kernel/softlockup.c +++ b/kernel/softlockup.c @@ -118,6 +118,7 @@ cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) printk("watchdog for %i failed\n", hotcpu); return NOTIFY_BAD; } + per_cpu(touch_timestamp, hotcpu) = jiffies; per_cpu(watchdog_task, hotcpu) = p; kthread_bind(p, hotcpu); break; -- cgit v1.2.3 From c08b8a49100715b20e6f7c997e992428b5e06078 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sat, 25 Mar 2006 03:06:33 -0800 Subject: [PATCH] sys_alarm() unsigned signed conversion fixup alarm() calls the kernel with an unsigend int timeout in seconds. The value is stored in the tv_sec field of a struct timeval to setup the itimer. The tv_sec field of struct timeval is of type long, which causes the tv_sec value to be negative on 32 bit machines if seconds > INT_MAX. Before the hrtimer merge (pre 2.6.16) such a negative value was converted to the maximum jiffies timeout by the timeval_to_jiffies conversion. It's not clear whether this was intended or just happened to be done by the timeval_to_jiffies code. hrtimers expect a timeval in canonical form and treat a negative timeout as already expired. This breaks the legitimate usage of alarm() with a timeout value > INT_MAX seconds. For 32 bit machines it is therefor necessary to limit the internal seconds value to avoid API breakage. Instead of doing this in all implementations of sys_alarm the duplicated sys_alarm code is moved into a common function in itimer.c Signed-off-by: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/itimer.c | 37 +++++++++++++++++++++++++++++++++++++ kernel/timer.c | 14 +------------- 2 files changed, 38 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/itimer.c b/kernel/itimer.c index 379be2f8c84c..a2dc375927d8 100644 --- a/kernel/itimer.c +++ b/kernel/itimer.c @@ -226,6 +226,43 @@ again: return 0; } +/** + * alarm_setitimer - set alarm in seconds + * + * @seconds: number of seconds until alarm + * 0 disables the alarm + * + * Returns the remaining time in seconds of a pending timer or 0 when + * the timer is not active. + * + * On 32 bit machines the seconds value is limited to (INT_MAX/2) to avoid + * negative timeval settings which would cause immediate expiry. + */ +unsigned int alarm_setitimer(unsigned int seconds) +{ + struct itimerval it_new, it_old; + +#if BITS_PER_LONG < 64 + if (seconds > INT_MAX) + seconds = INT_MAX; +#endif + it_new.it_value.tv_sec = seconds; + it_new.it_value.tv_usec = 0; + it_new.it_interval.tv_sec = it_new.it_interval.tv_usec = 0; + + do_setitimer(ITIMER_REAL, &it_new, &it_old); + + /* + * We can't return 0 if we have an alarm pending ... And we'd + * better return too much than too little anyway + */ + if ((!it_old.it_value.tv_sec && it_old.it_value.tv_usec) || + it_old.it_value.tv_usec >= 500000) + it_old.it_value.tv_sec++; + + return it_old.it_value.tv_sec; +} + asmlinkage long sys_setitimer(int which, struct itimerval __user *value, struct itimerval __user *ovalue) diff --git a/kernel/timer.c b/kernel/timer.c index 17d956cebcb9..13fa72cac7d8 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -956,19 +956,7 @@ void do_timer(struct pt_regs *regs) */ asmlinkage unsigned long sys_alarm(unsigned int seconds) { - struct itimerval it_new, it_old; - unsigned int oldalarm; - - it_new.it_interval.tv_sec = it_new.it_interval.tv_usec = 0; - it_new.it_value.tv_sec = seconds; - it_new.it_value.tv_usec = 0; - do_setitimer(ITIMER_REAL, &it_new, &it_old); - oldalarm = it_old.it_value.tv_sec; - /* ehhh.. We can't return 0 if we have an alarm pending.. */ - /* And we'd better return too much than too little anyway */ - if ((!oldalarm && it_old.it_value.tv_usec) || it_old.it_value.tv_usec >= 500000) - oldalarm++; - return oldalarm; + return alarm_setitimer(seconds); } #endif -- cgit v1.2.3 From 7d99b7d634d81bb372e03e4561c80430aa4cfac2 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sat, 25 Mar 2006 03:06:35 -0800 Subject: [PATCH] Validate and sanitze itimer timeval from userspace According to the specification the timevals must be validated and an errorcode -EINVAL returned in case the timevals are not in canonical form. This check was never done in Linux. The pre 2.6.16 code converted invalid timevals silently. Negative timeouts were converted by the timeval_to_jiffies conversion to the maximum timeout. hrtimers and the ktime_t operations expect timevals in canonical form. Otherwise random results might happen on 32 bits machines due to the optimized ktime_add/sub operations. Negative timeouts are treated as already expired. This might break applications which work on pre 2.6.16. To prevent random behaviour and API breakage the timevals are checked and invalid timevals sanitized in a simliar way as the pre 2.6.16 code did. Invalid timevals are reported with a per boot limited number of kernel messages so applications which use this misfeature can be corrected. After a grace period of one year the sanitizing should be replaced by a correct validation check. This is also documented in Documentation/feature-removal-schedule.txt The validation and sanitizing is done inside do_setitimer so all callers (sys_setitimer, compat_sys_setitimer, osf_setitimer) are catched. Signed-off-by: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/itimer.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) (limited to 'kernel') diff --git a/kernel/itimer.c b/kernel/itimer.c index a2dc375927d8..680e6b70c872 100644 --- a/kernel/itimer.c +++ b/kernel/itimer.c @@ -143,6 +143,60 @@ int it_real_fn(void *data) return HRTIMER_NORESTART; } +/* + * We do not care about correctness. We just sanitize the values so + * the ktime_t operations which expect normalized values do not + * break. This converts negative values to long timeouts similar to + * the code in kernel versions < 2.6.16 + * + * Print a limited number of warning messages when an invalid timeval + * is detected. + */ +static void fixup_timeval(struct timeval *tv, int interval) +{ + static int warnlimit = 10; + unsigned long tmp; + + if (warnlimit > 0) { + warnlimit--; + printk(KERN_WARNING + "setitimer: %s (pid = %d) provided " + "invalid timeval %s: tv_sec = %ld tv_usec = %ld\n", + current->comm, current->pid, + interval ? "it_interval" : "it_value", + tv->tv_sec, (long) tv->tv_usec); + } + + tmp = tv->tv_usec; + if (tmp >= USEC_PER_SEC) { + tv->tv_usec = tmp % USEC_PER_SEC; + tv->tv_sec += tmp / USEC_PER_SEC; + } + + tmp = tv->tv_sec; + if (tmp > LONG_MAX) + tv->tv_sec = LONG_MAX; +} + +/* + * Returns true if the timeval is in canonical form + */ +#define timeval_valid(t) \ + (((t)->tv_sec >= 0) && (((unsigned long) (t)->tv_usec) < USEC_PER_SEC)) + +/* + * Check for invalid timevals, sanitize them and print a limited + * number of warnings. + */ +static void check_itimerval(struct itimerval *value) { + + if (unlikely(!timeval_valid(&value->it_value))) + fixup_timeval(&value->it_value, 0); + + if (unlikely(!timeval_valid(&value->it_interval))) + fixup_timeval(&value->it_interval, 1); +} + int do_setitimer(int which, struct itimerval *value, struct itimerval *ovalue) { struct task_struct *tsk = current; @@ -150,6 +204,18 @@ int do_setitimer(int which, struct itimerval *value, struct itimerval *ovalue) ktime_t expires; cputime_t cval, cinterval, nval, ninterval; + /* + * Validate the timevals in value. + * + * Note: Although the spec requires that invalid values shall + * return -EINVAL, we just fixup the value and print a limited + * number of warnings in order not to break users of this + * historical misfeature. + * + * Scheduled for replacement in March 2007 + */ + check_itimerval(value); + switch (which) { case ITIMER_REAL: again: -- cgit v1.2.3 From 8d3b33f67fdc0fb364a1ef6d8fbbea7c2e4e6c98 Mon Sep 17 00:00:00 2001 From: Rusty Russell Date: Sat, 25 Mar 2006 03:07:05 -0800 Subject: [PATCH] Remove MODULE_PARM MODULE_PARM was actually breaking: recent gcc version optimize them out as unused. It's time to replace the last users, which are generally in the most unloved drivers anyway. Signed-off-by: Rusty Russell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/module.c | 183 ++++------------------------------------------------ kernel/rcutorture.c | 10 +-- 2 files changed, 16 insertions(+), 177 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 54623c714bba..ddfe45ac2fd1 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -233,24 +233,6 @@ static unsigned long __find_symbol(const char *name, return 0; } -/* Find a symbol in this elf symbol table */ -static unsigned long find_local_symbol(Elf_Shdr *sechdrs, - unsigned int symindex, - const char *strtab, - const char *name) -{ - unsigned int i; - Elf_Sym *sym = (void *)sechdrs[symindex].sh_addr; - - /* Search (defined) internal symbols first. */ - for (i = 1; i < sechdrs[symindex].sh_size/sizeof(*sym); i++) { - if (sym[i].st_shndx != SHN_UNDEF - && strcmp(name, strtab + sym[i].st_name) == 0) - return sym[i].st_value; - } - return 0; -} - /* Search for module by name: must hold module_mutex. */ static struct module *find_module(const char *name) { @@ -785,139 +767,6 @@ static struct module_attribute *modinfo_attrs[] = { NULL, }; -#ifdef CONFIG_OBSOLETE_MODPARM -/* Bounds checking done below */ -static int obsparm_copy_string(const char *val, struct kernel_param *kp) -{ - strcpy(kp->arg, val); - return 0; -} - -static int set_obsolete(const char *val, struct kernel_param *kp) -{ - unsigned int min, max; - unsigned int size, maxsize; - int dummy; - char *endp; - const char *p; - struct obsolete_modparm *obsparm = kp->arg; - - if (!val) { - printk(KERN_ERR "Parameter %s needs an argument\n", kp->name); - return -EINVAL; - } - - /* type is: [min[-max]]{b,h,i,l,s} */ - p = obsparm->type; - min = simple_strtol(p, &endp, 10); - if (endp == obsparm->type) - min = max = 1; - else if (*endp == '-') { - p = endp+1; - max = simple_strtol(p, &endp, 10); - } else - max = min; - switch (*endp) { - case 'b': - return param_array(kp->name, val, min, max, obsparm->addr, - 1, param_set_byte, &dummy); - case 'h': - return param_array(kp->name, val, min, max, obsparm->addr, - sizeof(short), param_set_short, &dummy); - case 'i': - return param_array(kp->name, val, min, max, obsparm->addr, - sizeof(int), param_set_int, &dummy); - case 'l': - return param_array(kp->name, val, min, max, obsparm->addr, - sizeof(long), param_set_long, &dummy); - case 's': - return param_array(kp->name, val, min, max, obsparm->addr, - sizeof(char *), param_set_charp, &dummy); - - case 'c': - /* Undocumented: 1-5c50 means 1-5 strings of up to 49 chars, - and the decl is "char xxx[5][50];" */ - p = endp+1; - maxsize = simple_strtol(p, &endp, 10); - /* We check lengths here (yes, this is a hack). */ - p = val; - while (p[size = strcspn(p, ",")]) { - if (size >= maxsize) - goto oversize; - p += size+1; - } - if (size >= maxsize) - goto oversize; - return param_array(kp->name, val, min, max, obsparm->addr, - maxsize, obsparm_copy_string, &dummy); - } - printk(KERN_ERR "Unknown obsolete parameter type %s\n", obsparm->type); - return -EINVAL; - oversize: - printk(KERN_ERR - "Parameter %s doesn't fit in %u chars.\n", kp->name, maxsize); - return -EINVAL; -} - -static int obsolete_params(const char *name, - char *args, - struct obsolete_modparm obsparm[], - unsigned int num, - Elf_Shdr *sechdrs, - unsigned int symindex, - const char *strtab) -{ - struct kernel_param *kp; - unsigned int i; - int ret; - - kp = kmalloc(sizeof(kp[0]) * num, GFP_KERNEL); - if (!kp) - return -ENOMEM; - - for (i = 0; i < num; i++) { - char sym_name[128 + sizeof(MODULE_SYMBOL_PREFIX)]; - - snprintf(sym_name, sizeof(sym_name), "%s%s", - MODULE_SYMBOL_PREFIX, obsparm[i].name); - - kp[i].name = obsparm[i].name; - kp[i].perm = 000; - kp[i].set = set_obsolete; - kp[i].get = NULL; - obsparm[i].addr - = (void *)find_local_symbol(sechdrs, symindex, strtab, - sym_name); - if (!obsparm[i].addr) { - printk("%s: falsely claims to have parameter %s\n", - name, obsparm[i].name); - ret = -EINVAL; - goto out; - } - kp[i].arg = &obsparm[i]; - } - - ret = parse_args(name, args, kp, num, NULL); - out: - kfree(kp); - return ret; -} -#else -static int obsolete_params(const char *name, - char *args, - struct obsolete_modparm obsparm[], - unsigned int num, - Elf_Shdr *sechdrs, - unsigned int symindex, - const char *strtab) -{ - if (num != 0) - printk(KERN_WARNING "%s: Ignoring obsolete parameters\n", - name); - return 0; -} -#endif /* CONFIG_OBSOLETE_MODPARM */ - static const char vermagic[] = VERMAGIC_STRING; #ifdef CONFIG_MODVERSIONS @@ -1874,27 +1723,17 @@ static struct module *load_module(void __user *umod, set_fs(old_fs); mod->args = args; - if (obsparmindex) { - err = obsolete_params(mod->name, mod->args, - (struct obsolete_modparm *) - sechdrs[obsparmindex].sh_addr, - sechdrs[obsparmindex].sh_size - / sizeof(struct obsolete_modparm), - sechdrs, symindex, - (char *)sechdrs[strindex].sh_addr); - if (setupindex) - printk(KERN_WARNING "%s: Ignoring new-style " - "parameters in presence of obsolete ones\n", - mod->name); - } else { - /* Size of section 0 is 0, so this works well if no params */ - err = parse_args(mod->name, mod->args, - (struct kernel_param *) - sechdrs[setupindex].sh_addr, - sechdrs[setupindex].sh_size - / sizeof(struct kernel_param), - NULL); - } + if (obsparmindex) + printk(KERN_WARNING "%s: Ignoring obsolete parameters\n", + mod->name); + + /* Size of section 0 is 0, so this works well if no params */ + err = parse_args(mod->name, mod->args, + (struct kernel_param *) + sechdrs[setupindex].sh_addr, + sechdrs[setupindex].sh_size + / sizeof(struct kernel_param), + NULL); if (err < 0) goto arch_cleanup; diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c index 9a1fa8894b95..b4b362b5baf5 100644 --- a/kernel/rcutorture.c +++ b/kernel/rcutorture.c @@ -54,15 +54,15 @@ static int verbose; /* Print more debug info. */ static int test_no_idle_hz; /* Test RCU's support for tickless idle CPUs. */ static int shuffle_interval = 5; /* Interval between shuffles (in sec)*/ -MODULE_PARM(nreaders, "i"); +module_param(nreaders, int, 0); MODULE_PARM_DESC(nreaders, "Number of RCU reader threads"); -MODULE_PARM(stat_interval, "i"); +module_param(stat_interval, int, 0); MODULE_PARM_DESC(stat_interval, "Number of seconds between stats printk()s"); -MODULE_PARM(verbose, "i"); +module_param(verbose, bool, 0); MODULE_PARM_DESC(verbose, "Enable verbose debugging printk()s"); -MODULE_PARM(test_no_idle_hz, "i"); +module_param(test_no_idle_hz, bool, 0); MODULE_PARM_DESC(test_no_idle_hz, "Test support for tickless idle CPUs"); -MODULE_PARM(shuffle_interval, "i"); +module_param(shuffle_interval, int, 0); MODULE_PARM_DESC(shuffle_interval, "Number of seconds between shuffles"); #define TORTURE_FLAG "rcutorture: " #define PRINTK_STRING(s) \ -- cgit v1.2.3 From 9871728b756646e0d758a966ba00f2c0ff812817 Mon Sep 17 00:00:00 2001 From: Adrian Bunk Date: Sat, 25 Mar 2006 03:07:06 -0800 Subject: [PATCH] kernel/params.c: make param_array() static param_array() in kernel/params.c can now become static. Signed-off-by: Adrian Bunk Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/params.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/params.c b/kernel/params.c index a29150582310..9de637a5c8bc 100644 --- a/kernel/params.c +++ b/kernel/params.c @@ -265,12 +265,12 @@ int param_get_invbool(char *buffer, struct kernel_param *kp) } /* We cheat here and temporarily mangle the string. */ -int param_array(const char *name, - const char *val, - unsigned int min, unsigned int max, - void *elem, int elemsize, - int (*set)(const char *, struct kernel_param *kp), - int *num) +static int param_array(const char *name, + const char *val, + unsigned int min, unsigned int max, + void *elem, int elemsize, + int (*set)(const char *, struct kernel_param *kp), + int *num) { int ret; struct kernel_param kp; -- cgit v1.2.3 From c777ac5594f772ac760e02c3ac71d067616b579d Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Sat, 25 Mar 2006 03:07:36 -0800 Subject: [PATCH] irq: uninline migration functions Uninline some massive IRQ migration functions. Put them in the new kernel/irq/migration.c. Cc: Andi Kleen Cc: "Luck, Tony" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/irq/Makefile | 3 +-- kernel/irq/migration.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 53 insertions(+), 2 deletions(-) create mode 100644 kernel/irq/migration.c (limited to 'kernel') diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile index 49378738ff5e..2b33f852be3e 100644 --- a/kernel/irq/Makefile +++ b/kernel/irq/Makefile @@ -1,5 +1,4 @@ -obj-y := handle.o manage.o spurious.o +obj-y := handle.o manage.o spurious.o migration.o obj-$(CONFIG_GENERIC_IRQ_PROBE) += autoprobe.o obj-$(CONFIG_PROC_FS) += proc.o - diff --git a/kernel/irq/migration.c b/kernel/irq/migration.c new file mode 100644 index 000000000000..6bdd03c524c7 --- /dev/null +++ b/kernel/irq/migration.c @@ -0,0 +1,52 @@ +#include + +#if defined(CONFIG_GENERIC_PENDING_IRQ) + +void set_pending_irq(unsigned int irq, cpumask_t mask) +{ + irq_desc_t *desc = irq_desc + irq; + unsigned long flags; + + spin_lock_irqsave(&desc->lock, flags); + desc->move_irq = 1; + pending_irq_cpumask[irq] = mask; + spin_unlock_irqrestore(&desc->lock, flags); +} + +void move_native_irq(int irq) +{ + cpumask_t tmp; + irq_desc_t *desc = irq_descp(irq); + + if (likely (!desc->move_irq)) + return; + + desc->move_irq = 0; + + if (likely(cpus_empty(pending_irq_cpumask[irq]))) + return; + + if (!desc->handler->set_affinity) + return; + + /* note - we hold the desc->lock */ + cpus_and(tmp, pending_irq_cpumask[irq], cpu_online_map); + + /* + * If there was a valid mask to work with, please + * do the disable, re-program, enable sequence. + * This is *not* particularly important for level triggered + * but in a edge trigger case, we might be setting rte + * when an active trigger is comming in. This could + * cause some ioapics to mal-function. + * Being paranoid i guess! + */ + if (unlikely(!cpus_empty(tmp))) { + desc->handler->disable(irq); + desc->handler->set_affinity(irq,tmp); + desc->handler->enable(irq); + } + cpus_clear(pending_irq_cpumask[irq]); +} + +#endif -- cgit v1.2.3 From 501f2499b897ca4be68b1acc7a4bc8cf66f5fd24 Mon Sep 17 00:00:00 2001 From: Bryan Holty Date: Sat, 25 Mar 2006 03:07:37 -0800 Subject: [PATCH] IRQ: prevent enabling of previously disabled interrupt This fix prevents re-disabling and enabling of a previously disabled interrupt. On an SMP system with irq balancing enabled; If an interrupt is disabled from within its own interrupt context with disable_irq_nosync and is also earmarked for processor migration, the interrupt is blindly moved to the other processor and enabled without regard for its current "enabled" state. If there is an interrupt pending, it will unexpectedly invoke the irq handler on the new irq owning processor (even though the irq was previously disabled) The more intuitive fix would be to invoke disable_irq_nosync and enable_irq, but since we already have the desc->lock from __do_IRQ, we cannot call them directly. Instead we can use the same logic to disable and enable found in disable_irq_nosync and enable_irq, with regards to the desc->depth. This now prevents a disabled interrupt from being re-disabled, and more importantly prevents a disabled interrupt from being incorrectly enabled on a different processor. Signed-off-by: Bryan Holty Cc: Andi Kleen Cc: "Luck, Tony" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/irq/migration.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/migration.c b/kernel/irq/migration.c index 6bdd03c524c7..52a8655fa080 100644 --- a/kernel/irq/migration.c +++ b/kernel/irq/migration.c @@ -18,9 +18,17 @@ void move_native_irq(int irq) cpumask_t tmp; irq_desc_t *desc = irq_descp(irq); - if (likely (!desc->move_irq)) + if (likely(!desc->move_irq)) return; + /* + * Paranoia: cpu-local interrupts shouldn't be calling in here anyway. + */ + if (CHECK_IRQ_PER_CPU(desc->status)) { + WARN_ON(1); + return; + } + desc->move_irq = 0; if (likely(cpus_empty(pending_irq_cpumask[irq]))) @@ -29,7 +37,8 @@ void move_native_irq(int irq) if (!desc->handler->set_affinity) return; - /* note - we hold the desc->lock */ + assert_spin_locked(&desc->lock); + cpus_and(tmp, pending_irq_cpumask[irq], cpu_online_map); /* @@ -42,9 +51,13 @@ void move_native_irq(int irq) * Being paranoid i guess! */ if (unlikely(!cpus_empty(tmp))) { - desc->handler->disable(irq); + if (likely(!(desc->status & IRQ_DISABLED))) + desc->handler->disable(irq); + desc->handler->set_affinity(irq,tmp); - desc->handler->enable(irq); + + if (likely(!(desc->status & IRQ_DISABLED))) + desc->handler->enable(irq); } cpus_clear(pending_irq_cpumask[irq]); } -- cgit v1.2.3 From 12b5989be10011387a9da5dee82e5c0d6f9d02e7 Mon Sep 17 00:00:00 2001 From: Chris Wright Date: Sat, 25 Mar 2006 03:07:41 -0800 Subject: [PATCH] refactor capable() to one implementation, add __capable() helper Move capable() to kernel/capability.c and eliminate duplicate implementations. Add __capable() function which can be used to check for capabiilty of any process. Signed-off-by: Chris Wright Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/capability.c | 16 ++++++++++++++++ kernel/sys.c | 12 ------------ 2 files changed, 16 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/capability.c b/kernel/capability.c index bfa3c92e16f2..1a4d8a40d3f9 100644 --- a/kernel/capability.c +++ b/kernel/capability.c @@ -233,3 +233,19 @@ out: return ret; } + +int __capable(struct task_struct *t, int cap) +{ + if (security_capable(t, cap) == 0) { + t->flags |= PF_SUPERPRIV; + return 1; + } + return 0; +} +EXPORT_SYMBOL(__capable); + +int capable(int cap) +{ + return __capable(current, cap); +} +EXPORT_SYMBOL(capable); diff --git a/kernel/sys.c b/kernel/sys.c index 19d058be49d4..421009cedb51 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -224,18 +224,6 @@ int unregister_reboot_notifier(struct notifier_block * nb) EXPORT_SYMBOL(unregister_reboot_notifier); -#ifndef CONFIG_SECURITY -int capable(int cap) -{ - if (cap_raised(current->cap_effective, cap)) { - current->flags |= PF_SUPERPRIV; - return 1; - } - return 0; -} -EXPORT_SYMBOL(capable); -#endif - static int set_one_prio(struct task_struct *p, int niceval, int error) { int no_nice; -- cgit v1.2.3 From 05eeae208d08a05a6980cf2ff61f02843c0955fd Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Sat, 25 Mar 2006 03:07:48 -0800 Subject: [PATCH] find_task_by_pid() needs tasklist_lock A couple of places are forgetting to take it. The kswapd case is probably unimportant. keventd_create_kthread() was racy. The whole thing is a bit flakey: you start a kernel thread, get its pid from kernel_thread() then look up its task_struct. a) It assumes that pid recycling takes a "long" time. b) We get a task_struct but no reference was taken on it. The owner of the kswapd and kthread task_struct*'s must assume that the new thread won't exit unexpectedly. Because if it does, they're left holding dead memory and any attempt to control or stop that task will crash. Cc: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kthread.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/kthread.c b/kernel/kthread.c index 6a5373868a98..c5f3c6613b6d 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -115,7 +115,9 @@ static void keventd_create_kthread(void *_create) create->result = ERR_PTR(pid); } else { wait_for_completion(&create->started); + read_lock(&tasklist_lock); create->result = find_task_by_pid(pid); + read_unlock(&tasklist_lock); } complete(&create->done); } -- cgit v1.2.3 From 231bed205879236357171e50bd8965e70797ecdc Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Sat, 25 Mar 2006 03:08:00 -0800 Subject: [PATCH] No need to protect current->group_info in sys_getgroups(), in_group_p() and in_egroup_p() While doing some benchmarks of an Apache/PHP SMP server, I noticed high oprofile numbers in in_group_p() and _atomic_dec_and_lock(). rank percent 1 4.8911 % __link_path_walk 2 4.8503 % __d_lookup *3 4.2911 % _atomic_dec_and_lock 4 3.9307 % __copy_to_user_ll 5 4.9004 % sysenter_past_esp *6 3.3248 % in_group_p It appears that in_group_p() does an uncessary get_group_info(current->group_info); /* atomic_inc() */ ... /* access current->group_info */ put_group_info(current->group_info); /* _atomic_dec_and_lock */ It is not necessary to do this, because the current task holds a reference on its own group_info, and this reference cannot change during the lookup. This patch deletes the get_group_info()/put_group_info() pair from sys_getgroups(), in_group_p() and in_egroup_p() functions. Signed-off-by: Eric Dumazet Cc: Tim Hockin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 6 ------ 1 file changed, 6 deletions(-) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index 421009cedb51..119fb0d9e24e 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1421,7 +1421,6 @@ asmlinkage long sys_getgroups(int gidsetsize, gid_t __user *grouplist) return -EINVAL; /* no need to grab task_lock here; it cannot change */ - get_group_info(current->group_info); i = current->group_info->ngroups; if (gidsetsize) { if (i > gidsetsize) { @@ -1434,7 +1433,6 @@ asmlinkage long sys_getgroups(int gidsetsize, gid_t __user *grouplist) } } out: - put_group_info(current->group_info); return i; } @@ -1475,9 +1473,7 @@ int in_group_p(gid_t grp) { int retval = 1; if (grp != current->fsgid) { - get_group_info(current->group_info); retval = groups_search(current->group_info, grp); - put_group_info(current->group_info); } return retval; } @@ -1488,9 +1484,7 @@ int in_egroup_p(gid_t grp) { int retval = 1; if (grp != current->egid) { - get_group_info(current->group_info); retval = groups_search(current->group_info, grp); - put_group_info(current->group_info); } return retval; } -- cgit v1.2.3 From 34f361ade2fb4a869f6a7714d01c04ce4cfa75d9 Mon Sep 17 00:00:00 2001 From: Ashok Raj Date: Sat, 25 Mar 2006 03:08:18 -0800 Subject: [PATCH] Check if cpu can be onlined before calling smp_prepare_cpu() - Moved check for online cpu out of smp_prepare_cpu() - Moved default declaration of smp_prepare_cpu() to kernel/cpu.c - Removed lock_cpu_hotplug() from smp_prepare_cpu() to around it, since its called from cpu_up() as well now. - Removed clearing from cpu_present_map during cpu_offline as it breaks using cpu_up() directly during a subsequent online operation. Signed-off-by: Ashok Raj Cc: Srivatsa Vaddagiri Cc: "Li, Shaohua" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/power/smp.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/power/smp.c b/kernel/power/smp.c index 911fc62b8225..5957312b2d68 100644 --- a/kernel/power/smp.c +++ b/kernel/power/smp.c @@ -49,9 +49,7 @@ void enable_nonboot_cpus(void) printk("Thawing cpus ...\n"); for_each_cpu_mask(cpu, frozen_cpus) { - error = smp_prepare_cpu(cpu); - if (!error) - error = cpu_up(cpu); + error = cpu_up(cpu); if (!error) { printk("CPU%d is up\n", cpu); continue; -- cgit v1.2.3 From d74beb9f33a5f16d2965f11b275e401f225c949d Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Sat, 25 Mar 2006 03:08:19 -0800 Subject: [PATCH] Use unsigned int types for a faster bsearch This patch avoids arithmetic on 'signed' types that are slower than 'unsigned'. This saves space and cpu cycles. size of kernel/sys.o before the patch (gcc-3.4.5) text data bss dec hex filename 10924 252 4 11180 2bac kernel/sys.o size of kernel/sys.o after the patch text data bss dec hex filename 10903 252 4 11159 2b97 kernel/sys.o I noticed that gcc-4.1.0 (from Fedora Core 5) even uses idiv instruction for (a+b)/2 if a and b are signed. Signed-off-by: Eric Dumazet Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index 119fb0d9e24e..38bc73ede2ba 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1363,7 +1363,7 @@ static void groups_sort(struct group_info *group_info) /* a simple bsearch */ int groups_search(struct group_info *group_info, gid_t grp) { - int left, right; + unsigned int left, right; if (!group_info) return 0; @@ -1371,7 +1371,7 @@ int groups_search(struct group_info *group_info, gid_t grp) left = 0; right = group_info->ngroups; while (left < right) { - int mid = (left+right)/2; + unsigned int mid = (left+right)/2; int cmp = grp - GROUP_AT(group_info, mid); if (cmp > 0) left = mid + 1; -- cgit v1.2.3 From f5163427453bc6ca2dd497eeb3e7a30d1c74b487 Mon Sep 17 00:00:00 2001 From: Dimitri Sivanich Date: Sat, 25 Mar 2006 03:08:23 -0800 Subject: [PATCH] Add SA_PERCPU_IRQ flag support Add support for SA_PERCPU_IRQ (only mmtimer.c uses this at this stage). Signed-off-by: Dimitri Sivanich Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/irq/manage.c | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 97d5559997d2..6edfcef291e8 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -204,10 +204,14 @@ int setup_irq(unsigned int irq, struct irqaction * new) p = &desc->action; if ((old = *p) != NULL) { /* Can't share interrupts unless both agree to */ - if (!(old->flags & new->flags & SA_SHIRQ)) { - spin_unlock_irqrestore(&desc->lock,flags); - return -EBUSY; - } + if (!(old->flags & new->flags & SA_SHIRQ)) + goto mismatch; + +#if defined(ARCH_HAS_IRQ_PER_CPU) && defined(SA_PERCPU_IRQ) + /* All handlers must agree on per-cpuness */ + if ((old->flags & IRQ_PER_CPU) != (new->flags & IRQ_PER_CPU)) + goto mismatch; +#endif /* add new interrupt at end of irq queue */ do { @@ -218,7 +222,10 @@ int setup_irq(unsigned int irq, struct irqaction * new) } *p = new; - +#if defined(ARCH_HAS_IRQ_PER_CPU) && defined(SA_PERCPU_IRQ) + if (new->flags & SA_PERCPU_IRQ) + desc->status |= IRQ_PER_CPU; +#endif if (!shared) { desc->depth = 0; desc->status &= ~(IRQ_DISABLED | IRQ_AUTODETECT | @@ -236,6 +243,12 @@ int setup_irq(unsigned int irq, struct irqaction * new) register_handler_proc(irq, new); return 0; + +mismatch: + spin_unlock_irqrestore(&desc->lock, flags); + printk(KERN_ERR "%s: irq handler mismatch\n", __FUNCTION__); + dump_stack(); + return -EBUSY; } /** -- cgit v1.2.3 From 5ddcfa878d5b10b0ab94251a4229a8a9daaf93ed Mon Sep 17 00:00:00 2001 From: Roman Zippel Date: Sat, 25 Mar 2006 03:08:28 -0800 Subject: [PATCH] remove pps support This removes the support for pps. It's completely unused within the kernel and is basically in the way for further cleanups. It should be easier to readd proper support for it after the rest has been converted to NTP4 (where the pps mechanisms are quite different from NTP3 anyway). Signed-off-by: Roman Zippel Cc: Adrian Bunk Cc: john stultz Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/time.c | 59 ++++++++++++++++------------------------------------------ kernel/timer.c | 13 ++----------- 2 files changed, 18 insertions(+), 54 deletions(-) (limited to 'kernel') diff --git a/kernel/time.c b/kernel/time.c index 804539165d8b..e00a97b77241 100644 --- a/kernel/time.c +++ b/kernel/time.c @@ -202,24 +202,6 @@ asmlinkage long sys_settimeofday(struct timeval __user *tv, return do_sys_settimeofday(tv ? &new_ts : NULL, tz ? &new_tz : NULL); } -long pps_offset; /* pps time offset (us) */ -long pps_jitter = MAXTIME; /* time dispersion (jitter) (us) */ - -long pps_freq; /* frequency offset (scaled ppm) */ -long pps_stabil = MAXFREQ; /* frequency dispersion (scaled ppm) */ - -long pps_valid = PPS_VALID; /* pps signal watchdog counter */ - -int pps_shift = PPS_SHIFT; /* interval duration (s) (shift) */ - -long pps_jitcnt; /* jitter limit exceeded */ -long pps_calcnt; /* calibration intervals */ -long pps_errcnt; /* calibration errors */ -long pps_stbcnt; /* stability limit exceeded */ - -/* hook for a loadable hardpps kernel module */ -void (*hardpps_ptr)(struct timeval *); - /* we call this to notify the arch when the clock is being * controlled. If no such arch routine, do nothing. */ @@ -279,7 +261,7 @@ int do_adjtimex(struct timex *txc) result = -EINVAL; goto leave; } - time_freq = txc->freq - pps_freq; + time_freq = txc->freq; } if (txc->modes & ADJ_MAXERROR) { @@ -312,10 +294,8 @@ int do_adjtimex(struct timex *txc) if ((time_next_adjust = txc->offset) == 0) time_adjust = 0; } - else if ( time_status & (STA_PLL | STA_PPSTIME) ) { - ltemp = (time_status & (STA_PPSTIME | STA_PPSSIGNAL)) == - (STA_PPSTIME | STA_PPSSIGNAL) ? - pps_offset : txc->offset; + else if (time_status & STA_PLL) { + ltemp = txc->offset; /* * Scale the phase adjustment and @@ -356,23 +336,14 @@ int do_adjtimex(struct timex *txc) } time_freq = min(time_freq, time_tolerance); time_freq = max(time_freq, -time_tolerance); - } /* STA_PLL || STA_PPSTIME */ + } /* STA_PLL */ } /* txc->modes & ADJ_OFFSET */ if (txc->modes & ADJ_TICK) { tick_usec = txc->tick; tick_nsec = TICK_USEC_TO_NSEC(tick_usec); } } /* txc->modes */ -leave: if ((time_status & (STA_UNSYNC|STA_CLOCKERR)) != 0 - || ((time_status & (STA_PPSFREQ|STA_PPSTIME)) != 0 - && (time_status & STA_PPSSIGNAL) == 0) - /* p. 24, (b) */ - || ((time_status & (STA_PPSTIME|STA_PPSJITTER)) - == (STA_PPSTIME|STA_PPSJITTER)) - /* p. 24, (c) */ - || ((time_status & STA_PPSFREQ) != 0 - && (time_status & (STA_PPSWANDER|STA_PPSERROR)) != 0)) - /* p. 24, (d) */ +leave: if ((time_status & (STA_UNSYNC|STA_CLOCKERR)) != 0) result = TIME_ERROR; if ((txc->modes & ADJ_OFFSET_SINGLESHOT) == ADJ_OFFSET_SINGLESHOT) @@ -380,7 +351,7 @@ leave: if ((time_status & (STA_UNSYNC|STA_CLOCKERR)) != 0 else { txc->offset = shift_right(time_offset, SHIFT_UPDATE); } - txc->freq = time_freq + pps_freq; + txc->freq = time_freq; txc->maxerror = time_maxerror; txc->esterror = time_esterror; txc->status = time_status; @@ -388,14 +359,16 @@ leave: if ((time_status & (STA_UNSYNC|STA_CLOCKERR)) != 0 txc->precision = time_precision; txc->tolerance = time_tolerance; txc->tick = tick_usec; - txc->ppsfreq = pps_freq; - txc->jitter = pps_jitter >> PPS_AVG; - txc->shift = pps_shift; - txc->stabil = pps_stabil; - txc->jitcnt = pps_jitcnt; - txc->calcnt = pps_calcnt; - txc->errcnt = pps_errcnt; - txc->stbcnt = pps_stbcnt; + + /* PPS is not implemented, so these are zero */ + txc->ppsfreq = 0; + txc->jitter = 0; + txc->shift = 0; + txc->stabil = 0; + txc->jitcnt = 0; + txc->calcnt = 0; + txc->errcnt = 0; + txc->stbcnt = 0; write_sequnlock_irq(&xtime_lock); do_gettimeofday(&txc->time); notify_arch_cmos_timer(); diff --git a/kernel/timer.c b/kernel/timer.c index 13fa72cac7d8..ab189dd187cb 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -697,18 +697,9 @@ static void second_overflow(void) /* * Compute the frequency estimate and additional phase adjustment due - * to frequency error for the next second. When the PPS signal is - * engaged, gnaw on the watchdog counter and update the frequency - * computed by the pll and the PPS signal. + * to frequency error for the next second. */ - pps_valid++; - if (pps_valid == PPS_VALID) { /* PPS signal lost */ - pps_jitter = MAXTIME; - pps_stabil = MAXFREQ; - time_status &= ~(STA_PPSSIGNAL | STA_PPSJITTER | - STA_PPSWANDER | STA_PPSERROR); - } - ltemp = time_freq + pps_freq; + ltemp = time_freq; time_adj += shift_right(ltemp,(SHIFT_USEC + SHIFT_HZ - SHIFT_SCALE)); #if HZ == 100 -- cgit v1.2.3 From 910dea7fdda22f0ee83d26d459e460c79ed94557 Mon Sep 17 00:00:00 2001 From: Eric Sesterhenn Date: Sun, 26 Mar 2006 18:29:26 +0200 Subject: BUG_ON() Conversion in kernel/fork.c this changes if() BUG(); constructs to BUG_ON() which is cleaner, contains unlikely() and can better optimized away. Signed-off-by: Eric Sesterhenn Signed-off-by: Adrian Bunk --- kernel/fork.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index a02063903aaa..d93ab2ba729c 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -769,8 +769,7 @@ int unshare_files(void) struct files_struct *files = current->files; int rc; - if(!files) - BUG(); + BUG_ON(!files); /* This can race but the race causes us to copy when we don't need to and drop the copy */ -- cgit v1.2.3 From cd7b24bb1891a10ee25168a912ff2304a9571d23 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Sun, 26 Mar 2006 01:36:54 -0800 Subject: [PATCH] warn if free_irq() is called from IRQ context Warn if free_irq() is called in IRQ context - free_irq() can execute /proc VFS work, which must not be done in IRQ context. Signed-off-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/irq/manage.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 6edfcef291e8..ac766ad573e8 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -271,6 +271,7 @@ void free_irq(unsigned int irq, void *dev_id) struct irqaction **p; unsigned long flags; + WARN_ON(in_interrupt()); if (irq >= NR_IRQS) return; -- cgit v1.2.3 From e655a250d5fc12b6dfe0d436180ba4a3bfffdc9f Mon Sep 17 00:00:00 2001 From: Con Kolivas Date: Sun, 26 Mar 2006 01:37:11 -0800 Subject: [PATCH] swswsup: return correct load_image error If there's an error in load_image() we should return that without checking snapshot_image_loaded. Signed-off-by: Con Kolivas Acked-by: "Rafael J. Wysocki" Cc: Pavel Machek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/power/swap.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/power/swap.c b/kernel/power/swap.c index 9177f3f73a6c..044b8e0c1025 100644 --- a/kernel/power/swap.c +++ b/kernel/power/swap.c @@ -454,10 +454,11 @@ static int load_image(struct swap_map_handle *handle, nr_pages++; } } while (ret > 0); - if (!error) + if (!error) { printk("\b\b\b\bdone\n"); - if (!snapshot_image_loaded(snapshot)) - error = -ENODATA; + if (!snapshot_image_loaded(snapshot)) + error = -ENODATA; + } return error; } -- cgit v1.2.3 From 3158e9411a66fb98d495ac441c242264f31aaf3e Mon Sep 17 00:00:00 2001 From: Stephen Rothwell Date: Sun, 26 Mar 2006 01:37:29 -0800 Subject: [PATCH] consolidate sys32/compat_adjtimex Create compat_sys_adjtimex and use it an all appropriate places. Signed-off-by: Stephen Rothwell Cc: Arnd Bergmann Acked-by: Paul Mackerras Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/compat.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 59 insertions(+) (limited to 'kernel') diff --git a/kernel/compat.c b/kernel/compat.c index 8c9cd88b6785..b9bdd1271f44 100644 --- a/kernel/compat.c +++ b/kernel/compat.c @@ -21,6 +21,7 @@ #include #include #include +#include #include @@ -898,3 +899,61 @@ asmlinkage long compat_sys_rt_sigsuspend(compat_sigset_t __user *unewset, compat return -ERESTARTNOHAND; } #endif /* __ARCH_WANT_COMPAT_SYS_RT_SIGSUSPEND */ + +asmlinkage long compat_sys_adjtimex(struct compat_timex __user *utp) +{ + struct timex txc; + int ret; + + memset(&txc, 0, sizeof(struct timex)); + + if (!access_ok(VERIFY_READ, utp, sizeof(struct compat_timex)) || + __get_user(txc.modes, &utp->modes) || + __get_user(txc.offset, &utp->offset) || + __get_user(txc.freq, &utp->freq) || + __get_user(txc.maxerror, &utp->maxerror) || + __get_user(txc.esterror, &utp->esterror) || + __get_user(txc.status, &utp->status) || + __get_user(txc.constant, &utp->constant) || + __get_user(txc.precision, &utp->precision) || + __get_user(txc.tolerance, &utp->tolerance) || + __get_user(txc.time.tv_sec, &utp->time.tv_sec) || + __get_user(txc.time.tv_usec, &utp->time.tv_usec) || + __get_user(txc.tick, &utp->tick) || + __get_user(txc.ppsfreq, &utp->ppsfreq) || + __get_user(txc.jitter, &utp->jitter) || + __get_user(txc.shift, &utp->shift) || + __get_user(txc.stabil, &utp->stabil) || + __get_user(txc.jitcnt, &utp->jitcnt) || + __get_user(txc.calcnt, &utp->calcnt) || + __get_user(txc.errcnt, &utp->errcnt) || + __get_user(txc.stbcnt, &utp->stbcnt)) + return -EFAULT; + + ret = do_adjtimex(&txc); + + if (!access_ok(VERIFY_WRITE, utp, sizeof(struct compat_timex)) || + __put_user(txc.modes, &utp->modes) || + __put_user(txc.offset, &utp->offset) || + __put_user(txc.freq, &utp->freq) || + __put_user(txc.maxerror, &utp->maxerror) || + __put_user(txc.esterror, &utp->esterror) || + __put_user(txc.status, &utp->status) || + __put_user(txc.constant, &utp->constant) || + __put_user(txc.precision, &utp->precision) || + __put_user(txc.tolerance, &utp->tolerance) || + __put_user(txc.time.tv_sec, &utp->time.tv_sec) || + __put_user(txc.time.tv_usec, &utp->time.tv_usec) || + __put_user(txc.tick, &utp->tick) || + __put_user(txc.ppsfreq, &utp->ppsfreq) || + __put_user(txc.jitter, &utp->jitter) || + __put_user(txc.shift, &utp->shift) || + __put_user(txc.stabil, &utp->stabil) || + __put_user(txc.jitcnt, &utp->jitcnt) || + __put_user(txc.calcnt, &utp->calcnt) || + __put_user(txc.errcnt, &utp->errcnt) || + __put_user(txc.stbcnt, &utp->stbcnt)) + ret = -EFAULT; + + return ret; +} -- cgit v1.2.3 From 92127c7a45d4d167d9b015a5f9de6b41ed66f1d0 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sun, 26 Mar 2006 01:38:05 -0800 Subject: [PATCH] hrtimers: optimize softirq runqueues The hrtimer softirq is called from the timer softirq every tick. Retrieve the current time from xtime and wall_to_monotonic instead of calling base->get_time() for each timer base. Store the time in the base structure and provide a hook once clock source abstractions are in place and to keep the code open for new base clocks. Based on a patch from: Roman Zippel Signed-off-by: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/hrtimer.c | 28 ++++++++++++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 14bc9cfa6399..b728cc53452b 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -122,6 +122,26 @@ void ktime_get_ts(struct timespec *ts) } EXPORT_SYMBOL_GPL(ktime_get_ts); +/* + * Get the coarse grained time at the softirq based on xtime and + * wall_to_monotonic. + */ +static void hrtimer_get_softirq_time(struct hrtimer_base *base) +{ + ktime_t xtim, tomono; + unsigned long seq; + + do { + seq = read_seqbegin(&xtime_lock); + xtim = timespec_to_ktime(xtime); + tomono = timespec_to_ktime(wall_to_monotonic); + + } while (read_seqretry(&xtime_lock, seq)); + + base[CLOCK_REALTIME].softirq_time = xtim; + base[CLOCK_MONOTONIC].softirq_time = ktime_add(xtim, tomono); +} + /* * Functions and macros which are different for UP/SMP systems are kept in a * single place @@ -586,9 +606,11 @@ int hrtimer_get_res(const clockid_t which_clock, struct timespec *tp) */ static inline void run_hrtimer_queue(struct hrtimer_base *base) { - ktime_t now = base->get_time(); struct rb_node *node; + if (base->get_softirq_time) + base->softirq_time = base->get_softirq_time(); + spin_lock_irq(&base->lock); while ((node = base->first)) { @@ -598,7 +620,7 @@ static inline void run_hrtimer_queue(struct hrtimer_base *base) void *data; timer = rb_entry(node, struct hrtimer, node); - if (now.tv64 <= timer->expires.tv64) + if (base->softirq_time.tv64 <= timer->expires.tv64) break; fn = timer->function; @@ -641,6 +663,8 @@ void hrtimer_run_queues(void) struct hrtimer_base *base = __get_cpu_var(hrtimer_bases); int i; + hrtimer_get_softirq_time(base); + for (i = 0; i < MAX_HRTIMER_BASES; i++) run_hrtimer_queue(&base[i]); } -- cgit v1.2.3 From 44f21475511bbc0135b52c66ad74dcc6a9026da3 Mon Sep 17 00:00:00 2001 From: Roman Zippel Date: Sun, 26 Mar 2006 01:38:06 -0800 Subject: [PATCH] hrtimers: pass current time to hrtimer_forward() Pass current time to hrtimer_forward(). This allows to use the softirq time in the timer base when the forward function is called from the timer callback. Other places pass current time with a call to timer->base->get_time(). Signed-off-by: Roman Zippel Signed-off-by: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/hrtimer.c | 7 +++---- kernel/itimer.c | 3 ++- kernel/posix-timers.c | 14 ++++++++++---- 3 files changed, 15 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index b728cc53452b..e989c9981a96 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -301,18 +301,17 @@ void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags) * hrtimer_forward - forward the timer expiry * * @timer: hrtimer to forward + * @now: forward past this time * @interval: the interval to forward * * Forward the timer expiry so it will expire in the future. * Returns the number of overruns. */ unsigned long -hrtimer_forward(struct hrtimer *timer, ktime_t interval) +hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval) { unsigned long orun = 1; - ktime_t delta, now; - - now = timer->base->get_time(); + ktime_t delta; delta = ktime_sub(now, timer->expires); diff --git a/kernel/itimer.c b/kernel/itimer.c index 680e6b70c872..af2ec6b4392c 100644 --- a/kernel/itimer.c +++ b/kernel/itimer.c @@ -136,7 +136,8 @@ int it_real_fn(void *data) if (tsk->signal->it_real_incr.tv64 != 0) { hrtimer_forward(&tsk->signal->real_timer, - tsk->signal->it_real_incr); + tsk->signal->real_timer.base->softirq_time, + tsk->signal->it_real_incr); return HRTIMER_RESTART; } diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c index 9944379360b5..255657accf02 100644 --- a/kernel/posix-timers.c +++ b/kernel/posix-timers.c @@ -251,15 +251,18 @@ __initcall(init_posix_timers); static void schedule_next_timer(struct k_itimer *timr) { + struct hrtimer *timer = &timr->it.real.timer; + if (timr->it.real.interval.tv64 == 0) return; - timr->it_overrun += hrtimer_forward(&timr->it.real.timer, + timr->it_overrun += hrtimer_forward(timer, timer->base->get_time(), timr->it.real.interval); + timr->it_overrun_last = timr->it_overrun; timr->it_overrun = -1; ++timr->it_requeue_pending; - hrtimer_restart(&timr->it.real.timer); + hrtimer_restart(timer); } /* @@ -334,6 +337,7 @@ EXPORT_SYMBOL_GPL(posix_timer_event); static int posix_timer_fn(void *data) { struct k_itimer *timr = data; + struct hrtimer *timer = &timr->it.real.timer; unsigned long flags; int si_private = 0; int ret = HRTIMER_NORESTART; @@ -351,7 +355,8 @@ static int posix_timer_fn(void *data) */ if (timr->it.real.interval.tv64 != 0) { timr->it_overrun += - hrtimer_forward(&timr->it.real.timer, + hrtimer_forward(timer, + timer->base->softirq_time, timr->it.real.interval); ret = HRTIMER_RESTART; ++timr->it_requeue_pending; @@ -623,7 +628,8 @@ common_timer_get(struct k_itimer *timr, struct itimerspec *cur_setting) if (timr->it_requeue_pending & REQUEUE_PENDING || (timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE) { timr->it_overrun += - hrtimer_forward(timer, timr->it.real.interval); + hrtimer_forward(timer, timer->base->get_time(), + timr->it.real.interval); remaining = hrtimer_get_remaining(timer); } calci: -- cgit v1.2.3 From 3b98a5328171cebc867f70484b20bd34948cd7f6 Mon Sep 17 00:00:00 2001 From: Roman Zippel Date: Sun, 26 Mar 2006 01:38:07 -0800 Subject: [PATCH] hrtimers: posix-timer: cleanup common_timer_get() Cleanup common_timer_get() a little. Signed-off-by: Roman Zippel Signed-off-by: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/posix-timers.c | 50 ++++++++++++++++++++++++++------------------------ 1 file changed, 26 insertions(+), 24 deletions(-) (limited to 'kernel') diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c index 255657accf02..7c5f44787c8c 100644 --- a/kernel/posix-timers.c +++ b/kernel/posix-timers.c @@ -608,39 +608,41 @@ static struct k_itimer * lock_timer(timer_t timer_id, unsigned long *flags) static void common_timer_get(struct k_itimer *timr, struct itimerspec *cur_setting) { - ktime_t remaining; + ktime_t now, remaining, iv; struct hrtimer *timer = &timr->it.real.timer; memset(cur_setting, 0, sizeof(struct itimerspec)); - remaining = hrtimer_get_remaining(timer); - /* Time left ? or timer pending */ - if (remaining.tv64 > 0 || hrtimer_active(timer)) - goto calci; + iv = timr->it.real.interval; + /* interval timer ? */ - if (timr->it.real.interval.tv64 == 0) + if (iv.tv64) + cur_setting->it_interval = ktime_to_timespec(iv); + else if (!hrtimer_active(timer) && + (timr->it_sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_NONE) return; + + now = timer->base->get_time(); + /* - * When a requeue is pending or this is a SIGEV_NONE timer - * move the expiry time forward by intervals, so expiry is > - * now. + * When a requeue is pending or this is a SIGEV_NONE + * timer move the expiry time forward by intervals, so + * expiry is > now. */ - if (timr->it_requeue_pending & REQUEUE_PENDING || - (timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE) { - timr->it_overrun += - hrtimer_forward(timer, timer->base->get_time(), - timr->it.real.interval); - remaining = hrtimer_get_remaining(timer); - } - calci: - /* interval timer ? */ - if (timr->it.real.interval.tv64 != 0) - cur_setting->it_interval = - ktime_to_timespec(timr->it.real.interval); + if (iv.tv64 && (timr->it_requeue_pending & REQUEUE_PENDING || + (timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE)) + timr->it_overrun += hrtimer_forward(timer, now, iv); + + remaining = ktime_sub(timer->expires, now); /* Return 0 only, when the timer is expired and not pending */ - if (remaining.tv64 <= 0) - cur_setting->it_value.tv_nsec = 1; - else + if (remaining.tv64 <= 0) { + /* + * A single shot SIGEV_NONE timer must return 0, when + * it is expired ! + */ + if ((timr->it_sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_NONE) + cur_setting->it_value.tv_nsec = 1; + } else cur_setting->it_value = ktime_to_timespec(remaining); } -- cgit v1.2.3 From 432569bb9d9d424d7ffe5b21f8205c55bdd1aaa8 Mon Sep 17 00:00:00 2001 From: Roman Zippel Date: Sun, 26 Mar 2006 01:38:08 -0800 Subject: [PATCH] hrtimers: simplify nanosleep nanosleep is the only user of the expired state, so let it manage this itself, which makes the hrtimer code a bit simpler. The remaining time is also only calculated if requested. Signed-off-by: Roman Zippel Acked-by: Ingo Molnar Acked-by: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/hrtimer.c | 142 ++++++++++++++++++++++++------------------------------- 1 file changed, 62 insertions(+), 80 deletions(-) (limited to 'kernel') diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index e989c9981a96..59ec50c1e905 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -625,30 +625,20 @@ static inline void run_hrtimer_queue(struct hrtimer_base *base) fn = timer->function; data = timer->data; set_curr_timer(base, timer); - timer->state = HRTIMER_RUNNING; + timer->state = HRTIMER_INACTIVE; __remove_hrtimer(timer, base); spin_unlock_irq(&base->lock); - /* - * fn == NULL is special case for the simplest timer - * variant - wake up process and do not restart: - */ - if (!fn) { - wake_up_process(data); - restart = HRTIMER_NORESTART; - } else - restart = fn(data); + restart = fn(data); spin_lock_irq(&base->lock); /* Another CPU has added back the timer */ - if (timer->state != HRTIMER_RUNNING) + if (timer->state != HRTIMER_INACTIVE) continue; - if (restart == HRTIMER_RESTART) + if (restart != HRTIMER_NORESTART) enqueue_hrtimer(timer, base); - else - timer->state = HRTIMER_EXPIRED; } set_curr_timer(base, NULL); spin_unlock_irq(&base->lock); @@ -672,79 +662,70 @@ void hrtimer_run_queues(void) * Sleep related functions: */ -/** - * schedule_hrtimer - sleep until timeout - * - * @timer: hrtimer variable initialized with the correct clock base - * @mode: timeout value is abs/rel - * - * Make the current task sleep until @timeout is - * elapsed. - * - * You can set the task state as follows - - * - * %TASK_UNINTERRUPTIBLE - at least @timeout is guaranteed to - * pass before the routine returns. The routine will return 0 - * - * %TASK_INTERRUPTIBLE - the routine may return early if a signal is - * delivered to the current task. In this case the remaining time - * will be returned - * - * The current task state is guaranteed to be TASK_RUNNING when this - * routine returns. - */ -static ktime_t __sched -schedule_hrtimer(struct hrtimer *timer, const enum hrtimer_mode mode) -{ - /* fn stays NULL, meaning single-shot wakeup: */ - timer->data = current; +struct sleep_hrtimer { + struct hrtimer timer; + struct task_struct *task; + int expired; +}; - hrtimer_start(timer, timer->expires, mode); +static int nanosleep_wakeup(void *data) +{ + struct sleep_hrtimer *t = data; - schedule(); - hrtimer_cancel(timer); + t->expired = 1; + wake_up_process(t->task); - /* Return the remaining time: */ - if (timer->state != HRTIMER_EXPIRED) - return ktime_sub(timer->expires, timer->base->get_time()); - else - return (ktime_t) {.tv64 = 0 }; + return HRTIMER_NORESTART; } -static inline ktime_t __sched -schedule_hrtimer_interruptible(struct hrtimer *timer, - const enum hrtimer_mode mode) +static int __sched do_nanosleep(struct sleep_hrtimer *t, enum hrtimer_mode mode) { - set_current_state(TASK_INTERRUPTIBLE); + t->timer.function = nanosleep_wakeup; + t->timer.data = t; + t->task = current; + t->expired = 0; + + do { + set_current_state(TASK_INTERRUPTIBLE); + hrtimer_start(&t->timer, t->timer.expires, mode); + + schedule(); - return schedule_hrtimer(timer, mode); + if (unlikely(!t->expired)) { + hrtimer_cancel(&t->timer); + mode = HRTIMER_ABS; + } + } while (!t->expired && !signal_pending(current)); + + return t->expired; } static long __sched nanosleep_restart(struct restart_block *restart) { + struct sleep_hrtimer t; struct timespec __user *rmtp; struct timespec tu; - void *rfn_save = restart->fn; - struct hrtimer timer; - ktime_t rem; + ktime_t time; restart->fn = do_no_restart_syscall; - hrtimer_init(&timer, (clockid_t) restart->arg3, HRTIMER_ABS); - - timer.expires.tv64 = ((u64)restart->arg1 << 32) | (u64) restart->arg0; + hrtimer_init(&t.timer, restart->arg3, HRTIMER_ABS); + t.timer.expires.tv64 = ((u64)restart->arg1 << 32) | (u64) restart->arg0; - rem = schedule_hrtimer_interruptible(&timer, HRTIMER_ABS); - - if (rem.tv64 <= 0) + if (do_nanosleep(&t, HRTIMER_ABS)) return 0; rmtp = (struct timespec __user *) restart->arg2; - tu = ktime_to_timespec(rem); - if (rmtp && copy_to_user(rmtp, &tu, sizeof(tu))) - return -EFAULT; + if (rmtp) { + time = ktime_sub(t.timer.expires, t.timer.base->get_time()); + if (time.tv64 <= 0) + return 0; + tu = ktime_to_timespec(time); + if (copy_to_user(rmtp, &tu, sizeof(tu))) + return -EFAULT; + } - restart->fn = rfn_save; + restart->fn = nanosleep_restart; /* The other values in restart are already filled in */ return -ERESTART_RESTARTBLOCK; @@ -754,33 +735,34 @@ long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp, const enum hrtimer_mode mode, const clockid_t clockid) { struct restart_block *restart; - struct hrtimer timer; + struct sleep_hrtimer t; struct timespec tu; ktime_t rem; - hrtimer_init(&timer, clockid, mode); - - timer.expires = timespec_to_ktime(*rqtp); - - rem = schedule_hrtimer_interruptible(&timer, mode); - if (rem.tv64 <= 0) + hrtimer_init(&t.timer, clockid, mode); + t.timer.expires = timespec_to_ktime(*rqtp); + if (do_nanosleep(&t, mode)) return 0; /* Absolute timers do not update the rmtp value and restart: */ if (mode == HRTIMER_ABS) return -ERESTARTNOHAND; - tu = ktime_to_timespec(rem); - - if (rmtp && copy_to_user(rmtp, &tu, sizeof(tu))) - return -EFAULT; + if (rmtp) { + rem = ktime_sub(t.timer.expires, t.timer.base->get_time()); + if (rem.tv64 <= 0) + return 0; + tu = ktime_to_timespec(rem); + if (copy_to_user(rmtp, &tu, sizeof(tu))) + return -EFAULT; + } restart = ¤t_thread_info()->restart_block; restart->fn = nanosleep_restart; - restart->arg0 = timer.expires.tv64 & 0xFFFFFFFF; - restart->arg1 = timer.expires.tv64 >> 32; + restart->arg0 = t.timer.expires.tv64 & 0xFFFFFFFF; + restart->arg1 = t.timer.expires.tv64 >> 32; restart->arg2 = (unsigned long) rmtp; - restart->arg3 = (unsigned long) timer.base->index; + restart->arg3 = (unsigned long) t.timer.base->index; return -ERESTART_RESTARTBLOCK; } -- cgit v1.2.3 From b75f7a51ca75c977d7d77f735d7a7859194eb39e Mon Sep 17 00:00:00 2001 From: Roman Zippel Date: Sun, 26 Mar 2006 01:38:09 -0800 Subject: [PATCH] hrtimers: remove state field Remove the state field and encode this information in the rb_node similiar to normal timer. Signed-off-by: Roman Zippel Acked-by: Ingo Molnar Signed-off-by: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/hrtimer.c | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 59ec50c1e905..658d49feedb9 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -374,8 +374,6 @@ static void enqueue_hrtimer(struct hrtimer *timer, struct hrtimer_base *base) rb_link_node(&timer->node, parent, link); rb_insert_color(&timer->node, &base->active); - timer->state = HRTIMER_PENDING; - if (!base->first || timer->expires.tv64 < rb_entry(base->first, struct hrtimer, node)->expires.tv64) base->first = &timer->node; @@ -395,6 +393,7 @@ static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_base *base) if (base->first == &timer->node) base->first = rb_next(&timer->node); rb_erase(&timer->node, &base->active); + timer->node.rb_parent = HRTIMER_INACTIVE; } /* @@ -405,7 +404,6 @@ remove_hrtimer(struct hrtimer *timer, struct hrtimer_base *base) { if (hrtimer_active(timer)) { __remove_hrtimer(timer, base); - timer->state = HRTIMER_INACTIVE; return 1; } return 0; @@ -579,6 +577,7 @@ void hrtimer_init(struct hrtimer *timer, clockid_t clock_id, clock_id = CLOCK_MONOTONIC; timer->base = &bases[clock_id]; + timer->node.rb_parent = HRTIMER_INACTIVE; } /** @@ -625,7 +624,6 @@ static inline void run_hrtimer_queue(struct hrtimer_base *base) fn = timer->function; data = timer->data; set_curr_timer(base, timer); - timer->state = HRTIMER_INACTIVE; __remove_hrtimer(timer, base); spin_unlock_irq(&base->lock); @@ -633,12 +631,10 @@ static inline void run_hrtimer_queue(struct hrtimer_base *base) spin_lock_irq(&base->lock); - /* Another CPU has added back the timer */ - if (timer->state != HRTIMER_INACTIVE) - continue; - - if (restart != HRTIMER_NORESTART) + if (restart != HRTIMER_NORESTART) { + BUG_ON(hrtimer_active(timer)); enqueue_hrtimer(timer, base); + } } set_curr_timer(base, NULL); spin_unlock_irq(&base->lock); -- cgit v1.2.3 From df869b630d9d9131c10cf073fb61646048874b2f Mon Sep 17 00:00:00 2001 From: Roman Zippel Date: Sun, 26 Mar 2006 01:38:11 -0800 Subject: [PATCH] hrtimers: remove nsec_t typedef nsec_t predates ktime_t and has mostly been superseded by it. In the few places that are left it's better to make it explicit that we're dealing with 64 bit values here. Signed-off-by: Roman Zippel Acked-by: Thomas Gleixner Acked-by: John Stultz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/hrtimer.c | 4 ++-- kernel/time.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 658d49feedb9..44108de4f028 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -266,7 +266,7 @@ ktime_t ktime_add_ns(const ktime_t kt, u64 nsec) /* * Divide a ktime value by a nanosecond value */ -static unsigned long ktime_divns(const ktime_t kt, nsec_t div) +static unsigned long ktime_divns(const ktime_t kt, s64 div) { u64 dclc, inc, dns; int sft = 0; @@ -322,7 +322,7 @@ hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval) interval.tv64 = timer->base->resolution.tv64; if (unlikely(delta.tv64 >= interval.tv64)) { - nsec_t incr = ktime_to_ns(interval); + s64 incr = ktime_to_ns(interval); orun = ktime_divns(delta, incr); timer->expires = ktime_add_ns(timer->expires, incr * orun); diff --git a/kernel/time.c b/kernel/time.c index e00a97b77241..ff8e7019c4c4 100644 --- a/kernel/time.c +++ b/kernel/time.c @@ -610,7 +610,7 @@ void set_normalized_timespec(struct timespec *ts, time_t sec, long nsec) * * Returns the timespec representation of the nsec parameter. */ -struct timespec ns_to_timespec(const nsec_t nsec) +struct timespec ns_to_timespec(const s64 nsec) { struct timespec ts; @@ -630,7 +630,7 @@ struct timespec ns_to_timespec(const nsec_t nsec) * * Returns the timeval representation of the nsec parameter. */ -struct timeval ns_to_timeval(const nsec_t nsec) +struct timeval ns_to_timeval(const s64 nsec) { struct timespec ts = ns_to_timespec(nsec); struct timeval tv; -- cgit v1.2.3 From 05cfb614ddbf3181540ce09d44d96486f8ba8d6a Mon Sep 17 00:00:00 2001 From: Roman Zippel Date: Sun, 26 Mar 2006 01:38:12 -0800 Subject: [PATCH] hrtimers: remove data field The nanosleep cleanup allows to remove the data field of hrtimer. The callback function can use container_of() to get it's own data. Since the hrtimer structure is anyway embedded in other structures, this adds no overhead. Signed-off-by: Roman Zippel Signed-off-by: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 2 +- kernel/hrtimer.c | 12 +++++------- kernel/itimer.c | 15 +++++++-------- kernel/posix-timers.c | 9 ++++----- 4 files changed, 17 insertions(+), 21 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index a02063903aaa..4bd6486aa67d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -848,7 +848,7 @@ static inline int copy_signal(unsigned long clone_flags, struct task_struct * ts hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_REL); sig->it_real_incr.tv64 = 0; sig->real_timer.function = it_real_fn; - sig->real_timer.data = tsk; + sig->tsk = tsk; sig->it_virt_expires = cputime_zero; sig->it_virt_incr = cputime_zero; diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 44108de4f028..0237a556eb1f 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -613,21 +613,19 @@ static inline void run_hrtimer_queue(struct hrtimer_base *base) while ((node = base->first)) { struct hrtimer *timer; - int (*fn)(void *); + int (*fn)(struct hrtimer *); int restart; - void *data; timer = rb_entry(node, struct hrtimer, node); if (base->softirq_time.tv64 <= timer->expires.tv64) break; fn = timer->function; - data = timer->data; set_curr_timer(base, timer); __remove_hrtimer(timer, base); spin_unlock_irq(&base->lock); - restart = fn(data); + restart = fn(timer); spin_lock_irq(&base->lock); @@ -664,9 +662,10 @@ struct sleep_hrtimer { int expired; }; -static int nanosleep_wakeup(void *data) +static int nanosleep_wakeup(struct hrtimer *timer) { - struct sleep_hrtimer *t = data; + struct sleep_hrtimer *t = + container_of(timer, struct sleep_hrtimer, timer); t->expired = 1; wake_up_process(t->task); @@ -677,7 +676,6 @@ static int nanosleep_wakeup(void *data) static int __sched do_nanosleep(struct sleep_hrtimer *t, enum hrtimer_mode mode) { t->timer.function = nanosleep_wakeup; - t->timer.data = t; t->task = current; t->expired = 0; diff --git a/kernel/itimer.c b/kernel/itimer.c index af2ec6b4392c..204ed7939e75 100644 --- a/kernel/itimer.c +++ b/kernel/itimer.c @@ -128,17 +128,16 @@ asmlinkage long sys_getitimer(int which, struct itimerval __user *value) /* * The timer is automagically restarted, when interval != 0 */ -int it_real_fn(void *data) +int it_real_fn(struct hrtimer *timer) { - struct task_struct *tsk = (struct task_struct *) data; + struct signal_struct *sig = + container_of(timer, struct signal_struct, real_timer); - send_group_sig_info(SIGALRM, SEND_SIG_PRIV, tsk); - - if (tsk->signal->it_real_incr.tv64 != 0) { - hrtimer_forward(&tsk->signal->real_timer, - tsk->signal->real_timer.base->softirq_time, - tsk->signal->it_real_incr); + send_group_sig_info(SIGALRM, SEND_SIG_PRIV, sig->tsk); + if (sig->it_real_incr.tv64 != 0) { + hrtimer_forward(timer, timer->base->softirq_time, + sig->it_real_incr); return HRTIMER_RESTART; } return HRTIMER_NORESTART; diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c index 7c5f44787c8c..ac6dc8744429 100644 --- a/kernel/posix-timers.c +++ b/kernel/posix-timers.c @@ -145,7 +145,7 @@ static int common_timer_set(struct k_itimer *, int, struct itimerspec *, struct itimerspec *); static int common_timer_del(struct k_itimer *timer); -static int posix_timer_fn(void *data); +static int posix_timer_fn(struct hrtimer *data); static struct k_itimer *lock_timer(timer_t timer_id, unsigned long *flags); @@ -334,14 +334,14 @@ EXPORT_SYMBOL_GPL(posix_timer_event); * This code is for CLOCK_REALTIME* and CLOCK_MONOTONIC* timers. */ -static int posix_timer_fn(void *data) +static int posix_timer_fn(struct hrtimer *timer) { - struct k_itimer *timr = data; - struct hrtimer *timer = &timr->it.real.timer; + struct k_itimer *timr; unsigned long flags; int si_private = 0; int ret = HRTIMER_NORESTART; + timr = container_of(timer, struct k_itimer, it.real.timer); spin_lock_irqsave(&timr->it_lock, flags); if (timr->it.real.interval.tv64 != 0) @@ -725,7 +725,6 @@ common_timer_set(struct k_itimer *timr, int flags, mode = flags & TIMER_ABSTIME ? HRTIMER_ABS : HRTIMER_REL; hrtimer_init(&timr->it.real.timer, timr->it_clock, mode); - timr->it.real.timer.data = timr; timr->it.real.timer.function = posix_timer_fn; timer->expires = timespec_to_ktime(new_setting->it_value); -- cgit v1.2.3 From c6fd91f0bdcd294a0ae0ba2b2a7f7456ef4b7144 Mon Sep 17 00:00:00 2001 From: bibo mao Date: Sun, 26 Mar 2006 01:38:20 -0800 Subject: [PATCH] kretprobe instance recycled by parent process When kretprobe probes the schedule() function, if the probed process exits then schedule() will never return, so some kretprobe instances will never be recycled. In this patch the parent process will recycle retprobe instances of the probed function and there will be no memory leak of kretprobe instances. Signed-off-by: bibo mao Cc: Masami Hiramatsu Cc: Prasanna S Panchamukhi Cc: Ananth N Mavinakayanahalli Cc: Anil S Keshavamurthy Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kprobes.c | 10 +++++----- kernel/sched.c | 9 ++++++++- 2 files changed, 13 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 1fb9f753ef60..1156eb0977d0 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -323,10 +323,10 @@ struct hlist_head __kprobes *kretprobe_inst_table_head(struct task_struct *tsk) } /* - * This function is called from exit_thread or flush_thread when task tk's - * stack is being recycled so that we can recycle any function-return probe - * instances associated with this task. These left over instances represent - * probed functions that have been called but will never return. + * This function is called from finish_task_switch when task tk becomes dead, + * so that we can recycle any function-return probe instances associated + * with this task. These left over instances represent probed functions + * that have been called but will never return. */ void __kprobes kprobe_flush_task(struct task_struct *tk) { @@ -336,7 +336,7 @@ void __kprobes kprobe_flush_task(struct task_struct *tk) unsigned long flags = 0; spin_lock_irqsave(&kretprobe_lock, flags); - head = kretprobe_inst_table_head(current); + head = kretprobe_inst_table_head(tk); hlist_for_each_entry_safe(ri, node, tmp, head, hlist) { if (ri->task == tk) recycle_rp_inst(ri); diff --git a/kernel/sched.c b/kernel/sched.c index 7ffaabd64f89..78acdefeccca 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -49,6 +49,7 @@ #include #include #include +#include #include #include @@ -1546,8 +1547,14 @@ static inline void finish_task_switch(runqueue_t *rq, task_t *prev) finish_lock_switch(rq, prev); if (mm) mmdrop(mm); - if (unlikely(prev_task_flags & PF_DEAD)) + if (unlikely(prev_task_flags & PF_DEAD)) { + /* + * Remove function-return probe instances associated with this + * task and put them back on the free list. + */ + kprobe_flush_task(prev); put_task_struct(prev); + } } /** -- cgit v1.2.3 From 013d3868143387be99bb0373d49452eaa3c55d41 Mon Sep 17 00:00:00 2001 From: Martin Andersson Date: Mon, 27 Mar 2006 01:15:18 -0800 Subject: [PATCH] sched: fix task interactivity calculation Is a truncation error in kernel/sched.c triggered when the nice value is negative. The affected code is used in the TASK_INTERACTIVE macro. The code is: #define SCALE(v1,v1_max,v2_max) \ (v1) * (v2_max) / (v1_max) which is used in this way: SCALE(TASK_NICE(p), 40, MAX_BONUS) Comments in the code says: * This part scales the interactivity limit depending on niceness. * * We scale it linearly, offset by the INTERACTIVE_DELTA delta. * Here are a few examples of different nice levels: * * TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0] * TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0] * TASK_INTERACTIVE( 0): [1,1,1,1,0,0,0,0,0,0,0] * TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0] * TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0] * * (the X axis represents the possible -5 ... 0 ... +5 dynamic * priority range a task can explore, a value of '1' means the * task is rated interactive.) However, the current code does not scale it linearly and the result differs from the given examples. If the mathematical function "floor" is used when the nice value is negative instead of the truncation one gets when using integer division, the result conforms to the documentation. Output of TASK_INTERACTIVE when using the kernel code: nice dynamic priorities -20 1 1 1 1 1 1 1 1 1 0 0 -19 1 1 1 1 1 1 1 1 0 0 0 -18 1 1 1 1 1 1 1 1 0 0 0 -17 1 1 1 1 1 1 1 1 0 0 0 -16 1 1 1 1 1 1 1 1 0 0 0 -15 1 1 1 1 1 1 1 0 0 0 0 -14 1 1 1 1 1 1 1 0 0 0 0 -13 1 1 1 1 1 1 1 0 0 0 0 -12 1 1 1 1 1 1 1 0 0 0 0 -11 1 1 1 1 1 1 0 0 0 0 0 -10 1 1 1 1 1 1 0 0 0 0 0 -9 1 1 1 1 1 1 0 0 0 0 0 -8 1 1 1 1 1 1 0 0 0 0 0 -7 1 1 1 1 1 0 0 0 0 0 0 -6 1 1 1 1 1 0 0 0 0 0 0 -5 1 1 1 1 1 0 0 0 0 0 0 -4 1 1 1 1 1 0 0 0 0 0 0 -3 1 1 1 1 0 0 0 0 0 0 0 -2 1 1 1 1 0 0 0 0 0 0 0 -1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 2 1 1 1 1 0 0 0 0 0 0 0 3 1 1 1 1 0 0 0 0 0 0 0 4 1 1 1 0 0 0 0 0 0 0 0 5 1 1 1 0 0 0 0 0 0 0 0 6 1 1 1 0 0 0 0 0 0 0 0 7 1 1 1 0 0 0 0 0 0 0 0 8 1 1 0 0 0 0 0 0 0 0 0 9 1 1 0 0 0 0 0 0 0 0 0 10 1 1 0 0 0 0 0 0 0 0 0 11 1 1 0 0 0 0 0 0 0 0 0 12 1 0 0 0 0 0 0 0 0 0 0 13 1 0 0 0 0 0 0 0 0 0 0 14 1 0 0 0 0 0 0 0 0 0 0 15 1 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0 0 0 0 19 0 0 0 0 0 0 0 0 0 0 0 Output of TASK_INTERACTIVE when using "floor" nice dynamic priorities -20 1 1 1 1 1 1 1 1 1 0 0 -19 1 1 1 1 1 1 1 1 1 0 0 -18 1 1 1 1 1 1 1 1 1 0 0 -17 1 1 1 1 1 1 1 1 1 0 0 -16 1 1 1 1 1 1 1 1 0 0 0 -15 1 1 1 1 1 1 1 1 0 0 0 -14 1 1 1 1 1 1 1 1 0 0 0 -13 1 1 1 1 1 1 1 1 0 0 0 -12 1 1 1 1 1 1 1 0 0 0 0 -11 1 1 1 1 1 1 1 0 0 0 0 -10 1 1 1 1 1 1 1 0 0 0 0 -9 1 1 1 1 1 1 1 0 0 0 0 -8 1 1 1 1 1 1 0 0 0 0 0 -7 1 1 1 1 1 1 0 0 0 0 0 -6 1 1 1 1 1 1 0 0 0 0 0 -5 1 1 1 1 1 1 0 0 0 0 0 -4 1 1 1 1 1 0 0 0 0 0 0 -3 1 1 1 1 1 0 0 0 0 0 0 -2 1 1 1 1 1 0 0 0 0 0 0 -1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 2 1 1 1 1 0 0 0 0 0 0 0 3 1 1 1 1 0 0 0 0 0 0 0 4 1 1 1 0 0 0 0 0 0 0 0 5 1 1 1 0 0 0 0 0 0 0 0 6 1 1 1 0 0 0 0 0 0 0 0 7 1 1 1 0 0 0 0 0 0 0 0 8 1 1 0 0 0 0 0 0 0 0 0 9 1 1 0 0 0 0 0 0 0 0 0 10 1 1 0 0 0 0 0 0 0 0 0 11 1 1 0 0 0 0 0 0 0 0 0 12 1 0 0 0 0 0 0 0 0 0 0 13 1 0 0 0 0 0 0 0 0 0 0 14 1 0 0 0 0 0 0 0 0 0 0 15 1 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0 0 0 0 19 0 0 0 0 0 0 0 0 0 0 0 Signed-off-by: Martin Andersson Acked-by: Ingo Molnar Cc: Nick Piggin Cc: Mike Galbraith Cc: Peter Williams Cc: Con Kolivas Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 78acdefeccca..dc599c85a88d 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -145,7 +145,8 @@ (v1) * (v2_max) / (v1_max) #define DELTA(p) \ - (SCALE(TASK_NICE(p), 40, MAX_BONUS) + INTERACTIVE_DELTA) + (SCALE(TASK_NICE(p) + 20, 40, MAX_BONUS) - 20 * MAX_BONUS / 40 + \ + INTERACTIVE_DELTA) #define TASK_INTERACTIVE(p) \ ((p)->prio <= (p)->static_prio - DELTA(p)) -- cgit v1.2.3 From 77e4bfbcf071f795b54862455dce8902b3fc29c2 Mon Sep 17 00:00:00 2001 From: Andreas Mohr Date: Mon, 27 Mar 2006 01:15:20 -0800 Subject: [PATCH] Small schedule() optimization small schedule() microoptimization. Signed-off-by: Andreas Mohr Signed-off-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index dc599c85a88d..a96a05d23262 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -2879,13 +2879,11 @@ asmlinkage void __sched schedule(void) * schedule() atomically, we ignore that path for now. * Otherwise, whine if we are scheduling when we should not be. */ - if (likely(!current->exit_state)) { - if (unlikely(in_atomic())) { - printk(KERN_ERR "BUG: scheduling while atomic: " - "%s/0x%08x/%d\n", - current->comm, preempt_count(), current->pid); - dump_stack(); - } + if (unlikely(in_atomic() && !current->exit_state)) { + printk(KERN_ERR "BUG: scheduling while atomic: " + "%s/0x%08x/%d\n", + current->comm, preempt_count(), current->pid); + dump_stack(); } profile_hit(SCHED_PROFILING, __builtin_return_address(0)); -- cgit v1.2.3 From 1e9f28fa1eb9773bf65bae08288c6a0a38eef4a7 Mon Sep 17 00:00:00 2001 From: "Siddha, Suresh B" Date: Mon, 27 Mar 2006 01:15:22 -0800 Subject: [PATCH] sched: new sched domain for representing multi-core Add a new sched domain for representing multi-core with shared caches between cores. Consider a dual package system, each package containing two cores and with last level cache shared between cores with in a package. If there are two runnable processes, with this appended patch those two processes will be scheduled on different packages. On such systems, with this patch we have observed 8% perf improvement with specJBB(2 warehouse) benchmark and 35% improvement with CFP2000 rate(with 2 users). This new domain will come into play only on multi-core systems with shared caches. On other systems, this sched domain will be removed by domain degeneration code. This new domain can be also used for implementing power savings policy (see OLS 2005 CMP kernel scheduler paper for more details.. I will post another patch for power savings policy soon) Most of the arch/* file changes are for cpu_coregroup_map() implementation. Signed-off-by: Suresh Siddha Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 68 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index a96a05d23262..8a8b71b5751b 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -5574,11 +5574,31 @@ static int cpu_to_cpu_group(int cpu) } #endif +#ifdef CONFIG_SCHED_MC +static DEFINE_PER_CPU(struct sched_domain, core_domains); +static struct sched_group sched_group_core[NR_CPUS]; +#endif + +#if defined(CONFIG_SCHED_MC) && defined(CONFIG_SCHED_SMT) +static int cpu_to_core_group(int cpu) +{ + return first_cpu(cpu_sibling_map[cpu]); +} +#elif defined(CONFIG_SCHED_MC) +static int cpu_to_core_group(int cpu) +{ + return cpu; +} +#endif + static DEFINE_PER_CPU(struct sched_domain, phys_domains); static struct sched_group sched_group_phys[NR_CPUS]; static int cpu_to_phys_group(int cpu) { -#ifdef CONFIG_SCHED_SMT +#if defined(CONFIG_SCHED_MC) + cpumask_t mask = cpu_coregroup_map(cpu); + return first_cpu(mask); +#elif defined(CONFIG_SCHED_SMT) return first_cpu(cpu_sibling_map[cpu]); #else return cpu; @@ -5676,6 +5696,17 @@ void build_sched_domains(const cpumask_t *cpu_map) sd->parent = p; sd->groups = &sched_group_phys[group]; +#ifdef CONFIG_SCHED_MC + p = sd; + sd = &per_cpu(core_domains, i); + group = cpu_to_core_group(i); + *sd = SD_MC_INIT; + sd->span = cpu_coregroup_map(i); + cpus_and(sd->span, sd->span, *cpu_map); + sd->parent = p; + sd->groups = &sched_group_core[group]; +#endif + #ifdef CONFIG_SCHED_SMT p = sd; sd = &per_cpu(cpu_domains, i); @@ -5701,6 +5732,19 @@ void build_sched_domains(const cpumask_t *cpu_map) } #endif +#ifdef CONFIG_SCHED_MC + /* Set up multi-core groups */ + for_each_cpu_mask(i, *cpu_map) { + cpumask_t this_core_map = cpu_coregroup_map(i); + cpus_and(this_core_map, this_core_map, *cpu_map); + if (i != first_cpu(this_core_map)) + continue; + init_sched_build_groups(sched_group_core, this_core_map, + &cpu_to_core_group); + } +#endif + + /* Set up physical groups */ for (i = 0; i < MAX_NUMNODES; i++) { cpumask_t nodemask = node_to_cpumask(i); @@ -5797,11 +5841,31 @@ void build_sched_domains(const cpumask_t *cpu_map) power = SCHED_LOAD_SCALE; sd->groups->cpu_power = power; #endif +#ifdef CONFIG_SCHED_MC + sd = &per_cpu(core_domains, i); + power = SCHED_LOAD_SCALE + (cpus_weight(sd->groups->cpumask)-1) + * SCHED_LOAD_SCALE / 10; + sd->groups->cpu_power = power; + + sd = &per_cpu(phys_domains, i); + /* + * This has to be < 2 * SCHED_LOAD_SCALE + * Lets keep it SCHED_LOAD_SCALE, so that + * while calculating NUMA group's cpu_power + * we can simply do + * numa_group->cpu_power += phys_group->cpu_power; + * + * See "only add power once for each physical pkg" + * comment below + */ + sd->groups->cpu_power = SCHED_LOAD_SCALE; +#else sd = &per_cpu(phys_domains, i); power = SCHED_LOAD_SCALE + SCHED_LOAD_SCALE * (cpus_weight(sd->groups->cpumask)-1) / 10; sd->groups->cpu_power = power; +#endif #ifdef CONFIG_NUMA sd = &per_cpu(allnodes_domains, i); @@ -5823,7 +5887,6 @@ void build_sched_domains(const cpumask_t *cpu_map) next_sg: for_each_cpu_mask(j, sg->cpumask) { struct sched_domain *sd; - int power; sd = &per_cpu(phys_domains, j); if (j != first_cpu(sd->groups->cpumask)) { @@ -5833,10 +5896,8 @@ next_sg: */ continue; } - power = SCHED_LOAD_SCALE + SCHED_LOAD_SCALE * - (cpus_weight(sd->groups->cpumask)-1) / 10; - sg->cpu_power += power; + sg->cpu_power += sd->groups->cpu_power; } sg = sg->next; if (sg != sched_group_nodes[i]) @@ -5849,6 +5910,8 @@ next_sg: struct sched_domain *sd; #ifdef CONFIG_SCHED_SMT sd = &per_cpu(cpu_domains, i); +#elif defined(CONFIG_SCHED_MC) + sd = &per_cpu(core_domains, i); #else sd = &per_cpu(phys_domains, i); #endif -- cgit v1.2.3 From 0806903316d516a3b3851c51cea5c71724d9051d Mon Sep 17 00:00:00 2001 From: "Siddha, Suresh B" Date: Mon, 27 Mar 2006 01:15:23 -0800 Subject: [PATCH] sched: fix group power for allnodes_domains Current sched groups power calculation for allnodes_domains is wrong. We should really be using cumulative power of the physical packages in that group (similar to the calculation in node_domains) Signed-off-by: Suresh Siddha Cc: Ingo Molnar Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 62 +++++++++++++++++++++++++++------------------------------- 1 file changed, 29 insertions(+), 33 deletions(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 8a8b71b5751b..7854ee516b92 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -5621,6 +5621,32 @@ static int cpu_to_allnodes_group(int cpu) { return cpu_to_node(cpu); } +static void init_numa_sched_groups_power(struct sched_group *group_head) +{ + struct sched_group *sg = group_head; + int j; + + if (!sg) + return; +next_sg: + for_each_cpu_mask(j, sg->cpumask) { + struct sched_domain *sd; + + sd = &per_cpu(phys_domains, j); + if (j != first_cpu(sd->groups->cpumask)) { + /* + * Only add "power" once for each + * physical package. + */ + continue; + } + + sg->cpu_power += sd->groups->cpu_power; + } + sg = sg->next; + if (sg != group_head) + goto next_sg; +} #endif /* @@ -5866,43 +5892,13 @@ void build_sched_domains(const cpumask_t *cpu_map) (cpus_weight(sd->groups->cpumask)-1) / 10; sd->groups->cpu_power = power; #endif - -#ifdef CONFIG_NUMA - sd = &per_cpu(allnodes_domains, i); - if (sd->groups) { - power = SCHED_LOAD_SCALE + SCHED_LOAD_SCALE * - (cpus_weight(sd->groups->cpumask)-1) / 10; - sd->groups->cpu_power = power; - } -#endif } #ifdef CONFIG_NUMA - for (i = 0; i < MAX_NUMNODES; i++) { - struct sched_group *sg = sched_group_nodes[i]; - int j; - - if (sg == NULL) - continue; -next_sg: - for_each_cpu_mask(j, sg->cpumask) { - struct sched_domain *sd; + for (i = 0; i < MAX_NUMNODES; i++) + init_numa_sched_groups_power(sched_group_nodes[i]); - sd = &per_cpu(phys_domains, j); - if (j != first_cpu(sd->groups->cpumask)) { - /* - * Only add "power" once for each - * physical package. - */ - continue; - } - - sg->cpu_power += sd->groups->cpu_power; - } - sg = sg->next; - if (sg != sched_group_nodes[i]) - goto next_sg; - } + init_numa_sched_groups_power(sched_group_allnodes); #endif /* Attach the domains */ -- cgit v1.2.3 From 0771dfefc9e538f077d0b43b6dec19a5a67d0e70 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Mon, 27 Mar 2006 01:16:22 -0800 Subject: [PATCH] lightweight robust futexes: core Add the core infrastructure for robust futexes: structure definitions, the new syscalls and the do_exit() based cleanup mechanism. Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner Signed-off-by: Arjan van de Ven Acked-by: Ulrich Drepper Cc: Michael Kerrisk Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 3 + kernel/futex.c | 172 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sys_ni.c | 4 ++ 3 files changed, 179 insertions(+) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 8037405e136e..aecb48ca7370 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include @@ -852,6 +853,8 @@ fastcall NORET_TYPE void do_exit(long code) exit_itimers(tsk->signal); acct_process(code); } + if (unlikely(tsk->robust_list)) + exit_robust_list(tsk); exit_mm(tsk); exit_sem(tsk); diff --git a/kernel/futex.c b/kernel/futex.c index 5efa2f978032..feb724b2554e 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -8,6 +8,10 @@ * Removed page pinning, fix privately mapped COW pages and other cleanups * (C) Copyright 2003, 2004 Jamie Lokier * + * Robust futex support started by Ingo Molnar + * (C) Copyright 2006 Red Hat Inc, All Rights Reserved + * Thanks to Thomas Gleixner for suggestions, analysis and fixes. + * * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly * enough at me, Linus for the original (flawed) idea, Matthew * Kirkwood for proof-of-concept implementation. @@ -829,6 +833,174 @@ error: goto out; } +/* + * Support for robust futexes: the kernel cleans up held futexes at + * thread exit time. + * + * Implementation: user-space maintains a per-thread list of locks it + * is holding. Upon do_exit(), the kernel carefully walks this list, + * and marks all locks that are owned by this thread with the + * FUTEX_OWNER_DEAD bit, and wakes up a waiter (if any). The list is + * always manipulated with the lock held, so the list is private and + * per-thread. Userspace also maintains a per-thread 'list_op_pending' + * field, to allow the kernel to clean up if the thread dies after + * acquiring the lock, but just before it could have added itself to + * the list. There can only be one such pending lock. + */ + +/** + * sys_set_robust_list - set the robust-futex list head of a task + * @head: pointer to the list-head + * @len: length of the list-head, as userspace expects + */ +asmlinkage long +sys_set_robust_list(struct robust_list_head __user *head, + size_t len) +{ + /* + * The kernel knows only one size for now: + */ + if (unlikely(len != sizeof(*head))) + return -EINVAL; + + current->robust_list = head; + + return 0; +} + +/** + * sys_get_robust_list - get the robust-futex list head of a task + * @pid: pid of the process [zero for current task] + * @head_ptr: pointer to a list-head pointer, the kernel fills it in + * @len_ptr: pointer to a length field, the kernel fills in the header size + */ +asmlinkage long +sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr, + size_t __user *len_ptr) +{ + struct robust_list_head *head; + unsigned long ret; + + if (!pid) + head = current->robust_list; + else { + struct task_struct *p; + + ret = -ESRCH; + read_lock(&tasklist_lock); + p = find_task_by_pid(pid); + if (!p) + goto err_unlock; + ret = -EPERM; + if ((current->euid != p->euid) && (current->euid != p->uid) && + !capable(CAP_SYS_PTRACE)) + goto err_unlock; + head = p->robust_list; + read_unlock(&tasklist_lock); + } + + if (put_user(sizeof(*head), len_ptr)) + return -EFAULT; + return put_user(head, head_ptr); + +err_unlock: + read_unlock(&tasklist_lock); + + return ret; +} + +/* + * Process a futex-list entry, check whether it's owned by the + * dying task, and do notification if so: + */ +int handle_futex_death(unsigned int *uaddr, struct task_struct *curr) +{ + unsigned int futex_val; + +repeat: + if (get_user(futex_val, uaddr)) + return -1; + + if ((futex_val & FUTEX_TID_MASK) == curr->pid) { + /* + * Ok, this dying thread is truly holding a futex + * of interest. Set the OWNER_DIED bit atomically + * via cmpxchg, and if the value had FUTEX_WAITERS + * set, wake up a waiter (if any). (We have to do a + * futex_wake() even if OWNER_DIED is already set - + * to handle the rare but possible case of recursive + * thread-death.) The rest of the cleanup is done in + * userspace. + */ + if (futex_atomic_cmpxchg_inuser(uaddr, futex_val, + futex_val | FUTEX_OWNER_DIED) != + futex_val) + goto repeat; + + if (futex_val & FUTEX_WAITERS) + futex_wake((unsigned long)uaddr, 1); + } + return 0; +} + +/* + * Walk curr->robust_list (very carefully, it's a userspace list!) + * and mark any locks found there dead, and notify any waiters. + * + * We silently return on any sign of list-walking problem. + */ +void exit_robust_list(struct task_struct *curr) +{ + struct robust_list_head __user *head = curr->robust_list; + struct robust_list __user *entry, *pending; + unsigned int limit = ROBUST_LIST_LIMIT; + unsigned long futex_offset; + + /* + * Fetch the list head (which was registered earlier, via + * sys_set_robust_list()): + */ + if (get_user(entry, &head->list.next)) + return; + /* + * Fetch the relative futex offset: + */ + if (get_user(futex_offset, &head->futex_offset)) + return; + /* + * Fetch any possibly pending lock-add first, and handle it + * if it exists: + */ + if (get_user(pending, &head->list_op_pending)) + return; + if (pending) + handle_futex_death((void *)pending + futex_offset, curr); + + while (entry != &head->list) { + /* + * A pending lock might already be on the list, so + * dont process it twice: + */ + if (entry != pending) + if (handle_futex_death((void *)entry + futex_offset, + curr)) + return; + + /* + * Fetch the next entry in the list: + */ + if (get_user(entry, &entry->next)) + return; + /* + * Avoid excessively long or circular lists: + */ + if (!--limit) + break; + + cond_resched(); + } +} + long do_futex(unsigned long uaddr, int op, int val, unsigned long timeout, unsigned long uaddr2, int val2, int val3) { diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 1067090db6b1..d82864c4a617 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -42,6 +42,10 @@ cond_syscall(sys_recvmsg); cond_syscall(sys_socketcall); cond_syscall(sys_futex); cond_syscall(compat_sys_futex); +cond_syscall(sys_set_robust_list); +cond_syscall(compat_sys_set_robust_list); +cond_syscall(sys_get_robust_list); +cond_syscall(compat_sys_get_robust_list); cond_syscall(sys_epoll_create); cond_syscall(sys_epoll_ctl); cond_syscall(sys_epoll_wait); -- cgit v1.2.3 From 34f192c6527f20c47ccec239e7d51a27691b93fc Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Mon, 27 Mar 2006 01:16:24 -0800 Subject: [PATCH] lightweight robust futexes: compat 32-bit syscall compatibility support. (This patch also moves all futex related compat functionality into kernel/futex_compat.c.) Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner Signed-off-by: Arjan van de Ven Acked-by: Ulrich Drepper Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/Makefile | 3 ++ kernel/compat.c | 23 -------- kernel/exit.c | 5 ++ kernel/futex_compat.c | 142 ++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 150 insertions(+), 23 deletions(-) create mode 100644 kernel/futex_compat.c (limited to 'kernel') diff --git a/kernel/Makefile b/kernel/Makefile index ff1c11dc12cf..58908f9d156a 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -12,6 +12,9 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o profile.o \ obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o obj-$(CONFIG_FUTEX) += futex.o +ifeq ($(CONFIG_COMPAT),y) +obj-$(CONFIG_FUTEX) += futex_compat.o +endif obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o obj-$(CONFIG_SMP) += cpu.o spinlock.o obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock.o diff --git a/kernel/compat.c b/kernel/compat.c index b9bdd1271f44..c1601a84f8d8 100644 --- a/kernel/compat.c +++ b/kernel/compat.c @@ -17,7 +17,6 @@ #include #include #include /* for MAX_SCHEDULE_TIMEOUT */ -#include /* for FUTEX_WAIT */ #include #include #include @@ -239,28 +238,6 @@ asmlinkage long compat_sys_sigprocmask(int how, compat_old_sigset_t __user *set, return ret; } -#ifdef CONFIG_FUTEX -asmlinkage long compat_sys_futex(u32 __user *uaddr, int op, int val, - struct compat_timespec __user *utime, u32 __user *uaddr2, - int val3) -{ - struct timespec t; - unsigned long timeout = MAX_SCHEDULE_TIMEOUT; - int val2 = 0; - - if ((op == FUTEX_WAIT) && utime) { - if (get_compat_timespec(&t, utime)) - return -EFAULT; - timeout = timespec_to_jiffies(&t) + 1; - } - if (op >= FUTEX_REQUEUE) - val2 = (int) (unsigned long) utime; - - return do_futex((unsigned long)uaddr, op, val, timeout, - (unsigned long)uaddr2, val2, val3); -} -#endif - asmlinkage long compat_sys_setrlimit(unsigned int resource, struct compat_rlimit __user *rlim) { diff --git a/kernel/exit.c b/kernel/exit.c index aecb48ca7370..a8c7efc7a681 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -32,6 +32,7 @@ #include #include #include +#include #include #include @@ -855,6 +856,10 @@ fastcall NORET_TYPE void do_exit(long code) } if (unlikely(tsk->robust_list)) exit_robust_list(tsk); +#ifdef CONFIG_COMPAT + if (unlikely(tsk->compat_robust_list)) + compat_exit_robust_list(tsk); +#endif exit_mm(tsk); exit_sem(tsk); diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c new file mode 100644 index 000000000000..c153559ef289 --- /dev/null +++ b/kernel/futex_compat.c @@ -0,0 +1,142 @@ +/* + * linux/kernel/futex_compat.c + * + * Futex compatibililty routines. + * + * Copyright 2006, Red Hat, Inc., Ingo Molnar + */ + +#include +#include +#include + +#include + +/* + * Walk curr->robust_list (very carefully, it's a userspace list!) + * and mark any locks found there dead, and notify any waiters. + * + * We silently return on any sign of list-walking problem. + */ +void compat_exit_robust_list(struct task_struct *curr) +{ + struct compat_robust_list_head __user *head = curr->compat_robust_list; + struct robust_list __user *entry, *pending; + compat_uptr_t uentry, upending; + unsigned int limit = ROBUST_LIST_LIMIT; + compat_long_t futex_offset; + + /* + * Fetch the list head (which was registered earlier, via + * sys_set_robust_list()): + */ + if (get_user(uentry, &head->list.next)) + return; + entry = compat_ptr(uentry); + /* + * Fetch the relative futex offset: + */ + if (get_user(futex_offset, &head->futex_offset)) + return; + /* + * Fetch any possibly pending lock-add first, and handle it + * if it exists: + */ + if (get_user(upending, &head->list_op_pending)) + return; + pending = compat_ptr(upending); + if (upending) + handle_futex_death((void *)pending + futex_offset, curr); + + while (compat_ptr(uentry) != &head->list) { + /* + * A pending lock might already be on the list, so + * dont process it twice: + */ + if (entry != pending) + if (handle_futex_death((void *)entry + futex_offset, + curr)) + return; + + /* + * Fetch the next entry in the list: + */ + if (get_user(uentry, (compat_uptr_t *)&entry->next)) + return; + entry = compat_ptr(uentry); + /* + * Avoid excessively long or circular lists: + */ + if (!--limit) + break; + + cond_resched(); + } +} + +asmlinkage long +compat_sys_set_robust_list(struct compat_robust_list_head __user *head, + compat_size_t len) +{ + if (unlikely(len != sizeof(*head))) + return -EINVAL; + + current->compat_robust_list = head; + + return 0; +} + +asmlinkage long +compat_sys_get_robust_list(int pid, compat_uptr_t *head_ptr, + compat_size_t __user *len_ptr) +{ + struct compat_robust_list_head *head; + unsigned long ret; + + if (!pid) + head = current->compat_robust_list; + else { + struct task_struct *p; + + ret = -ESRCH; + read_lock(&tasklist_lock); + p = find_task_by_pid(pid); + if (!p) + goto err_unlock; + ret = -EPERM; + if ((current->euid != p->euid) && (current->euid != p->uid) && + !capable(CAP_SYS_PTRACE)) + goto err_unlock; + head = p->compat_robust_list; + read_unlock(&tasklist_lock); + } + + if (put_user(sizeof(*head), len_ptr)) + return -EFAULT; + return put_user(ptr_to_compat(head), head_ptr); + +err_unlock: + read_unlock(&tasklist_lock); + + return ret; +} + +asmlinkage long compat_sys_futex(u32 __user *uaddr, int op, int val, + struct compat_timespec __user *utime, u32 __user *uaddr2, + int val3) +{ + struct timespec t; + unsigned long timeout = MAX_SCHEDULE_TIMEOUT; + int val2 = 0; + + if ((op == FUTEX_WAIT) && utime) { + if (get_compat_timespec(&t, utime)) + return -EFAULT; + timeout = timespec_to_jiffies(&t) + 1; + } + if (op >= FUTEX_REQUEUE) + val2 = (int) (unsigned long) utime; + + return do_futex((unsigned long)uaddr, op, val, timeout, + (unsigned long)uaddr2, val2, val3); +} -- cgit v1.2.3 From 8f17d3a5049d32392b79925c73a0cf99ce6d5af0 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Mon, 27 Mar 2006 01:16:27 -0800 Subject: [PATCH] lightweight robust futexes updates - fix: initialize the robust list(s) to NULL in copy_process. - doc update - cleanup: rename _inuser to _inatomic - __user cleanups and other small cleanups Signed-off-by: Ingo Molnar Cc: Thomas Gleixner Cc: Arjan van de Ven Cc: Ulrich Drepper Cc: Andi Kleen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 5 ++++- kernel/futex.c | 20 +++++++++----------- kernel/futex_compat.c | 7 +++---- 3 files changed, 16 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index e0a2b449dea6..c49bd193b058 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1061,7 +1061,10 @@ static task_t *copy_process(unsigned long clone_flags, * Clear TID on mm_release()? */ p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr: NULL; - + p->robust_list = NULL; +#ifdef CONFIG_COMPAT + p->compat_robust_list = NULL; +#endif /* * sigaltstack should be cleared when sharing the same VM */ diff --git a/kernel/futex.c b/kernel/futex.c index feb724b2554e..9c9b2b6b22dd 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -913,15 +913,15 @@ err_unlock: * Process a futex-list entry, check whether it's owned by the * dying task, and do notification if so: */ -int handle_futex_death(unsigned int *uaddr, struct task_struct *curr) +int handle_futex_death(u32 __user *uaddr, struct task_struct *curr) { - unsigned int futex_val; + u32 uval; -repeat: - if (get_user(futex_val, uaddr)) +retry: + if (get_user(uval, uaddr)) return -1; - if ((futex_val & FUTEX_TID_MASK) == curr->pid) { + if ((uval & FUTEX_TID_MASK) == curr->pid) { /* * Ok, this dying thread is truly holding a futex * of interest. Set the OWNER_DIED bit atomically @@ -932,12 +932,11 @@ repeat: * thread-death.) The rest of the cleanup is done in * userspace. */ - if (futex_atomic_cmpxchg_inuser(uaddr, futex_val, - futex_val | FUTEX_OWNER_DIED) != - futex_val) - goto repeat; + if (futex_atomic_cmpxchg_inatomic(uaddr, uval, + uval | FUTEX_OWNER_DIED) != uval) + goto retry; - if (futex_val & FUTEX_WAITERS) + if (uval & FUTEX_WAITERS) futex_wake((unsigned long)uaddr, 1); } return 0; @@ -985,7 +984,6 @@ void exit_robust_list(struct task_struct *curr) if (handle_futex_death((void *)entry + futex_offset, curr)) return; - /* * Fetch the next entry in the list: */ diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c index c153559ef289..9c077cf9aa84 100644 --- a/kernel/futex_compat.c +++ b/kernel/futex_compat.c @@ -121,9 +121,9 @@ err_unlock: return ret; } -asmlinkage long compat_sys_futex(u32 __user *uaddr, int op, int val, +asmlinkage long compat_sys_futex(u32 __user *uaddr, int op, u32 val, struct compat_timespec __user *utime, u32 __user *uaddr2, - int val3) + u32 val3) { struct timespec t; unsigned long timeout = MAX_SCHEDULE_TIMEOUT; @@ -137,6 +137,5 @@ asmlinkage long compat_sys_futex(u32 __user *uaddr, int op, int val, if (op >= FUTEX_REQUEUE) val2 = (int) (unsigned long) utime; - return do_futex((unsigned long)uaddr, op, val, timeout, - (unsigned long)uaddr2, val2, val3); + return do_futex(uaddr, op, val, timeout, uaddr2, val2, val3); } -- cgit v1.2.3 From e041c683412d5bf44dc2b109053e3b837b71742d Mon Sep 17 00:00:00 2001 From: Alan Stern Date: Mon, 27 Mar 2006 01:16:30 -0800 Subject: [PATCH] Notifier chain update: API changes The kernel's implementation of notifier chains is unsafe. There is no protection against entries being added to or removed from a chain while the chain is in use. The issues were discussed in this thread: http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2 We noticed that notifier chains in the kernel fall into two basic usage classes: "Blocking" chains are always called from a process context and the callout routines are allowed to sleep; "Atomic" chains can be called from an atomic context and the callout routines are not allowed to sleep. We decided to codify this distinction and make it part of the API. Therefore this set of patches introduces three new, parallel APIs: one for blocking notifiers, one for atomic notifiers, and one for "raw" notifiers (which is really just the old API under a new name). New kinds of data structures are used for the heads of the chains, and new routines are defined for registration, unregistration, and calling a chain. The three APIs are explained in include/linux/notifier.h and their implementation is in kernel/sys.c. With atomic and blocking chains, the implementation guarantees that the chain links will not be corrupted and that chain callers will not get messed up by entries being added or removed. For raw chains the implementation provides no guarantees at all; users of this API must provide their own protections. (The idea was that situations may come up where the assumptions of the atomic and blocking APIs are not appropriate, so it should be possible for users to handle these things in their own way.) There are some limitations, which should not be too hard to live with. For atomic/blocking chains, registration and unregistration must always be done in a process context since the chain is protected by a mutex/rwsem. Also, a callout routine for a non-raw chain must not try to register or unregister entries on its own chain. (This did happen in a couple of places and the code had to be changed to avoid it.) Since atomic chains may be called from within an NMI handler, they cannot use spinlocks for synchronization. Instead we use RCU. The overhead falls almost entirely in the unregister routine, which is okay since unregistration is much less frequent that calling a chain. Here is the list of chains that we adjusted and their classifications. None of them use the raw API, so for the moment it is only a placeholder. ATOMIC CHAINS ------------- arch/i386/kernel/traps.c: i386die_chain arch/ia64/kernel/traps.c: ia64die_chain arch/powerpc/kernel/traps.c: powerpc_die_chain arch/sparc64/kernel/traps.c: sparc64die_chain arch/x86_64/kernel/traps.c: die_chain drivers/char/ipmi/ipmi_si_intf.c: xaction_notifier_list kernel/panic.c: panic_notifier_list kernel/profile.c: task_free_notifier net/bluetooth/hci_core.c: hci_notifier net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_chain net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_expect_chain net/ipv6/addrconf.c: inet6addr_chain net/netfilter/nf_conntrack_core.c: nf_conntrack_chain net/netfilter/nf_conntrack_core.c: nf_conntrack_expect_chain net/netlink/af_netlink.c: netlink_chain BLOCKING CHAINS --------------- arch/powerpc/platforms/pseries/reconfig.c: pSeries_reconfig_chain arch/s390/kernel/process.c: idle_chain arch/x86_64/kernel/process.c idle_notifier drivers/base/memory.c: memory_chain drivers/cpufreq/cpufreq.c cpufreq_policy_notifier_list drivers/cpufreq/cpufreq.c cpufreq_transition_notifier_list drivers/macintosh/adb.c: adb_client_list drivers/macintosh/via-pmu.c sleep_notifier_list drivers/macintosh/via-pmu68k.c sleep_notifier_list drivers/macintosh/windfarm_core.c wf_client_list drivers/usb/core/notify.c usb_notifier_list drivers/video/fbmem.c fb_notifier_list kernel/cpu.c cpu_chain kernel/module.c module_notify_list kernel/profile.c munmap_notifier kernel/profile.c task_exit_notifier kernel/sys.c reboot_notifier_list net/core/dev.c netdev_chain net/decnet/dn_dev.c: dnaddr_chain net/ipv4/devinet.c: inetaddr_chain It's possible that some of these classifications are wrong. If they are, please let us know or submit a patch to fix them. Note that any chain that gets called very frequently should be atomic, because the rwsem read-locking used for blocking chains is very likely to incur cache misses on SMP systems. (However, if the chain's callout routines may sleep then the chain cannot be atomic.) The patch set was written by Alan Stern and Chandra Seetharaman, incorporating material written by Keith Owens and suggestions from Paul McKenney and Andrew Morton. [jes@sgi.com: restructure the notifier chain initialization macros] Signed-off-by: Alan Stern Signed-off-by: Chandra Seetharaman Signed-off-by: Jes Sorensen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpu.c | 29 ++--- kernel/module.c | 20 +--- kernel/panic.c | 4 +- kernel/profile.c | 53 +++------ kernel/softlockup.c | 2 +- kernel/sys.c | 327 ++++++++++++++++++++++++++++++++++++++++++---------- 6 files changed, 301 insertions(+), 134 deletions(-) (limited to 'kernel') diff --git a/kernel/cpu.c b/kernel/cpu.c index 8be22bd80933..fe2b8d0bfe4c 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -18,7 +18,7 @@ /* This protects CPUs going up and down... */ static DECLARE_MUTEX(cpucontrol); -static struct notifier_block *cpu_chain; +static BLOCKING_NOTIFIER_HEAD(cpu_chain); #ifdef CONFIG_HOTPLUG_CPU static struct task_struct *lock_cpu_hotplug_owner; @@ -71,21 +71,13 @@ EXPORT_SYMBOL_GPL(lock_cpu_hotplug_interruptible); /* Need to know about CPUs going up/down? */ int register_cpu_notifier(struct notifier_block *nb) { - int ret; - - if ((ret = lock_cpu_hotplug_interruptible()) != 0) - return ret; - ret = notifier_chain_register(&cpu_chain, nb); - unlock_cpu_hotplug(); - return ret; + return blocking_notifier_chain_register(&cpu_chain, nb); } EXPORT_SYMBOL(register_cpu_notifier); void unregister_cpu_notifier(struct notifier_block *nb) { - lock_cpu_hotplug(); - notifier_chain_unregister(&cpu_chain, nb); - unlock_cpu_hotplug(); + blocking_notifier_chain_unregister(&cpu_chain, nb); } EXPORT_SYMBOL(unregister_cpu_notifier); @@ -141,7 +133,7 @@ int cpu_down(unsigned int cpu) goto out; } - err = notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE, + err = blocking_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE, (void *)(long)cpu); if (err == NOTIFY_BAD) { printk("%s: attempt to take down CPU %u failed\n", @@ -159,7 +151,7 @@ int cpu_down(unsigned int cpu) p = __stop_machine_run(take_cpu_down, NULL, cpu); if (IS_ERR(p)) { /* CPU didn't die: tell everyone. Can't complain. */ - if (notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED, + if (blocking_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED, (void *)(long)cpu) == NOTIFY_BAD) BUG(); @@ -182,8 +174,8 @@ int cpu_down(unsigned int cpu) put_cpu(); /* CPU is completely dead: tell everyone. Too late to complain. */ - if (notifier_call_chain(&cpu_chain, CPU_DEAD, (void *)(long)cpu) - == NOTIFY_BAD) + if (blocking_notifier_call_chain(&cpu_chain, CPU_DEAD, + (void *)(long)cpu) == NOTIFY_BAD) BUG(); check_for_tasks(cpu); @@ -211,7 +203,7 @@ int __devinit cpu_up(unsigned int cpu) goto out; } - ret = notifier_call_chain(&cpu_chain, CPU_UP_PREPARE, hcpu); + ret = blocking_notifier_call_chain(&cpu_chain, CPU_UP_PREPARE, hcpu); if (ret == NOTIFY_BAD) { printk("%s: attempt to bring up CPU %u failed\n", __FUNCTION__, cpu); @@ -226,11 +218,12 @@ int __devinit cpu_up(unsigned int cpu) BUG_ON(!cpu_online(cpu)); /* Now call notifier in preparation. */ - notifier_call_chain(&cpu_chain, CPU_ONLINE, hcpu); + blocking_notifier_call_chain(&cpu_chain, CPU_ONLINE, hcpu); out_notify: if (ret != 0) - notifier_call_chain(&cpu_chain, CPU_UP_CANCELED, hcpu); + blocking_notifier_call_chain(&cpu_chain, + CPU_UP_CANCELED, hcpu); out: unlock_cpu_hotplug(); return ret; diff --git a/kernel/module.c b/kernel/module.c index ddfe45ac2fd1..4fafd58038a0 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -64,26 +64,17 @@ static DEFINE_SPINLOCK(modlist_lock); static DEFINE_MUTEX(module_mutex); static LIST_HEAD(modules); -static DEFINE_MUTEX(notify_mutex); -static struct notifier_block * module_notify_list; +static BLOCKING_NOTIFIER_HEAD(module_notify_list); int register_module_notifier(struct notifier_block * nb) { - int err; - mutex_lock(¬ify_mutex); - err = notifier_chain_register(&module_notify_list, nb); - mutex_unlock(¬ify_mutex); - return err; + return blocking_notifier_chain_register(&module_notify_list, nb); } EXPORT_SYMBOL(register_module_notifier); int unregister_module_notifier(struct notifier_block * nb) { - int err; - mutex_lock(¬ify_mutex); - err = notifier_chain_unregister(&module_notify_list, nb); - mutex_unlock(¬ify_mutex); - return err; + return blocking_notifier_chain_unregister(&module_notify_list, nb); } EXPORT_SYMBOL(unregister_module_notifier); @@ -1816,9 +1807,8 @@ sys_init_module(void __user *umod, /* Drop lock so they can recurse */ mutex_unlock(&module_mutex); - mutex_lock(¬ify_mutex); - notifier_call_chain(&module_notify_list, MODULE_STATE_COMING, mod); - mutex_unlock(¬ify_mutex); + blocking_notifier_call_chain(&module_notify_list, + MODULE_STATE_COMING, mod); /* Start the module */ if (mod->init != NULL) diff --git a/kernel/panic.c b/kernel/panic.c index acd95adddb93..f895c7c01d5b 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -29,7 +29,7 @@ static DEFINE_SPINLOCK(pause_on_oops_lock); int panic_timeout; EXPORT_SYMBOL(panic_timeout); -struct notifier_block *panic_notifier_list; +ATOMIC_NOTIFIER_HEAD(panic_notifier_list); EXPORT_SYMBOL(panic_notifier_list); @@ -97,7 +97,7 @@ NORET_TYPE void panic(const char * fmt, ...) smp_send_stop(); #endif - notifier_call_chain(&panic_notifier_list, 0, buf); + atomic_notifier_call_chain(&panic_notifier_list, 0, buf); if (!panic_blink) panic_blink = no_blink; diff --git a/kernel/profile.c b/kernel/profile.c index ad81f799a9b4..5a730fdb1a2c 100644 --- a/kernel/profile.c +++ b/kernel/profile.c @@ -87,72 +87,52 @@ void __init profile_init(void) #ifdef CONFIG_PROFILING -static DECLARE_RWSEM(profile_rwsem); -static DEFINE_RWLOCK(handoff_lock); -static struct notifier_block * task_exit_notifier; -static struct notifier_block * task_free_notifier; -static struct notifier_block * munmap_notifier; +static BLOCKING_NOTIFIER_HEAD(task_exit_notifier); +static ATOMIC_NOTIFIER_HEAD(task_free_notifier); +static BLOCKING_NOTIFIER_HEAD(munmap_notifier); void profile_task_exit(struct task_struct * task) { - down_read(&profile_rwsem); - notifier_call_chain(&task_exit_notifier, 0, task); - up_read(&profile_rwsem); + blocking_notifier_call_chain(&task_exit_notifier, 0, task); } int profile_handoff_task(struct task_struct * task) { int ret; - read_lock(&handoff_lock); - ret = notifier_call_chain(&task_free_notifier, 0, task); - read_unlock(&handoff_lock); + ret = atomic_notifier_call_chain(&task_free_notifier, 0, task); return (ret == NOTIFY_OK) ? 1 : 0; } void profile_munmap(unsigned long addr) { - down_read(&profile_rwsem); - notifier_call_chain(&munmap_notifier, 0, (void *)addr); - up_read(&profile_rwsem); + blocking_notifier_call_chain(&munmap_notifier, 0, (void *)addr); } int task_handoff_register(struct notifier_block * n) { - int err = -EINVAL; - - write_lock(&handoff_lock); - err = notifier_chain_register(&task_free_notifier, n); - write_unlock(&handoff_lock); - return err; + return atomic_notifier_chain_register(&task_free_notifier, n); } int task_handoff_unregister(struct notifier_block * n) { - int err = -EINVAL; - - write_lock(&handoff_lock); - err = notifier_chain_unregister(&task_free_notifier, n); - write_unlock(&handoff_lock); - return err; + return atomic_notifier_chain_unregister(&task_free_notifier, n); } int profile_event_register(enum profile_type type, struct notifier_block * n) { int err = -EINVAL; - down_write(&profile_rwsem); - switch (type) { case PROFILE_TASK_EXIT: - err = notifier_chain_register(&task_exit_notifier, n); + err = blocking_notifier_chain_register( + &task_exit_notifier, n); break; case PROFILE_MUNMAP: - err = notifier_chain_register(&munmap_notifier, n); + err = blocking_notifier_chain_register( + &munmap_notifier, n); break; } - up_write(&profile_rwsem); - return err; } @@ -161,18 +141,17 @@ int profile_event_unregister(enum profile_type type, struct notifier_block * n) { int err = -EINVAL; - down_write(&profile_rwsem); - switch (type) { case PROFILE_TASK_EXIT: - err = notifier_chain_unregister(&task_exit_notifier, n); + err = blocking_notifier_chain_unregister( + &task_exit_notifier, n); break; case PROFILE_MUNMAP: - err = notifier_chain_unregister(&munmap_notifier, n); + err = blocking_notifier_chain_unregister( + &munmap_notifier, n); break; } - up_write(&profile_rwsem); return err; } diff --git a/kernel/softlockup.c b/kernel/softlockup.c index d9b3d5847ed8..ced91e1ff564 100644 --- a/kernel/softlockup.c +++ b/kernel/softlockup.c @@ -152,5 +152,5 @@ __init void spawn_softlockup_task(void) cpu_callback(&cpu_nfb, CPU_ONLINE, cpu); register_cpu_notifier(&cpu_nfb); - notifier_chain_register(&panic_notifier_list, &panic_block); + atomic_notifier_chain_register(&panic_notifier_list, &panic_block); } diff --git a/kernel/sys.c b/kernel/sys.c index 38bc73ede2ba..c93d37f71aef 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -95,99 +95,304 @@ int cad_pid = 1; * and the like. */ -static struct notifier_block *reboot_notifier_list; -static DEFINE_RWLOCK(notifier_lock); +static BLOCKING_NOTIFIER_HEAD(reboot_notifier_list); + +/* + * Notifier chain core routines. The exported routines below + * are layered on top of these, with appropriate locking added. + */ + +static int notifier_chain_register(struct notifier_block **nl, + struct notifier_block *n) +{ + while ((*nl) != NULL) { + if (n->priority > (*nl)->priority) + break; + nl = &((*nl)->next); + } + n->next = *nl; + rcu_assign_pointer(*nl, n); + return 0; +} + +static int notifier_chain_unregister(struct notifier_block **nl, + struct notifier_block *n) +{ + while ((*nl) != NULL) { + if ((*nl) == n) { + rcu_assign_pointer(*nl, n->next); + return 0; + } + nl = &((*nl)->next); + } + return -ENOENT; +} + +static int __kprobes notifier_call_chain(struct notifier_block **nl, + unsigned long val, void *v) +{ + int ret = NOTIFY_DONE; + struct notifier_block *nb; + + nb = rcu_dereference(*nl); + while (nb) { + ret = nb->notifier_call(nb, val, v); + if ((ret & NOTIFY_STOP_MASK) == NOTIFY_STOP_MASK) + break; + nb = rcu_dereference(nb->next); + } + return ret; +} + +/* + * Atomic notifier chain routines. Registration and unregistration + * use a mutex, and call_chain is synchronized by RCU (no locks). + */ /** - * notifier_chain_register - Add notifier to a notifier chain - * @list: Pointer to root list pointer + * atomic_notifier_chain_register - Add notifier to an atomic notifier chain + * @nh: Pointer to head of the atomic notifier chain * @n: New entry in notifier chain * - * Adds a notifier to a notifier chain. + * Adds a notifier to an atomic notifier chain. * * Currently always returns zero. */ + +int atomic_notifier_chain_register(struct atomic_notifier_head *nh, + struct notifier_block *n) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&nh->lock, flags); + ret = notifier_chain_register(&nh->head, n); + spin_unlock_irqrestore(&nh->lock, flags); + return ret; +} + +EXPORT_SYMBOL_GPL(atomic_notifier_chain_register); + +/** + * atomic_notifier_chain_unregister - Remove notifier from an atomic notifier chain + * @nh: Pointer to head of the atomic notifier chain + * @n: Entry to remove from notifier chain + * + * Removes a notifier from an atomic notifier chain. + * + * Returns zero on success or %-ENOENT on failure. + */ +int atomic_notifier_chain_unregister(struct atomic_notifier_head *nh, + struct notifier_block *n) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&nh->lock, flags); + ret = notifier_chain_unregister(&nh->head, n); + spin_unlock_irqrestore(&nh->lock, flags); + synchronize_rcu(); + return ret; +} + +EXPORT_SYMBOL_GPL(atomic_notifier_chain_unregister); + +/** + * atomic_notifier_call_chain - Call functions in an atomic notifier chain + * @nh: Pointer to head of the atomic notifier chain + * @val: Value passed unmodified to notifier function + * @v: Pointer passed unmodified to notifier function + * + * Calls each function in a notifier chain in turn. The functions + * run in an atomic context, so they must not block. + * This routine uses RCU to synchronize with changes to the chain. + * + * If the return value of the notifier can be and'ed + * with %NOTIFY_STOP_MASK then atomic_notifier_call_chain + * will return immediately, with the return value of + * the notifier function which halted execution. + * Otherwise the return value is the return value + * of the last notifier function called. + */ -int notifier_chain_register(struct notifier_block **list, struct notifier_block *n) +int atomic_notifier_call_chain(struct atomic_notifier_head *nh, + unsigned long val, void *v) { - write_lock(¬ifier_lock); - while(*list) - { - if(n->priority > (*list)->priority) - break; - list= &((*list)->next); - } - n->next = *list; - *list=n; - write_unlock(¬ifier_lock); - return 0; + int ret; + + rcu_read_lock(); + ret = notifier_call_chain(&nh->head, val, v); + rcu_read_unlock(); + return ret; } -EXPORT_SYMBOL(notifier_chain_register); +EXPORT_SYMBOL_GPL(atomic_notifier_call_chain); + +/* + * Blocking notifier chain routines. All access to the chain is + * synchronized by an rwsem. + */ /** - * notifier_chain_unregister - Remove notifier from a notifier chain - * @nl: Pointer to root list pointer + * blocking_notifier_chain_register - Add notifier to a blocking notifier chain + * @nh: Pointer to head of the blocking notifier chain * @n: New entry in notifier chain * - * Removes a notifier from a notifier chain. + * Adds a notifier to a blocking notifier chain. + * Must be called in process context. * - * Returns zero on success, or %-ENOENT on failure. + * Currently always returns zero. */ -int notifier_chain_unregister(struct notifier_block **nl, struct notifier_block *n) +int blocking_notifier_chain_register(struct blocking_notifier_head *nh, + struct notifier_block *n) { - write_lock(¬ifier_lock); - while((*nl)!=NULL) - { - if((*nl)==n) - { - *nl=n->next; - write_unlock(¬ifier_lock); - return 0; - } - nl=&((*nl)->next); - } - write_unlock(¬ifier_lock); - return -ENOENT; + int ret; + + /* + * This code gets used during boot-up, when task switching is + * not yet working and interrupts must remain disabled. At + * such times we must not call down_write(). + */ + if (unlikely(system_state == SYSTEM_BOOTING)) + return notifier_chain_register(&nh->head, n); + + down_write(&nh->rwsem); + ret = notifier_chain_register(&nh->head, n); + up_write(&nh->rwsem); + return ret; } -EXPORT_SYMBOL(notifier_chain_unregister); +EXPORT_SYMBOL_GPL(blocking_notifier_chain_register); /** - * notifier_call_chain - Call functions in a notifier chain - * @n: Pointer to root pointer of notifier chain + * blocking_notifier_chain_unregister - Remove notifier from a blocking notifier chain + * @nh: Pointer to head of the blocking notifier chain + * @n: Entry to remove from notifier chain + * + * Removes a notifier from a blocking notifier chain. + * Must be called from process context. + * + * Returns zero on success or %-ENOENT on failure. + */ +int blocking_notifier_chain_unregister(struct blocking_notifier_head *nh, + struct notifier_block *n) +{ + int ret; + + /* + * This code gets used during boot-up, when task switching is + * not yet working and interrupts must remain disabled. At + * such times we must not call down_write(). + */ + if (unlikely(system_state == SYSTEM_BOOTING)) + return notifier_chain_unregister(&nh->head, n); + + down_write(&nh->rwsem); + ret = notifier_chain_unregister(&nh->head, n); + up_write(&nh->rwsem); + return ret; +} + +EXPORT_SYMBOL_GPL(blocking_notifier_chain_unregister); + +/** + * blocking_notifier_call_chain - Call functions in a blocking notifier chain + * @nh: Pointer to head of the blocking notifier chain * @val: Value passed unmodified to notifier function * @v: Pointer passed unmodified to notifier function * - * Calls each function in a notifier chain in turn. + * Calls each function in a notifier chain in turn. The functions + * run in a process context, so they are allowed to block. * - * If the return value of the notifier can be and'd - * with %NOTIFY_STOP_MASK, then notifier_call_chain + * If the return value of the notifier can be and'ed + * with %NOTIFY_STOP_MASK then blocking_notifier_call_chain * will return immediately, with the return value of * the notifier function which halted execution. - * Otherwise, the return value is the return value + * Otherwise the return value is the return value * of the last notifier function called. */ -int __kprobes notifier_call_chain(struct notifier_block **n, unsigned long val, void *v) +int blocking_notifier_call_chain(struct blocking_notifier_head *nh, + unsigned long val, void *v) { - int ret=NOTIFY_DONE; - struct notifier_block *nb = *n; + int ret; - while(nb) - { - ret=nb->notifier_call(nb,val,v); - if(ret&NOTIFY_STOP_MASK) - { - return ret; - } - nb=nb->next; - } + down_read(&nh->rwsem); + ret = notifier_call_chain(&nh->head, val, v); + up_read(&nh->rwsem); return ret; } -EXPORT_SYMBOL(notifier_call_chain); +EXPORT_SYMBOL_GPL(blocking_notifier_call_chain); + +/* + * Raw notifier chain routines. There is no protection; + * the caller must provide it. Use at your own risk! + */ + +/** + * raw_notifier_chain_register - Add notifier to a raw notifier chain + * @nh: Pointer to head of the raw notifier chain + * @n: New entry in notifier chain + * + * Adds a notifier to a raw notifier chain. + * All locking must be provided by the caller. + * + * Currently always returns zero. + */ + +int raw_notifier_chain_register(struct raw_notifier_head *nh, + struct notifier_block *n) +{ + return notifier_chain_register(&nh->head, n); +} + +EXPORT_SYMBOL_GPL(raw_notifier_chain_register); + +/** + * raw_notifier_chain_unregister - Remove notifier from a raw notifier chain + * @nh: Pointer to head of the raw notifier chain + * @n: Entry to remove from notifier chain + * + * Removes a notifier from a raw notifier chain. + * All locking must be provided by the caller. + * + * Returns zero on success or %-ENOENT on failure. + */ +int raw_notifier_chain_unregister(struct raw_notifier_head *nh, + struct notifier_block *n) +{ + return notifier_chain_unregister(&nh->head, n); +} + +EXPORT_SYMBOL_GPL(raw_notifier_chain_unregister); + +/** + * raw_notifier_call_chain - Call functions in a raw notifier chain + * @nh: Pointer to head of the raw notifier chain + * @val: Value passed unmodified to notifier function + * @v: Pointer passed unmodified to notifier function + * + * Calls each function in a notifier chain in turn. The functions + * run in an undefined context. + * All locking must be provided by the caller. + * + * If the return value of the notifier can be and'ed + * with %NOTIFY_STOP_MASK then raw_notifier_call_chain + * will return immediately, with the return value of + * the notifier function which halted execution. + * Otherwise the return value is the return value + * of the last notifier function called. + */ + +int raw_notifier_call_chain(struct raw_notifier_head *nh, + unsigned long val, void *v) +{ + return notifier_call_chain(&nh->head, val, v); +} + +EXPORT_SYMBOL_GPL(raw_notifier_call_chain); /** * register_reboot_notifier - Register function to be called at reboot time @@ -196,13 +401,13 @@ EXPORT_SYMBOL(notifier_call_chain); * Registers a function with the list of functions * to be called at reboot time. * - * Currently always returns zero, as notifier_chain_register + * Currently always returns zero, as blocking_notifier_chain_register * always returns zero. */ int register_reboot_notifier(struct notifier_block * nb) { - return notifier_chain_register(&reboot_notifier_list, nb); + return blocking_notifier_chain_register(&reboot_notifier_list, nb); } EXPORT_SYMBOL(register_reboot_notifier); @@ -219,7 +424,7 @@ EXPORT_SYMBOL(register_reboot_notifier); int unregister_reboot_notifier(struct notifier_block * nb) { - return notifier_chain_unregister(&reboot_notifier_list, nb); + return blocking_notifier_chain_unregister(&reboot_notifier_list, nb); } EXPORT_SYMBOL(unregister_reboot_notifier); @@ -380,7 +585,7 @@ EXPORT_SYMBOL_GPL(emergency_restart); void kernel_restart_prepare(char *cmd) { - notifier_call_chain(&reboot_notifier_list, SYS_RESTART, cmd); + blocking_notifier_call_chain(&reboot_notifier_list, SYS_RESTART, cmd); system_state = SYSTEM_RESTART; device_shutdown(); } @@ -430,7 +635,7 @@ EXPORT_SYMBOL_GPL(kernel_kexec); void kernel_shutdown_prepare(enum system_states state) { - notifier_call_chain(&reboot_notifier_list, + blocking_notifier_call_chain(&reboot_notifier_list, (state == SYSTEM_HALT)?SYS_HALT:SYS_POWER_OFF, NULL); system_state = state; device_shutdown(); -- cgit v1.2.3 From f83ca9fe3ee390755f18b4a7780c25ce593b484a Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Tue, 28 Mar 2006 01:56:20 -0800 Subject: [PATCH] symversion warning fix gcc-4.2: kernel/module.c: In function '__find_symbol': kernel/module.c:158: warning: the address of '__start___kcrctab', will always evaluate as 'true' kernel/module.c:165: warning: the address of '__start___kcrctab_gpl', will always evaluate as 'true' kernel/module.c:182: warning: the address of '__start___kcrctab_gpl_future', will always evaluate as 'true' Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/module.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 4fafd58038a0..bd088a7c1499 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -127,7 +127,7 @@ extern const unsigned long __start___kcrctab_gpl_future[]; #ifndef CONFIG_MODVERSIONS #define symversion(base, idx) NULL #else -#define symversion(base, idx) ((base) ? ((base) + (idx)) : NULL) +#define symversion(base, idx) ((base != NULL) ? ((base) + (idx)) : NULL) #endif /* lookup symbol in given range of kernel_symbols */ -- cgit v1.2.3 From b9e20a920092eb3840424f85c78852c0433df00d Mon Sep 17 00:00:00 2001 From: Eric Sesterhenn Date: Tue, 28 Mar 2006 01:56:24 -0800 Subject: [PATCH] Change dash2underscore() return value to char Since dash2underscore() just operates and returns chars, I guess its safe to change the return value to a char. With my .config, this reduces its size by 5 bytes. text data bss dec hex filename 4155 152 0 4307 10d3 params.o.orig 4150 152 0 4302 10ce params.o Signed-off-by: Eric Sesterhenn Signed-off-by: Alexey Dobriyan Signed-off-by: Adrian Bunk Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/params.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/params.c b/kernel/params.c index 9de637a5c8bc..af43ecdc8d9b 100644 --- a/kernel/params.c +++ b/kernel/params.c @@ -31,7 +31,7 @@ #define DEBUGP(fmt, a...) #endif -static inline int dash2underscore(char c) +static inline char dash2underscore(char c) { if (c == '-') return '_'; -- cgit v1.2.3 From 0a945022778f100115d0cb6234eb28fc1b15ccaf Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Tue, 28 Mar 2006 01:56:37 -0800 Subject: [PATCH] for_each_possible_cpu: fixes for generic part replaces for_each_cpu with for_each_possible_cpu(). Signed-off-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/rcutorture.c | 4 ++-- kernel/sched.c | 8 ++++---- 2 files changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c index b4b362b5baf5..8154e7589d12 100644 --- a/kernel/rcutorture.c +++ b/kernel/rcutorture.c @@ -301,7 +301,7 @@ rcu_torture_printk(char *page) long pipesummary[RCU_TORTURE_PIPE_LEN + 1] = { 0 }; long batchsummary[RCU_TORTURE_PIPE_LEN + 1] = { 0 }; - for_each_cpu(cpu) { + for_each_possible_cpu(cpu) { for (i = 0; i < RCU_TORTURE_PIPE_LEN + 1; i++) { pipesummary[i] += per_cpu(rcu_torture_count, cpu)[i]; batchsummary[i] += per_cpu(rcu_torture_batch, cpu)[i]; @@ -535,7 +535,7 @@ rcu_torture_init(void) atomic_set(&n_rcu_torture_error, 0); for (i = 0; i < RCU_TORTURE_PIPE_LEN + 1; i++) atomic_set(&rcu_torture_wcount[i], 0); - for_each_cpu(cpu) { + for_each_possible_cpu(cpu) { for (i = 0; i < RCU_TORTURE_PIPE_LEN + 1; i++) { per_cpu(rcu_torture_count, cpu)[i] = 0; per_cpu(rcu_torture_batch, cpu)[i] = 0; diff --git a/kernel/sched.c b/kernel/sched.c index 7854ee516b92..a9ecac398bb9 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1625,7 +1625,7 @@ unsigned long nr_uninterruptible(void) { unsigned long i, sum = 0; - for_each_cpu(i) + for_each_possible_cpu(i) sum += cpu_rq(i)->nr_uninterruptible; /* @@ -1642,7 +1642,7 @@ unsigned long long nr_context_switches(void) { unsigned long long i, sum = 0; - for_each_cpu(i) + for_each_possible_cpu(i) sum += cpu_rq(i)->nr_switches; return sum; @@ -1652,7 +1652,7 @@ unsigned long nr_iowait(void) { unsigned long i, sum = 0; - for_each_cpu(i) + for_each_possible_cpu(i) sum += atomic_read(&cpu_rq(i)->nr_iowait); return sum; @@ -6080,7 +6080,7 @@ void __init sched_init(void) runqueue_t *rq; int i, j, k; - for_each_cpu(i) { + for_each_possible_cpu(i) { prio_array_t *array; rq = cpu_rq(i); -- cgit v1.2.3 From ec7e15d6486e9d1da1c2f344b670b8388ba7019b Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Tue, 28 Mar 2006 01:56:55 -0800 Subject: [PATCH] compat_sys_futex() warning fix kernel/futex_compat.c: In function `compat_sys_futex': kernel/futex_compat.c:140: warning: passing arg 1 of `do_futex' makes integer from pointer without a cast kernel/futex_compat.c:140: warning: passing arg 5 of `do_futex' makes integer from pointer without a cast Not sure what Ingo was thinking of here. Put the casts back in. Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/futex_compat.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c index 9c077cf9aa84..54274fc85321 100644 --- a/kernel/futex_compat.c +++ b/kernel/futex_compat.c @@ -137,5 +137,6 @@ asmlinkage long compat_sys_futex(u32 __user *uaddr, int op, u32 val, if (op >= FUTEX_REQUEUE) val2 = (int) (unsigned long) utime; - return do_futex(uaddr, op, val, timeout, uaddr2, val2, val3); + return do_futex((unsigned long)uaddr, op, val, timeout, + (unsigned long)uaddr2, val2, val3); } -- cgit v1.2.3 From fef23e7fbb11a0a78cd61935f7056bc2b237995a Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Tue, 28 Mar 2006 16:10:58 -0800 Subject: [PATCH] exec: allow init to exec from any thread. After looking at the problem of init calling exec some more I figured out an easy way to make the code work. The actual symptom without out this patch is that all threads will die except pid == 1, and the thread calling exec. The thread calling exec will wait forever for pid == 1 to die. Since pid == 1 does not install a handler for SIGKILL it will never die. This modifies the tests for init from current->pid == 1 to the equivalent current == child_reaper. And then it causes exec in the ugly case to modify child_reaper. The only weird symptom is that you wind up with an init process that doesn't have the oldest start time on the box. Signed-off-by: Eric W. Biederman Cc: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 2 +- kernel/signal.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index a8c7efc7a681..223a8802b665 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -807,7 +807,7 @@ fastcall NORET_TYPE void do_exit(long code) panic("Aiee, killing interrupt handler!"); if (unlikely(!tsk->pid)) panic("Attempted to kill the idle task!"); - if (unlikely(tsk->pid == 1)) + if (unlikely(tsk == child_reaper)) panic("Attempted to kill init!"); if (unlikely(current->ptrace & PT_TRACE_EXIT)) { diff --git a/kernel/signal.c b/kernel/signal.c index 75f7341b0c39..dc8f91bf9f89 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1990,7 +1990,7 @@ relock: continue; /* Init gets no signals it doesn't want. */ - if (current->pid == 1) + if (current == child_reaper) continue; if (sig_kernel_stop(signr)) { -- cgit v1.2.3 From d73d65293e3e2de7e916a89c8da30be0948afab7 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Tue, 28 Mar 2006 16:11:03 -0800 Subject: [PATCH] pidhash: kill switch_exec_pids switch_exec_pids is only called from de_thread by way of exec, and it is only called when we are exec'ing from a non thread group leader. Currently switch_exec_pids gives the leader the pid of the thread and unhashes and rehashes all of the process groups. The leader is already in the EXIT_DEAD state so no one cares about it's pids. The only concern for the leader is that __unhash_process called from release_task will function correctly. If we don't touch the leader at all we know that __unhash_process will work fine so there is no need to touch the leader. For the task becomming the thread group leader, we just need to give it the pid of the old thread group leader, add it to the task list, and attach it to the session and the process group of the thread group. Currently de_thread is also adding the task to the task list which is just silly. Currently the only leader of __detach_pid besides detach_pid is switch_exec_pids because of the ugly extra work that was being performed. So this patch removes switch_exec_pids because it is doing too much, it is creating an unnecessary special case in pid.c, duing work duplicated in de_thread, and generally obscuring what it is going on. The necessary work is added to de_thread, and it seems to be a little clearer there what is going on. Signed-off-by: Eric W. Biederman Cc: Oleg Nesterov Cc: Kirill Korotaev Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/pid.c | 30 ------------------------------ 1 file changed, 30 deletions(-) (limited to 'kernel') diff --git a/kernel/pid.c b/kernel/pid.c index 1acc07246991..7781d9999058 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -217,36 +217,6 @@ task_t *find_task_by_pid_type(int type, int nr) EXPORT_SYMBOL(find_task_by_pid_type); -/* - * This function switches the PIDs if a non-leader thread calls - * sys_execve() - this must be done without releasing the PID. - * (which a detach_pid() would eventually do.) - */ -void switch_exec_pids(task_t *leader, task_t *thread) -{ - __detach_pid(leader, PIDTYPE_PID); - __detach_pid(leader, PIDTYPE_TGID); - __detach_pid(leader, PIDTYPE_PGID); - __detach_pid(leader, PIDTYPE_SID); - - __detach_pid(thread, PIDTYPE_PID); - __detach_pid(thread, PIDTYPE_TGID); - - leader->pid = leader->tgid = thread->pid; - thread->pid = thread->tgid; - - attach_pid(thread, PIDTYPE_PID, thread->pid); - attach_pid(thread, PIDTYPE_TGID, thread->tgid); - attach_pid(thread, PIDTYPE_PGID, thread->signal->pgrp); - attach_pid(thread, PIDTYPE_SID, thread->signal->session); - list_add_tail(&thread->tasks, &init_task.tasks); - - attach_pid(leader, PIDTYPE_PID, leader->pid); - attach_pid(leader, PIDTYPE_TGID, leader->tgid); - attach_pid(leader, PIDTYPE_PGID, leader->signal->pgrp); - attach_pid(leader, PIDTYPE_SID, leader->signal->session); -} - /* * The pid hash table is scaled according to the amount of memory in the * machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or -- cgit v1.2.3 From d799f03597cabc6112acb518fc8ab4487aa4f953 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:04 -0800 Subject: [PATCH] choose_new_parent: remove unused arg, sanitize exit_state check 'child_reaper' arg is not used in choose_new_parent(). "->exit_state >= EXIT_ZOMBIE" check is a leftover, was valid when EXIT_ZOMBIE lived in ->state var. Signed-off-by: Oleg Nesterov Acked-by: Eric Biederman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 223a8802b665..e04a59405e99 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -538,13 +538,13 @@ static void exit_mm(struct task_struct * tsk) mmput(mm); } -static inline void choose_new_parent(task_t *p, task_t *reaper, task_t *child_reaper) +static inline void choose_new_parent(task_t *p, task_t *reaper) { /* * Make sure we're not reparenting to ourselves and that * the parent is not a zombie. */ - BUG_ON(p == reaper || reaper->exit_state >= EXIT_ZOMBIE); + BUG_ON(p == reaper || reaper->exit_state); p->real_parent = reaper; } @@ -645,7 +645,7 @@ static void forget_original_parent(struct task_struct * father, if (father == p->real_parent) { /* reparent with a reaper, real father it's us */ - choose_new_parent(p, reaper, child_reaper); + choose_new_parent(p, reaper); reparent_thread(p, father, 0); } else { /* reparent ptraced task to its real parent */ @@ -666,7 +666,7 @@ static void forget_original_parent(struct task_struct * father, } list_for_each_safe(_p, _n, &father->ptrace_children) { p = list_entry(_p,struct task_struct,ptrace_list); - choose_new_parent(p, reaper, child_reaper); + choose_new_parent(p, reaper); reparent_thread(p, father, 1); } } -- cgit v1.2.3 From 8fafabd86f1b75ed3cc6a6ffbe6c3e53e3d8457d Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:05 -0800 Subject: [PATCH] remove add_parent()'s parent argument add_parent(p, parent) is always called with parent == p->parent, and it makes no sense to do it differently. This patch removes this argument. No changes in affected .o files. Signed-off-by: Oleg Nesterov Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index e04a59405e99..df26c33037d2 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -1281,7 +1281,7 @@ bail_ref: /* move to end of parent's list to avoid starvation */ remove_parent(p); - add_parent(p, p->parent); + add_parent(p); write_unlock_irq(&tasklist_lock); -- cgit v1.2.3 From 9b678ece42893b53aae5ed7cb8d7cb261cacb72c Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:05 -0800 Subject: [PATCH] don't use REMOVE_LINKS/SET_LINKS for reparenting There are places where kernel uses REMOVE_LINKS/SET_LINKS while changing process's ->parent. Use add_parent/remove_parent instead, they don't abuse of global process list. Signed-off-by: Oleg Nesterov Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 4 ++-- kernel/ptrace.c | 8 ++++---- 2 files changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index df26c33037d2..5b5e8b67680e 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -238,10 +238,10 @@ static void reparent_to_init(void) ptrace_unlink(current); /* Reparent to init */ - REMOVE_LINKS(current); + remove_parent(current); current->parent = child_reaper; current->real_parent = child_reaper; - SET_LINKS(current); + add_parent(current); /* Set the exit signal to SIGCHLD so we signal init on exit */ current->exit_signal = SIGCHLD; diff --git a/kernel/ptrace.c b/kernel/ptrace.c index d95a72c9279d..86a7f6c60cb2 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -35,9 +35,9 @@ void __ptrace_link(task_t *child, task_t *new_parent) if (child->parent == new_parent) return; list_add(&child->ptrace_list, &child->parent->ptrace_children); - REMOVE_LINKS(child); + remove_parent(child); child->parent = new_parent; - SET_LINKS(child); + add_parent(child); } /* @@ -77,9 +77,9 @@ void __ptrace_unlink(task_t *child) child->ptrace = 0; if (!list_empty(&child->ptrace_list)) { list_del_init(&child->ptrace_list); - REMOVE_LINKS(child); + remove_parent(child); child->parent = child->real_parent; - SET_LINKS(child); + add_parent(child); } ptrace_untrace(child); -- cgit v1.2.3 From c97d98931ac52ef110b62d9b75c6a6f2bfbc1898 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:06 -0800 Subject: [PATCH] kill SET_LINKS/REMOVE_LINKS Both SET_LINKS() and SET_LINKS/REMOVE_LINKS() have exactly one caller, and these callers already check thread_group_leader(). This patch kills theese macros, they mix two different things: setting process's parent and registering it in init_task.tasks list. Callers are updated to do these actions by hand. Signed-off-by: Oleg Nesterov Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 4 +++- kernel/fork.c | 4 +++- 2 files changed, 6 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 5b5e8b67680e..f436a6bd3fb7 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -54,11 +54,13 @@ static void __unhash_process(struct task_struct *p) if (thread_group_leader(p)) { detach_pid(p, PIDTYPE_PGID); detach_pid(p, PIDTYPE_SID); + + list_del_init(&p->tasks); if (p->pid) __get_cpu_var(process_counts)--; } - REMOVE_LINKS(p); + remove_parent(p); } void release_task(struct task_struct * p) diff --git a/kernel/fork.c b/kernel/fork.c index c49bd193b058..74c67629ee62 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1181,7 +1181,7 @@ static task_t *copy_process(unsigned long clone_flags, */ p->ioprio = current->ioprio; - SET_LINKS(p); + add_parent(p); if (unlikely(p->ptrace & PT_PTRACED)) __ptrace_link(p, current->parent); @@ -1191,6 +1191,8 @@ static task_t *copy_process(unsigned long clone_flags, p->signal->session = current->signal->session; attach_pid(p, PIDTYPE_PGID, process_group(p)); attach_pid(p, PIDTYPE_SID, p->signal->session); + + list_add_tail(&p->tasks, &init_task.tasks); if (p->pid) __get_cpu_var(process_counts)++; } -- cgit v1.2.3 From 73b9ebfe126a4a886ee46cbab637374d7024668a Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:07 -0800 Subject: [PATCH] pidhash: don't count idle threads fork_idle() does unhash_process() just after copy_process(). Contrary, boot_cpu's idle thread explicitely registers itself for each pid_type with nr = 0. copy_process() already checks p->pid != 0 before process_counts++, I think we can just skip attach_pid() calls and job control inits for idle threads and kill unhash_process(). We don't need to cleanup ->proc_dentry in fork_idle() because with this patch idle threads are never hashed in kernel/pid.c:pid_hash[]. We don't need to hash pid == 0 in pidmap_init(). free_pidmap() is never called with pid == 0 arg, so it will never be reused. So it is still possible to use pid == 0 in any PIDTYPE_xxx namespace from kernel/pid.c's POV. However with this patch we don't hash pid == 0 for PIDTYPE_PID case. We still have have PIDTYPE_PGID/PIDTYPE_SID entries with pid == 0: /sbin/init and kernel threads which don't call daemonize(). Signed-off-by: Oleg Nesterov Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 18 +----------------- kernel/fork.c | 35 ++++++++++++++++++----------------- kernel/pid.c | 10 +--------- 3 files changed, 20 insertions(+), 43 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index f436a6bd3fb7..a94e1c31131b 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -56,8 +56,7 @@ static void __unhash_process(struct task_struct *p) detach_pid(p, PIDTYPE_SID); list_del_init(&p->tasks); - if (p->pid) - __get_cpu_var(process_counts)--; + __get_cpu_var(process_counts)--; } remove_parent(p); @@ -118,21 +117,6 @@ repeat: goto repeat; } -/* we are using it only for SMP init */ - -void unhash_process(struct task_struct *p) -{ - struct dentry *proc_dentry; - - spin_lock(&p->proc_lock); - proc_dentry = proc_pid_unhash(p); - write_lock_irq(&tasklist_lock); - __unhash_process(p); - write_unlock_irq(&tasklist_lock); - spin_unlock(&p->proc_lock); - proc_pid_flush(proc_dentry); -} - /* * This checks not only the pgrp, but falls back on the pid if no * satisfactory pgrp is found. I dunno - gdb doesn't work correctly diff --git a/kernel/fork.c b/kernel/fork.c index 74c67629ee62..0c32e28cdc5f 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1181,25 +1181,26 @@ static task_t *copy_process(unsigned long clone_flags, */ p->ioprio = current->ioprio; - add_parent(p); - if (unlikely(p->ptrace & PT_PTRACED)) - __ptrace_link(p, current->parent); - - if (thread_group_leader(p)) { - p->signal->tty = current->signal->tty; - p->signal->pgrp = process_group(current); - p->signal->session = current->signal->session; - attach_pid(p, PIDTYPE_PGID, process_group(p)); - attach_pid(p, PIDTYPE_SID, p->signal->session); - - list_add_tail(&p->tasks, &init_task.tasks); - if (p->pid) + if (likely(p->pid)) { + add_parent(p); + if (unlikely(p->ptrace & PT_PTRACED)) + __ptrace_link(p, current->parent); + + if (thread_group_leader(p)) { + p->signal->tty = current->signal->tty; + p->signal->pgrp = process_group(current); + p->signal->session = current->signal->session; + attach_pid(p, PIDTYPE_PGID, process_group(p)); + attach_pid(p, PIDTYPE_SID, p->signal->session); + + list_add_tail(&p->tasks, &init_task.tasks); __get_cpu_var(process_counts)++; + } + attach_pid(p, PIDTYPE_TGID, p->tgid); + attach_pid(p, PIDTYPE_PID, p->pid); + nr_threads++; } - attach_pid(p, PIDTYPE_TGID, p->tgid); - attach_pid(p, PIDTYPE_PID, p->pid); - nr_threads++; total_forks++; spin_unlock(¤t->sighand->siglock); write_unlock_irq(&tasklist_lock); @@ -1263,7 +1264,7 @@ task_t * __devinit fork_idle(int cpu) if (!task) return ERR_PTR(-ENOMEM); init_idle(task, cpu); - unhash_process(task); + return task; } diff --git a/kernel/pid.c b/kernel/pid.c index 7781d9999058..a9f2dfd006d2 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -247,16 +247,8 @@ void __init pidhash_init(void) void __init pidmap_init(void) { - int i; - pidmap_array->page = (void *)get_zeroed_page(GFP_KERNEL); + /* Reserve PID 0. We never call free_pidmap(0) */ set_bit(0, pidmap_array->page); atomic_dec(&pidmap_array->nr_free); - - /* - * Allocate PID 0, and hash it via all PID types: - */ - - for (i = 0; i < PIDTYPE_MAX; i++) - attach_pid(current, i, 0); } -- cgit v1.2.3 From 6ac781b11ade6e3451f6b460991c8b2b87e58280 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:09 -0800 Subject: [PATCH] reparent_thread: use remove_parent/add_parent Use remove_parent/add_parent instead of open coding. No changes in kernel/exit.o Signed-off-by: Oleg Nesterov Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index a94e1c31131b..98eec590ecbd 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -555,9 +555,9 @@ static void reparent_thread(task_t *p, task_t *father, int traced) * anyway, so let go of it. */ p->ptrace = 0; - list_del_init(&p->sibling); + remove_parent(p); p->parent = p->real_parent; - list_add_tail(&p->sibling, &p->parent->children); + add_parent(p); /* If we'd notified the old parent about this child's death, * also notify the new parent. -- cgit v1.2.3 From 8292d633add73d40eda1d26089e2fc758944ac7c Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:10 -0800 Subject: [PATCH] wait_for_helper: trivial style cleanup Use NULL instead of (... *)0 Signed-off-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kmod.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/kmod.c b/kernel/kmod.c index 51a892063aaa..20a997c73c3d 100644 --- a/kernel/kmod.c +++ b/kernel/kmod.c @@ -170,7 +170,7 @@ static int wait_for_helper(void *data) sa.sa.sa_handler = SIG_IGN; sa.sa.sa_flags = 0; siginitset(&sa.sa.sa_mask, sigmask(SIGCHLD)); - do_sigaction(SIGCHLD, &sa, (struct k_sigaction *)0); + do_sigaction(SIGCHLD, &sa, NULL); allow_signal(SIGCHLD); pid = kernel_thread(____call_usermodehelper, sub_info, SIGCHLD); -- cgit v1.2.3 From 1f09f9749cdde4e69f95d62d96d2e03f50b3353c Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:11 -0800 Subject: [PATCH] release_task: replace open-coded ptrace_unlink() Use ptrace_unlink() instead of open-coding. No changes in kernel/exit.o Signed-off-by: Oleg Nesterov Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 98eec590ecbd..77c35efad88c 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -68,13 +68,12 @@ void release_task(struct task_struct * p) task_t *leader; struct dentry *proc_dentry; -repeat: +repeat: atomic_dec(&p->user->processes); spin_lock(&p->proc_lock); proc_dentry = proc_pid_unhash(p); write_lock_irq(&tasklist_lock); - if (unlikely(p->ptrace)) - __ptrace_unlink(p); + ptrace_unlink(p); BUG_ON(!list_empty(&p->ptrace_list) || !list_empty(&p->ptrace_children)); __exit_signal(p); /* -- cgit v1.2.3 From aa1757f90bea3f598b6e5d04d922a6a60200f1da Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:12 -0800 Subject: [PATCH] convert sighand_cache to use SLAB_DESTROY_BY_RCU This patch borrows a clever Hugh's 'struct anon_vma' trick. Without tasklist_lock held we can't trust task->sighand until we locked it and re-checked that it is still the same. But this means we don't need to defer 'kmem_cache_free(sighand)'. We can return the memory to slab immediately, all we need is to be sure that sighand->siglock can't dissapear inside rcu protected section. To do so we need to initialize ->siglock inside ctor function, SLAB_DESTROY_BY_RCU does the rest. Signed-off-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 21 +++++++++++---------- kernel/signal.c | 2 +- 2 files changed, 12 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 0c32e28cdc5f..33ffb5bf0dbc 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -786,14 +786,6 @@ int unshare_files(void) EXPORT_SYMBOL(unshare_files); -void sighand_free_cb(struct rcu_head *rhp) -{ - struct sighand_struct *sp; - - sp = container_of(rhp, struct sighand_struct, rcu); - kmem_cache_free(sighand_cachep, sp); -} - static inline int copy_sighand(unsigned long clone_flags, struct task_struct * tsk) { struct sighand_struct *sig; @@ -806,7 +798,6 @@ static inline int copy_sighand(unsigned long clone_flags, struct task_struct * t rcu_assign_pointer(tsk->sighand, sig); if (!sig) return -ENOMEM; - spin_lock_init(&sig->siglock); atomic_set(&sig->count, 1); memcpy(sig->action, current->sighand->action, sizeof(sig->action)); return 0; @@ -1356,11 +1347,21 @@ long do_fork(unsigned long clone_flags, #define ARCH_MIN_MMSTRUCT_ALIGN 0 #endif +static void sighand_ctor(void *data, kmem_cache_t *cachep, unsigned long flags) +{ + struct sighand_struct *sighand = data; + + if ((flags & (SLAB_CTOR_VERIFY | SLAB_CTOR_CONSTRUCTOR)) == + SLAB_CTOR_CONSTRUCTOR) + spin_lock_init(&sighand->siglock); +} + void __init proc_caches_init(void) { sighand_cachep = kmem_cache_create("sighand_cache", sizeof(struct sighand_struct), 0, - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_DESTROY_BY_RCU, + sighand_ctor, NULL); signal_cachep = kmem_cache_create("signal_cache", sizeof(struct signal_struct), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); diff --git a/kernel/signal.c b/kernel/signal.c index dc8f91bf9f89..b0b1ca9daa33 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -330,7 +330,7 @@ void __exit_sighand(struct task_struct *tsk) /* Ok, we're done with the signal handlers */ tsk->sighand = NULL; if (atomic_dec_and_test(&sighand->count)) - sighand_free(sighand); + kmem_cache_free(sighand_cachep, sighand); } void exit_sighand(struct task_struct *tsk) -- cgit v1.2.3 From f63ee72e0fb82e504a0489490babc7612c7cd6c2 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:13 -0800 Subject: [PATCH] introduce lock_task_sighand() helper Add lock_task_sighand() helper and converts group_send_sig_info() to use it. Hopefully we will have more users soon. This patch also removes '!sighand->count' and '!p->usage' checks, I think they both are bogus, racy and unneeded (but probably it makes sense to restore them as BUG_ON()s). ->sighand is cleared and it's ->count is decremented in release_task() with sighand->siglock held, so it is a bug to have '!p->usage || !->count' after we already locked and verified it is the same. On the other hand, an already dead task without ->sighand can have a non-zero ->usage due to ptrace, for example. If we read the stale value of ->sighand we must see the change after spin_lock(), because that change was done while holding that same old ->sighand.siglock. Signed-off-by: Oleg Nesterov Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/signal.c | 38 ++++++++++++++++++++++++-------------- 1 file changed, 24 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index b0b1ca9daa33..819fa49aa70a 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1120,27 +1120,37 @@ void zap_other_threads(struct task_struct *p) /* * Must be called under rcu_read_lock() or with tasklist_lock read-held. */ +struct sighand_struct *lock_task_sighand(struct task_struct *tsk, unsigned long *flags) +{ + struct sighand_struct *sighand; + + for (;;) { + sighand = rcu_dereference(tsk->sighand); + if (unlikely(sighand == NULL)) + break; + + spin_lock_irqsave(&sighand->siglock, *flags); + if (likely(sighand == tsk->sighand)) + break; + spin_unlock_irqrestore(&sighand->siglock, *flags); + } + + return sighand; +} + int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p) { unsigned long flags; - struct sighand_struct *sp; int ret; -retry: ret = check_kill_permission(sig, info, p); - if (!ret && sig && (sp = rcu_dereference(p->sighand))) { - spin_lock_irqsave(&sp->siglock, flags); - if (p->sighand != sp) { - spin_unlock_irqrestore(&sp->siglock, flags); - goto retry; - } - if ((atomic_read(&sp->count) == 0) || - (atomic_read(&p->usage) == 0)) { - spin_unlock_irqrestore(&sp->siglock, flags); - return -ESRCH; + + if (!ret && sig) { + ret = -ESRCH; + if (lock_task_sighand(p, &flags)) { + ret = __group_send_sig_info(sig, info, p); + unlock_task_sighand(p, &flags); } - ret = __group_send_sig_info(sig, info, p); - spin_unlock_irqrestore(&sp->siglock, flags); } return ret; -- cgit v1.2.3 From a9e88e84b5245da0a1dadb6ccca70ae84e93ccf6 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:14 -0800 Subject: [PATCH] introduce sig_needs_tasklist() helper In my opinion this patch cleans up the code. Signed-off-by: Oleg Nesterov Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/signal.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index 819fa49aa70a..c5b65aa4c2bc 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -147,6 +147,9 @@ static kmem_cache_t *sigqueue_cachep; #define sig_kernel_stop(sig) \ (((sig) < SIGRTMIN) && T(sig, SIG_KERNEL_STOP_MASK)) +#define sig_needs_tasklist(sig) \ + (((sig) < SIGRTMIN) && T(sig, SIG_KERNEL_STOP_MASK | M(SIGCONT))) + #define sig_user_defined(t, signr) \ (((t)->sighand->action[(signr)-1].sa.sa_handler != SIG_DFL) && \ ((t)->sighand->action[(signr)-1].sa.sa_handler != SIG_IGN)) @@ -1199,7 +1202,7 @@ kill_proc_info(int sig, struct siginfo *info, pid_t pid) struct task_struct *p; rcu_read_lock(); - if (unlikely(sig_kernel_stop(sig) || sig == SIGCONT)) { + if (unlikely(sig_needs_tasklist(sig))) { read_lock(&tasklist_lock); acquired_tasklist_lock = 1; } -- cgit v1.2.3 From 7001510d0cbf51ad202dd2d0744f54104285cbb9 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:14 -0800 Subject: [PATCH] copy_process: cleanup bad_fork_cleanup_sighand The only caller of exit_sighand(tsk) is copy_process's error path. We can call __exit_sighand() directly and kill exit_sighand(). This 'tsk' was not yet registered in pid_hash[] or init_task.tasks, it has no external references, nobody can see it, and IF (clone_flags & CLONE_SIGHAND) At least 'current' has a reference to ->sighand, this means atomic_dec_and_test(sighand->count) can't be true. ELSE Nobody can see this ->sighand, this means we can free it without any locking. Signed-off-by: Oleg Nesterov Cc: "Eric W. Biederman" Acked-by: "Paul E. McKenney" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 3 ++- kernel/signal.c | 14 -------------- 2 files changed, 2 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 33ffb5bf0dbc..8a46ad52be8f 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1208,7 +1208,8 @@ bad_fork_cleanup_mm: bad_fork_cleanup_signal: exit_signal(p); bad_fork_cleanup_sighand: - exit_sighand(p); + if (p->sighand) + __exit_sighand(p); bad_fork_cleanup_fs: exit_fs(p); /* blocking */ bad_fork_cleanup_files: diff --git a/kernel/signal.c b/kernel/signal.c index c5b65aa4c2bc..1d7f4463c32d 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -336,20 +336,6 @@ void __exit_sighand(struct task_struct *tsk) kmem_cache_free(sighand_cachep, sighand); } -void exit_sighand(struct task_struct *tsk) -{ - write_lock_irq(&tasklist_lock); - rcu_read_lock(); - if (tsk->sighand != NULL) { - struct sighand_struct *sighand = rcu_dereference(tsk->sighand); - spin_lock(&sighand->siglock); - __exit_sighand(tsk); - spin_unlock(&sighand->siglock); - } - rcu_read_unlock(); - write_unlock_irq(&tasklist_lock); -} - /* * This function expects the tasklist_lock write-locked. */ -- cgit v1.2.3 From 6b3934ef52712ece50605dfc72e55d00c580831a Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:16 -0800 Subject: [PATCH] copy_process: cleanup bad_fork_cleanup_signal __exit_signal() does important cleanups atomically under ->siglock. It is also called from copy_process's error path. This is not good, for example we can't move __unhash_process() under ->siglock for that reason. We should not mix these 2 paths, just look at ugly 'if (p->sighand)' under 'bad_fork_cleanup_sighand:' label. For copy_process() case it is sufficient to just backout copy_signal(), nothing more. Again, nobody can see this task yet. For CLONE_THREAD case we just decrement signal->count, otherwise nobody can see this ->signal and we can free it lockless. This patch assumes it is safe to do exit_thread_group_keys() without tasklist_lock. Signed-off-by: Oleg Nesterov Cc: "Eric W. Biederman" Acked-by: David Howells Signed-off-by: Adrian Bunk Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 23 +++++++++++++++++++---- kernel/signal.c | 15 +-------------- 2 files changed, 20 insertions(+), 18 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 8a46ad52be8f..0aff28cdbadd 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -84,7 +84,7 @@ static kmem_cache_t *task_struct_cachep; #endif /* SLAB cache for signal_struct structures (tsk->signal) */ -kmem_cache_t *signal_cachep; +static kmem_cache_t *signal_cachep; /* SLAB cache for sighand_struct structures (tsk->sighand) */ kmem_cache_t *sighand_cachep; @@ -872,6 +872,22 @@ static inline int copy_signal(unsigned long clone_flags, struct task_struct * ts return 0; } +void __cleanup_signal(struct signal_struct *sig) +{ + exit_thread_group_keys(sig); + kmem_cache_free(signal_cachep, sig); +} + +static inline void cleanup_signal(struct task_struct *tsk) +{ + struct signal_struct *sig = tsk->signal; + + atomic_dec(&sig->live); + + if (atomic_dec_and_test(&sig->count)) + __cleanup_signal(sig); +} + static inline void copy_flags(unsigned long clone_flags, struct task_struct *p) { unsigned long new_flags = p->flags; @@ -1206,10 +1222,9 @@ bad_fork_cleanup_mm: if (p->mm) mmput(p->mm); bad_fork_cleanup_signal: - exit_signal(p); + cleanup_signal(p); bad_fork_cleanup_sighand: - if (p->sighand) - __exit_sighand(p); + __exit_sighand(p); bad_fork_cleanup_fs: exit_fs(p); /* blocking */ bad_fork_cleanup_files: diff --git a/kernel/signal.c b/kernel/signal.c index 1d7f4463c32d..54e9ef673e68 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -395,23 +395,10 @@ void __exit_signal(struct task_struct *tsk) clear_tsk_thread_flag(tsk,TIF_SIGPENDING); flush_sigqueue(&tsk->pending); if (sig) { - /* - * We are cleaning up the signal_struct here. - */ - exit_thread_group_keys(sig); - kmem_cache_free(signal_cachep, sig); + __cleanup_signal(sig); } } -void exit_signal(struct task_struct *tsk) -{ - atomic_dec(&tsk->signal->live); - - write_lock_irq(&tasklist_lock); - __exit_signal(tsk); - write_unlock_irq(&tasklist_lock); -} - /* * Flush all handlers for a task. */ -- cgit v1.2.3 From 29ff471234d53c7235db287bc52f91884c2977c6 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:17 -0800 Subject: [PATCH] cleanup __exit_signal() This patch factors out duplicated code under 'if' branches. Also, BUG_ON() conversions and whitespace cleanups. Signed-off-by: Oleg Nesterov Cc: "Eric W. Biederman" Acked-by: "Paul E. McKenney" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/signal.c | 31 +++++++++++++++---------------- 1 file changed, 15 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index 54e9ef673e68..ca1fa854e469 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -341,24 +341,20 @@ void __exit_sighand(struct task_struct *tsk) */ void __exit_signal(struct task_struct *tsk) { - struct signal_struct * sig = tsk->signal; - struct sighand_struct * sighand; + struct signal_struct *sig = tsk->signal; + struct sighand_struct *sighand; + + BUG_ON(!sig); + BUG_ON(!atomic_read(&sig->count)); - if (!sig) - BUG(); - if (!atomic_read(&sig->count)) - BUG(); rcu_read_lock(); sighand = rcu_dereference(tsk->sighand); spin_lock(&sighand->siglock); + posix_cpu_timers_exit(tsk); - if (atomic_dec_and_test(&sig->count)) { + if (atomic_dec_and_test(&sig->count)) posix_cpu_timers_exit_group(tsk); - tsk->signal = NULL; - __exit_sighand(tsk); - spin_unlock(&sighand->siglock); - flush_sigqueue(&sig->shared_pending); - } else { + else { /* * If there is any task waiting for the group exit * then notify it: @@ -369,7 +365,6 @@ void __exit_signal(struct task_struct *tsk) } if (tsk == sig->curr_target) sig->curr_target = next_thread(tsk); - tsk->signal = NULL; /* * Accumulate here the counters for all threads but the * group leader as they die, so they can be added into @@ -387,14 +382,18 @@ void __exit_signal(struct task_struct *tsk) sig->nvcsw += tsk->nvcsw; sig->nivcsw += tsk->nivcsw; sig->sched_time += tsk->sched_time; - __exit_sighand(tsk); - spin_unlock(&sighand->siglock); - sig = NULL; /* Marker for below. */ + sig = NULL; /* Marker for below. */ } + + tsk->signal = NULL; + __exit_sighand(tsk); + spin_unlock(&sighand->siglock); rcu_read_unlock(); + clear_tsk_thread_flag(tsk,TIF_SIGPENDING); flush_sigqueue(&tsk->pending); if (sig) { + flush_sigqueue(&sig->shared_pending); __cleanup_signal(sig); } } -- cgit v1.2.3 From c81addc9d3a0ebff2155e0cd86f90820ab97147e Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:17 -0800 Subject: [PATCH] rename __exit_sighand to cleanup_sighand Cosmetic, rename __exit_sighand to cleanup_sighand and move it close to copy_sighand(). This matches copy_signal/cleanup_signal naming, and I think it is easier to follow. Signed-off-by: Oleg Nesterov Cc: "Eric W. Biederman" Acked-by: "Paul E. McKenney" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 12 +++++++++++- kernel/signal.c | 19 ++----------------- 2 files changed, 13 insertions(+), 18 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 0aff28cdbadd..12cdd9fc9d02 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -803,6 +803,16 @@ static inline int copy_sighand(unsigned long clone_flags, struct task_struct * t return 0; } +void cleanup_sighand(struct task_struct *tsk) +{ + struct sighand_struct * sighand = tsk->sighand; + + /* Ok, we're done with the signal handlers */ + tsk->sighand = NULL; + if (atomic_dec_and_test(&sighand->count)) + kmem_cache_free(sighand_cachep, sighand); +} + static inline int copy_signal(unsigned long clone_flags, struct task_struct * tsk) { struct signal_struct *sig; @@ -1224,7 +1234,7 @@ bad_fork_cleanup_mm: bad_fork_cleanup_signal: cleanup_signal(p); bad_fork_cleanup_sighand: - __exit_sighand(p); + cleanup_sighand(p); bad_fork_cleanup_fs: exit_fs(p); /* blocking */ bad_fork_cleanup_files: diff --git a/kernel/signal.c b/kernel/signal.c index ca1fa854e469..b29c868bd5ee 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -310,9 +310,7 @@ static void flush_sigqueue(struct sigpending *queue) /* * Flush all pending signals for a task. */ - -void -flush_signals(struct task_struct *t) +void flush_signals(struct task_struct *t) { unsigned long flags; @@ -323,19 +321,6 @@ flush_signals(struct task_struct *t) spin_unlock_irqrestore(&t->sighand->siglock, flags); } -/* - * This function expects the tasklist_lock write-locked. - */ -void __exit_sighand(struct task_struct *tsk) -{ - struct sighand_struct * sighand = tsk->sighand; - - /* Ok, we're done with the signal handlers */ - tsk->sighand = NULL; - if (atomic_dec_and_test(&sighand->count)) - kmem_cache_free(sighand_cachep, sighand); -} - /* * This function expects the tasklist_lock write-locked. */ @@ -386,7 +371,7 @@ void __exit_signal(struct task_struct *tsk) } tsk->signal = NULL; - __exit_sighand(tsk); + cleanup_sighand(tsk); spin_unlock(&sighand->siglock); rcu_read_unlock(); -- cgit v1.2.3 From 6a14c5c9da0b4c34b5be783403c54f0396fcfe77 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:18 -0800 Subject: [PATCH] move __exit_signal() to kernel/exit.c __exit_signal() is private to release_task() now. I think it is better to make it static in kernel/exit.c and export flush_sigqueue() instead - this function is much more simple and straightforward. Signed-off-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/signal.c | 65 +-------------------------------------------------------- 2 files changed, 64 insertions(+), 64 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 77c35efad88c..3823ec89d7b8 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -62,6 +63,68 @@ static void __unhash_process(struct task_struct *p) remove_parent(p); } +/* + * This function expects the tasklist_lock write-locked. + */ +static void __exit_signal(struct task_struct *tsk) +{ + struct signal_struct *sig = tsk->signal; + struct sighand_struct *sighand; + + BUG_ON(!sig); + BUG_ON(!atomic_read(&sig->count)); + + rcu_read_lock(); + sighand = rcu_dereference(tsk->sighand); + spin_lock(&sighand->siglock); + + posix_cpu_timers_exit(tsk); + if (atomic_dec_and_test(&sig->count)) + posix_cpu_timers_exit_group(tsk); + else { + /* + * If there is any task waiting for the group exit + * then notify it: + */ + if (sig->group_exit_task && atomic_read(&sig->count) == sig->notify_count) { + wake_up_process(sig->group_exit_task); + sig->group_exit_task = NULL; + } + if (tsk == sig->curr_target) + sig->curr_target = next_thread(tsk); + /* + * Accumulate here the counters for all threads but the + * group leader as they die, so they can be added into + * the process-wide totals when those are taken. + * The group leader stays around as a zombie as long + * as there are other threads. When it gets reaped, + * the exit.c code will add its counts into these totals. + * We won't ever get here for the group leader, since it + * will have been the last reference on the signal_struct. + */ + sig->utime = cputime_add(sig->utime, tsk->utime); + sig->stime = cputime_add(sig->stime, tsk->stime); + sig->min_flt += tsk->min_flt; + sig->maj_flt += tsk->maj_flt; + sig->nvcsw += tsk->nvcsw; + sig->nivcsw += tsk->nivcsw; + sig->sched_time += tsk->sched_time; + sig = NULL; /* Marker for below. */ + } + + tsk->signal = NULL; + cleanup_sighand(tsk); + spin_unlock(&sighand->siglock); + rcu_read_unlock(); + + clear_tsk_thread_flag(tsk,TIF_SIGPENDING); + flush_sigqueue(&tsk->pending); + if (sig) { + flush_sigqueue(&sig->shared_pending); + __cleanup_signal(sig); + } +} + void release_task(struct task_struct * p) { int zap_leader; diff --git a/kernel/signal.c b/kernel/signal.c index b29c868bd5ee..6ea49f742a2f 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -22,7 +22,6 @@ #include #include #include -#include #include #include #include @@ -295,7 +294,7 @@ static void __sigqueue_free(struct sigqueue *q) kmem_cache_free(sigqueue_cachep, q); } -static void flush_sigqueue(struct sigpending *queue) +void flush_sigqueue(struct sigpending *queue) { struct sigqueue *q; @@ -321,68 +320,6 @@ void flush_signals(struct task_struct *t) spin_unlock_irqrestore(&t->sighand->siglock, flags); } -/* - * This function expects the tasklist_lock write-locked. - */ -void __exit_signal(struct task_struct *tsk) -{ - struct signal_struct *sig = tsk->signal; - struct sighand_struct *sighand; - - BUG_ON(!sig); - BUG_ON(!atomic_read(&sig->count)); - - rcu_read_lock(); - sighand = rcu_dereference(tsk->sighand); - spin_lock(&sighand->siglock); - - posix_cpu_timers_exit(tsk); - if (atomic_dec_and_test(&sig->count)) - posix_cpu_timers_exit_group(tsk); - else { - /* - * If there is any task waiting for the group exit - * then notify it: - */ - if (sig->group_exit_task && atomic_read(&sig->count) == sig->notify_count) { - wake_up_process(sig->group_exit_task); - sig->group_exit_task = NULL; - } - if (tsk == sig->curr_target) - sig->curr_target = next_thread(tsk); - /* - * Accumulate here the counters for all threads but the - * group leader as they die, so they can be added into - * the process-wide totals when those are taken. - * The group leader stays around as a zombie as long - * as there are other threads. When it gets reaped, - * the exit.c code will add its counts into these totals. - * We won't ever get here for the group leader, since it - * will have been the last reference on the signal_struct. - */ - sig->utime = cputime_add(sig->utime, tsk->utime); - sig->stime = cputime_add(sig->stime, tsk->stime); - sig->min_flt += tsk->min_flt; - sig->maj_flt += tsk->maj_flt; - sig->nvcsw += tsk->nvcsw; - sig->nivcsw += tsk->nivcsw; - sig->sched_time += tsk->sched_time; - sig = NULL; /* Marker for below. */ - } - - tsk->signal = NULL; - cleanup_sighand(tsk); - spin_unlock(&sighand->siglock); - rcu_read_unlock(); - - clear_tsk_thread_flag(tsk,TIF_SIGPENDING); - flush_sigqueue(&tsk->pending); - if (sig) { - flush_sigqueue(&sig->shared_pending); - __cleanup_signal(sig); - } -} - /* * Flush all handlers for a task. */ -- cgit v1.2.3 From 35f5cad8c4bab94ecc5acdc4055df5ea12dc76f8 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:19 -0800 Subject: [PATCH] revert "Optimize sys_times for a single thread process" This patch reverts 'CONFIG_SMP && thread_group_empty()' optimization in sys_times(). The reason is that the next patch breaks memory ordering which is needed for that optimization. tasklist_lock in sys_times() will be eliminated completely by further patch. Signed-off-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 6 +---- kernel/sys.c | 86 ++++++++++++++++++----------------------------------------- 2 files changed, 27 insertions(+), 65 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 3823ec89d7b8..6b2e4cf3e140 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -139,11 +139,7 @@ repeat: ptrace_unlink(p); BUG_ON(!list_empty(&p->ptrace_list) || !list_empty(&p->ptrace_children)); __exit_signal(p); - /* - * Note that the fastpath in sys_times depends on __exit_signal having - * updated the counters before a task is removed from the tasklist of - * the process by __unhash_process. - */ + __unhash_process(p); /* diff --git a/kernel/sys.c b/kernel/sys.c index c93d37f71aef..84371fdc660b 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1202,69 +1202,35 @@ asmlinkage long sys_times(struct tms __user * tbuf) */ if (tbuf) { struct tms tmp; + struct task_struct *tsk = current; + struct task_struct *t; cputime_t utime, stime, cutime, cstime; -#ifdef CONFIG_SMP - if (thread_group_empty(current)) { - /* - * Single thread case without the use of any locks. - * - * We may race with release_task if two threads are - * executing. However, release task first adds up the - * counters (__exit_signal) before removing the task - * from the process tasklist (__unhash_process). - * __exit_signal also acquires and releases the - * siglock which results in the proper memory ordering - * so that the list modifications are always visible - * after the counters have been updated. - * - * If the counters have been updated by the second thread - * but the thread has not yet been removed from the list - * then the other branch will be executing which will - * block on tasklist_lock until the exit handling of the - * other task is finished. - * - * This also implies that the sighand->siglock cannot - * be held by another processor. So we can also - * skip acquiring that lock. - */ - utime = cputime_add(current->signal->utime, current->utime); - stime = cputime_add(current->signal->utime, current->stime); - cutime = current->signal->cutime; - cstime = current->signal->cstime; - } else -#endif - { - - /* Process with multiple threads */ - struct task_struct *tsk = current; - struct task_struct *t; - - read_lock(&tasklist_lock); - utime = tsk->signal->utime; - stime = tsk->signal->stime; - t = tsk; - do { - utime = cputime_add(utime, t->utime); - stime = cputime_add(stime, t->stime); - t = next_thread(t); - } while (t != tsk); + read_lock(&tasklist_lock); + utime = tsk->signal->utime; + stime = tsk->signal->stime; + t = tsk; + do { + utime = cputime_add(utime, t->utime); + stime = cputime_add(stime, t->stime); + t = next_thread(t); + } while (t != tsk); + + /* + * While we have tasklist_lock read-locked, no dying thread + * can be updating current->signal->[us]time. Instead, + * we got their counts included in the live thread loop. + * However, another thread can come in right now and + * do a wait call that updates current->signal->c[us]time. + * To make sure we always see that pair updated atomically, + * we take the siglock around fetching them. + */ + spin_lock_irq(&tsk->sighand->siglock); + cutime = tsk->signal->cutime; + cstime = tsk->signal->cstime; + spin_unlock_irq(&tsk->sighand->siglock); + read_unlock(&tasklist_lock); - /* - * While we have tasklist_lock read-locked, no dying thread - * can be updating current->signal->[us]time. Instead, - * we got their counts included in the live thread loop. - * However, another thread can come in right now and - * do a wait call that updates current->signal->c[us]time. - * To make sure we always see that pair updated atomically, - * we take the siglock around fetching them. - */ - spin_lock_irq(&tsk->sighand->siglock); - cutime = tsk->signal->cutime; - cstime = tsk->signal->cstime; - spin_unlock_irq(&tsk->sighand->siglock); - read_unlock(&tasklist_lock); - } tmp.tms_utime = cputime_to_clock_t(utime); tmp.tms_stime = cputime_to_clock_t(stime); tmp.tms_cutime = cputime_to_clock_t(cutime); -- cgit v1.2.3 From 5876700cd399112ecfa70df36203c8c6660d84f8 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:20 -0800 Subject: [PATCH] do __unhash_process() under ->siglock This patch moves __unhash_process() call from realease_task() to __exit_signal(), so __detach_pid() is called with ->siglock held. This means we don't need tasklist_lock to iterate over thread group anymore: copy_process() was already changed to do attach_pid() under ->siglock. Eric's "pidhash-kill-switch_exec_pids.patch" from -mm changed de_thread() so it doesn't touch PIDTYPE_TGID. NOTE: de_thread() still needs some attention. It still changes task->pid lockless. Taking ->sighand.siglock here allows to do more tasklist_lock removals. Signed-off-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 6b2e4cf3e140..44d6c6e3896d 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -112,6 +112,8 @@ static void __exit_signal(struct task_struct *tsk) sig = NULL; /* Marker for below. */ } + __unhash_process(tsk); + tsk->signal = NULL; cleanup_sighand(tsk); spin_unlock(&sighand->siglock); @@ -140,8 +142,6 @@ repeat: BUG_ON(!list_empty(&p->ptrace_list) || !list_empty(&p->ptrace_children)); __exit_signal(p); - __unhash_process(p); - /* * If we are the last non-leader member of the thread * group, and the leader is zombie, then notify the -- cgit v1.2.3 From 7d7185c818925ba5fe90efa75840d0b415032774 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:21 -0800 Subject: [PATCH] sys_times: don't take tasklist_lock sys_times: don't take tasklist_lock Signed-off-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 13 +------------ 1 file changed, 1 insertion(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index 84371fdc660b..7ef7f6054c28 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1206,7 +1206,7 @@ asmlinkage long sys_times(struct tms __user * tbuf) struct task_struct *t; cputime_t utime, stime, cutime, cstime; - read_lock(&tasklist_lock); + spin_lock_irq(&tsk->sighand->siglock); utime = tsk->signal->utime; stime = tsk->signal->stime; t = tsk; @@ -1216,20 +1216,9 @@ asmlinkage long sys_times(struct tms __user * tbuf) t = next_thread(t); } while (t != tsk); - /* - * While we have tasklist_lock read-locked, no dying thread - * can be updating current->signal->[us]time. Instead, - * we got their counts included in the live thread loop. - * However, another thread can come in right now and - * do a wait call that updates current->signal->c[us]time. - * To make sure we always see that pair updated atomically, - * we take the siglock around fetching them. - */ - spin_lock_irq(&tsk->sighand->siglock); cutime = tsk->signal->cutime; cstime = tsk->signal->cstime; spin_unlock_irq(&tsk->sighand->siglock); - read_unlock(&tasklist_lock); tmp.tms_utime = cputime_to_clock_t(utime); tmp.tms_stime = cputime_to_clock_t(stime); -- cgit v1.2.3 From 6108ccd3e2f3012d5eec582e0af4d75e693824da Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:22 -0800 Subject: [PATCH] relax sig_needs_tasklist() handle_stop_signal() does not need tasklist_lock for SIG_KERNEL_STOP_MASK signals anymore. Signed-off-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/signal.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index 6ea49f742a2f..e99ec2f891a0 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -146,8 +146,7 @@ static kmem_cache_t *sigqueue_cachep; #define sig_kernel_stop(sig) \ (((sig) < SIGRTMIN) && T(sig, SIG_KERNEL_STOP_MASK)) -#define sig_needs_tasklist(sig) \ - (((sig) < SIGRTMIN) && T(sig, SIG_KERNEL_STOP_MASK | M(SIGCONT))) +#define sig_needs_tasklist(sig) ((sig) == SIGCONT) #define sig_user_defined(t, signr) \ (((t)->sighand->action[(signr)-1].sa.sa_handler != SIG_DFL) && \ -- cgit v1.2.3 From a122b341b74c08020f6521b615acca6a692aac79 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:22 -0800 Subject: [PATCH] do_signal_stop: don't take tasklist_lock do_signal_stop() does not need tasklist_lock anymore. So it does not need to do misc re-checks, and we can simplify the code. Signed-off-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/signal.c | 69 ++++++++++++++------------------------------------------- 1 file changed, 17 insertions(+), 52 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index e99ec2f891a0..e3bdec914626 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1682,8 +1682,7 @@ out: * Returns nonzero if we've actually stopped and released the siglock. * Returns zero if we didn't stop and still hold the siglock. */ -static int -do_signal_stop(int signr) +static int do_signal_stop(int signr) { struct signal_struct *sig = current->signal; struct sighand_struct *sighand = current->sighand; @@ -1703,7 +1702,6 @@ do_signal_stop(int signr) set_current_state(TASK_STOPPED); if (stop_count == 0) sig->flags = SIGNAL_STOP_STOPPED; - spin_unlock_irq(&sighand->siglock); } else if (thread_group_empty(current)) { /* @@ -1712,71 +1710,38 @@ do_signal_stop(int signr) current->exit_code = current->signal->group_exit_code = signr; set_current_state(TASK_STOPPED); sig->flags = SIGNAL_STOP_STOPPED; - spin_unlock_irq(&sighand->siglock); } else { /* + * (sig->group_stop_count == 0) * There is no group stop already in progress. - * We must initiate one now, but that requires - * dropping siglock to get both the tasklist lock - * and siglock again in the proper order. Note that - * this allows an intervening SIGCONT to be posted. - * We need to check for that and bail out if necessary. + * We must initiate one now. */ struct task_struct *t; - spin_unlock_irq(&sighand->siglock); - - /* signals can be posted during this window */ - - read_lock(&tasklist_lock); - spin_lock_irq(&sighand->siglock); + current->exit_code = signr; + sig->group_exit_code = signr; - if (!likely(sig->flags & SIGNAL_STOP_DEQUEUED)) { + stop_count = 0; + for (t = next_thread(current); t != current; t = next_thread(t)) /* - * Another stop or continue happened while we - * didn't have the lock. We can just swallow this - * signal now. If we raced with a SIGCONT, that - * should have just cleared it now. If we raced - * with another processor delivering a stop signal, - * then the SIGCONT that wakes us up should clear it. + * Setting state to TASK_STOPPED for a group + * stop is always done with the siglock held, + * so this check has no races. */ - read_unlock(&tasklist_lock); - return 0; - } - - if (sig->group_stop_count == 0) { - sig->group_exit_code = signr; - stop_count = 0; - for (t = next_thread(current); t != current; - t = next_thread(t)) - /* - * Setting state to TASK_STOPPED for a group - * stop is always done with the siglock held, - * so this check has no races. - */ - if (!t->exit_state && - !(t->state & (TASK_STOPPED|TASK_TRACED))) { - stop_count++; - signal_wake_up(t, 0); - } - sig->group_stop_count = stop_count; - } - else { - /* A race with another thread while unlocked. */ - signr = sig->group_exit_code; - stop_count = --sig->group_stop_count; - } + if (!t->exit_state && + !(t->state & (TASK_STOPPED|TASK_TRACED))) { + stop_count++; + signal_wake_up(t, 0); + } + sig->group_stop_count = stop_count; - current->exit_code = signr; set_current_state(TASK_STOPPED); if (stop_count == 0) sig->flags = SIGNAL_STOP_STOPPED; - - spin_unlock_irq(&sighand->siglock); - read_unlock(&tasklist_lock); } + spin_unlock_irq(&sighand->siglock); finish_stop(stop_count); return 1; } -- cgit v1.2.3 From aacc90944d4b1f2fcec73a8127eb60a3a701ce1c Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:23 -0800 Subject: [PATCH] do_group_exit: don't take tasklist_lock do_group_exit() takes tasklist_lock for zap_other_threads(), this is unneeded now. Signed-off-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 44d6c6e3896d..aea23e713cf4 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -985,7 +985,6 @@ do_group_exit(int exit_code) else if (!thread_group_empty(current)) { struct signal_struct *const sig = current->signal; struct sighand_struct *const sighand = current->sighand; - read_lock(&tasklist_lock); spin_lock_irq(&sighand->siglock); if (sig->flags & SIGNAL_GROUP_EXIT) /* Another thread got here before we took the lock. */ @@ -995,7 +994,6 @@ do_group_exit(int exit_code) zap_other_threads(current); } spin_unlock_irq(&sighand->siglock); - read_unlock(&tasklist_lock); } do_exit(exit_code); -- cgit v1.2.3 From 88531f725bd52e37a7be726860e4ff3f09031d89 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:24 -0800 Subject: [PATCH] do_sigaction: don't take tasklist_lock do_sigaction() does not need tasklist_lock anymore, we can simplify the code. Signed-off-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/signal.c | 22 +++------------------- 1 file changed, 3 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index e3bdec914626..2dfaa5076c31 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2301,8 +2301,7 @@ sys_rt_sigqueueinfo(int pid, int sig, siginfo_t __user *uinfo) return kill_proc_info(sig, &info, pid); } -int -do_sigaction(int sig, struct k_sigaction *act, struct k_sigaction *oact) +int do_sigaction(int sig, struct k_sigaction *act, struct k_sigaction *oact) { struct k_sigaction *k; sigset_t mask; @@ -2328,6 +2327,7 @@ do_sigaction(int sig, struct k_sigaction *act, struct k_sigaction *oact) if (act) { sigdelsetmask(&act->sa.sa_mask, sigmask(SIGKILL) | sigmask(SIGSTOP)); + *k = *act; /* * POSIX 3.3.1.3: * "Setting a signal action to SIG_IGN for a signal that is @@ -2340,19 +2340,8 @@ do_sigaction(int sig, struct k_sigaction *act, struct k_sigaction *oact) * be discarded, whether or not it is blocked" */ if (act->sa.sa_handler == SIG_IGN || - (act->sa.sa_handler == SIG_DFL && - sig_kernel_ignore(sig))) { - /* - * This is a fairly rare case, so we only take the - * tasklist_lock once we're sure we'll need it. - * Now we must do this little unlock and relock - * dance to maintain the lock hierarchy. - */ + (act->sa.sa_handler == SIG_DFL && sig_kernel_ignore(sig))) { struct task_struct *t = current; - spin_unlock_irq(&t->sighand->siglock); - read_lock(&tasklist_lock); - spin_lock_irq(&t->sighand->siglock); - *k = *act; sigemptyset(&mask); sigaddset(&mask, sig); rm_from_queue_full(&mask, &t->signal->shared_pending); @@ -2361,12 +2350,7 @@ do_sigaction(int sig, struct k_sigaction *act, struct k_sigaction *oact) recalc_sigpending_tsk(t); t = next_thread(t); } while (t != current); - spin_unlock_irq(¤t->sighand->siglock); - read_unlock(&tasklist_lock); - return 0; } - - *k = *act; } spin_unlock_irq(¤t->sighand->siglock); -- cgit v1.2.3 From 47e65328a7b1cdfc4e3102e50d60faf94ebba7d3 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:25 -0800 Subject: [PATCH] pids: kill PIDTYPE_TGID This patch kills PIDTYPE_TGID pid_type thus saving one hash table in kernel/pid.c and speeding up subthreads create/destroy a bit. It is also a preparation for the further tref/pids rework. This patch adds 'struct list_head thread_group' to 'struct task_struct' instead. We don't detach group leader from PIDTYPE_PID namespace until another thread inherits it's ->pid == ->tgid, so we are safe wrt premature free_pidmap(->tgid) call. Currently there are no users of find_task_by_pid_type(PIDTYPE_TGID). Should the need arise, we can use find_task_by_pid()->group_leader. Signed-off-by: Oleg Nesterov Acked-By: Eric Biederman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 10 +--------- kernel/fork.c | 4 +++- 2 files changed, 4 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index aea23e713cf4..22399caf7574 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -51,7 +51,6 @@ static void __unhash_process(struct task_struct *p) { nr_threads--; detach_pid(p, PIDTYPE_PID); - detach_pid(p, PIDTYPE_TGID); if (thread_group_leader(p)) { detach_pid(p, PIDTYPE_PGID); detach_pid(p, PIDTYPE_SID); @@ -59,7 +58,7 @@ static void __unhash_process(struct task_struct *p) list_del_init(&p->tasks); __get_cpu_var(process_counts)--; } - + list_del_rcu(&p->thread_group); remove_parent(p); } @@ -964,13 +963,6 @@ asmlinkage long sys_exit(int error_code) do_exit((error_code&0xff)<<8); } -task_t fastcall *next_thread(const task_t *p) -{ - return pid_task(p->pids[PIDTYPE_TGID].pid_list.next, PIDTYPE_TGID); -} - -EXPORT_SYMBOL(next_thread); - /* * Take down every thread in the group. This is called by fatal signals * as well as by sys_exit_group (below). diff --git a/kernel/fork.c b/kernel/fork.c index 12cdd9fc9d02..bc551efb5fd4 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1112,6 +1112,7 @@ static task_t *copy_process(unsigned long clone_flags, * We dont wake it up yet. */ p->group_leader = p; + INIT_LIST_HEAD(&p->thread_group); INIT_LIST_HEAD(&p->ptrace_children); INIT_LIST_HEAD(&p->ptrace_list); @@ -1165,7 +1166,9 @@ static task_t *copy_process(unsigned long clone_flags, retval = -EAGAIN; goto bad_fork_cleanup_namespace; } + p->group_leader = current->group_leader; + list_add_tail_rcu(&p->thread_group, &p->group_leader->thread_group); if (current->signal->group_stop_count > 0) { /* @@ -1213,7 +1216,6 @@ static task_t *copy_process(unsigned long clone_flags, list_add_tail(&p->tasks, &init_task.tasks); __get_cpu_var(process_counts)++; } - attach_pid(p, PIDTYPE_TGID, p->tgid); attach_pid(p, PIDTYPE_PID, p->pid); nr_threads++; } -- cgit v1.2.3 From 4a2c7a7837da1b91468e50426066d988050e4d56 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:26 -0800 Subject: [PATCH] make fork() atomic wrt pgrp/session signals Eric W. Biederman wrote: > > Ok. SUSV3/Posix is clear, fork is atomic with respect > to signals. Either a signal comes before or after a > fork but not during. (See the rationale section). > http://www.opengroup.org/onlinepubs/000095399/functions/fork.html > > The tasklist_lock does not stop forks from adding to a process > group. The forks stall while the tasklist_lock is held, but a fork > that began before we grabbed the tasklist_lock simply completes > afterwards, and the child does not receive the signal. This also means that SIGSTOP or sig_kernel_coredump() signal can't be delivered to pgrp/session reliably. With this patch copy_process() returns -ERESTARTNOINTR when it detects a pending signal, fork() will be restarted transparently after handling the signals. This patch also deletes now unneeded "group_stop_count > 0" check, copy_process() can no longer succeed while group stop in progress. Signed-off-by: Oleg Nesterov Acked-By: Eric Biederman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 37 +++++++++++++++++-------------------- 1 file changed, 17 insertions(+), 20 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index bc551efb5fd4..aa50c848fae7 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1136,16 +1136,6 @@ static task_t *copy_process(unsigned long clone_flags, !cpu_online(task_cpu(p)))) set_task_cpu(p, smp_processor_id()); - /* - * Check for pending SIGKILL! The new thread should not be allowed - * to slip out of an OOM kill. (or normal SIGKILL.) - */ - if (sigismember(¤t->pending.signal, SIGKILL)) { - write_unlock_irq(&tasklist_lock); - retval = -EINTR; - goto bad_fork_cleanup_namespace; - } - /* CLONE_PARENT re-uses the old parent */ if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) p->real_parent = current->real_parent; @@ -1154,6 +1144,23 @@ static task_t *copy_process(unsigned long clone_flags, p->parent = p->real_parent; spin_lock(¤t->sighand->siglock); + + /* + * Process group and session signals need to be delivered to just the + * parent before the fork or both the parent and the child after the + * fork. Restart if a signal comes in before we add the new process to + * it's process group. + * A fatal signal pending means that current will exit, so the new + * thread can't slip out of an OOM kill (or normal SIGKILL). + */ + recalc_sigpending(); + if (signal_pending(current)) { + spin_unlock(¤t->sighand->siglock); + write_unlock_irq(&tasklist_lock); + retval = -ERESTARTNOINTR; + goto bad_fork_cleanup_namespace; + } + if (clone_flags & CLONE_THREAD) { /* * Important: if an exit-all has been started then @@ -1170,16 +1177,6 @@ static task_t *copy_process(unsigned long clone_flags, p->group_leader = current->group_leader; list_add_tail_rcu(&p->thread_group, &p->group_leader->thread_group); - if (current->signal->group_stop_count > 0) { - /* - * There is an all-stop in progress for the group. - * We ourselves will stop as soon as we check signals. - * Make the new thread part of that group stop too. - */ - current->signal->group_stop_count++; - set_tsk_thread_flag(p, TIF_SIGPENDING); - } - if (!cputime_eq(current->signal->it_virt_expires, cputime_zero) || !cputime_eq(current->signal->it_prof_expires, -- cgit v1.2.3 From a7e5328a06a2beee3a2bbfaf87ce2a7bbe937de1 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:27 -0800 Subject: [PATCH] cleanup __exit_signal->cleanup_sighand path Move 'tsk->sighand = NULL' from cleanup_sighand() to __exit_signal(). This makes the exit path more understandable and allows us to do cleanup_sighand() outside of ->siglock protected section. Signed-off-by: Oleg Nesterov Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 3 ++- kernel/fork.c | 8 ++------ 2 files changed, 4 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 22399caf7574..bc0ec674d3f4 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -114,10 +114,11 @@ static void __exit_signal(struct task_struct *tsk) __unhash_process(tsk); tsk->signal = NULL; - cleanup_sighand(tsk); + tsk->sighand = NULL; spin_unlock(&sighand->siglock); rcu_read_unlock(); + __cleanup_sighand(sighand); clear_tsk_thread_flag(tsk,TIF_SIGPENDING); flush_sigqueue(&tsk->pending); if (sig) { diff --git a/kernel/fork.c b/kernel/fork.c index aa50c848fae7..b3f7a1bb5e55 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -803,12 +803,8 @@ static inline int copy_sighand(unsigned long clone_flags, struct task_struct * t return 0; } -void cleanup_sighand(struct task_struct *tsk) +void __cleanup_sighand(struct sighand_struct *sighand) { - struct sighand_struct * sighand = tsk->sighand; - - /* Ok, we're done with the signal handlers */ - tsk->sighand = NULL; if (atomic_dec_and_test(&sighand->count)) kmem_cache_free(sighand_cachep, sighand); } @@ -1233,7 +1229,7 @@ bad_fork_cleanup_mm: bad_fork_cleanup_signal: cleanup_signal(p); bad_fork_cleanup_sighand: - cleanup_sighand(p); + __cleanup_sighand(p->sighand); bad_fork_cleanup_fs: exit_fs(p); /* blocking */ bad_fork_cleanup_files: -- cgit v1.2.3 From dac27f4a09c274db048e80d2511102e540ac9e3a Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:28 -0800 Subject: [PATCH] simplify do_signal_stop() do_signal_stop() considers 'thread_group_empty()' as a special case. This was needed to avoid taking tasklist_lock. Since this lock is unneeded any longer, we can remove this special case and simplify the code even more. Also, before this patch, finish_stop() was called with stop_count == -1 for 'thread_group_empty()' case. This is not strictly wrong, but confusing and unneeded. Signed-off-by: Oleg Nesterov Cc: john stultz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/signal.c | 32 ++++++++------------------------ 1 file changed, 8 insertions(+), 24 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index 2dfaa5076c31..efba39626e66 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1685,8 +1685,7 @@ out: static int do_signal_stop(int signr) { struct signal_struct *sig = current->signal; - struct sighand_struct *sighand = current->sighand; - int stop_count = -1; + int stop_count; if (!likely(sig->flags & SIGNAL_STOP_DEQUEUED)) return 0; @@ -1696,30 +1695,14 @@ static int do_signal_stop(int signr) * There is a group stop in progress. We don't need to * start another one. */ - signr = sig->group_exit_code; stop_count = --sig->group_stop_count; - current->exit_code = signr; - set_current_state(TASK_STOPPED); - if (stop_count == 0) - sig->flags = SIGNAL_STOP_STOPPED; - } - else if (thread_group_empty(current)) { - /* - * Lock must be held through transition to stopped state. - */ - current->exit_code = current->signal->group_exit_code = signr; - set_current_state(TASK_STOPPED); - sig->flags = SIGNAL_STOP_STOPPED; - } - else { + } else { /* - * (sig->group_stop_count == 0) * There is no group stop already in progress. * We must initiate one now. */ struct task_struct *t; - current->exit_code = signr; sig->group_exit_code = signr; stop_count = 0; @@ -1735,13 +1718,14 @@ static int do_signal_stop(int signr) signal_wake_up(t, 0); } sig->group_stop_count = stop_count; - - set_current_state(TASK_STOPPED); - if (stop_count == 0) - sig->flags = SIGNAL_STOP_STOPPED; } - spin_unlock_irq(&sighand->siglock); + if (stop_count == 0) + sig->flags = SIGNAL_STOP_STOPPED; + current->exit_code = sig->group_exit_code; + __set_current_state(TASK_STOPPED); + + spin_unlock_irq(¤t->sighand->siglock); finish_stop(stop_count); return 1; } -- cgit v1.2.3 From 883606a7c9e84e206f96c38f88c4bd66df72f51e Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:28 -0800 Subject: [PATCH] finish_stop: don't check stop_count < 0 Remove an obscure 'stop_count < 0' check in finish_stop(). The previous patch made this case impossible. Signed-off-by: Oleg Nesterov Cc: john stultz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/signal.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index efba39626e66..24d5059ab0a9 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1657,7 +1657,7 @@ finish_stop(int stop_count) * a group stop in progress and we are the last to stop, * report to the parent. When ptraced, every thread reports itself. */ - if (stop_count < 0 || (current->ptrace & PT_PTRACED)) + if (current->ptrace & PT_PTRACED) to_self = 1; else if (stop_count == 0) to_self = 0; -- cgit v1.2.3 From a1d5e21e3e388fb2c16463d007e788b1e41b74f1 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:29 -0800 Subject: [PATCH] do_notify_parent_cldstop: remove 'to_self' param The previous patch has changed callsites of do_notify_parent_cldstop() so that to_self == (->ptrace & PT_PTRACED) always (as it should be). We can remove this parameter now. Signed-off-by: Oleg Nesterov Cc: john stultz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/signal.c | 32 +++++++++++--------------------- 1 file changed, 11 insertions(+), 21 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index 24d5059ab0a9..0528a959daa9 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -591,9 +591,7 @@ static int check_kill_permission(int sig, struct siginfo *info, } /* forward decl */ -static void do_notify_parent_cldstop(struct task_struct *tsk, - int to_self, - int why); +static void do_notify_parent_cldstop(struct task_struct *tsk, int why); /* * Handle magic process-wide effects of stop/continue signals. @@ -643,7 +641,7 @@ static void handle_stop_signal(int sig, struct task_struct *p) p->signal->group_stop_count = 0; p->signal->flags = SIGNAL_STOP_CONTINUED; spin_unlock(&p->sighand->siglock); - do_notify_parent_cldstop(p, (p->ptrace & PT_PTRACED), CLD_STOPPED); + do_notify_parent_cldstop(p, CLD_STOPPED); spin_lock(&p->sighand->siglock); } rm_from_queue(SIG_KERNEL_STOP_MASK, &p->signal->shared_pending); @@ -684,7 +682,7 @@ static void handle_stop_signal(int sig, struct task_struct *p) p->signal->flags = SIGNAL_STOP_CONTINUED; p->signal->group_exit_code = 0; spin_unlock(&p->sighand->siglock); - do_notify_parent_cldstop(p, (p->ptrace & PT_PTRACED), CLD_CONTINUED); + do_notify_parent_cldstop(p, CLD_CONTINUED); spin_lock(&p->sighand->siglock); } else { /* @@ -1519,14 +1517,14 @@ void do_notify_parent(struct task_struct *tsk, int sig) spin_unlock_irqrestore(&psig->siglock, flags); } -static void do_notify_parent_cldstop(struct task_struct *tsk, int to_self, int why) +static void do_notify_parent_cldstop(struct task_struct *tsk, int why) { struct siginfo info; unsigned long flags; struct task_struct *parent; struct sighand_struct *sighand; - if (to_self) + if (tsk->ptrace & PT_PTRACED) parent = tsk->parent; else { tsk = tsk->group_leader; @@ -1601,7 +1599,7 @@ static void ptrace_stop(int exit_code, int nostop_code, siginfo_t *info) !(current->ptrace & PT_ATTACHED)) && (likely(current->parent->signal != current->signal) || !unlikely(current->signal->flags & SIGNAL_GROUP_EXIT))) { - do_notify_parent_cldstop(current, 1, CLD_TRAPPED); + do_notify_parent_cldstop(current, CLD_TRAPPED); read_unlock(&tasklist_lock); schedule(); } else { @@ -1650,25 +1648,17 @@ void ptrace_notify(int exit_code) static void finish_stop(int stop_count) { - int to_self; - /* * If there are no other threads in the group, or if there is * a group stop in progress and we are the last to stop, * report to the parent. When ptraced, every thread reports itself. */ - if (current->ptrace & PT_PTRACED) - to_self = 1; - else if (stop_count == 0) - to_self = 0; - else - goto out; - - read_lock(&tasklist_lock); - do_notify_parent_cldstop(current, to_self, CLD_STOPPED); - read_unlock(&tasklist_lock); + if (stop_count == 0 || (current->ptrace & PT_PTRACED)) { + read_lock(&tasklist_lock); + do_notify_parent_cldstop(current, CLD_STOPPED); + read_unlock(&tasklist_lock); + } -out: schedule(); /* * Now we don't run again until continued. -- cgit v1.2.3 From 547679087bc60277d11b11631d826895762c4505 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 28 Mar 2006 16:11:30 -0800 Subject: [PATCH] send_sigqueue: simplify and fix the race send_sigqueue() checks PF_EXITING, then locks p->sighand->siglock. This is unsafe: 'p' can exit in between and set ->sighand = NULL. The race is theoretical, the window is tiny and irqs are disabled by the caller, so I don't think we need the fix for -stable tree. Convert send_sigqueue() to use lock_task_sighand() helper. Also, delete 'p->flags & PF_EXITING' re-check, it is unneeded and the comment is wrong. Signed-off-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/signal.c | 41 ++++------------------------------------- 1 file changed, 4 insertions(+), 37 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index 0528a959daa9..4922928d91f6 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1309,12 +1309,10 @@ void sigqueue_free(struct sigqueue *q) __sigqueue_free(q); } -int -send_sigqueue(int sig, struct sigqueue *q, struct task_struct *p) +int send_sigqueue(int sig, struct sigqueue *q, struct task_struct *p) { unsigned long flags; int ret = 0; - struct sighand_struct *sh; BUG_ON(!(q->flags & SIGQUEUE_PREALLOC)); @@ -1328,48 +1326,17 @@ send_sigqueue(int sig, struct sigqueue *q, struct task_struct *p) */ rcu_read_lock(); - if (unlikely(p->flags & PF_EXITING)) { + if (!likely(lock_task_sighand(p, &flags))) { ret = -1; goto out_err; } -retry: - sh = rcu_dereference(p->sighand); - - spin_lock_irqsave(&sh->siglock, flags); - if (p->sighand != sh) { - /* We raced with exec() in a multithreaded process... */ - spin_unlock_irqrestore(&sh->siglock, flags); - goto retry; - } - - /* - * We do the check here again to handle the following scenario: - * - * CPU 0 CPU 1 - * send_sigqueue - * check PF_EXITING - * interrupt exit code running - * __exit_signal - * lock sighand->siglock - * unlock sighand->siglock - * lock sh->siglock - * add(tsk->pending) flush_sigqueue(tsk->pending) - * - */ - - if (unlikely(p->flags & PF_EXITING)) { - ret = -1; - goto out; - } - if (unlikely(!list_empty(&q->list))) { /* * If an SI_TIMER entry is already queue just increment * the overrun count. */ - if (q->info.si_code != SI_TIMER) - BUG(); + BUG_ON(q->info.si_code != SI_TIMER); q->info.si_overrun++; goto out; } @@ -1385,7 +1352,7 @@ retry: signal_wake_up(p, sig == SIGKILL); out: - spin_unlock_irqrestore(&sh->siglock, flags); + unlock_task_sighand(p, &flags); out_err: rcu_read_unlock(); -- cgit v1.2.3 From 85b6bce3658a823aa169586fe71ffba0f12ccc71 Mon Sep 17 00:00:00 2001 From: Pavel Machek Date: Fri, 31 Mar 2006 02:30:06 -0800 Subject: [PATCH] Fix suspend with traced tasks strace /bin/bash misbehaves after resume; this fixes it. (akpm: it's scary calling refrigerator() in state TASK_TRACED, but it seems to do the right thing). Signed-off-by: Pavel Machek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/power/process.c | 3 +-- kernel/signal.c | 1 + 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/power/process.c b/kernel/power/process.c index 8ac7c35fad77..b2a5f671d6cd 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -26,8 +26,7 @@ static inline int freezeable(struct task_struct * p) (p->flags & PF_NOFREEZE) || (p->exit_state == EXIT_ZOMBIE) || (p->exit_state == EXIT_DEAD) || - (p->state == TASK_STOPPED) || - (p->state == TASK_TRACED)) + (p->state == TASK_STOPPED)) return 0; return 1; } diff --git a/kernel/signal.c b/kernel/signal.c index 4922928d91f6..92025b108791 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1560,6 +1560,7 @@ static void ptrace_stop(int exit_code, int nostop_code, siginfo_t *info) /* Let the debugger run. */ set_current_state(TASK_TRACED); spin_unlock_irq(¤t->sighand->siglock); + try_to_freeze(); read_lock(&tasklist_lock); if (likely(current->ptrace & PT_PTRACED) && likely(current->parent != current->real_parent || -- cgit v1.2.3 From 3691c5199e8a4be1c7a91b5ab925db5feb866e19 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Fri, 31 Mar 2006 02:30:30 -0800 Subject: [PATCH] kill __init_timer_base in favor of boot_tvec_bases Commit a4a6198b80cf82eb8160603c98da218d1bd5e104: [PATCH] tvec_bases too large for per-cpu data introduced "struct tvec_t_base_s boot_tvec_bases" which is visible at compile time. This means we can kill __init_timer_base and move timer_base_s's content into tvec_t_base_s. Signed-off-by: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/timer.c | 84 ++++++++++++++++++++++++---------------------------------- 1 file changed, 35 insertions(+), 49 deletions(-) (limited to 'kernel') diff --git a/kernel/timer.c b/kernel/timer.c index ab189dd187cb..b04dc03b5934 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -54,7 +54,6 @@ EXPORT_SYMBOL(jiffies_64); /* * per-CPU timer vector definitions: */ - #define TVN_BITS (CONFIG_BASE_SMALL ? 4 : 6) #define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8) #define TVN_SIZE (1 << TVN_BITS) @@ -62,11 +61,6 @@ EXPORT_SYMBOL(jiffies_64); #define TVN_MASK (TVN_SIZE - 1) #define TVR_MASK (TVR_SIZE - 1) -struct timer_base_s { - spinlock_t lock; - struct timer_list *running_timer; -}; - typedef struct tvec_s { struct list_head vec[TVN_SIZE]; } tvec_t; @@ -76,7 +70,8 @@ typedef struct tvec_root_s { } tvec_root_t; struct tvec_t_base_s { - struct timer_base_s t_base; + spinlock_t lock; + struct timer_list *running_timer; unsigned long timer_jiffies; tvec_root_t tv1; tvec_t tv2; @@ -87,13 +82,14 @@ struct tvec_t_base_s { typedef struct tvec_t_base_s tvec_base_t; static DEFINE_PER_CPU(tvec_base_t *, tvec_bases); -static tvec_base_t boot_tvec_bases; +tvec_base_t boot_tvec_bases; +EXPORT_SYMBOL(boot_tvec_bases); static inline void set_running_timer(tvec_base_t *base, struct timer_list *timer) { #ifdef CONFIG_SMP - base->t_base.running_timer = timer; + base->running_timer = timer; #endif } @@ -139,15 +135,6 @@ static void internal_add_timer(tvec_base_t *base, struct timer_list *timer) list_add_tail(&timer->entry, vec); } -typedef struct timer_base_s timer_base_t; -/* - * Used by TIMER_INITIALIZER, we can't use per_cpu(tvec_bases) - * at compile time, and we need timer->base to lock the timer. - */ -timer_base_t __init_timer_base - ____cacheline_aligned_in_smp = { .lock = SPIN_LOCK_UNLOCKED }; -EXPORT_SYMBOL(__init_timer_base); - /*** * init_timer - initialize a timer. * @timer: the timer to be initialized @@ -158,7 +145,7 @@ EXPORT_SYMBOL(__init_timer_base); void fastcall init_timer(struct timer_list *timer) { timer->entry.next = NULL; - timer->base = &per_cpu(tvec_bases, raw_smp_processor_id())->t_base; + timer->base = per_cpu(tvec_bases, raw_smp_processor_id()); } EXPORT_SYMBOL(init_timer); @@ -174,7 +161,7 @@ static inline void detach_timer(struct timer_list *timer, } /* - * We are using hashed locking: holding per_cpu(tvec_bases).t_base.lock + * We are using hashed locking: holding per_cpu(tvec_bases).lock * means that all timers which are tied to this base via timer->base are * locked, and the base itself is locked too. * @@ -185,10 +172,10 @@ static inline void detach_timer(struct timer_list *timer, * possible to set timer->base = NULL and drop the lock: the timer remains * locked. */ -static timer_base_t *lock_timer_base(struct timer_list *timer, +static tvec_base_t *lock_timer_base(struct timer_list *timer, unsigned long *flags) { - timer_base_t *base; + tvec_base_t *base; for (;;) { base = timer->base; @@ -205,8 +192,7 @@ static timer_base_t *lock_timer_base(struct timer_list *timer, int __mod_timer(struct timer_list *timer, unsigned long expires) { - timer_base_t *base; - tvec_base_t *new_base; + tvec_base_t *base, *new_base; unsigned long flags; int ret = 0; @@ -221,7 +207,7 @@ int __mod_timer(struct timer_list *timer, unsigned long expires) new_base = __get_cpu_var(tvec_bases); - if (base != &new_base->t_base) { + if (base != new_base) { /* * We are trying to schedule the timer on the local CPU. * However we can't change timer's base while it is running, @@ -231,19 +217,19 @@ int __mod_timer(struct timer_list *timer, unsigned long expires) */ if (unlikely(base->running_timer == timer)) { /* The timer remains on a former base */ - new_base = container_of(base, tvec_base_t, t_base); + new_base = base; } else { /* See the comment in lock_timer_base() */ timer->base = NULL; spin_unlock(&base->lock); - spin_lock(&new_base->t_base.lock); - timer->base = &new_base->t_base; + spin_lock(&new_base->lock); + timer->base = new_base; } } timer->expires = expires; internal_add_timer(new_base, timer); - spin_unlock_irqrestore(&new_base->t_base.lock, flags); + spin_unlock_irqrestore(&new_base->lock, flags); return ret; } @@ -263,10 +249,10 @@ void add_timer_on(struct timer_list *timer, int cpu) unsigned long flags; BUG_ON(timer_pending(timer) || !timer->function); - spin_lock_irqsave(&base->t_base.lock, flags); - timer->base = &base->t_base; + spin_lock_irqsave(&base->lock, flags); + timer->base = base; internal_add_timer(base, timer); - spin_unlock_irqrestore(&base->t_base.lock, flags); + spin_unlock_irqrestore(&base->lock, flags); } @@ -319,7 +305,7 @@ EXPORT_SYMBOL(mod_timer); */ int del_timer(struct timer_list *timer) { - timer_base_t *base; + tvec_base_t *base; unsigned long flags; int ret = 0; @@ -346,7 +332,7 @@ EXPORT_SYMBOL(del_timer); */ int try_to_del_timer_sync(struct timer_list *timer) { - timer_base_t *base; + tvec_base_t *base; unsigned long flags; int ret = -1; @@ -410,7 +396,7 @@ static int cascade(tvec_base_t *base, tvec_t *tv, int index) struct timer_list *tmp; tmp = list_entry(curr, struct timer_list, entry); - BUG_ON(tmp->base != &base->t_base); + BUG_ON(tmp->base != base); curr = curr->next; internal_add_timer(base, tmp); } @@ -432,7 +418,7 @@ static inline void __run_timers(tvec_base_t *base) { struct timer_list *timer; - spin_lock_irq(&base->t_base.lock); + spin_lock_irq(&base->lock); while (time_after_eq(jiffies, base->timer_jiffies)) { struct list_head work_list = LIST_HEAD_INIT(work_list); struct list_head *head = &work_list; @@ -458,7 +444,7 @@ static inline void __run_timers(tvec_base_t *base) set_running_timer(base, timer); detach_timer(timer, 1); - spin_unlock_irq(&base->t_base.lock); + spin_unlock_irq(&base->lock); { int preempt_count = preempt_count(); fn(data); @@ -471,11 +457,11 @@ static inline void __run_timers(tvec_base_t *base) BUG(); } } - spin_lock_irq(&base->t_base.lock); + spin_lock_irq(&base->lock); } } set_running_timer(base, NULL); - spin_unlock_irq(&base->t_base.lock); + spin_unlock_irq(&base->lock); } #ifdef CONFIG_NO_IDLE_HZ @@ -506,7 +492,7 @@ unsigned long next_timer_interrupt(void) hr_expires += jiffies; base = __get_cpu_var(tvec_bases); - spin_lock(&base->t_base.lock); + spin_lock(&base->lock); expires = base->timer_jiffies + (LONG_MAX >> 1); list = NULL; @@ -554,7 +540,7 @@ found: expires = nte->expires; } } - spin_unlock(&base->t_base.lock); + spin_unlock(&base->lock); if (time_before(hr_expires, expires)) return hr_expires; @@ -1262,7 +1248,7 @@ static int __devinit init_timers_cpu(int cpu) } per_cpu(tvec_bases, cpu) = base; } - spin_lock_init(&base->t_base.lock); + spin_lock_init(&base->lock); for (j = 0; j < TVN_SIZE; j++) { INIT_LIST_HEAD(base->tv5.vec + j); INIT_LIST_HEAD(base->tv4.vec + j); @@ -1284,7 +1270,7 @@ static void migrate_timer_list(tvec_base_t *new_base, struct list_head *head) while (!list_empty(head)) { timer = list_entry(head->next, struct timer_list, entry); detach_timer(timer, 0); - timer->base = &new_base->t_base; + timer->base = new_base; internal_add_timer(new_base, timer); } } @@ -1300,11 +1286,11 @@ static void __devinit migrate_timers(int cpu) new_base = get_cpu_var(tvec_bases); local_irq_disable(); - spin_lock(&new_base->t_base.lock); - spin_lock(&old_base->t_base.lock); + spin_lock(&new_base->lock); + spin_lock(&old_base->lock); + + BUG_ON(old_base->running_timer); - if (old_base->t_base.running_timer) - BUG(); for (i = 0; i < TVR_SIZE; i++) migrate_timer_list(new_base, old_base->tv1.vec + i); for (i = 0; i < TVN_SIZE; i++) { @@ -1314,8 +1300,8 @@ static void __devinit migrate_timers(int cpu) migrate_timer_list(new_base, old_base->tv5.vec + i); } - spin_unlock(&old_base->t_base.lock); - spin_unlock(&new_base->t_base.lock); + spin_unlock(&old_base->lock); + spin_unlock(&new_base->lock); local_irq_enable(); put_cpu_var(tvec_bases); } -- cgit v1.2.3 From a2c348fe0117adced11e374329a5ea3f7c43cb41 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Fri, 31 Mar 2006 02:30:31 -0800 Subject: [PATCH] __mod_timer: simplify ->base changing Since base and new_base are of the same type now, we can save one 'if' branch and simplify the code a bit. Signed-off-by: Oleg Nesterov Acked-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/timer.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/timer.c b/kernel/timer.c index b04dc03b5934..9062a82ee8ec 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -215,21 +215,19 @@ int __mod_timer(struct timer_list *timer, unsigned long expires) * handler yet has not finished. This also guarantees that * the timer is serialized wrt itself. */ - if (unlikely(base->running_timer == timer)) { - /* The timer remains on a former base */ - new_base = base; - } else { + if (likely(base->running_timer != timer)) { /* See the comment in lock_timer_base() */ timer->base = NULL; spin_unlock(&base->lock); - spin_lock(&new_base->lock); - timer->base = new_base; + base = new_base; + spin_lock(&base->lock); + timer->base = base; } } timer->expires = expires; - internal_add_timer(new_base, timer); - spin_unlock_irqrestore(&new_base->lock, flags); + internal_add_timer(base, timer); + spin_unlock_irqrestore(&base->lock, flags); return ret; } -- cgit v1.2.3 From 9b41046cd0ee0a57f849d6e1363f7933e363cca9 Mon Sep 17 00:00:00 2001 From: OGAWA Hirofumi Date: Fri, 31 Mar 2006 02:30:33 -0800 Subject: [PATCH] Don't pass boot parameters to argv_init[] The boot cmdline is parsed in parse_early_param() and parse_args(,unknown_bootoption). And __setup() is used in obsolete_checksetup(). start_kernel() -> parse_args() -> unknown_bootoption() -> obsolete_checksetup() If __setup()'s callback (->setup_func()) returns 1 in obsolete_checksetup(), obsolete_checksetup() thinks a parameter was handled. If ->setup_func() returns 0, obsolete_checksetup() tries other ->setup_func(). If all ->setup_func() that matched a parameter returns 0, a parameter is seted to argv_init[]. Then, when runing /sbin/init or init=app, argv_init[] is passed to the app. If the app doesn't ignore those arguments, it will warning and exit. This patch fixes a wrong usage of it, however fixes obvious one only. Signed-off-by: OGAWA Hirofumi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/audit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index 04fe2e301b61..c8ccbd09048f 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -578,7 +578,7 @@ static int __init audit_enable(char *str) audit_initialized ? "" : " (after initialization)"); if (audit_initialized) audit_enabled = audit_default; - return 0; + return 1; } __setup("audit=", audit_enable); -- cgit v1.2.3 From bb231fe3a53b2d34c1aef119613816fcb18864b1 Mon Sep 17 00:00:00 2001 From: KaiGai Kohei Date: Fri, 31 Mar 2006 02:30:45 -0800 Subject: [PATCH] Fix pacct bug in multithreading case. I noticed a bug on the process accounting facility. In multi-threading process, some data would be recorded incorrectly when the group_leader dies earlier than one or more threads. The attached patch fixes this problem. See below. 'bugacct' is a test program that create a worker thread after 4 seconds sleeping, then the group_leader dies soon. The worker thread consume CPU/Memory for 6 seconds, then exit. We can estimate 10 seconds as etime and 6 seconds as stime + utime. This is a sample program which the group_leader dies earlier than other threads. The results of same binary execution on different kernel are below. -- accounted records -------------------- | btime | utime | stime | etime | minflt | majflt | comm | original | 13:16:40 | 0.00 | 0.00 | 6.10 | 171 | 0 | bugacct | patched | 13:20:21 | 5.83 | 0.18 | 10.03 | 32776 | 0 | bugacct | (*) bugacct allocates 128MB memory, thus 128MB / 4KB = 32768 of minflt is appropriate. -- Test results in original kernel ------ $ date; time -p ./bugacct Tue Mar 28 13:16:36 JST 2006 <- But pacct said btime is 13:16:40 real 10.11 <- But pacct said etime is 6.10 user 5.96 <- But pacct said utime is 0.00 sys 0.14 <- But pacct said stime is 0.00 $ -- Test results in patched kernel ------- $ date; time -p ./bugacct Tue Mar 28 13:20:21 JST 2006 real 10.04 user 5.83 sys 0.19 $ In the original 2.6.16 kernel, pacct records btime, utime, stime, etime and minflt incorrectly. In my opinion, this problem is caused by an assumption that group_leader dies last. The following section calculates process running time for etime and btime. But it means running time of the thread that dies last, not process. The start_time of the first thread in the process (group_leader) should be reduced from uptime to calculate etime and btime correctly. ---- do_acct_process() in kernel/acct.c: /* calculate run_time in nsec*/ do_posix_clock_monotonic_gettime(&uptime); run_time = (u64)uptime.tv_sec*NSEC_PER_SEC + uptime.tv_nsec; run_time -= (u64)current->start_time.tv_sec*NSEC_PER_SEC + current->start_time.tv_nsec; ---- The following section calculates stime and utime of the process. But it might count the utime and stime of the group_leader duplicatly and ignore the utime and stime of the thread dies last, when one or more threads remain after group_leader dead. The ac_utime should be calculated as the sum of the signal->utime and utime of the thread dies last. The ac_stime should be done also. ---- do_acct_process() in kernel/acct.c: jiffies = cputime_to_jiffies(cputime_add(current->group_leader->utime, current->signal->utime)); ac.ac_utime = encode_comp_t(jiffies_to_AHZ(jiffies)); jiffies = cputime_to_jiffies(cputime_add(current->group_leader->stime, current->signal->stime)); ac.ac_stime = encode_comp_t(jiffies_to_AHZ(jiffies)); ---- The part of the minflt/majflt calculation has same problem. This patch solves those problems, I think. Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/acct.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/acct.c b/kernel/acct.c index 065d8b4e51ef..b327f4d20104 100644 --- a/kernel/acct.c +++ b/kernel/acct.c @@ -449,8 +449,8 @@ static void do_acct_process(long exitcode, struct file *file) /* calculate run_time in nsec*/ do_posix_clock_monotonic_gettime(&uptime); run_time = (u64)uptime.tv_sec*NSEC_PER_SEC + uptime.tv_nsec; - run_time -= (u64)current->start_time.tv_sec*NSEC_PER_SEC - + current->start_time.tv_nsec; + run_time -= (u64)current->group_leader->start_time.tv_sec * NSEC_PER_SEC + + current->group_leader->start_time.tv_nsec; /* convert nsec -> AHZ */ elapsed = nsec_to_AHZ(run_time); #if ACCT_VERSION==3 @@ -469,10 +469,10 @@ static void do_acct_process(long exitcode, struct file *file) #endif do_div(elapsed, AHZ); ac.ac_btime = xtime.tv_sec - elapsed; - jiffies = cputime_to_jiffies(cputime_add(current->group_leader->utime, + jiffies = cputime_to_jiffies(cputime_add(current->utime, current->signal->utime)); ac.ac_utime = encode_comp_t(jiffies_to_AHZ(jiffies)); - jiffies = cputime_to_jiffies(cputime_add(current->group_leader->stime, + jiffies = cputime_to_jiffies(cputime_add(current->stime, current->signal->stime)); ac.ac_stime = encode_comp_t(jiffies_to_AHZ(jiffies)); /* we really need to bite the bullet and change layout */ @@ -522,9 +522,9 @@ static void do_acct_process(long exitcode, struct file *file) ac.ac_io = encode_comp_t(0 /* current->io_usage */); /* %% */ ac.ac_rw = encode_comp_t(ac.ac_io / 1024); ac.ac_minflt = encode_comp_t(current->signal->min_flt + - current->group_leader->min_flt); + current->min_flt); ac.ac_majflt = encode_comp_t(current->signal->maj_flt + - current->group_leader->maj_flt); + current->maj_flt); ac.ac_swaps = encode_comp_t(0); ac.ac_exitcode = exitcode; -- cgit v1.2.3 From 4a01c8d5be628ac20cfd432c21808d76be5813e6 Mon Sep 17 00:00:00 2001 From: Paul Jackson Date: Fri, 31 Mar 2006 02:30:50 -0800 Subject: [PATCH] cpuset: task_lock comment fix Fix cpuset comment involving case of a tasks cpuset pointer being NULL. Thanks to "the_top_cpuset_hack", this code no longer sees NULL task->cpuset pointers. Signed-off-by: Paul Jackson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 18aea1bd1284..2523a4b6a8c6 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -616,12 +616,10 @@ static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask) * current->cpuset if a task has its memory placement changed. * Do not call this routine if in_interrupt(). * - * Call without callback_mutex or task_lock() held. May be called - * with or without manage_mutex held. Doesn't need task_lock to guard - * against another task changing a non-NULL cpuset pointer to NULL, - * as that is only done by a task on itself, and if the current task - * is here, it is not simultaneously in the exit code NULL'ing its - * cpuset pointer. This routine also might acquire callback_mutex and + * Call without callback_mutex or task_lock() held. May be + * called with or without manage_mutex held. Thanks in part to + * 'the_top_cpuset_hack', the tasks cpuset pointer will never + * be NULL. This routine also might acquire callback_mutex and * current->mm->mmap_sem during call. * * Reading current->cpuset->mems_generation doesn't need task_lock -- cgit v1.2.3 From 2741a559a01e1ba9bf87285569dc1a104d134ecf Mon Sep 17 00:00:00 2001 From: Paul Jackson Date: Fri, 31 Mar 2006 02:30:51 -0800 Subject: [PATCH] cpuset: unsafe mm reference fix Fix unsafe reference to a tasks mm struct, by moving the reference inside of a convenient nearby properly guarded code block. Signed-off-by: Paul Jackson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 2523a4b6a8c6..bf42381a4195 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -1183,11 +1183,11 @@ static int attach_task(struct cpuset *cs, char *pidbuf, char **ppathbuf) mm = get_task_mm(tsk); if (mm) { mpol_rebind_mm(mm, &to); + if (is_memory_migrate(cs)) + do_migrate_pages(mm, &from, &to, MPOL_MF_MOVE_ALL); mmput(mm); } - if (is_memory_migrate(cs)) - do_migrate_pages(tsk->mm, &from, &to, MPOL_MF_MOVE_ALL); put_task_struct(tsk); synchronize_rcu(); if (atomic_dec_and_test(&oldcs->count)) -- cgit v1.2.3 From e4e364e865b382f9d99c7fc230ec2ce7df21257a Mon Sep 17 00:00:00 2001 From: Paul Jackson Date: Fri, 31 Mar 2006 02:30:52 -0800 Subject: [PATCH] cpuset: memory migration interaction fix Fix memory migration so that it works regardless of what cpuset the invoking task is in. If a task invoked a memory migration, by doing one of: 1) writing a different nodemask to a cpuset 'mems' file, or 2) writing a tasks pid to a different cpuset's 'tasks' file, where the cpuset had its 'memory_migrate' option turned on, then the allocation of the new pages for the migrated task(s) was constrained by the invoking tasks cpuset. If this task wasn't in a cpuset that allowed the requested memory nodes, the memory migration would happen to some other nodes that were in that invoking tasks cpuset. This was usually surprising and puzzling behaviour: Why didn't the pages move? Why did the pages move -there-? To fix this, temporarilly change the invoking tasks 'mems_allowed' task_struct field to the nodes the migrating tasks is moving to, so that new pages can be allocated there. Signed-off-by: Paul Jackson Acked-by: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpuset.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 52 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/cpuset.c b/kernel/cpuset.c index bf42381a4195..72248d1b9e3f 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -833,6 +833,55 @@ static int update_cpumask(struct cpuset *cs, char *buf) return 0; } +/* + * cpuset_migrate_mm + * + * Migrate memory region from one set of nodes to another. + * + * Temporarilly set tasks mems_allowed to target nodes of migration, + * so that the migration code can allocate pages on these nodes. + * + * Call holding manage_mutex, so our current->cpuset won't change + * during this call, as manage_mutex holds off any attach_task() + * calls. Therefore we don't need to take task_lock around the + * call to guarantee_online_mems(), as we know no one is changing + * our tasks cpuset. + * + * Hold callback_mutex around the two modifications of our tasks + * mems_allowed to synchronize with cpuset_mems_allowed(). + * + * While the mm_struct we are migrating is typically from some + * other task, the task_struct mems_allowed that we are hacking + * is for our current task, which must allocate new pages for that + * migrating memory region. + * + * We call cpuset_update_task_memory_state() before hacking + * our tasks mems_allowed, so that we are assured of being in + * sync with our tasks cpuset, and in particular, callbacks to + * cpuset_update_task_memory_state() from nested page allocations + * won't see any mismatch of our cpuset and task mems_generation + * values, so won't overwrite our hacked tasks mems_allowed + * nodemask. + */ + +static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from, + const nodemask_t *to) +{ + struct task_struct *tsk = current; + + cpuset_update_task_memory_state(); + + mutex_lock(&callback_mutex); + tsk->mems_allowed = *to; + mutex_unlock(&callback_mutex); + + do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL); + + mutex_lock(&callback_mutex); + guarantee_online_mems(tsk->cpuset, &tsk->mems_allowed); + mutex_unlock(&callback_mutex); +} + /* * Handle user request to change the 'mems' memory placement * of a cpuset. Needs to validate the request, update the @@ -945,10 +994,8 @@ static int update_nodemask(struct cpuset *cs, char *buf) struct mm_struct *mm = mmarray[i]; mpol_rebind_mm(mm, &cs->mems_allowed); - if (migrate) { - do_migrate_pages(mm, &oldmem, &cs->mems_allowed, - MPOL_MF_MOVE_ALL); - } + if (migrate) + cpuset_migrate_mm(mm, &oldmem, &cs->mems_allowed); mmput(mm); } @@ -1184,7 +1231,7 @@ static int attach_task(struct cpuset *cs, char *pidbuf, char **ppathbuf) if (mm) { mpol_rebind_mm(mm, &to); if (is_memory_migrate(cs)) - do_migrate_pages(mm, &from, &to, MPOL_MF_MOVE_ALL); + cpuset_migrate_mm(mm, &from, &to); mmput(mm); } -- cgit v1.2.3 From 7529c301165079d0f149d0e54724829e602f8fc0 Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Fri, 31 Mar 2006 02:30:59 -0800 Subject: [PATCH] modules: permit Dual-MIT/GPL licenses One of the LEDs driver files wants to use this. Probably drivers/mtd/maps/ipaq-flash.c wants to convert as well - right now it'll be tainting the kernel. Cc: David Woodhouse Cc: Thomas Gleixner Cc: John Bowler Cc: "'Richard Purdie'" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/module.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index bd088a7c1499..d24deb0dbbc9 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1254,6 +1254,7 @@ static inline int license_is_gpl_compatible(const char *license) || strcmp(license, "GPL v2") == 0 || strcmp(license, "GPL and additional rights") == 0 || strcmp(license, "Dual BSD/GPL") == 0 + || strcmp(license, "Dual MIT/GPL") == 0 || strcmp(license, "Dual MPL/GPL") == 0); } -- cgit v1.2.3 From 00362e33f65f1cb5d15e62ea5509520ce2770360 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 31 Mar 2006 02:31:17 -0800 Subject: [PATCH] hrtimer: create generic sleeper The removal of the data field in the hrtimer structure enforces the embedding of the timer into another data structure. nanosleep now uses a private implementation of the most common used timer callback function (simple task wakeup). In order to avoid the reimplentation of such functionality all over the place a generic hrtimer_sleeper functionality is created. Signed-off-by: Thomas Gleixner Signed-off-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/hrtimer.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) (limited to 'kernel') diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 0237a556eb1f..877cdf9678bf 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -656,6 +656,25 @@ void hrtimer_run_queues(void) * Sleep related functions: */ +static int hrtimer_wakeup(struct hrtimer *timer) +{ + struct hrtimer_sleeper *t = + container_of(timer, struct hrtimer_sleeper, timer); + struct task_struct *task = t->task; + + t->task = NULL; + if (task) + wake_up_process(task); + + return HRTIMER_NORESTART; +} + +void hrtimer_init_sleeper(struct hrtimer_sleeper *sl, task_t *task) +{ + sl->timer.function = hrtimer_wakeup; + sl->task = task; +} + struct sleep_hrtimer { struct hrtimer timer; struct task_struct *task; -- cgit v1.2.3 From 669d7868ae414cdb7b7e5df375dc8e4b47f26f6d Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 31 Mar 2006 02:31:19 -0800 Subject: [PATCH] hrtimer: use generic sleeper for nanosleep Replace the nanosleep private sleeper functionality by the generic hrtimer sleeper. Signed-off-by: Thomas Gleixner Signed-off-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/hrtimer.c | 39 +++++++++------------------------------ 1 file changed, 9 insertions(+), 30 deletions(-) (limited to 'kernel') diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 877cdf9678bf..49cbf7cffebd 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -655,7 +655,6 @@ void hrtimer_run_queues(void) /* * Sleep related functions: */ - static int hrtimer_wakeup(struct hrtimer *timer) { struct hrtimer_sleeper *t = @@ -675,28 +674,9 @@ void hrtimer_init_sleeper(struct hrtimer_sleeper *sl, task_t *task) sl->task = task; } -struct sleep_hrtimer { - struct hrtimer timer; - struct task_struct *task; - int expired; -}; - -static int nanosleep_wakeup(struct hrtimer *timer) +static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) { - struct sleep_hrtimer *t = - container_of(timer, struct sleep_hrtimer, timer); - - t->expired = 1; - wake_up_process(t->task); - - return HRTIMER_NORESTART; -} - -static int __sched do_nanosleep(struct sleep_hrtimer *t, enum hrtimer_mode mode) -{ - t->timer.function = nanosleep_wakeup; - t->task = current; - t->expired = 0; + hrtimer_init_sleeper(t, current); do { set_current_state(TASK_INTERRUPTIBLE); @@ -704,18 +684,17 @@ static int __sched do_nanosleep(struct sleep_hrtimer *t, enum hrtimer_mode mode) schedule(); - if (unlikely(!t->expired)) { - hrtimer_cancel(&t->timer); - mode = HRTIMER_ABS; - } - } while (!t->expired && !signal_pending(current)); + hrtimer_cancel(&t->timer); + mode = HRTIMER_ABS; + + } while (t->task && !signal_pending(current)); - return t->expired; + return t->task == NULL; } static long __sched nanosleep_restart(struct restart_block *restart) { - struct sleep_hrtimer t; + struct hrtimer_sleeper t; struct timespec __user *rmtp; struct timespec tu; ktime_t time; @@ -748,7 +727,7 @@ long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp, const enum hrtimer_mode mode, const clockid_t clockid) { struct restart_block *restart; - struct sleep_hrtimer t; + struct hrtimer_sleeper t; struct timespec tu; ktime_t rem; -- cgit v1.2.3 From 3055addadbe9bfb2365006a1c13fd342a8d30d52 Mon Sep 17 00:00:00 2001 From: Dimitri Sivanich Date: Fri, 31 Mar 2006 02:31:20 -0800 Subject: [PATCH] hrtimer: call get_softirq_time() only when necessary in run_hrtimer_queue() It seems that run_hrtimer_queue() is calling get_softirq_time() more often than it needs to. With this patch, it only calls get_softirq_time() if there's a pending timer. Signed-off-by: Dimitri Sivanich Signed-off-by: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/hrtimer.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 49cbf7cffebd..f181ff4dd32e 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -606,6 +606,9 @@ static inline void run_hrtimer_queue(struct hrtimer_base *base) { struct rb_node *node; + if (!base->first) + return; + if (base->get_softirq_time) base->softirq_time = base->get_softirq_time(); -- cgit v1.2.3 From db1b1fefc2cecbff2e4214062fa8c680cb6e7b7d Mon Sep 17 00:00:00 2001 From: Jack Steiner Date: Fri, 31 Mar 2006 02:31:21 -0800 Subject: [PATCH] sched: reduce overhead of calc_load Currently, count_active_tasks() calls both nr_running() & nr_interruptible(). Each of these functions does a "for_each_cpu" & reads values from the runqueue of each cpu. Although this is not a lot of instructions, each runqueue may be located on different node. Depending on the architecture, a unique TLB entry may be required to access each runqueue. Since there may be more runqueues than cpu TLB entries, a scan of all runqueues can trash the TLB. Each memory reference incurs a TLB miss & refill. In addition, the runqueue cacheline that contains nr_running & nr_uninterruptible may be evicted from the cache between the two passes. This causes unnecessary cache misses. Combining nr_running() & nr_interruptible() into a single function substantially reduces the TLB & cache misses on large systems. This should have no measureable effect on smaller systems. On a 128p IA64 system running a memory stress workload, the new function reduced the overhead of calc_load() from 605 usec/call to 324 usec/call. Signed-off-by: Jack Steiner Acked-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 15 +++++++++++++++ kernel/timer.c | 2 +- 2 files changed, 16 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index a9ecac398bb9..6e52e0adff80 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1658,6 +1658,21 @@ unsigned long nr_iowait(void) return sum; } +unsigned long nr_active(void) +{ + unsigned long i, running = 0, uninterruptible = 0; + + for_each_online_cpu(i) { + running += cpu_rq(i)->nr_running; + uninterruptible += cpu_rq(i)->nr_uninterruptible; + } + + if (unlikely((long)uninterruptible < 0)) + uninterruptible = 0; + + return running + uninterruptible; +} + #ifdef CONFIG_SMP /* diff --git a/kernel/timer.c b/kernel/timer.c index 9062a82ee8ec..6b812c04737b 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -825,7 +825,7 @@ void update_process_times(int user_tick) */ static unsigned long count_active_tasks(void) { - return (nr_running() + nr_uninterruptible()) * FIXED_1; + return nr_active() * FIXED_1; } /* -- cgit v1.2.3 From 3dee386e14045484a6c41c8f03a263f9d79de740 Mon Sep 17 00:00:00 2001 From: Con Kolivas Date: Fri, 31 Mar 2006 02:31:23 -0800 Subject: [PATCH] sched: cleanup task_activated() The activated flag in task_struct is used to track different sleep types and its usage is somewhat obfuscated. Convert the variable to an enum with more descriptive names without altering the function. Signed-off-by: Con Kolivas Acked-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 6e52e0adff80..f55ce5adac55 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -704,7 +704,7 @@ static int recalc_task_prio(task_t *p, unsigned long long now) * prevent them suddenly becoming cpu hogs and starving * other processes. */ - if (p->mm && p->activated != -1 && + if (p->mm && p->sleep_type != SLEEP_NONINTERACTIVE && sleep_time > INTERACTIVE_SLEEP(p)) { p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG - DEF_TIMESLICE); @@ -714,7 +714,7 @@ static int recalc_task_prio(task_t *p, unsigned long long now) * limited in their sleep_avg rise as they * are likely to be waiting on I/O */ - if (p->activated == -1 && p->mm) { + if (p->sleep_type == SLEEP_NONINTERACTIVE && p->mm) { if (p->sleep_avg >= INTERACTIVE_SLEEP(p)) sleep_time = 0; else if (p->sleep_avg + sleep_time >= @@ -769,7 +769,7 @@ static void activate_task(task_t *p, runqueue_t *rq, int local) * This checks to make sure it's not an uninterruptible task * that is now waking up. */ - if (!p->activated) { + if (p->sleep_type == SLEEP_NORMAL) { /* * Tasks which were woken up by interrupts (ie. hw events) * are most likely of interactive nature. So we give them @@ -778,13 +778,13 @@ static void activate_task(task_t *p, runqueue_t *rq, int local) * on a CPU, first time around: */ if (in_interrupt()) - p->activated = 2; + p->sleep_type = SLEEP_INTERRUPTED; else { /* * Normal first-time wakeups get a credit too for * on-runqueue time, but it will be weighted down: */ - p->activated = 1; + p->sleep_type = SLEEP_INTERACTIVE; } } p->timestamp = now; @@ -1272,7 +1272,7 @@ out_activate: * Tasks on involuntary sleep don't earn * sleep_avg beyond just interactive state. */ - p->activated = -1; + p->sleep_type = SLEEP_NONINTERACTIVE; } /* @@ -2875,6 +2875,12 @@ EXPORT_SYMBOL(sub_preempt_count); #endif +static inline int interactive_sleep(enum sleep_type sleep_type) +{ + return (sleep_type == SLEEP_INTERACTIVE || + sleep_type == SLEEP_INTERRUPTED); +} + /* * schedule() is the main scheduler function. */ @@ -2998,12 +3004,12 @@ go_idle: queue = array->queue + idx; next = list_entry(queue->next, task_t, run_list); - if (!rt_task(next) && next->activated > 0) { + if (!rt_task(next) && interactive_sleep(next->sleep_type)) { unsigned long long delta = now - next->timestamp; if (unlikely((long long)(now - next->timestamp) < 0)) delta = 0; - if (next->activated == 1) + if (next->sleep_type == SLEEP_INTERACTIVE) delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128; array = next->array; @@ -3016,7 +3022,7 @@ go_idle: } else requeue_task(next, array); } - next->activated = 0; + next->sleep_type = SLEEP_NORMAL; switch_tasks: if (next == rq->idle) schedstat_inc(rq, sched_goidle); -- cgit v1.2.3 From e7c38cb49c6cc05bc11f70d9e9889da1c4a0d37f Mon Sep 17 00:00:00 2001 From: Con Kolivas Date: Fri, 31 Mar 2006 02:31:25 -0800 Subject: [PATCH] sched: make task_noninteractive use sleep_type Alterations to the pipe code in the kernel made it possible for relative starvation to occur with tasks that slept waiting on a pipe getting unfair priority bonuses even if they were otherwise fully cpu bound so the TASK_NONINTERACTIVE flag was introduced which prevented any change to sleep_avg while sleeping waiting on a pipe. This change also leads to the converse though, preventing any priority boost from occurring in truly interactive tasks that wait on pipes. Convert the TASK_NONINTERACTIVE flag to set sleep_type to SLEEP_NONINTERACTIVE which will allow a linear bonus to priority based on sleep time thus allowing interactive tasks to get high priority if they sleep enough. Signed-off-by: Con Kolivas Acked-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index f55ce5adac55..589e55a42214 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1273,18 +1273,18 @@ out_activate: * sleep_avg beyond just interactive state. */ p->sleep_type = SLEEP_NONINTERACTIVE; - } + } else /* * Tasks that have marked their sleep as noninteractive get - * woken up without updating their sleep average. (i.e. their - * sleep is handled in a priority-neutral manner, no priority - * boost and no penalty.) + * woken up with their sleep average not weighted in an + * interactive way. */ - if (old_state & TASK_NONINTERACTIVE) - __activate_task(p, rq); - else - activate_task(p, rq, cpu == this_cpu); + if (old_state & TASK_NONINTERACTIVE) + p->sleep_type = SLEEP_NONINTERACTIVE; + + + activate_task(p, rq, cpu == this_cpu); /* * Sync wakeups (i.e. those types of wakeups where the waker * has indicated that it will leave the CPU in short order) -- cgit v1.2.3 From e72ff0bb2c163eb13014ba113701bd42dab382fe Mon Sep 17 00:00:00 2001 From: Con Kolivas Date: Fri, 31 Mar 2006 02:31:26 -0800 Subject: [PATCH] sched: dont decrease idle sleep avg We watch for tasks that sleep extended periods and don't allow one single prolonged sleep period from elevating priority to maximum bonus to prevent cpu bound tasks from getting high priority with single long sleeps. There is a bug in the current code that also penalises tasks that already have high priority. Correct that bug. Signed-off-by: Con Kolivas Acked-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 589e55a42214..7b371931114f 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -700,14 +700,19 @@ static int recalc_task_prio(task_t *p, unsigned long long now) if (likely(sleep_time > 0)) { /* * User tasks that sleep a long time are categorised as - * idle and will get just interactive status to stay active & - * prevent them suddenly becoming cpu hogs and starving - * other processes. + * idle. They will only have their sleep_avg increased to a + * level that makes them just interactive priority to stay + * active yet prevent them suddenly becoming cpu hogs and + * starving other processes. */ if (p->mm && p->sleep_type != SLEEP_NONINTERACTIVE && sleep_time > INTERACTIVE_SLEEP(p)) { - p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG - - DEF_TIMESLICE); + unsigned long ceiling; + + ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG - + DEF_TIMESLICE); + if (p->sleep_avg < ceiling) + p->sleep_avg = ceiling; } else { /* * Tasks waking from uninterruptible sleep are -- cgit v1.2.3 From 5138930e6a69f1c7851a82d7cedaa01fad029fcf Mon Sep 17 00:00:00 2001 From: Con Kolivas Date: Fri, 31 Mar 2006 02:31:27 -0800 Subject: [PATCH] sched: include noninteractive sleep in idle detect Tasks waiting in SLEEP_NONINTERACTIVE state can now get to best priority so they need to be included in the idle detection code. Signed-off-by: Con Kolivas Acked-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 7b371931114f..3055fe806ff7 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -705,8 +705,7 @@ static int recalc_task_prio(task_t *p, unsigned long long now) * active yet prevent them suddenly becoming cpu hogs and * starving other processes. */ - if (p->mm && p->sleep_type != SLEEP_NONINTERACTIVE && - sleep_time > INTERACTIVE_SLEEP(p)) { + if (p->mm && sleep_time > INTERACTIVE_SLEEP(p)) { unsigned long ceiling; ceiling = JIFFIES_TO_NS(MAX_SLEEP_AVG - -- cgit v1.2.3 From 7c4bb1f9b3788309e1159961c606ba0bdf7ed382 Mon Sep 17 00:00:00 2001 From: Con Kolivas Date: Fri, 31 Mar 2006 02:31:29 -0800 Subject: [PATCH] sched: remove on runqueue requeueing On runqueue time is used to elevate priority in schedule(). In the code it currently requeues tasks even if their priority is not elevated, which would end up placing them at the end of their runqueue array effectively delaying them instead of improving their priority. Bug spotted by Mike Galbraith This patch removes this requeueing. Signed-off-by: Con Kolivas Acked-by: Ingo Molnar Cc: Mike Galbraith Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 3055fe806ff7..73bb4d9ef989 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3023,8 +3023,7 @@ go_idle: dequeue_task(next, array); next->prio = new_prio; enqueue_task(next, array); - } else - requeue_task(next, array); + } } next->sleep_type = SLEEP_NORMAL; switch_tasks: -- cgit v1.2.3 From d425b274ba83ba4e7746a40446ec0ba3267de51f Mon Sep 17 00:00:00 2001 From: Con Kolivas Date: Fri, 31 Mar 2006 02:31:29 -0800 Subject: [PATCH] sched: activate SCHED BATCH expired To increase the strength of SCHED_BATCH as a scheduling hint we can activate batch tasks on the expired array since by definition they are latency insensitive tasks. Signed-off-by: Con Kolivas Acked-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sched.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 73bb4d9ef989..dd153d6f8a04 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -667,9 +667,13 @@ static int effective_prio(task_t *p) /* * __activate_task - move a task to the runqueue. */ -static inline void __activate_task(task_t *p, runqueue_t *rq) +static void __activate_task(task_t *p, runqueue_t *rq) { - enqueue_task(p, rq->active); + prio_array_t *target = rq->active; + + if (batch_task(p)) + target = rq->expired; + enqueue_task(p, target); rq->nr_running++; } @@ -688,7 +692,7 @@ static int recalc_task_prio(task_t *p, unsigned long long now) unsigned long long __sleep_time = now - p->timestamp; unsigned long sleep_time; - if (unlikely(p->policy == SCHED_BATCH)) + if (batch_task(p)) sleep_time = 0; else { if (__sleep_time > NS_MAX_SLEEP_AVG) -- cgit v1.2.3 From 9741ef964dc8bfeb6520825df9fed8f538c3336e Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 31 Mar 2006 02:31:32 -0800 Subject: [PATCH] futex: check and validate timevals The futex timeval is not checked for correctness. The change does not break existing applications as the timeval is supplied by glibc (and glibc always passes a correct value), but the glibc-internal tests for this functionality fail. Signed-off-by: Thomas Gleixner Signed-off-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/futex.c | 4 +++- kernel/futex_compat.c | 4 +++- 2 files changed, 6 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/futex.c b/kernel/futex.c index 9c9b2b6b22dd..5699c512057b 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1039,9 +1039,11 @@ asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, unsigned long timeout = MAX_SCHEDULE_TIMEOUT; int val2 = 0; - if ((op == FUTEX_WAIT) && utime) { + if (utime && (op == FUTEX_WAIT)) { if (copy_from_user(&t, utime, sizeof(t)) != 0) return -EFAULT; + if (!timespec_valid(&t)) + return -EINVAL; timeout = timespec_to_jiffies(&t) + 1; } /* diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c index 54274fc85321..1ab6a0ea3d14 100644 --- a/kernel/futex_compat.c +++ b/kernel/futex_compat.c @@ -129,9 +129,11 @@ asmlinkage long compat_sys_futex(u32 __user *uaddr, int op, u32 val, unsigned long timeout = MAX_SCHEDULE_TIMEOUT; int val2 = 0; - if ((op == FUTEX_WAIT) && utime) { + if (utime && (op == FUTEX_WAIT)) { if (get_compat_timespec(&t, utime)) return -EFAULT; + if (!timespec_valid(&t)) + return -EINVAL; timeout = timespec_to_jiffies(&t) + 1; } if (op >= FUTEX_REQUEUE) -- cgit v1.2.3 From 390e2ff07712468ce6600a43aa91e897b056ce12 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Fri, 31 Mar 2006 02:31:33 -0800 Subject: [PATCH] Make setsid() more robust The core problem: setsid fails if it is called by init. The effect in 2.6.16 and the earlier kernels that have this problem is that if you do a "ps -j 1 or ps -ej 1" you will see that init and several of it's children have process group and session == 0. Instead of process group == session == 1. Despite init calling setsid. The reason it fails is that daemonize calls set_special_pids(1,1) on kernel threads that are launched before /sbin/init is called. The only remaining effect in that current->signal->leader == 0 for init instead of 1. And the setsid call fails. No one has noticed because /sbin/init does not check the return value of setsid. In 2.4 where we don't have the pidhash table, and daemonize doesn't exist setsid actually works for init. I care a lot about pid == 1 not being a special case that we leave broken, because of the container/jail work that I am doing. - Carefully allow init (pid == 1) to call setsid despite the kernel using its session. - Use find_task_by_pid instead of find_pid because find_pid taking a pidtype is going away. Signed-off-by: Eric W. Biederman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index 7ef7f6054c28..0b6ec0e7936f 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1372,18 +1372,29 @@ asmlinkage long sys_getsid(pid_t pid) asmlinkage long sys_setsid(void) { struct task_struct *group_leader = current->group_leader; - struct pid *pid; + pid_t session; int err = -EPERM; mutex_lock(&tty_mutex); write_lock_irq(&tasklist_lock); - pid = find_pid(PIDTYPE_PGID, group_leader->pid); - if (pid) + /* Fail if I am already a session leader */ + if (group_leader->signal->leader) + goto out; + + session = group_leader->pid; + /* Fail if a process group id already exists that equals the + * proposed session id. + * + * Don't check if session id == 1 because kernel threads use this + * session id and so the check will always fail and make it so + * init cannot successfully call setsid. + */ + if (session > 1 && find_task_by_pid_type(PIDTYPE_PGID, session)) goto out; group_leader->signal->leader = 1; - __set_special_pids(group_leader->pid, group_leader->pid); + __set_special_pids(session, session); group_leader->signal->tty = NULL; group_leader->signal->tty_old_pgrp = 0; err = process_group(group_leader); -- cgit v1.2.3 From 158d9ebd19280582da172626ad3edda1a626dace Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Fri, 31 Mar 2006 02:31:34 -0800 Subject: [PATCH] resurrect __put_task_struct This just got nuked in mainline. Bring it back because Eric's patches use it. Cc: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index b3f7a1bb5e55..b1341205be27 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -108,10 +108,8 @@ void free_task(struct task_struct *tsk) } EXPORT_SYMBOL(free_task); -void __put_task_struct_cb(struct rcu_head *rhp) +void __put_task_struct(struct task_struct *tsk) { - struct task_struct *tsk = container_of(rhp, struct task_struct, rcu); - WARN_ON(!(tsk->exit_state & (EXIT_DEAD | EXIT_ZOMBIE))); WARN_ON(atomic_read(&tsk->usage)); WARN_ON(tsk == current); @@ -126,6 +124,12 @@ void __put_task_struct_cb(struct rcu_head *rhp) free_task(tsk); } +void __put_task_struct_cb(struct rcu_head *rhp) +{ + struct task_struct *tsk = container_of(rhp, struct task_struct, rcu); + __put_task_struct(tsk); +} + void __init fork_init(unsigned long mempages) { #ifndef __HAVE_ARCH_TASK_STRUCT_ALLOCATOR -- cgit v1.2.3 From 8c7904a00b06d2ee51149794b619e07369fcf9d4 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Fri, 31 Mar 2006 02:31:37 -0800 Subject: [PATCH] task: RCU protect task->usage A big problem with rcu protected data structures that are also reference counted is that you must jump through several hoops to increase the reference count. I think someone finally implemented atomic_inc_not_zero(&count) to automate the common case. Unfortunately this means you must special case the rcu access case. When data structures are only visible via rcu in a manner that is not determined by the reference count on the object (i.e. tasks are visible until their zombies are reaped) there is a much simpler technique we can employ. Simply delaying the decrement of the reference count until the rcu interval is over. What that means is that the proc code that looks up a task and later wants to sleep can now do: rcu_read_lock(); task = find_task_by_pid(some_pid); if (task) { get_task_struct(task); } rcu_read_unlock(); The effect on the rest of the kernel is that put_task_struct becomes cheaper and immediate, and in the case where the task has been reaped it frees the task immediate instead of unnecessarily waiting an until the rcu interval is over. Cleanup of task_struct does not happen when its reference count drops to zero, instead cleanup happens when release_task is called. Tasks can only be looked up via rcu before release_task is called. All rcu protected members of task_struct are freed by release_task. Therefore we can move call_rcu from put_task_struct into release_task. And we can modify release_task to not immediately release the reference count but instead have it call put_task_struct from the function it gives to call_rcu. The end result: - get_task_struct is safe in an rcu context where we have just looked up the task. - put_task_struct() simplifies into its old pre rcu self. This reorganization also makes put_task_struct uncallable from modules as it is not exported but it does not appear to be called from any modules so this should not be an issue, and is trivially fixed. Signed-off-by: Eric W. Biederman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index bc0ec674d3f4..6c2eeb8f6390 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -127,6 +127,11 @@ static void __exit_signal(struct task_struct *tsk) } } +static void delayed_put_task_struct(struct rcu_head *rhp) +{ + put_task_struct(container_of(rhp, struct task_struct, rcu)); +} + void release_task(struct task_struct * p) { int zap_leader; @@ -168,7 +173,7 @@ repeat: spin_unlock(&p->proc_lock); proc_pid_flush(proc_dentry); release_thread(p); - put_task_struct(p); + call_rcu(&p->rcu, delayed_put_task_struct); p = leader; if (unlikely(zap_leader)) -- cgit v1.2.3 From 92476d7fc0326a409ab1d3864a04093a6be9aca7 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Fri, 31 Mar 2006 02:31:42 -0800 Subject: [PATCH] pidhash: Refactor the pid hash table Simplifies the code, reduces the need for 4 pid hash tables, and makes the code more capable. In the discussions I had with Oleg it was felt that to a large extent the cleanup itself justified the work. With struct pid being dynamically allocated meant we could create the hash table entry when the pid was allocated and free the hash table entry when the pid was freed. Instead of playing with the hash lists when ever a process would attach or detach to a process. For myself the fact that it gave what my previous task_ref patch gave for free with simpler code was a big win. The problem is that if you hold a reference to struct task_struct you lock in 10K of low memory. If you do that in a user controllable way like /proc does, with an unprivileged but hostile user space application with typical resource limits of 1000 fds and 100 processes I can trigger the OOM killer by consuming all of low memory with task structs, on a machine wight 1GB of low memory. If I instead hold a reference to struct pid which holds a pointer to my task_struct, I don't suffer from that problem because struct pid is 2 orders of magnitude smaller. In fact struct pid is small enough that most other kernel data structures dwarf it, so simply limiting the number of referring data structures is enough to prevent exhaustion of low memory. This splits the current struct pid into two structures, struct pid and struct pid_link, and reduces our number of hash tables from PIDTYPE_MAX to just one. struct pid_link is the per process linkage into the hash tables and lives in struct task_struct. struct pid is given an indepedent lifetime, and holds pointers to each of the pid types. The independent life of struct pid simplifies attach_pid, and detach_pid, because we are always manipulating the list of pids and not the hash table. In addition in giving struct pid an indpendent life it makes the concept much more powerful. Kernel data structures can now embed a struct pid * instead of a pid_t and not suffer from pid wrap around problems or from keeping unnecessarily large amounts of memory allocated. Signed-off-by: Eric W. Biederman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 16 +++-- kernel/pid.c | 212 ++++++++++++++++++++++++++++++++++++++++------------------ 2 files changed, 155 insertions(+), 73 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index b1341205be27..03975d0467f9 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1315,17 +1315,19 @@ long do_fork(unsigned long clone_flags, { struct task_struct *p; int trace = 0; - long pid = alloc_pidmap(); + struct pid *pid = alloc_pid(); + long nr; - if (pid < 0) + if (!pid) return -EAGAIN; + nr = pid->nr; if (unlikely(current->ptrace)) { trace = fork_traceflag (clone_flags); if (trace) clone_flags |= CLONE_PTRACE; } - p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr, pid); + p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr, nr); /* * Do this prior waking up the new thread - the thread pointer * might get invalid after that point, if the thread exits quickly. @@ -1352,7 +1354,7 @@ long do_fork(unsigned long clone_flags, p->state = TASK_STOPPED; if (unlikely (trace)) { - current->ptrace_message = pid; + current->ptrace_message = nr; ptrace_notify ((trace << 8) | SIGTRAP); } @@ -1362,10 +1364,10 @@ long do_fork(unsigned long clone_flags, ptrace_notify ((PTRACE_EVENT_VFORK_DONE << 8) | SIGTRAP); } } else { - free_pidmap(pid); - pid = PTR_ERR(p); + free_pid(pid); + nr = PTR_ERR(p); } - return pid; + return nr; } #ifndef ARCH_MIN_MMSTRUCT_ALIGN diff --git a/kernel/pid.c b/kernel/pid.c index a9f2dfd006d2..eeb836b65ca4 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -28,8 +28,9 @@ #include #define pid_hashfn(nr) hash_long((unsigned long)nr, pidhash_shift) -static struct hlist_head *pid_hash[PIDTYPE_MAX]; +static struct hlist_head *pid_hash; static int pidhash_shift; +static kmem_cache_t *pid_cachep; int pid_max = PID_MAX_DEFAULT; int last_pid; @@ -60,9 +61,22 @@ typedef struct pidmap { static pidmap_t pidmap_array[PIDMAP_ENTRIES] = { [ 0 ... PIDMAP_ENTRIES-1 ] = { ATOMIC_INIT(BITS_PER_PAGE), NULL } }; +/* + * Note: disable interrupts while the pidmap_lock is held as an + * interrupt might come in and do read_lock(&tasklist_lock). + * + * If we don't disable interrupts there is a nasty deadlock between + * detach_pid()->free_pid() and another cpu that does + * spin_lock(&pidmap_lock) followed by an interrupt routine that does + * read_lock(&tasklist_lock); + * + * After we clean up the tasklist_lock and know there are no + * irq handlers that take it we can leave the interrupts enabled. + * For now it is easier to be safe than to prove it can't happen. + */ static __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock); -fastcall void free_pidmap(int pid) +static fastcall void free_pidmap(int pid) { pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE; int offset = pid & BITS_PER_PAGE_MASK; @@ -71,7 +85,7 @@ fastcall void free_pidmap(int pid) atomic_inc(&map->nr_free); } -int alloc_pidmap(void) +static int alloc_pidmap(void) { int i, offset, max_scan, pid, last = last_pid; pidmap_t *map; @@ -89,12 +103,12 @@ int alloc_pidmap(void) * Free the page if someone raced with us * installing it: */ - spin_lock(&pidmap_lock); + spin_lock_irq(&pidmap_lock); if (map->page) free_page(page); else map->page = (void *)page; - spin_unlock(&pidmap_lock); + spin_unlock_irq(&pidmap_lock); if (unlikely(!map->page)) break; } @@ -131,13 +145,73 @@ int alloc_pidmap(void) return -1; } -struct pid * fastcall find_pid(enum pid_type type, int nr) +fastcall void put_pid(struct pid *pid) +{ + if (!pid) + return; + if ((atomic_read(&pid->count) == 1) || + atomic_dec_and_test(&pid->count)) + kmem_cache_free(pid_cachep, pid); +} + +static void delayed_put_pid(struct rcu_head *rhp) +{ + struct pid *pid = container_of(rhp, struct pid, rcu); + put_pid(pid); +} + +fastcall void free_pid(struct pid *pid) +{ + /* We can be called with write_lock_irq(&tasklist_lock) held */ + unsigned long flags; + + spin_lock_irqsave(&pidmap_lock, flags); + hlist_del_rcu(&pid->pid_chain); + spin_unlock_irqrestore(&pidmap_lock, flags); + + free_pidmap(pid->nr); + call_rcu(&pid->rcu, delayed_put_pid); +} + +struct pid *alloc_pid(void) +{ + struct pid *pid; + enum pid_type type; + int nr = -1; + + pid = kmem_cache_alloc(pid_cachep, GFP_KERNEL); + if (!pid) + goto out; + + nr = alloc_pidmap(); + if (nr < 0) + goto out_free; + + atomic_set(&pid->count, 1); + pid->nr = nr; + for (type = 0; type < PIDTYPE_MAX; ++type) + INIT_HLIST_HEAD(&pid->tasks[type]); + + spin_lock_irq(&pidmap_lock); + hlist_add_head_rcu(&pid->pid_chain, &pid_hash[pid_hashfn(pid->nr)]); + spin_unlock_irq(&pidmap_lock); + +out: + return pid; + +out_free: + kmem_cache_free(pid_cachep, pid); + pid = NULL; + goto out; +} + +struct pid * fastcall find_pid(int nr) { struct hlist_node *elem; struct pid *pid; hlist_for_each_entry_rcu(pid, elem, - &pid_hash[type][pid_hashfn(nr)], pid_chain) { + &pid_hash[pid_hashfn(nr)], pid_chain) { if (pid->nr == nr) return pid; } @@ -146,77 +220,82 @@ struct pid * fastcall find_pid(enum pid_type type, int nr) int fastcall attach_pid(task_t *task, enum pid_type type, int nr) { - struct pid *pid, *task_pid; - - task_pid = &task->pids[type]; - pid = find_pid(type, nr); - task_pid->nr = nr; - if (pid == NULL) { - INIT_LIST_HEAD(&task_pid->pid_list); - hlist_add_head_rcu(&task_pid->pid_chain, - &pid_hash[type][pid_hashfn(nr)]); - } else { - INIT_HLIST_NODE(&task_pid->pid_chain); - list_add_tail_rcu(&task_pid->pid_list, &pid->pid_list); - } + struct pid_link *link; + struct pid *pid; + + WARN_ON(!task->pid); /* to be removed soon */ + WARN_ON(!nr); /* to be removed soon */ + + link = &task->pids[type]; + link->pid = pid = find_pid(nr); + hlist_add_head_rcu(&link->node, &pid->tasks[type]); return 0; } -static fastcall int __detach_pid(task_t *task, enum pid_type type) +void fastcall detach_pid(task_t *task, enum pid_type type) { - struct pid *pid, *pid_next; - int nr = 0; + struct pid_link *link; + struct pid *pid; + int tmp; - pid = &task->pids[type]; - if (!hlist_unhashed(&pid->pid_chain)) { + link = &task->pids[type]; + pid = link->pid; - if (list_empty(&pid->pid_list)) { - nr = pid->nr; - hlist_del_rcu(&pid->pid_chain); - } else { - pid_next = list_entry(pid->pid_list.next, - struct pid, pid_list); - /* insert next pid from pid_list to hash */ - hlist_replace_rcu(&pid->pid_chain, - &pid_next->pid_chain); - } - } + hlist_del_rcu(&link->node); + link->pid = NULL; - list_del_rcu(&pid->pid_list); - pid->nr = 0; + for (tmp = PIDTYPE_MAX; --tmp >= 0; ) + if (!hlist_empty(&pid->tasks[tmp])) + return; - return nr; + free_pid(pid); } -void fastcall detach_pid(task_t *task, enum pid_type type) +struct task_struct * fastcall pid_task(struct pid *pid, enum pid_type type) { - int tmp, nr; + struct task_struct *result = NULL; + if (pid) { + struct hlist_node *first; + first = rcu_dereference(pid->tasks[type].first); + if (first) + result = hlist_entry(first, struct task_struct, pids[(type)].node); + } + return result; +} - nr = __detach_pid(task, type); - if (!nr) - return; +/* + * Must be called under rcu_read_lock() or with tasklist_lock read-held. + */ +task_t *find_task_by_pid_type(int type, int nr) +{ + return pid_task(find_pid(nr), type); +} - for (tmp = PIDTYPE_MAX; --tmp >= 0; ) - if (tmp != type && find_pid(tmp, nr)) - return; +EXPORT_SYMBOL(find_task_by_pid_type); - free_pidmap(nr); +struct task_struct *fastcall get_pid_task(struct pid *pid, enum pid_type type) +{ + struct task_struct *result; + rcu_read_lock(); + result = pid_task(pid, type); + if (result) + get_task_struct(result); + rcu_read_unlock(); + return result; } -task_t *find_task_by_pid_type(int type, int nr) +struct pid *find_get_pid(pid_t nr) { struct pid *pid; - pid = find_pid(type, nr); - if (!pid) - return NULL; + rcu_read_lock(); + pid = get_pid(find_pid(nr)); + rcu_read_unlock(); - return pid_task(&pid->pid_list, type); + return pid; } -EXPORT_SYMBOL(find_task_by_pid_type); - /* * The pid hash table is scaled according to the amount of memory in the * machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or @@ -224,7 +303,7 @@ EXPORT_SYMBOL(find_task_by_pid_type); */ void __init pidhash_init(void) { - int i, j, pidhash_size; + int i, pidhash_size; unsigned long megabytes = nr_kernel_pages >> (20 - PAGE_SHIFT); pidhash_shift = max(4, fls(megabytes * 4)); @@ -233,16 +312,13 @@ void __init pidhash_init(void) printk("PID hash table entries: %d (order: %d, %Zd bytes)\n", pidhash_size, pidhash_shift, - PIDTYPE_MAX * pidhash_size * sizeof(struct hlist_head)); - - for (i = 0; i < PIDTYPE_MAX; i++) { - pid_hash[i] = alloc_bootmem(pidhash_size * - sizeof(*(pid_hash[i]))); - if (!pid_hash[i]) - panic("Could not alloc pidhash!\n"); - for (j = 0; j < pidhash_size; j++) - INIT_HLIST_HEAD(&pid_hash[i][j]); - } + pidhash_size * sizeof(struct hlist_head)); + + pid_hash = alloc_bootmem(pidhash_size * sizeof(*(pid_hash))); + if (!pid_hash) + panic("Could not alloc pidhash!\n"); + for (i = 0; i < pidhash_size; i++) + INIT_HLIST_HEAD(&pid_hash[i]); } void __init pidmap_init(void) @@ -251,4 +327,8 @@ void __init pidmap_init(void) /* Reserve PID 0. We never call free_pidmap(0) */ set_bit(0, pidmap_array->page); atomic_dec(&pidmap_array->nr_free); + + pid_cachep = kmem_cache_create("pid", sizeof(struct pid), + __alignof__(struct pid), + SLAB_PANIC, NULL, NULL); } -- cgit v1.2.3 From 428622986858aebddc32d022af65e88b9d2ea8bb Mon Sep 17 00:00:00 2001 From: Kirill Korotaev Date: Fri, 31 Mar 2006 17:58:46 +0400 Subject: [PATCH] wrong error path in dup_fd() leading to oopses in RCU Wrong error path in dup_fd() - it should return NULL on error, not an address of already freed memory :/ Triggered by OpenVZ stress test suite. What is interesting is that it was causing different oopses in RCU like below: Call Trace: [] rcu_do_batch+0x2c/0x80 [] rcu_process_callbacks+0x3d/0x70 [] tasklet_action+0x73/0xe0 [] __do_softirq+0x10a/0x130 [] do_softirq+0x4f/0x60 ======================= [] smp_apic_timer_interrupt+0x77/0x110 [] apic_timer_interrupt+0x1c/0x24 Code: Bad EIP value. <0>Kernel panic - not syncing: Fatal exception in interrupt Signed-Off-By: Pavel Emelianov Signed-Off-By: Dmitry Mishin Signed-Off-By: Kirill Korotaev Signed-Off-By: Linus Torvalds --- kernel/fork.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 03975d0467f9..3384eb89cb1c 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -725,7 +725,7 @@ out_release: free_fdset (new_fdt->open_fds, new_fdt->max_fdset); free_fd_array(new_fdt->fd, new_fdt->max_fds); kmem_cache_free(files_cachep, newf); - goto out; + return NULL; } static int copy_files(unsigned long clone_flags, struct task_struct * tsk) -- cgit v1.2.3