From 2b144498350860b6ee9dc57ff27a93ad488de5dc Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Thu, 9 Feb 2012 14:56:42 +0530 Subject: uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints Add uprobes support to the core kernel, with x86 support. This commit adds the kernel facilities, the actual uprobes user-space ABI and perf probe support comes in later commits. General design: Uprobes are maintained in an rb-tree indexed by inode and offset (the offset here is from the start of the mapping). For a unique (inode, offset) tuple, there can be at most one uprobe in the rb-tree. Since the (inode, offset) tuple identifies a unique uprobe, more than one user may be interested in the same uprobe. This provides the ability to connect multiple 'consumers' to the same uprobe. Each consumer defines a handler and a filter (optional). The 'handler' is run every time the uprobe is hit, if it matches the 'filter' criteria. The first consumer of a uprobe causes the breakpoint to be inserted at the specified address and subsequent consumers are appended to this list. On subsequent probes, the consumer gets appended to the existing list of consumers. The breakpoint is removed when the last consumer unregisters. For all other unregisterations, the consumer is removed from the list of consumers. Given a inode, we get a list of the mms that have mapped the inode. Do the actual registration if mm maps the page where a probe needs to be inserted/removed. We use a temporary list to walk through the vmas that map the inode. - The number of maps that map the inode, is not known before we walk the rmap and keeps changing. - extending vm_area_struct wasn't recommended, it's a size-critical data structure. - There can be more than one maps of the inode in the same mm. We add callbacks to the mmap methods to keep an eye on text vmas that are of interest to uprobes. When a vma of interest is mapped, we insert the breakpoint at the right address. Uprobe works by replacing the instruction at the address defined by (inode, offset) with the arch specific breakpoint instruction. We save a copy of the original instruction at the uprobed address. This is needed for: a. executing the instruction out-of-line (xol). b. instruction analysis for any subsequent fixups. c. restoring the instruction back when the uprobe is unregistered. We insert or delete a breakpoint instruction, and this breakpoint instruction is assumed to be the smallest instruction available on the platform. For fixed size instruction platforms this is trivially true, for variable size instruction platforms the breakpoint instruction is typically the smallest (often a single byte). Writing the instruction is done by COWing the page and changing the instruction during the copy, this even though most platforms allow atomic writes of the breakpoint instruction. This also mirrors the behaviour of a ptrace() memory write to a PRIVATE file map. The core worker is derived from KSM's replace_page() logic. In essence, similar to KSM: a. allocate a new page and copy over contents of the page that has the uprobed vaddr b. modify the copy and insert the breakpoint at the required address c. switch the original page with the copy containing the breakpoint d. flush page tables. replace_page() is being replicated here because of some minor changes in the type of pages and also because Hugh Dickins had plans to improve replace_page() for KSM specific work. Instruction analysis on x86 is based on instruction decoder and determines if an instruction can be probed and determines the necessary fixups after singlestep. Instruction analysis is done at probe insertion time so that we avoid having to repeat the same analysis every time a probe is hit. A lot of code here is due to the improvement/suggestions/inputs from Peter Zijlstra. Changelog: (v10): - Add code to clear REX.B prefix as suggested by Denys Vlasenko and Masami Hiramatsu. (v9): - Use insn_offset_modrm as suggested by Masami Hiramatsu. (v7): Handle comments from Peter Zijlstra: - Dont take reference to inode. (expect inode to uprobe_register to be sane). - Use PTR_ERR to set the return value. - No need to take reference to inode. - use PTR_ERR to return error value. - register and uprobe_unregister share code. (v5): - Modified del_consumer as per comments from Peter. - Drop reference to inode before dropping reference to uprobe. - Use i_size_read(inode) instead of inode->i_size. - Ensure uprobe->consumers is NULL, before __uprobe_unregister() is called. - Includes errno.h as recommended by Stephen Rothwell to fix a build issue on sparc defconfig - Remove restrictions while unregistering. - Earlier code leaked inode references under some conditions while registering/unregistering. - Continue the vma-rmap walk even if the intermediate vma doesnt meet the requirements. - Validate the vma found by find_vma before inserting/removing the breakpoint - Call del_consumer under mutex_lock. - Use hash locks. - Handle mremap. - Introduce find_least_offset_node() instead of close match logic in find_uprobe - Uprobes no more depends on MM_OWNER; No reference to task_structs while inserting/removing a probe. - Uses read_mapping_page instead of grab_cache_page so that the pages have valid content. - pass NULL to get_user_pages for the task parameter. - call SetPageUptodate on the new page allocated in write_opcode. - fix leaking a reference to the new page under certain conditions. - Include Instruction Decoder if Uprobes gets defined. - Remove const attributes for instruction prefix arrays. - Uses mm_context to know if the application is 32 bit. Signed-off-by: Srikar Dronamraju Also-written-by: Jim Keniston Reviewed-by: Peter Zijlstra Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Roland McGrath Cc: Masami Hiramatsu Cc: Arnaldo Carvalho de Melo Cc: Anton Arapov Cc: Ananth N Mavinakayanahalli Cc: Stephen Rothwell Cc: Denys Vlasenko Cc: Peter Zijlstra Cc: Linus Torvalds Cc: Andrew Morton Cc: Linux-mm Link: http://lkml.kernel.org/r/20120209092642.GE16600@linux.vnet.ibm.com [ Made various small edits to the commit log ] Signed-off-by: Ingo Molnar --- kernel/Makefile | 1 + kernel/uprobes.c | 976 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 977 insertions(+) create mode 100644 kernel/uprobes.c (limited to 'kernel') diff --git a/kernel/Makefile b/kernel/Makefile index 2d9de86b7e76..8609dd3d875a 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -107,6 +107,7 @@ obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o obj-$(CONFIG_PADATA) += padata.o obj-$(CONFIG_CRASH_DUMP) += crash_dump.o obj-$(CONFIG_JUMP_LABEL) += jump_label.o +obj-$(CONFIG_UPROBES) += uprobes.o $(obj)/configs.o: $(obj)/config_data.h diff --git a/kernel/uprobes.c b/kernel/uprobes.c new file mode 100644 index 000000000000..72e8bb3b52cd --- /dev/null +++ b/kernel/uprobes.c @@ -0,0 +1,976 @@ +/* + * Userspace Probes (UProbes) + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright (C) IBM Corporation, 2008-2011 + * Authors: + * Srikar Dronamraju + * Jim Keniston + */ + +#include +#include +#include /* read_mapping_page */ +#include +#include +#include /* anon_vma_prepare */ +#include /* set_pte_at_notify */ +#include /* try_to_free_swap */ +#include + +static struct rb_root uprobes_tree = RB_ROOT; +static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ + +#define UPROBES_HASH_SZ 13 +/* serialize (un)register */ +static struct mutex uprobes_mutex[UPROBES_HASH_SZ]; +#define uprobes_hash(v) (&uprobes_mutex[((unsigned long)(v)) %\ + UPROBES_HASH_SZ]) + +/* serialize uprobe->pending_list */ +static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ]; +#define uprobes_mmap_hash(v) (&uprobes_mmap_mutex[((unsigned long)(v)) %\ + UPROBES_HASH_SZ]) + +/* + * uprobe_events allows us to skip the mmap_uprobe if there are no uprobe + * events active at this time. Probably a fine grained per inode count is + * better? + */ +static atomic_t uprobe_events = ATOMIC_INIT(0); + +/* + * Maintain a temporary per vma info that can be used to search if a vma + * has already been handled. This structure is introduced since extending + * vm_area_struct wasnt recommended. + */ +struct vma_info { + struct list_head probe_list; + struct mm_struct *mm; + loff_t vaddr; +}; + +/* + * valid_vma: Verify if the specified vma is an executable vma + * Relax restrictions while unregistering: vm_flags might have + * changed after breakpoint was inserted. + * - is_register: indicates if we are in register context. + * - Return 1 if the specified virtual address is in an + * executable vma. + */ +static bool valid_vma(struct vm_area_struct *vma, bool is_register) +{ + if (!vma->vm_file) + return false; + + if (!is_register) + return true; + + if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) == + (VM_READ|VM_EXEC)) + return true; + + return false; +} + +static loff_t vma_address(struct vm_area_struct *vma, loff_t offset) +{ + loff_t vaddr; + + vaddr = vma->vm_start + offset; + vaddr -= vma->vm_pgoff << PAGE_SHIFT; + return vaddr; +} + +/** + * __replace_page - replace page in vma by new page. + * based on replace_page in mm/ksm.c + * + * @vma: vma that holds the pte pointing to page + * @page: the cowed page we are replacing by kpage + * @kpage: the modified page we replace page by + * + * Returns 0 on success, -EFAULT on failure. + */ +static int __replace_page(struct vm_area_struct *vma, struct page *page, + struct page *kpage) +{ + struct mm_struct *mm = vma->vm_mm; + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *ptep; + spinlock_t *ptl; + unsigned long addr; + int err = -EFAULT; + + addr = page_address_in_vma(page, vma); + if (addr == -EFAULT) + goto out; + + pgd = pgd_offset(mm, addr); + if (!pgd_present(*pgd)) + goto out; + + pud = pud_offset(pgd, addr); + if (!pud_present(*pud)) + goto out; + + pmd = pmd_offset(pud, addr); + if (!pmd_present(*pmd)) + goto out; + + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); + if (!ptep) + goto out; + + get_page(kpage); + page_add_new_anon_rmap(kpage, vma, addr); + + flush_cache_page(vma, addr, pte_pfn(*ptep)); + ptep_clear_flush(vma, addr, ptep); + set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot)); + + page_remove_rmap(page); + if (!page_mapped(page)) + try_to_free_swap(page); + put_page(page); + pte_unmap_unlock(ptep, ptl); + err = 0; + +out: + return err; +} + +/** + * is_bkpt_insn - check if instruction is breakpoint instruction. + * @insn: instruction to be checked. + * Default implementation of is_bkpt_insn + * Returns true if @insn is a breakpoint instruction. + */ +bool __weak is_bkpt_insn(uprobe_opcode_t *insn) +{ + return (*insn == UPROBES_BKPT_INSN); +} + +/* + * NOTE: + * Expect the breakpoint instruction to be the smallest size instruction for + * the architecture. If an arch has variable length instruction and the + * breakpoint instruction is not of the smallest length instruction + * supported by that architecture then we need to modify read_opcode / + * write_opcode accordingly. This would never be a problem for archs that + * have fixed length instructions. + */ + +/* + * write_opcode - write the opcode at a given virtual address. + * @mm: the probed process address space. + * @uprobe: the breakpointing information. + * @vaddr: the virtual address to store the opcode. + * @opcode: opcode to be written at @vaddr. + * + * Called with mm->mmap_sem held (for read and with a reference to + * mm). + * + * For mm @mm, write the opcode at @vaddr. + * Return 0 (success) or a negative errno. + */ +static int write_opcode(struct mm_struct *mm, struct uprobe *uprobe, + unsigned long vaddr, uprobe_opcode_t opcode) +{ + struct page *old_page, *new_page; + struct address_space *mapping; + void *vaddr_old, *vaddr_new; + struct vm_area_struct *vma; + loff_t addr; + int ret; + + /* Read the page with vaddr into memory */ + ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &old_page, &vma); + if (ret <= 0) + return ret; + ret = -EINVAL; + + /* + * We are interested in text pages only. Our pages of interest + * should be mapped for read and execute only. We desist from + * adding probes in write mapped pages since the breakpoints + * might end up in the file copy. + */ + if (!valid_vma(vma, is_bkpt_insn(&opcode))) + goto put_out; + + mapping = uprobe->inode->i_mapping; + if (mapping != vma->vm_file->f_mapping) + goto put_out; + + addr = vma_address(vma, uprobe->offset); + if (vaddr != (unsigned long)addr) + goto put_out; + + ret = -ENOMEM; + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr); + if (!new_page) + goto put_out; + + __SetPageUptodate(new_page); + + /* + * lock page will serialize against do_wp_page()'s + * PageAnon() handling + */ + lock_page(old_page); + /* copy the page now that we've got it stable */ + vaddr_old = kmap_atomic(old_page); + vaddr_new = kmap_atomic(new_page); + + memcpy(vaddr_new, vaddr_old, PAGE_SIZE); + /* poke the new insn in, ASSUMES we don't cross page boundary */ + vaddr &= ~PAGE_MASK; + BUG_ON(vaddr + uprobe_opcode_sz > PAGE_SIZE); + memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz); + + kunmap_atomic(vaddr_new); + kunmap_atomic(vaddr_old); + + ret = anon_vma_prepare(vma); + if (ret) + goto unlock_out; + + lock_page(new_page); + ret = __replace_page(vma, old_page, new_page); + unlock_page(new_page); + +unlock_out: + unlock_page(old_page); + page_cache_release(new_page); + +put_out: + put_page(old_page); /* we did a get_page in the beginning */ + return ret; +} + +/** + * read_opcode - read the opcode at a given virtual address. + * @mm: the probed process address space. + * @vaddr: the virtual address to read the opcode. + * @opcode: location to store the read opcode. + * + * Called with mm->mmap_sem held (for read and with a reference to + * mm. + * + * For mm @mm, read the opcode at @vaddr and store it in @opcode. + * Return 0 (success) or a negative errno. + */ +static int read_opcode(struct mm_struct *mm, unsigned long vaddr, + uprobe_opcode_t *opcode) +{ + struct page *page; + void *vaddr_new; + int ret; + + ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &page, NULL); + if (ret <= 0) + return ret; + + lock_page(page); + vaddr_new = kmap_atomic(page); + vaddr &= ~PAGE_MASK; + memcpy(opcode, vaddr_new + vaddr, uprobe_opcode_sz); + kunmap_atomic(vaddr_new); + unlock_page(page); + put_page(page); /* we did a get_user_pages in the beginning */ + return 0; +} + +static int is_bkpt_at_addr(struct mm_struct *mm, unsigned long vaddr) +{ + uprobe_opcode_t opcode; + int result = read_opcode(mm, vaddr, &opcode); + + if (result) + return result; + + if (is_bkpt_insn(&opcode)) + return 1; + + return 0; +} + +/** + * set_bkpt - store breakpoint at a given address. + * @mm: the probed process address space. + * @uprobe: the probepoint information. + * @vaddr: the virtual address to insert the opcode. + * + * For mm @mm, store the breakpoint instruction at @vaddr. + * Return 0 (success) or a negative errno. + */ +int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe, + unsigned long vaddr) +{ + int result = is_bkpt_at_addr(mm, vaddr); + + if (result == 1) + return -EEXIST; + + if (result) + return result; + + return write_opcode(mm, uprobe, vaddr, UPROBES_BKPT_INSN); +} + +/** + * set_orig_insn - Restore the original instruction. + * @mm: the probed process address space. + * @uprobe: the probepoint information. + * @vaddr: the virtual address to insert the opcode. + * @verify: if true, verify existance of breakpoint instruction. + * + * For mm @mm, restore the original opcode (opcode) at @vaddr. + * Return 0 (success) or a negative errno. + */ +int __weak set_orig_insn(struct mm_struct *mm, struct uprobe *uprobe, + unsigned long vaddr, bool verify) +{ + if (verify) { + int result = is_bkpt_at_addr(mm, vaddr); + + if (!result) + return -EINVAL; + + if (result != 1) + return result; + } + return write_opcode(mm, uprobe, vaddr, + *(uprobe_opcode_t *)uprobe->insn); +} + +static int match_uprobe(struct uprobe *l, struct uprobe *r) +{ + if (l->inode < r->inode) + return -1; + if (l->inode > r->inode) + return 1; + else { + if (l->offset < r->offset) + return -1; + + if (l->offset > r->offset) + return 1; + } + + return 0; +} + +static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset) +{ + struct uprobe u = { .inode = inode, .offset = offset }; + struct rb_node *n = uprobes_tree.rb_node; + struct uprobe *uprobe; + int match; + + while (n) { + uprobe = rb_entry(n, struct uprobe, rb_node); + match = match_uprobe(&u, uprobe); + if (!match) { + atomic_inc(&uprobe->ref); + return uprobe; + } + if (match < 0) + n = n->rb_left; + else + n = n->rb_right; + } + return NULL; +} + +/* + * Find a uprobe corresponding to a given inode:offset + * Acquires uprobes_treelock + */ +static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) +{ + struct uprobe *uprobe; + unsigned long flags; + + spin_lock_irqsave(&uprobes_treelock, flags); + uprobe = __find_uprobe(inode, offset); + spin_unlock_irqrestore(&uprobes_treelock, flags); + return uprobe; +} + +static struct uprobe *__insert_uprobe(struct uprobe *uprobe) +{ + struct rb_node **p = &uprobes_tree.rb_node; + struct rb_node *parent = NULL; + struct uprobe *u; + int match; + + while (*p) { + parent = *p; + u = rb_entry(parent, struct uprobe, rb_node); + match = match_uprobe(uprobe, u); + if (!match) { + atomic_inc(&u->ref); + return u; + } + + if (match < 0) + p = &parent->rb_left; + else + p = &parent->rb_right; + + } + u = NULL; + rb_link_node(&uprobe->rb_node, parent, p); + rb_insert_color(&uprobe->rb_node, &uprobes_tree); + /* get access + creation ref */ + atomic_set(&uprobe->ref, 2); + return u; +} + +/* + * Acquires uprobes_treelock. + * Matching uprobe already exists in rbtree; + * increment (access refcount) and return the matching uprobe. + * + * No matching uprobe; insert the uprobe in rb_tree; + * get a double refcount (access + creation) and return NULL. + */ +static struct uprobe *insert_uprobe(struct uprobe *uprobe) +{ + unsigned long flags; + struct uprobe *u; + + spin_lock_irqsave(&uprobes_treelock, flags); + u = __insert_uprobe(uprobe); + spin_unlock_irqrestore(&uprobes_treelock, flags); + return u; +} + +static void put_uprobe(struct uprobe *uprobe) +{ + if (atomic_dec_and_test(&uprobe->ref)) + kfree(uprobe); +} + +static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset) +{ + struct uprobe *uprobe, *cur_uprobe; + + uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); + if (!uprobe) + return NULL; + + uprobe->inode = igrab(inode); + uprobe->offset = offset; + init_rwsem(&uprobe->consumer_rwsem); + INIT_LIST_HEAD(&uprobe->pending_list); + + /* add to uprobes_tree, sorted on inode:offset */ + cur_uprobe = insert_uprobe(uprobe); + + /* a uprobe exists for this inode:offset combination */ + if (cur_uprobe) { + kfree(uprobe); + uprobe = cur_uprobe; + iput(inode); + } else + atomic_inc(&uprobe_events); + return uprobe; +} + +/* Returns the previous consumer */ +static struct uprobe_consumer *add_consumer(struct uprobe *uprobe, + struct uprobe_consumer *consumer) +{ + down_write(&uprobe->consumer_rwsem); + consumer->next = uprobe->consumers; + uprobe->consumers = consumer; + up_write(&uprobe->consumer_rwsem); + return consumer->next; +} + +/* + * For uprobe @uprobe, delete the consumer @consumer. + * Return true if the @consumer is deleted successfully + * or return false. + */ +static bool del_consumer(struct uprobe *uprobe, + struct uprobe_consumer *consumer) +{ + struct uprobe_consumer **con; + bool ret = false; + + down_write(&uprobe->consumer_rwsem); + for (con = &uprobe->consumers; *con; con = &(*con)->next) { + if (*con == consumer) { + *con = consumer->next; + ret = true; + break; + } + } + up_write(&uprobe->consumer_rwsem); + return ret; +} + +static int __copy_insn(struct address_space *mapping, + struct vm_area_struct *vma, char *insn, + unsigned long nbytes, unsigned long offset) +{ + struct file *filp = vma->vm_file; + struct page *page; + void *vaddr; + unsigned long off1; + unsigned long idx; + + if (!filp) + return -EINVAL; + + idx = (unsigned long)(offset >> PAGE_CACHE_SHIFT); + off1 = offset &= ~PAGE_MASK; + + /* + * Ensure that the page that has the original instruction is + * populated and in page-cache. + */ + page = read_mapping_page(mapping, idx, filp); + if (IS_ERR(page)) + return PTR_ERR(page); + + vaddr = kmap_atomic(page); + memcpy(insn, vaddr + off1, nbytes); + kunmap_atomic(vaddr); + page_cache_release(page); + return 0; +} + +static int copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma, + unsigned long addr) +{ + struct address_space *mapping; + int bytes; + unsigned long nbytes; + + addr &= ~PAGE_MASK; + nbytes = PAGE_SIZE - addr; + mapping = uprobe->inode->i_mapping; + + /* Instruction at end of binary; copy only available bytes */ + if (uprobe->offset + MAX_UINSN_BYTES > uprobe->inode->i_size) + bytes = uprobe->inode->i_size - uprobe->offset; + else + bytes = MAX_UINSN_BYTES; + + /* Instruction at the page-boundary; copy bytes in second page */ + if (nbytes < bytes) { + if (__copy_insn(mapping, vma, uprobe->insn + nbytes, + bytes - nbytes, uprobe->offset + nbytes)) + return -ENOMEM; + + bytes = nbytes; + } + return __copy_insn(mapping, vma, uprobe->insn, bytes, uprobe->offset); +} + +static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, + struct vm_area_struct *vma, loff_t vaddr) +{ + unsigned long addr; + int ret; + + /* + * If probe is being deleted, unregister thread could be done with + * the vma-rmap-walk through. Adding a probe now can be fatal since + * nobody will be able to cleanup. Also we could be from fork or + * mremap path, where the probe might have already been inserted. + * Hence behave as if probe already existed. + */ + if (!uprobe->consumers) + return -EEXIST; + + addr = (unsigned long)vaddr; + if (!(uprobe->flags & UPROBES_COPY_INSN)) { + ret = copy_insn(uprobe, vma, addr); + if (ret) + return ret; + + if (is_bkpt_insn((uprobe_opcode_t *)uprobe->insn)) + return -EEXIST; + + ret = analyze_insn(mm, uprobe); + if (ret) + return ret; + + uprobe->flags |= UPROBES_COPY_INSN; + } + ret = set_bkpt(mm, uprobe, addr); + + return ret; +} + +static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, + loff_t vaddr) +{ + set_orig_insn(mm, uprobe, (unsigned long)vaddr, true); +} + +static void delete_uprobe(struct uprobe *uprobe) +{ + unsigned long flags; + + spin_lock_irqsave(&uprobes_treelock, flags); + rb_erase(&uprobe->rb_node, &uprobes_tree); + spin_unlock_irqrestore(&uprobes_treelock, flags); + iput(uprobe->inode); + put_uprobe(uprobe); + atomic_dec(&uprobe_events); +} + +static struct vma_info *__find_next_vma_info(struct list_head *head, + loff_t offset, struct address_space *mapping, + struct vma_info *vi, bool is_register) +{ + struct prio_tree_iter iter; + struct vm_area_struct *vma; + struct vma_info *tmpvi; + loff_t vaddr; + unsigned long pgoff = offset >> PAGE_SHIFT; + int existing_vma; + + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { + if (!valid_vma(vma, is_register)) + continue; + + existing_vma = 0; + vaddr = vma_address(vma, offset); + list_for_each_entry(tmpvi, head, probe_list) { + if (tmpvi->mm == vma->vm_mm && tmpvi->vaddr == vaddr) { + existing_vma = 1; + break; + } + } + + /* + * Another vma needs a probe to be installed. However skip + * installing the probe if the vma is about to be unlinked. + */ + if (!existing_vma && + atomic_inc_not_zero(&vma->vm_mm->mm_users)) { + vi->mm = vma->vm_mm; + vi->vaddr = vaddr; + list_add(&vi->probe_list, head); + return vi; + } + } + return NULL; +} + +/* + * Iterate in the rmap prio tree and find a vma where a probe has not + * yet been inserted. + */ +static struct vma_info *find_next_vma_info(struct list_head *head, + loff_t offset, struct address_space *mapping, + bool is_register) +{ + struct vma_info *vi, *retvi; + vi = kzalloc(sizeof(struct vma_info), GFP_KERNEL); + if (!vi) + return ERR_PTR(-ENOMEM); + + mutex_lock(&mapping->i_mmap_mutex); + retvi = __find_next_vma_info(head, offset, mapping, vi, is_register); + mutex_unlock(&mapping->i_mmap_mutex); + + if (!retvi) + kfree(vi); + return retvi; +} + +static int register_for_each_vma(struct uprobe *uprobe, bool is_register) +{ + struct list_head try_list; + struct vm_area_struct *vma; + struct address_space *mapping; + struct vma_info *vi, *tmpvi; + struct mm_struct *mm; + loff_t vaddr; + int ret = 0; + + mapping = uprobe->inode->i_mapping; + INIT_LIST_HEAD(&try_list); + while ((vi = find_next_vma_info(&try_list, uprobe->offset, + mapping, is_register)) != NULL) { + if (IS_ERR(vi)) { + ret = PTR_ERR(vi); + break; + } + mm = vi->mm; + down_read(&mm->mmap_sem); + vma = find_vma(mm, (unsigned long)vi->vaddr); + if (!vma || !valid_vma(vma, is_register)) { + list_del(&vi->probe_list); + kfree(vi); + up_read(&mm->mmap_sem); + mmput(mm); + continue; + } + vaddr = vma_address(vma, uprobe->offset); + if (vma->vm_file->f_mapping->host != uprobe->inode || + vaddr != vi->vaddr) { + list_del(&vi->probe_list); + kfree(vi); + up_read(&mm->mmap_sem); + mmput(mm); + continue; + } + + if (is_register) + ret = install_breakpoint(mm, uprobe, vma, vi->vaddr); + else + remove_breakpoint(mm, uprobe, vi->vaddr); + + up_read(&mm->mmap_sem); + mmput(mm); + if (is_register) { + if (ret && ret == -EEXIST) + ret = 0; + if (ret) + break; + } + } + list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) { + list_del(&vi->probe_list); + kfree(vi); + } + return ret; +} + +static int __register_uprobe(struct uprobe *uprobe) +{ + return register_for_each_vma(uprobe, true); +} + +static void __unregister_uprobe(struct uprobe *uprobe) +{ + if (!register_for_each_vma(uprobe, false)) + delete_uprobe(uprobe); + + /* TODO : cant unregister? schedule a worker thread */ +} + +/* + * register_uprobe - register a probe + * @inode: the file in which the probe has to be placed. + * @offset: offset from the start of the file. + * @consumer: information on howto handle the probe.. + * + * Apart from the access refcount, register_uprobe() takes a creation + * refcount (thro alloc_uprobe) if and only if this @uprobe is getting + * inserted into the rbtree (i.e first consumer for a @inode:@offset + * tuple). Creation refcount stops unregister_uprobe from freeing the + * @uprobe even before the register operation is complete. Creation + * refcount is released when the last @consumer for the @uprobe + * unregisters. + * + * Return errno if it cannot successully install probes + * else return 0 (success) + */ +int register_uprobe(struct inode *inode, loff_t offset, + struct uprobe_consumer *consumer) +{ + struct uprobe *uprobe; + int ret = -EINVAL; + + if (!inode || !consumer || consumer->next) + return ret; + + if (offset > i_size_read(inode)) + return ret; + + ret = 0; + mutex_lock(uprobes_hash(inode)); + uprobe = alloc_uprobe(inode, offset); + if (uprobe && !add_consumer(uprobe, consumer)) { + ret = __register_uprobe(uprobe); + if (ret) { + uprobe->consumers = NULL; + __unregister_uprobe(uprobe); + } else + uprobe->flags |= UPROBES_RUN_HANDLER; + } + + mutex_unlock(uprobes_hash(inode)); + put_uprobe(uprobe); + + return ret; +} + +/* + * unregister_uprobe - unregister a already registered probe. + * @inode: the file in which the probe has to be removed. + * @offset: offset from the start of the file. + * @consumer: identify which probe if multiple probes are colocated. + */ +void unregister_uprobe(struct inode *inode, loff_t offset, + struct uprobe_consumer *consumer) +{ + struct uprobe *uprobe = NULL; + + if (!inode || !consumer) + return; + + uprobe = find_uprobe(inode, offset); + if (!uprobe) + return; + + mutex_lock(uprobes_hash(inode)); + if (!del_consumer(uprobe, consumer)) + goto unreg_out; + + if (!uprobe->consumers) { + __unregister_uprobe(uprobe); + uprobe->flags &= ~UPROBES_RUN_HANDLER; + } + +unreg_out: + mutex_unlock(uprobes_hash(inode)); + if (uprobe) + put_uprobe(uprobe); +} + +/* + * Of all the nodes that correspond to the given inode, return the node + * with the least offset. + */ +static struct rb_node *find_least_offset_node(struct inode *inode) +{ + struct uprobe u = { .inode = inode, .offset = 0}; + struct rb_node *n = uprobes_tree.rb_node; + struct rb_node *close_node = NULL; + struct uprobe *uprobe; + int match; + + while (n) { + uprobe = rb_entry(n, struct uprobe, rb_node); + match = match_uprobe(&u, uprobe); + if (uprobe->inode == inode) + close_node = n; + + if (!match) + return close_node; + + if (match < 0) + n = n->rb_left; + else + n = n->rb_right; + } + return close_node; +} + +/* + * For a given inode, build a list of probes that need to be inserted. + */ +static void build_probe_list(struct inode *inode, struct list_head *head) +{ + struct uprobe *uprobe; + struct rb_node *n; + unsigned long flags; + + spin_lock_irqsave(&uprobes_treelock, flags); + n = find_least_offset_node(inode); + for (; n; n = rb_next(n)) { + uprobe = rb_entry(n, struct uprobe, rb_node); + if (uprobe->inode != inode) + break; + + list_add(&uprobe->pending_list, head); + atomic_inc(&uprobe->ref); + } + spin_unlock_irqrestore(&uprobes_treelock, flags); +} + +/* + * Called from mmap_region. + * called with mm->mmap_sem acquired. + * + * Return -ve no if we fail to insert probes and we cannot + * bail-out. + * Return 0 otherwise. i.e : + * - successful insertion of probes + * - (or) no possible probes to be inserted. + * - (or) insertion of probes failed but we can bail-out. + */ +int mmap_uprobe(struct vm_area_struct *vma) +{ + struct list_head tmp_list; + struct uprobe *uprobe, *u; + struct inode *inode; + int ret = 0; + + if (!atomic_read(&uprobe_events) || !valid_vma(vma, true)) + return ret; /* Bail-out */ + + inode = vma->vm_file->f_mapping->host; + if (!inode) + return ret; + + INIT_LIST_HEAD(&tmp_list); + mutex_lock(uprobes_mmap_hash(inode)); + build_probe_list(inode, &tmp_list); + list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) { + loff_t vaddr; + + list_del(&uprobe->pending_list); + if (!ret) { + vaddr = vma_address(vma, uprobe->offset); + if (vaddr < vma->vm_start || vaddr >= vma->vm_end) { + put_uprobe(uprobe); + continue; + } + ret = install_breakpoint(vma->vm_mm, uprobe, vma, + vaddr); + if (ret == -EEXIST) + ret = 0; + } + put_uprobe(uprobe); + } + + mutex_unlock(uprobes_mmap_hash(inode)); + + return ret; +} + +static int __init init_uprobes(void) +{ + int i; + + for (i = 0; i < UPROBES_HASH_SZ; i++) { + mutex_init(&uprobes_mutex[i]); + mutex_init(&uprobes_mmap_mutex[i]); + } + return 0; +} + +static void __exit exit_uprobes(void) +{ +} + +module_init(init_uprobes); +module_exit(exit_uprobes); -- cgit v1.2.3 From 7b2d81d48a2d8e37efb6ce7b4d5ef58822b30d89 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Fri, 17 Feb 2012 09:27:41 +0100 Subject: uprobes/core: Clean up, refactor and improve the code Make the uprobes code readable to me: - improve the Kconfig text so that a mere mortal gets some idea what CONFIG_UPROBES=y is really about - do trivial renames to standardize around the uprobes_*() namespace - clean up and simplify various code flow details - separate basic blocks of functionality - line break artifact and white space related removal - use standard local varible definition blocks - use vertical spacing to make things more readable - remove unnecessary volatile - restructure comment blocks to make them more uniform and more readable in general Cc: Srikar Dronamraju Cc: Jim Keniston Cc: Peter Zijlstra Cc: Oleg Nesterov Cc: Masami Hiramatsu Cc: Arnaldo Carvalho de Melo Cc: Anton Arapov Cc: Ananth N Mavinakayanahalli Link: http://lkml.kernel.org/n/tip-ewbwhb8o6navvllsauu7k07p@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/uprobes.c | 219 ++++++++++++++++++++++++++++++++----------------------- 1 file changed, 127 insertions(+), 92 deletions(-) (limited to 'kernel') diff --git a/kernel/uprobes.c b/kernel/uprobes.c index 72e8bb3b52cd..884817f1b0d3 100644 --- a/kernel/uprobes.c +++ b/kernel/uprobes.c @@ -1,5 +1,5 @@ /* - * Userspace Probes (UProbes) + * User-space Probes (UProbes) * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -29,24 +29,26 @@ #include /* anon_vma_prepare */ #include /* set_pte_at_notify */ #include /* try_to_free_swap */ + #include static struct rb_root uprobes_tree = RB_ROOT; + static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ #define UPROBES_HASH_SZ 13 + /* serialize (un)register */ static struct mutex uprobes_mutex[UPROBES_HASH_SZ]; -#define uprobes_hash(v) (&uprobes_mutex[((unsigned long)(v)) %\ - UPROBES_HASH_SZ]) + +#define uprobes_hash(v) (&uprobes_mutex[((unsigned long)(v)) % UPROBES_HASH_SZ]) /* serialize uprobe->pending_list */ static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ]; -#define uprobes_mmap_hash(v) (&uprobes_mmap_mutex[((unsigned long)(v)) %\ - UPROBES_HASH_SZ]) +#define uprobes_mmap_hash(v) (&uprobes_mmap_mutex[((unsigned long)(v)) % UPROBES_HASH_SZ]) /* - * uprobe_events allows us to skip the mmap_uprobe if there are no uprobe + * uprobe_events allows us to skip the uprobe_mmap if there are no uprobe * events active at this time. Probably a fine grained per inode count is * better? */ @@ -58,9 +60,9 @@ static atomic_t uprobe_events = ATOMIC_INIT(0); * vm_area_struct wasnt recommended. */ struct vma_info { - struct list_head probe_list; - struct mm_struct *mm; - loff_t vaddr; + struct list_head probe_list; + struct mm_struct *mm; + loff_t vaddr; }; /* @@ -79,8 +81,7 @@ static bool valid_vma(struct vm_area_struct *vma, bool is_register) if (!is_register) return true; - if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) == - (VM_READ|VM_EXEC)) + if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) == (VM_READ|VM_EXEC)) return true; return false; @@ -92,6 +93,7 @@ static loff_t vma_address(struct vm_area_struct *vma, loff_t offset) vaddr = vma->vm_start + offset; vaddr -= vma->vm_pgoff << PAGE_SHIFT; + return vaddr; } @@ -105,8 +107,7 @@ static loff_t vma_address(struct vm_area_struct *vma, loff_t offset) * * Returns 0 on success, -EFAULT on failure. */ -static int __replace_page(struct vm_area_struct *vma, struct page *page, - struct page *kpage) +static int __replace_page(struct vm_area_struct *vma, struct page *page, struct page *kpage) { struct mm_struct *mm = vma->vm_mm; pgd_t *pgd; @@ -163,7 +164,7 @@ out: */ bool __weak is_bkpt_insn(uprobe_opcode_t *insn) { - return (*insn == UPROBES_BKPT_INSN); + return *insn == UPROBES_BKPT_INSN; } /* @@ -203,6 +204,7 @@ static int write_opcode(struct mm_struct *mm, struct uprobe *uprobe, ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &old_page, &vma); if (ret <= 0) return ret; + ret = -EINVAL; /* @@ -239,6 +241,7 @@ static int write_opcode(struct mm_struct *mm, struct uprobe *uprobe, vaddr_new = kmap_atomic(new_page); memcpy(vaddr_new, vaddr_old, PAGE_SIZE); + /* poke the new insn in, ASSUMES we don't cross page boundary */ vaddr &= ~PAGE_MASK; BUG_ON(vaddr + uprobe_opcode_sz > PAGE_SIZE); @@ -260,7 +263,8 @@ unlock_out: page_cache_release(new_page); put_out: - put_page(old_page); /* we did a get_page in the beginning */ + put_page(old_page); + return ret; } @@ -276,8 +280,7 @@ put_out: * For mm @mm, read the opcode at @vaddr and store it in @opcode. * Return 0 (success) or a negative errno. */ -static int read_opcode(struct mm_struct *mm, unsigned long vaddr, - uprobe_opcode_t *opcode) +static int read_opcode(struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t *opcode) { struct page *page; void *vaddr_new; @@ -293,15 +296,18 @@ static int read_opcode(struct mm_struct *mm, unsigned long vaddr, memcpy(opcode, vaddr_new + vaddr, uprobe_opcode_sz); kunmap_atomic(vaddr_new); unlock_page(page); - put_page(page); /* we did a get_user_pages in the beginning */ + + put_page(page); + return 0; } static int is_bkpt_at_addr(struct mm_struct *mm, unsigned long vaddr) { uprobe_opcode_t opcode; - int result = read_opcode(mm, vaddr, &opcode); + int result; + result = read_opcode(mm, vaddr, &opcode); if (result) return result; @@ -320,11 +326,11 @@ static int is_bkpt_at_addr(struct mm_struct *mm, unsigned long vaddr) * For mm @mm, store the breakpoint instruction at @vaddr. * Return 0 (success) or a negative errno. */ -int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe, - unsigned long vaddr) +int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe, unsigned long vaddr) { - int result = is_bkpt_at_addr(mm, vaddr); + int result; + result = is_bkpt_at_addr(mm, vaddr); if (result == 1) return -EEXIST; @@ -344,35 +350,35 @@ int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe, * For mm @mm, restore the original opcode (opcode) at @vaddr. * Return 0 (success) or a negative errno. */ -int __weak set_orig_insn(struct mm_struct *mm, struct uprobe *uprobe, - unsigned long vaddr, bool verify) +int __weak +set_orig_insn(struct mm_struct *mm, struct uprobe *uprobe, unsigned long vaddr, bool verify) { if (verify) { - int result = is_bkpt_at_addr(mm, vaddr); + int result; + result = is_bkpt_at_addr(mm, vaddr); if (!result) return -EINVAL; if (result != 1) return result; } - return write_opcode(mm, uprobe, vaddr, - *(uprobe_opcode_t *)uprobe->insn); + return write_opcode(mm, uprobe, vaddr, *(uprobe_opcode_t *)uprobe->insn); } static int match_uprobe(struct uprobe *l, struct uprobe *r) { if (l->inode < r->inode) return -1; + if (l->inode > r->inode) return 1; - else { - if (l->offset < r->offset) - return -1; - if (l->offset > r->offset) - return 1; - } + if (l->offset < r->offset) + return -1; + + if (l->offset > r->offset) + return 1; return 0; } @@ -391,6 +397,7 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset) atomic_inc(&uprobe->ref); return uprobe; } + if (match < 0) n = n->rb_left; else @@ -411,6 +418,7 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) spin_lock_irqsave(&uprobes_treelock, flags); uprobe = __find_uprobe(inode, offset); spin_unlock_irqrestore(&uprobes_treelock, flags); + return uprobe; } @@ -436,16 +444,18 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe) p = &parent->rb_right; } + u = NULL; rb_link_node(&uprobe->rb_node, parent, p); rb_insert_color(&uprobe->rb_node, &uprobes_tree); /* get access + creation ref */ atomic_set(&uprobe->ref, 2); + return u; } /* - * Acquires uprobes_treelock. + * Acquire uprobes_treelock. * Matching uprobe already exists in rbtree; * increment (access refcount) and return the matching uprobe. * @@ -460,6 +470,7 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) spin_lock_irqsave(&uprobes_treelock, flags); u = __insert_uprobe(uprobe); spin_unlock_irqrestore(&uprobes_treelock, flags); + return u; } @@ -490,19 +501,22 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset) kfree(uprobe); uprobe = cur_uprobe; iput(inode); - } else + } else { atomic_inc(&uprobe_events); + } + return uprobe; } /* Returns the previous consumer */ -static struct uprobe_consumer *add_consumer(struct uprobe *uprobe, - struct uprobe_consumer *consumer) +static struct uprobe_consumer * +consumer_add(struct uprobe *uprobe, struct uprobe_consumer *consumer) { down_write(&uprobe->consumer_rwsem); consumer->next = uprobe->consumers; uprobe->consumers = consumer; up_write(&uprobe->consumer_rwsem); + return consumer->next; } @@ -511,8 +525,7 @@ static struct uprobe_consumer *add_consumer(struct uprobe *uprobe, * Return true if the @consumer is deleted successfully * or return false. */ -static bool del_consumer(struct uprobe *uprobe, - struct uprobe_consumer *consumer) +static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *consumer) { struct uprobe_consumer **con; bool ret = false; @@ -526,6 +539,7 @@ static bool del_consumer(struct uprobe *uprobe, } } up_write(&uprobe->consumer_rwsem); + return ret; } @@ -557,15 +571,15 @@ static int __copy_insn(struct address_space *mapping, memcpy(insn, vaddr + off1, nbytes); kunmap_atomic(vaddr); page_cache_release(page); + return 0; } -static int copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma, - unsigned long addr) +static int copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma, unsigned long addr) { struct address_space *mapping; - int bytes; unsigned long nbytes; + int bytes; addr &= ~PAGE_MASK; nbytes = PAGE_SIZE - addr; @@ -605,6 +619,7 @@ static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, return -EEXIST; addr = (unsigned long)vaddr; + if (!(uprobe->flags & UPROBES_COPY_INSN)) { ret = copy_insn(uprobe, vma, addr); if (ret) @@ -613,7 +628,7 @@ static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, if (is_bkpt_insn((uprobe_opcode_t *)uprobe->insn)) return -EEXIST; - ret = analyze_insn(mm, uprobe); + ret = arch_uprobes_analyze_insn(mm, uprobe); if (ret) return ret; @@ -624,8 +639,7 @@ static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, return ret; } -static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, - loff_t vaddr) +static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, loff_t vaddr) { set_orig_insn(mm, uprobe, (unsigned long)vaddr, true); } @@ -649,9 +663,11 @@ static struct vma_info *__find_next_vma_info(struct list_head *head, struct prio_tree_iter iter; struct vm_area_struct *vma; struct vma_info *tmpvi; - loff_t vaddr; - unsigned long pgoff = offset >> PAGE_SHIFT; + unsigned long pgoff; int existing_vma; + loff_t vaddr; + + pgoff = offset >> PAGE_SHIFT; vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { if (!valid_vma(vma, is_register)) @@ -659,6 +675,7 @@ static struct vma_info *__find_next_vma_info(struct list_head *head, existing_vma = 0; vaddr = vma_address(vma, offset); + list_for_each_entry(tmpvi, head, probe_list) { if (tmpvi->mm == vma->vm_mm && tmpvi->vaddr == vaddr) { existing_vma = 1; @@ -670,14 +687,15 @@ static struct vma_info *__find_next_vma_info(struct list_head *head, * Another vma needs a probe to be installed. However skip * installing the probe if the vma is about to be unlinked. */ - if (!existing_vma && - atomic_inc_not_zero(&vma->vm_mm->mm_users)) { + if (!existing_vma && atomic_inc_not_zero(&vma->vm_mm->mm_users)) { vi->mm = vma->vm_mm; vi->vaddr = vaddr; list_add(&vi->probe_list, head); + return vi; } } + return NULL; } @@ -685,11 +703,12 @@ static struct vma_info *__find_next_vma_info(struct list_head *head, * Iterate in the rmap prio tree and find a vma where a probe has not * yet been inserted. */ -static struct vma_info *find_next_vma_info(struct list_head *head, - loff_t offset, struct address_space *mapping, - bool is_register) +static struct vma_info * +find_next_vma_info(struct list_head *head, loff_t offset, struct address_space *mapping, + bool is_register) { struct vma_info *vi, *retvi; + vi = kzalloc(sizeof(struct vma_info), GFP_KERNEL); if (!vi) return ERR_PTR(-ENOMEM); @@ -700,6 +719,7 @@ static struct vma_info *find_next_vma_info(struct list_head *head, if (!retvi) kfree(vi); + return retvi; } @@ -711,16 +731,23 @@ static int register_for_each_vma(struct uprobe *uprobe, bool is_register) struct vma_info *vi, *tmpvi; struct mm_struct *mm; loff_t vaddr; - int ret = 0; + int ret; mapping = uprobe->inode->i_mapping; INIT_LIST_HEAD(&try_list); - while ((vi = find_next_vma_info(&try_list, uprobe->offset, - mapping, is_register)) != NULL) { + + ret = 0; + + for (;;) { + vi = find_next_vma_info(&try_list, uprobe->offset, mapping, is_register); + if (!vi) + break; + if (IS_ERR(vi)) { ret = PTR_ERR(vi); break; } + mm = vi->mm; down_read(&mm->mmap_sem); vma = find_vma(mm, (unsigned long)vi->vaddr); @@ -755,19 +782,21 @@ static int register_for_each_vma(struct uprobe *uprobe, bool is_register) break; } } + list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) { list_del(&vi->probe_list); kfree(vi); } + return ret; } -static int __register_uprobe(struct uprobe *uprobe) +static int __uprobe_register(struct uprobe *uprobe) { return register_for_each_vma(uprobe, true); } -static void __unregister_uprobe(struct uprobe *uprobe) +static void __uprobe_unregister(struct uprobe *uprobe) { if (!register_for_each_vma(uprobe, false)) delete_uprobe(uprobe); @@ -776,15 +805,15 @@ static void __unregister_uprobe(struct uprobe *uprobe) } /* - * register_uprobe - register a probe + * uprobe_register - register a probe * @inode: the file in which the probe has to be placed. * @offset: offset from the start of the file. * @consumer: information on howto handle the probe.. * - * Apart from the access refcount, register_uprobe() takes a creation + * Apart from the access refcount, uprobe_register() takes a creation * refcount (thro alloc_uprobe) if and only if this @uprobe is getting * inserted into the rbtree (i.e first consumer for a @inode:@offset - * tuple). Creation refcount stops unregister_uprobe from freeing the + * tuple). Creation refcount stops uprobe_unregister from freeing the * @uprobe even before the register operation is complete. Creation * refcount is released when the last @consumer for the @uprobe * unregisters. @@ -792,28 +821,29 @@ static void __unregister_uprobe(struct uprobe *uprobe) * Return errno if it cannot successully install probes * else return 0 (success) */ -int register_uprobe(struct inode *inode, loff_t offset, - struct uprobe_consumer *consumer) +int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer) { struct uprobe *uprobe; - int ret = -EINVAL; + int ret; if (!inode || !consumer || consumer->next) - return ret; + return -EINVAL; if (offset > i_size_read(inode)) - return ret; + return -EINVAL; ret = 0; mutex_lock(uprobes_hash(inode)); uprobe = alloc_uprobe(inode, offset); - if (uprobe && !add_consumer(uprobe, consumer)) { - ret = __register_uprobe(uprobe); + + if (uprobe && !consumer_add(uprobe, consumer)) { + ret = __uprobe_register(uprobe); if (ret) { uprobe->consumers = NULL; - __unregister_uprobe(uprobe); - } else + __uprobe_unregister(uprobe); + } else { uprobe->flags |= UPROBES_RUN_HANDLER; + } } mutex_unlock(uprobes_hash(inode)); @@ -823,15 +853,14 @@ int register_uprobe(struct inode *inode, loff_t offset, } /* - * unregister_uprobe - unregister a already registered probe. + * uprobe_unregister - unregister a already registered probe. * @inode: the file in which the probe has to be removed. * @offset: offset from the start of the file. * @consumer: identify which probe if multiple probes are colocated. */ -void unregister_uprobe(struct inode *inode, loff_t offset, - struct uprobe_consumer *consumer) +void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer) { - struct uprobe *uprobe = NULL; + struct uprobe *uprobe; if (!inode || !consumer) return; @@ -841,15 +870,14 @@ void unregister_uprobe(struct inode *inode, loff_t offset, return; mutex_lock(uprobes_hash(inode)); - if (!del_consumer(uprobe, consumer)) - goto unreg_out; - if (!uprobe->consumers) { - __unregister_uprobe(uprobe); - uprobe->flags &= ~UPROBES_RUN_HANDLER; + if (consumer_del(uprobe, consumer)) { + if (!uprobe->consumers) { + __uprobe_unregister(uprobe); + uprobe->flags &= ~UPROBES_RUN_HANDLER; + } } -unreg_out: mutex_unlock(uprobes_hash(inode)); if (uprobe) put_uprobe(uprobe); @@ -870,6 +898,7 @@ static struct rb_node *find_least_offset_node(struct inode *inode) while (n) { uprobe = rb_entry(n, struct uprobe, rb_node); match = match_uprobe(&u, uprobe); + if (uprobe->inode == inode) close_node = n; @@ -881,6 +910,7 @@ static struct rb_node *find_least_offset_node(struct inode *inode) else n = n->rb_right; } + return close_node; } @@ -890,11 +920,13 @@ static struct rb_node *find_least_offset_node(struct inode *inode) static void build_probe_list(struct inode *inode, struct list_head *head) { struct uprobe *uprobe; - struct rb_node *n; unsigned long flags; + struct rb_node *n; spin_lock_irqsave(&uprobes_treelock, flags); + n = find_least_offset_node(inode); + for (; n; n = rb_next(n)) { uprobe = rb_entry(n, struct uprobe, rb_node); if (uprobe->inode != inode) @@ -903,6 +935,7 @@ static void build_probe_list(struct inode *inode, struct list_head *head) list_add(&uprobe->pending_list, head); atomic_inc(&uprobe->ref); } + spin_unlock_irqrestore(&uprobes_treelock, flags); } @@ -912,42 +945,44 @@ static void build_probe_list(struct inode *inode, struct list_head *head) * * Return -ve no if we fail to insert probes and we cannot * bail-out. - * Return 0 otherwise. i.e : + * Return 0 otherwise. i.e: + * * - successful insertion of probes * - (or) no possible probes to be inserted. * - (or) insertion of probes failed but we can bail-out. */ -int mmap_uprobe(struct vm_area_struct *vma) +int uprobe_mmap(struct vm_area_struct *vma) { struct list_head tmp_list; struct uprobe *uprobe, *u; struct inode *inode; - int ret = 0; + int ret; if (!atomic_read(&uprobe_events) || !valid_vma(vma, true)) - return ret; /* Bail-out */ + return 0; inode = vma->vm_file->f_mapping->host; if (!inode) - return ret; + return 0; INIT_LIST_HEAD(&tmp_list); mutex_lock(uprobes_mmap_hash(inode)); build_probe_list(inode, &tmp_list); + + ret = 0; + list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) { loff_t vaddr; list_del(&uprobe->pending_list); if (!ret) { vaddr = vma_address(vma, uprobe->offset); - if (vaddr < vma->vm_start || vaddr >= vma->vm_end) { - put_uprobe(uprobe); - continue; + if (vaddr >= vma->vm_start && vaddr < vma->vm_end) { + ret = install_breakpoint(vma->vm_mm, uprobe, vma, vaddr); + /* Ignore double add: */ + if (ret == -EEXIST) + ret = 0; } - ret = install_breakpoint(vma->vm_mm, uprobe, vma, - vaddr); - if (ret == -EEXIST) - ret = 0; } put_uprobe(uprobe); } -- cgit v1.2.3 From a5f4374a9610fd7286c2164d4e680436727eff71 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Wed, 22 Feb 2012 11:01:49 +0100 Subject: uprobes: Move to kernel/events/ Consolidate the uprobes code under kernel/events/, where the various core kernel event handling routines live. Acked-by: Peter Zijlstra Cc: Srikar Dronamraju Cc: Jim Keniston Cc: Oleg Nesterov Cc: Masami Hiramatsu Cc: Arnaldo Carvalho de Melo Cc: Anton Arapov Cc: Ananth N Mavinakayanahalli Link: http://lkml.kernel.org/n/tip-biuyhhwohxgbp2vzbap5yr8o@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/Makefile | 1 - kernel/events/Makefile | 3 + kernel/events/uprobes.c | 1011 +++++++++++++++++++++++++++++++++++++++++++++++ kernel/uprobes.c | 1011 ----------------------------------------------- 4 files changed, 1014 insertions(+), 1012 deletions(-) create mode 100644 kernel/events/uprobes.c delete mode 100644 kernel/uprobes.c (limited to 'kernel') diff --git a/kernel/Makefile b/kernel/Makefile index 8609dd3d875a..2d9de86b7e76 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -107,7 +107,6 @@ obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o obj-$(CONFIG_PADATA) += padata.o obj-$(CONFIG_CRASH_DUMP) += crash_dump.o obj-$(CONFIG_JUMP_LABEL) += jump_label.o -obj-$(CONFIG_UPROBES) += uprobes.o $(obj)/configs.o: $(obj)/config_data.h diff --git a/kernel/events/Makefile b/kernel/events/Makefile index 22d901f9caf4..103f5d147b2f 100644 --- a/kernel/events/Makefile +++ b/kernel/events/Makefile @@ -3,4 +3,7 @@ CFLAGS_REMOVE_core.o = -pg endif obj-y := core.o ring_buffer.o callchain.o + obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o +obj-$(CONFIG_UPROBES) += uprobes.o + diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c new file mode 100644 index 000000000000..884817f1b0d3 --- /dev/null +++ b/kernel/events/uprobes.c @@ -0,0 +1,1011 @@ +/* + * User-space Probes (UProbes) + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright (C) IBM Corporation, 2008-2011 + * Authors: + * Srikar Dronamraju + * Jim Keniston + */ + +#include +#include +#include /* read_mapping_page */ +#include +#include +#include /* anon_vma_prepare */ +#include /* set_pte_at_notify */ +#include /* try_to_free_swap */ + +#include + +static struct rb_root uprobes_tree = RB_ROOT; + +static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ + +#define UPROBES_HASH_SZ 13 + +/* serialize (un)register */ +static struct mutex uprobes_mutex[UPROBES_HASH_SZ]; + +#define uprobes_hash(v) (&uprobes_mutex[((unsigned long)(v)) % UPROBES_HASH_SZ]) + +/* serialize uprobe->pending_list */ +static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ]; +#define uprobes_mmap_hash(v) (&uprobes_mmap_mutex[((unsigned long)(v)) % UPROBES_HASH_SZ]) + +/* + * uprobe_events allows us to skip the uprobe_mmap if there are no uprobe + * events active at this time. Probably a fine grained per inode count is + * better? + */ +static atomic_t uprobe_events = ATOMIC_INIT(0); + +/* + * Maintain a temporary per vma info that can be used to search if a vma + * has already been handled. This structure is introduced since extending + * vm_area_struct wasnt recommended. + */ +struct vma_info { + struct list_head probe_list; + struct mm_struct *mm; + loff_t vaddr; +}; + +/* + * valid_vma: Verify if the specified vma is an executable vma + * Relax restrictions while unregistering: vm_flags might have + * changed after breakpoint was inserted. + * - is_register: indicates if we are in register context. + * - Return 1 if the specified virtual address is in an + * executable vma. + */ +static bool valid_vma(struct vm_area_struct *vma, bool is_register) +{ + if (!vma->vm_file) + return false; + + if (!is_register) + return true; + + if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) == (VM_READ|VM_EXEC)) + return true; + + return false; +} + +static loff_t vma_address(struct vm_area_struct *vma, loff_t offset) +{ + loff_t vaddr; + + vaddr = vma->vm_start + offset; + vaddr -= vma->vm_pgoff << PAGE_SHIFT; + + return vaddr; +} + +/** + * __replace_page - replace page in vma by new page. + * based on replace_page in mm/ksm.c + * + * @vma: vma that holds the pte pointing to page + * @page: the cowed page we are replacing by kpage + * @kpage: the modified page we replace page by + * + * Returns 0 on success, -EFAULT on failure. + */ +static int __replace_page(struct vm_area_struct *vma, struct page *page, struct page *kpage) +{ + struct mm_struct *mm = vma->vm_mm; + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *ptep; + spinlock_t *ptl; + unsigned long addr; + int err = -EFAULT; + + addr = page_address_in_vma(page, vma); + if (addr == -EFAULT) + goto out; + + pgd = pgd_offset(mm, addr); + if (!pgd_present(*pgd)) + goto out; + + pud = pud_offset(pgd, addr); + if (!pud_present(*pud)) + goto out; + + pmd = pmd_offset(pud, addr); + if (!pmd_present(*pmd)) + goto out; + + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); + if (!ptep) + goto out; + + get_page(kpage); + page_add_new_anon_rmap(kpage, vma, addr); + + flush_cache_page(vma, addr, pte_pfn(*ptep)); + ptep_clear_flush(vma, addr, ptep); + set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot)); + + page_remove_rmap(page); + if (!page_mapped(page)) + try_to_free_swap(page); + put_page(page); + pte_unmap_unlock(ptep, ptl); + err = 0; + +out: + return err; +} + +/** + * is_bkpt_insn - check if instruction is breakpoint instruction. + * @insn: instruction to be checked. + * Default implementation of is_bkpt_insn + * Returns true if @insn is a breakpoint instruction. + */ +bool __weak is_bkpt_insn(uprobe_opcode_t *insn) +{ + return *insn == UPROBES_BKPT_INSN; +} + +/* + * NOTE: + * Expect the breakpoint instruction to be the smallest size instruction for + * the architecture. If an arch has variable length instruction and the + * breakpoint instruction is not of the smallest length instruction + * supported by that architecture then we need to modify read_opcode / + * write_opcode accordingly. This would never be a problem for archs that + * have fixed length instructions. + */ + +/* + * write_opcode - write the opcode at a given virtual address. + * @mm: the probed process address space. + * @uprobe: the breakpointing information. + * @vaddr: the virtual address to store the opcode. + * @opcode: opcode to be written at @vaddr. + * + * Called with mm->mmap_sem held (for read and with a reference to + * mm). + * + * For mm @mm, write the opcode at @vaddr. + * Return 0 (success) or a negative errno. + */ +static int write_opcode(struct mm_struct *mm, struct uprobe *uprobe, + unsigned long vaddr, uprobe_opcode_t opcode) +{ + struct page *old_page, *new_page; + struct address_space *mapping; + void *vaddr_old, *vaddr_new; + struct vm_area_struct *vma; + loff_t addr; + int ret; + + /* Read the page with vaddr into memory */ + ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &old_page, &vma); + if (ret <= 0) + return ret; + + ret = -EINVAL; + + /* + * We are interested in text pages only. Our pages of interest + * should be mapped for read and execute only. We desist from + * adding probes in write mapped pages since the breakpoints + * might end up in the file copy. + */ + if (!valid_vma(vma, is_bkpt_insn(&opcode))) + goto put_out; + + mapping = uprobe->inode->i_mapping; + if (mapping != vma->vm_file->f_mapping) + goto put_out; + + addr = vma_address(vma, uprobe->offset); + if (vaddr != (unsigned long)addr) + goto put_out; + + ret = -ENOMEM; + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr); + if (!new_page) + goto put_out; + + __SetPageUptodate(new_page); + + /* + * lock page will serialize against do_wp_page()'s + * PageAnon() handling + */ + lock_page(old_page); + /* copy the page now that we've got it stable */ + vaddr_old = kmap_atomic(old_page); + vaddr_new = kmap_atomic(new_page); + + memcpy(vaddr_new, vaddr_old, PAGE_SIZE); + + /* poke the new insn in, ASSUMES we don't cross page boundary */ + vaddr &= ~PAGE_MASK; + BUG_ON(vaddr + uprobe_opcode_sz > PAGE_SIZE); + memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz); + + kunmap_atomic(vaddr_new); + kunmap_atomic(vaddr_old); + + ret = anon_vma_prepare(vma); + if (ret) + goto unlock_out; + + lock_page(new_page); + ret = __replace_page(vma, old_page, new_page); + unlock_page(new_page); + +unlock_out: + unlock_page(old_page); + page_cache_release(new_page); + +put_out: + put_page(old_page); + + return ret; +} + +/** + * read_opcode - read the opcode at a given virtual address. + * @mm: the probed process address space. + * @vaddr: the virtual address to read the opcode. + * @opcode: location to store the read opcode. + * + * Called with mm->mmap_sem held (for read and with a reference to + * mm. + * + * For mm @mm, read the opcode at @vaddr and store it in @opcode. + * Return 0 (success) or a negative errno. + */ +static int read_opcode(struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t *opcode) +{ + struct page *page; + void *vaddr_new; + int ret; + + ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &page, NULL); + if (ret <= 0) + return ret; + + lock_page(page); + vaddr_new = kmap_atomic(page); + vaddr &= ~PAGE_MASK; + memcpy(opcode, vaddr_new + vaddr, uprobe_opcode_sz); + kunmap_atomic(vaddr_new); + unlock_page(page); + + put_page(page); + + return 0; +} + +static int is_bkpt_at_addr(struct mm_struct *mm, unsigned long vaddr) +{ + uprobe_opcode_t opcode; + int result; + + result = read_opcode(mm, vaddr, &opcode); + if (result) + return result; + + if (is_bkpt_insn(&opcode)) + return 1; + + return 0; +} + +/** + * set_bkpt - store breakpoint at a given address. + * @mm: the probed process address space. + * @uprobe: the probepoint information. + * @vaddr: the virtual address to insert the opcode. + * + * For mm @mm, store the breakpoint instruction at @vaddr. + * Return 0 (success) or a negative errno. + */ +int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe, unsigned long vaddr) +{ + int result; + + result = is_bkpt_at_addr(mm, vaddr); + if (result == 1) + return -EEXIST; + + if (result) + return result; + + return write_opcode(mm, uprobe, vaddr, UPROBES_BKPT_INSN); +} + +/** + * set_orig_insn - Restore the original instruction. + * @mm: the probed process address space. + * @uprobe: the probepoint information. + * @vaddr: the virtual address to insert the opcode. + * @verify: if true, verify existance of breakpoint instruction. + * + * For mm @mm, restore the original opcode (opcode) at @vaddr. + * Return 0 (success) or a negative errno. + */ +int __weak +set_orig_insn(struct mm_struct *mm, struct uprobe *uprobe, unsigned long vaddr, bool verify) +{ + if (verify) { + int result; + + result = is_bkpt_at_addr(mm, vaddr); + if (!result) + return -EINVAL; + + if (result != 1) + return result; + } + return write_opcode(mm, uprobe, vaddr, *(uprobe_opcode_t *)uprobe->insn); +} + +static int match_uprobe(struct uprobe *l, struct uprobe *r) +{ + if (l->inode < r->inode) + return -1; + + if (l->inode > r->inode) + return 1; + + if (l->offset < r->offset) + return -1; + + if (l->offset > r->offset) + return 1; + + return 0; +} + +static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset) +{ + struct uprobe u = { .inode = inode, .offset = offset }; + struct rb_node *n = uprobes_tree.rb_node; + struct uprobe *uprobe; + int match; + + while (n) { + uprobe = rb_entry(n, struct uprobe, rb_node); + match = match_uprobe(&u, uprobe); + if (!match) { + atomic_inc(&uprobe->ref); + return uprobe; + } + + if (match < 0) + n = n->rb_left; + else + n = n->rb_right; + } + return NULL; +} + +/* + * Find a uprobe corresponding to a given inode:offset + * Acquires uprobes_treelock + */ +static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) +{ + struct uprobe *uprobe; + unsigned long flags; + + spin_lock_irqsave(&uprobes_treelock, flags); + uprobe = __find_uprobe(inode, offset); + spin_unlock_irqrestore(&uprobes_treelock, flags); + + return uprobe; +} + +static struct uprobe *__insert_uprobe(struct uprobe *uprobe) +{ + struct rb_node **p = &uprobes_tree.rb_node; + struct rb_node *parent = NULL; + struct uprobe *u; + int match; + + while (*p) { + parent = *p; + u = rb_entry(parent, struct uprobe, rb_node); + match = match_uprobe(uprobe, u); + if (!match) { + atomic_inc(&u->ref); + return u; + } + + if (match < 0) + p = &parent->rb_left; + else + p = &parent->rb_right; + + } + + u = NULL; + rb_link_node(&uprobe->rb_node, parent, p); + rb_insert_color(&uprobe->rb_node, &uprobes_tree); + /* get access + creation ref */ + atomic_set(&uprobe->ref, 2); + + return u; +} + +/* + * Acquire uprobes_treelock. + * Matching uprobe already exists in rbtree; + * increment (access refcount) and return the matching uprobe. + * + * No matching uprobe; insert the uprobe in rb_tree; + * get a double refcount (access + creation) and return NULL. + */ +static struct uprobe *insert_uprobe(struct uprobe *uprobe) +{ + unsigned long flags; + struct uprobe *u; + + spin_lock_irqsave(&uprobes_treelock, flags); + u = __insert_uprobe(uprobe); + spin_unlock_irqrestore(&uprobes_treelock, flags); + + return u; +} + +static void put_uprobe(struct uprobe *uprobe) +{ + if (atomic_dec_and_test(&uprobe->ref)) + kfree(uprobe); +} + +static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset) +{ + struct uprobe *uprobe, *cur_uprobe; + + uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); + if (!uprobe) + return NULL; + + uprobe->inode = igrab(inode); + uprobe->offset = offset; + init_rwsem(&uprobe->consumer_rwsem); + INIT_LIST_HEAD(&uprobe->pending_list); + + /* add to uprobes_tree, sorted on inode:offset */ + cur_uprobe = insert_uprobe(uprobe); + + /* a uprobe exists for this inode:offset combination */ + if (cur_uprobe) { + kfree(uprobe); + uprobe = cur_uprobe; + iput(inode); + } else { + atomic_inc(&uprobe_events); + } + + return uprobe; +} + +/* Returns the previous consumer */ +static struct uprobe_consumer * +consumer_add(struct uprobe *uprobe, struct uprobe_consumer *consumer) +{ + down_write(&uprobe->consumer_rwsem); + consumer->next = uprobe->consumers; + uprobe->consumers = consumer; + up_write(&uprobe->consumer_rwsem); + + return consumer->next; +} + +/* + * For uprobe @uprobe, delete the consumer @consumer. + * Return true if the @consumer is deleted successfully + * or return false. + */ +static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *consumer) +{ + struct uprobe_consumer **con; + bool ret = false; + + down_write(&uprobe->consumer_rwsem); + for (con = &uprobe->consumers; *con; con = &(*con)->next) { + if (*con == consumer) { + *con = consumer->next; + ret = true; + break; + } + } + up_write(&uprobe->consumer_rwsem); + + return ret; +} + +static int __copy_insn(struct address_space *mapping, + struct vm_area_struct *vma, char *insn, + unsigned long nbytes, unsigned long offset) +{ + struct file *filp = vma->vm_file; + struct page *page; + void *vaddr; + unsigned long off1; + unsigned long idx; + + if (!filp) + return -EINVAL; + + idx = (unsigned long)(offset >> PAGE_CACHE_SHIFT); + off1 = offset &= ~PAGE_MASK; + + /* + * Ensure that the page that has the original instruction is + * populated and in page-cache. + */ + page = read_mapping_page(mapping, idx, filp); + if (IS_ERR(page)) + return PTR_ERR(page); + + vaddr = kmap_atomic(page); + memcpy(insn, vaddr + off1, nbytes); + kunmap_atomic(vaddr); + page_cache_release(page); + + return 0; +} + +static int copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma, unsigned long addr) +{ + struct address_space *mapping; + unsigned long nbytes; + int bytes; + + addr &= ~PAGE_MASK; + nbytes = PAGE_SIZE - addr; + mapping = uprobe->inode->i_mapping; + + /* Instruction at end of binary; copy only available bytes */ + if (uprobe->offset + MAX_UINSN_BYTES > uprobe->inode->i_size) + bytes = uprobe->inode->i_size - uprobe->offset; + else + bytes = MAX_UINSN_BYTES; + + /* Instruction at the page-boundary; copy bytes in second page */ + if (nbytes < bytes) { + if (__copy_insn(mapping, vma, uprobe->insn + nbytes, + bytes - nbytes, uprobe->offset + nbytes)) + return -ENOMEM; + + bytes = nbytes; + } + return __copy_insn(mapping, vma, uprobe->insn, bytes, uprobe->offset); +} + +static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, + struct vm_area_struct *vma, loff_t vaddr) +{ + unsigned long addr; + int ret; + + /* + * If probe is being deleted, unregister thread could be done with + * the vma-rmap-walk through. Adding a probe now can be fatal since + * nobody will be able to cleanup. Also we could be from fork or + * mremap path, where the probe might have already been inserted. + * Hence behave as if probe already existed. + */ + if (!uprobe->consumers) + return -EEXIST; + + addr = (unsigned long)vaddr; + + if (!(uprobe->flags & UPROBES_COPY_INSN)) { + ret = copy_insn(uprobe, vma, addr); + if (ret) + return ret; + + if (is_bkpt_insn((uprobe_opcode_t *)uprobe->insn)) + return -EEXIST; + + ret = arch_uprobes_analyze_insn(mm, uprobe); + if (ret) + return ret; + + uprobe->flags |= UPROBES_COPY_INSN; + } + ret = set_bkpt(mm, uprobe, addr); + + return ret; +} + +static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, loff_t vaddr) +{ + set_orig_insn(mm, uprobe, (unsigned long)vaddr, true); +} + +static void delete_uprobe(struct uprobe *uprobe) +{ + unsigned long flags; + + spin_lock_irqsave(&uprobes_treelock, flags); + rb_erase(&uprobe->rb_node, &uprobes_tree); + spin_unlock_irqrestore(&uprobes_treelock, flags); + iput(uprobe->inode); + put_uprobe(uprobe); + atomic_dec(&uprobe_events); +} + +static struct vma_info *__find_next_vma_info(struct list_head *head, + loff_t offset, struct address_space *mapping, + struct vma_info *vi, bool is_register) +{ + struct prio_tree_iter iter; + struct vm_area_struct *vma; + struct vma_info *tmpvi; + unsigned long pgoff; + int existing_vma; + loff_t vaddr; + + pgoff = offset >> PAGE_SHIFT; + + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { + if (!valid_vma(vma, is_register)) + continue; + + existing_vma = 0; + vaddr = vma_address(vma, offset); + + list_for_each_entry(tmpvi, head, probe_list) { + if (tmpvi->mm == vma->vm_mm && tmpvi->vaddr == vaddr) { + existing_vma = 1; + break; + } + } + + /* + * Another vma needs a probe to be installed. However skip + * installing the probe if the vma is about to be unlinked. + */ + if (!existing_vma && atomic_inc_not_zero(&vma->vm_mm->mm_users)) { + vi->mm = vma->vm_mm; + vi->vaddr = vaddr; + list_add(&vi->probe_list, head); + + return vi; + } + } + + return NULL; +} + +/* + * Iterate in the rmap prio tree and find a vma where a probe has not + * yet been inserted. + */ +static struct vma_info * +find_next_vma_info(struct list_head *head, loff_t offset, struct address_space *mapping, + bool is_register) +{ + struct vma_info *vi, *retvi; + + vi = kzalloc(sizeof(struct vma_info), GFP_KERNEL); + if (!vi) + return ERR_PTR(-ENOMEM); + + mutex_lock(&mapping->i_mmap_mutex); + retvi = __find_next_vma_info(head, offset, mapping, vi, is_register); + mutex_unlock(&mapping->i_mmap_mutex); + + if (!retvi) + kfree(vi); + + return retvi; +} + +static int register_for_each_vma(struct uprobe *uprobe, bool is_register) +{ + struct list_head try_list; + struct vm_area_struct *vma; + struct address_space *mapping; + struct vma_info *vi, *tmpvi; + struct mm_struct *mm; + loff_t vaddr; + int ret; + + mapping = uprobe->inode->i_mapping; + INIT_LIST_HEAD(&try_list); + + ret = 0; + + for (;;) { + vi = find_next_vma_info(&try_list, uprobe->offset, mapping, is_register); + if (!vi) + break; + + if (IS_ERR(vi)) { + ret = PTR_ERR(vi); + break; + } + + mm = vi->mm; + down_read(&mm->mmap_sem); + vma = find_vma(mm, (unsigned long)vi->vaddr); + if (!vma || !valid_vma(vma, is_register)) { + list_del(&vi->probe_list); + kfree(vi); + up_read(&mm->mmap_sem); + mmput(mm); + continue; + } + vaddr = vma_address(vma, uprobe->offset); + if (vma->vm_file->f_mapping->host != uprobe->inode || + vaddr != vi->vaddr) { + list_del(&vi->probe_list); + kfree(vi); + up_read(&mm->mmap_sem); + mmput(mm); + continue; + } + + if (is_register) + ret = install_breakpoint(mm, uprobe, vma, vi->vaddr); + else + remove_breakpoint(mm, uprobe, vi->vaddr); + + up_read(&mm->mmap_sem); + mmput(mm); + if (is_register) { + if (ret && ret == -EEXIST) + ret = 0; + if (ret) + break; + } + } + + list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) { + list_del(&vi->probe_list); + kfree(vi); + } + + return ret; +} + +static int __uprobe_register(struct uprobe *uprobe) +{ + return register_for_each_vma(uprobe, true); +} + +static void __uprobe_unregister(struct uprobe *uprobe) +{ + if (!register_for_each_vma(uprobe, false)) + delete_uprobe(uprobe); + + /* TODO : cant unregister? schedule a worker thread */ +} + +/* + * uprobe_register - register a probe + * @inode: the file in which the probe has to be placed. + * @offset: offset from the start of the file. + * @consumer: information on howto handle the probe.. + * + * Apart from the access refcount, uprobe_register() takes a creation + * refcount (thro alloc_uprobe) if and only if this @uprobe is getting + * inserted into the rbtree (i.e first consumer for a @inode:@offset + * tuple). Creation refcount stops uprobe_unregister from freeing the + * @uprobe even before the register operation is complete. Creation + * refcount is released when the last @consumer for the @uprobe + * unregisters. + * + * Return errno if it cannot successully install probes + * else return 0 (success) + */ +int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer) +{ + struct uprobe *uprobe; + int ret; + + if (!inode || !consumer || consumer->next) + return -EINVAL; + + if (offset > i_size_read(inode)) + return -EINVAL; + + ret = 0; + mutex_lock(uprobes_hash(inode)); + uprobe = alloc_uprobe(inode, offset); + + if (uprobe && !consumer_add(uprobe, consumer)) { + ret = __uprobe_register(uprobe); + if (ret) { + uprobe->consumers = NULL; + __uprobe_unregister(uprobe); + } else { + uprobe->flags |= UPROBES_RUN_HANDLER; + } + } + + mutex_unlock(uprobes_hash(inode)); + put_uprobe(uprobe); + + return ret; +} + +/* + * uprobe_unregister - unregister a already registered probe. + * @inode: the file in which the probe has to be removed. + * @offset: offset from the start of the file. + * @consumer: identify which probe if multiple probes are colocated. + */ +void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer) +{ + struct uprobe *uprobe; + + if (!inode || !consumer) + return; + + uprobe = find_uprobe(inode, offset); + if (!uprobe) + return; + + mutex_lock(uprobes_hash(inode)); + + if (consumer_del(uprobe, consumer)) { + if (!uprobe->consumers) { + __uprobe_unregister(uprobe); + uprobe->flags &= ~UPROBES_RUN_HANDLER; + } + } + + mutex_unlock(uprobes_hash(inode)); + if (uprobe) + put_uprobe(uprobe); +} + +/* + * Of all the nodes that correspond to the given inode, return the node + * with the least offset. + */ +static struct rb_node *find_least_offset_node(struct inode *inode) +{ + struct uprobe u = { .inode = inode, .offset = 0}; + struct rb_node *n = uprobes_tree.rb_node; + struct rb_node *close_node = NULL; + struct uprobe *uprobe; + int match; + + while (n) { + uprobe = rb_entry(n, struct uprobe, rb_node); + match = match_uprobe(&u, uprobe); + + if (uprobe->inode == inode) + close_node = n; + + if (!match) + return close_node; + + if (match < 0) + n = n->rb_left; + else + n = n->rb_right; + } + + return close_node; +} + +/* + * For a given inode, build a list of probes that need to be inserted. + */ +static void build_probe_list(struct inode *inode, struct list_head *head) +{ + struct uprobe *uprobe; + unsigned long flags; + struct rb_node *n; + + spin_lock_irqsave(&uprobes_treelock, flags); + + n = find_least_offset_node(inode); + + for (; n; n = rb_next(n)) { + uprobe = rb_entry(n, struct uprobe, rb_node); + if (uprobe->inode != inode) + break; + + list_add(&uprobe->pending_list, head); + atomic_inc(&uprobe->ref); + } + + spin_unlock_irqrestore(&uprobes_treelock, flags); +} + +/* + * Called from mmap_region. + * called with mm->mmap_sem acquired. + * + * Return -ve no if we fail to insert probes and we cannot + * bail-out. + * Return 0 otherwise. i.e: + * + * - successful insertion of probes + * - (or) no possible probes to be inserted. + * - (or) insertion of probes failed but we can bail-out. + */ +int uprobe_mmap(struct vm_area_struct *vma) +{ + struct list_head tmp_list; + struct uprobe *uprobe, *u; + struct inode *inode; + int ret; + + if (!atomic_read(&uprobe_events) || !valid_vma(vma, true)) + return 0; + + inode = vma->vm_file->f_mapping->host; + if (!inode) + return 0; + + INIT_LIST_HEAD(&tmp_list); + mutex_lock(uprobes_mmap_hash(inode)); + build_probe_list(inode, &tmp_list); + + ret = 0; + + list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) { + loff_t vaddr; + + list_del(&uprobe->pending_list); + if (!ret) { + vaddr = vma_address(vma, uprobe->offset); + if (vaddr >= vma->vm_start && vaddr < vma->vm_end) { + ret = install_breakpoint(vma->vm_mm, uprobe, vma, vaddr); + /* Ignore double add: */ + if (ret == -EEXIST) + ret = 0; + } + } + put_uprobe(uprobe); + } + + mutex_unlock(uprobes_mmap_hash(inode)); + + return ret; +} + +static int __init init_uprobes(void) +{ + int i; + + for (i = 0; i < UPROBES_HASH_SZ; i++) { + mutex_init(&uprobes_mutex[i]); + mutex_init(&uprobes_mmap_mutex[i]); + } + return 0; +} + +static void __exit exit_uprobes(void) +{ +} + +module_init(init_uprobes); +module_exit(exit_uprobes); diff --git a/kernel/uprobes.c b/kernel/uprobes.c deleted file mode 100644 index 884817f1b0d3..000000000000 --- a/kernel/uprobes.c +++ /dev/null @@ -1,1011 +0,0 @@ -/* - * User-space Probes (UProbes) - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. - * - * Copyright (C) IBM Corporation, 2008-2011 - * Authors: - * Srikar Dronamraju - * Jim Keniston - */ - -#include -#include -#include /* read_mapping_page */ -#include -#include -#include /* anon_vma_prepare */ -#include /* set_pte_at_notify */ -#include /* try_to_free_swap */ - -#include - -static struct rb_root uprobes_tree = RB_ROOT; - -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ - -#define UPROBES_HASH_SZ 13 - -/* serialize (un)register */ -static struct mutex uprobes_mutex[UPROBES_HASH_SZ]; - -#define uprobes_hash(v) (&uprobes_mutex[((unsigned long)(v)) % UPROBES_HASH_SZ]) - -/* serialize uprobe->pending_list */ -static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ]; -#define uprobes_mmap_hash(v) (&uprobes_mmap_mutex[((unsigned long)(v)) % UPROBES_HASH_SZ]) - -/* - * uprobe_events allows us to skip the uprobe_mmap if there are no uprobe - * events active at this time. Probably a fine grained per inode count is - * better? - */ -static atomic_t uprobe_events = ATOMIC_INIT(0); - -/* - * Maintain a temporary per vma info that can be used to search if a vma - * has already been handled. This structure is introduced since extending - * vm_area_struct wasnt recommended. - */ -struct vma_info { - struct list_head probe_list; - struct mm_struct *mm; - loff_t vaddr; -}; - -/* - * valid_vma: Verify if the specified vma is an executable vma - * Relax restrictions while unregistering: vm_flags might have - * changed after breakpoint was inserted. - * - is_register: indicates if we are in register context. - * - Return 1 if the specified virtual address is in an - * executable vma. - */ -static bool valid_vma(struct vm_area_struct *vma, bool is_register) -{ - if (!vma->vm_file) - return false; - - if (!is_register) - return true; - - if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) == (VM_READ|VM_EXEC)) - return true; - - return false; -} - -static loff_t vma_address(struct vm_area_struct *vma, loff_t offset) -{ - loff_t vaddr; - - vaddr = vma->vm_start + offset; - vaddr -= vma->vm_pgoff << PAGE_SHIFT; - - return vaddr; -} - -/** - * __replace_page - replace page in vma by new page. - * based on replace_page in mm/ksm.c - * - * @vma: vma that holds the pte pointing to page - * @page: the cowed page we are replacing by kpage - * @kpage: the modified page we replace page by - * - * Returns 0 on success, -EFAULT on failure. - */ -static int __replace_page(struct vm_area_struct *vma, struct page *page, struct page *kpage) -{ - struct mm_struct *mm = vma->vm_mm; - pgd_t *pgd; - pud_t *pud; - pmd_t *pmd; - pte_t *ptep; - spinlock_t *ptl; - unsigned long addr; - int err = -EFAULT; - - addr = page_address_in_vma(page, vma); - if (addr == -EFAULT) - goto out; - - pgd = pgd_offset(mm, addr); - if (!pgd_present(*pgd)) - goto out; - - pud = pud_offset(pgd, addr); - if (!pud_present(*pud)) - goto out; - - pmd = pmd_offset(pud, addr); - if (!pmd_present(*pmd)) - goto out; - - ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); - if (!ptep) - goto out; - - get_page(kpage); - page_add_new_anon_rmap(kpage, vma, addr); - - flush_cache_page(vma, addr, pte_pfn(*ptep)); - ptep_clear_flush(vma, addr, ptep); - set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot)); - - page_remove_rmap(page); - if (!page_mapped(page)) - try_to_free_swap(page); - put_page(page); - pte_unmap_unlock(ptep, ptl); - err = 0; - -out: - return err; -} - -/** - * is_bkpt_insn - check if instruction is breakpoint instruction. - * @insn: instruction to be checked. - * Default implementation of is_bkpt_insn - * Returns true if @insn is a breakpoint instruction. - */ -bool __weak is_bkpt_insn(uprobe_opcode_t *insn) -{ - return *insn == UPROBES_BKPT_INSN; -} - -/* - * NOTE: - * Expect the breakpoint instruction to be the smallest size instruction for - * the architecture. If an arch has variable length instruction and the - * breakpoint instruction is not of the smallest length instruction - * supported by that architecture then we need to modify read_opcode / - * write_opcode accordingly. This would never be a problem for archs that - * have fixed length instructions. - */ - -/* - * write_opcode - write the opcode at a given virtual address. - * @mm: the probed process address space. - * @uprobe: the breakpointing information. - * @vaddr: the virtual address to store the opcode. - * @opcode: opcode to be written at @vaddr. - * - * Called with mm->mmap_sem held (for read and with a reference to - * mm). - * - * For mm @mm, write the opcode at @vaddr. - * Return 0 (success) or a negative errno. - */ -static int write_opcode(struct mm_struct *mm, struct uprobe *uprobe, - unsigned long vaddr, uprobe_opcode_t opcode) -{ - struct page *old_page, *new_page; - struct address_space *mapping; - void *vaddr_old, *vaddr_new; - struct vm_area_struct *vma; - loff_t addr; - int ret; - - /* Read the page with vaddr into memory */ - ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &old_page, &vma); - if (ret <= 0) - return ret; - - ret = -EINVAL; - - /* - * We are interested in text pages only. Our pages of interest - * should be mapped for read and execute only. We desist from - * adding probes in write mapped pages since the breakpoints - * might end up in the file copy. - */ - if (!valid_vma(vma, is_bkpt_insn(&opcode))) - goto put_out; - - mapping = uprobe->inode->i_mapping; - if (mapping != vma->vm_file->f_mapping) - goto put_out; - - addr = vma_address(vma, uprobe->offset); - if (vaddr != (unsigned long)addr) - goto put_out; - - ret = -ENOMEM; - new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr); - if (!new_page) - goto put_out; - - __SetPageUptodate(new_page); - - /* - * lock page will serialize against do_wp_page()'s - * PageAnon() handling - */ - lock_page(old_page); - /* copy the page now that we've got it stable */ - vaddr_old = kmap_atomic(old_page); - vaddr_new = kmap_atomic(new_page); - - memcpy(vaddr_new, vaddr_old, PAGE_SIZE); - - /* poke the new insn in, ASSUMES we don't cross page boundary */ - vaddr &= ~PAGE_MASK; - BUG_ON(vaddr + uprobe_opcode_sz > PAGE_SIZE); - memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz); - - kunmap_atomic(vaddr_new); - kunmap_atomic(vaddr_old); - - ret = anon_vma_prepare(vma); - if (ret) - goto unlock_out; - - lock_page(new_page); - ret = __replace_page(vma, old_page, new_page); - unlock_page(new_page); - -unlock_out: - unlock_page(old_page); - page_cache_release(new_page); - -put_out: - put_page(old_page); - - return ret; -} - -/** - * read_opcode - read the opcode at a given virtual address. - * @mm: the probed process address space. - * @vaddr: the virtual address to read the opcode. - * @opcode: location to store the read opcode. - * - * Called with mm->mmap_sem held (for read and with a reference to - * mm. - * - * For mm @mm, read the opcode at @vaddr and store it in @opcode. - * Return 0 (success) or a negative errno. - */ -static int read_opcode(struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t *opcode) -{ - struct page *page; - void *vaddr_new; - int ret; - - ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &page, NULL); - if (ret <= 0) - return ret; - - lock_page(page); - vaddr_new = kmap_atomic(page); - vaddr &= ~PAGE_MASK; - memcpy(opcode, vaddr_new + vaddr, uprobe_opcode_sz); - kunmap_atomic(vaddr_new); - unlock_page(page); - - put_page(page); - - return 0; -} - -static int is_bkpt_at_addr(struct mm_struct *mm, unsigned long vaddr) -{ - uprobe_opcode_t opcode; - int result; - - result = read_opcode(mm, vaddr, &opcode); - if (result) - return result; - - if (is_bkpt_insn(&opcode)) - return 1; - - return 0; -} - -/** - * set_bkpt - store breakpoint at a given address. - * @mm: the probed process address space. - * @uprobe: the probepoint information. - * @vaddr: the virtual address to insert the opcode. - * - * For mm @mm, store the breakpoint instruction at @vaddr. - * Return 0 (success) or a negative errno. - */ -int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe, unsigned long vaddr) -{ - int result; - - result = is_bkpt_at_addr(mm, vaddr); - if (result == 1) - return -EEXIST; - - if (result) - return result; - - return write_opcode(mm, uprobe, vaddr, UPROBES_BKPT_INSN); -} - -/** - * set_orig_insn - Restore the original instruction. - * @mm: the probed process address space. - * @uprobe: the probepoint information. - * @vaddr: the virtual address to insert the opcode. - * @verify: if true, verify existance of breakpoint instruction. - * - * For mm @mm, restore the original opcode (opcode) at @vaddr. - * Return 0 (success) or a negative errno. - */ -int __weak -set_orig_insn(struct mm_struct *mm, struct uprobe *uprobe, unsigned long vaddr, bool verify) -{ - if (verify) { - int result; - - result = is_bkpt_at_addr(mm, vaddr); - if (!result) - return -EINVAL; - - if (result != 1) - return result; - } - return write_opcode(mm, uprobe, vaddr, *(uprobe_opcode_t *)uprobe->insn); -} - -static int match_uprobe(struct uprobe *l, struct uprobe *r) -{ - if (l->inode < r->inode) - return -1; - - if (l->inode > r->inode) - return 1; - - if (l->offset < r->offset) - return -1; - - if (l->offset > r->offset) - return 1; - - return 0; -} - -static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset) -{ - struct uprobe u = { .inode = inode, .offset = offset }; - struct rb_node *n = uprobes_tree.rb_node; - struct uprobe *uprobe; - int match; - - while (n) { - uprobe = rb_entry(n, struct uprobe, rb_node); - match = match_uprobe(&u, uprobe); - if (!match) { - atomic_inc(&uprobe->ref); - return uprobe; - } - - if (match < 0) - n = n->rb_left; - else - n = n->rb_right; - } - return NULL; -} - -/* - * Find a uprobe corresponding to a given inode:offset - * Acquires uprobes_treelock - */ -static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) -{ - struct uprobe *uprobe; - unsigned long flags; - - spin_lock_irqsave(&uprobes_treelock, flags); - uprobe = __find_uprobe(inode, offset); - spin_unlock_irqrestore(&uprobes_treelock, flags); - - return uprobe; -} - -static struct uprobe *__insert_uprobe(struct uprobe *uprobe) -{ - struct rb_node **p = &uprobes_tree.rb_node; - struct rb_node *parent = NULL; - struct uprobe *u; - int match; - - while (*p) { - parent = *p; - u = rb_entry(parent, struct uprobe, rb_node); - match = match_uprobe(uprobe, u); - if (!match) { - atomic_inc(&u->ref); - return u; - } - - if (match < 0) - p = &parent->rb_left; - else - p = &parent->rb_right; - - } - - u = NULL; - rb_link_node(&uprobe->rb_node, parent, p); - rb_insert_color(&uprobe->rb_node, &uprobes_tree); - /* get access + creation ref */ - atomic_set(&uprobe->ref, 2); - - return u; -} - -/* - * Acquire uprobes_treelock. - * Matching uprobe already exists in rbtree; - * increment (access refcount) and return the matching uprobe. - * - * No matching uprobe; insert the uprobe in rb_tree; - * get a double refcount (access + creation) and return NULL. - */ -static struct uprobe *insert_uprobe(struct uprobe *uprobe) -{ - unsigned long flags; - struct uprobe *u; - - spin_lock_irqsave(&uprobes_treelock, flags); - u = __insert_uprobe(uprobe); - spin_unlock_irqrestore(&uprobes_treelock, flags); - - return u; -} - -static void put_uprobe(struct uprobe *uprobe) -{ - if (atomic_dec_and_test(&uprobe->ref)) - kfree(uprobe); -} - -static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset) -{ - struct uprobe *uprobe, *cur_uprobe; - - uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); - if (!uprobe) - return NULL; - - uprobe->inode = igrab(inode); - uprobe->offset = offset; - init_rwsem(&uprobe->consumer_rwsem); - INIT_LIST_HEAD(&uprobe->pending_list); - - /* add to uprobes_tree, sorted on inode:offset */ - cur_uprobe = insert_uprobe(uprobe); - - /* a uprobe exists for this inode:offset combination */ - if (cur_uprobe) { - kfree(uprobe); - uprobe = cur_uprobe; - iput(inode); - } else { - atomic_inc(&uprobe_events); - } - - return uprobe; -} - -/* Returns the previous consumer */ -static struct uprobe_consumer * -consumer_add(struct uprobe *uprobe, struct uprobe_consumer *consumer) -{ - down_write(&uprobe->consumer_rwsem); - consumer->next = uprobe->consumers; - uprobe->consumers = consumer; - up_write(&uprobe->consumer_rwsem); - - return consumer->next; -} - -/* - * For uprobe @uprobe, delete the consumer @consumer. - * Return true if the @consumer is deleted successfully - * or return false. - */ -static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *consumer) -{ - struct uprobe_consumer **con; - bool ret = false; - - down_write(&uprobe->consumer_rwsem); - for (con = &uprobe->consumers; *con; con = &(*con)->next) { - if (*con == consumer) { - *con = consumer->next; - ret = true; - break; - } - } - up_write(&uprobe->consumer_rwsem); - - return ret; -} - -static int __copy_insn(struct address_space *mapping, - struct vm_area_struct *vma, char *insn, - unsigned long nbytes, unsigned long offset) -{ - struct file *filp = vma->vm_file; - struct page *page; - void *vaddr; - unsigned long off1; - unsigned long idx; - - if (!filp) - return -EINVAL; - - idx = (unsigned long)(offset >> PAGE_CACHE_SHIFT); - off1 = offset &= ~PAGE_MASK; - - /* - * Ensure that the page that has the original instruction is - * populated and in page-cache. - */ - page = read_mapping_page(mapping, idx, filp); - if (IS_ERR(page)) - return PTR_ERR(page); - - vaddr = kmap_atomic(page); - memcpy(insn, vaddr + off1, nbytes); - kunmap_atomic(vaddr); - page_cache_release(page); - - return 0; -} - -static int copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma, unsigned long addr) -{ - struct address_space *mapping; - unsigned long nbytes; - int bytes; - - addr &= ~PAGE_MASK; - nbytes = PAGE_SIZE - addr; - mapping = uprobe->inode->i_mapping; - - /* Instruction at end of binary; copy only available bytes */ - if (uprobe->offset + MAX_UINSN_BYTES > uprobe->inode->i_size) - bytes = uprobe->inode->i_size - uprobe->offset; - else - bytes = MAX_UINSN_BYTES; - - /* Instruction at the page-boundary; copy bytes in second page */ - if (nbytes < bytes) { - if (__copy_insn(mapping, vma, uprobe->insn + nbytes, - bytes - nbytes, uprobe->offset + nbytes)) - return -ENOMEM; - - bytes = nbytes; - } - return __copy_insn(mapping, vma, uprobe->insn, bytes, uprobe->offset); -} - -static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, - struct vm_area_struct *vma, loff_t vaddr) -{ - unsigned long addr; - int ret; - - /* - * If probe is being deleted, unregister thread could be done with - * the vma-rmap-walk through. Adding a probe now can be fatal since - * nobody will be able to cleanup. Also we could be from fork or - * mremap path, where the probe might have already been inserted. - * Hence behave as if probe already existed. - */ - if (!uprobe->consumers) - return -EEXIST; - - addr = (unsigned long)vaddr; - - if (!(uprobe->flags & UPROBES_COPY_INSN)) { - ret = copy_insn(uprobe, vma, addr); - if (ret) - return ret; - - if (is_bkpt_insn((uprobe_opcode_t *)uprobe->insn)) - return -EEXIST; - - ret = arch_uprobes_analyze_insn(mm, uprobe); - if (ret) - return ret; - - uprobe->flags |= UPROBES_COPY_INSN; - } - ret = set_bkpt(mm, uprobe, addr); - - return ret; -} - -static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, loff_t vaddr) -{ - set_orig_insn(mm, uprobe, (unsigned long)vaddr, true); -} - -static void delete_uprobe(struct uprobe *uprobe) -{ - unsigned long flags; - - spin_lock_irqsave(&uprobes_treelock, flags); - rb_erase(&uprobe->rb_node, &uprobes_tree); - spin_unlock_irqrestore(&uprobes_treelock, flags); - iput(uprobe->inode); - put_uprobe(uprobe); - atomic_dec(&uprobe_events); -} - -static struct vma_info *__find_next_vma_info(struct list_head *head, - loff_t offset, struct address_space *mapping, - struct vma_info *vi, bool is_register) -{ - struct prio_tree_iter iter; - struct vm_area_struct *vma; - struct vma_info *tmpvi; - unsigned long pgoff; - int existing_vma; - loff_t vaddr; - - pgoff = offset >> PAGE_SHIFT; - - vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { - if (!valid_vma(vma, is_register)) - continue; - - existing_vma = 0; - vaddr = vma_address(vma, offset); - - list_for_each_entry(tmpvi, head, probe_list) { - if (tmpvi->mm == vma->vm_mm && tmpvi->vaddr == vaddr) { - existing_vma = 1; - break; - } - } - - /* - * Another vma needs a probe to be installed. However skip - * installing the probe if the vma is about to be unlinked. - */ - if (!existing_vma && atomic_inc_not_zero(&vma->vm_mm->mm_users)) { - vi->mm = vma->vm_mm; - vi->vaddr = vaddr; - list_add(&vi->probe_list, head); - - return vi; - } - } - - return NULL; -} - -/* - * Iterate in the rmap prio tree and find a vma where a probe has not - * yet been inserted. - */ -static struct vma_info * -find_next_vma_info(struct list_head *head, loff_t offset, struct address_space *mapping, - bool is_register) -{ - struct vma_info *vi, *retvi; - - vi = kzalloc(sizeof(struct vma_info), GFP_KERNEL); - if (!vi) - return ERR_PTR(-ENOMEM); - - mutex_lock(&mapping->i_mmap_mutex); - retvi = __find_next_vma_info(head, offset, mapping, vi, is_register); - mutex_unlock(&mapping->i_mmap_mutex); - - if (!retvi) - kfree(vi); - - return retvi; -} - -static int register_for_each_vma(struct uprobe *uprobe, bool is_register) -{ - struct list_head try_list; - struct vm_area_struct *vma; - struct address_space *mapping; - struct vma_info *vi, *tmpvi; - struct mm_struct *mm; - loff_t vaddr; - int ret; - - mapping = uprobe->inode->i_mapping; - INIT_LIST_HEAD(&try_list); - - ret = 0; - - for (;;) { - vi = find_next_vma_info(&try_list, uprobe->offset, mapping, is_register); - if (!vi) - break; - - if (IS_ERR(vi)) { - ret = PTR_ERR(vi); - break; - } - - mm = vi->mm; - down_read(&mm->mmap_sem); - vma = find_vma(mm, (unsigned long)vi->vaddr); - if (!vma || !valid_vma(vma, is_register)) { - list_del(&vi->probe_list); - kfree(vi); - up_read(&mm->mmap_sem); - mmput(mm); - continue; - } - vaddr = vma_address(vma, uprobe->offset); - if (vma->vm_file->f_mapping->host != uprobe->inode || - vaddr != vi->vaddr) { - list_del(&vi->probe_list); - kfree(vi); - up_read(&mm->mmap_sem); - mmput(mm); - continue; - } - - if (is_register) - ret = install_breakpoint(mm, uprobe, vma, vi->vaddr); - else - remove_breakpoint(mm, uprobe, vi->vaddr); - - up_read(&mm->mmap_sem); - mmput(mm); - if (is_register) { - if (ret && ret == -EEXIST) - ret = 0; - if (ret) - break; - } - } - - list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) { - list_del(&vi->probe_list); - kfree(vi); - } - - return ret; -} - -static int __uprobe_register(struct uprobe *uprobe) -{ - return register_for_each_vma(uprobe, true); -} - -static void __uprobe_unregister(struct uprobe *uprobe) -{ - if (!register_for_each_vma(uprobe, false)) - delete_uprobe(uprobe); - - /* TODO : cant unregister? schedule a worker thread */ -} - -/* - * uprobe_register - register a probe - * @inode: the file in which the probe has to be placed. - * @offset: offset from the start of the file. - * @consumer: information on howto handle the probe.. - * - * Apart from the access refcount, uprobe_register() takes a creation - * refcount (thro alloc_uprobe) if and only if this @uprobe is getting - * inserted into the rbtree (i.e first consumer for a @inode:@offset - * tuple). Creation refcount stops uprobe_unregister from freeing the - * @uprobe even before the register operation is complete. Creation - * refcount is released when the last @consumer for the @uprobe - * unregisters. - * - * Return errno if it cannot successully install probes - * else return 0 (success) - */ -int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer) -{ - struct uprobe *uprobe; - int ret; - - if (!inode || !consumer || consumer->next) - return -EINVAL; - - if (offset > i_size_read(inode)) - return -EINVAL; - - ret = 0; - mutex_lock(uprobes_hash(inode)); - uprobe = alloc_uprobe(inode, offset); - - if (uprobe && !consumer_add(uprobe, consumer)) { - ret = __uprobe_register(uprobe); - if (ret) { - uprobe->consumers = NULL; - __uprobe_unregister(uprobe); - } else { - uprobe->flags |= UPROBES_RUN_HANDLER; - } - } - - mutex_unlock(uprobes_hash(inode)); - put_uprobe(uprobe); - - return ret; -} - -/* - * uprobe_unregister - unregister a already registered probe. - * @inode: the file in which the probe has to be removed. - * @offset: offset from the start of the file. - * @consumer: identify which probe if multiple probes are colocated. - */ -void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer) -{ - struct uprobe *uprobe; - - if (!inode || !consumer) - return; - - uprobe = find_uprobe(inode, offset); - if (!uprobe) - return; - - mutex_lock(uprobes_hash(inode)); - - if (consumer_del(uprobe, consumer)) { - if (!uprobe->consumers) { - __uprobe_unregister(uprobe); - uprobe->flags &= ~UPROBES_RUN_HANDLER; - } - } - - mutex_unlock(uprobes_hash(inode)); - if (uprobe) - put_uprobe(uprobe); -} - -/* - * Of all the nodes that correspond to the given inode, return the node - * with the least offset. - */ -static struct rb_node *find_least_offset_node(struct inode *inode) -{ - struct uprobe u = { .inode = inode, .offset = 0}; - struct rb_node *n = uprobes_tree.rb_node; - struct rb_node *close_node = NULL; - struct uprobe *uprobe; - int match; - - while (n) { - uprobe = rb_entry(n, struct uprobe, rb_node); - match = match_uprobe(&u, uprobe); - - if (uprobe->inode == inode) - close_node = n; - - if (!match) - return close_node; - - if (match < 0) - n = n->rb_left; - else - n = n->rb_right; - } - - return close_node; -} - -/* - * For a given inode, build a list of probes that need to be inserted. - */ -static void build_probe_list(struct inode *inode, struct list_head *head) -{ - struct uprobe *uprobe; - unsigned long flags; - struct rb_node *n; - - spin_lock_irqsave(&uprobes_treelock, flags); - - n = find_least_offset_node(inode); - - for (; n; n = rb_next(n)) { - uprobe = rb_entry(n, struct uprobe, rb_node); - if (uprobe->inode != inode) - break; - - list_add(&uprobe->pending_list, head); - atomic_inc(&uprobe->ref); - } - - spin_unlock_irqrestore(&uprobes_treelock, flags); -} - -/* - * Called from mmap_region. - * called with mm->mmap_sem acquired. - * - * Return -ve no if we fail to insert probes and we cannot - * bail-out. - * Return 0 otherwise. i.e: - * - * - successful insertion of probes - * - (or) no possible probes to be inserted. - * - (or) insertion of probes failed but we can bail-out. - */ -int uprobe_mmap(struct vm_area_struct *vma) -{ - struct list_head tmp_list; - struct uprobe *uprobe, *u; - struct inode *inode; - int ret; - - if (!atomic_read(&uprobe_events) || !valid_vma(vma, true)) - return 0; - - inode = vma->vm_file->f_mapping->host; - if (!inode) - return 0; - - INIT_LIST_HEAD(&tmp_list); - mutex_lock(uprobes_mmap_hash(inode)); - build_probe_list(inode, &tmp_list); - - ret = 0; - - list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) { - loff_t vaddr; - - list_del(&uprobe->pending_list); - if (!ret) { - vaddr = vma_address(vma, uprobe->offset); - if (vaddr >= vma->vm_start && vaddr < vma->vm_end) { - ret = install_breakpoint(vma->vm_mm, uprobe, vma, vaddr); - /* Ignore double add: */ - if (ret == -EEXIST) - ret = 0; - } - } - put_uprobe(uprobe); - } - - mutex_unlock(uprobes_mmap_hash(inode)); - - return ret; -} - -static int __init init_uprobes(void) -{ - int i; - - for (i = 0; i < UPROBES_HASH_SZ; i++) { - mutex_init(&uprobes_mutex[i]); - mutex_init(&uprobes_mmap_mutex[i]); - } - return 0; -} - -static void __exit exit_uprobes(void) -{ -} - -module_init(init_uprobes); -module_exit(exit_uprobes); -- cgit v1.2.3 From 96379f60075c75b261328aa7830ef8aa158247ac Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Wed, 22 Feb 2012 14:45:49 +0530 Subject: uprobes/core: Remove uprobe_opcode_sz uprobe_opcode_sz refers to the smallest instruction size for that architecture. UPROBES_BKPT_INSN_SIZE refers to the size of the breakpoint instruction for that architecture. For now we are assuming that both uprobe_opcode_sz and UPROBES_BKPT_INSN_SIZE are the same for all archs and hence removing uprobe_opcode_sz in favour of UPROBES_BKPT_INSN_SIZE. However if we have to support architectures where the smallest instruction size is different from the size of breakpoint instruction, we may have to re-introduce uprobe_opcode_sz. Signed-off-by: Srikar Dronamraju Cc: Peter Zijlstra Cc: Linus Torvalds Cc: Oleg Nesterov Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Anton Arapov Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Jiri Olsa Cc: Josh Stone Link: http://lkml.kernel.org/r/20120222091549.15880.67020.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar --- kernel/events/uprobes.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 884817f1b0d3..ee496ad95db3 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -244,8 +244,8 @@ static int write_opcode(struct mm_struct *mm, struct uprobe *uprobe, /* poke the new insn in, ASSUMES we don't cross page boundary */ vaddr &= ~PAGE_MASK; - BUG_ON(vaddr + uprobe_opcode_sz > PAGE_SIZE); - memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz); + BUG_ON(vaddr + UPROBES_BKPT_INSN_SIZE > PAGE_SIZE); + memcpy(vaddr_new + vaddr, &opcode, UPROBES_BKPT_INSN_SIZE); kunmap_atomic(vaddr_new); kunmap_atomic(vaddr_old); @@ -293,7 +293,7 @@ static int read_opcode(struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_ lock_page(page); vaddr_new = kmap_atomic(page); vaddr &= ~PAGE_MASK; - memcpy(opcode, vaddr_new + vaddr, uprobe_opcode_sz); + memcpy(opcode, vaddr_new + vaddr, UPROBES_BKPT_INSN_SIZE); kunmap_atomic(vaddr_new); unlock_page(page); -- cgit v1.2.3 From 3ff54efdfaace9e9b2b7c1959a865be6b91de96c Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Wed, 22 Feb 2012 14:46:02 +0530 Subject: uprobes/core: Move insn to arch specific structure Few cleanups suggested by Ingo Molnar. - Rename struct uprobe_arch_info to struct arch_uprobe. - Move insn from struct uprobe to struct arch_uprobe. - Make arch specific uprobe functions to accept struct arch_uprobe instead of struct uprobe. - Move struct uprobe to kernel/uprobes.c from include/linux/uprobes.h Signed-off-by: Srikar Dronamraju Cc: Peter Zijlstra Cc: Linus Torvalds Cc: Oleg Nesterov Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Anton Arapov Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Jiri Olsa Cc: Josh Stone Link: http://lkml.kernel.org/r/20120222091602.15880.40249.sendpatchset@srdronam.in.ibm.com [ Made various small improvements ] Signed-off-by: Ingo Molnar --- kernel/events/uprobes.c | 38 ++++++++++++++++++++++++++------------ 1 file changed, 26 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index ee496ad95db3..13f1b5909af4 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -65,6 +65,18 @@ struct vma_info { loff_t vaddr; }; +struct uprobe { + struct rb_node rb_node; /* node in the rb tree */ + atomic_t ref; + struct rw_semaphore consumer_rwsem; + struct list_head pending_list; + struct uprobe_consumer *consumers; + struct inode *inode; /* Also hold a ref to inode */ + loff_t offset; + int flags; + struct arch_uprobe arch; +}; + /* * valid_vma: Verify if the specified vma is an executable vma * Relax restrictions while unregistering: vm_flags might have @@ -180,7 +192,7 @@ bool __weak is_bkpt_insn(uprobe_opcode_t *insn) /* * write_opcode - write the opcode at a given virtual address. * @mm: the probed process address space. - * @uprobe: the breakpointing information. + * @arch_uprobe: the breakpointing information. * @vaddr: the virtual address to store the opcode. * @opcode: opcode to be written at @vaddr. * @@ -190,13 +202,14 @@ bool __weak is_bkpt_insn(uprobe_opcode_t *insn) * For mm @mm, write the opcode at @vaddr. * Return 0 (success) or a negative errno. */ -static int write_opcode(struct mm_struct *mm, struct uprobe *uprobe, +static int write_opcode(struct mm_struct *mm, struct arch_uprobe *auprobe, unsigned long vaddr, uprobe_opcode_t opcode) { struct page *old_page, *new_page; struct address_space *mapping; void *vaddr_old, *vaddr_new; struct vm_area_struct *vma; + struct uprobe *uprobe; loff_t addr; int ret; @@ -216,6 +229,7 @@ static int write_opcode(struct mm_struct *mm, struct uprobe *uprobe, if (!valid_vma(vma, is_bkpt_insn(&opcode))) goto put_out; + uprobe = container_of(auprobe, struct uprobe, arch); mapping = uprobe->inode->i_mapping; if (mapping != vma->vm_file->f_mapping) goto put_out; @@ -326,7 +340,7 @@ static int is_bkpt_at_addr(struct mm_struct *mm, unsigned long vaddr) * For mm @mm, store the breakpoint instruction at @vaddr. * Return 0 (success) or a negative errno. */ -int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe, unsigned long vaddr) +int __weak set_bkpt(struct mm_struct *mm, struct arch_uprobe *auprobe, unsigned long vaddr) { int result; @@ -337,7 +351,7 @@ int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe, unsigned long v if (result) return result; - return write_opcode(mm, uprobe, vaddr, UPROBES_BKPT_INSN); + return write_opcode(mm, auprobe, vaddr, UPROBES_BKPT_INSN); } /** @@ -351,7 +365,7 @@ int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe, unsigned long v * Return 0 (success) or a negative errno. */ int __weak -set_orig_insn(struct mm_struct *mm, struct uprobe *uprobe, unsigned long vaddr, bool verify) +set_orig_insn(struct mm_struct *mm, struct arch_uprobe *auprobe, unsigned long vaddr, bool verify) { if (verify) { int result; @@ -363,7 +377,7 @@ set_orig_insn(struct mm_struct *mm, struct uprobe *uprobe, unsigned long vaddr, if (result != 1) return result; } - return write_opcode(mm, uprobe, vaddr, *(uprobe_opcode_t *)uprobe->insn); + return write_opcode(mm, auprobe, vaddr, *(uprobe_opcode_t *)auprobe->insn); } static int match_uprobe(struct uprobe *l, struct uprobe *r) @@ -593,13 +607,13 @@ static int copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma, unsigned /* Instruction at the page-boundary; copy bytes in second page */ if (nbytes < bytes) { - if (__copy_insn(mapping, vma, uprobe->insn + nbytes, + if (__copy_insn(mapping, vma, uprobe->arch.insn + nbytes, bytes - nbytes, uprobe->offset + nbytes)) return -ENOMEM; bytes = nbytes; } - return __copy_insn(mapping, vma, uprobe->insn, bytes, uprobe->offset); + return __copy_insn(mapping, vma, uprobe->arch.insn, bytes, uprobe->offset); } static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, @@ -625,23 +639,23 @@ static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, if (ret) return ret; - if (is_bkpt_insn((uprobe_opcode_t *)uprobe->insn)) + if (is_bkpt_insn((uprobe_opcode_t *)uprobe->arch.insn)) return -EEXIST; - ret = arch_uprobes_analyze_insn(mm, uprobe); + ret = arch_uprobes_analyze_insn(mm, &uprobe->arch); if (ret) return ret; uprobe->flags |= UPROBES_COPY_INSN; } - ret = set_bkpt(mm, uprobe, addr); + ret = set_bkpt(mm, &uprobe->arch, addr); return ret; } static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, loff_t vaddr) { - set_orig_insn(mm, uprobe, (unsigned long)vaddr, true); + set_orig_insn(mm, &uprobe->arch, (unsigned long)vaddr, true); } static void delete_uprobe(struct uprobe *uprobe) -- cgit v1.2.3 From 35aa621b5ab9d08767f7bc8d209b696df281d715 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Wed, 22 Feb 2012 11:37:29 +0100 Subject: uprobes: Update copyright notices Add Peter Zijlstra's copyright to the uprobes code, whose contributions to the uprobes code are not visible in the Git history, because they were backmerged. Also update existing copyright notices to the year 2012. Acked-by: Srikar Dronamraju Cc: Peter Zijlstra Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Link: http://lkml.kernel.org/n/tip-vjqxst502pc1efz7ah8cyht4@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/events/uprobes.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 13f1b5909af4..5ce32e3ae9e9 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -15,10 +15,11 @@ * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. * - * Copyright (C) IBM Corporation, 2008-2011 + * Copyright (C) IBM Corporation, 2008-2012 * Authors: * Srikar Dronamraju * Jim Keniston + * Copyright (C) 2011-2012 Red Hat, Inc., Peter Zijlstra */ #include -- cgit v1.2.3 From 3d48749d93a3dce732dd30a14002ab90ec4355f3 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Mon, 5 Mar 2012 13:15:25 -0800 Subject: block: ioc_task_link() can't fail ioc_task_link() is used to share %current's ioc on clone. If %current->io_context is set, %current is guaranteed to have refcount on the ioc and, thus, ioc_task_link() can't fail. Replace error checking in ioc_task_link() with WARN_ON_ONCE() and make it just increment refcount and nr_tasks. -v2: Description typo fix (Vivek). Signed-off-by: Tejun Heo Cc: Vivek Goyal Signed-off-by: Jens Axboe --- kernel/fork.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index b77fd559c78e..a1b632713e43 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -901,9 +901,8 @@ static int copy_io(unsigned long clone_flags, struct task_struct *tsk) * Share io context with parent, if CLONE_IO is set */ if (clone_flags & CLONE_IO) { - tsk->io_context = ioc_task_link(ioc); - if (unlikely(!tsk->io_context)) - return -ENOMEM; + ioc_task_link(ioc); + tsk->io_context = ioc; } else if (ioprio_valid(ioc->ioprio)) { new_ioc = get_task_io_context(tsk, GFP_KERNEL, NUMA_NO_NODE); if (unlikely(!new_ioc)) -- cgit v1.2.3 From 900771a483ef28915a48066d7895d8252315607a Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Mon, 12 Mar 2012 14:55:14 +0530 Subject: uprobes/core: Make macro names consistent Rename macros that refer to individual uprobe to start with UPROBE_ instead of UPROBES_. This is pure cleanup, no functional change intended. Signed-off-by: Srikar Dronamraju Cc: Linus Torvalds Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Linux-mm Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20120312092514.5379.36595.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar --- kernel/events/uprobes.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 5ce32e3ae9e9..0d36bf3920ba 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -177,7 +177,7 @@ out: */ bool __weak is_bkpt_insn(uprobe_opcode_t *insn) { - return *insn == UPROBES_BKPT_INSN; + return *insn == UPROBE_BKPT_INSN; } /* @@ -259,8 +259,8 @@ static int write_opcode(struct mm_struct *mm, struct arch_uprobe *auprobe, /* poke the new insn in, ASSUMES we don't cross page boundary */ vaddr &= ~PAGE_MASK; - BUG_ON(vaddr + UPROBES_BKPT_INSN_SIZE > PAGE_SIZE); - memcpy(vaddr_new + vaddr, &opcode, UPROBES_BKPT_INSN_SIZE); + BUG_ON(vaddr + UPROBE_BKPT_INSN_SIZE > PAGE_SIZE); + memcpy(vaddr_new + vaddr, &opcode, UPROBE_BKPT_INSN_SIZE); kunmap_atomic(vaddr_new); kunmap_atomic(vaddr_old); @@ -308,7 +308,7 @@ static int read_opcode(struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_ lock_page(page); vaddr_new = kmap_atomic(page); vaddr &= ~PAGE_MASK; - memcpy(opcode, vaddr_new + vaddr, UPROBES_BKPT_INSN_SIZE); + memcpy(opcode, vaddr_new + vaddr, UPROBE_BKPT_INSN_SIZE); kunmap_atomic(vaddr_new); unlock_page(page); @@ -352,7 +352,7 @@ int __weak set_bkpt(struct mm_struct *mm, struct arch_uprobe *auprobe, unsigned if (result) return result; - return write_opcode(mm, auprobe, vaddr, UPROBES_BKPT_INSN); + return write_opcode(mm, auprobe, vaddr, UPROBE_BKPT_INSN); } /** @@ -635,7 +635,7 @@ static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, addr = (unsigned long)vaddr; - if (!(uprobe->flags & UPROBES_COPY_INSN)) { + if (!(uprobe->flags & UPROBE_COPY_INSN)) { ret = copy_insn(uprobe, vma, addr); if (ret) return ret; @@ -647,7 +647,7 @@ static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, if (ret) return ret; - uprobe->flags |= UPROBES_COPY_INSN; + uprobe->flags |= UPROBE_COPY_INSN; } ret = set_bkpt(mm, &uprobe->arch, addr); @@ -857,7 +857,7 @@ int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer * uprobe->consumers = NULL; __uprobe_unregister(uprobe); } else { - uprobe->flags |= UPROBES_RUN_HANDLER; + uprobe->flags |= UPROBE_RUN_HANDLER; } } @@ -889,7 +889,7 @@ void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consume if (consumer_del(uprobe, consumer)) { if (!uprobe->consumers) { __uprobe_unregister(uprobe); - uprobe->flags &= ~UPROBES_RUN_HANDLER; + uprobe->flags &= ~UPROBE_RUN_HANDLER; } } -- cgit v1.2.3 From e3343e6a2819ff5d0dfc4bb5c9fb7f9a4d04da73 Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Mon, 12 Mar 2012 14:55:30 +0530 Subject: uprobes/core: Make order of function parameters consistent across functions If a function takes struct uprobe or struct arch_uprobe, then it is passed as the first parameter. This is pure cleanup, no functional change intended. Signed-off-by: Srikar Dronamraju Cc: Linus Torvalds Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Linux-mm Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20120312092530.5379.18394.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar --- kernel/events/uprobes.c | 93 +++++++++++++++++++++++++------------------------ 1 file changed, 48 insertions(+), 45 deletions(-) (limited to 'kernel') diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 0d36bf3920ba..9c5ddff1c8da 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -192,8 +192,8 @@ bool __weak is_bkpt_insn(uprobe_opcode_t *insn) /* * write_opcode - write the opcode at a given virtual address. + * @auprobe: arch breakpointing information. * @mm: the probed process address space. - * @arch_uprobe: the breakpointing information. * @vaddr: the virtual address to store the opcode. * @opcode: opcode to be written at @vaddr. * @@ -203,7 +203,7 @@ bool __weak is_bkpt_insn(uprobe_opcode_t *insn) * For mm @mm, write the opcode at @vaddr. * Return 0 (success) or a negative errno. */ -static int write_opcode(struct mm_struct *mm, struct arch_uprobe *auprobe, +static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t opcode) { struct page *old_page, *new_page; @@ -334,14 +334,14 @@ static int is_bkpt_at_addr(struct mm_struct *mm, unsigned long vaddr) /** * set_bkpt - store breakpoint at a given address. + * @auprobe: arch specific probepoint information. * @mm: the probed process address space. - * @uprobe: the probepoint information. * @vaddr: the virtual address to insert the opcode. * * For mm @mm, store the breakpoint instruction at @vaddr. * Return 0 (success) or a negative errno. */ -int __weak set_bkpt(struct mm_struct *mm, struct arch_uprobe *auprobe, unsigned long vaddr) +int __weak set_bkpt(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { int result; @@ -352,13 +352,13 @@ int __weak set_bkpt(struct mm_struct *mm, struct arch_uprobe *auprobe, unsigned if (result) return result; - return write_opcode(mm, auprobe, vaddr, UPROBE_BKPT_INSN); + return write_opcode(auprobe, mm, vaddr, UPROBE_BKPT_INSN); } /** * set_orig_insn - Restore the original instruction. * @mm: the probed process address space. - * @uprobe: the probepoint information. + * @auprobe: arch specific probepoint information. * @vaddr: the virtual address to insert the opcode. * @verify: if true, verify existance of breakpoint instruction. * @@ -366,7 +366,7 @@ int __weak set_bkpt(struct mm_struct *mm, struct arch_uprobe *auprobe, unsigned * Return 0 (success) or a negative errno. */ int __weak -set_orig_insn(struct mm_struct *mm, struct arch_uprobe *auprobe, unsigned long vaddr, bool verify) +set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, bool verify) { if (verify) { int result; @@ -378,7 +378,7 @@ set_orig_insn(struct mm_struct *mm, struct arch_uprobe *auprobe, unsigned long v if (result != 1) return result; } - return write_opcode(mm, auprobe, vaddr, *(uprobe_opcode_t *)auprobe->insn); + return write_opcode(auprobe, mm, vaddr, *(uprobe_opcode_t *)auprobe->insn); } static int match_uprobe(struct uprobe *l, struct uprobe *r) @@ -525,30 +525,30 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset) /* Returns the previous consumer */ static struct uprobe_consumer * -consumer_add(struct uprobe *uprobe, struct uprobe_consumer *consumer) +consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc) { down_write(&uprobe->consumer_rwsem); - consumer->next = uprobe->consumers; - uprobe->consumers = consumer; + uc->next = uprobe->consumers; + uprobe->consumers = uc; up_write(&uprobe->consumer_rwsem); - return consumer->next; + return uc->next; } /* - * For uprobe @uprobe, delete the consumer @consumer. - * Return true if the @consumer is deleted successfully + * For uprobe @uprobe, delete the consumer @uc. + * Return true if the @uc is deleted successfully * or return false. */ -static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *consumer) +static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc) { struct uprobe_consumer **con; bool ret = false; down_write(&uprobe->consumer_rwsem); for (con = &uprobe->consumers; *con; con = &(*con)->next) { - if (*con == consumer) { - *con = consumer->next; + if (*con == uc) { + *con = uc->next; ret = true; break; } @@ -558,8 +558,8 @@ static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *consumer return ret; } -static int __copy_insn(struct address_space *mapping, - struct vm_area_struct *vma, char *insn, +static int +__copy_insn(struct address_space *mapping, struct vm_area_struct *vma, char *insn, unsigned long nbytes, unsigned long offset) { struct file *filp = vma->vm_file; @@ -590,7 +590,8 @@ static int __copy_insn(struct address_space *mapping, return 0; } -static int copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma, unsigned long addr) +static int +copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma, unsigned long addr) { struct address_space *mapping; unsigned long nbytes; @@ -617,8 +618,9 @@ static int copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma, unsigned return __copy_insn(mapping, vma, uprobe->arch.insn, bytes, uprobe->offset); } -static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, - struct vm_area_struct *vma, loff_t vaddr) +static int +install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, + struct vm_area_struct *vma, loff_t vaddr) { unsigned long addr; int ret; @@ -643,20 +645,21 @@ static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, if (is_bkpt_insn((uprobe_opcode_t *)uprobe->arch.insn)) return -EEXIST; - ret = arch_uprobes_analyze_insn(mm, &uprobe->arch); + ret = arch_uprobes_analyze_insn(&uprobe->arch, mm); if (ret) return ret; uprobe->flags |= UPROBE_COPY_INSN; } - ret = set_bkpt(mm, &uprobe->arch, addr); + ret = set_bkpt(&uprobe->arch, mm, addr); return ret; } -static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe, loff_t vaddr) +static void +remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, loff_t vaddr) { - set_orig_insn(mm, &uprobe->arch, (unsigned long)vaddr, true); + set_orig_insn(&uprobe->arch, mm, (unsigned long)vaddr, true); } static void delete_uprobe(struct uprobe *uprobe) @@ -671,9 +674,9 @@ static void delete_uprobe(struct uprobe *uprobe) atomic_dec(&uprobe_events); } -static struct vma_info *__find_next_vma_info(struct list_head *head, - loff_t offset, struct address_space *mapping, - struct vma_info *vi, bool is_register) +static struct vma_info * +__find_next_vma_info(struct address_space *mapping, struct list_head *head, + struct vma_info *vi, loff_t offset, bool is_register) { struct prio_tree_iter iter; struct vm_area_struct *vma; @@ -719,8 +722,8 @@ static struct vma_info *__find_next_vma_info(struct list_head *head, * yet been inserted. */ static struct vma_info * -find_next_vma_info(struct list_head *head, loff_t offset, struct address_space *mapping, - bool is_register) +find_next_vma_info(struct address_space *mapping, struct list_head *head, + loff_t offset, bool is_register) { struct vma_info *vi, *retvi; @@ -729,7 +732,7 @@ find_next_vma_info(struct list_head *head, loff_t offset, struct address_space * return ERR_PTR(-ENOMEM); mutex_lock(&mapping->i_mmap_mutex); - retvi = __find_next_vma_info(head, offset, mapping, vi, is_register); + retvi = __find_next_vma_info(mapping, head, vi, offset, is_register); mutex_unlock(&mapping->i_mmap_mutex); if (!retvi) @@ -754,7 +757,7 @@ static int register_for_each_vma(struct uprobe *uprobe, bool is_register) ret = 0; for (;;) { - vi = find_next_vma_info(&try_list, uprobe->offset, mapping, is_register); + vi = find_next_vma_info(mapping, &try_list, uprobe->offset, is_register); if (!vi) break; @@ -784,9 +787,9 @@ static int register_for_each_vma(struct uprobe *uprobe, bool is_register) } if (is_register) - ret = install_breakpoint(mm, uprobe, vma, vi->vaddr); + ret = install_breakpoint(uprobe, mm, vma, vi->vaddr); else - remove_breakpoint(mm, uprobe, vi->vaddr); + remove_breakpoint(uprobe, mm, vi->vaddr); up_read(&mm->mmap_sem); mmput(mm); @@ -823,25 +826,25 @@ static void __uprobe_unregister(struct uprobe *uprobe) * uprobe_register - register a probe * @inode: the file in which the probe has to be placed. * @offset: offset from the start of the file. - * @consumer: information on howto handle the probe.. + * @uc: information on howto handle the probe.. * * Apart from the access refcount, uprobe_register() takes a creation * refcount (thro alloc_uprobe) if and only if this @uprobe is getting * inserted into the rbtree (i.e first consumer for a @inode:@offset * tuple). Creation refcount stops uprobe_unregister from freeing the * @uprobe even before the register operation is complete. Creation - * refcount is released when the last @consumer for the @uprobe + * refcount is released when the last @uc for the @uprobe * unregisters. * * Return errno if it cannot successully install probes * else return 0 (success) */ -int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer) +int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) { struct uprobe *uprobe; int ret; - if (!inode || !consumer || consumer->next) + if (!inode || !uc || uc->next) return -EINVAL; if (offset > i_size_read(inode)) @@ -851,7 +854,7 @@ int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer * mutex_lock(uprobes_hash(inode)); uprobe = alloc_uprobe(inode, offset); - if (uprobe && !consumer_add(uprobe, consumer)) { + if (uprobe && !consumer_add(uprobe, uc)) { ret = __uprobe_register(uprobe); if (ret) { uprobe->consumers = NULL; @@ -871,13 +874,13 @@ int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer * * uprobe_unregister - unregister a already registered probe. * @inode: the file in which the probe has to be removed. * @offset: offset from the start of the file. - * @consumer: identify which probe if multiple probes are colocated. + * @uc: identify which probe if multiple probes are colocated. */ -void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *consumer) +void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) { struct uprobe *uprobe; - if (!inode || !consumer) + if (!inode || !uc) return; uprobe = find_uprobe(inode, offset); @@ -886,7 +889,7 @@ void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consume mutex_lock(uprobes_hash(inode)); - if (consumer_del(uprobe, consumer)) { + if (consumer_del(uprobe, uc)) { if (!uprobe->consumers) { __uprobe_unregister(uprobe); uprobe->flags &= ~UPROBE_RUN_HANDLER; @@ -993,7 +996,7 @@ int uprobe_mmap(struct vm_area_struct *vma) if (!ret) { vaddr = vma_address(vma, uprobe->offset); if (vaddr >= vma->vm_start && vaddr < vma->vm_end) { - ret = install_breakpoint(vma->vm_mm, uprobe, vma, vaddr); + ret = install_breakpoint(uprobe, vma->vm_mm, vma, vaddr); /* Ignore double add: */ if (ret == -EEXIST) ret = 0; -- cgit v1.2.3 From 5cb4ac3a583d4ee18c8682ab857e093c4a0d0895 Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Mon, 12 Mar 2012 14:55:45 +0530 Subject: uprobes/core: Rename bkpt to swbp bkpt doesnt seem to be a correct abbrevation for breakpoint. Choice was between bp and breakpoint. Since bp can refer to things other than breakpoint, use swbp to refer to breakpoints. This is pure cleanup, no functional change intended. Signed-off-by: Srikar Dronamraju Cc: Linus Torvalds Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Linux-mm Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20120312092545.5379.91251.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar --- kernel/events/uprobes.c | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 9c5ddff1c8da..e56e56aa7535 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -170,14 +170,14 @@ out: } /** - * is_bkpt_insn - check if instruction is breakpoint instruction. + * is_swbp_insn - check if instruction is breakpoint instruction. * @insn: instruction to be checked. - * Default implementation of is_bkpt_insn + * Default implementation of is_swbp_insn * Returns true if @insn is a breakpoint instruction. */ -bool __weak is_bkpt_insn(uprobe_opcode_t *insn) +bool __weak is_swbp_insn(uprobe_opcode_t *insn) { - return *insn == UPROBE_BKPT_INSN; + return *insn == UPROBE_SWBP_INSN; } /* @@ -227,7 +227,7 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, * adding probes in write mapped pages since the breakpoints * might end up in the file copy. */ - if (!valid_vma(vma, is_bkpt_insn(&opcode))) + if (!valid_vma(vma, is_swbp_insn(&opcode))) goto put_out; uprobe = container_of(auprobe, struct uprobe, arch); @@ -259,8 +259,8 @@ static int write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, /* poke the new insn in, ASSUMES we don't cross page boundary */ vaddr &= ~PAGE_MASK; - BUG_ON(vaddr + UPROBE_BKPT_INSN_SIZE > PAGE_SIZE); - memcpy(vaddr_new + vaddr, &opcode, UPROBE_BKPT_INSN_SIZE); + BUG_ON(vaddr + UPROBE_SWBP_INSN_SIZE > PAGE_SIZE); + memcpy(vaddr_new + vaddr, &opcode, UPROBE_SWBP_INSN_SIZE); kunmap_atomic(vaddr_new); kunmap_atomic(vaddr_old); @@ -308,7 +308,7 @@ static int read_opcode(struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_ lock_page(page); vaddr_new = kmap_atomic(page); vaddr &= ~PAGE_MASK; - memcpy(opcode, vaddr_new + vaddr, UPROBE_BKPT_INSN_SIZE); + memcpy(opcode, vaddr_new + vaddr, UPROBE_SWBP_INSN_SIZE); kunmap_atomic(vaddr_new); unlock_page(page); @@ -317,7 +317,7 @@ static int read_opcode(struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_ return 0; } -static int is_bkpt_at_addr(struct mm_struct *mm, unsigned long vaddr) +static int is_swbp_at_addr(struct mm_struct *mm, unsigned long vaddr) { uprobe_opcode_t opcode; int result; @@ -326,14 +326,14 @@ static int is_bkpt_at_addr(struct mm_struct *mm, unsigned long vaddr) if (result) return result; - if (is_bkpt_insn(&opcode)) + if (is_swbp_insn(&opcode)) return 1; return 0; } /** - * set_bkpt - store breakpoint at a given address. + * set_swbp - store breakpoint at a given address. * @auprobe: arch specific probepoint information. * @mm: the probed process address space. * @vaddr: the virtual address to insert the opcode. @@ -341,18 +341,18 @@ static int is_bkpt_at_addr(struct mm_struct *mm, unsigned long vaddr) * For mm @mm, store the breakpoint instruction at @vaddr. * Return 0 (success) or a negative errno. */ -int __weak set_bkpt(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) +int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { int result; - result = is_bkpt_at_addr(mm, vaddr); + result = is_swbp_at_addr(mm, vaddr); if (result == 1) return -EEXIST; if (result) return result; - return write_opcode(auprobe, mm, vaddr, UPROBE_BKPT_INSN); + return write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN); } /** @@ -371,7 +371,7 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v if (verify) { int result; - result = is_bkpt_at_addr(mm, vaddr); + result = is_swbp_at_addr(mm, vaddr); if (!result) return -EINVAL; @@ -642,7 +642,7 @@ install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, if (ret) return ret; - if (is_bkpt_insn((uprobe_opcode_t *)uprobe->arch.insn)) + if (is_swbp_insn((uprobe_opcode_t *)uprobe->arch.insn)) return -EEXIST; ret = arch_uprobes_analyze_insn(&uprobe->arch, mm); @@ -651,7 +651,7 @@ install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, uprobe->flags |= UPROBE_COPY_INSN; } - ret = set_bkpt(&uprobe->arch, mm, addr); + ret = set_swbp(&uprobe->arch, mm, addr); return ret; } -- cgit v1.2.3 From 0326f5a94ddea33fa331b2519f4172f4fb387baa Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Tue, 13 Mar 2012 23:30:11 +0530 Subject: uprobes/core: Handle breakpoint and singlestep exceptions Uprobes uses exception notifiers to get to know if a thread hit a breakpoint or a singlestep exception. When a thread hits a uprobe or is singlestepping post a uprobe hit, the uprobe exception notifier sets its TIF_UPROBE bit, which will then be checked on its return to userspace path (do_notify_resume() ->uprobe_notify_resume()), where the consumers handlers are run (in task context) based on the defined filters. Uprobe hits are thread specific and hence we need to maintain information about if a task hit a uprobe, what uprobe was hit, the slot where the original instruction was copied for xol so that it can be singlestepped with appropriate fixups. In some cases, special care is needed for instructions that are executed out of line (xol). These are architecture specific artefacts, such as handling RIP relative instructions on x86_64. Since the instruction at which the uprobe was inserted is executed out of line, architecture specific fixups are added so that the thread continues normal execution in the presence of a uprobe. Postpone the signals until we execute the probed insn. post_xol() path does a recalc_sigpending() before return to user-mode, this ensures the signal can't be lost. Uprobes relies on DIE_DEBUG notification to notify if a singlestep is complete. Adds x86 specific uprobe exception notifiers and appropriate hooks needed to determine a uprobe hit and subsequent post processing. Add requisite x86 fixups for xol for uprobes. Specific cases needing fixups include relative jumps (x86_64), calls, etc. Where possible, we check and skip singlestepping the breakpointed instructions. For now we skip single byte as well as few multibyte nop instructions. However this can be extended to other instructions too. Credits to Oleg Nesterov for suggestions/patches related to signal, breakpoint, singlestep handling code. Signed-off-by: Srikar Dronamraju Cc: Linus Torvalds Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Linux-mm Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20120313180011.29771.89027.sendpatchset@srdronam.in.ibm.com [ Performed various cleanliness edits ] Signed-off-by: Ingo Molnar --- kernel/events/uprobes.c | 323 +++++++++++++++++++++++++++++++++++++++++++++++- kernel/fork.c | 4 + kernel/signal.c | 4 + 3 files changed, 327 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index e56e56aa7535..b807d1566b64 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -30,9 +30,12 @@ #include /* anon_vma_prepare */ #include /* set_pte_at_notify */ #include /* try_to_free_swap */ +#include /* user_enable_single_step */ +#include /* notifier mechanism */ #include +static struct srcu_struct uprobes_srcu; static struct rb_root uprobes_tree = RB_ROOT; static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ @@ -486,6 +489,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) u = __insert_uprobe(uprobe); spin_unlock_irqrestore(&uprobes_treelock, flags); + /* For now assume that the instruction need not be single-stepped */ + uprobe->flags |= UPROBE_SKIP_SSTEP; + return u; } @@ -523,6 +529,21 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset) return uprobe; } +static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs) +{ + struct uprobe_consumer *uc; + + if (!(uprobe->flags & UPROBE_RUN_HANDLER)) + return; + + down_read(&uprobe->consumer_rwsem); + for (uc = uprobe->consumers; uc; uc = uc->next) { + if (!uc->filter || uc->filter(uc, current)) + uc->handler(uc, regs); + } + up_read(&uprobe->consumer_rwsem); +} + /* Returns the previous consumer */ static struct uprobe_consumer * consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc) @@ -645,7 +666,7 @@ install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, if (is_swbp_insn((uprobe_opcode_t *)uprobe->arch.insn)) return -EEXIST; - ret = arch_uprobes_analyze_insn(&uprobe->arch, mm); + ret = arch_uprobe_analyze_insn(&uprobe->arch, mm); if (ret) return ret; @@ -662,10 +683,21 @@ remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, loff_t vaddr) set_orig_insn(&uprobe->arch, mm, (unsigned long)vaddr, true); } +/* + * There could be threads that have hit the breakpoint and are entering the + * notifier code and trying to acquire the uprobes_treelock. The thread + * calling delete_uprobe() that is removing the uprobe from the rb_tree can + * race with these threads and might acquire the uprobes_treelock compared + * to some of the breakpoint hit threads. In such a case, the breakpoint + * hit threads will not find the uprobe. The current unregistering thread + * waits till all other threads have hit a breakpoint, to acquire the + * uprobes_treelock before the uprobe is removed from the rbtree. + */ static void delete_uprobe(struct uprobe *uprobe) { unsigned long flags; + synchronize_srcu(&uprobes_srcu); spin_lock_irqsave(&uprobes_treelock, flags); rb_erase(&uprobe->rb_node, &uprobes_tree); spin_unlock_irqrestore(&uprobes_treelock, flags); @@ -1010,6 +1042,288 @@ int uprobe_mmap(struct vm_area_struct *vma) return ret; } +/** + * uprobe_get_swbp_addr - compute address of swbp given post-swbp regs + * @regs: Reflects the saved state of the task after it has hit a breakpoint + * instruction. + * Return the address of the breakpoint instruction. + */ +unsigned long __weak uprobe_get_swbp_addr(struct pt_regs *regs) +{ + return instruction_pointer(regs) - UPROBE_SWBP_INSN_SIZE; +} + +/* + * Called with no locks held. + * Called in context of a exiting or a exec-ing thread. + */ +void uprobe_free_utask(struct task_struct *t) +{ + struct uprobe_task *utask = t->utask; + + if (t->uprobe_srcu_id != -1) + srcu_read_unlock_raw(&uprobes_srcu, t->uprobe_srcu_id); + + if (!utask) + return; + + if (utask->active_uprobe) + put_uprobe(utask->active_uprobe); + + kfree(utask); + t->utask = NULL; +} + +/* + * Called in context of a new clone/fork from copy_process. + */ +void uprobe_copy_process(struct task_struct *t) +{ + t->utask = NULL; + t->uprobe_srcu_id = -1; +} + +/* + * Allocate a uprobe_task object for the task. + * Called when the thread hits a breakpoint for the first time. + * + * Returns: + * - pointer to new uprobe_task on success + * - NULL otherwise + */ +static struct uprobe_task *add_utask(void) +{ + struct uprobe_task *utask; + + utask = kzalloc(sizeof *utask, GFP_KERNEL); + if (unlikely(!utask)) + return NULL; + + utask->active_uprobe = NULL; + current->utask = utask; + return utask; +} + +/* Prepare to single-step probed instruction out of line. */ +static int +pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long vaddr) +{ + return -EFAULT; +} + +/* + * If we are singlestepping, then ensure this thread is not connected to + * non-fatal signals until completion of singlestep. When xol insn itself + * triggers the signal, restart the original insn even if the task is + * already SIGKILL'ed (since coredump should report the correct ip). This + * is even more important if the task has a handler for SIGSEGV/etc, The + * _same_ instruction should be repeated again after return from the signal + * handler, and SSTEP can never finish in this case. + */ +bool uprobe_deny_signal(void) +{ + struct task_struct *t = current; + struct uprobe_task *utask = t->utask; + + if (likely(!utask || !utask->active_uprobe)) + return false; + + WARN_ON_ONCE(utask->state != UTASK_SSTEP); + + if (signal_pending(t)) { + spin_lock_irq(&t->sighand->siglock); + clear_tsk_thread_flag(t, TIF_SIGPENDING); + spin_unlock_irq(&t->sighand->siglock); + + if (__fatal_signal_pending(t) || arch_uprobe_xol_was_trapped(t)) { + utask->state = UTASK_SSTEP_TRAPPED; + set_tsk_thread_flag(t, TIF_UPROBE); + set_tsk_thread_flag(t, TIF_NOTIFY_RESUME); + } + } + + return true; +} + +/* + * Avoid singlestepping the original instruction if the original instruction + * is a NOP or can be emulated. + */ +static bool can_skip_sstep(struct uprobe *uprobe, struct pt_regs *regs) +{ + if (arch_uprobe_skip_sstep(&uprobe->arch, regs)) + return true; + + uprobe->flags &= ~UPROBE_SKIP_SSTEP; + return false; +} + +/* + * Run handler and ask thread to singlestep. + * Ensure all non-fatal signals cannot interrupt thread while it singlesteps. + */ +static void handle_swbp(struct pt_regs *regs) +{ + struct vm_area_struct *vma; + struct uprobe_task *utask; + struct uprobe *uprobe; + struct mm_struct *mm; + unsigned long bp_vaddr; + + uprobe = NULL; + bp_vaddr = uprobe_get_swbp_addr(regs); + mm = current->mm; + down_read(&mm->mmap_sem); + vma = find_vma(mm, bp_vaddr); + + if (vma && vma->vm_start <= bp_vaddr && valid_vma(vma, false)) { + struct inode *inode; + loff_t offset; + + inode = vma->vm_file->f_mapping->host; + offset = bp_vaddr - vma->vm_start; + offset += (vma->vm_pgoff << PAGE_SHIFT); + uprobe = find_uprobe(inode, offset); + } + + srcu_read_unlock_raw(&uprobes_srcu, current->uprobe_srcu_id); + current->uprobe_srcu_id = -1; + up_read(&mm->mmap_sem); + + if (!uprobe) { + /* No matching uprobe; signal SIGTRAP. */ + send_sig(SIGTRAP, current, 0); + return; + } + + utask = current->utask; + if (!utask) { + utask = add_utask(); + /* Cannot allocate; re-execute the instruction. */ + if (!utask) + goto cleanup_ret; + } + utask->active_uprobe = uprobe; + handler_chain(uprobe, regs); + if (uprobe->flags & UPROBE_SKIP_SSTEP && can_skip_sstep(uprobe, regs)) + goto cleanup_ret; + + utask->state = UTASK_SSTEP; + if (!pre_ssout(uprobe, regs, bp_vaddr)) { + user_enable_single_step(current); + return; + } + +cleanup_ret: + if (utask) { + utask->active_uprobe = NULL; + utask->state = UTASK_RUNNING; + } + if (uprobe) { + if (!(uprobe->flags & UPROBE_SKIP_SSTEP)) + + /* + * cannot singlestep; cannot skip instruction; + * re-execute the instruction. + */ + instruction_pointer_set(regs, bp_vaddr); + + put_uprobe(uprobe); + } +} + +/* + * Perform required fix-ups and disable singlestep. + * Allow pending signals to take effect. + */ +static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs) +{ + struct uprobe *uprobe; + + uprobe = utask->active_uprobe; + if (utask->state == UTASK_SSTEP_ACK) + arch_uprobe_post_xol(&uprobe->arch, regs); + else if (utask->state == UTASK_SSTEP_TRAPPED) + arch_uprobe_abort_xol(&uprobe->arch, regs); + else + WARN_ON_ONCE(1); + + put_uprobe(uprobe); + utask->active_uprobe = NULL; + utask->state = UTASK_RUNNING; + user_disable_single_step(current); + + spin_lock_irq(¤t->sighand->siglock); + recalc_sigpending(); /* see uprobe_deny_signal() */ + spin_unlock_irq(¤t->sighand->siglock); +} + +/* + * On breakpoint hit, breakpoint notifier sets the TIF_UPROBE flag. (and on + * subsequent probe hits on the thread sets the state to UTASK_BP_HIT) and + * allows the thread to return from interrupt. + * + * On singlestep exception, singlestep notifier sets the TIF_UPROBE flag and + * also sets the state to UTASK_SSTEP_ACK and allows the thread to return from + * interrupt. + * + * While returning to userspace, thread notices the TIF_UPROBE flag and calls + * uprobe_notify_resume(). + */ +void uprobe_notify_resume(struct pt_regs *regs) +{ + struct uprobe_task *utask; + + utask = current->utask; + if (!utask || utask->state == UTASK_BP_HIT) + handle_swbp(regs); + else + handle_singlestep(utask, regs); +} + +/* + * uprobe_pre_sstep_notifier gets called from interrupt context as part of + * notifier mechanism. Set TIF_UPROBE flag and indicate breakpoint hit. + */ +int uprobe_pre_sstep_notifier(struct pt_regs *regs) +{ + struct uprobe_task *utask; + + if (!current->mm) + return 0; + + utask = current->utask; + if (utask) + utask->state = UTASK_BP_HIT; + + set_thread_flag(TIF_UPROBE); + current->uprobe_srcu_id = srcu_read_lock_raw(&uprobes_srcu); + + return 1; +} + +/* + * uprobe_post_sstep_notifier gets called in interrupt context as part of notifier + * mechanism. Set TIF_UPROBE flag and indicate completion of singlestep. + */ +int uprobe_post_sstep_notifier(struct pt_regs *regs) +{ + struct uprobe_task *utask = current->utask; + + if (!current->mm || !utask || !utask->active_uprobe) + /* task is currently not uprobed */ + return 0; + + utask->state = UTASK_SSTEP_ACK; + set_thread_flag(TIF_UPROBE); + return 1; +} + +static struct notifier_block uprobe_exception_nb = { + .notifier_call = arch_uprobe_exception_notify, + .priority = INT_MAX-1, /* notified after kprobes, kgdb */ +}; + static int __init init_uprobes(void) { int i; @@ -1018,12 +1332,13 @@ static int __init init_uprobes(void) mutex_init(&uprobes_mutex[i]); mutex_init(&uprobes_mmap_mutex[i]); } - return 0; + init_srcu_struct(&uprobes_srcu); + + return register_die_notifier(&uprobe_exception_nb); } +module_init(init_uprobes); static void __exit exit_uprobes(void) { } - -module_init(init_uprobes); module_exit(exit_uprobes); diff --git a/kernel/fork.c b/kernel/fork.c index e2cd3e2a5ae8..eb7b63334009 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -67,6 +67,7 @@ #include #include #include +#include #include #include @@ -701,6 +702,8 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm) exit_pi_state_list(tsk); #endif + uprobe_free_utask(tsk); + /* Get rid of any cached register state */ deactivate_mm(tsk, mm); @@ -1295,6 +1298,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, INIT_LIST_HEAD(&p->pi_state_list); p->pi_state_cache = NULL; #endif + uprobe_copy_process(p); /* * sigaltstack should be cleared when sharing the same VM */ diff --git a/kernel/signal.c b/kernel/signal.c index 8511e39813c7..e93ff0a719a0 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -29,6 +29,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include @@ -2192,6 +2193,9 @@ int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka, struct signal_struct *signal = current->signal; int signr; + if (unlikely(uprobe_deny_signal())) + return 0; + relock: /* * We'll jump back here after any time we were stopped in TASK_STOPPED. -- cgit v1.2.3 From d4b3b6384f98f8692ad0209891ccdbc7e78bbefe Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Fri, 30 Mar 2012 23:56:31 +0530 Subject: uprobes/core: Allocate XOL slots for uprobes use Uprobes executes the original instruction at a probed location out of line. For this, we allocate a page (per mm) upon the first uprobe hit, in the process user address space, divide it into slots that are used to store the actual instructions to be singlestepped. These slots are known as xol (execution out of line) slots. Care is taken to ensure that the allocation is in an unmapped area as close to the top of the user address space as possible, with appropriate permission settings to keep selinux like frameworks happy. Upon a uprobe hit, a free slot is acquired, and is released after the singlestep completes. Lots of improvements courtesy suggestions/inputs from Peter and Oleg. [ Folded a fix for build issue on powerpc fixed and reported by Stephen Rothwell. ] Signed-off-by: Srikar Dronamraju Cc: Linus Torvalds Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Linux-mm Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Anton Arapov Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20120330182631.10018.48175.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar --- kernel/events/uprobes.c | 215 ++++++++++++++++++++++++++++++++++++++++++++++++ kernel/fork.c | 2 + 2 files changed, 217 insertions(+) (limited to 'kernel') diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index b807d1566b64..b395edb97f53 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -35,6 +35,9 @@ #include +#define UINSNS_PER_PAGE (PAGE_SIZE/UPROBE_XOL_SLOT_BYTES) +#define MAX_UPROBE_XOL_SLOTS UINSNS_PER_PAGE + static struct srcu_struct uprobes_srcu; static struct rb_root uprobes_tree = RB_ROOT; @@ -1042,6 +1045,213 @@ int uprobe_mmap(struct vm_area_struct *vma) return ret; } +/* Slot allocation for XOL */ +static int xol_add_vma(struct xol_area *area) +{ + struct mm_struct *mm; + int ret; + + area->page = alloc_page(GFP_HIGHUSER); + if (!area->page) + return -ENOMEM; + + ret = -EALREADY; + mm = current->mm; + + down_write(&mm->mmap_sem); + if (mm->uprobes_state.xol_area) + goto fail; + + ret = -ENOMEM; + + /* Try to map as high as possible, this is only a hint. */ + area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0); + if (area->vaddr & ~PAGE_MASK) { + ret = area->vaddr; + goto fail; + } + + ret = install_special_mapping(mm, area->vaddr, PAGE_SIZE, + VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO, &area->page); + if (ret) + goto fail; + + smp_wmb(); /* pairs with get_xol_area() */ + mm->uprobes_state.xol_area = area; + ret = 0; + +fail: + up_write(&mm->mmap_sem); + if (ret) + __free_page(area->page); + + return ret; +} + +static struct xol_area *get_xol_area(struct mm_struct *mm) +{ + struct xol_area *area; + + area = mm->uprobes_state.xol_area; + smp_read_barrier_depends(); /* pairs with wmb in xol_add_vma() */ + + return area; +} + +/* + * xol_alloc_area - Allocate process's xol_area. + * This area will be used for storing instructions for execution out of + * line. + * + * Returns the allocated area or NULL. + */ +static struct xol_area *xol_alloc_area(void) +{ + struct xol_area *area; + + area = kzalloc(sizeof(*area), GFP_KERNEL); + if (unlikely(!area)) + return NULL; + + area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long), GFP_KERNEL); + + if (!area->bitmap) + goto fail; + + init_waitqueue_head(&area->wq); + if (!xol_add_vma(area)) + return area; + +fail: + kfree(area->bitmap); + kfree(area); + + return get_xol_area(current->mm); +} + +/* + * uprobe_clear_state - Free the area allocated for slots. + */ +void uprobe_clear_state(struct mm_struct *mm) +{ + struct xol_area *area = mm->uprobes_state.xol_area; + + if (!area) + return; + + put_page(area->page); + kfree(area->bitmap); + kfree(area); +} + +/* + * uprobe_reset_state - Free the area allocated for slots. + */ +void uprobe_reset_state(struct mm_struct *mm) +{ + mm->uprobes_state.xol_area = NULL; +} + +/* + * - search for a free slot. + */ +static unsigned long xol_take_insn_slot(struct xol_area *area) +{ + unsigned long slot_addr; + int slot_nr; + + do { + slot_nr = find_first_zero_bit(area->bitmap, UINSNS_PER_PAGE); + if (slot_nr < UINSNS_PER_PAGE) { + if (!test_and_set_bit(slot_nr, area->bitmap)) + break; + + slot_nr = UINSNS_PER_PAGE; + continue; + } + wait_event(area->wq, (atomic_read(&area->slot_count) < UINSNS_PER_PAGE)); + } while (slot_nr >= UINSNS_PER_PAGE); + + slot_addr = area->vaddr + (slot_nr * UPROBE_XOL_SLOT_BYTES); + atomic_inc(&area->slot_count); + + return slot_addr; +} + +/* + * xol_get_insn_slot - If was not allocated a slot, then + * allocate a slot. + * Returns the allocated slot address or 0. + */ +static unsigned long xol_get_insn_slot(struct uprobe *uprobe, unsigned long slot_addr) +{ + struct xol_area *area; + unsigned long offset; + void *vaddr; + + area = get_xol_area(current->mm); + if (!area) { + area = xol_alloc_area(); + if (!area) + return 0; + } + current->utask->xol_vaddr = xol_take_insn_slot(area); + + /* + * Initialize the slot if xol_vaddr points to valid + * instruction slot. + */ + if (unlikely(!current->utask->xol_vaddr)) + return 0; + + current->utask->vaddr = slot_addr; + offset = current->utask->xol_vaddr & ~PAGE_MASK; + vaddr = kmap_atomic(area->page); + memcpy(vaddr + offset, uprobe->arch.insn, MAX_UINSN_BYTES); + kunmap_atomic(vaddr); + + return current->utask->xol_vaddr; +} + +/* + * xol_free_insn_slot - If slot was earlier allocated by + * @xol_get_insn_slot(), make the slot available for + * subsequent requests. + */ +static void xol_free_insn_slot(struct task_struct *tsk) +{ + struct xol_area *area; + unsigned long vma_end; + unsigned long slot_addr; + + if (!tsk->mm || !tsk->mm->uprobes_state.xol_area || !tsk->utask) + return; + + slot_addr = tsk->utask->xol_vaddr; + + if (unlikely(!slot_addr || IS_ERR_VALUE(slot_addr))) + return; + + area = tsk->mm->uprobes_state.xol_area; + vma_end = area->vaddr + PAGE_SIZE; + if (area->vaddr <= slot_addr && slot_addr < vma_end) { + unsigned long offset; + int slot_nr; + + offset = slot_addr - area->vaddr; + slot_nr = offset / UPROBE_XOL_SLOT_BYTES; + if (slot_nr >= UINSNS_PER_PAGE) + return; + + clear_bit(slot_nr, area->bitmap); + atomic_dec(&area->slot_count); + if (waitqueue_active(&area->wq)) + wake_up(&area->wq); + + tsk->utask->xol_vaddr = 0; + } +} + /** * uprobe_get_swbp_addr - compute address of swbp given post-swbp regs * @regs: Reflects the saved state of the task after it has hit a breakpoint @@ -1070,6 +1280,7 @@ void uprobe_free_utask(struct task_struct *t) if (utask->active_uprobe) put_uprobe(utask->active_uprobe); + xol_free_insn_slot(t); kfree(utask); t->utask = NULL; } @@ -1108,6 +1319,9 @@ static struct uprobe_task *add_utask(void) static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long vaddr) { + if (xol_get_insn_slot(uprobe, vaddr) && !arch_uprobe_pre_xol(&uprobe->arch, regs)) + return 0; + return -EFAULT; } @@ -1252,6 +1466,7 @@ static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs) utask->active_uprobe = NULL; utask->state = UTASK_RUNNING; user_disable_single_step(current); + xol_free_insn_slot(current); spin_lock_irq(¤t->sighand->siglock); recalc_sigpending(); /* see uprobe_deny_signal() */ diff --git a/kernel/fork.c b/kernel/fork.c index eb7b63334009..3133b9da59d5 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -554,6 +554,7 @@ void mmput(struct mm_struct *mm) might_sleep(); if (atomic_dec_and_test(&mm->mm_users)) { + uprobe_clear_state(mm); exit_aio(mm); ksm_exit(mm); khugepaged_exit(mm); /* must run before exit_mmap */ @@ -760,6 +761,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk) #ifdef CONFIG_TRANSPARENT_HUGEPAGE mm->pmd_huge_pte = NULL; #endif + uprobe_reset_state(mm); if (!mm_init(mm, tsk)) goto fail_nomem; -- cgit v1.2.3 From 682968e0c425c60f0dde37977e5beb2b12ddc4cc Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Fri, 30 Mar 2012 23:56:46 +0530 Subject: uprobes/core: Optimize probe hits with the help of a counter Maintain a per-mm counter: number of uprobes that are inserted on this process address space. This counter can be used at probe hit time to determine if we need a lookup in the uprobes rbtree. Everytime a probe gets inserted successfully, the probe count is incremented and everytime a probe gets removed, the probe count is decremented. The new uprobe_munmap hook ensures the count is correct on a unmap or remap of a region. We expect that once a uprobe_munmap() is called, the vma goes away. So uprobe_unregister() finding a probe to unregister would either mean unmap event hasnt occurred yet or a mmap event on the same executable file occured after a unmap event. Additionally, uprobe_mmap hook now also gets called: a. on every executable vma that is COWed at fork. b. a vma of interest is newly mapped; breakpoint insertion also happens at the required address. On process creation, make sure the probes count in the child is set correctly. Special cases that are taken care include: a. mremap b. VM_DONTCOPY vmas on fork() c. insertion/removal races in the parent during fork(). Signed-off-by: Srikar Dronamraju Cc: Linus Torvalds Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Linux-mm Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Anton Arapov Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20120330182646.10018.85805.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar --- kernel/events/uprobes.c | 119 ++++++++++++++++++++++++++++++++++++++++++++---- kernel/fork.c | 3 ++ 2 files changed, 114 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index b395edb97f53..29e881b0137d 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -642,6 +642,29 @@ copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma, unsigned long addr) return __copy_insn(mapping, vma, uprobe->arch.insn, bytes, uprobe->offset); } +/* + * How mm->uprobes_state.count gets updated + * uprobe_mmap() increments the count if + * - it successfully adds a breakpoint. + * - it cannot add a breakpoint, but sees that there is a underlying + * breakpoint (via a is_swbp_at_addr()). + * + * uprobe_munmap() decrements the count if + * - it sees a underlying breakpoint, (via is_swbp_at_addr) + * (Subsequent uprobe_unregister wouldnt find the breakpoint + * unless a uprobe_mmap kicks in, since the old vma would be + * dropped just after uprobe_munmap.) + * + * uprobe_register increments the count if: + * - it successfully adds a breakpoint. + * + * uprobe_unregister decrements the count if: + * - it sees a underlying breakpoint and removes successfully. + * (via is_swbp_at_addr) + * (Subsequent uprobe_munmap wouldnt find the breakpoint + * since there is no underlying breakpoint after the + * breakpoint removal.) + */ static int install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, struct vm_area_struct *vma, loff_t vaddr) @@ -675,7 +698,19 @@ install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, uprobe->flags |= UPROBE_COPY_INSN; } + + /* + * Ideally, should be updating the probe count after the breakpoint + * has been successfully inserted. However a thread could hit the + * breakpoint we just inserted even before the probe count is + * incremented. If this is the first breakpoint placed, breakpoint + * notifier might ignore uprobes and pass the trap to the thread. + * Hence increment before and decrement on failure. + */ + atomic_inc(&mm->uprobes_state.count); ret = set_swbp(&uprobe->arch, mm, addr); + if (ret) + atomic_dec(&mm->uprobes_state.count); return ret; } @@ -683,7 +718,8 @@ install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, static void remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, loff_t vaddr) { - set_orig_insn(&uprobe->arch, mm, (unsigned long)vaddr, true); + if (!set_orig_insn(&uprobe->arch, mm, (unsigned long)vaddr, true)) + atomic_dec(&mm->uprobes_state.count); } /* @@ -1009,7 +1045,7 @@ int uprobe_mmap(struct vm_area_struct *vma) struct list_head tmp_list; struct uprobe *uprobe, *u; struct inode *inode; - int ret; + int ret, count; if (!atomic_read(&uprobe_events) || !valid_vma(vma, true)) return 0; @@ -1023,6 +1059,7 @@ int uprobe_mmap(struct vm_area_struct *vma) build_probe_list(inode, &tmp_list); ret = 0; + count = 0; list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) { loff_t vaddr; @@ -1030,21 +1067,85 @@ int uprobe_mmap(struct vm_area_struct *vma) list_del(&uprobe->pending_list); if (!ret) { vaddr = vma_address(vma, uprobe->offset); - if (vaddr >= vma->vm_start && vaddr < vma->vm_end) { - ret = install_breakpoint(uprobe, vma->vm_mm, vma, vaddr); - /* Ignore double add: */ - if (ret == -EEXIST) - ret = 0; + + if (vaddr < vma->vm_start || vaddr >= vma->vm_end) { + put_uprobe(uprobe); + continue; } + + ret = install_breakpoint(uprobe, vma->vm_mm, vma, vaddr); + + /* Ignore double add: */ + if (ret == -EEXIST) { + ret = 0; + + if (!is_swbp_at_addr(vma->vm_mm, vaddr)) + continue; + + /* + * Unable to insert a breakpoint, but + * breakpoint lies underneath. Increment the + * probe count. + */ + atomic_inc(&vma->vm_mm->uprobes_state.count); + } + + if (!ret) + count++; } put_uprobe(uprobe); } mutex_unlock(uprobes_mmap_hash(inode)); + if (ret) + atomic_sub(count, &vma->vm_mm->uprobes_state.count); + return ret; } +/* + * Called in context of a munmap of a vma. + */ +void uprobe_munmap(struct vm_area_struct *vma) +{ + struct list_head tmp_list; + struct uprobe *uprobe, *u; + struct inode *inode; + + if (!atomic_read(&uprobe_events) || !valid_vma(vma, false)) + return; + + if (!atomic_read(&vma->vm_mm->uprobes_state.count)) + return; + + inode = vma->vm_file->f_mapping->host; + if (!inode) + return; + + INIT_LIST_HEAD(&tmp_list); + mutex_lock(uprobes_mmap_hash(inode)); + build_probe_list(inode, &tmp_list); + + list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) { + loff_t vaddr; + + list_del(&uprobe->pending_list); + vaddr = vma_address(vma, uprobe->offset); + + if (vaddr >= vma->vm_start && vaddr < vma->vm_end) { + /* + * An unregister could have removed the probe before + * unmap. So check before we decrement the count. + */ + if (is_swbp_at_addr(vma->vm_mm, vaddr) == 1) + atomic_dec(&vma->vm_mm->uprobes_state.count); + } + put_uprobe(uprobe); + } + mutex_unlock(uprobes_mmap_hash(inode)); +} + /* Slot allocation for XOL */ static int xol_add_vma(struct xol_area *area) { @@ -1150,6 +1251,7 @@ void uprobe_clear_state(struct mm_struct *mm) void uprobe_reset_state(struct mm_struct *mm) { mm->uprobes_state.xol_area = NULL; + atomic_set(&mm->uprobes_state.count, 0); } /* @@ -1504,7 +1606,8 @@ int uprobe_pre_sstep_notifier(struct pt_regs *regs) { struct uprobe_task *utask; - if (!current->mm) + if (!current->mm || !atomic_read(¤t->mm->uprobes_state.count)) + /* task is currently not uprobed */ return 0; utask = current->utask; diff --git a/kernel/fork.c b/kernel/fork.c index 3133b9da59d5..26a8f5c25805 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -421,6 +421,9 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) if (retval) goto out; + + if (file && uprobe_mmap(tmp)) + goto out; } /* a new mm has just been created */ arch_dup_mmap(oldmm, mm); -- cgit v1.2.3 From 8b5a5a9dbca914d1f7d70276024d1525a3c94081 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Sun, 1 Apr 2012 12:09:54 -0700 Subject: cgroup: deprecate remount option changes This patch marks the following features for deprecation. * Rebinding subsys by remount: Never reached useful state - only works on empty hierarchies. * release_agent update by remount: release_agent itself will be replaced with conventional fsnotify notification. v2: Lennart pointed out that "name=" is necessary for mounts w/o any controller attached. Drop "name=" deprecation. Signed-off-by: Tejun Heo Acked-by: Li Zefan Cc: Lennart Poettering --- kernel/cgroup.c | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index ed64ccac67c9..741c120e06c1 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1294,6 +1294,11 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data) if (ret) goto out_unlock; + /* See feature-removal-schedule.txt */ + if (opts.subsys_bits != root->actual_subsys_bits || opts.release_agent) + pr_warning("cgroup: option changes via remount are deprecated (pid=%d comm=%s)\n", + task_tgid_nr(current), current->comm); + /* Don't allow flags or name to change at remount */ if (opts.flags != root->flags || (opts.name && strcmp(opts.name, root->name))) { -- cgit v1.2.3 From ff4c8d503e2583b485da46d0ec3dcc29639ab889 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Sun, 1 Apr 2012 12:09:54 -0700 Subject: cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir() cgroup_populate_dir() currently clears all files and then repopulate the directory; however, the clearing part is only useful when it's called from cgroup_remount(). Relocate the invocation to cgroup_remount(). This is to prepare for further cgroup file handling updates. Signed-off-by: Tejun Heo Acked-by: Li Zefan --- kernel/cgroup.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 741c120e06c1..8f72853131d3 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1313,7 +1313,8 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data) goto out_unlock; } - /* (re)populate subsystem files */ + /* clear out any existing files and repopulate subsystem files */ + cgroup_clear_directory(cgrp->dentry); cgroup_populate_dir(cgrp); if (opts.release_agent) @@ -3644,9 +3645,6 @@ static int cgroup_populate_dir(struct cgroup *cgrp) int err; struct cgroup_subsys *ss; - /* First clear out any existing files */ - cgroup_clear_directory(cgrp->dentry); - err = cgroup_add_files(cgrp, NULL, files, ARRAY_SIZE(files)); if (err < 0) return err; -- cgit v1.2.3 From b0ca5a84fc3ad8f75bb2b7ac3dc6a77151cd3906 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Sun, 1 Apr 2012 12:09:54 -0700 Subject: cgroup: build list of all cgroups under a given cgroupfs_root Build a list of all cgroups anchored at cgroupfs_root->allcg_list and going through cgroup->allcg_node. The list is protected by cgroup_mutex and will be used to improve cgroup file handling. Signed-off-by: Tejun Heo Acked-by: Li Zefan --- kernel/cgroup.c | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 8f72853131d3..db4319c030d0 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -127,6 +127,9 @@ struct cgroupfs_root { /* A list running through the active hierarchies */ struct list_head root_list; + /* All cgroups on this root, cgroup_mutex protected */ + struct list_head allcg_list; + /* Hierarchy-specific flags */ unsigned long flags; @@ -1350,11 +1353,14 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp) static void init_cgroup_root(struct cgroupfs_root *root) { struct cgroup *cgrp = &root->top_cgroup; + INIT_LIST_HEAD(&root->subsys_list); INIT_LIST_HEAD(&root->root_list); + INIT_LIST_HEAD(&root->allcg_list); root->number_of_cgroups = 1; cgrp->root = root; cgrp->top_cgroup = cgrp; + list_add_tail(&cgrp->allcg_node, &root->allcg_list); init_cgroup_housekeeping(cgrp); } @@ -3790,6 +3796,8 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry, /* The cgroup directory was pre-locked for us */ BUG_ON(!mutex_is_locked(&cgrp->dentry->d_inode->i_mutex)); + list_add_tail(&cgrp->allcg_node, &root->allcg_list); + err = cgroup_populate_dir(cgrp); /* If err < 0, we have a half-filled directory - oh well ;) */ @@ -3998,6 +4006,8 @@ again: list_del_init(&cgrp->sibling); cgroup_unlock_hierarchy(cgrp->root); + list_del_init(&cgrp->allcg_node); + d = dget(cgrp->dentry); cgroup_d_remove_dir(d); -- cgit v1.2.3 From 8e3f6541d45e1a002801e56a19530a90f775deba Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Sun, 1 Apr 2012 12:09:55 -0700 Subject: cgroup: implement cgroup_add_cftypes() and friends Currently, cgroup directories are populated by subsys->populate() callback explicitly creating files on each cgroup creation. This level of flexibility isn't needed or desirable. It provides largely unused flexibility which call for abuses while severely limiting what the core layer can do through the lack of structure and conventions. Per each cgroup file type, the only distinction that cgroup users is making is whether a cgroup is root or not, which can easily be expressed with flags. This patch introduces cgroup_add_cftypes(). These deal with cftypes instead of individual files - controllers indicate that certain types of files exist for certain subsystem. Newly added CFTYPE_*_ON_ROOT flags indicate whether a cftype should be excluded or created only on the root cgroup. cgroup_add_cftypes() can be called any time whether the target subsystem is currently attached or not. cgroup core will create files on the existing cgroups as necessary. Also, cgroup_subsys->base_cftypes is added to ease registration of the base files for the subsystem. If non-NULL on subsys init, the cftypes pointed to by ->base_cftypes are automatically registered on subsys init / load. Further patches will convert the existing users and remove the file based interface. Note that this interface allows dynamic addition of files to an active controller. This will be used for sub-controller modularity and unified hierarchy in the longer term. This patch implements the new mechanism but doesn't apply it to any user. v2: replaced DECLARE_CGROUP_CFTYPES[_COND]() with cgroup_subsys->base_cftypes, which works better for cgroup_subsys which is loaded as module. Signed-off-by: Tejun Heo Acked-by: Li Zefan --- kernel/cgroup.c | 132 +++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 131 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index db4319c030d0..df8fb82ef350 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2623,8 +2623,14 @@ int cgroup_add_file(struct cgroup *cgrp, struct dentry *dentry; int error; umode_t mode; - char name[MAX_CGROUP_TYPE_NAMELEN + MAX_CFTYPE_NAME + 2] = { 0 }; + + /* does @cft->flags tell us to skip creation on @cgrp? */ + if ((cft->flags & CFTYPE_NOT_ON_ROOT) && !cgrp->parent) + return 0; + if ((cft->flags & CFTYPE_ONLY_ON_ROOT) && cgrp->parent) + return 0; + if (subsys && !test_bit(ROOT_NOPREFIX, &cgrp->root->flags)) { strcpy(name, subsys->name); strcat(name, "."); @@ -2660,6 +2666,95 @@ int cgroup_add_files(struct cgroup *cgrp, } EXPORT_SYMBOL_GPL(cgroup_add_files); +static DEFINE_MUTEX(cgroup_cft_mutex); + +static void cgroup_cfts_prepare(void) + __acquires(&cgroup_cft_mutex) __acquires(&cgroup_mutex) +{ + /* + * Thanks to the entanglement with vfs inode locking, we can't walk + * the existing cgroups under cgroup_mutex and create files. + * Instead, we increment reference on all cgroups and build list of + * them using @cgrp->cft_q_node. Grab cgroup_cft_mutex to ensure + * exclusive access to the field. + */ + mutex_lock(&cgroup_cft_mutex); + mutex_lock(&cgroup_mutex); +} + +static void cgroup_cfts_commit(struct cgroup_subsys *ss, + const struct cftype *cfts) + __releases(&cgroup_mutex) __releases(&cgroup_cft_mutex) +{ + LIST_HEAD(pending); + struct cgroup *cgrp, *n; + int count = 0; + + while (cfts[count].name[0] != '\0') + count++; + + /* %NULL @cfts indicates abort and don't bother if @ss isn't attached */ + if (cfts && ss->root != &rootnode) { + list_for_each_entry(cgrp, &ss->root->allcg_list, allcg_node) { + dget(cgrp->dentry); + list_add_tail(&cgrp->cft_q_node, &pending); + } + } + + mutex_unlock(&cgroup_mutex); + + /* + * All new cgroups will see @cfts update on @ss->cftsets. Add/rm + * files for all cgroups which were created before. + */ + list_for_each_entry_safe(cgrp, n, &pending, cft_q_node) { + struct inode *inode = cgrp->dentry->d_inode; + + mutex_lock(&inode->i_mutex); + mutex_lock(&cgroup_mutex); + if (!cgroup_is_removed(cgrp)) + cgroup_add_files(cgrp, ss, cfts, count); + mutex_unlock(&cgroup_mutex); + mutex_unlock(&inode->i_mutex); + + list_del_init(&cgrp->cft_q_node); + dput(cgrp->dentry); + } + + mutex_unlock(&cgroup_cft_mutex); +} + +/** + * cgroup_add_cftypes - add an array of cftypes to a subsystem + * @ss: target cgroup subsystem + * @cfts: zero-length name terminated array of cftypes + * + * Register @cfts to @ss. Files described by @cfts are created for all + * existing cgroups to which @ss is attached and all future cgroups will + * have them too. This function can be called anytime whether @ss is + * attached or not. + * + * Returns 0 on successful registration, -errno on failure. Note that this + * function currently returns 0 as long as @cfts registration is successful + * even if some file creation attempts on existing cgroups fail. + */ +int cgroup_add_cftypes(struct cgroup_subsys *ss, const struct cftype *cfts) +{ + struct cftype_set *set; + + set = kzalloc(sizeof(*set), GFP_KERNEL); + if (!set) + return -ENOMEM; + + cgroup_cfts_prepare(); + set->cfts = cfts; + list_add_tail(&set->node, &ss->cftsets); + cgroup_cfts_commit(ss, cfts); + + return 0; +} +EXPORT_SYMBOL_GPL(cgroup_add_cftypes); + /** * cgroup_task_count - count the number of tasks in a cgroup. * @cgrp: the cgroup in question @@ -3660,10 +3755,25 @@ static int cgroup_populate_dir(struct cgroup *cgrp) return err; } + /* process cftsets of each subsystem */ for_each_subsys(cgrp->root, ss) { + struct cftype_set *set; + if (ss->populate && (err = ss->populate(ss, cgrp)) < 0) return err; + + list_for_each_entry(set, &ss->cftsets, node) { + const struct cftype *cft; + + for (cft = set->cfts; cft->name[0] != '\0'; cft++) { + err = cgroup_add_file(cgrp, ss, cft); + if (err) + pr_warning("cgroup_populate_dir: failed to create %s, err=%d\n", + cft->name, err); + } + } } + /* This cgroup is ready now */ for_each_subsys(cgrp->root, ss) { struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id]; @@ -4034,12 +4144,29 @@ again: return 0; } +static void __init_or_module cgroup_init_cftsets(struct cgroup_subsys *ss) +{ + INIT_LIST_HEAD(&ss->cftsets); + + /* + * base_cftset is embedded in subsys itself, no need to worry about + * deregistration. + */ + if (ss->base_cftypes) { + ss->base_cftset.cfts = ss->base_cftypes; + list_add_tail(&ss->base_cftset.node, &ss->cftsets); + } +} + static void __init cgroup_init_subsys(struct cgroup_subsys *ss) { struct cgroup_subsys_state *css; printk(KERN_INFO "Initializing cgroup subsys %s\n", ss->name); + /* init base cftset */ + cgroup_init_cftsets(ss); + /* Create the top cgroup state for this subsystem */ list_add(&ss->sibling, &rootnode.subsys_list); ss->root = &rootnode; @@ -4109,6 +4236,9 @@ int __init_or_module cgroup_load_subsys(struct cgroup_subsys *ss) return 0; } + /* init base cftset */ + cgroup_init_cftsets(ss); + /* * need to register a subsys id before anything else - for example, * init_cgroup_css needs it. -- cgit v1.2.3 From 6e6ff25bd548a0c5bf5163e4f63e2793dcefbdcb Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Sun, 1 Apr 2012 12:09:55 -0700 Subject: cgroup: merge cft_release_agent cftype array into the base files array Now that cftype can express whether a file should only be on root, cft_release_agent can be merged into the base files cftypes array. Signed-off-by: Tejun Heo Acked-by: Li Zefan --- kernel/cgroup.c | 19 +++++++------------ 1 file changed, 7 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index df8fb82ef350..66c1e3b5c0f2 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -3732,13 +3732,13 @@ static struct cftype files[] = { .read_u64 = cgroup_clone_children_read, .write_u64 = cgroup_clone_children_write, }, -}; - -static struct cftype cft_release_agent = { - .name = "release_agent", - .read_seq_string = cgroup_release_agent_show, - .write_string = cgroup_release_agent_write, - .max_write_len = PATH_MAX, + { + .name = "release_agent", + .flags = CFTYPE_ONLY_ON_ROOT, + .read_seq_string = cgroup_release_agent_show, + .write_string = cgroup_release_agent_write, + .max_write_len = PATH_MAX, + }, }; static int cgroup_populate_dir(struct cgroup *cgrp) @@ -3750,11 +3750,6 @@ static int cgroup_populate_dir(struct cgroup *cgrp) if (err < 0) return err; - if (cgrp == cgrp->top_cgroup) { - if ((err = cgroup_add_file(cgrp, NULL, &cft_release_agent)) < 0) - return err; - } - /* process cftsets of each subsystem */ for_each_subsys(cgrp->root, ss) { struct cftype_set *set; -- cgit v1.2.3 From 4baf6e33251b37f111e21289f8ee71fe4cce236e Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Sun, 1 Apr 2012 12:09:55 -0700 Subject: cgroup: convert all non-memcg controllers to the new cftype interface Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio, net_cls and device controllers to use the new cftype based interface. Termination entry is added to cftype arrays and populate callbacks are replaced with cgroup_subsys->base_cftypes initializations. This is functionally identical transformation. There shouldn't be any visible behavior change. memcg is rather special and will be converted separately. Signed-off-by: Tejun Heo Acked-by: Li Zefan Cc: Paul Menage Cc: Ingo Molnar Cc: Peter Zijlstra Cc: "David S. Miller" Cc: Vivek Goyal --- kernel/cgroup.c | 10 +++------- kernel/cgroup_freezer.c | 11 +++-------- kernel/cpuset.c | 31 ++++++++++--------------------- kernel/sched/core.c | 16 ++++------------ 4 files changed, 20 insertions(+), 48 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 66c1e3b5c0f2..835ae29e4ea2 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -5349,19 +5349,15 @@ static struct cftype debug_files[] = { .name = "releasable", .read_u64 = releasable_read, }, -}; -static int debug_populate(struct cgroup_subsys *ss, struct cgroup *cont) -{ - return cgroup_add_files(cont, ss, debug_files, - ARRAY_SIZE(debug_files)); -} + { } /* terminate */ +}; struct cgroup_subsys debug_subsys = { .name = "debug", .create = debug_create, .destroy = debug_destroy, - .populate = debug_populate, .subsys_id = debug_subsys_id, + .base_cftypes = debug_files, }; #endif /* CONFIG_CGROUP_DEBUG */ diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c index f86e93920b62..3649fc6b3eaa 100644 --- a/kernel/cgroup_freezer.c +++ b/kernel/cgroup_freezer.c @@ -358,24 +358,19 @@ static int freezer_write(struct cgroup *cgroup, static struct cftype files[] = { { .name = "state", + .flags = CFTYPE_NOT_ON_ROOT, .read_seq_string = freezer_read, .write_string = freezer_write, }, + { } /* terminate */ }; -static int freezer_populate(struct cgroup_subsys *ss, struct cgroup *cgroup) -{ - if (!cgroup->parent) - return 0; - return cgroup_add_files(cgroup, ss, files, ARRAY_SIZE(files)); -} - struct cgroup_subsys freezer_subsys = { .name = "freezer", .create = freezer_create, .destroy = freezer_destroy, - .populate = freezer_populate, .subsys_id = freezer_subsys_id, .can_attach = freezer_can_attach, .fork = freezer_fork, + .base_cftypes = files, }; diff --git a/kernel/cpuset.c b/kernel/cpuset.c index b96ad75b7e64..2382683617a3 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -1765,28 +1765,17 @@ static struct cftype files[] = { .write_u64 = cpuset_write_u64, .private = FILE_SPREAD_SLAB, }, -}; - -static struct cftype cft_memory_pressure_enabled = { - .name = "memory_pressure_enabled", - .read_u64 = cpuset_read_u64, - .write_u64 = cpuset_write_u64, - .private = FILE_MEMORY_PRESSURE_ENABLED, -}; -static int cpuset_populate(struct cgroup_subsys *ss, struct cgroup *cont) -{ - int err; + { + .name = "memory_pressure_enabled", + .flags = CFTYPE_ONLY_ON_ROOT, + .read_u64 = cpuset_read_u64, + .write_u64 = cpuset_write_u64, + .private = FILE_MEMORY_PRESSURE_ENABLED, + }, - err = cgroup_add_files(cont, ss, files, ARRAY_SIZE(files)); - if (err) - return err; - /* memory_pressure_enabled is in root cpuset only */ - if (!cont->parent) - err = cgroup_add_file(cont, ss, - &cft_memory_pressure_enabled); - return err; -} + { } /* terminate */ +}; /* * post_clone() is called during cgroup_create() when the @@ -1887,9 +1876,9 @@ struct cgroup_subsys cpuset_subsys = { .destroy = cpuset_destroy, .can_attach = cpuset_can_attach, .attach = cpuset_attach, - .populate = cpuset_populate, .post_clone = cpuset_post_clone, .subsys_id = cpuset_subsys_id, + .base_cftypes = files, .early_init = 1, }; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4603b9d8f30a..afc6d7e71557 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7970,13 +7970,9 @@ static struct cftype cpu_files[] = { .write_u64 = cpu_rt_period_write_uint, }, #endif + { } /* terminate */ }; -static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont) -{ - return cgroup_add_files(cont, ss, cpu_files, ARRAY_SIZE(cpu_files)); -} - struct cgroup_subsys cpu_cgroup_subsys = { .name = "cpu", .create = cpu_cgroup_create, @@ -7984,8 +7980,8 @@ struct cgroup_subsys cpu_cgroup_subsys = { .can_attach = cpu_cgroup_can_attach, .attach = cpu_cgroup_attach, .exit = cpu_cgroup_exit, - .populate = cpu_cgroup_populate, .subsys_id = cpu_cgroup_subsys_id, + .base_cftypes = cpu_files, .early_init = 1, }; @@ -8170,13 +8166,9 @@ static struct cftype files[] = { .name = "stat", .read_map = cpuacct_stats_show, }, + { } /* terminate */ }; -static int cpuacct_populate(struct cgroup_subsys *ss, struct cgroup *cgrp) -{ - return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files)); -} - /* * charge this task's execution time to its accounting group. * @@ -8208,7 +8200,7 @@ struct cgroup_subsys cpuacct_subsys = { .name = "cpuacct", .create = cpuacct_create, .destroy = cpuacct_destroy, - .populate = cpuacct_populate, .subsys_id = cpuacct_subsys_id, + .base_cftypes = files, }; #endif /* CONFIG_CGROUP_CPUACCT */ -- cgit v1.2.3 From db0416b64977cb0f382175608286cc80c7573e38 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Sun, 1 Apr 2012 12:09:55 -0700 Subject: cgroup: remove cgroup_add_file[s]() No controller is using cgroup_add_files[s](). Unexport them, and convert cgroup_add_files() to handle NULL entry terminated array instead of taking count explicitly and continue creation on failure for internal use. Signed-off-by: Tejun Heo Acked-by: Li Zefan --- kernel/cgroup.c | 51 ++++++++++++++++++++------------------------------- 1 file changed, 20 insertions(+), 31 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 835ae29e4ea2..fb71a930ec8d 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2615,9 +2615,8 @@ static umode_t cgroup_file_mode(const struct cftype *cft) return mode; } -int cgroup_add_file(struct cgroup *cgrp, - struct cgroup_subsys *subsys, - const struct cftype *cft) +static int cgroup_add_file(struct cgroup *cgrp, struct cgroup_subsys *subsys, + const struct cftype *cft) { struct dentry *dir = cgrp->dentry; struct dentry *dentry; @@ -2649,22 +2648,23 @@ int cgroup_add_file(struct cgroup *cgrp, error = PTR_ERR(dentry); return error; } -EXPORT_SYMBOL_GPL(cgroup_add_file); -int cgroup_add_files(struct cgroup *cgrp, - struct cgroup_subsys *subsys, - const struct cftype cft[], - int count) +static int cgroup_add_files(struct cgroup *cgrp, struct cgroup_subsys *subsys, + const struct cftype cfts[]) { - int i, err; - for (i = 0; i < count; i++) { - err = cgroup_add_file(cgrp, subsys, &cft[i]); - if (err) - return err; + const struct cftype *cft; + int err, ret = 0; + + for (cft = cfts; cft->name[0] != '\0'; cft++) { + err = cgroup_add_file(cgrp, subsys, cft); + if (err) { + pr_warning("cgroup_add_files: failed to create %s, err=%d\n", + cft->name, err); + ret = err; + } } - return 0; + return ret; } -EXPORT_SYMBOL_GPL(cgroup_add_files); static DEFINE_MUTEX(cgroup_cft_mutex); @@ -2688,10 +2688,6 @@ static void cgroup_cfts_commit(struct cgroup_subsys *ss, { LIST_HEAD(pending); struct cgroup *cgrp, *n; - int count = 0; - - while (cfts[count].name[0] != '\0') - count++; /* %NULL @cfts indicates abort and don't bother if @ss isn't attached */ if (cfts && ss->root != &rootnode) { @@ -2713,7 +2709,7 @@ static void cgroup_cfts_commit(struct cgroup_subsys *ss, mutex_lock(&inode->i_mutex); mutex_lock(&cgroup_mutex); if (!cgroup_is_removed(cgrp)) - cgroup_add_files(cgrp, ss, cfts, count); + cgroup_add_files(cgrp, ss, cfts); mutex_unlock(&cgroup_mutex); mutex_unlock(&inode->i_mutex); @@ -3739,6 +3735,7 @@ static struct cftype files[] = { .write_string = cgroup_release_agent_write, .max_write_len = PATH_MAX, }, + { } /* terminate */ }; static int cgroup_populate_dir(struct cgroup *cgrp) @@ -3746,7 +3743,7 @@ static int cgroup_populate_dir(struct cgroup *cgrp) int err; struct cgroup_subsys *ss; - err = cgroup_add_files(cgrp, NULL, files, ARRAY_SIZE(files)); + err = cgroup_add_files(cgrp, NULL, files); if (err < 0) return err; @@ -3757,16 +3754,8 @@ static int cgroup_populate_dir(struct cgroup *cgrp) if (ss->populate && (err = ss->populate(ss, cgrp)) < 0) return err; - list_for_each_entry(set, &ss->cftsets, node) { - const struct cftype *cft; - - for (cft = set->cfts; cft->name[0] != '\0'; cft++) { - err = cgroup_add_file(cgrp, ss, cft); - if (err) - pr_warning("cgroup_populate_dir: failed to create %s, err=%d\n", - cft->name, err); - } - } + list_for_each_entry(set, &ss->cftsets, node) + cgroup_add_files(cgrp, ss, set->cfts); } /* This cgroup is ready now */ -- cgit v1.2.3 From f6ea93723d0049ae5fbbb5292cb10c51d7a80801 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Sun, 1 Apr 2012 12:09:55 -0700 Subject: cgroup: relocate __d_cgrp() and __d_cft() Move the two macros upwards as they'll be used earlier in the file. Signed-off-by: Tejun Heo Acked-by: Li Zefan --- kernel/cgroup.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index fb71a930ec8d..2ab29eb381b7 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -282,6 +282,16 @@ list_for_each_entry(_ss, &_root->subsys_list, sibling) #define for_each_active_root(_root) \ list_for_each_entry(_root, &roots, root_list) +static inline struct cgroup *__d_cgrp(struct dentry *dentry) +{ + return dentry->d_fsdata; +} + +static inline struct cftype *__d_cft(struct dentry *dentry) +{ + return dentry->d_fsdata; +} + /* the list of cgroups eligible for automatic release. Protected by * release_list_lock */ static LIST_HEAD(release_list); @@ -1704,16 +1714,6 @@ static struct file_system_type cgroup_fs_type = { static struct kobject *cgroup_kobj; -static inline struct cgroup *__d_cgrp(struct dentry *dentry) -{ - return dentry->d_fsdata; -} - -static inline struct cftype *__d_cft(struct dentry *dentry) -{ - return dentry->d_fsdata; -} - /** * cgroup_path - generate the path of a cgroup * @cgrp: the cgroup in question -- cgit v1.2.3 From 05ef1d7c4a5b6d09cadd2b8e9b3c395363d1a89c Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Sun, 1 Apr 2012 12:09:56 -0700 Subject: cgroup: introduce struct cfent This patch adds cfent (cgroup file entry) which is the association between a cgroup and a file. This is in-cgroup representation of files under a cgroup directory. This simplifies walking walking cgroup files and thus cgroup_clear_directory(), which is now implemented in two parts - cgroup_rm_file() and a loop around it. cgroup_rm_file() will be used to implement cftype removal and cfent is scheduled to serve cgroup specific per-file data (e.g. for sysfs-like "sever" semantics). v2: - cfe was freed from cgroup_rm_file() which led to use-after-free if the file had openers at the time of removal. Moved to cgroup_diput(). - cgroup_clear_directory() triggered WARN_ON_ONCE() if d_subdirs wasn't empty after removing all files. This triggered spuriously if some files were open during directory clearing. Removed. v3: - In cgroup_diput(), WARN_ONCE(!list_empty(&cfe->node)) could be spuriously triggered for root cgroups because they don't go through cgroup_clear_directory() on unmount. Don't trigger WARN for root cgroups. Signed-off-by: Tejun Heo Acked-by: Li Zefan Cc: Glauber Costa --- kernel/cgroup.c | 113 ++++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 77 insertions(+), 36 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 2ab29eb381b7..7d5d1c927d9d 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -147,6 +147,15 @@ struct cgroupfs_root { */ static struct cgroupfs_root rootnode; +/* + * cgroupfs file entry, pointed to from leaf dentry->d_fsdata. + */ +struct cfent { + struct list_head node; + struct dentry *dentry; + struct cftype *type; +}; + /* * CSS ID -- ID per subsys's Cgroup Subsys State(CSS). used only when * cgroup_subsys->use_id != 0. @@ -287,11 +296,16 @@ static inline struct cgroup *__d_cgrp(struct dentry *dentry) return dentry->d_fsdata; } -static inline struct cftype *__d_cft(struct dentry *dentry) +static inline struct cfent *__d_cfe(struct dentry *dentry) { return dentry->d_fsdata; } +static inline struct cftype *__d_cft(struct dentry *dentry) +{ + return __d_cfe(dentry)->type; +} + /* the list of cgroups eligible for automatic release. Protected by * release_list_lock */ static LIST_HEAD(release_list); @@ -877,6 +891,14 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode) BUG_ON(!list_empty(&cgrp->pidlists)); kfree_rcu(cgrp, rcu_head); + } else { + struct cfent *cfe = __d_cfe(dentry); + struct cgroup *cgrp = dentry->d_parent->d_fsdata; + + WARN_ONCE(!list_empty(&cfe->node) && + cgrp != &cgrp->root->top_cgroup, + "cfe still linked for %s\n", cfe->type->name); + kfree(cfe); } iput(inode); } @@ -895,34 +917,36 @@ static void remove_dir(struct dentry *d) dput(parent); } -static void cgroup_clear_directory(struct dentry *dentry) -{ - struct list_head *node; - - BUG_ON(!mutex_is_locked(&dentry->d_inode->i_mutex)); - spin_lock(&dentry->d_lock); - node = dentry->d_subdirs.next; - while (node != &dentry->d_subdirs) { - struct dentry *d = list_entry(node, struct dentry, d_u.d_child); - - spin_lock_nested(&d->d_lock, DENTRY_D_LOCK_NESTED); - list_del_init(node); - if (d->d_inode) { - /* This should never be called on a cgroup - * directory with child cgroups */ - BUG_ON(d->d_inode->i_mode & S_IFDIR); - dget_dlock(d); - spin_unlock(&d->d_lock); - spin_unlock(&dentry->d_lock); - d_delete(d); - simple_unlink(dentry->d_inode, d); - dput(d); - spin_lock(&dentry->d_lock); - } else - spin_unlock(&d->d_lock); - node = dentry->d_subdirs.next; +static int cgroup_rm_file(struct cgroup *cgrp, const struct cftype *cft) +{ + struct cfent *cfe; + + lockdep_assert_held(&cgrp->dentry->d_inode->i_mutex); + lockdep_assert_held(&cgroup_mutex); + + list_for_each_entry(cfe, &cgrp->files, node) { + struct dentry *d = cfe->dentry; + + if (cft && cfe->type != cft) + continue; + + dget(d); + d_delete(d); + simple_unlink(d->d_inode, d); + list_del_init(&cfe->node); + dput(d); + + return 0; } - spin_unlock(&dentry->d_lock); + return -ENOENT; +} + +static void cgroup_clear_directory(struct dentry *dir) +{ + struct cgroup *cgrp = __d_cgrp(dir); + + while (!list_empty(&cgrp->files)) + cgroup_rm_file(cgrp, NULL); } /* @@ -1352,6 +1376,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp) { INIT_LIST_HEAD(&cgrp->sibling); INIT_LIST_HEAD(&cgrp->children); + INIT_LIST_HEAD(&cgrp->files); INIT_LIST_HEAD(&cgrp->css_sets); INIT_LIST_HEAD(&cgrp->release_list); INIT_LIST_HEAD(&cgrp->pidlists); @@ -2619,7 +2644,9 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cgroup_subsys *subsys, const struct cftype *cft) { struct dentry *dir = cgrp->dentry; + struct cgroup *parent = __d_cgrp(dir); struct dentry *dentry; + struct cfent *cfe; int error; umode_t mode; char name[MAX_CGROUP_TYPE_NAMELEN + MAX_CFTYPE_NAME + 2] = { 0 }; @@ -2635,17 +2662,31 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cgroup_subsys *subsys, strcat(name, "."); } strcat(name, cft->name); + BUG_ON(!mutex_is_locked(&dir->d_inode->i_mutex)); + + cfe = kzalloc(sizeof(*cfe), GFP_KERNEL); + if (!cfe) + return -ENOMEM; + dentry = lookup_one_len(name, dir, strlen(name)); - if (!IS_ERR(dentry)) { - mode = cgroup_file_mode(cft); - error = cgroup_create_file(dentry, mode | S_IFREG, - cgrp->root->sb); - if (!error) - dentry->d_fsdata = (void *)cft; - dput(dentry); - } else + if (IS_ERR(dentry)) { error = PTR_ERR(dentry); + goto out; + } + + mode = cgroup_file_mode(cft); + error = cgroup_create_file(dentry, mode | S_IFREG, cgrp->root->sb); + if (!error) { + cfe->type = (void *)cft; + cfe->dentry = dentry; + dentry->d_fsdata = cfe; + list_add_tail(&cfe->node, &parent->files); + cfe = NULL; + } + dput(dentry); +out: + kfree(cfe); return error; } -- cgit v1.2.3 From 79578621b4847afdef48d19a28d00e3b188c37e1 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Sun, 1 Apr 2012 12:09:56 -0700 Subject: cgroup: implement cgroup_rm_cftypes() Implement cgroup_rm_cftypes() which removes an array of cftypes from a subsystem. It can be called whether the target subsys is attached or not. cgroup core will remove the specified file from all existing cgroups. This will be used to improve sub-subsys modularity and will be helpful for unified hierarchy. Signed-off-by: Tejun Heo Acked-by: Li Zefan --- kernel/cgroup.c | 54 ++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 44 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 7d5d1c927d9d..21bba7722350 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2690,17 +2690,20 @@ out: return error; } -static int cgroup_add_files(struct cgroup *cgrp, struct cgroup_subsys *subsys, - const struct cftype cfts[]) +static int cgroup_addrm_files(struct cgroup *cgrp, struct cgroup_subsys *subsys, + const struct cftype cfts[], bool is_add) { const struct cftype *cft; int err, ret = 0; for (cft = cfts; cft->name[0] != '\0'; cft++) { - err = cgroup_add_file(cgrp, subsys, cft); + if (is_add) + err = cgroup_add_file(cgrp, subsys, cft); + else + err = cgroup_rm_file(cgrp, cft); if (err) { - pr_warning("cgroup_add_files: failed to create %s, err=%d\n", - cft->name, err); + pr_warning("cgroup_addrm_files: failed to %s %s, err=%d\n", + is_add ? "add" : "remove", cft->name, err); ret = err; } } @@ -2724,7 +2727,7 @@ static void cgroup_cfts_prepare(void) } static void cgroup_cfts_commit(struct cgroup_subsys *ss, - const struct cftype *cfts) + const struct cftype *cfts, bool is_add) __releases(&cgroup_mutex) __releases(&cgroup_cft_mutex) { LIST_HEAD(pending); @@ -2750,7 +2753,7 @@ static void cgroup_cfts_commit(struct cgroup_subsys *ss, mutex_lock(&inode->i_mutex); mutex_lock(&cgroup_mutex); if (!cgroup_is_removed(cgrp)) - cgroup_add_files(cgrp, ss, cfts); + cgroup_addrm_files(cgrp, ss, cfts, is_add); mutex_unlock(&cgroup_mutex); mutex_unlock(&inode->i_mutex); @@ -2786,12 +2789,43 @@ int cgroup_add_cftypes(struct cgroup_subsys *ss, const struct cftype *cfts) cgroup_cfts_prepare(); set->cfts = cfts; list_add_tail(&set->node, &ss->cftsets); - cgroup_cfts_commit(ss, cfts); + cgroup_cfts_commit(ss, cfts, true); return 0; } EXPORT_SYMBOL_GPL(cgroup_add_cftypes); +/** + * cgroup_rm_cftypes - remove an array of cftypes from a subsystem + * @ss: target cgroup subsystem + * @cfts: zero-length name terminated array of cftypes + * + * Unregister @cfts from @ss. Files described by @cfts are removed from + * all existing cgroups to which @ss is attached and all future cgroups + * won't have them either. This function can be called anytime whether @ss + * is attached or not. + * + * Returns 0 on successful unregistration, -ENOENT if @cfts is not + * registered with @ss. + */ +int cgroup_rm_cftypes(struct cgroup_subsys *ss, const struct cftype *cfts) +{ + struct cftype_set *set; + + cgroup_cfts_prepare(); + + list_for_each_entry(set, &ss->cftsets, node) { + if (set->cfts == cfts) { + list_del_init(&set->node); + cgroup_cfts_commit(ss, cfts, false); + return 0; + } + } + + cgroup_cfts_commit(ss, NULL, false); + return -ENOENT; +} + /** * cgroup_task_count - count the number of tasks in a cgroup. * @cgrp: the cgroup in question @@ -3784,7 +3818,7 @@ static int cgroup_populate_dir(struct cgroup *cgrp) int err; struct cgroup_subsys *ss; - err = cgroup_add_files(cgrp, NULL, files); + err = cgroup_addrm_files(cgrp, NULL, files, true); if (err < 0) return err; @@ -3796,7 +3830,7 @@ static int cgroup_populate_dir(struct cgroup *cgrp) return err; list_for_each_entry(set, &ss->cftsets, node) - cgroup_add_files(cgrp, ss, set->cfts); + cgroup_addrm_files(cgrp, ss, set->cfts, true); } /* This cgroup is ready now */ -- cgit v1.2.3 From 28b4c27b8e6bb6d7ff2875281a8484f8898a87ef Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Sun, 1 Apr 2012 12:09:56 -0700 Subject: cgroup: use negative bias on css->refcnt to block css_tryget() When a cgroup is about to be removed, cgroup_clear_css_refs() is called to check and ensure that there are no active css references. This is currently achieved by dropping the refcnt to zero iff it has only the base ref. If all css refs could be dropped to zero, ref clearing is successful and CSS_REMOVED is set on all css. If not, the base ref is restored. While css ref is zero w/o CSS_REMOVED set, any css_tryget() attempt on it busy loops so that they are atomic w.r.t. the whole css ref clearing. This does work but dropping and re-instating the base ref is somewhat hairy and makes it difficult to add more logic to the put path as there are two of them - the regular css_put() and the reversible base ref clearing. This patch updates css ref clearing such that blocking new css_tryget() and putting the base ref are separate operations. CSS_DEACT_BIAS, defined as INT_MIN, is added to css->refcnt and css_tryget() busy loops while refcnt is negative. After all css refs are deactivated, if they were all one, ref clearing succeeded and CSS_REMOVED is set and the base ref is put using the regular css_put(); otherwise, CSS_DEACT_BIAS is subtracted from the refcnts and the original postive values are restored. css_refcnt() accessor which always returns the unbiased positive reference counts is added and used to simplify refcnt usages. While at it, relocate and reformat comments in cgroup_has_css_refs(). This separates css->refcnt deactivation and putting the base ref, which enables the next patch to make ref clearing optional. Signed-off-by: Tejun Heo Acked-by: Li Zefan --- kernel/cgroup.c | 119 +++++++++++++++++++++++++++++++++----------------------- 1 file changed, 71 insertions(+), 48 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 21bba7722350..2eade5186604 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -63,6 +63,9 @@ #include +/* css deactivation bias, makes css->refcnt negative to deny new trygets */ +#define CSS_DEACT_BIAS INT_MIN + /* * cgroup_mutex is the master lock. Any modification to cgroup or its * hierarchy must be performed while holding it. @@ -251,6 +254,14 @@ int cgroup_lock_is_held(void) EXPORT_SYMBOL_GPL(cgroup_lock_is_held); +/* the current nr of refs, always >= 0 whether @css is deactivated or not */ +static int css_refcnt(struct cgroup_subsys_state *css) +{ + int v = atomic_read(&css->refcnt); + + return v >= 0 ? v : v - CSS_DEACT_BIAS; +} + /* convenient tests for these bits */ inline int cgroup_is_removed(const struct cgroup *cgrp) { @@ -4006,18 +4017,19 @@ static int cgroup_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) return cgroup_create(c_parent, dentry, mode | S_IFDIR); } +/* + * Check the reference count on each subsystem. Since we already + * established that there are no tasks in the cgroup, if the css refcount + * is also 1, then there should be no outstanding references, so the + * subsystem is safe to destroy. We scan across all subsystems rather than + * using the per-hierarchy linked list of mounted subsystems since we can + * be called via check_for_release() with no synchronization other than + * RCU, and the subsystem linked list isn't RCU-safe. + */ static int cgroup_has_css_refs(struct cgroup *cgrp) { - /* Check the reference count on each subsystem. Since we - * already established that there are no tasks in the - * cgroup, if the css refcount is also 1, then there should - * be no outstanding references, so the subsystem is safe to - * destroy. We scan across all subsystems rather than using - * the per-hierarchy linked list of mounted subsystems since - * we can be called via check_for_release() with no - * synchronization other than RCU, and the subsystem linked - * list isn't RCU-safe */ int i; + /* * We won't need to lock the subsys array, because the subsystems * we're concerned about aren't going anywhere since our cgroup root @@ -4026,17 +4038,21 @@ static int cgroup_has_css_refs(struct cgroup *cgrp) for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { struct cgroup_subsys *ss = subsys[i]; struct cgroup_subsys_state *css; + /* Skip subsystems not present or not in this hierarchy */ if (ss == NULL || ss->root != cgrp->root) continue; + css = cgrp->subsys[ss->subsys_id]; - /* When called from check_for_release() it's possible + /* + * When called from check_for_release() it's possible * that by this point the cgroup has been removed * and the css deleted. But a false-positive doesn't * matter, since it can only happen if the cgroup * has been deleted and hence no longer needs the - * release agent to be called anyway. */ - if (css && (atomic_read(&css->refcnt) > 1)) + * release agent to be called anyway. + */ + if (css && css_refcnt(css) > 1) return 1; } return 0; @@ -4053,44 +4069,37 @@ static int cgroup_clear_css_refs(struct cgroup *cgrp) struct cgroup_subsys *ss; unsigned long flags; bool failed = false; + local_irq_save(flags); + + /* + * Block new css_tryget() by deactivating refcnt. If all refcnts + * were 1 at the moment of deactivation, we succeeded. + */ for_each_subsys(cgrp->root, ss) { struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id]; - int refcnt; - while (1) { - /* We can only remove a CSS with a refcnt==1 */ - refcnt = atomic_read(&css->refcnt); - if (refcnt > 1) { - failed = true; - goto done; - } - BUG_ON(!refcnt); - /* - * Drop the refcnt to 0 while we check other - * subsystems. This will cause any racing - * css_tryget() to spin until we set the - * CSS_REMOVED bits or abort - */ - if (atomic_cmpxchg(&css->refcnt, refcnt, 0) == refcnt) - break; - cpu_relax(); - } + + WARN_ON(atomic_read(&css->refcnt) < 0); + atomic_add(CSS_DEACT_BIAS, &css->refcnt); + failed |= css_refcnt(css) != 1; } - done: + + /* + * If succeeded, set REMOVED and put all the base refs; otherwise, + * restore refcnts to positive values. Either way, all in-progress + * css_tryget() will be released. + */ for_each_subsys(cgrp->root, ss) { struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id]; - if (failed) { - /* - * Restore old refcnt if we previously managed - * to clear it from 1 to 0 - */ - if (!atomic_read(&css->refcnt)) - atomic_set(&css->refcnt, 1); - } else { - /* Commit the fact that the CSS is removed */ + + if (!failed) { set_bit(CSS_REMOVED, &css->flags); + css_put(css); + } else { + atomic_sub(CSS_DEACT_BIAS, &css->refcnt); } } + local_irq_restore(flags); return !failed; } @@ -4887,13 +4896,28 @@ static void check_for_release(struct cgroup *cgrp) } /* Caller must verify that the css is not for root cgroup */ -void __css_put(struct cgroup_subsys_state *css, int count) +bool __css_tryget(struct cgroup_subsys_state *css) +{ + do { + int v = css_refcnt(css); + + if (atomic_cmpxchg(&css->refcnt, v, v + 1) == v) + return true; + cpu_relax(); + } while (!test_bit(CSS_REMOVED, &css->flags)); + + return false; +} +EXPORT_SYMBOL_GPL(__css_tryget); + +/* Caller must verify that the css is not for root cgroup */ +void __css_put(struct cgroup_subsys_state *css) { struct cgroup *cgrp = css->cgroup; - int val; + rcu_read_lock(); - val = atomic_sub_return(count, &css->refcnt); - if (val == 1) { + atomic_dec(&css->refcnt); + if (css_refcnt(css) == 1) { if (notify_on_release(cgrp)) { set_bit(CGRP_RELEASABLE, &cgrp->flags); check_for_release(cgrp); @@ -4901,7 +4925,6 @@ void __css_put(struct cgroup_subsys_state *css, int count) cgroup_wakeup_rmdir_waiter(cgrp); } rcu_read_unlock(); - WARN_ON_ONCE(val < 1); } EXPORT_SYMBOL_GPL(__css_put); @@ -5020,7 +5043,7 @@ unsigned short css_id(struct cgroup_subsys_state *css) * on this or this is under rcu_read_lock(). Once css->id is allocated, * it's unchanged until freed. */ - cssid = rcu_dereference_check(css->id, atomic_read(&css->refcnt)); + cssid = rcu_dereference_check(css->id, css_refcnt(css)); if (cssid) return cssid->id; @@ -5032,7 +5055,7 @@ unsigned short css_depth(struct cgroup_subsys_state *css) { struct css_id *cssid; - cssid = rcu_dereference_check(css->id, atomic_read(&css->refcnt)); + cssid = rcu_dereference_check(css->id, css_refcnt(css)); if (cssid) return cssid->depth; -- cgit v1.2.3 From 48ddbe194623ae089cc0576e60363f2d2e85662a Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Sun, 1 Apr 2012 12:09:56 -0700 Subject: cgroup: make css->refcnt clearing on cgroup removal optional Currently, cgroup removal tries to drain all css references. If there are active css references, the removal logic waits and retries ->pre_detroy() until either all refs drop to zero or removal is cancelled. This semantics is unusual and adds non-trivial complexity to cgroup core and IMHO is fundamentally misguided in that it couples internal implementation details (references to internal data structure) with externally visible operation (rmdir). To userland, this is a behavior peculiarity which is unnecessary and difficult to expect (css refs is otherwise invisible from userland), and, to policy implementations, this is an unnecessary restriction (e.g. blkcg wants to hold css refs for caching purposes but can't as that becomes visible as rmdir hang). Unfortunately, memcg currently depends on ->pre_destroy() retrials and cgroup removal vetoing and can't be immmediately switched to the new behavior. This patch introduces the new behavior of not waiting for css refs to drain and maintains the old behavior for subsystems which have __DEPRECATED_clear_css_refs set. Once, memcg is updated, we can drop the code paths for the old behavior as proposed in the following patch. Note that the following patch is incorrect in that dput work item is in cgroup and may lose some of dputs when multiples css's are released back-to-back, and __css_put() triggers check_for_release() when refcnt reaches 0 instead of 1; however, it shows what part can be removed. http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251 Note that, in not-too-distant future, cgroup core will start emitting warning messages for subsys which require the old behavior, so please get moving. Signed-off-by: Tejun Heo Acked-by: Li Zefan Cc: Vivek Goyal Cc: Johannes Weiner Cc: Michal Hocko Cc: Balbir Singh Cc: KAMEZAWA Hiroyuki --- kernel/cgroup.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 62 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 2eade5186604..2905977e0f33 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -854,12 +854,17 @@ static int cgroup_call_pre_destroy(struct cgroup *cgrp) struct cgroup_subsys *ss; int ret = 0; - for_each_subsys(cgrp->root, ss) - if (ss->pre_destroy) { - ret = ss->pre_destroy(cgrp); - if (ret) - break; + for_each_subsys(cgrp->root, ss) { + if (!ss->pre_destroy) + continue; + + ret = ss->pre_destroy(cgrp); + if (ret) { + /* ->pre_destroy() failure is being deprecated */ + WARN_ON_ONCE(!ss->__DEPRECATED_clear_css_refs); + break; } + } return ret; } @@ -3859,6 +3864,14 @@ static int cgroup_populate_dir(struct cgroup *cgrp) return 0; } +static void css_dput_fn(struct work_struct *work) +{ + struct cgroup_subsys_state *css = + container_of(work, struct cgroup_subsys_state, dput_work); + + dput(css->cgroup->dentry); +} + static void init_cgroup_css(struct cgroup_subsys_state *css, struct cgroup_subsys *ss, struct cgroup *cgrp) @@ -3871,6 +3884,16 @@ static void init_cgroup_css(struct cgroup_subsys_state *css, set_bit(CSS_ROOT, &css->flags); BUG_ON(cgrp->subsys[ss->subsys_id]); cgrp->subsys[ss->subsys_id] = css; + + /* + * If !clear_css_refs, css holds an extra ref to @cgrp->dentry + * which is put on the last css_put(). dput() requires process + * context, which css_put() may be called without. @css->dput_work + * will be used to invoke dput() asynchronously from css_put(). + */ + INIT_WORK(&css->dput_work, css_dput_fn); + if (ss->__DEPRECATED_clear_css_refs) + set_bit(CSS_CLEAR_CSS_REFS, &css->flags); } static void cgroup_lock_hierarchy(struct cgroupfs_root *root) @@ -3973,6 +3996,11 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry, if (err < 0) goto err_remove; + /* If !clear_css_refs, each css holds a ref to the cgroup's dentry */ + for_each_subsys(root, ss) + if (!ss->__DEPRECATED_clear_css_refs) + dget(dentry); + /* The cgroup directory was pre-locked for us */ BUG_ON(!mutex_is_locked(&cgrp->dentry->d_inode->i_mutex)); @@ -4062,8 +4090,24 @@ static int cgroup_has_css_refs(struct cgroup *cgrp) * Atomically mark all (or else none) of the cgroup's CSS objects as * CSS_REMOVED. Return true on success, or false if the cgroup has * busy subsystems. Call with cgroup_mutex held + * + * Depending on whether a subsys has __DEPRECATED_clear_css_refs set or + * not, cgroup removal behaves differently. + * + * If clear is set, css refcnt for the subsystem should be zero before + * cgroup removal can be committed. This is implemented by + * CGRP_WAIT_ON_RMDIR and retry logic around ->pre_destroy(), which may be + * called multiple times until all css refcnts reach zero and is allowed to + * veto removal on any invocation. This behavior is deprecated and will be + * removed as soon as the existing user (memcg) is updated. + * + * If clear is not set, each css holds an extra reference to the cgroup's + * dentry and cgroup removal proceeds regardless of css refs. + * ->pre_destroy() will be called at least once and is not allowed to fail. + * On the last put of each css, whenever that may be, the extra dentry ref + * is put so that dentry destruction happens only after all css's are + * released. */ - static int cgroup_clear_css_refs(struct cgroup *cgrp) { struct cgroup_subsys *ss; @@ -4074,14 +4118,17 @@ static int cgroup_clear_css_refs(struct cgroup *cgrp) /* * Block new css_tryget() by deactivating refcnt. If all refcnts - * were 1 at the moment of deactivation, we succeeded. + * for subsystems w/ clear_css_refs set were 1 at the moment of + * deactivation, we succeeded. */ for_each_subsys(cgrp->root, ss) { struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id]; WARN_ON(atomic_read(&css->refcnt) < 0); atomic_add(CSS_DEACT_BIAS, &css->refcnt); - failed |= css_refcnt(css) != 1; + + if (ss->__DEPRECATED_clear_css_refs) + failed |= css_refcnt(css) != 1; } /* @@ -4917,12 +4964,18 @@ void __css_put(struct cgroup_subsys_state *css) rcu_read_lock(); atomic_dec(&css->refcnt); - if (css_refcnt(css) == 1) { + switch (css_refcnt(css)) { + case 1: if (notify_on_release(cgrp)) { set_bit(CGRP_RELEASABLE, &cgrp->flags); check_for_release(cgrp); } cgroup_wakeup_rmdir_waiter(cgrp); + break; + case 0: + if (!test_bit(CSS_CLEAR_CSS_REFS, &css->flags)) + schedule_work(&css->dput_work); + break; } rcu_read_unlock(); } -- cgit v1.2.3 From f7232f0808d9c9899cbecf79296b0293217a27cd Mon Sep 17 00:00:00 2001 From: Lucia Rosculete Date: Sat, 3 Mar 2012 16:18:47 +0200 Subject: semaphore: fix improper comment reference to mutex Replace "mutex" with "semaphore" in down_trylock comment Signed-off-by: Lucia Rosculete Signed-off-by: Jiri Kosina --- kernel/semaphore.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/semaphore.c b/kernel/semaphore.c index 60636a4e25c3..4567fc020fe3 100644 --- a/kernel/semaphore.c +++ b/kernel/semaphore.c @@ -118,7 +118,7 @@ EXPORT_SYMBOL(down_killable); * down_trylock - try to acquire the semaphore, without waiting * @sem: the semaphore to be acquired * - * Try to acquire the semaphore atomically. Returns 0 if the mutex has + * Try to acquire the semaphore atomically. Returns 0 if the semaphore has * been acquired successfully or 1 if it it cannot be acquired. * * NOTE: This return value is inverted from both spin_trylock and -- cgit v1.2.3 From 74ba508f60c3650595297b33a4cdfd02e443194e Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Sat, 3 Mar 2012 18:58:11 -0800 Subject: userns: Remove unnecessary cast to struct user_struct when copying cred->user. In struct cred the user member is and has always been declared struct user_struct *user. At most a constant struct cred will have a constant pointer to non-constant user_struct so remove this unnecessary cast. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/sys.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index e7006eb6c1e4..f7a43514ac65 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -209,7 +209,7 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval) } while_each_pid_thread(pgrp, PIDTYPE_PGID, p); break; case PRIO_USER: - user = (struct user_struct *) cred->user; + user = cred->user; if (!who) who = cred->uid; else if ((who != cred->uid) && @@ -274,7 +274,7 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who) } while_each_pid_thread(pgrp, PIDTYPE_PGID, p); break; case PRIO_USER: - user = (struct user_struct *) cred->user; + user = cred->user; if (!who) who = cred->uid; else if ((who != cred->uid) && -- cgit v1.2.3 From c4a4d603796c727b9555867571f89483be9c565e Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Wed, 16 Nov 2011 23:15:31 -0800 Subject: userns: Use cred->user_ns instead of cred->user->user_ns Optimize performance and prepare for the removal of the user_ns reference from user_struct. Remove the slow long walk through cred->user->user_ns and instead go straight to cred->user_ns. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/ptrace.c | 4 ++-- kernel/sched/core.c | 2 +- kernel/signal.c | 4 ++-- kernel/sys.c | 8 ++++---- kernel/user_namespace.c | 4 ++-- kernel/utsname.c | 2 +- 6 files changed, 12 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/ptrace.c b/kernel/ptrace.c index ee8d49b9c309..24e0a5a94824 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -198,7 +198,7 @@ int __ptrace_may_access(struct task_struct *task, unsigned int mode) return 0; rcu_read_lock(); tcred = __task_cred(task); - if (cred->user->user_ns == tcred->user->user_ns && + if (cred->user_ns == tcred->user_ns && (cred->uid == tcred->euid && cred->uid == tcred->suid && cred->uid == tcred->uid && @@ -206,7 +206,7 @@ int __ptrace_may_access(struct task_struct *task, unsigned int mode) cred->gid == tcred->sgid && cred->gid == tcred->gid)) goto ok; - if (ptrace_has_cap(tcred->user->user_ns, mode)) + if (ptrace_has_cap(tcred->user_ns, mode)) goto ok; rcu_read_unlock(); return -EPERM; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4603b9d8f30a..96bff855b866 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4042,7 +4042,7 @@ static bool check_same_owner(struct task_struct *p) rcu_read_lock(); pcred = __task_cred(p); - if (cred->user->user_ns == pcred->user->user_ns) + if (cred->user_ns == pcred->user_ns) match = (cred->euid == pcred->euid || cred->euid == pcred->uid); else diff --git a/kernel/signal.c b/kernel/signal.c index 17afcaf582d0..e2c5d84f2dac 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -767,14 +767,14 @@ static int kill_ok_by_cred(struct task_struct *t) const struct cred *cred = current_cred(); const struct cred *tcred = __task_cred(t); - if (cred->user->user_ns == tcred->user->user_ns && + if (cred->user_ns == tcred->user_ns && (cred->euid == tcred->suid || cred->euid == tcred->uid || cred->uid == tcred->suid || cred->uid == tcred->uid)) return 1; - if (ns_capable(tcred->user->user_ns, CAP_KILL)) + if (ns_capable(tcred->user_ns, CAP_KILL)) return 1; return 0; diff --git a/kernel/sys.c b/kernel/sys.c index f7a43514ac65..82d8714bbede 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -133,11 +133,11 @@ static bool set_one_prio_perm(struct task_struct *p) { const struct cred *cred = current_cred(), *pcred = __task_cred(p); - if (pcred->user->user_ns == cred->user->user_ns && + if (pcred->user_ns == cred->user_ns && (pcred->uid == cred->euid || pcred->euid == cred->euid)) return true; - if (ns_capable(pcred->user->user_ns, CAP_SYS_NICE)) + if (ns_capable(pcred->user_ns, CAP_SYS_NICE)) return true; return false; } @@ -1498,7 +1498,7 @@ static int check_prlimit_permission(struct task_struct *task) return 0; tcred = __task_cred(task); - if (cred->user->user_ns == tcred->user->user_ns && + if (cred->user_ns == tcred->user_ns && (cred->uid == tcred->euid && cred->uid == tcred->suid && cred->uid == tcred->uid && @@ -1506,7 +1506,7 @@ static int check_prlimit_permission(struct task_struct *task) cred->gid == tcred->sgid && cred->gid == tcred->gid)) return 0; - if (ns_capable(tcred->user->user_ns, CAP_SYS_RESOURCE)) + if (ns_capable(tcred->user_ns, CAP_SYS_RESOURCE)) return 0; return -EPERM; diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 3b906e98b1db..f084083a0fd3 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -90,7 +90,7 @@ uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, uid_t { struct user_namespace *tmp; - if (likely(to == cred->user->user_ns)) + if (likely(to == cred->user_ns)) return uid; @@ -112,7 +112,7 @@ gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, gid_t { struct user_namespace *tmp; - if (likely(to == cred->user->user_ns)) + if (likely(to == cred->user_ns)) return gid; /* Is cred->user the creator of the target user_ns diff --git a/kernel/utsname.c b/kernel/utsname.c index 405caf91aad5..679d97a5d3fd 100644 --- a/kernel/utsname.c +++ b/kernel/utsname.c @@ -43,7 +43,7 @@ static struct uts_namespace *clone_uts_ns(struct task_struct *tsk, down_read(&uts_sem); memcpy(&ns->name, &old_ns->name, sizeof(ns->name)); - ns->user_ns = get_user_ns(task_cred_xxx(tsk, user)->user_ns); + ns->user_ns = get_user_ns(task_cred_xxx(tsk, user_ns)); up_read(&uts_sem); return ns; } -- cgit v1.2.3 From 0093ccb68f3753c0ba4d74c89d7e0f444b8d6123 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Wed, 16 Nov 2011 21:52:53 -0800 Subject: cred: Refcount the user_ns pointed to by the cred. struct user_struct will shortly loose it's user_ns reference so make the cred user_ns reference a proper reference complete with reference counting. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/cred.c | 8 +++----- kernel/user_namespace.c | 8 +++++--- 2 files changed, 8 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/cred.c b/kernel/cred.c index 97b36eeca4c9..7a0d80669886 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -148,6 +148,7 @@ static void put_cred_rcu(struct rcu_head *rcu) if (cred->group_info) put_group_info(cred->group_info); free_uid(cred->user); + put_user_ns(cred->user_ns); kmem_cache_free(cred_jar, cred); } @@ -303,6 +304,7 @@ struct cred *prepare_creds(void) set_cred_subscribers(new, 0); get_group_info(new->group_info); get_uid(new->user); + get_user_ns(new->user_ns); #ifdef CONFIG_KEYS key_get(new->thread_keyring); @@ -412,11 +414,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags) goto error_put; } - /* cache user_ns in cred. Doesn't need a refcount because it will - * stay pinned by cred->user - */ - new->user_ns = new->user->user_ns; - #ifdef CONFIG_KEYS /* new threads get their own thread keyrings if their parent already * had one */ @@ -676,6 +673,7 @@ struct cred *prepare_kernel_cred(struct task_struct *daemon) atomic_set(&new->usage, 1); set_cred_subscribers(new, 0); get_uid(new->user); + get_user_ns(new->user_ns); get_group_info(new->group_info); #ifdef CONFIG_KEYS diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index f084083a0fd3..58bb8781a778 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -24,7 +24,7 @@ static struct kmem_cache *user_ns_cachep __read_mostly; */ int create_user_ns(struct cred *new) { - struct user_namespace *ns; + struct user_namespace *ns, *parent_ns = new->user_ns; struct user_struct *root_user; int n; @@ -57,8 +57,10 @@ int create_user_ns(struct cred *new) #endif /* tgcred will be cleared in our caller bc CLONE_THREAD won't be set */ - /* root_user holds a reference to ns, our reference can be dropped */ - put_user_ns(ns); + /* Leave the reference to our user_ns with the new cred */ + new->user_ns = ns; + + put_user_ns(parent_ns); return 0; } -- cgit v1.2.3 From aeb3ae9da9b50a386b22af786d19b623e8d9f0fa Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Wed, 16 Nov 2011 21:59:43 -0800 Subject: userns: Add an explicit reference to the parent user namespace I am about to remove the struct user_namespace reference from struct user_struct. So keep an explicit track of the parent user namespace. Take advantage of this new reference and replace instances of user_ns->creator->user_ns with user_ns->parent. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/user_namespace.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 58bb8781a778..c15e533d6bc5 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -45,6 +45,7 @@ int create_user_ns(struct cred *new) } /* set the new root user in the credentials under preparation */ + ns->parent = parent_ns; ns->creator = new->user; new->user = root_user; new->uid = new->euid = new->suid = new->fsuid = 0; @@ -60,8 +61,6 @@ int create_user_ns(struct cred *new) /* Leave the reference to our user_ns with the new cred */ new->user_ns = ns; - put_user_ns(parent_ns); - return 0; } @@ -72,10 +71,12 @@ int create_user_ns(struct cred *new) */ static void free_user_ns_work(struct work_struct *work) { - struct user_namespace *ns = + struct user_namespace *parent, *ns = container_of(work, struct user_namespace, destroyer); + parent = ns->parent; free_uid(ns->creator); kmem_cache_free(user_ns_cachep, ns); + put_user_ns(parent); } void free_user_ns(struct kref *kref) @@ -99,8 +100,7 @@ uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, uid_t /* Is cred->user the creator of the target user_ns * or the creator of one of it's parents? */ - for ( tmp = to; tmp != &init_user_ns; - tmp = tmp->creator->user_ns ) { + for ( tmp = to; tmp != &init_user_ns; tmp = tmp->parent ) { if (cred->user == tmp->creator) { return (uid_t)0; } @@ -120,8 +120,7 @@ gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, gid_t /* Is cred->user the creator of the target user_ns * or the creator of one of it's parents? */ - for ( tmp = to; tmp != &init_user_ns; - tmp = tmp->creator->user_ns ) { + for ( tmp = to; tmp != &init_user_ns; tmp = tmp->parent ) { if (cred->user == tmp->creator) { return (gid_t)0; } -- cgit v1.2.3 From d0bd6594e286bd6145e04e19e8d3fa2e902cb800 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Wed, 16 Nov 2011 23:20:58 -0800 Subject: userns: Deprecate and rename the user_namespace reference in the user_struct With a user_ns reference in struct cred the only user of the user namespace reference in struct user_struct is to keep the uid hash table alive. The user_namespace reference in struct user_struct will be going away soon, and I have removed all of the references. Rename the field from user_ns to _user_ns so that the compiler can verify nothing follows the user struct to the user namespace anymore. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/user.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/user.c b/kernel/user.c index 71dd2363ab0f..d65fec0615a0 100644 --- a/kernel/user.c +++ b/kernel/user.c @@ -58,7 +58,7 @@ struct user_struct root_user = { .files = ATOMIC_INIT(0), .sigpending = ATOMIC_INIT(0), .locked_shm = 0, - .user_ns = &init_user_ns, + ._user_ns = &init_user_ns, }; /* @@ -72,7 +72,7 @@ static void uid_hash_insert(struct user_struct *up, struct hlist_head *hashent) static void uid_hash_remove(struct user_struct *up) { hlist_del_init(&up->uidhash_node); - put_user_ns(up->user_ns); + put_user_ns(up->_user_ns); /* It is safe to free the uid hash table now */ } static struct user_struct *uid_hash_find(uid_t uid, struct hlist_head *hashent) @@ -153,7 +153,7 @@ struct user_struct *alloc_uid(struct user_namespace *ns, uid_t uid) new->uid = uid; atomic_set(&new->__count, 1); - new->user_ns = get_user_ns(ns); + new->_user_ns = get_user_ns(ns); /* * Before adding this, check whether we raced -- cgit v1.2.3 From 973c5914260d75292f71a4729753086b9e863d57 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Thu, 17 Nov 2011 01:59:07 -0800 Subject: userns: Start out with a full set of capabilities. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/user_namespace.c | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'kernel') diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index c15e533d6bc5..e216e1e8ce84 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -11,6 +11,7 @@ #include #include #include +#include static struct kmem_cache *user_ns_cachep __read_mostly; @@ -52,6 +53,14 @@ int create_user_ns(struct cred *new) new->gid = new->egid = new->sgid = new->fsgid = 0; put_group_info(new->group_info); new->group_info = get_group_info(&init_groups); + /* Start with the same capabilities as init but useless for doing + * anything as the capabilities are bound to the new user namespace. + */ + new->securebits = SECUREBITS_DEFAULT; + new->cap_inheritable = CAP_EMPTY_SET; + new->cap_permitted = CAP_FULL_SET; + new->cap_effective = CAP_FULL_SET; + new->cap_bset = CAP_FULL_SET; #ifdef CONFIG_KEYS key_put(new->request_key_auth); new->request_key_auth = NULL; -- cgit v1.2.3 From 1a48e2ac034d47ed843081c4523b63c46b46888b Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Mon, 14 Nov 2011 16:24:06 -0800 Subject: userns: Replace the hard to write inode_userns with inode_capable. This represents a change in strategy of how to handle user namespaces. Instead of tagging everything explicitly with a user namespace and bulking up all of the comparisons of uids and gids in the kernel, all uids and gids in use will have a mapping to a flat kuid and kgid spaces respectively. This allows much more of the existing logic to be preserved and in general allows for faster code. In this new and improved world we allow someone to utiliize capabilities over an inode if the inodes owner mapps into the capabilities holders user namespace and the user has capabilities in their user namespace. Which is simple and efficient. Moving the fs uid comparisons to be comparisons in a flat kuid space follows in later patches, something that is only significant if you are using user namespaces. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/capability.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) (limited to 'kernel') diff --git a/kernel/capability.c b/kernel/capability.c index 3f1adb6c6470..cc5f0718215d 100644 --- a/kernel/capability.c +++ b/kernel/capability.c @@ -419,3 +419,22 @@ bool nsown_capable(int cap) { return ns_capable(current_user_ns(), cap); } + +/** + * inode_capable - Check superior capability over inode + * @inode: The inode in question + * @cap: The capability in question + * + * Return true if the current task has the given superior capability + * targeted at it's own user namespace and that the given inode is owned + * by the current user namespace or a child namespace. + * + * Currently inodes can only be owned by the initial user namespace. + * + */ +bool inode_capable(const struct inode *inode, int cap) +{ + struct user_namespace *ns = current_user_ns(); + + return ns_capable(ns, cap) && (ns == &init_user_ns); +} -- cgit v1.2.3 From 7a4e7408c5cadb240e068a662251754a562355e3 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Mon, 14 Nov 2011 14:29:51 -0800 Subject: userns: Add kuid_t and kgid_t and associated infrastructure in uidgid.h Start distinguishing between internal kernel uids and gids and values that userspace can use. This is done by introducing two new types: kuid_t and kgid_t. These types and their associated functions are infrastructure are declared in the new header uidgid.h. Ultimately there will be a different implementation of the mapping functions for use with user namespaces. But to keep it simple we introduce the mapping functions first to separate the meat from the mechanical code conversions. Export overflowuid and overflowgid so we can use from_kuid_munged and from_kgid_munged in modular code. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/sys.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index 82d8714bbede..71852417cfc8 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -93,10 +93,8 @@ int overflowuid = DEFAULT_OVERFLOWUID; int overflowgid = DEFAULT_OVERFLOWGID; -#ifdef CONFIG_UID16 EXPORT_SYMBOL(overflowuid); EXPORT_SYMBOL(overflowgid); -#endif /* * the same as above, but for filesystems which can only store a 16-bit -- cgit v1.2.3 From 7b44ab978b77a91b327058a0f4db7e6fcdb90b92 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Wed, 16 Nov 2011 23:20:58 -0800 Subject: userns: Disassociate user_struct from the user_namespace. Modify alloc_uid to take a kuid and make the user hash table global. Stop holding a reference to the user namespace in struct user_struct. This simplifies the code and makes the per user accounting not care about which user namespace a uid happens to appear in. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/sys.c | 34 +++++++++++++++++++++++----------- kernel/user.c | 28 +++++++++++++--------------- kernel/user_namespace.c | 6 +----- 3 files changed, 37 insertions(+), 31 deletions(-) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index 71852417cfc8..f0c43b4b6657 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -175,6 +175,8 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval) const struct cred *cred = current_cred(); int error = -EINVAL; struct pid *pgrp; + kuid_t cred_uid; + kuid_t uid; if (which > PRIO_USER || which < PRIO_PROCESS) goto out; @@ -207,18 +209,22 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval) } while_each_pid_thread(pgrp, PIDTYPE_PGID, p); break; case PRIO_USER: + cred_uid = make_kuid(cred->user_ns, cred->uid); + uid = make_kuid(cred->user_ns, who); user = cred->user; if (!who) - who = cred->uid; - else if ((who != cred->uid) && - !(user = find_user(who))) + uid = cred_uid; + else if (!uid_eq(uid, cred_uid) && + !(user = find_user(uid))) goto out_unlock; /* No processes for this user */ do_each_thread(g, p) { - if (__task_cred(p)->uid == who) + const struct cred *tcred = __task_cred(p); + kuid_t tcred_uid = make_kuid(tcred->user_ns, tcred->uid); + if (uid_eq(tcred_uid, uid)) error = set_one_prio(p, niceval, error); } while_each_thread(g, p); - if (who != cred->uid) + if (!uid_eq(uid, cred_uid)) free_uid(user); /* For find_user() */ break; } @@ -242,6 +248,8 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who) const struct cred *cred = current_cred(); long niceval, retval = -ESRCH; struct pid *pgrp; + kuid_t cred_uid; + kuid_t uid; if (which > PRIO_USER || which < PRIO_PROCESS) return -EINVAL; @@ -272,21 +280,25 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who) } while_each_pid_thread(pgrp, PIDTYPE_PGID, p); break; case PRIO_USER: + cred_uid = make_kuid(cred->user_ns, cred->uid); + uid = make_kuid(cred->user_ns, who); user = cred->user; if (!who) - who = cred->uid; - else if ((who != cred->uid) && - !(user = find_user(who))) + uid = cred_uid; + else if (!uid_eq(uid, cred_uid) && + !(user = find_user(uid))) goto out_unlock; /* No processes for this user */ do_each_thread(g, p) { - if (__task_cred(p)->uid == who) { + const struct cred *tcred = __task_cred(p); + kuid_t tcred_uid = make_kuid(tcred->user_ns, tcred->uid); + if (uid_eq(tcred_uid, uid)) { niceval = 20 - task_nice(p); if (niceval > retval) retval = niceval; } } while_each_thread(g, p); - if (who != cred->uid) + if (!uid_eq(uid, cred_uid)) free_uid(user); /* for find_user() */ break; } @@ -629,7 +641,7 @@ static int set_user(struct cred *new) { struct user_struct *new_user; - new_user = alloc_uid(current_user_ns(), new->uid); + new_user = alloc_uid(make_kuid(new->user_ns, new->uid)); if (!new_user) return -EAGAIN; diff --git a/kernel/user.c b/kernel/user.c index d65fec0615a0..025077e54a7c 100644 --- a/kernel/user.c +++ b/kernel/user.c @@ -34,11 +34,14 @@ EXPORT_SYMBOL_GPL(init_user_ns); * when changing user ID's (ie setuid() and friends). */ +#define UIDHASH_BITS (CONFIG_BASE_SMALL ? 3 : 7) +#define UIDHASH_SZ (1 << UIDHASH_BITS) #define UIDHASH_MASK (UIDHASH_SZ - 1) #define __uidhashfn(uid) (((uid >> UIDHASH_BITS) + uid) & UIDHASH_MASK) -#define uidhashentry(ns, uid) ((ns)->uidhash_table + __uidhashfn((uid))) +#define uidhashentry(uid) (uidhash_table + __uidhashfn((__kuid_val(uid)))) static struct kmem_cache *uid_cachep; +struct hlist_head uidhash_table[UIDHASH_SZ]; /* * The uidhash_lock is mostly taken from process context, but it is @@ -58,7 +61,7 @@ struct user_struct root_user = { .files = ATOMIC_INIT(0), .sigpending = ATOMIC_INIT(0), .locked_shm = 0, - ._user_ns = &init_user_ns, + .uid = GLOBAL_ROOT_UID, }; /* @@ -72,16 +75,15 @@ static void uid_hash_insert(struct user_struct *up, struct hlist_head *hashent) static void uid_hash_remove(struct user_struct *up) { hlist_del_init(&up->uidhash_node); - put_user_ns(up->_user_ns); /* It is safe to free the uid hash table now */ } -static struct user_struct *uid_hash_find(uid_t uid, struct hlist_head *hashent) +static struct user_struct *uid_hash_find(kuid_t uid, struct hlist_head *hashent) { struct user_struct *user; struct hlist_node *h; hlist_for_each_entry(user, h, hashent, uidhash_node) { - if (user->uid == uid) { + if (uid_eq(user->uid, uid)) { atomic_inc(&user->__count); return user; } @@ -110,14 +112,13 @@ static void free_user(struct user_struct *up, unsigned long flags) * * If the user_struct could not be found, return NULL. */ -struct user_struct *find_user(uid_t uid) +struct user_struct *find_user(kuid_t uid) { struct user_struct *ret; unsigned long flags; - struct user_namespace *ns = current_user_ns(); spin_lock_irqsave(&uidhash_lock, flags); - ret = uid_hash_find(uid, uidhashentry(ns, uid)); + ret = uid_hash_find(uid, uidhashentry(uid)); spin_unlock_irqrestore(&uidhash_lock, flags); return ret; } @@ -136,9 +137,9 @@ void free_uid(struct user_struct *up) local_irq_restore(flags); } -struct user_struct *alloc_uid(struct user_namespace *ns, uid_t uid) +struct user_struct *alloc_uid(kuid_t uid) { - struct hlist_head *hashent = uidhashentry(ns, uid); + struct hlist_head *hashent = uidhashentry(uid); struct user_struct *up, *new; spin_lock_irq(&uidhash_lock); @@ -153,8 +154,6 @@ struct user_struct *alloc_uid(struct user_namespace *ns, uid_t uid) new->uid = uid; atomic_set(&new->__count, 1); - new->_user_ns = get_user_ns(ns); - /* * Before adding this, check whether we raced * on adding the same user already.. @@ -162,7 +161,6 @@ struct user_struct *alloc_uid(struct user_namespace *ns, uid_t uid) spin_lock_irq(&uidhash_lock); up = uid_hash_find(uid, hashent); if (up) { - put_user_ns(ns); key_put(new->uid_keyring); key_put(new->session_keyring); kmem_cache_free(uid_cachep, new); @@ -187,11 +185,11 @@ static int __init uid_cache_init(void) 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); for(n = 0; n < UIDHASH_SZ; ++n) - INIT_HLIST_HEAD(init_user_ns.uidhash_table + n); + INIT_HLIST_HEAD(uidhash_table + n); /* Insert the root user immediately (init already runs as root) */ spin_lock_irq(&uidhash_lock); - uid_hash_insert(&root_user, uidhashentry(&init_user_ns, 0)); + uid_hash_insert(&root_user, uidhashentry(GLOBAL_ROOT_UID)); spin_unlock_irq(&uidhash_lock); return 0; diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index e216e1e8ce84..898e973bd1e8 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -27,7 +27,6 @@ int create_user_ns(struct cred *new) { struct user_namespace *ns, *parent_ns = new->user_ns; struct user_struct *root_user; - int n; ns = kmem_cache_alloc(user_ns_cachep, GFP_KERNEL); if (!ns) @@ -35,11 +34,8 @@ int create_user_ns(struct cred *new) kref_init(&ns->kref); - for (n = 0; n < UIDHASH_SZ; ++n) - INIT_HLIST_HEAD(ns->uidhash_table + n); - /* Alloc new root user. */ - root_user = alloc_uid(ns, 0); + root_user = alloc_uid(make_kuid(ns, 0)); if (!root_user) { kmem_cache_free(user_ns_cachep, ns); return -ENOMEM; -- cgit v1.2.3 From 5d1c0f4a80a6df73395fb3fc2c302510f8f09d36 Mon Sep 17 00:00:00 2001 From: Eric B Munson Date: Sat, 10 Mar 2012 14:37:28 -0500 Subject: watchdog: add check for suspended vm in softlockup detector A suspended VM can cause spurious soft lockup warnings. To avoid these, the watchdog now checks if the kernel knows it was stopped by the host and skips the warning if so. When the watchdog is reset successfully, clear the guest paused flag. Signed-off-by: Eric B Munson Signed-off-by: Marcelo Tosatti Signed-off-by: Avi Kivity --- kernel/watchdog.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) (limited to 'kernel') diff --git a/kernel/watchdog.c b/kernel/watchdog.c index df30ee08bdd4..e5e1d85b8c7c 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -24,6 +24,7 @@ #include #include +#include #include int watchdog_enabled = 1; @@ -280,6 +281,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) __this_cpu_write(softlockup_touch_sync, false); sched_clock_tick(); } + + /* Clear the guest paused flag on watchdog reset */ + kvm_check_and_clear_guest_paused(); __touch_watchdog(); return HRTIMER_RESTART; } @@ -292,6 +296,14 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) */ duration = is_softlockup(touch_ts); if (unlikely(duration)) { + /* + * If a virtual machine is stopped by the host it can look to + * the watchdog like a soft lockup, check to see if the host + * stopped the vm before we issue the warning + */ + if (kvm_check_and_clear_guest_paused()) + return HRTIMER_RESTART; + /* only warn once */ if (__this_cpu_read(soft_watchdog_warn) == true) return HRTIMER_RESTART; -- cgit v1.2.3 From 86f82d561864e902c70282b6f17cf590c0f34691 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Tue, 10 Apr 2012 10:16:36 -0700 Subject: cgroup: remove cgroup_subsys->populate() With memcg converted, cgroup_subsys->populate() doesn't have any user left. Remove it. Signed-off-by: Tejun Heo Acked-by: Li Zefan --- kernel/cgroup.c | 3 --- 1 file changed, 3 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 2905977e0f33..b2f203f25ec8 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -3842,9 +3842,6 @@ static int cgroup_populate_dir(struct cgroup *cgrp) for_each_subsys(cgrp->root, ss) { struct cftype_set *set; - if (ss->populate && (err = ss->populate(ss, cgrp)) < 0) - return err; - list_for_each_entry(set, &ss->cftsets, node) cgroup_addrm_files(cgrp, ss, set->cfts, true); } -- cgit v1.2.3 From d3283fb45c06dfd7b1a36da8e6ff39d313f0600c Mon Sep 17 00:00:00 2001 From: Stephen Boyd Date: Mon, 9 Apr 2012 11:00:25 -0700 Subject: trace: Remove unused workqueue tracer This tracer was temporarily removed in 6416669 (workqueue: temporarily remove workqueue tracing, 2010-06-29) but never reinstated after concurrency managed workqueues were completed. For almost two years it hasn't been compilable so it seems nobody is using it. Delete it. Signed-off-by: Stephen Boyd Acked-by: Frederic Weisbecker Signed-off-by: Tejun Heo --- kernel/trace/Makefile | 1 - kernel/trace/trace_workqueue.c | 300 ----------------------------------------- 2 files changed, 301 deletions(-) delete mode 100644 kernel/trace/trace_workqueue.c (limited to 'kernel') diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index 5f39a07fe5ea..b3afe0e76f79 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -41,7 +41,6 @@ obj-$(CONFIG_STACK_TRACER) += trace_stack.o obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o -obj-$(CONFIG_WORKQUEUE_TRACER) += trace_workqueue.o obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o ifeq ($(CONFIG_BLOCK),y) obj-$(CONFIG_EVENT_TRACING) += blktrace.o diff --git a/kernel/trace/trace_workqueue.c b/kernel/trace/trace_workqueue.c deleted file mode 100644 index 209b379a4721..000000000000 --- a/kernel/trace/trace_workqueue.c +++ /dev/null @@ -1,300 +0,0 @@ -/* - * Workqueue statistical tracer. - * - * Copyright (C) 2008 Frederic Weisbecker - * - */ - - -#include -#include -#include -#include -#include -#include "trace_stat.h" -#include "trace.h" - - -/* A cpu workqueue thread */ -struct cpu_workqueue_stats { - struct list_head list; - struct kref kref; - int cpu; - pid_t pid; -/* Can be inserted from interrupt or user context, need to be atomic */ - atomic_t inserted; -/* - * Don't need to be atomic, works are serialized in a single workqueue thread - * on a single CPU. - */ - unsigned int executed; -}; - -/* List of workqueue threads on one cpu */ -struct workqueue_global_stats { - struct list_head list; - spinlock_t lock; -}; - -/* Don't need a global lock because allocated before the workqueues, and - * never freed. - */ -static DEFINE_PER_CPU(struct workqueue_global_stats, all_workqueue_stat); -#define workqueue_cpu_stat(cpu) (&per_cpu(all_workqueue_stat, cpu)) - -static void cpu_workqueue_stat_free(struct kref *kref) -{ - kfree(container_of(kref, struct cpu_workqueue_stats, kref)); -} - -/* Insertion of a work */ -static void -probe_workqueue_insertion(void *ignore, - struct task_struct *wq_thread, - struct work_struct *work) -{ - int cpu = cpumask_first(&wq_thread->cpus_allowed); - struct cpu_workqueue_stats *node; - unsigned long flags; - - spin_lock_irqsave(&workqueue_cpu_stat(cpu)->lock, flags); - list_for_each_entry(node, &workqueue_cpu_stat(cpu)->list, list) { - if (node->pid == wq_thread->pid) { - atomic_inc(&node->inserted); - goto found; - } - } - pr_debug("trace_workqueue: entry not found\n"); -found: - spin_unlock_irqrestore(&workqueue_cpu_stat(cpu)->lock, flags); -} - -/* Execution of a work */ -static void -probe_workqueue_execution(void *ignore, - struct task_struct *wq_thread, - struct work_struct *work) -{ - int cpu = cpumask_first(&wq_thread->cpus_allowed); - struct cpu_workqueue_stats *node; - unsigned long flags; - - spin_lock_irqsave(&workqueue_cpu_stat(cpu)->lock, flags); - list_for_each_entry(node, &workqueue_cpu_stat(cpu)->list, list) { - if (node->pid == wq_thread->pid) { - node->executed++; - goto found; - } - } - pr_debug("trace_workqueue: entry not found\n"); -found: - spin_unlock_irqrestore(&workqueue_cpu_stat(cpu)->lock, flags); -} - -/* Creation of a cpu workqueue thread */ -static void probe_workqueue_creation(void *ignore, - struct task_struct *wq_thread, int cpu) -{ - struct cpu_workqueue_stats *cws; - unsigned long flags; - - WARN_ON(cpu < 0); - - /* Workqueues are sometimes created in atomic context */ - cws = kzalloc(sizeof(struct cpu_workqueue_stats), GFP_ATOMIC); - if (!cws) { - pr_warning("trace_workqueue: not enough memory\n"); - return; - } - INIT_LIST_HEAD(&cws->list); - kref_init(&cws->kref); - cws->cpu = cpu; - cws->pid = wq_thread->pid; - - spin_lock_irqsave(&workqueue_cpu_stat(cpu)->lock, flags); - list_add_tail(&cws->list, &workqueue_cpu_stat(cpu)->list); - spin_unlock_irqrestore(&workqueue_cpu_stat(cpu)->lock, flags); -} - -/* Destruction of a cpu workqueue thread */ -static void -probe_workqueue_destruction(void *ignore, struct task_struct *wq_thread) -{ - /* Workqueue only execute on one cpu */ - int cpu = cpumask_first(&wq_thread->cpus_allowed); - struct cpu_workqueue_stats *node, *next; - unsigned long flags; - - spin_lock_irqsave(&workqueue_cpu_stat(cpu)->lock, flags); - list_for_each_entry_safe(node, next, &workqueue_cpu_stat(cpu)->list, - list) { - if (node->pid == wq_thread->pid) { - list_del(&node->list); - kref_put(&node->kref, cpu_workqueue_stat_free); - goto found; - } - } - - pr_debug("trace_workqueue: don't find workqueue to destroy\n"); -found: - spin_unlock_irqrestore(&workqueue_cpu_stat(cpu)->lock, flags); - -} - -static struct cpu_workqueue_stats *workqueue_stat_start_cpu(int cpu) -{ - unsigned long flags; - struct cpu_workqueue_stats *ret = NULL; - - - spin_lock_irqsave(&workqueue_cpu_stat(cpu)->lock, flags); - - if (!list_empty(&workqueue_cpu_stat(cpu)->list)) { - ret = list_entry(workqueue_cpu_stat(cpu)->list.next, - struct cpu_workqueue_stats, list); - kref_get(&ret->kref); - } - - spin_unlock_irqrestore(&workqueue_cpu_stat(cpu)->lock, flags); - - return ret; -} - -static void *workqueue_stat_start(struct tracer_stat *trace) -{ - int cpu; - void *ret = NULL; - - for_each_possible_cpu(cpu) { - ret = workqueue_stat_start_cpu(cpu); - if (ret) - return ret; - } - return NULL; -} - -static void *workqueue_stat_next(void *prev, int idx) -{ - struct cpu_workqueue_stats *prev_cws = prev; - struct cpu_workqueue_stats *ret; - int cpu = prev_cws->cpu; - unsigned long flags; - - spin_lock_irqsave(&workqueue_cpu_stat(cpu)->lock, flags); - if (list_is_last(&prev_cws->list, &workqueue_cpu_stat(cpu)->list)) { - spin_unlock_irqrestore(&workqueue_cpu_stat(cpu)->lock, flags); - do { - cpu = cpumask_next(cpu, cpu_possible_mask); - if (cpu >= nr_cpu_ids) - return NULL; - } while (!(ret = workqueue_stat_start_cpu(cpu))); - return ret; - } else { - ret = list_entry(prev_cws->list.next, - struct cpu_workqueue_stats, list); - kref_get(&ret->kref); - } - spin_unlock_irqrestore(&workqueue_cpu_stat(cpu)->lock, flags); - - return ret; -} - -static int workqueue_stat_show(struct seq_file *s, void *p) -{ - struct cpu_workqueue_stats *cws = p; - struct pid *pid; - struct task_struct *tsk; - - pid = find_get_pid(cws->pid); - if (pid) { - tsk = get_pid_task(pid, PIDTYPE_PID); - if (tsk) { - seq_printf(s, "%3d %6d %6u %s\n", cws->cpu, - atomic_read(&cws->inserted), cws->executed, - tsk->comm); - put_task_struct(tsk); - } - put_pid(pid); - } - - return 0; -} - -static void workqueue_stat_release(void *stat) -{ - struct cpu_workqueue_stats *node = stat; - - kref_put(&node->kref, cpu_workqueue_stat_free); -} - -static int workqueue_stat_headers(struct seq_file *s) -{ - seq_printf(s, "# CPU INSERTED EXECUTED NAME\n"); - seq_printf(s, "# | | | |\n"); - return 0; -} - -struct tracer_stat workqueue_stats __read_mostly = { - .name = "workqueues", - .stat_start = workqueue_stat_start, - .stat_next = workqueue_stat_next, - .stat_show = workqueue_stat_show, - .stat_release = workqueue_stat_release, - .stat_headers = workqueue_stat_headers -}; - - -int __init stat_workqueue_init(void) -{ - if (register_stat_tracer(&workqueue_stats)) { - pr_warning("Unable to register workqueue stat tracer\n"); - return 1; - } - - return 0; -} -fs_initcall(stat_workqueue_init); - -/* - * Workqueues are created very early, just after pre-smp initcalls. - * So we must register our tracepoints at this stage. - */ -int __init trace_workqueue_early_init(void) -{ - int ret, cpu; - - for_each_possible_cpu(cpu) { - spin_lock_init(&workqueue_cpu_stat(cpu)->lock); - INIT_LIST_HEAD(&workqueue_cpu_stat(cpu)->list); - } - - ret = register_trace_workqueue_insertion(probe_workqueue_insertion, NULL); - if (ret) - goto out; - - ret = register_trace_workqueue_execution(probe_workqueue_execution, NULL); - if (ret) - goto no_insertion; - - ret = register_trace_workqueue_creation(probe_workqueue_creation, NULL); - if (ret) - goto no_execution; - - ret = register_trace_workqueue_destruction(probe_workqueue_destruction, NULL); - if (ret) - goto no_creation; - - return 0; - -no_creation: - unregister_trace_workqueue_creation(probe_workqueue_creation, NULL); -no_execution: - unregister_trace_workqueue_execution(probe_workqueue_execution, NULL); -no_insertion: - unregister_trace_workqueue_insertion(probe_workqueue_insertion, NULL); -out: - pr_warning("trace_workqueue: unable to trace workqueues\n"); - - return 1; -} -early_initcall(trace_workqueue_early_init); -- cgit v1.2.3 From 30059d93b07a034555defbf14d689a279fd7368d Mon Sep 17 00:00:00 2001 From: Srinivas Kandagatla Date: Wed, 11 Apr 2012 05:01:56 -0300 Subject: [media] kernel:kfifo: export __kfifo_max_r symbol kfifo_avail expands to __kfifo_max_r which is not an exported symbol. Any kernel module using kfifo_avail will result in build failures because of this. This patch just exports __kfifo_max_r symbol to fix such problems in future. Reported-by: Stephen Rothwell Signed-off-by: Srinivas Kandagatla Acked-by: Stefani Seibold Signed-off-by: Mauro Carvalho Chehab --- kernel/kfifo.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/kfifo.c b/kernel/kfifo.c index c744b88c44e2..59dcf5b81d24 100644 --- a/kernel/kfifo.c +++ b/kernel/kfifo.c @@ -402,6 +402,7 @@ unsigned int __kfifo_max_r(unsigned int len, size_t recsize) return max; return len; } +EXPORT_SYMBOL(__kfifo_max_r); #define __KFIFO_PEEK(data, out, mask) \ ((data)[(out) & (mask)]) -- cgit v1.2.3 From 8d3d5ada56a692d36a9d55858881147ec10cfeb6 Mon Sep 17 00:00:00 2001 From: Kirill Tkhai Date: Wed, 11 Apr 2012 09:06:04 +0400 Subject: sched_rt: Avoid unnecessary dequeue and enqueue of pushable tasks in set_cpus_allowed_rt() Migration status depends on a difference of weight from 0 and 1. If weight > 1 (<= 1) and old weight <= 1 (> 1) then task becomes pushable (or not pushable). We are not insterested in its exact values, is it 3 or 4, for example. Now if we are changing affinity from a set of 3 cpus to a set of 4, the- task will be dequeued and enqueued sequentially without important difference in comparison with initial state. The only difference is in internal representation of plist queue of pushable tasks and the fact that the task may won't be the first in a sequence of the same priority tasks. But it seems to me it gives nothing. Link: http://lkml.kernel.org/r/273741334120764@web83.yandex.ru Cc: Ingo Molnar Cc: Peter Zijlstra Signed-off-by: Tkhai Kirill Signed-off-by: Steven Rostedt --- kernel/sched/rt.c | 56 ++++++++++++++++++++++++++----------------------------- 1 file changed, 26 insertions(+), 30 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index b60dad720173..90607a94dd69 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1803,44 +1803,40 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p) static void set_cpus_allowed_rt(struct task_struct *p, const struct cpumask *new_mask) { - int weight = cpumask_weight(new_mask); + struct rq *rq; + int weight; BUG_ON(!rt_task(p)); - /* - * Update the migration status of the RQ if we have an RT task - * which is running AND changing its weight value. - */ - if (p->on_rq && (weight != p->rt.nr_cpus_allowed)) { - struct rq *rq = task_rq(p); - - if (!task_current(rq, p)) { - /* - * Make sure we dequeue this task from the pushable list - * before going further. It will either remain off of - * the list because we are no longer pushable, or it - * will be requeued. - */ - if (p->rt.nr_cpus_allowed > 1) - dequeue_pushable_task(rq, p); + if (!p->on_rq) + return; - /* - * Requeue if our weight is changing and still > 1 - */ - if (weight > 1) - enqueue_pushable_task(rq, p); + weight = cpumask_weight(new_mask); - } + /* + * Only update if the process changes its state from whether it + * can migrate or not. + */ + if ((p->rt.nr_cpus_allowed > 1) == (weight > 1)) + return; - if ((p->rt.nr_cpus_allowed <= 1) && (weight > 1)) { - rq->rt.rt_nr_migratory++; - } else if ((p->rt.nr_cpus_allowed > 1) && (weight <= 1)) { - BUG_ON(!rq->rt.rt_nr_migratory); - rq->rt.rt_nr_migratory--; - } + rq = task_rq(p); - update_rt_migration(&rq->rt); + /* + * The process used to be able to migrate OR it can now migrate + */ + if (weight <= 1) { + if (!task_current(rq, p)) + dequeue_pushable_task(rq, p); + BUG_ON(!rq->rt.rt_nr_migratory); + rq->rt.rt_nr_migratory--; + } else { + if (!task_current(rq, p)) + enqueue_pushable_task(rq, p); + rq->rt.rt_nr_migratory++; } + + update_rt_migration(&rq->rt); } /* Assumes rq->lock is held */ -- cgit v1.2.3 From 259e5e6c75a910f3b5e656151dc602f53f9d7548 Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Thu, 12 Apr 2012 16:47:50 -0500 Subject: Add PR_{GET,SET}_NO_NEW_PRIVS to prevent execve from granting privs With this change, calling prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) disables privilege granting operations at execve-time. For example, a process will not be able to execute a setuid binary to change their uid or gid if this bit is set. The same is true for file capabilities. Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that LSMs respect the requested behavior. To determine if the NO_NEW_PRIVS bit is set, a task may call prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0); It returns 1 if set and 0 if it is not set. If any of the arguments are non-zero, it will return -1 and set errno to -EINVAL. (PR_SET_NO_NEW_PRIVS behaves similarly.) This functionality is desired for the proposed seccomp filter patch series. By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the system call behavior for itself and its child tasks without being able to impact the behavior of a more privileged task. Another potential use is making certain privileged operations unprivileged. For example, chroot may be considered "safe" if it cannot affect privileged tasks. Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is set and AppArmor is in use. It is fixed in a subsequent patch. Signed-off-by: Andy Lutomirski Signed-off-by: Will Drewry Acked-by: Eric Paris Acked-by: Kees Cook v18: updated change desc v17: using new define values as per 3.4 Signed-off-by: James Morris --- kernel/sys.c | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index e7006eb6c1e4..b82568b7d201 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1979,6 +1979,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = put_user(me->signal->is_child_subreaper, (int __user *) arg2); break; + case PR_SET_NO_NEW_PRIVS: + if (arg2 != 1 || arg3 || arg4 || arg5) + return -EINVAL; + + current->no_new_privs = 1; + break; + case PR_GET_NO_NEW_PRIVS: + if (arg2 || arg3 || arg4 || arg5) + return -EINVAL; + return current->no_new_privs ? 1 : 0; default: error = -EINVAL; break; -- cgit v1.2.3 From e2cfabdfd075648216f99c2c03821cf3f47c1727 Mon Sep 17 00:00:00 2001 From: Will Drewry Date: Thu, 12 Apr 2012 16:47:57 -0500 Subject: seccomp: add system call filtering using BPF [This patch depends on luto@mit.edu's no_new_privs patch: https://lkml.org/lkml/2012/1/30/264 The whole series including Andrew's patches can be found here: https://github.com/redpig/linux/tree/seccomp Complete diff here: https://github.com/redpig/linux/compare/1dc65fed...seccomp ] This patch adds support for seccomp mode 2. Mode 2 introduces the ability for unprivileged processes to install system call filtering policy expressed in terms of a Berkeley Packet Filter (BPF) program. This program will be evaluated in the kernel for each system call the task makes and computes a result based on data in the format of struct seccomp_data. A filter program may be installed by calling: struct sock_fprog fprog = { ... }; ... prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &fprog); The return value of the filter program determines if the system call is allowed to proceed or denied. If the first filter program installed allows prctl(2) calls, then the above call may be made repeatedly by a task to further reduce its access to the kernel. All attached programs must be evaluated before a system call will be allowed to proceed. Filter programs will be inherited across fork/clone and execve. However, if the task attaching the filter is unprivileged (!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task. This ensures that unprivileged tasks cannot attach filters that affect privileged tasks (e.g., setuid binary). There are a number of benefits to this approach. A few of which are as follows: - BPF has been exposed to userland for a long time - BPF optimization (and JIT'ing) are well understood - Userland already knows its ABI: system call numbers and desired arguments - No time-of-check-time-of-use vulnerable data accesses are possible. - system call arguments are loaded on access only to minimize copying required for system call policy decisions. Mode 2 support is restricted to architectures that enable HAVE_ARCH_SECCOMP_FILTER. In this patch, the primary dependency is on syscall_get_arguments(). The full desired scope of this feature will add a few minor additional requirements expressed later in this series. Based on discussion, SECCOMP_RET_ERRNO and SECCOMP_RET_TRACE seem to be the desired additional functionality. No architectures are enabled in this patch. Signed-off-by: Will Drewry Acked-by: Serge Hallyn Reviewed-by: Indan Zupancic Acked-by: Eric Paris Reviewed-by: Kees Cook v18: - rebase to v3.4-rc2 - s/chk/check/ (akpm@linux-foundation.org,jmorris@namei.org) - allocate with GFP_KERNEL|__GFP_NOWARN (indan@nul.nu) - add a comment for get_u32 regarding endianness (akpm@) - fix other typos, style mistakes (akpm@) - added acked-by v17: - properly guard seccomp filter needed headers (leann@ubuntu.com) - tighten return mask to 0x7fff0000 v16: - no change v15: - add a 4 instr penalty when counting a path to account for seccomp_filter size (indan@nul.nu) - drop the max insns to 256KB (indan@nul.nu) - return ENOMEM if the max insns limit has been hit (indan@nul.nu) - move IP checks after args (indan@nul.nu) - drop !user_filter check (indan@nul.nu) - only allow explicit bpf codes (indan@nul.nu) - exit_code -> exit_sig v14: - put/get_seccomp_filter takes struct task_struct (indan@nul.nu,keescook@chromium.org) - adds seccomp_chk_filter and drops general bpf_run/chk_filter user - add seccomp_bpf_load for use by net/core/filter.c - lower max per-process/per-hierarchy: 1MB - moved nnp/capability check prior to allocation (all of the above: indan@nul.nu) v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc v12: - added a maximum instruction count per path (indan@nul.nu,oleg@redhat.com) - removed copy_seccomp (keescook@chromium.org,indan@nul.nu) - reworded the prctl_set_seccomp comment (indan@nul.nu) v11: - reorder struct seccomp_data to allow future args expansion (hpa@zytor.com) - style clean up, @compat dropped, compat_sock_fprog32 (indan@nul.nu) - do_exit(SIGSYS) (keescook@chromium.org, luto@mit.edu) - pare down Kconfig doc reference. - extra comment clean up v10: - seccomp_data has changed again to be more aesthetically pleasing (hpa@zytor.com) - calling convention is noted in a new u32 field using syscall_get_arch. This allows for cross-calling convention tasks to use seccomp filters. (hpa@zytor.com) - lots of clean up (thanks, Indan!) v9: - n/a v8: - use bpf_chk_filter, bpf_run_filter. update load_fns - Lots of fixes courtesy of indan@nul.nu: -- fix up load behavior, compat fixups, and merge alloc code, -- renamed pc and dropped __packed, use bool compat. -- Added a hidden CONFIG_SECCOMP_FILTER to synthesize non-arch dependencies v7: (massive overhaul thanks to Indan, others) - added CONFIG_HAVE_ARCH_SECCOMP_FILTER - merged into seccomp.c - minimal seccomp_filter.h - no config option (part of seccomp) - no new prctl - doesn't break seccomp on systems without asm/syscall.h (works but arg access always fails) - dropped seccomp_init_task, extra free functions, ... - dropped the no-asm/syscall.h code paths - merges with network sk_run_filter and sk_chk_filter v6: - fix memory leak on attach compat check failure - require no_new_privs || CAP_SYS_ADMIN prior to filter installation. (luto@mit.edu) - s/seccomp_struct_/seccomp_/ for macros/functions (amwang@redhat.com) - cleaned up Kconfig (amwang@redhat.com) - on block, note if the call was compat (so the # means something) v5: - uses syscall_get_arguments (indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org) - uses union-based arg storage with hi/lo struct to handle endianness. Compromises between the two alternate proposals to minimize extra arg shuffling and account for endianness assuming userspace uses offsetof(). (mcgrathr@chromium.org, indan@nul.nu) - update Kconfig description - add include/seccomp_filter.h and add its installation - (naive) on-demand syscall argument loading - drop seccomp_t (eparis@redhat.com) v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS - now uses current->no_new_privs (luto@mit.edu,torvalds@linux-foundation.com) - assign names to seccomp modes (rdunlap@xenotime.net) - fix style issues (rdunlap@xenotime.net) - reworded Kconfig entry (rdunlap@xenotime.net) v3: - macros to inline (oleg@redhat.com) - init_task behavior fixed (oleg@redhat.com) - drop creator entry and extra NULL check (oleg@redhat.com) - alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com) - adds tentative use of "always_unprivileged" as per torvalds@linux-foundation.org and luto@mit.edu v2: - (patch 2 only) Signed-off-by: James Morris --- kernel/fork.c | 3 + kernel/seccomp.c | 396 ++++++++++++++++++++++++++++++++++++++++++++++++++++--- kernel/sys.c | 2 +- 3 files changed, 382 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index b9372a0bff18..f7cf6fb107ec 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include #include @@ -170,6 +171,7 @@ void free_task(struct task_struct *tsk) free_thread_info(tsk->stack); rt_mutex_debug_task_free(tsk); ftrace_graph_exit_task(tsk); + put_seccomp_filter(tsk); free_task_struct(tsk); } EXPORT_SYMBOL(free_task); @@ -1162,6 +1164,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto fork_out; ftrace_graph_init_task(p); + get_seccomp_filter(p); rt_mutex_init_task(p); diff --git a/kernel/seccomp.c b/kernel/seccomp.c index e8d76c5895ea..0aeec1960f91 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -3,16 +3,343 @@ * * Copyright 2004-2005 Andrea Arcangeli * - * This defines a simple but solid secure-computing mode. + * Copyright (C) 2012 Google, Inc. + * Will Drewry + * + * This defines a simple but solid secure-computing facility. + * + * Mode 1 uses a fixed list of allowed system calls. + * Mode 2 allows user-defined system call filters in the form + * of Berkeley Packet Filters/Linux Socket Filters. */ +#include #include -#include -#include #include +#include +#include /* #define SECCOMP_DEBUG 1 */ -#define NR_SECCOMP_MODES 1 + +#ifdef CONFIG_SECCOMP_FILTER +#include +#include +#include +#include +#include +#include + +/** + * struct seccomp_filter - container for seccomp BPF programs + * + * @usage: reference count to manage the object lifetime. + * get/put helpers should be used when accessing an instance + * outside of a lifetime-guarded section. In general, this + * is only needed for handling filters shared across tasks. + * @prev: points to a previously installed, or inherited, filter + * @len: the number of instructions in the program + * @insns: the BPF program instructions to evaluate + * + * seccomp_filter objects are organized in a tree linked via the @prev + * pointer. For any task, it appears to be a singly-linked list starting + * with current->seccomp.filter, the most recently attached or inherited filter. + * However, multiple filters may share a @prev node, by way of fork(), which + * results in a unidirectional tree existing in memory. This is similar to + * how namespaces work. + * + * seccomp_filter objects should never be modified after being attached + * to a task_struct (other than @usage). + */ +struct seccomp_filter { + atomic_t usage; + struct seccomp_filter *prev; + unsigned short len; /* Instruction count */ + struct sock_filter insns[]; +}; + +/* Limit any path through the tree to 256KB worth of instructions. */ +#define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter)) + +static void seccomp_filter_log_failure(int syscall) +{ + int compat = 0; +#ifdef CONFIG_COMPAT + compat = is_compat_task(); +#endif + pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n", + current->comm, task_pid_nr(current), + (compat ? "compat " : ""), + syscall, KSTK_EIP(current)); +} + +/** + * get_u32 - returns a u32 offset into data + * @data: a unsigned 64 bit value + * @index: 0 or 1 to return the first or second 32-bits + * + * This inline exists to hide the length of unsigned long. If a 32-bit + * unsigned long is passed in, it will be extended and the top 32-bits will be + * 0. If it is a 64-bit unsigned long, then whatever data is resident will be + * properly returned. + * + * Endianness is explicitly ignored and left for BPF program authors to manage + * as per the specific architecture. + */ +static inline u32 get_u32(u64 data, int index) +{ + return ((u32 *)&data)[index]; +} + +/* Helper for bpf_load below. */ +#define BPF_DATA(_name) offsetof(struct seccomp_data, _name) +/** + * bpf_load: checks and returns a pointer to the requested offset + * @off: offset into struct seccomp_data to load from + * + * Returns the requested 32-bits of data. + * seccomp_check_filter() should assure that @off is 32-bit aligned + * and not out of bounds. Failure to do so is a BUG. + */ +u32 seccomp_bpf_load(int off) +{ + struct pt_regs *regs = task_pt_regs(current); + if (off == BPF_DATA(nr)) + return syscall_get_nr(current, regs); + if (off == BPF_DATA(arch)) + return syscall_get_arch(current, regs); + if (off >= BPF_DATA(args[0]) && off < BPF_DATA(args[6])) { + unsigned long value; + int arg = (off - BPF_DATA(args[0])) / sizeof(u64); + int index = !!(off % sizeof(u64)); + syscall_get_arguments(current, regs, arg, 1, &value); + return get_u32(value, index); + } + if (off == BPF_DATA(instruction_pointer)) + return get_u32(KSTK_EIP(current), 0); + if (off == BPF_DATA(instruction_pointer) + sizeof(u32)) + return get_u32(KSTK_EIP(current), 1); + /* seccomp_check_filter should make this impossible. */ + BUG(); +} + +/** + * seccomp_check_filter - verify seccomp filter code + * @filter: filter to verify + * @flen: length of filter + * + * Takes a previously checked filter (by sk_chk_filter) and + * redirects all filter code that loads struct sk_buff data + * and related data through seccomp_bpf_load. It also + * enforces length and alignment checking of those loads. + * + * Returns 0 if the rule set is legal or -EINVAL if not. + */ +static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen) +{ + int pc; + for (pc = 0; pc < flen; pc++) { + struct sock_filter *ftest = &filter[pc]; + u16 code = ftest->code; + u32 k = ftest->k; + + switch (code) { + case BPF_S_LD_W_ABS: + ftest->code = BPF_S_ANC_SECCOMP_LD_W; + /* 32-bit aligned and not out of bounds. */ + if (k >= sizeof(struct seccomp_data) || k & 3) + return -EINVAL; + continue; + case BPF_S_LD_W_LEN: + ftest->code = BPF_S_LD_IMM; + ftest->k = sizeof(struct seccomp_data); + continue; + case BPF_S_LDX_W_LEN: + ftest->code = BPF_S_LDX_IMM; + ftest->k = sizeof(struct seccomp_data); + continue; + /* Explicitly include allowed calls. */ + case BPF_S_RET_K: + case BPF_S_RET_A: + case BPF_S_ALU_ADD_K: + case BPF_S_ALU_ADD_X: + case BPF_S_ALU_SUB_K: + case BPF_S_ALU_SUB_X: + case BPF_S_ALU_MUL_K: + case BPF_S_ALU_MUL_X: + case BPF_S_ALU_DIV_X: + case BPF_S_ALU_AND_K: + case BPF_S_ALU_AND_X: + case BPF_S_ALU_OR_K: + case BPF_S_ALU_OR_X: + case BPF_S_ALU_LSH_K: + case BPF_S_ALU_LSH_X: + case BPF_S_ALU_RSH_K: + case BPF_S_ALU_RSH_X: + case BPF_S_ALU_NEG: + case BPF_S_LD_IMM: + case BPF_S_LDX_IMM: + case BPF_S_MISC_TAX: + case BPF_S_MISC_TXA: + case BPF_S_ALU_DIV_K: + case BPF_S_LD_MEM: + case BPF_S_LDX_MEM: + case BPF_S_ST: + case BPF_S_STX: + case BPF_S_JMP_JA: + case BPF_S_JMP_JEQ_K: + case BPF_S_JMP_JEQ_X: + case BPF_S_JMP_JGE_K: + case BPF_S_JMP_JGE_X: + case BPF_S_JMP_JGT_K: + case BPF_S_JMP_JGT_X: + case BPF_S_JMP_JSET_K: + case BPF_S_JMP_JSET_X: + continue; + default: + return -EINVAL; + } + } + return 0; +} + +/** + * seccomp_run_filters - evaluates all seccomp filters against @syscall + * @syscall: number of the current system call + * + * Returns valid seccomp BPF response codes. + */ +static u32 seccomp_run_filters(int syscall) +{ + struct seccomp_filter *f; + u32 ret = SECCOMP_RET_KILL; + /* + * All filters in the list are evaluated and the lowest BPF return + * value always takes priority. + */ + for (f = current->seccomp.filter; f; f = f->prev) { + ret = sk_run_filter(NULL, f->insns); + if (ret != SECCOMP_RET_ALLOW) + break; + } + return ret; +} + +/** + * seccomp_attach_filter: Attaches a seccomp filter to current. + * @fprog: BPF program to install + * + * Returns 0 on success or an errno on failure. + */ +static long seccomp_attach_filter(struct sock_fprog *fprog) +{ + struct seccomp_filter *filter; + unsigned long fp_size = fprog->len * sizeof(struct sock_filter); + unsigned long total_insns = fprog->len; + long ret; + + if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) + return -EINVAL; + + for (filter = current->seccomp.filter; filter; filter = filter->prev) + total_insns += filter->len + 4; /* include a 4 instr penalty */ + if (total_insns > MAX_INSNS_PER_PATH) + return -ENOMEM; + + /* + * Installing a seccomp filter requires that the task have + * CAP_SYS_ADMIN in its namespace or be running with no_new_privs. + * This avoids scenarios where unprivileged tasks can affect the + * behavior of privileged children. + */ + if (!current->no_new_privs && + security_capable_noaudit(current_cred(), current_user_ns(), + CAP_SYS_ADMIN) != 0) + return -EACCES; + + /* Allocate a new seccomp_filter */ + filter = kzalloc(sizeof(struct seccomp_filter) + fp_size, + GFP_KERNEL|__GFP_NOWARN); + if (!filter) + return -ENOMEM; + atomic_set(&filter->usage, 1); + filter->len = fprog->len; + + /* Copy the instructions from fprog. */ + ret = -EFAULT; + if (copy_from_user(filter->insns, fprog->filter, fp_size)) + goto fail; + + /* Check and rewrite the fprog via the skb checker */ + ret = sk_chk_filter(filter->insns, filter->len); + if (ret) + goto fail; + + /* Check and rewrite the fprog for seccomp use */ + ret = seccomp_check_filter(filter->insns, filter->len); + if (ret) + goto fail; + + /* + * If there is an existing filter, make it the prev and don't drop its + * task reference. + */ + filter->prev = current->seccomp.filter; + current->seccomp.filter = filter; + return 0; +fail: + kfree(filter); + return ret; +} + +/** + * seccomp_attach_user_filter - attaches a user-supplied sock_fprog + * @user_filter: pointer to the user data containing a sock_fprog. + * + * Returns 0 on success and non-zero otherwise. + */ +long seccomp_attach_user_filter(char __user *user_filter) +{ + struct sock_fprog fprog; + long ret = -EFAULT; + +#ifdef CONFIG_COMPAT + if (is_compat_task()) { + struct compat_sock_fprog fprog32; + if (copy_from_user(&fprog32, user_filter, sizeof(fprog32))) + goto out; + fprog.len = fprog32.len; + fprog.filter = compat_ptr(fprog32.filter); + } else /* falls through to the if below. */ +#endif + if (copy_from_user(&fprog, user_filter, sizeof(fprog))) + goto out; + ret = seccomp_attach_filter(&fprog); +out: + return ret; +} + +/* get_seccomp_filter - increments the reference count of the filter on @tsk */ +void get_seccomp_filter(struct task_struct *tsk) +{ + struct seccomp_filter *orig = tsk->seccomp.filter; + if (!orig) + return; + /* Reference count is bounded by the number of total processes. */ + atomic_inc(&orig->usage); +} + +/* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */ +void put_seccomp_filter(struct task_struct *tsk) +{ + struct seccomp_filter *orig = tsk->seccomp.filter; + /* Clean up single-reference branches iteratively. */ + while (orig && atomic_dec_and_test(&orig->usage)) { + struct seccomp_filter *freeme = orig; + orig = orig->prev; + kfree(freeme); + } +} +#endif /* CONFIG_SECCOMP_FILTER */ /* * Secure computing mode 1 allows only read/write/exit/sigreturn. @@ -34,10 +361,11 @@ static int mode1_syscalls_32[] = { void __secure_computing(int this_syscall) { int mode = current->seccomp.mode; - int * syscall; + int exit_sig = 0; + int *syscall; switch (mode) { - case 1: + case SECCOMP_MODE_STRICT: syscall = mode1_syscalls; #ifdef CONFIG_COMPAT if (is_compat_task()) @@ -47,7 +375,16 @@ void __secure_computing(int this_syscall) if (*syscall == this_syscall) return; } while (*++syscall); + exit_sig = SIGKILL; break; +#ifdef CONFIG_SECCOMP_FILTER + case SECCOMP_MODE_FILTER: + if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW) + return; + seccomp_filter_log_failure(this_syscall); + exit_sig = SIGSYS; + break; +#endif default: BUG(); } @@ -56,7 +393,7 @@ void __secure_computing(int this_syscall) dump_stack(); #endif audit_seccomp(this_syscall); - do_exit(SIGKILL); + do_exit(exit_sig); } long prctl_get_seccomp(void) @@ -64,25 +401,48 @@ long prctl_get_seccomp(void) return current->seccomp.mode; } -long prctl_set_seccomp(unsigned long seccomp_mode) +/** + * prctl_set_seccomp: configures current->seccomp.mode + * @seccomp_mode: requested mode to use + * @filter: optional struct sock_fprog for use with SECCOMP_MODE_FILTER + * + * This function may be called repeatedly with a @seccomp_mode of + * SECCOMP_MODE_FILTER to install additional filters. Every filter + * successfully installed will be evaluated (in reverse order) for each system + * call the task makes. + * + * Once current->seccomp.mode is non-zero, it may not be changed. + * + * Returns 0 on success or -EINVAL on failure. + */ +long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter) { - long ret; + long ret = -EINVAL; - /* can set it only once to be even more secure */ - ret = -EPERM; - if (unlikely(current->seccomp.mode)) + if (current->seccomp.mode && + current->seccomp.mode != seccomp_mode) goto out; - ret = -EINVAL; - if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) { - current->seccomp.mode = seccomp_mode; - set_thread_flag(TIF_SECCOMP); + switch (seccomp_mode) { + case SECCOMP_MODE_STRICT: + ret = 0; #ifdef TIF_NOTSC disable_TSC(); #endif - ret = 0; + break; +#ifdef CONFIG_SECCOMP_FILTER + case SECCOMP_MODE_FILTER: + ret = seccomp_attach_user_filter(filter); + if (ret) + goto out; + break; +#endif + default: + goto out; } - out: + current->seccomp.mode = seccomp_mode; + set_thread_flag(TIF_SECCOMP); +out: return ret; } diff --git a/kernel/sys.c b/kernel/sys.c index b82568b7d201..ba0ae8eea6fb 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1908,7 +1908,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = prctl_get_seccomp(); break; case PR_SET_SECCOMP: - error = prctl_set_seccomp(arg2); + error = prctl_set_seccomp(arg2, (char __user *)arg3); break; case PR_GET_TSC: error = GET_TSC_CTL(arg2); -- cgit v1.2.3 From 3dc1c1b2d2ed7507ce8a379814ad75745ff97ebe Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Thu, 12 Apr 2012 16:47:58 -0500 Subject: seccomp: remove duplicated failure logging This consolidates the seccomp filter error logging path and adds more details to the audit log. Signed-off-by: Will Drewry Signed-off-by: Kees Cook Acked-by: Eric Paris v18: make compat= permanent in the record v15: added a return code to the audit_seccomp path by wad@chromium.org (suggested by eparis@redhat.com) v*: original by keescook@chromium.org Signed-off-by: James Morris --- kernel/auditsc.c | 8 ++++++-- kernel/seccomp.c | 15 +-------------- 2 files changed, 7 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/auditsc.c b/kernel/auditsc.c index af1de0f34eae..4b96415527b8 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -67,6 +67,7 @@ #include #include #include +#include #include "audit.h" @@ -2710,13 +2711,16 @@ void audit_core_dumps(long signr) audit_log_end(ab); } -void __audit_seccomp(unsigned long syscall) +void __audit_seccomp(unsigned long syscall, long signr, int code) { struct audit_buffer *ab; ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_ANOM_ABEND); - audit_log_abend(ab, "seccomp", SIGKILL); + audit_log_abend(ab, "seccomp", signr); audit_log_format(ab, " syscall=%ld", syscall); + audit_log_format(ab, " compat=%d", is_compat_task()); + audit_log_format(ab, " ip=0x%lx", KSTK_EIP(current)); + audit_log_format(ab, " code=0x%x", code); audit_log_end(ab); } diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 0aeec1960f91..0f7c709a523e 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -60,18 +60,6 @@ struct seccomp_filter { /* Limit any path through the tree to 256KB worth of instructions. */ #define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter)) -static void seccomp_filter_log_failure(int syscall) -{ - int compat = 0; -#ifdef CONFIG_COMPAT - compat = is_compat_task(); -#endif - pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n", - current->comm, task_pid_nr(current), - (compat ? "compat " : ""), - syscall, KSTK_EIP(current)); -} - /** * get_u32 - returns a u32 offset into data * @data: a unsigned 64 bit value @@ -381,7 +369,6 @@ void __secure_computing(int this_syscall) case SECCOMP_MODE_FILTER: if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW) return; - seccomp_filter_log_failure(this_syscall); exit_sig = SIGSYS; break; #endif @@ -392,7 +379,7 @@ void __secure_computing(int this_syscall) #ifdef SECCOMP_DEBUG dump_stack(); #endif - audit_seccomp(this_syscall); + audit_seccomp(this_syscall, exit_code, SECCOMP_RET_KILL); do_exit(exit_sig); } -- cgit v1.2.3 From acf3b2c71ed20c53dc69826683417703c2a88059 Mon Sep 17 00:00:00 2001 From: Will Drewry Date: Thu, 12 Apr 2012 16:47:59 -0500 Subject: seccomp: add SECCOMP_RET_ERRNO This change adds the SECCOMP_RET_ERRNO as a valid return value from a seccomp filter. Additionally, it makes the first use of the lower 16-bits for storing a filter-supplied errno. 16-bits is more than enough for the errno-base.h calls. Returning errors instead of immediately terminating processes that violate seccomp policy allow for broader use of this functionality for kernel attack surface reduction. For example, a linux container could maintain a whitelist of pre-existing system calls but drop all new ones with errnos. This would keep a logically static attack surface while providing errnos that may allow for graceful failure without the downside of do_exit() on a bad call. This change also changes the signature of __secure_computing. It appears the only direct caller is the arm entry code and it clobbers any possible return value (register) immediately. Signed-off-by: Will Drewry Acked-by: Serge Hallyn Reviewed-by: Kees Cook Acked-by: Eric Paris v18: - fix up comments and rebase - fix bad var name which was fixed in later revs - remove _int() and just change the __secure_computing signature v16-v17: ... v15: - use audit_seccomp and add a skip label. (eparis@redhat.com) - clean up and pad out return codes (indan@nul.nu) v14: - no change/rebase v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc v12: - move to WARN_ON if filter is NULL (oleg@redhat.com, luto@mit.edu, keescook@chromium.org) - return immediately for filter==NULL (keescook@chromium.org) - change evaluation to only compare the ACTION so that layered errnos don't result in the lowest one being returned. (keeschook@chromium.org) v11: - check for NULL filter (keescook@chromium.org) v10: - change loaders to fn v9: - n/a v8: - update Kconfig to note new need for syscall_set_return_value. - reordered such that TRAP behavior follows on later. - made the for loop a little less indent-y v7: - introduced Signed-off-by: James Morris --- kernel/seccomp.c | 42 ++++++++++++++++++++++++++++++++---------- 1 file changed, 32 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 0f7c709a523e..5f78fb6d2212 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -199,15 +199,20 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen) static u32 seccomp_run_filters(int syscall) { struct seccomp_filter *f; - u32 ret = SECCOMP_RET_KILL; + u32 ret = SECCOMP_RET_ALLOW; + + /* Ensure unexpected behavior doesn't result in failing open. */ + if (WARN_ON(current->seccomp.filter == NULL)) + return SECCOMP_RET_KILL; + /* * All filters in the list are evaluated and the lowest BPF return - * value always takes priority. + * value always takes priority (ignoring the DATA). */ for (f = current->seccomp.filter; f; f = f->prev) { - ret = sk_run_filter(NULL, f->insns); - if (ret != SECCOMP_RET_ALLOW) - break; + u32 cur_ret = sk_run_filter(NULL, f->insns); + if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION)) + ret = cur_ret; } return ret; } @@ -346,11 +351,13 @@ static int mode1_syscalls_32[] = { }; #endif -void __secure_computing(int this_syscall) +int __secure_computing(int this_syscall) { int mode = current->seccomp.mode; int exit_sig = 0; int *syscall; + u32 ret = SECCOMP_RET_KILL; + int data; switch (mode) { case SECCOMP_MODE_STRICT: @@ -361,14 +368,26 @@ void __secure_computing(int this_syscall) #endif do { if (*syscall == this_syscall) - return; + return 0; } while (*++syscall); exit_sig = SIGKILL; break; #ifdef CONFIG_SECCOMP_FILTER case SECCOMP_MODE_FILTER: - if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW) - return; + ret = seccomp_run_filters(this_syscall); + data = ret & SECCOMP_RET_DATA; + switch (ret & SECCOMP_RET_ACTION) { + case SECCOMP_RET_ERRNO: + /* Set the low-order 16-bits as a errno. */ + syscall_set_return_value(current, task_pt_regs(current), + -data, 0); + goto skip; + case SECCOMP_RET_ALLOW: + return 0; + case SECCOMP_RET_KILL: + default: + break; + } exit_sig = SIGSYS; break; #endif @@ -379,8 +398,11 @@ void __secure_computing(int this_syscall) #ifdef SECCOMP_DEBUG dump_stack(); #endif - audit_seccomp(this_syscall, exit_code, SECCOMP_RET_KILL); + audit_seccomp(this_syscall, exit_sig, ret); do_exit(exit_sig); +skip: + audit_seccomp(this_syscall, exit_sig, ret); + return -1; } long prctl_get_seccomp(void) -- cgit v1.2.3 From a0727e8ce513fe6890416da960181ceb10fbfae6 Mon Sep 17 00:00:00 2001 From: Will Drewry Date: Thu, 12 Apr 2012 16:48:00 -0500 Subject: signal, x86: add SIGSYS info and make it synchronous. This change enables SIGSYS, defines _sigfields._sigsys, and adds x86 (compat) arch support. _sigsys defines fields which allow a signal handler to receive the triggering system call number, the relevant AUDIT_ARCH_* value for that number, and the address of the callsite. SIGSYS is added to the SYNCHRONOUS_MASK because it is desirable for it to have setup_frame() called for it. The goal is to ensure that ucontext_t reflects the machine state from the time-of-syscall and not from another signal handler. The first consumer of SIGSYS would be seccomp filter. In particular, a filter program could specify a new return value, SECCOMP_RET_TRAP, which would result in the system call being denied and the calling thread signaled. This also means that implementing arch-specific support can be dependent upon HAVE_ARCH_SECCOMP_FILTER. Suggested-by: H. Peter Anvin Signed-off-by: Will Drewry Acked-by: Serge Hallyn Reviewed-by: H. Peter Anvin Acked-by: Eric Paris v18: - added acked by, rebase v17: - rebase and reviewed-by addition v14: - rebase/nochanges v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc v12: - reworded changelog (oleg@redhat.com) v11: - fix dropped words in the change description - added fallback copy_siginfo support. - added __ARCH_SIGSYS define to allow stepped arch support. v10: - first version based on suggestion Signed-off-by: James Morris --- kernel/signal.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index 17afcaf582d0..1a006b5d9d9d 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -160,7 +160,7 @@ void recalc_sigpending(void) #define SYNCHRONOUS_MASK \ (sigmask(SIGSEGV) | sigmask(SIGBUS) | sigmask(SIGILL) | \ - sigmask(SIGTRAP) | sigmask(SIGFPE)) + sigmask(SIGTRAP) | sigmask(SIGFPE) | sigmask(SIGSYS)) int next_signal(struct sigpending *pending, sigset_t *mask) { @@ -2706,6 +2706,13 @@ int copy_siginfo_to_user(siginfo_t __user *to, siginfo_t *from) err |= __put_user(from->si_uid, &to->si_uid); err |= __put_user(from->si_ptr, &to->si_ptr); break; +#ifdef __ARCH_SIGSYS + case __SI_SYS: + err |= __put_user(from->si_call_addr, &to->si_call_addr); + err |= __put_user(from->si_syscall, &to->si_syscall); + err |= __put_user(from->si_arch, &to->si_arch); + break; +#endif default: /* this is just in case for now ... */ err |= __put_user(from->si_pid, &to->si_pid); err |= __put_user(from->si_uid, &to->si_uid); -- cgit v1.2.3 From bb6ea4301a1109afdacaee576fedbfcd7152fc86 Mon Sep 17 00:00:00 2001 From: Will Drewry Date: Thu, 12 Apr 2012 16:48:01 -0500 Subject: seccomp: Add SECCOMP_RET_TRAP Adds a new return value to seccomp filters that triggers a SIGSYS to be delivered with the new SYS_SECCOMP si_code. This allows in-process system call emulation, including just specifying an errno or cleanly dumping core, rather than just dying. Suggested-by: Markus Gutschke Suggested-by: Julien Tinnes Signed-off-by: Will Drewry Acked-by: Eric Paris v18: - acked-by, rebase - don't mention secure_computing_int() anymore v15: - use audit_seccomp/skip - pad out error spacing; clean up switch (indan@nul.nu) v14: - n/a v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc v12: - rebase on to linux-next v11: - clarify the comment (indan@nul.nu) - s/sigtrap/sigsys v10: - use SIGSYS, syscall_get_arch, updates arch/Kconfig note suggested-by (though original suggestion had other behaviors) v9: - changes to SIGILL v8: - clean up based on changes to dependent patches v7: - introduction Signed-off-by: James Morris --- kernel/seccomp.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) (limited to 'kernel') diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 5f78fb6d2212..9c3830692a08 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -332,6 +332,26 @@ void put_seccomp_filter(struct task_struct *tsk) kfree(freeme); } } + +/** + * seccomp_send_sigsys - signals the task to allow in-process syscall emulation + * @syscall: syscall number to send to userland + * @reason: filter-supplied reason code to send to userland (via si_errno) + * + * Forces a SIGSYS with a code of SYS_SECCOMP and related sigsys info. + */ +static void seccomp_send_sigsys(int syscall, int reason) +{ + struct siginfo info; + memset(&info, 0, sizeof(info)); + info.si_signo = SIGSYS; + info.si_code = SYS_SECCOMP; + info.si_call_addr = (void __user *)KSTK_EIP(current); + info.si_errno = reason; + info.si_arch = syscall_get_arch(current, task_pt_regs(current)); + info.si_syscall = syscall; + force_sig_info(SIGSYS, &info, current); +} #endif /* CONFIG_SECCOMP_FILTER */ /* @@ -382,6 +402,12 @@ int __secure_computing(int this_syscall) syscall_set_return_value(current, task_pt_regs(current), -data, 0); goto skip; + case SECCOMP_RET_TRAP: + /* Show the handler the original registers. */ + syscall_rollback(current, task_pt_regs(current)); + /* Let the filter pass back 16 bits of data. */ + seccomp_send_sigsys(this_syscall, data); + goto skip; case SECCOMP_RET_ALLOW: return 0; case SECCOMP_RET_KILL: -- cgit v1.2.3 From fb0fadf9b213f55ca9368f3edafe51101d5d2deb Mon Sep 17 00:00:00 2001 From: Will Drewry Date: Thu, 12 Apr 2012 16:48:02 -0500 Subject: ptrace,seccomp: Add PTRACE_SECCOMP support This change adds support for a new ptrace option, PTRACE_O_TRACESECCOMP, and a new return value for seccomp BPF programs, SECCOMP_RET_TRACE. When a tracer specifies the PTRACE_O_TRACESECCOMP ptrace option, the tracer will be notified, via PTRACE_EVENT_SECCOMP, for any syscall that results in a BPF program returning SECCOMP_RET_TRACE. The 16-bit SECCOMP_RET_DATA mask of the BPF program return value will be passed as the ptrace_message and may be retrieved using PTRACE_GETEVENTMSG. If the subordinate process is not using seccomp filter, then no system call notifications will occur even if the option is specified. If there is no tracer with PTRACE_O_TRACESECCOMP when SECCOMP_RET_TRACE is returned, the system call will not be executed and an -ENOSYS errno will be returned to userspace. This change adds a dependency on the system call slow path. Any future efforts to use the system call fast path for seccomp filter will need to address this restriction. Signed-off-by: Will Drewry Acked-by: Eric Paris v18: - rebase - comment fatal_signal check - acked-by - drop secure_computing_int comment v17: - ... v16: - update PT_TRACE_MASK to 0xbf4 so that STOP isn't clear on SETOPTIONS call (indan@nul.nu) [note PT_TRACE_MASK disappears in linux-next] v15: - add audit support for non-zero return codes - clean up style (indan@nul.nu) v14: - rebase/nochanges v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc (Brings back a change to ptrace.c and the masks.) v12: - rebase to linux-next - use ptrace_event and update arch/Kconfig to mention slow-path dependency - drop all tracehook changes and inclusion (oleg@redhat.com) v11: - invert the logic to just make it a PTRACE_SYSCALL accelerator (indan@nul.nu) v10: - moved to PTRACE_O_SECCOMP / PT_TRACE_SECCOMP v9: - n/a v8: - guarded PTRACE_SECCOMP use with an ifdef v7: - introduced Signed-off-by: James Morris --- kernel/seccomp.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) (limited to 'kernel') diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 9c3830692a08..d9db6ec46bc9 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -24,6 +24,7 @@ #ifdef CONFIG_SECCOMP_FILTER #include #include +#include #include #include #include @@ -408,6 +409,21 @@ int __secure_computing(int this_syscall) /* Let the filter pass back 16 bits of data. */ seccomp_send_sigsys(this_syscall, data); goto skip; + case SECCOMP_RET_TRACE: + /* Skip these calls if there is no tracer. */ + if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) + goto skip; + /* Allow the BPF to provide the event message */ + ptrace_event(PTRACE_EVENT_SECCOMP, data); + /* + * The delivery of a fatal signal during event + * notification may silently skip tracer notification. + * Terminating the task now avoids executing a system + * call that may not be intended. + */ + if (fatal_signal_pending(current)) + break; + return 0; case SECCOMP_RET_ALLOW: return 0; case SECCOMP_RET_KILL: -- cgit v1.2.3 From 7396fa818d6278694a44840f389ddc40a3269a9a Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Wed, 11 Apr 2012 16:05:16 +0530 Subject: uprobes/core: Make background page replacement logic account for rss_stat counters Background page replacement logic adds a new anonymous page instead of a file backed (while inserting a breakpoint) / anonymous page (while removing a breakpoint). Hence the uprobes logic should take care to update the task->ss_stat counters accordingly. This bug became apparent courtesy of commit c3f0327f8e9d ("mm: add rss counters consistency check"). Signed-off-by: Srikar Dronamraju Cc: Linus Torvalds Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Linux-mm Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Anton Arapov Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20120411103516.23245.2700.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar --- kernel/events/uprobes.c | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'kernel') diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 29e881b0137d..c5caeecea1dc 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -160,6 +160,11 @@ static int __replace_page(struct vm_area_struct *vma, struct page *page, struct get_page(kpage); page_add_new_anon_rmap(kpage, vma, addr); + if (!PageAnon(page)) { + dec_mm_counter(mm, MM_FILEPAGES); + inc_mm_counter(mm, MM_ANONPAGES); + } + flush_cache_page(vma, addr, pte_pfn(*ptep)); ptep_clear_flush(vma, addr, ptep); set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot)); -- cgit v1.2.3 From cbc91f71b51b8335f1fc7ccfca8011f31a717367 Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Wed, 11 Apr 2012 16:05:27 +0530 Subject: uprobes/core: Decrement uprobe count before the pages are unmapped Uprobes has a callback (uprobe_munmap()) in the unmap path to maintain the uprobes count. In the exit path this callback gets called in unlink_file_vma(). However by the time unlink_file_vma() is called, the pages would have been unmapped (in unmap_vmas()) and the task->rss_stat counts accounted (in zap_pte_range()). If the exiting process has probepoints, uprobe_munmap() checks if the breakpoint instruction was around before decrementing the probe count. This results in a file backed page being reread by uprobe_munmap() and hence it does not find the breakpoint. This patch fixes this problem by moving the callback to unmap_single_vma(). Since unmap_single_vma() may not unmap the complete vma, add start and end parameters to uprobe_munmap(). This bug became apparent courtesy of commit c3f0327f8e9d ("mm: add rss counters consistency check"). Signed-off-by: Srikar Dronamraju Cc: Linus Torvalds Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Linux-mm Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Steven Rostedt Cc: Arnaldo Carvalho de Melo Cc: Masami Hiramatsu Cc: Anton Arapov Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20120411103527.23245.9835.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar --- kernel/events/uprobes.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index c5caeecea1dc..985be4d80fe8 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1112,7 +1112,7 @@ int uprobe_mmap(struct vm_area_struct *vma) /* * Called in context of a munmap of a vma. */ -void uprobe_munmap(struct vm_area_struct *vma) +void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end) { struct list_head tmp_list; struct uprobe *uprobe, *u; @@ -1138,7 +1138,7 @@ void uprobe_munmap(struct vm_area_struct *vma) list_del(&uprobe->pending_list); vaddr = vma_address(vma, uprobe->offset); - if (vaddr >= vma->vm_start && vaddr < vma->vm_end) { + if (vaddr >= start && vaddr < end) { /* * An unregister could have removed the probe before * unmap. So check before we decrement the count. -- cgit v1.2.3 From f5b2552b4ebbeadcadde1532d7bbd3f850719046 Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Fri, 13 Apr 2012 22:06:58 +0300 Subject: workqueue: change BUG_ON() to WARN_ON() This BUG_ON() can be triggered if you call schedule_work() before calling INIT_WORK(). It is a bug definitely, but it's nicer to just print a stack trace and return. Reported-by: Matt Renzelmann Signed-off-by: Dan Carpenter Signed-off-by: Tejun Heo --- kernel/workqueue.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 5abf42f63c08..66ec08de6dac 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1032,7 +1032,10 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq, cwq = get_cwq(gcwq->cpu, wq); trace_workqueue_queue_work(cpu, cwq, work); - BUG_ON(!list_empty(&work->entry)); + if (WARN_ON(!list_empty(&work->entry))) { + spin_unlock_irqrestore(&gcwq->lock, flags); + return; + } cwq->nr_in_flight[cwq->work_color]++; work_flags = work_color_to_flags(cwq->work_color); -- cgit v1.2.3 From 8156b451f37898d3c3652b4e988a4d62ae16eaac Mon Sep 17 00:00:00 2001 From: Will Drewry Date: Tue, 17 Apr 2012 14:48:58 -0500 Subject: seccomp: fix build warnings when there is no CONFIG_SECCOMP_FILTER If both audit and seccomp filter support are disabled, 'ret' is marked as unused. If just seccomp filter support is disabled, data and skip are considered unused. This change fixes those build warnings. Reported-by: Stephen Rothwell Signed-off-by: Will Drewry Acked-by: Kees Cook Signed-off-by: James Morris --- kernel/seccomp.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/seccomp.c b/kernel/seccomp.c index d9db6ec46bc9..ee376beedaf9 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -377,8 +377,7 @@ int __secure_computing(int this_syscall) int mode = current->seccomp.mode; int exit_sig = 0; int *syscall; - u32 ret = SECCOMP_RET_KILL; - int data; + u32 ret; switch (mode) { case SECCOMP_MODE_STRICT: @@ -392,12 +391,15 @@ int __secure_computing(int this_syscall) return 0; } while (*++syscall); exit_sig = SIGKILL; + ret = SECCOMP_RET_KILL; break; #ifdef CONFIG_SECCOMP_FILTER - case SECCOMP_MODE_FILTER: + case SECCOMP_MODE_FILTER: { + int data; ret = seccomp_run_filters(this_syscall); data = ret & SECCOMP_RET_DATA; - switch (ret & SECCOMP_RET_ACTION) { + ret &= SECCOMP_RET_ACTION; + switch (ret) { case SECCOMP_RET_ERRNO: /* Set the low-order 16-bits as a errno. */ syscall_set_return_value(current, task_pt_regs(current), @@ -432,6 +434,7 @@ int __secure_computing(int this_syscall) } exit_sig = SIGSYS; break; + } #endif default: BUG(); @@ -442,8 +445,10 @@ int __secure_computing(int this_syscall) #endif audit_seccomp(this_syscall, exit_sig, ret); do_exit(exit_sig); +#ifdef CONFIG_SECCOMP_FILTER skip: audit_seccomp(this_syscall, exit_sig, ret); +#endif return -1; } -- cgit v1.2.3 From 59bf896406471ac49d124b3e5f4edcafe28e5360 Mon Sep 17 00:00:00 2001 From: Masanari Iida Date: Wed, 18 Apr 2012 00:01:21 +0900 Subject: Fix "the the" in various Kconfig Fix typo "the the" in various Kconfig. Signed-off-by: Masanari Iida Signed-off-by: Jiri Kosina --- kernel/trace/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index a1d2849f2473..33f768b779c9 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -272,7 +272,7 @@ config PROFILE_ANNOTATED_BRANCHES bool "Trace likely/unlikely profiler" select TRACE_BRANCH_PROFILING help - This tracer profiles all the the likely and unlikely macros + This tracer profiles all likely and unlikely macros in the kernel. It will display the results in: /sys/kernel/debug/tracing/trace_stat/branch_annotated -- cgit v1.2.3 From 1c6c69525b40eb76de8adf039409722015927dc3 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Thu, 19 Apr 2012 10:35:17 +0200 Subject: genirq: Reject bogus threaded irq requests Requesting a threaded interrupt without a primary handler and without IRQF_ONESHOT set is dangerous. The core will use the default primary handler for it, which merily wakes the thread. For a level type interrupt this results in an interrupt storm, because the interrupt line is reenabled after the primary handler runs. The device has still the line asserted, which brings us back into the primary handler. While this works for edge type interrupts, we play it safe and reject unconditionally because we can't say for sure which type this interrupt really has. The type flags are unreliable as the underlying chip implementation can override them. And we cannot assume that developers using that interface know what they are doing. Signed-off-by: Thomas Gleixner --- kernel/irq/manage.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 89a3ea82569b..9a35ace38bb1 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -1031,6 +1031,27 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) * all existing action->thread_mask bits. */ new->thread_mask = 1 << ffz(thread_mask); + + } else if (new->handler == irq_default_primary_handler) { + /* + * The interrupt was requested with handler = NULL, so + * we use the default primary handler for it. But it + * does not have the oneshot flag set. In combination + * with level interrupts this is deadly, because the + * default primary handler just wakes the thread, then + * the irq lines is reenabled, but the device still + * has the level irq asserted. Rinse and repeat.... + * + * While this works for edge type interrupts, we play + * it safe and reject unconditionally because we can't + * say for sure which type this interrupt really + * has. The type flags are unreliable as the + * underlying chip implementation can override them. + */ + pr_err("genirq: Threaded irq requested with handler=NULL and !ONESHOT for irq %d\n", + irq); + ret = -EINVAL; + goto out_mask; } if (!shared) { -- cgit v1.2.3 From f5d89470f91f2e67eeaf350c730ae8412c3a98e3 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Thu, 19 Apr 2012 12:06:13 +0200 Subject: genirq: Be more informative on irq type mismatch We require that shared interrupts agree on a few flag settings. Right now we silently return with an error code without giving any hint why we reject it. Make the printout unconditionally and actually useful by printing the flags of the new and the already registered action. Convert all printks to pr_* and use a proper prefix while at it. Signed-off-by: Thomas Gleixner --- kernel/irq/manage.c | 25 ++++++++++--------------- 1 file changed, 10 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 9a35ace38bb1..585f6381f8e4 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -565,8 +565,8 @@ int __irq_set_trigger(struct irq_desc *desc, unsigned int irq, * IRQF_TRIGGER_* but the PIC does not support multiple * flow-types? */ - pr_debug("No set_type function for IRQ %d (%s)\n", irq, - chip ? (chip->name ? : "unknown") : "unknown"); + pr_debug("genirq: No set_type function for IRQ %d (%s)\n", irq, + chip ? (chip->name ? : "unknown") : "unknown"); return 0; } @@ -600,7 +600,7 @@ int __irq_set_trigger(struct irq_desc *desc, unsigned int irq, ret = 0; break; default: - pr_err("setting trigger mode %lu for irq %u failed (%pF)\n", + pr_err("genirq: Setting trigger mode %lu for irq %u failed (%pF)\n", flags, irq, chip->irq_set_type); } if (unmask) @@ -837,8 +837,7 @@ void exit_irq_thread(void) action = kthread_data(tsk); - printk(KERN_ERR - "exiting task \"%s\" (%d) is an active IRQ thread (irq %d)\n", + pr_err("genirq: exiting task \"%s\" (%d) is an active IRQ thread (irq %d)\n", tsk->comm ? tsk->comm : "", tsk->pid, action->irq); desc = irq_to_desc(action->irq); @@ -878,7 +877,6 @@ static int __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) { struct irqaction *old, **old_ptr; - const char *old_name = NULL; unsigned long flags, thread_mask = 0; int ret, nested, shared = 0; cpumask_var_t mask; @@ -972,10 +970,8 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) */ if (!((old->flags & new->flags) & IRQF_SHARED) || ((old->flags ^ new->flags) & IRQF_TRIGGER_MASK) || - ((old->flags ^ new->flags) & IRQF_ONESHOT)) { - old_name = old->name; + ((old->flags ^ new->flags) & IRQF_ONESHOT)) goto mismatch; - } /* All handlers must agree on per-cpuness */ if ((old->flags & IRQF_PERCPU) != @@ -1099,7 +1095,7 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) if (nmsk != omsk) /* hope the handler works with current trigger mode */ - pr_warning("IRQ %d uses trigger mode %u; requested %u\n", + pr_warning("genirq: irq %d uses trigger mode %u; requested %u\n", irq, nmsk, omsk); } @@ -1136,14 +1132,13 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) return 0; mismatch: -#ifdef CONFIG_DEBUG_SHIRQ if (!(new->flags & IRQF_PROBE_SHARED)) { - printk(KERN_ERR "IRQ handler type mismatch for IRQ %d\n", irq); - if (old_name) - printk(KERN_ERR "current handler: %s\n", old_name); + pr_err("genirq: Flags mismatch irq %d. %08x (%s) vs. %08x (%s)\n", + irq, new->flags, new->name, old->flags, old->name); +#ifdef CONFIG_DEBUG_SHIRQ dump_stack(); - } #endif + } ret = -EBUSY; out_mask: -- cgit v1.2.3 From d219e2e86a407035303b987e4184ca0b1de53257 Mon Sep 17 00:00:00 2001 From: David Daney Date: Thu, 19 Apr 2012 14:59:56 -0700 Subject: extable: Skip sorting if sorted at build time. If the build program sortextable has already sorted the exception table, don't sort it again. Signed-off-by: David Daney Link: http://lkml.kernel.org/r/1334872799-14589-3-git-send-email-ddaney.cavm@gmail.com Signed-off-by: H. Peter Anvin --- kernel/extable.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/extable.c b/kernel/extable.c index 5339705b8241..fe35a634bf76 100644 --- a/kernel/extable.c +++ b/kernel/extable.c @@ -35,10 +35,16 @@ DEFINE_MUTEX(text_mutex); extern struct exception_table_entry __start___ex_table[]; extern struct exception_table_entry __stop___ex_table[]; +/* Cleared by build time tools if the table is already sorted. */ +u32 __initdata main_extable_sort_needed = 1; + /* Sort the kernel's built-in exception table */ void __init sort_main_extable(void) { - sort_extable(__start___ex_table, __stop___ex_table); + if (main_extable_sort_needed) + sort_extable(__start___ex_table, __stop___ex_table); + else + pr_notice("__ex_table already sorted, skipping sort\n"); } /* Given an address, look for it in the exception tables. */ -- cgit v1.2.3 From 57c498fa5df05e4bd410c101795368439fec7b4e Mon Sep 17 00:00:00 2001 From: John Stultz Date: Fri, 20 Apr 2012 12:31:45 -0700 Subject: alarmtimer: Provide accessor to alarmtimer rtc device The Android alarm interface provides a settime call that sets both the alarmtimer RTC device and CLOCK_REALTIME to the same value. Since there may be multiple rtc devices, provide a hook to access the one the alarmtimer infrastructure is using. CC: Colin Cross CC: Thomas Gleixner CC: Android Kernel Team Signed-off-by: John Stultz Signed-off-by: Greg Kroah-Hartman --- kernel/time/alarmtimer.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c index 8a538c55fc7b..aa27d391bfc8 100644 --- a/kernel/time/alarmtimer.c +++ b/kernel/time/alarmtimer.c @@ -59,7 +59,7 @@ static DEFINE_SPINLOCK(rtcdev_lock); * If one has not already been chosen, it checks to see if a * functional rtc device is available. */ -static struct rtc_device *alarmtimer_get_rtcdev(void) +struct rtc_device *alarmtimer_get_rtcdev(void) { unsigned long flags; struct rtc_device *ret; @@ -115,7 +115,7 @@ static void alarmtimer_rtc_interface_remove(void) class_interface_unregister(&alarmtimer_rtc_interface); } #else -static inline struct rtc_device *alarmtimer_get_rtcdev(void) +struct rtc_device *alarmtimer_get_rtcdev(void) { return NULL; } -- cgit v1.2.3 From c4c27fbdda4e8ba87806c415b6d15266b07bce4b Mon Sep 17 00:00:00 2001 From: Mike Galbraith Date: Sat, 21 Apr 2012 09:13:46 +0200 Subject: cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads Allowing kthreadd to be moved to a non-root group makes no sense, it being a global resource, and needlessly leads unsuspecting users toward trouble. 1. An RT workqueue worker thread spawned in a task group with no rt_runtime allocated is not schedulable. Simple user error, but harmful to the box. 2. A worker thread which acquires PF_THREAD_BOUND can never leave a cpuset, rendering the cpuset immortal. Save the user some unexpected trouble, just say no. Signed-off-by: Mike Galbraith Acked-by: Peter Zijlstra Acked-by: Thomas Gleixner Acked-by: Li Zefan Signed-off-by: Tejun Heo --- kernel/cgroup.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index b2f203f25ec8..ad8eae5bb801 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -60,6 +60,7 @@ #include #include #include /* used in cgroup_attach_proc */ +#include #include @@ -2225,6 +2226,18 @@ retry_find_task: if (threadgroup) tsk = tsk->group_leader; + + /* + * Workqueue threads may acquire PF_THREAD_BOUND and become + * trapped in a cpuset, or RT worker may be born in a cgroup + * with no rt_runtime allocated. Just say no. + */ + if (tsk == kthreadd_task || (tsk->flags & PF_THREAD_BOUND)) { + ret = -EINVAL; + rcu_read_unlock(); + goto out_unlock_cgroup; + } + get_task_struct(tsk); rcu_read_unlock(); -- cgit v1.2.3 From 0976dfc1d0cd80a4e9dfaf87bd8744612bde475a Mon Sep 17 00:00:00 2001 From: Stephen Boyd Date: Fri, 20 Apr 2012 17:28:50 -0700 Subject: workqueue: Catch more locking problems with flush_work() If a workqueue is flushed with flush_work() lockdep checking can be circumvented. For example: static DEFINE_MUTEX(mutex); static void my_work(struct work_struct *w) { mutex_lock(&mutex); mutex_unlock(&mutex); } static DECLARE_WORK(work, my_work); static int __init start_test_module(void) { schedule_work(&work); return 0; } module_init(start_test_module); static void __exit stop_test_module(void) { mutex_lock(&mutex); flush_work(&work); mutex_unlock(&mutex); } module_exit(stop_test_module); would not always print a warning when flush_work() was called. In this trivial example nothing could go wrong since we are guaranteed module_init() and module_exit() don't run concurrently, but if the work item is schedule asynchronously we could have a scenario where the work item is running just at the time flush_work() is called resulting in a classic ABBA locking problem. Add a lockdep hint by acquiring and releasing the work item lockdep_map in flush_work() so that we always catch this potential deadlock scenario. Signed-off-by: Stephen Boyd Reviewed-by: Yong Zhang Signed-off-by: Tejun Heo --- kernel/workqueue.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 66ec08de6dac..211eadb23323 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -2509,6 +2509,9 @@ bool flush_work(struct work_struct *work) { struct wq_barrier barr; + lock_map_acquire(&work->lockdep_map); + lock_map_release(&work->lockdep_map); + if (start_flush_work(work, &barr, true)) { wait_for_completion(&barr.done); destroy_work_on_stack(&barr.work); -- cgit v1.2.3 From 07d777fe8c3985bc83428c2866713c2d1b3d4129 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Thu, 22 Sep 2011 14:01:55 -0400 Subject: tracing: Add percpu buffers for trace_printk() Currently, trace_printk() uses a single buffer to write into to calculate the size and format needed to save the trace. To do this safely in an SMP environment, a spin_lock() is taken to only allow one writer at a time to the buffer. But this could also affect what is being traced, and add synchronization that would not be there otherwise. Ideally, using percpu buffers would be useful, but since trace_printk() is only used in development, having per cpu buffers for something never used is a waste of space. Thus, the use of the trace_bprintk() format section is changed to be used for static fmts as well as dynamic ones. Then at boot up, we can check if the section that holds the trace_printk formats is non-empty, and if it does contain something, then we know a trace_printk() has been added to the kernel. At this time the trace_printk per cpu buffers are allocated. A check is also done at module load time in case a module is added that contains a trace_printk(). Once the buffers are allocated, they are never freed. If you use a trace_printk() then you should know what you are doing. A buffer is made for each type of context: normal softirq irq nmi The context is checked and the appropriate buffer is used. This allows for totally lockless usage of trace_printk(), and they no longer even disable interrupts. Requested-by: Peter Zijlstra Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 184 ++++++++++++++++++++++++++++++++------------ kernel/trace/trace.h | 2 + kernel/trace/trace_printk.c | 4 + 3 files changed, 141 insertions(+), 49 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index ed7b5d1e12f4..1ab8e35d069b 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1498,25 +1498,119 @@ static void __trace_userstack(struct trace_array *tr, unsigned long flags) #endif /* CONFIG_STACKTRACE */ +/* created for use with alloc_percpu */ +struct trace_buffer_struct { + char buffer[TRACE_BUF_SIZE]; +}; + +static struct trace_buffer_struct *trace_percpu_buffer; +static struct trace_buffer_struct *trace_percpu_sirq_buffer; +static struct trace_buffer_struct *trace_percpu_irq_buffer; +static struct trace_buffer_struct *trace_percpu_nmi_buffer; + +/* + * The buffer used is dependent on the context. There is a per cpu + * buffer for normal context, softirq contex, hard irq context and + * for NMI context. Thise allows for lockless recording. + * + * Note, if the buffers failed to be allocated, then this returns NULL + */ +static char *get_trace_buf(void) +{ + struct trace_buffer_struct *percpu_buffer; + struct trace_buffer_struct *buffer; + + /* + * If we have allocated per cpu buffers, then we do not + * need to do any locking. + */ + if (in_nmi()) + percpu_buffer = trace_percpu_nmi_buffer; + else if (in_irq()) + percpu_buffer = trace_percpu_irq_buffer; + else if (in_softirq()) + percpu_buffer = trace_percpu_sirq_buffer; + else + percpu_buffer = trace_percpu_buffer; + + if (!percpu_buffer) + return NULL; + + buffer = per_cpu_ptr(percpu_buffer, smp_processor_id()); + + return buffer->buffer; +} + +static int alloc_percpu_trace_buffer(void) +{ + struct trace_buffer_struct *buffers; + struct trace_buffer_struct *sirq_buffers; + struct trace_buffer_struct *irq_buffers; + struct trace_buffer_struct *nmi_buffers; + + buffers = alloc_percpu(struct trace_buffer_struct); + if (!buffers) + goto err_warn; + + sirq_buffers = alloc_percpu(struct trace_buffer_struct); + if (!sirq_buffers) + goto err_sirq; + + irq_buffers = alloc_percpu(struct trace_buffer_struct); + if (!irq_buffers) + goto err_irq; + + nmi_buffers = alloc_percpu(struct trace_buffer_struct); + if (!nmi_buffers) + goto err_nmi; + + trace_percpu_buffer = buffers; + trace_percpu_sirq_buffer = sirq_buffers; + trace_percpu_irq_buffer = irq_buffers; + trace_percpu_nmi_buffer = nmi_buffers; + + return 0; + + err_nmi: + free_percpu(irq_buffers); + err_irq: + free_percpu(sirq_buffers); + err_sirq: + free_percpu(buffers); + err_warn: + WARN(1, "Could not allocate percpu trace_printk buffer"); + return -ENOMEM; +} + +void trace_printk_init_buffers(void) +{ + static int buffers_allocated; + + if (buffers_allocated) + return; + + if (alloc_percpu_trace_buffer()) + return; + + pr_info("ftrace: Allocated trace_printk buffers\n"); + + buffers_allocated = 1; +} + /** * trace_vbprintk - write binary msg to tracing buffer * */ int trace_vbprintk(unsigned long ip, const char *fmt, va_list args) { - static arch_spinlock_t trace_buf_lock = - (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; - static u32 trace_buf[TRACE_BUF_SIZE]; - struct ftrace_event_call *call = &event_bprint; struct ring_buffer_event *event; struct ring_buffer *buffer; struct trace_array *tr = &global_trace; - struct trace_array_cpu *data; struct bprint_entry *entry; unsigned long flags; - int disable; - int cpu, len = 0, size, pc; + char *tbuffer; + int len = 0, size, pc; if (unlikely(tracing_selftest_running || tracing_disabled)) return 0; @@ -1526,43 +1620,36 @@ int trace_vbprintk(unsigned long ip, const char *fmt, va_list args) pc = preempt_count(); preempt_disable_notrace(); - cpu = raw_smp_processor_id(); - data = tr->data[cpu]; - disable = atomic_inc_return(&data->disabled); - if (unlikely(disable != 1)) + tbuffer = get_trace_buf(); + if (!tbuffer) { + len = 0; goto out; + } - /* Lockdep uses trace_printk for lock tracing */ - local_irq_save(flags); - arch_spin_lock(&trace_buf_lock); - len = vbin_printf(trace_buf, TRACE_BUF_SIZE, fmt, args); + len = vbin_printf((u32 *)tbuffer, TRACE_BUF_SIZE/sizeof(int), fmt, args); - if (len > TRACE_BUF_SIZE || len < 0) - goto out_unlock; + if (len > TRACE_BUF_SIZE/sizeof(int) || len < 0) + goto out; + local_save_flags(flags); size = sizeof(*entry) + sizeof(u32) * len; buffer = tr->buffer; event = trace_buffer_lock_reserve(buffer, TRACE_BPRINT, size, flags, pc); if (!event) - goto out_unlock; + goto out; entry = ring_buffer_event_data(event); entry->ip = ip; entry->fmt = fmt; - memcpy(entry->buf, trace_buf, sizeof(u32) * len); + memcpy(entry->buf, tbuffer, sizeof(u32) * len); if (!filter_check_discard(call, entry, buffer, event)) { ring_buffer_unlock_commit(buffer, event); ftrace_trace_stack(buffer, flags, 6, pc); } -out_unlock: - arch_spin_unlock(&trace_buf_lock); - local_irq_restore(flags); - out: - atomic_dec_return(&data->disabled); preempt_enable_notrace(); unpause_graph_tracing(); @@ -1588,58 +1675,53 @@ int trace_array_printk(struct trace_array *tr, int trace_array_vprintk(struct trace_array *tr, unsigned long ip, const char *fmt, va_list args) { - static arch_spinlock_t trace_buf_lock = __ARCH_SPIN_LOCK_UNLOCKED; - static char trace_buf[TRACE_BUF_SIZE]; - struct ftrace_event_call *call = &event_print; struct ring_buffer_event *event; struct ring_buffer *buffer; - struct trace_array_cpu *data; - int cpu, len = 0, size, pc; + int len = 0, size, pc; struct print_entry *entry; - unsigned long irq_flags; - int disable; + unsigned long flags; + char *tbuffer; if (tracing_disabled || tracing_selftest_running) return 0; + /* Don't pollute graph traces with trace_vprintk internals */ + pause_graph_tracing(); + pc = preempt_count(); preempt_disable_notrace(); - cpu = raw_smp_processor_id(); - data = tr->data[cpu]; - disable = atomic_inc_return(&data->disabled); - if (unlikely(disable != 1)) + + tbuffer = get_trace_buf(); + if (!tbuffer) { + len = 0; goto out; + } - pause_graph_tracing(); - raw_local_irq_save(irq_flags); - arch_spin_lock(&trace_buf_lock); - len = vsnprintf(trace_buf, TRACE_BUF_SIZE, fmt, args); + len = vsnprintf(tbuffer, TRACE_BUF_SIZE, fmt, args); + if (len > TRACE_BUF_SIZE) + goto out; + local_save_flags(flags); size = sizeof(*entry) + len + 1; buffer = tr->buffer; event = trace_buffer_lock_reserve(buffer, TRACE_PRINT, size, - irq_flags, pc); + flags, pc); if (!event) - goto out_unlock; + goto out; entry = ring_buffer_event_data(event); entry->ip = ip; - memcpy(&entry->buf, trace_buf, len); + memcpy(&entry->buf, tbuffer, len); entry->buf[len] = '\0'; if (!filter_check_discard(call, entry, buffer, event)) { ring_buffer_unlock_commit(buffer, event); - ftrace_trace_stack(buffer, irq_flags, 6, pc); + ftrace_trace_stack(buffer, flags, 6, pc); } - - out_unlock: - arch_spin_unlock(&trace_buf_lock); - raw_local_irq_restore(irq_flags); - unpause_graph_tracing(); out: - atomic_dec_return(&data->disabled); preempt_enable_notrace(); + unpause_graph_tracing(); return len; } @@ -4955,6 +5037,10 @@ __init static int tracer_alloc_buffers(void) if (!alloc_cpumask_var(&tracing_cpumask, GFP_KERNEL)) goto out_free_buffer_mask; + /* Only allocate trace_printk buffers if a trace_printk exists */ + if (__stop___trace_bprintk_fmt != __start___trace_bprintk_fmt) + trace_printk_init_buffers(); + /* To save memory, keep the ring buffer size to its minimum */ if (ring_buffer_expanded) ring_buf_size = trace_buf_size; diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 95059f091a24..f9d85504f04b 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -826,6 +826,8 @@ extern struct list_head ftrace_events; extern const char *__start___trace_bprintk_fmt[]; extern const char *__stop___trace_bprintk_fmt[]; +void trace_printk_init_buffers(void); + #undef FTRACE_ENTRY #define FTRACE_ENTRY(call, struct_name, id, tstruct, print, filter) \ extern struct ftrace_event_call \ diff --git a/kernel/trace/trace_printk.c b/kernel/trace/trace_printk.c index 6fd4ffd042f9..a9077c1b4ad3 100644 --- a/kernel/trace/trace_printk.c +++ b/kernel/trace/trace_printk.c @@ -51,6 +51,10 @@ void hold_module_trace_bprintk_format(const char **start, const char **end) const char **iter; char *fmt; + /* allocate the trace_printk per cpu buffers */ + if (start != end) + trace_printk_init_buffers(); + mutex_lock(&btrace_mutex); for (iter = start; iter < end; iter++) { struct trace_bprintk_fmt *tb_fmt = lookup_format(*iter); -- cgit v1.2.3 From 5a26c8f0cf1e95106858bb4e23ca6dd14c9b842f Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Fri, 20 Apr 2012 09:31:45 +0300 Subject: tracing: Remove an unneeded check in trace_seq_buffer() memcpy() returns a pointer to "bug". Hopefully, it's not NULL here or we would already have Oopsed. Link: http://lkml.kernel.org/r/20120420063145.GA22649@elgon.mountain Cc: Frederic Weisbecker Cc: Eduard - Gabriel Munteanu Signed-off-by: Dan Carpenter Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 1ab8e35d069b..bbcde546f9f7 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -629,7 +629,6 @@ ssize_t trace_seq_to_user(struct trace_seq *s, char __user *ubuf, size_t cnt) static ssize_t trace_seq_to_buffer(struct trace_seq *s, void *buf, size_t cnt) { int len; - void *ret; if (s->len <= s->readpos) return -EBUSY; @@ -637,9 +636,7 @@ static ssize_t trace_seq_to_buffer(struct trace_seq *s, void *buf, size_t cnt) len = s->len - s->readpos; if (cnt > len) cnt = len; - ret = memcpy(buf, s->buffer + s->readpos, cnt); - if (!ret) - return -EFAULT; + memcpy(buf, s->buffer + s->readpos, cnt); s->readpos += cnt; return cnt; -- cgit v1.2.3 From 438ced1720b584000a9e8a4349d1f6bb7ee3ad6d Mon Sep 17 00:00:00 2001 From: Vaibhav Nagarnaik Date: Thu, 2 Feb 2012 12:00:41 -0800 Subject: ring-buffer: Add per_cpu ring buffer control files Add a debugfs entry under per_cpu/ folder for each cpu called buffer_size_kb to control the ring buffer size for each CPU independently. If the global file buffer_size_kb is used to set size, the individual ring buffers will be adjusted to the given size. The buffer_size_kb will report the common size to maintain backward compatibility. If the buffer_size_kb file under the per_cpu/ directory is used to change buffer size for a specific CPU, only the size of the respective ring buffer is updated. When tracing/buffer_size_kb is read, it reports 'X' to indicate that sizes of per_cpu ring buffers are not equivalent. Link: http://lkml.kernel.org/r/1328212844-11889-1-git-send-email-vnagarnaik@google.com Cc: Frederic Weisbecker Cc: Michael Rubin Cc: David Sharp Cc: Justin Teravest Signed-off-by: Vaibhav Nagarnaik Signed-off-by: Steven Rostedt --- kernel/trace/ring_buffer.c | 248 +++++++++++++++++++++++++-------------------- kernel/trace/trace.c | 190 +++++++++++++++++++++++++++------- kernel/trace/trace.h | 2 +- 3 files changed, 293 insertions(+), 147 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index cf8d11e91efd..2d5eb3320827 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -449,6 +449,7 @@ struct ring_buffer_per_cpu { raw_spinlock_t reader_lock; /* serialize readers */ arch_spinlock_t lock; struct lock_class_key lock_key; + unsigned int nr_pages; struct list_head *pages; struct buffer_page *head_page; /* read from head */ struct buffer_page *tail_page; /* write to tail */ @@ -466,10 +467,12 @@ struct ring_buffer_per_cpu { unsigned long read_bytes; u64 write_stamp; u64 read_stamp; + /* ring buffer pages to update, > 0 to add, < 0 to remove */ + int nr_pages_to_update; + struct list_head new_pages; /* new pages to add */ }; struct ring_buffer { - unsigned pages; unsigned flags; int cpus; atomic_t record_disabled; @@ -963,14 +966,10 @@ static int rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer) return 0; } -static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, - unsigned nr_pages) +static int __rb_allocate_pages(int nr_pages, struct list_head *pages, int cpu) { + int i; struct buffer_page *bpage, *tmp; - LIST_HEAD(pages); - unsigned i; - - WARN_ON(!nr_pages); for (i = 0; i < nr_pages; i++) { struct page *page; @@ -981,15 +980,13 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, */ bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), GFP_KERNEL | __GFP_NORETRY, - cpu_to_node(cpu_buffer->cpu)); + cpu_to_node(cpu)); if (!bpage) goto free_pages; - rb_check_bpage(cpu_buffer, bpage); + list_add(&bpage->list, pages); - list_add(&bpage->list, &pages); - - page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), + page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_NORETRY, 0); if (!page) goto free_pages; @@ -997,6 +994,27 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, rb_init_page(bpage->page); } + return 0; + +free_pages: + list_for_each_entry_safe(bpage, tmp, pages, list) { + list_del_init(&bpage->list); + free_buffer_page(bpage); + } + + return -ENOMEM; +} + +static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, + unsigned nr_pages) +{ + LIST_HEAD(pages); + + WARN_ON(!nr_pages); + + if (__rb_allocate_pages(nr_pages, &pages, cpu_buffer->cpu)) + return -ENOMEM; + /* * The ring buffer page list is a circular list that does not * start and end with a list head. All page list items point to @@ -1005,20 +1023,15 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, cpu_buffer->pages = pages.next; list_del(&pages); + cpu_buffer->nr_pages = nr_pages; + rb_check_pages(cpu_buffer); return 0; - - free_pages: - list_for_each_entry_safe(bpage, tmp, &pages, list) { - list_del_init(&bpage->list); - free_buffer_page(bpage); - } - return -ENOMEM; } static struct ring_buffer_per_cpu * -rb_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu) +rb_allocate_cpu_buffer(struct ring_buffer *buffer, int nr_pages, int cpu) { struct ring_buffer_per_cpu *cpu_buffer; struct buffer_page *bpage; @@ -1052,7 +1065,7 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu) INIT_LIST_HEAD(&cpu_buffer->reader_page->list); - ret = rb_allocate_pages(cpu_buffer, buffer->pages); + ret = rb_allocate_pages(cpu_buffer, nr_pages); if (ret < 0) goto fail_free_reader; @@ -1113,7 +1126,7 @@ struct ring_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags, { struct ring_buffer *buffer; int bsize; - int cpu; + int cpu, nr_pages; /* keep it in its own cache line */ buffer = kzalloc(ALIGN(sizeof(*buffer), cache_line_size()), @@ -1124,14 +1137,14 @@ struct ring_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags, if (!alloc_cpumask_var(&buffer->cpumask, GFP_KERNEL)) goto fail_free_buffer; - buffer->pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE); + nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE); buffer->flags = flags; buffer->clock = trace_clock_local; buffer->reader_lock_key = key; /* need at least two pages */ - if (buffer->pages < 2) - buffer->pages = 2; + if (nr_pages < 2) + nr_pages = 2; /* * In case of non-hotplug cpu, if the ring-buffer is allocated @@ -1154,7 +1167,7 @@ struct ring_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags, for_each_buffer_cpu(buffer, cpu) { buffer->buffers[cpu] = - rb_allocate_cpu_buffer(buffer, cpu); + rb_allocate_cpu_buffer(buffer, nr_pages, cpu); if (!buffer->buffers[cpu]) goto fail_free_buffers; } @@ -1276,6 +1289,18 @@ out: raw_spin_unlock_irq(&cpu_buffer->reader_lock); } +static void update_pages_handler(struct ring_buffer_per_cpu *cpu_buffer) +{ + if (cpu_buffer->nr_pages_to_update > 0) + rb_insert_pages(cpu_buffer, &cpu_buffer->new_pages, + cpu_buffer->nr_pages_to_update); + else + rb_remove_pages(cpu_buffer, -cpu_buffer->nr_pages_to_update); + cpu_buffer->nr_pages += cpu_buffer->nr_pages_to_update; + /* reset this value */ + cpu_buffer->nr_pages_to_update = 0; +} + /** * ring_buffer_resize - resize the ring buffer * @buffer: the buffer to resize. @@ -1285,14 +1310,12 @@ out: * * Returns -1 on failure. */ -int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size) +int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, + int cpu_id) { struct ring_buffer_per_cpu *cpu_buffer; - unsigned nr_pages, rm_pages, new_pages; - struct buffer_page *bpage, *tmp; - unsigned long buffer_size; - LIST_HEAD(pages); - int i, cpu; + unsigned nr_pages; + int cpu; /* * Always succeed at resizing a non-existent buffer: @@ -1302,15 +1325,11 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size) size = DIV_ROUND_UP(size, BUF_PAGE_SIZE); size *= BUF_PAGE_SIZE; - buffer_size = buffer->pages * BUF_PAGE_SIZE; /* we need a minimum of two pages */ if (size < BUF_PAGE_SIZE * 2) size = BUF_PAGE_SIZE * 2; - if (size == buffer_size) - return size; - atomic_inc(&buffer->record_disabled); /* Make sure all writers are done with this buffer. */ @@ -1321,68 +1340,56 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size) nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE); - if (size < buffer_size) { - - /* easy case, just free pages */ - if (RB_WARN_ON(buffer, nr_pages >= buffer->pages)) - goto out_fail; - - rm_pages = buffer->pages - nr_pages; - + if (cpu_id == RING_BUFFER_ALL_CPUS) { + /* calculate the pages to update */ for_each_buffer_cpu(buffer, cpu) { cpu_buffer = buffer->buffers[cpu]; - rb_remove_pages(cpu_buffer, rm_pages); - } - goto out; - } - /* - * This is a bit more difficult. We only want to add pages - * when we can allocate enough for all CPUs. We do this - * by allocating all the pages and storing them on a local - * link list. If we succeed in our allocation, then we - * add these pages to the cpu_buffers. Otherwise we just free - * them all and return -ENOMEM; - */ - if (RB_WARN_ON(buffer, nr_pages <= buffer->pages)) - goto out_fail; + cpu_buffer->nr_pages_to_update = nr_pages - + cpu_buffer->nr_pages; - new_pages = nr_pages - buffer->pages; + /* + * nothing more to do for removing pages or no update + */ + if (cpu_buffer->nr_pages_to_update <= 0) + continue; - for_each_buffer_cpu(buffer, cpu) { - for (i = 0; i < new_pages; i++) { - struct page *page; /* - * __GFP_NORETRY flag makes sure that the allocation - * fails gracefully without invoking oom-killer and - * the system is not destabilized. + * to add pages, make sure all new pages can be + * allocated without receiving ENOMEM */ - bpage = kzalloc_node(ALIGN(sizeof(*bpage), - cache_line_size()), - GFP_KERNEL | __GFP_NORETRY, - cpu_to_node(cpu)); - if (!bpage) - goto free_pages; - list_add(&bpage->list, &pages); - page = alloc_pages_node(cpu_to_node(cpu), - GFP_KERNEL | __GFP_NORETRY, 0); - if (!page) - goto free_pages; - bpage->page = page_address(page); - rb_init_page(bpage->page); + INIT_LIST_HEAD(&cpu_buffer->new_pages); + if (__rb_allocate_pages(cpu_buffer->nr_pages_to_update, + &cpu_buffer->new_pages, cpu)) + /* not enough memory for new pages */ + goto no_mem; } - } - for_each_buffer_cpu(buffer, cpu) { - cpu_buffer = buffer->buffers[cpu]; - rb_insert_pages(cpu_buffer, &pages, new_pages); - } + /* wait for all the updates to complete */ + for_each_buffer_cpu(buffer, cpu) { + cpu_buffer = buffer->buffers[cpu]; + if (cpu_buffer->nr_pages_to_update) { + update_pages_handler(cpu_buffer); + } + } + } else { + cpu_buffer = buffer->buffers[cpu_id]; + if (nr_pages == cpu_buffer->nr_pages) + goto out; - if (RB_WARN_ON(buffer, !list_empty(&pages))) - goto out_fail; + cpu_buffer->nr_pages_to_update = nr_pages - + cpu_buffer->nr_pages; + + INIT_LIST_HEAD(&cpu_buffer->new_pages); + if (cpu_buffer->nr_pages_to_update > 0 && + __rb_allocate_pages(cpu_buffer->nr_pages_to_update, + &cpu_buffer->new_pages, cpu_id)) + goto no_mem; + + update_pages_handler(cpu_buffer); + } out: - buffer->pages = nr_pages; put_online_cpus(); mutex_unlock(&buffer->mutex); @@ -1390,25 +1397,24 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size) return size; - free_pages: - list_for_each_entry_safe(bpage, tmp, &pages, list) { - list_del_init(&bpage->list); - free_buffer_page(bpage); + no_mem: + for_each_buffer_cpu(buffer, cpu) { + struct buffer_page *bpage, *tmp; + cpu_buffer = buffer->buffers[cpu]; + /* reset this number regardless */ + cpu_buffer->nr_pages_to_update = 0; + if (list_empty(&cpu_buffer->new_pages)) + continue; + list_for_each_entry_safe(bpage, tmp, &cpu_buffer->new_pages, + list) { + list_del_init(&bpage->list); + free_buffer_page(bpage); + } } put_online_cpus(); mutex_unlock(&buffer->mutex); atomic_dec(&buffer->record_disabled); return -ENOMEM; - - /* - * Something went totally wrong, and we are too paranoid - * to even clean up the mess. - */ - out_fail: - put_online_cpus(); - mutex_unlock(&buffer->mutex); - atomic_dec(&buffer->record_disabled); - return -1; } EXPORT_SYMBOL_GPL(ring_buffer_resize); @@ -1510,7 +1516,7 @@ rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer) * assign the commit to the tail. */ again: - max_count = cpu_buffer->buffer->pages * 100; + max_count = cpu_buffer->nr_pages * 100; while (cpu_buffer->commit_page != cpu_buffer->tail_page) { if (RB_WARN_ON(cpu_buffer, !(--max_count))) @@ -3588,9 +3594,18 @@ EXPORT_SYMBOL_GPL(ring_buffer_read); * ring_buffer_size - return the size of the ring buffer (in bytes) * @buffer: The ring buffer. */ -unsigned long ring_buffer_size(struct ring_buffer *buffer) +unsigned long ring_buffer_size(struct ring_buffer *buffer, int cpu) { - return BUF_PAGE_SIZE * buffer->pages; + /* + * Earlier, this method returned + * BUF_PAGE_SIZE * buffer->nr_pages + * Since the nr_pages field is now removed, we have converted this to + * return the per cpu buffer value. + */ + if (!cpumask_test_cpu(cpu, buffer->cpumask)) + return 0; + + return BUF_PAGE_SIZE * buffer->buffers[cpu]->nr_pages; } EXPORT_SYMBOL_GPL(ring_buffer_size); @@ -3765,8 +3780,11 @@ int ring_buffer_swap_cpu(struct ring_buffer *buffer_a, !cpumask_test_cpu(cpu, buffer_b->cpumask)) goto out; + cpu_buffer_a = buffer_a->buffers[cpu]; + cpu_buffer_b = buffer_b->buffers[cpu]; + /* At least make sure the two buffers are somewhat the same */ - if (buffer_a->pages != buffer_b->pages) + if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages) goto out; ret = -EAGAIN; @@ -3780,9 +3798,6 @@ int ring_buffer_swap_cpu(struct ring_buffer *buffer_a, if (atomic_read(&buffer_b->record_disabled)) goto out; - cpu_buffer_a = buffer_a->buffers[cpu]; - cpu_buffer_b = buffer_b->buffers[cpu]; - if (atomic_read(&cpu_buffer_a->record_disabled)) goto out; @@ -4071,6 +4086,8 @@ static int rb_cpu_notify(struct notifier_block *self, struct ring_buffer *buffer = container_of(self, struct ring_buffer, cpu_notify); long cpu = (long)hcpu; + int cpu_i, nr_pages_same; + unsigned int nr_pages; switch (action) { case CPU_UP_PREPARE: @@ -4078,8 +4095,23 @@ static int rb_cpu_notify(struct notifier_block *self, if (cpumask_test_cpu(cpu, buffer->cpumask)) return NOTIFY_OK; + nr_pages = 0; + nr_pages_same = 1; + /* check if all cpu sizes are same */ + for_each_buffer_cpu(buffer, cpu_i) { + /* fill in the size from first enabled cpu */ + if (nr_pages == 0) + nr_pages = buffer->buffers[cpu_i]->nr_pages; + if (nr_pages != buffer->buffers[cpu_i]->nr_pages) { + nr_pages_same = 0; + break; + } + } + /* allocate minimum pages, user can later expand it */ + if (!nr_pages_same) + nr_pages = 2; buffer->buffers[cpu] = - rb_allocate_cpu_buffer(buffer, cpu); + rb_allocate_cpu_buffer(buffer, nr_pages, cpu); if (!buffer->buffers[cpu]) { WARN(1, "failed to allocate ring buffer on CPU %ld\n", cpu); diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index bbcde546f9f7..f11a285ee5bb 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -838,7 +838,8 @@ __acquires(kernel_lock) /* If we expanded the buffers, make sure the max is expanded too */ if (ring_buffer_expanded && type->use_max_tr) - ring_buffer_resize(max_tr.buffer, trace_buf_size); + ring_buffer_resize(max_tr.buffer, trace_buf_size, + RING_BUFFER_ALL_CPUS); /* the test is responsible for initializing and enabling */ pr_info("Testing tracer %s: ", type->name); @@ -854,7 +855,8 @@ __acquires(kernel_lock) /* Shrink the max buffer again */ if (ring_buffer_expanded && type->use_max_tr) - ring_buffer_resize(max_tr.buffer, 1); + ring_buffer_resize(max_tr.buffer, 1, + RING_BUFFER_ALL_CPUS); printk(KERN_CONT "PASSED\n"); } @@ -3053,7 +3055,14 @@ int tracer_init(struct tracer *t, struct trace_array *tr) return t->init(tr); } -static int __tracing_resize_ring_buffer(unsigned long size) +static void set_buffer_entries(struct trace_array *tr, unsigned long val) +{ + int cpu; + for_each_tracing_cpu(cpu) + tr->data[cpu]->entries = val; +} + +static int __tracing_resize_ring_buffer(unsigned long size, int cpu) { int ret; @@ -3064,19 +3073,32 @@ static int __tracing_resize_ring_buffer(unsigned long size) */ ring_buffer_expanded = 1; - ret = ring_buffer_resize(global_trace.buffer, size); + ret = ring_buffer_resize(global_trace.buffer, size, cpu); if (ret < 0) return ret; if (!current_trace->use_max_tr) goto out; - ret = ring_buffer_resize(max_tr.buffer, size); + ret = ring_buffer_resize(max_tr.buffer, size, cpu); if (ret < 0) { - int r; + int r = 0; + + if (cpu == RING_BUFFER_ALL_CPUS) { + int i; + for_each_tracing_cpu(i) { + r = ring_buffer_resize(global_trace.buffer, + global_trace.data[i]->entries, + i); + if (r < 0) + break; + } + } else { + r = ring_buffer_resize(global_trace.buffer, + global_trace.data[cpu]->entries, + cpu); + } - r = ring_buffer_resize(global_trace.buffer, - global_trace.entries); if (r < 0) { /* * AARGH! We are left with different @@ -3098,14 +3120,21 @@ static int __tracing_resize_ring_buffer(unsigned long size) return ret; } - max_tr.entries = size; + if (cpu == RING_BUFFER_ALL_CPUS) + set_buffer_entries(&max_tr, size); + else + max_tr.data[cpu]->entries = size; + out: - global_trace.entries = size; + if (cpu == RING_BUFFER_ALL_CPUS) + set_buffer_entries(&global_trace, size); + else + global_trace.data[cpu]->entries = size; return ret; } -static ssize_t tracing_resize_ring_buffer(unsigned long size) +static ssize_t tracing_resize_ring_buffer(unsigned long size, int cpu_id) { int cpu, ret = size; @@ -3121,12 +3150,19 @@ static ssize_t tracing_resize_ring_buffer(unsigned long size) atomic_inc(&max_tr.data[cpu]->disabled); } - if (size != global_trace.entries) - ret = __tracing_resize_ring_buffer(size); + if (cpu_id != RING_BUFFER_ALL_CPUS) { + /* make sure, this cpu is enabled in the mask */ + if (!cpumask_test_cpu(cpu_id, tracing_buffer_mask)) { + ret = -EINVAL; + goto out; + } + } + ret = __tracing_resize_ring_buffer(size, cpu_id); if (ret < 0) ret = -ENOMEM; +out: for_each_tracing_cpu(cpu) { if (global_trace.data[cpu]) atomic_dec(&global_trace.data[cpu]->disabled); @@ -3157,7 +3193,8 @@ int tracing_update_buffers(void) mutex_lock(&trace_types_lock); if (!ring_buffer_expanded) - ret = __tracing_resize_ring_buffer(trace_buf_size); + ret = __tracing_resize_ring_buffer(trace_buf_size, + RING_BUFFER_ALL_CPUS); mutex_unlock(&trace_types_lock); return ret; @@ -3181,7 +3218,8 @@ static int tracing_set_tracer(const char *buf) mutex_lock(&trace_types_lock); if (!ring_buffer_expanded) { - ret = __tracing_resize_ring_buffer(trace_buf_size); + ret = __tracing_resize_ring_buffer(trace_buf_size, + RING_BUFFER_ALL_CPUS); if (ret < 0) goto out; ret = 0; @@ -3207,8 +3245,8 @@ static int tracing_set_tracer(const char *buf) * The max_tr ring buffer has some state (e.g. ring->clock) and * we want preserve it. */ - ring_buffer_resize(max_tr.buffer, 1); - max_tr.entries = 1; + ring_buffer_resize(max_tr.buffer, 1, RING_BUFFER_ALL_CPUS); + set_buffer_entries(&max_tr, 1); } destroy_trace_option_files(topts); @@ -3216,10 +3254,17 @@ static int tracing_set_tracer(const char *buf) topts = create_trace_option_files(current_trace); if (current_trace->use_max_tr) { - ret = ring_buffer_resize(max_tr.buffer, global_trace.entries); - if (ret < 0) - goto out; - max_tr.entries = global_trace.entries; + int cpu; + /* we need to make per cpu buffer sizes equivalent */ + for_each_tracing_cpu(cpu) { + ret = ring_buffer_resize(max_tr.buffer, + global_trace.data[cpu]->entries, + cpu); + if (ret < 0) + goto out; + max_tr.data[cpu]->entries = + global_trace.data[cpu]->entries; + } } if (t->init) { @@ -3721,30 +3766,82 @@ out_err: goto out; } +struct ftrace_entries_info { + struct trace_array *tr; + int cpu; +}; + +static int tracing_entries_open(struct inode *inode, struct file *filp) +{ + struct ftrace_entries_info *info; + + if (tracing_disabled) + return -ENODEV; + + info = kzalloc(sizeof(*info), GFP_KERNEL); + if (!info) + return -ENOMEM; + + info->tr = &global_trace; + info->cpu = (unsigned long)inode->i_private; + + filp->private_data = info; + + return 0; +} + static ssize_t tracing_entries_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos) { - struct trace_array *tr = filp->private_data; - char buf[96]; - int r; + struct ftrace_entries_info *info = filp->private_data; + struct trace_array *tr = info->tr; + char buf[64]; + int r = 0; + ssize_t ret; mutex_lock(&trace_types_lock); - if (!ring_buffer_expanded) - r = sprintf(buf, "%lu (expanded: %lu)\n", - tr->entries >> 10, - trace_buf_size >> 10); - else - r = sprintf(buf, "%lu\n", tr->entries >> 10); + + if (info->cpu == RING_BUFFER_ALL_CPUS) { + int cpu, buf_size_same; + unsigned long size; + + size = 0; + buf_size_same = 1; + /* check if all cpu sizes are same */ + for_each_tracing_cpu(cpu) { + /* fill in the size from first enabled cpu */ + if (size == 0) + size = tr->data[cpu]->entries; + if (size != tr->data[cpu]->entries) { + buf_size_same = 0; + break; + } + } + + if (buf_size_same) { + if (!ring_buffer_expanded) + r = sprintf(buf, "%lu (expanded: %lu)\n", + size >> 10, + trace_buf_size >> 10); + else + r = sprintf(buf, "%lu\n", size >> 10); + } else + r = sprintf(buf, "X\n"); + } else + r = sprintf(buf, "%lu\n", tr->data[info->cpu]->entries >> 10); + mutex_unlock(&trace_types_lock); - return simple_read_from_buffer(ubuf, cnt, ppos, buf, r); + ret = simple_read_from_buffer(ubuf, cnt, ppos, buf, r); + return ret; } static ssize_t tracing_entries_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) { + struct ftrace_entries_info *info = filp->private_data; unsigned long val; int ret; @@ -3759,7 +3856,7 @@ tracing_entries_write(struct file *filp, const char __user *ubuf, /* value is in KB */ val <<= 10; - ret = tracing_resize_ring_buffer(val); + ret = tracing_resize_ring_buffer(val, info->cpu); if (ret < 0) return ret; @@ -3768,6 +3865,16 @@ tracing_entries_write(struct file *filp, const char __user *ubuf, return cnt; } +static int +tracing_entries_release(struct inode *inode, struct file *filp) +{ + struct ftrace_entries_info *info = filp->private_data; + + kfree(info); + + return 0; +} + static ssize_t tracing_total_entries_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos) @@ -3779,7 +3886,7 @@ tracing_total_entries_read(struct file *filp, char __user *ubuf, mutex_lock(&trace_types_lock); for_each_tracing_cpu(cpu) { - size += tr->entries >> 10; + size += tr->data[cpu]->entries >> 10; if (!ring_buffer_expanded) expanded_size += trace_buf_size >> 10; } @@ -3813,7 +3920,7 @@ tracing_free_buffer_release(struct inode *inode, struct file *filp) if (trace_flags & TRACE_ITER_STOP_ON_FREE) tracing_off(); /* resize the ring buffer to 0 */ - tracing_resize_ring_buffer(0); + tracing_resize_ring_buffer(0, RING_BUFFER_ALL_CPUS); return 0; } @@ -4012,9 +4119,10 @@ static const struct file_operations tracing_pipe_fops = { }; static const struct file_operations tracing_entries_fops = { - .open = tracing_open_generic, + .open = tracing_entries_open, .read = tracing_entries_read, .write = tracing_entries_write, + .release = tracing_entries_release, .llseek = generic_file_llseek, }; @@ -4466,6 +4574,9 @@ static void tracing_init_debugfs_percpu(long cpu) trace_create_file("stats", 0444, d_cpu, (void *) cpu, &tracing_stats_fops); + + trace_create_file("buffer_size_kb", 0444, d_cpu, + (void *) cpu, &tracing_entries_fops); } #ifdef CONFIG_FTRACE_SELFTEST @@ -4795,7 +4906,7 @@ static __init int tracer_init_debugfs(void) (void *) TRACE_PIPE_ALL_CPU, &tracing_pipe_fops); trace_create_file("buffer_size_kb", 0644, d_tracer, - &global_trace, &tracing_entries_fops); + (void *) RING_BUFFER_ALL_CPUS, &tracing_entries_fops); trace_create_file("buffer_total_size_kb", 0444, d_tracer, &global_trace, &tracing_total_entries_fops); @@ -5056,7 +5167,6 @@ __init static int tracer_alloc_buffers(void) WARN_ON(1); goto out_free_cpumask; } - global_trace.entries = ring_buffer_size(global_trace.buffer); if (global_trace.buffer_disabled) tracing_off(); @@ -5069,7 +5179,6 @@ __init static int tracer_alloc_buffers(void) ring_buffer_free(global_trace.buffer); goto out_free_cpumask; } - max_tr.entries = 1; #endif /* Allocate the first page for all buffers */ @@ -5078,6 +5187,11 @@ __init static int tracer_alloc_buffers(void) max_tr.data[i] = &per_cpu(max_tr_data, i); } + set_buffer_entries(&global_trace, ring_buf_size); +#ifdef CONFIG_TRACER_MAX_TRACE + set_buffer_entries(&max_tr, 1); +#endif + trace_init_cmdlines(); register_tracer(&nop_trace); diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index f9d85504f04b..1c8b7c6f7b3b 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -131,6 +131,7 @@ struct trace_array_cpu { atomic_t disabled; void *buffer_page; /* ring buffer spare */ + unsigned long entries; unsigned long saved_latency; unsigned long critical_start; unsigned long critical_end; @@ -152,7 +153,6 @@ struct trace_array_cpu { */ struct trace_array { struct ring_buffer *buffer; - unsigned long entries; int cpu; int buffer_disabled; cycle_t time_start; -- cgit v1.2.3 From 8932a63d5edb02f714d50c26583152fe0a97a69c Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 19 Apr 2012 12:20:14 -0700 Subject: rcu: Reduce cache-miss initialization latencies for large systems Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper limit of 16 on the leaf-level fanout for the rcu_node tree. This was needed to reduce lock contention that was induced by the synchronization of scheduling-clock interrupts, which was in turn needed to improve energy efficiency for moderate-sized lightly loaded servers. However, reducing the leaf-level fanout means that there are more leaf-level rcu_node structures in the tree, which in turn means that RCU's grace-period initialization incurs more cache misses. This is not a problem on moderate-sized servers with only a few tens of CPUs, but becomes a major source of real-time latency spikes on systems with many hundreds of CPUs. In addition, the workloads running on these large systems tend to be CPU-bound, which eliminates the energy-efficiency advantages of synchronizing scheduling-clock interrupts. Therefore, these systems need maximal values for the rcu_node leaf-level fanout. This commit addresses this problem by introducing a new kernel parameter named RCU_FANOUT_LEAF that directly controls the leaf-level fanout. This parameter defaults to 16 to handle the common case of a moderate sized lightly loaded servers, but may be set higher on larger systems. Reported-by: Mike Galbraith Reported-by: Dimitri Sivanich Signed-off-by: Paul E. McKenney --- kernel/rcutree.c | 2 +- kernel/rcutree.h | 10 +++------- 2 files changed, 4 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/rcutree.c b/kernel/rcutree.c index 1050d6d3922c..780acf8e15e9 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -2418,7 +2418,7 @@ static void __init rcu_init_levelspread(struct rcu_state *rsp) for (i = NUM_RCU_LVLS - 1; i > 0; i--) rsp->levelspread[i] = CONFIG_RCU_FANOUT; - rsp->levelspread[0] = RCU_FANOUT_LEAF; + rsp->levelspread[0] = CONFIG_RCU_FANOUT_LEAF; } #else /* #ifdef CONFIG_RCU_FANOUT_EXACT */ static void __init rcu_init_levelspread(struct rcu_state *rsp) diff --git a/kernel/rcutree.h b/kernel/rcutree.h index cdd1be0a4072..a905c200405c 100644 --- a/kernel/rcutree.h +++ b/kernel/rcutree.h @@ -29,18 +29,14 @@ #include /* - * Define shape of hierarchy based on NR_CPUS and CONFIG_RCU_FANOUT. + * Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and + * CONFIG_RCU_FANOUT_LEAF. * In theory, it should be possible to add more levels straightforwardly. * In practice, this did work well going from three levels to four. * Of course, your mileage may vary. */ #define MAX_RCU_LVLS 4 -#if CONFIG_RCU_FANOUT > 16 -#define RCU_FANOUT_LEAF 16 -#else /* #if CONFIG_RCU_FANOUT > 16 */ -#define RCU_FANOUT_LEAF (CONFIG_RCU_FANOUT) -#endif /* #else #if CONFIG_RCU_FANOUT > 16 */ -#define RCU_FANOUT_1 (RCU_FANOUT_LEAF) +#define RCU_FANOUT_1 (CONFIG_RCU_FANOUT_LEAF) #define RCU_FANOUT_2 (RCU_FANOUT_1 * CONFIG_RCU_FANOUT) #define RCU_FANOUT_3 (RCU_FANOUT_2 * CONFIG_RCU_FANOUT) #define RCU_FANOUT_4 (RCU_FANOUT_3 * CONFIG_RCU_FANOUT) -- cgit v1.2.3 From 6d8133919bac4270883b24328500875a49e71b36 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 23 Feb 2012 13:30:16 -0800 Subject: rcu: Document why rcu_blocking_is_gp() is safe The rcu_blocking_is_gp() function tests to see if there is only one online CPU, and if so, synchronize_sched() and friends become no-ops. However, for larger systems, num_online_cpus() scans a large vector, and might be preempted while doing so. While preempted, any number of CPUs might come online and go offline, potentially resulting in num_online_cpus() returning 1 when there never had only been one CPU online. This could result in a too-short RCU grace period, which could in turn result in total failure, except that the only way that the grace period is too short is if there is an RCU read-side critical section spanning it. For RCU-sched and RCU-bh (which are the only cases using rcu_blocking_is_gp()), RCU read-side critical sections have either preemption or bh disabled, which prevents CPUs from going offline. This in turn prevents actual failures from occurring. This commit therefore adds a large block comment to rcu_blocking_is_gp() documenting why it is safe. This commit also moves rcu_blocking_is_gp() into kernel/rcutree.c, which should help prevent unwary developers from mistaking it for a generally useful function. Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney --- kernel/rcutree.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) (limited to 'kernel') diff --git a/kernel/rcutree.c b/kernel/rcutree.c index 780acf8e15e9..8f6a344306e6 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -1894,6 +1894,38 @@ void call_rcu_bh(struct rcu_head *head, void (*func)(struct rcu_head *rcu)) } EXPORT_SYMBOL_GPL(call_rcu_bh); +/* + * Because a context switch is a grace period for RCU-sched and RCU-bh, + * any blocking grace-period wait automatically implies a grace period + * if there is only one CPU online at any point time during execution + * of either synchronize_sched() or synchronize_rcu_bh(). It is OK to + * occasionally incorrectly indicate that there are multiple CPUs online + * when there was in fact only one the whole time, as this just adds + * some overhead: RCU still operates correctly. + * + * Of course, sampling num_online_cpus() with preemption enabled can + * give erroneous results if there are concurrent CPU-hotplug operations. + * For example, given a demonic sequence of preemptions in num_online_cpus() + * and CPU-hotplug operations, there could be two or more CPUs online at + * all times, but num_online_cpus() might well return one (or even zero). + * + * However, all such demonic sequences require at least one CPU-offline + * operation. Furthermore, rcu_blocking_is_gp() giving the wrong answer + * is only a problem if there is an RCU read-side critical section executing + * throughout. But RCU-sched and RCU-bh read-side critical sections + * disable either preemption or bh, which prevents a CPU from going offline. + * Therefore, the only way that rcu_blocking_is_gp() can incorrectly return + * that there is only one CPU when in fact there was more than one throughout + * is when there were no RCU readers in the system. If there are no + * RCU readers, the grace period by definition can be of zero length, + * regardless of the number of online CPUs. + */ +static inline int rcu_blocking_is_gp(void) +{ + might_sleep(); /* Check for RCU read-side critical section. */ + return num_online_cpus() <= 1; +} + /** * synchronize_sched - wait until an rcu-sched grace period has elapsed. * -- cgit v1.2.3 From 2fdbb31b662787f78bb78b3e4e18f1a072058ffc Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 23 Feb 2012 15:58:29 -0800 Subject: rcu: Add RCU_FAST_NO_HZ tracing for idle exit Traces of rcu_prep_idle events can be confusing because rcu_cleanup_after_idle() does no tracing. This commit therefore adds this tracing. Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney --- kernel/rcutree_plugin.h | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h index c023464816be..1e561ab952a3 100644 --- a/kernel/rcutree_plugin.h +++ b/kernel/rcutree_plugin.h @@ -2085,6 +2085,7 @@ static void rcu_prepare_for_idle_init(int cpu) static void rcu_cleanup_after_idle(int cpu) { hrtimer_cancel(&per_cpu(rcu_idle_gp_timer, cpu)); + trace_rcu_prep_idle("Cleanup after idle"); } /* -- cgit v1.2.3 From 2ee3dc80660ac8285a37e662fd91b2e45c46f06a Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 23 Feb 2012 17:13:19 -0800 Subject: rcu: Make RCU_FAST_NO_HZ use timer rather than hrtimer The RCU_FAST_NO_HZ facility uses an hrtimer to wake up a CPU when it is allowed to go into dyntick-idle mode, which is almost always cancelled soon after. This is not what hrtimers are good at, so this commit switches to the timer wheel. Reported-by: Steven Rostedt Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney --- kernel/rcutree_plugin.h | 40 ++++++++++++---------------------------- 1 file changed, 12 insertions(+), 28 deletions(-) (limited to 'kernel') diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h index 1e561ab952a3..0f007b363dba 100644 --- a/kernel/rcutree_plugin.h +++ b/kernel/rcutree_plugin.h @@ -1980,9 +1980,7 @@ static void rcu_prepare_for_idle(int cpu) static DEFINE_PER_CPU(int, rcu_dyntick_drain); static DEFINE_PER_CPU(unsigned long, rcu_dyntick_holdoff); -static DEFINE_PER_CPU(struct hrtimer, rcu_idle_gp_timer); -static ktime_t rcu_idle_gp_wait; /* If some non-lazy callbacks. */ -static ktime_t rcu_idle_lazy_gp_wait; /* If only lazy callbacks. */ +static DEFINE_PER_CPU(struct timer_list, rcu_idle_gp_timer); /* * Allow the CPU to enter dyntick-idle mode if either: (1) There are no @@ -2051,10 +2049,9 @@ static bool rcu_cpu_has_nonlazy_callbacks(int cpu) * real work is done upon re-entry to idle, or by the next scheduling-clock * interrupt should idle not be re-entered. */ -static enum hrtimer_restart rcu_idle_gp_timer_func(struct hrtimer *hrtp) +static void rcu_idle_gp_timer_func(unsigned long unused) { trace_rcu_prep_idle("Timer"); - return HRTIMER_NORESTART; } /* @@ -2062,19 +2059,8 @@ static enum hrtimer_restart rcu_idle_gp_timer_func(struct hrtimer *hrtp) */ static void rcu_prepare_for_idle_init(int cpu) { - static int firsttime = 1; - struct hrtimer *hrtp = &per_cpu(rcu_idle_gp_timer, cpu); - - hrtimer_init(hrtp, CLOCK_MONOTONIC, HRTIMER_MODE_REL); - hrtp->function = rcu_idle_gp_timer_func; - if (firsttime) { - unsigned int upj = jiffies_to_usecs(RCU_IDLE_GP_DELAY); - - rcu_idle_gp_wait = ns_to_ktime(upj * (u64)1000); - upj = jiffies_to_usecs(RCU_IDLE_LAZY_GP_DELAY); - rcu_idle_lazy_gp_wait = ns_to_ktime(upj * (u64)1000); - firsttime = 0; - } + setup_timer(&per_cpu(rcu_idle_gp_timer, cpu), + rcu_idle_gp_timer_func, 0); } /* @@ -2084,7 +2070,7 @@ static void rcu_prepare_for_idle_init(int cpu) */ static void rcu_cleanup_after_idle(int cpu) { - hrtimer_cancel(&per_cpu(rcu_idle_gp_timer, cpu)); + del_timer(&per_cpu(rcu_idle_gp_timer, cpu)); trace_rcu_prep_idle("Cleanup after idle"); } @@ -2141,11 +2127,11 @@ static void rcu_prepare_for_idle(int cpu) per_cpu(rcu_dyntick_drain, cpu) = 0; per_cpu(rcu_dyntick_holdoff, cpu) = jiffies; if (rcu_cpu_has_nonlazy_callbacks(cpu)) - hrtimer_start(&per_cpu(rcu_idle_gp_timer, cpu), - rcu_idle_gp_wait, HRTIMER_MODE_REL); + mod_timer(&per_cpu(rcu_idle_gp_timer, cpu), + jiffies + RCU_IDLE_GP_DELAY); else - hrtimer_start(&per_cpu(rcu_idle_gp_timer, cpu), - rcu_idle_lazy_gp_wait, HRTIMER_MODE_REL); + mod_timer(&per_cpu(rcu_idle_gp_timer, cpu), + jiffies + RCU_IDLE_LAZY_GP_DELAY); return; /* Nothing more to do immediately. */ } else if (--per_cpu(rcu_dyntick_drain, cpu) <= 0) { /* We have hit the limit, so time to give up. */ @@ -2193,14 +2179,12 @@ static void rcu_prepare_for_idle(int cpu) static void print_cpu_stall_fast_no_hz(char *cp, int cpu) { - struct hrtimer *hrtp = &per_cpu(rcu_idle_gp_timer, cpu); + struct timer_list *tltp = &per_cpu(rcu_idle_gp_timer, cpu); - sprintf(cp, "drain=%d %c timer=%lld", + sprintf(cp, "drain=%d %c timer=%lu", per_cpu(rcu_dyntick_drain, cpu), per_cpu(rcu_dyntick_holdoff, cpu) == jiffies ? 'H' : '.', - hrtimer_active(hrtp) - ? ktime_to_us(hrtimer_get_remaining(hrtp)) - : -1); + timer_pending(tltp) ? tltp->expires - jiffies : -1); } #else /* #ifdef CONFIG_RCU_FAST_NO_HZ */ -- cgit v1.2.3 From c57afe80db4e169135eb675acc2d241e26cc064e Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 28 Feb 2012 11:02:21 -0800 Subject: rcu: Make RCU_FAST_NO_HZ account for pauses out of idle Both Steven Rostedt's new idle-capable trace macros and the RCU_NONIDLE() macro can cause RCU to momentarily pause out of idle without the rest of the system being involved. This can cause rcu_prepare_for_idle() to run through its state machine too quickly, which can in turn result in needless scheduling-clock interrupts. This commit therefore adds code to enable rcu_prepare_for_idle() to distinguish between an initial entry to idle on the one hand (which needs to advance the rcu_prepare_for_idle() state machine) and an idle reentry due to idle-capable trace macros and RCU_NONIDLE() on the other hand (which should avoid advancing the rcu_prepare_for_idle() state machine). Additional state is maintained to allow the timer to be correctly reposted when returning after a momentary pause out of idle, and even more state is maintained to detect when new non-lazy callbacks have been enqueued (which may require re-evaluation of the approach to idleness). Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney --- kernel/rcutree.c | 2 ++ kernel/rcutree.h | 1 + kernel/rcutree_plugin.h | 57 +++++++++++++++++++++++++++++++++++++++++++++---- 3 files changed, 56 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/rcutree.c b/kernel/rcutree.c index 1050d6d3922c..403306b86e78 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -1829,6 +1829,8 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu), rdp->qlen++; if (lazy) rdp->qlen_lazy++; + else + rcu_idle_count_callbacks_posted(); if (__is_kfree_rcu_offset((unsigned long)func)) trace_rcu_kfree_callback(rsp->name, head, (unsigned long)func, diff --git a/kernel/rcutree.h b/kernel/rcutree.h index cdd1be0a4072..36ca28ecedc6 100644 --- a/kernel/rcutree.h +++ b/kernel/rcutree.h @@ -471,6 +471,7 @@ static void __cpuinit rcu_prepare_kthreads(int cpu); static void rcu_prepare_for_idle_init(int cpu); static void rcu_cleanup_after_idle(int cpu); static void rcu_prepare_for_idle(int cpu); +static void rcu_idle_count_callbacks_posted(void); static void print_cpu_stall_info_begin(void); static void print_cpu_stall_info(struct rcu_state *rsp, int cpu); static void print_cpu_stall_info_end(void); diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h index 0f007b363dba..50c17975d4f4 100644 --- a/kernel/rcutree_plugin.h +++ b/kernel/rcutree_plugin.h @@ -1938,6 +1938,14 @@ static void rcu_prepare_for_idle(int cpu) { } +/* + * Don't bother keeping a running count of the number of RCU callbacks + * posted because CONFIG_RCU_FAST_NO_HZ=n. + */ +static void rcu_idle_count_callbacks_posted(void) +{ +} + #else /* #if !defined(CONFIG_RCU_FAST_NO_HZ) */ /* @@ -1981,6 +1989,10 @@ static void rcu_prepare_for_idle(int cpu) static DEFINE_PER_CPU(int, rcu_dyntick_drain); static DEFINE_PER_CPU(unsigned long, rcu_dyntick_holdoff); static DEFINE_PER_CPU(struct timer_list, rcu_idle_gp_timer); +static DEFINE_PER_CPU(unsigned long, rcu_idle_gp_timer_expires); +static DEFINE_PER_CPU(bool, rcu_idle_first_pass); +static DEFINE_PER_CPU(unsigned long, rcu_nonlazy_posted); +static DEFINE_PER_CPU(unsigned long, rcu_nonlazy_posted_snap); /* * Allow the CPU to enter dyntick-idle mode if either: (1) There are no @@ -1993,6 +2005,8 @@ static DEFINE_PER_CPU(struct timer_list, rcu_idle_gp_timer); */ int rcu_needs_cpu(int cpu) { + /* Flag a new idle sojourn to the idle-entry state machine. */ + per_cpu(rcu_idle_first_pass, cpu) = 1; /* If no callbacks, RCU doesn't need the CPU. */ if (!rcu_cpu_has_callbacks(cpu)) return 0; @@ -2095,6 +2109,26 @@ static void rcu_cleanup_after_idle(int cpu) */ static void rcu_prepare_for_idle(int cpu) { + /* + * If this is an idle re-entry, for example, due to use of + * RCU_NONIDLE() or the new idle-loop tracing API within the idle + * loop, then don't take any state-machine actions, unless the + * momentary exit from idle queued additional non-lazy callbacks. + * Instead, repost the rcu_idle_gp_timer if this CPU has callbacks + * pending. + */ + if (!per_cpu(rcu_idle_first_pass, cpu) && + (per_cpu(rcu_nonlazy_posted, cpu) == + per_cpu(rcu_nonlazy_posted_snap, cpu))) { + if (rcu_cpu_has_callbacks(cpu)) + mod_timer(&per_cpu(rcu_idle_gp_timer, cpu), + per_cpu(rcu_idle_gp_timer_expires, cpu)); + return; + } + per_cpu(rcu_idle_first_pass, cpu) = 0; + per_cpu(rcu_nonlazy_posted_snap, cpu) = + per_cpu(rcu_nonlazy_posted, cpu) - 1; + /* * If there are no callbacks on this CPU, enter dyntick-idle mode. * Also reset state to avoid prejudicing later attempts. @@ -2127,11 +2161,15 @@ static void rcu_prepare_for_idle(int cpu) per_cpu(rcu_dyntick_drain, cpu) = 0; per_cpu(rcu_dyntick_holdoff, cpu) = jiffies; if (rcu_cpu_has_nonlazy_callbacks(cpu)) - mod_timer(&per_cpu(rcu_idle_gp_timer, cpu), - jiffies + RCU_IDLE_GP_DELAY); + per_cpu(rcu_idle_gp_timer_expires, cpu) = + jiffies + RCU_IDLE_GP_DELAY; else - mod_timer(&per_cpu(rcu_idle_gp_timer, cpu), - jiffies + RCU_IDLE_LAZY_GP_DELAY); + per_cpu(rcu_idle_gp_timer_expires, cpu) = + jiffies + RCU_IDLE_LAZY_GP_DELAY; + mod_timer(&per_cpu(rcu_idle_gp_timer, cpu), + per_cpu(rcu_idle_gp_timer_expires, cpu)); + per_cpu(rcu_nonlazy_posted_snap, cpu) = + per_cpu(rcu_nonlazy_posted, cpu); return; /* Nothing more to do immediately. */ } else if (--per_cpu(rcu_dyntick_drain, cpu) <= 0) { /* We have hit the limit, so time to give up. */ @@ -2171,6 +2209,17 @@ static void rcu_prepare_for_idle(int cpu) trace_rcu_prep_idle("Callbacks drained"); } +/* + * Keep a running count of callbacks posted so that rcu_prepare_for_idle() + * can detect when something out of the idle loop posts a callback. + * Of course, it had better do so either from a trace event designed to + * be called from idle or from within RCU_NONIDLE(). + */ +static void rcu_idle_count_callbacks_posted(void) +{ + __this_cpu_add(rcu_nonlazy_posted, 1); +} + #endif /* #else #if !defined(CONFIG_RCU_FAST_NO_HZ) */ #ifdef CONFIG_RCU_CPU_STALL_INFO -- cgit v1.2.3 From 37e377d2823e03528cb64f435d7c0e30b1c668eb Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 17 Feb 2012 22:12:18 -0800 Subject: rcu: Fixes to rcutorture error handling and cleanup The rcutorture initialization code ignored the error returns from rcu_torture_onoff_init() and rcu_torture_stall_init(). The rcutorture cleanup code failed to NULL out a number of pointers. These bugs will normally have no effect, but this commit fixes them nevertheless. Signed-off-by: Paul E. McKenney --- kernel/rcutorture.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c index a89b381a8c6e..1463a0636443 100644 --- a/kernel/rcutorture.c +++ b/kernel/rcutorture.c @@ -1337,6 +1337,7 @@ static void rcutorture_booster_cleanup(int cpu) /* This must be outside of the mutex, otherwise deadlock! */ kthread_stop(t); + boost_tasks[cpu] = NULL; } static int rcutorture_booster_init(int cpu) @@ -1484,13 +1485,15 @@ static void rcu_torture_onoff_cleanup(void) return; VERBOSE_PRINTK_STRING("Stopping rcu_torture_onoff task"); kthread_stop(onoff_task); + onoff_task = NULL; } #else /* #ifdef CONFIG_HOTPLUG_CPU */ -static void +static int rcu_torture_onoff_init(void) { + return 0; } static void rcu_torture_onoff_cleanup(void) @@ -1554,6 +1557,7 @@ static void rcu_torture_stall_cleanup(void) return; VERBOSE_PRINTK_STRING("Stopping rcu_torture_stall_task."); kthread_stop(stall_task); + stall_task = NULL; } static int rcutorture_cpu_notify(struct notifier_block *self, @@ -1665,6 +1669,7 @@ rcu_torture_cleanup(void) VERBOSE_PRINTK_STRING("Stopping rcu_torture_shutdown task"); kthread_stop(shutdown_task); } + shutdown_task = NULL; rcu_torture_onoff_cleanup(); /* Wait for all RCU callbacks to fire. */ @@ -1897,9 +1902,17 @@ rcu_torture_init(void) goto unwind; } } - rcu_torture_onoff_init(); + i = rcu_torture_onoff_init(); + if (i != 0) { + firsterr = i; + goto unwind; + } register_reboot_notifier(&rcutorture_shutdown_nb); - rcu_torture_stall_init(); + i = rcu_torture_stall_init(); + if (i != 0) { + firsterr = i; + goto unwind; + } rcutorture_record_test_transition(); mutex_unlock(&fullstop_mutex); return 0; -- cgit v1.2.3 From 625056b65ee9a8b4caf42136e0c844c15e6ec54f Mon Sep 17 00:00:00 2001 From: Sasha Levin Date: Thu, 15 Mar 2012 17:47:20 -0400 Subject: hung task debugging: Inject NMI when hung and going to panic Send an NMI to all CPUs when a hung task is detected and the hung task code is configured to panic. This gives us a fairly uptodate snapshot of all CPUs in the system. This lets us get stack trace of all CPUs which makes life easier trying to debug a deadlock, and the NMI doesn't change anything since the next step is a kernel panic. Signed-off-by: Sasha Levin Cc: Linus Torvalds Cc: Andrew Morton Link: http://lkml.kernel.org/r/1331848040-1676-1-git-send-email-levinsasha928@gmail.com [ extended the changelog a bit ] Signed-off-by: Ingo Molnar --- kernel/hung_task.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/hung_task.c b/kernel/hung_task.c index c21449f85a2a..6df614912b9d 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -108,8 +108,10 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) touch_nmi_watchdog(); - if (sysctl_hung_task_panic) + if (sysctl_hung_task_panic) { + trigger_all_cpu_backtrace(); panic("hung_task: blocked tasks"); + } } /* -- cgit v1.2.3 From 783291e6900292521a3895583785e0c04a56c5b3 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Thu, 17 Nov 2011 01:32:59 -0800 Subject: userns: Simplify the user_namespace by making userns->creator a kuid. - Transform userns->creator from a user_struct reference to a simple kuid_t, kgid_t pair. In cap_capable this allows the check to see if we are the creator of a namespace to become the classic suser style euid permission check. This allows us to remove the need for a struct cred in the mapping functions and still be able to dispaly the user namespace creators uid and gid as 0. - Remove the now unnecessary delayed_work in free_user_ns. All that is left for free_user_ns to do is to call kmem_cache_free and put_user_ns. Those functions can be called in any context so call them directly from free_user_ns removing the need for delayed work. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/user.c | 7 ++++--- kernel/user_namespace.c | 42 ++++++++++++++++++++---------------------- 2 files changed, 24 insertions(+), 25 deletions(-) (limited to 'kernel') diff --git a/kernel/user.c b/kernel/user.c index 025077e54a7c..cff385659175 100644 --- a/kernel/user.c +++ b/kernel/user.c @@ -25,7 +25,8 @@ struct user_namespace init_user_ns = { .kref = { .refcount = ATOMIC_INIT(3), }, - .creator = &root_user, + .owner = GLOBAL_ROOT_UID, + .group = GLOBAL_ROOT_GID, }; EXPORT_SYMBOL_GPL(init_user_ns); @@ -54,9 +55,9 @@ struct hlist_head uidhash_table[UIDHASH_SZ]; */ static DEFINE_SPINLOCK(uidhash_lock); -/* root_user.__count is 2, 1 for init task cred, 1 for init_user_ns->user_ns */ +/* root_user.__count is 1, for init task cred */ struct user_struct root_user = { - .__count = ATOMIC_INIT(2), + .__count = ATOMIC_INIT(1), .processes = ATOMIC_INIT(1), .files = ATOMIC_INIT(0), .sigpending = ATOMIC_INIT(0), diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 898e973bd1e8..ed08836558e0 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -27,6 +27,16 @@ int create_user_ns(struct cred *new) { struct user_namespace *ns, *parent_ns = new->user_ns; struct user_struct *root_user; + kuid_t owner = make_kuid(new->user_ns, new->euid); + kgid_t group = make_kgid(new->user_ns, new->egid); + + /* The creator needs a mapping in the parent user namespace + * or else we won't be able to reasonably tell userspace who + * created a user_namespace. + */ + if (!kuid_has_mapping(parent_ns, owner) || + !kgid_has_mapping(parent_ns, group)) + return -EPERM; ns = kmem_cache_alloc(user_ns_cachep, GFP_KERNEL); if (!ns) @@ -43,7 +53,9 @@ int create_user_ns(struct cred *new) /* set the new root user in the credentials under preparation */ ns->parent = parent_ns; - ns->creator = new->user; + ns->owner = owner; + ns->group = group; + free_uid(new->user); new->user = root_user; new->uid = new->euid = new->suid = new->fsuid = 0; new->gid = new->egid = new->sgid = new->fsgid = 0; @@ -63,35 +75,22 @@ int create_user_ns(struct cred *new) #endif /* tgcred will be cleared in our caller bc CLONE_THREAD won't be set */ - /* Leave the reference to our user_ns with the new cred */ + /* Leave the new->user_ns reference with the new user namespace. */ + /* Leave the reference to our user_ns with the new cred. */ new->user_ns = ns; return 0; } -/* - * Deferred destructor for a user namespace. This is required because - * free_user_ns() may be called with uidhash_lock held, but we need to call - * back to free_uid() which will want to take the lock again. - */ -static void free_user_ns_work(struct work_struct *work) +void free_user_ns(struct kref *kref) { struct user_namespace *parent, *ns = - container_of(work, struct user_namespace, destroyer); + container_of(kref, struct user_namespace, kref); + parent = ns->parent; - free_uid(ns->creator); kmem_cache_free(user_ns_cachep, ns); put_user_ns(parent); } - -void free_user_ns(struct kref *kref) -{ - struct user_namespace *ns = - container_of(kref, struct user_namespace, kref); - - INIT_WORK(&ns->destroyer, free_user_ns_work); - schedule_work(&ns->destroyer); -} EXPORT_SYMBOL(free_user_ns); uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, uid_t uid) @@ -101,12 +100,11 @@ uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, uid_t if (likely(to == cred->user_ns)) return uid; - /* Is cred->user the creator of the target user_ns * or the creator of one of it's parents? */ for ( tmp = to; tmp != &init_user_ns; tmp = tmp->parent ) { - if (cred->user == tmp->creator) { + if (uid_eq(cred->user->uid, tmp->owner)) { return (uid_t)0; } } @@ -126,7 +124,7 @@ gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, gid_t * or the creator of one of it's parents? */ for ( tmp = to; tmp != &init_user_ns; tmp = tmp->parent ) { - if (cred->user == tmp->creator) { + if (uid_eq(cred->user->uid, tmp->owner)) { return (gid_t)0; } } -- cgit v1.2.3 From 22d917d80e842829d0ca0a561967d728eb1d6303 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Thu, 17 Nov 2011 00:11:58 -0800 Subject: userns: Rework the user_namespace adding uid/gid mapping support - Convert the old uid mapping functions into compatibility wrappers - Add a uid/gid mapping layer from user space uid and gids to kernel internal uids and gids that is extent based for simplicty and speed. * Working with number space after mapping uids/gids into their kernel internal version adds only mapping complexity over what we have today, leaving the kernel code easy to understand and test. - Add proc files /proc/self/uid_map /proc/self/gid_map These files display the mapping and allow a mapping to be added if a mapping does not exist. - Allow entering the user namespace without a uid or gid mapping. Since we are starting with an existing user our uids and gids still have global mappings so are still valid and useful they just don't have local mappings. The requirement for things to work are global uid and gid so it is odd but perfectly fine not to have a local uid and gid mapping. Not requiring global uid and gid mappings greatly simplifies the logic of setting up the uid and gid mappings by allowing the mappings to be set after the namespace is created which makes the slight weirdness worth it. - Make the mappings in the initial user namespace to the global uid/gid space explicit. Today it is an identity mapping but in the future we may want to twist this for debugging, similar to what we do with jiffies. - Document the memory ordering requirements of setting the uid and gid mappings. We only allow the mappings to be set once and there are no pointers involved so the requirments are trivial but a little atypical. Performance: In this scheme for the permission checks the performance is expected to stay the same as the actuall machine instructions should remain the same. The worst case I could think of is ls -l on a large directory where all of the stat results need to be translated with from kuids and kgids to uids and gids. So I benchmarked that case on my laptop with a dual core hyperthread Intel i5-2520M cpu with 3M of cpu cache. My benchmark consisted of going to single user mode where nothing else was running. On an ext4 filesystem opening 1,000,000 files and looping through all of the files 1000 times and calling fstat on the individuals files. This was to ensure I was benchmarking stat times where the inodes were in the kernels cache, but the inode values were not in the processors cache. My results: v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled) v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled) v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled) All of the configurations ran in roughly 120ns when I performed tests that ran in the cpu cache. So in summary the performance impact is: 1ns improvement in the worst case with user namespace support compiled out. 8ns aka 5% slowdown in the worst case with user namespace support compiled in. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/user.c | 16 ++ kernel/user_namespace.c | 545 ++++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 520 insertions(+), 41 deletions(-) (limited to 'kernel') diff --git a/kernel/user.c b/kernel/user.c index cff385659175..f9e420e36699 100644 --- a/kernel/user.c +++ b/kernel/user.c @@ -22,6 +22,22 @@ * and 1 for... ? */ struct user_namespace init_user_ns = { + .uid_map = { + .nr_extents = 1, + .extent[0] = { + .first = 0, + .lower_first = 0, + .count = 4294967295, + }, + }, + .gid_map = { + .nr_extents = 1, + .extent[0] = { + .first = 0, + .lower_first = 0, + .count = 4294967295, + }, + }, .kref = { .refcount = ATOMIC_INIT(3), }, diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index ed08836558e0..7eff867bfac5 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -12,9 +12,19 @@ #include #include #include +#include +#include +#include +#include +#include +#include +#include static struct kmem_cache *user_ns_cachep __read_mostly; +static bool new_idmap_permitted(struct user_namespace *ns, int cap_setid, + struct uid_gid_map *map); + /* * Create a new user namespace, deriving the creator from the user in the * passed credentials, and replacing that user with the new root user for the @@ -26,7 +36,6 @@ static struct kmem_cache *user_ns_cachep __read_mostly; int create_user_ns(struct cred *new) { struct user_namespace *ns, *parent_ns = new->user_ns; - struct user_struct *root_user; kuid_t owner = make_kuid(new->user_ns, new->euid); kgid_t group = make_kgid(new->user_ns, new->egid); @@ -38,29 +47,15 @@ int create_user_ns(struct cred *new) !kgid_has_mapping(parent_ns, group)) return -EPERM; - ns = kmem_cache_alloc(user_ns_cachep, GFP_KERNEL); + ns = kmem_cache_zalloc(user_ns_cachep, GFP_KERNEL); if (!ns) return -ENOMEM; kref_init(&ns->kref); - - /* Alloc new root user. */ - root_user = alloc_uid(make_kuid(ns, 0)); - if (!root_user) { - kmem_cache_free(user_ns_cachep, ns); - return -ENOMEM; - } - - /* set the new root user in the credentials under preparation */ ns->parent = parent_ns; ns->owner = owner; ns->group = group; - free_uid(new->user); - new->user = root_user; - new->uid = new->euid = new->suid = new->fsuid = 0; - new->gid = new->egid = new->sgid = new->fsgid = 0; - put_group_info(new->group_info); - new->group_info = get_group_info(&init_groups); + /* Start with the same capabilities as init but useless for doing * anything as the capabilities are bound to the new user namespace. */ @@ -93,44 +88,512 @@ void free_user_ns(struct kref *kref) } EXPORT_SYMBOL(free_user_ns); -uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, uid_t uid) +static u32 map_id_range_down(struct uid_gid_map *map, u32 id, u32 count) { - struct user_namespace *tmp; + unsigned idx, extents; + u32 first, last, id2; - if (likely(to == cred->user_ns)) - return uid; + id2 = id + count - 1; - /* Is cred->user the creator of the target user_ns - * or the creator of one of it's parents? - */ - for ( tmp = to; tmp != &init_user_ns; tmp = tmp->parent ) { - if (uid_eq(cred->user->uid, tmp->owner)) { - return (uid_t)0; - } + /* Find the matching extent */ + extents = map->nr_extents; + smp_read_barrier_depends(); + for (idx = 0; idx < extents; idx++) { + first = map->extent[idx].first; + last = first + map->extent[idx].count - 1; + if (id >= first && id <= last && + (id2 >= first && id2 <= last)) + break; + } + /* Map the id or note failure */ + if (idx < extents) + id = (id - first) + map->extent[idx].lower_first; + else + id = (u32) -1; + + return id; +} + +static u32 map_id_down(struct uid_gid_map *map, u32 id) +{ + unsigned idx, extents; + u32 first, last; + + /* Find the matching extent */ + extents = map->nr_extents; + smp_read_barrier_depends(); + for (idx = 0; idx < extents; idx++) { + first = map->extent[idx].first; + last = first + map->extent[idx].count - 1; + if (id >= first && id <= last) + break; + } + /* Map the id or note failure */ + if (idx < extents) + id = (id - first) + map->extent[idx].lower_first; + else + id = (u32) -1; + + return id; +} + +static u32 map_id_up(struct uid_gid_map *map, u32 id) +{ + unsigned idx, extents; + u32 first, last; + + /* Find the matching extent */ + extents = map->nr_extents; + smp_read_barrier_depends(); + for (idx = 0; idx < extents; idx++) { + first = map->extent[idx].lower_first; + last = first + map->extent[idx].count - 1; + if (id >= first && id <= last) + break; } + /* Map the id or note failure */ + if (idx < extents) + id = (id - first) + map->extent[idx].first; + else + id = (u32) -1; + + return id; +} + +/** + * make_kuid - Map a user-namespace uid pair into a kuid. + * @ns: User namespace that the uid is in + * @uid: User identifier + * + * Maps a user-namespace uid pair into a kernel internal kuid, + * and returns that kuid. + * + * When there is no mapping defined for the user-namespace uid + * pair INVALID_UID is returned. Callers are expected to test + * for and handle handle INVALID_UID being returned. INVALID_UID + * may be tested for using uid_valid(). + */ +kuid_t make_kuid(struct user_namespace *ns, uid_t uid) +{ + /* Map the uid to a global kernel uid */ + return KUIDT_INIT(map_id_down(&ns->uid_map, uid)); +} +EXPORT_SYMBOL(make_kuid); + +/** + * from_kuid - Create a uid from a kuid user-namespace pair. + * @targ: The user namespace we want a uid in. + * @kuid: The kernel internal uid to start with. + * + * Map @kuid into the user-namespace specified by @targ and + * return the resulting uid. + * + * There is always a mapping into the initial user_namespace. + * + * If @kuid has no mapping in @targ (uid_t)-1 is returned. + */ +uid_t from_kuid(struct user_namespace *targ, kuid_t kuid) +{ + /* Map the uid from a global kernel uid */ + return map_id_up(&targ->uid_map, __kuid_val(kuid)); +} +EXPORT_SYMBOL(from_kuid); + +/** + * from_kuid_munged - Create a uid from a kuid user-namespace pair. + * @targ: The user namespace we want a uid in. + * @kuid: The kernel internal uid to start with. + * + * Map @kuid into the user-namespace specified by @targ and + * return the resulting uid. + * + * There is always a mapping into the initial user_namespace. + * + * Unlike from_kuid from_kuid_munged never fails and always + * returns a valid uid. This makes from_kuid_munged appropriate + * for use in syscalls like stat and getuid where failing the + * system call and failing to provide a valid uid are not an + * options. + * + * If @kuid has no mapping in @targ overflowuid is returned. + */ +uid_t from_kuid_munged(struct user_namespace *targ, kuid_t kuid) +{ + uid_t uid; + uid = from_kuid(targ, kuid); + + if (uid == (uid_t) -1) + uid = overflowuid; + return uid; +} +EXPORT_SYMBOL(from_kuid_munged); + +/** + * make_kgid - Map a user-namespace gid pair into a kgid. + * @ns: User namespace that the gid is in + * @uid: group identifier + * + * Maps a user-namespace gid pair into a kernel internal kgid, + * and returns that kgid. + * + * When there is no mapping defined for the user-namespace gid + * pair INVALID_GID is returned. Callers are expected to test + * for and handle INVALID_GID being returned. INVALID_GID may be + * tested for using gid_valid(). + */ +kgid_t make_kgid(struct user_namespace *ns, gid_t gid) +{ + /* Map the gid to a global kernel gid */ + return KGIDT_INIT(map_id_down(&ns->gid_map, gid)); +} +EXPORT_SYMBOL(make_kgid); + +/** + * from_kgid - Create a gid from a kgid user-namespace pair. + * @targ: The user namespace we want a gid in. + * @kgid: The kernel internal gid to start with. + * + * Map @kgid into the user-namespace specified by @targ and + * return the resulting gid. + * + * There is always a mapping into the initial user_namespace. + * + * If @kgid has no mapping in @targ (gid_t)-1 is returned. + */ +gid_t from_kgid(struct user_namespace *targ, kgid_t kgid) +{ + /* Map the gid from a global kernel gid */ + return map_id_up(&targ->gid_map, __kgid_val(kgid)); +} +EXPORT_SYMBOL(from_kgid); + +/** + * from_kgid_munged - Create a gid from a kgid user-namespace pair. + * @targ: The user namespace we want a gid in. + * @kgid: The kernel internal gid to start with. + * + * Map @kgid into the user-namespace specified by @targ and + * return the resulting gid. + * + * There is always a mapping into the initial user_namespace. + * + * Unlike from_kgid from_kgid_munged never fails and always + * returns a valid gid. This makes from_kgid_munged appropriate + * for use in syscalls like stat and getgid where failing the + * system call and failing to provide a valid gid are not options. + * + * If @kgid has no mapping in @targ overflowgid is returned. + */ +gid_t from_kgid_munged(struct user_namespace *targ, kgid_t kgid) +{ + gid_t gid; + gid = from_kgid(targ, kgid); + + if (gid == (gid_t) -1) + gid = overflowgid; + return gid; +} +EXPORT_SYMBOL(from_kgid_munged); + +static int uid_m_show(struct seq_file *seq, void *v) +{ + struct user_namespace *ns = seq->private; + struct uid_gid_extent *extent = v; + struct user_namespace *lower_ns; + uid_t lower; - /* No useful relationship so no mapping */ - return overflowuid; + lower_ns = current_user_ns(); + if ((lower_ns == ns) && lower_ns->parent) + lower_ns = lower_ns->parent; + + lower = from_kuid(lower_ns, KUIDT_INIT(extent->lower_first)); + + seq_printf(seq, "%10u %10u %10u\n", + extent->first, + lower, + extent->count); + + return 0; } -gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, gid_t gid) +static int gid_m_show(struct seq_file *seq, void *v) { - struct user_namespace *tmp; + struct user_namespace *ns = seq->private; + struct uid_gid_extent *extent = v; + struct user_namespace *lower_ns; + gid_t lower; - if (likely(to == cred->user_ns)) - return gid; + lower_ns = current_user_ns(); + if ((lower_ns == ns) && lower_ns->parent) + lower_ns = lower_ns->parent; - /* Is cred->user the creator of the target user_ns - * or the creator of one of it's parents? + lower = from_kgid(lower_ns, KGIDT_INIT(extent->lower_first)); + + seq_printf(seq, "%10u %10u %10u\n", + extent->first, + lower, + extent->count); + + return 0; +} + +static void *m_start(struct seq_file *seq, loff_t *ppos, struct uid_gid_map *map) +{ + struct uid_gid_extent *extent = NULL; + loff_t pos = *ppos; + + if (pos < map->nr_extents) + extent = &map->extent[pos]; + + return extent; +} + +static void *uid_m_start(struct seq_file *seq, loff_t *ppos) +{ + struct user_namespace *ns = seq->private; + + return m_start(seq, ppos, &ns->uid_map); +} + +static void *gid_m_start(struct seq_file *seq, loff_t *ppos) +{ + struct user_namespace *ns = seq->private; + + return m_start(seq, ppos, &ns->gid_map); +} + +static void *m_next(struct seq_file *seq, void *v, loff_t *pos) +{ + (*pos)++; + return seq->op->start(seq, pos); +} + +static void m_stop(struct seq_file *seq, void *v) +{ + return; +} + +struct seq_operations proc_uid_seq_operations = { + .start = uid_m_start, + .stop = m_stop, + .next = m_next, + .show = uid_m_show, +}; + +struct seq_operations proc_gid_seq_operations = { + .start = gid_m_start, + .stop = m_stop, + .next = m_next, + .show = gid_m_show, +}; + +static DEFINE_MUTEX(id_map_mutex); + +static ssize_t map_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos, + int cap_setid, + struct uid_gid_map *map, + struct uid_gid_map *parent_map) +{ + struct seq_file *seq = file->private_data; + struct user_namespace *ns = seq->private; + struct uid_gid_map new_map; + unsigned idx; + struct uid_gid_extent *extent, *last = NULL; + unsigned long page = 0; + char *kbuf, *pos, *next_line; + ssize_t ret = -EINVAL; + + /* + * The id_map_mutex serializes all writes to any given map. + * + * Any map is only ever written once. + * + * An id map fits within 1 cache line on most architectures. + * + * On read nothing needs to be done unless you are on an + * architecture with a crazy cache coherency model like alpha. + * + * There is a one time data dependency between reading the + * count of the extents and the values of the extents. The + * desired behavior is to see the values of the extents that + * were written before the count of the extents. + * + * To achieve this smp_wmb() is used on guarantee the write + * order and smp_read_barrier_depends() is guaranteed that we + * don't have crazy architectures returning stale data. + * + */ + mutex_lock(&id_map_mutex); + + ret = -EPERM; + /* Only allow one successful write to the map */ + if (map->nr_extents != 0) + goto out; + + /* Require the appropriate privilege CAP_SETUID or CAP_SETGID + * over the user namespace in order to set the id mapping. */ - for ( tmp = to; tmp != &init_user_ns; tmp = tmp->parent ) { - if (uid_eq(cred->user->uid, tmp->owner)) { - return (gid_t)0; + if (!ns_capable(ns, cap_setid)) + goto out; + + /* Get a buffer */ + ret = -ENOMEM; + page = __get_free_page(GFP_TEMPORARY); + kbuf = (char *) page; + if (!page) + goto out; + + /* Only allow <= page size writes at the beginning of the file */ + ret = -EINVAL; + if ((*ppos != 0) || (count >= PAGE_SIZE)) + goto out; + + /* Slurp in the user data */ + ret = -EFAULT; + if (copy_from_user(kbuf, buf, count)) + goto out; + kbuf[count] = '\0'; + + /* Parse the user data */ + ret = -EINVAL; + pos = kbuf; + new_map.nr_extents = 0; + for (;pos; pos = next_line) { + extent = &new_map.extent[new_map.nr_extents]; + + /* Find the end of line and ensure I don't look past it */ + next_line = strchr(pos, '\n'); + if (next_line) { + *next_line = '\0'; + next_line++; + if (*next_line == '\0') + next_line = NULL; } + + pos = skip_spaces(pos); + extent->first = simple_strtoul(pos, &pos, 10); + if (!isspace(*pos)) + goto out; + + pos = skip_spaces(pos); + extent->lower_first = simple_strtoul(pos, &pos, 10); + if (!isspace(*pos)) + goto out; + + pos = skip_spaces(pos); + extent->count = simple_strtoul(pos, &pos, 10); + if (*pos && !isspace(*pos)) + goto out; + + /* Verify there is not trailing junk on the line */ + pos = skip_spaces(pos); + if (*pos != '\0') + goto out; + + /* Verify we have been given valid starting values */ + if ((extent->first == (u32) -1) || + (extent->lower_first == (u32) -1 )) + goto out; + + /* Verify count is not zero and does not cause the extent to wrap */ + if ((extent->first + extent->count) <= extent->first) + goto out; + if ((extent->lower_first + extent->count) <= extent->lower_first) + goto out; + + /* For now only accept extents that are strictly in order */ + if (last && + (((last->first + last->count) > extent->first) || + ((last->lower_first + last->count) > extent->lower_first))) + goto out; + + new_map.nr_extents++; + last = extent; + + /* Fail if the file contains too many extents */ + if ((new_map.nr_extents == UID_GID_MAP_MAX_EXTENTS) && + (next_line != NULL)) + goto out; } + /* Be very certaint the new map actually exists */ + if (new_map.nr_extents == 0) + goto out; + + ret = -EPERM; + /* Validate the user is allowed to use user id's mapped to. */ + if (!new_idmap_permitted(ns, cap_setid, &new_map)) + goto out; + + /* Map the lower ids from the parent user namespace to the + * kernel global id space. + */ + for (idx = 0; idx < new_map.nr_extents; idx++) { + u32 lower_first; + extent = &new_map.extent[idx]; + + lower_first = map_id_range_down(parent_map, + extent->lower_first, + extent->count); + + /* Fail if we can not map the specified extent to + * the kernel global id space. + */ + if (lower_first == (u32) -1) + goto out; + + extent->lower_first = lower_first; + } + + /* Install the map */ + memcpy(map->extent, new_map.extent, + new_map.nr_extents*sizeof(new_map.extent[0])); + smp_wmb(); + map->nr_extents = new_map.nr_extents; + + *ppos = count; + ret = count; +out: + mutex_unlock(&id_map_mutex); + if (page) + free_page(page); + return ret; +} + +ssize_t proc_uid_map_write(struct file *file, const char __user *buf, size_t size, loff_t *ppos) +{ + struct seq_file *seq = file->private_data; + struct user_namespace *ns = seq->private; + + if (!ns->parent) + return -EPERM; + + return map_write(file, buf, size, ppos, CAP_SETUID, + &ns->uid_map, &ns->parent->uid_map); +} + +ssize_t proc_gid_map_write(struct file *file, const char __user *buf, size_t size, loff_t *ppos) +{ + struct seq_file *seq = file->private_data; + struct user_namespace *ns = seq->private; + + if (!ns->parent) + return -EPERM; + + return map_write(file, buf, size, ppos, CAP_SETGID, + &ns->gid_map, &ns->parent->gid_map); +} + +static bool new_idmap_permitted(struct user_namespace *ns, int cap_setid, + struct uid_gid_map *new_map) +{ + /* Allow the specified ids if we have the appropriate capability + * (CAP_SETUID or CAP_SETGID) over the parent user namespace. + */ + if (ns_capable(ns->parent, cap_setid)) + return true; - /* No useful relationship so no mapping */ - return overflowgid; + return false; } static __init int user_namespaces_init(void) -- cgit v1.2.3 From 8239c25f47d2b318156993b15f33900a86ea5e17 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 20 Apr 2012 13:05:42 +0000 Subject: smp: Add task_struct argument to __cpu_up() Preparatory patch to make the idle thread allocation for secondary cpus generic. Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Cc: Rusty Russell Cc: Paul E. McKenney Cc: Srivatsa S. Bhat Cc: Matt Turner Cc: Russell King Cc: Mike Frysinger Cc: Jesper Nilsson Cc: Richard Kuo Cc: Tony Luck Cc: Hirokazu Takata Cc: Ralf Baechle Cc: David Howells Cc: James E.J. Bottomley Cc: Benjamin Herrenschmidt Cc: Martin Schwidefsky Cc: Paul Mundt Cc: David S. Miller Cc: Chris Metcalf Cc: Richard Weinberger Cc: x86@kernel.org Link: http://lkml.kernel.org/r/20120420124556.964170564@linutronix.de --- kernel/cpu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/cpu.c b/kernel/cpu.c index 2060c6e57027..e711aef0fb3c 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -309,7 +309,7 @@ static int __cpuinit _cpu_up(unsigned int cpu, int tasks_frozen) } /* Arch-specific enabling code. */ - ret = __cpu_up(cpu); + ret = __cpu_up(cpu, NULL); if (ret != 0) goto out_notify; BUG_ON(!cpu_online(cpu)); -- cgit v1.2.3 From 38498a67aa2cf8c80754b8d304bfacc10bc582b5 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 20 Apr 2012 13:05:44 +0000 Subject: smp: Add generic smpboot facility Start a new file, which will hold SMP and CPU hotplug related generic infrastructure. Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Cc: Rusty Russell Cc: Paul E. McKenney Cc: Srivatsa S. Bhat Cc: Matt Turner Cc: Russell King Cc: Mike Frysinger Cc: Jesper Nilsson Cc: Richard Kuo Cc: Tony Luck Cc: Hirokazu Takata Cc: Ralf Baechle Cc: David Howells Cc: James E.J. Bottomley Cc: Benjamin Herrenschmidt Cc: Martin Schwidefsky Cc: Paul Mundt Cc: David S. Miller Cc: Chris Metcalf Cc: Richard Weinberger Cc: x86@kernel.org Link: http://lkml.kernel.org/r/20120420124557.035417523@linutronix.de --- kernel/Makefile | 1 + kernel/cpu.c | 8 ++++++++ kernel/smpboot.c | 14 ++++++++++++++ kernel/smpboot.h | 6 ++++++ 4 files changed, 29 insertions(+) create mode 100644 kernel/smpboot.c create mode 100644 kernel/smpboot.h (limited to 'kernel') diff --git a/kernel/Makefile b/kernel/Makefile index cb41b9547c9f..6c07f30fa9b7 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -43,6 +43,7 @@ obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o obj-$(CONFIG_SMP) += smp.o +obj-$(CONFIG_SMP) += smpboot.o ifneq ($(CONFIG_SMP),y) obj-y += up.o endif diff --git a/kernel/cpu.c b/kernel/cpu.c index e711aef0fb3c..e58b99ada3d8 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -17,6 +17,8 @@ #include #include +#include "smpboot.h" + #ifdef CONFIG_SMP /* Serializes the updates to cpu_online_mask, cpu_present_mask */ static DEFINE_MUTEX(cpu_add_remove_lock); @@ -300,6 +302,11 @@ static int __cpuinit _cpu_up(unsigned int cpu, int tasks_frozen) return -EINVAL; cpu_hotplug_begin(); + + ret = smpboot_prepare(cpu); + if (ret) + goto out; + ret = __cpu_notify(CPU_UP_PREPARE | mod, hcpu, -1, &nr_calls); if (ret) { nr_calls--; @@ -320,6 +327,7 @@ static int __cpuinit _cpu_up(unsigned int cpu, int tasks_frozen) out_notify: if (ret != 0) __cpu_notify(CPU_UP_CANCELED | mod, hcpu, nr_calls, NULL); +out: cpu_hotplug_done(); return ret; diff --git a/kernel/smpboot.c b/kernel/smpboot.c new file mode 100644 index 000000000000..6dae6a3d2d59 --- /dev/null +++ b/kernel/smpboot.c @@ -0,0 +1,14 @@ +/* + * Common SMP CPU bringup/teardown functions + */ +#include + +#include "smpboot.h" + +/** + * smpboot_prepare - generic smpboot preparation + */ +int __cpuinit smpboot_prepare(unsigned int cpu) +{ + return 0; +} diff --git a/kernel/smpboot.h b/kernel/smpboot.h new file mode 100644 index 000000000000..d88e77165086 --- /dev/null +++ b/kernel/smpboot.h @@ -0,0 +1,6 @@ +#ifndef SMPBOOT_H +#define SMPBOOT_H + +int smpboot_prepare(unsigned int cpu); + +#endif -- cgit v1.2.3 From 29d5e0476e1c4a513859e7858845ad172f560389 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 20 Apr 2012 13:05:45 +0000 Subject: smp: Provide generic idle thread allocation All SMP architectures have magic to fork the idle task and to store it for reusage when cpu hotplug is enabled. Provide a generic infrastructure for it. Create/reinit the idle thread for the cpu which is brought up in the generic code and hand the thread pointer to the architecture code via __cpu_up(). Note, that fork_idle() is called via a workqueue, because this guarantees that the idle thread does not get a reference to a user space VM. This can happen when the boot process did not bring up all possible cpus and a later cpu_up() is initiated via the sysfs interface. In that case fork_idle() would be called in the context of the user space task and take a reference on the user space VM. Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Cc: Rusty Russell Cc: Paul E. McKenney Cc: Srivatsa S. Bhat Cc: Matt Turner Cc: Russell King Cc: Mike Frysinger Cc: Jesper Nilsson Cc: Richard Kuo Cc: Tony Luck Cc: Hirokazu Takata Cc: Ralf Baechle Cc: David Howells Cc: James E.J. Bottomley Cc: Benjamin Herrenschmidt Cc: Martin Schwidefsky Cc: Paul Mundt Cc: David S. Miller Cc: Chris Metcalf Cc: Richard Weinberger Cc: x86@kernel.org Acked-by: Venkatesh Pallipadi Link: http://lkml.kernel.org/r/20120420124557.102478630@linutronix.de --- kernel/cpu.c | 2 +- kernel/sched/core.c | 2 ++ kernel/smpboot.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++- kernel/smpboot.h | 10 +++++++ 4 files changed, 96 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/cpu.c b/kernel/cpu.c index e58b99ada3d8..05c46bae5e55 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -316,7 +316,7 @@ static int __cpuinit _cpu_up(unsigned int cpu, int tasks_frozen) } /* Arch-specific enabling code. */ - ret = __cpu_up(cpu, NULL); + ret = __cpu_up(cpu, idle_thread_get(cpu)); if (ret != 0) goto out_notify; BUG_ON(!cpu_online(cpu)); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4603b9d8f30a..6a63cde23d03 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -83,6 +83,7 @@ #include "sched.h" #include "../workqueue_sched.h" +#include "../smpboot.h" #define CREATE_TRACE_POINTS #include @@ -7049,6 +7050,7 @@ void __init sched_init(void) /* May be allocated at isolcpus cmdline parse time */ if (cpu_isolated_map == NULL) zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT); + idle_thread_set_boot_cpu(); #endif init_sched_fair_class(); diff --git a/kernel/smpboot.c b/kernel/smpboot.c index 6dae6a3d2d59..ed1576981801 100644 --- a/kernel/smpboot.c +++ b/kernel/smpboot.c @@ -1,14 +1,96 @@ /* * Common SMP CPU bringup/teardown functions */ +#include +#include #include +#include +#include +#include #include "smpboot.h" +#ifdef CONFIG_GENERIC_SMP_IDLE_THREAD +struct create_idle { + struct work_struct work; + struct task_struct *idle; + struct completion done; + unsigned int cpu; +}; + +static void __cpuinit do_fork_idle(struct work_struct *work) +{ + struct create_idle *c = container_of(work, struct create_idle, work); + + c->idle = fork_idle(c->cpu); + complete(&c->done); +} + +static struct task_struct * __cpuinit idle_thread_create(unsigned int cpu) +{ + struct create_idle c_idle = { + .cpu = cpu, + .done = COMPLETION_INITIALIZER_ONSTACK(c_idle.done), + }; + + INIT_WORK_ONSTACK(&c_idle.work, do_fork_idle); + schedule_work(&c_idle.work); + wait_for_completion(&c_idle.done); + destroy_work_on_stack(&c_idle.work); + return c_idle.idle; +} + +/* + * For the hotplug case we keep the task structs around and reuse + * them. + */ +static DEFINE_PER_CPU(struct task_struct *, idle_threads); + +static inline struct task_struct *get_idle_for_cpu(unsigned int cpu) +{ + struct task_struct *tsk = per_cpu(idle_threads, cpu); + + if (!tsk) + return idle_thread_create(cpu); + init_idle(tsk, cpu); + return tsk; +} + +struct task_struct * __cpuinit idle_thread_get(unsigned int cpu) +{ + return per_cpu(idle_threads, cpu); +} + +void __init idle_thread_set_boot_cpu(void) +{ + per_cpu(idle_threads, smp_processor_id()) = current; +} + +/** + * idle_thread_init - Initialize the idle thread for a cpu + * @cpu: The cpu for which the idle thread should be initialized + * + * Creates the thread if it does not exist. + */ +static int __cpuinit idle_thread_init(unsigned int cpu) +{ + struct task_struct *idle = get_idle_for_cpu(cpu); + + if (IS_ERR(idle)) { + printk(KERN_ERR "failed fork for CPU %u\n", cpu); + return PTR_ERR(idle); + } + per_cpu(idle_threads, cpu) = idle; + return 0; +} +#else +static inline int idle_thread_init(unsigned int cpu) { return 0; } +#endif + /** * smpboot_prepare - generic smpboot preparation */ int __cpuinit smpboot_prepare(unsigned int cpu) { - return 0; + return idle_thread_init(cpu); } diff --git a/kernel/smpboot.h b/kernel/smpboot.h index d88e77165086..7943bbbab917 100644 --- a/kernel/smpboot.h +++ b/kernel/smpboot.h @@ -1,6 +1,16 @@ #ifndef SMPBOOT_H #define SMPBOOT_H +struct task_struct; + int smpboot_prepare(unsigned int cpu); +#ifdef CONFIG_GENERIC_SMP_IDLE_THREAD +struct task_struct *idle_thread_get(unsigned int cpu); +void idle_thread_set_boot_cpu(void); +#else +static inline struct task_struct *idle_thread_get(unsigned int cpu) { return NULL; } +static inline void idle_thread_set_boot_cpu(void) { } +#endif + #endif -- cgit v1.2.3 From 33b07b8be7f0e1e8e4184e3473d71f174e4b0641 Mon Sep 17 00:00:00 2001 From: Robert Richter Date: Thu, 5 Apr 2012 18:24:43 +0200 Subject: perf: Use static variant of perf_event_overflow in core.c No need to have an additional function layer. Signed-off-by: Robert Richter Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1333643084-26776-4-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar --- kernel/events/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index a6a9ec4cd8f5..9789a56b7d54 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5350,7 +5350,7 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer) if (regs && !perf_exclude_event(event, regs)) { if (!(event->attr.exclude_idle && is_idle_task(current))) - if (perf_event_overflow(event, &data, regs)) + if (__perf_event_overflow(event, 1, &data, regs)) ret = HRTIMER_NORESTART; } -- cgit v1.2.3 From 79b9a75fb703b6a2670e46b9dc495af5bc7029b3 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 23 Apr 2012 10:06:39 -0700 Subject: rcu: Add warning for RCU_FAST_NO_HZ timer firing RCU_FAST_NO_HZ uses a timer to limit the time that a CPU with callbacks can remain in dyntick-idle mode. This timer is cancelled when the CPU exits idle, and therefore should never fire. However, if the timer were migrated to some other CPU for whatever reason (1) the timer could actually fire and (2) firing on some other CPU would fail to wake up the CPU with callbacks, possibly resulting in sluggishness or a system hang. This commit therfore adds a WARN_ON_ONCE() to the timer handler in order to detect this condition. Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney --- kernel/rcutree_plugin.h | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h index 50c17975d4f4..ad61da79b311 100644 --- a/kernel/rcutree_plugin.h +++ b/kernel/rcutree_plugin.h @@ -2065,6 +2065,7 @@ static bool rcu_cpu_has_nonlazy_callbacks(int cpu) */ static void rcu_idle_gp_timer_func(unsigned long unused) { + WARN_ON_ONCE(1); /* Getting here can hang the system... */ trace_rcu_prep_idle("Timer"); } -- cgit v1.2.3 From 048a0e8f5e1d94c01a5fc70f5b2f2fd2f4527326 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 26 Apr 2012 10:52:27 -0700 Subject: timer: Fix mod_timer_pinned() header comment The mod_timer_pinned() header comment states that it prevents timers from being migrated to a different CPU. This is not the case, instead, it ensures that the timer is posted to the current CPU, but does nothing to prevent CPU-hotplug operations from migrating the timer. This commit therefore brings the comment header into alignment with reality. Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney Acked-by: Steven Rostedt --- kernel/timer.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/timer.c b/kernel/timer.c index a297ffcf888e..837c552fe838 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -861,7 +861,13 @@ EXPORT_SYMBOL(mod_timer); * * mod_timer_pinned() is a way to update the expire field of an * active timer (if the timer is inactive it will be activated) - * and not allow the timer to be migrated to a different CPU. + * and to ensure that the timer is scheduled on the current CPU. + * + * Note that this does not prevent the timer from being migrated + * when the current CPU goes offline. If this is a problem for + * you, use CPU-hotplug notifiers to handle it correctly, for + * example, cancelling the timer when the corresponding CPU goes + * offline. * * mod_timer_pinned(timer, expires) is equivalent to: * -- cgit v1.2.3 From 4d8438f044d8aaac6fbba98316ba484dabea397d Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Wed, 25 Apr 2012 01:11:35 +0200 Subject: res_counter: Merge res_counter_charge and res_counter_charge_nofail These two functions do almost the same thing and duplicate some code. Merge their implementation into a single common function. res_counter_charge_locked() takes one more parameter but it doesn't seem to be used outside res_counter.c yet anyway. There is no (intended) change in the behaviour. Signed-off-by: Frederic Weisbecker Signed-off-by: Tejun Heo Acked-by: KAMEZAWA Hiroyuki Acked-by: Glauber Costa Acked-by: Kirill A. Shutemov Cc: Li Zefan --- kernel/res_counter.c | 73 ++++++++++++++++++++++++---------------------------- 1 file changed, 34 insertions(+), 39 deletions(-) (limited to 'kernel') diff --git a/kernel/res_counter.c b/kernel/res_counter.c index d508363858b3..07a29923aba2 100644 --- a/kernel/res_counter.c +++ b/kernel/res_counter.c @@ -22,75 +22,70 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent) counter->parent = parent; } -int res_counter_charge_locked(struct res_counter *counter, unsigned long val) +int res_counter_charge_locked(struct res_counter *counter, unsigned long val, + bool force) { + int ret = 0; + if (counter->usage + val > counter->limit) { counter->failcnt++; - return -ENOMEM; + ret = -ENOMEM; + if (!force) + return ret; } counter->usage += val; - if (counter->usage > counter->max_usage) + if (!force && counter->usage > counter->max_usage) counter->max_usage = counter->usage; - return 0; + return ret; } -int res_counter_charge(struct res_counter *counter, unsigned long val, - struct res_counter **limit_fail_at) +static int __res_counter_charge(struct res_counter *counter, unsigned long val, + struct res_counter **limit_fail_at, bool force) { - int ret; + int ret, r; unsigned long flags; struct res_counter *c, *u; + r = ret = 0; *limit_fail_at = NULL; local_irq_save(flags); for (c = counter; c != NULL; c = c->parent) { spin_lock(&c->lock); - ret = res_counter_charge_locked(c, val); + r = res_counter_charge_locked(c, val, force); spin_unlock(&c->lock); - if (ret < 0) { + if (r < 0 && !ret) { + ret = r; *limit_fail_at = c; - goto undo; + if (!force) + break; } } - ret = 0; - goto done; -undo: - for (u = counter; u != c; u = u->parent) { - spin_lock(&u->lock); - res_counter_uncharge_locked(u, val); - spin_unlock(&u->lock); + + if (ret < 0 && !force) { + for (u = counter; u != c; u = u->parent) { + spin_lock(&u->lock); + res_counter_uncharge_locked(u, val); + spin_unlock(&u->lock); + } } -done: local_irq_restore(flags); + return ret; } +int res_counter_charge(struct res_counter *counter, unsigned long val, + struct res_counter **limit_fail_at) +{ + return __res_counter_charge(counter, val, limit_fail_at, false); +} + int res_counter_charge_nofail(struct res_counter *counter, unsigned long val, struct res_counter **limit_fail_at) { - int ret, r; - unsigned long flags; - struct res_counter *c; - - r = ret = 0; - *limit_fail_at = NULL; - local_irq_save(flags); - for (c = counter; c != NULL; c = c->parent) { - spin_lock(&c->lock); - r = res_counter_charge_locked(c, val); - if (r) - c->usage += val; - spin_unlock(&c->lock); - if (r < 0 && ret == 0) { - *limit_fail_at = c; - ret = r; - } - } - local_irq_restore(flags); - - return ret; + return __res_counter_charge(counter, val, limit_fail_at, true); } + void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val) { if (WARN_ON(counter->usage < val)) -- cgit v1.2.3 From 0d4dde1ac9a5af74ac76c6ab90557d1ae7b8f5d8 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Wed, 25 Apr 2012 01:11:36 +0200 Subject: res_counter: Account max_usage when calling res_counter_charge_nofail() Updating max_usage is something one would expect when we reach a new maximum usage value even when we do this by forcing through the limit with res_counter_charge_nofail(). (Whether we want to account failcnt when we force through the limit is another debate). Signed-off-by: Frederic Weisbecker Signed-off-by: Tejun Heo Acked-by: KAMEZAWA Hiroyuki Acked-by: Glauber Costa Acked-by: Kirill A. Shutemov Cc: Li Zefan --- kernel/res_counter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/res_counter.c b/kernel/res_counter.c index 07a29923aba2..bebe2b170d49 100644 --- a/kernel/res_counter.c +++ b/kernel/res_counter.c @@ -35,7 +35,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val, } counter->usage += val; - if (!force && counter->usage > counter->max_usage) + if (counter->usage > counter->max_usage) counter->max_usage = counter->usage; return ret; } -- cgit v1.2.3 From fae4b54f28f034d228fa3bfc98858c698b64e89c Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 20 Feb 2012 17:51:45 -0800 Subject: rcu: Introduce rcutorture testing for rcu_barrier() Although rcutorture does invoke rcu_barrier() and friends, it cannot really be called a torture test given that it invokes them only once at the end of the test. This commit therefore introduces heavy-duty rcutorture testing for rcu_barrier(), which may be carried out concurrently with normal rcutorture testing. Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney --- kernel/rcutorture.c | 194 +++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 186 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c index 1463a0636443..8cd262b41499 100644 --- a/kernel/rcutorture.c +++ b/kernel/rcutorture.c @@ -64,6 +64,7 @@ static int irqreader = 1; /* RCU readers from irq (timers). */ static int fqs_duration; /* Duration of bursts (us), 0 to disable. */ static int fqs_holdoff; /* Hold time within burst (us). */ static int fqs_stutter = 3; /* Wait time between bursts (s). */ +static int n_barrier_cbs; /* Number of callbacks to test RCU barriers. */ static int onoff_interval; /* Wait time between CPU hotplugs, 0=disable. */ static int onoff_holdoff; /* Seconds after boot before CPU hotplugs. */ static int shutdown_secs; /* Shutdown time (s). <=0 for no shutdown. */ @@ -96,6 +97,8 @@ module_param(fqs_holdoff, int, 0444); MODULE_PARM_DESC(fqs_holdoff, "Holdoff time within fqs bursts (us)"); module_param(fqs_stutter, int, 0444); MODULE_PARM_DESC(fqs_stutter, "Wait time between fqs bursts (s)"); +module_param(n_barrier_cbs, int, 0444); +MODULE_PARM_DESC(n_barrier_cbs, "# of callbacks/kthreads for barrier testing"); module_param(onoff_interval, int, 0444); MODULE_PARM_DESC(onoff_interval, "Time between CPU hotplugs (s), 0=disable"); module_param(onoff_holdoff, int, 0444); @@ -139,6 +142,8 @@ static struct task_struct *shutdown_task; static struct task_struct *onoff_task; #endif /* #ifdef CONFIG_HOTPLUG_CPU */ static struct task_struct *stall_task; +static struct task_struct **barrier_cbs_tasks; +static struct task_struct *barrier_task; #define RCU_TORTURE_PIPE_LEN 10 @@ -164,6 +169,7 @@ static atomic_t n_rcu_torture_alloc_fail; static atomic_t n_rcu_torture_free; static atomic_t n_rcu_torture_mberror; static atomic_t n_rcu_torture_error; +static long n_rcu_torture_barrier_error; static long n_rcu_torture_boost_ktrerror; static long n_rcu_torture_boost_rterror; static long n_rcu_torture_boost_failure; @@ -173,6 +179,8 @@ static long n_offline_attempts; static long n_offline_successes; static long n_online_attempts; static long n_online_successes; +static long n_barrier_attempts; +static long n_barrier_successes; static struct list_head rcu_torture_removed; static cpumask_var_t shuffle_tmp_mask; @@ -197,6 +205,10 @@ static unsigned long shutdown_time; /* jiffies to system shutdown. */ static unsigned long boost_starttime; /* jiffies of next boost test start. */ DEFINE_MUTEX(boost_mutex); /* protect setting boost_starttime */ /* and boost task create/destroy. */ +static atomic_t barrier_cbs_count; /* Barrier callbacks registered. */ +static atomic_t barrier_cbs_invoked; /* Barrier callbacks invoked. */ +static wait_queue_head_t *barrier_cbs_wq; /* Coordinate barrier testing. */ +static DECLARE_WAIT_QUEUE_HEAD(barrier_wq); /* Mediate rmmod and system shutdown. Concurrent rmmod & shutdown illegal! */ @@ -327,6 +339,7 @@ struct rcu_torture_ops { int (*completed)(void); void (*deferred_free)(struct rcu_torture *p); void (*sync)(void); + void (*call)(struct rcu_head *head, void (*func)(struct rcu_head *rcu)); void (*cb_barrier)(void); void (*fqs)(void); int (*stats)(char *page); @@ -417,6 +430,7 @@ static struct rcu_torture_ops rcu_ops = { .completed = rcu_torture_completed, .deferred_free = rcu_torture_deferred_free, .sync = synchronize_rcu, + .call = call_rcu, .cb_barrier = rcu_barrier, .fqs = rcu_force_quiescent_state, .stats = NULL, @@ -460,6 +474,7 @@ static struct rcu_torture_ops rcu_sync_ops = { .completed = rcu_torture_completed, .deferred_free = rcu_sync_torture_deferred_free, .sync = synchronize_rcu, + .call = NULL, .cb_barrier = NULL, .fqs = rcu_force_quiescent_state, .stats = NULL, @@ -477,6 +492,7 @@ static struct rcu_torture_ops rcu_expedited_ops = { .completed = rcu_no_completed, .deferred_free = rcu_sync_torture_deferred_free, .sync = synchronize_rcu_expedited, + .call = NULL, .cb_barrier = NULL, .fqs = rcu_force_quiescent_state, .stats = NULL, @@ -519,6 +535,7 @@ static struct rcu_torture_ops rcu_bh_ops = { .completed = rcu_bh_torture_completed, .deferred_free = rcu_bh_torture_deferred_free, .sync = synchronize_rcu_bh, + .call = call_rcu_bh, .cb_barrier = rcu_barrier_bh, .fqs = rcu_bh_force_quiescent_state, .stats = NULL, @@ -535,6 +552,7 @@ static struct rcu_torture_ops rcu_bh_sync_ops = { .completed = rcu_bh_torture_completed, .deferred_free = rcu_sync_torture_deferred_free, .sync = synchronize_rcu_bh, + .call = NULL, .cb_barrier = NULL, .fqs = rcu_bh_force_quiescent_state, .stats = NULL, @@ -551,6 +569,7 @@ static struct rcu_torture_ops rcu_bh_expedited_ops = { .completed = rcu_bh_torture_completed, .deferred_free = rcu_sync_torture_deferred_free, .sync = synchronize_rcu_bh_expedited, + .call = NULL, .cb_barrier = NULL, .fqs = rcu_bh_force_quiescent_state, .stats = NULL, @@ -637,6 +656,7 @@ static struct rcu_torture_ops srcu_ops = { .completed = srcu_torture_completed, .deferred_free = rcu_sync_torture_deferred_free, .sync = srcu_torture_synchronize, + .call = NULL, .cb_barrier = NULL, .stats = srcu_torture_stats, .name = "srcu" @@ -661,6 +681,7 @@ static struct rcu_torture_ops srcu_raw_ops = { .completed = srcu_torture_completed, .deferred_free = rcu_sync_torture_deferred_free, .sync = srcu_torture_synchronize, + .call = NULL, .cb_barrier = NULL, .stats = srcu_torture_stats, .name = "srcu_raw" @@ -680,6 +701,7 @@ static struct rcu_torture_ops srcu_expedited_ops = { .completed = srcu_torture_completed, .deferred_free = rcu_sync_torture_deferred_free, .sync = srcu_torture_synchronize_expedited, + .call = NULL, .cb_barrier = NULL, .stats = srcu_torture_stats, .name = "srcu_expedited" @@ -1129,7 +1151,8 @@ rcu_torture_printk(char *page) "rtc: %p ver: %lu tfle: %d rta: %d rtaf: %d rtf: %d " "rtmbe: %d rtbke: %ld rtbre: %ld " "rtbf: %ld rtb: %ld nt: %ld " - "onoff: %ld/%ld:%ld/%ld", + "onoff: %ld/%ld:%ld/%ld " + "barrier: %ld/%ld:%ld", rcu_torture_current, rcu_torture_current_version, list_empty(&rcu_torture_freelist), @@ -1145,14 +1168,17 @@ rcu_torture_printk(char *page) n_online_successes, n_online_attempts, n_offline_successes, - n_offline_attempts); + n_offline_attempts, + n_barrier_successes, + n_barrier_attempts, + n_rcu_torture_barrier_error); + cnt += sprintf(&page[cnt], "\n%s%s ", torture_type, TORTURE_FLAG); if (atomic_read(&n_rcu_torture_mberror) != 0 || + n_rcu_torture_barrier_error != 0 || n_rcu_torture_boost_ktrerror != 0 || n_rcu_torture_boost_rterror != 0 || - n_rcu_torture_boost_failure != 0) - cnt += sprintf(&page[cnt], " !!!"); - cnt += sprintf(&page[cnt], "\n%s%s ", torture_type, TORTURE_FLAG); - if (i > 1) { + n_rcu_torture_boost_failure != 0 || + i > 1) { cnt += sprintf(&page[cnt], "!!! "); atomic_inc(&n_rcu_torture_error); WARN_ON_ONCE(1); @@ -1560,6 +1586,151 @@ static void rcu_torture_stall_cleanup(void) stall_task = NULL; } +/* Callback function for RCU barrier testing. */ +void rcu_torture_barrier_cbf(struct rcu_head *rcu) +{ + atomic_inc(&barrier_cbs_invoked); +} + +/* kthread function to register callbacks used to test RCU barriers. */ +static int rcu_torture_barrier_cbs(void *arg) +{ + long myid = (long)arg; + struct rcu_head rcu; + + init_rcu_head_on_stack(&rcu); + VERBOSE_PRINTK_STRING("rcu_torture_barrier_cbs task started"); + set_user_nice(current, 19); + do { + wait_event(barrier_cbs_wq[myid], + atomic_read(&barrier_cbs_count) == n_barrier_cbs || + kthread_should_stop() || + fullstop != FULLSTOP_DONTSTOP); + if (kthread_should_stop() || fullstop != FULLSTOP_DONTSTOP) + break; + cur_ops->call(&rcu, rcu_torture_barrier_cbf); + if (atomic_dec_and_test(&barrier_cbs_count)) + wake_up(&barrier_wq); + } while (!kthread_should_stop() && fullstop == FULLSTOP_DONTSTOP); + VERBOSE_PRINTK_STRING("rcu_torture_barrier_cbs task stopping"); + rcutorture_shutdown_absorb("rcu_torture_barrier_cbs"); + while (!kthread_should_stop()) + schedule_timeout_interruptible(1); + cur_ops->cb_barrier(); + destroy_rcu_head_on_stack(&rcu); + return 0; +} + +/* kthread function to drive and coordinate RCU barrier testing. */ +static int rcu_torture_barrier(void *arg) +{ + int i; + + VERBOSE_PRINTK_STRING("rcu_torture_barrier task starting"); + do { + atomic_set(&barrier_cbs_invoked, 0); + atomic_set(&barrier_cbs_count, n_barrier_cbs); + /* wake_up() path contains the required barriers. */ + for (i = 0; i < n_barrier_cbs; i++) + wake_up(&barrier_cbs_wq[i]); + wait_event(barrier_wq, + atomic_read(&barrier_cbs_count) == 0 || + kthread_should_stop() || + fullstop != FULLSTOP_DONTSTOP); + if (kthread_should_stop() || fullstop != FULLSTOP_DONTSTOP) + break; + n_barrier_attempts++; + cur_ops->cb_barrier(); + if (atomic_read(&barrier_cbs_invoked) != n_barrier_cbs) { + n_rcu_torture_barrier_error++; + WARN_ON_ONCE(1); + } + n_barrier_successes++; + schedule_timeout_interruptible(HZ / 10); + } while (!kthread_should_stop() && fullstop == FULLSTOP_DONTSTOP); + VERBOSE_PRINTK_STRING("rcu_torture_barrier task stopping"); + rcutorture_shutdown_absorb("rcu_torture_barrier_cbs"); + while (!kthread_should_stop()) + schedule_timeout_interruptible(1); + return 0; +} + +/* Initialize RCU barrier testing. */ +static int rcu_torture_barrier_init(void) +{ + int i; + int ret; + + if (n_barrier_cbs == 0) + return 0; + if (cur_ops->call == NULL || cur_ops->cb_barrier == NULL) { + printk(KERN_ALERT "%s" TORTURE_FLAG + " Call or barrier ops missing for %s,\n", + torture_type, cur_ops->name); + printk(KERN_ALERT "%s" TORTURE_FLAG + " RCU barrier testing omitted from run.\n", + torture_type); + return 0; + } + atomic_set(&barrier_cbs_count, 0); + atomic_set(&barrier_cbs_invoked, 0); + barrier_cbs_tasks = + kzalloc(n_barrier_cbs * sizeof(barrier_cbs_tasks[0]), + GFP_KERNEL); + barrier_cbs_wq = + kzalloc(n_barrier_cbs * sizeof(barrier_cbs_wq[0]), + GFP_KERNEL); + if (barrier_cbs_tasks == NULL || barrier_cbs_wq == 0) + return -ENOMEM; + for (i = 0; i < n_barrier_cbs; i++) { + init_waitqueue_head(&barrier_cbs_wq[i]); + barrier_cbs_tasks[i] = kthread_run(rcu_torture_barrier_cbs, + (void *)i, + "rcu_torture_barrier_cbs"); + if (IS_ERR(barrier_cbs_tasks[i])) { + ret = PTR_ERR(barrier_cbs_tasks[i]); + VERBOSE_PRINTK_ERRSTRING("Failed to create rcu_torture_barrier_cbs"); + barrier_cbs_tasks[i] = NULL; + return ret; + } + } + barrier_task = kthread_run(rcu_torture_barrier, NULL, + "rcu_torture_barrier"); + if (IS_ERR(barrier_task)) { + ret = PTR_ERR(barrier_task); + VERBOSE_PRINTK_ERRSTRING("Failed to create rcu_torture_barrier"); + barrier_task = NULL; + } + return 0; +} + +/* Clean up after RCU barrier testing. */ +static void rcu_torture_barrier_cleanup(void) +{ + int i; + + if (barrier_task != NULL) { + VERBOSE_PRINTK_STRING("Stopping rcu_torture_barrier task"); + kthread_stop(barrier_task); + barrier_task = NULL; + } + if (barrier_cbs_tasks != NULL) { + for (i = 0; i < n_barrier_cbs; i++) { + if (barrier_cbs_tasks[i] != NULL) { + VERBOSE_PRINTK_STRING("Stopping rcu_torture_barrier_cbs task"); + kthread_stop(barrier_cbs_tasks[i]); + barrier_cbs_tasks[i] = NULL; + } + } + kfree(barrier_cbs_tasks); + barrier_cbs_tasks = NULL; + } + if (barrier_cbs_wq != NULL) { + kfree(barrier_cbs_wq); + barrier_cbs_wq = NULL; + } +} + static int rcutorture_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu) { @@ -1602,6 +1773,7 @@ rcu_torture_cleanup(void) fullstop = FULLSTOP_RMMOD; mutex_unlock(&fullstop_mutex); unregister_reboot_notifier(&rcutorture_shutdown_nb); + rcu_torture_barrier_cleanup(); rcu_torture_stall_cleanup(); if (stutter_task) { VERBOSE_PRINTK_STRING("Stopping rcu_torture_stutter task"); @@ -1681,7 +1853,7 @@ rcu_torture_cleanup(void) if (cur_ops->cleanup) cur_ops->cleanup(); - if (atomic_read(&n_rcu_torture_error)) + if (atomic_read(&n_rcu_torture_error) || n_rcu_torture_barrier_error) rcu_torture_print_module_parms(cur_ops, "End of test: FAILURE"); else if (n_online_successes != n_online_attempts || n_offline_successes != n_offline_attempts) @@ -1697,6 +1869,7 @@ rcu_torture_init(void) int i; int cpu; int firsterr = 0; + int retval; static struct rcu_torture_ops *torture_ops[] = { &rcu_ops, &rcu_sync_ops, &rcu_expedited_ops, &rcu_bh_ops, &rcu_bh_sync_ops, &rcu_bh_expedited_ops, @@ -1754,6 +1927,7 @@ rcu_torture_init(void) atomic_set(&n_rcu_torture_free, 0); atomic_set(&n_rcu_torture_mberror, 0); atomic_set(&n_rcu_torture_error, 0); + n_rcu_torture_barrier_error = 0; n_rcu_torture_boost_ktrerror = 0; n_rcu_torture_boost_rterror = 0; n_rcu_torture_boost_failure = 0; @@ -1877,7 +2051,6 @@ rcu_torture_init(void) test_boost_duration = 2; if ((test_boost == 1 && cur_ops->can_boost) || test_boost == 2) { - int retval; boost_starttime = jiffies + test_boost_interval * HZ; register_cpu_notifier(&rcutorture_cpu_nb); @@ -1913,6 +2086,11 @@ rcu_torture_init(void) firsterr = i; goto unwind; } + retval = rcu_torture_barrier_init(); + if (retval != 0) { + firsterr = retval; + goto unwind; + } rcutorture_record_test_transition(); mutex_unlock(&fullstop_mutex); return 0; -- cgit v1.2.3 From cef50120b61c2af4ce34bc165e19cad66296f93d Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 5 Feb 2012 07:42:44 -0800 Subject: rcu: Direct algorithmic SRCU implementation The current implementation of synchronize_srcu_expedited() can cause severe OS jitter due to its use of synchronize_sched(), which in turn invokes try_stop_cpus(), which causes each CPU to be sent an IPI. This can result in severe performance degradation for real-time workloads and especially for short-interation-length HPC workloads. Furthermore, because only one instance of try_stop_cpus() can be making forward progress at a given time, only one instance of synchronize_srcu_expedited() can make forward progress at a time, even if they are all operating on distinct srcu_struct structures. This commit, inspired by an earlier implementation by Peter Zijlstra (https://lkml.org/lkml/2012/1/31/211) and by further offline discussions, takes a strictly algorithmic bits-in-memory approach. This has the disadvantage of requiring one explicit memory-barrier instruction in each of srcu_read_lock() and srcu_read_unlock(), but on the other hand completely dispenses with OS jitter and furthermore allows SRCU to be used freely by CPUs that RCU believes to be idle or offline. The update-side implementation handles the single read-side memory barrier by rechecking the per-CPU counters after summing them and by running through the update-side state machine twice. This implementation has passed moderate rcutorture testing on both x86 and Power. Also updated to use this_cpu_ptr() instead of per_cpu_ptr(), as suggested by Peter Zijlstra. Reported-by: Peter Zijlstra Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney Acked-by: Peter Zijlstra Reviewed-by: Lai Jiangshan --- kernel/rcutorture.c | 2 +- kernel/srcu.c | 284 ++++++++++++++++++++++++++++++++++------------------ 2 files changed, 190 insertions(+), 96 deletions(-) (limited to 'kernel') diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c index 8cd262b41499..d10b179dea83 100644 --- a/kernel/rcutorture.c +++ b/kernel/rcutorture.c @@ -639,7 +639,7 @@ static int srcu_torture_stats(char *page) cnt += sprintf(&page[cnt], "%s%s per-CPU(idx=%d):", torture_type, TORTURE_FLAG, idx); for_each_possible_cpu(cpu) { - cnt += sprintf(&page[cnt], " %d(%d,%d)", cpu, + cnt += sprintf(&page[cnt], " %d(%lu,%lu)", cpu, per_cpu_ptr(srcu_ctl.per_cpu_ref, cpu)->c[!idx], per_cpu_ptr(srcu_ctl.per_cpu_ref, cpu)->c[idx]); } diff --git a/kernel/srcu.c b/kernel/srcu.c index ba35f3a4a1f4..84c9b97dc3d9 100644 --- a/kernel/srcu.c +++ b/kernel/srcu.c @@ -73,19 +73,102 @@ EXPORT_SYMBOL_GPL(init_srcu_struct); #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */ /* - * srcu_readers_active_idx -- returns approximate number of readers - * active on the specified rank of per-CPU counters. + * Returns approximate number of readers active on the specified rank + * of per-CPU counters. Also snapshots each counter's value in the + * corresponding element of sp->snap[] for later use validating + * the sum. */ +static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx) +{ + int cpu; + unsigned long sum = 0; + unsigned long t; -static int srcu_readers_active_idx(struct srcu_struct *sp, int idx) + for_each_possible_cpu(cpu) { + t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]); + sum += t; + sp->snap[cpu] = t; + } + return sum & SRCU_REF_MASK; +} + +/* + * To be called from the update side after an index flip. Returns true + * if the modulo sum of the counters is stably zero, false if there is + * some possibility of non-zero. + */ +static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx) { int cpu; - int sum; - sum = 0; + /* + * Note that srcu_readers_active_idx() can incorrectly return + * zero even though there is a pre-existing reader throughout. + * To see this, suppose that task A is in a very long SRCU + * read-side critical section that started on CPU 0, and that + * no other reader exists, so that the modulo sum of the counters + * is equal to one. Then suppose that task B starts executing + * srcu_readers_active_idx(), summing up to CPU 1, and then that + * task C starts reading on CPU 0, so that its increment is not + * summed, but finishes reading on CPU 2, so that its decrement + * -is- summed. Then when task B completes its sum, it will + * incorrectly get zero, despite the fact that task A has been + * in its SRCU read-side critical section the whole time. + * + * We therefore do a validation step should srcu_readers_active_idx() + * return zero. + */ + if (srcu_readers_active_idx(sp, idx) != 0) + return false; + + /* + * Since the caller recently flipped ->completed, we can see at + * most one increment of each CPU's counter from this point + * forward. The reason for this is that the reader CPU must have + * fetched the index before srcu_readers_active_idx checked + * that CPU's counter, but not yet incremented its counter. + * Its eventual counter increment will follow the read in + * srcu_readers_active_idx(), and that increment is immediately + * followed by smp_mb() B. Because smp_mb() D is between + * the ->completed flip and srcu_readers_active_idx()'s read, + * that CPU's subsequent load of ->completed must see the new + * value, and therefore increment the counter in the other rank. + */ + smp_mb(); /* A */ + + /* + * Now, we check the ->snap array that srcu_readers_active_idx() + * filled in from the per-CPU counter values. Since both + * __srcu_read_lock() and __srcu_read_unlock() increment the + * upper bits of the per-CPU counter, an increment/decrement + * pair will change the value of the counter. Since there is + * only one possible increment, the only way to wrap the counter + * is to have a huge number of counter decrements, which requires + * a huge number of tasks and huge SRCU read-side critical-section + * nesting levels, even on 32-bit systems. + * + * All of the ways of confusing the readings require that the scan + * in srcu_readers_active_idx() see the read-side task's decrement, + * but not its increment. However, between that decrement and + * increment are smb_mb() B and C. Either or both of these pair + * with smp_mb() A above to ensure that the scan below will see + * the read-side tasks's increment, thus noting a difference in + * the counter values between the two passes. + * + * Therefore, if srcu_readers_active_idx() returned zero, and + * none of the counters changed, we know that the zero was the + * correct sum. + * + * Of course, it is possible that a task might be delayed + * for a very long time in __srcu_read_lock() after fetching + * the index but before incrementing its counter. This + * possibility will be dealt with in __synchronize_srcu(). + */ for_each_possible_cpu(cpu) - sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]; - return sum; + if (sp->snap[cpu] != + ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx])) + return false; /* False zero reading! */ + return true; } /** @@ -131,10 +214,11 @@ int __srcu_read_lock(struct srcu_struct *sp) int idx; preempt_disable(); - idx = sp->completed & 0x1; - barrier(); /* ensure compiler looks -once- at sp->completed. */ - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++; - srcu_barrier(); /* ensure compiler won't misorder critical section. */ + idx = rcu_dereference_index_check(sp->completed, + rcu_read_lock_sched_held()) & 0x1; + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += + SRCU_USAGE_COUNT + 1; + smp_mb(); /* B */ /* Avoid leaking the critical section. */ preempt_enable(); return idx; } @@ -149,8 +233,9 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock); void __srcu_read_unlock(struct srcu_struct *sp, int idx) { preempt_disable(); - srcu_barrier(); /* ensure compiler won't misorder critical section. */ - per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--; + smp_mb(); /* C */ /* Avoid leaking the critical section. */ + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += + SRCU_USAGE_COUNT - 1; preempt_enable(); } EXPORT_SYMBOL_GPL(__srcu_read_unlock); @@ -163,12 +248,65 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock); * we repeatedly block for 1-millisecond time periods. This approach * has done well in testing, so there is no need for a config parameter. */ -#define SYNCHRONIZE_SRCU_READER_DELAY 10 +#define SYNCHRONIZE_SRCU_READER_DELAY 5 + +/* + * Flip the readers' index by incrementing ->completed, then wait + * until there are no more readers using the counters referenced by + * the old index value. (Recall that the index is the bottom bit + * of ->completed.) + * + * Of course, it is possible that a reader might be delayed for the + * full duration of flip_idx_and_wait() between fetching the + * index and incrementing its counter. This possibility is handled + * by __synchronize_srcu() invoking flip_idx_and_wait() twice. + */ +static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited) +{ + int idx; + int trycount = 0; + + idx = sp->completed++ & 0x1; + + /* + * If a reader fetches the index before the above increment, + * but increments its counter after srcu_readers_active_idx_check() + * sums it, then smp_mb() D will pair with __srcu_read_lock()'s + * smp_mb() B to ensure that the SRCU read-side critical section + * will see any updates that the current task performed before its + * call to synchronize_srcu(), or to synchronize_srcu_expedited(), + * as the case may be. + */ + smp_mb(); /* D */ + + /* + * SRCU read-side critical sections are normally short, so wait + * a small amount of time before possibly blocking. + */ + if (!srcu_readers_active_idx_check(sp, idx)) { + udelay(SYNCHRONIZE_SRCU_READER_DELAY); + while (!srcu_readers_active_idx_check(sp, idx)) { + if (expedited && ++ trycount < 10) + udelay(SYNCHRONIZE_SRCU_READER_DELAY); + else + schedule_timeout_interruptible(1); + } + } + + /* + * The following smp_mb() E pairs with srcu_read_unlock()'s + * smp_mb C to ensure that if srcu_readers_active_idx_check() + * sees srcu_read_unlock()'s counter decrement, then any + * of the current task's subsequent code will happen after + * that SRCU read-side critical section. + */ + smp_mb(); /* E */ +} /* * Helper function for synchronize_srcu() and synchronize_srcu_expedited(). */ -static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void)) +static void __synchronize_srcu(struct srcu_struct *sp, bool expedited) { int idx; @@ -178,90 +316,53 @@ static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void)) !lock_is_held(&rcu_sched_lock_map), "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section"); - idx = sp->completed; + smp_mb(); /* Ensure prior action happens before grace period. */ + idx = ACCESS_ONCE(sp->completed); + smp_mb(); /* Access to ->completed before lock acquisition. */ mutex_lock(&sp->mutex); /* * Check to see if someone else did the work for us while we were - * waiting to acquire the lock. We need -two- advances of + * waiting to acquire the lock. We need -three- advances of * the counter, not just one. If there was but one, we might have * shown up -after- our helper's first synchronize_sched(), thus * having failed to prevent CPU-reordering races with concurrent - * srcu_read_unlock()s on other CPUs (see comment below). So we - * either (1) wait for two or (2) supply the second ourselves. + * srcu_read_unlock()s on other CPUs (see comment below). If there + * was only two, we are guaranteed to have waited through only one + * full index-flip phase. So we either (1) wait for three or + * (2) supply the additional ones we need. */ - if ((sp->completed - idx) >= 2) { + if (sp->completed == idx + 2) + idx = 1; + else if (sp->completed == idx + 3) { mutex_unlock(&sp->mutex); return; - } - - sync_func(); /* Force memory barrier on all CPUs. */ + } else + idx = 0; /* - * The preceding synchronize_sched() ensures that any CPU that - * sees the new value of sp->completed will also see any preceding - * changes to data structures made by this CPU. This prevents - * some other CPU from reordering the accesses in its SRCU - * read-side critical section to precede the corresponding - * srcu_read_lock() -- ensuring that such references will in - * fact be protected. + * If there were no helpers, then we need to do two flips of + * the index. The first flip is required if there are any + * outstanding SRCU readers even if there are no new readers + * running concurrently with the first counter flip. * - * So it is now safe to do the flip. - */ - - idx = sp->completed & 0x1; - sp->completed++; - - sync_func(); /* Force memory barrier on all CPUs. */ - - /* - * At this point, because of the preceding synchronize_sched(), - * all srcu_read_lock() calls using the old counters have completed. - * Their corresponding critical sections might well be still - * executing, but the srcu_read_lock() primitives themselves - * will have finished executing. We initially give readers - * an arbitrarily chosen 10 microseconds to get out of their - * SRCU read-side critical sections, then loop waiting 1/HZ - * seconds per iteration. The 10-microsecond value has done - * very well in testing. - */ - - if (srcu_readers_active_idx(sp, idx)) - udelay(SYNCHRONIZE_SRCU_READER_DELAY); - while (srcu_readers_active_idx(sp, idx)) - schedule_timeout_interruptible(1); - - sync_func(); /* Force memory barrier on all CPUs. */ - - /* - * The preceding synchronize_sched() forces all srcu_read_unlock() - * primitives that were executing concurrently with the preceding - * for_each_possible_cpu() loop to have completed by this point. - * More importantly, it also forces the corresponding SRCU read-side - * critical sections to have also completed, and the corresponding - * references to SRCU-protected data items to be dropped. + * The second flip is required when a new reader picks up + * the old value of the index, but does not increment its + * counter until after its counters is summed/rechecked by + * srcu_readers_active_idx_check(). In this case, the current SRCU + * grace period would be OK because the SRCU read-side critical + * section started after this SRCU grace period started, so the + * grace period is not required to wait for the reader. * - * Note: - * - * Despite what you might think at first glance, the - * preceding synchronize_sched() -must- be within the - * critical section ended by the following mutex_unlock(). - * Otherwise, a task taking the early exit can race - * with a srcu_read_unlock(), which might have executed - * just before the preceding srcu_readers_active() check, - * and whose CPU might have reordered the srcu_read_unlock() - * with the preceding critical section. In this case, there - * is nothing preventing the synchronize_sched() task that is - * taking the early exit from freeing a data structure that - * is still being referenced (out of order) by the task - * doing the srcu_read_unlock(). - * - * Alternatively, the comparison with "2" on the early exit - * could be changed to "3", but this increases synchronize_srcu() - * latency for bulk loads. So the current code is preferred. + * However, the next SRCU grace period would be waiting for the + * other set of counters to go to zero, and therefore would not + * wait for the reader, which would be very bad. To avoid this + * bad scenario, we flip and wait twice, clearing out both sets + * of counters. */ - + for (; idx < 2; idx++) + flip_idx_and_wait(sp, expedited); mutex_unlock(&sp->mutex); } @@ -281,7 +382,7 @@ static void __synchronize_srcu(struct srcu_struct *sp, void (*sync_func)(void)) */ void synchronize_srcu(struct srcu_struct *sp) { - __synchronize_srcu(sp, synchronize_sched); + __synchronize_srcu(sp, 0); } EXPORT_SYMBOL_GPL(synchronize_srcu); @@ -289,18 +390,11 @@ EXPORT_SYMBOL_GPL(synchronize_srcu); * synchronize_srcu_expedited - Brute-force SRCU grace period * @sp: srcu_struct with which to synchronize. * - * Wait for an SRCU grace period to elapse, but use a "big hammer" - * approach to force the grace period to end quickly. This consumes - * significant time on all CPUs and is unfriendly to real-time workloads, - * so is thus not recommended for any sort of common-case code. In fact, - * if you are using synchronize_srcu_expedited() in a loop, please - * restructure your code to batch your updates, and then use a single - * synchronize_srcu() instead. + * Wait for an SRCU grace period to elapse, but be more aggressive about + * spinning rather than blocking when waiting. * * Note that it is illegal to call this function while holding any lock - * that is acquired by a CPU-hotplug notifier. And yes, it is also illegal - * to call this function from a CPU-hotplug notifier. Failing to observe - * these restriction will result in deadlock. It is also illegal to call + * that is acquired by a CPU-hotplug notifier. It is also illegal to call * synchronize_srcu_expedited() from the corresponding SRCU read-side * critical section; doing so will result in deadlock. However, it is * perfectly legal to call synchronize_srcu_expedited() on one srcu_struct @@ -309,7 +403,7 @@ EXPORT_SYMBOL_GPL(synchronize_srcu); */ void synchronize_srcu_expedited(struct srcu_struct *sp) { - __synchronize_srcu(sp, synchronize_sched_expedited); + __synchronize_srcu(sp, 1); } EXPORT_SYMBOL_GPL(synchronize_srcu_expedited); -- cgit v1.2.3 From 4b7a3e9e32114a09c61995048f055615b5d4c26d Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Wed, 22 Feb 2012 13:06:51 -0800 Subject: rcu: Remove fast check path from __synchronize_srcu() The fastpath in __synchronize_srcu() is designed to handle cases where there are a large number of concurrent calls for the same srcu_struct structure. However, the Linux kernel currently does not use SRCU in this manner, so remove the fastpath checks for simplicity. Signed-off-by: Lai Jiangshan Signed-off-by: Paul E. McKenney --- kernel/srcu.c | 25 +------------------------ 1 file changed, 1 insertion(+), 24 deletions(-) (limited to 'kernel') diff --git a/kernel/srcu.c b/kernel/srcu.c index 84c9b97dc3d9..17e95bcc901c 100644 --- a/kernel/srcu.c +++ b/kernel/srcu.c @@ -308,7 +308,7 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited) */ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited) { - int idx; + int idx = 0; rcu_lockdep_assert(!lock_is_held(&sp->dep_map) && !lock_is_held(&rcu_bh_lock_map) && @@ -316,31 +316,8 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited) !lock_is_held(&rcu_sched_lock_map), "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section"); - smp_mb(); /* Ensure prior action happens before grace period. */ - idx = ACCESS_ONCE(sp->completed); - smp_mb(); /* Access to ->completed before lock acquisition. */ mutex_lock(&sp->mutex); - /* - * Check to see if someone else did the work for us while we were - * waiting to acquire the lock. We need -three- advances of - * the counter, not just one. If there was but one, we might have - * shown up -after- our helper's first synchronize_sched(), thus - * having failed to prevent CPU-reordering races with concurrent - * srcu_read_unlock()s on other CPUs (see comment below). If there - * was only two, we are guaranteed to have waited through only one - * full index-flip phase. So we either (1) wait for three or - * (2) supply the additional ones we need. - */ - - if (sp->completed == idx + 2) - idx = 1; - else if (sp->completed == idx + 3) { - mutex_unlock(&sp->mutex); - return; - } else - idx = 0; - /* * If there were no helpers, then we need to do two flips of * the index. The first flip is required if there are any -- cgit v1.2.3 From 440253c17fc4ed41d778492a7fb44dc0d756eccc Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Wed, 22 Feb 2012 13:29:06 -0800 Subject: rcu: Increment upper bit only for srcu_read_lock() The purpose of the upper bit of SRCU's per-CPU counters is to guarantee that no reasonable series of srcu_read_lock() and srcu_read_unlock() operations can return the value of the counter to its original value. This guarantee is require only after the index has been switched to the other set of counters, so at most one srcu_read_lock() can affect a given CPU's counter. The number of srcu_read_unlock() operations on a given counter is limited to the number of tasks in the system, which given the Linux kernel's current structure is limited to far less than 2^30 on 32-bit systems and far less than 2^62 on 64-bit systems. (Something about a limited number of bytes in the kernel's address space.) Therefore, if srcu_read_lock() increments the upper bits, then srcu_read_unlock() need not do so. In this case, an srcu_read_lock() and an srcu_read_unlock() will flip the lower bit of the upper field of the counter. An unreasonably large additional number of srcu_read_unlock() operations would be required to return the counter to its initial value, thus preserving the guarantee. This commit takes this approach, which further allows it to shrink the size of the upper field to one bit, making the number of srcu_read_unlock() operations required to return the counter to its initial value even more unreasonable than before. Signed-off-by: Lai Jiangshan Signed-off-by: Paul E. McKenney --- kernel/srcu.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/srcu.c b/kernel/srcu.c index 17e95bcc901c..43f1d61e513e 100644 --- a/kernel/srcu.c +++ b/kernel/srcu.c @@ -138,14 +138,14 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx) /* * Now, we check the ->snap array that srcu_readers_active_idx() - * filled in from the per-CPU counter values. Since both - * __srcu_read_lock() and __srcu_read_unlock() increment the - * upper bits of the per-CPU counter, an increment/decrement - * pair will change the value of the counter. Since there is - * only one possible increment, the only way to wrap the counter - * is to have a huge number of counter decrements, which requires - * a huge number of tasks and huge SRCU read-side critical-section - * nesting levels, even on 32-bit systems. + * filled in from the per-CPU counter values. Since + * __srcu_read_lock() increments the upper bits of the per-CPU + * counter, an increment/decrement pair will change the value + * of the counter. Since there is only one possible increment, + * the only way to wrap the counter is to have a huge number of + * counter decrements, which requires a huge number of tasks and + * huge SRCU read-side critical-section nesting levels, even on + * 32-bit systems. * * All of the ways of confusing the readings require that the scan * in srcu_readers_active_idx() see the read-side task's decrement, @@ -234,8 +234,7 @@ void __srcu_read_unlock(struct srcu_struct *sp, int idx) { preempt_disable(); smp_mb(); /* C */ /* Avoid leaking the critical section. */ - ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += - SRCU_USAGE_COUNT - 1; + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) -= 1; preempt_enable(); } EXPORT_SYMBOL_GPL(__srcu_read_unlock); -- cgit v1.2.3 From 944ce9af4767ca085d465e4add69df11a8faa9ef Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Wed, 22 Feb 2012 16:43:55 -0800 Subject: rcu: Flip ->completed only once per SRCU grace period This is an optimization of the SRCU grace period. To guard against preempted readers with old values of the counter, it suffices to scan the old counters once more, then flip ->completed only one time. The reason this works is that the old readers must have incremented the old set of counters (if they have not yet incremented, then their critical section starts after this grace period, so they may be safely ignored). This commit therefore optimizes the second flip out in favor of a simple rescan. Signed-off-by: Lai Jiangshan Signed-off-by: Paul E. McKenney --- kernel/srcu.c | 92 ++++++++++++++++++++++++++++++++++++----------------------- 1 file changed, 56 insertions(+), 36 deletions(-) (limited to 'kernel') diff --git a/kernel/srcu.c b/kernel/srcu.c index 43f1d61e513e..b6b9ea2eb51c 100644 --- a/kernel/srcu.c +++ b/kernel/srcu.c @@ -249,26 +249,12 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock); */ #define SYNCHRONIZE_SRCU_READER_DELAY 5 -/* - * Flip the readers' index by incrementing ->completed, then wait - * until there are no more readers using the counters referenced by - * the old index value. (Recall that the index is the bottom bit - * of ->completed.) - * - * Of course, it is possible that a reader might be delayed for the - * full duration of flip_idx_and_wait() between fetching the - * index and incrementing its counter. This possibility is handled - * by __synchronize_srcu() invoking flip_idx_and_wait() twice. - */ -static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited) +static void wait_idx(struct srcu_struct *sp, int idx, bool expedited) { - int idx; int trycount = 0; - idx = sp->completed++ & 0x1; - /* - * If a reader fetches the index before the above increment, + * If a reader fetches the index before the ->completed increment, * but increments its counter after srcu_readers_active_idx_check() * sums it, then smp_mb() D will pair with __srcu_read_lock()'s * smp_mb() B to ensure that the SRCU read-side critical section @@ -298,17 +284,38 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited) * sees srcu_read_unlock()'s counter decrement, then any * of the current task's subsequent code will happen after * that SRCU read-side critical section. + * + * It also ensures the order between the above waiting and + * the next flipping. */ smp_mb(); /* E */ } +/* + * Flip the readers' index by incrementing ->completed, then wait + * until there are no more readers using the counters referenced by + * the old index value. (Recall that the index is the bottom bit + * of ->completed.) + * + * Of course, it is possible that a reader might be delayed for the + * full duration of flip_idx_and_wait() between fetching the + * index and incrementing its counter. This possibility is handled + * by the next __synchronize_srcu() invoking wait_idx() for such readers + * before starting a new grace period. + */ +static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited) +{ + int idx; + + idx = sp->completed++ & 0x1; + wait_idx(sp, idx, expedited); +} + /* * Helper function for synchronize_srcu() and synchronize_srcu_expedited(). */ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited) { - int idx = 0; - rcu_lockdep_assert(!lock_is_held(&sp->dep_map) && !lock_is_held(&rcu_bh_lock_map) && !lock_is_held(&rcu_lock_map) && @@ -318,27 +325,40 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited) mutex_lock(&sp->mutex); /* - * If there were no helpers, then we need to do two flips of - * the index. The first flip is required if there are any - * outstanding SRCU readers even if there are no new readers - * running concurrently with the first counter flip. + * Suppose that during the previous grace period, a reader + * picked up the old value of the index, but did not increment + * its counter until after the previous instance of + * __synchronize_srcu() did the counter summation and recheck. + * That previous grace period was OK because the reader did + * not start until after the grace period started, so the grace + * period was not obligated to wait for that reader. + * + * However, the current SRCU grace period does have to wait for + * that reader. This is handled by invoking wait_idx() on the + * non-active set of counters (hence sp->completed - 1). Once + * wait_idx() returns, we know that all readers that picked up + * the old value of ->completed and that already incremented their + * counter will have completed. * - * The second flip is required when a new reader picks up - * the old value of the index, but does not increment its - * counter until after its counters is summed/rechecked by - * srcu_readers_active_idx_check(). In this case, the current SRCU - * grace period would be OK because the SRCU read-side critical - * section started after this SRCU grace period started, so the - * grace period is not required to wait for the reader. + * But what about readers that picked up the old value of + * ->completed, but -still- have not managed to increment their + * counter? We do not need to wait for those readers, because + * they will have started their SRCU read-side critical section + * after the current grace period starts. * - * However, the next SRCU grace period would be waiting for the - * other set of counters to go to zero, and therefore would not - * wait for the reader, which would be very bad. To avoid this - * bad scenario, we flip and wait twice, clearing out both sets - * of counters. + * Because it is unlikely that readers will be preempted between + * fetching ->completed and incrementing their counter, wait_idx() + * will normally not need to wait. */ - for (; idx < 2; idx++) - flip_idx_and_wait(sp, expedited); + wait_idx(sp, (sp->completed - 1) & 0x1, expedited); + + /* + * Now that wait_idx() has waited for the really old readers, + * invoke flip_idx_and_wait() to flip the counter and wait + * for current SRCU readers. + */ + flip_idx_and_wait(sp, expedited); + mutex_unlock(&sp->mutex); } -- cgit v1.2.3 From 18108ebfebe9e871d0a9af830baf8f5df69eb5fc Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Mon, 27 Feb 2012 09:28:10 -0800 Subject: rcu: Improve SRCU's wait_idx() comments The safety of SRCU is provided byy wait_idx() rather than flipping. The flipping actually prevents starvation. This commit therefore updates the comments to more accurately and precisely describe what is going on. Signed-off-by: Lai Jiangshan Signed-off-by: Paul E. McKenney --- kernel/srcu.c | 77 ++++++++++++++++++++++++++++------------------------------- 1 file changed, 37 insertions(+), 40 deletions(-) (limited to 'kernel') diff --git a/kernel/srcu.c b/kernel/srcu.c index b6b9ea2eb51c..1fecb4d858ed 100644 --- a/kernel/srcu.c +++ b/kernel/srcu.c @@ -249,6 +249,10 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock); */ #define SYNCHRONIZE_SRCU_READER_DELAY 5 +/* + * Wait until all pre-existing readers complete. Such readers + * will have used the index specified by "idx". + */ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited) { int trycount = 0; @@ -291,24 +295,9 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited) smp_mb(); /* E */ } -/* - * Flip the readers' index by incrementing ->completed, then wait - * until there are no more readers using the counters referenced by - * the old index value. (Recall that the index is the bottom bit - * of ->completed.) - * - * Of course, it is possible that a reader might be delayed for the - * full duration of flip_idx_and_wait() between fetching the - * index and incrementing its counter. This possibility is handled - * by the next __synchronize_srcu() invoking wait_idx() for such readers - * before starting a new grace period. - */ -static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited) +static void srcu_flip(struct srcu_struct *sp) { - int idx; - - idx = sp->completed++ & 0x1; - wait_idx(sp, idx, expedited); + sp->completed++; } /* @@ -316,6 +305,8 @@ static void flip_idx_and_wait(struct srcu_struct *sp, bool expedited) */ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited) { + int busy_idx; + rcu_lockdep_assert(!lock_is_held(&sp->dep_map) && !lock_is_held(&rcu_bh_lock_map) && !lock_is_held(&rcu_lock_map) && @@ -323,8 +314,28 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited) "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section"); mutex_lock(&sp->mutex); + busy_idx = sp->completed & 0X1UL; /* + * If we recently flipped the index, there will be some readers + * using idx=0 and others using idx=1. Therefore, two calls to + * wait_idx()s suffice to ensure that all pre-existing readers + * have completed: + * + * __synchronize_srcu() { + * wait_idx(sp, 0, expedited); + * wait_idx(sp, 1, expedited); + * } + * + * Starvation is prevented by the fact that we flip the index. + * While we wait on one index to clear out, almost all new readers + * will be using the other index. The number of new readers using the + * index we are waiting on is sharply bounded by roughly the number + * of CPUs. + * + * How can new readers possibly using the old pre-flip value of + * the index? Consider the following sequence of events: + * * Suppose that during the previous grace period, a reader * picked up the old value of the index, but did not increment * its counter until after the previous instance of @@ -333,31 +344,17 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited) * not start until after the grace period started, so the grace * period was not obligated to wait for that reader. * - * However, the current SRCU grace period does have to wait for - * that reader. This is handled by invoking wait_idx() on the - * non-active set of counters (hence sp->completed - 1). Once - * wait_idx() returns, we know that all readers that picked up - * the old value of ->completed and that already incremented their - * counter will have completed. - * - * But what about readers that picked up the old value of - * ->completed, but -still- have not managed to increment their - * counter? We do not need to wait for those readers, because - * they will have started their SRCU read-side critical section - * after the current grace period starts. - * - * Because it is unlikely that readers will be preempted between - * fetching ->completed and incrementing their counter, wait_idx() - * will normally not need to wait. + * However, this sequence of events is quite improbable, so + * this call to wait_idx(), which waits on really old readers + * describe in this comment above, will almost never need to wait. */ - wait_idx(sp, (sp->completed - 1) & 0x1, expedited); + wait_idx(sp, 1 - busy_idx, expedited); - /* - * Now that wait_idx() has waited for the really old readers, - * invoke flip_idx_and_wait() to flip the counter and wait - * for current SRCU readers. - */ - flip_idx_and_wait(sp, expedited); + /* Flip the index to avoid reader-induced starvation. */ + srcu_flip(sp); + + /* Wait for recent pre-existing readers. */ + wait_idx(sp, busy_idx, expedited); mutex_unlock(&sp->mutex); } -- cgit v1.2.3 From b52ce066c55a6a53cf1f8d71308d74f908e31b99 Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Mon, 27 Feb 2012 09:29:09 -0800 Subject: rcu: Implement a variant of Peter's SRCU algorithm This commit implements a variant of Peter's algorithm, which may be found at https://lkml.org/lkml/2012/2/1/119. o Make the checking lock-free to enable parallel checking. Parallel checking is required when (1) the original checking task is preempted for a long time, (2) sychronize_srcu_expedited() starts during an ongoing SRCU grace period, or (3) we wish to avoid acquiring a lock. o Since the checking is lock-free, we avoid a mutex in state machine for call_srcu(). o Remove the SRCU_REF_MASK and remove the coupling with the flipping. This might allow us to remove the preempt_disable() in future versions, though such removal will need great care because it rescinds the one-old-reader-per-CPU guarantee. o Remove a smp_mb(), simplify the comments and make the smp_mb() pairs more intuitive. Inspired-by: Peter Zijlstra Signed-off-by: Lai Jiangshan Signed-off-by: Paul E. McKenney --- kernel/srcu.c | 149 +++++++++++++++++++++++++++------------------------------- 1 file changed, 69 insertions(+), 80 deletions(-) (limited to 'kernel') diff --git a/kernel/srcu.c b/kernel/srcu.c index 1fecb4d858ed..e0139a274856 100644 --- a/kernel/srcu.c +++ b/kernel/srcu.c @@ -72,11 +72,26 @@ EXPORT_SYMBOL_GPL(init_srcu_struct); #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */ +/* + * Returns approximate total of the readers' ->seq[] values for the + * rank of per-CPU counters specified by idx. + */ +static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx) +{ + int cpu; + unsigned long sum = 0; + unsigned long t; + + for_each_possible_cpu(cpu) { + t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]); + sum += t; + } + return sum; +} + /* * Returns approximate number of readers active on the specified rank - * of per-CPU counters. Also snapshots each counter's value in the - * corresponding element of sp->snap[] for later use validating - * the sum. + * of the per-CPU ->c[] counters. */ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx) { @@ -87,26 +102,45 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx) for_each_possible_cpu(cpu) { t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]); sum += t; - sp->snap[cpu] = t; } - return sum & SRCU_REF_MASK; + return sum; } /* - * To be called from the update side after an index flip. Returns true - * if the modulo sum of the counters is stably zero, false if there is - * some possibility of non-zero. + * Return true if the number of pre-existing readers is determined to + * be stably zero. An example unstable zero can occur if the call + * to srcu_readers_active_idx() misses an __srcu_read_lock() increment, + * but due to task migration, sees the corresponding __srcu_read_unlock() + * decrement. This can happen because srcu_readers_active_idx() takes + * time to sum the array, and might in fact be interrupted or preempted + * partway through the summation. */ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx) { - int cpu; + unsigned long seq; + + seq = srcu_readers_seq_idx(sp, idx); + + /* + * The following smp_mb() A pairs with the smp_mb() B located in + * __srcu_read_lock(). This pairing ensures that if an + * __srcu_read_lock() increments its counter after the summation + * in srcu_readers_active_idx(), then the corresponding SRCU read-side + * critical section will see any changes made prior to the start + * of the current SRCU grace period. + * + * Also, if the above call to srcu_readers_seq_idx() saw the + * increment of ->seq[], then the call to srcu_readers_active_idx() + * must see the increment of ->c[]. + */ + smp_mb(); /* A */ /* * Note that srcu_readers_active_idx() can incorrectly return * zero even though there is a pre-existing reader throughout. * To see this, suppose that task A is in a very long SRCU * read-side critical section that started on CPU 0, and that - * no other reader exists, so that the modulo sum of the counters + * no other reader exists, so that the sum of the counters * is equal to one. Then suppose that task B starts executing * srcu_readers_active_idx(), summing up to CPU 1, and then that * task C starts reading on CPU 0, so that its increment is not @@ -122,53 +156,31 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx) return false; /* - * Since the caller recently flipped ->completed, we can see at - * most one increment of each CPU's counter from this point - * forward. The reason for this is that the reader CPU must have - * fetched the index before srcu_readers_active_idx checked - * that CPU's counter, but not yet incremented its counter. - * Its eventual counter increment will follow the read in - * srcu_readers_active_idx(), and that increment is immediately - * followed by smp_mb() B. Because smp_mb() D is between - * the ->completed flip and srcu_readers_active_idx()'s read, - * that CPU's subsequent load of ->completed must see the new - * value, and therefore increment the counter in the other rank. - */ - smp_mb(); /* A */ - - /* - * Now, we check the ->snap array that srcu_readers_active_idx() - * filled in from the per-CPU counter values. Since - * __srcu_read_lock() increments the upper bits of the per-CPU - * counter, an increment/decrement pair will change the value - * of the counter. Since there is only one possible increment, - * the only way to wrap the counter is to have a huge number of - * counter decrements, which requires a huge number of tasks and - * huge SRCU read-side critical-section nesting levels, even on - * 32-bit systems. + * The remainder of this function is the validation step. + * The following smp_mb() D pairs with the smp_mb() C in + * __srcu_read_unlock(). If the __srcu_read_unlock() was seen + * by srcu_readers_active_idx() above, then any destructive + * operation performed after the grace period will happen after + * the corresponding SRCU read-side critical section. * - * All of the ways of confusing the readings require that the scan - * in srcu_readers_active_idx() see the read-side task's decrement, - * but not its increment. However, between that decrement and - * increment are smb_mb() B and C. Either or both of these pair - * with smp_mb() A above to ensure that the scan below will see - * the read-side tasks's increment, thus noting a difference in - * the counter values between the two passes. - * - * Therefore, if srcu_readers_active_idx() returned zero, and - * none of the counters changed, we know that the zero was the - * correct sum. - * - * Of course, it is possible that a task might be delayed - * for a very long time in __srcu_read_lock() after fetching - * the index but before incrementing its counter. This - * possibility will be dealt with in __synchronize_srcu(). + * Note that there can be at most NR_CPUS worth of readers using + * the old index, which is not enough to overflow even a 32-bit + * integer. (Yes, this does mean that systems having more than + * a billion or so CPUs need to be 64-bit systems.) Therefore, + * the sum of the ->seq[] counters cannot possibly overflow. + * Therefore, the only way that the return values of the two + * calls to srcu_readers_seq_idx() can be equal is if there were + * no increments of the corresponding rank of ->seq[] counts + * in the interim. But the missed-increment scenario laid out + * above includes an increment of the ->seq[] counter by + * the corresponding __srcu_read_lock(). Therefore, if this + * scenario occurs, the return values from the two calls to + * srcu_readers_seq_idx() will differ, and thus the validation + * step below suffices. */ - for_each_possible_cpu(cpu) - if (sp->snap[cpu] != - ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx])) - return false; /* False zero reading! */ - return true; + smp_mb(); /* D */ + + return srcu_readers_seq_idx(sp, idx) == seq; } /** @@ -216,9 +228,9 @@ int __srcu_read_lock(struct srcu_struct *sp) preempt_disable(); idx = rcu_dereference_index_check(sp->completed, rcu_read_lock_sched_held()) & 0x1; - ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += - SRCU_USAGE_COUNT + 1; + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 1; smp_mb(); /* B */ /* Avoid leaking the critical section. */ + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->seq[idx]) += 1; preempt_enable(); return idx; } @@ -257,17 +269,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited) { int trycount = 0; - /* - * If a reader fetches the index before the ->completed increment, - * but increments its counter after srcu_readers_active_idx_check() - * sums it, then smp_mb() D will pair with __srcu_read_lock()'s - * smp_mb() B to ensure that the SRCU read-side critical section - * will see any updates that the current task performed before its - * call to synchronize_srcu(), or to synchronize_srcu_expedited(), - * as the case may be. - */ - smp_mb(); /* D */ - /* * SRCU read-side critical sections are normally short, so wait * a small amount of time before possibly blocking. @@ -281,18 +282,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited) schedule_timeout_interruptible(1); } } - - /* - * The following smp_mb() E pairs with srcu_read_unlock()'s - * smp_mb C to ensure that if srcu_readers_active_idx_check() - * sees srcu_read_unlock()'s counter decrement, then any - * of the current task's subsequent code will happen after - * that SRCU read-side critical section. - * - * It also ensures the order between the above waiting and - * the next flipping. - */ - smp_mb(); /* E */ } static void srcu_flip(struct srcu_struct *sp) -- cgit v1.2.3 From dc87917501e324701dbfb249def44054b5220187 Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Tue, 6 Mar 2012 17:57:34 +0800 Subject: rcu: Improve srcu_readers_active_idx()'s cache locality Expand the calls to srcu_readers_active_idx() from srcu_readers_active() inline. This change improves cache locality by interating over the CPUs once rather than twice. Signed-off-by: Lai Jiangshan Signed-off-by: Paul E. McKenney --- kernel/srcu.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/srcu.c b/kernel/srcu.c index e0139a274856..a43211c92863 100644 --- a/kernel/srcu.c +++ b/kernel/srcu.c @@ -193,7 +193,14 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx) */ static int srcu_readers_active(struct srcu_struct *sp) { - return srcu_readers_active_idx(sp, 0) + srcu_readers_active_idx(sp, 1); + int cpu; + unsigned long sum = 0; + + for_each_possible_cpu(cpu) { + sum += ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[0]); + sum += ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[1]); + } + return sum; } /** -- cgit v1.2.3 From d9792edd7a9a0858a3b1df92cf8beb31e4191e3c Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Mon, 19 Mar 2012 16:12:12 +0800 Subject: rcu: Use single value to handle expedited SRCU grace periods The earlier algorithm used an "expedited" flag combined with a "trycount" counter to differentiate between normal and expedited SRCU grace periods. However, the difference can be encoded into a single counter with a cutoff value and different initial values for expedited and normal SRCU grace periods. This commit makes that change. Signed-off-by: Lai Jiangshan Signed-off-by: Paul E. McKenney Conflicts: kernel/srcu.c --- kernel/srcu.c | 27 ++++++++++++++------------- 1 file changed, 14 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/srcu.c b/kernel/srcu.c index a43211c92863..b9088524935a 100644 --- a/kernel/srcu.c +++ b/kernel/srcu.c @@ -266,16 +266,16 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock); * we repeatedly block for 1-millisecond time periods. This approach * has done well in testing, so there is no need for a config parameter. */ -#define SYNCHRONIZE_SRCU_READER_DELAY 5 +#define SYNCHRONIZE_SRCU_READER_DELAY 5 +#define SYNCHRONIZE_SRCU_TRYCOUNT 2 +#define SYNCHRONIZE_SRCU_EXP_TRYCOUNT 12 /* * Wait until all pre-existing readers complete. Such readers * will have used the index specified by "idx". */ -static void wait_idx(struct srcu_struct *sp, int idx, bool expedited) +static void wait_idx(struct srcu_struct *sp, int idx, int trycount) { - int trycount = 0; - /* * SRCU read-side critical sections are normally short, so wait * a small amount of time before possibly blocking. @@ -283,9 +283,10 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited) if (!srcu_readers_active_idx_check(sp, idx)) { udelay(SYNCHRONIZE_SRCU_READER_DELAY); while (!srcu_readers_active_idx_check(sp, idx)) { - if (expedited && ++ trycount < 10) + if (trycount > 0) { + trycount--; udelay(SYNCHRONIZE_SRCU_READER_DELAY); - else + } else schedule_timeout_interruptible(1); } } @@ -299,7 +300,7 @@ static void srcu_flip(struct srcu_struct *sp) /* * Helper function for synchronize_srcu() and synchronize_srcu_expedited(). */ -static void __synchronize_srcu(struct srcu_struct *sp, bool expedited) +static void __synchronize_srcu(struct srcu_struct *sp, int trycount) { int busy_idx; @@ -319,8 +320,8 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited) * have completed: * * __synchronize_srcu() { - * wait_idx(sp, 0, expedited); - * wait_idx(sp, 1, expedited); + * wait_idx(sp, 0, trycount); + * wait_idx(sp, 1, trycount); * } * * Starvation is prevented by the fact that we flip the index. @@ -344,13 +345,13 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited) * this call to wait_idx(), which waits on really old readers * describe in this comment above, will almost never need to wait. */ - wait_idx(sp, 1 - busy_idx, expedited); + wait_idx(sp, 1 - busy_idx, trycount); /* Flip the index to avoid reader-induced starvation. */ srcu_flip(sp); /* Wait for recent pre-existing readers. */ - wait_idx(sp, busy_idx, expedited); + wait_idx(sp, busy_idx, trycount); mutex_unlock(&sp->mutex); } @@ -371,7 +372,7 @@ static void __synchronize_srcu(struct srcu_struct *sp, bool expedited) */ void synchronize_srcu(struct srcu_struct *sp) { - __synchronize_srcu(sp, 0); + __synchronize_srcu(sp, SYNCHRONIZE_SRCU_TRYCOUNT); } EXPORT_SYMBOL_GPL(synchronize_srcu); @@ -392,7 +393,7 @@ EXPORT_SYMBOL_GPL(synchronize_srcu); */ void synchronize_srcu_expedited(struct srcu_struct *sp) { - __synchronize_srcu(sp, 1); + __synchronize_srcu(sp, SYNCHRONIZE_SRCU_EXP_TRYCOUNT); } EXPORT_SYMBOL_GPL(synchronize_srcu_expedited); -- cgit v1.2.3 From 931ea9d1a6e06a5e3af03aa4aaaa7c7fd90e163f Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Mon, 19 Mar 2012 16:12:13 +0800 Subject: rcu: Implement per-domain single-threaded call_srcu() state machine This commit implements an SRCU state machine in support of call_srcu(). The state machine is preemptible, light-weight, and single-threaded, minimizing synchronization overhead. In particular, there is no longer any need for synchronize_srcu() to be guarded by a mutex. Expedited processing is handled, at least in the absence of concurrent grace-period operations on that same srcu_struct structure, by having the synchronize_srcu_expedited() thread take on the role of the workqueue thread for one iteration. There is a reasonable probability that a given SRCU callback will be invoked on the same CPU that registered it, however, there is no guarantee. Concurrent SRCU grace-period primitives can cause callbacks to be executed elsewhere, even in absence of CPU-hotplug operations. Callbacks execute in process context, but under the influence of local_bh_disable(), so it is illegal to sleep in an SRCU callback function. Signed-off-by: Lai Jiangshan Acked-by: Peter Zijlstra Signed-off-by: Paul E. McKenney --- kernel/srcu.c | 362 ++++++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 300 insertions(+), 62 deletions(-) (limited to 'kernel') diff --git a/kernel/srcu.c b/kernel/srcu.c index b9088524935a..2095be3318d5 100644 --- a/kernel/srcu.c +++ b/kernel/srcu.c @@ -34,10 +34,77 @@ #include #include +/* + * Initialize an rcu_batch structure to empty. + */ +static inline void rcu_batch_init(struct rcu_batch *b) +{ + b->head = NULL; + b->tail = &b->head; +} + +/* + * Enqueue a callback onto the tail of the specified rcu_batch structure. + */ +static inline void rcu_batch_queue(struct rcu_batch *b, struct rcu_head *head) +{ + *b->tail = head; + b->tail = &head->next; +} + +/* + * Is the specified rcu_batch structure empty? + */ +static inline bool rcu_batch_empty(struct rcu_batch *b) +{ + return b->tail == &b->head; +} + +/* + * Remove the callback at the head of the specified rcu_batch structure + * and return a pointer to it, or return NULL if the structure is empty. + */ +static inline struct rcu_head *rcu_batch_dequeue(struct rcu_batch *b) +{ + struct rcu_head *head; + + if (rcu_batch_empty(b)) + return NULL; + + head = b->head; + b->head = head->next; + if (b->tail == &head->next) + rcu_batch_init(b); + + return head; +} + +/* + * Move all callbacks from the rcu_batch structure specified by "from" to + * the structure specified by "to". + */ +static inline void rcu_batch_move(struct rcu_batch *to, struct rcu_batch *from) +{ + if (!rcu_batch_empty(from)) { + *to->tail = from->head; + to->tail = from->tail; + rcu_batch_init(from); + } +} + +/* single-thread state-machine */ +static void process_srcu(struct work_struct *work); + static int init_srcu_struct_fields(struct srcu_struct *sp) { sp->completed = 0; - mutex_init(&sp->mutex); + spin_lock_init(&sp->queue_lock); + sp->running = false; + rcu_batch_init(&sp->batch_queue); + rcu_batch_init(&sp->batch_check0); + rcu_batch_init(&sp->batch_check1); + rcu_batch_init(&sp->batch_done); + INIT_DELAYED_WORK(&sp->work, process_srcu); sp->per_cpu_ref = alloc_percpu(struct srcu_struct_array); return sp->per_cpu_ref ? 0 : -ENOMEM; } @@ -266,43 +333,86 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock); * we repeatedly block for 1-millisecond time periods. This approach * has done well in testing, so there is no need for a config parameter. */ -#define SYNCHRONIZE_SRCU_READER_DELAY 5 +#define SRCU_RETRY_CHECK_DELAY 5 #define SYNCHRONIZE_SRCU_TRYCOUNT 2 #define SYNCHRONIZE_SRCU_EXP_TRYCOUNT 12 /* - * Wait until all pre-existing readers complete. Such readers + * @@@ Wait until all pre-existing readers complete. Such readers * will have used the index specified by "idx". + * the caller should ensures the ->completed is not changed while checking + * and idx = (->completed & 1) ^ 1 */ -static void wait_idx(struct srcu_struct *sp, int idx, int trycount) +static bool try_check_zero(struct srcu_struct *sp, int idx, int trycount) { - /* - * SRCU read-side critical sections are normally short, so wait - * a small amount of time before possibly blocking. - */ - if (!srcu_readers_active_idx_check(sp, idx)) { - udelay(SYNCHRONIZE_SRCU_READER_DELAY); - while (!srcu_readers_active_idx_check(sp, idx)) { - if (trycount > 0) { - trycount--; - udelay(SYNCHRONIZE_SRCU_READER_DELAY); - } else - schedule_timeout_interruptible(1); - } + for (;;) { + if (srcu_readers_active_idx_check(sp, idx)) + return true; + if (--trycount <= 0) + return false; + udelay(SRCU_RETRY_CHECK_DELAY); } } +/* + * Increment the ->completed counter so that future SRCU readers will + * use the other rank of the ->c[] and ->seq[] arrays. This allows + * us to wait for pre-existing readers in a starvation-free manner. + */ static void srcu_flip(struct srcu_struct *sp) { sp->completed++; } +/* + * Enqueue an SRCU callback on the specified srcu_struct structure, + * initiating grace-period processing if it is not already running. + */ +void call_srcu(struct srcu_struct *sp, struct rcu_head *head, + void (*func)(struct rcu_head *head)) +{ + unsigned long flags; + + head->next = NULL; + head->func = func; + spin_lock_irqsave(&sp->queue_lock, flags); + rcu_batch_queue(&sp->batch_queue, head); + if (!sp->running) { + sp->running = true; + queue_delayed_work(system_nrt_wq, &sp->work, 0); + } + spin_unlock_irqrestore(&sp->queue_lock, flags); +} +EXPORT_SYMBOL_GPL(call_srcu); + +struct rcu_synchronize { + struct rcu_head head; + struct completion completion; +}; + +/* + * Awaken the corresponding synchronize_srcu() instance now that a + * grace period has elapsed. + */ +static void wakeme_after_rcu(struct rcu_head *head) +{ + struct rcu_synchronize *rcu; + + rcu = container_of(head, struct rcu_synchronize, head); + complete(&rcu->completion); +} + +static void srcu_advance_batches(struct srcu_struct *sp, int trycount); +static void srcu_reschedule(struct srcu_struct *sp); + /* * Helper function for synchronize_srcu() and synchronize_srcu_expedited(). */ static void __synchronize_srcu(struct srcu_struct *sp, int trycount) { - int busy_idx; + struct rcu_synchronize rcu; + struct rcu_head *head = &rcu.head; + bool done = false; rcu_lockdep_assert(!lock_is_held(&sp->dep_map) && !lock_is_held(&rcu_bh_lock_map) && @@ -310,50 +420,32 @@ static void __synchronize_srcu(struct srcu_struct *sp, int trycount) !lock_is_held(&rcu_sched_lock_map), "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section"); - mutex_lock(&sp->mutex); - busy_idx = sp->completed & 0X1UL; - - /* - * If we recently flipped the index, there will be some readers - * using idx=0 and others using idx=1. Therefore, two calls to - * wait_idx()s suffice to ensure that all pre-existing readers - * have completed: - * - * __synchronize_srcu() { - * wait_idx(sp, 0, trycount); - * wait_idx(sp, 1, trycount); - * } - * - * Starvation is prevented by the fact that we flip the index. - * While we wait on one index to clear out, almost all new readers - * will be using the other index. The number of new readers using the - * index we are waiting on is sharply bounded by roughly the number - * of CPUs. - * - * How can new readers possibly using the old pre-flip value of - * the index? Consider the following sequence of events: - * - * Suppose that during the previous grace period, a reader - * picked up the old value of the index, but did not increment - * its counter until after the previous instance of - * __synchronize_srcu() did the counter summation and recheck. - * That previous grace period was OK because the reader did - * not start until after the grace period started, so the grace - * period was not obligated to wait for that reader. - * - * However, this sequence of events is quite improbable, so - * this call to wait_idx(), which waits on really old readers - * describe in this comment above, will almost never need to wait. - */ - wait_idx(sp, 1 - busy_idx, trycount); - - /* Flip the index to avoid reader-induced starvation. */ - srcu_flip(sp); - - /* Wait for recent pre-existing readers. */ - wait_idx(sp, busy_idx, trycount); + init_completion(&rcu.completion); + + head->next = NULL; + head->func = wakeme_after_rcu; + spin_lock_irq(&sp->queue_lock); + if (!sp->running) { + /* steal the processing owner */ + sp->running = true; + rcu_batch_queue(&sp->batch_check0, head); + spin_unlock_irq(&sp->queue_lock); + + srcu_advance_batches(sp, trycount); + if (!rcu_batch_empty(&sp->batch_done)) { + BUG_ON(sp->batch_done.head != head); + rcu_batch_dequeue(&sp->batch_done); + done = true; + } + /* give the processing owner to work_struct */ + srcu_reschedule(sp); + } else { + rcu_batch_queue(&sp->batch_queue, head); + spin_unlock_irq(&sp->queue_lock); + } - mutex_unlock(&sp->mutex); + if (!done) + wait_for_completion(&rcu.completion); } /** @@ -397,6 +489,15 @@ void synchronize_srcu_expedited(struct srcu_struct *sp) } EXPORT_SYMBOL_GPL(synchronize_srcu_expedited); +/** + * srcu_barrier - Wait until all in-flight call_srcu() callbacks complete. + */ +void srcu_barrier(struct srcu_struct *sp) +{ + synchronize_srcu(sp); +} +EXPORT_SYMBOL_GPL(srcu_barrier); + /** * srcu_batches_completed - return batches completed. * @sp: srcu_struct on which to report batch completion. @@ -404,9 +505,146 @@ EXPORT_SYMBOL_GPL(synchronize_srcu_expedited); * Report the number of batches, correlated with, but not necessarily * precisely the same as, the number of grace periods that have elapsed. */ - long srcu_batches_completed(struct srcu_struct *sp) { return sp->completed; } EXPORT_SYMBOL_GPL(srcu_batches_completed); + +#define SRCU_CALLBACK_BATCH 10 +#define SRCU_INTERVAL 1 + +/* + * Move any new SRCU callbacks to the first stage of the SRCU grace + * period pipeline. + */ +static void srcu_collect_new(struct srcu_struct *sp) +{ + if (!rcu_batch_empty(&sp->batch_queue)) { + spin_lock_irq(&sp->queue_lock); + rcu_batch_move(&sp->batch_check0, &sp->batch_queue); + spin_unlock_irq(&sp->queue_lock); + } +} + +/* + * Core SRCU state machine. Advance callbacks from ->batch_check0 to + * ->batch_check1 and then to ->batch_done as readers drain. + */ +static void srcu_advance_batches(struct srcu_struct *sp, int trycount) +{ + int idx = 1 ^ (sp->completed & 1); + + /* + * Because readers might be delayed for an extended period after + * fetching ->completed for their index, at any point in time there + * might well be readers using both idx=0 and idx=1. We therefore + * need to wait for readers to clear from both index values before + * invoking a callback. + */ + + if (rcu_batch_empty(&sp->batch_check0) && + rcu_batch_empty(&sp->batch_check1)) + return; /* no callbacks need to be advanced */ + + if (!try_check_zero(sp, idx, trycount)) + return; /* failed to advance, will try after SRCU_INTERVAL */ + + /* + * The callbacks in ->batch_check1 have already done with their + * first zero check and flip back when they were enqueued on + * ->batch_check0 in a previous invocation of srcu_advance_batches(). + * (Presumably try_check_zero() returned false during that + * invocation, leaving the callbacks stranded on ->batch_check1.) + * They are therefore ready to invoke, so move them to ->batch_done. + */ + rcu_batch_move(&sp->batch_done, &sp->batch_check1); + + if (rcu_batch_empty(&sp->batch_check0)) + return; /* no callbacks need to be advanced */ + srcu_flip(sp); + + /* + * The callbacks in ->batch_check0 just finished their + * first check zero and flip, so move them to ->batch_check1 + * for future checking on the other idx. + */ + rcu_batch_move(&sp->batch_check1, &sp->batch_check0); + + /* + * SRCU read-side critical sections are normally short, so check + * at least twice in quick succession after a flip. + */ + trycount = trycount < 2 ? 2 : trycount; + if (!try_check_zero(sp, idx^1, trycount)) + return; /* failed to advance, will try after SRCU_INTERVAL */ + + /* + * The callbacks in ->batch_check1 have now waited for all + * pre-existing readers using both idx values. They are therefore + * ready to invoke, so move them to ->batch_done. + */ + rcu_batch_move(&sp->batch_done, &sp->batch_check1); +} + +/* + * Invoke a limited number of SRCU callbacks that have passed through + * their grace period. If there are more to do, SRCU will reschedule + * the workqueue. + */ +static void srcu_invoke_callbacks(struct srcu_struct *sp) +{ + int i; + struct rcu_head *head; + + for (i = 0; i < SRCU_CALLBACK_BATCH; i++) { + head = rcu_batch_dequeue(&sp->batch_done); + if (!head) + break; + local_bh_disable(); + head->func(head); + local_bh_enable(); + } +} + +/* + * Finished one round of SRCU grace period. Start another if there are + * more SRCU callbacks queued, otherwise put SRCU into not-running state. + */ +static void srcu_reschedule(struct srcu_struct *sp) +{ + bool pending = true; + + if (rcu_batch_empty(&sp->batch_done) && + rcu_batch_empty(&sp->batch_check1) && + rcu_batch_empty(&sp->batch_check0) && + rcu_batch_empty(&sp->batch_queue)) { + spin_lock_irq(&sp->queue_lock); + if (rcu_batch_empty(&sp->batch_done) && + rcu_batch_empty(&sp->batch_check1) && + rcu_batch_empty(&sp->batch_check0) && + rcu_batch_empty(&sp->batch_queue)) { + sp->running = false; + pending = false; + } + spin_unlock_irq(&sp->queue_lock); + } + + if (pending) + queue_delayed_work(system_nrt_wq, &sp->work, SRCU_INTERVAL); +} + +/* + * This is the work-queue function that handles SRCU grace periods. + */ +static void process_srcu(struct work_struct *work) +{ + struct srcu_struct *sp; + + sp = container_of(work, struct srcu_struct, work.work); + + srcu_collect_new(sp); + srcu_advance_batches(sp, 1); + srcu_invoke_callbacks(sp); + srcu_reschedule(sp); +} -- cgit v1.2.3 From 9059c94017f748d9e20c3b089188a7abb27f6233 Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Mon, 19 Mar 2012 16:12:14 +0800 Subject: rcu: Add rcutorture test for call_srcu() Add srcu_torture_deferred_free() for srcu_ops so as to test the new call_srcu(). Rename the original srcu_ops to srcu_sync_ops. Signed-off-by: Lai Jiangshan Signed-off-by: Paul E. McKenney --- kernel/rcutorture.c | 44 ++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 40 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c index d10b179dea83..e66b34ab7555 100644 --- a/kernel/rcutorture.c +++ b/kernel/rcutorture.c @@ -625,6 +625,11 @@ static int srcu_torture_completed(void) return srcu_batches_completed(&srcu_ctl); } +static void srcu_torture_deferred_free(struct rcu_torture *rp) +{ + call_srcu(&srcu_ctl, &rp->rtort_rcu, rcu_torture_cb); +} + static void srcu_torture_synchronize(void) { synchronize_srcu(&srcu_ctl); @@ -654,7 +659,7 @@ static struct rcu_torture_ops srcu_ops = { .read_delay = srcu_read_delay, .readunlock = srcu_torture_read_unlock, .completed = srcu_torture_completed, - .deferred_free = rcu_sync_torture_deferred_free, + .deferred_free = srcu_torture_deferred_free, .sync = srcu_torture_synchronize, .call = NULL, .cb_barrier = NULL, @@ -662,6 +667,21 @@ static struct rcu_torture_ops srcu_ops = { .name = "srcu" }; +static struct rcu_torture_ops srcu_sync_ops = { + .init = srcu_torture_init, + .cleanup = srcu_torture_cleanup, + .readlock = srcu_torture_read_lock, + .read_delay = srcu_read_delay, + .readunlock = srcu_torture_read_unlock, + .completed = srcu_torture_completed, + .deferred_free = rcu_sync_torture_deferred_free, + .sync = srcu_torture_synchronize, + .call = NULL, + .cb_barrier = NULL, + .stats = srcu_torture_stats, + .name = "srcu_sync" +}; + static int srcu_torture_read_lock_raw(void) __acquires(&srcu_ctl) { return srcu_read_lock_raw(&srcu_ctl); @@ -679,7 +699,7 @@ static struct rcu_torture_ops srcu_raw_ops = { .read_delay = srcu_read_delay, .readunlock = srcu_torture_read_unlock_raw, .completed = srcu_torture_completed, - .deferred_free = rcu_sync_torture_deferred_free, + .deferred_free = srcu_torture_deferred_free, .sync = srcu_torture_synchronize, .call = NULL, .cb_barrier = NULL, @@ -687,6 +707,21 @@ static struct rcu_torture_ops srcu_raw_ops = { .name = "srcu_raw" }; +static struct rcu_torture_ops srcu_raw_sync_ops = { + .init = srcu_torture_init, + .cleanup = srcu_torture_cleanup, + .readlock = srcu_torture_read_lock_raw, + .read_delay = srcu_read_delay, + .readunlock = srcu_torture_read_unlock_raw, + .completed = srcu_torture_completed, + .deferred_free = rcu_sync_torture_deferred_free, + .sync = srcu_torture_synchronize, + .call = NULL, + .cb_barrier = NULL, + .stats = srcu_torture_stats, + .name = "srcu_raw_sync" +}; + static void srcu_torture_synchronize_expedited(void) { synchronize_srcu_expedited(&srcu_ctl); @@ -1685,7 +1720,7 @@ static int rcu_torture_barrier_init(void) for (i = 0; i < n_barrier_cbs; i++) { init_waitqueue_head(&barrier_cbs_wq[i]); barrier_cbs_tasks[i] = kthread_run(rcu_torture_barrier_cbs, - (void *)i, + (void *)(long)i, "rcu_torture_barrier_cbs"); if (IS_ERR(barrier_cbs_tasks[i])) { ret = PTR_ERR(barrier_cbs_tasks[i]); @@ -1873,7 +1908,8 @@ rcu_torture_init(void) static struct rcu_torture_ops *torture_ops[] = { &rcu_ops, &rcu_sync_ops, &rcu_expedited_ops, &rcu_bh_ops, &rcu_bh_sync_ops, &rcu_bh_expedited_ops, - &srcu_ops, &srcu_raw_ops, &srcu_expedited_ops, + &srcu_ops, &srcu_sync_ops, &srcu_raw_ops, + &srcu_raw_sync_ops, &srcu_expedited_ops, &sched_ops, &sched_sync_ops, &sched_expedited_ops, }; mutex_lock(&fullstop_mutex); -- cgit v1.2.3 From 9fb48c744ba6a4bf58b666f4e6fdac3008ea1bd4 Mon Sep 17 00:00:00 2001 From: Jim Cromie Date: Fri, 27 Apr 2012 14:30:34 -0600 Subject: params: add 3rd arg to option handler callback signature Add a 3rd arg, named "doing", to unknown-options callbacks invoked from parse_args(). The arg is passed as: "Booting kernel" from start_kernel(), initcall_level_names[i] from do_initcall_level(), mod->name from load_module(), via parse_args(), parse_one() parse_args() already has the "name" parameter, which is renamed to "doing" to better reflect current uses 1,2 above. parse_args() passes it to an altered parse_one(), which now passes it down into the unknown option handler callbacks. The mod->name will be needed to handle dyndbg for loadable modules, since params passed by modprobe are not qualified (they do not have a "$modname." prefix), and by the time the unknown-param callback is called, the module name is not otherwise available. Minor tweaks: Add param-name to parse_one's pr_debug(), current message doesnt identify the param being handled, add it. Add a pr_info to print current level and level_name of the initcall, and number of registered initcalls at that level. This adds 7 lines to dmesg output, like: initlevel:6=device, 172 registered initcalls Drop "parameters" from initcall_level_names[], its unhelpful in the pr_info() added above. This array is passed into parse_args() by do_initcall_level(). CC: Rusty Russell Signed-off-by: Jim Cromie Acked-by: Jason Baron Signed-off-by: Greg Kroah-Hartman --- kernel/params.c | 31 +++++++++++++++++-------------- 1 file changed, 17 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/params.c b/kernel/params.c index f37d82631347..b60e2c74b961 100644 --- a/kernel/params.c +++ b/kernel/params.c @@ -85,11 +85,13 @@ bool parameq(const char *a, const char *b) static int parse_one(char *param, char *val, + const char *doing, const struct kernel_param *params, unsigned num_params, s16 min_level, s16 max_level, - int (*handle_unknown)(char *param, char *val)) + int (*handle_unknown)(char *param, char *val, + const char *doing)) { unsigned int i; int err; @@ -104,8 +106,8 @@ static int parse_one(char *param, if (!val && params[i].ops->set != param_set_bool && params[i].ops->set != param_set_bint) return -EINVAL; - pr_debug("They are equal! Calling %p\n", - params[i].ops->set); + pr_debug("handling %s with %p\n", param, + params[i].ops->set); mutex_lock(¶m_lock); err = params[i].ops->set(val, ¶ms[i]); mutex_unlock(¶m_lock); @@ -114,11 +116,11 @@ static int parse_one(char *param, } if (handle_unknown) { - pr_debug("Unknown argument: calling %p\n", handle_unknown); - return handle_unknown(param, val); + pr_debug("doing %s: %s='%s'\n", doing, param, val); + return handle_unknown(param, val, doing); } - pr_debug("Unknown argument `%s'\n", param); + pr_debug("Unknown argument '%s'\n", param); return -ENOENT; } @@ -175,28 +177,29 @@ static char *next_arg(char *args, char **param, char **val) } /* Args looks like "foo=bar,bar2 baz=fuz wiz". */ -int parse_args(const char *name, +int parse_args(const char *doing, char *args, const struct kernel_param *params, unsigned num, s16 min_level, s16 max_level, - int (*unknown)(char *param, char *val)) + int (*unknown)(char *param, char *val, const char *doing)) { char *param, *val; - pr_debug("Parsing ARGS: %s\n", args); - /* Chew leading spaces */ args = skip_spaces(args); + if (args && *args) + pr_debug("doing %s, parsing ARGS: '%s'\n", doing, args); + while (*args) { int ret; int irq_was_disabled; args = next_arg(args, ¶m, &val); irq_was_disabled = irqs_disabled(); - ret = parse_one(param, val, params, num, + ret = parse_one(param, val, doing, params, num, min_level, max_level, unknown); if (irq_was_disabled && !irqs_disabled()) { printk(KERN_WARNING "parse_args(): option '%s' enabled " @@ -205,19 +208,19 @@ int parse_args(const char *name, switch (ret) { case -ENOENT: printk(KERN_ERR "%s: Unknown parameter `%s'\n", - name, param); + doing, param); return ret; case -ENOSPC: printk(KERN_ERR "%s: `%s' too large for parameter `%s'\n", - name, val ?: "", param); + doing, val ?: "", param); return ret; case 0: break; default: printk(KERN_ERR "%s: `%s' invalid for parameter `%s'\n", - name, val ?: "", param); + doing, val ?: "", param); return ret; } } -- cgit v1.2.3 From b48420c1d3019ce8d84fb8e58f4ca86b8e3655b8 Mon Sep 17 00:00:00 2001 From: Jim Cromie Date: Fri, 27 Apr 2012 14:30:35 -0600 Subject: dynamic_debug: make dynamic-debug work for module initialization This introduces a fake module param $module.dyndbg. Its based upon Thomas Renninger's $module.ddebug boot-time debugging patch from https://lkml.org/lkml/2010/9/15/397 The 'fake' module parameter is provided for all modules, whether or not they need it. It is not explicitly added to each module, but is implemented in callbacks invoked from parse_args. For builtin modules, dynamic_debug_init() now directly calls parse_args(..., &ddebug_dyndbg_boot_params_cb), to process the params undeclared in the modules, just after the ddebug tables are processed. While its slightly weird to reprocess the boot params, parse_args() is already called repeatedly by do_initcall_levels(). More importantly, the dyndbg queries (given in ddebug_query or dyndbg params) cannot be activated until after the ddebug tables are ready, and reusing parse_args is cleaner than doing an ad-hoc parse. This reparse would break options like inc_verbosity, but they probably should be params, like verbosity=3. ddebug_dyndbg_boot_params_cb() handles both bare dyndbg (aka: ddebug_query) and module-prefixed dyndbg params, and ignores all other parameters. For example, the following will enable pr_debug()s in 4 builtin modules, in the order given: dyndbg="module params +p; module aio +p" module.dyndbg=+p pci.dyndbg For loadable modules, parse_args() in load_module() calls ddebug_dyndbg_module_params_cb(). This handles bare dyndbg params as passed from modprobe, and errors on other unknown params. Note that modprobe reads /proc/cmdline, so "modprobe foo" grabs all foo.params, strips the "foo.", and passes these to the kernel. ddebug_dyndbg_module_params_cb() is again called for the unknown params; it handles dyndbg, and errors on others. The "doing" arg added previously contains the module name. For non CONFIG_DYNAMIC_DEBUG builds, the stub function accepts and ignores $module.dyndbg params, other unknowns get -ENOENT. If no param value is given (as in pci.dyndbg example above), "+p" is assumed, which enables all pr_debug callsites in the module. The dyndbg fake parameter is not shown in /sys/module/*/parameters, thus it does not use any resources. Changes to it are made via the control file. Also change pr_info in ddebug_exec_queries to vpr_info, no need to see it all the time. Signed-off-by: Jim Cromie CC: Thomas Renninger CC: Rusty Russell Acked-by: Jason Baron Signed-off-by: Greg Kroah-Hartman --- kernel/module.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 78ac6ec1e425..a4e60973ca73 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2953,7 +2953,7 @@ static struct module *load_module(void __user *umod, /* Module is ready to execute: parsing args may do that. */ err = parse_args(mod->name, mod->args, mod->kp, mod->num_kp, - -32768, 32767, NULL); + -32768, 32767, &ddebug_dyndbg_module_param_cb); if (err < 0) goto unlink; -- cgit v1.2.3 From f511fc624642f0bb8cf65aaa28979737514d4746 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 15 Mar 2012 12:16:26 -0700 Subject: rcu: Ensure that RCU_FAST_NO_HZ timers expire on correct CPU Timers are subject to migration, which can lead to the following system-hang scenario when CONFIG_RCU_FAST_NO_HZ=y: 1. CPU 0 executes synchronize_rcu(), which posts an RCU callback. 2. CPU 0 then goes idle. It cannot immediately invoke the callback, but there is nothing RCU needs from ti, so it enters dyntick-idle mode after posting a timer. 3. The timer gets migrated to CPU 1. 4. CPU 0 never wakes up, so the synchronize_rcu() never returns, so the system hangs. This commit fixes this problem by using mod_timer_pinned(), as suggested by Peter Zijlstra, to ensure that the timer is actually posted on the running CPU. Reported-by: Dipankar Sarma Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney --- kernel/rcutree_plugin.h | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h index ad61da79b311..d01e26df55a1 100644 --- a/kernel/rcutree_plugin.h +++ b/kernel/rcutree_plugin.h @@ -2110,6 +2110,8 @@ static void rcu_cleanup_after_idle(int cpu) */ static void rcu_prepare_for_idle(int cpu) { + struct timer_list *tp; + /* * If this is an idle re-entry, for example, due to use of * RCU_NONIDLE() or the new idle-loop tracing API within the idle @@ -2121,9 +2123,10 @@ static void rcu_prepare_for_idle(int cpu) if (!per_cpu(rcu_idle_first_pass, cpu) && (per_cpu(rcu_nonlazy_posted, cpu) == per_cpu(rcu_nonlazy_posted_snap, cpu))) { - if (rcu_cpu_has_callbacks(cpu)) - mod_timer(&per_cpu(rcu_idle_gp_timer, cpu), - per_cpu(rcu_idle_gp_timer_expires, cpu)); + if (rcu_cpu_has_callbacks(cpu)) { + tp = &per_cpu(rcu_idle_gp_timer, cpu); + mod_timer_pinned(tp, per_cpu(rcu_idle_gp_timer_expires, cpu)); + } return; } per_cpu(rcu_idle_first_pass, cpu) = 0; @@ -2167,8 +2170,8 @@ static void rcu_prepare_for_idle(int cpu) else per_cpu(rcu_idle_gp_timer_expires, cpu) = jiffies + RCU_IDLE_LAZY_GP_DELAY; - mod_timer(&per_cpu(rcu_idle_gp_timer, cpu), - per_cpu(rcu_idle_gp_timer_expires, cpu)); + tp = &per_cpu(rcu_idle_gp_timer, cpu); + mod_timer_pinned(tp, per_cpu(rcu_idle_gp_timer_expires, cpu)); per_cpu(rcu_nonlazy_posted_snap, cpu) = per_cpu(rcu_nonlazy_posted, cpu); return; /* Nothing more to do immediately. */ -- cgit v1.2.3 From 5a21d489fd9541a4a66b9a500659abaca1b19a51 Mon Sep 17 00:00:00 2001 From: Bojan Smojver Date: Sun, 29 Apr 2012 22:42:06 +0200 Subject: PM / Hibernate: Hibernate/thaw fixes/improvements 1. Do not allocate memory for buffers from emergency pools, unless absolutely required. Do not warn about and do not retry non-essential failed allocations. 2. Do not check the amount of free pages left on every single page write, but wait until one map is completely populated and then check. 3. Set maximum number of pages for read buffering consistently, instead of inadvertently depending on the size of the sector type. 4. Fix copyright line, which I missed when I submitted the hibernation threading patch. 5. Dispense with bit shifting arithmetic to improve readability. 6. Really recalculate the number of pages required to be free after all allocations have been done. 7. Fix calculation of pages required for read buffering. Only count in pages that do not belong to high memory. Signed-off-by: Bojan Smojver Signed-off-by: Rafael J. Wysocki --- kernel/power/swap.c | 62 +++++++++++++++++++++++++++++++++-------------------- 1 file changed, 39 insertions(+), 23 deletions(-) (limited to 'kernel') diff --git a/kernel/power/swap.c b/kernel/power/swap.c index eef311a58a64..11e22c068e8b 100644 --- a/kernel/power/swap.c +++ b/kernel/power/swap.c @@ -6,7 +6,7 @@ * * Copyright (C) 1998,2001-2005 Pavel Machek * Copyright (C) 2006 Rafael J. Wysocki - * Copyright (C) 2010 Bojan Smojver + * Copyright (C) 2010-2012 Bojan Smojver * * This file is released under the GPLv2. * @@ -282,14 +282,17 @@ static int write_page(void *buf, sector_t offset, struct bio **bio_chain) return -ENOSPC; if (bio_chain) { - src = (void *)__get_free_page(__GFP_WAIT | __GFP_HIGH); + src = (void *)__get_free_page(__GFP_WAIT | __GFP_NOWARN | + __GFP_NORETRY); if (src) { copy_page(src, buf); } else { ret = hib_wait_on_bio_chain(bio_chain); /* Free pages */ if (ret) return ret; - src = (void *)__get_free_page(__GFP_WAIT | __GFP_HIGH); + src = (void *)__get_free_page(__GFP_WAIT | + __GFP_NOWARN | + __GFP_NORETRY); if (src) { copy_page(src, buf); } else { @@ -367,12 +370,17 @@ static int swap_write_page(struct swap_map_handle *handle, void *buf, clear_page(handle->cur); handle->cur_swap = offset; handle->k = 0; - } - if (bio_chain && low_free_pages() <= handle->reqd_free_pages) { - error = hib_wait_on_bio_chain(bio_chain); - if (error) - goto out; - handle->reqd_free_pages = reqd_free_pages(); + + if (bio_chain && low_free_pages() <= handle->reqd_free_pages) { + error = hib_wait_on_bio_chain(bio_chain); + if (error) + goto out; + /* + * Recalculate the number of required free pages, to + * make sure we never take more than half. + */ + handle->reqd_free_pages = reqd_free_pages(); + } } out: return error; @@ -419,8 +427,9 @@ static int swap_writer_finish(struct swap_map_handle *handle, /* Maximum number of threads for compression/decompression. */ #define LZO_THREADS 3 -/* Maximum number of pages for read buffering. */ -#define LZO_READ_PAGES (MAP_PAGE_ENTRIES * 8) +/* Minimum/maximum number of pages for read buffering. */ +#define LZO_MIN_RD_PAGES 1024 +#define LZO_MAX_RD_PAGES 8192 /** @@ -630,12 +639,6 @@ static int save_image_lzo(struct swap_map_handle *handle, } } - /* - * Adjust number of free pages after all allocations have been done. - * We don't want to run out of pages when writing. - */ - handle->reqd_free_pages = reqd_free_pages(); - /* * Start the CRC32 thread. */ @@ -657,6 +660,12 @@ static int save_image_lzo(struct swap_map_handle *handle, goto out_clean; } + /* + * Adjust the number of required free pages after all allocations have + * been done. We don't want to run out of pages when writing. + */ + handle->reqd_free_pages = reqd_free_pages(); + printk(KERN_INFO "PM: Using %u thread(s) for compression.\n" "PM: Compressing and saving image data (%u pages) ... ", @@ -1067,7 +1076,7 @@ static int load_image_lzo(struct swap_map_handle *handle, unsigned i, thr, run_threads, nr_threads; unsigned ring = 0, pg = 0, ring_size = 0, have = 0, want, need, asked = 0; - unsigned long read_pages; + unsigned long read_pages = 0; unsigned char **page = NULL; struct dec_data *data = NULL; struct crc_data *crc = NULL; @@ -1079,7 +1088,7 @@ static int load_image_lzo(struct swap_map_handle *handle, nr_threads = num_online_cpus() - 1; nr_threads = clamp_val(nr_threads, 1, LZO_THREADS); - page = vmalloc(sizeof(*page) * LZO_READ_PAGES); + page = vmalloc(sizeof(*page) * LZO_MAX_RD_PAGES); if (!page) { printk(KERN_ERR "PM: Failed to allocate LZO page\n"); ret = -ENOMEM; @@ -1144,15 +1153,22 @@ static int load_image_lzo(struct swap_map_handle *handle, } /* - * Adjust number of pages for read buffering, in case we are short. + * Set the number of pages for read buffering. + * This is complete guesswork, because we'll only know the real + * picture once prepare_image() is called, which is much later on + * during the image load phase. We'll assume the worst case and + * say that none of the image pages are from high memory. */ - read_pages = (nr_free_pages() - snapshot_get_image_size()) >> 1; - read_pages = clamp_val(read_pages, LZO_CMP_PAGES, LZO_READ_PAGES); + if (low_free_pages() > snapshot_get_image_size()) + read_pages = (low_free_pages() - snapshot_get_image_size()) / 2; + read_pages = clamp_val(read_pages, LZO_MIN_RD_PAGES, LZO_MAX_RD_PAGES); for (i = 0; i < read_pages; i++) { page[i] = (void *)__get_free_page(i < LZO_CMP_PAGES ? __GFP_WAIT | __GFP_HIGH : - __GFP_WAIT); + __GFP_WAIT | __GFP_NOWARN | + __GFP_NORETRY); + if (!page[i]) { if (i < LZO_CMP_PAGES) { ring_size = i; -- cgit v1.2.3 From 7483b4a4d9abf9dcf1ffe6e805ead2847ec3264e Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sun, 29 Apr 2012 22:53:22 +0200 Subject: PM / Sleep: Implement opportunistic sleep, v2 Introduce a mechanism by which the kernel can trigger global transitions to a sleep state chosen by user space if there are no active wakeup sources. It consists of a new sysfs attribute, /sys/power/autosleep, that can be written one of the strings returned by reads from /sys/power/state, an ordered workqueue and a work item carrying out the "suspend" operations. If a string representing the system's sleep state is written to /sys/power/autosleep, the work item triggering transitions to that state is queued up and it requeues itself after every execution until user space writes "off" to /sys/power/autosleep. That work item enables the detection of wakeup events using the functions already defined in drivers/base/power/wakeup.c (with one small modification) and calls either pm_suspend(), or hibernate() to put the system into a sleep state. If a wakeup event is reported while the transition is in progress, it will abort the transition and the "system suspend" work item will be queued up again. Signed-off-by: Rafael J. Wysocki Acked-by: Greg Kroah-Hartman Reviewed-by: NeilBrown --- kernel/power/Kconfig | 8 +++ kernel/power/Makefile | 1 + kernel/power/autosleep.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++ kernel/power/main.c | 119 +++++++++++++++++++++++++++++++++++++-------- kernel/power/power.h | 18 +++++++ 5 files changed, 250 insertions(+), 19 deletions(-) create mode 100644 kernel/power/autosleep.c (limited to 'kernel') diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig index deb5461e3216..67947083f842 100644 --- a/kernel/power/Kconfig +++ b/kernel/power/Kconfig @@ -103,6 +103,14 @@ config PM_SLEEP_SMP select HOTPLUG select HOTPLUG_CPU +config PM_AUTOSLEEP + bool "Opportunistic sleep" + depends on PM_SLEEP + default n + ---help--- + Allow the kernel to trigger a system transition into a global sleep + state automatically whenever there are no active wakeup sources. + config PM_RUNTIME bool "Run-time PM core functionality" depends on !IA64_HP_SIM diff --git a/kernel/power/Makefile b/kernel/power/Makefile index 66d808ec5252..010b2f7e148c 100644 --- a/kernel/power/Makefile +++ b/kernel/power/Makefile @@ -9,5 +9,6 @@ obj-$(CONFIG_SUSPEND) += suspend.o obj-$(CONFIG_PM_TEST_SUSPEND) += suspend_test.o obj-$(CONFIG_HIBERNATION) += hibernate.o snapshot.o swap.o user.o \ block_io.o +obj-$(CONFIG_PM_AUTOSLEEP) += autosleep.o obj-$(CONFIG_MAGIC_SYSRQ) += poweroff.o diff --git a/kernel/power/autosleep.c b/kernel/power/autosleep.c new file mode 100644 index 000000000000..42348e3589d3 --- /dev/null +++ b/kernel/power/autosleep.c @@ -0,0 +1,123 @@ +/* + * kernel/power/autosleep.c + * + * Opportunistic sleep support. + * + * Copyright (C) 2012 Rafael J. Wysocki + */ + +#include +#include +#include + +#include "power.h" + +static suspend_state_t autosleep_state; +static struct workqueue_struct *autosleep_wq; +/* + * Note: it is only safe to mutex_lock(&autosleep_lock) if a wakeup_source + * is active, otherwise a deadlock with try_to_suspend() is possible. + * Alternatively mutex_lock_interruptible() can be used. This will then fail + * if an auto_sleep cycle tries to freeze processes. + */ +static DEFINE_MUTEX(autosleep_lock); +static struct wakeup_source *autosleep_ws; + +static void try_to_suspend(struct work_struct *work) +{ + unsigned int initial_count, final_count; + + if (!pm_get_wakeup_count(&initial_count, true)) + goto out; + + mutex_lock(&autosleep_lock); + + if (!pm_save_wakeup_count(initial_count)) { + mutex_unlock(&autosleep_lock); + goto out; + } + + if (autosleep_state == PM_SUSPEND_ON) { + mutex_unlock(&autosleep_lock); + return; + } + if (autosleep_state >= PM_SUSPEND_MAX) + hibernate(); + else + pm_suspend(autosleep_state); + + mutex_unlock(&autosleep_lock); + + if (!pm_get_wakeup_count(&final_count, false)) + goto out; + + /* + * If the wakeup occured for an unknown reason, wait to prevent the + * system from trying to suspend and waking up in a tight loop. + */ + if (final_count == initial_count) + schedule_timeout_uninterruptible(HZ / 2); + + out: + queue_up_suspend_work(); +} + +static DECLARE_WORK(suspend_work, try_to_suspend); + +void queue_up_suspend_work(void) +{ + if (!work_pending(&suspend_work) && autosleep_state > PM_SUSPEND_ON) + queue_work(autosleep_wq, &suspend_work); +} + +suspend_state_t pm_autosleep_state(void) +{ + return autosleep_state; +} + +int pm_autosleep_lock(void) +{ + return mutex_lock_interruptible(&autosleep_lock); +} + +void pm_autosleep_unlock(void) +{ + mutex_unlock(&autosleep_lock); +} + +int pm_autosleep_set_state(suspend_state_t state) +{ + +#ifndef CONFIG_HIBERNATION + if (state >= PM_SUSPEND_MAX) + return -EINVAL; +#endif + + __pm_stay_awake(autosleep_ws); + + mutex_lock(&autosleep_lock); + + autosleep_state = state; + + __pm_relax(autosleep_ws); + + if (state > PM_SUSPEND_ON) + queue_up_suspend_work(); + + mutex_unlock(&autosleep_lock); + return 0; +} + +int __init pm_autosleep_init(void) +{ + autosleep_ws = wakeup_source_register("autosleep"); + if (!autosleep_ws) + return -ENOMEM; + + autosleep_wq = alloc_ordered_workqueue("autosleep", 0); + if (autosleep_wq) + return 0; + + wakeup_source_unregister(autosleep_ws); + return -ENOMEM; +} diff --git a/kernel/power/main.c b/kernel/power/main.c index 1c12581f1c62..ba6a5645952d 100644 --- a/kernel/power/main.c +++ b/kernel/power/main.c @@ -269,8 +269,7 @@ static ssize_t state_show(struct kobject *kobj, struct kobj_attribute *attr, return (s - buf); } -static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr, - const char *buf, size_t n) +static suspend_state_t decode_state(const char *buf, size_t n) { #ifdef CONFIG_SUSPEND suspend_state_t state = PM_SUSPEND_STANDBY; @@ -278,27 +277,48 @@ static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr, #endif char *p; int len; - int error = -EINVAL; p = memchr(buf, '\n', n); len = p ? p - buf : n; - /* First, check if we are requested to hibernate */ - if (len == 4 && !strncmp(buf, "disk", len)) { - error = hibernate(); - goto Exit; - } + /* Check hibernation first. */ + if (len == 4 && !strncmp(buf, "disk", len)) + return PM_SUSPEND_MAX; #ifdef CONFIG_SUSPEND - for (s = &pm_states[state]; state < PM_SUSPEND_MAX; s++, state++) { - if (*s && len == strlen(*s) && !strncmp(buf, *s, len)) { - error = pm_suspend(state); - break; - } - } + for (s = &pm_states[state]; state < PM_SUSPEND_MAX; s++, state++) + if (*s && len == strlen(*s) && !strncmp(buf, *s, len)) + return state; #endif - Exit: + return PM_SUSPEND_ON; +} + +static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t n) +{ + suspend_state_t state; + int error; + + error = pm_autosleep_lock(); + if (error) + return error; + + if (pm_autosleep_state() > PM_SUSPEND_ON) { + error = -EBUSY; + goto out; + } + + state = decode_state(buf, n); + if (state < PM_SUSPEND_MAX) + error = pm_suspend(state); + else if (state == PM_SUSPEND_MAX) + error = hibernate(); + else + error = -EINVAL; + + out: + pm_autosleep_unlock(); return error ? error : n; } @@ -339,7 +359,8 @@ static ssize_t wakeup_count_show(struct kobject *kobj, { unsigned int val; - return pm_get_wakeup_count(&val) ? sprintf(buf, "%u\n", val) : -EINTR; + return pm_get_wakeup_count(&val, true) ? + sprintf(buf, "%u\n", val) : -EINTR; } static ssize_t wakeup_count_store(struct kobject *kobj, @@ -347,15 +368,69 @@ static ssize_t wakeup_count_store(struct kobject *kobj, const char *buf, size_t n) { unsigned int val; + int error; + + error = pm_autosleep_lock(); + if (error) + return error; + if (pm_autosleep_state() > PM_SUSPEND_ON) { + error = -EBUSY; + goto out; + } + + error = -EINVAL; if (sscanf(buf, "%u", &val) == 1) { if (pm_save_wakeup_count(val)) - return n; + error = n; } - return -EINVAL; + + out: + pm_autosleep_unlock(); + return error; } power_attr(wakeup_count); + +#ifdef CONFIG_PM_AUTOSLEEP +static ssize_t autosleep_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + suspend_state_t state = pm_autosleep_state(); + + if (state == PM_SUSPEND_ON) + return sprintf(buf, "off\n"); + +#ifdef CONFIG_SUSPEND + if (state < PM_SUSPEND_MAX) + return sprintf(buf, "%s\n", valid_state(state) ? + pm_states[state] : "error"); +#endif +#ifdef CONFIG_HIBERNATION + return sprintf(buf, "disk\n"); +#else + return sprintf(buf, "error"); +#endif +} + +static ssize_t autosleep_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t n) +{ + suspend_state_t state = decode_state(buf, n); + int error; + + if (state == PM_SUSPEND_ON + && !(strncmp(buf, "off", 3) && strncmp(buf, "off\n", 4))) + return -EINVAL; + + error = pm_autosleep_set_state(state); + return error ? error : n; +} + +power_attr(autosleep); +#endif /* CONFIG_PM_AUTOSLEEP */ #endif /* CONFIG_PM_SLEEP */ #ifdef CONFIG_PM_TRACE @@ -409,6 +484,9 @@ static struct attribute * g[] = { #ifdef CONFIG_PM_SLEEP &pm_async_attr.attr, &wakeup_count_attr.attr, +#ifdef CONFIG_PM_AUTOSLEEP + &autosleep_attr.attr, +#endif #ifdef CONFIG_PM_DEBUG &pm_test_attr.attr, #endif @@ -444,7 +522,10 @@ static int __init pm_init(void) power_kobj = kobject_create_and_add("power", NULL); if (!power_kobj) return -ENOMEM; - return sysfs_create_group(power_kobj, &attr_group); + error = sysfs_create_group(power_kobj, &attr_group); + if (error) + return error; + return pm_autosleep_init(); } core_initcall(pm_init); diff --git a/kernel/power/power.h b/kernel/power/power.h index 98f3622d7407..4cf80fa115d9 100644 --- a/kernel/power/power.h +++ b/kernel/power/power.h @@ -264,3 +264,21 @@ static inline void suspend_thaw_processes(void) { } #endif + +#ifdef CONFIG_PM_AUTOSLEEP + +/* kernel/power/autosleep.c */ +extern int pm_autosleep_init(void); +extern int pm_autosleep_lock(void); +extern void pm_autosleep_unlock(void); +extern suspend_state_t pm_autosleep_state(void); +extern int pm_autosleep_set_state(suspend_state_t state); + +#else /* !CONFIG_PM_AUTOSLEEP */ + +static inline int pm_autosleep_init(void) { return 0; } +static inline int pm_autosleep_lock(void) { return 0; } +static inline void pm_autosleep_unlock(void) {} +static inline suspend_state_t pm_autosleep_state(void) { return PM_SUSPEND_ON; } + +#endif /* !CONFIG_PM_AUTOSLEEP */ -- cgit v1.2.3 From 55850945e872531644f31fefd217d61dd15dcab8 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sun, 29 Apr 2012 22:53:32 +0200 Subject: PM / Sleep: Add "prevent autosleep time" statistics to wakeup sources Android uses one wakelock statistics that is only necessary for opportunistic sleep. Namely, the prevent_suspend_time field accumulates the total time the given wakelock has been locked while "automatic suspend" was enabled. Add an analogous field, prevent_sleep_time, to wakeup sources and make it behave in a similar way. Signed-off-by: Rafael J. Wysocki Acked-by: Greg Kroah-Hartman --- kernel/power/autosleep.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/power/autosleep.c b/kernel/power/autosleep.c index 42348e3589d3..ca304046d9e2 100644 --- a/kernel/power/autosleep.c +++ b/kernel/power/autosleep.c @@ -101,8 +101,12 @@ int pm_autosleep_set_state(suspend_state_t state) __pm_relax(autosleep_ws); - if (state > PM_SUSPEND_ON) + if (state > PM_SUSPEND_ON) { + pm_wakep_autosleep_enabled(true); queue_up_suspend_work(); + } else { + pm_wakep_autosleep_enabled(false); + } mutex_unlock(&autosleep_lock); return 0; -- cgit v1.2.3 From b86ff9820fd5df69295273b9aa68e58786ffc23f Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sun, 29 Apr 2012 22:53:42 +0200 Subject: PM / Sleep: Add user space interface for manipulating wakeup sources, v3 Android allows user space to manipulate wakelocks using two sysfs file located in /sys/power/, wake_lock and wake_unlock. Writing a wakelock name and optionally a timeout to the wake_lock file causes the wakelock whose name was written to be acquired (it is created before is necessary), optionally with the given timeout. Writing the name of a wakelock to wake_unlock causes that wakelock to be released. Implement an analogous interface for user space using wakeup sources. Add the /sys/power/wake_lock and /sys/power/wake_unlock files allowing user space to create, activate and deactivate wakeup sources, such that writing a name and optionally a timeout to wake_lock causes the wakeup source of that name to be activated, optionally with the given timeout. If that wakeup source doesn't exist, it will be created and then activated. Writing a name to wake_unlock causes the wakeup source of that name, if there is one, to be deactivated. Wakeup sources created with the help of wake_lock that haven't been used for more than 5 minutes are garbage collected and destroyed. Moreover, there can be only WL_NUMBER_LIMIT wakeup sources created with the help of wake_lock present at a time. The data type used to track wakeup sources created by user space is called "struct wakelock" to indicate the origins of this feature. This version of the patch includes an rbtree manipulation fix from John Stultz. Signed-off-by: Rafael J. Wysocki Acked-by: Greg Kroah-Hartman Reviewed-by: NeilBrown --- kernel/power/Kconfig | 8 ++ kernel/power/Makefile | 1 + kernel/power/main.c | 41 +++++++++ kernel/power/power.h | 9 ++ kernel/power/wakelock.c | 215 ++++++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 274 insertions(+) create mode 100644 kernel/power/wakelock.c (limited to 'kernel') diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig index 67947083f842..1d534076d33a 100644 --- a/kernel/power/Kconfig +++ b/kernel/power/Kconfig @@ -111,6 +111,14 @@ config PM_AUTOSLEEP Allow the kernel to trigger a system transition into a global sleep state automatically whenever there are no active wakeup sources. +config PM_WAKELOCKS + bool "User space wakeup sources interface" + depends on PM_SLEEP + default n + ---help--- + Allow user space to create, activate and deactivate wakeup source + objects with the help of a sysfs-based interface. + config PM_RUNTIME bool "Run-time PM core functionality" depends on !IA64_HP_SIM diff --git a/kernel/power/Makefile b/kernel/power/Makefile index 010b2f7e148c..29472bff11ef 100644 --- a/kernel/power/Makefile +++ b/kernel/power/Makefile @@ -10,5 +10,6 @@ obj-$(CONFIG_PM_TEST_SUSPEND) += suspend_test.o obj-$(CONFIG_HIBERNATION) += hibernate.o snapshot.o swap.o user.o \ block_io.o obj-$(CONFIG_PM_AUTOSLEEP) += autosleep.o +obj-$(CONFIG_PM_WAKELOCKS) += wakelock.o obj-$(CONFIG_MAGIC_SYSRQ) += poweroff.o diff --git a/kernel/power/main.c b/kernel/power/main.c index ba6a5645952d..54ec071de337 100644 --- a/kernel/power/main.c +++ b/kernel/power/main.c @@ -431,6 +431,43 @@ static ssize_t autosleep_store(struct kobject *kobj, power_attr(autosleep); #endif /* CONFIG_PM_AUTOSLEEP */ + +#ifdef CONFIG_PM_WAKELOCKS +static ssize_t wake_lock_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return pm_show_wakelocks(buf, true); +} + +static ssize_t wake_lock_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t n) +{ + int error = pm_wake_lock(buf); + return error ? error : n; +} + +power_attr(wake_lock); + +static ssize_t wake_unlock_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return pm_show_wakelocks(buf, false); +} + +static ssize_t wake_unlock_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t n) +{ + int error = pm_wake_unlock(buf); + return error ? error : n; +} + +power_attr(wake_unlock); + +#endif /* CONFIG_PM_WAKELOCKS */ #endif /* CONFIG_PM_SLEEP */ #ifdef CONFIG_PM_TRACE @@ -487,6 +524,10 @@ static struct attribute * g[] = { #ifdef CONFIG_PM_AUTOSLEEP &autosleep_attr.attr, #endif +#ifdef CONFIG_PM_WAKELOCKS + &wake_lock_attr.attr, + &wake_unlock_attr.attr, +#endif #ifdef CONFIG_PM_DEBUG &pm_test_attr.attr, #endif diff --git a/kernel/power/power.h b/kernel/power/power.h index 4cf80fa115d9..b0bd4beaebfe 100644 --- a/kernel/power/power.h +++ b/kernel/power/power.h @@ -282,3 +282,12 @@ static inline void pm_autosleep_unlock(void) {} static inline suspend_state_t pm_autosleep_state(void) { return PM_SUSPEND_ON; } #endif /* !CONFIG_PM_AUTOSLEEP */ + +#ifdef CONFIG_PM_WAKELOCKS + +/* kernel/power/wakelock.c */ +extern ssize_t pm_show_wakelocks(char *buf, bool show_active); +extern int pm_wake_lock(const char *buf); +extern int pm_wake_unlock(const char *buf); + +#endif /* !CONFIG_PM_WAKELOCKS */ diff --git a/kernel/power/wakelock.c b/kernel/power/wakelock.c new file mode 100644 index 000000000000..579700665e8c --- /dev/null +++ b/kernel/power/wakelock.c @@ -0,0 +1,215 @@ +/* + * kernel/power/wakelock.c + * + * User space wakeup sources support. + * + * Copyright (C) 2012 Rafael J. Wysocki + * + * This code is based on the analogous interface allowing user space to + * manipulate wakelocks on Android. + */ + +#include +#include +#include +#include +#include +#include +#include + +#define WL_NUMBER_LIMIT 100 +#define WL_GC_COUNT_MAX 100 +#define WL_GC_TIME_SEC 300 + +static DEFINE_MUTEX(wakelocks_lock); + +struct wakelock { + char *name; + struct rb_node node; + struct wakeup_source ws; + struct list_head lru; +}; + +static struct rb_root wakelocks_tree = RB_ROOT; +static LIST_HEAD(wakelocks_lru_list); +static unsigned int number_of_wakelocks; +static unsigned int wakelocks_gc_count; + +ssize_t pm_show_wakelocks(char *buf, bool show_active) +{ + struct rb_node *node; + struct wakelock *wl; + char *str = buf; + char *end = buf + PAGE_SIZE; + + mutex_lock(&wakelocks_lock); + + for (node = rb_first(&wakelocks_tree); node; node = rb_next(node)) { + wl = rb_entry(node, struct wakelock, node); + if (wl->ws.active == show_active) + str += scnprintf(str, end - str, "%s ", wl->name); + } + if (str > buf) + str--; + + str += scnprintf(str, end - str, "\n"); + + mutex_unlock(&wakelocks_lock); + return (str - buf); +} + +static struct wakelock *wakelock_lookup_add(const char *name, size_t len, + bool add_if_not_found) +{ + struct rb_node **node = &wakelocks_tree.rb_node; + struct rb_node *parent = *node; + struct wakelock *wl; + + while (*node) { + int diff; + + parent = *node; + wl = rb_entry(*node, struct wakelock, node); + diff = strncmp(name, wl->name, len); + if (diff == 0) { + if (wl->name[len]) + diff = -1; + else + return wl; + } + if (diff < 0) + node = &(*node)->rb_left; + else + node = &(*node)->rb_right; + } + if (!add_if_not_found) + return ERR_PTR(-EINVAL); + + if (number_of_wakelocks > WL_NUMBER_LIMIT) + return ERR_PTR(-ENOSPC); + + /* Not found, we have to add a new one. */ + wl = kzalloc(sizeof(*wl), GFP_KERNEL); + if (!wl) + return ERR_PTR(-ENOMEM); + + wl->name = kstrndup(name, len, GFP_KERNEL); + if (!wl->name) { + kfree(wl); + return ERR_PTR(-ENOMEM); + } + wl->ws.name = wl->name; + wakeup_source_add(&wl->ws); + rb_link_node(&wl->node, parent, node); + rb_insert_color(&wl->node, &wakelocks_tree); + list_add(&wl->lru, &wakelocks_lru_list); + number_of_wakelocks++; + return wl; +} + +int pm_wake_lock(const char *buf) +{ + const char *str = buf; + struct wakelock *wl; + u64 timeout_ns = 0; + size_t len; + int ret = 0; + + while (*str && !isspace(*str)) + str++; + + len = str - buf; + if (!len) + return -EINVAL; + + if (*str && *str != '\n') { + /* Find out if there's a valid timeout string appended. */ + ret = kstrtou64(skip_spaces(str), 10, &timeout_ns); + if (ret) + return -EINVAL; + } + + mutex_lock(&wakelocks_lock); + + wl = wakelock_lookup_add(buf, len, true); + if (IS_ERR(wl)) { + ret = PTR_ERR(wl); + goto out; + } + if (timeout_ns) { + u64 timeout_ms = timeout_ns + NSEC_PER_MSEC - 1; + + do_div(timeout_ms, NSEC_PER_MSEC); + __pm_wakeup_event(&wl->ws, timeout_ms); + } else { + __pm_stay_awake(&wl->ws); + } + + list_move(&wl->lru, &wakelocks_lru_list); + + out: + mutex_unlock(&wakelocks_lock); + return ret; +} + +static void wakelocks_gc(void) +{ + struct wakelock *wl, *aux; + ktime_t now = ktime_get(); + + list_for_each_entry_safe_reverse(wl, aux, &wakelocks_lru_list, lru) { + u64 idle_time_ns; + bool active; + + spin_lock_irq(&wl->ws.lock); + idle_time_ns = ktime_to_ns(ktime_sub(now, wl->ws.last_time)); + active = wl->ws.active; + spin_unlock_irq(&wl->ws.lock); + + if (idle_time_ns < ((u64)WL_GC_TIME_SEC * NSEC_PER_SEC)) + break; + + if (!active) { + wakeup_source_remove(&wl->ws); + rb_erase(&wl->node, &wakelocks_tree); + list_del(&wl->lru); + kfree(wl->name); + kfree(wl); + number_of_wakelocks--; + } + } + wakelocks_gc_count = 0; +} + +int pm_wake_unlock(const char *buf) +{ + struct wakelock *wl; + size_t len; + int ret = 0; + + len = strlen(buf); + if (!len) + return -EINVAL; + + if (buf[len-1] == '\n') + len--; + + if (!len) + return -EINVAL; + + mutex_lock(&wakelocks_lock); + + wl = wakelock_lookup_add(buf, len, false); + if (IS_ERR(wl)) { + ret = PTR_ERR(wl); + goto out; + } + __pm_relax(&wl->ws); + list_move(&wl->lru, &wakelocks_lru_list); + if (++wakelocks_gc_count > WL_GC_COUNT_MAX) + wakelocks_gc(); + + out: + mutex_unlock(&wakelocks_lock); + return ret; +} -- cgit v1.2.3 From 616c310e83b872024271c915c1b9ab505b9efad9 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 27 Mar 2012 16:02:08 -0700 Subject: rcu: Move PREEMPT_RCU preemption to switch_to() invocation Currently, PREEMPT_RCU readers are enqueued upon entry to the scheduler. This is inefficient because enqueuing is required only if there is a context switch, and entry to the scheduler does not guarantee a context switch. The commit therefore moves the enqueuing to immediately precede the call to switch_to() from the scheduler. Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney Tested-by: Linus Torvalds --- kernel/rcutree.c | 1 - kernel/rcutree.h | 1 - kernel/rcutree_plugin.h | 14 +++----------- kernel/sched/core.c | 1 + 4 files changed, 4 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/rcutree.c b/kernel/rcutree.c index 1050d6d3922c..61351505ec78 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -192,7 +192,6 @@ void rcu_note_context_switch(int cpu) { trace_rcu_utilization("Start context switch"); rcu_sched_qs(cpu); - rcu_preempt_note_context_switch(cpu); trace_rcu_utilization("End context switch"); } EXPORT_SYMBOL_GPL(rcu_note_context_switch); diff --git a/kernel/rcutree.h b/kernel/rcutree.h index cdd1be0a4072..d6b70b08a01a 100644 --- a/kernel/rcutree.h +++ b/kernel/rcutree.h @@ -423,7 +423,6 @@ DECLARE_PER_CPU(char, rcu_cpu_has_work); /* Forward declarations for rcutree_plugin.h */ static void rcu_bootup_announce(void); long rcu_batches_completed(void); -static void rcu_preempt_note_context_switch(int cpu); static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp); #ifdef CONFIG_HOTPLUG_CPU static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp, diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h index c023464816be..b1ac22e6fa31 100644 --- a/kernel/rcutree_plugin.h +++ b/kernel/rcutree_plugin.h @@ -153,7 +153,7 @@ static void rcu_preempt_qs(int cpu) * * Caller must disable preemption. */ -static void rcu_preempt_note_context_switch(int cpu) +void rcu_preempt_note_context_switch(void) { struct task_struct *t = current; unsigned long flags; @@ -164,7 +164,7 @@ static void rcu_preempt_note_context_switch(int cpu) (t->rcu_read_unlock_special & RCU_READ_UNLOCK_BLOCKED) == 0) { /* Possibly blocking in an RCU read-side critical section. */ - rdp = per_cpu_ptr(rcu_preempt_state.rda, cpu); + rdp = __this_cpu_ptr(rcu_preempt_state.rda); rnp = rdp->mynode; raw_spin_lock_irqsave(&rnp->lock, flags); t->rcu_read_unlock_special |= RCU_READ_UNLOCK_BLOCKED; @@ -228,7 +228,7 @@ static void rcu_preempt_note_context_switch(int cpu) * means that we continue to block the current grace period. */ local_irq_save(flags); - rcu_preempt_qs(cpu); + rcu_preempt_qs(smp_processor_id()); local_irq_restore(flags); } @@ -1017,14 +1017,6 @@ void rcu_force_quiescent_state(void) } EXPORT_SYMBOL_GPL(rcu_force_quiescent_state); -/* - * Because preemptible RCU does not exist, we never have to check for - * CPUs being in quiescent states. - */ -static void rcu_preempt_note_context_switch(int cpu) -{ -} - /* * Because preemptible RCU does not exist, there are never any preempted * RCU readers. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4603b9d8f30a..5d89eb93f7e4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2083,6 +2083,7 @@ context_switch(struct rq *rq, struct task_struct *prev, #endif /* Here we just switch the register state and the stack. */ + rcu_switch_from(prev); switch_to(prev, next, prev); barrier(); -- cgit v1.2.3 From 9dd8fb16c36178df2066387d2abd44d8b4dca8c8 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 13 Apr 2012 12:54:22 -0700 Subject: rcu: Make exit_rcu() more precise and consolidate When running preemptible RCU, if a task exits in an RCU read-side critical section having blocked within that same RCU read-side critical section, the task must be removed from the list of tasks blocking a grace period (perhaps the current grace period, perhaps the next grace period, depending on timing). The exit() path invokes exit_rcu() to do this cleanup. However, the current implementation of exit_rcu() needlessly does the cleanup even if the task did not block within the current RCU read-side critical section, which wastes time and needlessly increases the size of the state space. Fix this by only doing the cleanup if the current task is actually on the list of tasks blocking some grace period. While we are at it, consolidate the two identical exit_rcu() functions into a single function. Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney Tested-by: Linus Torvalds Conflicts: kernel/rcupdate.c --- kernel/rcupdate.c | 28 ++++++++++++++++++++++++++++ kernel/rcutiny_plugin.h | 16 ---------------- kernel/rcutree_plugin.h | 16 ---------------- 3 files changed, 28 insertions(+), 32 deletions(-) (limited to 'kernel') diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c index a86f1741cc27..95cba41ce1e9 100644 --- a/kernel/rcupdate.c +++ b/kernel/rcupdate.c @@ -51,6 +51,34 @@ #include "rcu.h" +#ifdef CONFIG_PREEMPT_RCU + +/* + * Check for a task exiting while in a preemptible-RCU read-side + * critical section, clean up if so. No need to issue warnings, + * as debug_check_no_locks_held() already does this if lockdep + * is enabled. + */ +void exit_rcu(void) +{ + struct task_struct *t = current; + + if (likely(list_empty(¤t->rcu_node_entry))) + return; + t->rcu_read_lock_nesting = 1; + barrier(); + t->rcu_read_unlock_special = RCU_READ_UNLOCK_BLOCKED; + __rcu_read_unlock(); +} + +#else /* #ifdef CONFIG_PREEMPT_RCU */ + +void exit_rcu(void) +{ +} + +#endif /* #else #ifdef CONFIG_PREEMPT_RCU */ + #ifdef CONFIG_DEBUG_LOCK_ALLOC static struct lock_class_key rcu_lock_key; struct lockdep_map rcu_lock_map = diff --git a/kernel/rcutiny_plugin.h b/kernel/rcutiny_plugin.h index 22ecea0dfb62..fc31a2d65100 100644 --- a/kernel/rcutiny_plugin.h +++ b/kernel/rcutiny_plugin.h @@ -851,22 +851,6 @@ int rcu_preempt_needs_cpu(void) return rcu_preempt_ctrlblk.rcb.rcucblist != NULL; } -/* - * Check for a task exiting while in a preemptible -RCU read-side - * critical section, clean up if so. No need to issue warnings, - * as debug_check_no_locks_held() already does this if lockdep - * is enabled. - */ -void exit_rcu(void) -{ - struct task_struct *t = current; - - if (t->rcu_read_lock_nesting == 0) - return; - t->rcu_read_lock_nesting = 1; - __rcu_read_unlock(); -} - #else /* #ifdef CONFIG_TINY_PREEMPT_RCU */ #ifdef CONFIG_RCU_TRACE diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h index b1ac22e6fa31..4936fff820eb 100644 --- a/kernel/rcutree_plugin.h +++ b/kernel/rcutree_plugin.h @@ -969,22 +969,6 @@ static void __init __rcu_init_preempt(void) rcu_init_one(&rcu_preempt_state, &rcu_preempt_data); } -/* - * Check for a task exiting while in a preemptible-RCU read-side - * critical section, clean up if so. No need to issue warnings, - * as debug_check_no_locks_held() already does this if lockdep - * is enabled. - */ -void exit_rcu(void) -{ - struct task_struct *t = current; - - if (t->rcu_read_lock_nesting == 0) - return; - t->rcu_read_lock_nesting = 1; - __rcu_read_unlock(); -} - #else /* #ifdef CONFIG_TREE_PREEMPT_RCU */ static struct rcu_state *rcu_state = &rcu_sched_state; -- cgit v1.2.3 From ae2975bc3476243b45a1e2344236d7920c268f38 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Mon, 14 Nov 2011 15:56:38 -0800 Subject: userns: Convert group_info values from gid_t to kgid_t. As a first step to converting struct cred to be all kuid_t and kgid_t values convert the group values stored in group_info to always be kgid_t values. Unless user namespaces are used this change should have no effect. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/groups.c | 48 +++++++++++++++++++++++++----------------------- kernel/uid16.c | 14 ++++++++++++-- 2 files changed, 37 insertions(+), 25 deletions(-) (limited to 'kernel') diff --git a/kernel/groups.c b/kernel/groups.c index 99b53d1eb7ea..84156f2d4c8c 100644 --- a/kernel/groups.c +++ b/kernel/groups.c @@ -31,7 +31,7 @@ struct group_info *groups_alloc(int gidsetsize) group_info->blocks[0] = group_info->small_block; else { for (i = 0; i < nblocks; i++) { - gid_t *b; + kgid_t *b; b = (void *)__get_free_page(GFP_USER); if (!b) goto out_undo_partial_alloc; @@ -66,18 +66,15 @@ EXPORT_SYMBOL(groups_free); static int groups_to_user(gid_t __user *grouplist, const struct group_info *group_info) { + struct user_namespace *user_ns = current_user_ns(); int i; unsigned int count = group_info->ngroups; - for (i = 0; i < group_info->nblocks; i++) { - unsigned int cp_count = min(NGROUPS_PER_BLOCK, count); - unsigned int len = cp_count * sizeof(*grouplist); - - if (copy_to_user(grouplist, group_info->blocks[i], len)) + for (i = 0; i < count; i++) { + gid_t gid; + gid = from_kgid_munged(user_ns, GROUP_AT(group_info, i)); + if (put_user(gid, grouplist+i)) return -EFAULT; - - grouplist += NGROUPS_PER_BLOCK; - count -= cp_count; } return 0; } @@ -86,18 +83,21 @@ static int groups_to_user(gid_t __user *grouplist, static int groups_from_user(struct group_info *group_info, gid_t __user *grouplist) { + struct user_namespace *user_ns = current_user_ns(); int i; unsigned int count = group_info->ngroups; - for (i = 0; i < group_info->nblocks; i++) { - unsigned int cp_count = min(NGROUPS_PER_BLOCK, count); - unsigned int len = cp_count * sizeof(*grouplist); - - if (copy_from_user(group_info->blocks[i], grouplist, len)) + for (i = 0; i < count; i++) { + gid_t gid; + kgid_t kgid; + if (get_user(gid, grouplist+i)) return -EFAULT; - grouplist += NGROUPS_PER_BLOCK; - count -= cp_count; + kgid = make_kgid(user_ns, gid); + if (!gid_valid(kgid)) + return -EINVAL; + + GROUP_AT(group_info, i) = kgid; } return 0; } @@ -117,9 +117,9 @@ static void groups_sort(struct group_info *group_info) for (base = 0; base < max; base++) { int left = base; int right = left + stride; - gid_t tmp = GROUP_AT(group_info, right); + kgid_t tmp = GROUP_AT(group_info, right); - while (left >= 0 && GROUP_AT(group_info, left) > tmp) { + while (left >= 0 && gid_gt(GROUP_AT(group_info, left), tmp)) { GROUP_AT(group_info, right) = GROUP_AT(group_info, left); right = left; @@ -132,7 +132,7 @@ static void groups_sort(struct group_info *group_info) } /* a simple bsearch */ -int groups_search(const struct group_info *group_info, gid_t grp) +int groups_search(const struct group_info *group_info, kgid_t grp) { unsigned int left, right; @@ -143,9 +143,9 @@ int groups_search(const struct group_info *group_info, gid_t grp) right = group_info->ngroups; while (left < right) { unsigned int mid = (left+right)/2; - if (grp > GROUP_AT(group_info, mid)) + if (gid_gt(grp, GROUP_AT(group_info, mid))) left = mid + 1; - else if (grp < GROUP_AT(group_info, mid)) + else if (gid_lt(grp, GROUP_AT(group_info, mid))) right = mid; else return 1; @@ -262,7 +262,8 @@ int in_group_p(gid_t grp) int retval = 1; if (grp != cred->fsgid) - retval = groups_search(cred->group_info, grp); + retval = groups_search(cred->group_info, + make_kgid(cred->user_ns, grp)); return retval; } @@ -274,7 +275,8 @@ int in_egroup_p(gid_t grp) int retval = 1; if (grp != cred->egid) - retval = groups_search(cred->group_info, grp); + retval = groups_search(cred->group_info, + make_kgid(cred->user_ns, grp)); return retval; } diff --git a/kernel/uid16.c b/kernel/uid16.c index 51c6e89e8619..e530bc34c4cf 100644 --- a/kernel/uid16.c +++ b/kernel/uid16.c @@ -134,11 +134,14 @@ SYSCALL_DEFINE1(setfsgid16, old_gid_t, gid) static int groups16_to_user(old_gid_t __user *grouplist, struct group_info *group_info) { + struct user_namespace *user_ns = current_user_ns(); int i; old_gid_t group; + kgid_t kgid; for (i = 0; i < group_info->ngroups; i++) { - group = high2lowgid(GROUP_AT(group_info, i)); + kgid = GROUP_AT(group_info, i); + group = high2lowgid(from_kgid_munged(user_ns, kgid)); if (put_user(group, grouplist+i)) return -EFAULT; } @@ -149,13 +152,20 @@ static int groups16_to_user(old_gid_t __user *grouplist, static int groups16_from_user(struct group_info *group_info, old_gid_t __user *grouplist) { + struct user_namespace *user_ns = current_user_ns(); int i; old_gid_t group; + kgid_t kgid; for (i = 0; i < group_info->ngroups; i++) { if (get_user(group, grouplist+i)) return -EFAULT; - GROUP_AT(group_info, i) = low2highgid(group); + + kgid = make_kgid(user_ns, low2highgid(group)); + if (!gid_valid(kgid)) + return -EINVAL; + + GROUP_AT(group_info, i) = kgid; } return 0; -- cgit v1.2.3 From 078de5f706ece36afd73bb4b8283314132d2dfdf Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Wed, 8 Feb 2012 07:00:08 -0800 Subject: userns: Store uid and gid values in struct cred with kuid_t and kgid_t types cred.h and a few trivial users of struct cred are changed. The rest of the users of struct cred are left for other patches as there are too many changes to make in one go and leave the change reviewable. If the user namespace is disabled and CONFIG_UIDGID_STRICT_TYPE_CHECKS are disabled the code will contiue to compile and behave correctly. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/cred.c | 36 ++++++++++++++++++++++-------------- kernel/signal.c | 14 ++++++++------ kernel/sys.c | 26 +++++++++----------------- kernel/user_namespace.c | 4 ++-- 4 files changed, 41 insertions(+), 39 deletions(-) (limited to 'kernel') diff --git a/kernel/cred.c b/kernel/cred.c index 7a0d80669886..eddc5e2e9587 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -49,6 +49,14 @@ struct cred init_cred = { .subscribers = ATOMIC_INIT(2), .magic = CRED_MAGIC, #endif + .uid = GLOBAL_ROOT_UID, + .gid = GLOBAL_ROOT_GID, + .suid = GLOBAL_ROOT_UID, + .sgid = GLOBAL_ROOT_GID, + .euid = GLOBAL_ROOT_UID, + .egid = GLOBAL_ROOT_GID, + .fsuid = GLOBAL_ROOT_UID, + .fsgid = GLOBAL_ROOT_GID, .securebits = SECUREBITS_DEFAULT, .cap_inheritable = CAP_EMPTY_SET, .cap_permitted = CAP_FULL_SET, @@ -488,10 +496,10 @@ int commit_creds(struct cred *new) get_cred(new); /* we will require a ref for the subj creds too */ /* dumpability changes */ - if (old->euid != new->euid || - old->egid != new->egid || - old->fsuid != new->fsuid || - old->fsgid != new->fsgid || + if (!uid_eq(old->euid, new->euid) || + !gid_eq(old->egid, new->egid) || + !uid_eq(old->fsuid, new->fsuid) || + !gid_eq(old->fsgid, new->fsgid) || !cap_issubset(new->cap_permitted, old->cap_permitted)) { if (task->mm) set_dumpable(task->mm, suid_dumpable); @@ -500,9 +508,9 @@ int commit_creds(struct cred *new) } /* alter the thread keyring */ - if (new->fsuid != old->fsuid) + if (!uid_eq(new->fsuid, old->fsuid)) key_fsuid_changed(task); - if (new->fsgid != old->fsgid) + if (!gid_eq(new->fsgid, old->fsgid)) key_fsgid_changed(task); /* do it @@ -519,16 +527,16 @@ int commit_creds(struct cred *new) alter_cred_subscribers(old, -2); /* send notifications */ - if (new->uid != old->uid || - new->euid != old->euid || - new->suid != old->suid || - new->fsuid != old->fsuid) + if (!uid_eq(new->uid, old->uid) || + !uid_eq(new->euid, old->euid) || + !uid_eq(new->suid, old->suid) || + !uid_eq(new->fsuid, old->fsuid)) proc_id_connector(task, PROC_EVENT_UID); - if (new->gid != old->gid || - new->egid != old->egid || - new->sgid != old->sgid || - new->fsgid != old->fsgid) + if (!gid_eq(new->gid, old->gid) || + !gid_eq(new->egid, old->egid) || + !gid_eq(new->sgid, old->sgid) || + !gid_eq(new->fsgid, old->fsgid)) proc_id_connector(task, PROC_EVENT_GID); /* release the old obj and subj refs both */ diff --git a/kernel/signal.c b/kernel/signal.c index e2c5d84f2dac..2734dc965f69 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1038,8 +1038,10 @@ static inline void userns_fixup_signal_uid(struct siginfo *info, struct task_str if (SI_FROMKERNEL(info)) return; - info->si_uid = user_ns_map_uid(task_cred_xxx(t, user_ns), - current_cred(), info->si_uid); + rcu_read_lock(); + info->si_uid = from_kuid_munged(task_cred_xxx(t, user_ns), + make_kuid(current_user_ns(), info->si_uid)); + rcu_read_unlock(); } #else static inline void userns_fixup_signal_uid(struct siginfo *info, struct task_struct *t) @@ -1106,7 +1108,7 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t, q->info.si_code = SI_USER; q->info.si_pid = task_tgid_nr_ns(current, task_active_pid_ns(t)); - q->info.si_uid = current_uid(); + q->info.si_uid = from_kuid_munged(current_user_ns(), current_uid()); break; case (unsigned long) SEND_SIG_PRIV: q->info.si_signo = sig; @@ -1973,7 +1975,7 @@ static void ptrace_do_notify(int signr, int exit_code, int why) info.si_signo = signr; info.si_code = exit_code; info.si_pid = task_pid_vnr(current); - info.si_uid = current_uid(); + info.si_uid = from_kuid_munged(current_user_ns(), current_uid()); /* Let the debugger run. */ ptrace_stop(exit_code, why, 1, &info); @@ -2828,7 +2830,7 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig) info.si_errno = 0; info.si_code = SI_USER; info.si_pid = task_tgid_vnr(current); - info.si_uid = current_uid(); + info.si_uid = from_kuid_munged(current_user_ns(), current_uid()); return kill_something_info(sig, &info, pid); } @@ -2871,7 +2873,7 @@ static int do_tkill(pid_t tgid, pid_t pid, int sig) info.si_errno = 0; info.si_code = SI_TKILL; info.si_pid = task_tgid_vnr(current); - info.si_uid = current_uid(); + info.si_uid = from_kuid_munged(current_user_ns(), current_uid()); return do_send_specific(tgid, pid, sig, &info); } diff --git a/kernel/sys.c b/kernel/sys.c index f0c43b4b6657..39962818c008 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -175,7 +175,6 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval) const struct cred *cred = current_cred(); int error = -EINVAL; struct pid *pgrp; - kuid_t cred_uid; kuid_t uid; if (which > PRIO_USER || which < PRIO_PROCESS) @@ -209,22 +208,19 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval) } while_each_pid_thread(pgrp, PIDTYPE_PGID, p); break; case PRIO_USER: - cred_uid = make_kuid(cred->user_ns, cred->uid); uid = make_kuid(cred->user_ns, who); user = cred->user; if (!who) - uid = cred_uid; - else if (!uid_eq(uid, cred_uid) && + uid = cred->uid; + else if (!uid_eq(uid, cred->uid) && !(user = find_user(uid))) goto out_unlock; /* No processes for this user */ do_each_thread(g, p) { - const struct cred *tcred = __task_cred(p); - kuid_t tcred_uid = make_kuid(tcred->user_ns, tcred->uid); - if (uid_eq(tcred_uid, uid)) + if (uid_eq(task_uid(p), uid)) error = set_one_prio(p, niceval, error); } while_each_thread(g, p); - if (!uid_eq(uid, cred_uid)) + if (!uid_eq(uid, cred->uid)) free_uid(user); /* For find_user() */ break; } @@ -248,7 +244,6 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who) const struct cred *cred = current_cred(); long niceval, retval = -ESRCH; struct pid *pgrp; - kuid_t cred_uid; kuid_t uid; if (which > PRIO_USER || which < PRIO_PROCESS) @@ -280,25 +275,22 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who) } while_each_pid_thread(pgrp, PIDTYPE_PGID, p); break; case PRIO_USER: - cred_uid = make_kuid(cred->user_ns, cred->uid); uid = make_kuid(cred->user_ns, who); user = cred->user; if (!who) - uid = cred_uid; - else if (!uid_eq(uid, cred_uid) && + uid = cred->uid; + else if (!uid_eq(uid, cred->uid) && !(user = find_user(uid))) goto out_unlock; /* No processes for this user */ do_each_thread(g, p) { - const struct cred *tcred = __task_cred(p); - kuid_t tcred_uid = make_kuid(tcred->user_ns, tcred->uid); - if (uid_eq(tcred_uid, uid)) { + if (uid_eq(task_uid(p), uid)) { niceval = 20 - task_nice(p); if (niceval > retval) retval = niceval; } } while_each_thread(g, p); - if (!uid_eq(uid, cred_uid)) + if (!uid_eq(uid, cred->uid)) free_uid(user); /* for find_user() */ break; } @@ -641,7 +633,7 @@ static int set_user(struct cred *new) { struct user_struct *new_user; - new_user = alloc_uid(make_kuid(new->user_ns, new->uid)); + new_user = alloc_uid(new->uid); if (!new_user) return -EAGAIN; diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 7eff867bfac5..86602316422d 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -36,8 +36,8 @@ static bool new_idmap_permitted(struct user_namespace *ns, int cap_setid, int create_user_ns(struct cred *new) { struct user_namespace *ns, *parent_ns = new->user_ns; - kuid_t owner = make_kuid(new->user_ns, new->euid); - kgid_t group = make_kgid(new->user_ns, new->egid); + kuid_t owner = new->euid; + kgid_t group = new->egid; /* The creator needs a mapping in the parent user namespace * or else we won't be able to reasonably tell userspace who -- cgit v1.2.3 From 76b6db010297d4928ab7b7e7c78dd982f413f0a4 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Wed, 14 Mar 2012 15:24:19 -0700 Subject: userns: Replace user_ns_map_uid and user_ns_map_gid with from_kuid and from_kgid These function are no longer needed replace them with their more useful equivalents. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/signal.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index 2734dc965f69..d6303277a640 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1026,7 +1026,7 @@ static inline int legacy_queue(struct sigpending *signals, int sig) static inline uid_t map_cred_ns(const struct cred *cred, struct user_namespace *ns) { - return user_ns_map_uid(ns, cred, cred->uid); + return from_kuid_munged(ns, cred->uid); } #ifdef CONFIG_USER_NS -- cgit v1.2.3 From 9c806aa06f8e121c6058db8e8073798aa5c4355b Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Thu, 2 Feb 2012 18:54:02 -0800 Subject: userns: Convert sched_set_affinity and sched_set_scheduler's permission checks - Compare kuids with uid_eq - kuid are uniuqe across all user namespaces so there is no longer the need for a user_namespace comparison. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/sched/core.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 96bff855b866..b189fecaef90 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4042,11 +4042,8 @@ static bool check_same_owner(struct task_struct *p) rcu_read_lock(); pcred = __task_cred(p); - if (cred->user_ns == pcred->user_ns) - match = (cred->euid == pcred->euid || - cred->euid == pcred->uid); - else - match = false; + match = (uid_eq(cred->euid, pcred->euid) || + uid_eq(cred->euid, pcred->uid)); rcu_read_unlock(); return match; } -- cgit v1.2.3 From a29c33f4e506e1dae7e0985b6328046535becbf8 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Tue, 7 Feb 2012 18:51:01 -0800 Subject: userns: Convert setting and getting uid and gid system calls to use kuid and kgid Convert setregid, setgid, setreuid, setuid, setresuid, getresuid, setresgid, getresgid, setfsuid, setfsgid, getuid, geteuid, getgid, getegid, waitpid, waitid, wait4. Convert userspace uids and gids into kuids and kgids before being placed on struct cred. Convert struct cred kuids and kgids into userspace uids and gids when returning them. Signed-off-by: Eric W. Biederman --- kernel/exit.c | 6 +- kernel/sys.c | 216 +++++++++++++++++++++++++++++++++++++++------------------ kernel/timer.c | 8 +-- kernel/uid16.c | 34 +++++---- 4 files changed, 178 insertions(+), 86 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index d8bd3b425fa7..789e3c5777f7 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -1214,7 +1214,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) unsigned long state; int retval, status, traced; pid_t pid = task_pid_vnr(p); - uid_t uid = __task_cred(p)->uid; + uid_t uid = from_kuid_munged(current_user_ns(), __task_cred(p)->uid); struct siginfo __user *infop; if (!likely(wo->wo_flags & WEXITED)) @@ -1427,7 +1427,7 @@ static int wait_task_stopped(struct wait_opts *wo, if (!unlikely(wo->wo_flags & WNOWAIT)) *p_code = 0; - uid = task_uid(p); + uid = from_kuid_munged(current_user_ns(), __task_cred(p)->uid); unlock_sig: spin_unlock_irq(&p->sighand->siglock); if (!exit_code) @@ -1500,7 +1500,7 @@ static int wait_task_continued(struct wait_opts *wo, struct task_struct *p) } if (!unlikely(wo->wo_flags & WNOWAIT)) p->signal->flags &= ~SIGNAL_STOP_CONTINUED; - uid = task_uid(p); + uid = from_kuid_munged(current_user_ns(), __task_cred(p)->uid); spin_unlock_irq(&p->sighand->siglock); pid = task_pid_vnr(p); diff --git a/kernel/sys.c b/kernel/sys.c index 39962818c008..aff09f208eb3 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -555,9 +555,19 @@ void ctrl_alt_del(void) */ SYSCALL_DEFINE2(setregid, gid_t, rgid, gid_t, egid) { + struct user_namespace *ns = current_user_ns(); const struct cred *old; struct cred *new; int retval; + kgid_t krgid, kegid; + + krgid = make_kgid(ns, rgid); + kegid = make_kgid(ns, egid); + + if ((rgid != (gid_t) -1) && !gid_valid(krgid)) + return -EINVAL; + if ((egid != (gid_t) -1) && !gid_valid(kegid)) + return -EINVAL; new = prepare_creds(); if (!new) @@ -566,25 +576,25 @@ SYSCALL_DEFINE2(setregid, gid_t, rgid, gid_t, egid) retval = -EPERM; if (rgid != (gid_t) -1) { - if (old->gid == rgid || - old->egid == rgid || + if (gid_eq(old->gid, krgid) || + gid_eq(old->egid, krgid) || nsown_capable(CAP_SETGID)) - new->gid = rgid; + new->gid = krgid; else goto error; } if (egid != (gid_t) -1) { - if (old->gid == egid || - old->egid == egid || - old->sgid == egid || + if (gid_eq(old->gid, kegid) || + gid_eq(old->egid, kegid) || + gid_eq(old->sgid, kegid) || nsown_capable(CAP_SETGID)) - new->egid = egid; + new->egid = kegid; else goto error; } if (rgid != (gid_t) -1 || - (egid != (gid_t) -1 && egid != old->gid)) + (egid != (gid_t) -1 && !gid_eq(kegid, old->gid))) new->sgid = new->egid; new->fsgid = new->egid; @@ -602,9 +612,15 @@ error: */ SYSCALL_DEFINE1(setgid, gid_t, gid) { + struct user_namespace *ns = current_user_ns(); const struct cred *old; struct cred *new; int retval; + kgid_t kgid; + + kgid = make_kgid(ns, gid); + if (!gid_valid(kgid)) + return -EINVAL; new = prepare_creds(); if (!new) @@ -613,9 +629,9 @@ SYSCALL_DEFINE1(setgid, gid_t, gid) retval = -EPERM; if (nsown_capable(CAP_SETGID)) - new->gid = new->egid = new->sgid = new->fsgid = gid; - else if (gid == old->gid || gid == old->sgid) - new->egid = new->fsgid = gid; + new->gid = new->egid = new->sgid = new->fsgid = kgid; + else if (gid_eq(kgid, old->gid) || gid_eq(kgid, old->sgid)) + new->egid = new->fsgid = kgid; else goto error; @@ -672,9 +688,19 @@ static int set_user(struct cred *new) */ SYSCALL_DEFINE2(setreuid, uid_t, ruid, uid_t, euid) { + struct user_namespace *ns = current_user_ns(); const struct cred *old; struct cred *new; int retval; + kuid_t kruid, keuid; + + kruid = make_kuid(ns, ruid); + keuid = make_kuid(ns, euid); + + if ((ruid != (uid_t) -1) && !uid_valid(kruid)) + return -EINVAL; + if ((euid != (uid_t) -1) && !uid_valid(keuid)) + return -EINVAL; new = prepare_creds(); if (!new) @@ -683,29 +709,29 @@ SYSCALL_DEFINE2(setreuid, uid_t, ruid, uid_t, euid) retval = -EPERM; if (ruid != (uid_t) -1) { - new->uid = ruid; - if (old->uid != ruid && - old->euid != ruid && + new->uid = kruid; + if (!uid_eq(old->uid, kruid) && + !uid_eq(old->euid, kruid) && !nsown_capable(CAP_SETUID)) goto error; } if (euid != (uid_t) -1) { - new->euid = euid; - if (old->uid != euid && - old->euid != euid && - old->suid != euid && + new->euid = keuid; + if (!uid_eq(old->uid, keuid) && + !uid_eq(old->euid, keuid) && + !uid_eq(old->suid, keuid) && !nsown_capable(CAP_SETUID)) goto error; } - if (new->uid != old->uid) { + if (!uid_eq(new->uid, old->uid)) { retval = set_user(new); if (retval < 0) goto error; } if (ruid != (uid_t) -1 || - (euid != (uid_t) -1 && euid != old->uid)) + (euid != (uid_t) -1 && !uid_eq(keuid, old->uid))) new->suid = new->euid; new->fsuid = new->euid; @@ -733,9 +759,15 @@ error: */ SYSCALL_DEFINE1(setuid, uid_t, uid) { + struct user_namespace *ns = current_user_ns(); const struct cred *old; struct cred *new; int retval; + kuid_t kuid; + + kuid = make_kuid(ns, uid); + if (!uid_valid(kuid)) + return -EINVAL; new = prepare_creds(); if (!new) @@ -744,17 +776,17 @@ SYSCALL_DEFINE1(setuid, uid_t, uid) retval = -EPERM; if (nsown_capable(CAP_SETUID)) { - new->suid = new->uid = uid; - if (uid != old->uid) { + new->suid = new->uid = kuid; + if (!uid_eq(kuid, old->uid)) { retval = set_user(new); if (retval < 0) goto error; } - } else if (uid != old->uid && uid != new->suid) { + } else if (!uid_eq(kuid, old->uid) && !uid_eq(kuid, new->suid)) { goto error; } - new->fsuid = new->euid = uid; + new->fsuid = new->euid = kuid; retval = security_task_fix_setuid(new, old, LSM_SETID_ID); if (retval < 0) @@ -774,9 +806,24 @@ error: */ SYSCALL_DEFINE3(setresuid, uid_t, ruid, uid_t, euid, uid_t, suid) { + struct user_namespace *ns = current_user_ns(); const struct cred *old; struct cred *new; int retval; + kuid_t kruid, keuid, ksuid; + + kruid = make_kuid(ns, ruid); + keuid = make_kuid(ns, euid); + ksuid = make_kuid(ns, suid); + + if ((ruid != (uid_t) -1) && !uid_valid(kruid)) + return -EINVAL; + + if ((euid != (uid_t) -1) && !uid_valid(keuid)) + return -EINVAL; + + if ((suid != (uid_t) -1) && !uid_valid(ksuid)) + return -EINVAL; new = prepare_creds(); if (!new) @@ -786,29 +833,29 @@ SYSCALL_DEFINE3(setresuid, uid_t, ruid, uid_t, euid, uid_t, suid) retval = -EPERM; if (!nsown_capable(CAP_SETUID)) { - if (ruid != (uid_t) -1 && ruid != old->uid && - ruid != old->euid && ruid != old->suid) + if (ruid != (uid_t) -1 && !uid_eq(kruid, old->uid) && + !uid_eq(kruid, old->euid) && !uid_eq(kruid, old->suid)) goto error; - if (euid != (uid_t) -1 && euid != old->uid && - euid != old->euid && euid != old->suid) + if (euid != (uid_t) -1 && !uid_eq(keuid, old->uid) && + !uid_eq(keuid, old->euid) && !uid_eq(keuid, old->suid)) goto error; - if (suid != (uid_t) -1 && suid != old->uid && - suid != old->euid && suid != old->suid) + if (suid != (uid_t) -1 && !uid_eq(ksuid, old->uid) && + !uid_eq(ksuid, old->euid) && !uid_eq(ksuid, old->suid)) goto error; } if (ruid != (uid_t) -1) { - new->uid = ruid; - if (ruid != old->uid) { + new->uid = kruid; + if (!uid_eq(kruid, old->uid)) { retval = set_user(new); if (retval < 0) goto error; } } if (euid != (uid_t) -1) - new->euid = euid; + new->euid = keuid; if (suid != (uid_t) -1) - new->suid = suid; + new->suid = ksuid; new->fsuid = new->euid; retval = security_task_fix_setuid(new, old, LSM_SETID_RES); @@ -822,14 +869,19 @@ error: return retval; } -SYSCALL_DEFINE3(getresuid, uid_t __user *, ruid, uid_t __user *, euid, uid_t __user *, suid) +SYSCALL_DEFINE3(getresuid, uid_t __user *, ruidp, uid_t __user *, euidp, uid_t __user *, suidp) { const struct cred *cred = current_cred(); int retval; + uid_t ruid, euid, suid; + + ruid = from_kuid_munged(cred->user_ns, cred->uid); + euid = from_kuid_munged(cred->user_ns, cred->euid); + suid = from_kuid_munged(cred->user_ns, cred->suid); - if (!(retval = put_user(cred->uid, ruid)) && - !(retval = put_user(cred->euid, euid))) - retval = put_user(cred->suid, suid); + if (!(retval = put_user(ruid, ruidp)) && + !(retval = put_user(euid, euidp))) + retval = put_user(suid, suidp); return retval; } @@ -839,9 +891,22 @@ SYSCALL_DEFINE3(getresuid, uid_t __user *, ruid, uid_t __user *, euid, uid_t __u */ SYSCALL_DEFINE3(setresgid, gid_t, rgid, gid_t, egid, gid_t, sgid) { + struct user_namespace *ns = current_user_ns(); const struct cred *old; struct cred *new; int retval; + kgid_t krgid, kegid, ksgid; + + krgid = make_kgid(ns, rgid); + kegid = make_kgid(ns, egid); + ksgid = make_kgid(ns, sgid); + + if ((rgid != (gid_t) -1) && !gid_valid(krgid)) + return -EINVAL; + if ((egid != (gid_t) -1) && !gid_valid(kegid)) + return -EINVAL; + if ((sgid != (gid_t) -1) && !gid_valid(ksgid)) + return -EINVAL; new = prepare_creds(); if (!new) @@ -850,23 +915,23 @@ SYSCALL_DEFINE3(setresgid, gid_t, rgid, gid_t, egid, gid_t, sgid) retval = -EPERM; if (!nsown_capable(CAP_SETGID)) { - if (rgid != (gid_t) -1 && rgid != old->gid && - rgid != old->egid && rgid != old->sgid) + if (rgid != (gid_t) -1 && !gid_eq(krgid, old->gid) && + !gid_eq(krgid, old->egid) && !gid_eq(krgid, old->sgid)) goto error; - if (egid != (gid_t) -1 && egid != old->gid && - egid != old->egid && egid != old->sgid) + if (egid != (gid_t) -1 && !gid_eq(kegid, old->gid) && + !gid_eq(kegid, old->egid) && !gid_eq(kegid, old->sgid)) goto error; - if (sgid != (gid_t) -1 && sgid != old->gid && - sgid != old->egid && sgid != old->sgid) + if (sgid != (gid_t) -1 && !gid_eq(ksgid, old->gid) && + !gid_eq(ksgid, old->egid) && !gid_eq(ksgid, old->sgid)) goto error; } if (rgid != (gid_t) -1) - new->gid = rgid; + new->gid = krgid; if (egid != (gid_t) -1) - new->egid = egid; + new->egid = kegid; if (sgid != (gid_t) -1) - new->sgid = sgid; + new->sgid = ksgid; new->fsgid = new->egid; return commit_creds(new); @@ -876,14 +941,19 @@ error: return retval; } -SYSCALL_DEFINE3(getresgid, gid_t __user *, rgid, gid_t __user *, egid, gid_t __user *, sgid) +SYSCALL_DEFINE3(getresgid, gid_t __user *, rgidp, gid_t __user *, egidp, gid_t __user *, sgidp) { const struct cred *cred = current_cred(); int retval; + gid_t rgid, egid, sgid; + + rgid = from_kgid_munged(cred->user_ns, cred->gid); + egid = from_kgid_munged(cred->user_ns, cred->egid); + sgid = from_kgid_munged(cred->user_ns, cred->sgid); - if (!(retval = put_user(cred->gid, rgid)) && - !(retval = put_user(cred->egid, egid))) - retval = put_user(cred->sgid, sgid); + if (!(retval = put_user(rgid, rgidp)) && + !(retval = put_user(egid, egidp))) + retval = put_user(sgid, sgidp); return retval; } @@ -900,18 +970,24 @@ SYSCALL_DEFINE1(setfsuid, uid_t, uid) const struct cred *old; struct cred *new; uid_t old_fsuid; + kuid_t kuid; + + old = current_cred(); + old_fsuid = from_kuid_munged(old->user_ns, old->fsuid); + + kuid = make_kuid(old->user_ns, uid); + if (!uid_valid(kuid)) + return old_fsuid; new = prepare_creds(); if (!new) - return current_fsuid(); - old = current_cred(); - old_fsuid = old->fsuid; + return old_fsuid; - if (uid == old->uid || uid == old->euid || - uid == old->suid || uid == old->fsuid || + if (uid_eq(kuid, old->uid) || uid_eq(kuid, old->euid) || + uid_eq(kuid, old->suid) || uid_eq(kuid, old->fsuid) || nsown_capable(CAP_SETUID)) { - if (uid != old_fsuid) { - new->fsuid = uid; + if (!uid_eq(kuid, old->fsuid)) { + new->fsuid = kuid; if (security_task_fix_setuid(new, old, LSM_SETID_FS) == 0) goto change_okay; } @@ -933,18 +1009,24 @@ SYSCALL_DEFINE1(setfsgid, gid_t, gid) const struct cred *old; struct cred *new; gid_t old_fsgid; + kgid_t kgid; + + old = current_cred(); + old_fsgid = from_kgid_munged(old->user_ns, old->fsgid); + + kgid = make_kgid(old->user_ns, gid); + if (!gid_valid(kgid)) + return old_fsgid; new = prepare_creds(); if (!new) - return current_fsgid(); - old = current_cred(); - old_fsgid = old->fsgid; + return old_fsgid; - if (gid == old->gid || gid == old->egid || - gid == old->sgid || gid == old->fsgid || + if (gid_eq(kgid, old->gid) || gid_eq(kgid, old->egid) || + gid_eq(kgid, old->sgid) || gid_eq(kgid, old->fsgid) || nsown_capable(CAP_SETGID)) { - if (gid != old_fsgid) { - new->fsgid = gid; + if (!gid_eq(kgid, old->fsgid)) { + new->fsgid = kgid; goto change_okay; } } @@ -1503,10 +1585,10 @@ static int check_prlimit_permission(struct task_struct *task) if (cred->user_ns == tcred->user_ns && (cred->uid == tcred->euid && cred->uid == tcred->suid && - cred->uid == tcred->uid && + cred->uid == tcred->uid && cred->gid == tcred->egid && cred->gid == tcred->sgid && - cred->gid == tcred->gid)) + cred->gid == tcred->gid)) return 0; if (ns_capable(tcred->user_ns, CAP_SYS_RESOURCE)) return 0; diff --git a/kernel/timer.c b/kernel/timer.c index a297ffcf888e..67316cb6a777 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -1427,25 +1427,25 @@ SYSCALL_DEFINE0(getppid) SYSCALL_DEFINE0(getuid) { /* Only we change this so SMP safe */ - return current_uid(); + return from_kuid_munged(current_user_ns(), current_uid()); } SYSCALL_DEFINE0(geteuid) { /* Only we change this so SMP safe */ - return current_euid(); + return from_kuid_munged(current_user_ns(), current_euid()); } SYSCALL_DEFINE0(getgid) { /* Only we change this so SMP safe */ - return current_gid(); + return from_kgid_munged(current_user_ns(), current_gid()); } SYSCALL_DEFINE0(getegid) { /* Only we change this so SMP safe */ - return current_egid(); + return from_kgid_munged(current_user_ns(), current_egid()); } #endif diff --git a/kernel/uid16.c b/kernel/uid16.c index e530bc34c4cf..d7948eb10225 100644 --- a/kernel/uid16.c +++ b/kernel/uid16.c @@ -81,14 +81,19 @@ SYSCALL_DEFINE3(setresuid16, old_uid_t, ruid, old_uid_t, euid, old_uid_t, suid) return ret; } -SYSCALL_DEFINE3(getresuid16, old_uid_t __user *, ruid, old_uid_t __user *, euid, old_uid_t __user *, suid) +SYSCALL_DEFINE3(getresuid16, old_uid_t __user *, ruidp, old_uid_t __user *, euidp, old_uid_t __user *, suidp) { const struct cred *cred = current_cred(); int retval; + old_uid_t ruid, euid, suid; - if (!(retval = put_user(high2lowuid(cred->uid), ruid)) && - !(retval = put_user(high2lowuid(cred->euid), euid))) - retval = put_user(high2lowuid(cred->suid), suid); + ruid = high2lowuid(from_kuid_munged(cred->user_ns, cred->uid)); + euid = high2lowuid(from_kuid_munged(cred->user_ns, cred->euid)); + suid = high2lowuid(from_kuid_munged(cred->user_ns, cred->suid)); + + if (!(retval = put_user(ruid, ruidp)) && + !(retval = put_user(euid, euidp))) + retval = put_user(suid, suidp); return retval; } @@ -103,14 +108,19 @@ SYSCALL_DEFINE3(setresgid16, old_gid_t, rgid, old_gid_t, egid, old_gid_t, sgid) } -SYSCALL_DEFINE3(getresgid16, old_gid_t __user *, rgid, old_gid_t __user *, egid, old_gid_t __user *, sgid) +SYSCALL_DEFINE3(getresgid16, old_gid_t __user *, rgidp, old_gid_t __user *, egidp, old_gid_t __user *, sgidp) { const struct cred *cred = current_cred(); int retval; + old_gid_t rgid, egid, sgid; + + rgid = high2lowgid(from_kgid_munged(cred->user_ns, cred->gid)); + egid = high2lowgid(from_kgid_munged(cred->user_ns, cred->egid)); + sgid = high2lowgid(from_kgid_munged(cred->user_ns, cred->sgid)); - if (!(retval = put_user(high2lowgid(cred->gid), rgid)) && - !(retval = put_user(high2lowgid(cred->egid), egid))) - retval = put_user(high2lowgid(cred->sgid), sgid); + if (!(retval = put_user(rgid, rgidp)) && + !(retval = put_user(egid, egidp))) + retval = put_user(sgid, sgidp); return retval; } @@ -221,20 +231,20 @@ SYSCALL_DEFINE2(setgroups16, int, gidsetsize, old_gid_t __user *, grouplist) SYSCALL_DEFINE0(getuid16) { - return high2lowuid(current_uid()); + return high2lowuid(from_kuid_munged(current_user_ns(), current_uid())); } SYSCALL_DEFINE0(geteuid16) { - return high2lowuid(current_euid()); + return high2lowuid(from_kuid_munged(current_user_ns(), current_euid())); } SYSCALL_DEFINE0(getgid16) { - return high2lowgid(current_gid()); + return high2lowgid(from_kgid_munged(current_user_ns(), current_gid())); } SYSCALL_DEFINE0(getegid16) { - return high2lowgid(current_egid()); + return high2lowgid(from_kgid_munged(current_user_ns(), current_egid())); } -- cgit v1.2.3 From 5af662030e5db1a5560fd917250d5d688a6be586 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Sat, 3 Mar 2012 20:21:47 -0800 Subject: userns: Convert ptrace, kill, set_priority permission checks to work with kuids and kgids Update the permission checks to use the new uid_eq and gid_eq helpers and remove the now unnecessary user_ns equality comparison. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/ptrace.c | 13 ++++++------- kernel/signal.c | 15 ++++++--------- kernel/sys.c | 18 ++++++++---------- 3 files changed, 20 insertions(+), 26 deletions(-) (limited to 'kernel') diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 24e0a5a94824..a232bb59d93f 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -198,13 +198,12 @@ int __ptrace_may_access(struct task_struct *task, unsigned int mode) return 0; rcu_read_lock(); tcred = __task_cred(task); - if (cred->user_ns == tcred->user_ns && - (cred->uid == tcred->euid && - cred->uid == tcred->suid && - cred->uid == tcred->uid && - cred->gid == tcred->egid && - cred->gid == tcred->sgid && - cred->gid == tcred->gid)) + if (uid_eq(cred->uid, tcred->euid) && + uid_eq(cred->uid, tcred->suid) && + uid_eq(cred->uid, tcred->uid) && + gid_eq(cred->gid, tcred->egid) && + gid_eq(cred->gid, tcred->sgid) && + gid_eq(cred->gid, tcred->gid)) goto ok; if (ptrace_has_cap(tcred->user_ns, mode)) goto ok; diff --git a/kernel/signal.c b/kernel/signal.c index d6303277a640..aef629c65c87 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -767,11 +767,10 @@ static int kill_ok_by_cred(struct task_struct *t) const struct cred *cred = current_cred(); const struct cred *tcred = __task_cred(t); - if (cred->user_ns == tcred->user_ns && - (cred->euid == tcred->suid || - cred->euid == tcred->uid || - cred->uid == tcred->suid || - cred->uid == tcred->uid)) + if (uid_eq(cred->euid, tcred->suid) || + uid_eq(cred->euid, tcred->uid) || + uid_eq(cred->uid, tcred->suid) || + uid_eq(cred->uid, tcred->uid)) return 1; if (ns_capable(tcred->user_ns, CAP_KILL)) @@ -1389,10 +1388,8 @@ static int kill_as_cred_perm(const struct cred *cred, struct task_struct *target) { const struct cred *pcred = __task_cred(target); - if (cred->user_ns != pcred->user_ns) - return 0; - if (cred->euid != pcred->suid && cred->euid != pcred->uid && - cred->uid != pcred->suid && cred->uid != pcred->uid) + if (!uid_eq(cred->euid, pcred->suid) && !uid_eq(cred->euid, pcred->uid) && + !uid_eq(cred->uid, pcred->suid) && !uid_eq(cred->uid, pcred->uid)) return 0; return 1; } diff --git a/kernel/sys.c b/kernel/sys.c index aff09f208eb3..f484077b6b14 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -131,9 +131,8 @@ static bool set_one_prio_perm(struct task_struct *p) { const struct cred *cred = current_cred(), *pcred = __task_cred(p); - if (pcred->user_ns == cred->user_ns && - (pcred->uid == cred->euid || - pcred->euid == cred->euid)) + if (uid_eq(pcred->uid, cred->euid) || + uid_eq(pcred->euid, cred->euid)) return true; if (ns_capable(pcred->user_ns, CAP_SYS_NICE)) return true; @@ -1582,13 +1581,12 @@ static int check_prlimit_permission(struct task_struct *task) return 0; tcred = __task_cred(task); - if (cred->user_ns == tcred->user_ns && - (cred->uid == tcred->euid && - cred->uid == tcred->suid && - cred->uid == tcred->uid && - cred->gid == tcred->egid && - cred->gid == tcred->sgid && - cred->gid == tcred->gid)) + if (uid_eq(cred->uid, tcred->euid) && + uid_eq(cred->uid, tcred->suid) && + uid_eq(cred->uid, tcred->uid) && + gid_eq(cred->gid, tcred->egid) && + gid_eq(cred->gid, tcred->sgid) && + gid_eq(cred->gid, tcred->gid)) return 0; if (ns_capable(tcred->user_ns, CAP_SYS_RESOURCE)) return 0; -- cgit v1.2.3 From 72cda3d1ef24ab0a9a89c15e9776ca737b75f45a Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Thu, 9 Feb 2012 09:09:39 -0800 Subject: userns: Convert in_group_p and in_egroup_p to use kgid_t Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/groups.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/groups.c b/kernel/groups.c index 84156f2d4c8c..6b2588dd04ff 100644 --- a/kernel/groups.c +++ b/kernel/groups.c @@ -256,27 +256,25 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist) /* * Check whether we're fsgid/egid or in the supplemental group.. */ -int in_group_p(gid_t grp) +int in_group_p(kgid_t grp) { const struct cred *cred = current_cred(); int retval = 1; - if (grp != cred->fsgid) - retval = groups_search(cred->group_info, - make_kgid(cred->user_ns, grp)); + if (!gid_eq(grp, cred->fsgid)) + retval = groups_search(cred->group_info, grp); return retval; } EXPORT_SYMBOL(in_group_p); -int in_egroup_p(gid_t grp) +int in_egroup_p(kgid_t grp) { const struct cred *cred = current_cred(); int retval = 1; - if (grp != cred->egid) - retval = groups_search(cred->group_info, - make_kgid(cred->user_ns, grp)); + if (!gid_eq(grp, cred->egid)) + retval = groups_search(cred->group_info, grp); return retval; } -- cgit v1.2.3 From 3bb5d2ee396aabaa4e318f17e94d13e2ee0e5a88 Mon Sep 17 00:00:00 2001 From: Suresh Siddha Date: Fri, 20 Apr 2012 17:08:50 -0700 Subject: smp, idle: Allocate idle thread for each possible cpu during boot percpu areas are already allocated during boot for each possible cpu. percpu idle threads can be considered as an extension of the percpu areas, and allocate them for each possible cpu during boot. This will eliminate the need for workqueue based idle thread allocation. In future we can move the idle thread area into the percpu area too. [ tglx: Moved the loop into smpboot.c and added an error check when the init code failed to allocate an idle thread for a cpu which should be onlined ] Signed-off-by: Suresh Siddha Cc: Peter Zijlstra Cc: Rusty Russell Cc: Paul E. McKenney Cc: Srivatsa S. Bhat Cc: Tejun Heo Cc: David Rientjes Cc: venki@google.com Link: http://lkml.kernel.org/r/1334966930.28674.245.camel@sbsiddha-desk.sc.intel.com Signed-off-by: Thomas Gleixner --- kernel/cpu.c | 9 ++++--- kernel/smp.c | 4 ++++ kernel/smpboot.c | 72 +++++++++++++++----------------------------------------- kernel/smpboot.h | 2 ++ 4 files changed, 31 insertions(+), 56 deletions(-) (limited to 'kernel') diff --git a/kernel/cpu.c b/kernel/cpu.c index 05c46bae5e55..0e6353cf147a 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -297,15 +297,18 @@ static int __cpuinit _cpu_up(unsigned int cpu, int tasks_frozen) int ret, nr_calls = 0; void *hcpu = (void *)(long)cpu; unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0; + struct task_struct *idle; if (cpu_online(cpu) || !cpu_present(cpu)) return -EINVAL; cpu_hotplug_begin(); - ret = smpboot_prepare(cpu); - if (ret) + idle = idle_thread_get(cpu); + if (IS_ERR(idle)) { + ret = PTR_ERR(idle); goto out; + } ret = __cpu_notify(CPU_UP_PREPARE | mod, hcpu, -1, &nr_calls); if (ret) { @@ -316,7 +319,7 @@ static int __cpuinit _cpu_up(unsigned int cpu, int tasks_frozen) } /* Arch-specific enabling code. */ - ret = __cpu_up(cpu, idle_thread_get(cpu)); + ret = __cpu_up(cpu, idle); if (ret != 0) goto out_notify; BUG_ON(!cpu_online(cpu)); diff --git a/kernel/smp.c b/kernel/smp.c index 2f8b10ecf759..a61294c07f3f 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -13,6 +13,8 @@ #include #include +#include "smpboot.h" + #ifdef CONFIG_USE_GENERIC_SMP_HELPERS static struct { struct list_head queue; @@ -669,6 +671,8 @@ void __init smp_init(void) { unsigned int cpu; + idle_threads_init(); + /* FIXME: This should be done in userspace --RR */ for_each_present_cpu(cpu) { if (num_online_cpus() >= setup_max_cpus) diff --git a/kernel/smpboot.c b/kernel/smpboot.c index ed1576981801..e1a797e028a3 100644 --- a/kernel/smpboot.c +++ b/kernel/smpboot.c @@ -6,64 +6,42 @@ #include #include #include -#include #include "smpboot.h" #ifdef CONFIG_GENERIC_SMP_IDLE_THREAD -struct create_idle { - struct work_struct work; - struct task_struct *idle; - struct completion done; - unsigned int cpu; -}; - -static void __cpuinit do_fork_idle(struct work_struct *work) -{ - struct create_idle *c = container_of(work, struct create_idle, work); - - c->idle = fork_idle(c->cpu); - complete(&c->done); -} - -static struct task_struct * __cpuinit idle_thread_create(unsigned int cpu) -{ - struct create_idle c_idle = { - .cpu = cpu, - .done = COMPLETION_INITIALIZER_ONSTACK(c_idle.done), - }; - - INIT_WORK_ONSTACK(&c_idle.work, do_fork_idle); - schedule_work(&c_idle.work); - wait_for_completion(&c_idle.done); - destroy_work_on_stack(&c_idle.work); - return c_idle.idle; -} - /* * For the hotplug case we keep the task structs around and reuse * them. */ static DEFINE_PER_CPU(struct task_struct *, idle_threads); -static inline struct task_struct *get_idle_for_cpu(unsigned int cpu) +struct task_struct * __cpuinit idle_thread_get(unsigned int cpu) { struct task_struct *tsk = per_cpu(idle_threads, cpu); if (!tsk) - return idle_thread_create(cpu); + return ERR_PTR(-ENOMEM); init_idle(tsk, cpu); return tsk; } -struct task_struct * __cpuinit idle_thread_get(unsigned int cpu) +void __init idle_thread_set_boot_cpu(void) { - return per_cpu(idle_threads, cpu); + per_cpu(idle_threads, smp_processor_id()) = current; } -void __init idle_thread_set_boot_cpu(void) +static inline void idle_init(unsigned int cpu) { - per_cpu(idle_threads, smp_processor_id()) = current; + struct task_struct *tsk = per_cpu(idle_threads, cpu); + + if (!tsk) { + tsk = fork_idle(cpu); + if (IS_ERR(tsk)) + pr_err("SMP: fork_idle() failed for CPU %u\n", cpu); + else + per_cpu(idle_threads, cpu) = tsk; + } } /** @@ -72,25 +50,13 @@ void __init idle_thread_set_boot_cpu(void) * * Creates the thread if it does not exist. */ -static int __cpuinit idle_thread_init(unsigned int cpu) +void __init idle_threads_init(void) { - struct task_struct *idle = get_idle_for_cpu(cpu); + unsigned int cpu; - if (IS_ERR(idle)) { - printk(KERN_ERR "failed fork for CPU %u\n", cpu); - return PTR_ERR(idle); + for_each_possible_cpu(cpu) { + if (cpu != smp_processor_id()) + idle_init(cpu); } - per_cpu(idle_threads, cpu) = idle; - return 0; } -#else -static inline int idle_thread_init(unsigned int cpu) { return 0; } #endif - -/** - * smpboot_prepare - generic smpboot preparation - */ -int __cpuinit smpboot_prepare(unsigned int cpu) -{ - return idle_thread_init(cpu); -} diff --git a/kernel/smpboot.h b/kernel/smpboot.h index 7943bbbab917..4cfbcb8a8362 100644 --- a/kernel/smpboot.h +++ b/kernel/smpboot.h @@ -8,9 +8,11 @@ int smpboot_prepare(unsigned int cpu); #ifdef CONFIG_GENERIC_SMP_IDLE_THREAD struct task_struct *idle_thread_get(unsigned int cpu); void idle_thread_set_boot_cpu(void); +void idle_threads_init(void); #else static inline struct task_struct *idle_thread_get(unsigned int cpu) { return NULL; } static inline void idle_thread_set_boot_cpu(void) { } +static inline void idle_threads_init(unsigned int cpu) { } #endif #endif -- cgit v1.2.3 From 43a18b1e588d1b6a993eedd44dd3154590d9bebd Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 4 May 2012 12:52:25 +0200 Subject: smp: Fix idle_thread_init() inline stub idle_thread_init() does not have arguments. Reported-by: Ingo Molnar Signed-off-by: Thomas Gleixner --- kernel/smpboot.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/smpboot.h b/kernel/smpboot.h index 4cfbcb8a8362..80c0acfb8472 100644 --- a/kernel/smpboot.h +++ b/kernel/smpboot.h @@ -12,7 +12,7 @@ void idle_threads_init(void); #else static inline struct task_struct *idle_thread_get(unsigned int cpu) { return NULL; } static inline void idle_thread_set_boot_cpu(void) { } -static inline void idle_threads_init(unsigned int cpu) { } +static inline void idle_threads_init(void) { } #endif #endif -- cgit v1.2.3 From d4dc0f90d243fb54cfbca6601c9a7c5a758e437f Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 25 Apr 2012 12:54:54 +0200 Subject: genirq: Allow check_wakeup_irqs to notice level-triggered interrupts Level triggered interrupts do not cause IRQS_PENDING to be set when they fire while "disabled" as the 'pending' state is always present in the level - they automatically refire where re-enabled. However the IRQS_PENDING flag is also used to abort a suspend cycle - if any 'is_wakeup_set' interrupt is PENDING, check_wakeup_irqs() will cause suspend to abort. Without IRQS_PENDING, suspend won't abort. Consequently, level-triggered interrupts that fire during the 'noirq' phase of suspend do not currently abort suspend. So set IRQS_PENDING even for level triggered interrupts, and make sure to clear the flag in check_irq_resend. [ Changelog by courtesy of Neil ] Tested-by: NeilBrown Signed-off-by: Thomas Gleixner --- kernel/irq/chip.c | 4 +++- kernel/irq/resend.c | 7 +++++-- 2 files changed, 8 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index 6080f6bc8c33..741f83643da6 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -379,8 +379,10 @@ handle_level_irq(unsigned int irq, struct irq_desc *desc) * If its disabled or no action available * keep it masked and get out of here */ - if (unlikely(!desc->action || irqd_irq_disabled(&desc->irq_data))) + if (unlikely(!desc->action || irqd_irq_disabled(&desc->irq_data))) { + desc->istate |= IRQS_PENDING; goto out_unlock; + } handle_irq_event(desc); diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c index 14dd5761e8c9..6454db7b6a4d 100644 --- a/kernel/irq/resend.c +++ b/kernel/irq/resend.c @@ -58,10 +58,13 @@ void check_irq_resend(struct irq_desc *desc, unsigned int irq) /* * We do not resend level type interrupts. Level type * interrupts are resent by hardware when they are still - * active. + * active. Clear the pending bit so suspend/resume does not + * get confused. */ - if (irq_settings_is_level(desc)) + if (irq_settings_is_level(desc)) { + desc->istate &= ~IRQS_PENDING; return; + } if (desc->istate & IRQS_REPLAY) return; if (desc->istate & IRQS_PENDING) { -- cgit v1.2.3 From 9c6079aa1bfcf7e14de10b824779ce39b679bcb8 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 4 May 2012 17:56:16 +0200 Subject: genirq: Do not consider disabled wakeup irqs If an wakeup interrupt has been disabled before the suspend code disables all interrupts then we have to ignore the pending flag. Otherwise we would abort suspend over and over as nothing clears the pending flag because the interrupt is disabled. Signed-off-by: Thomas Gleixner Cc: NeilBrown --- kernel/irq/pm.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/irq/pm.c b/kernel/irq/pm.c index 15e53b1766a6..cb228bf21760 100644 --- a/kernel/irq/pm.c +++ b/kernel/irq/pm.c @@ -103,8 +103,13 @@ int check_wakeup_irqs(void) int irq; for_each_irq_desc(irq, desc) { + /* + * Only interrupts which are marked as wakeup source + * and have not been disabled before the suspend check + * can abort suspend. + */ if (irqd_is_wakeup_set(&desc->irq_data)) { - if (desc->istate & IRQS_PENDING) + if (desc->depth == 1 && desc->istate & IRQS_PENDING) return -EBUSY; continue; } -- cgit v1.2.3 From 1ef9eaf2bf8901e92bb931875a5621692c8a0b84 Mon Sep 17 00:00:00 2001 From: Jim Cromie Date: Thu, 3 May 2012 11:57:37 -0600 Subject: params.c: fix Smack complaint about parse_args In commit 9fb48c744: "params: add 3rd arg to option handler callback signature", the if-guard added to the pr_debug was overzealous; no callers pass NULL, and existing code above and below the guard assumes as much. Change the if-guard to match, and silence the Smack complaint. CC: Dan Carpenter Signed-off-by: Jim Cromie Signed-off-by: Greg Kroah-Hartman --- kernel/params.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/params.c b/kernel/params.c index b60e2c74b961..be78c904b564 100644 --- a/kernel/params.c +++ b/kernel/params.c @@ -190,7 +190,7 @@ int parse_args(const char *doing, /* Chew leading spaces */ args = skip_spaces(args); - if (args && *args) + if (*args) pr_debug("doing %s, parsing ARGS: '%s'\n", doing, args); while (*args) { -- cgit v1.2.3 From b5f3abf950f16fa615dc621e38eec63b2cc67946 Mon Sep 17 00:00:00 2001 From: Jim Cromie Date: Thu, 3 May 2012 18:22:44 -0600 Subject: params: replace printk(KERN_...) with pr_(...) I left 1 printk which uses __FILE__, __LINE__ explicitly, which should not be subject to generic preferences expressed via pr_fmt(). + tweaks suggested by Joe Perches: - add doing to irq-enabled warning, like others. It wont happen often.. - change sysfs failure crit, not just err, make it 1 line in logs. - coalese 2 format fragments into 1 >80 char line cc: Joe Perches Signed-off-by: Jim Cromie Signed-off-by: Greg Kroah-Hartman --- kernel/params.c | 33 ++++++++++++--------------------- 1 file changed, 12 insertions(+), 21 deletions(-) (limited to 'kernel') diff --git a/kernel/params.c b/kernel/params.c index be78c904b564..ed35345be536 100644 --- a/kernel/params.c +++ b/kernel/params.c @@ -201,25 +201,22 @@ int parse_args(const char *doing, irq_was_disabled = irqs_disabled(); ret = parse_one(param, val, doing, params, num, min_level, max_level, unknown); - if (irq_was_disabled && !irqs_disabled()) { - printk(KERN_WARNING "parse_args(): option '%s' enabled " - "irq's!\n", param); - } + if (irq_was_disabled && !irqs_disabled()) + pr_warn("%s: option '%s' enabled irq's!\n", + doing, param); + switch (ret) { case -ENOENT: - printk(KERN_ERR "%s: Unknown parameter `%s'\n", - doing, param); + pr_err("%s: Unknown parameter `%s'\n", doing, param); return ret; case -ENOSPC: - printk(KERN_ERR - "%s: `%s' too large for parameter `%s'\n", + pr_err("%s: `%s' too large for parameter `%s'\n", doing, val ?: "", param); return ret; case 0: break; default: - printk(KERN_ERR - "%s: `%s' invalid for parameter `%s'\n", + pr_err("%s: `%s' invalid for parameter `%s'\n", doing, val ?: "", param); return ret; } @@ -266,8 +263,7 @@ STANDARD_PARAM_DEF(ulong, unsigned long, "%lu", unsigned long, strict_strtoul); int param_set_charp(const char *val, const struct kernel_param *kp) { if (strlen(val) > 1024) { - printk(KERN_ERR "%s: string parameter too long\n", - kp->name); + pr_err("%s: string parameter too long\n", kp->name); return -ENOSPC; } @@ -403,8 +399,7 @@ static int param_array(const char *name, int len; if (*num == max) { - printk(KERN_ERR "%s: can only take %i arguments\n", - name, max); + pr_err("%s: can only take %i arguments\n", name, max); return -EINVAL; } len = strcspn(val, ","); @@ -423,8 +418,7 @@ static int param_array(const char *name, } while (save == ','); if (*num < min) { - printk(KERN_ERR "%s: needs at least %i arguments\n", - name, min); + pr_err("%s: needs at least %i arguments\n", name, min); return -EINVAL; } return 0; @@ -483,7 +477,7 @@ int param_set_copystring(const char *val, const struct kernel_param *kp) const struct kparam_string *kps = kp->str; if (strlen(val)+1 > kps->maxlen) { - printk(KERN_ERR "%s: string doesn't fit in %u chars.\n", + pr_err("%s: string doesn't fit in %u chars.\n", kp->name, kps->maxlen-1); return -ENOSPC; } @@ -753,11 +747,8 @@ static struct module_kobject * __init locate_module_kobject(const char *name) #endif if (err) { kobject_put(&mk->kobj); - printk(KERN_ERR - "Module '%s' failed add to sysfs, error number %d\n", + pr_crit("Adding module '%s' to sysfs failed (%d), the system may be unstable.\n", name, err); - printk(KERN_ERR - "The system will be unstable now.\n"); return NULL; } -- cgit v1.2.3 From a4a2eb490e38aaff61eafcb8cde6725ad1be22ab Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Thu, 3 May 2012 09:02:48 +0000 Subject: init_task: Create generic init_task instance All archs define init_task in the same way (except ia64, but there is no particular reason why ia64 cannot use the common version). Create a generic instance so all archs can be converted over. The config switch is temporary and will be removed when all archs are converted over. Signed-off-by: Thomas Gleixner Cc: Benjamin Herrenschmidt Cc: Chen Liqin Cc: Chris Metcalf Cc: Chris Zankel Cc: David Howells Cc: David S. Miller Cc: Geert Uytterhoeven Cc: Guan Xuetao Cc: Haavard Skinnemoen Cc: Hirokazu Takata Cc: James E.J. Bottomley Cc: Jesper Nilsson Cc: Jonas Bonn Cc: Mark Salter Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Matt Turner Cc: Michal Simek Cc: Mike Frysinger Cc: Paul Mundt Cc: Ralf Baechle Cc: Richard Kuo Cc: Richard Weinberger Cc: Russell King Cc: Yoshinori Sato Link: http://lkml.kernel.org/r/20120503085034.092585287@linutronix.de --- kernel/sched/Makefile | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 9a7dd35102a3..173ea52f3af0 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -16,5 +16,3 @@ obj-$(CONFIG_SMP) += cpupri.o obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o - - -- cgit v1.2.3 From 040e5bf65e1ee66266bc314c5965518a7c21ff36 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Arve=20Hj=C3=B8nnev=C3=A5g?= Date: Fri, 4 May 2012 00:14:21 +0200 Subject: PM / Sleep: Fix a mistake in a conditional in autosleep_store() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The condition check in autosleep_store() is incorrect and prevents /sys/power/autosleep from working as advertised. Fix that. [rjw: Added the changelog.] Signed-off-by: Arve HjønnevÃ¥g Signed-off-by: Rafael J. Wysocki --- kernel/power/main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/power/main.c b/kernel/power/main.c index 54ec071de337..428f8a034e96 100644 --- a/kernel/power/main.c +++ b/kernel/power/main.c @@ -422,7 +422,7 @@ static ssize_t autosleep_store(struct kobject *kobj, int error; if (state == PM_SUSPEND_ON - && !(strncmp(buf, "off", 3) && strncmp(buf, "off\n", 4))) + && strcmp(buf, "off") && strcmp(buf, "off\n")) return -EINVAL; error = pm_autosleep_set_state(state); -- cgit v1.2.3 From 3a6b76661da8e92124a813b43607f5bec1a618de Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Mon, 9 Apr 2012 14:41:33 +0530 Subject: tracing: Modify is_delete, is_return from int to bool is_delete and is_return can take utmost 2 values and are better of being a boolean than a int. There are no functional changes. Signed-off-by: Srikar Dronamraju Acked-by: Steven Rostedt Acked-by: Masami Hiramatsu Cc: Linus Torvalds Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Linux-mm Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Arnaldo Carvalho de Melo Cc: Anton Arapov Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20120409091133.8343.65289.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar --- kernel/trace/trace_kprobe.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 580a05ec926b..4f935f83cd46 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -651,7 +651,7 @@ static struct trace_probe *alloc_trace_probe(const char *group, void *addr, const char *symbol, unsigned long offs, - int nargs, int is_return) + int nargs, bool is_return) { struct trace_probe *tp; int ret = -ENOMEM; @@ -944,7 +944,7 @@ static int split_symbol_offset(char *symbol, unsigned long *offset) #define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long)) static int parse_probe_vars(char *arg, const struct fetch_type *t, - struct fetch_param *f, int is_return) + struct fetch_param *f, bool is_return) { int ret = 0; unsigned long param; @@ -977,7 +977,7 @@ static int parse_probe_vars(char *arg, const struct fetch_type *t, /* Recursive argument parser */ static int __parse_probe_arg(char *arg, const struct fetch_type *t, - struct fetch_param *f, int is_return) + struct fetch_param *f, bool is_return) { int ret = 0; unsigned long param; @@ -1089,7 +1089,7 @@ static int __parse_bitfield_probe_arg(const char *bf, /* String length checking wrapper */ static int parse_probe_arg(char *arg, struct trace_probe *tp, - struct probe_arg *parg, int is_return) + struct probe_arg *parg, bool is_return) { const char *t; int ret; @@ -1162,7 +1162,7 @@ static int create_trace_probe(int argc, char **argv) */ struct trace_probe *tp; int i, ret = 0; - int is_return = 0, is_delete = 0; + bool is_return = false, is_delete = false; char *symbol = NULL, *event = NULL, *group = NULL; char *arg; unsigned long offset = 0; @@ -1171,11 +1171,11 @@ static int create_trace_probe(int argc, char **argv) /* argc must be >= 1 */ if (argv[0][0] == 'p') - is_return = 0; + is_return = false; else if (argv[0][0] == 'r') - is_return = 1; + is_return = true; else if (argv[0][0] == '-') - is_delete = 1; + is_delete = true; else { pr_info("Probe definition must be started with 'p', 'r' or" " '-'.\n"); -- cgit v1.2.3 From 8ab83f56475ec9151645a888dfe1941f4a92091d Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Mon, 9 Apr 2012 14:41:44 +0530 Subject: tracing: Extract out common code for kprobes/uprobes trace events Move parts of trace_kprobe.c that can be shared with upcoming trace_uprobe.c. Common code to kernel/trace/trace_probe.h and kernel/trace/trace_probe.c. There are no functional changes. Signed-off-by: Srikar Dronamraju Acked-by: Steven Rostedt Acked-by: Masami Hiramatsu Cc: Linus Torvalds Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Linux-mm Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Arnaldo Carvalho de Melo Cc: Anton Arapov Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20120409091144.8343.76218.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar --- kernel/trace/Kconfig | 4 + kernel/trace/Makefile | 1 + kernel/trace/trace_kprobe.c | 889 +------------------------------------------- kernel/trace/trace_probe.c | 833 +++++++++++++++++++++++++++++++++++++++++ kernel/trace/trace_probe.h | 160 ++++++++ 5 files changed, 1016 insertions(+), 871 deletions(-) create mode 100644 kernel/trace/trace_probe.c create mode 100644 kernel/trace/trace_probe.h (limited to 'kernel') diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index a1d2849f2473..ce5a5c54ac8f 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -373,6 +373,7 @@ config KPROBE_EVENT depends on HAVE_REGS_AND_STACK_ACCESS_API bool "Enable kprobes-based dynamic events" select TRACING + select PROBE_EVENTS default y help This allows the user to add tracing events (similar to tracepoints) @@ -385,6 +386,9 @@ config KPROBE_EVENT This option is also required by perf-probe subcommand of perf tools. If you want to use perf tools, this option is strongly recommended. +config PROBE_EVENTS + def_bool n + config DYNAMIC_FTRACE bool "enable/disable ftrace tracepoints dynamically" depends on FUNCTION_TRACER diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index 5f39a07fe5ea..fa10d5ca5ab1 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -61,5 +61,6 @@ endif ifeq ($(CONFIG_TRACING),y) obj-$(CONFIG_KGDB_KDB) += trace_kdb.o endif +obj-$(CONFIG_PROBE_EVENTS) += trace_probe.o libftrace-y := ftrace.o diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 4f935f83cd46..f8b777367d8e 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -19,547 +19,15 @@ #include #include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include "trace.h" -#include "trace_output.h" - -#define MAX_TRACE_ARGS 128 -#define MAX_ARGSTR_LEN 63 -#define MAX_EVENT_NAME_LEN 64 -#define MAX_STRING_SIZE PATH_MAX -#define KPROBE_EVENT_SYSTEM "kprobes" - -/* Reserved field names */ -#define FIELD_STRING_IP "__probe_ip" -#define FIELD_STRING_RETIP "__probe_ret_ip" -#define FIELD_STRING_FUNC "__probe_func" - -const char *reserved_field_names[] = { - "common_type", - "common_flags", - "common_preempt_count", - "common_pid", - "common_tgid", - FIELD_STRING_IP, - FIELD_STRING_RETIP, - FIELD_STRING_FUNC, -}; - -/* Printing function type */ -typedef int (*print_type_func_t)(struct trace_seq *, const char *, void *, - void *); -#define PRINT_TYPE_FUNC_NAME(type) print_type_##type -#define PRINT_TYPE_FMT_NAME(type) print_type_format_##type - -/* Printing in basic type function template */ -#define DEFINE_BASIC_PRINT_TYPE_FUNC(type, fmt, cast) \ -static __kprobes int PRINT_TYPE_FUNC_NAME(type)(struct trace_seq *s, \ - const char *name, \ - void *data, void *ent)\ -{ \ - return trace_seq_printf(s, " %s=" fmt, name, (cast)*(type *)data);\ -} \ -static const char PRINT_TYPE_FMT_NAME(type)[] = fmt; - -DEFINE_BASIC_PRINT_TYPE_FUNC(u8, "%x", unsigned int) -DEFINE_BASIC_PRINT_TYPE_FUNC(u16, "%x", unsigned int) -DEFINE_BASIC_PRINT_TYPE_FUNC(u32, "%lx", unsigned long) -DEFINE_BASIC_PRINT_TYPE_FUNC(u64, "%llx", unsigned long long) -DEFINE_BASIC_PRINT_TYPE_FUNC(s8, "%d", int) -DEFINE_BASIC_PRINT_TYPE_FUNC(s16, "%d", int) -DEFINE_BASIC_PRINT_TYPE_FUNC(s32, "%ld", long) -DEFINE_BASIC_PRINT_TYPE_FUNC(s64, "%lld", long long) - -/* data_rloc: data relative location, compatible with u32 */ -#define make_data_rloc(len, roffs) \ - (((u32)(len) << 16) | ((u32)(roffs) & 0xffff)) -#define get_rloc_len(dl) ((u32)(dl) >> 16) -#define get_rloc_offs(dl) ((u32)(dl) & 0xffff) - -static inline void *get_rloc_data(u32 *dl) -{ - return (u8 *)dl + get_rloc_offs(*dl); -} - -/* For data_loc conversion */ -static inline void *get_loc_data(u32 *dl, void *ent) -{ - return (u8 *)ent + get_rloc_offs(*dl); -} - -/* - * Convert data_rloc to data_loc: - * data_rloc stores the offset from data_rloc itself, but data_loc - * stores the offset from event entry. - */ -#define convert_rloc_to_loc(dl, offs) ((u32)(dl) + (offs)) - -/* For defining macros, define string/string_size types */ -typedef u32 string; -typedef u32 string_size; - -/* Print type function for string type */ -static __kprobes int PRINT_TYPE_FUNC_NAME(string)(struct trace_seq *s, - const char *name, - void *data, void *ent) -{ - int len = *(u32 *)data >> 16; - - if (!len) - return trace_seq_printf(s, " %s=(fault)", name); - else - return trace_seq_printf(s, " %s=\"%s\"", name, - (const char *)get_loc_data(data, ent)); -} -static const char PRINT_TYPE_FMT_NAME(string)[] = "\\\"%s\\\""; - -/* Data fetch function type */ -typedef void (*fetch_func_t)(struct pt_regs *, void *, void *); - -struct fetch_param { - fetch_func_t fn; - void *data; -}; - -static __kprobes void call_fetch(struct fetch_param *fprm, - struct pt_regs *regs, void *dest) -{ - return fprm->fn(regs, fprm->data, dest); -} - -#define FETCH_FUNC_NAME(method, type) fetch_##method##_##type -/* - * Define macro for basic types - we don't need to define s* types, because - * we have to care only about bitwidth at recording time. - */ -#define DEFINE_BASIC_FETCH_FUNCS(method) \ -DEFINE_FETCH_##method(u8) \ -DEFINE_FETCH_##method(u16) \ -DEFINE_FETCH_##method(u32) \ -DEFINE_FETCH_##method(u64) - -#define CHECK_FETCH_FUNCS(method, fn) \ - (((FETCH_FUNC_NAME(method, u8) == fn) || \ - (FETCH_FUNC_NAME(method, u16) == fn) || \ - (FETCH_FUNC_NAME(method, u32) == fn) || \ - (FETCH_FUNC_NAME(method, u64) == fn) || \ - (FETCH_FUNC_NAME(method, string) == fn) || \ - (FETCH_FUNC_NAME(method, string_size) == fn)) \ - && (fn != NULL)) - -/* Data fetch function templates */ -#define DEFINE_FETCH_reg(type) \ -static __kprobes void FETCH_FUNC_NAME(reg, type)(struct pt_regs *regs, \ - void *offset, void *dest) \ -{ \ - *(type *)dest = (type)regs_get_register(regs, \ - (unsigned int)((unsigned long)offset)); \ -} -DEFINE_BASIC_FETCH_FUNCS(reg) -/* No string on the register */ -#define fetch_reg_string NULL -#define fetch_reg_string_size NULL - -#define DEFINE_FETCH_stack(type) \ -static __kprobes void FETCH_FUNC_NAME(stack, type)(struct pt_regs *regs,\ - void *offset, void *dest) \ -{ \ - *(type *)dest = (type)regs_get_kernel_stack_nth(regs, \ - (unsigned int)((unsigned long)offset)); \ -} -DEFINE_BASIC_FETCH_FUNCS(stack) -/* No string on the stack entry */ -#define fetch_stack_string NULL -#define fetch_stack_string_size NULL - -#define DEFINE_FETCH_retval(type) \ -static __kprobes void FETCH_FUNC_NAME(retval, type)(struct pt_regs *regs,\ - void *dummy, void *dest) \ -{ \ - *(type *)dest = (type)regs_return_value(regs); \ -} -DEFINE_BASIC_FETCH_FUNCS(retval) -/* No string on the retval */ -#define fetch_retval_string NULL -#define fetch_retval_string_size NULL - -#define DEFINE_FETCH_memory(type) \ -static __kprobes void FETCH_FUNC_NAME(memory, type)(struct pt_regs *regs,\ - void *addr, void *dest) \ -{ \ - type retval; \ - if (probe_kernel_address(addr, retval)) \ - *(type *)dest = 0; \ - else \ - *(type *)dest = retval; \ -} -DEFINE_BASIC_FETCH_FUNCS(memory) -/* - * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max - * length and relative data location. - */ -static __kprobes void FETCH_FUNC_NAME(memory, string)(struct pt_regs *regs, - void *addr, void *dest) -{ - long ret; - int maxlen = get_rloc_len(*(u32 *)dest); - u8 *dst = get_rloc_data(dest); - u8 *src = addr; - mm_segment_t old_fs = get_fs(); - if (!maxlen) - return; - /* - * Try to get string again, since the string can be changed while - * probing. - */ - set_fs(KERNEL_DS); - pagefault_disable(); - do - ret = __copy_from_user_inatomic(dst++, src++, 1); - while (dst[-1] && ret == 0 && src - (u8 *)addr < maxlen); - dst[-1] = '\0'; - pagefault_enable(); - set_fs(old_fs); - - if (ret < 0) { /* Failed to fetch string */ - ((u8 *)get_rloc_data(dest))[0] = '\0'; - *(u32 *)dest = make_data_rloc(0, get_rloc_offs(*(u32 *)dest)); - } else - *(u32 *)dest = make_data_rloc(src - (u8 *)addr, - get_rloc_offs(*(u32 *)dest)); -} -/* Return the length of string -- including null terminal byte */ -static __kprobes void FETCH_FUNC_NAME(memory, string_size)(struct pt_regs *regs, - void *addr, void *dest) -{ - int ret, len = 0; - u8 c; - mm_segment_t old_fs = get_fs(); - - set_fs(KERNEL_DS); - pagefault_disable(); - do { - ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1); - len++; - } while (c && ret == 0 && len < MAX_STRING_SIZE); - pagefault_enable(); - set_fs(old_fs); - - if (ret < 0) /* Failed to check the length */ - *(u32 *)dest = 0; - else - *(u32 *)dest = len; -} - -/* Memory fetching by symbol */ -struct symbol_cache { - char *symbol; - long offset; - unsigned long addr; -}; - -static unsigned long update_symbol_cache(struct symbol_cache *sc) -{ - sc->addr = (unsigned long)kallsyms_lookup_name(sc->symbol); - if (sc->addr) - sc->addr += sc->offset; - return sc->addr; -} - -static void free_symbol_cache(struct symbol_cache *sc) -{ - kfree(sc->symbol); - kfree(sc); -} - -static struct symbol_cache *alloc_symbol_cache(const char *sym, long offset) -{ - struct symbol_cache *sc; - - if (!sym || strlen(sym) == 0) - return NULL; - sc = kzalloc(sizeof(struct symbol_cache), GFP_KERNEL); - if (!sc) - return NULL; - - sc->symbol = kstrdup(sym, GFP_KERNEL); - if (!sc->symbol) { - kfree(sc); - return NULL; - } - sc->offset = offset; - update_symbol_cache(sc); - return sc; -} - -#define DEFINE_FETCH_symbol(type) \ -static __kprobes void FETCH_FUNC_NAME(symbol, type)(struct pt_regs *regs,\ - void *data, void *dest) \ -{ \ - struct symbol_cache *sc = data; \ - if (sc->addr) \ - fetch_memory_##type(regs, (void *)sc->addr, dest); \ - else \ - *(type *)dest = 0; \ -} -DEFINE_BASIC_FETCH_FUNCS(symbol) -DEFINE_FETCH_symbol(string) -DEFINE_FETCH_symbol(string_size) - -/* Dereference memory access function */ -struct deref_fetch_param { - struct fetch_param orig; - long offset; -}; - -#define DEFINE_FETCH_deref(type) \ -static __kprobes void FETCH_FUNC_NAME(deref, type)(struct pt_regs *regs,\ - void *data, void *dest) \ -{ \ - struct deref_fetch_param *dprm = data; \ - unsigned long addr; \ - call_fetch(&dprm->orig, regs, &addr); \ - if (addr) { \ - addr += dprm->offset; \ - fetch_memory_##type(regs, (void *)addr, dest); \ - } else \ - *(type *)dest = 0; \ -} -DEFINE_BASIC_FETCH_FUNCS(deref) -DEFINE_FETCH_deref(string) -DEFINE_FETCH_deref(string_size) - -static __kprobes void update_deref_fetch_param(struct deref_fetch_param *data) -{ - if (CHECK_FETCH_FUNCS(deref, data->orig.fn)) - update_deref_fetch_param(data->orig.data); - else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn)) - update_symbol_cache(data->orig.data); -} - -static __kprobes void free_deref_fetch_param(struct deref_fetch_param *data) -{ - if (CHECK_FETCH_FUNCS(deref, data->orig.fn)) - free_deref_fetch_param(data->orig.data); - else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn)) - free_symbol_cache(data->orig.data); - kfree(data); -} - -/* Bitfield fetch function */ -struct bitfield_fetch_param { - struct fetch_param orig; - unsigned char hi_shift; - unsigned char low_shift; -}; +#include "trace_probe.h" -#define DEFINE_FETCH_bitfield(type) \ -static __kprobes void FETCH_FUNC_NAME(bitfield, type)(struct pt_regs *regs,\ - void *data, void *dest) \ -{ \ - struct bitfield_fetch_param *bprm = data; \ - type buf = 0; \ - call_fetch(&bprm->orig, regs, &buf); \ - if (buf) { \ - buf <<= bprm->hi_shift; \ - buf >>= bprm->low_shift; \ - } \ - *(type *)dest = buf; \ -} -DEFINE_BASIC_FETCH_FUNCS(bitfield) -#define fetch_bitfield_string NULL -#define fetch_bitfield_string_size NULL - -static __kprobes void -update_bitfield_fetch_param(struct bitfield_fetch_param *data) -{ - /* - * Don't check the bitfield itself, because this must be the - * last fetch function. - */ - if (CHECK_FETCH_FUNCS(deref, data->orig.fn)) - update_deref_fetch_param(data->orig.data); - else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn)) - update_symbol_cache(data->orig.data); -} - -static __kprobes void -free_bitfield_fetch_param(struct bitfield_fetch_param *data) -{ - /* - * Don't check the bitfield itself, because this must be the - * last fetch function. - */ - if (CHECK_FETCH_FUNCS(deref, data->orig.fn)) - free_deref_fetch_param(data->orig.data); - else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn)) - free_symbol_cache(data->orig.data); - kfree(data); -} - -/* Default (unsigned long) fetch type */ -#define __DEFAULT_FETCH_TYPE(t) u##t -#define _DEFAULT_FETCH_TYPE(t) __DEFAULT_FETCH_TYPE(t) -#define DEFAULT_FETCH_TYPE _DEFAULT_FETCH_TYPE(BITS_PER_LONG) -#define DEFAULT_FETCH_TYPE_STR __stringify(DEFAULT_FETCH_TYPE) - -/* Fetch types */ -enum { - FETCH_MTD_reg = 0, - FETCH_MTD_stack, - FETCH_MTD_retval, - FETCH_MTD_memory, - FETCH_MTD_symbol, - FETCH_MTD_deref, - FETCH_MTD_bitfield, - FETCH_MTD_END, -}; - -#define ASSIGN_FETCH_FUNC(method, type) \ - [FETCH_MTD_##method] = FETCH_FUNC_NAME(method, type) - -#define __ASSIGN_FETCH_TYPE(_name, ptype, ftype, _size, sign, _fmttype) \ - {.name = _name, \ - .size = _size, \ - .is_signed = sign, \ - .print = PRINT_TYPE_FUNC_NAME(ptype), \ - .fmt = PRINT_TYPE_FMT_NAME(ptype), \ - .fmttype = _fmttype, \ - .fetch = { \ -ASSIGN_FETCH_FUNC(reg, ftype), \ -ASSIGN_FETCH_FUNC(stack, ftype), \ -ASSIGN_FETCH_FUNC(retval, ftype), \ -ASSIGN_FETCH_FUNC(memory, ftype), \ -ASSIGN_FETCH_FUNC(symbol, ftype), \ -ASSIGN_FETCH_FUNC(deref, ftype), \ -ASSIGN_FETCH_FUNC(bitfield, ftype), \ - } \ - } - -#define ASSIGN_FETCH_TYPE(ptype, ftype, sign) \ - __ASSIGN_FETCH_TYPE(#ptype, ptype, ftype, sizeof(ftype), sign, #ptype) - -#define FETCH_TYPE_STRING 0 -#define FETCH_TYPE_STRSIZE 1 - -/* Fetch type information table */ -static const struct fetch_type { - const char *name; /* Name of type */ - size_t size; /* Byte size of type */ - int is_signed; /* Signed flag */ - print_type_func_t print; /* Print functions */ - const char *fmt; /* Fromat string */ - const char *fmttype; /* Name in format file */ - /* Fetch functions */ - fetch_func_t fetch[FETCH_MTD_END]; -} fetch_type_table[] = { - /* Special types */ - [FETCH_TYPE_STRING] = __ASSIGN_FETCH_TYPE("string", string, string, - sizeof(u32), 1, "__data_loc char[]"), - [FETCH_TYPE_STRSIZE] = __ASSIGN_FETCH_TYPE("string_size", u32, - string_size, sizeof(u32), 0, "u32"), - /* Basic types */ - ASSIGN_FETCH_TYPE(u8, u8, 0), - ASSIGN_FETCH_TYPE(u16, u16, 0), - ASSIGN_FETCH_TYPE(u32, u32, 0), - ASSIGN_FETCH_TYPE(u64, u64, 0), - ASSIGN_FETCH_TYPE(s8, u8, 1), - ASSIGN_FETCH_TYPE(s16, u16, 1), - ASSIGN_FETCH_TYPE(s32, u32, 1), - ASSIGN_FETCH_TYPE(s64, u64, 1), -}; - -static const struct fetch_type *find_fetch_type(const char *type) -{ - int i; - - if (!type) - type = DEFAULT_FETCH_TYPE_STR; - - /* Special case: bitfield */ - if (*type == 'b') { - unsigned long bs; - type = strchr(type, '/'); - if (!type) - goto fail; - type++; - if (strict_strtoul(type, 0, &bs)) - goto fail; - switch (bs) { - case 8: - return find_fetch_type("u8"); - case 16: - return find_fetch_type("u16"); - case 32: - return find_fetch_type("u32"); - case 64: - return find_fetch_type("u64"); - default: - goto fail; - } - } - - for (i = 0; i < ARRAY_SIZE(fetch_type_table); i++) - if (strcmp(type, fetch_type_table[i].name) == 0) - return &fetch_type_table[i]; -fail: - return NULL; -} - -/* Special function : only accept unsigned long */ -static __kprobes void fetch_stack_address(struct pt_regs *regs, - void *dummy, void *dest) -{ - *(unsigned long *)dest = kernel_stack_pointer(regs); -} - -static fetch_func_t get_fetch_size_function(const struct fetch_type *type, - fetch_func_t orig_fn) -{ - int i; - - if (type != &fetch_type_table[FETCH_TYPE_STRING]) - return NULL; /* Only string type needs size function */ - for (i = 0; i < FETCH_MTD_END; i++) - if (type->fetch[i] == orig_fn) - return fetch_type_table[FETCH_TYPE_STRSIZE].fetch[i]; - - WARN_ON(1); /* This should not happen */ - return NULL; -} +#define KPROBE_EVENT_SYSTEM "kprobes" /** * Kprobe event core functions */ -struct probe_arg { - struct fetch_param fetch; - struct fetch_param fetch_size; - unsigned int offset; /* Offset from argument entry */ - const char *name; /* Name of this argument */ - const char *comm; /* Command of this argument */ - const struct fetch_type *type; /* Type of this argument */ -}; - -/* Flags for trace_probe */ -#define TP_FLAG_TRACE 1 -#define TP_FLAG_PROFILE 2 -#define TP_FLAG_REGISTERED 4 - struct trace_probe { struct list_head list; struct kretprobe rp; /* Use rp.kp for kprobe use */ @@ -631,18 +99,6 @@ static int kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs); static int kretprobe_dispatcher(struct kretprobe_instance *ri, struct pt_regs *regs); -/* Check the name is good for event/group/fields */ -static int is_good_name(const char *name) -{ - if (!isalpha(*name) && *name != '_') - return 0; - while (*++name != '\0') { - if (!isalpha(*name) && !isdigit(*name) && *name != '_') - return 0; - } - return 1; -} - /* * Allocate new trace_probe and initialize it (including kprobes). */ @@ -702,34 +158,12 @@ error: return ERR_PTR(ret); } -static void update_probe_arg(struct probe_arg *arg) -{ - if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn)) - update_bitfield_fetch_param(arg->fetch.data); - else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn)) - update_deref_fetch_param(arg->fetch.data); - else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn)) - update_symbol_cache(arg->fetch.data); -} - -static void free_probe_arg(struct probe_arg *arg) -{ - if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn)) - free_bitfield_fetch_param(arg->fetch.data); - else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn)) - free_deref_fetch_param(arg->fetch.data); - else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn)) - free_symbol_cache(arg->fetch.data); - kfree(arg->name); - kfree(arg->comm); -} - static void free_trace_probe(struct trace_probe *tp) { int i; for (i = 0; i < tp->nr_args; i++) - free_probe_arg(&tp->args[i]); + traceprobe_free_probe_arg(&tp->args[i]); kfree(tp->call.class->system); kfree(tp->call.name); @@ -787,7 +221,7 @@ static int __register_trace_probe(struct trace_probe *tp) return -EINVAL; for (i = 0; i < tp->nr_args; i++) - update_probe_arg(&tp->args[i]); + traceprobe_update_arg(&tp->args[i]); /* Set/clear disabled flag according to tp->flag */ if (trace_probe_is_enabled(tp)) @@ -919,227 +353,6 @@ static struct notifier_block trace_probe_module_nb = { .priority = 1 /* Invoked after kprobe module callback */ }; -/* Split symbol and offset. */ -static int split_symbol_offset(char *symbol, unsigned long *offset) -{ - char *tmp; - int ret; - - if (!offset) - return -EINVAL; - - tmp = strchr(symbol, '+'); - if (tmp) { - /* skip sign because strict_strtol doesn't accept '+' */ - ret = strict_strtoul(tmp + 1, 0, offset); - if (ret) - return ret; - *tmp = '\0'; - } else - *offset = 0; - return 0; -} - -#define PARAM_MAX_ARGS 16 -#define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long)) - -static int parse_probe_vars(char *arg, const struct fetch_type *t, - struct fetch_param *f, bool is_return) -{ - int ret = 0; - unsigned long param; - - if (strcmp(arg, "retval") == 0) { - if (is_return) - f->fn = t->fetch[FETCH_MTD_retval]; - else - ret = -EINVAL; - } else if (strncmp(arg, "stack", 5) == 0) { - if (arg[5] == '\0') { - if (strcmp(t->name, DEFAULT_FETCH_TYPE_STR) == 0) - f->fn = fetch_stack_address; - else - ret = -EINVAL; - } else if (isdigit(arg[5])) { - ret = strict_strtoul(arg + 5, 10, ¶m); - if (ret || param > PARAM_MAX_STACK) - ret = -EINVAL; - else { - f->fn = t->fetch[FETCH_MTD_stack]; - f->data = (void *)param; - } - } else - ret = -EINVAL; - } else - ret = -EINVAL; - return ret; -} - -/* Recursive argument parser */ -static int __parse_probe_arg(char *arg, const struct fetch_type *t, - struct fetch_param *f, bool is_return) -{ - int ret = 0; - unsigned long param; - long offset; - char *tmp; - - switch (arg[0]) { - case '$': - ret = parse_probe_vars(arg + 1, t, f, is_return); - break; - case '%': /* named register */ - ret = regs_query_register_offset(arg + 1); - if (ret >= 0) { - f->fn = t->fetch[FETCH_MTD_reg]; - f->data = (void *)(unsigned long)ret; - ret = 0; - } - break; - case '@': /* memory or symbol */ - if (isdigit(arg[1])) { - ret = strict_strtoul(arg + 1, 0, ¶m); - if (ret) - break; - f->fn = t->fetch[FETCH_MTD_memory]; - f->data = (void *)param; - } else { - ret = split_symbol_offset(arg + 1, &offset); - if (ret) - break; - f->data = alloc_symbol_cache(arg + 1, offset); - if (f->data) - f->fn = t->fetch[FETCH_MTD_symbol]; - } - break; - case '+': /* deref memory */ - arg++; /* Skip '+', because strict_strtol() rejects it. */ - case '-': - tmp = strchr(arg, '('); - if (!tmp) - break; - *tmp = '\0'; - ret = strict_strtol(arg, 0, &offset); - if (ret) - break; - arg = tmp + 1; - tmp = strrchr(arg, ')'); - if (tmp) { - struct deref_fetch_param *dprm; - const struct fetch_type *t2 = find_fetch_type(NULL); - *tmp = '\0'; - dprm = kzalloc(sizeof(struct deref_fetch_param), - GFP_KERNEL); - if (!dprm) - return -ENOMEM; - dprm->offset = offset; - ret = __parse_probe_arg(arg, t2, &dprm->orig, - is_return); - if (ret) - kfree(dprm); - else { - f->fn = t->fetch[FETCH_MTD_deref]; - f->data = (void *)dprm; - } - } - break; - } - if (!ret && !f->fn) { /* Parsed, but do not find fetch method */ - pr_info("%s type has no corresponding fetch method.\n", - t->name); - ret = -EINVAL; - } - return ret; -} - -#define BYTES_TO_BITS(nb) ((BITS_PER_LONG * (nb)) / sizeof(long)) - -/* Bitfield type needs to be parsed into a fetch function */ -static int __parse_bitfield_probe_arg(const char *bf, - const struct fetch_type *t, - struct fetch_param *f) -{ - struct bitfield_fetch_param *bprm; - unsigned long bw, bo; - char *tail; - - if (*bf != 'b') - return 0; - - bprm = kzalloc(sizeof(*bprm), GFP_KERNEL); - if (!bprm) - return -ENOMEM; - bprm->orig = *f; - f->fn = t->fetch[FETCH_MTD_bitfield]; - f->data = (void *)bprm; - - bw = simple_strtoul(bf + 1, &tail, 0); /* Use simple one */ - if (bw == 0 || *tail != '@') - return -EINVAL; - - bf = tail + 1; - bo = simple_strtoul(bf, &tail, 0); - if (tail == bf || *tail != '/') - return -EINVAL; - - bprm->hi_shift = BYTES_TO_BITS(t->size) - (bw + bo); - bprm->low_shift = bprm->hi_shift + bo; - return (BYTES_TO_BITS(t->size) < (bw + bo)) ? -EINVAL : 0; -} - -/* String length checking wrapper */ -static int parse_probe_arg(char *arg, struct trace_probe *tp, - struct probe_arg *parg, bool is_return) -{ - const char *t; - int ret; - - if (strlen(arg) > MAX_ARGSTR_LEN) { - pr_info("Argument is too long.: %s\n", arg); - return -ENOSPC; - } - parg->comm = kstrdup(arg, GFP_KERNEL); - if (!parg->comm) { - pr_info("Failed to allocate memory for command '%s'.\n", arg); - return -ENOMEM; - } - t = strchr(parg->comm, ':'); - if (t) { - arg[t - parg->comm] = '\0'; - t++; - } - parg->type = find_fetch_type(t); - if (!parg->type) { - pr_info("Unsupported type: %s\n", t); - return -EINVAL; - } - parg->offset = tp->size; - tp->size += parg->type->size; - ret = __parse_probe_arg(arg, parg->type, &parg->fetch, is_return); - if (ret >= 0 && t != NULL) - ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch); - if (ret >= 0) { - parg->fetch_size.fn = get_fetch_size_function(parg->type, - parg->fetch.fn); - parg->fetch_size.data = parg->fetch.data; - } - return ret; -} - -/* Return 1 if name is reserved or already used by another argument */ -static int conflict_field_name(const char *name, - struct probe_arg *args, int narg) -{ - int i; - for (i = 0; i < ARRAY_SIZE(reserved_field_names); i++) - if (strcmp(reserved_field_names[i], name) == 0) - return 1; - for (i = 0; i < narg; i++) - if (strcmp(args[i].name, name) == 0) - return 1; - return 0; -} - static int create_trace_probe(int argc, char **argv) { /* @@ -1240,7 +453,7 @@ static int create_trace_probe(int argc, char **argv) /* a symbol specified */ symbol = argv[1]; /* TODO: support .init module functions */ - ret = split_symbol_offset(symbol, &offset); + ret = traceprobe_split_symbol_offset(symbol, &offset); if (ret) { pr_info("Failed to parse symbol.\n"); return ret; @@ -1302,7 +515,8 @@ static int create_trace_probe(int argc, char **argv) goto error; } - if (conflict_field_name(tp->args[i].name, tp->args, i)) { + if (traceprobe_conflict_field_name(tp->args[i].name, + tp->args, i)) { pr_info("Argument[%d] name '%s' conflicts with " "another field.\n", i, argv[i]); ret = -EINVAL; @@ -1310,7 +524,8 @@ static int create_trace_probe(int argc, char **argv) } /* Parse fetch argument */ - ret = parse_probe_arg(arg, tp, &tp->args[i], is_return); + ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i], + is_return); if (ret) { pr_info("Parse error at argument[%d]. (%d)\n", i, ret); goto error; @@ -1412,70 +627,11 @@ static int probes_open(struct inode *inode, struct file *file) return seq_open(file, &probes_seq_op); } -static int command_trace_probe(const char *buf) -{ - char **argv; - int argc = 0, ret = 0; - - argv = argv_split(GFP_KERNEL, buf, &argc); - if (!argv) - return -ENOMEM; - - if (argc) - ret = create_trace_probe(argc, argv); - - argv_free(argv); - return ret; -} - -#define WRITE_BUFSIZE 4096 - static ssize_t probes_write(struct file *file, const char __user *buffer, size_t count, loff_t *ppos) { - char *kbuf, *tmp; - int ret; - size_t done; - size_t size; - - kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL); - if (!kbuf) - return -ENOMEM; - - ret = done = 0; - while (done < count) { - size = count - done; - if (size >= WRITE_BUFSIZE) - size = WRITE_BUFSIZE - 1; - if (copy_from_user(kbuf, buffer + done, size)) { - ret = -EFAULT; - goto out; - } - kbuf[size] = '\0'; - tmp = strchr(kbuf, '\n'); - if (tmp) { - *tmp = '\0'; - size = tmp - kbuf + 1; - } else if (done + size < count) { - pr_warning("Line length is too long: " - "Should be less than %d.", WRITE_BUFSIZE); - ret = -EINVAL; - goto out; - } - done += size; - /* Remove comments */ - tmp = strchr(kbuf, '#'); - if (tmp) - *tmp = '\0'; - - ret = command_trace_probe(kbuf); - if (ret) - goto out; - } - ret = done; -out: - kfree(kbuf); - return ret; + return traceprobe_probes_write(file, buffer, count, ppos, + create_trace_probe); } static const struct file_operations kprobe_events_ops = { @@ -1711,16 +867,6 @@ partial: return TRACE_TYPE_PARTIAL_LINE; } -#undef DEFINE_FIELD -#define DEFINE_FIELD(type, item, name, is_signed) \ - do { \ - ret = trace_define_field(event_call, #type, name, \ - offsetof(typeof(field), item), \ - sizeof(field.item), is_signed, \ - FILTER_OTHER); \ - if (ret) \ - return ret; \ - } while (0) static int kprobe_event_define_fields(struct ftrace_event_call *event_call) { @@ -2051,8 +1197,9 @@ static __init int kprobe_trace_self_tests_init(void) pr_info("Testing kprobe tracing: "); - ret = command_trace_probe("p:testprobe kprobe_trace_selftest_target " - "$stack $stack0 +0($stack)"); + ret = traceprobe_command("p:testprobe kprobe_trace_selftest_target " + "$stack $stack0 +0($stack)", + create_trace_probe); if (WARN_ON_ONCE(ret)) { pr_warning("error on probing function entry.\n"); warn++; @@ -2066,8 +1213,8 @@ static __init int kprobe_trace_self_tests_init(void) enable_trace_probe(tp, TP_FLAG_TRACE); } - ret = command_trace_probe("r:testprobe2 kprobe_trace_selftest_target " - "$retval"); + ret = traceprobe_command("r:testprobe2 kprobe_trace_selftest_target " + "$retval", create_trace_probe); if (WARN_ON_ONCE(ret)) { pr_warning("error on probing function return.\n"); warn++; @@ -2101,13 +1248,13 @@ static __init int kprobe_trace_self_tests_init(void) } else disable_trace_probe(tp, TP_FLAG_TRACE); - ret = command_trace_probe("-:testprobe"); + ret = traceprobe_command("-:testprobe", create_trace_probe); if (WARN_ON_ONCE(ret)) { pr_warning("error on deleting a probe.\n"); warn++; } - ret = command_trace_probe("-:testprobe2"); + ret = traceprobe_command("-:testprobe2", create_trace_probe); if (WARN_ON_ONCE(ret)) { pr_warning("error on deleting a probe.\n"); warn++; diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c new file mode 100644 index 000000000000..8e526b9286e9 --- /dev/null +++ b/kernel/trace/trace_probe.c @@ -0,0 +1,833 @@ +/* + * Common code for probe-based Dynamic events. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * This code was copied from kernel/trace/trace_kprobe.c written by + * Masami Hiramatsu + * + * Updates to make this generic: + * Copyright (C) IBM Corporation, 2010-2011 + * Author: Srikar Dronamraju + */ + +#include "trace_probe.h" + +const char *reserved_field_names[] = { + "common_type", + "common_flags", + "common_preempt_count", + "common_pid", + "common_tgid", + FIELD_STRING_IP, + FIELD_STRING_RETIP, + FIELD_STRING_FUNC, +}; + +/* Printing function type */ +#define PRINT_TYPE_FUNC_NAME(type) print_type_##type +#define PRINT_TYPE_FMT_NAME(type) print_type_format_##type + +/* Printing in basic type function template */ +#define DEFINE_BASIC_PRINT_TYPE_FUNC(type, fmt, cast) \ +static __kprobes int PRINT_TYPE_FUNC_NAME(type)(struct trace_seq *s, \ + const char *name, \ + void *data, void *ent)\ +{ \ + return trace_seq_printf(s, " %s=" fmt, name, (cast)*(type *)data);\ +} \ +static const char PRINT_TYPE_FMT_NAME(type)[] = fmt; + +DEFINE_BASIC_PRINT_TYPE_FUNC(u8, "%x", unsigned int) +DEFINE_BASIC_PRINT_TYPE_FUNC(u16, "%x", unsigned int) +DEFINE_BASIC_PRINT_TYPE_FUNC(u32, "%lx", unsigned long) +DEFINE_BASIC_PRINT_TYPE_FUNC(u64, "%llx", unsigned long long) +DEFINE_BASIC_PRINT_TYPE_FUNC(s8, "%d", int) +DEFINE_BASIC_PRINT_TYPE_FUNC(s16, "%d", int) +DEFINE_BASIC_PRINT_TYPE_FUNC(s32, "%ld", long) +DEFINE_BASIC_PRINT_TYPE_FUNC(s64, "%lld", long long) + +static inline void *get_rloc_data(u32 *dl) +{ + return (u8 *)dl + get_rloc_offs(*dl); +} + +/* For data_loc conversion */ +static inline void *get_loc_data(u32 *dl, void *ent) +{ + return (u8 *)ent + get_rloc_offs(*dl); +} + +/* For defining macros, define string/string_size types */ +typedef u32 string; +typedef u32 string_size; + +/* Print type function for string type */ +static __kprobes int PRINT_TYPE_FUNC_NAME(string)(struct trace_seq *s, + const char *name, + void *data, void *ent) +{ + int len = *(u32 *)data >> 16; + + if (!len) + return trace_seq_printf(s, " %s=(fault)", name); + else + return trace_seq_printf(s, " %s=\"%s\"", name, + (const char *)get_loc_data(data, ent)); +} + +static const char PRINT_TYPE_FMT_NAME(string)[] = "\\\"%s\\\""; + +#define FETCH_FUNC_NAME(method, type) fetch_##method##_##type +/* + * Define macro for basic types - we don't need to define s* types, because + * we have to care only about bitwidth at recording time. + */ +#define DEFINE_BASIC_FETCH_FUNCS(method) \ +DEFINE_FETCH_##method(u8) \ +DEFINE_FETCH_##method(u16) \ +DEFINE_FETCH_##method(u32) \ +DEFINE_FETCH_##method(u64) + +#define CHECK_FETCH_FUNCS(method, fn) \ + (((FETCH_FUNC_NAME(method, u8) == fn) || \ + (FETCH_FUNC_NAME(method, u16) == fn) || \ + (FETCH_FUNC_NAME(method, u32) == fn) || \ + (FETCH_FUNC_NAME(method, u64) == fn) || \ + (FETCH_FUNC_NAME(method, string) == fn) || \ + (FETCH_FUNC_NAME(method, string_size) == fn)) \ + && (fn != NULL)) + +/* Data fetch function templates */ +#define DEFINE_FETCH_reg(type) \ +static __kprobes void FETCH_FUNC_NAME(reg, type)(struct pt_regs *regs, \ + void *offset, void *dest) \ +{ \ + *(type *)dest = (type)regs_get_register(regs, \ + (unsigned int)((unsigned long)offset)); \ +} +DEFINE_BASIC_FETCH_FUNCS(reg) +/* No string on the register */ +#define fetch_reg_string NULL +#define fetch_reg_string_size NULL + +#define DEFINE_FETCH_stack(type) \ +static __kprobes void FETCH_FUNC_NAME(stack, type)(struct pt_regs *regs,\ + void *offset, void *dest) \ +{ \ + *(type *)dest = (type)regs_get_kernel_stack_nth(regs, \ + (unsigned int)((unsigned long)offset)); \ +} +DEFINE_BASIC_FETCH_FUNCS(stack) +/* No string on the stack entry */ +#define fetch_stack_string NULL +#define fetch_stack_string_size NULL + +#define DEFINE_FETCH_retval(type) \ +static __kprobes void FETCH_FUNC_NAME(retval, type)(struct pt_regs *regs,\ + void *dummy, void *dest) \ +{ \ + *(type *)dest = (type)regs_return_value(regs); \ +} +DEFINE_BASIC_FETCH_FUNCS(retval) +/* No string on the retval */ +#define fetch_retval_string NULL +#define fetch_retval_string_size NULL + +#define DEFINE_FETCH_memory(type) \ +static __kprobes void FETCH_FUNC_NAME(memory, type)(struct pt_regs *regs,\ + void *addr, void *dest) \ +{ \ + type retval; \ + if (probe_kernel_address(addr, retval)) \ + *(type *)dest = 0; \ + else \ + *(type *)dest = retval; \ +} +DEFINE_BASIC_FETCH_FUNCS(memory) +/* + * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max + * length and relative data location. + */ +static __kprobes void FETCH_FUNC_NAME(memory, string)(struct pt_regs *regs, + void *addr, void *dest) +{ + long ret; + int maxlen = get_rloc_len(*(u32 *)dest); + u8 *dst = get_rloc_data(dest); + u8 *src = addr; + mm_segment_t old_fs = get_fs(); + + if (!maxlen) + return; + + /* + * Try to get string again, since the string can be changed while + * probing. + */ + set_fs(KERNEL_DS); + pagefault_disable(); + + do + ret = __copy_from_user_inatomic(dst++, src++, 1); + while (dst[-1] && ret == 0 && src - (u8 *)addr < maxlen); + + dst[-1] = '\0'; + pagefault_enable(); + set_fs(old_fs); + + if (ret < 0) { /* Failed to fetch string */ + ((u8 *)get_rloc_data(dest))[0] = '\0'; + *(u32 *)dest = make_data_rloc(0, get_rloc_offs(*(u32 *)dest)); + } else { + *(u32 *)dest = make_data_rloc(src - (u8 *)addr, + get_rloc_offs(*(u32 *)dest)); + } +} + +/* Return the length of string -- including null terminal byte */ +static __kprobes void FETCH_FUNC_NAME(memory, string_size)(struct pt_regs *regs, + void *addr, void *dest) +{ + mm_segment_t old_fs; + int ret, len = 0; + u8 c; + + old_fs = get_fs(); + set_fs(KERNEL_DS); + pagefault_disable(); + + do { + ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1); + len++; + } while (c && ret == 0 && len < MAX_STRING_SIZE); + + pagefault_enable(); + set_fs(old_fs); + + if (ret < 0) /* Failed to check the length */ + *(u32 *)dest = 0; + else + *(u32 *)dest = len; +} + +/* Memory fetching by symbol */ +struct symbol_cache { + char *symbol; + long offset; + unsigned long addr; +}; + +static unsigned long update_symbol_cache(struct symbol_cache *sc) +{ + sc->addr = (unsigned long)kallsyms_lookup_name(sc->symbol); + + if (sc->addr) + sc->addr += sc->offset; + + return sc->addr; +} + +static void free_symbol_cache(struct symbol_cache *sc) +{ + kfree(sc->symbol); + kfree(sc); +} + +static struct symbol_cache *alloc_symbol_cache(const char *sym, long offset) +{ + struct symbol_cache *sc; + + if (!sym || strlen(sym) == 0) + return NULL; + + sc = kzalloc(sizeof(struct symbol_cache), GFP_KERNEL); + if (!sc) + return NULL; + + sc->symbol = kstrdup(sym, GFP_KERNEL); + if (!sc->symbol) { + kfree(sc); + return NULL; + } + sc->offset = offset; + update_symbol_cache(sc); + + return sc; +} + +#define DEFINE_FETCH_symbol(type) \ +static __kprobes void FETCH_FUNC_NAME(symbol, type)(struct pt_regs *regs,\ + void *data, void *dest) \ +{ \ + struct symbol_cache *sc = data; \ + if (sc->addr) \ + fetch_memory_##type(regs, (void *)sc->addr, dest); \ + else \ + *(type *)dest = 0; \ +} +DEFINE_BASIC_FETCH_FUNCS(symbol) +DEFINE_FETCH_symbol(string) +DEFINE_FETCH_symbol(string_size) + +/* Dereference memory access function */ +struct deref_fetch_param { + struct fetch_param orig; + long offset; +}; + +#define DEFINE_FETCH_deref(type) \ +static __kprobes void FETCH_FUNC_NAME(deref, type)(struct pt_regs *regs,\ + void *data, void *dest) \ +{ \ + struct deref_fetch_param *dprm = data; \ + unsigned long addr; \ + call_fetch(&dprm->orig, regs, &addr); \ + if (addr) { \ + addr += dprm->offset; \ + fetch_memory_##type(regs, (void *)addr, dest); \ + } else \ + *(type *)dest = 0; \ +} +DEFINE_BASIC_FETCH_FUNCS(deref) +DEFINE_FETCH_deref(string) +DEFINE_FETCH_deref(string_size) + +static __kprobes void update_deref_fetch_param(struct deref_fetch_param *data) +{ + if (CHECK_FETCH_FUNCS(deref, data->orig.fn)) + update_deref_fetch_param(data->orig.data); + else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn)) + update_symbol_cache(data->orig.data); +} + +static __kprobes void free_deref_fetch_param(struct deref_fetch_param *data) +{ + if (CHECK_FETCH_FUNCS(deref, data->orig.fn)) + free_deref_fetch_param(data->orig.data); + else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn)) + free_symbol_cache(data->orig.data); + kfree(data); +} + +/* Bitfield fetch function */ +struct bitfield_fetch_param { + struct fetch_param orig; + unsigned char hi_shift; + unsigned char low_shift; +}; + +#define DEFINE_FETCH_bitfield(type) \ +static __kprobes void FETCH_FUNC_NAME(bitfield, type)(struct pt_regs *regs,\ + void *data, void *dest) \ +{ \ + struct bitfield_fetch_param *bprm = data; \ + type buf = 0; \ + call_fetch(&bprm->orig, regs, &buf); \ + if (buf) { \ + buf <<= bprm->hi_shift; \ + buf >>= bprm->low_shift; \ + } \ + *(type *)dest = buf; \ +} + +DEFINE_BASIC_FETCH_FUNCS(bitfield) +#define fetch_bitfield_string NULL +#define fetch_bitfield_string_size NULL + +static __kprobes void +update_bitfield_fetch_param(struct bitfield_fetch_param *data) +{ + /* + * Don't check the bitfield itself, because this must be the + * last fetch function. + */ + if (CHECK_FETCH_FUNCS(deref, data->orig.fn)) + update_deref_fetch_param(data->orig.data); + else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn)) + update_symbol_cache(data->orig.data); +} + +static __kprobes void +free_bitfield_fetch_param(struct bitfield_fetch_param *data) +{ + /* + * Don't check the bitfield itself, because this must be the + * last fetch function. + */ + if (CHECK_FETCH_FUNCS(deref, data->orig.fn)) + free_deref_fetch_param(data->orig.data); + else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn)) + free_symbol_cache(data->orig.data); + + kfree(data); +} + +/* Default (unsigned long) fetch type */ +#define __DEFAULT_FETCH_TYPE(t) u##t +#define _DEFAULT_FETCH_TYPE(t) __DEFAULT_FETCH_TYPE(t) +#define DEFAULT_FETCH_TYPE _DEFAULT_FETCH_TYPE(BITS_PER_LONG) +#define DEFAULT_FETCH_TYPE_STR __stringify(DEFAULT_FETCH_TYPE) + +#define ASSIGN_FETCH_FUNC(method, type) \ + [FETCH_MTD_##method] = FETCH_FUNC_NAME(method, type) + +#define __ASSIGN_FETCH_TYPE(_name, ptype, ftype, _size, sign, _fmttype) \ + {.name = _name, \ + .size = _size, \ + .is_signed = sign, \ + .print = PRINT_TYPE_FUNC_NAME(ptype), \ + .fmt = PRINT_TYPE_FMT_NAME(ptype), \ + .fmttype = _fmttype, \ + .fetch = { \ +ASSIGN_FETCH_FUNC(reg, ftype), \ +ASSIGN_FETCH_FUNC(stack, ftype), \ +ASSIGN_FETCH_FUNC(retval, ftype), \ +ASSIGN_FETCH_FUNC(memory, ftype), \ +ASSIGN_FETCH_FUNC(symbol, ftype), \ +ASSIGN_FETCH_FUNC(deref, ftype), \ +ASSIGN_FETCH_FUNC(bitfield, ftype), \ + } \ + } + +#define ASSIGN_FETCH_TYPE(ptype, ftype, sign) \ + __ASSIGN_FETCH_TYPE(#ptype, ptype, ftype, sizeof(ftype), sign, #ptype) + +#define FETCH_TYPE_STRING 0 +#define FETCH_TYPE_STRSIZE 1 + +/* Fetch type information table */ +static const struct fetch_type fetch_type_table[] = { + /* Special types */ + [FETCH_TYPE_STRING] = __ASSIGN_FETCH_TYPE("string", string, string, + sizeof(u32), 1, "__data_loc char[]"), + [FETCH_TYPE_STRSIZE] = __ASSIGN_FETCH_TYPE("string_size", u32, + string_size, sizeof(u32), 0, "u32"), + /* Basic types */ + ASSIGN_FETCH_TYPE(u8, u8, 0), + ASSIGN_FETCH_TYPE(u16, u16, 0), + ASSIGN_FETCH_TYPE(u32, u32, 0), + ASSIGN_FETCH_TYPE(u64, u64, 0), + ASSIGN_FETCH_TYPE(s8, u8, 1), + ASSIGN_FETCH_TYPE(s16, u16, 1), + ASSIGN_FETCH_TYPE(s32, u32, 1), + ASSIGN_FETCH_TYPE(s64, u64, 1), +}; + +static const struct fetch_type *find_fetch_type(const char *type) +{ + int i; + + if (!type) + type = DEFAULT_FETCH_TYPE_STR; + + /* Special case: bitfield */ + if (*type == 'b') { + unsigned long bs; + + type = strchr(type, '/'); + if (!type) + goto fail; + + type++; + if (strict_strtoul(type, 0, &bs)) + goto fail; + + switch (bs) { + case 8: + return find_fetch_type("u8"); + case 16: + return find_fetch_type("u16"); + case 32: + return find_fetch_type("u32"); + case 64: + return find_fetch_type("u64"); + default: + goto fail; + } + } + + for (i = 0; i < ARRAY_SIZE(fetch_type_table); i++) + if (strcmp(type, fetch_type_table[i].name) == 0) + return &fetch_type_table[i]; + +fail: + return NULL; +} + +/* Special function : only accept unsigned long */ +static __kprobes void fetch_stack_address(struct pt_regs *regs, + void *dummy, void *dest) +{ + *(unsigned long *)dest = kernel_stack_pointer(regs); +} + +static fetch_func_t get_fetch_size_function(const struct fetch_type *type, + fetch_func_t orig_fn) +{ + int i; + + if (type != &fetch_type_table[FETCH_TYPE_STRING]) + return NULL; /* Only string type needs size function */ + + for (i = 0; i < FETCH_MTD_END; i++) + if (type->fetch[i] == orig_fn) + return fetch_type_table[FETCH_TYPE_STRSIZE].fetch[i]; + + WARN_ON(1); /* This should not happen */ + + return NULL; +} + +/* Split symbol and offset. */ +int traceprobe_split_symbol_offset(char *symbol, unsigned long *offset) +{ + char *tmp; + int ret; + + if (!offset) + return -EINVAL; + + tmp = strchr(symbol, '+'); + if (tmp) { + /* skip sign because strict_strtol doesn't accept '+' */ + ret = strict_strtoul(tmp + 1, 0, offset); + if (ret) + return ret; + + *tmp = '\0'; + } else + *offset = 0; + + return 0; +} + +#define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long)) + +static int parse_probe_vars(char *arg, const struct fetch_type *t, + struct fetch_param *f, bool is_return) +{ + int ret = 0; + unsigned long param; + + if (strcmp(arg, "retval") == 0) { + if (is_return) + f->fn = t->fetch[FETCH_MTD_retval]; + else + ret = -EINVAL; + } else if (strncmp(arg, "stack", 5) == 0) { + if (arg[5] == '\0') { + if (strcmp(t->name, DEFAULT_FETCH_TYPE_STR) == 0) + f->fn = fetch_stack_address; + else + ret = -EINVAL; + } else if (isdigit(arg[5])) { + ret = strict_strtoul(arg + 5, 10, ¶m); + if (ret || param > PARAM_MAX_STACK) + ret = -EINVAL; + else { + f->fn = t->fetch[FETCH_MTD_stack]; + f->data = (void *)param; + } + } else + ret = -EINVAL; + } else + ret = -EINVAL; + + return ret; +} + +/* Recursive argument parser */ +static int parse_probe_arg(char *arg, const struct fetch_type *t, + struct fetch_param *f, bool is_return) +{ + unsigned long param; + long offset; + char *tmp; + int ret; + + ret = 0; + switch (arg[0]) { + case '$': + ret = parse_probe_vars(arg + 1, t, f, is_return); + break; + + case '%': /* named register */ + ret = regs_query_register_offset(arg + 1); + if (ret >= 0) { + f->fn = t->fetch[FETCH_MTD_reg]; + f->data = (void *)(unsigned long)ret; + ret = 0; + } + break; + + case '@': /* memory or symbol */ + if (isdigit(arg[1])) { + ret = strict_strtoul(arg + 1, 0, ¶m); + if (ret) + break; + + f->fn = t->fetch[FETCH_MTD_memory]; + f->data = (void *)param; + } else { + ret = traceprobe_split_symbol_offset(arg + 1, &offset); + if (ret) + break; + + f->data = alloc_symbol_cache(arg + 1, offset); + if (f->data) + f->fn = t->fetch[FETCH_MTD_symbol]; + } + break; + + case '+': /* deref memory */ + arg++; /* Skip '+', because strict_strtol() rejects it. */ + case '-': + tmp = strchr(arg, '('); + if (!tmp) + break; + + *tmp = '\0'; + ret = strict_strtol(arg, 0, &offset); + + if (ret) + break; + + arg = tmp + 1; + tmp = strrchr(arg, ')'); + + if (tmp) { + struct deref_fetch_param *dprm; + const struct fetch_type *t2; + + t2 = find_fetch_type(NULL); + *tmp = '\0'; + dprm = kzalloc(sizeof(struct deref_fetch_param), GFP_KERNEL); + + if (!dprm) + return -ENOMEM; + + dprm->offset = offset; + ret = parse_probe_arg(arg, t2, &dprm->orig, is_return); + if (ret) + kfree(dprm); + else { + f->fn = t->fetch[FETCH_MTD_deref]; + f->data = (void *)dprm; + } + } + break; + } + if (!ret && !f->fn) { /* Parsed, but do not find fetch method */ + pr_info("%s type has no corresponding fetch method.\n", t->name); + ret = -EINVAL; + } + + return ret; +} + +#define BYTES_TO_BITS(nb) ((BITS_PER_LONG * (nb)) / sizeof(long)) + +/* Bitfield type needs to be parsed into a fetch function */ +static int __parse_bitfield_probe_arg(const char *bf, + const struct fetch_type *t, + struct fetch_param *f) +{ + struct bitfield_fetch_param *bprm; + unsigned long bw, bo; + char *tail; + + if (*bf != 'b') + return 0; + + bprm = kzalloc(sizeof(*bprm), GFP_KERNEL); + if (!bprm) + return -ENOMEM; + + bprm->orig = *f; + f->fn = t->fetch[FETCH_MTD_bitfield]; + f->data = (void *)bprm; + bw = simple_strtoul(bf + 1, &tail, 0); /* Use simple one */ + + if (bw == 0 || *tail != '@') + return -EINVAL; + + bf = tail + 1; + bo = simple_strtoul(bf, &tail, 0); + + if (tail == bf || *tail != '/') + return -EINVAL; + + bprm->hi_shift = BYTES_TO_BITS(t->size) - (bw + bo); + bprm->low_shift = bprm->hi_shift + bo; + + return (BYTES_TO_BITS(t->size) < (bw + bo)) ? -EINVAL : 0; +} + +/* String length checking wrapper */ +int traceprobe_parse_probe_arg(char *arg, ssize_t *size, + struct probe_arg *parg, bool is_return) +{ + const char *t; + int ret; + + if (strlen(arg) > MAX_ARGSTR_LEN) { + pr_info("Argument is too long.: %s\n", arg); + return -ENOSPC; + } + parg->comm = kstrdup(arg, GFP_KERNEL); + if (!parg->comm) { + pr_info("Failed to allocate memory for command '%s'.\n", arg); + return -ENOMEM; + } + t = strchr(parg->comm, ':'); + if (t) { + arg[t - parg->comm] = '\0'; + t++; + } + parg->type = find_fetch_type(t); + if (!parg->type) { + pr_info("Unsupported type: %s\n", t); + return -EINVAL; + } + parg->offset = *size; + *size += parg->type->size; + ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return); + + if (ret >= 0 && t != NULL) + ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch); + + if (ret >= 0) { + parg->fetch_size.fn = get_fetch_size_function(parg->type, + parg->fetch.fn); + parg->fetch_size.data = parg->fetch.data; + } + + return ret; +} + +/* Return 1 if name is reserved or already used by another argument */ +int traceprobe_conflict_field_name(const char *name, + struct probe_arg *args, int narg) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(reserved_field_names); i++) + if (strcmp(reserved_field_names[i], name) == 0) + return 1; + + for (i = 0; i < narg; i++) + if (strcmp(args[i].name, name) == 0) + return 1; + + return 0; +} + +void traceprobe_update_arg(struct probe_arg *arg) +{ + if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn)) + update_bitfield_fetch_param(arg->fetch.data); + else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn)) + update_deref_fetch_param(arg->fetch.data); + else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn)) + update_symbol_cache(arg->fetch.data); +} + +void traceprobe_free_probe_arg(struct probe_arg *arg) +{ + if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn)) + free_bitfield_fetch_param(arg->fetch.data); + else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn)) + free_deref_fetch_param(arg->fetch.data); + else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn)) + free_symbol_cache(arg->fetch.data); + + kfree(arg->name); + kfree(arg->comm); +} + +int traceprobe_command(const char *buf, int (*createfn)(int, char **)) +{ + char **argv; + int argc, ret; + + argc = 0; + ret = 0; + argv = argv_split(GFP_KERNEL, buf, &argc); + if (!argv) + return -ENOMEM; + + if (argc) + ret = createfn(argc, argv); + + argv_free(argv); + + return ret; +} + +#define WRITE_BUFSIZE 4096 + +ssize_t traceprobe_probes_write(struct file *file, const char __user *buffer, + size_t count, loff_t *ppos, + int (*createfn)(int, char **)) +{ + char *kbuf, *tmp; + int ret = 0; + size_t done = 0; + size_t size; + + kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL); + if (!kbuf) + return -ENOMEM; + + while (done < count) { + size = count - done; + + if (size >= WRITE_BUFSIZE) + size = WRITE_BUFSIZE - 1; + + if (copy_from_user(kbuf, buffer + done, size)) { + ret = -EFAULT; + goto out; + } + kbuf[size] = '\0'; + tmp = strchr(kbuf, '\n'); + + if (tmp) { + *tmp = '\0'; + size = tmp - kbuf + 1; + } else if (done + size < count) { + pr_warning("Line length is too long: " + "Should be less than %d.", WRITE_BUFSIZE); + ret = -EINVAL; + goto out; + } + done += size; + /* Remove comments */ + tmp = strchr(kbuf, '#'); + + if (tmp) + *tmp = '\0'; + + ret = traceprobe_command(kbuf, createfn); + if (ret) + goto out; + } + ret = done; + +out: + kfree(kbuf); + + return ret; +} diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h new file mode 100644 index 000000000000..2df9a18e0252 --- /dev/null +++ b/kernel/trace/trace_probe.h @@ -0,0 +1,160 @@ +/* + * Common header file for probe-based Dynamic events. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * This code was copied from kernel/trace/trace_kprobe.h written by + * Masami Hiramatsu + * + * Updates to make this generic: + * Copyright (C) IBM Corporation, 2010-2011 + * Author: Srikar Dronamraju + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "trace.h" +#include "trace_output.h" + +#define MAX_TRACE_ARGS 128 +#define MAX_ARGSTR_LEN 63 +#define MAX_EVENT_NAME_LEN 64 +#define MAX_STRING_SIZE PATH_MAX + +/* Reserved field names */ +#define FIELD_STRING_IP "__probe_ip" +#define FIELD_STRING_RETIP "__probe_ret_ip" +#define FIELD_STRING_FUNC "__probe_func" + +#undef DEFINE_FIELD +#define DEFINE_FIELD(type, item, name, is_signed) \ + do { \ + ret = trace_define_field(event_call, #type, name, \ + offsetof(typeof(field), item), \ + sizeof(field.item), is_signed, \ + FILTER_OTHER); \ + if (ret) \ + return ret; \ + } while (0) + + +/* Flags for trace_probe */ +#define TP_FLAG_TRACE 1 +#define TP_FLAG_PROFILE 2 +#define TP_FLAG_REGISTERED 4 + + +/* data_rloc: data relative location, compatible with u32 */ +#define make_data_rloc(len, roffs) \ + (((u32)(len) << 16) | ((u32)(roffs) & 0xffff)) +#define get_rloc_len(dl) ((u32)(dl) >> 16) +#define get_rloc_offs(dl) ((u32)(dl) & 0xffff) + +/* + * Convert data_rloc to data_loc: + * data_rloc stores the offset from data_rloc itself, but data_loc + * stores the offset from event entry. + */ +#define convert_rloc_to_loc(dl, offs) ((u32)(dl) + (offs)) + +/* Data fetch function type */ +typedef void (*fetch_func_t)(struct pt_regs *, void *, void *); +/* Printing function type */ +typedef int (*print_type_func_t)(struct trace_seq *, const char *, void *, void *); + +/* Fetch types */ +enum { + FETCH_MTD_reg = 0, + FETCH_MTD_stack, + FETCH_MTD_retval, + FETCH_MTD_memory, + FETCH_MTD_symbol, + FETCH_MTD_deref, + FETCH_MTD_bitfield, + FETCH_MTD_END, +}; + +/* Fetch type information table */ +struct fetch_type { + const char *name; /* Name of type */ + size_t size; /* Byte size of type */ + int is_signed; /* Signed flag */ + print_type_func_t print; /* Print functions */ + const char *fmt; /* Fromat string */ + const char *fmttype; /* Name in format file */ + /* Fetch functions */ + fetch_func_t fetch[FETCH_MTD_END]; +}; + +struct fetch_param { + fetch_func_t fn; + void *data; +}; + +struct probe_arg { + struct fetch_param fetch; + struct fetch_param fetch_size; + unsigned int offset; /* Offset from argument entry */ + const char *name; /* Name of this argument */ + const char *comm; /* Command of this argument */ + const struct fetch_type *type; /* Type of this argument */ +}; + +static inline __kprobes void call_fetch(struct fetch_param *fprm, + struct pt_regs *regs, void *dest) +{ + return fprm->fn(regs, fprm->data, dest); +} + +/* Check the name is good for event/group/fields */ +static inline int is_good_name(const char *name) +{ + if (!isalpha(*name) && *name != '_') + return 0; + while (*++name != '\0') { + if (!isalpha(*name) && !isdigit(*name) && *name != '_') + return 0; + } + return 1; +} + +extern int traceprobe_parse_probe_arg(char *arg, ssize_t *size, + struct probe_arg *parg, bool is_return); + +extern int traceprobe_conflict_field_name(const char *name, + struct probe_arg *args, int narg); + +extern void traceprobe_update_arg(struct probe_arg *arg); +extern void traceprobe_free_probe_arg(struct probe_arg *arg); + +extern int traceprobe_split_symbol_offset(char *symbol, unsigned long *offset); + +extern ssize_t traceprobe_probes_write(struct file *file, + const char __user *buffer, size_t count, loff_t *ppos, + int (*createfn)(int, char**)); + +extern int traceprobe_command(const char *buf, int (*createfn)(int, char**)); -- cgit v1.2.3 From f3f096cfedf8113380c56fc855275cc75cd8cf55 Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Wed, 11 Apr 2012 16:00:43 +0530 Subject: tracing: Provide trace events interface for uprobes Implements trace_event support for uprobes. In its current form it can be used to put probes at a specified offset in a file and dump the required registers when the code flow reaches the probed address. The following example shows how to dump the instruction pointer and %ax a register at the probed text address. Here we are trying to probe zfree in /bin/zsh: # cd /sys/kernel/debug/tracing/ # cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp 00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh # objdump -T /bin/zsh | grep -w zfree 0000000000446420 g DF .text 0000000000000012 Base zfree # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events # cat uprobe_events p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420 # echo 1 > events/uprobes/enable # sleep 20 # echo 0 > events/uprobes/enable # cat trace # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79 zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79 zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79 zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79 Signed-off-by: Srikar Dronamraju Acked-by: Steven Rostedt Acked-by: Masami Hiramatsu Cc: Linus Torvalds Cc: Ananth N Mavinakayanahalli Cc: Jim Keniston Cc: Linux-mm Cc: Oleg Nesterov Cc: Andi Kleen Cc: Christoph Hellwig Cc: Arnaldo Carvalho de Melo Cc: Anton Arapov Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20120411103043.GB29437@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- kernel/trace/Kconfig | 16 + kernel/trace/Makefile | 1 + kernel/trace/trace.h | 5 + kernel/trace/trace_kprobe.c | 2 +- kernel/trace/trace_probe.c | 14 +- kernel/trace/trace_probe.h | 3 +- kernel/trace/trace_uprobe.c | 788 ++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 823 insertions(+), 6 deletions(-) create mode 100644 kernel/trace/trace_uprobe.c (limited to 'kernel') diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index ce5a5c54ac8f..ea4bff6295fc 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -386,6 +386,22 @@ config KPROBE_EVENT This option is also required by perf-probe subcommand of perf tools. If you want to use perf tools, this option is strongly recommended. +config UPROBE_EVENT + bool "Enable uprobes-based dynamic events" + depends on ARCH_SUPPORTS_UPROBES + depends on MMU + select UPROBES + select PROBE_EVENTS + select TRACING + default n + help + This allows the user to add tracing events on top of userspace + dynamic events (similar to tracepoints) on the fly via the trace + events interface. Those events can be inserted wherever uprobes + can probe, and record various registers. + This option is required if you plan to use perf-probe subcommand + of perf tools on user space applications. + config PROBE_EVENTS def_bool n diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile index fa10d5ca5ab1..1734c03e048b 100644 --- a/kernel/trace/Makefile +++ b/kernel/trace/Makefile @@ -62,5 +62,6 @@ ifeq ($(CONFIG_TRACING),y) obj-$(CONFIG_KGDB_KDB) += trace_kdb.o endif obj-$(CONFIG_PROBE_EVENTS) += trace_probe.o +obj-$(CONFIG_UPROBE_EVENT) += trace_uprobe.o libftrace-y := ftrace.o diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 95059f091a24..1bcdbec95a11 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -103,6 +103,11 @@ struct kretprobe_trace_entry_head { unsigned long ret_ip; }; +struct uprobe_trace_entry_head { + struct trace_entry ent; + unsigned long ip; +}; + /* * trace_flag_type is an enumeration that holds different * states when a trace occurs. These are: diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index f8b777367d8e..b31d3d5699fe 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -525,7 +525,7 @@ static int create_trace_probe(int argc, char **argv) /* Parse fetch argument */ ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i], - is_return); + is_return, true); if (ret) { pr_info("Parse error at argument[%d]. (%d)\n", i, ret); goto error; diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c index 8e526b9286e9..daa9980153af 100644 --- a/kernel/trace/trace_probe.c +++ b/kernel/trace/trace_probe.c @@ -550,7 +550,7 @@ static int parse_probe_vars(char *arg, const struct fetch_type *t, /* Recursive argument parser */ static int parse_probe_arg(char *arg, const struct fetch_type *t, - struct fetch_param *f, bool is_return) + struct fetch_param *f, bool is_return, bool is_kprobe) { unsigned long param; long offset; @@ -558,6 +558,11 @@ static int parse_probe_arg(char *arg, const struct fetch_type *t, int ret; ret = 0; + + /* Until uprobe_events supports only reg arguments */ + if (!is_kprobe && arg[0] != '%') + return -EINVAL; + switch (arg[0]) { case '$': ret = parse_probe_vars(arg + 1, t, f, is_return); @@ -619,7 +624,8 @@ static int parse_probe_arg(char *arg, const struct fetch_type *t, return -ENOMEM; dprm->offset = offset; - ret = parse_probe_arg(arg, t2, &dprm->orig, is_return); + ret = parse_probe_arg(arg, t2, &dprm->orig, is_return, + is_kprobe); if (ret) kfree(dprm); else { @@ -677,7 +683,7 @@ static int __parse_bitfield_probe_arg(const char *bf, /* String length checking wrapper */ int traceprobe_parse_probe_arg(char *arg, ssize_t *size, - struct probe_arg *parg, bool is_return) + struct probe_arg *parg, bool is_return, bool is_kprobe) { const char *t; int ret; @@ -703,7 +709,7 @@ int traceprobe_parse_probe_arg(char *arg, ssize_t *size, } parg->offset = *size; *size += parg->type->size; - ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return); + ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return, is_kprobe); if (ret >= 0 && t != NULL) ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch); diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h index 2df9a18e0252..933708677814 100644 --- a/kernel/trace/trace_probe.h +++ b/kernel/trace/trace_probe.h @@ -66,6 +66,7 @@ #define TP_FLAG_TRACE 1 #define TP_FLAG_PROFILE 2 #define TP_FLAG_REGISTERED 4 +#define TP_FLAG_UPROBE 8 /* data_rloc: data relative location, compatible with u32 */ @@ -143,7 +144,7 @@ static inline int is_good_name(const char *name) } extern int traceprobe_parse_probe_arg(char *arg, ssize_t *size, - struct probe_arg *parg, bool is_return); + struct probe_arg *parg, bool is_return, bool is_kprobe); extern int traceprobe_conflict_field_name(const char *name, struct probe_arg *args, int narg); diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c new file mode 100644 index 000000000000..2b36ac68549e --- /dev/null +++ b/kernel/trace/trace_uprobe.c @@ -0,0 +1,788 @@ +/* + * uprobes-based tracing events + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Copyright (C) IBM Corporation, 2010-2012 + * Author: Srikar Dronamraju + */ + +#include +#include +#include +#include + +#include "trace_probe.h" + +#define UPROBE_EVENT_SYSTEM "uprobes" + +/* + * uprobe event core functions + */ +struct trace_uprobe; +struct uprobe_trace_consumer { + struct uprobe_consumer cons; + struct trace_uprobe *tu; +}; + +struct trace_uprobe { + struct list_head list; + struct ftrace_event_class class; + struct ftrace_event_call call; + struct uprobe_trace_consumer *consumer; + struct inode *inode; + char *filename; + unsigned long offset; + unsigned long nhit; + unsigned int flags; /* For TP_FLAG_* */ + ssize_t size; /* trace entry size */ + unsigned int nr_args; + struct probe_arg args[]; +}; + +#define SIZEOF_TRACE_UPROBE(n) \ + (offsetof(struct trace_uprobe, args) + \ + (sizeof(struct probe_arg) * (n))) + +static int register_uprobe_event(struct trace_uprobe *tu); +static void unregister_uprobe_event(struct trace_uprobe *tu); + +static DEFINE_MUTEX(uprobe_lock); +static LIST_HEAD(uprobe_list); + +static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs); + +/* + * Allocate new trace_uprobe and initialize it (including uprobes). + */ +static struct trace_uprobe * +alloc_trace_uprobe(const char *group, const char *event, int nargs) +{ + struct trace_uprobe *tu; + + if (!event || !is_good_name(event)) + return ERR_PTR(-EINVAL); + + if (!group || !is_good_name(group)) + return ERR_PTR(-EINVAL); + + tu = kzalloc(SIZEOF_TRACE_UPROBE(nargs), GFP_KERNEL); + if (!tu) + return ERR_PTR(-ENOMEM); + + tu->call.class = &tu->class; + tu->call.name = kstrdup(event, GFP_KERNEL); + if (!tu->call.name) + goto error; + + tu->class.system = kstrdup(group, GFP_KERNEL); + if (!tu->class.system) + goto error; + + INIT_LIST_HEAD(&tu->list); + return tu; + +error: + kfree(tu->call.name); + kfree(tu); + + return ERR_PTR(-ENOMEM); +} + +static void free_trace_uprobe(struct trace_uprobe *tu) +{ + int i; + + for (i = 0; i < tu->nr_args; i++) + traceprobe_free_probe_arg(&tu->args[i]); + + iput(tu->inode); + kfree(tu->call.class->system); + kfree(tu->call.name); + kfree(tu->filename); + kfree(tu); +} + +static struct trace_uprobe *find_probe_event(const char *event, const char *group) +{ + struct trace_uprobe *tu; + + list_for_each_entry(tu, &uprobe_list, list) + if (strcmp(tu->call.name, event) == 0 && + strcmp(tu->call.class->system, group) == 0) + return tu; + + return NULL; +} + +/* Unregister a trace_uprobe and probe_event: call with locking uprobe_lock */ +static void unregister_trace_uprobe(struct trace_uprobe *tu) +{ + list_del(&tu->list); + unregister_uprobe_event(tu); + free_trace_uprobe(tu); +} + +/* Register a trace_uprobe and probe_event */ +static int register_trace_uprobe(struct trace_uprobe *tu) +{ + struct trace_uprobe *old_tp; + int ret; + + mutex_lock(&uprobe_lock); + + /* register as an event */ + old_tp = find_probe_event(tu->call.name, tu->call.class->system); + if (old_tp) + /* delete old event */ + unregister_trace_uprobe(old_tp); + + ret = register_uprobe_event(tu); + if (ret) { + pr_warning("Failed to register probe event(%d)\n", ret); + goto end; + } + + list_add_tail(&tu->list, &uprobe_list); + +end: + mutex_unlock(&uprobe_lock); + + return ret; +} + +/* + * Argument syntax: + * - Add uprobe: p[:[GRP/]EVENT] PATH:SYMBOL[+offs] [FETCHARGS] + * + * - Remove uprobe: -:[GRP/]EVENT + */ +static int create_trace_uprobe(int argc, char **argv) +{ + struct trace_uprobe *tu; + struct inode *inode; + char *arg, *event, *group, *filename; + char buf[MAX_EVENT_NAME_LEN]; + struct path path; + unsigned long offset; + bool is_delete; + int i, ret; + + inode = NULL; + ret = 0; + is_delete = false; + event = NULL; + group = NULL; + + /* argc must be >= 1 */ + if (argv[0][0] == '-') + is_delete = true; + else if (argv[0][0] != 'p') { + pr_info("Probe definition must be started with 'p', 'r' or" " '-'.\n"); + return -EINVAL; + } + + if (argv[0][1] == ':') { + event = &argv[0][2]; + arg = strchr(event, '/'); + + if (arg) { + group = event; + event = arg + 1; + event[-1] = '\0'; + + if (strlen(group) == 0) { + pr_info("Group name is not specified\n"); + return -EINVAL; + } + } + if (strlen(event) == 0) { + pr_info("Event name is not specified\n"); + return -EINVAL; + } + } + if (!group) + group = UPROBE_EVENT_SYSTEM; + + if (is_delete) { + if (!event) { + pr_info("Delete command needs an event name.\n"); + return -EINVAL; + } + mutex_lock(&uprobe_lock); + tu = find_probe_event(event, group); + + if (!tu) { + mutex_unlock(&uprobe_lock); + pr_info("Event %s/%s doesn't exist.\n", group, event); + return -ENOENT; + } + /* delete an event */ + unregister_trace_uprobe(tu); + mutex_unlock(&uprobe_lock); + return 0; + } + + if (argc < 2) { + pr_info("Probe point is not specified.\n"); + return -EINVAL; + } + if (isdigit(argv[1][0])) { + pr_info("probe point must be have a filename.\n"); + return -EINVAL; + } + arg = strchr(argv[1], ':'); + if (!arg) + goto fail_address_parse; + + *arg++ = '\0'; + filename = argv[1]; + ret = kern_path(filename, LOOKUP_FOLLOW, &path); + if (ret) + goto fail_address_parse; + + ret = strict_strtoul(arg, 0, &offset); + if (ret) + goto fail_address_parse; + + inode = igrab(path.dentry->d_inode); + + argc -= 2; + argv += 2; + + /* setup a probe */ + if (!event) { + char *tail = strrchr(filename, '/'); + char *ptr; + + ptr = kstrdup((tail ? tail + 1 : filename), GFP_KERNEL); + if (!ptr) { + ret = -ENOMEM; + goto fail_address_parse; + } + + tail = ptr; + ptr = strpbrk(tail, ".-_"); + if (ptr) + *ptr = '\0'; + + snprintf(buf, MAX_EVENT_NAME_LEN, "%c_%s_0x%lx", 'p', tail, offset); + event = buf; + kfree(tail); + } + + tu = alloc_trace_uprobe(group, event, argc); + if (IS_ERR(tu)) { + pr_info("Failed to allocate trace_uprobe.(%d)\n", (int)PTR_ERR(tu)); + ret = PTR_ERR(tu); + goto fail_address_parse; + } + tu->offset = offset; + tu->inode = inode; + tu->filename = kstrdup(filename, GFP_KERNEL); + + if (!tu->filename) { + pr_info("Failed to allocate filename.\n"); + ret = -ENOMEM; + goto error; + } + + /* parse arguments */ + ret = 0; + for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++) { + /* Increment count for freeing args in error case */ + tu->nr_args++; + + /* Parse argument name */ + arg = strchr(argv[i], '='); + if (arg) { + *arg++ = '\0'; + tu->args[i].name = kstrdup(argv[i], GFP_KERNEL); + } else { + arg = argv[i]; + /* If argument name is omitted, set "argN" */ + snprintf(buf, MAX_EVENT_NAME_LEN, "arg%d", i + 1); + tu->args[i].name = kstrdup(buf, GFP_KERNEL); + } + + if (!tu->args[i].name) { + pr_info("Failed to allocate argument[%d] name.\n", i); + ret = -ENOMEM; + goto error; + } + + if (!is_good_name(tu->args[i].name)) { + pr_info("Invalid argument[%d] name: %s\n", i, tu->args[i].name); + ret = -EINVAL; + goto error; + } + + if (traceprobe_conflict_field_name(tu->args[i].name, tu->args, i)) { + pr_info("Argument[%d] name '%s' conflicts with " + "another field.\n", i, argv[i]); + ret = -EINVAL; + goto error; + } + + /* Parse fetch argument */ + ret = traceprobe_parse_probe_arg(arg, &tu->size, &tu->args[i], false, false); + if (ret) { + pr_info("Parse error at argument[%d]. (%d)\n", i, ret); + goto error; + } + } + + ret = register_trace_uprobe(tu); + if (ret) + goto error; + return 0; + +error: + free_trace_uprobe(tu); + return ret; + +fail_address_parse: + if (inode) + iput(inode); + + pr_info("Failed to parse address.\n"); + + return ret; +} + +static void cleanup_all_probes(void) +{ + struct trace_uprobe *tu; + + mutex_lock(&uprobe_lock); + while (!list_empty(&uprobe_list)) { + tu = list_entry(uprobe_list.next, struct trace_uprobe, list); + unregister_trace_uprobe(tu); + } + mutex_unlock(&uprobe_lock); +} + +/* Probes listing interfaces */ +static void *probes_seq_start(struct seq_file *m, loff_t *pos) +{ + mutex_lock(&uprobe_lock); + return seq_list_start(&uprobe_list, *pos); +} + +static void *probes_seq_next(struct seq_file *m, void *v, loff_t *pos) +{ + return seq_list_next(v, &uprobe_list, pos); +} + +static void probes_seq_stop(struct seq_file *m, void *v) +{ + mutex_unlock(&uprobe_lock); +} + +static int probes_seq_show(struct seq_file *m, void *v) +{ + struct trace_uprobe *tu = v; + int i; + + seq_printf(m, "p:%s/%s", tu->call.class->system, tu->call.name); + seq_printf(m, " %s:0x%p", tu->filename, (void *)tu->offset); + + for (i = 0; i < tu->nr_args; i++) + seq_printf(m, " %s=%s", tu->args[i].name, tu->args[i].comm); + + seq_printf(m, "\n"); + return 0; +} + +static const struct seq_operations probes_seq_op = { + .start = probes_seq_start, + .next = probes_seq_next, + .stop = probes_seq_stop, + .show = probes_seq_show +}; + +static int probes_open(struct inode *inode, struct file *file) +{ + if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) + cleanup_all_probes(); + + return seq_open(file, &probes_seq_op); +} + +static ssize_t probes_write(struct file *file, const char __user *buffer, + size_t count, loff_t *ppos) +{ + return traceprobe_probes_write(file, buffer, count, ppos, create_trace_uprobe); +} + +static const struct file_operations uprobe_events_ops = { + .owner = THIS_MODULE, + .open = probes_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, + .write = probes_write, +}; + +/* Probes profiling interfaces */ +static int probes_profile_seq_show(struct seq_file *m, void *v) +{ + struct trace_uprobe *tu = v; + + seq_printf(m, " %s %-44s %15lu\n", tu->filename, tu->call.name, tu->nhit); + return 0; +} + +static const struct seq_operations profile_seq_op = { + .start = probes_seq_start, + .next = probes_seq_next, + .stop = probes_seq_stop, + .show = probes_profile_seq_show +}; + +static int profile_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &profile_seq_op); +} + +static const struct file_operations uprobe_profile_ops = { + .owner = THIS_MODULE, + .open = profile_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + +/* uprobe handler */ +static void uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs) +{ + struct uprobe_trace_entry_head *entry; + struct ring_buffer_event *event; + struct ring_buffer *buffer; + u8 *data; + int size, i, pc; + unsigned long irq_flags; + struct ftrace_event_call *call = &tu->call; + + tu->nhit++; + + local_save_flags(irq_flags); + pc = preempt_count(); + + size = sizeof(*entry) + tu->size; + + event = trace_current_buffer_lock_reserve(&buffer, call->event.type, + size, irq_flags, pc); + if (!event) + return; + + entry = ring_buffer_event_data(event); + entry->ip = uprobe_get_swbp_addr(task_pt_regs(current)); + data = (u8 *)&entry[1]; + for (i = 0; i < tu->nr_args; i++) + call_fetch(&tu->args[i].fetch, regs, data + tu->args[i].offset); + + if (!filter_current_check_discard(buffer, call, entry, event)) + trace_buffer_unlock_commit(buffer, event, irq_flags, pc); +} + +/* Event entry printers */ +static enum print_line_t +print_uprobe_event(struct trace_iterator *iter, int flags, struct trace_event *event) +{ + struct uprobe_trace_entry_head *field; + struct trace_seq *s = &iter->seq; + struct trace_uprobe *tu; + u8 *data; + int i; + + field = (struct uprobe_trace_entry_head *)iter->ent; + tu = container_of(event, struct trace_uprobe, call.event); + + if (!trace_seq_printf(s, "%s: (", tu->call.name)) + goto partial; + + if (!seq_print_ip_sym(s, field->ip, flags | TRACE_ITER_SYM_OFFSET)) + goto partial; + + if (!trace_seq_puts(s, ")")) + goto partial; + + data = (u8 *)&field[1]; + for (i = 0; i < tu->nr_args; i++) { + if (!tu->args[i].type->print(s, tu->args[i].name, + data + tu->args[i].offset, field)) + goto partial; + } + + if (trace_seq_puts(s, "\n")) + return TRACE_TYPE_HANDLED; + +partial: + return TRACE_TYPE_PARTIAL_LINE; +} + +static int probe_event_enable(struct trace_uprobe *tu, int flag) +{ + struct uprobe_trace_consumer *utc; + int ret = 0; + + if (!tu->inode || tu->consumer) + return -EINTR; + + utc = kzalloc(sizeof(struct uprobe_trace_consumer), GFP_KERNEL); + if (!utc) + return -EINTR; + + utc->cons.handler = uprobe_dispatcher; + utc->cons.filter = NULL; + ret = uprobe_register(tu->inode, tu->offset, &utc->cons); + if (ret) { + kfree(utc); + return ret; + } + + tu->flags |= flag; + utc->tu = tu; + tu->consumer = utc; + + return 0; +} + +static void probe_event_disable(struct trace_uprobe *tu, int flag) +{ + if (!tu->inode || !tu->consumer) + return; + + uprobe_unregister(tu->inode, tu->offset, &tu->consumer->cons); + tu->flags &= ~flag; + kfree(tu->consumer); + tu->consumer = NULL; +} + +static int uprobe_event_define_fields(struct ftrace_event_call *event_call) +{ + int ret, i; + struct uprobe_trace_entry_head field; + struct trace_uprobe *tu = (struct trace_uprobe *)event_call->data; + + DEFINE_FIELD(unsigned long, ip, FIELD_STRING_IP, 0); + /* Set argument names as fields */ + for (i = 0; i < tu->nr_args; i++) { + ret = trace_define_field(event_call, tu->args[i].type->fmttype, + tu->args[i].name, + sizeof(field) + tu->args[i].offset, + tu->args[i].type->size, + tu->args[i].type->is_signed, + FILTER_OTHER); + + if (ret) + return ret; + } + return 0; +} + +#define LEN_OR_ZERO (len ? len - pos : 0) +static int __set_print_fmt(struct trace_uprobe *tu, char *buf, int len) +{ + const char *fmt, *arg; + int i; + int pos = 0; + + fmt = "(%lx)"; + arg = "REC->" FIELD_STRING_IP; + + /* When len=0, we just calculate the needed length */ + + pos += snprintf(buf + pos, LEN_OR_ZERO, "\"%s", fmt); + + for (i = 0; i < tu->nr_args; i++) { + pos += snprintf(buf + pos, LEN_OR_ZERO, " %s=%s", + tu->args[i].name, tu->args[i].type->fmt); + } + + pos += snprintf(buf + pos, LEN_OR_ZERO, "\", %s", arg); + + for (i = 0; i < tu->nr_args; i++) { + pos += snprintf(buf + pos, LEN_OR_ZERO, ", REC->%s", + tu->args[i].name); + } + + return pos; /* return the length of print_fmt */ +} +#undef LEN_OR_ZERO + +static int set_print_fmt(struct trace_uprobe *tu) +{ + char *print_fmt; + int len; + + /* First: called with 0 length to calculate the needed length */ + len = __set_print_fmt(tu, NULL, 0); + print_fmt = kmalloc(len + 1, GFP_KERNEL); + if (!print_fmt) + return -ENOMEM; + + /* Second: actually write the @print_fmt */ + __set_print_fmt(tu, print_fmt, len + 1); + tu->call.print_fmt = print_fmt; + + return 0; +} + +#ifdef CONFIG_PERF_EVENTS +/* uprobe profile handler */ +static void uprobe_perf_func(struct trace_uprobe *tu, struct pt_regs *regs) +{ + struct ftrace_event_call *call = &tu->call; + struct uprobe_trace_entry_head *entry; + struct hlist_head *head; + u8 *data; + int size, __size, i; + int rctx; + + __size = sizeof(*entry) + tu->size; + size = ALIGN(__size + sizeof(u32), sizeof(u64)); + size -= sizeof(u32); + if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE, "profile buffer not large enough")) + return; + + preempt_disable(); + + entry = perf_trace_buf_prepare(size, call->event.type, regs, &rctx); + if (!entry) + goto out; + + entry->ip = uprobe_get_swbp_addr(task_pt_regs(current)); + data = (u8 *)&entry[1]; + for (i = 0; i < tu->nr_args; i++) + call_fetch(&tu->args[i].fetch, regs, data + tu->args[i].offset); + + head = this_cpu_ptr(call->perf_events); + perf_trace_buf_submit(entry, size, rctx, entry->ip, 1, regs, head); + + out: + preempt_enable(); +} +#endif /* CONFIG_PERF_EVENTS */ + +static +int trace_uprobe_register(struct ftrace_event_call *event, enum trace_reg type, void *data) +{ + struct trace_uprobe *tu = (struct trace_uprobe *)event->data; + + switch (type) { + case TRACE_REG_REGISTER: + return probe_event_enable(tu, TP_FLAG_TRACE); + + case TRACE_REG_UNREGISTER: + probe_event_disable(tu, TP_FLAG_TRACE); + return 0; + +#ifdef CONFIG_PERF_EVENTS + case TRACE_REG_PERF_REGISTER: + return probe_event_enable(tu, TP_FLAG_PROFILE); + + case TRACE_REG_PERF_UNREGISTER: + probe_event_disable(tu, TP_FLAG_PROFILE); + return 0; +#endif + default: + return 0; + } + return 0; +} + +static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs) +{ + struct uprobe_trace_consumer *utc; + struct trace_uprobe *tu; + + utc = container_of(con, struct uprobe_trace_consumer, cons); + tu = utc->tu; + if (!tu || tu->consumer != utc) + return 0; + + if (tu->flags & TP_FLAG_TRACE) + uprobe_trace_func(tu, regs); + +#ifdef CONFIG_PERF_EVENTS + if (tu->flags & TP_FLAG_PROFILE) + uprobe_perf_func(tu, regs); +#endif + return 0; +} + +static struct trace_event_functions uprobe_funcs = { + .trace = print_uprobe_event +}; + +static int register_uprobe_event(struct trace_uprobe *tu) +{ + struct ftrace_event_call *call = &tu->call; + int ret; + + /* Initialize ftrace_event_call */ + INIT_LIST_HEAD(&call->class->fields); + call->event.funcs = &uprobe_funcs; + call->class->define_fields = uprobe_event_define_fields; + + if (set_print_fmt(tu) < 0) + return -ENOMEM; + + ret = register_ftrace_event(&call->event); + if (!ret) { + kfree(call->print_fmt); + return -ENODEV; + } + call->flags = 0; + call->class->reg = trace_uprobe_register; + call->data = tu; + ret = trace_add_event_call(call); + + if (ret) { + pr_info("Failed to register uprobe event: %s\n", call->name); + kfree(call->print_fmt); + unregister_ftrace_event(&call->event); + } + + return ret; +} + +static void unregister_uprobe_event(struct trace_uprobe *tu) +{ + /* tu->event is unregistered in trace_remove_event_call() */ + trace_remove_event_call(&tu->call); + kfree(tu->call.print_fmt); + tu->call.print_fmt = NULL; +} + +/* Make a trace interface for controling probe points */ +static __init int init_uprobe_trace(void) +{ + struct dentry *d_tracer; + + d_tracer = tracing_init_dentry(); + if (!d_tracer) + return 0; + + trace_create_file("uprobe_events", 0644, d_tracer, + NULL, &uprobe_events_ops); + /* Profile interface */ + trace_create_file("uprobe_profile", 0444, d_tracer, + NULL, &uprobe_profile_ops); + return 0; +} + +fs_initcall(init_uprobe_trace); -- cgit v1.2.3 From 489a71b029cd94e3b0132795146e8be3a87bf3fa Mon Sep 17 00:00:00 2001 From: Hiroshi Shimamoto Date: Mon, 2 Apr 2012 17:00:44 +0900 Subject: sched: Update documentation and comments Change sched_*.c to sched/*.c in documentation and comments. Signed-off-by: Hiroshi Shimamoto Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/4F795CAC.9080206@ct.jp.nec.com Signed-off-by: Ingo Molnar --- kernel/sched/idle_task.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c index 91b4c957f289..b44d604b35d1 100644 --- a/kernel/sched/idle_task.c +++ b/kernel/sched/idle_task.c @@ -4,7 +4,7 @@ * idle-task scheduling class. * * (NOTE: these are not related to SCHED_IDLE tasks which are - * handled in sched_fair.c) + * handled in sched/fair.c) */ #ifdef CONFIG_SMP -- cgit v1.2.3 From 7ff9554bb578ba02166071d2d487b7fc7d860d62 Mon Sep 17 00:00:00 2001 From: Kay Sievers Date: Thu, 3 May 2012 02:29:13 +0200 Subject: printk: convert byte-buffer to variable-length record buffer - Record-based stream instead of the traditional byte stream buffer. All records carry a 64 bit timestamp, the syslog facility and priority in the record header. - Records consume almost the same amount, sometimes less memory than the traditional byte stream buffer (if printk_time is enabled). The record header is 16 bytes long, plus some padding bytes at the end if needed. The byte-stream buffer needed 3 chars for the syslog prefix, 15 char for the timestamp and a newline. - Buffer management is based on message sequence numbers. When records need to be discarded, the reading heads move on to the next full record. Unlike the byte-stream buffer, no old logged lines get truncated or partly overwritten by new ones. Sequence numbers also allow consumers of the log stream to get notified if any message in the stream they are about to read gets discarded during the time of reading. - Better buffered IO support for KERN_CONT continuation lines, when printk() is called multiple times for a single line. The use of KERN_CONT is now mandatory to use continuation; a few places in the kernel need trivial fixes here. The buffering could possibly be extended to per-cpu variables to allow better thread-safety for multiple printk() invocations for a single line. - Full-featured syslog facility value support. Different facilities can tag their messages. All userspace-injected messages enforce a facility value > 0 now, to be able to reliably distinguish them from the kernel-generated messages. Independent subsystems like a baseband processor running its own firmware, or a kernel-related userspace process can use their own unique facility values. Multiple independent log streams can co-exist that way in the same buffer. All share the same global sequence number counter to ensure proper ordering (and interleaving) and to allow the consumers of the log to reliably correlate the events from different facilities. Tested-by: William Douglas Signed-off-by: Kay Sievers Signed-off-by: Greg Kroah-Hartman --- kernel/printk.c | 1014 ++++++++++++++++++++++++++++++++----------------------- 1 file changed, 590 insertions(+), 424 deletions(-) (limited to 'kernel') diff --git a/kernel/printk.c b/kernel/printk.c index b663c2c95d39..74357329550f 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -54,8 +54,6 @@ void asmlinkage __attribute__((weak)) early_printk(const char *fmt, ...) { } -#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT) - /* printk's without a loglevel use this.. */ #define DEFAULT_MESSAGE_LOGLEVEL CONFIG_DEFAULT_MESSAGE_LOGLEVEL @@ -98,24 +96,6 @@ EXPORT_SYMBOL_GPL(console_drivers); */ static int console_locked, console_suspended; -/* - * logbuf_lock protects log_buf, log_start, log_end, con_start and logged_chars - * It is also used in interesting ways to provide interlocking in - * console_unlock();. - */ -static DEFINE_RAW_SPINLOCK(logbuf_lock); - -#define LOG_BUF_MASK (log_buf_len-1) -#define LOG_BUF(idx) (log_buf[(idx) & LOG_BUF_MASK]) - -/* - * The indices into log_buf are not constrained to log_buf_len - they - * must be masked before subscripting - */ -static unsigned log_start; /* Index into log_buf: next char to be read by syslog() */ -static unsigned con_start; /* Index into log_buf: next char to be sent to consoles */ -static unsigned log_end; /* Index into log_buf: most-recently-written-char + 1 */ - /* * If exclusive_console is non-NULL then only this console is to be printed to. */ @@ -146,12 +126,176 @@ EXPORT_SYMBOL(console_set_on_cmdline); static int console_may_schedule; #ifdef CONFIG_PRINTK +/* + * The printk log buffer consists of a chain of concatenated variable + * length records. Every record starts with a record header, containing + * the overall length of the record. + * + * The heads to the first and last entry in the buffer, as well as the + * sequence numbers of these both entries are maintained when messages + * are stored.. + * + * If the heads indicate available messages, the length in the header + * tells the start next message. A length == 0 for the next message + * indicates a wrap-around to the beginning of the buffer. + * + * Every record carries the monotonic timestamp in microseconds, as well as + * the standard userspace syslog level and syslog facility. The usual + * kernel messages use LOG_KERN; userspace-injected messages always carry + * a matching syslog facility, by default LOG_USER. The origin of every + * message can be reliably determined that way. + * + * The human readable log message directly follows the message header. The + * length of the message text is stored in the header, the stored message + * is not terminated. + * + */ + +struct log { + u64 ts_nsec; /* timestamp in nanoseconds */ + u16 len; /* length of entire record */ + u16 text_len; /* length of text buffer */ + u16 dict_len; /* length of dictionary buffer */ + u16 level; /* syslog level + facility */ +}; + +/* + * The logbuf_lock protects kmsg buffer, indices, counters. It is also + * used in interesting ways to provide interlocking in console_unlock(); + */ +static DEFINE_RAW_SPINLOCK(logbuf_lock); +/* cpu currently holding logbuf_lock */ +static volatile unsigned int logbuf_cpu = UINT_MAX; + +#define LOG_LINE_MAX 1024 + +/* record buffer */ +#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT) static char __log_buf[__LOG_BUF_LEN]; static char *log_buf = __log_buf; -static int log_buf_len = __LOG_BUF_LEN; -static unsigned logged_chars; /* Number of chars produced since last read+clear operation */ -static int saved_console_loglevel = -1; +static u32 log_buf_len = __LOG_BUF_LEN; + +/* index and sequence number of the first record stored in the buffer */ +static u64 log_first_seq; +static u32 log_first_idx; + +/* index and sequence number of the next record to store in the buffer */ +static u64 log_next_seq; +static u32 log_next_idx; + +/* the next printk record to read after the last 'clear' command */ +static u64 clear_seq; +static u32 clear_idx; + +/* the next printk record to read by syslog(READ) or /proc/kmsg */ +static u64 syslog_seq; +static u32 syslog_idx; + +/* human readable text of the record */ +static char *log_text(const struct log *msg) +{ + return (char *)msg + sizeof(struct log); +} + +/* optional key/value pair dictionary attached to the record */ +static char *log_dict(const struct log *msg) +{ + return (char *)msg + sizeof(struct log) + msg->text_len; +} + +/* get record by index; idx must point to valid msg */ +static struct log *log_from_idx(u32 idx) +{ + struct log *msg = (struct log *)(log_buf + idx); + + /* + * A length == 0 record is the end of buffer marker. Wrap around and + * read the message at the start of the buffer. + */ + if (!msg->len) + return (struct log *)log_buf; + return msg; +} + +/* get next record; idx must point to valid msg */ +static u32 log_next(u32 idx) +{ + struct log *msg = (struct log *)(log_buf + idx); + + /* length == 0 indicates the end of the buffer; wrap */ + /* + * A length == 0 record is the end of buffer marker. Wrap around and + * read the message at the start of the buffer as *this* one, and + * return the one after that. + */ + if (!msg->len) { + msg = (struct log *)log_buf; + return msg->len; + } + return idx + msg->len; +} + +#if !defined(CONFIG_64BIT) || defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) +#define LOG_ALIGN 4 +#else +#define LOG_ALIGN 8 +#endif + +/* insert record into the buffer, discard old ones, update heads */ +static void log_store(int facility, int level, + const char *dict, u16 dict_len, + const char *text, u16 text_len) +{ + struct log *msg; + u32 size, pad_len; + + /* number of '\0' padding bytes to next message */ + size = sizeof(struct log) + text_len + dict_len; + pad_len = (-size) & (LOG_ALIGN - 1); + size += pad_len; + + while (log_first_seq < log_next_seq) { + u32 free; + + if (log_next_idx > log_first_idx) + free = max(log_buf_len - log_next_idx, log_first_idx); + else + free = log_first_idx - log_next_idx; + + if (free > size + sizeof(struct log)) + break; + + /* drop old messages until we have enough contiuous space */ + log_first_idx = log_next(log_first_idx); + log_first_seq++; + } + + if (log_next_idx + size + sizeof(struct log) >= log_buf_len) { + /* + * This message + an additional empty header does not fit + * at the end of the buffer. Add an empty header with len == 0 + * to signify a wrap around. + */ + memset(log_buf + log_next_idx, 0, sizeof(struct log)); + log_next_idx = 0; + } + + /* fill message */ + msg = (struct log *)(log_buf + log_next_idx); + memcpy(log_text(msg), text, text_len); + msg->text_len = text_len; + memcpy(log_dict(msg), dict, dict_len); + msg->dict_len = dict_len; + msg->level = (facility << 3) | (level & 7); + msg->ts_nsec = local_clock(); + memset(log_dict(msg) + dict_len, 0, pad_len); + msg->len = sizeof(struct log) + text_len + dict_len + pad_len; + + /* insert message */ + log_next_idx += msg->len; + log_next_seq++; +} #ifdef CONFIG_KEXEC /* @@ -165,9 +309,9 @@ static int saved_console_loglevel = -1; void log_buf_kexec_setup(void) { VMCOREINFO_SYMBOL(log_buf); - VMCOREINFO_SYMBOL(log_end); VMCOREINFO_SYMBOL(log_buf_len); - VMCOREINFO_SYMBOL(logged_chars); + VMCOREINFO_SYMBOL(log_first_idx); + VMCOREINFO_SYMBOL(log_next_idx); } #endif @@ -191,7 +335,6 @@ early_param("log_buf_len", log_buf_len_setup); void __init setup_log_buf(int early) { unsigned long flags; - unsigned start, dest_idx, offset; char *new_log_buf; int free; @@ -219,20 +362,8 @@ void __init setup_log_buf(int early) log_buf_len = new_log_buf_len; log_buf = new_log_buf; new_log_buf_len = 0; - free = __LOG_BUF_LEN - log_end; - - offset = start = min(con_start, log_start); - dest_idx = 0; - while (start != log_end) { - unsigned log_idx_mask = start & (__LOG_BUF_LEN - 1); - - log_buf[dest_idx] = __log_buf[log_idx_mask]; - start++; - dest_idx++; - } - log_start -= offset; - con_start -= offset; - log_end -= offset; + free = __LOG_BUF_LEN - log_next_idx; + memcpy(log_buf, __log_buf, __LOG_BUF_LEN); raw_spin_unlock_irqrestore(&logbuf_lock, flags); pr_info("log_buf_len: %d\n", log_buf_len); @@ -332,11 +463,165 @@ static int check_syslog_permissions(int type, bool from_file) return 0; } +#if defined(CONFIG_PRINTK_TIME) +static bool printk_time = 1; +#else +static bool printk_time; +#endif +module_param_named(time, printk_time, bool, S_IRUGO | S_IWUSR); + +static int syslog_print_line(u32 idx, char *text, size_t size) +{ + struct log *msg; + size_t len; + + msg = log_from_idx(idx); + if (!text) { + /* calculate length only */ + len = 3; + + if (msg->level > 9) + len++; + if (msg->level > 99) + len++; + + if (printk_time) + len += 15; + + len += msg->text_len; + len++; + return len; + } + + len = sprintf(text, "<%u>", msg->level); + + if (printk_time) { + unsigned long long t = msg->ts_nsec; + unsigned long rem_ns = do_div(t, 1000000000); + + len += sprintf(text + len, "[%5lu.%06lu] ", + (unsigned long) t, rem_ns / 1000); + } + + if (len + msg->text_len > size) + return -EINVAL; + memcpy(text + len, log_text(msg), msg->text_len); + len += msg->text_len; + text[len++] = '\n'; + return len; +} + +static int syslog_print(char __user *buf, int size) +{ + char *text; + int len; + + text = kmalloc(LOG_LINE_MAX, GFP_KERNEL); + if (!text) + return -ENOMEM; + + raw_spin_lock_irq(&logbuf_lock); + if (syslog_seq < log_first_seq) { + /* messages are gone, move to first one */ + syslog_seq = log_first_seq; + syslog_idx = log_first_idx; + } + len = syslog_print_line(syslog_idx, text, LOG_LINE_MAX); + syslog_idx = log_next(syslog_idx); + syslog_seq++; + raw_spin_unlock_irq(&logbuf_lock); + + if (len > 0 && copy_to_user(buf, text, len)) + len = -EFAULT; + + kfree(text); + return len; +} + +static int syslog_print_all(char __user *buf, int size, bool clear) +{ + char *text; + int len = 0; + + text = kmalloc(LOG_LINE_MAX, GFP_KERNEL); + if (!text) + return -ENOMEM; + + raw_spin_lock_irq(&logbuf_lock); + if (buf) { + u64 next_seq; + u64 seq; + u32 idx; + + if (clear_seq < log_first_seq) { + /* messages are gone, move to first available one */ + clear_seq = log_first_seq; + clear_idx = log_first_idx; + } + + /* + * Find first record that fits, including all following records, + * into the user-provided buffer for this dump. + */ + seq = clear_seq; + idx = clear_idx; + while (seq < log_next_seq) { + len += syslog_print_line(idx, NULL, 0); + idx = log_next(idx); + seq++; + } + seq = clear_seq; + idx = clear_idx; + while (len > size && seq < log_next_seq) { + len -= syslog_print_line(idx, NULL, 0); + idx = log_next(idx); + seq++; + } + + /* last message in this dump */ + next_seq = log_next_seq; + + len = 0; + while (len >= 0 && seq < next_seq) { + int textlen; + + textlen = syslog_print_line(idx, text, LOG_LINE_MAX); + if (textlen < 0) { + len = textlen; + break; + } + idx = log_next(idx); + seq++; + + raw_spin_unlock_irq(&logbuf_lock); + if (copy_to_user(buf + len, text, textlen)) + len = -EFAULT; + else + len += textlen; + raw_spin_lock_irq(&logbuf_lock); + + if (seq < log_first_seq) { + /* messages are gone, move to next one */ + seq = log_first_seq; + idx = log_first_idx; + } + } + } + + if (clear) { + clear_seq = log_next_seq; + clear_idx = log_next_idx; + } + raw_spin_unlock_irq(&logbuf_lock); + + kfree(text); + return len; +} + int do_syslog(int type, char __user *buf, int len, bool from_file) { - unsigned i, j, limit, count; - int do_clear = 0; - char c; + bool clear = false; + static int saved_console_loglevel = -1; int error; error = check_syslog_permissions(type, from_file); @@ -364,28 +649,14 @@ int do_syslog(int type, char __user *buf, int len, bool from_file) goto out; } error = wait_event_interruptible(log_wait, - (log_start - log_end)); + syslog_seq != log_next_seq); if (error) goto out; - i = 0; - raw_spin_lock_irq(&logbuf_lock); - while (!error && (log_start != log_end) && i < len) { - c = LOG_BUF(log_start); - log_start++; - raw_spin_unlock_irq(&logbuf_lock); - error = __put_user(c,buf); - buf++; - i++; - cond_resched(); - raw_spin_lock_irq(&logbuf_lock); - } - raw_spin_unlock_irq(&logbuf_lock); - if (!error) - error = i; + error = syslog_print(buf, len); break; /* Read/clear last kernel messages */ case SYSLOG_ACTION_READ_CLEAR: - do_clear = 1; + clear = true; /* FALL THRU */ /* Read last kernel messages */ case SYSLOG_ACTION_READ_ALL: @@ -399,52 +670,11 @@ int do_syslog(int type, char __user *buf, int len, bool from_file) error = -EFAULT; goto out; } - count = len; - if (count > log_buf_len) - count = log_buf_len; - raw_spin_lock_irq(&logbuf_lock); - if (count > logged_chars) - count = logged_chars; - if (do_clear) - logged_chars = 0; - limit = log_end; - /* - * __put_user() could sleep, and while we sleep - * printk() could overwrite the messages - * we try to copy to user space. Therefore - * the messages are copied in reverse. - */ - for (i = 0; i < count && !error; i++) { - j = limit-1-i; - if (j + log_buf_len < log_end) - break; - c = LOG_BUF(j); - raw_spin_unlock_irq(&logbuf_lock); - error = __put_user(c,&buf[count-1-i]); - cond_resched(); - raw_spin_lock_irq(&logbuf_lock); - } - raw_spin_unlock_irq(&logbuf_lock); - if (error) - break; - error = i; - if (i != count) { - int offset = count-error; - /* buffer overflow during copy, correct user buffer. */ - for (i = 0; i < error; i++) { - if (__get_user(c,&buf[i+offset]) || - __put_user(c,&buf[i])) { - error = -EFAULT; - break; - } - cond_resched(); - } - } + error = syslog_print_all(buf, len, clear); break; /* Clear ring buffer */ case SYSLOG_ACTION_CLEAR: - logged_chars = 0; - break; + syslog_print_all(NULL, 0, true); /* Disable logging to console */ case SYSLOG_ACTION_CONSOLE_OFF: if (saved_console_loglevel == -1) @@ -472,7 +702,33 @@ int do_syslog(int type, char __user *buf, int len, bool from_file) break; /* Number of chars in the log buffer */ case SYSLOG_ACTION_SIZE_UNREAD: - error = log_end - log_start; + raw_spin_lock_irq(&logbuf_lock); + if (syslog_seq < log_first_seq) { + /* messages are gone, move to first one */ + syslog_seq = log_first_seq; + syslog_idx = log_first_idx; + } + if (from_file) { + /* + * Short-cut for poll(/"proc/kmsg") which simply checks + * for pending data, not the size; return the count of + * records, not the length. + */ + error = log_next_idx - syslog_idx; + } else { + u64 seq; + u32 idx; + + error = 0; + seq = syslog_seq; + idx = syslog_idx; + while (seq < log_next_seq) { + error += syslog_print_line(idx, NULL, 0); + idx = log_next(idx); + seq++; + } + } + raw_spin_unlock_irq(&logbuf_lock); break; /* Size of the log buffer */ case SYSLOG_ACTION_SIZE_BUFFER: @@ -501,29 +757,11 @@ void kdb_syslog_data(char *syslog_data[4]) { syslog_data[0] = log_buf; syslog_data[1] = log_buf + log_buf_len; - syslog_data[2] = log_buf + log_end - - (logged_chars < log_buf_len ? logged_chars : log_buf_len); - syslog_data[3] = log_buf + log_end; + syslog_data[2] = log_buf + log_first_idx; + syslog_data[3] = log_buf + log_next_idx; } #endif /* CONFIG_KGDB_KDB */ -/* - * Call the console drivers on a range of log_buf - */ -static void __call_console_drivers(unsigned start, unsigned end) -{ - struct console *con; - - for_each_console(con) { - if (exclusive_console && con != exclusive_console) - continue; - if ((con->flags & CON_ENABLED) && con->write && - (cpu_online(smp_processor_id()) || - (con->flags & CON_ANYTIME))) - con->write(con, &LOG_BUF(start), end - start); - } -} - static bool __read_mostly ignore_loglevel; static int __init ignore_loglevel_setup(char *str) @@ -539,143 +777,34 @@ module_param(ignore_loglevel, bool, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(ignore_loglevel, "ignore loglevel setting, to" "print all kernel messages to the console."); -/* - * Write out chars from start to end - 1 inclusive - */ -static void _call_console_drivers(unsigned start, - unsigned end, int msg_log_level) -{ - trace_console(&LOG_BUF(0), start, end, log_buf_len); - - if ((msg_log_level < console_loglevel || ignore_loglevel) && - console_drivers && start != end) { - if ((start & LOG_BUF_MASK) > (end & LOG_BUF_MASK)) { - /* wrapped write */ - __call_console_drivers(start & LOG_BUF_MASK, - log_buf_len); - __call_console_drivers(0, end & LOG_BUF_MASK); - } else { - __call_console_drivers(start, end); - } - } -} - -/* - * Parse the syslog header <[0-9]*>. The decimal value represents 32bit, the - * lower 3 bit are the log level, the rest are the log facility. In case - * userspace passes usual userspace syslog messages to /dev/kmsg or - * /dev/ttyprintk, the log prefix might contain the facility. Printk needs - * to extract the correct log level for in-kernel processing, and not mangle - * the original value. - * - * If a prefix is found, the length of the prefix is returned. If 'level' is - * passed, it will be filled in with the log level without a possible facility - * value. If 'special' is passed, the special printk prefix chars are accepted - * and returned. If no valid header is found, 0 is returned and the passed - * variables are not touched. - */ -static size_t log_prefix(const char *p, unsigned int *level, char *special) -{ - unsigned int lev = 0; - char sp = '\0'; - size_t len; - - if (p[0] != '<' || !p[1]) - return 0; - if (p[2] == '>') { - /* usual single digit level number or special char */ - switch (p[1]) { - case '0' ... '7': - lev = p[1] - '0'; - break; - case 'c': /* KERN_CONT */ - case 'd': /* KERN_DEFAULT */ - sp = p[1]; - break; - default: - return 0; - } - len = 3; - } else { - /* multi digit including the level and facility number */ - char *endp = NULL; - - lev = (simple_strtoul(&p[1], &endp, 10) & 7); - if (endp == NULL || endp[0] != '>') - return 0; - len = (endp + 1) - p; - } - - /* do not accept special char if not asked for */ - if (sp && !special) - return 0; - - if (special) { - *special = sp; - /* return special char, do not touch level */ - if (sp) - return len; - } - - if (level) - *level = lev; - return len; -} - /* * Call the console drivers, asking them to write out * log_buf[start] to log_buf[end - 1]. * The console_lock must be held. */ -static void call_console_drivers(unsigned start, unsigned end) +static void call_console_drivers(int level, const char *text, size_t len) { - unsigned cur_index, start_print; - static int msg_level = -1; - - BUG_ON(((int)(start - end)) > 0); - - cur_index = start; - start_print = start; - while (cur_index != end) { - if (msg_level < 0 && ((end - cur_index) > 2)) { - /* strip log prefix */ - cur_index += log_prefix(&LOG_BUF(cur_index), &msg_level, NULL); - start_print = cur_index; - } - while (cur_index != end) { - char c = LOG_BUF(cur_index); - - cur_index++; - if (c == '\n') { - if (msg_level < 0) { - /* - * printk() has already given us loglevel tags in - * the buffer. This code is here in case the - * log buffer has wrapped right round and scribbled - * on those tags - */ - msg_level = default_message_loglevel; - } - _call_console_drivers(start_print, cur_index, msg_level); - msg_level = -1; - start_print = cur_index; - break; - } - } - } - _call_console_drivers(start_print, end, msg_level); -} + struct console *con; -static void emit_log_char(char c) -{ - LOG_BUF(log_end) = c; - log_end++; - if (log_end - log_start > log_buf_len) - log_start = log_end - log_buf_len; - if (log_end - con_start > log_buf_len) - con_start = log_end - log_buf_len; - if (logged_chars < log_buf_len) - logged_chars++; + trace_console(text, 0, len, len); + + if (level >= console_loglevel && !ignore_loglevel) + return; + if (!console_drivers) + return; + + for_each_console(con) { + if (exclusive_console && con != exclusive_console) + continue; + if (!(con->flags & CON_ENABLED)) + continue; + if (!con->write) + continue; + if (!cpu_online(smp_processor_id()) && + !(con->flags & CON_ANYTIME)) + continue; + con->write(con, text, len); + } } /* @@ -700,16 +829,6 @@ static void zap_locks(void) sema_init(&console_sem, 1); } -#if defined(CONFIG_PRINTK_TIME) -static bool printk_time = 1; -#else -static bool printk_time = 0; -#endif -module_param_named(time, printk_time, bool, S_IRUGO | S_IWUSR); - -static bool always_kmsg_dump; -module_param_named(always_kmsg_dump, always_kmsg_dump, bool, S_IRUGO | S_IWUSR); - /* Check if we have any console registered that can be called early in boot. */ static int have_callable_console(void) { @@ -722,51 +841,6 @@ static int have_callable_console(void) return 0; } -/** - * printk - print a kernel message - * @fmt: format string - * - * This is printk(). It can be called from any context. We want it to work. - * - * We try to grab the console_lock. If we succeed, it's easy - we log the output and - * call the console drivers. If we fail to get the semaphore we place the output - * into the log buffer and return. The current holder of the console_sem will - * notice the new output in console_unlock(); and will send it to the - * consoles before releasing the lock. - * - * One effect of this deferred printing is that code which calls printk() and - * then changes console_loglevel may break. This is because console_loglevel - * is inspected when the actual printing occurs. - * - * See also: - * printf(3) - * - * See the vsnprintf() documentation for format string extensions over C99. - */ - -asmlinkage int printk(const char *fmt, ...) -{ - va_list args; - int r; - -#ifdef CONFIG_KGDB_KDB - if (unlikely(kdb_trap_printk)) { - va_start(args, fmt); - r = vkdb_printf(fmt, args); - va_end(args); - return r; - } -#endif - va_start(args, fmt); - r = vprintk(fmt, args); - va_end(args); - - return r; -} - -/* cpu currently holding logbuf_lock */ -static volatile unsigned int printk_cpu = UINT_MAX; - /* * Can we actually use the console at this time on this cpu? * @@ -810,17 +884,12 @@ static int console_trylock_for_printk(unsigned int cpu) retval = 0; } } - printk_cpu = UINT_MAX; + logbuf_cpu = UINT_MAX; if (wake) up(&console_sem); raw_spin_unlock(&logbuf_lock); return retval; } -static const char recursion_bug_msg [] = - KERN_CRIT "BUG: recent printk recursion!\n"; -static int recursion_bug; -static int new_text_line = 1; -static char printk_buf[1024]; int printk_delay_msec __read_mostly; @@ -836,15 +905,22 @@ static inline void printk_delay(void) } } -asmlinkage int vprintk(const char *fmt, va_list args) +asmlinkage int vprintk_emit(int facility, int level, + const char *dict, size_t dictlen, + const char *fmt, va_list args) { - int printed_len = 0; - int current_log_level = default_message_loglevel; + static int recursion_bug; + static char buf[LOG_LINE_MAX]; + static size_t buflen; + static int buflevel; + static char textbuf[LOG_LINE_MAX]; + char *text = textbuf; + size_t textlen; unsigned long flags; int this_cpu; - char *p; - size_t plen; - char special; + bool newline = false; + bool cont = false; + int printed_len = 0; boot_delay_msec(); printk_delay(); @@ -856,7 +932,7 @@ asmlinkage int vprintk(const char *fmt, va_list args) /* * Ouch, printk recursed into itself! */ - if (unlikely(printk_cpu == this_cpu)) { + if (unlikely(logbuf_cpu == this_cpu)) { /* * If a crash is occurring during printk() on this CPU, * then try to get the crash message out but make sure @@ -873,97 +949,92 @@ asmlinkage int vprintk(const char *fmt, va_list args) lockdep_off(); raw_spin_lock(&logbuf_lock); - printk_cpu = this_cpu; + logbuf_cpu = this_cpu; if (recursion_bug) { + static const char recursion_msg[] = + "BUG: recent printk recursion!"; + recursion_bug = 0; - strcpy(printk_buf, recursion_bug_msg); - printed_len = strlen(recursion_bug_msg); + printed_len += strlen(recursion_msg); + /* emit KERN_CRIT message */ + log_store(0, 2, NULL, 0, recursion_msg, printed_len); } - /* Emit the output into the temporary buffer */ - printed_len += vscnprintf(printk_buf + printed_len, - sizeof(printk_buf) - printed_len, fmt, args); - p = printk_buf; + /* + * The printf needs to come first; we need the syslog + * prefix which might be passed-in as a parameter. + */ + textlen = vscnprintf(text, sizeof(textbuf), fmt, args); - /* Read log level and handle special printk prefix */ - plen = log_prefix(p, ¤t_log_level, &special); - if (plen) { - p += plen; + /* mark and strip a trailing newline */ + if (textlen && text[textlen-1] == '\n') { + textlen--; + newline = true; + } - switch (special) { - case 'c': /* Strip KERN_CONT, continue line */ - plen = 0; + /* strip syslog prefix and extract log level or flags */ + if (text[0] == '<' && text[1] && text[2] == '>') { + switch (text[1]) { + case '0' ... '7': + if (level == -1) + level = text[1] - '0'; + text += 3; + textlen -= 3; + break; + case 'c': /* KERN_CONT */ + cont = true; + case 'd': /* KERN_DEFAULT */ + text += 3; + textlen -= 3; break; - case 'd': /* Strip KERN_DEFAULT, start new line */ - plen = 0; - default: - if (!new_text_line) { - emit_log_char('\n'); - new_text_line = 1; - } } } - /* - * Copy the output into log_buf. If the caller didn't provide - * the appropriate log prefix, we insert them here - */ - for (; *p; p++) { - if (new_text_line) { - new_text_line = 0; - - if (plen) { - /* Copy original log prefix */ - int i; - - for (i = 0; i < plen; i++) - emit_log_char(printk_buf[i]); - printed_len += plen; - } else { - /* Add log prefix */ - emit_log_char('<'); - emit_log_char(current_log_level + '0'); - emit_log_char('>'); - printed_len += 3; - } + if (buflen && (!cont || dict)) { + /* no continuation; flush existing buffer */ + log_store(facility, buflevel, NULL, 0, buf, buflen); + printed_len += buflen; + buflen = 0; + } - if (printk_time) { - /* Add the current time stamp */ - char tbuf[50], *tp; - unsigned tlen; - unsigned long long t; - unsigned long nanosec_rem; - - t = cpu_clock(printk_cpu); - nanosec_rem = do_div(t, 1000000000); - tlen = sprintf(tbuf, "[%5lu.%06lu] ", - (unsigned long) t, - nanosec_rem / 1000); - - for (tp = tbuf; tp < tbuf + tlen; tp++) - emit_log_char(*tp); - printed_len += tlen; - } + if (buflen == 0) { + /* remember level for first message in the buffer */ + if (level == -1) + buflevel = default_message_loglevel; + else + buflevel = level; + } - if (!*p) - break; - } + if (buflen || !newline) { + /* append to existing buffer, or buffer until next message */ + if (buflen + textlen > sizeof(buf)) + textlen = sizeof(buf) - buflen; + memcpy(buf + buflen, text, textlen); + buflen += textlen; + } - emit_log_char(*p); - if (*p == '\n') - new_text_line = 1; + if (newline) { + /* end of line; flush buffer */ + if (buflen) { + log_store(facility, buflevel, + dict, dictlen, buf, buflen); + printed_len += buflen; + buflen = 0; + } else { + log_store(facility, buflevel, + dict, dictlen, text, textlen); + printed_len += textlen; + } } /* - * Try to acquire and then immediately release the - * console semaphore. The release will do all the - * actual magic (print out buffers, wake up klogd, - * etc). + * Try to acquire and then immediately release the console semaphore. + * The release will print out buffers and wake up /dev/kmsg and syslog() + * users. * - * The console_trylock_for_printk() function - * will release 'logbuf_lock' regardless of whether it - * actually gets the semaphore or not. + * The console_trylock_for_printk() function will release 'logbuf_lock' + * regardless of whether it actually gets the console semaphore or not. */ if (console_trylock_for_printk(this_cpu)) console_unlock(); @@ -974,12 +1045,73 @@ out_restore_irqs: return printed_len; } -EXPORT_SYMBOL(printk); +EXPORT_SYMBOL(vprintk_emit); + +asmlinkage int vprintk(const char *fmt, va_list args) +{ + return vprintk_emit(0, -1, NULL, 0, fmt, args); +} EXPORT_SYMBOL(vprintk); +asmlinkage int printk_emit(int facility, int level, + const char *dict, size_t dictlen, + const char *fmt, ...) +{ + va_list args; + int r; + + va_start(args, fmt); + r = vprintk_emit(facility, level, dict, dictlen, fmt, args); + va_end(args); + + return r; +} +EXPORT_SYMBOL(printk_emit); + +/** + * printk - print a kernel message + * @fmt: format string + * + * This is printk(). It can be called from any context. We want it to work. + * + * We try to grab the console_lock. If we succeed, it's easy - we log the + * output and call the console drivers. If we fail to get the semaphore, we + * place the output into the log buffer and return. The current holder of + * the console_sem will notice the new output in console_unlock(); and will + * send it to the consoles before releasing the lock. + * + * One effect of this deferred printing is that code which calls printk() and + * then changes console_loglevel may break. This is because console_loglevel + * is inspected when the actual printing occurs. + * + * See also: + * printf(3) + * + * See the vsnprintf() documentation for format string extensions over C99. + */ +asmlinkage int printk(const char *fmt, ...) +{ + va_list args; + int r; + +#ifdef CONFIG_KGDB_KDB + if (unlikely(kdb_trap_printk)) { + va_start(args, fmt); + r = vkdb_printf(fmt, args); + va_end(args); + return r; + } +#endif + va_start(args, fmt); + r = vprintk_emit(0, -1, NULL, 0, fmt, args); + va_end(args); + + return r; +} +EXPORT_SYMBOL(printk); #else -static void call_console_drivers(unsigned start, unsigned end) +static void call_console_drivers(int level, const char *text, size_t len) { } @@ -1217,7 +1349,7 @@ int is_console_locked(void) } /* - * Delayed printk facility, for scheduler-internal messages: + * Delayed printk version, for scheduler-internal messages: */ #define PRINTK_BUF_SIZE 512 @@ -1253,6 +1385,10 @@ void wake_up_klogd(void) this_cpu_or(printk_pending, PRINTK_PENDING_WAKEUP); } +/* the next printk record to write to the console */ +static u64 console_seq; +static u32 console_idx; + /** * console_unlock - unlock the console system * @@ -1263,15 +1399,16 @@ void wake_up_klogd(void) * by printk(). If this is the case, console_unlock(); emits * the output prior to releasing the lock. * - * If there is output waiting for klogd, we wake it up. + * If there is output waiting, we wake it /dev/kmsg and syslog() users. * * console_unlock(); may be called from any context. */ void console_unlock(void) { + static u64 seen_seq; unsigned long flags; - unsigned _con_start, _log_end; - unsigned wake_klogd = 0, retry = 0; + bool wake_klogd = false; + bool retry; if (console_suspended) { up(&console_sem); @@ -1281,17 +1418,41 @@ void console_unlock(void) console_may_schedule = 0; again: - for ( ; ; ) { + for (;;) { + struct log *msg; + static char text[LOG_LINE_MAX]; + size_t len; + int level; + raw_spin_lock_irqsave(&logbuf_lock, flags); - wake_klogd |= log_start - log_end; - if (con_start == log_end) - break; /* Nothing to print */ - _con_start = con_start; - _log_end = log_end; - con_start = log_end; /* Flush */ + if (seen_seq != log_next_seq) { + wake_klogd = true; + seen_seq = log_next_seq; + } + + if (console_seq < log_first_seq) { + /* messages are gone, move to first one */ + console_seq = log_first_seq; + console_idx = log_first_idx; + } + + if (console_seq == log_next_seq) + break; + + msg = log_from_idx(console_idx); + level = msg->level & 7; + len = msg->text_len; + if (len+1 >= sizeof(text)) + len = sizeof(text)-1; + memcpy(text, log_text(msg), len); + text[len++] = '\n'; + + console_idx = log_next(console_idx); + console_seq++; raw_spin_unlock(&logbuf_lock); + stop_critical_timings(); /* don't trace print latency */ - call_console_drivers(_con_start, _log_end); + call_console_drivers(level, text, len); start_critical_timings(); local_irq_restore(flags); } @@ -1312,8 +1473,7 @@ again: * flush, no worries. */ raw_spin_lock(&logbuf_lock); - if (con_start != log_end) - retry = 1; + retry = console_seq != log_next_seq; raw_spin_unlock_irqrestore(&logbuf_lock, flags); if (retry && console_trylock()) @@ -1549,7 +1709,8 @@ void register_console(struct console *newcon) * for us. */ raw_spin_lock_irqsave(&logbuf_lock, flags); - con_start = log_start; + console_seq = syslog_seq; + console_idx = syslog_idx; raw_spin_unlock_irqrestore(&logbuf_lock, flags); /* * We're about to replay the log buffer. Only do this to the @@ -1758,6 +1919,9 @@ int kmsg_dump_unregister(struct kmsg_dumper *dumper) } EXPORT_SYMBOL_GPL(kmsg_dump_unregister); +static bool always_kmsg_dump; +module_param_named(always_kmsg_dump, always_kmsg_dump, bool, S_IRUGO | S_IWUSR); + /** * kmsg_dump - dump kernel log to kernel message dumpers. * @reason: the reason (oops, panic etc) for dumping @@ -1767,8 +1931,7 @@ EXPORT_SYMBOL_GPL(kmsg_dump_unregister); */ void kmsg_dump(enum kmsg_dump_reason reason) { - unsigned long end; - unsigned chars; + u64 idx; struct kmsg_dumper *dumper; const char *s1, *s2; unsigned long l1, l2; @@ -1780,24 +1943,27 @@ void kmsg_dump(enum kmsg_dump_reason reason) /* Theoretically, the log could move on after we do this, but there's not a lot we can do about that. The new messages will overwrite the start of what we dump. */ + raw_spin_lock_irqsave(&logbuf_lock, flags); - end = log_end & LOG_BUF_MASK; - chars = logged_chars; - raw_spin_unlock_irqrestore(&logbuf_lock, flags); + if (syslog_seq < log_first_seq) + idx = syslog_idx; + else + idx = log_first_idx; - if (chars > end) { - s1 = log_buf + log_buf_len - chars + end; - l1 = chars - end; + if (idx > log_next_idx) { + s1 = log_buf; + l1 = log_next_idx; - s2 = log_buf; - l2 = end; + s2 = log_buf + idx; + l2 = log_buf_len - idx; } else { s1 = ""; l1 = 0; - s2 = log_buf + end - chars; - l2 = chars; + s2 = log_buf + idx; + l2 = log_next_idx - idx; } + raw_spin_unlock_irqrestore(&logbuf_lock, flags); rcu_read_lock(); list_for_each_entry_rcu(dumper, &dump_list, list) -- cgit v1.2.3 From e11fea92e13fb91c50bacca799a6131c81929986 Mon Sep 17 00:00:00 2001 From: Kay Sievers Date: Thu, 3 May 2012 02:29:41 +0200 Subject: kmsg: export printk records to the /dev/kmsg interface Support for multiple concurrent readers of /dev/kmsg, with read(), seek(), poll() support. Output of message sequence numbers, to allow userspace log consumers to reliably reconnect and reconstruct their state at any given time. After open("/dev/kmsg"), read() always returns *all* buffered records. If only future messages should be read, SEEK_END can be used. In case records get overwritten while /dev/kmsg is held open, or records get faster overwritten than they are read, the next read() will return -EPIPE and the current reading position gets updated to the next available record. The passed sequence numbers allow the log consumer to calculate the amount of lost messages. [root@mop ~]# cat /dev/kmsg 5,0,0;Linux version 3.4.0-rc1+ (kay@mop) (gcc version 4.7.0 20120315 ... 6,159,423091;ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff]) 7,160,424069;pci_root PNP0A03:00: host bridge window [io 0x0000-0x0cf7] (ignored) SUBSYSTEM=acpi DEVICE=+acpi:PNP0A03:00 6,339,5140900;NET: Registered protocol family 10 30,340,5690716;udevd[80]: starting version 181 6,341,6081421;FDC 0 is a S82078B 6,345,6154686;microcode: CPU0 sig=0x623, pf=0x0, revision=0x0 7,346,6156968;sr 1:0:0:0: Attached scsi CD-ROM sr0 SUBSYSTEM=scsi DEVICE=+scsi:1:0:0:0 6,347,6289375;microcode: CPU1 sig=0x623, pf=0x0, revision=0x0 Cc: Karel Zak Tested-by: William Douglas Signed-off-by: Kay Sievers Signed-off-by: Greg Kroah-Hartman --- kernel/printk.c | 313 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 313 insertions(+) (limited to 'kernel') diff --git a/kernel/printk.c b/kernel/printk.c index 74357329550f..1ccc6d986cb3 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -41,6 +41,7 @@ #include #include #include +#include #include @@ -149,6 +150,48 @@ static int console_may_schedule; * length of the message text is stored in the header, the stored message * is not terminated. * + * Optionally, a message can carry a dictionary of properties (key/value pairs), + * to provide userspace with a machine-readable message context. + * + * Examples for well-defined, commonly used property names are: + * DEVICE=b12:8 device identifier + * b12:8 block dev_t + * c127:3 char dev_t + * n8 netdev ifindex + * +sound:card0 subsystem:devname + * SUBSYSTEM=pci driver-core subsystem name + * + * Valid characters in property names are [a-zA-Z0-9.-_]. The plain text value + * follows directly after a '=' character. Every property is terminated by + * a '\0' character. The last property is not terminated. + * + * Example of a message structure: + * 0000 ff 8f 00 00 00 00 00 00 monotonic time in nsec + * 0008 34 00 record is 52 bytes long + * 000a 0b 00 text is 11 bytes long + * 000c 1f 00 dictionary is 23 bytes long + * 000e 03 00 LOG_KERN (facility) LOG_ERR (level) + * 0010 69 74 27 73 20 61 20 6c "it's a l" + * 69 6e 65 "ine" + * 001b 44 45 56 49 43 "DEVIC" + * 45 3d 62 38 3a 32 00 44 "E=b8:2\0D" + * 52 49 56 45 52 3d 62 75 "RIVER=bu" + * 67 "g" + * 0032 00 00 00 padding to next message header + * + * The 'struct log' buffer header must never be directly exported to + * userspace, it is a kernel-private implementation detail that might + * need to be changed in the future, when the requirements change. + * + * /dev/kmsg exports the structured data in the following line format: + * "level,sequnum,timestamp;\n" + * + * The optional key/value pairs are attached as continuation lines starting + * with a space character and terminated by a newline. All possible + * non-prinatable characters are escaped in the "\xff" notation. + * + * Users of the export format should ignore possible additional values + * separated by ',', and find the message after the ';' character. */ struct log { @@ -297,6 +340,276 @@ static void log_store(int facility, int level, log_next_seq++; } +/* /dev/kmsg - userspace message inject/listen interface */ +struct devkmsg_user { + u64 seq; + u32 idx; + struct mutex lock; + char buf[8192]; +}; + +static ssize_t devkmsg_writev(struct kiocb *iocb, const struct iovec *iv, + unsigned long count, loff_t pos) +{ + char *buf, *line; + int i; + int level = default_message_loglevel; + int facility = 1; /* LOG_USER */ + size_t len = iov_length(iv, count); + ssize_t ret = len; + + if (len > LOG_LINE_MAX) + return -EINVAL; + buf = kmalloc(len+1, GFP_KERNEL); + if (buf == NULL) + return -ENOMEM; + + line = buf; + for (i = 0; i < count; i++) { + if (copy_from_user(line, iv[i].iov_base, iv[i].iov_len)) + goto out; + line += iv[i].iov_len; + } + + /* + * Extract and skip the syslog prefix <[0-9]*>. Coming from userspace + * the decimal value represents 32bit, the lower 3 bit are the log + * level, the rest are the log facility. + * + * If no prefix or no userspace facility is specified, we + * enforce LOG_USER, to be able to reliably distinguish + * kernel-generated messages from userspace-injected ones. + */ + line = buf; + if (line[0] == '<') { + char *endp = NULL; + + i = simple_strtoul(line+1, &endp, 10); + if (endp && endp[0] == '>') { + level = i & 7; + if (i >> 3) + facility = i >> 3; + endp++; + len -= endp - line; + line = endp; + } + } + line[len] = '\0'; + + printk_emit(facility, level, NULL, 0, "%s", line); +out: + kfree(buf); + return ret; +} + +static ssize_t devkmsg_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct devkmsg_user *user = file->private_data; + struct log *msg; + size_t i; + size_t len; + ssize_t ret; + + if (!user) + return -EBADF; + + mutex_lock(&user->lock); + raw_spin_lock(&logbuf_lock); + while (user->seq == log_next_seq) { + if (file->f_flags & O_NONBLOCK) { + ret = -EAGAIN; + raw_spin_unlock(&logbuf_lock); + goto out; + } + + raw_spin_unlock(&logbuf_lock); + ret = wait_event_interruptible(log_wait, + user->seq != log_next_seq); + if (ret) + goto out; + raw_spin_lock(&logbuf_lock); + } + + if (user->seq < log_first_seq) { + /* our last seen message is gone, return error and reset */ + user->idx = log_first_idx; + user->seq = log_first_seq; + ret = -EPIPE; + raw_spin_unlock(&logbuf_lock); + goto out; + } + + msg = log_from_idx(user->idx); + len = sprintf(user->buf, "%u,%llu,%llu;", + msg->level, user->seq, msg->ts_nsec / 1000); + + /* escape non-printable characters */ + for (i = 0; i < msg->text_len; i++) { + char c = log_text(msg)[i]; + + if (c < ' ' || c >= 128) + len += sprintf(user->buf + len, "\\x%02x", c); + else + user->buf[len++] = c; + } + user->buf[len++] = '\n'; + + if (msg->dict_len) { + bool line = true; + + for (i = 0; i < msg->dict_len; i++) { + char c = log_dict(msg)[i]; + + if (line) { + user->buf[len++] = ' '; + line = false; + } + + if (c == '\0') { + user->buf[len++] = '\n'; + line = true; + continue; + } + + if (c < ' ' || c >= 128) { + len += sprintf(user->buf + len, "\\x%02x", c); + continue; + } + + user->buf[len++] = c; + } + user->buf[len++] = '\n'; + } + + user->idx = log_next(user->idx); + user->seq++; + raw_spin_unlock(&logbuf_lock); + + if (len > count) { + ret = -EINVAL; + goto out; + } + + if (copy_to_user(buf, user->buf, len)) { + ret = -EFAULT; + goto out; + } + ret = len; +out: + mutex_unlock(&user->lock); + return ret; +} + +static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence) +{ + struct devkmsg_user *user = file->private_data; + loff_t ret = 0; + + if (!user) + return -EBADF; + if (offset) + return -ESPIPE; + + raw_spin_lock(&logbuf_lock); + switch (whence) { + case SEEK_SET: + /* the first record */ + user->idx = log_first_idx; + user->seq = log_first_seq; + break; + case SEEK_DATA: + /* + * The first record after the last SYSLOG_ACTION_CLEAR, + * like issued by 'dmesg -c'. Reading /dev/kmsg itself + * changes no global state, and does not clear anything. + */ + user->idx = clear_idx; + user->seq = clear_seq; + break; + case SEEK_END: + /* after the last record */ + user->idx = log_next_idx; + user->seq = log_next_seq; + break; + default: + ret = -EINVAL; + } + raw_spin_unlock(&logbuf_lock); + return ret; +} + +static unsigned int devkmsg_poll(struct file *file, poll_table *wait) +{ + struct devkmsg_user *user = file->private_data; + int ret = 0; + + if (!user) + return POLLERR|POLLNVAL; + + poll_wait(file, &log_wait, wait); + + raw_spin_lock(&logbuf_lock); + if (user->seq < log_next_seq) { + /* return error when data has vanished underneath us */ + if (user->seq < log_first_seq) + ret = POLLIN|POLLRDNORM|POLLERR|POLLPRI; + ret = POLLIN|POLLRDNORM; + } + raw_spin_unlock(&logbuf_lock); + + return ret; +} + +static int devkmsg_open(struct inode *inode, struct file *file) +{ + struct devkmsg_user *user; + int err; + + /* write-only does not need any file context */ + if ((file->f_flags & O_ACCMODE) == O_WRONLY) + return 0; + + err = security_syslog(SYSLOG_ACTION_READ_ALL); + if (err) + return err; + + user = kmalloc(sizeof(struct devkmsg_user), GFP_KERNEL); + if (!user) + return -ENOMEM; + + mutex_init(&user->lock); + + raw_spin_lock(&logbuf_lock); + user->idx = log_first_idx; + user->seq = log_first_seq; + raw_spin_unlock(&logbuf_lock); + + file->private_data = user; + return 0; +} + +static int devkmsg_release(struct inode *inode, struct file *file) +{ + struct devkmsg_user *user = file->private_data; + + if (!user) + return 0; + + mutex_destroy(&user->lock); + kfree(user); + return 0; +} + +const struct file_operations kmsg_fops = { + .open = devkmsg_open, + .read = devkmsg_read, + .aio_write = devkmsg_writev, + .llseek = devkmsg_llseek, + .poll = devkmsg_poll, + .release = devkmsg_release, +}; + #ifdef CONFIG_KEXEC /* * This appends the listed symbols to /proc/vmcoreinfo -- cgit v1.2.3 From f37f435f33717dcf15fd4bb422da739da7fc2052 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Mon, 7 May 2012 17:59:48 +0000 Subject: smp: Implement kick_all_cpus_sync() Will replace the misnomed cpu_idle_wait() function which is copied a gazillion times all over arch/* Signed-off-by: Thomas Gleixner Acked-by: Peter Zijlstra Link: http://lkml.kernel.org/r/20120507175652.049316594@linutronix.de --- kernel/smp.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) (limited to 'kernel') diff --git a/kernel/smp.c b/kernel/smp.c index a61294c07f3f..d0ae5b24875e 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -795,3 +795,26 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info), } } EXPORT_SYMBOL(on_each_cpu_cond); + +static void do_nothing(void *unused) +{ +} + +/** + * kick_all_cpus_sync - Force all cpus out of idle + * + * Used to synchronize the update of pm_idle function pointer. It's + * called after the pointer is updated and returns after the dummy + * callback function has been executed on all cpus. The execution of + * the function can only happen on the remote cpus after they have + * left the idle function which had been called via pm_idle function + * pointer. So it's guaranteed that nothing uses the previous pointer + * anymore. + */ +void kick_all_cpus_sync(void) +{ + /* Make sure the change is visible before we kick the cpus */ + smp_mb(); + smp_call_function(do_nothing, NULL, 1); +} +EXPORT_SYMBOL_GPL(kick_all_cpus_sync); -- cgit v1.2.3 From 6c0a9fa62feb7e9fdefa9720bcc03040c9b0b311 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sat, 5 May 2012 15:05:40 +0000 Subject: fork: Remove the weak insanity We error out when compiling with gcc4.1.[01] as it miscompiles __weak. The workaround with magic defines is not longer necessary. Make it __weak again. Signed-off-by: Thomas Gleixner Link: http://lkml.kernel.org/r/20120505150141.306358267@linutronix.de --- kernel/fork.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index b9372a0bff18..a79b36e2e912 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -203,13 +203,7 @@ void __put_task_struct(struct task_struct *tsk) } EXPORT_SYMBOL_GPL(__put_task_struct); -/* - * macro override instead of weak attribute alias, to workaround - * gcc 4.1.0 and 4.1.1 bugs with weak attribute and empty functions. - */ -#ifndef arch_task_cache_init -#define arch_task_cache_init() -#endif +void __init __weak arch_task_cache_init(void) { } void __init fork_init(unsigned long mempages) { -- cgit v1.2.3 From 2889f60814e15dea644782597d897cdba943564f Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sat, 5 May 2012 15:05:41 +0000 Subject: fork: Move thread info gfp flags to header These flags can be useful for extra allocations outside of the core code. Add __GFP_NOTRACK to them, so the archs which have kmemcheck do not have to provide extra allocators just for that reason. Signed-off-by: Thomas Gleixner Link: http://lkml.kernel.org/r/20120505150141.428211694@linutronix.de --- kernel/fork.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index a79b36e2e912..5d22b9b8cf7b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -123,12 +123,8 @@ static struct kmem_cache *task_struct_cachep; static struct thread_info *alloc_thread_info_node(struct task_struct *tsk, int node) { -#ifdef CONFIG_DEBUG_STACK_USAGE - gfp_t mask = GFP_KERNEL | __GFP_ZERO; -#else - gfp_t mask = GFP_KERNEL; -#endif - struct page *page = alloc_pages_node(node, mask, THREAD_SIZE_ORDER); + struct page *page = alloc_pages_node(node, THREADINFO_GFP, + THREAD_SIZE_ORDER); return page ? page_address(page) : NULL; } -- cgit v1.2.3 From 41101809a865dd0be1b56eff46c83fad321870b2 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sat, 5 May 2012 15:05:41 +0000 Subject: fork: Provide weak arch_release_[task_struct|thread_info] functions These functions allow us to move most of the duplicated thread_info allocators to the core code. Signed-off-by: Thomas Gleixner Link: http://lkml.kernel.org/r/20120505150141.366461660@linutronix.de --- kernel/fork.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 5d22b9b8cf7b..2dfad0269674 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -112,14 +112,26 @@ int nr_processes(void) } #ifndef __HAVE_ARCH_TASK_STRUCT_ALLOCATOR -# define alloc_task_struct_node(node) \ - kmem_cache_alloc_node(task_struct_cachep, GFP_KERNEL, node) -# define free_task_struct(tsk) \ - kmem_cache_free(task_struct_cachep, (tsk)) static struct kmem_cache *task_struct_cachep; + +static inline struct task_struct *alloc_task_struct_node(int node) +{ + return kmem_cache_alloc_node(task_struct_cachep, GFP_KERNEL, node); +} + +void __weak arch_release_task_struct(struct task_struct *tsk) { } + +static inline void free_task_struct(struct task_struct *tsk) +{ + arch_release_task_struct(tsk); + kmem_cache_free(task_struct_cachep, tsk); +} #endif #ifndef __HAVE_ARCH_THREAD_INFO_ALLOCATOR + +void __weak arch_release_thread_info(struct thread_info *ti) { } + static struct thread_info *alloc_thread_info_node(struct task_struct *tsk, int node) { @@ -131,6 +143,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk, static inline void free_thread_info(struct thread_info *ti) { + arch_release_thread_info(ti); free_pages((unsigned long)ti, THREAD_SIZE_ORDER); } #endif -- cgit v1.2.3 From 0d15d74a1ead10673b5b1db66d4c90552769096c Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sat, 5 May 2012 15:05:41 +0000 Subject: fork: Provide kmemcache based thread_info allocator Several architectures have their own kmemcache based thread allocator because THREAD_SIZE is smaller than PAGE_SIZE. Add it to the core code conditionally on THREAD_SIZE < PAGE_SIZE so the private copies can go. Signed-off-by: Thomas Gleixner Link: http://lkml.kernel.org/r/20120505150141.491002124@linutronix.de --- kernel/fork.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 2dfad0269674..7590bd6e8dff 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -132,6 +132,11 @@ static inline void free_task_struct(struct task_struct *tsk) void __weak arch_release_thread_info(struct thread_info *ti) { } +/* + * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a + * kmemcache based allocator. + */ +# if THREAD_SIZE >= PAGE_SIZE static struct thread_info *alloc_thread_info_node(struct task_struct *tsk, int node) { @@ -146,6 +151,28 @@ static inline void free_thread_info(struct thread_info *ti) arch_release_thread_info(ti); free_pages((unsigned long)ti, THREAD_SIZE_ORDER); } +# else +static struct kmem_cache *thread_info_cache; + +static struct thread_info *alloc_thread_info_node(struct task_struct *tsk, + int node) +{ + return kmem_cache_alloc_node(thread_info_cache, THREADINFO_GFP, node); +} + +static void free_thread_info(struct thread_info *ti) +{ + arch_release_thread_info(ti); + kmem_cache_free(thread_info_cache, ti); +} + +void thread_info_cache_init(void) +{ + thread_info_cache = kmem_cache_create("thread_info", THREAD_SIZE, + THREAD_SIZE, 0, NULL); + BUG_ON(thread_info_cache == NULL); +} +# endif #endif /* SLAB cache for signal_struct structures (tsk->signal) */ -- cgit v1.2.3 From f5e10287367dcffb5504d19c83e85ca041ca2596 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sat, 5 May 2012 15:05:48 +0000 Subject: task_allocator: Use config switches instead of magic defines Replace __HAVE_ARCH_TASK_ALLOCATOR and __HAVE_ARCH_THREAD_ALLOCATOR with proper config switches. Signed-off-by: Thomas Gleixner Cc: Sam Ravnborg Cc: Tony Luck Link: http://lkml.kernel.org/r/20120505150142.371309416@linutronix.de --- kernel/fork.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 7590bd6e8dff..a1793e442b20 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -111,7 +111,7 @@ int nr_processes(void) return total; } -#ifndef __HAVE_ARCH_TASK_STRUCT_ALLOCATOR +#ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR static struct kmem_cache *task_struct_cachep; static inline struct task_struct *alloc_task_struct_node(int node) @@ -128,8 +128,7 @@ static inline void free_task_struct(struct task_struct *tsk) } #endif -#ifndef __HAVE_ARCH_THREAD_INFO_ALLOCATOR - +#ifndef CONFIG_ARCH_THREAD_INFO_ALLOCATOR void __weak arch_release_thread_info(struct thread_info *ti) { } /* @@ -243,7 +242,7 @@ void __init __weak arch_task_cache_init(void) { } void __init fork_init(unsigned long mempages) { -#ifndef __HAVE_ARCH_TASK_STRUCT_ALLOCATOR +#ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR #ifndef ARCH_MIN_TASKALIGN #define ARCH_MIN_TASKALIGN L1_CACHE_BYTES #endif -- cgit v1.2.3 From 5fc3249068c1ed87c6fd485f42ced24132405629 Mon Sep 17 00:00:00 2001 From: Kay Sievers Date: Tue, 8 May 2012 13:04:17 +0200 Subject: kmsg: use do_div() to divide 64bit integer On Tue, May 8, 2012 at 10:02 AM, Stephen Rothwell wrote: > kernel/built-in.o: In function `devkmsg_read': > printk.c:(.text+0x27e8): undefined reference to `__udivdi3' > Most probably the "msg->ts_nsec / 1000" since > ts_nsec is a u64 and this is a 32 bit build ... Reported-by: Stephen Rothwell Signed-off-by: Kay Sievers Signed-off-by: Greg Kroah-Hartman --- kernel/printk.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/printk.c b/kernel/printk.c index 1ccc6d986cb3..96d4cc892255 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -407,6 +407,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf, { struct devkmsg_user *user = file->private_data; struct log *msg; + u64 ts_usec; size_t i; size_t len; ssize_t ret; @@ -441,8 +442,10 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf, } msg = log_from_idx(user->idx); + ts_usec = msg->ts_nsec; + do_div(ts_usec, 1000); len = sprintf(user->buf, "%u,%llu,%llu;", - msg->level, user->seq, msg->ts_nsec / 1000); + msg->level, user->seq, ts_usec); /* escape non-printable characters */ for (i = 0; i < msg->text_len; i++) { -- cgit v1.2.3 From 50e18b94c695644d824381e7574b9c44acc25ffe Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Wed, 25 Apr 2012 10:23:39 +0200 Subject: tracing: Use seq_*_private interface for some seq files It's appropriate to use __seq_open_private interface to open some of trace seq files, because it covers all steps we are duplicating in tracing code - zallocating the iterator and setting it as seq_file's private. Using this for following files: trace available_filter_functions enabled_functions Link: http://lkml.kernel.org/r/1335342219-2782-5-git-send-email-jolsa@redhat.com Signed-off-by: Jiri Olsa [ Fixed warnings for: kernel/trace/trace.c: In function '__tracing_open': kernel/trace/trace.c:2418:11: warning: unused variable 'ret' [-Wunused-variable] kernel/trace/trace.c:2417:19: warning: unused variable 'm' [-Wunused-variable] ] Signed-off-by: Steven Rostedt --- kernel/trace/ftrace.c | 44 +++++++++++--------------------------------- kernel/trace/trace.c | 30 +++++------------------------- 2 files changed, 16 insertions(+), 58 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 0fa92f677c92..cf81f27ce6c6 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -2469,57 +2469,35 @@ static int ftrace_avail_open(struct inode *inode, struct file *file) { struct ftrace_iterator *iter; - int ret; if (unlikely(ftrace_disabled)) return -ENODEV; - iter = kzalloc(sizeof(*iter), GFP_KERNEL); - if (!iter) - return -ENOMEM; - - iter->pg = ftrace_pages_start; - iter->ops = &global_ops; - - ret = seq_open(file, &show_ftrace_seq_ops); - if (!ret) { - struct seq_file *m = file->private_data; - - m->private = iter; - } else { - kfree(iter); + iter = __seq_open_private(file, &show_ftrace_seq_ops, sizeof(*iter)); + if (iter) { + iter->pg = ftrace_pages_start; + iter->ops = &global_ops; } - return ret; + return iter ? 0 : -ENOMEM; } static int ftrace_enabled_open(struct inode *inode, struct file *file) { struct ftrace_iterator *iter; - int ret; if (unlikely(ftrace_disabled)) return -ENODEV; - iter = kzalloc(sizeof(*iter), GFP_KERNEL); - if (!iter) - return -ENOMEM; - - iter->pg = ftrace_pages_start; - iter->flags = FTRACE_ITER_ENABLED; - iter->ops = &global_ops; - - ret = seq_open(file, &show_ftrace_seq_ops); - if (!ret) { - struct seq_file *m = file->private_data; - - m->private = iter; - } else { - kfree(iter); + iter = __seq_open_private(file, &show_ftrace_seq_ops, sizeof(*iter)); + if (iter) { + iter->pg = ftrace_pages_start; + iter->flags = FTRACE_ITER_ENABLED; + iter->ops = &global_ops; } - return ret; + return iter ? 0 : -ENOMEM; } static void ftrace_filter_reset(struct ftrace_hash *hash) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index f11a285ee5bb..4fb10ef727d3 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2413,15 +2413,13 @@ static struct trace_iterator * __tracing_open(struct inode *inode, struct file *file) { long cpu_file = (long) inode->i_private; - void *fail_ret = ERR_PTR(-ENOMEM); struct trace_iterator *iter; - struct seq_file *m; - int cpu, ret; + int cpu; if (tracing_disabled) return ERR_PTR(-ENODEV); - iter = kzalloc(sizeof(*iter), GFP_KERNEL); + iter = __seq_open_private(file, &tracer_seq_ops, sizeof(*iter)); if (!iter) return ERR_PTR(-ENOMEM); @@ -2478,32 +2476,15 @@ __tracing_open(struct inode *inode, struct file *file) tracing_iter_reset(iter, cpu); } - ret = seq_open(file, &tracer_seq_ops); - if (ret < 0) { - fail_ret = ERR_PTR(ret); - goto fail_buffer; - } - - m = file->private_data; - m->private = iter; - mutex_unlock(&trace_types_lock); return iter; - fail_buffer: - for_each_tracing_cpu(cpu) { - if (iter->buffer_iter[cpu]) - ring_buffer_read_finish(iter->buffer_iter[cpu]); - } - free_cpumask_var(iter->started); - tracing_start(); fail: mutex_unlock(&trace_types_lock); kfree(iter->trace); - kfree(iter); - - return fail_ret; + seq_release_private(inode, file); + return ERR_PTR(-ENOMEM); } int tracing_open_generic(struct inode *inode, struct file *filp) @@ -2539,11 +2520,10 @@ static int tracing_release(struct inode *inode, struct file *file) tracing_start(); mutex_unlock(&trace_types_lock); - seq_release(inode, file); mutex_destroy(&iter->mutex); free_cpumask_var(iter->started); kfree(iter->trace); - kfree(iter); + seq_release_private(inode, file); return 0; } -- cgit v1.2.3 From 68179686ac67cb08f08b1ef28b860d5ed899f242 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Tue, 8 May 2012 20:57:53 -0400 Subject: tracing: Remove ftrace_disable/enable_cpu() The ftrace_disable_cpu() and ftrace_enable_cpu() functions were needed back before the ring buffer was lockless. Now that the ring buffer is lockless (and has been for some time), these functions serve no purpose, and unnecessarily slow down operations of the tracer. Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 44 ++------------------------------------------ 1 file changed, 2 insertions(+), 42 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 4fb10ef727d3..48ef4960ec90 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -87,18 +87,6 @@ static int tracing_disabled = 1; DEFINE_PER_CPU(int, ftrace_cpu_disabled); -static inline void ftrace_disable_cpu(void) -{ - preempt_disable(); - __this_cpu_inc(ftrace_cpu_disabled); -} - -static inline void ftrace_enable_cpu(void) -{ - __this_cpu_dec(ftrace_cpu_disabled); - preempt_enable(); -} - cpumask_var_t __read_mostly tracing_buffer_mask; /* @@ -748,8 +736,6 @@ update_max_tr_single(struct trace_array *tr, struct task_struct *tsk, int cpu) arch_spin_lock(&ftrace_max_lock); - ftrace_disable_cpu(); - ret = ring_buffer_swap_cpu(max_tr.buffer, tr->buffer, cpu); if (ret == -EBUSY) { @@ -763,8 +749,6 @@ update_max_tr_single(struct trace_array *tr, struct task_struct *tsk, int cpu) "Failed to swap buffers due to commit in progress\n"); } - ftrace_enable_cpu(); - WARN_ON_ONCE(ret && ret != -EAGAIN && ret != -EBUSY); __update_max_tr(tr, tsk, cpu); @@ -916,13 +900,6 @@ out: mutex_unlock(&trace_types_lock); } -static void __tracing_reset(struct ring_buffer *buffer, int cpu) -{ - ftrace_disable_cpu(); - ring_buffer_reset_cpu(buffer, cpu); - ftrace_enable_cpu(); -} - void tracing_reset(struct trace_array *tr, int cpu) { struct ring_buffer *buffer = tr->buffer; @@ -931,7 +908,7 @@ void tracing_reset(struct trace_array *tr, int cpu) /* Make sure all commits have finished */ synchronize_sched(); - __tracing_reset(buffer, cpu); + ring_buffer_reset_cpu(buffer, cpu); ring_buffer_record_enable(buffer); } @@ -949,7 +926,7 @@ void tracing_reset_online_cpus(struct trace_array *tr) tr->time_start = ftrace_now(tr->cpu); for_each_online_cpu(cpu) - __tracing_reset(buffer, cpu); + ring_buffer_reset_cpu(buffer, cpu); ring_buffer_record_enable(buffer); } @@ -1733,14 +1710,9 @@ EXPORT_SYMBOL_GPL(trace_vprintk); static void trace_iterator_increment(struct trace_iterator *iter) { - /* Don't allow ftrace to trace into the ring buffers */ - ftrace_disable_cpu(); - iter->idx++; if (iter->buffer_iter[iter->cpu]) ring_buffer_read(iter->buffer_iter[iter->cpu], NULL); - - ftrace_enable_cpu(); } static struct trace_entry * @@ -1750,17 +1722,12 @@ peek_next_entry(struct trace_iterator *iter, int cpu, u64 *ts, struct ring_buffer_event *event; struct ring_buffer_iter *buf_iter = iter->buffer_iter[cpu]; - /* Don't allow ftrace to trace into the ring buffers */ - ftrace_disable_cpu(); - if (buf_iter) event = ring_buffer_iter_peek(buf_iter, ts); else event = ring_buffer_peek(iter->tr->buffer, cpu, ts, lost_events); - ftrace_enable_cpu(); - if (event) { iter->ent_size = ring_buffer_event_length(event); return ring_buffer_event_data(event); @@ -1850,11 +1817,8 @@ void *trace_find_next_entry_inc(struct trace_iterator *iter) static void trace_consume(struct trace_iterator *iter) { - /* Don't allow ftrace to trace into the ring buffers */ - ftrace_disable_cpu(); ring_buffer_consume(iter->tr->buffer, iter->cpu, &iter->ts, &iter->lost_events); - ftrace_enable_cpu(); } static void *s_next(struct seq_file *m, void *v, loff_t *pos) @@ -1943,16 +1907,12 @@ static void *s_start(struct seq_file *m, loff_t *pos) iter->cpu = 0; iter->idx = -1; - ftrace_disable_cpu(); - if (cpu_file == TRACE_PIPE_ALL_CPU) { for_each_tracing_cpu(cpu) tracing_iter_reset(iter, cpu); } else tracing_iter_reset(iter, cpu_file); - ftrace_enable_cpu(); - iter->leftover = 0; for (p = iter; p && l < *pos; p = s_next(m, p, &l)) ; -- cgit v1.2.3 From c82513e513556a04f81aa511cd890acd23349c48 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 26 Apr 2012 13:12:27 +0200 Subject: sched: Change rq->nr_running to unsigned int Since there's a PID space limit of 30bits (see futex.h:FUTEX_TID_MASK) and allocating that many tasks (assuming a lower bound of 2 pages per task) would still take 8T of memory it seems reasonable to say that unsigned int is sufficient for rq->nr_running. When we do get anywhere near that amount of tasks I suspect other things would go funny, load-balancer load computations would really need to be hoisted to 128bit etc. So save a few bytes and convert rq->nr_running and friends to unsigned int. Suggested-by: Ingo Molnar Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-y3tvyszjdmbibade5bw8zl81@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/debug.c | 2 +- kernel/sched/fair.c | 8 ++++---- kernel/sched/sched.h | 6 +++--- 3 files changed, 8 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 09acaa15161d..31e4f61a1629 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -202,7 +202,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) SPLIT_NS(spread0)); SEQ_printf(m, " .%-30s: %d\n", "nr_spread_over", cfs_rq->nr_spread_over); - SEQ_printf(m, " .%-30s: %ld\n", "nr_running", cfs_rq->nr_running); + SEQ_printf(m, " .%-30s: %d\n", "nr_running", cfs_rq->nr_running); SEQ_printf(m, " .%-30s: %ld\n", "load", cfs_rq->load.weight); #ifdef CONFIG_FAIR_GROUP_SCHED #ifdef CONFIG_SMP diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e9553640c1c3..678966ca393b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4447,10 +4447,10 @@ redo: * correctly treated as an imbalance. */ env.flags |= LBF_ALL_PINNED; - env.load_move = imbalance; - env.src_cpu = busiest->cpu; - env.src_rq = busiest; - env.loop_max = min_t(unsigned long, sysctl_sched_nr_migrate, busiest->nr_running); + env.load_move = imbalance; + env.src_cpu = busiest->cpu; + env.src_rq = busiest; + env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running); more_balance: local_irq_save(flags); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index fb3acba4d52e..7282e7b5f4c7 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -201,7 +201,7 @@ struct cfs_bandwidth { }; /* CFS-related fields in a runqueue */ struct cfs_rq { struct load_weight load; - unsigned long nr_running, h_nr_running; + unsigned int nr_running, h_nr_running; u64 exec_clock; u64 min_vruntime; @@ -279,7 +279,7 @@ static inline int rt_bandwidth_enabled(void) /* Real-Time classes' related field in a runqueue: */ struct rt_rq { struct rt_prio_array active; - unsigned long rt_nr_running; + unsigned int rt_nr_running; #if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED struct { int curr; /* highest queued rt task prio */ @@ -353,7 +353,7 @@ struct rq { * nr_running and cpu_load should be in the same cacheline because * remote CPUs use both these fields when doing load calculation. */ - unsigned long nr_running; + unsigned int nr_running; #define CPU_LOAD_IDX_MAX 5 unsigned long cpu_load[CPU_LOAD_IDX_MAX]; unsigned long last_load_update_tick; -- cgit v1.2.3 From c22402a2f76e88b04b7a8b6c0597ad9ba6fd71de Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 20 Apr 2012 16:57:22 +0200 Subject: sched/fair: Let minimally loaded cpu balance the group Currently we let the leftmost (or first idle) cpu ascend the sched_domain tree and perform load-balancing. The result is that the busiest cpu in the group might be performing this function and pull more load to itself. The next load balance pass will then try to equalize this again. Change this to pick the least loaded cpu to perform higher domain balancing. Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-v8zlrmgmkne3bkcy9dej1fvm@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 678966ca393b..968ffee24721 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3781,7 +3781,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, { unsigned long load, max_cpu_load, min_cpu_load, max_nr_running; int i; - unsigned int balance_cpu = -1, first_idle_cpu = 0; + unsigned int balance_cpu = -1; + unsigned long balance_load = ~0UL; unsigned long avg_load_per_task = 0; if (local_group) @@ -3797,12 +3798,11 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, /* Bias balancing toward cpus of our domain */ if (local_group) { - if (idle_cpu(i) && !first_idle_cpu) { - first_idle_cpu = 1; + load = target_load(i, load_idx); + if (load < balance_load || idle_cpu(i)) { + balance_load = load; balance_cpu = i; } - - load = target_load(i, load_idx); } else { load = source_load(i, load_idx); if (load > max_cpu_load) { -- cgit v1.2.3 From 0ce90475dcdbe90affc218e9688c8401e468e84d Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 25 Apr 2012 00:30:36 +0200 Subject: sched/fair: Add some serialization to the sched_domain load-balance walk Since the sched_domain walk is completely unserialized (!SD_SERIALIZE) it is possible that multiple cpus in the group get elected to do the next level. Avoid this by adding some serialization. Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-vqh9ai6s0ewmeakjz80w4qz6@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 2 ++ kernel/sched/fair.c | 9 +++++++-- 2 files changed, 9 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0533a688ce22..6001e5c3b4e4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6060,6 +6060,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) sg->sgp = *per_cpu_ptr(sdd->sgp, cpumask_first(sg_span)); atomic_inc(&sg->sgp->ref); + sg->balance_cpu = -1; if (cpumask_test_cpu(cpu, sg_span)) groups = sg; @@ -6135,6 +6136,7 @@ build_sched_groups(struct sched_domain *sd, int cpu) cpumask_clear(sched_group_cpus(sg)); sg->sgp->power = 0; + sg->balance_cpu = -1; for_each_cpu(j, span) { if (get_group(j, sdd, NULL) != group) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 968ffee24721..cf86f74bcac2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3828,7 +3828,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, */ if (local_group) { if (idle != CPU_NEWLY_IDLE) { - if (balance_cpu != this_cpu) { + if (balance_cpu != this_cpu || + cmpxchg(&group->balance_cpu, -1, balance_cpu) != -1) { *balance = 0; return; } @@ -4929,7 +4930,7 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle) int balance = 1; struct rq *rq = cpu_rq(cpu); unsigned long interval; - struct sched_domain *sd; + struct sched_domain *sd, *last = NULL; /* Earliest time when we have to do rebalance again */ unsigned long next_balance = jiffies + 60*HZ; int update_next_balance = 0; @@ -4939,6 +4940,7 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle) rcu_read_lock(); for_each_domain(cpu, sd) { + last = sd; if (!(sd->flags & SD_LOAD_BALANCE)) continue; @@ -4983,6 +4985,9 @@ out: if (!balance) break; } + for (sd = last; sd; sd = sd->child) + (void)cmpxchg(&sd->groups->balance_cpu, cpu, -1); + rcu_read_unlock(); /* -- cgit v1.2.3 From bd939f45da24e25e08a8f5c993c50b1afada0fef Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 2 May 2012 14:20:37 +0200 Subject: sched/fair: Propagate 'struct lb_env' usage into find_busiest_group More function argument passing reduction. Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-v66ivjfqdiqdso01lqgqx6qf@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 175 ++++++++++++++++++++++++---------------------------- 1 file changed, 82 insertions(+), 93 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cf86f74bcac2..9bd3366dbb1c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3082,7 +3082,7 @@ struct lb_env { struct rq *dst_rq; enum cpu_idle_type idle; - long load_move; + long imbalance; unsigned int flags; unsigned int loop; @@ -3218,7 +3218,7 @@ static unsigned long task_h_load(struct task_struct *p); static const unsigned int sched_nr_migrate_break = 32; /* - * move_tasks tries to move up to load_move weighted load from busiest to + * move_tasks tries to move up to imbalance weighted load from busiest to * this_rq, as part of a balancing operation within domain "sd". * Returns 1 if successful and 0 otherwise. * @@ -3231,7 +3231,7 @@ static int move_tasks(struct lb_env *env) unsigned long load; int pulled = 0; - if (env->load_move <= 0) + if (env->imbalance <= 0) return 0; while (!list_empty(tasks)) { @@ -3257,7 +3257,7 @@ static int move_tasks(struct lb_env *env) if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed) goto next; - if ((load / 2) > env->load_move) + if ((load / 2) > env->imbalance) goto next; if (!can_migrate_task(p, env)) @@ -3265,7 +3265,7 @@ static int move_tasks(struct lb_env *env) move_task(p, env); pulled++; - env->load_move -= load; + env->imbalance -= load; #ifdef CONFIG_PREEMPT /* @@ -3281,7 +3281,7 @@ static int move_tasks(struct lb_env *env) * We only want to steal up to the prescribed amount of * weighted load. */ - if (env->load_move <= 0) + if (env->imbalance <= 0) break; continue; @@ -3578,10 +3578,9 @@ static inline void update_sd_power_savings_stats(struct sched_group *group, /** * check_power_save_busiest_group - see if there is potential for some power-savings balance + * @env: load balance environment * @sds: Variable containing the statistics of the sched_domain * under consideration. - * @this_cpu: Cpu at which we're currently performing load-balancing. - * @imbalance: Variable to store the imbalance. * * Description: * Check if we have potential to perform some power-savings balance. @@ -3591,8 +3590,8 @@ static inline void update_sd_power_savings_stats(struct sched_group *group, * Returns 1 if there is potential to perform power-savings balance. * Else returns 0. */ -static inline int check_power_save_busiest_group(struct sd_lb_stats *sds, - int this_cpu, unsigned long *imbalance) +static inline +int check_power_save_busiest_group(struct lb_env *env, struct sd_lb_stats *sds) { if (!sds->power_savings_balance) return 0; @@ -3601,7 +3600,7 @@ static inline int check_power_save_busiest_group(struct sd_lb_stats *sds, sds->group_leader == sds->group_min) return 0; - *imbalance = sds->min_load_per_task; + env->imbalance = sds->min_load_per_task; sds->busiest = sds->group_min; return 1; @@ -3620,8 +3619,8 @@ static inline void update_sd_power_savings_stats(struct sched_group *group, return; } -static inline int check_power_save_busiest_group(struct sd_lb_stats *sds, - int this_cpu, unsigned long *imbalance) +static inline +int check_power_save_busiest_group(struct lb_env *env, struct sd_lb_stats *sds) { return 0; } @@ -3765,25 +3764,22 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group) * update_sg_lb_stats - Update sched_group's statistics for load balancing. * @sd: The sched_domain whose statistics are to be updated. * @group: sched_group whose statistics are to be updated. - * @this_cpu: Cpu for which load balance is currently performed. - * @idle: Idle status of this_cpu * @load_idx: Load index of sched_domain of this_cpu for load calc. * @local_group: Does group contain this_cpu. * @cpus: Set of cpus considered for load balancing. * @balance: Should we balance. * @sgs: variable to hold the statistics for this group. */ -static inline void update_sg_lb_stats(struct sched_domain *sd, - struct sched_group *group, int this_cpu, - enum cpu_idle_type idle, int load_idx, +static inline void update_sg_lb_stats(struct lb_env *env, + struct sched_group *group, int load_idx, int local_group, const struct cpumask *cpus, int *balance, struct sg_lb_stats *sgs) { unsigned long load, max_cpu_load, min_cpu_load, max_nr_running; - int i; unsigned int balance_cpu = -1; unsigned long balance_load = ~0UL; unsigned long avg_load_per_task = 0; + int i; if (local_group) balance_cpu = group_first_cpu(group); @@ -3827,15 +3823,15 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, * to do the newly idle load balance. */ if (local_group) { - if (idle != CPU_NEWLY_IDLE) { - if (balance_cpu != this_cpu || + if (env->idle != CPU_NEWLY_IDLE) { + if (balance_cpu != env->dst_cpu || cmpxchg(&group->balance_cpu, -1, balance_cpu) != -1) { *balance = 0; return; } - update_group_power(sd, this_cpu); + update_group_power(env->sd, env->dst_cpu); } else if (time_after_eq(jiffies, group->sgp->next_update)) - update_group_power(sd, this_cpu); + update_group_power(env->sd, env->dst_cpu); } /* Adjust by relative CPU power of the group */ @@ -3859,7 +3855,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power, SCHED_POWER_SCALE); if (!sgs->group_capacity) - sgs->group_capacity = fix_small_capacity(sd, group); + sgs->group_capacity = fix_small_capacity(env->sd, group); sgs->group_weight = group->group_weight; if (sgs->group_capacity > sgs->sum_nr_running) @@ -3877,11 +3873,10 @@ static inline void update_sg_lb_stats(struct sched_domain *sd, * Determine if @sg is a busier group than the previously selected * busiest group. */ -static bool update_sd_pick_busiest(struct sched_domain *sd, +static bool update_sd_pick_busiest(struct lb_env *env, struct sd_lb_stats *sds, struct sched_group *sg, - struct sg_lb_stats *sgs, - int this_cpu) + struct sg_lb_stats *sgs) { if (sgs->avg_load <= sds->max_load) return false; @@ -3897,8 +3892,8 @@ static bool update_sd_pick_busiest(struct sched_domain *sd, * numbered CPUs in the group, therefore mark all groups * higher than ourself as busy. */ - if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running && - this_cpu < group_first_cpu(sg)) { + if ((env->sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running && + env->dst_cpu < group_first_cpu(sg)) { if (!sds->busiest) return true; @@ -3918,28 +3913,28 @@ static bool update_sd_pick_busiest(struct sched_domain *sd, * @balance: Should we balance. * @sds: variable to hold the statistics for this sched_domain. */ -static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, - enum cpu_idle_type idle, const struct cpumask *cpus, - int *balance, struct sd_lb_stats *sds) +static inline void update_sd_lb_stats(struct lb_env *env, + const struct cpumask *cpus, + int *balance, struct sd_lb_stats *sds) { - struct sched_domain *child = sd->child; - struct sched_group *sg = sd->groups; + struct sched_domain *child = env->sd->child; + struct sched_group *sg = env->sd->groups; struct sg_lb_stats sgs; int load_idx, prefer_sibling = 0; if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1; - init_sd_power_savings_stats(sd, sds, idle); - load_idx = get_sd_load_idx(sd, idle); + init_sd_power_savings_stats(env->sd, sds, env->idle); + load_idx = get_sd_load_idx(env->sd, env->idle); do { int local_group; - local_group = cpumask_test_cpu(this_cpu, sched_group_cpus(sg)); + local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg)); memset(&sgs, 0, sizeof(sgs)); - update_sg_lb_stats(sd, sg, this_cpu, idle, load_idx, - local_group, cpus, balance, &sgs); + update_sg_lb_stats(env, sg, load_idx, local_group, + cpus, balance, &sgs); if (local_group && !(*balance)) return; @@ -3967,7 +3962,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, sds->this_load_per_task = sgs.sum_weighted_load; sds->this_has_capacity = sgs.group_has_capacity; sds->this_idle_cpus = sgs.idle_cpus; - } else if (update_sd_pick_busiest(sd, sds, sg, &sgs, this_cpu)) { + } else if (update_sd_pick_busiest(env, sds, sg, &sgs)) { sds->max_load = sgs.avg_load; sds->busiest = sg; sds->busiest_nr_running = sgs.sum_nr_running; @@ -3981,7 +3976,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, update_sd_power_savings_stats(sg, sds, local_group, &sgs); sg = sg->next; - } while (sg != sd->groups); + } while (sg != env->sd->groups); } /** @@ -4009,24 +4004,23 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, * @this_cpu: The cpu at whose sched_domain we're performing load-balance. * @imbalance: returns amount of imbalanced due to packing. */ -static int check_asym_packing(struct sched_domain *sd, - struct sd_lb_stats *sds, - int this_cpu, unsigned long *imbalance) +static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds) { int busiest_cpu; - if (!(sd->flags & SD_ASYM_PACKING)) + if (!(env->sd->flags & SD_ASYM_PACKING)) return 0; if (!sds->busiest) return 0; busiest_cpu = group_first_cpu(sds->busiest); - if (this_cpu > busiest_cpu) + if (env->dst_cpu > busiest_cpu) return 0; - *imbalance = DIV_ROUND_CLOSEST(sds->max_load * sds->busiest->sgp->power, - SCHED_POWER_SCALE); + env->imbalance = DIV_ROUND_CLOSEST( + sds->max_load * sds->busiest->sgp->power, SCHED_POWER_SCALE); + return 1; } @@ -4038,8 +4032,8 @@ static int check_asym_packing(struct sched_domain *sd, * @this_cpu: The cpu at whose sched_domain we're performing load-balance. * @imbalance: Variable to store the imbalance. */ -static inline void fix_small_imbalance(struct sd_lb_stats *sds, - int this_cpu, unsigned long *imbalance) +static inline +void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) { unsigned long tmp, pwr_now = 0, pwr_move = 0; unsigned int imbn = 2; @@ -4050,9 +4044,10 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds, if (sds->busiest_load_per_task > sds->this_load_per_task) imbn = 1; - } else + } else { sds->this_load_per_task = - cpu_avg_load_per_task(this_cpu); + cpu_avg_load_per_task(env->dst_cpu); + } scaled_busy_load_per_task = sds->busiest_load_per_task * SCHED_POWER_SCALE; @@ -4060,7 +4055,7 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds, if (sds->max_load - sds->this_load + scaled_busy_load_per_task >= (scaled_busy_load_per_task * imbn)) { - *imbalance = sds->busiest_load_per_task; + env->imbalance = sds->busiest_load_per_task; return; } @@ -4097,18 +4092,16 @@ static inline void fix_small_imbalance(struct sd_lb_stats *sds, /* Move if we gain throughput */ if (pwr_move > pwr_now) - *imbalance = sds->busiest_load_per_task; + env->imbalance = sds->busiest_load_per_task; } /** * calculate_imbalance - Calculate the amount of imbalance present within the * groups of a given sched_domain during load balance. + * @env: load balance environment * @sds: statistics of the sched_domain whose imbalance is to be calculated. - * @this_cpu: Cpu for which currently load balance is being performed. - * @imbalance: The variable to store the imbalance. */ -static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu, - unsigned long *imbalance) +static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds) { unsigned long max_pull, load_above_capacity = ~0UL; @@ -4124,8 +4117,8 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu, * its cpu_power, while calculating max_load..) */ if (sds->max_load < sds->avg_load) { - *imbalance = 0; - return fix_small_imbalance(sds, this_cpu, imbalance); + env->imbalance = 0; + return fix_small_imbalance(env, sds); } if (!sds->group_imb) { @@ -4153,7 +4146,7 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu, max_pull = min(sds->max_load - sds->avg_load, load_above_capacity); /* How much load to actually move to equalise the imbalance */ - *imbalance = min(max_pull * sds->busiest->sgp->power, + env->imbalance = min(max_pull * sds->busiest->sgp->power, (sds->avg_load - sds->this_load) * sds->this->sgp->power) / SCHED_POWER_SCALE; @@ -4163,8 +4156,8 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu, * a think about bumping its value to force at least one task to be * moved */ - if (*imbalance < sds->busiest_load_per_task) - return fix_small_imbalance(sds, this_cpu, imbalance); + if (env->imbalance < sds->busiest_load_per_task) + return fix_small_imbalance(env, sds); } @@ -4195,9 +4188,7 @@ static inline void calculate_imbalance(struct sd_lb_stats *sds, int this_cpu, * put to idle by rebalancing its tasks onto our group. */ static struct sched_group * -find_busiest_group(struct sched_domain *sd, int this_cpu, - unsigned long *imbalance, enum cpu_idle_type idle, - const struct cpumask *cpus, int *balance) +find_busiest_group(struct lb_env *env, const struct cpumask *cpus, int *balance) { struct sd_lb_stats sds; @@ -4207,7 +4198,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu, * Compute the various statistics relavent for load balancing at * this level. */ - update_sd_lb_stats(sd, this_cpu, idle, cpus, balance, &sds); + update_sd_lb_stats(env, cpus, balance, &sds); /* * this_cpu is not the appropriate cpu to perform load balancing at @@ -4216,8 +4207,8 @@ find_busiest_group(struct sched_domain *sd, int this_cpu, if (!(*balance)) goto ret; - if ((idle == CPU_IDLE || idle == CPU_NEWLY_IDLE) && - check_asym_packing(sd, &sds, this_cpu, imbalance)) + if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) && + check_asym_packing(env, &sds)) return sds.busiest; /* There is no busy sibling group to pull tasks from */ @@ -4235,7 +4226,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu, goto force_balance; /* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ - if (idle == CPU_NEWLY_IDLE && sds.this_has_capacity && + if (env->idle == CPU_NEWLY_IDLE && sds.this_has_capacity && !sds.busiest_has_capacity) goto force_balance; @@ -4253,7 +4244,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu, if (sds.this_load >= sds.avg_load) goto out_balanced; - if (idle == CPU_IDLE) { + if (env->idle == CPU_IDLE) { /* * This cpu is idle. If the busiest group load doesn't * have more tasks than the number of available cpu's and @@ -4268,13 +4259,13 @@ find_busiest_group(struct sched_domain *sd, int this_cpu, * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use * imbalance_pct to be conservative. */ - if (100 * sds.max_load <= sd->imbalance_pct * sds.this_load) + if (100 * sds.max_load <= env->sd->imbalance_pct * sds.this_load) goto out_balanced; } force_balance: /* Looks like there is an imbalance. Compute it */ - calculate_imbalance(&sds, this_cpu, imbalance); + calculate_imbalance(env, &sds); return sds.busiest; out_balanced: @@ -4282,20 +4273,19 @@ out_balanced: * There is no obvious imbalance. But check if we can do some balancing * to save power. */ - if (check_power_save_busiest_group(&sds, this_cpu, imbalance)) + if (check_power_save_busiest_group(env, &sds)) return sds.busiest; ret: - *imbalance = 0; + env->imbalance = 0; return NULL; } /* * find_busiest_queue - find the busiest runqueue among the cpus in group. */ -static struct rq * -find_busiest_queue(struct sched_domain *sd, struct sched_group *group, - enum cpu_idle_type idle, unsigned long imbalance, - const struct cpumask *cpus) +static struct rq *find_busiest_queue(struct lb_env *env, + struct sched_group *group, + const struct cpumask *cpus) { struct rq *busiest = NULL, *rq; unsigned long max_load = 0; @@ -4308,7 +4298,7 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group, unsigned long wl; if (!capacity) - capacity = fix_small_capacity(sd, group); + capacity = fix_small_capacity(env->sd, group); if (!cpumask_test_cpu(i, cpus)) continue; @@ -4320,7 +4310,7 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group, * When comparing with imbalance, use weighted_cpuload() * which is not scaled with the cpu power. */ - if (capacity && rq->nr_running == 1 && wl > imbalance) + if (capacity && rq->nr_running == 1 && wl > env->imbalance) continue; /* @@ -4349,17 +4339,18 @@ find_busiest_queue(struct sched_domain *sd, struct sched_group *group, /* Working cpumask for load_balance and load_balance_newidle. */ DEFINE_PER_CPU(cpumask_var_t, load_balance_tmpmask); -static int need_active_balance(struct sched_domain *sd, int idle, - int busiest_cpu, int this_cpu) +static int need_active_balance(struct lb_env *env) { - if (idle == CPU_NEWLY_IDLE) { + struct sched_domain *sd = env->sd; + + if (env->idle == CPU_NEWLY_IDLE) { /* * ASYM_PACKING needs to force migrate tasks from busy but * higher numbered CPUs in order to pack all tasks in the * lowest numbered CPUs. */ - if ((sd->flags & SD_ASYM_PACKING) && busiest_cpu > this_cpu) + if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu) return 1; /* @@ -4400,7 +4391,6 @@ static int load_balance(int this_cpu, struct rq *this_rq, { int ld_moved, active_balance = 0; struct sched_group *group; - unsigned long imbalance; struct rq *busiest; unsigned long flags; struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask); @@ -4418,8 +4408,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, schedstat_inc(sd, lb_count[idle]); redo: - group = find_busiest_group(sd, this_cpu, &imbalance, idle, - cpus, balance); + group = find_busiest_group(&env, cpus, balance); if (*balance == 0) goto out_balanced; @@ -4429,7 +4418,7 @@ redo: goto out_balanced; } - busiest = find_busiest_queue(sd, group, idle, imbalance, cpus); + busiest = find_busiest_queue(&env, group, cpus); if (!busiest) { schedstat_inc(sd, lb_nobusyq[idle]); goto out_balanced; @@ -4437,7 +4426,7 @@ redo: BUG_ON(busiest == this_rq); - schedstat_add(sd, lb_imbalance[idle], imbalance); + schedstat_add(sd, lb_imbalance[idle], env.imbalance); ld_moved = 0; if (busiest->nr_running > 1) { @@ -4448,7 +4437,6 @@ redo: * correctly treated as an imbalance. */ env.flags |= LBF_ALL_PINNED; - env.load_move = imbalance; env.src_cpu = busiest->cpu; env.src_rq = busiest; env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running); @@ -4493,7 +4481,7 @@ more_balance: if (idle != CPU_NEWLY_IDLE) sd->nr_balance_failed++; - if (need_active_balance(sd, idle, cpu_of(busiest), this_cpu)) { + if (need_active_balance(&env)) { raw_spin_lock_irqsave(&busiest->lock, flags); /* don't kick the active_load_balance_cpu_stop, @@ -4520,10 +4508,11 @@ more_balance: } raw_spin_unlock_irqrestore(&busiest->lock, flags); - if (active_balance) + if (active_balance) { stop_one_cpu_nowait(cpu_of(busiest), active_load_balance_cpu_stop, busiest, &busiest->active_balance_work); + } /* * We've kicked active balancing, reset the failure -- cgit v1.2.3 From cb83b629bae0327cf9f44f096adc38d150ceb913 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 17 Apr 2012 15:49:36 +0200 Subject: sched/numa: Rewrite the CONFIG_NUMA sched domain support The current code groups up to 16 nodes in a level and then puts an ALLNODES domain spanning the entire tree on top of that. This doesn't reflect the numa topology and esp for the smaller not-fully-connected machines out there today this might make a difference. Therefore, build a proper numa topology based on node_distance(). Since there's no fixed numa layers anymore, the static SD_NODE_INIT and SD_ALLNODES_INIT aren't usable anymore, the new code tries to construct something similar and scales some values either on the number of cpus in the domain and/or the node_distance() ratio. Signed-off-by: Peter Zijlstra Cc: Anton Blanchard Cc: Benjamin Herrenschmidt Cc: Chris Metcalf Cc: David Howells Cc: "David S. Miller" Cc: Fenghua Yu Cc: "H. Peter Anvin" Cc: Ivan Kokshaysky Cc: linux-alpha@vger.kernel.org Cc: linux-ia64@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-sh@vger.kernel.org Cc: Matt Turner Cc: Paul Mackerras Cc: Paul Mundt Cc: Ralf Baechle Cc: Richard Henderson Cc: sparclinux@vger.kernel.org Cc: Tony Luck Cc: x86@kernel.org Cc: Dimitri Sivanich Cc: Greg Pearson Cc: KAMEZAWA Hiroyuki Cc: bob.picco@oracle.com Cc: chris.mason@oracle.com Cc: Linus Torvalds Cc: Andrew Morton Link: http://lkml.kernel.org/n/tip-r74n3n8hhuc2ynbrnp3vt954@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 280 ++++++++++++++++++++++++++++++++++------------------ 1 file changed, 185 insertions(+), 95 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6001e5c3b4e4..b4f2096980a3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5560,7 +5560,8 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level, break; } - if (cpumask_intersects(groupmask, sched_group_cpus(group))) { + if (!(sd->flags & SD_OVERLAP) && + cpumask_intersects(groupmask, sched_group_cpus(group))) { printk(KERN_CONT "\n"); printk(KERN_ERR "ERROR: repeated CPUs\n"); break; @@ -5898,92 +5899,6 @@ static int __init isolated_cpu_setup(char *str) __setup("isolcpus=", isolated_cpu_setup); -#ifdef CONFIG_NUMA - -/** - * find_next_best_node - find the next node to include in a sched_domain - * @node: node whose sched_domain we're building - * @used_nodes: nodes already in the sched_domain - * - * Find the next node to include in a given scheduling domain. Simply - * finds the closest node not already in the @used_nodes map. - * - * Should use nodemask_t. - */ -static int find_next_best_node(int node, nodemask_t *used_nodes) -{ - int i, n, val, min_val, best_node = -1; - - min_val = INT_MAX; - - for (i = 0; i < nr_node_ids; i++) { - /* Start at @node */ - n = (node + i) % nr_node_ids; - - if (!nr_cpus_node(n)) - continue; - - /* Skip already used nodes */ - if (node_isset(n, *used_nodes)) - continue; - - /* Simple min distance search */ - val = node_distance(node, n); - - if (val < min_val) { - min_val = val; - best_node = n; - } - } - - if (best_node != -1) - node_set(best_node, *used_nodes); - return best_node; -} - -/** - * sched_domain_node_span - get a cpumask for a node's sched_domain - * @node: node whose cpumask we're constructing - * @span: resulting cpumask - * - * Given a node, construct a good cpumask for its sched_domain to span. It - * should be one that prevents unnecessary balancing, but also spreads tasks - * out optimally. - */ -static void sched_domain_node_span(int node, struct cpumask *span) -{ - nodemask_t used_nodes; - int i; - - cpumask_clear(span); - nodes_clear(used_nodes); - - cpumask_or(span, span, cpumask_of_node(node)); - node_set(node, used_nodes); - - for (i = 1; i < SD_NODES_PER_DOMAIN; i++) { - int next_node = find_next_best_node(node, &used_nodes); - if (next_node < 0) - break; - cpumask_or(span, span, cpumask_of_node(next_node)); - } -} - -static const struct cpumask *cpu_node_mask(int cpu) -{ - lockdep_assert_held(&sched_domains_mutex); - - sched_domain_node_span(cpu_to_node(cpu), sched_domains_tmpmask); - - return sched_domains_tmpmask; -} - -static const struct cpumask *cpu_allnodes_mask(int cpu) -{ - return cpu_possible_mask; -} -#endif /* CONFIG_NUMA */ - static const struct cpumask *cpu_cpu_mask(int cpu) { return cpumask_of_node(cpu_to_node(cpu)); @@ -6020,6 +5935,7 @@ struct sched_domain_topology_level { sched_domain_init_f init; sched_domain_mask_f mask; int flags; + int numa_level; struct sd_data data; }; @@ -6213,10 +6129,6 @@ sd_init_##type(struct sched_domain_topology_level *tl, int cpu) \ } SD_INIT_FUNC(CPU) -#ifdef CONFIG_NUMA - SD_INIT_FUNC(ALLNODES) - SD_INIT_FUNC(NODE) -#endif #ifdef CONFIG_SCHED_SMT SD_INIT_FUNC(SIBLING) #endif @@ -6338,15 +6250,191 @@ static struct sched_domain_topology_level default_topology[] = { { sd_init_BOOK, cpu_book_mask, }, #endif { sd_init_CPU, cpu_cpu_mask, }, -#ifdef CONFIG_NUMA - { sd_init_NODE, cpu_node_mask, SDTL_OVERLAP, }, - { sd_init_ALLNODES, cpu_allnodes_mask, }, -#endif { NULL, }, }; static struct sched_domain_topology_level *sched_domain_topology = default_topology; +#ifdef CONFIG_NUMA + +static int sched_domains_numa_levels; +static int sched_domains_numa_scale; +static int *sched_domains_numa_distance; +static struct cpumask ***sched_domains_numa_masks; +static int sched_domains_curr_level; + +static inline unsigned long numa_scale(unsigned long x, int level) +{ + return x * sched_domains_numa_distance[level] / sched_domains_numa_scale; +} + +static inline int sd_local_flags(int level) +{ + if (sched_domains_numa_distance[level] > REMOTE_DISTANCE) + return 0; + + return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE; +} + +static struct sched_domain * +sd_numa_init(struct sched_domain_topology_level *tl, int cpu) +{ + struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu); + int level = tl->numa_level; + int sd_weight = cpumask_weight( + sched_domains_numa_masks[level][cpu_to_node(cpu)]); + + *sd = (struct sched_domain){ + .min_interval = sd_weight, + .max_interval = 2*sd_weight, + .busy_factor = 32, + .imbalance_pct = 100 + numa_scale(25, level), + .cache_nice_tries = 2, + .busy_idx = 3, + .idle_idx = 2, + .newidle_idx = 0, + .wake_idx = 0, + .forkexec_idx = 0, + + .flags = 1*SD_LOAD_BALANCE + | 1*SD_BALANCE_NEWIDLE + | 0*SD_BALANCE_EXEC + | 0*SD_BALANCE_FORK + | 0*SD_BALANCE_WAKE + | 0*SD_WAKE_AFFINE + | 0*SD_PREFER_LOCAL + | 0*SD_SHARE_CPUPOWER + | 0*SD_POWERSAVINGS_BALANCE + | 0*SD_SHARE_PKG_RESOURCES + | 1*SD_SERIALIZE + | 0*SD_PREFER_SIBLING + | sd_local_flags(level) + , + .last_balance = jiffies, + .balance_interval = sd_weight, + }; + SD_INIT_NAME(sd, NUMA); + sd->private = &tl->data; + + /* + * Ugly hack to pass state to sd_numa_mask()... + */ + sched_domains_curr_level = tl->numa_level; + + return sd; +} + +static const struct cpumask *sd_numa_mask(int cpu) +{ + return sched_domains_numa_masks[sched_domains_curr_level][cpu_to_node(cpu)]; +} + +static void sched_init_numa(void) +{ + int next_distance, curr_distance = node_distance(0, 0); + struct sched_domain_topology_level *tl; + int level = 0; + int i, j, k; + + sched_domains_numa_scale = curr_distance; + sched_domains_numa_distance = kzalloc(sizeof(int) * nr_node_ids, GFP_KERNEL); + if (!sched_domains_numa_distance) + return; + + /* + * O(nr_nodes^2) deduplicating selection sort -- in order to find the + * unique distances in the node_distance() table. + * + * Assumes node_distance(0,j) includes all distances in + * node_distance(i,j) in order to avoid cubic time. + * + * XXX: could be optimized to O(n log n) by using sort() + */ + next_distance = curr_distance; + for (i = 0; i < nr_node_ids; i++) { + for (j = 0; j < nr_node_ids; j++) { + int distance = node_distance(0, j); + if (distance > curr_distance && + (distance < next_distance || + next_distance == curr_distance)) + next_distance = distance; + } + if (next_distance != curr_distance) { + sched_domains_numa_distance[level++] = next_distance; + sched_domains_numa_levels = level; + curr_distance = next_distance; + } else break; + } + /* + * 'level' contains the number of unique distances, excluding the + * identity distance node_distance(i,i). + * + * The sched_domains_nume_distance[] array includes the actual distance + * numbers. + */ + + sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL); + if (!sched_domains_numa_masks) + return; + + /* + * Now for each level, construct a mask per node which contains all + * cpus of nodes that are that many hops away from us. + */ + for (i = 0; i < level; i++) { + sched_domains_numa_masks[i] = + kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL); + if (!sched_domains_numa_masks[i]) + return; + + for (j = 0; j < nr_node_ids; j++) { + struct cpumask *mask = kzalloc_node(cpumask_size(), GFP_KERNEL, j); + if (!mask) + return; + + sched_domains_numa_masks[i][j] = mask; + + for (k = 0; k < nr_node_ids; k++) { + if (node_distance(cpu_to_node(j), k) > + sched_domains_numa_distance[i]) + continue; + + cpumask_or(mask, mask, cpumask_of_node(k)); + } + } + } + + tl = kzalloc((ARRAY_SIZE(default_topology) + level) * + sizeof(struct sched_domain_topology_level), GFP_KERNEL); + if (!tl) + return; + + /* + * Copy the default topology bits.. + */ + for (i = 0; default_topology[i].init; i++) + tl[i] = default_topology[i]; + + /* + * .. and append 'j' levels of NUMA goodness. + */ + for (j = 0; j < level; i++, j++) { + tl[i] = (struct sched_domain_topology_level){ + .init = sd_numa_init, + .mask = sd_numa_mask, + .flags = SDTL_OVERLAP, + .numa_level = j, + }; + } + + sched_domain_topology = tl; +} +#else +static inline void sched_init_numa(void) +{ +} +#endif /* CONFIG_NUMA */ + static int __sdt_alloc(const struct cpumask *cpu_map) { struct sched_domain_topology_level *tl; @@ -6840,6 +6928,8 @@ void __init sched_init_smp(void) alloc_cpumask_var(&non_isolated_cpus, GFP_KERNEL); alloc_cpumask_var(&fallback_doms, GFP_KERNEL); + sched_init_numa(); + get_online_cpus(); mutex_lock(&sched_domains_mutex); init_sched_domains(cpu_active_mask); -- cgit v1.2.3 From fd0d000b2c34aa43d4e92dcf0dfaeda7e123008a Mon Sep 17 00:00:00 2001 From: Robert Richter Date: Mon, 2 Apr 2012 20:19:08 +0200 Subject: perf: Pass last sampling period to perf_sample_data_init() We always need to pass the last sample period to perf_sample_data_init(), otherwise the event distribution will be wrong. Thus, modifiyng the function interface with the required period as argument. So basically a pattern like this: perf_sample_data_init(&data, ~0ULL); data.period = event->hw.last_period; will now be like that: perf_sample_data_init(&data, ~0ULL, event->hw.last_period); Avoids unininitialized data.period and simplifies code. Signed-off-by: Robert Richter Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1333390758-10893-3-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar --- kernel/events/core.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 9789a56b7d54..00c58df9f4e2 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -4957,7 +4957,7 @@ void __perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr) if (rctx < 0) return; - perf_sample_data_init(&data, addr); + perf_sample_data_init(&data, addr, 0); do_perf_sw_event(PERF_TYPE_SOFTWARE, event_id, nr, &data, regs); @@ -5215,7 +5215,7 @@ void perf_tp_event(u64 addr, u64 count, void *record, int entry_size, .data = record, }; - perf_sample_data_init(&data, addr); + perf_sample_data_init(&data, addr, 0); data.raw = &raw; hlist_for_each_entry_rcu(event, node, head, hlist_entry) { @@ -5318,7 +5318,7 @@ void perf_bp_event(struct perf_event *bp, void *data) struct perf_sample_data sample; struct pt_regs *regs = data; - perf_sample_data_init(&sample, bp->attr.bp_addr); + perf_sample_data_init(&sample, bp->attr.bp_addr, 0); if (!bp->hw.state && !perf_exclude_event(bp, regs)) perf_swevent_event(bp, 1, &sample, regs); @@ -5344,8 +5344,7 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer) event->pmu->read(event); - perf_sample_data_init(&data, 0); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, event->hw.last_period); regs = get_irq_regs(); if (regs && !perf_exclude_event(event, regs)) { -- cgit v1.2.3 From cb04ff9ac424d0e689d9b612e9f73cb443ab4b7e Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 8 May 2012 18:56:04 +0200 Subject: sched, perf: Use a single callback into the scheduler We can easily use a single callback for both sched-in and sched-out. This reduces the code footprint in the scheduler path as well as removes the PMU black spot otherwise present between the out and in callback. Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-o56ajxp1edwqg6x9d31wb805@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/events/core.c | 14 ++++++++++---- kernel/sched/core.c | 9 +-------- 2 files changed, 11 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 00c58df9f4e2..e82c7a1face9 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -2039,8 +2039,8 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn, * accessing the event control register. If a NMI hits, then it will * not restart the event. */ -void __perf_event_task_sched_out(struct task_struct *task, - struct task_struct *next) +static void __perf_event_task_sched_out(struct task_struct *task, + struct task_struct *next) { int ctxn; @@ -2279,8 +2279,8 @@ static void perf_branch_stack_sched_in(struct task_struct *prev, * accessing the event control register. If a NMI hits, then it will * keep the event running. */ -void __perf_event_task_sched_in(struct task_struct *prev, - struct task_struct *task) +static void __perf_event_task_sched_in(struct task_struct *prev, + struct task_struct *task) { struct perf_event_context *ctx; int ctxn; @@ -2305,6 +2305,12 @@ void __perf_event_task_sched_in(struct task_struct *prev, perf_branch_stack_sched_in(prev, task); } +void __perf_event_task_sched(struct task_struct *prev, struct task_struct *next) +{ + __perf_event_task_sched_out(prev, next); + __perf_event_task_sched_in(prev, next); +} + static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count) { u64 frequency = event->attr.sample_freq; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4603b9d8f30a..5c692a0a555d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1913,7 +1913,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next) { sched_info_switch(prev, next); - perf_event_task_sched_out(prev, next); + perf_event_task_sched(prev, next); fire_sched_out_preempt_notifiers(prev, next); prepare_lock_switch(rq, next); prepare_arch_switch(next); @@ -1956,13 +1956,6 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev) */ prev_state = prev->state; finish_arch_switch(prev); -#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW - local_irq_disable(); -#endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */ - perf_event_task_sched_in(prev, current); -#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW - local_irq_enable(); -#endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */ finish_lock_switch(rq, prev); finish_arch_post_lock_switch(); -- cgit v1.2.3 From 21e52e15666323078b8517a4312712579176b56f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 30 Apr 2012 14:16:19 -0700 Subject: rcu: Make RCU_FAST_NO_HZ handle timer migration The current RCU_FAST_NO_HZ assumes that timers do not migrate unless a CPU goes offline, in which case it assumes that the CPU will have to come out of dyntick-idle mode (cancelling the timer) in order to go offline. This is important because when RCU_FAST_NO_HZ permits a CPU to enter dyntick-idle mode despite having RCU callbacks pending, it posts a timer on that CPU to force a wakeup on that CPU. This wakeup ensures that the CPU will eventually handle the end of the grace period, including invoking its RCU callbacks. However, Pascal Chapperon's test setup shows that the timer handler rcu_idle_gp_timer_func() really does get invoked in some cases. This is problematic because this can cause the CPU that entered dyntick-idle mode despite still having RCU callbacks pending to remain in dyntick-idle mode indefinitely, which means that its RCU callbacks might never be invoked. This situation can result in grace-period delays or even system hangs, which matches Pascal's observations of slow boot-up and shutdown (https://lkml.org/lkml/2012/4/5/142). See also the bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=806548 This commit therefore causes the "should never be invoked" timer handler rcu_idle_gp_timer_func() to use smp_call_function_single() to wake up the CPU for which the timer was intended, allowing that CPU to invoke its RCU callbacks in a timely manner. Reported-by: Pascal Chapperon Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney --- kernel/rcutree_plugin.h | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h index d01e26df55a1..bbb43cad755e 100644 --- a/kernel/rcutree_plugin.h +++ b/kernel/rcutree_plugin.h @@ -2056,17 +2056,35 @@ static bool rcu_cpu_has_nonlazy_callbacks(int cpu) rcu_preempt_cpu_has_nonlazy_callbacks(cpu); } +/* + * Handler for smp_call_function_single(). The only point of this + * handler is to wake the CPU up, so the handler does only tracing. + */ +void rcu_idle_demigrate(void *unused) +{ + trace_rcu_prep_idle("Demigrate"); +} + /* * Timer handler used to force CPU to start pushing its remaining RCU * callbacks in the case where it entered dyntick-idle mode with callbacks * pending. The hander doesn't really need to do anything because the * real work is done upon re-entry to idle, or by the next scheduling-clock * interrupt should idle not be re-entered. + * + * One special case: the timer gets migrated without awakening the CPU + * on which the timer was scheduled on. In this case, we must wake up + * that CPU. We do so with smp_call_function_single(). */ -static void rcu_idle_gp_timer_func(unsigned long unused) +static void rcu_idle_gp_timer_func(unsigned long cpu_in) { - WARN_ON_ONCE(1); /* Getting here can hang the system... */ + int cpu = (int)cpu_in; + trace_rcu_prep_idle("Timer"); + if (cpu != smp_processor_id()) + smp_call_function_single(cpu, rcu_idle_demigrate, NULL, 0); + else + WARN_ON_ONCE(1); /* Getting here can hang the system... */ } /* @@ -2075,7 +2093,7 @@ static void rcu_idle_gp_timer_func(unsigned long unused) static void rcu_prepare_for_idle_init(int cpu) { setup_timer(&per_cpu(rcu_idle_gp_timer, cpu), - rcu_idle_gp_timer_func, 0); + rcu_idle_gp_timer_func, cpu); } /* -- cgit v1.2.3 From 98248a0e24327bc64eb7518145c44bff7bebebc3 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 3 May 2012 15:38:10 -0700 Subject: rcu: Explicitly initialize RCU_FAST_NO_HZ per-CPU variables The current initialization of the RCU_FAST_NO_HZ per-CPU variables makes needless and fragile assumptions about the initial value of things like the jiffies counter. This commit therefore explicitly initializes all of them that are better started with a non-zero value. It also adds some comments describing the per-CPU state variables. Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney --- kernel/rcutree_plugin.h | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h index bbb43cad755e..7082ea93566f 100644 --- a/kernel/rcutree_plugin.h +++ b/kernel/rcutree_plugin.h @@ -1986,12 +1986,19 @@ static void rcu_idle_count_callbacks_posted(void) #define RCU_IDLE_GP_DELAY 6 /* Roughly one grace period. */ #define RCU_IDLE_LAZY_GP_DELAY (6 * HZ) /* Roughly six seconds. */ +/* Loop counter for rcu_prepare_for_idle(). */ static DEFINE_PER_CPU(int, rcu_dyntick_drain); +/* If rcu_dyntick_holdoff==jiffies, don't try to enter dyntick-idle mode. */ static DEFINE_PER_CPU(unsigned long, rcu_dyntick_holdoff); +/* Timer to awaken the CPU if it enters dyntick-idle mode with callbacks. */ static DEFINE_PER_CPU(struct timer_list, rcu_idle_gp_timer); +/* Scheduled expiry time for rcu_idle_gp_timer to allow reposting. */ static DEFINE_PER_CPU(unsigned long, rcu_idle_gp_timer_expires); +/* Enable special processing on first attempt to enter dyntick-idle mode. */ static DEFINE_PER_CPU(bool, rcu_idle_first_pass); +/* Running count of non-lazy callbacks posted, never decremented. */ static DEFINE_PER_CPU(unsigned long, rcu_nonlazy_posted); +/* Snapshot of rcu_nonlazy_posted to detect meaningful exits from idle. */ static DEFINE_PER_CPU(unsigned long, rcu_nonlazy_posted_snap); /* @@ -2092,8 +2099,11 @@ static void rcu_idle_gp_timer_func(unsigned long cpu_in) */ static void rcu_prepare_for_idle_init(int cpu) { + per_cpu(rcu_dyntick_holdoff, cpu) = jiffies - 1; setup_timer(&per_cpu(rcu_idle_gp_timer, cpu), rcu_idle_gp_timer_func, cpu); + per_cpu(rcu_idle_gp_timer_expires, cpu) = jiffies - 1; + per_cpu(rcu_idle_first_pass, cpu) = 1; } /* @@ -2232,10 +2242,12 @@ static void rcu_prepare_for_idle(int cpu) } /* - * Keep a running count of callbacks posted so that rcu_prepare_for_idle() - * can detect when something out of the idle loop posts a callback. - * Of course, it had better do so either from a trace event designed to - * be called from idle or from within RCU_NONIDLE(). + * Keep a running count of the number of non-lazy callbacks posted + * on this CPU. This running counter (which is never decremented) allows + * rcu_prepare_for_idle() to detect when something out of the idle loop + * posts a callback, even if an equal number of callbacks are invoked. + * Of course, callbacks should only be posted from within a trace event + * designed to be called from idle or from within RCU_NONIDLE(). */ static void rcu_idle_count_callbacks_posted(void) { -- cgit v1.2.3 From b1420f1c8bfc30ecf6380a31d0f686884834b599 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 1 Mar 2012 13:18:08 -0800 Subject: rcu: Make rcu_barrier() less disruptive The rcu_barrier() primitive interrupts each and every CPU, registering a callback on every CPU. Once all of these callbacks have been invoked, rcu_barrier() knows that every callback that was registered before the call to rcu_barrier() has also been invoked. However, there is no point in registering a callback on a CPU that currently has no callbacks, most especially if that CPU is in a deep idle state. This commit therefore makes rcu_barrier() avoid interrupting CPUs that have no callbacks. Doing this requires reworking the handling of orphaned callbacks, otherwise callbacks could slip through rcu_barrier()'s net by being orphaned from a CPU that rcu_barrier() had not yet interrupted to a CPU that rcu_barrier() had already interrupted. This reworking was needed anyway to take a first step towards weaning RCU from the CPU_DYING notifier's use of stop_cpu(). Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney --- kernel/rcutree.c | 295 +++++++++++++++++++++++++++++++++++-------------- kernel/rcutree.h | 11 ++ kernel/rcutree_trace.c | 4 +- 3 files changed, 222 insertions(+), 88 deletions(-) (limited to 'kernel') diff --git a/kernel/rcutree.c b/kernel/rcutree.c index 403306b86e78..e578dd327c64 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -75,6 +75,8 @@ static struct lock_class_key rcu_node_class[NUM_RCU_LVLS]; .gpnum = -300, \ .completed = -300, \ .onofflock = __RAW_SPIN_LOCK_UNLOCKED(&structname##_state.onofflock), \ + .orphan_nxttail = &structname##_state.orphan_nxtlist, \ + .orphan_donetail = &structname##_state.orphan_donelist, \ .fqslock = __RAW_SPIN_LOCK_UNLOCKED(&structname##_state.fqslock), \ .n_force_qs = 0, \ .n_force_qs_ngp = 0, \ @@ -145,6 +147,13 @@ static void invoke_rcu_callbacks(struct rcu_state *rsp, struct rcu_data *rdp); unsigned long rcutorture_testseq; unsigned long rcutorture_vernum; +/* State information for rcu_barrier() and friends. */ + +static DEFINE_PER_CPU(struct rcu_head, rcu_barrier_head) = {NULL}; +static atomic_t rcu_barrier_cpu_count; +static DEFINE_MUTEX(rcu_barrier_mutex); +static struct completion rcu_barrier_completion; + /* * Return true if an RCU grace period is in progress. The ACCESS_ONCE()s * permit this function to be invoked without holding the root rcu_node @@ -1311,95 +1320,133 @@ rcu_check_quiescent_state(struct rcu_state *rsp, struct rcu_data *rdp) #ifdef CONFIG_HOTPLUG_CPU /* - * Move a dying CPU's RCU callbacks to online CPU's callback list. - * Also record a quiescent state for this CPU for the current grace period. - * Synchronization and interrupt disabling are not required because - * this function executes in stop_machine() context. Therefore, cleanup - * operations that might block must be done later from the CPU_DEAD - * notifier. - * - * Note that the outgoing CPU's bit has already been cleared in the - * cpu_online_mask. This allows us to randomly pick a callback - * destination from the bits set in that mask. + * Send the specified CPU's RCU callbacks to the orphanage. The + * specified CPU must be offline, and the caller must hold the + * ->onofflock. */ -static void rcu_cleanup_dying_cpu(struct rcu_state *rsp) +static void +rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp, + struct rcu_node *rnp, struct rcu_data *rdp) { int i; - unsigned long mask; - int receive_cpu = cpumask_any(cpu_online_mask); - struct rcu_data *rdp = this_cpu_ptr(rsp->rda); - struct rcu_data *receive_rdp = per_cpu_ptr(rsp->rda, receive_cpu); - RCU_TRACE(struct rcu_node *rnp = rdp->mynode); /* For dying CPU. */ - /* First, adjust the counts. */ + /* + * Orphan the callbacks. First adjust the counts. This is safe + * because ->onofflock excludes _rcu_barrier()'s adoption of + * the callbacks, thus no memory barrier is required. + */ if (rdp->nxtlist != NULL) { - receive_rdp->qlen_lazy += rdp->qlen_lazy; - receive_rdp->qlen += rdp->qlen; + rsp->qlen_lazy += rdp->qlen_lazy; + rsp->qlen += rdp->qlen; + rdp->n_cbs_orphaned += rdp->qlen; rdp->qlen_lazy = 0; rdp->qlen = 0; } /* - * Next, move ready-to-invoke callbacks to be invoked on some - * other CPU. These will not be required to pass through another - * grace period: They are done, regardless of CPU. + * Next, move those callbacks still needing a grace period to + * the orphanage, where some other CPU will pick them up. + * Some of the callbacks might have gone partway through a grace + * period, but that is too bad. They get to start over because we + * cannot assume that grace periods are synchronized across CPUs. + * We don't bother updating the ->nxttail[] array yet, instead + * we just reset the whole thing later on. */ - if (rdp->nxtlist != NULL && - rdp->nxttail[RCU_DONE_TAIL] != &rdp->nxtlist) { - struct rcu_head *oldhead; - struct rcu_head **oldtail; - struct rcu_head **newtail; - - oldhead = rdp->nxtlist; - oldtail = receive_rdp->nxttail[RCU_DONE_TAIL]; - rdp->nxtlist = *rdp->nxttail[RCU_DONE_TAIL]; - *rdp->nxttail[RCU_DONE_TAIL] = *oldtail; - *receive_rdp->nxttail[RCU_DONE_TAIL] = oldhead; - newtail = rdp->nxttail[RCU_DONE_TAIL]; - for (i = RCU_DONE_TAIL; i < RCU_NEXT_SIZE; i++) { - if (receive_rdp->nxttail[i] == oldtail) - receive_rdp->nxttail[i] = newtail; - if (rdp->nxttail[i] == newtail) - rdp->nxttail[i] = &rdp->nxtlist; - } + if (*rdp->nxttail[RCU_DONE_TAIL] != NULL) { + *rsp->orphan_nxttail = *rdp->nxttail[RCU_DONE_TAIL]; + rsp->orphan_nxttail = rdp->nxttail[RCU_NEXT_TAIL]; + *rdp->nxttail[RCU_DONE_TAIL] = NULL; } /* - * Finally, put the rest of the callbacks at the end of the list. - * The ones that made it partway through get to start over: We - * cannot assume that grace periods are synchronized across CPUs. - * (We could splice RCU_WAIT_TAIL into RCU_NEXT_READY_TAIL, but - * this does not seem compelling. Not yet, anyway.) + * Then move the ready-to-invoke callbacks to the orphanage, + * where some other CPU will pick them up. These will not be + * required to pass though another grace period: They are done. */ if (rdp->nxtlist != NULL) { - *receive_rdp->nxttail[RCU_NEXT_TAIL] = rdp->nxtlist; - receive_rdp->nxttail[RCU_NEXT_TAIL] = - rdp->nxttail[RCU_NEXT_TAIL]; - receive_rdp->n_cbs_adopted += rdp->qlen; - rdp->n_cbs_orphaned += rdp->qlen; - - rdp->nxtlist = NULL; - for (i = 0; i < RCU_NEXT_SIZE; i++) - rdp->nxttail[i] = &rdp->nxtlist; + *rsp->orphan_donetail = rdp->nxtlist; + rsp->orphan_donetail = rdp->nxttail[RCU_DONE_TAIL]; } + /* Finally, initialize the rcu_data structure's list to empty. */ + rdp->nxtlist = NULL; + for (i = 0; i < RCU_NEXT_SIZE; i++) + rdp->nxttail[i] = &rdp->nxtlist; +} + +/* + * Adopt the RCU callbacks from the specified rcu_state structure's + * orphanage. The caller must hold the ->onofflock. + */ +static void rcu_adopt_orphan_cbs(struct rcu_state *rsp) +{ + int i; + struct rcu_data *rdp = __this_cpu_ptr(rsp->rda); + /* - * Record a quiescent state for the dying CPU. This is safe - * only because we have already cleared out the callbacks. - * (Otherwise, the RCU core might try to schedule the invocation - * of callbacks on this now-offline CPU, which would be bad.) + * If there is an rcu_barrier() operation in progress, then + * only the task doing that operation is permitted to adopt + * callbacks. To do otherwise breaks rcu_barrier() and friends + * by causing them to fail to wait for the callbacks in the + * orphanage. */ - mask = rdp->grpmask; /* rnp->grplo is constant. */ + if (rsp->rcu_barrier_in_progress && + rsp->rcu_barrier_in_progress != current) + return; + + /* Do the accounting first. */ + rdp->qlen_lazy += rsp->qlen_lazy; + rdp->qlen += rsp->qlen; + rdp->n_cbs_adopted += rsp->qlen; + rsp->qlen_lazy = 0; + rsp->qlen = 0; + + /* + * We do not need a memory barrier here because the only way we + * can get here if there is an rcu_barrier() in flight is if + * we are the task doing the rcu_barrier(). + */ + + /* First adopt the ready-to-invoke callbacks. */ + if (rsp->orphan_donelist != NULL) { + *rsp->orphan_donetail = *rdp->nxttail[RCU_DONE_TAIL]; + *rdp->nxttail[RCU_DONE_TAIL] = rsp->orphan_donelist; + for (i = RCU_NEXT_SIZE - 1; i >= RCU_DONE_TAIL; i--) + if (rdp->nxttail[i] == rdp->nxttail[RCU_DONE_TAIL]) + rdp->nxttail[i] = rsp->orphan_donetail; + rsp->orphan_donelist = NULL; + rsp->orphan_donetail = &rsp->orphan_donelist; + } + + /* And then adopt the callbacks that still need a grace period. */ + if (rsp->orphan_nxtlist != NULL) { + *rdp->nxttail[RCU_NEXT_TAIL] = rsp->orphan_nxtlist; + rdp->nxttail[RCU_NEXT_TAIL] = rsp->orphan_nxttail; + rsp->orphan_nxtlist = NULL; + rsp->orphan_nxttail = &rsp->orphan_nxtlist; + } +} + +/* + * Trace the fact that this CPU is going offline. + */ +static void rcu_cleanup_dying_cpu(struct rcu_state *rsp) +{ + RCU_TRACE(unsigned long mask); + RCU_TRACE(struct rcu_data *rdp = this_cpu_ptr(rsp->rda)); + RCU_TRACE(struct rcu_node *rnp = rdp->mynode); + + RCU_TRACE(mask = rdp->grpmask); trace_rcu_grace_period(rsp->name, rnp->gpnum + 1 - !!(rnp->qsmask & mask), "cpuofl"); - rcu_report_qs_rdp(smp_processor_id(), rsp, rdp, rsp->gpnum); - /* Note that rcu_report_qs_rdp() might call trace_rcu_grace_period(). */ } /* * The CPU has been completely removed, and some other CPU is reporting - * this fact from process context. Do the remainder of the cleanup. + * this fact from process context. Do the remainder of the cleanup, + * including orphaning the outgoing CPU's RCU callbacks, and also + * adopting them, if there is no _rcu_barrier() instance running. * There can only be one CPU hotplug operation at a time, so no other * CPU can be attempting to update rcu_cpu_kthread_task. */ @@ -1409,17 +1456,21 @@ static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp) unsigned long mask; int need_report = 0; struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu); - struct rcu_node *rnp = rdp->mynode; /* Outgoing CPU's rnp. */ + struct rcu_node *rnp = rdp->mynode; /* Outgoing CPU's rdp & rnp. */ /* Adjust any no-longer-needed kthreads. */ rcu_stop_cpu_kthread(cpu); rcu_node_kthread_setaffinity(rnp, -1); - /* Remove the dying CPU from the bitmasks in the rcu_node hierarchy. */ + /* Remove the dead CPU from the bitmasks in the rcu_node hierarchy. */ /* Exclude any attempts to start a new grace period. */ raw_spin_lock_irqsave(&rsp->onofflock, flags); + /* Orphan the dead CPU's callbacks, and adopt them if appropriate. */ + rcu_send_cbs_to_orphanage(cpu, rsp, rnp, rdp); + rcu_adopt_orphan_cbs(rsp); + /* Remove the outgoing CPU from the masks in the rcu_node hierarchy. */ mask = rdp->grpmask; /* rnp->grplo is constant. */ do { @@ -1456,6 +1507,10 @@ static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp) #else /* #ifdef CONFIG_HOTPLUG_CPU */ +static void rcu_adopt_orphan_cbs(struct rcu_state *rsp) +{ +} + static void rcu_cleanup_dying_cpu(struct rcu_state *rsp) { } @@ -1524,9 +1579,6 @@ static void rcu_do_batch(struct rcu_state *rsp, struct rcu_data *rdp) rcu_is_callbacks_kthread()); /* Update count, and requeue any remaining callbacks. */ - rdp->qlen_lazy -= count_lazy; - rdp->qlen -= count; - rdp->n_cbs_invoked += count; if (list != NULL) { *tail = rdp->nxtlist; rdp->nxtlist = list; @@ -1536,6 +1588,10 @@ static void rcu_do_batch(struct rcu_state *rsp, struct rcu_data *rdp) else break; } + smp_mb(); /* List handling before counting for rcu_barrier(). */ + rdp->qlen_lazy -= count_lazy; + rdp->qlen -= count; + rdp->n_cbs_invoked += count; /* Reinstate batch limit if we have worked down the excess. */ if (rdp->blimit == LONG_MAX && rdp->qlen <= qlowmark) @@ -1824,13 +1880,14 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu), rdp = this_cpu_ptr(rsp->rda); /* Add the callback to our list. */ - *rdp->nxttail[RCU_NEXT_TAIL] = head; - rdp->nxttail[RCU_NEXT_TAIL] = &head->next; rdp->qlen++; if (lazy) rdp->qlen_lazy++; else rcu_idle_count_callbacks_posted(); + smp_mb(); /* Count before adding callback for rcu_barrier(). */ + *rdp->nxttail[RCU_NEXT_TAIL] = head; + rdp->nxttail[RCU_NEXT_TAIL] = &head->next; if (__is_kfree_rcu_offset((unsigned long)func)) trace_rcu_kfree_callback(rsp->name, head, (unsigned long)func, @@ -2169,11 +2226,10 @@ static int rcu_cpu_has_callbacks(int cpu) rcu_preempt_cpu_has_callbacks(cpu); } -static DEFINE_PER_CPU(struct rcu_head, rcu_barrier_head) = {NULL}; -static atomic_t rcu_barrier_cpu_count; -static DEFINE_MUTEX(rcu_barrier_mutex); -static struct completion rcu_barrier_completion; - +/* + * RCU callback function for _rcu_barrier(). If we are last, wake + * up the task executing _rcu_barrier(). + */ static void rcu_barrier_callback(struct rcu_head *notused) { if (atomic_dec_and_test(&rcu_barrier_cpu_count)) @@ -2203,27 +2259,94 @@ static void _rcu_barrier(struct rcu_state *rsp, void (*call_rcu_func)(struct rcu_head *head, void (*func)(struct rcu_head *head))) { - BUG_ON(in_interrupt()); + int cpu; + unsigned long flags; + struct rcu_data *rdp; + struct rcu_head rh; + + init_rcu_head_on_stack(&rh); + /* Take mutex to serialize concurrent rcu_barrier() requests. */ mutex_lock(&rcu_barrier_mutex); - init_completion(&rcu_barrier_completion); + + smp_mb(); /* Prevent any prior operations from leaking in. */ + /* - * Initialize rcu_barrier_cpu_count to 1, then invoke - * rcu_barrier_func() on each CPU, so that each CPU also has - * incremented rcu_barrier_cpu_count. Only then is it safe to - * decrement rcu_barrier_cpu_count -- otherwise the first CPU - * might complete its grace period before all of the other CPUs - * did their increment, causing this function to return too - * early. Note that on_each_cpu() disables irqs, which prevents - * any CPUs from coming online or going offline until each online - * CPU has queued its RCU-barrier callback. + * Initialize the count to one rather than to zero in order to + * avoid a too-soon return to zero in case of a short grace period + * (or preemption of this task). Also flag this task as doing + * an rcu_barrier(). This will prevent anyone else from adopting + * orphaned callbacks, which could cause otherwise failure if a + * CPU went offline and quickly came back online. To see this, + * consider the following sequence of events: + * + * 1. We cause CPU 0 to post an rcu_barrier_callback() callback. + * 2. CPU 1 goes offline, orphaning its callbacks. + * 3. CPU 0 adopts CPU 1's orphaned callbacks. + * 4. CPU 1 comes back online. + * 5. We cause CPU 1 to post an rcu_barrier_callback() callback. + * 6. Both rcu_barrier_callback() callbacks are invoked, awakening + * us -- but before CPU 1's orphaned callbacks are invoked!!! */ + init_completion(&rcu_barrier_completion); atomic_set(&rcu_barrier_cpu_count, 1); - on_each_cpu(rcu_barrier_func, (void *)call_rcu_func, 1); + raw_spin_lock_irqsave(&rsp->onofflock, flags); + rsp->rcu_barrier_in_progress = current; + raw_spin_unlock_irqrestore(&rsp->onofflock, flags); + + /* + * Force every CPU with callbacks to register a new callback + * that will tell us when all the preceding callbacks have + * been invoked. If an offline CPU has callbacks, wait for + * it to either come back online or to finish orphaning those + * callbacks. + */ + for_each_possible_cpu(cpu) { + preempt_disable(); + rdp = per_cpu_ptr(rsp->rda, cpu); + if (cpu_is_offline(cpu)) { + preempt_enable(); + while (cpu_is_offline(cpu) && ACCESS_ONCE(rdp->qlen)) + schedule_timeout_interruptible(1); + } else if (ACCESS_ONCE(rdp->qlen)) { + smp_call_function_single(cpu, rcu_barrier_func, + (void *)call_rcu_func, 1); + preempt_enable(); + } else { + preempt_enable(); + } + } + + /* + * Now that all online CPUs have rcu_barrier_callback() callbacks + * posted, we can adopt all of the orphaned callbacks and place + * an rcu_barrier_callback() callback after them. When that is done, + * we are guaranteed to have an rcu_barrier_callback() callback + * following every callback that could possibly have been + * registered before _rcu_barrier() was called. + */ + raw_spin_lock_irqsave(&rsp->onofflock, flags); + rcu_adopt_orphan_cbs(rsp); + rsp->rcu_barrier_in_progress = NULL; + raw_spin_unlock_irqrestore(&rsp->onofflock, flags); + atomic_inc(&rcu_barrier_cpu_count); + smp_mb__after_atomic_inc(); /* Ensure atomic_inc() before callback. */ + call_rcu_func(&rh, rcu_barrier_callback); + + /* + * Now that we have an rcu_barrier_callback() callback on each + * CPU, and thus each counted, remove the initial count. + */ if (atomic_dec_and_test(&rcu_barrier_cpu_count)) complete(&rcu_barrier_completion); + + /* Wait for all rcu_barrier_callback() callbacks to be invoked. */ wait_for_completion(&rcu_barrier_completion); + + /* Other rcu_barrier() invocations can now safely proceed. */ mutex_unlock(&rcu_barrier_mutex); + + destroy_rcu_head_on_stack(&rh); } /** diff --git a/kernel/rcutree.h b/kernel/rcutree.h index 36ca28ecedc6..1e49c5685960 100644 --- a/kernel/rcutree.h +++ b/kernel/rcutree.h @@ -371,6 +371,17 @@ struct rcu_state { raw_spinlock_t onofflock; /* exclude on/offline and */ /* starting new GP. */ + struct rcu_head *orphan_nxtlist; /* Orphaned callbacks that */ + /* need a grace period. */ + struct rcu_head **orphan_nxttail; /* Tail of above. */ + struct rcu_head *orphan_donelist; /* Orphaned callbacks that */ + /* are ready to invoke. */ + struct rcu_head **orphan_donetail; /* Tail of above. */ + long qlen_lazy; /* Number of lazy callbacks. */ + long qlen; /* Total number of callbacks. */ + struct task_struct *rcu_barrier_in_progress; + /* Task doing rcu_barrier(), */ + /* or NULL if no barrier. */ raw_spinlock_t fqslock; /* Only one task forcing */ /* quiescent states. */ unsigned long jiffies_force_qs; /* Time at which to invoke */ diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c index ed459edeff43..d4bc16ddd1d4 100644 --- a/kernel/rcutree_trace.c +++ b/kernel/rcutree_trace.c @@ -271,13 +271,13 @@ static void print_one_rcu_state(struct seq_file *m, struct rcu_state *rsp) gpnum = rsp->gpnum; seq_printf(m, "c=%lu g=%lu s=%d jfq=%ld j=%x " - "nfqs=%lu/nfqsng=%lu(%lu) fqlh=%lu\n", + "nfqs=%lu/nfqsng=%lu(%lu) fqlh=%lu oqlen=%ld/%ld\n", rsp->completed, gpnum, rsp->fqs_state, (long)(rsp->jiffies_force_qs - jiffies), (int)(jiffies & 0xffff), rsp->n_force_qs, rsp->n_force_qs_ngp, rsp->n_force_qs - rsp->n_force_qs_ngp, - rsp->n_force_qs_lh); + rsp->n_force_qs_lh, rsp->qlen_lazy, rsp->qlen); for (rnp = &rsp->node[0]; rnp - &rsp->node[0] < NUM_RCU_NODES; rnp++) { if (rnp->level != level) { seq_puts(m, "\n"); -- cgit v1.2.3 From 7f3a781d6fd81e397c3928c9af33f1fc63232db6 Mon Sep 17 00:00:00 2001 From: Kay Sievers Date: Wed, 9 May 2012 01:37:51 +0200 Subject: printk - fix compilation for CONFIG_PRINTK=n Reported-by: Randy Dunlap Signed-off-by: Kay Sievers Signed-off-by: Greg Kroah-Hartman --- kernel/printk.c | 41 ++++++++++++++++++++++------------------- 1 file changed, 22 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/kernel/printk.c b/kernel/printk.c index 96d4cc892255..7b432951f91e 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -126,7 +126,6 @@ EXPORT_SYMBOL(console_set_on_cmdline); /* Flag: console code may call schedule() */ static int console_may_schedule; -#ifdef CONFIG_PRINTK /* * The printk log buffer consists of a chain of concatenated variable * length records. Every record starts with a record header, containing @@ -208,16 +207,9 @@ struct log { */ static DEFINE_RAW_SPINLOCK(logbuf_lock); -/* cpu currently holding logbuf_lock */ -static volatile unsigned int logbuf_cpu = UINT_MAX; - -#define LOG_LINE_MAX 1024 - -/* record buffer */ -#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT) -static char __log_buf[__LOG_BUF_LEN]; -static char *log_buf = __log_buf; -static u32 log_buf_len = __LOG_BUF_LEN; +/* the next printk record to read by syslog(READ) or /proc/kmsg */ +static u64 syslog_seq; +static u32 syslog_idx; /* index and sequence number of the first record stored in the buffer */ static u64 log_first_seq; @@ -225,15 +217,23 @@ static u32 log_first_idx; /* index and sequence number of the next record to store in the buffer */ static u64 log_next_seq; +#ifdef CONFIG_PRINTK static u32 log_next_idx; /* the next printk record to read after the last 'clear' command */ static u64 clear_seq; static u32 clear_idx; -/* the next printk record to read by syslog(READ) or /proc/kmsg */ -static u64 syslog_seq; -static u32 syslog_idx; +#define LOG_LINE_MAX 1024 + +/* record buffer */ +#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT) +static char __log_buf[__LOG_BUF_LEN]; +static char *log_buf = __log_buf; +static u32 log_buf_len = __LOG_BUF_LEN; + +/* cpu currently holding logbuf_lock */ +static volatile unsigned int logbuf_cpu = UINT_MAX; /* human readable text of the record */ static char *log_text(const struct log *msg) @@ -1425,13 +1425,16 @@ asmlinkage int printk(const char *fmt, ...) return r; } EXPORT_SYMBOL(printk); + #else -static void call_console_drivers(int level, const char *text, size_t len) -{ -} +#define LOG_LINE_MAX 0 +static struct log *log_from_idx(u32 idx) { return NULL; } +static u32 log_next(u32 idx) { return 0; } +static char *log_text(const struct log *msg) { return NULL; } +static void call_console_drivers(int level, const char *text, size_t len) {} -#endif +#endif /* CONFIG_PRINTK */ static int __add_preferred_console(char *name, int idx, char *options, char *brl_options) @@ -1715,7 +1718,7 @@ static u32 console_idx; * by printk(). If this is the case, console_unlock(); emits * the output prior to releasing the lock. * - * If there is output waiting, we wake it /dev/kmsg and syslog() users. + * If there is output waiting, we wake /dev/kmsg and syslog() users. * * console_unlock(); may be called from any context. */ -- cgit v1.2.3 From 5c5d5ca51abd728c8de3be43ffd6bb00f977bfcd Mon Sep 17 00:00:00 2001 From: Kay Sievers Date: Thu, 10 May 2012 04:32:53 +0200 Subject: printk() - do not merge continuation lines of different threads This prevents the merging of printk() continuation lines of different threads, in the case they race against each other. It should properly isolate "atomic" single-line printk() users from continuation users, to make sure the single-line users will never be merged with the racy continuation ones. Cc: Linus Torvalds Cc: Ingo Molnar Cc: Jonathan Corbet Cc: Sasha Levin Signed-off-by: Kay Sievers Signed-off-by: Greg Kroah-Hartman --- kernel/printk.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/printk.c b/kernel/printk.c index 7b432951f91e..301fb0f09fbf 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -1230,12 +1230,13 @@ asmlinkage int vprintk_emit(int facility, int level, static size_t buflen; static int buflevel; static char textbuf[LOG_LINE_MAX]; + static struct task_struct *cont; char *text = textbuf; size_t textlen; unsigned long flags; int this_cpu; bool newline = false; - bool cont = false; + bool prefix = false; int printed_len = 0; boot_delay_msec(); @@ -1295,20 +1296,16 @@ asmlinkage int vprintk_emit(int facility, int level, case '0' ... '7': if (level == -1) level = text[1] - '0'; - text += 3; - textlen -= 3; - break; - case 'c': /* KERN_CONT */ - cont = true; case 'd': /* KERN_DEFAULT */ + prefix = true; + case 'c': /* KERN_CONT */ text += 3; textlen -= 3; - break; } } - if (buflen && (!cont || dict)) { - /* no continuation; flush existing buffer */ + if (buflen && (prefix || dict || cont != current)) { + /* flush existing buffer */ log_store(facility, buflevel, NULL, 0, buf, buflen); printed_len += buflen; buflen = 0; @@ -1342,6 +1339,10 @@ asmlinkage int vprintk_emit(int facility, int level, dict, dictlen, text, textlen); printed_len += textlen; } + cont = NULL; + } else { + /* remember thread which filled the buffer */ + cont = current; } /* -- cgit v1.2.3 From 649e6ee33f73ba1c4f2492c6de9aff2254b540cb Mon Sep 17 00:00:00 2001 From: Kay Sievers Date: Thu, 10 May 2012 04:30:45 +0200 Subject: printk() - restore timestamp printing at console output The output of the timestamps got lost with the conversion of the kmsg buffer to records; restore the old behavior. Document, that CONFIG_PRINTK_TIME now only controls the output of the timestamps in the syslog() system call and on the console, and not the recording of the timestamps. Cc: Joe Perches Cc: Linus Torvalds Cc: Sasha Levin Cc: Ingo Molnar Reported-by: Yinghai Lu Signed-off-by: Kay Sievers Signed-off-by: Greg Kroah-Hartman --- kernel/printk.c | 43 ++++++++++++++++++++++++++----------------- 1 file changed, 26 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/kernel/printk.c b/kernel/printk.c index 301fb0f09fbf..572941d7e5f7 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -786,6 +786,22 @@ static bool printk_time; #endif module_param_named(time, printk_time, bool, S_IRUGO | S_IWUSR); +static size_t prepend_timestamp(unsigned long long t, char *buf) +{ + unsigned long rem_ns; + + if (!printk_time) + return 0; + + if (!buf) + return 15; + + rem_ns = do_div(t, 1000000000); + + return sprintf(buf, "[%5lu.%06lu] ", + (unsigned long) t, rem_ns / 1000); +} + static int syslog_print_line(u32 idx, char *text, size_t size) { struct log *msg; @@ -800,9 +816,7 @@ static int syslog_print_line(u32 idx, char *text, size_t size) len++; if (msg->level > 99) len++; - - if (printk_time) - len += 15; + len += prepend_timestamp(0, NULL); len += msg->text_len; len++; @@ -810,15 +824,7 @@ static int syslog_print_line(u32 idx, char *text, size_t size) } len = sprintf(text, "<%u>", msg->level); - - if (printk_time) { - unsigned long long t = msg->ts_nsec; - unsigned long rem_ns = do_div(t, 1000000000); - - len += sprintf(text + len, "[%5lu.%06lu] ", - (unsigned long) t, rem_ns / 1000); - } - + len += prepend_timestamp(msg->ts_nsec, text + len); if (len + msg->text_len > size) return -EINVAL; memcpy(text + len, log_text(msg), msg->text_len); @@ -1741,7 +1747,7 @@ again: for (;;) { struct log *msg; static char text[LOG_LINE_MAX]; - size_t len; + size_t len, l; int level; raw_spin_lock_irqsave(&logbuf_lock, flags); @@ -1761,10 +1767,13 @@ again: msg = log_from_idx(console_idx); level = msg->level & 7; - len = msg->text_len; - if (len+1 >= sizeof(text)) - len = sizeof(text)-1; - memcpy(text, log_text(msg), len); + + len = prepend_timestamp(msg->ts_nsec, text); + l = msg->text_len; + if (len + l + 1 >= sizeof(text)) + l = sizeof(text) - len - 1; + memcpy(text + len, log_text(msg), l); + len += l; text[len++] = '\n'; console_idx = log_next(console_idx); -- cgit v1.2.3 From f8450fca6ecdea38b5a882fdf6cd097e3ec8651c Mon Sep 17 00:00:00 2001 From: Stephen Warren Date: Thu, 10 May 2012 16:14:33 -0600 Subject: printk: correctly align __log_buf __log_buf must be aligned, because a 64-bit value is written directly to it as part of struct log. Alignment of the log entries is typically handled by log_store(), but this only triggers for subsequent entries, not the very first (or wrapped) entries. Cc: Kay Sievers Signed-off-by: Stephen Warren Signed-off-by: Greg Kroah-Hartman --- kernel/printk.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/printk.c b/kernel/printk.c index 572941d7e5f7..8b027bdf4606 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -227,8 +227,13 @@ static u32 clear_idx; #define LOG_LINE_MAX 1024 /* record buffer */ +#if !defined(CONFIG_64BIT) || defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) +#define LOG_ALIGN 4 +#else +#define LOG_ALIGN 8 +#endif #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT) -static char __log_buf[__LOG_BUF_LEN]; +static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN); static char *log_buf = __log_buf; static u32 log_buf_len = __LOG_BUF_LEN; @@ -279,12 +284,6 @@ static u32 log_next(u32 idx) return idx + msg->len; } -#if !defined(CONFIG_64BIT) || defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) -#define LOG_ALIGN 4 -#else -#define LOG_ALIGN 8 -#endif - /* insert record into the buffer, discard old ones, update heads */ static void log_store(int facility, int level, const char *dict, u16 dict_len, -- cgit v1.2.3 From c73893e2ca731b4a81ae59246ab57979aa188777 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sat, 5 May 2012 21:57:20 +0200 Subject: PM / Sleep: Make the limit of user space wakeup sources configurable MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Make it possible to configure out the check against the limit of user space wakeup sources for debugging and default Android builds. Signed-off-by: Rafael J. Wysocki Acked-by: Arve HjønnevÃ¥g --- kernel/power/Kconfig | 6 ++++++ kernel/power/wakelock.c | 31 ++++++++++++++++++++++++++----- 2 files changed, 32 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig index 1d534076d33a..08783eda9ce4 100644 --- a/kernel/power/Kconfig +++ b/kernel/power/Kconfig @@ -119,6 +119,12 @@ config PM_WAKELOCKS Allow user space to create, activate and deactivate wakeup source objects with the help of a sysfs-based interface. +config PM_WAKELOCKS_LIMIT + int "Maximum number of user space wakeup sources (0 = no limit)" + range 0 100000 + default 100 + depends on PM_WAKELOCKS + config PM_RUNTIME bool "Run-time PM core functionality" depends on !IA64_HP_SIM diff --git a/kernel/power/wakelock.c b/kernel/power/wakelock.c index 579700665e8c..dc34b9d3b7d8 100644 --- a/kernel/power/wakelock.c +++ b/kernel/power/wakelock.c @@ -17,7 +17,6 @@ #include #include -#define WL_NUMBER_LIMIT 100 #define WL_GC_COUNT_MAX 100 #define WL_GC_TIME_SEC 300 @@ -32,7 +31,6 @@ struct wakelock { static struct rb_root wakelocks_tree = RB_ROOT; static LIST_HEAD(wakelocks_lru_list); -static unsigned int number_of_wakelocks; static unsigned int wakelocks_gc_count; ssize_t pm_show_wakelocks(char *buf, bool show_active) @@ -58,6 +56,29 @@ ssize_t pm_show_wakelocks(char *buf, bool show_active) return (str - buf); } +#if CONFIG_PM_WAKELOCKS_LIMIT > 0 +static unsigned int number_of_wakelocks; + +static inline bool wakelocks_limit_exceeded(void) +{ + return number_of_wakelocks > CONFIG_PM_WAKELOCKS_LIMIT; +} + +static inline void increment_wakelocks_number(void) +{ + number_of_wakelocks++; +} + +static inline void decrement_wakelocks_number(void) +{ + number_of_wakelocks--; +} +#else /* CONFIG_PM_WAKELOCKS_LIMIT = 0 */ +static inline bool wakelocks_limit_exceeded(void) { return false; } +static inline void increment_wakelocks_number(void) {} +static inline void decrement_wakelocks_number(void) {} +#endif /* CONFIG_PM_WAKELOCKS_LIMIT */ + static struct wakelock *wakelock_lookup_add(const char *name, size_t len, bool add_if_not_found) { @@ -85,7 +106,7 @@ static struct wakelock *wakelock_lookup_add(const char *name, size_t len, if (!add_if_not_found) return ERR_PTR(-EINVAL); - if (number_of_wakelocks > WL_NUMBER_LIMIT) + if (wakelocks_limit_exceeded()) return ERR_PTR(-ENOSPC); /* Not found, we have to add a new one. */ @@ -103,7 +124,7 @@ static struct wakelock *wakelock_lookup_add(const char *name, size_t len, rb_link_node(&wl->node, parent, node); rb_insert_color(&wl->node, &wakelocks_tree); list_add(&wl->lru, &wakelocks_lru_list); - number_of_wakelocks++; + increment_wakelocks_number(); return wl; } @@ -175,7 +196,7 @@ static void wakelocks_gc(void) list_del(&wl->lru); kfree(wl->name); kfree(wl); - number_of_wakelocks--; + decrement_wakelocks_number(); } } wakelocks_gc_count = 0; -- cgit v1.2.3 From 4e585d25e120f1eae0a3a8bf8f6ebc7692afec18 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sat, 5 May 2012 21:57:28 +0200 Subject: PM / Sleep: User space wakeup sources garbage collector Kconfig option MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Make it possible to configure out the user space wakeup sources garbage collector for debugging and default Android builds. Signed-off-by: Rafael J. Wysocki Acked-by: Arve HjønnevÃ¥g --- kernel/power/Kconfig | 5 +++ kernel/power/wakelock.c | 101 +++++++++++++++++++++++++++++------------------- 2 files changed, 67 insertions(+), 39 deletions(-) (limited to 'kernel') diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig index 08783eda9ce4..8f9b4eb974e0 100644 --- a/kernel/power/Kconfig +++ b/kernel/power/Kconfig @@ -125,6 +125,11 @@ config PM_WAKELOCKS_LIMIT default 100 depends on PM_WAKELOCKS +config PM_WAKELOCKS_GC + bool "Garbage collector for user space wakeup sources" + depends on PM_WAKELOCKS + default y + config PM_RUNTIME bool "Run-time PM core functionality" depends on !IA64_HP_SIM diff --git a/kernel/power/wakelock.c b/kernel/power/wakelock.c index dc34b9d3b7d8..c8fba3380076 100644 --- a/kernel/power/wakelock.c +++ b/kernel/power/wakelock.c @@ -17,21 +17,18 @@ #include #include -#define WL_GC_COUNT_MAX 100 -#define WL_GC_TIME_SEC 300 - static DEFINE_MUTEX(wakelocks_lock); struct wakelock { char *name; struct rb_node node; struct wakeup_source ws; +#ifdef CONFIG_PM_WAKELOCKS_GC struct list_head lru; +#endif }; static struct rb_root wakelocks_tree = RB_ROOT; -static LIST_HEAD(wakelocks_lru_list); -static unsigned int wakelocks_gc_count; ssize_t pm_show_wakelocks(char *buf, bool show_active) { @@ -79,6 +76,61 @@ static inline void increment_wakelocks_number(void) {} static inline void decrement_wakelocks_number(void) {} #endif /* CONFIG_PM_WAKELOCKS_LIMIT */ +#ifdef CONFIG_PM_WAKELOCKS_GC +#define WL_GC_COUNT_MAX 100 +#define WL_GC_TIME_SEC 300 + +static LIST_HEAD(wakelocks_lru_list); +static unsigned int wakelocks_gc_count; + +static inline void wakelocks_lru_add(struct wakelock *wl) +{ + list_add(&wl->lru, &wakelocks_lru_list); +} + +static inline void wakelocks_lru_most_recent(struct wakelock *wl) +{ + list_move(&wl->lru, &wakelocks_lru_list); +} + +static void wakelocks_gc(void) +{ + struct wakelock *wl, *aux; + ktime_t now; + + if (++wakelocks_gc_count <= WL_GC_COUNT_MAX) + return; + + now = ktime_get(); + list_for_each_entry_safe_reverse(wl, aux, &wakelocks_lru_list, lru) { + u64 idle_time_ns; + bool active; + + spin_lock_irq(&wl->ws.lock); + idle_time_ns = ktime_to_ns(ktime_sub(now, wl->ws.last_time)); + active = wl->ws.active; + spin_unlock_irq(&wl->ws.lock); + + if (idle_time_ns < ((u64)WL_GC_TIME_SEC * NSEC_PER_SEC)) + break; + + if (!active) { + wakeup_source_remove(&wl->ws); + rb_erase(&wl->node, &wakelocks_tree); + list_del(&wl->lru); + kfree(wl->name); + kfree(wl); + decrement_wakelocks_number(); + } + } + wakelocks_gc_count = 0; +} +#else /* !CONFIG_PM_WAKELOCKS_GC */ +static inline void wakelocks_lru_add(struct wakelock *wl) {} +static inline void wakelocks_lru_most_recent(struct wakelock *wl) {} +static inline void wakelocks_gc(void) {} +#endif /* !CONFIG_PM_WAKELOCKS_GC */ + static struct wakelock *wakelock_lookup_add(const char *name, size_t len, bool add_if_not_found) { @@ -123,7 +175,7 @@ static struct wakelock *wakelock_lookup_add(const char *name, size_t len, wakeup_source_add(&wl->ws); rb_link_node(&wl->node, parent, node); rb_insert_color(&wl->node, &wakelocks_tree); - list_add(&wl->lru, &wakelocks_lru_list); + wakelocks_lru_add(wl); increment_wakelocks_number(); return wl; } @@ -166,42 +218,13 @@ int pm_wake_lock(const char *buf) __pm_stay_awake(&wl->ws); } - list_move(&wl->lru, &wakelocks_lru_list); + wakelocks_lru_most_recent(wl); out: mutex_unlock(&wakelocks_lock); return ret; } -static void wakelocks_gc(void) -{ - struct wakelock *wl, *aux; - ktime_t now = ktime_get(); - - list_for_each_entry_safe_reverse(wl, aux, &wakelocks_lru_list, lru) { - u64 idle_time_ns; - bool active; - - spin_lock_irq(&wl->ws.lock); - idle_time_ns = ktime_to_ns(ktime_sub(now, wl->ws.last_time)); - active = wl->ws.active; - spin_unlock_irq(&wl->ws.lock); - - if (idle_time_ns < ((u64)WL_GC_TIME_SEC * NSEC_PER_SEC)) - break; - - if (!active) { - wakeup_source_remove(&wl->ws); - rb_erase(&wl->node, &wakelocks_tree); - list_del(&wl->lru); - kfree(wl->name); - kfree(wl); - decrement_wakelocks_number(); - } - } - wakelocks_gc_count = 0; -} - int pm_wake_unlock(const char *buf) { struct wakelock *wl; @@ -226,9 +249,9 @@ int pm_wake_unlock(const char *buf) goto out; } __pm_relax(&wl->ws); - list_move(&wl->lru, &wakelocks_lru_list); - if (++wakelocks_gc_count > WL_GC_COUNT_MAX) - wakelocks_gc(); + + wakelocks_lru_most_recent(wl); + wakelocks_gc(); out: mutex_unlock(&wakelocks_lock); -- cgit v1.2.3 From 1fce677971e29ceaa7c569741fa9c685a7b1052a Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 11 May 2012 16:36:07 -0700 Subject: printk: add stub for prepend_timestamp() Add a stub for prepend_timestamp() when CONFIG_PRINTK is not enabled. Fixes this build error: kernel/printk.c:1770:3: error: implicit declaration of function 'prepend_timestamp' Cc: Kay Sievers Signed-off-by: Randy Dunlap Signed-off-by: Greg Kroah-Hartman --- kernel/printk.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/printk.c b/kernel/printk.c index 8b027bdf4606..c42faf97404a 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -1439,6 +1439,10 @@ static struct log *log_from_idx(u32 idx) { return NULL; } static u32 log_next(u32 idx) { return 0; } static char *log_text(const struct log *msg) { return NULL; } static void call_console_drivers(int level, const char *text, size_t len) {} +static size_t prepend_timestamp(unsigned long long t, char *buf) +{ + return 0; +} #endif /* CONFIG_PRINTK */ -- cgit v1.2.3 From dd7d8634e619b715a537402672d1383535ff4c54 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 11 May 2012 00:56:20 +0200 Subject: sched/numa: Fix the new NUMA topology bits There's no need to convert a node number to a node number by pretending its a cpu number.. Reported-by: Yinghai Lu Reported-and-Tested-by: Greg Pearson Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-0sqhrht34phowgclj12dgk8h@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b4f2096980a3..0738036fa569 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6395,8 +6395,7 @@ static void sched_init_numa(void) sched_domains_numa_masks[i][j] = mask; for (k = 0; k < nr_node_ids; k++) { - if (node_distance(cpu_to_node(j), k) > - sched_domains_numa_distance[i]) + if (node_distance(j, k) > sched_domains_numa_distance[i]) continue; cpumask_or(mask, mask, cpumask_of_node(k)); -- cgit v1.2.3 From 04f733b4afac5dc93ae9b0a8703c60b87def491e Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 11 May 2012 00:12:02 +0200 Subject: sched/fair: Revert sched-domain iteration breakage Patches c22402a2f ("sched/fair: Let minimally loaded cpu balance the group") and 0ce90475 ("sched/fair: Add some serialization to the sched_domain load-balance walk") are horribly broken so revert them. The problem is that while it sounds good to have the minimally loaded cpu do the pulling of more load, the way we walk the domains there is absolutely no guarantee this cpu will actually get to the domain. In fact its very likely it wont. Therefore the higher up the tree we get, the less likely it is we'll balance at all. The first of mask always walks up, while sucky in that it accumulates load on the first cpu and needs extra passes to spread it out at least guarantees a cpu gets up that far and load-balancing happens at all. Since its now always the first and idle cpus should always be able to balance so they get a task as fast as possible we can also do away with the added serialization. Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-rpuhs5s56aiv1aw7khv9zkw6@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 2 -- kernel/sched/fair.c | 19 +++++++------------ 2 files changed, 7 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0738036fa569..24922b7ff567 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5976,7 +5976,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) sg->sgp = *per_cpu_ptr(sdd->sgp, cpumask_first(sg_span)); atomic_inc(&sg->sgp->ref); - sg->balance_cpu = -1; if (cpumask_test_cpu(cpu, sg_span)) groups = sg; @@ -6052,7 +6051,6 @@ build_sched_groups(struct sched_domain *sd, int cpu) cpumask_clear(sched_group_cpus(sg)); sg->sgp->power = 0; - sg->balance_cpu = -1; for_each_cpu(j, span) { if (get_group(j, sdd, NULL) != group) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9bd3366dbb1c..a259a614b394 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3776,8 +3776,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, int *balance, struct sg_lb_stats *sgs) { unsigned long load, max_cpu_load, min_cpu_load, max_nr_running; - unsigned int balance_cpu = -1; - unsigned long balance_load = ~0UL; + unsigned int balance_cpu = -1, first_idle_cpu = 0; unsigned long avg_load_per_task = 0; int i; @@ -3794,11 +3793,12 @@ static inline void update_sg_lb_stats(struct lb_env *env, /* Bias balancing toward cpus of our domain */ if (local_group) { - load = target_load(i, load_idx); - if (load < balance_load || idle_cpu(i)) { - balance_load = load; + if (idle_cpu(i) && !first_idle_cpu) { + first_idle_cpu = 1; balance_cpu = i; } + + load = target_load(i, load_idx); } else { load = source_load(i, load_idx); if (load > max_cpu_load) { @@ -3824,8 +3824,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, */ if (local_group) { if (env->idle != CPU_NEWLY_IDLE) { - if (balance_cpu != env->dst_cpu || - cmpxchg(&group->balance_cpu, -1, balance_cpu) != -1) { + if (balance_cpu != env->dst_cpu) { *balance = 0; return; } @@ -4919,7 +4918,7 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle) int balance = 1; struct rq *rq = cpu_rq(cpu); unsigned long interval; - struct sched_domain *sd, *last = NULL; + struct sched_domain *sd; /* Earliest time when we have to do rebalance again */ unsigned long next_balance = jiffies + 60*HZ; int update_next_balance = 0; @@ -4929,7 +4928,6 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle) rcu_read_lock(); for_each_domain(cpu, sd) { - last = sd; if (!(sd->flags & SD_LOAD_BALANCE)) continue; @@ -4974,9 +4972,6 @@ out: if (!balance) break; } - for (sd = last; sd; sd = sd->child) - (void)cmpxchg(&sd->groups->balance_cpu, cpu, -1); - rcu_read_unlock(); /* -- cgit v1.2.3 From 870a0bb5d636156502769233d02a0d5791d4366a Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 11 May 2012 00:26:27 +0200 Subject: sched/numa: Don't scale the imbalance It's far too easy to get ridiculously large imbalance pct when you scale it like that. Use a fixed 125% for now. Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-zsriaft1dv7hhboyrpvqjy6s@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 24922b7ff567..6883d998dc38 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6261,11 +6261,6 @@ static int *sched_domains_numa_distance; static struct cpumask ***sched_domains_numa_masks; static int sched_domains_curr_level; -static inline unsigned long numa_scale(unsigned long x, int level) -{ - return x * sched_domains_numa_distance[level] / sched_domains_numa_scale; -} - static inline int sd_local_flags(int level) { if (sched_domains_numa_distance[level] > REMOTE_DISTANCE) @@ -6286,7 +6281,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu) .min_interval = sd_weight, .max_interval = 2*sd_weight, .busy_factor = 32, - .imbalance_pct = 100 + numa_scale(25, level), + .imbalance_pct = 125, .cache_nice_tries = 2, .busy_idx = 3, .idle_idx = 2, -- cgit v1.2.3 From 556061b00c9f2fd6a5524b6bde823ef12f299ecf Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 11 May 2012 17:31:26 +0200 Subject: sched/nohz: Fix rq->cpu_load[] calculations While investigating why the load-balancer did funny I found that the rq->cpu_load[] tables were completely screwy.. a bit more digging revealed that the updates that got through were missing ticks followed by a catchup of 2 ticks. The catchup assumes the cpu was idle during that time (since only nohz can cause missed ticks and the machine is idle etc..) this means that esp. the higher indices were significantly lower than they ought to be. The reason for this is that its not correct to compare against jiffies on every jiffy on any other cpu than the cpu that updates jiffies. This patch cludges around it by only doing the catch-up stuff from nohz_idle_balance() and doing the regular stuff unconditionally from the tick. Signed-off-by: Peter Zijlstra Cc: pjt@google.com Cc: Venkatesh Pallipadi Link: http://lkml.kernel.org/n/tip-tp4kj18xdd5aj4vvj0qg55s2@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 53 ++++++++++++++++++++++++++++++++++++++-------------- kernel/sched/fair.c | 2 +- kernel/sched/sched.h | 2 +- 3 files changed, 41 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6883d998dc38..860fddfb7bb7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -692,8 +692,6 @@ int tg_nop(struct task_group *tg, void *data) } #endif -void update_cpu_load(struct rq *this_rq); - static void set_load_weight(struct task_struct *p) { int prio = p->static_prio - MAX_RT_PRIO; @@ -2486,22 +2484,13 @@ decay_load_missed(unsigned long load, unsigned long missed_updates, int idx) * scheduler tick (TICK_NSEC). With tickless idle this will not be called * every tick. We fix it up based on jiffies. */ -void update_cpu_load(struct rq *this_rq) +static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, + unsigned long pending_updates) { - unsigned long this_load = this_rq->load.weight; - unsigned long curr_jiffies = jiffies; - unsigned long pending_updates; int i, scale; this_rq->nr_load_updates++; - /* Avoid repeated calls on same jiffy, when moving in and out of idle */ - if (curr_jiffies == this_rq->last_load_update_tick) - return; - - pending_updates = curr_jiffies - this_rq->last_load_update_tick; - this_rq->last_load_update_tick = curr_jiffies; - /* Update our load: */ this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */ for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) { @@ -2526,9 +2515,45 @@ void update_cpu_load(struct rq *this_rq) sched_avg_update(this_rq); } +/* + * Called from nohz_idle_balance() to update the load ratings before doing the + * idle balance. + */ +void update_idle_cpu_load(struct rq *this_rq) +{ + unsigned long curr_jiffies = jiffies; + unsigned long load = this_rq->load.weight; + unsigned long pending_updates; + + /* + * Bloody broken means of dealing with nohz, but better than nothing.. + * jiffies is updated by one cpu, another cpu can drift wrt the jiffy + * update and see 0 difference the one time and 2 the next, even though + * we ticked at roughtly the same rate. + * + * Hence we only use this from nohz_idle_balance() and skip this + * nonsense when called from the scheduler_tick() since that's + * guaranteed a stable rate. + */ + if (load || curr_jiffies == this_rq->last_load_update_tick) + return; + + pending_updates = curr_jiffies - this_rq->last_load_update_tick; + this_rq->last_load_update_tick = curr_jiffies; + + __update_cpu_load(this_rq, load, pending_updates); +} + +/* + * Called from scheduler_tick() + */ static void update_cpu_load_active(struct rq *this_rq) { - update_cpu_load(this_rq); + /* + * See the mess in update_idle_cpu_load(). + */ + this_rq->last_load_update_tick = jiffies; + __update_cpu_load(this_rq, this_rq->load.weight, 1); calc_load_account_active(this_rq); } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a259a614b394..124e6b6999a7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5012,7 +5012,7 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) raw_spin_lock_irq(&this_rq->lock); update_rq_clock(this_rq); - update_cpu_load(this_rq); + update_idle_cpu_load(this_rq); raw_spin_unlock_irq(&this_rq->lock); rebalance_domains(balance_cpu, CPU_IDLE); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 7282e7b5f4c7..ba9dccfd24ce 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -876,7 +876,7 @@ extern void resched_cpu(int cpu); extern struct rt_bandwidth def_rt_bandwidth; extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime); -extern void update_cpu_load(struct rq *this_rq); +extern void update_idle_cpu_load(struct rq *this_rq); #ifdef CONFIG_CGROUP_CPUACCT #include -- cgit v1.2.3 From e44bc5c5d00ee9b56dd87db47ed827d52948b9fa Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 11 May 2012 00:22:12 +0200 Subject: sched/fair: Improve the ->group_imb logic Group imbalance is meant to deal with situations where affinity masks and sched domains don't align well, such as 3 cpus from one group and 6 from another. In this case the domain based balancer will want to put an equal amount of tasks on each side even though they don't have equal cpus. Currently group_imb is set whenever two cpus of a group have a weight difference of at least one avg task and the heaviest cpu has at least two tasks. A group with imbalance set will always be picked as busiest and a balance pass will be forced. The problem is that even if there are no affinity masks this stuff can trigger and cause weird balancing decisions, eg. the observed behaviour was that of 6 cpus, 5 had 2 and 1 had 3 tasks, due to the difference of 1 avg load (they all had the same weight) and nr_running being >1 the group_imbalance logic triggered and did the weird thing of pulling more load instead of trying to move the 1 excess task to the other domain of 6 cpus that had 5 cpu with 2 tasks and 1 cpu with 1 task. Curb the group_imbalance stuff by making the nr_running condition weaker by also tracking the min_nr_running and using the difference in nr_running over the set instead of the absolute max nr_running. Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-9s7dedozxo8kjsb9kqlrukkf@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 124e6b6999a7..0b42f4487329 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3775,7 +3775,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, int local_group, const struct cpumask *cpus, int *balance, struct sg_lb_stats *sgs) { - unsigned long load, max_cpu_load, min_cpu_load, max_nr_running; + unsigned long nr_running, max_nr_running, min_nr_running; + unsigned long load, max_cpu_load, min_cpu_load; unsigned int balance_cpu = -1, first_idle_cpu = 0; unsigned long avg_load_per_task = 0; int i; @@ -3787,10 +3788,13 @@ static inline void update_sg_lb_stats(struct lb_env *env, max_cpu_load = 0; min_cpu_load = ~0UL; max_nr_running = 0; + min_nr_running = ~0UL; for_each_cpu_and(i, sched_group_cpus(group), cpus) { struct rq *rq = cpu_rq(i); + nr_running = rq->nr_running; + /* Bias balancing toward cpus of our domain */ if (local_group) { if (idle_cpu(i) && !first_idle_cpu) { @@ -3801,16 +3805,19 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = target_load(i, load_idx); } else { load = source_load(i, load_idx); - if (load > max_cpu_load) { + if (load > max_cpu_load) max_cpu_load = load; - max_nr_running = rq->nr_running; - } if (min_cpu_load > load) min_cpu_load = load; + + if (nr_running > max_nr_running) + max_nr_running = nr_running; + if (min_nr_running > nr_running) + min_nr_running = nr_running; } sgs->group_load += load; - sgs->sum_nr_running += rq->nr_running; + sgs->sum_nr_running += nr_running; sgs->sum_weighted_load += weighted_cpuload(i); if (idle_cpu(i)) sgs->idle_cpus++; @@ -3848,7 +3855,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (sgs->sum_nr_running) avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running; - if ((max_cpu_load - min_cpu_load) >= avg_load_per_task && max_nr_running > 1) + if ((max_cpu_load - min_cpu_load) >= avg_load_per_task && + (max_nr_running - min_nr_running) > 1) sgs->group_imb = 1; sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power, -- cgit v1.2.3 From 13e099d2f77e1da3e4046860c48d956588633613 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 14 May 2012 14:34:00 +0200 Subject: sched/debug: Fix printing large integers on 32-bit platforms Some numbers like nr_running and nr_uninterruptible are fundamentally unsigned since its impossible to have a negative amount of tasks, yet we still print them as signed to easily recognise the underflow condition. rq->nr_uninterruptible has 'special' accounting and can in fact very easily become negative on a per-cpu basis. It was noted that since the P() macro assumes things are long long and the promotion of unsigned 'int/long' to long long on 32bit doesn't sign extend we print silly large numbers instead of the easier to read signed numbers. Therefore extend the P() macro to not require the sign extention. Reported-by: Diwakar Tundlam Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-gk5tm8t2n4ix2vkpns42uqqp@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/debug.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 31e4f61a1629..6f79596e0ea9 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -260,8 +260,14 @@ static void print_cpu(struct seq_file *m, int cpu) SEQ_printf(m, "\ncpu#%d\n", cpu); #endif -#define P(x) \ - SEQ_printf(m, " .%-30s: %Ld\n", #x, (long long)(rq->x)) +#define P(x) \ +do { \ + if (sizeof(rq->x) == 4) \ + SEQ_printf(m, " .%-30s: %ld\n", #x, (long)(rq->x)); \ + else \ + SEQ_printf(m, " .%-30s: %Ld\n", #x, (long long)(rq->x));\ +} while (0) + #define PN(x) \ SEQ_printf(m, " .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(rq->x)) -- cgit v1.2.3 From 3ce9a7c0ac28561567fadedf1a99272e4970f740 Mon Sep 17 00:00:00 2001 From: Kay Sievers Date: Sun, 13 May 2012 23:30:46 +0200 Subject: printk() - restore prefix/timestamp printing for multi-newline strings Calls like: printk("\n *** DEADLOCK ***\n\n"); will print 3 properly indented, separated, syslog + timestamp prefixed lines in the log output. Reported-By: Sasha Levin Signed-off-by: Kay Sievers Signed-off-by: Greg Kroah-Hartman --- kernel/printk.c | 127 +++++++++++++++++++++++++++++++++----------------------- 1 file changed, 76 insertions(+), 51 deletions(-) (limited to 'kernel') diff --git a/kernel/printk.c b/kernel/printk.c index c42faf97404a..915a8be10b5f 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -448,7 +448,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf, /* escape non-printable characters */ for (i = 0; i < msg->text_len; i++) { - char c = log_text(msg)[i]; + unsigned char c = log_text(msg)[i]; if (c < ' ' || c >= 128) len += sprintf(user->buf + len, "\\x%02x", c); @@ -461,7 +461,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf, bool line = true; for (i = 0; i < msg->dict_len; i++) { - char c = log_dict(msg)[i]; + unsigned char c = log_dict(msg)[i]; if (line) { user->buf[len++] = ' '; @@ -785,56 +785,81 @@ static bool printk_time; #endif module_param_named(time, printk_time, bool, S_IRUGO | S_IWUSR); -static size_t prepend_timestamp(unsigned long long t, char *buf) +static size_t print_prefix(const struct log *msg, bool syslog, char *buf) { - unsigned long rem_ns; + size_t len = 0; - if (!printk_time) - return 0; + if (syslog) { + if (buf) { + len += sprintf(buf, "<%u>", msg->level); + } else { + len += 3; + if (msg->level > 9) + len++; + if (msg->level > 99) + len++; + } + } - if (!buf) - return 15; + if (printk_time) { + if (buf) { + unsigned long long ts = msg->ts_nsec; + unsigned long rem_nsec = do_div(ts, 1000000000); - rem_ns = do_div(t, 1000000000); + len += sprintf(buf + len, "[%5lu.%06lu] ", + (unsigned long) ts, rem_nsec / 1000); + } else { + len += 15; + } + } - return sprintf(buf, "[%5lu.%06lu] ", - (unsigned long) t, rem_ns / 1000); + return len; } -static int syslog_print_line(u32 idx, char *text, size_t size) +static size_t msg_print_text(const struct log *msg, bool syslog, + char *buf, size_t size) { - struct log *msg; - size_t len; + const char *text = log_text(msg); + size_t text_size = msg->text_len; + size_t len = 0; - msg = log_from_idx(idx); - if (!text) { - /* calculate length only */ - len = 3; + do { + const char *next = memchr(text, '\n', text_size); + size_t text_len; - if (msg->level > 9) - len++; - if (msg->level > 99) - len++; - len += prepend_timestamp(0, NULL); + if (next) { + text_len = next - text; + next++; + text_size -= next - text; + } else { + text_len = text_size; + } - len += msg->text_len; - len++; - return len; - } + if (buf) { + if (print_prefix(msg, syslog, NULL) + + text_len + 1>= size - len) + break; + + len += print_prefix(msg, syslog, buf + len); + memcpy(buf + len, text, text_len); + len += text_len; + buf[len++] = '\n'; + } else { + /* SYSLOG_ACTION_* buffer size only calculation */ + len += print_prefix(msg, syslog, NULL); + len += text_len + 1; + } + + text = next; + } while (text); - len = sprintf(text, "<%u>", msg->level); - len += prepend_timestamp(msg->ts_nsec, text + len); - if (len + msg->text_len > size) - return -EINVAL; - memcpy(text + len, log_text(msg), msg->text_len); - len += msg->text_len; - text[len++] = '\n'; return len; } static int syslog_print(char __user *buf, int size) { char *text; + struct log *msg; int len; text = kmalloc(LOG_LINE_MAX, GFP_KERNEL); @@ -847,7 +872,8 @@ static int syslog_print(char __user *buf, int size) syslog_seq = log_first_seq; syslog_idx = log_first_idx; } - len = syslog_print_line(syslog_idx, text, LOG_LINE_MAX); + msg = log_from_idx(syslog_idx); + len = msg_print_text(msg, true, text, LOG_LINE_MAX); syslog_idx = log_next(syslog_idx); syslog_seq++; raw_spin_unlock_irq(&logbuf_lock); @@ -887,14 +913,18 @@ static int syslog_print_all(char __user *buf, int size, bool clear) seq = clear_seq; idx = clear_idx; while (seq < log_next_seq) { - len += syslog_print_line(idx, NULL, 0); + struct log *msg = log_from_idx(idx); + + len += msg_print_text(msg, true, NULL, 0); idx = log_next(idx); seq++; } seq = clear_seq; idx = clear_idx; while (len > size && seq < log_next_seq) { - len -= syslog_print_line(idx, NULL, 0); + struct log *msg = log_from_idx(idx); + + len -= msg_print_text(msg, true, NULL, 0); idx = log_next(idx); seq++; } @@ -904,9 +934,10 @@ static int syslog_print_all(char __user *buf, int size, bool clear) len = 0; while (len >= 0 && seq < next_seq) { + struct log *msg = log_from_idx(idx); int textlen; - textlen = syslog_print_line(idx, text, LOG_LINE_MAX); + textlen = msg_print_text(msg, true, text, LOG_LINE_MAX); if (textlen < 0) { len = textlen; break; @@ -1044,7 +1075,9 @@ int do_syslog(int type, char __user *buf, int len, bool from_file) seq = syslog_seq; idx = syslog_idx; while (seq < log_next_seq) { - error += syslog_print_line(idx, NULL, 0); + struct log *msg = log_from_idx(idx); + + error += msg_print_text(msg, true, NULL, 0); idx = log_next(idx); seq++; } @@ -1439,10 +1472,8 @@ static struct log *log_from_idx(u32 idx) { return NULL; } static u32 log_next(u32 idx) { return 0; } static char *log_text(const struct log *msg) { return NULL; } static void call_console_drivers(int level, const char *text, size_t len) {} -static size_t prepend_timestamp(unsigned long long t, char *buf) -{ - return 0; -} +static size_t msg_print_text(const struct log *msg, bool syslog, + char *buf, size_t size) { return 0; } #endif /* CONFIG_PRINTK */ @@ -1750,7 +1781,7 @@ again: for (;;) { struct log *msg; static char text[LOG_LINE_MAX]; - size_t len, l; + size_t len; int level; raw_spin_lock_irqsave(&logbuf_lock, flags); @@ -1771,13 +1802,7 @@ again: msg = log_from_idx(console_idx); level = msg->level & 7; - len = prepend_timestamp(msg->ts_nsec, text); - l = msg->text_len; - if (len + l + 1 >= sizeof(text)) - l = sizeof(text) - len - 1; - memcpy(text + len, log_text(msg), l); - len += l; - text[len++] = '\n'; + len = msg_print_text(msg, false, text, sizeof(text)); console_idx = log_next(console_idx); console_seq++; -- cgit v1.2.3 From c313af145b9bc4fb8e8e0c83b8cfc10e1b894a50 Mon Sep 17 00:00:00 2001 From: Kay Sievers Date: Mon, 14 May 2012 20:46:27 +0200 Subject: printk() - isolate KERN_CONT users from ordinary complete lines Arrange the continuation printk() buffering to be fully separated from the ordinary full line users. Limit the exposure to races and wrong printk() line merges to users of continuation only. Ordinary full line users racing against continuation users will no longer affect each other. Multiple continuation users from different threads, racing against each other will not wrongly be merged into a single line, but printed as separate lines. Test output of a kernel module which starts two separate threads which race against each other, one of them printing a single full terminated line: printk("(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)\n"); The other one printing the line, every character separate in a continuation loop: printk("(C"); for (i = 0; i < 58; i++) printk(KERN_CONT "C"); printk(KERN_CONT "C)\n"); Behavior of single and non-thread-aware printk() buffer: # modprobe printk-race printk test init (CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA) C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA) CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA) C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA) CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA) C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA) C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA) CC(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA) C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA) C(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA) CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC) (CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC) New behavior with separate and thread-aware continuation buffer: # modprobe printk-race printk test init (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA) (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA) (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA) (CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC) (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA) (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA) (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA) (AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA) (CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC) (CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC) Cc: Linus Torvalds Cc: Joe Perches Cc: Ted Ts'o Cc: Ingo Molnar Cc: Jonathan Corbet Cc: Sasha Levin Signed-off-by: Kay Sievers Signed-off-by: Greg Kroah-Hartman --- kernel/printk.c | 105 ++++++++++++++++++++++++++++++++------------------------ 1 file changed, 61 insertions(+), 44 deletions(-) (limited to 'kernel') diff --git a/kernel/printk.c b/kernel/printk.c index 915a8be10b5f..32462d2b364a 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -1264,13 +1264,13 @@ asmlinkage int vprintk_emit(int facility, int level, const char *fmt, va_list args) { static int recursion_bug; - static char buf[LOG_LINE_MAX]; - static size_t buflen; - static int buflevel; + static char cont_buf[LOG_LINE_MAX]; + static size_t cont_len; + static int cont_level; + static struct task_struct *cont_task; static char textbuf[LOG_LINE_MAX]; - static struct task_struct *cont; char *text = textbuf; - size_t textlen; + size_t text_len; unsigned long flags; int this_cpu; bool newline = false; @@ -1320,15 +1320,15 @@ asmlinkage int vprintk_emit(int facility, int level, * The printf needs to come first; we need the syslog * prefix which might be passed-in as a parameter. */ - textlen = vscnprintf(text, sizeof(textbuf), fmt, args); + text_len = vscnprintf(text, sizeof(textbuf), fmt, args); /* mark and strip a trailing newline */ - if (textlen && text[textlen-1] == '\n') { - textlen--; + if (text_len && text[text_len-1] == '\n') { + text_len--; newline = true; } - /* strip syslog prefix and extract log level or flags */ + /* strip syslog prefix and extract log level or control flags */ if (text[0] == '<' && text[1] && text[2] == '>') { switch (text[1]) { case '0' ... '7': @@ -1338,49 +1338,67 @@ asmlinkage int vprintk_emit(int facility, int level, prefix = true; case 'c': /* KERN_CONT */ text += 3; - textlen -= 3; + text_len -= 3; } } - if (buflen && (prefix || dict || cont != current)) { - /* flush existing buffer */ - log_store(facility, buflevel, NULL, 0, buf, buflen); - printed_len += buflen; - buflen = 0; - } + if (level == -1) + level = default_message_loglevel; - if (buflen == 0) { - /* remember level for first message in the buffer */ - if (level == -1) - buflevel = default_message_loglevel; - else - buflevel = level; + if (dict) { + prefix = true; + newline = true; } - if (buflen || !newline) { - /* append to existing buffer, or buffer until next message */ - if (buflen + textlen > sizeof(buf)) - textlen = sizeof(buf) - buflen; - memcpy(buf + buflen, text, textlen); - buflen += textlen; - } + if (!newline) { + if (cont_len && (prefix || cont_task != current)) { + /* + * Flush earlier buffer, which is either from a + * different thread, or when we got a new prefix. + */ + log_store(facility, cont_level, NULL, 0, cont_buf, cont_len); + cont_len = 0; + } - if (newline) { - /* end of line; flush buffer */ - if (buflen) { - log_store(facility, buflevel, - dict, dictlen, buf, buflen); - printed_len += buflen; - buflen = 0; - } else { - log_store(facility, buflevel, - dict, dictlen, text, textlen); - printed_len += textlen; + if (!cont_len) { + cont_level = level; + cont_task = current; } - cont = NULL; + + /* buffer or append to earlier buffer from the same thread */ + if (cont_len + text_len > sizeof(cont_buf)) + text_len = sizeof(cont_buf) - cont_len; + memcpy(cont_buf + cont_len, text, text_len); + cont_len += text_len; } else { - /* remember thread which filled the buffer */ - cont = current; + if (cont_len && cont_task == current) { + if (prefix) { + /* + * New prefix from the same thread; flush. We + * either got no earlier newline, or we race + * with an interrupt. + */ + log_store(facility, cont_level, + NULL, 0, cont_buf, cont_len); + cont_len = 0; + } + + /* append to the earlier buffer and flush */ + if (cont_len + text_len > sizeof(cont_buf)) + text_len = sizeof(cont_buf) - cont_len; + memcpy(cont_buf + cont_len, text, text_len); + cont_len += text_len; + log_store(facility, cont_level, + NULL, 0, cont_buf, cont_len); + cont_len = 0; + cont_task = NULL; + printed_len = cont_len; + } else { + /* ordinary single and terminated line */ + log_store(facility, level, + dict, dictlen, text, text_len); + printed_len = text_len; + } } /* @@ -1470,7 +1488,6 @@ EXPORT_SYMBOL(printk); #define LOG_LINE_MAX 0 static struct log *log_from_idx(u32 idx) { return NULL; } static u32 log_next(u32 idx) { return 0; } -static char *log_text(const struct log *msg) { return NULL; } static void call_console_drivers(int level, const char *text, size_t len) {} static size_t msg_print_text(const struct log *msg, bool syslog, char *buf, size_t size) { return 0; } -- cgit v1.2.3 From 544ecf310f0e7f51fa057ac2a295fc1b3b35a9d3 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Mon, 14 May 2012 15:04:50 -0700 Subject: workqueue: skip nr_running sanity check in worker_enter_idle() if trustee is active worker_enter_idle() has WARN_ON_ONCE() which triggers if nr_running isn't zero when every worker is idle. This can trigger spuriously while a cpu is going down due to the way trustee sets %WORKER_ROGUE and zaps nr_running. It first sets %WORKER_ROGUE on all workers without updating nr_running, releases gcwq->lock, schedules, regrabs gcwq->lock and then zaps nr_running. If the last running worker enters idle inbetween, it would see stale nr_running which hasn't been zapped yet and trigger the WARN_ON_ONCE(). Fix it by performing the sanity check iff the trustee is idle. Signed-off-by: Tejun Heo Reported-by: "Paul E. McKenney" Cc: stable@vger.kernel.org --- kernel/workqueue.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 211eadb23323..c36c86cf7900 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1213,8 +1213,13 @@ static void worker_enter_idle(struct worker *worker) } else wake_up_all(&gcwq->trustee_wait); - /* sanity check nr_running */ - WARN_ON_ONCE(gcwq->nr_workers == gcwq->nr_idle && + /* + * Sanity check nr_running. Because trustee releases gcwq->lock + * between setting %WORKER_ROGUE and zapping nr_running, the + * warning may trigger spuriously. Check iff trustee is idle. + */ + WARN_ON_ONCE(gcwq->trustee_state == TRUSTEE_DONE && + gcwq->nr_workers == gcwq->nr_idle && atomic_read(get_gcwq_nr_running(gcwq->cpu))); } -- cgit v1.2.3 From 4d82a1debbffec129cc387aafa8f40b7bbab3297 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 15 May 2012 08:06:19 -0700 Subject: lockdep: fix oops in processing workqueue Under memory load, on x86_64, with lockdep enabled, the workqueue's process_one_work() has been seen to oops in __lock_acquire(), barfing on a 0xffffffff00000000 pointer in the lockdep_map's class_cache[]. Because it's permissible to free a work_struct from its callout function, the map used is an onstack copy of the map given in the work_struct: and that copy is made without any locking. Surprisingly, gcc (4.5.1 in Hugh's case) uses "rep movsl" rather than "rep movsq" for that structure copy: which might race with a workqueue user's wait_on_work() doing lock_map_acquire() on the source of the copy, putting a pointer into the class_cache[], but only in time for the top half of that pointer to be copied to the destination map. Boom when process_one_work() subsequently does lock_map_acquire() on its onstack copy of the lockdep_map. Fix this, and a similar instance in call_timer_fn(), with a lockdep_copy_map() function which additionally NULLs the class_cache[]. Note: this oops was actually seen on 3.4-next, where flush_work() newly does the racing lock_map_acquire(); but Tejun points out that 3.4 and earlier are already vulnerable to the same through wait_on_work(). * Patch orginally from Peter. Hugh modified it a bit and wrote the description. Signed-off-by: Peter Zijlstra Reported-by: Hugh Dickins LKML-Reference: Signed-off-by: Tejun Heo --- kernel/timer.c | 4 +++- kernel/workqueue.c | 4 +++- 2 files changed, 6 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/timer.c b/kernel/timer.c index a297ffcf888e..b12385244bb5 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -1102,7 +1102,9 @@ static void call_timer_fn(struct timer_list *timer, void (*fn)(unsigned long), * warnings as well as problems when looking into * timer->lockdep_map, make a copy and use that here. */ - struct lockdep_map lockdep_map = timer->lockdep_map; + struct lockdep_map lockdep_map; + + lockdep_copy_map(&lockdep_map, &timer->lockdep_map); #endif /* * Couple the lock chain with the lock chain at diff --git a/kernel/workqueue.c b/kernel/workqueue.c index c36c86cf7900..9a3128dc67df 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1818,7 +1818,9 @@ __acquires(&gcwq->lock) * lock freed" warnings as well as problems when looking into * work->lockdep_map, make a copy and use that here. */ - struct lockdep_map lockdep_map = work->lockdep_map; + struct lockdep_map lockdep_map; + + lockdep_copy_map(&lockdep_map, &work->lockdep_map); #endif /* * A single work shouldn't be executed concurrently by -- cgit v1.2.3 From 65cc5a17ad3388f89ddc3d68226a09242656809b Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Mon, 12 Mar 2012 13:08:45 -0700 Subject: userns: Teach inode_capable to understand inodes whose uids map to other namespaces. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/capability.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/capability.c b/kernel/capability.c index cc5f0718215d..493d97259484 100644 --- a/kernel/capability.c +++ b/kernel/capability.c @@ -429,12 +429,14 @@ bool nsown_capable(int cap) * targeted at it's own user namespace and that the given inode is owned * by the current user namespace or a child namespace. * - * Currently inodes can only be owned by the initial user namespace. + * Currently we check to see if an inode is owned by the current + * user namespace by seeing if the inode's owner maps into the + * current user namespace. * */ bool inode_capable(const struct inode *inode, int cap) { struct user_namespace *ns = current_user_ns(); - return ns_capable(ns, cap) && (ns == &init_user_ns); + return ns_capable(ns, cap) && kuid_has_mapping(ns, inode->i_uid); } -- cgit v1.2.3 From 54ba47edac90091d42e5f97516cad56953576a5a Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Tue, 13 Mar 2012 16:04:35 -0700 Subject: userns: signal remove unnecessary map_cred_ns map_cred_ns is a light wrapper around from_kuid with the order of the arguments reversed. Replace map_cred_ns with from_kuid and remove map_cred_ns. Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/signal.c | 20 +++++--------------- 1 file changed, 5 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index aef629c65c87..833ea5166855 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1019,15 +1019,6 @@ static inline int legacy_queue(struct sigpending *signals, int sig) return (sig < SIGRTMIN) && sigismember(&signals->signal, sig); } -/* - * map the uid in struct cred into user namespace *ns - */ -static inline uid_t map_cred_ns(const struct cred *cred, - struct user_namespace *ns) -{ - return from_kuid_munged(ns, cred->uid); -} - #ifdef CONFIG_USER_NS static inline void userns_fixup_signal_uid(struct siginfo *info, struct task_struct *t) { @@ -1677,8 +1668,8 @@ bool do_notify_parent(struct task_struct *tsk, int sig) */ rcu_read_lock(); info.si_pid = task_pid_nr_ns(tsk, tsk->parent->nsproxy->pid_ns); - info.si_uid = map_cred_ns(__task_cred(tsk), - task_cred_xxx(tsk->parent, user_ns)); + info.si_uid = from_kuid_munged(task_cred_xxx(tsk->parent, user_ns), + task_uid(tsk)); rcu_read_unlock(); info.si_utime = cputime_to_clock_t(tsk->utime + tsk->signal->utime); @@ -1761,8 +1752,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, */ rcu_read_lock(); info.si_pid = task_pid_nr_ns(tsk, parent->nsproxy->pid_ns); - info.si_uid = map_cred_ns(__task_cred(tsk), - task_cred_xxx(parent, user_ns)); + info.si_uid = from_kuid_munged(task_cred_xxx(parent, user_ns), task_uid(tsk)); rcu_read_unlock(); info.si_utime = cputime_to_clock_t(tsk->utime); @@ -2180,8 +2170,8 @@ static int ptrace_signal(int signr, siginfo_t *info, info->si_code = SI_USER; rcu_read_lock(); info->si_pid = task_pid_vnr(current->parent); - info->si_uid = map_cred_ns(__task_cred(current->parent), - current_user_ns()); + info->si_uid = from_kuid_munged(current_user_ns(), + task_uid(current->parent)); rcu_read_unlock(); } -- cgit v1.2.3 From 14a590c3f987977d7b09ec926481ee0238c08eee Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Mon, 12 Mar 2012 15:44:39 -0700 Subject: userns: Convert cgroup permission checks to use uid_eq Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- kernel/cgroup.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index ed64ccac67c9..c8329b0c2576 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2160,9 +2160,9 @@ retry_find_task: * only need to check permissions on one of them. */ tcred = __task_cred(tsk); - if (cred->euid && - cred->euid != tcred->uid && - cred->euid != tcred->suid) { + if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && + !uid_eq(cred->euid, tcred->uid) && + !uid_eq(cred->euid, tcred->suid)) { rcu_read_unlock(); ret = -EACCES; goto out_unlock_cgroup; -- cgit v1.2.3 From 6edb2a8a385f0cdef51dae37ff23e74d76d8a6ce Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Fri, 11 May 2012 23:28:49 -0400 Subject: tracing: Clean up tracing_mark_write() On gcc 4.5 the function tracing_mark_write() would give a warning of page2 being uninitialized. This is due to a bug in gcc because the logic prevents page2 from being used uninitialized, and gcc 4.6+ does not complain (correctly). Instead of adding a "unitialized" around page2, which could show a bug later on, I combined page1 and page2 into an array map_pages[]. This binds the two and the two are modified according to nr_pages (what gcc 4.5 seems to ignore). This no longer gives a warning with gcc 4.5 nor with gcc 4.6. Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 48ef4960ec90..d1b3469b62e3 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -3875,14 +3875,14 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, struct print_entry *entry; unsigned long irq_flags; struct page *pages[2]; + void *map_page[2]; int nr_pages = 1; ssize_t written; - void *page1; - void *page2; int offset; int size; int len; int ret; + int i; if (tracing_disabled) return -EINVAL; @@ -3921,9 +3921,8 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, goto out; } - page1 = kmap_atomic(pages[0]); - if (nr_pages == 2) - page2 = kmap_atomic(pages[1]); + for (i = 0; i < nr_pages; i++) + map_page[i] = kmap_atomic(pages[i]); local_save_flags(irq_flags); size = sizeof(*entry) + cnt + 2; /* possible \n added */ @@ -3941,10 +3940,10 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, if (nr_pages == 2) { len = PAGE_SIZE - offset; - memcpy(&entry->buf, page1 + offset, len); - memcpy(&entry->buf[len], page2, cnt - len); + memcpy(&entry->buf, map_page[0] + offset, len); + memcpy(&entry->buf[len], map_page[1], cnt - len); } else - memcpy(&entry->buf, page1 + offset, cnt); + memcpy(&entry->buf, map_page[0] + offset, cnt); if (entry->buf[cnt - 1] != '\n') { entry->buf[cnt] = '\n'; @@ -3959,11 +3958,10 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, *fpos += written; out_unlock: - if (nr_pages == 2) - kunmap_atomic(page2); - kunmap_atomic(page1); - while (nr_pages > 0) - put_page(pages[--nr_pages]); + for (i = 0; i < nr_pages; i++){ + kunmap_atomic(map_page[i]); + put_page(pages[i]); + } out: return written; } -- cgit v1.2.3 From 83f40318dab00e3298a1f6d0b12ac025e84e478d Mon Sep 17 00:00:00 2001 From: Vaibhav Nagarnaik Date: Thu, 3 May 2012 18:59:50 -0700 Subject: ring-buffer: Make removal of ring buffer pages atomic This patch adds the capability to remove pages from a ring buffer without destroying any existing data in it. This is done by removing the pages after the tail page. This makes sure that first all the empty pages in the ring buffer are removed. If the head page is one in the list of pages to be removed, then the page after the removed ones is made the head page. This removes the oldest data from the ring buffer and keeps the latest data around to be read. To do this in a non-racey manner, tracing is stopped for a very short time while the pages to be removed are identified and unlinked from the ring buffer. The pages are freed after the tracing is restarted to minimize the time needed to stop tracing. The context in which the pages from the per-cpu ring buffer are removed runs on the respective CPU. This minimizes the events not traced to only NMI trace contexts. Link: http://lkml.kernel.org/r/1336096792-25373-1-git-send-email-vnagarnaik@google.com Cc: Frederic Weisbecker Cc: Ingo Molnar Cc: Laurent Chavey Cc: Justin Teravest Cc: David Sharp Signed-off-by: Vaibhav Nagarnaik Signed-off-by: Steven Rostedt --- kernel/trace/ring_buffer.c | 265 +++++++++++++++++++++++++++++++++++---------- kernel/trace/trace.c | 20 +--- 2 files changed, 209 insertions(+), 76 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 2d5eb3320827..27ac37efb2b0 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -23,6 +23,8 @@ #include #include "trace.h" +static void update_pages_handler(struct work_struct *work); + /* * The ring buffer header is special. We must manually up keep it. */ @@ -470,12 +472,15 @@ struct ring_buffer_per_cpu { /* ring buffer pages to update, > 0 to add, < 0 to remove */ int nr_pages_to_update; struct list_head new_pages; /* new pages to add */ + struct work_struct update_pages_work; + struct completion update_completion; }; struct ring_buffer { unsigned flags; int cpus; atomic_t record_disabled; + atomic_t resize_disabled; cpumask_var_t cpumask; struct lock_class_key *reader_lock_key; @@ -1048,6 +1053,8 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, int nr_pages, int cpu) raw_spin_lock_init(&cpu_buffer->reader_lock); lockdep_set_class(&cpu_buffer->reader_lock, buffer->reader_lock_key); cpu_buffer->lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; + INIT_WORK(&cpu_buffer->update_pages_work, update_pages_handler); + init_completion(&cpu_buffer->update_completion); bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), GFP_KERNEL, cpu_to_node(cpu)); @@ -1235,32 +1242,123 @@ void ring_buffer_set_clock(struct ring_buffer *buffer, static void rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer); +static inline unsigned long rb_page_entries(struct buffer_page *bpage) +{ + return local_read(&bpage->entries) & RB_WRITE_MASK; +} + +static inline unsigned long rb_page_write(struct buffer_page *bpage) +{ + return local_read(&bpage->write) & RB_WRITE_MASK; +} + static void -rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned nr_pages) +rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned int nr_pages) { - struct buffer_page *bpage; - struct list_head *p; - unsigned i; + struct list_head *tail_page, *to_remove, *next_page; + struct buffer_page *to_remove_page, *tmp_iter_page; + struct buffer_page *last_page, *first_page; + unsigned int nr_removed; + unsigned long head_bit; + int page_entries; + + head_bit = 0; raw_spin_lock_irq(&cpu_buffer->reader_lock); - rb_head_page_deactivate(cpu_buffer); + atomic_inc(&cpu_buffer->record_disabled); + /* + * We don't race with the readers since we have acquired the reader + * lock. We also don't race with writers after disabling recording. + * This makes it easy to figure out the first and the last page to be + * removed from the list. We unlink all the pages in between including + * the first and last pages. This is done in a busy loop so that we + * lose the least number of traces. + * The pages are freed after we restart recording and unlock readers. + */ + tail_page = &cpu_buffer->tail_page->list; - for (i = 0; i < nr_pages; i++) { - if (RB_WARN_ON(cpu_buffer, list_empty(cpu_buffer->pages))) - goto out; - p = cpu_buffer->pages->next; - bpage = list_entry(p, struct buffer_page, list); - list_del_init(&bpage->list); - free_buffer_page(bpage); + /* + * tail page might be on reader page, we remove the next page + * from the ring buffer + */ + if (cpu_buffer->tail_page == cpu_buffer->reader_page) + tail_page = rb_list_head(tail_page->next); + to_remove = tail_page; + + /* start of pages to remove */ + first_page = list_entry(rb_list_head(to_remove->next), + struct buffer_page, list); + + for (nr_removed = 0; nr_removed < nr_pages; nr_removed++) { + to_remove = rb_list_head(to_remove)->next; + head_bit |= (unsigned long)to_remove & RB_PAGE_HEAD; } - if (RB_WARN_ON(cpu_buffer, list_empty(cpu_buffer->pages))) - goto out; - rb_reset_cpu(cpu_buffer); - rb_check_pages(cpu_buffer); + next_page = rb_list_head(to_remove)->next; -out: + /* + * Now we remove all pages between tail_page and next_page. + * Make sure that we have head_bit value preserved for the + * next page + */ + tail_page->next = (struct list_head *)((unsigned long)next_page | + head_bit); + next_page = rb_list_head(next_page); + next_page->prev = tail_page; + + /* make sure pages points to a valid page in the ring buffer */ + cpu_buffer->pages = next_page; + + /* update head page */ + if (head_bit) + cpu_buffer->head_page = list_entry(next_page, + struct buffer_page, list); + + /* + * change read pointer to make sure any read iterators reset + * themselves + */ + cpu_buffer->read = 0; + + /* pages are removed, resume tracing and then free the pages */ + atomic_dec(&cpu_buffer->record_disabled); raw_spin_unlock_irq(&cpu_buffer->reader_lock); + + RB_WARN_ON(cpu_buffer, list_empty(cpu_buffer->pages)); + + /* last buffer page to remove */ + last_page = list_entry(rb_list_head(to_remove), struct buffer_page, + list); + tmp_iter_page = first_page; + + do { + to_remove_page = tmp_iter_page; + rb_inc_page(cpu_buffer, &tmp_iter_page); + + /* update the counters */ + page_entries = rb_page_entries(to_remove_page); + if (page_entries) { + /* + * If something was added to this page, it was full + * since it is not the tail page. So we deduct the + * bytes consumed in ring buffer from here. + * No need to update overruns, since this page is + * deleted from ring buffer and its entries are + * already accounted for. + */ + local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes); + } + + /* + * We have already removed references to this list item, just + * free up the buffer_page and its page + */ + free_buffer_page(to_remove_page); + nr_removed--; + + } while (to_remove_page != last_page); + + RB_WARN_ON(cpu_buffer, nr_removed); } static void @@ -1272,6 +1370,8 @@ rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned i; raw_spin_lock_irq(&cpu_buffer->reader_lock); + /* stop the writers while inserting pages */ + atomic_inc(&cpu_buffer->record_disabled); rb_head_page_deactivate(cpu_buffer); for (i = 0; i < nr_pages; i++) { @@ -1286,19 +1386,27 @@ rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer, rb_check_pages(cpu_buffer); out: + atomic_dec(&cpu_buffer->record_disabled); raw_spin_unlock_irq(&cpu_buffer->reader_lock); } -static void update_pages_handler(struct ring_buffer_per_cpu *cpu_buffer) +static void rb_update_pages(struct ring_buffer_per_cpu *cpu_buffer) { if (cpu_buffer->nr_pages_to_update > 0) rb_insert_pages(cpu_buffer, &cpu_buffer->new_pages, cpu_buffer->nr_pages_to_update); else rb_remove_pages(cpu_buffer, -cpu_buffer->nr_pages_to_update); + cpu_buffer->nr_pages += cpu_buffer->nr_pages_to_update; - /* reset this value */ - cpu_buffer->nr_pages_to_update = 0; +} + +static void update_pages_handler(struct work_struct *work) +{ + struct ring_buffer_per_cpu *cpu_buffer = container_of(work, + struct ring_buffer_per_cpu, update_pages_work); + rb_update_pages(cpu_buffer); + complete(&cpu_buffer->update_completion); } /** @@ -1308,14 +1416,14 @@ static void update_pages_handler(struct ring_buffer_per_cpu *cpu_buffer) * * Minimum size is 2 * BUF_PAGE_SIZE. * - * Returns -1 on failure. + * Returns 0 on success and < 0 on failure. */ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, int cpu_id) { struct ring_buffer_per_cpu *cpu_buffer; unsigned nr_pages; - int cpu; + int cpu, err = 0; /* * Always succeed at resizing a non-existent buffer: @@ -1330,15 +1438,18 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, if (size < BUF_PAGE_SIZE * 2) size = BUF_PAGE_SIZE * 2; - atomic_inc(&buffer->record_disabled); + nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE); - /* Make sure all writers are done with this buffer. */ - synchronize_sched(); + /* + * Don't succeed if resizing is disabled, as a reader might be + * manipulating the ring buffer and is expecting a sane state while + * this is true. + */ + if (atomic_read(&buffer->resize_disabled)) + return -EBUSY; + /* prevent another thread from changing buffer sizes */ mutex_lock(&buffer->mutex); - get_online_cpus(); - - nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE); if (cpu_id == RING_BUFFER_ALL_CPUS) { /* calculate the pages to update */ @@ -1347,33 +1458,67 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, cpu_buffer->nr_pages_to_update = nr_pages - cpu_buffer->nr_pages; - /* * nothing more to do for removing pages or no update */ if (cpu_buffer->nr_pages_to_update <= 0) continue; - /* * to add pages, make sure all new pages can be * allocated without receiving ENOMEM */ INIT_LIST_HEAD(&cpu_buffer->new_pages); if (__rb_allocate_pages(cpu_buffer->nr_pages_to_update, - &cpu_buffer->new_pages, cpu)) + &cpu_buffer->new_pages, cpu)) { /* not enough memory for new pages */ - goto no_mem; + err = -ENOMEM; + goto out_err; + } + } + + get_online_cpus(); + /* + * Fire off all the required work handlers + * Look out for offline CPUs + */ + for_each_buffer_cpu(buffer, cpu) { + cpu_buffer = buffer->buffers[cpu]; + if (!cpu_buffer->nr_pages_to_update || + !cpu_online(cpu)) + continue; + + schedule_work_on(cpu, &cpu_buffer->update_pages_work); + } + /* + * This loop is for the CPUs that are not online. + * We can't schedule anything on them, but it's not necessary + * since we can change their buffer sizes without any race. + */ + for_each_buffer_cpu(buffer, cpu) { + cpu_buffer = buffer->buffers[cpu]; + if (!cpu_buffer->nr_pages_to_update || + cpu_online(cpu)) + continue; + + rb_update_pages(cpu_buffer); } /* wait for all the updates to complete */ for_each_buffer_cpu(buffer, cpu) { cpu_buffer = buffer->buffers[cpu]; - if (cpu_buffer->nr_pages_to_update) { - update_pages_handler(cpu_buffer); - } + if (!cpu_buffer->nr_pages_to_update || + !cpu_online(cpu)) + continue; + + wait_for_completion(&cpu_buffer->update_completion); + /* reset this value */ + cpu_buffer->nr_pages_to_update = 0; } + + put_online_cpus(); } else { cpu_buffer = buffer->buffers[cpu_id]; + if (nr_pages == cpu_buffer->nr_pages) goto out; @@ -1383,38 +1528,47 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, INIT_LIST_HEAD(&cpu_buffer->new_pages); if (cpu_buffer->nr_pages_to_update > 0 && __rb_allocate_pages(cpu_buffer->nr_pages_to_update, - &cpu_buffer->new_pages, cpu_id)) - goto no_mem; + &cpu_buffer->new_pages, cpu_id)) { + err = -ENOMEM; + goto out_err; + } - update_pages_handler(cpu_buffer); + get_online_cpus(); + + if (cpu_online(cpu_id)) { + schedule_work_on(cpu_id, + &cpu_buffer->update_pages_work); + wait_for_completion(&cpu_buffer->update_completion); + } else + rb_update_pages(cpu_buffer); + + put_online_cpus(); + /* reset this value */ + cpu_buffer->nr_pages_to_update = 0; } out: - put_online_cpus(); mutex_unlock(&buffer->mutex); - - atomic_dec(&buffer->record_disabled); - return size; - no_mem: + out_err: for_each_buffer_cpu(buffer, cpu) { struct buffer_page *bpage, *tmp; + cpu_buffer = buffer->buffers[cpu]; - /* reset this number regardless */ cpu_buffer->nr_pages_to_update = 0; + if (list_empty(&cpu_buffer->new_pages)) continue; + list_for_each_entry_safe(bpage, tmp, &cpu_buffer->new_pages, list) { list_del_init(&bpage->list); free_buffer_page(bpage); } } - put_online_cpus(); mutex_unlock(&buffer->mutex); - atomic_dec(&buffer->record_disabled); - return -ENOMEM; + return err; } EXPORT_SYMBOL_GPL(ring_buffer_resize); @@ -1453,21 +1607,11 @@ rb_iter_head_event(struct ring_buffer_iter *iter) return __rb_page_index(iter->head_page, iter->head); } -static inline unsigned long rb_page_write(struct buffer_page *bpage) -{ - return local_read(&bpage->write) & RB_WRITE_MASK; -} - static inline unsigned rb_page_commit(struct buffer_page *bpage) { return local_read(&bpage->page->commit); } -static inline unsigned long rb_page_entries(struct buffer_page *bpage) -{ - return local_read(&bpage->entries) & RB_WRITE_MASK; -} - /* Size is determined by what has been committed */ static inline unsigned rb_page_size(struct buffer_page *bpage) { @@ -3492,6 +3636,7 @@ ring_buffer_read_prepare(struct ring_buffer *buffer, int cpu) iter->cpu_buffer = cpu_buffer; + atomic_inc(&buffer->resize_disabled); atomic_inc(&cpu_buffer->record_disabled); return iter; @@ -3555,6 +3700,7 @@ ring_buffer_read_finish(struct ring_buffer_iter *iter) struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer; atomic_dec(&cpu_buffer->record_disabled); + atomic_dec(&cpu_buffer->buffer->resize_disabled); kfree(iter); } EXPORT_SYMBOL_GPL(ring_buffer_read_finish); @@ -3662,8 +3808,12 @@ void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu) if (!cpumask_test_cpu(cpu, buffer->cpumask)) return; + atomic_inc(&buffer->resize_disabled); atomic_inc(&cpu_buffer->record_disabled); + /* Make sure all commits have finished */ + synchronize_sched(); + raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); if (RB_WARN_ON(cpu_buffer, local_read(&cpu_buffer->committing))) @@ -3679,6 +3829,7 @@ void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu) raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); atomic_dec(&cpu_buffer->record_disabled); + atomic_dec(&buffer->resize_disabled); } EXPORT_SYMBOL_GPL(ring_buffer_reset_cpu); diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index d1b3469b62e3..dfbd86cc4876 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -3076,20 +3076,10 @@ static int __tracing_resize_ring_buffer(unsigned long size, int cpu) static ssize_t tracing_resize_ring_buffer(unsigned long size, int cpu_id) { - int cpu, ret = size; + int ret = size; mutex_lock(&trace_types_lock); - tracing_stop(); - - /* disable all cpu buffers */ - for_each_tracing_cpu(cpu) { - if (global_trace.data[cpu]) - atomic_inc(&global_trace.data[cpu]->disabled); - if (max_tr.data[cpu]) - atomic_inc(&max_tr.data[cpu]->disabled); - } - if (cpu_id != RING_BUFFER_ALL_CPUS) { /* make sure, this cpu is enabled in the mask */ if (!cpumask_test_cpu(cpu_id, tracing_buffer_mask)) { @@ -3103,14 +3093,6 @@ static ssize_t tracing_resize_ring_buffer(unsigned long size, int cpu_id) ret = -ENOMEM; out: - for_each_tracing_cpu(cpu) { - if (global_trace.data[cpu]) - atomic_dec(&global_trace.data[cpu]->disabled); - if (max_tr.data[cpu]) - atomic_dec(&max_tr.data[cpu]->disabled); - } - - tracing_start(); mutex_unlock(&trace_types_lock); return ret; -- cgit v1.2.3 From 5040b4b7bcc26a311c799d46f67174bcb20d05dd Mon Sep 17 00:00:00 2001 From: Vaibhav Nagarnaik Date: Thu, 3 May 2012 18:59:51 -0700 Subject: ring-buffer: Make addition of pages in ring buffer atomic This patch adds the capability to add new pages to a ring buffer atomically while write operations are going on. This makes it possible to expand the ring buffer size without reinitializing the ring buffer. The new pages are attached between the head page and its previous page. Link: http://lkml.kernel.org/r/1336096792-25373-2-git-send-email-vnagarnaik@google.com Cc: Frederic Weisbecker Cc: Ingo Molnar Cc: Laurent Chavey Cc: Justin Teravest Cc: David Sharp Signed-off-by: Vaibhav Nagarnaik Signed-off-by: Steven Rostedt --- kernel/trace/ring_buffer.c | 102 ++++++++++++++++++++++++++++++++++----------- 1 file changed, 77 insertions(+), 25 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 27ac37efb2b0..d673ef03d16d 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1252,7 +1252,7 @@ static inline unsigned long rb_page_write(struct buffer_page *bpage) return local_read(&bpage->write) & RB_WRITE_MASK; } -static void +static int rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned int nr_pages) { struct list_head *tail_page, *to_remove, *next_page; @@ -1359,46 +1359,97 @@ rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned int nr_pages) } while (to_remove_page != last_page); RB_WARN_ON(cpu_buffer, nr_removed); + + return nr_removed == 0; } -static void -rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer, - struct list_head *pages, unsigned nr_pages) +static int +rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer) { - struct buffer_page *bpage; - struct list_head *p; - unsigned i; + struct list_head *pages = &cpu_buffer->new_pages; + int retries, success; raw_spin_lock_irq(&cpu_buffer->reader_lock); - /* stop the writers while inserting pages */ - atomic_inc(&cpu_buffer->record_disabled); - rb_head_page_deactivate(cpu_buffer); + /* + * We are holding the reader lock, so the reader page won't be swapped + * in the ring buffer. Now we are racing with the writer trying to + * move head page and the tail page. + * We are going to adapt the reader page update process where: + * 1. We first splice the start and end of list of new pages between + * the head page and its previous page. + * 2. We cmpxchg the prev_page->next to point from head page to the + * start of new pages list. + * 3. Finally, we update the head->prev to the end of new list. + * + * We will try this process 10 times, to make sure that we don't keep + * spinning. + */ + retries = 10; + success = 0; + while (retries--) { + struct list_head *head_page, *prev_page, *r; + struct list_head *last_page, *first_page; + struct list_head *head_page_with_bit; - for (i = 0; i < nr_pages; i++) { - if (RB_WARN_ON(cpu_buffer, list_empty(pages))) - goto out; - p = pages->next; - bpage = list_entry(p, struct buffer_page, list); - list_del_init(&bpage->list); - list_add_tail(&bpage->list, cpu_buffer->pages); + head_page = &rb_set_head_page(cpu_buffer)->list; + prev_page = head_page->prev; + + first_page = pages->next; + last_page = pages->prev; + + head_page_with_bit = (struct list_head *) + ((unsigned long)head_page | RB_PAGE_HEAD); + + last_page->next = head_page_with_bit; + first_page->prev = prev_page; + + r = cmpxchg(&prev_page->next, head_page_with_bit, first_page); + + if (r == head_page_with_bit) { + /* + * yay, we replaced the page pointer to our new list, + * now, we just have to update to head page's prev + * pointer to point to end of list + */ + head_page->prev = last_page; + success = 1; + break; + } } - rb_reset_cpu(cpu_buffer); - rb_check_pages(cpu_buffer); -out: - atomic_dec(&cpu_buffer->record_disabled); + if (success) + INIT_LIST_HEAD(pages); + /* + * If we weren't successful in adding in new pages, warn and stop + * tracing + */ + RB_WARN_ON(cpu_buffer, !success); raw_spin_unlock_irq(&cpu_buffer->reader_lock); + + /* free pages if they weren't inserted */ + if (!success) { + struct buffer_page *bpage, *tmp; + list_for_each_entry_safe(bpage, tmp, &cpu_buffer->new_pages, + list) { + list_del_init(&bpage->list); + free_buffer_page(bpage); + } + } + return success; } static void rb_update_pages(struct ring_buffer_per_cpu *cpu_buffer) { + int success; + if (cpu_buffer->nr_pages_to_update > 0) - rb_insert_pages(cpu_buffer, &cpu_buffer->new_pages, - cpu_buffer->nr_pages_to_update); + success = rb_insert_pages(cpu_buffer); else - rb_remove_pages(cpu_buffer, -cpu_buffer->nr_pages_to_update); + success = rb_remove_pages(cpu_buffer, + -cpu_buffer->nr_pages_to_update); - cpu_buffer->nr_pages += cpu_buffer->nr_pages_to_update; + if (success) + cpu_buffer->nr_pages += cpu_buffer->nr_pages_to_update; } static void update_pages_handler(struct work_struct *work) @@ -3772,6 +3823,7 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer) cpu_buffer->commit_page = cpu_buffer->head_page; INIT_LIST_HEAD(&cpu_buffer->reader_page->list); + INIT_LIST_HEAD(&cpu_buffer->new_pages); local_set(&cpu_buffer->reader_page->write, 0); local_set(&cpu_buffer->reader_page->entries, 0); local_set(&cpu_buffer->reader_page->page->commit, 0); -- cgit v1.2.3 From 55ccf3fe3f9a3441731aa79cf42a628fc4ecace9 Mon Sep 17 00:00:00 2001 From: Suresh Siddha Date: Wed, 16 May 2012 15:03:51 -0700 Subject: fork: move the real prepare_to_copy() users to arch_dup_task_struct() Historical prepare_to_copy() is mostly a no-op, duplicated for majority of the architectures and the rest following the x86 model of flushing the extended register state like fpu there. Remove it and use the arch_dup_task_struct() instead. Suggested-by: Oleg Nesterov Suggested-by: Linus Torvalds Signed-off-by: Suresh Siddha Link: http://lkml.kernel.org/r/1336692811-30576-1-git-send-email-suresh.b.siddha@intel.com Acked-by: Benjamin Herrenschmidt Cc: David Howells Cc: Koichi Yasutake Cc: Paul Mackerras Cc: Paul Mundt Cc: Chris Zankel Cc: Richard Henderson Cc: Russell King Cc: Haavard Skinnemoen Cc: Mike Frysinger Cc: Mark Salter Cc: Aurelien Jacquiot Cc: Mikael Starvik Cc: Yoshinori Sato Cc: Richard Kuo Cc: Tony Luck Cc: Michal Simek Cc: Ralf Baechle Cc: Jonas Bonn Cc: James E.J. Bottomley Cc: Helge Deller Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Chen Liqin Cc: Lennox Wu Cc: David S. Miller Cc: Chris Metcalf Cc: Jeff Dike Cc: Richard Weinberger Cc: Guan Xuetao Signed-off-by: H. Peter Anvin --- kernel/fork.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 687a15d56243..7aed746ff47c 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -261,8 +261,6 @@ static struct task_struct *dup_task_struct(struct task_struct *orig) int node = tsk_fork_get_node(orig); int err; - prepare_to_copy(orig); - tsk = alloc_task_struct_node(node); if (!tsk) return NULL; -- cgit v1.2.3 From 659f451ff21315ebfeeb46b9adccee8ce1b52c25 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Mon, 14 May 2012 17:02:33 -0400 Subject: ring-buffer: Add integrity check at end of iter read There use to be ring buffer integrity checks after updating the size of the ring buffer. But now that the ring buffer can modify the size while the system is running, the integrity checks were removed, as they require the ring buffer to be disabed to perform the check. Move the integrity check to the reading of the ring buffer via the iterator reads (the "trace" file). As reading via an iterator requires disabling the ring buffer, it is a perfect place to have it. If the ring buffer happens to be disabled when updating the size, we still perform the integrity check. Cc: Vaibhav Nagarnaik Signed-off-by: Steven Rostedt --- kernel/trace/ring_buffer.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index d673ef03d16d..e0573c523b5c 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1599,6 +1599,29 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, } out: + /* + * The ring buffer resize can happen with the ring buffer + * enabled, so that the update disturbs the tracing as little + * as possible. But if the buffer is disabled, we do not need + * to worry about that, and we can take the time to verify + * that the buffer is not corrupt. + */ + if (atomic_read(&buffer->record_disabled)) { + atomic_inc(&buffer->record_disabled); + /* + * Even though the buffer was disabled, we must make sure + * that it is truly disabled before calling rb_check_pages. + * There could have been a race between checking + * record_disable and incrementing it. + */ + synchronize_sched(); + for_each_buffer_cpu(buffer, cpu) { + cpu_buffer = buffer->buffers[cpu]; + rb_check_pages(cpu_buffer); + } + atomic_dec(&buffer->record_disabled); + } + mutex_unlock(&buffer->mutex); return size; @@ -3750,6 +3773,12 @@ ring_buffer_read_finish(struct ring_buffer_iter *iter) { struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer; + /* + * Ring buffer is disabled from recording, here's a good place + * to check the integrity of the ring buffer. + */ + rb_check_pages(cpu_buffer); + atomic_dec(&cpu_buffer->record_disabled); atomic_dec(&cpu_buffer->buffer->resize_disabled); kfree(iter); -- cgit v1.2.3 From 308f7eeb7882c27c1d7aa783499cb22f3b199718 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Wed, 16 May 2012 19:46:32 -0400 Subject: ring-buffer: Reset head page before running self test When the ring buffer does its consistency test on itself, it removes the head page, runs the tests, and then adds it back to what the "head_page" pointer was. But because the head_page pointer may lack behind the real head page (held by the link list pointer). The reset may be incorrect. Instead, if the head_page exists (it does not on first allocation) reset it back to the real head page before running the consistency tests. Then it will be put back to its original location after the tests are complete. Signed-off-by: Steven Rostedt --- kernel/trace/ring_buffer.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index e0573c523b5c..68388f876d43 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -945,6 +945,10 @@ static int rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer) struct list_head *head = cpu_buffer->pages; struct buffer_page *bpage, *tmp; + /* Reset the head page if it exists */ + if (cpu_buffer->head_page) + rb_set_head_page(cpu_buffer); + rb_head_page_deactivate(cpu_buffer); if (RB_WARN_ON(cpu_buffer, head->next->prev != head)) -- cgit v1.2.3 From 0a3d7ce7e6caa8c39cb5184bd9047a01a40abc2a Mon Sep 17 00:00:00 2001 From: Namhyung Kim Date: Mon, 23 Apr 2012 10:11:57 +0900 Subject: tracing: Check return value of tracing_dentry_percpu() If tracing_dentry_percpu() failed, tracing_init_debugfs_percpu() will try to create each cpu directories on debugfs' root directory as d_percpu is NULL. Link: http://lkml.kernel.org/r/1335143517-2285-1-git-send-email-namhyung.kim@lge.com Cc: Frederic Weisbecker Cc: Ingo Molnar Signed-off-by: Namhyung Kim Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index dfbd86cc4876..0ed4df0c6a39 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4474,6 +4474,9 @@ static void tracing_init_debugfs_percpu(long cpu) struct dentry *d_cpu; char cpu_dir[30]; /* 30 characters should be more than enough */ + if (!d_percpu) + return; + snprintf(cpu_dir, 30, "cpu%ld", cpu); d_cpu = debugfs_create_dir(cpu_dir, d_percpu); if (!d_cpu) { -- cgit v1.2.3 From 71babb2705e2203a64c27ede13ae3508a0d2c16c Mon Sep 17 00:00:00 2001 From: Vaibhav Nagarnaik Date: Thu, 3 May 2012 18:59:52 -0700 Subject: tracing: change CPU ring buffer state from tracing_cpumask According to Documentation/trace/ftrace.txt: tracing_cpumask: This is a mask that lets the user only trace on specified CPUS. The format is a hex string representing the CPUS. The tracing_cpumask currently doesn't affect the tracing state of per-CPU ring buffers. This patch enables/disables CPU recording as its corresponding bit in tracing_cpumask is set/unset. Link: http://lkml.kernel.org/r/1336096792-25373-3-git-send-email-vnagarnaik@google.com Cc: Frederic Weisbecker Cc: Ingo Molnar Cc: Laurent Chavey Cc: Justin Teravest Cc: David Sharp Signed-off-by: Vaibhav Nagarnaik Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 0ed4df0c6a39..08a08bab57a3 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2669,10 +2669,12 @@ tracing_cpumask_write(struct file *filp, const char __user *ubuf, if (cpumask_test_cpu(cpu, tracing_cpumask) && !cpumask_test_cpu(cpu, tracing_cpumask_new)) { atomic_inc(&global_trace.data[cpu]->disabled); + ring_buffer_record_disable_cpu(global_trace.buffer, cpu); } if (!cpumask_test_cpu(cpu, tracing_cpumask) && cpumask_test_cpu(cpu, tracing_cpumask_new)) { atomic_dec(&global_trace.data[cpu]->disabled); + ring_buffer_record_enable_cpu(global_trace.buffer, cpu); } } arch_spin_unlock(&ftrace_max_lock); -- cgit v1.2.3 From 9fd49328fc2a1cbfea542bcbcf004b5c81dc495b Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Tue, 24 Apr 2012 22:32:06 -0400 Subject: ftrace: Sort all function addresses, not just per page Instead of just sorting the ip's of the functions per ftrace page, sort the entire list before adding them to the ftrace pages. This will allow the bsearch algorithm to be sped up as it can also sort by pages, not just records within a page. Signed-off-by: Steven Rostedt --- kernel/trace/ftrace.c | 34 ++++++++++++++++++++++------------ 1 file changed, 22 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index cf81f27ce6c6..53ed01ed7aa7 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -3666,15 +3666,27 @@ static __init int ftrace_init_dyn_debugfs(struct dentry *d_tracer) return 0; } -static void ftrace_swap_recs(void *a, void *b, int size) +static int ftrace_cmp_ips(const void *a, const void *b) { - struct dyn_ftrace *reca = a; - struct dyn_ftrace *recb = b; - struct dyn_ftrace t; + const unsigned long *ipa = a; + const unsigned long *ipb = b; - t = *reca; - *reca = *recb; - *recb = t; + if (*ipa > *ipb) + return 1; + if (*ipa < *ipb) + return -1; + return 0; +} + +static void ftrace_swap_ips(void *a, void *b, int size) +{ + unsigned long *ipa = a; + unsigned long *ipb = b; + unsigned long t; + + t = *ipa; + *ipa = *ipb; + *ipb = t; } static int ftrace_process_locs(struct module *mod, @@ -3693,6 +3705,9 @@ static int ftrace_process_locs(struct module *mod, if (!count) return 0; + sort(start, count, sizeof(*start), + ftrace_cmp_ips, ftrace_swap_ips); + pg = ftrace_allocate_pages(count); if (!pg) return -ENOMEM; @@ -3740,11 +3755,6 @@ static int ftrace_process_locs(struct module *mod, /* These new locations need to be initialized */ ftrace_new_pgs = pg; - /* Make each individual set of pages sorted by ips */ - for (; pg; pg = pg->next) - sort(pg->records, pg->index, sizeof(struct dyn_ftrace), - ftrace_cmp_recs, ftrace_swap_recs); - /* * We only need to disable interrupts on start up * because we are modifying code that an interrupt -- cgit v1.2.3 From 706c81f87f84adbcf1f6553b9e6b69b3e28fc35a Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Tue, 24 Apr 2012 23:45:26 -0400 Subject: ftrace: Remove extra helper functions The ftrace_record_ip() and ftrace_alloc_dyn_node() were from the time of the ftrace daemon. Although they were still used, they still make things a bit more complex than necessary. Move the code into the one function that uses it, and remove the helper functions. Signed-off-by: Steven Rostedt --- kernel/trace/ftrace.c | 61 ++++++++++++++++++++------------------------------- 1 file changed, 24 insertions(+), 37 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 53ed01ed7aa7..e10f9e522c44 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1520,35 +1520,6 @@ static void ftrace_hash_rec_enable(struct ftrace_ops *ops, __ftrace_hash_rec_update(ops, filter_hash, 1); } -static struct dyn_ftrace *ftrace_alloc_dyn_node(unsigned long ip) -{ - if (ftrace_pages->index == ftrace_pages->size) { - /* We should have allocated enough */ - if (WARN_ON(!ftrace_pages->next)) - return NULL; - ftrace_pages = ftrace_pages->next; - } - - return &ftrace_pages->records[ftrace_pages->index++]; -} - -static struct dyn_ftrace * -ftrace_record_ip(unsigned long ip) -{ - struct dyn_ftrace *rec; - - if (ftrace_disabled) - return NULL; - - rec = ftrace_alloc_dyn_node(ip); - if (!rec) - return NULL; - - rec->ip = ip; - - return rec; -} - static void print_ip_ins(const char *fmt, unsigned char *p) { int i; @@ -3693,7 +3664,9 @@ static int ftrace_process_locs(struct module *mod, unsigned long *start, unsigned long *end) { + struct ftrace_page *start_pg; struct ftrace_page *pg; + struct dyn_ftrace *rec; unsigned long count; unsigned long *p; unsigned long addr; @@ -3708,8 +3681,8 @@ static int ftrace_process_locs(struct module *mod, sort(start, count, sizeof(*start), ftrace_cmp_ips, ftrace_swap_ips); - pg = ftrace_allocate_pages(count); - if (!pg) + start_pg = ftrace_allocate_pages(count); + if (!start_pg) return -ENOMEM; mutex_lock(&ftrace_lock); @@ -3722,7 +3695,7 @@ static int ftrace_process_locs(struct module *mod, if (!mod) { WARN_ON(ftrace_pages || ftrace_pages_start); /* First initialization */ - ftrace_pages = ftrace_pages_start = pg; + ftrace_pages = ftrace_pages_start = start_pg; } else { if (!ftrace_pages) goto out; @@ -3733,11 +3706,11 @@ static int ftrace_process_locs(struct module *mod, ftrace_pages = ftrace_pages->next; } - ftrace_pages->next = pg; - ftrace_pages = pg; + ftrace_pages->next = start_pg; } p = start; + pg = start_pg; while (p < end) { addr = ftrace_call_adjust(*p++); /* @@ -3748,12 +3721,26 @@ static int ftrace_process_locs(struct module *mod, */ if (!addr) continue; - if (!ftrace_record_ip(addr)) - break; + + if (pg->index == pg->size) { + /* We should have allocated enough */ + if (WARN_ON(!pg->next)) + break; + pg = pg->next; + } + + rec = &pg->records[pg->index++]; + rec->ip = addr; } + /* We should have used all pages */ + WARN_ON(pg->next); + + /* Assign the last page to ftrace_pages */ + ftrace_pages = pg; + /* These new locations need to be initialized */ - ftrace_new_pgs = pg; + ftrace_new_pgs = start_pg; /* * We only need to disable interrupts on start up -- cgit v1.2.3 From 9644302e3315e7e36495d230d5ac7125a316d33e Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Wed, 25 Apr 2012 10:14:43 -0400 Subject: ftrace: Speed up search by skipping pages by address As all records in a page of the ftrace table are sorted, we can speed up the search algorithm by checking if the address to look for falls in between the first and last record ip on the page. This speeds up both the ftrace_location() and ftrace_text_reserved() algorithms, as it can skip full pages when the search address is not in them. Cc: Masami Hiramatsu Signed-off-by: Steven Rostedt --- kernel/trace/ftrace.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index e10f9e522c44..fc93562a6654 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1411,6 +1411,8 @@ int ftrace_location(unsigned long ip) key.ip = ip; for (pg = ftrace_pages_start; pg; pg = pg->next) { + if (ip < pg->records[0].ip || ip > pg->records[pg->index - 1].ip) + continue; rec = bsearch(&key, pg->records, pg->index, sizeof(struct dyn_ftrace), ftrace_cmp_recs); @@ -1571,16 +1573,24 @@ void ftrace_bug(int failed, unsigned long ip) /* Return 1 if the address range is reserved for ftrace */ -int ftrace_text_reserved(void *start, void *end) +int ftrace_text_reserved(void *s, void *e) { struct dyn_ftrace *rec; struct ftrace_page *pg; + unsigned long start = (unsigned long)s; + unsigned long end = (unsigned long)e; + int i; - do_for_each_ftrace_rec(pg, rec) { - if (rec->ip <= (unsigned long)end && - rec->ip + MCOUNT_INSN_SIZE > (unsigned long)start) - return 1; - } while_for_each_ftrace_rec(); + for (pg = ftrace_pages_start; pg; pg = pg->next) { + if (end < pg->records[0].ip || + start >= (pg->records[pg->index - 1].ip + MCOUNT_INSN_SIZE)) + continue; + for (i = 0; i < pg->index; i++) { + rec = &pg->records[i]; + if (rec->ip <= end && rec->ip + MCOUNT_INSN_SIZE > start) + return 1; + } + } return 0; } -- cgit v1.2.3 From a650e02a528ab9d6d6f0b8b57745c32f2a138459 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Wed, 25 Apr 2012 13:48:13 -0400 Subject: ftrace: Consolidate ftrace_location() and ftrace_text_reserved() Both ftrace_location() and ftrace_text_reserved() do basically the same thing. They search to see if an address is in the ftace table (contains an address that may change from nop to call ftrace_caller). The difference is that ftrace_location() searches a single address, but ftrace_text_reserved() searches a range. This also makes the ftrace_text_reserved() faster as it now uses a bsearch() instead of linearly searching all the addresses within a page. Cc: Masami Hiramatsu Signed-off-by: Steven Rostedt --- kernel/trace/ftrace.c | 80 +++++++++++++++++++++++++-------------------------- 1 file changed, 40 insertions(+), 40 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index fc93562a6654..dd091c84b57f 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1383,35 +1383,28 @@ ftrace_ops_test(struct ftrace_ops *ops, unsigned long ip) static int ftrace_cmp_recs(const void *a, const void *b) { - const struct dyn_ftrace *reca = a; - const struct dyn_ftrace *recb = b; + const struct dyn_ftrace *key = a; + const struct dyn_ftrace *rec = b; - if (reca->ip > recb->ip) - return 1; - if (reca->ip < recb->ip) + if (key->flags < rec->ip) return -1; + if (key->ip >= rec->ip + MCOUNT_INSN_SIZE) + return 1; return 0; } -/** - * ftrace_location - return true if the ip giving is a traced location - * @ip: the instruction pointer to check - * - * Returns 1 if @ip given is a pointer to a ftrace location. - * That is, the instruction that is either a NOP or call to - * the function tracer. It checks the ftrace internal tables to - * determine if the address belongs or not. - */ -int ftrace_location(unsigned long ip) +static int ftrace_location_range(unsigned long start, unsigned long end) { struct ftrace_page *pg; struct dyn_ftrace *rec; struct dyn_ftrace key; - key.ip = ip; + key.ip = start; + key.flags = end; /* overload flags, as it is unsigned long */ for (pg = ftrace_pages_start; pg; pg = pg->next) { - if (ip < pg->records[0].ip || ip > pg->records[pg->index - 1].ip) + if (end < pg->records[0].ip || + start >= (pg->records[pg->index - 1].ip + MCOUNT_INSN_SIZE)) continue; rec = bsearch(&key, pg->records, pg->index, sizeof(struct dyn_ftrace), @@ -1423,6 +1416,36 @@ int ftrace_location(unsigned long ip) return 0; } +/** + * ftrace_location - return true if the ip giving is a traced location + * @ip: the instruction pointer to check + * + * Returns 1 if @ip given is a pointer to a ftrace location. + * That is, the instruction that is either a NOP or call to + * the function tracer. It checks the ftrace internal tables to + * determine if the address belongs or not. + */ +int ftrace_location(unsigned long ip) +{ + return ftrace_location_range(ip, ip); +} + +/** + * ftrace_text_reserved - return true if range contains an ftrace location + * @start: start of range to search + * @end: end of range to search (inclusive). @end points to the last byte to check. + * + * Returns 1 if @start and @end contains a ftrace location. + * That is, the instruction that is either a NOP or call to + * the function tracer. It checks the ftrace internal tables to + * determine if the address belongs or not. + */ +int ftrace_text_reserved(void *start, void *end) +{ + return ftrace_location_range((unsigned long)start, + (unsigned long)end); +} + static void __ftrace_hash_rec_update(struct ftrace_ops *ops, int filter_hash, bool inc) @@ -1571,29 +1594,6 @@ void ftrace_bug(int failed, unsigned long ip) } } - -/* Return 1 if the address range is reserved for ftrace */ -int ftrace_text_reserved(void *s, void *e) -{ - struct dyn_ftrace *rec; - struct ftrace_page *pg; - unsigned long start = (unsigned long)s; - unsigned long end = (unsigned long)e; - int i; - - for (pg = ftrace_pages_start; pg; pg = pg->next) { - if (end < pg->records[0].ip || - start >= (pg->records[pg->index - 1].ip + MCOUNT_INSN_SIZE)) - continue; - for (i = 0; i < pg->index; i++) { - rec = &pg->records[i]; - if (rec->ip <= end && rec->ip + MCOUNT_INSN_SIZE > start) - return 1; - } - } - return 0; -} - static int ftrace_check_record(struct dyn_ftrace *rec, int enable, int update) { unsigned long flag = 0UL; -- cgit v1.2.3 From f0cf973a224a3e3c1dec3395af3ba01cf14b1ff4 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Wed, 25 Apr 2012 14:39:54 -0400 Subject: ftrace: Return record ip addr for ftrace_location() ftrace_location() is passed an addr, and returns 1 if the addr is on a ftrace nop (or caller to ftrace_caller), and 0 otherwise. To let kprobes know if it should move a breakpoint or not, it must return the actual addr that is the start of the ftrace nop. This way a kprobe placed on the location of a ftrace nop, can instead be placed on the instruction after the nop. Even if the probe addr is on the second or later byte of the nop, it can simply be moved forward. Cc: Masami Hiramatsu Signed-off-by: Steven Rostedt --- kernel/trace/ftrace.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index dd091c84b57f..ef0826204840 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1393,7 +1393,7 @@ static int ftrace_cmp_recs(const void *a, const void *b) return 0; } -static int ftrace_location_range(unsigned long start, unsigned long end) +static unsigned long ftrace_location_range(unsigned long start, unsigned long end) { struct ftrace_page *pg; struct dyn_ftrace *rec; @@ -1410,7 +1410,7 @@ static int ftrace_location_range(unsigned long start, unsigned long end) sizeof(struct dyn_ftrace), ftrace_cmp_recs); if (rec) - return 1; + return rec->ip; } return 0; @@ -1420,12 +1420,12 @@ static int ftrace_location_range(unsigned long start, unsigned long end) * ftrace_location - return true if the ip giving is a traced location * @ip: the instruction pointer to check * - * Returns 1 if @ip given is a pointer to a ftrace location. + * Returns rec->ip if @ip given is a pointer to a ftrace location. * That is, the instruction that is either a NOP or call to * the function tracer. It checks the ftrace internal tables to * determine if the address belongs or not. */ -int ftrace_location(unsigned long ip) +unsigned long ftrace_location(unsigned long ip) { return ftrace_location_range(ip, ip); } @@ -1442,8 +1442,12 @@ int ftrace_location(unsigned long ip) */ int ftrace_text_reserved(void *start, void *end) { - return ftrace_location_range((unsigned long)start, - (unsigned long)end); + unsigned long ret; + + ret = ftrace_location_range((unsigned long)start, + (unsigned long)end); + + return (int)!!ret; } static void __ftrace_hash_rec_update(struct ftrace_ops *ops, -- cgit v1.2.3 From 8ed3e2cfe40ffe43630fd8efa34fc97c95b4c298 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Thu, 26 Apr 2012 14:59:43 -0400 Subject: ftrace: Make ftrace_modify_all_code() global for archs to use Rename __ftrace_modify_code() to ftrace_modify_all_code() and make it global for all archs to use. This will remove the duplication of code, as archs that can modify code without stop_machine() can use it directly outside of the stop_machine() call. Signed-off-by: Steven Rostedt --- kernel/trace/ftrace.c | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index ef0826204840..3c345825cc23 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1811,22 +1811,27 @@ int __weak ftrace_arch_code_modify_post_process(void) return 0; } -static int __ftrace_modify_code(void *data) +void ftrace_modify_all_code(int command) { - int *command = data; - - if (*command & FTRACE_UPDATE_CALLS) + if (command & FTRACE_UPDATE_CALLS) ftrace_replace_code(1); - else if (*command & FTRACE_DISABLE_CALLS) + else if (command & FTRACE_DISABLE_CALLS) ftrace_replace_code(0); - if (*command & FTRACE_UPDATE_TRACE_FUNC) + if (command & FTRACE_UPDATE_TRACE_FUNC) ftrace_update_ftrace_func(ftrace_trace_function); - if (*command & FTRACE_START_FUNC_RET) + if (command & FTRACE_START_FUNC_RET) ftrace_enable_ftrace_graph_caller(); - else if (*command & FTRACE_STOP_FUNC_RET) + else if (command & FTRACE_STOP_FUNC_RET) ftrace_disable_ftrace_graph_caller(); +} + +static int __ftrace_modify_code(void *data) +{ + int *command = data; + + ftrace_modify_all_code(*command); return 0; } -- cgit v1.2.3 From e4f5d5440bb860a3e8942ca8f7277a7f31798965 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Fri, 27 Apr 2012 09:13:18 -0400 Subject: ftrace/x86: Have x86 ftrace use the ftrace_modify_all_code() To remove duplicate code, have the ftrace arch_ftrace_update_code() use the generic ftrace_modify_all_code(). This requires that the default ftrace_replace_code() becomes a weak function so that an arch may override it. Signed-off-by: Steven Rostedt --- kernel/trace/ftrace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 3c345825cc23..a008663d86c8 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1683,7 +1683,7 @@ __ftrace_replace_code(struct dyn_ftrace *rec, int enable) return -1; /* unknow ftrace bug */ } -static void ftrace_replace_code(int update) +void __weak ftrace_replace_code(int enable) { struct dyn_ftrace *rec; struct ftrace_page *pg; @@ -1693,7 +1693,7 @@ static void ftrace_replace_code(int update) return; do_for_each_ftrace_rec(pg, rec) { - failed = __ftrace_replace_code(rec, update); + failed = __ftrace_replace_code(rec, enable); if (failed) { ftrace_bug(failed, rec->ip); /* Stop processing */ -- cgit v1.2.3 From b732d439cb43336cd6d7e804ecb2c81193ef63b0 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Mon, 30 Apr 2012 09:17:03 -0400 Subject: ftrace: Remove selecting FRAME_POINTER with FUNCTION_TRACER The function tracer will enable the -pg option with gcc, which requires that frame pointers. When FRAME_POINTER is defined in the kernel config it adds the gcc option -fno-omit-frame-pointer which causes some problems on some architectures. For those architectures, the FRAME_POINTER select was not set. When FUNCTION_TRACER was selected on these architectures that can not have -fno-omit-frame-pointer, the -pg option is still set. But when FRAME_POINTER is not selected, the kernel config would add the gcc option -fomit-frame-pointer. Adding this option is incompatible with -pg even on archs that do not need frame pointers with -pg. The answer to this was to just not add either -fno-omit-frame-pointer or -fomit-frame-pointer on these archs that want function tracing but do not set FRAME_POINTER. As it turns out, for archs that require frame pointers for function tracing, the same can be used. If gcc requires frame pointers with -pg, it will simply add it. The best thing to do is not select FRAME_POINTER when function tracing is selected, and let gcc add it if needed. Only add the -fno-omit-frame-pointer when something else selects FRAME_POINTER, but do not add -fomit-frame-pointer if function tracing is selected. Signed-off-by: Steven Rostedt --- kernel/trace/Kconfig | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index a1d2849f2473..d81a1a532994 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -141,7 +141,6 @@ if FTRACE config FUNCTION_TRACER bool "Kernel Function Tracer" depends on HAVE_FUNCTION_TRACER - select FRAME_POINTER if !ARM_UNWIND && !PPC && !S390 && !MICROBLAZE select KALLSYMS select GENERIC_TRACER select CONTEXT_SWITCH_TRACER -- cgit v1.2.3 From 8e7fbcbc22c12414bcc9dfdd683637f58fb32759 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 9 Jan 2012 11:28:35 +0100 Subject: sched: Remove stale power aware scheduling remnants and dysfunctional knobs It's been broken forever (i.e. it's not scheduling in a power aware fashion), as reported by Suresh and others sending patches, and nobody cares enough to fix it properly ... so remove it to make space free for something better. There's various problems with the code as it stands today, first and foremost the user interface which is bound to topology levels and has multiple values per level. This results in a state explosion which the administrator or distro needs to master and almost nobody does. Furthermore large configuration state spaces aren't good, it means the thing doesn't just work right because it's either under so many impossibe to meet constraints, or even if there's an achievable state workloads have to be aware of it precisely and can never meet it for dynamic workloads. So pushing this kind of decision to user-space was a bad idea even with a single knob - it's exponentially worse with knobs on every node of the topology. There is a proposal to replace the user interface with a single 3 state knob: sched_balance_policy := { performance, power, auto } where 'auto' would be the preferred default which looks at things like Battery/AC mode and possible cpufreq state or whatever the hw exposes to show us power use expectations - but there's been no progress on it in the past many months. Aside from that, the actual implementation of the various knobs is known to be broken. There have been sporadic attempts at fixing things but these always stop short of reaching a mergable state. Therefore this wholesale removal with the hopes of spurring people who care to come forward once again and work on a coherent replacement. Signed-off-by: Peter Zijlstra Cc: Suresh Siddha Cc: Arjan van de Ven Cc: Vincent Guittot Cc: Vaidyanathan Srinivasan Cc: Linus Torvalds Cc: Andrew Morton Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twins Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 94 ------------------ kernel/sched/fair.c | 275 +--------------------------------------------------- 2 files changed, 2 insertions(+), 367 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index bd314d7cd9f8..24ca677b5457 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5929,8 +5929,6 @@ static const struct cpumask *cpu_cpu_mask(int cpu) return cpumask_of_node(cpu_to_node(cpu)); } -int sched_smt_power_savings = 0, sched_mc_power_savings = 0; - struct sd_data { struct sched_domain **__percpu sd; struct sched_group **__percpu sg; @@ -6322,7 +6320,6 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu) | 0*SD_WAKE_AFFINE | 0*SD_PREFER_LOCAL | 0*SD_SHARE_CPUPOWER - | 0*SD_POWERSAVINGS_BALANCE | 0*SD_SHARE_PKG_RESOURCES | 1*SD_SERIALIZE | 0*SD_PREFER_SIBLING @@ -6819,97 +6816,6 @@ match2: mutex_unlock(&sched_domains_mutex); } -#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT) -static void reinit_sched_domains(void) -{ - get_online_cpus(); - - /* Destroy domains first to force the rebuild */ - partition_sched_domains(0, NULL, NULL); - - rebuild_sched_domains(); - put_online_cpus(); -} - -static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt) -{ - unsigned int level = 0; - - if (sscanf(buf, "%u", &level) != 1) - return -EINVAL; - - /* - * level is always be positive so don't check for - * level < POWERSAVINGS_BALANCE_NONE which is 0 - * What happens on 0 or 1 byte write, - * need to check for count as well? - */ - - if (level >= MAX_POWERSAVINGS_BALANCE_LEVELS) - return -EINVAL; - - if (smt) - sched_smt_power_savings = level; - else - sched_mc_power_savings = level; - - reinit_sched_domains(); - - return count; -} - -#ifdef CONFIG_SCHED_MC -static ssize_t sched_mc_power_savings_show(struct device *dev, - struct device_attribute *attr, - char *buf) -{ - return sprintf(buf, "%u\n", sched_mc_power_savings); -} -static ssize_t sched_mc_power_savings_store(struct device *dev, - struct device_attribute *attr, - const char *buf, size_t count) -{ - return sched_power_savings_store(buf, count, 0); -} -static DEVICE_ATTR(sched_mc_power_savings, 0644, - sched_mc_power_savings_show, - sched_mc_power_savings_store); -#endif - -#ifdef CONFIG_SCHED_SMT -static ssize_t sched_smt_power_savings_show(struct device *dev, - struct device_attribute *attr, - char *buf) -{ - return sprintf(buf, "%u\n", sched_smt_power_savings); -} -static ssize_t sched_smt_power_savings_store(struct device *dev, - struct device_attribute *attr, - const char *buf, size_t count) -{ - return sched_power_savings_store(buf, count, 1); -} -static DEVICE_ATTR(sched_smt_power_savings, 0644, - sched_smt_power_savings_show, - sched_smt_power_savings_store); -#endif - -int __init sched_create_sysfs_power_savings_entries(struct device *dev) -{ - int err = 0; - -#ifdef CONFIG_SCHED_SMT - if (smt_capable()) - err = device_create_file(dev, &dev_attr_sched_smt_power_savings); -#endif -#ifdef CONFIG_SCHED_MC - if (!err && mc_capable()) - err = device_create_file(dev, &dev_attr_sched_mc_power_savings); -#endif - return err; -} -#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */ - /* * Update cpusets according to cpu_active mask. If cpusets are * disabled, cpuset_update_active_cpus() becomes a simple wrapper diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0b42f4487329..940e6d17cf96 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2721,7 +2721,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) * If power savings logic is enabled for a domain, see if we * are not overloaded, if so, don't balance wider. */ - if (tmp->flags & (SD_POWERSAVINGS_BALANCE|SD_PREFER_LOCAL)) { + if (tmp->flags & (SD_PREFER_LOCAL)) { unsigned long power = 0; unsigned long nr_running = 0; unsigned long capacity; @@ -2734,9 +2734,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE); - if (tmp->flags & SD_POWERSAVINGS_BALANCE) - nr_running /= 2; - if (nr_running < capacity) want_sd = 0; } @@ -3435,14 +3432,6 @@ struct sd_lb_stats { unsigned int busiest_group_weight; int group_imb; /* Is there imbalance in this sd */ -#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT) - int power_savings_balance; /* Is powersave balance needed for this sd */ - struct sched_group *group_min; /* Least loaded group in sd */ - struct sched_group *group_leader; /* Group which relieves group_min */ - unsigned long min_load_per_task; /* load_per_task in group_min */ - unsigned long leader_nr_running; /* Nr running of group_leader */ - unsigned long min_nr_running; /* Nr running of group_min */ -#endif }; /* @@ -3486,147 +3475,6 @@ static inline int get_sd_load_idx(struct sched_domain *sd, return load_idx; } - -#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT) -/** - * init_sd_power_savings_stats - Initialize power savings statistics for - * the given sched_domain, during load balancing. - * - * @sd: Sched domain whose power-savings statistics are to be initialized. - * @sds: Variable containing the statistics for sd. - * @idle: Idle status of the CPU at which we're performing load-balancing. - */ -static inline void init_sd_power_savings_stats(struct sched_domain *sd, - struct sd_lb_stats *sds, enum cpu_idle_type idle) -{ - /* - * Busy processors will not participate in power savings - * balance. - */ - if (idle == CPU_NOT_IDLE || !(sd->flags & SD_POWERSAVINGS_BALANCE)) - sds->power_savings_balance = 0; - else { - sds->power_savings_balance = 1; - sds->min_nr_running = ULONG_MAX; - sds->leader_nr_running = 0; - } -} - -/** - * update_sd_power_savings_stats - Update the power saving stats for a - * sched_domain while performing load balancing. - * - * @group: sched_group belonging to the sched_domain under consideration. - * @sds: Variable containing the statistics of the sched_domain - * @local_group: Does group contain the CPU for which we're performing - * load balancing ? - * @sgs: Variable containing the statistics of the group. - */ -static inline void update_sd_power_savings_stats(struct sched_group *group, - struct sd_lb_stats *sds, int local_group, struct sg_lb_stats *sgs) -{ - - if (!sds->power_savings_balance) - return; - - /* - * If the local group is idle or completely loaded - * no need to do power savings balance at this domain - */ - if (local_group && (sds->this_nr_running >= sgs->group_capacity || - !sds->this_nr_running)) - sds->power_savings_balance = 0; - - /* - * If a group is already running at full capacity or idle, - * don't include that group in power savings calculations - */ - if (!sds->power_savings_balance || - sgs->sum_nr_running >= sgs->group_capacity || - !sgs->sum_nr_running) - return; - - /* - * Calculate the group which has the least non-idle load. - * This is the group from where we need to pick up the load - * for saving power - */ - if ((sgs->sum_nr_running < sds->min_nr_running) || - (sgs->sum_nr_running == sds->min_nr_running && - group_first_cpu(group) > group_first_cpu(sds->group_min))) { - sds->group_min = group; - sds->min_nr_running = sgs->sum_nr_running; - sds->min_load_per_task = sgs->sum_weighted_load / - sgs->sum_nr_running; - } - - /* - * Calculate the group which is almost near its - * capacity but still has some space to pick up some load - * from other group and save more power - */ - if (sgs->sum_nr_running + 1 > sgs->group_capacity) - return; - - if (sgs->sum_nr_running > sds->leader_nr_running || - (sgs->sum_nr_running == sds->leader_nr_running && - group_first_cpu(group) < group_first_cpu(sds->group_leader))) { - sds->group_leader = group; - sds->leader_nr_running = sgs->sum_nr_running; - } -} - -/** - * check_power_save_busiest_group - see if there is potential for some power-savings balance - * @env: load balance environment - * @sds: Variable containing the statistics of the sched_domain - * under consideration. - * - * Description: - * Check if we have potential to perform some power-savings balance. - * If yes, set the busiest group to be the least loaded group in the - * sched_domain, so that it's CPUs can be put to idle. - * - * Returns 1 if there is potential to perform power-savings balance. - * Else returns 0. - */ -static inline -int check_power_save_busiest_group(struct lb_env *env, struct sd_lb_stats *sds) -{ - if (!sds->power_savings_balance) - return 0; - - if (sds->this != sds->group_leader || - sds->group_leader == sds->group_min) - return 0; - - env->imbalance = sds->min_load_per_task; - sds->busiest = sds->group_min; - - return 1; - -} -#else /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */ -static inline void init_sd_power_savings_stats(struct sched_domain *sd, - struct sd_lb_stats *sds, enum cpu_idle_type idle) -{ - return; -} - -static inline void update_sd_power_savings_stats(struct sched_group *group, - struct sd_lb_stats *sds, int local_group, struct sg_lb_stats *sgs) -{ - return; -} - -static inline -int check_power_save_busiest_group(struct lb_env *env, struct sd_lb_stats *sds) -{ - return 0; -} -#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */ - - unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu) { return SCHED_POWER_SCALE; @@ -3932,7 +3780,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1; - init_sd_power_savings_stats(env->sd, sds, env->idle); load_idx = get_sd_load_idx(env->sd, env->idle); do { @@ -3981,7 +3828,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, sds->group_imb = sgs.group_imb; } - update_sd_power_savings_stats(sg, sds, local_group, &sgs); sg = sg->next; } while (sg != env->sd->groups); } @@ -4276,12 +4122,6 @@ force_balance: return sds.busiest; out_balanced: - /* - * There is no obvious imbalance. But check if we can do some balancing - * to save power. - */ - if (check_power_save_busiest_group(env, &sds)) - return sds.busiest; ret: env->imbalance = 0; return NULL; @@ -4359,28 +4199,6 @@ static int need_active_balance(struct lb_env *env) */ if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu) return 1; - - /* - * The only task running in a non-idle cpu can be moved to this - * cpu in an attempt to completely freeup the other CPU - * package. - * - * The package power saving logic comes from - * find_busiest_group(). If there are no imbalance, then - * f_b_g() will return NULL. However when sched_mc={1,2} then - * f_b_g() will select a group from which a running task may be - * pulled to this cpu in order to make the other package idle. - * If there is no opportunity to make a package idle and if - * there are no imbalance, then f_b_g() will return NULL and no - * action will be taken in load_balance_newidle(). - * - * Under normal task pull operation due to imbalance, there - * will be more than one task in the source run queue and - * move_tasks() will succeed. ld_moved will be true and this - * active balance code will not be triggered. - */ - if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP) - return 0; } return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2); @@ -4700,104 +4518,15 @@ static struct { unsigned long next_balance; /* in jiffy units */ } nohz ____cacheline_aligned; -#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT) -/** - * lowest_flag_domain - Return lowest sched_domain containing flag. - * @cpu: The cpu whose lowest level of sched domain is to - * be returned. - * @flag: The flag to check for the lowest sched_domain - * for the given cpu. - * - * Returns the lowest sched_domain of a cpu which contains the given flag. - */ -static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) -{ - struct sched_domain *sd; - - for_each_domain(cpu, sd) - if (sd->flags & flag) - break; - - return sd; -} - -/** - * for_each_flag_domain - Iterates over sched_domains containing the flag. - * @cpu: The cpu whose domains we're iterating over. - * @sd: variable holding the value of the power_savings_sd - * for cpu. - * @flag: The flag to filter the sched_domains to be iterated. - * - * Iterates over all the scheduler domains for a given cpu that has the 'flag' - * set, starting from the lowest sched_domain to the highest. - */ -#define for_each_flag_domain(cpu, sd, flag) \ - for (sd = lowest_flag_domain(cpu, flag); \ - (sd && (sd->flags & flag)); sd = sd->parent) - -/** - * find_new_ilb - Finds the optimum idle load balancer for nomination. - * @cpu: The cpu which is nominating a new idle_load_balancer. - * - * Returns: Returns the id of the idle load balancer if it exists, - * Else, returns >= nr_cpu_ids. - * - * This algorithm picks the idle load balancer such that it belongs to a - * semi-idle powersavings sched_domain. The idea is to try and avoid - * completely idle packages/cores just for the purpose of idle load balancing - * when there are other idle cpu's which are better suited for that job. - */ -static int find_new_ilb(int cpu) +static inline int find_new_ilb(int call_cpu) { int ilb = cpumask_first(nohz.idle_cpus_mask); - struct sched_group *ilbg; - struct sched_domain *sd; - /* - * Have idle load balancer selection from semi-idle packages only - * when power-aware load balancing is enabled - */ - if (!(sched_smt_power_savings || sched_mc_power_savings)) - goto out_done; - - /* - * Optimize for the case when we have no idle CPUs or only one - * idle CPU. Don't walk the sched_domain hierarchy in such cases - */ - if (cpumask_weight(nohz.idle_cpus_mask) < 2) - goto out_done; - - rcu_read_lock(); - for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) { - ilbg = sd->groups; - - do { - if (ilbg->group_weight != - atomic_read(&ilbg->sgp->nr_busy_cpus)) { - ilb = cpumask_first_and(nohz.idle_cpus_mask, - sched_group_cpus(ilbg)); - goto unlock; - } - - ilbg = ilbg->next; - - } while (ilbg != sd->groups); - } -unlock: - rcu_read_unlock(); - -out_done: if (ilb < nr_cpu_ids && idle_cpu(ilb)) return ilb; return nr_cpu_ids; } -#else /* (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */ -static inline int find_new_ilb(int call_cpu) -{ - return nr_cpu_ids; -} -#endif /* * Kick a CPU to do the nohz balancing, if it is time for it. We pick the -- cgit v1.2.3 From 8ca937a668d4be0b0c89c4fc9a912a5885ac06fe Mon Sep 17 00:00:00 2001 From: Sasha Levin Date: Thu, 17 May 2012 23:31:39 +0200 Subject: cred: use correct cred accessor with regards to rcu read lock Commit "userns: Convert setting and getting uid and gid system calls to use kuid and kgid has modified the accessors in wait_task_continued() and wait_task_stopped() to use __task_cred() instead of task_uid(). __task_cred() assumes that we're inside a rcu read lock, which is untrue for these two functions. Modify it to use task_uid() instead. Signed-off-by: Sasha Levin Signed-off-by: Eric W. Biederman --- kernel/exit.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 789e3c5777f7..910a0716e17a 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -1427,7 +1427,7 @@ static int wait_task_stopped(struct wait_opts *wo, if (!unlikely(wo->wo_flags & WNOWAIT)) *p_code = 0; - uid = from_kuid_munged(current_user_ns(), __task_cred(p)->uid); + uid = from_kuid_munged(current_user_ns(), task_uid(p)); unlock_sig: spin_unlock_irq(&p->sighand->siglock); if (!exit_code) @@ -1500,7 +1500,7 @@ static int wait_task_continued(struct wait_opts *wo, struct task_struct *p) } if (!unlikely(wo->wo_flags & WNOWAIT)) p->signal->flags &= ~SIGNAL_STOP_CONTINUED; - uid = from_kuid_munged(current_user_ns(), __task_cred(p)->uid); + uid = from_kuid_munged(current_user_ns(), task_uid(p)); spin_unlock_irq(&p->sighand->siglock); pid = task_pid_vnr(p); -- cgit v1.2.3 From 1c2927f18576d65631d8e0ddd19e1d023183222e Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Thu, 10 May 2012 16:20:04 +0400 Subject: sched: Taint kernel with TAINT_WARN after sleep-in-atomic bug Usually sleep-in-atomic bugs are followed by dozens other warnings. This patch should help to figure out original source of problem. Signed-off-by: Konstantin Khlebnikov Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/20120510122004.4873.12726.stgit@zurg Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 24ca677b5457..ab9745f7e115 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3138,6 +3138,7 @@ static noinline void __schedule_bug(struct task_struct *prev) if (irqs_disabled()) print_irqtrace_events(prev); dump_stack(); + add_taint(TAINT_WARN); } /* -- cgit v1.2.3 From 2df83fa4bce421f8176932142f1004adfba0f9dd Mon Sep 17 00:00:00 2001 From: Minho Ban Date: Mon, 14 May 2012 21:45:31 +0200 Subject: PM / Hibernate: Use get_gendisk to verify partition if resume_file is integer format Sometimes resume= parameter comes in integer style (e.g. major:minor) and then name_to_dev_t can not detect partition properly. (especially async device like usb, mmc). This patch calls get_gendisk() if resumewait is true and resume_file is in integer format to work around this problem. Signed-off-by: Minho Ban Signed-off-by: Rafael J. Wysocki --- kernel/power/hibernate.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'kernel') diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c index e09dfbfeecee..8b53db38a279 100644 --- a/kernel/power/hibernate.c +++ b/kernel/power/hibernate.c @@ -25,6 +25,8 @@ #include #include #include +#include +#include #include #include "power.h" @@ -722,6 +724,17 @@ static int software_resume(void) /* Check if the device is there */ swsusp_resume_device = name_to_dev_t(resume_file); + + /* + * name_to_dev_t is ineffective to verify parition if resume_file is in + * integer format. (e.g. major:minor) + */ + if (isdigit(resume_file[0]) && resume_wait) { + int partno; + while (!get_gendisk(swsusp_resume_device, &partno)) + msleep(10); + } + if (!swsusp_resume_device) { /* * Some device discovery might still be in progress; we need -- cgit v1.2.3 From 05fdd70d2fe1e34d8b80ec56d6e3272d9293653e Mon Sep 17 00:00:00 2001 From: Vaibhav Nagarnaik Date: Fri, 18 May 2012 13:29:51 -0700 Subject: ring-buffer: Merge separate resize loops There are 2 separate loops to resize cpu buffers that are online and offline. Merge them to make the code look better. Also change the name from update_completion to update_done to allow shorter lines. Link: http://lkml.kernel.org/r/1337372991-14783-1-git-send-email-vnagarnaik@google.com Cc: Laurent Chavey Cc: Justin Teravest Cc: David Sharp Signed-off-by: Vaibhav Nagarnaik Signed-off-by: Steven Rostedt --- kernel/trace/ring_buffer.c | 41 +++++++++++++++-------------------------- 1 file changed, 15 insertions(+), 26 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 68388f876d43..6420cda62336 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -473,7 +473,7 @@ struct ring_buffer_per_cpu { int nr_pages_to_update; struct list_head new_pages; /* new pages to add */ struct work_struct update_pages_work; - struct completion update_completion; + struct completion update_done; }; struct ring_buffer { @@ -1058,7 +1058,7 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, int nr_pages, int cpu) lockdep_set_class(&cpu_buffer->reader_lock, buffer->reader_lock_key); cpu_buffer->lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; INIT_WORK(&cpu_buffer->update_pages_work, update_pages_handler); - init_completion(&cpu_buffer->update_completion); + init_completion(&cpu_buffer->update_done); bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), GFP_KERNEL, cpu_to_node(cpu)); @@ -1461,7 +1461,7 @@ static void update_pages_handler(struct work_struct *work) struct ring_buffer_per_cpu *cpu_buffer = container_of(work, struct ring_buffer_per_cpu, update_pages_work); rb_update_pages(cpu_buffer); - complete(&cpu_buffer->update_completion); + complete(&cpu_buffer->update_done); } /** @@ -1534,39 +1534,29 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, get_online_cpus(); /* * Fire off all the required work handlers - * Look out for offline CPUs - */ - for_each_buffer_cpu(buffer, cpu) { - cpu_buffer = buffer->buffers[cpu]; - if (!cpu_buffer->nr_pages_to_update || - !cpu_online(cpu)) - continue; - - schedule_work_on(cpu, &cpu_buffer->update_pages_work); - } - /* - * This loop is for the CPUs that are not online. - * We can't schedule anything on them, but it's not necessary + * We can't schedule on offline CPUs, but it's not necessary * since we can change their buffer sizes without any race. */ for_each_buffer_cpu(buffer, cpu) { cpu_buffer = buffer->buffers[cpu]; - if (!cpu_buffer->nr_pages_to_update || - cpu_online(cpu)) + if (!cpu_buffer->nr_pages_to_update) continue; - rb_update_pages(cpu_buffer); + if (cpu_online(cpu)) + schedule_work_on(cpu, + &cpu_buffer->update_pages_work); + else + rb_update_pages(cpu_buffer); } /* wait for all the updates to complete */ for_each_buffer_cpu(buffer, cpu) { cpu_buffer = buffer->buffers[cpu]; - if (!cpu_buffer->nr_pages_to_update || - !cpu_online(cpu)) + if (!cpu_buffer->nr_pages_to_update) continue; - wait_for_completion(&cpu_buffer->update_completion); - /* reset this value */ + if (cpu_online(cpu)) + wait_for_completion(&cpu_buffer->update_done); cpu_buffer->nr_pages_to_update = 0; } @@ -1593,13 +1583,12 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, if (cpu_online(cpu_id)) { schedule_work_on(cpu_id, &cpu_buffer->update_pages_work); - wait_for_completion(&cpu_buffer->update_completion); + wait_for_completion(&cpu_buffer->update_done); } else rb_update_pages(cpu_buffer); - put_online_cpus(); - /* reset this value */ cpu_buffer->nr_pages_to_update = 0; + put_online_cpus(); } out: -- cgit v1.2.3 From a591c73f127505cdbd0aa399a92112a8ddff8730 Mon Sep 17 00:00:00 2001 From: Vaibhav Nagarnaik Date: Thu, 3 May 2012 10:40:34 -0700 Subject: tracing: Fix initial buffer_size_kb state Make sure that the state of buffer_size_kb is initialized correctly and returns actual size of the ring buffer. Link: http://lkml.kernel.org/r/1336066834-1673-1-git-send-email-vnagarnaik@google.com Cc: Frederic Weisbecker Cc: Ingo Molnar Cc: Laurent Chavey Cc: Justin Teravest Cc: David Sharp Signed-off-by: Vaibhav Nagarnaik Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 08a08bab57a3..a44d4c6ca9a8 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -5112,7 +5112,8 @@ __init static int tracer_alloc_buffers(void) max_tr.data[i] = &per_cpu(max_tr_data, i); } - set_buffer_entries(&global_trace, ring_buf_size); + set_buffer_entries(&global_trace, + ring_buffer_size(global_trace.buffer, 0)); #ifdef CONFIG_TRACER_MAX_TRACE set_buffer_entries(&max_tr, 1); #endif -- cgit v1.2.3 From 895b67fd5830ce18a6f1375a7c062fcf84b4b874 Mon Sep 17 00:00:00 2001 From: Richard Weinberger Date: Mon, 7 Nov 2011 09:23:22 +0100 Subject: tracing: Remove kernel_lock annotations The BKL is gone, these annotations are useless. Link: http://lkml.kernel.org/r/1320654202-4433-1-git-send-email-richard@nod.at Signed-off-by: Richard Weinberger Signed-off-by: Steven Rostedt --- kernel/trace/trace.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index a44d4c6ca9a8..b9a507c19c4b 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -763,8 +763,6 @@ update_max_tr_single(struct trace_array *tr, struct task_struct *tsk, int cpu) * Register a new plugin tracer. */ int register_tracer(struct tracer *type) -__releases(kernel_lock) -__acquires(kernel_lock) { struct tracer *t; int ret = 0; -- cgit v1.2.3 From 58ee99ada293b5ed971a023304fcfbc1a0ccdb1c Mon Sep 17 00:00:00 2001 From: Paul Mundt Date: Sat, 19 May 2012 15:11:41 +0900 Subject: irqdomain: Support removal of IRQ domains. Now that IRQ domains are being used by modules it's necessary to support removing them, too. This adds a new irq_domain_remove() routine for doing the bulk of the heavy lifting. It's left as an exercise to the caller to ensure all mappings have been appropriatey disposed of before attempting to remove the domain. Signed-off-by: Paul Mundt Signed-off-by: Grant Likely --- kernel/irq/irqdomain.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 59 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index 0e0ba5f840b2..9cae0b2f509f 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -56,6 +56,12 @@ static struct irq_domain *irq_domain_alloc(struct device_node *of_node, return domain; } +static void irq_domain_free(struct irq_domain *domain) +{ + of_node_put(domain->of_node); + kfree(domain); +} + static void irq_domain_add(struct irq_domain *domain) { mutex_lock(&irq_domain_mutex); @@ -65,6 +71,58 @@ static void irq_domain_add(struct irq_domain *domain) domain->revmap_type, domain); } +/** + * irq_domain_remove() - Remove an irq domain. + * @domain: domain to remove + * + * This routine is used to remove an irq domain. The caller must ensure + * that all mappings within the domain have been disposed of prior to + * use, depending on the revmap type. + */ +void irq_domain_remove(struct irq_domain *domain) +{ + mutex_lock(&irq_domain_mutex); + + switch (domain->revmap_type) { + case IRQ_DOMAIN_MAP_LEGACY: + /* + * Legacy domains don't manage their own irq_desc + * allocations, we expect the caller to handle irq_desc + * freeing on their own. + */ + break; + case IRQ_DOMAIN_MAP_TREE: + /* + * radix_tree_delete() takes care of destroying the root + * node when all entries are removed. Shout if there are + * any mappings left. + */ + WARN_ON(domain->revmap_data.tree.height); + break; + case IRQ_DOMAIN_MAP_LINEAR: + kfree(domain->revmap_data.linear.revmap); + domain->revmap_data.linear.size = 0; + break; + case IRQ_DOMAIN_MAP_NOMAP: + break; + } + + list_del(&domain->link); + + /* + * If the going away domain is the default one, reset it. + */ + if (unlikely(irq_default_domain == domain)) + irq_set_default_host(NULL); + + mutex_unlock(&irq_domain_mutex); + + pr_debug("irq: Removed domain of type %d @0x%p\n", + domain->revmap_type, domain); + + irq_domain_free(domain); +} + static unsigned int irq_domain_legacy_revmap(struct irq_domain *domain, irq_hw_number_t hwirq) { @@ -117,8 +175,7 @@ struct irq_domain *irq_domain_add_legacy(struct device_node *of_node, if (WARN_ON(!irq_data || irq_data->domain)) { mutex_unlock(&irq_domain_mutex); - of_node_put(domain->of_node); - kfree(domain); + irq_domain_free(domain); return NULL; } } -- cgit v1.2.3 From ecd84eb20a08841197d6bbda5961dcf5f4e424fc Mon Sep 17 00:00:00 2001 From: Paul Mundt Date: Sat, 19 May 2012 15:11:42 +0900 Subject: irqdomain: Export remaining public API symbols. modules making use of irq domains at the very least need access to the add/remove/lookup routines, though there's nothing preventing them from using the remainder of the public API, either. The current set of exports seem primarily geared at DT-enabled platforms using DT-backed IRQ domains, where many of the API accesses are hidden away in OF code. The non-DT cases need to do most of this on their own. Signed-off-by: Paul Mundt Signed-off-by: Grant Likely --- kernel/irq/irqdomain.c | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'kernel') diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index 9cae0b2f509f..01d4a0de061f 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -122,6 +122,7 @@ void irq_domain_remove(struct irq_domain *domain) irq_domain_free(domain); } +EXPORT_SYMBOL_GPL(irq_domain_remove); static unsigned int irq_domain_legacy_revmap(struct irq_domain *domain, irq_hw_number_t hwirq) @@ -209,6 +210,7 @@ struct irq_domain *irq_domain_add_legacy(struct device_node *of_node, irq_domain_add(domain); return domain; } +EXPORT_SYMBOL_GPL(irq_domain_add_legacy); /** * irq_domain_add_linear() - Allocate and register a legacy revmap irq_domain. @@ -238,6 +240,7 @@ struct irq_domain *irq_domain_add_linear(struct device_node *of_node, irq_domain_add(domain); return domain; } +EXPORT_SYMBOL_GPL(irq_domain_add_linear); struct irq_domain *irq_domain_add_nomap(struct device_node *of_node, unsigned int max_irq, @@ -252,6 +255,7 @@ struct irq_domain *irq_domain_add_nomap(struct device_node *of_node, } return domain; } +EXPORT_SYMBOL_GPL(irq_domain_add_nomap); /** * irq_domain_add_tree() @@ -273,6 +277,7 @@ struct irq_domain *irq_domain_add_tree(struct device_node *of_node, } return domain; } +EXPORT_SYMBOL_GPL(irq_domain_add_tree); /** * irq_find_host() - Locates a domain for a given device node @@ -320,6 +325,7 @@ void irq_set_default_host(struct irq_domain *domain) irq_default_domain = domain; } +EXPORT_SYMBOL_GPL(irq_set_default_host); static int irq_setup_virq(struct irq_domain *domain, unsigned int virq, irq_hw_number_t hwirq) @@ -378,6 +384,7 @@ unsigned int irq_create_direct_mapping(struct irq_domain *domain) return virq; } +EXPORT_SYMBOL_GPL(irq_create_direct_mapping); /** * irq_create_mapping() - Map a hardware interrupt into linux irq space @@ -617,6 +624,7 @@ unsigned int irq_radix_revmap_lookup(struct irq_domain *domain, */ return irq_data ? irq_data->irq : irq_find_mapping(domain, hwirq); } +EXPORT_SYMBOL_GPL(irq_radix_revmap_lookup); /** * irq_radix_revmap_insert() - Insert a hw irq to linux irq number mapping. @@ -641,6 +649,7 @@ void irq_radix_revmap_insert(struct irq_domain *domain, unsigned int virq, mutex_unlock(&revmap_trees_mutex); } } +EXPORT_SYMBOL_GPL(irq_radix_revmap_insert); /** * irq_linear_revmap() - Find a linux irq from a hw irq number. @@ -674,6 +683,7 @@ unsigned int irq_linear_revmap(struct irq_domain *domain, return revmap[hwirq]; } +EXPORT_SYMBOL_GPL(irq_linear_revmap); #ifdef CONFIG_IRQ_DOMAIN_DEBUG static int virq_debug_show(struct seq_file *m, void *private) -- cgit v1.2.3 From 5c5806e50b9fd4ca435acdf0eceddda64da9be5b Mon Sep 17 00:00:00 2001 From: Paul Mundt Date: Sat, 19 May 2012 15:11:43 +0900 Subject: irqdomain: Make irq_domain_simple_map() static. Presently irq_domain_simple_map() isn't labelled as static, but there's no definition for it in the public irqdomain header either. At present all in-tree ->map users have meaningful work to do, and all others are using irq_domain_simple_ops directly. Make it static for now, as it can always be exported and added to the public API later. Signed-off-by: Paul Mundt Signed-off-by: Grant Likely --- kernel/irq/irqdomain.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index 01d4a0de061f..92e06442d74c 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -758,8 +758,8 @@ static int __init irq_debugfs_init(void) __initcall(irq_debugfs_init); #endif /* CONFIG_IRQ_DOMAIN_DEBUG */ -int irq_domain_simple_map(struct irq_domain *d, unsigned int irq, - irq_hw_number_t hwirq) +static int irq_domain_simple_map(struct irq_domain *d, unsigned int irq, + irq_hw_number_t hwirq) { return 0; } -- cgit v1.2.3 From 54a9058860f2b7ed8d2fe5bf7be19bb901866dc2 Mon Sep 17 00:00:00 2001 From: Paul Mundt Date: Sat, 19 May 2012 15:11:47 +0900 Subject: irqdomain: trivial pr_fmt conversion. Convert to pr_fmt before things start to get out of hand and some janitors start getting overly excited. Signed-off-by: Paul Mundt Signed-off-by: Grant Likely --- kernel/irq/irqdomain.c | 32 +++++++++++++++++--------------- 1 file changed, 17 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index 92e06442d74c..9a6e8a8747db 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -1,3 +1,5 @@ +#define pr_fmt(fmt) "irq: " fmt + #include #include #include @@ -67,7 +69,7 @@ static void irq_domain_add(struct irq_domain *domain) mutex_lock(&irq_domain_mutex); list_add(&domain->link, &irq_domain_list); mutex_unlock(&irq_domain_mutex); - pr_debug("irq: Allocated domain of type %d @0x%p\n", + pr_debug("Allocated domain of type %d @0x%p\n", domain->revmap_type, domain); } @@ -117,7 +119,7 @@ void irq_domain_remove(struct irq_domain *domain) mutex_unlock(&irq_domain_mutex); - pr_debug("irq: Removed domain of type %d @0x%p\n", + pr_debug("Removed domain of type %d @0x%p\n", domain->revmap_type, domain); irq_domain_free(domain); @@ -321,7 +323,7 @@ EXPORT_SYMBOL_GPL(irq_find_host); */ void irq_set_default_host(struct irq_domain *domain) { - pr_debug("irq: Default domain set to @0x%p\n", domain); + pr_debug("Default domain set to @0x%p\n", domain); irq_default_domain = domain; } @@ -335,7 +337,7 @@ static int irq_setup_virq(struct irq_domain *domain, unsigned int virq, irq_data->hwirq = hwirq; irq_data->domain = domain; if (domain->ops->map(domain, virq, hwirq)) { - pr_debug("irq: -> mapping failed, freeing\n"); + pr_debug("irq-%i==>hwirq-0x%lx mapping failed\n", virq, hwirq); irq_data->domain = NULL; irq_data->hwirq = 0; return -1; @@ -366,7 +368,7 @@ unsigned int irq_create_direct_mapping(struct irq_domain *domain) virq = irq_alloc_desc_from(1, 0); if (!virq) { - pr_debug("irq: create_direct virq allocation failed\n"); + pr_debug("create_direct virq allocation failed\n"); return 0; } if (virq >= domain->revmap_data.nomap.max_irq) { @@ -375,7 +377,7 @@ unsigned int irq_create_direct_mapping(struct irq_domain *domain) irq_free_desc(virq); return 0; } - pr_debug("irq: create_direct obtained virq %d\n", virq); + pr_debug("create_direct obtained virq %d\n", virq); if (irq_setup_virq(domain, virq, virq)) { irq_free_desc(virq); @@ -402,23 +404,23 @@ unsigned int irq_create_mapping(struct irq_domain *domain, unsigned int hint; int virq; - pr_debug("irq: irq_create_mapping(0x%p, 0x%lx)\n", domain, hwirq); + pr_debug("irq_create_mapping(0x%p, 0x%lx)\n", domain, hwirq); /* Look for default domain if nececssary */ if (domain == NULL) domain = irq_default_domain; if (domain == NULL) { - printk(KERN_WARNING "irq_create_mapping called for" - " NULL domain, hwirq=%lx\n", hwirq); + pr_warning("irq_create_mapping called for" + " NULL domain, hwirq=%lx\n", hwirq); WARN_ON(1); return 0; } - pr_debug("irq: -> using domain @%p\n", domain); + pr_debug("-> using domain @%p\n", domain); /* Check if mapping already exists */ virq = irq_find_mapping(domain, hwirq); if (virq) { - pr_debug("irq: -> existing mapping on virq %d\n", virq); + pr_debug("-> existing mapping on virq %d\n", virq); return virq; } @@ -434,7 +436,7 @@ unsigned int irq_create_mapping(struct irq_domain *domain, if (virq <= 0) virq = irq_alloc_desc_from(1, 0); if (virq <= 0) { - pr_debug("irq: -> virq allocation failed\n"); + pr_debug("-> virq allocation failed\n"); return 0; } @@ -444,7 +446,7 @@ unsigned int irq_create_mapping(struct irq_domain *domain, return 0; } - pr_debug("irq: irq %lu on domain %s mapped to virtual irq %u\n", + pr_debug("irq %lu on domain %s mapped to virtual irq %u\n", hwirq, domain->of_node ? domain->of_node->full_name : "null", virq); return virq; @@ -473,8 +475,8 @@ unsigned int irq_create_of_mapping(struct device_node *controller, if (intsize > 0) return intspec[0]; #endif - printk(KERN_WARNING "irq: no irq domain found for %s !\n", - controller->full_name); + pr_warning("no irq domain found for %s !\n", + controller->full_name); return 0; } -- cgit v1.2.3 From a87487e687ceafdc696b8cb1fb27e37cc463586b Mon Sep 17 00:00:00 2001 From: Mark Brown Date: Sat, 19 May 2012 12:15:35 +0100 Subject: irqdomain: Document size parameter of irq_domain_add_linear() Signed-off-by: Mark Brown Signed-off-by: Grant Likely --- kernel/irq/irqdomain.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index 9a6e8a8747db..41c1564103f1 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -217,6 +217,7 @@ EXPORT_SYMBOL_GPL(irq_domain_add_legacy); /** * irq_domain_add_linear() - Allocate and register a legacy revmap irq_domain. * @of_node: pointer to interrupt controller's device tree node. + * @size: Number of interrupts in the domain. * @ops: map/unmap domain callbacks * @host_data: Controller private data pointer */ -- cgit v1.2.3 From 4b06a81f1daee668fbd6de85557bfb36dd36078f Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Sat, 19 May 2012 15:44:06 -0600 Subject: userns: Silence silly gcc warning. On 32bit builds gcc says: kernel/user.c:30:4: warning: this decimal constant is unsigned only in ISO C90 [enabled by default] kernel/user.c:38:4: warning: this decimal constant is unsigned only in ISO C90 [enabled by default] Silence gcc by changing the constant 4294967295 to 4294967295U. Signed-off-by: Eric W. Biederman --- kernel/user.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/user.c b/kernel/user.c index f9e420e36699..b815fefbe76f 100644 --- a/kernel/user.c +++ b/kernel/user.c @@ -27,7 +27,7 @@ struct user_namespace init_user_ns = { .extent[0] = { .first = 0, .lower_first = 0, - .count = 4294967295, + .count = 4294967295U, }, }, .gid_map = { @@ -35,7 +35,7 @@ struct user_namespace init_user_ns = { .extent[0] = { .first = 0, .lower_first = 0, - .count = 4294967295, + .count = 4294967295U, }, }, .kref = { -- cgit v1.2.3 From b5e498ad67863cc877fb05e6a71ce58eda4fb2ca Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 18 May 2012 09:59:57 +0200 Subject: timers: Provide generic Kconfig switches We really don't want all the arch code defining stuff over and over. [ anna-maria: Added missing GENERIC_CMOS_UPDATE switch ] Signed-off-by: Thomas Gleixner Signed-off-by: Anna-Maria Gleixner Cc: Paul Mundt Link: http://lkml.kernel.org/r/1337529587.3208.2.camel@dionysos Acked-by: Sam Ravnborg --- kernel/time/Kconfig | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) (limited to 'kernel') diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index a20dc8a3c949..f6ebc4ff702a 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -33,3 +33,38 @@ config GENERIC_CLOCKEVENTS_BUILD config GENERIC_CLOCKEVENTS_MIN_ADJUST bool + +# Options selectable by arch Kconfig + +# Watchdog function for clocksources to detect instabilities +config CLOCKSOURCE_WATCHDOG + bool + +# Architecture has extra clocksource data +config ARCH_CLOCKSOURCE_DATA + bool + +# Timekeeping vsyscall support +config GENERIC_TIME_VSYSCALL + bool + +# ktime_t scalar 64bit nsec representation +config KTIME_SCALAR + bool + +# Old style timekeeping +config ARCH_USES_GETTIMEOFFSET + bool + +# The generic clock events infrastructure +config GENERIC_CLOCKEVENTS + bool + +# Clockevents broadcasting infrastructure +config GENERIC_CLOCKEVENTS_BROADCAST + bool + depends on GENERIC_CLOCKEVENTS + +# Generic update of CMOS clock +config GENERIC_CMOS_UPDATE + bool -- cgit v1.2.3 From 875682648b89a3ebc06176d60dc280f810647839 Mon Sep 17 00:00:00 2001 From: Richard Weinberger Date: Tue, 17 Apr 2012 22:37:16 +0200 Subject: irq: Remove irq_chip->release() As it's only user (UML) does no longer need it we can get rid of it. Signed-off-by: Richard Weinberger Reviewed-by: Thomas Gleixner --- kernel/irq/manage.c | 6 ------ 1 file changed, 6 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 89a3ea82569b..9b7f68a00e5e 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -1204,12 +1204,6 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id) /* Found it - now remove it from the list of entries: */ *action_ptr = action->next; - /* Currently used only by UML, might disappear one day: */ -#ifdef CONFIG_IRQ_RELEASE_METHOD - if (desc->irq_data.chip->release) - desc->irq_data.chip->release(irq, dev_id); -#endif - /* If this was the last handler, shut down the IRQ line: */ if (!desc->action) irq_shutdown(desc); -- cgit v1.2.3 From 764e0da14fd7ac2d259d98d34ece0a87d32306c9 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Mon, 21 May 2012 23:16:18 +0200 Subject: timers: Fixup the Kconfig consolidation fallout Sigh, I missed to check which architecture Kconfig files actually include the core Kconfig file. There are a few which did not. So we broke them. Instead of adding the includes to those, we are better off to move the include to init/Kconfig like we did already with irqs and others. This does not change anything for the architectures using the old style periodic timer mode. It just solves the build wreckage there. For those architectures which use the clock events infrastructure it moves the include of the core Kconfig file to "General setup" which is a way more logical place than having it at random locations specified by the architecture specific Kconfigs. Reported-by: Ingo Molnar Cc: Anna-Maria Gleixner Signed-off-by: Thomas Gleixner --- kernel/time/Kconfig | 73 ++++++++++++++++++++++++++++++----------------------- 1 file changed, 41 insertions(+), 32 deletions(-) (limited to 'kernel') diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index f6ebc4ff702a..fd42bd452b75 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -2,38 +2,6 @@ # Timer subsystem related configuration options # -# Core internal switch. Selected by NO_HZ / HIGH_RES_TIMERS. This is -# only related to the tick functionality. Oneshot clockevent devices -# are supported independ of this. -config TICK_ONESHOT - bool - -config NO_HZ - bool "Tickless System (Dynamic Ticks)" - depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS - select TICK_ONESHOT - help - This option enables a tickless system: timer interrupts will - only trigger on an as-needed basis both when the system is - busy and when the system is idle. - -config HIGH_RES_TIMERS - bool "High Resolution Timer Support" - depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS - select TICK_ONESHOT - help - This option enables high resolution timer support. If your - hardware is not capable then this option only increases - the size of the kernel image. - -config GENERIC_CLOCKEVENTS_BUILD - bool - default y - depends on GENERIC_CLOCKEVENTS - -config GENERIC_CLOCKEVENTS_MIN_ADJUST - bool - # Options selectable by arch Kconfig # Watchdog function for clocksources to detect instabilities @@ -60,11 +28,52 @@ config ARCH_USES_GETTIMEOFFSET config GENERIC_CLOCKEVENTS bool +# Migration helper. Builds, but does not invoke +config GENERIC_CLOCKEVENTS_BUILD + bool + default y + depends on GENERIC_CLOCKEVENTS + # Clockevents broadcasting infrastructure config GENERIC_CLOCKEVENTS_BROADCAST bool depends on GENERIC_CLOCKEVENTS +# Automatically adjust the min. reprogramming time for +# clock event device +config GENERIC_CLOCKEVENTS_MIN_ADJUST + bool + # Generic update of CMOS clock config GENERIC_CMOS_UPDATE bool + +if GENERIC_CLOCKEVENTS +menu "Timers subsystem" + +# Core internal switch. Selected by NO_HZ / HIGH_RES_TIMERS. This is +# only related to the tick functionality. Oneshot clockevent devices +# are supported independ of this. +config TICK_ONESHOT + bool + +config NO_HZ + bool "Tickless System (Dynamic Ticks)" + depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS + select TICK_ONESHOT + help + This option enables a tickless system: timer interrupts will + only trigger on an as-needed basis both when the system is + busy and when the system is idle. + +config HIGH_RES_TIMERS + bool "High Resolution Timer Support" + depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS + select TICK_ONESHOT + help + This option enables high resolution timer support. If your + hardware is not capable then this option only increases + the size of the kernel image. + +endmenu +endif -- cgit v1.2.3 From dd48d708ff3e917f6d6b6c2b696c3f18c019feed Mon Sep 17 00:00:00 2001 From: Richard Cochran Date: Thu, 26 Apr 2012 14:11:32 +0200 Subject: ntp: Correct TAI offset during leap second When repeating a UTC time value during a leap second (when the UTC time should be 23:59:60), the TAI timescale should not stop. The kernel NTP code increments the TAI offset one second too late. This patch fixes the issue by incrementing the offset during the leap second itself. Signed-off-by: Richard Cochran Signed-off-by: John Stultz --- kernel/time/ntp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c index f03fd83b170b..e8c867173ae5 100644 --- a/kernel/time/ntp.c +++ b/kernel/time/ntp.c @@ -412,6 +412,7 @@ int second_overflow(unsigned long secs) if (secs % 86400 == 0) { leap = -1; time_state = TIME_OOP; + time_tai++; printk(KERN_NOTICE "Clock: inserting leap second 23:59:60 UTC\n"); } @@ -426,7 +427,6 @@ int second_overflow(unsigned long secs) } break; case TIME_OOP: - time_tai++; time_state = TIME_WAIT; break; -- cgit v1.2.3 From cd5398bed9296d1ddb21630ac17e90cd19a5c62e Mon Sep 17 00:00:00 2001 From: Richard Cochran Date: Fri, 27 Apr 2012 10:12:41 +0200 Subject: ntp: Fix a stale comment and a few stray newlines. Signed-off-by: Richard Cochran Signed-off-by: John Stultz --- kernel/time/ntp.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c index e8c867173ae5..70b33abcc7bb 100644 --- a/kernel/time/ntp.c +++ b/kernel/time/ntp.c @@ -473,8 +473,6 @@ int second_overflow(unsigned long secs) << NTP_SCALE_SHIFT; time_adjust = 0; - - out: spin_unlock_irqrestore(&ntp_lock, flags); @@ -559,10 +557,10 @@ static inline void process_adj_status(struct timex *txc, struct timespec *ts) /* only set allowed bits */ time_status &= STA_RONLY; time_status |= txc->status & ~STA_RONLY; - } + /* - * Called with the xtime lock held, so we can access and modify + * Called with ntp_lock held, so we can access and modify * all the global NTP state: */ static inline void process_adjtimex_modes(struct timex *txc, struct timespec *ts) -- cgit v1.2.3 From d239f49d77ad9ffa442e700db3cab06d8b414cd1 Mon Sep 17 00:00:00 2001 From: Richard Cochran Date: Fri, 27 Apr 2012 10:12:42 +0200 Subject: timekeeping: Fix a few minor newline issues. Fix a few minor newline issues. Signed-off-by: Richard Cochran Signed-off-by: John Stultz --- kernel/time/timekeeping.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index d66b21308f7c..6e46cacf5969 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -240,7 +240,6 @@ void getnstimeofday(struct timespec *ts) timespec_add_ns(ts, nsecs); } - EXPORT_SYMBOL(getnstimeofday); ktime_t ktime_get(void) @@ -357,8 +356,8 @@ void do_gettimeofday(struct timeval *tv) tv->tv_sec = now.tv_sec; tv->tv_usec = now.tv_nsec/1000; } - EXPORT_SYMBOL(do_gettimeofday); + /** * do_settimeofday - Sets the time of day * @tv: pointer to the timespec variable containing the new time @@ -392,7 +391,6 @@ int do_settimeofday(const struct timespec *tv) return 0; } - EXPORT_SYMBOL(do_settimeofday); -- cgit v1.2.3 From 68f3f16d9ad0f1e28ab3fd0001ab5798c41f15a3 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Mon, 21 May 2012 21:42:32 -0400 Subject: new helper: sigsuspend() guts of saved_sigmask-based sigsuspend/rt_sigsuspend. Takes kernel sigset_t *. Open-coded instances replaced with calling it. Signed-off-by: Al Viro --- kernel/compat.c | 10 +--------- kernel/signal.c | 25 ++++++++++++++++--------- 2 files changed, 17 insertions(+), 18 deletions(-) (limited to 'kernel') diff --git a/kernel/compat.c b/kernel/compat.c index d2c67aa49ae6..c28a306ae05c 100644 --- a/kernel/compat.c +++ b/kernel/compat.c @@ -1073,15 +1073,7 @@ asmlinkage long compat_sys_rt_sigsuspend(compat_sigset_t __user *unewset, compat if (copy_from_user(&newset32, unewset, sizeof(compat_sigset_t))) return -EFAULT; sigset_from_compat(&newset, &newset32); - sigdelsetmask(&newset, sigmask(SIGKILL)|sigmask(SIGSTOP)); - - current->saved_sigmask = current->blocked; - set_current_blocked(&newset); - - current->state = TASK_INTERRUPTIBLE; - schedule(); - set_restore_sigmask(); - return -ERESTARTNOHAND; + return sigsuspend(&newset); } #endif /* __ARCH_WANT_COMPAT_SYS_RT_SIGSUSPEND */ diff --git a/kernel/signal.c b/kernel/signal.c index 17afcaf582d0..3ad220a81619 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -3236,6 +3236,21 @@ SYSCALL_DEFINE0(pause) #endif +#ifdef HAVE_SET_RESTORE_SIGMASK +int sigsuspend(sigset_t *set) +{ + sigdelsetmask(set, sigmask(SIGKILL)|sigmask(SIGSTOP)); + + current->saved_sigmask = current->blocked; + set_current_blocked(set); + + current->state = TASK_INTERRUPTIBLE; + schedule(); + set_restore_sigmask(); + return -ERESTARTNOHAND; +} +#endif + #ifdef __ARCH_WANT_SYS_RT_SIGSUSPEND /** * sys_rt_sigsuspend - replace the signal mask for a value with the @@ -3253,15 +3268,7 @@ SYSCALL_DEFINE2(rt_sigsuspend, sigset_t __user *, unewset, size_t, sigsetsize) if (copy_from_user(&newset, unewset, sizeof(newset))) return -EFAULT; - sigdelsetmask(&newset, sigmask(SIGKILL)|sigmask(SIGSTOP)); - - current->saved_sigmask = current->blocked; - set_current_blocked(&newset); - - current->state = TASK_INTERRUPTIBLE; - schedule(); - set_restore_sigmask(); - return -ERESTARTNOHAND; + return sigsuspend(&newset); } #endif /* __ARCH_WANT_SYS_RT_SIGSUSPEND */ -- cgit v1.2.3 From ef26a5a6eadb7cd0637e1e9e246cd42505b8ec8c Mon Sep 17 00:00:00 2001 From: David Howells Date: Tue, 22 May 2012 15:56:13 +0100 Subject: Guard check in module loader against integer overflow The check: if (len < hdr->e_shoff + hdr->e_shnum * sizeof(Elf_Shdr)) may not work if there's an overflow in the right-hand side of the condition. Signed-off-by: David Howells Signed-off-by: Rusty Russell --- kernel/module.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index a4e60973ca73..4edbd9c11aca 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2429,7 +2429,8 @@ static int copy_and_check(struct load_info *info, goto free_hdr; } - if (len < hdr->e_shoff + hdr->e_shnum * sizeof(Elf_Shdr)) { + if (hdr->e_shoff >= len || + hdr->e_shnum * sizeof(Elf_Shdr) > len - hdr->e_shoff) { err = -ENOEXEC; goto free_hdr; } -- cgit v1.2.3 From ab0cce560ef177bdc7a8f73e9962be9d829a7b2c Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Wed, 23 May 2012 13:13:02 +0200 Subject: Revert "sched, perf: Use a single callback into the scheduler" This reverts commit cb04ff9ac424 ("sched, perf: Use a single callback into the scheduler"). Before this change was introduced, the process switch worked like this (wrt. to perf event schedule): schedule (prev, next) - schedule out all perf events for prev - switch to next - schedule in all perf events for current (next) After the commit, the process switch looks like: schedule (prev, next) - schedule out all perf events for prev - schedule in all perf events for (next) - switch to next The problem is, that after we schedule perf events in, the pmu is enabled and we can receive events even before we make the switch to next - so "current" still being prev process (event SAMPLE data are filled based on the value of the "current" process). Thats exactly what we see for test__PERF_RECORD test. We receive SAMPLES with PID of the process that our tracee is scheduled from. Discussed with Peter Zijlstra: > Bah!, yeah I guess reverting is the right thing for now. Sad > though. > > So by having the two hooks we have a black-spot between them > where we receive no events at all, this black-spot covers the > hand-over of current and we thus don't receive the 'wrong' > events. > > I rather liked we could do away with both that black-spot and > clean up the code a little, but apparently people rely on it. Signed-off-by: Jiri Olsa Acked-by: Peter Zijlstra Cc: acme@redhat.com Cc: paulus@samba.org Cc: cjashfor@linux.vnet.ibm.com Cc: fweisbec@gmail.com Cc: eranian@google.com Link: http://lkml.kernel.org/r/20120523111302.GC1638@m.brq.redhat.com Signed-off-by: Ingo Molnar --- kernel/events/core.c | 14 ++++---------- kernel/sched/core.c | 9 ++++++++- 2 files changed, 12 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 91a445925855..5b06cbbf6931 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -2039,8 +2039,8 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn, * accessing the event control register. If a NMI hits, then it will * not restart the event. */ -static void __perf_event_task_sched_out(struct task_struct *task, - struct task_struct *next) +void __perf_event_task_sched_out(struct task_struct *task, + struct task_struct *next) { int ctxn; @@ -2279,8 +2279,8 @@ static void perf_branch_stack_sched_in(struct task_struct *prev, * accessing the event control register. If a NMI hits, then it will * keep the event running. */ -static void __perf_event_task_sched_in(struct task_struct *prev, - struct task_struct *task) +void __perf_event_task_sched_in(struct task_struct *prev, + struct task_struct *task) { struct perf_event_context *ctx; int ctxn; @@ -2305,12 +2305,6 @@ static void __perf_event_task_sched_in(struct task_struct *prev, perf_branch_stack_sched_in(prev, task); } -void __perf_event_task_sched(struct task_struct *prev, struct task_struct *next) -{ - __perf_event_task_sched_out(prev, next); - __perf_event_task_sched_in(prev, next); -} - static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count) { u64 frequency = event->attr.sample_freq; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 13c38837f2cd..0533a688ce22 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1913,7 +1913,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next) { sched_info_switch(prev, next); - perf_event_task_sched(prev, next); + perf_event_task_sched_out(prev, next); fire_sched_out_preempt_notifiers(prev, next); prepare_lock_switch(rq, next); prepare_arch_switch(next); @@ -1956,6 +1956,13 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev) */ prev_state = prev->state; finish_arch_switch(prev); +#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW + local_irq_disable(); +#endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */ + perf_event_task_sched_in(prev, current); +#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW + local_irq_enable(); +#endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */ finish_lock_switch(rq, prev); finish_arch_post_lock_switch(); -- cgit v1.2.3 From 6a31e1f135d1abfb5137697f889c8cd5d72eb522 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Wed, 23 May 2012 15:35:17 -0400 Subject: ring-buffer: Check for valid buffer before changing size On some machines the number of possible CPUS is not the same as the number of CPUs that is on the machine. Ftrace uses possible_cpus to update the tracing structures but the ring buffer only allocates per cpu buffers for online CPUs when they come up. When the wakeup tracer was enabled in such a case, the ftrace code enabled all possible cpu buffers, but the code in ring_buffer_resize() did not check to see if the buffer in question was allocated. Since boot up CPUs did not match possible CPUs it caused the following crash: BUG: unable to handle kernel NULL pointer dereference at 00000020 IP: [] ring_buffer_resize+0x16a/0x28d *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: [last unloaded: scsi_wait_scan] Pid: 1387, comm: bash Not tainted 3.4.0-test+ #13 /DG965MQ EIP: 0060:[] EFLAGS: 00010217 CPU: 0 EIP is at ring_buffer_resize+0x16a/0x28d EAX: f5a14340 EBX: f6026b80 ECX: 00000ff4 EDX: 00000ff3 ESI: 00000000 EDI: 00000002 EBP: f4275ecc ESP: f4275eb0 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 80050033 CR2: 00000020 CR3: 34396000 CR4: 000007d0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 Process bash (pid: 1387, ti=f4274000 task=f4380cb0 task.ti=f4274000) Stack: c109cf9a f6026b98 00000162 00160f68 00000006 00160f68 00000002 f4275ef0 c109d013 f4275ee8 c123b72a c1c0bf00 c1cc81dc 00000005 f4275f98 00000007 f4275f70 c109d0c7 7700000e 75656b61 00000070 f5e90900 f5c4e198 00000301 Call Trace: [] ? tracing_set_tracer+0x115/0x1e9 [] tracing_set_tracer+0x18e/0x1e9 [] ? _copy_from_user+0x30/0x46 [] tracing_set_trace_write+0x59/0x7f [] ? fput+0x18/0x1c6 [] ? security_file_permission+0x27/0x2b [] ? rw_verify_area+0xcf/0xf2 [] ? fput+0x18/0x1c6 [] ? tracing_set_tracer+0x1e9/0x1e9 [] vfs_write+0x8b/0xe3 [] ? fget_light+0x30/0x81 [] sys_write+0x42/0x63 [] sysenter_do_call+0x12/0x28 This happens with the latency tracer as the ftrace code updates the saved max buffer via its cpumask and not with a global setting. Adding a check in ring_buffer_resize() to make sure the buffer being resized exists, fixes the problem. Cc: Vaibhav Nagarnaik Signed-off-by: Steven Rostedt --- kernel/trace/ring_buffer.c | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 6420cda62336..1d0f6a8a0e5e 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1486,6 +1486,11 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size, if (!buffer) return size; + /* Make sure the requested buffer exists */ + if (cpu_id != RING_BUFFER_ALL_CPUS && + !cpumask_test_cpu(cpu_id, buffer->cpumask)) + return size; + size = DIV_ROUND_UP(size, BUF_PAGE_SIZE); size *= BUF_PAGE_SIZE; -- cgit v1.2.3 From e73f8959af0439d114847eab5a8a5ce48f1217c4 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Fri, 11 May 2012 10:59:07 +1000 Subject: task_work_add: generic process-context callbacks Provide a simple mechanism that allows running code in the (nonatomic) context of the arbitrary task. The caller does task_work_add(task, task_work) and this task executes task_work->func() either from do_notify_resume() or from do_exit(). The callback can rely on PF_EXITING to detect the latter case. "struct task_work" can be embedded in another struct, still it has "void *data" to handle the most common/simple case. This allows us to kill the ->replacement_session_keyring hack, and potentially this can have more users. Performance-wise, this adds 2 "unlikely(!hlist_empty())" checks into tracehook_notify_resume() and do_exit(). But at the same time we can remove the "replacement_session_keyring != NULL" checks from arch/*/signal.c and exit_creds(). Note: task_work_add/task_work_run abuses ->pi_lock. This is only because this lock is already used by lookup_pi_state() to synchronize with do_exit() setting PF_EXITING. Fortunately the scope of this lock in task_work.c is really tiny, and the code is unlikely anyway. Signed-off-by: Oleg Nesterov Acked-by: David Howells Cc: Thomas Gleixner Cc: Richard Kuo Cc: Linus Torvalds Cc: Alexander Gordeev Cc: Chris Zankel Cc: David Smith Cc: "Frank Ch. Eigler" Cc: Geert Uytterhoeven Cc: Larry Woodman Cc: Peter Zijlstra Cc: Tejun Heo Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Al Viro --- kernel/Makefile | 2 +- kernel/exit.c | 5 +++- kernel/fork.c | 1 + kernel/task_work.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 90 insertions(+), 2 deletions(-) create mode 100644 kernel/task_work.c (limited to 'kernel') diff --git a/kernel/Makefile b/kernel/Makefile index 6c07f30fa9b7..bf1034008aca 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -5,7 +5,7 @@ obj-y = fork.o exec_domain.o panic.o printk.o \ cpu.o exit.o itimer.o time.o softirq.o resource.o \ sysctl.o sysctl_binary.o capability.o ptrace.o timer.o user.o \ - signal.o sys.o kmod.o workqueue.o pid.o \ + signal.o sys.o kmod.o workqueue.o pid.o task_work.o \ rcupdate.o extable.o params.o posix-timers.o \ kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \ diff --git a/kernel/exit.c b/kernel/exit.c index 910a0716e17a..3d93325e0b1a 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -946,11 +946,14 @@ void do_exit(long code) exit_signals(tsk); /* sets PF_EXITING */ /* * tsk->flags are checked in the futex code to protect against - * an exiting task cleaning up the robust pi futexes. + * an exiting task cleaning up the robust pi futexes, and in + * task_work_add() to avoid the race with exit_task_work(). */ smp_mb(); raw_spin_unlock_wait(&tsk->pi_lock); + exit_task_work(tsk); + exit_irq_thread(); if (unlikely(in_atomic())) diff --git a/kernel/fork.c b/kernel/fork.c index 05c813dc9ecc..a46db217a589 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1411,6 +1411,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, */ p->group_leader = p; INIT_LIST_HEAD(&p->thread_group); + INIT_HLIST_HEAD(&p->task_works); /* Now that the task is set up, run cgroup callbacks if * necessary. We need to run them before the task is visible diff --git a/kernel/task_work.c b/kernel/task_work.c new file mode 100644 index 000000000000..82d1c794066d --- /dev/null +++ b/kernel/task_work.c @@ -0,0 +1,84 @@ +#include +#include +#include + +int +task_work_add(struct task_struct *task, struct task_work *twork, bool notify) +{ + unsigned long flags; + int err = -ESRCH; + +#ifndef TIF_NOTIFY_RESUME + if (notify) + return -ENOTSUPP; +#endif + /* + * We must not insert the new work if the task has already passed + * exit_task_work(). We rely on do_exit()->raw_spin_unlock_wait() + * and check PF_EXITING under pi_lock. + */ + raw_spin_lock_irqsave(&task->pi_lock, flags); + if (likely(!(task->flags & PF_EXITING))) { + hlist_add_head(&twork->hlist, &task->task_works); + err = 0; + } + raw_spin_unlock_irqrestore(&task->pi_lock, flags); + + /* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */ + if (likely(!err) && notify) + set_notify_resume(task); + return err; +} + +struct task_work * +task_work_cancel(struct task_struct *task, task_work_func_t func) +{ + unsigned long flags; + struct task_work *twork; + struct hlist_node *pos; + + raw_spin_lock_irqsave(&task->pi_lock, flags); + hlist_for_each_entry(twork, pos, &task->task_works, hlist) { + if (twork->func == func) { + hlist_del(&twork->hlist); + goto found; + } + } + twork = NULL; + found: + raw_spin_unlock_irqrestore(&task->pi_lock, flags); + + return twork; +} + +void task_work_run(void) +{ + struct task_struct *task = current; + struct hlist_head task_works; + struct hlist_node *pos; + + raw_spin_lock_irq(&task->pi_lock); + hlist_move_list(&task->task_works, &task_works); + raw_spin_unlock_irq(&task->pi_lock); + + if (unlikely(hlist_empty(&task_works))) + return; + /* + * We use hlist to save the space in task_struct, but we want fifo. + * Find the last entry, the list should be short, then process them + * in reverse order. + */ + for (pos = task_works.first; pos->next; pos = pos->next) + ; + + for (;;) { + struct hlist_node **pprev = pos->pprev; + struct task_work *twork = container_of(pos, struct task_work, + hlist); + twork->func(twork); + + if (pprev == &task_works.first) + break; + pos = container_of(pprev, struct hlist_node, next); + } +} -- cgit v1.2.3 From 4d1d61a6b203d957777d73fcebf19d90b038b5b2 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Fri, 11 May 2012 10:59:08 +1000 Subject: genirq: reimplement exit_irq_thread() hook via task_work_add() exit_irq_thread() and task->irq_thread are needed to handle the unexpected (and unlikely) exit of irq-thread. We can use task_work instead and make this all private to kernel/irq/manage.c, cleanup plus micro-optimization. 1. rename exit_irq_thread() to irq_thread_dtor(), make it static, and move it up before irq_thread(). 2. change irq_thread() to do task_work_add(irq_thread_dtor) at the start and task_work_cancel() before return. tracehook_notify_resume() can never play with kthreads, only do_exit()->exit_task_work() can call the callback and this is what we want. 3. remove task_struct->irq_thread and the special hook in do_exit(). Signed-off-by: Oleg Nesterov Reviewed-by: Thomas Gleixner Cc: David Howells Cc: Richard Kuo Cc: Linus Torvalds Cc: Alexander Gordeev Cc: Chris Zankel Cc: David Smith Cc: "Frank Ch. Eigler" Cc: Geert Uytterhoeven Cc: Larry Woodman Cc: Peter Zijlstra Cc: Tejun Heo Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Al Viro --- kernel/exit.c | 2 -- kernel/irq/manage.c | 68 ++++++++++++++++++++++++++--------------------------- 2 files changed, 33 insertions(+), 37 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 3d93325e0b1a..3ecd096e5d4d 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -954,8 +954,6 @@ void do_exit(long code) exit_task_work(tsk); - exit_irq_thread(); - if (unlikely(in_atomic())) printk(KERN_INFO "note: %s[%d] exited with preempt_count %d\n", current->comm, task_pid_nr(current), diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index bb32326afe87..4d1f8f897414 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -14,6 +14,7 @@ #include #include #include +#include #include "internals.h" @@ -773,11 +774,39 @@ static void wake_threads_waitq(struct irq_desc *desc) wake_up(&desc->wait_for_threads); } +static void irq_thread_dtor(struct task_work *unused) +{ + struct task_struct *tsk = current; + struct irq_desc *desc; + struct irqaction *action; + + if (WARN_ON_ONCE(!(current->flags & PF_EXITING))) + return; + + action = kthread_data(tsk); + + pr_err("genirq: exiting task \"%s\" (%d) is an active IRQ thread (irq %d)\n", + tsk->comm ? tsk->comm : "", tsk->pid, action->irq); + + + desc = irq_to_desc(action->irq); + /* + * If IRQTF_RUNTHREAD is set, we need to decrement + * desc->threads_active and wake possible waiters. + */ + if (test_and_clear_bit(IRQTF_RUNTHREAD, &action->thread_flags)) + wake_threads_waitq(desc); + + /* Prevent a stale desc->threads_oneshot */ + irq_finalize_oneshot(desc, action); +} + /* * Interrupt handler thread */ static int irq_thread(void *data) { + struct task_work on_exit_work; static const struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2, }; @@ -793,7 +822,9 @@ static int irq_thread(void *data) handler_fn = irq_thread_fn; sched_setscheduler(current, SCHED_FIFO, ¶m); - current->irq_thread = 1; + + init_task_work(&on_exit_work, irq_thread_dtor, NULL); + task_work_add(current, &on_exit_work, false); while (!irq_wait_for_interrupt(action)) { irqreturn_t action_ret; @@ -815,44 +846,11 @@ static int irq_thread(void *data) * cannot touch the oneshot mask at this point anymore as * __setup_irq() might have given out currents thread_mask * again. - * - * Clear irq_thread. Otherwise exit_irq_thread() would make - * fuzz about an active irq thread going into nirvana. */ - current->irq_thread = 0; + task_work_cancel(current, irq_thread_dtor); return 0; } -/* - * Called from do_exit() - */ -void exit_irq_thread(void) -{ - struct task_struct *tsk = current; - struct irq_desc *desc; - struct irqaction *action; - - if (!tsk->irq_thread) - return; - - action = kthread_data(tsk); - - pr_err("genirq: exiting task \"%s\" (%d) is an active IRQ thread (irq %d)\n", - tsk->comm ? tsk->comm : "", tsk->pid, action->irq); - - desc = irq_to_desc(action->irq); - - /* - * If IRQTF_RUNTHREAD is set, we need to decrement - * desc->threads_active and wake possible waiters. - */ - if (test_and_clear_bit(IRQTF_RUNTHREAD, &action->thread_flags)) - wake_threads_waitq(desc); - - /* Prevent a stale desc->threads_oneshot */ - irq_finalize_oneshot(desc, action); -} - static void irq_setup_forced_threading(struct irqaction *new) { if (!force_irqthreads) -- cgit v1.2.3 From f23ca335462e3c84f13270b9e65f83936068ec2c Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Fri, 11 May 2012 10:59:09 +1000 Subject: keys: kill task_struct->replacement_session_keyring Kill the no longer used task_struct->replacement_session_keyring, update copy_creds() and exit_creds(). Signed-off-by: Oleg Nesterov Acked-by: David Howells Cc: Thomas Gleixner Cc: Richard Kuo Cc: Linus Torvalds Cc: Alexander Gordeev Cc: Chris Zankel Cc: David Smith Cc: "Frank Ch. Eigler" Cc: Geert Uytterhoeven Cc: Larry Woodman Cc: Peter Zijlstra Cc: Tejun Heo Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Al Viro --- kernel/cred.c | 9 --------- 1 file changed, 9 deletions(-) (limited to 'kernel') diff --git a/kernel/cred.c b/kernel/cred.c index 430557ea488f..de728ac50d82 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -207,13 +207,6 @@ void exit_creds(struct task_struct *tsk) validate_creds(cred); alter_cred_subscribers(cred, -1); put_cred(cred); - - cred = (struct cred *) tsk->replacement_session_keyring; - if (cred) { - tsk->replacement_session_keyring = NULL; - validate_creds(cred); - put_cred(cred); - } } /** @@ -396,8 +389,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags) struct cred *new; int ret; - p->replacement_session_keyring = NULL; - if ( #ifdef CONFIG_KEYS !p->cred->thread_keyring && -- cgit v1.2.3 From 31fe62b9586643953f0c0c37a6357dafc69034e2 Mon Sep 17 00:00:00 2001 From: Tim Bird Date: Wed, 23 May 2012 13:33:35 +0000 Subject: mm: add a low limit to alloc_large_system_hash UDP stack needs a minimum hash size value for proper operation and also uses alloc_large_system_hash() for proper NUMA distribution of its hash tables and automatic sizing depending on available system memory. On some low memory situations, udp_table_init() must ignore the alloc_large_system_hash() result and reallocs a bigger memory area. As we cannot easily free old hash table, we leak it and kmemleak can issue a warning. This patch adds a low limit parameter to alloc_large_system_hash() to solve this problem. We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table allocation. Reported-by: Mark Asselstine Reported-by: Tim Bird Signed-off-by: Eric Dumazet Cc: Paul Gortmaker Signed-off-by: David S. Miller --- kernel/pid.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/pid.c b/kernel/pid.c index 9f08dfabaf13..e86b291ad834 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -547,7 +547,8 @@ void __init pidhash_init(void) pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18, HASH_EARLY | HASH_SMALL, - &pidhash_shift, NULL, 4096); + &pidhash_shift, NULL, + 0, 4096); pidhash_size = 1U << pidhash_shift; for (i = 0; i < pidhash_size; i++) -- cgit v1.2.3 From 23812b9d9e497580d38c62ebdc6f308733b0a32a Mon Sep 17 00:00:00 2001 From: Ning Jiang Date: Tue, 22 May 2012 00:19:20 +0800 Subject: genirq: Add IRQS_PENDING for nested and simple irq Every interrupt which is an active wakeup source needs the ability to abort suspend if there is a pending irq. Right now only edge and level irqs can do that. | +---------+ | INTC | +---------+ | GPIO_IRQ +------------+ | gpio-exp | +------------+ | | GPIO0_IRQ GPIO1_IRQ In the above diagram, gpio expander has irq number GPIO_IRQ, it is connected with two sub GPIO pins, GPIO0 and GPIO1. During suspend, we set IRQF_NO_SUSPEND for GPIO_IRQ so that gpio expander driver can handle the sub irq GPIO0_IRQ and GPIO1_IRQ, and these two irqs themselves can further be handled by simple or nested irq in some drivers(typically gpio and mfd driver). If they are used as wakeup sources during suspend, we want them to be able to abort suspend too. Setting IRQS_PENDING flag in handle_nested_irq() and handle_simple_irq() when the irq is disabled allows check_wakeup_irqs() to identify such irqs as source for aborting suspend. Signed-off-by: Ning Jiang Cc: rjw@sisk.pl Link: http://lkml.kernel.org/r/CAH3Oq6T905%2B3fkF43NAMMFvJvq7dsk_so6T2vQ8ZJrA5xiU3YA@mail.gmail.com Signed-off-by: Thomas Gleixner --- kernel/irq/chip.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index fc275e4f629b..eebd6d5cfb44 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -275,8 +275,10 @@ void handle_nested_irq(unsigned int irq) kstat_incr_irqs_this_cpu(irq, desc); action = desc->action; - if (unlikely(!action || irqd_irq_disabled(&desc->irq_data))) + if (unlikely(!action || irqd_irq_disabled(&desc->irq_data))) { + desc->istate |= IRQS_PENDING; goto out_unlock; + } irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS); raw_spin_unlock_irq(&desc->lock); @@ -324,8 +326,10 @@ handle_simple_irq(unsigned int irq, struct irq_desc *desc) desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING); kstat_incr_irqs_this_cpu(irq, desc); - if (unlikely(!desc->action || irqd_irq_disabled(&desc->irq_data))) + if (unlikely(!desc->action || irqd_irq_disabled(&desc->irq_data))) { + desc->istate |= IRQS_PENDING; goto out_unlock; + } handle_irq_event(desc); -- cgit v1.2.3 From 818b0f3bfb236ae66cac3ff38e86b9e47f24b7aa Mon Sep 17 00:00:00 2001 From: Jiang Liu Date: Fri, 30 Mar 2012 23:11:34 +0800 Subject: genirq: Introduce irq_do_set_affinity() to reduce duplicated code All invocations of chip->irq_set_affinity() are doing the same return value checks. Let them all use a common function. [ tglx: removed the silly likely while at it ] Signed-off-by: Jiang Liu Cc: Jiang Liu Cc: Keping Chen Link: http://lkml.kernel.org/r/1333120296-13563-3-git-send-email-jiang.liu@huawei.com Signed-off-by: Thomas Gleixner --- kernel/irq/internals.h | 3 +++ kernel/irq/manage.c | 39 ++++++++++++++++++++++----------------- kernel/irq/migration.c | 13 ++----------- 3 files changed, 27 insertions(+), 28 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h index 8e5c56b3b7d9..001fa5bab490 100644 --- a/kernel/irq/internals.h +++ b/kernel/irq/internals.h @@ -101,6 +101,9 @@ extern int irq_select_affinity_usr(unsigned int irq, struct cpumask *mask); extern void irq_set_thread_affinity(struct irq_desc *desc); +extern int irq_do_set_affinity(struct irq_data *data, + const struct cpumask *dest, bool force); + /* Inline functions for support of irq chips on slow busses */ static inline void chip_bus_lock(struct irq_desc *desc) { diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index bb32326afe87..a1b903380bcf 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -139,6 +139,25 @@ static inline void irq_get_pending(struct cpumask *mask, struct irq_desc *desc) { } #endif +int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask, + bool force) +{ + struct irq_desc *desc = irq_data_to_desc(data); + struct irq_chip *chip = irq_data_get_irq_chip(data); + int ret; + + ret = chip->irq_set_affinity(data, mask, false); + switch (ret) { + case IRQ_SET_MASK_OK: + cpumask_copy(data->affinity, mask); + case IRQ_SET_MASK_OK_NOCOPY: + irq_set_thread_affinity(desc); + ret = 0; + } + + return ret; +} + int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask *mask) { struct irq_chip *chip = irq_data_get_irq_chip(data); @@ -149,14 +168,7 @@ int __irq_set_affinity_locked(struct irq_data *data, const struct cpumask *mask) return -EINVAL; if (irq_can_move_pcntxt(data)) { - ret = chip->irq_set_affinity(data, mask, false); - switch (ret) { - case IRQ_SET_MASK_OK: - cpumask_copy(data->affinity, mask); - case IRQ_SET_MASK_OK_NOCOPY: - irq_set_thread_affinity(desc); - ret = 0; - } + ret = irq_do_set_affinity(data, mask, false); } else { irqd_set_move_pending(data); irq_copy_pending(desc, mask); @@ -280,9 +292,8 @@ EXPORT_SYMBOL_GPL(irq_set_affinity_notifier); static int setup_affinity(unsigned int irq, struct irq_desc *desc, struct cpumask *mask) { - struct irq_chip *chip = irq_desc_get_chip(desc); struct cpumask *set = irq_default_affinity; - int ret, node = desc->irq_data.node; + int node = desc->irq_data.node; /* Excludes PER_CPU and NO_BALANCE interrupts */ if (!irq_can_set_affinity(irq)) @@ -308,13 +319,7 @@ setup_affinity(unsigned int irq, struct irq_desc *desc, struct cpumask *mask) if (cpumask_intersects(mask, nodemask)) cpumask_and(mask, mask, nodemask); } - ret = chip->irq_set_affinity(&desc->irq_data, mask, false); - switch (ret) { - case IRQ_SET_MASK_OK: - cpumask_copy(desc->irq_data.affinity, mask); - case IRQ_SET_MASK_OK_NOCOPY: - irq_set_thread_affinity(desc); - } + irq_do_set_affinity(&desc->irq_data, mask, false); return 0; } #else diff --git a/kernel/irq/migration.c b/kernel/irq/migration.c index c3c89751b327..ca3f4aaff707 100644 --- a/kernel/irq/migration.c +++ b/kernel/irq/migration.c @@ -42,17 +42,8 @@ void irq_move_masked_irq(struct irq_data *idata) * For correct operation this depends on the caller * masking the irqs. */ - if (likely(cpumask_any_and(desc->pending_mask, cpu_online_mask) - < nr_cpu_ids)) { - int ret = chip->irq_set_affinity(&desc->irq_data, - desc->pending_mask, false); - switch (ret) { - case IRQ_SET_MASK_OK: - cpumask_copy(desc->irq_data.affinity, desc->pending_mask); - case IRQ_SET_MASK_OK_NOCOPY: - irq_set_thread_affinity(desc); - } - } + if (cpumask_any_and(desc->pending_mask, cpu_online_mask) < nr_cpu_ids) + irq_do_set_affinity(&desc->irq_data, desc->pending_mask, false); cpumask_clear(desc->pending_mask); } -- cgit v1.2.3 From ee74d13229fb606353ff56f4927fa93b37e95bbe Mon Sep 17 00:00:00 2001 From: "Srivatsa S. Bhat" Date: Thu, 24 May 2012 20:40:55 +0530 Subject: smpboot, idle: Optimize calls to smp_processor_id() in idle_threads_init() While trying to initialize idle threads for all cpus, idle_threads_init() calls smp_processor_id() in a loop, which is unnecessary. The intent is to initialize idle threads for all non-boot cpus. So just use a variable to note the boot cpu and use it in the loop. Signed-off-by: Srivatsa S. Bhat Cc: suresh.b.siddha@intel.com Cc: venki@google.com Cc: nikunj@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/20120524151055.2549.64309.stgit@srivatsabhat.in.ibm.com Signed-off-by: Thomas Gleixner --- kernel/smpboot.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/smpboot.c b/kernel/smpboot.c index e1a797e028a3..0f2162f808a7 100644 --- a/kernel/smpboot.c +++ b/kernel/smpboot.c @@ -52,10 +52,12 @@ static inline void idle_init(unsigned int cpu) */ void __init idle_threads_init(void) { - unsigned int cpu; + unsigned int cpu, boot_cpu; + + boot_cpu = smp_processor_id(); for_each_possible_cpu(cpu) { - if (cpu != smp_processor_id()) + if (cpu != boot_cpu) idle_init(cpu); } } -- cgit v1.2.3 From 4a70d2d9909b43ed88043b98cabe2c7fbd563021 Mon Sep 17 00:00:00 2001 From: "Srivatsa S. Bhat" Date: Thu, 24 May 2012 20:41:00 +0530 Subject: smpboot, idle: Fix comment mismatch over idle_threads_init() The comment over idle_threads_init() really talks about the functionality of idle_init(). Move that comment to idle_init(), and add a suitable comment over idle_threads_init(). Signed-off-by: Srivatsa S. Bhat Cc: suresh.b.siddha@intel.com Cc: venki@google.com Cc: nikunj@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/20120524151100.2549.66501.stgit@srivatsabhat.in.ibm.com Signed-off-by: Thomas Gleixner --- kernel/smpboot.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/smpboot.c b/kernel/smpboot.c index 0f2162f808a7..98f60c5caa1b 100644 --- a/kernel/smpboot.c +++ b/kernel/smpboot.c @@ -31,6 +31,12 @@ void __init idle_thread_set_boot_cpu(void) per_cpu(idle_threads, smp_processor_id()) = current; } +/** + * idle_init - Initialize the idle thread for a cpu + * @cpu: The cpu for which the idle thread should be initialized + * + * Creates the thread if it does not exist. + */ static inline void idle_init(unsigned int cpu) { struct task_struct *tsk = per_cpu(idle_threads, cpu); @@ -45,10 +51,7 @@ static inline void idle_init(unsigned int cpu) } /** - * idle_thread_init - Initialize the idle thread for a cpu - * @cpu: The cpu for which the idle thread should be initialized - * - * Creates the thread if it does not exist. + * idle_threads_init - Initialize idle threads for all cpus */ void __init idle_threads_init(void) { -- cgit v1.2.3 From 5307c9556bc17e3cd26d4e94fc3b2565921834de Mon Sep 17 00:00:00 2001 From: Mike Galbraith Date: Tue, 8 May 2012 12:20:58 +0200 Subject: tick: Add tick skew boot option Let the user decide whether power consumption or jitter is the more important consideration for their machines. Quoting removal commit af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867: "Historically, Linux has tried to make the regular timer tick on the various CPUs not happen at the same time, to avoid contention on xtime_lock. Nowadays, with the tickless kernel, this contention no longer happens since time keeping and updating are done differently. In addition, this skew is actually hurting power consumption in a measurable way on many-core systems." Problems: - Contrary to the above, systems do encounter contention on both xtime_lock and RCU structure locks when the tick is synchronized. - Moderate sized RT systems suffer intolerable jitter due to the tick being synchronized. - SGI reports the same for their large systems. - Fully utilized systems reap no power saving benefit from skew removal, but do suffer from resulting induced lock contention. - 0209f649 rcu: limit rcu_node leaf-level fanout This patch was born to combat lock contention which testing showed to have been _induced by_ skew removal. Skew the tick, contention disappeared virtually completely. Signed-off-by: Mike Galbraith Link: http://lkml.kernel.org/r/1336472458.21924.78.camel@marge.simpson.net Signed-off-by: Thomas Gleixner --- kernel/time/tick-sched.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) (limited to 'kernel') diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 6a3a5b9ff561..4eddbb5ea9c5 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -814,6 +814,8 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer) return HRTIMER_RESTART; } +static int sched_skew_tick; + /** * tick_setup_sched_timer - setup the tick emulation timer */ @@ -831,6 +833,14 @@ void tick_setup_sched_timer(void) /* Get the next period (per cpu) */ hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update()); + /* Offset the tick to avert xtime_lock contention. */ + if (sched_skew_tick) { + u64 offset = ktime_to_ns(tick_period) >> 1; + do_div(offset, num_possible_cpus()); + offset *= smp_processor_id(); + hrtimer_add_expires_ns(&ts->sched_timer, offset); + } + for (;;) { hrtimer_forward(&ts->sched_timer, now, tick_period); hrtimer_start_expires(&ts->sched_timer, @@ -910,3 +920,11 @@ int tick_check_oneshot_change(int allow_nohz) tick_nohz_switch_to_nohz(); return 0; } + +static int __init skew_tick(char *str) +{ + get_option(&str, &sched_skew_tick); + + return 0; +} +early_param("skew_tick", skew_tick); -- cgit v1.2.3 From e5400321a6f15ce0fe77c8455954f213ef7dcc54 Mon Sep 17 00:00:00 2001 From: Magnus Damm Date: Wed, 9 May 2012 23:39:34 +0900 Subject: clockevents: Make clockevents_config() a global symbol Make clockevents_config() into a global symbol to allow it to be used by compiled-in clockevent drivers. This is needed by drivers that want to update the timer frequency after registration time. Signed-off-by: Magnus Damm Tested-by: Simon Horman Cc: arnd@arndb.de Cc: johnstul@us.ibm.com Cc: rjw@sisk.pl Cc: lethal@linux-sh.org Cc: gregkh@linuxfoundation.org Cc: olof@lixom.net Cc: Magnus Damm Link: http://lkml.kernel.org/r/20120509143934.27521.46553.sendpatchset@w520 Signed-off-by: Thomas Gleixner --- kernel/time/clockevents.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 9cd928f7a7c6..7e1ce012a851 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -297,8 +297,7 @@ void clockevents_register_device(struct clock_event_device *dev) } EXPORT_SYMBOL_GPL(clockevents_register_device); -static void clockevents_config(struct clock_event_device *dev, - u32 freq) +void clockevents_config(struct clock_event_device *dev, u32 freq) { u64 sec; -- cgit v1.2.3 From 62cf20b32aee4ae889a2eb40fd41c0eab73de970 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 25 May 2012 14:08:57 +0200 Subject: tick: Move skew_tick option into the HIGH_RES_TIMER section commit 5307c95 (tick: Add tick skew boot option) broke the !CONFIG_HIGH_RES_TIMERS build. Move the boot option parsing into the CONFIG_HIGH_RES_TIMERS section. Reported-by: Ingo Molnar Signed-off-by: Thomas Gleixner Cc: Mike Galbraith --- kernel/time/tick-sched.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 4eddbb5ea9c5..efd386667536 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -816,6 +816,14 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer) static int sched_skew_tick; +static int __init skew_tick(char *str) +{ + get_option(&str, &sched_skew_tick); + + return 0; +} +early_param("skew_tick", skew_tick); + /** * tick_setup_sched_timer - setup the tick emulation timer */ @@ -920,11 +928,3 @@ int tick_check_oneshot_change(int allow_nohz) tick_nohz_switch_to_nohz(); return 0; } - -static int __init skew_tick(char *str) -{ - get_option(&str, &sched_skew_tick); - - return 0; -} -early_param("skew_tick", skew_tick); -- cgit v1.2.3 From fa980ca87d15bb8a1317853f257a505990f3ffde Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Thu, 24 May 2012 08:24:39 -0700 Subject: cgroup: superblock can't be released with active dentries 48ddbe1946 "cgroup: make css->refcnt clearing on cgroup removal optional" allowed a css to linger after the associated cgroup is removed. As a css holds a reference on the cgroup's dentry, it means that cgroup dentries may linger for a while. cgroup_create() does grab an active reference on the superblock to prevent it from going away while there are !root cgroups; however, the reference is put from cgroup_diput() which is invoked on cgroup removal, so cgroup dentries which are removed but persisting due to lingering csses already have released their superblock active refs allowing superblock to be killed while those dentries are around. Given the right condition, this makes cgroup_kill_sb() call kill_litter_super() with dentries with non-zero d_count leading to BUG() in shrink_dcache_for_umount_subtree(). Fix it by adding cgroup_dops->d_release() operation and moving deactivate_super() to it. cgroup_diput() now marks dentry->d_fsdata with itself if superblock should be deactivated and cgroup_d_release() deactivates the superblock on dentry release. Signed-off-by: Tejun Heo Reported-by: Sasha Levin Tested-by: Sasha Levin LKML-Reference: Acked-by: Li Zefan --- kernel/cgroup.c | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index ad8eae5bb801..e887b55f1f29 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -896,10 +896,13 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode) mutex_unlock(&cgroup_mutex); /* - * Drop the active superblock reference that we took when we - * created the cgroup + * We want to drop the active superblock reference from the + * cgroup creation after all the dentry refs are gone - + * kill_sb gets mighty unhappy otherwise. Mark + * dentry->d_fsdata with cgroup_diput() to tell + * cgroup_d_release() to call deactivate_super(). */ - deactivate_super(cgrp->root->sb); + dentry->d_fsdata = cgroup_diput; /* * if we're getting rid of the cgroup, refcount should ensure @@ -925,6 +928,13 @@ static int cgroup_delete(const struct dentry *d) return 1; } +static void cgroup_d_release(struct dentry *dentry) +{ + /* did cgroup_diput() tell me to deactivate super? */ + if (dentry->d_fsdata == cgroup_diput) + deactivate_super(dentry->d_sb); +} + static void remove_dir(struct dentry *d) { struct dentry *parent = dget(d->d_parent); @@ -1532,6 +1542,7 @@ static int cgroup_get_rootdir(struct super_block *sb) static const struct dentry_operations cgroup_dops = { .d_iput = cgroup_diput, .d_delete = cgroup_delete, + .d_release = cgroup_d_release, }; struct inode *inode = -- cgit v1.2.3 From e709ffd6169ccd259eb5874e853303e91e94e829 Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Tue, 29 May 2012 15:06:18 -0700 Subject: mm: remove swap token code The swap token code no longer fits in with the current VM model. It does not play well with cgroups or the better NUMA placement code in development, since we have only one swap token globally. It also has the potential to mess with scalability of the system, by increasing the number of non-reclaimable pages on the active and inactive anon LRU lists. Last but not least, the swap token code has been broken for a year without complaints, as reported by Konstantin Khlebnikov. This suggests we no longer have much use for it. The days of sub-1G memory systems with heavy use of swap are over. If we ever need thrashing reducing code in the future, we will have to implement something that does scale. Signed-off-by: Rik van Riel Cc: Konstantin Khlebnikov Acked-by: Johannes Weiner Cc: Mel Gorman Cc: Hugh Dickins Acked-by: Bob Picco Acked-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 9 --------- 1 file changed, 9 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 47b4e4f379f9..5b13eea2e757 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -614,7 +614,6 @@ void mmput(struct mm_struct *mm) list_del(&mm->mmlist); spin_unlock(&mmlist_lock); } - put_swap_token(mm); if (mm->binfmt) module_put(mm->binfmt->module); mmdrop(mm); @@ -831,10 +830,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk) memcpy(mm, oldmm, sizeof(*mm)); mm_init_cpumask(mm); - /* Initializing for Swap token stuff */ - mm->token_priority = 0; - mm->last_interval = 0; - #ifdef CONFIG_TRANSPARENT_HUGEPAGE mm->pmd_huge_pte = NULL; #endif @@ -913,10 +908,6 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk) goto fail_nomem; good_mm: - /* Initializing for Swap token stuff */ - mm->token_priority = 0; - mm->last_interval = 0; - tsk->mm = mm; tsk->active_mm = mm; return 0; -- cgit v1.2.3 From 7edc8b0ac16cbaed7cb4ea4c6b95ce98d2997e84 Mon Sep 17 00:00:00 2001 From: Siddhesh Poyarekar Date: Tue, 29 May 2012 15:06:22 -0700 Subject: mm/fork: fix overflow in vma length when copying mmap on clone The vma length in dup_mmap is calculated and stored in a unsigned int, which is insufficient and hence overflows for very large maps (beyond 16TB). The following program demonstrates this: #include #include #include #define GIG 1024 * 1024 * 1024L #define EXTENT 16393 int main(void) { int i, r; void *m; char buf[1024]; for (i = 0; i < EXTENT; i++) { m = mmap(NULL, (size_t) 1 * 1024 * 1024 * 1024L, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); if (m == (void *)-1) printf("MMAP Failed: %d\n", m); else printf("%d : MMAP returned %p\n", i, m); r = fork(); if (r == 0) { printf("%d: successed\n", i); return 0; } else if (r < 0) printf("FORK Failed: %d\n", r); else if (r > 0) wait(NULL); } return 0; } Increase the storage size of the result to unsigned long, which is sufficient for storing the difference between addresses. Signed-off-by: Siddhesh Poyarekar Cc: Tejun Heo Cc: Oleg Nesterov Cc: Jens Axboe Cc: Peter Zijlstra Acked-by: Hugh Dickins Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 5b13eea2e757..017fb23d5983 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -386,7 +386,8 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) } charge = 0; if (mpnt->vm_flags & VM_ACCOUNT) { - unsigned int len = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT; + unsigned long len; + len = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT; if (security_vm_enough_memory_mm(oldmm, len)) /* sic */ goto fail_nomem; charge = len; -- cgit v1.2.3 From 91c63734f6908425903aed69c04035592f18d398 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Tue, 29 May 2012 15:06:24 -0700 Subject: kernel: cgroup: push rcu read locking from css_is_ancestor() to callsite Library functions should not grab locks when the callsites can do it, even if the lock nests like the rcu read-side lock does. Push the rcu_read_lock() from css_is_ancestor() to its single user, mem_cgroup_same_or_subtree() in preparation for another user that may already hold the rcu read-side lock. Signed-off-by: Johannes Weiner Cc: Konstantin Khlebnikov Acked-by: KAMEZAWA Hiroyuki Acked-by: Michal Hocko Acked-by: Li Zefan Cc: Li Zefan Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cgroup.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup.c b/kernel/cgroup.c index a0c6af34d500..0f3527d6184a 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -5132,7 +5132,7 @@ EXPORT_SYMBOL_GPL(css_depth); * @root: the css supporsed to be an ancestor of the child. * * Returns true if "root" is an ancestor of "child" in its hierarchy. Because - * this function reads css->id, this use rcu_dereference() and rcu_read_lock(). + * this function reads css->id, the caller must hold rcu_read_lock(). * But, considering usual usage, the csses should be valid objects after test. * Assuming that the caller will do some action to the child if this returns * returns true, the caller must take "child";s reference count. @@ -5144,18 +5144,18 @@ bool css_is_ancestor(struct cgroup_subsys_state *child, { struct css_id *child_id; struct css_id *root_id; - bool ret = true; - rcu_read_lock(); child_id = rcu_dereference(child->id); + if (!child_id) + return false; root_id = rcu_dereference(root->id); - if (!child_id - || !root_id - || (child_id->depth < root_id->depth) - || (child_id->stack[root_id->depth] != root_id->id)) - ret = false; - rcu_read_unlock(); - return ret; + if (!root_id) + return false; + if (child_id->depth < root_id->depth) + return false; + if (child_id->stack[root_id->depth] != root_id->id) + return false; + return true; } void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css) -- cgit v1.2.3 From 2bb2ba9d51a8044a71a29608d2c4ef8f5b2d57a2 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Tue, 29 May 2012 15:07:03 -0700 Subject: rescounters: add res_counter_uncharge_until() When killing a res_counter which is a child of other counter, we need to do res_counter_uncharge(child, xxx) res_counter_charge(parent, xxx) This is not atomic and wastes CPU. This patch adds res_counter_uncharge_until(). This function's uncharge propagates to ancestors until specified res_counter. res_counter_uncharge_until(child, parent, xxx) Now the operation is atomic and efficient. Signed-off-by: Frederic Weisbecker Signed-off-by: KAMEZAWA Hiroyuki Cc: Aneesh Kumar K.V Cc: Michal Hocko Cc: Johannes Weiner Cc: Ying Han Cc: Glauber Costa Reviewed-by: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/res_counter.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/res_counter.c b/kernel/res_counter.c index bebe2b170d49..ad581aa2369a 100644 --- a/kernel/res_counter.c +++ b/kernel/res_counter.c @@ -94,13 +94,15 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val) counter->usage -= val; } -void res_counter_uncharge(struct res_counter *counter, unsigned long val) +void res_counter_uncharge_until(struct res_counter *counter, + struct res_counter *top, + unsigned long val) { unsigned long flags; struct res_counter *c; local_irq_save(flags); - for (c = counter; c != NULL; c = c->parent) { + for (c = counter; c != top; c = c->parent) { spin_lock(&c->lock); res_counter_uncharge_locked(c, val); spin_unlock(&c->lock); @@ -108,6 +110,10 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val) local_irq_restore(flags); } +void res_counter_uncharge(struct res_counter *counter, unsigned long val) +{ + res_counter_uncharge_until(counter, NULL, val); +} static inline unsigned long long * res_counter_member(struct res_counter *counter, int member) -- cgit v1.2.3 From 4796dd200db943e36f876e7029552212e5bbdf33 Mon Sep 17 00:00:00 2001 From: Stephen Boyd Date: Tue, 29 May 2012 15:07:33 -0700 Subject: vsprintf: fix %ps on non symbols when using kallsyms Using %ps in a printk format will sometimes fail silently and print the empty string if the address passed in does not match a symbol that kallsyms knows about. But using %pS will fall back to printing the full address if kallsyms can't find the symbol. Make %ps act the same as %pS by falling back to printing the address. While we're here also make %ps print the module that a symbol comes from so that it matches what %pS already does. Take this simple function for example (in a module): static void test_printk(void) { int test; pr_info("with pS: %pS\n", &test); pr_info("with ps: %ps\n", &test); } Before this patch: with pS: 0xdff7df44 with ps: After this patch: with pS: 0xdff7df44 with ps: 0xdff7df44 Signed-off-by: Stephen Boyd Cc: Ingo Molnar Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kallsyms.c | 32 ++++++++++++++++++++++++-------- 1 file changed, 24 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c index 079f1d39a8b8..2169feeba529 100644 --- a/kernel/kallsyms.c +++ b/kernel/kallsyms.c @@ -343,7 +343,7 @@ int lookup_symbol_attrs(unsigned long addr, unsigned long *size, /* Look up a kernel symbol and return it in a text buffer. */ static int __sprint_symbol(char *buffer, unsigned long address, - int symbol_offset) + int symbol_offset, int add_offset) { char *modname; const char *name; @@ -358,13 +358,13 @@ static int __sprint_symbol(char *buffer, unsigned long address, if (name != buffer) strcpy(buffer, name); len = strlen(buffer); - buffer += len; offset -= symbol_offset; + if (add_offset) + len += sprintf(buffer + len, "+%#lx/%#lx", offset, size); + if (modname) - len += sprintf(buffer, "+%#lx/%#lx [%s]", offset, size, modname); - else - len += sprintf(buffer, "+%#lx/%#lx", offset, size); + len += sprintf(buffer + len, " [%s]", modname); return len; } @@ -382,11 +382,27 @@ static int __sprint_symbol(char *buffer, unsigned long address, */ int sprint_symbol(char *buffer, unsigned long address) { - return __sprint_symbol(buffer, address, 0); + return __sprint_symbol(buffer, address, 0, 1); } - EXPORT_SYMBOL_GPL(sprint_symbol); +/** + * sprint_symbol_no_offset - Look up a kernel symbol and return it in a text buffer + * @buffer: buffer to be stored + * @address: address to lookup + * + * This function looks up a kernel symbol with @address and stores its name + * and module name to @buffer if possible. If no symbol was found, just saves + * its @address as is. + * + * This function returns the number of bytes stored in @buffer. + */ +int sprint_symbol_no_offset(char *buffer, unsigned long address) +{ + return __sprint_symbol(buffer, address, 0, 0); +} +EXPORT_SYMBOL_GPL(sprint_symbol_no_offset); + /** * sprint_backtrace - Look up a backtrace symbol and return it in a text buffer * @buffer: buffer to be stored @@ -403,7 +419,7 @@ EXPORT_SYMBOL_GPL(sprint_symbol); */ int sprint_backtrace(char *buffer, unsigned long address) { - return __sprint_symbol(buffer, address, -1); + return __sprint_symbol(buffer, address, -1, 1); } /* Look up a kernel symbol and print it to the kernel messages. */ -- cgit v1.2.3 From eea62f831b8030b0eeea8314eed73b6132d1de26 Mon Sep 17 00:00:00 2001 From: Andi Kleen Date: Tue, 8 May 2012 13:32:24 +0930 Subject: brlocks/lglocks: turn into functions lglocks and brlocks are currently generated with some complicated macros in lglock.h. But there's no reason to not just use common utility functions and put all the data into a common data structure. Since there are at least two users it makes sense to share this code in a library. This is also easier maintainable than a macro forest. This will also make it later possible to dynamically allocate lglocks and also use them in modules (this would both still need some additional, but now straightforward, code) [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Andi Kleen Cc: Al Viro Cc: Rusty Russell Signed-off-by: Andrew Morton Signed-off-by: Rusty Russell Signed-off-by: Al Viro --- kernel/Makefile | 2 +- kernel/lglock.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 90 insertions(+), 1 deletion(-) create mode 100644 kernel/lglock.c (limited to 'kernel') diff --git a/kernel/Makefile b/kernel/Makefile index 6c07f30fa9b7..296132c19a57 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -10,7 +10,7 @@ obj-y = fork.o exec_domain.o panic.o printk.o \ kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \ notifier.o ksysfs.o cred.o \ - async.o range.o groups.o + async.o range.o groups.o lglock.o ifdef CONFIG_FUNCTION_TRACER # Do not trace debug files and internal ftrace files diff --git a/kernel/lglock.c b/kernel/lglock.c new file mode 100644 index 000000000000..6535a667a5a7 --- /dev/null +++ b/kernel/lglock.c @@ -0,0 +1,89 @@ +/* See include/linux/lglock.h for description */ +#include +#include +#include +#include + +/* + * Note there is no uninit, so lglocks cannot be defined in + * modules (but it's fine to use them from there) + * Could be added though, just undo lg_lock_init + */ + +void lg_lock_init(struct lglock *lg, char *name) +{ + LOCKDEP_INIT_MAP(&lg->lock_dep_map, name, &lg->lock_key, 0); +} +EXPORT_SYMBOL(lg_lock_init); + +void lg_local_lock(struct lglock *lg) +{ + arch_spinlock_t *lock; + + preempt_disable(); + rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_); + lock = this_cpu_ptr(lg->lock); + arch_spin_lock(lock); +} +EXPORT_SYMBOL(lg_local_lock); + +void lg_local_unlock(struct lglock *lg) +{ + arch_spinlock_t *lock; + + rwlock_release(&lg->lock_dep_map, 1, _RET_IP_); + lock = this_cpu_ptr(lg->lock); + arch_spin_unlock(lock); + preempt_enable(); +} +EXPORT_SYMBOL(lg_local_unlock); + +void lg_local_lock_cpu(struct lglock *lg, int cpu) +{ + arch_spinlock_t *lock; + + preempt_disable(); + rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_); + lock = per_cpu_ptr(lg->lock, cpu); + arch_spin_lock(lock); +} +EXPORT_SYMBOL(lg_local_lock_cpu); + +void lg_local_unlock_cpu(struct lglock *lg, int cpu) +{ + arch_spinlock_t *lock; + + rwlock_release(&lg->lock_dep_map, 1, _RET_IP_); + lock = per_cpu_ptr(lg->lock, cpu); + arch_spin_unlock(lock); + preempt_enable(); +} +EXPORT_SYMBOL(lg_local_unlock_cpu); + +void lg_global_lock(struct lglock *lg) +{ + int i; + + preempt_disable(); + rwlock_acquire(&lg->lock_dep_map, 0, 0, _RET_IP_); + for_each_possible_cpu(i) { + arch_spinlock_t *lock; + lock = per_cpu_ptr(lg->lock, i); + arch_spin_lock(lock); + } +} +EXPORT_SYMBOL(lg_global_lock); + +void lg_global_unlock(struct lglock *lg) +{ + int i; + + rwlock_release(&lg->lock_dep_map, 1, _RET_IP_); + for_each_possible_cpu(i) { + arch_spinlock_t *lock; + lock = per_cpu_ptr(lg->lock, i); + arch_spin_unlock(lock); + } + preempt_enable(); +} +EXPORT_SYMBOL(lg_global_unlock); -- cgit v1.2.3 From 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 17 May 2012 17:15:29 +0200 Subject: sched/nohz: Fix rq->cpu_load calculations some more Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[] calculations") since while that fixed the busy case it regressed the mostly idle case. Add a callback from the nohz exit to also age the rq->cpu_load[] array. This closes the hole where either there was no nohz load balance pass during the nohz, or there was a 'significant' amount of idle time between the last nohz balance and the nohz exit. So we'll update unconditionally from the tick to not insert any accidental 0 load periods while busy, and we try and catch up from nohz idle balance and nohz exit. Both these are still prone to missing a jiffy, but that has always been the case. Signed-off-by: Peter Zijlstra Cc: pjt@google.com Cc: Venkatesh Pallipadi Link: http://lkml.kernel.org/n/tip-kt0trz0apodbf84ucjfdbr1a@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 53 +++++++++++++++++++++++++++++++++++++++--------- kernel/time/tick-sched.c | 1 + 2 files changed, 44 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 39eb6011bc38..75844a8f9aeb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2517,25 +2517,32 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, sched_avg_update(this_rq); } +#ifdef CONFIG_NO_HZ +/* + * There is no sane way to deal with nohz on smp when using jiffies because the + * cpu doing the jiffies update might drift wrt the cpu doing the jiffy reading + * causing off-by-one errors in observed deltas; {0,2} instead of {1,1}. + * + * Therefore we cannot use the delta approach from the regular tick since that + * would seriously skew the load calculation. However we'll make do for those + * updates happening while idle (nohz_idle_balance) or coming out of idle + * (tick_nohz_idle_exit). + * + * This means we might still be one tick off for nohz periods. + */ + /* * Called from nohz_idle_balance() to update the load ratings before doing the * idle balance. */ void update_idle_cpu_load(struct rq *this_rq) { - unsigned long curr_jiffies = jiffies; + unsigned long curr_jiffies = ACCESS_ONCE(jiffies); unsigned long load = this_rq->load.weight; unsigned long pending_updates; /* - * Bloody broken means of dealing with nohz, but better than nothing.. - * jiffies is updated by one cpu, another cpu can drift wrt the jiffy - * update and see 0 difference the one time and 2 the next, even though - * we ticked at roughtly the same rate. - * - * Hence we only use this from nohz_idle_balance() and skip this - * nonsense when called from the scheduler_tick() since that's - * guaranteed a stable rate. + * bail if there's load or we're actually up-to-date. */ if (load || curr_jiffies == this_rq->last_load_update_tick) return; @@ -2546,13 +2553,39 @@ void update_idle_cpu_load(struct rq *this_rq) __update_cpu_load(this_rq, load, pending_updates); } +/* + * Called from tick_nohz_idle_exit() -- try and fix up the ticks we missed. + */ +void update_cpu_load_nohz(void) +{ + struct rq *this_rq = this_rq(); + unsigned long curr_jiffies = ACCESS_ONCE(jiffies); + unsigned long pending_updates; + + if (curr_jiffies == this_rq->last_load_update_tick) + return; + + raw_spin_lock(&this_rq->lock); + pending_updates = curr_jiffies - this_rq->last_load_update_tick; + if (pending_updates) { + this_rq->last_load_update_tick = curr_jiffies; + /* + * We were idle, this means load 0, the current load might be + * !0 due to remote wakeups and the sort. + */ + __update_cpu_load(this_rq, 0, pending_updates); + } + raw_spin_unlock(&this_rq->lock); +} +#endif /* CONFIG_NO_HZ */ + /* * Called from scheduler_tick() */ static void update_cpu_load_active(struct rq *this_rq) { /* - * See the mess in update_idle_cpu_load(). + * See the mess around update_idle_cpu_load() / update_cpu_load_nohz(). */ this_rq->last_load_update_tick = jiffies; __update_cpu_load(this_rq, this_rq->load.weight, 1); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 6a3a5b9ff561..0c927cd85345 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -576,6 +576,7 @@ void tick_nohz_idle_exit(void) /* Update jiffies first */ select_nohz_load_balancer(0); tick_do_update_jiffies64(now); + update_cpu_load_nohz(); #ifndef CONFIG_VIRT_CPU_ACCOUNTING /* -- cgit v1.2.3 From 2ea45800d8e1c3c51c45a233d6bd6289a297a386 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 25 May 2012 09:26:43 +0200 Subject: sched: Don't try allocating memory from offline nodes Allocators don't appreciate it when you try and allocate memory from offline nodes. Reported-and-tested-by: Tony Luck Reported-and-tested-by: Anton Blanchard Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-epfc1io9whb7o22bcujf31vn@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 75844a8f9aeb..55733616baaa 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6436,7 +6436,7 @@ static void sched_init_numa(void) return; for (j = 0; j < nr_node_ids; j++) { - struct cpumask *mask = kzalloc_node(cpumask_size(), GFP_KERNEL, j); + struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL); if (!mask) return; -- cgit v1.2.3 From 74a5ce20e6eeeb3751340b390e7ac1d1d07bbf55 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 23 May 2012 18:00:43 +0200 Subject: sched: Fix SD_OVERLAP SD_OVERLAP exists to allow overlapping groups, overlapping groups appear in NUMA topologies that aren't fully connected. The typical result of not fully connected NUMA is that each cpu (or rather node) will have different spans for a particular distance. However due to how sched domains are traversed -- only the first cpu in the mask goes one level up -- the next level only cares about the spans of the cpus that went up. Due to this two things were observed to be broken: - build_overlap_sched_groups() -- since its possible the cpu we're building the groups for exists in multiple (or all) groups, the selection criteria of the first group didn't ensure there was a cpu for which is was true that cpumask_first(span) == cpu. Thus load- balancing would terminate. - update_group_power() -- assumed that the cpu span of the first group of the domain was covered by all groups of the child domain. The above explains why this isn't true, so deal with it. Signed-off-by: Peter Zijlstra Cc: David Rientjes Link: http://lkml.kernel.org/r/1337788843.9783.14.camel@laptop Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 7 +++++-- kernel/sched/fair.c | 25 ++++++++++++++++++++----- 2 files changed, 25 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 55733616baaa..3a69374fb427 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6030,11 +6030,14 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) cpumask_or(covered, covered, sg_span); - sg->sgp = *per_cpu_ptr(sdd->sgp, cpumask_first(sg_span)); + sg->sgp = *per_cpu_ptr(sdd->sgp, i); atomic_inc(&sg->sgp->ref); - if (cpumask_test_cpu(cpu, sg_span)) + if ((!groups && cpumask_test_cpu(cpu, sg_span)) || + cpumask_first(sg_span) == cpu) { + WARN_ON_ONCE(!cpumask_test_cpu(cpu, sg_span)); groups = sg; + } if (!first) first = sg; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 940e6d17cf96..f0380d4987b3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3574,11 +3574,26 @@ void update_group_power(struct sched_domain *sd, int cpu) power = 0; - group = child->groups; - do { - power += group->sgp->power; - group = group->next; - } while (group != child->groups); + if (child->flags & SD_OVERLAP) { + /* + * SD_OVERLAP domains cannot assume that child groups + * span the current group. + */ + + for_each_cpu(cpu, sched_group_cpus(sdg)) + power += power_of(cpu); + } else { + /* + * !SD_OVERLAP domains can assume that child groups + * span the current group. + */ + + group = child->groups; + do { + power += group->sgp->power; + group = group->next; + } while (group != child->groups); + } sdg->sgp->power = power; } -- cgit v1.2.3 From b654f7de41b0e3903ee2b51d3b8db77fe52ce728 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 22 May 2012 14:04:28 +0200 Subject: sched: Make sure to not re-read variables after validation We could re-read rq->rt_avg after we validated it was smaller than total, invalidating the check and resulting in an unintended negative. Signed-off-by: Peter Zijlstra Cc: David Rientjes Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f0380d4987b3..2b449a762074 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3503,15 +3503,22 @@ unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu) unsigned long scale_rt_power(int cpu) { struct rq *rq = cpu_rq(cpu); - u64 total, available; + u64 total, available, age_stamp, avg; - total = sched_avg_period() + (rq->clock - rq->age_stamp); + /* + * Since we're reading these variables without serialization make sure + * we read them once before doing sanity checks on them. + */ + age_stamp = ACCESS_ONCE(rq->age_stamp); + avg = ACCESS_ONCE(rq->rt_avg); + + total = sched_avg_period() + (rq->clock - age_stamp); - if (unlikely(total < rq->rt_avg)) { + if (unlikely(total < avg)) { /* Ensures that power won't end up being negative */ available = 0; } else { - available = total - rq->rt_avg; + available = total - avg; } if (unlikely((s64)total < SCHED_POWER_SCALE)) -- cgit v1.2.3 From 29baa7478ba47d746e3625c91d3b2afbf46b4312 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 23 Apr 2012 12:11:21 +0200 Subject: sched: Move nr_cpus_allowed out of 'struct sched_rt_entity' Since nr_cpus_allowed is used outside of sched/rt.c and wants to be used outside of there more, move it to a more natural site. Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-kr61f02y9brwzkh6x53pdptm@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 2 +- kernel/sched/fair.c | 2 +- kernel/sched/rt.c | 36 +++++++++++++++++++++--------------- 3 files changed, 23 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3a69374fb427..70cc36a6073f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5015,7 +5015,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask) p->sched_class->set_cpus_allowed(p, new_mask); cpumask_copy(&p->cpus_allowed, new_mask); - p->rt.nr_cpus_allowed = cpumask_weight(new_mask); + p->nr_cpus_allowed = cpumask_weight(new_mask); } /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2b449a762074..b2a2d236f27b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2703,7 +2703,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) int want_sd = 1; int sync = wake_flags & WF_SYNC; - if (p->rt.nr_cpus_allowed == 1) + if (p->nr_cpus_allowed == 1) return prev_cpu; if (sd_flag & SD_BALANCE_WAKE) { diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index c5565c3c515f..295da737b6fe 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -274,13 +274,16 @@ static void update_rt_migration(struct rt_rq *rt_rq) static void inc_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) { + struct task_struct *p; + if (!rt_entity_is_task(rt_se)) return; + p = rt_task_of(rt_se); rt_rq = &rq_of_rt_rq(rt_rq)->rt; rt_rq->rt_nr_total++; - if (rt_se->nr_cpus_allowed > 1) + if (p->nr_cpus_allowed > 1) rt_rq->rt_nr_migratory++; update_rt_migration(rt_rq); @@ -288,13 +291,16 @@ static void inc_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) static void dec_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) { + struct task_struct *p; + if (!rt_entity_is_task(rt_se)) return; + p = rt_task_of(rt_se); rt_rq = &rq_of_rt_rq(rt_rq)->rt; rt_rq->rt_nr_total--; - if (rt_se->nr_cpus_allowed > 1) + if (p->nr_cpus_allowed > 1) rt_rq->rt_nr_migratory--; update_rt_migration(rt_rq); @@ -1161,7 +1167,7 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags) enqueue_rt_entity(rt_se, flags & ENQUEUE_HEAD); - if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1) + if (!task_current(rq, p) && p->nr_cpus_allowed > 1) enqueue_pushable_task(rq, p); inc_nr_running(rq); @@ -1225,7 +1231,7 @@ select_task_rq_rt(struct task_struct *p, int sd_flag, int flags) cpu = task_cpu(p); - if (p->rt.nr_cpus_allowed == 1) + if (p->nr_cpus_allowed == 1) goto out; /* For anything but wake ups, just return the task_cpu */ @@ -1260,9 +1266,9 @@ select_task_rq_rt(struct task_struct *p, int sd_flag, int flags) * will have to sort it out. */ if (curr && unlikely(rt_task(curr)) && - (curr->rt.nr_cpus_allowed < 2 || + (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio) && - (p->rt.nr_cpus_allowed > 1)) { + (p->nr_cpus_allowed > 1)) { int target = find_lowest_rq(p); if (target != -1) @@ -1276,10 +1282,10 @@ out: static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p) { - if (rq->curr->rt.nr_cpus_allowed == 1) + if (rq->curr->nr_cpus_allowed == 1) return; - if (p->rt.nr_cpus_allowed != 1 + if (p->nr_cpus_allowed != 1 && cpupri_find(&rq->rd->cpupri, p, NULL)) return; @@ -1395,7 +1401,7 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p) * The previous task needs to be made eligible for pushing * if it is still active */ - if (on_rt_rq(&p->rt) && p->rt.nr_cpus_allowed > 1) + if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1) enqueue_pushable_task(rq, p); } @@ -1408,7 +1414,7 @@ static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu) { if (!task_running(rq, p) && (cpu < 0 || cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) && - (p->rt.nr_cpus_allowed > 1)) + (p->nr_cpus_allowed > 1)) return 1; return 0; } @@ -1464,7 +1470,7 @@ static int find_lowest_rq(struct task_struct *task) if (unlikely(!lowest_mask)) return -1; - if (task->rt.nr_cpus_allowed == 1) + if (task->nr_cpus_allowed == 1) return -1; /* No other targets possible */ if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask)) @@ -1586,7 +1592,7 @@ static struct task_struct *pick_next_pushable_task(struct rq *rq) BUG_ON(rq->cpu != task_cpu(p)); BUG_ON(task_current(rq, p)); - BUG_ON(p->rt.nr_cpus_allowed <= 1); + BUG_ON(p->nr_cpus_allowed <= 1); BUG_ON(!p->on_rq); BUG_ON(!rt_task(p)); @@ -1793,9 +1799,9 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p) if (!task_running(rq, p) && !test_tsk_need_resched(rq->curr) && has_pushable_tasks(rq) && - p->rt.nr_cpus_allowed > 1 && + p->nr_cpus_allowed > 1 && rt_task(rq->curr) && - (rq->curr->rt.nr_cpus_allowed < 2 || + (rq->curr->nr_cpus_allowed < 2 || rq->curr->prio <= p->prio)) push_rt_tasks(rq); } @@ -1817,7 +1823,7 @@ static void set_cpus_allowed_rt(struct task_struct *p, * Only update if the process changes its state from whether it * can migrate or not. */ - if ((p->rt.nr_cpus_allowed > 1) == (weight > 1)) + if ((p->nr_cpus_allowed > 1) == (weight > 1)) return; rq = task_rq(p); -- cgit v1.2.3 From 454c79999f7eaedcdf4c15c449e43902980cbdf5 Mon Sep 17 00:00:00 2001 From: Colin Cross Date: Wed, 16 May 2012 21:34:23 -0700 Subject: sched/rt: Fix SCHED_RR across cgroups task_tick_rt() has an optimization to only reschedule SCHED_RR tasks if they were the only element on their rq. However, with cgroups a SCHED_RR task could be the only element on its per-cgroup rq but still be competing with other SCHED_RR tasks in its parent's cgroup. In this case, the SCHED_RR task in the child cgroup would never yield at the end of its timeslice. If the child cgroup rt_runtime_us was the same as the parent cgroup rt_runtime_us, the task in the parent cgroup would starve completely. Modify task_tick_rt() to check that the task is the only task on its rq, and that the each of the scheduling entities of its ancestors is also the only entity on its rq. Signed-off-by: Colin Cross Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1337229266-15798-1-git-send-email-ccross@android.com Signed-off-by: Ingo Molnar --- kernel/sched/rt.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 295da737b6fe..2a4e8dffbd6b 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1985,6 +1985,8 @@ static void watchdog(struct rq *rq, struct task_struct *p) static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued) { + struct sched_rt_entity *rt_se = &p->rt; + update_curr_rt(rq); watchdog(rq, p); @@ -2002,12 +2004,15 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued) p->rt.time_slice = RR_TIMESLICE; /* - * Requeue to the end of queue if we are not the only element - * on the queue: + * Requeue to the end of queue if we (and all of our ancestors) are the + * only element on the queue */ - if (p->rt.run_list.prev != p->rt.run_list.next) { - requeue_task_rt(rq, p, 0); - set_tsk_need_resched(p); + for_each_sched_rt_entity(rt_se) { + if (rt_se->run_list.prev != rt_se->run_list.next) { + requeue_task_rt(rq, p, 0); + set_tsk_need_resched(p); + return; + } } } -- cgit v1.2.3 From 1292531f6f27af909e713671dd9cc3bcab8114b7 Mon Sep 17 00:00:00 2001 From: Hiroshi Shimamoto Date: Fri, 25 May 2012 15:41:54 +0900 Subject: sched: Make sched_feat_names const The strings sched_feat_names are never changed. Signed-off-by: Hiroshi Shimamoto Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/4FBF29B2.9030904@ct.jp.nec.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 70cc36a6073f..c1679a098fc7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -142,7 +142,7 @@ const_debug unsigned int sysctl_sched_features = #define SCHED_FEAT(name, enabled) \ #name , -static __read_mostly char *sched_feat_names[] = { +static const char * const sched_feat_names[] = { #include "features.h" NULL }; -- cgit v1.2.3 From 7997a456ef841bb78eb6f881d7cc2c17c2f9b35e Mon Sep 17 00:00:00 2001 From: Hiroshi Shimamoto Date: Fri, 25 May 2012 15:42:47 +0900 Subject: sched: Remove the last NULL entry from sched_feat_names No need to have the last NULL entry. Signed-off-by: Hiroshi Shimamoto Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/4FBF29E7.5020805@ct.jp.nec.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c1679a098fc7..94d598ac5e64 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -144,7 +144,6 @@ const_debug unsigned int sysctl_sched_features = static const char * const sched_feat_names[] = { #include "features.h" - NULL }; #undef SCHED_FEAT -- cgit v1.2.3 From 6a4c96eef42f835734a82c6b512abf9881b7c55d Mon Sep 17 00:00:00 2001 From: Kamalesh Babulal Date: Wed, 23 May 2012 14:44:11 +0530 Subject: sched: Remove NULL assignment of dattr_cur Remove explicit NULL assignment of static pointer dattr_cur from init_sched_domains(). Signed-off-by: Kamalesh Babulal Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/20120523091411.GG5005@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 94d598ac5e64..c46958e26121 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6726,7 +6726,6 @@ static int init_sched_domains(const struct cpumask *cpu_map) if (!doms_cur) doms_cur = &fallback_doms; cpumask_andnot(doms_cur[0], cpu_map, cpu_isolated_map); - dattr_cur = NULL; err = build_sched_domains(doms_cur[0], NULL); register_sched_domain_sysctl(); -- cgit v1.2.3 From cb7225feec627e91d598198996429e9ee6804f8d Mon Sep 17 00:00:00 2001 From: Namhyung Kim Date: Thu, 31 May 2012 14:51:44 +0900 Subject: perf: Remove duplicate invocation on perf_event_for_each The @func callback was invoked twice for group leader when perf_event_for_each() called. It seems the commit 75f937f24bd9 ("perf_counter: Fix ctx->mutex vs counter ->mutex inversion") made the mistake during the change. Signed-off-by: Namhyung Kim Acked-by: Peter Zijlstra Cc: Namhyung Kim Cc: Paul Mackerras Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/1338443506-25009-1-git-send-email-namhyung.kim@lge.com Signed-off-by: Arnaldo Carvalho de Melo --- kernel/events/core.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 5b06cbbf6931..f85c0154b333 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3181,7 +3181,6 @@ static void perf_event_for_each(struct perf_event *event, event = event->group_leader; perf_event_for_each_child(event, func); - func(event); list_for_each_entry(sibling, &event->sibling_list, group_entry) perf_event_for_each_child(sibling, func); mutex_unlock(&ctx->mutex); -- cgit v1.2.3 From ee5e5683d8ac3fec876cb6c26792212f773d5898 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Thu, 31 May 2012 16:26:05 -0700 Subject: kernel/resource.c: correct the comment of allocate_resource() In the comment of allocate_resource(), the explanation of parameter max and min is not correct. Actually, these two parameters are used to specify the range of the resource that will be allocated, not the min/max size that will be allocated. Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/resource.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/resource.c b/kernel/resource.c index 7e8ea66a8c01..e1d2b8ee76d5 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -515,8 +515,8 @@ out: * @root: root resource descriptor * @new: resource descriptor desired by caller * @size: requested resource region size - * @min: minimum size to allocate - * @max: maximum size to allocate + * @min: minimum boundary to allocate + * @max: maximum boundary to allocate * @align: alignment requested, in bytes * @alignf: alignment function, optional, called if not NULL * @alignf_data: arbitrary data to pass to the @alignf function -- cgit v1.2.3 From 499eea6bf9c06df3bf4549954aee6fb3427946ed Mon Sep 17 00:00:00 2001 From: Sasikantha babu Date: Thu, 31 May 2012 16:26:07 -0700 Subject: sethostname/setdomainname: notify userspace when there is a change in uts_kern_table sethostname() and setdomainname() notify userspace on failure (without modifying uts_kern_table). Change things so that we only notify userspace on success, when uts_kern_table was actually modified. Signed-off-by: Sasikantha babu Cc: Paul Gortmaker Cc: Greg Kroah-Hartman Cc: WANG Cong Reviewed-by: Cyrill Gorcunov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index 6df42624e454..8b71cef3bf1a 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1378,8 +1378,8 @@ SYSCALL_DEFINE2(sethostname, char __user *, name, int, len) memcpy(u->nodename, tmp, len); memset(u->nodename + len, 0, sizeof(u->nodename) - len); errno = 0; + uts_proc_notify(UTS_PROC_HOSTNAME); } - uts_proc_notify(UTS_PROC_HOSTNAME); up_write(&uts_sem); return errno; } @@ -1429,8 +1429,8 @@ SYSCALL_DEFINE2(setdomainname, char __user *, name, int, len) memcpy(u->domainname, tmp, len); memset(u->domainname + len, 0, sizeof(u->domainname) - len); errno = 0; + uts_proc_notify(UTS_PROC_DOMAINNAME); } - uts_proc_notify(UTS_PROC_DOMAINNAME); up_write(&uts_sem); return errno; } -- cgit v1.2.3 From 97fd75b7b8e0f4e6d3f06b819c89b2555f626fcf Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Thu, 31 May 2012 16:26:07 -0700 Subject: kernel/irq/manage.c: use the pr_foo() infrastructure to prefix printks Use the module-wide pr_fmt() mechanism rather than open-coding "genirq: " everywhere. Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/irq/manage.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index bb32326afe87..7c475cd3f6e6 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -7,6 +7,8 @@ * This file contains driver APIs to the irq subsystem. */ +#define pr_fmt(fmt) "genirq: " fmt + #include #include #include @@ -565,7 +567,7 @@ int __irq_set_trigger(struct irq_desc *desc, unsigned int irq, * IRQF_TRIGGER_* but the PIC does not support multiple * flow-types? */ - pr_debug("genirq: No set_type function for IRQ %d (%s)\n", irq, + pr_debug("No set_type function for IRQ %d (%s)\n", irq, chip ? (chip->name ? : "unknown") : "unknown"); return 0; } @@ -600,7 +602,7 @@ int __irq_set_trigger(struct irq_desc *desc, unsigned int irq, ret = 0; break; default: - pr_err("genirq: Setting trigger mode %lu for irq %u failed (%pF)\n", + pr_err("Setting trigger mode %lu for irq %u failed (%pF)\n", flags, irq, chip->irq_set_type); } if (unmask) @@ -837,7 +839,7 @@ void exit_irq_thread(void) action = kthread_data(tsk); - pr_err("genirq: exiting task \"%s\" (%d) is an active IRQ thread (irq %d)\n", + pr_err("exiting task \"%s\" (%d) is an active IRQ thread (irq %d)\n", tsk->comm ? tsk->comm : "", tsk->pid, action->irq); desc = irq_to_desc(action->irq); @@ -1044,7 +1046,7 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) * has. The type flags are unreliable as the * underlying chip implementation can override them. */ - pr_err("genirq: Threaded irq requested with handler=NULL and !ONESHOT for irq %d\n", + pr_err("Threaded irq requested with handler=NULL and !ONESHOT for irq %d\n", irq); ret = -EINVAL; goto out_mask; @@ -1095,7 +1097,7 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) if (nmsk != omsk) /* hope the handler works with current trigger mode */ - pr_warning("genirq: irq %d uses trigger mode %u; requested %u\n", + pr_warning("irq %d uses trigger mode %u; requested %u\n", irq, nmsk, omsk); } @@ -1133,7 +1135,7 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) mismatch: if (!(new->flags & IRQF_PROBE_SHARED)) { - pr_err("genirq: Flags mismatch irq %d. %08x (%s) vs. %08x (%s)\n", + pr_err("Flags mismatch irq %d. %08x (%s) vs. %08x (%s)\n", irq, new->flags, new->name, old->flags, old->name); #ifdef CONFIG_DEBUG_SHIRQ dump_stack(); -- cgit v1.2.3 From d84970bbaf9a09b3fc60c18ee6d59bc9cb4c3b8a Mon Sep 17 00:00:00 2001 From: Nicolas Pitre Date: Thu, 31 May 2012 16:26:07 -0700 Subject: kernel/cpu_pm.c: fix various typos Signed-off-by: Nicolas Pitre Acked-by: Colin Cross Acked-by: Santosh Shilimkar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpu_pm.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/cpu_pm.c b/kernel/cpu_pm.c index 249152e15308..9656a3c36503 100644 --- a/kernel/cpu_pm.c +++ b/kernel/cpu_pm.c @@ -81,7 +81,7 @@ int cpu_pm_unregister_notifier(struct notifier_block *nb) EXPORT_SYMBOL_GPL(cpu_pm_unregister_notifier); /** - * cpm_pm_enter - CPU low power entry notifier + * cpu_pm_enter - CPU low power entry notifier * * Notifies listeners that a single CPU is entering a low power state that may * cause some blocks in the same power domain as the cpu to reset. @@ -89,7 +89,7 @@ EXPORT_SYMBOL_GPL(cpu_pm_unregister_notifier); * Must be called on the affected CPU with interrupts disabled. Platform is * responsible for ensuring that cpu_pm_enter is not called twice on the same * CPU before cpu_pm_exit is called. Notified drivers can include VFP - * co-processor, interrupt controller and it's PM extensions, local CPU + * co-processor, interrupt controller and its PM extensions, local CPU * timers context save/restore which shouldn't be interrupted. Hence it * must be called with interrupts disabled. * @@ -115,13 +115,13 @@ int cpu_pm_enter(void) EXPORT_SYMBOL_GPL(cpu_pm_enter); /** - * cpm_pm_exit - CPU low power exit notifier + * cpu_pm_exit - CPU low power exit notifier * * Notifies listeners that a single CPU is exiting a low power state that may * have caused some blocks in the same power domain as the cpu to reset. * * Notified drivers can include VFP co-processor, interrupt controller - * and it's PM extensions, local CPU timers context save/restore which + * and its PM extensions, local CPU timers context save/restore which * shouldn't be interrupted. Hence it must be called with interrupts disabled. * * Return conditions are same as __raw_notifier_call_chain. @@ -139,7 +139,7 @@ int cpu_pm_exit(void) EXPORT_SYMBOL_GPL(cpu_pm_exit); /** - * cpm_cluster_pm_enter - CPU cluster low power entry notifier + * cpu_cluster_pm_enter - CPU cluster low power entry notifier * * Notifies listeners that all cpus in a power domain are entering a low power * state that may cause some blocks in the same power domain to reset. @@ -147,7 +147,7 @@ EXPORT_SYMBOL_GPL(cpu_pm_exit); * Must be called after cpu_pm_enter has been called on all cpus in the power * domain, and before cpu_pm_exit has been called on any cpu in the power * domain. Notified drivers can include VFP co-processor, interrupt controller - * and it's PM extensions, local CPU timers context save/restore which + * and its PM extensions, local CPU timers context save/restore which * shouldn't be interrupted. Hence it must be called with interrupts disabled. * * Must be called with interrupts disabled. @@ -174,7 +174,7 @@ int cpu_cluster_pm_enter(void) EXPORT_SYMBOL_GPL(cpu_cluster_pm_enter); /** - * cpm_cluster_pm_exit - CPU cluster low power exit notifier + * cpu_cluster_pm_exit - CPU cluster low power exit notifier * * Notifies listeners that all cpus in a power domain are exiting form a * low power state that may have caused some blocks in the same power domain @@ -183,7 +183,7 @@ EXPORT_SYMBOL_GPL(cpu_cluster_pm_enter); * Must be called after cpu_pm_exit has been called on all cpus in the power * domain, and before cpu_pm_exit has been called on any cpu in the power * domain. Notified drivers can include VFP co-processor, interrupt controller - * and it's PM extensions, local CPU timers context save/restore which + * and its PM extensions, local CPU timers context save/restore which * shouldn't be interrupted. Hence it must be called with interrupts disabled. * * Return conditions are same as __raw_notifier_call_chain. -- cgit v1.2.3 From ae3cef7300e9fddc35ad251dd5f27c5b88c8594a Mon Sep 17 00:00:00 2001 From: Boaz Harrosh Date: Thu, 31 May 2012 16:26:14 -0700 Subject: kmod: unexport call_usermodehelper_freeinfo() call_usermodehelper_freeinfo() is not used outside of kmod.c. So unexport it, and make it static to kmod.c Signed-off-by: Boaz Harrosh Cc: Oleg Nesterov Cc: Tetsuo Handa Cc: Ingo Molnar Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kmod.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/kmod.c b/kernel/kmod.c index 05698a7415fe..21a0f8e99102 100644 --- a/kernel/kmod.c +++ b/kernel/kmod.c @@ -221,13 +221,12 @@ fail: return 0; } -void call_usermodehelper_freeinfo(struct subprocess_info *info) +static void call_usermodehelper_freeinfo(struct subprocess_info *info) { if (info->cleanup) (*info->cleanup)(info); kfree(info); } -EXPORT_SYMBOL(call_usermodehelper_freeinfo); static void umh_complete(struct subprocess_info *sub_info) { -- cgit v1.2.3 From 81ab6e7b26b453a795d46f2616ed0e31d97f05b9 Mon Sep 17 00:00:00 2001 From: Boaz Harrosh Date: Thu, 31 May 2012 16:26:15 -0700 Subject: kmod: convert two call sites to call_usermodehelper_fns() Both kernel/sys.c && security/keys/request_key.c where inlining the exact same code as call_usermodehelper_fns(); So simply convert these sites to directly use call_usermodehelper_fns(). Signed-off-by: Boaz Harrosh Cc: Oleg Nesterov Cc: Tetsuo Handa Cc: Ingo Molnar Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 19 ++++++++----------- 1 file changed, 8 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index 8b71cef3bf1a..6e81aa7e4688 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2114,7 +2114,6 @@ int orderly_poweroff(bool force) NULL }; int ret = -ENOMEM; - struct subprocess_info *info; if (argv == NULL) { printk(KERN_WARNING "%s failed to allocate memory for \"%s\"\n", @@ -2122,18 +2121,16 @@ int orderly_poweroff(bool force) goto out; } - info = call_usermodehelper_setup(argv[0], argv, envp, GFP_ATOMIC); - if (info == NULL) { - argv_free(argv); - goto out; - } - - call_usermodehelper_setfns(info, NULL, argv_cleanup, NULL); + ret = call_usermodehelper_fns(argv[0], argv, envp, UMH_NO_WAIT, + NULL, argv_cleanup, NULL); +out: + if (likely(!ret)) + return 0; - ret = call_usermodehelper_exec(info, UMH_NO_WAIT); + if (ret == -ENOMEM) + argv_free(argv); - out: - if (ret && force) { + if (force) { printk(KERN_WARNING "Failed to start orderly shutdown: " "forcing the issue\n"); -- cgit v1.2.3 From 785042f2e275089e22c36b462f6495ce8d91732d Mon Sep 17 00:00:00 2001 From: Boaz Harrosh Date: Thu, 31 May 2012 16:26:15 -0700 Subject: kmod: move call_usermodehelper_fns() to .c file and unexport all it's helpers If we move call_usermodehelper_fns() to kmod.c file and EXPORT_SYMBOL it we can avoid exporting all it's helper functions: call_usermodehelper_setup call_usermodehelper_setfns call_usermodehelper_exec And make all of them static to kmod.c Since the optimizer will see all these as a single call site it will inline them inside call_usermodehelper_fns(). So we loose the call to _fns but gain 3 calls to the helpers. (Not that it matters) Signed-off-by: Boaz Harrosh Cc: Oleg Nesterov Cc: Tetsuo Handa Cc: Ingo Molnar Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kmod.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/kmod.c b/kernel/kmod.c index 21a0f8e99102..1f596e4de306 100644 --- a/kernel/kmod.c +++ b/kernel/kmod.c @@ -478,6 +478,7 @@ static void helper_unlock(void) * structure. This should be passed to call_usermodehelper_exec to * exec the process and free the structure. */ +static struct subprocess_info *call_usermodehelper_setup(char *path, char **argv, char **envp, gfp_t gfp_mask) { @@ -493,7 +494,6 @@ struct subprocess_info *call_usermodehelper_setup(char *path, char **argv, out: return sub_info; } -EXPORT_SYMBOL(call_usermodehelper_setup); /** * call_usermodehelper_setfns - set a cleanup/init function @@ -511,6 +511,7 @@ EXPORT_SYMBOL(call_usermodehelper_setup); * Function must be runnable in either a process context or the * context in which call_usermodehelper_exec is called. */ +static void call_usermodehelper_setfns(struct subprocess_info *info, int (*init)(struct subprocess_info *info, struct cred *new), void (*cleanup)(struct subprocess_info *info), @@ -520,7 +521,6 @@ void call_usermodehelper_setfns(struct subprocess_info *info, info->init = init; info->data = data; } -EXPORT_SYMBOL(call_usermodehelper_setfns); /** * call_usermodehelper_exec - start a usermode application @@ -534,6 +534,7 @@ EXPORT_SYMBOL(call_usermodehelper_setfns); * asynchronously if wait is not set, and runs as a child of keventd. * (ie. it runs with full root capabilities). */ +static int call_usermodehelper_exec(struct subprocess_info *sub_info, int wait) { DECLARE_COMPLETION_ONSTACK(done); @@ -575,7 +576,25 @@ unlock: helper_unlock(); return retval; } -EXPORT_SYMBOL(call_usermodehelper_exec); + +int call_usermodehelper_fns( + char *path, char **argv, char **envp, int wait, + int (*init)(struct subprocess_info *info, struct cred *new), + void (*cleanup)(struct subprocess_info *), void *data) +{ + struct subprocess_info *info; + gfp_t gfp_mask = (wait == UMH_NO_WAIT) ? GFP_ATOMIC : GFP_KERNEL; + + info = call_usermodehelper_setup(path, argv, envp, gfp_mask); + + if (info == NULL) + return -ENOMEM; + + call_usermodehelper_setfns(info, init, cleanup, data); + + return call_usermodehelper_exec(info, wait); +} +EXPORT_SYMBOL(call_usermodehelper_fns); static int proc_cap_handler(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) -- cgit v1.2.3 From 9b3c98cd663750c33434572ff76ba306505eba5a Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Thu, 31 May 2012 16:26:15 -0700 Subject: kmod.c: fix kernel-doc warning Warning(kernel/kmod.c:419): No description found for parameter 'depth' Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kmod.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/kmod.c b/kernel/kmod.c index 1f596e4de306..ff2c7cb86d77 100644 --- a/kernel/kmod.c +++ b/kernel/kmod.c @@ -409,7 +409,7 @@ EXPORT_SYMBOL_GPL(usermodehelper_read_unlock); /** * __usermodehelper_set_disable_depth - Modify usermodehelper_disabled. - * depth: New value to assign to usermodehelper_disabled. + * @depth: New value to assign to usermodehelper_disabled. * * Change the value of usermodehelper_disabled (under umhelper_sem locked for * writing) and wakeup tasks waiting for it to change. -- cgit v1.2.3 From 43e13cc107cf6cd3c15fbe1cef849435c2223d50 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Thu, 31 May 2012 16:26:16 -0700 Subject: cred: remove task_is_dead() from __task_cred() validation Commit 8f92054e7ca1 ("CRED: Fix __task_cred()'s lockdep check and banner comment"): add the following validation condition: task->exit_state >= 0 to permit the access if the target task is dead and therefore unable to change its own credentials. OK, but afaics currently this can only help wait_task_zombie() which calls __task_cred() without rcu lock. Remove this validation and change wait_task_zombie() to use task_uid() instead. This means we do rcu_read_lock() only to shut up the lockdep, but we already do the same in, say, wait_task_stopped(). task_is_dead() should die, task->exit_state != 0 means that this task has passed exit_notify(), only do_wait-like code paths should use this. Unfortunately, we can't kill task_is_dead() right now, it has already acquired buggy users in drivers/staging. The fix already exists. Signed-off-by: Oleg Nesterov Reviewed-by: "Eric W. Biederman" Acked-by: David Howells Cc: Jiri Olsa Cc: Paul E. McKenney Cc: James Morris Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 910a0716e17a..3281493ce7ad 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -1214,7 +1214,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) unsigned long state; int retval, status, traced; pid_t pid = task_pid_vnr(p); - uid_t uid = from_kuid_munged(current_user_ns(), __task_cred(p)->uid); + uid_t uid = from_kuid_munged(current_user_ns(), task_uid(p)); struct siginfo __user *infop; if (!likely(wo->wo_flags & WEXITED)) -- cgit v1.2.3 From 168eeccbc956d2ec083c3a513f7706784ee0dc5f Mon Sep 17 00:00:00 2001 From: Tim Bird Date: Thu, 31 May 2012 16:26:16 -0700 Subject: stack usage: add pid to warning printk in check_stack_usage In embedded systems, sometimes the same program (busybox) is the cause of multiple warnings. Outputting the pid with the program name in the warning printk helps distinguish which instances of a program are using the stack most. This is a small patch, but useful. Signed-off-by: Tim Bird Cc: Oleg Nesterov Cc: Frederic Weisbecker Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 3281493ce7ad..6d85655353e9 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -884,9 +884,9 @@ static void check_stack_usage(void) spin_lock(&low_water_lock); if (free < lowest_to_date) { - printk(KERN_WARNING "%s used greatest stack depth: %lu bytes " - "left\n", - current->comm, free); + printk(KERN_WARNING "%s (%d) used greatest stack depth: " + "%lu bytes left\n", + current->comm, task_pid_nr(current), free); lowest_to_date = free; } spin_unlock(&low_water_lock); -- cgit v1.2.3 From f7505d64f2db5da2d7d94873ddf2cd2524847061 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Thu, 31 May 2012 16:26:21 -0700 Subject: fork: call complete_vfork_done() after clearing child_tid and flushing rss-counters Child should wake up the parent from vfork() only after finishing all operations with shared mm. There is no sense in using CLONE_CHILD_CLEARTID together with CLONE_VFORK, but it looks more accurate now. Signed-off-by: Konstantin Khlebnikov Cc: Oleg Nesterov Cc: Hugh Dickins Cc: KAMEZAWA Hiroyuki Cc: Konstantin Khlebnikov Cc: Markus Trippelsdorf Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 017fb23d5983..2254fbf23567 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -787,9 +787,6 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm) /* Get rid of any cached register state */ deactivate_mm(tsk, mm); - if (tsk->vfork_done) - complete_vfork_done(tsk); - /* * If we're exiting normally, clear a user-space tid field if * requested. We leave this alone when dying by signal, to leave @@ -810,6 +807,13 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm) } tsk->clear_child_tid = NULL; } + + /* + * All done, finally we can wake up parent and return this mm to him. + * Also kthread_stop() uses this completion for synchronization. + */ + if (tsk->vfork_done) + complete_vfork_done(tsk); } /* -- cgit v1.2.3 From cb79295e20a8088a2fd6a9b3cb5f2d889ec36b4d Mon Sep 17 00:00:00 2001 From: Anton Vorontsov Date: Thu, 31 May 2012 16:26:22 -0700 Subject: cpu: introduce clear_tasks_mm_cpumask() helper Many architectures clear tasks' mm_cpumask like this: read_lock(&tasklist_lock); for_each_process(p) { if (p->mm) cpumask_clear_cpu(cpu, mm_cpumask(p->mm)); } read_unlock(&tasklist_lock); Depending on the context, the code above may have several problems, such as: 1. Working with task->mm w/o getting mm or grabing the task lock is dangerous as ->mm might disappear (exit_mm() assigns NULL under task_lock(), so tasklist lock is not enough). 2. Checking for process->mm is not enough because process' main thread may exit or detach its mm via use_mm(), but other threads may still have a valid mm. This patch implements a small helper function that does things correctly, i.e.: 1. We take the task's lock while whe handle its mm (we can't use get_task_mm()/mmput() pair as mmput() might sleep); 2. To catch exited main thread case, we use find_lock_task_mm(), which walks up all threads and returns an appropriate task (with task lock held). Also, Per Peter Zijlstra's idea, now we don't grab tasklist_lock in the new helper, instead we take the rcu read lock. We can do this because the function is called after the cpu is taken down and marked offline, so no new tasks will get this cpu set in their mm mask. Signed-off-by: Anton Vorontsov Cc: Richard Weinberger Cc: Oleg Nesterov Cc: Peter Zijlstra Cc: Russell King Cc: Benjamin Herrenschmidt Cc: Mike Frysinger Cc: Paul Mundt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpu.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) (limited to 'kernel') diff --git a/kernel/cpu.c b/kernel/cpu.c index 0e6353cf147a..0575197deb4a 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -10,6 +10,8 @@ #include #include #include +#include +#include #include #include #include @@ -173,6 +175,30 @@ void __ref unregister_cpu_notifier(struct notifier_block *nb) } EXPORT_SYMBOL(unregister_cpu_notifier); +void clear_tasks_mm_cpumask(int cpu) +{ + struct task_struct *p; + + /* + * This function is called after the cpu is taken down and marked + * offline, so its not like new tasks will ever get this cpu set in + * their mm mask. -- Peter Zijlstra + * Thus, we may use rcu_read_lock() here, instead of grabbing + * full-fledged tasklist_lock. + */ + rcu_read_lock(); + for_each_process(p) { + struct task_struct *t; + + t = find_lock_task_mm(p); + if (!t) + continue; + cpumask_clear_cpu(cpu, mm_cpumask(t->mm)); + task_unlock(t); + } + rcu_read_unlock(); +} + static inline void check_for_tasks(int cpu) { struct task_struct *p; -- cgit v1.2.3 From e4cc2f873ad0833aa5c4aca56bebe15b9603a1e7 Mon Sep 17 00:00:00 2001 From: Anton Vorontsov Date: Thu, 31 May 2012 16:26:26 -0700 Subject: kernel/cpu.c: document clear_tasks_mm_cpumask() Add more comments on clear_tasks_mm_cpumask, plus adds a runtime check: the function is only suitable for offlined CPUs, and if called inappropriately, the kernel should scream aloud. [akpm@linux-foundation.org: tweak comment: s/walks up/walks/, use 80 cols] Suggested-by: Andrew Morton Suggested-by: Peter Zijlstra Signed-off-by: Anton Vorontsov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/cpu.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) (limited to 'kernel') diff --git a/kernel/cpu.c b/kernel/cpu.c index 0575197deb4a..a4eb5227a19e 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -175,6 +176,18 @@ void __ref unregister_cpu_notifier(struct notifier_block *nb) } EXPORT_SYMBOL(unregister_cpu_notifier); +/** + * clear_tasks_mm_cpumask - Safely clear tasks' mm_cpumask for a CPU + * @cpu: a CPU id + * + * This function walks all processes, finds a valid mm struct for each one and + * then clears a corresponding bit in mm's cpumask. While this all sounds + * trivial, there are various non-obvious corner cases, which this function + * tries to solve in a safe manner. + * + * Also note that the function uses a somewhat relaxed locking scheme, so it may + * be called only for an already offlined CPU. + */ void clear_tasks_mm_cpumask(int cpu) { struct task_struct *p; @@ -186,10 +199,15 @@ void clear_tasks_mm_cpumask(int cpu) * Thus, we may use rcu_read_lock() here, instead of grabbing * full-fledged tasklist_lock. */ + WARN_ON(cpu_online(cpu)); rcu_read_lock(); for_each_process(p) { struct task_struct *t; + /* + * Main thread might exit, but other threads may still have + * a valid mm. Find one. + */ t = find_lock_task_mm(p); if (!t) continue; -- cgit v1.2.3 From 3208450488ae724196f1efffc457e4265957c04e Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Thu, 31 May 2012 16:26:39 -0700 Subject: pidns: use task_active_pid_ns in do_notify_parent Using task_active_pid_ns is more robust because it works even after we have called exit_namespaces. This change allows us to have parent processes that are zombies. Normally a zombie parent processes is crazy and the last thing you would want to have but in the case of not letting the init process of a pid namespace be reaped until all of it's children are dead and reaped a zombie parent process is exactly what we want. Signed-off-by: Eric W. Biederman Cc: Oleg Nesterov Cc: Pavel Emelyanov Cc: Cyrill Gorcunov Cc: Louis Rilling Cc: Mike Galbraith Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/signal.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index f7b418217633..08dfbd748cd2 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1656,19 +1656,18 @@ bool do_notify_parent(struct task_struct *tsk, int sig) info.si_signo = sig; info.si_errno = 0; /* - * we are under tasklist_lock here so our parent is tied to - * us and cannot exit and release its namespace. + * We are under tasklist_lock here so our parent is tied to + * us and cannot change. * - * the only it can is to switch its nsproxy with sys_unshare, - * bu uncharing pid namespaces is not allowed, so we'll always - * see relevant namespace + * task_active_pid_ns will always return the same pid namespace + * until a task passes through release_task. * * write_lock() currently calls preempt_disable() which is the * same as rcu_read_lock(), but according to Oleg, this is not * correct to rely on this */ rcu_read_lock(); - info.si_pid = task_pid_nr_ns(tsk, tsk->parent->nsproxy->pid_ns); + info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(tsk->parent)); info.si_uid = from_kuid_munged(task_cred_xxx(tsk->parent, user_ns), task_uid(tsk)); rcu_read_unlock(); -- cgit v1.2.3 From 00c10bc13cdb58447d6bb2a003afad7bd60f5a5f Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Thu, 31 May 2012 16:26:40 -0700 Subject: pidns: make killed children autoreap Force SIGCHLD handling to SIG_IGN so that signals are not generated and so that the children autoreap. This increases the parallelize and in general the speed of network namespace shutdown. Note self reaping childrean can exist past zap_pid_ns_processess but they will all be reaped before we allow the pid namespace init task with pid == 1 to be reaped. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Eric W. Biederman Cc: Oleg Nesterov Cc: Pavel Emelyanov Cc: Cyrill Gorcunov Cc: Louis Rilling Cc: Mike Galbraith Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/pid_namespace.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index 57bc1fd35b3c..fd3c44986191 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -149,7 +149,12 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) { int nr; int rc; - struct task_struct *task; + struct task_struct *task, *me = current; + + /* Ignore SIGCHLD causing any terminated children to autoreap */ + spin_lock_irq(&me->sighand->siglock); + me->sighand->action[SIGCHLD - 1].sa.sa_handler = SIG_IGN; + spin_unlock_irq(&me->sighand->siglock); /* * The last thread in the cgroup-init thread group is terminating. -- cgit v1.2.3 From 98ed57eef9f67dfe541be0bca34660ffc88365b2 Mon Sep 17 00:00:00 2001 From: Cyrill Gorcunov Date: Thu, 31 May 2012 16:26:42 -0700 Subject: sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE For those who doesn't need C/R functionality there is no need to control last pid, ie the pid for the next fork() call. Signed-off-by: Cyrill Gorcunov Cc: Pavel Emelyanov Cc: Tejun Heo Cc: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/pid_namespace.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index fd3c44986191..16b20e38c4a1 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -196,6 +196,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) return; } +#ifdef CONFIG_CHECKPOINT_RESTORE static int pid_ns_ctl_handler(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) { @@ -223,8 +224,8 @@ static struct ctl_table pid_ns_ctl_table[] = { }, { } }; - static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } }; +#endif /* CONFIG_CHECKPOINT_RESTORE */ int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) { @@ -258,7 +259,10 @@ int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) static __init int pid_namespaces_init(void) { pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC); + +#ifdef CONFIG_CHECKPOINT_RESTORE register_sysctl_paths(kern_path, pid_ns_ctl_table); +#endif return 0; } -- cgit v1.2.3 From d97b46a64674a267bc41c9e16132ee2a98c3347d Mon Sep 17 00:00:00 2001 From: Cyrill Gorcunov Date: Thu, 31 May 2012 16:26:44 -0700 Subject: syscalls, x86: add __NR_kcmp syscall While doing the checkpoint-restore in the user space one need to determine whether various kernel objects (like mm_struct-s of file_struct-s) are shared between tasks and restore this state. The 2nd step can be solved by using appropriate CLONE_ flags and the unshare syscall, while there's currently no ways for solving the 1st one. One of the ways for checking whether two tasks share e.g. mm_struct is to provide some mm_struct ID of a task to its proc file, but showing such info considered to be not that good for security reasons. Thus after some debates we end up in conclusion that using that named 'comparison' syscall might be the best candidate. So here is it -- __NR_kcmp. It takes up to 5 arguments - the pids of the two tasks (which characteristics should be compared), the comparison type and (in case of comparison of files) two file descriptors. Lookups for pids are done in the caller's PID namespace only. At moment only x86 is supported and tested. [akpm@linux-foundation.org: fix up selftests, warnings] [akpm@linux-foundation.org: include errno.h] [akpm@linux-foundation.org: tweak comment text] Signed-off-by: Cyrill Gorcunov Acked-by: "Eric W. Biederman" Cc: Pavel Emelyanov Cc: Andrey Vagin Cc: KOSAKI Motohiro Cc: Ingo Molnar Cc: H. Peter Anvin Cc: Thomas Gleixner Cc: Glauber Costa Cc: Andi Kleen Cc: Tejun Heo Cc: Matt Helsley Cc: Pekka Enberg Cc: Eric Dumazet Cc: Vasiliy Kulikov Cc: Alexey Dobriyan Cc: Valdis.Kletnieks@vt.edu Cc: Michal Marek Cc: Frederic Weisbecker Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/Makefile | 3 + kernel/kcmp.c | 196 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sys_ni.c | 3 + 3 files changed, 202 insertions(+) create mode 100644 kernel/kcmp.c (limited to 'kernel') diff --git a/kernel/Makefile b/kernel/Makefile index 6c07f30fa9b7..80be6ca0cc75 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -25,6 +25,9 @@ endif obj-y += sched/ obj-y += power/ +ifeq ($(CONFIG_CHECKPOINT_RESTORE),y) +obj-$(CONFIG_X86) += kcmp.o +endif obj-$(CONFIG_FREEZER) += freezer.o obj-$(CONFIG_PROFILING) += profile.o obj-$(CONFIG_STACKTRACE) += stacktrace.o diff --git a/kernel/kcmp.c b/kernel/kcmp.c new file mode 100644 index 000000000000..30b7b225306c --- /dev/null +++ b/kernel/kcmp.c @@ -0,0 +1,196 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +/* + * We don't expose the real in-memory order of objects for security reasons. + * But still the comparison results should be suitable for sorting. So we + * obfuscate kernel pointers values and compare the production instead. + * + * The obfuscation is done in two steps. First we xor the kernel pointer with + * a random value, which puts pointer into a new position in a reordered space. + * Secondly we multiply the xor production with a large odd random number to + * permute its bits even more (the odd multiplier guarantees that the product + * is unique ever after the high bits are truncated, since any odd number is + * relative prime to 2^n). + * + * Note also that the obfuscation itself is invisible to userspace and if needed + * it can be changed to an alternate scheme. + */ +static unsigned long cookies[KCMP_TYPES][2] __read_mostly; + +static long kptr_obfuscate(long v, int type) +{ + return (v ^ cookies[type][0]) * cookies[type][1]; +} + +/* + * 0 - equal, i.e. v1 = v2 + * 1 - less than, i.e. v1 < v2 + * 2 - greater than, i.e. v1 > v2 + * 3 - not equal but ordering unavailable (reserved for future) + */ +static int kcmp_ptr(void *v1, void *v2, enum kcmp_type type) +{ + long ret; + + ret = kptr_obfuscate((long)v1, type) - kptr_obfuscate((long)v2, type); + + return (ret < 0) | ((ret > 0) << 1); +} + +/* The caller must have pinned the task */ +static struct file * +get_file_raw_ptr(struct task_struct *task, unsigned int idx) +{ + struct file *file = NULL; + + task_lock(task); + rcu_read_lock(); + + if (task->files) + file = fcheck_files(task->files, idx); + + rcu_read_unlock(); + task_unlock(task); + + return file; +} + +static void kcmp_unlock(struct mutex *m1, struct mutex *m2) +{ + if (likely(m2 != m1)) + mutex_unlock(m2); + mutex_unlock(m1); +} + +static int kcmp_lock(struct mutex *m1, struct mutex *m2) +{ + int err; + + if (m2 > m1) + swap(m1, m2); + + err = mutex_lock_killable(m1); + if (!err && likely(m1 != m2)) { + err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING); + if (err) + mutex_unlock(m1); + } + + return err; +} + +SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type, + unsigned long, idx1, unsigned long, idx2) +{ + struct task_struct *task1, *task2; + int ret; + + rcu_read_lock(); + + /* + * Tasks are looked up in caller's PID namespace only. + */ + task1 = find_task_by_vpid(pid1); + task2 = find_task_by_vpid(pid2); + if (!task1 || !task2) + goto err_no_task; + + get_task_struct(task1); + get_task_struct(task2); + + rcu_read_unlock(); + + /* + * One should have enough rights to inspect task details. + */ + ret = kcmp_lock(&task1->signal->cred_guard_mutex, + &task2->signal->cred_guard_mutex); + if (ret) + goto err; + if (!ptrace_may_access(task1, PTRACE_MODE_READ) || + !ptrace_may_access(task2, PTRACE_MODE_READ)) { + ret = -EPERM; + goto err_unlock; + } + + switch (type) { + case KCMP_FILE: { + struct file *filp1, *filp2; + + filp1 = get_file_raw_ptr(task1, idx1); + filp2 = get_file_raw_ptr(task2, idx2); + + if (filp1 && filp2) + ret = kcmp_ptr(filp1, filp2, KCMP_FILE); + else + ret = -EBADF; + break; + } + case KCMP_VM: + ret = kcmp_ptr(task1->mm, task2->mm, KCMP_VM); + break; + case KCMP_FILES: + ret = kcmp_ptr(task1->files, task2->files, KCMP_FILES); + break; + case KCMP_FS: + ret = kcmp_ptr(task1->fs, task2->fs, KCMP_FS); + break; + case KCMP_SIGHAND: + ret = kcmp_ptr(task1->sighand, task2->sighand, KCMP_SIGHAND); + break; + case KCMP_IO: + ret = kcmp_ptr(task1->io_context, task2->io_context, KCMP_IO); + break; + case KCMP_SYSVSEM: +#ifdef CONFIG_SYSVIPC + ret = kcmp_ptr(task1->sysvsem.undo_list, + task2->sysvsem.undo_list, + KCMP_SYSVSEM); +#else + ret = -EOPNOTSUPP; +#endif + break; + default: + ret = -EINVAL; + break; + } + +err_unlock: + kcmp_unlock(&task1->signal->cred_guard_mutex, + &task2->signal->cred_guard_mutex); +err: + put_task_struct(task1); + put_task_struct(task2); + + return ret; + +err_no_task: + rcu_read_unlock(); + return -ESRCH; +} + +static __init int kcmp_cookies_init(void) +{ + int i; + + get_random_bytes(cookies, sizeof(cookies)); + + for (i = 0; i < KCMP_TYPES; i++) + cookies[i][1] |= (~(~0UL >> 1) | 1); + + return 0; +} +arch_initcall(kcmp_cookies_init); diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 47bfa16430d7..dbff751e4086 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -203,3 +203,6 @@ cond_syscall(sys_fanotify_mark); cond_syscall(sys_name_to_handle_at); cond_syscall(sys_open_by_handle_at); cond_syscall(compat_sys_open_by_handle_at); + +/* compare kernel pointers */ +cond_syscall(sys_kcmp); -- cgit v1.2.3 From fe8c7f5cbf91124987106faa3bdf0c8b955c4cf7 Mon Sep 17 00:00:00 2001 From: Cyrill Gorcunov Date: Thu, 31 May 2012 16:26:45 -0700 Subject: c/r: prctl: extend PR_SET_MM to set up more mm_struct entries During checkpoint we dump whole process memory to a file and the dump includes process stack memory. But among stack data itself, the stack carries additional parameters such as command line arguments, environment data and auxiliary vector. So when we do restore procedure and once we've restored stack data itself we need to setup mm_struct::arg_start/end, env_start/end, so restored process would be able to find command line arguments and environment data it had at checkpoint time. The same applies to auxiliary vector. For this reason additional PR_SET_MM_(ARG_START | ARG_END | ENV_START | ENV_END | AUXV) codes are introduced. Signed-off-by: Cyrill Gorcunov Acked-by: Kees Cook Cc: Tejun Heo Cc: Andrew Vagin Cc: Serge Hallyn Cc: Pavel Emelyanov Cc: Vasiliy Kulikov Cc: KAMEZAWA Hiroyuki Cc: Michael Kerrisk Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 134 ++++++++++++++++++++++++++++++++++++----------------------- 1 file changed, 83 insertions(+), 51 deletions(-) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index 6e81aa7e4688..8b544972e46e 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1784,17 +1784,23 @@ SYSCALL_DEFINE1(umask, int, mask) } #ifdef CONFIG_CHECKPOINT_RESTORE +static bool vma_flags_mismatch(struct vm_area_struct *vma, + unsigned long required, + unsigned long banned) +{ + return (vma->vm_flags & required) != required || + (vma->vm_flags & banned); +} + static int prctl_set_mm(int opt, unsigned long addr, unsigned long arg4, unsigned long arg5) { unsigned long rlim = rlimit(RLIMIT_DATA); - unsigned long vm_req_flags; - unsigned long vm_bad_flags; - struct vm_area_struct *vma; - int error = 0; struct mm_struct *mm = current->mm; + struct vm_area_struct *vma; + int error; - if (arg4 | arg5) + if (arg5 || (arg4 && opt != PR_SET_MM_AUXV)) return -EINVAL; if (!capable(CAP_SYS_RESOURCE)) @@ -1803,58 +1809,23 @@ static int prctl_set_mm(int opt, unsigned long addr, if (addr >= TASK_SIZE) return -EINVAL; + error = -EINVAL; + down_read(&mm->mmap_sem); vma = find_vma(mm, addr); - if (opt != PR_SET_MM_START_BRK && opt != PR_SET_MM_BRK) { - /* It must be existing VMA */ - if (!vma || vma->vm_start > addr) - goto out; - } - - error = -EINVAL; switch (opt) { case PR_SET_MM_START_CODE: + mm->start_code = addr; + break; case PR_SET_MM_END_CODE: - vm_req_flags = VM_READ | VM_EXEC; - vm_bad_flags = VM_WRITE | VM_MAYSHARE; - - if ((vma->vm_flags & vm_req_flags) != vm_req_flags || - (vma->vm_flags & vm_bad_flags)) - goto out; - - if (opt == PR_SET_MM_START_CODE) - mm->start_code = addr; - else - mm->end_code = addr; + mm->end_code = addr; break; - case PR_SET_MM_START_DATA: - case PR_SET_MM_END_DATA: - vm_req_flags = VM_READ | VM_WRITE; - vm_bad_flags = VM_EXEC | VM_MAYSHARE; - - if ((vma->vm_flags & vm_req_flags) != vm_req_flags || - (vma->vm_flags & vm_bad_flags)) - goto out; - - if (opt == PR_SET_MM_START_DATA) - mm->start_data = addr; - else - mm->end_data = addr; + mm->start_data = addr; break; - - case PR_SET_MM_START_STACK: - -#ifdef CONFIG_STACK_GROWSUP - vm_req_flags = VM_READ | VM_WRITE | VM_GROWSUP; -#else - vm_req_flags = VM_READ | VM_WRITE | VM_GROWSDOWN; -#endif - if ((vma->vm_flags & vm_req_flags) != vm_req_flags) - goto out; - - mm->start_stack = addr; + case PR_SET_MM_END_DATA: + mm->end_data = addr; break; case PR_SET_MM_START_BRK: @@ -1881,16 +1852,77 @@ static int prctl_set_mm(int opt, unsigned long addr, mm->brk = addr; break; + /* + * If command line arguments and environment + * are placed somewhere else on stack, we can + * set them up here, ARG_START/END to setup + * command line argumets and ENV_START/END + * for environment. + */ + case PR_SET_MM_START_STACK: + case PR_SET_MM_ARG_START: + case PR_SET_MM_ARG_END: + case PR_SET_MM_ENV_START: + case PR_SET_MM_ENV_END: + if (!vma) { + error = -EFAULT; + goto out; + } +#ifdef CONFIG_STACK_GROWSUP + if (vma_flags_mismatch(vma, VM_READ | VM_WRITE | VM_GROWSUP, 0)) +#else + if (vma_flags_mismatch(vma, VM_READ | VM_WRITE | VM_GROWSDOWN, 0)) +#endif + goto out; + if (opt == PR_SET_MM_START_STACK) + mm->start_stack = addr; + else if (opt == PR_SET_MM_ARG_START) + mm->arg_start = addr; + else if (opt == PR_SET_MM_ARG_END) + mm->arg_end = addr; + else if (opt == PR_SET_MM_ENV_START) + mm->env_start = addr; + else if (opt == PR_SET_MM_ENV_END) + mm->env_end = addr; + break; + + /* + * This doesn't move auxiliary vector itself + * since it's pinned to mm_struct, but allow + * to fill vector with new values. It's up + * to a caller to provide sane values here + * otherwise user space tools which use this + * vector might be unhappy. + */ + case PR_SET_MM_AUXV: { + unsigned long user_auxv[AT_VECTOR_SIZE]; + + if (arg4 > sizeof(user_auxv)) + goto out; + up_read(&mm->mmap_sem); + + if (copy_from_user(user_auxv, (const void __user *)addr, arg4)) + return -EFAULT; + + /* Make sure the last entry is always AT_NULL */ + user_auxv[AT_VECTOR_SIZE - 2] = 0; + user_auxv[AT_VECTOR_SIZE - 1] = 0; + + BUILD_BUG_ON(sizeof(user_auxv) != sizeof(mm->saved_auxv)); + + task_lock(current); + memcpy(mm->saved_auxv, user_auxv, arg4); + task_unlock(current); + + return 0; + } default: - error = -EINVAL; goto out; } error = 0; - out: up_read(&mm->mmap_sem); - return error; } #else /* CONFIG_CHECKPOINT_RESTORE */ -- cgit v1.2.3 From b32dfe377102ce668775f8b6b1461f7ad428f8b6 Mon Sep 17 00:00:00 2001 From: Cyrill Gorcunov Date: Thu, 31 May 2012 16:26:46 -0700 Subject: c/r: prctl: add ability to set new mm_struct::exe_file When we do restore we would like to have a way to setup a former mm_struct::exe_file so that /proc/pid/exe would point to the original executable file a process had at checkpoint time. For this the PR_SET_MM_EXE_FILE code is introduced. This option takes a file descriptor which will be set as a source for new /proc/$pid/exe symlink. Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE vmas present for current process, simply because this feature is a special to C/R and mm::num_exe_file_vmas become meaningless after that. To minimize the amount of transition the /proc/pid/exe symlink might have, this feature is implemented in one-shot manner. Thus once changed the symlink can't be changed again. This should help sysadmins to monitor the symlinks over all process running in a system. In particular one could make a snapshot of processes and ring alarm if there unexpected changes of /proc/pid/exe's in a system. Note -- this feature is available iif CONFIG_CHECKPOINT_RESTORE is set and the caller must have CAP_SYS_RESOURCE capability granted, otherwise the request to change symlink will be rejected. Signed-off-by: Cyrill Gorcunov Reviewed-by: Oleg Nesterov Cc: KOSAKI Motohiro Cc: Pavel Emelyanov Cc: Kees Cook Cc: Tejun Heo Cc: Matt Helsley Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 56 insertions(+) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index 8b544972e46e..9ff89cb9657a 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -36,6 +36,8 @@ #include #include #include +#include +#include #include #include #include @@ -1792,6 +1794,57 @@ static bool vma_flags_mismatch(struct vm_area_struct *vma, (vma->vm_flags & banned); } +static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd) +{ + struct file *exe_file; + struct dentry *dentry; + int err; + + /* + * Setting new mm::exe_file is only allowed when no VM_EXECUTABLE vma's + * remain. So perform a quick test first. + */ + if (mm->num_exe_file_vmas) + return -EBUSY; + + exe_file = fget(fd); + if (!exe_file) + return -EBADF; + + dentry = exe_file->f_path.dentry; + + /* + * Because the original mm->exe_file points to executable file, make + * sure that this one is executable as well, to avoid breaking an + * overall picture. + */ + err = -EACCES; + if (!S_ISREG(dentry->d_inode->i_mode) || + exe_file->f_path.mnt->mnt_flags & MNT_NOEXEC) + goto exit; + + err = inode_permission(dentry->d_inode, MAY_EXEC); + if (err) + goto exit; + + /* + * The symlink can be changed only once, just to disallow arbitrary + * transitions malicious software might bring in. This means one + * could make a snapshot over all processes running and monitor + * /proc/pid/exe changes to notice unusual activity if needed. + */ + down_write(&mm->mmap_sem); + if (likely(!mm->exe_file)) + set_mm_exe_file(mm, exe_file); + else + err = -EBUSY; + up_write(&mm->mmap_sem); + +exit: + fput(exe_file); + return err; +} + static int prctl_set_mm(int opt, unsigned long addr, unsigned long arg4, unsigned long arg5) { @@ -1806,6 +1859,9 @@ static int prctl_set_mm(int opt, unsigned long addr, if (!capable(CAP_SYS_RESOURCE)) return -EPERM; + if (opt == PR_SET_MM_EXE_FILE) + return prctl_set_mm_exe_file(mm, (unsigned int)addr); + if (addr >= TASK_SIZE) return -EINVAL; -- cgit v1.2.3 From 754421c8cab1a568be844a7069fe04c1cf6391b8 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Thu, 26 Apr 2012 18:31:00 -0400 Subject: HAVE_RESTORE_SIGMASK is defined on all architectures now Everyone either defines it in arch thread_info.h or has TIF_RESTORE_SIGMASK and picks default set_restore_sigmask() in linux/thread_info.h. Kill the ifdefs, slap #error in linux/thread_info.h to catch breakage when new ones get merged. Signed-off-by: Al Viro --- kernel/signal.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index 08dfbd748cd2..95a9d9d8122b 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -3235,7 +3235,6 @@ SYSCALL_DEFINE0(pause) #endif -#ifdef HAVE_SET_RESTORE_SIGMASK int sigsuspend(sigset_t *set) { sigdelsetmask(set, sigmask(SIGKILL)|sigmask(SIGSTOP)); @@ -3248,7 +3247,6 @@ int sigsuspend(sigset_t *set) set_restore_sigmask(); return -ERESTARTNOHAND; } -#endif #ifdef __ARCH_WANT_SYS_RT_SIGSUSPEND /** -- cgit v1.2.3 From a610d6e672d6d3723e8da257ad4a8a288a8f2f89 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Mon, 21 May 2012 23:42:15 -0400 Subject: pull clearing RESTORE_SIGMASK into block_sigmask() Signed-off-by: Al Viro --- kernel/signal.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index 95a9d9d8122b..b9be7e0fe41a 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2382,6 +2382,12 @@ void block_sigmask(struct k_sigaction *ka, int signr) { sigset_t blocked; + /* A signal was successfully delivered, and the + saved sigmask was stored on the signal frame, + and will be restored by sigreturn. So we can + simply clear the restore sigmask flag. */ + clear_restore_sigmask(); + sigorsets(&blocked, ¤t->blocked, &ka->sa.sa_mask); if (!(ka->sa.sa_flags & SA_NODEFER)) sigaddset(&blocked, signr); -- cgit v1.2.3 From 77097ae503b170120ab66dd1d547f8577193f91f Mon Sep 17 00:00:00 2001 From: Al Viro Date: Fri, 27 Apr 2012 13:58:59 -0400 Subject: most of set_current_blocked() callers want SIGKILL/SIGSTOP removed from set Only 3 out of 63 do not. Renamed the current variant to __set_current_blocked(), added set_current_blocked() that will exclude unblockable signals, switched open-coded instances to it. Signed-off-by: Al Viro --- kernel/signal.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index b9be7e0fe41a..df8d721a9e6f 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2524,7 +2524,16 @@ static void __set_task_blocked(struct task_struct *tsk, const sigset_t *newset) * It is wrong to change ->blocked directly, this helper should be used * to ensure the process can't miss a shared signal we are going to block. */ -void set_current_blocked(const sigset_t *newset) +void set_current_blocked(sigset_t *newset) +{ + struct task_struct *tsk = current; + sigdelsetmask(newset, sigmask(SIGKILL) | sigmask(SIGSTOP)); + spin_lock_irq(&tsk->sighand->siglock); + __set_task_blocked(tsk, newset); + spin_unlock_irq(&tsk->sighand->siglock); +} + +void __set_current_blocked(const sigset_t *newset) { struct task_struct *tsk = current; @@ -2564,7 +2573,7 @@ int sigprocmask(int how, sigset_t *set, sigset_t *oldset) return -EINVAL; } - set_current_blocked(&newset); + __set_current_blocked(&newset); return 0; } @@ -3138,7 +3147,7 @@ SYSCALL_DEFINE3(sigprocmask, int, how, old_sigset_t __user *, nset, return -EINVAL; } - set_current_blocked(&new_blocked); + __set_current_blocked(&new_blocked); } if (oset) { @@ -3202,7 +3211,6 @@ SYSCALL_DEFINE1(ssetmask, int, newmask) int old = current->blocked.sig[0]; sigset_t newset; - siginitset(&newset, newmask & ~(sigmask(SIGKILL) | sigmask(SIGSTOP))); set_current_blocked(&newset); return old; @@ -3243,8 +3251,6 @@ SYSCALL_DEFINE0(pause) int sigsuspend(sigset_t *set) { - sigdelsetmask(set, sigmask(SIGKILL)|sigmask(SIGSTOP)); - current->saved_sigmask = current->blocked; set_current_blocked(set); -- cgit v1.2.3 From efee984c27b67e3ebef40410f35671997441b57c Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sat, 28 Apr 2012 02:04:15 -0400 Subject: new helper: signal_delivered() Does block_sigmask() + tracehook_signal_handler(); called when sigframe has been successfully built. All architectures converted to it; block_sigmask() itself is gone now (merged into this one). I'm still not too happy with the signature, but that's a separate story (IMO we need a structure that would contain signal number + siginfo + k_sigaction, so that get_signal_to_deliver() would fill one, signal_delivered(), handle_signal() and probably setup...frame() - take one). Signed-off-by: Al Viro --- kernel/signal.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index df8d721a9e6f..677102789cf2 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2368,17 +2368,20 @@ relock: } /** - * block_sigmask - add @ka's signal mask to current->blocked - * @ka: action for @signr - * @signr: signal that has been successfully delivered + * signal_delivered - + * @sig: number of signal being delivered + * @info: siginfo_t of signal being delivered + * @ka: sigaction setting that chose the handler + * @regs: user register state + * @stepping: nonzero if debugger single-step or block-step in use * * This function should be called when a signal has succesfully been - * delivered. It adds the mask of signals for @ka to current->blocked - * so that they are blocked during the execution of the signal - * handler. In addition, @signr will be blocked unless %SA_NODEFER is - * set in @ka->sa.sa_flags. + * delivered. It updates the blocked signals accordingly (@ka->sa.sa_mask + * is always blocked, and the signal itself is blocked unless %SA_NODEFER + * is set in @ka->sa.sa_flags. Tracing is notified. */ -void block_sigmask(struct k_sigaction *ka, int signr) +void signal_delivered(int sig, siginfo_t *info, struct k_sigaction *ka, + struct pt_regs *regs, int stepping) { sigset_t blocked; @@ -2390,8 +2393,9 @@ void block_sigmask(struct k_sigaction *ka, int signr) sigorsets(&blocked, ¤t->blocked, &ka->sa.sa_mask); if (!(ka->sa.sa_flags & SA_NODEFER)) - sigaddset(&blocked, signr); + sigaddset(&blocked, sig); set_current_blocked(&blocked); + tracehook_signal_handler(sig, info, ka, regs, stepping); } /* -- cgit v1.2.3 From fad0c66c4bb836d57a5f125ecd38bed653ca863a Mon Sep 17 00:00:00 2001 From: John Stultz Date: Wed, 30 May 2012 10:54:57 -0700 Subject: timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecond Commit 6b43ae8a61 (ntp: Fix leap-second hrtimer livelock) broke the leapsecond update of CLOCK_MONOTONIC. The missing leapsecond update to wall_to_monotonic causes discontinuities in CLOCK_MONOTONIC. Adjust wall_to_monotonic when NTP inserted a leapsecond. Reported-by: Richard Cochran Signed-off-by: John Stultz Tested-by: Richard Cochran Cc: stable@kernel.org Link: http://lkml.kernel.org/r/1338400497-12420-1-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner --- kernel/time/timekeeping.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 6e46cacf5969..6f46a00a1e8a 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -962,6 +962,7 @@ static cycle_t logarithmic_accumulation(cycle_t offset, int shift) timekeeper.xtime.tv_sec++; leap = second_overflow(timekeeper.xtime.tv_sec); timekeeper.xtime.tv_sec += leap; + timekeeper.wall_to_monotonic.tv_sec -= leap; } /* Accumulate raw time */ @@ -1077,6 +1078,7 @@ static void update_wall_time(void) timekeeper.xtime.tv_sec++; leap = second_overflow(timekeeper.xtime.tv_sec); timekeeper.xtime.tv_sec += leap; + timekeeper.wall_to_monotonic.tv_sec -= leap; } timekeeping_update(false); -- cgit v1.2.3 From 10717dcde10d09f9fcee53a12a4236af1a82b484 Mon Sep 17 00:00:00 2001 From: Alex Shi Date: Wed, 6 Jun 2012 14:52:51 +0800 Subject: sched/numa: Load balance between remote nodes Commit cb83b629b ("sched/numa: Rewrite the CONFIG_NUMA sched domain support") removed the NODE sched domain and started checking if the node distance in SLIT table is farther than REMOTE_DISTANCE, if so, it will lose the load balance chance at exec/fork/wake_affine points. But actually, even the node distance is farther than REMOTE_DISTANCE. Modern CPUs also has QPI like connections, which ensures that memory access is not too slow between nodes. So the above change in behavior on NUMA machine causes a performance regression on various benchmarks: hackbench, tbench, netperf, oltp, etc. This patch will recover the scheduler behavior to old mode on all my Intel platforms: NHM EP/EX, WSM EP, SNB EP/EP4S, and thus fixes the perfromance regressions. (all of them just have 2 kinds distance, 10, 21) Signed-off-by: Alex Shi Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1338965571-9812-1-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c46958e26121..6546083af3e0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6321,7 +6321,7 @@ static int sched_domains_curr_level; static inline int sd_local_flags(int level) { - if (sched_domains_numa_distance[level] > REMOTE_DISTANCE) + if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE) return 0; return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE; -- cgit v1.2.3 From 7f1b43936f0ecad14770634c021cf4a929aec74d Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 17 May 2012 21:19:46 +0200 Subject: sched/rt: Fix lockdep annotation within find_lock_lowest_rq() Roland Dreier reported spurious, hard to trigger lockdep warnings within the scheduler - without any real lockup. This bit gives us the right clue: > [89945.640512] [] double_lock_balance+0x5a/0x90 > [89945.640568] [] push_rt_task+0xc6/0x290 if you look at that code you'll find the double_lock_balance() in question is the one in find_lock_lowest_rq() [yay for inlining]. Now find_lock_lowest_rq() has a bug.. it fails to use double_unlock_balance() in one exit path, if this results in a retry in push_rt_task() we'll call double_lock_balance() again, at which point we'll run into said lockdep confusion. Reported-by: Roland Dreier Acked-by: Steven Rostedt Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1337282386.4281.77.camel@twins Signed-off-by: Ingo Molnar --- kernel/sched/rt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 2a4e8dffbd6b..573e1ca01102 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1562,7 +1562,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq) task_running(rq, task) || !task->on_rq)) { - raw_spin_unlock(&lowest_rq->lock); + double_unlock_balance(rq, lowest_rq); lowest_rq = NULL; break; } -- cgit v1.2.3 From c1174876874dcf8986806e4dad3d7d07af20b439 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 31 May 2012 14:47:33 +0200 Subject: sched: Fix domain iteration Weird topologies can lead to asymmetric domain setups. This needs further consideration since these setups are typically non-minimal too. For now, make it work by adding an extra mask selecting which CPUs are allowed to iterate up. The topology that triggered it is the one from David Rientjes: 10 20 20 30 20 10 20 20 20 20 10 20 30 20 20 10 resulting in boxes that wouldn't even boot. Reported-by: David Rientjes Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-3p86l9cuaqnxz7uxsojmz5rm@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 64 +++++++++++++++++++++++++++++++++++++++++++++------- kernel/sched/fair.c | 5 ++-- kernel/sched/sched.h | 2 ++ 3 files changed, 61 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6546083af3e0..781acb91a50a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5994,6 +5994,44 @@ struct sched_domain_topology_level { struct sd_data data; }; +/* + * Build an iteration mask that can exclude certain CPUs from the upwards + * domain traversal. + * + * Asymmetric node setups can result in situations where the domain tree is of + * unequal depth, make sure to skip domains that already cover the entire + * range. + * + * In that case build_sched_domains() will have terminated the iteration early + * and our sibling sd spans will be empty. Domains should always include the + * cpu they're built on, so check that. + * + */ +static void build_group_mask(struct sched_domain *sd, struct sched_group *sg) +{ + const struct cpumask *span = sched_domain_span(sd); + struct sd_data *sdd = sd->private; + struct sched_domain *sibling; + int i; + + for_each_cpu(i, span) { + sibling = *per_cpu_ptr(sdd->sd, i); + if (!cpumask_test_cpu(i, sched_domain_span(sibling))) + continue; + + cpumask_set_cpu(i, sched_group_mask(sg)); + } +} + +/* + * Return the canonical balance cpu for this group, this is the first cpu + * of this group that's also in the iteration mask. + */ +int group_balance_cpu(struct sched_group *sg) +{ + return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg)); +} + static int build_overlap_sched_groups(struct sched_domain *sd, int cpu) { @@ -6012,6 +6050,12 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) if (cpumask_test_cpu(i, covered)) continue; + child = *per_cpu_ptr(sdd->sd, i); + + /* See the comment near build_group_mask(). */ + if (!cpumask_test_cpu(i, sched_domain_span(child))) + continue; + sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(), GFP_KERNEL, cpu_to_node(cpu)); @@ -6019,8 +6063,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) goto fail; sg_span = sched_group_cpus(sg); - - child = *per_cpu_ptr(sdd->sd, i); if (child->child) { child = child->child; cpumask_copy(sg_span, sched_domain_span(child)); @@ -6030,13 +6072,18 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) cpumask_or(covered, covered, sg_span); sg->sgp = *per_cpu_ptr(sdd->sgp, i); - atomic_inc(&sg->sgp->ref); + if (atomic_inc_return(&sg->sgp->ref) == 1) + build_group_mask(sd, sg); + + /* + * Make sure the first group of this domain contains the + * canonical balance cpu. Otherwise the sched_domain iteration + * breaks. See update_sg_lb_stats(). + */ if ((!groups && cpumask_test_cpu(cpu, sg_span)) || - cpumask_first(sg_span) == cpu) { - WARN_ON_ONCE(!cpumask_test_cpu(cpu, sg_span)); + group_balance_cpu(sg) == cpu) groups = sg; - } if (!first) first = sg; @@ -6109,6 +6156,7 @@ build_sched_groups(struct sched_domain *sd, int cpu) cpumask_clear(sched_group_cpus(sg)); sg->sgp->power = 0; + cpumask_setall(sched_group_mask(sg)); for_each_cpu(j, span) { if (get_group(j, sdd, NULL) != group) @@ -6150,7 +6198,7 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd) sg = sg->next; } while (sg != sd->groups); - if (cpu != group_first_cpu(sg)) + if (cpu != group_balance_cpu(sg)) return; update_group_power(sd, cpu); @@ -6525,7 +6573,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map) *per_cpu_ptr(sdd->sg, j) = sg; - sgp = kzalloc_node(sizeof(struct sched_group_power), + sgp = kzalloc_node(sizeof(struct sched_group_power) + cpumask_size(), GFP_KERNEL, cpu_to_node(j)); if (!sgp) return -ENOMEM; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b2a2d236f27b..54cbaa4e7b37 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3652,7 +3652,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, int i; if (local_group) - balance_cpu = group_first_cpu(group); + balance_cpu = group_balance_cpu(group); /* Tally up the load of all CPUs in the group */ max_cpu_load = 0; @@ -3667,7 +3667,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, /* Bias balancing toward cpus of our domain */ if (local_group) { - if (idle_cpu(i) && !first_idle_cpu) { + if (idle_cpu(i) && !first_idle_cpu && + cpumask_test_cpu(i, sched_group_mask(group))) { first_idle_cpu = 1; balance_cpu = i; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ba9dccfd24ce..6d52cea7f33d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -526,6 +526,8 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag) DECLARE_PER_CPU(struct sched_domain *, sd_llc); DECLARE_PER_CPU(int, sd_llc_id); +extern int group_balance_cpu(struct sched_group *sg); + #endif /* CONFIG_SMP */ #include "stats.h" -- cgit v1.2.3 From c3decf0dfbc95736b7c0ab68fa4e5854c4734da9 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 31 May 2012 12:05:32 +0200 Subject: sched: Always initialize cpu-power Often when we run into mis-shapen topologies the balance iteration fails to update the cpu power properly and we'll end up in /0 traps. Always initialize the cpu-power to a semi-sane value so that we can at least boot the machine, even if the load-balancer might not function correctly. Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-3lbhyj25sr169ha7z3qht5na@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 13 ++++++++++++- kernel/sched/fair.c | 2 +- 2 files changed, 13 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 781acb91a50a..725ee7c1c8cf 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5604,7 +5604,12 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level, break; } - if (!group->sgp->power) { + /* + * Even though we initialize ->power to something semi-sane, + * we leave power_orig unset. This allows us to detect if + * domain iteration is still funny without causing /0 traps. + */ + if (!group->sgp->power_orig) { printk(KERN_CONT "\n"); printk(KERN_ERR "ERROR: domain->cpu_power not " "set\n"); @@ -6075,6 +6080,12 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) if (atomic_inc_return(&sg->sgp->ref) == 1) build_group_mask(sd, sg); + /* + * Initialize sgp->power such that even if we mess up the + * domains and no possible iteration will get us here, we won't + * die on a /0 trap. + */ + sg->sgp->power = SCHED_POWER_SCALE * cpumask_weight(sg_span); /* * Make sure the first group of this domain contains the diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 54cbaa4e7b37..c9fd6d673d05 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3602,7 +3602,7 @@ void update_group_power(struct sched_domain *sd, int cpu) } while (group != child->groups); } - sdg->sgp->power = power; + sdg->sgp->power_orig = sdg->sgp->power = power; } /* -- cgit v1.2.3 From d039ac60800fe8ed8522ec3b9ca796aaf748c18b Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 31 May 2012 21:20:16 +0200 Subject: sched: Validate assumptions in sched_init_numa() Add some code to validate assumptions we're making and output warnings if they are not. If this trigger we want to know about it. Signed-off-by: Peter Zijlstra Cc: Alex Shi Link: http://lkml.kernel.org/n/tip-6uc3wk5s9udxtdl9cnku0vtt@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 99 +++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 80 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 725ee7c1c8cf..2bdd17616437 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5556,15 +5556,20 @@ static cpumask_var_t sched_domains_tmpmask; /* sched_domains_mutex */ #ifdef CONFIG_SCHED_DEBUG -static __read_mostly int sched_domain_debug_enabled; +static __read_mostly int sched_debug_enabled; -static int __init sched_domain_debug_setup(char *str) +static int __init sched_debug_setup(char *str) { - sched_domain_debug_enabled = 1; + sched_debug_enabled = 1; return 0; } -early_param("sched_debug", sched_domain_debug_setup); +early_param("sched_debug", sched_debug_setup); + +static inline bool sched_debug(void) +{ + return sched_debug_enabled; +} static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level, struct cpumask *groupmask) @@ -5657,7 +5662,7 @@ static void sched_domain_debug(struct sched_domain *sd, int cpu) { int level = 0; - if (!sched_domain_debug_enabled) + if (!sched_debug_enabled) return; if (!sd) { @@ -5678,6 +5683,10 @@ static void sched_domain_debug(struct sched_domain *sd, int cpu) } #else /* !CONFIG_SCHED_DEBUG */ # define sched_domain_debug(sd, cpu) do { } while (0) +static inline bool sched_debug(void) +{ + return false; +} #endif /* CONFIG_SCHED_DEBUG */ static int sd_degenerate(struct sched_domain *sd) @@ -6373,7 +6382,6 @@ static struct sched_domain_topology_level *sched_domain_topology = default_topol #ifdef CONFIG_NUMA static int sched_domains_numa_levels; -static int sched_domains_numa_scale; static int *sched_domains_numa_distance; static struct cpumask ***sched_domains_numa_masks; static int sched_domains_curr_level; @@ -6438,6 +6446,42 @@ static const struct cpumask *sd_numa_mask(int cpu) return sched_domains_numa_masks[sched_domains_curr_level][cpu_to_node(cpu)]; } +static void sched_numa_warn(const char *str) +{ + static int done = false; + int i,j; + + if (done) + return; + + done = true; + + printk(KERN_WARNING "ERROR: %s\n\n", str); + + for (i = 0; i < nr_node_ids; i++) { + printk(KERN_WARNING " "); + for (j = 0; j < nr_node_ids; j++) + printk(KERN_CONT "%02d ", node_distance(i,j)); + printk(KERN_CONT "\n"); + } + printk(KERN_WARNING "\n"); +} + +static bool find_numa_distance(int distance) +{ + int i; + + if (distance == node_distance(0, 0)) + return true; + + for (i = 0; i < sched_domains_numa_levels; i++) { + if (sched_domains_numa_distance[i] == distance) + return true; + } + + return false; +} + static void sched_init_numa(void) { int next_distance, curr_distance = node_distance(0, 0); @@ -6445,7 +6489,6 @@ static void sched_init_numa(void) int level = 0; int i, j, k; - sched_domains_numa_scale = curr_distance; sched_domains_numa_distance = kzalloc(sizeof(int) * nr_node_ids, GFP_KERNEL); if (!sched_domains_numa_distance) return; @@ -6456,23 +6499,41 @@ static void sched_init_numa(void) * * Assumes node_distance(0,j) includes all distances in * node_distance(i,j) in order to avoid cubic time. - * - * XXX: could be optimized to O(n log n) by using sort() */ next_distance = curr_distance; for (i = 0; i < nr_node_ids; i++) { for (j = 0; j < nr_node_ids; j++) { - int distance = node_distance(0, j); - if (distance > curr_distance && - (distance < next_distance || - next_distance == curr_distance)) - next_distance = distance; + for (k = 0; k < nr_node_ids; k++) { + int distance = node_distance(i, k); + + if (distance > curr_distance && + (distance < next_distance || + next_distance == curr_distance)) + next_distance = distance; + + /* + * While not a strong assumption it would be nice to know + * about cases where if node A is connected to B, B is not + * equally connected to A. + */ + if (sched_debug() && node_distance(k, i) != distance) + sched_numa_warn("Node-distance not symmetric"); + + if (sched_debug() && i && !find_numa_distance(distance)) + sched_numa_warn("Node-0 not representative"); + } + if (next_distance != curr_distance) { + sched_domains_numa_distance[level++] = next_distance; + sched_domains_numa_levels = level; + curr_distance = next_distance; + } else break; } - if (next_distance != curr_distance) { - sched_domains_numa_distance[level++] = next_distance; - sched_domains_numa_levels = level; - curr_distance = next_distance; - } else break; + + /* + * In case of sched_debug() we verify the above assumption. + */ + if (!sched_debug()) + break; } /* * 'level' contains the number of unique distances, excluding the -- cgit v1.2.3 From a841f8cef4bb124f0f5563314d0beaf2e1249d72 Mon Sep 17 00:00:00 2001 From: Dimitri Sivanich Date: Tue, 5 Jun 2012 13:44:36 -0500 Subject: sched: Fix the relax_domain_level boot parameter It does not get processed because sched_domain_level_max is 0 at the time that setup_relax_domain_level() is run. Simply accept the value as it is, as we don't know the value of sched_domain_level_max until sched domain construction is completed. Fix sched_relax_domain_level in cpuset. The build_sched_domain() routine calls the set_domain_attribute() routine prior to setting the sd->level, however, the set_domain_attribute() routine relies on the sd->level to decide whether idle load balancing will be off/on. Signed-off-by: Dimitri Sivanich Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/20120605184436.GA15668@sgi.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2bdd17616437..d5594a4268d4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6268,11 +6268,8 @@ int sched_domain_level_max; static int __init setup_relax_domain_level(char *str) { - unsigned long val; - - val = simple_strtoul(str, NULL, 0); - if (val < sched_domain_level_max) - default_relax_domain_level = val; + if (kstrtoint(str, 0, &default_relax_domain_level)) + pr_warn("Unable to set relax_domain_level\n"); return 1; } @@ -6698,7 +6695,6 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl, if (!sd) return child; - set_domain_attribute(sd, attr); cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu)); if (child) { sd->level = child->level + 1; @@ -6706,6 +6702,7 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl, child->parent = sd; } sd->child = child; + set_domain_attribute(sd, attr); return sd; } -- cgit v1.2.3 From bafb282df29c1524b1617019adebd6d0c3eb7a47 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Thu, 7 Jun 2012 14:21:11 -0700 Subject: c/r: prctl: update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal A fix for commit b32dfe377102 ("c/r: prctl: add ability to set new mm_struct::exe_file"). After removing mm->num_exe_file_vmas kernel keeps mm->exe_file until final mmput(), it never becomes NULL while task is alive. We can check for other mapped files in mm instead of checking mm->num_exe_file_vmas, and mark mm with flag MMF_EXE_FILE_CHANGED in order to forbid second changing of mm->exe_file. Signed-off-by: Konstantin Khlebnikov Reviewed-by: Cyrill Gorcunov Cc: Oleg Nesterov Cc: Matt Helsley Cc: Kees Cook Cc: KOSAKI Motohiro Cc: Tejun Heo Cc: Pavel Emelyanov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 31 +++++++++++++++++++------------ 1 file changed, 19 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index 9ff89cb9657a..54f20fdee93c 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1796,17 +1796,11 @@ static bool vma_flags_mismatch(struct vm_area_struct *vma, static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd) { + struct vm_area_struct *vma; struct file *exe_file; struct dentry *dentry; int err; - /* - * Setting new mm::exe_file is only allowed when no VM_EXECUTABLE vma's - * remain. So perform a quick test first. - */ - if (mm->num_exe_file_vmas) - return -EBUSY; - exe_file = fget(fd); if (!exe_file) return -EBADF; @@ -1827,17 +1821,30 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd) if (err) goto exit; + down_write(&mm->mmap_sem); + + /* + * Forbid mm->exe_file change if there are mapped other files. + */ + err = -EBUSY; + for (vma = mm->mmap; vma; vma = vma->vm_next) { + if (vma->vm_file && !path_equal(&vma->vm_file->f_path, + &exe_file->f_path)) + goto exit_unlock; + } + /* * The symlink can be changed only once, just to disallow arbitrary * transitions malicious software might bring in. This means one * could make a snapshot over all processes running and monitor * /proc/pid/exe changes to notice unusual activity if needed. */ - down_write(&mm->mmap_sem); - if (likely(!mm->exe_file)) - set_mm_exe_file(mm, exe_file); - else - err = -EBUSY; + err = -EPERM; + if (test_and_set_bit(MMF_EXE_FILE_CHANGED, &mm->flags)) + goto exit_unlock; + + set_mm_exe_file(mm, exe_file); +exit_unlock: up_write(&mm->mmap_sem); exit: -- cgit v1.2.3 From 1ad75b9e16280ca4e2501a629a225319cf2eef2e Mon Sep 17 00:00:00 2001 From: Cyrill Gorcunov Date: Thu, 7 Jun 2012 14:21:11 -0700 Subject: c/r: prctl: add minimal address test to PR_SET_MM Make sure the address being set is greater than mmap_min_addr (as suggested by Kees Cook). Signed-off-by: Cyrill Gorcunov Acked-by: Kees Cook Cc: Serge Hallyn Cc: Tejun Heo Cc: Pavel Emelyanov Cc: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index 54f20fdee93c..19a2c7139960 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1869,7 +1869,7 @@ static int prctl_set_mm(int opt, unsigned long addr, if (opt == PR_SET_MM_EXE_FILE) return prctl_set_mm_exe_file(mm, (unsigned int)addr); - if (addr >= TASK_SIZE) + if (addr >= TASK_SIZE || addr < mmap_min_addr) return -EINVAL; error = -EINVAL; -- cgit v1.2.3 From 300f786b2683f8bb1ec0afb6e1851183a479c86d Mon Sep 17 00:00:00 2001 From: Cyrill Gorcunov Date: Thu, 7 Jun 2012 14:21:12 -0700 Subject: c/r: prctl: add ability to get clear_tid_address Zero is written at clear_tid_address when the process exits. This functionality is used by pthread_join(). We already have sys_set_tid_address() to change this address for the current task but there is no way to obtain it from user space. Without the ability to find this address and dump it we can't restore pthread'ed apps which call pthread_join() once they have been restored. This patch introduces the PR_GET_TID_ADDRESS prctl option which allows the current process to obtain own clear_tid_address. This feature is available iif CONFIG_CHECKPOINT_RESTORE is set. [akpm@linux-foundation.org: fix prctl numbering] Signed-off-by: Andrew Vagin Signed-off-by: Cyrill Gorcunov Cc: Pedro Alves Cc: Oleg Nesterov Cc: Pavel Emelyanov Cc: Tejun Heo Acked-by: Kees Cook Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index 19a2c7139960..0ec1942ba7ea 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1988,12 +1988,22 @@ out: up_read(&mm->mmap_sem); return error; } + +static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr) +{ + return put_user(me->clear_child_tid, tid_addr); +} + #else /* CONFIG_CHECKPOINT_RESTORE */ static int prctl_set_mm(int opt, unsigned long addr, unsigned long arg4, unsigned long arg5) { return -EINVAL; } +static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr) +{ + return -EINVAL; +} #endif SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, @@ -2131,6 +2141,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, else return -EINVAL; break; + case PR_GET_TID_ADDRESS: + error = prctl_get_tid_address(me, (int __user **)arg2); + break; default: return -EINVAL; } -- cgit v1.2.3 From 736f24d5e59d699c6e300c5da7e3bb882eddda67 Mon Sep 17 00:00:00 2001 From: Cyrill Gorcunov Date: Thu, 7 Jun 2012 14:21:12 -0700 Subject: c/r: prctl: drop VMA flags test on PR_SET_MM_ stack data assignment In commit b76437579d13 ("procfs: mark thread stack correctly in proc//maps") the stack allocated via clone() is marked in /proc//maps as [stack:%d] thus it might be out of the former mm->start_stack/end_stack values (and even has some custom VMA flags set). So to be able to restore mm->start_stack/end_stack drop vma flags test, but still require the underlying VMA to exist. As always note this feature is under CONFIG_CHECKPOINT_RESTORE and requires CAP_SYS_RESOURCE to be granted. Signed-off-by: Cyrill Gorcunov Cc: Oleg Nesterov Acked-by: Kees Cook Cc: Pavel Emelyanov Cc: Serge Hallyn Cc: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 14 -------------- 1 file changed, 14 deletions(-) (limited to 'kernel') diff --git a/kernel/sys.c b/kernel/sys.c index 0ec1942ba7ea..f0ec44dcd415 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1786,14 +1786,6 @@ SYSCALL_DEFINE1(umask, int, mask) } #ifdef CONFIG_CHECKPOINT_RESTORE -static bool vma_flags_mismatch(struct vm_area_struct *vma, - unsigned long required, - unsigned long banned) -{ - return (vma->vm_flags & required) != required || - (vma->vm_flags & banned); -} - static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd) { struct vm_area_struct *vma; @@ -1931,12 +1923,6 @@ static int prctl_set_mm(int opt, unsigned long addr, error = -EFAULT; goto out; } -#ifdef CONFIG_STACK_GROWSUP - if (vma_flags_mismatch(vma, VM_READ | VM_WRITE | VM_GROWSUP, 0)) -#else - if (vma_flags_mismatch(vma, VM_READ | VM_WRITE | VM_GROWSDOWN, 0)) -#endif - goto out; if (opt == PR_SET_MM_START_STACK) mm->start_stack = addr; else if (opt == PR_SET_MM_ARG_START) -- cgit v1.2.3 From 40af1bbdca47e5c8a2044039bb78ca8fd8b20f94 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Thu, 7 Jun 2012 14:21:14 -0700 Subject: mm: correctly synchronize rss-counters at exit/exec mm->rss_stat counters have per-task delta: task->rss_stat. Before changing task->mm pointer the kernel must flush this delta with sync_mm_rss(). do_exit() already calls sync_mm_rss() to flush the rss-counters before committing the rss statistics into task->signal->maxrss, taskstats, audit and other stuff. Unfortunately the kernel does this before calling mm_release(), which can call put_user() for processing task->clear_child_tid. So at this point we can trigger page-faults and task->rss_stat becomes non-zero again. As a result mm->rss_stat becomes inconsistent and check_mm() will print something like this: | BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1 | BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1 This patch moves sync_mm_rss() into mm_release(), and moves mm_release() out of do_exit() and calls it earlier. After mm_release() there should be no pagefaults. [akpm@linux-foundation.org: tweak comment] Signed-off-by: Konstantin Khlebnikov Reported-by: Markus Trippelsdorf Cc: Hugh Dickins Cc: KAMEZAWA Hiroyuki Cc: Oleg Nesterov Cc: [3.4.x] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 13 ++++++++----- kernel/fork.c | 8 ++++++++ 2 files changed, 16 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 34867cc5b42a..804fb6bb8161 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -423,6 +423,7 @@ void daemonize(const char *name, ...) * user space pages. We don't need them, and if we didn't close them * they would be locked into memory. */ + mm_release(current, current->mm); exit_mm(current); /* * We don't want to get frozen, in case system-wide hibernation @@ -640,7 +641,6 @@ static void exit_mm(struct task_struct * tsk) struct mm_struct *mm = tsk->mm; struct core_state *core_state; - mm_release(tsk, mm); if (!mm) return; /* @@ -960,9 +960,13 @@ void do_exit(long code) preempt_count()); acct_update_integrals(tsk); - /* sync mm's RSS info before statistics gathering */ - if (tsk->mm) - sync_mm_rss(tsk->mm); + + /* Set exit_code before complete_vfork_done() in mm_release() */ + tsk->exit_code = code; + + /* Release mm and sync mm's RSS info before statistics gathering */ + mm_release(tsk, tsk->mm); + group_dead = atomic_dec_and_test(&tsk->signal->live); if (group_dead) { hrtimer_cancel(&tsk->signal->real_timer); @@ -975,7 +979,6 @@ void do_exit(long code) tty_audit_exit(); audit_free(tsk); - tsk->exit_code = code; taskstats_exit(tsk, group_dead); exit_mm(tsk); diff --git a/kernel/fork.c b/kernel/fork.c index ab5211b9e622..0560781c6904 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -619,6 +619,14 @@ void mmput(struct mm_struct *mm) module_put(mm->binfmt->module); mmdrop(mm); } + + /* + * Final rss-counter synchronization. After this point there must be + * no pagefaults into this mm from the current context. Otherwise + * mm->rss_stat will be inconsistent. + */ + if (mm) + sync_mm_rss(mm); } EXPORT_SYMBOL_GPL(mmput); -- cgit v1.2.3 From 48d212a2eecaca2e1875925837ad27b2f43f48a3 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Thu, 7 Jun 2012 17:54:07 -0700 Subject: Revert "mm: correctly synchronize rss-counters at exit/exec" This reverts commit 40af1bbdca47e5c8a2044039bb78ca8fd8b20f94. It's horribly and utterly broken for at least the following reasons: - calling sync_mm_rss() from mmput() is fundamentally wrong, because there's absolutely no reason to believe that the task that does the mmput() always does it on its own VM. Example: fork, ptrace, /proc - you name it. - calling it *after* having done mmdrop() on it is doubly insane, since the mm struct may well be gone now. - testing mm against NULL before you call it is insane too, since a NULL mm there would have caused oopses long before. .. and those are just the three bugs I found before I decided to give up looking for me and revert it asap. I should have caught it before I even took it, but I trusted Andrew too much. Cc: Konstantin Khlebnikov Cc: Markus Trippelsdorf Cc: Hugh Dickins Cc: KAMEZAWA Hiroyuki Cc: Oleg Nesterov Cc: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/exit.c | 13 +++++-------- kernel/fork.c | 8 -------- 2 files changed, 5 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 804fb6bb8161..34867cc5b42a 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -423,7 +423,6 @@ void daemonize(const char *name, ...) * user space pages. We don't need them, and if we didn't close them * they would be locked into memory. */ - mm_release(current, current->mm); exit_mm(current); /* * We don't want to get frozen, in case system-wide hibernation @@ -641,6 +640,7 @@ static void exit_mm(struct task_struct * tsk) struct mm_struct *mm = tsk->mm; struct core_state *core_state; + mm_release(tsk, mm); if (!mm) return; /* @@ -960,13 +960,9 @@ void do_exit(long code) preempt_count()); acct_update_integrals(tsk); - - /* Set exit_code before complete_vfork_done() in mm_release() */ - tsk->exit_code = code; - - /* Release mm and sync mm's RSS info before statistics gathering */ - mm_release(tsk, tsk->mm); - + /* sync mm's RSS info before statistics gathering */ + if (tsk->mm) + sync_mm_rss(tsk->mm); group_dead = atomic_dec_and_test(&tsk->signal->live); if (group_dead) { hrtimer_cancel(&tsk->signal->real_timer); @@ -979,6 +975,7 @@ void do_exit(long code) tty_audit_exit(); audit_free(tsk); + tsk->exit_code = code; taskstats_exit(tsk, group_dead); exit_mm(tsk); diff --git a/kernel/fork.c b/kernel/fork.c index 0560781c6904..ab5211b9e622 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -619,14 +619,6 @@ void mmput(struct mm_struct *mm) module_put(mm->binfmt->module); mmdrop(mm); } - - /* - * Final rss-counter synchronization. After this point there must be - * no pagefaults into this mm from the current context. Otherwise - * mm->rss_stat will be inconsistent. - */ - if (mm) - sync_mm_rss(mm); } EXPORT_SYMBOL_GPL(mmput); -- cgit v1.2.3 From cd96891d48a945ca2011fbeceda73813d6286195 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 8 Jun 2012 13:18:33 -0700 Subject: sched/fair: fix lots of kernel-doc warnings Fix lots of new kernel-doc warnings in kernel/sched/fair.c: Warning(kernel/sched/fair.c:3625): No description found for parameter 'env' Warning(kernel/sched/fair.c:3625): Excess function parameter 'sd' description in 'update_sg_lb_stats' Warning(kernel/sched/fair.c:3735): No description found for parameter 'env' Warning(kernel/sched/fair.c:3735): Excess function parameter 'sd' description in 'update_sd_pick_busiest' Warning(kernel/sched/fair.c:3735): Excess function parameter 'this_cpu' description in 'update_sd_pick_busiest' .. more warnings Signed-off-by: Randy Dunlap Cc: Ingo Molnar Cc: Peter Zijlstra Signed-off-by: Linus Torvalds --- kernel/sched/fair.c | 22 ++++++---------------- 1 file changed, 6 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b2a2d236f27b..d5583f9588e7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3632,7 +3632,7 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group) /** * update_sg_lb_stats - Update sched_group's statistics for load balancing. - * @sd: The sched_domain whose statistics are to be updated. + * @env: The load balancing environment. * @group: sched_group whose statistics are to be updated. * @load_idx: Load index of sched_domain of this_cpu for load calc. * @local_group: Does group contain this_cpu. @@ -3741,11 +3741,10 @@ static inline void update_sg_lb_stats(struct lb_env *env, /** * update_sd_pick_busiest - return 1 on busiest group - * @sd: sched_domain whose statistics are to be checked + * @env: The load balancing environment. * @sds: sched_domain statistics * @sg: sched_group candidate to be checked for being the busiest * @sgs: sched_group statistics - * @this_cpu: the current cpu * * Determine if @sg is a busier group than the previously selected * busiest group. @@ -3783,9 +3782,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, /** * update_sd_lb_stats - Update sched_domain's statistics for load balancing. - * @sd: sched_domain whose statistics are to be updated. - * @this_cpu: Cpu for which load balance is currently performed. - * @idle: Idle status of this_cpu + * @env: The load balancing environment. * @cpus: Set of cpus considered for load balancing. * @balance: Should we balance. * @sds: variable to hold the statistics for this sched_domain. @@ -3874,10 +3871,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, * Returns 1 when packing is required and a task should be moved to * this CPU. The amount of the imbalance is returned in *imbalance. * - * @sd: The sched_domain whose packing is to be checked. + * @env: The load balancing environment. * @sds: Statistics of the sched_domain which is to be packed - * @this_cpu: The cpu at whose sched_domain we're performing load-balance. - * @imbalance: returns amount of imbalanced due to packing. */ static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds) { @@ -3903,9 +3898,8 @@ static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds) * fix_small_imbalance - Calculate the minor imbalance that exists * amongst the groups of a sched_domain, during * load balancing. + * @env: The load balancing environment. * @sds: Statistics of the sched_domain whose imbalance is to be calculated. - * @this_cpu: The cpu at whose sched_domain we're performing load-balance. - * @imbalance: Variable to store the imbalance. */ static inline void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) @@ -4048,11 +4042,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * Also calculates the amount of weighted load which should be moved * to restore balance. * - * @sd: The sched_domain whose busiest group is to be returned. - * @this_cpu: The cpu for which load balancing is currently being performed. - * @imbalance: Variable which stores amount of weighted load which should - * be moved to restore balance/put a group to idle. - * @idle: The idle status of this_cpu. + * @env: The load balancing environment. * @cpus: The set of CPUs under consideration for load-balancing. * @balance: Pointer to a variable indicating if this_cpu * is the appropriate cpu to perform load balancing at this_level. -- cgit v1.2.3