summaryrefslogtreecommitdiff
path: root/kernel/workqueue.c (follow)
Commit message (Collapse)AuthorAge
* Merge remote-tracking branch 'origin/tmp-917a9a9133a6' into lskRunmin Wang2016-07-12
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * tmp-917a9: ARM/vdso: Mark the vDSO code read-only after init x86/vdso: Mark the vDSO code read-only after init lkdtm: Verify that '__ro_after_init' works correctly arch: Introduce post-init read-only memory x86/mm: Always enable CONFIG_DEBUG_RODATA and remove the Kconfig option mm/init: Add 'rodata=off' boot cmdline parameter to disable read-only kernel mappings asm-generic: Consolidate mark_rodata_ro() Linux 4.4.6 ld-version: Fix awk regex compile failure target: Drop incorrect ABORT_TASK put for completed commands block: don't optimize for non-cloned bio in bio_get_last_bvec() MIPS: smp.c: Fix uninitialised temp_foreign_map MIPS: Fix build error when SMP is used without GIC ovl: fix getcwd() failure after unsuccessful rmdir ovl: copy new uid/gid into overlayfs runtime inode userfaultfd: don't block on the last VM updates at exit time powerpc/powernv: Fix OPAL_CONSOLE_FLUSH prototype and usages powerpc/powernv: Add a kmsg_dumper that flushes console output on panic powerpc: Fix dedotify for binutils >= 2.26 Revert "drm/radeon/pm: adjust display configuration after powerstate" drm/radeon: Fix error handling in radeon_flip_work_func. drm/amdgpu: Fix error handling in amdgpu_flip_work_func. Revert "drm/radeon: call hpd_irq_event on resume" x86/mm: Fix slow_virt_to_phys() for X86_PAE again gpu: ipu-v3: Do not bail out on missing optional port nodes mac80211: Fix Public Action frame RX in AP mode mac80211: check PN correctly for GCMP-encrypted fragmented MPDUs mac80211: minstrel_ht: fix a logic error in RTS/CTS handling mac80211: minstrel_ht: set default tx aggregation timeout to 0 mac80211: fix use of uninitialised values in RX aggregation mac80211: minstrel: Change expected throughput unit back to Kbps iwlwifi: mvm: inc pending frames counter also when txing non-sta can: gs_usb: fixed disconnect bug by removing erroneous use of kfree() cfg80211/wext: fix message ordering wext: fix message delay/ordering ovl: fix working on distributed fs as lower layer ovl: ignore lower entries when checking purity of non-directory entries ASoC: wm8958: Fix enum ctl accesses in a wrong type ASoC: wm8994: Fix enum ctl accesses in a wrong type ASoC: samsung: Use IRQ safe spin lock calls ASoC: dapm: Fix ctl value accesses in a wrong type ncpfs: fix a braino in OOM handling in ncp_fill_cache() jffs2: reduce the breakage on recovery from halfway failed rename() dmaengine: at_xdmac: fix residue computation tracing: Fix check for cpu online when event is disabled s390/dasd: fix diag 0x250 inline assembly s390/mm: four page table levels vs. fork KVM: MMU: fix reserved bit check for ept=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 KVM: MMU: fix ept=0/pte.u=1/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo KVM: PPC: Book3S HV: Sanitize special-purpose register values on guest exit KVM: s390: correct fprs on SIGP (STOP AND) STORE STATUS KVM: VMX: disable PEBS before a guest entry kvm: cap halt polling at exactly halt_poll_ns PCI: Allow a NULL "parent" pointer in pci_bus_assign_domain_nr() ARM: OMAP2+: hwmod: Introduce ti,no-idle dt property ARM: dts: dra7: do not gate cpsw clock due to errata i877 ARM: mvebu: fix overlap of Crypto SRAM with PCIe memory window arm64: account for sparsemem section alignment when choosing vmemmap offset Linux 4.4.5 drm/amdgpu: fix topaz/tonga gmc assignment in 4.4 stable modules: fix longstanding /proc/kallsyms vs module insertion race. drm/i915: refine qemu south bridge detection drm/i915: more virtual south bridge detection block: get the 1st and last bvec via helpers block: check virt boundary in bio_will_gap() drm/amdgpu: Use drm_calloc_large for VM page_tables array thermal: cpu_cooling: fix out of bounds access in time_in_idle i2c: brcmstb: allocate correct amount of memory for regmap ubi: Fix out of bounds write in volume update code cxl: Fix PSL timebase synchronization detection MIPS: traps: Fix SIGFPE information leak from `do_ov' and `do_trap_or_bp' MIPS: scache: Fix scache init with invalid line size. USB: serial: option: add support for Quectel UC20 USB: serial: option: add support for Telit LE922 PID 0x1045 USB: qcserial: add Sierra Wireless EM74xx device ID USB: qcserial: add Dell Wireless 5809e Gobi 4G HSPA+ (rev3) USB: cp210x: Add ID for Parrot NMEA GPS Flight Recorder usb: chipidea: otg: change workqueue ci_otg as freezable ALSA: timer: Fix broken compat timer user status ioctl ALSA: hdspm: Fix zero-division ALSA: hdsp: Fix wrong boolean ctl value accesses ALSA: hdspm: Fix wrong boolean ctl value accesses ALSA: seq: oss: Don't drain at closing a client ALSA: pcm: Fix ioctls for X32 ABI ALSA: timer: Fix ioctls for X32 ABI ALSA: rawmidi: Fix ioctls X32 ABI ALSA: hda - Fix mic issues on Acer Aspire E1-472 ALSA: ctl: Fix ioctls for X32 ABI ALSA: usb-audio: Add a quirk for Plantronics DA45 adv7604: fix tx 5v detect regression dmaengine: pxa_dma: fix cyclic transfers Fix directory hardlinks from deleted directories jffs2: Fix page lock / f->sem deadlock Revert "jffs2: Fix lock acquisition order bug in jffs2_write_begin" Btrfs: fix loading of orphan roots leading to BUG_ON pata-rb532-cf: get rid of the irq_to_gpio() call tracing: Do not have 'comm' filter override event 'comm' field ata: ahci: don't mark HotPlugCapable Ports as external/removable PM / sleep / x86: Fix crash on graph trace through x86 suspend arm64: vmemmap: use virtual projection of linear region Adding Intel Lewisburg device IDs for SATA writeback: flush inode cgroup wb switches instead of pinning super_block block: bio: introduce helpers to get the 1st and last bvec libata: Align ata_device's id on a cacheline libata: fix HDIO_GET_32BIT ioctl drm/amdgpu: return from atombios_dp_get_dpcd only when error drm/amdgpu/gfx8: specify which engine to wait before vm flush drm/amdgpu: apply gfx_v8 fixes to gfx_v7 as well drm/amdgpu/pm: update current crtc info after setting the powerstate drm/radeon/pm: update current crtc info after setting the powerstate drm/ast: Fix incorrect register check for DRAM width target: Fix WRITE_SAME/DISCARD conversion to linux 512b sectors iommu/vt-d: Use BUS_NOTIFY_REMOVED_DEVICE in hotplug path iommu/amd: Fix boot warning when device 00:00.0 is not iommu covered iommu/amd: Apply workaround for ATS write permission check arm/arm64: KVM: Fix ioctl error handling KVM: x86: fix root cause for missed hardware breakpoints vfio: fix ioctl error handling Fix cifs_uniqueid_to_ino_t() function for s390x CIFS: Fix SMB2+ interim response processing for read requests cifs: fix out-of-bounds access in lease parsing fbcon: set a default value to blink interval kvm: x86: Update tsc multiplier on change. mips/kvm: fix ioctl error handling parisc: Fix ptrace syscall number and return value modification PCI: keystone: Fix MSI code that retrieves struct pcie_port pointer block: Initialize max_dev_sectors to 0 drm/amdgpu: mask out WC from BO on unsupported arches btrfs: async-thread: Fix a use-after-free error for trace btrfs: Fix no_space in write and rm loop Btrfs: fix deadlock running delayed iputs at transaction commit time drivers: sh: Restore legacy clock domain on SuperH platforms use ->d_seq to get coherency between ->d_inode and ->d_flags Linux 4.4.4 iwlwifi: mvm: don't allow sched scans without matches to be started iwlwifi: update and fix 7265 series PCI IDs iwlwifi: pcie: properly configure the debug buffer size for 8000 iwlwifi: dvm: fix WoWLAN security: let security modules use PTRACE_MODE_* with bitmasks IB/cma: Fix RDMA port validation for iWarp x86/irq: Plug vector cleanup race x86/irq: Call irq_force_move_complete with irq descriptor x86/irq: Remove outgoing CPU from vector cleanup mask x86/irq: Remove the cpumask allocation from send_cleanup_vector() x86/irq: Clear move_in_progress before sending cleanup IPI x86/irq: Remove offline cpus from vector cleanup x86/irq: Get rid of code duplication x86/irq: Copy vectormask instead of an AND operation x86/irq: Check vector allocation early x86/irq: Reorganize the search in assign_irq_vector x86/irq: Reorganize the return path in assign_irq_vector x86/irq: Do not use apic_chip_data.old_domain as temporary buffer x86/irq: Validate that irq descriptor is still active x86/irq: Fix a race in x86_vector_free_irqs() x86/irq: Call chip->irq_set_affinity in proper context x86/entry/compat: Add missing CLAC to entry_INT80_32 x86/mpx: Fix off-by-one comparison with nr_registers hpfs: don't truncate the file when delete fails do_last(): ELOOP failure exit should be done after leaving RCU mode should_follow_link(): validate ->d_seq after having decided to follow xen/pcifront: Fix mysterious crashes when NUMA locality information was extracted. xen/pciback: Save the number of MSI-X entries to be copied later. xen/pciback: Check PF instead of VF for PCI_COMMAND_MEMORY xen/scsiback: correct frontend counting xen/arm: correctly handle DMA mapping of compound pages ARM: at91/dt: fix typo in sama5d2 pinmux descriptions ARM: OMAP2+: Fix onenand initialization to avoid filesystem corruption do_last(): don't let a bogus return value from ->open() et.al. to confuse us kernel/resource.c: fix muxed resource handling in __request_region() sunrpc/cache: fix off-by-one in qword_get() tracing: Fix showing function event in available_events powerpc/eeh: Fix partial hotplug criterion KVM: x86: MMU: fix ubsan index-out-of-range warning KVM: x86: fix conversion of addresses to linear in 32-bit protected mode KVM: x86: fix missed hardware breakpoints KVM: arm/arm64: vgic: Ensure bitmaps are long enough KVM: async_pf: do not warn on page allocation failures of/irq: Fix msi-map calculation for nonzero rid-base NFSv4: Fix a dentry leak on alias use nfs: fix nfs_size_to_loff_t block: fix use-after-free in dio_bio_complete bio: return EINTR if copying to user space got interrupted i2c: i801: Adding Intel Lewisburg support for iTCO phy: core: fix wrong err handle for phy_power_on writeback: keep superblock pinned during cgroup writeback association switches cgroup: make sure a parent css isn't offlined before its children cpuset: make mm migration asynchronous PCI/AER: Flush workqueue on device remove to avoid use-after-free ARCv2: SMP: Emulate IPI to self using software triggered interrupt ARCv2: STAR 9000950267: Handle return from intr to Delay Slot #2 libata: fix sff host state machine locking while polling qla2xxx: Fix stale pointer access. spi: atmel: fix gpio chip-select in case of non-DT platform target: Fix race with SCF_SEND_DELAYED_TAS handling target: Fix remote-port TMR ABORT + se_cmd fabric stop target: Fix TAS handling for multi-session se_node_acls target: Fix LUN_RESET active TMR descriptor handling target: Fix LUN_RESET active I/O handling for ACK_KREF ALSA: hda - Fixing background noise on Dell Inspiron 3162 ALSA: hda - Apply clock gate workaround to Skylake, too Revert "workqueue: make sure delayed work run in local cpu" workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup mac80211: Requeue work after scan complete for all VIF types. rfkill: fix rfkill_fop_read wait_event usage tick/nohz: Set the correct expiry when switching to nohz/lowres mode perf stat: Do not clean event's private stats cdc-acm:exclude Samsung phone 04e8:685d Revert "Staging: panel: usleep_range is preferred over udelay" Staging: speakup: Fix getting port information sd: Optimal I/O size is in bytes, not sectors libceph: don't spam dmesg with stray reply warnings libceph: use the right footer size when skipping a message libceph: don't bail early from try_read() when skipping a message libceph: fix ceph_msg_revoke() seccomp: always propagate NO_NEW_PRIVS on tsync cpufreq: Fix NULL reference crash while accessing policy->governor_data cpufreq: pxa2xx: fix pxa_cpufreq_change_voltage prototype hwmon: (ads1015) Handle negative conversion values correctly hwmon: (gpio-fan) Remove un-necessary speed_index lookup for thermal hook hwmon: (dell-smm) Blacklist Dell Studio XPS 8000 Thermal: do thermal zone update after a cooling device registered Thermal: handle thermal zone device properly during system sleep Thermal: initialize thermal zone device correctly IB/mlx5: Expose correct maximum number of CQE capacity IB/qib: Support creating qps with GFP_NOIO flag IB/qib: fix mcast detach when qp not attached IB/cm: Fix a recently introduced deadlock dmaengine: dw: disable BLOCK IRQs for non-cyclic xfer dmaengine: at_xdmac: fix resume for cyclic transfers dmaengine: dw: fix cyclic transfer callbacks dmaengine: dw: fix cyclic transfer setup nfit: fix multi-interface dimm handling, acpi6.1 compatibility ACPI / PCI / hotplug: unlock in error path in acpiphp_enable_slot() ACPI: Revert "ACPI / video: Add Dell Inspiron 5737 to the blacklist" ACPI / video: Add disable_backlight_sysfs_if quirk for the Toshiba Satellite R830 ACPI / video: Add disable_backlight_sysfs_if quirk for the Toshiba Portege R700 lib: sw842: select crc32 uapi: update install list after nvme.h rename ideapad-laptop: Add Lenovo Yoga 700 to no_hw_rfkill dmi list ideapad-laptop: Add Lenovo ideapad Y700-17ISK to no_hw_rfkill dmi list toshiba_acpi: Fix blank screen at boot if transflective backlight is supported make sure that freeing shmem fast symlinks is RCU-delayed drm/radeon/pm: adjust display configuration after powerstate drm/radeon: Don't hang in radeon_flip_work_func on disabled crtc. (v2) drm: Fix treatment of drm_vblank_offdelay in drm_vblank_on() (v2) drm: Fix drm_vblank_pre/post_modeset regression from Linux 4.4 drm: Prevent vblank counter bumps > 1 with active vblank clients. (v2) drm: No-Op redundant calls to drm_vblank_off() (v2) drm/radeon: use post-decrement in error handling drm/qxl: use kmalloc_array to alloc reloc_info in qxl_process_single_command drm/i915: fix error path in intel_setup_gmbus() drm/i915/dsi: don't pass arbitrary data to sideband drm/i915/dsi: defend gpio table against out of bounds access drm/i915/skl: Don't skip mst encoders in skl_ddi_pll_select() drm/i915: Don't reject primary plane windowing with color keying enabled on SKL+ drm/i915/dp: fall back to 18 bpp when sink capability is unknown drm/i915: Make sure DC writes are coherent on flush. drm/i915: Init power domains early in driver load drm/i915: intel_hpd_init(): Fix suspend/resume reprobing drm/i915: Restore inhibiting the load of the default context drm: fix missing reference counting decrease drm/radeon: hold reference to fences in radeon_sa_bo_new drm/radeon: mask out WC from BO on unsupported arches drm: add helper to check for wc memory support drm/radeon: fix DP audio support for APU with DCE4.1 display engine drm/radeon: Add a common function for DFS handling drm/radeon: cleaned up VCO output settings for DP audio drm/radeon: properly byte swap vce firmware setup drm/radeon: clean up fujitsu quirks drm/radeon: Fix "slow" audio over DP on DCE8+ drm/radeon: call hpd_irq_event on resume drm/radeon: Fix off-by-one errors in radeon_vm_bo_set_addr drm/dp/mst: deallocate payload on port destruction drm/dp/mst: Reverse order of MST enable and clearing VC payload table. drm/dp/mst: move GUID storage from mgr, port to only mst branch drm/dp/mst: Calculate MST PBN with 31.32 fixed point drm: Add drm_fixp_from_fraction and drm_fixp2int_ceil drm/dp/mst: fix in RAD element access drm/dp/mst: fix in MSTB RAD initialization drm/dp/mst: always send reply for UP request drm/dp/mst: process broadcast messages correctly drm/nouveau: platform: Fix deferred probe drm/nouveau/disp/dp: ensure sink is powered up before attempting link training drm/nouveau/display: Enable vblank irqs after display engine is on again. drm/nouveau/kms: take mode_config mutex in connector hotplug path drm/amdgpu/pm: adjust display configuration after powerstate drm/amdgpu: Don't hang in amdgpu_flip_work_func on disabled crtc. drm/amdgpu: use post-decrement in error handling drm/amdgpu: fix issue with overlapping userptrs drm/amdgpu: hold reference to fences in amdgpu_sa_bo_new (v2) drm/amdgpu: remove unnecessary forward declaration drm/amdgpu: fix s4 resume drm/amdgpu: remove exp hardware support from iceland drm/amdgpu: don't load MEC2 on topaz drm/amdgpu: drop topaz support from gmc8 module drm/amdgpu: pull topaz gmc bits into gmc_v7 drm/amdgpu: The VI specific EXE bit should only apply to GMC v8.0 above drm/amdgpu: iceland use CI based MC IP drm/amdgpu: move gmc7 support out of CIK dependency drm/amdgpu: no need to load MC firmware on fiji drm/amdgpu: fix amdgpu_bo_pin_restricted VRAM placing v2 drm/amdgpu: fix tonga smu resume drm/amdgpu: fix lost sync_to if scheduler is enabled. drm/amdgpu: call hpd_irq_event on resume drm/amdgpu: Fix off-by-one errors in amdgpu_vm_bo_map drm/vmwgfx: respect 'nomodeset' drm/vmwgfx: Fix a width / pitch mismatch on framebuffer updates drm/vmwgfx: Fix an incorrect lock check virtio_pci: fix use after free on release virtio_balloon: fix race between migration and ballooning virtio_balloon: fix race by fill and leak regulator: mt6311: MT6311_REGULATOR needs to select REGMAP_I2C regulator: axp20x: Fix GPIO LDO enable value for AXP22x clk: exynos: use irqsave version of spin_lock to avoid deadlock with irqs cxl: use correct operator when writing pcie config space values sparc64: fix incorrect sign extension in sys_sparc64_personality EDAC, mc_sysfs: Fix freeing bus' name EDAC: Robustify workqueues destruction MIPS: Fix buffer overflow in syscall_get_arguments() MIPS: Fix some missing CONFIG_CPU_MIPSR6 #ifdefs MIPS: hpet: Choose a safe value for the ETIME check MIPS: Loongson-3: Fix SMP_ASK_C0COUNT IPI handler Revert "MIPS: Fix PAGE_MASK definition" cputime: Prevent 32bit overflow in time[val|spec]_to_cputime() time: Avoid signed overflow in timekeeping_get_ns() Bluetooth: 6lowpan: Fix handling of uncompressed IPv6 packets Bluetooth: 6lowpan: Fix kernel NULL pointer dereferences Bluetooth: Fix incorrect removing of IRKs Bluetooth: Add support of Toshiba Broadcom based devices Bluetooth: Use continuous scanning when creating LE connections Drivers: hv: vmbus: Fix a Host signaling bug tools: hv: vss: fix the write()'s argument: error -> vss_msg mmc: sdhci: Allow override of get_cd() called from sdhci_request() mmc: sdhci: Allow override of mmc host operations mmc: sdhci-pci: Fix card detect race for Intel BXT/APL mmc: pxamci: fix again read-only gpio detection polarity mmc: sdhci-acpi: Fix card detect race for Intel BXT/APL mmc: mmci: fix an ages old detection error mmc: core: Enable tuning according to the actual timing mmc: sdhci: Fix sdhci_runtime_pm_bus_on/off() mmc: mmc: Fix incorrect use of driver strength switching HS200 and HS400 mmc: sdio: Fix invalid vdd in voltage switch power cycle mmc: sdhci: Fix DMA descriptor with zero data length mmc: sdhci-pci: Do not default to 33 Ohm driver strength for Intel SPT mmc: usdhi6rol0: handle NULL data in timeout clockevents/tcb_clksrc: Prevent disabling an already disabled clock posix-clock: Fix return code on the poll method's error path irqchip/gic-v3-its: Fix double ICC_EOIR write for LPI in EOImode==1 irqchip/atmel-aic: Fix wrong bit operation for IRQ priority irqchip/mxs: Add missing set_handle_irq() irqchip/omap-intc: Add support for spurious irq handling coresight: checking for NULL string in coresight_name_match() dm: fix dm_rq_target_io leak on faults with .request_fn DM w/ blk-mq paths dm snapshot: fix hung bios when copy error occurs dm space map metadata: remove unused variable in brb_pop() tda1004x: only update the frontend properties if locked vb2: fix a regression in poll() behavior for output,streams gspca: ov534/topro: prevent a division by 0 si2157: return -EINVAL if firmware blob is too big media: dvb-core: Don't force CAN_INVERSION_AUTO in oneshot mode rc: sunxi-cir: Initialize the spinlock properly namei: ->d_inode of a pinned dentry is stable only for positives mei: validate request value in client notify request ioctl mei: fix fasync return value on error rtlwifi: rtl8723be: Fix module parameter initialization rtlwifi: rtl8188ee: Fix module parameter initialization rtlwifi: rtl8192se: Fix module parameter initialization rtlwifi: rtl8723ae: Fix initialization of module parameters rtlwifi: rtl8192de: Fix incorrect module parameter descriptions rtlwifi: rtl8192ce: Fix handling of module parameters rtlwifi: rtl8192cu: Add missing parameter setup rtlwifi: rtl_pci: Fix kernel panic locks: fix unlock when fcntl_setlk races with a close um: link with -lpthread uml: fix hostfs mknod() uml: flush stdout before forking s390/fpu: signals vs. floating point control register s390/compat: correct restore of high gprs on signal return s390/dasd: fix performance drop s390/dasd: fix refcount for PAV reassignment s390/dasd: prevent incorrect length error under z/VM after PAV changes s390: fix normalization bug in exception table sorting btrfs: initialize the seq counter in struct btrfs_device Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots Btrfs: fix transaction handle leak on failure to create hard link Btrfs: fix number of transaction units required to create symlink Btrfs: send, don't BUG_ON() when an empty symlink is found btrfs: statfs: report zero available if metadata are exhausted Btrfs: igrab inode in writepage Btrfs: add missing brelse when superblock checksum fails KVM: s390: fix memory overwrites when vx is disabled s390/kvm: remove dependency on struct save_area definition clocksource/drivers/vt8500: Increase the minimum delta genirq: Validate action before dereferencing it in handle_irq_event_percpu() mm: numa: quickly fail allocations for NUMA balancing on full nodes mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED ocfs2: unlock inode if deleting inode from orphan fails drm/i915: shut up gen8+ SDE irq dmesg noise iw_cxgb3: Fix incorrectly returning error on success spi: omap2-mcspi: Prevent duplicate gpio_request drivers: android: correct the size of struct binder_uintptr_t for BC_DEAD_BINDER_DONE USB: option: add "4G LTE usb-modem U901" USB: option: add support for SIM7100E USB: cp210x: add IDs for GE B650V3 and B850V3 boards usb: dwc3: Fix assignment of EP transfer resources can: ems_usb: Fix possible tx overflow dm thin: fix race condition when destroying thin pool workqueue bcache: Change refill_dirty() to always scan entire disk if necessary bcache: prevent crash on changing writeback_running bcache: allows use of register in udev to avoid "device_busy" error. bcache: unregister reboot notifier if bcache fails to unregister device bcache: fix a leak in bch_cached_dev_run() bcache: clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing device bcache: Add a cond_resched() call to gc bcache: fix a livelock when we cause a huge number of cache misses lib/ucs2_string: Correct ucs2 -> utf8 conversion efi: Add pstore variables to the deletion whitelist efi: Make efivarfs entries immutable by default efi: Make our variable validation list include the guid efi: Do variable name validation tests in utf8 efi: Use ucs2_as_utf8 in efivarfs instead of open coding a bad version lib/ucs2_string: Add ucs2 -> utf8 helper functions ARM: 8457/1: psci-smp is built only for SMP drm/gma500: Use correct unref in the gem bo create function devm_memremap: Fix error value when memremap failed KVM: s390: fix guest fprs memory leak arm64: errata: Add -mpc-relative-literal-loads to build flags ARM: debug-ll: fix BCM63xx entry for multiplatform ext4: fix bh->b_state corruption sctp: Fix port hash table size computation unix_diag: fix incorrect sign extension in unix_lookup_by_ino tipc: unlock in error path rtnl: RTM_GETNETCONF: fix wrong return value IFF_NO_QUEUE: Fix for drivers not calling ether_setup() tcp/dccp: fix another race at listener dismantle route: check and remove route cache when we get route net_sched fix: reclassification needs to consider ether protocol changes pppoe: fix reference counting in PPPoE proxy l2tp: Fix error creating L2TP tunnels net/mlx4_en: Avoid changing dev->features directly in run-time net/mlx4_en: Choose time-stamping shift value according to HW frequency net/mlx4_en: Count HW buffer overrun only once qmi_wwan: add "4G LTE usb-modem U901" tcp: md5: release request socket instead of listener tipc: fix premature addition of node to lookup table af_unix: Guard against other == sk in unix_dgram_sendmsg af_unix: Don't set err in unix_stream_read_generic unless there was an error ipv4: fix memory leaks in ip_cmsg_send() callers bonding: Fix ARP monitor validation bpf: fix branch offset adjustment on backjumps after patching ctx expansion flow_dissector: Fix unaligned access in __skb_flow_dissector when used by eth_get_headlen net: Copy inner L3 and L4 headers as unaligned on GRE TEB sctp: translate network order to host order when users get a hmacid enic: increment devcmd2 result ring in case of timeout tg3: Fix for tg3 transmit queue 0 timed out when too many gso_segs net:Add sysctl_max_skb_frags tcp: do not drop syn_recv on all icmp reports unix: correctly track in-flight fds in sending process user_struct ipv6: fix a lockdep splat ipv6: addrconf: Fix recursive spin lock call ipv6/udp: use sticky pktinfo egress ifindex on connect() ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail() tcp: beware of alignments in tcp_get_info() switchdev: Require RTNL mutex to be held when sending FDB notifications inet: frag: Always orphan skbs inside ip_defrag() tipc: fix connection abort during subscription cancel net: dsa: fix mv88e6xxx switches sctp: allow setting SCTP_SACK_IMMEDIATELY by the application pptp: fix illegal memory access caused by multiple bind()s af_unix: fix struct pid memory leak tcp: fix NULL deref in tcp_v4_send_ack() lwt: fix rx checksum setting for lwt devices tunneling over ipv6 tunnels: Allow IPv6 UDP checksums to be correctly controlled. net: dp83640: Fix tx timestamp overflow handling. gro: Make GRO aware of lightweight tunnels. af_iucv: Validate socket address length in iucv_sock_bind() Conflicts: arch/arm64/Makefile arch/arm64/include/asm/cacheflush.h drivers/mmc/host/sdhci.c drivers/usb/dwc3/ep0.c drivers/usb/dwc3/gadget.c kernel/module.c sound/core/pcm_compat.c CRs-Fixed: 1010239 Signed-off-by: Runmin Wang <runminw@codeaurora.org> Change-Id: I41a28636fc9ad91f9d979b191784609476294cdf
| * Revert "workqueue: make sure delayed work run in local cpu"Tejun Heo2016-03-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 041bd12e272c53a35c54c13875839bcb98c999ce upstream. This reverts commit 874bbfe600a660cba9c776b3957b1ce393151b76. Workqueue used to implicity guarantee that work items queued without explicit CPU specified are put on the local CPU. Recent changes in timer broke the guarantee and led to vmstat breakage which was fixed by 176bed1de5bf ("vmstat: explicitly schedule per-cpu work on the CPU we need it to run on"). vmstat is the most likely to expose the issue and it's quite possible that there are other similar problems which are a lot more difficult to trigger. As a preventive measure, 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu") was applied to restore the local CPU guarnatee. Unfortunately, the change exposed a bug in timer code which got fixed by 22b886dd1018 ("timers: Use proper base migration in add_timer_on()"). Due to code restructuring, the commit couldn't be backported beyond certain point and stable kernels which only had 874bbfe600a6 started crashing. The local CPU guarantee was accidental more than anything else and we want to get rid of it anyway. As, with the vmstat case fixed, 874bbfe600a6 is causing more problems than it's fixing, it has been decided to take the chance and officially break the guarantee by reverting the commit. A debug feature will be added to force foreign CPU assignment to expose cases relying on the guarantee and fixes for the individual cases will be backported to stable as necessary. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu") Link: http://lkml.kernel.org/g/20160120211926.GJ10810@quack.suse.cz Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Daniel Bilik <daniel.bilik@neosystem.cz> Cc: Jan Kara <jack@suse.cz> Cc: Shaohua Li <shli@fb.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Daniel Bilik <daniel.bilik@neosystem.cz> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookupTejun Heo2016-03-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d6e022f1d207a161cd88e08ef0371554680ffc46 upstream. When looking up the pool_workqueue to use for an unbound workqueue, workqueue assumes that the target CPU is always bound to a valid NUMA node. However, currently, when a CPU goes offline, the mapping is destroyed and cpu_to_node() returns NUMA_NO_NODE. This has always been broken but hasn't triggered often enough before 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu"). After the commit, workqueue forcifully assigns the local CPU for delayed work items without explicit target CPU to fix a different issue. This widens the window where CPU can go offline while a delayed work item is pending causing delayed work items dispatched with target CPU set to an already offlined CPU. The resulting NUMA_NO_NODE mapping makes workqueue try to queue the work item on a NULL pool_workqueue and thus crash. While 874bbfe600a6 has been reverted for a different reason making the bug less visible again, it can still happen. Fix it by mapping NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node(). This is a temporary workaround. The long term solution is keeping CPU -> NODE mapping stable across CPU off/online cycles which is being worked on. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Len Brown <len.brown@intel.com> Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | workqueue: implement lockup detectorTejun Heo2016-05-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Workqueue stalls can happen from a variety of usage bugs such as missing WQ_MEM_RECLAIM flag or concurrency managed work item indefinitely staying RUNNING. These stalls can be extremely difficult to hunt down because the usual warning mechanisms can't detect workqueue stalls and the internal state is pretty opaque. To alleviate the situation, this patch implements workqueue lockup detector. It periodically monitors all worker_pools periodically and, if any pool failed to make forward progress longer than the threshold duration, triggers warning and dumps workqueue state as follows. BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s! Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256 pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent workqueue events_power_efficient: flags=0x80 pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 pending: check_lifetime, neigh_periodic_work workqueue cgroup_pidlist_destroy: flags=0x0 pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1 pending: cgroup_pidlist_destroy_work_fn ... The detection mechanism is controller through kernel parameter workqueue.watchdog_thresh and can be updated at runtime through the sysfs module parameter file. v2: Decoupled from softlockup control knobs. CRs-Fixed: 1007459 Change-Id: Id7dfbbd2701128a942b1bcac2299e07a66db8657 Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Andrew Morton <akpm@linux-foundation.org> Git-commit: 82607adcf9cdf40fb7b5331269780c8f70ec6e35 Git-repo: git://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git Signed-off-by: Trilok Soni <tsoni@codeaurora.org>
* | kernel/lib: add additional debug capabilites for data corruptionSyed Rameez Mustafa2016-03-22
|/ | | | | | | | | | | | | | | Data corruptions in the kernel often end up in system crashes that are easier to debug closer to the time of detection. Specifically, if we do not panic immediately after lock or list corruptions have been detected, the problem context is lost in the ensuing system mayhem. Add support for allowing system crash immediately after such corruptions are detected. The CONFIG option controls the enabling/disabling of the feature. Change-Id: I9b2eb62da506a13007acff63e85e9515145909ff Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> [abhimany: minor merge conflict resolution] Signed-off-by: Abhimanyu Kapur <abhimany@codeaurora.org>
* Merge branch 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2015-11-05
|\ | | | | | | | | | | | | | | | | | | | | Pull workqueue update from Tejun Heo: "This pull request contains one patch to make an unbound worker pool allocated from the NUMA node containing it if such node exists. As unbound worker pools are node-affine by default, this makes most pools allocated on the right node" * 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Allocate the unbound pool using local node memory
| * workqueue: Allocate the unbound pool using local node memoryXunlei Pang2015-10-12
| | | | | | | | | | | | | | | | | | | | | | | | Currently, get_unbound_pool() uses kzalloc() to allocate the worker pool. Actually, we can use the right node to do the allocation, achieving local memory access. This patch selects target node first, and uses kzalloc_node() instead. Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org> Signed-off-by: Tejun Heo <tj@kernel.org>
* | workqueue: make sure delayed work run in local cpuShaohua Li2015-09-30
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | My system keeps crashing with below message. vmstat_update() schedules a delayed work in current cpu and expects the work runs in the cpu. schedule_delayed_work() is expected to make delayed work run in local cpu. The problem is timer can be migrated with NO_HZ. __queue_work() queues work in timer handler, which could run in a different cpu other than where the delayed work is scheduled. The end result is the delayed work runs in different cpu. The patch makes __queue_delayed_work records local cpu earlier. Where the timer runs doesn't change where the work runs with the change. [ 28.010131] ------------[ cut here ]------------ [ 28.010609] kernel BUG at ../mm/vmstat.c:1392! [ 28.011099] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN [ 28.011860] Modules linked in: [ 28.012245] CPU: 0 PID: 289 Comm: kworker/0:3 Tainted: G W4.3.0-rc3+ #634 [ 28.013065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014 [ 28.014160] Workqueue: events vmstat_update [ 28.014571] task: ffff880117682580 ti: ffff8800ba428000 task.ti: ffff8800ba428000 [ 28.015445] RIP: 0010:[<ffffffff8115f921>] [<ffffffff8115f921>]vmstat_update+0x31/0x80 [ 28.016282] RSP: 0018:ffff8800ba42fd80 EFLAGS: 00010297 [ 28.016812] RAX: 0000000000000000 RBX: ffff88011a858dc0 RCX:0000000000000000 [ 28.017585] RDX: ffff880117682580 RSI: ffffffff81f14d8c RDI:ffffffff81f4df8d [ 28.018366] RBP: ffff8800ba42fd90 R08: 0000000000000001 R09:0000000000000000 [ 28.019169] R10: 0000000000000000 R11: 0000000000000121 R12:ffff8800baa9f640 [ 28.019947] R13: ffff88011a81e340 R14: ffff88011a823700 R15:0000000000000000 [ 28.020071] FS: 0000000000000000(0000) GS:ffff88011a800000(0000)knlGS:0000000000000000 [ 28.020071] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 28.020071] CR2: 00007ff6144b01d0 CR3: 00000000b8e93000 CR4:00000000000006f0 [ 28.020071] Stack: [ 28.020071] ffff88011a858dc0 ffff8800baa9f640 ffff8800ba42fe00ffffffff8106bd88 [ 28.020071] ffffffff8106bd0b 0000000000000096 0000000000000000ffffffff82f9b1e8 [ 28.020071] ffffffff829f0b10 0000000000000000 ffffffff81f18460ffff88011a81e340 [ 28.020071] Call Trace: [ 28.020071] [<ffffffff8106bd88>] process_one_work+0x1c8/0x540 [ 28.020071] [<ffffffff8106bd0b>] ? process_one_work+0x14b/0x540 [ 28.020071] [<ffffffff8106c214>] worker_thread+0x114/0x460 [ 28.020071] [<ffffffff8106c100>] ? process_one_work+0x540/0x540 [ 28.020071] [<ffffffff81071bf8>] kthread+0xf8/0x110 [ 28.020071] [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200 [ 28.020071] [<ffffffff81a6522f>] ret_from_fork+0x3f/0x70 [ 28.020071] [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200 Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v2.6.31+
* Merge branch 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2015-09-02
|\ | | | | | | | | | | | | | | | | | | | | Pull workqueue updates from Tejun Heo: "Only three trivial changes for workqueue this time - doc, MAINTAINERS and EXPORT_SYMBOL updates" * 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix some docbook warnings workqueue: Make flush_workqueue() available again to non GPL modules workqueue: add myself as a dedicated reviwer
| * workqueue: Make flush_workqueue() available again to non GPL modulesTim Gardner2015-08-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 37b1ef31a568fc02e53587620226e5f3c66454c8 ("workqueue: move flush_scheduled_work() to workqueue.h") moved the exported non GPL flush_scheduled_work() from a function to an inline wrapper. Unfortunately, it directly calls flush_workqueue() which is a GPL function. This has the effect of changing the licensing requirement for this function and makes it unavailable to non GPL modules. See commit ad7b1f841f8a54c6d61ff181451f55b68175e15a ("workqueue: Make schedule_work() available again to non GPL modules") for precedent. Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2015-08-31
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The biggest change in this cycle is the rewrite of the main SMP load balancing metric: the CPU load/utilization. The main goal was to make the metric more precise and more representative - see the changelog of this commit for the gory details: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking") It is done in a way that significantly reduces complexity of the code: 5 files changed, 249 insertions(+), 494 deletions(-) and the performance testing results are encouraging. Nevertheless we need to keep an eye on potential regressions, since this potentially affects every SMP workload in existence. This work comes from Yuyang Du. Other changes: - SCHED_DL updates. (Andrea Parri) - Simplify architecture callbacks by removing finish_arch_switch(). (Peter Zijlstra et al) - cputime accounting: guarantee stime + utime == rtime. (Peter Zijlstra) - optimize idle CPU wakeups some more - inspired by Facebook server loads. (Mike Galbraith) - stop_machine fixes and updates. (Oleg Nesterov) - Introduce the 'trace_sched_waking' tracepoint. (Peter Zijlstra) - sched/numa tweaks. (Srikar Dronamraju) - misc fixes and small cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits) sched/deadline: Fix comment in enqueue_task_dl() sched/deadline: Fix comment in push_dl_tasks() sched: Change the sched_class::set_cpus_allowed() calling context sched: Make sched_class::set_cpus_allowed() unconditional sched: Fix a race between __kthread_bind() and sched_setaffinity() sched: Ensure a task has a non-normalized vruntime when returning back to CFS sched/numa: Fix NUMA_DIRECT topology identification tile: Reorganize _switch_to() sched, sparc32: Update scheduler comments in copy_thread() sched: Remove finish_arch_switch() sched, tile: Remove finish_arch_switch sched, sh: Fold finish_arch_switch() into switch_to() sched, score: Remove finish_arch_switch() sched, avr32: Remove finish_arch_switch() sched, MIPS: Get rid of finish_arch_switch() sched, arm: Remove finish_arch_switch() sched/fair: Clean up load average references sched/fair: Provide runnable_load_avg back to cfs_rq sched/fair: Remove task and group entity load when they are dead sched/fair: Init cfs_rq's sched_entity load average ...
| * | sched: Fix a race between __kthread_bind() and sched_setaffinity()Peter Zijlstra2015-08-12
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because sched_setscheduler() checks p->flags & PF_NO_SETAFFINITY without locks, a caller might observe an old value and race with the set_cpus_allowed_ptr() call from __kthread_bind() and effectively undo it: __kthread_bind() do_set_cpus_allowed() <SYSCALL> sched_setaffinity() if (p->flags & PF_NO_SETAFFINITIY) set_cpus_allowed_ptr() p->flags |= PF_NO_SETAFFINITY Fix the bug by putting everything under the regular scheduler locks. This also closes a hole in the serialization of task_struct::{nr_,}cpus_allowed. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dedekind1@gmail.com Cc: juri.lelli@arm.com Cc: mgorman@suse.de Cc: riel@redhat.com Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/20150515154833.545640346@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* / rcu: Rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()Paul E. McKenney2015-07-22
|/ | | | | | | | | | This commit renames rcu_lockdep_assert() to RCU_LOCKDEP_WARN() for consistency with the WARN() series of macros. This also requires inverting the sense of the conditional, which this commit also does. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org>
* Merge tag 'modules-next-for-linus' of ↵Linus Torvalds2015-07-01
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module updates from Rusty Russell: "Main excitement here is Peter Zijlstra's lockless rbtree optimization to speed module address lookup. He found some abusers of the module lock doing that too. A little bit of parameter work here too; including Dan Streetman's breaking up the big param mutex so writing a parameter can load another module (yeah, really). Unfortunately that broke the usual suspects, !CONFIG_MODULES and !CONFIG_SYSFS, so those fixes were appended too" * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (26 commits) modules: only use mod->param_lock if CONFIG_MODULES param: fix module param locks when !CONFIG_SYSFS. rcu: merge fix for Convert ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE() module: add per-module param_lock module: make perm const params: suppress unused variable error, warn once just in case code changes. modules: clarify CONFIG_MODULE_COMPRESS help, suggest 'N'. kernel/module.c: avoid ifdefs for sig_enforce declaration kernel/workqueue.c: remove ifdefs over wq_power_efficient kernel/params.c: export param_ops_bool_enable_only kernel/params.c: generalize bool_enable_only kernel/module.c: use generic module param operaters for sig_enforce kernel/params: constify struct kernel_param_ops uses sysfs: tightened sysfs permission checks module: Rework module_addr_{min,max} module: Use __module_address() for module_address_lookup() module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING module: Optimize __module_address() using a latched RB-tree rbtree: Implement generic latch_tree seqlock: Introduce raw_read_seqcount_latch() ...
| * kernel/workqueue.c: remove ifdefs over wq_power_efficientLuis R. Rodriguez2015-05-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can avoid an ifdef over wq_power_efficient's declaration by just using IS_ENABLED(). Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: linux-kernel@vger.kernel.org Cc: cocci@systeme.lip6.fr Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* | workqueue: fix typos in commentsShailendra Verma2015-05-29
| | | | | | | | | | | | | | | | | | tj: dropped iff -> if, iff is if and only if not a typo. Spotted by Randy Dunlap. Signed-off-by: Shailendra Verma <shailendra.capricorn@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org>
* | workqueue: move flush_scheduled_work() to workqueue.hLai Jiangshan2015-05-21
| | | | | | | | | | | | | | flush_scheduled_work() is just a simple call to flush_work(). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | workqueue: remove the lock from wq_sysfs_prep_attrs()Lai Jiangshan2015-05-21
| | | | | | | | | | | | | | | | | | Reading to wq->unbound_attrs requires protection of either wq_pool_mutex or wq->mutex, and wq_sysfs_prep_attrs() is called with wq_pool_mutex held, so we don't need to grab wq->mutex here. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | workqueue: remove the declaration of copy_workqueue_attrs()Lai Jiangshan2015-05-21
| | | | | | | | | | | | | | | | This pre-declaration was unneeded since a previous refactor patch 6ba94429c8e7 ("workqueue: Reorder sysfs code"). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | workqueue: ensure attrs changes are properly synchronizedLai Jiangshan2015-05-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current modification to attrs via sysfs is not fully synchronized. Process A (change cpumask) | Process B (change numa affinity) wq_cpumask_store() | wq_sysfs_prep_attrs() | | apply_workqueue_attrs() apply_workqueue_attrs() | It results that the Process B's operation is totally reverted without any notification, it is a buggy behavior. So this patch moves wq_sysfs_prep_attrs() into the protection under wq_pool_mutex to ensure attrs changes are properly synchronized. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | workqueue: separate out and refactor the locking of applying attrsLai Jiangshan2015-05-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Applying attrs requires two locks: get_online_cpus() and wq_pool_mutex, and this code is duplicated at two places (apply_workqueue_attrs() and workqueue_set_unbound_cpumask()). So we separate out this locking code into apply_wqattrs_[un]lock() and do a minor refactor on apply_workqueue_attrs(). The apply_wqattrs_[un]lock() will be also used on later patch for ensuring attrs changes are properly synchronized. tj: minor updates to comments Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | workqueue: simplify wq_update_unbound_numa()Lai Jiangshan2015-05-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wq_update_unbound_numa() is known be called with wq_pool_mutex held. But wq_update_unbound_numa() requests wq->mutex before reading wq->unbound_attrs, wq->numa_pwq_tbl[] and wq->dfl_pwq. But these fields were changed to be allowed being read with wq_pool_mutex held. So we simply remove the mutex_lock(&wq->mutex). Without the dependence on the the mutex_lock(&wq->mutex), the test of wq->unbound_attrs->no_numa can also be moved upward. The old code need a long comment to describe the stableness of @wq->unbound_attrs which is also guaranteed by wq_pool_mutex now, so we don't need this such comment. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | workqueue: wq_pool_mutex protects the attrs-installationLai Jiangshan2015-05-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current wq_pool_mutex doesn't proctect the attrs-installation, it results that ->unbound_attrs, ->numa_pwq_tbl[] and ->dfl_pwq can only be accessed under wq->mutex and causes some inconveniences. Example, wq_update_unbound_numa() has to acquire wq->mutex before fetching the wq->unbound_attrs->no_numa and the old_pwq. attrs-installation is a short operation, so this change will no cause any latency for other operations which also acquire the wq_pool_mutex. The only unprotected attrs-installation code is in apply_workqueue_attrs(), so this patch touches code less than comments. It is also a preparation patch for next several patches which read wq->unbound_attrs, wq->numa_pwq_tbl[] and wq->dfl_pwq with only wq_pool_mutex held. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | workqueue: fix a typoChen Hanxiao2015-05-13
| | | | | | | | | | | | | | s/detemined/determined Signed-off-by: Chen Hanxiao <chenhanxiao@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | workqueue: function name in the comment differs from the real function nameGong Zhaogang2015-05-11
| | | | | | | | | | | | | | modify wq_calc_node_mask to wq_calc_node_cpumask Signed-off-by: Gong Zhaogang <gongzhaogang@inspur.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | workqueue: Allow modifying low level unbound workqueue cpumaskLai Jiangshan2015-04-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow to modify the low-level unbound workqueues cpumask through sysfs. This is performed by traversing the entire workqueue list and calling apply_wqattrs_prepare() on the unbound workqueues with the new low level mask. Only after all the preparation are done, we commit them all together. Ordered workqueues are ignored from the low level unbound workqueue cpumask, it will be handled in near future. All the (default & per-node) pwqs are mandatorily controlled by the low level cpumask. If the user configured cpumask doesn't overlap with the low level cpumask, the low level cpumask will be used for the wq instead. The comment of wq_calc_node_cpumask() is updated and explicitly requires that its first argument should be the attrs of the default pwq. The default wq_unbound_cpumask is cpu_possible_mask. The workqueue subsystem doesn't know its best default value, let the system manager or the other subsystem set it when needed. Changed from V8: merge the calculating code for the attrs of the default pwq together. minor change the code&comments for saving the user configured attrs. remove unnecessary list_del(). minor update the comment of wq_calc_node_cpumask(). update the comment of workqueue_set_unbound_cpumask(); Cc: Christoph Lameter <cl@linux.com> Cc: Kevin Hilman <khilman@linaro.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Mike Galbraith <bitbucket@online.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Original-patch-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | workqueue: Create low-level unbound workqueues cpumaskFrederic Weisbecker2015-04-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Create a cpumask that limits the affinity of all unbound workqueues. This cpumask is controlled through a file at the root of the workqueue sysfs directory. It works on a lower-level than the per WQ_SYSFS workqueues cpumask files such that the effective cpumask applied for a given unbound workqueue is the intersection of /sys/devices/virtual/workqueue/$WORKQUEUE/cpumask and the new /sys/devices/virtual/workqueue/cpumask file. This patch implements the basic infrastructure and the read interface. wq_unbound_cpumask is initially set to cpu_possible_mask. Cc: Christoph Lameter <cl@linux.com> Cc: Kevin Hilman <khilman@linaro.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Mike Galbraith <bitbucket@online.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | workqueue: split apply_workqueue_attrs() into 3 stagesLai Jiangshan2015-04-27
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current apply_workqueue_attrs() includes pwqs-allocation and pwqs-installation, so when we batch multiple apply_workqueue_attrs()s as a transaction, we can't ensure the transaction must succeed or fail as a complete unit. To solve this, we split apply_workqueue_attrs() into three stages. The first stage does the preparation: allocation memory, pwqs. The second stage does the attrs-installaion and pwqs-installation. The third stage frees the allocated memory and (old or unused) pwqs. As the result, batching multiple apply_workqueue_attrs()s can succeed or fail as a complete unit: 1) batch do all the first stage for all the workqueues 2) only commit all when all the above succeed. This patch is a preparation for the next patch ("Allow modifying low level unbound workqueue cpumask") which will do a multiple apply_workqueue_attrs(). The patch doesn't have functionality changed except two minor adjustment: 1) free_unbound_pwq() for the error path is removed, we use the heavier version put_pwq_unlocked() instead since the error path is rare. this adjustment simplifies the code. 2) the memory-allocation is also moved into wq_pool_mutex. this is needed to avoid to do the further splitting. tj: minor updates to comments. Suggested-by: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Kevin Hilman <khilman@linaro.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Mike Galbraith <bitbucket@online.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: Reorder sysfs codeFrederic Weisbecker2015-04-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | The sysfs code usually belongs to the botom of the file since it deals with high level objects. In the workqueue code it's misplaced and such that we'll need to work around functions references to allow the sysfs code to call APIs like apply_workqueue_attrs(). Lets move that block further in the file, almost the botom. And declare workqueue_sysfs_unregister() just before destroy_workqueue() which reference it. tj: Moved workqueue_sysfs_unregister() forward declaration where other forward declarations are. Suggested-by: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Kevin Hilman <khilman@linaro.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Mike Galbraith <bitbucket@online.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: dump workqueues on sysrq-tTejun Heo2015-03-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Workqueues are used extensively throughout the kernel but sometimes it's difficult to debug stalls involving work items because visibility into its inner workings is fairly limited. Although sysrq-t task dump annotates each active worker task with the information on the work item being executed, it is challenging to find out which work items are pending or delayed on which queues and how pools are being managed. This patch implements show_workqueue_state() which dumps all busy workqueues and pools and is called from the sysrq-t handler. At the end of sysrq-t dump, something like the following is printed. Showing busy workqueues and worker pools: ... workqueue filler_wq: flags=0x0 pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 in-flight: 491:filler_workfn, 507:filler_workfn pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 in-flight: 501:filler_workfn pending: filler_workfn ... workqueue test_wq: flags=0x8 pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500) delayed: test_workfn1 BAR(492), test_workfn2 ... pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137 pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469 pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16 pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62 The above shows that test_wq is executing test_workfn() on pid 510 which is the rescuer and also that there are two tasks 69 and 500 waiting for the work item to finish in flush_work(). As test_wq has max_active of 1, there are two work items for test_workfn1() and test_workfn2() which are delayed till the current work item is finished. In addition, pid 492 is flushing test_workfn1(). The work item for test_workfn() is being executed on pwq of pool 2 which is the normal priority per-cpu pool for CPU 1. The pool has three workers, two of which are executing filler_workfn() for filler_wq and the last one is assuming the manager role trying to create more workers. This extra workqueue state dump will hopefully help chasing down hangs involving workqueues. v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting. v2: As suggested by Andrew, minor formatting change in pr_cont_work(), printk()'s replaced with pr_info()'s, and cpumask printing now uses cpulist_pr_cont(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> CC: Ingo Molnar <mingo@redhat.com>
* workqueue: keep track of the flushing task and pool managerTejun Heo2015-03-09
| | | | | | | | Add wq_barrier->task and worker_pool->manager to keep track of the flushing task and pool manager respectively. These are purely informational and will be used to implement sysrq dump of workqueues. Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: make the workqueues list RCU walkableTejun Heo2015-03-09
| | | | | | | | | | | | | | | | | | | The workqueues list is protected by wq_pool_mutex and a workqueue and its subordinate data structures are freed directly on destruction. We want to add the ability dump workqueues from a sysrq callback which requires walking all workqueues without grabbing wq_pool_mutex. This patch makes freeing of workqueues RCU protected and makes the workqueues list walkable while holding RCU read lock. Note that pool_workqueues and pools are already sched-RCU protected. For consistency, workqueues are also protected with sched-RCU. While at it, reverse the workqueues list so that a workqueue which is created earlier comes before. The order of the list isn't significant functionally but this makes the planned sysrq dump list system workqueues first. Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for ↵Tejun Heo2015-03-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PREEMPT_NONE cancel[_delayed]_work_sync() are implemented using __cancel_work_timer() which grabs the PENDING bit using try_to_grab_pending() and then flushes the work item with PENDING set to prevent the on-going execution of the work item from requeueing itself. try_to_grab_pending() can always grab PENDING bit without blocking except when someone else is doing the above flushing during cancelation. In that case, try_to_grab_pending() returns -ENOENT. In this case, __cancel_work_timer() currently invokes flush_work(). The assumption is that the completion of the work item is what the other canceling task would be waiting for too and thus waiting for the same condition and retrying should allow forward progress without excessive busy looping Unfortunately, this doesn't work if preemption is disabled or the latter task has real time priority. Let's say task A just got woken up from flush_work() by the completion of the target work item. If, before task A starts executing, task B gets scheduled and invokes __cancel_work_timer() on the same work item, its try_to_grab_pending() will return -ENOENT as the work item is still being canceled by task A and flush_work() will also immediately return false as the work item is no longer executing. This puts task B in a busy loop possibly preventing task A from executing and clearing the canceling state on the work item leading to a hang. task A task B worker executing work __cancel_work_timer() try_to_grab_pending() set work CANCELING flush_work() block for work completion completion, wakes up A __cancel_work_timer() while (forever) { try_to_grab_pending() -ENOENT as work is being canceled flush_work() false as work is no longer executing } This patch removes the possible hang by updating __cancel_work_timer() to explicitly wait for clearing of CANCELING rather than invoking flush_work() after try_to_grab_pending() fails with -ENOENT. Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com v3: bit_waitqueue() can't be used for work items defined in vmalloc area. Switched to custom wake function which matches the target work item and exclusive wait and wakeup. v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if the target bit waitqueue has wait_bit_queue's on it. Use DEFINE_WAIT_BIT() and __wake_up_bit() instead. Reported by Tomeu Vizoso. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Rabin Vincent <rabin.vincent@axis.com> Cc: Tomeu Vizoso <tomeu.vizoso@gmail.com> Cc: stable@vger.kernel.org Tested-by: Jesper Nilsson <jesper.nilsson@axis.com> Tested-by: Rabin Vincent <rabin.vincent@axis.com>
* workqueue: use %*pb[l] to format bitmaps including cpumasks and nodemasksTejun Heo2015-02-13
| | | | | | | | | | | printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* workqueue: fix subtle pool management issue which can stall whole worker_poolTejun Heo2015-01-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A worker_pool's forward progress is guaranteed by the fact that the last idle worker assumes the manager role to create more workers and summon the rescuers if creating workers doesn't succeed in timely manner before proceeding to execute work items. This manager role is implemented in manage_workers(), which indicates whether the worker may proceed to work item execution with its return value. This is necessary because multiple workers may contend for the manager role, and, if there already is a manager, others should proceed to work item execution. Unfortunately, the function also indicates that the worker may proceed to work item execution if need_to_create_worker() is false at the head of the function. need_to_create_worker() tests the following conditions. pending work items && !nr_running && !nr_idle The first and third conditions are protected by pool->lock and thus won't change while holding pool->lock; however, nr_running can change asynchronously as other workers block and resume and while it's likely to be zero, as someone woke this worker up in the first place, some other workers could have become runnable inbetween making it non-zero. If this happens, manage_worker() could return false even with zero nr_idle making the worker, the last idle one, proceed to execute work items. If then all workers of the pool end up blocking on a resource which can only be released by a work item which is pending on that pool, the whole pool can deadlock as there's no one to create more workers or summon the rescuers. This patch fixes the problem by removing the early exit condition from maybe_create_worker() and making manage_workers() return false iff there's already another manager, which ensures that the last worker doesn't start executing work items. We can leave the early exit condition alone and just ignore the return value but the only reason it was put there is because the manage_workers() used to perform both creations and destructions of workers and thus the function may be invoked while the pool is trying to reduce the number of workers. Now that manage_workers() is called only when more workers are needed, the only case this early exit condition is triggered is rare race conditions rendering it pointless. Tested with simulated workload and modified workqueue code which trigger the pool deadlock reliably without this patch. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Eric Sandeen <sandeen@sandeen.net> Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net Cc: Dave Chinner <david@fromorbit.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: stable@vger.kernel.org
* workqueue: allow rescuer thread to do more work.NeilBrown2014-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When there is serious memory pressure, all workers in a pool could be blocked, and a new thread cannot be created because it requires memory allocation. In this situation a WQ_MEM_RECLAIM workqueue will wake up the rescuer thread to do some work. The rescuer will only handle requests that are already on ->worklist. If max_requests is 1, that means it will handle a single request. The rescuer will be woken again in 100ms to handle another max_requests requests. I've seen a machine (running a 3.0 based "enterprise" kernel) with thousands of requests queued for xfslogd, which has a max_requests of 1, and is needed for retiring all 'xfs' write requests. When one of the worker pools gets into this state, it progresses extremely slowly and possibly never recovers (only waited an hour or two). With this patch we leave a pool_workqueue on mayday list until it is clearly no longer in need of assistance. This allows all requests to be handled in a timely fashion. We keep each pool_workqueue on the mayday list until need_to_create_worker() is false, and no work for this workqueue is found in the pool. I have tested this in combination with a (hackish) patch which forces all work items to be handled by the rescuer thread. In that context it significantly improves performance. A similar patch for a 3.0 kernel significantly improved performance on a heavy work load. Thanks to Jan Kara for some design ideas, and to Dongsu Park for some comments and testing. tj: Inverted the lock order between wq_mayday_lock and pool->lock with a preceding patch and simplified this patch. Added comment and updated changelog accordingly. Dongsu spotted missing get_pwq() in the simplified code. Cc: Dongsu Park <dongsu.park@profitbricks.com> Cc: Jan Kara <jack@suse.cz> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: invert the order between pool->lock and wq_mayday_lockTejun Heo2014-12-08
| | | | | | | | | | | | | | | Currently, pool->lock nests inside pool->lock. There's no inherent reason for this order. The only place where the two locks are held together is pool_mayday_timeout() and it just got decided that way. This nesting order turns out to complicate things with the planned rescuer_thread() update. Let's invert them. This doesn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: NeilBrown <neilb@suse.de> Cc: Dongsu Park <dongsu.park@profitbricks.com>
* workqueue: cosmetic update in rescuer_thread()Tejun Heo2014-12-04
| | | | | | | | | | rescuer_thread() caches &rescuer->scheduled in a local variable scheduled for convenience. There's one WARN_ON_ONCE() which was using &rescuer->scheduled directly. Replace it with the local variable. This patch causes no functional difference. Signed-off-by: Tejun Heo <tj@kernel.org>
* workqueue: Use cond_resched_rcu_qs macroJoe Lawrence2014-10-06
| | | | | | | | | | | Tidy up and use cond_resched_rcu_qs when calling cond_resched and reporting potential quiescent state to RCU. Splitting this change in this way allows easy backporting to -stable for kernel versions not having cond_resched_rcu_qs(). Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* workqueue: Add quiescent state between work itemsJoe Lawrence2014-10-06
| | | | | | | | | | | | | | | | | Similar to the stop_machine deadlock scenario on !PREEMPT kernels addressed in b22ce2785d97 "workqueue: cond_resched() after processing each work item", kworker threads requeueing back-to-back with zero jiffy delay can stall RCU. The cond_resched call introduced in that fix will yield only iff there are other higher priority tasks to run, so force a quiescent RCU state between work items. Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com> Link: https://lkml.kernel.org/r/20140926105227.01325697@jlaw-desktop.mno.stratus.com Link: https://lkml.kernel.org/r/20140929115445.40221d8e@jlaw-desktop.mno.stratus.com Fixes: b22ce2785d97 ("workqueue: cond_resched() after processing each work item") Cc: <stable@vger.kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* Merge branch 'for-3.17' of ↵Linus Torvalds2014-08-04
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: - Major reorganization of percpu header files which I think makes things a lot more readable and logical than before. - percpu-refcount is updated so that it requires explicit destruction and can be reinitialized if necessary. This was pulled into the block tree to replace the custom percpu refcnting implemented in blk-mq. - In the process, percpu and percpu-refcount got cleaned up a bit * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (21 commits) percpu-refcount: implement percpu_ref_reinit() and percpu_ref_is_zero() percpu-refcount: require percpu_ref to be exited explicitly percpu-refcount: use unsigned long for pcpu_count pointer percpu-refcount: add helpers for ->percpu_count accesses percpu-refcount: one bit is enough for REF_STATUS percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc() workqueue: stronger test in process_one_work() workqueue: clear POOL_DISASSOCIATED in rebind_workers() percpu: Use ALIGN macro instead of hand coding alignment calculation percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations percpu: preffity percpu header files percpu: use raw_cpu_*() to define __this_cpu_*() percpu: reorder macros in percpu header files percpu: move {raw|this}_cpu_*() definitions to include/linux/percpu-defs.h percpu: move generic {raw|this}_cpu_*_N() definitions to include/asm-generic/percpu.h percpu: only allow sized arch overrides for {raw|this}_cpu_*() ops percpu: reorganize include/linux/percpu-defs.h percpu: move accessors from include/linux/percpu.h to percpu-defs.h percpu: include/asm-generic/percpu.h should contain only arch-overridable parts percpu: introduce arch_raw_cpu_ptr() ...
| * workqueue: stronger test in process_one_work()Lai Jiangshan2014-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the recent changes, when POOL_DISASSOCIATED is cleared, the running worker's local CPU should be the same as pool->cpu without any exception even during cpu-hotplug. Update the sanity check in process_one_work() accordingly. This patch changes "(proposition_A && proposition_B && proposition_C)" to "(proposition_B && proposition_C)", so if the old compound proposition is true, the new one must be true too. so this will not hide any possible bug which can be caught by the old test. tj: Minor updates to the description. CC: Jason J. Herne <jjherne@linux.vnet.ibm.com> CC: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * workqueue: clear POOL_DISASSOCIATED in rebind_workers()Lai Jiangshan2014-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit a9ab775bcadf ("workqueue: directly restore CPU affinity of workers from CPU_ONLINE") moved the pool->lock into rebind_workers() without also moving "pool->flags &= ~POOL_DISASSOCIATED". There is nothing wrong with "pool->flags &= ~POOL_DISASSOCIATED" not being moved together, but there isn't any benefit either. We move it into rebind_workers() and achieve these benefits: 1) Better readability. POOL_DISASSOCIATED is cleared in rebind_workers() as expected. 2) When POOL_DISASSOCIATED is cleared, we can ensure that all the running workers of the pool are on the local CPU (pool->cpu). tj: Cosmetic updates to the code and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2014-08-04
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull workqueue updates from Tejun Heo: "Lai has been doing a lot of cleanups of workqueue and kthread_work. No significant behavior change. Just a lot of cleanups all over the place. Some are a bit invasive but overall nothing too dangerous" * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: kthread_work: remove the unused wait_queue_head kthread_work: wake up worker only when the worker is idle workqueue: use nr_node_ids instead of wq_numa_tbl_len workqueue: remove the misnamed out_unlock label in get_unbound_pool() workqueue: remove the stale comment in pwq_unbound_release_workfn() workqueue: move rescuer pool detachment to the end workqueue: unfold start_worker() into create_worker() workqueue: remove @wakeup from worker_set_flags() workqueue: remove an unneeded UNBOUND test before waking up the next worker workqueue: wake regular worker if need_more_worker() when rescuer leave the pool workqueue: alloc struct worker on its local node workqueue: reuse the already calculated pwq in try_to_grab_pending() workqueue: stronger test in process_one_work() workqueue: clear POOL_DISASSOCIATED in rebind_workers() workqueue: sanity check pool->cpu in wq_worker_sleeping() workqueue: clear leftover flags when detached workqueue: remove useless WARN_ON_ONCE() workqueue: use schedule_timeout_interruptible() instead of open code workqueue: remove the empty check in too_many_workers() workqueue: use "pool->cpu < 0" to stand for an unbound pool
| * | workqueue: use nr_node_ids instead of wq_numa_tbl_lenLai Jiangshan2014-07-22
| | | | | | | | | | | | | | | | | | | | | They are the same and nr_node_ids is provided by the memory subsystem. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | workqueue: remove the misnamed out_unlock label in get_unbound_pool()Lai Jiangshan2014-07-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the locking was moved up to the caller of the get_unbound_pool(), out_unlock label doesn't need to do any unlock operation and the name became bad, so we just remove this label, and the only usage-site "goto out_unlock" is subsituted to "return pool". Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | workqueue: remove the stale comment in pwq_unbound_release_workfn()Lai Jiangshan2014-07-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In 75ccf5950f82 ("workqueue: prepare flush_workqueue() for dynamic creation and destrucion of unbound pool_workqueues"), a comment about the synchronization for the pwq in pwq_unbound_release_workfn() was added. The comment claimed the flush_mutex wasn't strictly necessary, it was correct in that time, due to the pwq was protected by workqueue_lock. But it is incorrect now since the wq->flush_mutex was renamed to wq->mutex and workqueue_lock was removed, the wq->mutex is strictly needed. But the comment was miss-updated when the synchronization was changed. This patch removes the incorrect comments and doesn't add any new comment to explain why wq->mutex is needed here, which is definitely obvious and wq->pwqs_node has "WQ" notation in its definition which is better comment. The old commit mentioned above also introduced a comment in link_pwq() about the synchronization. This comment is also removed in this patch since the whole link_pwq() is proteced by wq->mutex. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | workqueue: move rescuer pool detachment to the endLai Jiangshan2014-07-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In 51697d393922 ("workqueue: use generic attach/detach routine for rescuers"), The rescuer detaches itself from the pool before put_pwq() so that the put_unbound_pool() will not destroy the rescuer-attached pool. It is unnecessary. worker_detach_from_pool() can be used as the last statement to access to the pool just like the regular workers, put_unbound_pool() will wait for it to detach and then free the pool. So we move the worker_detach_from_pool() down, make it coincide with the regular workers. tj: Minor description update. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | workqueue: unfold start_worker() into create_worker()Lai Jiangshan2014-07-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simply unfold the code of start_worker() into create_worker() and remove the original start_worker() and create_and_start_worker(). The only trade-off is the introduced overhead that the pool->lock is released and regrabbed after the newly worker is started. The overhead is acceptible since the manager is slow path. And because this new locking behavior, the newly created worker may grab the lock earlier than the manager and go to process work items. In this case, the recheck need_to_create_worker() may be true as expected and the manager goes to restart which is the correct behavior. tj: Minor updates to description and comments. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | workqueue: remove @wakeup from worker_set_flags()Lai Jiangshan2014-07-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | worker_set_flags() has only two callers, each specifying %true and %false for @wakeup. Let's push the wake up to the caller and remove @wakeup from worker_set_flags(). The caller can use the following instead if wakeup is necessary: worker_set_flags(); if (need_more_worker(pool)) wake_up_worker(pool); This makes the code simpler. This patch doesn't introduce behavior changes. tj: Updated description and comments. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>