summaryrefslogtreecommitdiff
path: root/kernel (follow)
Commit message (Collapse)AuthorAge
* Merge 4.4.103 into android-4.4Greg Kroah-Hartman2017-11-30
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changes in 4.4.103 s390: fix transactional execution control register handling s390/runtime instrumention: fix possible memory corruption s390/disassembler: add missing end marker for e7 table s390/disassembler: increase show_code buffer size ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER AF_VSOCK: Shrink the area influenced by prepare_to_wait vsock: use new wait API for vsock_stream_sendmsg() sched: Make resched_cpu() unconditional lib/mpi: call cond_resched() from mpi_powm() loop x86/decoder: Add new TEST instruction pattern ARM: 8722/1: mm: make STRICT_KERNEL_RWX effective for LPAE ARM: 8721/1: mm: dump: check hardware RO bit for LPAE MIPS: ralink: Fix MT7628 pinmux MIPS: ralink: Fix typo in mt7628 pinmux function ALSA: hda: Add Raven PCI ID dm bufio: fix integer overflow when limiting maximum cache size dm: fix race between dm_get_from_kobject() and __dm_destroy() MIPS: Fix an n32 core file generation regset support regression MIPS: BCM47XX: Fix LED inversion for WRT54GSv1 autofs: don't fail mount for transient error nilfs2: fix race condition that causes file system corruption eCryptfs: use after free in ecryptfs_release_messaging() bcache: check ca->alloc_thread initialized before wake up it isofs: fix timestamps beyond 2027 NFS: Fix typo in nomigration mount option nfs: Fix ugly referral attributes nfsd: deal with revoked delegations appropriately rtlwifi: rtl8192ee: Fix memory leak when loading firmware rtlwifi: fix uninitialized rtlhal->last_suspend_sec time ata: fixes kernel crash while tracing ata_eh_link_autopsy event ext4: fix interaction between i_size, fallocate, and delalloc after a crash ALSA: pcm: update tstamp only if audio_tstamp changed ALSA: usb-audio: Add sanity checks to FE parser ALSA: usb-audio: Fix potential out-of-bound access at parsing SU ALSA: usb-audio: Add sanity checks in v2 clock parsers ALSA: timer: Remove kernel warning at compat ioctl error paths ALSA: hda/realtek - Fix ALC700 family no sound issue fix a page leak in vhost_scsi_iov_to_sgl() error recovery fs/9p: Compare qid.path in v9fs_test_inode iscsi-target: Fix non-immediate TMR reference leak target: Fix QUEUE_FULL + SCSI task attribute handling KVM: nVMX: set IDTR and GDTR limits when loading L1 host state KVM: SVM: obey guest PAT SUNRPC: Fix tracepoint storage issues with svc_recv and svc_rqst_status clk: ti: dra7-atl-clock: Fix of_node reference counting clk: ti: dra7-atl-clock: fix child-node lookups libnvdimm, namespace: fix label initialization to use valid seq numbers libnvdimm, namespace: make 'resource' attribute only readable by root IB/srpt: Do not accept invalid initiator port names IB/srp: Avoid that a cable pull can trigger a kernel crash NFC: fix device-allocation error return i40e: Use smp_rmb rather than read_barrier_depends igb: Use smp_rmb rather than read_barrier_depends igbvf: Use smp_rmb rather than read_barrier_depends ixgbevf: Use smp_rmb rather than read_barrier_depends i40evf: Use smp_rmb rather than read_barrier_depends fm10k: Use smp_rmb rather than read_barrier_depends ixgbe: Fix skb list corruption on Power systems parisc: Fix validity check of pointer size argument in new CAS implementation powerpc/signal: Properly handle return value from uprobe_deny_signal() media: Don't do DMA on stack for firmware upload in the AS102 driver media: rc: check for integer overflow cx231xx-cards: fix NULL-deref on missing association descriptor media: v4l2-ctrl: Fix flags field on Control events sched/rt: Simplify the IPI based RT balancing logic fscrypt: lock mutex before checking for bounce page pool net/9p: Switch to wait_event_killable() PM / OPP: Add missing of_node_put(np) e1000e: Fix error path in link detection e1000e: Fix return value test e1000e: Separate signaling for link check/link up RDS: RDMA: return appropriate error on rdma map failures PCI: Apply _HPX settings only to relevant devices dmaengine: zx: set DMA_CYCLIC cap_mask bit net: Allow IP_MULTICAST_IF to set index to L3 slave net: 3com: typhoon: typhoon_init_one: make return values more specific net: 3com: typhoon: typhoon_init_one: fix incorrect return values drm/armada: Fix compile fail ath10k: fix incorrect txpower set by P2P_DEVICE interface ath10k: ignore configuring the incorrect board_id ath10k: fix potential memory leak in ath10k_wmi_tlv_op_pull_fw_stats() ath10k: set CTS protection VDEV param only if VDEV is up ALSA: hda - Apply ALC269_FIXUP_NO_SHUTUP on HDA_FIXUP_ACT_PROBE drm: Apply range restriction after color adjustment when allocation mac80211: Remove invalid flag operations in mesh TSF synchronization mac80211: Suppress NEW_PEER_CANDIDATE event if no room iio: light: fix improper return value staging: iio: cdc: fix improper return value spi: SPI_FSL_DSPI should depend on HAS_DMA netfilter: nft_queue: use raw_smp_processor_id() netfilter: nf_tables: fix oob access ASoC: rsnd: don't double free kctrl btrfs: return the actual error value from from btrfs_uuid_tree_iterate ASoC: wm_adsp: Don't overrun firmware file buffer when reading region data s390/kbuild: enable modversions for symbols exported from asm xen: xenbus driver must not accept invalid transaction ids Revert "sctp: do not peel off an assoc from one netns to another one" Linux 4.4.103 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
| * sched/rt: Simplify the IPI based RT balancing logicSteven Rostedt (Red Hat)2017-11-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream. When a CPU lowers its priority (schedules out a high priority task for a lower priority one), a check is made to see if any other CPU has overloaded RT tasks (more than one). It checks the rto_mask to determine this and if so it will request to pull one of those tasks to itself if the non running RT task is of higher priority than the new priority of the next task to run on the current CPU. When we deal with large number of CPUs, the original pull logic suffered from large lock contention on a single CPU run queue, which caused a huge latency across all CPUs. This was caused by only having one CPU having overloaded RT tasks and a bunch of other CPUs lowering their priority. To solve this issue, commit: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") changed the way to request a pull. Instead of grabbing the lock of the overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work. Although the IPI logic worked very well in removing the large latency build up, it still could suffer from a large number of IPIs being sent to a single CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet, when I tested this on a 120 CPU box, with a stress test that had lots of RT tasks scheduling on all CPUs, it actually triggered the hard lockup detector! One CPU had so many IPIs sent to it, and due to the restart mechanism that is triggered when the source run queue has a priority status change, the CPU spent minutes! processing the IPIs. Thinking about this further, I realized there's no reason for each run queue to send its own IPI. As all CPUs with overloaded tasks must be scanned regardless if there's one or many CPUs lowering their priority, because there's no current way to find the CPU with the highest priority task that can schedule to one of these CPUs, there really only needs to be one IPI being sent around at a time. This greatly simplifies the code! The new approach is to have each root domain have its own irq work, as the rto_mask is per root domain. The root domain has the following fields attached to it: rto_push_work - the irq work to process each CPU set in rto_mask rto_lock - the lock to protect some of the other rto fields rto_loop_start - an atomic that keeps contention down on rto_lock the first CPU scheduling in a lower priority task is the one to kick off the process. rto_loop_next - an atomic that gets incremented for each CPU that schedules in a lower priority task. rto_loop - a variable protected by rto_lock that is used to compare against rto_loop_next rto_cpu - The cpu to send the next IPI to, also protected by the rto_lock. When a CPU schedules in a lower priority task and wants to make sure overloaded CPUs know about it. It increments the rto_loop_next. Then it atomically sets rto_loop_start with a cmpxchg. If the old value is not "0", then it is done, as another CPU is kicking off the IPI loop. If the old value is "0", then it will take the rto_lock to synchronize with a possible IPI being sent around to the overloaded CPUs. If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no IPI being sent around, or one is about to finish. Then rto_cpu is set to the first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set in rto_mask, then there's nothing to be done. When the CPU receives the IPI, it will first try to push any RT tasks that is queued on the CPU but can't run because a higher priority RT task is currently running on that CPU. Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it finds one, it simply sends an IPI to that CPU and the process continues. If there's no more CPUs in the rto_mask, then rto_loop is compared with rto_loop_next. If they match, everything is done and the process is over. If they do not match, then a CPU scheduled in a lower priority task as the IPI was being passed around, and the process needs to start again. The first CPU in rto_mask is sent the IPI. This change removes this duplication of work in the IPI logic, and greatly lowers the latency caused by the IPIs. This removed the lockup happening on the 120 CPU machine. It also simplifies the code tremendously. What else could anyone ask for? Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and supplying me with the rto_start_trylock() and rto_start_unlock() helper functions. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott Wood <swood@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * sched: Make resched_cpu() unconditionalPaul E. McKenney2017-11-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 upstream. The current implementation of synchronize_sched_expedited() incorrectly assumes that resched_cpu() is unconditional, which it is not. This means that synchronize_sched_expedited() can hang when resched_cpu()'s trylock fails as follows (analysis by Neeraj Upadhyay): o CPU1 is waiting for expedited wait to complete: sync_rcu_exp_select_cpus rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5 IPI sent to CPU5 synchronize_sched_expedited_wait ret = swait_event_timeout(rsp->expedited_wq, sync_rcu_preempt_exp_done(rnp_root), jiffies_stall); expmask = 0x20, CPU 5 in idle path (in cpuidle_enter()) o CPU5 handles IPI and fails to acquire rq lock. Handles IPI sync_sched_exp_handler resched_cpu returns while failing to try lock acquire rq->lock need_resched is not set o CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to idle (schedule() is not called). o CPU 1 reports RCU stall. Given that resched_cpu() is now used only by RCU, this commit fixes the assumption by making resched_cpu() unconditional. Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org> Suggested-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * bpf: don't let ldimm64 leak map addresses on unprivilegedDaniel Borkmann2017-11-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 0d0e57697f162da4aa218b5feafe614fb666db07 upstream. The patch fixes two things at once: 1) It checks the env->allow_ptr_leaks and only prints the map address to the log if we have the privileges to do so, otherwise it just dumps 0 as we would when kptr_restrict is enabled on %pK. Given the latter is off by default and not every distro sets it, I don't want to rely on this, hence the 0 by default for unprivileged. 2) Printing of ldimm64 in the verifier log is currently broken in that we don't print the full immediate, but only the 32 bit part of the first insn part for ldimm64. Thus, fix this up as well; it's okay to access, since we verified all ldimm64 earlier already (including just constants) through replace_map_fd_with_map_ptr(). Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs") Fixes: cbd357008604 ("bpf: verifier (add ability to receive verification log)") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> [bwh: Backported to 4.4: s/bpf_verifier_env/verifier_env/] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | BACKPORT: time: Clean up CLOCK_MONOTONIC_RAW time handlingJohn Stultz2017-11-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (cherry pick from commit fc6eead7c1e2e5376c25d2795d4539fdacbc0648) Now that we fixed the sub-ns handling for CLOCK_MONOTONIC_RAW, remove the duplicitive tk->raw_time.tv_nsec, which can be stored in tk->tkr_raw.xtime_nsec (similarly to how its handled for monotonic time). Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Stephen Boyd <stephen.boyd@linaro.org> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Daniel Mentz <danielmentz@google.com> Tested-by: Daniel Mentz <danielmentz@google.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Bug: 20045882 Bug: 63737556 Change-Id: I243827d21b08703a09d2d2fe738a9258be224582
* | BACKPORT: time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accountingJohn Stultz2017-11-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (cherry pick from commit 3d88d56c5873f6eebe23e05c3da701960146b801) Due to how the MONOTONIC_RAW accumulation logic was handled, there is the potential for a 1ns discontinuity when we do accumulations. This small discontinuity has for the most part gone un-noticed, but since ARM64 enabled CLOCK_MONOTONIC_RAW in their vDSO clock_gettime implementation, we've seen failures with the inconsistency-check test in kselftest. This patch addresses the issue by using the same sub-ns accumulation handling that CLOCK_MONOTONIC uses, which avoids the issue for in-kernel users. Since the ARM64 vDSO implementation has its own clock_gettime calculation logic, this patch reduces the frequency of errors, but failures are still seen. The ARM64 vDSO will need to be updated to include the sub-nanosecond xtime_nsec values in its calculation for this issue to be completely fixed. Signed-off-by: John Stultz <john.stultz@linaro.org> Tested-by: Daniel Mentz <danielmentz@google.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Stephen Boyd <stephen.boyd@linaro.org> Cc: Will Deacon <will.deacon@arm.com> Cc: "stable #4 . 8+" <stable@vger.kernel.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Link: http://lkml.kernel.org/r/1496965462-20003-3-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Bug: 20045882 Bug: 63737556 Change-Id: I6c55dd7685f6bd212c6af9d09c527528e1dd5fa1
* | Merge 4.4.98 into android-4.4Greg Kroah-Hartman2017-11-15
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changes in 4.4.98 adv7604: Initialize drive strength to default when using DT video: fbdev: pmag-ba-fb: Remove bad `__init' annotation PCI: mvebu: Handle changes to the bridge windows while enabled xen/netback: set default upper limit of tx/rx queues to 8 drm: drm_minor_register(): Clean up debugfs on failure KVM: PPC: Book 3S: XICS: correct the real mode ICP rejecting counter iommu/arm-smmu-v3: Clear prior settings when updating STEs powerpc/corenet: explicitly disable the SDHC controller on kmcoge4 ARM: omap2plus_defconfig: Fix probe errors on UARTs 5 and 6 crypto: vmx - disable preemption to enable vsx in aes_ctr.c iio: trigger: free trigger resource correctly phy: increase size of MII_BUS_ID_SIZE and bus_id serial: sh-sci: Fix register offsets for the IRDA serial port usb: hcd: initialize hcd->flags to 0 when rm hcd netfilter: nft_meta: deal with PACKET_LOOPBACK in netdev family IPsec: do not ignore crypto err in ah4 input Input: mpr121 - handle multiple bits change of status register Input: mpr121 - set missing event capability IB/ipoib: Change list_del to list_del_init in the tx object s390/qeth: issue STARTLAN as first IPA command net: dsa: select NET_SWITCHDEV platform/x86: hp-wmi: Fix detection for dock and tablet mode cdc_ncm: Set NTB format again after altsetting switch for Huawei devices KEYS: trusted: sanitize all key material KEYS: trusted: fix writing past end of buffer in trusted_read() platform/x86: hp-wmi: Fix error value for hp_wmi_tablet_state platform/x86: hp-wmi: Do not shadow error values x86/uaccess, sched/preempt: Verify access_ok() context workqueue: Fix NULL pointer dereference crypto: x86/sha1-mb - fix panic due to unaligned access KEYS: fix NULL pointer dereference during ASN.1 parsing [ver #2] ARM: 8720/1: ensure dump_instr() checks addr_limit ALSA: seq: Fix OSS sysex delivery in OSS emulation ALSA: seq: Avoid invalid lockdep class warning MIPS: microMIPS: Fix incorrect mask in insn_table_MM MIPS: Fix CM region target definitions MIPS: SMP: Use a completion event to signal CPU up MIPS: Fix race on setting and getting cpu_online_mask MIPS: SMP: Fix deadlock & online race test: firmware_class: report errors properly on failure selftests: firmware: add empty string and async tests selftests: firmware: send expected errors to /dev/null tools: firmware: check for distro fallback udev cancel rule MIPS: AR7: Defer registration of GPIO MIPS: AR7: Ensure that serial ports are properly set up Input: elan_i2c - add ELAN060C to the ACPI table drm/vmwgfx: Fix Ubuntu 17.10 Wayland black screen issue rbd: use GFP_NOIO for parent stat and data requests can: sun4i: handle overrun in RX FIFO can: c_can: don't indicate triple sampling support for D_CAN x86/oprofile/ppro: Do not use __this_cpu*() in preemptible context PKCS#7: fix unitialized boolean 'want' Linux 4.4.98 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
| * workqueue: Fix NULL pointer dereferenceLi Bin2017-11-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit cef572ad9bd7f85035ba8272e5352040e8be0152 upstream. When queue_work() is used in irq (not in task context), there is a potential case that trigger NULL pointer dereference. ---------------------------------------------------------------- worker_thread() |-spin_lock_irq() |-process_one_work() |-worker->current_pwq = pwq |-spin_unlock_irq() |-worker->current_func(work) |-spin_lock_irq() |-worker->current_pwq = NULL |-spin_unlock_irq() //interrupt here |-irq_handler |-__queue_work() //assuming that the wq is draining |-is_chained_work(wq) |-current_wq_worker() //Here, 'current' is the interrupted worker! |-current->current_pwq is NULL here! |-schedule() ---------------------------------------------------------------- Avoid it by checking for task context in current_wq_worker(), and if not in task context, we shouldn't use the 'current' to check the condition. Reported-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Li Bin <huawei.libin@huawei.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 8d03ecfe4718 ("workqueue: reimplement is_chained_work() using current_wq_worker()") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Revert "ANDROID: sched/rt: schedtune: Add boost retention to RT"Todd Kjos2017-11-08
| | | | | | | | | | | | | | | | This reverts commit d194ba5d712f051ff6c025f3484bb72f219764e3. Reason for revert: Broke some builds. Will fix and resubmit. Change-Id: I4e6fa1562346eda1bbf058f1d5ace5ba6256ce07
* | cpufreq: Drop schedfreq governorViresh Kumar2017-11-07
| | | | | | | | | | | | | | | | | | | | We all should be using (and improving) the schedutil governor now. Get rid of the non-upstream governor. Tested on Hikey. Change-Id: Ic660756536e5da51952738c3c18b94e31f58cd57 Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
* | ANDROID: sched/rt: schedtune: Add boost retention to RTJoel Fernandes2017-11-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Boosted RT tasks can be deboosted quickly, this makes boost usless for RT tasks and causes lots of glitching. Use timers to prevent de-boost too soon and wait for long enough such that next enqueue happens after a threshold. While this can be solved in the governor, there are following advantages: - The approach used is governor-independent - Reduces boost group lock contention for frequently sleepers/wakers - Works with schedfreq without any other schedfreq hacks. Bug: 30210506 Change-Id: I41788b235586988be446505deb7c0529758a9898 Signed-off-by: Joel Fernandes <joelaf@google.com>
* | ANDROID: sched/rt: add schedtune accountingJoel Fernandes2017-11-07
| | | | | | | | | | | | | | This patch adds schedtune enqueue/dequeue to RT scheduling class. Change-Id: If416e64319d62191f3aedd675d3e9a21fe2102fb Signed-off-by: Joel Fernandes <joelaf@google.com>
* | sched: EAS: Fix the calculation of group util in group_idle_state()Ke Wang2017-11-02
| | | | | | | | | | | | | | | | | | util_delta becomes not zero in eenv_before, which will affect the calculation of grp_util in group_idle_state(). Fix it under the new condition. Change-Id: Ic3853bb45876a8e388afcbe4e72d25fc42b1d7b0 Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
* | sched: EAS: update trg_cpu to backup_cpu if no energy saving for target_cpuKe Wang2017-11-02
| | | | | | | | | | | | | | | | | | If no energy saving for target_cpu in the calculation of energy_diff(), backup_cpu will be set as the new dst_cpu for the next calculation. At this point, we also need update the new trg_cpu as backup_cpu to make sure the subsequent calculation of energy_diff() is correct. Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
* | sched: EAS: Fix the condition to distinguish energy before/afterKe Wang2017-11-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before commit 5f8b3a757d65 ("sched/fair: consider task utilization in group_norm_util()"), eenv->util_delta is used to distinguish energy before and energy after in sched_group_energy(). After that commit, eenv->util_delta can not do that any more. In this commit, use trg_cpu to distinguish energy before/after in sched_group_energy(). Before apply this commit, cap_before/cap_delta is not correct: <idle>-0 [001] 147504.608920: sched_energy_diff: pid=7 comm=rcu_preempt src_cpu=1 dst_cpu=3 usage_delta=7 nrg_before=250 nrg_after=250 nrg_diff=0 cap_before=0 cap_after=528 cap_delta=1056 nrg_delta=0 nrg_payoff=0 After apply this commit, cap_before/cap_delta retrun to normal: <idle>-0 [001] 220.494011: sched_energy_diff: pid=7 comm=rcu_preempt src_cpu=1 dst_cpu=2 usage_delta=3 nrg_before=248 nrg_after=248 nrg_diff=0 cap_before=528 cap_after=528 cap_delta=0 nrg_delta=0 nrg_payoff=0 Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
* | Merge 4.4.96 into android-4.4Greg Kroah-Hartman2017-11-02
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changes in 4.4.96 workqueue: replace pool->manager_arb mutex with a flag ALSA: hda/realtek - Add support for ALC236/ALC3204 ALSA: hda - fix headset mic problem for Dell machines with alc236 ceph: unlock dangling spinlock in try_flush_caps() usb: xhci: Handle error condition in xhci_stop_device() spi: uapi: spidev: add missing ioctl header fuse: fix READDIRPLUS skipping an entry xen/gntdev: avoid out of bounds access in case of partial gntdev_mmap() Input: elan_i2c - add ELAN0611 to the ACPI table Input: gtco - fix potential out-of-bound access assoc_array: Fix a buggy node-splitting case scsi: zfcp: fix erp_action use-before-initialize in REC action trace scsi: sg: Re-fix off by one in sg_fill_request_table() can: sun4i: fix loopback mode can: kvaser_usb: Correct return value in printout can: kvaser_usb: Ignore CMD_FLUSH_QUEUE_REPLY messages regulator: fan53555: fix I2C device ids x86/microcode/intel: Disable late loading on model 79 ecryptfs: fix dereference of NULL user_key_payload Revert "drm: bridge: add DT bindings for TI ths8135" Linux 4.4.96 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
| * workqueue: replace pool->manager_arb mutex with a flagTejun Heo2017-11-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 692b48258dda7c302e777d7d5f4217244478f1f6 upstream. Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by lockdep: [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted [ 1270.473240] ----------------------------------------------------- [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 1270.474239] (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280 [ 1270.474994] [ 1270.474994] and this task is already holding: [ 1270.475440] (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0 [ 1270.476046] which would create a new lock dependency: [ 1270.476436] (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.} [ 1270.476949] [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock: [ 1270.477553] (&pool->lock/1){-.-.} ... [ 1270.488900] to a HARDIRQ-irq-unsafe lock: [ 1270.489327] (&(&lock->wait_lock)->rlock){+.+.} ... [ 1270.494735] Possible interrupt unsafe locking scenario: [ 1270.494735] [ 1270.495250] CPU0 CPU1 [ 1270.495600] ---- ---- [ 1270.495947] lock(&(&lock->wait_lock)->rlock); [ 1270.496295] local_irq_disable(); [ 1270.496753] lock(&pool->lock/1); [ 1270.497205] lock(&(&lock->wait_lock)->rlock); [ 1270.497744] <Interrupt> [ 1270.497948] lock(&pool->lock/1); , which will cause a irq inversion deadlock if the above lock scenario happens. The root cause of this safe -> unsafe lock order is the mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock held. Unlocking mutex while holding an irq spinlock was never safe and this problem has been around forever but it never got noticed because the only time the mutex is usually trylocked while holding irqlock making actual failures very unlikely and lockdep annotation missed the condition until the recent b9c16a0e1f73 ("locking/mutex: Fix lockdep_assert_held() fail"). Using mutex for pool->manager_arb has always been a bit of stretch. It primarily is an mechanism to arbitrate managership between workers which can easily be done with a pool flag. The only reason it became a mutex is that pool destruction path wants to exclude parallel managing operations. This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE and make the destruction path wait for the current manager on a wait queue. v2: Drop unnecessary flag clearing before pool destruction as suggested by Boqun. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | sched: EAS: upmigrate misfit current taskJoonwoo Park2017-11-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upmigrate misfit current task upon scheduler tick with stopper. We can kick an random (not necessarily big CPU) NOHZ idle CPU when a CPU bound task is in need of upmigration. But it's not efficient as that way needs following unnecessary wakeups: 1. Busy little CPU A to kick idle B 2. B runs idle balancer and enqueue migration/A 3. B goes idle 4. A runs migration/A, enqueues busy task on B. 5. B wakes up again. This change makes active upmigration more efficiently by doing: 1. Busy little CPU A find target CPU B upon tick. 2. CPU A enqueues migration/A. Change-Id: Ie865738054ea3296f28e6ba01710635efa7193c0 [joonwoop: The original version had logic to reserve CPU. The logic is omitted in this version.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
* | sched: avoid pushing tasks to an offline CPUPrasad Sodagudi2017-11-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently active_load_balance_cpu_stop is run by cpu stopper and it pushes running tasks off the busiest CPU onto idle target CPU. But there is no check to see whether target cpu is offline or not before pushing the tasks. With the introduction of active migration in the scheduler tick path (see check_for_migration()) there have been instances of attempts to migrate tasks to offline CPUs. Add a check as to whether the target cpu is online or not to prevent scheduling on offline CPUs. Change-Id: Ib8ac7f8aeabd3ca7365f3eae977075952dab4f21 Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org> [rameezmustafa@codeaurora.org]: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
* | sched: Extend active balance to accept 'push_task' argumentSrivatsa Vaddagiri2017-11-01
| | | | | | | | | | | | | | | | | | | | | | | | | | Active balance currently picks one task to migrate from busy cpu to a chosen cpu (push_cpu). This patch extends active load balance to recognize a particular task ('push_task') that needs to be migrated to 'push_cpu'. This capability will be leveraged by HMP-aware task placement in a subsequent patch. Change-Id: If31320111e6cc7044e617b5c3fd6d8e0c0e16952 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> [rameezmustafa@codeaurora.org]: Port to msm-3.18] Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
* | Revert "sched/core: Warn if ENERGY_AWARE is enabled but data is missing"Joel Fernandes2017-10-30
| | | | | | | | | | | | This reverts commit a21299785a502ca4b3592a0f977aa1202b105260. Change-Id: Idb707c80788d8b6d26d400a59d9b14f854cce89f
* | Revert "sched/core: fix have_sched_energy_data build warning"Joel Fernandes2017-10-30
| | | | | | | | | | | | | | | | This reverts commit a899b9085c8d5d581b214c24dc707466e8cb479f. Reverting for now due to suspend warning issues. Change-Id: I9387297b52552a73d5a253b9e8e4467b366479a5
* | Merge 4.4.95 into android-4.4Greg Kroah-Hartman2017-10-30
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changes in 4.4.95 USB: devio: Revert "USB: devio: Don't corrupt user memory" USB: core: fix out-of-bounds access bug in usb_get_bos_descriptor() USB: serial: metro-usb: add MS7820 device id usb: cdc_acm: Add quirk for Elatec TWN3 usb: quirks: add quirk for WORLDE MINI MIDI keyboard usb: hub: Allow reset retry for USB2 devices on connect bounce ALSA: usb-audio: Add native DSD support for Pro-Ject Pre Box S2 Digital can: gs_usb: fix busy loop if no more TX context is available usb: musb: sunxi: Explicitly release USB PHY on exit usb: musb: Check for host-mode using is_host_active() on reset interrupt can: esd_usb2: Fix can_dlc value for received RTR, frames drm/nouveau/bsp/g92: disable by default drm/nouveau/mmu: flush tlbs before deleting page tables ALSA: seq: Enable 'use' locking in all configurations ALSA: hda: Remove superfluous '-' added by printk conversion i2c: ismt: Separate I2C block read from SMBus block read brcmsmac: make some local variables 'static const' to reduce stack size bus: mbus: fix window size calculation for 4GB windows clockevents/drivers/cs5535: Improve resilience to spurious interrupts rtlwifi: rtl8821ae: Fix connection lost problem KEYS: encrypted: fix dereference of NULL user_key_payload lib/digsig: fix dereference of NULL user_key_payload KEYS: don't let add_key() update an uninstantiated key pkcs7: Prevent NULL pointer dereference, since sinfo is not always set. parisc: Avoid trashing sr2 and sr3 in LWS code parisc: Fix double-word compare and exchange in LWS code on 32-bit kernels sched/autogroup: Fix autogroup_move_group() to never skip sched_move_task() f2fs crypto: replace some BUG_ON()'s with error checks f2fs crypto: add missing locking for keyring_key access fscrypt: fix dereference of NULL user_key_payload KEYS: Fix race between updating and finding a negative key fscrypto: require write access to mount to set encryption policy FS-Cache: fix dereference of NULL user_key_payload Linux 4.4.95 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
| * sched/autogroup: Fix autogroup_move_group() to never skip sched_move_task()Oleg Nesterov2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 18f649ef344127ef6de23a5a4272dbe2fdb73dde upstream. The PF_EXITING check in task_wants_autogroup() is no longer needed. Remove it, but see the next patch. However the comment is correct in that autogroup_move_group() must always change task_group() for every thread so the sysctl_ check is very wrong; we can race with cgroups and even sys_setsid() is not safe because a task running with task_group() == ag->tg must participate in refcounting: int main(void) { int sctl = open("/proc/sys/kernel/sched_autogroup_enabled", O_WRONLY); assert(sctl > 0); if (fork()) { wait(NULL); // destroy the child's ag/tg pause(); } assert(pwrite(sctl, "1\n", 2, 0) == 2); assert(setsid() > 0); if (fork()) pause(); kill(getppid(), SIGKILL); sleep(1); // The child has gone, the grandchild runs with kref == 1 assert(pwrite(sctl, "0\n", 2, 0) == 2); assert(setsid() > 0); // runs with the freed ag/tg for (;;) sleep(1); return 0; } crashes the kernel. It doesn't really need sleep(1), it doesn't matter if autogroup_move_group() actually frees the task_group or this happens later. Reported-by: Vern Lovejoy <vlovejoy@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: hartsjc@redhat.com Cc: vbendel@redhat.com Link: http://lkml.kernel.org/r/20161114184609.GA15965@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org> [sumits: submit to 4.4 LTS, post testing on Hikey] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | sched/core: fix have_sched_energy_data build warningJoel Fernandes2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | have_sched_energy_data is defined only for CONFIG_SMP, so declare it only with CONFIG_SMP. Fixes warning from intel bot: tree: https://android.googlesource.com/kernel/msm android-4.4 head: a21299785a502ca4b3592a0f977aa1202b105260 commit: a21299785a502ca4b3592a0f977aa1202b105260 [5/5] sched/core: Warn if ENERGY_AWARE is enabled but data is missing config: i386-randconfig-x002-201743 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: git checkout a21299785a502ca4b3592a0f977aa1202b105260 # save the attached .config to linux build tree make ARCH=i386 All warnings (new ones prefixed by >>): >> kernel//sched/core.c:94:13: warning: 'have_sched_energy_data' used but never defined static bool have_sched_energy_data(void); ^~~~~~~~~~~~~~~~~~~~~~ vim +/have_sched_energy_data +94 kernel//sched/core.c 93 > 94 static bool have_sched_energy_data(void); 95 Change-Id: I266b63ece6fb31d2b5b11821a8244e147ba6d3a4 Signed-off-by: Joel Fernandes <joelaf@google.com>
* | sched/core: Warn if ENERGY_AWARE is enabled but data is missingBrendan Jackman2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the EAS energy model is missing or incomplete, i.e. sd_scs is NULL, then sched_group_energy will return -EINVAL on the assumption that it raced with a CPU hotplug event. In that case, energy_diff will return 0 and the energy-aware wake path will silently fail to trigger any migrations. This case can be triggered by disabling CONFIG_SCHED_MC on existing platforms, so that there are no sched_groups with the SD_SHARE_CAP_STATES flag, so that sd_scs is NULL. Add checks so that a warning is printed if EAS is ever enabled while the necessary data is not present. Change-Id: Id233a510b5ad8b7fcecac0b1d789e730bbfc7c4a Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
* | sched: walt: Correct WALT window size initializationVikram Mulukutla2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is preferable that WALT window rollover occurs just before a tick, since the tick is an opportune moment to record a complete window's statistics, as well as report those stats to the cpu frequency governor. When CONFIG_HZ results in a TICK_NSEC that isn't a integral number, this requirement may be violated. Account for this by reducing the WALT window size to the nearest multiple of TICK_NSEC. Commit d368c6faa19b ("sched: walt: fix window misalignment when HZ=300") attempted to do this but WALT isn't using MIN_SCHED_RAVG_WINDOW as the window size and the patch was doing nothing. Also, change the type of 'walt_disabled' to bool and warn if an invalid window size causes WALT to be disabled. Change-Id: Ie3dcfc21a3df4408254ca1165a355bbe391ed5c7 Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
* | FROMLIST: sched/fair: Use wake_q length as a hint for wake_wideBrendan Jackman2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (from https://patchwork.kernel.org/patch/9895261/) This patch adds a parameter to select_task_rq, sibling_count_hint allowing the caller, where it has this information, to inform the sched_class the number of tasks that are being woken up as part of the same event. The wake_q mechanism is one case where this information is available. select_task_rq_fair can then use the information to detect that it needs to widen the search space for task placement in order to avoid overloading the last-level cache domain's CPUs. * * * The reason I am investigating this change is the following use case on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which all repeatedly do X amount of work then pthread_barrier_wait (i.e. sleep until the last task finishes its X and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU finish faster, and then those CPUs pull over the tasks that are still running: v CPU v ->time-> ------------- 0 (big) 11111 /333 ------------- 1 (big) 22222 /444| ------------- 2 (LITTLE) 333333/ ------------- 3 (LITTLE) 444444/ ------------- Now when task 4 hits the barrier (at |) and wakes the others up, there are 4 tasks with prev_cpu=<big> and 0 tasks with prev_cpu=<little>. want_affine therefore means that we'll only look in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled on the bigs until the next load balance, something like this: v CPU v ->time-> ------------------------ 0 (big) 11111 /333 31313\33333 ------------------------ 1 (big) 22222 /444|424\4444444 ------------------------ 2 (LITTLE) 333333/ \222222 ------------------------ 3 (LITTLE) 444444/ \1111 ------------------------ ^^^ underutilization So, I'm trying to get want_affine = 0 for these tasks. I don't _think_ any incarnation of the wakee_flips mechanism can help us here because which task is waker and which tasks are wakees generally changes with each iteration. However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the nice property that we know exactly how many tasks are being woken, so we can cheat. It might be a disadvantage that we "widen" _every_ task that's woken in an event, while select_idle_sibling would work fine for the first sd_llc_size - 1 tasks. IIUC, if wake_affine() behaves correctly this trick wouldn't be necessary on SMP systems, so it might be best guarded by the presence of SD_ASYM_CPUCAPACITY? * * * Final note.. In order to observe "perfect" behaviour for this use case, I also had to disable the TTWU_QUEUE sched feature. Suppose during the wakeup above we are working through the work queue and have placed tasks 3 and 2, and are about to place task 1: v CPU v ->time-> -------------- 0 (big) 11111 /333 3 -------------- 1 (big) 22222 /444|4 -------------- 2 (LITTLE) 333333/ 2 -------------- 3 (LITTLE) 444444/ <- Task 1 should go here -------------- If TTWU_QUEUE is enabled, we will not yet have enqueued task 2 (having instead sent a reschedule IPI) or attached its load to CPU 2. So we are likely to also place task 1 on cpu 2. Disabling TTWU_QUEUE means that we enqueue task 2 before placing task 1, solving this issue. TTWU_QUEUE is there to minimise rq lock contention, and I guess that this contention is less of an issue on big.LITTLE systems since they have relatively few CPUs, which suggests the trade-off makes sense here. Change-Id: I2080302839a263e0841a89efea8589ea53bbda9c Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Matt Fleming <matt@codeblueprint.co.uk>
* | sched: WALT: account cumulative window demandJoonwoo Park2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Energy cost estimation has been a long lasting challenge for WALT because WALT guides CPU frequency based on the CPU utilization of previous window. Consequently it's not possible to know newly waking-up task's energy cost until WALT's end of the current window. The WALT already tracks 'Previous Runnable Sum' (prev_runnable_sum) and 'Cumulative Runnable Average' (cr_avg). They are designed for CPU frequency guidance and task placement but unfortunately both are not suitable for the energy cost estimation. It's because using prev_runnable_sum for energy cost calculation would make us to account CPU and task's energy solely based on activity in the previous window so for example, any task didn't have an activity in the previous window will be accounted as a 'zero energy cost' task. Energy estimation with cr_avg is what energy_diff() relies on at present. However cr_avg can only represent instantaneous picture of energy cost thus for example, if a CPU was fully occupied for an entire WALT window and became idle just before window boundary, and if there is a wake-up, energy_diff() accounts that CPU is a 'zero energy cost' CPU. As a result, introduce a new accounting unit 'Cumulative Window Demand'. The cumulative window demand tracks all the tasks' demands have seen in current window which is neither instantaneous nor actual execution time. Because task demand represents estimated scaled execution time when the task runs a full window, accumulation of all the demands represents predicted CPU load at the end of window. Thus we can estimate CPU's frequency at the end of current WALT window with the cumulative window demand. The use of prev_runnable_sum for the CPU frequency guidance and cr_avg for the task placement have not changed and these are going to be used for both purpose while this patch aims to add an additional statistics. Change-Id: I9908c77ead9973a26dea2b36c001c2baf944d4f5 Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
* | sched/fair: remove useless variable in find_best_targetLeo Yan2017-10-27
| | | | | | | | | | | | | | | | | | Patch 5680f23f20c7 ("sched/fair: streamline find_best_target heuristics") has reworked function find_best_target, as result the variable "target_util" is useless now. So remove it. Change-Id: I5447062419e5828a49115119984fac6cd37db034 Signed-off-by: Leo Yan <leo.yan@linaro.org>
* | sched/tune: access schedtune_initialized under CGROUP_SCHEDTUNERuss Weight2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | schedtune_initialized is protected by CONFIG_CGROUP_SCHEDTUNE, but is being used without CONFIG_CGROUP_SCHEDTUNE being defined. Add appropriate ifdefs around the usage of schedtune_initialized to avoid a compilation error when CONFIG_CGROUP_SCHEDTUNE is not defined. Change-Id: Iab79bf053d74db3eeb84c09d71d43b4e39746ed2 Signed-off-by: Russ Weight <russell.h.weight@intel.com> Signed-off-by: Fei Yang <fei.yang@intel.com>
* | sched/fair: consider task utilization in group_max_util()Patrick Bellasi2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The group_max_util() function is used to compute the maximum utilization across the CPUs of a certain energy_env configuration. Its main client is the energy_diff function when it needs to compute the SG capacity for one of the before/after scheduling candidates. Currently, the energy_diff function sets util_delta = 0 when it wants to compute the energy corresponding to the scheduling candidate where the task runs in the previous CPU. This implies that, for the task waking up in the previous CPU we consider only its blocked load tracked by the CPU RQ. However, in case of a medium-big task which is waking up on a long time idle CPU, this blocked load can be already completely decayed. More in general, the current approach is biased towards under-estimating the capacity requirements for the "before" scheduling candidate. This patch fixes this by: - always use the cpu_util_wake() to properly get the utilization of a CPU without any (partially decayed) contribution of the waking up task - adding the task utilization to the cpu_util_wake just for the target cpu The "target CPU" is defined by the energy_env to be either the src_cpu or the dst_cpu, depending on which scheduling candidate we are considering. Finally, since this update removes the last usage of calc_util_delta() this function is now safely removed. Change-Id: I20ee1bcf40cee6bf6e265fb2d32ef79061ad6ced Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com>
* | sched/fair: consider task utilization in group_norm_util()Chris Redpath2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The group_norm_util() function is used to compute the normalized utilization of a SG given a certain energy_env configuration. The main client of this function is the energy_diff function when it comes to compute the SG energy for one of the before/after scheduling candidates. Currently, the energy_diff function sets util_delta = 0 when it wants to compute the energy corresponding to the scheduling candidate where the task runs in the previous CPU. This implies that, for the task waking up in the previous CPU we consider only its blocked load tracked by the CPU RQ. However, in case of a medium-big task which is waking up on a long time idle CPU, this blocked load can be already completely decayed. More in general, the current approach is biased towards under-estimating the energy consumption for the "before" scheduling candidate. This patch fixes this by: - always use the cpu_util_wake() to properly get the utilization of a CPU without any (partially decayed) contribution of the waking up task - adding the task utilization to the cpu_util_wake just for the target cpu The "target CPU" is defined by the energy_env to be either the src_cpu or the dst_cpu, depending on which scheduling candidate we are considering. This patch update also the definition of __cpu_norm_util(), which is currently called just by the group_norm_util() function. This allows to simplify the code by using this function just to normalize a specified utilization with respect to a given capacity. This update allows to completely remove any dependency of group_norm_util() from calc_util_delta(). Change-Id: I3b6ec50ce8decb1521faae660e326ab3319d3c82 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com>
* | sched/fair: enforce EAS modePatrick Bellasi2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | For non latency sensitive tasks the goal is to optimize for energy efficiency. Thus, we should try our best to avoid moving a task on a CPU which is then going to be marked as overutilized. Let's use the capacity_margin metric to verify if a candidate target CPU should be considered without risking to bail out of EAS mode. Change-Id: Ib3697106f4073aedf4a6c6ce42bd5d000fa8c007 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com>
* | sched/fair: ignore backup CPU when not validPatrick Bellasi2017-10-27
| | | | | | | | | | | | | | | | | | | | The find_best_target can sometimes not return a valid backup CPU, either because it cannot find one or just becasue it returns prev_cpu as a backup. In these cases we should skip the energy_diff evaluation for the backup CPU. Change-Id: I3787dbdfe74122348dd7a7485b88c4679051bd32 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com>
* | sched/fair: trace energy_diff for non boosted tasksPatrick Bellasi2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | In systems where SchedTune is enabled, we do not report energy diff for non boosted tasks. Let's fix this by always genereting an energy_diff event where however: nrg.delta = 0, since we skip energy normalization payoff = nrg.diff, since the payoff is defined just by the energy difference Change-Id: I9a11ec19b6f56da04147f5ae5b47daf1dd180445 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com>
* | UPSTREAM: sched/fair: Sync task util before slow-path wakeupBrendan Jackman2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We use task_util() in find_idlest_group() via capacity_spare_wake(). This task_util() updated in wake_cap(). However wake_cap() is not the only reason for ending up in find_idlest_group() - we could have been sent there by wake_wide(). So explicitly sync the task util with prev_cpu when we are about to head to find_idlest_group(). We could simply do this at the beginning of select_task_rq_fair() (i.e. irrespective of whether we're heading to select_idle_sibling() or find_idlest_group() & co), but I didn't want to slow down the select_idle_sibling() path more than necessary. Don't do this during fork balancing, we won't need the task_util and we'd just clobber the last_update_time, which is supposed to be 0. Change-Id: I935f4bfdfec3e8b914457aac3387ce264d5fd484 Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andres Oportus <andresoportus@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20170808095519.10077-1-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry-picked-from: commit ea16f0ea6c3d tip:sched/core) Signed-off-by: Chris Redpath <chris.redpath@arm.com>
* | UPSTREAM: sched/fair: Fix usage of find_idlest_group() when the local group ↵Brendan Jackman2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | is idlest find_idlest_group() returns NULL when the local group is idlest. The caller then continues the find_idlest_group() search at a lower level of the current CPU's sched_domain hierarchy. find_idlest_group_cpu() is not consulted and, crucially, @new_cpu is not updated. This means the search is pointless and we return @prev_cpu from select_task_rq_fair(). This is fixed by initialising @new_cpu to @cpu instead of @prev_cpu. Change-Id: Ie531f5bb29775952bdc4c148b6e974b2f5f32b7a Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171005114516.18617-6-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry-picked-from: commit 93f50f90247e tip:sched/core) Signed-off-by: Chris Redpath <chris.redpath@arm.com>
* | UPSTREAM: sched/fair: Fix usage of find_idlest_group() when no groups are ↵Brendan Jackman2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | allowed When 'p' is not allowed on any of the CPUs in the sched_domain, we currently return NULL from find_idlest_group(), and pointlessly continue the search on lower sched_domain levels (where 'p' is also not allowed) before returning prev_cpu regardless (as we have not updated new_cpu). Add an explicit check for this case, and add a comment to find_idlest_group(). Now when find_idlest_group() returns NULL, it always means that the local group is allowed and idlest. Change-Id: I5f2648d2f7fb0465677961ecb7473df3d06f0057 Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171005114516.18617-5-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry-picked-from: commit 6fee85ccbc76 tip:sched/core) Signed-off-by: Chris Redpath <chris.redpath@arm.com>
* | BACKPORT: sched/fair: Fix find_idlest_group when local group is not allowedBrendan Jackman2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the local group is not allowed we do not modify this_*_load from their initial value of 0. That means that the load checks at the end of find_idlest_group cause us to incorrectly return NULL. Fixing the initial values to ULONG_MAX means we will instead return the idlest remote group in that case. BACKPORT: Note 4.4 is missing commit 6b94780e45c1 "sched/core: Use load_avg for selecting idlest group", so we only have to fix this_load instead of this_runnable_load and this_avg_load. Change-Id: I41f775b0e7c8f5e675c2780f955bb130a563cba7 Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20171005114516.18617-4-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry-picked-from: commit 0d10ab952e99 tip:sched/core) (backport changes described above) Signed-off-by: Chris Redpath <chris.redpath@arm.com>
* | UPSTREAM: sched/fair: Remove unnecessary comparison with -1Brendan Jackman2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit: 83a0a96a5f26 ("sched/fair: Leverage the idle state info when choosing the "idlest" cpu") find_idlest_group_cpu() (formerly find_idlest_cpu) no longer returns -1, so we can simplify the checking of the return value in find_idlest_cpu(). Change-Id: I98f4b9f178cd93a30408e024e608d36771764c7b Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171005114516.18617-3-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry-picked-from commit e90381eaecf6 in tip:sched/core) Signed-off-by: Chris Redpath <chris.redpath@arm.com>
* | BACKPORT: sched/fair: Move select_task_rq_fair slow-path into its own functionBrendan Jackman2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for changes that would otherwise require adding a new level of indentation to the while(sd) loop, create a new function find_idlest_cpu() which contains this loop, and rename the existing find_idlest_cpu() to find_idlest_group_cpu(). Code inside the while(sd) loop is unchanged. @new_cpu is added as a variable in the new function, with the same initial value as the @new_cpu in select_task_rq_fair(). Change-Id: I9842308cab00dc9cd6c513fc38c609089a1aaaaf Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171005114516.18617-2-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (reworked for eas/cas schedstats added in Android) (cherry-picked commit 18bd1b4bd53a from tip:sched/core) Signed-off-by: Chris Redpath <chris.redpath@arm.com>
* | UPSTREAM: sched/fair: Force balancing on nohz balance if local group has ↵Brendan Jackman2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | capacity The "goto force_balance" here is intended to mitigate the fact that avg_load calculations can result in bad placement decisions when priority is asymmetrical. The original commit that adds it: fab476228ba3 ("sched: Force balancing on newidle balance if local group has capacity") explains: Under certain situations, such as a niced down task (i.e. nice = -15) in the presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and kicks away other tasks because of its large weight. This leads to sub-optimal utilization of the machine. Even though the sched group has capacity, it does not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL. A similar but inverted issue also affects ARM big.LITTLE (asymmetrical CPU capacity) systems - consider 8 always-running, same-priority tasks on a system with 4 "big" and 4 "little" CPUs. Suppose that 5 of them end up on the "big" CPUs (which will be represented by one sched_group in the DIE sched_domain) and 3 on the "little" (the other sched_group in DIE), leaving one CPU unused. Because the "big" group has a higher group_capacity its avg_load may not present an imbalance that would cause migrating a task to the idle "little". The force_balance case here solves the problem but currently only for CPU_NEWLY_IDLE balances, which in theory might never happen on the unused CPU. Including CPU_IDLE in the force_balance case means there's an upper bound on the time before we can attempt to solve the underutilization: after DIE's sd->balance_interval has passed the next nohz balance kick will help us out. Change-Id: I807ba5cba0ef1b8bbec02cbcd4755fd32af10135 Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170807163900.25180-1-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry-picked-from: commit 583ffd99d765 tip:sched/core) Signed-off-by: Chris Redpath <chris.redpath@arm.com>
* | UPSTREAM: sched/core: Add missing update_rq_clock() call in set_user_nice()Peter Zijlstra2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Address this rq-clock update bug: WARNING: CPU: 30 PID: 195 at ../kernel/sched/sched.h:797 set_next_entity() rq->clock_update_flags < RQCF_ACT_SKIP Call Trace: dump_stack() __warn() warn_slowpath_fmt() set_next_entity() ? _raw_spin_lock() set_curr_task_fair() set_user_nice.part.85() set_user_nice() create_worker() worker_thread() kthread() ret_from_fork() Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 2fb8d36787affe26f3536c3d8ec094995a48037d) Change-Id: I53ba056e72820c7fadb3f022e4ee3b821c0de17d Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com>
* | UPSTREAM: sched/core: Add missing update_rq_clock() call for task_hot()Peter Zijlstra2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the update_rq_clock() call at the top of the callstack instead of at the bottom where we find it missing, this to aid later effort to minimize the number of update_rq_lock() calls. WARNING: CPU: 30 PID: 194 at ../kernel/sched/sched.h:797 assert_clock_updated() rq->clock_update_flags < RQCF_ACT_SKIP Call Trace: dump_stack() __warn() warn_slowpath_fmt() assert_clock_updated.isra.63.part.64() can_migrate_task() load_balance() pick_next_task_fair() __schedule() schedule() worker_thread() kthread() Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 3bed5e2166a5e433bf62162f3cd3c5174d335934) Change-Id: Ief5070dcce486535334dcb739ee16b989ea9df42 Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com>
* | UPSTREAM: sched/core: Add missing update_rq_clock() in detach_task_cfs_rq()Peter Zijlstra2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of adding the update_rq_clock() all the way at the bottom of the callstack, add one at the top, this to aid later effort to minimize update_rq_lock() calls. WARNING: CPU: 0 PID: 1 at ../kernel/sched/sched.h:797 detach_task_cfs_rq() rq->clock_update_flags < RQCF_ACT_SKIP Call Trace: dump_stack() __warn() warn_slowpath_fmt() detach_task_cfs_rq() switched_from_fair() __sched_setscheduler() _sched_setscheduler() sched_set_stop_task() cpu_stop_create() __smpboot_create_thread.part.2() smpboot_register_percpu_thread_cpumask() cpu_stop_init() do_one_initcall() ? print_cpu_info() kernel_init_freeable() ? rest_init() kernel_init() ret_from_fork() Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 80f5c1b84baa8180c3c27b7e227429712cd967b6) Change-Id: Ibffde077d18eabec4c2984158bd9d6d73bd0fb96 Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com>
* | UPSTREAM: sched/core: Add missing update_rq_clock() in ↵Peter Zijlstra2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | post_init_entity_util_avg() Address this rq-clock update bug: WARNING: CPU: 0 PID: 0 at ../kernel/sched/sched.h:797 post_init_entity_util_avg() rq->clock_update_flags < RQCF_ACT_SKIP Call Trace: __warn() post_init_entity_util_avg() wake_up_new_task() _do_fork() kernel_thread() rest_init() start_kernel() Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 4126bad6717336abe5d666440ae15555563ca53f) Change-Id: Ibe9a73386896377f96483d195e433259218755a5 Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com>
* | UPSTREAM: sched/core: Fix find_idlest_group() for forkVincent Guittot2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During fork, the utilization of a task is init once the rq has been selected because the current utilization level of the rq is used to set the utilization of the fork task. As the task's utilization is still 0 at this step of the fork sequence, it doesn't make sense to look for some spare capacity that can fit the task's utilization. Furthermore, I can see perf regressions for the test: hackbench -P -g 1 because the least loaded policy is always bypassed and tasks are not spread during fork. With this patch and the fix below, we are back to same performances as for v4.8. The fix below is only a temporary one used for the test until a smarter solution is found because we can't simply remove the test which is useful for others benchmarks | @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t | | avg_cost = this_sd->avg_scan_cost; | | - /* | - * Due to large variance we need a large fuzz factor; hackbench in | - * particularly is sensitive here. | - */ | - if ((avg_idle / 512) < avg_cost) | - return -1; | - | time = local_clock(); | | for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) { Tested-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Acked-by: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: kernellwp@gmail.com Cc: umgwanakikbuti@gmail.com Cc: yuyang.du@intel.comc Link: http://lkml.kernel.org/r/1481216215-24651-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit f519a3f1c6b7a990e5aed37a8f853c6ecfdee945) Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> Change-Id: I86cc2ad81af3467c0b2f82b995111f428248baa4
* | BACKPORT: sched/fair: Fix PELT integrity for new tasksPeter Zijlstra2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Vincent and Yuyang found another few scenarios in which entity tracking goes wobbly. The scenarios are basically due to the fact that new tasks are not immediately attached and thereby differ from the normal situation -- a task is always attached to a cfs_rq load average (such that it includes its blocked contribution) and are explicitly detached/attached on migration to another cfs_rq. Scenario 1: switch to fair class p->sched_class = fair_class; if (queued) enqueue_task(p); ... enqueue_entity() enqueue_entity_load_avg() migrated = !sa->last_update_time (true) if (migrated) attach_entity_load_avg() check_class_changed() switched_from() (!fair) switched_to() (fair) switched_to_fair() attach_entity_load_avg() If @p is a new task that hasn't been fair before, it will have !last_update_time and, per the above, end up in attach_entity_load_avg() _twice_. Scenario 2: change between cgroups sched_move_group(p) if (queued) dequeue_task() task_move_group_fair() detach_task_cfs_rq() detach_entity_load_avg() set_task_rq() attach_task_cfs_rq() attach_entity_load_avg() if (queued) enqueue_task(); ... enqueue_entity() enqueue_entity_load_avg() migrated = !sa->last_update_time (true) if (migrated) attach_entity_load_avg() Similar as with scenario 1, if @p is a new task, it will have !load_update_time and we'll end up in attach_entity_load_avg() _twice_. Furthermore, notice how we do a detach_entity_load_avg() on something that wasn't attached to begin with. As stated above; the problem is that the new task isn't yet attached to the load tracking and thereby violates the invariant assumption. This patch remedies this by ensuring a new task is indeed properly attached to the load tracking on creation, through post_init_entity_util_avg(). Of course, this isn't entirely as straightforward as one might think, since the task is hashed before we call wake_up_new_task() and thus can be poked at. We avoid this by adding TASK_NEW and teaching cpu_cgroup_can_attach() to refuse such tasks. .:: BACKPORT Complicated by the fact that mch of the lines changed by the original of this commit were then changed by: df217913e72e sched/fair: Factorize attach/detach entity <Vincent Guittot> and then d31b1a66cbe0 sched/fair: Factorize PELT update <Vincent Guittot> , which have both already been backported here. Reported-by: Yuyang Du <yuyang.du@intel.com> Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit 7dc603c9028ea5d4354e0e317e8481df99b06d7e) Change-Id: Ibc59eb52310a62709d49a744bd5a24e8b97c4ae8 Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com>
* | BACKPORT: sched/cgroup: Fix cpu_cgroup_fork() handlingVincent Guittot2017-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A new fair task is detached and attached from/to task_group with: cgroup_post_fork() ss->fork(child) := cpu_cgroup_fork() sched_move_task() task_move_group_fair() Which is wrong, because at this point in fork() the task isn't fully initialized and it cannot 'move' to another group, because its not attached to any group as yet. In fact, cpu_cgroup_fork() needs a small part of sched_move_task() so we can just call this small part directly instead sched_move_task(). And the task doesn't really migrate because it is not yet attached so we need the following sequence: do_fork() sched_fork() __set_task_cpu() cgroup_post_fork() set_task_rq() # set task group and runqueue wake_up_new_task() select_task_rq() can select a new cpu __set_task_cpu post_init_entity_util_avg attach_task_cfs_rq() activate_task enqueue_task This patch makes that happen. BACKPORT: Difference from original commit: - Removed use of DEQUEUE_MOVE (which isn't defined in 4.4) in dequeue_task flags - Replaced "struct rq_flags rf" with "unsigned long flags". Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> [ Added TASK_SET_GROUP to set depth properly. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit ea86cb4b7621e1298a37197005bf0abcc86348d4) Change-Id: I8126fd923288acf961218431ffd29d6bf6fd8d72 Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com>