diff options
author | Blagovest Kolenichev <bkolenichev@codeaurora.org> | 2017-06-12 07:30:14 -0700 |
---|---|---|
committer | Blagovest Kolenichev <bkolenichev@codeaurora.org> | 2017-06-19 16:59:55 -0700 |
commit | c5f247dd6d415e5f4b7613d0234a311b81354ee9 (patch) | |
tree | 40fb96973459fa7926aebac00c7422ec770b3853 /kernel/sched | |
parent | 55a25be010f62b574938ef3da38c50738db78cff (diff) | |
parent | 6fc0573f6daffb79bb6ea76daea17b99a3c4526a (diff) |
Merge branch 'android-4.4@6fc0573' into branch 'msm-4.4'
* refs/heads/tmp-6fc0573:
Linux 4.4.71
xfs: only return -errno or success from attr ->put_listent
xfs: in _attrlist_by_handle, copy the cursor back to userspace
xfs: fix unaligned access in xfs_btree_visit_blocks
xfs: bad assertion for delalloc an extent that start at i_size
xfs: fix indlen accounting error on partial delalloc conversion
xfs: wait on new inodes during quotaoff dquot release
xfs: update ag iterator to support wait on new inodes
xfs: support ability to wait on new inodes
xfs: fix up quotacheck buffer list error handling
xfs: prevent multi-fsb dir readahead from reading random blocks
xfs: handle array index overrun in xfs_dir2_leaf_readbuf()
xfs: fix over-copying of getbmap parameters from userspace
xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()
xfs: Fix missed holes in SEEK_HOLE implementation
mlock: fix mlock count can not decrease in race condition
mm/migrate: fix refcount handling when !hugepage_migration_supported()
drm/gma500/psb: Actually use VBT mode when it is found
slub/memcg: cure the brainless abuse of sysfs attributes
ALSA: hda - apply STAC_9200_DELL_M22 quirk for Dell Latitude D430
pcmcia: remove left-over %Z format
drm/radeon: Unbreak HPD handling for r600+
drm/radeon/ci: disable mclk switching for high refresh rates (v2)
scsi: mpt3sas: Force request partial completion alignment
HID: wacom: Have wacom_tpc_irq guard against possible NULL dereference
mmc: sdhci-iproc: suppress spurious interrupt with Multiblock read
i2c: i2c-tiny-usb: fix buffer not being DMA capable
vlan: Fix tcp checksum offloads in Q-in-Q vlans
net: phy: marvell: Limit errata to 88m1101
netem: fix skb_orphan_partial()
ipv4: add reference counting to metrics
sctp: fix ICMP processing if skb is non-linear
tcp: avoid fastopen API to be used on AF_UNSPEC
virtio-net: enable TSO/checksum offloads for Q-in-Q vlans
be2net: Fix offload features for Q-in-Q packets
ipv6: fix out of bound writes in __ip6_append_data()
bridge: start hello_timer when enabling KERNEL_STP in br_stp_start
qmi_wwan: add another Lenovo EM74xx device ID
bridge: netlink: check vlan_default_pvid range
ipv6: Check ip6_find_1stfragopt() return value properly.
ipv6: Prevent overrun when parsing v6 header options
net: Improve handling of failures on link and route dumps
tcp: eliminate negative reordering in tcp_clean_rtx_queue
sctp: do not inherit ipv6_{mc|ac|fl}_list from parent
sctp: fix src address selection if using secondary addresses for ipv6
tcp: avoid fragmenting peculiar skbs in SACK
s390/qeth: avoid null pointer dereference on OSN
s390/qeth: unbreak OSM and OSN support
s390/qeth: handle sysfs error during initialization
ipv6/dccp: do not inherit ipv6_mc_list from parent
dccp/tcp: do not inherit mc_list from parent
sparc: Fix -Wstringop-overflow warning
android: base-cfg: disable CONFIG_NFS_FS and CONFIG_NFSD
schedstats/eas: guard properly to avoid breaking non-smp schedstats users
BACKPORT: f2fs: sanity check size of nat and sit cache
FROMLIST: f2fs: sanity check checkpoint segno and blkoff
sched/tune: don't use schedtune before it is ready
sched/fair: use SCHED_CAPACITY_SCALE for energy normalization
sched/{fair,tune}: use reciprocal_value to compute boost margin
sched/tune: Initialize raw_spin_lock in boosted_groups
sched/tune: report when SchedTune has not been initialized
sched/tune: fix sched_energy_diff tracepoint
sched/tune: increase group count to 5
cpufreq/schedutil: use boosted_cpu_util for PELT to match WALT
sched/fair: Fix sched_group_energy() to support per-cpu capacity states
sched/fair: discount task contribution to find CPU with lowest utilization
sched/fair: ensure utilization signals are synchronized before use
sched/fair: remove task util from own cpu when placing waking task
trace:sched: Make util_avg in load_avg trace reflect PELT/WALT as used
sched/fair: Add eas (& cas) specific rq, sd and task stats
sched/core: Fix PELT jump to max OPP upon util increase
sched: EAS & 'single cpu per cluster'/cpu hotplug interoperability
UPSTREAM: sched/core: Fix group_entity's share update
UPSTREAM: sched/fair: Fix calc_cfs_shares() fixed point arithmetics width confusion
UPSTREAM: sched/fair: Fix incorrect task group ->load_avg
UPSTREAM: sched/fair: Fix effective_load() to consistently use smoothed load
UPSTREAM: sched/fair: Propagate asynchrous detach
UPSTREAM: sched/fair: Propagate load during synchronous attach/detach
UPSTREAM: sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list
BACKPORT: sched/fair: Factorize PELT update
UPSTREAM: sched/fair: Factorize attach/detach entity
UPSTREAM: sched/fair: Improve PELT stuff some more
UPSTREAM: sched/fair: Apply more PELT fixes
UPSTREAM: sched/fair: Fix post_init_entity_util_avg() serialization
BACKPORT: sched/fair: Initiate a new task's util avg to a bounded value
sched/fair: Simplify idle_idx handling in select_idle_sibling()
sched/fair: refactor find_best_target() for simplicity
sched/fair: Change cpu iteration order in find_best_target()
sched/core: Add first cpu w/ max/min orig capacity to root domain
sched/core: Remove remnants of commit fd5c98da1a42
sched: Remove sysctl_sched_is_big_little
sched/fair: Code !is_big_little path into select_energy_cpu_brute()
EAS: sched/fair: Re-integrate 'honor sync wakeups' into wakeup path
Fixup!: sched/fair.c: Set SchedTune specific struct energy_env.task
sched/fair: Energy-aware wake-up task placement
sched/fair: Add energy_diff dead-zone margin
sched/fair: Decommission energy_aware_wake_cpu()
sched/fair: Do not force want_affine eq. true if EAS is enabled
arm64: Set SD_ASYM_CPUCAPACITY sched_domain flag on DIE level
UPSTREAM: sched/fair: Fix incorrect comment for capacity_margin
UPSTREAM: sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
UPSTREAM: sched/fair: Add per-CPU min capacity to sched_group_capacity
UPSTREAM: sched/fair: Consider spare capacity in find_idlest_group()
UPSTREAM: sched/fair: Compute task/cpu utilization at wake-up correctly
UPSTREAM: sched/fair: Let asymmetric CPU configurations balance at wake-up
UPSTREAM: sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systems
UPSTREAM: sched/core: Pass child domain into sd_init()
UPSTREAM: sched/core: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
UPSTREAM: sched/core: Remove unnecessary NULL-pointer check
UPSTREAM: sched/fair: Optimize find_idlest_cpu() when there is no choice
BACKPORT: sched/fair: Make the use of prev_cpu consistent in the wakeup path
UPSTREAM: sched/core: Fix power to capacity renaming in comment
Partial Revert: "WIP: sched: Add cpu capacity awareness to wakeup balancing"
Revert "WIP: sched: Consider spare cpu capacity at task wake-up"
FROM-LIST: cpufreq: schedutil: Redefine the rate_limit_us tunable
cpufreq: schedutil: add up/down frequency transition rate limits
trace/sched: add rq utilization signal for WALT
sched/cpufreq: make schedutil use WALT signal
sched: cpufreq: use rt_avg as estimate of required RT CPU capacity
cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task
BACKPORT: kthread: allow to cancel kthread work
sched/cpufreq: fix tunables for schedfreq governor
BACKPORT: cpufreq: schedutil: New governor based on scheduler utilization data
sched: backport cpufreq hooks from 4.9-rc4
ANDROID: Kconfig: add depends for UID_SYS_STATS
ANDROID: hid: uhid: implement refcount for open and close
Revert "ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY"
ANDROID: mnt: Fix next_descendent
Conflicts:
include/trace/events/sched.h
kernel/sched/Makefile
kernel/sched/core.c
kernel/sched/fair.c
kernel/sched/sched.h
Change-Id: I55318828f2c858e192ac7015bcf2bf0ec5c5b2c5
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
Diffstat (limited to 'kernel/sched')
-rw-r--r-- | kernel/sched/Makefile | 2 | ||||
-rw-r--r-- | kernel/sched/core.c | 78 | ||||
-rw-r--r-- | kernel/sched/cpufreq.c | 63 | ||||
-rw-r--r-- | kernel/sched/cpufreq_sched.c | 220 | ||||
-rw-r--r-- | kernel/sched/cpufreq_schedutil.c | 770 | ||||
-rw-r--r-- | kernel/sched/deadline.c | 3 | ||||
-rw-r--r-- | kernel/sched/debug.c | 26 | ||||
-rw-r--r-- | kernel/sched/fair.c | 1453 | ||||
-rw-r--r-- | kernel/sched/rt.c | 3 | ||||
-rw-r--r-- | kernel/sched/sched.h | 62 | ||||
-rw-r--r-- | kernel/sched/stats.c | 26 | ||||
-rw-r--r-- | kernel/sched/tune.c | 13 |
12 files changed, 2179 insertions, 540 deletions
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 308f80ce2e43..a353df46c8e4 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -22,4 +22,6 @@ obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_SCHED_TUNE) += tune.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o obj-$(CONFIG_SCHED_CORE_CTL) += core_ctl.o +obj-$(CONFIG_CPU_FREQ) += cpufreq.o obj-$(CONFIG_CPU_FREQ_GOV_SCHED) += cpufreq_sched.o +obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2d2b8834dd3b..0071785e698b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2623,9 +2623,9 @@ void wake_up_new_task(struct task_struct *p) */ set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); #endif - rq = __task_rq_lock(p); mark_task_starting(p); + post_init_entity_util_avg(&p->se); activate_task(rq, p, ENQUEUE_WAKEUP_NEW); p->on_rq = TASK_ON_RQ_QUEUED; trace_sched_wakeup_new(p); @@ -3154,9 +3154,10 @@ unsigned long sum_capacity_reqs(unsigned long cfs_cap, return total += scr->dl; } +unsigned long boosted_cpu_util(int cpu); static void sched_freq_tick_pelt(int cpu) { - unsigned long cpu_utilization = capacity_max; + unsigned long cpu_utilization = boosted_cpu_util(cpu); unsigned long capacity_curr = capacity_curr_of(cpu); struct sched_capacity_reqs *scr; @@ -6460,9 +6461,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level, if (!(sd->flags & SD_LOAD_BALANCE)) { printk("does not load-balance\n"); - if (sd->parent) - printk(KERN_ERR "ERROR: !SD_LOAD_BALANCE domain" - " has parent"); return -1; } @@ -6555,8 +6553,12 @@ static inline bool sched_debug(void) static int sd_degenerate(struct sched_domain *sd) { - if (cpumask_weight(sched_domain_span(sd)) == 1) - return 1; + if (cpumask_weight(sched_domain_span(sd)) == 1) { + if (sd->groups->sge) + sd->flags &= ~SD_LOAD_BALANCE; + else + return 1; + } /* Following flags need at least 2 groups */ if (sd->flags & (SD_LOAD_BALANCE | @@ -6564,6 +6566,7 @@ static int sd_degenerate(struct sched_domain *sd) SD_BALANCE_FORK | SD_BALANCE_EXEC | SD_SHARE_CPUCAPACITY | + SD_ASYM_CPUCAPACITY | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN | SD_SHARE_CAP_STATES)) { @@ -6595,11 +6598,16 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent) SD_BALANCE_NEWIDLE | SD_BALANCE_FORK | SD_BALANCE_EXEC | + SD_ASYM_CPUCAPACITY | SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES | SD_PREFER_SIBLING | SD_SHARE_POWERDOMAIN | SD_SHARE_CAP_STATES); + if (parent->groups->sge) { + parent->flags &= ~SD_LOAD_BALANCE; + return 0; + } if (nr_node_ids == 1) pflags &= ~SD_SERIALIZE; } @@ -6680,6 +6688,9 @@ static int init_rootdomain(struct root_domain *rd) goto free_rto_mask; init_max_cpu_capacity(&rd->max_cpu_capacity); + + rd->max_cap_orig_cpu = rd->min_cap_orig_cpu = -1; + return 0; free_rto_mask: @@ -6996,6 +7007,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) */ sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span); sg->sgc->max_capacity = SCHED_CAPACITY_SCALE; + sg->sgc->min_capacity = SCHED_CAPACITY_SCALE; /* * Make sure the first group of this domain contains the @@ -7291,11 +7303,19 @@ static int sched_domains_curr_level; /* * SD_flags allowed in topology descriptions. * - * SD_SHARE_CPUCAPACITY - describes SMT topologies - * SD_SHARE_PKG_RESOURCES - describes shared caches - * SD_NUMA - describes NUMA topologies - * SD_SHARE_POWERDOMAIN - describes shared power domain - * SD_SHARE_CAP_STATES - describes shared capacity states + * These flags are purely descriptive of the topology and do not prescribe + * behaviour. Behaviour is artificial and mapped in the below sd_init() + * function: + * + * SD_SHARE_CPUCAPACITY - describes SMT topologies + * SD_SHARE_PKG_RESOURCES - describes shared caches + * SD_NUMA - describes NUMA topologies + * SD_SHARE_POWERDOMAIN - describes shared power domain + * SD_ASYM_CPUCAPACITY - describes mixed capacity topologies + * SD_SHARE_CAP_STATES - describes shared capacity states + * + * Odd one out, which beside describing the topology has a quirk also + * prescribes the desired behaviour that goes along with it: * * Odd one out: * SD_ASYM_PACKING - describes SMT quirks @@ -7305,11 +7325,13 @@ static int sched_domains_curr_level; SD_SHARE_PKG_RESOURCES | \ SD_NUMA | \ SD_ASYM_PACKING | \ + SD_ASYM_CPUCAPACITY | \ SD_SHARE_POWERDOMAIN | \ SD_SHARE_CAP_STATES) static struct sched_domain * -sd_init(struct sched_domain_topology_level *tl, int cpu) +sd_init(struct sched_domain_topology_level *tl, + struct sched_domain *child, int cpu) { struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu); int sd_weight, sd_flags = 0; @@ -7361,6 +7383,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) .smt_gain = 0, .max_newidle_lb_cost = 0, .next_decay_max_lb_cost = jiffies, + .child = child, #ifdef CONFIG_SCHED_DEBUG .name = tl->name, #endif @@ -7370,6 +7393,13 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) * Convert topological properties into behaviour. */ + if (sd->flags & SD_ASYM_CPUCAPACITY) { + struct sched_domain *t = sd; + + for_each_lower_domain(t) + t->flags |= SD_BALANCE_WAKE; + } + if (sd->flags & SD_SHARE_CPUCAPACITY) { sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 110; @@ -7816,16 +7846,13 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl, const struct cpumask *cpu_map, struct sched_domain_attr *attr, struct sched_domain *child, int cpu) { - struct sched_domain *sd = sd_init(tl, cpu); - if (!sd) - return child; + struct sched_domain *sd = sd_init(tl, child, cpu); cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu)); if (child) { sd->level = child->level + 1; sched_domain_level_max = max(sched_domain_level_max, sd->level); child->parent = sd; - sd->child = child; if (!cpumask_subset(sched_domain_span(child), sched_domain_span(sd))) { @@ -7859,7 +7886,6 @@ static int build_sched_domains(const struct cpumask *cpu_map, enum s_alloc alloc_state; struct sched_domain *sd; struct s_data d; - struct rq *rq = NULL; int i, ret = -ENOMEM; alloc_state = __visit_domain_allocation_hell(&d, cpu_map); @@ -7877,8 +7903,6 @@ static int build_sched_domains(const struct cpumask *cpu_map, *per_cpu_ptr(d.sd, i) = sd; if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP)) sd->flags |= SD_OVERLAP; - if (cpumask_equal(cpu_map, sched_domain_span(sd))) - break; } } @@ -7914,8 +7938,19 @@ static int build_sched_domains(const struct cpumask *cpu_map, /* Attach the domains */ rcu_read_lock(); for_each_cpu(i, cpu_map) { - rq = cpu_rq(i); + int max_cpu = READ_ONCE(d.rd->max_cap_orig_cpu); + int min_cpu = READ_ONCE(d.rd->min_cap_orig_cpu); + + if ((max_cpu < 0) || (cpu_rq(i)->cpu_capacity_orig > + cpu_rq(max_cpu)->cpu_capacity_orig)) + WRITE_ONCE(d.rd->max_cap_orig_cpu, i); + + if ((min_cpu < 0) || (cpu_rq(i)->cpu_capacity_orig < + cpu_rq(min_cpu)->cpu_capacity_orig)) + WRITE_ONCE(d.rd->min_cap_orig_cpu, i); + sd = *per_cpu_ptr(d.sd, i); + cpu_attach_domain(sd, d.rd, i); } rcu_read_unlock(); @@ -8339,6 +8374,7 @@ void __init sched_init(void) #ifdef CONFIG_FAIR_GROUP_SCHED root_task_group.shares = ROOT_TASK_GROUP_LOAD; INIT_LIST_HEAD(&rq->leaf_cfs_rq_list); + rq->tmp_alone_branch = &rq->leaf_cfs_rq_list; /* * How much cpu bandwidth does root_task_group get? * diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c new file mode 100644 index 000000000000..dbc51442ecbc --- /dev/null +++ b/kernel/sched/cpufreq.c @@ -0,0 +1,63 @@ +/* + * Scheduler code and data structures related to cpufreq. + * + * Copyright (C) 2016, Intel Corporation + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include "sched.h" + +DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data); + +/** + * cpufreq_add_update_util_hook - Populate the CPU's update_util_data pointer. + * @cpu: The CPU to set the pointer for. + * @data: New pointer value. + * @func: Callback function to set for the CPU. + * + * Set and publish the update_util_data pointer for the given CPU. + * + * The update_util_data pointer of @cpu is set to @data and the callback + * function pointer in the target struct update_util_data is set to @func. + * That function will be called by cpufreq_update_util() from RCU-sched + * read-side critical sections, so it must not sleep. @data will always be + * passed to it as the first argument which allows the function to get to the + * target update_util_data structure and its container. + * + * The update_util_data pointer of @cpu must be NULL when this function is + * called or it will WARN() and return with no effect. + */ +void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data, + void (*func)(struct update_util_data *data, u64 time, + unsigned int flags)) +{ + if (WARN_ON(!data || !func)) + return; + + if (WARN_ON(per_cpu(cpufreq_update_util_data, cpu))) + return; + + data->func = func; + rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data); +} +EXPORT_SYMBOL_GPL(cpufreq_add_update_util_hook); + +/** + * cpufreq_remove_update_util_hook - Clear the CPU's update_util_data pointer. + * @cpu: The CPU to clear the pointer for. + * + * Clear the update_util_data pointer for the given CPU. + * + * Callers must use RCU-sched callbacks to free any memory that might be + * accessed via the old update_util_data pointer or invoke synchronize_sched() + * right after this function to avoid use-after-free. + */ +void cpufreq_remove_update_util_hook(int cpu) +{ + rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), NULL); +} +EXPORT_SYMBOL_GPL(cpufreq_remove_update_util_hook); diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c index d751bc2d0d6e..f10d9f7d6d07 100644 --- a/kernel/sched/cpufreq_sched.c +++ b/kernel/sched/cpufreq_sched.c @@ -32,6 +32,12 @@ static struct cpufreq_governor cpufreq_gov_sched; static DEFINE_PER_CPU(unsigned long, enabled); DEFINE_PER_CPU(struct sched_capacity_reqs, cpu_sched_capacity_reqs); +struct gov_tunables { + struct gov_attr_set attr_set; + unsigned int up_throttle_nsec; + unsigned int down_throttle_nsec; +}; + /** * gov_data - per-policy data internal to the governor * @up_throttle: next throttling period expiry if increasing OPP @@ -53,8 +59,8 @@ DEFINE_PER_CPU(struct sched_capacity_reqs, cpu_sched_capacity_reqs); struct gov_data { ktime_t up_throttle; ktime_t down_throttle; - unsigned int up_throttle_nsec; - unsigned int down_throttle_nsec; + struct gov_tunables *tunables; + struct list_head tunables_hook; struct task_struct *task; struct irq_work irq_work; unsigned int requested_freq; @@ -71,8 +77,10 @@ static void cpufreq_sched_try_driver_target(struct cpufreq_policy *policy, __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L); - gd->up_throttle = ktime_add_ns(ktime_get(), gd->up_throttle_nsec); - gd->down_throttle = ktime_add_ns(ktime_get(), gd->down_throttle_nsec); + gd->up_throttle = ktime_add_ns(ktime_get(), + gd->tunables->up_throttle_nsec); + gd->down_throttle = ktime_add_ns(ktime_get(), + gd->tunables->down_throttle_nsec); up_write(&policy->rwsem); } @@ -262,12 +270,70 @@ static inline void clear_sched_freq(void) static_key_slow_dec(&__sched_freq); } -static struct attribute_group sched_attr_group_gov_pol; -static struct attribute_group *get_sysfs_attr(void) +/* Tunables */ +static struct gov_tunables *global_tunables; + +static inline struct gov_tunables *to_tunables(struct gov_attr_set *attr_set) { - return &sched_attr_group_gov_pol; + return container_of(attr_set, struct gov_tunables, attr_set); } +static ssize_t up_throttle_nsec_show(struct gov_attr_set *attr_set, char *buf) +{ + struct gov_tunables *tunables = to_tunables(attr_set); + + return sprintf(buf, "%u\n", tunables->up_throttle_nsec); +} + +static ssize_t up_throttle_nsec_store(struct gov_attr_set *attr_set, + const char *buf, size_t count) +{ + struct gov_tunables *tunables = to_tunables(attr_set); + int ret; + long unsigned int val; + + ret = kstrtoul(buf, 0, &val); + if (ret < 0) + return ret; + tunables->up_throttle_nsec = val; + return count; +} + +static ssize_t down_throttle_nsec_show(struct gov_attr_set *attr_set, char *buf) +{ + struct gov_tunables *tunables = to_tunables(attr_set); + + return sprintf(buf, "%u\n", tunables->down_throttle_nsec); +} + +static ssize_t down_throttle_nsec_store(struct gov_attr_set *attr_set, + const char *buf, size_t count) +{ + struct gov_tunables *tunables = to_tunables(attr_set); + int ret; + long unsigned int val; + + ret = kstrtoul(buf, 0, &val); + if (ret < 0) + return ret; + tunables->down_throttle_nsec = val; + return count; +} + +static struct governor_attr up_throttle_nsec = __ATTR_RW(up_throttle_nsec); +static struct governor_attr down_throttle_nsec = __ATTR_RW(down_throttle_nsec); + +static struct attribute *schedfreq_attributes[] = { + &up_throttle_nsec.attr, + &down_throttle_nsec.attr, + NULL +}; + +static struct kobj_type tunables_ktype = { + .default_attrs = schedfreq_attributes, + .sysfs_ops = &governor_sysfs_ops, +}; + static int cpufreq_sched_policy_init(struct cpufreq_policy *policy) { struct gov_data *gd; @@ -282,17 +348,39 @@ static int cpufreq_sched_policy_init(struct cpufreq_policy *policy) if (!gd) return -ENOMEM; - gd->up_throttle_nsec = policy->cpuinfo.transition_latency ? - policy->cpuinfo.transition_latency : - THROTTLE_UP_NSEC; - gd->down_throttle_nsec = THROTTLE_DOWN_NSEC; - pr_debug("%s: throttle threshold = %u [ns]\n", - __func__, gd->up_throttle_nsec); - - rc = sysfs_create_group(&policy->kobj, get_sysfs_attr()); - if (rc) { - pr_err("%s: couldn't create sysfs attributes: %d\n", __func__, rc); - goto err; + policy->governor_data = gd; + + if (!global_tunables) { + gd->tunables = kzalloc(sizeof(*gd->tunables), GFP_KERNEL); + if (!gd->tunables) + goto free_gd; + + gd->tunables->up_throttle_nsec = + policy->cpuinfo.transition_latency ? + policy->cpuinfo.transition_latency : + THROTTLE_UP_NSEC; + gd->tunables->down_throttle_nsec = + THROTTLE_DOWN_NSEC; + + rc = kobject_init_and_add(&gd->tunables->attr_set.kobj, + &tunables_ktype, + get_governor_parent_kobj(policy), + "%s", cpufreq_gov_sched.name); + if (rc) + goto free_tunables; + + gov_attr_set_init(&gd->tunables->attr_set, + &gd->tunables_hook); + + pr_debug("%s: throttle_threshold = %u [ns]\n", + __func__, gd->tunables->up_throttle_nsec); + + if (!have_governor_per_policy()) + global_tunables = gd->tunables; + } else { + gd->tunables = global_tunables; + gov_attr_set_get(&global_tunables->attr_set, + &gd->tunables_hook); } policy->governor_data = gd; @@ -304,7 +392,7 @@ static int cpufreq_sched_policy_init(struct cpufreq_policy *policy) if (IS_ERR_OR_NULL(gd->task)) { pr_err("%s: failed to create kschedfreq thread\n", __func__); - goto err; + goto free_tunables; } get_task_struct(gd->task); kthread_bind_mask(gd->task, policy->related_cpus); @@ -316,7 +404,9 @@ static int cpufreq_sched_policy_init(struct cpufreq_policy *policy) return 0; -err: +free_tunables: + kfree(gd->tunables); +free_gd: policy->governor_data = NULL; kfree(gd); return -ENOMEM; @@ -324,6 +414,7 @@ err: static int cpufreq_sched_policy_exit(struct cpufreq_policy *policy) { + unsigned int count; struct gov_data *gd = policy->governor_data; clear_sched_freq(); @@ -332,7 +423,12 @@ static int cpufreq_sched_policy_exit(struct cpufreq_policy *policy) put_task_struct(gd->task); } - sysfs_remove_group(&policy->kobj, get_sysfs_attr()); + count = gov_attr_set_put(&gd->tunables->attr_set, &gd->tunables_hook); + if (!count) { + if (!have_governor_per_policy()) + global_tunables = NULL; + kfree(gd->tunables); + } policy->governor_data = NULL; @@ -394,88 +490,6 @@ static int cpufreq_sched_setup(struct cpufreq_policy *policy, return 0; } -/* Tunables */ -static ssize_t show_up_throttle_nsec(struct gov_data *gd, char *buf) -{ - return sprintf(buf, "%u\n", gd->up_throttle_nsec); -} - -static ssize_t store_up_throttle_nsec(struct gov_data *gd, - const char *buf, size_t count) -{ - int ret; - long unsigned int val; - - ret = kstrtoul(buf, 0, &val); - if (ret < 0) - return ret; - gd->up_throttle_nsec = val; - return count; -} - -static ssize_t show_down_throttle_nsec(struct gov_data *gd, char *buf) -{ - return sprintf(buf, "%u\n", gd->down_throttle_nsec); -} - -static ssize_t store_down_throttle_nsec(struct gov_data *gd, - const char *buf, size_t count) -{ - int ret; - long unsigned int val; - - ret = kstrtoul(buf, 0, &val); - if (ret < 0) - return ret; - gd->down_throttle_nsec = val; - return count; -} - -/* - * Create show/store routines - * - sys: One governor instance for complete SYSTEM - * - pol: One governor instance per struct cpufreq_policy - */ -#define show_gov_pol_sys(file_name) \ -static ssize_t show_##file_name##_gov_pol \ -(struct cpufreq_policy *policy, char *buf) \ -{ \ - return show_##file_name(policy->governor_data, buf); \ -} - -#define store_gov_pol_sys(file_name) \ -static ssize_t store_##file_name##_gov_pol \ -(struct cpufreq_policy *policy, const char *buf, size_t count) \ -{ \ - return store_##file_name(policy->governor_data, buf, count); \ -} - -#define gov_pol_attr_rw(_name) \ - static struct freq_attr _name##_gov_pol = \ - __ATTR(_name, 0644, show_##_name##_gov_pol, store_##_name##_gov_pol) - -#define show_store_gov_pol_sys(file_name) \ - show_gov_pol_sys(file_name); \ - store_gov_pol_sys(file_name) -#define tunable_handlers(file_name) \ - show_gov_pol_sys(file_name); \ - store_gov_pol_sys(file_name); \ - gov_pol_attr_rw(file_name) - -tunable_handlers(down_throttle_nsec); -tunable_handlers(up_throttle_nsec); - -/* Per policy governor instance */ -static struct attribute *sched_attributes_gov_pol[] = { - &up_throttle_nsec_gov_pol.attr, - &down_throttle_nsec_gov_pol.attr, - NULL, -}; - -static struct attribute_group sched_attr_group_gov_pol = { - .attrs = sched_attributes_gov_pol, - .name = "sched", -}; #ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED static diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c new file mode 100644 index 000000000000..75bfbb336722 --- /dev/null +++ b/kernel/sched/cpufreq_schedutil.c @@ -0,0 +1,770 @@ +/* + * CPUFreq governor based on scheduler-provided CPU utilization data. + * + * Copyright (C) 2016, Intel Corporation + * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/cpufreq.h> +#include <linux/kthread.h> +#include <linux/slab.h> +#include <trace/events/power.h> + +#include "sched.h" +#include "tune.h" + +unsigned long boosted_cpu_util(int cpu); + +/* Stub out fast switch routines present on mainline to reduce the backport + * overhead. */ +#define cpufreq_driver_fast_switch(x, y) 0 +#define cpufreq_enable_fast_switch(x) +#define cpufreq_disable_fast_switch(x) +#define LATENCY_MULTIPLIER (1000) +#define SUGOV_KTHREAD_PRIORITY 50 + +struct sugov_tunables { + struct gov_attr_set attr_set; + unsigned int up_rate_limit_us; + unsigned int down_rate_limit_us; +}; + +struct sugov_policy { + struct cpufreq_policy *policy; + + struct sugov_tunables *tunables; + struct list_head tunables_hook; + + raw_spinlock_t update_lock; /* For shared policies */ + u64 last_freq_update_time; + s64 min_rate_limit_ns; + s64 up_rate_delay_ns; + s64 down_rate_delay_ns; + unsigned int next_freq; + + /* The next fields are only needed if fast switch cannot be used. */ + struct irq_work irq_work; + struct kthread_work work; + struct mutex work_lock; + struct kthread_worker worker; + struct task_struct *thread; + bool work_in_progress; + + bool need_freq_update; +}; + +struct sugov_cpu { + struct update_util_data update_util; + struct sugov_policy *sg_policy; + + unsigned int cached_raw_freq; + unsigned long iowait_boost; + unsigned long iowait_boost_max; + u64 last_update; + + /* The fields below are only needed when sharing a policy. */ + unsigned long util; + unsigned long max; + unsigned int flags; +}; + +static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); + +/************************ Governor internals ***********************/ + +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) +{ + s64 delta_ns; + + if (sg_policy->work_in_progress) + return false; + + if (unlikely(sg_policy->need_freq_update)) { + sg_policy->need_freq_update = false; + /* + * This happens when limits change, so forget the previous + * next_freq value and force an update. + */ + sg_policy->next_freq = UINT_MAX; + return true; + } + + delta_ns = time - sg_policy->last_freq_update_time; + + /* No need to recalculate next freq for min_rate_limit_us at least */ + return delta_ns >= sg_policy->min_rate_limit_ns; +} + +static bool sugov_up_down_rate_limit(struct sugov_policy *sg_policy, u64 time, + unsigned int next_freq) +{ + s64 delta_ns; + + delta_ns = time - sg_policy->last_freq_update_time; + + if (next_freq > sg_policy->next_freq && + delta_ns < sg_policy->up_rate_delay_ns) + return true; + + if (next_freq < sg_policy->next_freq && + delta_ns < sg_policy->down_rate_delay_ns) + return true; + + return false; +} + +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, + unsigned int next_freq) +{ + struct cpufreq_policy *policy = sg_policy->policy; + + if (sugov_up_down_rate_limit(sg_policy, time, next_freq)) + return; + + if (policy->fast_switch_enabled) { + if (sg_policy->next_freq == next_freq) { + trace_cpu_frequency(policy->cur, smp_processor_id()); + return; + } + sg_policy->next_freq = next_freq; + sg_policy->last_freq_update_time = time; + next_freq = cpufreq_driver_fast_switch(policy, next_freq); + if (next_freq == CPUFREQ_ENTRY_INVALID) + return; + + policy->cur = next_freq; + trace_cpu_frequency(next_freq, smp_processor_id()); + } else if (sg_policy->next_freq != next_freq) { + sg_policy->next_freq = next_freq; + sg_policy->last_freq_update_time = time; + sg_policy->work_in_progress = true; + irq_work_queue(&sg_policy->irq_work); + } +} + +/** + * get_next_freq - Compute a new frequency for a given cpufreq policy. + * @sg_cpu: schedutil cpu object to compute the new frequency for. + * @util: Current CPU utilization. + * @max: CPU capacity. + * + * If the utilization is frequency-invariant, choose the new frequency to be + * proportional to it, that is + * + * next_freq = C * max_freq * util / max + * + * Otherwise, approximate the would-be frequency-invariant utilization by + * util_raw * (curr_freq / max_freq) which leads to + * + * next_freq = C * curr_freq * util_raw / max + * + * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8. + * + * The lowest driver-supported frequency which is equal or greater than the raw + * next_freq (as calculated above) is returned, subject to policy min/max and + * cpufreq driver limitations. + */ +static unsigned int get_next_freq(struct sugov_cpu *sg_cpu, unsigned long util, + unsigned long max) +{ + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + struct cpufreq_policy *policy = sg_policy->policy; + unsigned int freq = arch_scale_freq_invariant() ? + policy->cpuinfo.max_freq : policy->cur; + + freq = (freq + (freq >> 2)) * util / max; + + if (freq == sg_cpu->cached_raw_freq && sg_policy->next_freq != UINT_MAX) + return sg_policy->next_freq; + sg_cpu->cached_raw_freq = freq; + return cpufreq_driver_resolve_freq(policy, freq); +} + +static inline bool use_pelt(void) +{ +#ifdef CONFIG_SCHED_WALT + return (!sysctl_sched_use_walt_cpu_util || walt_disabled); +#else + return true; +#endif +} + +static void sugov_get_util(unsigned long *util, unsigned long *max, u64 time) +{ + int cpu = smp_processor_id(); + struct rq *rq = cpu_rq(cpu); + unsigned long max_cap, rt; + s64 delta; + + max_cap = arch_scale_cpu_capacity(NULL, cpu); + + sched_avg_update(rq); + delta = time - rq->age_stamp; + if (unlikely(delta < 0)) + delta = 0; + rt = div64_u64(rq->rt_avg, sched_avg_period() + delta); + rt = (rt * max_cap) >> SCHED_CAPACITY_SHIFT; + + *util = boosted_cpu_util(cpu); + if (likely(use_pelt())) + *util = min((*util + rt), max_cap); + + *max = max_cap; +} + +static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, + unsigned int flags) +{ + if (flags & SCHED_CPUFREQ_IOWAIT) { + sg_cpu->iowait_boost = sg_cpu->iowait_boost_max; + } else if (sg_cpu->iowait_boost) { + s64 delta_ns = time - sg_cpu->last_update; + + /* Clear iowait_boost if the CPU apprears to have been idle. */ + if (delta_ns > TICK_NSEC) + sg_cpu->iowait_boost = 0; + } +} + +static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, unsigned long *util, + unsigned long *max) +{ + unsigned long boost_util = sg_cpu->iowait_boost; + unsigned long boost_max = sg_cpu->iowait_boost_max; + + if (!boost_util) + return; + + if (*util * boost_max < *max * boost_util) { + *util = boost_util; + *max = boost_max; + } + sg_cpu->iowait_boost >>= 1; +} + +static void sugov_update_single(struct update_util_data *hook, u64 time, + unsigned int flags) +{ + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util); + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + struct cpufreq_policy *policy = sg_policy->policy; + unsigned long util, max; + unsigned int next_f; + + sugov_set_iowait_boost(sg_cpu, time, flags); + sg_cpu->last_update = time; + + if (!sugov_should_update_freq(sg_policy, time)) + return; + + if (flags & SCHED_CPUFREQ_DL) { + next_f = policy->cpuinfo.max_freq; + } else { + sugov_get_util(&util, &max, time); + sugov_iowait_boost(sg_cpu, &util, &max); + next_f = get_next_freq(sg_cpu, util, max); + } + sugov_update_commit(sg_policy, time, next_f); +} + +static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, + unsigned long util, unsigned long max, + unsigned int flags) +{ + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + struct cpufreq_policy *policy = sg_policy->policy; + unsigned int max_f = policy->cpuinfo.max_freq; + u64 last_freq_update_time = sg_policy->last_freq_update_time; + unsigned int j; + + if (flags & SCHED_CPUFREQ_DL) + return max_f; + + sugov_iowait_boost(sg_cpu, &util, &max); + + for_each_cpu(j, policy->cpus) { + struct sugov_cpu *j_sg_cpu; + unsigned long j_util, j_max; + s64 delta_ns; + + if (j == smp_processor_id()) + continue; + + j_sg_cpu = &per_cpu(sugov_cpu, j); + /* + * If the CPU utilization was last updated before the previous + * frequency update and the time elapsed between the last update + * of the CPU utilization and the last frequency update is long + * enough, don't take the CPU into account as it probably is + * idle now (and clear iowait_boost for it). + */ + delta_ns = last_freq_update_time - j_sg_cpu->last_update; + if (delta_ns > TICK_NSEC) { + j_sg_cpu->iowait_boost = 0; + continue; + } + if (j_sg_cpu->flags & SCHED_CPUFREQ_DL) + return max_f; + + j_util = j_sg_cpu->util; + j_max = j_sg_cpu->max; + if (j_util * max > j_max * util) { + util = j_util; + max = j_max; + } + + sugov_iowait_boost(j_sg_cpu, &util, &max); + } + + return get_next_freq(sg_cpu, util, max); +} + +static void sugov_update_shared(struct update_util_data *hook, u64 time, + unsigned int flags) +{ + struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util); + struct sugov_policy *sg_policy = sg_cpu->sg_policy; + unsigned long util, max; + unsigned int next_f; + + sugov_get_util(&util, &max, time); + + raw_spin_lock(&sg_policy->update_lock); + + sg_cpu->util = util; + sg_cpu->max = max; + sg_cpu->flags = flags; + + sugov_set_iowait_boost(sg_cpu, time, flags); + sg_cpu->last_update = time; + + if (sugov_should_update_freq(sg_policy, time)) { + next_f = sugov_next_freq_shared(sg_cpu, util, max, flags); + sugov_update_commit(sg_policy, time, next_f); + } + + raw_spin_unlock(&sg_policy->update_lock); +} + +static void sugov_work(struct kthread_work *work) +{ + struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work); + + mutex_lock(&sg_policy->work_lock); + __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq, + CPUFREQ_RELATION_L); + mutex_unlock(&sg_policy->work_lock); + + sg_policy->work_in_progress = false; +} + +static void sugov_irq_work(struct irq_work *irq_work) +{ + struct sugov_policy *sg_policy; + + sg_policy = container_of(irq_work, struct sugov_policy, irq_work); + + /* + * For Real Time and Deadline tasks, schedutil governor shoots the + * frequency to maximum. And special care must be taken to ensure that + * this kthread doesn't result in that. + * + * This is (mostly) guaranteed by the work_in_progress flag. The flag is + * updated only at the end of the sugov_work() and before that schedutil + * rejects all other frequency scaling requests. + * + * Though there is a very rare case where the RT thread yields right + * after the work_in_progress flag is cleared. The effects of that are + * neglected for now. + */ + queue_kthread_work(&sg_policy->worker, &sg_policy->work); +} + +/************************** sysfs interface ************************/ + +static struct sugov_tunables *global_tunables; +static DEFINE_MUTEX(global_tunables_lock); + +static inline struct sugov_tunables *to_sugov_tunables(struct gov_attr_set *attr_set) +{ + return container_of(attr_set, struct sugov_tunables, attr_set); +} + +static DEFINE_MUTEX(min_rate_lock); + +static void update_min_rate_limit_us(struct sugov_policy *sg_policy) +{ + mutex_lock(&min_rate_lock); + sg_policy->min_rate_limit_ns = min(sg_policy->up_rate_delay_ns, + sg_policy->down_rate_delay_ns); + mutex_unlock(&min_rate_lock); +} + +static ssize_t up_rate_limit_us_show(struct gov_attr_set *attr_set, char *buf) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + + return sprintf(buf, "%u\n", tunables->up_rate_limit_us); +} + +static ssize_t down_rate_limit_us_show(struct gov_attr_set *attr_set, char *buf) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + + return sprintf(buf, "%u\n", tunables->down_rate_limit_us); +} + +static ssize_t up_rate_limit_us_store(struct gov_attr_set *attr_set, + const char *buf, size_t count) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + struct sugov_policy *sg_policy; + unsigned int rate_limit_us; + + if (kstrtouint(buf, 10, &rate_limit_us)) + return -EINVAL; + + tunables->up_rate_limit_us = rate_limit_us; + + list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) { + sg_policy->up_rate_delay_ns = rate_limit_us * NSEC_PER_USEC; + update_min_rate_limit_us(sg_policy); + } + + return count; +} + +static ssize_t down_rate_limit_us_store(struct gov_attr_set *attr_set, + const char *buf, size_t count) +{ + struct sugov_tunables *tunables = to_sugov_tunables(attr_set); + struct sugov_policy *sg_policy; + unsigned int rate_limit_us; + + if (kstrtouint(buf, 10, &rate_limit_us)) + return -EINVAL; + + tunables->down_rate_limit_us = rate_limit_us; + + list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) { + sg_policy->down_rate_delay_ns = rate_limit_us * NSEC_PER_USEC; + update_min_rate_limit_us(sg_policy); + } + + return count; +} + +static struct governor_attr up_rate_limit_us = __ATTR_RW(up_rate_limit_us); +static struct governor_attr down_rate_limit_us = __ATTR_RW(down_rate_limit_us); + +static struct attribute *sugov_attributes[] = { + &up_rate_limit_us.attr, + &down_rate_limit_us.attr, + NULL +}; + +static struct kobj_type sugov_tunables_ktype = { + .default_attrs = sugov_attributes, + .sysfs_ops = &governor_sysfs_ops, +}; + +/********************** cpufreq governor interface *********************/ +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL +static +#endif +struct cpufreq_governor cpufreq_gov_schedutil; + +static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy; + + sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL); + if (!sg_policy) + return NULL; + + sg_policy->policy = policy; + init_irq_work(&sg_policy->irq_work, sugov_irq_work); + mutex_init(&sg_policy->work_lock); + raw_spin_lock_init(&sg_policy->update_lock); + return sg_policy; +} + +static void sugov_policy_free(struct sugov_policy *sg_policy) +{ + mutex_destroy(&sg_policy->work_lock); + kfree(sg_policy); +} + +static int sugov_kthread_create(struct sugov_policy *sg_policy) +{ + struct task_struct *thread; + struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO / 2 }; + struct cpufreq_policy *policy = sg_policy->policy; + int ret; + + /* kthread only required for slow path */ + if (policy->fast_switch_enabled) + return 0; + + init_kthread_work(&sg_policy->work, sugov_work); + init_kthread_worker(&sg_policy->worker); + thread = kthread_create(kthread_worker_fn, &sg_policy->worker, + "sugov:%d", + cpumask_first(policy->related_cpus)); + if (IS_ERR(thread)) { + pr_err("failed to create sugov thread: %ld\n", PTR_ERR(thread)); + return PTR_ERR(thread); + } + + ret = sched_setscheduler_nocheck(thread, SCHED_FIFO, ¶m); + if (ret) { + kthread_stop(thread); + pr_warn("%s: failed to set SCHED_FIFO\n", __func__); + return ret; + } + + sg_policy->thread = thread; + kthread_bind_mask(thread, policy->related_cpus); + wake_up_process(thread); + + return 0; +} + +static void sugov_kthread_stop(struct sugov_policy *sg_policy) +{ + /* kthread only required for slow path */ + if (sg_policy->policy->fast_switch_enabled) + return; + + flush_kthread_worker(&sg_policy->worker); + kthread_stop(sg_policy->thread); +} + +static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy) +{ + struct sugov_tunables *tunables; + + tunables = kzalloc(sizeof(*tunables), GFP_KERNEL); + if (tunables) { + gov_attr_set_init(&tunables->attr_set, &sg_policy->tunables_hook); + if (!have_governor_per_policy()) + global_tunables = tunables; + } + return tunables; +} + +static void sugov_tunables_free(struct sugov_tunables *tunables) +{ + if (!have_governor_per_policy()) + global_tunables = NULL; + + kfree(tunables); +} + +static int sugov_init(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy; + struct sugov_tunables *tunables; + unsigned int lat; + int ret = 0; + + /* State should be equivalent to EXIT */ + if (policy->governor_data) + return -EBUSY; + + sg_policy = sugov_policy_alloc(policy); + if (!sg_policy) + return -ENOMEM; + + ret = sugov_kthread_create(sg_policy); + if (ret) + goto free_sg_policy; + + mutex_lock(&global_tunables_lock); + + if (global_tunables) { + if (WARN_ON(have_governor_per_policy())) { + ret = -EINVAL; + goto stop_kthread; + } + policy->governor_data = sg_policy; + sg_policy->tunables = global_tunables; + + gov_attr_set_get(&global_tunables->attr_set, &sg_policy->tunables_hook); + goto out; + } + + tunables = sugov_tunables_alloc(sg_policy); + if (!tunables) { + ret = -ENOMEM; + goto stop_kthread; + } + + tunables->up_rate_limit_us = LATENCY_MULTIPLIER; + tunables->down_rate_limit_us = LATENCY_MULTIPLIER; + lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC; + if (lat) { + tunables->up_rate_limit_us *= lat; + tunables->down_rate_limit_us *= lat; + } + + policy->governor_data = sg_policy; + sg_policy->tunables = tunables; + + ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype, + get_governor_parent_kobj(policy), "%s", + cpufreq_gov_schedutil.name); + if (ret) + goto fail; + + out: + mutex_unlock(&global_tunables_lock); + + cpufreq_enable_fast_switch(policy); + return 0; + + fail: + policy->governor_data = NULL; + sugov_tunables_free(tunables); + +stop_kthread: + sugov_kthread_stop(sg_policy); + +free_sg_policy: + mutex_unlock(&global_tunables_lock); + + sugov_policy_free(sg_policy); + pr_err("initialization failed (error %d)\n", ret); + return ret; +} + +static int sugov_exit(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + struct sugov_tunables *tunables = sg_policy->tunables; + unsigned int count; + + cpufreq_disable_fast_switch(policy); + + mutex_lock(&global_tunables_lock); + + count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook); + policy->governor_data = NULL; + if (!count) + sugov_tunables_free(tunables); + + mutex_unlock(&global_tunables_lock); + + sugov_kthread_stop(sg_policy); + sugov_policy_free(sg_policy); + + return 0; +} + +static int sugov_start(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + unsigned int cpu; + + sg_policy->up_rate_delay_ns = + sg_policy->tunables->up_rate_limit_us * NSEC_PER_USEC; + sg_policy->down_rate_delay_ns = + sg_policy->tunables->down_rate_limit_us * NSEC_PER_USEC; + update_min_rate_limit_us(sg_policy); + sg_policy->last_freq_update_time = 0; + sg_policy->next_freq = UINT_MAX; + sg_policy->work_in_progress = false; + sg_policy->need_freq_update = false; + + for_each_cpu(cpu, policy->cpus) { + struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu); + + sg_cpu->sg_policy = sg_policy; + if (policy_is_shared(policy)) { + sg_cpu->util = 0; + sg_cpu->max = 0; + sg_cpu->flags = SCHED_CPUFREQ_DL; + sg_cpu->last_update = 0; + sg_cpu->cached_raw_freq = 0; + sg_cpu->iowait_boost = 0; + sg_cpu->iowait_boost_max = policy->cpuinfo.max_freq; + cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util, + sugov_update_shared); + } else { + cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util, + sugov_update_single); + } + } + return 0; +} + +static int sugov_stop(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + unsigned int cpu; + + for_each_cpu(cpu, policy->cpus) + cpufreq_remove_update_util_hook(cpu); + + synchronize_sched(); + + irq_work_sync(&sg_policy->irq_work); + kthread_cancel_work_sync(&sg_policy->work); + + return 0; +} + +static int sugov_limits(struct cpufreq_policy *policy) +{ + struct sugov_policy *sg_policy = policy->governor_data; + + if (!policy->fast_switch_enabled) { + mutex_lock(&sg_policy->work_lock); + cpufreq_policy_apply_limits(policy); + mutex_unlock(&sg_policy->work_lock); + } + + sg_policy->need_freq_update = true; + + return 0; +} + +static int cpufreq_schedutil_cb(struct cpufreq_policy *policy, + unsigned int event) +{ + switch(event) { + case CPUFREQ_GOV_POLICY_INIT: + return sugov_init(policy); + case CPUFREQ_GOV_POLICY_EXIT: + return sugov_exit(policy); + case CPUFREQ_GOV_START: + return sugov_start(policy); + case CPUFREQ_GOV_STOP: + return sugov_stop(policy); + case CPUFREQ_GOV_LIMITS: + return sugov_limits(policy); + default: + BUG(); + } +} + +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL +static +#endif +struct cpufreq_governor cpufreq_gov_schedutil = { + .name = "schedutil", + .governor = cpufreq_schedutil_cb, + .owner = THIS_MODULE, +}; + +static int __init sugov_register(void) +{ + return cpufreq_register_governor(&cpufreq_gov_schedutil); +} +fs_initcall(sugov_register); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 3d55ec89c400..a105e97ab6bf 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -755,6 +755,9 @@ static void update_curr_dl(struct rq *rq) if (unlikely((s64)delta_exec <= 0)) return; + /* kick cpufreq (see the comment in kernel/sched/sched.h). */ + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_DL); + schedstat_set(curr->se.statistics.exec_max, max(curr->se.statistics.exec_max, delta_exec)); diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index c8c4272c61d8..ed8e6bb4531b 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -636,6 +636,32 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m) P(se.statistics.nr_wakeups_affine_attempts); P(se.statistics.nr_wakeups_passive); P(se.statistics.nr_wakeups_idle); + /* eas */ + /* select_idle_sibling() */ + P(se.statistics.nr_wakeups_sis_attempts); + P(se.statistics.nr_wakeups_sis_idle); + P(se.statistics.nr_wakeups_sis_cache_affine); + P(se.statistics.nr_wakeups_sis_suff_cap); + P(se.statistics.nr_wakeups_sis_idle_cpu); + P(se.statistics.nr_wakeups_sis_count); + /* select_energy_cpu_brute() */ + P(se.statistics.nr_wakeups_secb_attempts); + P(se.statistics.nr_wakeups_secb_sync); + P(se.statistics.nr_wakeups_secb_idle_bt); + P(se.statistics.nr_wakeups_secb_insuff_cap); + P(se.statistics.nr_wakeups_secb_no_nrg_sav); + P(se.statistics.nr_wakeups_secb_nrg_sav); + P(se.statistics.nr_wakeups_secb_count); + /* find_best_target() */ + P(se.statistics.nr_wakeups_fbt_attempts); + P(se.statistics.nr_wakeups_fbt_no_cpu); + P(se.statistics.nr_wakeups_fbt_no_sd); + P(se.statistics.nr_wakeups_fbt_pref_idle); + P(se.statistics.nr_wakeups_fbt_count); + /* cas */ + /* select_task_rq_fair() */ + P(se.statistics.nr_wakeups_cas_attempts); + P(se.statistics.nr_wakeups_cas_count); #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) __P(load_avg); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3b6038225c17..fc270d4fbd73 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -50,7 +50,6 @@ unsigned int sysctl_sched_latency = 6000000ULL; unsigned int normalized_sysctl_sched_latency = 6000000ULL; -unsigned int sysctl_sched_is_big_little = 0; unsigned int sysctl_sched_sync_hint_enable = 1; unsigned int sysctl_sched_initial_task_util = 0; unsigned int sysctl_sched_cstate_aware = 1; @@ -119,6 +118,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL; unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL; #endif +/* + * The margin used when comparing utilization with CPU capacity: + * util * margin < capacity * 1024 + */ +unsigned int capacity_margin = 1280; /* ~20% */ + static inline void update_load_add(struct load_weight *lw, unsigned long inc) { lw->weight += inc; @@ -294,19 +299,59 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp) static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq) { if (!cfs_rq->on_list) { + struct rq *rq = rq_of(cfs_rq); + int cpu = cpu_of(rq); /* * Ensure we either appear before our parent (if already * enqueued) or force our parent to appear after us when it is - * enqueued. The fact that we always enqueue bottom-up - * reduces this to two cases. + * enqueued. The fact that we always enqueue bottom-up + * reduces this to two cases and a special case for the root + * cfs_rq. Furthermore, it also means that we will always reset + * tmp_alone_branch either when the branch is connected + * to a tree or when we reach the beg of the tree */ if (cfs_rq->tg->parent && - cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) { - list_add_rcu(&cfs_rq->leaf_cfs_rq_list, - &rq_of(cfs_rq)->leaf_cfs_rq_list); - } else { + cfs_rq->tg->parent->cfs_rq[cpu]->on_list) { + /* + * If parent is already on the list, we add the child + * just before. Thanks to circular linked property of + * the list, this means to put the child at the tail + * of the list that starts by parent. + */ + list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list, + &(cfs_rq->tg->parent->cfs_rq[cpu]->leaf_cfs_rq_list)); + /* + * The branch is now connected to its tree so we can + * reset tmp_alone_branch to the beginning of the + * list. + */ + rq->tmp_alone_branch = &rq->leaf_cfs_rq_list; + } else if (!cfs_rq->tg->parent) { + /* + * cfs rq without parent should be put + * at the tail of the list. + */ list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list, - &rq_of(cfs_rq)->leaf_cfs_rq_list); + &rq->leaf_cfs_rq_list); + /* + * We have reach the beg of a tree so we can reset + * tmp_alone_branch to the beginning of the list. + */ + rq->tmp_alone_branch = &rq->leaf_cfs_rq_list; + } else { + /* + * The parent has not already been added so we want to + * make sure that it will be put after us. + * tmp_alone_branch points to the beg of the branch + * where we will add parent. + */ + list_add_rcu(&cfs_rq->leaf_cfs_rq_list, + rq->tmp_alone_branch); + /* + * update tmp_alone_branch to points to the new beg + * of the branch + */ + rq->tmp_alone_branch = &cfs_rq->leaf_cfs_rq_list; } cfs_rq->on_list = 1; @@ -664,7 +709,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se) } #ifdef CONFIG_SMP -static int select_idle_sibling(struct task_struct *p, int cpu); +static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu); static unsigned long task_h_load(struct task_struct *p); /* @@ -688,20 +733,115 @@ void init_entity_runnable_average(struct sched_entity *se) * will definitely be update (after enqueue). */ sa->period_contrib = 1023; - sa->load_avg = scale_load_down(se->load.weight); + /* + * Tasks are intialized with full load to be seen as heavy tasks until + * they get a chance to stabilize to their real load level. + * Group entities are intialized with zero load to reflect the fact that + * nothing has been attached to the task group yet. + */ + if (entity_is_task(se)) + sa->load_avg = scale_load_down(se->load.weight); sa->load_sum = sa->load_avg * LOAD_AVG_MAX; - sa->util_avg = sched_freq() ? - sysctl_sched_initial_task_util : - scale_load_down(SCHED_LOAD_SCALE); - sa->util_sum = sa->util_avg * LOAD_AVG_MAX; + /* + * In previous Android versions, we used to have: + * sa->util_avg = sched_freq() ? + * sysctl_sched_initial_task_util : + * scale_load_down(SCHED_LOAD_SCALE); + * sa->util_sum = sa->util_avg * LOAD_AVG_MAX; + * However, that functionality has been moved to enqueue. + * It is unclear if we should restore this in enqueue. + */ + /* + * At this point, util_avg won't be used in select_task_rq_fair anyway + */ + sa->util_avg = 0; + sa->util_sum = 0; /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */ } +static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq); +static void attach_entity_cfs_rq(struct sched_entity *se); + +/* + * With new tasks being created, their initial util_avgs are extrapolated + * based on the cfs_rq's current util_avg: + * + * util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight + * + * However, in many cases, the above util_avg does not give a desired + * value. Moreover, the sum of the util_avgs may be divergent, such + * as when the series is a harmonic series. + * + * To solve this problem, we also cap the util_avg of successive tasks to + * only 1/2 of the left utilization budget: + * + * util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n + * + * where n denotes the nth task. + * + * For example, a simplest series from the beginning would be like: + * + * task util_avg: 512, 256, 128, 64, 32, 16, 8, ... + * cfs_rq util_avg: 512, 768, 896, 960, 992, 1008, 1016, ... + * + * Finally, that extrapolated util_avg is clamped to the cap (util_avg_cap) + * if util_avg > util_avg_cap. + */ +void post_init_entity_util_avg(struct sched_entity *se) +{ + struct cfs_rq *cfs_rq = cfs_rq_of(se); + struct sched_avg *sa = &se->avg; + long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2; + + if (cap > 0) { + if (cfs_rq->avg.util_avg != 0) { + sa->util_avg = cfs_rq->avg.util_avg * se->load.weight; + sa->util_avg /= (cfs_rq->avg.load_avg + 1); + + if (sa->util_avg > cap) + sa->util_avg = cap; + } else { + sa->util_avg = cap; + } + /* + * If we wish to restore tuning via setting initial util, + * this is where we should do it. + */ + sa->util_sum = sa->util_avg * LOAD_AVG_MAX; + } + + if (entity_is_task(se)) { + struct task_struct *p = task_of(se); + if (p->sched_class != &fair_sched_class) { + /* + * For !fair tasks do: + * + update_cfs_rq_load_avg(now, cfs_rq, false); + attach_entity_load_avg(cfs_rq, se); + switched_from_fair(rq, p); + * + * such that the next switched_to_fair() has the + * expected state. + */ + se->avg.last_update_time = cfs_rq_clock_task(cfs_rq); + return; + } + } + + attach_entity_cfs_rq(se); +} + #else void init_entity_runnable_average(struct sched_entity *se) { } -#endif +void post_init_entity_util_avg(struct sched_entity *se) +{ +} +static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) +{ +} +#endif /* CONFIG_SMP */ /* * Update the current task's runtime statistics. @@ -1425,7 +1565,8 @@ balance: * Call select_idle_sibling to maybe find a better one. */ if (!cur) - env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu); + env->dst_cpu = select_idle_sibling(env->p, env->src_cpu, + env->dst_cpu); assign: assigned = true; @@ -2410,28 +2551,22 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) #ifdef CONFIG_FAIR_GROUP_SCHED # ifdef CONFIG_SMP -static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq) +static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg) { - long tg_weight; + long tg_weight, load, shares; /* - * Use this CPU's real-time load instead of the last load contribution - * as the updating of the contribution is delayed, and we will use the - * the real-time load to calc the share. See update_tg_load_avg(). + * This really should be: cfs_rq->avg.load_avg, but instead we use + * cfs_rq->load.weight, which is its upper bound. This helps ramp up + * the shares for small weight interactive tasks. */ - tg_weight = atomic_long_read(&tg->load_avg); - tg_weight -= cfs_rq->tg_load_avg_contrib; - tg_weight += cfs_rq->load.weight; + load = scale_load_down(cfs_rq->load.weight); - return tg_weight; -} - -static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg) -{ - long tg_weight, load, shares; + tg_weight = atomic_long_read(&tg->load_avg); - tg_weight = calc_tg_weight(tg, cfs_rq); - load = cfs_rq->load.weight; + /* Ensure tg_weight >= load */ + tg_weight -= cfs_rq->tg_load_avg_contrib; + tg_weight += load; shares = (tg->shares * load); if (tg_weight) @@ -2450,6 +2585,7 @@ static inline long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg) return tg->shares; } # endif /* CONFIG_SMP */ + static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, unsigned long weight) { @@ -2468,16 +2604,20 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, static inline int throttled_hierarchy(struct cfs_rq *cfs_rq); -static void update_cfs_shares(struct cfs_rq *cfs_rq) +static void update_cfs_shares(struct sched_entity *se) { + struct cfs_rq *cfs_rq = group_cfs_rq(se); struct task_group *tg; - struct sched_entity *se; long shares; - tg = cfs_rq->tg; - se = tg->se[cpu_of(rq_of(cfs_rq))]; - if (!se || throttled_hierarchy(cfs_rq)) + if (!cfs_rq) + return; + + if (throttled_hierarchy(cfs_rq)) return; + + tg = cfs_rq->tg; + #ifndef CONFIG_SMP if (likely(se->load.weight == tg->shares)) return; @@ -2486,8 +2626,9 @@ static void update_cfs_shares(struct cfs_rq *cfs_rq) reweight_entity(cfs_rq_of(se), se, shares); } + #else /* CONFIG_FAIR_GROUP_SCHED */ -static inline void update_cfs_shares(struct cfs_rq *cfs_rq) +static inline void update_cfs_shares(struct sched_entity *se) { } #endif /* CONFIG_FAIR_GROUP_SCHED */ @@ -3789,25 +3930,262 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, return decayed; } -#ifdef CONFIG_FAIR_GROUP_SCHED /* - * Updating tg's load_avg is necessary before update_cfs_share (which is done) - * and effective_load (which is not done because it is too costly). + * Signed add and clamp on underflow. + * + * Explicitly do a load-store to ensure the intermediate value never hits + * memory. This allows lockless observations without ever seeing the negative + * values. + */ +#define add_positive(_ptr, _val) do { \ + typeof(_ptr) ptr = (_ptr); \ + typeof(_val) val = (_val); \ + typeof(*ptr) res, var = READ_ONCE(*ptr); \ + \ + res = var + val; \ + \ + if (val < 0 && res > var) \ + res = 0; \ + \ + WRITE_ONCE(*ptr, res); \ +} while (0) + +#ifdef CONFIG_FAIR_GROUP_SCHED +/** + * update_tg_load_avg - update the tg's load avg + * @cfs_rq: the cfs_rq whose avg changed + * @force: update regardless of how small the difference + * + * This function 'ensures': tg->load_avg := \Sum tg->cfs_rq[]->avg.load. + * However, because tg->load_avg is a global value there are performance + * considerations. + * + * In order to avoid having to look at the other cfs_rq's, we use a + * differential update where we store the last value we propagated. This in + * turn allows skipping updates if the differential is 'small'. + * + * Updating tg's load_avg is necessary before update_cfs_share() (which is + * done) and effective_load() (which is not done because it is too costly). */ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) { long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; + /* + * No need to update load_avg for root_task_group as it is not used. + */ + if (cfs_rq->tg == &root_task_group) + return; + if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { atomic_long_add(delta, &cfs_rq->tg->load_avg); cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; } } +/* + * Called within set_task_rq() right before setting a task's cpu. The + * caller only guarantees p->pi_lock is held; no other assumptions, + * including the state of rq->lock, should be made. + */ +void set_task_rq_fair(struct sched_entity *se, + struct cfs_rq *prev, struct cfs_rq *next) +{ + if (!sched_feat(ATTACH_AGE_LOAD)) + return; + + /* + * We are supposed to update the task to "current" time, then its up to + * date and ready to go to new CPU/cfs_rq. But we have difficulty in + * getting what current time is, so simply throw away the out-of-date + * time. This will result in the wakee task is less decayed, but giving + * the wakee more load sounds not bad. + */ + if (se->avg.last_update_time && prev) { + u64 p_last_update_time; + u64 n_last_update_time; + +#ifndef CONFIG_64BIT + u64 p_last_update_time_copy; + u64 n_last_update_time_copy; + + do { + p_last_update_time_copy = prev->load_last_update_time_copy; + n_last_update_time_copy = next->load_last_update_time_copy; + + smp_rmb(); + + p_last_update_time = prev->avg.last_update_time; + n_last_update_time = next->avg.last_update_time; + + } while (p_last_update_time != p_last_update_time_copy || + n_last_update_time != n_last_update_time_copy); +#else + p_last_update_time = prev->avg.last_update_time; + n_last_update_time = next->avg.last_update_time; +#endif + __update_load_avg(p_last_update_time, cpu_of(rq_of(prev)), + &se->avg, 0, 0, NULL); + se->avg.last_update_time = n_last_update_time; + } +} + +/* Take into account change of utilization of a child task group */ +static inline void +update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se) +{ + struct cfs_rq *gcfs_rq = group_cfs_rq(se); + long delta = gcfs_rq->avg.util_avg - se->avg.util_avg; + + /* Nothing to update */ + if (!delta) + return; + + /* Set new sched_entity's utilization */ + se->avg.util_avg = gcfs_rq->avg.util_avg; + se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX; + + /* Update parent cfs_rq utilization */ + add_positive(&cfs_rq->avg.util_avg, delta); + cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX; +} + +/* Take into account change of load of a child task group */ +static inline void +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se) +{ + struct cfs_rq *gcfs_rq = group_cfs_rq(se); + long delta, load = gcfs_rq->avg.load_avg; + + /* + * If the load of group cfs_rq is null, the load of the + * sched_entity will also be null so we can skip the formula + */ + if (load) { + long tg_load; + + /* Get tg's load and ensure tg_load > 0 */ + tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1; + + /* Ensure tg_load >= load and updated with current load*/ + tg_load -= gcfs_rq->tg_load_avg_contrib; + tg_load += load; + + /* + * We need to compute a correction term in the case that the + * task group is consuming more CPU than a task of equal + * weight. A task with a weight equals to tg->shares will have + * a load less or equal to scale_load_down(tg->shares). + * Similarly, the sched_entities that represent the task group + * at parent level, can't have a load higher than + * scale_load_down(tg->shares). And the Sum of sched_entities' + * load must be <= scale_load_down(tg->shares). + */ + if (tg_load > scale_load_down(gcfs_rq->tg->shares)) { + /* scale gcfs_rq's load into tg's shares*/ + load *= scale_load_down(gcfs_rq->tg->shares); + load /= tg_load; + } + } + + delta = load - se->avg.load_avg; + + /* Nothing to update */ + if (!delta) + return; + + /* Set new sched_entity's load */ + se->avg.load_avg = load; + se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX; + + /* Update parent cfs_rq load */ + add_positive(&cfs_rq->avg.load_avg, delta); + cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX; + + /* + * If the sched_entity is already enqueued, we also have to update the + * runnable load avg. + */ + if (se->on_rq) { + /* Update parent cfs_rq runnable_load_avg */ + add_positive(&cfs_rq->runnable_load_avg, delta); + cfs_rq->runnable_load_sum = cfs_rq->runnable_load_avg * LOAD_AVG_MAX; + } +} + +static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq) +{ + cfs_rq->propagate_avg = 1; +} + +static inline int test_and_clear_tg_cfs_propagate(struct sched_entity *se) +{ + struct cfs_rq *cfs_rq = group_cfs_rq(se); + + if (!cfs_rq->propagate_avg) + return 0; + + cfs_rq->propagate_avg = 0; + return 1; +} + +/* Update task and its cfs_rq load average */ +static inline int propagate_entity_load_avg(struct sched_entity *se) +{ + struct cfs_rq *cfs_rq; + + if (entity_is_task(se)) + return 0; + + if (!test_and_clear_tg_cfs_propagate(se)) + return 0; + + cfs_rq = cfs_rq_of(se); + + set_tg_cfs_propagate(cfs_rq); + + update_tg_cfs_util(cfs_rq, se); + update_tg_cfs_load(cfs_rq, se); + + return 1; +} + #else /* CONFIG_FAIR_GROUP_SCHED */ + static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {} + +static inline int propagate_entity_load_avg(struct sched_entity *se) +{ + return 0; +} + +static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq) {} + #endif /* CONFIG_FAIR_GROUP_SCHED */ +static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq) +{ + if (&this_rq()->cfs == cfs_rq) { + /* + * There are a few boundary cases this might miss but it should + * get called often enough that that should (hopefully) not be + * a real problem -- added to that it only calls on the local + * CPU, so if we enqueue remotely we'll miss an update, but + * the next tick/schedule should update. + * + * It will not get called when we go idle, because the idle + * thread is a different class (!fair), nor will the utilization + * number include things like RT tasks. + * + * As is, the util number is not freq-invariant (we'd have to + * implement arch_scale_freq_capacity() for that). + * + * See cpu_util(). + */ + cpufreq_update_util(rq_of(cfs_rq), 0); + } +} + static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq); /* @@ -3827,23 +4205,43 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq); WRITE_ONCE(*ptr, res); \ } while (0) -/* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */ -static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) +/** + * update_cfs_rq_load_avg - update the cfs_rq's load/util averages + * @now: current time, as per cfs_rq_clock_task() + * @cfs_rq: cfs_rq to update + * @update_freq: should we call cfs_rq_util_change() or will the call do so + * + * The cfs_rq avg is the direct sum of all its entities (blocked and runnable) + * avg. The immediate corollary is that all (fair) tasks must be attached, see + * post_init_entity_util_avg(). + * + * cfs_rq->avg is used for task_h_load() and update_cfs_share() for example. + * + * Returns true if the load decayed or we removed load. + * + * Since both these conditions indicate a changed cfs_rq->avg.load we should + * call update_tg_load_avg() when this function returns true. + */ +static inline int +update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) { struct sched_avg *sa = &cfs_rq->avg; - int decayed, removed = 0; + int decayed, removed = 0, removed_util = 0; if (atomic_long_read(&cfs_rq->removed_load_avg)) { s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); sub_positive(&sa->load_avg, r); sub_positive(&sa->load_sum, r * LOAD_AVG_MAX); removed = 1; + set_tg_cfs_propagate(cfs_rq); } if (atomic_long_read(&cfs_rq->removed_util_avg)) { long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0); sub_positive(&sa->util_avg, r); sub_positive(&sa->util_sum, r * LOAD_AVG_MAX); + removed_util = 1; + set_tg_cfs_propagate(cfs_rq); } decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa, @@ -3858,68 +4256,89 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) if (cfs_rq == &rq_of(cfs_rq)->cfs) trace_sched_load_avg_cpu(cpu_of(rq_of(cfs_rq)), cfs_rq); + if (update_freq && (decayed || removed_util)) + cfs_rq_util_change(cfs_rq); + return decayed || removed; } +/* + * Optional action to be done while updating the load average + */ +#define UPDATE_TG 0x1 +#define SKIP_AGE_LOAD 0x2 + /* Update task and its cfs_rq load average */ -static inline void update_load_avg(struct sched_entity *se, int update_tg) +static inline void update_load_avg(struct sched_entity *se, int flags) { struct cfs_rq *cfs_rq = cfs_rq_of(se); u64 now = cfs_rq_clock_task(cfs_rq); int cpu = cpu_of(rq_of(cfs_rq)); + int decayed; + void *ptr = NULL; /* * Track task load average for carrying it to new CPU after migrated, and * track group sched_entity load average for task_h_load calc in migration */ - __update_load_avg(now, cpu, &se->avg, + if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) { + __update_load_avg(now, cpu, &se->avg, se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se, NULL); + } - if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg) + decayed = update_cfs_rq_load_avg(now, cfs_rq, true); + decayed |= propagate_entity_load_avg(se); + + if (decayed && (flags & UPDATE_TG)) update_tg_load_avg(cfs_rq, 0); - if (entity_is_task(se)) - trace_sched_load_avg_task(task_of(se), &se->avg); + if (entity_is_task(se)) { +#ifdef CONFIG_SCHED_WALT + ptr = (void *)&(task_of(se)->ravg); +#endif + trace_sched_load_avg_task(task_of(se), &se->avg, ptr); + } } +/** + * attach_entity_load_avg - attach this entity to its cfs_rq load avg + * @cfs_rq: cfs_rq to attach to + * @se: sched_entity to attach + * + * Must call update_cfs_rq_load_avg() before this, since we rely on + * cfs_rq->avg.last_update_time being current. + */ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { - if (!sched_feat(ATTACH_AGE_LOAD)) - goto skip_aging; - - /* - * If we got migrated (either between CPUs or between cgroups) we'll - * have aged the average right before clearing @last_update_time. - */ - if (se->avg.last_update_time) { - __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)), - &se->avg, 0, 0, NULL); - - /* - * XXX: we could have just aged the entire load away if we've been - * absent from the fair class for too long. - */ - } - -skip_aging: se->avg.last_update_time = cfs_rq->avg.last_update_time; cfs_rq->avg.load_avg += se->avg.load_avg; cfs_rq->avg.load_sum += se->avg.load_sum; cfs_rq->avg.util_avg += se->avg.util_avg; cfs_rq->avg.util_sum += se->avg.util_sum; + set_tg_cfs_propagate(cfs_rq); + + cfs_rq_util_change(cfs_rq); } +/** + * detach_entity_load_avg - detach this entity from its cfs_rq load avg + * @cfs_rq: cfs_rq to detach from + * @se: sched_entity to detach + * + * Must call update_cfs_rq_load_avg() before this, since we rely on + * cfs_rq->avg.last_update_time being current. + */ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { - __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)), - &se->avg, se->on_rq * scale_load_down(se->load.weight), - cfs_rq->curr == se, NULL); sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg); sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum); sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg); sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum); + set_tg_cfs_propagate(cfs_rq); + + cfs_rq_util_change(cfs_rq); } /* Add the load generated by se into cfs_rq's load average */ @@ -3927,34 +4346,20 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { struct sched_avg *sa = &se->avg; - u64 now = cfs_rq_clock_task(cfs_rq); - int migrated, decayed; - - migrated = !sa->last_update_time; - if (!migrated) { - __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa, - se->on_rq * scale_load_down(se->load.weight), - cfs_rq->curr == se, NULL); - } - - decayed = update_cfs_rq_load_avg(now, cfs_rq); cfs_rq->runnable_load_avg += sa->load_avg; cfs_rq->runnable_load_sum += sa->load_sum; - if (migrated) + if (!sa->last_update_time) { attach_entity_load_avg(cfs_rq, se); - - if (decayed || migrated) update_tg_load_avg(cfs_rq, 0); + } } /* Remove the runnable load generated by se from cfs_rq's runnable load average */ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { - update_load_avg(se, 1); - cfs_rq->runnable_load_avg = max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0); cfs_rq->runnable_load_sum = @@ -3983,13 +4388,25 @@ static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq) #endif /* + * Synchronize entity load avg of dequeued entity without locking + * the previous rq. + */ +void sync_entity_load_avg(struct sched_entity *se) +{ + struct cfs_rq *cfs_rq = cfs_rq_of(se); + u64 last_update_time; + + last_update_time = cfs_rq_last_update_time(cfs_rq); + __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL); +} + +/* * Task first catches up with cfs_rq, and then subtract * itself from the cfs_rq (task must be off the queue now). */ void remove_entity_load_avg(struct sched_entity *se) { struct cfs_rq *cfs_rq = cfs_rq_of(se); - u64 last_update_time; /* * Newly created task or never used group entity should not be removed @@ -3998,9 +4415,7 @@ void remove_entity_load_avg(struct sched_entity *se) if (se->avg.last_update_time == 0) return; - last_update_time = cfs_rq_last_update_time(cfs_rq); - - __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL); + sync_entity_load_avg(se); atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg); atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg); } @@ -4037,7 +4452,16 @@ static int idle_balance(struct rq *this_rq); #else /* CONFIG_SMP */ -static inline void update_load_avg(struct sched_entity *se, int update_tg) {} +static inline int +update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq) +{ + return 0; +} + +#define UPDATE_TG 0x0 +#define SKIP_AGE_LOAD 0x0 + +static inline void update_load_avg(struct sched_entity *se, int not_used1){} static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {} static inline void @@ -4186,9 +4610,10 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) * Update run-time statistics of the 'current'. */ update_curr(cfs_rq); + update_load_avg(se, UPDATE_TG); enqueue_entity_load_avg(cfs_rq, se); + update_cfs_shares(se); account_entity_enqueue(cfs_rq, se); - update_cfs_shares(cfs_rq); if (flags & ENQUEUE_WAKEUP) { place_entity(cfs_rq, se, 0); @@ -4261,6 +4686,16 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) * Update run-time statistics of the 'current'. */ update_curr(cfs_rq); + + /* + * When dequeuing a sched_entity, we must: + * - Update loads to have both entity and cfs_rq synced with now. + * - Substract its load from the cfs_rq->runnable_avg. + * - Substract its previous weight from cfs_rq->load.weight. + * - For group entity, update its weight to reflect the new share + * of its group cfs_rq. + */ + update_load_avg(se, UPDATE_TG); dequeue_entity_load_avg(cfs_rq, se); update_stats_dequeue(cfs_rq, se); @@ -4296,7 +4731,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) return_cfs_rq_runtime(cfs_rq); update_min_vruntime(cfs_rq); - update_cfs_shares(cfs_rq); + update_cfs_shares(se); } /* @@ -4351,7 +4786,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) */ update_stats_wait_end(cfs_rq, se); __dequeue_entity(cfs_rq, se); - update_load_avg(se, 1); + update_load_avg(se, UPDATE_TG); } update_stats_curr_start(cfs_rq, se); @@ -4467,8 +4902,8 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) /* * Ensure that runnable average is periodically updated. */ - update_load_avg(curr, 1); - update_cfs_shares(cfs_rq); + update_load_avg(curr, UPDATE_TG); + update_cfs_shares(curr); #ifdef CONFIG_SCHED_HRTICK /* @@ -5372,7 +5807,7 @@ static inline void hrtick_update(struct rq *rq) #ifdef CONFIG_SMP static bool cpu_overutilized(int cpu); -static inline unsigned long boosted_cpu_util(int cpu); +unsigned long boosted_cpu_util(int cpu); #else #define boosted_cpu_util(cpu) cpu_util(cpu) #endif @@ -5409,6 +5844,14 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) int task_wakeup = flags & ENQUEUE_WAKEUP; #endif + /* + * If in_iowait is set, the code below may not trigger any cpufreq + * utilization updates, so do it here explicitly with the IOWAIT flag + * passed. + */ + if (p->in_iowait) + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_IOWAIT); + for_each_sched_entity(se) { if (se->on_rq) break; @@ -5437,8 +5880,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (cfs_rq_throttled(cfs_rq)) break; - update_load_avg(se, 1); - update_cfs_shares(cfs_rq); + update_load_avg(se, UPDATE_TG); + update_cfs_shares(se); } if (!se) { @@ -5543,8 +5986,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (cfs_rq_throttled(cfs_rq)) break; - update_load_avg(se, 1); - update_cfs_shares(cfs_rq); + update_load_avg(se, UPDATE_TG); + update_cfs_shares(se); } if (!se) { @@ -6150,15 +6593,7 @@ static int sched_group_energy(struct energy_env *eenv) */ sd = rcu_dereference(per_cpu(sd_scs, cpu)); - if (!sd) - /* - * We most probably raced with hotplug; returning a - * wrong energy estimation is better than entering an - * infinite loop. - */ - return -EINVAL; - - if (sd->parent) + if (sd && sd->parent) sg_shared_cap = sd->parent->groups; for_each_domain(cpu, sd) { @@ -6213,6 +6648,14 @@ static int sched_group_energy(struct energy_env *eenv) } while (sg = sg->next, sg != sd->groups); } + + /* + * If we raced with hotplug and got an sd NULL-pointer; + * returning a wrong energy estimation is better than + * entering an infinite loop. + */ + if (cpumask_test_cpu(cpu, &visit_cpus)) + return -EINVAL; next_cpu: cpumask_clear_cpu(cpu, &visit_cpus); continue; @@ -6239,6 +6682,7 @@ static inline int __energy_diff(struct energy_env *eenv) struct sched_domain *sd; struct sched_group *sg; int sd_cpu = -1, energy_before = 0, energy_after = 0; + int diff, margin; struct energy_env eenv_before = { .util_delta = 0, @@ -6281,12 +6725,22 @@ static inline int __energy_diff(struct energy_env *eenv) eenv->nrg.after = energy_after; eenv->nrg.diff = eenv->nrg.after - eenv->nrg.before; eenv->payoff = 0; - +#ifndef CONFIG_SCHED_TUNE trace_sched_energy_diff(eenv->task, eenv->src_cpu, eenv->dst_cpu, eenv->util_delta, eenv->nrg.before, eenv->nrg.after, eenv->nrg.diff, eenv->cap.before, eenv->cap.after, eenv->cap.delta, eenv->nrg.delta, eenv->payoff); +#endif + /* + * Dead-zone margin preventing too many migrations. + */ + + margin = eenv->nrg.before >> 6; /* ~1.56% */ + + diff = eenv->nrg.after - eenv->nrg.before; + + eenv->nrg.diff = (abs(diff) < margin) ? 0 : eenv->nrg.diff; return eenv->nrg.diff; } @@ -6294,30 +6748,37 @@ static inline int __energy_diff(struct energy_env *eenv) #ifdef CONFIG_SCHED_TUNE struct target_nrg schedtune_target_nrg; - +extern bool schedtune_initialized; /* * System energy normalization - * Returns the normalized value, in the range [0..SCHED_LOAD_SCALE], + * Returns the normalized value, in the range [0..SCHED_CAPACITY_SCALE], * corresponding to the specified energy variation. */ static inline int normalize_energy(int energy_diff) { u32 normalized_nrg; + + /* during early setup, we don't know the extents */ + if (unlikely(!schedtune_initialized)) + return energy_diff < 0 ? -1 : 1 ; + #ifdef CONFIG_SCHED_DEBUG + { int max_delta; /* Check for boundaries */ max_delta = schedtune_target_nrg.max_power; max_delta -= schedtune_target_nrg.min_power; WARN_ON(abs(energy_diff) >= max_delta); + } #endif /* Do scaling using positive numbers to increase the range */ normalized_nrg = (energy_diff < 0) ? -energy_diff : energy_diff; /* Scale by energy magnitude */ - normalized_nrg <<= SCHED_LOAD_SHIFT; + normalized_nrg <<= SCHED_CAPACITY_SHIFT; /* Normalize on max energy for target platform */ normalized_nrg = reciprocal_divide( @@ -6348,6 +6809,12 @@ energy_diff(struct energy_env *eenv) eenv->cap.delta, eenv->task); + trace_sched_energy_diff(eenv->task, + eenv->src_cpu, eenv->dst_cpu, eenv->util_delta, + eenv->nrg.before, eenv->nrg.after, eenv->nrg.diff, + eenv->cap.before, eenv->cap.after, eenv->cap.delta, + eenv->nrg.delta, eenv->payoff); + /* * When SchedTune is enabled, the energy_diff() function will return * the computed energy payoff value. Since the energy_diff() return @@ -6387,18 +6854,18 @@ static int wake_wide(struct task_struct *p) return 1; } -static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) +static int wake_affine(struct sched_domain *sd, struct task_struct *p, + int prev_cpu, int sync) { s64 this_load, load; s64 this_eff_load, prev_eff_load; - int idx, this_cpu, prev_cpu; + int idx, this_cpu; struct task_group *tg; unsigned long weight; int balanced; idx = sd->wake_idx; this_cpu = smp_processor_id(); - prev_cpu = task_cpu(p); load = source_load(prev_cpu, idx); this_load = target_load(this_cpu, idx); @@ -6458,8 +6925,6 @@ static inline unsigned long task_util(struct task_struct *p) return p->se.avg.util_avg; } -unsigned int capacity_margin = 1280; /* ~20% margin */ - static inline unsigned long boosted_task_util(struct task_struct *task); static inline bool __task_fits(struct task_struct *p, int cpu, int util) @@ -6485,11 +6950,6 @@ static inline bool task_fits_max(struct task_struct *p, int cpu) return __task_fits(p, cpu, 0); } -static inline bool task_fits_spare(struct task_struct *p, int cpu) -{ - return __task_fits(p, cpu, cpu_util(cpu)); -} - static bool cpu_overutilized(int cpu) { return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin); @@ -6497,6 +6957,8 @@ static bool cpu_overutilized(int cpu) #ifdef CONFIG_SCHED_TUNE +struct reciprocal_value schedtune_spc_rdiv; + static long schedtune_margin(unsigned long signal, long boost) { @@ -6507,29 +6969,16 @@ schedtune_margin(unsigned long signal, long boost) * * The Boost (B) value is used to compute a Margin (M) which is * proportional to the complement of the original Signal (S): - * M = B * (SCHED_LOAD_SCALE - S), if B is positive - * M = B * S, if B is negative + * M = B * (SCHED_CAPACITY_SCALE - S) * The obtained M could be used by the caller to "boost" S. */ if (boost >= 0) { - margin = SCHED_LOAD_SCALE - signal; + margin = SCHED_CAPACITY_SCALE - signal; margin *= boost; } else margin = -signal * boost; - /* - * Fast integer division by constant: - * Constant : (C) = 100 - * Precision : 0.1% (P) = 0.1 - * Reference : C * 100 / P (R) = 100000 - * - * Thus: - * Shift bits : ceil(log(R,2)) (S) = 17 - * Mult const : round(2^S/C) (M) = 1311 - * - * - */ - margin *= 1311; - margin >>= 17; + + margin = reciprocal_divide(margin, schedtune_spc_rdiv); if (boost < 0) margin *= -1; @@ -6579,7 +7028,7 @@ schedtune_task_margin(struct task_struct *task) #endif /* CONFIG_SCHED_TUNE */ -static inline unsigned long +unsigned long boosted_cpu_util(int cpu) { unsigned long util = cpu_util(cpu); @@ -6601,6 +7050,13 @@ boosted_task_util(struct task_struct *task) return util + margin; } +static int cpu_util_wake(int cpu, struct task_struct *p); + +static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) +{ + return capacity_orig_of(cpu) - cpu_util_wake(cpu, p); +} + /* * find_idlest_group finds and returns the least busy CPU group within the * domain. @@ -6610,10 +7066,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu, int sd_flag) { struct sched_group *idlest = NULL, *group = sd->groups; - struct sched_group *fit_group = NULL, *spare_group = NULL; + struct sched_group *most_spare_sg = NULL; unsigned long min_load = ULONG_MAX, this_load = 0; - unsigned long fit_capacity = ULONG_MAX; - unsigned long max_spare_capacity = capacity_margin - SCHED_LOAD_SCALE; + unsigned long most_spare = 0, this_spare = 0; int load_idx = sd->forkexec_idx; int imbalance = 100 + (sd->imbalance_pct-100)/2; @@ -6621,7 +7076,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, load_idx = sd->wake_idx; do { - unsigned long load, avg_load, spare_capacity; + unsigned long load, avg_load, spare_cap, max_spare_cap; int local_group; int i; @@ -6633,8 +7088,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, local_group = cpumask_test_cpu(this_cpu, sched_group_cpus(group)); - /* Tally up the load of all CPUs in the group */ + /* + * Tally up the load of all CPUs in the group and find + * the group containing the CPU with most spare capacity. + */ avg_load = 0; + max_spare_cap = 0; for_each_cpu(i, sched_group_cpus(group)) { /* Bias balancing toward cpus of our domain */ @@ -6645,24 +7104,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, avg_load += load; - /* - * Look for most energy-efficient group that can fit - * that can fit the task. - */ - if (capacity_of(i) < fit_capacity && task_fits_spare(p, i)) { - fit_capacity = capacity_of(i); - fit_group = group; - } + spare_cap = capacity_spare_wake(i, p); - /* - * Look for group which has most spare capacity on a - * single cpu. - */ - spare_capacity = capacity_of(i) - cpu_util(i); - if (spare_capacity > max_spare_capacity) { - max_spare_capacity = spare_capacity; - spare_group = group; - } + if (spare_cap > max_spare_cap) + max_spare_cap = spare_cap; } /* Adjust by relative CPU capacity of the group */ @@ -6670,17 +7115,32 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, if (local_group) { this_load = avg_load; - } else if (avg_load < min_load) { - min_load = avg_load; - idlest = group; + this_spare = max_spare_cap; + } else { + if (avg_load < min_load) { + min_load = avg_load; + idlest = group; + } + + if (most_spare < max_spare_cap) { + most_spare = max_spare_cap; + most_spare_sg = group; + } } } while (group = group->next, group != sd->groups); - if (fit_group) - return fit_group; - - if (spare_group) - return spare_group; + /* + * The cross-over point between using spare capacity or least load + * is too conservative for high utilization tasks on partially + * utilized systems if we require spare_capacity > task_util(p), + * so we allow for some task stuffing by using + * spare_capacity > task_util(p)/2. + */ + if (this_spare > task_util(p) / 2 && + imbalance*this_spare > 100*most_spare) + return NULL; + else if (most_spare > task_util(p) / 2) + return most_spare_sg; if (!idlest || 100*this_load < imbalance*min_load) return NULL; @@ -6700,9 +7160,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) int shallowest_idle_cpu = -1; int i; + /* Check if we have any choice: */ + if (group->group_weight == 1) + return cpumask_first(sched_group_cpus(group)); + /* Traverse only the allowed CPUs */ for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) { - if (task_fits_spare(p, i)) { + if (idle_cpu(i)) { struct rq *rq = cpu_rq(i); struct cpuidle_state *idle = idle_get_state(rq); if (idle && idle->exit_latency < min_exit_latency) { @@ -6714,8 +7178,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) min_exit_latency = idle->exit_latency; latest_idle_timestamp = rq->idle_stamp; shallowest_idle_cpu = i; - } else if (idle_cpu(i) && - (!idle || idle->exit_latency == min_exit_latency) && + } else if ((!idle || idle->exit_latency == min_exit_latency) && rq->idle_stamp > latest_idle_timestamp) { /* * If equal or no active idle state, then @@ -6724,13 +7187,6 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) */ latest_idle_timestamp = rq->idle_stamp; shallowest_idle_cpu = i; - } else if (shallowest_idle_cpu == -1) { - /* - * If we haven't found an idle CPU yet - * pick a non-idle one that can fit the task as - * fallback. - */ - shallowest_idle_cpu = i; } } else if (shallowest_idle_cpu == -1) { load = weighted_cpuload(i); @@ -6747,24 +7203,32 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) /* * Try and locate an idle CPU in the sched_domain. */ -static int select_idle_sibling(struct task_struct *p, int target) +static int select_idle_sibling(struct task_struct *p, int prev, int target) { struct sched_domain *sd; struct sched_group *sg; - int i = task_cpu(p); - int best_idle = -1; - int best_idle_cstate = -1; - int best_idle_capacity = INT_MAX; + int best_idle_cpu = -1; + int best_idle_cstate = INT_MAX; + unsigned long best_idle_capacity = ULONG_MAX; + + schedstat_inc(p, se.statistics.nr_wakeups_sis_attempts); + schedstat_inc(this_rq(), eas_stats.sis_attempts); if (!sysctl_sched_cstate_aware) { - if (idle_cpu(target)) + if (idle_cpu(target)) { + schedstat_inc(p, se.statistics.nr_wakeups_sis_idle); + schedstat_inc(this_rq(), eas_stats.sis_idle); return target; + } /* * If the prevous cpu is cache affine and idle, don't be stupid. */ - if (i != target && cpus_share_cache(i, target) && idle_cpu(i)) - return i; + if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev)) { + schedstat_inc(p, se.statistics.nr_wakeups_sis_cache_affine); + schedstat_inc(this_rq(), eas_stats.sis_cache_affine); + return prev; + } } if (!(current->flags & PF_WAKE_UP_IDLE) && @@ -6778,24 +7242,30 @@ static int select_idle_sibling(struct task_struct *p, int target) for_each_lower_domain(sd) { sg = sd->groups; do { + int i; if (!cpumask_intersects(sched_group_cpus(sg), tsk_cpus_allowed(p))) goto next; if (sysctl_sched_cstate_aware) { for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg)) { - struct rq *rq = cpu_rq(i); - int idle_idx = idle_get_state_idx(rq); + int idle_idx = idle_get_state_idx(cpu_rq(i)); unsigned long new_usage = boosted_task_util(p); unsigned long capacity_orig = capacity_orig_of(i); + if (new_usage > capacity_orig || !idle_cpu(i)) goto next; - if (i == target && new_usage <= capacity_curr_of(target)) + if (i == target && new_usage <= capacity_curr_of(target)) { + schedstat_inc(p, se.statistics.nr_wakeups_sis_suff_cap); + schedstat_inc(this_rq(), eas_stats.sis_suff_cap); + schedstat_inc(sd, eas_stats.sis_suff_cap); return target; + } - if (best_idle < 0 || (idle_idx < best_idle_cstate && capacity_orig <= best_idle_capacity)) { - best_idle = i; + if (idle_idx < best_idle_cstate && + capacity_orig <= best_idle_capacity) { + best_idle_cpu = i; best_idle_cstate = idle_idx; best_idle_capacity = capacity_orig; } @@ -6808,231 +7278,283 @@ static int select_idle_sibling(struct task_struct *p, int target) target = cpumask_first_and(sched_group_cpus(sg), tsk_cpus_allowed(p)); + schedstat_inc(p, se.statistics.nr_wakeups_sis_idle_cpu); + schedstat_inc(this_rq(), eas_stats.sis_idle_cpu); + schedstat_inc(sd, eas_stats.sis_idle_cpu); goto done; } next: sg = sg->next; } while (sg != sd->groups); } - if (best_idle > 0) - target = best_idle; + + if (best_idle_cpu >= 0) + target = best_idle_cpu; done: + schedstat_inc(p, se.statistics.nr_wakeups_sis_count); + schedstat_inc(this_rq(), eas_stats.sis_count); + return target; } -static inline int find_best_target(struct task_struct *p, bool boosted, bool prefer_idle) +/* + * cpu_util_wake: Compute cpu utilization with any contributions from + * the waking task p removed. + */ +static int cpu_util_wake(int cpu, struct task_struct *p) { - int iter_cpu; - int target_cpu = -1; - int target_util = 0; - int backup_capacity = 0; - int best_idle_cpu = -1; - int best_idle_cstate = INT_MAX; - int backup_cpu = -1; - unsigned long task_util_boosted, new_util; - - task_util_boosted = boosted_task_util(p); - for (iter_cpu = 0; iter_cpu < NR_CPUS; iter_cpu++) { - int cur_capacity; - struct rq *rq; - int idle_idx; - - /* - * Iterate from higher cpus for boosted tasks. - */ - int i = boosted ? NR_CPUS-iter_cpu-1 : iter_cpu; - - if (!cpu_online(i) || !cpumask_test_cpu(i, tsk_cpus_allowed(p))) - continue; + unsigned long util, capacity; - /* - * p's blocked utilization is still accounted for on prev_cpu - * so prev_cpu will receive a negative bias due to the double - * accounting. However, the blocked utilization may be zero. - */ - new_util = cpu_util(i) + task_util_boosted; - - /* - * Ensure minimum capacity to grant the required boost. - * The target CPU can be already at a capacity level higher - * than the one required to boost the task. - */ - if (new_util > capacity_orig_of(i)) - continue; +#ifdef CONFIG_SCHED_WALT + /* + * WALT does not decay idle tasks in the same manner + * as PELT, so it makes little sense to subtract task + * utilization from cpu utilization. Instead just use + * cpu_util for this case. + */ + if (!walt_disabled && sysctl_sched_use_walt_cpu_util) + return cpu_util(cpu); +#endif + /* Task has no contribution or is new */ + if (cpu != task_cpu(p) || !p->se.avg.last_update_time) + return cpu_util(cpu); - /* - * Unconditionally favoring tasks that prefer idle cpus to - * improve latency. - */ - if (idle_cpu(i) && prefer_idle) { - if (best_idle_cpu < 0) - best_idle_cpu = i; - continue; - } + capacity = capacity_orig_of(cpu); + util = max_t(long, cpu_util(cpu) - task_util(p), 0); - cur_capacity = capacity_curr_of(i); - rq = cpu_rq(i); - idle_idx = idle_get_state_idx(rq); + return (util >= capacity) ? capacity : util; +} - if (new_util < cur_capacity) { - if (cpu_rq(i)->nr_running) { - if (prefer_idle) { - /* Find a target cpu with highest - * utilization. - */ - if (target_util == 0 || - target_util < new_util) { - target_cpu = i; - target_util = new_util; - } - } else { - /* Find a target cpu with lowest - * utilization. - */ - if (target_util == 0 || - target_util > new_util) { - target_cpu = i; - target_util = new_util; - } - } - } else if (!prefer_idle) { - if (best_idle_cpu < 0 || - (sysctl_sched_cstate_aware && - best_idle_cstate > idle_idx)) { - best_idle_cstate = idle_idx; - best_idle_cpu = i; - } - } - } else if (backup_capacity == 0 || - backup_capacity > cur_capacity) { - // Find a backup cpu with least capacity. - backup_capacity = cur_capacity; - backup_cpu = i; - } - } +static int start_cpu(bool boosted) +{ + struct root_domain *rd = cpu_rq(smp_processor_id())->rd; - if (prefer_idle && best_idle_cpu >= 0) - target_cpu = best_idle_cpu; - else if (target_cpu < 0) - target_cpu = best_idle_cpu >= 0 ? best_idle_cpu : backup_cpu; + RCU_LOCKDEP_WARN(rcu_read_lock_sched_held(), + "sched RCU must be held"); - return target_cpu; + return boosted ? rd->max_cap_orig_cpu : rd->min_cap_orig_cpu; } -static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync) +static inline int find_best_target(struct task_struct *p, bool boosted, bool prefer_idle) { + int target_cpu = -1; + unsigned long target_util = prefer_idle ? ULONG_MAX : 0; + unsigned long backup_capacity = ULONG_MAX; + int best_idle_cpu = -1; + int best_idle_cstate = INT_MAX; + int backup_cpu = -1; + unsigned long min_util = boosted_task_util(p); struct sched_domain *sd; - struct sched_group *sg, *sg_target; - int target_max_cap = INT_MAX; - int target_cpu = task_cpu(p); - unsigned long task_util_boosted, new_util; - int i; + struct sched_group *sg; + int cpu = start_cpu(boosted); - if (sysctl_sched_sync_hint_enable && sync) { - int cpu = smp_processor_id(); - cpumask_t search_cpus; - cpumask_and(&search_cpus, tsk_cpus_allowed(p), cpu_online_mask); - if (cpumask_test_cpu(cpu, &search_cpus)) - return cpu; + schedstat_inc(p, se.statistics.nr_wakeups_fbt_attempts); + schedstat_inc(this_rq(), eas_stats.fbt_attempts); + + if (cpu < 0) { + schedstat_inc(p, se.statistics.nr_wakeups_fbt_no_cpu); + schedstat_inc(this_rq(), eas_stats.fbt_no_cpu); + return target_cpu; } - sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p))); + sd = rcu_dereference(per_cpu(sd_ea, cpu)); - if (!sd) - return target; + if (!sd) { + schedstat_inc(p, se.statistics.nr_wakeups_fbt_no_sd); + schedstat_inc(this_rq(), eas_stats.fbt_no_sd); + return target_cpu; + } sg = sd->groups; - sg_target = sg; - if (sysctl_sched_is_big_little) { + do { + int i; - /* - * Find group with sufficient capacity. We only get here if no cpu is - * overutilized. We may end up overutilizing a cpu by adding the task, - * but that should not be any worse than select_idle_sibling(). - * load_balance() should sort it out later as we get above the tipping - * point. - */ - do { - /* Assuming all cpus are the same in group */ - int max_cap_cpu = group_first_cpu(sg); + for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg)) { + unsigned long cur_capacity, new_util, wake_util; + unsigned long min_wake_util = ULONG_MAX; - /* - * Assume smaller max capacity means more energy-efficient. - * Ideally we should query the energy model for the right - * answer but it easily ends up in an exhaustive search. - */ - if (capacity_of(max_cap_cpu) < target_max_cap && - task_fits_max(p, max_cap_cpu)) { - sg_target = sg; - target_max_cap = capacity_of(max_cap_cpu); - } - } while (sg = sg->next, sg != sd->groups); + if (!cpu_online(i)) + continue; - task_util_boosted = boosted_task_util(p); - /* Find cpu with sufficient capacity */ - for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) { /* * p's blocked utilization is still accounted for on prev_cpu * so prev_cpu will receive a negative bias due to the double * accounting. However, the blocked utilization may be zero. */ - new_util = cpu_util(i) + task_util_boosted; + wake_util = cpu_util_wake(i, p); + new_util = wake_util + task_util(p); /* * Ensure minimum capacity to grant the required boost. * The target CPU can be already at a capacity level higher * than the one required to boost the task. */ + new_util = max(min_util, new_util); + if (new_util > capacity_orig_of(i)) continue; - if (new_util < capacity_curr_of(i)) { - target_cpu = i; - if (cpu_rq(i)->nr_running) - break; + /* + * Unconditionally favoring tasks that prefer idle cpus to + * improve latency. + */ + if (idle_cpu(i) && prefer_idle) { + schedstat_inc(p, se.statistics.nr_wakeups_fbt_pref_idle); + schedstat_inc(this_rq(), eas_stats.fbt_pref_idle); + return i; + } + + cur_capacity = capacity_curr_of(i); + + if (new_util < cur_capacity) { + if (cpu_rq(i)->nr_running) { + /* + * Find a target cpu with the lowest/highest + * utilization if prefer_idle/!prefer_idle. + */ + if (prefer_idle) { + /* Favor the CPU that last ran the task */ + if (new_util > target_util || + wake_util > min_wake_util) + continue; + min_wake_util = wake_util; + target_util = new_util; + target_cpu = i; + } else if (target_util < new_util) { + target_util = new_util; + target_cpu = i; + } + } else if (!prefer_idle) { + int idle_idx = idle_get_state_idx(cpu_rq(i)); + + if (best_idle_cpu < 0 || + (sysctl_sched_cstate_aware && + best_idle_cstate > idle_idx)) { + best_idle_cstate = idle_idx; + best_idle_cpu = i; + } + } + } else if (backup_capacity > cur_capacity) { + /* Find a backup cpu with least capacity. */ + backup_capacity = cur_capacity; + backup_cpu = i; } + } + } while (sg = sg->next, sg != sd->groups); + + if (target_cpu < 0) + target_cpu = best_idle_cpu >= 0 ? best_idle_cpu : backup_cpu; + + if (target_cpu >= 0) { + schedstat_inc(p, se.statistics.nr_wakeups_fbt_count); + schedstat_inc(this_rq(), eas_stats.fbt_count); + } + + return target_cpu; +} + +/* + * Disable WAKE_AFFINE in the case where task @p doesn't fit in the + * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu. + * + * In that case WAKE_AFFINE doesn't make sense and we'll let + * BALANCE_WAKE sort things out. + */ +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) +{ + long min_cap, max_cap; - /* cpu has capacity at higher OPP, keep it as fallback */ - if (target_cpu == task_cpu(p)) - target_cpu = i; + min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu)); + max_cap = cpu_rq(cpu)->rd->max_cpu_capacity.val; + + /* Minimum capacity is close to max, no need to abort wake_affine */ + if (max_cap - min_cap < max_cap >> 3) + return 0; + + /* Bring task utilization in sync with prev_cpu */ + sync_entity_load_avg(&p->se); + + return min_cap * 1024 < task_util(p) * capacity_margin; +} + +static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync) +{ + struct sched_domain *sd; + int target_cpu = prev_cpu, tmp_target; + bool boosted, prefer_idle; + + schedstat_inc(p, se.statistics.nr_wakeups_secb_attempts); + schedstat_inc(this_rq(), eas_stats.secb_attempts); + + if (sysctl_sched_sync_hint_enable && sync) { + int cpu = smp_processor_id(); + + if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) { + schedstat_inc(p, se.statistics.nr_wakeups_secb_sync); + schedstat_inc(this_rq(), eas_stats.secb_sync); + return cpu; } - } else { - /* - * Find a cpu with sufficient capacity - */ + } + + rcu_read_lock(); #ifdef CONFIG_CGROUP_SCHEDTUNE - bool boosted = schedtune_task_boost(p) > 0; - bool prefer_idle = schedtune_prefer_idle(p) > 0; + boosted = schedtune_task_boost(p) > 0; + prefer_idle = schedtune_prefer_idle(p) > 0; #else - bool boosted = 0; - bool prefer_idle = 0; + boosted = get_sysctl_sched_cfs_boost() > 0; + prefer_idle = 0; #endif - int tmp_target = find_best_target(p, boosted, prefer_idle); - if (tmp_target >= 0) { - target_cpu = tmp_target; - if ((boosted || prefer_idle) && idle_cpu(target_cpu)) - return target_cpu; + + sd = rcu_dereference(per_cpu(sd_ea, prev_cpu)); + /* Find a cpu with sufficient capacity */ + tmp_target = find_best_target(p, boosted, prefer_idle); + + if (!sd) + goto unlock; + if (tmp_target >= 0) { + target_cpu = tmp_target; + if ((boosted || prefer_idle) && idle_cpu(target_cpu)) { + schedstat_inc(p, se.statistics.nr_wakeups_secb_idle_bt); + schedstat_inc(this_rq(), eas_stats.secb_idle_bt); + goto unlock; } } - if (target_cpu != task_cpu(p)) { + if (target_cpu != prev_cpu) { struct energy_env eenv = { - .util_delta = task_util(p), - .src_cpu = task_cpu(p), - .dst_cpu = target_cpu, - .task = p, + .util_delta = task_util(p), + .src_cpu = prev_cpu, + .dst_cpu = target_cpu, + .task = p, }; /* Not enough spare capacity on previous cpu */ - if (cpu_overutilized(task_cpu(p))) - return target_cpu; + if (cpu_overutilized(prev_cpu)) { + schedstat_inc(p, se.statistics.nr_wakeups_secb_insuff_cap); + schedstat_inc(this_rq(), eas_stats.secb_insuff_cap); + goto unlock; + } + + if (energy_diff(&eenv) >= 0) { + schedstat_inc(p, se.statistics.nr_wakeups_secb_no_nrg_sav); + schedstat_inc(this_rq(), eas_stats.secb_no_nrg_sav); + target_cpu = prev_cpu; + goto unlock; + } - if (energy_diff(&eenv) >= 0) - return task_cpu(p); + schedstat_inc(p, se.statistics.nr_wakeups_secb_nrg_sav); + schedstat_inc(this_rq(), eas_stats.secb_nrg_sav); + goto unlock; } + schedstat_inc(p, se.statistics.nr_wakeups_secb_count); + schedstat_inc(this_rq(), eas_stats.secb_count); + +unlock: + rcu_read_unlock(); + return target_cpu; } @@ -7061,10 +7583,19 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f return select_best_cpu(p, prev_cpu, 0, sync); #endif - if (sd_flag & SD_BALANCE_WAKE) - want_affine = (!wake_wide(p) && task_fits_max(p, cpu) && - cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) || - energy_aware(); + if (sd_flag & SD_BALANCE_WAKE) { + /* + * do wake_cap unconditionally as it causes task and cpu + * utilization to be synced, and we need that for energy + * aware wakeups + */ + int _wake_cap = wake_cap(p, cpu, prev_cpu); + want_affine = !wake_wide(p) && !_wake_cap + && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); + } + + if (energy_aware() && !(cpu_rq(prev_cpu)->rd->overutilized)) + return select_energy_cpu_brute(p, prev_cpu, sync); rcu_read_lock(); for_each_domain(cpu, tmp) { @@ -7089,49 +7620,65 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (affine_sd) { sd = NULL; /* Prefer wake_affine over balance flags */ - if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) + if (cpu != prev_cpu && wake_affine(affine_sd, p, prev_cpu, sync)) new_cpu = cpu; } if (!sd) { - if (energy_aware() && !cpu_rq(cpu)->rd->overutilized) - new_cpu = energy_aware_wake_cpu(p, prev_cpu, sync); - else if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */ - new_cpu = select_idle_sibling(p, new_cpu); + if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */ + new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); - } else while (sd) { - struct sched_group *group; - int weight; + } else { + int wu = sd_flag & SD_BALANCE_WAKE; + int cas_cpu = -1; - if (!(sd->flags & sd_flag)) { - sd = sd->child; - continue; + if (wu) { + schedstat_inc(p, se.statistics.nr_wakeups_cas_attempts); + schedstat_inc(this_rq(), eas_stats.cas_attempts); } - group = find_idlest_group(sd, p, cpu, sd_flag); - if (!group) { - sd = sd->child; - continue; - } + while (sd) { + struct sched_group *group; + int weight; - new_cpu = find_idlest_cpu(group, p, cpu); - if (new_cpu == -1 || new_cpu == cpu) { - /* Now try balancing at a lower domain level of cpu */ - sd = sd->child; - continue; + if (wu) + schedstat_inc(sd, eas_stats.cas_attempts); + + if (!(sd->flags & sd_flag)) { + sd = sd->child; + continue; + } + + group = find_idlest_group(sd, p, cpu, sd_flag); + if (!group) { + sd = sd->child; + continue; + } + + new_cpu = find_idlest_cpu(group, p, cpu); + if (new_cpu == -1 || new_cpu == cpu) { + /* Now try balancing at a lower domain level of cpu */ + sd = sd->child; + continue; + } + + /* Now try balancing at a lower domain level of new_cpu */ + cpu = cas_cpu = new_cpu; + weight = sd->span_weight; + sd = NULL; + for_each_domain(cpu, tmp) { + if (weight <= tmp->span_weight) + break; + if (tmp->flags & sd_flag) + sd = tmp; + } + /* while loop will break here if sd == NULL */ } - /* Now try balancing at a lower domain level of new_cpu */ - cpu = new_cpu; - weight = sd->span_weight; - sd = NULL; - for_each_domain(cpu, tmp) { - if (weight <= tmp->span_weight) - break; - if (tmp->flags & sd_flag) - sd = tmp; + if (wu && (cas_cpu >= 0)) { + schedstat_inc(p, se.statistics.nr_wakeups_cas_count); + schedstat_inc(this_rq(), eas_stats.cas_count); } - /* while loop will break here if sd == NULL */ } rcu_read_unlock(); @@ -8134,8 +8681,13 @@ static void update_blocked_averages(int cpu) if (throttled_hierarchy(cfs_rq)) continue; - if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq)) + if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, + true)) update_tg_load_avg(cfs_rq, 0); + + /* Propagate pending load changes to the parent */ + if (cfs_rq->tg->se[cpu]) + update_load_avg(cfs_rq->tg->se[cpu], 0); } raw_spin_unlock_irqrestore(&rq->lock, flags); } @@ -8195,7 +8747,7 @@ static inline void update_blocked_averages(int cpu) raw_spin_lock_irqsave(&rq->lock, flags); update_rq_clock(rq); - update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); + update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true); raw_spin_unlock_irqrestore(&rq->lock, flags); } @@ -8433,13 +8985,14 @@ skip_unlock: __attribute__ ((unused)); cpu_rq(cpu)->cpu_capacity = capacity; sdg->sgc->capacity = capacity; sdg->sgc->max_capacity = capacity; + sdg->sgc->min_capacity = capacity; } void update_group_capacity(struct sched_domain *sd, int cpu) { struct sched_domain *child = sd->child; struct sched_group *group, *sdg = sd->groups; - unsigned long capacity, max_capacity; + unsigned long capacity, max_capacity, min_capacity; unsigned long interval; interval = msecs_to_jiffies(sd->balance_interval); @@ -8453,6 +9006,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) capacity = 0; max_capacity = 0; + min_capacity = ULONG_MAX; if (child->flags & SD_OVERLAP) { /* @@ -8485,6 +9039,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) } max_capacity = max(capacity, max_capacity); + min_capacity = min(capacity, min_capacity); } } else { /* @@ -8502,6 +9057,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) if (!cpu_isolated(cpumask_first(cpus))) { capacity += sgc->capacity; max_capacity = max(sgc->max_capacity, max_capacity); + min_capacity = min(sgc->min_capacity, min_capacity); } group = group->next; } while (group != child->groups); @@ -8509,6 +9065,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) sdg->sgc->capacity = capacity; sdg->sgc->max_capacity = max_capacity; + sdg->sgc->min_capacity = min_capacity; } /* @@ -8790,15 +9347,21 @@ static bool update_sd_pick_busiest(struct lb_env *env, if (sgs->avg_load <= busiest->avg_load) return false; + if (!(env->sd->flags & SD_ASYM_CPUCAPACITY)) + goto asym_packing; + /* - * Candiate sg has no more than one task per cpu and has higher - * per-cpu capacity. No reason to pull tasks to less capable cpus. + * Candidate sg has no more than one task per CPU and + * has higher per-CPU capacity. Migrating tasks to less + * capable CPUs may harm throughput. Maximize throughput, + * power/energy consequences are not considered. */ if (sgs->sum_nr_running <= sgs->group_weight && group_smaller_cpu_capacity(sds->local, sg)) return false; } +asym_packing: /* This is the busiest node in its class. */ if (!(env->sd->flags & SD_ASYM_PACKING)) return true; @@ -8849,6 +9412,9 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq) } #endif /* CONFIG_NUMA_BALANCING */ +#define lb_sd_parent(sd) \ + (sd->parent && sd->parent->groups != sd->parent->groups->next) + /** * update_sd_lb_stats - Update sched_domain's statistics for load balancing. * @env: The load balancing environment. @@ -8934,7 +9500,7 @@ next_group: env->src_grp_nr_running = sds->busiest_stat.sum_nr_running; - if (!env->sd->parent) { + if (!lb_sd_parent(env->sd)) { /* update overload indicator if we are at root domain */ if (env->dst_rq->rd->overload != overload) env->dst_rq->rd->overload = overload; @@ -9523,7 +10089,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, int *continue_balancing) { int ld_moved = 0, cur_ld_moved, active_balance = 0; - struct sched_domain *sd_parent = sd->parent; + struct sched_domain *sd_parent = lb_sd_parent(sd) ? sd->parent : NULL; struct sched_group *group = NULL; struct rq *busiest = NULL; unsigned long flags; @@ -10807,6 +11373,61 @@ static inline bool vruntime_normalized(struct task_struct *p) return false; } +#ifdef CONFIG_FAIR_GROUP_SCHED +/* + * Propagate the changes of the sched_entity across the tg tree to make it + * visible to the root + */ +static void propagate_entity_cfs_rq(struct sched_entity *se) +{ + struct cfs_rq *cfs_rq; + + /* Start to propagate at parent */ + se = se->parent; + + for_each_sched_entity(se) { + cfs_rq = cfs_rq_of(se); + + if (cfs_rq_throttled(cfs_rq)) + break; + + update_load_avg(se, UPDATE_TG); + } +} +#else +static void propagate_entity_cfs_rq(struct sched_entity *se) { } +#endif + +static void detach_entity_cfs_rq(struct sched_entity *se) +{ + struct cfs_rq *cfs_rq = cfs_rq_of(se); + + /* Catch up with the cfs_rq and remove our load when we leave */ + update_load_avg(se, 0); + detach_entity_load_avg(cfs_rq, se); + update_tg_load_avg(cfs_rq, false); + propagate_entity_cfs_rq(se); +} + +static void attach_entity_cfs_rq(struct sched_entity *se) +{ + struct cfs_rq *cfs_rq = cfs_rq_of(se); + +#ifdef CONFIG_FAIR_GROUP_SCHED + /* + * Since the real-depth could have been changed (only FAIR + * class maintain depth value), reset depth properly. + */ + se->depth = se->parent ? se->parent->depth + 1 : 0; +#endif + + /* Synchronize entity with its cfs_rq */ + update_load_avg(se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD); + attach_entity_load_avg(cfs_rq, se); + update_tg_load_avg(cfs_rq, false); + propagate_entity_cfs_rq(se); +} + static void detach_task_cfs_rq(struct task_struct *p) { struct sched_entity *se = &p->se; @@ -10821,8 +11442,7 @@ static void detach_task_cfs_rq(struct task_struct *p) se->vruntime -= cfs_rq->min_vruntime; } - /* Catch up with the cfs_rq and remove our load when we leave */ - detach_entity_load_avg(cfs_rq, se); + detach_entity_cfs_rq(se); } static void attach_task_cfs_rq(struct task_struct *p) @@ -10830,16 +11450,7 @@ static void attach_task_cfs_rq(struct task_struct *p) struct sched_entity *se = &p->se; struct cfs_rq *cfs_rq = cfs_rq_of(se); -#ifdef CONFIG_FAIR_GROUP_SCHED - /* - * Since the real-depth could have been changed (only FAIR - * class maintain depth value), reset depth properly. - */ - se->depth = se->parent ? se->parent->depth + 1 : 0; -#endif - - /* Synchronize task with its cfs_rq */ - attach_entity_load_avg(cfs_rq, se); + attach_entity_cfs_rq(se); if (!vruntime_normalized(p)) se->vruntime += cfs_rq->min_vruntime; @@ -10893,6 +11504,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq) cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime; #endif #ifdef CONFIG_SMP +#ifdef CONFIG_FAIR_GROUP_SCHED + cfs_rq->propagate_avg = 0; +#endif atomic_long_set(&cfs_rq->removed_load_avg, 0); atomic_long_set(&cfs_rq->removed_util_avg, 0); #endif @@ -10930,8 +11544,9 @@ void free_fair_sched_group(struct task_group *tg) int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) { - struct cfs_rq *cfs_rq; struct sched_entity *se; + struct cfs_rq *cfs_rq; + struct rq *rq; int i; tg->cfs_rq = kzalloc(sizeof(cfs_rq) * nr_cpu_ids, GFP_KERNEL); @@ -10946,6 +11561,8 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) init_cfs_bandwidth(tg_cfs_bandwidth(tg)); for_each_possible_cpu(i) { + rq = cpu_rq(i); + cfs_rq = kzalloc_node(sizeof(struct cfs_rq), GFP_KERNEL, cpu_to_node(i)); if (!cfs_rq) @@ -10959,6 +11576,10 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) init_cfs_rq(cfs_rq); init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]); init_entity_runnable_average(se); + + raw_spin_lock_irq(&rq->lock); + post_init_entity_util_avg(se); + raw_spin_unlock_irq(&rq->lock); } return 1; @@ -11055,8 +11676,10 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares) /* Possible calls to update_curr() need rq clock */ update_rq_clock(rq); - for_each_sched_entity(se) - update_cfs_shares(group_cfs_rq(se)); + for_each_sched_entity(se) { + update_load_avg(se, UPDATE_TG); + update_cfs_shares(se); + } raw_spin_unlock_irqrestore(&rq->lock, flags); } diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 29345ed74069..fd27394354a1 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1005,6 +1005,9 @@ static void update_curr_rt(struct rq *rq) if (unlikely((s64)delta_exec <= 0)) return; + /* Kick cpufreq (see the comment in kernel/sched/sched.h). */ + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT); + schedstat_set(curr->se.statistics.exec_max, max(curr->se.statistics.exec_max, delta_exec)); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b88f647ea935..a8ad8e71076a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -465,6 +465,7 @@ struct cfs_rq { unsigned long runnable_load_avg; #ifdef CONFIG_FAIR_GROUP_SCHED unsigned long tg_load_avg_contrib; + unsigned long propagate_avg; #endif atomic_long_t removed_load_avg, removed_util_avg; #ifndef CONFIG_64BIT @@ -651,6 +652,9 @@ struct root_domain { /* Maximum cpu capacity in the system. */ struct max_cpu_capacity max_cpu_capacity; + + /* First cpu with maximum and minimum original capacity */ + int max_cap_orig_cpu, min_cap_orig_cpu; }; extern struct root_domain def_root_domain; @@ -708,6 +712,7 @@ struct rq { #ifdef CONFIG_FAIR_GROUP_SCHED /* list of leaf cfs_rq on this cpu: */ struct list_head leaf_cfs_rq_list; + struct list_head *tmp_alone_branch; #endif /* CONFIG_FAIR_GROUP_SCHED */ /* @@ -827,6 +832,9 @@ struct rq { /* try_to_wake_up() stats */ unsigned int ttwu_count; unsigned int ttwu_local; +#ifdef CONFIG_SMP + struct eas_stats eas_stats; +#endif #endif #ifdef CONFIG_SMP @@ -997,6 +1005,7 @@ struct sched_group_capacity { */ unsigned long capacity; unsigned long max_capacity; /* Max per-cpu capacity in group */ + unsigned long min_capacity; /* Min per-CPU capacity in group */ unsigned long next_update; int imbalance; /* XXX unrelated to capacity but shared group state */ /* @@ -2158,6 +2167,7 @@ extern void init_dl_task_timer(struct sched_dl_entity *dl_se); unsigned long to_ratio(u64 period, u64 runtime); extern void init_entity_runnable_average(struct sched_entity *se); +extern void post_init_entity_util_avg(struct sched_entity *se); static inline void __add_nr_running(struct rq *rq, unsigned count) { @@ -2792,3 +2802,55 @@ static inline u64 irq_time_read(int cpu) } #endif /* CONFIG_64BIT */ #endif /* CONFIG_IRQ_TIME_ACCOUNTING */ + +#ifdef CONFIG_CPU_FREQ +DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data); + +/** + * cpufreq_update_util - Take a note about CPU utilization changes. + * @rq: Runqueue to carry out the update for. + * @flags: Update reason flags. + * + * This function is called by the scheduler on the CPU whose utilization is + * being updated. + * + * It can only be called from RCU-sched read-side critical sections. + * + * The way cpufreq is currently arranged requires it to evaluate the CPU + * performance state (frequency/voltage) on a regular basis to prevent it from + * being stuck in a completely inadequate performance level for too long. + * That is not guaranteed to happen if the updates are only triggered from CFS, + * though, because they may not be coming in if RT or deadline tasks are active + * all the time (or there are RT and DL tasks only). + * + * As a workaround for that issue, this function is called by the RT and DL + * sched classes to trigger extra cpufreq updates to prevent it from stalling, + * but that really is a band-aid. Going forward it should be replaced with + * solutions targeted more specifically at RT and DL tasks. + */ +static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) +{ + struct update_util_data *data; + + data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data)); + if (data) + data->func(data, rq_clock(rq), flags); +} + +static inline void cpufreq_update_this_cpu(struct rq *rq, unsigned int flags) +{ + if (cpu_of(rq) == smp_processor_id()) + cpufreq_update_util(rq, flags); +} +#else +static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} +static inline void cpufreq_update_this_cpu(struct rq *rq, unsigned int flags) {} +#endif /* CONFIG_CPU_FREQ */ + +#ifdef arch_scale_freq_capacity +#ifndef arch_scale_freq_invariant +#define arch_scale_freq_invariant() (true) +#endif +#else /* arch_scale_freq_capacity */ +#define arch_scale_freq_invariant() (false) +#endif diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c index 87e2c9f0c33e..6d74a7c77c8c 100644 --- a/kernel/sched/stats.c +++ b/kernel/sched/stats.c @@ -12,6 +12,28 @@ */ #define SCHEDSTAT_VERSION 15 +#ifdef CONFIG_SMP +static inline void show_easstat(struct seq_file *seq, struct eas_stats *stats) +{ + /* eas-specific runqueue stats */ + seq_printf(seq, "eas %llu %llu %llu %llu %llu %llu ", + stats->sis_attempts, stats->sis_idle, stats->sis_cache_affine, + stats->sis_suff_cap, stats->sis_idle_cpu, stats->sis_count); + + seq_printf(seq, "%llu %llu %llu %llu %llu %llu %llu ", + stats->secb_attempts, stats->secb_sync, stats->secb_idle_bt, + stats->secb_insuff_cap, stats->secb_no_nrg_sav, + stats->secb_nrg_sav, stats->secb_count); + + seq_printf(seq, "%llu %llu %llu %llu %llu ", + stats->fbt_attempts, stats->fbt_no_cpu, stats->fbt_no_sd, + stats->fbt_pref_idle, stats->fbt_count); + + seq_printf(seq, "%llu %llu\n", + stats->cas_attempts, stats->cas_count); +} +#endif + static int show_schedstat(struct seq_file *seq, void *v) { int cpu; @@ -40,6 +62,8 @@ static int show_schedstat(struct seq_file *seq, void *v) seq_printf(seq, "\n"); #ifdef CONFIG_SMP + show_easstat(seq, &rq->eas_stats); + /* domain-specific stats */ rcu_read_lock(); for_each_domain(cpu, sd) { @@ -66,6 +90,8 @@ static int show_schedstat(struct seq_file *seq, void *v) sd->sbf_count, sd->sbf_balanced, sd->sbf_pushed, sd->ttwu_wake_remote, sd->ttwu_move_affine, sd->ttwu_move_balance); + + show_easstat(seq, &sd->eas_stats); } rcu_read_unlock(); #endif diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c index b0c5fe6d1f3b..a71e94cecdb6 100644 --- a/kernel/sched/tune.c +++ b/kernel/sched/tune.c @@ -12,11 +12,12 @@ #include "tune.h" #ifdef CONFIG_CGROUP_SCHEDTUNE -static bool schedtune_initialized = false; +bool schedtune_initialized = false; #endif unsigned int sysctl_sched_cfs_boost __read_mostly; +extern struct reciprocal_value schedtune_spc_rdiv; extern struct target_nrg schedtune_target_nrg; /* Performance Boost region (B) threshold params */ @@ -675,6 +676,9 @@ int schedtune_task_boost(struct task_struct *p) struct schedtune *st; int task_boost; + if (!unlikely(schedtune_initialized)) + return 0; + /* Get task boost value */ rcu_read_lock(); st = task_schedtune(p); @@ -689,6 +693,9 @@ int schedtune_prefer_idle(struct task_struct *p) struct schedtune *st; int prefer_idle; + if (!unlikely(schedtune_initialized)) + return 0; + /* Get prefer_idle value */ rcu_read_lock(); st = task_schedtune(p); @@ -822,6 +829,7 @@ schedtune_boostgroup_init(struct schedtune *st) bg = &per_cpu(cpu_boost_groups, cpu); bg->group[st->idx].boost = 0; bg->group[st->idx].tasks = 0; + raw_spin_lock_init(&bg->lock); } return 0; @@ -1121,9 +1129,12 @@ schedtune_init(void) pr_info("schedtune: configured to support global boosting only\n"); #endif + schedtune_spc_rdiv = reciprocal_value(100); + return 0; nodata: + pr_warning("schedtune: disabled!\n"); rcu_read_unlock(); return -EINVAL; } |