diff mbox series

[v6,04/16] sched/core: uclamp: Add CPU's clamp buckets refcounting

Message ID 20190115101513.2822-5-patrick.bellasi@arm.com (mailing list archive)
State Not Applicable, archived
Headers show
Series Add utilization clamping support | expand

Commit Message

Patrick Bellasi Jan. 15, 2019, 10:15 a.m. UTC
Utilization clamping allows to clamp the CPU's utilization within a
[util_min, util_max] range, depending on the set of RUNNABLE tasks on
that CPU. Each task references two "clamp buckets" defining its minimum
and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
bucket is active if there is at least one RUNNABLE tasks enqueued on
that CPU and refcounting that bucket.

When a task is {en,de}queued {on,from} a CPU, the set of active clamp
buckets on that CPU can change. Since each clamp bucket enforces a
different utilization clamp value, when the set of active clamp buckets
changes, a new "aggregated" clamp value is computed for that CPU.

Clamp values are always MAX aggregated for both util_min and util_max.
This ensures that no tasks can affect the performance of other
co-scheduled tasks which are more boosted (i.e. with higher util_min
clamp) or less capped (i.e. with higher util_max clamp).

Each task has a:
   task_struct::uclamp[clamp_id]::bucket_id
to track the "bucket index" of the CPU's clamp bucket it refcounts while
enqueued, for each clamp index (clamp_id).

Each CPU's rq has a:
   rq::uclamp[clamp_id]::bucket[bucket_id].tasks
to track how many tasks, currently RUNNABLE on that CPU, refcount each
clamp bucket (bucket_id) of a clamp index (clamp_id).

Each CPU's rq has also a:
   rq::uclamp[clamp_id]::bucket[bucket_id].value
to track the clamp value of each clamp bucket (bucket_id) of a clamp
index (clamp_id).

The unordered array rq::uclamp::bucket[clamp_id][] is scanned every time
we need to find a new MAX aggregated clamp value for a clamp_id. This
operation is required only when we dequeue the last task of a clamp
bucket tracking the current MAX aggregated clamp value. In these cases,
the CPU is either entering IDLE or going to schedule a less boosted or
more clamped task.

The expected number of different clamp values, configured at build time,
is small enough to fit the full unordered array into a single cache
line. In most use-cases we expect less than 10 different clamp values
for each clamp_id.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>

---
Changes in v6:
 Message-ID: <20181113151127.GA7681@darkstar>
 - use SCHED_WARN_ON() instead of CONFIG_SCHED_DEBUG guarded WARN()s
 - add some better inline documentation to explain per-CPU initializations
 - add some local variables to use library's max() for aggregation on
   bitfields attirbutes
 Message-ID: <20181112000910.GC3038@worktop>
 - wholesale s/group/bucket/
 Message-ID: <20181111164754.GA3038@worktop>
 - consistently use unary (++/--) operators
 Others:
 - updated from rq::uclamp::group[clamp_id][group_id]
             to rq::uclamp[clamp_id]::bucket[bucket_id]
   which better matches the layout already used for tasks, i.e.
                 p::uclamp[clamp_id]::value
 - use {WRITE,READ}_ONCE() for rq's clamp access
 - update layout of rq::uclamp_cpu to better match that of tasks,
   i.e now access CPU's clamp buckets as:
     rq->uclamp[clamp_id]{.bucket[bucket_id].value}
   which matches:
      p->uclamp[clamp_id]
---
 include/linux/sched.h |   6 ++
 kernel/sched/core.c   | 152 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h  |  49 ++++++++++++++
 3 files changed, 207 insertions(+)

Comments

Peter Zijlstra Jan. 21, 2019, 2:59 p.m. UTC | #1
On Tue, Jan 15, 2019 at 10:15:01AM +0000, Patrick Bellasi wrote:
> @@ -835,6 +954,28 @@ static void uclamp_bucket_inc(struct uclamp_se *uc_se, unsigned int clamp_id,
>  	} while (!atomic_long_try_cmpxchg(&uc_maps[bucket_id].adata,
>  					  &uc_map_old.data, uc_map_new.data));
>  
> +	/*
> +	 * Ensure each CPU tracks the correct value for this clamp bucket.
> +	 * This initialization of per-CPU variables is required only when a
> +	 * clamp value is requested for the first time from a slow-path.
> +	 */

I'm confused; why is this needed?

> +	if (unlikely(!uc_map_old.se_count)) {
> +		for_each_possible_cpu(cpu) {
> +			struct uclamp_cpu *uc_cpu =
> +				&cpu_rq(cpu)->uclamp[clamp_id];
> +
> +			/* CPU's tasks count must be 0 for free buckets */
> +			SCHED_WARN_ON(uc_cpu->bucket[bucket_id].tasks);
> +			if (unlikely(uc_cpu->bucket[bucket_id].tasks))
> +				uc_cpu->bucket[bucket_id].tasks = 0;
> +
> +			/* Minimize cache lines invalidation */
> +			if (uc_cpu->bucket[bucket_id].value == bucket_value)
> +				continue;
> +			uc_cpu->bucket[bucket_id].value = bucket_value;
> +		}
> +	}
> +
>  	uc_se->value = clamp_value;
>  	uc_se->bucket_id = bucket_id;
>
Peter Zijlstra Jan. 21, 2019, 3:17 p.m. UTC | #2
On Tue, Jan 15, 2019 at 10:15:01AM +0000, Patrick Bellasi wrote:
> +#ifdef CONFIG_UCLAMP_TASK

> +struct uclamp_bucket {
> +	unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
> +	unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
> +};

> +struct uclamp_cpu {
> +	unsigned int value;

	/* 4 byte hole */

> +	struct uclamp_bucket bucket[UCLAMP_BUCKETS];
> +};

With the default of 5, this UCLAMP_BUCKETS := 6, so struct uclamp_cpu
ends up being 7 'unsigned long's, or 56 bytes on 64bit (with a 4 byte
hole).

> +#endif /* CONFIG_UCLAMP_TASK */
> +
>  /*
>   * This is the main, per-CPU runqueue data structure.
>   *
> @@ -835,6 +879,11 @@ struct rq {
>  	unsigned long		nr_load_updates;
>  	u64			nr_switches;
>  
> +#ifdef CONFIG_UCLAMP_TASK
> +	/* Utilization clamp values based on CPU's RUNNABLE tasks */
> +	struct uclamp_cpu	uclamp[UCLAMP_CNT] ____cacheline_aligned;

Which makes this 112 bytes with 8 bytes in 2 holes, which is short of 2
64 byte cachelines.

Is that the best layout?

> +#endif
> +
>  	struct cfs_rq		cfs;
>  	struct rt_rq		rt;
>  	struct dl_rq		dl;
> -- 
> 2.19.2
>
Patrick Bellasi Jan. 21, 2019, 3:23 p.m. UTC | #3
On 21-Jan 15:59, Peter Zijlstra wrote:
> On Tue, Jan 15, 2019 at 10:15:01AM +0000, Patrick Bellasi wrote:
> > @@ -835,6 +954,28 @@ static void uclamp_bucket_inc(struct uclamp_se *uc_se, unsigned int clamp_id,
> >  	} while (!atomic_long_try_cmpxchg(&uc_maps[bucket_id].adata,
> >  					  &uc_map_old.data, uc_map_new.data));
> >  
> > +	/*
> > +	 * Ensure each CPU tracks the correct value for this clamp bucket.
> > +	 * This initialization of per-CPU variables is required only when a
> > +	 * clamp value is requested for the first time from a slow-path.
> > +	 */
> 
> I'm confused; why is this needed?

That's a lazy initialization of the per-CPU uclamp data for a given
bucket, i.e. the clamp value assigned to a bucket, which happens only
when new clamp values are requested... usually only at system
boot/configuration time.

For example, let say we have these buckets mapped to given clamp
values:

 bucket_#0: clamp value: 10% (mapped)
 bucket_#1: clamp value: 20% (mapped)
 bucket_#2: clamp value: 30% (mapped)

and then let's assume all the users of bucket_#1 are "destroyed", i.e.
there are no more tasks, system defaults or cgroups asking for a
20% clamp value. The corresponding bucket will become free:

 bucket_#0: clamp value: 10% (mapped)
 bucket_#1: clamp value: 20% (free)
 bucket_#2: clamp value: 30% (mapped)

If, in the future, we ask for a new clamp value, let say a task ask
for a 40% clamp value, then we need to map that value into a bucket.
Since bucket_#1 is free we can use it to fill up the hold and keep all
the buckets in use at the beginning of a cache line.

However, since now bucket_#1 tracks a different clamp value (40
instead of 20) we need to walk all the CPUs and updated the cached
value:

 bucket_#0: clamp value: 10% (mapped)
 bucket_#1: clamp value: 40% (mapped)
 bucket_#2: clamp value: 30% (mapped)

Is that more clear ?


In the following code:

> 
> > +	if (unlikely(!uc_map_old.se_count)) {

            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This condition is matched by clamp buckets which needs the
initialization described above. These are buckets without a client so
fare and that have been selected to map/track a new clamp value.
That's why we have an unlikely... quite likely tasks/cgroups will keep
asking for the same (limited number of) clamp values and thus we find
a bucket already properly initialized for them.


> > +		for_each_possible_cpu(cpu) {
> > +			struct uclamp_cpu *uc_cpu =
> > +				&cpu_rq(cpu)->uclamp[clamp_id];
> > +
> > +			/* CPU's tasks count must be 0 for free buckets */
> > +			SCHED_WARN_ON(uc_cpu->bucket[bucket_id].tasks);
> > +			if (unlikely(uc_cpu->bucket[bucket_id].tasks))
> > +				uc_cpu->bucket[bucket_id].tasks = 0;

That's a safety check, we expect that (free) buckets do not refcount
any task. That's one of the conditions for a bucket to be considered
free. Here we do just a sanity check, that's because we use unlikely.
If the check matches there is a data corruption, which is reported by
the previous SCHED_WARN_ON and "fixed" by the if branch.

In my tests I have s/SCHED_WARN_ON/BUG_ON/ and never hit that bug...
thus the refcounting code should be ok and this check is there just to
be more on the safe side for future changes.

> > +
> > +			/* Minimize cache lines invalidation */
> > +			if (uc_cpu->bucket[bucket_id].value == bucket_value)
> > +				continue;

If by any chance we get a request for a new clamp value which happened
to be already used before... we can skip the initialization to avoid.

> > +			uc_cpu->bucket[bucket_id].value = bucket_value;
> > +		}
> > +	}
> > +
> >  	uc_se->value = clamp_value;
> >  	uc_se->bucket_id = bucket_id;
> >
Patrick Bellasi Jan. 21, 2019, 3:54 p.m. UTC | #4
On 21-Jan 16:17, Peter Zijlstra wrote:
> On Tue, Jan 15, 2019 at 10:15:01AM +0000, Patrick Bellasi wrote:
> > +#ifdef CONFIG_UCLAMP_TASK
> 
> > +struct uclamp_bucket {
> > +	unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
> > +	unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
> > +};
> 
> > +struct uclamp_cpu {
> > +	unsigned int value;
> 
> 	/* 4 byte hole */
> 
> > +	struct uclamp_bucket bucket[UCLAMP_BUCKETS];
> > +};
> 
> With the default of 5, this UCLAMP_BUCKETS := 6, so struct uclamp_cpu
> ends up being 7 'unsigned long's, or 56 bytes on 64bit (with a 4 byte
> hole).

Yes, that's dimensioned and configured to fit into a single cache line
for all the possible 5 (by default) clamp values of a clamp index
(i.e. min or max util).

> 
> > +#endif /* CONFIG_UCLAMP_TASK */
> > +
> >  /*
> >   * This is the main, per-CPU runqueue data structure.
> >   *
> > @@ -835,6 +879,11 @@ struct rq {
> >  	unsigned long		nr_load_updates;
> >  	u64			nr_switches;
> >  
> > +#ifdef CONFIG_UCLAMP_TASK
> > +	/* Utilization clamp values based on CPU's RUNNABLE tasks */
> > +	struct uclamp_cpu	uclamp[UCLAMP_CNT] ____cacheline_aligned;
> 
> Which makes this 112 bytes with 8 bytes in 2 holes, which is short of 2
> 64 byte cachelines.

Right, we have 2 cache lines where:
- the first $L tracks 5 different util_min values
- the second $L tracks 5 different util_max values

> Is that the best layout?

It changed few times and that's what I found more reasonable for both
for fitting the default configuration and also for code readability.
Notice that we access RQ and SE clamp values with the same patter,
for example:

   {rq|p}->uclamp[clamp_idx].value

Are you worried about the holes or something else specific ?

> > +#endif
> > +
> >  	struct cfs_rq		cfs;
> >  	struct rt_rq		rt;
> >  	struct dl_rq		dl;
> > -- 
> > 2.19.2
> >
Peter Zijlstra Jan. 21, 2019, 4:12 p.m. UTC | #5
On Mon, Jan 21, 2019 at 03:23:11PM +0000, Patrick Bellasi wrote:
> On 21-Jan 15:59, Peter Zijlstra wrote:
> > On Tue, Jan 15, 2019 at 10:15:01AM +0000, Patrick Bellasi wrote:
> > > @@ -835,6 +954,28 @@ static void uclamp_bucket_inc(struct uclamp_se *uc_se, unsigned int clamp_id,
> > >  	} while (!atomic_long_try_cmpxchg(&uc_maps[bucket_id].adata,
> > >  					  &uc_map_old.data, uc_map_new.data));
> > >  
> > > +	/*
> > > +	 * Ensure each CPU tracks the correct value for this clamp bucket.
> > > +	 * This initialization of per-CPU variables is required only when a
> > > +	 * clamp value is requested for the first time from a slow-path.
> > > +	 */
> > 
> > I'm confused; why is this needed?
> 
> That's a lazy initialization of the per-CPU uclamp data for a given
> bucket, i.e. the clamp value assigned to a bucket, which happens only
> when new clamp values are requested... usually only at system
> boot/configuration time.
> 
> For example, let say we have these buckets mapped to given clamp
> values:
> 
>  bucket_#0: clamp value: 10% (mapped)
>  bucket_#1: clamp value: 20% (mapped)
>  bucket_#2: clamp value: 30% (mapped)
> 
> and then let's assume all the users of bucket_#1 are "destroyed", i.e.
> there are no more tasks, system defaults or cgroups asking for a
> 20% clamp value. The corresponding bucket will become free:
> 
>  bucket_#0: clamp value: 10% (mapped)
>  bucket_#1: clamp value: 20% (free)
>  bucket_#2: clamp value: 30% (mapped)
> 
> If, in the future, we ask for a new clamp value, let say a task ask
> for a 40% clamp value, then we need to map that value into a bucket.
> Since bucket_#1 is free we can use it to fill up the hold and keep all
> the buckets in use at the beginning of a cache line.
> 
> However, since now bucket_#1 tracks a different clamp value (40
> instead of 20) we need to walk all the CPUs and updated the cached
> value:
> 
>  bucket_#0: clamp value: 10% (mapped)
>  bucket_#1: clamp value: 40% (mapped)
>  bucket_#2: clamp value: 30% (mapped)
> 
> Is that more clear ?

Yes, and I realized this a little while after sending this; but I'm not
sure I have an answer to why though.

That is; why isn't the whole thing hard coded to have:

 bucket_n: clamp value: n*UCLAMP_BUCKET_DELTA

We already do that division anyway (clamp_value / UCLAMP_BUCKET_DELTA),
and from that we instantly have the right bucket index. And that allows
us to initialize all this beforehand.

> and keep all
> the buckets in use at the beginning of a cache line.

That; is that the rationale for all this? Note that per the defaults
everything is in a single line already.
Patrick Bellasi Jan. 21, 2019, 4:33 p.m. UTC | #6
On 21-Jan 17:12, Peter Zijlstra wrote:
> On Mon, Jan 21, 2019 at 03:23:11PM +0000, Patrick Bellasi wrote:
> > On 21-Jan 15:59, Peter Zijlstra wrote:
> > > On Tue, Jan 15, 2019 at 10:15:01AM +0000, Patrick Bellasi wrote:
> > > > @@ -835,6 +954,28 @@ static void uclamp_bucket_inc(struct uclamp_se *uc_se, unsigned int clamp_id,
> > > >  	} while (!atomic_long_try_cmpxchg(&uc_maps[bucket_id].adata,
> > > >  					  &uc_map_old.data, uc_map_new.data));
> > > >  
> > > > +	/*
> > > > +	 * Ensure each CPU tracks the correct value for this clamp bucket.
> > > > +	 * This initialization of per-CPU variables is required only when a
> > > > +	 * clamp value is requested for the first time from a slow-path.
> > > > +	 */
> > > 
> > > I'm confused; why is this needed?
> > 
> > That's a lazy initialization of the per-CPU uclamp data for a given
> > bucket, i.e. the clamp value assigned to a bucket, which happens only
> > when new clamp values are requested... usually only at system
> > boot/configuration time.
> > 
> > For example, let say we have these buckets mapped to given clamp
> > values:
> > 
> >  bucket_#0: clamp value: 10% (mapped)
> >  bucket_#1: clamp value: 20% (mapped)
> >  bucket_#2: clamp value: 30% (mapped)
> > 
> > and then let's assume all the users of bucket_#1 are "destroyed", i.e.
> > there are no more tasks, system defaults or cgroups asking for a
> > 20% clamp value. The corresponding bucket will become free:
> > 
> >  bucket_#0: clamp value: 10% (mapped)
> >  bucket_#1: clamp value: 20% (free)
> >  bucket_#2: clamp value: 30% (mapped)
> > 
> > If, in the future, we ask for a new clamp value, let say a task ask
> > for a 40% clamp value, then we need to map that value into a bucket.
> > Since bucket_#1 is free we can use it to fill up the hold and keep all
> > the buckets in use at the beginning of a cache line.
> > 
> > However, since now bucket_#1 tracks a different clamp value (40
> > instead of 20) we need to walk all the CPUs and updated the cached
> > value:
> > 
> >  bucket_#0: clamp value: 10% (mapped)
> >  bucket_#1: clamp value: 40% (mapped)
> >  bucket_#2: clamp value: 30% (mapped)
> > 
> > Is that more clear ?
> 
> Yes, and I realized this a little while after sending this; but I'm not
> sure I have an answer to why though.
> 
> That is; why isn't the whole thing hard coded to have:
> 
>  bucket_n: clamp value: n*UCLAMP_BUCKET_DELTA
> 
> We already do that division anyway (clamp_value / UCLAMP_BUCKET_DELTA),
> and from that we instantly have the right bucket index. And that allows
> us to initialize all this beforehand.
> 
> > and keep all
> > the buckets in use at the beginning of a cache line.
> 
> That; is that the rationale for all this? Note that per the defaults
> everything is in a single line already.

Yes, that's because of the loop in:

   dequeue_task()
     uclamp_cpu_dec()
       uclamp_cpu_dec_id()
         uclamp_cpu_update()

where buckets needs sometimes to be scanned to find a new max.

Consider also that, with mapping, we can more easily increase the
buckets count to 20 in order to have a finer clamping granularity if
needed without warring too much about performance impact especially
when we use anyway few different clamp values.

So, I agree that mapping adds (code) complexity but it can also save
few cycles in the fast path... do you think it's not worth the added
complexity?

TBH I never did a proper profiling w/-w/o mapping... I'm just worried
in principle for a loop on 20 entries spanning 4 cache lines. :/

NOTE: the loop is currently going through all the entries anyway,
      but we can add later a guard to bail out once we covered the
      number of active entries.
Peter Zijlstra Jan. 22, 2019, 9:45 a.m. UTC | #7
On Mon, Jan 21, 2019 at 04:33:38PM +0000, Patrick Bellasi wrote:
> On 21-Jan 17:12, Peter Zijlstra wrote:
> > On Mon, Jan 21, 2019 at 03:23:11PM +0000, Patrick Bellasi wrote:

> > > and keep all
> > > the buckets in use at the beginning of a cache line.
> > 
> > That; is that the rationale for all this? Note that per the defaults
> > everything is in a single line already.
> 
> Yes, that's because of the loop in:
> 
>    dequeue_task()
>      uclamp_cpu_dec()
>        uclamp_cpu_dec_id()
>          uclamp_cpu_update()
> 
> where buckets needs sometimes to be scanned to find a new max.
> 
> Consider also that, with mapping, we can more easily increase the
> buckets count to 20 in order to have a finer clamping granularity if
> needed without warring too much about performance impact especially
> when we use anyway few different clamp values.
> 
> So, I agree that mapping adds (code) complexity but it can also save
> few cycles in the fast path... do you think it's not worth the added
> complexity?

Then maybe split this out in a separate patch? Do the trivial linear
bucket thing first and then do this smarty pants thing on top.

One problem with the scheme is that it doesn't defrag; so if you get a
peak usage, you can still end up with only two active buckets in
different lines.

Also; if it is it's own patch, you get a much better view of the
additional complexity and a chance to justify it ;-)

Also; would it make sense to do s/cpu/rq/ on much of this? All this
uclamp_cpu_*() stuff really is per rq and takes rq arguments, so why
does it have cpu in the name... no strong feelings, just noticed it and
thought is a tad inconsistent.
Peter Zijlstra Jan. 22, 2019, 10:03 a.m. UTC | #8
On Mon, Jan 21, 2019 at 03:54:07PM +0000, Patrick Bellasi wrote:
> On 21-Jan 16:17, Peter Zijlstra wrote:
> > On Tue, Jan 15, 2019 at 10:15:01AM +0000, Patrick Bellasi wrote:
> > > +#ifdef CONFIG_UCLAMP_TASK
> > 
> > > +struct uclamp_bucket {
> > > +	unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
> > > +	unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
> > > +};
> > 
> > > +struct uclamp_cpu {
> > > +	unsigned int value;
> > 
> > 	/* 4 byte hole */
> > 
> > > +	struct uclamp_bucket bucket[UCLAMP_BUCKETS];
> > > +};
> > 
> > With the default of 5, this UCLAMP_BUCKETS := 6, so struct uclamp_cpu
> > ends up being 7 'unsigned long's, or 56 bytes on 64bit (with a 4 byte
> > hole).
> 
> Yes, that's dimensioned and configured to fit into a single cache line
> for all the possible 5 (by default) clamp values of a clamp index
> (i.e. min or max util).

And I suppose you picked 5 because 20% is a 'nice' number? whereas
16./666/% is a bit odd?

> > > +#endif /* CONFIG_UCLAMP_TASK */
> > > +
> > >  /*
> > >   * This is the main, per-CPU runqueue data structure.
> > >   *
> > > @@ -835,6 +879,11 @@ struct rq {
> > >  	unsigned long		nr_load_updates;
> > >  	u64			nr_switches;
> > >  
> > > +#ifdef CONFIG_UCLAMP_TASK
> > > +	/* Utilization clamp values based on CPU's RUNNABLE tasks */
> > > +	struct uclamp_cpu	uclamp[UCLAMP_CNT] ____cacheline_aligned;
> > 
> > Which makes this 112 bytes with 8 bytes in 2 holes, which is short of 2
> > 64 byte cachelines.
> 
> Right, we have 2 cache lines where:
> - the first $L tracks 5 different util_min values
> - the second $L tracks 5 different util_max values

Well, not quite so, if you want that you should put
____cacheline_aligned on struct uclamp_cpu. Such that the individual
array entries are each aligned, the above only alignes the whole array,
so the second uclamp_cpu is spread over both lines.

But I think this is actually better, since you have to scan both
min/max anyway, and allowing one the straddle a line you have to touch
anyway, allows for using less lines in total.

Consider for example the case where UCLAMP_BUCKETS=8, then each
uclamp_cpu would be 9 words or 72 bytes. If you force align the member,
then you end up with 4 lines, whereas now it would be 3.

> > Is that the best layout?
> 
> It changed few times and that's what I found more reasonable for both
> for fitting the default configuration and also for code readability.
> Notice that we access RQ and SE clamp values with the same patter,
> for example:
> 
>    {rq|p}->uclamp[clamp_idx].value
> 
> Are you worried about the holes or something else specific ?

Not sure; just mostly asking if this was by design or by accident.

One thing I did wonder though; since bucket[0] is counting the tasks
that are unconstrained and it's bucket value is basically fixed (0 /
1024), can't we abuse that value field to store uclamp_cpu::value ?

OTOH, doing that might make the code really ugly with all them:

  if (!bucket_id)

exceptions all over the place.
Patrick Bellasi Jan. 22, 2019, 10:31 a.m. UTC | #9
On 22-Jan 10:45, Peter Zijlstra wrote:
> On Mon, Jan 21, 2019 at 04:33:38PM +0000, Patrick Bellasi wrote:
> > On 21-Jan 17:12, Peter Zijlstra wrote:
> > > On Mon, Jan 21, 2019 at 03:23:11PM +0000, Patrick Bellasi wrote:
> 
> > > > and keep all
> > > > the buckets in use at the beginning of a cache line.
> > > 
> > > That; is that the rationale for all this? Note that per the defaults
> > > everything is in a single line already.
> > 
> > Yes, that's because of the loop in:
> > 
> >    dequeue_task()
> >      uclamp_cpu_dec()
> >        uclamp_cpu_dec_id()
> >          uclamp_cpu_update()
> > 
> > where buckets needs sometimes to be scanned to find a new max.
> > 
> > Consider also that, with mapping, we can more easily increase the
> > buckets count to 20 in order to have a finer clamping granularity if
> > needed without warring too much about performance impact especially
> > when we use anyway few different clamp values.
> > 
> > So, I agree that mapping adds (code) complexity but it can also save
> > few cycles in the fast path... do you think it's not worth the added
> > complexity?
> 
> Then maybe split this out in a separate patch? Do the trivial linear
> bucket thing first and then do this smarty pants thing on top.
> 
> One problem with the scheme is that it doesn't defrag; so if you get a
> peak usage, you can still end up with only two active buckets in
> different lines.

You right, that was saved for a later optimization. :/

Mainly in consideration of the fact that, at least for the main usage
we have in mind on Android, we will likely configure all the required
clamps once for all at boot time.

> Also; if it is it's own patch, you get a much better view of the
> additional complexity and a chance to justify it ;-)

What about ditching the mapping for the time being and see if we
get a real overhead hit in the future ?

At that point we will revamp the mapping patch with also a proper
defrag support.

> Also; would it make sense to do s/cpu/rq/ on much of this? All this
> uclamp_cpu_*() stuff really is per rq and takes rq arguments, so why
> does it have cpu in the name... no strong feelings, just noticed it and
> thought is a tad inconsistent.

The idea behind using "cpu" instead of "rq" was that we use those only at
root rq level and the clamps are aggregated per-CPU.

I remember one of the first versions used "cpu" instead of "rq" as a
parameter name and you proposed to change it as an optimization since
we call it from dequeue_task() where we already have a *rq.

... but, since we have those uclamp data within struct rq, I think you
are right: it makes more sense to rename the functions.

Will do it in v7, thanks.
Patrick Bellasi Jan. 22, 2019, 10:53 a.m. UTC | #10
On 22-Jan 11:03, Peter Zijlstra wrote:
> On Mon, Jan 21, 2019 at 03:54:07PM +0000, Patrick Bellasi wrote:
> > On 21-Jan 16:17, Peter Zijlstra wrote:
> > > On Tue, Jan 15, 2019 at 10:15:01AM +0000, Patrick Bellasi wrote:
> > > > +#ifdef CONFIG_UCLAMP_TASK
> > > 
> > > > +struct uclamp_bucket {
> > > > +	unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
> > > > +	unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
> > > > +};
> > > 
> > > > +struct uclamp_cpu {
> > > > +	unsigned int value;
> > > 
> > > 	/* 4 byte hole */
> > > 
> > > > +	struct uclamp_bucket bucket[UCLAMP_BUCKETS];
> > > > +};
> > > 
> > > With the default of 5, this UCLAMP_BUCKETS := 6, so struct uclamp_cpu
> > > ends up being 7 'unsigned long's, or 56 bytes on 64bit (with a 4 byte
> > > hole).
> > 
> > Yes, that's dimensioned and configured to fit into a single cache line
> > for all the possible 5 (by default) clamp values of a clamp index
> > (i.e. min or max util).
> 
> And I suppose you picked 5 because 20% is a 'nice' number? whereas
> 16./666/% is a bit odd?

Yes, UCLAMP_BUCKETS:=6 gives me 5 20% buckets:

 0-19%, 20-39%, 40-59%, 60-79%, 80-99%
 
plus a 100% bucket to track the max boosted tasks.

Does that makes sense ?

> > > > +#endif /* CONFIG_UCLAMP_TASK */
> > > > +
> > > >  /*
> > > >   * This is the main, per-CPU runqueue data structure.
> > > >   *
> > > > @@ -835,6 +879,11 @@ struct rq {
> > > >  	unsigned long		nr_load_updates;
> > > >  	u64			nr_switches;
> > > >  
> > > > +#ifdef CONFIG_UCLAMP_TASK
> > > > +	/* Utilization clamp values based on CPU's RUNNABLE tasks */
> > > > +	struct uclamp_cpu	uclamp[UCLAMP_CNT] ____cacheline_aligned;
> > > 
> > > Which makes this 112 bytes with 8 bytes in 2 holes, which is short of 2
> > > 64 byte cachelines.
> > 
> > Right, we have 2 cache lines where:
> > - the first $L tracks 5 different util_min values
> > - the second $L tracks 5 different util_max values
> 
> Well, not quite so, if you want that you should put
> ____cacheline_aligned on struct uclamp_cpu. Such that the individual
> array entries are each aligned, the above only alignes the whole array,
> so the second uclamp_cpu is spread over both lines.

That's true... I was considering more important to save space if we
have a buckets number which can fit in let say 3 cache lines.
... but if you prefer the other way around I'll move it.

> But I think this is actually better, since you have to scan both
> min/max anyway, and allowing one the straddle a line you have to touch
> anyway, allows for using less lines in total.

Right.

> Consider for example the case where UCLAMP_BUCKETS=8, then each
> uclamp_cpu would be 9 words or 72 bytes. If you force align the member,
> then you end up with 4 lines, whereas now it would be 3.

Exactly :)

> > > Is that the best layout?
> > 
> > It changed few times and that's what I found more reasonable for both
> > for fitting the default configuration and also for code readability.
> > Notice that we access RQ and SE clamp values with the same patter,
> > for example:
> > 
> >    {rq|p}->uclamp[clamp_idx].value
> > 
> > Are you worried about the holes or something else specific ?
> 
> Not sure; just mostly asking if this was by design or by accident.
> 
> One thing I did wonder though; since bucket[0] is counting the tasks
> that are unconstrained and it's bucket value is basically fixed (0 /
> 1024), can't we abuse that value field to store uclamp_cpu::value ?

Mmm... should be possible, just worried about adding special cases
which can make the code even more complex of what it's not already.

.... moreover, if we ditch the mapping, the 1024 will be indexed at
the top of the array... so...

> OTOH, doing that might make the code really ugly with all them:
> 
>   if (!bucket_id)
> 
> exceptions all over the place.

Exactly... I should read all your comments before replying :)
diff mbox series

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4f72f956850f..84294925d006 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -601,6 +601,7 @@  struct sched_dl_entity {
  * @value:		clamp value tracked by a clamp bucket
  * @bucket_id:		the bucket index used by the fast-path
  * @mapped:		the bucket index is valid
+ * @active:		the se is currently refcounted in a CPU's clamp bucket
  *
  * A utilization clamp bucket maps a:
  *   clamp value (value), i.e.
@@ -614,11 +615,16 @@  struct sched_dl_entity {
  *   uclamp_bucket_inc() - for a new clamp value
  * is matched by a:
  *   uclamp_bucket_dec() - for the old clamp value
+ *
+ * The active bit is set whenever a task has got an effective clamp bucket
+ * and value assigned, which can be different from the user requested ones.
+ * This allows to know a task is actually refcounting a CPU's clamp bucket.
  */
 struct uclamp_se {
 	unsigned int value		: bits_per(SCHED_CAPACITY_SCALE);
 	unsigned int bucket_id		: bits_per(UCLAMP_BUCKETS);
 	unsigned int mapped		: 1;
+	unsigned int active		: 1;
 };
 #endif /* CONFIG_UCLAMP_TASK */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3f87898b13a0..190137cd7b3b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -766,6 +766,124 @@  static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
 	return UCLAMP_BUCKET_DELTA * (clamp_value / UCLAMP_BUCKET_DELTA);
 }
 
+static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id)
+{
+	unsigned int max_value = 0;
+	unsigned int bucket_id;
+
+	for (bucket_id = 0; bucket_id < UCLAMP_BUCKETS; ++bucket_id) {
+		unsigned int bucket_value;
+
+		if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
+			continue;
+
+		/* Both min and max clamps are MAX aggregated */
+		bucket_value = rq->uclamp[clamp_id].bucket[bucket_id].value;
+		max_value = max(max_value, bucket_value);
+		if (max_value >= SCHED_CAPACITY_SCALE)
+			break;
+	}
+	WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
+}
+
+/*
+ * When a task is enqueued on a CPU's rq, the clamp bucket currently defined by
+ * the task's uclamp::bucket_id is reference counted on that CPU. This also
+ * immediately updates the CPU's clamp value if required.
+ *
+ * Since tasks know their specific value requested from user-space, we track
+ * within each bucket the maximum value for tasks refcounted in that bucket.
+ */
+static inline void uclamp_cpu_inc_id(struct task_struct *p, struct rq *rq,
+				     unsigned int clamp_id)
+{
+	unsigned int cpu_clamp, grp_clamp, tsk_clamp;
+	unsigned int bucket_id;
+
+	if (unlikely(!p->uclamp[clamp_id].mapped))
+		return;
+
+	bucket_id = p->uclamp[clamp_id].bucket_id;
+	p->uclamp[clamp_id].active = true;
+
+	rq->uclamp[clamp_id].bucket[bucket_id].tasks++;
+
+	/* CPU's clamp buckets track the max effective clamp value */
+	tsk_clamp = p->uclamp[clamp_id].value;
+	grp_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
+	rq->uclamp[clamp_id].bucket[bucket_id].value = max(grp_clamp, tsk_clamp);
+
+	/* Update CPU clamp value if required */
+	cpu_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
+	WRITE_ONCE(rq->uclamp[clamp_id].value, max(cpu_clamp, tsk_clamp));
+}
+
+/*
+ * When a task is dequeued from a CPU's rq, the CPU's clamp bucket reference
+ * counted by the task is released. If this is the last task reference
+ * counting the CPU's max active clamp value, then the CPU's clamp value is
+ * updated.
+ * Both the tasks reference counter and the CPU's cached clamp values are
+ * expected to be always valid, if we detect they are not we skip the updates,
+ * enforce a consistent state and warn.
+ */
+static inline void uclamp_cpu_dec_id(struct task_struct *p, struct rq *rq,
+				     unsigned int clamp_id)
+{
+	unsigned int clamp_value;
+	unsigned int bucket_id;
+
+	if (unlikely(!p->uclamp[clamp_id].mapped))
+		return;
+
+	bucket_id = p->uclamp[clamp_id].bucket_id;
+	p->uclamp[clamp_id].active = false;
+
+	SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
+	if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
+		rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
+
+	/* We accept to (possibly) overboost tasks still RUNNABLE */
+	if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
+		return;
+	clamp_value = rq->uclamp[clamp_id].bucket[bucket_id].value;
+
+	/* The CPU's clamp value is expected to always track the max */
+	SCHED_WARN_ON(clamp_value > rq->uclamp[clamp_id].value);
+
+	if (clamp_value >= READ_ONCE(rq->uclamp[clamp_id].value)) {
+		/*
+		 * Reset CPU's clamp bucket value to its nominal value whenever
+		 * there are anymore RUNNABLE tasks refcounting it.
+		 */
+		rq->uclamp[clamp_id].bucket[bucket_id].value =
+			uclamp_maps[clamp_id][bucket_id].value;
+		uclamp_cpu_update(rq, clamp_id);
+	}
+}
+
+static inline void uclamp_cpu_inc(struct rq *rq, struct task_struct *p)
+{
+	unsigned int clamp_id;
+
+	if (unlikely(!p->sched_class->uclamp_enabled))
+		return;
+
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+		uclamp_cpu_inc_id(p, rq, clamp_id);
+}
+
+static inline void uclamp_cpu_dec(struct rq *rq, struct task_struct *p)
+{
+	unsigned int clamp_id;
+
+	if (unlikely(!p->sched_class->uclamp_enabled))
+		return;
+
+	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+		uclamp_cpu_dec_id(p, rq, clamp_id);
+}
+
 static void uclamp_bucket_dec(unsigned int clamp_id, unsigned int bucket_id)
 {
 	union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
@@ -798,6 +916,7 @@  static void uclamp_bucket_inc(struct uclamp_se *uc_se, unsigned int clamp_id,
 	unsigned int free_bucket_id;
 	unsigned int bucket_value;
 	unsigned int bucket_id;
+	int cpu;
 
 	bucket_value = uclamp_bucket_value(clamp_value);
 
@@ -835,6 +954,28 @@  static void uclamp_bucket_inc(struct uclamp_se *uc_se, unsigned int clamp_id,
 	} while (!atomic_long_try_cmpxchg(&uc_maps[bucket_id].adata,
 					  &uc_map_old.data, uc_map_new.data));
 
+	/*
+	 * Ensure each CPU tracks the correct value for this clamp bucket.
+	 * This initialization of per-CPU variables is required only when a
+	 * clamp value is requested for the first time from a slow-path.
+	 */
+	if (unlikely(!uc_map_old.se_count)) {
+		for_each_possible_cpu(cpu) {
+			struct uclamp_cpu *uc_cpu =
+				&cpu_rq(cpu)->uclamp[clamp_id];
+
+			/* CPU's tasks count must be 0 for free buckets */
+			SCHED_WARN_ON(uc_cpu->bucket[bucket_id].tasks);
+			if (unlikely(uc_cpu->bucket[bucket_id].tasks))
+				uc_cpu->bucket[bucket_id].tasks = 0;
+
+			/* Minimize cache lines invalidation */
+			if (uc_cpu->bucket[bucket_id].value == bucket_value)
+				continue;
+			uc_cpu->bucket[bucket_id].value = bucket_value;
+		}
+	}
+
 	uc_se->value = clamp_value;
 	uc_se->bucket_id = bucket_id;
 
@@ -907,6 +1048,7 @@  static void uclamp_fork(struct task_struct *p, bool reset)
 			clamp_value = uclamp_none(clamp_id);
 
 		p->uclamp[clamp_id].mapped = false;
+		p->uclamp[clamp_id].active = false;
 		uclamp_bucket_inc(&p->uclamp[clamp_id], clamp_id, clamp_value);
 	}
 }
@@ -915,9 +1057,15 @@  static void __init init_uclamp(void)
 {
 	struct uclamp_se *uc_se;
 	unsigned int clamp_id;
+	int cpu;
 
 	mutex_init(&uclamp_mutex);
 
+	for_each_possible_cpu(cpu) {
+		memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_cpu));
+		cpu_rq(cpu)->uclamp[UCLAMP_MAX].value = uclamp_none(UCLAMP_MAX);
+	}
+
 	memset(uclamp_maps, 0, sizeof(uclamp_maps));
 	for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
 		uc_se = &init_task.uclamp[clamp_id];
@@ -926,6 +1074,8 @@  static void __init init_uclamp(void)
 }
 
 #else /* CONFIG_UCLAMP_TASK */
+static inline void uclamp_cpu_inc(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_cpu_dec(struct rq *rq, struct task_struct *p) { }
 static inline int __setscheduler_uclamp(struct task_struct *p,
 					const struct sched_attr *attr)
 {
@@ -945,6 +1095,7 @@  static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 		psi_enqueue(p, flags & ENQUEUE_WAKEUP);
 	}
 
+	uclamp_cpu_inc(rq, p);
 	p->sched_class->enqueue_task(rq, p, flags);
 }
 
@@ -958,6 +1109,7 @@  static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 		psi_dequeue(p, flags & DEQUEUE_SLEEP);
 	}
 
+	uclamp_cpu_dec(rq, p);
 	p->sched_class->dequeue_task(rq, p, flags);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a0b238156161..06ff7d890ff6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -797,6 +797,50 @@  extern void rto_push_irq_work_func(struct irq_work *work);
 #endif
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_UCLAMP_TASK
+/*
+ * struct uclamp_bucket - Utilization clamp bucket
+ * @value: utilization clamp value for tasks on this clamp bucket
+ * @tasks: number of RUNNABLE tasks on this clamp bucket
+ *
+ * Keep track of how many tasks are RUNNABLE for a given utilization
+ * clamp value.
+ */
+struct uclamp_bucket {
+	unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
+	unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
+};
+
+/*
+ * struct uclamp_cpu - CPU's utilization clamp
+ * @value: currently active clamp values for a CPU
+ * @bucket: utilization clamp buckets affecting a CPU
+ *
+ * Keep track of RUNNABLE tasks on a CPUs to aggregate their clamp values.
+ * A clamp value is affecting a CPU where there is at least one task RUNNABLE
+ * (or actually running) with that value.
+ *
+ * We have up to UCLAMP_CNT possible different clamp values, which are
+ * currently only two: minmum utilization and maximum utilization.
+ *
+ * All utilization clamping values are MAX aggregated, since:
+ * - for util_min: we want to run the CPU at least at the max of the minimum
+ *   utilization required by its currently RUNNABLE tasks.
+ * - for util_max: we want to allow the CPU to run up to the max of the
+ *   maximum utilization allowed by its currently RUNNABLE tasks.
+ *
+ * Since on each system we expect only a limited number of different
+ * utilization clamp values (CONFIG_UCLAMP_bucketS_COUNT), we use a simple
+ * array to track the metrics required to compute all the per-CPU utilization
+ * clamp values. The additional slot is used to track the default clamp
+ * values, i.e. no min/max clamping at all.
+ */
+struct uclamp_cpu {
+	unsigned int value;
+	struct uclamp_bucket bucket[UCLAMP_BUCKETS];
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -835,6 +879,11 @@  struct rq {
 	unsigned long		nr_load_updates;
 	u64			nr_switches;
 
+#ifdef CONFIG_UCLAMP_TASK
+	/* Utilization clamp values based on CPU's RUNNABLE tasks */
+	struct uclamp_cpu	uclamp[UCLAMP_CNT] ____cacheline_aligned;
+#endif
+
 	struct cfs_rq		cfs;
 	struct rt_rq		rt;
 	struct dl_rq		dl;