diff mbox series

[RFC,15/16] sched/fair: Account kthread runtime debt for CFS bandwidth

Message ID 20220106004656.126790-16-daniel.m.jordan@oracle.com (mailing list archive)
State RFC
Delegated to: Herbert Xu
Headers show
Series padata, vfio, sched: Multithreaded VFIO page pinning | expand

Commit Message

Daniel Jordan Jan. 6, 2022, 12:46 a.m. UTC
As before, helpers in multithreaded jobs don't honor the main thread's
CFS bandwidth limits, which could lead to the group exceeding its quota.

Fix it by having helpers remote charge their CPU time to the main
thread's task group.  A helper calls a pair of new interfaces
cpu_cgroup_remote_begin() and cpu_cgroup_remote_charge() (see function
header comments) to achieve this.

This is just supposed to start a discussion, so it's pretty simple.
Once a kthread has finished a remote charging period with
cpu_cgroup_remote_charge(), its runtime is subtracted from the target
task group's runtime (cfs_bandwidth::runtime) and any remainder is saved
as debt (cfs_bandwidth::debt) to pay off in later periods.

Remote charging tasks aren't throttled when the group reaches its quota,
and a task group doesn't run at all until its debt is completely paid,
but these shortcomings can be addressed if the approach ends up being
taken.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/sched.h        |  2 +
 include/linux/sched/cgroup.h | 16 ++++++
 kernel/padata.c              | 26 +++++++---
 kernel/sched/core.c          | 39 +++++++++++++++
 kernel/sched/fair.c          | 94 +++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h         |  5 ++
 6 files changed, 174 insertions(+), 8 deletions(-)

Comments

Peter Zijlstra Jan. 11, 2022, 11:58 a.m. UTC | #1
On Wed, Jan 05, 2022 at 07:46:55PM -0500, Daniel Jordan wrote:
> As before, helpers in multithreaded jobs don't honor the main thread's
> CFS bandwidth limits, which could lead to the group exceeding its quota.
> 
> Fix it by having helpers remote charge their CPU time to the main
> thread's task group.  A helper calls a pair of new interfaces
> cpu_cgroup_remote_begin() and cpu_cgroup_remote_charge() (see function
> header comments) to achieve this.
> 
> This is just supposed to start a discussion, so it's pretty simple.
> Once a kthread has finished a remote charging period with
> cpu_cgroup_remote_charge(), its runtime is subtracted from the target
> task group's runtime (cfs_bandwidth::runtime) and any remainder is saved
> as debt (cfs_bandwidth::debt) to pay off in later periods.
> 
> Remote charging tasks aren't throttled when the group reaches its quota,
> and a task group doesn't run at all until its debt is completely paid,
> but these shortcomings can be addressed if the approach ends up being
> taken.
> 

*groan*... and not a single word on why it wouldn't be much better to
simply move the task into the relevant cgroup..
Daniel Jordan Jan. 11, 2022, 4:29 p.m. UTC | #2
On Tue, Jan 11, 2022 at 12:58:53PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 05, 2022 at 07:46:55PM -0500, Daniel Jordan wrote:
> > As before, helpers in multithreaded jobs don't honor the main thread's
> > CFS bandwidth limits, which could lead to the group exceeding its quota.
> > 
> > Fix it by having helpers remote charge their CPU time to the main
> > thread's task group.  A helper calls a pair of new interfaces
> > cpu_cgroup_remote_begin() and cpu_cgroup_remote_charge() (see function
> > header comments) to achieve this.
> > 
> > This is just supposed to start a discussion, so it's pretty simple.
> > Once a kthread has finished a remote charging period with
> > cpu_cgroup_remote_charge(), its runtime is subtracted from the target
> > task group's runtime (cfs_bandwidth::runtime) and any remainder is saved
> > as debt (cfs_bandwidth::debt) to pay off in later periods.
> > 
> > Remote charging tasks aren't throttled when the group reaches its quota,
> > and a task group doesn't run at all until its debt is completely paid,
> > but these shortcomings can be addressed if the approach ends up being
> > taken.
> > 
> 
> *groan*... and not a single word on why it wouldn't be much better to
> simply move the task into the relevant cgroup..

Yes, the cover letter talks about that, I'll quote the relevant part
here.

---

    15  sched/fair: Account kthread runtime debt for CFS bandwidth
    16  sched/fair: Consider kthread debt in cputime

A prototype for remote charging in CFS bandwidth and cpu.stat, described more
in the next section.  It's debatable whether these last two are required for
this series.  Patch 12 caps the number of helper threads started according to
the max effective CPUs allowed by the quota and period of the main thread's
task group.  In practice, I think this hits the sweet spot between complexity
and respecting CFS bandwidth limits so that patch 15 might just be dropped.
For instance, when running qemu with a vfio device, the restriction from patch
12 was enough to avoid the helpers breaching CFS bandwidth limits.  That leaves
patch 16, which on its own seems overkill for all the hunks it would require
from patch 15, so it could be dropped too.

Patch 12 isn't airtight, though, since other tasks running in the task group
alongside the main thread and helpers could still result in overage.  So,
patches 15-16 give an idea of what absolutely correct accounting in the CPU
controller might look like in case there are real situations that want it.


Remote Charging in the CPU Controller
-------------------------------------

CPU-intensive kthreads aren't generally accounted in the CPU controller, so
they escape settings such as weight and bandwidth when they do work on behalf
of a task group.

This problem arises with multithreaded jobs, but is also an issue in other
places.  CPU activity from async memory reclaim (kswapd, cswapd?[5]) should be
accounted to the cgroup that the memory belongs to, and similarly CPU activity
from net rx should be accounted to the task groups that correspond to the
packets being received.  There are also vague complaints from Android[6].

Each use case has its own requirements[7].  In padata and reclaim, the task
group to account to is known ahead of time, but net rx has to spend cycles
processing a packet before its destination task group is known, so any solution
should be able to work without knowing the task group in advance.  Furthermore,
the CPU controller shouldn't throttle reclaim or net rx in real time since both
are doing high priority work.  These make approaches that run kthreads directly
in a task group, like cgroup-aware workqueues[8] or a kernel path for
CLONE_INTO_CGROUP, infeasible.  Running kthreads directly in cgroups also has a
downside for padata because helpers' MAX_NICE priority is "shadowed" by the
priority of the group entities they're running under.

The proposed solution of remote charging can accrue debt to a task group to be
paid off or forgiven later, addressing all these issues.  A kthread calls the
interface

    void cpu_cgroup_remote_begin(struct task_struct *p,
                                 struct cgroup_subsys_state *css);

to begin remote charging to @css, causing @p's current sum_exec_runtime to be
updated and saved.  The @css arg isn't required and can be removed later to
facilitate the unknown cgroup case mentioned above.  Then the kthread calls
another interface

    void cpu_cgroup_remote_charge(struct task_struct *p,
                                  struct cgroup_subsys_state *css);

to account the sum_exec_runtime that @p has used since the first call.
Internally, a new field cfs_bandwidth::debt is added to keep track of unpaid
debt that's only used when the debt exceeds the quota in the current period.

Weight-based control isn't implemented for now since padata helpers run at
MAX_NICE and so always yield to anything higher priority, meaning they would
rarely compete with other task groups.

[ We have another use case to use remote charging for implementing
  CFS bandwidth control across multiple machines.  This is an entirely
  different topic that deserves its own thread. ]
Tejun Heo Jan. 12, 2022, 8:18 p.m. UTC | #3
Hello,

On Tue, Jan 11, 2022 at 11:29:50AM -0500, Daniel Jordan wrote:
...
> This problem arises with multithreaded jobs, but is also an issue in other
> places.  CPU activity from async memory reclaim (kswapd, cswapd?[5]) should be
> accounted to the cgroup that the memory belongs to, and similarly CPU activity
> from net rx should be accounted to the task groups that correspond to the
> packets being received.  There are also vague complaints from Android[6].

These are pretty big holes in CPU cycle accounting right now and I think
spend-first-and-backcharge is the right solution for most of them given
experiences from other controllers. That said,

> Each use case has its own requirements[7].  In padata and reclaim, the task
> group to account to is known ahead of time, but net rx has to spend cycles
> processing a packet before its destination task group is known, so any solution
> should be able to work without knowing the task group in advance.  Furthermore,
> the CPU controller shouldn't throttle reclaim or net rx in real time since both
> are doing high priority work.  These make approaches that run kthreads directly
> in a task group, like cgroup-aware workqueues[8] or a kernel path for
> CLONE_INTO_CGROUP, infeasible.  Running kthreads directly in cgroups also has a
> downside for padata because helpers' MAX_NICE priority is "shadowed" by the
> priority of the group entities they're running under.
> 
> The proposed solution of remote charging can accrue debt to a task group to be
> paid off or forgiven later, addressing all these issues.  A kthread calls the
> interface
> 
>     void cpu_cgroup_remote_begin(struct task_struct *p,
>                                  struct cgroup_subsys_state *css);
> 
> to begin remote charging to @css, causing @p's current sum_exec_runtime to be
> updated and saved.  The @css arg isn't required and can be removed later to
> facilitate the unknown cgroup case mentioned above.  Then the kthread calls
> another interface
> 
>     void cpu_cgroup_remote_charge(struct task_struct *p,
>                                   struct cgroup_subsys_state *css);
> 
> to account the sum_exec_runtime that @p has used since the first call.
> Internally, a new field cfs_bandwidth::debt is added to keep track of unpaid
> debt that's only used when the debt exceeds the quota in the current period.
> 
> Weight-based control isn't implemented for now since padata helpers run at
> MAX_NICE and so always yield to anything higher priority, meaning they would
> rarely compete with other task groups.

If we're gonna do this, let's please do it right and make weight based
control work too. Otherwise, its usefulness is pretty limited.

Thanks.
Daniel Jordan Jan. 13, 2022, 9:08 p.m. UTC | #4
On Wed, Jan 12, 2022 at 10:18:16AM -1000, Tejun Heo wrote:
> Hello,

Hi, Tejun.

> On Tue, Jan 11, 2022 at 11:29:50AM -0500, Daniel Jordan wrote:
> ...
> > This problem arises with multithreaded jobs, but is also an issue in other
> > places.  CPU activity from async memory reclaim (kswapd, cswapd?[5]) should be
> > accounted to the cgroup that the memory belongs to, and similarly CPU activity
> > from net rx should be accounted to the task groups that correspond to the
> > packets being received.  There are also vague complaints from Android[6].
> 
> These are pretty big holes in CPU cycle accounting right now and I think
> spend-first-and-backcharge is the right solution for most of them given
> experiences from other controllers. That said,
> 
> > Each use case has its own requirements[7].  In padata and reclaim, the task
> > group to account to is known ahead of time, but net rx has to spend cycles
> > processing a packet before its destination task group is known, so any solution
> > should be able to work without knowing the task group in advance.  Furthermore,
> > the CPU controller shouldn't throttle reclaim or net rx in real time since both
> > are doing high priority work.  These make approaches that run kthreads directly
> > in a task group, like cgroup-aware workqueues[8] or a kernel path for
> > CLONE_INTO_CGROUP, infeasible.  Running kthreads directly in cgroups also has a
> > downside for padata because helpers' MAX_NICE priority is "shadowed" by the
> > priority of the group entities they're running under.
> > 
> > The proposed solution of remote charging can accrue debt to a task group to be
> > paid off or forgiven later, addressing all these issues.  A kthread calls the
> > interface
> > 
> >     void cpu_cgroup_remote_begin(struct task_struct *p,
> >                                  struct cgroup_subsys_state *css);
> > 
> > to begin remote charging to @css, causing @p's current sum_exec_runtime to be
> > updated and saved.  The @css arg isn't required and can be removed later to
> > facilitate the unknown cgroup case mentioned above.  Then the kthread calls
> > another interface
> > 
> >     void cpu_cgroup_remote_charge(struct task_struct *p,
> >                                   struct cgroup_subsys_state *css);
> > 
> > to account the sum_exec_runtime that @p has used since the first call.
> > Internally, a new field cfs_bandwidth::debt is added to keep track of unpaid
> > debt that's only used when the debt exceeds the quota in the current period.
> > 
> > Weight-based control isn't implemented for now since padata helpers run at
> > MAX_NICE and so always yield to anything higher priority, meaning they would
> > rarely compete with other task groups.
> 
> If we're gonna do this, let's please do it right and make weight based
> control work too. Otherwise, its usefulness is pretty limited.

Ok, understood.

Doing it as presented is an incremental step and all that's required for
this.  I figured weight could be added later with the first user that
actually needs it.

I did prototype weight too, though, just to see if it was all gonna work
together, so given how the discussion elsewhere in the thread is going,
I might respin the scheduler part of this with another use case and
weight-based control included.

I got this far, do the interface and CFS skeleton seem sane?  Both are
basically unchanged with weight-based control included, the weight parts
are just more code on top.

Thanks for looking.
Daniel Jordan Jan. 13, 2022, 9:11 p.m. UTC | #5
On Thu, Jan 13, 2022 at 04:08:57PM -0500, Daniel Jordan wrote:
> On Wed, Jan 12, 2022 at 10:18:16AM -1000, Tejun Heo wrote:
> > If we're gonna do this, let's please do it right and make weight based
> > control work too. Otherwise, its usefulness is pretty limited.
> 
> Ok, understood.
> 
> Doing it as presented is an incremental step and all that's required for
> this.  I figured weight could be added later with the first user that
> actually needs it.
> 
> I did prototype weight too, though, just to see if it was all gonna work
> together, so given how the discussion elsewhere in the thread is going,
> I might respin the scheduler part of this with another use case and
> weight-based control included.
> 
> I got this far, do the interface and CFS skeleton seem sane?  Both are

s/CFS/CFS bandwidth/

> basically unchanged with weight-based control included, the weight parts
> are just more code on top.
> 
> Thanks for looking.
Peter Zijlstra Jan. 14, 2022, 9:31 a.m. UTC | #6
On Wed, Jan 05, 2022 at 07:46:55PM -0500, Daniel Jordan wrote:
> As before, helpers in multithreaded jobs don't honor the main thread's
> CFS bandwidth limits, which could lead to the group exceeding its quota.
> 
> Fix it by having helpers remote charge their CPU time to the main
> thread's task group.  A helper calls a pair of new interfaces
> cpu_cgroup_remote_begin() and cpu_cgroup_remote_charge() (see function
> header comments) to achieve this.
> 
> This is just supposed to start a discussion, so it's pretty simple.
> Once a kthread has finished a remote charging period with
> cpu_cgroup_remote_charge(), its runtime is subtracted from the target
> task group's runtime (cfs_bandwidth::runtime) and any remainder is saved
> as debt (cfs_bandwidth::debt) to pay off in later periods.
> 
> Remote charging tasks aren't throttled when the group reaches its quota,
> and a task group doesn't run at all until its debt is completely paid,
> but these shortcomings can be addressed if the approach ends up being
> taken.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 44c452072a1b..3c2d7f245c68 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4655,10 +4655,19 @@ static inline u64 sched_cfs_bandwidth_slice(void)
>   */
>  void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>  {
> -	if (unlikely(cfs_b->quota == RUNTIME_INF))
> +	u64 quota = cfs_b->quota;
> +	u64 payment;
> +
> +	if (unlikely(quota == RUNTIME_INF))
>  		return;
>  
> -	cfs_b->runtime += cfs_b->quota;
> +	if (cfs_b->debt) {
> +		payment = min(quota, cfs_b->debt);
> +		cfs_b->debt -= payment;
> +		quota -= payment;
> +	}
> +
> +	cfs_b->runtime += quota;
>  	cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
>  }

It might be easier to make cfs_bandwidth::runtime an s64 and make it go
negative.

> @@ -5406,6 +5415,32 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
>  	rcu_read_unlock();
>  }
>  
> +static void incur_cfs_debt(struct rq *rq, struct sched_entity *se,
> +			   struct task_group *tg, u64 debt)
> +{
> +	if (!cfs_bandwidth_used())
> +		return;
> +
> +	while (tg != &root_task_group) {
> +		struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> +
> +		if (cfs_rq->runtime_enabled) {
> +			struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
> +			u64 payment;
> +
> +			raw_spin_lock(&cfs_b->lock);
> +
> +			payment = min(cfs_b->runtime, debt);
> +			cfs_b->runtime -= payment;

At this point it might hit 0 (or go negative if/when you do the above)
and you'll need to throttle the group.

> +			cfs_b->debt += debt - payment;
> +
> +			raw_spin_unlock(&cfs_b->lock);
> +		}
> +
> +		tg = tg->parent;
> +	}
> +}

So part of the problem I have with this is that these external things
can consume all the bandwidth and basically indefinitely starve the
group.

This is doulby so if you're going to account things like softirq network
processing.

Also, why does the whole charging API have a task argument? It either is
current or NULL in case of things like softirq, neither really make
sense as an argument.

Also, by virtue of this being a start-stop annotation interface, the
accrued time might be arbitrarily large and arbitrarily delayed. I'm not
sure that's sensible.

For tasks it might be better to mark the task and have the tick DTRT
instead of later trying to 'migrate' the time.
Peter Zijlstra Jan. 14, 2022, 9:40 a.m. UTC | #7
On Fri, Jan 14, 2022 at 10:31:55AM +0100, Peter Zijlstra wrote:

> Also, by virtue of this being a start-stop annotation interface, the
> accrued time might be arbitrarily large and arbitrarily delayed. I'm not
> sure that's sensible.
> 
> For tasks it might be better to mark the task and have the tick DTRT
> instead of later trying to 'migrate' the time.

Which is then very close to simply sticking the task into the right
cgroup for a limited duration.

You could do a special case sched_move_task(), that takes a css argument
instead of using the current task_css. Then for cgroups it looks like
nothing changes, but the scheduler will DTRT and act like it is in the
target cgroup. Then at the end, simply move it back to task_css.

This obviously doesn't work for SoftIRQ accounting, but that is
'special' anyway. Softirq stuff is not otherwise under scheduler
control and has preemption disabled.
Tejun Heo Jan. 14, 2022, 4:30 p.m. UTC | #8
Hello,

On Fri, Jan 14, 2022 at 10:31:55AM +0100, Peter Zijlstra wrote:
> So part of the problem I have with this is that these external things
> can consume all the bandwidth and basically indefinitely starve the
> group.
> 
> This is doulby so if you're going to account things like softirq network
> processing.

So, anything which is accounted this way should slow down / stall the
originator so that backcharges can't run away. In many cases, these
connections are already there - e.g. if you keep charging socket rx buffers
to a cgroup, the processes in the cgroup will slow down and the backpressure
on the network socket will slow down the incoming packets. Sometimes, the
backpressure propagation needs to be added explicitly - e.g. filesystem
metadata writes can run away because something can keeping on issuing
metadata updates without getting throttled neither on memory or io side. So,
for situations like that, we need to add an explicit mechanism to throttle
the originator asynchronously.

Thanks.
Tejun Heo Jan. 14, 2022, 4:38 p.m. UTC | #9
Hello,

On Fri, Jan 14, 2022 at 10:40:06AM +0100, Peter Zijlstra wrote:
> You could do a special case sched_move_task(), that takes a css argument
> instead of using the current task_css. Then for cgroups it looks like
> nothing changes, but the scheduler will DTRT and act like it is in the
> target cgroup. Then at the end, simply move it back to task_css.
> 
> This obviously doesn't work for SoftIRQ accounting, but that is
> 'special' anyway. Softirq stuff is not otherwise under scheduler
> control and has preemption disabled.

So, if this particular use case doesn't fit the backcharge model (I'm not
sure yet). I'd much prefer it to maintain dynamic per-cgroup helper threads
than move tasks around dynamically. Nothing else is using migration this way
and we don't even need migration for seeding cgroups w/ CLONE_INTO_CGROUP.
In the future, this should allow further optimizations and likely
simplifications. It'd suck to have an odd exception usage.

Thanks.
Daniel Jordan Jan. 18, 2022, 5:32 p.m. UTC | #10
On Fri, Jan 14, 2022 at 10:31:55AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 05, 2022 at 07:46:55PM -0500, Daniel Jordan wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 44c452072a1b..3c2d7f245c68 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4655,10 +4655,19 @@ static inline u64 sched_cfs_bandwidth_slice(void)
> >   */
> >  void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
> >  {
> > -	if (unlikely(cfs_b->quota == RUNTIME_INF))
> > +	u64 quota = cfs_b->quota;
> > +	u64 payment;
> > +
> > +	if (unlikely(quota == RUNTIME_INF))
> >  		return;
> >  
> > -	cfs_b->runtime += cfs_b->quota;
> > +	if (cfs_b->debt) {
> > +		payment = min(quota, cfs_b->debt);
> > +		cfs_b->debt -= payment;
> > +		quota -= payment;
> > +	}
> > +
> > +	cfs_b->runtime += quota;
> >  	cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
> >  }
> 
> It might be easier to make cfs_bandwidth::runtime an s64 and make it go
> negative.

Yep, nice, no need for a new field in cfs_bandwidth.

> > @@ -5406,6 +5415,32 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
> >  	rcu_read_unlock();
> >  }
> >  
> > +static void incur_cfs_debt(struct rq *rq, struct sched_entity *se,
> > +			   struct task_group *tg, u64 debt)
> > +{
> > +	if (!cfs_bandwidth_used())
> > +		return;
> > +
> > +	while (tg != &root_task_group) {
> > +		struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> > +
> > +		if (cfs_rq->runtime_enabled) {
> > +			struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
> > +			u64 payment;
> > +
> > +			raw_spin_lock(&cfs_b->lock);
> > +
> > +			payment = min(cfs_b->runtime, debt);
> > +			cfs_b->runtime -= payment;
> 
> At this point it might hit 0 (or go negative if/when you do the above)
> and you'll need to throttle the group.

I might not be following you, but there could be cfs_rq's with local
runtime_remaining, so even if it goes 0 or negative, the group might
still have quota left and so shouldn't be throttled right away.

I was thinking the throttling would happen as normal, when a cfs_rq ran
out of runtime_remaining and failed to refill it from
cfs_bandwidth::runtime.

> > +			cfs_b->debt += debt - payment;
> > +
> > +			raw_spin_unlock(&cfs_b->lock);
> > +		}
> > +
> > +		tg = tg->parent;
> > +	}
> > +}
> 
> So part of the problem I have with this is that these external things
> can consume all the bandwidth and basically indefinitely starve the
> group.
> 
> This is doulby so if you're going to account things like softirq network
> processing.

Yes.  As Tejun points out, I'll make sure remote charging doesn't run
away.

> Also, why does the whole charging API have a task argument? It either is
> current or NULL in case of things like softirq, neither really make
> sense as an argument.

@task distinguishes between NULL for softirq and current for everybody
else.

It's possible to detect this difference internally though, if that's
what you're saying, so @task can go away.

> Also, by virtue of this being a start-stop annotation interface, the
> accrued time might be arbitrarily large and arbitrarily delayed. I'm not
> sure that's sensible.

Yes, that is a risk.  With start-stop, users need to be careful to
account often enough and have a "reasonable" upper bound on period
length, however that's defined.  Multithreaded jobs are probably the
worst offender since these threads charge a sizable amount at once
compared to the other use cases.

> For tasks it might be better to mark the task and have the tick DTRT
> instead of later trying to 'migrate' the time.

Ok, I'll try that.  The start-stop approach keeps remote charging from
adding overhead in the tick for non-remote-charging things, far and away
the common case, but I'll see how expensive the tick-based approach is.

Can hide it behind a static branch for systems not using the cpu
contoller.
Daniel Jordan Jan. 18, 2022, 5:40 p.m. UTC | #11
On Fri, Jan 14, 2022 at 10:40:06AM +0100, Peter Zijlstra wrote:
> On Fri, Jan 14, 2022 at 10:31:55AM +0100, Peter Zijlstra wrote:
> 
> > Also, by virtue of this being a start-stop annotation interface, the
> > accrued time might be arbitrarily large and arbitrarily delayed. I'm not
> > sure that's sensible.
> > 
> > For tasks it might be better to mark the task and have the tick DTRT
> > instead of later trying to 'migrate' the time.
> 
> Which is then very close to simply sticking the task into the right
> cgroup for a limited duration.
> 
> You could do a special case sched_move_task(), that takes a css argument
> instead of using the current task_css. Then for cgroups it looks like
> nothing changes, but the scheduler will DTRT and act like it is in the
> target cgroup. Then at the end, simply move it back to task_css.

Yes, that's one of the things I tried.  Less new code in the scheduler
this way.

> This obviously doesn't work for SoftIRQ accounting, but that is
> 'special' anyway. Softirq stuff is not otherwise under scheduler
> control and has preemption disabled.

It also doesn't work for memory reclaim since that shouldn't be
throttled in real time.  Reclaim and softirq seem to demand something
that doesn't move tasks onto runqueues, that's external to avoid getting
pushed off cpu.
diff mbox series

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ec8d07d88641..cc04367d4458 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -487,6 +487,8 @@  struct sched_entity {
 	struct cfs_rq			*my_q;
 	/* cached value of my_q->h_nr_running */
 	unsigned long			runnable_weight;
+	/* sum_exec_runtime at the start of the remote charging period */
+	u64				remote_runtime_begin;
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/include/linux/sched/cgroup.h b/include/linux/sched/cgroup.h
index f89d92e9e015..cb3b7941149f 100644
--- a/include/linux/sched/cgroup.h
+++ b/include/linux/sched/cgroup.h
@@ -5,6 +5,22 @@ 
 #include <linux/cgroup-defs.h>
 #include <linux/cpumask.h>
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+
+void cpu_cgroup_remote_begin(struct task_struct *p,
+			     struct cgroup_subsys_state *css);
+void cpu_cgroup_remote_charge(struct task_struct *p,
+			      struct cgroup_subsys_state *css);
+
+#else /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline void cpu_cgroup_remote_begin(struct task_struct *p,
+					   struct cgroup_subsys_state *css) {}
+static inline void cpu_cgroup_remote_charge(struct task_struct *p,
+					    struct cgroup_subsys_state *css) {}
+
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
 #ifdef CONFIG_CFS_BANDWIDTH
 
 int max_cfs_bandwidth_cpus(struct cgroup_subsys_state *css);
diff --git a/kernel/padata.c b/kernel/padata.c
index 52f670a5d6d9..d595f11c2fdd 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -43,6 +43,7 @@ 
 enum padata_work_flags {
 	PADATA_WORK_FINISHED	= 1,
 	PADATA_WORK_UNDO	= 2,
+	PADATA_WORK_MAIN_THR	= 4,
 };
 
 struct padata_work {
@@ -75,6 +76,7 @@  struct padata_mt_job_state {
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map	lockdep_map;
 #endif
+	struct cgroup_subsys_state *cpu_css;
 };
 
 static void padata_free_pd(struct parallel_data *pd);
@@ -495,6 +497,10 @@  static int padata_mt_helper(void *__pw)
 	struct padata_work *pw = __pw;
 	struct padata_mt_job_state *ps = pw->pw_data;
 	struct padata_mt_job *job = ps->job;
+	bool is_main = pw->pw_flags & PADATA_WORK_MAIN_THR;
+
+	if (!is_main)
+		cpu_cgroup_remote_begin(current, ps->cpu_css);
 
 	spin_lock(&ps->lock);
 
@@ -518,6 +524,10 @@  static int padata_mt_helper(void *__pw)
 		ret = job->thread_fn(position, end, job->fn_arg);
 
 		lock_map_release(&ps->lockdep_map);
+
+		if (!is_main)
+			cpu_cgroup_remote_charge(current, ps->cpu_css);
+
 		spin_lock(&ps->lock);
 
 		if (ret) {
@@ -612,7 +622,6 @@  int padata_do_multithreaded_job(struct padata_mt_job *job,
 {
 	/* In case threads finish at different times. */
 	static const unsigned long load_balance_factor = 4;
-	struct cgroup_subsys_state *cpu_css;
 	struct padata_work my_work, *pw;
 	struct padata_mt_job_state ps;
 	LIST_HEAD(unfinished_works);
@@ -628,18 +637,20 @@  int padata_do_multithreaded_job(struct padata_mt_job *job,
 	req = min(req, current->nr_cpus_allowed);
 
 #ifdef CONFIG_CGROUP_SCHED
+	ps.cpu_css = task_get_css(current, cpu_cgrp_id);
+
 	/*
 	 * Cap threads at the max number of CPUs current's CFS bandwidth
 	 * settings allow.  Keep it simple, don't try to keep this value up to
 	 * date.  The ifdef guards cpu_cgrp_id.
 	 */
-	rcu_read_lock();
-	cpu_css = task_css(current, cpu_cgrp_id);
-	req = min(req, max_cfs_bandwidth_cpus(cpu_css));
-	rcu_read_unlock();
+	req = min(req, max_cfs_bandwidth_cpus(ps.cpu_css));
 #endif
 
 	if (req == 1) {
+#ifdef CONFIG_CGROUP_SCHED
+		css_put(ps.cpu_css);
+#endif
 		/* Single thread, no coordination needed, cut to the chase. */
 		return job->thread_fn(job->start, job->start + job->size,
 				      job->fn_arg);
@@ -687,12 +698,15 @@  int padata_do_multithreaded_job(struct padata_mt_job *job,
 
 	/* Use the current task, which saves starting a kthread. */
 	my_work.pw_data = &ps;
-	my_work.pw_flags = 0;
+	my_work.pw_flags = PADATA_WORK_MAIN_THR;
 	INIT_LIST_HEAD(&my_work.pw_list);
 	padata_mt_helper(&my_work);
 
 	/* Wait for all the helpers to finish. */
 	padata_wait_for_helpers(&ps, &unfinished_works, &finished_works);
+#ifdef CONFIG_CGROUP_SCHED
+	css_put(ps.cpu_css);
+#endif
 
 	if (ps.error && job->undo_fn)
 		padata_undo(&ps, &finished_works, &my_work);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 848c9fec8006..a5e24b6bd7e0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9913,6 +9913,7 @@  static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
 	cfs_b->burst = burst;
+	cfs_b->debt = 0;
 
 	__refill_cfs_bandwidth_runtime(cfs_b);
 
@@ -10181,6 +10182,44 @@  static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
 	return 0;
 }
 #endif /* CONFIG_CFS_BANDWIDTH */
+
+/**
+ * cpu_cgroup_remote_begin - begin charging p's CPU usage to a remote css
+ * @p: the kernel thread whose CPU usage should be accounted
+ * @css: the css to which the CPU usage should be accounted
+ *
+ * Begin charging a kernel thread's CPU usage to a remote (non-root) task group
+ * to account CPU time that the kernel thread spends working on behalf of the
+ * group.  Pair with at least one subsequent call to cpu_cgroup_remote_charge()
+ * to complete the charge.
+ *
+ * Supports CFS bandwidth and cgroup2 CPU accounting stats but not weight-based
+ * control for now.
+ */
+void cpu_cgroup_remote_begin(struct task_struct *p,
+			     struct cgroup_subsys_state *css)
+{
+	if (p->sched_class == &fair_sched_class)
+		cpu_cgroup_remote_begin_fair(p, css_tg(css));
+}
+
+/**
+ * cpu_cgroup_remote_charge - account p's CPU usage to a remote css
+ * @p: the kernel thread whose CPU usage should be accounted
+ * @css: the css to which the CPU usage should be accounted
+ *
+ * Account a kernel thread's CPU usage to a remote (non-root) task group.  Pair
+ * with a previous call to cpu_cgroup_remote_begin() with the same @p and @css.
+ * This may be invoked multiple times after the initial
+ * cpu_cgroup_remote_begin() to account additional CPU usage.
+ */
+void cpu_cgroup_remote_charge(struct task_struct *p,
+			      struct cgroup_subsys_state *css)
+{
+	if (p->sched_class == &fair_sched_class)
+		cpu_cgroup_remote_charge_fair(p, css_tg(css));
+}
+
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 44c452072a1b..3c2d7f245c68 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4655,10 +4655,19 @@  static inline u64 sched_cfs_bandwidth_slice(void)
  */
 void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 {
-	if (unlikely(cfs_b->quota == RUNTIME_INF))
+	u64 quota = cfs_b->quota;
+	u64 payment;
+
+	if (unlikely(quota == RUNTIME_INF))
 		return;
 
-	cfs_b->runtime += cfs_b->quota;
+	if (cfs_b->debt) {
+		payment = min(quota, cfs_b->debt);
+		cfs_b->debt -= payment;
+		quota -= payment;
+	}
+
+	cfs_b->runtime += quota;
 	cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
 }
 
@@ -5406,6 +5415,32 @@  static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 	rcu_read_unlock();
 }
 
+static void incur_cfs_debt(struct rq *rq, struct sched_entity *se,
+			   struct task_group *tg, u64 debt)
+{
+	if (!cfs_bandwidth_used())
+		return;
+
+	while (tg != &root_task_group) {
+		struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+
+		if (cfs_rq->runtime_enabled) {
+			struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+			u64 payment;
+
+			raw_spin_lock(&cfs_b->lock);
+
+			payment = min(cfs_b->runtime, debt);
+			cfs_b->runtime -= payment;
+			cfs_b->debt += debt - payment;
+
+			raw_spin_unlock(&cfs_b->lock);
+		}
+
+		tg = tg->parent;
+	}
+}
+
 #else /* CONFIG_CFS_BANDWIDTH */
 
 static inline bool cfs_bandwidth_used(void)
@@ -5448,6 +5483,8 @@  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
 static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
 static inline void update_runtime_enabled(struct rq *rq) {}
 static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}
+static inline void incur_cfs_debt(struct rq *rq, struct sched_entity *se,
+				  struct task_group *tg, u64 debt) {}
 
 #endif /* CONFIG_CFS_BANDWIDTH */
 
@@ -11452,6 +11489,59 @@  int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 	mutex_unlock(&shares_mutex);
 	return 0;
 }
+
+#define INCUR_DEBT		1
+
+static void cpu_cgroup_remote(struct task_struct *p, struct task_group *tg,
+			      int flags)
+{
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq;
+	struct rq_flags rf;
+	struct rq *rq;
+
+	/*
+	 * User tasks might change task groups between calls to this function,
+	 * which isn't handled for now, so disallow them.
+	 */
+	if (!(p->flags & PF_KTHREAD))
+		return;
+
+	/* kthreads already run in the root, so no need for remote charging. */
+	if (tg == &root_task_group)
+		return;
+
+	rq = task_rq_lock(p, &rf);
+	update_rq_clock(rq);
+
+	cfs_rq = cfs_rq_of(se);
+	update_curr(cfs_rq);
+
+	if (flags & INCUR_DEBT) {
+		u64 debt = se->sum_exec_runtime - se->remote_runtime_begin;
+
+		if (unlikely((s64)debt <= 0))
+			goto out;
+
+		incur_cfs_debt(rq, se, tg, debt);
+	}
+
+out:
+	se->remote_runtime_begin = se->sum_exec_runtime;
+
+	task_rq_unlock(rq, p, &rf);
+}
+
+void cpu_cgroup_remote_begin_fair(struct task_struct *p, struct task_group *tg)
+{
+	cpu_cgroup_remote(p, tg, 0);
+}
+
+void cpu_cgroup_remote_charge_fair(struct task_struct *p, struct task_group *tg)
+{
+	cpu_cgroup_remote(p, tg, INCUR_DEBT);
+}
+
 #else /* CONFIG_FAIR_GROUP_SCHED */
 
 void free_fair_sched_group(struct task_group *tg) { }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ddefb0419d7a..75dd6f89e295 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -367,6 +367,7 @@  struct cfs_bandwidth {
 	u64			quota;
 	u64			runtime;
 	u64			burst;
+	u64			debt;
 	s64			hierarchical_quota;
 
 	u8			idle;
@@ -472,6 +473,10 @@  extern void free_fair_sched_group(struct task_group *tg);
 extern int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent);
 extern void online_fair_sched_group(struct task_group *tg);
 extern void unregister_fair_sched_group(struct task_group *tg);
+extern void cpu_cgroup_remote_begin_fair(struct task_struct *p,
+					 struct task_group *tg);
+extern void cpu_cgroup_remote_charge_fair(struct task_struct *p,
+					  struct task_group *tg);
 extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 			struct sched_entity *se, int cpu,
 			struct sched_entity *parent);