sched: Don't try to catch up excess steal time.

Message ID	20240806111157.1336532-1-suleiman@google.com (mailing list archive)
State	New, archived
Headers	show Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C0A971C378B for <kvm@vger.kernel.org>; Tue, 6 Aug 2024 11:13:36 +0000 (UTC) Date: Tue, 6 Aug 2024 20:11:57 +0900 Message-Id: <20240806111157.1336532-1-suleiman@google.com> Precedence: bulk Mime-Version: 1.0 Subject: [PATCH] sched: Don't try to catch up excess steal time. From: Suleiman Souhlal <suleiman@google.com> To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Valentin Schneider <vschneid@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, joelaf@google.com, vineethrp@google.com, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, ssouhlal@freebsd.org, Suleiman Souhlal <suleiman@google.com> Content-Type: text/plain; charset="UTF-8"
Series	sched: Don't try to catch up excess steal time. \| expand sched: Don't try to catch up excess steal time.

Suleiman Souhlal Aug. 6, 2024, 11:11 a.m. UTC

When steal time exceeds the measured delta when updating clock_task, we
currently try to catch up the excess in future updates.
However, this results in inaccurate run times for the future clock_task
measurements, as they end up getting additional steal time that did not
actually happen, from the previous excess steal time being paid back.

For example, suppose a task in a VM runs for 10ms and had 15ms of steal
time reported while it ran. clock_task rightly doesn't advance. Then, a
different task runs on the same rq for 10ms without any time stolen.
Because of the current catch up mechanism, clock_sched inaccurately ends
up advancing by only 5ms instead of 10ms even though there wasn't any
actual time stolen. The second task is getting charged for less time
than it ran, even though it didn't deserve it.
In other words, tasks can end up getting more run time than they should
actually get.

So, we instead don't make future updates pay back past excess stolen time.

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 kernel/sched/core.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Joel Fernandes Aug. 6, 2024, 10:51 p.m. UTC | #1

On Tue, Aug 6, 2024 at 7:13 AM Suleiman Souhlal <suleiman@google.com> wrote:
>
> When steal time exceeds the measured delta when updating clock_task, we
> currently try to catch up the excess in future updates.
> However, this results in inaccurate run times for the future clock_task
> measurements, as they end up getting additional steal time that did not
> actually happen, from the previous excess steal time being paid back.
>
> For example, suppose a task in a VM runs for 10ms and had 15ms of steal
> time reported while it ran. clock_task rightly doesn't advance. Then, a
> different task runs on the same rq for 10ms without any time stolen.
> Because of the current catch up mechanism, clock_sched inaccurately ends
> up advancing by only 5ms instead of 10ms even though there wasn't any
> actual time stolen. The second task is getting charged for less time
> than it ran, even though it didn't deserve it.
> In other words, tasks can end up getting more run time than they should
> actually get.
>
> So, we instead don't make future updates pay back past excess stolen time.
>
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---
>  kernel/sched/core.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bcf2c4cc0522..42b37da2bda6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -728,13 +728,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>  #endif
>  #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
>         if (static_key_false((&paravirt_steal_rq_enabled))) {
> -               steal = paravirt_steal_clock(cpu_of(rq));
> +               u64 prev_steal;
> +
> +               steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
>                 steal -= rq->prev_steal_time_rq;
>
>                 if (unlikely(steal > delta))
>                         steal = delta;
>
> -               rq->prev_steal_time_rq += steal;
> +               rq->prev_steal_time_rq = prev_steal;
>                 delta -= steal;

Makes sense, but wouldn't this patch also do the following: If vCPU
task is the only one running and has a large steal time, then
sched_tick() will only freeze the clock for a shorter period, and not
give future credits to the vCPU task itself?  Maybe it does not matter
(and I probably don't understand the code enough) but thought I would
mention.

I am also not sure if the purpose of stealtime is to credit individual
tasks, or rather all tasks on the runqueue because the "whole
runqueue" had time stolen.. No where in this function is it dealing
with individual tasks but rather the rq itself.

Thoughts?

 - Joel



>         }
>  #endif
> --
> 2.46.0.rc2.264.g509ed76dc8-goog
>

Suleiman Souhlal Aug. 7, 2024, 12:08 a.m. UTC | #2

On Tue, Aug 06, 2024 at 06:51:36PM -0400, Joel Fernandes wrote:
> On Tue, Aug 6, 2024 at 7:13 AM Suleiman Souhlal <suleiman@google.com> wrote:
> >
> > When steal time exceeds the measured delta when updating clock_task, we
> > currently try to catch up the excess in future updates.
> > However, this results in inaccurate run times for the future clock_task
> > measurements, as they end up getting additional steal time that did not
> > actually happen, from the previous excess steal time being paid back.
> >
> > For example, suppose a task in a VM runs for 10ms and had 15ms of steal
> > time reported while it ran. clock_task rightly doesn't advance. Then, a
> > different task runs on the same rq for 10ms without any time stolen.
> > Because of the current catch up mechanism, clock_sched inaccurately ends
> > up advancing by only 5ms instead of 10ms even though there wasn't any
> > actual time stolen. The second task is getting charged for less time
> > than it ran, even though it didn't deserve it.
> > In other words, tasks can end up getting more run time than they should
> > actually get.
> >
> > So, we instead don't make future updates pay back past excess stolen time.
> >
> > Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> > ---
> >  kernel/sched/core.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index bcf2c4cc0522..42b37da2bda6 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -728,13 +728,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> >  #endif
> >  #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> >         if (static_key_false((&paravirt_steal_rq_enabled))) {
> > -               steal = paravirt_steal_clock(cpu_of(rq));
> > +               u64 prev_steal;
> > +
> > +               steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
> >                 steal -= rq->prev_steal_time_rq;
> >
> >                 if (unlikely(steal > delta))
> >                         steal = delta;
> >
> > -               rq->prev_steal_time_rq += steal;
> > +               rq->prev_steal_time_rq = prev_steal;
> >                 delta -= steal;
> 
> Makes sense, but wouldn't this patch also do the following: If vCPU
> task is the only one running and has a large steal time, then
> sched_tick() will only freeze the clock for a shorter period, and not
> give future credits to the vCPU task itself?  Maybe it does not matter
> (and I probably don't understand the code enough) but thought I would
> mention.

The patch should still be doing the right thing in that situation:
The clock will be frozen for the whole duration that it ran, and delta
will be 0.
The current excess amount is not relevant to the future, as far as I can
tell.
The pre-patch code is giving the rq extra time that it hadn't measured.
I don't really see why it should be getting that extra time.

> 
> I am also not sure if the purpose of stealtime is to credit individual
> tasks, or rather all tasks on the runqueue because the "whole
> runqueue" had time stolen.. No where in this function is it dealing
> with individual tasks but rather the rq itself.

This function is used to update clock_task, which *is* relevant to
individual tasks. It is used to calculate how long tasks ran for (and
for load averages).

-- Suleiman

Suleiman Souhlal Aug. 20, 2024, 1:46 a.m. UTC | #3

On Tue, Aug 6, 2024 at 8:13 PM Suleiman Souhlal <suleiman@google.com> wrote:
>
> When steal time exceeds the measured delta when updating clock_task, we
> currently try to catch up the excess in future updates.
> However, this results in inaccurate run times for the future clock_task
> measurements, as they end up getting additional steal time that did not
> actually happen, from the previous excess steal time being paid back.
>
> For example, suppose a task in a VM runs for 10ms and had 15ms of steal
> time reported while it ran. clock_task rightly doesn't advance. Then, a
> different task runs on the same rq for 10ms without any time stolen.
> Because of the current catch up mechanism, clock_sched inaccurately ends
> up advancing by only 5ms instead of 10ms even though there wasn't any
> actual time stolen. The second task is getting charged for less time
> than it ran, even though it didn't deserve it.
> In other words, tasks can end up getting more run time than they should
> actually get.
>
> So, we instead don't make future updates pay back past excess stolen time.
>
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>

Gentle ping.

-- Suleiman

Srikar Dronamraju Aug. 20, 2024, 9:45 a.m. UTC | #4

* Suleiman Souhlal <suleiman@google.com> [2024-08-06 20:11:57]:

> When steal time exceeds the measured delta when updating clock_task, we
> currently try to catch up the excess in future updates.
> However, this results in inaccurate run times for the future clock_task
> measurements, as they end up getting additional steal time that did not
> actually happen, from the previous excess steal time being paid back.
> 
> For example, suppose a task in a VM runs for 10ms and had 15ms of steal
> time reported while it ran. clock_task rightly doesn't advance. Then, a
> different task runs on the same rq for 10ms without any time stolen.
> Because of the current catch up mechanism, clock_sched inaccurately ends
> up advancing by only 5ms instead of 10ms even though there wasn't any
> actual time stolen. The second task is getting charged for less time
> than it ran, even though it didn't deserve it.
> In other words, tasks can end up getting more run time than they should
> actually get.
> 
> So, we instead don't make future updates pay back past excess stolen time.
> 
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---
>  kernel/sched/core.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bcf2c4cc0522..42b37da2bda6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -728,13 +728,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>  #endif
>  #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
>  	if (static_key_false((&paravirt_steal_rq_enabled))) {
> -		steal = paravirt_steal_clock(cpu_of(rq));
> +		u64 prev_steal;
> +
> +		steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
>  		steal -= rq->prev_steal_time_rq;
>  
>  		if (unlikely(steal > delta))
>  			steal = delta;
>  
> -		rq->prev_steal_time_rq += steal;
> +		rq->prev_steal_time_rq = prev_steal;
>  		delta -= steal;
>  	}
>  #endif


Agree with the change.

Probably, we could have achieved by just moving a line above
Something like this?

#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
	if (static_key_false((&paravirt_steal_rq_enabled))) {
		steal = paravirt_steal_clock(cpu_of(rq));
		steal -= rq->prev_steal_time_rq;
		rq->prev_steal_time_rq += steal;

		if (unlikely(steal > delta))
			steal = delta;

		delta -= steal;
	}
#endif

Steven Rostedt Aug. 20, 2024, 2:50 p.m. UTC | #5

On Tue, 20 Aug 2024 15:15:55 +0530
Srikar Dronamraju <srikar@linux.ibm.com> wrote:

> * Suleiman Souhlal <suleiman@google.com> [2024-08-06 20:11:57]:
> 
> > When steal time exceeds the measured delta when updating clock_task, we
> > currently try to catch up the excess in future updates.
> > However, this results in inaccurate run times for the future clock_task
> > measurements, as they end up getting additional steal time that did not
> > actually happen, from the previous excess steal time being paid back.
> > 
> > For example, suppose a task in a VM runs for 10ms and had 15ms of steal
> > time reported while it ran. clock_task rightly doesn't advance. Then, a
> > different task runs on the same rq for 10ms without any time stolen.
> > Because of the current catch up mechanism, clock_sched inaccurately ends
> > up advancing by only 5ms instead of 10ms even though there wasn't any
> > actual time stolen. The second task is getting charged for less time
> > than it ran, even though it didn't deserve it.
> > In other words, tasks can end up getting more run time than they should
> > actually get.
> > 
> > So, we instead don't make future updates pay back past excess stolen time.

In other words, If one task had more time stolen from it than it had run,
the excess time is removed from the next task even though it ran for its
entire slot?

I'm curious, how does a task get queued on the run queue if 100% of it's
time was stolen? That is, how did it get queued if the vCPU wasn't running?


> > 
> > Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> > ---
> >  kernel/sched/core.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index bcf2c4cc0522..42b37da2bda6 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -728,13 +728,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> >  #endif
> >  #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> >  	if (static_key_false((&paravirt_steal_rq_enabled))) {
> > -		steal = paravirt_steal_clock(cpu_of(rq));
> > +		u64 prev_steal;
> > +
> > +		steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
> >  		steal -= rq->prev_steal_time_rq;
> >  
> >  		if (unlikely(steal > delta))
> >  			steal = delta;
> >  
> > -		rq->prev_steal_time_rq += steal;
> > +		rq->prev_steal_time_rq = prev_steal;
> >  		delta -= steal;
> >  	}
> >  #endif  
> 
> 
> Agree with the change.
> 
> Probably, we could have achieved by just moving a line above
> Something like this?
> 
> #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> 	if (static_key_false((&paravirt_steal_rq_enabled))) {
> 		steal = paravirt_steal_clock(cpu_of(rq));
> 		steal -= rq->prev_steal_time_rq;
> 		rq->prev_steal_time_rq += steal;
> 
> 		if (unlikely(steal > delta))
> 			steal = delta;
> 
> 		delta -= steal;
> 	}
> #endif

Yeah, that is probably a nicer way of doing the same thing.

Suleiman, care to send a v2?

-- Steve

Joel Fernandes Aug. 20, 2024, 3:42 p.m. UTC | #6

On Tue, Aug 20, 2024 at 10:50 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 20 Aug 2024 15:15:55 +0530
> Srikar Dronamraju <srikar@linux.ibm.com> wrote:
>
> > * Suleiman Souhlal <suleiman@google.com> [2024-08-06 20:11:57]:
> >
> > > When steal time exceeds the measured delta when updating clock_task, we
> > > currently try to catch up the excess in future updates.
> > > However, this results in inaccurate run times for the future clock_task
> > > measurements, as they end up getting additional steal time that did not
> > > actually happen, from the previous excess steal time being paid back.
> > >
> > > For example, suppose a task in a VM runs for 10ms and had 15ms of steal
> > > time reported while it ran. clock_task rightly doesn't advance. Then, a
> > > different task runs on the same rq for 10ms without any time stolen.
> > > Because of the current catch up mechanism, clock_sched inaccurately ends
> > > up advancing by only 5ms instead of 10ms even though there wasn't any
> > > actual time stolen. The second task is getting charged for less time
> > > than it ran, even though it didn't deserve it.
> > > In other words, tasks can end up getting more run time than they should
> > > actually get.
> > >
> > > So, we instead don't make future updates pay back past excess stolen time.
>
> In other words, If one task had more time stolen from it than it had run,
> the excess time is removed from the next task even though it ran for its
> entire slot?
>
> I'm curious, how does a task get queued on the run queue if 100% of it's
> time was stolen? That is, how did it get queued if the vCPU wasn't running?

The scenario seems plausible to me, like say several tasks were
already queued on the RQ (overloaded RQ) before any time stealing
happened. Then vCPU is preempted thus stealing time. When vCPU returns
to execution, the running task runs without the clock_task advancing
and then goes to sleep thus freeing the RQ for other tasks (that were
previously queued).

Or did I miss something?

thanks,
 - Joel

David Woodhouse Aug. 20, 2024, 3:50 p.m. UTC | #7

On Tue, 2024-08-06 at 20:11 +0900, Suleiman Souhlal wrote:
> When steal time exceeds the measured delta when updating clock_task, we
> currently try to catch up the excess in future updates.
> However, this results in inaccurate run times for the future clock_task
> measurements, as they end up getting additional steal time that did not
> actually happen, from the previous excess steal time being paid back.
> 
> For example, suppose a task in a VM runs for 10ms and had 15ms of steal
> time reported while it ran. clock_task rightly doesn't advance. Then, a
> different task runs on the same rq for 10ms without any time stolen.
> Because of the current catch up mechanism, clock_sched inaccurately ends
> up advancing by only 5ms instead of 10ms even though there wasn't any
> actual time stolen. The second task is getting charged for less time
> than it ran, even though it didn't deserve it.
> In other words, tasks can end up getting more run time than they should
> actually get.
> 
> So, we instead don't make future updates pay back past excess stolen time.

My understanding was that it was done this way for a reason: there is a
lot of jitter between the "run time" (your 10ms example), and the steal
time (15ms). What if 5ms really *did* elapse between the time that
'delta' is calculated, and the call to paravirt_steal_clock()?

By accounting that steal time "in advance" we ensure it isn't lost in
the case where the same process remains running for the next timeslice.

However, that does cause problems when the steal time goes negative
(due to hypervisor bugs). So in 
https://lore.kernel.org/all/20240522001817.619072-22-dwmw2@infradead.org/
I limited the amount of time which would be accounted to a future tick.

> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---
>  kernel/sched/core.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bcf2c4cc0522..42b37da2bda6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -728,13 +728,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>  #endif
>  #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
>         if (static_key_false((&paravirt_steal_rq_enabled))) {
> -               steal = paravirt_steal_clock(cpu_of(rq));
> +               u64 prev_steal;
> +
> +               steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
>                 steal -= rq->prev_steal_time_rq;
>  
>                 if (unlikely(steal > delta))
>                         steal = delta;
>  
> -               rq->prev_steal_time_rq += steal;
> +               rq->prev_steal_time_rq = prev_steal;
>                 delta -= steal;
>         }
>  #endif

Suleiman Souhlal Aug. 23, 2024, 9:52 p.m. UTC | #8

On Wed, Aug 21, 2024 at 12:51 AM David Woodhouse <dwmw2@infradead.org> wrote:
>
> On Tue, 2024-08-06 at 20:11 +0900, Suleiman Souhlal wrote:
> > When steal time exceeds the measured delta when updating clock_task, we
> > currently try to catch up the excess in future updates.
> > However, this results in inaccurate run times for the future clock_task
> > measurements, as they end up getting additional steal time that did not
> > actually happen, from the previous excess steal time being paid back.
> >
> > For example, suppose a task in a VM runs for 10ms and had 15ms of steal
> > time reported while it ran. clock_task rightly doesn't advance. Then, a
> > different task runs on the same rq for 10ms without any time stolen.
> > Because of the current catch up mechanism, clock_sched inaccurately ends
> > up advancing by only 5ms instead of 10ms even though there wasn't any
> > actual time stolen. The second task is getting charged for less time
> > than it ran, even though it didn't deserve it.
> > In other words, tasks can end up getting more run time than they should
> > actually get.
> >
> > So, we instead don't make future updates pay back past excess stolen time.
>
> My understanding was that it was done this way for a reason: there is a
> lot of jitter between the "run time" (your 10ms example), and the steal
> time (15ms). What if 5ms really *did* elapse between the time that
> 'delta' is calculated, and the call to paravirt_steal_clock()?
>
> By accounting that steal time "in advance" we ensure it isn't lost in
> the case where the same process remains running for the next timeslice.

This is an interesting observation, but I'd argue that even in the scenario
where the same task stays running, the extra 5ms should not be accounted
for in steal time:
From what I can tell, update_curr() is only updating the task's runtime by the
"delta" that was calculated, so the extra "stolen" time that happened between
the delta measurement and the steal time application does not seem to be
relevant. From what I can tell, that stolen time happened at a point where the
task is not being counted as running in its run time, so it should not be
accounted.

I am however struggling to express this in a way that is easily understandable
by others.

> However, that does cause problems when the steal time goes negative
> (due to hypervisor bugs). So in
> https://lore.kernel.org/all/20240522001817.619072-22-dwmw2@infradead.org/
> I limited the amount of time which would be accounted to a future tick.

That would definitely be an improvement over the old behavior.

I noticed the issue because I've been trying to make sense of the numbers while
trying to include the time that host was suspended in steal time in
https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
(But I think the issue can happen even without these changes.)

Thanks,
-- Suleiman

sched: Don't try to catch up excess steal time.

Commit Message

Comments

Patch