[v3] sched: Don't try to catch up excess steal time.

Message ID	20241118043745.1857272-1-suleiman@google.com (mailing list archive)
State	New
Headers	show Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F22801EB48 for <kvm@vger.kernel.org>; Mon, 18 Nov 2024 04:38:27 +0000 (UTC) Date: Mon, 18 Nov 2024 13:37:45 +0900 Message-Id: <20241118043745.1857272-1-suleiman@google.com> Precedence: bulk Mime-Version: 1.0 Subject: [PATCH v3] sched: Don't try to catch up excess steal time. From: Suleiman Souhlal <suleiman@google.com> To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Valentin Schneider <vschneid@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, seanjc@google.com, Srikar Dronamraju <srikar@linux.ibm.com>, David Woodhouse <dwmw2@infradead.org>, joelaf@google.com, vineethrp@google.com, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, ssouhlal@freebsd.org, Suleiman Souhlal <suleiman@google.com> Content-Type: text/plain; charset="UTF-8"
Series	[v3] sched: Don't try to catch up excess steal time. \| expand [v3] sched: Don't try to catch up excess steal time.

Suleiman Souhlal Nov. 18, 2024, 4:37 a.m. UTC

When steal time exceeds the measured delta when updating clock_task, we
currently try to catch up the excess in future updates.
However, this results in inaccurate run times for the future things using
clock_task, in some situations, as they end up getting additional steal
time that did not actually happen.
This is because there is a window between reading the elapsed time in
update_rq_clock() and sampling the steal time in update_rq_clock_task().
If the VCPU gets preempted between those two points, any additional
steal time is accounted to the outgoing task even though the calculated
delta did not actually contain any of that "stolen" time.
When this race happens, we can end up with steal time that exceeds the
calculated delta, and the previous code would try to catch up that excess
steal time in future clock updates, which is given to the next,
incoming task, even though it did not actually have any time stolen.

This behavior is particularly bad when steal time can be very long,
which we've seen when trying to extend steal time to contain the duration
that the host was suspended [0]. When this happens, clock_task stays
frozen, during which the running task stays running for the whole
duration, since its run time doesn't increase.
However the race can happen even under normal operation.

Ideally we would read the elapsed cpu time and the steal time atomically,
to prevent this race from happening in the first place, but doing so
is non-trivial.

Since the time between those two points isn't otherwise accounted anywhere,
neither to the outgoing task nor the incoming task (because the "end of
outgoing task" and "start of incoming task" timestamps are the same),
I would argue that the right thing to do is to simply drop any excess steal
time, in order to prevent these issues.

[0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
v3:
- Reword commit message.
- Revert back to v1 code, since it's more understandable.

v2: https://lore.kernel.org/lkml/20240911111522.1110074-1-suleiman@google.com
- Slightly changed to simply moving one line up instead of adding
  new variable.

v1: https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com
---
 kernel/sched/core.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Peter Zijlstra Nov. 18, 2024, 9:34 a.m. UTC | #1

On Mon, Nov 18, 2024 at 01:37:45PM +0900, Suleiman Souhlal wrote:
> When steal time exceeds the measured delta when updating clock_task, we
> currently try to catch up the excess in future updates.
> However, this results in inaccurate run times for the future things using
> clock_task, in some situations, as they end up getting additional steal
> time that did not actually happen.
> This is because there is a window between reading the elapsed time in
> update_rq_clock() and sampling the steal time in update_rq_clock_task().
> If the VCPU gets preempted between those two points, any additional
> steal time is accounted to the outgoing task even though the calculated
> delta did not actually contain any of that "stolen" time.
> When this race happens, we can end up with steal time that exceeds the
> calculated delta, and the previous code would try to catch up that excess
> steal time in future clock updates, which is given to the next,
> incoming task, even though it did not actually have any time stolen.
> 
> This behavior is particularly bad when steal time can be very long,
> which we've seen when trying to extend steal time to contain the duration
> that the host was suspended [0]. When this happens, clock_task stays
> frozen, during which the running task stays running for the whole
> duration, since its run time doesn't increase.
> However the race can happen even under normal operation.
> 
> Ideally we would read the elapsed cpu time and the steal time atomically,
> to prevent this race from happening in the first place, but doing so
> is non-trivial.
> 
> Since the time between those two points isn't otherwise accounted anywhere,
> neither to the outgoing task nor the incoming task (because the "end of
> outgoing task" and "start of incoming task" timestamps are the same),
> I would argue that the right thing to do is to simply drop any excess steal
> time, in order to prevent these issues.
> 
> [0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
> 
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>

Right.. uhm.. I don't particularly care much either way. Are other
people with virt clue okay with this?

> ---
>  kernel/sched/core.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a1c353a62c56..13f70316ef39 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -766,13 +766,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>  #endif
>  #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
>  	if (static_key_false((&paravirt_steal_rq_enabled))) {
> -		steal = paravirt_steal_clock(cpu_of(rq));
> +		u64 prev_steal;
> +
> +		steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
>  		steal -= rq->prev_steal_time_rq;
>  
>  		if (unlikely(steal > delta))
>  			steal = delta;
>  
> -		rq->prev_steal_time_rq += steal;
> +		rq->prev_steal_time_rq = prev_steal;
>  		delta -= steal;
>  	}
>  #endif
> -- 
> 2.47.0.338.g60cca15819-goog
>

David Woodhouse Nov. 18, 2024, 10:14 a.m. UTC | #2

On Mon, 2024-11-18 at 10:34 +0100, Peter Zijlstra wrote:
> On Mon, Nov 18, 2024 at 01:37:45PM +0900, Suleiman Souhlal wrote:
> > When steal time exceeds the measured delta when updating clock_task, we
> > currently try to catch up the excess in future updates.
> > However, this results in inaccurate run times for the future things using
> > clock_task, in some situations, as they end up getting additional steal
> > time that did not actually happen.
> > This is because there is a window between reading the elapsed time in
> > update_rq_clock() and sampling the steal time in update_rq_clock_task().
> > If the VCPU gets preempted between those two points, any additional
> > steal time is accounted to the outgoing task even though the calculated
> > delta did not actually contain any of that "stolen" time.
> > When this race happens, we can end up with steal time that exceeds the
> > calculated delta, and the previous code would try to catch up that excess
> > steal time in future clock updates, which is given to the next,
> > incoming task, even though it did not actually have any time stolen.
> > 
> > This behavior is particularly bad when steal time can be very long,
> > which we've seen when trying to extend steal time to contain the duration
> > that the host was suspended [0]. When this happens, clock_task stays
> > frozen, during which the running task stays running for the whole
> > duration, since its run time doesn't increase.
> > However the race can happen even under normal operation.
> > 
> > Ideally we would read the elapsed cpu time and the steal time atomically,
> > to prevent this race from happening in the first place, but doing so
> > is non-trivial.
> > 
> > Since the time between those two points isn't otherwise accounted anywhere,
> > neither to the outgoing task nor the incoming task (because the "end of
> > outgoing task" and "start of incoming task" timestamps are the same),
> > I would argue that the right thing to do is to simply drop any excess steal
> > time, in order to prevent these issues.
> > 
> > [0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
> > 
> > Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> 
> Right.. uhm.. I don't particularly care much either way. Are other
> people with virt clue okay with this?

I'm slightly dubious because now we may systemically lose accounted
steal time where before it was all at least accounted *somewhere*, and
we might reasonably have expected the slight inaccuracies to balance
out over time.

But this *does* fix the main problem I was seeing, that the kernel will
currently just keep attributing steal time to processes *forever* if a
buggy hypervisor lets it step backwards. So I can live with it.

Acked-by: David Woodhouse <dwmw@amazon.co.uk>

Steven Rostedt Nov. 18, 2024, 7:33 p.m. UTC | #3

On Mon, 18 Nov 2024 13:37:45 +0900
Suleiman Souhlal <suleiman@google.com> wrote:

> When steal time exceeds the measured delta when updating clock_task, we
> currently try to catch up the excess in future updates.
> However, this results in inaccurate run times for the future things using
> clock_task, in some situations, as they end up getting additional steal
> time that did not actually happen.
> This is because there is a window between reading the elapsed time in
> update_rq_clock() and sampling the steal time in update_rq_clock_task().
> If the VCPU gets preempted between those two points, any additional
> steal time is accounted to the outgoing task even though the calculated
> delta did not actually contain any of that "stolen" time.
> When this race happens, we can end up with steal time that exceeds the
> calculated delta, and the previous code would try to catch up that excess
> steal time in future clock updates, which is given to the next,
> incoming task, even though it did not actually have any time stolen.
> 
> This behavior is particularly bad when steal time can be very long,
> which we've seen when trying to extend steal time to contain the duration
> that the host was suspended [0]. When this happens, clock_task stays
> frozen, during which the running task stays running for the whole
> duration, since its run time doesn't increase.
> However the race can happen even under normal operation.
> 
> Ideally we would read the elapsed cpu time and the steal time atomically,
> to prevent this race from happening in the first place, but doing so
> is non-trivial.
> 
> Since the time between those two points isn't otherwise accounted anywhere,
> neither to the outgoing task nor the incoming task (because the "end of
> outgoing task" and "start of incoming task" timestamps are the same),
> I would argue that the right thing to do is to simply drop any excess steal
> time, in order to prevent these issues.
> 
> [0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
> 
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---
> v3:
> - Reword commit message.
> - Revert back to v1 code, since it's more understandable.
> 
> v2: https://lore.kernel.org/lkml/20240911111522.1110074-1-suleiman@google.com
> - Slightly changed to simply moving one line up instead of adding
>   new variable.
> 
> v1: https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com
> ---
>  kernel/sched/core.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a1c353a62c56..13f70316ef39 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -766,13 +766,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>  #endif
>  #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
>  	if (static_key_false((&paravirt_steal_rq_enabled))) {
> -		steal = paravirt_steal_clock(cpu_of(rq));
> +		u64 prev_steal;
> +
> +		steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
>  		steal -= rq->prev_steal_time_rq;
>  
>  		if (unlikely(steal > delta))
>  			steal = delta;

So is the problem just the above if statement? That is, delta is already
calculated, but if we get interrupted by the host before steal is
calculated and the time then becomes greater than delta, the time
difference between delta and steal gets pushed off to the next task, right?

-- Steve



>  
> -		rq->prev_steal_time_rq += steal;
> +		rq->prev_steal_time_rq = prev_steal;
>  		delta -= steal;
>  	}
>  #endif

Joel Fernandes Nov. 19, 2024, 12:10 a.m. UTC | #4

On Tue, Nov 19, 2024 at 4:32 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 18 Nov 2024 13:37:45 +0900
> Suleiman Souhlal <suleiman@google.com> wrote:
>
> > When steal time exceeds the measured delta when updating clock_task, we
> > currently try to catch up the excess in future updates.
> > However, this results in inaccurate run times for the future things using
> > clock_task, in some situations, as they end up getting additional steal
> > time that did not actually happen.
> > This is because there is a window between reading the elapsed time in
> > update_rq_clock() and sampling the steal time in update_rq_clock_task().
> > If the VCPU gets preempted between those two points, any additional
> > steal time is accounted to the outgoing task even though the calculated
> > delta did not actually contain any of that "stolen" time.
> > When this race happens, we can end up with steal time that exceeds the
> > calculated delta, and the previous code would try to catch up that excess
> > steal time in future clock updates, which is given to the next,
> > incoming task, even though it did not actually have any time stolen.
> >
> > This behavior is particularly bad when steal time can be very long,
> > which we've seen when trying to extend steal time to contain the duration
> > that the host was suspended [0]. When this happens, clock_task stays
> > frozen, during which the running task stays running for the whole
> > duration, since its run time doesn't increase.
> > However the race can happen even under normal operation.
> >
> > Ideally we would read the elapsed cpu time and the steal time atomically,
> > to prevent this race from happening in the first place, but doing so
> > is non-trivial.
> >
> > Since the time between those two points isn't otherwise accounted anywhere,
> > neither to the outgoing task nor the incoming task (because the "end of
> > outgoing task" and "start of incoming task" timestamps are the same),
> > I would argue that the right thing to do is to simply drop any excess steal
> > time, in order to prevent these issues.
> >
> > [0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
> >
> > Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> > ---
> > v3:
> > - Reword commit message.
> > - Revert back to v1 code, since it's more understandable.
> >
> > v2: https://lore.kernel.org/lkml/20240911111522.1110074-1-suleiman@google.com
> > - Slightly changed to simply moving one line up instead of adding
> >   new variable.
> >
> > v1: https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com
> > ---
> >  kernel/sched/core.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index a1c353a62c56..13f70316ef39 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -766,13 +766,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> >  #endif
> >  #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> >       if (static_key_false((&paravirt_steal_rq_enabled))) {
> > -             steal = paravirt_steal_clock(cpu_of(rq));
> > +             u64 prev_steal;
> > +
> > +             steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
> >               steal -= rq->prev_steal_time_rq;
> >
> >               if (unlikely(steal > delta))
> >                       steal = delta;
>
> So is the problem just the above if statement? That is, delta is already
> calculated, but if we get interrupted by the host before steal is
> calculated and the time then becomes greater than delta, the time
> difference between delta and steal gets pushed off to the next task, right?

Pretty much.. the steal being capped to delta means the rest of the
steal is pushed off to the future. Instead he discards the remaining
steal after this patch.

Thanks


Joel

Steven Rostedt Nov. 19, 2024, 12:18 a.m. UTC | #5

On Tue, 19 Nov 2024 09:10:41 +0900
Joel Fernandes <joelaf@google.com> wrote:

> > > +++ b/kernel/sched/core.c
> > > @@ -766,13 +766,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> > >  #endif
> > >  #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> > >       if (static_key_false((&paravirt_steal_rq_enabled))) {
> > > -             steal = paravirt_steal_clock(cpu_of(rq));
> > > +             u64 prev_steal;
> > > +
> > > +             steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
> > >               steal -= rq->prev_steal_time_rq;
> > >
> > >               if (unlikely(steal > delta))
> > >                       steal = delta;  
> >
> > So is the problem just the above if statement? That is, delta is already
> > calculated, but if we get interrupted by the host before steal is
> > calculated and the time then becomes greater than delta, the time
> > difference between delta and steal gets pushed off to the next task, right?  
> 
> Pretty much.. the steal being capped to delta means the rest of the
> steal is pushed off to the future. Instead he discards the remaining
> steal after this patch.

Thanks for confirming. I just wanted to make sure I understand as the
initial change log went into a lot of detail where I sorta got lost ;-)

-- Steve

Joel Fernandes Nov. 19, 2024, 12:22 a.m. UTC | #6

On Tue, Nov 19, 2024 at 9:17 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 19 Nov 2024 09:10:41 +0900
> Joel Fernandes <joelaf@google.com> wrote:
>
> > > > +++ b/kernel/sched/core.c
> > > > @@ -766,13 +766,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> > > >  #endif
> > > >  #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> > > >       if (static_key_false((&paravirt_steal_rq_enabled))) {
> > > > -             steal = paravirt_steal_clock(cpu_of(rq));
> > > > +             u64 prev_steal;
> > > > +
> > > > +             steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
> > > >               steal -= rq->prev_steal_time_rq;
> > > >
> > > >               if (unlikely(steal > delta))
> > > >                       steal = delta;
> > >
> > > So is the problem just the above if statement? That is, delta is already
> > > calculated, but if we get interrupted by the host before steal is
> > > calculated and the time then becomes greater than delta, the time
> > > difference between delta and steal gets pushed off to the next task, right?
> >
> > Pretty much.. the steal being capped to delta means the rest of the
> > steal is pushed off to the future. Instead he discards the remaining
> > steal after this patch.
>
> Thanks for confirming. I just wanted to make sure I understand as the
> initial change log went into a lot of detail where I sorta got lost ;-)

No problem!! Glad we're on the same page about the change.

thanks,

 - Joel

[v3] sched: Don't try to catch up excess steal time.

Commit Message

Comments

Patch