[v2] sched: Don't try to catch up excess steal time.

Message ID	20240911111522.1110074-1-suleiman@google.com (mailing list archive)
State	New, archived
Headers	show Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 72A1918733C for <kvm@vger.kernel.org>; Wed, 11 Sep 2024 11:16:24 +0000 (UTC) Date: Wed, 11 Sep 2024 20:15:22 +0900 Message-Id: <20240911111522.1110074-1-suleiman@google.com> Precedence: bulk Mime-Version: 1.0 Subject: [PATCH v2] sched: Don't try to catch up excess steal time. From: Suleiman Souhlal <suleiman@google.com> To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Valentin Schneider <vschneid@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, joelaf@google.com, vineethrp@google.com, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, ssouhlal@freebsd.org, Srikar Dronamraju <srikar@linux.ibm.com>, David Woodhouse <dwmw2@infradead.org>, Sean Christopherson <seanjc@google.com>, Suleiman Souhlal <suleiman@google.com> Content-Type: text/plain; charset="UTF-8"
Series	[v2] sched: Don't try to catch up excess steal time. \| expand [v2] sched: Don't try to catch up excess steal time.

Suleiman Souhlal Sept. 11, 2024, 11:15 a.m. UTC

When steal time exceeds the measured delta when updating clock_task, we
currently try to catch up the excess in future updates.
However, this results in inaccurate run times for the future things using
clock_task, as they end up getting additional steal time that did not
actually happen.

For example, suppose a task in a VM runs for 10ms and had 15ms of steal
time reported while it ran. clock_task rightly doesn't advance. Then, a
different taks runs on the same rq for 10ms without any time stolen in
the host.
Because of the current catch up mechanism, clock_sched inaccurately ends
up advancing by only 5ms instead of 10ms even though there wasn't any
actual time stolen. The second task is getting charged for less time
than it ran, even though it didn't deserve it.
This can result in tasks getting more run time than they should actually
get.

So, we instead don't make future updates pay back past excess stolen time.

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
v2:
- Slightly changed to simply moving one line up instead of adding
  new variable.

v1: https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Srikar Dronamraju Sept. 12, 2024, 2:47 p.m. UTC | #1

* Suleiman Souhlal <suleiman@google.com> [2024-09-11 20:15:22]:

> When steal time exceeds the measured delta when updating clock_task, we
> currently try to catch up the excess in future updates.
> However, this results in inaccurate run times for the future things using
> clock_task, as they end up getting additional steal time that did not
> actually happen.
> 
> For example, suppose a task in a VM runs for 10ms and had 15ms of steal
> time reported while it ran. clock_task rightly doesn't advance. Then, a
> different taks runs on the same rq for 10ms without any time stolen in
> the host.
> Because of the current catch up mechanism, clock_sched inaccurately ends
> up advancing by only 5ms instead of 10ms even though there wasn't any
> actual time stolen. The second task is getting charged for less time
> than it ran, even though it didn't deserve it.
> This can result in tasks getting more run time than they should actually
> get.
> 
> So, we instead don't make future updates pay back past excess stolen time.
> 
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---
> v2:
> - Slightly changed to simply moving one line up instead of adding
>   new variable.
> 
> v1: https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com
> ---

After moving the line up, this looks good to me.

Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com>

Suleiman Souhlal Sept. 25, 2024, 11:04 a.m. UTC | #2

On Wed, Sep 11, 2024 at 8:16 PM Suleiman Souhlal <suleiman@google.com> wrote:
>
> When steal time exceeds the measured delta when updating clock_task, we
> currently try to catch up the excess in future updates.
> However, this results in inaccurate run times for the future things using
> clock_task, as they end up getting additional steal time that did not
> actually happen.
>
> For example, suppose a task in a VM runs for 10ms and had 15ms of steal
> time reported while it ran. clock_task rightly doesn't advance. Then, a
> different taks runs on the same rq for 10ms without any time stolen in
> the host.
> Because of the current catch up mechanism, clock_sched inaccurately ends
> up advancing by only 5ms instead of 10ms even though there wasn't any
> actual time stolen. The second task is getting charged for less time
> than it ran, even though it didn't deserve it.
> This can result in tasks getting more run time than they should actually
> get.
>
> So, we instead don't make future updates pay back past excess stolen time.
>
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---
> v2:
> - Slightly changed to simply moving one line up instead of adding
>   new variable.
>
> v1: https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com
> ---
>  kernel/sched/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Friendly ping.

Thanks,
-- Suleiman

David Woodhouse Sept. 25, 2024, 11:45 a.m. UTC | #3

On Wed, 2024-09-11 at 20:15 +0900, Suleiman Souhlal wrote:
> When steal time exceeds the measured delta when updating clock_task,
> we
> currently try to catch up the excess in future updates.
> However, this results in inaccurate run times for the future things
> using
> clock_task, as they end up getting additional steal time that did not
> actually happen.
> 
> For example, suppose a task in a VM runs for 10ms and had 15ms of
> steal
> time reported while it ran. clock_task rightly doesn't advance. Then,
> a
> different taks runs on the same rq for 10ms without any time stolen
> in
> the host.
> Because of the current catch up mechanism, clock_sched inaccurately
> ends
> up advancing by only 5ms instead of 10ms even though there wasn't any
> actual time stolen. The second task is getting charged for less time
> than it ran, even though it didn't deserve it.
> This can result in tasks getting more run time than they should
> actually
> get.
> 
> So, we instead don't make future updates pay back past excess stolen
> time.
> 
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---
> v2:
> - Slightly changed to simply moving one line up instead of adding
>   new variable.
> 
> v1:
> https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com
> ---
>  kernel/sched/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f3951e4a55e5..6c34de8b3fbb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -730,11 +730,11 @@ static void update_rq_clock_task(struct rq *rq,
> s64 delta)
>         if (static_key_false((&paravirt_steal_rq_enabled))) {
>                 steal = paravirt_steal_clock(cpu_of(rq));
>                 steal -= rq->prev_steal_time_rq;
> +               rq->prev_steal_time_rq += steal;

The above two lines are essentially:

	steal -= prev;
	prev += steal;

It's like one of those clever ways of exchanging two variables with
three XOR operations. I don't like it :)

Ultimately, you're just setting rq->prev_steal_time_rq to the latest
value that you just read from paravirt_steal_clock(). And then setting
'steal' to the delta between the new reading and the previous one.

The code above is *far* from obvious. At the very least it wants a
comment, but I'd rather see it refactored so that it doesn't need one. 

    u64 abs_steal_now = paravirt_steal_clock(cpu_of(rq));
    steal = abs_steal_now - rq->prev_steal_time_rq;
    rq->prev_steal_time_rq = abs_steal_now;

I'm still not utterly convinced this is the right thing to do, though.
It means you will constantly undermeasure the accounting of steal time
as you discard the excess each time.

The underlying bug here is that we are sampling the steal time and the
time slice at *different* times. This update_rq_clock_task() function
could be called with a calculated 'delta' argument... and then
experience a large amount of steal time *before* calling
paravirt_steal_clock(), which is how we end up in the situation where
the calculated steal time exceeds the running time of the previous
task.

Which task *should* that steal time be accounted to? At the moment it
ends up being accounted to the next task to run — which seems to make
sense to me. In the situation I just described, we can consider the
time stolen in update_rq_clock_task() itself to have been stolen from
the *incoming* task, not the *outgoing* one. But that seems to be what
you're objecting to?

In
https://lore.kernel.org/all/20240522001817.619072-22-dwmw2@infradead.org/
I put a limit on the amount of steal time carried forward from one
timeslice to the next, as it was misbehaving when a bad hypervisor
reported negative steal time. But I don't think the limit should be
zero.

Of course, *ideally* we'd be able to sample the time and steal time
*simultaneously*, with a single sched_clock_cpu_and_steal() function so
that we don't have to deal with this slop between readings. Then none
of this would be necessary. But that seems hard.

Suleiman Souhlal Sept. 25, 2024, 1:25 p.m. UTC | #4

On Wed, Sep 25, 2024 at 12:45:55PM +0100, David Woodhouse wrote:
> On Wed, 2024-09-11 at 20:15 +0900, Suleiman Souhlal wrote:
> > When steal time exceeds the measured delta when updating clock_task,
> > we
> > currently try to catch up the excess in future updates.
> > However, this results in inaccurate run times for the future things
> > using
> > clock_task, as they end up getting additional steal time that did not
> > actually happen.
> > 
> > For example, suppose a task in a VM runs for 10ms and had 15ms of
> > steal
> > time reported while it ran. clock_task rightly doesn't advance. Then,
> > a
> > different taks runs on the same rq for 10ms without any time stolen
> > in
> > the host.
> > Because of the current catch up mechanism, clock_sched inaccurately
> > ends
> > up advancing by only 5ms instead of 10ms even though there wasn't any
> > actual time stolen. The second task is getting charged for less time
> > than it ran, even though it didn't deserve it.
> > This can result in tasks getting more run time than they should
> > actually
> > get.
> > 
> > So, we instead don't make future updates pay back past excess stolen
> > time.
> > 
> > Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> > ---
> > v2:
> > - Slightly changed to simply moving one line up instead of adding
> >   new variable.
> > 
> > v1:
> > https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com
> > ---
> >  kernel/sched/core.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f3951e4a55e5..6c34de8b3fbb 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -730,11 +730,11 @@ static void update_rq_clock_task(struct rq *rq,
> > s64 delta)
> >         if (static_key_false((&paravirt_steal_rq_enabled))) {
> >                 steal = paravirt_steal_clock(cpu_of(rq));
> >                 steal -= rq->prev_steal_time_rq;
> > +               rq->prev_steal_time_rq += steal;
> 
> The above two lines are essentially:
> 
> 	steal -= prev;
> 	prev += steal;
> 
> It's like one of those clever ways of exchanging two variables with
> three XOR operations. I don't like it :)
> 
> Ultimately, you're just setting rq->prev_steal_time_rq to the latest
> value that you just read from paravirt_steal_clock(). And then setting
> 'steal' to the delta between the new reading and the previous one.
> 
> The code above is *far* from obvious. At the very least it wants a
> comment, but I'd rather see it refactored so that it doesn't need one. 
> 
>     u64 abs_steal_now = paravirt_steal_clock(cpu_of(rq));
>     steal = abs_steal_now - rq->prev_steal_time_rq;
>     rq->prev_steal_time_rq = abs_steal_now;

That is what v1 did:
https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com/

It is also more obvious to me, but the feedback I received was that
the way in the current iteration is better.

I don't feel strongly about it, and I'd be ok with either version applied. 

> 
> I'm still not utterly convinced this is the right thing to do, though.
> It means you will constantly undermeasure the accounting of steal time
> as you discard the excess each time.
> 
> The underlying bug here is that we are sampling the steal time and the
> time slice at *different* times. This update_rq_clock_task() function
> could be called with a calculated 'delta' argument... and then
> experience a large amount of steal time *before* calling
> paravirt_steal_clock(), which is how we end up in the situation where
> the calculated steal time exceeds the running time of the previous
> task.
> 
> Which task *should* that steal time be accounted to? At the moment it
> ends up being accounted to the next task to run — which seems to make
> sense to me. In the situation I just described, we can consider the
> time stolen in update_rq_clock_task() itself to have been stolen from
> the *incoming* task, not the *outgoing* one. But that seems to be what
> you're objecting to?

This is a good description of the problem, except that the time stolen
in update_rq_clock_task() itself is actually being stolen from the 
outgoing task. This is because we are still trying to calculate how long
it ran for (update_curr()), and time hasn't started ticking for the
incoming task yet. We haven't set the incoming task's exec_start with the
new clock_task time yet.

So, in my opinion, it's wrong to give that time to the incoming task.

> 
> In
> https://lore.kernel.org/all/20240522001817.619072-22-dwmw2@infradead.org/
> I put a limit on the amount of steal time carried forward from one
> timeslice to the next, as it was misbehaving when a bad hypervisor
> reported negative steal time. But I don't think the limit should be
> zero.
> 
> Of course, *ideally* we'd be able to sample the time and steal time
> *simultaneously*, with a single sched_clock_cpu_and_steal() function so
> that we don't have to deal with this slop between readings. Then none
> of this would be necessary. But that seems hard.

I agree that that would be ideal.

Thanks,
-- Suleiman

David Woodhouse Sept. 25, 2024, 2:02 p.m. UTC | #5

On Wed, 2024-09-25 at 09:24 -0400, Joel Fernandes wrote:
> > 
> > In
> > https://lore.kernel.org/all/20240522001817.619072-22-dwmw2@infradead.org/
> > I put a limit on the amount of steal time carried forward from one
> > timeslice to the next, as it was misbehaving when a bad hypervisor
> > reported negative steal time. But I don't think the limit should be
> > zero.
> 
> Yea, this is the solution I was thinking but I don’t see any limits
> to steal time in the current code. That is what Suleiman is trying to
> fix.
> 
> So why dont we cap the maximum steal time accrued any more (I.e. what
> happened to the code in your patch, was it deleted for another
> reason?).
> 

My patch was never merged. It was part of an exploratory RFC series
working on various KVM clock problems, one of which *fixed* the problem
of steal time going backwards in the hypervisor, and that guest patch
was an attempt to work around that bug in existing KVM. I never did
split it out and send it on its own.

> My impression is your old patch is exactly what we need as Suleiman
> is seeing an issue with suspend causing very very long steal times in
> virtual Machines, but I will let him speak for himself.

David Woodhouse Sept. 25, 2024, 2:26 p.m. UTC | #6

On Wed, 2024-09-25 at 13:25 +0000, Suleiman Souhlal wrote:
> On Wed, Sep 25, 2024 at 12:45:55PM +0100, David Woodhouse wrote:
> > On Wed, 2024-09-11 at 20:15 +0900, Suleiman Souhlal wrote:
> > > When steal time exceeds the measured delta when updating clock_task,
> > > we
> > > currently try to catch up the excess in future updates.
> > > However, this results in inaccurate run times for the future things
> > > using
> > > clock_task, as they end up getting additional steal time that did not
> > > actually happen.
> > > 
> > > For example, suppose a task in a VM runs for 10ms and had 15ms of
> > > steal
> > > time reported while it ran. clock_task rightly doesn't advance. Then,
> > > a
> > > different taks runs on the same rq for 10ms without any time stolen
> > > in
> > > the host.
> > > Because of the current catch up mechanism, clock_sched inaccurately
> > > ends
> > > up advancing by only 5ms instead of 10ms even though there wasn't any
> > > actual time stolen. The second task is getting charged for less time
> > > than it ran, even though it didn't deserve it.
> > > This can result in tasks getting more run time than they should
> > > actually
> > > get.
> > > 
> > > So, we instead don't make future updates pay back past excess stolen
> > > time.
> > > 
> > > Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> > > ---
> > > v2:
> > > - Slightly changed to simply moving one line up instead of adding
> > >   new variable.
> > > 
> > > v1:
> > > https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com
> > > ---
> > >  kernel/sched/core.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index f3951e4a55e5..6c34de8b3fbb 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -730,11 +730,11 @@ static void update_rq_clock_task(struct rq *rq,
> > > s64 delta)
> > >         if (static_key_false((&paravirt_steal_rq_enabled))) {
> > >                 steal = paravirt_steal_clock(cpu_of(rq));
> > >                 steal -= rq->prev_steal_time_rq;
> > > +               rq->prev_steal_time_rq += steal;
> > 
> > The above two lines are essentially:
> > 
> >         steal -= prev;
> >         prev += steal;
> > 
> > It's like one of those clever ways of exchanging two variables with
> > three XOR operations. I don't like it :)
> > 
> > Ultimately, you're just setting rq->prev_steal_time_rq to the latest
> > value that you just read from paravirt_steal_clock(). And then setting
> > 'steal' to the delta between the new reading and the previous one.
> > 
> > The code above is *far* from obvious. At the very least it wants a
> > comment, but I'd rather see it refactored so that it doesn't need one. 
> > 
> >     u64 abs_steal_now = paravirt_steal_clock(cpu_of(rq));
> >     steal = abs_steal_now - rq->prev_steal_time_rq;
> >     rq->prev_steal_time_rq = abs_steal_now;
> 
> That is what v1 did:
> https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com/
> 
> It is also more obvious to me, but the feedback I received was that
> the way in the current iteration is better.
> 
> I don't feel strongly about it, and I'd be ok with either version applied. 

Fair enough. Not really a hill anyone should choose to die on, I
suppose.

> > 
> > I'm still not utterly convinced this is the right thing to do, though.
> > It means you will constantly undermeasure the accounting of steal time
> > as you discard the excess each time.
> > 
> > The underlying bug here is that we are sampling the steal time and the
> > time slice at *different* times. This update_rq_clock_task() function
> > could be called with a calculated 'delta' argument... and then
> > experience a large amount of steal time *before* calling
> > paravirt_steal_clock(), which is how we end up in the situation where
> > the calculated steal time exceeds the running time of the previous
> > task.
> > 
> > Which task *should* that steal time be accounted to? At the moment it
> > ends up being accounted to the next task to run — which seems to make
> > sense to me. In the situation I just described, we can consider the
> > time stolen in update_rq_clock_task() itself to have been stolen from
> > the *incoming* task, not the *outgoing* one. But that seems to be what
> > you're objecting to?
> 
> This is a good description of the problem, except that the time stolen
> in update_rq_clock_task() itself is actually being stolen from the 
> outgoing task. This is because we are still trying to calculate how long
> it ran for (update_curr()), and time hasn't started ticking for the
> incoming task yet. We haven't set the incoming task's exec_start with the
> new clock_task time yet.
> 
> So, in my opinion, it's wrong to give that time to the incoming task.

That makes sense. That steal time is actually stolen from *neither*
task, since it's after the 'end' timestamp of the outgoing task, and
before the 'start' timestamp of the incoming task.

So where *should* it be accounted?

Or is it actually correct to drop it completely?

If you can make a coherent case for the fact that dropping it is really
the right thing to do (not *just* that it doesn't belong to the
outgoing task, which is the case you make in your existing commit
message), then I suppose I'm OK with your patch as-is.
>

Suleiman Souhlal Sept. 25, 2024, 3:15 p.m. UTC | #7

On Wed, Sep 25, 2024 at 03:26:56PM +0100, David Woodhouse wrote:
> On Wed, 2024-09-25 at 13:25 +0000, Suleiman Souhlal wrote:
> > > 
> > > I'm still not utterly convinced this is the right thing to do, though.
> > > It means you will constantly undermeasure the accounting of steal time
> > > as you discard the excess each time.
> > > 
> > > The underlying bug here is that we are sampling the steal time and the
> > > time slice at *different* times. This update_rq_clock_task() function
> > > could be called with a calculated 'delta' argument... and then
> > > experience a large amount of steal time *before* calling
> > > paravirt_steal_clock(), which is how we end up in the situation where
> > > the calculated steal time exceeds the running time of the previous
> > > task.
> > > 
> > > Which task *should* that steal time be accounted to? At the moment it
> > > ends up being accounted to the next task to run — which seems to make
> > > sense to me. In the situation I just described, we can consider the
> > > time stolen in update_rq_clock_task() itself to have been stolen from
> > > the *incoming* task, not the *outgoing* one. But that seems to be what
> > > you're objecting to?
> > 
> > This is a good description of the problem, except that the time stolen
> > in update_rq_clock_task() itself is actually being stolen from the 
> > outgoing task. This is because we are still trying to calculate how long
> > it ran for (update_curr()), and time hasn't started ticking for the
> > incoming task yet. We haven't set the incoming task's exec_start with the
> > new clock_task time yet.
> > 
> > So, in my opinion, it's wrong to give that time to the incoming task.
> 
> That makes sense. That steal time is actually stolen from *neither*
> task, since it's after the 'end' timestamp of the outgoing task, and
> before the 'start' timestamp of the incoming task.
> 
> So where *should* it be accounted?
> 
> Or is it actually correct to drop it completely?
> 
> If you can make a coherent case for the fact that dropping it is really
> the right thing to do (not *just* that it doesn't belong to the
> outgoing task, which is the case you make in your existing commit
> message), then I suppose I'm OK with your patch as-is.
> > 

Yes, that's a good way to put it: The excess steal time isn't actually
being stolen from anyone.
And since it's not being stolen from anyone, isn't the right thing to do
to drop it?

There might still be extra steal time that doesn't exceed the current
'delta' from the race between reading the two values, that would still
be erroneously accounted to the outgoing task, which this patch doesn't
address, but we know that any steal > delta is from this race and should
be dropped.

-- Suleiman

David Woodhouse Sept. 25, 2024, 3:34 p.m. UTC | #8

On Wed, 2024-09-25 at 15:15 +0000, Suleiman Souhlal wrote:
> Yes, that's a good way to put it: The excess steal time isn't actually
> being stolen from anyone.
> And since it's not being stolen from anyone, isn't the right thing to do
> to drop it?

It's being stolen from the system, isn't it? Just not any specific
userspace process?

If we have separate "end of outgoing task" and "start of incoming task"
timestamps, surely the time between those two must be accounted
*somewhere*?

> There might still be extra steal time that doesn't exceed the current
> 'delta' from the race between reading the two values, that would still
> be erroneously accounted to the outgoing task, which this patch doesn't
> address, but we know that any steal > delta is from this race and should
> be dropped.

Well that's what we want the atomic paired read for :)

Suleiman Souhlal Sept. 25, 2024, 4:29 p.m. UTC | #9

On Wed, Sep 25, 2024 at 04:34:57PM +0100, David Woodhouse wrote:
> On Wed, 2024-09-25 at 15:15 +0000, Suleiman Souhlal wrote:
> > Yes, that's a good way to put it: The excess steal time isn't actually
> > being stolen from anyone.
> > And since it's not being stolen from anyone, isn't the right thing to do
> > to drop it?
> 
> It's being stolen from the system, isn't it? Just not any specific
> userspace process?

I guess it depends what you mean by "stolen". I would argue that it's not
stolen from anyone since the time isn't actually counted anywhere.

> 
> If we have separate "end of outgoing task" and "start of incoming task"
> timestamps, surely the time between those two must be accounted
> *somewhere*?

Not exactly. clock_task is essentially frozen until the next
update_rq_clock(), at which point we'll look at how much sched_clock_cpu
advanced and subtract how much steal time advanced. The two things
are done in separate spots (update_rq_clock() and update_rq_clock_task()),
indepedently (which is where the race is happening).
As far as I can tell, the time between the two isn't really accounted
anywhere.

The "end of outgoing task" and "start of incoming task" timestamps should
end up being the same.

> 
> > There might still be extra steal time that doesn't exceed the current
> > 'delta' from the race between reading the two values, that would still
> > be erroneously accounted to the outgoing task, which this patch doesn't
> > address, but we know that any steal > delta is from this race and should
> > be dropped.
> 
> Well that's what we want the atomic paired read for :)

Right, but I don't think it's that simple. We aren't only reading memory
but also a clock.
It might be possible to address this with a mechanism like rseq, but that
would be a much bigger patch set than the current one (and I don't think
anyone has ever attempted to do rseq for VMs yet).

(There is also another potential issue I spotted with steal time, that
has to do with reading another VCPU's steal time while it's not running,
but I'll start a separate discussion about that with a different patch set.)

Thanks for the discussion.
-- Suleiman

[v2] sched: Don't try to catch up excess steal time.

Commit Message

Comments

Patch