Message ID | 20240911111522.1110074-1-suleiman@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] sched: Don't try to catch up excess steal time. | expand |
* Suleiman Souhlal <suleiman@google.com> [2024-09-11 20:15:22]: > When steal time exceeds the measured delta when updating clock_task, we > currently try to catch up the excess in future updates. > However, this results in inaccurate run times for the future things using > clock_task, as they end up getting additional steal time that did not > actually happen. > > For example, suppose a task in a VM runs for 10ms and had 15ms of steal > time reported while it ran. clock_task rightly doesn't advance. Then, a > different taks runs on the same rq for 10ms without any time stolen in > the host. > Because of the current catch up mechanism, clock_sched inaccurately ends > up advancing by only 5ms instead of 10ms even though there wasn't any > actual time stolen. The second task is getting charged for less time > than it ran, even though it didn't deserve it. > This can result in tasks getting more run time than they should actually > get. > > So, we instead don't make future updates pay back past excess stolen time. > > Signed-off-by: Suleiman Souhlal <suleiman@google.com> > --- > v2: > - Slightly changed to simply moving one line up instead of adding > new variable. > > v1: https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com > --- After moving the line up, this looks good to me. Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com>
On Wed, Sep 11, 2024 at 8:16 PM Suleiman Souhlal <suleiman@google.com> wrote: > > When steal time exceeds the measured delta when updating clock_task, we > currently try to catch up the excess in future updates. > However, this results in inaccurate run times for the future things using > clock_task, as they end up getting additional steal time that did not > actually happen. > > For example, suppose a task in a VM runs for 10ms and had 15ms of steal > time reported while it ran. clock_task rightly doesn't advance. Then, a > different taks runs on the same rq for 10ms without any time stolen in > the host. > Because of the current catch up mechanism, clock_sched inaccurately ends > up advancing by only 5ms instead of 10ms even though there wasn't any > actual time stolen. The second task is getting charged for less time > than it ran, even though it didn't deserve it. > This can result in tasks getting more run time than they should actually > get. > > So, we instead don't make future updates pay back past excess stolen time. > > Signed-off-by: Suleiman Souhlal <suleiman@google.com> > --- > v2: > - Slightly changed to simply moving one line up instead of adding > new variable. > > v1: https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com > --- > kernel/sched/core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Friendly ping. Thanks, -- Suleiman
On Wed, 2024-09-11 at 20:15 +0900, Suleiman Souhlal wrote: > When steal time exceeds the measured delta when updating clock_task, > we > currently try to catch up the excess in future updates. > However, this results in inaccurate run times for the future things > using > clock_task, as they end up getting additional steal time that did not > actually happen. > > For example, suppose a task in a VM runs for 10ms and had 15ms of > steal > time reported while it ran. clock_task rightly doesn't advance. Then, > a > different taks runs on the same rq for 10ms without any time stolen > in > the host. > Because of the current catch up mechanism, clock_sched inaccurately > ends > up advancing by only 5ms instead of 10ms even though there wasn't any > actual time stolen. The second task is getting charged for less time > than it ran, even though it didn't deserve it. > This can result in tasks getting more run time than they should > actually > get. > > So, we instead don't make future updates pay back past excess stolen > time. > > Signed-off-by: Suleiman Souhlal <suleiman@google.com> > --- > v2: > - Slightly changed to simply moving one line up instead of adding > new variable. > > v1: > https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com > --- > kernel/sched/core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index f3951e4a55e5..6c34de8b3fbb 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -730,11 +730,11 @@ static void update_rq_clock_task(struct rq *rq, > s64 delta) > if (static_key_false((¶virt_steal_rq_enabled))) { > steal = paravirt_steal_clock(cpu_of(rq)); > steal -= rq->prev_steal_time_rq; > + rq->prev_steal_time_rq += steal; The above two lines are essentially: steal -= prev; prev += steal; It's like one of those clever ways of exchanging two variables with three XOR operations. I don't like it :) Ultimately, you're just setting rq->prev_steal_time_rq to the latest value that you just read from paravirt_steal_clock(). And then setting 'steal' to the delta between the new reading and the previous one. The code above is *far* from obvious. At the very least it wants a comment, but I'd rather see it refactored so that it doesn't need one. u64 abs_steal_now = paravirt_steal_clock(cpu_of(rq)); steal = abs_steal_now - rq->prev_steal_time_rq; rq->prev_steal_time_rq = abs_steal_now; I'm still not utterly convinced this is the right thing to do, though. It means you will constantly undermeasure the accounting of steal time as you discard the excess each time. The underlying bug here is that we are sampling the steal time and the time slice at *different* times. This update_rq_clock_task() function could be called with a calculated 'delta' argument... and then experience a large amount of steal time *before* calling paravirt_steal_clock(), which is how we end up in the situation where the calculated steal time exceeds the running time of the previous task. Which task *should* that steal time be accounted to? At the moment it ends up being accounted to the next task to run — which seems to make sense to me. In the situation I just described, we can consider the time stolen in update_rq_clock_task() itself to have been stolen from the *incoming* task, not the *outgoing* one. But that seems to be what you're objecting to? In https://lore.kernel.org/all/20240522001817.619072-22-dwmw2@infradead.org/ I put a limit on the amount of steal time carried forward from one timeslice to the next, as it was misbehaving when a bad hypervisor reported negative steal time. But I don't think the limit should be zero. Of course, *ideally* we'd be able to sample the time and steal time *simultaneously*, with a single sched_clock_cpu_and_steal() function so that we don't have to deal with this slop between readings. Then none of this would be necessary. But that seems hard.
On Wed, Sep 25, 2024 at 12:45:55PM +0100, David Woodhouse wrote: > On Wed, 2024-09-11 at 20:15 +0900, Suleiman Souhlal wrote: > > When steal time exceeds the measured delta when updating clock_task, > > we > > currently try to catch up the excess in future updates. > > However, this results in inaccurate run times for the future things > > using > > clock_task, as they end up getting additional steal time that did not > > actually happen. > > > > For example, suppose a task in a VM runs for 10ms and had 15ms of > > steal > > time reported while it ran. clock_task rightly doesn't advance. Then, > > a > > different taks runs on the same rq for 10ms without any time stolen > > in > > the host. > > Because of the current catch up mechanism, clock_sched inaccurately > > ends > > up advancing by only 5ms instead of 10ms even though there wasn't any > > actual time stolen. The second task is getting charged for less time > > than it ran, even though it didn't deserve it. > > This can result in tasks getting more run time than they should > > actually > > get. > > > > So, we instead don't make future updates pay back past excess stolen > > time. > > > > Signed-off-by: Suleiman Souhlal <suleiman@google.com> > > --- > > v2: > > - Slightly changed to simply moving one line up instead of adding > > new variable. > > > > v1: > > https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com > > --- > > kernel/sched/core.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index f3951e4a55e5..6c34de8b3fbb 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -730,11 +730,11 @@ static void update_rq_clock_task(struct rq *rq, > > s64 delta) > > if (static_key_false((¶virt_steal_rq_enabled))) { > > steal = paravirt_steal_clock(cpu_of(rq)); > > steal -= rq->prev_steal_time_rq; > > + rq->prev_steal_time_rq += steal; > > The above two lines are essentially: > > steal -= prev; > prev += steal; > > It's like one of those clever ways of exchanging two variables with > three XOR operations. I don't like it :) > > Ultimately, you're just setting rq->prev_steal_time_rq to the latest > value that you just read from paravirt_steal_clock(). And then setting > 'steal' to the delta between the new reading and the previous one. > > The code above is *far* from obvious. At the very least it wants a > comment, but I'd rather see it refactored so that it doesn't need one. > > u64 abs_steal_now = paravirt_steal_clock(cpu_of(rq)); > steal = abs_steal_now - rq->prev_steal_time_rq; > rq->prev_steal_time_rq = abs_steal_now; That is what v1 did: https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com/ It is also more obvious to me, but the feedback I received was that the way in the current iteration is better. I don't feel strongly about it, and I'd be ok with either version applied. > > I'm still not utterly convinced this is the right thing to do, though. > It means you will constantly undermeasure the accounting of steal time > as you discard the excess each time. > > The underlying bug here is that we are sampling the steal time and the > time slice at *different* times. This update_rq_clock_task() function > could be called with a calculated 'delta' argument... and then > experience a large amount of steal time *before* calling > paravirt_steal_clock(), which is how we end up in the situation where > the calculated steal time exceeds the running time of the previous > task. > > Which task *should* that steal time be accounted to? At the moment it > ends up being accounted to the next task to run — which seems to make > sense to me. In the situation I just described, we can consider the > time stolen in update_rq_clock_task() itself to have been stolen from > the *incoming* task, not the *outgoing* one. But that seems to be what > you're objecting to? This is a good description of the problem, except that the time stolen in update_rq_clock_task() itself is actually being stolen from the outgoing task. This is because we are still trying to calculate how long it ran for (update_curr()), and time hasn't started ticking for the incoming task yet. We haven't set the incoming task's exec_start with the new clock_task time yet. So, in my opinion, it's wrong to give that time to the incoming task. > > In > https://lore.kernel.org/all/20240522001817.619072-22-dwmw2@infradead.org/ > I put a limit on the amount of steal time carried forward from one > timeslice to the next, as it was misbehaving when a bad hypervisor > reported negative steal time. But I don't think the limit should be > zero. > > Of course, *ideally* we'd be able to sample the time and steal time > *simultaneously*, with a single sched_clock_cpu_and_steal() function so > that we don't have to deal with this slop between readings. Then none > of this would be necessary. But that seems hard. I agree that that would be ideal. Thanks, -- Suleiman
On Wed, 2024-09-25 at 09:24 -0400, Joel Fernandes wrote: > > > > In > > https://lore.kernel.org/all/20240522001817.619072-22-dwmw2@infradead.org/ > > I put a limit on the amount of steal time carried forward from one > > timeslice to the next, as it was misbehaving when a bad hypervisor > > reported negative steal time. But I don't think the limit should be > > zero. > > Yea, this is the solution I was thinking but I don’t see any limits > to steal time in the current code. That is what Suleiman is trying to > fix. > > So why dont we cap the maximum steal time accrued any more (I.e. what > happened to the code in your patch, was it deleted for another > reason?). > My patch was never merged. It was part of an exploratory RFC series working on various KVM clock problems, one of which *fixed* the problem of steal time going backwards in the hypervisor, and that guest patch was an attempt to work around that bug in existing KVM. I never did split it out and send it on its own. > My impression is your old patch is exactly what we need as Suleiman > is seeing an issue with suspend causing very very long steal times in > virtual Machines, but I will let him speak for himself.
On Wed, 2024-09-25 at 13:25 +0000, Suleiman Souhlal wrote: > On Wed, Sep 25, 2024 at 12:45:55PM +0100, David Woodhouse wrote: > > On Wed, 2024-09-11 at 20:15 +0900, Suleiman Souhlal wrote: > > > When steal time exceeds the measured delta when updating clock_task, > > > we > > > currently try to catch up the excess in future updates. > > > However, this results in inaccurate run times for the future things > > > using > > > clock_task, as they end up getting additional steal time that did not > > > actually happen. > > > > > > For example, suppose a task in a VM runs for 10ms and had 15ms of > > > steal > > > time reported while it ran. clock_task rightly doesn't advance. Then, > > > a > > > different taks runs on the same rq for 10ms without any time stolen > > > in > > > the host. > > > Because of the current catch up mechanism, clock_sched inaccurately > > > ends > > > up advancing by only 5ms instead of 10ms even though there wasn't any > > > actual time stolen. The second task is getting charged for less time > > > than it ran, even though it didn't deserve it. > > > This can result in tasks getting more run time than they should > > > actually > > > get. > > > > > > So, we instead don't make future updates pay back past excess stolen > > > time. > > > > > > Signed-off-by: Suleiman Souhlal <suleiman@google.com> > > > --- > > > v2: > > > - Slightly changed to simply moving one line up instead of adding > > > new variable. > > > > > > v1: > > > https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com > > > --- > > > kernel/sched/core.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > index f3951e4a55e5..6c34de8b3fbb 100644 > > > --- a/kernel/sched/core.c > > > +++ b/kernel/sched/core.c > > > @@ -730,11 +730,11 @@ static void update_rq_clock_task(struct rq *rq, > > > s64 delta) > > > if (static_key_false((¶virt_steal_rq_enabled))) { > > > steal = paravirt_steal_clock(cpu_of(rq)); > > > steal -= rq->prev_steal_time_rq; > > > + rq->prev_steal_time_rq += steal; > > > > The above two lines are essentially: > > > > steal -= prev; > > prev += steal; > > > > It's like one of those clever ways of exchanging two variables with > > three XOR operations. I don't like it :) > > > > Ultimately, you're just setting rq->prev_steal_time_rq to the latest > > value that you just read from paravirt_steal_clock(). And then setting > > 'steal' to the delta between the new reading and the previous one. > > > > The code above is *far* from obvious. At the very least it wants a > > comment, but I'd rather see it refactored so that it doesn't need one. > > > > u64 abs_steal_now = paravirt_steal_clock(cpu_of(rq)); > > steal = abs_steal_now - rq->prev_steal_time_rq; > > rq->prev_steal_time_rq = abs_steal_now; > > That is what v1 did: > https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com/ > > It is also more obvious to me, but the feedback I received was that > the way in the current iteration is better. > > I don't feel strongly about it, and I'd be ok with either version applied. Fair enough. Not really a hill anyone should choose to die on, I suppose. > > > > I'm still not utterly convinced this is the right thing to do, though. > > It means you will constantly undermeasure the accounting of steal time > > as you discard the excess each time. > > > > The underlying bug here is that we are sampling the steal time and the > > time slice at *different* times. This update_rq_clock_task() function > > could be called with a calculated 'delta' argument... and then > > experience a large amount of steal time *before* calling > > paravirt_steal_clock(), which is how we end up in the situation where > > the calculated steal time exceeds the running time of the previous > > task. > > > > Which task *should* that steal time be accounted to? At the moment it > > ends up being accounted to the next task to run — which seems to make > > sense to me. In the situation I just described, we can consider the > > time stolen in update_rq_clock_task() itself to have been stolen from > > the *incoming* task, not the *outgoing* one. But that seems to be what > > you're objecting to? > > This is a good description of the problem, except that the time stolen > in update_rq_clock_task() itself is actually being stolen from the > outgoing task. This is because we are still trying to calculate how long > it ran for (update_curr()), and time hasn't started ticking for the > incoming task yet. We haven't set the incoming task's exec_start with the > new clock_task time yet. > > So, in my opinion, it's wrong to give that time to the incoming task. That makes sense. That steal time is actually stolen from *neither* task, since it's after the 'end' timestamp of the outgoing task, and before the 'start' timestamp of the incoming task. So where *should* it be accounted? Or is it actually correct to drop it completely? If you can make a coherent case for the fact that dropping it is really the right thing to do (not *just* that it doesn't belong to the outgoing task, which is the case you make in your existing commit message), then I suppose I'm OK with your patch as-is. >
On Wed, Sep 25, 2024 at 03:26:56PM +0100, David Woodhouse wrote: > On Wed, 2024-09-25 at 13:25 +0000, Suleiman Souhlal wrote: > > > > > > I'm still not utterly convinced this is the right thing to do, though. > > > It means you will constantly undermeasure the accounting of steal time > > > as you discard the excess each time. > > > > > > The underlying bug here is that we are sampling the steal time and the > > > time slice at *different* times. This update_rq_clock_task() function > > > could be called with a calculated 'delta' argument... and then > > > experience a large amount of steal time *before* calling > > > paravirt_steal_clock(), which is how we end up in the situation where > > > the calculated steal time exceeds the running time of the previous > > > task. > > > > > > Which task *should* that steal time be accounted to? At the moment it > > > ends up being accounted to the next task to run — which seems to make > > > sense to me. In the situation I just described, we can consider the > > > time stolen in update_rq_clock_task() itself to have been stolen from > > > the *incoming* task, not the *outgoing* one. But that seems to be what > > > you're objecting to? > > > > This is a good description of the problem, except that the time stolen > > in update_rq_clock_task() itself is actually being stolen from the > > outgoing task. This is because we are still trying to calculate how long > > it ran for (update_curr()), and time hasn't started ticking for the > > incoming task yet. We haven't set the incoming task's exec_start with the > > new clock_task time yet. > > > > So, in my opinion, it's wrong to give that time to the incoming task. > > That makes sense. That steal time is actually stolen from *neither* > task, since it's after the 'end' timestamp of the outgoing task, and > before the 'start' timestamp of the incoming task. > > So where *should* it be accounted? > > Or is it actually correct to drop it completely? > > If you can make a coherent case for the fact that dropping it is really > the right thing to do (not *just* that it doesn't belong to the > outgoing task, which is the case you make in your existing commit > message), then I suppose I'm OK with your patch as-is. > > Yes, that's a good way to put it: The excess steal time isn't actually being stolen from anyone. And since it's not being stolen from anyone, isn't the right thing to do to drop it? There might still be extra steal time that doesn't exceed the current 'delta' from the race between reading the two values, that would still be erroneously accounted to the outgoing task, which this patch doesn't address, but we know that any steal > delta is from this race and should be dropped. -- Suleiman
On Wed, 2024-09-25 at 15:15 +0000, Suleiman Souhlal wrote: > Yes, that's a good way to put it: The excess steal time isn't actually > being stolen from anyone. > And since it's not being stolen from anyone, isn't the right thing to do > to drop it? It's being stolen from the system, isn't it? Just not any specific userspace process? If we have separate "end of outgoing task" and "start of incoming task" timestamps, surely the time between those two must be accounted *somewhere*? > There might still be extra steal time that doesn't exceed the current > 'delta' from the race between reading the two values, that would still > be erroneously accounted to the outgoing task, which this patch doesn't > address, but we know that any steal > delta is from this race and should > be dropped. Well that's what we want the atomic paired read for :)
On Wed, Sep 25, 2024 at 04:34:57PM +0100, David Woodhouse wrote: > On Wed, 2024-09-25 at 15:15 +0000, Suleiman Souhlal wrote: > > Yes, that's a good way to put it: The excess steal time isn't actually > > being stolen from anyone. > > And since it's not being stolen from anyone, isn't the right thing to do > > to drop it? > > It's being stolen from the system, isn't it? Just not any specific > userspace process? I guess it depends what you mean by "stolen". I would argue that it's not stolen from anyone since the time isn't actually counted anywhere. > > If we have separate "end of outgoing task" and "start of incoming task" > timestamps, surely the time between those two must be accounted > *somewhere*? Not exactly. clock_task is essentially frozen until the next update_rq_clock(), at which point we'll look at how much sched_clock_cpu advanced and subtract how much steal time advanced. The two things are done in separate spots (update_rq_clock() and update_rq_clock_task()), indepedently (which is where the race is happening). As far as I can tell, the time between the two isn't really accounted anywhere. The "end of outgoing task" and "start of incoming task" timestamps should end up being the same. > > > There might still be extra steal time that doesn't exceed the current > > 'delta' from the race between reading the two values, that would still > > be erroneously accounted to the outgoing task, which this patch doesn't > > address, but we know that any steal > delta is from this race and should > > be dropped. > > Well that's what we want the atomic paired read for :) Right, but I don't think it's that simple. We aren't only reading memory but also a clock. It might be possible to address this with a mechanism like rseq, but that would be a much bigger patch set than the current one (and I don't think anyone has ever attempted to do rseq for VMs yet). (There is also another potential issue I spotted with steal time, that has to do with reading another VCPU's steal time while it's not running, but I'll start a separate discussion about that with a different patch set.) Thanks for the discussion. -- Suleiman
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f3951e4a55e5..6c34de8b3fbb 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -730,11 +730,11 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) if (static_key_false((¶virt_steal_rq_enabled))) { steal = paravirt_steal_clock(cpu_of(rq)); steal -= rq->prev_steal_time_rq; + rq->prev_steal_time_rq += steal; if (unlikely(steal > delta)) steal = delta; - rq->prev_steal_time_rq += steal; delta -= steal; } #endif
When steal time exceeds the measured delta when updating clock_task, we currently try to catch up the excess in future updates. However, this results in inaccurate run times for the future things using clock_task, as they end up getting additional steal time that did not actually happen. For example, suppose a task in a VM runs for 10ms and had 15ms of steal time reported while it ran. clock_task rightly doesn't advance. Then, a different taks runs on the same rq for 10ms without any time stolen in the host. Because of the current catch up mechanism, clock_sched inaccurately ends up advancing by only 5ms instead of 10ms even though there wasn't any actual time stolen. The second task is getting charged for less time than it ran, even though it didn't deserve it. This can result in tasks getting more run time than they should actually get. So, we instead don't make future updates pay back past excess stolen time. Signed-off-by: Suleiman Souhlal <suleiman@google.com> --- v2: - Slightly changed to simply moving one line up instead of adding new variable. v1: https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)