Message ID | 20240806111157.1336532-1-suleiman@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | sched: Don't try to catch up excess steal time. | expand |
On Tue, Aug 6, 2024 at 7:13 AM Suleiman Souhlal <suleiman@google.com> wrote: > > When steal time exceeds the measured delta when updating clock_task, we > currently try to catch up the excess in future updates. > However, this results in inaccurate run times for the future clock_task > measurements, as they end up getting additional steal time that did not > actually happen, from the previous excess steal time being paid back. > > For example, suppose a task in a VM runs for 10ms and had 15ms of steal > time reported while it ran. clock_task rightly doesn't advance. Then, a > different task runs on the same rq for 10ms without any time stolen. > Because of the current catch up mechanism, clock_sched inaccurately ends > up advancing by only 5ms instead of 10ms even though there wasn't any > actual time stolen. The second task is getting charged for less time > than it ran, even though it didn't deserve it. > In other words, tasks can end up getting more run time than they should > actually get. > > So, we instead don't make future updates pay back past excess stolen time. > > Signed-off-by: Suleiman Souhlal <suleiman@google.com> > --- > kernel/sched/core.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index bcf2c4cc0522..42b37da2bda6 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -728,13 +728,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) > #endif > #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING > if (static_key_false((¶virt_steal_rq_enabled))) { > - steal = paravirt_steal_clock(cpu_of(rq)); > + u64 prev_steal; > + > + steal = prev_steal = paravirt_steal_clock(cpu_of(rq)); > steal -= rq->prev_steal_time_rq; > > if (unlikely(steal > delta)) > steal = delta; > > - rq->prev_steal_time_rq += steal; > + rq->prev_steal_time_rq = prev_steal; > delta -= steal; Makes sense, but wouldn't this patch also do the following: If vCPU task is the only one running and has a large steal time, then sched_tick() will only freeze the clock for a shorter period, and not give future credits to the vCPU task itself? Maybe it does not matter (and I probably don't understand the code enough) but thought I would mention. I am also not sure if the purpose of stealtime is to credit individual tasks, or rather all tasks on the runqueue because the "whole runqueue" had time stolen.. No where in this function is it dealing with individual tasks but rather the rq itself. Thoughts? - Joel > } > #endif > -- > 2.46.0.rc2.264.g509ed76dc8-goog >
On Tue, Aug 06, 2024 at 06:51:36PM -0400, Joel Fernandes wrote: > On Tue, Aug 6, 2024 at 7:13 AM Suleiman Souhlal <suleiman@google.com> wrote: > > > > When steal time exceeds the measured delta when updating clock_task, we > > currently try to catch up the excess in future updates. > > However, this results in inaccurate run times for the future clock_task > > measurements, as they end up getting additional steal time that did not > > actually happen, from the previous excess steal time being paid back. > > > > For example, suppose a task in a VM runs for 10ms and had 15ms of steal > > time reported while it ran. clock_task rightly doesn't advance. Then, a > > different task runs on the same rq for 10ms without any time stolen. > > Because of the current catch up mechanism, clock_sched inaccurately ends > > up advancing by only 5ms instead of 10ms even though there wasn't any > > actual time stolen. The second task is getting charged for less time > > than it ran, even though it didn't deserve it. > > In other words, tasks can end up getting more run time than they should > > actually get. > > > > So, we instead don't make future updates pay back past excess stolen time. > > > > Signed-off-by: Suleiman Souhlal <suleiman@google.com> > > --- > > kernel/sched/core.c | 6 ++++-- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index bcf2c4cc0522..42b37da2bda6 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -728,13 +728,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) > > #endif > > #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING > > if (static_key_false((¶virt_steal_rq_enabled))) { > > - steal = paravirt_steal_clock(cpu_of(rq)); > > + u64 prev_steal; > > + > > + steal = prev_steal = paravirt_steal_clock(cpu_of(rq)); > > steal -= rq->prev_steal_time_rq; > > > > if (unlikely(steal > delta)) > > steal = delta; > > > > - rq->prev_steal_time_rq += steal; > > + rq->prev_steal_time_rq = prev_steal; > > delta -= steal; > > Makes sense, but wouldn't this patch also do the following: If vCPU > task is the only one running and has a large steal time, then > sched_tick() will only freeze the clock for a shorter period, and not > give future credits to the vCPU task itself? Maybe it does not matter > (and I probably don't understand the code enough) but thought I would > mention. The patch should still be doing the right thing in that situation: The clock will be frozen for the whole duration that it ran, and delta will be 0. The current excess amount is not relevant to the future, as far as I can tell. The pre-patch code is giving the rq extra time that it hadn't measured. I don't really see why it should be getting that extra time. > > I am also not sure if the purpose of stealtime is to credit individual > tasks, or rather all tasks on the runqueue because the "whole > runqueue" had time stolen.. No where in this function is it dealing > with individual tasks but rather the rq itself. This function is used to update clock_task, which *is* relevant to individual tasks. It is used to calculate how long tasks ran for (and for load averages). -- Suleiman
On Tue, Aug 6, 2024 at 8:13 PM Suleiman Souhlal <suleiman@google.com> wrote: > > When steal time exceeds the measured delta when updating clock_task, we > currently try to catch up the excess in future updates. > However, this results in inaccurate run times for the future clock_task > measurements, as they end up getting additional steal time that did not > actually happen, from the previous excess steal time being paid back. > > For example, suppose a task in a VM runs for 10ms and had 15ms of steal > time reported while it ran. clock_task rightly doesn't advance. Then, a > different task runs on the same rq for 10ms without any time stolen. > Because of the current catch up mechanism, clock_sched inaccurately ends > up advancing by only 5ms instead of 10ms even though there wasn't any > actual time stolen. The second task is getting charged for less time > than it ran, even though it didn't deserve it. > In other words, tasks can end up getting more run time than they should > actually get. > > So, we instead don't make future updates pay back past excess stolen time. > > Signed-off-by: Suleiman Souhlal <suleiman@google.com> Gentle ping. -- Suleiman
* Suleiman Souhlal <suleiman@google.com> [2024-08-06 20:11:57]: > When steal time exceeds the measured delta when updating clock_task, we > currently try to catch up the excess in future updates. > However, this results in inaccurate run times for the future clock_task > measurements, as they end up getting additional steal time that did not > actually happen, from the previous excess steal time being paid back. > > For example, suppose a task in a VM runs for 10ms and had 15ms of steal > time reported while it ran. clock_task rightly doesn't advance. Then, a > different task runs on the same rq for 10ms without any time stolen. > Because of the current catch up mechanism, clock_sched inaccurately ends > up advancing by only 5ms instead of 10ms even though there wasn't any > actual time stolen. The second task is getting charged for less time > than it ran, even though it didn't deserve it. > In other words, tasks can end up getting more run time than they should > actually get. > > So, we instead don't make future updates pay back past excess stolen time. > > Signed-off-by: Suleiman Souhlal <suleiman@google.com> > --- > kernel/sched/core.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index bcf2c4cc0522..42b37da2bda6 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -728,13 +728,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) > #endif > #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING > if (static_key_false((¶virt_steal_rq_enabled))) { > - steal = paravirt_steal_clock(cpu_of(rq)); > + u64 prev_steal; > + > + steal = prev_steal = paravirt_steal_clock(cpu_of(rq)); > steal -= rq->prev_steal_time_rq; > > if (unlikely(steal > delta)) > steal = delta; > > - rq->prev_steal_time_rq += steal; > + rq->prev_steal_time_rq = prev_steal; > delta -= steal; > } > #endif Agree with the change. Probably, we could have achieved by just moving a line above Something like this? #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((¶virt_steal_rq_enabled))) { steal = paravirt_steal_clock(cpu_of(rq)); steal -= rq->prev_steal_time_rq; rq->prev_steal_time_rq += steal; if (unlikely(steal > delta)) steal = delta; delta -= steal; } #endif
On Tue, 20 Aug 2024 15:15:55 +0530 Srikar Dronamraju <srikar@linux.ibm.com> wrote: > * Suleiman Souhlal <suleiman@google.com> [2024-08-06 20:11:57]: > > > When steal time exceeds the measured delta when updating clock_task, we > > currently try to catch up the excess in future updates. > > However, this results in inaccurate run times for the future clock_task > > measurements, as they end up getting additional steal time that did not > > actually happen, from the previous excess steal time being paid back. > > > > For example, suppose a task in a VM runs for 10ms and had 15ms of steal > > time reported while it ran. clock_task rightly doesn't advance. Then, a > > different task runs on the same rq for 10ms without any time stolen. > > Because of the current catch up mechanism, clock_sched inaccurately ends > > up advancing by only 5ms instead of 10ms even though there wasn't any > > actual time stolen. The second task is getting charged for less time > > than it ran, even though it didn't deserve it. > > In other words, tasks can end up getting more run time than they should > > actually get. > > > > So, we instead don't make future updates pay back past excess stolen time. In other words, If one task had more time stolen from it than it had run, the excess time is removed from the next task even though it ran for its entire slot? I'm curious, how does a task get queued on the run queue if 100% of it's time was stolen? That is, how did it get queued if the vCPU wasn't running? > > > > Signed-off-by: Suleiman Souhlal <suleiman@google.com> > > --- > > kernel/sched/core.c | 6 ++++-- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index bcf2c4cc0522..42b37da2bda6 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -728,13 +728,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) > > #endif > > #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING > > if (static_key_false((¶virt_steal_rq_enabled))) { > > - steal = paravirt_steal_clock(cpu_of(rq)); > > + u64 prev_steal; > > + > > + steal = prev_steal = paravirt_steal_clock(cpu_of(rq)); > > steal -= rq->prev_steal_time_rq; > > > > if (unlikely(steal > delta)) > > steal = delta; > > > > - rq->prev_steal_time_rq += steal; > > + rq->prev_steal_time_rq = prev_steal; > > delta -= steal; > > } > > #endif > > > Agree with the change. > > Probably, we could have achieved by just moving a line above > Something like this? > > #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING > if (static_key_false((¶virt_steal_rq_enabled))) { > steal = paravirt_steal_clock(cpu_of(rq)); > steal -= rq->prev_steal_time_rq; > rq->prev_steal_time_rq += steal; > > if (unlikely(steal > delta)) > steal = delta; > > delta -= steal; > } > #endif Yeah, that is probably a nicer way of doing the same thing. Suleiman, care to send a v2? -- Steve
On Tue, Aug 20, 2024 at 10:50 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Tue, 20 Aug 2024 15:15:55 +0530 > Srikar Dronamraju <srikar@linux.ibm.com> wrote: > > > * Suleiman Souhlal <suleiman@google.com> [2024-08-06 20:11:57]: > > > > > When steal time exceeds the measured delta when updating clock_task, we > > > currently try to catch up the excess in future updates. > > > However, this results in inaccurate run times for the future clock_task > > > measurements, as they end up getting additional steal time that did not > > > actually happen, from the previous excess steal time being paid back. > > > > > > For example, suppose a task in a VM runs for 10ms and had 15ms of steal > > > time reported while it ran. clock_task rightly doesn't advance. Then, a > > > different task runs on the same rq for 10ms without any time stolen. > > > Because of the current catch up mechanism, clock_sched inaccurately ends > > > up advancing by only 5ms instead of 10ms even though there wasn't any > > > actual time stolen. The second task is getting charged for less time > > > than it ran, even though it didn't deserve it. > > > In other words, tasks can end up getting more run time than they should > > > actually get. > > > > > > So, we instead don't make future updates pay back past excess stolen time. > > In other words, If one task had more time stolen from it than it had run, > the excess time is removed from the next task even though it ran for its > entire slot? > > I'm curious, how does a task get queued on the run queue if 100% of it's > time was stolen? That is, how did it get queued if the vCPU wasn't running? The scenario seems plausible to me, like say several tasks were already queued on the RQ (overloaded RQ) before any time stealing happened. Then vCPU is preempted thus stealing time. When vCPU returns to execution, the running task runs without the clock_task advancing and then goes to sleep thus freeing the RQ for other tasks (that were previously queued). Or did I miss something? thanks, - Joel
On Tue, 2024-08-06 at 20:11 +0900, Suleiman Souhlal wrote: > When steal time exceeds the measured delta when updating clock_task, we > currently try to catch up the excess in future updates. > However, this results in inaccurate run times for the future clock_task > measurements, as they end up getting additional steal time that did not > actually happen, from the previous excess steal time being paid back. > > For example, suppose a task in a VM runs for 10ms and had 15ms of steal > time reported while it ran. clock_task rightly doesn't advance. Then, a > different task runs on the same rq for 10ms without any time stolen. > Because of the current catch up mechanism, clock_sched inaccurately ends > up advancing by only 5ms instead of 10ms even though there wasn't any > actual time stolen. The second task is getting charged for less time > than it ran, even though it didn't deserve it. > In other words, tasks can end up getting more run time than they should > actually get. > > So, we instead don't make future updates pay back past excess stolen time. My understanding was that it was done this way for a reason: there is a lot of jitter between the "run time" (your 10ms example), and the steal time (15ms). What if 5ms really *did* elapse between the time that 'delta' is calculated, and the call to paravirt_steal_clock()? By accounting that steal time "in advance" we ensure it isn't lost in the case where the same process remains running for the next timeslice. However, that does cause problems when the steal time goes negative (due to hypervisor bugs). So in https://lore.kernel.org/all/20240522001817.619072-22-dwmw2@infradead.org/ I limited the amount of time which would be accounted to a future tick. > Signed-off-by: Suleiman Souhlal <suleiman@google.com> > --- > kernel/sched/core.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index bcf2c4cc0522..42b37da2bda6 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -728,13 +728,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) > #endif > #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING > if (static_key_false((¶virt_steal_rq_enabled))) { > - steal = paravirt_steal_clock(cpu_of(rq)); > + u64 prev_steal; > + > + steal = prev_steal = paravirt_steal_clock(cpu_of(rq)); > steal -= rq->prev_steal_time_rq; > > if (unlikely(steal > delta)) > steal = delta; > > - rq->prev_steal_time_rq += steal; > + rq->prev_steal_time_rq = prev_steal; > delta -= steal; > } > #endif
On Wed, Aug 21, 2024 at 12:51 AM David Woodhouse <dwmw2@infradead.org> wrote: > > On Tue, 2024-08-06 at 20:11 +0900, Suleiman Souhlal wrote: > > When steal time exceeds the measured delta when updating clock_task, we > > currently try to catch up the excess in future updates. > > However, this results in inaccurate run times for the future clock_task > > measurements, as they end up getting additional steal time that did not > > actually happen, from the previous excess steal time being paid back. > > > > For example, suppose a task in a VM runs for 10ms and had 15ms of steal > > time reported while it ran. clock_task rightly doesn't advance. Then, a > > different task runs on the same rq for 10ms without any time stolen. > > Because of the current catch up mechanism, clock_sched inaccurately ends > > up advancing by only 5ms instead of 10ms even though there wasn't any > > actual time stolen. The second task is getting charged for less time > > than it ran, even though it didn't deserve it. > > In other words, tasks can end up getting more run time than they should > > actually get. > > > > So, we instead don't make future updates pay back past excess stolen time. > > My understanding was that it was done this way for a reason: there is a > lot of jitter between the "run time" (your 10ms example), and the steal > time (15ms). What if 5ms really *did* elapse between the time that > 'delta' is calculated, and the call to paravirt_steal_clock()? > > By accounting that steal time "in advance" we ensure it isn't lost in > the case where the same process remains running for the next timeslice. This is an interesting observation, but I'd argue that even in the scenario where the same task stays running, the extra 5ms should not be accounted for in steal time: From what I can tell, update_curr() is only updating the task's runtime by the "delta" that was calculated, so the extra "stolen" time that happened between the delta measurement and the steal time application does not seem to be relevant. From what I can tell, that stolen time happened at a point where the task is not being counted as running in its run time, so it should not be accounted. I am however struggling to express this in a way that is easily understandable by others. > However, that does cause problems when the steal time goes negative > (due to hypervisor bugs). So in > https://lore.kernel.org/all/20240522001817.619072-22-dwmw2@infradead.org/ > I limited the amount of time which would be accounted to a future tick. That would definitely be an improvement over the old behavior. I noticed the issue because I've been trying to make sense of the numbers while trying to include the time that host was suspended in steal time in https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/ (But I think the issue can happen even without these changes.) Thanks, -- Suleiman
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index bcf2c4cc0522..42b37da2bda6 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -728,13 +728,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) #endif #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((¶virt_steal_rq_enabled))) { - steal = paravirt_steal_clock(cpu_of(rq)); + u64 prev_steal; + + steal = prev_steal = paravirt_steal_clock(cpu_of(rq)); steal -= rq->prev_steal_time_rq; if (unlikely(steal > delta)) steal = delta; - rq->prev_steal_time_rq += steal; + rq->prev_steal_time_rq = prev_steal; delta -= steal; } #endif
When steal time exceeds the measured delta when updating clock_task, we currently try to catch up the excess in future updates. However, this results in inaccurate run times for the future clock_task measurements, as they end up getting additional steal time that did not actually happen, from the previous excess steal time being paid back. For example, suppose a task in a VM runs for 10ms and had 15ms of steal time reported while it ran. clock_task rightly doesn't advance. Then, a different task runs on the same rq for 10ms without any time stolen. Because of the current catch up mechanism, clock_sched inaccurately ends up advancing by only 5ms instead of 10ms even though there wasn't any actual time stolen. The second task is getting charged for less time than it ran, even though it didn't deserve it. In other words, tasks can end up getting more run time than they should actually get. So, we instead don't make future updates pay back past excess stolen time. Signed-off-by: Suleiman Souhlal <suleiman@google.com> --- kernel/sched/core.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)