diff mbox

[RFCv5,11/46] sched: Remove blocked load and utilization contributions of dying tasks

Message ID 1436293469-25707-12-git-send-email-morten.rasmussen@arm.com (mailing list archive)
State RFC
Headers show

Commit Message

Morten Rasmussen July 7, 2015, 6:23 p.m. UTC
Tasks being dequeued for the last time (state == TASK_DEAD) are dequeued
with the DEQUEUE_SLEEP flag which causes their load and utilization
contributions to be added to the runqueue blocked load and utilization.
Hence they will contain load or utilization that is gone away. The issue
only exists for the root cfs_rq as cgroup_exit() doesn't set
DEQUEUE_SLEEP for task group exits.

If runnable+blocked load is to be used as a better estimate for cpu
load the dead task contributions need to be removed to prevent
load_balance() (idle_balance() in particular) from over-estimating the
cpu load.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Leo Yan July 22, 2015, 6:51 a.m. UTC | #1
On Tue, Jul 07, 2015 at 07:23:54PM +0100, Morten Rasmussen wrote:
> Tasks being dequeued for the last time (state == TASK_DEAD) are dequeued
> with the DEQUEUE_SLEEP flag which causes their load and utilization
> contributions to be added to the runqueue blocked load and utilization.
> Hence they will contain load or utilization that is gone away. The issue
> only exists for the root cfs_rq as cgroup_exit() doesn't set
> DEQUEUE_SLEEP for task group exits.
> 
> If runnable+blocked load is to be used as a better estimate for cpu
> load the dead task contributions need to be removed to prevent
> load_balance() (idle_balance() in particular) from over-estimating the
> cpu load.
> 
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 775b0c7..fa12ce5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3217,6 +3217,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	 * Update run-time statistics of the 'current'.
>  	 */
>  	update_curr(cfs_rq);
> +	if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
> +		flags &= !DEQUEUE_SLEEP;

So flags will be set to zero? Could be replaced by "flags &= ~DEQUEUE_SLEEP"?

>  	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
>  
>  	update_stats_dequeue(cfs_rq, se);
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Morten Rasmussen July 22, 2015, 1:45 p.m. UTC | #2
On Wed, Jul 22, 2015 at 02:51:01PM +0800, Leo Yan wrote:
> On Tue, Jul 07, 2015 at 07:23:54PM +0100, Morten Rasmussen wrote:
> > Tasks being dequeued for the last time (state == TASK_DEAD) are dequeued
> > with the DEQUEUE_SLEEP flag which causes their load and utilization
> > contributions to be added to the runqueue blocked load and utilization.
> > Hence they will contain load or utilization that is gone away. The issue
> > only exists for the root cfs_rq as cgroup_exit() doesn't set
> > DEQUEUE_SLEEP for task group exits.
> > 
> > If runnable+blocked load is to be used as a better estimate for cpu
> > load the dead task contributions need to be removed to prevent
> > load_balance() (idle_balance() in particular) from over-estimating the
> > cpu load.
> > 
> > cc: Ingo Molnar <mingo@redhat.com>
> > cc: Peter Zijlstra <peterz@infradead.org>
> > 
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/fair.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 775b0c7..fa12ce5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3217,6 +3217,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >  	 * Update run-time statistics of the 'current'.
> >  	 */
> >  	update_curr(cfs_rq);
> > +	if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
> > +		flags &= !DEQUEUE_SLEEP;
> 
> So flags will be set to zero? Could be replaced by "flags &= ~DEQUEUE_SLEEP"?

Not could, should :)

I meant to clear the flag, but used the wrong operator. We only have
DEQUEUE_SLEEP and 0 at the moment so it doesn't matter, but it might
later.

Thanks,
Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Aug. 11, 2015, 11:39 a.m. UTC | #3
On Tue, Jul 07, 2015 at 07:23:54PM +0100, Morten Rasmussen wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 775b0c7..fa12ce5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3217,6 +3217,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	 * Update run-time statistics of the 'current'.
>  	 */
>  	update_curr(cfs_rq);
> +	if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
> +		flags &= !DEQUEUE_SLEEP;
>  	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
>  
>  	update_stats_dequeue(cfs_rq, se);

I know this is entirely redundant at this point (we took Yuyang's
patches), but this is the wrong way to go about doing this.

You add extra code the hot dequeue path for something that 'never'
happens. We have the sched_class::task_dead call for that.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Morten Rasmussen Aug. 11, 2015, 2:58 p.m. UTC | #4
On Tue, Aug 11, 2015 at 01:39:27PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 07:23:54PM +0100, Morten Rasmussen wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 775b0c7..fa12ce5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3217,6 +3217,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >  	 * Update run-time statistics of the 'current'.
> >  	 */
> >  	update_curr(cfs_rq);
> > +	if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
> > +		flags &= !DEQUEUE_SLEEP;
> >  	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
> >  
> >  	update_stats_dequeue(cfs_rq, se);
> 
> I know this is entirely redundant at this point (we took Yuyang's
> patches), but this is the wrong way to go about doing this.

Yes, I'm still working my way through Yuyang's changes.

> You add extra code the hot dequeue path for something that 'never'
> happens. We have the sched_class::task_dead call for that.

I don't mind using sched_class::task_dead() instead. The reason why I
didn't go that way is that we have to retake the rq->lock or mess with
cfs_rq::removed_load instead of just not adding the utilization in
the first place when we have the rq->lock.

Anyway, it is probably redundant by now. I will check Yuyang's code to
see if he already fixed this problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Aug. 11, 2015, 5:23 p.m. UTC | #5
On Tue, Aug 11, 2015 at 03:58:48PM +0100, Morten Rasmussen wrote:
> On Tue, Aug 11, 2015 at 01:39:27PM +0200, Peter Zijlstra wrote:

> > You add extra code the hot dequeue path for something that 'never'
> > happens. We have the sched_class::task_dead call for that.
> 
> I don't mind using sched_class::task_dead() instead. The reason why I
> didn't go that way is that we have to retake the rq->lock or mess with
> cfs_rq::removed_load instead of just not adding the utilization in
> the first place when we have the rq->lock.
> 
> Anyway, it is probably redundant by now. I will check Yuyang's code to
> see if he already fixed this problem.

He did, he used the removed_load stuff, same as migration does.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Morten Rasmussen Aug. 12, 2015, 9:08 a.m. UTC | #6
On Tue, Aug 11, 2015 at 07:23:44PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 11, 2015 at 03:58:48PM +0100, Morten Rasmussen wrote:
> > On Tue, Aug 11, 2015 at 01:39:27PM +0200, Peter Zijlstra wrote:
> 
> > > You add extra code the hot dequeue path for something that 'never'
> > > happens. We have the sched_class::task_dead call for that.
> > 
> > I don't mind using sched_class::task_dead() instead. The reason why I
> > didn't go that way is that we have to retake the rq->lock or mess with
> > cfs_rq::removed_load instead of just not adding the utilization in
> > the first place when we have the rq->lock.
> > 
> > Anyway, it is probably redundant by now. I will check Yuyang's code to
> > see if he already fixed this problem.
> 
> He did, he used the removed_load stuff, same as migration does.

Nice. One less patch to worry about :)
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 775b0c7..fa12ce5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3217,6 +3217,8 @@  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
+	if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
+		flags &= !DEQUEUE_SLEEP;
 	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
 
 	update_stats_dequeue(cfs_rq, se);