diff mbox series

sched: idle: Avoid retaining the tick when it has been stopped

Message ID 2161372.IsD4PDzmmY@aspire.rjw.lan (mailing list archive)
State Mainlined
Delegated to: Rafael Wysocki
Headers show
Series sched: idle: Avoid retaining the tick when it has been stopped | expand

Commit Message

Rafael J. Wysocki Aug. 9, 2018, 5:08 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

If the tick has been stopped already, but the governor has not asked to
stop it (which it can do sometimes), the idle loop should invoke
tick_nohz_idle_stop_tick(), to let tick_nohz_stop_tick() take care
of this case properly.

Fixes: 554c8aa8ecad (sched: idle: Select idle state before stopping the tick)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 kernel/sched/idle.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Leo Yan Aug. 10, 2018, 6:19 a.m. UTC | #1
On Thu, Aug 09, 2018 at 07:08:34PM +0200, Rafael J . Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> If the tick has been stopped already, but the governor has not asked to
> stop it (which it can do sometimes), the idle loop should invoke
> tick_nohz_idle_stop_tick(), to let tick_nohz_stop_tick() take care
> of this case properly.

IMHO, I don't think this patch is on the right way; from the idle loop
side, it needs to provide sane fundamental supports, for example, it
can stop or restart the tick per idle governor's request.  On the
other hand, the idle governors can decide their own policy for how to
use the tick in idle loop.  This patch seems mixes two things and
finally it's possible to couple the implementation between idle loop
and 'menu' governor for sched tick usage.

I still think my patch to restart the tick is valid :)

Thanks,
Leo Yan

> Fixes: 554c8aa8ecad (sched: idle: Select idle state before stopping the tick)
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  kernel/sched/idle.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-pm/kernel/sched/idle.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/idle.c
> +++ linux-pm/kernel/sched/idle.c
> @@ -190,7 +190,7 @@ static void cpuidle_idle_call(void)
>  		 */
>  		next_state = cpuidle_select(drv, dev, &stop_tick);
>  
> -		if (stop_tick)
> +		if (stop_tick || tick_nohz_tick_stopped())
>  			tick_nohz_idle_stop_tick();
>  		else
>  			tick_nohz_idle_retain_tick();
>
Rafael J. Wysocki Aug. 10, 2018, 7:15 a.m. UTC | #2
On Fri, Aug 10, 2018 at 8:19 AM,  <leo.yan@linaro.org> wrote:
> On Thu, Aug 09, 2018 at 07:08:34PM +0200, Rafael J . Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> If the tick has been stopped already, but the governor has not asked to
>> stop it (which it can do sometimes), the idle loop should invoke
>> tick_nohz_idle_stop_tick(), to let tick_nohz_stop_tick() take care
>> of this case properly.
>
> IMHO, I don't think this patch is on the right way;

So we disagree here, quite obviously.

> from the idle loop side, it needs to provide sane fundamental supports,
> for example, it can stop or restart the tick per idle governor's request.

No, if the tick is stopped, restarting it is pointless until we exit
the loop in do_idle().

> On the other hand, the idle governors can decide their own policy for how to
> use the tick in idle loop.  This patch seems mixes two things and
> finally it's possible to couple the implementation between idle loop
> and 'menu' governor for sched tick usage.

I'm not following this, sorry.

> I still think my patch to restart the tick is valid :)

It changes the behavior significantly, though, and it is not clear if
the new behavior is desirable.

The patch here simply fixes a problem while leaving the overall behavior as is.

>> Fixes: 554c8aa8ecad (sched: idle: Select idle state before stopping the tick)
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> ---
>>  kernel/sched/idle.c |    2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> Index: linux-pm/kernel/sched/idle.c
>> ===================================================================
>> --- linux-pm.orig/kernel/sched/idle.c
>> +++ linux-pm/kernel/sched/idle.c
>> @@ -190,7 +190,7 @@ static void cpuidle_idle_call(void)
>>                */
>>               next_state = cpuidle_select(drv, dev, &stop_tick);
>>
>> -             if (stop_tick)
>> +             if (stop_tick || tick_nohz_tick_stopped())
>>                       tick_nohz_idle_stop_tick();
>>               else
>>                       tick_nohz_idle_retain_tick();
>>
Frederic Weisbecker Aug. 16, 2018, 1:27 p.m. UTC | #3
On Thu, Aug 09, 2018 at 07:08:34PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> If the tick has been stopped already, but the governor has not asked to
> stop it (which it can do sometimes), the idle loop should invoke
> tick_nohz_idle_stop_tick(), to let tick_nohz_stop_tick() take care
> of this case properly.
> 
> Fixes: 554c8aa8ecad (sched: idle: Select idle state before stopping the tick)
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  kernel/sched/idle.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-pm/kernel/sched/idle.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/idle.c
> +++ linux-pm/kernel/sched/idle.c
> @@ -190,7 +190,7 @@ static void cpuidle_idle_call(void)
>  		 */
>  		next_state = cpuidle_select(drv, dev, &stop_tick);
>  
> -		if (stop_tick)
> +		if (stop_tick || tick_nohz_tick_stopped())
>  			tick_nohz_idle_stop_tick();
>  		else
>  			tick_nohz_idle_retain_tick();

So what if tick_nohz_idle_stop_tick() sees no timer to schedule and
cancels it, we may remain idle in a shallow state for a long while?

Otherwise we can have something like this:

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index da9455a..408c985 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -806,6 +806,9 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
 static void tick_nohz_retain_tick(struct tick_sched *ts)
 {
 	ts->timer_expires_base = 0;
+
+	if (ts->tick_stopped)
+		tick_nohz_restart(ts, ktime_get());
 }
 
 #ifdef CONFIG_NO_HZ_FULL
Rafael J. Wysocki Aug. 17, 2018, 9:32 a.m. UTC | #4
On Thursday, August 16, 2018 3:27:24 PM CEST Frederic Weisbecker wrote:
> On Thu, Aug 09, 2018 at 07:08:34PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > 
> > If the tick has been stopped already, but the governor has not asked to
> > stop it (which it can do sometimes), the idle loop should invoke
> > tick_nohz_idle_stop_tick(), to let tick_nohz_stop_tick() take care
> > of this case properly.
> > 
> > Fixes: 554c8aa8ecad (sched: idle: Select idle state before stopping the tick)
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >  kernel/sched/idle.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > Index: linux-pm/kernel/sched/idle.c
> > ===================================================================
> > --- linux-pm.orig/kernel/sched/idle.c
> > +++ linux-pm/kernel/sched/idle.c
> > @@ -190,7 +190,7 @@ static void cpuidle_idle_call(void)
> >  		 */
> >  		next_state = cpuidle_select(drv, dev, &stop_tick);
> >  
> > -		if (stop_tick)
> > +		if (stop_tick || tick_nohz_tick_stopped())
> >  			tick_nohz_idle_stop_tick();
> >  		else
> >  			tick_nohz_idle_retain_tick();
> 
> So what if tick_nohz_idle_stop_tick() sees no timer to schedule and
> cancels it, we may remain idle in a shallow state for a long while?

Yes, but the governor is expected to avoid using shallow states when the
tick is stopped already.

> Otherwise we can have something like this:
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index da9455a..408c985 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -806,6 +806,9 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
>  static void tick_nohz_retain_tick(struct tick_sched *ts)
>  {
>  	ts->timer_expires_base = 0;
> +
> +	if (ts->tick_stopped)
> +		tick_nohz_restart(ts, ktime_get());
>  }
>  
>  #ifdef CONFIG_NO_HZ_FULL
> 

We could do that, but my concern with that approach is that we may end up
stopping and starting the tick back and forth without exiting the loop
in do_idle() just because somebody uses a periodic timer behind our
back and the governor gets confused.

Besides, that would be a change in behavior, while the $subject patch
simply fixes a mistake in the original design.

Cheers,
Rafael
Rafael J. Wysocki Aug. 17, 2018, 10:05 a.m. UTC | #5
On Fri, Aug 17, 2018 at 11:34 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> On Thursday, August 16, 2018 3:27:24 PM CEST Frederic Weisbecker wrote:
> > On Thu, Aug 09, 2018 at 07:08:34PM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >
> > > If the tick has been stopped already, but the governor has not asked to
> > > stop it (which it can do sometimes), the idle loop should invoke
> > > tick_nohz_idle_stop_tick(), to let tick_nohz_stop_tick() take care
> > > of this case properly.
> > >
> > > Fixes: 554c8aa8ecad (sched: idle: Select idle state before stopping the tick)
> > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > ---
> > >  kernel/sched/idle.c |    2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > Index: linux-pm/kernel/sched/idle.c
> > > ===================================================================
> > > --- linux-pm.orig/kernel/sched/idle.c
> > > +++ linux-pm/kernel/sched/idle.c
> > > @@ -190,7 +190,7 @@ static void cpuidle_idle_call(void)
> > >              */
> > >             next_state = cpuidle_select(drv, dev, &stop_tick);
> > >
> > > -           if (stop_tick)
> > > +           if (stop_tick || tick_nohz_tick_stopped())
> > >                     tick_nohz_idle_stop_tick();
> > >             else
> > >                     tick_nohz_idle_retain_tick();
> >
> > So what if tick_nohz_idle_stop_tick() sees no timer to schedule and
> > cancels it, we may remain idle in a shallow state for a long while?
>
> Yes, but the governor is expected to avoid using shallow states when the
> tick is stopped already.
>
> > Otherwise we can have something like this:
> >
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index da9455a..408c985 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -806,6 +806,9 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
> >  static void tick_nohz_retain_tick(struct tick_sched *ts)
> >  {
> >       ts->timer_expires_base = 0;
> > +
> > +     if (ts->tick_stopped)
> > +             tick_nohz_restart(ts, ktime_get());
> >  }
> >
> >  #ifdef CONFIG_NO_HZ_FULL
> >
>
> We could do that, but my concern with that approach is that we may end up
> stopping and starting the tick back and forth without exiting the loop
> in do_idle() just because somebody uses a periodic timer behind our
> back and the governor gets confused.
>
> Besides, that would be a change in behavior, while the $subject patch
> simply fixes a mistake in the original design.

Anyway, I'm sort of divided here.

We need to do something, this way or another, because the current code
is not strictly correct.

If there are no concerns about the possible extra overhead related to
restarting the tick, I'd just add a tick_nohz_idle_restart_tick() to
the tick_nohz_idle_retain_tick() branch in cpuidle_idle_call() (it
would do what's needed in there without affecting any other places).

Then, of course, governors would not need to worry about leaving the
tick stopped, so menu could be simplified somewhat, which may be a
good thing after all.

Cheers,
Rafael
Frederic Weisbecker Aug. 17, 2018, 2:12 p.m. UTC | #6
On Fri, Aug 17, 2018 at 11:32:07AM +0200, Rafael J. Wysocki wrote:
> On Thursday, August 16, 2018 3:27:24 PM CEST Frederic Weisbecker wrote:
> > On Thu, Aug 09, 2018 at 07:08:34PM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > 
> > > If the tick has been stopped already, but the governor has not asked to
> > > stop it (which it can do sometimes), the idle loop should invoke
> > > tick_nohz_idle_stop_tick(), to let tick_nohz_stop_tick() take care
> > > of this case properly.
> > > 
> > > Fixes: 554c8aa8ecad (sched: idle: Select idle state before stopping the tick)
> > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > ---
> > >  kernel/sched/idle.c |    2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > Index: linux-pm/kernel/sched/idle.c
> > > ===================================================================
> > > --- linux-pm.orig/kernel/sched/idle.c
> > > +++ linux-pm/kernel/sched/idle.c
> > > @@ -190,7 +190,7 @@ static void cpuidle_idle_call(void)
> > >  		 */
> > >  		next_state = cpuidle_select(drv, dev, &stop_tick);
> > >  
> > > -		if (stop_tick)
> > > +		if (stop_tick || tick_nohz_tick_stopped())
> > >  			tick_nohz_idle_stop_tick();
> > >  		else
> > >  			tick_nohz_idle_retain_tick();
> > 
> > So what if tick_nohz_idle_stop_tick() sees no timer to schedule and
> > cancels it, we may remain idle in a shallow state for a long while?
> 
> Yes, but the governor is expected to avoid using shallow states when the
> tick is stopped already.

So what kind of sleep do we enter to when an idle tick fires and we go
back to idle? Is it always deep?

I believe that ts->tick_stopped == 1 shouldn't be too relevant for the governor.
We can definetly have scenarios where the idle tick is stopped for a long while,
then it fires and schedules the next timer at NOW() + TICK_NSEC (as if the tick
had been restarted). This can even repeat that way for some time, because
ts->tick_stopped == 1 only implies that the tick has been stopped once since
we entered the idle loop. After that we may well have a periodic tick behaviour.
In that case we probably don't want deep idle state. Especially if we have:

              idle_loop() {
                  tick_stop (scheduled several seconds forward)
                  deep_idle_sleep()
                  //several seconds later
                  tick()
                  tick_stop (scheduled TICK_NSEC forward)
                  deep_idle_sleep()
                  tick() {
                      set_need_resched()
                  }
                  exit idle loop
              }

Here the last deep idle state isn't necessary.

> 
> > Otherwise we can have something like this:
> > 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index da9455a..408c985 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -806,6 +806,9 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
> >  static void tick_nohz_retain_tick(struct tick_sched *ts)
> >  {
> >  	ts->timer_expires_base = 0;
> > +
> > +	if (ts->tick_stopped)
> > +		tick_nohz_restart(ts, ktime_get());
> >  }
> >  
> >  #ifdef CONFIG_NO_HZ_FULL
> > 
> 
> We could do that, but my concern with that approach is that we may end up
> stopping and starting the tick back and forth without exiting the loop
> in do_idle() just because somebody uses a periodic timer behind our
> back and the governor gets confused.
> 
> Besides, that would be a change in behavior, while the $subject patch
> simply fixes a mistake in the original design.

Ok, let's take the safe approach for now as this is a fix and it should even be
routed to stable.

But then in the longer term, perhaps cpuidle_select() should think that
through.

Thanks.
Rafael J. Wysocki Aug. 18, 2018, 9:57 p.m. UTC | #7
On Fri, Aug 17, 2018 at 4:12 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> On Fri, Aug 17, 2018 at 11:32:07AM +0200, Rafael J. Wysocki wrote:
> > On Thursday, August 16, 2018 3:27:24 PM CEST Frederic Weisbecker wrote:
> > > On Thu, Aug 09, 2018 at 07:08:34PM +0200, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > >
> > > > If the tick has been stopped already, but the governor has not asked to
> > > > stop it (which it can do sometimes), the idle loop should invoke
> > > > tick_nohz_idle_stop_tick(), to let tick_nohz_stop_tick() take care
> > > > of this case properly.
> > > >
> > > > Fixes: 554c8aa8ecad (sched: idle: Select idle state before stopping the tick)
> > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > ---
> > > >  kernel/sched/idle.c |    2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > Index: linux-pm/kernel/sched/idle.c
> > > > ===================================================================
> > > > --- linux-pm.orig/kernel/sched/idle.c
> > > > +++ linux-pm/kernel/sched/idle.c
> > > > @@ -190,7 +190,7 @@ static void cpuidle_idle_call(void)
> > > >            */
> > > >           next_state = cpuidle_select(drv, dev, &stop_tick);
> > > >
> > > > -         if (stop_tick)
> > > > +         if (stop_tick || tick_nohz_tick_stopped())
> > > >                   tick_nohz_idle_stop_tick();
> > > >           else
> > > >                   tick_nohz_idle_retain_tick();
> > >
> > > So what if tick_nohz_idle_stop_tick() sees no timer to schedule and
> > > cancels it, we may remain idle in a shallow state for a long while?
> >
> > Yes, but the governor is expected to avoid using shallow states when the
> > tick is stopped already.
>
> So what kind of sleep do we enter to when an idle tick fires and we go
> back to idle? Is it always deep?

No, it isn't.

The state to select must always fit the time till the closest timer
event and that may be shorter than the tick period.

If there's a non-tick timer to wake the CPU up, we don't need to worry
about restarting the tick, though. :-)

> I believe that ts->tick_stopped == 1 shouldn't be too relevant for the governor.
> We can definetly have scenarios where the idle tick is stopped for a long while,
> then it fires and schedules the next timer at NOW() + TICK_NSEC (as if the tick
> had been restarted). This can even repeat that way for some time, because
> ts->tick_stopped == 1 only implies that the tick has been stopped once since
> we entered the idle loop. After that we may well have a periodic tick behaviour.
> In that case we probably don't want deep idle state. Especially if we have:
>
>               idle_loop() {
>                   tick_stop (scheduled several seconds forward)
>                   deep_idle_sleep()
>                   //several seconds later
>                   tick()
>                   tick_stop (scheduled TICK_NSEC forward)
>                   deep_idle_sleep()
>                   tick() {
>                       set_need_resched()
>                   }
>                   exit idle loop
>               }
>
> Here the last deep idle state isn't necessary.

No, it isn't.

However, that is not relevant for the question of whether or not to
restart the tick before entering the idle state IMO (see the
considerations below).

> >
> > > Otherwise we can have something like this:
> > >
> > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > index da9455a..408c985 100644
> > > --- a/kernel/time/tick-sched.c
> > > +++ b/kernel/time/tick-sched.c
> > > @@ -806,6 +806,9 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
> > >  static void tick_nohz_retain_tick(struct tick_sched *ts)
> > >  {
> > >     ts->timer_expires_base = 0;
> > > +
> > > +   if (ts->tick_stopped)
> > > +           tick_nohz_restart(ts, ktime_get());
> > >  }
> > >
> > >  #ifdef CONFIG_NO_HZ_FULL
> > >
> >
> > We could do that, but my concern with that approach is that we may end up
> > stopping and starting the tick back and forth without exiting the loop
> > in do_idle() just because somebody uses a periodic timer behind our
> > back and the governor gets confused.
> >
> > Besides, that would be a change in behavior, while the $subject patch
> > simply fixes a mistake in the original design.
>
> Ok, let's take the safe approach for now as this is a fix and it should even be
> routed to stable.

Right.  I'll queue up this patch, then.

> But then in the longer term, perhaps cpuidle_select() should think that
> through.

So I have given more consideration to this and my conclusion is that
restarting the tick between cpuidle_select() and call_cpuidle() is a
bad idea.

First off, if need_resched() is "false", the primary reason for
running the tick on the given CPU is not there, so it only might be
useful as a "backup" timer to wake up the CPU from an inadequate idle
state.

Now, in general, there are two reasons for the idle governor (whatever
it is) to select an idle state with a target residency below the tick
period length.  The first reason is when the governor knows that the
closest timer event is going to occur in this time frame, but in that
case (as stated above), it is not necessary to worry about the tick,
because the other timer will trigger soon enough anyway.  The second
reason is when the governor predicts a wakeup which is not by a timer
in this time frame and it is quite arguable what the governor should
do then.  IMO it at least is not unreasonable to throw the prediction
away and still go for the closest timer event in that case (which is
the current approach).

There's more, though.  Restarting the tick between cpuidle_select()
and call_cpuidle() might introduce quite a bit of latency into that
point and that would mess up with the idle state selection (e.g.
selecting a very shallow idle state might not make a lot of sense if
that latency was high enough, because the expected wakeup might very
well take place when the tick was being restarted), so it should
rather be avoided IMO.

Cheers,
Rafael
Leo Yan Aug. 19, 2018, 12:36 a.m. UTC | #8
On Sat, Aug 18, 2018 at 11:57:00PM +0200, Rafael J. Wysocki wrote:

[...]

> > > > Otherwise we can have something like this:
> > > >
> > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > > index da9455a..408c985 100644
> > > > --- a/kernel/time/tick-sched.c
> > > > +++ b/kernel/time/tick-sched.c
> > > > @@ -806,6 +806,9 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
> > > >  static void tick_nohz_retain_tick(struct tick_sched *ts)
> > > >  {
> > > >     ts->timer_expires_base = 0;
> > > > +
> > > > +   if (ts->tick_stopped)
> > > > +           tick_nohz_restart(ts, ktime_get());
> > > >  }
> > > >
> > > >  #ifdef CONFIG_NO_HZ_FULL
> > > >
> > >
> > > We could do that, but my concern with that approach is that we may end up
> > > stopping and starting the tick back and forth without exiting the loop
> > > in do_idle() just because somebody uses a periodic timer behind our
> > > back and the governor gets confused.
> > >
> > > Besides, that would be a change in behavior, while the $subject patch
> > > simply fixes a mistake in the original design.
> >
> > Ok, let's take the safe approach for now as this is a fix and it should even be
> > routed to stable.
> 
> Right.  I'll queue up this patch, then.
> 
> > But then in the longer term, perhaps cpuidle_select() should think that
> > through.
> 
> So I have given more consideration to this and my conclusion is that
> restarting the tick between cpuidle_select() and call_cpuidle() is a
> bad idea.
> 
> First off, if need_resched() is "false", the primary reason for
> running the tick on the given CPU is not there, so it only might be
> useful as a "backup" timer to wake up the CPU from an inadequate idle
> state.
> 
> Now, in general, there are two reasons for the idle governor (whatever
> it is) to select an idle state with a target residency below the tick
> period length.  The first reason is when the governor knows that the
> closest timer event is going to occur in this time frame, but in that
> case (as stated above), it is not necessary to worry about the tick,
> because the other timer will trigger soon enough anyway.  The second
> reason is when the governor predicts a wakeup which is not by a timer
> in this time frame and it is quite arguable what the governor should
> do then.  IMO it at least is not unreasonable to throw the prediction
> away and still go for the closest timer event in that case (which is
> the current approach).
> 
> There's more, though.  Restarting the tick between cpuidle_select()
> and call_cpuidle() might introduce quite a bit of latency into that
> point and that would mess up with the idle state selection (e.g.
> selecting a very shallow idle state might not make a lot of sense if
> that latency was high enough, because the expected wakeup might very
> well take place when the tick was being restarted), so it should
> rather be avoided IMO.

I expect the idle governor doesn't introduce many restarting tick
operations, the reason is if there have a close timer event than idle
governor can trust it to wake up CPU so in this case the idle governor
will not restart tick;  if the the timer event is long delta and the
shallow state selection is caused by factors (e.g. typical pattern),
then we need restart tick to avoid powernightmares, for this case we
can restart tick only once at the beginning for the typical pattern
interrupt events; after the typical pattern interrupt doesn't continue
then we can rely on the tick to rescue the idle state to deep one.

Thanks,
Leo Yan
Rafael J. Wysocki Aug. 19, 2018, 7:57 a.m. UTC | #9
On Sun, Aug 19, 2018 at 2:36 AM <leo.yan@linaro.org> wrote:
>
> On Sat, Aug 18, 2018 at 11:57:00PM +0200, Rafael J. Wysocki wrote:
>
> [...]
>
> > > > > Otherwise we can have something like this:
> > > > >
> > > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > > > index da9455a..408c985 100644
> > > > > --- a/kernel/time/tick-sched.c
> > > > > +++ b/kernel/time/tick-sched.c
> > > > > @@ -806,6 +806,9 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
> > > > >  static void tick_nohz_retain_tick(struct tick_sched *ts)
> > > > >  {
> > > > >     ts->timer_expires_base = 0;
> > > > > +
> > > > > +   if (ts->tick_stopped)
> > > > > +           tick_nohz_restart(ts, ktime_get());
> > > > >  }
> > > > >
> > > > >  #ifdef CONFIG_NO_HZ_FULL
> > > > >
> > > >
> > > > We could do that, but my concern with that approach is that we may end up
> > > > stopping and starting the tick back and forth without exiting the loop
> > > > in do_idle() just because somebody uses a periodic timer behind our
> > > > back and the governor gets confused.
> > > >
> > > > Besides, that would be a change in behavior, while the $subject patch
> > > > simply fixes a mistake in the original design.
> > >
> > > Ok, let's take the safe approach for now as this is a fix and it should even be
> > > routed to stable.
> >
> > Right.  I'll queue up this patch, then.
> >
> > > But then in the longer term, perhaps cpuidle_select() should think that
> > > through.
> >
> > So I have given more consideration to this and my conclusion is that
> > restarting the tick between cpuidle_select() and call_cpuidle() is a
> > bad idea.
> >
> > First off, if need_resched() is "false", the primary reason for
> > running the tick on the given CPU is not there, so it only might be
> > useful as a "backup" timer to wake up the CPU from an inadequate idle
> > state.
> >
> > Now, in general, there are two reasons for the idle governor (whatever
> > it is) to select an idle state with a target residency below the tick
> > period length.  The first reason is when the governor knows that the
> > closest timer event is going to occur in this time frame, but in that
> > case (as stated above), it is not necessary to worry about the tick,
> > because the other timer will trigger soon enough anyway.  The second
> > reason is when the governor predicts a wakeup which is not by a timer
> > in this time frame and it is quite arguable what the governor should
> > do then.  IMO it at least is not unreasonable to throw the prediction
> > away and still go for the closest timer event in that case (which is
> > the current approach).
> >
> > There's more, though.  Restarting the tick between cpuidle_select()
> > and call_cpuidle() might introduce quite a bit of latency into that
> > point and that would mess up with the idle state selection (e.g.
> > selecting a very shallow idle state might not make a lot of sense if
> > that latency was high enough, because the expected wakeup might very
> > well take place when the tick was being restarted), so it should
> > rather be avoided IMO.
>
> I expect the idle governor doesn't introduce many restarting tick
> operations, the reason is if there have a close timer event than idle
> governor can trust it to wake up CPU so in this case the idle governor
> will not restart tick;  if the the timer event is long delta and the
> shallow state selection is caused by factors (e.g. typical pattern),
> then we need restart tick to avoid powernightmares, for this case we
> can restart tick only once at the beginning for the typical pattern
> interrupt events; after the typical pattern interrupt doesn't continue
> then we can rely on the tick to rescue the idle state to deep one.

No, we don't need to restart the tick at all.  We just need to require
the governor to disregard "typical patterns" (which are not
timer-induced, mind you) when it knows that the tick has been stopped
already.

Unfortunately, the menu governor cannot distinguish a timer-induced
"typical" pattern from one related to device interrupts, but I don't
really see a reason to worry about the latter when the CPU is idle and
with stopped tick (which means that the workload can tolerate extra
latency from deep idle states anyway).
Peter Zijlstra Aug. 20, 2018, 9:14 a.m. UTC | #10
On Sat, Aug 18, 2018 at 11:57:00PM +0200, Rafael J. Wysocki wrote:
> So I have given more consideration to this and my conclusion is that
> restarting the tick between cpuidle_select() and call_cpuidle() is a
> bad idea.

Ack, we should only restart the tick once we leave the idle loop.

> First off, if need_resched() is "false", the primary reason for
> running the tick on the given CPU is not there, so it only might be
> useful as a "backup" timer to wake up the CPU from an inadequate idle
> state.

this..

<snip>

> The second
> reason is when the governor predicts a wakeup which is not by a timer
> in this time frame and it is quite arguable what the governor should
> do then.  IMO it at least is not unreasonable to throw the prediction
> away and still go for the closest timer event in that case (which is
> the current approach).

Yes, I think I can agree with that, predictions at that scale are just
not that useful. The primary point of the governor is to stay shallow
when we can, but once we're deep and have disabled the tick and lost
caches, there's really no point anymore. Waking up is going to hurt.

> There's more, though.  Restarting the tick between cpuidle_select()
> and call_cpuidle() might introduce quite a bit of latency into that
> point and that would mess up with the idle state selection (e.g.
> selecting a very shallow idle state might not make a lot of sense if
> that latency was high enough, because the expected wakeup might very
> well take place when the tick was being restarted), so it should
> rather be avoided IMO.

Absolutely, mucking with the tick just because of a hunch is the wrong
thing.

So,

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

for this one.
Frederic Weisbecker Aug. 20, 2018, 2:42 p.m. UTC | #11
On Sat, Aug 18, 2018 at 11:57:00PM +0200, Rafael J. Wysocki wrote:
> On Fri, Aug 17, 2018 at 4:12 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> >
> > On Fri, Aug 17, 2018 at 11:32:07AM +0200, Rafael J. Wysocki wrote:
> > > On Thursday, August 16, 2018 3:27:24 PM CEST Frederic Weisbecker wrote:
> > > > On Thu, Aug 09, 2018 at 07:08:34PM +0200, Rafael J. Wysocki wrote:
> > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > >
> > > > > If the tick has been stopped already, but the governor has not asked to
> > > > > stop it (which it can do sometimes), the idle loop should invoke
> > > > > tick_nohz_idle_stop_tick(), to let tick_nohz_stop_tick() take care
> > > > > of this case properly.
> > > > >
> > > > > Fixes: 554c8aa8ecad (sched: idle: Select idle state before stopping the tick)
> > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > ---
> > > > >  kernel/sched/idle.c |    2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > Index: linux-pm/kernel/sched/idle.c
> > > > > ===================================================================
> > > > > --- linux-pm.orig/kernel/sched/idle.c
> > > > > +++ linux-pm/kernel/sched/idle.c
> > > > > @@ -190,7 +190,7 @@ static void cpuidle_idle_call(void)
> > > > >            */
> > > > >           next_state = cpuidle_select(drv, dev, &stop_tick);
> > > > >
> > > > > -         if (stop_tick)
> > > > > +         if (stop_tick || tick_nohz_tick_stopped())
> > > > >                   tick_nohz_idle_stop_tick();
> > > > >           else
> > > > >                   tick_nohz_idle_retain_tick();
> > > >
> > > > So what if tick_nohz_idle_stop_tick() sees no timer to schedule and
> > > > cancels it, we may remain idle in a shallow state for a long while?
> > >
> > > Yes, but the governor is expected to avoid using shallow states when the
> > > tick is stopped already.
> >
> > So what kind of sleep do we enter to when an idle tick fires and we go
> > back to idle? Is it always deep?
> 
> No, it isn't.
> 
> The state to select must always fit the time till the closest timer
> event and that may be shorter than the tick period.

Ah ok, so that's fine then.

> 
> If there's a non-tick timer to wake the CPU up, we don't need to worry
> about restarting the tick, though. :-)

Ok.

> 
> > I believe that ts->tick_stopped == 1 shouldn't be too relevant for the governor.
> > We can definetly have scenarios where the idle tick is stopped for a long while,
> > then it fires and schedules the next timer at NOW() + TICK_NSEC (as if the tick
> > had been restarted). This can even repeat that way for some time, because
> > ts->tick_stopped == 1 only implies that the tick has been stopped once since
> > we entered the idle loop. After that we may well have a periodic tick behaviour.
> > In that case we probably don't want deep idle state. Especially if we have:
> >
> >               idle_loop() {
> >                   tick_stop (scheduled several seconds forward)
> >                   deep_idle_sleep()
> >                   //several seconds later
> >                   tick()
> >                   tick_stop (scheduled TICK_NSEC forward)
> >                   deep_idle_sleep()
> >                   tick() {
> >                       set_need_resched()
> >                   }
> >                   exit idle loop
> >               }
> >
> > Here the last deep idle state isn't necessary.
> 
> No, it isn't.
> 
> However, that is not relevant for the question of whether or not to
> restart the tick before entering the idle state IMO (see the
> considerations below).

Yes indeed.

> > But then in the longer term, perhaps cpuidle_select() should think that
> > through.
> 
> So I have given more consideration to this and my conclusion is that
> restarting the tick between cpuidle_select() and call_cpuidle() is a
> bad idea.
> 
> First off, if need_resched() is "false", the primary reason for
> running the tick on the given CPU is not there, so it only might be
> useful as a "backup" timer to wake up the CPU from an inadequate idle
> state.
> 
> Now, in general, there are two reasons for the idle governor (whatever
> it is) to select an idle state with a target residency below the tick
> period length.  The first reason is when the governor knows that the
> closest timer event is going to occur in this time frame, but in that
> case (as stated above), it is not necessary to worry about the tick,
> because the other timer will trigger soon enough anyway.  The second
> reason is when the governor predicts a wakeup which is not by a timer
> in this time frame and it is quite arguable what the governor should
> do then.  IMO it at least is not unreasonable to throw the prediction
> away and still go for the closest timer event in that case (which is
> the current approach).

Then in this case, when you say you throw away that prediction, does it
mean you select an idle state that only takes the next timer event into
consideration?

So for example we predict a wake up event TICK_NSEC ahead but the next
timer event is a few seconds, you're going to select an idle state
according to that "few seconds" ahead next event, right? (which in
practice is likely to be deep I guess).

I guess so but, just want to be sure I understand you correctly.

> 
> There's more, though.  Restarting the tick between cpuidle_select()
> and call_cpuidle() might introduce quite a bit of latency into that
> point and that would mess up with the idle state selection (e.g.
> selecting a very shallow idle state might not make a lot of sense if
> that latency was high enough, because the expected wakeup might very
> well take place when the tick was being restarted), so it should
> rather be avoided IMO.

Yes indeed.

Thanks.
Rafael J. Wysocki Aug. 20, 2018, 9:21 p.m. UTC | #12
On Mon, Aug 20, 2018 at 4:42 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> On Sat, Aug 18, 2018 at 11:57:00PM +0200, Rafael J. Wysocki wrote:
> > On Fri, Aug 17, 2018 at 4:12 PM Frederic Weisbecker <frederic@kernel.org> wrote:

[cut]

> >
> > Now, in general, there are two reasons for the idle governor (whatever
> > it is) to select an idle state with a target residency below the tick
> > period length.  The first reason is when the governor knows that the
> > closest timer event is going to occur in this time frame, but in that
> > case (as stated above), it is not necessary to worry about the tick,
> > because the other timer will trigger soon enough anyway.  The second
> > reason is when the governor predicts a wakeup which is not by a timer
> > in this time frame and it is quite arguable what the governor should
> > do then.  IMO it at least is not unreasonable to throw the prediction
> > away and still go for the closest timer event in that case (which is
> > the current approach).
>
> Then in this case, when you say you throw away that prediction, does it
> mean you select an idle state that only takes the next timer event into
> consideration?

Yes, it does.

> So for example we predict a wake up event TICK_NSEC ahead but the next
> timer event is a few seconds, you're going to select an idle state
> according to that "few seconds" ahead next event, right? (which in
> practice is likely to be deep I guess).
>
> I guess so but, just want to be sure I understand you correctly.

More precisely, if the original predicted idle duration is less than
TICK_USEC and the tick has been stopped, the governor takes the time
till the next timer event instead of the predicted value (so
effectively the predicted value is discarded).

If the original predicted idle duration is TICK_USEC or more, the tick
would have been stopped anyway (had it not been stopped already), so
it may as well be used for idle state selection in the stopped tick
case.

Cheers,
Rafael
Tony Lindgren Aug. 21, 2018, 11:21 p.m. UTC | #13
* Rafael J. Wysocki <rjw@rjwysocki.net> [691231 23:00]:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> If the tick has been stopped already, but the governor has not asked to
> stop it (which it can do sometimes), the idle loop should invoke
> tick_nohz_idle_stop_tick(), to let tick_nohz_stop_tick() take care
> of this case properly.
> 
> Fixes: 554c8aa8ecad (sched: idle: Select idle state before stopping the tick)

This patch seems to fix an issue where boot hangs occasionally
on beagleboard-xm with ARM multi_v7_defconfig as reported by
kernelci.org and Mark Brown earlier at [0].

At least so far no boot hangs for me with this fix, so:

Tested-by: Tony Lindgren <tony@atomide.com>

[0] https://www.spinics.net/lists/linux-mmc/msg50480.html
diff mbox series

Patch

Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -190,7 +190,7 @@  static void cpuidle_idle_call(void)
 		 */
 		next_state = cpuidle_select(drv, dev, &stop_tick);
 
-		if (stop_tick)
+		if (stop_tick || tick_nohz_tick_stopped())
 			tick_nohz_idle_stop_tick();
 		else
 			tick_nohz_idle_retain_tick();