diff mbox

cpuidle: Fix the CPU stuck at C0 for 2-3s after PM_QOS back to DEFAULT

Message ID 1407982309-4863-1-git-send-email-chuansheng.liu@intel.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Chuansheng Liu Aug. 14, 2014, 2:11 a.m. UTC
We found sometimes even after we let PM_QOS back to DEFAULT,
the CPU still stuck at C0 for 2-3s, don't do the new suitable C-state
selection immediately after received the IPI interrupt.

The code model is simply like below:
{
	pm_qos_update_request(&pm_qos, C1 - 1);
		< == Here keep all cores at C0
	...;
	pm_qos_update_request(&pm_qos, PM_QOS_DEFAULT_VALUE);
		< == Here some cores still stuck at C0 for 2-3s
}

The reason is when pm_qos come back to DEFAULT, there is IPI interrupt to
wake up the core, but when core is in poll idle state, the IPI interrupt
can not break the polling loop.

So here in the IPI callback interrupt, when currently the idle task is
running, we need to forcedly set reschedule bit to break the polling loop,
as for other non-polling idle state, IPI interrupt can break them directly,
and setting reschedule bit has no harm for them too.

With this fix, we saved about 30mV power in our android platform.

Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
---
 drivers/cpuidle/cpuidle.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra Aug. 14, 2014, 10:53 a.m. UTC | #1
On Thu, Aug 14, 2014 at 12:29:32PM +0200, Daniel Lezcano wrote:
> Hi Chuansheng,
> 
> On 14 August 2014 04:11, Chuansheng Liu <chuansheng.liu@intel.com> wrote:
> 
> > We found sometimes even after we let PM_QOS back to DEFAULT,
> > the CPU still stuck at C0 for 2-3s, don't do the new suitable C-state
> > selection immediately after received the IPI interrupt.
> >
> > The code model is simply like below:
> > {
> >         pm_qos_update_request(&pm_qos, C1 - 1);
> >                 < == Here keep all cores at C0
> >         ...;
> >         pm_qos_update_request(&pm_qos, PM_QOS_DEFAULT_VALUE);
> >                 < == Here some cores still stuck at C0 for 2-3s
> > }
> >
> > The reason is when pm_qos come back to DEFAULT, there is IPI interrupt to
> > wake up the core, but when core is in poll idle state, the IPI interrupt
> > can not break the polling loop.
> >
> > So here in the IPI callback interrupt, when currently the idle task is
> > running, we need to forcedly set reschedule bit to break the polling loop,
> > as for other non-polling idle state, IPI interrupt can break them directly,
> > and setting reschedule bit has no harm for them too.
> >
> > With this fix, we saved about 30mV power in our android platform.
> >
> > Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
> > ---
> >  drivers/cpuidle/cpuidle.c |    8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> > index ee9df5e..9e28a13 100644
> > --- a/drivers/cpuidle/cpuidle.c
> > +++ b/drivers/cpuidle/cpuidle.c
> > @@ -532,7 +532,13 @@ EXPORT_SYMBOL_GPL(cpuidle_register);
> >
> >  static void smp_callback(void *v)
> >  {
> > -       /* we already woke the CPU up, nothing more to do */
> > +       /* we already woke the CPU up, and when the corresponding
> > +        * CPU is at polling idle state, we need to set the sched
> > +        * bit to trigger reselect the new suitable C-state, it
> > +        * will be helpful for power.
> > +       */
> > +       if (is_idle_task(current))
> > +               set_tsk_need_resched(current);
> >
> 
> Mmh, shouldn't we inspect the polling flag instead ? Peter (Cc'ed) did some
> changes around this and I think we should ask its opinion. I am not sure
> this code won't make all cpu to return to the scheduler and go back to the
> idle task.

Yes, this is wrong.. Also cpuidle should not know about this, so this is
very much the wrong place to go fix this. Lemme have a look.
Peter Zijlstra Aug. 14, 2014, 11 a.m. UTC | #2
On Thu, Aug 14, 2014 at 12:29:32PM +0200, Daniel Lezcano wrote:
> Hi Chuansheng,
> 
> On 14 August 2014 04:11, Chuansheng Liu <chuansheng.liu@intel.com> wrote:
> 
> > We found sometimes even after we let PM_QOS back to DEFAULT,
> > the CPU still stuck at C0 for 2-3s, don't do the new suitable C-state
> > selection immediately after received the IPI interrupt.
> >
> > The code model is simply like below:
> > {
> >         pm_qos_update_request(&pm_qos, C1 - 1);
> >                 < == Here keep all cores at C0
> >         ...;
> >         pm_qos_update_request(&pm_qos, PM_QOS_DEFAULT_VALUE);
> >                 < == Here some cores still stuck at C0 for 2-3s
> > }
> >
> > The reason is when pm_qos come back to DEFAULT, there is IPI interrupt to
> > wake up the core, but when core is in poll idle state, the IPI interrupt
> > can not break the polling loop.

So seeing how you're from @intel.com I'm assuming you're using x86 here.

I'm not seeing how this can be possible, MWAIT is interrupted by IPIs
just fine, which means we'll fall out of the cpuidle_enter(), which
means we'll cpuidle_reflect(), and then leave cpuidle_idle_call().

It will indeed not leave the cpu_idle_loop() function and go right back
into cpuidle_idle_call(), but that will then call cpuidle_select() which
should pick a new C state.

So the interrupt _should_ work. If it doesn't you need to explain why.
Daniel Lezcano Aug. 14, 2014, 11:14 a.m. UTC | #3
On 08/14/2014 01:00 PM, Peter Zijlstra wrote:
> On Thu, Aug 14, 2014 at 12:29:32PM +0200, Daniel Lezcano wrote:
>> Hi Chuansheng,
>>
>> On 14 August 2014 04:11, Chuansheng Liu <chuansheng.liu@intel.com> wrote:
>>
>>> We found sometimes even after we let PM_QOS back to DEFAULT,
>>> the CPU still stuck at C0 for 2-3s, don't do the new suitable C-state
>>> selection immediately after received the IPI interrupt.
>>>
>>> The code model is simply like below:
>>> {
>>>          pm_qos_update_request(&pm_qos, C1 - 1);
>>>                  < == Here keep all cores at C0
>>>          ...;
>>>          pm_qos_update_request(&pm_qos, PM_QOS_DEFAULT_VALUE);
>>>                  < == Here some cores still stuck at C0 for 2-3s
>>> }
>>>
>>> The reason is when pm_qos come back to DEFAULT, there is IPI interrupt to
>>> wake up the core, but when core is in poll idle state, the IPI interrupt
>>> can not break the polling loop.
>
> So seeing how you're from @intel.com I'm assuming you're using x86 here.
>
> I'm not seeing how this can be possible, MWAIT is interrupted by IPIs
> just fine, which means we'll fall out of the cpuidle_enter(), which
> means we'll cpuidle_reflect(), and then leave cpuidle_idle_call().
>
> It will indeed not leave the cpu_idle_loop() function and go right back
> into cpuidle_idle_call(), but that will then call cpuidle_select() which
> should pick a new C state.
>
> So the interrupt _should_ work. If it doesn't you need to explain why.

I think the issue is related to the poll_idle state, in 
drivers/cpuidle/driver.c. This state is x86 specific and inserted in the 
cpuidle table as the state 0 (POLL). There is no mwait for this state. 
It is a bit confusing because this state is not listed in the acpi / 
intel idle driver but inserted implicitly at the beginning of the idle 
table by the cpuidle framework when the driver is registered.

static int poll_idle(struct cpuidle_device *dev,
                 struct cpuidle_driver *drv, int index)
{
         local_irq_enable();
         if (!current_set_polling_and_test()) {
                 while (!need_resched())
                         cpu_relax();
         }
         current_clr_polling();

         return index;
}
Chuansheng Liu Aug. 14, 2014, 11:17 a.m. UTC | #4
> -----Original Message-----

> From: Daniel Lezcano [mailto:daniel.lezcano@linaro.org]

> Sent: Thursday, August 14, 2014 7:15 PM

> To: Peter Zijlstra

> Cc: Liu, Chuansheng; Rafael J. Wysocki; linux-pm@vger.kernel.org; LKML; Liu,

> Changcheng; Wang, Xiaoming; Chakravarty, Souvik K

> Subject: Re: [PATCH] cpuidle: Fix the CPU stuck at C0 for 2-3s after PM_QOS

> back to DEFAULT

> 

> On 08/14/2014 01:00 PM, Peter Zijlstra wrote:

> > On Thu, Aug 14, 2014 at 12:29:32PM +0200, Daniel Lezcano wrote:

> >> Hi Chuansheng,

> >>

> >> On 14 August 2014 04:11, Chuansheng Liu <chuansheng.liu@intel.com>

> wrote:

> >>

> >>> We found sometimes even after we let PM_QOS back to DEFAULT,

> >>> the CPU still stuck at C0 for 2-3s, don't do the new suitable C-state

> >>> selection immediately after received the IPI interrupt.

> >>>

> >>> The code model is simply like below:

> >>> {

> >>>          pm_qos_update_request(&pm_qos, C1 - 1);

> >>>                  < == Here keep all cores at C0

> >>>          ...;

> >>>          pm_qos_update_request(&pm_qos,

> PM_QOS_DEFAULT_VALUE);

> >>>                  < == Here some cores still stuck at C0 for 2-3s

> >>> }

> >>>

> >>> The reason is when pm_qos come back to DEFAULT, there is IPI interrupt to

> >>> wake up the core, but when core is in poll idle state, the IPI interrupt

> >>> can not break the polling loop.

> >

> > So seeing how you're from @intel.com I'm assuming you're using x86 here.

> >

> > I'm not seeing how this can be possible, MWAIT is interrupted by IPIs

> > just fine, which means we'll fall out of the cpuidle_enter(), which

> > means we'll cpuidle_reflect(), and then leave cpuidle_idle_call().

> >

> > It will indeed not leave the cpu_idle_loop() function and go right back

> > into cpuidle_idle_call(), but that will then call cpuidle_select() which

> > should pick a new C state.

> >

> > So the interrupt _should_ work. If it doesn't you need to explain why.

> 

> I think the issue is related to the poll_idle state, in

> drivers/cpuidle/driver.c. This state is x86 specific and inserted in the

> cpuidle table as the state 0 (POLL). There is no mwait for this state.

> It is a bit confusing because this state is not listed in the acpi /

> intel idle driver but inserted implicitly at the beginning of the idle

> table by the cpuidle framework when the driver is registered.


Yes, I am talking about the poll_idle() function which didn't use the mwait,
If we want the reselection happening immediately, we need to break the poll while loop
with setting schedule bit, insteadly we didn't care if real re-schedule happening or not.
Chuansheng Liu Aug. 14, 2014, 11:24 a.m. UTC | #5
> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Thursday, August 14, 2014 6:54 PM
> To: Daniel Lezcano
> Cc: Liu, Chuansheng; Rafael J. Wysocki; linux-pm@vger.kernel.org; LKML; Liu,
> Changcheng; Wang, Xiaoming; Chakravarty, Souvik K
> Subject: Re: [PATCH] cpuidle: Fix the CPU stuck at C0 for 2-3s after PM_QOS
> back to DEFAULT
> 
> On Thu, Aug 14, 2014 at 12:29:32PM +0200, Daniel Lezcano wrote:
> > Hi Chuansheng,
> >
> > On 14 August 2014 04:11, Chuansheng Liu <chuansheng.liu@intel.com>
> wrote:
> >
> > > We found sometimes even after we let PM_QOS back to DEFAULT,
> > > the CPU still stuck at C0 for 2-3s, don't do the new suitable C-state
> > > selection immediately after received the IPI interrupt.
> > >
> > > The code model is simply like below:
> > > {
> > >         pm_qos_update_request(&pm_qos, C1 - 1);
> > >                 < == Here keep all cores at C0
> > >         ...;
> > >         pm_qos_update_request(&pm_qos, PM_QOS_DEFAULT_VALUE);
> > >                 < == Here some cores still stuck at C0 for 2-3s
> > > }
> > >
> > > The reason is when pm_qos come back to DEFAULT, there is IPI interrupt to
> > > wake up the core, but when core is in poll idle state, the IPI interrupt
> > > can not break the polling loop.
> > >
> > > So here in the IPI callback interrupt, when currently the idle task is
> > > running, we need to forcedly set reschedule bit to break the polling loop,
> > > as for other non-polling idle state, IPI interrupt can break them directly,
> > > and setting reschedule bit has no harm for them too.
> > >
> > > With this fix, we saved about 30mV power in our android platform.
> > >
> > > Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
> > > ---
> > >  drivers/cpuidle/cpuidle.c |    8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> > > index ee9df5e..9e28a13 100644
> > > --- a/drivers/cpuidle/cpuidle.c
> > > +++ b/drivers/cpuidle/cpuidle.c
> > > @@ -532,7 +532,13 @@ EXPORT_SYMBOL_GPL(cpuidle_register);
> > >
> > >  static void smp_callback(void *v)
> > >  {
> > > -       /* we already woke the CPU up, nothing more to do */
> > > +       /* we already woke the CPU up, and when the corresponding
> > > +        * CPU is at polling idle state, we need to set the sched
> > > +        * bit to trigger reselect the new suitable C-state, it
> > > +        * will be helpful for power.
> > > +       */
> > > +       if (is_idle_task(current))
> > > +               set_tsk_need_resched(current);
> > >
> >
> > Mmh, shouldn't we inspect the polling flag instead ? Peter (Cc'ed) did some
> > changes around this and I think we should ask its opinion. I am not sure
> > this code won't make all cpu to return to the scheduler and go back to the
> > idle task.
> 
> Yes, this is wrong.. Also cpuidle should not know about this, so this is
> very much the wrong place to go fix this. Lemme have a look.

If inspecting the polling flag, we can not fix the race between poll_idle and smp_callback,
since in poll_idle(), before set polling flag, if the smp_callback come in, then no resched bit set,
after that, poll_idle() will do the polling action, without reselection immediately, it will bring power
regression here.




--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Aug. 14, 2014, 12:41 p.m. UTC | #6
On Thu, Aug 14, 2014 at 01:14:49PM +0200, Daniel Lezcano wrote:
> On 08/14/2014 01:00 PM, Peter Zijlstra wrote:
> >On Thu, Aug 14, 2014 at 12:29:32PM +0200, Daniel Lezcano wrote:
> >>Hi Chuansheng,
> >>
> >>On 14 August 2014 04:11, Chuansheng Liu <chuansheng.liu@intel.com> wrote:
> >>
> >>>We found sometimes even after we let PM_QOS back to DEFAULT,
> >>>the CPU still stuck at C0 for 2-3s, don't do the new suitable C-state
> >>>selection immediately after received the IPI interrupt.
> >>>
> >>>The code model is simply like below:
> >>>{
> >>>         pm_qos_update_request(&pm_qos, C1 - 1);
> >>>                 < == Here keep all cores at C0
> >>>         ...;
> >>>         pm_qos_update_request(&pm_qos, PM_QOS_DEFAULT_VALUE);
> >>>                 < == Here some cores still stuck at C0 for 2-3s
> >>>}
> >>>
> >>>The reason is when pm_qos come back to DEFAULT, there is IPI interrupt to
> >>>wake up the core, but when core is in poll idle state, the IPI interrupt
> >>>can not break the polling loop.
> >
> >So seeing how you're from @intel.com I'm assuming you're using x86 here.
> >
> >I'm not seeing how this can be possible, MWAIT is interrupted by IPIs
> >just fine, which means we'll fall out of the cpuidle_enter(), which
> >means we'll cpuidle_reflect(), and then leave cpuidle_idle_call().
> >
> >It will indeed not leave the cpu_idle_loop() function and go right back
> >into cpuidle_idle_call(), but that will then call cpuidle_select() which
> >should pick a new C state.
> >
> >So the interrupt _should_ work. If it doesn't you need to explain why.
> 
> I think the issue is related to the poll_idle state, in
> drivers/cpuidle/driver.c. This state is x86 specific and inserted in the
> cpuidle table as the state 0 (POLL). There is no mwait for this state. It is
> a bit confusing because this state is not listed in the acpi / intel idle
> driver but inserted implicitly at the beginning of the idle table by the
> cpuidle framework when the driver is registered.
> 
> static int poll_idle(struct cpuidle_device *dev,
>                 struct cpuidle_driver *drv, int index)
> {
>         local_irq_enable();
>         if (!current_set_polling_and_test()) {
>                 while (!need_resched())
>                         cpu_relax();
>         }
>         current_clr_polling();
> 
>         return index;
> }

Ah, well, in that case there's a ton more broken than just this.
kick_all_cpus_sync() won't work either, and cpuidle_reflect() pretty
much expects to be called after each interrupt.

Then again, not reflecting properly isn't really a problem, its not like
not accounting interrupts is going to safe power much.
Peter Zijlstra Aug. 14, 2014, 1:13 p.m. UTC | #7
On Thu, Aug 14, 2014 at 11:24:06AM +0000, Liu, Chuansheng wrote:
> If inspecting the polling flag, we can not fix the race between poll_idle and smp_callback,
> since in poll_idle(), before set polling flag, if the smp_callback come in, then no resched bit set,
> after that, poll_idle() will do the polling action, without reselection immediately, it will bring power
> regression here.

-ENOPARSE. Is there a question there?
Daniel Lezcano Aug. 14, 2014, 1:29 p.m. UTC | #8
On 08/14/2014 02:41 PM, Peter Zijlstra wrote:
> On Thu, Aug 14, 2014 at 01:14:49PM +0200, Daniel Lezcano wrote:
>> On 08/14/2014 01:00 PM, Peter Zijlstra wrote:
>>> On Thu, Aug 14, 2014 at 12:29:32PM +0200, Daniel Lezcano wrote:
>>>> Hi Chuansheng,
>>>>
>>>> On 14 August 2014 04:11, Chuansheng Liu <chuansheng.liu@intel.com> wrote:
>>>>
>>>>> We found sometimes even after we let PM_QOS back to DEFAULT,
>>>>> the CPU still stuck at C0 for 2-3s, don't do the new suitable C-state
>>>>> selection immediately after received the IPI interrupt.
>>>>>
>>>>> The code model is simply like below:
>>>>> {
>>>>>          pm_qos_update_request(&pm_qos, C1 - 1);
>>>>>                  < == Here keep all cores at C0
>>>>>          ...;
>>>>>          pm_qos_update_request(&pm_qos, PM_QOS_DEFAULT_VALUE);
>>>>>                  < == Here some cores still stuck at C0 for 2-3s
>>>>> }
>>>>>
>>>>> The reason is when pm_qos come back to DEFAULT, there is IPI interrupt to
>>>>> wake up the core, but when core is in poll idle state, the IPI interrupt
>>>>> can not break the polling loop.
>>>
>>> So seeing how you're from @intel.com I'm assuming you're using x86 here.
>>>
>>> I'm not seeing how this can be possible, MWAIT is interrupted by IPIs
>>> just fine, which means we'll fall out of the cpuidle_enter(), which
>>> means we'll cpuidle_reflect(), and then leave cpuidle_idle_call().
>>>
>>> It will indeed not leave the cpu_idle_loop() function and go right back
>>> into cpuidle_idle_call(), but that will then call cpuidle_select() which
>>> should pick a new C state.
>>>
>>> So the interrupt _should_ work. If it doesn't you need to explain why.
>>
>> I think the issue is related to the poll_idle state, in
>> drivers/cpuidle/driver.c. This state is x86 specific and inserted in the
>> cpuidle table as the state 0 (POLL). There is no mwait for this state. It is
>> a bit confusing because this state is not listed in the acpi / intel idle
>> driver but inserted implicitly at the beginning of the idle table by the
>> cpuidle framework when the driver is registered.
>>
>> static int poll_idle(struct cpuidle_device *dev,
>>                  struct cpuidle_driver *drv, int index)
>> {
>>          local_irq_enable();
>>          if (!current_set_polling_and_test()) {
>>                  while (!need_resched())
>>                          cpu_relax();
>>          }
>>          current_clr_polling();
>>
>>          return index;
>> }
>
> Ah, well, in that case there's a ton more broken than just this.
> kick_all_cpus_sync() won't work either, and cpuidle_reflect() pretty
> much expects to be called after each interrupt.

Agree.

> Then again, not reflecting properly isn't really a problem, its not like
> not accounting interrupts is going to safe power much.

I think the main issue here is to exit the poll_idle loop when an IPI is 
received. IIUC, there is a pm_qos user, perhaps a driver (Chuansheng can 
give more details), setting a very short latency, so the cpuidle 
framework choose a shallow state like the poll_idle and then the driver 
sets a bigger latency, leading to the IPI to wake all the cpus. As the 
CPUs are in the poll_idle, they don't exit until an event make them to 
exit the need_resched() loop (reschedule or whatever). This situation 
can let the CPUs to stand in the infinite loop several seconds while we 
are expecting them to exit the poll_idle and enter a deeper idle state, 
thus with an extra energy consumption.
Chuansheng Liu Aug. 14, 2014, 1:57 p.m. UTC | #9
> -----Original Message-----

> From: Daniel Lezcano [mailto:daniel.lezcano@linaro.org]

> Sent: Thursday, August 14, 2014 9:30 PM

> To: Peter Zijlstra

> Cc: Liu, Chuansheng; Rafael J. Wysocki; linux-pm@vger.kernel.org; LKML; Liu,

> Changcheng; Wang, Xiaoming; Chakravarty, Souvik K

> Subject: Re: [PATCH] cpuidle: Fix the CPU stuck at C0 for 2-3s after PM_QOS

> back to DEFAULT



> I think the main issue here is to exit the poll_idle loop when an IPI is

> received. IIUC, there is a pm_qos user, perhaps a driver (Chuansheng can

> give more details), setting a very short latency, so the cpuidle

> framework choose a shallow state like the poll_idle and then the driver

> sets a bigger latency, leading to the IPI to wake all the cpus. As the

> CPUs are in the poll_idle, they don't exit until an event make them to

> exit the need_resched() loop (reschedule or whatever). This situation

> can let the CPUs to stand in the infinite loop several seconds while we

> are expecting them to exit the poll_idle and enter a deeper idle state,

> thus with an extra energy consumption.

> 


Exactly, no function error here. But do not enter the deeper C-state will bring more power
consumption, in some mp3 standby mode, even 10% power can be saved.

And this is the patch's aim here.
Chuansheng Liu Aug. 14, 2014, 2:10 p.m. UTC | #10
> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Thursday, August 14, 2014 9:13 PM
> To: Liu, Chuansheng
> Cc: Daniel Lezcano; Rafael J. Wysocki; linux-pm@vger.kernel.org; LKML; Liu,
> Changcheng; Wang, Xiaoming; Chakravarty, Souvik K
> Subject: Re: [PATCH] cpuidle: Fix the CPU stuck at C0 for 2-3s after PM_QOS
> back to DEFAULT
> 
> On Thu, Aug 14, 2014 at 11:24:06AM +0000, Liu, Chuansheng wrote:
> > If inspecting the polling flag, we can not fix the race between poll_idle and
> smp_callback,
> > since in poll_idle(), before set polling flag, if the smp_callback come in, then
> no resched bit set,
> > after that, poll_idle() will do the polling action, without reselection
> immediately, it will bring power
> > regression here.
> 
> -ENOPARSE. Is there a question there?

Lezcano suggest to inspect the polling flag, then code is like below:
smp_callback() {
if (polling_flag)
  set_resched_bit;
}

And the poll_idle code is like below:
static int poll_idle(struct cpuidle_device *dev,
                struct cpuidle_driver *drv, int index)
{
        local_irq_enable();
        if (!current_set_polling_and_test()) {
                while (!need_resched())
                        cpu_relax();
        }   
        current_clr_polling();

        return index;
}

The race is:
Idle task:
poll_idle
  local_irq_enable()
<== IPI interrupt coming, check the polling flag is not set yet, do nothing;
Come back to poll_idle, it will stay in the poll loop for a while, instead break
it immediately to let governor reselect the right C-state.






--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano Aug. 14, 2014, 2:17 p.m. UTC | #11
On 08/14/2014 04:10 PM, Liu, Chuansheng wrote:
>
>
>> -----Original Message-----
>> From: Peter Zijlstra [mailto:peterz@infradead.org]
>> Sent: Thursday, August 14, 2014 9:13 PM
>> To: Liu, Chuansheng
>> Cc: Daniel Lezcano; Rafael J. Wysocki; linux-pm@vger.kernel.org; LKML; Liu,
>> Changcheng; Wang, Xiaoming; Chakravarty, Souvik K
>> Subject: Re: [PATCH] cpuidle: Fix the CPU stuck at C0 for 2-3s after PM_QOS
>> back to DEFAULT
>>
>> On Thu, Aug 14, 2014 at 11:24:06AM +0000, Liu, Chuansheng wrote:
>>> If inspecting the polling flag, we can not fix the race between poll_idle and
>> smp_callback,
>>> since in poll_idle(), before set polling flag, if the smp_callback come in, then
>> no resched bit set,
>>> after that, poll_idle() will do the polling action, without reselection
>> immediately, it will bring power
>>> regression here.
>>
>> -ENOPARSE. Is there a question there?
>
> Lezcano suggest to inspect the polling flag, then code is like below:
> smp_callback() {
> if (polling_flag)
>    set_resched_bit;
> }
>
> And the poll_idle code is like below:
> static int poll_idle(struct cpuidle_device *dev,
>                  struct cpuidle_driver *drv, int index)
> {
>          local_irq_enable();
>          if (!current_set_polling_and_test()) {
>                  while (!need_resched())

Or alternatively, something like:

	while (!need_resched() || kickme) {
		...
	}
			

smp_callback()
{
	kickme = 1;
}

kickme is a percpu variable and set to zero when exiting the 'enter' 
callback.

So we don't mess with the polling flag, which is already a bit tricky.

This patch is very straightforward to illustrate the idea.

>                          cpu_relax();
>          }
>          current_clr_polling();
>
>          return index;
> }
>
> The race is:
> Idle task:
> poll_idle
>    local_irq_enable()
> <== IPI interrupt coming, check the polling flag is not set yet, do nothing;
> Come back to poll_idle, it will stay in the poll loop for a while, instead break
> it immediately to let governor reselect the right C-state.
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
Chuansheng Liu Aug. 14, 2014, 2:26 p.m. UTC | #12
> -----Original Message-----

> From: Daniel Lezcano [mailto:daniel.lezcano@linaro.org]

> Sent: Thursday, August 14, 2014 10:17 PM

> To: Liu, Chuansheng; Peter Zijlstra

> Cc: Rafael J. Wysocki; linux-pm@vger.kernel.org; LKML; Liu, Changcheng;

> Wang, Xiaoming; Chakravarty, Souvik K

> Subject: Re: [PATCH] cpuidle: Fix the CPU stuck at C0 for 2-3s after PM_QOS

> back to DEFAULT

> 

> On 08/14/2014 04:10 PM, Liu, Chuansheng wrote:

> >

> >

> >> -----Original Message-----

> >> From: Peter Zijlstra [mailto:peterz@infradead.org]

> >> Sent: Thursday, August 14, 2014 9:13 PM

> >> To: Liu, Chuansheng

> >> Cc: Daniel Lezcano; Rafael J. Wysocki; linux-pm@vger.kernel.org; LKML; Liu,

> >> Changcheng; Wang, Xiaoming; Chakravarty, Souvik K

> >> Subject: Re: [PATCH] cpuidle: Fix the CPU stuck at C0 for 2-3s after PM_QOS

> >> back to DEFAULT

> >>

> >> On Thu, Aug 14, 2014 at 11:24:06AM +0000, Liu, Chuansheng wrote:

> >>> If inspecting the polling flag, we can not fix the race between poll_idle and

> >> smp_callback,

> >>> since in poll_idle(), before set polling flag, if the smp_callback come in, then

> >> no resched bit set,

> >>> after that, poll_idle() will do the polling action, without reselection

> >> immediately, it will bring power

> >>> regression here.

> >>

> >> -ENOPARSE. Is there a question there?

> >

> > Lezcano suggest to inspect the polling flag, then code is like below:

> > smp_callback() {

> > if (polling_flag)

> >    set_resched_bit;

> > }

> >

> > And the poll_idle code is like below:

> > static int poll_idle(struct cpuidle_device *dev,

> >                  struct cpuidle_driver *drv, int index)

> > {

> >          local_irq_enable();

> >          if (!current_set_polling_and_test()) {

> >                  while (!need_resched())

> 

> Or alternatively, something like:

> 

> 	while (!need_resched() || kickme) {

> 		...

> 	}

> 

> 

> smp_callback()

> {

> 	kickme = 1;

> }

> 

> kickme is a percpu variable and set to zero when exiting the 'enter'

> callback.

> 

> So we don't mess with the polling flag, which is already a bit tricky.

> 

> This patch is very straightforward to illustrate the idea.

> 

> >                          cpu_relax();

> >          }

> >          current_clr_polling();

> >

> >          return index;

> > }

> >

Thanks Lezcano, the new flag kickme sounds making things simple,
will try to send one new patch to review:)
Andy Lutomirski Aug. 14, 2014, 9:12 p.m. UTC | #13
On 08/14/2014 04:14 AM, Daniel Lezcano wrote:
> On 08/14/2014 01:00 PM, Peter Zijlstra wrote:
>>
>> So seeing how you're from @intel.com I'm assuming you're using x86 here.
>>
>> I'm not seeing how this can be possible, MWAIT is interrupted by IPIs
>> just fine, which means we'll fall out of the cpuidle_enter(), which
>> means we'll cpuidle_reflect(), and then leave cpuidle_idle_call().
>>
>> It will indeed not leave the cpu_idle_loop() function and go right back
>> into cpuidle_idle_call(), but that will then call cpuidle_select() which
>> should pick a new C state.
>>
>> So the interrupt _should_ work. If it doesn't you need to explain why.
> 
> I think the issue is related to the poll_idle state, in
> drivers/cpuidle/driver.c. This state is x86 specific and inserted in the
> cpuidle table as the state 0 (POLL). There is no mwait for this state.
> It is a bit confusing because this state is not listed in the acpi /
> intel idle driver but inserted implicitly at the beginning of the idle
> table by the cpuidle framework when the driver is registered.
> 
> static int poll_idle(struct cpuidle_device *dev,
>                 struct cpuidle_driver *drv, int index)
> {
>         local_irq_enable();
>         if (!current_set_polling_and_test()) {
>                 while (!need_resched())
>                         cpu_relax();
>         }
>         current_clr_polling();
> 
>         return index;
> }

As the most recent person to have modified this function, and as an
avowed hater of pointless IPIs, let me ask a rather different question:
why are you sending IPIs at all?  As of Linux 3.16, poll_idle actually
supports the polling idle interface :)

Can't you just do:

if (set_nr_if_polling(rq->idle)) {
	trace_sched_wake_idle_without_ipi(cpu);
} else {
	spin_lock_irqsave(&rq->lock, flags);
	if (rq->curr == rq->idle)
		smp_send_reschedule(cpu);
	// else the CPU wasn't idle; nothing to do
	raw_spin_unlock_irqrestore(&rq->lock, flags);
}

In the common case (wake from C0, i.e. polling idle), this will skip the
IPI entirely unless you race with idle entry/exit, saving a few more
precious electrons and all of the latency involved in poking the APIC
registers.

--Andy

P.S. "30mV" in the patch description is presumably a typo.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Aug. 14, 2014, 9:16 p.m. UTC | #14
On Thu, Aug 14, 2014 at 02:12:27PM -0700, Andy Lutomirski wrote:
> On 08/14/2014 04:14 AM, Daniel Lezcano wrote:
> > On 08/14/2014 01:00 PM, Peter Zijlstra wrote:
> >>
> >> So seeing how you're from @intel.com I'm assuming you're using x86 here.
> >>
> >> I'm not seeing how this can be possible, MWAIT is interrupted by IPIs
> >> just fine, which means we'll fall out of the cpuidle_enter(), which
> >> means we'll cpuidle_reflect(), and then leave cpuidle_idle_call().
> >>
> >> It will indeed not leave the cpu_idle_loop() function and go right back
> >> into cpuidle_idle_call(), but that will then call cpuidle_select() which
> >> should pick a new C state.
> >>
> >> So the interrupt _should_ work. If it doesn't you need to explain why.
> > 
> > I think the issue is related to the poll_idle state, in
> > drivers/cpuidle/driver.c. This state is x86 specific and inserted in the
> > cpuidle table as the state 0 (POLL). There is no mwait for this state.
> > It is a bit confusing because this state is not listed in the acpi /
> > intel idle driver but inserted implicitly at the beginning of the idle
> > table by the cpuidle framework when the driver is registered.
> > 
> > static int poll_idle(struct cpuidle_device *dev,
> >                 struct cpuidle_driver *drv, int index)
> > {
> >         local_irq_enable();
> >         if (!current_set_polling_and_test()) {
> >                 while (!need_resched())
> >                         cpu_relax();
> >         }
> >         current_clr_polling();
> > 
> >         return index;
> > }
> 
> As the most recent person to have modified this function, and as an
> avowed hater of pointless IPIs, let me ask a rather different question:
> why are you sending IPIs at all?  As of Linux 3.16, poll_idle actually
> supports the polling idle interface :)
> 
> Can't you just do:
> 
> if (set_nr_if_polling(rq->idle)) {
> 	trace_sched_wake_idle_without_ipi(cpu);
> } else {
> 	spin_lock_irqsave(&rq->lock, flags);
> 	if (rq->curr == rq->idle)
> 		smp_send_reschedule(cpu);
> 	// else the CPU wasn't idle; nothing to do
> 	raw_spin_unlock_irqrestore(&rq->lock, flags);
> }
> 
> In the common case (wake from C0, i.e. polling idle), this will skip the
> IPI entirely unless you race with idle entry/exit, saving a few more
> precious electrons and all of the latency involved in poking the APIC
> registers.

They could and they probably should, but that logic should _not_ live in
the cpuidle driver.

And as stated elsewhere in the thread; they also need to fix their
kick_all_cpus_sync() usage, because that's similarly wrecked.
Andy Lutomirski Aug. 14, 2014, 9:22 p.m. UTC | #15
On Thu, Aug 14, 2014 at 2:16 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Aug 14, 2014 at 02:12:27PM -0700, Andy Lutomirski wrote:
>> On 08/14/2014 04:14 AM, Daniel Lezcano wrote:
>> > On 08/14/2014 01:00 PM, Peter Zijlstra wrote:
>> >>
>> >> So seeing how you're from @intel.com I'm assuming you're using x86 here.
>> >>
>> >> I'm not seeing how this can be possible, MWAIT is interrupted by IPIs
>> >> just fine, which means we'll fall out of the cpuidle_enter(), which
>> >> means we'll cpuidle_reflect(), and then leave cpuidle_idle_call().
>> >>
>> >> It will indeed not leave the cpu_idle_loop() function and go right back
>> >> into cpuidle_idle_call(), but that will then call cpuidle_select() which
>> >> should pick a new C state.
>> >>
>> >> So the interrupt _should_ work. If it doesn't you need to explain why.
>> >
>> > I think the issue is related to the poll_idle state, in
>> > drivers/cpuidle/driver.c. This state is x86 specific and inserted in the
>> > cpuidle table as the state 0 (POLL). There is no mwait for this state.
>> > It is a bit confusing because this state is not listed in the acpi /
>> > intel idle driver but inserted implicitly at the beginning of the idle
>> > table by the cpuidle framework when the driver is registered.
>> >
>> > static int poll_idle(struct cpuidle_device *dev,
>> >                 struct cpuidle_driver *drv, int index)
>> > {
>> >         local_irq_enable();
>> >         if (!current_set_polling_and_test()) {
>> >                 while (!need_resched())
>> >                         cpu_relax();
>> >         }
>> >         current_clr_polling();
>> >
>> >         return index;
>> > }
>>
>> As the most recent person to have modified this function, and as an
>> avowed hater of pointless IPIs, let me ask a rather different question:
>> why are you sending IPIs at all?  As of Linux 3.16, poll_idle actually
>> supports the polling idle interface :)
>>
>> Can't you just do:
>>
>> if (set_nr_if_polling(rq->idle)) {
>>       trace_sched_wake_idle_without_ipi(cpu);
>> } else {
>>       spin_lock_irqsave(&rq->lock, flags);
>>       if (rq->curr == rq->idle)
>>               smp_send_reschedule(cpu);
>>       // else the CPU wasn't idle; nothing to do
>>       raw_spin_unlock_irqrestore(&rq->lock, flags);
>> }
>>
>> In the common case (wake from C0, i.e. polling idle), this will skip the
>> IPI entirely unless you race with idle entry/exit, saving a few more
>> precious electrons and all of the latency involved in poking the APIC
>> registers.
>
> They could and they probably should, but that logic should _not_ live in
> the cpuidle driver.

Sure.  My point is that fixing the IPI handler is, I think, totally
bogus, because the IPI API isn't the right way to do this at all.

It would be straightforward to add a new function wake_if_idle(int
cpu) to sched/core.c.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuansheng Liu Aug. 15, 2014, 1:21 a.m. UTC | #16
DQoNCj4gLS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJvbTogQW5keSBMdXRvbWlyc2tp
IFttYWlsdG86bHV0b0BhbWFjYXBpdGFsLm5ldF0NCj4gU2VudDogRnJpZGF5LCBBdWd1c3QgMTUs
IDIwMTQgNToyMyBBTQ0KPiBUbzogUGV0ZXIgWmlqbHN0cmENCj4gQ2M6IERhbmllbCBMZXpjYW5v
OyBMaXUsIENodWFuc2hlbmc7IFJhZmFlbCBKLiBXeXNvY2tpOw0KPiBsaW51eC1wbUB2Z2VyLmtl
cm5lbC5vcmc7IExLTUw7IExpdSwgQ2hhbmdjaGVuZzsgV2FuZywgWGlhb21pbmc7DQo+IENoYWty
YXZhcnR5LCBTb3V2aWsgSw0KPiBTdWJqZWN0OiBSZTogW1BBVENIXSBjcHVpZGxlOiBGaXggdGhl
IENQVSBzdHVjayBhdCBDMCBmb3IgMi0zcyBhZnRlciBQTV9RT1MNCj4gYmFjayB0byBERUZBVUxU
DQo+IA0KPiBPbiBUaHUsIEF1ZyAxNCwgMjAxNCBhdCAyOjE2IFBNLCBQZXRlciBaaWpsc3RyYSA8
cGV0ZXJ6QGluZnJhZGVhZC5vcmc+DQo+IHdyb3RlOg0KPiA+IE9uIFRodSwgQXVnIDE0LCAyMDE0
IGF0IDAyOjEyOjI3UE0gLTA3MDAsIEFuZHkgTHV0b21pcnNraSB3cm90ZToNCj4gPj4gT24gMDgv
MTQvMjAxNCAwNDoxNCBBTSwgRGFuaWVsIExlemNhbm8gd3JvdGU6DQo+ID4+ID4gT24gMDgvMTQv
MjAxNCAwMTowMCBQTSwgUGV0ZXIgWmlqbHN0cmEgd3JvdGU6DQo+ID4+ID4+DQo+ID4+ID4+IFNv
IHNlZWluZyBob3cgeW91J3JlIGZyb20gQGludGVsLmNvbSBJJ20gYXNzdW1pbmcgeW91J3JlIHVz
aW5nIHg4Ng0KPiBoZXJlLg0KPiA+PiA+Pg0KPiA+PiA+PiBJJ20gbm90IHNlZWluZyBob3cgdGhp
cyBjYW4gYmUgcG9zc2libGUsIE1XQUlUIGlzIGludGVycnVwdGVkIGJ5IElQSXMNCj4gPj4gPj4g
anVzdCBmaW5lLCB3aGljaCBtZWFucyB3ZSdsbCBmYWxsIG91dCBvZiB0aGUgY3B1aWRsZV9lbnRl
cigpLCB3aGljaA0KPiA+PiA+PiBtZWFucyB3ZSdsbCBjcHVpZGxlX3JlZmxlY3QoKSwgYW5kIHRo
ZW4gbGVhdmUgY3B1aWRsZV9pZGxlX2NhbGwoKS4NCj4gPj4gPj4NCj4gPj4gPj4gSXQgd2lsbCBp
bmRlZWQgbm90IGxlYXZlIHRoZSBjcHVfaWRsZV9sb29wKCkgZnVuY3Rpb24gYW5kIGdvIHJpZ2h0
IGJhY2sNCj4gPj4gPj4gaW50byBjcHVpZGxlX2lkbGVfY2FsbCgpLCBidXQgdGhhdCB3aWxsIHRo
ZW4gY2FsbCBjcHVpZGxlX3NlbGVjdCgpIHdoaWNoDQo+ID4+ID4+IHNob3VsZCBwaWNrIGEgbmV3
IEMgc3RhdGUuDQo+ID4+ID4+DQo+ID4+ID4+IFNvIHRoZSBpbnRlcnJ1cHQgX3Nob3VsZF8gd29y
ay4gSWYgaXQgZG9lc24ndCB5b3UgbmVlZCB0byBleHBsYWluIHdoeS4NCj4gPj4gPg0KPiA+PiA+
IEkgdGhpbmsgdGhlIGlzc3VlIGlzIHJlbGF0ZWQgdG8gdGhlIHBvbGxfaWRsZSBzdGF0ZSwgaW4N
Cj4gPj4gPiBkcml2ZXJzL2NwdWlkbGUvZHJpdmVyLmMuIFRoaXMgc3RhdGUgaXMgeDg2IHNwZWNp
ZmljIGFuZCBpbnNlcnRlZCBpbiB0aGUNCj4gPj4gPiBjcHVpZGxlIHRhYmxlIGFzIHRoZSBzdGF0
ZSAwIChQT0xMKS4gVGhlcmUgaXMgbm8gbXdhaXQgZm9yIHRoaXMgc3RhdGUuDQo+ID4+ID4gSXQg
aXMgYSBiaXQgY29uZnVzaW5nIGJlY2F1c2UgdGhpcyBzdGF0ZSBpcyBub3QgbGlzdGVkIGluIHRo
ZSBhY3BpIC8NCj4gPj4gPiBpbnRlbCBpZGxlIGRyaXZlciBidXQgaW5zZXJ0ZWQgaW1wbGljaXRs
eSBhdCB0aGUgYmVnaW5uaW5nIG9mIHRoZSBpZGxlDQo+ID4+ID4gdGFibGUgYnkgdGhlIGNwdWlk
bGUgZnJhbWV3b3JrIHdoZW4gdGhlIGRyaXZlciBpcyByZWdpc3RlcmVkLg0KPiA+PiA+DQo+ID4+
ID4gc3RhdGljIGludCBwb2xsX2lkbGUoc3RydWN0IGNwdWlkbGVfZGV2aWNlICpkZXYsDQo+ID4+
ID4gICAgICAgICAgICAgICAgIHN0cnVjdCBjcHVpZGxlX2RyaXZlciAqZHJ2LCBpbnQgaW5kZXgp
DQo+ID4+ID4gew0KPiA+PiA+ICAgICAgICAgbG9jYWxfaXJxX2VuYWJsZSgpOw0KPiA+PiA+ICAg
ICAgICAgaWYgKCFjdXJyZW50X3NldF9wb2xsaW5nX2FuZF90ZXN0KCkpIHsNCj4gPj4gPiAgICAg
ICAgICAgICAgICAgd2hpbGUgKCFuZWVkX3Jlc2NoZWQoKSkNCj4gPj4gPiAgICAgICAgICAgICAg
ICAgICAgICAgICBjcHVfcmVsYXgoKTsNCj4gPj4gPiAgICAgICAgIH0NCj4gPj4gPiAgICAgICAg
IGN1cnJlbnRfY2xyX3BvbGxpbmcoKTsNCj4gPj4gPg0KPiA+PiA+ICAgICAgICAgcmV0dXJuIGlu
ZGV4Ow0KPiA+PiA+IH0NCj4gPj4NCj4gPj4gQXMgdGhlIG1vc3QgcmVjZW50IHBlcnNvbiB0byBo
YXZlIG1vZGlmaWVkIHRoaXMgZnVuY3Rpb24sIGFuZCBhcyBhbg0KPiA+PiBhdm93ZWQgaGF0ZXIg
b2YgcG9pbnRsZXNzIElQSXMsIGxldCBtZSBhc2sgYSByYXRoZXIgZGlmZmVyZW50IHF1ZXN0aW9u
Og0KPiA+PiB3aHkgYXJlIHlvdSBzZW5kaW5nIElQSXMgYXQgYWxsPyAgQXMgb2YgTGludXggMy4x
NiwgcG9sbF9pZGxlIGFjdHVhbGx5DQo+ID4+IHN1cHBvcnRzIHRoZSBwb2xsaW5nIGlkbGUgaW50
ZXJmYWNlIDopDQo+ID4+DQo+ID4+IENhbid0IHlvdSBqdXN0IGRvOg0KPiA+Pg0KPiA+PiBpZiAo
c2V0X25yX2lmX3BvbGxpbmcocnEtPmlkbGUpKSB7DQo+ID4+ICAgICAgIHRyYWNlX3NjaGVkX3dh
a2VfaWRsZV93aXRob3V0X2lwaShjcHUpOw0KPiA+PiB9IGVsc2Ugew0KPiA+PiAgICAgICBzcGlu
X2xvY2tfaXJxc2F2ZSgmcnEtPmxvY2ssIGZsYWdzKTsNCj4gPj4gICAgICAgaWYgKHJxLT5jdXJy
ID09IHJxLT5pZGxlKQ0KPiA+PiAgICAgICAgICAgICAgIHNtcF9zZW5kX3Jlc2NoZWR1bGUoY3B1
KTsNCj4gPj4gICAgICAgLy8gZWxzZSB0aGUgQ1BVIHdhc24ndCBpZGxlOyBub3RoaW5nIHRvIGRv
DQo+ID4+ICAgICAgIHJhd19zcGluX3VubG9ja19pcnFyZXN0b3JlKCZycS0+bG9jaywgZmxhZ3Mp
Ow0KPiA+PiB9DQo+ID4+DQo+ID4+IEluIHRoZSBjb21tb24gY2FzZSAod2FrZSBmcm9tIEMwLCBp
LmUuIHBvbGxpbmcgaWRsZSksIHRoaXMgd2lsbCBza2lwIHRoZQ0KPiA+PiBJUEkgZW50aXJlbHkg
dW5sZXNzIHlvdSByYWNlIHdpdGggaWRsZSBlbnRyeS9leGl0LCBzYXZpbmcgYSBmZXcgbW9yZQ0K
PiA+PiBwcmVjaW91cyBlbGVjdHJvbnMgYW5kIGFsbCBvZiB0aGUgbGF0ZW5jeSBpbnZvbHZlZCBp
biBwb2tpbmcgdGhlIEFQSUMNCj4gPj4gcmVnaXN0ZXJzLg0KPiA+DQo+ID4gVGhleSBjb3VsZCBh
bmQgdGhleSBwcm9iYWJseSBzaG91bGQsIGJ1dCB0aGF0IGxvZ2ljIHNob3VsZCBfbm90XyBsaXZl
IGluDQo+ID4gdGhlIGNwdWlkbGUgZHJpdmVyLg0KPiANCj4gU3VyZS4gIE15IHBvaW50IGlzIHRo
YXQgZml4aW5nIHRoZSBJUEkgaGFuZGxlciBpcywgSSB0aGluaywgdG90YWxseQ0KPiBib2d1cywg
YmVjYXVzZSB0aGUgSVBJIEFQSSBpc24ndCB0aGUgcmlnaHQgd2F5IHRvIGRvIHRoaXMgYXQgYWxs
Lg0KPiANCj4gSXQgd291bGQgYmUgc3RyYWlnaHRmb3J3YXJkIHRvIGFkZCBhIG5ldyBmdW5jdGlv
biB3YWtlX2lmX2lkbGUoaW50DQo+IGNwdSkgdG8gc2NoZWQvY29yZS5jLg0KPiANClRoYW5rcyBB
bmR5IGFuZCBQZXRlcidzIHN1Z2dlc3Rpb24sIGl0IHdpbGwgc2F2ZSBzb21lIElQSSB0aGluZ3Mg
aW4gY2FzZSB0aGUgY29yZXMgYXJlIG5vdA0KaW4gaWRsZS4NCg0KVGhlcmUgaXMgb25lIHNpbWls
YXIgQVBJIGluIHNjaGVkL2NvcmUuYyB3YWtlX3VwX2lkbGVfY3B1KCksDQp0aGVuIGp1c3QgbmVl
ZCBhZGQgb25lIG5ldyBjb21tb24gc21wIEFQSToNCg0Kc21wX3dha2VfdXBfY3B1cygpIHsNCmZv
cl9lYWNoX29ubGluZV9jcHUoKQ0KICB3YWtlX3VwX2lkbGVfY3B1KCk7DQp9DQoNCldpbGwgdHJ5
IG9uZSBwYXRjaCBmb3IgaXQuDQoNCg0K
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Aug. 15, 2014, 1:27 a.m. UTC | #17
On Thu, Aug 14, 2014 at 6:21 PM, Liu, Chuansheng
<chuansheng.liu@intel.com> wrote:
>
>
>> -----Original Message-----
>> From: Andy Lutomirski [mailto:luto@amacapital.net]
>> Sent: Friday, August 15, 2014 5:23 AM
>> To: Peter Zijlstra
>> Cc: Daniel Lezcano; Liu, Chuansheng; Rafael J. Wysocki;
>> linux-pm@vger.kernel.org; LKML; Liu, Changcheng; Wang, Xiaoming;
>> Chakravarty, Souvik K
>> Subject: Re: [PATCH] cpuidle: Fix the CPU stuck at C0 for 2-3s after PM_QOS
>> back to DEFAULT
>>
>> On Thu, Aug 14, 2014 at 2:16 PM, Peter Zijlstra <peterz@infradead.org>
>> wrote:
>> > On Thu, Aug 14, 2014 at 02:12:27PM -0700, Andy Lutomirski wrote:
>> >> On 08/14/2014 04:14 AM, Daniel Lezcano wrote:
>> >> > On 08/14/2014 01:00 PM, Peter Zijlstra wrote:
>> >> >>
>> >> >> So seeing how you're from @intel.com I'm assuming you're using x86
>> here.
>> >> >>
>> >> >> I'm not seeing how this can be possible, MWAIT is interrupted by IPIs
>> >> >> just fine, which means we'll fall out of the cpuidle_enter(), which
>> >> >> means we'll cpuidle_reflect(), and then leave cpuidle_idle_call().
>> >> >>
>> >> >> It will indeed not leave the cpu_idle_loop() function and go right back
>> >> >> into cpuidle_idle_call(), but that will then call cpuidle_select() which
>> >> >> should pick a new C state.
>> >> >>
>> >> >> So the interrupt _should_ work. If it doesn't you need to explain why.
>> >> >
>> >> > I think the issue is related to the poll_idle state, in
>> >> > drivers/cpuidle/driver.c. This state is x86 specific and inserted in the
>> >> > cpuidle table as the state 0 (POLL). There is no mwait for this state.
>> >> > It is a bit confusing because this state is not listed in the acpi /
>> >> > intel idle driver but inserted implicitly at the beginning of the idle
>> >> > table by the cpuidle framework when the driver is registered.
>> >> >
>> >> > static int poll_idle(struct cpuidle_device *dev,
>> >> >                 struct cpuidle_driver *drv, int index)
>> >> > {
>> >> >         local_irq_enable();
>> >> >         if (!current_set_polling_and_test()) {
>> >> >                 while (!need_resched())
>> >> >                         cpu_relax();
>> >> >         }
>> >> >         current_clr_polling();
>> >> >
>> >> >         return index;
>> >> > }
>> >>
>> >> As the most recent person to have modified this function, and as an
>> >> avowed hater of pointless IPIs, let me ask a rather different question:
>> >> why are you sending IPIs at all?  As of Linux 3.16, poll_idle actually
>> >> supports the polling idle interface :)
>> >>
>> >> Can't you just do:
>> >>
>> >> if (set_nr_if_polling(rq->idle)) {
>> >>       trace_sched_wake_idle_without_ipi(cpu);
>> >> } else {
>> >>       spin_lock_irqsave(&rq->lock, flags);
>> >>       if (rq->curr == rq->idle)
>> >>               smp_send_reschedule(cpu);
>> >>       // else the CPU wasn't idle; nothing to do
>> >>       raw_spin_unlock_irqrestore(&rq->lock, flags);
>> >> }
>> >>
>> >> In the common case (wake from C0, i.e. polling idle), this will skip the
>> >> IPI entirely unless you race with idle entry/exit, saving a few more
>> >> precious electrons and all of the latency involved in poking the APIC
>> >> registers.
>> >
>> > They could and they probably should, but that logic should _not_ live in
>> > the cpuidle driver.
>>
>> Sure.  My point is that fixing the IPI handler is, I think, totally
>> bogus, because the IPI API isn't the right way to do this at all.
>>
>> It would be straightforward to add a new function wake_if_idle(int
>> cpu) to sched/core.c.
>>
> Thanks Andy and Peter's suggestion, it will save some IPI things in case the cores are not
> in idle.

This isn't quite right.  Using the polling interface correctly will
save IPIs in case the core *is* idle.  But, given that you are trying
to upgrade the chosen idle state, I don't think you need to kick
non-idle CPUs at all, and my example contains that optimization.

Presumably the function should be named something like wake_up_if_idle.

>
> There is one similar API in sched/core.c wake_up_idle_cpu(),
> then just need add one new common smp API:
>
> smp_wake_up_cpus() {
> for_each_online_cpu()
>   wake_up_idle_cpu();
> }
>
> Will try one patch for it.

This will have lots of extra overhead if the cpu is *not* idle.  I
think my example will be a lot more efficient.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index ee9df5e..9e28a13 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -532,7 +532,13 @@  EXPORT_SYMBOL_GPL(cpuidle_register);
 
 static void smp_callback(void *v)
 {
-	/* we already woke the CPU up, nothing more to do */
+	/* we already woke the CPU up, and when the corresponding
+	 * CPU is at polling idle state, we need to set the sched
+	 * bit to trigger reselect the new suitable C-state, it
+	 * will be helpful for power.
+	*/
+	if (is_idle_task(current))
+		set_tsk_need_resched(current);
 }
 
 /*