cpu/hotplug: handle unbalanced hotplug enable/disable

Message ID	CAJFUiJizA=ZxXh5BNj-eL6xsVrNEbTnd0Z5yePPDxAR8YjGibw@mail.gmail.com (mailing list archive)
State	Not Applicable, archived
Headers	show Return-Path: <linux-pm-owner@kernel.org> MIME-Version: 1.0 In-Reply-To: <alpine.DEB.2.11.1605021007310.3692@nanos> References: <1461214567-3356-1-git-send-email-lianwei.wang@gmail.com> <20160421105042.GI3408@twins.programming.kicks-ass.net> <CAJFUiJih1YMast1S2z4_sfVaLwsPpHyRrS9tAhwTket2KbRDxA@mail.gmail.com> <alpine.DEB.2.11.1604221834430.3941@nanos> <CAJFUiJg61i16JdC4LiQOXmD9Rrg8VDf7BEWs124dFTkbd4Rg-g@mail.gmail.com> <alpine.DEB.2.11.1604251018090.3941@nanos> <CAJFUiJjSvzv65sHApE5Gcp9oWA-j5j7JLb6b0=1eETrA4uQLkQ@mail.gmail.com> <alpine.DEB.2.11.1604271216400.3941@nanos> <CAJFUiJir_zvQDXub=1zwPKJoZm_AnWNpGXuSUiCeHAEwEEUCMQ@mail.gmail.com> <alpine.DEB.2.11.1604280813570.3941@nanos> <CAJFUiJgUDSYE+6MG3riVyFjhhqckeY_3nEqer52tkUOy_yBeOw@mail.gmail.com> <alpine.DEB.2.11.1604290240560.3718@nanos> <CAJFUiJgSDwtpXWPcGu704=5tggOntmyfk=muOtDLRcSUJQc8Rg@mail.gmail.com> <alpine.DEB.2.11.1605021007310.3692@nanos> From: Lianwei Wang <lianwei.wang@gmail.com> Date: Wed, 4 May 2016 00:23:52 -0700 Message-ID: <CAJFUiJizA=ZxXh5BNj-eL6xsVrNEbTnd0Z5yePPDxAR8YjGibw@mail.gmail.com> Subject: Re: [PATCH] cpu/hotplug: handle unbalanced hotplug enable/disable To: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>, oleg@redhat.com, Ingo Molnar <mingo@kernel.org>, linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org Content-Type: text/plain; charset=UTF-8 Sender: linux-pm-owner@vger.kernel.org Precedence: bulk

Lianwei Wang May 4, 2016, 7:23 a.m. UTC

On Mon, May 2, 2016 at 1:11 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 29 Apr 2016, Lianwei Wang wrote:
>> On Thu, Apr 28, 2016 at 5:44 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > Wrong. This is the symptom. The root cause is in #1. Therefor you are trying
>> > to fix the symptom and not the root cause
>> >
>> I don't understand why you keep saying that the issue is in the pm
>> notifier callback. As I told you, the pm notifier return an error(or
>> NOTIFY_BAD) on purpose to abort the suspend process. This is work as
>> design. Any driver can abort the suspend process if it is not ready to
>> suspend.
>
> Right. That's not the issue. The issue is that as a consequence we end up with
> an unbalanced count. So how do we end up with an unbalanced count? That's what
> needs to be fixed and not worked around.
>
In this example, the unbalanced count is caused by the
cpu_hotplug_pm_callback pm notifier callback function. We can add a
variable to avoid the unbalanced call of cpu_hotplug_enable ,e.g.

Please let me know if you like to fix it in this way.

But actually I think we don't need to add a new variable to check if
the cpu_hotplug_disable() is called or not. We already have a disable
counter which can be used to check if the cpu_hotpug_disable is called
or not, as my original patch do in cpu_hotplug_enable() function.
Maybe the reset comments and reset cpu_hotplug_disabled to 0 operation
confuse you. I should check it firstly and do nothing if it is already
0.
e.g.
+static void _cpu_hotplug_enable(void)
+{
+       if (WARN(!cpu_hotplug_disabled, "Unbalanced cpu hotplug enable\n"))
+               return;
+
+       cpu_hotplug_disabled--;
+}

I like to fix it in the cpu_hotplug_enable because it is a public
kernel API and fix in it can prevent any other unbalanced calling. I
will update the patch.

>> > I completely understand that you are tyring to put the cart before the horse.
>> No. Your understanding is wrong.
>
> My understanding is very correct. We have a situation which leads to an
> unbalanced count. Instead of fixing that, you fix up the unbalanced count.
>
Yes, that's right. We are on the same page. The only difference is
that where/how to fix it. See my two solutions above and let me know
which one you like?
> Thanks,
>
>         tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Gleixner May 5, 2016, 12:13 p.m. UTC | #1

On Wed, 4 May 2016, Lianwei Wang wrote:
> In this example, the unbalanced count is caused by the
> cpu_hotplug_pm_callback pm notifier callback function.

I doubt that.

> We can add a variable to avoid the unbalanced call of cpu_hotplug_enable
> ,e.g.

> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 3e3f6e49eabb..aa6694f0e9d3 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1140,16 +1140,21 @@ static int
>  cpu_hotplug_pm_callback(struct notifier_block *nb,
>                         unsigned long action, void *ptr)
>  {
> +       static int disabled;
> +
>         switch (action) {
> 
>         case PM_SUSPEND_PREPARE:
>         case PM_HIBERNATION_PREPARE:
>                 cpu_hotplug_disable();
> +               disabled = 1;
>                 break;
> 
>         case PM_POST_SUSPEND:
>         case PM_POST_HIBERNATION:
> -               cpu_hotplug_enable();
> +               if (disabled)
> +                       cpu_hotplug_enable();
> +               disabled = 0;
>                 break;
> 
>         default:
> 
> Please let me know if you like to fix it in this way.

So you are moving the work around one step down w/o providing any reasonable
explanation how this asymetric call of that callback can happen.

Can you eventually come up with a coherent explanation of the problem down to
the root cause or are we going to play this "move the workaround one step
down" game for another 10 rounds?
 
> +static void _cpu_hotplug_enable(void)
> +{
> +       if (WARN(!cpu_hotplug_disabled, "Unbalanced cpu hotplug enable\n"))
> +               return;
> +
> +       cpu_hotplug_disabled--;
> +}
> 
> I like to fix it in the cpu_hotplug_enable because it is a public

You CANNOT fix it there. The problem is the call site and NOT
cpu_hotplug_enable(). Can you finally accept this?

> kernel API and fix in it can prevent any other unbalanced calling. I

It cannot prevent any unbalanced calls. It mitigates the issue, but that's a
different problem.

We can discuss that seperately after fixing the offending call site.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Lianwei Wang May 6, 2016, 7:06 a.m. UTC | #2

On Thu, May 5, 2016 at 5:13 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 4 May 2016, Lianwei Wang wrote:
>> In this example, the unbalanced count is caused by the
>> cpu_hotplug_pm_callback pm notifier callback function.
>
> I doubt that.
>
>> We can add a variable to avoid the unbalanced call of cpu_hotplug_enable
>> ,e.g.
>
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index 3e3f6e49eabb..aa6694f0e9d3 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -1140,16 +1140,21 @@ static int
>>  cpu_hotplug_pm_callback(struct notifier_block *nb,
>>                         unsigned long action, void *ptr)
>>  {
>> +       static int disabled;
>> +
>>         switch (action) {
>>
>>         case PM_SUSPEND_PREPARE:
>>         case PM_HIBERNATION_PREPARE:
>>                 cpu_hotplug_disable();
>> +               disabled = 1;
>>                 break;
>>
>>         case PM_POST_SUSPEND:
>>         case PM_POST_HIBERNATION:
>> -               cpu_hotplug_enable();
>> +               if (disabled)
>> +                       cpu_hotplug_enable();
>> +               disabled = 0;
>>                 break;
>>
>>         default:
>>
>> Please let me know if you like to fix it in this way.
>
> So you are moving the work around one step down w/o providing any reasonable
> explanation how this asymetric call of that callback can happen.
>
> Can you eventually come up with a coherent explanation of the problem down to
> the root cause or are we going to play this "move the workaround one step
> down" game for another 10 rounds?
>
Do you agree that any driver can abort the suspend process by
returning an error or NOTIFY_BAD if it is not ready to suspend?
I have explain it and I also copied the example code that abort
suspend by returning an error or NOTIFY_BAD in the pm notifier
callback function.
The cpu_hotplug_disable and cpu_hotplug_enable are called in one of
the PM notifier callback. And they are called from two difference
place.
Below is how it happened:
  pm_suspend
    |--enter_state
        |--suspend_prepare
            |--pm_notifier_call_chain(PM_SUSPEND_PREPARE)
            |    |--call_back_1
            |    |--call_back_..
            |    |--call_back_n ===> return NOTIFY_BAD to abort call chain and
            |    |                                suspend process here
            |    |--cpu_hotplug_pm_callback()
            |    |   |--cpu_hotplug_disable  =====> remember it is not
called yet
            |    |--call_back_..
            |
            |--pm_notifier_call_chain(PM_POST_SUSPEND)
            |    |--call_back_1
            |    |--call_back_..
            |    |--call_back_n
            |    |--cpu_hotplug_pm_callback()
            |    |   |--cpu_hotplug_enable  =====> Here it is unbalanced called
            |    |--call_back_..
            |
So, keep in mind that for pm notifier call chain, the
PM_SUSPEND_PREPARE notifier and PM_POST_SUSPEND notifier is not always
paired called. Sometimes for a driver's pm notifier callback, the
PM_POST_SUSPEND is called without PM_SUSPEND_PREPARE.

>> +static void _cpu_hotplug_enable(void)
>> +{
>> +       if (WARN(!cpu_hotplug_disabled, "Unbalanced cpu hotplug enable\n"))
>> +               return;
>> +
>> +       cpu_hotplug_disabled--;
>> +}
>>
>> I like to fix it in the cpu_hotplug_enable because it is a public
>
> You CANNOT fix it there. The problem is the call site and NOT
> cpu_hotplug_enable(). Can you finally accept this?
I know what you mean. But why let the driver unconditionally do
"--cpu_hotplug_disabled" without any checking. It should do nothing if
it detect a unbalanced enable. A good example for unbalanced checking
from enable_irq is here:
http://lxr.free-electrons.com/source/kernel/irq/manage.c#L512

It's the cpu hotplug driver's responsibility to guarantee that the cpu
hotplug always working well even others failed do something to it. And
the driver can check and handle it itself, why not let the driver to
handle it and make the cpu hotplug driver more strong?
>
>> kernel API and fix in it can prevent any other unbalanced calling.
>
> It cannot prevent any unbalanced calls. It mitigates the issue, but that's a
> different problem.
It did not migrate the issue. It give a warning message to log the
unbalanced issue and it also make sure the cpu hotplug continue to
work well even someone do an unbalanced call. It is a good checking as
the enable_irq/disable_irq do. There are some other unbalanced
checking in kernel too. All make sure the kernel has a better
stability.
>
> We can discuss that seperately after fixing the offending call site.
>
> Thanks,
>
>         tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Gleixner May 6, 2016, 7:18 a.m. UTC | #3

On Fri, 6 May 2016, Lianwei Wang wrote:
> On Thu, May 5, 2016 at 5:13 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > Can you eventually come up with a coherent explanation of the problem down to
> > the root cause or are we going to play this "move the workaround one step
> > down" game for another 10 rounds?
> >
> Do you agree that any driver can abort the suspend process by
> returning an error or NOTIFY_BAD if it is not ready to suspend?
> I have explain it and I also copied the example code that abort
> suspend by returning an error or NOTIFY_BAD in the pm notifier
> callback function.

I don't need copied example code which does not tell me what the real problem
is.

> The cpu_hotplug_disable and cpu_hotplug_enable are called in one of
> the PM notifier callback. And they are called from two difference
> place.
> Below is how it happened:
>   pm_suspend
>     |--enter_state
>         |--suspend_prepare
>             |--pm_notifier_call_chain(PM_SUSPEND_PREPARE)
>             |    |--call_back_1
>             |    |--call_back_..
>             |    |--call_back_n ===> return NOTIFY_BAD to abort call chain and
>             |    |                                suspend process here
>             |    |--cpu_hotplug_pm_callback()
>             |    |   |--cpu_hotplug_disable  =====> remember it is not
> called yet
>             |    |--call_back_..
>             |
>             |--pm_notifier_call_chain(PM_POST_SUSPEND)
>             |    |--call_back_1
>             |    |--call_back_..
>             |    |--call_back_n
>             |    |--cpu_hotplug_pm_callback()
>             |    |   |--cpu_hotplug_enable  =====> Here it is unbalanced called
>             |    |--call_back_..
>             |
> So, keep in mind that for pm notifier call chain, the
> PM_SUSPEND_PREPARE notifier and PM_POST_SUSPEND notifier is not always
> paired called. Sometimes for a driver's pm notifier callback, the
> PM_POST_SUSPEND is called without PM_SUSPEND_PREPARE.

So that is the real problem: cpu_hotplug_pm_callback(PM_POST_SUSPEND) can be
called w/o a previous call to cpu_hotplug_pm_callback(PM_SUSPEND_PREPARE).
 
> > It cannot prevent any unbalanced calls. It mitigates the issue, but that's a
> > different problem.
> It did not migrate the issue. It give a warning message to log the
> unbalanced issue and it also make sure the cpu hotplug continue to
> work well even someone do an unbalanced call. It is a good checking as
> the enable_irq/disable_irq do. There are some other unbalanced
> checking in kernel too. All make sure the kernel has a better
> stability.

I'm not opposed to do that and I said so several times. But I said as well,
that we do not add this without fixing the problem which made you write that
patch in the first place.

So we have a proper explanation for the real problem now, but we have no
fix. 

And again: Your patch is NOT a fix. Simply because it will emit a warning
everytime the above happens. And that's wrong because the abort is a
legitimate scenario.

So please come up with a sensible fix for the suspend abort issue and then we
can add the balance check/fixup to the hotplug_disable/enable() code.

Thanks,

	tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Lianwei Wang May 12, 2016, 8:06 a.m. UTC | #4

I have come up a patch to make the pm notifier called symmetrically
and currently being tested. I will send it out after pass the test.

On Fri, May 6, 2016 at 12:18 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 6 May 2016, Lianwei Wang wrote:
>> On Thu, May 5, 2016 at 5:13 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > Can you eventually come up with a coherent explanation of the problem down to
>> > the root cause or are we going to play this "move the workaround one step
>> > down" game for another 10 rounds?
>> >
>> Do you agree that any driver can abort the suspend process by
>> returning an error or NOTIFY_BAD if it is not ready to suspend?
>> I have explain it and I also copied the example code that abort
>> suspend by returning an error or NOTIFY_BAD in the pm notifier
>> callback function.
>
> I don't need copied example code which does not tell me what the real problem
> is.
>
>> The cpu_hotplug_disable and cpu_hotplug_enable are called in one of
>> the PM notifier callback. And they are called from two difference
>> place.
>> Below is how it happened:
>>   pm_suspend
>>     |--enter_state
>>         |--suspend_prepare
>>             |--pm_notifier_call_chain(PM_SUSPEND_PREPARE)
>>             |    |--call_back_1
>>             |    |--call_back_..
>>             |    |--call_back_n ===> return NOTIFY_BAD to abort call chain and
>>             |    |                                suspend process here
>>             |    |--cpu_hotplug_pm_callback()
>>             |    |   |--cpu_hotplug_disable  =====> remember it is not
>> called yet
>>             |    |--call_back_..
>>             |
>>             |--pm_notifier_call_chain(PM_POST_SUSPEND)
>>             |    |--call_back_1
>>             |    |--call_back_..
>>             |    |--call_back_n
>>             |    |--cpu_hotplug_pm_callback()
>>             |    |   |--cpu_hotplug_enable  =====> Here it is unbalanced called
>>             |    |--call_back_..
>>             |
>> So, keep in mind that for pm notifier call chain, the
>> PM_SUSPEND_PREPARE notifier and PM_POST_SUSPEND notifier is not always
>> paired called. Sometimes for a driver's pm notifier callback, the
>> PM_POST_SUSPEND is called without PM_SUSPEND_PREPARE.
>
> So that is the real problem: cpu_hotplug_pm_callback(PM_POST_SUSPEND) can be
> called w/o a previous call to cpu_hotplug_pm_callback(PM_SUSPEND_PREPARE).
>
>> > It cannot prevent any unbalanced calls. It mitigates the issue, but that's a
>> > different problem.
>> It did not migrate the issue. It give a warning message to log the
>> unbalanced issue and it also make sure the cpu hotplug continue to
>> work well even someone do an unbalanced call. It is a good checking as
>> the enable_irq/disable_irq do. There are some other unbalanced
>> checking in kernel too. All make sure the kernel has a better
>> stability.
>
> I'm not opposed to do that and I said so several times. But I said as well,
> that we do not add this without fixing the problem which made you write that
> patch in the first place.
>
> So we have a proper explanation for the real problem now, but we have no
> fix.
>
> And again: Your patch is NOT a fix. Simply because it will emit a warning
> everytime the above happens. And that's wrong because the abort is a
> legitimate scenario.
>
> So please come up with a sensible fix for the suspend abort issue and then we
> can add the balance check/fixup to the hotplug_disable/enable() code.
>
> Thanks,
>
>         tglx
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Lianwei Wang June 7, 2016, 5:38 a.m. UTC | #5

On Thu, May 12, 2016 at 1:06 AM, Lianwei Wang <lianwei.wang@gmail.com> wrote:
> I have come up a patch to make the pm notifier called symmetrically
> and currently being tested. I will send it out after pass the test.
>
> On Fri, May 6, 2016 at 12:18 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Fri, 6 May 2016, Lianwei Wang wrote:
>>> On Thu, May 5, 2016 at 5:13 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> > Can you eventually come up with a coherent explanation of the problem down to
>>> > the root cause or are we going to play this "move the workaround one step
>>> > down" game for another 10 rounds?
>>> >
>>> Do you agree that any driver can abort the suspend process by
>>> returning an error or NOTIFY_BAD if it is not ready to suspend?
>>> I have explain it and I also copied the example code that abort
>>> suspend by returning an error or NOTIFY_BAD in the pm notifier
>>> callback function.
>>
>> I don't need copied example code which does not tell me what the real problem
>> is.
>>
>>> The cpu_hotplug_disable and cpu_hotplug_enable are called in one of
>>> the PM notifier callback. And they are called from two difference
>>> place.
>>> Below is how it happened:
>>>   pm_suspend
>>>     |--enter_state
>>>         |--suspend_prepare
>>>             |--pm_notifier_call_chain(PM_SUSPEND_PREPARE)
>>>             |    |--call_back_1
>>>             |    |--call_back_..
>>>             |    |--call_back_n ===> return NOTIFY_BAD to abort call chain and
>>>             |    |                                suspend process here
>>>             |    |--cpu_hotplug_pm_callback()
>>>             |    |   |--cpu_hotplug_disable  =====> remember it is not
>>> called yet
>>>             |    |--call_back_..
>>>             |
>>>             |--pm_notifier_call_chain(PM_POST_SUSPEND)
>>>             |    |--call_back_1
>>>             |    |--call_back_..
>>>             |    |--call_back_n
>>>             |    |--cpu_hotplug_pm_callback()
>>>             |    |   |--cpu_hotplug_enable  =====> Here it is unbalanced called
>>>             |    |--call_back_..
>>>             |
>>> So, keep in mind that for pm notifier call chain, the
>>> PM_SUSPEND_PREPARE notifier and PM_POST_SUSPEND notifier is not always
>>> paired called. Sometimes for a driver's pm notifier callback, the
>>> PM_POST_SUSPEND is called without PM_SUSPEND_PREPARE.
>>
>> So that is the real problem: cpu_hotplug_pm_callback(PM_POST_SUSPEND) can be
>> called w/o a previous call to cpu_hotplug_pm_callback(PM_SUSPEND_PREPARE).
>>
>>> > It cannot prevent any unbalanced calls. It mitigates the issue, but that's a
>>> > different problem.
>>> It did not migrate the issue. It give a warning message to log the
>>> unbalanced issue and it also make sure the cpu hotplug continue to
>>> work well even someone do an unbalanced call. It is a good checking as
>>> the enable_irq/disable_irq do. There are some other unbalanced
>>> checking in kernel too. All make sure the kernel has a better
>>> stability.
>>
>> I'm not opposed to do that and I said so several times. But I said as well,
>> that we do not add this without fixing the problem which made you write that
>> patch in the first place.
>>
>> So we have a proper explanation for the real problem now, but we have no
>> fix.
>>
>> And again: Your patch is NOT a fix. Simply because it will emit a warning
>> everytime the above happens. And that's wrong because the abort is a
>> legitimate scenario.
>>
>> So please come up with a sensible fix for the suspend abort issue and then we
>> can add the balance check/fixup to the hotplug_disable/enable() code.
>>
>> Thanks,
>>
>>         tglx
>>
>>

I think this is the best solution to fix this unbalanced issue for now.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

cpu/hotplug: handle unbalanced hotplug enable/disable

Commit Message

Comments

Patch