diff mbox

hwmon: (coretemp) Handle frozen hotplug state correctly

Message ID alpine.DEB.2.20.1705101624460.1979@nanos (mailing list archive)
State Accepted
Headers show

Commit Message

Thomas Gleixner May 10, 2017, 2:30 p.m. UTC
The recent conversion to the hotplug state machine missed that the original
hotplug notifiers did not execute in the frozen state, which is used on
suspend on resume.

This does not matter on single socket machines, but on multi socket systems
this breaks when the device for a non-boot socket is removed when the last
CPU of that socket is brought offline. The device removal locks up the
machine hard w/o any debug output.

Prevent executing the hotplug callbacks when cpuhp_tasks_frozen is true.

Thanks to Tommi for providing debug information patiently while I failed to
spot the obvious.

Fixes: e00ca5df37ad ("hwmon: (coretemp) Convert to hotplug state machine")
Reported-by: Tommi Rantala <tt.rantala@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/hwmon/coretemp.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

--
To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Tommi Rantala May 10, 2017, 7:16 p.m. UTC | #1
2017-05-10 17:30 GMT+03:00 Thomas Gleixner <tglx@linutronix.de>:
> The recent conversion to the hotplug state machine missed that the original
> hotplug notifiers did not execute in the frozen state, which is used on
> suspend on resume.
>
> This does not matter on single socket machines, but on multi socket systems
> this breaks when the device for a non-boot socket is removed when the last
> CPU of that socket is brought offline. The device removal locks up the
> machine hard w/o any debug output.
>
> Prevent executing the hotplug callbacks when cpuhp_tasks_frozen is true.
>
> Thanks to Tommi for providing debug information patiently while I failed to
> spot the obvious.
>
> Fixes: e00ca5df37ad ("hwmon: (coretemp) Convert to hotplug state machine")
> Reported-by: Tommi Rantala <tt.rantala@gmail.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Many thanks, I can confirm that it works well!

-Tommi

> ---
>  drivers/hwmon/coretemp.c |   14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> --- a/drivers/hwmon/coretemp.c
> +++ b/drivers/hwmon/coretemp.c
> @@ -605,6 +605,13 @@ static int coretemp_cpu_online(unsigned
>         struct platform_data *pdata;
>
>         /*
> +        * Don't execute this on resume as the offline callback did
> +        * not get executed on suspend.
> +        */
> +       if (cpuhp_tasks_frozen)
> +               return 0;
> +
> +       /*
>          * CPUID.06H.EAX[0] indicates whether the CPU has thermal
>          * sensors. We check this bit only, all the early CPUs
>          * without thermal sensors will be filtered out.
> @@ -654,6 +661,13 @@ static int coretemp_cpu_offline(unsigned
>         struct temp_data *tdata;
>         int indx, target;
>
> +       /*
> +        * Don't execute this on suspend as the device remove locks
> +        * up the machine.
> +        */
> +       if (cpuhp_tasks_frozen)
> +               return 0;
> +
>         /* If the physical CPU device does not exist, just return */
>         if (!pdev)
>                 return 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Guenter Roeck May 10, 2017, 8:09 p.m. UTC | #2
On Wed, May 10, 2017 at 04:30:12PM +0200, Thomas Gleixner wrote:
> The recent conversion to the hotplug state machine missed that the original
> hotplug notifiers did not execute in the frozen state, which is used on
> suspend on resume.
> 
> This does not matter on single socket machines, but on multi socket systems
> this breaks when the device for a non-boot socket is removed when the last
> CPU of that socket is brought offline. The device removal locks up the
> machine hard w/o any debug output.
> 
> Prevent executing the hotplug callbacks when cpuhp_tasks_frozen is true.
> 
> Thanks to Tommi for providing debug information patiently while I failed to
> spot the obvious.
> 
> Fixes: e00ca5df37ad ("hwmon: (coretemp) Convert to hotplug state machine")
> Reported-by: Tommi Rantala <tt.rantala@gmail.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Applied, and thanks a lot for fixing the problem!

Guenter

> ---
>  drivers/hwmon/coretemp.c |   14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> --- a/drivers/hwmon/coretemp.c
> +++ b/drivers/hwmon/coretemp.c
> @@ -605,6 +605,13 @@ static int coretemp_cpu_online(unsigned
>  	struct platform_data *pdata;
>  
>  	/*
> +	 * Don't execute this on resume as the offline callback did
> +	 * not get executed on suspend.
> +	 */
> +	if (cpuhp_tasks_frozen)
> +		return 0;
> +
> +	/*
>  	 * CPUID.06H.EAX[0] indicates whether the CPU has thermal
>  	 * sensors. We check this bit only, all the early CPUs
>  	 * without thermal sensors will be filtered out.
> @@ -654,6 +661,13 @@ static int coretemp_cpu_offline(unsigned
>  	struct temp_data *tdata;
>  	int indx, target;
>  
> +	/*
> +	 * Don't execute this on suspend as the device remove locks
> +	 * up the machine.
> +	 */
> +	if (cpuhp_tasks_frozen)
> +		return 0;
> +
>  	/* If the physical CPU device does not exist, just return */
>  	if (!pdev)
>  		return 0;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Guenter Roeck May 10, 2017, 8:09 p.m. UTC | #3
On Wed, May 10, 2017 at 10:16:33PM +0300, Tommi Rantala wrote:
> 2017-05-10 17:30 GMT+03:00 Thomas Gleixner <tglx@linutronix.de>:
> > The recent conversion to the hotplug state machine missed that the original
> > hotplug notifiers did not execute in the frozen state, which is used on
> > suspend on resume.
> >
> > This does not matter on single socket machines, but on multi socket systems
> > this breaks when the device for a non-boot socket is removed when the last
> > CPU of that socket is brought offline. The device removal locks up the
> > machine hard w/o any debug output.
> >
> > Prevent executing the hotplug callbacks when cpuhp_tasks_frozen is true.
> >
> > Thanks to Tommi for providing debug information patiently while I failed to
> > spot the obvious.
> >
> > Fixes: e00ca5df37ad ("hwmon: (coretemp) Convert to hotplug state machine")
> > Reported-by: Tommi Rantala <tt.rantala@gmail.com>
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> Many thanks, I can confirm that it works well!
> 
Ok if I add your Tested-by: ?

Thanks,
Guenter

> -Tommi
> 
> > ---
> >  drivers/hwmon/coretemp.c |   14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> >
> > --- a/drivers/hwmon/coretemp.c
> > +++ b/drivers/hwmon/coretemp.c
> > @@ -605,6 +605,13 @@ static int coretemp_cpu_online(unsigned
> >         struct platform_data *pdata;
> >
> >         /*
> > +        * Don't execute this on resume as the offline callback did
> > +        * not get executed on suspend.
> > +        */
> > +       if (cpuhp_tasks_frozen)
> > +               return 0;
> > +
> > +       /*
> >          * CPUID.06H.EAX[0] indicates whether the CPU has thermal
> >          * sensors. We check this bit only, all the early CPUs
> >          * without thermal sensors will be filtered out.
> > @@ -654,6 +661,13 @@ static int coretemp_cpu_offline(unsigned
> >         struct temp_data *tdata;
> >         int indx, target;
> >
> > +       /*
> > +        * Don't execute this on suspend as the device remove locks
> > +        * up the machine.
> > +        */
> > +       if (cpuhp_tasks_frozen)
> > +               return 0;
> > +
> >         /* If the physical CPU device does not exist, just return */
> >         if (!pdev)
> >                 return 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tommi Rantala May 11, 2017, 5:57 a.m. UTC | #4
2017-05-10 23:09 GMT+03:00 Guenter Roeck <linux@roeck-us.net>:
> On Wed, May 10, 2017 at 10:16:33PM +0300, Tommi Rantala wrote:
>> 2017-05-10 17:30 GMT+03:00 Thomas Gleixner <tglx@linutronix.de>:
>> > The recent conversion to the hotplug state machine missed that the original
>> > hotplug notifiers did not execute in the frozen state, which is used on
>> > suspend on resume.
>> >
>> > This does not matter on single socket machines, but on multi socket systems
>> > this breaks when the device for a non-boot socket is removed when the last
>> > CPU of that socket is brought offline. The device removal locks up the
>> > machine hard w/o any debug output.
>> >
>> > Prevent executing the hotplug callbacks when cpuhp_tasks_frozen is true.
>> >
>> > Thanks to Tommi for providing debug information patiently while I failed to
>> > spot the obvious.
>> >
>> > Fixes: e00ca5df37ad ("hwmon: (coretemp) Convert to hotplug state machine")
>> > Reported-by: Tommi Rantala <tt.rantala@gmail.com>
>> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>>
>> Many thanks, I can confirm that it works well!
>>
> Ok if I add your Tested-by: ?

Sure!

Tested-by: Tommi Rantala <tt.rantala@gmail.com>

> Thanks,
> Guenter
>
>> -Tommi
>>
>> > ---
>> >  drivers/hwmon/coretemp.c |   14 ++++++++++++++
>> >  1 file changed, 14 insertions(+)
>> >
>> > --- a/drivers/hwmon/coretemp.c
>> > +++ b/drivers/hwmon/coretemp.c
>> > @@ -605,6 +605,13 @@ static int coretemp_cpu_online(unsigned
>> >         struct platform_data *pdata;
>> >
>> >         /*
>> > +        * Don't execute this on resume as the offline callback did
>> > +        * not get executed on suspend.
>> > +        */
>> > +       if (cpuhp_tasks_frozen)
>> > +               return 0;
>> > +
>> > +       /*
>> >          * CPUID.06H.EAX[0] indicates whether the CPU has thermal
>> >          * sensors. We check this bit only, all the early CPUs
>> >          * without thermal sensors will be filtered out.
>> > @@ -654,6 +661,13 @@ static int coretemp_cpu_offline(unsigned
>> >         struct temp_data *tdata;
>> >         int indx, target;
>> >
>> > +       /*
>> > +        * Don't execute this on suspend as the device remove locks
>> > +        * up the machine.
>> > +        */
>> > +       if (cpuhp_tasks_frozen)
>> > +               return 0;
>> > +
>> >         /* If the physical CPU device does not exist, just return */
>> >         if (!pdev)
>> >                 return 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/drivers/hwmon/coretemp.c
+++ b/drivers/hwmon/coretemp.c
@@ -605,6 +605,13 @@  static int coretemp_cpu_online(unsigned
 	struct platform_data *pdata;
 
 	/*
+	 * Don't execute this on resume as the offline callback did
+	 * not get executed on suspend.
+	 */
+	if (cpuhp_tasks_frozen)
+		return 0;
+
+	/*
 	 * CPUID.06H.EAX[0] indicates whether the CPU has thermal
 	 * sensors. We check this bit only, all the early CPUs
 	 * without thermal sensors will be filtered out.
@@ -654,6 +661,13 @@  static int coretemp_cpu_offline(unsigned
 	struct temp_data *tdata;
 	int indx, target;
 
+	/*
+	 * Don't execute this on suspend as the device remove locks
+	 * up the machine.
+	 */
+	if (cpuhp_tasks_frozen)
+		return 0;
+
 	/* If the physical CPU device does not exist, just return */
 	if (!pdev)
 		return 0;