diff mbox series

x86/mce/therm_throt: Handle case where throttle_active_work() is called on behalf of an offline CPU

Message ID 20200222162432.497201-1-srinivas.pandruvada@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series x86/mce/therm_throt: Handle case where throttle_active_work() is called on behalf of an offline CPU | expand

Commit Message

Srinivas Pandruvada Feb. 22, 2020, 4:24 p.m. UTC
During cpu-hotplug test with CONFIG_PREEMPTION and CONFIG_DEBUG_PREEMPT
enabled, Chris reported error:

BUG: using smp_processor_id() in preemptible [00000000] code: kworker/1:0/17
caller is throttle_active_work+0x12/0x280

Here throttle_active_work() is a work queue callback scheduled with
schedule_delayed_work_on(). This will not cause this error for the use
of smp_processor_id() under normal conditions as there is a check for
"current->nr_cpus_allowed == 1".
But when the target CPU is offline the workqueue becomes unbound.
Then the work queue callback can be scheduled on another CPU and the
error is printed for the use of smp_processor_id() in preemptible context.

When the workqueue is not getting called on the target CPU, simply return.
This is done by adding a cpu field in the _thermal_state struct and match
the current CPU id.

Once workqueue is scheduled, prevent CPU offline. In this way, the log
bits are checked and cleared on the correct CPU. Also use get_cpu() to
get current CPU id and prevent preemption before we finish processing.

Fixes: f6656208f04e ("x86/mce/therm_throt: Optimize notifications of thermal throttle")
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mce/therm_throt.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

Comments

Chris Wilson Feb. 22, 2020, 4:53 p.m. UTC | #1
Quoting Srinivas Pandruvada (2020-02-22 16:24:32)
> During cpu-hotplug test with CONFIG_PREEMPTION and CONFIG_DEBUG_PREEMPT
> enabled, Chris reported error:
> 
> BUG: using smp_processor_id() in preemptible [00000000] code: kworker/1:0/17
> caller is throttle_active_work+0x12/0x280
> 
> Here throttle_active_work() is a work queue callback scheduled with
> schedule_delayed_work_on(). This will not cause this error for the use
> of smp_processor_id() under normal conditions as there is a check for
> "current->nr_cpus_allowed == 1".
> But when the target CPU is offline the workqueue becomes unbound.
> Then the work queue callback can be scheduled on another CPU and the
> error is printed for the use of smp_processor_id() in preemptible context.
> 
> When the workqueue is not getting called on the target CPU, simply return.
> This is done by adding a cpu field in the _thermal_state struct and match
> the current CPU id.
> 
> Once workqueue is scheduled, prevent CPU offline. In this way, the log
> bits are checked and cleared on the correct CPU. Also use get_cpu() to
> get current CPU id and prevent preemption before we finish processing.
> 
> Fixes: f6656208f04e ("x86/mce/therm_throt: Optimize notifications of thermal throttle")
> Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>

I've pushed the patch to our CI, but it's not a frequent occurrence, so
it may be some time before I can state a t-b with any confidence.
-Chris
Borislav Petkov Feb. 22, 2020, 5:51 p.m. UTC | #2
On Sat, Feb 22, 2020 at 08:24:32AM -0800, Srinivas Pandruvada wrote:
> During cpu-hotplug test with CONFIG_PREEMPTION and CONFIG_DEBUG_PREEMPT
> enabled, Chris reported error:
> 
> BUG: using smp_processor_id() in preemptible [00000000] code: kworker/1:0/17
> caller is throttle_active_work+0x12/0x280
> 
> Here throttle_active_work() is a work queue callback scheduled with
> schedule_delayed_work_on(). This will not cause this error for the use
> of smp_processor_id() under normal conditions as there is a check for
> "current->nr_cpus_allowed == 1".
> But when the target CPU is offline the workqueue becomes unbound.
> Then the work queue callback can be scheduled on another CPU and the
> error is printed for the use of smp_processor_id() in preemptible context.

So what's wrong with simply doing:

	if (cpu_is_offline(this_cpu))
		return;

?

You don't need to run the callback on an offlined CPU anyway...
Srinivas Pandruvada Feb. 23, 2020, 12:25 a.m. UTC | #3
On Sat, 2020-02-22 at 18:51 +0100, Borislav Petkov wrote:
> On Sat, Feb 22, 2020 at 08:24:32AM -0800, Srinivas Pandruvada wrote:
> > During cpu-hotplug test with CONFIG_PREEMPTION and
> > CONFIG_DEBUG_PREEMPT
> > enabled, Chris reported error:
> > 
> > BUG: using smp_processor_id() in preemptible [00000000] code:
> > kworker/1:0/17
> > caller is throttle_active_work+0x12/0x280
> > 
> > Here throttle_active_work() is a work queue callback scheduled with
> > schedule_delayed_work_on(). This will not cause this error for the
> > use
> > of smp_processor_id() under normal conditions as there is a check
> > for
> > "current->nr_cpus_allowed == 1".
> > But when the target CPU is offline the workqueue becomes unbound.
> > Then the work queue callback can be scheduled on another CPU and
> > the
> > error is printed for the use of smp_processor_id() in preemptible
> > context.
> 
> So what's wrong with simply doing:
> 
> 	if (cpu_is_offline(this_cpu))
> 		return;
> 
> ?
> 
If the condition is false, will it prevent offline CPU before executing
next statement and reschedule on another CPU? Although It will not
cause any error or crash but in rare circumstance may print premature
warning/normal message based on the current CPU's state.

So I can submit something like this:

diff --git a/arch/x86/kernel/cpu/mce/therm_throt.c
b/arch/x86/kernel/cpu/mce/therm_throt.c
index b38010b541d6..7aa7c9d1df2a 100644
--- a/arch/x86/kernel/cpu/mce/therm_throt.c
+++ b/arch/x86/kernel/cpu/mce/therm_throt.c
@@ -239,11 +239,14 @@ static void throttle_active_work(struct
work_struct *work)
 {
        struct _thermal_state *state =
container_of(to_delayed_work(work),
                                                struct _thermal_state,
therm_work);
-       unsigned int i, avg, this_cpu = smp_processor_id();
+       unsigned int i, avg, this_cpu = state->cpu;
        u64 now = get_jiffies_64();
        bool hot;
        u8 temp;
 
+       if (cpu_is_offline(this_cpu))
+               return;
+
        get_therm_status(state->level, &hot, &temp);
        /* temperature value is offset from the max so lesser means
hotter */
        if (!hot && temp > state->baseline_temp) {

Thanks,
Srinivas

> You don't need to run the callback on an offlined CPU anyway...
>
Borislav Petkov Feb. 24, 2020, 12:55 p.m. UTC | #4
On Sat, Feb 22, 2020 at 04:25:59PM -0800, Srinivas Pandruvada wrote:
> If the condition is false, will it prevent offline CPU before executing
> next statement and reschedule on another CPU? Although It will not
> cause any error or crash but in rare circumstance may print premature
> warning/normal message based on the current CPU's state.

Why, offline CPU is offline CPU?

Btw, I'm asking whether you can do the simpler thing *instead* of your
patch. You basically don't run the workqueue callback on offlined CPUs:

	get_online_cpus();

	if (cpu_is_offline(smp_processor_id()))
		goto out;

	...


out:
	put_online_cpus();

Hmm?
Thomas Gleixner Feb. 24, 2020, 4:01 p.m. UTC | #5
Borislav Petkov <bp@alien8.de> writes:

> On Sat, Feb 22, 2020 at 04:25:59PM -0800, Srinivas Pandruvada wrote:
>> If the condition is false, will it prevent offline CPU before executing
>> next statement and reschedule on another CPU? Although It will not
>> cause any error or crash but in rare circumstance may print premature
>> warning/normal message based on the current CPU's state.
>
> Why, offline CPU is offline CPU?
>
> Btw, I'm asking whether you can do the simpler thing *instead* of your
> patch. You basically don't run the workqueue callback on offlined CPUs:
>
> 	get_online_cpus();
>
> 	if (cpu_is_offline(smp_processor_id()))
> 		goto out;
>
> 	...
>
>
> out:
> 	put_online_cpus();

Which is wrong as well. Trying to "fix" it in the work queue callback is
papering over the root cause.

Why is any work scheduled on an outgoing CPU after this CPU executed
thermal_throttle_offline()?

When thermal_throttle_offline() is invoked the cpu bound work queues are
still functional and thermal_throttle_offline() cancels outstanding
work.

So no, please fix the root cause not the symptom.

Thanks,

        tglx
Thomas Gleixner Feb. 24, 2020, 7:25 p.m. UTC | #6
Thomas Gleixner <tglx@linutronix.de> writes:
> Which is wrong as well. Trying to "fix" it in the work queue callback is
> papering over the root cause.
>
> Why is any work scheduled on an outgoing CPU after this CPU executed
> thermal_throttle_offline()?
>
> When thermal_throttle_offline() is invoked the cpu bound work queues are
> still functional and thermal_throttle_offline() cancels outstanding
> work.
>
> So no, please fix the root cause not the symptom.

And if you look at thermal_throttle_online() then you'll notice that it
is asymetric vs. thermal_throttle_offline().

Also you want to do cancel_delayed_work_sync() and not just
cancel_delayed_work() because only the latter guarantees that the work
is not enqueued anymore while the former does not take running or self
requeueing work into account.

Something like the untested patch below.

Thanks,

        tglx
---
--- a/arch/x86/kernel/cpu/mce/therm_throt.c
+++ b/arch/x86/kernel/cpu/mce/therm_throt.c
@@ -487,8 +487,12 @@ static int thermal_throttle_offline(unsi
 	struct thermal_state *state = &per_cpu(thermal_state, cpu);
 	struct device *dev = get_cpu_device(cpu);
 
-	cancel_delayed_work(&state->package_throttle.therm_work);
-	cancel_delayed_work(&state->core_throttle.therm_work);
+	/* Mask the thermal vector before draining evtl. pending work */
+	l = apic_read(APIC_LVTTHMR);
+	apic_write(APIC_LVTTHMR, l | APIC_LVT_MASKED);
+
+	cancel_delayed_work_sync(&state->package_throttle.therm_work);
+	cancel_delayed_work_sync(&state->core_throttle.therm_work);
 
 	state->package_throttle.rate_control_active = false;
 	state->core_throttle.rate_control_active = false;
Pandruvada, Srinivas Feb. 24, 2020, 9:05 p.m. UTC | #7
On Mon, 2020-02-24 at 20:25 +0100, Thomas Gleixner wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> > Which is wrong as well. Trying to "fix" it in the work queue
> > callback is
> > papering over the root cause.
> > 
> > Why is any work scheduled on an outgoing CPU after this CPU
> > executed
> > thermal_throttle_offline()?
> > 
> > When thermal_throttle_offline() is invoked the cpu bound work
> > queues are
> > still functional and thermal_throttle_offline() cancels outstanding
> > work.
> > 
> > So no, please fix the root cause not the symptom.
> 
> And if you look at thermal_throttle_online() then you'll notice that
> it
> is asymetric vs. thermal_throttle_offline().
> 
> Also you want to do cancel_delayed_work_sync() and not just
> cancel_delayed_work() because only the latter guarantees that the
> work
> is not enqueued anymore while the former does not take running or
> self
> requeueing work into account.
> 
> Something like the untested patch below.
Thanks Thomas. I am testing this patch.

-Srinivas

> 
> Thanks,
> 
>         tglx
> ---
> --- a/arch/x86/kernel/cpu/mce/therm_throt.c
> +++ b/arch/x86/kernel/cpu/mce/therm_throt.c
> @@ -487,8 +487,12 @@ static int thermal_throttle_offline(unsi
>  	struct thermal_state *state = &per_cpu(thermal_state, cpu);
>  	struct device *dev = get_cpu_device(cpu);
>  
> -	cancel_delayed_work(&state->package_throttle.therm_work);
> -	cancel_delayed_work(&state->core_throttle.therm_work);
> +	/* Mask the thermal vector before draining evtl. pending work
> */
> +	l = apic_read(APIC_LVTTHMR);
> +	apic_write(APIC_LVTTHMR, l | APIC_LVT_MASKED);
> +
> +	cancel_delayed_work_sync(&state->package_throttle.therm_work);
> +	cancel_delayed_work_sync(&state->core_throttle.therm_work);
>  
>  	state->package_throttle.rate_control_active = false;
>  	state->core_throttle.rate_control_active = false;
Srinivas Pandruvada Feb. 25, 2020, 9:46 a.m. UTC | #8
On Mon, 2020-02-24 at 20:25 +0100, Thomas Gleixner wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> > Which is wrong as well. Trying to "fix" it in the work queue
> > callback is
> > papering over the root cause.
> > 
> > Why is any work scheduled on an outgoing CPU after this CPU
> > executed
> > thermal_throttle_offline()?
> > 
> > When thermal_throttle_offline() is invoked the cpu bound work
> > queues are
> > still functional and thermal_throttle_offline() cancels outstanding
> > work.
> > 
> > So no, please fix the root cause not the symptom.
> 
> And if you look at thermal_throttle_online() then you'll notice that
> it
> is asymetric vs. thermal_throttle_offline().
> 
> Also you want to do cancel_delayed_work_sync() and not just
> cancel_delayed_work() because only the latter guarantees that the
> work
> is not enqueued anymore while the former does not take running or
> self
> requeueing work into account.
> 
> Something like the untested patch below.
I tested this patch.
After simulating 20 million thermal interrupts and online/offline test
for 12+ hours,  don't see the issue.

So this change fixed the issue.

I can send change on your behalf or you can add
Tested-by: Pandruvada, Srinivas <srinivas.pandruvada@linux.intel.com>

Thanks,
Srinivas

> 
> Thanks,
> 
>         tglx
> ---
> --- a/arch/x86/kernel/cpu/mce/therm_throt.c
> +++ b/arch/x86/kernel/cpu/mce/therm_throt.c
> @@ -487,8 +487,12 @@ static int thermal_throttle_offline(unsi
>  	struct thermal_state *state = &per_cpu(thermal_state, cpu);
>  	struct device *dev = get_cpu_device(cpu);
>  
> -	cancel_delayed_work(&state->package_throttle.therm_work);
> -	cancel_delayed_work(&state->core_throttle.therm_work);
> +	/* Mask the thermal vector before draining evtl. pending work
> */
> +	l = apic_read(APIC_LVTTHMR);
> +	apic_write(APIC_LVTTHMR, l | APIC_LVT_MASKED);
> +
> +	cancel_delayed_work_sync(&state->package_throttle.therm_work);
> +	cancel_delayed_work_sync(&state->core_throttle.therm_work);
>  
>  	state->package_throttle.rate_control_active = false;
>  	state->core_throttle.rate_control_active = false;
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/therm_throt.c b/arch/x86/kernel/cpu/mce/therm_throt.c
index 58b4ee3cda77..4dab8a4558f9 100644
--- a/arch/x86/kernel/cpu/mce/therm_throt.c
+++ b/arch/x86/kernel/cpu/mce/therm_throt.c
@@ -61,6 +61,7 @@ 
  * @new_event:			Stores the last high/low status of the
  *				THERM_STATUS_PROCHOT or
  *				THERM_STATUS_POWER_LIMIT.
+ * @cpu:			CPU id for this instance.
  * @level:			Stores whether this _thermal_state instance is
  *				for a CORE level or for PACKAGE level.
  * @sample_index:		Index for storing the next sample in the buffer
@@ -86,6 +87,7 @@  struct _thermal_state {
 	unsigned long		total_time_ms;
 	bool			rate_control_active;
 	bool			new_event;
+	int			cpu;
 	u8			level;
 	u8			sample_index;
 	u8			sample_count;
@@ -239,11 +241,19 @@  static void __maybe_unused throttle_active_work(struct work_struct *work)
 {
 	struct _thermal_state *state = container_of(to_delayed_work(work),
 						struct _thermal_state, therm_work);
-	unsigned int i, avg, this_cpu = smp_processor_id();
+	unsigned int i, avg, this_cpu;
 	u64 now = get_jiffies_64();
 	bool hot;
 	u8 temp;
 
+	get_online_cpus();
+	this_cpu = get_cpu();
+
+	if (state->cpu != this_cpu) {
+		state->rate_control_active = false;
+		goto end;
+	}
+
 	get_therm_status(state->level, &hot, &temp);
 	/* temperature value is offset from the max so lesser means hotter */
 	if (!hot && temp > state->baseline_temp) {
@@ -254,7 +264,7 @@  static void __maybe_unused throttle_active_work(struct work_struct *work)
 				state->count);
 
 		state->rate_control_active = false;
-		return;
+		goto end;
 	}
 
 	if (time_before64(now, state->next_check) &&
@@ -296,6 +306,10 @@  static void __maybe_unused throttle_active_work(struct work_struct *work)
 re_arm:
 	clear_therm_status_log(state->level);
 	schedule_delayed_work_on(this_cpu, &state->therm_work, THERM_THROT_POLL_INTERVAL);
+
+end:
+	put_cpu();
+	put_online_cpus();
 }
 
 /***
@@ -359,6 +373,7 @@  static void therm_throt_process(bool new_event, int event, int level)
 
 		state->baseline_temp = temp;
 		state->last_interrupt_time = now;
+		state->cpu = this_cpu;
 		schedule_delayed_work_on(this_cpu, &state->therm_work, THERM_THROT_POLL_INTERVAL);
 	} else if (old_event && state->last_interrupt_time) {
 		unsigned long throttle_time;