diff mbox series

[V2,1/1] tick: broadcast-hrtimer: Fix a race in bc_set_next

Message ID 20190925142029.13648-2-balasubramani_vivekanandan@mentor.com (mailing list archive)
State Superseded
Delegated to: Geert Uytterhoeven
Headers show
Series [V2,1/1] tick: broadcast-hrtimer: Fix a race in bc_set_next | expand

Commit Message

Balasubramani Vivekanandan Sept. 25, 2019, 2:20 p.m. UTC
When a cpu requests broadcasting, before starting the tick broadcast
hrtimer, bc_set_next() checks if the timer callback (bc_handler) is
active using hrtimer_try_to_cancel(). But hrtimer_try_to_cancel() does
not provide the required synchronization when the callback is active on
other core.
The callback could have already executed
tick_handle_oneshot_broadcast() and could have also returned. But still
there is a small time window where the hrtimer_try_to_cancel() returns
-1. In that case bc_set_next() returns without doing anything, but the
next_event of the tick broadcast clock device is already set to a
timeout value.

In the race condition diagram below, CPU #1 is running the timer
callback and CPU #2 is entering idle state and so calls bc_set_next().

In the worst case, the next_event will contain an expiry time, but the
hrtimer will not be started which happens when the racing callback
returns HRTIMER_NORESTART. The hrtimer might never recover if all
further requests from the CPUs to subscribe to tick broadcast have
timeout greater than the next_event of tick broadcast clock device. This
leads to cascading of failures and finally noticed as rcu stall
warnings as shown in [1]

Here is a depiction of the race condition

CPU #1 (Running timer callback)                   CPU #2 (Enter idle
                                                  and subscribe to
                                                  tick broadcast)
---------------------                             ---------------------

__run_hrtimer()                                   tick_broadcast_enter()

  bc_handler()                                      __tick_broadcast_oneshot_control()

    tick_handle_oneshot_broadcast()

      raw_spin_lock(&tick_broadcast_lock);

      dev->next_event = KTIME_MAX;                  //wait for tick_broadcast_lock
      //next_event for tick broadcast clock
      set to KTIME_MAX since no other cores
      subscribed to tick broadcasting

      raw_spin_unlock(&tick_broadcast_lock);

    if (dev->next_event == KTIME_MAX)
      return HRTIMER_NORESTART
    // callback function exits without
       restarting the hrtimer                      //tick_broadcast_lock acquired
                                                   raw_spin_lock(&tick_broadcast_lock);

                                                   tick_broadcast_set_event()

                                                     clockevents_program_event()

                                                       dev->next_event = expires;

                                                       bc_set_next()

                                                         hrtimer_try_to_cancel()
                                                         //returns -1 since the timer
                                                         callback is active. Exits without
                                                         restarting the timer
  cpu_base->running = NULL;

Since it is now allowed to start the hrtimer from the callback, there is
no need for the try to cancel logic. All that is now removed.

[1]: rcu stall warnings noticed before this patch

[   26.477514] INFO: rcu_preempt detected stalls on CPUs/tasks:
[   26.483197]  4-...: (0 ticks this GP) idle=718/0/0 softirq=1436/1436 fqs=0
[   26.490171]  (detected by 0, t=1755 jiffies, g=778, c=777, q=10243)
[   26.496456] Task dump for CPU 4:
[   26.499688] swapper/4       R  running task        0     0      1 0x00000020
[   26.506756] Call trace:
[   26.509221] [<ffff000008086214>] __switch_to+0x94/0xd8
[   26.514378] [<ffff000008ed9000>] bp_hardening_data+0x0/0x10
[   26.519964] rcu_preempt kthread starved for 1762 jiffies! g778 c777 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=4
[   26.530245] rcu_preempt     I    0     8      2 0x00000020
[   26.535742] Call trace:
[   26.538192] [<ffff000008086214>] __switch_to+0x94/0xd8
[   26.543344] [<ffff000008a6365c>] __schedule+0x274/0x940
[   26.548578] [<ffff000008a63d68>] schedule+0x40/0xa8
[   26.553467] [<ffff000008a6847c>] schedule_timeout+0x94/0x430
[   26.559141] [<ffff00000813cb04>] rcu_gp_kthread+0x76c/0x1068
[   26.564813] [<ffff0000080e1610>] kthread+0x138/0x140
[   26.569787] [<ffff000008084f74>] ret_from_fork+0x10/0x1c
[   30.481515] INFO: rcu_sched detected stalls on CPUs/tasks:
[   30.487030]  4-...: (0 ticks this GP) idle=718/0/0 softirq=1436/1436 fqs=0
[   30.494004]  (detected by 1, t=1755 jiffies, g=-24, c=-25, q=45)
[   30.500028] Task dump for CPU 4:
[   30.503261] swapper/4       R  running task        0     0      1 0x00000020
[   30.510330] Call trace:
[   30.512796] [<ffff000008086214>] __switch_to+0x94/0xd8
[   30.517953] [<ffff000008ed9000>] bp_hardening_data+0x0/0x10
[   30.523541] rcu_sched kthread starved for 1762 jiffies! g18446744073709551592 c18446744073709551591 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=4
[   30.536608] rcu_sched       I    0     9      2 0x00000020
[   30.542105] Call trace:
[   30.544554] [<ffff000008086214>] __switch_to+0x94/0xd8
[   30.549707] [<ffff000008a6365c>] __schedule+0x274/0x940
[   30.554942] [<ffff000008a63d68>] schedule+0x40/0xa8
[   30.559830] [<ffff000008a6847c>] schedule_timeout+0x94/0x430
[   30.565504] [<ffff00000813cb04>] rcu_gp_kthread+0x76c/0x1068
[   30.571176] [<ffff0000080e1610>] kthread+0x138/0x140
[   30.576149] [<ffff000008084f74>] ret_from_fork+0x10/0x1c

Signed-off-by: Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-broadcast-hrtimer.c | 53 +++++++++++++---------------
 1 file changed, 24 insertions(+), 29 deletions(-)

Comments

Thomas Gleixner Sept. 26, 2019, 10:01 a.m. UTC | #1
On Wed, 25 Sep 2019, Balasubramani Vivekanandan wrote:
> 
> Since it is now allowed to start the hrtimer from the callback, there is

Is now allowed? 

> no need for the try to cancel logic. All that is now removed.

Sure I can see that it is removed from the patch, but why and why is it
correct?

> [1]: rcu stall warnings noticed before this patch
> 
> [   26.477514] INFO: rcu_preempt detected stalls on CPUs/tasks:

<SNIP>

I which way is this backtrace giving any useful information about the
problem?

> 
> Signed-off-by: Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Interesting. You claim authorship on that patch and then you put my SOB
after yours just because you feel entitled to do so?

I surely appreciate the time you spent to analyze the problem and I
wouldn't have said anything if you just would have done the right thing:

 1) Write a changelog which explains why the change is actually correct

 2) Not wreckage the formatting which I intentionally did for readability
    sake

 3) Add 'Originally-by: Thomas Gleixner' or at least having the courtesy to
    mention that this is not your work.

Thanks,

	tglx
Balasubramani Vivekanandan Sept. 26, 2019, 1:51 p.m. UTC | #2
> I which way is this backtrace giving any useful information about the problem?

The intention of me including the callstack was to help anyone notice
this commit while searching using the prints from the callstack. I have
removed it if it is creating noise.

> Interesting. You claim authorship on that patch and then you put my
> SOB after yours just because you feel entitled to do so?

Apologies for that. My limited experience with sharing patches upstream
for review and fixing it, was the reason for that. I have included the
"Originally-by" in the updated patch shared.

> 1) Write a changelog which explains why the change is actually correct
Done. Please check the update. Let me know if I need to add more
details.

> 2) Not wreckage the formatting which I intentionally did for
> readability sake
Done.

> 3) Add 'Originally-by: Thomas Gleixner' or at least having the courtesy to
>    mention that this is not your work.
Done.

Regards,
Bala
diff mbox series

Patch

diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c
index c1f5bb590b5e..2fb16d49939a 100644
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -42,39 +42,38 @@  static int bc_shutdown(struct clock_event_device *evt)
  */
 static int bc_set_next(ktime_t expires, struct clock_event_device *bc)
 {
-	int bc_moved;
 	/*
-	 * We try to cancel the timer first. If the callback is on
-	 * flight on some other cpu then we let it handle it. If we
-	 * were able to cancel the timer nothing can rearm it as we
-	 * own broadcast_lock.
+	 * This is called either from enter/exit idle code or from the
+	 * broadcast handler. In all cases tick_broadcast_lock is held.
 	 *
-	 * However we can also be called from the event handler of
-	 * ce_broadcast_hrtimer itself when it expires. We cannot
-	 * restart the timer because we are in the callback, but we
-	 * can set the expiry time and let the callback return
-	 * HRTIMER_RESTART.
+	 * hrtimer_cancel() cannot be called here neither from the
+	 * broadcast handler nor from the enter/exit idle code. The idle
+	 * code can run into the problem described in bc_shutdown() and the
+	 * broadcast handler cannot wait for itself to complete for obvious
+	 * reasons.
 	 *
-	 * Since we are in the idle loop at this point and because
-	 * hrtimer_{start/cancel} functions call into tracing,
-	 * calls to these functions must be bound within RCU_NONIDLE.
+	 * Each caller tries to arm the hrtimer on its own CPU, but if the
+	 * handler is currently running hrtimer_start() does not move
+	 * it. It keeps it on the CPU on which it is assigned at the
+	 * moment.
 	 */
 	RCU_NONIDLE(
 		{
-			bc_moved = hrtimer_try_to_cancel(&bctimer) >= 0;
-			if (bc_moved) {
-				hrtimer_start(&bctimer, expires,
-					      HRTIMER_MODE_ABS_PINNED_HARD);
-			}
+			hrtimer_start(&bctimer, expires,
+				      HRTIMER_MODE_ABS_PINNED_HARD);
+			/*
+			 * The core tick broadcast mode expects bc->bound_on to
+			 * be set correctly to prevent a CPU which has the
+			 * broadcast hrtimer armed from going deep idle.
+			 *
+			 * As tick_broadcast_lock is held, nothing can change
+			 * the cpu base which was just established in
+			 * hrtimer_start() above. So the below access is safe
+			 * even without holding the hrtimer base lock.
+			 */
+			bc->bound_on = bctimer.base->cpu_base->cpu;
 		}
 	);
-
-	if (bc_moved) {
-		/* Bind the "device" to the cpu */
-		bc->bound_on = smp_processor_id();
-	} else if (bc->bound_on == smp_processor_id()) {
-		hrtimer_set_expires(&bctimer, expires);
-	}
 	return 0;
 }
 
@@ -100,10 +99,6 @@  static enum hrtimer_restart bc_handler(struct hrtimer *t)
 {
 	ce_broadcast_hrtimer.event_handler(&ce_broadcast_hrtimer);
 
-	if (clockevent_state_oneshot(&ce_broadcast_hrtimer))
-		if (ce_broadcast_hrtimer.next_event != KTIME_MAX)
-			return HRTIMER_RESTART;
-
 	return HRTIMER_NORESTART;
 }