diff mbox series

mailbox: cancel timer before starting it

Message ID 20200923123916.1115962-1-jbrunet@baylibre.com (mailing list archive)
State New, archived
Headers show
Series mailbox: cancel timer before starting it | expand

Commit Message

Jerome Brunet Sept. 23, 2020, 12:39 p.m. UTC
If the txdone is done by polling, it is possible for msg_submit() to start
the timer while txdone_hrtimer() callback is running. If the timer needs
recheduling, it could already be enqueued by the time hrtimer_forward_now()
is called, leading hrtimer to loudly complain.

WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
Hardware name: Libre Computer AML-S805X-AC (DT)
Workqueue: events_freezable_power_ thermal_zone_device_check
pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
pc : hrtimer_forward+0xc4/0x110
lr : txdone_hrtimer+0xf8/0x118
[...]

Canceling the timer before starting it ensure that the timer callback is
not running when the timer is started, solving this race condition.

Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
Reported-by: Da Xue <da@libre.computer>
Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
---
 drivers/mailbox/mailbox.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Kevin Hilman Sept. 24, 2020, 4:50 p.m. UTC | #1
Jerome Brunet <jbrunet@baylibre.com> writes:

> If the txdone is done by polling, it is possible for msg_submit() to start
> the timer while txdone_hrtimer() callback is running. If the timer needs
> recheduling, it could already be enqueued by the time hrtimer_forward_now()
> is called, leading hrtimer to loudly complain.
>
> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
> Hardware name: Libre Computer AML-S805X-AC (DT)
> Workqueue: events_freezable_power_ thermal_zone_device_check
> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
> pc : hrtimer_forward+0xc4/0x110
> lr : txdone_hrtimer+0xf8/0x118
> [...]
>
> Canceling the timer before starting it ensure that the timer callback is
> not running when the timer is started, solving this race condition.
>
> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
> Reported-by: Da Xue <da@libre.computer>
> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>

Reviewed-by: Kevin Hilman <khilman@baylibre.com>

Kevin
Ionela Voinescu Oct. 15, 2020, 1:46 p.m. UTC | #2
Hi guys,

On Wednesday 23 Sep 2020 at 14:39:16 (+0200), Jerome Brunet wrote:
> If the txdone is done by polling, it is possible for msg_submit() to start
> the timer while txdone_hrtimer() callback is running. If the timer needs
> recheduling, it could already be enqueued by the time hrtimer_forward_now()
> is called, leading hrtimer to loudly complain.
> 
> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
> Hardware name: Libre Computer AML-S805X-AC (DT)
> Workqueue: events_freezable_power_ thermal_zone_device_check
> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
> pc : hrtimer_forward+0xc4/0x110
> lr : txdone_hrtimer+0xf8/0x118
> [...]
> 
> Canceling the timer before starting it ensure that the timer callback is
> not running when the timer is started, solving this race condition.
> 
> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
> Reported-by: Da Xue <da@libre.computer>
> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
> ---
>  drivers/mailbox/mailbox.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
> index 0b821a5b2db8..34f9ab01caef 100644
> --- a/drivers/mailbox/mailbox.c
> +++ b/drivers/mailbox/mailbox.c
> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
>  exit:
>  	spin_unlock_irqrestore(&chan->lock, flags);
>  
> -	if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> -		/* kick start the timer immediately to avoid delays */
> +	if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> +		/* Disable the timer if already active ... */
> +		hrtimer_cancel(&chan->mbox->poll_hrt);
> +
> +		/* ... and kick start it immediately to avoid delays */
>  		hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> +	}
>  }
>  
>  static void tx_tick(struct mbox_chan *chan, int r)

I've tracked a regression back to this commit. Details to reproduce:


 - HEAD: (linux-next)
   * 62c04453381e  Jerome Brunet   3 weeks ago mailbox: cancel timer before starting it

 - Platform: arm64 Juno R0 and Juno R2 [1]

 - Partial log:
	[    0.000000] Booting Linux on physical CPU 0x0000000100 [0x410fd030]
	[    0.000000] Linux version 5.9.0-rc8-01722-g62c04453381e () (aarch64-none-linux-gnu-gcc (GNU Toolchain for the A-profile Architecture 9.2-2019.12 (arm-9.10)) 9.2.1 20191025, GNU ld (GNU Toolchain for the A-profile Architecture 9.2-2019.12 (arm-9.10)) 2.33.1.20191209) #175 SMP PREEMPT Thu Oct 15 14:17:41 BST 2020
	[    0.000000] Machine model: ARM Juno development board (r0)
	[..]
	[    1.714340] mhu 2b1f0000.mhu: ARM MHU Mailbox registered
	[    1.722768] NET: Registered protocol family 17
	[    1.727364] 9pnet: Installing 9P2000 support
	[    1.731689] Key type dns_resolver registered
	[    1.735474] usb 1-1: new high-speed USB device number 2 using ehci-platform
	[    1.736407] registered taskstats version 1
	[    1.747061] Loading compiled-in X.509 certificates
	[    1.755885] scpi_protocol scpi: SCP Protocol 1.2 Firmware 1.21.0 version
	[    1.770484] cpu cpu0: EM: created perf domain
	[    1.778505] cpu cpu1: EM: created perf domain
	[    1.807449] scpi_clocks scpi:clocks: failed to register clock 'pxlclk'
	[    1.897593] hub 1-1:1.0: USB hub found
	[    1.901656] hub 1-1:1.0: 4 ports detected
	[    2.559453] atkbd serio0: keyboard reset failed on 1c060000.kmi
	[   22.787431] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
	[   22.793536] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=2626
	[   22.802421]  (detected by 2, t=5255 jiffies, g=-991, q=9)
	[   22.807823] Task dump for CPU 1:
	[   22.811049] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
	[   22.820980] Call trace:
	[   22.823429]  __switch_to+0x138/0x198
	[   23.583444] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 5443 jiffies s: 49 root: 0x2/.
	[   23.593995] rcu: blocking rcu_node structures:
	[   23.598449] Task dump for CPU 1:
	[   23.601680] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
	[   23.611619] Call trace:
	[   23.614064]  __switch_to+0x138/0x198
	[   85.807430] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
	[   85.813534] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=10502
	[   85.822506]  (detected by 2, t=21009 jiffies, g=-991, q=9)
	[   85.827994] Task dump for CPU 1:
	[   85.831220] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
	[   85.841150] Call trace:
	[   85.843596]  __switch_to+0x138/0x198
	[   87.071446] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 21315 jiffies s: 49 root: 0x2/.
	[   87.082088] rcu: blocking rcu_node structures:
	[   87.086540] Task dump for CPU 1:
	[   87.089773] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
	[   87.099708] Call trace:
	[   87.102155]  __switch_to+0x138/0x198
	[  148.827442] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
	[  148.833565] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=18377
	[  148.842543]  (detected by 4, t=36762 jiffies, g=-991, q=9)
	[  148.848037] Task dump for CPU 1:
	[  148.851268] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
	[  148.861207] Call trace:
	[  148.863663]  __switch_to+0x138/0x198
	[  150.559443] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 37187 jiffies s: 49 root: 0x2/.
	[  150.570082] rcu: blocking rcu_node structures:
	[  150.574535] Task dump for CPU 1:
	[  150.577767] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
	[  150.587705] Call trace:
	[  150.590151]  __switch_to+0x138/0x198

 - Commit working as expected:
   * 558e4c36ec9f  Krzysztof Kozlowski     7 weeks ago     maiblox: mediatek: Fix handling of platform_get_irq() error


[1] https://developer.arm.com/tools-and-software/development-boards/juno-development-board

Thank you,
Ionela.

> -- 
> 2.25.4
> 
> 
> _______________________________________________
> linux-amlogic mailing list
> linux-amlogic@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-amlogic
Jerome Brunet Oct. 15, 2020, 1:58 p.m. UTC | #3
On Thu 15 Oct 2020 at 15:46, Ionela Voinescu <ionela.voinescu@arm.com> wrote:

> Hi guys,
>
> On Wednesday 23 Sep 2020 at 14:39:16 (+0200), Jerome Brunet wrote:
>> If the txdone is done by polling, it is possible for msg_submit() to start
>> the timer while txdone_hrtimer() callback is running. If the timer needs
>> recheduling, it could already be enqueued by the time hrtimer_forward_now()
>> is called, leading hrtimer to loudly complain.
>> 
>> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
>> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
>> Hardware name: Libre Computer AML-S805X-AC (DT)
>> Workqueue: events_freezable_power_ thermal_zone_device_check
>> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
>> pc : hrtimer_forward+0xc4/0x110
>> lr : txdone_hrtimer+0xf8/0x118
>> [...]
>> 
>> Canceling the timer before starting it ensure that the timer callback is
>> not running when the timer is started, solving this race condition.
>> 
>> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
>> Reported-by: Da Xue <da@libre.computer>
>> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
>> ---
>>  drivers/mailbox/mailbox.c | 8 ++++++--
>>  1 file changed, 6 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
>> index 0b821a5b2db8..34f9ab01caef 100644
>> --- a/drivers/mailbox/mailbox.c
>> +++ b/drivers/mailbox/mailbox.c
>> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
>>  exit:
>>  	spin_unlock_irqrestore(&chan->lock, flags);
>>  
>> -	if (!err && (chan->txdone_method & TXDONE_BY_POLL))
>> -		/* kick start the timer immediately to avoid delays */
>> +	if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
>> +		/* Disable the timer if already active ... */
>> +		hrtimer_cancel(&chan->mbox->poll_hrt);
>> +
>> +		/* ... and kick start it immediately to avoid delays */
>>  		hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
>> +	}
>>  }
>>  
>>  static void tx_tick(struct mbox_chan *chan, int r)
>
> I've tracked a regression back to this commit. Details to reproduce:

Hi Ionela,

I don't have access to your platform and I don't get what is going on
from the log below.

Could you please give us a bit more details about what is going on ?

All this patch does is add hrtimer_cancel().
* It is needed if the timer had already been started, which is
  appropriate AFAIU
* It is a NO-OP is the timer is not active.

>
>
>  - HEAD: (linux-next)
>    * 62c04453381e  Jerome Brunet   3 weeks ago mailbox: cancel timer before starting it
>
>  - Platform: arm64 Juno R0 and Juno R2 [1]
>
>  - Partial log:
> 	[    0.000000] Booting Linux on physical CPU 0x0000000100 [0x410fd030]
> 	[    0.000000] Linux version 5.9.0-rc8-01722-g62c04453381e () (aarch64-none-linux-gnu-gcc (GNU Toolchain for the A-profile Architecture 9.2-2019.12 (arm-9.10)) 9.2.1 20191025, GNU ld (GNU Toolchain for the A-profile Architecture 9.2-2019.12 (arm-9.10)) 2.33.1.20191209) #175 SMP PREEMPT Thu Oct 15 14:17:41 BST 2020
> 	[    0.000000] Machine model: ARM Juno development board (r0)
> 	[..]
> 	[    1.714340] mhu 2b1f0000.mhu: ARM MHU Mailbox registered
> 	[    1.722768] NET: Registered protocol family 17
> 	[    1.727364] 9pnet: Installing 9P2000 support
> 	[    1.731689] Key type dns_resolver registered
> 	[    1.735474] usb 1-1: new high-speed USB device number 2 using ehci-platform
> 	[    1.736407] registered taskstats version 1
> 	[    1.747061] Loading compiled-in X.509 certificates
> 	[    1.755885] scpi_protocol scpi: SCP Protocol 1.2 Firmware 1.21.0 version
> 	[    1.770484] cpu cpu0: EM: created perf domain
> 	[    1.778505] cpu cpu1: EM: created perf domain
> 	[    1.807449] scpi_clocks scpi:clocks: failed to register clock 'pxlclk'
> 	[    1.897593] hub 1-1:1.0: USB hub found
> 	[    1.901656] hub 1-1:1.0: 4 ports detected
> 	[    2.559453] atkbd serio0: keyboard reset failed on 1c060000.kmi
> 	[   22.787431] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> 	[   22.793536] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=2626
> 	[   22.802421]  (detected by 2, t=5255 jiffies, g=-991, q=9)
> 	[   22.807823] Task dump for CPU 1:
> 	[   22.811049] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> 	[   22.820980] Call trace:
> 	[   22.823429]  __switch_to+0x138/0x198
> 	[   23.583444] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 5443 jiffies s: 49 root: 0x2/.
> 	[   23.593995] rcu: blocking rcu_node structures:
> 	[   23.598449] Task dump for CPU 1:
> 	[   23.601680] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> 	[   23.611619] Call trace:
> 	[   23.614064]  __switch_to+0x138/0x198
> 	[   85.807430] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> 	[   85.813534] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=10502
> 	[   85.822506]  (detected by 2, t=21009 jiffies, g=-991, q=9)
> 	[   85.827994] Task dump for CPU 1:
> 	[   85.831220] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> 	[   85.841150] Call trace:
> 	[   85.843596]  __switch_to+0x138/0x198
> 	[   87.071446] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 21315 jiffies s: 49 root: 0x2/.
> 	[   87.082088] rcu: blocking rcu_node structures:
> 	[   87.086540] Task dump for CPU 1:
> 	[   87.089773] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> 	[   87.099708] Call trace:
> 	[   87.102155]  __switch_to+0x138/0x198
> 	[  148.827442] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> 	[  148.833565] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=18377
> 	[  148.842543]  (detected by 4, t=36762 jiffies, g=-991, q=9)
> 	[  148.848037] Task dump for CPU 1:
> 	[  148.851268] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> 	[  148.861207] Call trace:
> 	[  148.863663]  __switch_to+0x138/0x198
> 	[  150.559443] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 37187 jiffies s: 49 root: 0x2/.
> 	[  150.570082] rcu: blocking rcu_node structures:
> 	[  150.574535] Task dump for CPU 1:
> 	[  150.577767] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> 	[  150.587705] Call trace:
> 	[  150.590151]  __switch_to+0x138/0x198
>
>  - Commit working as expected:
>    * 558e4c36ec9f  Krzysztof Kozlowski     7 weeks ago     maiblox: mediatek: Fix handling of platform_get_irq() error
>
>
> [1] https://developer.arm.com/tools-and-software/development-boards/juno-development-board
>
> Thank you,
> Ionela.
>
>> -- 
>> 2.25.4
>> 
>> 
>> _______________________________________________
>> linux-amlogic mailing list
>> linux-amlogic@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-amlogic
Ionela Voinescu Oct. 15, 2020, 2:29 p.m. UTC | #4
Hi Jerome,

On Thursday 15 Oct 2020 at 15:58:30 (+0200), Jerome Brunet wrote:
> 
> On Thu 15 Oct 2020 at 15:46, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
> 
> > Hi guys,
> >
> > On Wednesday 23 Sep 2020 at 14:39:16 (+0200), Jerome Brunet wrote:
> >> If the txdone is done by polling, it is possible for msg_submit() to start
> >> the timer while txdone_hrtimer() callback is running. If the timer needs
> >> recheduling, it could already be enqueued by the time hrtimer_forward_now()
> >> is called, leading hrtimer to loudly complain.
> >> 
> >> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
> >> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
> >> Hardware name: Libre Computer AML-S805X-AC (DT)
> >> Workqueue: events_freezable_power_ thermal_zone_device_check
> >> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
> >> pc : hrtimer_forward+0xc4/0x110
> >> lr : txdone_hrtimer+0xf8/0x118
> >> [...]
> >> 
> >> Canceling the timer before starting it ensure that the timer callback is
> >> not running when the timer is started, solving this race condition.
> >> 
> >> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
> >> Reported-by: Da Xue <da@libre.computer>
> >> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
> >> ---
> >>  drivers/mailbox/mailbox.c | 8 ++++++--
> >>  1 file changed, 6 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
> >> index 0b821a5b2db8..34f9ab01caef 100644
> >> --- a/drivers/mailbox/mailbox.c
> >> +++ b/drivers/mailbox/mailbox.c
> >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> >>  exit:
> >>  	spin_unlock_irqrestore(&chan->lock, flags);
> >>  
> >> -	if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> >> -		/* kick start the timer immediately to avoid delays */
> >> +	if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> >> +		/* Disable the timer if already active ... */
> >> +		hrtimer_cancel(&chan->mbox->poll_hrt);
> >> +
> >> +		/* ... and kick start it immediately to avoid delays */
> >>  		hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> >> +	}
> >>  }
> >>  
> >>  static void tx_tick(struct mbox_chan *chan, int r)
> >
> > I've tracked a regression back to this commit. Details to reproduce:
> 
> Hi Ionela,
> 
> I don't have access to your platform and I don't get what is going on
> from the log below.
> 
> Could you please give us a bit more details about what is going on ?
> 

I'm not familiar with the mailbox subsystem, so the best I can do right
now is to add Sudeep to Cc, in case this conflicts in some way with the
ARM MHU patches [1].

In the meantime I'll get some traces and get more familiar with the
code.

[1]
https://lore.kernel.org/linux-arm-kernel/20201009113129.uqw5zrqztjgw6vga@bogus/

Hope it helps,
Ionela.

> All this patch does is add hrtimer_cancel().
> * It is needed if the timer had already been started, which is
>   appropriate AFAIU
> * It is a NO-OP is the timer is not active.
> 
> >
> >
> >  - HEAD: (linux-next)
> >    * 62c04453381e  Jerome Brunet   3 weeks ago mailbox: cancel timer before starting it
> >
> >  - Platform: arm64 Juno R0 and Juno R2 [1]
> >
> >  - Partial log:
> > 	[    0.000000] Booting Linux on physical CPU 0x0000000100 [0x410fd030]
> > 	[    0.000000] Linux version 5.9.0-rc8-01722-g62c04453381e () (aarch64-none-linux-gnu-gcc (GNU Toolchain for the A-profile Architecture 9.2-2019.12 (arm-9.10)) 9.2.1 20191025, GNU ld (GNU Toolchain for the A-profile Architecture 9.2-2019.12 (arm-9.10)) 2.33.1.20191209) #175 SMP PREEMPT Thu Oct 15 14:17:41 BST 2020
> > 	[    0.000000] Machine model: ARM Juno development board (r0)
> > 	[..]
> > 	[    1.714340] mhu 2b1f0000.mhu: ARM MHU Mailbox registered
> > 	[    1.722768] NET: Registered protocol family 17
> > 	[    1.727364] 9pnet: Installing 9P2000 support
> > 	[    1.731689] Key type dns_resolver registered
> > 	[    1.735474] usb 1-1: new high-speed USB device number 2 using ehci-platform
> > 	[    1.736407] registered taskstats version 1
> > 	[    1.747061] Loading compiled-in X.509 certificates
> > 	[    1.755885] scpi_protocol scpi: SCP Protocol 1.2 Firmware 1.21.0 version
> > 	[    1.770484] cpu cpu0: EM: created perf domain
> > 	[    1.778505] cpu cpu1: EM: created perf domain
> > 	[    1.807449] scpi_clocks scpi:clocks: failed to register clock 'pxlclk'
> > 	[    1.897593] hub 1-1:1.0: USB hub found
> > 	[    1.901656] hub 1-1:1.0: 4 ports detected
> > 	[    2.559453] atkbd serio0: keyboard reset failed on 1c060000.kmi
> > 	[   22.787431] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > 	[   22.793536] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=2626
> > 	[   22.802421]  (detected by 2, t=5255 jiffies, g=-991, q=9)
> > 	[   22.807823] Task dump for CPU 1:
> > 	[   22.811049] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> > 	[   22.820980] Call trace:
> > 	[   22.823429]  __switch_to+0x138/0x198
> > 	[   23.583444] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 5443 jiffies s: 49 root: 0x2/.
> > 	[   23.593995] rcu: blocking rcu_node structures:
> > 	[   23.598449] Task dump for CPU 1:
> > 	[   23.601680] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> > 	[   23.611619] Call trace:
> > 	[   23.614064]  __switch_to+0x138/0x198
> > 	[   85.807430] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > 	[   85.813534] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=10502
> > 	[   85.822506]  (detected by 2, t=21009 jiffies, g=-991, q=9)
> > 	[   85.827994] Task dump for CPU 1:
> > 	[   85.831220] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> > 	[   85.841150] Call trace:
> > 	[   85.843596]  __switch_to+0x138/0x198
> > 	[   87.071446] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 21315 jiffies s: 49 root: 0x2/.
> > 	[   87.082088] rcu: blocking rcu_node structures:
> > 	[   87.086540] Task dump for CPU 1:
> > 	[   87.089773] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> > 	[   87.099708] Call trace:
> > 	[   87.102155]  __switch_to+0x138/0x198
> > 	[  148.827442] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > 	[  148.833565] rcu:     1-...0: (1 ticks this GP) idle=222/1/0x4000000000000002 softirq=63/64 fqs=18377
> > 	[  148.842543]  (detected by 4, t=36762 jiffies, g=-991, q=9)
> > 	[  148.848037] Task dump for CPU 1:
> > 	[  148.851268] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> > 	[  148.861207] Call trace:
> > 	[  148.863663]  __switch_to+0x138/0x198
> > 	[  150.559443] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... } 37187 jiffies s: 49 root: 0x2/.
> > 	[  150.570082] rcu: blocking rcu_node structures:
> > 	[  150.574535] Task dump for CPU 1:
> > 	[  150.577767] task:swapper/1       state:R  running task     stack:    0 pid:    0 ppid:     1 flags:0x0000002a
> > 	[  150.587705] Call trace:
> > 	[  150.590151]  __switch_to+0x138/0x198
> >
> >  - Commit working as expected:
> >    * 558e4c36ec9f  Krzysztof Kozlowski     7 weeks ago     maiblox: mediatek: Fix handling of platform_get_irq() error
> >
> >
> > [1] https://developer.arm.com/tools-and-software/development-boards/juno-development-board
> >
> > Thank you,
> > Ionela.
> >
> >> -- 
> >> 2.25.4
> >> 
> >> 
> >> _______________________________________________
> >> linux-amlogic mailing list
> >> linux-amlogic@lists.infradead.org
> >> http://lists.infradead.org/mailman/listinfo/linux-amlogic
>
Jassi Brar Oct. 15, 2020, 6:45 p.m. UTC | #5
On Thu, Oct 15, 2020 at 8:58 AM Jerome Brunet <jbrunet@baylibre.com> wrote:
>
>
> On Thu 15 Oct 2020 at 15:46, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
>
> > Hi guys,
> >
> > On Wednesday 23 Sep 2020 at 14:39:16 (+0200), Jerome Brunet wrote:
> >> If the txdone is done by polling, it is possible for msg_submit() to start
> >> the timer while txdone_hrtimer() callback is running. If the timer needs
> >> recheduling, it could already be enqueued by the time hrtimer_forward_now()
> >> is called, leading hrtimer to loudly complain.
> >>
> >> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
> >> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
> >> Hardware name: Libre Computer AML-S805X-AC (DT)
> >> Workqueue: events_freezable_power_ thermal_zone_device_check
> >> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
> >> pc : hrtimer_forward+0xc4/0x110
> >> lr : txdone_hrtimer+0xf8/0x118
> >> [...]
> >>
> >> Canceling the timer before starting it ensure that the timer callback is
> >> not running when the timer is started, solving this race condition.
> >>
> >> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
> >> Reported-by: Da Xue <da@libre.computer>
> >> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
> >> ---
> >>  drivers/mailbox/mailbox.c | 8 ++++++--
> >>  1 file changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
> >> index 0b821a5b2db8..34f9ab01caef 100644
> >> --- a/drivers/mailbox/mailbox.c
> >> +++ b/drivers/mailbox/mailbox.c
> >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> >>  exit:
> >>      spin_unlock_irqrestore(&chan->lock, flags);
> >>
> >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> >> -            /* kick start the timer immediately to avoid delays */
> >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> >> +            /* Disable the timer if already active ... */
> >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
> >> +
> >> +            /* ... and kick start it immediately to avoid delays */
> >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> >> +    }
> >>  }
> >>
> >>  static void tx_tick(struct mbox_chan *chan, int r)
> >
> > I've tracked a regression back to this commit. Details to reproduce:
>
> Hi Ionela,
>
> I don't have access to your platform and I don't get what is going on
> from the log below.
>
> Could you please give us a bit more details about what is going on ?
>
> All this patch does is add hrtimer_cancel().
> * It is needed if the timer had already been started, which is
>   appropriate AFAIU
> * It is a NO-OP is the timer is not active.
>
Can you please try using hrtimer_try_to_cancel() instead ?

thanks
Sudeep Holla Oct. 16, 2020, 8:44 a.m. UTC | #6
On Thu, Oct 15, 2020 at 03:29:35PM +0100, Ionela Voinescu wrote:
> Hi Jerome,
> 
> On Thursday 15 Oct 2020 at 15:58:30 (+0200), Jerome Brunet wrote:
> > 
> > On Thu 15 Oct 2020 at 15:46, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
> > 
> > > Hi guys,
> > >
> > > On Wednesday 23 Sep 2020 at 14:39:16 (+0200), Jerome Brunet wrote:
> > >> If the txdone is done by polling, it is possible for msg_submit() to start
> > >> the timer while txdone_hrtimer() callback is running. If the timer needs
> > >> recheduling, it could already be enqueued by the time hrtimer_forward_now()
> > >> is called, leading hrtimer to loudly complain.
> > >> 
> > >> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
> > >> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
> > >> Hardware name: Libre Computer AML-S805X-AC (DT)
> > >> Workqueue: events_freezable_power_ thermal_zone_device_check
> > >> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
> > >> pc : hrtimer_forward+0xc4/0x110
> > >> lr : txdone_hrtimer+0xf8/0x118
> > >> [...]
> > >> 
> > >> Canceling the timer before starting it ensure that the timer callback is
> > >> not running when the timer is started, solving this race condition.
> > >> 
> > >> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
> > >> Reported-by: Da Xue <da@libre.computer>
> > >> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
> > >> ---
> > >>  drivers/mailbox/mailbox.c | 8 ++++++--
> > >>  1 file changed, 6 insertions(+), 2 deletions(-)
> > >> 
> > >> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
> > >> index 0b821a5b2db8..34f9ab01caef 100644
> > >> --- a/drivers/mailbox/mailbox.c
> > >> +++ b/drivers/mailbox/mailbox.c
> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> > >>  exit:
> > >>  	spin_unlock_irqrestore(&chan->lock, flags);
> > >>  
> > >> -	if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> > >> -		/* kick start the timer immediately to avoid delays */
> > >> +	if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> > >> +		/* Disable the timer if already active ... */
> > >> +		hrtimer_cancel(&chan->mbox->poll_hrt);
> > >> +
> > >> +		/* ... and kick start it immediately to avoid delays */
> > >>  		hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> > >> +	}
> > >>  }
> > >>  
> > >>  static void tx_tick(struct mbox_chan *chan, int r)
> > >
> > > I've tracked a regression back to this commit. Details to reproduce:
> > 
> > Hi Ionela,
> > 
> > I don't have access to your platform and I don't get what is going on
> > from the log below.
> > 
> > Could you please give us a bit more details about what is going on ?
> > 
> 
> I'm not familiar with the mailbox subsystem, so the best I can do right
> now is to add Sudeep to Cc, in case this conflicts in some way with the
> ARM MHU patches [1].
>

Not it can't be doorbell driver as we use SCPI(old firmware) with upstream
MHU driver as is limiting the number of channels to be used.

> In the meantime I'll get some traces and get more familiar with the
> code.
>

I will try that too.
Ionela Voinescu Oct. 16, 2020, 8:52 a.m. UTC | #7
On Thursday 15 Oct 2020 at 13:45:54 (-0500), Jassi Brar wrote:
[..]
> > >> --- a/drivers/mailbox/mailbox.c
> > >> +++ b/drivers/mailbox/mailbox.c
> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> > >>  exit:
> > >>      spin_unlock_irqrestore(&chan->lock, flags);
> > >>
> > >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> > >> -            /* kick start the timer immediately to avoid delays */
> > >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> > >> +            /* Disable the timer if already active ... */
> > >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
> > >> +
> > >> +            /* ... and kick start it immediately to avoid delays */
> > >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> > >> +    }
> > >>  }
> > >>
> > >>  static void tx_tick(struct mbox_chan *chan, int r)
> > >
> > > I've tracked a regression back to this commit. Details to reproduce:
> >
> > Hi Ionela,
> >
> > I don't have access to your platform and I don't get what is going on
> > from the log below.
> >
> > Could you please give us a bit more details about what is going on ?
> >
> > All this patch does is add hrtimer_cancel().
> > * It is needed if the timer had already been started, which is
> >   appropriate AFAIU
> > * It is a NO-OP is the timer is not active.
> >
> Can you please try using hrtimer_try_to_cancel() instead ?
> 

Yes, using hrtimer_try_to_cancel() instead works for me. But doesn't
this limit how effective this change is? AFAIU, this will possibly only
reduce the chances for the race condition, but not solve it.

Thanks,
Ionela.

> thanks
Jerome Brunet Oct. 16, 2020, 9 a.m. UTC | #8
On Fri 16 Oct 2020 at 10:52, Ionela Voinescu <ionela.voinescu@arm.com> wrote:

> On Thursday 15 Oct 2020 at 13:45:54 (-0500), Jassi Brar wrote:
> [..]
>> > >> --- a/drivers/mailbox/mailbox.c
>> > >> +++ b/drivers/mailbox/mailbox.c
>> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
>> > >>  exit:
>> > >>      spin_unlock_irqrestore(&chan->lock, flags);
>> > >>
>> > >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
>> > >> -            /* kick start the timer immediately to avoid delays */
>> > >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
>> > >> +            /* Disable the timer if already active ... */
>> > >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
>> > >> +
>> > >> +            /* ... and kick start it immediately to avoid delays */
>> > >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
>> > >> +    }
>> > >>  }
>> > >>
>> > >>  static void tx_tick(struct mbox_chan *chan, int r)
>> > >
>> > > I've tracked a regression back to this commit. Details to reproduce:
>> >
>> > Hi Ionela,
>> >
>> > I don't have access to your platform and I don't get what is going on
>> > from the log below.
>> >
>> > Could you please give us a bit more details about what is going on ?
>> >
>> > All this patch does is add hrtimer_cancel().
>> > * It is needed if the timer had already been started, which is
>> >   appropriate AFAIU
>> > * It is a NO-OP is the timer is not active.
>> >
>> Can you please try using hrtimer_try_to_cancel() instead ?
>> 
>
> Yes, using hrtimer_try_to_cancel() instead works for me. But doesn't
> this limit how effective this change is? AFAIU, this will possibly only
> reduce the chances for the race condition, but not solve it.
>

It is also my understanding, hrtimer_try_to_cancel() would remove a
timer which as not already started but would return withtout doing
anything if the callback is already running ... which is the original
problem


> Thanks,
> Ionela.
>
>> thanks
Jerome Brunet Oct. 16, 2020, 9:02 a.m. UTC | #9
On Fri 16 Oct 2020 at 10:44, Sudeep Holla <sudeep.holla@arm.com> wrote:

> On Thu, Oct 15, 2020 at 03:29:35PM +0100, Ionela Voinescu wrote:
>> Hi Jerome,
>> 
>> On Thursday 15 Oct 2020 at 15:58:30 (+0200), Jerome Brunet wrote:
>> > 
>> > On Thu 15 Oct 2020 at 15:46, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
>> > 
>> > > Hi guys,
>> > >
>> > > On Wednesday 23 Sep 2020 at 14:39:16 (+0200), Jerome Brunet wrote:
>> > >> If the txdone is done by polling, it is possible for msg_submit() to start
>> > >> the timer while txdone_hrtimer() callback is running. If the timer needs
>> > >> recheduling, it could already be enqueued by the time hrtimer_forward_now()
>> > >> is called, leading hrtimer to loudly complain.
>> > >> 
>> > >> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
>> > >> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
>> > >> Hardware name: Libre Computer AML-S805X-AC (DT)
>> > >> Workqueue: events_freezable_power_ thermal_zone_device_check
>> > >> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
>> > >> pc : hrtimer_forward+0xc4/0x110
>> > >> lr : txdone_hrtimer+0xf8/0x118
>> > >> [...]
>> > >> 
>> > >> Canceling the timer before starting it ensure that the timer callback is
>> > >> not running when the timer is started, solving this race condition.
>> > >> 
>> > >> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
>> > >> Reported-by: Da Xue <da@libre.computer>
>> > >> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
>> > >> ---
>> > >>  drivers/mailbox/mailbox.c | 8 ++++++--
>> > >>  1 file changed, 6 insertions(+), 2 deletions(-)
>> > >> 
>> > >> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
>> > >> index 0b821a5b2db8..34f9ab01caef 100644
>> > >> --- a/drivers/mailbox/mailbox.c
>> > >> +++ b/drivers/mailbox/mailbox.c
>> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
>> > >>  exit:
>> > >>  	spin_unlock_irqrestore(&chan->lock, flags);
>> > >>  
>> > >> -	if (!err && (chan->txdone_method & TXDONE_BY_POLL))
>> > >> -		/* kick start the timer immediately to avoid delays */
>> > >> +	if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
>> > >> +		/* Disable the timer if already active ... */
>> > >> +		hrtimer_cancel(&chan->mbox->poll_hrt);
>> > >> +
>> > >> +		/* ... and kick start it immediately to avoid delays */
>> > >>  		hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
>> > >> +	}
>> > >>  }
>> > >>  
>> > >>  static void tx_tick(struct mbox_chan *chan, int r)
>> > >
>> > > I've tracked a regression back to this commit. Details to reproduce:
>> > 
>> > Hi Ionela,
>> > 
>> > I don't have access to your platform and I don't get what is going on
>> > from the log below.
>> > 
>> > Could you please give us a bit more details about what is going on ?
>> > 
>> 
>> I'm not familiar with the mailbox subsystem, so the best I can do right
>> now is to add Sudeep to Cc, in case this conflicts in some way with the
>> ARM MHU patches [1].
>>
>
> Not it can't be doorbell driver as we use SCPI(old firmware) with upstream
> MHU driver as is limiting the number of channels to be used.
>
>> In the meantime I'll get some traces and get more familiar with the
>> code.
>>
>
> I will try that too.

BTW, this issue was originally reported on amlogic platforms which also
use arm,mhu mailbox driver.
Sudeep Holla Oct. 16, 2020, 9:34 a.m. UTC | #10
On Fri, Oct 16, 2020 at 11:02:02AM +0200, Jerome Brunet wrote:
> 
> On Fri 16 Oct 2020 at 10:44, Sudeep Holla <sudeep.holla@arm.com> wrote:
> 
> > On Thu, Oct 15, 2020 at 03:29:35PM +0100, Ionela Voinescu wrote:
> >> Hi Jerome,
> >> 
> >> On Thursday 15 Oct 2020 at 15:58:30 (+0200), Jerome Brunet wrote:
> >> > 
> >> > On Thu 15 Oct 2020 at 15:46, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
> >> > 
> >> > > Hi guys,
> >> > >
> >> > > On Wednesday 23 Sep 2020 at 14:39:16 (+0200), Jerome Brunet wrote:
> >> > >> If the txdone is done by polling, it is possible for msg_submit() to start
> >> > >> the timer while txdone_hrtimer() callback is running. If the timer needs
> >> > >> recheduling, it could already be enqueued by the time hrtimer_forward_now()
> >> > >> is called, leading hrtimer to loudly complain.
> >> > >> 
> >> > >> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
> >> > >> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
> >> > >> Hardware name: Libre Computer AML-S805X-AC (DT)
> >> > >> Workqueue: events_freezable_power_ thermal_zone_device_check
> >> > >> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
> >> > >> pc : hrtimer_forward+0xc4/0x110
> >> > >> lr : txdone_hrtimer+0xf8/0x118
> >> > >> [...]
> >> > >> 
> >> > >> Canceling the timer before starting it ensure that the timer callback is
> >> > >> not running when the timer is started, solving this race condition.
> >> > >> 
> >> > >> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
> >> > >> Reported-by: Da Xue <da@libre.computer>
> >> > >> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
> >> > >> ---
> >> > >>  drivers/mailbox/mailbox.c | 8 ++++++--
> >> > >>  1 file changed, 6 insertions(+), 2 deletions(-)
> >> > >> 
> >> > >> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
> >> > >> index 0b821a5b2db8..34f9ab01caef 100644
> >> > >> --- a/drivers/mailbox/mailbox.c
> >> > >> +++ b/drivers/mailbox/mailbox.c
> >> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> >> > >>  exit:
> >> > >>  	spin_unlock_irqrestore(&chan->lock, flags);
> >> > >>  
> >> > >> -	if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> >> > >> -		/* kick start the timer immediately to avoid delays */
> >> > >> +	if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> >> > >> +		/* Disable the timer if already active ... */
> >> > >> +		hrtimer_cancel(&chan->mbox->poll_hrt);
> >> > >> +
> >> > >> +		/* ... and kick start it immediately to avoid delays */
> >> > >>  		hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> >> > >> +	}
> >> > >>  }
> >> > >>  
> >> > >>  static void tx_tick(struct mbox_chan *chan, int r)
> >> > >
> >> > > I've tracked a regression back to this commit. Details to reproduce:
> >> > 
> >> > Hi Ionela,
> >> > 
> >> > I don't have access to your platform and I don't get what is going on
> >> > from the log below.
> >> > 
> >> > Could you please give us a bit more details about what is going on ?
> >> > 
> >> 
> >> I'm not familiar with the mailbox subsystem, so the best I can do right
> >> now is to add Sudeep to Cc, in case this conflicts in some way with the
> >> ARM MHU patches [1].
> >>
> >
> > Not it can't be doorbell driver as we use SCPI(old firmware) with upstream
> > MHU driver as is limiting the number of channels to be used.
> >
> >> In the meantime I'll get some traces and get more familiar with the
> >> code.
> >>
> >
> > I will try that too.
> 
> BTW, this issue was originally reported on amlogic platforms which also
> use arm,mhu mailbox driver.
> 

OK. Anyway just noticed that hrtimer_cancel uses  hrtimer_try_to_cancel
and hrtimer_cancel_wait_running. The latter is just cpu_relax() if
PREEMPT_RT=n, so you may still have issue if the hrtimer is still active
or restarts in the meantime.
Sudeep Holla Oct. 16, 2020, 11:47 a.m. UTC | #11
On Fri, Oct 16, 2020 at 10:34:21AM +0100, Sudeep Holla wrote:
> On Fri, Oct 16, 2020 at 11:02:02AM +0200, Jerome Brunet wrote:
> > 
> > On Fri 16 Oct 2020 at 10:44, Sudeep Holla <sudeep.holla@arm.com> wrote:
> > 
> > > On Thu, Oct 15, 2020 at 03:29:35PM +0100, Ionela Voinescu wrote:
> > >> Hi Jerome,
> > >> 
> > >> On Thursday 15 Oct 2020 at 15:58:30 (+0200), Jerome Brunet wrote:
> > >> > 
> > >> > On Thu 15 Oct 2020 at 15:46, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
> > >> > 
> > >> > > Hi guys,
> > >> > >
> > >> > > On Wednesday 23 Sep 2020 at 14:39:16 (+0200), Jerome Brunet wrote:
> > >> > >> If the txdone is done by polling, it is possible for msg_submit() to start
> > >> > >> the timer while txdone_hrtimer() callback is running. If the timer needs
> > >> > >> recheduling, it could already be enqueued by the time hrtimer_forward_now()
> > >> > >> is called, leading hrtimer to loudly complain.
> > >> > >> 
> > >> > >> WARNING: CPU: 3 PID: 74 at kernel/time/hrtimer.c:932 hrtimer_forward+0xc4/0x110
> > >> > >> CPU: 3 PID: 74 Comm: kworker/u8:1 Not tainted 5.9.0-rc2-00236-gd3520067d01c-dirty #5
> > >> > >> Hardware name: Libre Computer AML-S805X-AC (DT)
> > >> > >> Workqueue: events_freezable_power_ thermal_zone_device_check
> > >> > >> pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
> > >> > >> pc : hrtimer_forward+0xc4/0x110
> > >> > >> lr : txdone_hrtimer+0xf8/0x118
> > >> > >> [...]
> > >> > >> 
> > >> > >> Canceling the timer before starting it ensure that the timer callback is
> > >> > >> not running when the timer is started, solving this race condition.
> > >> > >> 
> > >> > >> Fixes: 0cc67945ea59 ("mailbox: switch to hrtimer for tx_complete polling")
> > >> > >> Reported-by: Da Xue <da@libre.computer>
> > >> > >> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
> > >> > >> ---
> > >> > >>  drivers/mailbox/mailbox.c | 8 ++++++--
> > >> > >>  1 file changed, 6 insertions(+), 2 deletions(-)
> > >> > >> 
> > >> > >> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
> > >> > >> index 0b821a5b2db8..34f9ab01caef 100644
> > >> > >> --- a/drivers/mailbox/mailbox.c
> > >> > >> +++ b/drivers/mailbox/mailbox.c
> > >> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> > >> > >>  exit:
> > >> > >>  	spin_unlock_irqrestore(&chan->lock, flags);
> > >> > >>  
> > >> > >> -	if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> > >> > >> -		/* kick start the timer immediately to avoid delays */
> > >> > >> +	if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> > >> > >> +		/* Disable the timer if already active ... */
> > >> > >> +		hrtimer_cancel(&chan->mbox->poll_hrt);
> > >> > >> +
> > >> > >> +		/* ... and kick start it immediately to avoid delays */
> > >> > >>  		hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> > >> > >> +	}
> > >> > >>  }
> > >> > >>  
> > >> > >>  static void tx_tick(struct mbox_chan *chan, int r)
> > >> > >
> > >> > > I've tracked a regression back to this commit. Details to reproduce:
> > >> > 
> > >> > Hi Ionela,
> > >> > 
> > >> > I don't have access to your platform and I don't get what is going on
> > >> > from the log below.
> > >> > 
> > >> > Could you please give us a bit more details about what is going on ?
> > >> > 
> > >> 
> > >> I'm not familiar with the mailbox subsystem, so the best I can do right
> > >> now is to add Sudeep to Cc, in case this conflicts in some way with the
> > >> ARM MHU patches [1].
> > >>
> > >
> > > Not it can't be doorbell driver as we use SCPI(old firmware) with upstream
> > > MHU driver as is limiting the number of channels to be used.
> > >
> > >> In the meantime I'll get some traces and get more familiar with the
> > >> code.
> > >>
> > >
> > > I will try that too.
> > 
> > BTW, this issue was originally reported on amlogic platforms which also
> > use arm,mhu mailbox driver.
> > 
> 
> OK. Anyway just noticed that hrtimer_cancel uses  hrtimer_try_to_cancel
> and hrtimer_cancel_wait_running. The latter is just cpu_relax() if
> PREEMPT_RT=n, so you may still have issue if the hrtimer is still active
> or restarts in the meantime. 
>

Scratch that, I failed to see the loop in hrtimer_cancel earlier.
Sudeep Holla Oct. 16, 2020, 11:52 a.m. UTC | #12
On Fri, Oct 16, 2020 at 12:32 PM Ionela Voinescu
<ionela.voinescu@arm.com> wrote:
>
> On Thursday 15 Oct 2020 at 13:45:54 (-0500), Jassi Brar wrote:
> [..]
> > > >> --- a/drivers/mailbox/mailbox.c
> > > >> +++ b/drivers/mailbox/mailbox.c
> > > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> > > >>  exit:
> > > >>      spin_unlock_irqrestore(&chan->lock, flags);
> > > >>
> > > >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> > > >> -            /* kick start the timer immediately to avoid delays */
> > > >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> > > >> +            /* Disable the timer if already active ... */
> > > >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
> > > >> +
> > > >> +            /* ... and kick start it immediately to avoid delays */
> > > >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> > > >> +    }
> > > >>  }
> > > >>
> > > >>  static void tx_tick(struct mbox_chan *chan, int r)
> > > >
> > > > I've tracked a regression back to this commit. Details to reproduce:
> > >
> > > Hi Ionela,
> > >
> > > I don't have access to your platform and I don't get what is going on
> > > from the log below.
> > >
> > > Could you please give us a bit more details about what is going on ?
> > >
> > > All this patch does is add hrtimer_cancel().
> > > * It is needed if the timer had already been started, which is
> > >   appropriate AFAIU
> > > * It is a NO-OP is the timer is not active.
> > >
> > Can you please try using hrtimer_try_to_cancel() instead ?
> >
>
> Yes, using hrtimer_try_to_cancel() instead works for me. But doesn't
> this limit how effective this change is? AFAIU, this will possibly only
> reduce the chances for the race condition, but not solve it.
>

Indeed, I tried the same and got a lot of -ETIME failures. hrtimer_cancel
uses hrtimer_try_to_cancel in loop until it succeeds so that the following
hrtimer_start can be done w/o any pending timers.
Sudeep Holla Oct. 16, 2020, 1:52 p.m. UTC | #13
On Fri, Oct 16, 2020 at 1:15 PM Jerome Brunet <jbrunet@baylibre.com> wrote:
>
>
> On Fri 16 Oct 2020 at 10:52, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
>
> > On Thursday 15 Oct 2020 at 13:45:54 (-0500), Jassi Brar wrote:
> > [..]
> >> > >> --- a/drivers/mailbox/mailbox.c
> >> > >> +++ b/drivers/mailbox/mailbox.c
> >> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> >> > >>  exit:
> >> > >>      spin_unlock_irqrestore(&chan->lock, flags);
> >> > >>
> >> > >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> >> > >> -            /* kick start the timer immediately to avoid delays */
> >> > >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> >> > >> +            /* Disable the timer if already active ... */
> >> > >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
> >> > >> +
> >> > >> +            /* ... and kick start it immediately to avoid delays */
> >> > >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> >> > >> +    }
> >> > >>  }
> >> > >>
> >> > >>  static void tx_tick(struct mbox_chan *chan, int r)
> >> > >
> >> > > I've tracked a regression back to this commit. Details to reproduce:
> >> >
> >> > Hi Ionela,
> >> >
> >> > I don't have access to your platform and I don't get what is going on
> >> > from the log below.
> >> >
> >> > Could you please give us a bit more details about what is going on ?
> >> >
> >> > All this patch does is add hrtimer_cancel().
> >> > * It is needed if the timer had already been started, which is
> >> >   appropriate AFAIU
> >> > * It is a NO-OP is the timer is not active.
> >> >
> >> Can you please try using hrtimer_try_to_cancel() instead ?
> >>
> >
> > Yes, using hrtimer_try_to_cancel() instead works for me. But doesn't
> > this limit how effective this change is? AFAIU, this will possibly only
> > reduce the chances for the race condition, but not solve it.
> >
>
> It is also my understanding, hrtimer_try_to_cancel() would remove a
> timer which as not already started but would return withtout doing
> anything if the callback is already running ... which is the original
> problem
>


There seem to be some races. It always hangs in the hrtimer_cancel.
Logging some extra messages makes it progress for a while and finally
get stuck in the loop. I wonder if there is a race between cancel
and handler execution.
Jassi Brar Oct. 16, 2020, 5:33 p.m. UTC | #14
On Fri, Oct 16, 2020 at 4:00 AM Jerome Brunet <jbrunet@baylibre.com> wrote:
>
>
> On Fri 16 Oct 2020 at 10:52, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
>
> > On Thursday 15 Oct 2020 at 13:45:54 (-0500), Jassi Brar wrote:
> > [..]
> >> > >> --- a/drivers/mailbox/mailbox.c
> >> > >> +++ b/drivers/mailbox/mailbox.c
> >> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> >> > >>  exit:
> >> > >>      spin_unlock_irqrestore(&chan->lock, flags);
> >> > >>
> >> > >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> >> > >> -            /* kick start the timer immediately to avoid delays */
> >> > >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> >> > >> +            /* Disable the timer if already active ... */
> >> > >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
> >> > >> +
> >> > >> +            /* ... and kick start it immediately to avoid delays */
> >> > >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> >> > >> +    }
> >> > >>  }
> >> > >>
> >> > >>  static void tx_tick(struct mbox_chan *chan, int r)
> >> > >
> >> > > I've tracked a regression back to this commit. Details to reproduce:
> >> >
> >> > Hi Ionela,
> >> >
> >> > I don't have access to your platform and I don't get what is going on
> >> > from the log below.
> >> >
> >> > Could you please give us a bit more details about what is going on ?
> >> >
> >> > All this patch does is add hrtimer_cancel().
> >> > * It is needed if the timer had already been started, which is
> >> >   appropriate AFAIU
> >> > * It is a NO-OP is the timer is not active.
> >> >
> >> Can you please try using hrtimer_try_to_cancel() instead ?
> >>
> >
> > Yes, using hrtimer_try_to_cancel() instead works for me. But doesn't
> > this limit how effective this change is? AFAIU, this will possibly only
> > reduce the chances for the race condition, but not solve it.
> >
>
> It is also my understanding, hrtimer_try_to_cancel() would remove a
> timer which as not already started but would return withtout doing
> anything if the callback is already running ... which is the original
> problem
>
If we are running in the callback path, hrtimer_try_to_cancel will
return -1, in which case we could skip hrtimer_start.
Anyways, I think simply checking for hrtimer_active should effect the same.
I have submitted a patch, of course not tested.

Thanks
Jerome Brunet Oct. 16, 2020, 6:54 p.m. UTC | #15
On Fri 16 Oct 2020 at 19:33, Jassi Brar <jassisinghbrar@gmail.com> wrote:

> On Fri, Oct 16, 2020 at 4:00 AM Jerome Brunet <jbrunet@baylibre.com> wrote:
>>
>>
>> On Fri 16 Oct 2020 at 10:52, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
>>
>> > On Thursday 15 Oct 2020 at 13:45:54 (-0500), Jassi Brar wrote:
>> > [..]
>> >> > >> --- a/drivers/mailbox/mailbox.c
>> >> > >> +++ b/drivers/mailbox/mailbox.c
>> >> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
>> >> > >>  exit:
>> >> > >>      spin_unlock_irqrestore(&chan->lock, flags);
>> >> > >>
>> >> > >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
>> >> > >> -            /* kick start the timer immediately to avoid delays */
>> >> > >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
>> >> > >> +            /* Disable the timer if already active ... */
>> >> > >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
>> >> > >> +
>> >> > >> +            /* ... and kick start it immediately to avoid delays */
>> >> > >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
>> >> > >> +    }
>> >> > >>  }
>> >> > >>
>> >> > >>  static void tx_tick(struct mbox_chan *chan, int r)
>> >> > >
>> >> > > I've tracked a regression back to this commit. Details to reproduce:
>> >> >
>> >> > Hi Ionela,
>> >> >
>> >> > I don't have access to your platform and I don't get what is going on
>> >> > from the log below.
>> >> >
>> >> > Could you please give us a bit more details about what is going on ?
>> >> >
>> >> > All this patch does is add hrtimer_cancel().
>> >> > * It is needed if the timer had already been started, which is
>> >> >   appropriate AFAIU
>> >> > * It is a NO-OP is the timer is not active.
>> >> >
>> >> Can you please try using hrtimer_try_to_cancel() instead ?
>> >>
>> >
>> > Yes, using hrtimer_try_to_cancel() instead works for me. But doesn't
>> > this limit how effective this change is? AFAIU, this will possibly only
>> > reduce the chances for the race condition, but not solve it.
>> >
>>
>> It is also my understanding, hrtimer_try_to_cancel() would remove a
>> timer which as not already started but would return withtout doing
>> anything if the callback is already running ... which is the original
>> problem
>>
> If we are running in the callback path, hrtimer_try_to_cancel will
> return -1, in which case we could skip hrtimer_start.
> Anyways, I think simply checking for hrtimer_active should effect the same.
> I have submitted a patch, of course not tested.

Yes it sloves this race but ...

If a race is possible between a timer callback rescheduling itself (which
is not that uncommon) and another thread trying to cancel it, maybe
there is something worth fixing in hrtimer ? Also, mailbox calls
hrtimer_cancel() in unregister ... are we confident this would work ?

Any fix is by me - yours avoid killing and restarting the timer :) but
it feels like we are working around an issue that might bite us back
later on.

>
> Thanks
Jassi Brar Oct. 16, 2020, 7:27 p.m. UTC | #16
On Fri, Oct 16, 2020 at 1:54 PM Jerome Brunet <jbrunet@baylibre.com> wrote:
>
>
> On Fri 16 Oct 2020 at 19:33, Jassi Brar <jassisinghbrar@gmail.com> wrote:
>
> > On Fri, Oct 16, 2020 at 4:00 AM Jerome Brunet <jbrunet@baylibre.com> wrote:
> >>
> >>
> >> On Fri 16 Oct 2020 at 10:52, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
> >>
> >> > On Thursday 15 Oct 2020 at 13:45:54 (-0500), Jassi Brar wrote:
> >> > [..]
> >> >> > >> --- a/drivers/mailbox/mailbox.c
> >> >> > >> +++ b/drivers/mailbox/mailbox.c
> >> >> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> >> >> > >>  exit:
> >> >> > >>      spin_unlock_irqrestore(&chan->lock, flags);
> >> >> > >>
> >> >> > >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> >> >> > >> -            /* kick start the timer immediately to avoid delays */
> >> >> > >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> >> >> > >> +            /* Disable the timer if already active ... */
> >> >> > >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
> >> >> > >> +
> >> >> > >> +            /* ... and kick start it immediately to avoid delays */
> >> >> > >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> >> >> > >> +    }
> >> >> > >>  }
> >> >> > >>
> >> >> > >>  static void tx_tick(struct mbox_chan *chan, int r)
> >> >> > >
> >> >> > > I've tracked a regression back to this commit. Details to reproduce:
> >> >> >
> >> >> > Hi Ionela,
> >> >> >
> >> >> > I don't have access to your platform and I don't get what is going on
> >> >> > from the log below.
> >> >> >
> >> >> > Could you please give us a bit more details about what is going on ?
> >> >> >
> >> >> > All this patch does is add hrtimer_cancel().
> >> >> > * It is needed if the timer had already been started, which is
> >> >> >   appropriate AFAIU
> >> >> > * It is a NO-OP is the timer is not active.
> >> >> >
> >> >> Can you please try using hrtimer_try_to_cancel() instead ?
> >> >>
> >> >
> >> > Yes, using hrtimer_try_to_cancel() instead works for me. But doesn't
> >> > this limit how effective this change is? AFAIU, this will possibly only
> >> > reduce the chances for the race condition, but not solve it.
> >> >
> >>
> >> It is also my understanding, hrtimer_try_to_cancel() would remove a
> >> timer which as not already started but would return withtout doing
> >> anything if the callback is already running ... which is the original
> >> problem
> >>
> > If we are running in the callback path, hrtimer_try_to_cancel will
> > return -1, in which case we could skip hrtimer_start.
> > Anyways, I think simply checking for hrtimer_active should effect the same.
> > I have submitted a patch, of course not tested.
>
> Yes it sloves this race but ...
>
Thanks for confirmation.

> If a race is possible between a timer callback rescheduling itself (which
> is not that uncommon) and another thread trying to cancel it
>
In our case, we should not be cancelling+restarting the timer in the
first place, because txdone_hrtimer will take care of it via
hrtimer_forward_now.

>, maybe
> there is something worth fixing in hrtimer ? Also, mailbox calls
> hrtimer_cancel() in unregister ... are we confident this would work ?
>
Yes. After unregister() every channel is supposed to die and so must
its resources.

-jassi
Da Xue July 19, 2022, 7:08 a.m. UTC | #17
On Fri, Oct 16, 2020 at 3:27 PM Jassi Brar <jassisinghbrar@gmail.com> wrote:
>
> On Fri, Oct 16, 2020 at 1:54 PM Jerome Brunet <jbrunet@baylibre.com> wrote:
> >
> >
> > On Fri 16 Oct 2020 at 19:33, Jassi Brar <jassisinghbrar@gmail.com> wrote:
> >
> > > On Fri, Oct 16, 2020 at 4:00 AM Jerome Brunet <jbrunet@baylibre.com> wrote:
> > >>
> > >>
> > >> On Fri 16 Oct 2020 at 10:52, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
> > >>
> > >> > On Thursday 15 Oct 2020 at 13:45:54 (-0500), Jassi Brar wrote:
> > >> > [..]
> > >> >> > >> --- a/drivers/mailbox/mailbox.c
> > >> >> > >> +++ b/drivers/mailbox/mailbox.c
> > >> >> > >> @@ -82,9 +82,13 @@ static void msg_submit(struct mbox_chan *chan)
> > >> >> > >>  exit:
> > >> >> > >>      spin_unlock_irqrestore(&chan->lock, flags);
> > >> >> > >>
> > >> >> > >> -    if (!err && (chan->txdone_method & TXDONE_BY_POLL))
> > >> >> > >> -            /* kick start the timer immediately to avoid delays */
> > >> >> > >> +    if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> > >> >> > >> +            /* Disable the timer if already active ... */
> > >> >> > >> +            hrtimer_cancel(&chan->mbox->poll_hrt);
> > >> >> > >> +
> > >> >> > >> +            /* ... and kick start it immediately to avoid delays */
> > >> >> > >>              hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> > >> >> > >> +    }
> > >> >> > >>  }
> > >> >> > >>
> > >> >> > >>  static void tx_tick(struct mbox_chan *chan, int r)
> > >> >> > >
> > >> >> > > I've tracked a regression back to this commit. Details to reproduce:
> > >> >> >
> > >> >> > Hi Ionela,
> > >> >> >
> > >> >> > I don't have access to your platform and I don't get what is going on
> > >> >> > from the log below.
> > >> >> >
> > >> >> > Could you please give us a bit more details about what is going on ?
> > >> >> >
> > >> >> > All this patch does is add hrtimer_cancel().
> > >> >> > * It is needed if the timer had already been started, which is
> > >> >> >   appropriate AFAIU
> > >> >> > * It is a NO-OP is the timer is not active.
> > >> >> >
> > >> >> Can you please try using hrtimer_try_to_cancel() instead ?
> > >> >>
> > >> >
> > >> > Yes, using hrtimer_try_to_cancel() instead works for me. But doesn't
> > >> > this limit how effective this change is? AFAIU, this will possibly only
> > >> > reduce the chances for the race condition, but not solve it.
> > >> >
> > >>
> > >> It is also my understanding, hrtimer_try_to_cancel() would remove a
> > >> timer which as not already started but would return withtout doing
> > >> anything if the callback is already running ... which is the original
> > >> problem
> > >>
> > > If we are running in the callback path, hrtimer_try_to_cancel will
> > > return -1, in which case we could skip hrtimer_start.
> > > Anyways, I think simply checking for hrtimer_active should effect the same.
> > > I have submitted a patch, of course not tested.
> >
> > Yes it sloves this race but ...
> >
> Thanks for confirmation.
>
> > If a race is possible between a timer callback rescheduling itself (which
> > is not that uncommon) and another thread trying to cancel it
> >
> In our case, we should not be cancelling+restarting the timer in the
> first place, because txdone_hrtimer will take care of it via
> hrtimer_forward_now.
>
> >, maybe
> > there is something worth fixing in hrtimer ? Also, mailbox calls
> > hrtimer_cancel() in unregister ... are we confident this would work ?
> >
> Yes. After unregister() every channel is supposed to die and so must
> its resources.
>
> -jassi

Hi,

I'm running into an issue where I get "rcu: INFO: rcu_preempt detected
stalls on CPUs/tasks" when booting Linux distributions once every 5 or
so reboots. The system sticks on systemd starting for ~20 seconds
before rcu_preempt prints a message. This only happens on Amlogic (I
tested S805X/S905X/S905D) and I do not get this problem on my
Allwinner or Rockchip boards with the same kernel and image. I have
tried 5.15, 5.18, and 5.19. I dumped a bunch of logs here from the
three kernel versions but the content is the same.
https://github.com/libre-computer-project/libretech-linux/issues/5

+ mount -n -o move /sys /root/sys
+ mount -n -o move /proc /root/proc
+ exec run-init /root /sbin/init
[   26.485689] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   26.486217] rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-3): P1
[   26.492332]  (detected by 3, t=5252 jiffies, g=-315, q=546)
[   26.497846] task:run-init        state:R  running task     stack:
 0 pid:    1 ppid:     0 flags:0x00000004
[   26.507719] Call trace:
[   26.510063]  dump_backtrace.part.6+0xdc/0xe8
[   26.514306]  show_stack+0x18/0x68
[   26.517574]  sched_show_task+0x154/0x198
[   26.521461]  rcu_sched_clock_irq+0xd10/0xdd8
[   26.525690]  update_process_times+0x9c/0xd0
[   26.529829]  tick_sched_handle.isra.19+0x34/0x58
[   26.534405]  tick_sched_timer+0x4c/0xa8
[   26.538192]  __hrtimer_run_queues+0x120/0x1b8
[   26.542510]  hrtimer_interrupt+0xd8/0x238
[   26.546474]  arch_timer_handler_phys+0x2c/0x50
[   26.550877]  handle_percpu_devid_irq+0x84/0x138
[   26.555363]  generic_handle_domain_irq+0x2c/0x48
[   26.559935]  gic_handle_irq+0xa4/0xc0
[   26.563549]  call_on_irq_stack+0x2c/0x58
[   26.567432]  do_interrupt_handler+0x80/0x88
[   26.571574]  el1_interrupt+0x38/0x70
[   26.575105]  el1h_64_irq_handler+0x18/0x28
[   26.582437]  el1h_64_irq+0x64/0x68
[   26.585712]  xas_descend+0x0/0x88
[   26.588947]  xas_find+0x190/0x1d8
[   26.592182]  find_lock_entries+0x7c/0x280
[   26.596121]  truncate_inode_pages_range+0xa8/0x490
[   26.600843]  truncate_inode_pages_final+0x58/0x78
[   26.605453]  evict+0x144/0x150
[   26.608471]  iput+0x128/0x190
[   26.611442]  do_unlinkat+0x17c/0x2a8
[   26.614817]  __arm64_sys_unlinkat+0x40/0x88
[   26.618928]  invoke_syscall+0x44/0x100
[   26.622599]  el0_svc_common.constprop.3+0x6c/0xf0
[   26.627237]  do_el0_svc+0x24/0x88
[   26.630473]  el0_svc+0x20/0x60
[   26.633447]  el0t_64_sync_handler+0x90/0xb8
[   26.637562]  el0t_64_sync+0x170/0x174
[   33.757714] VCC_CARD: disabling

I have turned on systemd debug and it does not print anything. The
message is the same every time and I rebuilt the filesystem multiple
times with both ext4 and btrfs roots across the devices.

I'm not sure if this is an extension or corner case of the origin
hr_timer problem. Any ideas?

Best,
Da
diff mbox series

Patch

diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
index 0b821a5b2db8..34f9ab01caef 100644
--- a/drivers/mailbox/mailbox.c
+++ b/drivers/mailbox/mailbox.c
@@ -82,9 +82,13 @@  static void msg_submit(struct mbox_chan *chan)
 exit:
 	spin_unlock_irqrestore(&chan->lock, flags);
 
-	if (!err && (chan->txdone_method & TXDONE_BY_POLL))
-		/* kick start the timer immediately to avoid delays */
+	if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
+		/* Disable the timer if already active ... */
+		hrtimer_cancel(&chan->mbox->poll_hrt);
+
+		/* ... and kick start it immediately to avoid delays */
 		hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
+	}
 }
 
 static void tx_tick(struct mbox_chan *chan, int r)