mbox series

[GIT,PULL] RCU changes for v6.9

Message ID ZetHwrCb0KXE0xFI@tardis (mailing list archive)
State Accepted
Commit e5a3878c947ceef7b6ab68fdc093f3848059842c
Headers show
Series [GIT,PULL] RCU changes for v6.9 | expand

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git tags/rcu.next.v6.9

Message

Boqun Feng March 8, 2024, 5:15 p.m. UTC
Hi Linus,

Please pull this for the RCU changes of v6.9:

The following changes since commit 41bccc98fb7931d63d03f326a746ac4d429c1dd3:

  Linux 6.8-rc2 (2024-01-28 17:01:12 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git tags/rcu.next.v6.9

for you to fetch changes up to 3add00be5fe5810d7aa5ec3af8b6a245ef33144b:

  Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', 'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a (2024-02-26 17:37:25 -0800)


Two merge conflicts were detected by linux-next:

* https://lore.kernel.org/lkml/20240226135745.12ac854d@canb.auug.org.au/
* https://lore.kernel.org/lkml/20240227125522.2bdbe6be@canb.auug.org.au/

These conflict resolutions from linux-next look good to me, plus I made
my own resolutions at branch merge/rcu.2024.02.27a for your reference.


Some highlights of the changes:

* Eliminates deadlocks involving do_exit() and RCU tasks, by Paul:
  Instead of SRCU read side critical sections, now a percpu list is used
  in do_exit() for scaning yet-to-exit tasks.

* Fixes a deadlock due to the dependency between workqueue and RCU
  expedited grace period, reported by Anna-Maria Behnsen and Thomas
  Gleixner and fixed by Frederic: Now RCU expedited always uses its own
  kthread worker instead of a workqueue.


Regards,
Boqun


----------------------------------------------------------------
RCU pull request for v6.9

This pull request contains the following branches:

rcu-doc.2024.02.14a: Documentation updates.

rcu-nocb.2024.02.14a: RCU NOCB updates, code cleanups, unnecessary
        barrier removals and minor bug fixes.

rcu-exp.2024.02.14a: RCU exp, fixing a circular dependency between
        workqueue and RCU expedited callback handling.

rcu-tasks.2024.02.26a: RCU tasks, avoiding deadlocks in do_exit() when
        calling synchronize_rcu_task() with a mutex hold, maintaining
	real-time response in rcu_tasks_postscan() and a minor
        fix for tasks trace quiescence check.

rcu-misc.2024.02.14a: Misc updates, comments and readibility
	improvement, boot time parameter for lazy RCU and rcutorture
	improvement.

----------------------------------------------------------------
Boqun Feng (1):
      Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', 'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a

Frederic Weisbecker (13):
      rcu/nocb: Remove needless LOAD-ACQUIRE
      rcu/nocb: Remove needless full barrier after callback advancing
      rcu/nocb: Make IRQs disablement symmetric
      rcu/nocb: Re-arrange call_rcu() NOCB specific code
      rcu/exp: Remove full barrier upon main thread wakeup
      rcu/exp: Fix RCU expedited parallel grace period kworker allocation failure recovery
      rcu/exp: Handle RCU expedited grace period kworker allocation failure
      rcu: s/boost_kthread_mutex/kthread_mutex
      rcu/exp: Move expedited kthread worker creation functions above rcutree_prepare_cpu()
      rcu/exp: Make parallel exp gp kworker per rcu node
      rcu/exp: Handle parallel exp gp kworkers affinity
      rcu/exp: Remove rcu_par_gp_wq
      rcu: Rename jiffies_till_flush to jiffies_lazy_flush

Joel Fernandes (Google) (1):
      srcu: Improve comments about acceleration leak

Onkarnath (1):
      rcu/sync: remove un-used rcu_sync_enter_start function

Paul E. McKenney (16):
      doc: Spinlocks are implied RCU readers
      doc: Make whatisRCU.rst note that spinlocks are RCU readers
      doc: Make checklist.rst note that spinlocks are implied RCU readers
      doc: Add CONFIG_RCU_STRICT_GRACE_PERIOD to checklist.rst
      doc: Add EARLY flag to early-parsed kernel boot parameters
      context_tracking: Fix kerneldoc headers for __ct_user_{enter,exit}()
      doc: Clarify use of slab constructors and SLAB_TYPESAFE_BY_RCU
      doc: Update checklist.rst discussion of callback execution
      rcutorture: Suppress rtort_pipe_count warnings until after stalls
      rcu-tasks: Repair RCU Tasks Trace quiescence check
      rcu-tasks: Add data to eliminate RCU-tasks/do_exit() deadlocks
      rcu-tasks: Initialize callback lists at rcu_init() time
      rcu-tasks: Initialize data to eliminate RCU-tasks/do_exit() deadlocks
      rcu-tasks: Maintain lists to eliminate RCU-tasks/do_exit() deadlocks
      rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks
      rcu-tasks: Maintain real-time response in rcu_tasks_postscan()

Qais Yousef (1):
      rcu: Provide a boot time parameter to control lazy RCU

Zqiang (2):
      rcu/nocb: Fix WARN_ON_ONCE() in the rcu_nocb_bypass_lock()
      rcu/nocb: Check rdp_gp->nocb_timer in __call_rcu_nocb_wake()

 Documentation/RCU/checklist.rst                 |  32 +-
 Documentation/RCU/rcu_dereference.rst           |   5 +-
 Documentation/RCU/whatisRCU.rst                 |  19 +-
 Documentation/admin-guide/kernel-parameters.rst |   1 +
 Documentation/admin-guide/kernel-parameters.txt | 489 ++++++++++++------------
 include/linux/rcu_sync.h                        |   1 -
 include/linux/rcupdate.h                        |   4 +-
 include/linux/sched.h                           |   2 +
 init/init_task.c                                |   1 +
 kernel/context_tracking.c                       |   4 +
 kernel/fork.c                                   |   1 +
 kernel/rcu/Kconfig                              |  13 +
 kernel/rcu/rcu.h                                |  19 +-
 kernel/rcu/rcuscale.c                           |   6 +-
 kernel/rcu/rcutorture.c                         |  13 +-
 kernel/rcu/srcutree.c                           |  24 +-
 kernel/rcu/sync.c                               |  16 -
 kernel/rcu/tasks.h                              | 135 +++++--
 kernel/rcu/tiny.c                               |   1 +
 kernel/rcu/tree.c                               | 237 ++++++++----
 kernel/rcu/tree.h                               |  20 +-
 kernel/rcu/tree_exp.h                           |  83 +---
 kernel/rcu/tree_nocb.h                          |  69 ++--
 kernel/rcu/tree_plugin.h                        |  52 +--
 24 files changed, 687 insertions(+), 560 deletions(-)

Comments

pr-tracker-bot@kernel.org March 11, 2024, 7:43 p.m. UTC | #1
The pull request you sent on Fri, 8 Mar 2024 09:15:46 -0800:

> git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git tags/rcu.next.v6.9

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/e5a3878c947ceef7b6ab68fdc093f3848059842c

Thank you!
Florian Fainelli March 12, 2024, 8:32 p.m. UTC | #2
Hi Boqun,

On 3/8/24 09:15, Boqun Feng wrote:
> Hi Linus,
> 
> Please pull this for the RCU changes of v6.9:
> 
> The following changes since commit 41bccc98fb7931d63d03f326a746ac4d429c1dd3:
> 
>    Linux 6.8-rc2 (2024-01-28 17:01:12 -0800)
> 
> are available in the Git repository at:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git tags/rcu.next.v6.9
> 
> for you to fetch changes up to 3add00be5fe5810d7aa5ec3af8b6a245ef33144b:
> 
>    Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', 'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a (2024-02-26 17:37:25 -0800)
> 
> 
> Two merge conflicts were detected by linux-next:
> 
> * https://lore.kernel.org/lkml/20240226135745.12ac854d@canb.auug.org.au/
> * https://lore.kernel.org/lkml/20240227125522.2bdbe6be@canb.auug.org.au/
> 
> These conflict resolutions from linux-next look good to me, plus I made
> my own resolutions at branch merge/rcu.2024.02.27a for your reference.
> 
> 
> Some highlights of the changes:
> 
> * Eliminates deadlocks involving do_exit() and RCU tasks, by Paul:
>    Instead of SRCU read side critical sections, now a percpu list is used
>    in do_exit() for scaning yet-to-exit tasks.
> 
> * Fixes a deadlock due to the dependency between workqueue and RCU
>    expedited grace period, reported by Anna-Maria Behnsen and Thomas
>    Gleixner and fixed by Frederic: Now RCU expedited always uses its own
>    kthread worker instead of a workqueue.

At least one device in my test farm (ARM 32-bit) has consistently shown 
a very long boot, and some others are intermittently affected. This 
consistently looks like this on most of my devices:

[    2.450351] bcmgenet f0480000.ethernet: GENET 5.0 EPHY: 0x0000
[    2.547562] ata1: SATA link down (SStatus 0 SControl 300)
[  162.107264] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus

this gets flagged by my boot script as a boot failure since we exceeded 
the 30 seconds timeout given to boot a kernel to a prompt.

It has been somewhat difficult to get a reliable bisection going on, but 
what I am sure of is that e5a3878c947ceef7b6ab68fdc093f3848059842c~1 
does not expose the problem for 10 consecutive boots, while I *might* 
see it at e5a3878c947ceef7b6ab68fdc093f3848059842c and beyond.

Any clues what is going on here?

Thanks!
Frederic Weisbecker March 12, 2024, 9:01 p.m. UTC | #3
Le Tue, Mar 12, 2024 at 01:32:03PM -0700, Florian Fainelli a écrit :
> Hi Boqun,
> 
> On 3/8/24 09:15, Boqun Feng wrote:
> > Hi Linus,
> > 
> > Please pull this for the RCU changes of v6.9:
> > 
> > The following changes since commit 41bccc98fb7931d63d03f326a746ac4d429c1dd3:
> > 
> >    Linux 6.8-rc2 (2024-01-28 17:01:12 -0800)
> > 
> > are available in the Git repository at:
> > 
> >    git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git tags/rcu.next.v6.9
> > 
> > for you to fetch changes up to 3add00be5fe5810d7aa5ec3af8b6a245ef33144b:
> > 
> >    Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', 'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a (2024-02-26 17:37:25 -0800)
> > 
> > 
> > Two merge conflicts were detected by linux-next:
> > 
> > * https://lore.kernel.org/lkml/20240226135745.12ac854d@canb.auug.org.au/
> > * https://lore.kernel.org/lkml/20240227125522.2bdbe6be@canb.auug.org.au/
> > 
> > These conflict resolutions from linux-next look good to me, plus I made
> > my own resolutions at branch merge/rcu.2024.02.27a for your reference.
> > 
> > 
> > Some highlights of the changes:
> > 
> > * Eliminates deadlocks involving do_exit() and RCU tasks, by Paul:
> >    Instead of SRCU read side critical sections, now a percpu list is used
> >    in do_exit() for scaning yet-to-exit tasks.
> > 
> > * Fixes a deadlock due to the dependency between workqueue and RCU
> >    expedited grace period, reported by Anna-Maria Behnsen and Thomas
> >    Gleixner and fixed by Frederic: Now RCU expedited always uses its own
> >    kthread worker instead of a workqueue.
> 
> At least one device in my test farm (ARM 32-bit) has consistently shown a
> very long boot, and some others are intermittently affected. This
> consistently looks like this on most of my devices:
> 
> [    2.450351] bcmgenet f0480000.ethernet: GENET 5.0 EPHY: 0x0000
> [    2.547562] ata1: SATA link down (SStatus 0 SControl 300)
> [  162.107264] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
> 
> this gets flagged by my boot script as a boot failure since we exceeded the
> 30 seconds timeout given to boot a kernel to a prompt.
> 
> It has been somewhat difficult to get a reliable bisection going on, but
> what I am sure of is that e5a3878c947ceef7b6ab68fdc093f3848059842c~1 does
> not expose the problem for 10 consecutive boots, while I *might* see it at
> e5a3878c947ceef7b6ab68fdc093f3848059842c and beyond.
> 
> Any clues what is going on here?

I understand it doesn't reliably reproduce. But is there any chance you could
test that alone?

     0bb11a372fc8 (rcu-tasks: Maintain real-time response in rcu_tasks_postscan())

And if it reproduces there try a bisection?

git bisect bad 0bb11a372fc8d7006b4d0f42a2882939747bdbff
git bisect good 41bccc98fb7931d63d03f326a746ac4d429c1dd3

Thanks.
Boqun Feng March 12, 2024, 9:07 p.m. UTC | #4
On Tue, Mar 12, 2024 at 01:32:03PM -0700, Florian Fainelli wrote:
> Hi Boqun,
> 
> On 3/8/24 09:15, Boqun Feng wrote:
> > Hi Linus,
> > 
> > Please pull this for the RCU changes of v6.9:
> > 
> > The following changes since commit 41bccc98fb7931d63d03f326a746ac4d429c1dd3:
> > 
> >    Linux 6.8-rc2 (2024-01-28 17:01:12 -0800)
> > 
> > are available in the Git repository at:
> > 
> >    git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git tags/rcu.next.v6.9
> > 
> > for you to fetch changes up to 3add00be5fe5810d7aa5ec3af8b6a245ef33144b:
> > 
> >    Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', 'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a (2024-02-26 17:37:25 -0800)
> > 
> > 
> > Two merge conflicts were detected by linux-next:
> > 
> > * https://lore.kernel.org/lkml/20240226135745.12ac854d@canb.auug.org.au/
> > * https://lore.kernel.org/lkml/20240227125522.2bdbe6be@canb.auug.org.au/
> > 
> > These conflict resolutions from linux-next look good to me, plus I made
> > my own resolutions at branch merge/rcu.2024.02.27a for your reference.
> > 
> > 
> > Some highlights of the changes:
> > 
> > * Eliminates deadlocks involving do_exit() and RCU tasks, by Paul:
> >    Instead of SRCU read side critical sections, now a percpu list is used
> >    in do_exit() for scaning yet-to-exit tasks.
> > 
> > * Fixes a deadlock due to the dependency between workqueue and RCU
> >    expedited grace period, reported by Anna-Maria Behnsen and Thomas
> >    Gleixner and fixed by Frederic: Now RCU expedited always uses its own
> >    kthread worker instead of a workqueue.
> 
> At least one device in my test farm (ARM 32-bit) has consistently shown a
> very long boot, and some others are intermittently affected. This
> consistently looks like this on most of my devices:
> 
> [    2.450351] bcmgenet f0480000.ethernet: GENET 5.0 EPHY: 0x0000
> [    2.547562] ata1: SATA link down (SStatus 0 SControl 300)
> [  162.107264] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
> 
> this gets flagged by my boot script as a boot failure since we exceeded the
> 30 seconds timeout given to boot a kernel to a prompt.
> 
> It has been somewhat difficult to get a reliable bisection going on, but
> what I am sure of is that e5a3878c947ceef7b6ab68fdc093f3848059842c~1 does
> not expose the problem for 10 consecutive boots, while I *might* see it at
> e5a3878c947ceef7b6ab68fdc093f3848059842c and beyond.
> 
> Any clues what is going on here?
> 

Could you share the config file and your kernel parameters? Also could
you share the whole log? Thanks!

Regards,
Boqun

> Thanks!
> -- 
> Florian
>
Paul E. McKenney March 12, 2024, 9:15 p.m. UTC | #5
On Tue, Mar 12, 2024 at 10:01:00PM +0100, Frederic Weisbecker wrote:
> Le Tue, Mar 12, 2024 at 01:32:03PM -0700, Florian Fainelli a écrit :
> > Hi Boqun,
> > 
> > On 3/8/24 09:15, Boqun Feng wrote:
> > > Hi Linus,
> > > 
> > > Please pull this for the RCU changes of v6.9:
> > > 
> > > The following changes since commit 41bccc98fb7931d63d03f326a746ac4d429c1dd3:
> > > 
> > >    Linux 6.8-rc2 (2024-01-28 17:01:12 -0800)
> > > 
> > > are available in the Git repository at:
> > > 
> > >    git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git tags/rcu.next.v6.9
> > > 
> > > for you to fetch changes up to 3add00be5fe5810d7aa5ec3af8b6a245ef33144b:
> > > 
> > >    Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', 'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a (2024-02-26 17:37:25 -0800)
> > > 
> > > 
> > > Two merge conflicts were detected by linux-next:
> > > 
> > > * https://lore.kernel.org/lkml/20240226135745.12ac854d@canb.auug.org.au/
> > > * https://lore.kernel.org/lkml/20240227125522.2bdbe6be@canb.auug.org.au/
> > > 
> > > These conflict resolutions from linux-next look good to me, plus I made
> > > my own resolutions at branch merge/rcu.2024.02.27a for your reference.
> > > 
> > > 
> > > Some highlights of the changes:
> > > 
> > > * Eliminates deadlocks involving do_exit() and RCU tasks, by Paul:
> > >    Instead of SRCU read side critical sections, now a percpu list is used
> > >    in do_exit() for scaning yet-to-exit tasks.
> > > 
> > > * Fixes a deadlock due to the dependency between workqueue and RCU
> > >    expedited grace period, reported by Anna-Maria Behnsen and Thomas
> > >    Gleixner and fixed by Frederic: Now RCU expedited always uses its own
> > >    kthread worker instead of a workqueue.
> > 
> > At least one device in my test farm (ARM 32-bit) has consistently shown a
> > very long boot, and some others are intermittently affected. This
> > consistently looks like this on most of my devices:
> > 
> > [    2.450351] bcmgenet f0480000.ethernet: GENET 5.0 EPHY: 0x0000
> > [    2.547562] ata1: SATA link down (SStatus 0 SControl 300)
> > [  162.107264] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
> > 
> > this gets flagged by my boot script as a boot failure since we exceeded the
> > 30 seconds timeout given to boot a kernel to a prompt.
> > 
> > It has been somewhat difficult to get a reliable bisection going on, but
> > what I am sure of is that e5a3878c947ceef7b6ab68fdc093f3848059842c~1 does
> > not expose the problem for 10 consecutive boots, while I *might* see it at
> > e5a3878c947ceef7b6ab68fdc093f3848059842c and beyond.
> > 
> > Any clues what is going on here?
> 
> I understand it doesn't reliably reproduce. But is there any chance you could
> test that alone?
> 
>      0bb11a372fc8 (rcu-tasks: Maintain real-time response in rcu_tasks_postscan())
> 
> And if it reproduces there try a bisection?
> 
> git bisect bad 0bb11a372fc8d7006b4d0f42a2882939747bdbff
> git bisect good 41bccc98fb7931d63d03f326a746ac4d429c1dd3

Good point!

Is this kernel built with CONFIG_PREEMPT_RT=y and CONFIG_TASKS_RCU=y?
Either way, is there also a large number of tasks exiting at this point
during boot?

							Thanx, Paul
Florian Fainelli March 12, 2024, 9:34 p.m. UTC | #6
On 3/12/24 14:07, Boqun Feng wrote:
> On Tue, Mar 12, 2024 at 01:32:03PM -0700, Florian Fainelli wrote:
>> Hi Boqun,
>>
>> On 3/8/24 09:15, Boqun Feng wrote:
>>> Hi Linus,
>>>
>>> Please pull this for the RCU changes of v6.9:
>>>
>>> The following changes since commit 41bccc98fb7931d63d03f326a746ac4d429c1dd3:
>>>
>>>     Linux 6.8-rc2 (2024-01-28 17:01:12 -0800)
>>>
>>> are available in the Git repository at:
>>>
>>>     git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git tags/rcu.next.v6.9
>>>
>>> for you to fetch changes up to 3add00be5fe5810d7aa5ec3af8b6a245ef33144b:
>>>
>>>     Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', 'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a (2024-02-26 17:37:25 -0800)
>>>
>>>
>>> Two merge conflicts were detected by linux-next:
>>>
>>> * https://lore.kernel.org/lkml/20240226135745.12ac854d@canb.auug.org.au/
>>> * https://lore.kernel.org/lkml/20240227125522.2bdbe6be@canb.auug.org.au/
>>>
>>> These conflict resolutions from linux-next look good to me, plus I made
>>> my own resolutions at branch merge/rcu.2024.02.27a for your reference.
>>>
>>>
>>> Some highlights of the changes:
>>>
>>> * Eliminates deadlocks involving do_exit() and RCU tasks, by Paul:
>>>     Instead of SRCU read side critical sections, now a percpu list is used
>>>     in do_exit() for scaning yet-to-exit tasks.
>>>
>>> * Fixes a deadlock due to the dependency between workqueue and RCU
>>>     expedited grace period, reported by Anna-Maria Behnsen and Thomas
>>>     Gleixner and fixed by Frederic: Now RCU expedited always uses its own
>>>     kthread worker instead of a workqueue.
>>
>> At least one device in my test farm (ARM 32-bit) has consistently shown a
>> very long boot, and some others are intermittently affected. This
>> consistently looks like this on most of my devices:
>>
>> [    2.450351] bcmgenet f0480000.ethernet: GENET 5.0 EPHY: 0x0000
>> [    2.547562] ata1: SATA link down (SStatus 0 SControl 300)
>> [  162.107264] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
>>
>> this gets flagged by my boot script as a boot failure since we exceeded the
>> 30 seconds timeout given to boot a kernel to a prompt.
>>
>> It has been somewhat difficult to get a reliable bisection going on, but
>> what I am sure of is that e5a3878c947ceef7b6ab68fdc093f3848059842c~1 does
>> not expose the problem for 10 consecutive boots, while I *might* see it at
>> e5a3878c947ceef7b6ab68fdc093f3848059842c and beyond.
>>
>> Any clues what is going on here?
>>
> 
> Could you share the config file and your kernel parameters? Also could
> you share the whole log? Thanks!

Here is the configuration file:

https://gist.github.com/ffainelli/f3fd38752a186cee0cb8719089c0b6b8

and here is a log where this fails:

https://gist.github.com/ffainelli/ed08a2b3e853f59343786ebd20364fc8
Florian Fainelli March 12, 2024, 9:35 p.m. UTC | #7
On 3/12/24 14:15, Paul E. McKenney wrote:
> On Tue, Mar 12, 2024 at 10:01:00PM +0100, Frederic Weisbecker wrote:
>> Le Tue, Mar 12, 2024 at 01:32:03PM -0700, Florian Fainelli a écrit :
>>> Hi Boqun,
>>>
>>> On 3/8/24 09:15, Boqun Feng wrote:
>>>> Hi Linus,
>>>>
>>>> Please pull this for the RCU changes of v6.9:
>>>>
>>>> The following changes since commit 41bccc98fb7931d63d03f326a746ac4d429c1dd3:
>>>>
>>>>     Linux 6.8-rc2 (2024-01-28 17:01:12 -0800)
>>>>
>>>> are available in the Git repository at:
>>>>
>>>>     git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git tags/rcu.next.v6.9
>>>>
>>>> for you to fetch changes up to 3add00be5fe5810d7aa5ec3af8b6a245ef33144b:
>>>>
>>>>     Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', 'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a (2024-02-26 17:37:25 -0800)
>>>>
>>>>
>>>> Two merge conflicts were detected by linux-next:
>>>>
>>>> * https://lore.kernel.org/lkml/20240226135745.12ac854d@canb.auug.org.au/
>>>> * https://lore.kernel.org/lkml/20240227125522.2bdbe6be@canb.auug.org.au/
>>>>
>>>> These conflict resolutions from linux-next look good to me, plus I made
>>>> my own resolutions at branch merge/rcu.2024.02.27a for your reference.
>>>>
>>>>
>>>> Some highlights of the changes:
>>>>
>>>> * Eliminates deadlocks involving do_exit() and RCU tasks, by Paul:
>>>>     Instead of SRCU read side critical sections, now a percpu list is used
>>>>     in do_exit() for scaning yet-to-exit tasks.
>>>>
>>>> * Fixes a deadlock due to the dependency between workqueue and RCU
>>>>     expedited grace period, reported by Anna-Maria Behnsen and Thomas
>>>>     Gleixner and fixed by Frederic: Now RCU expedited always uses its own
>>>>     kthread worker instead of a workqueue.
>>>
>>> At least one device in my test farm (ARM 32-bit) has consistently shown a
>>> very long boot, and some others are intermittently affected. This
>>> consistently looks like this on most of my devices:
>>>
>>> [    2.450351] bcmgenet f0480000.ethernet: GENET 5.0 EPHY: 0x0000
>>> [    2.547562] ata1: SATA link down (SStatus 0 SControl 300)
>>> [  162.107264] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
>>>
>>> this gets flagged by my boot script as a boot failure since we exceeded the
>>> 30 seconds timeout given to boot a kernel to a prompt.
>>>
>>> It has been somewhat difficult to get a reliable bisection going on, but
>>> what I am sure of is that e5a3878c947ceef7b6ab68fdc093f3848059842c~1 does
>>> not expose the problem for 10 consecutive boots, while I *might* see it at
>>> e5a3878c947ceef7b6ab68fdc093f3848059842c and beyond.
>>>
>>> Any clues what is going on here?
>>
>> I understand it doesn't reliably reproduce. But is there any chance you could
>> test that alone?
>>
>>       0bb11a372fc8 (rcu-tasks: Maintain real-time response in rcu_tasks_postscan())
>>
>> And if it reproduces there try a bisection?
>>
>> git bisect bad 0bb11a372fc8d7006b4d0f42a2882939747bdbff
>> git bisect good 41bccc98fb7931d63d03f326a746ac4d429c1dd3
> 
> Good point!
> 
> Is this kernel built with CONFIG_PREEMPT_RT=y and CONFIG_TASKS_RCU=y?
> Either way, is there also a large number of tasks exiting at this point
> during boot?

Neither options are enabled, and this is still during boot, so there is 
not any user-space tasks being running yet, I cannot think of too many 
reasons why this would fail, and there is no driver on the probe 
deferral list that I can see.
Linus Torvalds March 12, 2024, 9:44 p.m. UTC | #8
On Tue, 12 Mar 2024 at 14:34, Florian Fainelli <f.fainelli@gmail.com> wrote:
>
> and here is a log where this fails:
>
> https://gist.github.com/ffainelli/ed08a2b3e853f59343786ebd20364fc8

You could try the 'initcall_debug' kernel command line.

It will make the above *much* noisier, but it might - thanks to all
the new noise - show exactly *what* is being crazy slow to initialize.

Because right now it's just radio silence in between those

  [    1.926435] bcmgenet f0480000.ethernet: GENET 5.0 EPHY: 0x0000
  [  162.148135] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus

things, and that's presumably because some random initcall there just
takes forever to time out.

             Linus
Florian Fainelli March 12, 2024, 10:05 p.m. UTC | #9
On 3/12/24 14:01, Frederic Weisbecker wrote:
> Le Tue, Mar 12, 2024 at 01:32:03PM -0700, Florian Fainelli a écrit :
>> Hi Boqun,
>>
>> On 3/8/24 09:15, Boqun Feng wrote:
>>> Hi Linus,
>>>
>>> Please pull this for the RCU changes of v6.9:
>>>
>>> The following changes since commit 41bccc98fb7931d63d03f326a746ac4d429c1dd3:
>>>
>>>     Linux 6.8-rc2 (2024-01-28 17:01:12 -0800)
>>>
>>> are available in the Git repository at:
>>>
>>>     git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git tags/rcu.next.v6.9
>>>
>>> for you to fetch changes up to 3add00be5fe5810d7aa5ec3af8b6a245ef33144b:
>>>
>>>     Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', 'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a (2024-02-26 17:37:25 -0800)
>>>
>>>
>>> Two merge conflicts were detected by linux-next:
>>>
>>> * https://lore.kernel.org/lkml/20240226135745.12ac854d@canb.auug.org.au/
>>> * https://lore.kernel.org/lkml/20240227125522.2bdbe6be@canb.auug.org.au/
>>>
>>> These conflict resolutions from linux-next look good to me, plus I made
>>> my own resolutions at branch merge/rcu.2024.02.27a for your reference.
>>>
>>>
>>> Some highlights of the changes:
>>>
>>> * Eliminates deadlocks involving do_exit() and RCU tasks, by Paul:
>>>     Instead of SRCU read side critical sections, now a percpu list is used
>>>     in do_exit() for scaning yet-to-exit tasks.
>>>
>>> * Fixes a deadlock due to the dependency between workqueue and RCU
>>>     expedited grace period, reported by Anna-Maria Behnsen and Thomas
>>>     Gleixner and fixed by Frederic: Now RCU expedited always uses its own
>>>     kthread worker instead of a workqueue.
>>
>> At least one device in my test farm (ARM 32-bit) has consistently shown a
>> very long boot, and some others are intermittently affected. This
>> consistently looks like this on most of my devices:
>>
>> [    2.450351] bcmgenet f0480000.ethernet: GENET 5.0 EPHY: 0x0000
>> [    2.547562] ata1: SATA link down (SStatus 0 SControl 300)
>> [  162.107264] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
>>
>> this gets flagged by my boot script as a boot failure since we exceeded the
>> 30 seconds timeout given to boot a kernel to a prompt.
>>
>> It has been somewhat difficult to get a reliable bisection going on, but
>> what I am sure of is that e5a3878c947ceef7b6ab68fdc093f3848059842c~1 does
>> not expose the problem for 10 consecutive boots, while I *might* see it at
>> e5a3878c947ceef7b6ab68fdc093f3848059842c and beyond.
>>
>> Any clues what is going on here?
> 
> I understand it doesn't reliably reproduce. But is there any chance you could
> test that alone?
> 
>       0bb11a372fc8 (rcu-tasks: Maintain real-time response in rcu_tasks_postscan())
> 
> And if it reproduces there try a bisection?
> 
> git bisect bad 0bb11a372fc8d7006b4d0f42a2882939747bdbff
> git bisect good 41bccc98fb7931d63d03f326a746ac4d429c1dd3

None of them failed unfortunately, I need to narrow down the window 
between v6.8 and 65d287c7eb1d14e0f4d56f19cec30d97fc7e8f66.

Thanks!
Boqun Feng March 12, 2024, 11:48 p.m. UTC | #10
On Tue, Mar 12, 2024 at 02:44:14PM -0700, Linus Torvalds wrote:
> On Tue, 12 Mar 2024 at 14:34, Florian Fainelli <f.fainelli@gmail.com> wrote:
> >
> > and here is a log where this fails:
> >
> > https://gist.github.com/ffainelli/ed08a2b3e853f59343786ebd20364fc8
> 
> You could try the 'initcall_debug' kernel command line.
> 

Right, that'll be helpful.

Besides I took a look at the config Florian shared, no TASKS_RCU,
RCU_LAZY or RCU nocb is enabled. So probably the only left changes in
the PR are around RCU expedited. Florian, could you see if you can build
and test with CONFIG_PROVE_RCU=y (you need to select
CONFIG_PROVE_LOCKING for that)? That'll call synchronize_rcu() +
synchronize_rcu_expedited() before and after we switch
rcu_scheduler_active, and it may provide more information. Thanks!

Regards,
Boqun

> It will make the above *much* noisier, but it might - thanks to all
> the new noise - show exactly *what* is being crazy slow to initialize.
> 
> Because right now it's just radio silence in between those
> 
>   [    1.926435] bcmgenet f0480000.ethernet: GENET 5.0 EPHY: 0x0000
>   [  162.148135] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
> 
> things, and that's presumably because some random initcall there just
> takes forever to time out.
> 
>              Linus
Joel Fernandes March 13, 2024, 4:01 p.m. UTC | #11
Hello Florian,

On Tue, Mar 12, 2024 at 7:48 PM Boqun Feng <boqun.feng@gmail.com> wrote:
>
> On Tue, Mar 12, 2024 at 02:44:14PM -0700, Linus Torvalds wrote:
> > On Tue, 12 Mar 2024 at 14:34, Florian Fainelli <f.fainelli@gmail.com> wrote:
> > >
> > > and here is a log where this fails:
> > >
> > > https://gist.github.com/ffainelli/ed08a2b3e853f59343786ebd20364fc8
> >
> > You could try the 'initcall_debug' kernel command line.
> >
>
> Right, that'll be helpful.
>
> Besides I took a look at the config Florian shared, no TASKS_RCU,
> RCU_LAZY or RCU nocb is enabled. So probably the only left changes in
> the PR are around RCU expedited. Florian, could you see if you can build
> and test with CONFIG_PROVE_RCU=y (you need to select
> CONFIG_PROVE_LOCKING for that)? That'll call synchronize_rcu() +
> synchronize_rcu_expedited() before and after we switch
> rcu_scheduler_active, and it may provide more information. Thanks!

Adding to everyone's suggestions, could you also try booting with
"rcupdate.rcu_normal=1" ? This will disable expedited RCU and help us
further confirm that it is indeed expedited RCU (and then we can look
into fixing that).

Also there are 2 additional users of expedited RCU in this release I noticed:

78c3253f27e5 ("net: use synchronize_rcu_expedited in cleanup_net()")
1ebb85f9c03d ("netfilter: conntrack: expedite rcu in
nf_conntrack_cleanup_net_list")

Could you also try reverting those patches as well, and see if the
issue goes away?

thanks,

 - Joel

On Tue, Mar 12, 2024 at 7:48 PM Boqun Feng <boqun.feng@gmail.com> wrote:
>
> On Tue, Mar 12, 2024 at 02:44:14PM -0700, Linus Torvalds wrote:
> > On Tue, 12 Mar 2024 at 14:34, Florian Fainelli <f.fainelli@gmail.com> wrote:
> > >
> > > and here is a log where this fails:
> > >
> > > https://gist.github.com/ffainelli/ed08a2b3e853f59343786ebd20364fc8
> >
> > You could try the 'initcall_debug' kernel command line.
> >
>
> Right, that'll be helpful.
>
> Besides I took a look at the config Florian shared, no TASKS_RCU,
> RCU_LAZY or RCU nocb is enabled. So probably the only left changes in
> the PR are around RCU expedited. Florian, could you see if you can build
> and test with CONFIG_PROVE_RCU=y (you need to select
> CONFIG_PROVE_LOCKING for that)? That'll call synchronize_rcu() +
> synchronize_rcu_expedited() before and after we switch
> rcu_scheduler_active, and it may provide more information. Thanks!
>
> Regards,
> Boqun
>
> > It will make the above *much* noisier, but it might - thanks to all
> > the new noise - show exactly *what* is being crazy slow to initialize.
> >
> > Because right now it's just radio silence in between those
> >
> >   [    1.926435] bcmgenet f0480000.ethernet: GENET 5.0 EPHY: 0x0000
> >   [  162.148135] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
> >
> > things, and that's presumably because some random initcall there just
> > takes forever to time out.
> >
> >              Linus
Florian Fainelli March 13, 2024, 9:30 p.m. UTC | #12
Adding Anna-Maria and Russell,

On 3/13/24 09:01, Joel Fernandes wrote:
> Hello Florian,
> 
> On Tue, Mar 12, 2024 at 7:48 PM Boqun Feng <boqun.feng@gmail.com> wrote:
>>
>> On Tue, Mar 12, 2024 at 02:44:14PM -0700, Linus Torvalds wrote:
>>> On Tue, 12 Mar 2024 at 14:34, Florian Fainelli <f.fainelli@gmail.com> wrote:
>>>>
>>>> and here is a log where this fails:
>>>>
>>>> https://gist.github.com/ffainelli/ed08a2b3e853f59343786ebd20364fc8
>>>
>>> You could try the 'initcall_debug' kernel command line.
>>>
>>
>> Right, that'll be helpful.
>>
>> Besides I took a look at the config Florian shared, no TASKS_RCU,
>> RCU_LAZY or RCU nocb is enabled. So probably the only left changes in
>> the PR are around RCU expedited. Florian, could you see if you can build
>> and test with CONFIG_PROVE_RCU=y (you need to select
>> CONFIG_PROVE_LOCKING for that)? That'll call synchronize_rcu() +
>> synchronize_rcu_expedited() before and after we switch
>> rcu_scheduler_active, and it may provide more information. Thanks!
> 
> Adding to everyone's suggestions, could you also try booting with
> "rcupdate.rcu_normal=1" ? This will disable expedited RCU and help us
> further confirm that it is indeed expedited RCU (and then we can look
> into fixing that).

Booting with "rcupdate.rcu_normal=1" did not make any difference here, 
this is looking less and less RCU related, but somewhere else, see below.

> 
> Also there are 2 additional users of expedited RCU in this release I noticed:
> 
> 78c3253f27e5 ("net: use synchronize_rcu_expedited in cleanup_net()")
> 1ebb85f9c03d ("netfilter: conntrack: expedite rcu in
> nf_conntrack_cleanup_net_list")
> 
> Could you also try reverting those patches as well, and see if the
> issue goes away
Sorry had been chasing another regression, with one down, I could 
finally get back to this one.

I will try to provide multiple answers for the sake of everyone having 
the same context. Responding to Linus' specifically and his suggestion 
to use "initcall_debug", this is what it gave me:

[    6.970669] ata1: SATA link down (SStatus 0 SControl 300)
[  166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
[  166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
[  166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
[  166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
[  166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller

Also got another occurrence happening resuming from suspend to DRAM with:

[   22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling 
platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
[  181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM: 
platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs

and also with the PCIe root complex driver:

[   18.266279] brcm-pcie f0460000.pcie: PM: calling 
brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
[  177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
[  177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
[  177.457231] brcm-pcie f0460000.pcie: PM: 
brcm_pcie_resume_noirq+0x0/0x164 returned 0 after 159190939 usecs
[  177.457257] pcieport 0000:00:00.0: PM: calling 
pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00

Surprisingly those drivers are consistently reproducing the failures I 
am seeing so at least this gave me a clue as to where the problem is.

There were no changes to drivers/net/ethernet/broadcom/genet/, the two 
changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, 
especially the read_poll_timeout() conversion is correct, we properly 
break out of the loop. The initial delay looked like a good culprit for 
a little while, but it is not used on the affected platforms because 
instead we provide a callback and we have an interrupt to signal the 
completion of a MDIO operation, therefore unimac_mdio_poll() is not used 
at all. Finally drivers/memory/brcmstb_dpfe.c also received a single 
change which is not functional here (.remove function change do return 
void).

I went back to a manual bisection and this time I believe that I have a 
more plausible candidate with:

7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the 
hierarchical pull model")

7ee988770326fca440472200c3eb58935fe712f6~1 is stable, and 
7ee988770326fca440472200c3eb58935fe712f6 consistently reproduces this 
long boot delay on multiple platforms, so we finally have something here.

This seems to be very specific to the ARM 32-bit architecture, and 
booting the same system with an ARM64 kernel does not expose the 
problem, even though the timer hardware is the same (ARM v7/v8 
architected timer).

Let me know if there is any debugging that you want me to add.

Thanks to everyone for your help so far, this is much appreciated!
--
Florian
Russell King (Oracle) March 13, 2024, 9:59 p.m. UTC | #13
On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
> I will try to provide multiple answers for the sake of everyone having the
> same context. Responding to Linus' specifically and his suggestion to use
> "initcall_debug", this is what it gave me:
> 
> [    6.970669] ata1: SATA link down (SStatus 0 SControl 300)
> [  166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
> [  166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
> [  166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
> [  166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
> [  166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
> 
> Also got another occurrence happening resuming from suspend to DRAM with:
> 
> [   22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
> platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
> [  181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
> platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
> 
> and also with the PCIe root complex driver:
> 
> [   18.266279] brcm-pcie f0460000.pcie: PM: calling
> brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
> [  177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
> [  177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
> [  177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
> returned 0 after 159190939 usecs
> [  177.457257] pcieport 0000:00:00.0: PM: calling
> pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
> 
> Surprisingly those drivers are consistently reproducing the failures I am
> seeing so at least this gave me a clue as to where the problem is.
> 
> There were no changes to drivers/net/ethernet/broadcom/genet/, the two
> changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
> the read_poll_timeout() conversion is correct, we properly break out of the
> loop. The initial delay looked like a good culprit for a little while, but
> it is not used on the affected platforms because instead we provide a
> callback and we have an interrupt to signal the completion of a MDIO
> operation, therefore unimac_mdio_poll() is not used at all. Finally
> drivers/memory/brcmstb_dpfe.c also received a single change which is not
> functional here (.remove function change do return void).
> 
> I went back to a manual bisection and this time I believe that I have a more
> plausible candidate with:
> 
> 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
> hierarchical pull model")

I haven't understood the code there yet, and how it would interact with
arch code, but one thing that immediately jumps out to me is this:

"    As long as a CPU is busy it expires both local and global timers. When a
    CPU goes idle it arms for the first expiring local timer."

So are local timers "armed" when they are enqueued while the cpu is
"busy" during initialisation, and will they expire, and will that
expiry be delivered in a timely manner?

If not, this commit is basically broken, and would be the cause of the
issue you are seeing. For the mdio case, we're talking about 2ms
polling. For the dpfe case, it looks like we're talking about 1ms
sleeps. I'm guessing that these end up being local timers.

Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
for link up every 5ms - if the link was down and we msleep(5) I wonder
if that's triggering the same issue.

Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
say. I would imagine that the same hardware timer driver is being used
(may be worth checking DT.) The same should be true for the interrupt
driver as well. There's been no changes in that code.

The last straw I can attempt to grasp at is maybe this has something to
do with an inappropriate data type being used - maybe something in the
timer code that the blamed commit changes that a 32-bit type is too
small?
Florian Fainelli March 13, 2024, 10:04 p.m. UTC | #14
On 3/13/24 14:59, Russell King (Oracle) wrote:
> On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
>> I will try to provide multiple answers for the sake of everyone having the
>> same context. Responding to Linus' specifically and his suggestion to use
>> "initcall_debug", this is what it gave me:
>>
>> [    6.970669] ata1: SATA link down (SStatus 0 SControl 300)
>> [  166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
>> [  166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
>> [  166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
>> [  166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
>> [  166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
>>
>> Also got another occurrence happening resuming from suspend to DRAM with:
>>
>> [   22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
>> platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
>> [  181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
>> platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
>>
>> and also with the PCIe root complex driver:
>>
>> [   18.266279] brcm-pcie f0460000.pcie: PM: calling
>> brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
>> [  177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
>> [  177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
>> [  177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
>> returned 0 after 159190939 usecs
>> [  177.457257] pcieport 0000:00:00.0: PM: calling
>> pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
>>
>> Surprisingly those drivers are consistently reproducing the failures I am
>> seeing so at least this gave me a clue as to where the problem is.
>>
>> There were no changes to drivers/net/ethernet/broadcom/genet/, the two
>> changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
>> the read_poll_timeout() conversion is correct, we properly break out of the
>> loop. The initial delay looked like a good culprit for a little while, but
>> it is not used on the affected platforms because instead we provide a
>> callback and we have an interrupt to signal the completion of a MDIO
>> operation, therefore unimac_mdio_poll() is not used at all. Finally
>> drivers/memory/brcmstb_dpfe.c also received a single change which is not
>> functional here (.remove function change do return void).
>>
>> I went back to a manual bisection and this time I believe that I have a more
>> plausible candidate with:
>>
>> 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
>> hierarchical pull model")
> 
> I haven't understood the code there yet, and how it would interact with
> arch code, but one thing that immediately jumps out to me is this:
> 
> "    As long as a CPU is busy it expires both local and global timers. When a
>      CPU goes idle it arms for the first expiring local timer."
> 
> So are local timers "armed" when they are enqueued while the cpu is
> "busy" during initialisation, and will they expire, and will that
> expiry be delivered in a timely manner?
> 
> If not, this commit is basically broken, and would be the cause of the
> issue you are seeing. For the mdio case, we're talking about 2ms
> polling. For the dpfe case, it looks like we're talking about 1ms
> sleeps. I'm guessing that these end up being local timers.
> 
> Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
> for link up every 5ms - if the link was down and we msleep(5) I wonder
> if that's triggering the same issue.
> 
> Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
> say. I would imagine that the same hardware timer driver is being used
> (may be worth checking DT.) The same should be true for the interrupt
> driver as well. There's been no changes in that code.

I just had it happen with ARM64 I was plagued by:

https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/

and my earlier bisections somehow did not have ARM64 fail, so I thought 
it was immune but it fails with about the same failure rate as ARM 32-bit.

> 
> The last straw I can attempt to grasp at is maybe this has something to
> do with an inappropriate data type being used - maybe something in the
> timer code that the blamed commit changes that a 32-bit type is too
> small?
>
Russell King (Oracle) March 13, 2024, 10:49 p.m. UTC | #15
On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
> On 3/13/24 14:59, Russell King (Oracle) wrote:
> > On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
> > > I will try to provide multiple answers for the sake of everyone having the
> > > same context. Responding to Linus' specifically and his suggestion to use
> > > "initcall_debug", this is what it gave me:
> > > 
> > > [    6.970669] ata1: SATA link down (SStatus 0 SControl 300)
> > > [  166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
> > > [  166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
> > > [  166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
> > > [  166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
> > > [  166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
> > > 
> > > Also got another occurrence happening resuming from suspend to DRAM with:
> > > 
> > > [   22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
> > > platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
> > > [  181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
> > > platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
> > > 
> > > and also with the PCIe root complex driver:
> > > 
> > > [   18.266279] brcm-pcie f0460000.pcie: PM: calling
> > > brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
> > > [  177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
> > > [  177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
> > > [  177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
> > > returned 0 after 159190939 usecs
> > > [  177.457257] pcieport 0000:00:00.0: PM: calling
> > > pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
> > > 
> > > Surprisingly those drivers are consistently reproducing the failures I am
> > > seeing so at least this gave me a clue as to where the problem is.
> > > 
> > > There were no changes to drivers/net/ethernet/broadcom/genet/, the two
> > > changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
> > > the read_poll_timeout() conversion is correct, we properly break out of the
> > > loop. The initial delay looked like a good culprit for a little while, but
> > > it is not used on the affected platforms because instead we provide a
> > > callback and we have an interrupt to signal the completion of a MDIO
> > > operation, therefore unimac_mdio_poll() is not used at all. Finally
> > > drivers/memory/brcmstb_dpfe.c also received a single change which is not
> > > functional here (.remove function change do return void).
> > > 
> > > I went back to a manual bisection and this time I believe that I have a more
> > > plausible candidate with:
> > > 
> > > 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
> > > hierarchical pull model")
> > 
> > I haven't understood the code there yet, and how it would interact with
> > arch code, but one thing that immediately jumps out to me is this:
> > 
> > "    As long as a CPU is busy it expires both local and global timers. When a
> >      CPU goes idle it arms for the first expiring local timer."
> > 
> > So are local timers "armed" when they are enqueued while the cpu is
> > "busy" during initialisation, and will they expire, and will that
> > expiry be delivered in a timely manner?
> > 
> > If not, this commit is basically broken, and would be the cause of the
> > issue you are seeing. For the mdio case, we're talking about 2ms
> > polling. For the dpfe case, it looks like we're talking about 1ms
> > sleeps. I'm guessing that these end up being local timers.
> > 
> > Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
> > for link up every 5ms - if the link was down and we msleep(5) I wonder
> > if that's triggering the same issue.
> > 
> > Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
> > say. I would imagine that the same hardware timer driver is being used
> > (may be worth checking DT.) The same should be true for the interrupt
> > driver as well. There's been no changes in that code.
> 
> I just had it happen with ARM64 I was plagued by:
> 
> https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
> 
> and my earlier bisections somehow did not have ARM64 fail, so I thought it
> was immune but it fails with about the same failure rate as ARM 32-bit.

Okay, so if that's two architectures failing, it's probably not an
architecture specific issue. It could still be a timer/interrupt driver
issue though. It would be nice to have a reproducer on x86 which would
confirm it.

Do you know the frequency that the failure happens?
Frederic Weisbecker March 13, 2024, 10:52 p.m. UTC | #16
On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
> On 3/13/24 14:59, Russell King (Oracle) wrote:
> > On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
> > > I will try to provide multiple answers for the sake of everyone having the
> > > same context. Responding to Linus' specifically and his suggestion to use
> > > "initcall_debug", this is what it gave me:
> > > 
> > > [    6.970669] ata1: SATA link down (SStatus 0 SControl 300)
> > > [  166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
> > > [  166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
> > > [  166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
> > > [  166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
> > > [  166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
> > > 
> > > Also got another occurrence happening resuming from suspend to DRAM with:
> > > 
> > > [   22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
> > > platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
> > > [  181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
> > > platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
> > > 
> > > and also with the PCIe root complex driver:
> > > 
> > > [   18.266279] brcm-pcie f0460000.pcie: PM: calling
> > > brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
> > > [  177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
> > > [  177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
> > > [  177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
> > > returned 0 after 159190939 usecs
> > > [  177.457257] pcieport 0000:00:00.0: PM: calling
> > > pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
> > > 
> > > Surprisingly those drivers are consistently reproducing the failures I am
> > > seeing so at least this gave me a clue as to where the problem is.
> > > 
> > > There were no changes to drivers/net/ethernet/broadcom/genet/, the two
> > > changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
> > > the read_poll_timeout() conversion is correct, we properly break out of the
> > > loop. The initial delay looked like a good culprit for a little while, but
> > > it is not used on the affected platforms because instead we provide a
> > > callback and we have an interrupt to signal the completion of a MDIO
> > > operation, therefore unimac_mdio_poll() is not used at all. Finally
> > > drivers/memory/brcmstb_dpfe.c also received a single change which is not
> > > functional here (.remove function change do return void).
> > > 
> > > I went back to a manual bisection and this time I believe that I have a more
> > > plausible candidate with:
> > > 
> > > 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
> > > hierarchical pull model")
> > 
> > I haven't understood the code there yet, and how it would interact with
> > arch code, but one thing that immediately jumps out to me is this:
> > 
> > "    As long as a CPU is busy it expires both local and global timers. When a
> >      CPU goes idle it arms for the first expiring local timer."
> > 
> > So are local timers "armed" when they are enqueued while the cpu is
> > "busy" during initialisation, and will they expire, and will that
> > expiry be delivered in a timely manner?
> > 
> > If not, this commit is basically broken, and would be the cause of the
> > issue you are seeing. For the mdio case, we're talking about 2ms
> > polling. For the dpfe case, it looks like we're talking about 1ms
> > sleeps. I'm guessing that these end up being local timers.
> > 
> > Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
> > for link up every 5ms - if the link was down and we msleep(5) I wonder
> > if that's triggering the same issue.
> > 
> > Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
> > say. I would imagine that the same hardware timer driver is being used
> > (may be worth checking DT.) The same should be true for the interrupt
> > driver as well. There's been no changes in that code.
> 
> I just had it happen with ARM64 I was plagued by:
> 
> https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
> 
> and my earlier bisections somehow did not have ARM64 fail, so I thought it
> was immune but it fails with about the same failure rate as ARM 32-bit.

Can you please boot with:

    trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel

And add the following and give us the resulting output in dmesg?

diff --git a/init/main.c b/init/main.c
index d002f30f7f24..f3d548919868 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1233,13 +1233,24 @@ int __init_or_module do_one_initcall(initcall_t fn)
 	int count = preempt_count();
 	char msgbuf[64];
 	int ret;
+	long start;
 
 	if (initcall_blacklisted(fn))
 		return -EPERM;
 
 	do_trace_initcall_start(fn);
+	start = READ_ONCE(jiffies);
 	ret = fn();
 	do_trace_initcall_finish(fn, ret);
+	if (READ_ONCE(jiffies) - start > HZ * 20) {
+		static bool warned;
+
+		if (!warned) {
+			warned = 1;
+			ftrace_dump(DUMP_ALL);
+		}
+	}
+
 
 	msgbuf[0] = 0;
Florian Fainelli March 13, 2024, 11:29 p.m. UTC | #17
On 3/13/24 15:49, Russell King (Oracle) wrote:
> On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
>> On 3/13/24 14:59, Russell King (Oracle) wrote:
>>> On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
>>>> I will try to provide multiple answers for the sake of everyone having the
>>>> same context. Responding to Linus' specifically and his suggestion to use
>>>> "initcall_debug", this is what it gave me:
>>>>
>>>> [    6.970669] ata1: SATA link down (SStatus 0 SControl 300)
>>>> [  166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
>>>> [  166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
>>>> [  166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
>>>> [  166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
>>>> [  166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
>>>>
>>>> Also got another occurrence happening resuming from suspend to DRAM with:
>>>>
>>>> [   22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
>>>> platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
>>>> [  181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
>>>> platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
>>>>
>>>> and also with the PCIe root complex driver:
>>>>
>>>> [   18.266279] brcm-pcie f0460000.pcie: PM: calling
>>>> brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
>>>> [  177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
>>>> [  177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
>>>> [  177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
>>>> returned 0 after 159190939 usecs
>>>> [  177.457257] pcieport 0000:00:00.0: PM: calling
>>>> pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
>>>>
>>>> Surprisingly those drivers are consistently reproducing the failures I am
>>>> seeing so at least this gave me a clue as to where the problem is.
>>>>
>>>> There were no changes to drivers/net/ethernet/broadcom/genet/, the two
>>>> changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
>>>> the read_poll_timeout() conversion is correct, we properly break out of the
>>>> loop. The initial delay looked like a good culprit for a little while, but
>>>> it is not used on the affected platforms because instead we provide a
>>>> callback and we have an interrupt to signal the completion of a MDIO
>>>> operation, therefore unimac_mdio_poll() is not used at all. Finally
>>>> drivers/memory/brcmstb_dpfe.c also received a single change which is not
>>>> functional here (.remove function change do return void).
>>>>
>>>> I went back to a manual bisection and this time I believe that I have a more
>>>> plausible candidate with:
>>>>
>>>> 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
>>>> hierarchical pull model")
>>>
>>> I haven't understood the code there yet, and how it would interact with
>>> arch code, but one thing that immediately jumps out to me is this:
>>>
>>> "    As long as a CPU is busy it expires both local and global timers. When a
>>>       CPU goes idle it arms for the first expiring local timer."
>>>
>>> So are local timers "armed" when they are enqueued while the cpu is
>>> "busy" during initialisation, and will they expire, and will that
>>> expiry be delivered in a timely manner?
>>>
>>> If not, this commit is basically broken, and would be the cause of the
>>> issue you are seeing. For the mdio case, we're talking about 2ms
>>> polling. For the dpfe case, it looks like we're talking about 1ms
>>> sleeps. I'm guessing that these end up being local timers.
>>>
>>> Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
>>> for link up every 5ms - if the link was down and we msleep(5) I wonder
>>> if that's triggering the same issue.
>>>
>>> Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
>>> say. I would imagine that the same hardware timer driver is being used
>>> (may be worth checking DT.) The same should be true for the interrupt
>>> driver as well. There's been no changes in that code.
>>
>> I just had it happen with ARM64 I was plagued by:
>>
>> https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
>>
>> and my earlier bisections somehow did not have ARM64 fail, so I thought it
>> was immune but it fails with about the same failure rate as ARM 32-bit.
> 
> Okay, so if that's two architectures failing, it's probably not an
> architecture specific issue. It could still be a timer/interrupt driver
> issue though. It would be nice to have a reproducer on x86 which would
> confirm it.
> 
> Do you know the frequency that the failure happens?

On this specific commit 7ee988770326fca440472200c3eb58935fe712f6, there 
is a 100% failure for at least 3 devices out of the 16 that are running 
the test.
Linus Torvalds March 14, 2024, 1:15 a.m. UTC | #18
On Wed, 13 Mar 2024 at 16:29, Florian Fainelli <f.fainelli@gmail.com> wrote:
>
> On this specific commit 7ee988770326fca440472200c3eb58935fe712f6, there
> is a 100% failure for at least 3 devices out of the 16 that are running
> the test.

Hmm.  I have no idea what is going on, and the unimac-mdio probe
function (one of the things that seem to take forever on your setup)
looks fairly simple.

There doesn't even seem to be any timers involved.

That said - one of the things it does is

  unimac_mdio_probe ->
    unimac_mdio_clk_set ->
      clk_prepare_enable

and maybe that's a pattern, because you report that
brcm_pcie_resume_noirq is another problem spot (on resume).

And guess what brcm_pcie_resume_noirq() does?

Yup. clk_prepare_enable().

So I'm wondering if there's some interaction with some clock driver?
That might explain why it shows up on some arm platforms but not
elsewhere.

I may be barking *entirely* up the wrong tree, though. I was just
looking at that unimac probe and going "there's absolutely _nothing_
timer-related here" and that clk thing looked like it might at least
have _some_ relevance.

            Linus
Florian Fainelli March 14, 2024, 1:22 a.m. UTC | #19
On 3/13/2024 6:15 PM, Linus Torvalds wrote:
> On Wed, 13 Mar 2024 at 16:29, Florian Fainelli <f.fainelli@gmail.com> wrote:
>>
>> On this specific commit 7ee988770326fca440472200c3eb58935fe712f6, there
>> is a 100% failure for at least 3 devices out of the 16 that are running
>> the test.
> 
> Hmm.  I have no idea what is going on, and the unimac-mdio probe
> function (one of the things that seem to take forever on your setup)
> looks fairly simple.
> 
> There doesn't even seem to be any timers involved.
> 
> That said - one of the things it does is
> 
>    unimac_mdio_probe ->
>      unimac_mdio_clk_set ->
>        clk_prepare_enable
> 
> and maybe that's a pattern, because you report that
> brcm_pcie_resume_noirq is another problem spot (on resume).
> 
> And guess what brcm_pcie_resume_noirq() does?
> 
> Yup. clk_prepare_enable().
> 
> So I'm wondering if there's some interaction with some clock driver?
> That might explain why it shows up on some arm platforms but not
> elsewhere.
> 
> I may be barking *entirely* up the wrong tree, though. I was just
> looking at that unimac probe and going "there's absolutely _nothing_
> timer-related here" and that clk thing looked like it might at least
> have _some_ relevance.

FWIW, we use the clk-scmi.c driver and the implementation of the SCMI 
platform/server resides in the ARM EL3 trusted firmware, that also has 
not changed. Ultimately this results in an ARM SMC call made to the 
firmware after having posted some SCMI message in a shared memory 
region. None of that has changed or is new, but it does also require me 
to look drivers/firmware/arm_scmi/ for possible changes.

Thanks!
Florian Fainelli March 14, 2024, 3:44 a.m. UTC | #20
On 3/13/2024 3:52 PM, Frederic Weisbecker wrote:
> On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
>> On 3/13/24 14:59, Russell King (Oracle) wrote:
>>> On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
>>>> I will try to provide multiple answers for the sake of everyone having the
>>>> same context. Responding to Linus' specifically and his suggestion to use
>>>> "initcall_debug", this is what it gave me:
>>>>
>>>> [    6.970669] ata1: SATA link down (SStatus 0 SControl 300)
>>>> [  166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
>>>> [  166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
>>>> [  166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
>>>> [  166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
>>>> [  166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
>>>>
>>>> Also got another occurrence happening resuming from suspend to DRAM with:
>>>>
>>>> [   22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
>>>> platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
>>>> [  181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
>>>> platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
>>>>
>>>> and also with the PCIe root complex driver:
>>>>
>>>> [   18.266279] brcm-pcie f0460000.pcie: PM: calling
>>>> brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
>>>> [  177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
>>>> [  177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
>>>> [  177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
>>>> returned 0 after 159190939 usecs
>>>> [  177.457257] pcieport 0000:00:00.0: PM: calling
>>>> pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
>>>>
>>>> Surprisingly those drivers are consistently reproducing the failures I am
>>>> seeing so at least this gave me a clue as to where the problem is.
>>>>
>>>> There were no changes to drivers/net/ethernet/broadcom/genet/, the two
>>>> changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
>>>> the read_poll_timeout() conversion is correct, we properly break out of the
>>>> loop. The initial delay looked like a good culprit for a little while, but
>>>> it is not used on the affected platforms because instead we provide a
>>>> callback and we have an interrupt to signal the completion of a MDIO
>>>> operation, therefore unimac_mdio_poll() is not used at all. Finally
>>>> drivers/memory/brcmstb_dpfe.c also received a single change which is not
>>>> functional here (.remove function change do return void).
>>>>
>>>> I went back to a manual bisection and this time I believe that I have a more
>>>> plausible candidate with:
>>>>
>>>> 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
>>>> hierarchical pull model")
>>>
>>> I haven't understood the code there yet, and how it would interact with
>>> arch code, but one thing that immediately jumps out to me is this:
>>>
>>> "    As long as a CPU is busy it expires both local and global timers. When a
>>>       CPU goes idle it arms for the first expiring local timer."
>>>
>>> So are local timers "armed" when they are enqueued while the cpu is
>>> "busy" during initialisation, and will they expire, and will that
>>> expiry be delivered in a timely manner?
>>>
>>> If not, this commit is basically broken, and would be the cause of the
>>> issue you are seeing. For the mdio case, we're talking about 2ms
>>> polling. For the dpfe case, it looks like we're talking about 1ms
>>> sleeps. I'm guessing that these end up being local timers.
>>>
>>> Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
>>> for link up every 5ms - if the link was down and we msleep(5) I wonder
>>> if that's triggering the same issue.
>>>
>>> Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
>>> say. I would imagine that the same hardware timer driver is being used
>>> (may be worth checking DT.) The same should be true for the interrupt
>>> driver as well. There's been no changes in that code.
>>
>> I just had it happen with ARM64 I was plagued by:
>>
>> https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
>>
>> and my earlier bisections somehow did not have ARM64 fail, so I thought it
>> was immune but it fails with about the same failure rate as ARM 32-bit.
> 
> Can you please boot with:
> 
>      trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel
> 
> And add the following and give us the resulting output in dmesg?

Here are two logs from two different systems that exposed the problem on 
boot:

https://gist.github.com/ffainelli/f0834c52ef6320c9216d879ca29a4b81
https://gist.github.com/ffainelli/dc838883edb925a77d8eb34c0fe95be0

thanks!
Boqun Feng March 14, 2024, 5:12 a.m. UTC | #21
On Wed, Mar 13, 2024 at 08:44:07PM -0700, Florian Fainelli wrote:
> 
> 
> On 3/13/2024 3:52 PM, Frederic Weisbecker wrote:
> > On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
> > > On 3/13/24 14:59, Russell King (Oracle) wrote:
> > > > On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
> > > > > I will try to provide multiple answers for the sake of everyone having the
> > > > > same context. Responding to Linus' specifically and his suggestion to use
> > > > > "initcall_debug", this is what it gave me:
> > > > > 
> > > > > [    6.970669] ata1: SATA link down (SStatus 0 SControl 300)
> > > > > [  166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
> > > > > [  166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
> > > > > [  166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
> > > > > [  166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
> > > > > [  166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
> > > > > 
> > > > > Also got another occurrence happening resuming from suspend to DRAM with:
> > > > > 
> > > > > [   22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
> > > > > platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
> > > > > [  181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
> > > > > platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
> > > > > 
> > > > > and also with the PCIe root complex driver:
> > > > > 
> > > > > [   18.266279] brcm-pcie f0460000.pcie: PM: calling
> > > > > brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
> > > > > [  177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
> > > > > [  177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
> > > > > [  177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
> > > > > returned 0 after 159190939 usecs
> > > > > [  177.457257] pcieport 0000:00:00.0: PM: calling
> > > > > pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
> > > > > 
> > > > > Surprisingly those drivers are consistently reproducing the failures I am
> > > > > seeing so at least this gave me a clue as to where the problem is.
> > > > > 
> > > > > There were no changes to drivers/net/ethernet/broadcom/genet/, the two
> > > > > changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
> > > > > the read_poll_timeout() conversion is correct, we properly break out of the
> > > > > loop. The initial delay looked like a good culprit for a little while, but
> > > > > it is not used on the affected platforms because instead we provide a
> > > > > callback and we have an interrupt to signal the completion of a MDIO
> > > > > operation, therefore unimac_mdio_poll() is not used at all. Finally
> > > > > drivers/memory/brcmstb_dpfe.c also received a single change which is not
> > > > > functional here (.remove function change do return void).
> > > > > 
> > > > > I went back to a manual bisection and this time I believe that I have a more
> > > > > plausible candidate with:
> > > > > 
> > > > > 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
> > > > > hierarchical pull model")
> > > > 
> > > > I haven't understood the code there yet, and how it would interact with
> > > > arch code, but one thing that immediately jumps out to me is this:
> > > > 
> > > > "    As long as a CPU is busy it expires both local and global timers. When a
> > > >       CPU goes idle it arms for the first expiring local timer."
> > > > 
> > > > So are local timers "armed" when they are enqueued while the cpu is
> > > > "busy" during initialisation, and will they expire, and will that
> > > > expiry be delivered in a timely manner?
> > > > 
> > > > If not, this commit is basically broken, and would be the cause of the
> > > > issue you are seeing. For the mdio case, we're talking about 2ms
> > > > polling. For the dpfe case, it looks like we're talking about 1ms
> > > > sleeps. I'm guessing that these end up being local timers.
> > > > 
> > > > Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
> > > > for link up every 5ms - if the link was down and we msleep(5) I wonder
> > > > if that's triggering the same issue.
> > > > 
> > > > Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
> > > > say. I would imagine that the same hardware timer driver is being used
> > > > (may be worth checking DT.) The same should be true for the interrupt
> > > > driver as well. There's been no changes in that code.
> > > 
> > > I just had it happen with ARM64 I was plagued by:
> > > 
> > > https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
> > > 
> > > and my earlier bisections somehow did not have ARM64 fail, so I thought it
> > > was immune but it fails with about the same failure rate as ARM 32-bit.
> > 
> > Can you please boot with:
> > 
> >      trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel
> > 
> > And add the following and give us the resulting output in dmesg?
> 
> Here are two logs from two different systems that exposed the problem on
> boot:
> 

I found a pattern here, maybe related.

> https://gist.github.com/ffainelli/f0834c52ef6320c9216d879ca29a4b81

	[  163.244130] kworker/-31        3D.... 44007us : timer_start: timer=b089b886 function=delayed_work_timer_fn expires=4294672340 [timeout=5000] bucket_expiry=4294672384 cpu =3 idx=192 flags=D|I

^^^ this timer was supposed to expired after around 5000 jiffies (i.e. 5
second, since HZ=1000), but it expired way late (160 seconds later).

	[  163.261034] kworker/-31        3d.... 44012us : timer_start: timer=394b309f function=delayed_work_timer_fn expires=4294787991 [timeout=120651] bucket_expiry=4294791168 c pu=3 idx=277 flags=I
	...
	[  221.630578]   <idle>-0         1..s.. 3287405us : timer_expire_entry: timer=7e541f87 function=process_timeout now=4294670584 baseclk=4294670584
	[  221.643475]   <idle>-0         0d.s.. 162361292us : timer_cancel: timer=95703ccd
	[  221.650896]   <idle>-0         0..s.. 162361292us : timer_expire_entry: timer=95703ccd function=process_timeout now=4294829657 baseclk=4294670587

baseclk and now has a huge delta.
	...
	[  222.394471]   <idle>-0         3dns.. 162499420us : timer_cancel: timer=b089b886
	[  222.401893] kworker/-47        1d.... 162499420us : timer_start: timer=11162242 function=delayed_work_timer_fn expires=4294830997 [timeout=1201] bucket_expiry=4294831040 cpu=1 idx=175 flags=D|P|I
	[  222.419317]   <idle>-0         3dns.. 162499421us : timer_expire_entry: timer=b089b886 function=delayed_work_timer_fn now=4294829796 baseclk=4294672384

and the same pattern for the following:

> https://gist.github.com/ffainelli/dc838883edb925a77d8eb34c0fe95be0

	[  162.855009] kworker/-31        3D.... 34849us : timer_start: timer=c48c9b47 function=delayed_work_timer_fn expires=4294672330 [timeout=5000] bucket_expiry=4294672384 cpu =3 idx=192 flags=D|I
	[  162.871903] kworker/-31        3d.... 34854us : timer_start: timer=c1ab7696 function=delayed_work_timer_fn expires=4294787991 [timeout=120661] bucket_expiry=4294791168 c pu=3 idx=277 flags=I
	...
	[  218.724285]   <idle>-0         2..s.. 2938198us : timer_expire_entry: timer=f3c19512 function=process_timeout now=4294670234 baseclk=4294670234
	[  218.737175]   <idle>-0         0d.s.. 162011087us : timer_cancel: timer=829432ce
	[  218.744589]   <idle>-0         0d.s.. 162011088us : timer_expire_entry: timer=829432ce function=delayed_work_timer_fn now=4294829306 baseclk=4294670400
	...
	[  219.171101]   <idle>-0         3dns.. 162025222us : timer_cancel: timer=c48c9b47
	[  219.178516]   <idle>-0         3dns.. 162025222us : timer_expire_entry: timer=c48c9b47 function=delayed_work_timer_fn now=4294829321 baseclk=4294672384

Regards,
Boqun

> 
> thanks!
> -- 
> Florian
Boqun Feng March 14, 2024, 6:33 a.m. UTC | #22
On Wed, Mar 13, 2024 at 10:12:10PM -0700, Boqun Feng wrote:
> On Wed, Mar 13, 2024 at 08:44:07PM -0700, Florian Fainelli wrote:
> > 
> > 
> > On 3/13/2024 3:52 PM, Frederic Weisbecker wrote:
> > > On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
> > > > On 3/13/24 14:59, Russell King (Oracle) wrote:
> > > > > On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
> > > > > > I will try to provide multiple answers for the sake of everyone having the
> > > > > > same context. Responding to Linus' specifically and his suggestion to use
> > > > > > "initcall_debug", this is what it gave me:
> > > > > > 
> > > > > > [    6.970669] ata1: SATA link down (SStatus 0 SControl 300)
> > > > > > [  166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
> > > > > > [  166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
> > > > > > [  166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
> > > > > > [  166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
> > > > > > [  166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
> > > > > > 
> > > > > > Also got another occurrence happening resuming from suspend to DRAM with:
> > > > > > 
> > > > > > [   22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
> > > > > > platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
> > > > > > [  181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
> > > > > > platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
> > > > > > 
> > > > > > and also with the PCIe root complex driver:
> > > > > > 
> > > > > > [   18.266279] brcm-pcie f0460000.pcie: PM: calling
> > > > > > brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
> > > > > > [  177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
> > > > > > [  177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
> > > > > > [  177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
> > > > > > returned 0 after 159190939 usecs
> > > > > > [  177.457257] pcieport 0000:00:00.0: PM: calling
> > > > > > pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
> > > > > > 
> > > > > > Surprisingly those drivers are consistently reproducing the failures I am
> > > > > > seeing so at least this gave me a clue as to where the problem is.
> > > > > > 
> > > > > > There were no changes to drivers/net/ethernet/broadcom/genet/, the two
> > > > > > changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
> > > > > > the read_poll_timeout() conversion is correct, we properly break out of the
> > > > > > loop. The initial delay looked like a good culprit for a little while, but
> > > > > > it is not used on the affected platforms because instead we provide a
> > > > > > callback and we have an interrupt to signal the completion of a MDIO
> > > > > > operation, therefore unimac_mdio_poll() is not used at all. Finally
> > > > > > drivers/memory/brcmstb_dpfe.c also received a single change which is not
> > > > > > functional here (.remove function change do return void).
> > > > > > 
> > > > > > I went back to a manual bisection and this time I believe that I have a more
> > > > > > plausible candidate with:
> > > > > > 
> > > > > > 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
> > > > > > hierarchical pull model")
> > > > > 
> > > > > I haven't understood the code there yet, and how it would interact with
> > > > > arch code, but one thing that immediately jumps out to me is this:
> > > > > 
> > > > > "    As long as a CPU is busy it expires both local and global timers. When a
> > > > >       CPU goes idle it arms for the first expiring local timer."
> > > > > 
> > > > > So are local timers "armed" when they are enqueued while the cpu is
> > > > > "busy" during initialisation, and will they expire, and will that
> > > > > expiry be delivered in a timely manner?
> > > > > 
> > > > > If not, this commit is basically broken, and would be the cause of the
> > > > > issue you are seeing. For the mdio case, we're talking about 2ms
> > > > > polling. For the dpfe case, it looks like we're talking about 1ms
> > > > > sleeps. I'm guessing that these end up being local timers.
> > > > > 
> > > > > Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
> > > > > for link up every 5ms - if the link was down and we msleep(5) I wonder
> > > > > if that's triggering the same issue.
> > > > > 
> > > > > Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
> > > > > say. I would imagine that the same hardware timer driver is being used
> > > > > (may be worth checking DT.) The same should be true for the interrupt
> > > > > driver as well. There's been no changes in that code.
> > > > 
> > > > I just had it happen with ARM64 I was plagued by:
> > > > 
> > > > https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
> > > > 
> > > > and my earlier bisections somehow did not have ARM64 fail, so I thought it
> > > > was immune but it fails with about the same failure rate as ARM 32-bit.
> > > 
> > > Can you please boot with:
> > > 
> > >      trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel
> > > 
> > > And add the following and give us the resulting output in dmesg?
> > 
> > Here are two logs from two different systems that exposed the problem on
> > boot:
> > 
> 
> I found a pattern here, maybe related.
> 
> > https://gist.github.com/ffainelli/f0834c52ef6320c9216d879ca29a4b81
> 
> 	[  163.244130] kworker/-31        3D.... 44007us : timer_start: timer=b089b886 function=delayed_work_timer_fn expires=4294672340 [timeout=5000] bucket_expiry=4294672384 cpu =3 idx=192 flags=D|I
> 
> ^^^ this timer was supposed to expired after around 5000 jiffies (i.e. 5
> second, since HZ=1000), but it expired way late (160 seconds later).
> 

Hmm.. this is more a noise since it's a deferreable timer...

> 	[  163.261034] kworker/-31        3d.... 44012us : timer_start: timer=394b309f function=delayed_work_timer_fn expires=4294787991 [timeout=120651] bucket_expiry=4294791168 c pu=3 idx=277 flags=I
> 	...
> 	[  221.630578]   <idle>-0         1..s.. 3287405us : timer_expire_entry: timer=7e541f87 function=process_timeout now=4294670584 baseclk=4294670584
> 	[  221.643475]   <idle>-0         0d.s.. 162361292us : timer_cancel: timer=95703ccd
> 	[  221.650896]   <idle>-0         0..s.. 162361292us : timer_expire_entry: timer=95703ccd function=process_timeout now=4294829657 baseclk=4294670587
> 

but here:

	[  221.555265] kworker/-44        0d.... 3279414us : timer_start: timer=95703ccd function=process_timeout expires=4294670586 [timeout=10] bucket_expiry=4294670587 cpu=0 idx =59 flags=

this is a normal timer.

	[  221.571298] rcu_sche-15        3d.... 3279417us : timer_start: timer=7e541f87 function=process_timeout expires=4294670579 [timeout=3] bucket_expiry=4294670580 cpu=3 idx= 52 flags=
	[  221.587241]   <idle>-0         1d.s.. 3283405us : timer_cancel: timer=7e541f87
	[  221.594488]   <idle>-0         1..s.. 3283407us : timer_expire_entry: timer=7e541f87 function=process_timeout now=4294670580 baseclk=4294670580
	[  221.607388] rcu_sche-15        3d.... 3283416us : timer_start: timer=7e541f87 function=process_timeout expires=4294670583 [timeout=3] bucket_expiry=4294670584 cpu=3 idx= 56 flags=
	[  221.623331]   <idle>-0         1d.s.. 3287404us : timer_cancel: timer=7e541f87
	[  221.630578]   <idle>-0         1..s.. 3287405us : timer_expire_entry: timer=7e541f87 function=process_timeout now=4294670584 baseclk=4294670584
	[  221.643475]   <idle>-0         0d.s.. 162361292us : timer_cancel: timer=95703ccd
	[  221.650896]   <idle>-0         0..s.. 162361292us : timer_expire_entry: timer=95703ccd function=process_timeout now=4294829657 baseclk=4294670587

which got fired here.

	[  221.663967]   <idle>-0         0dns.. 162361296us : timer_cancel: timer=d03eaa1d
	[  221.671388]   <idle>-0         0.ns.. 162361297us : timer_expire_entry: timer=d03eaa1d function=process_timeout now=4294829657 baseclk=4294670856

And looks to me CPU 0 is the tick_do_timer_cpu CPU, since it's the first
one that got timers after a long wait and was doing a few catch-ups. Now
the problem is why CPU 0 didn't program its hardware timer to expire at
4294670587? I.e. the earliest timer.

Regards,
Boqun

[...]
Thomas Gleixner March 14, 2024, 9:03 a.m. UTC | #23
On Wed, Mar 13 2024 at 21:59, Russell King (Oracle) wrote:
> On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
> I haven't understood the code there yet, and how it would interact with
> arch code, but one thing that immediately jumps out to me is this:
>
> "    As long as a CPU is busy it expires both local and global timers. When a
>     CPU goes idle it arms for the first expiring local timer."
>
> So are local timers "armed" when they are enqueued while the cpu is
> "busy" during initialisation, and will they expire, and will that
> expiry be delivered in a timely manner?

The local timers are timers which are pinned on a CPU. Global ones do
not care about the CPU they expire on.

So if the CPU goes idle then it arms for the first local timer. If the
first global timer expires after the first local, nothing to see. If it
expires before then that timer is queued in the migration hierarchy and
some other busy CPU will take care of it. If the last CPU goes idle then
it has to arm for the first timer (it's own local, global and the
hierarchy). So it's all covered in theory, but there seems to be an
issue somewhere. We'll figure it out.

Thanks,

        tglx
Thomas Gleixner March 14, 2024, 9:11 a.m. UTC | #24
On Wed, Mar 13 2024 at 20:44, Florian Fainelli wrote:
> On 3/13/2024 3:52 PM, Frederic Weisbecker wrote:
>> Can you please boot with:
>> 
>>      trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel
>> 
>> And add the following and give us the resulting output in dmesg?
>
> Here are two logs from two different systems that exposed the problem on 
> boot:
>
> https://gist.github.com/ffainelli/f0834c52ef6320c9216d879ca29a4b81
> https://gist.github.com/ffainelli/dc838883edb925a77d8eb34c0fe95be0

[    0.003042] Failed to enable trace event: timer_migration

The event option should be:

      trace_event=timer_migration:*,timer_start,timer_expire_entry,timer_cancel

Thanks,

        tglx
Thomas Gleixner March 14, 2024, 9:32 a.m. UTC | #25
On Wed, Mar 13 2024 at 23:33, Boqun Feng wrote:
> but here:
>
> 	[  221.555265] kworker/-44        0d.... 3279414us : timer_start: timer=95703ccd function=process_timeout expires=4294670586 [timeout=10] bucket_expiry=4294670587 cpu=0 idx =59 flags=
>
> this is a normal timer.
>
> 	[  221.571298] rcu_sche-15        3d.... 3279417us : timer_start: timer=7e541f87 function=process_timeout expires=4294670579 [timeout=3] bucket_expiry=4294670580 cpu=3 idx= 52 flags=
> 	[  221.587241]   <idle>-0         1d.s.. 3283405us : timer_cancel: timer=7e541f87
> 	[  221.594488]   <idle>-0         1..s.. 3283407us : timer_expire_entry: timer=7e541f87 function=process_timeout now=4294670580 baseclk=4294670580
> 	[  221.607388] rcu_sche-15        3d.... 3283416us : timer_start: timer=7e541f87 function=process_timeout expires=4294670583 [timeout=3] bucket_expiry=4294670584 cpu=3 idx= 56 flags=
> 	[  221.623331]   <idle>-0         1d.s.. 3287404us : timer_cancel: timer=7e541f87
> 	[  221.630578]   <idle>-0         1..s.. 3287405us : timer_expire_entry: timer=7e541f87 function=process_timeout now=4294670584 baseclk=4294670584
> 	[  221.643475]   <idle>-0         0d.s.. 162361292us : timer_cancel: timer=95703ccd
> 	[  221.650896]   <idle>-0         0..s.. 162361292us : timer_expire_entry: timer=95703ccd function=process_timeout now=4294829657 baseclk=4294670587
>
> which got fired here.
>
> 	[  221.663967]   <idle>-0         0dns.. 162361296us : timer_cancel: timer=d03eaa1d
> 	[  221.671388]   <idle>-0         0.ns.. 162361297us : timer_expire_entry: timer=d03eaa1d function=process_timeout now=4294829657 baseclk=4294670856
>
> And looks to me CPU 0 is the tick_do_timer_cpu CPU, since it's the first
> one that got timers after a long wait and was doing a few catch-ups. Now
> the problem is why CPU 0 didn't program its hardware timer to expire at
> 4294670587? I.e. the earliest timer.

That's not the problem. It's not clear whether CPU0 was the last one
going idle. It doesn't look so:

> 	[  221.623331]   <idle>-0         1d.s.. 3287404us : timer_cancel: timer=7e541f87
> 	[  221.630578]   <idle>-0         1..s.. 3287405us : timer_expire_entry: timer=7e541f87 function=process_timeout now=4294670584 baseclk=4294670584

The timer in question is supposed to wake up 2 ticks later, so CPU1
should have armed it. But without the migration trace points enabled
this is all guess work.

Thanks,

        tglx
Frederic Weisbecker March 14, 2024, 10:41 a.m. UTC | #26
On Wed, Mar 13, 2024 at 08:44:07PM -0700, Florian Fainelli wrote:
> 
> 
> On 3/13/2024 3:52 PM, Frederic Weisbecker wrote:
> > On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
> > > On 3/13/24 14:59, Russell King (Oracle) wrote:
> > > > On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
> > > > > I will try to provide multiple answers for the sake of everyone having the
> > > > > same context. Responding to Linus' specifically and his suggestion to use
> > > > > "initcall_debug", this is what it gave me:
> > > > > 
> > > > > [    6.970669] ata1: SATA link down (SStatus 0 SControl 300)
> > > > > [  166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
> > > > > [  166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
> > > > > [  166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
> > > > > [  166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
> > > > > [  166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
> > > > > 
> > > > > Also got another occurrence happening resuming from suspend to DRAM with:
> > > > > 
> > > > > [   22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
> > > > > platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
> > > > > [  181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
> > > > > platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
> > > > > 
> > > > > and also with the PCIe root complex driver:
> > > > > 
> > > > > [   18.266279] brcm-pcie f0460000.pcie: PM: calling
> > > > > brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
> > > > > [  177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
> > > > > [  177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
> > > > > [  177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
> > > > > returned 0 after 159190939 usecs
> > > > > [  177.457257] pcieport 0000:00:00.0: PM: calling
> > > > > pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
> > > > > 
> > > > > Surprisingly those drivers are consistently reproducing the failures I am
> > > > > seeing so at least this gave me a clue as to where the problem is.
> > > > > 
> > > > > There were no changes to drivers/net/ethernet/broadcom/genet/, the two
> > > > > changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
> > > > > the read_poll_timeout() conversion is correct, we properly break out of the
> > > > > loop. The initial delay looked like a good culprit for a little while, but
> > > > > it is not used on the affected platforms because instead we provide a
> > > > > callback and we have an interrupt to signal the completion of a MDIO
> > > > > operation, therefore unimac_mdio_poll() is not used at all. Finally
> > > > > drivers/memory/brcmstb_dpfe.c also received a single change which is not
> > > > > functional here (.remove function change do return void).
> > > > > 
> > > > > I went back to a manual bisection and this time I believe that I have a more
> > > > > plausible candidate with:
> > > > > 
> > > > > 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
> > > > > hierarchical pull model")
> > > > 
> > > > I haven't understood the code there yet, and how it would interact with
> > > > arch code, but one thing that immediately jumps out to me is this:
> > > > 
> > > > "    As long as a CPU is busy it expires both local and global timers. When a
> > > >       CPU goes idle it arms for the first expiring local timer."
> > > > 
> > > > So are local timers "armed" when they are enqueued while the cpu is
> > > > "busy" during initialisation, and will they expire, and will that
> > > > expiry be delivered in a timely manner?
> > > > 
> > > > If not, this commit is basically broken, and would be the cause of the
> > > > issue you are seeing. For the mdio case, we're talking about 2ms
> > > > polling. For the dpfe case, it looks like we're talking about 1ms
> > > > sleeps. I'm guessing that these end up being local timers.
> > > > 
> > > > Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
> > > > for link up every 5ms - if the link was down and we msleep(5) I wonder
> > > > if that's triggering the same issue.
> > > > 
> > > > Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
> > > > say. I would imagine that the same hardware timer driver is being used
> > > > (may be worth checking DT.) The same should be true for the interrupt
> > > > driver as well. There's been no changes in that code.
> > > 
> > > I just had it happen with ARM64 I was plagued by:
> > > 
> > > https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
> > > 
> > > and my earlier bisections somehow did not have ARM64 fail, so I thought it
> > > was immune but it fails with about the same failure rate as ARM 32-bit.
> > 
> > Can you please boot with:
> > 
> >      trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel
> > 
> > And add the following and give us the resulting output in dmesg?
> 
> Here are two logs from two different systems that exposed the problem on
> boot:
> 
> https://gist.github.com/ffainelli/f0834c52ef6320c9216d879ca29a4b81
> https://gist.github.com/ffainelli/dc838883edb925a77d8eb34c0fe95be0

Thanks! Unfortunately like Thomas pointed out, I'm missing the timer_migration
events. My fault, can you retry with this syntax?

	trace_event=timer_migration:*,timer_start,timer_expire_entry,timer_cancel

Though it's fairly possible that timer migration is not enabled at this point
as it's a late initcall. But we better not miss its traces otherwise.



> thanks!
> -- 
> Florian
>
Florian Fainelli March 14, 2024, 6:35 p.m. UTC | #27
On 3/14/24 03:41, Frederic Weisbecker wrote:
> On Wed, Mar 13, 2024 at 08:44:07PM -0700, Florian Fainelli wrote:
>>
>>
>> On 3/13/2024 3:52 PM, Frederic Weisbecker wrote:
>>> On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
>>>> On 3/13/24 14:59, Russell King (Oracle) wrote:
>>>>> On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
>>>>>> I will try to provide multiple answers for the sake of everyone having the
>>>>>> same context. Responding to Linus' specifically and his suggestion to use
>>>>>> "initcall_debug", this is what it gave me:
>>>>>>
>>>>>> [    6.970669] ata1: SATA link down (SStatus 0 SControl 300)
>>>>>> [  166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
>>>>>> [  166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
>>>>>> [  166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
>>>>>> [  166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
>>>>>> [  166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
>>>>>>
>>>>>> Also got another occurrence happening resuming from suspend to DRAM with:
>>>>>>
>>>>>> [   22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
>>>>>> platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
>>>>>> [  181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
>>>>>> platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
>>>>>>
>>>>>> and also with the PCIe root complex driver:
>>>>>>
>>>>>> [   18.266279] brcm-pcie f0460000.pcie: PM: calling
>>>>>> brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
>>>>>> [  177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
>>>>>> [  177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
>>>>>> [  177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
>>>>>> returned 0 after 159190939 usecs
>>>>>> [  177.457257] pcieport 0000:00:00.0: PM: calling
>>>>>> pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
>>>>>>
>>>>>> Surprisingly those drivers are consistently reproducing the failures I am
>>>>>> seeing so at least this gave me a clue as to where the problem is.
>>>>>>
>>>>>> There were no changes to drivers/net/ethernet/broadcom/genet/, the two
>>>>>> changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
>>>>>> the read_poll_timeout() conversion is correct, we properly break out of the
>>>>>> loop. The initial delay looked like a good culprit for a little while, but
>>>>>> it is not used on the affected platforms because instead we provide a
>>>>>> callback and we have an interrupt to signal the completion of a MDIO
>>>>>> operation, therefore unimac_mdio_poll() is not used at all. Finally
>>>>>> drivers/memory/brcmstb_dpfe.c also received a single change which is not
>>>>>> functional here (.remove function change do return void).
>>>>>>
>>>>>> I went back to a manual bisection and this time I believe that I have a more
>>>>>> plausible candidate with:
>>>>>>
>>>>>> 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
>>>>>> hierarchical pull model")
>>>>>
>>>>> I haven't understood the code there yet, and how it would interact with
>>>>> arch code, but one thing that immediately jumps out to me is this:
>>>>>
>>>>> "    As long as a CPU is busy it expires both local and global timers. When a
>>>>>        CPU goes idle it arms for the first expiring local timer."
>>>>>
>>>>> So are local timers "armed" when they are enqueued while the cpu is
>>>>> "busy" during initialisation, and will they expire, and will that
>>>>> expiry be delivered in a timely manner?
>>>>>
>>>>> If not, this commit is basically broken, and would be the cause of the
>>>>> issue you are seeing. For the mdio case, we're talking about 2ms
>>>>> polling. For the dpfe case, it looks like we're talking about 1ms
>>>>> sleeps. I'm guessing that these end up being local timers.
>>>>>
>>>>> Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
>>>>> for link up every 5ms - if the link was down and we msleep(5) I wonder
>>>>> if that's triggering the same issue.
>>>>>
>>>>> Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
>>>>> say. I would imagine that the same hardware timer driver is being used
>>>>> (may be worth checking DT.) The same should be true for the interrupt
>>>>> driver as well. There's been no changes in that code.
>>>>
>>>> I just had it happen with ARM64 I was plagued by:
>>>>
>>>> https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
>>>>
>>>> and my earlier bisections somehow did not have ARM64 fail, so I thought it
>>>> was immune but it fails with about the same failure rate as ARM 32-bit.
>>>
>>> Can you please boot with:
>>>
>>>       trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel
>>>
>>> And add the following and give us the resulting output in dmesg?
>>
>> Here are two logs from two different systems that exposed the problem on
>> boot:
>>
>> https://gist.github.com/ffainelli/f0834c52ef6320c9216d879ca29a4b81
>> https://gist.github.com/ffainelli/dc838883edb925a77d8eb34c0fe95be0
> 
> Thanks! Unfortunately like Thomas pointed out, I'm missing the timer_migration
> events. My fault, can you retry with this syntax?
> 
> 	trace_event=timer_migration:*,timer_start,timer_expire_entry,timer_cancel
> 
> Though it's fairly possible that timer migration is not enabled at this point
> as it's a late initcall. But we better not miss its traces otherwise.

Here is another log with timer_migration:

https://gist.github.com/ffainelli/237a5f9928850d6d8900d1f36da45aee
Boqun Feng March 14, 2024, 6:51 p.m. UTC | #28
On Thu, Mar 14, 2024 at 11:35:23AM -0700, Florian Fainelli wrote:
> On 3/14/24 03:41, Frederic Weisbecker wrote:
> > On Wed, Mar 13, 2024 at 08:44:07PM -0700, Florian Fainelli wrote:
> > > 
> > > 
> > > On 3/13/2024 3:52 PM, Frederic Weisbecker wrote:
> > > > On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
> > > > > On 3/13/24 14:59, Russell King (Oracle) wrote:
> > > > > > On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
> > > > > > > I will try to provide multiple answers for the sake of everyone having the
> > > > > > > same context. Responding to Linus' specifically and his suggestion to use
> > > > > > > "initcall_debug", this is what it gave me:
> > > > > > > 
> > > > > > > [    6.970669] ata1: SATA link down (SStatus 0 SControl 300)
> > > > > > > [  166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
> > > > > > > [  166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
> > > > > > > [  166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
> > > > > > > [  166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
> > > > > > > [  166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
> > > > > > > 
> > > > > > > Also got another occurrence happening resuming from suspend to DRAM with:
> > > > > > > 
> > > > > > > [   22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
> > > > > > > platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
> > > > > > > [  181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
> > > > > > > platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
> > > > > > > 
> > > > > > > and also with the PCIe root complex driver:
> > > > > > > 
> > > > > > > [   18.266279] brcm-pcie f0460000.pcie: PM: calling
> > > > > > > brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
> > > > > > > [  177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
> > > > > > > [  177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
> > > > > > > [  177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
> > > > > > > returned 0 after 159190939 usecs
> > > > > > > [  177.457257] pcieport 0000:00:00.0: PM: calling
> > > > > > > pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
> > > > > > > 
> > > > > > > Surprisingly those drivers are consistently reproducing the failures I am
> > > > > > > seeing so at least this gave me a clue as to where the problem is.
> > > > > > > 
> > > > > > > There were no changes to drivers/net/ethernet/broadcom/genet/, the two
> > > > > > > changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
> > > > > > > the read_poll_timeout() conversion is correct, we properly break out of the
> > > > > > > loop. The initial delay looked like a good culprit for a little while, but
> > > > > > > it is not used on the affected platforms because instead we provide a
> > > > > > > callback and we have an interrupt to signal the completion of a MDIO
> > > > > > > operation, therefore unimac_mdio_poll() is not used at all. Finally
> > > > > > > drivers/memory/brcmstb_dpfe.c also received a single change which is not
> > > > > > > functional here (.remove function change do return void).
> > > > > > > 
> > > > > > > I went back to a manual bisection and this time I believe that I have a more
> > > > > > > plausible candidate with:
> > > > > > > 
> > > > > > > 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
> > > > > > > hierarchical pull model")
> > > > > > 
> > > > > > I haven't understood the code there yet, and how it would interact with
> > > > > > arch code, but one thing that immediately jumps out to me is this:
> > > > > > 
> > > > > > "    As long as a CPU is busy it expires both local and global timers. When a
> > > > > >        CPU goes idle it arms for the first expiring local timer."
> > > > > > 
> > > > > > So are local timers "armed" when they are enqueued while the cpu is
> > > > > > "busy" during initialisation, and will they expire, and will that
> > > > > > expiry be delivered in a timely manner?
> > > > > > 
> > > > > > If not, this commit is basically broken, and would be the cause of the
> > > > > > issue you are seeing. For the mdio case, we're talking about 2ms
> > > > > > polling. For the dpfe case, it looks like we're talking about 1ms
> > > > > > sleeps. I'm guessing that these end up being local timers.
> > > > > > 
> > > > > > Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
> > > > > > for link up every 5ms - if the link was down and we msleep(5) I wonder
> > > > > > if that's triggering the same issue.
> > > > > > 
> > > > > > Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
> > > > > > say. I would imagine that the same hardware timer driver is being used
> > > > > > (may be worth checking DT.) The same should be true for the interrupt
> > > > > > driver as well. There's been no changes in that code.
> > > > > 
> > > > > I just had it happen with ARM64 I was plagued by:
> > > > > 
> > > > > https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
> > > > > 
> > > > > and my earlier bisections somehow did not have ARM64 fail, so I thought it
> > > > > was immune but it fails with about the same failure rate as ARM 32-bit.
> > > > 
> > > > Can you please boot with:
> > > > 
> > > >       trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel
> > > > 
> > > > And add the following and give us the resulting output in dmesg?
> > > 
> > > Here are two logs from two different systems that exposed the problem on
> > > boot:
> > > 
> > > https://gist.github.com/ffainelli/f0834c52ef6320c9216d879ca29a4b81
> > > https://gist.github.com/ffainelli/dc838883edb925a77d8eb34c0fe95be0
> > 
> > Thanks! Unfortunately like Thomas pointed out, I'm missing the timer_migration
> > events. My fault, can you retry with this syntax?
> > 
> > 	trace_event=timer_migration:*,timer_start,timer_expire_entry,timer_cancel
> > 
> > Though it's fairly possible that timer migration is not enabled at this point
> > as it's a late initcall. But we better not miss its traces otherwise.
> 
> Here is another log with timer_migration:
> 
> https://gist.github.com/ffainelli/237a5f9928850d6d8900d1f36da45aee

FWIW, the trace point is still not enabled:

[    0.000000] Failed to enable trace event: timer_migration:*

you need this commit in master:

	36e40df35d2c "timer_migration: Add tracepoints"

, which is one commit later than 7ee988770326 AFAICT

Regards,
Boqun

> -- 
> Florian
>
Florian Fainelli March 14, 2024, 7:09 p.m. UTC | #29
On 3/14/24 11:51, Boqun Feng wrote:
> On Thu, Mar 14, 2024 at 11:35:23AM -0700, Florian Fainelli wrote:
>> On 3/14/24 03:41, Frederic Weisbecker wrote:
>>> On Wed, Mar 13, 2024 at 08:44:07PM -0700, Florian Fainelli wrote:
>>>>
>>>>
>>>> On 3/13/2024 3:52 PM, Frederic Weisbecker wrote:
>>>>> On Wed, Mar 13, 2024 at 03:04:26PM -0700, Florian Fainelli wrote:
>>>>>> On 3/13/24 14:59, Russell King (Oracle) wrote:
>>>>>>> On Wed, Mar 13, 2024 at 02:30:43PM -0700, Florian Fainelli wrote:
>>>>>>>> I will try to provide multiple answers for the sake of everyone having the
>>>>>>>> same context. Responding to Linus' specifically and his suggestion to use
>>>>>>>> "initcall_debug", this is what it gave me:
>>>>>>>>
>>>>>>>> [    6.970669] ata1: SATA link down (SStatus 0 SControl 300)
>>>>>>>> [  166.136366] probe of unimac-mdio-0:01 returned 0 after 159216218 usecs
>>>>>>>> [  166.142931] unimac-mdio unimac-mdio.0: Broadcom UniMAC MDIO bus
>>>>>>>> [  166.148900] probe of unimac-mdio.0 returned 0 after 159243553 usecs
>>>>>>>> [  166.155820] probe of f0480000.ethernet returned 0 after 159258794 usecs
>>>>>>>> [  166.166427] ehci-brcm f0b00300.ehci_v2: EHCI Host Controller
>>>>>>>>
>>>>>>>> Also got another occurrence happening resuming from suspend to DRAM with:
>>>>>>>>
>>>>>>>> [   22.570667] brcmstb-dpfe 9932000.dpfe-cpu: PM: calling
>>>>>>>> platform_pm_resume+0x0/0x54 @ 1574, parent: rdb
>>>>>>>> [  181.643809] brcmstb-dpfe 9932000.dpfe-cpu: PM:
>>>>>>>> platform_pm_resume+0x0/0x54 returned 0 after 159073134 usecs
>>>>>>>>
>>>>>>>> and also with the PCIe root complex driver:
>>>>>>>>
>>>>>>>> [   18.266279] brcm-pcie f0460000.pcie: PM: calling
>>>>>>>> brcm_pcie_resume_noirq+0x0/0x164 @ 1597, parent: platform
>>>>>>>> [  177.457219] brcm-pcie f0460000.pcie: clkreq-mode set to default
>>>>>>>> [  177.457225] brcm-pcie f0460000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
>>>>>>>> [  177.457231] brcm-pcie f0460000.pcie: PM: brcm_pcie_resume_noirq+0x0/0x164
>>>>>>>> returned 0 after 159190939 usecs
>>>>>>>> [  177.457257] pcieport 0000:00:00.0: PM: calling
>>>>>>>> pci_pm_resume_noirq+0x0/0x160 @ 33, parent: pci0000:00
>>>>>>>>
>>>>>>>> Surprisingly those drivers are consistently reproducing the failures I am
>>>>>>>> seeing so at least this gave me a clue as to where the problem is.
>>>>>>>>
>>>>>>>> There were no changes to drivers/net/ethernet/broadcom/genet/, the two
>>>>>>>> changes done to drivers/net/mdio/mdio-bcm-unimac.c are correct, especially
>>>>>>>> the read_poll_timeout() conversion is correct, we properly break out of the
>>>>>>>> loop. The initial delay looked like a good culprit for a little while, but
>>>>>>>> it is not used on the affected platforms because instead we provide a
>>>>>>>> callback and we have an interrupt to signal the completion of a MDIO
>>>>>>>> operation, therefore unimac_mdio_poll() is not used at all. Finally
>>>>>>>> drivers/memory/brcmstb_dpfe.c also received a single change which is not
>>>>>>>> functional here (.remove function change do return void).
>>>>>>>>
>>>>>>>> I went back to a manual bisection and this time I believe that I have a more
>>>>>>>> plausible candidate with:
>>>>>>>>
>>>>>>>> 7ee988770326fca440472200c3eb58935fe712f6 ("timers: Implement the
>>>>>>>> hierarchical pull model")
>>>>>>>
>>>>>>> I haven't understood the code there yet, and how it would interact with
>>>>>>> arch code, but one thing that immediately jumps out to me is this:
>>>>>>>
>>>>>>> "    As long as a CPU is busy it expires both local and global timers. When a
>>>>>>>         CPU goes idle it arms for the first expiring local timer."
>>>>>>>
>>>>>>> So are local timers "armed" when they are enqueued while the cpu is
>>>>>>> "busy" during initialisation, and will they expire, and will that
>>>>>>> expiry be delivered in a timely manner?
>>>>>>>
>>>>>>> If not, this commit is basically broken, and would be the cause of the
>>>>>>> issue you are seeing. For the mdio case, we're talking about 2ms
>>>>>>> polling. For the dpfe case, it looks like we're talking about 1ms
>>>>>>> sleeps. I'm guessing that these end up being local timers.
>>>>>>>
>>>>>>> Looking at pcie-brcmstb, there's a 100ms msleep(), and then a polling
>>>>>>> for link up every 5ms - if the link was down and we msleep(5) I wonder
>>>>>>> if that's triggering the same issue.
>>>>>>>
>>>>>>> Why that would manifest itself on 32-bit but not 64-bit Arm, I can't
>>>>>>> say. I would imagine that the same hardware timer driver is being used
>>>>>>> (may be worth checking DT.) The same should be true for the interrupt
>>>>>>> driver as well. There's been no changes in that code.
>>>>>>
>>>>>> I just had it happen with ARM64 I was plagued by:
>>>>>>
>>>>>> https://lore.kernel.org/lkml/87wmqrjg8n.fsf@somnus/T/
>>>>>>
>>>>>> and my earlier bisections somehow did not have ARM64 fail, so I thought it
>>>>>> was immune but it fails with about the same failure rate as ARM 32-bit.
>>>>>
>>>>> Can you please boot with:
>>>>>
>>>>>        trace_event=timer_migration,timer_start,timer_expire_entry,timer_cancel
>>>>>
>>>>> And add the following and give us the resulting output in dmesg?
>>>>
>>>> Here are two logs from two different systems that exposed the problem on
>>>> boot:
>>>>
>>>> https://gist.github.com/ffainelli/f0834c52ef6320c9216d879ca29a4b81
>>>> https://gist.github.com/ffainelli/dc838883edb925a77d8eb34c0fe95be0
>>>
>>> Thanks! Unfortunately like Thomas pointed out, I'm missing the timer_migration
>>> events. My fault, can you retry with this syntax?
>>>
>>> 	trace_event=timer_migration:*,timer_start,timer_expire_entry,timer_cancel
>>>
>>> Though it's fairly possible that timer migration is not enabled at this point
>>> as it's a late initcall. But we better not miss its traces otherwise.
>>
>> Here is another log with timer_migration:
>>
>> https://gist.github.com/ffainelli/237a5f9928850d6d8900d1f36da45aee
> 
> FWIW, the trace point is still not enabled:
> 
> [    0.000000] Failed to enable trace event: timer_migration:*
> 
> you need this commit in master:
> 
> 	36e40df35d2c "timer_migration: Add tracepoints"
> 
> , which is one commit later than 7ee988770326 AFAICT

Argh, thanks Boqun, here is a new capture:

https://gist.github.com/ffainelli/cb562c1a60ef8e0e69e7d42143c48e8f

this one is does include the tmigr events. Thanks!
Thomas Gleixner March 14, 2024, 8:45 p.m. UTC | #30
On Thu, Mar 14 2024 at 12:09, Florian Fainelli wrote:
> On 3/14/24 11:51, Boqun Feng wrote:
>>>> 	trace_event=timer_migration:*,timer_start,timer_expire_entry,timer_cancel
>>>>
>>>> Though it's fairly possible that timer migration is not enabled at this point
>>>> as it's a late initcall. But we better not miss its traces otherwise.
>>>
>>> Here is another log with timer_migration:
>>>
>>> https://gist.github.com/ffainelli/237a5f9928850d6d8900d1f36da45aee
>> 
>> FWIW, the trace point is still not enabled:
>> 
>> [    0.000000] Failed to enable trace event: timer_migration:*
>> 
>> you need this commit in master:
>> 
>> 	36e40df35d2c "timer_migration: Add tracepoints"
>> 
>> , which is one commit later than 7ee988770326 AFAICT
>
> Argh, thanks Boqun, here is a new capture:
>
> https://gist.github.com/ffainelli/cb562c1a60ef8e0e69e7d42143c48e8f
>
> this one is does include the tmigr events. Thanks!

You need 8ca1836769d758e4fbf5851bb81e181c52193f5d too.
Thomas Gleixner March 14, 2024, 9:21 p.m. UTC | #31
On Thu, Mar 14 2024 at 21:45, Thomas Gleixner wrote:
> On Thu, Mar 14 2024 at 12:09, Florian Fainelli wrote:
>> https://gist.github.com/ffainelli/cb562c1a60ef8e0e69e7d42143c48e8f
>>
>> this one is does include the tmigr events. Thanks!
>
> You need 8ca1836769d758e4fbf5851bb81e181c52193f5d too.

So from the above trace it's clear where it goes south:

[  236.318158]   <idle>-0         3..s.. 2928466us : tmigr_handle_remote: group=aecb05cb lvl=0
[  236.326526]   <idle>-0         3d.s.. 2928467us : tmigr_handle_remote_cpu: cpu=0 parent=aecb05cb wakeup=9223372036854775807
[  236.357809]   <idle>-0         3d.s.. 2928469us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0

[  236.377222]   <idle>-0         0dn... 2928471us : tmigr_cpu_active: cpu=0 parent=aecb05cb wakeup=9223372036854775807
[  236.387765]   <idle>-0         0dn... 2928471us : tmigr_group_set_cpu_active: group=aecb05cb lvl=0 numa=0 active=9 migrator=8 parent=00000000 childmask=1

[  236.401526]   <idle>-0         0d.... 2928477us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
[  236.420940]   <idle>-0         0d.... 2928478us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=8 migrator=8 parent=00000000 childmask=1
[  236.434874]   <idle>-0         0d.... 2928478us : tmigr_cpu_idle: cpu=0 parent=aecb05cb nextevt=3103000000 wakeup=9223372036854775807

[  236.446896]   <idle>-0         3d.... 2929469us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=0 migrator=ff parent=00000000 childmask=8
[  236.460916]   <idle>-0         3d.... 2929470us : tmigr_cpu_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
[  236.473721]   <idle>-0         3d.... 2934471us : tmigr_cpu_new_timer_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807

CPU3 is the last active CPU and goes idle. So it should take care of the
pending events, but it does not.

This is the next trace entry where CPU0 magically comes back to life.

[  236.487393]   <idle>-0         0d.s.. 162001359us : timer_cancel: timer=8c725d84

8ca1836769d758e4fbf5851bb81e181c52193f5d is related, but dos not fully
explain the fail. I haven't yet spotted where this goes into lala land.
Florian Fainelli March 14, 2024, 9:53 p.m. UTC | #32
On 3/14/24 14:21, Thomas Gleixner wrote:
> On Thu, Mar 14 2024 at 21:45, Thomas Gleixner wrote:
>> On Thu, Mar 14 2024 at 12:09, Florian Fainelli wrote:
>>> https://gist.github.com/ffainelli/cb562c1a60ef8e0e69e7d42143c48e8f
>>>
>>> this one is does include the tmigr events. Thanks!
>>
>> You need 8ca1836769d758e4fbf5851bb81e181c52193f5d too.
> 
> So from the above trace it's clear where it goes south:
> 
> [  236.318158]   <idle>-0         3..s.. 2928466us : tmigr_handle_remote: group=aecb05cb lvl=0
> [  236.326526]   <idle>-0         3d.s.. 2928467us : tmigr_handle_remote_cpu: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> [  236.357809]   <idle>-0         3d.s.. 2928469us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
> 
> [  236.377222]   <idle>-0         0dn... 2928471us : tmigr_cpu_active: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> [  236.387765]   <idle>-0         0dn... 2928471us : tmigr_group_set_cpu_active: group=aecb05cb lvl=0 numa=0 active=9 migrator=8 parent=00000000 childmask=1
> 
> [  236.401526]   <idle>-0         0d.... 2928477us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
> [  236.420940]   <idle>-0         0d.... 2928478us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=8 migrator=8 parent=00000000 childmask=1
> [  236.434874]   <idle>-0         0d.... 2928478us : tmigr_cpu_idle: cpu=0 parent=aecb05cb nextevt=3103000000 wakeup=9223372036854775807
> 
> [  236.446896]   <idle>-0         3d.... 2929469us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=0 migrator=ff parent=00000000 childmask=8
> [  236.460916]   <idle>-0         3d.... 2929470us : tmigr_cpu_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
> [  236.473721]   <idle>-0         3d.... 2934471us : tmigr_cpu_new_timer_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
> 
> CPU3 is the last active CPU and goes idle. So it should take care of the
> pending events, but it does not.
> 
> This is the next trace entry where CPU0 magically comes back to life.
> 
> [  236.487393]   <idle>-0         0d.s.. 162001359us : timer_cancel: timer=8c725d84
> 
> 8ca1836769d758e4fbf5851bb81e181c52193f5d is related, but dos not fully
> explain the fail. I haven't yet spotted where this goes into lala land.

It was a lot harder to generate the same issue on cold boot against 
8ca1836769d758e4fbf5851bb81e181c52193f5d, however, just like against 
36e40df35d2c1891fe58241640c7c95de4aa739b, it would happen resuming from 
suspend to DRAM whereby the CPU core(s) lost their power and had to be 
re-initialized. Eventually I got a cold boot log:

https://gist.github.com/ffainelli/b5684585c78518a5492cbbf1c7dce16e

Does the consistent ~159s mean anything?
Thomas Gleixner March 14, 2024, 9:58 p.m. UTC | #33
On Thu, Mar 14 2024 at 22:21, Thomas Gleixner wrote:

> On Thu, Mar 14 2024 at 21:45, Thomas Gleixner wrote:
>> On Thu, Mar 14 2024 at 12:09, Florian Fainelli wrote:
>>> https://gist.github.com/ffainelli/cb562c1a60ef8e0e69e7d42143c48e8f
>>>
>>> this one is does include the tmigr events. Thanks!
>>
>> You need 8ca1836769d758e4fbf5851bb81e181c52193f5d too.
>
> So from the above trace it's clear where it goes south:
>
> [  236.318158]   <idle>-0         3..s.. 2928466us : tmigr_handle_remote: group=aecb05cb lvl=0
> [  236.326526]   <idle>-0         3d.s.. 2928467us : tmigr_handle_remote_cpu: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> [  236.357809]   <idle>-0         3d.s.. 2928469us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
>
> [  236.377222]   <idle>-0         0dn... 2928471us : tmigr_cpu_active: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> [  236.387765]   <idle>-0         0dn... 2928471us : tmigr_group_set_cpu_active: group=aecb05cb lvl=0 numa=0 active=9 migrator=8 parent=00000000 childmask=1
>
> [  236.401526]   <idle>-0         0d.... 2928477us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
> [  236.420940]   <idle>-0         0d.... 2928478us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=8 migrator=8 parent=00000000 childmask=1
> [  236.434874]   <idle>-0         0d.... 2928478us : tmigr_cpu_idle: cpu=0 parent=aecb05cb nextevt=3103000000 wakeup=9223372036854775807
>
> [  236.446896]   <idle>-0         3d.... 2929469us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=0 migrator=ff parent=00000000 childmask=8
> [  236.460916]   <idle>-0         3d.... 2929470us : tmigr_cpu_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
> [  236.473721]   <idle>-0         3d.... 2934471us : tmigr_cpu_new_timer_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
>
> CPU3 is the last active CPU and goes idle. So it should take care of the
> pending events, but it does not.
>
> This is the next trace entry where CPU0 magically comes back to life.
>
> [  236.487393]   <idle>-0         0d.s.. 162001359us : timer_cancel: timer=8c725d84
>
> 8ca1836769d758e4fbf5851bb81e181c52193f5d is related, but dos not fully
> explain the fail. I haven't yet spotted where this goes into lala land.

The first expiring pending global timer is this one:

[  236.112089] kworker/-1198      1d.... 2923481us : timer_start: timer=ffd09a61 function=process_timeout expires=4294670230 [timeout=10] bucket_expiry=4294670231 cpu=1 idx=23 flags=

Bucket expiry is the relevant number: 4294670231 (jiffies)

The last timer expiry happened about 1ms before the fail:

[  236.344917]   <idle>-0         3..s.. 2928468us : timer_expire_entry: timer=2f09af03 function=process_timeout now=4294670225 baseclk=4294670225

Jiffies at that point: 4294670225

AFAICT this is a HZ=1000 configuration.

4294670231 - 4294670225 = 6

So the relevant wakeup should be around 2934468us

which is damned close to the point where the last active CPU vanishes into idle:

> [  236.446896]   <idle>-0         3d.... 2929469us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=0 migrator=ff parent=00000000 childmask=8
> [  236.460916]   <idle>-0         3d.... 2929470us : tmigr_cpu_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
> [  236.473721]   <idle>-0         3d.... 2934471us : tmigr_cpu_new_timer_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807

Unless someone beats me to it, I'm going to continue staring at that
code tomorrow as my brain has started to melt by now.

Thanks,

        tglx
Boqun Feng March 14, 2024, 10:05 p.m. UTC | #34
On Thu, Mar 14, 2024 at 10:21:21PM +0100, Thomas Gleixner wrote:
> On Thu, Mar 14 2024 at 21:45, Thomas Gleixner wrote:
> > On Thu, Mar 14 2024 at 12:09, Florian Fainelli wrote:
> >> https://gist.github.com/ffainelli/cb562c1a60ef8e0e69e7d42143c48e8f
> >>
> >> this one is does include the tmigr events. Thanks!
> >
> > You need 8ca1836769d758e4fbf5851bb81e181c52193f5d too.
> 
> So from the above trace it's clear where it goes south:
> 
> [  236.318158]   <idle>-0         3..s.. 2928466us : tmigr_handle_remote: group=aecb05cb lvl=0
> [  236.326526]   <idle>-0         3d.s.. 2928467us : tmigr_handle_remote_cpu: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> [  236.357809]   <idle>-0         3d.s.. 2928469us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
> 
> [  236.377222]   <idle>-0         0dn... 2928471us : tmigr_cpu_active: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> [  236.387765]   <idle>-0         0dn... 2928471us : tmigr_group_set_cpu_active: group=aecb05cb lvl=0 numa=0 active=9 migrator=8 parent=00000000 childmask=1
> 
> [  236.401526]   <idle>-0         0d.... 2928477us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
> [  236.420940]   <idle>-0         0d.... 2928478us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=8 migrator=8 parent=00000000 childmask=1
> [  236.434874]   <idle>-0         0d.... 2928478us : tmigr_cpu_idle: cpu=0 parent=aecb05cb nextevt=3103000000 wakeup=9223372036854775807
> 
> [  236.446896]   <idle>-0         3d.... 2929469us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=0 migrator=ff parent=00000000 childmask=8
> [  236.460916]   <idle>-0         3d.... 2929470us : tmigr_cpu_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
> [  236.473721]   <idle>-0         3d.... 2934471us : tmigr_cpu_new_timer_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
> 
> CPU3 is the last active CPU and goes idle. So it should take care of the
> pending events, but it does not.
> 

I notice CPU3 didn't have its own non-deferrable timer queued (local or
global), so could the following happen?

	timer_base_try_to_set_idle():
	  __get_next_timer_interrupt():
	    fetch_next_timer_interrupt():
	      // nextevt_local == nextevt_global == basej + NEXT_TIMER_MAX_DELTA
	      // tevt->local == tevt->gloabl = KTIME_MAX
	    timer_use_tmigr():
	      tmigr_cpu_deactivate():
	        __tmigr_cpu_deactivate():
		  // tmc->cpuevt.ignore untouched still == true
		  walk_groups(&tmigr_inactive_up, ...):
		    tmigr_inactive_up():
		      data->remote = true;
		      tmigr_update_events():
		        if (child) { // child is NULL
			  ...
			} else {
			  first_childevt = evt = data->evt;

			  if (evt->ignore && !remote)
			    return true; // no remote tick is picked.
			  ...
			}

Regards,
Boqun

> This is the next trace entry where CPU0 magically comes back to life.
> 
> [  236.487393]   <idle>-0         0d.s.. 162001359us : timer_cancel: timer=8c725d84
> 
> 8ca1836769d758e4fbf5851bb81e181c52193f5d is related, but dos not fully
> explain the fail. I haven't yet spotted where this goes into lala land.
Boqun Feng March 14, 2024, 10:10 p.m. UTC | #35
On Thu, Mar 14, 2024 at 03:05:53PM -0700, Boqun Feng wrote:
> On Thu, Mar 14, 2024 at 10:21:21PM +0100, Thomas Gleixner wrote:
> > On Thu, Mar 14 2024 at 21:45, Thomas Gleixner wrote:
> > > On Thu, Mar 14 2024 at 12:09, Florian Fainelli wrote:
> > >> https://gist.github.com/ffainelli/cb562c1a60ef8e0e69e7d42143c48e8f
> > >>
> > >> this one is does include the tmigr events. Thanks!
> > >
> > > You need 8ca1836769d758e4fbf5851bb81e181c52193f5d too.
> > 
> > So from the above trace it's clear where it goes south:
> > 
> > [  236.318158]   <idle>-0         3..s.. 2928466us : tmigr_handle_remote: group=aecb05cb lvl=0
> > [  236.326526]   <idle>-0         3d.s.. 2928467us : tmigr_handle_remote_cpu: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> > [  236.357809]   <idle>-0         3d.s.. 2928469us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
> > 
> > [  236.377222]   <idle>-0         0dn... 2928471us : tmigr_cpu_active: cpu=0 parent=aecb05cb wakeup=9223372036854775807
> > [  236.387765]   <idle>-0         0dn... 2928471us : tmigr_group_set_cpu_active: group=aecb05cb lvl=0 numa=0 active=9 migrator=8 parent=00000000 childmask=1
> > 
> > [  236.401526]   <idle>-0         0d.... 2928477us : tmigr_update_events: child=00000000 group=aecb05cb group_lvl=0 child_active=0 group_active=8 nextevt=3103000000 next_expiry=2934000000 child_evt_expiry=0 child_evtcpu=0
> > [  236.420940]   <idle>-0         0d.... 2928478us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=8 migrator=8 parent=00000000 childmask=1
> > [  236.434874]   <idle>-0         0d.... 2928478us : tmigr_cpu_idle: cpu=0 parent=aecb05cb nextevt=3103000000 wakeup=9223372036854775807
> > 
> > [  236.446896]   <idle>-0         3d.... 2929469us : tmigr_group_set_cpu_inactive: group=aecb05cb lvl=0 numa=0 active=0 migrator=ff parent=00000000 childmask=8
> > [  236.460916]   <idle>-0         3d.... 2929470us : tmigr_cpu_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
> > [  236.473721]   <idle>-0         3d.... 2934471us : tmigr_cpu_new_timer_idle: cpu=3 parent=aecb05cb nextevt=9223372036854775807 wakeup=9223372036854775807
> > 
> > CPU3 is the last active CPU and goes idle. So it should take care of the
> > pending events, but it does not.
> > 
> 
> I notice CPU3 didn't have its own non-deferrable timer queued (local or
> global), so could the following happen?
> 
> 	timer_base_try_to_set_idle():
> 	  __get_next_timer_interrupt():
> 	    fetch_next_timer_interrupt():
> 	      // nextevt_local == nextevt_global == basej + NEXT_TIMER_MAX_DELTA
> 	      // tevt->local == tevt->gloabl = KTIME_MAX
> 	    timer_use_tmigr():
> 	      tmigr_cpu_deactivate():
> 	        __tmigr_cpu_deactivate():
> 		  // tmc->cpuevt.ignore untouched still == true
> 		  walk_groups(&tmigr_inactive_up, ...):
> 		    tmigr_inactive_up():
> 		      data->remote = true;

this line is 

		data->remote = false;
	
in the real code.

> 		      tmigr_update_events():
> 		        if (child) { // child is NULL
> 			  ...
> 			} else {
> 			  first_childevt = evt = data->evt;
> 
> 			  if (evt->ignore && !remote)
> 			    return true; // no remote tick is picked.
> 			  ...
> 			}
> 
> Regards,
> Boqun
> 
> > This is the next trace entry where CPU0 magically comes back to life.
> > 
> > [  236.487393]   <idle>-0         0d.s.. 162001359us : timer_cancel: timer=8c725d84
> > 
> > 8ca1836769d758e4fbf5851bb81e181c52193f5d is related, but dos not fully
> > explain the fail. I haven't yet spotted where this goes into lala land.
Thomas Gleixner March 14, 2024, 10:51 p.m. UTC | #36
On Thu, Mar 14 2024 at 14:53, Florian Fainelli wrote:
> On 3/14/24 14:21, Thomas Gleixner wrote:
>> 8ca1836769d758e4fbf5851bb81e181c52193f5d is related, but dos not fully
>> explain the fail. I haven't yet spotted where this goes into lala land.
>
> It was a lot harder to generate the same issue on cold boot against 
> 8ca1836769d758e4fbf5851bb81e181c52193f5d,

That's good as it points into the exactly right direction as far as I
can tell from the data we have, but I might be wrong at the end.

> however, just like against 36e40df35d2c1891fe58241640c7c95de4aa739b,
> it would happen resuming from suspend to DRAM whereby the CPU core(s)
> lost their power and had to be re-initialized. Eventually I got a cold
> boot log:
>
> https://gist.github.com/ffainelli/b5684585c78518a5492cbbf1c7dce16e

The picture is similar to the one before:

<idle>-0         2d.... 3016627us : tmigr_update_events: child=00000000 group=6b74d49d group_lvl=0 child_active=0 group_active=1 nextevt=3023000000 next_expiry=3023000000 child_evt_expiry=0 child_evtcpu=0
<idle>-0         2d.... 3016628us : tmigr_group_set_cpu_inactive: group=6b74d49d lvl=0 numa=0 active=1 migrator=1 parent=00000000 childmask=4
<idle>-0         2d.... 3016629us : tmigr_cpu_idle: cpu=2 parent=6b74d49d nextevt=3023000000 wakeup=9223372036854775807

<idle>-0         0d.... 3016684us : tmigr_group_set_cpu_inactive: group=6b74d49d lvl=0 numa=0 active=0 migrator=ff parent=00000000 childmask=1
<idle>-0         0d.... 3016685us : tmigr_cpu_idle: cpu=0 parent=6b74d49d nextevt=9223372036854775807 wakeup=9223372036854775807
<idle>-0         0d.... 3024623us : tmigr_cpu_new_timer_idle: cpu=0 parent=6b74d49d nextevt=9223372036854775807 wakeup=9223372036854775807

<idle>-0         1d.s.. 162080532us : timer_cancel: timer=2e281df7

Just a different CPU this time.

The first expiring timer:

kcompact-42        1d.... 2552891us : timer_start: timer=2e281df7 function=process_timeout expires=4294670348 [timeout=500] bucket_expiry=4294670352 cpu=1 idx=66 flags=

Last expiry before going south:

   <idle>-0         1..s.. 3006620us : timer_expire_entry: timer=6f47b280 function=process_timeout now=4294670302 baseclk=4294670302

4294670352 - 4294670302 = 50

3006620us + 50*1000us = 3056620us

So the previous observation of hitting the exact point of the last CPU
going idle does not matter.

What really bothers me is:

<idle>-0         2d.... 3016629us : tmigr_cpu_idle: cpu=2 parent=6b74d49d nextevt=3023000000 wakeup=9223372036854775807

which has an event between these events:

<idle>-0         0d.... 3016685us : tmigr_cpu_idle: cpu=0 parent=6b74d49d nextevt=9223372036854775807 wakeup=9223372036854775807
<idle>-0         0d.... 3024623us : tmigr_cpu_new_timer_idle: cpu=0 parent=6b74d49d nextevt=9223372036854775807 wakeup=9223372036854775807

But that event is before the next expiring timer. Confused, but Frederic
just told me on IRC he's on to something.

> Does the consistent ~159s mean anything?

I don't think so. It might be the limitation of the clockevent device,
the maximum sleep time restriction or some other unrelated event (device
interrupt, hrtimer) which happens to pop up after this time for some
reason.

But it's definitely not relevant to the problem itself. It's just the
thing which brings the machine back to life. Otherwise it might sit
there forever.

Thanks,

        tglx