mbox series

[v8,0/6] rcu: Add RCU stall diagnosis information

Message ID 20221119092508.1766-1-thunder.leizhen@huawei.com (mailing list archive)
Headers show
Series rcu: Add RCU stall diagnosis information | expand

Message

Leizhen (ThunderTown) Nov. 19, 2022, 9:25 a.m. UTC
v7 --> v8:
1. Change call jiffies64_to_msecs() to call jiffies_to_msecs().
2. Mention that rcupdate.rcu_cpu_stall_cputime overrides
   CONFIG_RCU_CPU_STALL_CPUTIME behaviour in the Kconfig help text.
3. Fix a make htmldocs warning, change "|...|" to ":...:".

v6 --> v7:
1. Use kcpustat_field() to obtain the cputime.
2. Make the output start with "\t" to match other related prints.
3. Aligns the output of the last line of RCU stall.

v5 --> v6:
1. When there are more than two continuous RCU stallings, correctly handle the
   value of the second and subsequent sampling periods. Update comments and
   document.
   Thanks to Elliott, Robert for the test.
2. Change "rcu stall" to "RCU stall".

v4 --> v5:
1. Resolve a git am conflict. No code change.

v3 --> v4:
1. Rename rcu_cpu_stall_deep_debug to rcu_cpu_stall_cputime.

v2 --> v3:
1. Fix the return type of kstat_cpu_irqs_sum()
2. Add Kconfig option CONFIG_RCU_CPU_STALL_DEEP_DEBUG and boot parameter
   rcupdate.rcu_cpu_stall_deep_debug.
3. Add comments and normalize local variable name

v1 --> v2:
1. Fixed a bug in the code. If the rcu stall is detected by another CPU,
   kcpustat_this_cpu cannot be used.
@@ -451,7 +451,7 @@ static void print_cpu_stat_info(int cpu)
        if (r->gp_seq != rdp->gp_seq)
                return;

-       cpustat = kcpustat_this_cpu->cpustat;
+       cpustat = kcpustat_cpu(cpu).cpustat;
2. Move the start point of statistics from rcu_stall_kick_kthreads() to
   rcu_implicit_dynticks_qs(), removing the dependency on irq_work.

v1:
In some extreme cases, such as the I/O pressure test, the CPU usage may
be 100%, causing RCU stall. In this case, the printed information about
current is not useful. Displays the number and usage of hard interrupts,
soft interrupts, and context switches that are generated within half of
the CPU stall timeout, can help us make a general judgment. In other
cases, we can preliminarily determine whether an infinite loop occurs
when local_irq, local_bh or preempt is disabled.

Zhen Lei (6):
  genirq: Fix the return type of kstat_cpu_irqs_sum()
  sched: Add helper kstat_cpu_softirqs_sum()
  sched: Add helper nr_context_switches_cpu()
  rcu: Add RCU stall diagnosis information
  doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
  rcu: Align the output of RCU stall

 Documentation/RCU/stallwarn.rst               | 88 +++++++++++++++++++
 .../admin-guide/kernel-parameters.txt         |  6 ++
 include/linux/kernel_stat.h                   | 14 ++-
 kernel/rcu/Kconfig.debug                      | 13 +++
 kernel/rcu/rcu.h                              |  1 +
 kernel/rcu/tree.c                             | 18 ++++
 kernel/rcu/tree.h                             | 19 ++++
 kernel/rcu/tree_stall.h                       | 35 +++++++-
 kernel/rcu/update.c                           |  2 +
 kernel/sched/core.c                           |  5 ++
 10 files changed, 198 insertions(+), 3 deletions(-)

Comments

Paul E. McKenney Nov. 21, 2022, 10:29 p.m. UTC | #1
On Sat, Nov 19, 2022 at 05:25:02PM +0800, Zhen Lei wrote:
> v7 --> v8:
> 1. Change call jiffies64_to_msecs() to call jiffies_to_msecs().
> 2. Mention that rcupdate.rcu_cpu_stall_cputime overrides
>    CONFIG_RCU_CPU_STALL_CPUTIME behaviour in the Kconfig help text.
> 3. Fix a make htmldocs warning, change "|...|" to ":...:".
> 
> v6 --> v7:
> 1. Use kcpustat_field() to obtain the cputime.
> 2. Make the output start with "\t" to match other related prints.
> 3. Aligns the output of the last line of RCU stall.
> 
> v5 --> v6:
> 1. When there are more than two continuous RCU stallings, correctly handle the
>    value of the second and subsequent sampling periods. Update comments and
>    document.
>    Thanks to Elliott, Robert for the test.
> 2. Change "rcu stall" to "RCU stall".
> 
> v4 --> v5:
> 1. Resolve a git am conflict. No code change.
> 
> v3 --> v4:
> 1. Rename rcu_cpu_stall_deep_debug to rcu_cpu_stall_cputime.
> 
> v2 --> v3:
> 1. Fix the return type of kstat_cpu_irqs_sum()
> 2. Add Kconfig option CONFIG_RCU_CPU_STALL_DEEP_DEBUG and boot parameter
>    rcupdate.rcu_cpu_stall_deep_debug.
> 3. Add comments and normalize local variable name
> 
> v1 --> v2:
> 1. Fixed a bug in the code. If the rcu stall is detected by another CPU,
>    kcpustat_this_cpu cannot be used.
> @@ -451,7 +451,7 @@ static void print_cpu_stat_info(int cpu)
>         if (r->gp_seq != rdp->gp_seq)
>                 return;
> 
> -       cpustat = kcpustat_this_cpu->cpustat;
> +       cpustat = kcpustat_cpu(cpu).cpustat;
> 2. Move the start point of statistics from rcu_stall_kick_kthreads() to
>    rcu_implicit_dynticks_qs(), removing the dependency on irq_work.
> 
> v1:
> In some extreme cases, such as the I/O pressure test, the CPU usage may
> be 100%, causing RCU stall. In this case, the printed information about
> current is not useful. Displays the number and usage of hard interrupts,
> soft interrupts, and context switches that are generated within half of
> the CPU stall timeout, can help us make a general judgment. In other
> cases, we can preliminarily determine whether an infinite loop occurs
> when local_irq, local_bh or preempt is disabled.

Queued for further review and testing, thank you!

I did the usual wordsmithing, so please check to see if I messed
something up.

							Thanx, Paul

> Zhen Lei (6):
>   genirq: Fix the return type of kstat_cpu_irqs_sum()
>   sched: Add helper kstat_cpu_softirqs_sum()
>   sched: Add helper nr_context_switches_cpu()
>   rcu: Add RCU stall diagnosis information
>   doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
>   rcu: Align the output of RCU stall
> 
>  Documentation/RCU/stallwarn.rst               | 88 +++++++++++++++++++
>  .../admin-guide/kernel-parameters.txt         |  6 ++
>  include/linux/kernel_stat.h                   | 14 ++-
>  kernel/rcu/Kconfig.debug                      | 13 +++
>  kernel/rcu/rcu.h                              |  1 +
>  kernel/rcu/tree.c                             | 18 ++++
>  kernel/rcu/tree.h                             | 19 ++++
>  kernel/rcu/tree_stall.h                       | 35 +++++++-
>  kernel/rcu/update.c                           |  2 +
>  kernel/sched/core.c                           |  5 ++
>  10 files changed, 198 insertions(+), 3 deletions(-)
> 
> -- 
> 2.25.1
>
Leizhen (ThunderTown) Nov. 22, 2022, 2:14 a.m. UTC | #2
On 2022/11/22 6:29, Paul E. McKenney wrote:
> On Sat, Nov 19, 2022 at 05:25:02PM +0800, Zhen Lei wrote:
>> v7 --> v8:
>> 1. Change call jiffies64_to_msecs() to call jiffies_to_msecs().
>> 2. Mention that rcupdate.rcu_cpu_stall_cputime overrides
>>    CONFIG_RCU_CPU_STALL_CPUTIME behaviour in the Kconfig help text.
>> 3. Fix a make htmldocs warning, change "|...|" to ":...:".
>>
>> v6 --> v7:
>> 1. Use kcpustat_field() to obtain the cputime.
>> 2. Make the output start with "\t" to match other related prints.
>> 3. Aligns the output of the last line of RCU stall.
>>
>> v5 --> v6:
>> 1. When there are more than two continuous RCU stallings, correctly handle the
>>    value of the second and subsequent sampling periods. Update comments and
>>    document.
>>    Thanks to Elliott, Robert for the test.
>> 2. Change "rcu stall" to "RCU stall".
>>
>> v4 --> v5:
>> 1. Resolve a git am conflict. No code change.
>>
>> v3 --> v4:
>> 1. Rename rcu_cpu_stall_deep_debug to rcu_cpu_stall_cputime.
>>
>> v2 --> v3:
>> 1. Fix the return type of kstat_cpu_irqs_sum()
>> 2. Add Kconfig option CONFIG_RCU_CPU_STALL_DEEP_DEBUG and boot parameter
>>    rcupdate.rcu_cpu_stall_deep_debug.
>> 3. Add comments and normalize local variable name
>>
>> v1 --> v2:
>> 1. Fixed a bug in the code. If the rcu stall is detected by another CPU,
>>    kcpustat_this_cpu cannot be used.
>> @@ -451,7 +451,7 @@ static void print_cpu_stat_info(int cpu)
>>         if (r->gp_seq != rdp->gp_seq)
>>                 return;
>>
>> -       cpustat = kcpustat_this_cpu->cpustat;
>> +       cpustat = kcpustat_cpu(cpu).cpustat;
>> 2. Move the start point of statistics from rcu_stall_kick_kthreads() to
>>    rcu_implicit_dynticks_qs(), removing the dependency on irq_work.
>>
>> v1:
>> In some extreme cases, such as the I/O pressure test, the CPU usage may
>> be 100%, causing RCU stall. In this case, the printed information about
>> current is not useful. Displays the number and usage of hard interrupts,
>> soft interrupts, and context switches that are generated within half of
>> the CPU stall timeout, can help us make a general judgment. In other
>> cases, we can preliminarily determine whether an infinite loop occurs
>> when local_irq, local_bh or preempt is disabled.
> 
> Queued for further review and testing, thank you!
> 
> I did the usual wordsmithing, so please check to see if I messed
> something up.

Thanks for your help.

Sorry, I think I missed a word in the commit message of patch 5/6.

This commit documents the additional RCU CPU stall warning output produced
- by kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y.
+ by kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y or booted with
+ rcupdate.rcu_cpu_stall_cputime=1.

> 
> 							Thanx, Paul
> 
>> Zhen Lei (6):
>>   genirq: Fix the return type of kstat_cpu_irqs_sum()
>>   sched: Add helper kstat_cpu_softirqs_sum()
>>   sched: Add helper nr_context_switches_cpu()
>>   rcu: Add RCU stall diagnosis information
>>   doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
>>   rcu: Align the output of RCU stall
>>
>>  Documentation/RCU/stallwarn.rst               | 88 +++++++++++++++++++
>>  .../admin-guide/kernel-parameters.txt         |  6 ++
>>  include/linux/kernel_stat.h                   | 14 ++-
>>  kernel/rcu/Kconfig.debug                      | 13 +++
>>  kernel/rcu/rcu.h                              |  1 +
>>  kernel/rcu/tree.c                             | 18 ++++
>>  kernel/rcu/tree.h                             | 19 ++++
>>  kernel/rcu/tree_stall.h                       | 35 +++++++-
>>  kernel/rcu/update.c                           |  2 +
>>  kernel/sched/core.c                           |  5 ++
>>  10 files changed, 198 insertions(+), 3 deletions(-)
>>
>> -- 
>> 2.25.1
>>
> .
>
Frederic Weisbecker Nov. 22, 2022, 11:59 a.m. UTC | #3
On Sat, Nov 19, 2022 at 05:25:02PM +0800, Zhen Lei wrote:
> v7 --> v8:
> 1. Change call jiffies64_to_msecs() to call jiffies_to_msecs().
> 2. Mention that rcupdate.rcu_cpu_stall_cputime overrides
>    CONFIG_RCU_CPU_STALL_CPUTIME behaviour in the Kconfig help text.
> 3. Fix a make htmldocs warning, change "|...|" to ":...:".
> 
> v6 --> v7:
> 1. Use kcpustat_field() to obtain the cputime.
> 2. Make the output start with "\t" to match other related prints.
> 3. Aligns the output of the last line of RCU stall.
> 
> v5 --> v6:
> 1. When there are more than two continuous RCU stallings, correctly handle the
>    value of the second and subsequent sampling periods. Update comments and
>    document.
>    Thanks to Elliott, Robert for the test.
> 2. Change "rcu stall" to "RCU stall".
> 
> v4 --> v5:
> 1. Resolve a git am conflict. No code change.
> 
> v3 --> v4:
> 1. Rename rcu_cpu_stall_deep_debug to rcu_cpu_stall_cputime.
> 
> v2 --> v3:
> 1. Fix the return type of kstat_cpu_irqs_sum()
> 2. Add Kconfig option CONFIG_RCU_CPU_STALL_DEEP_DEBUG and boot parameter
>    rcupdate.rcu_cpu_stall_deep_debug.
> 3. Add comments and normalize local variable name
> 
> v1 --> v2:
> 1. Fixed a bug in the code. If the rcu stall is detected by another CPU,
>    kcpustat_this_cpu cannot be used.
> @@ -451,7 +451,7 @@ static void print_cpu_stat_info(int cpu)
>         if (r->gp_seq != rdp->gp_seq)
>                 return;
> 
> -       cpustat = kcpustat_this_cpu->cpustat;
> +       cpustat = kcpustat_cpu(cpu).cpustat;
> 2. Move the start point of statistics from rcu_stall_kick_kthreads() to
>    rcu_implicit_dynticks_qs(), removing the dependency on irq_work.
> 
> v1:
> In some extreme cases, such as the I/O pressure test, the CPU usage may
> be 100%, causing RCU stall. In this case, the printed information about
> current is not useful. Displays the number and usage of hard interrupts,
> soft interrupts, and context switches that are generated within half of
> the CPU stall timeout, can help us make a general judgment. In other
> cases, we can preliminarily determine whether an infinite loop occurs
> when local_irq, local_bh or preempt is disabled.
> 
> Zhen Lei (6):
>   genirq: Fix the return type of kstat_cpu_irqs_sum()
>   sched: Add helper kstat_cpu_softirqs_sum()
>   sched: Add helper nr_context_switches_cpu()
>   rcu: Add RCU stall diagnosis information
>   doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
>   rcu: Align the output of RCU stall

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Thanks!

> 
>  Documentation/RCU/stallwarn.rst               | 88 +++++++++++++++++++
>  .../admin-guide/kernel-parameters.txt         |  6 ++
>  include/linux/kernel_stat.h                   | 14 ++-
>  kernel/rcu/Kconfig.debug                      | 13 +++
>  kernel/rcu/rcu.h                              |  1 +
>  kernel/rcu/tree.c                             | 18 ++++
>  kernel/rcu/tree.h                             | 19 ++++
>  kernel/rcu/tree_stall.h                       | 35 +++++++-
>  kernel/rcu/update.c                           |  2 +
>  kernel/sched/core.c                           |  5 ++
>  10 files changed, 198 insertions(+), 3 deletions(-)
> 
> -- 
> 2.25.1
>
Paul E. McKenney Nov. 23, 2022, 12:26 a.m. UTC | #4
On Tue, Nov 22, 2022 at 10:14:04AM +0800, Leizhen (ThunderTown) wrote:
> 
> 
> On 2022/11/22 6:29, Paul E. McKenney wrote:
> > On Sat, Nov 19, 2022 at 05:25:02PM +0800, Zhen Lei wrote:
> >> v7 --> v8:
> >> 1. Change call jiffies64_to_msecs() to call jiffies_to_msecs().
> >> 2. Mention that rcupdate.rcu_cpu_stall_cputime overrides
> >>    CONFIG_RCU_CPU_STALL_CPUTIME behaviour in the Kconfig help text.
> >> 3. Fix a make htmldocs warning, change "|...|" to ":...:".
> >>
> >> v6 --> v7:
> >> 1. Use kcpustat_field() to obtain the cputime.
> >> 2. Make the output start with "\t" to match other related prints.
> >> 3. Aligns the output of the last line of RCU stall.
> >>
> >> v5 --> v6:
> >> 1. When there are more than two continuous RCU stallings, correctly handle the
> >>    value of the second and subsequent sampling periods. Update comments and
> >>    document.
> >>    Thanks to Elliott, Robert for the test.
> >> 2. Change "rcu stall" to "RCU stall".
> >>
> >> v4 --> v5:
> >> 1. Resolve a git am conflict. No code change.
> >>
> >> v3 --> v4:
> >> 1. Rename rcu_cpu_stall_deep_debug to rcu_cpu_stall_cputime.
> >>
> >> v2 --> v3:
> >> 1. Fix the return type of kstat_cpu_irqs_sum()
> >> 2. Add Kconfig option CONFIG_RCU_CPU_STALL_DEEP_DEBUG and boot parameter
> >>    rcupdate.rcu_cpu_stall_deep_debug.
> >> 3. Add comments and normalize local variable name
> >>
> >> v1 --> v2:
> >> 1. Fixed a bug in the code. If the rcu stall is detected by another CPU,
> >>    kcpustat_this_cpu cannot be used.
> >> @@ -451,7 +451,7 @@ static void print_cpu_stat_info(int cpu)
> >>         if (r->gp_seq != rdp->gp_seq)
> >>                 return;
> >>
> >> -       cpustat = kcpustat_this_cpu->cpustat;
> >> +       cpustat = kcpustat_cpu(cpu).cpustat;
> >> 2. Move the start point of statistics from rcu_stall_kick_kthreads() to
> >>    rcu_implicit_dynticks_qs(), removing the dependency on irq_work.
> >>
> >> v1:
> >> In some extreme cases, such as the I/O pressure test, the CPU usage may
> >> be 100%, causing RCU stall. In this case, the printed information about
> >> current is not useful. Displays the number and usage of hard interrupts,
> >> soft interrupts, and context switches that are generated within half of
> >> the CPU stall timeout, can help us make a general judgment. In other
> >> cases, we can preliminarily determine whether an infinite loop occurs
> >> when local_irq, local_bh or preempt is disabled.
> > 
> > Queued for further review and testing, thank you!
> > 
> > I did the usual wordsmithing, so please check to see if I messed
> > something up.
> 
> Thanks for your help.
> 
> Sorry, I think I missed a word in the commit message of patch 5/6.
> 
> This commit documents the additional RCU CPU stall warning output produced
> - by kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y.
> + by kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y or booted with
> + rcupdate.rcu_cpu_stall_cputime=1.

Good eyes, fixed, thank you!

							Thanx, Paul

> >> Zhen Lei (6):
> >>   genirq: Fix the return type of kstat_cpu_irqs_sum()
> >>   sched: Add helper kstat_cpu_softirqs_sum()
> >>   sched: Add helper nr_context_switches_cpu()
> >>   rcu: Add RCU stall diagnosis information
> >>   doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
> >>   rcu: Align the output of RCU stall
> >>
> >>  Documentation/RCU/stallwarn.rst               | 88 +++++++++++++++++++
> >>  .../admin-guide/kernel-parameters.txt         |  6 ++
> >>  include/linux/kernel_stat.h                   | 14 ++-
> >>  kernel/rcu/Kconfig.debug                      | 13 +++
> >>  kernel/rcu/rcu.h                              |  1 +
> >>  kernel/rcu/tree.c                             | 18 ++++
> >>  kernel/rcu/tree.h                             | 19 ++++
> >>  kernel/rcu/tree_stall.h                       | 35 +++++++-
> >>  kernel/rcu/update.c                           |  2 +
> >>  kernel/sched/core.c                           |  5 ++
> >>  10 files changed, 198 insertions(+), 3 deletions(-)
> >>
> >> -- 
> >> 2.25.1
> >>
> > .
> > 
> 
> -- 
> Regards,
>   Zhen Lei
Paul E. McKenney Nov. 23, 2022, 12:27 a.m. UTC | #5
On Tue, Nov 22, 2022 at 12:59:28PM +0100, Frederic Weisbecker wrote:
> On Sat, Nov 19, 2022 at 05:25:02PM +0800, Zhen Lei wrote:
> > v7 --> v8:
> > 1. Change call jiffies64_to_msecs() to call jiffies_to_msecs().
> > 2. Mention that rcupdate.rcu_cpu_stall_cputime overrides
> >    CONFIG_RCU_CPU_STALL_CPUTIME behaviour in the Kconfig help text.
> > 3. Fix a make htmldocs warning, change "|...|" to ":...:".
> > 
> > v6 --> v7:
> > 1. Use kcpustat_field() to obtain the cputime.
> > 2. Make the output start with "\t" to match other related prints.
> > 3. Aligns the output of the last line of RCU stall.
> > 
> > v5 --> v6:
> > 1. When there are more than two continuous RCU stallings, correctly handle the
> >    value of the second and subsequent sampling periods. Update comments and
> >    document.
> >    Thanks to Elliott, Robert for the test.
> > 2. Change "rcu stall" to "RCU stall".
> > 
> > v4 --> v5:
> > 1. Resolve a git am conflict. No code change.
> > 
> > v3 --> v4:
> > 1. Rename rcu_cpu_stall_deep_debug to rcu_cpu_stall_cputime.
> > 
> > v2 --> v3:
> > 1. Fix the return type of kstat_cpu_irqs_sum()
> > 2. Add Kconfig option CONFIG_RCU_CPU_STALL_DEEP_DEBUG and boot parameter
> >    rcupdate.rcu_cpu_stall_deep_debug.
> > 3. Add comments and normalize local variable name
> > 
> > v1 --> v2:
> > 1. Fixed a bug in the code. If the rcu stall is detected by another CPU,
> >    kcpustat_this_cpu cannot be used.
> > @@ -451,7 +451,7 @@ static void print_cpu_stat_info(int cpu)
> >         if (r->gp_seq != rdp->gp_seq)
> >                 return;
> > 
> > -       cpustat = kcpustat_this_cpu->cpustat;
> > +       cpustat = kcpustat_cpu(cpu).cpustat;
> > 2. Move the start point of statistics from rcu_stall_kick_kthreads() to
> >    rcu_implicit_dynticks_qs(), removing the dependency on irq_work.
> > 
> > v1:
> > In some extreme cases, such as the I/O pressure test, the CPU usage may
> > be 100%, causing RCU stall. In this case, the printed information about
> > current is not useful. Displays the number and usage of hard interrupts,
> > soft interrupts, and context switches that are generated within half of
> > the CPU stall timeout, can help us make a general judgment. In other
> > cases, we can preliminarily determine whether an infinite loop occurs
> > when local_irq, local_bh or preempt is disabled.
> > 
> > Zhen Lei (6):
> >   genirq: Fix the return type of kstat_cpu_irqs_sum()
> >   sched: Add helper kstat_cpu_softirqs_sum()
> >   sched: Add helper nr_context_switches_cpu()
> >   rcu: Add RCU stall diagnosis information
> >   doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
> >   rcu: Align the output of RCU stall
> 
> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Applied, thank you!

							Thanx, Paul

> Thanks!
> 
> > 
> >  Documentation/RCU/stallwarn.rst               | 88 +++++++++++++++++++
> >  .../admin-guide/kernel-parameters.txt         |  6 ++
> >  include/linux/kernel_stat.h                   | 14 ++-
> >  kernel/rcu/Kconfig.debug                      | 13 +++
> >  kernel/rcu/rcu.h                              |  1 +
> >  kernel/rcu/tree.c                             | 18 ++++
> >  kernel/rcu/tree.h                             | 19 ++++
> >  kernel/rcu/tree_stall.h                       | 35 +++++++-
> >  kernel/rcu/update.c                           |  2 +
> >  kernel/sched/core.c                           |  5 ++
> >  10 files changed, 198 insertions(+), 3 deletions(-)
> > 
> > -- 
> > 2.25.1
> >