Message ID | 20221111130709.247-6-thunder.leizhen@huawei.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | rcu: Add RCU stall diagnosis information | expand |
On Fri, Nov 11, 2022 at 09:07:08PM +0800, Zhen Lei wrote: > +1. A CPU looping with interrupts disabled.:: > + > + rcu: hardirqs softirqs csw/system > + rcu: number: 0 0 0 > +65;6003;1c rcu: cputime: 0 0 0 ==> 2500(ms) > + > + Because interrupts have been disabled throughout the measurement > + interval, there are no interrupts and no context switches. > + Furthermore, because CPU time consumption was measured using interrupt > + handlers, the system CPU consumption is misleadingly measured as zero. > + This scenario will normally also have "(0 ticks this GP)" printed on > + this CPU's summary line. > + > +2. A CPU looping with bottom halves disabled. > + > + This is similar to the previous example, but with non-zero number of > + and CPU time consumed by hard interrupts, along with non-zero CPU > + time consumed by in-kernel execution.:: > + > + rcu: hardirqs softirqs csw/system > + rcu: number: 624 0 0 > + rcu: cputime: 49 0 2446 ==> 2500(ms) > + > + The fact that there are zero softirqs gives a hint that these were > + disabled, perhaps via local_bh_disable(). It is of course possible > + that there were no softirqs, perhaps because all events that would > + result in softirq execution are confined to other CPUs. In this case, > + the diagnosis should continue as shown in the next example. > + > +3. A CPU looping with preemption disabled. > + > + Here, only the number of context switches is zero.:: > + > + rcu: hardirqs softirqs csw/system > + rcu: number: 624 45 0 > + rcu: cputime: 69 1 2425 ==> 2500(ms) > + > + This situation hints that the stalled CPU was looping with preemption > + disabled. > + > +4. No looping, but massive hard and soft interrupts.:: > + > + rcu: hardirqs softirqs csw/system > + rcu: number: xx xx 0 > + rcu: cputime: xx xx 0 ==> 2500(ms) > + > + Here, the number and CPU time of hard interrupts are all non-zero, > + but the number of context switches and the in-kernel CPU time consumed > + are zero. The number and cputime of soft interrupts will usually be > + non-zero, but could be zero, for example, if the CPU was spinning > + within a single hard interrupt handler. > + > + If this type of RCU CPU stall warning can be reproduced, you can > + narrow it down by looking at /proc/interrupts or by writing code to > + trace each interrupt, for example, by referring to show_interrupts(). One last question I have. Usually all these informations can be deduced by just looking at the stacktrace that comes along an RCU stall report. So on which kind of situation the stacktrace is not enough? Thanks.
On 2022/11/17 6:55, Frederic Weisbecker wrote: > On Fri, Nov 11, 2022 at 09:07:08PM +0800, Zhen Lei wrote: >> +1. A CPU looping with interrupts disabled.:: >> + >> + rcu: hardirqs softirqs csw/system >> + rcu: number: 0 0 0 >> +65;6003;1c rcu: cputime: 0 0 0 ==> 2500(ms) >> + >> + Because interrupts have been disabled throughout the measurement >> + interval, there are no interrupts and no context switches. >> + Furthermore, because CPU time consumption was measured using interrupt >> + handlers, the system CPU consumption is misleadingly measured as zero. >> + This scenario will normally also have "(0 ticks this GP)" printed on >> + this CPU's summary line. >> + >> +2. A CPU looping with bottom halves disabled. >> + >> + This is similar to the previous example, but with non-zero number of >> + and CPU time consumed by hard interrupts, along with non-zero CPU >> + time consumed by in-kernel execution.:: >> + >> + rcu: hardirqs softirqs csw/system >> + rcu: number: 624 0 0 >> + rcu: cputime: 49 0 2446 ==> 2500(ms) >> + >> + The fact that there are zero softirqs gives a hint that these were >> + disabled, perhaps via local_bh_disable(). It is of course possible >> + that there were no softirqs, perhaps because all events that would >> + result in softirq execution are confined to other CPUs. In this case, >> + the diagnosis should continue as shown in the next example. >> + >> +3. A CPU looping with preemption disabled. >> + >> + Here, only the number of context switches is zero.:: >> + >> + rcu: hardirqs softirqs csw/system >> + rcu: number: 624 45 0 >> + rcu: cputime: 69 1 2425 ==> 2500(ms) >> + >> + This situation hints that the stalled CPU was looping with preemption >> + disabled. >> + >> +4. No looping, but massive hard and soft interrupts.:: >> + >> + rcu: hardirqs softirqs csw/system >> + rcu: number: xx xx 0 >> + rcu: cputime: xx xx 0 ==> 2500(ms) >> + >> + Here, the number and CPU time of hard interrupts are all non-zero, >> + but the number of context switches and the in-kernel CPU time consumed >> + are zero. The number and cputime of soft interrupts will usually be >> + non-zero, but could be zero, for example, if the CPU was spinning >> + within a single hard interrupt handler. >> + >> + If this type of RCU CPU stall warning can be reproduced, you can >> + narrow it down by looking at /proc/interrupts or by writing code to >> + trace each interrupt, for example, by referring to show_interrupts(). > > One last question I have. Usually all these informations can be deduced by > just looking at the stacktrace that comes along an RCU stall report. So on > which kind of situation the stacktrace is not enough? Interrupt storm. > > Thanks. > . >
On Thu, Nov 17, 2022 at 10:03:17AM +0800, Leizhen (ThunderTown) wrote: > On 2022/11/17 6:55, Frederic Weisbecker wrote: > > On Fri, Nov 11, 2022 at 09:07:08PM +0800, Zhen Lei wrote: > > One last question I have. Usually all these informations can be deduced by > > just looking at the stacktrace that comes along an RCU stall report. So on > > which kind of situation the stacktrace is not enough? > > Interrupt storm. Now that makes sense :) Thanks. > > > > > Thanks. > > . > > > > -- > Regards, > Zhen Lei
diff --git a/Documentation/RCU/stallwarn.rst b/Documentation/RCU/stallwarn.rst index dfa4db8c0931eaf..5e24e849290a286 100644 --- a/Documentation/RCU/stallwarn.rst +++ b/Documentation/RCU/stallwarn.rst @@ -390,3 +390,91 @@ for example, "P3421". It is entirely possible to see stall warnings from normal and from expedited grace periods at about the same time during the same run. + +RCU_CPU_STALL_CPUTIME +===================== + +In kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y or booted with +rcupdate.rcu_cpu_stall_cputime=1, the following additional information +is supplied with each RCU CPU stall warning:: + +rcu: hardirqs softirqs csw/system +rcu: number: 624 45 0 +rcu: cputime: 69 1 2425 ==> 2500(ms) + +These statistics are collected during the sampling period. The values +in row "number:" are the number of hard interrupts, number of soft +interrupts, and number of context switches on the stalled CPU. The +first three values in row "cputime:" indicate the CPU time in +milliseconds consumed by hard interrupts, soft interrupts, and tasks +on the stalled CPU. The last number is the measurement interval, again +in milliseconds. Because user-mode tasks normally do not cause RCU CPU +stalls, these tasks are typically kernel tasks, which is why only the +system CPU time are considered. + +The sampling period is shown as follows: +|<------------first timeout---------->|<-----second timeout----->| +|<--half timeout-->|<--half timeout-->| | +| |<--first period-->| | +| |<-----------second sampling period---------->| +| | | | +| sampling time point 1st-stall 2nd-stall + + +The following describes four typical scenarios: + +1. A CPU looping with interrupts disabled.:: + + rcu: hardirqs softirqs csw/system + rcu: number: 0 0 0 + rcu: cputime: 0 0 0 ==> 2500(ms) + + Because interrupts have been disabled throughout the measurement + interval, there are no interrupts and no context switches. + Furthermore, because CPU time consumption was measured using interrupt + handlers, the system CPU consumption is misleadingly measured as zero. + This scenario will normally also have "(0 ticks this GP)" printed on + this CPU's summary line. + +2. A CPU looping with bottom halves disabled. + + This is similar to the previous example, but with non-zero number of + and CPU time consumed by hard interrupts, along with non-zero CPU + time consumed by in-kernel execution.:: + + rcu: hardirqs softirqs csw/system + rcu: number: 624 0 0 + rcu: cputime: 49 0 2446 ==> 2500(ms) + + The fact that there are zero softirqs gives a hint that these were + disabled, perhaps via local_bh_disable(). It is of course possible + that there were no softirqs, perhaps because all events that would + result in softirq execution are confined to other CPUs. In this case, + the diagnosis should continue as shown in the next example. + +3. A CPU looping with preemption disabled. + + Here, only the number of context switches is zero.:: + + rcu: hardirqs softirqs csw/system + rcu: number: 624 45 0 + rcu: cputime: 69 1 2425 ==> 2500(ms) + + This situation hints that the stalled CPU was looping with preemption + disabled. + +4. No looping, but massive hard and soft interrupts.:: + + rcu: hardirqs softirqs csw/system + rcu: number: xx xx 0 + rcu: cputime: xx xx 0 ==> 2500(ms) + + Here, the number and CPU time of hard interrupts are all non-zero, + but the number of context switches and the in-kernel CPU time consumed + are zero. The number and cputime of soft interrupts will usually be + non-zero, but could be zero, for example, if the CPU was spinning + within a single hard interrupt handler. + + If this type of RCU CPU stall warning can be reproduced, you can + narrow it down by looking at /proc/interrupts or by writing code to + trace each interrupt, for example, by referring to show_interrupts().