Message ID | 1553865808-30604-1-git-send-email-yuzenghui@huawei.com (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Series | [RFC] irqchip/gic-v3: Make gic_handle_irq() notrace | expand |
Hi Zenghui, On 29/03/2019 13:23, Zenghui Yu wrote: > Enable pseudo NMI together with function_graph tracer, will lead > the system to a hang. This is easy to reproduce, > > 1) Set "irqchip.gicv3_pseudo_nmi=1" on the kernel command line > 2) echo function_graph > /sys/kernel/debug/tracing/current_tracer > > This patch (RFC) set gic_handle_irq() as notrace and it seems works > fine now. But I have no idea about what the issue is exactly, and > you can regard this patch as a report then :) > > Can someone give a look at it and provide some explanations ? > > Thanks! > > Cc: Julien Thierry <julien.thierry@arm.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> > --- > drivers/irqchip/irq-gic-v3.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c > index 15e55d3..8d0c25f 100644 > --- a/drivers/irqchip/irq-gic-v3.c > +++ b/drivers/irqchip/irq-gic-v3.c > @@ -487,7 +487,7 @@ static inline void gic_handle_nmi(u32 irqnr, struct pt_regs *regs) > gic_deactivate_unhandled(irqnr); > } > > -static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs *regs) > +static asmlinkage notrace void __exception_irq_entry gic_handle_irq(struct pt_regs *regs) > { > u32 irqnr; > > That's interesting. Do you have any out of tree patch that actually makes use of the pseudo-NMI feature? Without those patches, the behaviour should stay unchanged. On the other hand, if you can generate pseudo-NMIs, you could end-up tracing gic_handle_irq whilst being inside the tracing code with interrupts being notionally disabled (and that could be pretty bad). So, patches or no patches? Thanks, M.
On Fri, 29 Mar 2019 13:58:40 +0000 Marc Zyngier <marc.zyngier@arm.com> wrote: > On the other hand, if you can generate pseudo-NMIs, you could end-up > tracing gic_handle_irq whilst being inside the tracing code with > interrupts being notionally disabled (and that could be pretty bad). Actually, that should still be safe. The tracing code is expected to trace NMIs. Now the entry of an NMI can be an issue because of the way tracing is enabled. But this would also cause function tracing to crash, which was not stated. Does function tracing have the same issue, or is this just with function_graph tracing? This is because a breakpoint is added to all the places that are going to be traced so that the "nops" at the beginning of function calls are going to be converted to calls to the tracer. Now that means a breakpoint is being added at the beginning of gic_handle_irq(). I don't know how this pseudo NMI works, but we have notrace set for do_NMI because breakpoints at the entry (before it can fix things) causes issues. But this still doesn't make sense because the gic_handle_irq() call doesn't fix things up either, so other functions that are traced by gic_handle_irq() should blow up too, which appears (by the patch) not to be the case. -- Steve
Hi Marc, On 2019/3/29 21:58, Marc Zyngier wrote: > Hi Zenghui, > > On 29/03/2019 13:23, Zenghui Yu wrote: >> Enable pseudo NMI together with function_graph tracer, will lead >> the system to a hang. This is easy to reproduce, >> >> 1) Set "irqchip.gicv3_pseudo_nmi=1" on the kernel command line >> 2) echo function_graph > /sys/kernel/debug/tracing/current_tracer >> >> This patch (RFC) set gic_handle_irq() as notrace and it seems works >> fine now. But I have no idea about what the issue is exactly, and >> you can regard this patch as a report then :) >> >> Can someone give a look at it and provide some explanations ? >> >> Thanks! >> >> Cc: Julien Thierry <julien.thierry@arm.com> >> Cc: Steven Rostedt <rostedt@goodmis.org> >> Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> >> --- >> drivers/irqchip/irq-gic-v3.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c >> index 15e55d3..8d0c25f 100644 >> --- a/drivers/irqchip/irq-gic-v3.c >> +++ b/drivers/irqchip/irq-gic-v3.c >> @@ -487,7 +487,7 @@ static inline void gic_handle_nmi(u32 irqnr, struct pt_regs *regs) >> gic_deactivate_unhandled(irqnr); >> } >> >> -static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs *regs) >> +static asmlinkage notrace void __exception_irq_entry gic_handle_irq(struct pt_regs *regs) >> { >> u32 irqnr; >> >> > > > That's interesting. Do you have any out of tree patch that actually > makes use of the pseudo-NMI feature? Without those patches, the > behaviour should stay unchanged. I am at commit 1a9df9e29c2afecf6e3089442d429b377279ca3c. No more patches, and this is the most confusing. Just out of curiosity, I wanted to run Julien's "Use NMI for perf interrupt" patch (posted on the mailing list), so I have to enable NMI first. That said, with 1) Select Kernel Feature -> Support for NMI-like interrupts 2) Set "irqchip.gicv3_pseudo_nmi=1" on the kernel command line 3) No pseudo-NMIs have been generated at all and this issue was hit. thanks, zenghui > > On the other hand, if you can generate pseudo-NMIs, you could end-up > tracing gic_handle_irq whilst being inside the tracing code with > interrupts being notionally disabled (and that could be pretty bad). > > So, patches or no patches? > > Thanks, > > M. >
Hi Steven, On 2019/3/29 22:53, Steven Rostedt wrote: > On Fri, 29 Mar 2019 13:58:40 +0000 > Marc Zyngier <marc.zyngier@arm.com> wrote: > >> On the other hand, if you can generate pseudo-NMIs, you could end-up >> tracing gic_handle_irq whilst being inside the tracing code with >> interrupts being notionally disabled (and that could be pretty bad). > > Actually, that should still be safe. The tracing code is expected to > trace NMIs. > > Now the entry of an NMI can be an issue because of the way tracing is > enabled. But this would also cause function tracing to crash, which was > not stated. Does function tracing have the same issue, or is this just > with function_graph tracing? This is just with function_graph tracing. Function tracing works fine. > > This is because a breakpoint is added to all the places that are going > to be traced so that the "nops" at the beginning of function calls are > going to be converted to calls to the tracer. Now that means a > breakpoint is being added at the beginning of gic_handle_irq(). I don't > know how this pseudo NMI works, but we have notrace set for do_NMI > because breakpoints at the entry (before it can fix things) causes > issues. But this still doesn't make sense because the gic_handle_irq() > call doesn't fix things up either, so other functions that are traced > by gic_handle_irq() should blow up too, which appears (by the patch) > not to be the case. Thanks for your explanation. I will read the code around further, and get back to you if there's any discovery. thanks, zenghui > > -- Steve > > . >
On Fri, 29 Mar 2019 23:35:59 +0800 Zenghui Yu <yuzenghui@huawei.com> wrote: > Hi Steven, > > On 2019/3/29 22:53, Steven Rostedt wrote: > > On Fri, 29 Mar 2019 13:58:40 +0000 > > Marc Zyngier <marc.zyngier@arm.com> wrote: > > > >> On the other hand, if you can generate pseudo-NMIs, you could end-up > >> tracing gic_handle_irq whilst being inside the tracing code with > >> interrupts being notionally disabled (and that could be pretty bad). > > > > Actually, that should still be safe. The tracing code is expected to > > trace NMIs. > > > > Now the entry of an NMI can be an issue because of the way tracing is > > enabled. But this would also cause function tracing to crash, which was > > not stated. Does function tracing have the same issue, or is this just > > with function_graph tracing? > > This is just with function_graph tracing. Function tracing works fine. Then we can rule out the break point issue. > > > > > This is because a breakpoint is added to all the places that are going > > to be traced so that the "nops" at the beginning of function calls are > > going to be converted to calls to the tracer. Now that means a > > breakpoint is being added at the beginning of gic_handle_irq(). I don't > > know how this pseudo NMI works, but we have notrace set for do_NMI > > because breakpoints at the entry (before it can fix things) causes > > issues. But this still doesn't make sense because the gic_handle_irq() > > call doesn't fix things up either, so other functions that are traced > > by gic_handle_irq() should blow up too, which appears (by the patch) > > not to be the case. > > Thanks for your explanation. I will read the code around further, and > get back to you if there's any discovery. Well, if it only affects function_graph tracing, then we don't need to investigate the break point issue. If it was the break point issue, it would cause function tracing to break too. Also, which architecture is this for? ARM or ARM64? -- Steve
On Fri, Mar 29, 2019 at 11:46 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Fri, 29 Mar 2019 23:35:59 +0800 > Zenghui Yu <yuzenghui@huawei.com> wrote: > > > Hi Steven, > > > > On 2019/3/29 22:53, Steven Rostedt wrote: > > > On Fri, 29 Mar 2019 13:58:40 +0000 > > > Marc Zyngier <marc.zyngier@arm.com> wrote: > > > > > >> On the other hand, if you can generate pseudo-NMIs, you could end-up > > >> tracing gic_handle_irq whilst being inside the tracing code with > > >> interrupts being notionally disabled (and that could be pretty bad). > > > > > > Actually, that should still be safe. The tracing code is expected to > > > trace NMIs. > > > > > > Now the entry of an NMI can be an issue because of the way tracing is > > > enabled. But this would also cause function tracing to crash, which was > > > not stated. Does function tracing have the same issue, or is this just > > > with function_graph tracing? > > > > This is just with function_graph tracing. Function tracing works fine. > > Then we can rule out the break point issue. > > > > > > > > > This is because a breakpoint is added to all the places that are going > > > to be traced so that the "nops" at the beginning of function calls are > > > going to be converted to calls to the tracer. Now that means a > > > breakpoint is being added at the beginning of gic_handle_irq(). I don't > > > know how this pseudo NMI works, but we have notrace set for do_NMI > > > because breakpoints at the entry (before it can fix things) causes > > > issues. But this still doesn't make sense because the gic_handle_irq() > > > call doesn't fix things up either, so other functions that are traced > > > by gic_handle_irq() should blow up too, which appears (by the patch) > > > not to be the case. > > > > Thanks for your explanation. I will read the code around further, and > > get back to you if there's any discovery. > > Well, if it only affects function_graph tracing, then we don't need to > investigate the break point issue. If it was the break point issue, it > would cause function tracing to break too. OK. > > Also, which architecture is this for? ARM or ARM64? on ARM64. thanks, zenghui
Hi, On 29/03/2019 15:02, Zenghui Yu wrote: > Hi Marc, > > On 2019/3/29 21:58, Marc Zyngier wrote: >> Hi Zenghui, >> >> On 29/03/2019 13:23, Zenghui Yu wrote: >>> Enable pseudo NMI together with function_graph tracer, will lead >>> the system to a hang. This is easy to reproduce, >>> >>> 1) Set "irqchip.gicv3_pseudo_nmi=1" on the kernel command line >>> 2) echo function_graph > /sys/kernel/debug/tracing/current_tracer >>> >>> This patch (RFC) set gic_handle_irq() as notrace and it seems works >>> fine now. But I have no idea about what the issue is exactly, and >>> you can regard this patch as a report then :) >>> >>> Can someone give a look at it and provide some explanations ? >>> Thanks for testing this and reporting the issue. >>> Thanks! >>> >>> Cc: Julien Thierry <julien.thierry@arm.com> >>> Cc: Steven Rostedt <rostedt@goodmis.org> >>> Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> >>> --- >>> drivers/irqchip/irq-gic-v3.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c >>> index 15e55d3..8d0c25f 100644 >>> --- a/drivers/irqchip/irq-gic-v3.c >>> +++ b/drivers/irqchip/irq-gic-v3.c >>> @@ -487,7 +487,7 @@ static inline void gic_handle_nmi(u32 irqnr, >>> struct pt_regs *regs) >>> gic_deactivate_unhandled(irqnr); >>> } >>> -static asmlinkage void __exception_irq_entry gic_handle_irq(struct >>> pt_regs *regs) >>> +static asmlinkage notrace void __exception_irq_entry >>> gic_handle_irq(struct pt_regs *regs) >>> { >>> u32 irqnr; >>> >> >> >> That's interesting. Do you have any out of tree patch that actually >> makes use of the pseudo-NMI feature? Without those patches, the >> behaviour should stay unchanged. > > I am at commit 1a9df9e29c2afecf6e3089442d429b377279ca3c. No more > patches, and this is the most confusing. Just out of curiosity, I > wanted to run Julien's "Use NMI for perf interrupt" patch (posted > on the mailing list), so I have to enable NMI first. > > That said, with > 1) Select Kernel Feature -> Support for NMI-like interrupts > 2) Set "irqchip.gicv3_pseudo_nmi=1" on the kernel command line > 3) No pseudo-NMIs have been generated at all > and this issue was hit. > I finally found out what happens. When using interrupt priority masking, at the begining of gic_handle_irq(), we are in this awkward state where we still have the I bit set and PMR unmasked (this is because we cannot ack the interrupt that was signaled if it has lower priority than the current priority mask). To try and keep things simple, we decided that local_irq_*() would only deal with PMR (when using priority masking). With one additional case being that, if PSR.I is set when saving flags, we'd represent it in the form of a value of PMR (i.e. GIC_PRIO_IRQOFF), so that irqs_disabled() and such would still accurately state that no interrupt could happen in those sections. The assumption was that in the few sections were we are having the PSR.I set, we wouldn't care about having interrupts disabled with PSR.I or PMR. And now that assumption appears to be wrong: trace_graph_entry(), called at the begining of gic_handle_irq() when enabling the tracer, does use local_irq_save/restore(). The save operation returns flags GIC_PRIO_IRQOFF (because PSR.I is set when we enter gic_handle_irq() ). The restore operation will then proceed to mask PMR, once we get back to gic_handle_irq() we can't acknowledge the interrupt we just received... The function tracer does not appear to save/restore flags on function entry (I saw save/restore operations in the stack tracer but for some reason couldn't get them to trigger the issue). To confirm this, I checked with the following diff (which is not a fix, it is better to mark gic_handle_irq() as notrace if I don't find something more suitable). Cheers,
Hi Julien, On 2019/4/2 22:00, Julien Thierry wrote: I meet this issue by coincidence before too. > I finally found out what happens. > > When using interrupt priority masking, at the begining of > gic_handle_irq(), we are in this awkward state where we still have the I > bit set and PMR unmasked (this is because we cannot ack the interrupt > that was signaled if it has lower priority than the current priority mask). > > To try and keep things simple, we decided that local_irq_*() would only > deal with PMR (when using priority masking). With one additional case > being that, if PSR.I is set when saving flags, we'd represent it in the > form of a value of PMR (i.e. GIC_PRIO_IRQOFF), so that irqs_disabled() > and such would still accurately state that no interrupt could happen in > those sections. The assumption was that in the few sections were we are > having the PSR.I set, we wouldn't care about having interrupts disabled > with PSR.I or PMR. And now that assumption appears to be wrong: I am confused of the logic of the save and restore here, why can't we do the save & restore exactly? > trace_graph_entry(), called at the begining of gic_handle_irq() when > enabling the tracer, does use local_irq_save/restore(). The save > operation returns flags GIC_PRIO_IRQOFF (because PSR.I is set when we > enter gic_handle_irq() ). The restore operation will then proceed to > mask PMR, once we get back to gic_handle_irq() we can't acknowledge the > interrupt we just received... > > The function tracer does not appear to save/restore flags on function > entry (I saw save/restore operations in the stack tracer but for some > reason couldn't get them to trigger the issue). > > To confirm this, I checked with the following diff (which is not a fix, > it is better to mark gic_handle_irq() as notrace if I don't find > something more suitable). > I am convinced a notrace marking here will walkaround this issue,but i am afraid there is still some trouble like this else. I send my modification to solve this issue just now, it is tested ok so far, but i am not particularly sure, could you please have a look? Thanks, Wei
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 15e55d3..8d0c25f 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -487,7 +487,7 @@ static inline void gic_handle_nmi(u32 irqnr, struct pt_regs *regs) gic_deactivate_unhandled(irqnr); } -static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs *regs) +static asmlinkage notrace void __exception_irq_entry gic_handle_irq(struct pt_regs *regs) { u32 irqnr;
Enable pseudo NMI together with function_graph tracer, will lead the system to a hang. This is easy to reproduce, 1) Set "irqchip.gicv3_pseudo_nmi=1" on the kernel command line 2) echo function_graph > /sys/kernel/debug/tracing/current_tracer This patch (RFC) set gic_handle_irq() as notrace and it seems works fine now. But I have no idea about what the issue is exactly, and you can regard this patch as a report then :) Can someone give a look at it and provide some explanations ? Thanks! Cc: Julien Thierry <julien.thierry@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> --- drivers/irqchip/irq-gic-v3.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)