Message ID | 20230419225604.21204-1-dianders@chromium.org (mailing list archive) |
---|---|
Headers | show |
Series | arm64: Add framework to turn an IPI as NMI | expand |
Hi, On Wed, Apr 19, 2023 at 3:57 PM Douglas Anderson <dianders@chromium.org> wrote: > > This is an attempt to resurrect Sumit's old patch series [1] that > allowed us to use the arm64 pseudo-NMI to get backtraces of CPUs and > also to round up CPUs in kdb/kgdb. The last post from Sumit that I > could find was v7, so I called this series v8. I haven't copied all of > his old changelongs here, but you can find them from the link. > > Since v7, I have: > * Addressed the small amount of feedback that was there for v7. > * Rebased. > * Added a new patch that prevents us from spamming the logs with idle > tasks. > * Added an extra patch to gracefully fall back to regular IPIs if > pseudo-NMIs aren't there. > > Since there appear to be a few different patches series related to > being able to use NMIs to get stack traces of crashed systems, let me > try to organize them to the best of my understanding: > > a) This series. On its own, a) will (among other things) enable stack > traces of all running processes with the soft lockup detector if > you've enabled the sysctl "kernel.softlockup_all_cpu_backtrace". On > its own, a) doesn't give a hard lockup detector. > > b) A different recently-posted series [2] that adds a hard lockup > detector based on perf. On its own, b) gives a stack crawl of the > locked up CPU but no stack crawls of other CPUs (even if they're > locked too). Together with a) + b) we get everything (full lockup > detect, full ability to get stack crawls). > > c) The old Android "buddy" hard lockup detector [3] that I'm > considering trying to upstream. If b) lands then I believe c) would > be redundant (at least for arm64). c) on its own is really only > useful on arm64 for platforms that can print CPU_DBGPCSR somehow > (see [4]). a) + c) is roughly as good as a) + b). > > [1] https://lore.kernel.org/linux-arm-kernel/1604317487-14543-1-git-send-email-sumit.garg@linaro.org/ > [2] https://lore.kernel.org/linux-arm-kernel/20220903093415.15850-1-lecopzer.chen@mediatek.com/ > [3] https://issuetracker.google.com/172213097 > [4] https://issuetracker.google.com/172213129 > > Changes in v8: > - dynamic_ipi_setup() and dynamic_ipi_teardown() no longer take cpu param > - dynamic_ipi_setup() and dynamic_ipi_teardown() no longer take cpu param > - Add loongarch support, too > - Removed "#ifdef CONFIG_SMP" since arm64 is always SMP > - "Tag the arm64 idle functions as __cpuidle" new for v8 > - "Provide a stub kgdb_nmicallback() if !CONFIG_KGDB" new for v8 > - "Fallback to a regular IPI if NMI isn't enabled" new for v8 > > Douglas Anderson (3): > arm64: idle: Tag the arm64 idle functions as __cpuidle > kgdb: Provide a stub kgdb_nmicallback() if !CONFIG_KGDB > arm64: ipi_nmi: Fallback to a regular IPI if NMI isn't enabled > > Sumit Garg (7): > arm64: Add framework to turn IPI as NMI > irqchip/gic-v3: Enable support for SGIs to act as NMIs > arm64: smp: Assign and setup an IPI as NMI > nmi: backtrace: Allow runtime arch specific override > arm64: ipi_nmi: Add support for NMI backtrace > kgdb: Expose default CPUs roundup fallback mechanism > arm64: kgdb: Roundup cpus using IPI as NMI > > arch/arm/include/asm/irq.h | 2 +- > arch/arm/kernel/smp.c | 3 +- > arch/arm64/include/asm/irq.h | 4 ++ > arch/arm64/include/asm/nmi.h | 17 +++++ > arch/arm64/kernel/Makefile | 2 +- > arch/arm64/kernel/idle.c | 4 +- > arch/arm64/kernel/ipi_nmi.c | 103 +++++++++++++++++++++++++++++++ > arch/arm64/kernel/kgdb.c | 18 ++++++ > arch/arm64/kernel/smp.c | 8 +++ > arch/loongarch/include/asm/irq.h | 2 +- > arch/loongarch/kernel/process.c | 3 +- > arch/mips/include/asm/irq.h | 2 +- > arch/mips/kernel/process.c | 3 +- > arch/powerpc/include/asm/nmi.h | 2 +- > arch/powerpc/kernel/stacktrace.c | 3 +- > arch/sparc/include/asm/irq_64.h | 2 +- > arch/sparc/kernel/process_64.c | 4 +- > arch/x86/include/asm/irq.h | 2 +- > arch/x86/kernel/apic/hw_nmi.c | 3 +- > drivers/irqchip/irq-gic-v3.c | 29 ++++++--- > include/linux/kgdb.h | 13 ++++ > include/linux/nmi.h | 12 ++-- > kernel/debug/debug_core.c | 8 ++- > 23 files changed, 217 insertions(+), 32 deletions(-) It's been 3 weeks and I haven't heard a peep on this series. That means nobody has any objections and it's all good to land, right? Right? :-P Seriously, though, I really think we should figure out how to get this landed. There's obviously interest from several different parties and I'm chomping at the bit waiting to spin this series according to your wishes. I also don't think there's anything super scary/ugly here. IMO the ideal situation is that folks are OK with the idea presented in the last patch in the series ("arm64: ipi_nmi: Fallback to a regular IPI if NMI isn't enabled") and then I can re-spin this series to be _much_ simpler. That being said, I'm also OK with dropping that patch and starting the discussion for how to land the rest of the patches in the series. Please let me know! -Doug
On Wed, May 10, 2023 at 08:28:17AM -0700, Doug Anderson wrote: > Hi, Hi Doug, > On Wed, Apr 19, 2023 at 3:57 PM Douglas Anderson <dianders@chromium.org> wrote: > > This is an attempt to resurrect Sumit's old patch series [1] that > > allowed us to use the arm64 pseudo-NMI to get backtraces of CPUs and > > also to round up CPUs in kdb/kgdb. The last post from Sumit that I > > could find was v7, so I called this series v8. I haven't copied all of > > his old changelongs here, but you can find them from the link. > > > > Since v7, I have: > > * Addressed the small amount of feedback that was there for v7. > > * Rebased. > > * Added a new patch that prevents us from spamming the logs with idle > > tasks. > > * Added an extra patch to gracefully fall back to regular IPIs if > > pseudo-NMIs aren't there. > > > > Since there appear to be a few different patches series related to > > being able to use NMIs to get stack traces of crashed systems, let me > > try to organize them to the best of my understanding: > > > > a) This series. On its own, a) will (among other things) enable stack > > traces of all running processes with the soft lockup detector if > > you've enabled the sysctl "kernel.softlockup_all_cpu_backtrace". On > > its own, a) doesn't give a hard lockup detector. > > > > b) A different recently-posted series [2] that adds a hard lockup > > detector based on perf. On its own, b) gives a stack crawl of the > > locked up CPU but no stack crawls of other CPUs (even if they're > > locked too). Together with a) + b) we get everything (full lockup > > detect, full ability to get stack crawls). > > > > c) The old Android "buddy" hard lockup detector [3] that I'm > > considering trying to upstream. If b) lands then I believe c) would > > be redundant (at least for arm64). c) on its own is really only > > useful on arm64 for platforms that can print CPU_DBGPCSR somehow > > (see [4]). a) + c) is roughly as good as a) + b). > It's been 3 weeks and I haven't heard a peep on this series. That > means nobody has any objections and it's all good to land, right? > Right? :-P FWIW, there are still longstanding soundness issues in the arm64 pseudo-NMI support (and fixing that requires an overhaul of our DAIF / IRQ flag management, which I've been chipping away at for a number of releases), so I hadn't looked at this in detail yet because the foundations are still somewhat dodgy. I appreciate that this has been around for a while, and it's on my queue to look at. Thanks, Mark.
Hi, On Wed, May 10, 2023 at 9:30 AM Mark Rutland <mark.rutland@arm.com> wrote: > > On Wed, May 10, 2023 at 08:28:17AM -0700, Doug Anderson wrote: > > Hi, > > Hi Doug, > > > On Wed, Apr 19, 2023 at 3:57 PM Douglas Anderson <dianders@chromium.org> wrote: > > > This is an attempt to resurrect Sumit's old patch series [1] that > > > allowed us to use the arm64 pseudo-NMI to get backtraces of CPUs and > > > also to round up CPUs in kdb/kgdb. The last post from Sumit that I > > > could find was v7, so I called this series v8. I haven't copied all of > > > his old changelongs here, but you can find them from the link. > > > > > > Since v7, I have: > > > * Addressed the small amount of feedback that was there for v7. > > > * Rebased. > > > * Added a new patch that prevents us from spamming the logs with idle > > > tasks. > > > * Added an extra patch to gracefully fall back to regular IPIs if > > > pseudo-NMIs aren't there. > > > > > > Since there appear to be a few different patches series related to > > > being able to use NMIs to get stack traces of crashed systems, let me > > > try to organize them to the best of my understanding: > > > > > > a) This series. On its own, a) will (among other things) enable stack > > > traces of all running processes with the soft lockup detector if > > > you've enabled the sysctl "kernel.softlockup_all_cpu_backtrace". On > > > its own, a) doesn't give a hard lockup detector. > > > > > > b) A different recently-posted series [2] that adds a hard lockup > > > detector based on perf. On its own, b) gives a stack crawl of the > > > locked up CPU but no stack crawls of other CPUs (even if they're > > > locked too). Together with a) + b) we get everything (full lockup > > > detect, full ability to get stack crawls). > > > > > > c) The old Android "buddy" hard lockup detector [3] that I'm > > > considering trying to upstream. If b) lands then I believe c) would > > > be redundant (at least for arm64). c) on its own is really only > > > useful on arm64 for platforms that can print CPU_DBGPCSR somehow > > > (see [4]). a) + c) is roughly as good as a) + b). > > > It's been 3 weeks and I haven't heard a peep on this series. That > > means nobody has any objections and it's all good to land, right? > > Right? :-P > > FWIW, there are still longstanding soundness issues in the arm64 pseudo-NMI > support (and fixing that requires an overhaul of our DAIF / IRQ flag > management, which I've been chipping away at for a number of releases), so I > hadn't looked at this in detail yet because the foundations are still somewhat > dodgy. > > I appreciate that this has been around for a while, and it's on my queue to > look at. Ah, thanks for the heads up! We've been thinking about turning this on in production in ChromeOS because it will help us track down a whole class of field-generated crash reports that are otherwise opaque to us. It sounds as if maybe that's not a good idea quite yet? Do you have any idea of how much farther along this needs to go? ...of course, we've also run into issues with Mediatek devices because they don't save/restore GICR registers properly [1]. In theory, we might be able to work around that in the kernel. In any case, even if there are bugs that would prevent turning this on for production, it still seems like we could still land this series. It simply wouldn't do anything until someone turned on pseudo NMIs, which wouldn't happen till the kinks are worked out. ...actually, I guess I should say that if all the patches of the current series do land then it actually _would_ still do something, even without pseudo-NMI. Assuming the last patch looks OK, it would at least start falling back to using regular IPIs to do backtraces. That wouldn't get backtraces on hard locked up CPUs but it would be better than what we have today where we don't get any backtraces. This would get arm64 on par with arm32... [1] https://issuetracker.google.com/281831288
On Wed, 10 May 2023 at 22:20, Doug Anderson <dianders@chromium.org> wrote: > > Hi, > > On Wed, May 10, 2023 at 9:30 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > On Wed, May 10, 2023 at 08:28:17AM -0700, Doug Anderson wrote: > > > Hi, > > > > Hi Doug, > > > > > On Wed, Apr 19, 2023 at 3:57 PM Douglas Anderson <dianders@chromium.org> wrote: > > > > This is an attempt to resurrect Sumit's old patch series [1] that > > > > allowed us to use the arm64 pseudo-NMI to get backtraces of CPUs and > > > > also to round up CPUs in kdb/kgdb. The last post from Sumit that I > > > > could find was v7, so I called this series v8. I haven't copied all of > > > > his old changelongs here, but you can find them from the link. > > > > Thanks Doug for picking up this work and for all your additions/improvements. > > > > Since v7, I have: > > > > * Addressed the small amount of feedback that was there for v7. > > > > * Rebased. > > > > * Added a new patch that prevents us from spamming the logs with idle > > > > tasks. > > > > * Added an extra patch to gracefully fall back to regular IPIs if > > > > pseudo-NMIs aren't there. > > > > > > > > Since there appear to be a few different patches series related to > > > > being able to use NMIs to get stack traces of crashed systems, let me > > > > try to organize them to the best of my understanding: > > > > > > > > a) This series. On its own, a) will (among other things) enable stack > > > > traces of all running processes with the soft lockup detector if > > > > you've enabled the sysctl "kernel.softlockup_all_cpu_backtrace". On > > > > its own, a) doesn't give a hard lockup detector. > > > > > > > > b) A different recently-posted series [2] that adds a hard lockup > > > > detector based on perf. On its own, b) gives a stack crawl of the > > > > locked up CPU but no stack crawls of other CPUs (even if they're > > > > locked too). Together with a) + b) we get everything (full lockup > > > > detect, full ability to get stack crawls). > > > > > > > > c) The old Android "buddy" hard lockup detector [3] that I'm > > > > considering trying to upstream. If b) lands then I believe c) would > > > > be redundant (at least for arm64). c) on its own is really only > > > > useful on arm64 for platforms that can print CPU_DBGPCSR somehow > > > > (see [4]). a) + c) is roughly as good as a) + b). > > > > > It's been 3 weeks and I haven't heard a peep on this series. That > > > means nobody has any objections and it's all good to land, right? > > > Right? :-P For me it was months waiting without any feedback. So I think you are lucky :) or atleast better than me at poking arm64 maintainers. > > > > FWIW, there are still longstanding soundness issues in the arm64 pseudo-NMI > > support (and fixing that requires an overhaul of our DAIF / IRQ flag > > management, which I've been chipping away at for a number of releases), so I > > hadn't looked at this in detail yet because the foundations are still somewhat > > dodgy. > > > > I appreciate that this has been around for a while, and it's on my queue to > > look at. > > Ah, thanks for the heads up! We've been thinking about turning this on > in production in ChromeOS because it will help us track down a whole > class of field-generated crash reports that are otherwise opaque to > us. It sounds as if maybe that's not a good idea quite yet? Do you > have any idea of how much farther along this needs to go? ...of > course, we've also run into issues with Mediatek devices because they > don't save/restore GICR registers properly [1]. In theory, we might be > able to work around that in the kernel. > > In any case, even if there are bugs that would prevent turning this on > for production, it still seems like we could still land this series. > It simply wouldn't do anything until someone turned on pseudo NMIs, > which wouldn't happen till the kinks are worked out. I agree here. We should be able to make the foundations robust later on. IMHO, until we turn on features surrounding pseudo NMIs, I am not sure how we can have true confidence in the underlying robustness. -Sumit > > ...actually, I guess I should say that if all the patches of the > current series do land then it actually _would_ still do something, > even without pseudo-NMI. Assuming the last patch looks OK, it would at > least start falling back to using regular IPIs to do backtraces. That > wouldn't get backtraces on hard locked up CPUs but it would be better > than what we have today where we don't get any backtraces. This would > get arm64 on par with arm32... > > [1] https://issuetracker.google.com/281831288
Hi, On Wed, May 10, 2023 at 9:42 AM Doug Anderson <dianders@chromium.org> wrote: > > Hi, > > On Wed, May 10, 2023 at 9:30 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > On Wed, May 10, 2023 at 08:28:17AM -0700, Doug Anderson wrote: > > > Hi, > > > > Hi Doug, > > > > > On Wed, Apr 19, 2023 at 3:57 PM Douglas Anderson <dianders@chromium.org> wrote: > > > > This is an attempt to resurrect Sumit's old patch series [1] that > > > > allowed us to use the arm64 pseudo-NMI to get backtraces of CPUs and > > > > also to round up CPUs in kdb/kgdb. The last post from Sumit that I > > > > could find was v7, so I called this series v8. I haven't copied all of > > > > his old changelongs here, but you can find them from the link. > > > > > > > > Since v7, I have: > > > > * Addressed the small amount of feedback that was there for v7. > > > > * Rebased. > > > > * Added a new patch that prevents us from spamming the logs with idle > > > > tasks. > > > > * Added an extra patch to gracefully fall back to regular IPIs if > > > > pseudo-NMIs aren't there. > > > > > > > > Since there appear to be a few different patches series related to > > > > being able to use NMIs to get stack traces of crashed systems, let me > > > > try to organize them to the best of my understanding: > > > > > > > > a) This series. On its own, a) will (among other things) enable stack > > > > traces of all running processes with the soft lockup detector if > > > > you've enabled the sysctl "kernel.softlockup_all_cpu_backtrace". On > > > > its own, a) doesn't give a hard lockup detector. > > > > > > > > b) A different recently-posted series [2] that adds a hard lockup > > > > detector based on perf. On its own, b) gives a stack crawl of the > > > > locked up CPU but no stack crawls of other CPUs (even if they're > > > > locked too). Together with a) + b) we get everything (full lockup > > > > detect, full ability to get stack crawls). > > > > > > > > c) The old Android "buddy" hard lockup detector [3] that I'm > > > > considering trying to upstream. If b) lands then I believe c) would > > > > be redundant (at least for arm64). c) on its own is really only > > > > useful on arm64 for platforms that can print CPU_DBGPCSR somehow > > > > (see [4]). a) + c) is roughly as good as a) + b). > > > > > It's been 3 weeks and I haven't heard a peep on this series. That > > > means nobody has any objections and it's all good to land, right? > > > Right? :-P > > > > FWIW, there are still longstanding soundness issues in the arm64 pseudo-NMI > > support (and fixing that requires an overhaul of our DAIF / IRQ flag > > management, which I've been chipping away at for a number of releases), so I > > hadn't looked at this in detail yet because the foundations are still somewhat > > dodgy. > > > > I appreciate that this has been around for a while, and it's on my queue to > > look at. > > Ah, thanks for the heads up! We've been thinking about turning this on > in production in ChromeOS because it will help us track down a whole > class of field-generated crash reports that are otherwise opaque to > us. It sounds as if maybe that's not a good idea quite yet? Do you > have any idea of how much farther along this needs to go? I'm still very interested in this topic. Do you have anything concrete that is broken? Your reply gives me the vibe that there have been a bunch of bugs found recently and, while there are no known issues, you're worried that there might be something lingering still. Is that correct, or do you have something concrete that's broken? Is this anything others could help with? I could make an attempt, or we might be able to rope some of the others interested in this topic and more GIC-knowledgeable to help? I still have a goal for turning this on for production but obviously don't want to make the device crashier because of it. Also: if this really has known bugs, should we put a big WARN_ON splat if anyone tries to turn on pseudo NMIs? Without that, it feels like it's a bit of a footgun. > ...of > course, we've also run into issues with Mediatek devices because they > don't save/restore GICR registers properly [1]. In theory, we might be > able to work around that in the kernel. To followup with this, we've formulated a plan for dealing with Mediatek Chromebooks. I doubt anyone is truly interested, but if anyone is the details are in the public Google bug tracker [1]. None of that should block lading the NMI backtrace series. > In any case, even if there are bugs that would prevent turning this on > for production, it still seems like we could still land this series. > It simply wouldn't do anything until someone turned on pseudo NMIs, > which wouldn't happen till the kinks are worked out. > > ...actually, I guess I should say that if all the patches of the > current series do land then it actually _would_ still do something, > even without pseudo-NMI. Assuming the last patch looks OK, it would at > least start falling back to using regular IPIs to do backtraces. That > wouldn't get backtraces on hard locked up CPUs but it would be better > than what we have today where we don't get any backtraces. This would > get arm64 on par with arm32... The more I thought about it, the more I liked the idea of going full hog on the idea in patch #10. I've posted v9 [2] where I've really embraced the idea of falling back to a regular IPI if NMI isn't available. Hopefully it looks keen. [1] https://issuetracker.google.com/281831288 [2] https://lore.kernel.org/r/20230601213440.2488667-1-dianders@chromium.org -Doug