mbox series

[RFC,0/5] Introduce NMI aware serial drivers

Message ID 1595333413-30052-1-git-send-email-sumit.garg@linaro.org (mailing list archive)
Headers show
Series Introduce NMI aware serial drivers | expand

Message

Sumit Garg July 21, 2020, 12:10 p.m. UTC
Make it possible for UARTs to trigger magic sysrq from an NMI. With the
advent of pseudo NMIs on arm64 it became quite generic to request serial
device interrupt as an NMI rather than IRQ. And having NMI driven serial
RX will allow us to trigger magic sysrq as an NMI and hence drop into
kernel debugger in NMI context.

The major use-case is to add NMI debugging capabilities to the kernel
in order to debug scenarios such as:
- Primary CPU is stuck in deadlock with interrupts disabled and hence
  doesn't honor serial device interrupt. So having magic sysrq triggered
  as an NMI is helpful for debugging.
- Always enabled NMI based magic sysrq irrespective of whether the serial
  TTY port is active or not.

Currently there is an existing kgdb NMI serial driver which provides
partial implementation in upstream to have a separate ttyNMI0 port but
that remained in silos with the serial core/drivers which made it a bit
odd to enable using serial device interrupt and hence remained unused. It
seems to be clearly intended to avoid almost all custom NMI changes to
the UART driver.

But this patch-set allows the serial core/drivers to be NMI aware which
in turn provides NMI debugging capabilities via magic sysrq and hence
there is no specific reason to keep this special driver. So remove it
instead.

Approach:
---------

The overall idea is to intercept serial RX characters in NMI context, if
those are specific to magic sysrq then allow corresponding handler to run
in NMI context. Otherwise, defer all other RX and TX operations onto IRQ
work queue in order to run those in normal interrupt context.

This approach is demonstrated using amba-pl011 driver.

Patch-wise description:
-----------------------

Patch #1 prepares magic sysrq handler to be NMI aware.
Patch #2 adds NMI framework to serial core.
Patch #3 and #4 demonstrates NMI aware uart port using amba-pl011 driver.
Patch #5 removes kgdb NMI serial driver.

Goal of this RFC:
-----------------

My main reason for sharing this as an RFC is to help decide whether or
not to continue with this approach. The next step for me would to port
the work to a system with an 8250 UART.

Usage:
------

This RFC has been developed on top of 5.8-rc3 and if anyone is interested
to give this a try on QEMU, just enable following config options
additional to arm64 defconfig:

CONFIG_KGDB=y
CONFIG_KGDB_KDB=y
CONFIG_ARM64_PSEUDO_NMI=y

Qemu command line to test:

$ qemu-system-aarch64 -nographic -machine virt,gic-version=3 -cpu cortex-a57 \
  -smp 2 -kernel arch/arm64/boot/Image -append 'console=ttyAMA0,38400 \
  keep_bootcon root=/dev/vda2 irqchip.gicv3_pseudo_nmi=1 kgdboc=ttyAMA0' \
  -initrd rootfs-arm64.cpio.gz

NMI entry into kgdb via sysrq:
- Ctrl a + b + g

Reference:
----------

For more details about NMI/FIQ debugger, refer to this blog post [1].

[1] https://www.linaro.org/blog/debugging-arm-kernels-using-nmifiq/

I do look forward to your comments and feedback.

Sumit Garg (5):
  tty/sysrq: Make sysrq handler NMI aware
  serial: core: Add framework to allow NMI aware serial drivers
  serial: amba-pl011: Re-order APIs definition
  serial: amba-pl011: Enable NMI aware uart port
  serial: Remove KGDB NMI serial driver

 drivers/tty/serial/Kconfig       |  19 --
 drivers/tty/serial/Makefile      |   1 -
 drivers/tty/serial/amba-pl011.c  | 232 +++++++++++++++++-------
 drivers/tty/serial/kgdb_nmi.c    | 383 ---------------------------------------
 drivers/tty/serial/kgdboc.c      |   8 -
 drivers/tty/serial/serial_core.c | 120 +++++++++++-
 drivers/tty/sysrq.c              |  33 +++-
 include/linux/kgdb.h             |  10 -
 include/linux/serial_core.h      |  67 +++++++
 include/linux/sysrq.h            |   1 +
 kernel/debug/debug_core.c        |   1 +
 11 files changed, 386 insertions(+), 489 deletions(-)
 delete mode 100644 drivers/tty/serial/kgdb_nmi.c

Comments

Sumit Garg Aug. 11, 2020, 1:50 p.m. UTC | #1
On Tue, 21 Jul 2020 at 17:40, Sumit Garg <sumit.garg@linaro.org> wrote:
>
> Make it possible for UARTs to trigger magic sysrq from an NMI. With the
> advent of pseudo NMIs on arm64 it became quite generic to request serial
> device interrupt as an NMI rather than IRQ. And having NMI driven serial
> RX will allow us to trigger magic sysrq as an NMI and hence drop into
> kernel debugger in NMI context.
>
> The major use-case is to add NMI debugging capabilities to the kernel
> in order to debug scenarios such as:
> - Primary CPU is stuck in deadlock with interrupts disabled and hence
>   doesn't honor serial device interrupt. So having magic sysrq triggered
>   as an NMI is helpful for debugging.
> - Always enabled NMI based magic sysrq irrespective of whether the serial
>   TTY port is active or not.
>
> Currently there is an existing kgdb NMI serial driver which provides
> partial implementation in upstream to have a separate ttyNMI0 port but
> that remained in silos with the serial core/drivers which made it a bit
> odd to enable using serial device interrupt and hence remained unused. It
> seems to be clearly intended to avoid almost all custom NMI changes to
> the UART driver.
>
> But this patch-set allows the serial core/drivers to be NMI aware which
> in turn provides NMI debugging capabilities via magic sysrq and hence
> there is no specific reason to keep this special driver. So remove it
> instead.
>
> Approach:
> ---------
>
> The overall idea is to intercept serial RX characters in NMI context, if
> those are specific to magic sysrq then allow corresponding handler to run
> in NMI context. Otherwise, defer all other RX and TX operations onto IRQ
> work queue in order to run those in normal interrupt context.
>
> This approach is demonstrated using amba-pl011 driver.
>
> Patch-wise description:
> -----------------------
>
> Patch #1 prepares magic sysrq handler to be NMI aware.
> Patch #2 adds NMI framework to serial core.
> Patch #3 and #4 demonstrates NMI aware uart port using amba-pl011 driver.
> Patch #5 removes kgdb NMI serial driver.
>
> Goal of this RFC:
> -----------------
>
> My main reason for sharing this as an RFC is to help decide whether or
> not to continue with this approach. The next step for me would to port
> the work to a system with an 8250 UART.
>

A gentle reminder to seek feedback on this series.

-Sumit

> Usage:
> ------
>
> This RFC has been developed on top of 5.8-rc3 and if anyone is interested
> to give this a try on QEMU, just enable following config options
> additional to arm64 defconfig:
>
> CONFIG_KGDB=y
> CONFIG_KGDB_KDB=y
> CONFIG_ARM64_PSEUDO_NMI=y
>
> Qemu command line to test:
>
> $ qemu-system-aarch64 -nographic -machine virt,gic-version=3 -cpu cortex-a57 \
>   -smp 2 -kernel arch/arm64/boot/Image -append 'console=ttyAMA0,38400 \
>   keep_bootcon root=/dev/vda2 irqchip.gicv3_pseudo_nmi=1 kgdboc=ttyAMA0' \
>   -initrd rootfs-arm64.cpio.gz
>
> NMI entry into kgdb via sysrq:
> - Ctrl a + b + g
>
> Reference:
> ----------
>
> For more details about NMI/FIQ debugger, refer to this blog post [1].
>
> [1] https://www.linaro.org/blog/debugging-arm-kernels-using-nmifiq/
>
> I do look forward to your comments and feedback.
>
> Sumit Garg (5):
>   tty/sysrq: Make sysrq handler NMI aware
>   serial: core: Add framework to allow NMI aware serial drivers
>   serial: amba-pl011: Re-order APIs definition
>   serial: amba-pl011: Enable NMI aware uart port
>   serial: Remove KGDB NMI serial driver
>
>  drivers/tty/serial/Kconfig       |  19 --
>  drivers/tty/serial/Makefile      |   1 -
>  drivers/tty/serial/amba-pl011.c  | 232 +++++++++++++++++-------
>  drivers/tty/serial/kgdb_nmi.c    | 383 ---------------------------------------
>  drivers/tty/serial/kgdboc.c      |   8 -
>  drivers/tty/serial/serial_core.c | 120 +++++++++++-
>  drivers/tty/sysrq.c              |  33 +++-
>  include/linux/kgdb.h             |  10 -
>  include/linux/serial_core.h      |  67 +++++++
>  include/linux/sysrq.h            |   1 +
>  kernel/debug/debug_core.c        |   1 +
>  11 files changed, 386 insertions(+), 489 deletions(-)
>  delete mode 100644 drivers/tty/serial/kgdb_nmi.c
>
> --
> 2.7.4
>
Greg KH Aug. 11, 2020, 1:58 p.m. UTC | #2
On Tue, Aug 11, 2020 at 07:20:26PM +0530, Sumit Garg wrote:
> On Tue, 21 Jul 2020 at 17:40, Sumit Garg <sumit.garg@linaro.org> wrote:
> >
> > Make it possible for UARTs to trigger magic sysrq from an NMI. With the
> > advent of pseudo NMIs on arm64 it became quite generic to request serial
> > device interrupt as an NMI rather than IRQ. And having NMI driven serial
> > RX will allow us to trigger magic sysrq as an NMI and hence drop into
> > kernel debugger in NMI context.
> >
> > The major use-case is to add NMI debugging capabilities to the kernel
> > in order to debug scenarios such as:
> > - Primary CPU is stuck in deadlock with interrupts disabled and hence
> >   doesn't honor serial device interrupt. So having magic sysrq triggered
> >   as an NMI is helpful for debugging.
> > - Always enabled NMI based magic sysrq irrespective of whether the serial
> >   TTY port is active or not.
> >
> > Currently there is an existing kgdb NMI serial driver which provides
> > partial implementation in upstream to have a separate ttyNMI0 port but
> > that remained in silos with the serial core/drivers which made it a bit
> > odd to enable using serial device interrupt and hence remained unused. It
> > seems to be clearly intended to avoid almost all custom NMI changes to
> > the UART driver.
> >
> > But this patch-set allows the serial core/drivers to be NMI aware which
> > in turn provides NMI debugging capabilities via magic sysrq and hence
> > there is no specific reason to keep this special driver. So remove it
> > instead.
> >
> > Approach:
> > ---------
> >
> > The overall idea is to intercept serial RX characters in NMI context, if
> > those are specific to magic sysrq then allow corresponding handler to run
> > in NMI context. Otherwise, defer all other RX and TX operations onto IRQ
> > work queue in order to run those in normal interrupt context.
> >
> > This approach is demonstrated using amba-pl011 driver.
> >
> > Patch-wise description:
> > -----------------------
> >
> > Patch #1 prepares magic sysrq handler to be NMI aware.
> > Patch #2 adds NMI framework to serial core.
> > Patch #3 and #4 demonstrates NMI aware uart port using amba-pl011 driver.
> > Patch #5 removes kgdb NMI serial driver.
> >
> > Goal of this RFC:
> > -----------------
> >
> > My main reason for sharing this as an RFC is to help decide whether or
> > not to continue with this approach. The next step for me would to port
> > the work to a system with an 8250 UART.
> >
> 
> A gentle reminder to seek feedback on this series.

It's the middle of the merge window, and I can't do anything.

Also, I almost never review RFC patches as I have have way too many
patches that people think are "right" to review first...

I suggest you work to flesh this out first and submit something that you
feels works properly.

good luck!

greg k-h
Sumit Garg Aug. 11, 2020, 2:29 p.m. UTC | #3
Hi Greg,

Thanks for your comments.

On Tue, 11 Aug 2020 at 19:27, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Aug 11, 2020 at 07:20:26PM +0530, Sumit Garg wrote:
> > On Tue, 21 Jul 2020 at 17:40, Sumit Garg <sumit.garg@linaro.org> wrote:
> > >
> > > Make it possible for UARTs to trigger magic sysrq from an NMI. With the
> > > advent of pseudo NMIs on arm64 it became quite generic to request serial
> > > device interrupt as an NMI rather than IRQ. And having NMI driven serial
> > > RX will allow us to trigger magic sysrq as an NMI and hence drop into
> > > kernel debugger in NMI context.
> > >
> > > The major use-case is to add NMI debugging capabilities to the kernel
> > > in order to debug scenarios such as:
> > > - Primary CPU is stuck in deadlock with interrupts disabled and hence
> > >   doesn't honor serial device interrupt. So having magic sysrq triggered
> > >   as an NMI is helpful for debugging.
> > > - Always enabled NMI based magic sysrq irrespective of whether the serial
> > >   TTY port is active or not.
> > >
> > > Currently there is an existing kgdb NMI serial driver which provides
> > > partial implementation in upstream to have a separate ttyNMI0 port but
> > > that remained in silos with the serial core/drivers which made it a bit
> > > odd to enable using serial device interrupt and hence remained unused. It
> > > seems to be clearly intended to avoid almost all custom NMI changes to
> > > the UART driver.
> > >
> > > But this patch-set allows the serial core/drivers to be NMI aware which
> > > in turn provides NMI debugging capabilities via magic sysrq and hence
> > > there is no specific reason to keep this special driver. So remove it
> > > instead.
> > >
> > > Approach:
> > > ---------
> > >
> > > The overall idea is to intercept serial RX characters in NMI context, if
> > > those are specific to magic sysrq then allow corresponding handler to run
> > > in NMI context. Otherwise, defer all other RX and TX operations onto IRQ
> > > work queue in order to run those in normal interrupt context.
> > >
> > > This approach is demonstrated using amba-pl011 driver.
> > >
> > > Patch-wise description:
> > > -----------------------
> > >
> > > Patch #1 prepares magic sysrq handler to be NMI aware.
> > > Patch #2 adds NMI framework to serial core.
> > > Patch #3 and #4 demonstrates NMI aware uart port using amba-pl011 driver.
> > > Patch #5 removes kgdb NMI serial driver.
> > >
> > > Goal of this RFC:
> > > -----------------
> > >
> > > My main reason for sharing this as an RFC is to help decide whether or
> > > not to continue with this approach. The next step for me would to port
> > > the work to a system with an 8250 UART.
> > >
> >
> > A gentle reminder to seek feedback on this series.
>
> It's the middle of the merge window, and I can't do anything.
>
> Also, I almost never review RFC patches as I have have way too many
> patches that people think are "right" to review first...
>

Okay, I understand and I can definitely wait for your feedback.

> I suggest you work to flesh this out first and submit something that you
> feels works properly.
>

IIUC, in order to make this approach substantial I need to make it
work with 8250 UART (major serial driver), correct? As currently it
works properly for amba-pl011 driver.

> good luck!
>

Thanks.

-Sumit

> greg k-h
Greg KH Aug. 11, 2020, 2:58 p.m. UTC | #4
On Tue, Aug 11, 2020 at 07:59:24PM +0530, Sumit Garg wrote:
> Hi Greg,
> 
> Thanks for your comments.
> 
> On Tue, 11 Aug 2020 at 19:27, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Tue, Aug 11, 2020 at 07:20:26PM +0530, Sumit Garg wrote:
> > > On Tue, 21 Jul 2020 at 17:40, Sumit Garg <sumit.garg@linaro.org> wrote:
> > > >
> > > > Make it possible for UARTs to trigger magic sysrq from an NMI. With the
> > > > advent of pseudo NMIs on arm64 it became quite generic to request serial
> > > > device interrupt as an NMI rather than IRQ. And having NMI driven serial
> > > > RX will allow us to trigger magic sysrq as an NMI and hence drop into
> > > > kernel debugger in NMI context.
> > > >
> > > > The major use-case is to add NMI debugging capabilities to the kernel
> > > > in order to debug scenarios such as:
> > > > - Primary CPU is stuck in deadlock with interrupts disabled and hence
> > > >   doesn't honor serial device interrupt. So having magic sysrq triggered
> > > >   as an NMI is helpful for debugging.
> > > > - Always enabled NMI based magic sysrq irrespective of whether the serial
> > > >   TTY port is active or not.
> > > >
> > > > Currently there is an existing kgdb NMI serial driver which provides
> > > > partial implementation in upstream to have a separate ttyNMI0 port but
> > > > that remained in silos with the serial core/drivers which made it a bit
> > > > odd to enable using serial device interrupt and hence remained unused. It
> > > > seems to be clearly intended to avoid almost all custom NMI changes to
> > > > the UART driver.
> > > >
> > > > But this patch-set allows the serial core/drivers to be NMI aware which
> > > > in turn provides NMI debugging capabilities via magic sysrq and hence
> > > > there is no specific reason to keep this special driver. So remove it
> > > > instead.
> > > >
> > > > Approach:
> > > > ---------
> > > >
> > > > The overall idea is to intercept serial RX characters in NMI context, if
> > > > those are specific to magic sysrq then allow corresponding handler to run
> > > > in NMI context. Otherwise, defer all other RX and TX operations onto IRQ
> > > > work queue in order to run those in normal interrupt context.
> > > >
> > > > This approach is demonstrated using amba-pl011 driver.
> > > >
> > > > Patch-wise description:
> > > > -----------------------
> > > >
> > > > Patch #1 prepares magic sysrq handler to be NMI aware.
> > > > Patch #2 adds NMI framework to serial core.
> > > > Patch #3 and #4 demonstrates NMI aware uart port using amba-pl011 driver.
> > > > Patch #5 removes kgdb NMI serial driver.
> > > >
> > > > Goal of this RFC:
> > > > -----------------
> > > >
> > > > My main reason for sharing this as an RFC is to help decide whether or
> > > > not to continue with this approach. The next step for me would to port
> > > > the work to a system with an 8250 UART.
> > > >
> > >
> > > A gentle reminder to seek feedback on this series.
> >
> > It's the middle of the merge window, and I can't do anything.
> >
> > Also, I almost never review RFC patches as I have have way too many
> > patches that people think are "right" to review first...
> >
> 
> Okay, I understand and I can definitely wait for your feedback.

My feedback here is this:

> > I suggest you work to flesh this out first and submit something that you
> > feels works properly.

:)

> IIUC, in order to make this approach substantial I need to make it
> work with 8250 UART (major serial driver), correct? As currently it
> works properly for amba-pl011 driver.

Yes, try to do that, or better yet, make it work with all serial drivers
automatically.

thanks,

greg k-h
Doug Anderson Aug. 11, 2020, 5:15 p.m. UTC | #5
Hi,

On Tue, Aug 11, 2020 at 7:58 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Aug 11, 2020 at 07:59:24PM +0530, Sumit Garg wrote:
> > Hi Greg,
> >
> > Thanks for your comments.
> >
> > On Tue, 11 Aug 2020 at 19:27, Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Tue, Aug 11, 2020 at 07:20:26PM +0530, Sumit Garg wrote:
> > > > On Tue, 21 Jul 2020 at 17:40, Sumit Garg <sumit.garg@linaro.org> wrote:
> > > > >
> > > > > Make it possible for UARTs to trigger magic sysrq from an NMI. With the
> > > > > advent of pseudo NMIs on arm64 it became quite generic to request serial
> > > > > device interrupt as an NMI rather than IRQ. And having NMI driven serial
> > > > > RX will allow us to trigger magic sysrq as an NMI and hence drop into
> > > > > kernel debugger in NMI context.
> > > > >
> > > > > The major use-case is to add NMI debugging capabilities to the kernel
> > > > > in order to debug scenarios such as:
> > > > > - Primary CPU is stuck in deadlock with interrupts disabled and hence
> > > > >   doesn't honor serial device interrupt. So having magic sysrq triggered
> > > > >   as an NMI is helpful for debugging.
> > > > > - Always enabled NMI based magic sysrq irrespective of whether the serial
> > > > >   TTY port is active or not.
> > > > >
> > > > > Currently there is an existing kgdb NMI serial driver which provides
> > > > > partial implementation in upstream to have a separate ttyNMI0 port but
> > > > > that remained in silos with the serial core/drivers which made it a bit
> > > > > odd to enable using serial device interrupt and hence remained unused. It
> > > > > seems to be clearly intended to avoid almost all custom NMI changes to
> > > > > the UART driver.
> > > > >
> > > > > But this patch-set allows the serial core/drivers to be NMI aware which
> > > > > in turn provides NMI debugging capabilities via magic sysrq and hence
> > > > > there is no specific reason to keep this special driver. So remove it
> > > > > instead.
> > > > >
> > > > > Approach:
> > > > > ---------
> > > > >
> > > > > The overall idea is to intercept serial RX characters in NMI context, if
> > > > > those are specific to magic sysrq then allow corresponding handler to run
> > > > > in NMI context. Otherwise, defer all other RX and TX operations onto IRQ
> > > > > work queue in order to run those in normal interrupt context.
> > > > >
> > > > > This approach is demonstrated using amba-pl011 driver.
> > > > >
> > > > > Patch-wise description:
> > > > > -----------------------
> > > > >
> > > > > Patch #1 prepares magic sysrq handler to be NMI aware.
> > > > > Patch #2 adds NMI framework to serial core.
> > > > > Patch #3 and #4 demonstrates NMI aware uart port using amba-pl011 driver.
> > > > > Patch #5 removes kgdb NMI serial driver.
> > > > >
> > > > > Goal of this RFC:
> > > > > -----------------
> > > > >
> > > > > My main reason for sharing this as an RFC is to help decide whether or
> > > > > not to continue with this approach. The next step for me would to port
> > > > > the work to a system with an 8250 UART.
> > > > >
> > > >
> > > > A gentle reminder to seek feedback on this series.

It's been on my list for a while.  I started it Friday but ran out of
time.  This week hasn't been going as smoothly as I hoped but I'll
prioritize this since it's been too long.


> > > It's the middle of the merge window, and I can't do anything.
> > >
> > > Also, I almost never review RFC patches as I have have way too many
> > > patches that people think are "right" to review first...
> > >
> >
> > Okay, I understand and I can definitely wait for your feedback.
>
> My feedback here is this:
>
> > > I suggest you work to flesh this out first and submit something that you
> > > feels works properly.
>
> :)
>
> > IIUC, in order to make this approach substantial I need to make it
> > work with 8250 UART (major serial driver), correct? As currently it
> > works properly for amba-pl011 driver.
>
> Yes, try to do that, or better yet, make it work with all serial drivers
> automatically.

A bit of early feedback...

Although I'm not sure we can do Greg's "make it work everywhere
automatically", it's possible you could get half of your patch done
automatically.  Specifically, your patch really does two things:

a) It leaves the serial port "active" all the time to look for sysrq.
In other words even if there is no serial client it's always reading
the port looking for characters.  IMO this concept should be separated
out from the NMI concept and _could_ automatically work for all serial
drivers.  You'd just need something in the serial core that acted like
a default client if nobody else opened the serial port.  The nice
thing here is that we go through all the normal code paths and don't
need special cases in the driver.

b) It enables NMI for your particular serial driver.  This seems like
it'd be hard to do automatically because you can't do the same things
at NMI that you could do in a normal interrupt handler.

NOTE: to me, a) is more important than b) (though it'd be nice to have
both).  This would be especially true the earlier you could make a)
work since the main time when an "agetty" isn't running on my serial
port to read characters is during bootup.

Why is b) less important to me? Sure, it would let you drop into the
debugger in the case where the CPU handling serial port interrupts is
hung with IRQs disabled, but it _woudln't_ let you drop into the
debugger in the case where a different CPU is hung with IRQs disabled.
To get that we need NMI roundup (which, I know, you are also working
on for arm64).  ...and, if we've got NMI roundup, presumably we can
find our way into the debugger by either moving the serial interrupt
to a different CPU ahead of time or using some type of lockup detector
(which I know you are also working on for arm64).


One last bit of feedback is that I noticed that you didn't try to
implement the old "knock" functionality of the old NMI driver that's
being deleted.  That is: your new patches don't provide an alternate
way to drop into the debugger for systems where BREAK isn't hooked up.
That's not a hard requirement, but I was kinda hoping for it since I
have some systems that haven't routed BREAK properly.  ;-)


I'll try to get some more detailed feedback in the next few days.

-Doug
Sumit Garg Aug. 12, 2020, 5:48 a.m. UTC | #6
On Tue, 11 Aug 2020 at 20:28, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, Aug 11, 2020 at 07:59:24PM +0530, Sumit Garg wrote:
> > Hi Greg,
> >
> > Thanks for your comments.
> >
> > On Tue, 11 Aug 2020 at 19:27, Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Tue, Aug 11, 2020 at 07:20:26PM +0530, Sumit Garg wrote:
> > > > On Tue, 21 Jul 2020 at 17:40, Sumit Garg <sumit.garg@linaro.org> wrote:
> > > > >
> > > > > Make it possible for UARTs to trigger magic sysrq from an NMI. With the
> > > > > advent of pseudo NMIs on arm64 it became quite generic to request serial
> > > > > device interrupt as an NMI rather than IRQ. And having NMI driven serial
> > > > > RX will allow us to trigger magic sysrq as an NMI and hence drop into
> > > > > kernel debugger in NMI context.
> > > > >
> > > > > The major use-case is to add NMI debugging capabilities to the kernel
> > > > > in order to debug scenarios such as:
> > > > > - Primary CPU is stuck in deadlock with interrupts disabled and hence
> > > > >   doesn't honor serial device interrupt. So having magic sysrq triggered
> > > > >   as an NMI is helpful for debugging.
> > > > > - Always enabled NMI based magic sysrq irrespective of whether the serial
> > > > >   TTY port is active or not.
> > > > >
> > > > > Currently there is an existing kgdb NMI serial driver which provides
> > > > > partial implementation in upstream to have a separate ttyNMI0 port but
> > > > > that remained in silos with the serial core/drivers which made it a bit
> > > > > odd to enable using serial device interrupt and hence remained unused. It
> > > > > seems to be clearly intended to avoid almost all custom NMI changes to
> > > > > the UART driver.
> > > > >
> > > > > But this patch-set allows the serial core/drivers to be NMI aware which
> > > > > in turn provides NMI debugging capabilities via magic sysrq and hence
> > > > > there is no specific reason to keep this special driver. So remove it
> > > > > instead.
> > > > >
> > > > > Approach:
> > > > > ---------
> > > > >
> > > > > The overall idea is to intercept serial RX characters in NMI context, if
> > > > > those are specific to magic sysrq then allow corresponding handler to run
> > > > > in NMI context. Otherwise, defer all other RX and TX operations onto IRQ
> > > > > work queue in order to run those in normal interrupt context.
> > > > >
> > > > > This approach is demonstrated using amba-pl011 driver.
> > > > >
> > > > > Patch-wise description:
> > > > > -----------------------
> > > > >
> > > > > Patch #1 prepares magic sysrq handler to be NMI aware.
> > > > > Patch #2 adds NMI framework to serial core.
> > > > > Patch #3 and #4 demonstrates NMI aware uart port using amba-pl011 driver.
> > > > > Patch #5 removes kgdb NMI serial driver.
> > > > >
> > > > > Goal of this RFC:
> > > > > -----------------
> > > > >
> > > > > My main reason for sharing this as an RFC is to help decide whether or
> > > > > not to continue with this approach. The next step for me would to port
> > > > > the work to a system with an 8250 UART.
> > > > >
> > > >
> > > > A gentle reminder to seek feedback on this series.
> > >
> > > It's the middle of the merge window, and I can't do anything.
> > >
> > > Also, I almost never review RFC patches as I have have way too many
> > > patches that people think are "right" to review first...
> > >
> >
> > Okay, I understand and I can definitely wait for your feedback.
>
> My feedback here is this:
>
> > > I suggest you work to flesh this out first and submit something that you
> > > feels works properly.
>
> :)
>
> > IIUC, in order to make this approach substantial I need to make it
> > work with 8250 UART (major serial driver), correct? As currently it
> > works properly for amba-pl011 driver.
>
> Yes, try to do that, or better yet, make it work with all serial drivers
> automatically.

I would like to make serial drivers work automatically but
unfortunately the interrupt request/ handling code is pretty specific
to the corresponding serial driver.

BTW, I will look for ways how we can make it much easier for serial
drivers to adapt.

-Sumit

>
> thanks,
>
> greg k-h
Sumit Garg Aug. 12, 2020, 2:52 p.m. UTC | #7
Hi Doug,

On Tue, 11 Aug 2020 at 22:46, Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Tue, Aug 11, 2020 at 7:58 AM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > On Tue, Aug 11, 2020 at 07:59:24PM +0530, Sumit Garg wrote:
> > > Hi Greg,
> > >
> > > Thanks for your comments.
> > >
> > > On Tue, 11 Aug 2020 at 19:27, Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > On Tue, Aug 11, 2020 at 07:20:26PM +0530, Sumit Garg wrote:
> > > > > On Tue, 21 Jul 2020 at 17:40, Sumit Garg <sumit.garg@linaro.org> wrote:
> > > > > >
> > > > > > Make it possible for UARTs to trigger magic sysrq from an NMI. With the
> > > > > > advent of pseudo NMIs on arm64 it became quite generic to request serial
> > > > > > device interrupt as an NMI rather than IRQ. And having NMI driven serial
> > > > > > RX will allow us to trigger magic sysrq as an NMI and hence drop into
> > > > > > kernel debugger in NMI context.
> > > > > >
> > > > > > The major use-case is to add NMI debugging capabilities to the kernel
> > > > > > in order to debug scenarios such as:
> > > > > > - Primary CPU is stuck in deadlock with interrupts disabled and hence
> > > > > >   doesn't honor serial device interrupt. So having magic sysrq triggered
> > > > > >   as an NMI is helpful for debugging.
> > > > > > - Always enabled NMI based magic sysrq irrespective of whether the serial
> > > > > >   TTY port is active or not.
> > > > > >
> > > > > > Currently there is an existing kgdb NMI serial driver which provides
> > > > > > partial implementation in upstream to have a separate ttyNMI0 port but
> > > > > > that remained in silos with the serial core/drivers which made it a bit
> > > > > > odd to enable using serial device interrupt and hence remained unused. It
> > > > > > seems to be clearly intended to avoid almost all custom NMI changes to
> > > > > > the UART driver.
> > > > > >
> > > > > > But this patch-set allows the serial core/drivers to be NMI aware which
> > > > > > in turn provides NMI debugging capabilities via magic sysrq and hence
> > > > > > there is no specific reason to keep this special driver. So remove it
> > > > > > instead.
> > > > > >
> > > > > > Approach:
> > > > > > ---------
> > > > > >
> > > > > > The overall idea is to intercept serial RX characters in NMI context, if
> > > > > > those are specific to magic sysrq then allow corresponding handler to run
> > > > > > in NMI context. Otherwise, defer all other RX and TX operations onto IRQ
> > > > > > work queue in order to run those in normal interrupt context.
> > > > > >
> > > > > > This approach is demonstrated using amba-pl011 driver.
> > > > > >
> > > > > > Patch-wise description:
> > > > > > -----------------------
> > > > > >
> > > > > > Patch #1 prepares magic sysrq handler to be NMI aware.
> > > > > > Patch #2 adds NMI framework to serial core.
> > > > > > Patch #3 and #4 demonstrates NMI aware uart port using amba-pl011 driver.
> > > > > > Patch #5 removes kgdb NMI serial driver.
> > > > > >
> > > > > > Goal of this RFC:
> > > > > > -----------------
> > > > > >
> > > > > > My main reason for sharing this as an RFC is to help decide whether or
> > > > > > not to continue with this approach. The next step for me would to port
> > > > > > the work to a system with an 8250 UART.
> > > > > >
> > > > >
> > > > > A gentle reminder to seek feedback on this series.
>
> It's been on my list for a while.  I started it Friday but ran out of
> time.  This week hasn't been going as smoothly as I hoped but I'll
> prioritize this since it's been too long.
>

No worries and thanks for your feedback.

>
> > > > It's the middle of the merge window, and I can't do anything.
> > > >
> > > > Also, I almost never review RFC patches as I have have way too many
> > > > patches that people think are "right" to review first...
> > > >
> > >
> > > Okay, I understand and I can definitely wait for your feedback.
> >
> > My feedback here is this:
> >
> > > > I suggest you work to flesh this out first and submit something that you
> > > > feels works properly.
> >
> > :)
> >
> > > IIUC, in order to make this approach substantial I need to make it
> > > work with 8250 UART (major serial driver), correct? As currently it
> > > works properly for amba-pl011 driver.
> >
> > Yes, try to do that, or better yet, make it work with all serial drivers
> > automatically.
>
> A bit of early feedback...
>
> Although I'm not sure we can do Greg's "make it work everywhere
> automatically", it's possible you could get half of your patch done
> automatically.  Specifically, your patch really does two things:
>
> a) It leaves the serial port "active" all the time to look for sysrq.
> In other words even if there is no serial client it's always reading
> the port looking for characters.  IMO this concept should be separated
> out from the NMI concept and _could_ automatically work for all serial
> drivers.  You'd just need something in the serial core that acted like
> a default client if nobody else opened the serial port.  The nice
> thing here is that we go through all the normal code paths and don't
> need special cases in the driver.

Okay, will try to explore this option to have default serial port
client. Would this client be active in normal serial operation or only
active when we have kgdb active? One drawback I see for normal
operation could be power management as if user is not using serial
port and would like to disable corresponding clock in order to reduce
power consumption.

>
> b) It enables NMI for your particular serial driver.  This seems like
> it'd be hard to do automatically because you can't do the same things
> at NMI that you could do in a normal interrupt handler.

Agree.

>
> NOTE: to me, a) is more important than b) (though it'd be nice to have
> both).  This would be especially true the earlier you could make a)
> work since the main time when an "agetty" isn't running on my serial
> port to read characters is during bootup.
>
> Why is b) less important to me? Sure, it would let you drop into the
> debugger in the case where the CPU handling serial port interrupts is
> hung with IRQs disabled, but it _woudln't_ let you drop into the
> debugger in the case where a different CPU is hung with IRQs disabled.
> To get that we need NMI roundup (which, I know, you are also working
> on for arm64).  ...and, if we've got NMI roundup, presumably we can
> find our way into the debugger by either moving the serial interrupt
> to a different CPU ahead of time or using some type of lockup detector
> (which I know you are also working on for arm64).
>

Thanks for sharing your preferences. I will try to get a) sorted out first.

Overall I agree with your approaches to debug hard-lockup scenarios
but they might not be so trivial for kernel engineers who doesn't
posses kernel debugging experience as you do. :)

And I still think NMI aware magic sysrq is useful for scenarios such as:
- Try to get system information during hard-lockup rather than just
panic via hard-lockup detection.
- Do normal start/stop debugger activity on a core which was stuck in
hard-lockup.
- Random boot freezes which are not easily reproducible.

>
> One last bit of feedback is that I noticed that you didn't try to
> implement the old "knock" functionality of the old NMI driver that's
> being deleted.  That is: your new patches don't provide an alternate
> way to drop into the debugger for systems where BREAK isn't hooked up.
> That's not a hard requirement, but I was kinda hoping for it since I
> have some systems that haven't routed BREAK properly.  ;-)
>

Yeah, this is on my TODO list to have a kgdb "knock" functionality to
be implemented via a common hook in serial core.

>
> I'll try to get some more detailed feedback in the next few days.

Thanks. I do look forward to your feedback.

-Sumit

>
> -Doug
Doug Anderson Aug. 12, 2020, 3:27 p.m. UTC | #8
Hi,

On Wed, Aug 12, 2020 at 7:53 AM Sumit Garg <sumit.garg@linaro.org> wrote:
>
> Hi Doug,
>
> On Tue, 11 Aug 2020 at 22:46, Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Tue, Aug 11, 2020 at 7:58 AM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Tue, Aug 11, 2020 at 07:59:24PM +0530, Sumit Garg wrote:
> > > > Hi Greg,
> > > >
> > > > Thanks for your comments.
> > > >
> > > > On Tue, 11 Aug 2020 at 19:27, Greg Kroah-Hartman
> > > > <gregkh@linuxfoundation.org> wrote:
> > > > >
> > > > > On Tue, Aug 11, 2020 at 07:20:26PM +0530, Sumit Garg wrote:
> > > > > > On Tue, 21 Jul 2020 at 17:40, Sumit Garg <sumit.garg@linaro.org> wrote:
> > > > > > >
> > > > > > > Make it possible for UARTs to trigger magic sysrq from an NMI. With the
> > > > > > > advent of pseudo NMIs on arm64 it became quite generic to request serial
> > > > > > > device interrupt as an NMI rather than IRQ. And having NMI driven serial
> > > > > > > RX will allow us to trigger magic sysrq as an NMI and hence drop into
> > > > > > > kernel debugger in NMI context.
> > > > > > >
> > > > > > > The major use-case is to add NMI debugging capabilities to the kernel
> > > > > > > in order to debug scenarios such as:
> > > > > > > - Primary CPU is stuck in deadlock with interrupts disabled and hence
> > > > > > >   doesn't honor serial device interrupt. So having magic sysrq triggered
> > > > > > >   as an NMI is helpful for debugging.
> > > > > > > - Always enabled NMI based magic sysrq irrespective of whether the serial
> > > > > > >   TTY port is active or not.
> > > > > > >
> > > > > > > Currently there is an existing kgdb NMI serial driver which provides
> > > > > > > partial implementation in upstream to have a separate ttyNMI0 port but
> > > > > > > that remained in silos with the serial core/drivers which made it a bit
> > > > > > > odd to enable using serial device interrupt and hence remained unused. It
> > > > > > > seems to be clearly intended to avoid almost all custom NMI changes to
> > > > > > > the UART driver.
> > > > > > >
> > > > > > > But this patch-set allows the serial core/drivers to be NMI aware which
> > > > > > > in turn provides NMI debugging capabilities via magic sysrq and hence
> > > > > > > there is no specific reason to keep this special driver. So remove it
> > > > > > > instead.
> > > > > > >
> > > > > > > Approach:
> > > > > > > ---------
> > > > > > >
> > > > > > > The overall idea is to intercept serial RX characters in NMI context, if
> > > > > > > those are specific to magic sysrq then allow corresponding handler to run
> > > > > > > in NMI context. Otherwise, defer all other RX and TX operations onto IRQ
> > > > > > > work queue in order to run those in normal interrupt context.
> > > > > > >
> > > > > > > This approach is demonstrated using amba-pl011 driver.
> > > > > > >
> > > > > > > Patch-wise description:
> > > > > > > -----------------------
> > > > > > >
> > > > > > > Patch #1 prepares magic sysrq handler to be NMI aware.
> > > > > > > Patch #2 adds NMI framework to serial core.
> > > > > > > Patch #3 and #4 demonstrates NMI aware uart port using amba-pl011 driver.
> > > > > > > Patch #5 removes kgdb NMI serial driver.
> > > > > > >
> > > > > > > Goal of this RFC:
> > > > > > > -----------------
> > > > > > >
> > > > > > > My main reason for sharing this as an RFC is to help decide whether or
> > > > > > > not to continue with this approach. The next step for me would to port
> > > > > > > the work to a system with an 8250 UART.
> > > > > > >
> > > > > >
> > > > > > A gentle reminder to seek feedback on this series.
> >
> > It's been on my list for a while.  I started it Friday but ran out of
> > time.  This week hasn't been going as smoothly as I hoped but I'll
> > prioritize this since it's been too long.
> >
>
> No worries and thanks for your feedback.
>
> >
> > > > > It's the middle of the merge window, and I can't do anything.
> > > > >
> > > > > Also, I almost never review RFC patches as I have have way too many
> > > > > patches that people think are "right" to review first...
> > > > >
> > > >
> > > > Okay, I understand and I can definitely wait for your feedback.
> > >
> > > My feedback here is this:
> > >
> > > > > I suggest you work to flesh this out first and submit something that you
> > > > > feels works properly.
> > >
> > > :)
> > >
> > > > IIUC, in order to make this approach substantial I need to make it
> > > > work with 8250 UART (major serial driver), correct? As currently it
> > > > works properly for amba-pl011 driver.
> > >
> > > Yes, try to do that, or better yet, make it work with all serial drivers
> > > automatically.
> >
> > A bit of early feedback...
> >
> > Although I'm not sure we can do Greg's "make it work everywhere
> > automatically", it's possible you could get half of your patch done
> > automatically.  Specifically, your patch really does two things:
> >
> > a) It leaves the serial port "active" all the time to look for sysrq.
> > In other words even if there is no serial client it's always reading
> > the port looking for characters.  IMO this concept should be separated
> > out from the NMI concept and _could_ automatically work for all serial
> > drivers.  You'd just need something in the serial core that acted like
> > a default client if nobody else opened the serial port.  The nice
> > thing here is that we go through all the normal code paths and don't
> > need special cases in the driver.
>
> Okay, will try to explore this option to have default serial port
> client. Would this client be active in normal serial operation or only
> active when we have kgdb active? One drawback I see for normal
> operation could be power management as if user is not using serial
> port and would like to disable corresponding clock in order to reduce
> power consumption.

If I could pick the ideal, I'd say we'd do it any time the console is
configured for that port and magic sysrq is enabled.  Presumably if
they're already choosing to output kernel log messages to the serial
port and they've enabled magic sysrq they're in a state where they'd
be OK with the extra power of also listening for characters?


> > b) It enables NMI for your particular serial driver.  This seems like
> > it'd be hard to do automatically because you can't do the same things
> > at NMI that you could do in a normal interrupt handler.
>
> Agree.
>
> >
> > NOTE: to me, a) is more important than b) (though it'd be nice to have
> > both).  This would be especially true the earlier you could make a)
> > work since the main time when an "agetty" isn't running on my serial
> > port to read characters is during bootup.
> >
> > Why is b) less important to me? Sure, it would let you drop into the
> > debugger in the case where the CPU handling serial port interrupts is
> > hung with IRQs disabled, but it _woudln't_ let you drop into the
> > debugger in the case where a different CPU is hung with IRQs disabled.
> > To get that we need NMI roundup (which, I know, you are also working
> > on for arm64).  ...and, if we've got NMI roundup, presumably we can
> > find our way into the debugger by either moving the serial interrupt
> > to a different CPU ahead of time or using some type of lockup detector
> > (which I know you are also working on for arm64).
> >
>
> Thanks for sharing your preferences. I will try to get a) sorted out first.
>
> Overall I agree with your approaches to debug hard-lockup scenarios
> but they might not be so trivial for kernel engineers who doesn't
> posses kernel debugging experience as you do. :)
>
> And I still think NMI aware magic sysrq is useful for scenarios such as:
> - Try to get system information during hard-lockup rather than just
> panic via hard-lockup detection.
> - Do normal start/stop debugger activity on a core which was stuck in
> hard-lockup.
> - Random boot freezes which are not easily reproducible.

Don't get me wrong.  Having sysrq from NMI seems like a good feature
to me.  That being said, it will require non-trivial changes to each
serial driver to support it and that means that not all serial drivers
will support it.  It also starts requiring knowledge of how NMIs work
(what's allowed in NMI mode / not allowed / how to avoid races) for
authors of serial drivers.  I have a bit of a worry that the benefit
won't outweigh the extra complexity, but I guess time will tell.  One
last worry is that I assume that most people testing (and even
automated testing labs) will either always enable NMI or won't enable
NMI.  That means that everyone will be only testing one codepath or
the other and (given the complexity) the non-tested codepath will
break.

Hrm.  Along the lines of the above, though: almost no modern systems
are uniprocessor.  That means that even if one CPU is stuck with IRQs
off it's fairly likely that some other CPU is OK.  Presumably you'd
get almost as much benefit as your patch but with more done
automatically if you could figure out how to detect that the serial
interrupt isn't being serviced and re-route it to a different CPU.
...or possibly you could use some variant of the hard lockup detector
and move all interrupts off a locked up CPU?  You could make this an
option that's "default Y" when kgdb is turned on or something?


> > One last bit of feedback is that I noticed that you didn't try to
> > implement the old "knock" functionality of the old NMI driver that's
> > being deleted.  That is: your new patches don't provide an alternate
> > way to drop into the debugger for systems where BREAK isn't hooked up.
> > That's not a hard requirement, but I was kinda hoping for it since I
> > have some systems that haven't routed BREAK properly.  ;-)
> >
>
> Yeah, this is on my TODO list to have a kgdb "knock" functionality to
> be implemented via a common hook in serial core.
>
> >
> > I'll try to get some more detailed feedback in the next few days.
>
> Thanks. I do look forward to your feedback.
>
> -Sumit
>
> >
> > -Doug
Doug Anderson Aug. 13, 2020, 12:08 a.m. UTC | #9
Hi,


On Wed, Aug 12, 2020 at 8:27 AM Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Wed, Aug 12, 2020 at 7:53 AM Sumit Garg <sumit.garg@linaro.org> wrote:
> >
> > Hi Doug,
> >
> > On Tue, 11 Aug 2020 at 22:46, Doug Anderson <dianders@chromium.org> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, Aug 11, 2020 at 7:58 AM Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > On Tue, Aug 11, 2020 at 07:59:24PM +0530, Sumit Garg wrote:
> > > > > Hi Greg,
> > > > >
> > > > > Thanks for your comments.
> > > > >
> > > > > On Tue, 11 Aug 2020 at 19:27, Greg Kroah-Hartman
> > > > > <gregkh@linuxfoundation.org> wrote:
> > > > > >
> > > > > > On Tue, Aug 11, 2020 at 07:20:26PM +0530, Sumit Garg wrote:
> > > > > > > On Tue, 21 Jul 2020 at 17:40, Sumit Garg <sumit.garg@linaro.org> wrote:
> > > > > > > >
> > > > > > > > Make it possible for UARTs to trigger magic sysrq from an NMI. With the
> > > > > > > > advent of pseudo NMIs on arm64 it became quite generic to request serial
> > > > > > > > device interrupt as an NMI rather than IRQ. And having NMI driven serial
> > > > > > > > RX will allow us to trigger magic sysrq as an NMI and hence drop into
> > > > > > > > kernel debugger in NMI context.
> > > > > > > >
> > > > > > > > The major use-case is to add NMI debugging capabilities to the kernel
> > > > > > > > in order to debug scenarios such as:
> > > > > > > > - Primary CPU is stuck in deadlock with interrupts disabled and hence
> > > > > > > >   doesn't honor serial device interrupt. So having magic sysrq triggered
> > > > > > > >   as an NMI is helpful for debugging.
> > > > > > > > - Always enabled NMI based magic sysrq irrespective of whether the serial
> > > > > > > >   TTY port is active or not.
> > > > > > > >
> > > > > > > > Currently there is an existing kgdb NMI serial driver which provides
> > > > > > > > partial implementation in upstream to have a separate ttyNMI0 port but
> > > > > > > > that remained in silos with the serial core/drivers which made it a bit
> > > > > > > > odd to enable using serial device interrupt and hence remained unused. It
> > > > > > > > seems to be clearly intended to avoid almost all custom NMI changes to
> > > > > > > > the UART driver.
> > > > > > > >
> > > > > > > > But this patch-set allows the serial core/drivers to be NMI aware which
> > > > > > > > in turn provides NMI debugging capabilities via magic sysrq and hence
> > > > > > > > there is no specific reason to keep this special driver. So remove it
> > > > > > > > instead.
> > > > > > > >
> > > > > > > > Approach:
> > > > > > > > ---------
> > > > > > > >
> > > > > > > > The overall idea is to intercept serial RX characters in NMI context, if
> > > > > > > > those are specific to magic sysrq then allow corresponding handler to run
> > > > > > > > in NMI context. Otherwise, defer all other RX and TX operations onto IRQ
> > > > > > > > work queue in order to run those in normal interrupt context.
> > > > > > > >
> > > > > > > > This approach is demonstrated using amba-pl011 driver.
> > > > > > > >
> > > > > > > > Patch-wise description:
> > > > > > > > -----------------------
> > > > > > > >
> > > > > > > > Patch #1 prepares magic sysrq handler to be NMI aware.
> > > > > > > > Patch #2 adds NMI framework to serial core.
> > > > > > > > Patch #3 and #4 demonstrates NMI aware uart port using amba-pl011 driver.
> > > > > > > > Patch #5 removes kgdb NMI serial driver.
> > > > > > > >
> > > > > > > > Goal of this RFC:
> > > > > > > > -----------------
> > > > > > > >
> > > > > > > > My main reason for sharing this as an RFC is to help decide whether or
> > > > > > > > not to continue with this approach. The next step for me would to port
> > > > > > > > the work to a system with an 8250 UART.
> > > > > > > >
> > > > > > >
> > > > > > > A gentle reminder to seek feedback on this series.
> > >
> > > It's been on my list for a while.  I started it Friday but ran out of
> > > time.  This week hasn't been going as smoothly as I hoped but I'll
> > > prioritize this since it's been too long.
> > >
> >
> > No worries and thanks for your feedback.
> >
> > >
> > > > > > It's the middle of the merge window, and I can't do anything.
> > > > > >
> > > > > > Also, I almost never review RFC patches as I have have way too many
> > > > > > patches that people think are "right" to review first...
> > > > > >
> > > > >
> > > > > Okay, I understand and I can definitely wait for your feedback.
> > > >
> > > > My feedback here is this:
> > > >
> > > > > > I suggest you work to flesh this out first and submit something that you
> > > > > > feels works properly.
> > > >
> > > > :)
> > > >
> > > > > IIUC, in order to make this approach substantial I need to make it
> > > > > work with 8250 UART (major serial driver), correct? As currently it
> > > > > works properly for amba-pl011 driver.
> > > >
> > > > Yes, try to do that, or better yet, make it work with all serial drivers
> > > > automatically.
> > >
> > > A bit of early feedback...
> > >
> > > Although I'm not sure we can do Greg's "make it work everywhere
> > > automatically", it's possible you could get half of your patch done
> > > automatically.  Specifically, your patch really does two things:
> > >
> > > a) It leaves the serial port "active" all the time to look for sysrq.
> > > In other words even if there is no serial client it's always reading
> > > the port looking for characters.  IMO this concept should be separated
> > > out from the NMI concept and _could_ automatically work for all serial
> > > drivers.  You'd just need something in the serial core that acted like
> > > a default client if nobody else opened the serial port.  The nice
> > > thing here is that we go through all the normal code paths and don't
> > > need special cases in the driver.
> >
> > Okay, will try to explore this option to have default serial port
> > client. Would this client be active in normal serial operation or only
> > active when we have kgdb active? One drawback I see for normal
> > operation could be power management as if user is not using serial
> > port and would like to disable corresponding clock in order to reduce
> > power consumption.
>
> If I could pick the ideal, I'd say we'd do it any time the console is
> configured for that port and magic sysrq is enabled.  Presumably if
> they're already choosing to output kernel log messages to the serial
> port and they've enabled magic sysrq they're in a state where they'd
> be OK with the extra power of also listening for characters?
>
>
> > > b) It enables NMI for your particular serial driver.  This seems like
> > > it'd be hard to do automatically because you can't do the same things
> > > at NMI that you could do in a normal interrupt handler.
> >
> > Agree.
> >
> > >
> > > NOTE: to me, a) is more important than b) (though it'd be nice to have
> > > both).  This would be especially true the earlier you could make a)
> > > work since the main time when an "agetty" isn't running on my serial
> > > port to read characters is during bootup.
> > >
> > > Why is b) less important to me? Sure, it would let you drop into the
> > > debugger in the case where the CPU handling serial port interrupts is
> > > hung with IRQs disabled, but it _woudln't_ let you drop into the
> > > debugger in the case where a different CPU is hung with IRQs disabled.
> > > To get that we need NMI roundup (which, I know, you are also working
> > > on for arm64).  ...and, if we've got NMI roundup, presumably we can
> > > find our way into the debugger by either moving the serial interrupt
> > > to a different CPU ahead of time or using some type of lockup detector
> > > (which I know you are also working on for arm64).
> > >
> >
> > Thanks for sharing your preferences. I will try to get a) sorted out first.
> >
> > Overall I agree with your approaches to debug hard-lockup scenarios
> > but they might not be so trivial for kernel engineers who doesn't
> > posses kernel debugging experience as you do. :)
> >
> > And I still think NMI aware magic sysrq is useful for scenarios such as:
> > - Try to get system information during hard-lockup rather than just
> > panic via hard-lockup detection.
> > - Do normal start/stop debugger activity on a core which was stuck in
> > hard-lockup.
> > - Random boot freezes which are not easily reproducible.
>
> Don't get me wrong.  Having sysrq from NMI seems like a good feature
> to me.  That being said, it will require non-trivial changes to each
> serial driver to support it and that means that not all serial drivers
> will support it.  It also starts requiring knowledge of how NMIs work
> (what's allowed in NMI mode / not allowed / how to avoid races) for
> authors of serial drivers.  I have a bit of a worry that the benefit
> won't outweigh the extra complexity, but I guess time will tell.  One
> last worry is that I assume that most people testing (and even
> automated testing labs) will either always enable NMI or won't enable
> NMI.  That means that everyone will be only testing one codepath or
> the other and (given the complexity) the non-tested codepath will
> break.
>
> Hrm.  Along the lines of the above, though: almost no modern systems
> are uniprocessor.  That means that even if one CPU is stuck with IRQs
> off it's fairly likely that some other CPU is OK.  Presumably you'd
> get almost as much benefit as your patch but with more done
> automatically if you could figure out how to detect that the serial
> interrupt isn't being serviced and re-route it to a different CPU.
> ...or possibly you could use some variant of the hard lockup detector
> and move all interrupts off a locked up CPU?  You could make this an
> option that's "default Y" when kgdb is turned on or something?

One other idea occurred to me that's maybe simpler.  You could in
theory just poll the serial port periodically to accomplish.  It would
actually probably even work to call the normal serial port interrupt
routine from any random CPU.  On many serial drivers the entire
interrupt handler is wrapped with:

spin_lock_irqsave(&uap->port.lock, flags);
...
spin_unlock_irqrestore(&uap->port.lock, flags);

And a few (the ones I was involved in fixing) have the similar pattern
of using uart_unlock_and_check_sysrq().

Any serial drivers following this pattern could have their interrupt
routine called periodically just to poll for characters and it'd be
fine, right?  ...and having it take a second before a sysrq comes in
this case is probably not the end of the world?


One nice benefit of this is that it would actually work _better_ on
SMP systems for any sysrqs that aren't NMI safe.  Specifically with
your patch series those would be queued with irq_work_queue() which
means they'd be blocked if the CPU processing the NMI is stuck with
IRQs disabled.  With the polling mechanism they'd nicely just run on a
different CPU.


> > > One last bit of feedback is that I noticed that you didn't try to
> > > implement the old "knock" functionality of the old NMI driver that's
> > > being deleted.  That is: your new patches don't provide an alternate
> > > way to drop into the debugger for systems where BREAK isn't hooked up.
> > > That's not a hard requirement, but I was kinda hoping for it since I
> > > have some systems that haven't routed BREAK properly.  ;-)
> > >
> >
> > Yeah, this is on my TODO list to have a kgdb "knock" functionality to
> > be implemented via a common hook in serial core.
> >
> > >
> > > I'll try to get some more detailed feedback in the next few days.
> >
> > Thanks. I do look forward to your feedback.
> >
> > -Sumit
> >
> > >
> > > -Doug
Sumit Garg Aug. 13, 2020, 9:25 a.m. UTC | #10
On Thu, 13 Aug 2020 at 05:38, Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
>
> On Wed, Aug 12, 2020 at 8:27 AM Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Wed, Aug 12, 2020 at 7:53 AM Sumit Garg <sumit.garg@linaro.org> wrote:
> > >
> > > Hi Doug,
> > >
> > > On Tue, 11 Aug 2020 at 22:46, Doug Anderson <dianders@chromium.org> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Tue, Aug 11, 2020 at 7:58 AM Greg Kroah-Hartman
> > > > <gregkh@linuxfoundation.org> wrote:
> > > > >
> > > > > On Tue, Aug 11, 2020 at 07:59:24PM +0530, Sumit Garg wrote:
> > > > > > Hi Greg,
> > > > > >
> > > > > > Thanks for your comments.
> > > > > >
> > > > > > On Tue, 11 Aug 2020 at 19:27, Greg Kroah-Hartman
> > > > > > <gregkh@linuxfoundation.org> wrote:
> > > > > > >
> > > > > > > On Tue, Aug 11, 2020 at 07:20:26PM +0530, Sumit Garg wrote:
> > > > > > > > On Tue, 21 Jul 2020 at 17:40, Sumit Garg <sumit.garg@linaro.org> wrote:
> > > > > > > > >
> > > > > > > > > Make it possible for UARTs to trigger magic sysrq from an NMI. With the
> > > > > > > > > advent of pseudo NMIs on arm64 it became quite generic to request serial
> > > > > > > > > device interrupt as an NMI rather than IRQ. And having NMI driven serial
> > > > > > > > > RX will allow us to trigger magic sysrq as an NMI and hence drop into
> > > > > > > > > kernel debugger in NMI context.
> > > > > > > > >
> > > > > > > > > The major use-case is to add NMI debugging capabilities to the kernel
> > > > > > > > > in order to debug scenarios such as:
> > > > > > > > > - Primary CPU is stuck in deadlock with interrupts disabled and hence
> > > > > > > > >   doesn't honor serial device interrupt. So having magic sysrq triggered
> > > > > > > > >   as an NMI is helpful for debugging.
> > > > > > > > > - Always enabled NMI based magic sysrq irrespective of whether the serial
> > > > > > > > >   TTY port is active or not.
> > > > > > > > >
> > > > > > > > > Currently there is an existing kgdb NMI serial driver which provides
> > > > > > > > > partial implementation in upstream to have a separate ttyNMI0 port but
> > > > > > > > > that remained in silos with the serial core/drivers which made it a bit
> > > > > > > > > odd to enable using serial device interrupt and hence remained unused. It
> > > > > > > > > seems to be clearly intended to avoid almost all custom NMI changes to
> > > > > > > > > the UART driver.
> > > > > > > > >
> > > > > > > > > But this patch-set allows the serial core/drivers to be NMI aware which
> > > > > > > > > in turn provides NMI debugging capabilities via magic sysrq and hence
> > > > > > > > > there is no specific reason to keep this special driver. So remove it
> > > > > > > > > instead.
> > > > > > > > >
> > > > > > > > > Approach:
> > > > > > > > > ---------
> > > > > > > > >
> > > > > > > > > The overall idea is to intercept serial RX characters in NMI context, if
> > > > > > > > > those are specific to magic sysrq then allow corresponding handler to run
> > > > > > > > > in NMI context. Otherwise, defer all other RX and TX operations onto IRQ
> > > > > > > > > work queue in order to run those in normal interrupt context.
> > > > > > > > >
> > > > > > > > > This approach is demonstrated using amba-pl011 driver.
> > > > > > > > >
> > > > > > > > > Patch-wise description:
> > > > > > > > > -----------------------
> > > > > > > > >
> > > > > > > > > Patch #1 prepares magic sysrq handler to be NMI aware.
> > > > > > > > > Patch #2 adds NMI framework to serial core.
> > > > > > > > > Patch #3 and #4 demonstrates NMI aware uart port using amba-pl011 driver.
> > > > > > > > > Patch #5 removes kgdb NMI serial driver.
> > > > > > > > >
> > > > > > > > > Goal of this RFC:
> > > > > > > > > -----------------
> > > > > > > > >
> > > > > > > > > My main reason for sharing this as an RFC is to help decide whether or
> > > > > > > > > not to continue with this approach. The next step for me would to port
> > > > > > > > > the work to a system with an 8250 UART.
> > > > > > > > >
> > > > > > > >
> > > > > > > > A gentle reminder to seek feedback on this series.
> > > >
> > > > It's been on my list for a while.  I started it Friday but ran out of
> > > > time.  This week hasn't been going as smoothly as I hoped but I'll
> > > > prioritize this since it's been too long.
> > > >
> > >
> > > No worries and thanks for your feedback.
> > >
> > > >
> > > > > > > It's the middle of the merge window, and I can't do anything.
> > > > > > >
> > > > > > > Also, I almost never review RFC patches as I have have way too many
> > > > > > > patches that people think are "right" to review first...
> > > > > > >
> > > > > >
> > > > > > Okay, I understand and I can definitely wait for your feedback.
> > > > >
> > > > > My feedback here is this:
> > > > >
> > > > > > > I suggest you work to flesh this out first and submit something that you
> > > > > > > feels works properly.
> > > > >
> > > > > :)
> > > > >
> > > > > > IIUC, in order to make this approach substantial I need to make it
> > > > > > work with 8250 UART (major serial driver), correct? As currently it
> > > > > > works properly for amba-pl011 driver.
> > > > >
> > > > > Yes, try to do that, or better yet, make it work with all serial drivers
> > > > > automatically.
> > > >
> > > > A bit of early feedback...
> > > >
> > > > Although I'm not sure we can do Greg's "make it work everywhere
> > > > automatically", it's possible you could get half of your patch done
> > > > automatically.  Specifically, your patch really does two things:
> > > >
> > > > a) It leaves the serial port "active" all the time to look for sysrq.
> > > > In other words even if there is no serial client it's always reading
> > > > the port looking for characters.  IMO this concept should be separated
> > > > out from the NMI concept and _could_ automatically work for all serial
> > > > drivers.  You'd just need something in the serial core that acted like
> > > > a default client if nobody else opened the serial port.  The nice
> > > > thing here is that we go through all the normal code paths and don't
> > > > need special cases in the driver.
> > >
> > > Okay, will try to explore this option to have default serial port
> > > client. Would this client be active in normal serial operation or only
> > > active when we have kgdb active? One drawback I see for normal
> > > operation could be power management as if user is not using serial
> > > port and would like to disable corresponding clock in order to reduce
> > > power consumption.
> >
> > If I could pick the ideal, I'd say we'd do it any time the console is
> > configured for that port and magic sysrq is enabled.  Presumably if
> > they're already choosing to output kernel log messages to the serial
> > port and they've enabled magic sysrq they're in a state where they'd
> > be OK with the extra power of also listening for characters?
> >

Okay, sounds reasonable to me.

> >
> > > > b) It enables NMI for your particular serial driver.  This seems like
> > > > it'd be hard to do automatically because you can't do the same things
> > > > at NMI that you could do in a normal interrupt handler.
> > >
> > > Agree.
> > >
> > > >
> > > > NOTE: to me, a) is more important than b) (though it'd be nice to have
> > > > both).  This would be especially true the earlier you could make a)
> > > > work since the main time when an "agetty" isn't running on my serial
> > > > port to read characters is during bootup.
> > > >
> > > > Why is b) less important to me? Sure, it would let you drop into the
> > > > debugger in the case where the CPU handling serial port interrupts is
> > > > hung with IRQs disabled, but it _woudln't_ let you drop into the
> > > > debugger in the case where a different CPU is hung with IRQs disabled.
> > > > To get that we need NMI roundup (which, I know, you are also working
> > > > on for arm64).  ...and, if we've got NMI roundup, presumably we can
> > > > find our way into the debugger by either moving the serial interrupt
> > > > to a different CPU ahead of time or using some type of lockup detector
> > > > (which I know you are also working on for arm64).
> > > >
> > >
> > > Thanks for sharing your preferences. I will try to get a) sorted out first.
> > >
> > > Overall I agree with your approaches to debug hard-lockup scenarios
> > > but they might not be so trivial for kernel engineers who doesn't
> > > posses kernel debugging experience as you do. :)
> > >
> > > And I still think NMI aware magic sysrq is useful for scenarios such as:
> > > - Try to get system information during hard-lockup rather than just
> > > panic via hard-lockup detection.
> > > - Do normal start/stop debugger activity on a core which was stuck in
> > > hard-lockup.
> > > - Random boot freezes which are not easily reproducible.
> >
> > Don't get me wrong.  Having sysrq from NMI seems like a good feature
> > to me.

Yeah I understand what you meant but I was just trying to highlight
additional benefits that sysrq from NMI provides.

> > That being said, it will require non-trivial changes to each
> > serial driver to support it and that means that not all serial drivers
> > will support it.  It also starts requiring knowledge of how NMIs work
> > (what's allowed in NMI mode / not allowed / how to avoid races) for
> > authors of serial drivers.  I have a bit of a worry that the benefit
> > won't outweigh the extra complexity, but I guess time will tell.

Yes I understand these concerns as well. That's why I have tried to
defer almost all of the work to the IRQ work queue apart from
essential parsing of RX chars in NMI mode. Also, I tried to keep most
of this code common in serial core.

> > One
> > last worry is that I assume that most people testing (and even
> > automated testing labs) will either always enable NMI or won't enable
> > NMI.  That means that everyone will be only testing one codepath or
> > the other and (given the complexity) the non-tested codepath will
> > break.
> >

The current patch-set only makes this NMI to work when debugger (kgdb)
is enabled which I think is mostly suitable for development
environments. So most people testing will involve existing IRQ mode
only.

However, it's very much possible to make NMI mode as default for a
particular serial driver if the underlying irqchip supports it but it
depends if we really see any production level usage of NMI debug
feature.

> > Hrm.  Along the lines of the above, though: almost no modern systems
> > are uniprocessor.  That means that even if one CPU is stuck with IRQs
> > off it's fairly likely that some other CPU is OK.  Presumably you'd
> > get almost as much benefit as your patch but with more done
> > automatically if you could figure out how to detect that the serial
> > interrupt isn't being serviced and re-route it to a different CPU.
> > ...or possibly you could use some variant of the hard lockup detector
> > and move all interrupts off a locked up CPU?  You could make this an
> > option that's "default Y" when kgdb is turned on or something?

Yes we can reroute serial interrupts but what I meant to say is that
we can at most get a backtrace of CPU stuck in hard-lockup whereas
with NMI debugger entry on hard-lockup CPU, we can do normal
start/stop/single-step debugging on that CPU.

>
> One other idea occurred to me that's maybe simpler.  You could in
> theory just poll the serial port periodically to accomplish.  It would
> actually probably even work to call the normal serial port interrupt
> routine from any random CPU.  On many serial drivers the entire
> interrupt handler is wrapped with:
>
> spin_lock_irqsave(&uap->port.lock, flags);
> ...
> spin_unlock_irqrestore(&uap->port.lock, flags);
>
> And a few (the ones I was involved in fixing) have the similar pattern
> of using uart_unlock_and_check_sysrq().
>
> Any serial drivers following this pattern could have their interrupt
> routine called periodically just to poll for characters and it'd be
> fine, right?  ...and having it take a second before a sysrq comes in
> this case is probably not the end of the world?
>

Are you proposing to have complete RX operation in polling mode with
RX interrupt disabled (eg. using a kernel thread)?

>
> One nice benefit of this is that it would actually work _better_ on
> SMP systems for any sysrqs that aren't NMI safe.  Specifically with
> your patch series those would be queued with irq_work_queue() which
> means they'd be blocked if the CPU processing the NMI is stuck with
> IRQs disabled.

Yes, the sysrq handlers which aren't NMI safe will behave similarly to
existing IRQ based sysrq handlers.

> With the polling mechanism they'd nicely just run on a
> different CPU.

It looks like polling would cause much CPU overhead. So I am not sure
if that is the preferred approach.

-Sumit

>
>
> > > > One last bit of feedback is that I noticed that you didn't try to
> > > > implement the old "knock" functionality of the old NMI driver that's
> > > > being deleted.  That is: your new patches don't provide an alternate
> > > > way to drop into the debugger for systems where BREAK isn't hooked up.
> > > > That's not a hard requirement, but I was kinda hoping for it since I
> > > > have some systems that haven't routed BREAK properly.  ;-)
> > > >
> > >
> > > Yeah, this is on my TODO list to have a kgdb "knock" functionality to
> > > be implemented via a common hook in serial core.
> > >
> > > >
> > > > I'll try to get some more detailed feedback in the next few days.
> > >
> > > Thanks. I do look forward to your feedback.
> > >
> > > -Sumit
> > >
> > > >
> > > > -Doug
Daniel Thompson Aug. 13, 2020, 10:17 a.m. UTC | #11
On Thu, Aug 13, 2020 at 02:55:12PM +0530, Sumit Garg wrote:
> On Thu, 13 Aug 2020 at 05:38, Doug Anderson <dianders@chromium.org> wrote:
> > On Wed, Aug 12, 2020 at 8:27 AM Doug Anderson <dianders@chromium.org> wrote:
> > > One
> > > last worry is that I assume that most people testing (and even
> > > automated testing labs) will either always enable NMI or won't enable
> > > NMI.  That means that everyone will be only testing one codepath or
> > > the other and (given the complexity) the non-tested codepath will
> > > break.
> > >
> 
> The current patch-set only makes this NMI to work when debugger (kgdb)
> is enabled which I think is mostly suitable for development
> environments. So most people testing will involve existing IRQ mode
> only.
> 
> However, it's very much possible to make NMI mode as default for a
> particular serial driver if the underlying irqchip supports it but it
> depends if we really see any production level usage of NMI debug
> feature.

The effect of this patch is not to make kgdb work from NMI it is to make
(some) SysRqs work from NMI. I think that only allowing it to deploy for
kgdb users is a mistake.

Having it deploy automatically for kgdb users might be OK but it seems
sensible to make this feature available for other users too.

Daniel.
Doug Anderson Aug. 13, 2020, 3:26 p.m. UTC | #12
Hi,

On Thu, Aug 13, 2020 at 2:25 AM Sumit Garg <sumit.garg@linaro.org> wrote:
>
> > One other idea occurred to me that's maybe simpler.  You could in
> > theory just poll the serial port periodically to accomplish.  It would
> > actually probably even work to call the normal serial port interrupt
> > routine from any random CPU.  On many serial drivers the entire
> > interrupt handler is wrapped with:
> >
> > spin_lock_irqsave(&uap->port.lock, flags);
> > ...
> > spin_unlock_irqrestore(&uap->port.lock, flags);
> >
> > And a few (the ones I was involved in fixing) have the similar pattern
> > of using uart_unlock_and_check_sysrq().
> >
> > Any serial drivers following this pattern could have their interrupt
> > routine called periodically just to poll for characters and it'd be
> > fine, right?  ...and having it take a second before a sysrq comes in
> > this case is probably not the end of the world?
> >
>
> Are you proposing to have complete RX operation in polling mode with
> RX interrupt disabled (eg. using a kernel thread)?

No, I'm suggesting a hybrid approach.  Leave the interrupts enabled as
usual, but _also_ poll every 500 ms or 1 second (maybe make it
configurable?).  In some serial drivers (ones that hold the lock for
the whole interrupt routine) this polling function could actually be
the same as the normal interrupt handler so it'd be trivially easy to
implement and maintain.

NOTE: This is not the same type of polling that kgdb does today.  The
existing polling is really only intended to work when we're dropped
into the debugger.  This would be more like a "poll_irq" type function
that would do all the normal work the interrupt did and is really just
there in the case that the CPU that the interrupt is routed to is
locked up.


> > One nice benefit of this is that it would actually work _better_ on
> > SMP systems for any sysrqs that aren't NMI safe.  Specifically with
> > your patch series those would be queued with irq_work_queue() which
> > means they'd be blocked if the CPU processing the NMI is stuck with
> > IRQs disabled.
>
> Yes, the sysrq handlers which aren't NMI safe will behave similarly to
> existing IRQ based sysrq handlers.
>
> > With the polling mechanism they'd nicely just run on a
> > different CPU.
>
> It looks like polling would cause much CPU overhead. So I am not sure
> if that is the preferred approach.

Maybe now it's clearer that there should be almost no overhead.  When
dealing with a SYSRQ it's fine if there's a bit of a delay before it's
processed, so polling every 1 second is probably fine.

-Doug
Sumit Garg Aug. 14, 2020, 12:06 p.m. UTC | #13
On Thu, 13 Aug 2020 at 15:47, Daniel Thompson
<daniel.thompson@linaro.org> wrote:
>
> On Thu, Aug 13, 2020 at 02:55:12PM +0530, Sumit Garg wrote:
> > On Thu, 13 Aug 2020 at 05:38, Doug Anderson <dianders@chromium.org> wrote:
> > > On Wed, Aug 12, 2020 at 8:27 AM Doug Anderson <dianders@chromium.org> wrote:
> > > > One
> > > > last worry is that I assume that most people testing (and even
> > > > automated testing labs) will either always enable NMI or won't enable
> > > > NMI.  That means that everyone will be only testing one codepath or
> > > > the other and (given the complexity) the non-tested codepath will
> > > > break.
> > > >
> >
> > The current patch-set only makes this NMI to work when debugger (kgdb)
> > is enabled which I think is mostly suitable for development
> > environments. So most people testing will involve existing IRQ mode
> > only.
> >
> > However, it's very much possible to make NMI mode as default for a
> > particular serial driver if the underlying irqchip supports it but it
> > depends if we really see any production level usage of NMI debug
> > feature.
>
> The effect of this patch is not to make kgdb work from NMI it is to make
> (some) SysRqs work from NMI. I think that only allowing it to deploy for
> kgdb users is a mistake.
>
> Having it deploy automatically for kgdb users might be OK but it seems
> sensible to make this feature available for other users too.

I think I wasn't clear enough in my prior reply. Actually I meant to
say that this patch-set enables NMI support for a particular serial
driver via ".poll_init()" interface and the only current user of that
interface is kgdb.

So if there are other users interested in this feature, they can use
".poll_init()" interface as well to enable it.

-Sumit

>
> Daniel.
Sumit Garg Aug. 14, 2020, 12:50 p.m. UTC | #14
On Thu, 13 Aug 2020 at 20:56, Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Thu, Aug 13, 2020 at 2:25 AM Sumit Garg <sumit.garg@linaro.org> wrote:
> >
> > > One other idea occurred to me that's maybe simpler.  You could in
> > > theory just poll the serial port periodically to accomplish.  It would
> > > actually probably even work to call the normal serial port interrupt
> > > routine from any random CPU.  On many serial drivers the entire
> > > interrupt handler is wrapped with:
> > >
> > > spin_lock_irqsave(&uap->port.lock, flags);
> > > ...
> > > spin_unlock_irqrestore(&uap->port.lock, flags);
> > >
> > > And a few (the ones I was involved in fixing) have the similar pattern
> > > of using uart_unlock_and_check_sysrq().
> > >
> > > Any serial drivers following this pattern could have their interrupt
> > > routine called periodically just to poll for characters and it'd be
> > > fine, right?  ...and having it take a second before a sysrq comes in
> > > this case is probably not the end of the world?
> > >
> >
> > Are you proposing to have complete RX operation in polling mode with
> > RX interrupt disabled (eg. using a kernel thread)?
>
> No, I'm suggesting a hybrid approach.  Leave the interrupts enabled as
> usual, but _also_ poll every 500 ms or 1 second (maybe make it
> configurable?).  In some serial drivers (ones that hold the lock for
> the whole interrupt routine) this polling function could actually be
> the same as the normal interrupt handler so it'd be trivially easy to
> implement and maintain.
>
> NOTE: This is not the same type of polling that kgdb does today.  The
> existing polling is really only intended to work when we're dropped
> into the debugger.  This would be more like a "poll_irq" type function
> that would do all the normal work the interrupt did and is really just
> there in the case that the CPU that the interrupt is routed to is
> locked up.
>

Your idea sounds interesting. I think where we are reaching is to have
an ever active listener to serial port that can be scheduled to any
random active CPU. And to keep its CPU overhead negligible, it can
sleep and only wake-up and listen once every 500 ms or 1 second
(configurable).

I will try to think more about it and probably give it a try with a PoC.

-Sumit

>
> > > One nice benefit of this is that it would actually work _better_ on
> > > SMP systems for any sysrqs that aren't NMI safe.  Specifically with
> > > your patch series those would be queued with irq_work_queue() which
> > > means they'd be blocked if the CPU processing the NMI is stuck with
> > > IRQs disabled.
> >
> > Yes, the sysrq handlers which aren't NMI safe will behave similarly to
> > existing IRQ based sysrq handlers.
> >
> > > With the polling mechanism they'd nicely just run on a
> > > different CPU.
> >
> > It looks like polling would cause much CPU overhead. So I am not sure
> > if that is the preferred approach.
>
> Maybe now it's clearer that there should be almost no overhead.  When
> dealing with a SYSRQ it's fine if there's a bit of a delay before it's
> processed, so polling every 1 second is probably fine.
>
> -Doug
Daniel Thompson Aug. 14, 2020, 2:18 p.m. UTC | #15
On Fri, Aug 14, 2020 at 05:36:36PM +0530, Sumit Garg wrote:
> On Thu, 13 Aug 2020 at 15:47, Daniel Thompson
> <daniel.thompson@linaro.org> wrote:
> >
> > On Thu, Aug 13, 2020 at 02:55:12PM +0530, Sumit Garg wrote:
> > > On Thu, 13 Aug 2020 at 05:38, Doug Anderson <dianders@chromium.org> wrote:
> > > > On Wed, Aug 12, 2020 at 8:27 AM Doug Anderson <dianders@chromium.org> wrote:
> > > > > One
> > > > > last worry is that I assume that most people testing (and even
> > > > > automated testing labs) will either always enable NMI or won't enable
> > > > > NMI.  That means that everyone will be only testing one codepath or
> > > > > the other and (given the complexity) the non-tested codepath will
> > > > > break.
> > > > >
> > >
> > > The current patch-set only makes this NMI to work when debugger (kgdb)
> > > is enabled which I think is mostly suitable for development
> > > environments. So most people testing will involve existing IRQ mode
> > > only.
> > >
> > > However, it's very much possible to make NMI mode as default for a
> > > particular serial driver if the underlying irqchip supports it but it
> > > depends if we really see any production level usage of NMI debug
> > > feature.
> >
> > The effect of this patch is not to make kgdb work from NMI it is to make
> > (some) SysRqs work from NMI. I think that only allowing it to deploy for
> > kgdb users is a mistake.
> >
> > Having it deploy automatically for kgdb users might be OK but it seems
> > sensible to make this feature available for other users too.
> 
> I think I wasn't clear enough in my prior reply. Actually I meant to
> say that this patch-set enables NMI support for a particular serial
> driver via ".poll_init()" interface and the only current user of that
> interface is kgdb.
> 
> So if there are other users interested in this feature, they can use
> ".poll_init()" interface as well to enable it.

Huh?

We appear to speaking interchangably about users (people who sit in
front of the machine and want a stack trace) and sub-systems ;-).

I don't think other SysRq commands have quite such a direct relationship
between the sub-system and the sysrq command. For example who are you
expecting to call .poll_init() if a user wants to use the SysRq to
provoke a stack trace?


Daniel.
Sumit Garg Aug. 17, 2020, 5:12 a.m. UTC | #16
On Fri, 14 Aug 2020 at 19:48, Daniel Thompson
<daniel.thompson@linaro.org> wrote:
>
> On Fri, Aug 14, 2020 at 05:36:36PM +0530, Sumit Garg wrote:
> > On Thu, 13 Aug 2020 at 15:47, Daniel Thompson
> > <daniel.thompson@linaro.org> wrote:
> > >
> > > On Thu, Aug 13, 2020 at 02:55:12PM +0530, Sumit Garg wrote:
> > > > On Thu, 13 Aug 2020 at 05:38, Doug Anderson <dianders@chromium.org> wrote:
> > > > > On Wed, Aug 12, 2020 at 8:27 AM Doug Anderson <dianders@chromium.org> wrote:
> > > > > > One
> > > > > > last worry is that I assume that most people testing (and even
> > > > > > automated testing labs) will either always enable NMI or won't enable
> > > > > > NMI.  That means that everyone will be only testing one codepath or
> > > > > > the other and (given the complexity) the non-tested codepath will
> > > > > > break.
> > > > > >
> > > >
> > > > The current patch-set only makes this NMI to work when debugger (kgdb)
> > > > is enabled which I think is mostly suitable for development
> > > > environments. So most people testing will involve existing IRQ mode
> > > > only.
> > > >
> > > > However, it's very much possible to make NMI mode as default for a
> > > > particular serial driver if the underlying irqchip supports it but it
> > > > depends if we really see any production level usage of NMI debug
> > > > feature.
> > >
> > > The effect of this patch is not to make kgdb work from NMI it is to make
> > > (some) SysRqs work from NMI. I think that only allowing it to deploy for
> > > kgdb users is a mistake.
> > >
> > > Having it deploy automatically for kgdb users might be OK but it seems
> > > sensible to make this feature available for other users too.
> >
> > I think I wasn't clear enough in my prior reply. Actually I meant to
> > say that this patch-set enables NMI support for a particular serial
> > driver via ".poll_init()" interface and the only current user of that
> > interface is kgdb.
> >
> > So if there are other users interested in this feature, they can use
> > ".poll_init()" interface as well to enable it.
>
> Huh?
>
> We appear to speaking interchangably about users (people who sit in
> front of the machine and want a stack trace) and sub-systems ;-).
>
> I don't think other SysRq commands have quite such a direct relationship
> between the sub-system and the sysrq command. For example who are you
> expecting to call .poll_init() if a user wants to use the SysRq to
> provoke a stack trace?
>

Ah, I see. So you meant to provide a user-space interface to
dynamically enable/disable NMI debug, correct? It will require IRQ <->
NMI switching at runtime which should be doable safely.

-Sumit

>
> Daniel.
Daniel Thompson Aug. 17, 2020, 9:28 a.m. UTC | #17
On Mon, Aug 17, 2020 at 10:42:43AM +0530, Sumit Garg wrote:
> On Fri, 14 Aug 2020 at 19:48, Daniel Thompson
> <daniel.thompson@linaro.org> wrote:
> >
> > On Fri, Aug 14, 2020 at 05:36:36PM +0530, Sumit Garg wrote:
> > > On Thu, 13 Aug 2020 at 15:47, Daniel Thompson
> > > <daniel.thompson@linaro.org> wrote:
> > > >
> > > > On Thu, Aug 13, 2020 at 02:55:12PM +0530, Sumit Garg wrote:
> > > > > On Thu, 13 Aug 2020 at 05:38, Doug Anderson <dianders@chromium.org> wrote:
> > > > > > On Wed, Aug 12, 2020 at 8:27 AM Doug Anderson <dianders@chromium.org> wrote:
> > > > > > > One
> > > > > > > last worry is that I assume that most people testing (and even
> > > > > > > automated testing labs) will either always enable NMI or won't enable
> > > > > > > NMI.  That means that everyone will be only testing one codepath or
> > > > > > > the other and (given the complexity) the non-tested codepath will
> > > > > > > break.
> > > > > > >
> > > > >
> > > > > The current patch-set only makes this NMI to work when debugger (kgdb)
> > > > > is enabled which I think is mostly suitable for development
> > > > > environments. So most people testing will involve existing IRQ mode
> > > > > only.
> > > > >
> > > > > However, it's very much possible to make NMI mode as default for a
> > > > > particular serial driver if the underlying irqchip supports it but it
> > > > > depends if we really see any production level usage of NMI debug
> > > > > feature.
> > > >
> > > > The effect of this patch is not to make kgdb work from NMI it is to make
> > > > (some) SysRqs work from NMI. I think that only allowing it to deploy for
> > > > kgdb users is a mistake.
> > > >
> > > > Having it deploy automatically for kgdb users might be OK but it seems
> > > > sensible to make this feature available for other users too.
> > >
> > > I think I wasn't clear enough in my prior reply. Actually I meant to
> > > say that this patch-set enables NMI support for a particular serial
> > > driver via ".poll_init()" interface and the only current user of that
> > > interface is kgdb.
> > >
> > > So if there are other users interested in this feature, they can use
> > > ".poll_init()" interface as well to enable it.
> >
> > Huh?
> >
> > We appear to speaking interchangably about users (people who sit in
> > front of the machine and want a stack trace) and sub-systems ;-).
> >
> > I don't think other SysRq commands have quite such a direct relationship
> > between the sub-system and the sysrq command. For example who are you
> > expecting to call .poll_init() if a user wants to use the SysRq to
> > provoke a stack trace?
> >
> 
> Ah, I see. So you meant to provide a user-space interface to
> dynamically enable/disable NMI debug, correct? It will require IRQ <->
> NMI switching at runtime which should be doable safely.

I haven't given much thought to the exact mechanism, though I would
perhaps have started by thinking about a module parameter).

From an RFC point of view, I simple think this feature is potentially
useful on systems without kgdb (which, let's be honest, are firmly in
the majority) so making .poll_init() the only way to activate it is a
mistake.


Daniel.
Sumit Garg Aug. 17, 2020, 2:12 p.m. UTC | #18
On Mon, 17 Aug 2020 at 14:58, Daniel Thompson
<daniel.thompson@linaro.org> wrote:
>
> On Mon, Aug 17, 2020 at 10:42:43AM +0530, Sumit Garg wrote:
> > On Fri, 14 Aug 2020 at 19:48, Daniel Thompson
> > <daniel.thompson@linaro.org> wrote:
> > >
> > > On Fri, Aug 14, 2020 at 05:36:36PM +0530, Sumit Garg wrote:
> > > > On Thu, 13 Aug 2020 at 15:47, Daniel Thompson
> > > > <daniel.thompson@linaro.org> wrote:
> > > > >
> > > > > On Thu, Aug 13, 2020 at 02:55:12PM +0530, Sumit Garg wrote:
> > > > > > On Thu, 13 Aug 2020 at 05:38, Doug Anderson <dianders@chromium.org> wrote:
> > > > > > > On Wed, Aug 12, 2020 at 8:27 AM Doug Anderson <dianders@chromium.org> wrote:
> > > > > > > > One
> > > > > > > > last worry is that I assume that most people testing (and even
> > > > > > > > automated testing labs) will either always enable NMI or won't enable
> > > > > > > > NMI.  That means that everyone will be only testing one codepath or
> > > > > > > > the other and (given the complexity) the non-tested codepath will
> > > > > > > > break.
> > > > > > > >
> > > > > >
> > > > > > The current patch-set only makes this NMI to work when debugger (kgdb)
> > > > > > is enabled which I think is mostly suitable for development
> > > > > > environments. So most people testing will involve existing IRQ mode
> > > > > > only.
> > > > > >
> > > > > > However, it's very much possible to make NMI mode as default for a
> > > > > > particular serial driver if the underlying irqchip supports it but it
> > > > > > depends if we really see any production level usage of NMI debug
> > > > > > feature.
> > > > >
> > > > > The effect of this patch is not to make kgdb work from NMI it is to make
> > > > > (some) SysRqs work from NMI. I think that only allowing it to deploy for
> > > > > kgdb users is a mistake.
> > > > >
> > > > > Having it deploy automatically for kgdb users might be OK but it seems
> > > > > sensible to make this feature available for other users too.
> > > >
> > > > I think I wasn't clear enough in my prior reply. Actually I meant to
> > > > say that this patch-set enables NMI support for a particular serial
> > > > driver via ".poll_init()" interface and the only current user of that
> > > > interface is kgdb.
> > > >
> > > > So if there are other users interested in this feature, they can use
> > > > ".poll_init()" interface as well to enable it.
> > >
> > > Huh?
> > >
> > > We appear to speaking interchangably about users (people who sit in
> > > front of the machine and want a stack trace) and sub-systems ;-).
> > >
> > > I don't think other SysRq commands have quite such a direct relationship
> > > between the sub-system and the sysrq command. For example who are you
> > > expecting to call .poll_init() if a user wants to use the SysRq to
> > > provoke a stack trace?
> > >
> >
> > Ah, I see. So you meant to provide a user-space interface to
> > dynamically enable/disable NMI debug, correct? It will require IRQ <->
> > NMI switching at runtime which should be doable safely.
>
> I haven't given much thought to the exact mechanism, though I would
> perhaps have started by thinking about a module parameter).
>
> From an RFC point of view, I simple think this feature is potentially
> useful on systems without kgdb (which, let's be honest, are firmly in
> the majority) so making .poll_init() the only way to activate it is a
> mistake.
>

Makes sense, will add a module parameter to enable this feature during
boot as well.

-Sumit

>
> Daniel.