diff mbox series

[1/3] serial: qcom-geni: fix hard lockup on buffer flush

Message ID 20240624133135.7445-2-johan+linaro@kernel.org (mailing list archive)
State Superseded
Headers show
Series serial: qcom-geni: fix lockups | expand

Commit Message

Johan Hovold June 24, 2024, 1:31 p.m. UTC
The Qualcomm GENI serial driver does not handle buffer flushing and used
to print garbage characters when the circular buffer was cleared. Since
commit 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo") this
instead results in a lockup due to qcom_geni_serial_send_chunk_fifo()
spinning indefinitely in the interrupt handler.

This is easily triggered by interrupting a command such as dmesg in a
serial console but can also happen when stopping a serial getty on
reboot.

Fix the immediate issue by printing NUL characters until the current TX
command has been completed.

Fixes: 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo")
Reported-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
---
 drivers/tty/serial/qcom_geni_serial.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Doug Anderson June 24, 2024, 5:39 p.m. UTC | #1
Hi,

On Mon, Jun 24, 2024 at 6:31 AM Johan Hovold <johan+linaro@kernel.org> wrote:
>
> The Qualcomm GENI serial driver does not handle buffer flushing and used
> to print garbage characters when the circular buffer was cleared. Since
> commit 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo") this
> instead results in a lockup due to qcom_geni_serial_send_chunk_fifo()
> spinning indefinitely in the interrupt handler.
>
> This is easily triggered by interrupting a command such as dmesg in a
> serial console but can also happen when stopping a serial getty on
> reboot.
>
> Fix the immediate issue by printing NUL characters until the current TX
> command has been completed.
>
> Fixes: 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo")
> Reported-by: Douglas Anderson <dianders@chromium.org>
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> ---
>  drivers/tty/serial/qcom_geni_serial.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

I don't love this, though it's better than a hard lockup. I will note
that it doesn't exactly restore the old behavior which would have
(most likely) continued to output data that had previously been in the
FIFO but that had been cancelled.

...actually, if we're looking for a short term fix that mimics the old
behavior more closely, what would you think about having a
driver-local buffer that we fill when we kick off the transfer. Then
the data can't go away from underneath us. It's an extra copy, but
it's just a memory-to-memory copy which is much faster than the MMIO
copy we'll eventually need to do anyway... This local buffer would
essentially act as a larger FIFO.

You could choose the local buffer size to balance being able to cancel
quickly vs. using the FIFO efficiently.

-Doug
Doug Anderson June 24, 2024, 8:45 p.m. UTC | #2
Hi,

On Mon, Jun 24, 2024 at 10:39 AM Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Mon, Jun 24, 2024 at 6:31 AM Johan Hovold <johan+linaro@kernel.org> wrote:
> >
> > The Qualcomm GENI serial driver does not handle buffer flushing and used
> > to print garbage characters when the circular buffer was cleared. Since
> > commit 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo") this
> > instead results in a lockup due to qcom_geni_serial_send_chunk_fifo()
> > spinning indefinitely in the interrupt handler.
> >
> > This is easily triggered by interrupting a command such as dmesg in a
> > serial console but can also happen when stopping a serial getty on
> > reboot.
> >
> > Fix the immediate issue by printing NUL characters until the current TX
> > command has been completed.
> >
> > Fixes: 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo")
> > Reported-by: Douglas Anderson <dianders@chromium.org>
> > Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> > ---
> >  drivers/tty/serial/qcom_geni_serial.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
>
> I don't love this, though it's better than a hard lockup. I will note
> that it doesn't exactly restore the old behavior which would have
> (most likely) continued to output data that had previously been in the
> FIFO but that had been cancelled.
>
> ...actually, if we're looking for a short term fix that mimics the old
> behavior more closely, what would you think about having a
> driver-local buffer that we fill when we kick off the transfer. Then
> the data can't go away from underneath us. It's an extra copy, but
> it's just a memory-to-memory copy which is much faster than the MMIO
> copy we'll eventually need to do anyway... This local buffer would
> essentially act as a larger FIFO.
>
> You could choose the local buffer size to balance being able to cancel
> quickly vs. using the FIFO efficiently.

Also: if we're looking at quick/easy to land and just fix the hard
lockup, I'd vote for this (I can send a real patch, though I'm about
to go on vacation):

--

@@ -904,8 +904,8 @@ static void qcom_geni_serial_handle_tx_fifo(struct
uart_port *uport,
                goto out_write_wakeup;

        if (!port->tx_remaining) {
-               qcom_geni_serial_setup_tx(uport, pending);
-               port->tx_remaining = pending;
+               port->tx_remaining = min(avail, pending);
+               qcom_geni_serial_setup_tx(uport, port->tx_remaining);

                irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
                if (!(irq_en & M_TX_FIFO_WATERMARK_EN))

--

That will fix the hard lockup, is short and sweet, and also doesn't
end up outputting NUL bytes.

I measured time with that. I've been testing with a file I created
called "alphabet.txt" that just contains the letters A-Z repeated 3
times followed by a "\n", over and over again. I think gmail will kill
me with word wrapping, but basically:

ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
...
...

FWIW:

head -200 /var/alphabet.txt  | wc
    200     200   15800

Before my patch I ran `time head -200 /var/alphabet.txt` and I got:

real    0m1.386s

After my patch I ran the same thing and got:

real    0m1.409s

So it's slower, but that's not 25% slower. I get 1.7% slower:

In [6]: (1.409 - 1.386) / 1.386 * 100
Out[6]: 1.659451659451669

IMO that seems like a fine slowdown in order to avoid printing NUL bytes.
Johan Hovold June 25, 2024, 2:40 p.m. UTC | #3
On Mon, Jun 24, 2024 at 10:39:07AM -0700, Doug Anderson wrote:
> On Mon, Jun 24, 2024 at 6:31 AM Johan Hovold <johan+linaro@kernel.org> wrote:
> >
> > The Qualcomm GENI serial driver does not handle buffer flushing and used
> > to print garbage characters when the circular buffer was cleared. Since
> > commit 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo") this
> > instead results in a lockup due to qcom_geni_serial_send_chunk_fifo()
> > spinning indefinitely in the interrupt handler.
> >
> > This is easily triggered by interrupting a command such as dmesg in a
> > serial console but can also happen when stopping a serial getty on
> > reboot.
> >
> > Fix the immediate issue by printing NUL characters until the current TX
> > command has been completed.
> >
> > Fixes: 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo")
> > Reported-by: Douglas Anderson <dianders@chromium.org>
> > Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> > ---
> >  drivers/tty/serial/qcom_geni_serial.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> I don't love this, though it's better than a hard lockup. I will note
> that it doesn't exactly restore the old behavior which would have
> (most likely) continued to output data that had previously been in the
> FIFO but that had been cancelled.

Ah, yes, you're right. I went back and compared with 6.9 and the effect
was indeed (often) that the machine felt sluggish when you hit ctrl-c to
interrupt something like dmesg and the driver would continue to print up
to 4k characters after that (e.g. 350 ms at 115200).

The idea here was to fix the lockup regression separately and then have
the third patch address the buffer flush failure, which could also be
backported without depending on the kfifo conversion.

But running with this series since yesterday, I realise there are still
some unresolved interaction with the console code, which can now trigger
a soft (instead of hard) lockup on reboot...

> ...actually, if we're looking for a short term fix that mimics the old
> behavior more closely, what would you think about having a
> driver-local buffer that we fill when we kick off the transfer. Then
> the data can't go away from underneath us. It's an extra copy, but
> it's just a memory-to-memory copy which is much faster than the MMIO
> copy we'll eventually need to do anyway... This local buffer would
> essentially act as a larger FIFO.

The idea did cross my mind, three levels of fifo...

> You could choose the local buffer size to balance being able to cancel
> quickly vs. using the FIFO efficiently.

Yeah, perhaps adding a smaller driver kfifo would work, but not sure how
clean it would be to implement.

Johan
Johan Hovold June 25, 2024, 2:53 p.m. UTC | #4
On Mon, Jun 24, 2024 at 01:45:17PM -0700, Doug Anderson wrote:

> Also: if we're looking at quick/easy to land and just fix the hard
> lockup, I'd vote for this (I can send a real patch, though I'm about
> to go on vacation):
> 
> --
> 
> @@ -904,8 +904,8 @@ static void qcom_geni_serial_handle_tx_fifo(struct
> uart_port *uport,
>                 goto out_write_wakeup;
> 
>         if (!port->tx_remaining) {
> -               qcom_geni_serial_setup_tx(uport, pending);
> -               port->tx_remaining = pending;
> +               port->tx_remaining = min(avail, pending);
> +               qcom_geni_serial_setup_tx(uport, port->tx_remaining);
> 
>                 irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
>                 if (!(irq_en & M_TX_FIFO_WATERMARK_EN))
> 
> --
> 
> That will fix the hard lockup, is short and sweet, and also doesn't
> end up outputting NUL bytes.

Yeah, this might be a good stop gap even if performance suffers.

> I measured time with that. I've been testing with a file I created
> called "alphabet.txt" that just contains the letters A-Z repeated 3
> times followed by a "\n", over and over again. I think gmail will kill
> me with word wrapping, but basically:

> head -200 /var/alphabet.txt  | wc
>     200     200   15800
> 
> Before my patch I ran `time head -200 /var/alphabet.txt` and I got:
> 
> real    0m1.386s
> 
> After my patch I ran the same thing and got:
> 
> real    0m1.409s
> 
> So it's slower, but that's not 25% slower. I get 1.7% slower:
> 
> In [6]: (1.409 - 1.386) / 1.386 * 100
> Out[6]: 1.659451659451669
> 
> IMO that seems like a fine slowdown in order to avoid printing NUL bytes.

With my 500K dmesg file test I see a similar performance drop as with
your full series even if seems to behave slightly better (e.g. 20% drop
instead of 24%). 

Johan
Doug Anderson June 25, 2024, 4:27 p.m. UTC | #5
Hi,

On Tue, Jun 25, 2024 at 7:53 AM Johan Hovold <johan@kernel.org> wrote:
>
> On Mon, Jun 24, 2024 at 01:45:17PM -0700, Doug Anderson wrote:
>
> > Also: if we're looking at quick/easy to land and just fix the hard
> > lockup, I'd vote for this (I can send a real patch, though I'm about
> > to go on vacation):
> >
> > --
> >
> > @@ -904,8 +904,8 @@ static void qcom_geni_serial_handle_tx_fifo(struct
> > uart_port *uport,
> >                 goto out_write_wakeup;
> >
> >         if (!port->tx_remaining) {
> > -               qcom_geni_serial_setup_tx(uport, pending);
> > -               port->tx_remaining = pending;
> > +               port->tx_remaining = min(avail, pending);
> > +               qcom_geni_serial_setup_tx(uport, port->tx_remaining);
> >
> >                 irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
> >                 if (!(irq_en & M_TX_FIFO_WATERMARK_EN))
> >
> > --
> >
> > That will fix the hard lockup, is short and sweet, and also doesn't
> > end up outputting NUL bytes.
>
> Yeah, this might be a good stop gap even if performance suffers.

I've officially posted this as:

https://lore.kernel.org/r/20240625092440.1.Icf914852be911b95aefa9d798b6f1cd1a180f985@changeid

I realized that I didn't need to re-calculate "chunk" so the patch is
very slightly different than I posted above but should be effectively
the same.

-Doug
Johan Hovold July 4, 2024, 9:59 a.m. UTC | #6
On Tue, Jun 25, 2024 at 04:40:36PM +0200, Johan Hovold wrote:
> On Mon, Jun 24, 2024 at 10:39:07AM -0700, Doug Anderson wrote:
> > On Mon, Jun 24, 2024 at 6:31 AM Johan Hovold <johan+linaro@kernel.org> wrote:
> > >
> > > The Qualcomm GENI serial driver does not handle buffer flushing and used
> > > to print garbage characters when the circular buffer was cleared. Since
> > > commit 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo") this
> > > instead results in a lockup due to qcom_geni_serial_send_chunk_fifo()
> > > spinning indefinitely in the interrupt handler.
> > >
> > > This is easily triggered by interrupting a command such as dmesg in a
> > > serial console but can also happen when stopping a serial getty on
> > > reboot.
> > >
> > > Fix the immediate issue by printing NUL characters until the current TX
> > > command has been completed.

> > I don't love this, though it's better than a hard lockup. I will note
> > that it doesn't exactly restore the old behavior which would have
> > (most likely) continued to output data that had previously been in the
> > FIFO but that had been cancelled.
> 
> Ah, yes, you're right. I went back and compared with 6.9 and the effect
> was indeed (often) that the machine felt sluggish when you hit ctrl-c to
> interrupt something like dmesg and the driver would continue to print up
> to 4k characters after that (e.g. 350 ms at 115200).
> 
> The idea here was to fix the lockup regression separately and then have
> the third patch address the buffer flush failure, which could also be
> backported without depending on the kfifo conversion.
> 
> But running with this series since yesterday, I realise there are still
> some unresolved interaction with the console code, which can now trigger
> a soft (instead of hard) lockup on reboot...

I've reworked my series to avoid the remaining lockup, which was due to
v1 not handling some cases where cancelling a command left stale data in
the fifo.

I've also reordered the patches to avoid printing NUL characters as an
intermediate fix.

Johan
diff mbox series

Patch

diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index 2bd25afe0d92..1d5d6045879a 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -862,7 +862,7 @@  static void qcom_geni_serial_send_chunk_fifo(struct uart_port *uport,
 		memset(buf, 0, sizeof(buf));
 		tx_bytes = min(remaining, BYTES_PER_FIFO_WORD);
 
-		tx_bytes = uart_fifo_out(uport, buf, tx_bytes);
+		uart_fifo_out(uport, buf, tx_bytes);
 
 		iowrite32_rep(uport->membase + SE_GENI_TX_FIFOn, buf, 1);