mbox series

[v4,0/8] serial: qcom-geni: Overhaul TX handling to fix crashes/hangs

Message ID 20240610222515.3023730-1-dianders@chromium.org (mailing list archive)
Headers show
Series serial: qcom-geni: Overhaul TX handling to fix crashes/hangs | expand

Message

Doug Anderson June 10, 2024, 10:24 p.m. UTC
While trying to reproduce -EBUSY errors that our lab was getting in
suspend/resume testing, I ended up finding a whole pile of problems
with the Qualcomm GENI serial driver. I've posted a fix for the -EBUSY
issue separately [1]. This series is fixing all of the Qualcomm GENI
problems that I found.

As far as I can tell most of the problems have been in the Qualcomm
GENI serial driver since inception, but it can be noted that the
behavior got worse with the new kfifo changes. Previously when the OS
took data out of the circular queue we'd just spit stale data onto the
serial port. Now we'll hard lockup. :-P

I've tried to break this series up as much as possible to make it
easier to understand but the final patch is still a lot of change at
once. Hopefully it's OK.

[1] https://lore.kernel.org/r/20240530084841.v2.1.I2395e66cf70c6e67d774c56943825c289b9c13e4@changeid

Changes in v4:
- Add GP_LENGTH field definition.
- Fix indentation.
- GENMASK(31, 0) -> GP_LENGTH.
- Use uart_fifo_timeout_ms() for timeout.
- tty: serial: Add uart_fifo_timeout_ms()

Changes in v3:
- 0xffffffff => GENMASK(31, 0)
- Reword commit message.
- Use uart_fifo_timeout() for timeout.

Changes in v2:
- Totally rework / rename patch to handle suspend while active xfer
- serial: qcom-geni: Fix arg types for qcom_geni_serial_poll_bit()
- serial: qcom-geni: Fix the timeout in qcom_geni_serial_poll_bit()
- serial: qcom-geni: Introduce qcom_geni_serial_poll_bitfield()
- serial: qcom-geni: Just set the watermark level once
- serial: qcom-geni: Rework TX in FIFO mode to fix hangs/lockups
- soc: qcom: geni-se: Add GP_LENGTH/IRQ_EN_SET/IRQ_EN_CLEAR registers

Douglas Anderson (8):
  soc: qcom: geni-se: Add GP_LENGTH/IRQ_EN_SET/IRQ_EN_CLEAR registers
  tty: serial: Add uart_fifo_timeout_ms()
  serial: qcom-geni: Fix the timeout in qcom_geni_serial_poll_bit()
  serial: qcom-geni: Fix arg types for qcom_geni_serial_poll_bit()
  serial: qcom-geni: Introduce qcom_geni_serial_poll_bitfield()
  serial: qcom-geni: Just set the watermark level once
  serial: qcom-geni: Fix suspend while active UART xfer
  serial: qcom-geni: Rework TX in FIFO mode to fix hangs/lockups

 drivers/tty/serial/qcom_geni_serial.c | 322 +++++++++++++++-----------
 include/linux/serial_core.h           |  15 +-
 include/linux/soc/qcom/geni-se.h      |   9 +
 3 files changed, 206 insertions(+), 140 deletions(-)

Comments

Konrad Dybcio June 18, 2024, 10:19 a.m. UTC | #1
On 6/11/24 00:24, Douglas Anderson wrote:
> 
> While trying to reproduce -EBUSY errors that our lab was getting in
> suspend/resume testing, I ended up finding a whole pile of problems
> with the Qualcomm GENI serial driver. I've posted a fix for the -EBUSY
> issue separately [1]. This series is fixing all of the Qualcomm GENI
> problems that I found.
> 
> As far as I can tell most of the problems have been in the Qualcomm
> GENI serial driver since inception, but it can be noted that the
> behavior got worse with the new kfifo changes. Previously when the OS
> took data out of the circular queue we'd just spit stale data onto the
> serial port. Now we'll hard lockup. :-P
> 
> I've tried to break this series up as much as possible to make it
> easier to understand but the final patch is still a lot of change at
> once. Hopefully it's OK.

Tested-by: Konrad Dybcio <konrad.dybcio@linaro.org>

Konrad
Neil Armstrong June 19, 2024, 8:25 a.m. UTC | #2
On 11/06/2024 00:24, Douglas Anderson wrote:
> 
> While trying to reproduce -EBUSY errors that our lab was getting in
> suspend/resume testing, I ended up finding a whole pile of problems
> with the Qualcomm GENI serial driver. I've posted a fix for the -EBUSY
> issue separately [1]. This series is fixing all of the Qualcomm GENI
> problems that I found.
> 
> As far as I can tell most of the problems have been in the Qualcomm
> GENI serial driver since inception, but it can be noted that the
> behavior got worse with the new kfifo changes. Previously when the OS
> took data out of the circular queue we'd just spit stale data onto the
> serial port. Now we'll hard lockup. :-P
> 
> I've tried to break this series up as much as possible to make it
> easier to understand but the final patch is still a lot of change at
> once. Hopefully it's OK.
> 
> [1] https://lore.kernel.org/r/20240530084841.v2.1.I2395e66cf70c6e67d774c56943825c289b9c13e4@changeid
> 
> Changes in v4:
> - Add GP_LENGTH field definition.
> - Fix indentation.
> - GENMASK(31, 0) -> GP_LENGTH.
> - Use uart_fifo_timeout_ms() for timeout.
> - tty: serial: Add uart_fifo_timeout_ms()
> 
> Changes in v3:
> - 0xffffffff => GENMASK(31, 0)
> - Reword commit message.
> - Use uart_fifo_timeout() for timeout.
> 
> Changes in v2:
> - Totally rework / rename patch to handle suspend while active xfer
> - serial: qcom-geni: Fix arg types for qcom_geni_serial_poll_bit()
> - serial: qcom-geni: Fix the timeout in qcom_geni_serial_poll_bit()
> - serial: qcom-geni: Introduce qcom_geni_serial_poll_bitfield()
> - serial: qcom-geni: Just set the watermark level once
> - serial: qcom-geni: Rework TX in FIFO mode to fix hangs/lockups
> - soc: qcom: geni-se: Add GP_LENGTH/IRQ_EN_SET/IRQ_EN_CLEAR registers
> 
> Douglas Anderson (8):
>    soc: qcom: geni-se: Add GP_LENGTH/IRQ_EN_SET/IRQ_EN_CLEAR registers
>    tty: serial: Add uart_fifo_timeout_ms()
>    serial: qcom-geni: Fix the timeout in qcom_geni_serial_poll_bit()
>    serial: qcom-geni: Fix arg types for qcom_geni_serial_poll_bit()
>    serial: qcom-geni: Introduce qcom_geni_serial_poll_bitfield()
>    serial: qcom-geni: Just set the watermark level once
>    serial: qcom-geni: Fix suspend while active UART xfer
>    serial: qcom-geni: Rework TX in FIFO mode to fix hangs/lockups
> 
>   drivers/tty/serial/qcom_geni_serial.c | 322 +++++++++++++++-----------
>   include/linux/serial_core.h           |  15 +-
>   include/linux/soc/qcom/geni-se.h      |   9 +
>   3 files changed, 206 insertions(+), 140 deletions(-)
> 

Indeed no more lockup when killing a process on the serial debug console

Tested-by: Neil Armstrong <neil.armstrong@linaro.org> # on SM8650-HDK

Thanks !
Neil
Johan Hovold June 19, 2024, 8:50 a.m. UTC | #3
Hi Doug,

and sorry about the late feedback on this (was out of office last
week).

On Mon, Jun 10, 2024 at 03:24:18PM -0700, Douglas Anderson wrote:
> 
> While trying to reproduce -EBUSY errors that our lab was getting in
> suspend/resume testing, I ended up finding a whole pile of problems
> with the Qualcomm GENI serial driver. I've posted a fix for the -EBUSY
> issue separately [1]. This series is fixing all of the Qualcomm GENI
> problems that I found.
> 
> As far as I can tell most of the problems have been in the Qualcomm
> GENI serial driver since inception, but it can be noted that the
> behavior got worse with the new kfifo changes. Previously when the OS
> took data out of the circular queue we'd just spit stale data onto the
> serial port. Now we'll hard lockup. :-P

Thanks for taking a stab at this. This is indeed a known issue that has
been on my ever growing TODO list for over a year now. I worked around a
related regression with:

	9aff74cc4e9e ("serial: qcom-geni: fix console shutdown hang")

but noticed that the underlying bug can still easily be triggered, for
example, using software flow control in a serial console.

With 6.10-rc1 I started hitting this hang on every reboot. I was booting
the new x1e80100 so wasn't sure at first what caused it, but after
triggering the hang by interrupting a dmesg command I remembered the
broken serial driver and indeed your (v2) series fixed the regression
which was also present on sc8280xp.

I did run a quick benchmark this morning to see if there was any
significant performance penalty and I am seeing a 26% slow down (e.g.
catting 544 kB takes 68 instead of 54 seconds at 115200).

I've had a feeling that boot was slower with the series applied, but I
haven't verified that (just printing dmesg takes an extra second,
though).

Correctness first, of course, but perhaps something can be done about
that too.

I'll comment on the individual patches as well, but for now:

Tested-by: Johan Hovold <johan+linaro@kernel.org>

(I did a quick test with Bluetooth / DMA as well.)

Johan
Nícolas F. R. A. Prado June 20, 2024, 11:13 p.m. UTC | #4
On Mon, Jun 10, 2024 at 03:24:18PM -0700, Douglas Anderson wrote:
> 
> While trying to reproduce -EBUSY errors that our lab was getting in
> suspend/resume testing, I ended up finding a whole pile of problems
> with the Qualcomm GENI serial driver. I've posted a fix for the -EBUSY
> issue separately [1]. This series is fixing all of the Qualcomm GENI
> problems that I found.
> 
> As far as I can tell most of the problems have been in the Qualcomm
> GENI serial driver since inception, but it can be noted that the
> behavior got worse with the new kfifo changes. Previously when the OS
> took data out of the circular queue we'd just spit stale data onto the
> serial port. Now we'll hard lockup. :-P
> 
> I've tried to break this series up as much as possible to make it
> easier to understand but the final patch is still a lot of change at
> once. Hopefully it's OK.
> 
> [1] https://lore.kernel.org/r/20240530084841.v2.1.I2395e66cf70c6e67d774c56943825c289b9c13e4@changeid

Hi,

we've experienced issues with missing kernel messages in the serial on the
sc7180 based platforms in our lab for a while now.

I've just run a batch of jobs that just boot and write some messages to
/dev/kmsg on sc7180-trogdor-lazor-limozeen. Before the patch, in 18 out of
20 runs the first message would be missing in the logs causing the test to fail.
After the patch all 20 runs passed. So this is a clear fix, and I'm very happy
to say goodbye to this issue. Thank you!

Tested-by: Nícolas F. R. A. Prado <nfraprado@collabora.com>

FTR, this is the issue ticket in KernelCI:
https://github.com/kernelci/kernelci-project/issues/380

Thanks,
Nícolas