Message ID | 20240610152420.v4.7.I0f81a5baa37d368f291c96ee4830abca337e3c87@changeid (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Series | serial: qcom-geni: Overhaul TX handling to fix crashes/hangs | expand |
On 6/11/24 00:24, Douglas Anderson wrote: > On devices using Qualcomm's GENI UART it is possible to get the UART > stuck such that it no longer outputs data. Specifically, logging in > via an agetty on the debug serial port (which was _not_ used for > kernel console) and running: > cat /var/log/messages > ...and then (via an SSH session) forcing a few suspend/resume cycles > causes the UART to stop transmitting. > > The root of the problems was with qcom_geni_serial_stop_tx_fifo() > which is called as part of the suspend process. Specific problems with > that function: > - When an in-progress "tx" command is cancelled it doesn't appear to > fully drain the FIFO. That meant qcom_geni_serial_tx_empty() > continued to report that the FIFO wasn't empty. The > qcom_geni_serial_start_tx_fifo() function didn't re-enable > interrupts in this case so the driver would never start transferring > again. > - When the driver cancelled the current "tx" command but it forgot to > zero out "tx_remaining". This confused logic elsewhere in the > driver. > - From experimentation, it appears that cancelling the "tx" command > could drop some of the queued up bytes. > > While qcom_geni_serial_stop_tx_fifo() could be fixed to drain the FIFO > and shut things down properly, stop_tx() isn't supposed to be a slow > function. It is run with local interrupts off and is documented to > stop transmitting "as soon as possible". Change the function to just > stop new bytes from being queued. In order to make this work, change > qcom_geni_serial_start_tx_fifo() to remove some conditions. It's > always safe to enable the watermark interrupt and the IRQ handler will > disable it if it's not needed. > > For system suspend the queue still needs to be drained. Failure to do > so means that the hardware won't provide new interrupts until a > "cancel" command is sent. Add draining logic (fixing the issues noted > above) at suspend time. > > NOTE: It would be ideal if qcom_geni_serial_stop_tx_fifo() could > "pause" the transmitter right away. There is no obvious way to do this > in the docs and experimentation didn't find any tricks either, so > stopping TX "as soon as possible" isn't very soon but is the best > possible. > > Fixes: c4f528795d1a ("tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP") > Signed-off-by: Douglas Anderson <dianders@chromium.org> > --- This all looks good in my eyes, with the assumption that sending an ABORT can't somehow be rejected by the hardware.. I wouldn't normally think of that, but GENI is peculiar at times Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org> Konrad
On Mon, Jun 10, 2024 at 03:24:25PM -0700, Douglas Anderson wrote: > On devices using Qualcomm's GENI UART it is possible to get the UART > stuck such that it no longer outputs data. Specifically, logging in > via an agetty on the debug serial port (which was _not_ used for > kernel console) and running: > cat /var/log/messages > ...and then (via an SSH session) forcing a few suspend/resume cycles > causes the UART to stop transmitting. An easier way to trigger this old bug is to just run a command like dmesg and hit ctrl-s in a serial console to stop tx. Interrupting the command or hitting ctrl-q to restart tx then triggers the soft lockup. > The root of the problems was with qcom_geni_serial_stop_tx_fifo() > which is called as part of the suspend process. Specific problems with > that function: > - When an in-progress "tx" command is cancelled it doesn't appear to > fully drain the FIFO. That meant qcom_geni_serial_tx_empty() > continued to report that the FIFO wasn't empty. The > qcom_geni_serial_start_tx_fifo() function didn't re-enable > interrupts in this case so the driver would never start transferring > again. > - When the driver cancelled the current "tx" command but it forgot to > zero out "tx_remaining". This confused logic elsewhere in the > driver. > - From experimentation, it appears that cancelling the "tx" command > could drop some of the queued up bytes. > > While qcom_geni_serial_stop_tx_fifo() could be fixed to drain the FIFO > and shut things down properly, stop_tx() isn't supposed to be a slow > function. It is run with local interrupts off and is documented to > stop transmitting "as soon as possible". Change the function to just > stop new bytes from being queued. In order to make this work, change > qcom_geni_serial_start_tx_fifo() to remove some conditions. It's > always safe to enable the watermark interrupt and the IRQ handler will > disable it if it's not needed. > > For system suspend the queue still needs to be drained. Failure to do > so means that the hardware won't provide new interrupts until a > "cancel" command is sent. Add draining logic (fixing the issues noted > above) at suspend time. So I spent the better part of the weekend looking at this driver and this is one of the bits I worry about with your approach as relying on draining anything won't work with hardware flow control. Cancelling commands can result stalled TX in a number of ways and there's still at least one that you don't handle. If you end up with data in in the FIFO, the watermark interrupt may never fire when you try to restart tx. I'm leaning towards fixing the immediate hard lockup regression separately and then we can address the older bugs and rework driver without having to rush things. I've prepared a minimal three patch series which fixes most of the discussed issues (hard and soft lockup and garbage characters) and that should be backportable as well. Currently, the diffstat is just: drivers/tty/serial/qcom_geni_serial.c | 36 +++++++++++++++++++++++++----------- 1 file changed, 25 insertions(+), 11 deletions(-) Fixing the hard lockup 6.10-rc1 regression is just a single line. Johan
On Mon, Jun 24, 2024 at 02:12:04PM +0200, Johan Hovold wrote: > I've prepared a minimal three patch series which fixes most of the > discussed issues (hard and soft lockup and garbage characters) and that > should be backportable as well. > > Currently, the diffstat is just: > > drivers/tty/serial/qcom_geni_serial.c | 36 +++++++++++++++++++++++++----------- > 1 file changed, 25 insertions(+), 11 deletions(-) > > Fixing the hard lockup 6.10-rc1 regression is just a single line. For the record, I've posted the series here: https://lore.kernel.org/lkml/20240624133135.7445-1-johan+linaro@kernel.org/ Johan
Hi, On Mon, Jun 24, 2024 at 5:12 AM Johan Hovold <johan@kernel.org> wrote: > > On Mon, Jun 10, 2024 at 03:24:25PM -0700, Douglas Anderson wrote: > > On devices using Qualcomm's GENI UART it is possible to get the UART > > stuck such that it no longer outputs data. Specifically, logging in > > via an agetty on the debug serial port (which was _not_ used for > > kernel console) and running: > > cat /var/log/messages > > ...and then (via an SSH session) forcing a few suspend/resume cycles > > causes the UART to stop transmitting. > > An easier way to trigger this old bug is to just run a command like > dmesg and hit ctrl-s in a serial console to stop tx. Interrupting the > command or hitting ctrl-q to restart tx then triggers the soft lockup. > > > The root of the problems was with qcom_geni_serial_stop_tx_fifo() > > which is called as part of the suspend process. Specific problems with > > that function: > > - When an in-progress "tx" command is cancelled it doesn't appear to > > fully drain the FIFO. That meant qcom_geni_serial_tx_empty() > > continued to report that the FIFO wasn't empty. The > > qcom_geni_serial_start_tx_fifo() function didn't re-enable > > interrupts in this case so the driver would never start transferring > > again. > > - When the driver cancelled the current "tx" command but it forgot to > > zero out "tx_remaining". This confused logic elsewhere in the > > driver. > > - From experimentation, it appears that cancelling the "tx" command > > could drop some of the queued up bytes. > > > > While qcom_geni_serial_stop_tx_fifo() could be fixed to drain the FIFO > > and shut things down properly, stop_tx() isn't supposed to be a slow > > function. It is run with local interrupts off and is documented to > > stop transmitting "as soon as possible". Change the function to just > > stop new bytes from being queued. In order to make this work, change > > qcom_geni_serial_start_tx_fifo() to remove some conditions. It's > > always safe to enable the watermark interrupt and the IRQ handler will > > disable it if it's not needed. > > > > For system suspend the queue still needs to be drained. Failure to do > > so means that the hardware won't provide new interrupts until a > > "cancel" command is sent. Add draining logic (fixing the issues noted > > above) at suspend time. > > So I spent the better part of the weekend looking at this driver and > this is one of the bits I worry about with your approach as relying on > draining anything won't work with hardware flow control. > > Cancelling commands can result stalled TX in a number of ways and > there's still at least one that you don't handle. If you end up with > data in in the FIFO, the watermark interrupt may never fire when you try > to restart tx. Ah, that's a good call. Right now it doesn't really happen since people tend to hook up the debug UART without flow control lines (as far as I've seen), but it's good to make sure it works. > I'm leaning towards fixing the immediate hard lockup regression > separately and then we can address the older bugs and rework driver > without having to rush things. Yeah, that's fair. I've responded to your patch with a counter-proposal to fix the hard lockup regression, but I agree that should take priority. > I've prepared a minimal three patch series which fixes most of the > discussed issues (hard and soft lockup and garbage characters) and that > should be backportable as well. > > Currently, the diffstat is just: > > drivers/tty/serial/qcom_geni_serial.c | 36 +++++++++++++++++++++++++----------- > 1 file changed, 25 insertions(+), 11 deletions(-) I'll respond more in dept to your patches, but I suspect that your patch series won't fix the issues that Nícolas reported [1]. I also tested and your patch series doesn't fix the kdb issue talked about in my patch #8. Part of my reworking of stuff also changed the way that the console and the polling commands worked since they were pretty broken. Your series doesn't touch them. We'll probably need something in-between taking advantage of some of the stuff you figured out with "cancel" but also doing a bigger rework than you did. [1] https://lore.kernel.org/r/46f57349-1217-4594-85b2-84fa3a365c0c@notapiano
On Mon, Jun 24, 2024 at 01:58:34PM -0700, Doug Anderson wrote: > On Mon, Jun 24, 2024 at 5:12 AM Johan Hovold <johan@kernel.org> wrote: > > I'm leaning towards fixing the immediate hard lockup regression > > separately and then we can address the older bugs and rework driver > > without having to rush things. > > Yeah, that's fair. I've responded to your patch with a > counter-proposal to fix the hard lockup regression, but I agree that > should take priority. > > > I've prepared a minimal three patch series which fixes most of the > > discussed issues (hard and soft lockup and garbage characters) and that > > should be backportable as well. > > > > Currently, the diffstat is just: > > > > drivers/tty/serial/qcom_geni_serial.c | 36 +++++++++++++++++++++++++----------- > > 1 file changed, 25 insertions(+), 11 deletions(-) > > I'll respond more in dept to your patches, but I suspect that your > patch series won't fix the issues that Nícolas reported [1]. I also > tested and your patch series doesn't fix the kdb issue talked about in > my patch #8. Part of my reworking of stuff also changed the way that > the console and the polling commands worked since they were pretty > broken. Your series doesn't touch them. Right, I never claimed to fix all the issues, only some of the most obvious and severe ones. > We'll probably need something in-between taking advantage of some of > the stuff you figured out with "cancel" but also doing a bigger rework > than you did. Quite likely. My intention was to try to find minimal fixes for individual issues, which could also be backported, before doing a larger rework if that turns out to be necessary (and which can also be done in more than way, e.g. using 16-byte fifos). Johan
diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c index 132669a2da34..1a66424f0f5f 100644 --- a/drivers/tty/serial/qcom_geni_serial.c +++ b/drivers/tty/serial/qcom_geni_serial.c @@ -130,6 +130,7 @@ struct qcom_geni_serial_port { bool brk; unsigned int tx_remaining; + unsigned int tx_total; int wakeup_irq; bool rx_tx_swap; bool cts_rts_swap; @@ -310,11 +311,14 @@ static bool qcom_geni_serial_poll_bit(struct uart_port *uport, static void qcom_geni_serial_setup_tx(struct uart_port *uport, u32 xmit_size) { + struct qcom_geni_serial_port *port = to_dev_port(uport); u32 m_cmd; writel(xmit_size, uport->membase + SE_UART_TX_TRANS_LEN); m_cmd = UART_START_TX << M_OPCODE_SHFT; writel(m_cmd, uport->membase + SE_GENI_M_CMD0); + + port->tx_total = xmit_size; } static void qcom_geni_serial_poll_tx_done(struct uart_port *uport) @@ -334,6 +338,64 @@ static void qcom_geni_serial_poll_tx_done(struct uart_port *uport) writel(irq_clear, uport->membase + SE_GENI_M_IRQ_CLEAR); } +static void qcom_geni_serial_drain_tx_fifo(struct uart_port *uport) +{ + struct qcom_geni_serial_port *port = to_dev_port(uport); + + /* + * If the main sequencer is inactive it means that the TX command has + * been completed and all bytes have been sent. Nothing to do in that + * case. + */ + if (!qcom_geni_serial_main_active(uport)) + return; + + /* + * Wait until the FIFO has been drained. We've already taken bytes out + * of the higher level queue in qcom_geni_serial_send_chunk_fifo() so + * if we don't drain the FIFO but send the "cancel" below they seem to + * get lost. + */ + qcom_geni_serial_poll_bitfield(uport, SE_GENI_M_GP_LENGTH, GP_LENGTH, + port->tx_total - port->tx_remaining); + + /* + * If clearing the FIFO made us inactive then we're done--no need for + * a cancel. + */ + if (!qcom_geni_serial_main_active(uport)) + return; + + /* + * Cancel the current command. After this the main sequencer will + * stop reporting that it's active and we'll have to start a new + * transfer command. + * + * If we skip doing this cancel and then continue with a system + * suspend while there's an active command in the main sequencer + * then after resume time we won't get any more interrupts on the + * main sequencer until we send the cancel. + */ + geni_se_cancel_m_cmd(&port->se); + if (!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS, + M_CMD_CANCEL_EN, true)) { + /* The cancel failed; try an abort as a fallback. */ + geni_se_abort_m_cmd(&port->se); + qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS, + M_CMD_ABORT_EN, true); + writel(M_CMD_ABORT_EN, uport->membase + SE_GENI_M_IRQ_CLEAR); + } + writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR); + + /* + * We've cancelled the current command. "tx_remaining" stores how + * many bytes are left to finish in the current command so we know + * when to start a new command. Since the command was cancelled we + * need to zero "tx_remaining". + */ + port->tx_remaining = 0; +} + static void qcom_geni_serial_abort_rx(struct uart_port *uport) { u32 irq_clear = S_CMD_DONE_EN | S_CMD_ABORT_EN; @@ -654,37 +716,18 @@ static void qcom_geni_serial_start_tx_fifo(struct uart_port *uport) { u32 irq_en; - if (qcom_geni_serial_main_active(uport) || - !qcom_geni_serial_tx_empty(uport)) - return; - irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN); irq_en |= M_TX_FIFO_WATERMARK_EN | M_CMD_DONE_EN; - writel(irq_en, uport->membase + SE_GENI_M_IRQ_EN); } static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport) { u32 irq_en; - struct qcom_geni_serial_port *port = to_dev_port(uport); irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN); irq_en &= ~(M_CMD_DONE_EN | M_TX_FIFO_WATERMARK_EN); writel(irq_en, uport->membase + SE_GENI_M_IRQ_EN); - /* Possible stop tx is called multiple times. */ - if (!qcom_geni_serial_main_active(uport)) - return; - - geni_se_cancel_m_cmd(&port->se); - if (!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS, - M_CMD_CANCEL_EN, true)) { - geni_se_abort_m_cmd(&port->se); - qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS, - M_CMD_ABORT_EN, true); - writel(M_CMD_ABORT_EN, uport->membase + SE_GENI_M_IRQ_CLEAR); - } - writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR); } static void qcom_geni_serial_handle_rx_fifo(struct uart_port *uport, bool drop) @@ -1066,7 +1109,15 @@ static int setup_fifos(struct qcom_geni_serial_port *port) } -static void qcom_geni_serial_shutdown(struct uart_port *uport) +static void qcom_geni_serial_shutdown_dma(struct uart_port *uport) +{ + disable_irq(uport->irq); + + qcom_geni_serial_stop_tx(uport); + qcom_geni_serial_stop_rx(uport); +} + +static void qcom_geni_serial_shutdown_fifo(struct uart_port *uport) { disable_irq(uport->irq); @@ -1075,6 +1126,8 @@ static void qcom_geni_serial_shutdown(struct uart_port *uport) qcom_geni_serial_stop_tx(uport); qcom_geni_serial_stop_rx(uport); + + qcom_geni_serial_drain_tx_fifo(uport); } static int qcom_geni_serial_port_setup(struct uart_port *uport) @@ -1532,7 +1585,7 @@ static const struct uart_ops qcom_geni_console_pops = { .startup = qcom_geni_serial_startup, .request_port = qcom_geni_serial_request_port, .config_port = qcom_geni_serial_config_port, - .shutdown = qcom_geni_serial_shutdown, + .shutdown = qcom_geni_serial_shutdown_fifo, .type = qcom_geni_serial_get_type, .set_mctrl = qcom_geni_serial_set_mctrl, .get_mctrl = qcom_geni_serial_get_mctrl, @@ -1554,7 +1607,7 @@ static const struct uart_ops qcom_geni_uart_pops = { .startup = qcom_geni_serial_startup, .request_port = qcom_geni_serial_request_port, .config_port = qcom_geni_serial_config_port, - .shutdown = qcom_geni_serial_shutdown, + .shutdown = qcom_geni_serial_shutdown_dma, .type = qcom_geni_serial_get_type, .set_mctrl = qcom_geni_serial_set_mctrl, .get_mctrl = qcom_geni_serial_get_mctrl,
On devices using Qualcomm's GENI UART it is possible to get the UART stuck such that it no longer outputs data. Specifically, logging in via an agetty on the debug serial port (which was _not_ used for kernel console) and running: cat /var/log/messages ...and then (via an SSH session) forcing a few suspend/resume cycles causes the UART to stop transmitting. The root of the problems was with qcom_geni_serial_stop_tx_fifo() which is called as part of the suspend process. Specific problems with that function: - When an in-progress "tx" command is cancelled it doesn't appear to fully drain the FIFO. That meant qcom_geni_serial_tx_empty() continued to report that the FIFO wasn't empty. The qcom_geni_serial_start_tx_fifo() function didn't re-enable interrupts in this case so the driver would never start transferring again. - When the driver cancelled the current "tx" command but it forgot to zero out "tx_remaining". This confused logic elsewhere in the driver. - From experimentation, it appears that cancelling the "tx" command could drop some of the queued up bytes. While qcom_geni_serial_stop_tx_fifo() could be fixed to drain the FIFO and shut things down properly, stop_tx() isn't supposed to be a slow function. It is run with local interrupts off and is documented to stop transmitting "as soon as possible". Change the function to just stop new bytes from being queued. In order to make this work, change qcom_geni_serial_start_tx_fifo() to remove some conditions. It's always safe to enable the watermark interrupt and the IRQ handler will disable it if it's not needed. For system suspend the queue still needs to be drained. Failure to do so means that the hardware won't provide new interrupts until a "cancel" command is sent. Add draining logic (fixing the issues noted above) at suspend time. NOTE: It would be ideal if qcom_geni_serial_stop_tx_fifo() could "pause" the transmitter right away. There is no obvious way to do this in the docs and experimentation didn't find any tricks either, so stopping TX "as soon as possible" isn't very soon but is the best possible. Fixes: c4f528795d1a ("tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP") Signed-off-by: Douglas Anderson <dianders@chromium.org> --- There are still a number of problems with GENI UART after this but I've kept this change separate to make it easier to understand. Specifically on mainline just hitting "Ctrl-C" after dumping /var/log/messages to the serial port hangs things after the kfifo changes. Those issues will be addressed in future patches. It should also be noted that the "Fixes" tag here is a bit of a swag. I haven't gone and tested on ancient code, but at least the problems exist on kernel 5.15 and much of the code touched here has been here since the beginning, or at least since as long as the driver was stable. Changes in v4: - Fix indentation. - GENMASK(31, 0) -> GP_LENGTH. Changes in v3: - 0xffffffff => GENMASK(31, 0) - Reword commit message. Changes in v2: - Totally rework / rename patch to handle suspend while active xfer drivers/tty/serial/qcom_geni_serial.c | 97 +++++++++++++++++++++------ 1 file changed, 75 insertions(+), 22 deletions(-)