Message ID | 20201216144114.v2.2.Ibade998ed587e070388b4bf58801f1107a40eb53@changeid (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [v2,1/4] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case | expand |
Quoting Douglas Anderson (2020-12-16 14:41:50) > If we got a timeout when trying to send an abort command then it means > that we just got 3 timeouts in a row: > > 1. The original timeout that caused handle_fifo_timeout() to be > called. > 2. A one second timeout waiting for the cancel command to finish. > 3. A one second timeout waiting for the abort command to finish. > > SPI is clocked by the controller, so nothing (aside from a hardware > fault or a totally broken sequencer) should be causing the actual > commands to fail in hardware. However, even though the hardware > itself is not expected to fail (and it'd be hard to predict how we > should handle things if it did), it's easy to hit the timeout case by > simply blocking our interrupt handler from running for a long period > of time. Obviously the system is in pretty bad shape if a interrupt > handler is blocked for > 2 seconds, but there are certainly bugs (even > bugs in other unrelated drivers) that can make this happen. > > Let's make things a bit more robust against this case. If we fail to > abort we'll set a flag and then we'll block all future transfers until > we have no more interrupts pending. Why can't we forcibly roll the ball forward and clear the irq if it's a cancel/abort that's pending? Basically tell the hardware that we understand it did the job and canceled things out but our sad little CPU didn't run that irq handler yet. Here have a cookie and get ready for the next transfer. if (M_CMD_CANCEL_EN || M_CMD_ABORT_EN) /* but not the other irqs like CMD_DONE or refill fifos */ writel(M_CMD_CANCEL_EN | M_CMD_ABORT_EN, se->base + SE_GENI_M_IRQ_CLEAR); This would let us limp along and try to send another transfer in the case that we timed out but the transfer went through by servicing our own interrupt. > > Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP") > Signed-off-by: Douglas Anderson <dianders@chromium.org> > --- > > Changes in v2: > - Make this just about the failed abort. > > drivers/spi/spi-geni-qcom.c | 56 +++++++++++++++++++++++++++++++++++-- > 1 file changed, 54 insertions(+), 2 deletions(-) > > diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c > index bf55abbd39f1..d988463e606f 100644 > --- a/drivers/spi/spi-geni-qcom.c > +++ b/drivers/spi/spi-geni-qcom.c > @@ -83,6 +83,7 @@ struct spi_geni_master { > spinlock_t lock; > int irq; > bool cs_flag; > + bool abort_failed; > }; > > static int get_spi_clk_cfg(unsigned int speed_hz, > @@ -141,8 +142,46 @@ static void handle_fifo_timeout(struct spi_master *spi, > spin_unlock_irq(&mas->lock); > > time_left = wait_for_completion_timeout(&mas->abort_done, HZ); > - if (!time_left) > + if (!time_left) { > dev_err(mas->dev, "Failed to cancel/abort m_cmd\n"); > + > + /* > + * No need for a lock since SPI core has a lock and we never > + * access this from an interrupt. > + */ > + mas->abort_failed = true; > + } > +} > + > +static bool spi_geni_is_abort_still_pending(struct spi_geni_master *mas) > +{ > + struct geni_se *se = &mas->se; > + u32 m_irq, m_irq_en; > + > + if (!mas->abort_failed) > + return false; > + > + /* > + * The only known case where a transfer times out and then a cancel > + * times out then an abort times out is if something is blocking our > + * interrupt handler from running. Avoid starting any new transfers > + * until that sorts itself out. > + */ > + m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS); > + m_irq_en = readl(se->base + SE_GENI_M_IRQ_EN); I suppose this could race with the irq handler. Maybe we should grab the irq lock around the register reads so we can synchronize with the irq handler and save a fail? > + if (m_irq & m_irq_en) { > + dev_err(mas->dev, "Interrupts pending after abort: %#010x\n", > + m_irq & m_irq_en); > + return true; > + } > +
Hi, On Wed, Dec 16, 2020 at 8:21 PM Stephen Boyd <swboyd@chromium.org> wrote: > > Quoting Douglas Anderson (2020-12-16 14:41:50) > > If we got a timeout when trying to send an abort command then it means > > that we just got 3 timeouts in a row: > > > > 1. The original timeout that caused handle_fifo_timeout() to be > > called. > > 2. A one second timeout waiting for the cancel command to finish. > > 3. A one second timeout waiting for the abort command to finish. > > > > SPI is clocked by the controller, so nothing (aside from a hardware > > fault or a totally broken sequencer) should be causing the actual > > commands to fail in hardware. However, even though the hardware > > itself is not expected to fail (and it'd be hard to predict how we > > should handle things if it did), it's easy to hit the timeout case by > > simply blocking our interrupt handler from running for a long period > > of time. Obviously the system is in pretty bad shape if a interrupt > > handler is blocked for > 2 seconds, but there are certainly bugs (even > > bugs in other unrelated drivers) that can make this happen. > > > > Let's make things a bit more robust against this case. If we fail to > > abort we'll set a flag and then we'll block all future transfers until > > we have no more interrupts pending. > > Why can't we forcibly roll the ball forward and clear the irq if it's a > cancel/abort that's pending? Basically tell the hardware that we > understand it did the job and canceled things out but our sad little CPU > didn't run that irq handler yet. Here have a cookie and get ready for > the next transfer. > > if (M_CMD_CANCEL_EN || M_CMD_ABORT_EN) /* but not the other irqs like CMD_DONE or refill fifos */ > writel(M_CMD_CANCEL_EN | M_CMD_ABORT_EN, se->base + SE_GENI_M_IRQ_CLEAR); > > This would let us limp along and try to send another transfer in the > case that we timed out but the transfer went through by servicing our > own interrupt. A few problems: 1. The cancel and abort are commands and they generate a "done" interrupt along with their "cancel" and/or "abort". Clearing the cancel/abort without the done will leave things in a much more confused state. 2. If we timed out all the way down then there's probably _also_ interrupts for the previous transfer still pending. Those would also need to be cleared. ...and we'd need to disable watermarks / read pending data. 3. Even if we tried to solve all that, we're probably still in terrible shape. Sure, we could try to start another transfer, but if the previous one failed because the interrupt handler was blocked then the next one is just going to fail too so all the extra complexity we just added to handle this was likely wasted. The whole fact that you need to send little packets to the sequencer (and wait for an interrupt to tell you that it got your packet) for every last thing really doesn't work super well and it's just especially bad for chip select. > > +static bool spi_geni_is_abort_still_pending(struct spi_geni_master *mas) > > +{ > > + struct geni_se *se = &mas->se; > > + u32 m_irq, m_irq_en; > > + > > + if (!mas->abort_failed) > > + return false; > > + > > + /* > > + * The only known case where a transfer times out and then a cancel > > + * times out then an abort times out is if something is blocking our > > + * interrupt handler from running. Avoid starting any new transfers > > + * until that sorts itself out. > > + */ > > + m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS); > > + m_irq_en = readl(se->base + SE_GENI_M_IRQ_EN); > > I suppose this could race with the irq handler. Maybe we should grab the > irq lock around the register reads so we can synchronize with the irq > handler and save a fail? I don't _think_ it'll matter a whole lot but it also won't hurt, so sure. -Doug
Quoting Doug Anderson (2020-12-17 13:45:18) > > On Wed, Dec 16, 2020 at 8:21 PM Stephen Boyd <swboyd@chromium.org> wrote: > > > > if (M_CMD_CANCEL_EN || M_CMD_ABORT_EN) /* but not the other irqs like CMD_DONE or refill fifos */ > > writel(M_CMD_CANCEL_EN | M_CMD_ABORT_EN, se->base + SE_GENI_M_IRQ_CLEAR); > > > > This would let us limp along and try to send another transfer in the > > case that we timed out but the transfer went through by servicing our > > own interrupt. > > A few problems: > > 1. The cancel and abort are commands and they generate a "done" > interrupt along with their "cancel" and/or "abort". Clearing the > cancel/abort without the done will leave things in a much more > confused state. Ah I didn't know that the DONE bit was set even for cancel/abort, but that makes sense given that it's a sequencer and everything that goes into the sequencer eventually gets "DONE". I agree if the DONE bit hanging around that really confuses stuff, so best to ignore it and figure out why interrupt latencies are leading to this problem.
diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c index bf55abbd39f1..d988463e606f 100644 --- a/drivers/spi/spi-geni-qcom.c +++ b/drivers/spi/spi-geni-qcom.c @@ -83,6 +83,7 @@ struct spi_geni_master { spinlock_t lock; int irq; bool cs_flag; + bool abort_failed; }; static int get_spi_clk_cfg(unsigned int speed_hz, @@ -141,8 +142,46 @@ static void handle_fifo_timeout(struct spi_master *spi, spin_unlock_irq(&mas->lock); time_left = wait_for_completion_timeout(&mas->abort_done, HZ); - if (!time_left) + if (!time_left) { dev_err(mas->dev, "Failed to cancel/abort m_cmd\n"); + + /* + * No need for a lock since SPI core has a lock and we never + * access this from an interrupt. + */ + mas->abort_failed = true; + } +} + +static bool spi_geni_is_abort_still_pending(struct spi_geni_master *mas) +{ + struct geni_se *se = &mas->se; + u32 m_irq, m_irq_en; + + if (!mas->abort_failed) + return false; + + /* + * The only known case where a transfer times out and then a cancel + * times out then an abort times out is if something is blocking our + * interrupt handler from running. Avoid starting any new transfers + * until that sorts itself out. + */ + m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS); + m_irq_en = readl(se->base + SE_GENI_M_IRQ_EN); + if (m_irq & m_irq_en) { + dev_err(mas->dev, "Interrupts pending after abort: %#010x\n", + m_irq & m_irq_en); + return true; + } + + /* + * If we're here the problem resolved itself so no need to check more + * on future transfers. + */ + mas->abort_failed = false; + + return false; } static void spi_geni_set_cs(struct spi_device *slv, bool set_flag) @@ -158,9 +197,15 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag) if (set_flag == mas->cs_flag) return; + pm_runtime_get_sync(mas->dev); + + if (spi_geni_is_abort_still_pending(mas)) { + dev_err(mas->dev, "Can't set chip select\n"); + goto exit; + } + mas->cs_flag = set_flag; - pm_runtime_get_sync(mas->dev); spin_lock_irq(&mas->lock); reinit_completion(&mas->cs_done); if (set_flag) @@ -173,6 +218,7 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag) if (!time_left) handle_fifo_timeout(spi, NULL); +exit: pm_runtime_put(mas->dev); } @@ -280,6 +326,9 @@ static int spi_geni_prepare_message(struct spi_master *spi, int ret; struct spi_geni_master *mas = spi_master_get_devdata(spi); + if (spi_geni_is_abort_still_pending(mas)) + return -EBUSY; + ret = setup_fifo_params(spi_msg->spi, spi); if (ret) dev_err(mas->dev, "Couldn't select mode %d\n", ret); @@ -509,6 +558,9 @@ static int spi_geni_transfer_one(struct spi_master *spi, { struct spi_geni_master *mas = spi_master_get_devdata(spi); + if (spi_geni_is_abort_still_pending(mas)) + return -EBUSY; + /* Terminate and return success for 0 byte length transfer */ if (!xfer->len) return 0;
If we got a timeout when trying to send an abort command then it means that we just got 3 timeouts in a row: 1. The original timeout that caused handle_fifo_timeout() to be called. 2. A one second timeout waiting for the cancel command to finish. 3. A one second timeout waiting for the abort command to finish. SPI is clocked by the controller, so nothing (aside from a hardware fault or a totally broken sequencer) should be causing the actual commands to fail in hardware. However, even though the hardware itself is not expected to fail (and it'd be hard to predict how we should handle things if it did), it's easy to hit the timeout case by simply blocking our interrupt handler from running for a long period of time. Obviously the system is in pretty bad shape if a interrupt handler is blocked for > 2 seconds, but there are certainly bugs (even bugs in other unrelated drivers) that can make this happen. Let's make things a bit more robust against this case. If we fail to abort we'll set a flag and then we'll block all future transfers until we have no more interrupts pending. Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP") Signed-off-by: Douglas Anderson <dianders@chromium.org> --- Changes in v2: - Make this just about the failed abort. drivers/spi/spi-geni-qcom.c | 56 +++++++++++++++++++++++++++++++++++-- 1 file changed, 54 insertions(+), 2 deletions(-)