Message ID | 20200317133653.v2.1.I752ebdcfd5e8bf0de06d66e767b8974932b3620e@changeid (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] spi: spi-geni-qcom: Speculative fix of "nobody cared" about interrupt | expand |
Quoting Douglas Anderson (2020-03-17 13:37:06) > Every once in a while (like once in a few months on a device) people > have seen warnings on devices using spi-geni-qcom like: > > irq ...: nobody cared (try booting with the "irqpoll" option) > > ...where the interrupt number listed matches up with "spi_geni" in > /proc/interrupts. > > You can get "nobody cared" if the interrupt handler returns IRQ_NONE. > Once you get into this state the driver will just stop working. > > Auditing the code makes me believe that it's probably not so good > checking "cur_mcmd" in the interrupt handler not under spinlock. > Let's move it to be under spinlock. Looking more closely at the code, > it looks as if there are some other places that could be under > spinlock, so let's add. It looks as if the original code was assuming > that by the time that the interrupt handler got called that the write > to "cur_mcmd" (to make it not CMD_NONE anymore) would make it to the > processor handling the interrupt. Perhaps with weakly ordered memory > this sometimes (though very rarely) happened. > > Let's also add a warning (with the IRQ status) in the case that we > ever end up getting an interrupt when we think we shouldn't. This > would give us a better chance to debug if this patch doesn't help the > issue. We'll also try our best to clear the interrupt to hopefully > get us out of this state. > > Patch is marked "speculative" because I have no way to reproduce this > problem, so I'm just hoping this fixes it. Weakly ordered memory > makes my brain hurt. It could be that. It could also be the poor design of geni_se_init() and how it enables many interrupts that this driver doesn't look to handle? Why do we allow the wrapper to enable all those bits in M_COMMON_GENI_M_IRQ_EN and S_COMMON_GENI_S_IRQ_EN if nobody is going to handle them? > > Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP") > Signed-off-by: Douglas Anderson <dianders@chromium.org> > --- > > Changes in v2: > - Detect true spurious interrupt. > - Still return IRQ_NONE for state machine mismatch, but print warn. > > drivers/spi/spi-geni-qcom.c | 35 ++++++++++++++++++++++++++++++----- > 1 file changed, 30 insertions(+), 5 deletions(-) > > diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c > index c3972424af71..92e51ccb427f 100644 > --- a/drivers/spi/spi-geni-qcom.c > +++ b/drivers/spi/spi-geni-qcom.c > @@ -151,16 +151,19 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag) > struct geni_se *se = &mas->se; > unsigned long time_left; > > - reinit_completion(&mas->xfer_done); > pm_runtime_get_sync(mas->dev); > if (!(slv->mode & SPI_CS_HIGH)) > set_flag = !set_flag; > > + spin_lock_irq(&mas->lock); Why is this spin_lock_irq() vs. spin_lock_irqsave()? This isn't possible to be called from somewhere that hasn't changed irq flags? > + reinit_completion(&mas->xfer_done); > + > mas->cur_mcmd = CMD_CS; > if (set_flag) > geni_se_setup_m_cmd(se, SPI_CS_ASSERT, 0); > else > geni_se_setup_m_cmd(se, SPI_CS_DEASSERT, 0); > + spin_unlock_irq(&mas->lock); This will force on interrupts if they were masked. > > time_left = wait_for_completion_timeout(&mas->xfer_done, HZ); > if (!time_left) > @@ -307,6 +310,8 @@ static void setup_fifo_xfer(struct spi_transfer *xfer, > u32 spi_tx_cfg, len; > struct geni_se *se = &mas->se; > > + spin_lock_irq(&mas->lock); > + > spi_tx_cfg = readl(se->base + SE_SPI_TRANS_CFG); > if (xfer->bits_per_word != mas->cur_bits_per_word) { > spi_setup_word_len(mas, mode, xfer->bits_per_word); > @@ -376,6 +381,8 @@ static void setup_fifo_xfer(struct spi_transfer *xfer, > */ > if (m_cmd & SPI_TX_ONLY) > writel(mas->tx_wm, se->base + SE_GENI_TX_WATERMARK_REG); > + > + spin_unlock_irq(&mas->lock); > } > > static int spi_geni_transfer_one(struct spi_master *spi, > @@ -478,13 +485,29 @@ static irqreturn_t geni_spi_isr(int irq, void *data) > struct geni_se *se = &mas->se; > u32 m_irq; > unsigned long flags; > - > - if (mas->cur_mcmd == CMD_NONE) > - return IRQ_NONE; > + irqreturn_t ret = IRQ_HANDLED; > > spin_lock_irqsave(&mas->lock, flags); > m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS); Does this read need to be inside the lock? > > + /* Check for spurious interrupt */ > + if (!m_irq) { > + ret = IRQ_NONE; > + goto exit; I ask because it could be simplified by reading the status and then immediately returning IRQ_NONE if no bits are set without having to do the lock/unlock dance. And does writing 0 to the irq clear register do anything? > + } > + > + /* > + * If we got a real interrupt but software state machine thinks > + * we were idle then give a warning. We'll do our best to clear > + * the interrupt but we'll still return IRQ_NONE. If this keeps > + * happening the system will eventually disable us. > + */ > + if (mas->cur_mcmd == CMD_NONE) { > + pr_warn("Unexpected SPI interrupt: %#010x\n", m_irq); > + ret = IRQ_NONE; > + goto exit; > + } > + > if ((m_irq & M_RX_FIFO_WATERMARK_EN) || (m_irq & M_RX_FIFO_LAST_EN)) > geni_spi_handle_rx(mas); > > @@ -523,9 +546,11 @@ static irqreturn_t geni_spi_isr(int irq, void *data) > complete(&mas->xfer_done); > } > > +exit: > writel(m_irq, se->base + SE_GENI_M_IRQ_CLEAR); > spin_unlock_irqrestore(&mas->lock, flags); > - return IRQ_HANDLED; > + > + return ret; > } > > static int spi_geni_probe(struct platform_device *pdev)
Hi, On Tue, Mar 17, 2020 at 2:36 PM Stephen Boyd <swboyd@chromium.org> wrote: > > > Patch is marked "speculative" because I have no way to reproduce this > > problem, so I'm just hoping this fixes it. Weakly ordered memory > > makes my brain hurt. > > It could be that. It could also be the poor design of geni_se_init() and > how it enables many interrupts that this driver doesn't look to handle? > Why do we allow the wrapper to enable all those bits in > M_COMMON_GENI_M_IRQ_EN and S_COMMON_GENI_S_IRQ_EN if nobody is going to > handle them? It is possible it's related to an interrupt that we don't handle. However: * IMO having the locking in place is safer anyway. At some point I read that advice that trying to reason about weakly ordered memory was simply too hard for the average person (even the average kernel developer). In 99% of the cases you could just use a lock so it's super clear and the performance difference is near zero. * Most of the cases I saw of the "nobody cared" for geni-spi was on a mostly idle system (presumably still doing periodic SPI transactions to the EC, though). It seems weird that one of these other interrupts would suddenly fire. It seems more likely that we just happened to win a race of some sort. If nothing else it will suddenly become very obvious after my patch lands because I'll print out the status. That all being said if someone wants to submit a patch to clean up which interrupts are enabled I'd support it. > > @@ -151,16 +151,19 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag) > > struct geni_se *se = &mas->se; > > unsigned long time_left; > > > > - reinit_completion(&mas->xfer_done); > > pm_runtime_get_sync(mas->dev); > > if (!(slv->mode & SPI_CS_HIGH)) > > set_flag = !set_flag; > > > > + spin_lock_irq(&mas->lock); > > Why is this spin_lock_irq() vs. spin_lock_irqsave()? This isn't possible > to be called from somewhere that hasn't changed irq flags? See below. > > + reinit_completion(&mas->xfer_done); > > + > > mas->cur_mcmd = CMD_CS; > > if (set_flag) > > geni_se_setup_m_cmd(se, SPI_CS_ASSERT, 0); > > else > > geni_se_setup_m_cmd(se, SPI_CS_DEASSERT, 0); > > + spin_unlock_irq(&mas->lock); > > This will force on interrupts if they were masked. I'll change it if you want, but in this function there is already a call to "wait_for_completion_timeout". That's not gonna be too happy if this function is ever called with interrupts already masked. Also in this function is pm_runtime_get_sync() which in many cases will sleep (I think we'll end up in geni_se_clks_on() which calls clk_bulk_prepare_enable()). In cases where you know for sure that interrupts aren't masked, spin_lock_irq() and spin_unlock_irq() are fine and that's what they're for, no? > > time_left = wait_for_completion_timeout(&mas->xfer_done, HZ); > > if (!time_left) > > @@ -307,6 +310,8 @@ static void setup_fifo_xfer(struct spi_transfer *xfer, > > u32 spi_tx_cfg, len; > > struct geni_se *se = &mas->se; > > > > + spin_lock_irq(&mas->lock); ...and just to answer the same question for here: setup_fifo_xfer() is called from spi_geni_transfer_one() which is our "transfer_one" function. We don't happen to block anywhere in these functions, but I'm nearly certain you are allowed to block in them. We actually return a positive number to indicate to the SPI core that we're not doing the blocking ourselves but since the SPI core can't know we were going to do that it has to assume we might block. > > @@ -478,13 +485,29 @@ static irqreturn_t geni_spi_isr(int irq, void *data) > > struct geni_se *se = &mas->se; > > u32 m_irq; > > unsigned long flags; > > - > > - if (mas->cur_mcmd == CMD_NONE) > > - return IRQ_NONE; > > + irqreturn_t ret = IRQ_HANDLED; > > > > spin_lock_irqsave(&mas->lock, flags); Ironically the above could probably just be "spin_lock" since this is our interrupt handler. ;-) I'll just leave it alone though since what's there now doesn't hurt. > > m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS); > > Does this read need to be inside the lock? Probably not. Discussion below. > > + /* Check for spurious interrupt */ > > + if (!m_irq) { > > + ret = IRQ_NONE; > > + goto exit; > > I ask because it could be simplified by reading the status and then > immediately returning IRQ_NONE if no bits are set without having to do > the lock/unlock dance. And does writing 0 to the irq clear register do > anything? Sure. I'll move it if you want. It felt nicer to just keep the whole thing under lock so I didn't have to think about whether it mattered. ...and anyone else looking at it didn't need to think about if it mattered, too. It it is very easy to say that it doesn't _hurt_ to have it under lock other than having one extra memory read under lock. ...and presumably the case where the optimization matters is incredibly rare (a spurious interrupt) and we just spent a whole lot of cycles calling the interrupt handler to begin with for this spurious interrupt. I would have a hard time believing that a write of 0 to a "write 1 to clear" register would have any impact. It would be a pretty bad hardware design... So I guess in summary, I'm not planning to spin for any of these things unless you really insist or you say I'm wrong about something above or someone else says my opinions are the wrong opinions. -Doug
Quoting Doug Anderson (2020-03-17 15:08:45) > Hi, > > On Tue, Mar 17, 2020 at 2:36 PM Stephen Boyd <swboyd@chromium.org> wrote: > > > > > Patch is marked "speculative" because I have no way to reproduce this > > > problem, so I'm just hoping this fixes it. Weakly ordered memory > > > makes my brain hurt. > > > > It could be that. It could also be the poor design of geni_se_init() and > > how it enables many interrupts that this driver doesn't look to handle? > > Why do we allow the wrapper to enable all those bits in > > M_COMMON_GENI_M_IRQ_EN and S_COMMON_GENI_S_IRQ_EN if nobody is going to > > handle them? > > It is possible it's related to an interrupt that we don't handle. However: > > * IMO having the locking in place is safer anyway. At some point I > read that advice that trying to reason about weakly ordered memory was > simply too hard for the average person (even the average kernel > developer). In 99% of the cases you could just use a lock so it's > super clear and the performance difference is near zero. > > * Most of the cases I saw of the "nobody cared" for geni-spi was on a > mostly idle system (presumably still doing periodic SPI transactions > to the EC, though). It seems weird that one of these other interrupts > would suddenly fire. It seems more likely that we just happened to > win a race of some sort. Sure, I'm mostly interested in understanding what that race is. I think it is something like cur_mcmd being stale when it's tested in the irq handler because the IO access triggers the irq before mas->cur_mcmd can be updated due to out of order execution. Nothing stops spi_geni_set_cs() from being reordered like this, especially because there isn't any sort of barrier besides the compiler barrier of asm volatile inside geni_set_setup_m_cmd(): CPU0 (device write overtakes CPU) CPU1 ---- ---- mas->cur_mcmd = CMD_NONE; geni_se_setup_m_cmd(...) geni_spi_isr() <tests cur_mcmd and it's still NONE> mas->cur_mcmd = CMD_CS; > > If nothing else it will suddenly become very obvious after my patch > lands because I'll print out the status. Fair enough. > > > That all being said if someone wants to submit a patch to clean up > which interrupts are enabled I'd support it. Great! > > > > > @@ -151,16 +151,19 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag) > > > struct geni_se *se = &mas->se; > > > unsigned long time_left; > > > > > > - reinit_completion(&mas->xfer_done); > > > pm_runtime_get_sync(mas->dev); > > > if (!(slv->mode & SPI_CS_HIGH)) > > > set_flag = !set_flag; > > > > > > + spin_lock_irq(&mas->lock); > > > > Why is this spin_lock_irq() vs. spin_lock_irqsave()? This isn't possible > > to be called from somewhere that hasn't changed irq flags? > > See below. > > > > > + reinit_completion(&mas->xfer_done); > > > + > > > mas->cur_mcmd = CMD_CS; > > > if (set_flag) > > > geni_se_setup_m_cmd(se, SPI_CS_ASSERT, 0); > > > else > > > geni_se_setup_m_cmd(se, SPI_CS_DEASSERT, 0); > > > + spin_unlock_irq(&mas->lock); > > > > This will force on interrupts if they were masked. > > I'll change it if you want, but in this function there is already a > call to "wait_for_completion_timeout". That's not gonna be too happy > if this function is ever called with interrupts already masked. Also > in this function is pm_runtime_get_sync() which in many cases will > sleep (I think we'll end up in geni_se_clks_on() which calls > clk_bulk_prepare_enable()). > > In cases where you know for sure that interrupts aren't masked, > spin_lock_irq() and spin_unlock_irq() are fine and that's what they're > for, no? > > > > > time_left = wait_for_completion_timeout(&mas->xfer_done, HZ); > > > if (!time_left) > > > @@ -307,6 +310,8 @@ static void setup_fifo_xfer(struct spi_transfer *xfer, > > > u32 spi_tx_cfg, len; > > > struct geni_se *se = &mas->se; > > > > > > + spin_lock_irq(&mas->lock); > > ...and just to answer the same question for here: setup_fifo_xfer() is > called from spi_geni_transfer_one() which is our "transfer_one" > function. We don't happen to block anywhere in these functions, but > I'm nearly certain you are allowed to block in them. We actually > return a positive number to indicate to the SPI core that we're not > doing the blocking ourselves but since the SPI core can't know we were > going to do that it has to assume we might block. Sounds good to not use irq save/restore here. Thanks for confirming. > > > > > @@ -478,13 +485,29 @@ static irqreturn_t geni_spi_isr(int irq, void *data) > > > struct geni_se *se = &mas->se; > > > u32 m_irq; > > > unsigned long flags; > > > - > > > - if (mas->cur_mcmd == CMD_NONE) > > > - return IRQ_NONE; > > > + irqreturn_t ret = IRQ_HANDLED; > > > > > > spin_lock_irqsave(&mas->lock, flags); > > Ironically the above could probably just be "spin_lock" since this is > our interrupt handler. ;-) I'll just leave it alone though since > what's there now doesn't hurt. Yes that's another thing that can be done. > > > > > m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS); > > > > Does this read need to be inside the lock? > > Probably not. Discussion below. > > > > > + /* Check for spurious interrupt */ > > > + if (!m_irq) { > > > + ret = IRQ_NONE; > > > + goto exit; > > > > I ask because it could be simplified by reading the status and then > > immediately returning IRQ_NONE if no bits are set without having to do > > the lock/unlock dance. And does writing 0 to the irq clear register do > > anything? > > Sure. I'll move it if you want. It felt nicer to just keep the whole > thing under lock so I didn't have to think about whether it mattered. > ...and anyone else looking at it didn't need to think about if it > mattered, too. It it is very easy to say that it doesn't _hurt_ to > have it under lock other than having one extra memory read under lock. > ...and presumably the case where the optimization matters is > incredibly rare (a spurious interrupt) and we just spent a whole lot > of cycles calling the interrupt handler to begin with for this > spurious interrupt. > > I would have a hard time believing that a write of 0 to a "write 1 to > clear" register would have any impact. It would be a pretty bad > hardware design... This is a writel to a device so it may take some time for the memory barrier to drain any other writes with the full memory barrier in there. Not sure if we care about getting super high performance here but that's a concern. > > > So I guess in summary, I'm not planning to spin for any of these > things unless you really insist or you say I'm wrong about something > above or someone else says my opinions are the wrong opinions. > Why do we need such large locking areas? Why can't we remove the enum software tracking stuff and look at mas->cur_xfer to figure out if a cs change is happening or not? I suppose mas->cur_xfer could be stale, so we need to make sure that is modified and checked under a lock, and then mas->rx_rem_bytes/tx_rem_bytes also need to be under a lock because they're modified inside and outside the irq handler, but otherwise I don't see any need to hold the lock over the register reads/writes. Most other things are synchronized with the completions or in higher layers of the SPI core. I do wonder if we need to hold the lock again when we update the rx/tx remaining bytes counter but I can't convince myself that the irq can run on another CPU in parallel with the CPU that ran the irq first. That seems impossible given that the interrupt would have to come again and the irq controller would need to send it to a different CPU at the same time. We should only need to grab the lock to make sure that cur_xfer and remaining bytes is published by the CPU triggering the rx/tx transfers and then assume the irq handler code is synchronous with itself. ------8<----- diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c index f0ca7f5ae714..ac7da801a63d 100644 --- a/drivers/spi/spi-geni-qcom.c +++ b/drivers/spi/spi-geni-qcom.c @@ -64,13 +64,6 @@ #define TIMESTAMP_AFTER BIT(3) #define POST_CMD_DELAY BIT(4) -enum spi_m_cmd_opcode { - CMD_NONE, - CMD_XFER, - CMD_CS, - CMD_CANCEL, -}; - struct spi_geni_master { struct geni_se se; struct device *dev; @@ -84,8 +77,7 @@ struct spi_geni_master { const struct spi_transfer *cur_xfer; struct completion xfer_done; unsigned int oversampling; - spinlock_t lock; - enum spi_m_cmd_opcode cur_mcmd; + spinlock_t lock; /* Protects cur_xfer, {tx,rx}_rem_bytes */ int irq; }; @@ -123,23 +115,18 @@ static void handle_fifo_timeout(struct spi_master *spi, struct spi_message *msg) { struct spi_geni_master *mas = spi_master_get_devdata(spi); - unsigned long time_left, flags; + unsigned long time_left; struct geni_se *se = &mas->se; - spin_lock_irqsave(&mas->lock, flags); reinit_completion(&mas->xfer_done); - mas->cur_mcmd = CMD_CANCEL; geni_se_cancel_m_cmd(se); writel(0, se->base + SE_GENI_TX_WATERMARK_REG); - spin_unlock_irqrestore(&mas->lock, flags); time_left = wait_for_completion_timeout(&mas->xfer_done, HZ); if (time_left) return; - spin_lock_irqsave(&mas->lock, flags); reinit_completion(&mas->xfer_done); geni_se_abort_m_cmd(se); - spin_unlock_irqrestore(&mas->lock, flags); time_left = wait_for_completion_timeout(&mas->xfer_done, HZ); if (!time_left) dev_err(mas->dev, "Failed to cancel/abort m_cmd\n"); @@ -157,7 +144,6 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag) if (!(slv->mode & SPI_CS_HIGH)) set_flag = !set_flag; - mas->cur_mcmd = CMD_CS; if (set_flag) geni_se_setup_m_cmd(se, SPI_CS_ASSERT, 0); else @@ -307,6 +293,7 @@ static void setup_fifo_xfer(struct spi_transfer *xfer, u32 m_cmd = 0; u32 spi_tx_cfg, len; struct geni_se *se = &mas->se; + unsigned int rx_len = 0, tx_len = 0; spi_tx_cfg = readl(se->base + SE_SPI_TRANS_CFG); if (xfer->bits_per_word != mas->cur_bits_per_word) { @@ -339,14 +326,14 @@ static void setup_fifo_xfer(struct spi_transfer *xfer, writel(m_clk_cfg, se->base + GENI_SER_M_CLK_CFG); } - mas->tx_rem_bytes = 0; - mas->rx_rem_bytes = 0; - if (xfer->tx_buf && xfer->rx_buf) - m_cmd = SPI_FULL_DUPLEX; - else if (xfer->tx_buf) - m_cmd = SPI_TX_ONLY; - else if (xfer->rx_buf) - m_cmd = SPI_RX_ONLY; + if (xfer->tx_buf) { + m_cmd |= SPI_TX_ONLY; + tx_len = xfer->len; + } + if (xfer->rx_buf) { + m_cmd |= SPI_RX_ONLY; + rx_len = xfer->len; + } spi_tx_cfg &= ~CS_TOGGLE; @@ -356,18 +343,17 @@ static void setup_fifo_xfer(struct spi_transfer *xfer, len = xfer->len / (mas->cur_bits_per_word / BITS_PER_BYTE + 1); len &= TRANS_LEN_MSK; + spin_lock_irq(&mas->lock); mas->cur_xfer = xfer; - if (m_cmd & SPI_TX_ONLY) { - mas->tx_rem_bytes = xfer->len; - writel(len, se->base + SE_SPI_TX_TRANS_LEN); - } + mas->tx_rem_bytes = tx_len; + mas->rx_rem_bytes = rx_len; + spin_unlock_irq(&mas->lock); - if (m_cmd & SPI_RX_ONLY) { + if (m_cmd & SPI_TX_ONLY) + writel(len, se->base + SE_SPI_TX_TRANS_LEN); + if (m_cmd & SPI_RX_ONLY) writel(len, se->base + SE_SPI_RX_TRANS_LEN); - mas->rx_rem_bytes = xfer->len; - } writel(spi_tx_cfg, se->base + SE_SPI_TRANS_CFG); - mas->cur_mcmd = CMD_XFER; geni_se_setup_m_cmd(se, m_cmd, FRAGMENTATION); /* @@ -416,10 +402,12 @@ static void geni_spi_handle_tx(struct spi_geni_master *mas) unsigned int i = 0; max_bytes = (mas->tx_fifo_depth - mas->tx_wm) * bytes_per_fifo_word; + spin_lock(&mas->lock); if (mas->tx_rem_bytes < max_bytes) max_bytes = mas->tx_rem_bytes; tx_buf = mas->cur_xfer->tx_buf + mas->cur_xfer->len - mas->tx_rem_bytes; + spin_unlock(&mas->lock); while (i < max_bytes) { unsigned int j; unsigned int bytes_to_write; @@ -454,10 +442,13 @@ static void geni_spi_handle_rx(struct spi_geni_master *mas) if (rx_last_byte_valid && rx_last_byte_valid < 4) rx_bytes -= bytes_per_fifo_word - rx_last_byte_valid; } + spin_lock(&mas->lock); if (mas->rx_rem_bytes < rx_bytes) rx_bytes = mas->rx_rem_bytes; rx_buf = mas->cur_xfer->rx_buf + mas->cur_xfer->len - mas->rx_rem_bytes; + spin_unlock(&mas->lock); + while (i < rx_bytes) { u32 fifo_word = 0; u8 *fifo_byte = (u8 *)&fifo_word; @@ -478,13 +469,11 @@ static irqreturn_t geni_spi_isr(int irq, void *data) struct spi_geni_master *mas = spi_master_get_devdata(spi); struct geni_se *se = &mas->se; u32 m_irq; - unsigned long flags; + bool cs_change = false; - if (mas->cur_mcmd == CMD_NONE) - return IRQ_NONE; - - spin_lock_irqsave(&mas->lock, flags); m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS); + if (!m_irq) + return IRQ_NONE; if ((m_irq & M_RX_FIFO_WATERMARK_EN) || (m_irq & M_RX_FIFO_LAST_EN)) geni_spi_handle_rx(mas); @@ -493,39 +482,48 @@ static irqreturn_t geni_spi_isr(int irq, void *data) geni_spi_handle_tx(mas); if (m_irq & M_CMD_DONE_EN) { - if (mas->cur_mcmd == CMD_XFER) + spin_lock(&mas->lock); + if (mas->cur_xfer) { spi_finalize_current_transfer(spi); - else if (mas->cur_mcmd == CMD_CS) - complete(&mas->xfer_done); - mas->cur_mcmd = CMD_NONE; - /* - * If this happens, then a CMD_DONE came before all the Tx - * buffer bytes were sent out. This is unusual, log this - * condition and disable the WM interrupt to prevent the - * system from stalling due an interrupt storm. - * If this happens when all Rx bytes haven't been received, log - * the condition. - * The only known time this can happen is if bits_per_word != 8 - * and some registers that expect xfer lengths in num spi_words - * weren't written correctly. - */ - if (mas->tx_rem_bytes) { - writel(0, se->base + SE_GENI_TX_WATERMARK_REG); - dev_err(mas->dev, "Premature done. tx_rem = %d bpw%d\n", - mas->tx_rem_bytes, mas->cur_bits_per_word); + mas->cur_xfer = NULL; + /* + * If this happens, then a CMD_DONE came before all the + * Tx buffer bytes were sent out. This is unusual, log + * this condition and disable the WM interrupt to + * prevent the system from stalling due an interrupt + * storm. + * + * If this happens when all Rx bytes haven't been + * received, log the condition. + * + * The only known time this can happen is if + * bits_per_word != 8 and some registers that expect + * xfer lengths in num spi_words weren't written + * correctly. + */ + if (mas->tx_rem_bytes) { + writel(0, se->base + SE_GENI_TX_WATERMARK_REG); + dev_err(mas->dev, "Premature done. tx_rem = %d bpw%d\n", + mas->tx_rem_bytes, + mas->cur_bits_per_word); + } + if (mas->rx_rem_bytes) { + dev_err(mas->dev, "Premature done. rx_rem = %d bpw%d\n", + mas->rx_rem_bytes, + mas->cur_bits_per_word); + } + } else { + cs_change = true; } - if (mas->rx_rem_bytes) - dev_err(mas->dev, "Premature done. rx_rem = %d bpw%d\n", - mas->rx_rem_bytes, mas->cur_bits_per_word); + spin_unlock(&mas->lock); } - if ((m_irq & M_CMD_CANCEL_EN) || (m_irq & M_CMD_ABORT_EN)) { - mas->cur_mcmd = CMD_NONE; + /* Wake up spi_geni_set_cs() or handle_fifo_timeout() */ + if (cs_change || (m_irq & M_CMD_CANCEL_EN) || (m_irq & M_CMD_ABORT_EN)) complete(&mas->xfer_done); - } writel(m_irq, se->base + SE_GENI_M_IRQ_CLEAR); - spin_unlock_irqrestore(&mas->lock, flags); + return IRQ_HANDLED; }
Hi, On Wed, Mar 18, 2020 at 1:04 AM Stephen Boyd <swboyd@chromium.org> wrote: > > > * Most of the cases I saw of the "nobody cared" for geni-spi was on a > > mostly idle system (presumably still doing periodic SPI transactions > > to the EC, though). It seems weird that one of these other interrupts > > would suddenly fire. It seems more likely that we just happened to > > win a race of some sort. > > Sure, I'm mostly interested in understanding what that race is. I think > it is something like cur_mcmd being stale when it's tested in the irq > handler because the IO access triggers the irq before mas->cur_mcmd can > be updated due to out of order execution. Nothing stops > spi_geni_set_cs() from being reordered like this, especially because > there isn't any sort of barrier besides the compiler barrier of asm > volatile inside geni_set_setup_m_cmd(): > > CPU0 (device write overtakes CPU) CPU1 > ---- ---- > mas->cur_mcmd = CMD_NONE; > geni_se_setup_m_cmd(...) > geni_spi_isr() > <tests cur_mcmd and it's still NONE> > mas->cur_mcmd = CMD_CS; This is my assumption about what must have happened, yes. I have no proof, though. > > > > + /* Check for spurious interrupt */ > > > > + if (!m_irq) { > > > > + ret = IRQ_NONE; > > > > + goto exit; > > > > > > I ask because it could be simplified by reading the status and then > > > immediately returning IRQ_NONE if no bits are set without having to do > > > the lock/unlock dance. And does writing 0 to the irq clear register do > > > anything? > > > > Sure. I'll move it if you want. It felt nicer to just keep the whole > > thing under lock so I didn't have to think about whether it mattered. > > ...and anyone else looking at it didn't need to think about if it > > mattered, too. It it is very easy to say that it doesn't _hurt_ to > > have it under lock other than having one extra memory read under lock. > > ...and presumably the case where the optimization matters is > > incredibly rare (a spurious interrupt) and we just spent a whole lot > > of cycles calling the interrupt handler to begin with for this > > spurious interrupt. > > > > I would have a hard time believing that a write of 0 to a "write 1 to > > clear" register would have any impact. It would be a pretty bad > > hardware design... > > This is a writel to a device so it may take some time for the memory > barrier to drain any other writes with the full memory barrier in there. > Not sure if we care about getting super high performance here but that's > a concern. Right, but I guess the spurious interrupt case wasn't one I'm optimizing for. ...and if it's a real problem we should figure out how to avoid so many spurious interrupts. Presumably all the overhead that the Linux kernel did to call our interrupt handler in the first place is much more serious. > > So I guess in summary, I'm not planning to spin for any of these > > things unless you really insist or you say I'm wrong about something > > above or someone else says my opinions are the wrong opinions. > > > > Why do we need such large locking areas? We probably don't, but it's unlikely to matter at all and the large locking areas mean we don't have to spend time thinking about it. It's super clear that the locking is safe. In general there shouldn't be tons of downside to have large locking areas because there shouldn't be contention for this lock. As you said, much of this is synchronized at higher levels, so mostly we're just worried about synchronizing with our interrupt handler and one would hope the interrupt handler wasn't firing at times we don't expect. ...but I guess there is the downside that local interrupts are fully disabled for this whole large locking area and that does feel like something that might be worth optimizing for... In theory we _could_ just have a lock around updating 'cur_mcmd' and kicking off the transfer. That would make sure that 'cur_mcmd' was definitely updated by the time the interrupt fired. It feels at least a little awkward to me, though, because there is more shared state between the interrupt handler and the functions kicking off the transfers and we're really just using the lock as a poor man's memory barrier, but maybe it's OK. I can give a shot at this approach if it feels viable to you. > Why can't we remove the enum > software tracking stuff and look at mas->cur_xfer to figure out if a cs > change is happening or not? I suppose mas->cur_xfer could be stale, so > we need to make sure that is modified and checked under a lock, and then > mas->rx_rem_bytes/tx_rem_bytes also need to be under a lock because > they're modified inside and outside the irq handler, but otherwise I > don't see any need to hold the lock over the register reads/writes. Most > other things are synchronized with the completions or in higher layers > of the SPI core. In general it's a big-ish change to code that's largely already tested and that (IMO) it makes readability slightly worse. It can be worth it to make big changes for big gain, but I don't see it here. It might have a slight performance boost, but seems like premature optimization since there's no evidence that the existing performance (even after my big locking patch) is a problem. If we want to go forward with your suggestion, I'd have to spend more time thinking about the cancel/abort cases, especially, which I'm not sure how to test super well. Specifically does cancel/abort cause the "done" bit to be set? If so your new code will finalize the current transfer, but the old code didn't. I have no idea if this matters. You also removed the spinlock from handle_fifo_timeout() and I don't know if (for instance) it was important to clear the TX watermark register after firing off the abort command but before the interrupt handler could run. It also doesn't necessarily appear that reinit_completion() has any safety, so is there some chance that the interrupt could fire and we'd still have the re-init pending on our side, then clear "done" back to zero after the interrupt handler completed us? ...one extra thing to think about with all this is that (I think) handle_fifo_timeout() will be called if someone kicks off a SPI transfer and then hits "Ctrl-C" partway. In this case spi_transfer_wait() will return -EINTR and I think that will change the chip select and then call "handle_err". I have no idea if that works well today, but I have a feeling it will be harder to get right without an extra state variable. > I do wonder if we need to hold the lock again when we update the rx/tx > remaining bytes counter but I can't convince myself that the irq can run > on another CPU in parallel with the CPU that ran the irq first. That > seems impossible given that the interrupt would have to come again and > the irq controller would need to send it to a different CPU at the same > time. We should only need to grab the lock to make sure that cur_xfer > and remaining bytes is published by the CPU triggering the rx/tx > transfers and then assume the irq handler code is synchronous with > itself. At least needs a comment? ...but even with a comment, what exactly are you gaining here by making me and all future readers of the code think about this? One of your main goals here is to eliminate the need to have the spin lock locked for the whole of the IRQ handler, right? If we're in the interrupt handler, local interrupts are already off, right? That means that the only thing you can gain (in terms of speed) by decreasing the number of lines "under lock" is that you could allow someone else trying to get the same lock to run sooner. As you've pointed out, there's already a lot of synchronization at higher layers, and many of the arguments you make are assuming that the other code isn't running anyway. So this overall seems like trying to shorten the amount of time that the lock is held just for the sheer principle of having spin locks held for the fewest lines of code, but not for any real gain? To get into micro-optimizations, too, I will also say that having fine grained locking in the interrupt handler also will lead to more than one call to spin_lock() per IRQ. I put traces in and I often saw "rx" and "cmd_done" fall together and both separately lock in your code. In general isn't reducing the number of calls to spin_lock() (which I think has some overhead) more important? > - if ((m_irq & M_CMD_CANCEL_EN) || (m_irq & M_CMD_ABORT_EN)) { > - mas->cur_mcmd = CMD_NONE; > + /* Wake up spi_geni_set_cs() or handle_fifo_timeout() */ > + if (cs_change || (m_irq & M_CMD_CANCEL_EN) || (m_irq & M_CMD_ABORT_EN)) > complete(&mas->xfer_done); > - } > > writel(m_irq, se->base + SE_GENI_M_IRQ_CLEAR); > - spin_unlock_irqrestore(&mas->lock, flags); > + It doesn't feel legit to release the spinlock before this writel(). Once you complete(), can't (in theory) another transfer kick off on another CPU, and then cause an interrupt to fire? You'd be clearing someone else's interrupt, perhaps? Maybe this actually happens in handle_fifo_timeout() which (after your patch) waits for the completion and then immediate (with no locking) fires off an abort. As an aisde and unrelated to your change, I never like it when interrupt handler clear their IRQ bits at the end anyway because it always feels like a race. The only time you clear at the end is if you have "latched level" interrupts (interrupts that are latched but will continually re-assert unless the thing that caused them goes away). I have no idea of these bits are "latched level", but it feels unlikely. NOTE: I didn't do a complete review of your patch because (I think) I already found a few issues and am still not convinced it's a good use of time to fully re-architect the control flow of this driver. If you still think this is something we need to do, I will try to make time to work through all the corner cases in my head and try my best to make sure there are no weird memory-ordering issues, but I can't promise I won't miss anything. -Doug
Hi Douglas, url: https://github.com/0day-ci/linux/commits/Douglas-Anderson/spi-spi-geni-qcom-Speculative-fix-of-nobody-cared-about-interrupt/20200318-043933 base: https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next If you fix the issue, kindly add following tag Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> smatch warnings: drivers/spi/spi-geni-qcom.c:385 setup_fifo_xfer() warn: inconsistent returns 'irq'. drivers/spi/spi-geni-qcom.c:385 setup_fifo_xfer() warn: inconsistent returns 'mas->lock'. # https://github.com/0day-ci/linux/commit/365ef891fdac5e58b1f621b0b0d57608ffafeb2b git remote add linux-review https://github.com/0day-ci/linux git remote update linux-review git checkout 365ef891fdac5e58b1f621b0b0d57608ffafeb2b vim +/irq +385 drivers/spi/spi-geni-qcom.c 561de45f72bd5f Girish Mahadevan 2018-10-03 305 static void setup_fifo_xfer(struct spi_transfer *xfer, 561de45f72bd5f Girish Mahadevan 2018-10-03 306 struct spi_geni_master *mas, 561de45f72bd5f Girish Mahadevan 2018-10-03 307 u16 mode, struct spi_master *spi) 561de45f72bd5f Girish Mahadevan 2018-10-03 308 { 561de45f72bd5f Girish Mahadevan 2018-10-03 309 u32 m_cmd = 0; 561de45f72bd5f Girish Mahadevan 2018-10-03 310 u32 spi_tx_cfg, len; 561de45f72bd5f Girish Mahadevan 2018-10-03 311 struct geni_se *se = &mas->se; 561de45f72bd5f Girish Mahadevan 2018-10-03 312 365ef891fdac5e Douglas Anderson 2020-03-17 313 spin_lock_irq(&mas->lock); 365ef891fdac5e Douglas Anderson 2020-03-17 314 561de45f72bd5f Girish Mahadevan 2018-10-03 315 spi_tx_cfg = readl(se->base + SE_SPI_TRANS_CFG); 561de45f72bd5f Girish Mahadevan 2018-10-03 316 if (xfer->bits_per_word != mas->cur_bits_per_word) { 561de45f72bd5f Girish Mahadevan 2018-10-03 317 spi_setup_word_len(mas, mode, xfer->bits_per_word); 561de45f72bd5f Girish Mahadevan 2018-10-03 318 mas->cur_bits_per_word = xfer->bits_per_word; 561de45f72bd5f Girish Mahadevan 2018-10-03 319 } 561de45f72bd5f Girish Mahadevan 2018-10-03 320 561de45f72bd5f Girish Mahadevan 2018-10-03 321 /* Speed and bits per word can be overridden per transfer */ 561de45f72bd5f Girish Mahadevan 2018-10-03 322 if (xfer->speed_hz != mas->cur_speed_hz) { 561de45f72bd5f Girish Mahadevan 2018-10-03 323 int ret; 561de45f72bd5f Girish Mahadevan 2018-10-03 324 u32 clk_sel, m_clk_cfg; 561de45f72bd5f Girish Mahadevan 2018-10-03 325 unsigned int idx, div; 561de45f72bd5f Girish Mahadevan 2018-10-03 326 561de45f72bd5f Girish Mahadevan 2018-10-03 327 ret = get_spi_clk_cfg(xfer->speed_hz, mas, &idx, &div); 561de45f72bd5f Girish Mahadevan 2018-10-03 328 if (ret) { 561de45f72bd5f Girish Mahadevan 2018-10-03 329 dev_err(mas->dev, "Err setting clks:%d\n", ret); 561de45f72bd5f Girish Mahadevan 2018-10-03 330 return; Needs to drop the lock before returning. 561de45f72bd5f Girish Mahadevan 2018-10-03 331 } 561de45f72bd5f Girish Mahadevan 2018-10-03 332 /* 561de45f72bd5f Girish Mahadevan 2018-10-03 333 * SPI core clock gets configured with the requested frequency 561de45f72bd5f Girish Mahadevan 2018-10-03 334 * or the frequency closer to the requested frequency. 561de45f72bd5f Girish Mahadevan 2018-10-03 335 * For that reason requested frequency is stored in the 561de45f72bd5f Girish Mahadevan 2018-10-03 336 * cur_speed_hz and referred in the consecutive transfer instead 561de45f72bd5f Girish Mahadevan 2018-10-03 337 * of calling clk_get_rate() API. 561de45f72bd5f Girish Mahadevan 2018-10-03 338 */ 561de45f72bd5f Girish Mahadevan 2018-10-03 339 mas->cur_speed_hz = xfer->speed_hz; 561de45f72bd5f Girish Mahadevan 2018-10-03 340 clk_sel = idx & CLK_SEL_MSK; 561de45f72bd5f Girish Mahadevan 2018-10-03 341 m_clk_cfg = (div << CLK_DIV_SHFT) | SER_CLK_EN; 561de45f72bd5f Girish Mahadevan 2018-10-03 342 writel(clk_sel, se->base + SE_GENI_CLK_SEL); 561de45f72bd5f Girish Mahadevan 2018-10-03 343 writel(m_clk_cfg, se->base + GENI_SER_M_CLK_CFG); 561de45f72bd5f Girish Mahadevan 2018-10-03 344 } 561de45f72bd5f Girish Mahadevan 2018-10-03 345 561de45f72bd5f Girish Mahadevan 2018-10-03 346 mas->tx_rem_bytes = 0; 561de45f72bd5f Girish Mahadevan 2018-10-03 347 mas->rx_rem_bytes = 0; 561de45f72bd5f Girish Mahadevan 2018-10-03 348 if (xfer->tx_buf && xfer->rx_buf) 561de45f72bd5f Girish Mahadevan 2018-10-03 349 m_cmd = SPI_FULL_DUPLEX; 561de45f72bd5f Girish Mahadevan 2018-10-03 350 else if (xfer->tx_buf) 561de45f72bd5f Girish Mahadevan 2018-10-03 351 m_cmd = SPI_TX_ONLY; 561de45f72bd5f Girish Mahadevan 2018-10-03 352 else if (xfer->rx_buf) 561de45f72bd5f Girish Mahadevan 2018-10-03 353 m_cmd = SPI_RX_ONLY; 561de45f72bd5f Girish Mahadevan 2018-10-03 354 561de45f72bd5f Girish Mahadevan 2018-10-03 355 spi_tx_cfg &= ~CS_TOGGLE; 561de45f72bd5f Girish Mahadevan 2018-10-03 356 561de45f72bd5f Girish Mahadevan 2018-10-03 357 if (!(mas->cur_bits_per_word % MIN_WORD_LEN)) 561de45f72bd5f Girish Mahadevan 2018-10-03 358 len = xfer->len * BITS_PER_BYTE / mas->cur_bits_per_word; 561de45f72bd5f Girish Mahadevan 2018-10-03 359 else 561de45f72bd5f Girish Mahadevan 2018-10-03 360 len = xfer->len / (mas->cur_bits_per_word / BITS_PER_BYTE + 1); 561de45f72bd5f Girish Mahadevan 2018-10-03 361 len &= TRANS_LEN_MSK; 561de45f72bd5f Girish Mahadevan 2018-10-03 362 561de45f72bd5f Girish Mahadevan 2018-10-03 363 mas->cur_xfer = xfer; 561de45f72bd5f Girish Mahadevan 2018-10-03 364 if (m_cmd & SPI_TX_ONLY) { 561de45f72bd5f Girish Mahadevan 2018-10-03 365 mas->tx_rem_bytes = xfer->len; 561de45f72bd5f Girish Mahadevan 2018-10-03 366 writel(len, se->base + SE_SPI_TX_TRANS_LEN); 561de45f72bd5f Girish Mahadevan 2018-10-03 367 } 561de45f72bd5f Girish Mahadevan 2018-10-03 368 561de45f72bd5f Girish Mahadevan 2018-10-03 369 if (m_cmd & SPI_RX_ONLY) { 561de45f72bd5f Girish Mahadevan 2018-10-03 370 writel(len, se->base + SE_SPI_RX_TRANS_LEN); 561de45f72bd5f Girish Mahadevan 2018-10-03 371 mas->rx_rem_bytes = xfer->len; 561de45f72bd5f Girish Mahadevan 2018-10-03 372 } 561de45f72bd5f Girish Mahadevan 2018-10-03 373 writel(spi_tx_cfg, se->base + SE_SPI_TRANS_CFG); 561de45f72bd5f Girish Mahadevan 2018-10-03 374 mas->cur_mcmd = CMD_XFER; 561de45f72bd5f Girish Mahadevan 2018-10-03 375 geni_se_setup_m_cmd(se, m_cmd, FRAGMENTATION); 561de45f72bd5f Girish Mahadevan 2018-10-03 376 561de45f72bd5f Girish Mahadevan 2018-10-03 377 /* 561de45f72bd5f Girish Mahadevan 2018-10-03 378 * TX_WATERMARK_REG should be set after SPI configuration and 561de45f72bd5f Girish Mahadevan 2018-10-03 379 * setting up GENI SE engine, as driver starts data transfer 561de45f72bd5f Girish Mahadevan 2018-10-03 380 * for the watermark interrupt. 561de45f72bd5f Girish Mahadevan 2018-10-03 381 */ 561de45f72bd5f Girish Mahadevan 2018-10-03 382 if (m_cmd & SPI_TX_ONLY) 561de45f72bd5f Girish Mahadevan 2018-10-03 383 writel(mas->tx_wm, se->base + SE_GENI_TX_WATERMARK_REG); 365ef891fdac5e Douglas Anderson 2020-03-17 384 365ef891fdac5e Douglas Anderson 2020-03-17 @385 spin_unlock_irq(&mas->lock); 561de45f72bd5f Girish Mahadevan 2018-10-03 386 } --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Hi, On Mon, Mar 23, 2020 at 4:08 AM Dan Carpenter <dan.carpenter@oracle.com> wrote: > > 561de45f72bd5f Girish Mahadevan 2018-10-03 327 ret = get_spi_clk_cfg(xfer->speed_hz, mas, &idx, &div); > 561de45f72bd5f Girish Mahadevan 2018-10-03 328 if (ret) { > 561de45f72bd5f Girish Mahadevan 2018-10-03 329 dev_err(mas->dev, "Err setting clks:%d\n", ret); > 561de45f72bd5f Girish Mahadevan 2018-10-03 330 return; > > Needs to drop the lock before returning. Oops, thanks for catching. I will wait before spinning a v3 until there is some clarity about whether folks want to do something more like Stephen suggested or whether I should continue with my strategy. At the moment I'm still in favor of keeping w/ my strategy and seeing if I can reduce the amount of time with interrupts disabled in setup_fifo_xfer(), maybe just grabbing the lock around the start of the transfer to keep the state machine in sync with the kickoff... -Doug
diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c index c3972424af71..92e51ccb427f 100644 --- a/drivers/spi/spi-geni-qcom.c +++ b/drivers/spi/spi-geni-qcom.c @@ -151,16 +151,19 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag) struct geni_se *se = &mas->se; unsigned long time_left; - reinit_completion(&mas->xfer_done); pm_runtime_get_sync(mas->dev); if (!(slv->mode & SPI_CS_HIGH)) set_flag = !set_flag; + spin_lock_irq(&mas->lock); + reinit_completion(&mas->xfer_done); + mas->cur_mcmd = CMD_CS; if (set_flag) geni_se_setup_m_cmd(se, SPI_CS_ASSERT, 0); else geni_se_setup_m_cmd(se, SPI_CS_DEASSERT, 0); + spin_unlock_irq(&mas->lock); time_left = wait_for_completion_timeout(&mas->xfer_done, HZ); if (!time_left) @@ -307,6 +310,8 @@ static void setup_fifo_xfer(struct spi_transfer *xfer, u32 spi_tx_cfg, len; struct geni_se *se = &mas->se; + spin_lock_irq(&mas->lock); + spi_tx_cfg = readl(se->base + SE_SPI_TRANS_CFG); if (xfer->bits_per_word != mas->cur_bits_per_word) { spi_setup_word_len(mas, mode, xfer->bits_per_word); @@ -376,6 +381,8 @@ static void setup_fifo_xfer(struct spi_transfer *xfer, */ if (m_cmd & SPI_TX_ONLY) writel(mas->tx_wm, se->base + SE_GENI_TX_WATERMARK_REG); + + spin_unlock_irq(&mas->lock); } static int spi_geni_transfer_one(struct spi_master *spi, @@ -478,13 +485,29 @@ static irqreturn_t geni_spi_isr(int irq, void *data) struct geni_se *se = &mas->se; u32 m_irq; unsigned long flags; - - if (mas->cur_mcmd == CMD_NONE) - return IRQ_NONE; + irqreturn_t ret = IRQ_HANDLED; spin_lock_irqsave(&mas->lock, flags); m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS); + /* Check for spurious interrupt */ + if (!m_irq) { + ret = IRQ_NONE; + goto exit; + } + + /* + * If we got a real interrupt but software state machine thinks + * we were idle then give a warning. We'll do our best to clear + * the interrupt but we'll still return IRQ_NONE. If this keeps + * happening the system will eventually disable us. + */ + if (mas->cur_mcmd == CMD_NONE) { + pr_warn("Unexpected SPI interrupt: %#010x\n", m_irq); + ret = IRQ_NONE; + goto exit; + } + if ((m_irq & M_RX_FIFO_WATERMARK_EN) || (m_irq & M_RX_FIFO_LAST_EN)) geni_spi_handle_rx(mas); @@ -523,9 +546,11 @@ static irqreturn_t geni_spi_isr(int irq, void *data) complete(&mas->xfer_done); } +exit: writel(m_irq, se->base + SE_GENI_M_IRQ_CLEAR); spin_unlock_irqrestore(&mas->lock, flags); - return IRQ_HANDLED; + + return ret; } static int spi_geni_probe(struct platform_device *pdev)
Every once in a while (like once in a few months on a device) people have seen warnings on devices using spi-geni-qcom like: irq ...: nobody cared (try booting with the "irqpoll" option) ...where the interrupt number listed matches up with "spi_geni" in /proc/interrupts. You can get "nobody cared" if the interrupt handler returns IRQ_NONE. Once you get into this state the driver will just stop working. Auditing the code makes me believe that it's probably not so good checking "cur_mcmd" in the interrupt handler not under spinlock. Let's move it to be under spinlock. Looking more closely at the code, it looks as if there are some other places that could be under spinlock, so let's add. It looks as if the original code was assuming that by the time that the interrupt handler got called that the write to "cur_mcmd" (to make it not CMD_NONE anymore) would make it to the processor handling the interrupt. Perhaps with weakly ordered memory this sometimes (though very rarely) happened. Let's also add a warning (with the IRQ status) in the case that we ever end up getting an interrupt when we think we shouldn't. This would give us a better chance to debug if this patch doesn't help the issue. We'll also try our best to clear the interrupt to hopefully get us out of this state. Patch is marked "speculative" because I have no way to reproduce this problem, so I'm just hoping this fixes it. Weakly ordered memory makes my brain hurt. Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP") Signed-off-by: Douglas Anderson <dianders@chromium.org> --- Changes in v2: - Detect true spurious interrupt. - Still return IRQ_NONE for state machine mismatch, but print warn. drivers/spi/spi-geni-qcom.c | 35 ++++++++++++++++++++++++++++++----- 1 file changed, 30 insertions(+), 5 deletions(-)