diff mbox series

spi: spi-geni-qcom: Speculative fix of "nobody cared" about interrupt

Message ID 20200316151939.1.I752ebdcfd5e8bf0de06d66e767b8974932b3620e@changeid (mailing list archive)
State Superseded
Headers show
Series spi: spi-geni-qcom: Speculative fix of "nobody cared" about interrupt | expand

Commit Message

Doug Anderson March 16, 2020, 10:20 p.m. UTC
Every once in a while (like once in a few months on a device) people
have seen warnings on devices using spi-geni-qcom like:

  irq ...: nobody cared (try booting with the "irqpoll" option)

...where the interrupt number listed matches up with "spi_geni" in
/proc/interrupts.

You can get "nobody cared" if the interrupt handler returns IRQ_NONE.
Once you get into this state the driver will just stop working.

Auditing the code makes me believe that it's probably not so good
checking "cur_mcmd" in the interrupt handler not under spinlock.
Let's move it to be under spinlock.  While we're at it, let's not
return IRQ_NONE.  IRQ_NONE is supposed to check the _hardware_ status
registers and only be returned if your hardware says that there's no
interrupt present.  It's not supposed to check software state.  In any
case, the whole point of IRQ_NONE is if two separate devices are
trying to share an interrupt line, which just isn't true for anyone
using geni.

Looking more closely at the code, it looks as if there are some other
places that could be under spinlock, so let's add.  It looks as if the
original code was assuming that by the time that the interrupt handler
got called that the write to "cur_mcmd" (to make it not CMD_NONE
anymore) would make it to the processor handling the interrupt.
Perhaps with weakly ordered memory this sometimes (though very rarely)
happened.

Patch is marked "speculative" because I have no way to reproduce this
problem, so I'm just hoping this fixes it.  Weakly ordered memory
makes my brain hurt.

Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 drivers/spi/spi-geni-qcom.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

Comments

Mark Brown March 17, 2020, 12:10 p.m. UTC | #1
On Mon, Mar 16, 2020 at 03:20:01PM -0700, Douglas Anderson wrote:

> +	/*
> +	 * We don't expect to hit this, but if we do we should try our best
> +	 * to clear the interrupts and return so we don't just get called
> +	 * again.
> +	 */
> +	if (mas->cur_mcmd == CMD_NONE)
> +		goto exit;
> +

Does this mean that there was an actual concrete message of type
CMD_NONE or does it mean that there was no message waiting?  If there
was no message then isn't the interrupt spurious?
Doug Anderson March 17, 2020, 3:12 p.m. UTC | #2
Hi,

On Tue, Mar 17, 2020 at 5:10 AM Mark Brown <broonie@kernel.org> wrote:
>
> On Mon, Mar 16, 2020 at 03:20:01PM -0700, Douglas Anderson wrote:
>
> > +     /*
> > +      * We don't expect to hit this, but if we do we should try our best
> > +      * to clear the interrupts and return so we don't just get called
> > +      * again.
> > +      */
> > +     if (mas->cur_mcmd == CMD_NONE)
> > +             goto exit;
> > +
>
> Does this mean that there was an actual concrete message of type
> CMD_NONE or does it mean that there was no message waiting?  If there
> was no message then isn't the interrupt spurious?

There is no message of type "CMD_NONE".  The "cur_mcmd" field is
basically where in the software state machine we're at:

* CMD_NONE - Software thinks that the controller should be idle.
* CMD_XFER - Software has started a transfer.
* CMD_CS - Software has started a chip select change.
* CMD_CANCEL - Software sent a "cancel".

...so certainly if we see "cur_mcmd == CMD_NONE" in the interrupt
handler we're in an unexpected situation.  We don't expect interrupts
while idle.  I wouldn't necessarily say it was a spurious interrupt,
though.  To say that I'd rather look at the result of this line in the
IRQ handler:

    m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);

...if that line returns 0 then I would be willing to say it is a
spurious interrupt.


So there is really more than one issue at hand, I guess.

A) Why did we get an interrupt when we had "cur_mcmd == CMD_NONE"?
IMO this is due to weakly ordered memory and not enough locking.

B) If we do see an interrupt when "cur_mcmd == CMD_NONE" (even after
we fix the locking), what should we do?  IMO we should still try to
Ack it.  I can add a "pr_warn()" if it's helpful?

C) Do we care to try to detect spurious interrupts (by checking
SE_GENI_M_IRQ_STATUS) and return IRQ_NONE?  Right now a spurious
interrupt will be harmless because all of the logic in geni_spi_isr()
doesn't do anything if SE_GENI_M_IRQ_STATUS has no bits set.  ...but
it will still return IRQ_HANDLED.  I can't imagine anyone ever putting
this device on a shared interrupt, but if it's important I can detect
this and return IRQ_NONE in this case in a v2 of this patch.

-Doug
Mark Brown March 17, 2020, 4:44 p.m. UTC | #3
On Tue, Mar 17, 2020 at 08:12:30AM -0700, Doug Anderson wrote:
> On Tue, Mar 17, 2020 at 5:10 AM Mark Brown <broonie@kernel.org> wrote:
> > On Mon, Mar 16, 2020 at 03:20:01PM -0700, Douglas Anderson wrote:

> >
> > Does this mean that there was an actual concrete message of type
> > CMD_NONE or does it mean that there was no message waiting?  If there
> > was no message then isn't the interrupt spurious?

> There is no message of type "CMD_NONE".  The "cur_mcmd" field is
> basically where in the software state machine we're at:

...

> ...so certainly if we see "cur_mcmd == CMD_NONE" in the interrupt
> handler we're in an unexpected situation.  We don't expect interrupts
> while idle.  I wouldn't necessarily say it was a spurious interrupt,
> though.  To say that I'd rather look at the result of this line in the
> IRQ handler:

>     m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);

> ...if that line returns 0 then I would be willing to say it is a
> spurious interrupt.

Right, that should return IRQ_NONE if there's nothing actually asserted.
I think you're right about the state machine though.

> C) Do we care to try to detect spurious interrupts (by checking
> SE_GENI_M_IRQ_STATUS) and return IRQ_NONE?  Right now a spurious
> interrupt will be harmless because all of the logic in geni_spi_isr()
> doesn't do anything if SE_GENI_M_IRQ_STATUS has no bits set.  ...but
> it will still return IRQ_HANDLED.  I can't imagine anyone ever putting
> this device on a shared interrupt, but if it's important I can detect
> this and return IRQ_NONE in this case in a v2 of this patch.

It's a good idea to return IRQ_NONE not just for the shared interrupt
case but also because it lets the error handling code in the genirq core
do it's thing and detect bugs - as seems to have happened here.  This
one was fairly benign but you can also see things like interrupts that
are constantly asserted by the hardware where we end up spinning in
interrupt handlers.
diff mbox series

Patch

diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
index c3972424af71..51290a3fd203 100644
--- a/drivers/spi/spi-geni-qcom.c
+++ b/drivers/spi/spi-geni-qcom.c
@@ -151,16 +151,19 @@  static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
 	struct geni_se *se = &mas->se;
 	unsigned long time_left;
 
-	reinit_completion(&mas->xfer_done);
 	pm_runtime_get_sync(mas->dev);
 	if (!(slv->mode & SPI_CS_HIGH))
 		set_flag = !set_flag;
 
+	spin_lock_irq(&mas->lock);
+	reinit_completion(&mas->xfer_done);
+
 	mas->cur_mcmd = CMD_CS;
 	if (set_flag)
 		geni_se_setup_m_cmd(se, SPI_CS_ASSERT, 0);
 	else
 		geni_se_setup_m_cmd(se, SPI_CS_DEASSERT, 0);
+	spin_unlock_irq(&mas->lock);
 
 	time_left = wait_for_completion_timeout(&mas->xfer_done, HZ);
 	if (!time_left)
@@ -307,6 +310,8 @@  static void setup_fifo_xfer(struct spi_transfer *xfer,
 	u32 spi_tx_cfg, len;
 	struct geni_se *se = &mas->se;
 
+	spin_lock_irq(&mas->lock);
+
 	spi_tx_cfg = readl(se->base + SE_SPI_TRANS_CFG);
 	if (xfer->bits_per_word != mas->cur_bits_per_word) {
 		spi_setup_word_len(mas, mode, xfer->bits_per_word);
@@ -376,6 +381,8 @@  static void setup_fifo_xfer(struct spi_transfer *xfer,
 	 */
 	if (m_cmd & SPI_TX_ONLY)
 		writel(mas->tx_wm, se->base + SE_GENI_TX_WATERMARK_REG);
+
+	spin_unlock_irq(&mas->lock);
 }
 
 static int spi_geni_transfer_one(struct spi_master *spi,
@@ -479,12 +486,17 @@  static irqreturn_t geni_spi_isr(int irq, void *data)
 	u32 m_irq;
 	unsigned long flags;
 
-	if (mas->cur_mcmd == CMD_NONE)
-		return IRQ_NONE;
-
 	spin_lock_irqsave(&mas->lock, flags);
 	m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);
 
+	/*
+	 * We don't expect to hit this, but if we do we should try our best
+	 * to clear the interrupts and return so we don't just get called
+	 * again.
+	 */
+	if (mas->cur_mcmd == CMD_NONE)
+		goto exit;
+
 	if ((m_irq & M_RX_FIFO_WATERMARK_EN) || (m_irq & M_RX_FIFO_LAST_EN))
 		geni_spi_handle_rx(mas);
 
@@ -523,6 +535,7 @@  static irqreturn_t geni_spi_isr(int irq, void *data)
 		complete(&mas->xfer_done);
 	}
 
+exit:
 	writel(m_irq, se->base + SE_GENI_M_IRQ_CLEAR);
 	spin_unlock_irqrestore(&mas->lock, flags);
 	return IRQ_HANDLED;