Message ID | 20231030162114.3603829-2-mike.isely@cobaltdigital.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix error-leg bugs / misbehaviors in i2c-bcm2835 driver. | expand |
Hi Mike, > When the driver detects a timeout, there's no guarantee that the ISR > would have fired. Thus after a timeout, it's the foreground that > becomes responsible to reset the hardware state machine. The change > here just duplicates what is already implemented in the ISR. Is this a fix? What failing here? Can we have a feedback from Florian, Ray or Scott here? ... > if (!time_left) { > + /* Since we can't trust the ISR to have cleaned up, do the > + * full cleanup here... */ Please use the /* * comment * comment */ format > bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_C, > BCM2835_I2C_C_CLEAR); > + bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_S, BCM2835_I2C_S_CLKT | > + BCM2835_I2C_S_ERR | BCM2835_I2C_S_DONE); I'm not sure this is really making any difference though. How have you tested this? Have you tried reading those registers before and understand what went wrong? Andi
[Forward to Dave and Phil] Am 30.10.23 um 17:21 schrieb mike.isely@cobaltdigital.com: > From: Mike Isely <mike.isely@cobaltdigital.com> > > When the driver detects a timeout, there's no guarantee that the ISR > would have fired. Thus after a timeout, it's the foreground that > becomes responsible to reset the hardware state machine. The change > here just duplicates what is already implemented in the ISR. > > Signed-off-by: Mike Isely <isely@pobox.com> > --- > drivers/i2c/busses/i2c-bcm2835.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/i2c/busses/i2c-bcm2835.c b/drivers/i2c/busses/i2c-bcm2835.c > index 8ce6d3f49551..96de875394e1 100644 > --- a/drivers/i2c/busses/i2c-bcm2835.c > +++ b/drivers/i2c/busses/i2c-bcm2835.c > @@ -345,42 +345,46 @@ static irqreturn_t bcm2835_i2c_isr(int this_irq, void *data) > static int bcm2835_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg msgs[], > int num) > { > struct bcm2835_i2c_dev *i2c_dev = i2c_get_adapdata(adap); > unsigned long time_left; > int i; > > for (i = 0; i < (num - 1); i++) > if (msgs[i].flags & I2C_M_RD) { > dev_warn_once(i2c_dev->dev, > "only one read message supported, has to be last\n"); > return -EOPNOTSUPP; > } > > i2c_dev->curr_msg = msgs; > i2c_dev->num_msgs = num; > reinit_completion(&i2c_dev->completion); > > bcm2835_i2c_start_transfer(i2c_dev); > > time_left = wait_for_completion_timeout(&i2c_dev->completion, > adap->timeout); > > bcm2835_i2c_finish_transfer(i2c_dev); > > if (!time_left) { > + /* Since we can't trust the ISR to have cleaned up, do the > + * full cleanup here... */ > bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_C, > BCM2835_I2C_C_CLEAR); > + bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_S, BCM2835_I2C_S_CLKT | > + BCM2835_I2C_S_ERR | BCM2835_I2C_S_DONE); > dev_err(i2c_dev->dev, "i2c transfer timed out\n"); > return -ETIMEDOUT; > } > > if (!i2c_dev->msg_err) > return num; > > dev_dbg(i2c_dev->dev, "i2c transfer failed: %x\n", i2c_dev->msg_err); > > if (i2c_dev->msg_err & BCM2835_I2C_S_ERR) > return -EREMOTEIO; > > return -EIO; > }
Hi Mike & Andi On Tue, 31 Oct 2023 at 11:44, Andi Shyti <andi.shyti@kernel.org> wrote: > > Hi Mike, > > > When the driver detects a timeout, there's no guarantee that the ISR > > would have fired. Thus after a timeout, it's the foreground that > > becomes responsible to reset the hardware state machine. The change > > here just duplicates what is already implemented in the ISR. > > Is this a fix? What failing here? > > Can we have a feedback from Florian, Ray or Scott here? > > ... > > > if (!time_left) { > > + /* Since we can't trust the ISR to have cleaned up, do the > > + * full cleanup here... */ > > Please use the > > /* > * comment > * comment > */ > > format > > > bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_C, > > BCM2835_I2C_C_CLEAR); > > + bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_S, BCM2835_I2C_S_CLKT | > > + BCM2835_I2C_S_ERR | BCM2835_I2C_S_DONE); > > I'm not sure this is really making any difference though. How > have you tested this? > > Have you tried reading those registers before and understand what > went wrong? I guess we may have a race hazard here. The completion has timed out. The write to I2C_C will flush the FIFOs and mask the interrupts out, so if the transaction can complete at that point then the ISR won't handle clearing the status flags. As the flags are "write 1 to reset", they could still be set for the next transfer. It'd be down to the exact timing of hitting I2C_C_CLEAR (to clear the FIFOs and disable the interrupts) and when that terminates the transfer. Ensuring the status bits are cleared will do no harm, but I'm not convinced that there is an issue in the first place that needs fixing. Can you give any more detail of when you've seen this fail? Dave > Andi > > > _______________________________________________ > linux-rpi-kernel mailing list > linux-rpi-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-rpi-kernel
Response & details below... On Tue, 31 Oct 2023, Dave Stevenson wrote: > Hi Mike & Andi > > On Tue, 31 Oct 2023 at 11:44, Andi Shyti <andi.shyti@kernel.org> wrote: > > > > Hi Mike, > > > > > When the driver detects a timeout, there's no guarantee that the ISR > > > would have fired. Thus after a timeout, it's the foreground that > > > becomes responsible to reset the hardware state machine. The change > > > here just duplicates what is already implemented in the ISR. > > > > Is this a fix? What failing here? > > > > Can we have a feedback from Florian, Ray or Scott here? > > > > ... > > > > > if (!time_left) { > > > + /* Since we can't trust the ISR to have cleaned up, do the > > > + * full cleanup here... */ > > > > Please use the > > > > /* > > * comment > > * comment > > */ > > > > format > > > > > bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_C, > > > BCM2835_I2C_C_CLEAR); > > > + bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_S, BCM2835_I2C_S_CLKT | > > > + BCM2835_I2C_S_ERR | BCM2835_I2C_S_DONE); > > > > I'm not sure this is really making any difference though. How > > have you tested this? > > > > Have you tried reading those registers before and understand what > > went wrong? > > I guess we may have a race hazard here. > The completion has timed out. The write to I2C_C will flush the FIFOs > and mask the interrupts out, so if the transaction can complete at > that point then the ISR won't handle clearing the status flags. As the > flags are "write 1 to reset", they could still be set for the next > transfer. Yes, that's the problem with the state cleanup that the first patch fixes. In addition to clearing stuff in the control register (which was already there), one must also clear out pending status bits in the status register, i.e. ERR, DONE, and CLKT, which is what happens here. Realize that if the status bits are not cleaned up, then these bits will lead the ISR astray in the next transfer... Note that the comment I added is one statement ABOVE where I added code, since the two statements together define the full cleanup. So I'm talking about adding cleanup for the status bits as the control register is already handled in the first statement here. > It'd be down to the exact timing of hitting I2C_C_CLEAR (to clear the > FIFOs and disable the interrupts) and when that terminates the > transfer. Ensuring the status bits are cleared will do no harm, but > I'm not convinced that there is an issue in the first place that needs > fixing. Can you give any more detail of when you've seen this fail? See further down. But basically, if a timeout occurs, then the control register won't be cleared at all since that is happening in the ISR - which never fires. The fix clones the cleanup into the timeout handler as well - just a single line to clear the control register same as in the ISR. This *definitely* fixes a problem because without it the next transfer will start with the hardware in a non-idle state (status bits still set), cascading to more errors when the ISR next fires and goes sideways due to the erroneous bits. > > Dave There are 2 bugs here. (Actually there were 3 but I noticed a separate fix for the third one involving anintermittent incorrect error code on slave select errors is already merged in 6.5.9 so I dropped that from my patch set). 1. If the driver detects a timeout, the hardware isn't fully cleaned up. This can cause the subsequent transaction to be fouled - this is the actual symptom which started me down this rabbit hole in the first place. (The subsequent transaction was an EEPROM access which caused our logic to conclude it was corrupted and then re-initializing it, destroying the rest of its contents...) But I have seen cases where this behavior leads to an infinite loop of slave select errors, probably again because of the leftover hardware state after the timeout. The first patch fixes that by applying the same cleanup in the timeout handling as that which would have happened in the ISR. 2. There is a race going on, and I believe it's in the controller silicon. It's the cause for the driver-detected timeouts that I have seen here. This is the reason for the very lengthy comment for that patch. The race happens in the hardware (1) which triggers the first TX interrupt vs (2) the hardware that detects a slave select error. Both should generate an interrupt, but each is controlled by a separate interrupt enable. Things go awry if the slave select error wins the race and INTD is off. This is apparently because when the slave select error implicitly causes DONE to be set, which then masks the TX interrupt - and if INTD isn't also set then NO interrupt occurs at all, everything hangs, and you get the timeout. One of the tests I did was to add instrumention in the timeout detection to log the hardware state at that moment and it clearly shows INTD clear, ERR set, DONE set, INTT set, TXW set, and *no* ISR had ever fired. I had also instrumented a counter in the ISR itself so I could positively prove if the ISR had been fired. The fix for this is for INTD to be enabled at all times when a transaction is underway, which is what this patch does. According to my reading of the datasheet and extensive empirical tests, having INTD on whenever the transaction is active is otherwise benign. And with that fix, I have yet to see another I2C driver timeout take place. Not a single one. I also suspect that unless there's another bug in the controller silicon, then with this INTD fix you'll probably never see I2C timeouts ever happen again. But I'd want to keep the timeout detection there "just in case" and it's good hygiene anyway to ensure the hardware can't ever hang the driver. The first patch, the timeout cleanup, should prevent any subsequent foulups should the timeout ever happen again. The second patch, the race condition fix involving INTD, prevents the timeouts from happening in the first place. -Mike isely@pobox.com > > > Andi > > > > > > _______________________________________________ > > linux-rpi-kernel mailing list > > linux-rpi-kernel@lists.infradead.org > > http://lists.infradead.org/mailman/listinfo/linux-rpi-kernel >
diff --git a/drivers/i2c/busses/i2c-bcm2835.c b/drivers/i2c/busses/i2c-bcm2835.c index 8ce6d3f49551..96de875394e1 100644 --- a/drivers/i2c/busses/i2c-bcm2835.c +++ b/drivers/i2c/busses/i2c-bcm2835.c @@ -345,42 +345,46 @@ static irqreturn_t bcm2835_i2c_isr(int this_irq, void *data) static int bcm2835_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg msgs[], int num) { struct bcm2835_i2c_dev *i2c_dev = i2c_get_adapdata(adap); unsigned long time_left; int i; for (i = 0; i < (num - 1); i++) if (msgs[i].flags & I2C_M_RD) { dev_warn_once(i2c_dev->dev, "only one read message supported, has to be last\n"); return -EOPNOTSUPP; } i2c_dev->curr_msg = msgs; i2c_dev->num_msgs = num; reinit_completion(&i2c_dev->completion); bcm2835_i2c_start_transfer(i2c_dev); time_left = wait_for_completion_timeout(&i2c_dev->completion, adap->timeout); bcm2835_i2c_finish_transfer(i2c_dev); if (!time_left) { + /* Since we can't trust the ISR to have cleaned up, do the + * full cleanup here... */ bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_C, BCM2835_I2C_C_CLEAR); + bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_S, BCM2835_I2C_S_CLKT | + BCM2835_I2C_S_ERR | BCM2835_I2C_S_DONE); dev_err(i2c_dev->dev, "i2c transfer timed out\n"); return -ETIMEDOUT; } if (!i2c_dev->msg_err) return num; dev_dbg(i2c_dev->dev, "i2c transfer failed: %x\n", i2c_dev->msg_err); if (i2c_dev->msg_err & BCM2835_I2C_S_ERR) return -EREMOTEIO; return -EIO; }