diff mbox series

[1/2,i2c-bcm2835] Fully clean up hardware state machine after a timeout

Message ID 20231030162114.3603829-2-mike.isely@cobaltdigital.com (mailing list archive)
State New, archived
Headers show
Series Fix error-leg bugs / misbehaviors in i2c-bcm2835 driver. | expand

Commit Message

mike.isely@cobaltdigital.com Oct. 30, 2023, 4:21 p.m. UTC
From: Mike Isely <mike.isely@cobaltdigital.com>

When the driver detects a timeout, there's no guarantee that the ISR
would have fired.  Thus after a timeout, it's the foreground that
becomes responsible to reset the hardware state machine.  The change
here just duplicates what is already implemented in the ISR.

Signed-off-by: Mike Isely <isely@pobox.com>
---
 drivers/i2c/busses/i2c-bcm2835.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Andi Shyti Oct. 31, 2023, 11:43 a.m. UTC | #1
Hi Mike,

> When the driver detects a timeout, there's no guarantee that the ISR
> would have fired.  Thus after a timeout, it's the foreground that
> becomes responsible to reset the hardware state machine.  The change
> here just duplicates what is already implemented in the ISR.

Is this a fix? What failing here?

Can we have a feedback from Florian, Ray or Scott here?

...

>  	if (!time_left) {
> +		/* Since we can't trust the ISR to have cleaned up, do the
> +		 * full cleanup here... */

Please use the

	/*
	 * comment
	 * comment
	 */

format

>  		bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_C,
>  				   BCM2835_I2C_C_CLEAR);
> +		bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_S, BCM2835_I2C_S_CLKT |
> +				   BCM2835_I2C_S_ERR | BCM2835_I2C_S_DONE);

I'm not sure this is really making any difference though. How
have you tested this?

Have you tried reading those registers before and understand what
went wrong?

Andi
Stefan Wahren Oct. 31, 2023, 12:36 p.m. UTC | #2
[Forward to Dave and Phil]

Am 30.10.23 um 17:21 schrieb mike.isely@cobaltdigital.com:
> From: Mike Isely <mike.isely@cobaltdigital.com>
>
> When the driver detects a timeout, there's no guarantee that the ISR
> would have fired.  Thus after a timeout, it's the foreground that
> becomes responsible to reset the hardware state machine.  The change
> here just duplicates what is already implemented in the ISR.
>
> Signed-off-by: Mike Isely <isely@pobox.com>
> ---
>   drivers/i2c/busses/i2c-bcm2835.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/drivers/i2c/busses/i2c-bcm2835.c b/drivers/i2c/busses/i2c-bcm2835.c
> index 8ce6d3f49551..96de875394e1 100644
> --- a/drivers/i2c/busses/i2c-bcm2835.c
> +++ b/drivers/i2c/busses/i2c-bcm2835.c
> @@ -345,42 +345,46 @@ static irqreturn_t bcm2835_i2c_isr(int this_irq, void *data)
>   static int bcm2835_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg msgs[],
>   			    int num)
>   {
>   	struct bcm2835_i2c_dev *i2c_dev = i2c_get_adapdata(adap);
>   	unsigned long time_left;
>   	int i;
>
>   	for (i = 0; i < (num - 1); i++)
>   		if (msgs[i].flags & I2C_M_RD) {
>   			dev_warn_once(i2c_dev->dev,
>   				      "only one read message supported, has to be last\n");
>   			return -EOPNOTSUPP;
>   		}
>
>   	i2c_dev->curr_msg = msgs;
>   	i2c_dev->num_msgs = num;
>   	reinit_completion(&i2c_dev->completion);
>
>   	bcm2835_i2c_start_transfer(i2c_dev);
>
>   	time_left = wait_for_completion_timeout(&i2c_dev->completion,
>   						adap->timeout);
>
>   	bcm2835_i2c_finish_transfer(i2c_dev);
>
>   	if (!time_left) {
> +		/* Since we can't trust the ISR to have cleaned up, do the
> +		 * full cleanup here... */
>   		bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_C,
>   				   BCM2835_I2C_C_CLEAR);
> +		bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_S, BCM2835_I2C_S_CLKT |
> +				   BCM2835_I2C_S_ERR | BCM2835_I2C_S_DONE);
>   		dev_err(i2c_dev->dev, "i2c transfer timed out\n");
>   		return -ETIMEDOUT;
>   	}
>
>   	if (!i2c_dev->msg_err)
>   		return num;
>
>   	dev_dbg(i2c_dev->dev, "i2c transfer failed: %x\n", i2c_dev->msg_err);
>
>   	if (i2c_dev->msg_err & BCM2835_I2C_S_ERR)
>   		return -EREMOTEIO;
>
>   	return -EIO;
>   }
Dave Stevenson Oct. 31, 2023, 3:14 p.m. UTC | #3
Hi Mike & Andi

On Tue, 31 Oct 2023 at 11:44, Andi Shyti <andi.shyti@kernel.org> wrote:
>
> Hi Mike,
>
> > When the driver detects a timeout, there's no guarantee that the ISR
> > would have fired.  Thus after a timeout, it's the foreground that
> > becomes responsible to reset the hardware state machine.  The change
> > here just duplicates what is already implemented in the ISR.
>
> Is this a fix? What failing here?
>
> Can we have a feedback from Florian, Ray or Scott here?
>
> ...
>
> >       if (!time_left) {
> > +             /* Since we can't trust the ISR to have cleaned up, do the
> > +              * full cleanup here... */
>
> Please use the
>
>         /*
>          * comment
>          * comment
>          */
>
> format
>
> >               bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_C,
> >                                  BCM2835_I2C_C_CLEAR);
> > +             bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_S, BCM2835_I2C_S_CLKT |
> > +                                BCM2835_I2C_S_ERR | BCM2835_I2C_S_DONE);
>
> I'm not sure this is really making any difference though. How
> have you tested this?
>
> Have you tried reading those registers before and understand what
> went wrong?

I guess we may have a race hazard here.
The completion has timed out. The write to I2C_C will flush the FIFOs
and mask the interrupts out, so if the transaction can complete at
that point then the ISR won't handle clearing the status flags. As the
flags are "write 1 to reset", they could still be set for the next
transfer.
It'd be down to the exact timing of hitting I2C_C_CLEAR (to clear the
FIFOs and disable the interrupts) and when that terminates the
transfer. Ensuring the status bits are cleared will do no harm, but
I'm not convinced that there is an issue in the first place that needs
fixing. Can you give any more detail of when you've seen this fail?

  Dave

> Andi
>
>
> _______________________________________________
> linux-rpi-kernel mailing list
> linux-rpi-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-rpi-kernel
Mike Isely Oct. 31, 2023, 4:26 p.m. UTC | #4
Response & details below...

On Tue, 31 Oct 2023, Dave Stevenson wrote:

> Hi Mike & Andi
> 
> On Tue, 31 Oct 2023 at 11:44, Andi Shyti <andi.shyti@kernel.org> wrote:
> >
> > Hi Mike,
> >
> > > When the driver detects a timeout, there's no guarantee that the ISR
> > > would have fired.  Thus after a timeout, it's the foreground that
> > > becomes responsible to reset the hardware state machine.  The change
> > > here just duplicates what is already implemented in the ISR.
> >
> > Is this a fix? What failing here?
> >
> > Can we have a feedback from Florian, Ray or Scott here?
> >
> > ...
> >
> > >       if (!time_left) {
> > > +             /* Since we can't trust the ISR to have cleaned up, do the
> > > +              * full cleanup here... */
> >
> > Please use the
> >
> >         /*
> >          * comment
> >          * comment
> >          */
> >
> > format
> >
> > >               bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_C,
> > >                                  BCM2835_I2C_C_CLEAR);
> > > +             bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_S, BCM2835_I2C_S_CLKT |
> > > +                                BCM2835_I2C_S_ERR | BCM2835_I2C_S_DONE);
> >
> > I'm not sure this is really making any difference though. How
> > have you tested this?
> >
> > Have you tried reading those registers before and understand what
> > went wrong?
> 
> I guess we may have a race hazard here.

> The completion has timed out. The write to I2C_C will flush the FIFOs
> and mask the interrupts out, so if the transaction can complete at
> that point then the ISR won't handle clearing the status flags. As the
> flags are "write 1 to reset", they could still be set for the next
> transfer.

Yes, that's the problem with the state cleanup that the first patch 
fixes.  In addition to clearing stuff in the control register (which was 
already there), one must also clear out pending status bits in the 
status register, i.e. ERR, DONE, and CLKT, which is what happens here.  
Realize that if the status bits are not cleaned up, then these bits will 
lead the ISR astray in the next transfer...

Note that the comment I added is one statement ABOVE where I added code, 
since the two statements together define the full cleanup.  So I'm 
talking about adding cleanup for the status bits as the control register 
is already handled in the first statement here.


> It'd be down to the exact timing of hitting I2C_C_CLEAR (to clear the
> FIFOs and disable the interrupts) and when that terminates the
> transfer. Ensuring the status bits are cleared will do no harm, but
> I'm not convinced that there is an issue in the first place that needs
> fixing. Can you give any more detail of when you've seen this fail?

See further down.  But basically, if a timeout occurs, then the control 
register won't be cleared at all since that is happening in the ISR - 
which never fires.  The fix clones the cleanup into the timeout handler 
as well - just a single line to clear the control register same as in 
the ISR.

This *definitely* fixes a problem because without it the next transfer 
will start with the hardware in a non-idle state (status bits still 
set), cascading to more errors when the ISR next fires and goes sideways 
due to the erroneous bits.


> 
>   Dave

There are 2 bugs here.  (Actually there were 3 but I noticed a separate 
fix for the third one involving anintermittent incorrect error code on 
slave select errors is already merged in 6.5.9 so I dropped that from my 
patch set).

1. If the driver detects a timeout, the hardware isn't fully cleaned up.  
This can cause the subsequent transaction to be fouled - this is the 
actual symptom which started me down this rabbit hole in the first 
place.  (The subsequent transaction was an EEPROM access which caused 
our logic to conclude it was corrupted and then re-initializing it, 
destroying the rest of its contents...)  But I have seen cases where 
this behavior leads to an infinite loop of slave select errors, probably 
again because of the leftover hardware state after the timeout.  The 
first patch fixes that by applying the same cleanup in the timeout 
handling as that which would have happened in the ISR.

2. There is a race going on, and I believe it's in the controller 
silicon.  It's the cause for the driver-detected timeouts that I have 
seen here.  This is the reason for the very lengthy comment for that 
patch.  The race happens in the hardware (1) which triggers the first TX 
interrupt vs (2) the hardware that detects a slave select error.  Both 
should generate an interrupt, but each is controlled by a separate 
interrupt enable.  Things go awry if the slave select error wins the 
race and INTD is off.  This is apparently because when the slave select 
error implicitly causes DONE to be set, which then masks the TX 
interrupt - and if INTD isn't also set then NO interrupt occurs at all, 
everything hangs, and you get the timeout.  One of the tests I did was 
to add instrumention in the timeout detection to log the hardware state 
at that moment and it clearly shows INTD clear, ERR set, DONE set, INTT 
set, TXW set, and *no* ISR had ever fired.  I had also instrumented a 
counter in the ISR itself so I could positively prove if the ISR had 
been fired.  The fix for this is for INTD to be enabled at all times 
when a transaction is underway, which is what this patch does.  
According to my reading of the datasheet and extensive empirical tests, 
having INTD on whenever the transaction is active is otherwise benign.  
And with that fix, I have yet to see another I2C driver timeout take 
place.  Not a single one.

I also suspect that unless there's another bug in the controller 
silicon, then with this INTD fix you'll probably never see I2C timeouts 
ever happen again.  But I'd want to keep the timeout detection there 
"just in case" and it's good hygiene anyway to ensure the hardware can't 
ever hang the driver.

The first patch, the timeout cleanup, should prevent any subsequent 
foulups should the timeout ever happen again.  The second patch, the 
race condition fix involving INTD, prevents the timeouts from happening 
in the first place.

  -Mike
   isely@pobox.com


> 
> > Andi
> >
> >
> > _______________________________________________
> > linux-rpi-kernel mailing list
> > linux-rpi-kernel@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-rpi-kernel
>
diff mbox series

Patch

diff --git a/drivers/i2c/busses/i2c-bcm2835.c b/drivers/i2c/busses/i2c-bcm2835.c
index 8ce6d3f49551..96de875394e1 100644
--- a/drivers/i2c/busses/i2c-bcm2835.c
+++ b/drivers/i2c/busses/i2c-bcm2835.c
@@ -345,42 +345,46 @@  static irqreturn_t bcm2835_i2c_isr(int this_irq, void *data)
 static int bcm2835_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg msgs[],
 			    int num)
 {
 	struct bcm2835_i2c_dev *i2c_dev = i2c_get_adapdata(adap);
 	unsigned long time_left;
 	int i;
 
 	for (i = 0; i < (num - 1); i++)
 		if (msgs[i].flags & I2C_M_RD) {
 			dev_warn_once(i2c_dev->dev,
 				      "only one read message supported, has to be last\n");
 			return -EOPNOTSUPP;
 		}
 
 	i2c_dev->curr_msg = msgs;
 	i2c_dev->num_msgs = num;
 	reinit_completion(&i2c_dev->completion);
 
 	bcm2835_i2c_start_transfer(i2c_dev);
 
 	time_left = wait_for_completion_timeout(&i2c_dev->completion,
 						adap->timeout);
 
 	bcm2835_i2c_finish_transfer(i2c_dev);
 
 	if (!time_left) {
+		/* Since we can't trust the ISR to have cleaned up, do the
+		 * full cleanup here... */
 		bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_C,
 				   BCM2835_I2C_C_CLEAR);
+		bcm2835_i2c_writel(i2c_dev, BCM2835_I2C_S, BCM2835_I2C_S_CLKT |
+				   BCM2835_I2C_S_ERR | BCM2835_I2C_S_DONE);
 		dev_err(i2c_dev->dev, "i2c transfer timed out\n");
 		return -ETIMEDOUT;
 	}
 
 	if (!i2c_dev->msg_err)
 		return num;
 
 	dev_dbg(i2c_dev->dev, "i2c transfer failed: %x\n", i2c_dev->msg_err);
 
 	if (i2c_dev->msg_err & BCM2835_I2C_S_ERR)
 		return -EREMOTEIO;
 
 	return -EIO;
 }