diff mbox

Sluggish AT91 I2C driver causes SMBus timeouts

Message ID eec1202a-f612-c3ae-3f86-c999e7286316@axentia.se (mailing list archive)
State Superseded
Headers show

Commit Message

Peter Rosin Oct. 12, 2017, 2:32 p.m. UTC
On 2017-10-12 13:35, Peter Rosin wrote:
> Hi!
> 
> I have encountered an "interesting" bug. It silently corrupts data
> and is generally nasty...
> 
> On an I2C bus, driven by the at91 driver and DMA (an Atmel
> sama5d31 chip), I have an 256 byte eeprom (NXP SE97BTP). I'm using
> Linux v4.13.
> 
> The at24 driver for the eeprom detects that the I2C adapter is
> capable of I2C transactions and selects that over SMBus. Reads
> are done in 128 byte chunks. However, sometimes there is some
> kind of event that disturbs the transactions such that the very
> last bit och the very last byte (and the following NACK and STOP)
> of such chunks are delayed for a long time (the latest incident
> shows 85ms on the scope). That is too long for the eeprom which
> is expecting SMBus and times out after about 30 ms. When the
> eeprom times out, it just releases the data line so that it is
> pulled up high. The I2C driver does not notice this, and when it
> finally gets going, it reads a one for the last bit instead of
> the expected zero. Since it is the last byte of the read, a NACK
> is expected and since the eeprom has timed out the NACK is there.
> And the STOP condition also looks normal (expected since it is
> generated by the driver itself). So, the driver has not noticed
> anything funny. But the data is corrupted.
> 
> I can work around this by disabling the SMBus timeout in the eeprom
> with:
> 
> 	i2cset -f 0 0x18 0x22 0x8100
> 
> But that is done on a different I2C address (the eeprom is on
> address 0x50), since the chip is a combined temperature sensor and
> eeprom, and the SMBus timeout bit is of course in a temperature
> sensor register.
> 
> HOWEVER, I fail to see how this is limited to my case with this
> eeprom. Any SMBus chip with a timeout will suffer the same fate.
> The real bug is that this happens without the driver noticing it.
> And why is there a 85ms delay in the middle of the last byte?
> Sure, I can see why there might be a delay before finishing up
> with a STOP condition or between bytes if there needs to be some
> DMA setup at some interval, but after the 7th bit of a byte?
> 
> For a lot of transactions on the I2C bus there is no delay before
> the last bit. And most of the time there is no delay for the
> eeprom reads either; the delay only occurs when it feels like it.
> 
> This does not feel good at all.

I added some traces to i2c-at91.c and, AFAIU, it's the call to
at91_twi_read_data_dma_callback that sometimes arrives later than
desired. Once the callback runs, the transfer completes swiftly.

After reading the comments in that driver I suppose the HW holds
on to the last data-bit until it knows whether to ACK or NACK in
the following bit.

But given this, it is questionable if this driver/HW combo can
claim support for SMBus. But then again, I expect many things
suffer from similar scheduling delays (presumably that's what's
going on) so this is probably not special to i2c-at91.c...

Since this is probably a very generic problem and I just happened
to hit it for the eeprom, I wonder if it would be ok to add a
workaround, as below, to the temperature sensor driver part of this
chip? (with suitable comments, defines for the constants etc -
setting the 0x0080 bit in reg 0x22 disables the SMBus timeout)

Cheers,
Peter

--
To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Guenter Roeck Oct. 12, 2017, 2:42 p.m. UTC | #1
On 10/12/2017 07:32 AM, Peter Rosin wrote:
> On 2017-10-12 13:35, Peter Rosin wrote:
>> Hi!
>>
>> I have encountered an "interesting" bug. It silently corrupts data
>> and is generally nasty...
>>
>> On an I2C bus, driven by the at91 driver and DMA (an Atmel
>> sama5d31 chip), I have an 256 byte eeprom (NXP SE97BTP). I'm using
>> Linux v4.13.
>>
>> The at24 driver for the eeprom detects that the I2C adapter is
>> capable of I2C transactions and selects that over SMBus. Reads
>> are done in 128 byte chunks. However, sometimes there is some
>> kind of event that disturbs the transactions such that the very
>> last bit och the very last byte (and the following NACK and STOP)
>> of such chunks are delayed for a long time (the latest incident
>> shows 85ms on the scope). That is too long for the eeprom which
>> is expecting SMBus and times out after about 30 ms. When the
>> eeprom times out, it just releases the data line so that it is
>> pulled up high. The I2C driver does not notice this, and when it
>> finally gets going, it reads a one for the last bit instead of
>> the expected zero. Since it is the last byte of the read, a NACK
>> is expected and since the eeprom has timed out the NACK is there.
>> And the STOP condition also looks normal (expected since it is
>> generated by the driver itself). So, the driver has not noticed
>> anything funny. But the data is corrupted.
>>
>> I can work around this by disabling the SMBus timeout in the eeprom
>> with:
>>
>> 	i2cset -f 0 0x18 0x22 0x8100
>>
>> But that is done on a different I2C address (the eeprom is on
>> address 0x50), since the chip is a combined temperature sensor and
>> eeprom, and the SMBus timeout bit is of course in a temperature
>> sensor register.
>>
>> HOWEVER, I fail to see how this is limited to my case with this
>> eeprom. Any SMBus chip with a timeout will suffer the same fate.
>> The real bug is that this happens without the driver noticing it.
>> And why is there a 85ms delay in the middle of the last byte?
>> Sure, I can see why there might be a delay before finishing up
>> with a STOP condition or between bytes if there needs to be some
>> DMA setup at some interval, but after the 7th bit of a byte?
>>
>> For a lot of transactions on the I2C bus there is no delay before
>> the last bit. And most of the time there is no delay for the
>> eeprom reads either; the delay only occurs when it feels like it.
>>
>> This does not feel good at all.
> 
> I added some traces to i2c-at91.c and, AFAIU, it's the call to
> at91_twi_read_data_dma_callback that sometimes arrives later than
> desired. Once the callback runs, the transfer completes swiftly.
> 
> After reading the comments in that driver I suppose the HW holds
> on to the last data-bit until it knows whether to ACK or NACK in
> the following bit.
> 
> But given this, it is questionable if this driver/HW combo can
> claim support for SMBus. But then again, I expect many things
> suffer from similar scheduling delays (presumably that's what's
> going on) so this is probably not special to i2c-at91.c...
> 
> Since this is probably a very generic problem and I just happened
> to hit it for the eeprom, I wonder if it would be ok to add a
> workaround, as below, to the temperature sensor driver part of this
> chip? (with suitable comments, defines for the constants etc -
> setting the 0x0080 bit in reg 0x22 disables the SMBus timeout)
> 
> Cheers,
> Peter
> 
> diff --git a/drivers/hwmon/jc42.c b/drivers/hwmon/jc42.c
> index 1bf22eff0b08..3e72bd8e06d1 100644
> --- a/drivers/hwmon/jc42.c
> +++ b/drivers/hwmon/jc42.c
> @@ -416,6 +416,13 @@ static int jc42_detect(struct i2c_client *client, struct i2c_board_info *info)
>   	if ((cap & 0xff00) || (config & 0xf800))
>   		return -ENODEV;
>   
> +	if (manid == NXP_MANID && (devid & SE97_DEVID_MASK) == SE97_DEVID) {
> +		int smbus = i2c_smbus_read_word_swapped(client, 0x22);
> +		if (smbus < 0)
> +			return -ENODEV;
> +		i2c_smbus_write_word_swapped(client, 0x22, smbus | 0x0080);
> +	}
> +

Outch. Not like that; this would affect every board with this chip, not just this one.
We would need something like a DT property to do that (smbus-timeout-disable is used
in other drivers).

.. and definitely not in the detect function. This would have to be done in probe.

Guenter
--
To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Rosin Oct. 12, 2017, 9:19 p.m. UTC | #2
On 2017-10-12 16:42, Guenter Roeck wrote:
> On 10/12/2017 07:32 AM, Peter Rosin wrote:
>> On 2017-10-12 13:35, Peter Rosin wrote:
>>> Hi!
>>>
>>> I have encountered an "interesting" bug. It silently corrupts data
>>> and is generally nasty...
>>>
>>> On an I2C bus, driven by the at91 driver and DMA (an Atmel
>>> sama5d31 chip), I have an 256 byte eeprom (NXP SE97BTP). I'm using
>>> Linux v4.13.
>>>
>>> The at24 driver for the eeprom detects that the I2C adapter is
>>> capable of I2C transactions and selects that over SMBus. Reads
>>> are done in 128 byte chunks. However, sometimes there is some
>>> kind of event that disturbs the transactions such that the very
>>> last bit och the very last byte (and the following NACK and STOP)
>>> of such chunks are delayed for a long time (the latest incident
>>> shows 85ms on the scope). That is too long for the eeprom which
>>> is expecting SMBus and times out after about 30 ms. When the
>>> eeprom times out, it just releases the data line so that it is
>>> pulled up high. The I2C driver does not notice this, and when it
>>> finally gets going, it reads a one for the last bit instead of
>>> the expected zero. Since it is the last byte of the read, a NACK
>>> is expected and since the eeprom has timed out the NACK is there.
>>> And the STOP condition also looks normal (expected since it is
>>> generated by the driver itself). So, the driver has not noticed
>>> anything funny. But the data is corrupted.
>>>
>>> I can work around this by disabling the SMBus timeout in the eeprom
>>> with:
>>>
>>> 	i2cset -f 0 0x18 0x22 0x8100
>>>
>>> But that is done on a different I2C address (the eeprom is on
>>> address 0x50), since the chip is a combined temperature sensor and
>>> eeprom, and the SMBus timeout bit is of course in a temperature
>>> sensor register.
>>>
>>> HOWEVER, I fail to see how this is limited to my case with this
>>> eeprom. Any SMBus chip with a timeout will suffer the same fate.
>>> The real bug is that this happens without the driver noticing it.
>>> And why is there a 85ms delay in the middle of the last byte?
>>> Sure, I can see why there might be a delay before finishing up
>>> with a STOP condition or between bytes if there needs to be some
>>> DMA setup at some interval, but after the 7th bit of a byte?
>>>
>>> For a lot of transactions on the I2C bus there is no delay before
>>> the last bit. And most of the time there is no delay for the
>>> eeprom reads either; the delay only occurs when it feels like it.
>>>
>>> This does not feel good at all.
>>
>> I added some traces to i2c-at91.c and, AFAIU, it's the call to
>> at91_twi_read_data_dma_callback that sometimes arrives later than
>> desired. Once the callback runs, the transfer completes swiftly.
>>
>> After reading the comments in that driver I suppose the HW holds
>> on to the last data-bit until it knows whether to ACK or NACK in
>> the following bit.
>>
>> But given this, it is questionable if this driver/HW combo can
>> claim support for SMBus. But then again, I expect many things
>> suffer from similar scheduling delays (presumably that's what's
>> going on) so this is probably not special to i2c-at91.c...
>>
>> Since this is probably a very generic problem and I just happened
>> to hit it for the eeprom, I wonder if it would be ok to add a
>> workaround, as below, to the temperature sensor driver part of this
>> chip? (with suitable comments, defines for the constants etc -
>> setting the 0x0080 bit in reg 0x22 disables the SMBus timeout)
>>
>> Cheers,
>> Peter
>>
>> diff --git a/drivers/hwmon/jc42.c b/drivers/hwmon/jc42.c
>> index 1bf22eff0b08..3e72bd8e06d1 100644
>> --- a/drivers/hwmon/jc42.c
>> +++ b/drivers/hwmon/jc42.c
>> @@ -416,6 +416,13 @@ static int jc42_detect(struct i2c_client *client, struct i2c_board_info *info)
>>   	if ((cap & 0xff00) || (config & 0xf800))
>>   		return -ENODEV;
>>   
>> +	if (manid == NXP_MANID && (devid & SE97_DEVID_MASK) == SE97_DEVID) {
>> +		int smbus = i2c_smbus_read_word_swapped(client, 0x22);
>> +		if (smbus < 0)
>> +			return -ENODEV;
>> +		i2c_smbus_write_word_swapped(client, 0x22, smbus | 0x0080);
>> +	}
>> +
> 
> Outch. Not like that; this would affect every board with this chip, not just this one.

That was kind of the intention. Silent corruption is nasty. But ok, let's
do the opt-in thing.

> We would need something like a DT property to do that (smbus-timeout-disable is used
> in other drivers).
> 
> .. and definitely not in the detect function. This would have to be done in probe.

Arrgh, I did test it, but not properly. The patch passed my tests
because I didn't power-cycle the board and the already disabled timeout
stayed in place for the next boot. I just assumed jc42_detect was
called from probe and it was convenient to have the manufacturer etc
already at hand. Silly silly.

I'll send a proper patch tomorrow with these things taken care of...

Cheers,
peda
--
To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Guenter Roeck Oct. 13, 2017, 1:27 a.m. UTC | #3
On 10/12/2017 02:19 PM, Peter Rosin wrote:
> On 2017-10-12 16:42, Guenter Roeck wrote:
>> On 10/12/2017 07:32 AM, Peter Rosin wrote:
>>> On 2017-10-12 13:35, Peter Rosin wrote:
>>>> Hi!
>>>>
>>>> I have encountered an "interesting" bug. It silently corrupts data
>>>> and is generally nasty...
>>>>
>>>> On an I2C bus, driven by the at91 driver and DMA (an Atmel
>>>> sama5d31 chip), I have an 256 byte eeprom (NXP SE97BTP). I'm using
>>>> Linux v4.13.
>>>>
>>>> The at24 driver for the eeprom detects that the I2C adapter is
>>>> capable of I2C transactions and selects that over SMBus. Reads
>>>> are done in 128 byte chunks. However, sometimes there is some
>>>> kind of event that disturbs the transactions such that the very
>>>> last bit och the very last byte (and the following NACK and STOP)
>>>> of such chunks are delayed for a long time (the latest incident
>>>> shows 85ms on the scope). That is too long for the eeprom which
>>>> is expecting SMBus and times out after about 30 ms. When the
>>>> eeprom times out, it just releases the data line so that it is
>>>> pulled up high. The I2C driver does not notice this, and when it
>>>> finally gets going, it reads a one for the last bit instead of
>>>> the expected zero. Since it is the last byte of the read, a NACK
>>>> is expected and since the eeprom has timed out the NACK is there.
>>>> And the STOP condition also looks normal (expected since it is
>>>> generated by the driver itself). So, the driver has not noticed
>>>> anything funny. But the data is corrupted.
>>>>
>>>> I can work around this by disabling the SMBus timeout in the eeprom
>>>> with:
>>>>
>>>> 	i2cset -f 0 0x18 0x22 0x8100
>>>>
>>>> But that is done on a different I2C address (the eeprom is on
>>>> address 0x50), since the chip is a combined temperature sensor and
>>>> eeprom, and the SMBus timeout bit is of course in a temperature
>>>> sensor register.
>>>>
>>>> HOWEVER, I fail to see how this is limited to my case with this
>>>> eeprom. Any SMBus chip with a timeout will suffer the same fate.
>>>> The real bug is that this happens without the driver noticing it.
>>>> And why is there a 85ms delay in the middle of the last byte?
>>>> Sure, I can see why there might be a delay before finishing up
>>>> with a STOP condition or between bytes if there needs to be some
>>>> DMA setup at some interval, but after the 7th bit of a byte?
>>>>
>>>> For a lot of transactions on the I2C bus there is no delay before
>>>> the last bit. And most of the time there is no delay for the
>>>> eeprom reads either; the delay only occurs when it feels like it.
>>>>
>>>> This does not feel good at all.
>>>
>>> I added some traces to i2c-at91.c and, AFAIU, it's the call to
>>> at91_twi_read_data_dma_callback that sometimes arrives later than
>>> desired. Once the callback runs, the transfer completes swiftly.
>>>
>>> After reading the comments in that driver I suppose the HW holds
>>> on to the last data-bit until it knows whether to ACK or NACK in
>>> the following bit.
>>>
>>> But given this, it is questionable if this driver/HW combo can
>>> claim support for SMBus. But then again, I expect many things
>>> suffer from similar scheduling delays (presumably that's what's
>>> going on) so this is probably not special to i2c-at91.c...
>>>
>>> Since this is probably a very generic problem and I just happened
>>> to hit it for the eeprom, I wonder if it would be ok to add a
>>> workaround, as below, to the temperature sensor driver part of this
>>> chip? (with suitable comments, defines for the constants etc -
>>> setting the 0x0080 bit in reg 0x22 disables the SMBus timeout)
>>>
>>> Cheers,
>>> Peter
>>>
>>> diff --git a/drivers/hwmon/jc42.c b/drivers/hwmon/jc42.c
>>> index 1bf22eff0b08..3e72bd8e06d1 100644
>>> --- a/drivers/hwmon/jc42.c
>>> +++ b/drivers/hwmon/jc42.c
>>> @@ -416,6 +416,13 @@ static int jc42_detect(struct i2c_client *client, struct i2c_board_info *info)
>>>    	if ((cap & 0xff00) || (config & 0xf800))
>>>    		return -ENODEV;
>>>    
>>> +	if (manid == NXP_MANID && (devid & SE97_DEVID_MASK) == SE97_DEVID) {
>>> +		int smbus = i2c_smbus_read_word_swapped(client, 0x22);
>>> +		if (smbus < 0)
>>> +			return -ENODEV;
>>> +		i2c_smbus_write_word_swapped(client, 0x22, smbus | 0x0080);
>>> +	}
>>> +
>>
>> Outch. Not like that; this would affect every board with this chip, not just this one.
> 
> That was kind of the intention. Silent corruption is nasty. But ok, let's
> do the opt-in thing.
> 

Maybe, but you only have your point of view in mind: You are effectively
arguing that SMBus timeout should be unconditionally disabled for the SE97.
In reality, the problem is due to the i2c controller driver, or maybe
the i2c controller hardware. So, strictly speaking, there should be a
controller-based property stating that SMBus timeouts should be disabled
for this controller because the controller (or maybe its driver) can not
handle it. After all, the chip is just an innocent victim.

Guenter

>> We would need something like a DT property to do that (smbus-timeout-disable is used
>> in other drivers).
>>
>> .. and definitely not in the detect function. This would have to be done in probe.
> 
> Arrgh, I did test it, but not properly. The patch passed my tests
> because I didn't power-cycle the board and the already disabled timeout
> stayed in place for the next boot. I just assumed jc42_detect was
> called from probe and it was convenient to have the manufacturer etc
> already at hand. Silly silly.
> 
> I'll send a proper patch tomorrow with these things taken care of...
> 
> Cheers,
> peda
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/hwmon/jc42.c b/drivers/hwmon/jc42.c
index 1bf22eff0b08..3e72bd8e06d1 100644
--- a/drivers/hwmon/jc42.c
+++ b/drivers/hwmon/jc42.c
@@ -416,6 +416,13 @@  static int jc42_detect(struct i2c_client *client, struct i2c_board_info *info)
 	if ((cap & 0xff00) || (config & 0xf800))
 		return -ENODEV;
 
+	if (manid == NXP_MANID && (devid & SE97_DEVID_MASK) == SE97_DEVID) {
+		int smbus = i2c_smbus_read_word_swapped(client, 0x22);
+		if (smbus < 0)
+			return -ENODEV;
+		i2c_smbus_write_word_swapped(client, 0x22, smbus | 0x0080);
+	}
+
 	for (i = 0; i < ARRAY_SIZE(jc42_chips); i++) {
 		struct jc42_chips *chip = &jc42_chips[i];
 		if (manid == chip->manid &&