i2c: rk3x: Increase wait timeout to 1 second
diff mbox

Message ID 1430430247-9632-1-git-send-email-dianders@chromium.org
State New, archived
Headers show

Commit Message

Doug Anderson April 30, 2015, 9:44 p.m. UTC
While it's not sensible for an i2c command to _actually_ need more
than 200ms to complete, let's increase the timeout anyway.  Why?  It
turns out that if you've got a large number of printks going out to a
serial console, interrupts on a CPU can be disabled for hundreds of
milliseconds. That's not a great situation to be in to start with
(maybe we should put a cap in vprintk_emit()) but it's pretty annoying
to start seeing unexplained i2c timeouts.

A normal system shouldn't see i2c timeouts anyway, so increasing the
timeout should help people debugging without hurting other people
excessively.

Signed-off-by: Doug Anderson <dianders@chromium.org>
---
 drivers/i2c/busses/i2c-rk3x.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Caesar Wang May 1, 2015, 3:40 a.m. UTC | #1
? 2015?05?01? 05:44, Doug Anderson ??:
> While it's not sensible for an i2c command to _actually_ need more
> than 200ms to complete, let's increase the timeout anyway.  Why?  It
> turns out that if you've got a large number of printks going out to a
> serial console, interrupts on a CPU can be disabled for hundreds of
> milliseconds. That's not a great situation to be in to start with
> (maybe we should put a cap in vprintk_emit()) but it's pretty annoying
> to start seeing unexplained i2c timeouts.
>
> A normal system shouldn't see i2c timeouts anyway, so increasing the
> timeout should help people debugging without hurting other people
> excessively.
>
> Signed-off-by: Doug Anderson <dianders@chromium.org>
> ---
>   drivers/i2c/busses/i2c-rk3x.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/i2c/busses/i2c-rk3x.c b/drivers/i2c/busses/i2c-rk3x.c
> index 019d542..72e97e30 100644
> --- a/drivers/i2c/busses/i2c-rk3x.c
> +++ b/drivers/i2c/busses/i2c-rk3x.c
> @@ -72,7 +72,7 @@ enum {
>   #define REG_INT_ALL       0x7f
>   
>   /* Constants */
> -#define WAIT_TIMEOUT      200 /* ms */
> +#define WAIT_TIMEOUT      1000 /* ms */
Yeah, verified on veyron device.

Tested-by: Caesar Wang <wxt@rock-chips.com>



Thanks.

Caesar
>   #define DEFAULT_SCL_RATE  (100 * 1000) /* Hz */
>   
>   enum rk3x_i2c_state {
Caesar Wang May 1, 2015, 3:42 a.m. UTC | #2
? 2015?05?01? 05:44, Doug Anderson ??:
> While it's not sensible for an i2c command to _actually_ need more
> than 200ms to complete, let's increase the timeout anyway.  Why?  It
> turns out that if you've got a large number of printks going out to a
> serial console, interrupts on a CPU can be disabled for hundreds of
> milliseconds. That's not a great situation to be in to start with
> (maybe we should put a cap in vprintk_emit()) but it's pretty annoying
> to start seeing unexplained i2c timeouts.
>
> A normal system shouldn't see i2c timeouts anyway, so increasing the
> timeout should help people debugging without hurting other people
> excessively.
>
> Signed-off-by: Doug Anderson <dianders@chromium.org>
> ---
>   drivers/i2c/busses/i2c-rk3x.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/i2c/busses/i2c-rk3x.c b/drivers/i2c/busses/i2c-rk3x.c
> index 019d542..72e97e30 100644
> --- a/drivers/i2c/busses/i2c-rk3x.c
> +++ b/drivers/i2c/busses/i2c-rk3x.c
> @@ -72,7 +72,7 @@ enum {
>   #define REG_INT_ALL       0x7f
>   
>   /* Constants */
> -#define WAIT_TIMEOUT      200 /* ms */
> +#define WAIT_TIMEOUT      1000 /* ms */
Yeah,verified on veyron device.

Tested-by: Caesar Wang <wxt@rock-chips.com>

Thanks.
Caesar

>   #define DEFAULT_SCL_RATE  (100 * 1000) /* Hz */
>   
>   enum rk3x_i2c_state {
Uwe Kleine-König May 4, 2015, 8:33 a.m. UTC | #3
On Thu, Apr 30, 2015 at 02:44:07PM -0700, Doug Anderson wrote:
> While it's not sensible for an i2c command to _actually_ need more
> than 200ms to complete, let's increase the timeout anyway.  Why?  It
> turns out that if you've got a large number of printks going out to a
> serial console, interrupts on a CPU can be disabled for hundreds of
> milliseconds. That's not a great situation to be in to start with
> (maybe we should put a cap in vprintk_emit()) but it's pretty annoying
> to start seeing unexplained i2c timeouts.
> 
> A normal system shouldn't see i2c timeouts anyway, so increasing the
> timeout should help people debugging without hurting other people
> excessively.
Hmm, correct me if I'm wrong: You say that the following can happen:

	rk3x_i2c_xfer calls wait_event_timeout and blocks
	schedule ... disable_irqs ... xfer complete ... do some work ... enable_irqs
	control back to i2c driver after timeout elapsed
	wait_event_timeout returned 0

The documentation of wait_event_timeout tells:

 * Returns:
 * 0 if the @condition evaluated to %false after the @timeout elapsed,
 * 1 if the @condition evaluated to %true after the @timeout elapsed,
 * or the remaining jiffies (at least 1) if the @condition evaluated
 * to %true before the @timeout elapsed.

Where is the misunderstanding?

Best regards
Uwe
Doug Anderson May 4, 2015, 3:11 p.m. UTC | #4
Uwe,

On Mon, May 4, 2015 at 1:33 AM, Uwe Kleine-König
<u.kleine-koenig@pengutronix.de> wrote:
> On Thu, Apr 30, 2015 at 02:44:07PM -0700, Doug Anderson wrote:
>> While it's not sensible for an i2c command to _actually_ need more
>> than 200ms to complete, let's increase the timeout anyway.  Why?  It
>> turns out that if you've got a large number of printks going out to a
>> serial console, interrupts on a CPU can be disabled for hundreds of
>> milliseconds. That's not a great situation to be in to start with
>> (maybe we should put a cap in vprintk_emit()) but it's pretty annoying
>> to start seeing unexplained i2c timeouts.
>>
>> A normal system shouldn't see i2c timeouts anyway, so increasing the
>> timeout should help people debugging without hurting other people
>> excessively.
> Hmm, correct me if I'm wrong: You say that the following can happen:
>
>         rk3x_i2c_xfer calls wait_event_timeout and blocks
>         schedule ... disable_irqs ... xfer complete ... do some work ... enable_irqs
>         control back to i2c driver after timeout elapsed
>         wait_event_timeout returned 0
>
> The documentation of wait_event_timeout tells:
>
>  * Returns:
>  * 0 if the @condition evaluated to %false after the @timeout elapsed,
>  * 1 if the @condition evaluated to %true after the @timeout elapsed,
>  * or the remaining jiffies (at least 1) if the @condition evaluated
>  * to %true before the @timeout elapsed.
>
> Where is the misunderstanding?

Thank you for looking at this!  I will clarify by giving explicit CPU
numbers (this issue can only happen in SMP, I think):

1. CPU1 is running rk3x_i2c_xfer()

2. CPU0 calls vprintk_emit(), which disables all IRQs on CPU0.

3. I2C interrupt is ready but is set to only run on CPU0, where IRQs
are disabled.

4. CPU1 timeout expires.  I2C interrupt is still ready, but CPU0 is
still sitting in the same vprintk_emit()

5. CPU1 sees that no interrupt happened in 200ms, so timeout.


-Doug
Uwe Kleine-König May 4, 2015, 3:24 p.m. UTC | #5
Hello Doug,

On Mon, May 04, 2015 at 08:11:10AM -0700, Doug Anderson wrote:
> On Mon, May 4, 2015 at 1:33 AM, Uwe Kleine-König
> <u.kleine-koenig@pengutronix.de> wrote:
> > On Thu, Apr 30, 2015 at 02:44:07PM -0700, Doug Anderson wrote:
> >> While it's not sensible for an i2c command to _actually_ need more
> >> than 200ms to complete, let's increase the timeout anyway.  Why?  It
> >> turns out that if you've got a large number of printks going out to a
> >> serial console, interrupts on a CPU can be disabled for hundreds of
> >> milliseconds. That's not a great situation to be in to start with
> >> (maybe we should put a cap in vprintk_emit()) but it's pretty annoying
> >> to start seeing unexplained i2c timeouts.
> >>
> >> A normal system shouldn't see i2c timeouts anyway, so increasing the
> >> timeout should help people debugging without hurting other people
> >> excessively.
> > Hmm, correct me if I'm wrong: You say that the following can happen:
> >
> >         rk3x_i2c_xfer calls wait_event_timeout and blocks
> >         schedule ... disable_irqs ... xfer complete ... do some work ... enable_irqs
> >         control back to i2c driver after timeout elapsed
> >         wait_event_timeout returned 0
> >
> > The documentation of wait_event_timeout tells:
> >
> >  * Returns:
> >  * 0 if the @condition evaluated to %false after the @timeout elapsed,
> >  * 1 if the @condition evaluated to %true after the @timeout elapsed,
> >  * or the remaining jiffies (at least 1) if the @condition evaluated
> >  * to %true before the @timeout elapsed.
> >
> > Where is the misunderstanding?
> 
> Thank you for looking at this!  I will clarify by giving explicit CPU
> numbers (this issue can only happen in SMP, I think):
> 
> 1. CPU1 is running rk3x_i2c_xfer()
> 
> 2. CPU0 calls vprintk_emit(), which disables all IRQs on CPU0.
> 
> 3. I2C interrupt is ready but is set to only run on CPU0, where IRQs
> are disabled.
Why does this irq only trigger on cpu0? Assuming this is correct, the
more robust change would be to detect this situation after 200ms instead
of waiting 1s to work around this issue.

> 4. CPU1 timeout expires.  I2C interrupt is still ready, but CPU0 is
> still sitting in the same vprintk_emit()
> 
> 5. CPU1 sees that no interrupt happened in 200ms, so timeout.

Best regards
Uwe
Doug Anderson May 4, 2015, 4:38 p.m. UTC | #6
Hi,

On Mon, May 4, 2015 at 8:24 AM, Uwe Kleine-König
<u.kleine-koenig@pengutronix.de> wrote:
>> Thank you for looking at this!  I will clarify by giving explicit CPU
>> numbers (this issue can only happen in SMP, I think):
>>
>> 1. CPU1 is running rk3x_i2c_xfer()
>>
>> 2. CPU0 calls vprintk_emit(), which disables all IRQs on CPU0.
>>
>> 3. I2C interrupt is ready but is set to only run on CPU0, where IRQs
>> are disabled.
> Why does this irq only trigger on cpu0?

So I've never done the research myself on the interrupt architecture
on ARM, but certainly on all ARM boards I've worked on recently all
IRQs (except those local to just one core) get routed to CPU0.  I've
been told that there are some userspace programs that try to balance
things out a little bit, but even in those cases each IRQ is assigned
a single CPU and if that CPU has its interrupts off then the IRQ won't
be dynamically rerouted.  I think I remember someone telling me that
there was also extra complexity around what happens when CPUs get
taken offline...

A quick search shows some discussion from 2011 at
<http://comments.gmane.org/gmane.linux.ports.arm.kernel/102251>.

Given that lots of smart people have looked at this and our interrupts
are still all going to CPU0, it's not something I'm going to try to
solve right now...


> Assuming this is correct, the
> more robust change would be to detect this situation after 200ms instead
> of waiting 1s to work around this issue.

Detect in what way?  You mean add code to detect that the CPU that's
assigned our interrupt has been stalled for 200ms?  ...and what do I
do in that case?

I suppose I could add code that reads the I2C interrupt status and
notices that although the I2C controller claims that it should have an
interrupt by now but we never saw it go off.  I could then give it
more time.  Is that what you're looking for?  We'd still want to
timeout eventually since there could be some other bug in the system
that's causing the interrupt not to ever go off...

That adds a bunch of extra complexity, though.  Is there a use case
where a timeout of 1 second poses a problem for you that would justify
the extra code?  I'd presume you're thinking of a case where a timeout
would be expected in a case other than a bug in the i2c driver or a
hardware bug in the i2c controller where 200ms is an acceptable
timeout but 1 second is far too long.  Note: if we are using the
timeout to detect a bug of some sort then I'd imagine that 1 second
might be actually better (a slightly longer delay makes it more
obvious that something bad is happening).


One other thing occurs to me: having a longer than 200ms delay may
actually be a correctness thing anyway.  Technically i2c devices are
allowed to clock stretch "indefinitely", so transfers be quite long
and still be "legal".  In practice a REALLY long clock stretch signals
a problem somewhere so we really do need some timeout, but 200ms may
be too short.  For instance, some docs of the bq27541 battery gas
gauge claim that it can clock stretch (in extreme cases) for 144ms.
While this is still less than 200ms, it does seem prudent to give a
little more leeway before seeing a timeout.

-Doug
Uwe Kleine-König May 5, 2015, 1:10 p.m. UTC | #7
Hello,

On Mon, May 04, 2015 at 09:38:28AM -0700, Doug Anderson wrote:
> On Mon, May 4, 2015 at 8:24 AM, Uwe Kleine-König
> <u.kleine-koenig@pengutronix.de> wrote:
> >> Thank you for looking at this!  I will clarify by giving explicit CPU
> >> numbers (this issue can only happen in SMP, I think):
> >>
> >> 1. CPU1 is running rk3x_i2c_xfer()
> >>
> >> 2. CPU0 calls vprintk_emit(), which disables all IRQs on CPU0.
> >>
> >> 3. I2C interrupt is ready but is set to only run on CPU0, where IRQs
> >> are disabled.
> > Why does this irq only trigger on cpu0?
> 
> So I've never done the research myself on the interrupt architecture
> on ARM, but certainly on all ARM boards I've worked on recently all
> IRQs (except those local to just one core) get routed to CPU0.  I've
> been told that there are some userspace programs that try to balance
> things out a little bit, but even in those cases each IRQ is assigned
> a single CPU and if that CPU has its interrupts off then the IRQ won't
> be dynamically rerouted.  I think I remember someone telling me that
> there was also extra complexity around what happens when CPUs get
> taken offline...
> 
> A quick search shows some discussion from 2011 at
> <http://comments.gmane.org/gmane.linux.ports.arm.kernel/102251>.
> 
> Given that lots of smart people have looked at this and our interrupts
> are still all going to CPU0, it's not something I'm going to try to
> solve right now...
researching a bit myself it seems you're right here. It would be nice to
mention this in the commit log though.
 
> > Assuming this is correct, the
> > more robust change would be to detect this situation after 200ms instead
> > of waiting 1s to work around this issue.
> 
> Detect in what way?  You mean add code to detect that the CPU that's
> assigned our interrupt has been stalled for 200ms?  ...and what do I
> do in that case?
> 
> I suppose I could add code that reads the I2C interrupt status and
> notices that although the I2C controller claims that it should have an
> interrupt by now but we never saw it go off.  I could then give it
> more time.  Is that what you're looking for?  We'd still want to
> timeout eventually since there could be some other bug in the system
> that's causing the interrupt not to ever go off...
Hmm, probably that's overkill. (Better spend the time fixing printk not
to disable irqs that long :-)

> That adds a bunch of extra complexity, though.  Is there a use case
> where a timeout of 1 second poses a problem for you that would justify
> the extra code?  I'd presume you're thinking of a case where a timeout
> would be expected in a case other than a bug in the i2c driver or a
> hardware bug in the i2c controller where 200ms is an acceptable
> timeout but 1 second is far too long.  Note: if we are using the
> timeout to detect a bug of some sort then I'd imagine that 1 second
> might be actually better (a slightly longer delay makes it more
> obvious that something bad is happening).
> 
> 
> One other thing occurs to me: having a longer than 200ms delay may
> actually be a correctness thing anyway.  Technically i2c devices are
> allowed to clock stretch "indefinitely", so transfers be quite long
> and still be "legal".  In practice a REALLY long clock stretch signals
> a problem somewhere so we really do need some timeout, but 200ms may
> be too short.  For instance, some docs of the bq27541 battery gas
> gauge claim that it can clock stretch (in extreme cases) for 144ms.
> While this is still less than 200ms, it does seem prudent to give a
> little more leeway before seeing a timeout.
If you rework your commit log to describe the failing situation more
accurately that should be fine.

Best regards
Uwe

Patch
diff mbox

diff --git a/drivers/i2c/busses/i2c-rk3x.c b/drivers/i2c/busses/i2c-rk3x.c
index 019d542..72e97e30 100644
--- a/drivers/i2c/busses/i2c-rk3x.c
+++ b/drivers/i2c/busses/i2c-rk3x.c
@@ -72,7 +72,7 @@  enum {
 #define REG_INT_ALL       0x7f
 
 /* Constants */
-#define WAIT_TIMEOUT      200 /* ms */
+#define WAIT_TIMEOUT      1000 /* ms */
 #define DEFAULT_SCL_RATE  (100 * 1000) /* Hz */
 
 enum rk3x_i2c_state {