spi-rspi I/O errors

Message ID	alpine.DEB.2.02.1401072116370.27579@ayla.of.borg (mailing list archive)
State	RFC
Headers	show Return-Path: <linux-sh-owner@kernel.org> Date: Tue, 7 Jan 2014 21:27:18 +0100 (CET) From: Geert Uytterhoeven <geert@linux-m68k.org> To: linux-spi@vger.kernel.org, linux-sh@vger.kernel.org Subject: spi-rspi I/O errors Message-ID: <alpine.DEB.2.02.1401072116370.27579@ayla.of.borg> User-Agent: Alpine 2.02 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-sh-owner@vger.kernel.org Precedence: bulk

Geert Uytterhoeven Jan. 7, 2014, 8:27 p.m. UTC

I was regularly getting I/O errors when using the Renesas RSPI/QSPI
driver on r8a7791:

    m25p80 spi0.0: error -110 reading SR

Until I applied the following patch, which re-reads RSPI_SPSR on a time-out,
and continues if the condition has become true:


Now it prints from time to time:

    *** rspi->spsr = 0x20, real spsr = 0xa0, wait_mask = 0x80 ***

which shows that rspi->spsr (as set from the interrupt handler) didn't
have bit 7 set, while RSPI_SPSR does have bit 7 set.

So this looks like a race condition in the interrupt handling.

I didn't notice any data corruption after the patch.

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Laurent Pinchart Jan. 8, 2014, 12:27 a.m. UTC | #1

Hi Geert,

On Tuesday 07 January 2014 21:27:18 Geert Uytterhoeven wrote:
> I was regularly getting I/O errors when using the Renesas RSPI/QSPI
> driver on r8a7791:
> 
>     m25p80 spi0.0: error -110 reading SR
> 
> Until I applied the following patch, which re-reads RSPI_SPSR on a time-out,
> and continues if the condition has become true:
> 
> diff --git a/drivers/spi/spi-rspi.c b/drivers/spi/spi-rspi.c
> index 4b31d89e8568..e63e30c500da 100644
> --- a/drivers/spi/spi-rspi.c
> +++ b/drivers/spi/spi-rspi.c
> @@ -442,8 +442,13 @@ static int rspi_wait_for_interrupt(struct rspi_data
> *rspi, u8 wait_mask, rspi->spsr = rspi_read8(rspi, RSPI_SPSR);
>  	rspi_enable_irq(rspi, enable_bit);
>  	ret = wait_event_timeout(rspi->wait, rspi->spsr & wait_mask, HZ);
> -	if (ret == 0 && !(rspi->spsr & wait_mask))
> -		return -ETIMEDOUT;
> +	if (ret == 0 && !(rspi->spsr & wait_mask)) {
> +		u8 spsr = rspi_read8(rspi, RSPI_SPSR);
> +		printk("*** rspi->spsr = 0x%02x, real spsr = 0x%02x, wait_mask = 
0x%02x
> ***\n",
> +		       rspi->spsr, spsr, wait_mask);
> +		if (!(spsr & wait_mask))
> +			return -ETIMEDOUT;
> +	}
> 
>  	return 0;
>  }
> 
> Now it prints from time to time:
> 
>     *** rspi->spsr = 0x20, real spsr = 0xa0, wait_mask = 0x80 ***
> 
> which shows that rspi->spsr (as set from the interrupt handler) didn't
> have bit 7 set, while RSPI_SPSR does have bit 7 set.
> 
> So this looks like a race condition in the interrupt handling.

What happens if you print rspi->spsr in the interrupt handler ? Does it have 
bit 7 set ?

> I didn't notice any data corruption after the patch.

Simon Horman Jan. 8, 2014, 12:28 a.m. UTC | #2

On Tue, Jan 07, 2014 at 09:27:18PM +0100, Geert Uytterhoeven wrote:
> I was regularly getting I/O errors when using the Renesas RSPI/QSPI
> driver on r8a7791:
> 
>     m25p80 spi0.0: error -110 reading SR
> 
> Until I applied the following patch, which re-reads RSPI_SPSR on a time-out,
> and continues if the condition has become true:
> 
> diff --git a/drivers/spi/spi-rspi.c b/drivers/spi/spi-rspi.c
> index 4b31d89e8568..e63e30c500da 100644
> --- a/drivers/spi/spi-rspi.c
> +++ b/drivers/spi/spi-rspi.c
> @@ -442,8 +442,13 @@ static int rspi_wait_for_interrupt(struct rspi_data *rspi, u8 wait_mask,
>  	rspi->spsr = rspi_read8(rspi, RSPI_SPSR);
>  	rspi_enable_irq(rspi, enable_bit);
>  	ret = wait_event_timeout(rspi->wait, rspi->spsr & wait_mask, HZ);
> -	if (ret == 0 && !(rspi->spsr & wait_mask))
> -		return -ETIMEDOUT;
> +	if (ret == 0 && !(rspi->spsr & wait_mask)) {
> +		u8 spsr = rspi_read8(rspi, RSPI_SPSR);
> +		printk("*** rspi->spsr = 0x%02x, real spsr = 0x%02x, wait_mask = 0x%02x ***\n",
> +		       rspi->spsr, spsr, wait_mask);
> +		if (!(spsr & wait_mask))
> +			return -ETIMEDOUT;

I'm entirely unsure if this is the right approach,
but if it is should rspi->spsr = spsr go somewhere around here?

> +	}
>  
>  	return 0;
>  }
> 
> Now it prints from time to time:
> 
>     *** rspi->spsr = 0x20, real spsr = 0xa0, wait_mask = 0x80 ***
> 
> which shows that rspi->spsr (as set from the interrupt handler) didn't
> have bit 7 set, while RSPI_SPSR does have bit 7 set.
> 
> So this looks like a race condition in the interrupt handling.
> 
> I didn't notice any data corruption after the patch.
> 
> Gr{oetje,eeting}s,
> 
> 						Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
> 							    -- Linus Torvalds
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sh" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Geert Uytterhoeven Jan. 8, 2014, 8:28 a.m. UTC | #3

Hi Laurent,

On Wed, Jan 8, 2014 at 1:27 AM, Laurent Pinchart
<laurent.pinchart@ideasonboard.com> wrote:
> On Tuesday 07 January 2014 21:27:18 Geert Uytterhoeven wrote:
>> I was regularly getting I/O errors when using the Renesas RSPI/QSPI
>> driver on r8a7791:
>>
>>     m25p80 spi0.0: error -110 reading SR
>>
>> Until I applied the following patch, which re-reads RSPI_SPSR on a time-out,
>> and continues if the condition has become true:
>>
>> diff --git a/drivers/spi/spi-rspi.c b/drivers/spi/spi-rspi.c
>> index 4b31d89e8568..e63e30c500da 100644
>> --- a/drivers/spi/spi-rspi.c
>> +++ b/drivers/spi/spi-rspi.c
>> @@ -442,8 +442,13 @@ static int rspi_wait_for_interrupt(struct rspi_data
>> *rspi, u8 wait_mask, rspi->spsr = rspi_read8(rspi, RSPI_SPSR);
>>       rspi_enable_irq(rspi, enable_bit);
>>       ret = wait_event_timeout(rspi->wait, rspi->spsr & wait_mask, HZ);
>> -     if (ret == 0 && !(rspi->spsr & wait_mask))
>> -             return -ETIMEDOUT;
>> +     if (ret == 0 && !(rspi->spsr & wait_mask)) {
>> +             u8 spsr = rspi_read8(rspi, RSPI_SPSR);
>> +             printk("*** rspi->spsr = 0x%02x, real spsr = 0x%02x, wait_mask =
> 0x%02x
>> ***\n",
>> +                    rspi->spsr, spsr, wait_mask);
>> +             if (!(spsr & wait_mask))
>> +                     return -ETIMEDOUT;
>> +     }
>>
>>       return 0;
>>  }
>>
>> Now it prints from time to time:
>>
>>     *** rspi->spsr = 0x20, real spsr = 0xa0, wait_mask = 0x80 ***
>>
>> which shows that rspi->spsr (as set from the interrupt handler) didn't
>> have bit 7 set, while RSPI_SPSR does have bit 7 set.
>>
>> So this looks like a race condition in the interrupt handling.
>
> What happens if you print rspi->spsr in the interrupt handler ? Does it have
> bit 7 set ?

I haven't tried that yet. The driver is extremely interrupt-heavy (O(n),
with n the number of bytes transfered), so adding a printk() as-is won't be
a good idea.

Will think a bit more about a better approach...

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Laurent Pinchart Jan. 8, 2014, 8:41 a.m. UTC | #4

Hi Geert,

On Wednesday 08 January 2014 09:28:59 Geert Uytterhoeven wrote:
> On Wed, Jan 8, 2014 at 1:27 AM, Laurent Pinchart wrote:
> > On Tuesday 07 January 2014 21:27:18 Geert Uytterhoeven wrote:
> >> I was regularly getting I/O errors when using the Renesas RSPI/QSPI
> >> 
> >> driver on r8a7791:
> >>     m25p80 spi0.0: error -110 reading SR
> >> 
> >> Until I applied the following patch, which re-reads RSPI_SPSR on a
> >> time-out, and continues if the condition has become true:
> >> 
> >> diff --git a/drivers/spi/spi-rspi.c b/drivers/spi/spi-rspi.c
> >> index 4b31d89e8568..e63e30c500da 100644
> >> --- a/drivers/spi/spi-rspi.c
> >> +++ b/drivers/spi/spi-rspi.c
> >> @@ -442,8 +442,13 @@ static int rspi_wait_for_interrupt(struct rspi_data
> >> *rspi, u8 wait_mask, rspi->spsr = rspi_read8(rspi, RSPI_SPSR);
> >> 
> >>       rspi_enable_irq(rspi, enable_bit);
> >>       ret = wait_event_timeout(rspi->wait, rspi->spsr & wait_mask, HZ);
> >> 
> >> -     if (ret == 0 && !(rspi->spsr & wait_mask))
> >> -             return -ETIMEDOUT;
> >> +     if (ret == 0 && !(rspi->spsr & wait_mask)) {
> >> +             u8 spsr = rspi_read8(rspi, RSPI_SPSR);
> >> +             printk("*** rspi->spsr = 0x%02x, real spsr = 0x%02x,
> >> wait_mask => 
> > 0x%02x
> > 
> >> ***\n",
> >> +                    rspi->spsr, spsr, wait_mask);
> >> +             if (!(spsr & wait_mask))
> >> +                     return -ETIMEDOUT;
> >> +     }
> >> 
> >>       return 0;
> >>  
> >>  }
> >> 
> >> Now it prints from time to time:
> >>     *** rspi->spsr = 0x20, real spsr = 0xa0, wait_mask = 0x80 ***
> >> 
> >> which shows that rspi->spsr (as set from the interrupt handler) didn't
> >> have bit 7 set, while RSPI_SPSR does have bit 7 set.
> >> 
> >> So this looks like a race condition in the interrupt handling.
> > 
> > What happens if you print rspi->spsr in the interrupt handler ? Does it
> > have bit 7 set ?
> 
> I haven't tried that yet. The driver is extremely interrupt-heavy (O(n),
> with n the number of bytes transfered), so adding a printk() as-is won't be
> a good idea.
>
> Will think a bit more about a better approach...

Just limit it to the first 10 interrupts then :-)

What I'd like to know is whether the interrupt is trigerred before bit 7 gets 
set by the hardware, or if the rspi_wait_for_interrupt() function gets a stale 
value of rspi->spsr.

Geert Uytterhoeven Jan. 8, 2014, 1:13 p.m. UTC | #5

On Wed, Jan 8, 2014 at 9:41 AM, Laurent Pinchart
<laurent.pinchart@ideasonboard.com> wrote:
> On Wednesday 08 January 2014 09:28:59 Geert Uytterhoeven wrote:
>> On Wed, Jan 8, 2014 at 1:27 AM, Laurent Pinchart wrote:
>> > On Tuesday 07 January 2014 21:27:18 Geert Uytterhoeven wrote:
>> >> I was regularly getting I/O errors when using the Renesas RSPI/QSPI
>> >>
>> >> driver on r8a7791:
>> >>     m25p80 spi0.0: error -110 reading SR
>> >>
>> >> Until I applied the following patch, which re-reads RSPI_SPSR on a
>> >> time-out, and continues if the condition has become true:
>> >>
>> >> diff --git a/drivers/spi/spi-rspi.c b/drivers/spi/spi-rspi.c
>> >> index 4b31d89e8568..e63e30c500da 100644
>> >> --- a/drivers/spi/spi-rspi.c
>> >> +++ b/drivers/spi/spi-rspi.c
>> >> @@ -442,8 +442,13 @@ static int rspi_wait_for_interrupt(struct rspi_data
>> >> *rspi, u8 wait_mask, rspi->spsr = rspi_read8(rspi, RSPI_SPSR);
>> >>
>> >>       rspi_enable_irq(rspi, enable_bit);
>> >>       ret = wait_event_timeout(rspi->wait, rspi->spsr & wait_mask, HZ);
>> >>
>> >> -     if (ret == 0 && !(rspi->spsr & wait_mask))
>> >> -             return -ETIMEDOUT;
>> >> +     if (ret == 0 && !(rspi->spsr & wait_mask)) {
>> >> +             u8 spsr = rspi_read8(rspi, RSPI_SPSR);
>> >> +             printk("*** rspi->spsr = 0x%02x, real spsr = 0x%02x,
>> >> wait_mask =>
>> > 0x%02x
>> >
>> >> ***\n",
>> >> +                    rspi->spsr, spsr, wait_mask);
>> >> +             if (!(spsr & wait_mask))
>> >> +                     return -ETIMEDOUT;
>> >> +     }
>> >>
>> >>       return 0;
>> >>
>> >>  }
>> >>
>> >> Now it prints from time to time:
>> >>     *** rspi->spsr = 0x20, real spsr = 0xa0, wait_mask = 0x80 ***
>> >>
>> >> which shows that rspi->spsr (as set from the interrupt handler) didn't
>> >> have bit 7 set, while RSPI_SPSR does have bit 7 set.
>> >>
>> >> So this looks like a race condition in the interrupt handling.
>> >
>> > What happens if you print rspi->spsr in the interrupt handler ? Does it
>> > have bit 7 set ?
>>
>> I haven't tried that yet. The driver is extremely interrupt-heavy (O(n),
>> with n the number of bytes transfered), so adding a printk() as-is won't be
>> a good idea.
>>
>> Will think a bit more about a better approach...
>
> Just limit it to the first 10 interrupts then :-)

That assumes the issue happens during the first 10 interrupts, which is
usually not the case ;-)

> What I'd like to know is whether the interrupt is trigerred before bit 7 gets
> set by the hardware, or if the rspi_wait_for_interrupt() function gets a stale
> value of rspi->spsr.

It seems to get a stale value. I ended up doing the printk() anyway, and
when reading using small buffer sizes, the issues is more likely to happen.

If I add a second copy of rspi->spsr, and make that copy volatile OR add
a few calls to smp_mb() around the places where it's written and read,
the copy has the right value inside rspi_wait_for_interrupt(), BUT only if
I also print the spsr value inside rspi_irq(). Note that the original rspi->spsr
still has the stale value.

I'd expect wake_up() and wait_event_timeout() to have the right memory
barriers (cfr. Documentation/memory-barriers.txt), so I don't have to add
my own. And why does it need the extra printk()? What extra
synchronization does that give?

Note that the interrupt handler always runs on CPU core 0,
while rspi_wait_for_interrupt() can run on core 0 or 1.
Is there a cache-coherency issue between the two CPU cores?

Am I missing something?

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Magnus Damm Jan. 9, 2014, 3:26 a.m. UTC | #6

On Wed, Jan 8, 2014 at 10:13 PM, Geert Uytterhoeven
<geert@linux-m68k.org> wrote:
> On Wed, Jan 8, 2014 at 9:41 AM, Laurent Pinchart
> <laurent.pinchart@ideasonboard.com> wrote:
>> On Wednesday 08 January 2014 09:28:59 Geert Uytterhoeven wrote:
>>> On Wed, Jan 8, 2014 at 1:27 AM, Laurent Pinchart wrote:
>>> > On Tuesday 07 January 2014 21:27:18 Geert Uytterhoeven wrote:
>>> >> I was regularly getting I/O errors when using the Renesas RSPI/QSPI
>>> >>
>>> >> driver on r8a7791:
>>> >>     m25p80 spi0.0: error -110 reading SR
>>> >>
>>> >> Until I applied the following patch, which re-reads RSPI_SPSR on a
>>> >> time-out, and continues if the condition has become true:
>>> >>
>>> >> diff --git a/drivers/spi/spi-rspi.c b/drivers/spi/spi-rspi.c
>>> >> index 4b31d89e8568..e63e30c500da 100644
>>> >> --- a/drivers/spi/spi-rspi.c
>>> >> +++ b/drivers/spi/spi-rspi.c
>>> >> @@ -442,8 +442,13 @@ static int rspi_wait_for_interrupt(struct rspi_data
>>> >> *rspi, u8 wait_mask, rspi->spsr = rspi_read8(rspi, RSPI_SPSR);
>>> >>
>>> >>       rspi_enable_irq(rspi, enable_bit);
>>> >>       ret = wait_event_timeout(rspi->wait, rspi->spsr & wait_mask, HZ);
>>> >>
>>> >> -     if (ret == 0 && !(rspi->spsr & wait_mask))
>>> >> -             return -ETIMEDOUT;
>>> >> +     if (ret == 0 && !(rspi->spsr & wait_mask)) {
>>> >> +             u8 spsr = rspi_read8(rspi, RSPI_SPSR);
>>> >> +             printk("*** rspi->spsr = 0x%02x, real spsr = 0x%02x,
>>> >> wait_mask =>
>>> > 0x%02x
>>> >
>>> >> ***\n",
>>> >> +                    rspi->spsr, spsr, wait_mask);
>>> >> +             if (!(spsr & wait_mask))
>>> >> +                     return -ETIMEDOUT;
>>> >> +     }
>>> >>
>>> >>       return 0;
>>> >>
>>> >>  }
>>> >>
>>> >> Now it prints from time to time:
>>> >>     *** rspi->spsr = 0x20, real spsr = 0xa0, wait_mask = 0x80 ***
>>> >>
>>> >> which shows that rspi->spsr (as set from the interrupt handler) didn't
>>> >> have bit 7 set, while RSPI_SPSR does have bit 7 set.
>>> >>
>>> >> So this looks like a race condition in the interrupt handling.
>>> >
>>> > What happens if you print rspi->spsr in the interrupt handler ? Does it
>>> > have bit 7 set ?
>>>
>>> I haven't tried that yet. The driver is extremely interrupt-heavy (O(n),
>>> with n the number of bytes transfered), so adding a printk() as-is won't be
>>> a good idea.
>>>
>>> Will think a bit more about a better approach...
>>
>> Just limit it to the first 10 interrupts then :-)
>
> That assumes the issue happens during the first 10 interrupts, which is
> usually not the case ;-)
>
>> What I'd like to know is whether the interrupt is trigerred before bit 7 gets
>> set by the hardware, or if the rspi_wait_for_interrupt() function gets a stale
>> value of rspi->spsr.
>
> It seems to get a stale value. I ended up doing the printk() anyway, and
> when reading using small buffer sizes, the issues is more likely to happen.
>
> If I add a second copy of rspi->spsr, and make that copy volatile OR add
> a few calls to smp_mb() around the places where it's written and read,
> the copy has the right value inside rspi_wait_for_interrupt(), BUT only if
> I also print the spsr value inside rspi_irq(). Note that the original rspi->spsr
> still has the stale value.
>
> I'd expect wake_up() and wait_event_timeout() to have the right memory
> barriers (cfr. Documentation/memory-barriers.txt), so I don't have to add
> my own. And why does it need the extra printk()? What extra
> synchronization does that give?
>
> Note that the interrupt handler always runs on CPU core 0,
> while rspi_wait_for_interrupt() can run on core 0 or 1.
> Is there a cache-coherency issue between the two CPU cores?
>
> Am I missing something?

What happens if you run on a uniprocessor system or in case of R-Car
disable SMP and run with a single CPU core?

Cheers,

/ magnus
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Geert Uytterhoeven Jan. 9, 2014, 9:48 a.m. UTC | #7

Hi Magnus,

On Thu, Jan 9, 2014 at 4:26 AM, Magnus Damm <magnus.damm@gmail.com> wrote:
>>> What I'd like to know is whether the interrupt is trigerred before bit 7 gets
>>> set by the hardware, or if the rspi_wait_for_interrupt() function gets a stale
>>> value of rspi->spsr.
>>
>> It seems to get a stale value. I ended up doing the printk() anyway, and
>> when reading using small buffer sizes, the issues is more likely to happen.
>>
>> If I add a second copy of rspi->spsr, and make that copy volatile OR add
>> a few calls to smp_mb() around the places where it's written and read,
>> the copy has the right value inside rspi_wait_for_interrupt(), BUT only if
>> I also print the spsr value inside rspi_irq(). Note that the original rspi->spsr
>> still has the stale value.
>>
>> I'd expect wake_up() and wait_event_timeout() to have the right memory
>> barriers (cfr. Documentation/memory-barriers.txt), so I don't have to add
>> my own. And why does it need the extra printk()? What extra
>> synchronization does that give?
>>
>> Note that the interrupt handler always runs on CPU core 0,
>> while rspi_wait_for_interrupt() can run on core 0 or 1.
>> Is there a cache-coherency issue between the two CPU cores?
>>
>> Am I missing something?
>
> What happens if you run on a uniprocessor system or in case of R-Car
> disable SMP and run with a single CPU core?

As expected, I can't get it to fail in UP-mode.

In SMP-mode, it fails in ca. 1% of the transfers. It seems to happen when
switching between the write and the read part of a transfer, so the problem
is less likely to happen when using large buffer sizes, as less transfers are
involved for reading the same amount of data.

If it fails while reading the JEDEC ID from the SPI FLASH during boot up,
you end up without an MTD device.
In other cases, the error is recoverable by reading again (from userspace).

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Laurent Pinchart Jan. 9, 2014, 6:13 p.m. UTC | #8

Hi Geert,

On Wednesday 08 January 2014 14:13:27 Geert Uytterhoeven wrote:
> On Wed, Jan 8, 2014 at 9:41 AM, Laurent Pinchart wrote:
> > On Wednesday 08 January 2014 09:28:59 Geert Uytterhoeven wrote:
> >> On Wed, Jan 8, 2014 at 1:27 AM, Laurent Pinchart wrote:
> >> > On Tuesday 07 January 2014 21:27:18 Geert Uytterhoeven wrote:
> >> >> I was regularly getting I/O errors when using the Renesas RSPI/QSPI
> >> >> 
> >> >> driver on r8a7791:
> >> >>     m25p80 spi0.0: error -110 reading SR
> >> >> 
> >> >> Until I applied the following patch, which re-reads RSPI_SPSR on a
> >> >> time-out, and continues if the condition has become true:
> >> >> 
> >> >> diff --git a/drivers/spi/spi-rspi.c b/drivers/spi/spi-rspi.c
> >> >> index 4b31d89e8568..e63e30c500da 100644
> >> >> --- a/drivers/spi/spi-rspi.c
> >> >> +++ b/drivers/spi/spi-rspi.c
> >> >> @@ -442,8 +442,13 @@ static int rspi_wait_for_interrupt(struct
> >> >> rspi_data
> >> >> *rspi, u8 wait_mask, rspi->spsr = rspi_read8(rspi, RSPI_SPSR);
> >> >> 
> >> >>       rspi_enable_irq(rspi, enable_bit);
> >> >>       ret = wait_event_timeout(rspi->wait, rspi->spsr & wait_mask,
> >> >>       HZ);
> >> >> 
> >> >> -     if (ret == 0 && !(rspi->spsr & wait_mask))
> >> >> -             return -ETIMEDOUT;
> >> >> +     if (ret == 0 && !(rspi->spsr & wait_mask)) {
> >> >> +             u8 spsr = rspi_read8(rspi, RSPI_SPSR);
> >> >> +             printk("*** rspi->spsr = 0x%02x, real spsr = 0x%02x,
> >> >> wait_mask => 0x%02x ***\n",
> >> >> +                    rspi->spsr, spsr, wait_mask);
> >> >> +             if (!(spsr & wait_mask))
> >> >> +                     return -ETIMEDOUT;
> >> >> +     }
> >> >> 
> >> >>       return 0;
> >> >>  }
> >> >> 
> >> >> Now it prints from time to time:
> >> >>     *** rspi->spsr = 0x20, real spsr = 0xa0, wait_mask = 0x80 ***
> >> >> 
> >> >> which shows that rspi->spsr (as set from the interrupt handler) didn't
> >> >> have bit 7 set, while RSPI_SPSR does have bit 7 set.
> >> >> 
> >> >> So this looks like a race condition in the interrupt handling.
> >> > 
> >> > What happens if you print rspi->spsr in the interrupt handler ? Does it
> >> > have bit 7 set ?
> >> 
> >> I haven't tried that yet. The driver is extremely interrupt-heavy (O(n),
> >> with n the number of bytes transfered), so adding a printk() as-is won't
> >> be a good idea.
> >> 
> >> Will think a bit more about a better approach...
> > 
> > Just limit it to the first 10 interrupts then :-)
> 
> That assumes the issue happens during the first 10 interrupts, which is
> usually not the case ;-)
> 
> > What I'd like to know is whether the interrupt is trigerred before bit 7
> > gets set by the hardware, or if the rspi_wait_for_interrupt() function
> > gets a stale value of rspi->spsr.
> 
> It seems to get a stale value. I ended up doing the printk() anyway, and
> when reading using small buffer sizes, the issues is more likely to happen.
> 
> If I add a second copy of rspi->spsr, and make that copy volatile OR add
> a few calls to smp_mb() around the places where it's written and read,
> the copy has the right value inside rspi_wait_for_interrupt(), BUT only if
> I also print the spsr value inside rspi_irq(). Note that the original
> rspi->spsr still has the stale value.

Do I understand it correctly that the SPSR value read from the register in 
rspi_irq() is always correct on the first read, and that only the rspi->spsr 
value stored in memory and read in rspi_wait_for_interrupt() is wrong ?

> I'd expect wake_up() and wait_event_timeout() to have the right memory
> barriers (cfr. Documentation/memory-barriers.txt), so I don't have to add
> my own. And why does it need the extra printk()? What extra synchronization
> does that give?

Have you tried adding explicit memory barriers without any extra printk ?

> Note that the interrupt handler always runs on CPU core 0, while
> rspi_wait_for_interrupt() can run on core 0 or 1. Is there a cache-coherency
> issue between the two CPU cores?

I'd be very surprised if that was the case, as there should be lots of other 
breakages, but let's not rule that out too fast. Magnus should be able to help 
you more than I can regarding that.

> Am I missing something?

Laurent Pinchart Jan. 9, 2014, 6:14 p.m. UTC | #9

Hi Geert,

On Thursday 09 January 2014 19:13:04 Laurent Pinchart wrote:
> On Wednesday 08 January 2014 14:13:27 Geert Uytterhoeven wrote:
> > On Wed, Jan 8, 2014 at 9:41 AM, Laurent Pinchart wrote:
> > > On Wednesday 08 January 2014 09:28:59 Geert Uytterhoeven wrote:
> > >> On Wed, Jan 8, 2014 at 1:27 AM, Laurent Pinchart wrote:
> > >> > On Tuesday 07 January 2014 21:27:18 Geert Uytterhoeven wrote:
> > >> >> I was regularly getting I/O errors when using the Renesas RSPI/QSPI
> > >> >> 
> > >> >> driver on r8a7791:
> > >> >>     m25p80 spi0.0: error -110 reading SR
> > >> >> 
> > >> >> Until I applied the following patch, which re-reads RSPI_SPSR on a
> > >> >> time-out, and continues if the condition has become true:
> > >> >> 
> > >> >> diff --git a/drivers/spi/spi-rspi.c b/drivers/spi/spi-rspi.c
> > >> >> index 4b31d89e8568..e63e30c500da 100644
> > >> >> --- a/drivers/spi/spi-rspi.c
> > >> >> +++ b/drivers/spi/spi-rspi.c
> > >> >> @@ -442,8 +442,13 @@ static int rspi_wait_for_interrupt(struct
> > >> >> rspi_data
> > >> >> *rspi, u8 wait_mask, rspi->spsr = rspi_read8(rspi, RSPI_SPSR);
> > >> >> 
> > >> >>       rspi_enable_irq(rspi, enable_bit);
> > >> >>       ret = wait_event_timeout(rspi->wait, rspi->spsr & wait_mask,
> > >> >>       HZ);
> > >> >> 
> > >> >> -     if (ret == 0 && !(rspi->spsr & wait_mask))
> > >> >> -             return -ETIMEDOUT;
> > >> >> +     if (ret == 0 && !(rspi->spsr & wait_mask)) {
> > >> >> +             u8 spsr = rspi_read8(rspi, RSPI_SPSR);
> > >> >> +             printk("*** rspi->spsr = 0x%02x, real spsr = 0x%02x,
> > >> >> wait_mask => 0x%02x ***\n",
> > >> >> +                    rspi->spsr, spsr, wait_mask);
> > >> >> +             if (!(spsr & wait_mask))
> > >> >> +                     return -ETIMEDOUT;
> > >> >> +     }
> > >> >> 
> > >> >>       return 0;
> > >> >>  
> > >> >>  }
> > >> >> 
> > >> >> Now it prints from time to time:
> > >> >>     *** rspi->spsr = 0x20, real spsr = 0xa0, wait_mask = 0x80 ***
> > >> >> 
> > >> >> which shows that rspi->spsr (as set from the interrupt handler)
> > >> >> didn't
> > >> >> have bit 7 set, while RSPI_SPSR does have bit 7 set.
> > >> >> 
> > >> >> So this looks like a race condition in the interrupt handling.
> > >> > 
> > >> > What happens if you print rspi->spsr in the interrupt handler ? Does
> > >> > it
> > >> > have bit 7 set ?
> > >> 
> > >> I haven't tried that yet. The driver is extremely interrupt-heavy
> > >> (O(n),
> > >> with n the number of bytes transfered), so adding a printk() as-is
> > >> won't
> > >> be a good idea.
> > >> 
> > >> Will think a bit more about a better approach...
> > > 
> > > Just limit it to the first 10 interrupts then :-)
> > 
> > That assumes the issue happens during the first 10 interrupts, which is
> > usually not the case ;-)
> > 
> > > What I'd like to know is whether the interrupt is trigerred before bit 7
> > > gets set by the hardware, or if the rspi_wait_for_interrupt() function
> > > gets a stale value of rspi->spsr.
> > 
> > It seems to get a stale value. I ended up doing the printk() anyway, and
> > when reading using small buffer sizes, the issues is more likely to
> > happen.
> > 
> > If I add a second copy of rspi->spsr, and make that copy volatile OR add
> > a few calls to smp_mb() around the places where it's written and read,
> > the copy has the right value inside rspi_wait_for_interrupt(), BUT only if
> > I also print the spsr value inside rspi_irq(). Note that the original
> > rspi->spsr still has the stale value.
> 
> Do I understand it correctly that the SPSR value read from the register in
> rspi_irq() is always correct on the first read, and that only the rspi->spsr
> value stored in memory and read in rspi_wait_for_interrupt() is wrong ?
> > I'd expect wake_up() and wait_event_timeout() to have the right memory
> > barriers (cfr. Documentation/memory-barriers.txt), so I don't have to add
> > my own. And why does it need the extra printk()? What extra
> > synchronization
> > does that give?
> 
> Have you tried adding explicit memory barriers without any extra printk ?
> 
> > Note that the interrupt handler always runs on CPU core 0, while
> > rspi_wait_for_interrupt() can run on core 0 or 1. Is there a
> > cache-coherency issue between the two CPU cores?
> 
> I'd be very surprised if that was the case, as there should be lots of other
> breakages, but let's not rule that out too fast. Magnus should be able to
> help you more than I can regarding that.

Could you verify that the SMP bit is set in ACTLR 
(http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/CIHHDDHJ.html) 
?

> > Am I missing something?

Geert Uytterhoeven Jan. 10, 2014, 9:20 a.m. UTC | #10

Hi Laurent,

On Thu, Jan 9, 2014 at 7:14 PM, Laurent Pinchart
<laurent.pinchart@ideasonboard.com> wrote:
>> > It seems to get a stale value. I ended up doing the printk() anyway, and
>> > when reading using small buffer sizes, the issues is more likely to
>> > happen.
>> >
>> > If I add a second copy of rspi->spsr, and make that copy volatile OR add
>> > a few calls to smp_mb() around the places where it's written and read,
>> > the copy has the right value inside rspi_wait_for_interrupt(), BUT only if
>> > I also print the spsr value inside rspi_irq(). Note that the original
>> > rspi->spsr still has the stale value.
>>
>> Do I understand it correctly that the SPSR value read from the register in
>> rspi_irq() is always correct on the first read, and that only the rspi->spsr
>> value stored in memory and read in rspi_wait_for_interrupt() is wrong ?
>> > I'd expect wake_up() and wait_event_timeout() to have the right memory
>> > barriers (cfr. Documentation/memory-barriers.txt), so I don't have to add
>> > my own. And why does it need the extra printk()? What extra
>> > synchronization
>> > does that give?
>>
>> Have you tried adding explicit memory barriers without any extra printk ?

Yes, cfr. "add a few calls to smp_mb()" above.

>> > Note that the interrupt handler always runs on CPU core 0, while
>> > rspi_wait_for_interrupt() can run on core 0 or 1. Is there a
>> > cache-coherency issue between the two CPU cores?
>>
>> I'd be very surprised if that was the case, as there should be lots of other
>> breakages, but let's not rule that out too fast. Magnus should be able to
>> help you more than I can regarding that.
>
> Could you verify that the SMP bit is set in ACTLR
> (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/CIHHDDHJ.html)

ACTLR = 0x000000c0 on CPU 0
ACTLR = 0x00000040 on CPU 1

0xc0 = EXCL | SMP
0x40 = SMP

The different EXCL value seems strange to me, especially considering the comment
about the matching cache controller configuration:

[7] EXCL
Exclusive cache bit.
The exclusive cache configuration does not permit data to reside in L1
and L2 at the same time. The exclusive cache configuration provides
support for only caching data on an eviction from L1 when the inner
cache attributes are Write-Back, Cacheable and allocated in L1. Ensure
that your cache controller is also configured for exclusive caching.
0 = disabled. This is the reset value.
1 = enabled.

CCing the list with the more ARM-knowledgeable people...

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Geert Uytterhoeven Jan. 13, 2014, 10:19 a.m. UTC | #11

On Fri, Jan 10, 2014 at 10:20 AM, Geert Uytterhoeven
<geert@linux-m68k.org> wrote:
> On Thu, Jan 9, 2014 at 7:14 PM, Laurent Pinchart
> <laurent.pinchart@ideasonboard.com> wrote:
>>> > It seems to get a stale value. I ended up doing the printk() anyway, and
>>> > when reading using small buffer sizes, the issues is more likely to
>>> > happen.
>>> >
>>> > If I add a second copy of rspi->spsr, and make that copy volatile OR add
>>> > a few calls to smp_mb() around the places where it's written and read,
>>> > the copy has the right value inside rspi_wait_for_interrupt(), BUT only if
>>> > I also print the spsr value inside rspi_irq(). Note that the original
>>> > rspi->spsr still has the stale value.
>>>
>>> Do I understand it correctly that the SPSR value read from the register in
>>> rspi_irq() is always correct on the first read, and that only the rspi->spsr
>>> value stored in memory and read in rspi_wait_for_interrupt() is wrong ?
>>> > I'd expect wake_up() and wait_event_timeout() to have the right memory
>>> > barriers (cfr. Documentation/memory-barriers.txt), so I don't have to add
>>> > my own. And why does it need the extra printk()? What extra
>>> > synchronization
>>> > does that give?
>>>
>>> Have you tried adding explicit memory barriers without any extra printk ?
>
> Yes, cfr. "add a few calls to smp_mb()" above.
>
>>> > Note that the interrupt handler always runs on CPU core 0, while
>>> > rspi_wait_for_interrupt() can run on core 0 or 1. Is there a
>>> > cache-coherency issue between the two CPU cores?
>>>
>>> I'd be very surprised if that was the case, as there should be lots of other
>>> breakages, but let's not rule that out too fast. Magnus should be able to
>>> help you more than I can regarding that.
>>
>> Could you verify that the SMP bit is set in ACTLR
>> (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/CIHHDDHJ.html)
>
> ACTLR = 0x000000c0 on CPU 0
> ACTLR = 0x00000040 on CPU 1
>
> 0xc0 = EXCL | SMP
> 0x40 = SMP
>
> The different EXCL value seems strange to me, especially considering the comment
> about the matching cache controller configuration:

I tried enabling the EXCL bit for ACTLR on CPU 1, but it doesn't make
a difference.

One other thing: As I was playing with m25p80, I had #define DEBUG at the
top of drivers/mtd/devices/m25p80.c. This doesn't cause much kernel output,
just one message before every read:

    spi0.0: m25p80_read from 0x00108600, len 512

However, without the #define DEBUG, it's _much_ more difficult to trigger
the issue.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

spi-rspi I/O errors

Commit Message

Comments

Patch