mbox series

[v3,0/7] NXP DSPI bugfixes and support for LS1028A

Message ID 20200310125542.5939-1-olteanv@gmail.com (mailing list archive)
Headers show
Series NXP DSPI bugfixes and support for LS1028A | expand

Message

Vladimir Oltean March 10, 2020, 12:55 p.m. UTC
From: Vladimir Oltean <vladimir.oltean@nxp.com>

This series addresses a few issues that were missed during the previous
series "[PATCH 00/12] TCFQ to XSPI migration for NXP DSPI driver", on
SoCs other than LS1021A and LS1043A. DMA mode has been completely broken
by that series, and XSPI mode never worked on little-endian controllers.

Then it introduces support for the LS1028A chip, whose compatible has
recently been documented here:

https://lore.kernel.org/linux-devicetree/20200218171418.18297-1-michael@walle.cc/

The device tree for the LS1028A SoC is extended with DMA channels
definition, such that even though the default operating mode is XSPI,
one can simply change DSPI_XSPI_MODE to DSPI_DMA_MODE in the
devtype_data structure of the driver and use that instead.

I don't expect the "fixes" patches to reach very far down the stable
pipe, since there has been pretty heavy refactoring in this driver.

For testing, benchmarking and debugging, the mikroBUS connector on the
LS1028A-RDB is made available via spidev.

Vladimir Oltean (7):
  spi: spi-fsl-dspi: Don't access reserved fields in SPI_MCR
  spi: spi-fsl-dspi: Avoid use-after-free in interrupt mode
  spi: spi-fsl-dspi: Fix little endian access to PUSHR CMD and TXDATA
  spi: spi-fsl-dspi: Fix bits-per-word acceleration in DMA mode
  spi: spi-fsl-dspi: Add support for LS1028A
  arm64: dts: ls1028a: Specify the DMA channels for the DSPI controllers
  arm64: dts: ls1028a-rdb: Add a spidev node for the mikroBUS

 .../boot/dts/freescale/fsl-ls1028a-rdb.dts    |  14 ++
 .../arm64/boot/dts/freescale/fsl-ls1028a.dtsi |   6 +
 drivers/spi/spi-fsl-dspi.c                    | 188 +++++++++++-------
 3 files changed, 134 insertions(+), 74 deletions(-)

Comments

Michael Walle March 10, 2020, 2:11 p.m. UTC | #1
Hi Vladimir,

Am 2020-03-10 13:55, schrieb Vladimir Oltean:
> From: Vladimir Oltean <vladimir.oltean@nxp.com>
> 
> This series addresses a few issues that were missed during the previous
> series "[PATCH 00/12] TCFQ to XSPI migration for NXP DSPI driver", on
> SoCs other than LS1021A and LS1043A. DMA mode has been completely 
> broken
> by that series, and XSPI mode never worked on little-endian 
> controllers.
> 
> Then it introduces support for the LS1028A chip, whose compatible has
> recently been documented here:
> 
> https://lore.kernel.org/linux-devicetree/20200218171418.18297-1-michael@walle.cc/
> 
> The device tree for the LS1028A SoC is extended with DMA channels
> definition, such that even though the default operating mode is XSPI,
> one can simply change DSPI_XSPI_MODE to DSPI_DMA_MODE in the
> devtype_data structure of the driver and use that instead.
> 
> I don't expect the "fixes" patches to reach very far down the stable
> pipe, since there has been pretty heavy refactoring in this driver.
> 
> For testing, benchmarking and debugging, the mikroBUS connector on the
> LS1028A-RDB is made available via spidev.


XSPI mode, while now I cannot reproduce the kernel oops anymore, I've 
found two
other problems (1), (2). Which are likely the same underlying problem. 
DMA mode
works "better" now, still one problem (3).

(1) It seems like the first write/read/erase after the aborted 
instruction
don't get through:

# hexdump -C /dev/mtd0
00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  
|................|
*
[  627.914654] fsl-dspi 2120000.spi: Waiting for transfer to complete 
failed!
^C[  627.921649] spi_master spi1: failed to transfer one message from 
queue

#
# echo huhu > /dev/mtd0
# hexdump -C /dev/mtd0
00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  
|................|
*
hexdump: /dev/mtd0: Input/output error
003df000
# echo huhu > /dev/mtd0
# hexdump -C /dev/mtd0
00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff  
|huhu............|
00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  
|................|
*
[  642.738905] fsl-dspi 2120000.spi: Waiting for transfer to complete 
failed!
^C[  642.745832] spi_master spi1: failed to transfer one message from 
queue
#
# flash_erase /dev/mtd0 0 1
Erasing 4 Kibyte @ 0 -- 100 % complete
#
# hexdump -C /dev/mtd0
00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff  
|huhu............|
00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  
|................|
*
hexdump: /dev/mtd0: Input/output error
0023d000
# hexdump -C /dev/mtd0
00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff  
|huhu............|
00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  
|................|
*

(2) Also, reading the flash, every second time there is (reproducibly) 
an
IO error:

# hexdump -C /dev/mtd0
00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff  
|huhu............|
00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  
|................|
*
01000000
# hexdump -C /dev/mtd0
00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff  
|huhu............|
00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  
|................|
*
hexdump: /dev/mtd0: Input/output error
00dc0000
# hexdump -C /dev/mtd0
00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff  
|huhu............|
00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  
|................|
*
01000000
# hexdump -C /dev/mtd0
00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff  
|huhu............|
00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  
|................|
*
hexdump: /dev/mtd0: Input/output error
00e6a000

(3) Depening on the content length there is also an IO error. Funny 
enough,
the content is still written to the SPI flash.

# echo -n 1 > /dev/mtd10
# echo -n 12 > /dev/mtd10
# echo -n 123 > /dev/mtd10
# echo -n 1234 > /dev/mtd10
# echo -n 12345 > /dev/mtd10
sh: write error: Input/output error
# echo -n 123456 > /dev/mtd10
# echo -n 1234567 > /dev/mtd10
sh: write error: Input/output error
# echo -n 12345678 > /dev/mtd10
# echo -n 123456789 > /dev/mtd10
# echo -n 1234567890 > /dev/mtd10
# echo -n 12345678901 > /dev/mtd10
# echo -n 123456789012 > /dev/mtd10
# echo -n 1234567890123 > /dev/mtd10
sh: write error: Input/output error
# echo -n 12345678901234 > /dev/mtd10
# echo -n 123456789012345 > /dev/mtd10
sh: write error: Input/output error
# echo -n 1234567890123456 > /dev/mtd10
# echo -n 12345678901234567 > /dev/mtd10
# echo -n 123456789012345678 > /dev/mtd10


# flash_erase /dev/mtd10 0 1
Erasing 4 Kibyte @ 0 -- 100 % complete
# hexdump -C /dev/mtd10
00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  
|................|
*
^C
# echo -n 12345 > /dev/mtd10
sh: write error: Input/output error
# hexdump -C /dev/mtd10
00000000  31 32 33 34 35 ff ff ff  ff ff ff ff ff ff ff ff  
|12345...........|
00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  
|................|
*
^C


-michael

> 
> Vladimir Oltean (7):
>   spi: spi-fsl-dspi: Don't access reserved fields in SPI_MCR
>   spi: spi-fsl-dspi: Avoid use-after-free in interrupt mode
>   spi: spi-fsl-dspi: Fix little endian access to PUSHR CMD and TXDATA
>   spi: spi-fsl-dspi: Fix bits-per-word acceleration in DMA mode
>   spi: spi-fsl-dspi: Add support for LS1028A
>   arm64: dts: ls1028a: Specify the DMA channels for the DSPI 
> controllers
>   arm64: dts: ls1028a-rdb: Add a spidev node for the mikroBUS
> 
>  .../boot/dts/freescale/fsl-ls1028a-rdb.dts    |  14 ++
>  .../arm64/boot/dts/freescale/fsl-ls1028a.dtsi |   6 +
>  drivers/spi/spi-fsl-dspi.c                    | 188 +++++++++++-------
>  3 files changed, 134 insertions(+), 74 deletions(-)
Vladimir Oltean March 10, 2020, 2:56 p.m. UTC | #2
Hi Michael,

On Tue, 10 Mar 2020 at 16:11, Michael Walle <michael@walle.cc> wrote:
>

>
> XSPI mode, while now I cannot reproduce the kernel oops anymore, I've
> found two
> other problems (1), (2). Which are likely the same underlying problem.
> DMA mode
> works "better" now, still one problem (3).
>
> (1) It seems like the first write/read/erase after the aborted
> instruction
> don't get through:
>
> # hexdump -C /dev/mtd0
> 00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
> |................|
> *
> [  627.914654] fsl-dspi 2120000.spi: Waiting for transfer to complete
> failed!
> ^C[  627.921649] spi_master spi1: failed to transfer one message from
> queue
>
> #
> # echo huhu > /dev/mtd0
> # hexdump -C /dev/mtd0
> 00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
> |................|
> *
> hexdump: /dev/mtd0: Input/output error
> 003df000
> # echo huhu > /dev/mtd0
> # hexdump -C /dev/mtd0
> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
> |huhu............|
> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
> |................|
> *
> [  642.738905] fsl-dspi 2120000.spi: Waiting for transfer to complete
> failed!
> ^C[  642.745832] spi_master spi1: failed to transfer one message from
> queue
> #
> # flash_erase /dev/mtd0 0 1
> Erasing 4 Kibyte @ 0 -- 100 % complete
> #
> # hexdump -C /dev/mtd0
> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
> |huhu............|
> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
> |................|
> *
> hexdump: /dev/mtd0: Input/output error
> 0023d000
> # hexdump -C /dev/mtd0
> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
> |huhu............|
> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
> |................|
> *
>
> (2) Also, reading the flash, every second time there is (reproducibly)
> an
> IO error:
>
> # hexdump -C /dev/mtd0
> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
> |huhu............|
> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
> |................|
> *
> 01000000
> # hexdump -C /dev/mtd0
> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
> |huhu............|
> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
> |................|
> *
> hexdump: /dev/mtd0: Input/output error
> 00dc0000
> # hexdump -C /dev/mtd0
> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
> |huhu............|
> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
> |................|
> *
> 01000000
> # hexdump -C /dev/mtd0
> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
> |huhu............|
> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
> |................|
> *
> hexdump: /dev/mtd0: Input/output error
> 00e6a000
>

Just to be clear, issue 2 is seen only after you abort another
transaction, right?

> (3) Depening on the content length there is also an IO error. Funny
> enough,
> the content is still written to the SPI flash.
>
> # echo -n 1 > /dev/mtd10
> # echo -n 12 > /dev/mtd10
> # echo -n 123 > /dev/mtd10
> # echo -n 1234 > /dev/mtd10
> # echo -n 12345 > /dev/mtd10
> sh: write error: Input/output error
> # echo -n 123456 > /dev/mtd10
> # echo -n 1234567 > /dev/mtd10
> sh: write error: Input/output error
> # echo -n 12345678 > /dev/mtd10
> # echo -n 123456789 > /dev/mtd10
> # echo -n 1234567890 > /dev/mtd10
> # echo -n 12345678901 > /dev/mtd10
> # echo -n 123456789012 > /dev/mtd10
> # echo -n 1234567890123 > /dev/mtd10
> sh: write error: Input/output error
> # echo -n 12345678901234 > /dev/mtd10
> # echo -n 123456789012345 > /dev/mtd10
> sh: write error: Input/output error
> # echo -n 1234567890123456 > /dev/mtd10
> # echo -n 12345678901234567 > /dev/mtd10
> # echo -n 123456789012345678 > /dev/mtd10
>
>
> # flash_erase /dev/mtd10 0 1
> Erasing 4 Kibyte @ 0 -- 100 % complete
> # hexdump -C /dev/mtd10
> 00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
> |................|
> *
> ^C
> # echo -n 12345 > /dev/mtd10
> sh: write error: Input/output error
> # hexdump -C /dev/mtd10
> 00000000  31 32 33 34 35 ff ff ff  ff ff ff ff ff ff ff ff
> |12345...........|
> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
> |................|
> *
> ^C
>

For this one, I think the reported message->actual_length is incorrect
in dspi_dma_xfer, which makes spi-mem scream.

>
> -michael
>
> >
> > Vladimir Oltean (7):
> >   spi: spi-fsl-dspi: Don't access reserved fields in SPI_MCR
> >   spi: spi-fsl-dspi: Avoid use-after-free in interrupt mode
> >   spi: spi-fsl-dspi: Fix little endian access to PUSHR CMD and TXDATA
> >   spi: spi-fsl-dspi: Fix bits-per-word acceleration in DMA mode
> >   spi: spi-fsl-dspi: Add support for LS1028A
> >   arm64: dts: ls1028a: Specify the DMA channels for the DSPI
> > controllers
> >   arm64: dts: ls1028a-rdb: Add a spidev node for the mikroBUS
> >
> >  .../boot/dts/freescale/fsl-ls1028a-rdb.dts    |  14 ++
> >  .../arm64/boot/dts/freescale/fsl-ls1028a.dtsi |   6 +
> >  drivers/spi/spi-fsl-dspi.c                    | 188 +++++++++++-------
> >  3 files changed, 134 insertions(+), 74 deletions(-)

Thanks,
-Vladimir
Michael Walle March 10, 2020, 3:22 p.m. UTC | #3
Hi Vladimir,

Am 2020-03-10 15:56, schrieb Vladimir Oltean:
>> (2) Also, reading the flash, every second time there is (reproducibly)
>> an
>> IO error:
>> 
>> # hexdump -C /dev/mtd0
>> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
>> |huhu............|
>> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
>> |................|
>> *
>> 01000000
>> # hexdump -C /dev/mtd0
>> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
>> |huhu............|
>> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
>> |................|
>> *
>> hexdump: /dev/mtd0: Input/output error
>> 00dc0000
>> # hexdump -C /dev/mtd0
>> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
>> |huhu............|
>> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
>> |................|
>> *
>> 01000000
>> # hexdump -C /dev/mtd0
>> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
>> |huhu............|
>> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
>> |................|
>> *
>> hexdump: /dev/mtd0: Input/output error
>> 00e6a000
>> 
> 
> Just to be clear, issue 2 is seen only after you abort another
> transaction, right?

No, just normal uninterrupted reading. Just tried it right after
reboot. Doesn't seem to be every second time though, just random
which makes me wonder if that is another problem now. Also the
last successful reading is random.

buildroot login: root
# hexdump -C /dev/mtd0
00000000  31 32 33 34 35 ff ff ff  ff ff ff ff ff ff ff ff  
|12345...........|
00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  
|................|
*
[   32.359156] random: crng init done
01000000
# hexdump -C /dev/mtd0
00000000  31 32 33 34 35 ff ff ff  ff ff ff ff ff ff ff ff  
|12345...........|
00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  
|................|
*
01000000
# hexdump -C /dev/mtd0
00000000  31 32 33 34 35 ff ff ff  ff ff ff ff ff ff ff ff  
|12345...........|
00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  
|................|
*
hexdump: /dev/mtd0: Input/output error
00166000
# hexdump -C /dev/mtd0
00000000  31 32 33 34 35 ff ff ff  ff ff ff ff ff ff ff ff  
|12345...........|
00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  
|................|
*
hexdump: /dev/mtd0: Input/output error
00c68000
# hexdump -C /dev/mtd0
00000000  31 32 33 34 35 ff ff ff  ff ff ff ff ff ff ff ff  
|12345...........|
00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  
|................|
*
hexdump: /dev/mtd0: Input/output error
00243000
#

-michael
Michael Walle March 13, 2020, 4:07 p.m. UTC | #4
Am 2020-03-10 16:22, schrieb Michael Walle:
> Hi Vladimir,
> 
> Am 2020-03-10 15:56, schrieb Vladimir Oltean:
>>> (2) Also, reading the flash, every second time there is 
>>> (reproducibly)
>>> an
>>> IO error:
>>> 
>>> # hexdump -C /dev/mtd0
>>> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
>>> |huhu............|
>>> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
>>> |................|
>>> *
>>> 01000000
>>> # hexdump -C /dev/mtd0
>>> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
>>> |huhu............|
>>> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
>>> |................|
>>> *
>>> hexdump: /dev/mtd0: Input/output error
>>> 00dc0000
>>> # hexdump -C /dev/mtd0
>>> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
>>> |huhu............|
>>> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
>>> |................|
>>> *
>>> 01000000
>>> # hexdump -C /dev/mtd0
>>> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
>>> |huhu............|
>>> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
>>> |................|
>>> *
>>> hexdump: /dev/mtd0: Input/output error
>>> 00e6a000
>>> 
>> 
>> Just to be clear, issue 2 is seen only after you abort another
>> transaction, right?
> 
> No, just normal uninterrupted reading. Just tried it right after
> reboot. Doesn't seem to be every second time though, just random
> which makes me wonder if that is another problem now. Also the
> last successful reading is random.


Ok I guess I know what the root cause is. This is an extract of
the current code:

> static int dspi_transfer_one_message(struct spi_controller *ctlr,
> 				     struct spi_message *message)
> {
> ..
> 	/* Kick off the interrupt train */
> 	dspi_fifo_write(dspi);
> 
> 	status = wait_event_interruptible(dspi->waitq,
> 					  dspi->waitflags);
> 	dspi->waitflags = 0;
> ..
> }
> 
> static int dspi_rxtx(struct fsl_dspi *dspi)
> {
> 	dspi_fifo_read(dspi);
> 
> 	if (!dspi->len)
> 		/* Success! */
> 		return 0;
> 
> 	dspi_fifo_write(dspi);
> 
> 	return -EINPROGRESS;
> }

dspi_rxtx() is used in the ISR. Both dspi_fifo_write() and dspi_rxtx()
access shared data like, dspi->words_in_flight. In the EIO error case
the following bytes_sent is -1, because dspi->words_in_flight is -1.

> /* Update total number of bytes that were transferred */
> bytes_sent = dspi->words_in_flight * dspi->oper_word_size;

words_in_flight is always -1 after dspi_fifo_read() was called. In
the error case, the ISR kicks in right in the middle of the execution
of dspi_fifo_write() in dspi_transfer_one_message().

> static void dspi_fifo_write(struct fsl_dspi *dspi)
> {
> ..
> 	if (dspi->devtype_data->trans_mode == DSPI_EOQ_MODE)
> 		dspi_eoq_fifo_write(dspi);
> 	 else
> 		dspi_xspi_fifo_write(dspi);

Now if the ISR is executed right here..

> 
> 	/* Update total number of bytes that were transferred */
> 	bytes_sent = dspi->words_in_flight * dspi->oper_word_size;

.. words_in_flight might be -1.

> 	msg->actual_length += bytes_sent;

and bytes_sent is negative. And this causes an IO error because
the returned overall message length doesn't match.

> 	dspi->progress += bytes_sent / DIV_ROUND_UP(xfer->bits_per_word, 8);
> ..
> }

I could not reproduce the issue with the following patch. I don't
know if I got the locking correct though or if there is a better
way to go.


diff --git a/drivers/spi/spi-fsl-dspi.c b/drivers/spi/spi-fsl-dspi.c
index 8b16de9ed382..578fedeb16a0 100644
--- a/drivers/spi/spi-fsl-dspi.c
+++ b/drivers/spi/spi-fsl-dspi.c
@@ -224,6 +224,7 @@ struct fsl_dspi {
         u16                                     tx_cmd;
         const struct fsl_dspi_devtype_data      *devtype_data;

+       spinlock_t lock;
         wait_queue_head_t                       waitq;
         u32                                     waitflags;

@@ -873,14 +874,20 @@ static void dspi_fifo_write(struct fsl_dspi *dspi)

  static int dspi_rxtx(struct fsl_dspi *dspi)
  {
+       unsigned long flags;
+
+       spin_lock_irqsave(&dspi->lock, flags);
         dspi_fifo_read(dspi);

-       if (!dspi->len)
+       if (!dspi->len) {
                 /* Success! */
+               spin_unlock_irqrestore(&dspi->lock, flags);
                 return 0;
+       }

         dspi_fifo_write(dspi);

+       spin_unlock_irqrestore(&dspi->lock, flags);
         return -EINPROGRESS;
  }

@@ -950,7 +957,9 @@ static int dspi_transfer_one_message(struct 
spi_controller *ctlr,
         struct fsl_dspi *dspi = spi_controller_get_devdata(ctlr);
         struct spi_device *spi = message->spi;
         struct spi_transfer *transfer;
+       unsigned long flags;
         int status = 0;
+       int i = 0;

         if (dspi->irq)
                 dspi_enable_interrupts(dspi, true);
@@ -1009,7 +1018,9 @@ static int dspi_transfer_one_message(struct 
spi_controller *ctlr,
                                 goto out;
                 } else if (dspi->irq) {
                         /* Kick off the interrupt train */
+                       spin_lock_irqsave(&dspi->lock, flags);
                         dspi_fifo_write(dspi);
+                       spin_unlock_irqrestore(&dspi->lock, flags);

                         status = wait_event_interruptible(dspi->waitq,
                                                           
dspi->waitflags);
@@ -1301,6 +1312,7 @@ static int dspi_probe(struct platform_device 
*pdev)
         ctlr->cleanup = dspi_cleanup;
         ctlr->slave_abort = dspi_slave_abort;
         ctlr->mode_bits = SPI_CPOL | SPI_CPHA | SPI_LSB_FIRST;
+       spin_lock_init(&dspi->lock);

         pdata = dev_get_platdata(&pdev->dev);
         if (pdata) {



-michael
Vladimir Oltean March 13, 2020, 4:37 p.m. UTC | #5
Hi Michael,

On Fri, 13 Mar 2020 at 18:07, Michael Walle <michael@walle.cc> wrote:
>
> Am 2020-03-10 16:22, schrieb Michael Walle:
> > Hi Vladimir,
> >
> > Am 2020-03-10 15:56, schrieb Vladimir Oltean:
> >>> (2) Also, reading the flash, every second time there is
> >>> (reproducibly)
> >>> an
> >>> IO error:
> >>>
> >>> # hexdump -C /dev/mtd0
> >>> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
> >>> |huhu............|
> >>> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
> >>> |................|
> >>> *
> >>> 01000000
> >>> # hexdump -C /dev/mtd0
> >>> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
> >>> |huhu............|
> >>> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
> >>> |................|
> >>> *
> >>> hexdump: /dev/mtd0: Input/output error
> >>> 00dc0000
> >>> # hexdump -C /dev/mtd0
> >>> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
> >>> |huhu............|
> >>> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
> >>> |................|
> >>> *
> >>> 01000000
> >>> # hexdump -C /dev/mtd0
> >>> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
> >>> |huhu............|
> >>> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
> >>> |................|
> >>> *
> >>> hexdump: /dev/mtd0: Input/output error
> >>> 00e6a000
> >>>
> >>
> >> Just to be clear, issue 2 is seen only after you abort another
> >> transaction, right?
> >
> > No, just normal uninterrupted reading. Just tried it right after
> > reboot. Doesn't seem to be every second time though, just random
> > which makes me wonder if that is another problem now. Also the
> > last successful reading is random.
>
>
> Ok I guess I know what the root cause is. This is an extract of
> the current code:
>
> > static int dspi_transfer_one_message(struct spi_controller *ctlr,
> >                                    struct spi_message *message)
> > {
> > ..
> >       /* Kick off the interrupt train */
> >       dspi_fifo_write(dspi);
> >
> >       status = wait_event_interruptible(dspi->waitq,
> >                                         dspi->waitflags);
> >       dspi->waitflags = 0;
> > ..
> > }
> >
> > static int dspi_rxtx(struct fsl_dspi *dspi)
> > {
> >       dspi_fifo_read(dspi);
> >
> >       if (!dspi->len)
> >               /* Success! */
> >               return 0;
> >
> >       dspi_fifo_write(dspi);
> >
> >       return -EINPROGRESS;
> > }
>
> dspi_rxtx() is used in the ISR. Both dspi_fifo_write() and dspi_rxtx()
> access shared data like, dspi->words_in_flight. In the EIO error case
> the following bytes_sent is -1, because dspi->words_in_flight is -1.
>
> > /* Update total number of bytes that were transferred */
> > bytes_sent = dspi->words_in_flight * dspi->oper_word_size;
>
> words_in_flight is always -1 after dspi_fifo_read() was called. In
> the error case, the ISR kicks in right in the middle of the execution
> of dspi_fifo_write() in dspi_transfer_one_message().
>
> > static void dspi_fifo_write(struct fsl_dspi *dspi)
> > {
> > ..
> >       if (dspi->devtype_data->trans_mode == DSPI_EOQ_MODE)
> >               dspi_eoq_fifo_write(dspi);
> >        else
> >               dspi_xspi_fifo_write(dspi);
>
> Now if the ISR is executed right here..
>
> >
> >       /* Update total number of bytes that were transferred */
> >       bytes_sent = dspi->words_in_flight * dspi->oper_word_size;
>
> .. words_in_flight might be -1.
>
> >       msg->actual_length += bytes_sent;
>
> and bytes_sent is negative. And this causes an IO error because
> the returned overall message length doesn't match.
>
> >       dspi->progress += bytes_sent / DIV_ROUND_UP(xfer->bits_per_word, 8);
> > ..
> > }
>
> I could not reproduce the issue with the following patch. I don't
> know if I got the locking correct though or if there is a better
> way to go.
>
>
> diff --git a/drivers/spi/spi-fsl-dspi.c b/drivers/spi/spi-fsl-dspi.c
> index 8b16de9ed382..578fedeb16a0 100644
> --- a/drivers/spi/spi-fsl-dspi.c
> +++ b/drivers/spi/spi-fsl-dspi.c
> @@ -224,6 +224,7 @@ struct fsl_dspi {
>          u16                                     tx_cmd;
>          const struct fsl_dspi_devtype_data      *devtype_data;
>
> +       spinlock_t lock;
>          wait_queue_head_t                       waitq;
>          u32                                     waitflags;
>
> @@ -873,14 +874,20 @@ static void dspi_fifo_write(struct fsl_dspi *dspi)
>
>   static int dspi_rxtx(struct fsl_dspi *dspi)
>   {
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&dspi->lock, flags);
>          dspi_fifo_read(dspi);
>
> -       if (!dspi->len)
> +       if (!dspi->len) {
>                  /* Success! */
> +               spin_unlock_irqrestore(&dspi->lock, flags);
>                  return 0;
> +       }
>
>          dspi_fifo_write(dspi);
>
> +       spin_unlock_irqrestore(&dspi->lock, flags);
>          return -EINPROGRESS;
>   }
>
> @@ -950,7 +957,9 @@ static int dspi_transfer_one_message(struct
> spi_controller *ctlr,
>          struct fsl_dspi *dspi = spi_controller_get_devdata(ctlr);
>          struct spi_device *spi = message->spi;
>          struct spi_transfer *transfer;
> +       unsigned long flags;
>          int status = 0;
> +       int i = 0;
>
>          if (dspi->irq)
>                  dspi_enable_interrupts(dspi, true);
> @@ -1009,7 +1018,9 @@ static int dspi_transfer_one_message(struct
> spi_controller *ctlr,
>                                  goto out;
>                  } else if (dspi->irq) {
>                          /* Kick off the interrupt train */
> +                       spin_lock_irqsave(&dspi->lock, flags);
>                          dspi_fifo_write(dspi);
> +                       spin_unlock_irqrestore(&dspi->lock, flags);
>
>                          status = wait_event_interruptible(dspi->waitq,
>
> dspi->waitflags);
> @@ -1301,6 +1312,7 @@ static int dspi_probe(struct platform_device
> *pdev)
>          ctlr->cleanup = dspi_cleanup;
>          ctlr->slave_abort = dspi_slave_abort;
>          ctlr->mode_bits = SPI_CPOL | SPI_CPHA | SPI_LSB_FIRST;
> +       spin_lock_init(&dspi->lock);
>
>          pdata = dev_get_platdata(&pdev->dev);
>          if (pdata) {
>
>
>
> -michael

Thanks for taking such a close look. I haven't had the time to follow up.
Indeed, the ISR, and therefore dspi_fifo_read, can execute before
dspi->words_in_flight was populated correctly. And bad things will
happen in that case.
But I wouldn't introduce a spin lock that disables interrupts on the
local CPU just for that - it's too complicated for this driver.
I would just keep the SPI interrupt quiesced via SPI_RSER and enable
it only once it's safe, aka after updating dspi->words_in_flight.

Thanks,
-Vladimir
Michael Walle March 13, 2020, 4:53 p.m. UTC | #6
Am 2020-03-13 17:37, schrieb Vladimir Oltean:
> Hi Michael,
> 
> On Fri, 13 Mar 2020 at 18:07, Michael Walle <michael@walle.cc> wrote:
>> 
>> Am 2020-03-10 16:22, schrieb Michael Walle:
>> > Hi Vladimir,
>> >
>> > Am 2020-03-10 15:56, schrieb Vladimir Oltean:
>> >>> (2) Also, reading the flash, every second time there is
>> >>> (reproducibly)
>> >>> an
>> >>> IO error:
>> >>>
>> >>> # hexdump -C /dev/mtd0
>> >>> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
>> >>> |huhu............|
>> >>> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
>> >>> |................|
>> >>> *
>> >>> 01000000
>> >>> # hexdump -C /dev/mtd0
>> >>> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
>> >>> |huhu............|
>> >>> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
>> >>> |................|
>> >>> *
>> >>> hexdump: /dev/mtd0: Input/output error
>> >>> 00dc0000
>> >>> # hexdump -C /dev/mtd0
>> >>> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
>> >>> |huhu............|
>> >>> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
>> >>> |................|
>> >>> *
>> >>> 01000000
>> >>> # hexdump -C /dev/mtd0
>> >>> 00000000  68 75 68 75 0a ff ff ff  ff ff ff ff ff ff ff ff
>> >>> |huhu............|
>> >>> 00000010  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff
>> >>> |................|
>> >>> *
>> >>> hexdump: /dev/mtd0: Input/output error
>> >>> 00e6a000
>> >>>
>> >>
>> >> Just to be clear, issue 2 is seen only after you abort another
>> >> transaction, right?
>> >
>> > No, just normal uninterrupted reading. Just tried it right after
>> > reboot. Doesn't seem to be every second time though, just random
>> > which makes me wonder if that is another problem now. Also the
>> > last successful reading is random.
>> 
>> 
>> Ok I guess I know what the root cause is. This is an extract of
>> the current code:
>> 
>> > static int dspi_transfer_one_message(struct spi_controller *ctlr,
>> >                                    struct spi_message *message)
>> > {
>> > ..
>> >       /* Kick off the interrupt train */
>> >       dspi_fifo_write(dspi);
>> >
>> >       status = wait_event_interruptible(dspi->waitq,
>> >                                         dspi->waitflags);
>> >       dspi->waitflags = 0;
>> > ..
>> > }
>> >
>> > static int dspi_rxtx(struct fsl_dspi *dspi)
>> > {
>> >       dspi_fifo_read(dspi);
>> >
>> >       if (!dspi->len)
>> >               /* Success! */
>> >               return 0;
>> >
>> >       dspi_fifo_write(dspi);
>> >
>> >       return -EINPROGRESS;
>> > }
>> 
>> dspi_rxtx() is used in the ISR. Both dspi_fifo_write() and dspi_rxtx()
>> access shared data like, dspi->words_in_flight. In the EIO error case
>> the following bytes_sent is -1, because dspi->words_in_flight is -1.
>> 
>> > /* Update total number of bytes that were transferred */
>> > bytes_sent = dspi->words_in_flight * dspi->oper_word_size;
>> 
>> words_in_flight is always -1 after dspi_fifo_read() was called. In
>> the error case, the ISR kicks in right in the middle of the execution
>> of dspi_fifo_write() in dspi_transfer_one_message().
>> 
>> > static void dspi_fifo_write(struct fsl_dspi *dspi)
>> > {
>> > ..
>> >       if (dspi->devtype_data->trans_mode == DSPI_EOQ_MODE)
>> >               dspi_eoq_fifo_write(dspi);
>> >        else
>> >               dspi_xspi_fifo_write(dspi);
>> 
>> Now if the ISR is executed right here..
>> 
>> >
>> >       /* Update total number of bytes that were transferred */
>> >       bytes_sent = dspi->words_in_flight * dspi->oper_word_size;
>> 
>> .. words_in_flight might be -1.
>> 
>> >       msg->actual_length += bytes_sent;
>> 
>> and bytes_sent is negative. And this causes an IO error because
>> the returned overall message length doesn't match.
>> 
>> >       dspi->progress += bytes_sent / DIV_ROUND_UP(xfer->bits_per_word, 8);
>> > ..
>> > }
>> 
>> I could not reproduce the issue with the following patch. I don't
>> know if I got the locking correct though or if there is a better
>> way to go.
>> 
>> 
>> diff --git a/drivers/spi/spi-fsl-dspi.c b/drivers/spi/spi-fsl-dspi.c
>> index 8b16de9ed382..578fedeb16a0 100644
>> --- a/drivers/spi/spi-fsl-dspi.c
>> +++ b/drivers/spi/spi-fsl-dspi.c
>> @@ -224,6 +224,7 @@ struct fsl_dspi {
>>          u16                                     tx_cmd;
>>          const struct fsl_dspi_devtype_data      *devtype_data;
>> 
>> +       spinlock_t lock;
>>          wait_queue_head_t                       waitq;
>>          u32                                     waitflags;
>> 
>> @@ -873,14 +874,20 @@ static void dspi_fifo_write(struct fsl_dspi 
>> *dspi)
>> 
>>   static int dspi_rxtx(struct fsl_dspi *dspi)
>>   {
>> +       unsigned long flags;
>> +
>> +       spin_lock_irqsave(&dspi->lock, flags);
>>          dspi_fifo_read(dspi);
>> 
>> -       if (!dspi->len)
>> +       if (!dspi->len) {
>>                  /* Success! */
>> +               spin_unlock_irqrestore(&dspi->lock, flags);
>>                  return 0;
>> +       }
>> 
>>          dspi_fifo_write(dspi);
>> 
>> +       spin_unlock_irqrestore(&dspi->lock, flags);
>>          return -EINPROGRESS;
>>   }
>> 
>> @@ -950,7 +957,9 @@ static int dspi_transfer_one_message(struct
>> spi_controller *ctlr,
>>          struct fsl_dspi *dspi = spi_controller_get_devdata(ctlr);
>>          struct spi_device *spi = message->spi;
>>          struct spi_transfer *transfer;
>> +       unsigned long flags;
>>          int status = 0;
>> +       int i = 0;
>> 
>>          if (dspi->irq)
>>                  dspi_enable_interrupts(dspi, true);
>> @@ -1009,7 +1018,9 @@ static int dspi_transfer_one_message(struct
>> spi_controller *ctlr,
>>                                  goto out;
>>                  } else if (dspi->irq) {
>>                          /* Kick off the interrupt train */
>> +                       spin_lock_irqsave(&dspi->lock, flags);
>>                          dspi_fifo_write(dspi);
>> +                       spin_unlock_irqrestore(&dspi->lock, flags);
>> 
>>                          status = 
>> wait_event_interruptible(dspi->waitq,
>> 
>> dspi->waitflags);
>> @@ -1301,6 +1312,7 @@ static int dspi_probe(struct platform_device
>> *pdev)
>>          ctlr->cleanup = dspi_cleanup;
>>          ctlr->slave_abort = dspi_slave_abort;
>>          ctlr->mode_bits = SPI_CPOL | SPI_CPHA | SPI_LSB_FIRST;
>> +       spin_lock_init(&dspi->lock);
>> 
>>          pdata = dev_get_platdata(&pdev->dev);
>>          if (pdata) {
>> 
>> 
>> 
>> -michael
> 
> Thanks for taking such a close look. I haven't had the time to follow 
> up.
> Indeed, the ISR, and therefore dspi_fifo_read, can execute before
> dspi->words_in_flight was populated correctly. And bad things will
> happen in that case.
> But I wouldn't introduce a spin lock that disables interrupts on the
> local CPU just for that - it's too complicated for this driver.

Sure. It was just a quick test whether the problem actually goes away.

> I would just keep the SPI interrupt quiesced via SPI_RSER and enable
> it only once it's safe, aka after updating dspi->words_in_flight.

I didn't want to move the interrupt_enable() around. I leave this up to
you ;)

-michael