diff mbox series

[RFC] spi: Take the SPI IO-mutex in the spi_setup() method

Message ID 20201117094517.5654-1-Sergey.Semin@baikalelectronics.ru (mailing list archive)
State Accepted
Commit 4fae3a58ab59d8a286864d61fe1846283a0316f2
Headers show
Series [RFC] spi: Take the SPI IO-mutex in the spi_setup() method | expand

Commit Message

Serge Semin Nov. 17, 2020, 9:45 a.m. UTC
I've discovered that due to the recent commit 49d7d695ca4b ("spi: dw:
Explicitly de-assert CS on SPI transfer completion") a concurrent usage of
the spidev devices with different chip-selects causes the "SPI transfer
timed out" error. The root cause of the problem has turned to be in a race
condition of the SPI-transfer execution procedure and the spi_setup()
method being called at the same time. In particular in calling the
spi_set_cs(false) while there is an SPI-transfer being executed. In my
case due to the commit cited above all CSs get to be switched off by
calling the spi_setup() for /dev/spidev0.1 while there is an concurrent
SPI-transfer execution performed on /dev/spidev0.0. Of course a situation
of the spi_setup() being called while there is an SPI-transfer being
executed for two different SPI peripheral devices of the same controller
may happen not only for the spidev driver, but for instance for MMC SPI +
some another device, or spi_setup() being called from an SPI-peripheral
probe method while some other device has already been probed and is being
used by a corresponding driver...

Of course I could have provided a fix affecting the DW APB SSI driver
only, for instance, by creating a mutual exclusive access to the set_cs
callback and setting/clearing only the bit responsible for the
corresponding chip-select. But after a short research I've discovered that
the problem most likely affects a lot of the other drivers:
- drivers/spi/spi-sun4i.c - RMW the chip-select register;
- drivers/spi/spi-rockchip.c - RMW the chip-select register;
- drivers/spi/spi-qup.c - RMW a generic force-CS flag in a CSR.
- drivers/spi/spi-sifive.c - set a generic CS-mode flag in a CSR.
- drivers/spi/spi-bcm63xx-hsspi.c - uses an internal mutex to serialize
  the bus config changes, but still isn't protected from the race
  condition described above;
- drivers/spi/spi-geni-qcom.c - RMW a chip-select internal flag and set the
  CS state in HW;
- drivers/spi/spi-orion.c - RMW a chip-select register;
- drivers/spi/spi-cadence.c - RMW a chip-select register;
- drivers/spi/spi-armada-3700.c - RMW a chip-select register;
- drivers/spi/spi-lantiq-ssc.c - overwrites the chip-select register;
- drivers/spi/spi-sun6i.c - RMW a chip-select register;
- drivers/spi/spi-synquacer.c - RMW a chip-select register;
- drivers/spi/spi-altera.c - directly sets the chip-select state;
- drivers/spi/spi-omap2-mcspi.c - RMW an internally cached CS state and
  writes it to HW;
- drivers/spi/spi-mt65xx.c - RMW some CSR;
- drivers/spi/spi-jcore.c - directly sets the chip-selects state;
- drivers/spi/spi-mt7621.c - RMW a chip-select register;

I could have missed some drivers, but a scale of the problem is obvious.
As you can see most of the drivers perform an unprotected
Read-modify-write chip-select register modification in the set_cs callback.
Seeing the spi_setup() function is calling the spi_set_cs() and it can be
executed concurrently with SPI-transfers exec procedure, which also calls
spi_set_cs() in the SPI core spi_transfer_one_message() method, the race
condition of the register modification turns to be obvious.

To sum up the problem denoted above affects each driver for a controller
having more than one chip-select lane and which:
1) performs the RMW to some CS-related register with no serialization;
2) directly disables any CS on spi_set_cs(dev, false).
* the later is the case of the DW APB SSI driver.

The controllers which equipped with a single CS theoretically can also
experience the problem, but in practice will not since normally the
spi_setup() isn't called concurrently with the SPI-transfers executed on
the same SPI peripheral device.

In order to generically fix the denoted bug I'd suggest to serialize an
access to the controller IO by taking the IO mutex in the spi_setup()
callback. The mutex is held while there is an SPI communication going on
on the SPI-bus of the corresponding SPI-controller. So calling the
spi_setup() method and disabling/updating the CS state within it would be
safe while there is no any SPI-transfers being executed. Also note I
suppose it would be safer to protect the spi_controller->setup() callback
invocation too, seeing some of the SPI-controller drivers update a HW
state in there.

Fixes: 49d7d695ca4b ("spi: dw: Explicitly de-assert CS on SPI transfer completion")
Signed-off-by: Serge Semin <Sergey.Semin@baikalelectronics.ru>
---
 drivers/spi/spi.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Andy Shevchenko Nov. 17, 2020, 10:56 a.m. UTC | #1
On Tue, Nov 17, 2020 at 11:45 AM Serge Semin
<Sergey.Semin@baikalelectronics.ru> wrote:
>
> I've discovered that due to the recent commit 49d7d695ca4b ("spi: dw:
> Explicitly de-assert CS on SPI transfer completion") a concurrent usage of
> the spidev devices with different chip-selects causes the "SPI transfer
> timed out" error.

I'll read this later...

> +       mutex_lock(&spi->controller->io_mutex);
> +
>         if (spi->controller->setup)
>                 status = spi->controller->setup(spi);
>
>         if (spi->controller->auto_runtime_pm && spi->controller->set_cs) {
>                 status = pm_runtime_get_sync(spi->controller->dev.parent);

I didn't check what this lock is protecting, but have you checked all
PM runtime callbacks if they are not taking the lock. When you call PM
runtime functions with 'sync' it may include a lot of work, some of
which may sleep (not a problem for mutex) and may take arbitrary locks
(might be a deadlock in case of trying the same lock).

>                 if (status < 0) {
> +                       mutex_unlock(&spi->controller->io_mutex);
>                         pm_runtime_put_noidle(spi->controller->dev.parent);
>                         dev_err(&spi->controller->dev, "Failed to power device: %d\n",
>                                 status);
> @@ -3354,6 +3357,8 @@ int spi_setup(struct spi_device *spi)
>                 spi_set_cs(spi, false);
>         }
>
> +       mutex_unlock(&spi->controller->io_mutex);
> +
Serge Semin Nov. 17, 2020, 11:28 a.m. UTC | #2
On Tue, Nov 17, 2020 at 12:56:44PM +0200, Andy Shevchenko wrote:
> On Tue, Nov 17, 2020 at 11:45 AM Serge Semin
> <Sergey.Semin@baikalelectronics.ru> wrote:
> >
> > I've discovered that due to the recent commit 49d7d695ca4b ("spi: dw:
> > Explicitly de-assert CS on SPI transfer completion") a concurrent usage of
> > the spidev devices with different chip-selects causes the "SPI transfer
> > timed out" error.
> 
> I'll read this later...
> 
> > +       mutex_lock(&spi->controller->io_mutex);
> > +
> >         if (spi->controller->setup)
> >                 status = spi->controller->setup(spi);
> >
> >         if (spi->controller->auto_runtime_pm && spi->controller->set_cs) {
> >                 status = pm_runtime_get_sync(spi->controller->dev.parent);
> 

> I didn't check what this lock is protecting,

It is used to protect the SPI io operations. So it's locked only
during the SPI memory operations and the SPI-message execution. That's
the time when the core toggles the controller chip-selects by calling
the spi_set_cs() method and the set_cs callback.

> but have you checked all
> PM runtime callbacks if they are not taking the lock. When you call PM
> runtime functions with 'sync' it may include a lot of work, some of
> which may sleep (not a problem for mutex) and may take arbitrary locks
> (might be a deadlock in case of trying the same lock).

Yeah, I understand that. Simple grepping hasn't showed anyone else but
the SPI-core using it. So unless the controllers PM methods also call
spi_setup() or request SPI-transfers, there shouldn't be a deadlock.
Moreover as I can see from the __spi_pump_messages() method the
IO-mutex is locked during the sync-suffixed PM-methods invocation.
AFAICS locking io_mutex around the PM-methods here shouldn't cause
problems. But of course testing it in various platforms/controllers is
always welcome.

-Sergey

> 
> >                 if (status < 0) {
> > +                       mutex_unlock(&spi->controller->io_mutex);
> >                         pm_runtime_put_noidle(spi->controller->dev.parent);
> >                         dev_err(&spi->controller->dev, "Failed to power device: %d\n",
> >                                 status);
> > @@ -3354,6 +3357,8 @@ int spi_setup(struct spi_device *spi)
> >                 spi_set_cs(spi, false);
> >         }
> >
> > +       mutex_unlock(&spi->controller->io_mutex);
> > +
> 
> 
> -- 
> With Best Regards,
> Andy Shevchenko
Mark Brown Nov. 18, 2020, 1:16 p.m. UTC | #3
On Tue, Nov 17, 2020 at 12:45:17PM +0300, Serge Semin wrote:

> method being called at the same time. In particular in calling the
> spi_set_cs(false) while there is an SPI-transfer being executed. In my
> case due to the commit cited above all CSs get to be switched off by
> calling the spi_setup() for /dev/spidev0.1 while there is an concurrent
> SPI-transfer execution performed on /dev/spidev0.0. Of course a situation
> of the spi_setup() being called while there is an SPI-transfer being
> executed for two different SPI peripheral devices of the same controller
> may happen not only for the spidev driver, but for instance for MMC SPI +
> some another device, or spi_setup() being called from an SPI-peripheral
> probe method while some other device has already been probed and is being
> used by a corresponding driver...

It's documented that a driver's spi_setup() operation is supposed to
support being able to be called concurrently with other transfers, see
spi-summary.rst.

> Of course I could have provided a fix affecting the DW APB SSI driver
> only, for instance, by creating a mutual exclusive access to the set_cs
> callback and setting/clearing only the bit responsible for the
> corresponding chip-select. But after a short research I've discovered that
> the problem most likely affects a lot of the other drivers:

Yeah, problems with it are very common as the documentation has noted
since forever.  IIRC there was some problem triggered by trying to force
it to be serialised but I can't remember what it was.
Serge Semin Nov. 18, 2020, 4:29 p.m. UTC | #4
On Wed, Nov 18, 2020 at 01:16:04PM +0000, Mark Brown wrote:
> On Tue, Nov 17, 2020 at 12:45:17PM +0300, Serge Semin wrote:
> 
> > method being called at the same time. In particular in calling the
> > spi_set_cs(false) while there is an SPI-transfer being executed. In my
> > case due to the commit cited above all CSs get to be switched off by
> > calling the spi_setup() for /dev/spidev0.1 while there is an concurrent
> > SPI-transfer execution performed on /dev/spidev0.0. Of course a situation
> > of the spi_setup() being called while there is an SPI-transfer being
> > executed for two different SPI peripheral devices of the same controller
> > may happen not only for the spidev driver, but for instance for MMC SPI +
> > some another device, or spi_setup() being called from an SPI-peripheral
> > probe method while some other device has already been probed and is being
> > used by a corresponding driver...
> 
> It's documented that a driver's spi_setup() operation is supposed to
> support being able to be called concurrently with other transfers, see
> spi-summary.rst.
> 
> > Of course I could have provided a fix affecting the DW APB SSI driver
> > only, for instance, by creating a mutual exclusive access to the set_cs
> > callback and setting/clearing only the bit responsible for the
> > corresponding chip-select. But after a short research I've discovered that
> > the problem most likely affects a lot of the other drivers:
> 
> Yeah, problems with it are very common as the documentation has noted
> since forever.  IIRC there was some problem triggered by trying to force
> it to be serialised but I can't remember what it was.

Does it mean nack for this patch from you? So you suggest to fix the controller
driver instead, right? If so the best solution would be to just lock the
IO mutex in the set_cs callback of the DW APB SSI driver...

-Sergey
Mark Brown Nov. 19, 2020, 6:43 p.m. UTC | #5
On Wed, Nov 18, 2020 at 07:29:31PM +0300, Serge Semin wrote:
> On Wed, Nov 18, 2020 at 01:16:04PM +0000, Mark Brown wrote:

> > Yeah, problems with it are very common as the documentation has noted
> > since forever.  IIRC there was some problem triggered by trying to force
> > it to be serialised but I can't remember what it was.

> Does it mean nack for this patch from you? So you suggest to fix the controller
> driver instead, right? If so the best solution would be to just lock the
> IO mutex in the set_cs callback of the DW APB SSI driver...

I'm not 100% clear what the original issue was, given that this is a
constant source of errors in drivers it seems like it should be better
to change the core but since I don't know why we have this the way it is
it's hard to tell what special cases we might have that could explode if
we try to do so.  I *think* the main issue is things that don't actually
have separate per device registers trying to configure the single set of
controler registers shared by all devices in which case the locking is
fine and helps with this specific case where it's a read/modify/write
operation on per device stuff and this makes sense.
Geert Uytterhoeven Nov. 19, 2020, 7:30 p.m. UTC | #6
Hi Mark,

On Thu, Nov 19, 2020 at 7:45 PM Mark Brown <broonie@kernel.org> wrote:
> On Wed, Nov 18, 2020 at 07:29:31PM +0300, Serge Semin wrote:
> > On Wed, Nov 18, 2020 at 01:16:04PM +0000, Mark Brown wrote:
> > > Yeah, problems with it are very common as the documentation has noted
> > > since forever.  IIRC there was some problem triggered by trying to force
> > > it to be serialised but I can't remember what it was.
>
> > Does it mean nack for this patch from you? So you suggest to fix the controller
> > driver instead, right? If so the best solution would be to just lock the
> > IO mutex in the set_cs callback of the DW APB SSI driver...
>
> I'm not 100% clear what the original issue was, given that this is a
> constant source of errors in drivers it seems like it should be better
> to change the core but since I don't know why we have this the way it is
> it's hard to tell what special cases we might have that could explode if
> we try to do so.  I *think* the main issue is things that don't actually
> have separate per device registers trying to configure the single set of
> controler registers shared by all devices in which case the locking is
> fine and helps with this specific case where it's a read/modify/write
> operation on per device stuff and this makes sense.

It's also an issue on SPI controllers with a single native chipselect,
tricking the driver author into believing that writing to registers
during .setup() is not an issue.  Until an integrator starts using
cs-gpios...

Gr{oetje,eeting}s,

                        Geert
Mark Brown Nov. 20, 2020, 5:17 p.m. UTC | #7
On Tue, Nov 17, 2020 at 12:45:17PM +0300, Serge Semin wrote:

> Of course I could have provided a fix affecting the DW APB SSI driver
> only, for instance, by creating a mutual exclusive access to the set_cs
> callback and setting/clearing only the bit responsible for the
> corresponding chip-select. But after a short research I've discovered that

I think the driver needs a fix anyway for the case where there's a mix
of devices with standard and inverted chip selects, it assumes they all
have the same polarity AFAICT.
Serge Semin Nov. 20, 2020, 5:26 p.m. UTC | #8
On Fri, Nov 20, 2020 at 05:17:48PM +0000, Mark Brown wrote:
> On Tue, Nov 17, 2020 at 12:45:17PM +0300, Serge Semin wrote:
> 
> > Of course I could have provided a fix affecting the DW APB SSI driver
> > only, for instance, by creating a mutual exclusive access to the set_cs
> > callback and setting/clearing only the bit responsible for the
> > corresponding chip-select. But after a short research I've discovered that
> 

> I think the driver needs a fix anyway for the case where there's a mix
> of devices with standard and inverted chip selects, it assumes they all
> have the same polarity AFAICT.

No. The polarity inversion isn't supported by the DW APB SSI
controller. Native chip-select is always active-low, while a
corresponding peripheral is activated by setting a bit in the SER
register (Slave Enable Register). So as long as the dw_spi_set_cs()
callback isn't called while there is a SPI-message execution going on
it shall be save do preserve the current version of the method.

-Sergey
Mark Brown Nov. 20, 2020, 9:35 p.m. UTC | #9
On Tue, 17 Nov 2020 12:45:17 +0300, Serge Semin wrote:
> I've discovered that due to the recent commit 49d7d695ca4b ("spi: dw:
> Explicitly de-assert CS on SPI transfer completion") a concurrent usage of
> the spidev devices with different chip-selects causes the "SPI transfer
> timed out" error. The root cause of the problem has turned to be in a race
> condition of the SPI-transfer execution procedure and the spi_setup()
> method being called at the same time. In particular in calling the
> spi_set_cs(false) while there is an SPI-transfer being executed. In my
> case due to the commit cited above all CSs get to be switched off by
> calling the spi_setup() for /dev/spidev0.1 while there is an concurrent
> SPI-transfer execution performed on /dev/spidev0.0. Of course a situation
> of the spi_setup() being called while there is an SPI-transfer being
> executed for two different SPI peripheral devices of the same controller
> may happen not only for the spidev driver, but for instance for MMC SPI +
> some another device, or spi_setup() being called from an SPI-peripheral
> probe method while some other device has already been probed and is being
> used by a corresponding driver...
> 
> [...]

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next

Thanks!

[1/1] spi: Take the SPI IO-mutex in the spi_setup() method
      commit: 4fae3a58ab59d8a286864d61fe1846283a0316f2

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark
diff mbox series

Patch

diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
index 0cab239d8e7f..353fa178e39b 100644
--- a/drivers/spi/spi.c
+++ b/drivers/spi/spi.c
@@ -3327,12 +3327,15 @@  int spi_setup(struct spi_device *spi)
 	if (!spi->max_speed_hz)
 		spi->max_speed_hz = spi->controller->max_speed_hz;
 
+	mutex_lock(&spi->controller->io_mutex);
+
 	if (spi->controller->setup)
 		status = spi->controller->setup(spi);
 
 	if (spi->controller->auto_runtime_pm && spi->controller->set_cs) {
 		status = pm_runtime_get_sync(spi->controller->dev.parent);
 		if (status < 0) {
+			mutex_unlock(&spi->controller->io_mutex);
 			pm_runtime_put_noidle(spi->controller->dev.parent);
 			dev_err(&spi->controller->dev, "Failed to power device: %d\n",
 				status);
@@ -3354,6 +3357,8 @@  int spi_setup(struct spi_device *spi)
 		spi_set_cs(spi, false);
 	}
 
+	mutex_unlock(&spi->controller->io_mutex);
+
 	if (spi->rt && !spi->controller->rt) {
 		spi->controller->rt = true;
 		spi_set_thread_rt(spi->controller);