diff mbox series

arm64: io: Relax implicit barriers in default I/O accessors

Message ID 20190729170518.14271-1-will@kernel.org (mailing list archive)
State Mainlined
Commit 22ec71615d824f4f11d38d0e55a88d8956b7e45f
Headers show
Series arm64: io: Relax implicit barriers in default I/O accessors | expand

Commit Message

Will Deacon July 29, 2019, 5:05 p.m. UTC
From: Will Deacon <will.deacon@arm.com>

The arm64 implementation of the default I/O accessors requires barrier
instructions to satisfy the memory ordering requirements documented in
memory-barriers.txt [1], which are largely derived from the behaviour of
I/O accesses on x86.

Of particular interest are the requirements that a write to a device
must be ordered against prior writes to memory, and a read from a device
must be ordered against subsequent reads from memory. We satisfy these
requirements using various flavours of DSB: the most expensive barrier
we have, since it implies completion of prior accesses. This was deemed
necessary when we first implemented the accessors, since accesses to
different endpoints could propagate independently and therefore the only
way to enforce order is to rely on completion guarantees [2].

Since then, the Armv8 memory model has been retrospectively strengthened
to require "other-multi-copy atomicity", a property that requires memory
accesses from an observer to become visible to all other observers
simultaneously [3]. In other words, propagation of accesses is limited
to transitioning from locally observed to globally observed. It recently
became apparent that this change also has a subtle impact on our I/O
accessors for shared peripherals, allowing us to use the cheaper DMB
instruction instead.

As a concrete example, consider the following:

	memcpy(dma_buffer, data, bufsz);
	writel(DMA_START, dev->ctrl_reg);

A DMB ST instruction between the final write to the DMA buffer and the
write to the control register will ensure that the writes to the DMA
buffer are observed before the write to the control register by all
observers. Put another way, if an observer can see the write to the
control register, it can also see the writes to memory. This has always
been the case and is not sufficient to provide the ordering required by
Linux, since there is no guarantee that the master interface of the
DMA-capable device has observed either of the accesses. However, in an
other-multi-copy atomic world, we can infer two things:

  1. A write arriving at an endpoint shared between multiple CPUs is
     visible to all CPUs

  2. A write that is visible to all CPUs is also visible to all other
     observers in the shareability domain

Pieced together, this allows us to use DMB OSHST for our default I/O
write accessors and DMB OSHLD for our default I/O read accessors (the
outer-shareability is for handling non-cacheable mappings) for shared
devices. Memory-mapped, DMA-capable peripherals that are private to a
CPU (i.e. inaccessible to other CPUs) still require the DSB, however
these are few and far between and typically require special treatment
anyway which is outside of the scope of the portable driver API (e.g.
GIC, page-table walker, SPE profiler).

Note that our mandatory barriers remain as DSBs, since there are cases
where they are used to flush the store buffer of the CPU, e.g. when
publishing page table updates to the SMMU.

[1] https://git.kernel.org/linus/4614bbdee357
[2] https://www.youtube.com/watch?v=i6DayghhA8Q
[3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/io.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Catalin Marinas Aug. 5, 2019, 10:39 a.m. UTC | #1
On Mon, Jul 29, 2019 at 06:05:18PM +0100, Will Deacon wrote:
> As a concrete example, consider the following:
> 
> 	memcpy(dma_buffer, data, bufsz);
> 	writel(DMA_START, dev->ctrl_reg);
> 
> A DMB ST instruction between the final write to the DMA buffer and the
> write to the control register will ensure that the writes to the DMA
> buffer are observed before the write to the control register by all
> observers. Put another way, if an observer can see the write to the
> control register, it can also see the writes to memory.

I think one of the counter arguments here were that a device does not
"observe" the write to the control register as that's not a master
access (by the device). Do you mean that if another CPU (not the device)
can observe the writel(), it would have also observed the write to the
DMA buffer (assuming the DMB)? Since the device is also an observer of
the DMA buffer accesses, the multi-copy atomicity ensures that the
device is also seeing the buffer updates following a DMB.

If I got this right, I'm fine with the patch ;).

> This has always
> been the case and is not sufficient to provide the ordering required by
> Linux, since there is no guarantee that the master interface of the
> DMA-capable device has observed either of the accesses. However, in an
> other-multi-copy atomic world, we can infer two things:
> 
>   1. A write arriving at an endpoint shared between multiple CPUs is
>      visible to all CPUs
> 
>   2. A write that is visible to all CPUs is also visible to all other
>      observers in the shareability domain
> 
> Pieced together, this allows us to use DMB OSHST for our default I/O
> write accessors and DMB OSHLD for our default I/O read accessors (the
> outer-shareability is for handling non-cacheable mappings) for shared
> devices. Memory-mapped, DMA-capable peripherals that are private to a
> CPU (i.e. inaccessible to other CPUs) still require the DSB, however
> these are few and far between and typically require special treatment
> anyway which is outside of the scope of the portable driver API (e.g.
> GIC, page-table walker, SPE profiler).

I think there is another class of devices which are not CPU private
(USB, network). The buffer here is on-chip and the CPU can't do much
other than issuing a DSB (and even this may not be sufficient). The
multi-copy atomicity rule would work between CPUs here but not
necessarily for the device. Not sure they rely on the barrier in
writel(), I guess we can wait and fix them with the mandatory barriers
afterwards. In the meantime:

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Will Deacon Aug. 5, 2019, 11:35 a.m. UTC | #2
On Mon, Aug 05, 2019 at 11:39:05AM +0100, Catalin Marinas wrote:
> On Mon, Jul 29, 2019 at 06:05:18PM +0100, Will Deacon wrote:
> > As a concrete example, consider the following:
> > 
> > 	memcpy(dma_buffer, data, bufsz);
> > 	writel(DMA_START, dev->ctrl_reg);
> > 
> > A DMB ST instruction between the final write to the DMA buffer and the
> > write to the control register will ensure that the writes to the DMA
> > buffer are observed before the write to the control register by all
> > observers. Put another way, if an observer can see the write to the
> > control register, it can also see the writes to memory.
> 
> I think one of the counter arguments here were that a device does not
> "observe" the write to the control register as that's not a master
> access (by the device). Do you mean that if another CPU (not the device)
> can observe the writel(), it would have also observed the write to the
> DMA buffer (assuming the DMB)? Since the device is also an observer of
> the DMA buffer accesses, the multi-copy atomicity ensures that the
> device is also seeing the buffer updates following a DMB.

Yes, that's right.

> > This has always
> > been the case and is not sufficient to provide the ordering required by
> > Linux, since there is no guarantee that the master interface of the
> > DMA-capable device has observed either of the accesses. However, in an
> > other-multi-copy atomic world, we can infer two things:
> > 
> >   1. A write arriving at an endpoint shared between multiple CPUs is
> >      visible to all CPUs
> > 
> >   2. A write that is visible to all CPUs is also visible to all other
> >      observers in the shareability domain
> > 
> > Pieced together, this allows us to use DMB OSHST for our default I/O
> > write accessors and DMB OSHLD for our default I/O read accessors (the
> > outer-shareability is for handling non-cacheable mappings) for shared
> > devices. Memory-mapped, DMA-capable peripherals that are private to a
> > CPU (i.e. inaccessible to other CPUs) still require the DSB, however
> > these are few and far between and typically require special treatment
> > anyway which is outside of the scope of the portable driver API (e.g.
> > GIC, page-table walker, SPE profiler).
> 
> I think there is another class of devices which are not CPU private
> (USB, network). The buffer here is on-chip and the CPU can't do much
> other than issuing a DSB (and even this may not be sufficient). The
> multi-copy atomicity rule would work between CPUs here but not
> necessarily for the device. Not sure they rely on the barrier in
> writel(), I guess we can wait and fix them with the mandatory barriers
> afterwards. In the meantime:
> 
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

Thanks! I don't think that the on-chip case is too bad: either the device
observes updates like the CPUs (which would be necessary in order to
guarantee coherence with the CPU caches), or the buffer is really part of
the peripheral and mapped non-cacheable, so DMB would work for endpoint
ordering. I suppose you could imagine a magic, device-specific dance to
ensure visibility, but if that involves things like MMIO registers and
read-backs then you'll need mandatory barriers anyway.

Will
Dave Martin Aug. 5, 2019, 3:39 p.m. UTC | #3
On Mon, Aug 05, 2019 at 12:35:04PM +0100, Will Deacon wrote:
> On Mon, Aug 05, 2019 at 11:39:05AM +0100, Catalin Marinas wrote:
> > On Mon, Jul 29, 2019 at 06:05:18PM +0100, Will Deacon wrote:
> > > As a concrete example, consider the following:
> > > 
> > > 	memcpy(dma_buffer, data, bufsz);
> > > 	writel(DMA_START, dev->ctrl_reg);
> > > 
> > > A DMB ST instruction between the final write to the DMA buffer and the
> > > write to the control register will ensure that the writes to the DMA
> > > buffer are observed before the write to the control register by all
> > > observers. Put another way, if an observer can see the write to the
> > > control register, it can also see the writes to memory.
> > 
> > I think one of the counter arguments here were that a device does not
> > "observe" the write to the control register as that's not a master
> > access (by the device). Do you mean that if another CPU (not the device)
> > can observe the writel(), it would have also observed the write to the
> > DMA buffer (assuming the DMB)? Since the device is also an observer of
> > the DMA buffer accesses, the multi-copy atomicity ensures that the
> > device is also seeing the buffer updates following a DMB.
> 
> Yes, that's right.
> 
> > > This has always
> > > been the case and is not sufficient to provide the ordering required by
> > > Linux, since there is no guarantee that the master interface of the
> > > DMA-capable device has observed either of the accesses. However, in an
> > > other-multi-copy atomic world, we can infer two things:
> > > 
> > >   1. A write arriving at an endpoint shared between multiple CPUs is
> > >      visible to all CPUs
> > > 
> > >   2. A write that is visible to all CPUs is also visible to all other
> > >      observers in the shareability domain
> > > 
> > > Pieced together, this allows us to use DMB OSHST for our default I/O
> > > write accessors and DMB OSHLD for our default I/O read accessors (the
> > > outer-shareability is for handling non-cacheable mappings) for shared
> > > devices. Memory-mapped, DMA-capable peripherals that are private to a
> > > CPU (i.e. inaccessible to other CPUs) still require the DSB, however
> > > these are few and far between and typically require special treatment
> > > anyway which is outside of the scope of the portable driver API (e.g.
> > > GIC, page-table walker, SPE profiler).

[...]

I think there may be something missing from the argument:

One supposes some kind of causal dependency between the writel() and
any dependent read done by "the device".  This is entirely up to the
device implementation, but sanity seems to require that this depencency
is at least as strong as an address dependency, so that the device
mustn't speculatively read dma_buffer in advance of receiving the
DMA_START command etc.  If not, you probably have a badly-designed or
broken device and you deserve to have to carry ugly workarounds in
your drivers.

The multi-copy-atomicity requirement effectively rules out some
bus topologies: if a device masters directly onto a bus that doesn't
share a common ancestor and shareability domain with the bus its slave
interface is connected to, there would be no single place to resolve
the DMB (unless some explicit logic were added to handle that somehow).

It's reasonable to assume that the hardware is somewhat sane for the
default I/O accessors: for weird hardware, drivers would have to work
around it explicitly with extra synchronisation but we wouldn't expect
this to be common.

The per-CPU device case is one where there may be an explicitly weird
topology, but this only applies to a few specific devices and we can
work around those as appropriate.

Does that makes sense?  This might be "obvious", so I'm not sure we
need to write anything.  Just checking my understanding.

Cheers
---Dave
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 7ed92626949d..b0a3d5b85d4f 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -97,7 +97,7 @@  static inline u64 __raw_readq(const volatile void __iomem *addr)
 ({									\
 	unsigned long tmp;						\
 									\
-	rmb();								\
+	dma_rmb();								\
 									\
 	/*								\
 	 * Create a dummy control dependency from the IO read to any	\
@@ -111,7 +111,7 @@  static inline u64 __raw_readq(const volatile void __iomem *addr)
 })
 
 #define __io_par(v)		__iormb(v)
-#define __iowmb()		wmb()
+#define __iowmb()		dma_wmb()
 
 /*
  * Relaxed I/O memory access primitives. These follow the Device memory