Message ID | 20190729170518.14271-1-will@kernel.org (mailing list archive) |
---|---|
State | Mainlined |
Commit | 22ec71615d824f4f11d38d0e55a88d8956b7e45f |
Headers | show |
Series | arm64: io: Relax implicit barriers in default I/O accessors | expand |
On Mon, Jul 29, 2019 at 06:05:18PM +0100, Will Deacon wrote: > As a concrete example, consider the following: > > memcpy(dma_buffer, data, bufsz); > writel(DMA_START, dev->ctrl_reg); > > A DMB ST instruction between the final write to the DMA buffer and the > write to the control register will ensure that the writes to the DMA > buffer are observed before the write to the control register by all > observers. Put another way, if an observer can see the write to the > control register, it can also see the writes to memory. I think one of the counter arguments here were that a device does not "observe" the write to the control register as that's not a master access (by the device). Do you mean that if another CPU (not the device) can observe the writel(), it would have also observed the write to the DMA buffer (assuming the DMB)? Since the device is also an observer of the DMA buffer accesses, the multi-copy atomicity ensures that the device is also seeing the buffer updates following a DMB. If I got this right, I'm fine with the patch ;). > This has always > been the case and is not sufficient to provide the ordering required by > Linux, since there is no guarantee that the master interface of the > DMA-capable device has observed either of the accesses. However, in an > other-multi-copy atomic world, we can infer two things: > > 1. A write arriving at an endpoint shared between multiple CPUs is > visible to all CPUs > > 2. A write that is visible to all CPUs is also visible to all other > observers in the shareability domain > > Pieced together, this allows us to use DMB OSHST for our default I/O > write accessors and DMB OSHLD for our default I/O read accessors (the > outer-shareability is for handling non-cacheable mappings) for shared > devices. Memory-mapped, DMA-capable peripherals that are private to a > CPU (i.e. inaccessible to other CPUs) still require the DSB, however > these are few and far between and typically require special treatment > anyway which is outside of the scope of the portable driver API (e.g. > GIC, page-table walker, SPE profiler). I think there is another class of devices which are not CPU private (USB, network). The buffer here is on-chip and the CPU can't do much other than issuing a DSB (and even this may not be sufficient). The multi-copy atomicity rule would work between CPUs here but not necessarily for the device. Not sure they rely on the barrier in writel(), I guess we can wait and fix them with the mandatory barriers afterwards. In the meantime: Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
On Mon, Aug 05, 2019 at 11:39:05AM +0100, Catalin Marinas wrote: > On Mon, Jul 29, 2019 at 06:05:18PM +0100, Will Deacon wrote: > > As a concrete example, consider the following: > > > > memcpy(dma_buffer, data, bufsz); > > writel(DMA_START, dev->ctrl_reg); > > > > A DMB ST instruction between the final write to the DMA buffer and the > > write to the control register will ensure that the writes to the DMA > > buffer are observed before the write to the control register by all > > observers. Put another way, if an observer can see the write to the > > control register, it can also see the writes to memory. > > I think one of the counter arguments here were that a device does not > "observe" the write to the control register as that's not a master > access (by the device). Do you mean that if another CPU (not the device) > can observe the writel(), it would have also observed the write to the > DMA buffer (assuming the DMB)? Since the device is also an observer of > the DMA buffer accesses, the multi-copy atomicity ensures that the > device is also seeing the buffer updates following a DMB. Yes, that's right. > > This has always > > been the case and is not sufficient to provide the ordering required by > > Linux, since there is no guarantee that the master interface of the > > DMA-capable device has observed either of the accesses. However, in an > > other-multi-copy atomic world, we can infer two things: > > > > 1. A write arriving at an endpoint shared between multiple CPUs is > > visible to all CPUs > > > > 2. A write that is visible to all CPUs is also visible to all other > > observers in the shareability domain > > > > Pieced together, this allows us to use DMB OSHST for our default I/O > > write accessors and DMB OSHLD for our default I/O read accessors (the > > outer-shareability is for handling non-cacheable mappings) for shared > > devices. Memory-mapped, DMA-capable peripherals that are private to a > > CPU (i.e. inaccessible to other CPUs) still require the DSB, however > > these are few and far between and typically require special treatment > > anyway which is outside of the scope of the portable driver API (e.g. > > GIC, page-table walker, SPE profiler). > > I think there is another class of devices which are not CPU private > (USB, network). The buffer here is on-chip and the CPU can't do much > other than issuing a DSB (and even this may not be sufficient). The > multi-copy atomicity rule would work between CPUs here but not > necessarily for the device. Not sure they rely on the barrier in > writel(), I guess we can wait and fix them with the mandatory barriers > afterwards. In the meantime: > > Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Thanks! I don't think that the on-chip case is too bad: either the device observes updates like the CPUs (which would be necessary in order to guarantee coherence with the CPU caches), or the buffer is really part of the peripheral and mapped non-cacheable, so DMB would work for endpoint ordering. I suppose you could imagine a magic, device-specific dance to ensure visibility, but if that involves things like MMIO registers and read-backs then you'll need mandatory barriers anyway. Will
On Mon, Aug 05, 2019 at 12:35:04PM +0100, Will Deacon wrote: > On Mon, Aug 05, 2019 at 11:39:05AM +0100, Catalin Marinas wrote: > > On Mon, Jul 29, 2019 at 06:05:18PM +0100, Will Deacon wrote: > > > As a concrete example, consider the following: > > > > > > memcpy(dma_buffer, data, bufsz); > > > writel(DMA_START, dev->ctrl_reg); > > > > > > A DMB ST instruction between the final write to the DMA buffer and the > > > write to the control register will ensure that the writes to the DMA > > > buffer are observed before the write to the control register by all > > > observers. Put another way, if an observer can see the write to the > > > control register, it can also see the writes to memory. > > > > I think one of the counter arguments here were that a device does not > > "observe" the write to the control register as that's not a master > > access (by the device). Do you mean that if another CPU (not the device) > > can observe the writel(), it would have also observed the write to the > > DMA buffer (assuming the DMB)? Since the device is also an observer of > > the DMA buffer accesses, the multi-copy atomicity ensures that the > > device is also seeing the buffer updates following a DMB. > > Yes, that's right. > > > > This has always > > > been the case and is not sufficient to provide the ordering required by > > > Linux, since there is no guarantee that the master interface of the > > > DMA-capable device has observed either of the accesses. However, in an > > > other-multi-copy atomic world, we can infer two things: > > > > > > 1. A write arriving at an endpoint shared between multiple CPUs is > > > visible to all CPUs > > > > > > 2. A write that is visible to all CPUs is also visible to all other > > > observers in the shareability domain > > > > > > Pieced together, this allows us to use DMB OSHST for our default I/O > > > write accessors and DMB OSHLD for our default I/O read accessors (the > > > outer-shareability is for handling non-cacheable mappings) for shared > > > devices. Memory-mapped, DMA-capable peripherals that are private to a > > > CPU (i.e. inaccessible to other CPUs) still require the DSB, however > > > these are few and far between and typically require special treatment > > > anyway which is outside of the scope of the portable driver API (e.g. > > > GIC, page-table walker, SPE profiler). [...] I think there may be something missing from the argument: One supposes some kind of causal dependency between the writel() and any dependent read done by "the device". This is entirely up to the device implementation, but sanity seems to require that this depencency is at least as strong as an address dependency, so that the device mustn't speculatively read dma_buffer in advance of receiving the DMA_START command etc. If not, you probably have a badly-designed or broken device and you deserve to have to carry ugly workarounds in your drivers. The multi-copy-atomicity requirement effectively rules out some bus topologies: if a device masters directly onto a bus that doesn't share a common ancestor and shareability domain with the bus its slave interface is connected to, there would be no single place to resolve the DMB (unless some explicit logic were added to handle that somehow). It's reasonable to assume that the hardware is somewhat sane for the default I/O accessors: for weird hardware, drivers would have to work around it explicitly with extra synchronisation but we wouldn't expect this to be common. The per-CPU device case is one where there may be an explicitly weird topology, but this only applies to a few specific devices and we can work around those as appropriate. Does that makes sense? This might be "obvious", so I'm not sure we need to write anything. Just checking my understanding. Cheers ---Dave
diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h index 7ed92626949d..b0a3d5b85d4f 100644 --- a/arch/arm64/include/asm/io.h +++ b/arch/arm64/include/asm/io.h @@ -97,7 +97,7 @@ static inline u64 __raw_readq(const volatile void __iomem *addr) ({ \ unsigned long tmp; \ \ - rmb(); \ + dma_rmb(); \ \ /* \ * Create a dummy control dependency from the IO read to any \ @@ -111,7 +111,7 @@ static inline u64 __raw_readq(const volatile void __iomem *addr) }) #define __io_par(v) __iormb(v) -#define __iowmb() wmb() +#define __iowmb() dma_wmb() /* * Relaxed I/O memory access primitives. These follow the Device memory