diff mbox

RFC on writel and writel_relaxed

Message ID 20180327110258.GF2464@arm.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Will Deacon March 27, 2018, 11:02 a.m. UTC
On Tue, Mar 27, 2018 at 12:53:49PM +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon <will.deacon@arm.com> wrote:
> > On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote:
> >> > -
> >> > -     See Documentation/DMA-API.txt for more information on consistent memory.
> >> > +     can see it now has ownership.  Note that, when using writel(), a prior
> >> > +     wmb() is not needed to guarantee that the cache coherent memory writes
> >> > +     have completed before writing to the cache incoherent MMIO region.
> >> > +     If this ordering between incoherent MMIO and coherent memory regions
> 
> One more thing: I think the term "incoherent MMIO" is a bit confusing, I'd
> prefer just "MMIO" here. At least I don't have the faintest clue what the
> difference between "coherent MMIO" and "incoherent MMIO" would be ;-)

Yes, you're right. I was just following the terminology that's already used
here, but actually that seems not be used anywhere else in the document!
I'll kill it.

> >> > +     is not required, writel_relaxed() can be used instead and is significantly
> >> > +     cheaper on some weakly-ordered architectures.
> >>
> >> I think that's a great improvement, but I'm a bit worried about recommending
> >> writel_relaxed() too much: I've seen a lot of drivers that just always use
> >> writel_relaxed() over write(), and some of them get that wrong when they
> >> don't understand the difference but end up using DMA without explicit
> >> barriers anyway.
> >>
> >> Also, having an architecture-independent driver use wmb()+writel_relaxed()
> >> ends up being more expensive than just using write(). Not sure how to
> >> best phrase it though.
> >
> > Perhaps I add reword that with a simple example to say:
> >
> >   If this ordering between incoherent MMIO and coherent memory regions
> >   is not required (e.g. in a sequence of accesses all to the MMIO region)
> >   [...]
> >
> > since that seems to be the usual case where the _relaxed accessors help.
> 
> That still doesn't quite capture what I'd like driver writes to do: in essence
> I would recommend them to use writel() all the time, except in performance
> critical code that has been shown to be correct and has a comment to explain
> why _relaxed() is ok in that particular function.
> 
> Maybe it can just be rephrased to warn against the use of writel_relaxed()
> here, and explain the difference that way:
> 
>      can see it now has ownership.  Note that, when using writel(), a prior
>      wmb() is not needed to guarantee that the cache coherent memory writes
>      have completed before writing to the cache incoherent MMIO region.
>      The cheaper writel_relaxed() does not guarantee the DMA to be visible
>      to the device and must not be used here.

Fair enough. I'd rather people used _relaxed by default, but I have to admit
that it will probably just result in them getting things wrong. Just a tiny
bit of wordsmithing brings this to:




If you're happy with that, I'll send it as a proper patch.

Cheers,

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Arnd Bergmann March 27, 2018, 11:05 a.m. UTC | #1
On Tue, Mar 27, 2018 at 1:02 PM, Will Deacon <will.deacon@arm.com> wrote:
> On Tue, Mar 27, 2018 at 12:53:49PM +0200, Arnd Bergmann wrote:
>> On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon <will.deacon@arm.com> wrote:
>> > On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote:

> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index a863009849a3..3247547d1c36 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
>                 /* assign ownership */
>                 desc->status = DEVICE_OWN;
>
> -               /* force memory to sync before notifying device via MMIO */
> -               wmb();
> -
>                 /* notify device of new descriptors */
>                 writel(DESC_NOTIFY, doorbell);
>         }
> @@ -1919,11 +1916,15 @@ There are some more advanced barrier functions:
>       The dma_rmb() allows us guarantee the device has released ownership
>       before we read the data from the descriptor, and the dma_wmb() allows
>       us to guarantee the data is written to the descriptor before the device
> -     can see it now has ownership.  The wmb() is needed to guarantee that the
> -     cache coherent memory writes have completed before attempting a write to
> -     the cache incoherent MMIO region.
> -
> -     See Documentation/DMA-API.txt for more information on consistent memory.
> +     can see it now has ownership.  Note that, when using writel(), a prior
> +     wmb() is not needed to guarantee that the cache coherent memory writes
> +     have completed before writing to the MMIO region.  The cheaper
> +     writel_relaxed() does not provide this guarantee and must not be used
> +     here.
> +
> +     See the subsection "Kernel I/O barrier effects" for more information on
> +     relaxed I/O accessors and the Documentation/DMA-API.txt file for more
> +     information on consistent memory.
>
>
>  MMIO WRITE BARRIER
>
>
> If you're happy with that, I'll send it as a proper patch.

Looks good to me, thanks!

       Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt March 27, 2018, 11:25 a.m. UTC | #2
On Tue, 2018-03-27 at 12:02 +0100, Will Deacon wrote:
>       can see it now has ownership.  Note that, when using writel(), a prior
> >       wmb() is not needed to guarantee that the cache coherent memory writes
> >       have completed before writing to the cache incoherent MMIO region.
> >       The cheaper writel_relaxed() does not guarantee the DMA to be visible
> >       to the device and must not be used here.
> 
> Fair enough. I'd rather people used _relaxed by default, but I have to admit
> that it will probably just result in them getting things wrong. Just a tiny
> bit of wordsmithing brings this to:

I prefer people using writel() by default for the simple reason that
99% of writels out there are configuration stuff for which the
performance difference doesn't matter, and people will just get it
wrong.

Let's focus on the rare fast path for optimisation.
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index a863009849a3..3247547d1c36 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
>                 /* assign ownership */
>                 desc->status = DEVICE_OWN;
>  
> -               /* force memory to sync before notifying device via MMIO */
> -               wmb();
> -
>                 /* notify device of new descriptors */
>                 writel(DESC_NOTIFY, doorbell);
>         }
> @@ -1919,11 +1916,15 @@ There are some more advanced barrier functions:
>       The dma_rmb() allows us guarantee the device has released ownership
>       before we read the data from the descriptor, and the dma_wmb() allows
>       us to guarantee the data is written to the descriptor before the device
> -     can see it now has ownership.  The wmb() is needed to guarantee that the
> -     cache coherent memory writes have completed before attempting a write to
> -     the cache incoherent MMIO region.
> -
> -     See Documentation/DMA-API.txt for more information on consistent memory.
> +     can see it now has ownership.  Note that, when using writel(), a prior
> +     wmb() is not needed to guarantee that the cache coherent memory writes
> +     have completed before writing to the MMIO region.  The cheaper
> +     writel_relaxed() does not provide this guarantee and must not be used
> +     here.
> +
> +     See the subsection "Kernel I/O barrier effects" for more information on
> +     relaxed I/O accessors and the Documentation/DMA-API.txt file for more
> +     information on consistent memory.
>  
>  
>  MMIO WRITE BARRIER
> 
> 
> If you're happy with that, I'll send it as a proper patch.
> 
> Cheers,
> 
> Will
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight March 27, 2018, 1:20 p.m. UTC | #3
> Fair enough. I'd rather people used _relaxed by default, but I have to admit
> that it will probably just result in them getting things wrong...

Certainly requiring the driver writes use explicit barriers should make
them understand when and why they are needed - and then put in the correct ones.

The problem I've had is that I have a good idea which barriers are needed
but find that readl/writel seem to contain a lot of extra ones.
Maybe the are required in some places, but the extra synchronising
instructions could easily have measureable performance effects on
hot paths.

Drivers are likely to contain sequences like:
	read_io
	if (...) return
	write_mem
	...
	write_mem
	barrier
	write_mem
	barrier
	write_io
for things like ring updates.
Where the 'mem' might actually be in io space.
In such sequences not all the synchronising instructions are needed.
I'm not at all sure it is easy to get the right set.

	David




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sinan Kaya March 27, 2018, 1:46 p.m. UTC | #4
On 3/27/2018 7:02 AM, Will Deacon wrote:
> -     See Documentation/DMA-API.txt for more information on consistent memory.
> +     can see it now has ownership.  Note that, when using writel(), a prior
> +     wmb() is not needed to guarantee that the cache coherent memory writes
> +     have completed before writing to the MMIO region.  The cheaper
> +     writel_relaxed() does not provide this guarantee and must not be used
> +     here.

Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
in front of these.
Will Deacon March 27, 2018, 2:36 p.m. UTC | #5
On Tue, Mar 27, 2018 at 09:46:51AM -0400, Sinan Kaya wrote:
> On 3/27/2018 7:02 AM, Will Deacon wrote:
> > -     See Documentation/DMA-API.txt for more information on consistent memory.
> > +     can see it now has ownership.  Note that, when using writel(), a prior
> > +     wmb() is not needed to guarantee that the cache coherent memory writes
> > +     have completed before writing to the MMIO region.  The cheaper
> > +     writel_relaxed() does not provide this guarantee and must not be used
> > +     here.
> 
> Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
> in front of these.

I don't think so. My reading of memory-barriers.txt says that writeX might
expand to outX, and outX is not ordered with respect to other types of
memory.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt March 27, 2018, 9:24 p.m. UTC | #6
On Tue, 2018-03-27 at 09:46 -0400, Sinan Kaya wrote:
> On 3/27/2018 7:02 AM, Will Deacon wrote:
> > -     See Documentation/DMA-API.txt for more information on consistent memory.
> > +     can see it now has ownership.  Note that, when using writel(), a prior
> > +     wmb() is not needed to guarantee that the cache coherent memory writes
> > +     have completed before writing to the MMIO region.  The cheaper
> > +     writel_relaxed() does not provide this guarantee and must not be used
> > +     here.
> 
> Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
> in front of these.

Yes, they should have the same semantics as writel

Cheers,
Ben.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt March 27, 2018, 9:29 p.m. UTC | #7
On Tue, 2018-03-27 at 15:36 +0100, Will Deacon wrote:
> > Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
> > in front of these.
> 
> I don't think so. My reading of memory-barriers.txt says that writeX might
> expand to outX, and outX is not ordered with respect to other types of
> memory.

Ugh ?

My understanding of HW at least is the exact opposite. outX is *more*
ordered if anything, than any other accessors. IO space is completely
synchronous, non posted and ordered afaik.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Will Deacon March 28, 2018, 8:53 a.m. UTC | #8
On Wed, Mar 28, 2018 at 08:29:45AM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2018-03-27 at 15:36 +0100, Will Deacon wrote:
> > > Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
> > > in front of these.
> > 
> > I don't think so. My reading of memory-barriers.txt says that writeX might
> > expand to outX, and outX is not ordered with respect to other types of
> > memory.
> 
> Ugh ?
> 
> My understanding of HW at least is the exact opposite. outX is *more*
> ordered if anything, than any other accessors. IO space is completely
> synchronous, non posted and ordered afaik.

I'm just going by memory-barriers.txt:


 (*) inX(), outX():

     [...]

     They are guaranteed to be fully ordered with respect to each other.

     They are not guaranteed to be fully ordered with respect to other types of
     memory and I/O operation.


For arm/arm64 these end up behaving exactly the same as readX/writeX, but
I'm nervous about changing the documentation without understanding why it's
like it is currently. Maybe another ia64 thing?.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight March 28, 2018, 9 a.m. UTC | #9
From: Will Deacon
> Sent: 28 March 2018 09:54
...
> > > I don't think so. My reading of memory-barriers.txt says that writeX might
> > > expand to outX, and outX is not ordered with respect to other types of
> > > memory.
> >
> > Ugh ?
> >
> > My understanding of HW at least is the exact opposite. outX is *more*
> > ordered if anything, than any other accessors. IO space is completely
> > synchronous, non posted and ordered afaik.
> 
> I'm just going by memory-barriers.txt:
> 
> 
>  (*) inX(), outX():
> 
>      [...]
> 
>      They are guaranteed to be fully ordered with respect to each other.
> 
>      They are not guaranteed to be fully ordered with respect to other types of
>      memory and I/O operation.

A long time ago there was a document from Intel that said that inb/outb weren't
necessarily synchronised wrt memory accesses.
(Might be P-pro era).
However no processors actually behaved that way and more recent docs
say that inb/outb are fully ordered.

	David

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Will Deacon March 28, 2018, 9:09 a.m. UTC | #10
On Wed, Mar 28, 2018 at 09:00:01AM +0000, David Laight wrote:
> From: Will Deacon
> > Sent: 28 March 2018 09:54
> ...
> > > > I don't think so. My reading of memory-barriers.txt says that writeX might
> > > > expand to outX, and outX is not ordered with respect to other types of
> > > > memory.
> > >
> > > Ugh ?
> > >
> > > My understanding of HW at least is the exact opposite. outX is *more*
> > > ordered if anything, than any other accessors. IO space is completely
> > > synchronous, non posted and ordered afaik.
> > 
> > I'm just going by memory-barriers.txt:
> > 
> > 
> >  (*) inX(), outX():
> > 
> >      [...]
> > 
> >      They are guaranteed to be fully ordered with respect to each other.
> > 
> >      They are not guaranteed to be fully ordered with respect to other types of
> >      memory and I/O operation.
> 
> A long time ago there was a document from Intel that said that inb/outb weren't
> necessarily synchronised wrt memory accesses.
> (Might be P-pro era).
> However no processors actually behaved that way and more recent docs
> say that inb/outb are fully ordered.

Thank you, David! I'll write another patch fixing this up and hopefully
we'll soon have one making writeX/readX much clearer.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt March 28, 2018, 9:50 a.m. UTC | #11
On Wed, 2018-03-28 at 09:53 +0100, Will Deacon wrote:
> For arm/arm64 these end up behaving exactly the same as readX/writeX, but
> I'm nervous about changing the documentation without understanding why it's
> like it is currently. Maybe another ia64 thing?.

I doubt it ... the Intel ancestry here would make me think they are
completely ordered there too.

powerpc and ARM can't quite make them synchronous I think, but at least
they should have the same semantics as writel.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann March 28, 2018, 9:55 a.m. UTC | #12
On Wed, Mar 28, 2018 at 11:50 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Wed, 2018-03-28 at 09:53 +0100, Will Deacon wrote:
>> For arm/arm64 these end up behaving exactly the same as readX/writeX, but
>> I'm nervous about changing the documentation without understanding why it's
>> like it is currently. Maybe another ia64 thing?.
>
> I doubt it ... the Intel ancestry here would make me think they are
> completely ordered there too.
>
> powerpc and ARM can't quite make them synchronous I think, but at least
> they should have the same semantics as writel.

One thing that ARM does IIRC is that it only guarantees to order writel() within
one device, and the memory mapped PCI I/O space window almost certainly
counts as a separate device to the CPU.

In the absence of an enforced global synchronization during an I/O port
access, that means writel() and outb() can be reordered before they arrive
at a device in theory. Again, this rarely matters in practice, but I think it
makes sense to document the less strict behavior here, given that we have
common hardware that can't provide x86 compatible semantics.

       Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt March 28, 2018, 9:56 a.m. UTC | #13
On Wed, 2018-03-28 at 10:09 +0100, Will Deacon wrote:
> On Wed, Mar 28, 2018 at 09:00:01AM +0000, David Laight wrote:
> > From: Will Deacon
> > > Sent: 28 March 2018 09:54
> > 
> > ...
> > > > > I don't think so. My reading of memory-barriers.txt says that writeX might
> > > > > expand to outX, and outX is not ordered with respect to other types of
> > > > > memory.
> > > > 
> > > > Ugh ?
> > > > 
> > > > My understanding of HW at least is the exact opposite. outX is *more*
> > > > ordered if anything, than any other accessors. IO space is completely
> > > > synchronous, non posted and ordered afaik.
> > > 
> > > I'm just going by memory-barriers.txt:
> > > 
> > > 
> > >  (*) inX(), outX():
> > > 
> > >      [...]
> > > 
> > >      They are guaranteed to be fully ordered with respect to each other.
> > > 
> > >      They are not guaranteed to be fully ordered with respect to other types of
> > >      memory and I/O operation.
> > 
> > A long time ago there was a document from Intel that said that inb/outb weren't
> > necessarily synchronised wrt memory accesses.
> > (Might be P-pro era).
> > However no processors actually behaved that way and more recent docs
> > say that inb/outb are fully ordered.
> 
> Thank you, David! I'll write another patch fixing this up and hopefully
> we'll soon have one making writeX/readX much clearer.

Thanks for doing the grunt work Will ! :-)

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt March 28, 2018, 10:01 a.m. UTC | #14
On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > powerpc and ARM can't quite make them synchronous I think, but at least
> > they should have the same semantics as writel.
> 
> One thing that ARM does IIRC is that it only guarantees to order writel() within
> one device, and the memory mapped PCI I/O space window almost certainly
> counts as a separate device to the CPU.

That sounds bogus.

> In the absence of an enforced global synchronization during an I/O port
> access, that means writel() and outb() can be reordered before they arrive
> at a device in theory. Again, this rarely matters in practice, but I think it
> makes sense to document the less strict behavior here, given that we have
> common hardware that can't provide x86 compatible semantics.

Can't you put some kind of super heavy handed barrier in inX/outX ?
These things are never going to be performance sensitive anyway...

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Will Deacon March 28, 2018, 10:13 a.m. UTC | #15
On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > powerpc and ARM can't quite make them synchronous I think, but at least
> > > they should have the same semantics as writel.
> > 
> > One thing that ARM does IIRC is that it only guarantees to order writel() within
> > one device, and the memory mapped PCI I/O space window almost certainly
> > counts as a separate device to the CPU.
> 
> That sounds bogus.

To elaborate, if you do the following on arm:

	writel(DEVICE_FOO);
	writel(DEVICE_BAR);

we generally cannot guarantee in which order those accesses will hit the
devices even if we add every barrier under the sun. You'd need something
in between, specific to DEVICE_FOO (probably a read-back) to really push
the first write out. This doesn't sound like it would be that uncommon to
me.

On the other hand:

	writel(DEVICE_FOO);
	writel(DEVICE_FOO);

is obviously ordered and also things like:

	writel(DEVICE_FOO_IN_PCI_MEM_SPACE);
	writel(DEVICE_BAR_IN_SAME_PCI_MEM_SPACE);

are ordered up to the PCI host bridge, because that's really the "device"
here.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe March 28, 2018, 4:57 p.m. UTC | #16
On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote:
> On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > > powerpc and ARM can't quite make them synchronous I think, but at least
> > > > they should have the same semantics as writel.
> > > 
> > > One thing that ARM does IIRC is that it only guarantees to order writel() within
> > > one device, and the memory mapped PCI I/O space window almost certainly
> > > counts as a separate device to the CPU.
> > 
> > That sounds bogus.
> 
> To elaborate, if you do the following on arm:
> 
> 	writel(DEVICE_FOO);
> 	writel(DEVICE_BAR);
> 
> we generally cannot guarantee in which order those accesses will hit the
> devices even if we add every barrier under the sun. You'd need something
> in between, specific to DEVICE_FOO (probably a read-back) to really push
> the first write out. This doesn't sound like it would be that uncommon to
> me.

The PCI posted write does not require the above to execute 'in order'
only that any bus segment shared by the two devices have the writes
issued in CPU order. ie at a shared PCI root port for instance.

If I recall this is very similar to the ordering that ARM's on-chip
AXI interconnect is supposed to provide.. So I'd be very surprised if
a modern ARM64 has an meaningful difference from x86 here.

When talking about ordering between the devices, the relevant question
is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
from the DEVICE_FOO. 'ordered' means that in this case
writel(DEVICE_FOO) must be presented to FOO before anything generated
by BAR.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Will Deacon March 29, 2018, 9:19 a.m. UTC | #17
On Wed, Mar 28, 2018 at 10:57:32AM -0600, Jason Gunthorpe wrote:
> On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote:
> > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > > > powerpc and ARM can't quite make them synchronous I think, but at least
> > > > > they should have the same semantics as writel.
> > > > 
> > > > One thing that ARM does IIRC is that it only guarantees to order writel() within
> > > > one device, and the memory mapped PCI I/O space window almost certainly
> > > > counts as a separate device to the CPU.
> > > 
> > > That sounds bogus.
> > 
> > To elaborate, if you do the following on arm:
> > 
> > 	writel(DEVICE_FOO);
> > 	writel(DEVICE_BAR);
> > 
> > we generally cannot guarantee in which order those accesses will hit the
> > devices even if we add every barrier under the sun. You'd need something
> > in between, specific to DEVICE_FOO (probably a read-back) to really push
> > the first write out. This doesn't sound like it would be that uncommon to
> > me.
> 
> The PCI posted write does not require the above to execute 'in order'
> only that any bus segment shared by the two devices have the writes
> issued in CPU order. ie at a shared PCI root port for instance.
> 
> If I recall this is very similar to the ordering that ARM's on-chip
> AXI interconnect is supposed to provide.. So I'd be very surprised if
> a modern ARM64 has an meaningful difference from x86 here.

From the architectural perspective, writes to different "peripherals" are
not ordered with respect to each other. The first writel will complete once
it gets its write acknowledgement, but this may not necessarily come from
the endpoint -- it could come from an intermediate buffer past the point of
serialisation (i.e. the write will then be ordered with respect to other
accesses to that same endpoint). The PCI root port would look like one
peripheral here.

> When talking about ordering between the devices, the relevant question
> is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> from the DEVICE_FOO. 'ordered' means that in this case
> writel(DEVICE_FOO) must be presented to FOO before anything generated
> by BAR.

Yes, and that isn't the case for arm because the writes can still be
buffered.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe March 29, 2018, 2:45 p.m. UTC | #18
On Thu, Mar 29, 2018 at 10:19:41AM +0100, Will Deacon wrote:
> On Wed, Mar 28, 2018 at 10:57:32AM -0600, Jason Gunthorpe wrote:
> > On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote:
> > > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> > > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > > > > powerpc and ARM can't quite make them synchronous I think, but at least
> > > > > > they should have the same semantics as writel.
> > > > > 
> > > > > One thing that ARM does IIRC is that it only guarantees to order writel() within
> > > > > one device, and the memory mapped PCI I/O space window almost certainly
> > > > > counts as a separate device to the CPU.
> > > > 
> > > > That sounds bogus.
> > > 
> > > To elaborate, if you do the following on arm:
> > > 
> > > 	writel(DEVICE_FOO);
> > > 	writel(DEVICE_BAR);
> > > 
> > > we generally cannot guarantee in which order those accesses will hit the
> > > devices even if we add every barrier under the sun. You'd need something
> > > in between, specific to DEVICE_FOO (probably a read-back) to really push
> > > the first write out. This doesn't sound like it would be that uncommon to
> > > me.
> > 
> > The PCI posted write does not require the above to execute 'in order'
> > only that any bus segment shared by the two devices have the writes
> > issued in CPU order. ie at a shared PCI root port for instance.
> > 
> > If I recall this is very similar to the ordering that ARM's on-chip
> > AXI interconnect is supposed to provide.. So I'd be very surprised if
> > a modern ARM64 has an meaningful difference from x86 here.
> 
> From the architectural perspective, writes to different "peripherals" are
> not ordered with respect to each other. The first writel will complete once
> it gets its write acknowledgement, but this may not necessarily come from
> the endpoint -- it could come from an intermediate buffer past the point of
> serialisation (i.e. the write will then be ordered with respect to other
> accesses to that same endpoint). The PCI root port would look like one
> peripheral here.

That is basically the same as PCI - PCI has no write ACK, so all
writes are buffered by the PCI interconnect and complete in some
undefined temporal order when multiple end points are involved.

This does not seem very different from what happens in x86..

> > When talking about ordering between the devices, the relevant question
> > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> > from the DEVICE_FOO. 'ordered' means that in this case
> > writel(DEVICE_FOO) must be presented to FOO before anything generated
> > by BAR.
> 
> Yes, and that isn't the case for arm because the writes can still be
> buffered.

The statement is not about buffering, or temporal completion order, or
the order of acks returning to the CPU. It is about pure transaction
ordering inside the interconnect.

Can write BAR -> FOO pass write CPU -> FOO?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight March 29, 2018, 2:58 p.m. UTC | #19
From: Jason Gunthorpe
> Sent: 29 March 2018 15:45
...
> > > When talking about ordering between the devices, the relevant question
> > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> > > from the DEVICE_FOO. 'ordered' means that in this case
> > > writel(DEVICE_FOO) must be presented to FOO before anything generated
> > > by BAR.
> >
> > Yes, and that isn't the case for arm because the writes can still be
> > buffered.
> 
> The statement is not about buffering, or temporal completion order, or
> the order of acks returning to the CPU. It is about pure transaction
> ordering inside the interconnect.
> 
> Can write BAR -> FOO pass write CPU -> FOO?

Almost certainly.
The first cpu write can almost certainly be 'stalled' at the shared PCIe bridge.
The second cpu write then completes (to a different target).
That target then issues a peer to peer transfer that reaches the shared bridge.
I doubt the order of the transactions is guaranteed when it becomes 'un-stalled'.

Of course, these are peer to peer transfers, and strange ones at that.
Normally you'd not be doing peer to peer transfers that access 'memory'
the cpu has just written to.

Requiring extra barriers in this case, or different functions for WC accesses
shouldn't really be an issue.

Even requiring a barrier between a write to dma coherent memory and a write
that starts dma isn't really onerous.
Even if it is a nop on all current architectures it is a good comment in the code.
It could even have a 'dev' argument.

	David

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe March 29, 2018, 4:40 p.m. UTC | #20
On Thu, Mar 29, 2018 at 02:58:34PM +0000, David Laight wrote:
> From: Jason Gunthorpe
> > Sent: 29 March 2018 15:45
> ...
> > > > When talking about ordering between the devices, the relevant question
> > > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> > > > from the DEVICE_FOO. 'ordered' means that in this case
> > > > writel(DEVICE_FOO) must be presented to FOO before anything generated
> > > > by BAR.
> > >
> > > Yes, and that isn't the case for arm because the writes can still be
> > > buffered.
> > 
> > The statement is not about buffering, or temporal completion order, or
> > the order of acks returning to the CPU. It is about pure transaction
> > ordering inside the interconnect.
> > 
> > Can write BAR -> FOO pass write CPU -> FOO?
> 
> Almost certainly.
> The first cpu write can almost certainly be 'stalled' at the shared PCIe bridge.
> The second cpu write then completes (to a different target).
> That target then issues a peer to peer transfer that reaches the shared bridge.
> I doubt the order of the transactions is guaranteed when it becomes 'un-stalled'.

The PCI spec has very strong wording on ordering that covers this
case. Stalled bridges have to follow the ordering rules, and posted
writes cannot pass other posted writes.

Since in PCI all three transactions:
 CPU -> FOO
 CPU -> BAR
 BAR -> FOO

Must traverse a shared bus segment, they must be placed on that bus in
the above order, and the bridge(s) toward FOO must preserve this
order.

ARM's AXI has similar rules, I just can't recall the tiny details
right now :)

> Of course, these are peer to peer transfers, and strange ones at that.
> Normally you'd not be doing peer to peer transfers that access 'memory'
> the cpu has just written to.

It is the best situation I can think of where order of completion to
different devices would matter to a generic Linux driver..

.. And there are patches circulating right now for NVMe that enable
exactly this kind of transfer, and rely on these kind of semantics, so
it is a relevant detail :)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index a863009849a3..3247547d1c36 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1909,9 +1909,6 @@  There are some more advanced barrier functions:
 		/* assign ownership */
 		desc->status = DEVICE_OWN;
 
-		/* force memory to sync before notifying device via MMIO */
-		wmb();
-
 		/* notify device of new descriptors */
 		writel(DESC_NOTIFY, doorbell);
 	}
@@ -1919,11 +1916,15 @@  There are some more advanced barrier functions:
      The dma_rmb() allows us guarantee the device has released ownership
      before we read the data from the descriptor, and the dma_wmb() allows
      us to guarantee the data is written to the descriptor before the device
-     can see it now has ownership.  The wmb() is needed to guarantee that the
-     cache coherent memory writes have completed before attempting a write to
-     the cache incoherent MMIO region.
-
-     See Documentation/DMA-API.txt for more information on consistent memory.
+     can see it now has ownership.  Note that, when using writel(), a prior
+     wmb() is not needed to guarantee that the cache coherent memory writes
+     have completed before writing to the MMIO region.  The cheaper
+     writel_relaxed() does not provide this guarantee and must not be used
+     here.
+
+     See the subsection "Kernel I/O barrier effects" for more information on
+     relaxed I/O accessors and the Documentation/DMA-API.txt file for more
+     information on consistent memory.
 
 
 MMIO WRITE BARRIER