[RFC] libibverbs: add ARM64 memory barrier macros

Message ID	20160518220302.81260E09E9@smtp.ogc.us (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-rdma-owner@kernel.org> From: Steve Wise <swise@opengridcomputing.com> Date: Wed, 18 May 2016 14:16:39 -0700 Subject: [PATCH RFC] libibverbs: add ARM64 memory barrier macros To: dledford@redhat.com Cc: linux-rdma@vger.kernel.org Message-Id: <20160518220302.81260E09E9@smtp.ogc.us> Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk

Steve Wise May 18, 2016, 9:16 p.m. UTC

The default generic barriers are not correct for ARM64. This results in
data corruption.  The correct macros are lifted from the linux kernel.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---

I wonder why the linux kernel doesn't export these?  Also, if the hw
platform is unknown, I don't think libibverbs should pick a default
implementation that might cause data corruption.  Rather, I think it
should just fail a compile on that platform.

Thoughts?

---
 include/infiniband/arch.h |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

Jason Gunthorpe May 18, 2016, 10:28 p.m. UTC | #1

On Wed, May 18, 2016 at 02:16:39PM -0700, Steve Wise wrote:
> The default generic barriers are not correct for ARM64. This results in
> data corruption.  The correct macros are lifted from the linux kernel.
> 
> Signed-off-by: Steve Wise <swise@opengridcomputing.com>
> 
> I wonder why the linux kernel doesn't export these?  Also, if the hw
> platform is unknown, I don't think libibverbs should pick a default
> implementation that might cause data corruption.  Rather, I think it
> should just fail a compile on that platform.

These days, in user space this sort of stuff should be done following
the C11 atomic memory ordering model and not by trying to shoe-horn in
the kernel model. Then the compiler takes care of things properly.

This is using calls like atomic_store, atomic_load and
atomic_thread_fence to create the same sort of barriers.

You could probably implement the Xmbs() with variations on
atomic_thread_fence ??

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steve Wise May 19, 2016, 4:02 p.m. UTC | #2

> On Wed, May 18, 2016 at 02:16:39PM -0700, Steve Wise wrote:
> > The default generic barriers are not correct for ARM64. This results in
> > data corruption.  The correct macros are lifted from the linux kernel.
> >
> > Signed-off-by: Steve Wise <swise@opengridcomputing.com>
> >
> > I wonder why the linux kernel doesn't export these?  Also, if the hw
> > platform is unknown, I don't think libibverbs should pick a default
> > implementation that might cause data corruption.  Rather, I think it
> > should just fail a compile on that platform.
> 
> These days, in user space this sort of stuff should be done following
> the C11 atomic memory ordering model and not by trying to shoe-horn in
> the kernel model. Then the compiler takes care of things properly.
> 
> This is using calls like atomic_store, atomic_load and
> atomic_thread_fence to create the same sort of barriers.
> 
> You could probably implement the Xmbs() with variations on
> atomic_thread_fence ??

Looking at the documentation, it is not clear to me which parameter value passed
to atomic_thread_fence() maps to each of the mb services.  Is it correct to
think that if I get it right, the resulting assembly should be what we currently
have for the mb services?

Or perhaps someone else can provide guidance?

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steve Wise May 19, 2016, 4:35 p.m. UTC | #3

> > These days, in user space this sort of stuff should be done following
> > the C11 atomic memory ordering model and not by trying to shoe-horn in
> > the kernel model. Then the compiler takes care of things properly.
> >
> > This is using calls like atomic_store, atomic_load and
> > atomic_thread_fence to create the same sort of barriers.
> >
> > You could probably implement the Xmbs() with variations on
> > atomic_thread_fence ??
> 
> Looking at the documentation, it is not clear to me which parameter value
passed
> to atomic_thread_fence() maps to each of the mb services.  Is it correct to
> think that if I get it right, the resulting assembly should be what we
currently
> have for the mb services?
> 
> Or perhaps someone else can provide guidance?
> 

I created a simple program to call  __atomic_thread_fence() with the various
parameters, using the x64 architecture.  Only __ATOMIC_SEQ_CST seems to add any
code (mfence) when the binary is inspected with 'objdump -lS'.  The rest are
no-ops.  I'm not sure this will work for libibeverbs.  

Here is the program, compiled with -g -O0:

-----
[root@stevo3 ~]# cc -g -O0  atomic_thread_fence.c   -o atomic_thread_fence
[root@stevo3 ~]# cat atomic_thread_fence.c
#include <stdio.h>

#define mb() __atomic_thread_fence(__ATOMIC_SEQ_CST)

int main(int argc, char *argv[])
{
        int a;

        a = 1;
        printf("ATOMIC_RELAXED\n");
        __atomic_thread_fence(__ATOMIC_RELAXED);
        a = 2;
        printf("ATOMIC_CONSUME\n");
        __atomic_thread_fence(__ATOMIC_CONSUME);
        a = 3;
        printf("ATOMIC_RELEASE\n");
        __atomic_thread_fence(__ATOMIC_RELEASE);
        a = 4;
        printf("ATOMIC_ACQ_REL\n");
        __atomic_thread_fence(__ATOMIC_ACQ_REL);
        a = 5;
        printf("ATOMIC_SEQ_CST\n");
        __atomic_thread_fence(__ATOMIC_SEQ_CST);
        printf("a=%d\n", a);
}
-----

Here is the objdump snipit:

-----
int main(int argc, char *argv[])
{
  400580:       55                      push   %rbp
  400581:       48 89 e5                mov    %rsp,%rbp
  400584:       48 83 ec 20             sub    $0x20,%rsp
  400588:       89 7d ec                mov    %edi,-0x14(%rbp)
  40058b:       48 89 75 e0             mov    %rsi,-0x20(%rbp)
/root/atomic_thread_fence.c:9
        int a;

        a = 1;
  40058f:       c7 45 fc 01 00 00 00    movl   $0x1,-0x4(%rbp)
/root/atomic_thread_fence.c:10
        printf("ATOMIC_RELAXED\n");
  400596:       bf 90 06 40 00          mov    $0x400690,%edi
  40059b:       e8 b0 fe ff ff          callq  400450 <puts@plt>
/root/atomic_thread_fence.c:12
        __atomic_thread_fence(__ATOMIC_RELAXED);
        a = 2;
  4005a0:       c7 45 fc 02 00 00 00    movl   $0x2,-0x4(%rbp)
/root/atomic_thread_fence.c:13
        printf("ATOMIC_CONSUME\n");
  4005a7:       bf 9f 06 40 00          mov    $0x40069f,%edi
  4005ac:       e8 9f fe ff ff          callq  400450 <puts@plt>
/root/atomic_thread_fence.c:15
        __atomic_thread_fence(__ATOMIC_CONSUME);
        a = 3;
  4005b1:       c7 45 fc 03 00 00 00    movl   $0x3,-0x4(%rbp)
/root/atomic_thread_fence.c:16
        printf("ATOMIC_RELEASE\n");
  4005b8:       bf ae 06 40 00          mov    $0x4006ae,%edi
  4005bd:       e8 8e fe ff ff          callq  400450 <puts@plt>
/root/atomic_thread_fence.c:18
        __atomic_thread_fence(__ATOMIC_RELEASE);
        a = 4;
  4005c2:       c7 45 fc 04 00 00 00    movl   $0x4,-0x4(%rbp)
/root/atomic_thread_fence.c:19
        printf("ATOMIC_ACQ_REL\n");
  4005c9:       bf bd 06 40 00          mov    $0x4006bd,%edi
  4005ce:       e8 7d fe ff ff          callq  400450 <puts@plt>
/root/atomic_thread_fence.c:21
        __atomic_thread_fence(__ATOMIC_ACQ_REL);
        a = 5;
  4005d3:       c7 45 fc 05 00 00 00    movl   $0x5,-0x4(%rbp)
/root/atomic_thread_fence.c:22
        printf("ATOMIC_SEQ_CST\n");
  4005da:       bf cc 06 40 00          mov    $0x4006cc,%edi
  4005df:       e8 6c fe ff ff          callq  400450 <puts@plt>
/root/atomic_thread_fence.c:23
        __atomic_thread_fence(__ATOMIC_SEQ_CST);
  4005e4:       0f ae f0                mfence
/root/atomic_thread_fence.c:24
        printf("a=%d\n", a);
  4005e7:       8b 45 fc                mov    -0x4(%rbp),%eax
  4005ea:       89 c6                   mov    %eax,%esi
  4005ec:       bf db 06 40 00          mov    $0x4006db,%edi
  4005f1:       b8 00 00 00 00          mov    $0x0,%eax
  4005f6:       e8 65 fe ff ff          callq  400460 <printf@plt>
/root/atomic_thread_fence.c:25
}
  4005fb:       c9                      leaveq
  4005fc:       c3                      retq
  4005fd:       0f 1f 00                nopl   (%rax)
-----


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe May 19, 2016, 6:05 p.m. UTC | #4

On Thu, May 19, 2016 at 11:02:36AM -0500, Steve Wise wrote:

> Looking at the documentation, it is not clear to me which parameter value passed
> to atomic_thread_fence() maps to each of the mb services.  Is it correct to
> think that if I get it right, the resulting assembly should be what we currently
> have for the mb services?

You can review

https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

Hurm.

So, I'd say libibverbs is kinda broken here, as it is using a model
that is different from the kernel model. Sigh, it actually looks like
kinda a mess because provider libraries are abusing the primitives for
lots of different things. :(

This is the mapping that matches the definitions of the barriers:

 wmb can be memory_order_release
 rmb can be memory_order_acquire
 mb can be memory_order_seq_cst

On x86 this is similar as what we have today, but slightly less
explicit than the kernel model. Seems OK

On ARMv8 this would be DMB slightly weaker than the
kernel's DSB (My read suggests this is OK for PCI vs system memory
on coherent ARM?)

PPC will be lwsync not sync (not OK)

ia64 will both use mf for mb, but different otherwise (not OK)

No idea about the others.

C11 atomics are only specified for system memory to system memory,
they are undefined when working with device MMIO memory - this is why
some of the above are 'not OK'.

You could provide the above as the new default, maybe even use it on
some of the archs, like armv8/x86 if you think the dmb is OK.

An even better suggestion is to try and clean up some of this mess by
being more explicit:

 /* Guarentee all system memory writes are visible by the device
    before value's write is seen at the device */
 mmio_barrier_write32(void *ptr, uint32_t value);

Most everything else can then safely use C11 atomics as it is not
working with device memory.

But that is too hard with all our providers sprinkled around in so
many git trees :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steve Wise May 19, 2016, 6:54 p.m. UTC | #5

> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
> owner@vger.kernel.org] On Behalf Of Jason Gunthorpe
> Sent: Thursday, May 19, 2016 1:06 PM
> To: Steve Wise
> Cc: dledford@redhat.com; linux-rdma@vger.kernel.org
> Subject: Re: [PATCH RFC] libibverbs: add ARM64 memory barrier macros
> 
> On Thu, May 19, 2016 at 11:02:36AM -0500, Steve Wise wrote:
> 
> > Looking at the documentation, it is not clear to me which parameter value
passed
> > to atomic_thread_fence() maps to each of the mb services.  Is it correct to
> > think that if I get it right, the resulting assembly should be what we
currently
> > have for the mb services?
> 
> You can review
> 
> https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
> 
> Hurm.
> 
> So, I'd say libibverbs is kinda broken here, as it is using a model
> that is different from the kernel model. Sigh, it actually looks like
> kinda a mess because provider libraries are abusing the primitives for
> lots of different things. :(

How so?

> 
> This is the mapping that matches the definitions of the barriers:
> 
>  wmb can be memory_order_release
>  rmb can be memory_order_acquire

From my compile experiments the above two turned out to be a no-op for x64.  Is
that correct (assuming system memory)?

>  mb can be memory_order_seq_cst
>

What about the wc_wmb?

> On x86 this is similar as what we have today, but slightly less
> explicit than the kernel model. Seems OK
> 
> On ARMv8 this would be DMB slightly weaker than the
> kernel's DSB (My read suggests this is OK for PCI vs system memory
> on coherent ARM?)
> 
> PPC will be lwsync not sync (not OK)
> 
> ia64 will both use mf for mb, but different otherwise (not OK)
> 
> No idea about the others.
> 
> C11 atomics are only specified for system memory to system memory,
> they are undefined when working with device MMIO memory - this is why
> some of the above are 'not OK'.
> 
> You could provide the above as the new default, maybe even use it on
> some of the archs, like armv8/x86 if you think the dmb is OK.
> 
> An even better suggestion is to try and clean up some of this mess by
> being more explicit:
> 
>  /* Guarentee all system memory writes are visible by the device
>     before value's write is seen at the device */
>  mmio_barrier_write32(void *ptr, uint32_t value);
> 

How would the above be implemented?

> Most everything else can then safely use C11 atomics as it is not
> working with device memory.
> 

One case I wonder about is the write-combining path.   Some devices provide a
per-QP "slot" of device memory that can be used to send down small work
requests.  The host copies the work request from host memory into that device
memory hoping the CPU will do write combining and make the entire, say 64B, work
request a single PCIE transaction/cmd.   HW triggers on this transaction and
thus no db write is needed and the hw doesn't need to fetch the WR from host
memory.  Currently we use wc_wmb() to fence this, but It seems that will need
some mmio_barrier() operation as well.

> But that is too hard with all our providers sprinkled around in so
> many git trees :(
>

Changing all this scares me. :)  The location of where the barriers are in the
provider libs have been very carefully placed, most likely has the result of
seeing barrier problems under load.   I would hate to break any of this in the
name of "cleaning it up".  Also, the code in question is in the data-transfer
fast paths, and adding unneeded barriers would be bad too...

I think adding support for arm7 should be orthogonal from this cleanup.  The
barriers I added in the RFC patch indeed resolved a data corruption issue and we
would like to see them in libibverbs soon.

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe May 19, 2016, 7:28 p.m. UTC | #6

On Thu, May 19, 2016 at 01:54:33PM -0500, Steve Wise wrote:
> > So, I'd say libibverbs is kinda broken here, as it is using a model
> > that is different from the kernel model. Sigh, it actually looks like
> > kinda a mess because provider libraries are abusing the primitives for
> > lots of different things. :(
> 
> How so?

The kernel model uses very heavy weight barriers for wmb,rmb,mb and
then provides a host of weaker barrier options.

verbs is the opposite on x86, wmb is weak and wc_wmb is the stronger
version.

> >  wmb can be memory_order_release
> >  rmb can be memory_order_acquire
> 
> From my compile experiments the above two turned out to be a no-op for x64.  Is
> that correct (assuming system memory)?

Certainly yes for system memory.

For system mem vs device mmio? No. C11 does not contemplate that case.
On x86 these are the same things (eg libibverbs uses nop for
barriers), on other arches they are not.

It isn't clear to me if the much stronger kernel barries are needed
for system vs device. On x86 those might only be needed for really old
CPUs, or maybe libibverbs is just totally wrong here.. I don't know.

Same question for arm on dsb vs dsm.

> >  mb can be memory_order_seq_cst

> What about the wc_wmb?

wc_wmb seesm to be used as some kind of flushing operation to get data
out of the write combining write buffer promptly. Terrible name.

> >  /* Guarentee all system memory writes are visible by the device
> >     before value's write is seen at the device */
> >  mmio_barrier_write32(void *ptr, uint32_t value);

> How would the above be implemented?

It would have to use the correct asm for each arch. But at least we
can now speak clearly about what the expected behavior is and identify
the correct asm for each arch instead of this crazy catch all that wmb
has become for verbs.

> > Most everything else can then safely use C11 atomics as it is not
> > working with device memory.
> 
> One case I wonder about is the write-combining path.   Some devices provide a
> per-QP "slot" of device memory that can be used to send down small work
> requests.  The host copies the work request from host memory into that device
> memory hoping the CPU will do write combining and make the entire, say 64B, work
> request a single PCIE transaction/cmd.

Yes, we probably should have had a 'memcpy to wc' verbs helper that
did the best possible wc copy and prompt flush rather than the mess
with wc_wmb and open coded memcpy. IIRC there is some way to do this
with xmm non temporal cache line stores that is even better???

> thus no db write is needed and the hw doesn't need to fetch the WR from host
> memory.  Currently we use wc_wmb() to fence this, but It seems that will need
> some mmio_barrier() operation as well.

This isn't a fence, it is a flush. It is needed to get the data out of
the write buffer promptly - intel is always strongly ordered so no
fencing is needed here.

> Changing all this scares me. :)  The location of where the barriers are in the
> provider libs have been very carefully placed, most likely has the
> result of

You should probably just go ahead with your patch, it is a bigger mess
than I imagined at first :(

> seeing barrier problems under load.   I would hate to break any of this in the
> name of "cleaning it up".  Also, the code in question is in the data-transfer
> fast paths, and adding unneeded barriers would be bad too...

Well, therein is the problem.

It isn't clear what wmb is supposed to do, and it isn't the same as
the kernel wmb. So what should you implement for ARMv8? The kernel
version will work, but it is overkill in many places.

The truth is, none of these wmbs really do anything on x86 except
cause a compiler fence. Other arches are different, particularly ARM &
PPC tend to need correct fencing or it just doesn't work (as you
noticed).

Due to the mess this is in, I'd have a healthy skepticism about this.
x86 tests none of this!

If you want to optimize ARM then you need to use the correct, weaker
barriers when appropriate and not just use wmb as a big hammer.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bart Van Assche May 19, 2016, 8:14 p.m. UTC | #7

On 05/19/2016 11:05 AM, Jason Gunthorpe wrote:
> This is the mapping that matches the definitions of the barriers:
>
>   wmb can be memory_order_release
>   rmb can be memory_order_acquire
>   mb can be memory_order_seq_cst

This is not correct. Acquire and release, at least as defined by 
Gharachorloo in 1991, have another meaning than the mentioned barrier 
instructions. Acquire and release are labels that can be applied to load 
and store instructions respectively. Acquire semantics prevent 
reordering of a load-acquire instruction with any read or write 
operation which follows it in program order. Release semantics prevent 
reordering of a store-release with any read or write operation which 
precedes it in program order. Figure 2 in 
https://courses.engr.illinois.edu/cs533/sp2012/reading_list/2b.pdf 
illustrates this clearly.

I have not yet analyzed the meaning of acquire and release labels in the 
Itanium instruction set nor in the C11 memory consistency model closely 
but I think that acquire and release have the same meaning in these 
contexts as what has been explained above.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe May 19, 2016, 8:57 p.m. UTC | #8

On Thu, May 19, 2016 at 01:14:59PM -0700, Bart Van Assche wrote:
> On 05/19/2016 11:05 AM, Jason Gunthorpe wrote:
> >This is the mapping that matches the definitions of the barriers:
> >
> >  wmb can be memory_order_release
> >  rmb can be memory_order_acquire
> >  mb can be memory_order_seq_cst
> 
> This is not correct. Acquire and release, at least as defined by
> Gharachorloo in 1991, have another meaning than the mentioned barrier
> instructions. Acquire and release are labels that can be applied to load and
> store instructions respectively.

C11 uses a common language for the coherency requirement in multiple
contexts
- Load/store
- fence (ie a barrier)
- exchanges
- bit set/test

In a load/store context the ordering is associated with a single
memory op, like your paper describes. In the fence context there is no
memory op, the ordering acts on all memory ops..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gabriele Svelto May 20, 2016, 9:44 a.m. UTC | #9

On 19/05/2016 21:28, Jason Gunthorpe wrote:
> Yes, we probably should have had a 'memcpy to wc' verbs helper that
> did the best possible wc copy and prompt flush rather than the mess
> with wc_wmb and open coded memcpy. IIRC there is some way to do this
> with xmm non temporal cache line stores that is even better???

Non-temporal stores will go straight to memory IIRC. With many modern
HCAs being able to read data directly out of the last-level cache of the
processor they could end up being a suboptimal choice.

 Gabriele
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe May 24, 2016, 5:22 p.m. UTC | #10

On Fri, May 20, 2016 at 11:44:14AM +0200, Gabriele Svelto wrote:
> On 19/05/2016 21:28, Jason Gunthorpe wrote:
> > Yes, we probably should have had a 'memcpy to wc' verbs helper that
> > did the best possible wc copy and prompt flush rather than the mess
> > with wc_wmb and open coded memcpy. IIRC there is some way to do this
> > with xmm non temporal cache line stores that is even better???
> 
> Non-temporal stores will go straight to memory IIRC. With many modern
> HCAs being able to read data directly out of the last-level cache of the
> processor they could end up being a suboptimal choice.

We are not talking about memory backed write for DMA, but batching a
MMIO write into a 64 byte burst. non-temporal stores would seem to be
the best way to do that?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] libibverbs: add ARM64 memory barrier macros

Commit Message

Comments

Patch