Message ID | 20160518220302.81260E09E9@smtp.ogc.us (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Wed, May 18, 2016 at 02:16:39PM -0700, Steve Wise wrote: > The default generic barriers are not correct for ARM64. This results in > data corruption. The correct macros are lifted from the linux kernel. > > Signed-off-by: Steve Wise <swise@opengridcomputing.com> > > I wonder why the linux kernel doesn't export these? Also, if the hw > platform is unknown, I don't think libibverbs should pick a default > implementation that might cause data corruption. Rather, I think it > should just fail a compile on that platform. These days, in user space this sort of stuff should be done following the C11 atomic memory ordering model and not by trying to shoe-horn in the kernel model. Then the compiler takes care of things properly. This is using calls like atomic_store, atomic_load and atomic_thread_fence to create the same sort of barriers. You could probably implement the Xmbs() with variations on atomic_thread_fence ?? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Wed, May 18, 2016 at 02:16:39PM -0700, Steve Wise wrote: > > The default generic barriers are not correct for ARM64. This results in > > data corruption. The correct macros are lifted from the linux kernel. > > > > Signed-off-by: Steve Wise <swise@opengridcomputing.com> > > > > I wonder why the linux kernel doesn't export these? Also, if the hw > > platform is unknown, I don't think libibverbs should pick a default > > implementation that might cause data corruption. Rather, I think it > > should just fail a compile on that platform. > > These days, in user space this sort of stuff should be done following > the C11 atomic memory ordering model and not by trying to shoe-horn in > the kernel model. Then the compiler takes care of things properly. > > This is using calls like atomic_store, atomic_load and > atomic_thread_fence to create the same sort of barriers. > > You could probably implement the Xmbs() with variations on > atomic_thread_fence ?? Looking at the documentation, it is not clear to me which parameter value passed to atomic_thread_fence() maps to each of the mb services. Is it correct to think that if I get it right, the resulting assembly should be what we currently have for the mb services? Or perhaps someone else can provide guidance? Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > These days, in user space this sort of stuff should be done following > > the C11 atomic memory ordering model and not by trying to shoe-horn in > > the kernel model. Then the compiler takes care of things properly. > > > > This is using calls like atomic_store, atomic_load and > > atomic_thread_fence to create the same sort of barriers. > > > > You could probably implement the Xmbs() with variations on > > atomic_thread_fence ?? > > Looking at the documentation, it is not clear to me which parameter value passed > to atomic_thread_fence() maps to each of the mb services. Is it correct to > think that if I get it right, the resulting assembly should be what we currently > have for the mb services? > > Or perhaps someone else can provide guidance? > I created a simple program to call __atomic_thread_fence() with the various parameters, using the x64 architecture. Only __ATOMIC_SEQ_CST seems to add any code (mfence) when the binary is inspected with 'objdump -lS'. The rest are no-ops. I'm not sure this will work for libibeverbs. Here is the program, compiled with -g -O0: ----- [root@stevo3 ~]# cc -g -O0 atomic_thread_fence.c -o atomic_thread_fence [root@stevo3 ~]# cat atomic_thread_fence.c #include <stdio.h> #define mb() __atomic_thread_fence(__ATOMIC_SEQ_CST) int main(int argc, char *argv[]) { int a; a = 1; printf("ATOMIC_RELAXED\n"); __atomic_thread_fence(__ATOMIC_RELAXED); a = 2; printf("ATOMIC_CONSUME\n"); __atomic_thread_fence(__ATOMIC_CONSUME); a = 3; printf("ATOMIC_RELEASE\n"); __atomic_thread_fence(__ATOMIC_RELEASE); a = 4; printf("ATOMIC_ACQ_REL\n"); __atomic_thread_fence(__ATOMIC_ACQ_REL); a = 5; printf("ATOMIC_SEQ_CST\n"); __atomic_thread_fence(__ATOMIC_SEQ_CST); printf("a=%d\n", a); } ----- Here is the objdump snipit: ----- int main(int argc, char *argv[]) { 400580: 55 push %rbp 400581: 48 89 e5 mov %rsp,%rbp 400584: 48 83 ec 20 sub $0x20,%rsp 400588: 89 7d ec mov %edi,-0x14(%rbp) 40058b: 48 89 75 e0 mov %rsi,-0x20(%rbp) /root/atomic_thread_fence.c:9 int a; a = 1; 40058f: c7 45 fc 01 00 00 00 movl $0x1,-0x4(%rbp) /root/atomic_thread_fence.c:10 printf("ATOMIC_RELAXED\n"); 400596: bf 90 06 40 00 mov $0x400690,%edi 40059b: e8 b0 fe ff ff callq 400450 <puts@plt> /root/atomic_thread_fence.c:12 __atomic_thread_fence(__ATOMIC_RELAXED); a = 2; 4005a0: c7 45 fc 02 00 00 00 movl $0x2,-0x4(%rbp) /root/atomic_thread_fence.c:13 printf("ATOMIC_CONSUME\n"); 4005a7: bf 9f 06 40 00 mov $0x40069f,%edi 4005ac: e8 9f fe ff ff callq 400450 <puts@plt> /root/atomic_thread_fence.c:15 __atomic_thread_fence(__ATOMIC_CONSUME); a = 3; 4005b1: c7 45 fc 03 00 00 00 movl $0x3,-0x4(%rbp) /root/atomic_thread_fence.c:16 printf("ATOMIC_RELEASE\n"); 4005b8: bf ae 06 40 00 mov $0x4006ae,%edi 4005bd: e8 8e fe ff ff callq 400450 <puts@plt> /root/atomic_thread_fence.c:18 __atomic_thread_fence(__ATOMIC_RELEASE); a = 4; 4005c2: c7 45 fc 04 00 00 00 movl $0x4,-0x4(%rbp) /root/atomic_thread_fence.c:19 printf("ATOMIC_ACQ_REL\n"); 4005c9: bf bd 06 40 00 mov $0x4006bd,%edi 4005ce: e8 7d fe ff ff callq 400450 <puts@plt> /root/atomic_thread_fence.c:21 __atomic_thread_fence(__ATOMIC_ACQ_REL); a = 5; 4005d3: c7 45 fc 05 00 00 00 movl $0x5,-0x4(%rbp) /root/atomic_thread_fence.c:22 printf("ATOMIC_SEQ_CST\n"); 4005da: bf cc 06 40 00 mov $0x4006cc,%edi 4005df: e8 6c fe ff ff callq 400450 <puts@plt> /root/atomic_thread_fence.c:23 __atomic_thread_fence(__ATOMIC_SEQ_CST); 4005e4: 0f ae f0 mfence /root/atomic_thread_fence.c:24 printf("a=%d\n", a); 4005e7: 8b 45 fc mov -0x4(%rbp),%eax 4005ea: 89 c6 mov %eax,%esi 4005ec: bf db 06 40 00 mov $0x4006db,%edi 4005f1: b8 00 00 00 00 mov $0x0,%eax 4005f6: e8 65 fe ff ff callq 400460 <printf@plt> /root/atomic_thread_fence.c:25 } 4005fb: c9 leaveq 4005fc: c3 retq 4005fd: 0f 1f 00 nopl (%rax) ----- -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 19, 2016 at 11:02:36AM -0500, Steve Wise wrote: > Looking at the documentation, it is not clear to me which parameter value passed > to atomic_thread_fence() maps to each of the mb services. Is it correct to > think that if I get it right, the resulting assembly should be what we currently > have for the mb services? You can review https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html Hurm. So, I'd say libibverbs is kinda broken here, as it is using a model that is different from the kernel model. Sigh, it actually looks like kinda a mess because provider libraries are abusing the primitives for lots of different things. :( This is the mapping that matches the definitions of the barriers: wmb can be memory_order_release rmb can be memory_order_acquire mb can be memory_order_seq_cst On x86 this is similar as what we have today, but slightly less explicit than the kernel model. Seems OK On ARMv8 this would be DMB slightly weaker than the kernel's DSB (My read suggests this is OK for PCI vs system memory on coherent ARM?) PPC will be lwsync not sync (not OK) ia64 will both use mf for mb, but different otherwise (not OK) No idea about the others. C11 atomics are only specified for system memory to system memory, they are undefined when working with device MMIO memory - this is why some of the above are 'not OK'. You could provide the above as the new default, maybe even use it on some of the archs, like armv8/x86 if you think the dmb is OK. An even better suggestion is to try and clean up some of this mess by being more explicit: /* Guarentee all system memory writes are visible by the device before value's write is seen at the device */ mmio_barrier_write32(void *ptr, uint32_t value); Most everything else can then safely use C11 atomics as it is not working with device memory. But that is too hard with all our providers sprinkled around in so many git trees :( Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma- > owner@vger.kernel.org] On Behalf Of Jason Gunthorpe > Sent: Thursday, May 19, 2016 1:06 PM > To: Steve Wise > Cc: dledford@redhat.com; linux-rdma@vger.kernel.org > Subject: Re: [PATCH RFC] libibverbs: add ARM64 memory barrier macros > > On Thu, May 19, 2016 at 11:02:36AM -0500, Steve Wise wrote: > > > Looking at the documentation, it is not clear to me which parameter value passed > > to atomic_thread_fence() maps to each of the mb services. Is it correct to > > think that if I get it right, the resulting assembly should be what we currently > > have for the mb services? > > You can review > > https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html > > Hurm. > > So, I'd say libibverbs is kinda broken here, as it is using a model > that is different from the kernel model. Sigh, it actually looks like > kinda a mess because provider libraries are abusing the primitives for > lots of different things. :( How so? > > This is the mapping that matches the definitions of the barriers: > > wmb can be memory_order_release > rmb can be memory_order_acquire From my compile experiments the above two turned out to be a no-op for x64. Is that correct (assuming system memory)? > mb can be memory_order_seq_cst > What about the wc_wmb? > On x86 this is similar as what we have today, but slightly less > explicit than the kernel model. Seems OK > > On ARMv8 this would be DMB slightly weaker than the > kernel's DSB (My read suggests this is OK for PCI vs system memory > on coherent ARM?) > > PPC will be lwsync not sync (not OK) > > ia64 will both use mf for mb, but different otherwise (not OK) > > No idea about the others. > > C11 atomics are only specified for system memory to system memory, > they are undefined when working with device MMIO memory - this is why > some of the above are 'not OK'. > > You could provide the above as the new default, maybe even use it on > some of the archs, like armv8/x86 if you think the dmb is OK. > > An even better suggestion is to try and clean up some of this mess by > being more explicit: > > /* Guarentee all system memory writes are visible by the device > before value's write is seen at the device */ > mmio_barrier_write32(void *ptr, uint32_t value); > How would the above be implemented? > Most everything else can then safely use C11 atomics as it is not > working with device memory. > One case I wonder about is the write-combining path. Some devices provide a per-QP "slot" of device memory that can be used to send down small work requests. The host copies the work request from host memory into that device memory hoping the CPU will do write combining and make the entire, say 64B, work request a single PCIE transaction/cmd. HW triggers on this transaction and thus no db write is needed and the hw doesn't need to fetch the WR from host memory. Currently we use wc_wmb() to fence this, but It seems that will need some mmio_barrier() operation as well. > But that is too hard with all our providers sprinkled around in so > many git trees :( > Changing all this scares me. :) The location of where the barriers are in the provider libs have been very carefully placed, most likely has the result of seeing barrier problems under load. I would hate to break any of this in the name of "cleaning it up". Also, the code in question is in the data-transfer fast paths, and adding unneeded barriers would be bad too... I think adding support for arm7 should be orthogonal from this cleanup. The barriers I added in the RFC patch indeed resolved a data corruption issue and we would like to see them in libibverbs soon. Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 19, 2016 at 01:54:33PM -0500, Steve Wise wrote: > > So, I'd say libibverbs is kinda broken here, as it is using a model > > that is different from the kernel model. Sigh, it actually looks like > > kinda a mess because provider libraries are abusing the primitives for > > lots of different things. :( > > How so? The kernel model uses very heavy weight barriers for wmb,rmb,mb and then provides a host of weaker barrier options. verbs is the opposite on x86, wmb is weak and wc_wmb is the stronger version. > > wmb can be memory_order_release > > rmb can be memory_order_acquire > > From my compile experiments the above two turned out to be a no-op for x64. Is > that correct (assuming system memory)? Certainly yes for system memory. For system mem vs device mmio? No. C11 does not contemplate that case. On x86 these are the same things (eg libibverbs uses nop for barriers), on other arches they are not. It isn't clear to me if the much stronger kernel barries are needed for system vs device. On x86 those might only be needed for really old CPUs, or maybe libibverbs is just totally wrong here.. I don't know. Same question for arm on dsb vs dsm. > > mb can be memory_order_seq_cst > What about the wc_wmb? wc_wmb seesm to be used as some kind of flushing operation to get data out of the write combining write buffer promptly. Terrible name. > > /* Guarentee all system memory writes are visible by the device > > before value's write is seen at the device */ > > mmio_barrier_write32(void *ptr, uint32_t value); > How would the above be implemented? It would have to use the correct asm for each arch. But at least we can now speak clearly about what the expected behavior is and identify the correct asm for each arch instead of this crazy catch all that wmb has become for verbs. > > Most everything else can then safely use C11 atomics as it is not > > working with device memory. > > One case I wonder about is the write-combining path. Some devices provide a > per-QP "slot" of device memory that can be used to send down small work > requests. The host copies the work request from host memory into that device > memory hoping the CPU will do write combining and make the entire, say 64B, work > request a single PCIE transaction/cmd. Yes, we probably should have had a 'memcpy to wc' verbs helper that did the best possible wc copy and prompt flush rather than the mess with wc_wmb and open coded memcpy. IIRC there is some way to do this with xmm non temporal cache line stores that is even better??? > thus no db write is needed and the hw doesn't need to fetch the WR from host > memory. Currently we use wc_wmb() to fence this, but It seems that will need > some mmio_barrier() operation as well. This isn't a fence, it is a flush. It is needed to get the data out of the write buffer promptly - intel is always strongly ordered so no fencing is needed here. > Changing all this scares me. :) The location of where the barriers are in the > provider libs have been very carefully placed, most likely has the > result of You should probably just go ahead with your patch, it is a bigger mess than I imagined at first :( > seeing barrier problems under load. I would hate to break any of this in the > name of "cleaning it up". Also, the code in question is in the data-transfer > fast paths, and adding unneeded barriers would be bad too... Well, therein is the problem. It isn't clear what wmb is supposed to do, and it isn't the same as the kernel wmb. So what should you implement for ARMv8? The kernel version will work, but it is overkill in many places. The truth is, none of these wmbs really do anything on x86 except cause a compiler fence. Other arches are different, particularly ARM & PPC tend to need correct fencing or it just doesn't work (as you noticed). Due to the mess this is in, I'd have a healthy skepticism about this. x86 tests none of this! If you want to optimize ARM then you need to use the correct, weaker barriers when appropriate and not just use wmb as a big hammer. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/19/2016 11:05 AM, Jason Gunthorpe wrote: > This is the mapping that matches the definitions of the barriers: > > wmb can be memory_order_release > rmb can be memory_order_acquire > mb can be memory_order_seq_cst This is not correct. Acquire and release, at least as defined by Gharachorloo in 1991, have another meaning than the mentioned barrier instructions. Acquire and release are labels that can be applied to load and store instructions respectively. Acquire semantics prevent reordering of a load-acquire instruction with any read or write operation which follows it in program order. Release semantics prevent reordering of a store-release with any read or write operation which precedes it in program order. Figure 2 in https://courses.engr.illinois.edu/cs533/sp2012/reading_list/2b.pdf illustrates this clearly. I have not yet analyzed the meaning of acquire and release labels in the Itanium instruction set nor in the C11 memory consistency model closely but I think that acquire and release have the same meaning in these contexts as what has been explained above. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 19, 2016 at 01:14:59PM -0700, Bart Van Assche wrote: > On 05/19/2016 11:05 AM, Jason Gunthorpe wrote: > >This is the mapping that matches the definitions of the barriers: > > > > wmb can be memory_order_release > > rmb can be memory_order_acquire > > mb can be memory_order_seq_cst > > This is not correct. Acquire and release, at least as defined by > Gharachorloo in 1991, have another meaning than the mentioned barrier > instructions. Acquire and release are labels that can be applied to load and > store instructions respectively. C11 uses a common language for the coherency requirement in multiple contexts - Load/store - fence (ie a barrier) - exchanges - bit set/test In a load/store context the ordering is associated with a single memory op, like your paper describes. In the fence context there is no memory op, the ordering acts on all memory ops.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 19/05/2016 21:28, Jason Gunthorpe wrote: > Yes, we probably should have had a 'memcpy to wc' verbs helper that > did the best possible wc copy and prompt flush rather than the mess > with wc_wmb and open coded memcpy. IIRC there is some way to do this > with xmm non temporal cache line stores that is even better??? Non-temporal stores will go straight to memory IIRC. With many modern HCAs being able to read data directly out of the last-level cache of the processor they could end up being a suboptimal choice. Gabriele -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 20, 2016 at 11:44:14AM +0200, Gabriele Svelto wrote: > On 19/05/2016 21:28, Jason Gunthorpe wrote: > > Yes, we probably should have had a 'memcpy to wc' verbs helper that > > did the best possible wc copy and prompt flush rather than the mess > > with wc_wmb and open coded memcpy. IIRC there is some way to do this > > with xmm non temporal cache line stores that is even better??? > > Non-temporal stores will go straight to memory IIRC. With many modern > HCAs being able to read data directly out of the last-level cache of the > processor they could end up being a suboptimal choice. We are not talking about memory backed write for DMA, but batching a MMIO write into a 64 byte burst. non-temporal stores would seem to be the best way to do that? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/infiniband/arch.h b/include/infiniband/arch.h index bc1738a..71f02f8 100644 --- a/include/infiniband/arch.h +++ b/include/infiniband/arch.h @@ -122,6 +122,15 @@ static inline uint64_t ntohll(uint64_t x) { return x; } #define wmb() mb() /* for s390x */ #define wc_wmb() wmb() /* for s390x */ +#elif defined(__aarch64__) + +#define dsb(opt) asm volatile("dsb " #opt : : : "memory") + +#define mb() dsb(sy) +#define rmb() dsb(ld) +#define wmb() dsb(st) +#define wc_wmb() wmb() + #else #warning No architecture specific defines found. Using generic implementation.
The default generic barriers are not correct for ARM64. This results in data corruption. The correct macros are lifted from the linux kernel. Signed-off-by: Steve Wise <swise@opengridcomputing.com> --- I wonder why the linux kernel doesn't export these? Also, if the hw platform is unknown, I don't think libibverbs should pick a default implementation that might cause data corruption. Rather, I think it should just fail a compile on that platform. Thoughts? --- include/infiniband/arch.h | 9 +++++++++ 1 files changed, 9 insertions(+), 0 deletions(-)