| Message ID | alpine.LRH.2.02.1802131648030.31130@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive) |
|---|---|
| State | Awaiting Upstream, archived |
| Delegated to: | Mike Snitzer |
| Headers | show |
On Tue, Feb 13, 2018 at 2:00 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > On Fri, 8 Dec 2017, Dan Williams wrote: > >> > > > when we write to >> > > > persistent memory using cached write instructions and use dax_flush >> > > > afterwards to flush cache for the affected range, the performance is about >> > > > 350MB/s. It is practically unusable - worse than low-end SSDs. >> > > > >> > > > On the other hand, the movnti instruction can sustain performance of one >> > > > 8-byte write per clock cycle. We don't have to flush cache afterwards, the >> > > > only thing that must be done is to flush the write-combining buffer with >> > > > the sfence instruction. Movnti has much better throughput than dax_flush. >> > > >> > > What about memcpy_flushcache? >> > >> > but >> > >> > - using memcpy_flushcache is overkill if we need just one or two 8-byte >> > writes to the metadata area. Why not use movnti directly? >> > >> >> The driver performs so many 8-byte moves that the cost of the >> memcpy_flushcache() function call significantly eats into your >> performance? > > I've measured it on Skylake i7-6700 - and the dm-writecache driver has 2% > lower throughput when it uses memcpy_flushcache() to update it metadata > instead of explicitly coded "movnti" instructions. > > I've created this patch - it doesn't change API in any way, but it > optimizes memcpy_flushcache for 4, 8 and 16-byte writes (that is what my > driver mostly uses). With this patch, I can remove the explicit "asm" > statements from my driver. Would you consider commiting this patch to the > kernel? > > Mikulas > > Yes, this looks good to me. You can send it to the x86 folks with my: Reviewed-by: Dan Williams <dan.j.williams@intel.com> ...or let me know and I can chase it through the -tip tree. Either way works for me. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, 13 Feb 2018, Dan Williams wrote: > On Tue, Feb 13, 2018 at 2:00 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > > > On Fri, 8 Dec 2017, Dan Williams wrote: > > > >> > > > when we write to > >> > > > persistent memory using cached write instructions and use dax_flush > >> > > > afterwards to flush cache for the affected range, the performance is about > >> > > > 350MB/s. It is practically unusable - worse than low-end SSDs. > >> > > > > >> > > > On the other hand, the movnti instruction can sustain performance of one > >> > > > 8-byte write per clock cycle. We don't have to flush cache afterwards, the > >> > > > only thing that must be done is to flush the write-combining buffer with > >> > > > the sfence instruction. Movnti has much better throughput than dax_flush. > >> > > > >> > > What about memcpy_flushcache? > >> > > >> > but > >> > > >> > - using memcpy_flushcache is overkill if we need just one or two 8-byte > >> > writes to the metadata area. Why not use movnti directly? > >> > > >> > >> The driver performs so many 8-byte moves that the cost of the > >> memcpy_flushcache() function call significantly eats into your > >> performance? > > > > I've measured it on Skylake i7-6700 - and the dm-writecache driver has 2% > > lower throughput when it uses memcpy_flushcache() to update it metadata > > instead of explicitly coded "movnti" instructions. > > > > I've created this patch - it doesn't change API in any way, but it > > optimizes memcpy_flushcache for 4, 8 and 16-byte writes (that is what my > > driver mostly uses). With this patch, I can remove the explicit "asm" > > statements from my driver. Would you consider commiting this patch to the > > kernel? > > > > Mikulas > > > > > > Yes, this looks good to me. You can send it to the x86 folks with my: > > Reviewed-by: Dan Williams <dan.j.williams@intel.com> > > ...or let me know and I can chase it through the -tip tree. Either way > works for me. If you have access to some tree that will be merged, you can commit the patch there. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Feb 13, 2018 at 5:24 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > On Tue, 13 Feb 2018, Dan Williams wrote: > >> On Tue, Feb 13, 2018 at 2:00 PM, Mikulas Patocka <mpatocka@redhat.com> wrote: >> > >> > >> > On Fri, 8 Dec 2017, Dan Williams wrote: >> > >> >> > > > when we write to >> >> > > > persistent memory using cached write instructions and use dax_flush >> >> > > > afterwards to flush cache for the affected range, the performance is about >> >> > > > 350MB/s. It is practically unusable - worse than low-end SSDs. >> >> > > > >> >> > > > On the other hand, the movnti instruction can sustain performance of one >> >> > > > 8-byte write per clock cycle. We don't have to flush cache afterwards, the >> >> > > > only thing that must be done is to flush the write-combining buffer with >> >> > > > the sfence instruction. Movnti has much better throughput than dax_flush. >> >> > > >> >> > > What about memcpy_flushcache? >> >> > >> >> > but >> >> > >> >> > - using memcpy_flushcache is overkill if we need just one or two 8-byte >> >> > writes to the metadata area. Why not use movnti directly? >> >> > >> >> >> >> The driver performs so many 8-byte moves that the cost of the >> >> memcpy_flushcache() function call significantly eats into your >> >> performance? >> > >> > I've measured it on Skylake i7-6700 - and the dm-writecache driver has 2% >> > lower throughput when it uses memcpy_flushcache() to update it metadata >> > instead of explicitly coded "movnti" instructions. >> > >> > I've created this patch - it doesn't change API in any way, but it >> > optimizes memcpy_flushcache for 4, 8 and 16-byte writes (that is what my >> > driver mostly uses). With this patch, I can remove the explicit "asm" >> > statements from my driver. Would you consider commiting this patch to the >> > kernel? >> > >> > Mikulas >> > >> > >> >> Yes, this looks good to me. You can send it to the x86 folks with my: >> >> Reviewed-by: Dan Williams <dan.j.williams@intel.com> >> >> ...or let me know and I can chase it through the -tip tree. Either way >> works for me. > > If you have access to some tree that will be merged, you can commit the > patch there. Sure, no worries, I'll take it from here. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Index: linux-2.6/arch/x86/include/asm/string_64.h =================================================================== --- linux-2.6.orig/arch/x86/include/asm/string_64.h 2018-01-31 11:06:19.953577699 -0500 +++ linux-2.6/arch/x86/include/asm/string_64.h 2018-02-13 12:31:06.506810497 -0500 @@ -147,7 +147,25 @@ memcpy_mcsafe(void *dst, const void *src #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1 -void memcpy_flushcache(void *dst, const void *src, size_t cnt); +void __memcpy_flushcache(void *dst, const void *src, size_t cnt); +static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t cnt) +{ + if (__builtin_constant_p(cnt)) { + switch (cnt) { + case 4: + asm ("movntil %1, %0" : "=m"(*(u32 *)dst) : "r"(*(u32 *)src)); + return; + case 8: + asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src)); + return; + case 16: + asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src)); + asm ("movntiq %1, %0" : "=m"(*(u64 *)(dst + 8)) : "r"(*(u64 *)(src + 8))); + return; + } + } + __memcpy_flushcache(dst, src, cnt); +} #endif #endif /* __KERNEL__ */ Index: linux-2.6/arch/x86/lib/usercopy_64.c =================================================================== --- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2018-01-31 11:06:19.988577678 -0500 +++ linux-2.6/arch/x86/lib/usercopy_64.c 2018-02-13 11:56:40.249154414 -0500 @@ -133,7 +133,7 @@ long __copy_user_flushcache(void *dst, c return rc; } -void memcpy_flushcache(void *_dst, const void *_src, size_t size) +void __memcpy_flushcache(void *_dst, const void *_src, size_t size) { unsigned long dest = (unsigned long) _dst; unsigned long source = (unsigned long) _src; @@ -196,14 +196,14 @@ void memcpy_flushcache(void *_dst, const clean_cache_range((void *) dest, size); } } -EXPORT_SYMBOL_GPL(memcpy_flushcache); +EXPORT_SYMBOL_GPL(__memcpy_flushcache); void memcpy_page_flushcache(char *to, struct page *page, size_t offset, size_t len) { char *from = kmap_atomic(page); - memcpy_flushcache(to, from + offset, len); + __memcpy_flushcache(to, from + offset, len); kunmap_atomic(from); } #endif