diff mbox

[2/2] dm-writecache

Message ID alpine.LRH.2.02.1802131648030.31130@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)
State Awaiting Upstream, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Mikulas Patocka Feb. 13, 2018, 10 p.m. UTC
On Fri, 8 Dec 2017, Dan Williams wrote:

> > > > when we write to
> > > > persistent memory using cached write instructions and use dax_flush
> > > > afterwards to flush cache for the affected range, the performance is about
> > > > 350MB/s. It is practically unusable - worse than low-end SSDs.
> > > >
> > > > On the other hand, the movnti instruction can sustain performance of one
> > > > 8-byte write per clock cycle. We don't have to flush cache afterwards, the
> > > > only thing that must be done is to flush the write-combining buffer with
> > > > the sfence instruction. Movnti has much better throughput than dax_flush.
> > >
> > > What about memcpy_flushcache?
> >
> > but
> >
> > - using memcpy_flushcache is overkill if we need just one or two 8-byte
> > writes to the metadata area. Why not use movnti directly?
> >
> 
> The driver performs so many 8-byte moves that the cost of the
> memcpy_flushcache() function call significantly eats into your
> performance?

I've measured it on Skylake i7-6700 - and the dm-writecache driver has 2% 
lower throughput when it uses memcpy_flushcache() to update it metadata 
instead of explicitly coded "movnti" instructions.

I've created this patch - it doesn't change API in any way, but it 
optimizes memcpy_flushcache for 4, 8 and 16-byte writes (that is what my 
driver mostly uses). With this patch, I can remove the explicit "asm" 
statements from my driver. Would you consider commiting this patch to the 
kernel?

Mikulas




x86: optimize memcpy_flushcache

I use memcpy_flushcache in my persistent memory driver for metadata
updates and it turns out that the overhead of memcpy_flushcache causes 2%
performance degradation compared to "movnti" instruction explicitly coded
using inline assembler.

This patch recognizes memcpy_flushcache calls with constant short length
and turns them into inline assembler - so that I don't have to use inline
assembler in the driver.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 arch/x86/include/asm/string_64.h |   20 +++++++++++++++++++-
 arch/x86/lib/usercopy_64.c       |    6 +++---
 2 files changed, 22 insertions(+), 4 deletions(-)


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Dan Williams Feb. 13, 2018, 10:07 p.m. UTC | #1
On Tue, Feb 13, 2018 at 2:00 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Fri, 8 Dec 2017, Dan Williams wrote:
>
>> > > > when we write to
>> > > > persistent memory using cached write instructions and use dax_flush
>> > > > afterwards to flush cache for the affected range, the performance is about
>> > > > 350MB/s. It is practically unusable - worse than low-end SSDs.
>> > > >
>> > > > On the other hand, the movnti instruction can sustain performance of one
>> > > > 8-byte write per clock cycle. We don't have to flush cache afterwards, the
>> > > > only thing that must be done is to flush the write-combining buffer with
>> > > > the sfence instruction. Movnti has much better throughput than dax_flush.
>> > >
>> > > What about memcpy_flushcache?
>> >
>> > but
>> >
>> > - using memcpy_flushcache is overkill if we need just one or two 8-byte
>> > writes to the metadata area. Why not use movnti directly?
>> >
>>
>> The driver performs so many 8-byte moves that the cost of the
>> memcpy_flushcache() function call significantly eats into your
>> performance?
>
> I've measured it on Skylake i7-6700 - and the dm-writecache driver has 2%
> lower throughput when it uses memcpy_flushcache() to update it metadata
> instead of explicitly coded "movnti" instructions.
>
> I've created this patch - it doesn't change API in any way, but it
> optimizes memcpy_flushcache for 4, 8 and 16-byte writes (that is what my
> driver mostly uses). With this patch, I can remove the explicit "asm"
> statements from my driver. Would you consider commiting this patch to the
> kernel?
>
> Mikulas
>
>

Yes, this looks good to me. You can send it to the x86 folks with my:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

...or let me know and I can chase it through the -tip tree. Either way
works for me.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Feb. 14, 2018, 1:24 a.m. UTC | #2
On Tue, 13 Feb 2018, Dan Williams wrote:

> On Tue, Feb 13, 2018 at 2:00 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >
> > On Fri, 8 Dec 2017, Dan Williams wrote:
> >
> >> > > > when we write to
> >> > > > persistent memory using cached write instructions and use dax_flush
> >> > > > afterwards to flush cache for the affected range, the performance is about
> >> > > > 350MB/s. It is practically unusable - worse than low-end SSDs.
> >> > > >
> >> > > > On the other hand, the movnti instruction can sustain performance of one
> >> > > > 8-byte write per clock cycle. We don't have to flush cache afterwards, the
> >> > > > only thing that must be done is to flush the write-combining buffer with
> >> > > > the sfence instruction. Movnti has much better throughput than dax_flush.
> >> > >
> >> > > What about memcpy_flushcache?
> >> >
> >> > but
> >> >
> >> > - using memcpy_flushcache is overkill if we need just one or two 8-byte
> >> > writes to the metadata area. Why not use movnti directly?
> >> >
> >>
> >> The driver performs so many 8-byte moves that the cost of the
> >> memcpy_flushcache() function call significantly eats into your
> >> performance?
> >
> > I've measured it on Skylake i7-6700 - and the dm-writecache driver has 2%
> > lower throughput when it uses memcpy_flushcache() to update it metadata
> > instead of explicitly coded "movnti" instructions.
> >
> > I've created this patch - it doesn't change API in any way, but it
> > optimizes memcpy_flushcache for 4, 8 and 16-byte writes (that is what my
> > driver mostly uses). With this patch, I can remove the explicit "asm"
> > statements from my driver. Would you consider commiting this patch to the
> > kernel?
> >
> > Mikulas
> >
> >
> 
> Yes, this looks good to me. You can send it to the x86 folks with my:
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> 
> ...or let me know and I can chase it through the -tip tree. Either way
> works for me.

If you have access to some tree that will be merged, you can commit the 
patch there.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dan Williams Feb. 14, 2018, 1:36 a.m. UTC | #3
On Tue, Feb 13, 2018 at 5:24 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Tue, 13 Feb 2018, Dan Williams wrote:
>
>> On Tue, Feb 13, 2018 at 2:00 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >
>> >
>> > On Fri, 8 Dec 2017, Dan Williams wrote:
>> >
>> >> > > > when we write to
>> >> > > > persistent memory using cached write instructions and use dax_flush
>> >> > > > afterwards to flush cache for the affected range, the performance is about
>> >> > > > 350MB/s. It is practically unusable - worse than low-end SSDs.
>> >> > > >
>> >> > > > On the other hand, the movnti instruction can sustain performance of one
>> >> > > > 8-byte write per clock cycle. We don't have to flush cache afterwards, the
>> >> > > > only thing that must be done is to flush the write-combining buffer with
>> >> > > > the sfence instruction. Movnti has much better throughput than dax_flush.
>> >> > >
>> >> > > What about memcpy_flushcache?
>> >> >
>> >> > but
>> >> >
>> >> > - using memcpy_flushcache is overkill if we need just one or two 8-byte
>> >> > writes to the metadata area. Why not use movnti directly?
>> >> >
>> >>
>> >> The driver performs so many 8-byte moves that the cost of the
>> >> memcpy_flushcache() function call significantly eats into your
>> >> performance?
>> >
>> > I've measured it on Skylake i7-6700 - and the dm-writecache driver has 2%
>> > lower throughput when it uses memcpy_flushcache() to update it metadata
>> > instead of explicitly coded "movnti" instructions.
>> >
>> > I've created this patch - it doesn't change API in any way, but it
>> > optimizes memcpy_flushcache for 4, 8 and 16-byte writes (that is what my
>> > driver mostly uses). With this patch, I can remove the explicit "asm"
>> > statements from my driver. Would you consider commiting this patch to the
>> > kernel?
>> >
>> > Mikulas
>> >
>> >
>>
>> Yes, this looks good to me. You can send it to the x86 folks with my:
>>
>> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
>>
>> ...or let me know and I can chase it through the -tip tree. Either way
>> works for me.
>
> If you have access to some tree that will be merged, you can commit the
> patch there.

Sure, no worries, I'll take it from here.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

Index: linux-2.6/arch/x86/include/asm/string_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/string_64.h	2018-01-31 11:06:19.953577699 -0500
+++ linux-2.6/arch/x86/include/asm/string_64.h	2018-02-13 12:31:06.506810497 -0500
@@ -147,7 +147,25 @@  memcpy_mcsafe(void *dst, const void *src
 
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
-void memcpy_flushcache(void *dst, const void *src, size_t cnt);
+void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
+static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
+{
+	if (__builtin_constant_p(cnt)) {
+		switch (cnt) {
+			case 4:
+				asm ("movntil %1, %0" : "=m"(*(u32 *)dst) : "r"(*(u32 *)src));
+				return;
+			case 8:
+				asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src));
+				return;
+			case 16:
+				asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src));
+				asm ("movntiq %1, %0" : "=m"(*(u64 *)(dst + 8)) : "r"(*(u64 *)(src + 8)));
+				return;
+		}
+	}
+	__memcpy_flushcache(dst, src, cnt);
+}
 #endif
 
 #endif /* __KERNEL__ */
Index: linux-2.6/arch/x86/lib/usercopy_64.c
===================================================================
--- linux-2.6.orig/arch/x86/lib/usercopy_64.c	2018-01-31 11:06:19.988577678 -0500
+++ linux-2.6/arch/x86/lib/usercopy_64.c	2018-02-13 11:56:40.249154414 -0500
@@ -133,7 +133,7 @@  long __copy_user_flushcache(void *dst, c
 	return rc;
 }
 
-void memcpy_flushcache(void *_dst, const void *_src, size_t size)
+void __memcpy_flushcache(void *_dst, const void *_src, size_t size)
 {
 	unsigned long dest = (unsigned long) _dst;
 	unsigned long source = (unsigned long) _src;
@@ -196,14 +196,14 @@  void memcpy_flushcache(void *_dst, const
 		clean_cache_range((void *) dest, size);
 	}
 }
-EXPORT_SYMBOL_GPL(memcpy_flushcache);
+EXPORT_SYMBOL_GPL(__memcpy_flushcache);
 
 void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
 		size_t len)
 {
 	char *from = kmap_atomic(page);
 
-	memcpy_flushcache(to, from + offset, len);
+	__memcpy_flushcache(to, from + offset, len);
 	kunmap_atomic(from);
 }
 #endif