memcpy_flushcache: use cache flusing for larger lengths

Message ID	alpine.LRH.2.02.2003291625590.32108@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <SRS0=SoN4=5O=lists.01.org=linux-nvdimm-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2104C207FF Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=63.128.21.74; helo=us-smtp-delivery-74.mimecast.com; envelope-from=mpatocka@redhat.com; receiver=<UNKNOWN> Date: Sun, 29 Mar 2020 16:28:36 -0400 (EDT) From: Mikulas Patocka <mpatocka@redhat.com> To: Dan Williams <dan.j.williams@intel.com>, Vishal Verma <vishal.l.verma@intel.com>, Dave Jiang <dave.jiang@intel.com>, Ira Weiny <ira.weiny@intel.com>, Mike Snitzer <msnitzer@redhat.com> Subject: [PATCH] memcpy_flushcache: use cache flusing for larger lengths Message-ID: <alpine.LRH.2.02.2003291625590.32108@file01.intranet.prod.int.rdu2.redhat.com> User-Agent: Alpine 2.02 (LRH 1266 2009-07-14) MIME-Version: 1.0 Message-ID-Hash: 7NXHXPSAIJYOTWX5EISPKL3MRTZEEZKY CC: linux-nvdimm@lists.01.org, dm-devel@redhat.com Precedence: list Archived-At: <https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/message/7NXHXPSAIJYOTWX5EISPKL3MRTZEEZKY/> Content-Type: TEXT/PLAIN; charset="us-ascii" Content-Transfer-Encoding: 7bit
Series	memcpy_flushcache: use cache flusing for larger lengths \| expand memcpy_flushcache: use cache flusing for larger lengths

Message ID

alpine.LRH.2.02.2003291625590.32108@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)

State

Superseded

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2104C207FF
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=63.128.21.74;
 helo=us-smtp-delivery-74.mimecast.com; envelope-from=mpatocka@redhat.com;
 receiver=<UNKNOWN> 
Date: Sun, 29 Mar 2020 16:28:36 -0400 (EDT)
From: Mikulas Patocka <mpatocka@redhat.com>
To: Dan Williams <dan.j.williams@intel.com>,
        Vishal Verma <vishal.l.verma@intel.com>,
        Dave Jiang <dave.jiang@intel.com>, Ira Weiny <ira.weiny@intel.com>,
        Mike Snitzer <msnitzer@redhat.com>
Subject: [PATCH] memcpy_flushcache: use cache flusing for larger lengths
Message-ID: 
 <alpine.LRH.2.02.2003291625590.32108@file01.intranet.prod.int.rdu2.redhat.com>
User-Agent: Alpine 2.02 (LRH 1266 2009-07-14)
MIME-Version: 1.0
Message-ID-Hash: 7NXHXPSAIJYOTWX5EISPKL3MRTZEEZKY
CC: linux-nvdimm@lists.01.org, dm-devel@redhat.com
Precedence: list
Archived-At: 
 <https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/message/7NXHXPSAIJYOTWX5EISPKL3MRTZEEZKY/>
Content-Type: TEXT/PLAIN; charset="us-ascii"
Content-Transfer-Encoding: 7bit

Series

memcpy_flushcache: use cache flusing for larger lengths | expand

Commit Message

Mikulas Patocka March 29, 2020, 8:28 p.m. UTC

I tested dm-writecache performance on a machine with Optane nvdimm and it
turned out that for larger writes, cached stores + cache flushing perform
better than non-temporal stores. This is the throughput of dm-writecache
measured with this command:
dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct

block size	512		1024		2048		4096
movnti		496 MB/s	642 MB/s	725 MB/s	744 MB/s
clflushopt	373 MB/s	688 MB/s	1.1 GB/s	1.2 GB/s

We can see that for smaller block, movnti performs better, but for larger
blocks, clflushopt has better performance.

This patch changes the function __memcpy_flushcache accordingly, so that
with size >= 768 it performs cached stores and cache flushing. Note that
we must not use the new branch if the CPU doesn't have clflushopt - in
that case, the kernel would use inefficient "clflush" instruction that has
very bad performance.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 arch/x86/lib/usercopy_64.c |   36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

Index: linux-2.6/arch/x86/lib/usercopy_64.c
===================================================================
--- linux-2.6.orig/arch/x86/lib/usercopy_64.c	2020-03-24 15:15:36.644945091 -0400
+++ linux-2.6/arch/x86/lib/usercopy_64.c	2020-03-29 13:16:49.937011736 -0400
@@ -152,6 +152,42 @@  void __memcpy_flushcache(void *_dst, con
 			return;
 	}
 
+	if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && size >= 768) {
+		while (!IS_ALIGNED(dest, 64)) {
+			asm("movq    (%0), %%r8\n"
+			    "movnti  %%r8,   (%1)\n"
+			    :: "r" (source), "r" (dest)
+			    : "memory", "r8");
+			dest += 8;
+			source += 8;
+			size -= 8;
+		}
+		do {
+			asm("movq    (%0), %%r8\n"
+			    "movq   8(%0), %%r9\n"
+			    "movq  16(%0), %%r10\n"
+			    "movq  24(%0), %%r11\n"
+			    "movq    %%r8,   (%1)\n"
+			    "movq    %%r9,  8(%1)\n"
+			    "movq   %%r10, 16(%1)\n"
+			    "movq   %%r11, 24(%1)\n"
+			    "movq  32(%0), %%r8\n"
+			    "movq  40(%0), %%r9\n"
+			    "movq  48(%0), %%r10\n"
+			    "movq  56(%0), %%r11\n"
+			    "movq    %%r8, 32(%1)\n"
+			    "movq    %%r9, 40(%1)\n"
+			    "movq   %%r10, 48(%1)\n"
+			    "movq   %%r11, 56(%1)\n"
+			    :: "r" (source), "r" (dest)
+			    : "memory", "r8", "r9", "r10", "r11");
+			clflushopt((void *)dest);
+			dest += 64;
+			source += 64;
+			size -= 64;
+		} while (size >= 64);
+	}
+
 	/* 4x8 movnti loop */
 	while (size >= 32) {
 		asm("movq    (%0), %%r8\n"

memcpy_flushcache: use cache flusing for larger lengths

Commit Message

Patch