vfs: Optimize dedupe comparison

Message ID	20210715141309.38443-1-nborisov@suse.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Nikolay Borisov <nborisov@suse.com> To: linux-fsdevel@vger.kernel.org Cc: viro@zeniv.linux.org.uk, david@fromorbit.com, djwong@kernel.org, Nikolay Borisov <nborisov@suse.com> Subject: [PATCH] vfs: Optimize dedupe comparison Date: Thu, 15 Jul 2021 17:13:09 +0300 Message-Id: <20210715141309.38443-1-nborisov@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	vfs: Optimize dedupe comparison \| expand vfs: Optimize dedupe comparison

Nikolay Borisov July 15, 2021, 2:13 p.m. UTC

Currently the comparison method vfs_dedupe_file_range_compare utilizes
is a plain memcmp. This effectively means the code is doing byte-by-byte
comparison. Instead, the code could do word-sized comparison without
adverse effect on performance, provided that the comparison's length is
at least as big as the native word size, as well as resulting memory
addresses are properly aligned.

On a workload consisting of running duperemove (a userspace program
doing deduplication of duplicated extents) on a fully-duplicated
dataset, consisting of 80g spread among 20k 4m files I get the following
results:

		Unpatched:		Patched:
real		21m45.275s		21m14.445s
user		0m0.986s		0m0.933s
sys		1m30.734s		1m8.900s (-25%)

Notable changes in the perf profiles:
 .... omitted for brevity ....
     0.29%     +1.01%  [kernel.vmlinux]         [k] vfs_dedupe_file_range_compare.constprop.0
    23.62%             [kernel.vmlinux]         [k] memcmp
 .... omitted for brevity ....

The memcmp is being eliminated altogether and instead is replaced by the
newly introduced loop in vfs_dedupe_file_range_compare, hence the
increase of cycles spent there by 1%.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/remap_range.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

--
2.25.1

Matthew Wilcox July 15, 2021, 2:30 p.m. UTC | #1

On Thu, Jul 15, 2021 at 05:13:09PM +0300, Nikolay Borisov wrote:
> Currently the comparison method vfs_dedupe_file_range_compare utilizes
> is a plain memcmp. This effectively means the code is doing byte-by-byte
> comparison. Instead, the code could do word-sized comparison without
> adverse effect on performance, provided that the comparison's length is
> at least as big as the native word size, as well as resulting memory
> addresses are properly aligned.

Sounds to me like somebody hasn't optimised memcmp() very well ...
is this x86-64?

> @@ -256,9 +257,35 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
>  		flush_dcache_page(src_page);
>  		flush_dcache_page(dest_page);
> 
> -		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
> -			same = false;
> 
> +		if (!IS_ALIGNED((unsigned long)(src_addr + src_poff), block_size) ||
> +		    !IS_ALIGNED((unsigned long)(dest_addr + dest_poff), block_size) ||
> +		    cmp_len < block_size) {

Can this even happen?  Surely we can only dedup on a block boundary and
blocks are required to be a power of two and at least 512 bytes in size?

> +			if (memcmp(src_addr + src_poff, dest_addr + dest_poff,
> +				   cmp_len))
> +				same = false;
> +		} else {
> +			int i;
> +			size_t blocks = cmp_len / block_size;
> +			loff_t rem_len = cmp_len - (blocks * block_size);
> +			unsigned long *src = src_addr + src_poff;
> +			unsigned long *dst = dest_addr + src_poff;
> +
> +			for (i = 0; i < blocks; i++) {
> +				if (src[i] - dst[i]) {
> +					same = false;
> +					goto finished;
> +				}
> +			}
> +
> +			if (rem_len) {
> +				src_addr += src_poff + (blocks * block_size);
> +				dest_addr += dest_poff + (blocks * block_size);
> +				if (memcmp(src_addr, dest_addr, rem_len))
> +					same = false;
> +			}
> +		}
> +finished:
>  		kunmap_atomic(dest_addr);
>  		kunmap_atomic(src_addr);
>  unlock:
> --
> 2.25.1
>

Nikolay Borisov July 15, 2021, 2:44 p.m. UTC | #2

On 15.07.21 г. 17:30, Matthew Wilcox wrote:
> On Thu, Jul 15, 2021 at 05:13:09PM +0300, Nikolay Borisov wrote:
>> Currently the comparison method vfs_dedupe_file_range_compare utilizes
>> is a plain memcmp. This effectively means the code is doing byte-by-byte
>> comparison. Instead, the code could do word-sized comparison without
>> adverse effect on performance, provided that the comparison's length is
>> at least as big as the native word size, as well as resulting memory
>> addresses are properly aligned.
> 
> Sounds to me like somebody hasn't optimised memcmp() very well ...
> is this x86-64?
> 

That was my first impression, here's the profile:

       │    Disassembly of section .text:
       │
       │    ffffffff815c6f60 <memcmp>:
       │    memcmp():
       │      test   %rdx,%rdx
       │    ↓ je     22
       │      xor    %ecx,%ecx
       │    ↓ jmp    12
 49.32 │ 9:   add    $0x1,%rcx
  0.03 │      cmp    %rcx,%rdx
 11.82 │    ↓ je     21
  0.01 │12:   movzbl (%rdi,%rcx,1),%eax
 38.19 │      movzbl (%rsi,%rcx,1),%r8d
  0.59 │      sub    %r8d,%eax
  0.04 │    ↑ je     9
       │    ← retq
       │21: ← retq
       │22:   xor    %eax,%eax
       │    ← retq


It's indeed on x86-64 and according to the sources it's using
__builtin_memcmp according to arch/x86/boot/string.h

>> @@ -256,9 +257,35 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
>>  		flush_dcache_page(src_page);
>>  		flush_dcache_page(dest_page);
>>
>> -		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
>> -			same = false;
>>
>> +		if (!IS_ALIGNED((unsigned long)(src_addr + src_poff), block_size) ||
>> +		    !IS_ALIGNED((unsigned long)(dest_addr + dest_poff), block_size) ||
>> +		    cmp_len < block_size) {
> 
> Can this even happen?  Surely we can only dedup on a block boundary and
> blocks are required to be a power of two and at least 512 bytes in size?

I was wondering the same thing, but AFAICS it seems to be possible i.e
if userspace spaces bad offsets, while all kinds of internal fs
synchronization ops are going to be performed on aligned offsets, that
doesn't mean the original ones, passed from userspace are themselves
aligned explicitly.
> 
>> +			if (memcmp(src_addr + src_poff, dest_addr + dest_poff,
>> +				   cmp_len))
>> +				same = false;
>> +		} else {
>> +			int i;
>> +			size_t blocks = cmp_len / block_size;
>> +			loff_t rem_len = cmp_len - (blocks * block_size);
>> +			unsigned long *src = src_addr + src_poff;
>> +			unsigned long *dst = dest_addr + src_poff;
>> +
>> +			for (i = 0; i < blocks; i++) {
>> +				if (src[i] - dst[i]) {
>> +					same = false;
>> +					goto finished;
>> +				}
>> +			}
>> +
>> +			if (rem_len) {
>> +				src_addr += src_poff + (blocks * block_size);
>> +				dest_addr += dest_poff + (blocks * block_size);
>> +				if (memcmp(src_addr, dest_addr, rem_len))
>> +					same = false;
>> +			}
>> +		}
>> +finished:
>>  		kunmap_atomic(dest_addr);
>>  		kunmap_atomic(src_addr);
>>  unlock:
>> --
>> 2.25.1
>>
>

Matthew Wilcox July 15, 2021, 3:09 p.m. UTC | #3

On Thu, Jul 15, 2021 at 05:44:15PM +0300, Nikolay Borisov wrote:
> That was my first impression, here's the profile:
> 
>        │    Disassembly of section .text:
>        │
>        │    ffffffff815c6f60 <memcmp>:
>        │    memcmp():
>        │      test   %rdx,%rdx
>        │    ↓ je     22
>        │      xor    %ecx,%ecx
>        │    ↓ jmp    12
>  49.32 │ 9:   add    $0x1,%rcx
>   0.03 │      cmp    %rcx,%rdx
>  11.82 │    ↓ je     21
>   0.01 │12:   movzbl (%rdi,%rcx,1),%eax
>  38.19 │      movzbl (%rsi,%rcx,1),%r8d
>   0.59 │      sub    %r8d,%eax
>   0.04 │    ↑ je     9

That looks like a byte loop to me ...

> It's indeed on x86-64 and according to the sources it's using
> __builtin_memcmp according to arch/x86/boot/string.h

I think the 'boot' part of that path might indicate that it's not what's
actually being used by the kernel.

$ git grep __HAVE_ARCH_MEMCMP
arch/arc/include/asm/string.h:#define __HAVE_ARCH_MEMCMP
arch/arm64/include/asm/string.h:#define __HAVE_ARCH_MEMCMP
arch/csky/abiv2/inc/abi/string.h:#define __HAVE_ARCH_MEMCMP
arch/powerpc/include/asm/string.h:#define __HAVE_ARCH_MEMCMP
arch/s390/include/asm/string.h:#define __HAVE_ARCH_MEMCMP       /* arch function */
arch/s390/lib/string.c:#ifdef __HAVE_ARCH_MEMCMP
arch/s390/purgatory/string.c:#define __HAVE_ARCH_MEMCMP /* arch function */
arch/sparc/include/asm/string.h:#define __HAVE_ARCH_MEMCMP
include/linux/string.h:#ifndef __HAVE_ARCH_MEMCMP
lib/string.c:#ifndef __HAVE_ARCH_MEMCMP

So I think x86-64 is using the stupid one.

> > Can this even happen?  Surely we can only dedup on a block boundary and
> > blocks are required to be a power of two and at least 512 bytes in size?
> 
> I was wondering the same thing, but AFAICS it seems to be possible i.e
> if userspace spaces bad offsets, while all kinds of internal fs
> synchronization ops are going to be performed on aligned offsets, that
> doesn't mean the original ones, passed from userspace are themselves
> aligned explicitly.

Ah, I thought it'd be failed before we got to this point.

But honestly, I think x86-64 needs to be fixed to either use
__builtin_memcmp() or to have a nicely written custom memcmp().  I
tried to find the gcc implementation of __builtin_memcmp() on
x86-64, but I can't.

Dave Chinner July 15, 2021, 10:33 p.m. UTC | #4

On Thu, Jul 15, 2021 at 04:09:06PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 15, 2021 at 05:44:15PM +0300, Nikolay Borisov wrote:
> > I was wondering the same thing, but AFAICS it seems to be possible i.e
> > if userspace spaces bad offsets, while all kinds of internal fs
> > synchronization ops are going to be performed on aligned offsets, that
> > doesn't mean the original ones, passed from userspace are themselves
> > aligned explicitly.
> 
> Ah, I thought it'd be failed before we got to this point.
> 
> But honestly, I think x86-64 needs to be fixed to either use
> __builtin_memcmp() or to have a nicely written custom memcmp().  I
> tried to find the gcc implementation of __builtin_memcmp() on
> x86-64, but I can't.

Yup, this. memcmp() is widley used in hot paths through all the
filesystem code and the rest of the kernel. We should fix the
generic infrastructure problem, not play whack-a-mole to with custom
one-off fixes that avoid the problem just where it shows up in some
profile...

Cheers,

Dave.

Nikolay Borisov July 16, 2021, 12:10 p.m. UTC | #5

On 15.07.21 г. 18:09, Matthew Wilcox wrote:
> On Thu, Jul 15, 2021 at 05:44:15PM +0300, Nikolay Borisov wrote:
>> That was my first impression, here's the profile:
>>
>>        │    Disassembly of section .text:
>>        │
>>        │    ffffffff815c6f60 <memcmp>:
>>        │    memcmp():
>>        │      test   %rdx,%rdx
>>        │    ↓ je     22
>>        │      xor    %ecx,%ecx
>>        │    ↓ jmp    12
>>  49.32 │ 9:   add    $0x1,%rcx
>>   0.03 │      cmp    %rcx,%rdx
>>  11.82 │    ↓ je     21
>>   0.01 │12:   movzbl (%rdi,%rcx,1),%eax
>>  38.19 │      movzbl (%rsi,%rcx,1),%r8d
>>   0.59 │      sub    %r8d,%eax
>>   0.04 │    ↑ je     9
> 
> That looks like a byte loop to me ...
> 
>> It's indeed on x86-64 and according to the sources it's using
>> __builtin_memcmp according to arch/x86/boot/string.h
> 
> I think the 'boot' part of that path might indicate that it's not what's
> actually being used by the kernel.
> 
> $ git grep __HAVE_ARCH_MEMCMP
> arch/arc/include/asm/string.h:#define __HAVE_ARCH_MEMCMP
> arch/arm64/include/asm/string.h:#define __HAVE_ARCH_MEMCMP
> arch/csky/abiv2/inc/abi/string.h:#define __HAVE_ARCH_MEMCMP
> arch/powerpc/include/asm/string.h:#define __HAVE_ARCH_MEMCMP
> arch/s390/include/asm/string.h:#define __HAVE_ARCH_MEMCMP       /* arch function */
> arch/s390/lib/string.c:#ifdef __HAVE_ARCH_MEMCMP
> arch/s390/purgatory/string.c:#define __HAVE_ARCH_MEMCMP /* arch function */
> arch/sparc/include/asm/string.h:#define __HAVE_ARCH_MEMCMP
> include/linux/string.h:#ifndef __HAVE_ARCH_MEMCMP
> lib/string.c:#ifndef __HAVE_ARCH_MEMCMP
> 
> So I think x86-64 is using the stupid one.
> 
>>> Can this even happen?  Surely we can only dedup on a block boundary and
>>> blocks are required to be a power of two and at least 512 bytes in size?
>>
>> I was wondering the same thing, but AFAICS it seems to be possible i.e
>> if userspace spaces bad offsets, while all kinds of internal fs
>> synchronization ops are going to be performed on aligned offsets, that
>> doesn't mean the original ones, passed from userspace are themselves
>> aligned explicitly.
> 
> Ah, I thought it'd be failed before we got to this point.
> 
> But honestly, I think x86-64 needs to be fixed to either use
> __builtin_memcmp() or to have a nicely written custom memcmp().  I
> tried to find the gcc implementation of __builtin_memcmp() on
> x86-64, but I can't.

__builtin_memcmp is a no go due to this function being an ifunc [0]
 and is being resolved to a call to memcmp which causes link => build
failures. So what remains is to either patch particular sites or as Dave
suggested have a generic optimized implementation.

glibc's implementation [1] seems straightforward enough to be
convertible to kernel style. However it would need definitive proof it
actually improves performance in a variety of scenarios.

[0] https://sourceware.org/glibc/wiki/GNU_IFUNC
[1] https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memcmp.c

>

Nikolay Borisov July 20, 2021, 2:58 p.m. UTC | #6

On 16.07.21 г. 1:33, Dave Chinner wrote:
> On Thu, Jul 15, 2021 at 04:09:06PM +0100, Matthew Wilcox wrote:
>> On Thu, Jul 15, 2021 at 05:44:15PM +0300, Nikolay Borisov wrote:
>>> I was wondering the same thing, but AFAICS it seems to be possible i.e
>>> if userspace spaces bad offsets, while all kinds of internal fs
>>> synchronization ops are going to be performed on aligned offsets, that
>>> doesn't mean the original ones, passed from userspace are themselves
>>> aligned explicitly.
>>
>> Ah, I thought it'd be failed before we got to this point.
>>
>> But honestly, I think x86-64 needs to be fixed to either use
>> __builtin_memcmp() or to have a nicely written custom memcmp().  I
>> tried to find the gcc implementation of __builtin_memcmp() on
>> x86-64, but I can't.
> 
> Yup, this. memcmp() is widley used in hot paths through all the
> filesystem code and the rest of the kernel. We should fix the
> generic infrastructure problem, not play whack-a-mole to with custom
> one-off fixes that avoid the problem just where it shows up in some
> profile...

I ported glibc's implementation of memcmp to the kernel and after
running the same workload I get the same performance as with the basic
memcmp implementation of doing byte comparison ...

> 
> Cheers,
> 
> Dave.
>

Matthew Wilcox July 20, 2021, 3:12 p.m. UTC | #7

On Tue, Jul 20, 2021 at 05:58:39PM +0300, Nikolay Borisov wrote:
> 
> 
> On 16.07.21 г. 1:33, Dave Chinner wrote:
> > On Thu, Jul 15, 2021 at 04:09:06PM +0100, Matthew Wilcox wrote:
> >> On Thu, Jul 15, 2021 at 05:44:15PM +0300, Nikolay Borisov wrote:
> >>> I was wondering the same thing, but AFAICS it seems to be possible i.e
> >>> if userspace spaces bad offsets, while all kinds of internal fs
> >>> synchronization ops are going to be performed on aligned offsets, that
> >>> doesn't mean the original ones, passed from userspace are themselves
> >>> aligned explicitly.
> >>
> >> Ah, I thought it'd be failed before we got to this point.
> >>
> >> But honestly, I think x86-64 needs to be fixed to either use
> >> __builtin_memcmp() or to have a nicely written custom memcmp().  I
> >> tried to find the gcc implementation of __builtin_memcmp() on
> >> x86-64, but I can't.
> > 
> > Yup, this. memcmp() is widley used in hot paths through all the
> > filesystem code and the rest of the kernel. We should fix the
> > generic infrastructure problem, not play whack-a-mole to with custom
> > one-off fixes that avoid the problem just where it shows up in some
> > profile...
> 
> I ported glibc's implementation of memcmp to the kernel and after
> running the same workload I get the same performance as with the basic
> memcmp implementation of doing byte comparison ...

That's bizarre because the glibc memcmp that you pointed to earlier
basically does what your open-coded solution did.  Is it possible
you have a bug in one of the tests and it's falling back to the byte
loop?

Specifically for the dedup case, we only need the optimisation that

	if ((p1 | p2 | length) & 7)
		... do the byte loop ...
	... do the long-based comparison ...

so another possibility is that memcmp is doing too many tests.

vfs: Optimize dedupe comparison

Commit Message

Comments

Patch