diff mbox series

[02/14] tmpfs: fix regressions from wider use of ZERO_PAGE

Message ID 20220415021328.7D31EC385A1@smtp.kernel.org (mailing list archive)
State New
Headers show
Series [01/14] MAINTAINERS: Broadcom internal lists aren't maintainers | expand

Commit Message

Andrew Morton April 15, 2022, 2:13 a.m. UTC
From: Hugh Dickins <hughd@google.com>
Subject: tmpfs: fix regressions from wider use of ZERO_PAGE

Chuck Lever reported fsx-based xfstests generic 075 091 112 127 failing
when 5.18-rc1 NFS server exports tmpfs: bisected to recent tmpfs change.

Whilst nfsd_splice_action() does contain some questionable handling of
repeated pages, and Chuck was able to work around there, history from
Mark Hemment makes clear that there might be similar dangers elsewhere:
it was not a good idea for me to pass ZERO_PAGE down to unknown actors.

Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when
iter_is_iovec(); in other cases, use the more natural iov_iter_zero()
instead of copy_page_to_iter().  We would use iov_iter_zero() throughout,
but the x86 clear_user() is not nearly so well optimized as copy to user
(dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds).

And now pagecache_init() does not need to SetPageUptodate(ZERO_PAGE(0)):
which had caused boot failure on arm noMMU STM32F7 and STM32H7 boards

Link: https://lkml.kernel.org/r/9a978571-8648-e830-5735-1f4748ce2e30@google.com
Fixes: 56a8c8eb1eaf ("tmpfs: do not allocate pages on read")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Patrice CHOTARD <patrice.chotard@foss.st.com>
Reported-by: Chuck Lever III <chuck.lever@oracle.com>
Tested-by: Chuck Lever III <chuck.lever@oracle.com>
Cc: Mark Hemment <markhemm@googlemail.com>
Cc: Patrice CHOTARD <patrice.chotard@foss.st.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/filemap.c |    6 ------
 mm/shmem.c   |   31 ++++++++++++++++++++-----------
 2 files changed, 20 insertions(+), 17 deletions(-)

Comments

Linus Torvalds April 15, 2022, 10:10 p.m. UTC | #1
On Thu, Apr 14, 2022 at 7:13 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when
> iter_is_iovec(); in other cases, use the more natural iov_iter_zero()
> instead of copy_page_to_iter().  We would use iov_iter_zero() throughout,
> but the x86 clear_user() is not nearly so well optimized as copy to user
> (dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds).

Ugh.

I've applied this patch, but honestly, the proper course of action
should just be to improve on clear_user().

If it really is important enough that we should care about that
performance, then we just should fix clear_user().

It's a very odd special thing right now (at least on x86-64) using
some strange handcrafted inline asm code.

I assume that 'rep stosb' is the fastest way to clear things on modern
CPU's that have FSRM, and then we have the usual fallbacks (ie ERMS ->
"rep stos" except for small areas, and probably that "store zeros by
hand" for older CPUs).

Adding PeterZ and Borislav (who seem to be the last ones to have
worked on the copy and clear_page stuff respectively) and the x86
maintainers in case somebody gets the urge to just fix this.

Because memory clearing should be faster than copying, and the thing
that makes copying fast is that FSRM and ERMS logic (the whole
"manually unrolled copy" is hopefully mostly a thing of the past and
we can consider it legacy)

             Linus
Matthew Wilcox April 15, 2022, 10:21 p.m. UTC | #2
On Fri, Apr 15, 2022 at 03:10:51PM -0700, Linus Torvalds wrote:
> On Thu, Apr 14, 2022 at 7:13 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when
> > iter_is_iovec(); in other cases, use the more natural iov_iter_zero()
> > instead of copy_page_to_iter().  We would use iov_iter_zero() throughout,
> > but the x86 clear_user() is not nearly so well optimized as copy to user
> > (dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds).
> 
> Ugh.
> 
> I've applied this patch, but honestly, the proper course of action
> should just be to improve on clear_user().
> 
> If it really is important enough that we should care about that
> performance, then we just should fix clear_user().
> 
> It's a very odd special thing right now (at least on x86-64) using
> some strange handcrafted inline asm code.
> 
> I assume that 'rep stosb' is the fastest way to clear things on modern
> CPU's that have FSRM, and then we have the usual fallbacks (ie ERMS ->
> "rep stos" except for small areas, and probably that "store zeros by
> hand" for older CPUs).
> 
> Adding PeterZ and Borislav (who seem to be the last ones to have
> worked on the copy and clear_page stuff respectively) and the x86
> maintainers in case somebody gets the urge to just fix this.

Perhaps the x86 maintainers would like to start from
https://lore.kernel.org/lkml/20210523180423.108087-1-sneves@dei.uc.pt/
instead of pushing that work off on the submitter.
Hugh Dickins April 15, 2022, 10:41 p.m. UTC | #3
On Fri, 15 Apr 2022, Linus Torvalds wrote:
> On Thu, Apr 14, 2022 at 7:13 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when
> > iter_is_iovec(); in other cases, use the more natural iov_iter_zero()
> > instead of copy_page_to_iter().  We would use iov_iter_zero() throughout,
> > but the x86 clear_user() is not nearly so well optimized as copy to user
> > (dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds).
> 
> Ugh.
> 
> I've applied this patch,

Phew, thanks.

> but honestly, the proper course of action
> should just be to improve on clear_user().

You'll find no disagreement here: we've all been saying the same.
It's just that that work is yet to be done (or yet to be accepted).

> 
> If it really is important enough that we should care about that
> performance, then we just should fix clear_user().
> 
> It's a very odd special thing right now (at least on x86-64) using
> some strange handcrafted inline asm code.
> 
> I assume that 'rep stosb' is the fastest way to clear things on modern
> CPU's that have FSRM, and then we have the usual fallbacks (ie ERMS ->
> "rep stos" except for small areas, and probably that "store zeros by
> hand" for older CPUs).
> 
> Adding PeterZ and Borislav (who seem to be the last ones to have
> worked on the copy and clear_page stuff respectively) and the x86
> maintainers in case somebody gets the urge to just fix this.

Yes, it was exactly Borislav and PeterZ whom I first approached too,
link 3 in the commit message of the patch that this one is fixing,
https://lore.kernel.org/lkml/2f5ca5e4-e250-a41c-11fb-a7f4ebc7e1c9@google.com/

Borislav wants a thorough good patch, and I don't blame him for that!

Hugh

> 
> Because memory clearing should be faster than copying, and the thing
> that makes copying fast is that FSRM and ERMS logic (the whole
> "manually unrolled copy" is hopefully mostly a thing of the past and
> we can consider it legacy)
> 
>              Linus
Mark Hemment April 16, 2022, 2:07 p.m. UTC | #4
On Sat, 16 Apr 2022, Borislav Petkov wrote:

> On Fri, Apr 15, 2022 at 03:10:51PM -0700, Linus Torvalds wrote:
> > Adding PeterZ and Borislav (who seem to be the last ones to have
> > worked on the copy and clear_page stuff respectively) and the x86
> > maintainers in case somebody gets the urge to just fix this.
> 
> I guess if enough people ask and keep asking, some people at least try
> to move...
> 
> > Because memory clearing should be faster than copying, and the thing
> > that makes copying fast is that FSRM and ERMS logic (the whole
> > "manually unrolled copy" is hopefully mostly a thing of the past and
> > we can consider it legacy)
> 
> So I did give it a look and it seems to me, if we want to do the
> alternatives thing here, it will have to look something like
> arch/x86/lib/copy_user_64.S.
> 
> I.e., the current __clear_user() will have to become the "handle_tail"
> thing there which deals with uncopied rest-bytes at the end and the new
> fsrm/erms/rep_good variants will then be alternative_call_2 or _3.
> 
> The fsrm thing will have only the handle_tail thing at the end when size
> != 0.
> 
> The others - erms and rep_good - will have to check for sizes smaller
> than, say a cacheline, and for those call the handle_tail thing directly
> instead of going into a REP loop. The current __clear_user() is still a
> lot better than that copy_user_generic_unrolled() abomination. And it's
> not like old CPUs would get any perf penalty - they'll simply use the
> same code.
> 
> And then you need the labels for _ASM_EXTABLE_UA() exception handling.
> 
> Anyway, something along those lines.
> 
> And then we'll need to benchmark this on a bunch of current machines to
> make sure there's no funny surprises, perf-wise.
> 
> I can get cracking on this but I would advise people not to hold their
> breaths. :)
> 
> Unless someone has a better idea or is itching to get hands dirty
> her-/himself.

I've done a skeleton implementation of alternative __clear_user() based on 
CPU features.
It has three versions of __clear_user();
o __clear_user_original() - similar to the 'standard' __clear_user()
o __clear_user_rep_good() - using resp stos{qb} when CPU has 'rep_good'
o __clear_user_erms() - using 'resp stosb' when CPU has 'erms'

Not claiming the implementation is ideal, but might be a useful starting 
point for someone.
Patch is against 5.18.0-rc2.
Only basic sanity testing done.

Simple performance testing done for large sizes, on a system (Intel E8400) 
which has rep_good but not erms;
# dd if=/dev/zero of=/dev/null bs=16384 count=10000
o *_original() - ~14.2GB/s.  Same as the 'standard' __clear_user().
o *_rep_good() - same throughput as *_original().
o *_erms()     - ~12.2GB/s (expected on a system without erms).

No performance testing done for zeroing small sizes.

Cheers,
Mark

Signed-off-by: Mark Hemment <markhemm@googlemail.com>
---
 arch/x86/include/asm/asm.h        |  39 +++++++++++++++
 arch/x86/include/asm/uaccess_64.h |  36 ++++++++++++++
 arch/x86/lib/clear_page_64.S      | 100 ++++++++++++++++++++++++++++++++++++++
 arch/x86/lib/usercopy_64.c        |  32 ------------
 4 files changed, 175 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index fbcfec4dc4cc..373ed6be7a8d 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -132,6 +132,35 @@
 /* Exception table entry */
 #ifdef __ASSEMBLY__
 
+# define UNDEFINE_EXTABLE_TYPE_REG \
+	.purgem extable_type_reg ;
+
+# define DEFINE_EXTABLE_TYPE_REG \
+	.macro extable_type_reg type:req reg:req ;			\
+	.set .Lfound, 0	;						\
+	.set .Lregnr, 0 ;						\
+	.irp rs,rax,rcx,rdx,rbx,rsp,rbp,rsi,rdi,r8,r9,r10,r11,r12,r13,	\
+	     r14,r15 ;							\
+	.ifc \reg, %\rs ;						\
+	.set .Lfound, .Lfound+1 ;					\
+	.long \type + (.Lregnr << 8) ;					\
+	.endif ;							\
+	.set .Lregnr, .Lregnr+1 ;					\
+	.endr ;								\
+	.set .Lregnr, 0 ;						\
+	.irp rs,eax,ecx,edx,ebx,esp,ebp,esi,edi,r8d,r9d,r10d,r11d,r12d, \
+	     r13d,r14d,r15d ;						\
+	.ifc \reg, %\rs ;						\
+	.set .Lfound, .Lfound+1 ;					\
+	.long \type + (.Lregnr << 8) ;					\
+	.endif ;							\
+	.set .Lregnr, .Lregnr+1 ;					\
+	.endr ;								\
+	.if (.Lfound != 1) ;						\
+	.error "extable_type_reg: bad register argument" ;		\
+	.endif ;							\
+	.endm ;
+
 # define _ASM_EXTABLE_TYPE(from, to, type)			\
 	.pushsection "__ex_table","a" ;				\
 	.balign 4 ;						\
@@ -140,6 +169,16 @@
 	.long type ;						\
 	.popsection
 
+# define _ASM_EXTABLE_TYPE_REG(from, to, type1, reg1)		\
+	.pushsection "__ex_table","a" ;				\
+	.balign 4 ;						\
+	.long (from) - . ;					\
+	.long (to) - . ;					\
+	DEFINE_EXTABLE_TYPE_REG					\
+	extable_type_reg reg=reg1, type=type1 ;			\
+	UNDEFINE_EXTABLE_TYPE_REG				\
+	.popsection
+
 # ifdef CONFIG_KPROBES
 #  define _ASM_NOKPROBE(entry)					\
 	.pushsection "_kprobe_blacklist","aw" ;			\
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 45697e04d771..6a4995e4cfae 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -79,4 +79,40 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size)
 	kasan_check_write(dst, size);
 	return __copy_user_flushcache(dst, src, size);
 }
+
+/*
+ * Zero Userspace.
+ */
+
+__must_check unsigned long
+clear_user_original(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_rep_good(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_erms(void __user *addr, unsigned long len);
+
+static __always_inline __must_check unsigned long
+___clear_user(void __user *addr, unsigned long len)
+{
+	unsigned long	ret;
+
+	/*
+	 * No memory constraint because it doesn't change any memory gcc
+	 * knows about.
+	 */
+
+	might_fault();
+	alternative_call_2(
+		clear_user_original,
+		clear_user_rep_good,
+		X86_FEATURE_REP_GOOD,
+		clear_user_erms,
+		X86_FEATURE_ERMS,
+		ASM_OUTPUT2("=a" (ret), "=D" (addr), "=c" (len)),
+		"1" (addr), "2" (len)
+		: "%rdx", "cc");
+	return ret;
+}
+
+#define __clear_user(d, n)	___clear_user(d, n)
 #endif /* _ASM_X86_UACCESS_64_H */
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index fe59b8ac4fcc..abe1f44ea422 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -1,5 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 #include <linux/linkage.h>
+#include <asm/asm.h>
+#include <asm/smap.h>
 #include <asm/export.h>
 
 /*
@@ -50,3 +52,101 @@ SYM_FUNC_START(clear_page_erms)
 	RET
 SYM_FUNC_END(clear_page_erms)
 EXPORT_SYMBOL_GPL(clear_page_erms)
+
+/*
+ * Default clear user-space.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rax uncopied bytes or 0 if successful.
+ */
+
+SYM_FUNC_START(clear_user_original)
+	ASM_STAC
+	movq %rcx,%rax
+	shrq $3,%rcx
+	andq $7,%rax
+	testq %rcx,%rcx
+	jz 1f
+
+	.p2align 4
+0:	movq $0,(%rdi)
+	leaq 8(%rdi),%rdi
+	decq %rcx
+	jnz   0b
+
+1:	movq %rax,%rcx
+	testq %rcx,%rcx
+	jz 3f
+
+2:	movb $0,(%rdi)
+	incq %rdi
+	decl %ecx
+	jnz  2b
+
+3:	ASM_CLAC
+	movq %rcx,%rax
+	RET
+
+	_ASM_EXTABLE_TYPE_REG(0b, 3b, EX_TYPE_UCOPY_LEN8, %rax)
+	_ASM_EXTABLE_UA(2b, 3b)
+SYM_FUNC_END(clear_user_original)
+EXPORT_SYMBOL(clear_user_original)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is
+ * present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rax uncopied bytes or 0 if successful.
+ */
+
+SYM_FUNC_START(clear_user_rep_good)
+	ASM_STAC
+	movq %rcx,%rdx
+	xorq %rax,%rax
+	shrq $3,%rcx
+	andq $7,%rdx
+
+0:	rep stosq
+	movq %rdx,%rcx
+
+1:	rep stosb
+
+3:	ASM_CLAC
+	movq %rcx,%rax
+	RET
+
+	_ASM_EXTABLE_TYPE_REG(0b, 3b, EX_TYPE_UCOPY_LEN8, %rdx)
+	_ASM_EXTABLE_UA(1b, 3b)
+SYM_FUNC_END(clear_user_rep_good)
+EXPORT_SYMBOL(clear_user_rep_good)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_ERMS is present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rax uncopied bytes or 0 if successful.
+ */
+
+SYM_FUNC_START(clear_user_erms)
+	xorq %rax,%rax
+	ASM_STAC
+
+0:	rep stosb
+
+3:	ASM_CLAC
+	movq %rcx,%rax
+	RET
+
+	_ASM_EXTABLE_UA(0b, 3b)
+SYM_FUNC_END(clear_user_erms)
+EXPORT_SYMBOL(clear_user_erms)
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 0402a749f3a0..3a2872c9c4a9 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -14,38 +14,6 @@
  * Zero Userspace
  */
 
-unsigned long __clear_user(void __user *addr, unsigned long size)
-{
-	long __d0;
-	might_fault();
-	/* no memory constraint because it doesn't change any memory gcc knows
-	   about */
-	stac();
-	asm volatile(
-		"	testq  %[size8],%[size8]\n"
-		"	jz     4f\n"
-		"	.align 16\n"
-		"0:	movq $0,(%[dst])\n"
-		"	addq   $8,%[dst]\n"
-		"	decl %%ecx ; jnz   0b\n"
-		"4:	movq  %[size1],%%rcx\n"
-		"	testl %%ecx,%%ecx\n"
-		"	jz     2f\n"
-		"1:	movb   $0,(%[dst])\n"
-		"	incq   %[dst]\n"
-		"	decl %%ecx ; jnz  1b\n"
-		"2:\n"
-
-		_ASM_EXTABLE_TYPE_REG(0b, 2b, EX_TYPE_UCOPY_LEN8, %[size1])
-		_ASM_EXTABLE_UA(1b, 2b)
-
-		: [size8] "=&c"(size), [dst] "=&D" (__d0)
-		: [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr));
-	clac();
-	return size;
-}
-EXPORT_SYMBOL(__clear_user);
-
 unsigned long clear_user(void __user *to, unsigned long n)
 {
 	if (access_ok(to, n))
Linus Torvalds April 16, 2022, 5:42 p.m. UTC | #5
On Sat, Apr 16, 2022 at 10:28 AM Borislav Petkov <bp@alien8.de> wrote:
>
> you also need a _fsrm() one which checks X86_FEATURE_FSRM. That one
> should simply do rep; stosb regardless of the size. For that you can
> define an alternative_call_3 similar to how the _2 variant is defined.

Honestly, my personal preference would be that with FSRM, we'd have an
alternative that looks something like

    asm volatile(
        "1:"
        ALTERNATIVE("call __stosb_user", "rep movsb", X86_FEATURE_FSRM)
        "2:"
       _ASM_EXTABLE_UA(1b, 2b)
        :"=c" (count), "=D" (dest),ASM_CALL_CONSTRAINT
        :"0" (count), "1" (dest), "a" (0)
        :"memory");

iow, the 'rep stosb' case would be inline.

Note that the above would have a few things to look out for:

 - special 'stosb' calling convention:

     %rax/%rcx/%rdx as inputs
     %rcx as "bytes not copied" return value
     %rdi can be clobbered

   so the actual functions would look a bit odd and would need to
save/restore some registers, but they'd basically just emulate "rep
stosb".

 - since the whole point is that the "rep movsb" is inlined, it also
means that the "call __stosb_user" is done within the STAC/CLAC
region, so objdump would have to be taught that's ok

but wouldn't it be lovely if we could start moving towards a model
where we can just inline 'memset' and 'memcpy' like this?

NOTE! The above asm has not been tested. I wrote it in this email. I'm
sure I messed something up.

               Linus
Linus Torvalds April 17, 2022, 8:56 p.m. UTC | #6
On Sun, Apr 17, 2022 at 12:41 PM Borislav Petkov <bp@alien8.de> wrote:
>
> Anyway, more playing with this later to make sure it really does what it
> should.

I think the special calling conventions have tripped you up:

>  SYM_FUNC_START(clear_user_original)
> -       ASM_STAC
>         movq %rcx,%rax
>         shrq $3,%rcx
>         andq $7,%rax
> @@ -86,7 +84,7 @@ SYM_FUNC_START(clear_user_original)
>         decl %ecx
>         jnz  2b
>
> -3:     ASM_CLAC
> +3:
>         movq %rcx,%rax
>         RET

That 'movq %rcx,%rax' can't be right. The caller expects it to be zero
on input and stay zero on output.

But I think "xorl %eax,%eax" is good, since %eax was used as a
temporary in that function.

And the comment above the function should be fixed too.

>  SYM_FUNC_START(clear_user_rep_good)
> -       ASM_STAC
>         movq %rcx,%rdx
> -       xorq %rax,%rax
>         shrq $3,%rcx
>         andq $7,%rdx
>
> @@ -118,7 +113,7 @@ SYM_FUNC_START(clear_user_rep_good)
>
>  1:     rep stosb
>
> -3:     ASM_CLAC
> +3:
>         movq %rcx,%rax
>         RET

Same issue here.

Probably nothing notices, since %rcx *does* end up containing the
right value, and it's _likely_ that the compiler doesn't actually end
up re-using the zero value in %rax after the inline asm (that this bug
has corrupted), but if the compiler ever goes "Oh, I put zero in %rax,
so I'll just use that afterwards", this is going to blow up very
spectacularly and be very hard to debug.

> @@ -135,15 +130,13 @@ EXPORT_SYMBOL(clear_user_rep_good)
>   *
>   * Output:
>   * rax uncopied bytes or 0 if successful.
> + *
> + * XXX: check for small sizes and call the original version.
> + * Benchmark it first though.
>   */
> -
>  SYM_FUNC_START(clear_user_erms)
> -       xorq %rax,%rax
> -       ASM_STAC
> -
>  0:     rep stosb
> -
> -3:     ASM_CLAC
> +3:
>         movq %rcx,%rax
>         RET

.. and one more time.

Also, I do think that the rep_good and erms cases should probably
check for small copes, and use the clear_user_original thing for %rcx
< 64 or something like that.

It's what we do on the memcpy side - and more importantly, it's the
only difference between "erms" and FSRM. If the ERMS code doesn't do
anything different for small copies, why have it at all?

But other than these small issues, it looks good to me.

                  Linus
Linus Torvalds April 18, 2022, 5:10 p.m. UTC | #7
On Mon, Apr 18, 2022 at 3:15 AM Borislav Petkov <bp@alien8.de> wrote:
>
> Yah, wanted to singlestep that whole asm anyway to make sure it is good.
> And just started going through it - I think it can be even optimized a
> bit to use %rax for the rest bytes and decrement it into 0 eventually.

Ugh. If you do this, you need to have a big comment about how that
%rcx value gets fixed up with EX_TYPE_UCOPY_LEN (which basically ends
up doing "%rcx = %rcx*8+%rax" in ex_handler_ucopy_len() for the
exception case).

Plus you need to explain the xorl here:

> 3:
>         xorl %eax,%eax
>         RET

because with your re-written function it *looks* like %eax is already
zero, so - once again - this is about how the exception cases work and
get here magically.

So that needs some big comment about "if an exception happens, we jump
to label '3', and the exception handler will fix up %rcx, but we'll
need to clear %rax".

Or something like that.

But yes, that does look like it will work, it's just really subtle how
%rcx is zero for the 'rest bytes', and %rax is the byte count
remaining in addition.

                  Linus
Linus Torvalds April 19, 2022, 4:41 p.m. UTC | #8
On Tue, Apr 19, 2022 at 2:17 AM Borislav Petkov <bp@alien8.de> wrote:
>
> Yap, and I reused your text and expanded it. You made me look at that
> crazy DEFINE_EXTABLE_TYPE_REG macro finally so that I know what it does
> in detail.
>
> So I have the below now, it boots in the guest so it must be perfect.

This looks fine to me.

Although honestly, I'd be even happier without those fancy exception
table tricks. I actually think things would be more legible if we had
explicit error return points that did the

err8:
        shrq $3,%rcx
        addq %rax,%rcx
err1:
        xorl %eax,%eax
        RET

things explicitly.

That's perhaps especially true since this whole thing now added a new
- and even more complex - error case with that _ASM_EXTABLE_TYPE_REG.

But I'm ok with the complex version too, I guess.

                   Linus
Linus Torvalds April 21, 2022, 5:22 p.m. UTC | #9
On Thu, Apr 21, 2022 at 9:50 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> where that "read(..) = 16" is the important part. It correctly figured
> out that it can only do 16 bytes (ok, 17, but we've always allowed the
> user accessor functions to block).

Bad choice of words - by "block" I meant "doing the accesses in
blocks", in this case 64-bit words.

Obviously the user accessors _also_ "block" in the sense of having to
wait for page faults and IO.

I think 'copy_{to,from}_user()' actually does go to the effort to try
to do byte-exact results, though.

In particular, see copy_user_handle_tail in arch/x86/lib/copy_user_64.S.

But I think that we long ago ended up deciding it really wasn't worth
doing it, and x86 ends up just going to unnecessary lengths for this
case.

               Linus
Linus Torvalds April 24, 2022, 7:54 p.m. UTC | #10
On Sun, Apr 24, 2022 at 12:37 PM Borislav Petkov <bp@alien8.de> wrote:
>
> You could give me some more details but AFAIU, you mean, that
> fallback to byte-sized reads is unnecessary and I can get rid of
> copy_user_handle_tail? Because that would be a nice cleanup.

Yeah, we already don't have that fallback in many other places.

For example: traditionally we implemented fixed-size small
copy_to/from_user() with get/put_user().

We don't do that any more, but see commit 4b842e4e25b1 ("x86: get rid
of small constant size cases in raw_copy_{to,from}_user()") and
realize how it historically never did the byte-loop fallback.

The "clear_user()" case is another example of something that was never
byte-exact.

And last time we discussed this, Al was looking at making it
byte-exact, and I'm pretty sure he noted that other architectures
already didn't do always do it.

Let me go try to find it.

> Anyway, I ran your short prog and it all looks like you predicted it:

Well, this actually shows a bug.

> fsrm:
> ----
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 17

The above is good.

> erms:
> -----
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 17

This is good too: erms should do small copies with the expanded case,
but sicne it *thinks* it's a large copy, it will just use "rep movsb"
and be byte-exact.

> rep_good:
> ---------
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 16

This is good: it starts off with "rep movsq", does two iterations, and
then fails on the third word, so it only succeeds for 16 bytes.

> original:
> ---------
> read(3, strace: umoven: short read (17 < 33) @0x7f3ff61d5fef
> 0x7f3ff61d5fef, 65536)          = 3586

This is broken.

And I'm *not* talking about the strace warning. The strace warning is
actually just a *result* of the kernel bug.

Look at that return value. It returns '3586'. That's just pure garbage.

So that 'original' routine simply returns the wrong value.

I suspect it's a %rax vs %rcx confusion again, but with your "patch on
top of earlier patch" I didn't go and sort it out.

                   Linus
Linus Torvalds April 24, 2022, 8:24 p.m. UTC | #11
On Sun, Apr 24, 2022 at 12:54 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And last time we discussed this, Al was looking at making it
> byte-exact, and I'm pretty sure he noted that other architectures
> already didn't do always do it.
>
> Let me go try to find it.

Hmnm. I may have mis-remembered the details. The thread I was thinking
of was this:

   https://lore.kernel.org/all/20200719031733.GI2786714@ZenIV.linux.org.uk/

and while Al was arguing for not enforcing the exact byte count, he
still suggested that it must make *some* progress. But note the whole

 "There are two problems with that.  First of all, the promise was
  bogus - there are architectures where it is simply not true.  E.g. ppc
  (or alpha, or sparc, or...)  can have copy_from_user() called with source
  one word prior to an unmapped page, fetch that word, fail to fetch the next
  one and bugger off without doing any stores"

ie it's simply never been the case in general, and Al mentions ppc,
alpha and sparc as examples of architectures where it has not been
true.

(arm and arm64, btw, does seem to have the "try harder" byte copy loop
at the end, like x86 does).

And that's when I argued that we should just accept that the byte
exact behavior simply has never been reality, and we shouldn't even
try to make it be reality.

NOTE! We almost certainly do want to have some limit of how much off
we can be, though. I do *not* think we can just unroll the loop a ton,
and say "hey, we're doing copies in chunks of 16 words, so now we're
off by up to 128 bytes".

I'd suggest making it clear that being "off" by a word is fine,
because that's the natural thing for any architecture that needs to do
a "load low/high word" followed by "store aligned word" due to not
handling unaligned faults well (eg the whole traditional RISC thing).

And yes, I think it's actually somewhat detrimental to our test
coverage that x86 does the byte-exact thing, because it means that
*if* we have any code that depends on it, it will just happen to work
on x86, but then fail on architectures that don't get nearly the same
test coverage.

              Linus
Linus Torvalds April 27, 2022, 1:29 a.m. UTC | #12
On Tue, Apr 26, 2022 at 5:14 PM Borislav Petkov <bp@alien8.de> wrote:
>
> So when we enter the function, we shift %rcx to get the number of
> qword-sized quantities to zero:
>
> SYM_FUNC_START(clear_user_original)
>         mov %rcx,%rax
>         shr $3,%rcx             # qwords        <---

Yes.

But that's what we do for "rep stosq" too, for all the same reasons.

> but when we encounter the fault here, we return *%rcx* - not %rcx << 3
> - latter being the *bytes* leftover which we *actually* need to return
> when we encounter the #PF.

Yes, but:

> So, we need to shift back when we fault during the qword-sized zeroing,
> i.e., full function below, see label 3 there.

No.

The problem is that you're using the wrong exception type.

Thanks for posting the whole thing, because that makes it much more obvious.

You have the exception table entries switched.

You should have

         _ASM_EXTABLE_TYPE_REG(0b, 3b, EX_TYPE_UCOPY_LEN8, %rax)
        _ASM_EXTABLE_UA(2b, 3b)

and not need that label '4' at all.

Note how that "_ASM_EXTABLE_TYPE_REG" thing is literally designed to do

   %rcx = 8*%rcx+%rax

in the exception handler.

Of course, to get that regular

        _ASM_EXTABLE_UA(2b, 3b)

to work, you need to have the final byte count in %rcx, not in %rax so
that means that the "now do the rest of the bytes" case should have
done something like

        movl %eax,%ecx
2:      movb $0,(%rdi)
        inc %rdi
        decl %ecx
        jnz  2b

instead.

Yeah, yeah, you could also use that _ASM_EXTABLE_TYPE_REG thing for
the second exception point, and keep %rcx as zero, and keep it in
%eax, and depend on that whole "%rcx = 8*%rcx+%rax" fixing it up, and
knowing that if an exception does *not* happen, %rcx will be zero from
the word-size loop.

But that really seems much too subtle for me - why not just keep
things in %rcx, and try to make this look as much as possible like the
"rep stosq + rep stosb" case?

And finally: I still think that those fancy exception table things are
*much* too fancy, and much too subtle, and much too complicated.

So I'd actually prefer to get rid of them entirely, and make the code
use the "no register changes" exception, and make the exception
handler do a proper site-specific fixup. At that point, you can get
rid of all the "mask bits early" logic, get rid of all the extraneous
'test' instructions, and make it all look something like below.

NOTE! I've intentionally kept the %eax thing using 32-bit instructions
- smaller encoding, and only the low three bits matter, so why
move/mask full 64 bits?

NOTE2! Entirely untested. But I tried to make the normal code do
minimal work, and then fix things up in the exception case more. So it
just keeps the original count in the 32 bits in %eax until it wants to
test it, and then uses the 'andl' to both mask and test. And the
exception case knows that, so it masks there too.

I dunno. But I really think that whole new _ASM_EXTABLE_TYPE_REG and
EX_TYPE_UCOPY_LEN8 was unnecessary.

               Linus

SYM_FUNC_START(clear_user_original)
        movl %ecx,%eax
        shrq $3,%rcx             # qwords
        jz .Lrest_bytes

        # do the qwords first
        .p2align 4
.Lqwords:
        movq $0,(%rdi)
        lea 8(%rdi),%rdi
        dec %rcx
        jnz .Lqwords

.Lrest_bytes:
        andl $7,%eax             # rest bytes
        jz .Lexit

        # now do the rest bytes
.Lbytes:
        movb $0,(%rdi)
        inc %rdi
        decl %eax
        jnz .Lbytes

.Lexit:
        xorl %eax,%eax
        RET

.Lqwords_exception:
        # convert qwords back into bytes to return to caller
        shlq $3, %rcx
        andl $7, %eax
        addq %rax,%rcx
        jmp .Lexit

.Lbytes_exception:
        movl %eax,%ecx
        jmp .Lexit

        _ASM_EXTABLE_UA(.Lqwords, .Lqwords_exception)
        _ASM_EXTABLE_UA(.Lbytes, .Lbytes_exception)
Linus Torvalds April 27, 2022, 4 p.m. UTC | #13
On Wed, Apr 27, 2022 at 3:41 AM Borislav Petkov <bp@alien8.de> wrote:
>
> Any particular reason for the explicit "q" suffix? The register already
> denotes the opsize and the generated opcode is the same.

No, I just added the 'q' prefix to distinguish it from the 'l'
instruction working on the same register one line earlier.

And then I wasn't consistent throughout, because it was really just
about me thinking about that "here we can use 32-bit ops, and here we
are working with the full possible 64-bit range".

So take that code as what it is - an untested and slightly edited
version of the code you sent me, meant for "how about like this"
purposes rather than anything else.

                   Linus
Linus Torvalds May 4, 2022, 7:22 p.m. UTC | #14
On Wed, May 4, 2022 at 11:56 AM Borislav Petkov <bp@alien8.de> wrote:
>
> Just to update folks here: I haven't forgotten about this - Mel and I
> are running some benchmarks first and staring at results to see whether
> all the hoopla is even worth it.

Side note: the "do FSRM inline" would likely be a really good thing
for "copy_to_user()", more so than the silly "clear_user()" that we
realistically do almost nowhere.

I doubt you can find "clear_user()" outside of benchmarks (but hey,
people do odd things).

But "copy_to_user()" is everywhere, and the I$ advantage of inlining
it might be noticeable on some real loads.

I remember some git profiles having copy_to_user very high due to
fstat(), for example - cp_new_stat64 and friends.

Of course, I haven't profiled git in ages, but I doubt that has
changed. Many of those kinds of loads are all about name lookup and
stat (basic things like "make" would be that too, if it weren't for
the fact that it spends a _lot_ of its time in user space string
handling).

The inlining advantage would obviously only show up on CPUs that
actually do FSRM. Which I think is currently only Ice Lake. I don't
have access to one.

                Linus
Linus Torvalds May 4, 2022, 8:40 p.m. UTC | #15
On Wed, May 4, 2022 at 1:18 PM Borislav Petkov <bp@alien8.de> wrote:
>
> Zen3 has FSRM.

Sadly, I'm on Zen2 with my 3970X, and the Zen 3 threadrippers seem to
be basically impossible to get.

> So below's the git test suite with clear_user on Zen3. It creates a lot
> of processes so we get to clear_user a bunch and that's the inlined rep
> movsb.

Oh, the clear_user() in the ELF loader? I wouldn't have expected that
to be noticeable.

Now, clear_page() is *very* noticeable, but that has its own special
magic and doesn't use clear_user().

Maybe there is some other clear_user() case I never thought of. My dim
memories of profiles definitely had copy_to_user, clear_page and
copy_page being some of the top copy loops.

                   Linus
Linus Torvalds May 4, 2022, 9:09 p.m. UTC | #16
On Wed, May 4, 2022 at 2:01 PM Borislav Petkov <bp@alien8.de> wrote:
>
> I could try to do a perf probe or whatever fancy new thing we do now on
> clear_user to get some numbers of how many times it gets called during
> the benchmark run. Or do you wanna know the callers too?

One of the non-performance reasons I like inlined memcpy is actually
that when you do a regular 'perf record' run, the cost of the memcpy
gets associated with the call-site.

Which is universally what I want for those things. I used to love our
inlined spinlocks for the same reason back when we did them.

Yeah, yeah, you can do it with callchain magic, but then you get it
all - and I really consider memcpy/memset to be a special case.
Normally I want the "oh, that leaf function is expensive", but not for
memcpy and memset (and not for spinlocks, but we'll never go back to
the old trivial spinlocks)

I don't tend to particularly care about "how many times has this been
called" kind of trace profiles. It's the actual expense in CPU cycles
I tend to care about.

That said, I cared deeply about those kinds of CPU profiles when I was
working with Al on the RCU path lookup code and looking for where the
problem spots were.

That was years ago.

I haven't really done serious profiling work for a while (which is
just as well, because it's one of the things that went backwards when
I switch to the Zen 2 threadripper for my main machine)

                  Linus
Linus Torvalds May 10, 2022, 5:17 p.m. UTC | #17
On Tue, May 10, 2022 at 2:31 AM Borislav Petkov <bp@alien8.de> wrote:
>
> > I haven't really done serious profiling work for a while (which is
> > just as well, because it's one of the things that went backwards when
> > I switch to the Zen 2 threadripper for my main machine)
>
> Because of the not as advanced perf support there? Any pain points I can
> forward?

It's not anything fancy, and it's not anything new - you've been cc'd
on me talking about it before.

As mentioned, I don't actually do anything fancy with profiling - I
basically almost always just want to do a simple

    perf record -e cycles:pp

so that I get reasonable instruction attribution for what the
expensive part actually is (where "actually is" is obviously always
just an approximation, I'm not claiming anything else - but I just
don't want to have to try to figure out some huge instruction skid
issue). And then (because I only tend to care about the kernel, and
don't care about _who_ is doing things), I just do

    perf report --sort=dso,symbol

and start looking at the kernel side of things. I then occasionally
enable -g, but I hate doing it, and it' susually because I see "oh
damn, some spinlock slowpath, let's see what the callers are" just to
figure out which spinlock it ls.

VERY rudimentary, in other words. It's the "I don't know where the
time is going, so let's find out".

And that simple thing doesn't work, because Zen 2 doesn't like the
per-thread profiling. So I can be root and use '-a', and it works
fine, except I don't want to do things as root just for profiling.
Plus I don't actually want to see the ACPI "CPU idle" things.

I'm told the issue is that Zen 2 is not stable with IBS (aka "PEBS for
AMD"), which is all kinds of sad, but there it is.

As also mentioned, it's not actually a huge deal for me, because all I
do is read email and do "git pull". And the times when I used
profiling to find things git could need improvement on (usually
pathname lookup for "git diff") are long long gone.

               Linus
Linus Torvalds May 10, 2022, 5:28 p.m. UTC | #18
On Tue, May 10, 2022 at 2:31 AM Borislav Petkov <bp@alien8.de> wrote:
>
> clear_user_original:
> Amean: 9219.71 (Sum: 6340154910, samples: 687674)
>
> fsrm:
> Amean: 8030.63 (Sum: 5522277720, samples: 687652)

Well, that's pretty conclusive.

I'm obviously very happy with fsrm. I've been pushing for that thing
for probably over two decades by now, because I absolutely detest
uarch optimizations for memset/memcpy that can never be done well in
software anyway (because it depends not just on cache organization,
but on cache sizes and dynamic cache hit/miss behavior of the load).

And one of the things I always wanted to do was to just have
memcpy/memset entirely inlined.

In fact, if you go back to the 0.01 linux kernel sources, you'll see
that they only compile with my bastardized version of gcc-1.40,
because I made the compiler inline those things with 'rep movs/stos',
and there was no other implementation of memcpy/memset at all.

That was a bit optimistic at the time, but here we are, 30+ years
later and it is finally looking possible, at least on some uarchs.

               Linus
diff mbox series

Patch

--- a/mm/filemap.c~tmpfs-fix-regressions-from-wider-use-of-zero_page
+++ a/mm/filemap.c
@@ -1063,12 +1063,6 @@  void __init pagecache_init(void)
 		init_waitqueue_head(&folio_wait_table[i]);
 
 	page_writeback_init();
-
-	/*
-	 * tmpfs uses the ZERO_PAGE for reading holes: it is up-to-date,
-	 * and splice's page_cache_pipe_buf_confirm() needs to see that.
-	 */
-	SetPageUptodate(ZERO_PAGE(0));
 }
 
 /*
--- a/mm/shmem.c~tmpfs-fix-regressions-from-wider-use-of-zero_page
+++ a/mm/shmem.c
@@ -2513,7 +2513,6 @@  static ssize_t shmem_file_read_iter(stru
 		pgoff_t end_index;
 		unsigned long nr, ret;
 		loff_t i_size = i_size_read(inode);
-		bool got_page;
 
 		end_index = i_size >> PAGE_SHIFT;
 		if (index > end_index)
@@ -2570,24 +2569,34 @@  static ssize_t shmem_file_read_iter(stru
 			 */
 			if (!offset)
 				mark_page_accessed(page);
-			got_page = true;
+			/*
+			 * Ok, we have the page, and it's up-to-date, so
+			 * now we can copy it to user space...
+			 */
+			ret = copy_page_to_iter(page, offset, nr, to);
+			put_page(page);
+
+		} else if (iter_is_iovec(to)) {
+			/*
+			 * Copy to user tends to be so well optimized, but
+			 * clear_user() not so much, that it is noticeably
+			 * faster to copy the zero page instead of clearing.
+			 */
+			ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
 		} else {
-			page = ZERO_PAGE(0);
-			got_page = false;
+			/*
+			 * But submitting the same page twice in a row to
+			 * splice() - or others? - can result in confusion:
+			 * so don't attempt that optimization on pipes etc.
+			 */
+			ret = iov_iter_zero(nr, to);
 		}
 
-		/*
-		 * Ok, we have the page, and it's up-to-date, so
-		 * now we can copy it to user space...
-		 */
-		ret = copy_page_to_iter(page, offset, nr, to);
 		retval += ret;
 		offset += ret;
 		index += offset >> PAGE_SHIFT;
 		offset &= ~PAGE_MASK;
 
-		if (got_page)
-			put_page(page);
 		if (!iov_iter_count(to))
 			break;
 		if (ret < nr) {