[v7,05/19] iov_iter: Introduce fault_in_iov_iter_writeable

Message ID	20210827164926.1726765-6-agruenba@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=bb4f=NS=oss.oracle.com=ocfs2-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 74442604D7 From: Andreas Gruenbacher <agruenba@redhat.com> To: Linus Torvalds <torvalds@linux-foundation.org>, Alexander Viro <viro@zeniv.linux.org.uk>, Christoph Hellwig <hch@infradead.org>, "Darrick J. Wong" <djwong@kernel.org> Date: Fri, 27 Aug 2021 18:49:12 +0200 Message-Id: <20210827164926.1726765-6-agruenba@redhat.com> In-Reply-To: <20210827164926.1726765-1-agruenba@redhat.com> References: <20210827164926.1726765-1-agruenba@redhat.com> MIME-Version: 1.0 Cc: Jan Kara <jack@suse.cz>, Andreas Gruenbacher <agruenba@redhat.com>, linux-kernel@vger.kernel.org, cluster-devel@redhat.com, linux-fsdevel@vger.kernel.org, ocfs2-devel@oss.oracle.com Subject: [Ocfs2-devel] [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: ocfs2-devel-bounces@oss.oracle.com Errors-To: ocfs2-devel-bounces@oss.oracle.com
Series	gfs2: Fix mmap + page fault deadlocks \| expand [v7,00/19] gfs2: Fix mmap + page fault deadlocks [v7,01/19] iov_iter: Fix iov_iter_get_pages{, _alloc} page fault return value [v7,02/19] powerpc/kvm: Fix kvm_use_magic_page [v7,03/19] gup: Turn fault_in_pages_{readable, writeable} into fault_in_{readable, writeable} [v7,04/19] iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable [v7,05/19] iov_iter: Introduce fault_in_iov_iter_writeable [v7,06/19] gfs2: Add wrapper for iomap_file_buffered_write [v7,07/19] gfs2: Clean up function may_grant [v7,08/19] gfs2: Eliminate vestigial HIF_FIRST [v7,09/19] gfs2: Remove redundant check from gfs2_glock_dq [v7,10/19] gfs2: Introduce flag for glock holder auto-demotion [v7,11/19] gfs2: Move the inode glock locking to gfs2_file_buffered_write [v7,12/19] gfs2: Eliminate ip->i_gh [v7,13/19] gfs2: Fix mmap + page fault deadlocks for buffered I/O [v7,14/19] iomap: Fix iomap_dio_rw return value for user copies [v7,15/19] iomap: Support partial direct I/O on user copy failures [v7,16/19] iomap: Add done_before argument to iomap_dio_rw [v7,17/19] gup: Introduce FOLL_NOFAULT flag to disable page faults [v7,18/19] iov_iter: Introduce nofault flag to disable page faults [v7,19/19] gfs2: Fix mmap + page fault deadlocks for direct I/O

Andreas Gruenbacher Aug. 27, 2021, 4:49 p.m. UTC

Introduce a new fault_in_iov_iter_writeable helper for safely faulting
in an iterator for writing.  Uses get_user_pages() to fault in the pages
without actually writing to them, which would be destructive.

We'll use fault_in_iov_iter_writeable in gfs2 once we've determined that
the iterator passed to .read_iter isn't in memory.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/pagemap.h |  1 +
 include/linux/uio.h     |  1 +
 lib/iov_iter.c          | 39 +++++++++++++++++++++++++
 mm/gup.c                | 63 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 104 insertions(+)

Al Viro Aug. 27, 2021, 6:49 p.m. UTC | #1

On Fri, Aug 27, 2021 at 06:49:12PM +0200, Andreas Gruenbacher wrote:
> Introduce a new fault_in_iov_iter_writeable helper for safely faulting
> in an iterator for writing.  Uses get_user_pages() to fault in the pages
> without actually writing to them, which would be destructive.
> 
> We'll use fault_in_iov_iter_writeable in gfs2 once we've determined that
> the iterator passed to .read_iter isn't in memory.

Again, the calling conventions are wrong.  Make it success/failure or
0/-EFAULT.  And it's inconsistent for iovec and non-iovec cases as it is.

Linus Torvalds Aug. 27, 2021, 7:05 p.m. UTC | #2

On Fri, Aug 27, 2021 at 11:52 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Again, the calling conventions are wrong.  Make it success/failure or
> 0/-EFAULT.  And it's inconsistent for iovec and non-iovec cases as it is.

Al, the 0/-EFAULT thing DOES NOT WORK.

The whole "success vs failure" model is broken.

Because "success" for some people is "everything worked".

And for other people it is "at least _part_ of it worked".

So no, 0/-EFAULT fundamentally cannot work, because the return needs
to be able to handle that ternary situation (ie "nothing" vs
"something" vs "everything").

This is *literally* the exact same thing that we have for
copy_to/from_user(). And Andreas' solution (based on my suggestion) is
the exact same one that we have had for that code since basically day
#1.

The whole "0/-EFAULT" is simpler, yes. And it's what
"{get|put}_user()" uses, yes. And it's more common to a lot of other
functions that return zero or an error.

But see above. People *need* that ternary result, and "bytes/pages
uncopied" is not only the traditional one we use elsewhere in similar
situations, it's the one that has the easiest error tests for existing
users (because zero remains "everything worked").

Andreas originally had that "how many bytes/pages succeeded" return
value instead, and yes, that's also ternary. But it means that now the
common "complete success" test ends up being a lot uglier, and the
semantics of the function changes completely where "0" no longer means
success, and that messes up much more.

So I really think you are barking entirely up the wrong tree.

If there is any inconsistency, maybe we should make _more_ cases use
that "how many bytes/pages not copied" logic, but in a lot of cases
you don't actually need the ternary decision value.

So the inconsistency is EXACTLY the same as the one we have always had
for get|put_user() vs copy_to|from_user(), and it exists for the EXACT
same reason.

IOW, please explain how you'd solve the ternary problem without making
the code a lot uglier.

              Linus

Al Viro Aug. 27, 2021, 7:23 p.m. UTC | #3

On Fri, Aug 27, 2021 at 12:05:32PM -0700, Linus Torvalds wrote:

> But see above. People *need* that ternary result, and "bytes/pages
> uncopied" is not only the traditional one we use elsewhere in similar
> situations, it's the one that has the easiest error tests for existing
> users (because zero remains "everything worked").

Could you show the cases where "partial copy, so it's OK" behaviour would
break anything?  

For that you would need the case where
	partial fault-in is currently rejected by the check
	checks downstream from there (for failing copy-in/copy-out) would
be either missing or would not be handled correctly in case of partial
fault-in or would slow a fast path down.

I don't see any such cases and I would be very surprised if such existed.
If you see any, please describe them - I could be wrong.  And I would
like to take a good look at any such case and see how well does it handle
possible short copy after full fault-in.

Linus Torvalds Aug. 27, 2021, 7:33 p.m. UTC | #4

On Fri, Aug 27, 2021 at 12:23 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Could you show the cases where "partial copy, so it's OK" behaviour would
> break anything?

Absolutely.

For example, i t would cause an infinite loop in
restore_fpregs_from_user() if the "buf" argument is a situation where
the first page is fine, but the next page is not.

Why? Because __restore_fpregs_from_user() would take a fault, but then
fault_in_pages_readable() (renamed) would succeed, so you'd just do
that "retry" forever and ever.

Probably there are a number of other places too. That was literally
the *first* place I looked at.

Seriously. The current semantics are "check the whole area".

THOSE MUST NOT CHANGE.

              Linus

Al Viro Aug. 27, 2021, 7:37 p.m. UTC | #5

On Fri, Aug 27, 2021 at 12:33:00PM -0700, Linus Torvalds wrote:
> On Fri, Aug 27, 2021 at 12:23 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Could you show the cases where "partial copy, so it's OK" behaviour would
> > break anything?
> 
> Absolutely.
> 
> For example, i t would cause an infinite loop in
> restore_fpregs_from_user() if the "buf" argument is a situation where
> the first page is fine, but the next page is not.
> 
> Why? Because __restore_fpregs_from_user() would take a fault, but then
> fault_in_pages_readable() (renamed) would succeed, so you'd just do
> that "retry" forever and ever.
> 
> Probably there are a number of other places too. That was literally
> the *first* place I looked at.

OK...

Let me dig out the notes from the last time I looked through that area
and grep around a bit.  Should be about an hour or two.

Al Viro Aug. 27, 2021, 9:48 p.m. UTC | #6

On Fri, Aug 27, 2021 at 07:37:25PM +0000, Al Viro wrote:
> On Fri, Aug 27, 2021 at 12:33:00PM -0700, Linus Torvalds wrote:
> > On Fri, Aug 27, 2021 at 12:23 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > Could you show the cases where "partial copy, so it's OK" behaviour would
> > > break anything?
> > 
> > Absolutely.
> > 
> > For example, i t would cause an infinite loop in
> > restore_fpregs_from_user() if the "buf" argument is a situation where
> > the first page is fine, but the next page is not.
> > 
> > Why? Because __restore_fpregs_from_user() would take a fault, but then
> > fault_in_pages_readable() (renamed) would succeed, so you'd just do
> > that "retry" forever and ever.
> > 
> > Probably there are a number of other places too. That was literally
> > the *first* place I looked at.
> 
> OK...
> 
> Let me dig out the notes from the last time I looked through that area
> and grep around a bit.  Should be about an hour or two.

OK, I've dug it out and rechecked the current mainline.

Call trees:

fault_in_pages_readable()
	kvm_use_magic_page()

Broken, as per mpe.  Relevant part (see <87eeeqa7ng.fsf@mpe.ellerman.id.au> in
your mailbox back in early May for the full story):
|The current code is confused, ie. broken.
...
|We want to check that the mapping succeeded, that the address is
|readable (& writeable as well actually).
...
|diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
...
|-       if (!fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, sizeof(u32))) {
|+       if (get_kernel_nofault(c, (const char *)KVM_MAGIC_PAGE)) {

	[ppc32]swapcontext()
	[ppc32]debug_setcontext()
	[ppc64]swapcontext()

Same situation in all three - it's going to kill the process if copy-in
fails, so it tries to be gentler about it and treat fault-in failures
as -EFAULT from syscall.  AFAICS, it's pointless, but I would like
comments from ppc folks.  Note that bogus *contents* of the
struct ucontext passed by user is almost certainly going to end up
with segfault; trying to catch the cases when bogus address happens
to point someplace unreadable is rather useless in that situation.

	restore_fpregs_from_user()
The one you've caught; hadn't been there last time I'd checked (back in
April).  Its counterpart in copy_fpstate_to_sigframe() had been, though.

	armada_gem_pwrite_ioctl()
Pointless, along with the access_ok() there - it does copy_from_user()
on that area shortly afterwards and failure of either is not a fast path.
	copy_page_from_iter_iovec()
Will do the right thing on short copy of any kind; we are fine with either
semantics.
	iov_iter_fault_in_readable()
		generic_perform_write()
Any short copy that had not lead to progress (== rejected by ->write_end())
will lead to next chunk shortened accordingly, so ->write_begin() would be
asked to prepare for the amount we expect to be able to copy; ->write_end()
should be fine with that.  Failure to copy anything at all (possible due to
eviction on memory pressure, etc.) leads to retry of the same chunk as the
last time, and that's where we rely on fault-in rejecting "nothing could be
faulted in" case.  That one is fine with partial fault-in reported as success.
		f2fs_file_write_iter()
Odd prealloc-related stuff.  AFAICS, from the correctness POV either variant
of semantics would do, but I'm not sure how if either is the right match
to what they are trying to do there.
		fuse_fill_write_pages()
Similar to generic_perform_write() situation, only simpler (no ->write_end()
counterpart there).  All we care about is failure if nothing could be faulted
in.
		btrfs_buffered_write()
Again, similar to generic_perform_write().  More convoluted (after a short
copy it switches to going page-by-page and getting destination pages uptodate,
which will be equivalent to ->write_end() always accepting everything it's
given from that point on), but it's the same "we care only about failure
to fault in the first page" situation.
		ntfs_perform_write()
Another generic_perform_write() analogue.  Same situation wrt fault-in
semantics.
		iomap_write_actor()
Another generic_perform_write() relative.  Same situation.

fault_in_pages_writeable()
        copy_fpstate_to_sigframe()
Same kind of "retry everything from scratch on short copy" as in the other
fpu/signal.c case.
	[btrfs]search_ioctl()
Broken with memory poisoning, for either variant of semantics.  Same for
arm64 sub-page permission differences, I think.
	copy_page_to_iter_iovec()
Will do the right thing on short copy of any kind; we are fine with either
semantics.

So we have 3 callers where we want all-or-nothing semantics - two in
arch/x86/kernel/fpu/signal.c and one in btrfs.  HWPOISON will be a problem
for all 3, AFAICS...

IOW, it looks like we have two different things mixed here - one that wants
to try and fault stuff in, with callers caring only about having _something_
faulted in (most of the users) and one that wants to make sure we *can* do
stores or loads on each byte in the affected area.

Just accessing a byte in each page really won't suffice for the second kind.
Neither will g-u-p use, unless we teach it about HWPOISON and other fun
beasts...  Looks like we want that thing to be a separate primitive; for
btrfs I'd probably replace fault_in_pages_writeable() with clear_user()
as a quick fix for now...

Comments?

Al Viro Aug. 27, 2021, 9:57 p.m. UTC | #7

On Fri, Aug 27, 2021 at 09:48:55PM +0000, Al Viro wrote:

> 	[btrfs]search_ioctl()
> Broken with memory poisoning, for either variant of semantics.  Same for
> arm64 sub-page permission differences, I think.


> So we have 3 callers where we want all-or-nothing semantics - two in
> arch/x86/kernel/fpu/signal.c and one in btrfs.  HWPOISON will be a problem
> for all 3, AFAICS...
> 
> IOW, it looks like we have two different things mixed here - one that wants
> to try and fault stuff in, with callers caring only about having _something_
> faulted in (most of the users) and one that wants to make sure we *can* do
> stores or loads on each byte in the affected area.
> 
> Just accessing a byte in each page really won't suffice for the second kind.
> Neither will g-u-p use, unless we teach it about HWPOISON and other fun
> beasts...  Looks like we want that thing to be a separate primitive; for
> btrfs I'd probably replace fault_in_pages_writeable() with clear_user()
> as a quick fix for now...
> 
> Comments?

Wait a sec...  Wasn't HWPOISON a per-page thing?  arm64 definitely does have
smaller-than-page areas with different permissions, so btrfs search_ioctl()
has a problem there, but arch/x86/kernel/fpu/signal.c doesn't have to deal
with that...

Sigh...  I really need more coffee...

Luck, Tony Aug. 27, 2021, 11:22 p.m. UTC | #8

On Fri, Aug 27, 2021 at 09:57:10PM +0000, Al Viro wrote:
> On Fri, Aug 27, 2021 at 09:48:55PM +0000, Al Viro wrote:
> 
> > 	[btrfs]search_ioctl()
> > Broken with memory poisoning, for either variant of semantics.  Same for
> > arm64 sub-page permission differences, I think.
> 
> 
> > So we have 3 callers where we want all-or-nothing semantics - two in
> > arch/x86/kernel/fpu/signal.c and one in btrfs.  HWPOISON will be a problem
> > for all 3, AFAICS...
> > 
> > IOW, it looks like we have two different things mixed here - one that wants
> > to try and fault stuff in, with callers caring only about having _something_
> > faulted in (most of the users) and one that wants to make sure we *can* do
> > stores or loads on each byte in the affected area.
> > 
> > Just accessing a byte in each page really won't suffice for the second kind.
> > Neither will g-u-p use, unless we teach it about HWPOISON and other fun
> > beasts...  Looks like we want that thing to be a separate primitive; for
> > btrfs I'd probably replace fault_in_pages_writeable() with clear_user()
> > as a quick fix for now...
> > 
> > Comments?
> 
> Wait a sec...  Wasn't HWPOISON a per-page thing?  arm64 definitely does have
> smaller-than-page areas with different permissions, so btrfs search_ioctl()
> has a problem there, but arch/x86/kernel/fpu/signal.c doesn't have to deal
> with that...
> 
> Sigh...  I really need more coffee...

On Intel poison is tracked at the cache line granularity. Linux
inflates that to per-page (because it can only take a whole page away).
For faults triggered in ring3 this is pretty much the same thing because
mm/memory_failure.c unmaps the page ... so while you see a #MC on first
access, you get #PF when you retry. The x86 fault handler sees a magic
signature in the page table and sends a SIGBUS.

But it's all different if the #MC is triggerd from ring0. The machine
check handler can't unmap the page. It just schedules task_work to do
the unmap when next returning to the user.

But if your kernel code loops and tries again without a return to user,
then your get another #MC.

-Tony

Luck, Tony Aug. 28, 2021, 2:20 a.m. UTC | #9

> But if your kernel code loops and tries again without a return to user,
> then your get another #MC.

I've been trying to push this patch:

https://lore.kernel.org/linux-edac/20210818002942.1607544-1-tony.luck@intel.com/

which turns the infinite loops of machine checks into a panic.

-Tony

Al Viro Aug. 28, 2021, 7:28 p.m. UTC | #10

AFAICS, a48b73eca4ce "btrfs: fix potential deadlock in the search ioctl"
has introduced a bug at least on arm64.

Relevant bits: in search_ioctl() we have
        while (1) {
                ret = fault_in_pages_writeable(ubuf + sk_offset,
                                               *buf_size - sk_offset);
                if (ret)
                        break;

                ret = btrfs_search_forward(root, &key, path, sk->min_transid);
                if (ret != 0) {
                        if (ret > 0)
                                ret = 0;
                        goto err;
                }
                ret = copy_to_sk(path, &key, sk, buf_size, ubuf,
                                 &sk_offset, &num_found);
                btrfs_release_path(path);
                if (ret)
                        break;

        }
and in copy_to_sk() -
                sh.objectid = key->objectid;
                sh.offset = key->offset;
                sh.type = key->type;
                sh.len = item_len;
                sh.transid = found_transid;

                /*
                 * Copy search result header. If we fault then loop again so we
                 * can fault in the pages and -EFAULT there if there's a
                 * problem. Otherwise we'll fault and then copy the buffer in
                 * properly this next time through
                 */
                if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
                        ret = 0;
                        goto out;
                }
with sk_offset left unchanged if the very first copy_to_user_nofault() fails.

Now, consider a situation on arm64 where ubuf points to the beginning of page,
ubuf[0] can be accessed, but ubuf[16] can not (possible with MTE, AFAICS).  We do
fault_in_pages_writeable(), which succeeds.  When we get to copy_to_user_nofault()
we fail as soon as it gets past the first 16 bytes.  And we repeat everything from
scratch, with no progress made, since short copies are treated as "discard and
repeat" here.

Am I misreading what's going on there?

Thomas Gleixner Aug. 28, 2021, 9:47 p.m. UTC | #11

On Fri, Aug 27 2021 at 16:22, Tony Luck wrote:
> On Fri, Aug 27, 2021 at 09:57:10PM +0000, Al Viro wrote:
>> On Fri, Aug 27, 2021 at 09:48:55PM +0000, Al Viro wrote:
>> 
>> > 	[btrfs]search_ioctl()
>> > Broken with memory poisoning, for either variant of semantics.  Same for
>> > arm64 sub-page permission differences, I think.
>> 
>> 
>> > So we have 3 callers where we want all-or-nothing semantics - two in
>> > arch/x86/kernel/fpu/signal.c and one in btrfs.  HWPOISON will be a problem
>> > for all 3, AFAICS...
>> > 
>> > IOW, it looks like we have two different things mixed here - one that wants
>> > to try and fault stuff in, with callers caring only about having _something_
>> > faulted in (most of the users) and one that wants to make sure we *can* do
>> > stores or loads on each byte in the affected area.
>> > 
>> > Just accessing a byte in each page really won't suffice for the second kind.
>> > Neither will g-u-p use, unless we teach it about HWPOISON and other fun
>> > beasts...  Looks like we want that thing to be a separate primitive; for
>> > btrfs I'd probably replace fault_in_pages_writeable() with clear_user()
>> > as a quick fix for now...
>> > 
>> > Comments?
>> 
>> Wait a sec...  Wasn't HWPOISON a per-page thing?  arm64 definitely does have
>> smaller-than-page areas with different permissions, so btrfs search_ioctl()
>> has a problem there, but arch/x86/kernel/fpu/signal.c doesn't have to deal
>> with that...
>> 
>> Sigh...  I really need more coffee...
>
> On Intel poison is tracked at the cache line granularity. Linux
> inflates that to per-page (because it can only take a whole page away).
> For faults triggered in ring3 this is pretty much the same thing because
> mm/memory_failure.c unmaps the page ... so while you see a #MC on first
> access, you get #PF when you retry. The x86 fault handler sees a magic
> signature in the page table and sends a SIGBUS.
>
> But it's all different if the #MC is triggerd from ring0. The machine
> check handler can't unmap the page. It just schedules task_work to do
> the unmap when next returning to the user.
>
> But if your kernel code loops and tries again without a return to user,
> then your get another #MC.

But that's not the case for restore_fpregs_from_user() when it hits #MC.

restore_fpregs_from_user()
  ...
  ret = __restore_fpregs_from_user(buf, xrestore, fx_only)
  
  /* Try to handle #PF, but anything else is fatal. */
  if (ret != -EFAULT)
     return -EINVAL;

Now let's look at __restore_fpregs_from_user()

__restore_fpregs_from_user()
   return $FPUVARIANT_rstor_from_user_sigframe()

which all end up in user_insn(). user_insn() returns 0 or the negated
trap number, which results in -EFAULT for #PF, but for #MC the negated
trap number is -18 i.e. != -EFAULT. IOW, there is no endless loop.

This used to be a problem before commit:

  aee8c67a4faa ("x86/fpu: Return proper error codes from user access functions")

and as the changelog says the initial reason for this was #GP going into
the fault path, but I'm pretty sure that I also discussed the #MC angle with
Borislav back then. Should have added some more comments there
obviously.

Thanks,

        tglx

Al Viro Aug. 28, 2021, 10:04 p.m. UTC | #12

On Sat, Aug 28, 2021 at 11:47:03PM +0200, Thomas Gleixner wrote:

>   /* Try to handle #PF, but anything else is fatal. */
>   if (ret != -EFAULT)
>      return -EINVAL;

> which all end up in user_insn(). user_insn() returns 0 or the negated
> trap number, which results in -EFAULT for #PF, but for #MC the negated
> trap number is -18 i.e. != -EFAULT. IOW, there is no endless loop.
> 
> This used to be a problem before commit:
> 
>   aee8c67a4faa ("x86/fpu: Return proper error codes from user access functions")
> 
> and as the changelog says the initial reason for this was #GP going into
> the fault path, but I'm pretty sure that I also discussed the #MC angle with
> Borislav back then. Should have added some more comments there
> obviously.

... or at least have that check spelled

	if (ret != -X86_TRAP_PF)
		return -EINVAL;

Unless I'm misreading your explanation, that is...

Al Viro Aug. 28, 2021, 10:11 p.m. UTC | #13

On Sat, Aug 28, 2021 at 10:04:41PM +0000, Al Viro wrote:
> On Sat, Aug 28, 2021 at 11:47:03PM +0200, Thomas Gleixner wrote:
> 
> >   /* Try to handle #PF, but anything else is fatal. */
> >   if (ret != -EFAULT)
> >      return -EINVAL;
> 
> > which all end up in user_insn(). user_insn() returns 0 or the negated
> > trap number, which results in -EFAULT for #PF, but for #MC the negated
> > trap number is -18 i.e. != -EFAULT. IOW, there is no endless loop.
> > 
> > This used to be a problem before commit:
> > 
> >   aee8c67a4faa ("x86/fpu: Return proper error codes from user access functions")
> > 
> > and as the changelog says the initial reason for this was #GP going into
> > the fault path, but I'm pretty sure that I also discussed the #MC angle with
> > Borislav back then. Should have added some more comments there
> > obviously.
> 
> ... or at least have that check spelled
> 
> 	if (ret != -X86_TRAP_PF)
> 		return -EINVAL;
> 
> Unless I'm misreading your explanation, that is...

BTW, is #MC triggered on stored to a poisoned cacheline?  Existence of CLZERO
would seem to argue against that...

Al Viro Aug. 28, 2021, 10:19 p.m. UTC | #14

On Sat, Aug 28, 2021 at 10:11:36PM +0000, Al Viro wrote:
> On Sat, Aug 28, 2021 at 10:04:41PM +0000, Al Viro wrote:
> > On Sat, Aug 28, 2021 at 11:47:03PM +0200, Thomas Gleixner wrote:
> > 
> > >   /* Try to handle #PF, but anything else is fatal. */
> > >   if (ret != -EFAULT)
> > >      return -EINVAL;
> > 
> > > which all end up in user_insn(). user_insn() returns 0 or the negated
> > > trap number, which results in -EFAULT for #PF, but for #MC the negated
> > > trap number is -18 i.e. != -EFAULT. IOW, there is no endless loop.
> > > 
> > > This used to be a problem before commit:
> > > 
> > >   aee8c67a4faa ("x86/fpu: Return proper error codes from user access functions")
> > > 
> > > and as the changelog says the initial reason for this was #GP going into
> > > the fault path, but I'm pretty sure that I also discussed the #MC angle with
> > > Borislav back then. Should have added some more comments there
> > > obviously.
> > 
> > ... or at least have that check spelled
> > 
> > 	if (ret != -X86_TRAP_PF)
> > 		return -EINVAL;
> > 
> > Unless I'm misreading your explanation, that is...
> 
> BTW, is #MC triggered on stored to a poisoned cacheline?  Existence of CLZERO
> would seem to argue against that...

How about taking __clear_user() out of copy_fpregs_to_sigframe()
and replacing the call of fault_in_pages_writeable() with
	if (!clear_user(buf_fx, fpu_user_xstate_size))
		goto retry;
	return -EFAULT;
in the caller?

Luck, Tony Aug. 28, 2021, 10:20 p.m. UTC | #15

On Sat, Aug 28, 2021 at 3:12 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> BTW, is #MC triggered on stored to a poisoned cacheline?  Existence of CLZERO
> would seem to argue against that...

No #MC on stores. Just on loads. Note that you can't clear poison
state with a series of small writes to the cache line. But a single
64-byte store might do it (architects didn't want to guarantee that
it would work when I asked about avx512 stores to clear poison
many years ago).

-Tony

Thomas Gleixner Aug. 28, 2021, 10:23 p.m. UTC | #16

On Sat, Aug 28 2021 at 22:04, Al Viro wrote:

> On Sat, Aug 28, 2021 at 11:47:03PM +0200, Thomas Gleixner wrote:
>
>>   /* Try to handle #PF, but anything else is fatal. */
>>   if (ret != -EFAULT)
>>      return -EINVAL;
>
>> which all end up in user_insn(). user_insn() returns 0 or the negated
>> trap number, which results in -EFAULT for #PF, but for #MC the negated
>> trap number is -18 i.e. != -EFAULT. IOW, there is no endless loop.
>> 
>> This used to be a problem before commit:
>> 
>>   aee8c67a4faa ("x86/fpu: Return proper error codes from user access functions")
>> 
>> and as the changelog says the initial reason for this was #GP going into
>> the fault path, but I'm pretty sure that I also discussed the #MC angle with
>> Borislav back then. Should have added some more comments there
>> obviously.
>
> ... or at least have that check spelled
>
> 	if (ret != -X86_TRAP_PF)
> 		return -EINVAL;
>
> Unless I'm misreading your explanation, that is...

Yes, that makes a lot of sense.

Al Viro Aug. 28, 2021, 10:51 p.m. UTC | #17

On Sat, Aug 28, 2021 at 10:19:04PM +0000, Al Viro wrote:

> How about taking __clear_user() out of copy_fpregs_to_sigframe()
> and replacing the call of fault_in_pages_writeable() with
> 	if (!clear_user(buf_fx, fpu_user_xstate_size))
> 		goto retry;
> 	return -EFAULT;
> in the caller?

Something like this (completely untested)

Lift __clear_user() out of copy_fpregs_to_sigframe(), do not confuse EFAULT with
X86_TRAP_PF, don't bother with fault_in_pages_writeable() (pointless, since now
__clear_user() on error is not under pagefault_disable()).  And don't bother
with retries on anything other than #PF...

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 5a18694a89b2..71c6621a262f 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -17,6 +17,7 @@
 #include <linux/mm.h>
 
 #include <asm/user.h>
+#include <asm/trapnr.h>
 #include <asm/fpu/api.h>
 #include <asm/fpu/xstate.h>
 #include <asm/fpu/xcr.h>
@@ -345,7 +346,7 @@ static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
 	 */
 	err = __clear_user(&buf->header, sizeof(buf->header));
 	if (unlikely(err))
-		return -EFAULT;
+		return -X86_TRAP_PF;
 
 	stac();
 	XSTATE_OP(XSAVE, buf, lmask, hmask, err);
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 445c57c9c539..611b9ed9c06b 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -135,18 +135,12 @@ static inline int save_xstate_epilog(void __user *buf, int ia32_frame)
 
 static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf)
 {
-	int err;
-
 	if (use_xsave())
-		err = xsave_to_user_sigframe(buf);
-	else if (use_fxsr())
-		err = fxsave_to_user_sigframe((struct fxregs_state __user *) buf);
+		return xsave_to_user_sigframe(buf);
+	if (use_fxsr())
+		return fxsave_to_user_sigframe((struct fxregs_state __user *) buf);
 	else
-		err = fnsave_to_user_sigframe((struct fregs_state __user *) buf);
-
-	if (unlikely(err) && __clear_user(buf, fpu_user_xstate_size))
-		err = -EFAULT;
-	return err;
+		return fnsave_to_user_sigframe((struct fregs_state __user *) buf);
 }
 
 /*
@@ -205,9 +199,10 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
 	fpregs_unlock();
 
 	if (ret) {
-		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
+		if (!__clear_user(buf_fx, fpu_user_xstate_size) &&
+		    ret == -X86_TRAP_PF)
 			goto retry;
-		return -EFAULT;
+		return -1;
 	}
 
 	/* Save the fsave header for the 32-bit frames. */
@@ -275,7 +270,7 @@ static int restore_fpregs_from_user(void __user *buf, u64 xrestore,
 		fpregs_unlock();
 
 		/* Try to handle #PF, but anything else is fatal. */
-		if (ret != -EFAULT)
+		if (ret != -X86_TRAP_PF)
 			return -EINVAL;
 
 		ret = fault_in_pages_readable(buf, size);

Al Viro Aug. 29, 2021, 12:58 a.m. UTC | #18

On Fri, Aug 27, 2021 at 09:48:55PM +0000, Al Viro wrote:

> So we have 3 callers where we want all-or-nothing semantics - two in
> arch/x86/kernel/fpu/signal.c and one in btrfs.  HWPOISON will be a problem
> for all 3, AFAICS...
> 
> IOW, it looks like we have two different things mixed here - one that wants
> to try and fault stuff in, with callers caring only about having _something_
> faulted in (most of the users) and one that wants to make sure we *can* do
> stores or loads on each byte in the affected area.
> 
> Just accessing a byte in each page really won't suffice for the second kind.
> Neither will g-u-p use, unless we teach it about HWPOISON and other fun
> beasts...  Looks like we want that thing to be a separate primitive; for
> btrfs I'd probably replace fault_in_pages_writeable() with clear_user()
> as a quick fix for now...

Looks like out of these 3 we have
	* x86 restoring FPU state on sigreturn: correct, if somewhat obfuscated;
HWPOISON is not an issue.  We want full fault-in there (1 or 2 pages)
	* x86 saving FPU state into sigframe: not really needed; we do
__clear_user() on any error anyway, and taking it into the caller past the
pagefault_enable() will serve just fine instead of fault-in of the same
for write.
	* btrfs search_ioctl(): HWPOISON is not an issue (no #MC on stores),
but arm64 side of the things very likely is a problem with MTE; there we
can have successful store in some bytes in a page with faults on stores
elsewhere in it.  With such setups that thing will loop indefinitely.
And unlike x86 FPU handling, btrfs is arch-independent.

IOW, unless I'm misreading the situation, we have one caller where "all or
nothing" semantics is correct and needed, several where fault-in is pointless,
one where the current use of fault-in is actively wrong (ppc kvm, patch from
ppc folks exists), another place where neither semantics is right (btrfs on
arm64) and a bunch where "can we access at least the first byte?" should be
fine...

Matthew Wilcox (Oracle) Aug. 29, 2021, 1:40 a.m. UTC | #19

On Sat, Aug 28, 2021 at 03:20:58PM -0700, Tony Luck wrote:
> On Sat, Aug 28, 2021 at 3:12 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > BTW, is #MC triggered on stored to a poisoned cacheline?  Existence of CLZERO
> > would seem to argue against that...
> 
> No #MC on stores. Just on loads. Note that you can't clear poison
> state with a series of small writes to the cache line. But a single
> 64-byte store might do it (architects didn't want to guarantee that
> it would work when I asked about avx512 stores to clear poison
> many years ago).

Dave Jiang thinks MOVDIR64B clears poison.

http://archive.lwn.net:8080/linux-kernel/157617505636.42350.1170110675242558018.stgit@djiang5-desk3.ch.intel.com/

Thomas Gleixner Aug. 29, 2021, 6:44 p.m. UTC | #20

On Sat, Aug 28 2021 at 22:51, Al Viro wrote:
> @@ -345,7 +346,7 @@ static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
>  	 */
>  	err = __clear_user(&buf->header, sizeof(buf->header));
>  	if (unlikely(err))
> -		return -EFAULT;
> +		return -X86_TRAP_PF;

This clear_user can be lifted into copy_fpstate_to_sigframe(). Something
like the below.
  
Thanks,

        tglx
---
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -135,18 +135,12 @@ static inline int save_xstate_epilog(voi
 
 static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf)
 {
-	int err;
-
 	if (use_xsave())
-		err = xsave_to_user_sigframe(buf);
-	else if (use_fxsr())
-		err = fxsave_to_user_sigframe((struct fxregs_state __user *) buf);
+		return xsave_to_user_sigframe(buf);
+	if (use_fxsr())
+		return fxsave_to_user_sigframe((struct fxregs_state __user *) buf);
 	else
-		err = fnsave_to_user_sigframe((struct fregs_state __user *) buf);
-
-	if (unlikely(err) && __clear_user(buf, fpu_user_xstate_size))
-		err = -EFAULT;
-	return err;
+		return fnsave_to_user_sigframe((struct fregs_state __user *) buf);
 }
 
 /*
@@ -188,6 +182,16 @@ int copy_fpstate_to_sigframe(void __user
 
 	if (!access_ok(buf, size))
 		return -EACCES;
+
+	if (use_xsave()) {
+		/*
+		 * Clear the xsave header first, so that reserved fields are
+		 * initialized to zero.
+		 */
+		ret = __clear_user(&buf->header, sizeof(buf->header));
+		if (unlikely(ret))
+			return ret;
+	}
 retry:
 	/*
 	 * Load the FPU registers if they are not valid for the current task.
@@ -205,9 +209,10 @@ int copy_fpstate_to_sigframe(void __user
 	fpregs_unlock();
 
 	if (ret) {
-		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
+		if (!__clear_user(buf_fx, fpu_user_xstate_size) &&
+		    ret == -X86_TRAP_PF)
 			goto retry;
-		return -EFAULT;
+		return -1;
 	}
 
 	/* Save the fsave header for the 32-bit frames. */
@@ -275,7 +280,7 @@ static int restore_fpregs_from_user(void
 		fpregs_unlock();
 
 		/* Try to handle #PF, but anything else is fatal. */
-		if (ret != -EFAULT)
+		if (ret != -X86_TRAP_PF)
 			return -EINVAL;
 
 		ret = fault_in_pages_readable(buf, size);
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -323,9 +323,12 @@ static inline void os_xrstor(struct xreg
  * We don't use modified optimization because xrstor/xrstors might track
  * a different application.
  *
- * We don't use compacted format xsave area for
- * backward compatibility for old applications which don't understand
- * compacted format of xsave area.
+ * We don't use compacted format xsave area for backward compatibility for
+ * old applications which don't understand the compacted format of the
+ * xsave area.
+ *
+ * The caller has to zero buf::header before calling this because XSAVE*
+ * does not touch them.
  */
 static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
 {
@@ -339,14 +342,6 @@ static inline int xsave_to_user_sigframe
 	u32 hmask = mask >> 32;
 	int err;
 
-	/*
-	 * Clear the xsave header first, so that reserved fields are
-	 * initialized to zero.
-	 */
-	err = __clear_user(&buf->header, sizeof(buf->header));
-	if (unlikely(err))
-		return -EFAULT;
-
 	stac();
 	XSTATE_OP(XSAVE, buf, lmask, hmask, err);
 	clac();

Al Viro Aug. 29, 2021, 7:46 p.m. UTC | #21

On Sun, Aug 29, 2021 at 08:44:04PM +0200, Thomas Gleixner wrote:
> On Sat, Aug 28 2021 at 22:51, Al Viro wrote:
> > @@ -345,7 +346,7 @@ static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
> >  	 */
> >  	err = __clear_user(&buf->header, sizeof(buf->header));
> >  	if (unlikely(err))
> > -		return -EFAULT;
> > +		return -X86_TRAP_PF;
> 
> This clear_user can be lifted into copy_fpstate_to_sigframe(). Something
> like the below.

Hmm...  This mixing of -X86_TRAP_... with -E... looks like it's asking for
trouble in general.  Might be worth making e.g. fpu__restore_sig() (and
its callers) return bool, seeing that we only check for 0/non-zero in
there.

Thomas Gleixner Aug. 29, 2021, 7:51 p.m. UTC | #22

On Sun, Aug 29 2021 at 19:46, Al Viro wrote:

> On Sun, Aug 29, 2021 at 08:44:04PM +0200, Thomas Gleixner wrote:
>> On Sat, Aug 28 2021 at 22:51, Al Viro wrote:
>> > @@ -345,7 +346,7 @@ static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
>> >  	 */
>> >  	err = __clear_user(&buf->header, sizeof(buf->header));
>> >  	if (unlikely(err))
>> > -		return -EFAULT;
>> > +		return -X86_TRAP_PF;
>> 
>> This clear_user can be lifted into copy_fpstate_to_sigframe(). Something
>> like the below.
>
> Hmm...  This mixing of -X86_TRAP_... with -E... looks like it's asking for
> trouble in general.  Might be worth making e.g. fpu__restore_sig() (and
> its callers) return bool, seeing that we only check for 0/non-zero in
> there.

Let me fix that.

Luck, Tony Aug. 30, 2021, 3:41 p.m. UTC | #23

>> No #MC on stores. Just on loads. Note that you can't clear poison
>> state with a series of small writes to the cache line. But a single
>> 64-byte store might do it (architects didn't want to guarantee that
>> it would work when I asked about avx512 stores to clear poison
>> many years ago).
>
> Dave Jiang thinks MOVDIR64B clears poison.
>
> http://archive.lwn.net:8080/linux-kernel/157617505636.42350.1170110675242558018.stgit@djiang5-desk3.ch.intel.com/

MOVDIR64B has some explicit guarantees (does a write-back invalidate if the target is already
in the cache) that a 64-byte avx512 write doesn't.

Of course it would stop working if some future CPU were to have a longer than 64 bytes cache line.

-Tony

Catalin Marinas Aug. 31, 2021, 1:54 p.m. UTC | #24

On Sat, Aug 28, 2021 at 08:28:17PM +0100, Al Viro wrote:
> 	AFAICS, a48b73eca4ce "btrfs: fix potential deadlock in the search ioctl"
> has introduced a bug at least on arm64.
> 
> Relevant bits: in search_ioctl() we have
>         while (1) {
>                 ret = fault_in_pages_writeable(ubuf + sk_offset,
>                                                *buf_size - sk_offset);
>                 if (ret)
>                         break;
> 
>                 ret = btrfs_search_forward(root, &key, path, sk->min_transid);
>                 if (ret != 0) {
>                         if (ret > 0)
>                                 ret = 0;
>                         goto err;
>                 }
>                 ret = copy_to_sk(path, &key, sk, buf_size, ubuf,
>                                  &sk_offset, &num_found);
>                 btrfs_release_path(path);
>                 if (ret)
>                         break;
> 
>         }
> and in copy_to_sk() -
>                 sh.objectid = key->objectid;
>                 sh.offset = key->offset;
>                 sh.type = key->type;
>                 sh.len = item_len;
>                 sh.transid = found_transid;
> 
>                 /*
>                  * Copy search result header. If we fault then loop again so we
>                  * can fault in the pages and -EFAULT there if there's a
>                  * problem. Otherwise we'll fault and then copy the buffer in
>                  * properly this next time through
>                  */
>                 if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
>                         ret = 0;
>                         goto out;
>                 }
> with sk_offset left unchanged if the very first copy_to_user_nofault() fails.
> 
> Now, consider a situation on arm64 where ubuf points to the beginning of page,
> ubuf[0] can be accessed, but ubuf[16] can not (possible with MTE, AFAICS).  We do
> fault_in_pages_writeable(), which succeeds.  When we get to copy_to_user_nofault()
> we fail as soon as it gets past the first 16 bytes.  And we repeat everything from
> scratch, with no progress made, since short copies are treated as "discard and
> repeat" here.

So if copy_to_user_nofault() returns -EFAULT, copy_to_sk() returns 0
(following commit a48b73eca4ce). I think you are right, search_ioctl()
can get into an infinite loop attempting to write to user if the
architecture can trigger faults at smaller granularity than the page
boundary. fault_in_pages_writeable() won't fix it if ubuf[0] is
writable and doesn't trigger an MTE tag check fault.

An arm64-specific workaround would be for pagefault_disable() to disable
tag checking. It's a pretty big hammer, weakening the out of bounds
access detection of MTE. My preference would be a fix in the btrfs code.

A btrfs option would be for copy_to_sk() to return an indication of
where the fault occurred and get fault_in_pages_writeable() to check
that location, even if the copying would restart from an earlier offset
(this requires open-coding copy_to_user_nofault()). An attempt below,
untested and does not cover read_extent_buffer_to_user_nofault():

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0ba98e08a029..9e74ba1c955d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2079,6 +2079,7 @@ static noinline int copy_to_sk(struct btrfs_path *path,
 			       size_t *buf_size,
 			       char __user *ubuf,
 			       unsigned long *sk_offset,
+			       unsigned long *fault_offset,
 			       int *num_found)
 {
 	u64 found_transid;
@@ -2143,7 +2144,11 @@ static noinline int copy_to_sk(struct btrfs_path *path,
 		 * problem. Otherwise we'll fault and then copy the buffer in
 		 * properly this next time through
 		 */
-		if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
+		pagefault_disable();
+		ret = __copy_to_user_inatomic(ubuf + *sk_offset, &sh, sizeof(sh));
+		pagefault_enable();
+		*fault_offset = *sk_offset + sizeof(sh) - ret;
+		if (ret) {
 			ret = 0;
 			goto out;
 		}
@@ -2218,6 +2223,7 @@ static noinline int search_ioctl(struct inode *inode,
 	int ret;
 	int num_found = 0;
 	unsigned long sk_offset = 0;
+	unsigned long fault_offset = 0;

 	if (*buf_size < sizeof(struct btrfs_ioctl_search_header)) {
 		*buf_size = sizeof(struct btrfs_ioctl_search_header);
@@ -2244,8 +2250,8 @@ static noinline int search_ioctl(struct inode *inode,
 	key.offset = sk->min_offset;

 	while (1) {
-		ret = fault_in_pages_writeable(ubuf + sk_offset,
-					       *buf_size - sk_offset);
+		ret = fault_in_pages_writeable(ubuf + fault_offset,
+					       *buf_size - fault_offset);
 		if (ret)
 			break;

@@ -2256,7 +2262,7 @@ static noinline int search_ioctl(struct inode *inode,
 			goto err;
 		}
 		ret = copy_to_sk(path, &key, sk, buf_size, ubuf,
-				 &sk_offset, &num_found);
+				 &sk_offset, &fault_offset, &num_found);
 		btrfs_release_path(path);
 		if (ret)
 			break;

Al Viro Aug. 31, 2021, 3:28 p.m. UTC | #25

On Tue, Aug 31, 2021 at 02:54:50PM +0100, Catalin Marinas wrote:

> An arm64-specific workaround would be for pagefault_disable() to disable
> tag checking. It's a pretty big hammer, weakening the out of bounds
> access detection of MTE. My preference would be a fix in the btrfs code.
> 
> A btrfs option would be for copy_to_sk() to return an indication of
> where the fault occurred and get fault_in_pages_writeable() to check
> that location, even if the copying would restart from an earlier offset
> (this requires open-coding copy_to_user_nofault()). An attempt below,
> untested and does not cover read_extent_buffer_to_user_nofault():

Umm...  There's another copy_to_user_nofault() call in the same function
(same story, AFAICS).

Can't say I'm fond of their ABI, but then I guess it could've been worse -
iterating over btree, running a user-supplied chunk of INTERCAL over it,
with all details of internal representation cast in stone by that exposure...

Catalin Marinas Aug. 31, 2021, 4:01 p.m. UTC | #26

On Tue, Aug 31, 2021 at 03:28:57PM +0000, Al Viro wrote:
> On Tue, Aug 31, 2021 at 02:54:50PM +0100, Catalin Marinas wrote:
> 
> > An arm64-specific workaround would be for pagefault_disable() to disable
> > tag checking. It's a pretty big hammer, weakening the out of bounds
> > access detection of MTE. My preference would be a fix in the btrfs code.
> > 
> > A btrfs option would be for copy_to_sk() to return an indication of
> > where the fault occurred and get fault_in_pages_writeable() to check
> > that location, even if the copying would restart from an earlier offset
> > (this requires open-coding copy_to_user_nofault()). An attempt below,
> > untested and does not cover read_extent_buffer_to_user_nofault():
> 
> Umm...  There's another copy_to_user_nofault() call in the same function
> (same story, AFAICS).

Yeah, I was too lazy to do it all and I don't have a setup to test the
patch quickly either. BTW, my hack is missing an access_ok() check.

I wonder whether copy_{to,from}_user_nofault() could actually return the
number of bytes left to copy, just like their non-nofault counterparts.
These are only used in a few places, so fairly easy to change. If we go
for a btrfs fix along the lines of my diff, it saves us from duplicating
the copy_to_user_nofault() code.

Catalin Marinas Oct. 11, 2021, 5:37 p.m. UTC | #27

On Tue, Aug 31, 2021 at 03:28:57PM +0000, Al Viro wrote:
> On Tue, Aug 31, 2021 at 02:54:50PM +0100, Catalin Marinas wrote:
> > An arm64-specific workaround would be for pagefault_disable() to disable
> > tag checking. It's a pretty big hammer, weakening the out of bounds
> > access detection of MTE. My preference would be a fix in the btrfs code.
> > 
> > A btrfs option would be for copy_to_sk() to return an indication of
> > where the fault occurred and get fault_in_pages_writeable() to check
> > that location, even if the copying would restart from an earlier offset
> > (this requires open-coding copy_to_user_nofault()). An attempt below,
> > untested and does not cover read_extent_buffer_to_user_nofault():
> 
> Umm...  There's another copy_to_user_nofault() call in the same function
> (same story, AFAICS).

I cleaned up this patch [1] but I realised it still doesn't solve it.
The arm64 __copy_to_user_inatomic(), while ensuring progress if called
in a loop, it does not guarantee precise copy to the fault position. The
copy_to_sk(), after returning an error, starts again from the previous
sizeof(sh) boundary rather than from where the __copy_to_user_inatomic()
stopped. So it can get stuck attempting to copy the same search header.

An ugly fix is to fall back to byte by byte copying so that we can
attempt the actual fault address in fault_in_pages_writeable().

If the sh being recreated in copy_to_sk() is the same on the retried
iteration, we could use an *sk_offset that is not a multiple of
sizeof(sh) in order to have progress. But it's not clear to me whether
the data being copied can change once btrfs_release_path() is called.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-fix

Linus Torvalds Oct. 11, 2021, 7:15 p.m. UTC | #28

On Mon, Oct 11, 2021 at 10:38 AM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> I cleaned up this patch [1] but I realised it still doesn't solve it.
> The arm64 __copy_to_user_inatomic(), while ensuring progress if called
> in a loop, it does not guarantee precise copy to the fault position.

That should be ok., We've always allowed the user copy to return early
if it does word copies and hits a page crosser that causes a fault.

Any user then has the choice of:

 - partial copies are bad

 - partial copies are handled and then you retry from the place
copy_to_user() failed at

and in that second case, the next time around, you'll get the fault
immediately (or you'll make some more progress - maybe the user copy
loop did something different just because the length and/or alignment
was different).

If you get the fault immediately, that's -EFAULT.

And if you make some more progress, it's again up to the caller to
rinse and repeat.

End result: user copy functions do not have to report errors exactly.
It is the caller that has to handle the situation.

Most importantly: "exact error or not" doesn't actually _matter_ to
the caller. If the caller does the right thing for an exact error, it
will do the right thing for an inexact one too. See above.

> The copy_to_sk(), after returning an error, starts again from the previous
> sizeof(sh) boundary rather than from where the __copy_to_user_inatomic()
> stopped. So it can get stuck attempting to copy the same search header.

That seems to be purely a copy_to_sk() bug.

Or rather, it looks like a bug in the caller. copy_to_sk() itself does

                if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
                        ret = 0;
                        goto out;
                }

and the comment says

         *  0: all items from this leaf copied, continue with next

but that comment is then obviously not actually true in that it's not
"continue with next" at all.

But this is all very much a bug in the btrfs
search_ioctl()/copy_to_sk() code: it simply doesn't do the proper
thing for a partial result.

Because no, "just retry the whole thing" is by definition not the proper thing.

That said, I think that if we can have faults at non-page-aligned
boundaries, then we just need to make fault_in_pages_writeable() check
non-page boundaries.

> An ugly fix is to fall back to byte by byte copying so that we can
> attempt the actual fault address in fault_in_pages_writeable().

No, changing the user copy machinery is wrong. The caller really has
to do the right thing with partial results.

And I think we need to make fault_in_pages_writeable() match the
actual faulting cases - maybe remote the "pages" part of the name?

That would fix the btrfs code - it's not doing the right thing as-is,
but it's "close enough' to right that I think fixing
fault_in_pages_writeable() should fix it.

No?

             Linus

Catalin Marinas Oct. 11, 2021, 9:08 p.m. UTC | #29

On Mon, Oct 11, 2021 at 12:15:43PM -0700, Linus Torvalds wrote:
> On Mon, Oct 11, 2021 at 10:38 AM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > I cleaned up this patch [1] but I realised it still doesn't solve it.
> > The arm64 __copy_to_user_inatomic(), while ensuring progress if called
> > in a loop, it does not guarantee precise copy to the fault position.
> 
> That should be ok., We've always allowed the user copy to return early
> if it does word copies and hits a page crosser that causes a fault.
> 
> Any user then has the choice of:
> 
>  - partial copies are bad
> 
>  - partial copies are handled and then you retry from the place
> copy_to_user() failed at
> 
> and in that second case, the next time around, you'll get the fault
> immediately (or you'll make some more progress - maybe the user copy
> loop did something different just because the length and/or alignment
> was different).
> 
> If you get the fault immediately, that's -EFAULT.
> 
> And if you make some more progress, it's again up to the caller to
> rinse and repeat.
> 
> End result: user copy functions do not have to report errors exactly.
> It is the caller that has to handle the situation.
> 
> Most importantly: "exact error or not" doesn't actually _matter_ to
> the caller. If the caller does the right thing for an exact error, it
> will do the right thing for an inexact one too. See above.

Yes, that's my expectation (though fixed fairly recently in the arm64
user copy routines).

> > The copy_to_sk(), after returning an error, starts again from the previous
> > sizeof(sh) boundary rather than from where the __copy_to_user_inatomic()
> > stopped. So it can get stuck attempting to copy the same search header.
> 
> That seems to be purely a copy_to_sk() bug.
> 
> Or rather, it looks like a bug in the caller. copy_to_sk() itself does
> 
>                 if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
>                         ret = 0;
>                         goto out;
>                 }
> 
> and the comment says
> 
>          *  0: all items from this leaf copied, continue with next
> 
> but that comment is then obviously not actually true in that it's not
> "continue with next" at all.

The comments were correct before commit a48b73eca4ce ("btrfs: fix
potential deadlock in the search ioctl") which introduced the
potentially infinite loop.

Something like this would make the comments match (I think):

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index cc61813213d8..1debf6a124e8 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2161,7 +2161,7 @@ static noinline int copy_to_sk(struct btrfs_path *path,
 		 * properly this next time through
 		 */
 		if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
-			ret = 0;
+			ret = -EFAULT;
 			goto out;
 		}

@@ -2175,7 +2175,7 @@ static noinline int copy_to_sk(struct btrfs_path *path,
 			 */
 			if (read_extent_buffer_to_user_nofault(leaf, up,
 						item_off, item_len)) {
-				ret = 0;
+				ret = -EFAULT;
 				*sk_offset -= sizeof(sh);
 				goto out;
 			}
@@ -2260,12 +2260,8 @@ static noinline int search_ioctl(struct inode *inode,
 	key.type = sk->min_type;
 	key.offset = sk->min_offset;

-	while (1) {
-		ret = fault_in_pages_writeable(ubuf + sk_offset,
-					       *buf_size - sk_offset);
-		if (ret)
-			break;
-
+	ret = fault_in_pages_writeable(ubuf, *buf_size);
+	while (ret == 0) {
 		ret = btrfs_search_forward(root, &key, path, sk->min_transid);
 		if (ret != 0) {
 			if (ret > 0)
@@ -2275,9 +2271,14 @@ static noinline int search_ioctl(struct inode *inode,
 		ret = copy_to_sk(path, &key, sk, buf_size, ubuf,
 				 &sk_offset, &num_found);
 		btrfs_release_path(path);
-		if (ret)
-			break;

+		/*
+		 * Fault in copy_to_sk(), attempt to bring the page in after
+		 * releasing the locks and retry.
+		 */
+		if (ret == -EFAULT)
+			ret = fault_in_pages_writeable(ubuf + sk_offset,
+				       sizeof(struct btrfs_ioctl_search_header));
 	}
 	if (ret > 0)
 		ret = 0;

> > An ugly fix is to fall back to byte by byte copying so that we can
> > attempt the actual fault address in fault_in_pages_writeable().
> 
> No, changing the user copy machinery is wrong. The caller really has
> to do the right thing with partial results.
> 
> And I think we need to make fault_in_pages_writeable() match the
> actual faulting cases - maybe remote the "pages" part of the name?

Ah, good point. Without removing "pages" from the name (too late over
here to grep the kernel), something like below:

diff --git a/arch/arm64/include/asm/page-def.h b/arch/arm64/include/asm/page-def.h
index 2403f7b4cdbf..3768ac4a6610 100644
--- a/arch/arm64/include/asm/page-def.h
+++ b/arch/arm64/include/asm/page-def.h
@@ -15,4 +15,9 @@
 #define PAGE_SIZE		(_AC(1, UL) << PAGE_SHIFT)
 #define PAGE_MASK		(~(PAGE_SIZE-1))
 
+#ifdef CONFIG_ARM64_MTE
+#define FAULT_GRANULE_SIZE	(16)
+#define FAULT_GRANULE_MASK	(~(FAULT_GRANULE_SIZE-1))
+#endif
+
 #endif /* __ASM_PAGE_DEF_H */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 62db6b0176b9..7aef732e4fa7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -16,6 +16,11 @@
 #include <linux/hardirq.h> /* for in_interrupt() */
 #include <linux/hugetlb_inline.h>
 
+#ifndef FAULT_GRANULE_SIZE
+#define FAULT_GRANULE_SIZE	PAGE_SIZE
+#define FAULT_GRANULE_MASK	PAGE_MASK
+#endif
+
 struct pagevec;
 
 static inline bool mapping_empty(struct address_space *mapping)
@@ -751,12 +756,12 @@ static inline int fault_in_pages_writeable(char __user *uaddr, size_t size)
 	do {
 		if (unlikely(__put_user(0, uaddr) != 0))
 			return -EFAULT;
-		uaddr += PAGE_SIZE;
+		uaddr += FAULT_GRANULE_SIZE;
 	} while (uaddr <= end);
 
-	/* Check whether the range spilled into the next page. */
-	if (((unsigned long)uaddr & PAGE_MASK) ==
-			((unsigned long)end & PAGE_MASK))
+	/* Check whether the range spilled into the next granule. */
+	if (((unsigned long)uaddr & FAULT_GRANULE_MASK) ==
+			((unsigned long)end & FAULT_GRANULE_MASK))
 		return __put_user(0, end);
 
 	return 0;

If this looks in the right direction, I'll do some proper patches
tomorrow.

Linus Torvalds Oct. 11, 2021, 11:59 p.m. UTC | #30

On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> +#ifdef CONFIG_ARM64_MTE
> +#define FAULT_GRANULE_SIZE     (16)
> +#define FAULT_GRANULE_MASK     (~(FAULT_GRANULE_SIZE-1))

[...]

> If this looks in the right direction, I'll do some proper patches
> tomorrow.

Looks fine to me. It's going to be quite expensive and bad for caches, though.

That said, fault_in_writable() is _supposed_ to all be for the slow
path when things go south and the normal path didn't work out, so I
think it's fine.

I do wonder how the sub-page granularity works. Is it sufficient to
just read from it? Because then a _slightly_ better option might be to
do one write per page (to catch page table writability) and then one
read per "granule" (to catch pointer coloring or cache poisoning
issues)?

That said, since this is all preparatory to us wanting to write to it
eventually anyway, maybe marking it all dirty in the caches is only
good.

                Linus

Catalin Marinas Oct. 12, 2021, 5:27 p.m. UTC | #31

On Mon, Oct 11, 2021 at 04:59:28PM -0700, Linus Torvalds wrote:
> On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > +#ifdef CONFIG_ARM64_MTE
> > +#define FAULT_GRANULE_SIZE     (16)
> > +#define FAULT_GRANULE_MASK     (~(FAULT_GRANULE_SIZE-1))
> 
> [...]
> 
> > If this looks in the right direction, I'll do some proper patches
> > tomorrow.
> 
> Looks fine to me. It's going to be quite expensive and bad for caches,
> though.
> 
> That said, fault_in_writable() is _supposed_ to all be for the slow
> path when things go south and the normal path didn't work out, so I
> think it's fine.
> 
> I do wonder how the sub-page granularity works. Is it sufficient to
> just read from it?

For arm64 MTE and I think SPARC ADI, just reading should be sufficient.
There is CHERI in the long run, if it takes off, where the user can set
independent read/write permissions and uaccess would use the capability
rather than a match-all pointer (hence checked).

> Because then a _slightly_ better option might be to
> do one write per page (to catch page table writability) and then one
> read per "granule" (to catch pointer coloring or cache poisoning
> issues)?
> 
> That said, since this is all preparatory to us wanting to write to it
> eventually anyway, maybe marking it all dirty in the caches is only
> good.

It depends on how much would be written in the actual copy. For
significant memcpy on arm CPUs, write streaming usually kicks in and the
cache dirtying is skipped. This probably matters more for
copy_page_to_iter_iovec() than the btrfs search ioctl.

Apart from fault_in_pages_*(), there's also fault_in_user_writeable()
called from the futex code which uses the GUP mechanism as the write
would be destructive. It looks like it could potentially trigger the
same infinite loop on -EFAULT. For arm64 MTE, we get away with this by
disabling the tag checking around the arch futex code (we did it for an
unrelated issue - we don't have LDXR/STXR that would run with user
permissions in kernel mode like we do with LDTR/STTR).

I wonder whether we should actually just disable tag checking around the
problematic accesses. What these callers seem to have in common is using
pagefault_disable/enable(). We could abuse this to disable tag checking
or maybe in_atomic() when handling the exception to lazily disable such
faults temporarily.

A more invasive change would be to return a different error for such
faults like -EACCESS and treat them differently in the caller.

Linus Torvalds Oct. 12, 2021, 5:58 p.m. UTC | #32

On Tue, Oct 12, 2021 at 10:27 AM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> Apart from fault_in_pages_*(), there's also fault_in_user_writeable()
> called from the futex code which uses the GUP mechanism as the write
> would be destructive. It looks like it could potentially trigger the
> same infinite loop on -EFAULT.

Hmm.

I think the reason we do fault_in_user_writeable() using GUP is that

 (a) we can avoid the page fault overhead

 (b) we don't have any good "atomic_inc_user()" interface or similar
that could do a write with a zero increment or something like that.

We do have that "arch_futex_atomic_op_inuser()" thing, of course. It's
all kinds of crazy, but we *could* do

       arch_futex_atomic_op_inuser(FUTEX_OP_ADD, 0, &dummy, uaddr);

instead of doing the fault_in_user_writeable().

That might be a good idea anyway. I dunno.

But I agree other options exist:

> I wonder whether we should actually just disable tag checking around the
> problematic accesses. What these callers seem to have in common is using
> pagefault_disable/enable(). We could abuse this to disable tag checking
> or maybe in_atomic() when handling the exception to lazily disable such
> faults temporarily.

Hmm. That would work for MTE, but possibly be very inconvenient for
other situations.

> A more invasive change would be to return a different error for such
> faults like -EACCESS and treat them differently in the caller.

That's _really_ hard for things like "copy_to_user()", that isn't a
single operation, and is supposed to return the bytes left.

Adding another error return would be nasty.

We've had hacks like "squirrel away the actual error code in the task
structure", but that tends to be unmaintainable because we have
interrupts (and NMI's) doing their own possibly nested atomics, so
even disabling preemption won't actually fix some of the nesting
issues.

All of these things make me think that the proper fix ends up being to
make sure that our "fault_in_xyz()" functions simply should always
handle all faults.

Another option may be to teach the GUP code to actually check
architecture-specific sub-page ranges.

        Linus

Catalin Marinas Oct. 18, 2021, 5:13 p.m. UTC | #33

On Tue, Oct 12, 2021 at 10:58:46AM -0700, Linus Torvalds wrote:
> On Tue, Oct 12, 2021 at 10:27 AM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > Apart from fault_in_pages_*(), there's also fault_in_user_writeable()
> > called from the futex code which uses the GUP mechanism as the write
> > would be destructive. It looks like it could potentially trigger the
> > same infinite loop on -EFAULT.
> 
> Hmm.
> 
> I think the reason we do fault_in_user_writeable() using GUP is that
> 
>  (a) we can avoid the page fault overhead
> 
>  (b) we don't have any good "atomic_inc_user()" interface or similar
> that could do a write with a zero increment or something like that.
> 
> We do have that "arch_futex_atomic_op_inuser()" thing, of course. It's
> all kinds of crazy, but we *could* do
> 
>        arch_futex_atomic_op_inuser(FUTEX_OP_ADD, 0, &dummy, uaddr);
> 
> instead of doing the fault_in_user_writeable().
> 
> That might be a good idea anyway. I dunno.

I gave this a quick try for futex (though MTE is not affected at the
moment):

https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/sub-page-faults

However, I still have doubts about fault_in_pages_*() probing every 16
bytes, especially if one decides to change these routines to be
GUP-based.

> > A more invasive change would be to return a different error for such
> > faults like -EACCESS and treat them differently in the caller.
> 
> That's _really_ hard for things like "copy_to_user()", that isn't a
> single operation, and is supposed to return the bytes left.
> 
> Adding another error return would be nasty.
> 
> We've had hacks like "squirrel away the actual error code in the task
> structure", but that tends to be unmaintainable because we have
> interrupts (and NMI's) doing their own possibly nested atomics, so
> even disabling preemption won't actually fix some of the nesting
> issues.

I think we can do something similar to the __get_user_error() on arm64.
We can keep the __copy_to_user_inatomic() etc. returning the number of
bytes left but change the exception handling path in those routines to
set an error code or boolean to a pointer passed at uaccess routine call
time. The caller would do something along these lines:

	bool page_fault;
	left = copy_to_user_inatomic(dst, src, size, &page_fault);
	if (left && page_fault)
		goto repeat_fault_in;

copy_to_user_nofault() could also change its return type from -EFAULT to
something else based on whether page_fault was set or not.

Most architectures will use a generic copy_to_user_inatomic() wrapper
where page_fault == true for any fault. Arm64 needs some adjustment to
the uaccess fault handling to pass the fault code down to the exception
code. This way, at least for arm64, I don't think an interrupt or NMI
would be problematic.

> All of these things make me think that the proper fix ends up being to
> make sure that our "fault_in_xyz()" functions simply should always
> handle all faults.
> 
> Another option may be to teach the GUP code to actually check
> architecture-specific sub-page ranges.

Teaching GUP about this is likely to be expensive. A put_user() for
probing on arm64 uses a STTR instruction that's run with user privileges
on the user address and the user tag checking mode. The GUP code for
MTE, OTOH, would need to explicitly read the tag in memory and compare
it with the user pointer tag (which is normally cleared in the GUP code
by untagged_addr()).

To me it makes more sense for the fault_in_*() functions to only deal
with those permissions the kernel controls, i.e. the pte. Sub-page
permissions like MTE or CHERI are controlled by the user directly, so
the kernel cannot fix them up anyway. Rather than overloading
fault_in_*() with additional checks, I think we should expand the
in-atomic uaccess API to cover the type of fault.

Andreas Gruenbacher Oct. 21, 2021, 12:46 a.m. UTC | #34

On Tue, Oct 12, 2021 at 1:59 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> >
> > +#ifdef CONFIG_ARM64_MTE
> > +#define FAULT_GRANULE_SIZE     (16)
> > +#define FAULT_GRANULE_MASK     (~(FAULT_GRANULE_SIZE-1))
>
> [...]
>
> > If this looks in the right direction, I'll do some proper patches
> > tomorrow.
>
> Looks fine to me. It's going to be quite expensive and bad for caches, though.
>
> That said, fault_in_writable() is _supposed_ to all be for the slow
> path when things go south and the normal path didn't work out, so I
> think it's fine.

Let me get back to this; I'm actually not convinced that we need to
worry about sub-page-size fault granules in fault_in_pages_readable or
fault_in_pages_writeable.

>From a filesystem point of view, we can get into trouble when a
user-space read or write triggers a page fault while we're holding
filesystem locks, and that page fault ends up calling back into the
filesystem. To deal with that, we're performing those user-space
accesses with page faults disabled. When a page fault would occur, we
get back an error instead, and then we try to fault in the offending
pages. If a page is resident and we still get a fault trying to access
it, trying to fault in the same page again isn't going to help and we
have a true error. We're clearly looking at memory at a page
granularity; faults at a sub-page level don't matter at this level of
abstraction (but they do show similar error behavior). To avoid
getting stuck, when it gets a short result or -EFAULT, the filesystem
implements the following backoff strategy: first, it tries to fault in
a number of pages. When the read or write still doesn't make progress,
it scales back and faults in a single page. Finally, when that still
doesn't help, it gives up. This strategy is needed for actual page
faults, but it also handles sub-page faults appropriately as long as
the user-space access functions give sensible results.

What am I missing?

Thanks,
Andreas

> I do wonder how the sub-page granularity works. Is it sufficient to
> just read from it? Because then a _slightly_ better option might be to
> do one write per page (to catch page table writability) and then one
> read per "granule" (to catch pointer coloring or cache poisoning
> issues)?
>
> That said, since this is all preparatory to us wanting to write to it
> eventually anyway, maybe marking it all dirty in the caches is only
> good.
>
>                 Linus
>

Catalin Marinas Oct. 21, 2021, 10:05 a.m. UTC | #35

On Thu, Oct 21, 2021 at 02:46:10AM +0200, Andreas Gruenbacher wrote:
> On Tue, Oct 12, 2021 at 1:59 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > >
> > > +#ifdef CONFIG_ARM64_MTE
> > > +#define FAULT_GRANULE_SIZE     (16)
> > > +#define FAULT_GRANULE_MASK     (~(FAULT_GRANULE_SIZE-1))
> >
> > [...]
> >
> > > If this looks in the right direction, I'll do some proper patches
> > > tomorrow.
> >
> > Looks fine to me. It's going to be quite expensive and bad for caches, though.
> >
> > That said, fault_in_writable() is _supposed_ to all be for the slow
> > path when things go south and the normal path didn't work out, so I
> > think it's fine.
> 
> Let me get back to this; I'm actually not convinced that we need to
> worry about sub-page-size fault granules in fault_in_pages_readable or
> fault_in_pages_writeable.
> 
> From a filesystem point of view, we can get into trouble when a
> user-space read or write triggers a page fault while we're holding
> filesystem locks, and that page fault ends up calling back into the
> filesystem. To deal with that, we're performing those user-space
> accesses with page faults disabled.

Yes, this makes sense.

> When a page fault would occur, we
> get back an error instead, and then we try to fault in the offending
> pages. If a page is resident and we still get a fault trying to access
> it, trying to fault in the same page again isn't going to help and we
> have a true error.

You can't be sure the second fault is a true error. The unlocked
fault_in_*() may race with some LRU scheme making the pte not accessible
or a write-back making it clean/read-only. copy_to_user() with
pagefault_disabled() fails again but that's a benign fault. The
filesystem should re-attempt the fault-in (gup would correct the pte),
disable page faults and copy_to_user(), potentially in an infinite loop.
If you bail out on the second/third uaccess following a fault_in_*()
call, you may get some unexpected errors (though very rare). Maybe the
filesystems avoid this problem somehow but I couldn't figure it out.

> We're clearly looking at memory at a page
> granularity; faults at a sub-page level don't matter at this level of
> abstraction (but they do show similar error behavior). To avoid
> getting stuck, when it gets a short result or -EFAULT, the filesystem
> implements the following backoff strategy: first, it tries to fault in
> a number of pages. When the read or write still doesn't make progress,
> it scales back and faults in a single page. Finally, when that still
> doesn't help, it gives up. This strategy is needed for actual page
> faults, but it also handles sub-page faults appropriately as long as
> the user-space access functions give sensible results.

As I said above, I think with this approach there's a small chance of
incorrectly reporting an error when the fault is recoverable. If you
change it to an infinite loop, you'd run into the sub-page fault
problem.

There are some places with such infinite loops: futex_wake_op(),
search_ioctl() in the btrfs code. I still have to get my head around
generic_perform_write() but I think we get away here because it faults
in the page with a get_user() rather than gup (and copy_from_user() is
guaranteed to make progress if any bytes can still be accessed).

Andreas Gruenbacher Oct. 21, 2021, 2:42 p.m. UTC | #36

On Thu, Oct 21, 2021 at 12:06 PM Catalin Marinas
<catalin.marinas@arm.com> wrote:
> On Thu, Oct 21, 2021 at 02:46:10AM +0200, Andreas Gruenbacher wrote:
> > On Tue, Oct 12, 2021 at 1:59 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > > On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > >
> > > > +#ifdef CONFIG_ARM64_MTE
> > > > +#define FAULT_GRANULE_SIZE     (16)
> > > > +#define FAULT_GRANULE_MASK     (~(FAULT_GRANULE_SIZE-1))
> > >
> > > [...]
> > >
> > > > If this looks in the right direction, I'll do some proper patches
> > > > tomorrow.
> > >
> > > Looks fine to me. It's going to be quite expensive and bad for caches, though.
> > >
> > > That said, fault_in_writable() is _supposed_ to all be for the slow
> > > path when things go south and the normal path didn't work out, so I
> > > think it's fine.
> >
> > Let me get back to this; I'm actually not convinced that we need to
> > worry about sub-page-size fault granules in fault_in_pages_readable or
> > fault_in_pages_writeable.
> >
> > From a filesystem point of view, we can get into trouble when a
> > user-space read or write triggers a page fault while we're holding
> > filesystem locks, and that page fault ends up calling back into the
> > filesystem. To deal with that, we're performing those user-space
> > accesses with page faults disabled.
>
> Yes, this makes sense.
>
> > When a page fault would occur, we
> > get back an error instead, and then we try to fault in the offending
> > pages. If a page is resident and we still get a fault trying to access
> > it, trying to fault in the same page again isn't going to help and we
> > have a true error.
>
> You can't be sure the second fault is a true error. The unlocked
> fault_in_*() may race with some LRU scheme making the pte not accessible
> or a write-back making it clean/read-only. copy_to_user() with
> pagefault_disabled() fails again but that's a benign fault. The
> filesystem should re-attempt the fault-in (gup would correct the pte),
> disable page faults and copy_to_user(), potentially in an infinite loop.
> If you bail out on the second/third uaccess following a fault_in_*()
> call, you may get some unexpected errors (though very rare). Maybe the
> filesystems avoid this problem somehow but I couldn't figure it out.

Good point, we can indeed only bail out if both the user copy and the
fault-in fail.

But probing the entire memory range in fault domain granularity in the
page fault-in functions still doesn't actually make sense. Those
functions really only need to guarantee that we'll be able to make
progress eventually. From that point of view, it should be enough to
probe the first byte of the requested memory range, so when one of
those functions reports that the next N bytes should be accessible,
this really means that the first byte surely isn't permanently
inaccessible and that the rest is likely accessible. Functions
fault_in_readable and fault_in_writeable already work that way, so
this only leaves function fault_in_safe_writeable to worry about.

> > We're clearly looking at memory at a page
> > granularity; faults at a sub-page level don't matter at this level of
> > abstraction (but they do show similar error behavior). To avoid
> > getting stuck, when it gets a short result or -EFAULT, the filesystem
> > implements the following backoff strategy: first, it tries to fault in
> > a number of pages. When the read or write still doesn't make progress,
> > it scales back and faults in a single page. Finally, when that still
> > doesn't help, it gives up. This strategy is needed for actual page
> > faults, but it also handles sub-page faults appropriately as long as
> > the user-space access functions give sensible results.
>
> As I said above, I think with this approach there's a small chance of
> incorrectly reporting an error when the fault is recoverable. If you
> change it to an infinite loop, you'd run into the sub-page fault
> problem.

Yes, I see now, thanks.

> There are some places with such infinite loops: futex_wake_op(),
> search_ioctl() in the btrfs code. I still have to get my head around
> generic_perform_write() but I think we get away here because it faults
> in the page with a get_user() rather than gup (and copy_from_user() is
> guaranteed to make progress if any bytes can still be accessed).

Thanks,
Andreas

Catalin Marinas Oct. 21, 2021, 5:09 p.m. UTC | #37

On Thu, Oct 21, 2021 at 04:42:33PM +0200, Andreas Gruenbacher wrote:
> On Thu, Oct 21, 2021 at 12:06 PM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > On Thu, Oct 21, 2021 at 02:46:10AM +0200, Andreas Gruenbacher wrote:
> > > When a page fault would occur, we
> > > get back an error instead, and then we try to fault in the offending
> > > pages. If a page is resident and we still get a fault trying to access
> > > it, trying to fault in the same page again isn't going to help and we
> > > have a true error.
> >
> > You can't be sure the second fault is a true error. The unlocked
> > fault_in_*() may race with some LRU scheme making the pte not accessible
> > or a write-back making it clean/read-only. copy_to_user() with
> > pagefault_disabled() fails again but that's a benign fault. The
> > filesystem should re-attempt the fault-in (gup would correct the pte),
> > disable page faults and copy_to_user(), potentially in an infinite loop.
> > If you bail out on the second/third uaccess following a fault_in_*()
> > call, you may get some unexpected errors (though very rare). Maybe the
> > filesystems avoid this problem somehow but I couldn't figure it out.
> 
> Good point, we can indeed only bail out if both the user copy and the
> fault-in fail.
> 
> But probing the entire memory range in fault domain granularity in the
> page fault-in functions still doesn't actually make sense. Those
> functions really only need to guarantee that we'll be able to make
> progress eventually. From that point of view, it should be enough to
> probe the first byte of the requested memory range, so when one of
> those functions reports that the next N bytes should be accessible,
> this really means that the first byte surely isn't permanently
> inaccessible and that the rest is likely accessible. Functions
> fault_in_readable and fault_in_writeable already work that way, so
> this only leaves function fault_in_safe_writeable to worry about.

I agree, that's why generic_perform_write() works. It does a get_user()
from the first byte in that range and the subsequent copy_from_user()
will make progress of at least one byte if it was readable. Eventually
it will hit the byte that faults. The gup-based fault_in_*() are a bit
more problematic.

Your series introduces fault_in_safe_writeable() and I think for MTE
doing a _single_ get_user(uaddr) (in addition to the gup checks for
write) would be sufficient as long as generic_file_read_iter() advances
by at least one byte (eventually).

This discussion started with the btrfs search_ioctl() where, even if
some bytes were written in copy_to_sk(), it always restarts from an
earlier position, reattempting to write the same bytes. Since
copy_to_sk() doesn't guarantee forward progress even if some bytes are
writable, Linus' suggestion was for fault_in_writable() to probe the
whole range. I consider this overkill since btrfs is the only one that
needs probing every 16 bytes. The other cases like the new
fault_in_safe_writeable() can be fixed by probing the first byte only
followed by gup.

I think we need to better define the semantics of the fault_in + uaccess
sequences. For uaccess, we document "a hard requirement that not storing
anything at all (i.e. returning size) should happen only when nothing
could be copied" (from linux/uaccess.h). I think we can add a
requirement for the new size_t-based fault_in_* variants without
mandating that the whole range is probed, something like: "returning
leftover < size guarantees that a subsequent user access at uaddr copies
at least one byte eventually". I said "eventually" but maybe we can come
up with some clearer wording for a liveness property.

Such requirement would ensure that infinite loops of fault_in_* +
uaccess make progress as long as they don't reset the probed range. Code
like btrfs search_ioctl() would need to be adjusted to avoid such range
reset and guarantee forward progress.

Andreas Gruenbacher Oct. 21, 2021, 6 p.m. UTC | #38

On Thu, Oct 21, 2021 at 7:09 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Thu, Oct 21, 2021 at 04:42:33PM +0200, Andreas Gruenbacher wrote:
> > On Thu, Oct 21, 2021 at 12:06 PM Catalin Marinas
> > <catalin.marinas@arm.com> wrote:
> > > On Thu, Oct 21, 2021 at 02:46:10AM +0200, Andreas Gruenbacher wrote:
> > > > When a page fault would occur, we
> > > > get back an error instead, and then we try to fault in the offending
> > > > pages. If a page is resident and we still get a fault trying to access
> > > > it, trying to fault in the same page again isn't going to help and we
> > > > have a true error.
> > >
> > > You can't be sure the second fault is a true error. The unlocked
> > > fault_in_*() may race with some LRU scheme making the pte not accessible
> > > or a write-back making it clean/read-only. copy_to_user() with
> > > pagefault_disabled() fails again but that's a benign fault. The
> > > filesystem should re-attempt the fault-in (gup would correct the pte),
> > > disable page faults and copy_to_user(), potentially in an infinite loop.
> > > If you bail out on the second/third uaccess following a fault_in_*()
> > > call, you may get some unexpected errors (though very rare). Maybe the
> > > filesystems avoid this problem somehow but I couldn't figure it out.
> >
> > Good point, we can indeed only bail out if both the user copy and the
> > fault-in fail.
> >
> > But probing the entire memory range in fault domain granularity in the
> > page fault-in functions still doesn't actually make sense. Those
> > functions really only need to guarantee that we'll be able to make
> > progress eventually. From that point of view, it should be enough to
> > probe the first byte of the requested memory range, so when one of
> > those functions reports that the next N bytes should be accessible,
> > this really means that the first byte surely isn't permanently
> > inaccessible and that the rest is likely accessible. Functions
> > fault_in_readable and fault_in_writeable already work that way, so
> > this only leaves function fault_in_safe_writeable to worry about.
>
> I agree, that's why generic_perform_write() works. It does a get_user()
> from the first byte in that range and the subsequent copy_from_user()
> will make progress of at least one byte if it was readable. Eventually
> it will hit the byte that faults. The gup-based fault_in_*() are a bit
> more problematic.
>
> Your series introduces fault_in_safe_writeable() and I think for MTE
> doing a _single_ get_user(uaddr) (in addition to the gup checks for
> write) would be sufficient as long as generic_file_read_iter() advances
> by at least one byte (eventually).
>
> This discussion started with the btrfs search_ioctl() where, even if
> some bytes were written in copy_to_sk(), it always restarts from an
> earlier position, reattempting to write the same bytes. Since
> copy_to_sk() doesn't guarantee forward progress even if some bytes are
> writable, Linus' suggestion was for fault_in_writable() to probe the
> whole range. I consider this overkill since btrfs is the only one that
> needs probing every 16 bytes. The other cases like the new
> fault_in_safe_writeable() can be fixed by probing the first byte only
> followed by gup.

Hmm. Direct I/O request sizes are multiples of the underlying device
block size, so we'll also get stuck there if fault-in won't give us a
full block. This is getting pretty ugly. So scratch that idea; let's
stick with probing the whole range.

Thanks,
Andreas

> I think we need to better define the semantics of the fault_in + uaccess
> sequences. For uaccess, we document "a hard requirement that not storing
> anything at all (i.e. returning size) should happen only when nothing
> could be copied" (from linux/uaccess.h). I think we can add a
> requirement for the new size_t-based fault_in_* variants without
> mandating that the whole range is probed, something like: "returning
> leftover < size guarantees that a subsequent user access at uaddr copies
> at least one byte eventually". I said "eventually" but maybe we can come
> up with some clearer wording for a liveness property.
>
> Such requirement would ensure that infinite loops of fault_in_* +
> uaccess make progress as long as they don't reset the probed range. Code
> like btrfs search_ioctl() would need to be adjusted to avoid such range
> reset and guarantee forward progress.
>
> --
> Catalin
>

Linus Torvalds Oct. 22, 2021, 2:30 a.m. UTC | #39

On Thu, Oct 21, 2021 at 4:42 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> But probing the entire memory range in fault domain granularity in the
> page fault-in functions still doesn't actually make sense. Those
> functions really only need to guarantee that we'll be able to make
> progress eventually. From that point of view, it should be enough to
> probe the first byte of the requested memory range

That's probably fine.

Although it should be more than one byte - "copy_from_user()" might do
word-at-a-time optimizations, so you could have an infinite loop of

 (a) copy_from_user() fails because the chunk it tried to get failed partly

 (b) fault_in() probing succeeds, because the beginning part is fine

so I agree that the fault-in code doesn't need to do the whole area,
but it needs to at least do some <N bytes, up to length> thing, to
handle the situation where the copy_to/from_user requires more than a
single byte.

                 Linus

Catalin Marinas Oct. 22, 2021, 9:34 a.m. UTC | #40

On Thu, Oct 21, 2021 at 04:30:30PM -1000, Linus Torvalds wrote:
> On Thu, Oct 21, 2021 at 4:42 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > But probing the entire memory range in fault domain granularity in the
> > page fault-in functions still doesn't actually make sense. Those
> > functions really only need to guarantee that we'll be able to make
> > progress eventually. From that point of view, it should be enough to
> > probe the first byte of the requested memory range
> 
> That's probably fine.
> 
> Although it should be more than one byte - "copy_from_user()" might do
> word-at-a-time optimizations, so you could have an infinite loop of
> 
>  (a) copy_from_user() fails because the chunk it tried to get failed partly
> 
>  (b) fault_in() probing succeeds, because the beginning part is fine
> 
> so I agree that the fault-in code doesn't need to do the whole area,
> but it needs to at least do some <N bytes, up to length> thing, to
> handle the situation where the copy_to/from_user requires more than a
> single byte.

>From a discussion with Al some months ago, if there are bytes still
accessible, copy_from_user() is not allowed to fail fully (i.e. return
the requested copy size) even when it uses word-at-a-time. In the worst
case, it should return size - 1. If the fault_in() then continues
probing from uaddr + 1, it should eventually hit the faulty address.

The problem appears when fault_in() restarts from uaddr rather than
where copy_from_user() stopped. That's what the btrfs search_ioctl()
does. I also need to check the direct I/O cases that Andreas mentioned,
maybe they can be changed not to attempt the fault_in() from the
beginning of the block.

Catalin Marinas Oct. 22, 2021, 6:41 p.m. UTC | #41

On Thu, Oct 21, 2021 at 08:00:50PM +0200, Andreas Gruenbacher wrote:
> On Thu, Oct 21, 2021 at 7:09 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > This discussion started with the btrfs search_ioctl() where, even if
> > some bytes were written in copy_to_sk(), it always restarts from an
> > earlier position, reattempting to write the same bytes. Since
> > copy_to_sk() doesn't guarantee forward progress even if some bytes are
> > writable, Linus' suggestion was for fault_in_writable() to probe the
> > whole range. I consider this overkill since btrfs is the only one that
> > needs probing every 16 bytes. The other cases like the new
> > fault_in_safe_writeable() can be fixed by probing the first byte only
> > followed by gup.
> 
> Hmm. Direct I/O request sizes are multiples of the underlying device
> block size, so we'll also get stuck there if fault-in won't give us a
> full block. This is getting pretty ugly. So scratch that idea; let's
> stick with probing the whole range.

Ah, I wasn't aware of this. I got lost in the call trees but I noticed
__iomap_dio_rw() does an iov_iter_revert() only if direction is READ. Is
this the case for writes as well?

Andreas Gruenbacher Oct. 25, 2021, 7:37 p.m. UTC | #42

On Fri, Oct 22, 2021 at 8:41 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Thu, Oct 21, 2021 at 08:00:50PM +0200, Andreas Gruenbacher wrote:
> > On Thu, Oct 21, 2021 at 7:09 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > This discussion started with the btrfs search_ioctl() where, even if
> > > some bytes were written in copy_to_sk(), it always restarts from an
> > > earlier position, reattempting to write the same bytes. Since
> > > copy_to_sk() doesn't guarantee forward progress even if some bytes are
> > > writable, Linus' suggestion was for fault_in_writable() to probe the
> > > whole range. I consider this overkill since btrfs is the only one that
> > > needs probing every 16 bytes. The other cases like the new
> > > fault_in_safe_writeable() can be fixed by probing the first byte only
> > > followed by gup.
> >
> > Hmm. Direct I/O request sizes are multiples of the underlying device
> > block size, so we'll also get stuck there if fault-in won't give us a
> > full block. This is getting pretty ugly. So scratch that idea; let's
> > stick with probing the whole range.
>
> Ah, I wasn't aware of this. I got lost in the call trees but I noticed
> __iomap_dio_rw() does an iov_iter_revert() only if direction is READ. Is
> this the case for writes as well?

It's the EOF case, so it only applies to reads:

        /*
         * We only report that we've read data up to i_size.
         * Revert iter to a state corresponding to that as some callers (such
         * as the splice code) rely on it.
         */
        if (iov_iter_rw(iter) == READ && iomi.pos >= dio->i_size)
                iov_iter_revert(iter, iomi.pos - dio->i_size);

Andreas

[v7,05/19] iov_iter: Introduce fault_in_iov_iter_writeable

Commit Message

Comments

Patch