Message ID | 20210827164926.1726765-6-agruenba@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | gfs2: Fix mmap + page fault deadlocks | expand |
On Fri, Aug 27, 2021 at 06:49:12PM +0200, Andreas Gruenbacher wrote: > Introduce a new fault_in_iov_iter_writeable helper for safely faulting > in an iterator for writing. Uses get_user_pages() to fault in the pages > without actually writing to them, which would be destructive. > > We'll use fault_in_iov_iter_writeable in gfs2 once we've determined that > the iterator passed to .read_iter isn't in memory. Again, the calling conventions are wrong. Make it success/failure or 0/-EFAULT. And it's inconsistent for iovec and non-iovec cases as it is.
On Fri, Aug 27, 2021 at 11:52 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > Again, the calling conventions are wrong. Make it success/failure or > 0/-EFAULT. And it's inconsistent for iovec and non-iovec cases as it is. Al, the 0/-EFAULT thing DOES NOT WORK. The whole "success vs failure" model is broken. Because "success" for some people is "everything worked". And for other people it is "at least _part_ of it worked". So no, 0/-EFAULT fundamentally cannot work, because the return needs to be able to handle that ternary situation (ie "nothing" vs "something" vs "everything"). This is *literally* the exact same thing that we have for copy_to/from_user(). And Andreas' solution (based on my suggestion) is the exact same one that we have had for that code since basically day #1. The whole "0/-EFAULT" is simpler, yes. And it's what "{get|put}_user()" uses, yes. And it's more common to a lot of other functions that return zero or an error. But see above. People *need* that ternary result, and "bytes/pages uncopied" is not only the traditional one we use elsewhere in similar situations, it's the one that has the easiest error tests for existing users (because zero remains "everything worked"). Andreas originally had that "how many bytes/pages succeeded" return value instead, and yes, that's also ternary. But it means that now the common "complete success" test ends up being a lot uglier, and the semantics of the function changes completely where "0" no longer means success, and that messes up much more. So I really think you are barking entirely up the wrong tree. If there is any inconsistency, maybe we should make _more_ cases use that "how many bytes/pages not copied" logic, but in a lot of cases you don't actually need the ternary decision value. So the inconsistency is EXACTLY the same as the one we have always had for get|put_user() vs copy_to|from_user(), and it exists for the EXACT same reason. IOW, please explain how you'd solve the ternary problem without making the code a lot uglier. Linus
On Fri, Aug 27, 2021 at 12:05:32PM -0700, Linus Torvalds wrote: > But see above. People *need* that ternary result, and "bytes/pages > uncopied" is not only the traditional one we use elsewhere in similar > situations, it's the one that has the easiest error tests for existing > users (because zero remains "everything worked"). Could you show the cases where "partial copy, so it's OK" behaviour would break anything? For that you would need the case where partial fault-in is currently rejected by the check checks downstream from there (for failing copy-in/copy-out) would be either missing or would not be handled correctly in case of partial fault-in or would slow a fast path down. I don't see any such cases and I would be very surprised if such existed. If you see any, please describe them - I could be wrong. And I would like to take a good look at any such case and see how well does it handle possible short copy after full fault-in.
On Fri, Aug 27, 2021 at 12:23 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > Could you show the cases where "partial copy, so it's OK" behaviour would > break anything? Absolutely. For example, i t would cause an infinite loop in restore_fpregs_from_user() if the "buf" argument is a situation where the first page is fine, but the next page is not. Why? Because __restore_fpregs_from_user() would take a fault, but then fault_in_pages_readable() (renamed) would succeed, so you'd just do that "retry" forever and ever. Probably there are a number of other places too. That was literally the *first* place I looked at. Seriously. The current semantics are "check the whole area". THOSE MUST NOT CHANGE. Linus
On Fri, Aug 27, 2021 at 12:33:00PM -0700, Linus Torvalds wrote: > On Fri, Aug 27, 2021 at 12:23 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > Could you show the cases where "partial copy, so it's OK" behaviour would > > break anything? > > Absolutely. > > For example, i t would cause an infinite loop in > restore_fpregs_from_user() if the "buf" argument is a situation where > the first page is fine, but the next page is not. > > Why? Because __restore_fpregs_from_user() would take a fault, but then > fault_in_pages_readable() (renamed) would succeed, so you'd just do > that "retry" forever and ever. > > Probably there are a number of other places too. That was literally > the *first* place I looked at. OK... Let me dig out the notes from the last time I looked through that area and grep around a bit. Should be about an hour or two.
On Fri, Aug 27, 2021 at 07:37:25PM +0000, Al Viro wrote: > On Fri, Aug 27, 2021 at 12:33:00PM -0700, Linus Torvalds wrote: > > On Fri, Aug 27, 2021 at 12:23 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > Could you show the cases where "partial copy, so it's OK" behaviour would > > > break anything? > > > > Absolutely. > > > > For example, i t would cause an infinite loop in > > restore_fpregs_from_user() if the "buf" argument is a situation where > > the first page is fine, but the next page is not. > > > > Why? Because __restore_fpregs_from_user() would take a fault, but then > > fault_in_pages_readable() (renamed) would succeed, so you'd just do > > that "retry" forever and ever. > > > > Probably there are a number of other places too. That was literally > > the *first* place I looked at. > > OK... > > Let me dig out the notes from the last time I looked through that area > and grep around a bit. Should be about an hour or two. OK, I've dug it out and rechecked the current mainline. Call trees: fault_in_pages_readable() kvm_use_magic_page() Broken, as per mpe. Relevant part (see <87eeeqa7ng.fsf@mpe.ellerman.id.au> in your mailbox back in early May for the full story): |The current code is confused, ie. broken. ... |We want to check that the mapping succeeded, that the address is |readable (& writeable as well actually). ... |diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c ... |- if (!fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, sizeof(u32))) { |+ if (get_kernel_nofault(c, (const char *)KVM_MAGIC_PAGE)) { [ppc32]swapcontext() [ppc32]debug_setcontext() [ppc64]swapcontext() Same situation in all three - it's going to kill the process if copy-in fails, so it tries to be gentler about it and treat fault-in failures as -EFAULT from syscall. AFAICS, it's pointless, but I would like comments from ppc folks. Note that bogus *contents* of the struct ucontext passed by user is almost certainly going to end up with segfault; trying to catch the cases when bogus address happens to point someplace unreadable is rather useless in that situation. restore_fpregs_from_user() The one you've caught; hadn't been there last time I'd checked (back in April). Its counterpart in copy_fpstate_to_sigframe() had been, though. armada_gem_pwrite_ioctl() Pointless, along with the access_ok() there - it does copy_from_user() on that area shortly afterwards and failure of either is not a fast path. copy_page_from_iter_iovec() Will do the right thing on short copy of any kind; we are fine with either semantics. iov_iter_fault_in_readable() generic_perform_write() Any short copy that had not lead to progress (== rejected by ->write_end()) will lead to next chunk shortened accordingly, so ->write_begin() would be asked to prepare for the amount we expect to be able to copy; ->write_end() should be fine with that. Failure to copy anything at all (possible due to eviction on memory pressure, etc.) leads to retry of the same chunk as the last time, and that's where we rely on fault-in rejecting "nothing could be faulted in" case. That one is fine with partial fault-in reported as success. f2fs_file_write_iter() Odd prealloc-related stuff. AFAICS, from the correctness POV either variant of semantics would do, but I'm not sure how if either is the right match to what they are trying to do there. fuse_fill_write_pages() Similar to generic_perform_write() situation, only simpler (no ->write_end() counterpart there). All we care about is failure if nothing could be faulted in. btrfs_buffered_write() Again, similar to generic_perform_write(). More convoluted (after a short copy it switches to going page-by-page and getting destination pages uptodate, which will be equivalent to ->write_end() always accepting everything it's given from that point on), but it's the same "we care only about failure to fault in the first page" situation. ntfs_perform_write() Another generic_perform_write() analogue. Same situation wrt fault-in semantics. iomap_write_actor() Another generic_perform_write() relative. Same situation. fault_in_pages_writeable() copy_fpstate_to_sigframe() Same kind of "retry everything from scratch on short copy" as in the other fpu/signal.c case. [btrfs]search_ioctl() Broken with memory poisoning, for either variant of semantics. Same for arm64 sub-page permission differences, I think. copy_page_to_iter_iovec() Will do the right thing on short copy of any kind; we are fine with either semantics. So we have 3 callers where we want all-or-nothing semantics - two in arch/x86/kernel/fpu/signal.c and one in btrfs. HWPOISON will be a problem for all 3, AFAICS... IOW, it looks like we have two different things mixed here - one that wants to try and fault stuff in, with callers caring only about having _something_ faulted in (most of the users) and one that wants to make sure we *can* do stores or loads on each byte in the affected area. Just accessing a byte in each page really won't suffice for the second kind. Neither will g-u-p use, unless we teach it about HWPOISON and other fun beasts... Looks like we want that thing to be a separate primitive; for btrfs I'd probably replace fault_in_pages_writeable() with clear_user() as a quick fix for now... Comments?
On Fri, Aug 27, 2021 at 09:48:55PM +0000, Al Viro wrote: > [btrfs]search_ioctl() > Broken with memory poisoning, for either variant of semantics. Same for > arm64 sub-page permission differences, I think. > So we have 3 callers where we want all-or-nothing semantics - two in > arch/x86/kernel/fpu/signal.c and one in btrfs. HWPOISON will be a problem > for all 3, AFAICS... > > IOW, it looks like we have two different things mixed here - one that wants > to try and fault stuff in, with callers caring only about having _something_ > faulted in (most of the users) and one that wants to make sure we *can* do > stores or loads on each byte in the affected area. > > Just accessing a byte in each page really won't suffice for the second kind. > Neither will g-u-p use, unless we teach it about HWPOISON and other fun > beasts... Looks like we want that thing to be a separate primitive; for > btrfs I'd probably replace fault_in_pages_writeable() with clear_user() > as a quick fix for now... > > Comments? Wait a sec... Wasn't HWPOISON a per-page thing? arm64 definitely does have smaller-than-page areas with different permissions, so btrfs search_ioctl() has a problem there, but arch/x86/kernel/fpu/signal.c doesn't have to deal with that... Sigh... I really need more coffee...
On Fri, Aug 27, 2021 at 09:57:10PM +0000, Al Viro wrote: > On Fri, Aug 27, 2021 at 09:48:55PM +0000, Al Viro wrote: > > > [btrfs]search_ioctl() > > Broken with memory poisoning, for either variant of semantics. Same for > > arm64 sub-page permission differences, I think. > > > > So we have 3 callers where we want all-or-nothing semantics - two in > > arch/x86/kernel/fpu/signal.c and one in btrfs. HWPOISON will be a problem > > for all 3, AFAICS... > > > > IOW, it looks like we have two different things mixed here - one that wants > > to try and fault stuff in, with callers caring only about having _something_ > > faulted in (most of the users) and one that wants to make sure we *can* do > > stores or loads on each byte in the affected area. > > > > Just accessing a byte in each page really won't suffice for the second kind. > > Neither will g-u-p use, unless we teach it about HWPOISON and other fun > > beasts... Looks like we want that thing to be a separate primitive; for > > btrfs I'd probably replace fault_in_pages_writeable() with clear_user() > > as a quick fix for now... > > > > Comments? > > Wait a sec... Wasn't HWPOISON a per-page thing? arm64 definitely does have > smaller-than-page areas with different permissions, so btrfs search_ioctl() > has a problem there, but arch/x86/kernel/fpu/signal.c doesn't have to deal > with that... > > Sigh... I really need more coffee... On Intel poison is tracked at the cache line granularity. Linux inflates that to per-page (because it can only take a whole page away). For faults triggered in ring3 this is pretty much the same thing because mm/memory_failure.c unmaps the page ... so while you see a #MC on first access, you get #PF when you retry. The x86 fault handler sees a magic signature in the page table and sends a SIGBUS. But it's all different if the #MC is triggerd from ring0. The machine check handler can't unmap the page. It just schedules task_work to do the unmap when next returning to the user. But if your kernel code loops and tries again without a return to user, then your get another #MC. -Tony
> But if your kernel code loops and tries again without a return to user, > then your get another #MC. I've been trying to push this patch: https://lore.kernel.org/linux-edac/20210818002942.1607544-1-tony.luck@intel.com/ which turns the infinite loops of machine checks into a panic. -Tony
AFAICS, a48b73eca4ce "btrfs: fix potential deadlock in the search ioctl" has introduced a bug at least on arm64. Relevant bits: in search_ioctl() we have while (1) { ret = fault_in_pages_writeable(ubuf + sk_offset, *buf_size - sk_offset); if (ret) break; ret = btrfs_search_forward(root, &key, path, sk->min_transid); if (ret != 0) { if (ret > 0) ret = 0; goto err; } ret = copy_to_sk(path, &key, sk, buf_size, ubuf, &sk_offset, &num_found); btrfs_release_path(path); if (ret) break; } and in copy_to_sk() - sh.objectid = key->objectid; sh.offset = key->offset; sh.type = key->type; sh.len = item_len; sh.transid = found_transid; /* * Copy search result header. If we fault then loop again so we * can fault in the pages and -EFAULT there if there's a * problem. Otherwise we'll fault and then copy the buffer in * properly this next time through */ if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) { ret = 0; goto out; } with sk_offset left unchanged if the very first copy_to_user_nofault() fails. Now, consider a situation on arm64 where ubuf points to the beginning of page, ubuf[0] can be accessed, but ubuf[16] can not (possible with MTE, AFAICS). We do fault_in_pages_writeable(), which succeeds. When we get to copy_to_user_nofault() we fail as soon as it gets past the first 16 bytes. And we repeat everything from scratch, with no progress made, since short copies are treated as "discard and repeat" here. Am I misreading what's going on there?
On Fri, Aug 27 2021 at 16:22, Tony Luck wrote: > On Fri, Aug 27, 2021 at 09:57:10PM +0000, Al Viro wrote: >> On Fri, Aug 27, 2021 at 09:48:55PM +0000, Al Viro wrote: >> >> > [btrfs]search_ioctl() >> > Broken with memory poisoning, for either variant of semantics. Same for >> > arm64 sub-page permission differences, I think. >> >> >> > So we have 3 callers where we want all-or-nothing semantics - two in >> > arch/x86/kernel/fpu/signal.c and one in btrfs. HWPOISON will be a problem >> > for all 3, AFAICS... >> > >> > IOW, it looks like we have two different things mixed here - one that wants >> > to try and fault stuff in, with callers caring only about having _something_ >> > faulted in (most of the users) and one that wants to make sure we *can* do >> > stores or loads on each byte in the affected area. >> > >> > Just accessing a byte in each page really won't suffice for the second kind. >> > Neither will g-u-p use, unless we teach it about HWPOISON and other fun >> > beasts... Looks like we want that thing to be a separate primitive; for >> > btrfs I'd probably replace fault_in_pages_writeable() with clear_user() >> > as a quick fix for now... >> > >> > Comments? >> >> Wait a sec... Wasn't HWPOISON a per-page thing? arm64 definitely does have >> smaller-than-page areas with different permissions, so btrfs search_ioctl() >> has a problem there, but arch/x86/kernel/fpu/signal.c doesn't have to deal >> with that... >> >> Sigh... I really need more coffee... > > On Intel poison is tracked at the cache line granularity. Linux > inflates that to per-page (because it can only take a whole page away). > For faults triggered in ring3 this is pretty much the same thing because > mm/memory_failure.c unmaps the page ... so while you see a #MC on first > access, you get #PF when you retry. The x86 fault handler sees a magic > signature in the page table and sends a SIGBUS. > > But it's all different if the #MC is triggerd from ring0. The machine > check handler can't unmap the page. It just schedules task_work to do > the unmap when next returning to the user. > > But if your kernel code loops and tries again without a return to user, > then your get another #MC. But that's not the case for restore_fpregs_from_user() when it hits #MC. restore_fpregs_from_user() ... ret = __restore_fpregs_from_user(buf, xrestore, fx_only) /* Try to handle #PF, but anything else is fatal. */ if (ret != -EFAULT) return -EINVAL; Now let's look at __restore_fpregs_from_user() __restore_fpregs_from_user() return $FPUVARIANT_rstor_from_user_sigframe() which all end up in user_insn(). user_insn() returns 0 or the negated trap number, which results in -EFAULT for #PF, but for #MC the negated trap number is -18 i.e. != -EFAULT. IOW, there is no endless loop. This used to be a problem before commit: aee8c67a4faa ("x86/fpu: Return proper error codes from user access functions") and as the changelog says the initial reason for this was #GP going into the fault path, but I'm pretty sure that I also discussed the #MC angle with Borislav back then. Should have added some more comments there obviously. Thanks, tglx
On Sat, Aug 28, 2021 at 11:47:03PM +0200, Thomas Gleixner wrote: > /* Try to handle #PF, but anything else is fatal. */ > if (ret != -EFAULT) > return -EINVAL; > which all end up in user_insn(). user_insn() returns 0 or the negated > trap number, which results in -EFAULT for #PF, but for #MC the negated > trap number is -18 i.e. != -EFAULT. IOW, there is no endless loop. > > This used to be a problem before commit: > > aee8c67a4faa ("x86/fpu: Return proper error codes from user access functions") > > and as the changelog says the initial reason for this was #GP going into > the fault path, but I'm pretty sure that I also discussed the #MC angle with > Borislav back then. Should have added some more comments there > obviously. ... or at least have that check spelled if (ret != -X86_TRAP_PF) return -EINVAL; Unless I'm misreading your explanation, that is...
On Sat, Aug 28, 2021 at 10:04:41PM +0000, Al Viro wrote: > On Sat, Aug 28, 2021 at 11:47:03PM +0200, Thomas Gleixner wrote: > > > /* Try to handle #PF, but anything else is fatal. */ > > if (ret != -EFAULT) > > return -EINVAL; > > > which all end up in user_insn(). user_insn() returns 0 or the negated > > trap number, which results in -EFAULT for #PF, but for #MC the negated > > trap number is -18 i.e. != -EFAULT. IOW, there is no endless loop. > > > > This used to be a problem before commit: > > > > aee8c67a4faa ("x86/fpu: Return proper error codes from user access functions") > > > > and as the changelog says the initial reason for this was #GP going into > > the fault path, but I'm pretty sure that I also discussed the #MC angle with > > Borislav back then. Should have added some more comments there > > obviously. > > ... or at least have that check spelled > > if (ret != -X86_TRAP_PF) > return -EINVAL; > > Unless I'm misreading your explanation, that is... BTW, is #MC triggered on stored to a poisoned cacheline? Existence of CLZERO would seem to argue against that...
On Sat, Aug 28, 2021 at 10:11:36PM +0000, Al Viro wrote: > On Sat, Aug 28, 2021 at 10:04:41PM +0000, Al Viro wrote: > > On Sat, Aug 28, 2021 at 11:47:03PM +0200, Thomas Gleixner wrote: > > > > > /* Try to handle #PF, but anything else is fatal. */ > > > if (ret != -EFAULT) > > > return -EINVAL; > > > > > which all end up in user_insn(). user_insn() returns 0 or the negated > > > trap number, which results in -EFAULT for #PF, but for #MC the negated > > > trap number is -18 i.e. != -EFAULT. IOW, there is no endless loop. > > > > > > This used to be a problem before commit: > > > > > > aee8c67a4faa ("x86/fpu: Return proper error codes from user access functions") > > > > > > and as the changelog says the initial reason for this was #GP going into > > > the fault path, but I'm pretty sure that I also discussed the #MC angle with > > > Borislav back then. Should have added some more comments there > > > obviously. > > > > ... or at least have that check spelled > > > > if (ret != -X86_TRAP_PF) > > return -EINVAL; > > > > Unless I'm misreading your explanation, that is... > > BTW, is #MC triggered on stored to a poisoned cacheline? Existence of CLZERO > would seem to argue against that... How about taking __clear_user() out of copy_fpregs_to_sigframe() and replacing the call of fault_in_pages_writeable() with if (!clear_user(buf_fx, fpu_user_xstate_size)) goto retry; return -EFAULT; in the caller?
On Sat, Aug 28, 2021 at 3:12 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > BTW, is #MC triggered on stored to a poisoned cacheline? Existence of CLZERO > would seem to argue against that... No #MC on stores. Just on loads. Note that you can't clear poison state with a series of small writes to the cache line. But a single 64-byte store might do it (architects didn't want to guarantee that it would work when I asked about avx512 stores to clear poison many years ago). -Tony
On Sat, Aug 28 2021 at 22:04, Al Viro wrote: > On Sat, Aug 28, 2021 at 11:47:03PM +0200, Thomas Gleixner wrote: > >> /* Try to handle #PF, but anything else is fatal. */ >> if (ret != -EFAULT) >> return -EINVAL; > >> which all end up in user_insn(). user_insn() returns 0 or the negated >> trap number, which results in -EFAULT for #PF, but for #MC the negated >> trap number is -18 i.e. != -EFAULT. IOW, there is no endless loop. >> >> This used to be a problem before commit: >> >> aee8c67a4faa ("x86/fpu: Return proper error codes from user access functions") >> >> and as the changelog says the initial reason for this was #GP going into >> the fault path, but I'm pretty sure that I also discussed the #MC angle with >> Borislav back then. Should have added some more comments there >> obviously. > > ... or at least have that check spelled > > if (ret != -X86_TRAP_PF) > return -EINVAL; > > Unless I'm misreading your explanation, that is... Yes, that makes a lot of sense.
On Sat, Aug 28, 2021 at 10:19:04PM +0000, Al Viro wrote: > How about taking __clear_user() out of copy_fpregs_to_sigframe() > and replacing the call of fault_in_pages_writeable() with > if (!clear_user(buf_fx, fpu_user_xstate_size)) > goto retry; > return -EFAULT; > in the caller? Something like this (completely untested) Lift __clear_user() out of copy_fpregs_to_sigframe(), do not confuse EFAULT with X86_TRAP_PF, don't bother with fault_in_pages_writeable() (pointless, since now __clear_user() on error is not under pagefault_disable()). And don't bother with retries on anything other than #PF... diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h index 5a18694a89b2..71c6621a262f 100644 --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -17,6 +17,7 @@ #include <linux/mm.h> #include <asm/user.h> +#include <asm/trapnr.h> #include <asm/fpu/api.h> #include <asm/fpu/xstate.h> #include <asm/fpu/xcr.h> @@ -345,7 +346,7 @@ static inline int xsave_to_user_sigframe(struct xregs_state __user *buf) */ err = __clear_user(&buf->header, sizeof(buf->header)); if (unlikely(err)) - return -EFAULT; + return -X86_TRAP_PF; stac(); XSTATE_OP(XSAVE, buf, lmask, hmask, err); diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c index 445c57c9c539..611b9ed9c06b 100644 --- a/arch/x86/kernel/fpu/signal.c +++ b/arch/x86/kernel/fpu/signal.c @@ -135,18 +135,12 @@ static inline int save_xstate_epilog(void __user *buf, int ia32_frame) static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf) { - int err; - if (use_xsave()) - err = xsave_to_user_sigframe(buf); - else if (use_fxsr()) - err = fxsave_to_user_sigframe((struct fxregs_state __user *) buf); + return xsave_to_user_sigframe(buf); + if (use_fxsr()) + return fxsave_to_user_sigframe((struct fxregs_state __user *) buf); else - err = fnsave_to_user_sigframe((struct fregs_state __user *) buf); - - if (unlikely(err) && __clear_user(buf, fpu_user_xstate_size)) - err = -EFAULT; - return err; + return fnsave_to_user_sigframe((struct fregs_state __user *) buf); } /* @@ -205,9 +199,10 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size) fpregs_unlock(); if (ret) { - if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size)) + if (!__clear_user(buf_fx, fpu_user_xstate_size) && + ret == -X86_TRAP_PF) goto retry; - return -EFAULT; + return -1; } /* Save the fsave header for the 32-bit frames. */ @@ -275,7 +270,7 @@ static int restore_fpregs_from_user(void __user *buf, u64 xrestore, fpregs_unlock(); /* Try to handle #PF, but anything else is fatal. */ - if (ret != -EFAULT) + if (ret != -X86_TRAP_PF) return -EINVAL; ret = fault_in_pages_readable(buf, size);
On Fri, Aug 27, 2021 at 09:48:55PM +0000, Al Viro wrote: > So we have 3 callers where we want all-or-nothing semantics - two in > arch/x86/kernel/fpu/signal.c and one in btrfs. HWPOISON will be a problem > for all 3, AFAICS... > > IOW, it looks like we have two different things mixed here - one that wants > to try and fault stuff in, with callers caring only about having _something_ > faulted in (most of the users) and one that wants to make sure we *can* do > stores or loads on each byte in the affected area. > > Just accessing a byte in each page really won't suffice for the second kind. > Neither will g-u-p use, unless we teach it about HWPOISON and other fun > beasts... Looks like we want that thing to be a separate primitive; for > btrfs I'd probably replace fault_in_pages_writeable() with clear_user() > as a quick fix for now... Looks like out of these 3 we have * x86 restoring FPU state on sigreturn: correct, if somewhat obfuscated; HWPOISON is not an issue. We want full fault-in there (1 or 2 pages) * x86 saving FPU state into sigframe: not really needed; we do __clear_user() on any error anyway, and taking it into the caller past the pagefault_enable() will serve just fine instead of fault-in of the same for write. * btrfs search_ioctl(): HWPOISON is not an issue (no #MC on stores), but arm64 side of the things very likely is a problem with MTE; there we can have successful store in some bytes in a page with faults on stores elsewhere in it. With such setups that thing will loop indefinitely. And unlike x86 FPU handling, btrfs is arch-independent. IOW, unless I'm misreading the situation, we have one caller where "all or nothing" semantics is correct and needed, several where fault-in is pointless, one where the current use of fault-in is actively wrong (ppc kvm, patch from ppc folks exists), another place where neither semantics is right (btrfs on arm64) and a bunch where "can we access at least the first byte?" should be fine...
On Sat, Aug 28, 2021 at 03:20:58PM -0700, Tony Luck wrote: > On Sat, Aug 28, 2021 at 3:12 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > BTW, is #MC triggered on stored to a poisoned cacheline? Existence of CLZERO > > would seem to argue against that... > > No #MC on stores. Just on loads. Note that you can't clear poison > state with a series of small writes to the cache line. But a single > 64-byte store might do it (architects didn't want to guarantee that > it would work when I asked about avx512 stores to clear poison > many years ago). Dave Jiang thinks MOVDIR64B clears poison. http://archive.lwn.net:8080/linux-kernel/157617505636.42350.1170110675242558018.stgit@djiang5-desk3.ch.intel.com/
On Sat, Aug 28 2021 at 22:51, Al Viro wrote: > @@ -345,7 +346,7 @@ static inline int xsave_to_user_sigframe(struct xregs_state __user *buf) > */ > err = __clear_user(&buf->header, sizeof(buf->header)); > if (unlikely(err)) > - return -EFAULT; > + return -X86_TRAP_PF; This clear_user can be lifted into copy_fpstate_to_sigframe(). Something like the below. Thanks, tglx --- --- a/arch/x86/kernel/fpu/signal.c +++ b/arch/x86/kernel/fpu/signal.c @@ -135,18 +135,12 @@ static inline int save_xstate_epilog(voi static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf) { - int err; - if (use_xsave()) - err = xsave_to_user_sigframe(buf); - else if (use_fxsr()) - err = fxsave_to_user_sigframe((struct fxregs_state __user *) buf); + return xsave_to_user_sigframe(buf); + if (use_fxsr()) + return fxsave_to_user_sigframe((struct fxregs_state __user *) buf); else - err = fnsave_to_user_sigframe((struct fregs_state __user *) buf); - - if (unlikely(err) && __clear_user(buf, fpu_user_xstate_size)) - err = -EFAULT; - return err; + return fnsave_to_user_sigframe((struct fregs_state __user *) buf); } /* @@ -188,6 +182,16 @@ int copy_fpstate_to_sigframe(void __user if (!access_ok(buf, size)) return -EACCES; + + if (use_xsave()) { + /* + * Clear the xsave header first, so that reserved fields are + * initialized to zero. + */ + ret = __clear_user(&buf->header, sizeof(buf->header)); + if (unlikely(ret)) + return ret; + } retry: /* * Load the FPU registers if they are not valid for the current task. @@ -205,9 +209,10 @@ int copy_fpstate_to_sigframe(void __user fpregs_unlock(); if (ret) { - if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size)) + if (!__clear_user(buf_fx, fpu_user_xstate_size) && + ret == -X86_TRAP_PF) goto retry; - return -EFAULT; + return -1; } /* Save the fsave header for the 32-bit frames. */ @@ -275,7 +280,7 @@ static int restore_fpregs_from_user(void fpregs_unlock(); /* Try to handle #PF, but anything else is fatal. */ - if (ret != -EFAULT) + if (ret != -X86_TRAP_PF) return -EINVAL; ret = fault_in_pages_readable(buf, size); --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -323,9 +323,12 @@ static inline void os_xrstor(struct xreg * We don't use modified optimization because xrstor/xrstors might track * a different application. * - * We don't use compacted format xsave area for - * backward compatibility for old applications which don't understand - * compacted format of xsave area. + * We don't use compacted format xsave area for backward compatibility for + * old applications which don't understand the compacted format of the + * xsave area. + * + * The caller has to zero buf::header before calling this because XSAVE* + * does not touch them. */ static inline int xsave_to_user_sigframe(struct xregs_state __user *buf) { @@ -339,14 +342,6 @@ static inline int xsave_to_user_sigframe u32 hmask = mask >> 32; int err; - /* - * Clear the xsave header first, so that reserved fields are - * initialized to zero. - */ - err = __clear_user(&buf->header, sizeof(buf->header)); - if (unlikely(err)) - return -EFAULT; - stac(); XSTATE_OP(XSAVE, buf, lmask, hmask, err); clac();
On Sun, Aug 29, 2021 at 08:44:04PM +0200, Thomas Gleixner wrote: > On Sat, Aug 28 2021 at 22:51, Al Viro wrote: > > @@ -345,7 +346,7 @@ static inline int xsave_to_user_sigframe(struct xregs_state __user *buf) > > */ > > err = __clear_user(&buf->header, sizeof(buf->header)); > > if (unlikely(err)) > > - return -EFAULT; > > + return -X86_TRAP_PF; > > This clear_user can be lifted into copy_fpstate_to_sigframe(). Something > like the below. Hmm... This mixing of -X86_TRAP_... with -E... looks like it's asking for trouble in general. Might be worth making e.g. fpu__restore_sig() (and its callers) return bool, seeing that we only check for 0/non-zero in there.
On Sun, Aug 29 2021 at 19:46, Al Viro wrote: > On Sun, Aug 29, 2021 at 08:44:04PM +0200, Thomas Gleixner wrote: >> On Sat, Aug 28 2021 at 22:51, Al Viro wrote: >> > @@ -345,7 +346,7 @@ static inline int xsave_to_user_sigframe(struct xregs_state __user *buf) >> > */ >> > err = __clear_user(&buf->header, sizeof(buf->header)); >> > if (unlikely(err)) >> > - return -EFAULT; >> > + return -X86_TRAP_PF; >> >> This clear_user can be lifted into copy_fpstate_to_sigframe(). Something >> like the below. > > Hmm... This mixing of -X86_TRAP_... with -E... looks like it's asking for > trouble in general. Might be worth making e.g. fpu__restore_sig() (and > its callers) return bool, seeing that we only check for 0/non-zero in > there. Let me fix that.
>> No #MC on stores. Just on loads. Note that you can't clear poison >> state with a series of small writes to the cache line. But a single >> 64-byte store might do it (architects didn't want to guarantee that >> it would work when I asked about avx512 stores to clear poison >> many years ago). > > Dave Jiang thinks MOVDIR64B clears poison. > > http://archive.lwn.net:8080/linux-kernel/157617505636.42350.1170110675242558018.stgit@djiang5-desk3.ch.intel.com/ MOVDIR64B has some explicit guarantees (does a write-back invalidate if the target is already in the cache) that a 64-byte avx512 write doesn't. Of course it would stop working if some future CPU were to have a longer than 64 bytes cache line. -Tony
On Sat, Aug 28, 2021 at 08:28:17PM +0100, Al Viro wrote: > AFAICS, a48b73eca4ce "btrfs: fix potential deadlock in the search ioctl" > has introduced a bug at least on arm64. > > Relevant bits: in search_ioctl() we have > while (1) { > ret = fault_in_pages_writeable(ubuf + sk_offset, > *buf_size - sk_offset); > if (ret) > break; > > ret = btrfs_search_forward(root, &key, path, sk->min_transid); > if (ret != 0) { > if (ret > 0) > ret = 0; > goto err; > } > ret = copy_to_sk(path, &key, sk, buf_size, ubuf, > &sk_offset, &num_found); > btrfs_release_path(path); > if (ret) > break; > > } > and in copy_to_sk() - > sh.objectid = key->objectid; > sh.offset = key->offset; > sh.type = key->type; > sh.len = item_len; > sh.transid = found_transid; > > /* > * Copy search result header. If we fault then loop again so we > * can fault in the pages and -EFAULT there if there's a > * problem. Otherwise we'll fault and then copy the buffer in > * properly this next time through > */ > if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) { > ret = 0; > goto out; > } > with sk_offset left unchanged if the very first copy_to_user_nofault() fails. > > Now, consider a situation on arm64 where ubuf points to the beginning of page, > ubuf[0] can be accessed, but ubuf[16] can not (possible with MTE, AFAICS). We do > fault_in_pages_writeable(), which succeeds. When we get to copy_to_user_nofault() > we fail as soon as it gets past the first 16 bytes. And we repeat everything from > scratch, with no progress made, since short copies are treated as "discard and > repeat" here. So if copy_to_user_nofault() returns -EFAULT, copy_to_sk() returns 0 (following commit a48b73eca4ce). I think you are right, search_ioctl() can get into an infinite loop attempting to write to user if the architecture can trigger faults at smaller granularity than the page boundary. fault_in_pages_writeable() won't fix it if ubuf[0] is writable and doesn't trigger an MTE tag check fault. An arm64-specific workaround would be for pagefault_disable() to disable tag checking. It's a pretty big hammer, weakening the out of bounds access detection of MTE. My preference would be a fix in the btrfs code. A btrfs option would be for copy_to_sk() to return an indication of where the fault occurred and get fault_in_pages_writeable() to check that location, even if the copying would restart from an earlier offset (this requires open-coding copy_to_user_nofault()). An attempt below, untested and does not cover read_extent_buffer_to_user_nofault(): diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 0ba98e08a029..9e74ba1c955d 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2079,6 +2079,7 @@ static noinline int copy_to_sk(struct btrfs_path *path, size_t *buf_size, char __user *ubuf, unsigned long *sk_offset, + unsigned long *fault_offset, int *num_found) { u64 found_transid; @@ -2143,7 +2144,11 @@ static noinline int copy_to_sk(struct btrfs_path *path, * problem. Otherwise we'll fault and then copy the buffer in * properly this next time through */ - if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) { + pagefault_disable(); + ret = __copy_to_user_inatomic(ubuf + *sk_offset, &sh, sizeof(sh)); + pagefault_enable(); + *fault_offset = *sk_offset + sizeof(sh) - ret; + if (ret) { ret = 0; goto out; } @@ -2218,6 +2223,7 @@ static noinline int search_ioctl(struct inode *inode, int ret; int num_found = 0; unsigned long sk_offset = 0; + unsigned long fault_offset = 0; if (*buf_size < sizeof(struct btrfs_ioctl_search_header)) { *buf_size = sizeof(struct btrfs_ioctl_search_header); @@ -2244,8 +2250,8 @@ static noinline int search_ioctl(struct inode *inode, key.offset = sk->min_offset; while (1) { - ret = fault_in_pages_writeable(ubuf + sk_offset, - *buf_size - sk_offset); + ret = fault_in_pages_writeable(ubuf + fault_offset, + *buf_size - fault_offset); if (ret) break; @@ -2256,7 +2262,7 @@ static noinline int search_ioctl(struct inode *inode, goto err; } ret = copy_to_sk(path, &key, sk, buf_size, ubuf, - &sk_offset, &num_found); + &sk_offset, &fault_offset, &num_found); btrfs_release_path(path); if (ret) break;
On Tue, Aug 31, 2021 at 02:54:50PM +0100, Catalin Marinas wrote: > An arm64-specific workaround would be for pagefault_disable() to disable > tag checking. It's a pretty big hammer, weakening the out of bounds > access detection of MTE. My preference would be a fix in the btrfs code. > > A btrfs option would be for copy_to_sk() to return an indication of > where the fault occurred and get fault_in_pages_writeable() to check > that location, even if the copying would restart from an earlier offset > (this requires open-coding copy_to_user_nofault()). An attempt below, > untested and does not cover read_extent_buffer_to_user_nofault(): Umm... There's another copy_to_user_nofault() call in the same function (same story, AFAICS). Can't say I'm fond of their ABI, but then I guess it could've been worse - iterating over btree, running a user-supplied chunk of INTERCAL over it, with all details of internal representation cast in stone by that exposure...
On Tue, Aug 31, 2021 at 03:28:57PM +0000, Al Viro wrote: > On Tue, Aug 31, 2021 at 02:54:50PM +0100, Catalin Marinas wrote: > > > An arm64-specific workaround would be for pagefault_disable() to disable > > tag checking. It's a pretty big hammer, weakening the out of bounds > > access detection of MTE. My preference would be a fix in the btrfs code. > > > > A btrfs option would be for copy_to_sk() to return an indication of > > where the fault occurred and get fault_in_pages_writeable() to check > > that location, even if the copying would restart from an earlier offset > > (this requires open-coding copy_to_user_nofault()). An attempt below, > > untested and does not cover read_extent_buffer_to_user_nofault(): > > Umm... There's another copy_to_user_nofault() call in the same function > (same story, AFAICS). Yeah, I was too lazy to do it all and I don't have a setup to test the patch quickly either. BTW, my hack is missing an access_ok() check. I wonder whether copy_{to,from}_user_nofault() could actually return the number of bytes left to copy, just like their non-nofault counterparts. These are only used in a few places, so fairly easy to change. If we go for a btrfs fix along the lines of my diff, it saves us from duplicating the copy_to_user_nofault() code.
On Tue, Aug 31, 2021 at 03:28:57PM +0000, Al Viro wrote: > On Tue, Aug 31, 2021 at 02:54:50PM +0100, Catalin Marinas wrote: > > An arm64-specific workaround would be for pagefault_disable() to disable > > tag checking. It's a pretty big hammer, weakening the out of bounds > > access detection of MTE. My preference would be a fix in the btrfs code. > > > > A btrfs option would be for copy_to_sk() to return an indication of > > where the fault occurred and get fault_in_pages_writeable() to check > > that location, even if the copying would restart from an earlier offset > > (this requires open-coding copy_to_user_nofault()). An attempt below, > > untested and does not cover read_extent_buffer_to_user_nofault(): > > Umm... There's another copy_to_user_nofault() call in the same function > (same story, AFAICS). I cleaned up this patch [1] but I realised it still doesn't solve it. The arm64 __copy_to_user_inatomic(), while ensuring progress if called in a loop, it does not guarantee precise copy to the fault position. The copy_to_sk(), after returning an error, starts again from the previous sizeof(sh) boundary rather than from where the __copy_to_user_inatomic() stopped. So it can get stuck attempting to copy the same search header. An ugly fix is to fall back to byte by byte copying so that we can attempt the actual fault address in fault_in_pages_writeable(). If the sh being recreated in copy_to_sk() is the same on the retried iteration, we could use an *sk_offset that is not a multiple of sizeof(sh) in order to have progress. But it's not clear to me whether the data being copied can change once btrfs_release_path() is called. [1] https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-fix
On Mon, Oct 11, 2021 at 10:38 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > I cleaned up this patch [1] but I realised it still doesn't solve it. > The arm64 __copy_to_user_inatomic(), while ensuring progress if called > in a loop, it does not guarantee precise copy to the fault position. That should be ok., We've always allowed the user copy to return early if it does word copies and hits a page crosser that causes a fault. Any user then has the choice of: - partial copies are bad - partial copies are handled and then you retry from the place copy_to_user() failed at and in that second case, the next time around, you'll get the fault immediately (or you'll make some more progress - maybe the user copy loop did something different just because the length and/or alignment was different). If you get the fault immediately, that's -EFAULT. And if you make some more progress, it's again up to the caller to rinse and repeat. End result: user copy functions do not have to report errors exactly. It is the caller that has to handle the situation. Most importantly: "exact error or not" doesn't actually _matter_ to the caller. If the caller does the right thing for an exact error, it will do the right thing for an inexact one too. See above. > The copy_to_sk(), after returning an error, starts again from the previous > sizeof(sh) boundary rather than from where the __copy_to_user_inatomic() > stopped. So it can get stuck attempting to copy the same search header. That seems to be purely a copy_to_sk() bug. Or rather, it looks like a bug in the caller. copy_to_sk() itself does if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) { ret = 0; goto out; } and the comment says * 0: all items from this leaf copied, continue with next but that comment is then obviously not actually true in that it's not "continue with next" at all. But this is all very much a bug in the btrfs search_ioctl()/copy_to_sk() code: it simply doesn't do the proper thing for a partial result. Because no, "just retry the whole thing" is by definition not the proper thing. That said, I think that if we can have faults at non-page-aligned boundaries, then we just need to make fault_in_pages_writeable() check non-page boundaries. > An ugly fix is to fall back to byte by byte copying so that we can > attempt the actual fault address in fault_in_pages_writeable(). No, changing the user copy machinery is wrong. The caller really has to do the right thing with partial results. And I think we need to make fault_in_pages_writeable() match the actual faulting cases - maybe remote the "pages" part of the name? That would fix the btrfs code - it's not doing the right thing as-is, but it's "close enough' to right that I think fixing fault_in_pages_writeable() should fix it. No? Linus
On Mon, Oct 11, 2021 at 12:15:43PM -0700, Linus Torvalds wrote: > On Mon, Oct 11, 2021 at 10:38 AM Catalin Marinas > <catalin.marinas@arm.com> wrote: > > I cleaned up this patch [1] but I realised it still doesn't solve it. > > The arm64 __copy_to_user_inatomic(), while ensuring progress if called > > in a loop, it does not guarantee precise copy to the fault position. > > That should be ok., We've always allowed the user copy to return early > if it does word copies and hits a page crosser that causes a fault. > > Any user then has the choice of: > > - partial copies are bad > > - partial copies are handled and then you retry from the place > copy_to_user() failed at > > and in that second case, the next time around, you'll get the fault > immediately (or you'll make some more progress - maybe the user copy > loop did something different just because the length and/or alignment > was different). > > If you get the fault immediately, that's -EFAULT. > > And if you make some more progress, it's again up to the caller to > rinse and repeat. > > End result: user copy functions do not have to report errors exactly. > It is the caller that has to handle the situation. > > Most importantly: "exact error or not" doesn't actually _matter_ to > the caller. If the caller does the right thing for an exact error, it > will do the right thing for an inexact one too. See above. Yes, that's my expectation (though fixed fairly recently in the arm64 user copy routines). > > The copy_to_sk(), after returning an error, starts again from the previous > > sizeof(sh) boundary rather than from where the __copy_to_user_inatomic() > > stopped. So it can get stuck attempting to copy the same search header. > > That seems to be purely a copy_to_sk() bug. > > Or rather, it looks like a bug in the caller. copy_to_sk() itself does > > if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) { > ret = 0; > goto out; > } > > and the comment says > > * 0: all items from this leaf copied, continue with next > > but that comment is then obviously not actually true in that it's not > "continue with next" at all. The comments were correct before commit a48b73eca4ce ("btrfs: fix potential deadlock in the search ioctl") which introduced the potentially infinite loop. Something like this would make the comments match (I think): diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index cc61813213d8..1debf6a124e8 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2161,7 +2161,7 @@ static noinline int copy_to_sk(struct btrfs_path *path, * properly this next time through */ if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) { - ret = 0; + ret = -EFAULT; goto out; } @@ -2175,7 +2175,7 @@ static noinline int copy_to_sk(struct btrfs_path *path, */ if (read_extent_buffer_to_user_nofault(leaf, up, item_off, item_len)) { - ret = 0; + ret = -EFAULT; *sk_offset -= sizeof(sh); goto out; } @@ -2260,12 +2260,8 @@ static noinline int search_ioctl(struct inode *inode, key.type = sk->min_type; key.offset = sk->min_offset; - while (1) { - ret = fault_in_pages_writeable(ubuf + sk_offset, - *buf_size - sk_offset); - if (ret) - break; - + ret = fault_in_pages_writeable(ubuf, *buf_size); + while (ret == 0) { ret = btrfs_search_forward(root, &key, path, sk->min_transid); if (ret != 0) { if (ret > 0) @@ -2275,9 +2271,14 @@ static noinline int search_ioctl(struct inode *inode, ret = copy_to_sk(path, &key, sk, buf_size, ubuf, &sk_offset, &num_found); btrfs_release_path(path); - if (ret) - break; + /* + * Fault in copy_to_sk(), attempt to bring the page in after + * releasing the locks and retry. + */ + if (ret == -EFAULT) + ret = fault_in_pages_writeable(ubuf + sk_offset, + sizeof(struct btrfs_ioctl_search_header)); } if (ret > 0) ret = 0; > > An ugly fix is to fall back to byte by byte copying so that we can > > attempt the actual fault address in fault_in_pages_writeable(). > > No, changing the user copy machinery is wrong. The caller really has > to do the right thing with partial results. > > And I think we need to make fault_in_pages_writeable() match the > actual faulting cases - maybe remote the "pages" part of the name? Ah, good point. Without removing "pages" from the name (too late over here to grep the kernel), something like below: diff --git a/arch/arm64/include/asm/page-def.h b/arch/arm64/include/asm/page-def.h index 2403f7b4cdbf..3768ac4a6610 100644 --- a/arch/arm64/include/asm/page-def.h +++ b/arch/arm64/include/asm/page-def.h @@ -15,4 +15,9 @@ #define PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT) #define PAGE_MASK (~(PAGE_SIZE-1)) +#ifdef CONFIG_ARM64_MTE +#define FAULT_GRANULE_SIZE (16) +#define FAULT_GRANULE_MASK (~(FAULT_GRANULE_SIZE-1)) +#endif + #endif /* __ASM_PAGE_DEF_H */ diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 62db6b0176b9..7aef732e4fa7 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -16,6 +16,11 @@ #include <linux/hardirq.h> /* for in_interrupt() */ #include <linux/hugetlb_inline.h> +#ifndef FAULT_GRANULE_SIZE +#define FAULT_GRANULE_SIZE PAGE_SIZE +#define FAULT_GRANULE_MASK PAGE_MASK +#endif + struct pagevec; static inline bool mapping_empty(struct address_space *mapping) @@ -751,12 +756,12 @@ static inline int fault_in_pages_writeable(char __user *uaddr, size_t size) do { if (unlikely(__put_user(0, uaddr) != 0)) return -EFAULT; - uaddr += PAGE_SIZE; + uaddr += FAULT_GRANULE_SIZE; } while (uaddr <= end); - /* Check whether the range spilled into the next page. */ - if (((unsigned long)uaddr & PAGE_MASK) == - ((unsigned long)end & PAGE_MASK)) + /* Check whether the range spilled into the next granule. */ + if (((unsigned long)uaddr & FAULT_GRANULE_MASK) == + ((unsigned long)end & FAULT_GRANULE_MASK)) return __put_user(0, end); return 0; If this looks in the right direction, I'll do some proper patches tomorrow.
On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > +#ifdef CONFIG_ARM64_MTE > +#define FAULT_GRANULE_SIZE (16) > +#define FAULT_GRANULE_MASK (~(FAULT_GRANULE_SIZE-1)) [...] > If this looks in the right direction, I'll do some proper patches > tomorrow. Looks fine to me. It's going to be quite expensive and bad for caches, though. That said, fault_in_writable() is _supposed_ to all be for the slow path when things go south and the normal path didn't work out, so I think it's fine. I do wonder how the sub-page granularity works. Is it sufficient to just read from it? Because then a _slightly_ better option might be to do one write per page (to catch page table writability) and then one read per "granule" (to catch pointer coloring or cache poisoning issues)? That said, since this is all preparatory to us wanting to write to it eventually anyway, maybe marking it all dirty in the caches is only good. Linus
On Mon, Oct 11, 2021 at 04:59:28PM -0700, Linus Torvalds wrote: > On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > +#ifdef CONFIG_ARM64_MTE > > +#define FAULT_GRANULE_SIZE (16) > > +#define FAULT_GRANULE_MASK (~(FAULT_GRANULE_SIZE-1)) > > [...] > > > If this looks in the right direction, I'll do some proper patches > > tomorrow. > > Looks fine to me. It's going to be quite expensive and bad for caches, > though. > > That said, fault_in_writable() is _supposed_ to all be for the slow > path when things go south and the normal path didn't work out, so I > think it's fine. > > I do wonder how the sub-page granularity works. Is it sufficient to > just read from it? For arm64 MTE and I think SPARC ADI, just reading should be sufficient. There is CHERI in the long run, if it takes off, where the user can set independent read/write permissions and uaccess would use the capability rather than a match-all pointer (hence checked). > Because then a _slightly_ better option might be to > do one write per page (to catch page table writability) and then one > read per "granule" (to catch pointer coloring or cache poisoning > issues)? > > That said, since this is all preparatory to us wanting to write to it > eventually anyway, maybe marking it all dirty in the caches is only > good. It depends on how much would be written in the actual copy. For significant memcpy on arm CPUs, write streaming usually kicks in and the cache dirtying is skipped. This probably matters more for copy_page_to_iter_iovec() than the btrfs search ioctl. Apart from fault_in_pages_*(), there's also fault_in_user_writeable() called from the futex code which uses the GUP mechanism as the write would be destructive. It looks like it could potentially trigger the same infinite loop on -EFAULT. For arm64 MTE, we get away with this by disabling the tag checking around the arch futex code (we did it for an unrelated issue - we don't have LDXR/STXR that would run with user permissions in kernel mode like we do with LDTR/STTR). I wonder whether we should actually just disable tag checking around the problematic accesses. What these callers seem to have in common is using pagefault_disable/enable(). We could abuse this to disable tag checking or maybe in_atomic() when handling the exception to lazily disable such faults temporarily. A more invasive change would be to return a different error for such faults like -EACCESS and treat them differently in the caller.
On Tue, Oct 12, 2021 at 10:27 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > Apart from fault_in_pages_*(), there's also fault_in_user_writeable() > called from the futex code which uses the GUP mechanism as the write > would be destructive. It looks like it could potentially trigger the > same infinite loop on -EFAULT. Hmm. I think the reason we do fault_in_user_writeable() using GUP is that (a) we can avoid the page fault overhead (b) we don't have any good "atomic_inc_user()" interface or similar that could do a write with a zero increment or something like that. We do have that "arch_futex_atomic_op_inuser()" thing, of course. It's all kinds of crazy, but we *could* do arch_futex_atomic_op_inuser(FUTEX_OP_ADD, 0, &dummy, uaddr); instead of doing the fault_in_user_writeable(). That might be a good idea anyway. I dunno. But I agree other options exist: > I wonder whether we should actually just disable tag checking around the > problematic accesses. What these callers seem to have in common is using > pagefault_disable/enable(). We could abuse this to disable tag checking > or maybe in_atomic() when handling the exception to lazily disable such > faults temporarily. Hmm. That would work for MTE, but possibly be very inconvenient for other situations. > A more invasive change would be to return a different error for such > faults like -EACCESS and treat them differently in the caller. That's _really_ hard for things like "copy_to_user()", that isn't a single operation, and is supposed to return the bytes left. Adding another error return would be nasty. We've had hacks like "squirrel away the actual error code in the task structure", but that tends to be unmaintainable because we have interrupts (and NMI's) doing their own possibly nested atomics, so even disabling preemption won't actually fix some of the nesting issues. All of these things make me think that the proper fix ends up being to make sure that our "fault_in_xyz()" functions simply should always handle all faults. Another option may be to teach the GUP code to actually check architecture-specific sub-page ranges. Linus
On Tue, Oct 12, 2021 at 10:58:46AM -0700, Linus Torvalds wrote: > On Tue, Oct 12, 2021 at 10:27 AM Catalin Marinas > <catalin.marinas@arm.com> wrote: > > Apart from fault_in_pages_*(), there's also fault_in_user_writeable() > > called from the futex code which uses the GUP mechanism as the write > > would be destructive. It looks like it could potentially trigger the > > same infinite loop on -EFAULT. > > Hmm. > > I think the reason we do fault_in_user_writeable() using GUP is that > > (a) we can avoid the page fault overhead > > (b) we don't have any good "atomic_inc_user()" interface or similar > that could do a write with a zero increment or something like that. > > We do have that "arch_futex_atomic_op_inuser()" thing, of course. It's > all kinds of crazy, but we *could* do > > arch_futex_atomic_op_inuser(FUTEX_OP_ADD, 0, &dummy, uaddr); > > instead of doing the fault_in_user_writeable(). > > That might be a good idea anyway. I dunno. I gave this a quick try for futex (though MTE is not affected at the moment): https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/sub-page-faults However, I still have doubts about fault_in_pages_*() probing every 16 bytes, especially if one decides to change these routines to be GUP-based. > > A more invasive change would be to return a different error for such > > faults like -EACCESS and treat them differently in the caller. > > That's _really_ hard for things like "copy_to_user()", that isn't a > single operation, and is supposed to return the bytes left. > > Adding another error return would be nasty. > > We've had hacks like "squirrel away the actual error code in the task > structure", but that tends to be unmaintainable because we have > interrupts (and NMI's) doing their own possibly nested atomics, so > even disabling preemption won't actually fix some of the nesting > issues. I think we can do something similar to the __get_user_error() on arm64. We can keep the __copy_to_user_inatomic() etc. returning the number of bytes left but change the exception handling path in those routines to set an error code or boolean to a pointer passed at uaccess routine call time. The caller would do something along these lines: bool page_fault; left = copy_to_user_inatomic(dst, src, size, &page_fault); if (left && page_fault) goto repeat_fault_in; copy_to_user_nofault() could also change its return type from -EFAULT to something else based on whether page_fault was set or not. Most architectures will use a generic copy_to_user_inatomic() wrapper where page_fault == true for any fault. Arm64 needs some adjustment to the uaccess fault handling to pass the fault code down to the exception code. This way, at least for arm64, I don't think an interrupt or NMI would be problematic. > All of these things make me think that the proper fix ends up being to > make sure that our "fault_in_xyz()" functions simply should always > handle all faults. > > Another option may be to teach the GUP code to actually check > architecture-specific sub-page ranges. Teaching GUP about this is likely to be expensive. A put_user() for probing on arm64 uses a STTR instruction that's run with user privileges on the user address and the user tag checking mode. The GUP code for MTE, OTOH, would need to explicitly read the tag in memory and compare it with the user pointer tag (which is normally cleared in the GUP code by untagged_addr()). To me it makes more sense for the fault_in_*() functions to only deal with those permissions the kernel controls, i.e. the pte. Sub-page permissions like MTE or CHERI are controlled by the user directly, so the kernel cannot fix them up anyway. Rather than overloading fault_in_*() with additional checks, I think we should expand the in-atomic uaccess API to cover the type of fault.
On Tue, Oct 12, 2021 at 1:59 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > +#ifdef CONFIG_ARM64_MTE > > +#define FAULT_GRANULE_SIZE (16) > > +#define FAULT_GRANULE_MASK (~(FAULT_GRANULE_SIZE-1)) > > [...] > > > If this looks in the right direction, I'll do some proper patches > > tomorrow. > > Looks fine to me. It's going to be quite expensive and bad for caches, though. > > That said, fault_in_writable() is _supposed_ to all be for the slow > path when things go south and the normal path didn't work out, so I > think it's fine. Let me get back to this; I'm actually not convinced that we need to worry about sub-page-size fault granules in fault_in_pages_readable or fault_in_pages_writeable. >From a filesystem point of view, we can get into trouble when a user-space read or write triggers a page fault while we're holding filesystem locks, and that page fault ends up calling back into the filesystem. To deal with that, we're performing those user-space accesses with page faults disabled. When a page fault would occur, we get back an error instead, and then we try to fault in the offending pages. If a page is resident and we still get a fault trying to access it, trying to fault in the same page again isn't going to help and we have a true error. We're clearly looking at memory at a page granularity; faults at a sub-page level don't matter at this level of abstraction (but they do show similar error behavior). To avoid getting stuck, when it gets a short result or -EFAULT, the filesystem implements the following backoff strategy: first, it tries to fault in a number of pages. When the read or write still doesn't make progress, it scales back and faults in a single page. Finally, when that still doesn't help, it gives up. This strategy is needed for actual page faults, but it also handles sub-page faults appropriately as long as the user-space access functions give sensible results. What am I missing? Thanks, Andreas > I do wonder how the sub-page granularity works. Is it sufficient to > just read from it? Because then a _slightly_ better option might be to > do one write per page (to catch page table writability) and then one > read per "granule" (to catch pointer coloring or cache poisoning > issues)? > > That said, since this is all preparatory to us wanting to write to it > eventually anyway, maybe marking it all dirty in the caches is only > good. > > Linus >
On Thu, Oct 21, 2021 at 02:46:10AM +0200, Andreas Gruenbacher wrote: > On Tue, Oct 12, 2021 at 1:59 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > > > +#ifdef CONFIG_ARM64_MTE > > > +#define FAULT_GRANULE_SIZE (16) > > > +#define FAULT_GRANULE_MASK (~(FAULT_GRANULE_SIZE-1)) > > > > [...] > > > > > If this looks in the right direction, I'll do some proper patches > > > tomorrow. > > > > Looks fine to me. It's going to be quite expensive and bad for caches, though. > > > > That said, fault_in_writable() is _supposed_ to all be for the slow > > path when things go south and the normal path didn't work out, so I > > think it's fine. > > Let me get back to this; I'm actually not convinced that we need to > worry about sub-page-size fault granules in fault_in_pages_readable or > fault_in_pages_writeable. > > From a filesystem point of view, we can get into trouble when a > user-space read or write triggers a page fault while we're holding > filesystem locks, and that page fault ends up calling back into the > filesystem. To deal with that, we're performing those user-space > accesses with page faults disabled. Yes, this makes sense. > When a page fault would occur, we > get back an error instead, and then we try to fault in the offending > pages. If a page is resident and we still get a fault trying to access > it, trying to fault in the same page again isn't going to help and we > have a true error. You can't be sure the second fault is a true error. The unlocked fault_in_*() may race with some LRU scheme making the pte not accessible or a write-back making it clean/read-only. copy_to_user() with pagefault_disabled() fails again but that's a benign fault. The filesystem should re-attempt the fault-in (gup would correct the pte), disable page faults and copy_to_user(), potentially in an infinite loop. If you bail out on the second/third uaccess following a fault_in_*() call, you may get some unexpected errors (though very rare). Maybe the filesystems avoid this problem somehow but I couldn't figure it out. > We're clearly looking at memory at a page > granularity; faults at a sub-page level don't matter at this level of > abstraction (but they do show similar error behavior). To avoid > getting stuck, when it gets a short result or -EFAULT, the filesystem > implements the following backoff strategy: first, it tries to fault in > a number of pages. When the read or write still doesn't make progress, > it scales back and faults in a single page. Finally, when that still > doesn't help, it gives up. This strategy is needed for actual page > faults, but it also handles sub-page faults appropriately as long as > the user-space access functions give sensible results. As I said above, I think with this approach there's a small chance of incorrectly reporting an error when the fault is recoverable. If you change it to an infinite loop, you'd run into the sub-page fault problem. There are some places with such infinite loops: futex_wake_op(), search_ioctl() in the btrfs code. I still have to get my head around generic_perform_write() but I think we get away here because it faults in the page with a get_user() rather than gup (and copy_from_user() is guaranteed to make progress if any bytes can still be accessed).
On Thu, Oct 21, 2021 at 12:06 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > On Thu, Oct 21, 2021 at 02:46:10AM +0200, Andreas Gruenbacher wrote: > > On Tue, Oct 12, 2021 at 1:59 AM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > > > > > +#ifdef CONFIG_ARM64_MTE > > > > +#define FAULT_GRANULE_SIZE (16) > > > > +#define FAULT_GRANULE_MASK (~(FAULT_GRANULE_SIZE-1)) > > > > > > [...] > > > > > > > If this looks in the right direction, I'll do some proper patches > > > > tomorrow. > > > > > > Looks fine to me. It's going to be quite expensive and bad for caches, though. > > > > > > That said, fault_in_writable() is _supposed_ to all be for the slow > > > path when things go south and the normal path didn't work out, so I > > > think it's fine. > > > > Let me get back to this; I'm actually not convinced that we need to > > worry about sub-page-size fault granules in fault_in_pages_readable or > > fault_in_pages_writeable. > > > > From a filesystem point of view, we can get into trouble when a > > user-space read or write triggers a page fault while we're holding > > filesystem locks, and that page fault ends up calling back into the > > filesystem. To deal with that, we're performing those user-space > > accesses with page faults disabled. > > Yes, this makes sense. > > > When a page fault would occur, we > > get back an error instead, and then we try to fault in the offending > > pages. If a page is resident and we still get a fault trying to access > > it, trying to fault in the same page again isn't going to help and we > > have a true error. > > You can't be sure the second fault is a true error. The unlocked > fault_in_*() may race with some LRU scheme making the pte not accessible > or a write-back making it clean/read-only. copy_to_user() with > pagefault_disabled() fails again but that's a benign fault. The > filesystem should re-attempt the fault-in (gup would correct the pte), > disable page faults and copy_to_user(), potentially in an infinite loop. > If you bail out on the second/third uaccess following a fault_in_*() > call, you may get some unexpected errors (though very rare). Maybe the > filesystems avoid this problem somehow but I couldn't figure it out. Good point, we can indeed only bail out if both the user copy and the fault-in fail. But probing the entire memory range in fault domain granularity in the page fault-in functions still doesn't actually make sense. Those functions really only need to guarantee that we'll be able to make progress eventually. From that point of view, it should be enough to probe the first byte of the requested memory range, so when one of those functions reports that the next N bytes should be accessible, this really means that the first byte surely isn't permanently inaccessible and that the rest is likely accessible. Functions fault_in_readable and fault_in_writeable already work that way, so this only leaves function fault_in_safe_writeable to worry about. > > We're clearly looking at memory at a page > > granularity; faults at a sub-page level don't matter at this level of > > abstraction (but they do show similar error behavior). To avoid > > getting stuck, when it gets a short result or -EFAULT, the filesystem > > implements the following backoff strategy: first, it tries to fault in > > a number of pages. When the read or write still doesn't make progress, > > it scales back and faults in a single page. Finally, when that still > > doesn't help, it gives up. This strategy is needed for actual page > > faults, but it also handles sub-page faults appropriately as long as > > the user-space access functions give sensible results. > > As I said above, I think with this approach there's a small chance of > incorrectly reporting an error when the fault is recoverable. If you > change it to an infinite loop, you'd run into the sub-page fault > problem. Yes, I see now, thanks. > There are some places with such infinite loops: futex_wake_op(), > search_ioctl() in the btrfs code. I still have to get my head around > generic_perform_write() but I think we get away here because it faults > in the page with a get_user() rather than gup (and copy_from_user() is > guaranteed to make progress if any bytes can still be accessed). Thanks, Andreas
On Thu, Oct 21, 2021 at 04:42:33PM +0200, Andreas Gruenbacher wrote: > On Thu, Oct 21, 2021 at 12:06 PM Catalin Marinas > <catalin.marinas@arm.com> wrote: > > On Thu, Oct 21, 2021 at 02:46:10AM +0200, Andreas Gruenbacher wrote: > > > When a page fault would occur, we > > > get back an error instead, and then we try to fault in the offending > > > pages. If a page is resident and we still get a fault trying to access > > > it, trying to fault in the same page again isn't going to help and we > > > have a true error. > > > > You can't be sure the second fault is a true error. The unlocked > > fault_in_*() may race with some LRU scheme making the pte not accessible > > or a write-back making it clean/read-only. copy_to_user() with > > pagefault_disabled() fails again but that's a benign fault. The > > filesystem should re-attempt the fault-in (gup would correct the pte), > > disable page faults and copy_to_user(), potentially in an infinite loop. > > If you bail out on the second/third uaccess following a fault_in_*() > > call, you may get some unexpected errors (though very rare). Maybe the > > filesystems avoid this problem somehow but I couldn't figure it out. > > Good point, we can indeed only bail out if both the user copy and the > fault-in fail. > > But probing the entire memory range in fault domain granularity in the > page fault-in functions still doesn't actually make sense. Those > functions really only need to guarantee that we'll be able to make > progress eventually. From that point of view, it should be enough to > probe the first byte of the requested memory range, so when one of > those functions reports that the next N bytes should be accessible, > this really means that the first byte surely isn't permanently > inaccessible and that the rest is likely accessible. Functions > fault_in_readable and fault_in_writeable already work that way, so > this only leaves function fault_in_safe_writeable to worry about. I agree, that's why generic_perform_write() works. It does a get_user() from the first byte in that range and the subsequent copy_from_user() will make progress of at least one byte if it was readable. Eventually it will hit the byte that faults. The gup-based fault_in_*() are a bit more problematic. Your series introduces fault_in_safe_writeable() and I think for MTE doing a _single_ get_user(uaddr) (in addition to the gup checks for write) would be sufficient as long as generic_file_read_iter() advances by at least one byte (eventually). This discussion started with the btrfs search_ioctl() where, even if some bytes were written in copy_to_sk(), it always restarts from an earlier position, reattempting to write the same bytes. Since copy_to_sk() doesn't guarantee forward progress even if some bytes are writable, Linus' suggestion was for fault_in_writable() to probe the whole range. I consider this overkill since btrfs is the only one that needs probing every 16 bytes. The other cases like the new fault_in_safe_writeable() can be fixed by probing the first byte only followed by gup. I think we need to better define the semantics of the fault_in + uaccess sequences. For uaccess, we document "a hard requirement that not storing anything at all (i.e. returning size) should happen only when nothing could be copied" (from linux/uaccess.h). I think we can add a requirement for the new size_t-based fault_in_* variants without mandating that the whole range is probed, something like: "returning leftover < size guarantees that a subsequent user access at uaddr copies at least one byte eventually". I said "eventually" but maybe we can come up with some clearer wording for a liveness property. Such requirement would ensure that infinite loops of fault_in_* + uaccess make progress as long as they don't reset the probed range. Code like btrfs search_ioctl() would need to be adjusted to avoid such range reset and guarantee forward progress.
On Thu, Oct 21, 2021 at 7:09 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > On Thu, Oct 21, 2021 at 04:42:33PM +0200, Andreas Gruenbacher wrote: > > On Thu, Oct 21, 2021 at 12:06 PM Catalin Marinas > > <catalin.marinas@arm.com> wrote: > > > On Thu, Oct 21, 2021 at 02:46:10AM +0200, Andreas Gruenbacher wrote: > > > > When a page fault would occur, we > > > > get back an error instead, and then we try to fault in the offending > > > > pages. If a page is resident and we still get a fault trying to access > > > > it, trying to fault in the same page again isn't going to help and we > > > > have a true error. > > > > > > You can't be sure the second fault is a true error. The unlocked > > > fault_in_*() may race with some LRU scheme making the pte not accessible > > > or a write-back making it clean/read-only. copy_to_user() with > > > pagefault_disabled() fails again but that's a benign fault. The > > > filesystem should re-attempt the fault-in (gup would correct the pte), > > > disable page faults and copy_to_user(), potentially in an infinite loop. > > > If you bail out on the second/third uaccess following a fault_in_*() > > > call, you may get some unexpected errors (though very rare). Maybe the > > > filesystems avoid this problem somehow but I couldn't figure it out. > > > > Good point, we can indeed only bail out if both the user copy and the > > fault-in fail. > > > > But probing the entire memory range in fault domain granularity in the > > page fault-in functions still doesn't actually make sense. Those > > functions really only need to guarantee that we'll be able to make > > progress eventually. From that point of view, it should be enough to > > probe the first byte of the requested memory range, so when one of > > those functions reports that the next N bytes should be accessible, > > this really means that the first byte surely isn't permanently > > inaccessible and that the rest is likely accessible. Functions > > fault_in_readable and fault_in_writeable already work that way, so > > this only leaves function fault_in_safe_writeable to worry about. > > I agree, that's why generic_perform_write() works. It does a get_user() > from the first byte in that range and the subsequent copy_from_user() > will make progress of at least one byte if it was readable. Eventually > it will hit the byte that faults. The gup-based fault_in_*() are a bit > more problematic. > > Your series introduces fault_in_safe_writeable() and I think for MTE > doing a _single_ get_user(uaddr) (in addition to the gup checks for > write) would be sufficient as long as generic_file_read_iter() advances > by at least one byte (eventually). > > This discussion started with the btrfs search_ioctl() where, even if > some bytes were written in copy_to_sk(), it always restarts from an > earlier position, reattempting to write the same bytes. Since > copy_to_sk() doesn't guarantee forward progress even if some bytes are > writable, Linus' suggestion was for fault_in_writable() to probe the > whole range. I consider this overkill since btrfs is the only one that > needs probing every 16 bytes. The other cases like the new > fault_in_safe_writeable() can be fixed by probing the first byte only > followed by gup. Hmm. Direct I/O request sizes are multiples of the underlying device block size, so we'll also get stuck there if fault-in won't give us a full block. This is getting pretty ugly. So scratch that idea; let's stick with probing the whole range. Thanks, Andreas > I think we need to better define the semantics of the fault_in + uaccess > sequences. For uaccess, we document "a hard requirement that not storing > anything at all (i.e. returning size) should happen only when nothing > could be copied" (from linux/uaccess.h). I think we can add a > requirement for the new size_t-based fault_in_* variants without > mandating that the whole range is probed, something like: "returning > leftover < size guarantees that a subsequent user access at uaddr copies > at least one byte eventually". I said "eventually" but maybe we can come > up with some clearer wording for a liveness property. > > Such requirement would ensure that infinite loops of fault_in_* + > uaccess make progress as long as they don't reset the probed range. Code > like btrfs search_ioctl() would need to be adjusted to avoid such range > reset and guarantee forward progress. > > -- > Catalin >
On Thu, Oct 21, 2021 at 4:42 AM Andreas Gruenbacher <agruenba@redhat.com> wrote: > > But probing the entire memory range in fault domain granularity in the > page fault-in functions still doesn't actually make sense. Those > functions really only need to guarantee that we'll be able to make > progress eventually. From that point of view, it should be enough to > probe the first byte of the requested memory range That's probably fine. Although it should be more than one byte - "copy_from_user()" might do word-at-a-time optimizations, so you could have an infinite loop of (a) copy_from_user() fails because the chunk it tried to get failed partly (b) fault_in() probing succeeds, because the beginning part is fine so I agree that the fault-in code doesn't need to do the whole area, but it needs to at least do some <N bytes, up to length> thing, to handle the situation where the copy_to/from_user requires more than a single byte. Linus
On Thu, Oct 21, 2021 at 04:30:30PM -1000, Linus Torvalds wrote: > On Thu, Oct 21, 2021 at 4:42 AM Andreas Gruenbacher <agruenba@redhat.com> wrote: > > But probing the entire memory range in fault domain granularity in the > > page fault-in functions still doesn't actually make sense. Those > > functions really only need to guarantee that we'll be able to make > > progress eventually. From that point of view, it should be enough to > > probe the first byte of the requested memory range > > That's probably fine. > > Although it should be more than one byte - "copy_from_user()" might do > word-at-a-time optimizations, so you could have an infinite loop of > > (a) copy_from_user() fails because the chunk it tried to get failed partly > > (b) fault_in() probing succeeds, because the beginning part is fine > > so I agree that the fault-in code doesn't need to do the whole area, > but it needs to at least do some <N bytes, up to length> thing, to > handle the situation where the copy_to/from_user requires more than a > single byte. >From a discussion with Al some months ago, if there are bytes still accessible, copy_from_user() is not allowed to fail fully (i.e. return the requested copy size) even when it uses word-at-a-time. In the worst case, it should return size - 1. If the fault_in() then continues probing from uaddr + 1, it should eventually hit the faulty address. The problem appears when fault_in() restarts from uaddr rather than where copy_from_user() stopped. That's what the btrfs search_ioctl() does. I also need to check the direct I/O cases that Andreas mentioned, maybe they can be changed not to attempt the fault_in() from the beginning of the block.
On Thu, Oct 21, 2021 at 08:00:50PM +0200, Andreas Gruenbacher wrote: > On Thu, Oct 21, 2021 at 7:09 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > This discussion started with the btrfs search_ioctl() where, even if > > some bytes were written in copy_to_sk(), it always restarts from an > > earlier position, reattempting to write the same bytes. Since > > copy_to_sk() doesn't guarantee forward progress even if some bytes are > > writable, Linus' suggestion was for fault_in_writable() to probe the > > whole range. I consider this overkill since btrfs is the only one that > > needs probing every 16 bytes. The other cases like the new > > fault_in_safe_writeable() can be fixed by probing the first byte only > > followed by gup. > > Hmm. Direct I/O request sizes are multiples of the underlying device > block size, so we'll also get stuck there if fault-in won't give us a > full block. This is getting pretty ugly. So scratch that idea; let's > stick with probing the whole range. Ah, I wasn't aware of this. I got lost in the call trees but I noticed __iomap_dio_rw() does an iov_iter_revert() only if direction is READ. Is this the case for writes as well?
On Fri, Oct 22, 2021 at 8:41 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > On Thu, Oct 21, 2021 at 08:00:50PM +0200, Andreas Gruenbacher wrote: > > On Thu, Oct 21, 2021 at 7:09 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > This discussion started with the btrfs search_ioctl() where, even if > > > some bytes were written in copy_to_sk(), it always restarts from an > > > earlier position, reattempting to write the same bytes. Since > > > copy_to_sk() doesn't guarantee forward progress even if some bytes are > > > writable, Linus' suggestion was for fault_in_writable() to probe the > > > whole range. I consider this overkill since btrfs is the only one that > > > needs probing every 16 bytes. The other cases like the new > > > fault_in_safe_writeable() can be fixed by probing the first byte only > > > followed by gup. > > > > Hmm. Direct I/O request sizes are multiples of the underlying device > > block size, so we'll also get stuck there if fault-in won't give us a > > full block. This is getting pretty ugly. So scratch that idea; let's > > stick with probing the whole range. > > Ah, I wasn't aware of this. I got lost in the call trees but I noticed > __iomap_dio_rw() does an iov_iter_revert() only if direction is READ. Is > this the case for writes as well? It's the EOF case, so it only applies to reads: /* * We only report that we've read data up to i_size. * Revert iter to a state corresponding to that as some callers (such * as the splice code) rely on it. */ if (iov_iter_rw(iter) == READ && iomi.pos >= dio->i_size) iov_iter_revert(iter, iomi.pos - dio->i_size); Andreas
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 7c9edc9694d9..a629807edb8c 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -737,6 +737,7 @@ extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter); * Fault in userspace address range. */ size_t fault_in_writeable(char __user *uaddr, size_t size); +size_t fault_in_safe_writeable(const char __user *uaddr, size_t size); size_t fault_in_readable(const char __user *uaddr, size_t size); int add_to_page_cache_locked(struct page *page, struct address_space *mapping, diff --git a/include/linux/uio.h b/include/linux/uio.h index 12d30246c2e9..ffa431aeb067 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -120,6 +120,7 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, void iov_iter_advance(struct iov_iter *i, size_t bytes); void iov_iter_revert(struct iov_iter *i, size_t bytes); size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes); +size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t bytes); size_t iov_iter_single_seg_count(const struct iov_iter *i); size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes, struct iov_iter *i); diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 082ab155496d..968f2d2595cd 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -467,6 +467,45 @@ size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size) } EXPORT_SYMBOL(fault_in_iov_iter_readable); +/* + * fault_in_iov_iter_writeable - fault in iov iterator for writing + * @i: iterator + * @size: maximum length + * + * Faults in the iterator using get_user_pages(), i.e., without triggering + * hardware page faults. This is primarily useful when we know that some or + * all of the pages in @i aren't in memory. + * + * Returns the number of bytes not faulted in (like copy_to_user() and + * copy_from_user()). + * + * Always returns 0 for non-user space iterators. + */ +size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size) +{ + if (iter_is_iovec(i)) { + size_t count = min(size, iov_iter_count(i)); + const struct iovec *p; + size_t skip; + + size -= count; + for (p = i->iov, skip = i->iov_offset; count; p++, skip = 0) { + size_t len = min(count, p->iov_len - skip); + size_t ret; + + if (unlikely(!len)) + continue; + ret = fault_in_safe_writeable(p->iov_base + skip, len); + count -= len - ret; + if (ret) + break; + } + return count + size; + } + return 0; +} +EXPORT_SYMBOL(fault_in_iov_iter_writeable); + void iov_iter_init(struct iov_iter *i, unsigned int direction, const struct iovec *iov, unsigned long nr_segs, size_t count) diff --git a/mm/gup.c b/mm/gup.c index 0cf47955e5a1..03ab03b68dc7 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1707,6 +1707,69 @@ size_t fault_in_writeable(char __user *uaddr, size_t size) } EXPORT_SYMBOL(fault_in_writeable); +/* + * fault_in_safe_writeable - fault in an address range for writing + * @uaddr: start of address range + * @size: length of address range + * + * Faults in an address range using get_user_pages, i.e., without triggering + * hardware page faults. This is primarily useful when we know that some or + * all of the pages in the address range aren't in memory. + * + * Other than fault_in_writeable(), this function is non-destructive. + * + * Note that we don't pin or otherwise hold the pages referenced that we fault + * in. There's no guarantee that they'll stay in memory for any duration of + * time. + * + * Returns the number of bytes not faulted in (like copy_to_user() and + * copy_from_user()). + */ +size_t fault_in_safe_writeable(const char __user *uaddr, size_t size) +{ + unsigned long start = (unsigned long)uaddr; + unsigned long end, nstart, nend; + struct mm_struct *mm = current->mm; + struct vm_area_struct *vma = NULL; + int locked = 0; + + nstart = start & PAGE_MASK; + end = PAGE_ALIGN(start + size); + if (end < nstart) + end = 0; + for (; nstart != end; nstart = nend) { + unsigned long nr_pages; + long ret; + + if (!locked) { + locked = 1; + mmap_read_lock(mm); + vma = find_vma(mm, nstart); + } else if (nstart >= vma->vm_end) + vma = vma->vm_next; + if (!vma || vma->vm_start >= end) + break; + nend = end ? min(end, vma->vm_end) : vma->vm_end; + if (vma->vm_flags & (VM_IO | VM_PFNMAP)) + continue; + if (nstart < vma->vm_start) + nstart = vma->vm_start; + nr_pages = (nend - nstart) / PAGE_SIZE; + ret = __get_user_pages_locked(mm, nstart, nr_pages, + NULL, NULL, &locked, + FOLL_TOUCH | FOLL_WRITE); + if (ret <= 0) + break; + nend = nstart + ret * PAGE_SIZE; + } + if (locked) + mmap_read_unlock(mm); + if (nstart == end) + return 0; + return size - min_t(size_t, nstart - start, size); +} +EXPORT_SYMBOL(fault_in_safe_writeable); + /** * fault_in_readable - fault in userspace address range for reading * @uaddr: start of user address range
Introduce a new fault_in_iov_iter_writeable helper for safely faulting in an iterator for writing. Uses get_user_pages() to fault in the pages without actually writing to them, which would be destructive. We'll use fault_in_iov_iter_writeable in gfs2 once we've determined that the iterator passed to .read_iter isn't in memory. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> --- include/linux/pagemap.h | 1 + include/linux/uio.h | 1 + lib/iov_iter.c | 39 +++++++++++++++++++++++++ mm/gup.c | 63 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 104 insertions(+)