diff mbox series

[3/3] btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page faults

Message ID 20211124192024.2408218-4-catalin.marinas@arm.com (mailing list archive)
State New, archived
Headers show
Series Avoid live-lock in fault-in+uaccess loops with sub-page faults | expand

Commit Message

Catalin Marinas Nov. 24, 2021, 7:20 p.m. UTC
Commit a48b73eca4ce ("btrfs: fix potential deadlock in the search
ioctl") addressed a lockdep warning by pre-faulting the user pages and
attempting the copy_to_user_nofault() in an infinite loop. On
architectures like arm64 with MTE, an access may fault within a page at
a location different from what fault_in_writeable() probed. Since the
sk_offset is rewound to the previous struct btrfs_ioctl_search_header
boundary, there is no guaranteed forward progress and search_ioctl() may
live-lock.

Use fault_in_exact_writeable() instead which probes the entire user
buffer for faults at sub-page granularity.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/btrfs/ioctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Matthew Wilcox (Oracle) Nov. 24, 2021, 8:03 p.m. UTC | #1
On Wed, Nov 24, 2021 at 07:20:24PM +0000, Catalin Marinas wrote:
> +++ b/fs/btrfs/ioctl.c
> @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode,
>  
>  	while (1) {
>  		ret = -EFAULT;
> -		if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
> +		if (fault_in_exact_writeable(ubuf + sk_offset,
> +					     *buf_size - sk_offset))
>  			break;
>  
>  		ret = btrfs_search_forward(root, &key, path, sk->min_transid);

Couldn't we avoid all of this nastiness by doing ...

@@ -2121,10 +2121,9 @@ static noinline int copy_to_sk(struct btrfs_path *path,
                 * problem. Otherwise we'll fault and then copy the buffer in
                 * properly this next time through
                 */
-               if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
-                       ret = 0;
+               ret = __copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh));
+               if (ret)
                        goto out;
-               }
 
                *sk_offset += sizeof(sh);
@@ -2196,6 +2195,7 @@ static noinline int search_ioctl(struct inode *inode,
        int ret;
        int num_found = 0;
        unsigned long sk_offset = 0;
+       unsigned long next_offset = 0;
 
        if (*buf_size < sizeof(struct btrfs_ioctl_search_header)) {
                *buf_size = sizeof(struct btrfs_ioctl_search_header);
@@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode,
 
        while (1) {
                ret = -EFAULT;
-               if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
+               if (fault_in_writeable(ubuf + sk_offset + next_offset,
+                                       *buf_size - sk_offset - next_offset))
                        break;
 
                ret = btrfs_search_forward(root, &key, path, sk->min_transid);
@@ -2235,11 +2236,12 @@ static noinline int search_ioctl(struct inode *inode,
                ret = copy_to_sk(path, &key, sk, buf_size, ubuf,
                                 &sk_offset, &num_found);
                btrfs_release_path(path);
-               if (ret)
+               if (ret > 0)
+                       next_offset = ret;
+               else if (ret < 0)
                        break;
-
        }
-       if (ret > 0)
+       if (ret == -ENOSPC || ret > 0)
                ret = 0;
 err:
        sk->nr_items = num_found;

(not shown: the tedious bits where the existing 'ret = 1' are converted
to 'ret = -ENOSPC' in copy_to_sk())
 
(where __copy_to_user_nofault() is a new function that does exactly what
copy_to_user_nofault() does, but returns the number of bytes copied)

That way, the existing fault_in_writable() will get the fault, and we
don't need to probe every 16 bytes.
Catalin Marinas Nov. 24, 2021, 8:37 p.m. UTC | #2
On Wed, Nov 24, 2021 at 08:03:58PM +0000, Matthew Wilcox wrote:
> On Wed, Nov 24, 2021 at 07:20:24PM +0000, Catalin Marinas wrote:
> > +++ b/fs/btrfs/ioctl.c
> > @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode,
> >  
> >  	while (1) {
> >  		ret = -EFAULT;
> > -		if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
> > +		if (fault_in_exact_writeable(ubuf + sk_offset,
> > +					     *buf_size - sk_offset))
> >  			break;
> >  
> >  		ret = btrfs_search_forward(root, &key, path, sk->min_transid);
> 
> Couldn't we avoid all of this nastiness by doing ...

I had a similar attempt initially but I concluded that it doesn't work:

https://lore.kernel.org/r/YS40qqmXL7CMFLGq@arm.com

> @@ -2121,10 +2121,9 @@ static noinline int copy_to_sk(struct btrfs_path *path,
>                  * problem. Otherwise we'll fault and then copy the buffer in
>                  * properly this next time through
>                  */
> -               if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
> -                       ret = 0;
> +               ret = __copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh));
> +               if (ret)

There is no requirement for the arch implementation to be exact and copy
the maximum number of bytes possible. It can fail early while there are
still some bytes left that would not fault. The only requirement is that
if it is restarted from where it faulted, it makes some progress (on
arm64 there is one extra byte).

>                         goto out;
> -               }
>  
>                 *sk_offset += sizeof(sh);
> @@ -2196,6 +2195,7 @@ static noinline int search_ioctl(struct inode *inode,
>         int ret;
>         int num_found = 0;
>         unsigned long sk_offset = 0;
> +       unsigned long next_offset = 0;
>  
>         if (*buf_size < sizeof(struct btrfs_ioctl_search_header)) {
>                 *buf_size = sizeof(struct btrfs_ioctl_search_header);
> @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode,
>  
>         while (1) {
>                 ret = -EFAULT;
> -               if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
> +               if (fault_in_writeable(ubuf + sk_offset + next_offset,
> +                                       *buf_size - sk_offset - next_offset))
>                         break;
>  
>                 ret = btrfs_search_forward(root, &key, path, sk->min_transid);
> @@ -2235,11 +2236,12 @@ static noinline int search_ioctl(struct inode *inode,
>                 ret = copy_to_sk(path, &key, sk, buf_size, ubuf,
>                                  &sk_offset, &num_found);
>                 btrfs_release_path(path);
> -               if (ret)
> +               if (ret > 0)
> +                       next_offset = ret;

So after this point, ubuf+sk_offset+next_offset is writeable by
fault_in_writable(). If copy_to_user() was attempted on
ubuf+sk_offset+next_offset, all would be fine, but copy_to_sk() restarts
the copy from ubuf+sk_offset, so it returns exacting the same ret as in
the previous iteration.
Linus Torvalds Nov. 24, 2021, 11 p.m. UTC | #3
Catalin talked about the other change, but this part:

On Wed, Nov 24, 2021 at 12:04 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> (where __copy_to_user_nofault() is a new function that does exactly what
> copy_to_user_nofault() does, but returns the number of bytes copied)

If we want the "how many bytes" part, then we should just make
copy_to_user_nofault() have the same semantics as a plain
copy_to_user().

IOW, change it to return "number of bytes not copied".

Lookin gat the current uses, such a change would be trivial. The only
case that wants a 0/-EFAULT error is the bpf_probe_write_user(),
everybody else already just wants "zero for success", so changing
copy_to_user_nofault() would be trivial.

And it really is odd and very non-intuitive that
copy_to_user_nofault() has a completely different return value from
copy_to_user().

So if _anybody_ wants a byte-count, that should just be fixed.

                    Linus
Catalin Marinas Nov. 25, 2021, 11:10 a.m. UTC | #4
On Wed, Nov 24, 2021 at 03:00:00PM -0800, Linus Torvalds wrote:
> On Wed, Nov 24, 2021 at 12:04 PM Matthew Wilcox <willy@infradead.org> wrote:
> > (where __copy_to_user_nofault() is a new function that does exactly what
> > copy_to_user_nofault() does, but returns the number of bytes copied)
> 
> If we want the "how many bytes" part, then we should just make
> copy_to_user_nofault() have the same semantics as a plain
> copy_to_user().
> 
> IOW, change it to return "number of bytes not copied".
> 
> Looking at the current uses, such a change would be trivial. The only
> case that wants a 0/-EFAULT error is the bpf_probe_write_user(),
> everybody else already just wants "zero for success", so changing
> copy_to_user_nofault() would be trivial.

I agree, if we want the number of byte not copied, we should just change
copy_{to,from}_user_nofault() and their callers (I can count three
each).

For this specific btrfs case, if we want go with tuning the offset based
on the fault address, we'd need copy_to_user_nofault() (or a new
function) to be exact. IOW, fall back to byte-at-a-time copy until it
hits the real faulting address.
Linus Torvalds Nov. 25, 2021, 6:13 p.m. UTC | #5
On Thu, Nov 25, 2021 at 3:10 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> For this specific btrfs case, if we want go with tuning the offset based
> on the fault address, we'd need copy_to_user_nofault() (or a new
> function) to be exact.

I really don't see why you harp on the exactness.

I really believe that the fix is to make the read/write probing just
be more aggressive.

Make the read/write probing require that AT LEAST <n> bytes be
readable/writable at the beginning, where 'n' is 'min(len,ALIGN)', and
ALIGN is whatever size that copy_from/to_user_xyz() might require just
because it might do multi-byte accesses.

In fact, make ALIGN be perhaps something reasonable like 512 bytes or
whatever, and then you know you can handle the btrfs "copy a whole
structure and reset if that fails" case too.

Don't require that the fundamental copying routines (and whatever
fixup the code might need) be some kind of byte-precise - it's the
error case that should instead be made stricter.

If the user gave you a range that triggered a pointer color mismatch,
then returning an error is fine, rather than say "we'll do as much as
we can and waste time and effort on being byte-exact too".

Your earlier argument was that it was too expensive to probe things.
That was based on looking at the whole range that migth be MB (or GB)
in size. So just make it check the first <n> bytes, and problem
solved.

                 Linus
Catalin Marinas Nov. 25, 2021, 8:43 p.m. UTC | #6
On Thu, Nov 25, 2021 at 10:13:25AM -0800, Linus Torvalds wrote:
> On Thu, Nov 25, 2021 at 3:10 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > For this specific btrfs case, if we want go with tuning the offset based
> > on the fault address, we'd need copy_to_user_nofault() (or a new
> > function) to be exact.
> 
> I really don't see why you harp on the exactness.

I guess because I always thought we either fix fault_in_writable() to
probe the whole range (this series) or we change the loops to take the
copy_to_user() returned value into account when re-faulting.

> I really believe that the fix is to make the read/write probing just
> be more aggressive.
> 
> Make the read/write probing require that AT LEAST <n> bytes be
> readable/writable at the beginning, where 'n' is 'min(len,ALIGN)', and
> ALIGN is whatever size that copy_from/to_user_xyz() might require just
> because it might do multi-byte accesses.
> 
> In fact, make ALIGN be perhaps something reasonable like 512 bytes or
> whatever, and then you know you can handle the btrfs "copy a whole
> structure and reset if that fails" case too.

IIUC what you are suggesting, we still need changes to the btrfs loop
similar to willy's but that should work fine together with a slightly
more aggressive fault_in_writable().

A probing of at least sizeof(struct btrfs_ioctl_search_key) should
suffice without any loop changes and 512 would cover it but it doesn't
look generic enough. We could pass a 'probe_prefix' argument to
fault_in_exact_writeable() to only probe this and btrfs would just
specify the above sizeof().

> Don't require that the fundamental copying routines (and whatever
> fixup the code might need) be some kind of byte-precise - it's the
> error case that should instead be made stricter.
> 
> If the user gave you a range that triggered a pointer color mismatch,
> then returning an error is fine, rather than say "we'll do as much as
> we can and waste time and effort on being byte-exact too".

Yes, I'm fine with not copying anything at all, I just want to avoid the
infinite loop.

I think we are down to three approaches:

1. Probe the whole range in fault_in_*() for sub-page faults, no need to
   worry about copy_*_user() failures.

2. Get fault_in_*() to probe a sufficient prefix to cover the uaccess
   inexactness. In addition, change the btrfs loop to fault-in from
   where the copy_to_user() failed (willy's suggestion combined with
   the more aggressive probing in fault_in_*()).

3. Implement fault_in_exact_writeable(uaddr, size, probe_prefix) where
   loops that do some rewind would pass an appropriate value.

(1) is this series, (2) requires changing the loop logic, (3) looks
pretty simple.

I'm happy to attempt either (2) or (3) with a preference for the latter.
Matthew Wilcox (Oracle) Nov. 25, 2021, 9:02 p.m. UTC | #7
On Thu, Nov 25, 2021 at 08:43:57PM +0000, Catalin Marinas wrote:
> > I really believe that the fix is to make the read/write probing just
> > be more aggressive.
> > 
> > Make the read/write probing require that AT LEAST <n> bytes be
> > readable/writable at the beginning, where 'n' is 'min(len,ALIGN)', and
> > ALIGN is whatever size that copy_from/to_user_xyz() might require just
> > because it might do multi-byte accesses.
> > 
> > In fact, make ALIGN be perhaps something reasonable like 512 bytes or
> > whatever, and then you know you can handle the btrfs "copy a whole
> > structure and reset if that fails" case too.
> 
> IIUC what you are suggesting, we still need changes to the btrfs loop
> similar to willy's but that should work fine together with a slightly
> more aggressive fault_in_writable().
> 
> A probing of at least sizeof(struct btrfs_ioctl_search_key) should
> suffice without any loop changes and 512 would cover it but it doesn't
> look generic enough. We could pass a 'probe_prefix' argument to
> fault_in_exact_writeable() to only probe this and btrfs would just
> specify the above sizeof().

How about something like this?

+++ b/mm/gup.c
@@ -1672,6 +1672,13 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)

        if (unlikely(size == 0))
                return 0;
+       if (SUBPAGE_PROBE_INTERVAL) {
+               while (uaddr < PAGE_ALIGN((unsigned long)uaddr)) {
+                       if (unlikely(__put_user(0, uaddr) != 0))
+                               goto out;
+                       uaddr += SUBPAGE_PROBE_INTERVAL;
+               }
+       }
        if (!PAGE_ALIGNED(uaddr)) {
                if (unlikely(__put_user(0, uaddr) != 0))
                        return size;

ARM then defines SUBPAGE_PROBE_INTERVAL to be 16 and the rest of us
leave it as 0.  That way we probe all the way to the end of the current
page and the start of the next page.

Oh, that needs to be checked to not exceed size as well ... anyway,
you get the idea.
Catalin Marinas Nov. 25, 2021, 9:29 p.m. UTC | #8
On Thu, Nov 25, 2021 at 09:02:29PM +0000, Matthew Wilcox wrote:
> On Thu, Nov 25, 2021 at 08:43:57PM +0000, Catalin Marinas wrote:
> > > I really believe that the fix is to make the read/write probing just
> > > be more aggressive.
> > > 
> > > Make the read/write probing require that AT LEAST <n> bytes be
> > > readable/writable at the beginning, where 'n' is 'min(len,ALIGN)', and
> > > ALIGN is whatever size that copy_from/to_user_xyz() might require just
> > > because it might do multi-byte accesses.
> > > 
> > > In fact, make ALIGN be perhaps something reasonable like 512 bytes or
> > > whatever, and then you know you can handle the btrfs "copy a whole
> > > structure and reset if that fails" case too.
> > 
> > IIUC what you are suggesting, we still need changes to the btrfs loop
> > similar to willy's but that should work fine together with a slightly
> > more aggressive fault_in_writable().
> > 
> > A probing of at least sizeof(struct btrfs_ioctl_search_key) should
> > suffice without any loop changes and 512 would cover it but it doesn't
> > look generic enough. We could pass a 'probe_prefix' argument to
> > fault_in_exact_writeable() to only probe this and btrfs would just
> > specify the above sizeof().
> 
> How about something like this?
> 
> +++ b/mm/gup.c
> @@ -1672,6 +1672,13 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)
> 
>         if (unlikely(size == 0))
>                 return 0;
> +       if (SUBPAGE_PROBE_INTERVAL) {
> +               while (uaddr < PAGE_ALIGN((unsigned long)uaddr)) {
> +                       if (unlikely(__put_user(0, uaddr) != 0))
> +                               goto out;
> +                       uaddr += SUBPAGE_PROBE_INTERVAL;
> +               }
> +       }
>         if (!PAGE_ALIGNED(uaddr)) {
>                 if (unlikely(__put_user(0, uaddr) != 0))
>                         return size;
> 
> ARM then defines SUBPAGE_PROBE_INTERVAL to be 16 and the rest of us
> leave it as 0.  That way we probe all the way to the end of the current
> page and the start of the next page.

It doesn't help if the copy_to_user() fault happens 16 bytes into the
second page for example. The fault_in() passes, copy_to_user() fails and
the loop restarts from the same place. With sub-page faults, the page
boundary doesn't have any relevance. We want to probe the beginning of
the buffer that's at least as big as the loop rewind size even if it
goes past a page boundary.
Andreas Gruenbacher Nov. 25, 2021, 9:40 p.m. UTC | #9
On Thu, Nov 25, 2021 at 10:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Thu, Nov 25, 2021 at 08:43:57PM +0000, Catalin Marinas wrote:
> > > I really believe that the fix is to make the read/write probing just
> > > be more aggressive.
> > >
> > > Make the read/write probing require that AT LEAST <n> bytes be
> > > readable/writable at the beginning, where 'n' is 'min(len,ALIGN)', and
> > > ALIGN is whatever size that copy_from/to_user_xyz() might require just
> > > because it might do multi-byte accesses.
> > >
> > > In fact, make ALIGN be perhaps something reasonable like 512 bytes or
> > > whatever, and then you know you can handle the btrfs "copy a whole
> > > structure and reset if that fails" case too.
> >
> > IIUC what you are suggesting, we still need changes to the btrfs loop
> > similar to willy's but that should work fine together with a slightly
> > more aggressive fault_in_writable().
> >
> > A probing of at least sizeof(struct btrfs_ioctl_search_key) should
> > suffice without any loop changes and 512 would cover it but it doesn't
> > look generic enough. We could pass a 'probe_prefix' argument to
> > fault_in_exact_writeable() to only probe this and btrfs would just
> > specify the above sizeof().
>
> How about something like this?
>
> +++ b/mm/gup.c
> @@ -1672,6 +1672,13 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)
>
>         if (unlikely(size == 0))
>                 return 0;
> +       if (SUBPAGE_PROBE_INTERVAL) {
> +               while (uaddr < PAGE_ALIGN((unsigned long)uaddr)) {
> +                       if (unlikely(__put_user(0, uaddr) != 0))
> +                               goto out;
> +                       uaddr += SUBPAGE_PROBE_INTERVAL;
> +               }
> +       }
>         if (!PAGE_ALIGNED(uaddr)) {
>                 if (unlikely(__put_user(0, uaddr) != 0))
>                         return size;
>
> ARM then defines SUBPAGE_PROBE_INTERVAL to be 16 and the rest of us
> leave it as 0.  That way we probe all the way to the end of the current
> page and the start of the next page.
>
> Oh, that needs to be checked to not exceed size as well ... anyway,
> you get the idea.

Note that we don't need this additional probing when accessing the
buffer with byte granularity. That's probably the common case.

Andreas
Andreas Gruenbacher Nov. 25, 2021, 10:25 p.m. UTC | #10
On Wed, Nov 24, 2021 at 9:37 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Wed, Nov 24, 2021 at 08:03:58PM +0000, Matthew Wilcox wrote:
> > On Wed, Nov 24, 2021 at 07:20:24PM +0000, Catalin Marinas wrote:
> > > +++ b/fs/btrfs/ioctl.c
> > > @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode,
> > >
> > >     while (1) {
> > >             ret = -EFAULT;
> > > -           if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
> > > +           if (fault_in_exact_writeable(ubuf + sk_offset,
> > > +                                        *buf_size - sk_offset))
> > >                     break;
> > >
> > >             ret = btrfs_search_forward(root, &key, path, sk->min_transid);
> >
> > Couldn't we avoid all of this nastiness by doing ...
>
> I had a similar attempt initially but I concluded that it doesn't work:
>
> https://lore.kernel.org/r/YS40qqmXL7CMFLGq@arm.com
>
> > @@ -2121,10 +2121,9 @@ static noinline int copy_to_sk(struct btrfs_path *path,
> >                  * problem. Otherwise we'll fault and then copy the buffer in
> >                  * properly this next time through
> >                  */
> > -               if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
> > -                       ret = 0;
> > +               ret = __copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh));
> > +               if (ret)
>
> There is no requirement for the arch implementation to be exact and copy
> the maximum number of bytes possible. It can fail early while there are
> still some bytes left that would not fault. The only requirement is that
> if it is restarted from where it faulted, it makes some progress (on
> arm64 there is one extra byte).
>
> >                         goto out;
> > -               }
> >
> >                 *sk_offset += sizeof(sh);
> > @@ -2196,6 +2195,7 @@ static noinline int search_ioctl(struct inode *inode,
> >         int ret;
> >         int num_found = 0;
> >         unsigned long sk_offset = 0;
> > +       unsigned long next_offset = 0;
> >
> >         if (*buf_size < sizeof(struct btrfs_ioctl_search_header)) {
> >                 *buf_size = sizeof(struct btrfs_ioctl_search_header);
> > @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode,
> >
> >         while (1) {
> >                 ret = -EFAULT;
> > -               if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
> > +               if (fault_in_writeable(ubuf + sk_offset + next_offset,
> > +                                       *buf_size - sk_offset - next_offset))
> >                         break;
> >
> >                 ret = btrfs_search_forward(root, &key, path, sk->min_transid);
> > @@ -2235,11 +2236,12 @@ static noinline int search_ioctl(struct inode *inode,
> >                 ret = copy_to_sk(path, &key, sk, buf_size, ubuf,
> >                                  &sk_offset, &num_found);
> >                 btrfs_release_path(path);
> > -               if (ret)
> > +               if (ret > 0)
> > +                       next_offset = ret;
>
> So after this point, ubuf+sk_offset+next_offset is writeable by
> fault_in_writable(). If copy_to_user() was attempted on
> ubuf+sk_offset+next_offset, all would be fine, but copy_to_sk() restarts
> the copy from ubuf+sk_offset, so it returns exacting the same ret as in
> the previous iteration.

So this means that after a short copy_to_user_nofault(), copy_to_sk()
needs to figure out the actual point of failure. We'll have the same
problem elsewhere, so this should probably be a generic helper. The
alignment hacks are arch specific, so maybe we can have a generic
version that assumes no alignment restrictions, with arch-specific
overrides.

Once we know the exact point of failure, a
fault_in_writeable(point_of_failure, 1) in search_ioctl() will tell if
the failure is pertinent. Once we know that the failure isn't
pertinent, we're safe to retry the original fault_in_writeable().

Thanks,
Andreas
Catalin Marinas Nov. 25, 2021, 10:42 p.m. UTC | #11
On Thu, Nov 25, 2021 at 11:25:54PM +0100, Andreas Gruenbacher wrote:
> On Wed, Nov 24, 2021 at 9:37 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Wed, Nov 24, 2021 at 08:03:58PM +0000, Matthew Wilcox wrote:
> > > On Wed, Nov 24, 2021 at 07:20:24PM +0000, Catalin Marinas wrote:
> > > > +++ b/fs/btrfs/ioctl.c
> > > > @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode,
> > > >
> > > >     while (1) {
> > > >             ret = -EFAULT;
> > > > -           if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
> > > > +           if (fault_in_exact_writeable(ubuf + sk_offset,
> > > > +                                        *buf_size - sk_offset))
> > > >                     break;
> > > >
> > > >             ret = btrfs_search_forward(root, &key, path, sk->min_transid);
> > >
> > > Couldn't we avoid all of this nastiness by doing ...
> >
> > I had a similar attempt initially but I concluded that it doesn't work:
> >
> > https://lore.kernel.org/r/YS40qqmXL7CMFLGq@arm.com
> >
> > > @@ -2121,10 +2121,9 @@ static noinline int copy_to_sk(struct btrfs_path *path,
> > >                  * problem. Otherwise we'll fault and then copy the buffer in
> > >                  * properly this next time through
> > >                  */
> > > -               if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
> > > -                       ret = 0;
> > > +               ret = __copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh));
> > > +               if (ret)
> >
> > There is no requirement for the arch implementation to be exact and copy
> > the maximum number of bytes possible. It can fail early while there are
> > still some bytes left that would not fault. The only requirement is that
> > if it is restarted from where it faulted, it makes some progress (on
> > arm64 there is one extra byte).
> >
> > >                         goto out;
> > > -               }
> > >
> > >                 *sk_offset += sizeof(sh);
> > > @@ -2196,6 +2195,7 @@ static noinline int search_ioctl(struct inode *inode,
> > >         int ret;
> > >         int num_found = 0;
> > >         unsigned long sk_offset = 0;
> > > +       unsigned long next_offset = 0;
> > >
> > >         if (*buf_size < sizeof(struct btrfs_ioctl_search_header)) {
> > >                 *buf_size = sizeof(struct btrfs_ioctl_search_header);
> > > @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode,
> > >
> > >         while (1) {
> > >                 ret = -EFAULT;
> > > -               if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
> > > +               if (fault_in_writeable(ubuf + sk_offset + next_offset,
> > > +                                       *buf_size - sk_offset - next_offset))
> > >                         break;
> > >
> > >                 ret = btrfs_search_forward(root, &key, path, sk->min_transid);
> > > @@ -2235,11 +2236,12 @@ static noinline int search_ioctl(struct inode *inode,
> > >                 ret = copy_to_sk(path, &key, sk, buf_size, ubuf,
> > >                                  &sk_offset, &num_found);
> > >                 btrfs_release_path(path);
> > > -               if (ret)
> > > +               if (ret > 0)
> > > +                       next_offset = ret;
> >
> > So after this point, ubuf+sk_offset+next_offset is writeable by
> > fault_in_writable(). If copy_to_user() was attempted on
> > ubuf+sk_offset+next_offset, all would be fine, but copy_to_sk() restarts
> > the copy from ubuf+sk_offset, so it returns exacting the same ret as in
> > the previous iteration.
> 
> So this means that after a short copy_to_user_nofault(), copy_to_sk()
> needs to figure out the actual point of failure. We'll have the same
> problem elsewhere, so this should probably be a generic helper. The
> alignment hacks are arch specific, so maybe we can have a generic
> version that assumes no alignment restrictions, with arch-specific
> overrides.
> 
> Once we know the exact point of failure, a
> fault_in_writeable(point_of_failure, 1) in search_ioctl() will tell if
> the failure is pertinent. Once we know that the failure isn't
> pertinent, we're safe to retry the original fault_in_writeable().

The "exact point of failure" is problematic since copy_to_user() may
fail a few bytes before the actual fault point (e.g. by doing an
unaligned store). As per Linus' reply, we can work around this by doing
a sub-page fault_in_writable(point_of_failure, align) where 'align'
should cover the copy_to_user() impreciseness.

(of course, fault_in_writable() takes the full size argument but behind
the scene it probes the 'align' prefix at sub-page fault granularity)
David Sterba Nov. 26, 2021, 4:42 p.m. UTC | #12
On Wed, Nov 24, 2021 at 07:20:24PM +0000, Catalin Marinas wrote:
> Commit a48b73eca4ce ("btrfs: fix potential deadlock in the search
> ioctl") addressed a lockdep warning by pre-faulting the user pages and
> attempting the copy_to_user_nofault() in an infinite loop. On
> architectures like arm64 with MTE, an access may fault within a page at
> a location different from what fault_in_writeable() probed. Since the
> sk_offset is rewound to the previous struct btrfs_ioctl_search_header
> boundary, there is no guaranteed forward progress and search_ioctl() may
> live-lock.
> 
> Use fault_in_exact_writeable() instead which probes the entire user
> buffer for faults at sub-page granularity.
> 
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> Reported-by: Al Viro <viro@zeniv.linux.org.uk>

Acked-by: David Sterba <dsterba@suse.com>
Andreas Gruenbacher Nov. 26, 2021, 10:29 p.m. UTC | #13
On Thu, Nov 25, 2021 at 11:42 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Thu, Nov 25, 2021 at 11:25:54PM +0100, Andreas Gruenbacher wrote:
> > On Wed, Nov 24, 2021 at 9:37 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Wed, Nov 24, 2021 at 08:03:58PM +0000, Matthew Wilcox wrote:
> > > > On Wed, Nov 24, 2021 at 07:20:24PM +0000, Catalin Marinas wrote:
> > > > > +++ b/fs/btrfs/ioctl.c
> > > > > @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode,
> > > > >
> > > > >     while (1) {
> > > > >             ret = -EFAULT;
> > > > > -           if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
> > > > > +           if (fault_in_exact_writeable(ubuf + sk_offset,
> > > > > +                                        *buf_size - sk_offset))
> > > > >                     break;
> > > > >
> > > > >             ret = btrfs_search_forward(root, &key, path, sk->min_transid);
> > > >
> > > > Couldn't we avoid all of this nastiness by doing ...
> > >
> > > I had a similar attempt initially but I concluded that it doesn't work:
> > >
> > > https://lore.kernel.org/r/YS40qqmXL7CMFLGq@arm.com
> > >
> > > > @@ -2121,10 +2121,9 @@ static noinline int copy_to_sk(struct btrfs_path *path,
> > > >                  * problem. Otherwise we'll fault and then copy the buffer in
> > > >                  * properly this next time through
> > > >                  */
> > > > -               if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
> > > > -                       ret = 0;
> > > > +               ret = __copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh));
> > > > +               if (ret)
> > >
> > > There is no requirement for the arch implementation to be exact and copy
> > > the maximum number of bytes possible. It can fail early while there are
> > > still some bytes left that would not fault. The only requirement is that
> > > if it is restarted from where it faulted, it makes some progress (on
> > > arm64 there is one extra byte).
> > >
> > > >                         goto out;
> > > > -               }
> > > >
> > > >                 *sk_offset += sizeof(sh);
> > > > @@ -2196,6 +2195,7 @@ static noinline int search_ioctl(struct inode *inode,
> > > >         int ret;
> > > >         int num_found = 0;
> > > >         unsigned long sk_offset = 0;
> > > > +       unsigned long next_offset = 0;
> > > >
> > > >         if (*buf_size < sizeof(struct btrfs_ioctl_search_header)) {
> > > >                 *buf_size = sizeof(struct btrfs_ioctl_search_header);
> > > > @@ -2223,7 +2223,8 @@ static noinline int search_ioctl(struct inode *inode,
> > > >
> > > >         while (1) {
> > > >                 ret = -EFAULT;
> > > > -               if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
> > > > +               if (fault_in_writeable(ubuf + sk_offset + next_offset,
> > > > +                                       *buf_size - sk_offset - next_offset))
> > > >                         break;
> > > >
> > > >                 ret = btrfs_search_forward(root, &key, path, sk->min_transid);
> > > > @@ -2235,11 +2236,12 @@ static noinline int search_ioctl(struct inode *inode,
> > > >                 ret = copy_to_sk(path, &key, sk, buf_size, ubuf,
> > > >                                  &sk_offset, &num_found);
> > > >                 btrfs_release_path(path);
> > > > -               if (ret)
> > > > +               if (ret > 0)
> > > > +                       next_offset = ret;
> > >
> > > So after this point, ubuf+sk_offset+next_offset is writeable by
> > > fault_in_writable(). If copy_to_user() was attempted on
> > > ubuf+sk_offset+next_offset, all would be fine, but copy_to_sk() restarts
> > > the copy from ubuf+sk_offset, so it returns exacting the same ret as in
> > > the previous iteration.
> >
> > So this means that after a short copy_to_user_nofault(), copy_to_sk()
> > needs to figure out the actual point of failure. We'll have the same
> > problem elsewhere, so this should probably be a generic helper. The
> > alignment hacks are arch specific, so maybe we can have a generic
> > version that assumes no alignment restrictions, with arch-specific
> > overrides.
> >
> > Once we know the exact point of failure, a
> > fault_in_writeable(point_of_failure, 1) in search_ioctl() will tell if
> > the failure is pertinent. Once we know that the failure isn't
> > pertinent, we're safe to retry the original fault_in_writeable().
>
> The "exact point of failure" is problematic since copy_to_user() may
> fail a few bytes before the actual fault point (e.g. by doing an
> unaligned store).

That's why after the initial failure, we must keep trying until we hit
the actual point of failure independent of alignment.  If there's even a
single writable byte left, fault_in_writable() won't fail and we'll be
stuck in a loop.

On the other hand, once we've reached the actual point of failure, the
existing version of fault_in_writeable() will work for sub-page faults
as well.

> As per Linus' reply, we can work around this by doing
> a sub-page fault_in_writable(point_of_failure, align) where 'align'
> should cover the copy_to_user() impreciseness.
>
> (of course, fault_in_writable() takes the full size argument but behind
> the scene it probes the 'align' prefix at sub-page fault granularity)

That doesn't make sense; we don't want fault_in_writable() to fail or
succeed depending on the alignment of the address range passed to it.

Have a look at the below code to see what I mean.  Function
copy_to_user_nofault_unaligned() should be further optimized, maybe as
mm/maccess.c:copy_from_kernel_nofault() and/or per architecture
depending on the actual alignment rules; I'm not sure.

Thanks,
Andreas

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4e03a6d3aa32..067408fd26f9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -6764,7 +6764,8 @@ void read_extent_buffer(const struct extent_buffer *eb, void *dstv,
 
 int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb,
 				       void __user *dstv,
-				       unsigned long start, unsigned long len)
+				       unsigned long start, unsigned long len,
+				       void __user **copy_failure)
 {
 	size_t cur;
 	size_t offset;
@@ -6773,6 +6774,7 @@ int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb,
 	char __user *dst = (char __user *)dstv;
 	unsigned long i = get_eb_page_index(start);
 	int ret = 0;
+	size_t rest;
 
 	WARN_ON(start > eb->len);
 	WARN_ON(start + len > eb->start + eb->len);
@@ -6784,7 +6786,9 @@ int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb,
 
 		cur = min(len, (PAGE_SIZE - offset));
 		kaddr = page_address(page);
-		if (copy_to_user_nofault(dst, kaddr + offset, cur)) {
+		rest = copy_to_user_nofault_unaligned(dst, kaddr + offset, cur);
+		if (rest) {
+			*copy_failure = dst + cur - rest;
 			ret = -EFAULT;
 			break;
 		}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 0399cf8e3c32..833ff597a27f 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -238,9 +238,11 @@ int memcmp_extent_buffer(const struct extent_buffer *eb, const void *ptrv,
 void read_extent_buffer(const struct extent_buffer *eb, void *dst,
 			unsigned long start,
 			unsigned long len);
+size_t copy_to_user_nofault_unaligned(void __user *to, void *from, size_t size);
 int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb,
 				       void __user *dst, unsigned long start,
-				       unsigned long len);
+				       unsigned long len,
+				       void __user **copy_failure);
 void write_extent_buffer_fsid(const struct extent_buffer *eb, const void *src);
 void write_extent_buffer_chunk_tree_uuid(const struct extent_buffer *eb,
 		const void *src);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index fb8cc9642ac4..646f9c0251d9 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2051,13 +2051,30 @@ static noinline int key_in_sk(struct btrfs_key *key,
 	return 1;
 }
 
+size_t copy_to_user_nofault_unaligned(void __user *to, void *from, size_t size)
+{
+	size_t rest = copy_to_user_nofault(to, from, size);
+
+	if (rest) {
+		size_t n;
+
+		for (n = size - rest; n < size; n++) {
+			if (copy_to_user_nofault(to + n, from + n, 1))
+				break;
+		}
+		rest = size - n;
+	}
+	return rest;
+}
+
 static noinline int copy_to_sk(struct btrfs_path *path,
 			       struct btrfs_key *key,
 			       struct btrfs_ioctl_search_key *sk,
 			       size_t *buf_size,
 			       char __user *ubuf,
 			       unsigned long *sk_offset,
-			       int *num_found)
+			       int *num_found,
+			       void __user **copy_failure)
 {
 	u64 found_transid;
 	struct extent_buffer *leaf;
@@ -2069,6 +2086,7 @@ static noinline int copy_to_sk(struct btrfs_path *path,
 	int i;
 	int slot;
 	int ret = 0;
+	size_t rest;
 
 	leaf = path->nodes[0];
 	slot = path->slots[0];
@@ -2121,7 +2139,9 @@ static noinline int copy_to_sk(struct btrfs_path *path,
 		 * problem. Otherwise we'll fault and then copy the buffer in
 		 * properly this next time through
 		 */
-		if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
+		rest = copy_to_user_nofault_unaligned(ubuf + *sk_offset, &sh, sizeof(sh));
+		if (rest) {
+			*copy_failure = ubuf + *sk_offset + sizeof(sh) - rest;
 			ret = 0;
 			goto out;
 		}
@@ -2135,7 +2155,8 @@ static noinline int copy_to_sk(struct btrfs_path *path,
 			 * * sk_offset so we copy the full thing again.
 			 */
 			if (read_extent_buffer_to_user_nofault(leaf, up,
-						item_off, item_len)) {
+						item_off, item_len,
+						copy_failure)) {
 				ret = 0;
 				*sk_offset -= sizeof(sh);
 				goto out;
@@ -2222,6 +2243,8 @@ static noinline int search_ioctl(struct inode *inode,
 	key.offset = sk->min_offset;
 
 	while (1) {
+		void __user *copy_failure = NULL;
+
 		ret = -EFAULT;
 		if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
 			break;
@@ -2233,11 +2256,13 @@ static noinline int search_ioctl(struct inode *inode,
 			goto err;
 		}
 		ret = copy_to_sk(path, &key, sk, buf_size, ubuf,
-				 &sk_offset, &num_found);
+				 &sk_offset, &num_found, &copy_failure);
 		btrfs_release_path(path);
 		if (ret)
 			break;
-
+		ret = -EFAULT;
+		if (copy_failure && fault_in_writeable(copy_failure, 1))
+			break;
 	}
 	if (ret > 0)
 		ret = 0;
Catalin Marinas Nov. 26, 2021, 10:57 p.m. UTC | #14
On Fri, Nov 26, 2021 at 11:29:45PM +0100, Andreas Gruenbacher wrote:
> On Thu, Nov 25, 2021 at 11:42 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > As per Linus' reply, we can work around this by doing
> > a sub-page fault_in_writable(point_of_failure, align) where 'align'
> > should cover the copy_to_user() impreciseness.
> >
> > (of course, fault_in_writable() takes the full size argument but behind
> > the scene it probes the 'align' prefix at sub-page fault granularity)
> 
> That doesn't make sense; we don't want fault_in_writable() to fail or
> succeed depending on the alignment of the address range passed to it.

If we know that the arch copy_to_user() has an error of say maximum 16
bytes (or 15 rather on arm64), we can instead get fault_in_writeable()
to probe first 16 bytes rather than 1.

> Have a look at the below code to see what I mean.  Function
> copy_to_user_nofault_unaligned() should be further optimized, maybe as
> mm/maccess.c:copy_from_kernel_nofault() and/or per architecture
> depending on the actual alignment rules; I'm not sure.
[...]
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2051,13 +2051,30 @@ static noinline int key_in_sk(struct btrfs_key *key,
>  	return 1;
>  }
>  
> +size_t copy_to_user_nofault_unaligned(void __user *to, void *from, size_t size)
> +{
> +	size_t rest = copy_to_user_nofault(to, from, size);
> +
> +	if (rest) {
> +		size_t n;
> +
> +		for (n = size - rest; n < size; n++) {
> +			if (copy_to_user_nofault(to + n, from + n, 1))
> +				break;
> +		}
> +		rest = size - n;
> +	}
> +	return rest;

That's what I was trying to avoid. That's basically a fall-back to byte
at a time copy (we do this in copy_mount_options(); at some point we
even had a copy_from_user_exact() IIRC).

Linus' idea (if I got it correctly) was instead to slightly extend the
probing in fault_in_writeable() for the beginning of the buffer from 1
byte to some per-arch range.

I attempted the above here and works ok:

https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-live-lock-fix

but too late to post it this evening, I'll do it in the next day or so
as an alternative to this series.
Andreas Gruenbacher Nov. 27, 2021, 3:52 a.m. UTC | #15
On Sat, Nov 27, 2021 at 12:06 AM Catalin Marinas
<catalin.marinas@arm.com> wrote:
> On Fri, Nov 26, 2021 at 11:29:45PM +0100, Andreas Gruenbacher wrote:
> > On Thu, Nov 25, 2021 at 11:42 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > As per Linus' reply, we can work around this by doing
> > > a sub-page fault_in_writable(point_of_failure, align) where 'align'
> > > should cover the copy_to_user() impreciseness.
> > >
> > > (of course, fault_in_writable() takes the full size argument but behind
> > > the scene it probes the 'align' prefix at sub-page fault granularity)
> >
> > That doesn't make sense; we don't want fault_in_writable() to fail or
> > succeed depending on the alignment of the address range passed to it.
>
> If we know that the arch copy_to_user() has an error of say maximum 16
> bytes (or 15 rather on arm64), we can instead get fault_in_writeable()
> to probe the first 16 bytes rather than 1.

That isn't going to help one bit: [raw_]copy_to_user() is allowed to
copy as little or as much as it wants as long as it follows the rules
documented in include/linux/uaccess.h:

[] If copying succeeds, the return value must be 0.  If some data cannot be
[] fetched, it is permitted to copy less than had been fetched; the only
[] hard requirement is that not storing anything at all (i.e. returning size)
[] should happen only when nothing could be copied.  In other words, you don't
[] have to squeeze as much as possible - it is allowed, but not necessary.

When fault_in_writeable() tells us that an address range is accessible
in principle, that doesn't mean that copy_to_user() will allow us to
access it in arbitrary chunks. It's also not the case that
fault_in_writeable(addr, size) is always followed by
copy_to_user(addr, ..., size) for the exact same address range, not
even in this case.

These alignment restrictions have nothing to do with page or sub-page faults.

I'm also fairly sure that passing in an unaligned buffer will send
search_ioctl into an endless loop on architectures with copy_to_user()
alignment restrictions; there don't seem to be any buffer alignment
checks.

> > Have a look at the below code to see what I mean.  Function
> > copy_to_user_nofault_unaligned() should be further optimized, maybe as
> > mm/maccess.c:copy_from_kernel_nofault() and/or per architecture
> > depending on the actual alignment rules; I'm not sure.
> [...]
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -2051,13 +2051,30 @@ static noinline int key_in_sk(struct btrfs_key *key,
> >       return 1;
> >  }
> >
> > +size_t copy_to_user_nofault_unaligned(void __user *to, void *from, size_t size)
> > +{
> > +     size_t rest = copy_to_user_nofault(to, from, size);
> > +
> > +     if (rest) {
> > +             size_t n;
> > +
> > +             for (n = size - rest; n < size; n++) {
> > +                     if (copy_to_user_nofault(to + n, from + n, 1))
> > +                             break;
> > +             }
> > +             rest = size - n;
> > +     }
> > +     return rest;
>
> That's what I was trying to avoid. That's basically a fall-back to byte
> at a time copy (we do this in copy_mount_options(); at some point we
> even had a copy_from_user_exact() IIRC).

We could try 8/4/2 byte chunks if both buffers are 8/4/2-byte aligned.
It's just not clear that it's worth it.

> Linus' idea (if I got it correctly) was instead to slightly extend the
> probing in fault_in_writeable() for the beginning of the buffer from 1
> byte to some per-arch range.
>
> I attempted the above here and works ok:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-live-lock-fix
>
> but too late to post it this evening, I'll do it in the next day or so
> as an alternative to this series.
>
> --
> Catalin
>

Thanks,
Andreas
Andreas Gruenbacher Nov. 27, 2021, 12:39 p.m. UTC | #16
On Sat, Nov 27, 2021 at 4:52 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> On Sat, Nov 27, 2021 at 12:06 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Fri, Nov 26, 2021 at 11:29:45PM +0100, Andreas Gruenbacher wrote:
> > > On Thu, Nov 25, 2021 at 11:42 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > As per Linus' reply, we can work around this by doing
> > > > a sub-page fault_in_writable(point_of_failure, align) where 'align'
> > > > should cover the copy_to_user() impreciseness.
> > > >
> > > > (of course, fault_in_writable() takes the full size argument but behind
> > > > the scene it probes the 'align' prefix at sub-page fault granularity)
> > >
> > > That doesn't make sense; we don't want fault_in_writable() to fail or
> > > succeed depending on the alignment of the address range passed to it.
> >
> > If we know that the arch copy_to_user() has an error of say maximum 16
> > bytes (or 15 rather on arm64), we can instead get fault_in_writeable()
> > to probe the first 16 bytes rather than 1.
>
> That isn't going to help one bit: [raw_]copy_to_user() is allowed to
> copy as little or as much as it wants as long as it follows the rules
> documented in include/linux/uaccess.h:
>
> [] If copying succeeds, the return value must be 0.  If some data cannot be
> [] fetched, it is permitted to copy less than had been fetched; the only
> [] hard requirement is that not storing anything at all (i.e. returning size)
> [] should happen only when nothing could be copied.  In other words, you don't
> [] have to squeeze as much as possible - it is allowed, but not necessary.
>
> When fault_in_writeable() tells us that an address range is accessible
> in principle, that doesn't mean that copy_to_user() will allow us to
> access it in arbitrary chunks. It's also not the case that
> fault_in_writeable(addr, size) is always followed by
> copy_to_user(addr, ..., size) for the exact same address range, not
> even in this case.
>
> These alignment restrictions have nothing to do with page or sub-page faults.
>
> I'm also fairly sure that passing in an unaligned buffer will send
> search_ioctl into an endless loop on architectures with copy_to_user()
> alignment restrictions; there don't seem to be any buffer alignment
> checks.

Let me retract that ...

The description in include/linux/uaccess.h leaves out permissible
reasons for fetching/storing less than requested. Thinking about it, if
the address range passed to one of the copy functions includes an
address that faults, it kind of makes sense to allow the copy function
to stop short instead of copying every last byte right up to the address
that fails.

If that's the only reason, then it would be great to have that included
in the description.  And then we can indeed deal with the alignment
effects in fault_in_writeable().

> > > copy_to_user_nofault_unaligned() should be further optimized, maybe as
> > > mm/maccess.c:copy_from_kernel_nofault() and/or per architecture
> > > depending on the actual alignment rules; I'm not sure.
> > [...]
> > > --- a/fs/btrfs/ioctl.c
> > > +++ b/fs/btrfs/ioctl.c
> > > @@ -2051,13 +2051,30 @@ static noinline int key_in_sk(struct btrfs_key *key,
> > >       return 1;
> > >  }
> > >
> > > +size_t copy_to_user_nofault_unaligned(void __user *to, void *from, size_t size)
> > > +{
> > > +     size_t rest = copy_to_user_nofault(to, from, size);
> > > +
> > > +     if (rest) {
> > > +             size_t n;
> > > +
> > > +             for (n = size - rest; n < size; n++) {
> > > +                     if (copy_to_user_nofault(to + n, from + n, 1))
> > > +                             break;
> > > +             }
> > > +             rest = size - n;
> > > +     }
> > > +     return rest;
> >
> > That's what I was trying to avoid. That's basically a fall-back to byte
> > at a time copy (we do this in copy_mount_options(); at some point we
> > even had a copy_from_user_exact() IIRC).
>
> We could try 8/4/2 byte chunks if both buffers are 8/4/2-byte aligned.
> It's just not clear that it's worth it.
>
> > Linus' idea (if I got it correctly) was instead to slightly extend the
> > probing in fault_in_writeable() for the beginning of the buffer from 1
> > byte to some per-arch range.
> >
> > I attempted the above here and works ok:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-live-lock-fix
> >
> > but too late to post it this evening, I'll do it in the next day or so
> > as an alternative to this series.

I've taken a quick look.  Under the assumption that alignment effects
are tied to page / sub-page faults, I think we can really solve this
generically as Willy has proposed.  Maybe as shown below; no need for
arch-specific code.

Thanks,
Andreas

diff --git a/mm/gup.c b/mm/gup.c
index 2c51e9748a6a..a9b3d916b625 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1658,6 +1658,8 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
 }
 #endif /* !CONFIG_MMU */
 
+#define SUBPAGE_FAULT_SIZE 16
+
 /**
  * fault_in_writeable - fault in userspace address range for writing
  * @uaddr: start of address range
@@ -1673,8 +1675,19 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)
 	if (unlikely(size == 0))
 		return 0;
 	if (!PAGE_ALIGNED(uaddr)) {
+		if (SUBPAGE_FAULT_SIZE &&
+		    !IS_ALIGNED((unsigned long)uaddr, SUBPAGE_FAULT_SIZE)) {
+			end = PTR_ALIGN(uaddr, SUBPAGE_FAULT_SIZE);
+			if (end - uaddr < size) {
+				if (unlikely(__put_user(0, uaddr) != 0))
+					return size;
+				uaddr = end;
+				if (unlikely(!end))
+					goto out;
+			}
+		}
 		if (unlikely(__put_user(0, uaddr) != 0))
-			return size;
+			goto out;
 		uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr);
 	}
 	end = (char __user *)PAGE_ALIGN((unsigned long)start + size);
Catalin Marinas Nov. 27, 2021, 2:33 p.m. UTC | #17
On Sat, Nov 27, 2021 at 04:52:16AM +0100, Andreas Gruenbacher wrote:
> On Sat, Nov 27, 2021 at 12:06 AM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > On Fri, Nov 26, 2021 at 11:29:45PM +0100, Andreas Gruenbacher wrote:
> > > On Thu, Nov 25, 2021 at 11:42 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > As per Linus' reply, we can work around this by doing
> > > > a sub-page fault_in_writable(point_of_failure, align) where 'align'
> > > > should cover the copy_to_user() impreciseness.
> > > >
> > > > (of course, fault_in_writable() takes the full size argument but behind
> > > > the scene it probes the 'align' prefix at sub-page fault granularity)
> > >
> > > That doesn't make sense; we don't want fault_in_writable() to fail or
> > > succeed depending on the alignment of the address range passed to it.
> >
> > If we know that the arch copy_to_user() has an error of say maximum 16
> > bytes (or 15 rather on arm64), we can instead get fault_in_writeable()
> > to probe the first 16 bytes rather than 1.
> 
> That isn't going to help one bit:

Not on its own but it does allow the restarted loop to use
fault_in_writeable() on the address where copy_to_user() stopped,
without the need to be byte-precise in the latter.

> [raw_]copy_to_user() is allowed to
> copy as little or as much as it wants as long as it follows the rules
> documented in include/linux/uaccess.h:
> 
> [] If copying succeeds, the return value must be 0.  If some data cannot be
> [] fetched, it is permitted to copy less than had been fetched; the only
> [] hard requirement is that not storing anything at all (i.e. returning size)
> [] should happen only when nothing could be copied.  In other words, you don't
> [] have to squeeze as much as possible - it is allowed, but not necessary.
> 
> When fault_in_writeable() tells us that an address range is accessible
> in principle, that doesn't mean that copy_to_user() will allow us to
> access it in arbitrary chunks.

Ignoring sub-page faults, my interpretation of the fault_in_writeable()
semantics is that an arbitrary copy_to_user() within the faulted in
range will *eventually* either succeed or the fault_in() fails. There
are some theoretical live-lock conditions like a concurrent thread
changing the permission (mprotect) in a way that fault_in() always
succeeds and copy_to_user() always fails. Fortunately that's just
theoretical.

The above interpretation doesn't hold with sub-page faults because of
the way fault_in_writeable() is probing - one byte per page. This series
takes the big hammer approach of making the liveness assumption above
work in the presence of sub-page faults. I'm fine with this since, from
my tests so far, only the btrfs search_ioctl() is affected and
fault_in_writeable() is not used anywhere else that matters (some
highmem stuff we don't have on arm64).

> It's also not the case that fault_in_writeable(addr, size) is always
> followed by copy_to_user(addr, ..., size) for the exact same address
> range, not even in this case.

I agree, that's not a requirement. But there are some expectations of
how the fault_in_writeable()/copy_to_user() pair is used, typically:

a) pre-fault before the uaccess with the copy_to_user() within the range
   faulted in or

b) copy_to_user() attempted with a subsequent fault_in_writeable() on
   the next address that the uaccess failed to write to.

You can have a combination of the above but not completely disjoint
ranges.

For liveness properties, in addition, fault_in_writeable() needs to
reproduce the fault conditions of the copy_to_user(). If your algorithm
uses something like (a), you'd need to probe the whole range at sub-page
granularity (this series. If you go for something like (b), either
copy_to_user() is exact or fault_in_writeable() compensates for the
uaccess inexactness.

> These alignment restrictions have nothing to do with page or sub-page
> faults.

My point wasn't alignment faults (different set of problems, though on
arm64 one needs a device memory type in user space). Let's say we have a
user buffer:

	char mem[32];

and mem[0..15] has MTE tag 0, mem[16..31] has tag 1, on arm64 a
copy_to_user(mem, kbuf, 32) succeeds in writing 16 bytes. However, a
copy_to_user(mem + 8, kbuf, 24) only writes 1 byte even if 8 could have
been written (that's in line with the uaccess requirements you quoted
above).

If we know for an arch the maximum delta between the reported
copy_to_user() fault address and the real one (if byte-copy), we can
tweak fault_in_writeable() slightly to probe this prefix at sub-page
granularity and bail out. No need for an exact copy_to_user().

> I'm also fairly sure that passing in an unaligned buffer will send
> search_ioctl into an endless loop on architectures with copy_to_user()
> alignment restrictions; there don't seem to be any buffer alignment
> checks.

On such architectures, copy_to_user() should take care of doing aligned
writes. I don't think it's for the caller to guarantee anything here as
it doesn't know what the underlying uaccess implementation does. On
arm64, since the architecture can do unaligned writes to Normal memory,
the uaccess optimises the read to be aligned and the write may be
unaligned (write-combining in the hardware buffers sorts this out).

Thanks.
Catalin Marinas Nov. 27, 2021, 3:21 p.m. UTC | #18
On Sat, Nov 27, 2021 at 01:39:58PM +0100, Andreas Gruenbacher wrote:
> On Sat, Nov 27, 2021 at 4:52 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > On Sat, Nov 27, 2021 at 12:06 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > If we know that the arch copy_to_user() has an error of say maximum 16
> > > bytes (or 15 rather on arm64), we can instead get fault_in_writeable()
> > > to probe the first 16 bytes rather than 1.
> >
> > That isn't going to help one bit: [raw_]copy_to_user() is allowed to
> > copy as little or as much as it wants as long as it follows the rules
> > documented in include/linux/uaccess.h:
> >
> > [] If copying succeeds, the return value must be 0.  If some data cannot be
> > [] fetched, it is permitted to copy less than had been fetched; the only
> > [] hard requirement is that not storing anything at all (i.e. returning size)
> > [] should happen only when nothing could be copied.  In other words, you don't
> > [] have to squeeze as much as possible - it is allowed, but not necessary.
> >
> > When fault_in_writeable() tells us that an address range is accessible
> > in principle, that doesn't mean that copy_to_user() will allow us to
> > access it in arbitrary chunks. It's also not the case that
> > fault_in_writeable(addr, size) is always followed by
> > copy_to_user(addr, ..., size) for the exact same address range, not
> > even in this case.
> >
> > These alignment restrictions have nothing to do with page or sub-page faults.
> >
> > I'm also fairly sure that passing in an unaligned buffer will send
> > search_ioctl into an endless loop on architectures with copy_to_user()
> > alignment restrictions; there don't seem to be any buffer alignment
> > checks.
> 
> Let me retract that ...
> 
> The description in include/linux/uaccess.h leaves out permissible
> reasons for fetching/storing less than requested. Thinking about it, if
> the address range passed to one of the copy functions includes an
> address that faults, it kind of makes sense to allow the copy function
> to stop short instead of copying every last byte right up to the address
> that fails.
> 
> If that's the only reason, then it would be great to have that included
> in the description.  And then we can indeed deal with the alignment
> effects in fault_in_writeable().

Ah, I started replying last night, sent it today without seeing your
follow-up.

> > > I attempted the above here and works ok:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-live-lock-fix
> > >
> > > but too late to post it this evening, I'll do it in the next day or so
> > > as an alternative to this series.
> 
> I've taken a quick look.  Under the assumption that alignment effects
> are tied to page / sub-page faults, I think we can really solve this
> generically as Willy has proposed.

I think Willy's proposal stopped at the page boundary, it should go
beyond.

> Maybe as shown below; no need for arch-specific code.
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 2c51e9748a6a..a9b3d916b625 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1658,6 +1658,8 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
>  }
>  #endif /* !CONFIG_MMU */
>  
> +#define SUBPAGE_FAULT_SIZE 16
> +
>  /**
>   * fault_in_writeable - fault in userspace address range for writing
>   * @uaddr: start of address range
> @@ -1673,8 +1675,19 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)
>  	if (unlikely(size == 0))
>  		return 0;
>  	if (!PAGE_ALIGNED(uaddr)) {
> +		if (SUBPAGE_FAULT_SIZE &&
> +		    !IS_ALIGNED((unsigned long)uaddr, SUBPAGE_FAULT_SIZE)) {
> +			end = PTR_ALIGN(uaddr, SUBPAGE_FAULT_SIZE);
> +			if (end - uaddr < size) {
> +				if (unlikely(__put_user(0, uaddr) != 0))
> +					return size;
> +				uaddr = end;
> +				if (unlikely(!end))
> +					goto out;
> +			}
> +		}
>  		if (unlikely(__put_user(0, uaddr) != 0))
> -			return size;
> +			goto out;
>  		uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr);
>  	}
>  	end = (char __user *)PAGE_ALIGN((unsigned long)start + size);

That's similar, somehow, to the arch-specific probing in one of my
patches: [1]. We could do the above if we can guarantee that the maximum
error margin in copy_to_user() is smaller than SUBPAGE_FAULT_SIZE. For
arm64 copy_to_user(), it is fine, but for copy_from_user(), if we ever
need to handle fault_in_readable(), it isn't (on arm64 up to 64 bytes
even if aligned: reads of large blocks are done in 4 * 16 loads, and if
one of them fails e.g. because of the 16-byte sub-page fault, no write
is done, hence such larger than 16 delta).

If you want something in the generic fault_in_writeable(), we probably
need a loop over UACCESS_MAX_WRITE_ERROR in SUBPAGE_FAULT_SIZE
increments. But I thought I'd rather keep this in the arch-specific
code.

Of course, the above fault_in_writeable() still needs the btrfs
search_ioctl() counterpart to change the probing on the actual fault
address or offset.

In the general case (uaccess error margin larger), I'm not entirely
convinced we can skip the check if PAGE_ALIGNED(uaddr). I should
probably get this logic through CBMC (or TLA+), I can't think it
through.

Thanks.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/commit/?h=devel/btrfs-live-lock-fix&id=af7e96d9e9537d9f9cc014f388b7b2bb4a5bc343
Andreas Gruenbacher Nov. 27, 2021, 6:05 p.m. UTC | #19
On Sat, Nov 27, 2021 at 4:21 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Sat, Nov 27, 2021 at 01:39:58PM +0100, Andreas Gruenbacher wrote:
> > On Sat, Nov 27, 2021 at 4:52 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > On Sat, Nov 27, 2021 at 12:06 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > If we know that the arch copy_to_user() has an error of say maximum 16
> > > > bytes (or 15 rather on arm64), we can instead get fault_in_writeable()
> > > > to probe the first 16 bytes rather than 1.
> > >
> > > That isn't going to help one bit: [raw_]copy_to_user() is allowed to
> > > copy as little or as much as it wants as long as it follows the rules
> > > documented in include/linux/uaccess.h:
> > >
> > > [] If copying succeeds, the return value must be 0.  If some data cannot be
> > > [] fetched, it is permitted to copy less than had been fetched; the only
> > > [] hard requirement is that not storing anything at all (i.e. returning size)
> > > [] should happen only when nothing could be copied.  In other words, you don't
> > > [] have to squeeze as much as possible - it is allowed, but not necessary.
> > >
> > > When fault_in_writeable() tells us that an address range is accessible
> > > in principle, that doesn't mean that copy_to_user() will allow us to
> > > access it in arbitrary chunks. It's also not the case that
> > > fault_in_writeable(addr, size) is always followed by
> > > copy_to_user(addr, ..., size) for the exact same address range, not
> > > even in this case.
> > >
> > > These alignment restrictions have nothing to do with page or sub-page faults.
> > >
> > > I'm also fairly sure that passing in an unaligned buffer will send
> > > search_ioctl into an endless loop on architectures with copy_to_user()
> > > alignment restrictions; there don't seem to be any buffer alignment
> > > checks.
> >
> > Let me retract that ...
> >
> > The description in include/linux/uaccess.h leaves out permissible
> > reasons for fetching/storing less than requested. Thinking about it, if
> > the address range passed to one of the copy functions includes an
> > address that faults, it kind of makes sense to allow the copy function
> > to stop short instead of copying every last byte right up to the address
> > that fails.
> >
> > If that's the only reason, then it would be great to have that included
> > in the description.  And then we can indeed deal with the alignment
> > effects in fault_in_writeable().
>
> Ah, I started replying last night, sent it today without seeing your
> follow-up.
>
> > > > I attempted the above here and works ok:
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-live-lock-fix
> > > >
> > > > but too late to post it this evening, I'll do it in the next day or so
> > > > as an alternative to this series.
> >
> > I've taken a quick look.  Under the assumption that alignment effects
> > are tied to page / sub-page faults, I think we can really solve this
> > generically as Willy has proposed.
>
> I think Willy's proposal stopped at the page boundary, it should go
> beyond.
>
> > Maybe as shown below; no need for arch-specific code.
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 2c51e9748a6a..a9b3d916b625 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1658,6 +1658,8 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
> >  }
> >  #endif /* !CONFIG_MMU */
> >
> > +#define SUBPAGE_FAULT_SIZE 16
> > +
> >  /**
> >   * fault_in_writeable - fault in userspace address range for writing
> >   * @uaddr: start of address range
> > @@ -1673,8 +1675,19 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)
> >       if (unlikely(size == 0))
> >               return 0;
> >       if (!PAGE_ALIGNED(uaddr)) {
> > +             if (SUBPAGE_FAULT_SIZE &&
> > +                 !IS_ALIGNED((unsigned long)uaddr, SUBPAGE_FAULT_SIZE)) {
> > +                     end = PTR_ALIGN(uaddr, SUBPAGE_FAULT_SIZE);
> > +                     if (end - uaddr < size) {
> > +                             if (unlikely(__put_user(0, uaddr) != 0))
> > +                                     return size;
> > +                             uaddr = end;
> > +                             if (unlikely(!end))
> > +                                     goto out;
> > +                     }
> > +             }
> >               if (unlikely(__put_user(0, uaddr) != 0))
> > -                     return size;
> > +                     goto out;
> >               uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr);
> >       }
> >       end = (char __user *)PAGE_ALIGN((unsigned long)start + size);
>
> That's similar, somehow, to the arch-specific probing in one of my
> patches: [1]. We could do the above if we can guarantee that the maximum
> error margin in copy_to_user() is smaller than SUBPAGE_FAULT_SIZE. For
> arm64 copy_to_user(), it is fine, but for copy_from_user(), if we ever
> need to handle fault_in_readable(), it isn't (on arm64 up to 64 bytes
> even if aligned: reads of large blocks are done in 4 * 16 loads, and if
> one of them fails e.g. because of the 16-byte sub-page fault, no write
> is done, hence such larger than 16 delta).
>
> If you want something in the generic fault_in_writeable(), we probably
> need a loop over UACCESS_MAX_WRITE_ERROR in SUBPAGE_FAULT_SIZE
> increments. But I thought I'd rather keep this in the arch-specific code.

I see, that's even crazier than I'd thought. The looping / probing is
still pretty generic, so I'd still consider putting it in the generic
code.

We also still have fault_in_safe_writeable which is more difficult to
fix, and fault_in_readable which we don't want to leave behind broken,
either.

> Of course, the above fault_in_writeable() still needs the btrfs
> search_ioctl() counterpart to change the probing on the actual fault
> address or offset.

Yes, but that change is relatively simple and it eliminates the need
for probing the entire buffer, so it's a good thing. Maybe you want to
add this though:

--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2202,3 +2202,3 @@ static noinline int search_ioctl(struct inode *inode,
        unsigned long sk_offset = 0;
-       char __user *fault_in_addr;
+       char __user *fault_in_addr, *end;

@@ -2230,6 +2230,6 @@ static noinline int search_ioctl(struct inode *inode,
        fault_in_addr = ubuf;
+       end = ubuf + *buf_size;
        while (1) {
                ret = -EFAULT;
-               if (fault_in_writeable(fault_in_addr,
-                                      *buf_size - (fault_in_addr - ubuf)))
+               if (fault_in_writeable(fault_in_addr, end - fault_in_addr))
                        break;

> In the general case (uaccess error margin larger), I'm not entirely
> convinced we can skip the check if PAGE_ALIGNED(uaddr).

Yes, the loop can span multiple sub-page error domains, at least in
the read case, so it needs to happen even for page-aligned addresses.

> I should probably get this logic through CBMC (or TLA+), I can't think it
> through.
>
> Thanks.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/commit/?h=devel/btrfs-live-lock-fix&id=af7e96d9e9537d9f9cc014f388b7b2bb4a5bc343
>
> --
> Catalin
>

Thanks,
Andreas
Catalin Marinas Nov. 29, 2021, 12:16 p.m. UTC | #20
On Sat, Nov 27, 2021 at 07:05:39PM +0100, Andreas Gruenbacher wrote:
> On Sat, Nov 27, 2021 at 4:21 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > That's similar, somehow, to the arch-specific probing in one of my
> > patches: [1]. We could do the above if we can guarantee that the maximum
> > error margin in copy_to_user() is smaller than SUBPAGE_FAULT_SIZE. For
> > arm64 copy_to_user(), it is fine, but for copy_from_user(), if we ever
> > need to handle fault_in_readable(), it isn't (on arm64 up to 64 bytes
> > even if aligned: reads of large blocks are done in 4 * 16 loads, and if
> > one of them fails e.g. because of the 16-byte sub-page fault, no write
> > is done, hence such larger than 16 delta).
> >
> > If you want something in the generic fault_in_writeable(), we probably
> > need a loop over UACCESS_MAX_WRITE_ERROR in SUBPAGE_FAULT_SIZE
> > increments. But I thought I'd rather keep this in the arch-specific code.
> 
> I see, that's even crazier than I'd thought. The looping / probing is
> still pretty generic, so I'd still consider putting it in the generic
> code.

In the arm64 probe_subpage_user_writeable(), the loop is skipped if
!system_supports_mte() (static label). It doesn't make much difference
for search_ioctl() in terms of performance but I'd like the arch code to
dynamically decide when to probe. An arch_has_subpage_faults() static
inline function would solve this.

However, the above assumes that the only way of probing is by doing a
get_user/put_user(). A non-destructive probe with MTE would be to read
the actual tags in memory and compare them with the top byte of the
pointer.

There's the CHERI architecture as well. Although very early days for
arm64, we do have an incipient port (https://www.morello-project.org/).
The void __user * pointers are propagated inside the kernel as 128-bit
capabilities. A fault_in() would check whether the address (bottom
64-bit) is within the range and permissions specified in the upper
64-bit of the capability. There is no notion of sub-page fault
granularity here and no need to do a put_user() as the check is just
done on the pointer/capability.

Given the above, my preference is to keep the probing arch-specific.

> We also still have fault_in_safe_writeable which is more difficult to
> fix, and fault_in_readable which we don't want to leave behind broken,
> either.

fault_in_safe_writeable() can be done by using get_user() instead of
put_user() for arm64 MTE and probably SPARC ADI (an alternative is to
read the in-memory tags and compare them with the pointer). For CHERI,
that's different again since the capability encodes the read/write
permissions independently.

However, do we actually want to change the fault_in_safe_writeable() and
fault_in_readable() functions at this stage? I could not get any of them
to live-lock, though I only tried btrfs, ext4 and gfs2. As per the
earlier discussion, normal files accesses are guaranteed to make
progress. The only problematic one was O_DIRECT which seems to be
alright for the above filesystems (the fs either bails out after several
attempts or uses GUP to read which skips the uaccess altogether).

Happy to address them if there is a real concern, I just couldn't
trigger it.

> > Of course, the above fault_in_writeable() still needs the btrfs
> > search_ioctl() counterpart to change the probing on the actual fault
> > address or offset.
> 
> Yes, but that change is relatively simple and it eliminates the need
> for probing the entire buffer, so it's a good thing. Maybe you want to
> add this though:
> 
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2202,3 +2202,3 @@ static noinline int search_ioctl(struct inode *inode,
>         unsigned long sk_offset = 0;
> -       char __user *fault_in_addr;
> +       char __user *fault_in_addr, *end;
> 
> @@ -2230,6 +2230,6 @@ static noinline int search_ioctl(struct inode *inode,
>         fault_in_addr = ubuf;
> +       end = ubuf + *buf_size;
>         while (1) {
>                 ret = -EFAULT;
> -               if (fault_in_writeable(fault_in_addr,
> -                                      *buf_size - (fault_in_addr - ubuf)))
> +               if (fault_in_writeable(fault_in_addr, end - fault_in_addr))
>                         break;

Thanks, I'll add it.
Andreas Gruenbacher Nov. 29, 2021, 1:33 p.m. UTC | #21
On Mon, Nov 29, 2021 at 1:22 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Sat, Nov 27, 2021 at 07:05:39PM +0100, Andreas Gruenbacher wrote:
> > On Sat, Nov 27, 2021 at 4:21 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > That's similar, somehow, to the arch-specific probing in one of my
> > > patches: [1]. We could do the above if we can guarantee that the maximum
> > > error margin in copy_to_user() is smaller than SUBPAGE_FAULT_SIZE. For
> > > arm64 copy_to_user(), it is fine, but for copy_from_user(), if we ever
> > > need to handle fault_in_readable(), it isn't (on arm64 up to 64 bytes
> > > even if aligned: reads of large blocks are done in 4 * 16 loads, and if
> > > one of them fails e.g. because of the 16-byte sub-page fault, no write
> > > is done, hence such larger than 16 delta).
> > >
> > > If you want something in the generic fault_in_writeable(), we probably
> > > need a loop over UACCESS_MAX_WRITE_ERROR in SUBPAGE_FAULT_SIZE
> > > increments. But I thought I'd rather keep this in the arch-specific code.
> >
> > I see, that's even crazier than I'd thought. The looping / probing is
> > still pretty generic, so I'd still consider putting it in the generic
> > code.
>
> In the arm64 probe_subpage_user_writeable(), the loop is skipped if
> !system_supports_mte() (static label). It doesn't make much difference
> for search_ioctl() in terms of performance but I'd like the arch code to
> dynamically decide when to probe. An arch_has_subpage_faults() static
> inline function would solve this.
>
> However, the above assumes that the only way of probing is by doing a
> get_user/put_user(). A non-destructive probe with MTE would be to read
> the actual tags in memory and compare them with the top byte of the
> pointer.
>
> There's the CHERI architecture as well. Although very early days for
> arm64, we do have an incipient port (https://www.morello-project.org/).
> The void __user * pointers are propagated inside the kernel as 128-bit
> capabilities. A fault_in() would check whether the address (bottom
> 64-bit) is within the range and permissions specified in the upper
> 64-bit of the capability. There is no notion of sub-page fault
> granularity here and no need to do a put_user() as the check is just
> done on the pointer/capability.
>
> Given the above, my preference is to keep the probing arch-specific.
>
> > We also still have fault_in_safe_writeable which is more difficult to
> > fix, and fault_in_readable which we don't want to leave behind broken,
> > either.
>
> fault_in_safe_writeable() can be done by using get_user() instead of
> put_user() for arm64 MTE and probably SPARC ADI (an alternative is to
> read the in-memory tags and compare them with the pointer).

So we'd keep the existing fault_in_safe_writeable() logic for the
actual fault-in and use get_user() to check for sub-page faults? If
so, then that should probably also be hidden in arch code.

> For CHERI, that's different again since the fault_in_safe_writeable capability
> encodes the read/write permissions independently.
>
> However, do we actually want to change the fault_in_safe_writeable() and
> fault_in_readable() functions at this stage? I could not get any of them
> to live-lock, though I only tried btrfs, ext4 and gfs2. As per the
> earlier discussion, normal files accesses are guaranteed to make
> progress. The only problematic one was O_DIRECT which seems to be
> alright for the above filesystems (the fs either bails out after several
> attempts or uses GUP to read which skips the uaccess altogether).

Only gfs2 uses fault_in_safe_writeable(). For buffered reads, progress
is guaranteed because failures are at a byte granularity.

O_DIRECT reads and writes happen in device block size granularity, but
the pages are grabbed with get_user_pages() before the copying
happens. So by the time the copying happens, the pages are guaranteed
to be resident, and we don't need to loop around fault_in_*().

You've mentioned before that copying to/from struct page bypasses
sub-page fault checking. If that is the case, then the checking
probably needs to happen in iomap_dio_bio_iter and dio_refill_pages
instead.

> Happy to address them if there is a real concern, I just couldn't trigger it.

Hopefully it should now be clear why you couldn't. One way of
reproducing with fault_in_safe_writeable() would be to use that in
btrfs instead of fault_in_writeable(), of course.

We're not doing any chunked reads from user space with page faults
disabled as far as I'm aware right now, so we probably don't have a
reproducer for fault_in_readable(). It would still be worth fixing
fault_in_readable() to prevent things from blowing up very
unexpectedly later, though.

Thanks,
Andreas

> > > Of course, the above fault_in_writeable() still needs the btrfs
> > > search_ioctl() counterpart toget_user_pages change the probing on the actual fault
> > > address or offset.
> >
> > Yes, but that change is relatively simple and it eliminates the need
> > for probing the entire buffer, so it's a good thing. Maybe you want to
> > add this though:
> >
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -2202,3 +2202,3 @@ static noinline int search_ioctl(struct inode *inode,
> >         unsigned long sk_offset = 0;
> > -       char __user *fault_in_addr;
> > +       char __user *fault_in_addr, *end;
> >
> > @@ -2230,6 +2230,6 @@ static noinline int search_ioctl(struct inode *inode,
> >         fault_in_addr = ubuf;
> > +       end = ubuf + *buf_size;
> >         while (1) {
> >                 ret = -EFAULT;
> > -               if (fault_in_writeable(fault_in_addr,
> > -                                      *buf_size - (fault_in_addr - ubuf)))
> > +               if (fault_in_writeable(fault_in_addr, end - fault_in_addr))
> >                         break;
>
> Thanks, I'll add it.
>
> --
> Catalin
>
Catalin Marinas Nov. 29, 2021, 1:52 p.m. UTC | #22
On Sat, Nov 27, 2021 at 07:05:39PM +0100, Andreas Gruenbacher wrote:
> Maybe you want to add this though:
> 
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2202,3 +2202,3 @@ static noinline int search_ioctl(struct inode *inode,
>         unsigned long sk_offset = 0;
> -       char __user *fault_in_addr;
> +       char __user *fault_in_addr, *end;
> 
> @@ -2230,6 +2230,6 @@ static noinline int search_ioctl(struct inode *inode,
>         fault_in_addr = ubuf;
> +       end = ubuf + *buf_size;
>         while (1) {
>                 ret = -EFAULT;
> -               if (fault_in_writeable(fault_in_addr,
> -                                      *buf_size - (fault_in_addr - ubuf)))
> +               if (fault_in_writeable(fault_in_addr, end - fault_in_addr))
>                         break;

It looks like *buf_size is updated inside copy_to_sk(), so I'll move the
end update inside the loop.
Catalin Marinas Nov. 29, 2021, 3:36 p.m. UTC | #23
On Mon, Nov 29, 2021 at 02:33:42PM +0100, Andreas Gruenbacher wrote:
> On Mon, Nov 29, 2021 at 1:22 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Sat, Nov 27, 2021 at 07:05:39PM +0100, Andreas Gruenbacher wrote:
> > > We also still have fault_in_safe_writeable which is more difficult to
> > > fix, and fault_in_readable which we don't want to leave behind broken,
> > > either.
> >
> > fault_in_safe_writeable() can be done by using get_user() instead of
> > put_user() for arm64 MTE and probably SPARC ADI (an alternative is to
> > read the in-memory tags and compare them with the pointer).
> 
> So we'd keep the existing fault_in_safe_writeable() logic for the
> actual fault-in and use get_user() to check for sub-page faults? If
> so, then that should probably also be hidden in arch code.

That's what this series does when it probes the whole range in
fault_in_writeable(). The main reason was that it's more efficient to do
a read than a write on a large range (the latter dirtying the cache
lines).

> > For CHERI, that's different again since the fault_in_safe_writeable capability
> > encodes the read/write permissions independently.
> >
> > However, do we actually want to change the fault_in_safe_writeable() and
> > fault_in_readable() functions at this stage? I could not get any of them
> > to live-lock, though I only tried btrfs, ext4 and gfs2. As per the
> > earlier discussion, normal files accesses are guaranteed to make
> > progress. The only problematic one was O_DIRECT which seems to be
> > alright for the above filesystems (the fs either bails out after several
> > attempts or uses GUP to read which skips the uaccess altogether).
> 
> Only gfs2 uses fault_in_safe_writeable(). For buffered reads, progress
> is guaranteed because failures are at a byte granularity.
> 
> O_DIRECT reads and writes happen in device block size granularity, but
> the pages are grabbed with get_user_pages() before the copying
> happens. So by the time the copying happens, the pages are guaranteed
> to be resident, and we don't need to loop around fault_in_*().

For file reads, I couldn't triggered any mismatched tag faults with gfs2
and O_DIRECT, so it matches your description above. For file writes it
does trigger such faults, so I suspect it doesn't always use
get_user_pages() for writes. No live-lock though with the vanilla
kernel. My test uses a page with some mismatched tags in the middle.

ext4: no tag faults with O_DIRECT read/write irrespective of whether the
user buffer is page aligned or not.

btrfs: O_DIRECT file writes - no faults on page-aligned buffers, faults
on unaligned; file reads - tag faults on both aligned/unaligned buffers.
No live-lock.

So, some tag faults still happen even with O_DIRECT|O_SYNC but the
filesystems too care of continuous faulting.

> You've mentioned before that copying to/from struct page bypasses
> sub-page fault checking. If that is the case, then the checking
> probably needs to happen in iomap_dio_bio_iter and dio_refill_pages
> instead.

It's too expensive and not really worth it. With a buffered access, the
uaccess takes care of checking at the time of load/store (the hardware
does this for us). With a GUP, the access is done via the kernel mapping
with a match-all tag to avoid faults (kernel panic). We set the ABI
expectation some time ago that kernel accesses to user memory may not
always be tag-checked if the access is done via a GUP'ed page.

> > Happy to address them if there is a real concern, I just couldn't trigger it.
> 
> Hopefully it should now be clear why you couldn't. One way of
> reproducing with fault_in_safe_writeable() would be to use that in
> btrfs instead of fault_in_writeable(), of course.

Yes, that would trigger it again. I guess if we want to make this API
safer in general, we can add the checks to the other functions. Only
probing a few bytes at the start shouldn't cause a performance issue.

Thanks.
Linus Torvalds Nov. 29, 2021, 6:40 p.m. UTC | #24
On Mon, Nov 29, 2021 at 7:36 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> That's what this series does when it probes the whole range in
> fault_in_writeable(). The main reason was that it's more efficient to do
> a read than a write on a large range (the latter dirtying the cache
> lines).

The more this thread goes on, the more I'm starting to think that we
should just make "fault_in_writable()" (and readable, of course) only
really work on the beginning of the area.

Not just for the finer-granularity pointer color probing, but for the
page probing too.

I'm looking at our current fault_in_writeable(), and I'm going

 (a) it uses __put_user() without range checks, which is really not great

 (b) it looks like a disaster from another standpoint: essentially
user-controlled loop size with no limit checking, no preemption, and
no check for fatal signals.

Now, (a) should be fixed with a access_ok() or similar.

And (b) can easily be fixed multiple ways, with one option simply just
being adding a can_resched() call and checking for fatal signals.

But faulting in the whole region is actually fundamentally wrong in
low-memory situations - the beginning of the region might be swapped
out by the time we get to the end. That's unlikely to be a problem in
real life, but it's an example of how it's simply not conceptually
sensible.

So I do wonder why we don't just say "fault_in_writable will fault in
_at_most_ X bytes", and simply limit the actual fault-in size to
something reasonable.

That solves _all_ the problems. It solves the lack of preemption and
fatal signals (by virtue of just limiting the amount of work we do).
It solves the low memory situation. And it solves the "excessive dirty
cachelines" case too.

Of course, we want to have some minimum bytes we fault in too, but
that minimum range might well be "we guarantee at least a full page
worth of data" (and in practice make it a couple of pages).

It's not like fault_in_writeable() avoids page faults or anything like
that - it just moves them around. So there's really very little reason
to fault in a large range, and there are multiple reasons _not_ to do
it.

Hmm?

               Linus
Andreas Gruenbacher Nov. 29, 2021, 7:31 p.m. UTC | #25
On Mon, Nov 29, 2021 at 7:41 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Nov 29, 2021 at 7:36 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> >
> > That's what this series does when it probes the whole range in
> > fault_in_writeable(). The main reason was that it's more efficient to do
> > a read than a write on a large range (the latter dirtying the cache
> > lines).
>
> The more this thread goes on, the more I'm starting to think that we
> should just make "fault_in_writable()" (and readable, of course) only
> really work on the beginning of the area.
>
> Not just for the finer-granularity pointer color probing, but for the
> page probing too.
>
> I'm looking at our current fault_in_writeable(), and I'm going
>
>  (a) it uses __put_user() without range checks, which is really not great
>
>  (b) it looks like a disaster from another standpoint: essentially
> user-controlled loop size with no limit checking, no preemption, and
> no check for fatal signals.
>
> Now, (a) should be fixed with a access_ok() or similar.
>
> And (b) can easily be fixed multiple ways, with one option simply just
> being adding a can_resched() call and checking for fatal signals.
>
> But faulting in the whole region is actually fundamentally wrong in
> low-memory situations - the beginning of the region might be swapped
> out by the time we get to the end. That's unlikely to be a problem in
> real life, but it's an example of how it's simply not conceptually
> sensible.
>
> So I do wonder why we don't just say "fault_in_writable will fault in
> _at_most_ X bytes", and simply limit the actual fault-in size to
> something reasonable.
>
> That solves _all_ the problems. It solves the lack of preemption and
> fatal signals (by virtue of just limiting the amount of work we do).
> It solves the low memory situation. And it solves the "excessive dirty
> cachelines" case too.
>
> Of course, we want to have some minimum bytes we fault in too, but
> that minimum range might well be "we guarantee at least a full page
> worth of data" (and in practice make it a couple of pages).
>
> It's not like fault_in_writeable() avoids page faults or anything like
> that - it just moves them around. So there's really very little reason
> to fault in a large range, and there are multiple reasons _not_ to do
> it.
>
> Hmm?

This would mean that we could get rid of gfs2's
should_fault_in_pages() logic, which is based on what's in
btrfs_buffered_write().

Andreas

>
>                Linus
>
Catalin Marinas Nov. 29, 2021, 8:56 p.m. UTC | #26
On Mon, Nov 29, 2021 at 10:40:38AM -0800, Linus Torvalds wrote:
> On Mon, Nov 29, 2021 at 7:36 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > That's what this series does when it probes the whole range in
> > fault_in_writeable(). The main reason was that it's more efficient to do
> > a read than a write on a large range (the latter dirtying the cache
> > lines).
> 
> The more this thread goes on, the more I'm starting to think that we
> should just make "fault_in_writable()" (and readable, of course) only
> really work on the beginning of the area.
> 
> Not just for the finer-granularity pointer color probing, but for the
> page probing too.

I have patches for the finer-granularity checking of the beginning of
the buffer. They need a bit of testing, so probably posting them
tomorrow.

> I'm looking at our current fault_in_writeable(), and I'm going
> 
>  (a) it uses __put_user() without range checks, which is really not great

For arm64 at least __put_user() does the access_ok() check. I thought
only unsafe_put_user() should skip the checks. If __put_user() can write
arbitrary memory, we may have a bigger problem.

>  (b) it looks like a disaster from another standpoint: essentially
> user-controlled loop size with no limit checking, no preemption, and
> no check for fatal signals.

Indeed, the fault_in_*() loop can get pretty long, bounded by how much
memory can be faulted in the user process. My patches for now only
address the outer loop doing the copy_to_user() as that can be
unbounded.

> Now, (a) should be fixed with a access_ok() or similar.
> 
> And (b) can easily be fixed multiple ways, with one option simply just
> being adding a can_resched() call and checking for fatal signals.
> 
> But faulting in the whole region is actually fundamentally wrong in
> low-memory situations - the beginning of the region might be swapped
> out by the time we get to the end. That's unlikely to be a problem in
> real life, but it's an example of how it's simply not conceptually
> sensible.
> 
> So I do wonder why we don't just say "fault_in_writable will fault in
> _at_most_ X bytes", and simply limit the actual fault-in size to
> something reasonable.
> 
> That solves _all_ the problems. It solves the lack of preemption and
> fatal signals (by virtue of just limiting the amount of work we do).
> It solves the low memory situation. And it solves the "excessive dirty
> cachelines" case too.

I think that would be useful, though it doesn't solve the potential
livelock with sub-page faults. We still need the outer loop to
handle the copy_to_user() for the whole user buffer and the sub-page
probing of the beginning of such buffer (or whenever copy_to_user()
failed). IOW, you still fault in the whole buffer eventually.

Anyway, I think the sub-page probing and limiting the fault-in are
complementary improvements.
Linus Torvalds Nov. 29, 2021, 9:53 p.m. UTC | #27
On Mon, Nov 29, 2021 at 12:56 PM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> For arm64 at least __put_user() does the access_ok() check. I thought
> only unsafe_put_user() should skip the checks. If __put_user() can write
> arbitrary memory, we may have a bigger problem.

That's literally be the historical difference between __put_user() and
put_user() - the access check.

> I think that would be useful, though it doesn't solve the potential
> livelock with sub-page faults.

I was assuming we'd just do the sub-page faults.

In fact, I was assuming we'd basically just replace all the PAGE_ALIGN
and PAGE_SIZE with SUBPAGE_{ALIGN,SIZE}, together with something like

        if (size > PAGE_SIZE)
                size = PAGE_SIZE;

to limit that size thing (or possibly make that "min size" be a
parameter, so that people who have things like that "I need at least
this initial structure to be copied" issue can document their minimum
size needs).

            Linus
Catalin Marinas Nov. 29, 2021, 11:12 p.m. UTC | #28
On Mon, Nov 29, 2021 at 01:53:01PM -0800, Linus Torvalds wrote:
> On Mon, Nov 29, 2021 at 12:56 PM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > I think that would be useful, though it doesn't solve the potential
> > livelock with sub-page faults.
> 
> I was assuming we'd just do the sub-page faults.
> 
> In fact, I was assuming we'd basically just replace all the PAGE_ALIGN
> and PAGE_SIZE with SUBPAGE_{ALIGN,SIZE}, together with something like
> 
>         if (size > PAGE_SIZE)
>                 size = PAGE_SIZE;
> 
> to limit that size thing (or possibly make that "min size" be a
> parameter, so that people who have things like that "I need at least
> this initial structure to be copied" issue can document their minimum
> size needs).

Ah, so fault_in_writeable() would never fault in the whole range (if too
large). When copy_to_user() goes beyond the faulted in range, it may
fail and we go back to fault in a bit more of the range. A copy loop
would be equivalent to:

	fault_addr = ubuf;
	end = ubuf + size;
	while (1) {
		if (fault_in_writeable(fault_addr,
				       min(PAGE_SIZE, end - fault_addr)))
			break;
		left = copy_to_user(ubuf, kbuf, size);
		if (!left)
			break;
		fault_addr = end - left;
	}

That should work. I'll think about it tomorrow, getting late over here.

(I may still keep the sub-page probing in the arch code, see my earlier
exchanges with Andreas)
diff mbox series

Patch

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 92138ac2a4e2..23167c72fa47 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2223,7 +2223,8 @@  static noinline int search_ioctl(struct inode *inode,
 
 	while (1) {
 		ret = -EFAULT;
-		if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
+		if (fault_in_exact_writeable(ubuf + sk_offset,
+					     *buf_size - sk_offset))
 			break;
 
 		ret = btrfs_search_forward(root, &key, path, sk->min_transid);