Message ID | 20190904201933.10736-2-cyphar@cyphar.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | namei: openat2(2) path resolution restrictions | expand |
On Wed, Sep 4, 2019 at 1:20 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > A common pattern for syscall extensions is increasing the size of a > struct passed from userspace, such that the zero-value of the new fields > result in the old kernel behaviour (allowing for a mix of userspace and > kernel vintages to operate on one another in most cases). Ack, this makes the whole series (and a few unrelated system calls) cleaner. Linus
Hi, just kernel-doc fixes: On 9/4/19 1:19 PM, Aleksa Sarai wrote: > > diff --git a/lib/struct_user.c b/lib/struct_user.c > new file mode 100644 > index 000000000000..7301ab1bbe98 > --- /dev/null > +++ b/lib/struct_user.c > @@ -0,0 +1,182 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* > + * Copyright (C) 2019 SUSE LLC > + * Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com> > + */ > + > +#include <linux/types.h> > +#include <linux/export.h> > +#include <linux/uaccess.h> > +#include <linux/kernel.h> > +#include <linux/string.h> > + > +#define BUFFER_SIZE 64 > + > + > +/** > + * copy_struct_to_user: copy a struct to user space use correct format: * copy_struct_to_user - copy a struct to user space > + * @dst: Destination address, in user space. > + * @usize: Size of @dst struct. > + * @src: Source address, in kernel space. > + * @ksize: Size of @src struct. > + * > + * Copies a struct from kernel space to user space, in a way that guarantees > + * backwards-compatibility for struct syscall arguments (as long as future > + * struct extensions are made such that all new fields are *appended* to the > + * old struct, and zeroed-out new fields have the same meaning as the old > + * struct). > + * > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space. > + * The recommended usage is something like the following: > + * > + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize) > + * { > + * int err; > + * struct foo karg = {}; > + * > + * // do something with karg > + * > + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg)); > + * if (err) > + * return err; > + * > + * // ... > + * } > + * > + * There are three cases to consider: > + * * If @usize == @ksize, then it's copied verbatim. > + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an > + * older user space. In order to avoid user space getting incomplete > + * information (new fields might be important), all trailing bytes in @src > + * (@ksize - @usize) must be zerored, otherwise -EFBIG is returned. > + * * If @usize > @ksize, then the kernel is "returning" an older struct to a > + * newer user space. The trailing bytes in @dst (@usize - @ksize) will be > + * zero-filled. > + * > + * Returns (in all cases, some data may have been copied): > + * * -EFBIG: (@usize < @ksize) and there are non-zero trailing bytes in @src. > + * * -EFAULT: access to user space failed. > + */ > +int copy_struct_to_user(void __user *dst, size_t usize, > + const void *src, size_t ksize) > +{ > + size_t size = min(ksize, usize); > + size_t rest = abs(ksize - usize); > + > + if (unlikely(usize > PAGE_SIZE)) > + return -EFAULT; > + if (unlikely(!access_ok(dst, usize))) > + return -EFAULT; > + > + /* Deal with trailing bytes. */ > + if (usize < ksize) { > + if (memchr_inv(src + size, 0, rest)) > + return -EFBIG; > + } else if (usize > ksize) { > + if (__memzero_user(dst + size, rest)) > + return -EFAULT; > + } > + /* Copy the interoperable parts of the struct. */ > + if (__copy_to_user(dst, src, size)) > + return -EFAULT; > + return 0; > +} > +EXPORT_SYMBOL(copy_struct_to_user); > + > +/** same here: > + * copy_struct_from_user: copy a struct from user space * copy_struct_from_user - copy a struct from user space > + * @dst: Destination address, in kernel space. This buffer must be @ksize > + * bytes long. > + * @ksize: Size of @dst struct. > + * @src: Source address, in user space. > + * @usize: (Alleged) size of @src struct. > + * > + * Copies a struct from user space to kernel space, in a way that guarantees > + * backwards-compatibility for struct syscall arguments (as long as future > + * struct extensions are made such that all new fields are *appended* to the > + * old struct, and zeroed-out new fields have the same meaning as the old > + * struct). > + * > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space. > + * The recommended usage is something like the following: > + * > + * SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize) > + * { > + * int err; > + * struct foo karg = {}; > + * > + * err = copy_struct_from_user(&karg, sizeof(karg), uarg, size); > + * if (err) > + * return err; > + * > + * // ... > + * } > + * > + * There are three cases to consider: > + * * If @usize == @ksize, then it's copied verbatim. > + * * If @usize < @ksize, then the user space has passed an old struct to a > + * newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize) > + * are to be zero-filled. > + * * If @usize > @ksize, then the user space has passed a new struct to an > + * older kernel. The trailing bytes unknown to the kernel (@usize - @ksize) > + * are checked to ensure they are zeroed, otherwise -E2BIG is returned. > + * > + * Returns (in all cases, some data may have been copied): > + * * -E2BIG: (@usize > @ksize) and there are non-zero trailing bytes in @src. > + * * -E2BIG: @usize is "too big" (at time of writing, >PAGE_SIZE). > + * * -EFAULT: access to user space failed. > + */ > +int copy_struct_from_user(void *dst, size_t ksize, > + const void __user *src, size_t usize) > +{ > + size_t size = min(ksize, usize); > + size_t rest = abs(ksize - usize); > + > + if (unlikely(usize > PAGE_SIZE)) > + return -EFAULT; > + if (unlikely(!access_ok(src, usize))) > + return -EFAULT; > + > + /* Deal with trailing bytes. */ > + if (usize < ksize) > + memset(dst + size, 0, rest); > + else if (usize > ksize) { > + const void __user *addr = src + size; > + char buffer[BUFFER_SIZE] = {}; > + > + while (rest > 0) { > + size_t bufsize = min(rest, sizeof(buffer)); > + > + if (__copy_from_user(buffer, addr, bufsize)) > + return -EFAULT; > + if (memchr_inv(buffer, 0, bufsize)) > + return -E2BIG; > + > + addr += bufsize; > + rest -= bufsize; > + } > + } > + /* Copy the interoperable parts of the struct. */ > + if (__copy_from_user(dst, src, size)) > + return -EFAULT; > + return 0; > +} > +EXPORT_SYMBOL(copy_struct_from_user); > thanks.
On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote: > +/** > + * copy_struct_to_user: copy a struct to user space > + * @dst: Destination address, in user space. > + * @usize: Size of @dst struct. > + * @src: Source address, in kernel space. > + * @ksize: Size of @src struct. > + * > + * Copies a struct from kernel space to user space, in a way that guarantees > + * backwards-compatibility for struct syscall arguments (as long as future > + * struct extensions are made such that all new fields are *appended* to the > + * old struct, and zeroed-out new fields have the same meaning as the old > + * struct). > + * > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space. > + * The recommended usage is something like the following: > + * > + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize) > + * { > + * int err; > + * struct foo karg = {}; > + * > + * // do something with karg > + * > + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg)); > + * if (err) > + * return err; > + * > + * // ... > + * } > + * > + * There are three cases to consider: > + * * If @usize == @ksize, then it's copied verbatim. > + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an > + * older user space. In order to avoid user space getting incomplete > + * information (new fields might be important), all trailing bytes in @src > + * (@ksize - @usize) must be zerored s/zerored/zero/, right? > , otherwise -EFBIG is returned. 'Funny' that, copy_struct_from_user() below seems to use E2BIG. > + * * If @usize > @ksize, then the kernel is "returning" an older struct to a > + * newer user space. The trailing bytes in @dst (@usize - @ksize) will be > + * zero-filled. > + * > + * Returns (in all cases, some data may have been copied): > + * * -EFBIG: (@usize < @ksize) and there are non-zero trailing bytes in @src. > + * * -EFAULT: access to user space failed. > + */ > +int copy_struct_to_user(void __user *dst, size_t usize, > + const void *src, size_t ksize) > +{ > + size_t size = min(ksize, usize); > + size_t rest = abs(ksize - usize); > + > + if (unlikely(usize > PAGE_SIZE)) > + return -EFAULT; Not documented above. Implementation consistent with *from*, but see below. > + if (unlikely(!access_ok(dst, usize))) > + return -EFAULT; > + > + /* Deal with trailing bytes. */ > + if (usize < ksize) { > + if (memchr_inv(src + size, 0, rest)) > + return -EFBIG; > + } else if (usize > ksize) { > + if (__memzero_user(dst + size, rest)) > + return -EFAULT; > + } > + /* Copy the interoperable parts of the struct. */ > + if (__copy_to_user(dst, src, size)) > + return -EFAULT; > + return 0; > +} > +EXPORT_SYMBOL(copy_struct_to_user); > + > +/** > + * copy_struct_from_user: copy a struct from user space > + * @dst: Destination address, in kernel space. This buffer must be @ksize > + * bytes long. > + * @ksize: Size of @dst struct. > + * @src: Source address, in user space. > + * @usize: (Alleged) size of @src struct. > + * > + * Copies a struct from user space to kernel space, in a way that guarantees > + * backwards-compatibility for struct syscall arguments (as long as future > + * struct extensions are made such that all new fields are *appended* to the > + * old struct, and zeroed-out new fields have the same meaning as the old > + * struct). > + * > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space. > + * The recommended usage is something like the following: > + * > + * SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize) > + * { > + * int err; > + * struct foo karg = {}; > + * > + * err = copy_struct_from_user(&karg, sizeof(karg), uarg, size); > + * if (err) > + * return err; > + * > + * // ... > + * } > + * > + * There are three cases to consider: > + * * If @usize == @ksize, then it's copied verbatim. > + * * If @usize < @ksize, then the user space has passed an old struct to a > + * newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize) > + * are to be zero-filled. > + * * If @usize > @ksize, then the user space has passed a new struct to an > + * older kernel. The trailing bytes unknown to the kernel (@usize - @ksize) > + * are checked to ensure they are zeroed, otherwise -E2BIG is returned. > + * > + * Returns (in all cases, some data may have been copied): > + * * -E2BIG: (@usize > @ksize) and there are non-zero trailing bytes in @src. > + * * -E2BIG: @usize is "too big" (at time of writing, >PAGE_SIZE). > + * * -EFAULT: access to user space failed. > + */ > +int copy_struct_from_user(void *dst, size_t ksize, > + const void __user *src, size_t usize) > +{ > + size_t size = min(ksize, usize); > + size_t rest = abs(ksize - usize); > + > + if (unlikely(usize > PAGE_SIZE)) > + return -EFAULT; Documented above as returning -E2BIG. > + if (unlikely(!access_ok(src, usize))) > + return -EFAULT; > + > + /* Deal with trailing bytes. */ > + if (usize < ksize) > + memset(dst + size, 0, rest); > + else if (usize > ksize) { > + const void __user *addr = src + size; > + char buffer[BUFFER_SIZE] = {}; Isn't that too big for on-stack? > + > + while (rest > 0) { > + size_t bufsize = min(rest, sizeof(buffer)); > + > + if (__copy_from_user(buffer, addr, bufsize)) > + return -EFAULT; > + if (memchr_inv(buffer, 0, bufsize)) > + return -E2BIG; > + > + addr += bufsize; > + rest -= bufsize; > + } The perf implementation uses get_user(); but if that is too slow, surely we can do something with uaccess_try() here? > + } > + /* Copy the interoperable parts of the struct. */ > + if (__copy_from_user(dst, src, size)) > + return -EFAULT; > + return 0; > +} > +EXPORT_SYMBOL(copy_struct_from_user); And personally I'm not a big fan of EXPORT_SYMBOL().
On 04/09/2019 22.19, Aleksa Sarai wrote: > A common pattern for syscall extensions is increasing the size of a > struct passed from userspace, such that the zero-value of the new fields > result in the old kernel behaviour (allowing for a mix of userspace and > kernel vintages to operate on one another in most cases). This is done > in both directions -- hence two helpers -- though it's more common to > have to copy user space structs into kernel space. > > Previously there was no common lib/ function that implemented > the necessary extension-checking semantics (and different syscalls > implemented them slightly differently or incompletely[1]). A future > patch replaces all of the common uses of this pattern to use the new > copy_struct_{to,from}_user() helpers. > > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do > similar checks to copy_struct_from_user() while rt_sigprocmask(2) > always rejects differently-sized struct arguments. > > Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > --- > diff --git a/lib/struct_user.c b/lib/struct_user.c > new file mode 100644 > index 000000000000..7301ab1bbe98 > --- /dev/null > +++ b/lib/struct_user.c > @@ -0,0 +1,182 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* > + * Copyright (C) 2019 SUSE LLC > + * Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com> > + */ > + > +#include <linux/types.h> > +#include <linux/export.h> > +#include <linux/uaccess.h> > +#include <linux/kernel.h> > +#include <linux/string.h> > + > +#define BUFFER_SIZE 64 > + > +/* > + * "memset(p, 0, size)" but for user space buffers. Caller must have already > + * checked access_ok(p, size). > + */ Isn't this __clear_user() exactly (perhaps except for the return value)? Perhaps not every arch has that? > +static int __memzero_user(void __user *p, size_t s) > +{ > + const char zeros[BUFFER_SIZE] = {}; > + while (s > 0) { > + size_t n = min(s, sizeof(zeros)); > + > + if (__copy_to_user(p, zeros, n)) > + return -EFAULT; > + > + p += n; > + s -= n; > + } > + return 0; > +} > + > +/** > + * copy_struct_to_user: copy a struct to user space > + * @dst: Destination address, in user space. > + * @usize: Size of @dst struct. > + * @src: Source address, in kernel space. > + * @ksize: Size of @src struct. > + * > + * Returns (in all cases, some data may have been copied): > + * * -EFBIG: (@usize < @ksize) and there are non-zero trailing bytes in @src. > + * * -EFAULT: access to user space failed. > + */ > +int copy_struct_to_user(void __user *dst, size_t usize, > + const void *src, size_t ksize) > +{ > + size_t size = min(ksize, usize); > + size_t rest = abs(ksize - usize); Eh, I'd avoid abs() here due to the funkiness of the implicit type conversions - ksize-usize has type size_t, then that's coerced to an int (or a long maybe?), the abs is applied which return an int/long (or unsigned versions?). Something like "rest = max(ksize, usize) - size;" is more obviously correct and doesn't fall into any narrowing/widening/sign extending traps. > + if (unlikely(usize > PAGE_SIZE)) > + return -EFAULT; Please don't. That is a restriction on all future extensions - once a kernel is shipped with a syscall using this helper with that arbitrary restriction in place, that syscall is forever prevented from extending its arg struct beyond PAGE_SIZE (which is arch-dependent anyway). Sure, it's hard to imagine, but who'd have thought 32 O_* or CLONE_* bits weren't enough for everybody? This is only for future compatibility, and if someone runs an app compiled against 7.3 headers on a 5.4 kernel, they probably don't care about performance, but they would like their app to run. [If we ever create such a large ABI struct that doesn't fit on stack, we'd have to extend our API a little to create a dup_struct_from_user() that does the kmalloc() for us and then calls copy_struct_from_user() - but we might want that long before we hit PAGE_SIZE structs]. > + if (unlikely(!access_ok(dst, usize))) > + return -EFAULT; > + > + /* Deal with trailing bytes. */ > + if (usize < ksize) { > + if (memchr_inv(src + size, 0, rest)) > + return -EFBIG; > + } else if (usize > ksize) { > + if (__memzero_user(dst + size, rest)) > + return -EFAULT; I think that could simply be __clear_user(). > + } > + /* Copy the interoperable parts of the struct. */ > + if (__copy_to_user(dst, src, size)) > + return -EFAULT; I think I understand why you put this last instead of handling the buffer in the "natural" order. However, I'm wondering whether we should actually do this copy before checking that the extra kernel bytes are 0 - the user will still be told that there was some extra information via the -EFBIG/-E2BIG return, but maybe in some cases the part he understands is good enough. But I also guess we have to look to existing users to see whether that would prevent them from being converted to using this helper. linux-api folks, WDYT? > + return 0; Maybe more useful to "return size;", some users might want to know/pass on how much was actually copied. > +} > +EXPORT_SYMBOL(copy_struct_to_user); Can't we wait with this until a modular user shows up? The primary users are syscalls, which can't be modular AFAIK. > +/** > + * copy_struct_from_user: copy a struct from user space > + * @dst: Destination address, in kernel space. This buffer must be @ksize > + * bytes long. > + * @ksize: Size of @dst struct. > + * @src: Source address, in user space. > + * @usize: (Alleged) size of @src struct. > + * > + * Copies a struct from user space to kernel space, in a way that guarantees > + * backwards-compatibility for struct syscall arguments (as long as future > + * struct extensions are made such that all new fields are *appended* to the > + * old struct, and zeroed-out new fields have the same meaning as the old > + * struct). > + * > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space. > + * The recommended usage is something like the following: > + * > + * SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize) > + * { > + * int err; > + * struct foo karg = {}; > + * > + * err = copy_struct_from_user(&karg, sizeof(karg), uarg, size); > + * if (err) > + * return err; > + * > + * // ... > + * } > + * > + * There are three cases to consider: > + * * If @usize == @ksize, then it's copied verbatim. > + * * If @usize < @ksize, then the user space has passed an old struct to a > + * newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize) > + * are to be zero-filled. > + * * If @usize > @ksize, then the user space has passed a new struct to an > + * older kernel. The trailing bytes unknown to the kernel (@usize - @ksize) > + * are checked to ensure they are zeroed, otherwise -E2BIG is returned. > + * > + * Returns (in all cases, some data may have been copied): > + * * -E2BIG: (@usize > @ksize) and there are non-zero trailing bytes in @src. > + * * -E2BIG: @usize is "too big" (at time of writing, >PAGE_SIZE). > + * * -EFAULT: access to user space failed. > + */ > +int copy_struct_from_user(void *dst, size_t ksize, > + const void __user *src, size_t usize) > +{ > + size_t size = min(ksize, usize); > + size_t rest = abs(ksize - usize); As above. > + if (unlikely(usize > PAGE_SIZE)) > + return -EFAULT; As above. > + if (unlikely(!access_ok(src, usize))) > + return -EFAULT; > + > + /* Deal with trailing bytes. */ > + if (usize < ksize) > + memset(dst + size, 0, rest); > + else if (usize > ksize) { > + const void __user *addr = src + size; > + char buffer[BUFFER_SIZE] = {}; > + > + while (rest > 0) { > + size_t bufsize = min(rest, sizeof(buffer)); > + > + if (__copy_from_user(buffer, addr, bufsize)) > + return -EFAULT; > + if (memchr_inv(buffer, 0, bufsize)) > + return -E2BIG; > + > + addr += bufsize; > + rest -= bufsize; > + } I'd create a __user_is_zero() helper for this - that way the two branches in the two helpers become nicely symmetric, each just calling a single helper that deals appropriately with the tail. And we can discuss how to implement __user_is_zero() in another bikeshed. > + } > + /* Copy the interoperable parts of the struct. */ > + if (__copy_from_user(dst, src, size)) > + return -EFAULT; If you do move up the __copy_to_user(), please move this as well - on the kernel side, we certainly don't care that we copied some bytes to a local buffer which we then ignore because the user had a non-zero tail. But if __copy_to_user() is kept last in copy_struct_to_user(), this should stay for symmetry. > + return 0; As above. > +} > +EXPORT_SYMBOL(copy_struct_from_user); As above. Rasmus
On Sep 05 2019, Aleksa Sarai <cyphar@cyphar.com> wrote: > diff --git a/lib/struct_user.c b/lib/struct_user.c > new file mode 100644 > index 000000000000..7301ab1bbe98 > --- /dev/null > +++ b/lib/struct_user.c > @@ -0,0 +1,182 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* > + * Copyright (C) 2019 SUSE LLC > + * Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com> > + */ > + > +#include <linux/types.h> > +#include <linux/export.h> > +#include <linux/uaccess.h> > +#include <linux/kernel.h> > +#include <linux/string.h> > + > +#define BUFFER_SIZE 64 > + > +/* > + * "memset(p, 0, size)" but for user space buffers. Caller must have already > + * checked access_ok(p, size). > + */ > +static int __memzero_user(void __user *p, size_t s) > +{ > + const char zeros[BUFFER_SIZE] = {}; Perhaps make that static? Andreas.
On 2019-09-05, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote: > > +/** > > + * copy_struct_to_user: copy a struct to user space > > + * @dst: Destination address, in user space. > > + * @usize: Size of @dst struct. > > + * @src: Source address, in kernel space. > > + * @ksize: Size of @src struct. > > + * > > + * Copies a struct from kernel space to user space, in a way that guarantees > > + * backwards-compatibility for struct syscall arguments (as long as future > > + * struct extensions are made such that all new fields are *appended* to the > > + * old struct, and zeroed-out new fields have the same meaning as the old > > + * struct). > > + * > > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space. > > + * The recommended usage is something like the following: > > + * > > + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize) > > + * { > > + * int err; > > + * struct foo karg = {}; > > + * > > + * // do something with karg > > + * > > + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg)); > > + * if (err) > > + * return err; > > + * > > + * // ... > > + * } > > + * > > + * There are three cases to consider: > > + * * If @usize == @ksize, then it's copied verbatim. > > + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an > > + * older user space. In order to avoid user space getting incomplete > > + * information (new fields might be important), all trailing bytes in @src > > + * (@ksize - @usize) must be zerored > > s/zerored/zero/, right? It should've been "zeroed". > > , otherwise -EFBIG is returned. > > 'Funny' that, copy_struct_from_user() below seems to use E2BIG. This is a copy of the semantics that sched_[sg]etattr(2) uses -- E2BIG for a "too big" struct passed to the kernel, and EFBIG for a "too big" struct passed to user-space. I would personally have preferred EMSGSIZE instead of EFBIG, but felt using the existing error codes would be less confusing. > > > + * * If @usize > @ksize, then the kernel is "returning" an older struct to a > > + * newer user space. The trailing bytes in @dst (@usize - @ksize) will be > > + * zero-filled. > > + * > > + * Returns (in all cases, some data may have been copied): > > + * * -EFBIG: (@usize < @ksize) and there are non-zero trailing bytes in @src. > > + * * -EFAULT: access to user space failed. > > + */ > > +int copy_struct_to_user(void __user *dst, size_t usize, > > + const void *src, size_t ksize) > > +{ > > + size_t size = min(ksize, usize); > > + size_t rest = abs(ksize - usize); > > + > > + if (unlikely(usize > PAGE_SIZE)) > > + return -EFAULT; > > Not documented above. Implementation consistent with *from*, but see > below. Will update the kernel-doc. > > + if (unlikely(!access_ok(dst, usize))) > > + return -EFAULT; > > + > > + /* Deal with trailing bytes. */ > > + if (usize < ksize) { > > + if (memchr_inv(src + size, 0, rest)) > > + return -EFBIG; > > + } else if (usize > ksize) { > > + if (__memzero_user(dst + size, rest)) > > + return -EFAULT; > > + } > > + /* Copy the interoperable parts of the struct. */ > > + if (__copy_to_user(dst, src, size)) > > + return -EFAULT; > > + return 0; > > +} > > +EXPORT_SYMBOL(copy_struct_to_user); > > + > > +/** > > + * copy_struct_from_user: copy a struct from user space > > + * @dst: Destination address, in kernel space. This buffer must be @ksize > > + * bytes long. > > + * @ksize: Size of @dst struct. > > + * @src: Source address, in user space. > > + * @usize: (Alleged) size of @src struct. > > + * > > + * Copies a struct from user space to kernel space, in a way that guarantees > > + * backwards-compatibility for struct syscall arguments (as long as future > > + * struct extensions are made such that all new fields are *appended* to the > > + * old struct, and zeroed-out new fields have the same meaning as the old > > + * struct). > > + * > > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space. > > + * The recommended usage is something like the following: > > + * > > + * SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize) > > + * { > > + * int err; > > + * struct foo karg = {}; > > + * > > + * err = copy_struct_from_user(&karg, sizeof(karg), uarg, size); > > + * if (err) > > + * return err; > > + * > > + * // ... > > + * } > > + * > > + * There are three cases to consider: > > + * * If @usize == @ksize, then it's copied verbatim. > > + * * If @usize < @ksize, then the user space has passed an old struct to a > > + * newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize) > > + * are to be zero-filled. > > + * * If @usize > @ksize, then the user space has passed a new struct to an > > + * older kernel. The trailing bytes unknown to the kernel (@usize - @ksize) > > + * are checked to ensure they are zeroed, otherwise -E2BIG is returned. > > + * > > + * Returns (in all cases, some data may have been copied): > > + * * -E2BIG: (@usize > @ksize) and there are non-zero trailing bytes in @src. > > + * * -E2BIG: @usize is "too big" (at time of writing, >PAGE_SIZE). > > + * * -EFAULT: access to user space failed. > > + */ > > +int copy_struct_from_user(void *dst, size_t ksize, > > + const void __user *src, size_t usize) > > +{ > > + size_t size = min(ksize, usize); > > + size_t rest = abs(ksize - usize); > > + > > + if (unlikely(usize > PAGE_SIZE)) > > + return -EFAULT; > > Documented above as returning -E2BIG. I will switch this (and to) back to -E2BIG -- I must've had a brain-fart when doing some refactoring. > > > + if (unlikely(!access_ok(src, usize))) > > + return -EFAULT; > > + > > + /* Deal with trailing bytes. */ > > + if (usize < ksize) > > + memset(dst + size, 0, rest); > > + else if (usize > ksize) { > > + const void __user *addr = src + size; > > + char buffer[BUFFER_SIZE] = {}; > > Isn't that too big for on-stack? Is a 64-byte buffer too big? I picked the number "at random" to be the size of a cache line, but I could shrink it down to 32 bytes if the size is an issue (I wanted to avoid needless allocations -- hence it being on-stack). > > + > > + while (rest > 0) { > > + size_t bufsize = min(rest, sizeof(buffer)); > > + > > + if (__copy_from_user(buffer, addr, bufsize)) > > + return -EFAULT; > > + if (memchr_inv(buffer, 0, bufsize)) > > + return -E2BIG; > > + > > + addr += bufsize; > > + rest -= bufsize; > > + } > > The perf implementation uses get_user(); but if that is too slow, surely > we can do something with uaccess_try() here? Is there a non-x86-specific way to do that (unless I'm mistaken only x86 has uaccess_try() or the other *_try() wrappers)? The main "performance improvement" (if you can even call it that) is that we use memchr_inv() which finds non-matching characters more efficiently than just doing a loop. > > + } > > + /* Copy the interoperable parts of the struct. */ > > + if (__copy_from_user(dst, src, size)) > > + return -EFAULT; > > + return 0; > > +} > > +EXPORT_SYMBOL(copy_struct_from_user); > > And personally I'm not a big fan of EXPORT_SYMBOL(). I don't have much of an opinion (after all, it only really makes sense a lot of sense for syscalls) -- though out-of-tree modules that define ioctl()s wouldn't be able to make use of them.
On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote: > On 2019-09-05, Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote: > > > +/** > > > + * copy_struct_to_user: copy a struct to user space > > > + * @dst: Destination address, in user space. > > > + * @usize: Size of @dst struct. > > > + * @src: Source address, in kernel space. > > > + * @ksize: Size of @src struct. > > > + * > > > + * Copies a struct from kernel space to user space, in a way that guarantees > > > + * backwards-compatibility for struct syscall arguments (as long as future > > > + * struct extensions are made such that all new fields are *appended* to the > > > + * old struct, and zeroed-out new fields have the same meaning as the old > > > + * struct). > > > + * > > > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space. > > > + * The recommended usage is something like the following: > > > + * > > > + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize) > > > + * { > > > + * int err; > > > + * struct foo karg = {}; > > > + * > > > + * // do something with karg > > > + * > > > + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg)); > > > + * if (err) > > > + * return err; > > > + * > > > + * // ... > > > + * } > > > + * > > > + * There are three cases to consider: > > > + * * If @usize == @ksize, then it's copied verbatim. > > > + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an > > > + * older user space. In order to avoid user space getting incomplete > > > + * information (new fields might be important), all trailing bytes in @src > > > + * (@ksize - @usize) must be zerored > > > > s/zerored/zero/, right? > > It should've been "zeroed". That reads wrong to me; that way it reads like this function must take that action and zero out the 'rest'; which is just wrong. This function must verify those bytes are zero, not make them zero. > > > , otherwise -EFBIG is returned. > > > > 'Funny' that, copy_struct_from_user() below seems to use E2BIG. > > This is a copy of the semantics that sched_[sg]etattr(2) uses -- E2BIG for > a "too big" struct passed to the kernel, and EFBIG for a "too big" > struct passed to user-space. I would personally have preferred EMSGSIZE > instead of EFBIG, but felt using the existing error codes would be less > confusing. Sadly a recent commit: 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code") Made the situation even 'worse'. > > > + if (unlikely(!access_ok(src, usize))) > > > + return -EFAULT; > > > + > > > + /* Deal with trailing bytes. */ > > > + if (usize < ksize) > > > + memset(dst + size, 0, rest); > > > + else if (usize > ksize) { > > > + const void __user *addr = src + size; > > > + char buffer[BUFFER_SIZE] = {}; > > > > Isn't that too big for on-stack? > > Is a 64-byte buffer too big? I picked the number "at random" to be the > size of a cache line, but I could shrink it down to 32 bytes if the size > is an issue (I wanted to avoid needless allocations -- hence it being > on-stack). Ah, my ctags gave me a definition of BUFFER_SIZE that was 512. I suppose 64 should be OK. > > > + > > > + while (rest > 0) { > > > + size_t bufsize = min(rest, sizeof(buffer)); > > > + > > > + if (__copy_from_user(buffer, addr, bufsize)) > > > + return -EFAULT; > > > + if (memchr_inv(buffer, 0, bufsize)) > > > + return -E2BIG; > > > + > > > + addr += bufsize; > > > + rest -= bufsize; > > > + } > > > > The perf implementation uses get_user(); but if that is too slow, surely > > we can do something with uaccess_try() here? > > Is there a non-x86-specific way to do that (unless I'm mistaken only x86 > has uaccess_try() or the other *_try() wrappers)? The main "performance > improvement" (if you can even call it that) is that we use memchr_inv() > which finds non-matching characters more efficiently than just doing a > loop. Oh, you're right, that's x86 only :/
On 2019-09-05, Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote: > On 04/09/2019 22.19, Aleksa Sarai wrote: > > A common pattern for syscall extensions is increasing the size of a > > struct passed from userspace, such that the zero-value of the new fields > > result in the old kernel behaviour (allowing for a mix of userspace and > > kernel vintages to operate on one another in most cases). This is done > > in both directions -- hence two helpers -- though it's more common to > > have to copy user space structs into kernel space. > > > > Previously there was no common lib/ function that implemented > > the necessary extension-checking semantics (and different syscalls > > implemented them slightly differently or incompletely[1]). A future > > patch replaces all of the common uses of this pattern to use the new > > copy_struct_{to,from}_user() helpers. > > > > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do > > similar checks to copy_struct_from_user() while rt_sigprocmask(2) > > always rejects differently-sized struct arguments. > > > > Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > > --- > > diff --git a/lib/struct_user.c b/lib/struct_user.c > > new file mode 100644 > > index 000000000000..7301ab1bbe98 > > --- /dev/null > > +++ b/lib/struct_user.c > > @@ -0,0 +1,182 @@ > > +// SPDX-License-Identifier: GPL-2.0-or-later > > +/* > > + * Copyright (C) 2019 SUSE LLC > > + * Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com> > > + */ > > + > > +#include <linux/types.h> > > +#include <linux/export.h> > > +#include <linux/uaccess.h> > > +#include <linux/kernel.h> > > +#include <linux/string.h> > > + > > +#define BUFFER_SIZE 64 > > + > > +/* > > + * "memset(p, 0, size)" but for user space buffers. Caller must have already > > + * checked access_ok(p, size). > > + */ > > Isn't this __clear_user() exactly (perhaps except for the return value)? > Perhaps not every arch has that? I didn't know about clear_user() -- I will switch to it. > > +static int __memzero_user(void __user *p, size_t s) > > +{ > > + const char zeros[BUFFER_SIZE] = {}; > > + while (s > 0) { > > + size_t n = min(s, sizeof(zeros)); > > + > > + if (__copy_to_user(p, zeros, n)) > > + return -EFAULT; > > + > > + p += n; > > + s -= n; > > + } > > + return 0; > > +} > > + > > +/** > > + * copy_struct_to_user: copy a struct to user space > > + * @dst: Destination address, in user space. > > + * @usize: Size of @dst struct. > > + * @src: Source address, in kernel space. > > + * @ksize: Size of @src struct. > > + * > > + * Returns (in all cases, some data may have been copied): > > + * * -EFBIG: (@usize < @ksize) and there are non-zero trailing bytes in @src. > > + * * -EFAULT: access to user space failed. > > + */ > > +int copy_struct_to_user(void __user *dst, size_t usize, > > + const void *src, size_t ksize) > > +{ > > + size_t size = min(ksize, usize); > > + size_t rest = abs(ksize - usize); > > Eh, I'd avoid abs() here due to the funkiness of the implicit type > conversions - ksize-usize has type size_t, then that's coerced to an int > (or a long maybe?), the abs is applied which return an int/long (or > unsigned versions?). Something like "rest = max(ksize, usize) - size;" > is more obviously correct and doesn't fall into any > narrowing/widening/sign extending traps. Yeah, I originally used "max(ksize, usize) - size" for that reason but was worried it looked too funky (and some quick tests showed that abs() gives the right results in most cases -- though I just realised it would probably not give the right results around SIZE_MAX). I'll switch back. > > + if (unlikely(usize > PAGE_SIZE)) > > + return -EFAULT; > > Please don't. That is a restriction on all future extensions - once a > kernel is shipped with a syscall using this helper with that arbitrary > restriction in place, that syscall is forever prevented from extending > its arg struct beyond PAGE_SIZE (which is arch-dependent anyway). Sure, > it's hard to imagine, but who'd have thought 32 O_* or CLONE_* bits > weren't enough for everybody? > > This is only for future compatibility, and if someone runs an app > compiled against 7.3 headers on a 5.4 kernel, they probably don't care > about performance, but they would like their app to run. I'm not sure I agree that the limit is in place *forever* -- it's generally not a break in compatibility to convert an error into a success (though, there are counterexamples such as mknod(2) -- but that was a very specific case). You're right that it would mean that some very new code won't run on very ancient kernels (assuming we ever pass around structs that massive), but there should be a reasonable trade-off here IMHO. If we allow very large sizes, a program could probably DoS the kernel by allocating a moderately-large block of memory and then spawning a bunch of threads that all cause the kernel to re-check that the same 1GB block of memory is zeroed. I haven't tried, but it seems like it's best to avoid the possibility altogether. > > + } > > + /* Copy the interoperable parts of the struct. */ > > + if (__copy_to_user(dst, src, size)) > > + return -EFAULT; > > I think I understand why you put this last instead of handling the > buffer in the "natural" order. However, > I'm wondering whether we should actually do this copy before checking > that the extra kernel bytes are 0 - the user will still be told that > there was some extra information via the -EFBIG/-E2BIG return, but maybe > in some cases the part he understands is good enough. But I also guess > we have to look to existing users to see whether that would prevent them > from being converted to using this helper. > > linux-api folks, WDYT? Regarding the order, I just copied what sched and perf already do. I wouldn't mind doing it the other way around -- though I am a little cautious about implicitly making guarantees like that. The syscall that uses copy_struct_to_user() might not want to make that guarantee (it might not make sense for them), and there are some -E2BIG returns that won't result in data being copied (usize > PAGE_SIZE). As for feedback, this is syscall-dependent at the moment. The sched and perf users explicitly return the size of the kernel structure (by overwriting uattr->size if -E2BIG is returned) for copies in either direction. So users arguably already have some kind of feedback about size issues. clone3() on the other hand doesn't do that (though it doesn't copy anything to user-space so this isn't relevant to this particular question). Effectively, I'd like to see someone argue that this is something that they would personally want (before we do it). > > + return 0; > > Maybe more useful to "return size;", some users might want to know/pass > on how much was actually copied. Even though it is "just" min(ksize, usize), I don't see any harm in returning it. Will do. > > +} > > +EXPORT_SYMBOL(copy_struct_to_user); > > Can't we wait with this until a modular user shows up? The primary users > are syscalls, which can't be modular AFAIK. Yeah, I'll drop it. You could use them for ioctl()s but we can always add EXPORT_SYMBOL() later. > > +/** > > + * copy_struct_from_user: copy a struct from user space > > + * @dst: Destination address, in kernel space. This buffer must be @ksize > > + * bytes long. > > + * @ksize: Size of @dst struct. > > + * @src: Source address, in user space. > > + * @usize: (Alleged) size of @src struct. > > + * > > + * Copies a struct from user space to kernel space, in a way that guarantees > > + * backwards-compatibility for struct syscall arguments (as long as future > > + * struct extensions are made such that all new fields are *appended* to the > > + * old struct, and zeroed-out new fields have the same meaning as the old > > + * struct). > > + * > > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space. > > + * The recommended usage is something like the following: > > + * > > + * SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize) > > + * { > > + * int err; > > + * struct foo karg = {}; > > + * > > + * err = copy_struct_from_user(&karg, sizeof(karg), uarg, size); > > + * if (err) > > + * return err; > > + * > > + * // ... > > + * } > > + * > > + * There are three cases to consider: > > + * * If @usize == @ksize, then it's copied verbatim. > > + * * If @usize < @ksize, then the user space has passed an old struct to a > > + * newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize) > > + * are to be zero-filled. > > + * * If @usize > @ksize, then the user space has passed a new struct to an > > + * older kernel. The trailing bytes unknown to the kernel (@usize - @ksize) > > + * are checked to ensure they are zeroed, otherwise -E2BIG is returned. > > + * > > + * Returns (in all cases, some data may have been copied): > > + * * -E2BIG: (@usize > @ksize) and there are non-zero trailing bytes in @src. > > + * * -E2BIG: @usize is "too big" (at time of writing, >PAGE_SIZE). > > + * * -EFAULT: access to user space failed. > > + */ > > +int copy_struct_from_user(void *dst, size_t ksize, > > + const void __user *src, size_t usize) > > +{ > > + size_t size = min(ksize, usize); > > + size_t rest = abs(ksize - usize); > > As above. > > > + if (unlikely(usize > PAGE_SIZE)) > > + return -EFAULT; > > As above. > > > + if (unlikely(!access_ok(src, usize))) > > + return -EFAULT; > > + > > + /* Deal with trailing bytes. */ > > + if (usize < ksize) > > + memset(dst + size, 0, rest); > > + else if (usize > ksize) { > > + const void __user *addr = src + size; > > + char buffer[BUFFER_SIZE] = {}; > > + > > + while (rest > 0) { > > + size_t bufsize = min(rest, sizeof(buffer)); > > + > > + if (__copy_from_user(buffer, addr, bufsize)) > > + return -EFAULT; > > + if (memchr_inv(buffer, 0, bufsize)) > > + return -E2BIG; > > + > > + addr += bufsize; > > + rest -= bufsize; > > + } > > I'd create a __user_is_zero() helper for this - that way the two > branches in the two helpers become nicely symmetric, each just calling a > single helper that deals appropriately with the tail. And we can discuss > how to implement __user_is_zero() in another bikeshed. Will do. > > > + } > > + /* Copy the interoperable parts of the struct. */ > > + if (__copy_from_user(dst, src, size)) > > + return -EFAULT; > > If you do move up the __copy_to_user(), please move this as well - on > the kernel side, we certainly don't care that we copied some bytes to a > local buffer which we then ignore because the user had a non-zero tail. > But if __copy_to_user() is kept last in copy_struct_to_user(), this > should stay for symmetry. I will keep that in mind.
On Thu, Sep 05, 2019 at 11:09:35AM +0200, Andreas Schwab wrote: > On Sep 05 2019, Aleksa Sarai <cyphar@cyphar.com> wrote: > > > diff --git a/lib/struct_user.c b/lib/struct_user.c > > new file mode 100644 > > index 000000000000..7301ab1bbe98 > > --- /dev/null > > +++ b/lib/struct_user.c > > @@ -0,0 +1,182 @@ > > +// SPDX-License-Identifier: GPL-2.0-or-later > > +/* > > + * Copyright (C) 2019 SUSE LLC > > + * Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com> > > + */ > > + > > +#include <linux/types.h> > > +#include <linux/export.h> > > +#include <linux/uaccess.h> > > +#include <linux/kernel.h> > > +#include <linux/string.h> > > + > > +#define BUFFER_SIZE 64 > > + > > +/* > > + * "memset(p, 0, size)" but for user space buffers. Caller must have already > > + * checked access_ok(p, size). > > + */ > > +static int __memzero_user(void __user *p, size_t s) > > +{ > > + const char zeros[BUFFER_SIZE] = {}; > > Perhaps make that static? On SMP? It should at least be per cpu, and I'm not even sure with preemption. Gabriel > > Andreas. > > -- > Andreas Schwab, schwab@linux-m68k.org > GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 > "And now for something completely different."
On Thu, Sep 05, 2019 at 07:50:26PM +1000, Aleksa Sarai wrote: > On 2019-09-05, Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote: > > On 04/09/2019 22.19, Aleksa Sarai wrote: > > > A common pattern for syscall extensions is increasing the size of a > > > struct passed from userspace, such that the zero-value of the new fields > > > result in the old kernel behaviour (allowing for a mix of userspace and > > > kernel vintages to operate on one another in most cases). This is done > > > in both directions -- hence two helpers -- though it's more common to > > > have to copy user space structs into kernel space. > > > > > > Previously there was no common lib/ function that implemented > > > the necessary extension-checking semantics (and different syscalls > > > implemented them slightly differently or incompletely[1]). A future > > > patch replaces all of the common uses of this pattern to use the new > > > copy_struct_{to,from}_user() helpers. > > > > > > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do > > > similar checks to copy_struct_from_user() while rt_sigprocmask(2) > > > always rejects differently-sized struct arguments. > > > > > > Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> > > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > > > --- > > > diff --git a/lib/struct_user.c b/lib/struct_user.c > > > new file mode 100644 > > > index 000000000000..7301ab1bbe98 > > > --- /dev/null > > > +++ b/lib/struct_user.c > > > @@ -0,0 +1,182 @@ > > > +// SPDX-License-Identifier: GPL-2.0-or-later > > > +/* > > > + * Copyright (C) 2019 SUSE LLC > > > + * Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com> > > > + */ > > > + > > > +#include <linux/types.h> > > > +#include <linux/export.h> > > > +#include <linux/uaccess.h> > > > +#include <linux/kernel.h> > > > +#include <linux/string.h> > > > + > > > +#define BUFFER_SIZE 64 > > > + > > > +/* > > > + * "memset(p, 0, size)" but for user space buffers. Caller must have already > > > + * checked access_ok(p, size). > > > + */ > > > > Isn't this __clear_user() exactly (perhaps except for the return value)? > > Perhaps not every arch has that? > > I didn't know about clear_user() -- I will switch to it. > > > > +static int __memzero_user(void __user *p, size_t s) > > > +{ > > > + const char zeros[BUFFER_SIZE] = {}; > > > + while (s > 0) { > > > + size_t n = min(s, sizeof(zeros)); > > > + > > > + if (__copy_to_user(p, zeros, n)) > > > + return -EFAULT; > > > + > > > + p += n; > > > + s -= n; > > > + } > > > + return 0; > > > +} > > > + > > > +/** > > > + * copy_struct_to_user: copy a struct to user space > > > + * @dst: Destination address, in user space. > > > + * @usize: Size of @dst struct. > > > + * @src: Source address, in kernel space. > > > + * @ksize: Size of @src struct. > > > + * > > > + * Returns (in all cases, some data may have been copied): > > > + * * -EFBIG: (@usize < @ksize) and there are non-zero trailing bytes in @src. > > > + * * -EFAULT: access to user space failed. > > > + */ > > > +int copy_struct_to_user(void __user *dst, size_t usize, > > > + const void *src, size_t ksize) > > > +{ > > > + size_t size = min(ksize, usize); > > > + size_t rest = abs(ksize - usize); > > > > Eh, I'd avoid abs() here due to the funkiness of the implicit type > > conversions - ksize-usize has type size_t, then that's coerced to an int > > (or a long maybe?), the abs is applied which return an int/long (or > > unsigned versions?). Something like "rest = max(ksize, usize) - size;" > > is more obviously correct and doesn't fall into any > > narrowing/widening/sign extending traps. > > Yeah, I originally used "max(ksize, usize) - size" for that reason but > was worried it looked too funky (and some quick tests showed that abs() > gives the right results in most cases -- though I just realised it would > probably not give the right results around SIZE_MAX). I'll switch back. > > > > + if (unlikely(usize > PAGE_SIZE)) > > > + return -EFAULT; > > > > Please don't. That is a restriction on all future extensions - once a > > kernel is shipped with a syscall using this helper with that arbitrary > > restriction in place, that syscall is forever prevented from extending > > its arg struct beyond PAGE_SIZE (which is arch-dependent anyway). Sure, > > it's hard to imagine, but who'd have thought 32 O_* or CLONE_* bits > > weren't enough for everybody? > > > > This is only for future compatibility, and if someone runs an app > > compiled against 7.3 headers on a 5.4 kernel, they probably don't care > > about performance, but they would like their app to run. > > I'm not sure I agree that the limit is in place *forever* -- it's > generally not a break in compatibility to convert an error into a > success (though, there are counterexamples such as mknod(2) -- but that > was a very specific case). > > You're right that it would mean that some very new code won't run on > very ancient kernels (assuming we ever pass around structs that > massive), but there should be a reasonable trade-off here IMHO. Passing a struct larger than a PAGE_SIZE right now (at least for all those calls that would make use of this helper at the moment) is to be considered a bug. The PAGE_SIZE check is a reasonable heuristic. It's an assumption that is pretty common in the kernel in other places as well. Plus the possibility of DoS. > > If we allow very large sizes, a program could probably DoS the kernel by > allocating a moderately-large block of memory and then spawning a bunch > of threads that all cause the kernel to re-check that the same 1GB block > of memory is zeroed. I haven't tried, but it seems like it's best to > avoid the possibility altogether. > > > > + } > > > + /* Copy the interoperable parts of the struct. */ > > > + if (__copy_to_user(dst, src, size)) > > > + return -EFAULT; > > > > I think I understand why you put this last instead of handling the > > buffer in the "natural" order. However, > > I'm wondering whether we should actually do this copy before checking > > that the extra kernel bytes are 0 - the user will still be told that > > there was some extra information via the -EFBIG/-E2BIG return, but maybe > > in some cases the part he understands is good enough. But I also guess > > we have to look to existing users to see whether that would prevent them > > from being converted to using this helper. > > > > linux-api folks, WDYT? > > Regarding the order, I just copied what sched and perf already do. I > wouldn't mind doing it the other way around -- though I am a little > cautious about implicitly making guarantees like that. The syscall that > uses copy_struct_to_user() might not want to make that guarantee (it > might not make sense for them), and there are some -E2BIG returns that > won't result in data being copied (usize > PAGE_SIZE). > > As for feedback, this is syscall-dependent at the moment. The sched and > perf users explicitly return the size of the kernel structure (by > overwriting uattr->size if -E2BIG is returned) for copies in either > direction. So users arguably already have some kind of feedback about > size issues. clone3() on the other hand doesn't do that (though it > doesn't copy anything to user-space so this isn't relevant to this > particular question). > > Effectively, I'd like to see someone argue that this is something that > they would personally want (before we do it). I think the order you have right now is fine. I don't see the point of doing work first before we have verified that things are sane.
On Thu, Sep 05, 2019 at 11:43:05AM +0200, Peter Zijlstra wrote: > On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote: > > On 2019-09-05, Peter Zijlstra <peterz@infradead.org> wrote: > > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote: > > > > +/** > > > > + * copy_struct_to_user: copy a struct to user space > > > > + * @dst: Destination address, in user space. > > > > + * @usize: Size of @dst struct. > > > > + * @src: Source address, in kernel space. > > > > + * @ksize: Size of @src struct. > > > > + * > > > > + * Copies a struct from kernel space to user space, in a way that guarantees > > > > + * backwards-compatibility for struct syscall arguments (as long as future > > > > + * struct extensions are made such that all new fields are *appended* to the > > > > + * old struct, and zeroed-out new fields have the same meaning as the old > > > > + * struct). > > > > + * > > > > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space. > > > > + * The recommended usage is something like the following: > > > > + * > > > > + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize) > > > > + * { > > > > + * int err; > > > > + * struct foo karg = {}; > > > > + * > > > > + * // do something with karg > > > > + * > > > > + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg)); > > > > + * if (err) > > > > + * return err; > > > > + * > > > > + * // ... > > > > + * } > > > > + * > > > > + * There are three cases to consider: > > > > + * * If @usize == @ksize, then it's copied verbatim. > > > > + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an > > > > + * older user space. In order to avoid user space getting incomplete > > > > + * information (new fields might be important), all trailing bytes in @src > > > > + * (@ksize - @usize) must be zerored > > > > > > s/zerored/zero/, right? > > > > It should've been "zeroed". > > That reads wrong to me; that way it reads like this function must take > that action and zero out the 'rest'; which is just wrong. > > This function must verify those bytes are zero, not make them zero. > > > > > , otherwise -EFBIG is returned. > > > > > > 'Funny' that, copy_struct_from_user() below seems to use E2BIG. > > > > This is a copy of the semantics that sched_[sg]etattr(2) uses -- E2BIG for > > a "too big" struct passed to the kernel, and EFBIG for a "too big" > > struct passed to user-space. I would personally have preferred EMSGSIZE > > instead of EFBIG, but felt using the existing error codes would be less > > confusing. > > Sadly a recent commit: > > 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code") > > Made the situation even 'worse'. And thinking more about things; I'm not convinced the above patch is actually right. Do we really want to simply truncate all the attributes of the task? And should we not at least set sched_flags when there are non-default clamp values applied? See; that is I think the primary bug that had chrt failing; we tried to publish the default clamp values as !0.
On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote: > A common pattern for syscall extensions is increasing the size of a > struct passed from userspace, such that the zero-value of the new fields > result in the old kernel behaviour (allowing for a mix of userspace and > kernel vintages to operate on one another in most cases). This is done > in both directions -- hence two helpers -- though it's more common to > have to copy user space structs into kernel space. > > Previously there was no common lib/ function that implemented > the necessary extension-checking semantics (and different syscalls > implemented them slightly differently or incompletely[1]). A future > patch replaces all of the common uses of this pattern to use the new > copy_struct_{to,from}_user() helpers. > > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do > similar checks to copy_struct_from_user() while rt_sigprocmask(2) > always rejects differently-sized struct arguments. > > Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > --- > include/linux/uaccess.h | 5 ++ > lib/Makefile | 2 +- > lib/struct_user.c | 182 ++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 188 insertions(+), 1 deletion(-) > create mode 100644 lib/struct_user.c > > diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h > index 34a038563d97..0ad9544a1aee 100644 > --- a/include/linux/uaccess.h > +++ b/include/linux/uaccess.h > @@ -230,6 +230,11 @@ static inline unsigned long __copy_from_user_inatomic_nocache(void *to, > > #endif /* ARCH_HAS_NOCACHE_UACCESS */ > > +extern int copy_struct_to_user(void __user *dst, size_t usize, > + const void *src, size_t ksize); > +extern int copy_struct_from_user(void *dst, size_t ksize, > + const void __user *src, size_t usize); > + > /* > * probe_kernel_read(): safely attempt to read from a location > * @dst: pointer to the buffer that shall take the data > diff --git a/lib/Makefile b/lib/Makefile > index 29c02a924973..d86c71feaf0a 100644 > --- a/lib/Makefile > +++ b/lib/Makefile > @@ -28,7 +28,7 @@ endif > CFLAGS_string.o := $(call cc-option, -fno-stack-protector) > endif > > -lib-y := ctype.o string.o vsprintf.o cmdline.o \ > +lib-y := ctype.o string.o struct_user.o vsprintf.o cmdline.o \ > rbtree.o radix-tree.o timerqueue.o xarray.o \ > idr.o extable.o \ > sha1.o chacha.o irq_regs.o argv_split.o \ > diff --git a/lib/struct_user.c b/lib/struct_user.c > new file mode 100644 > index 000000000000..7301ab1bbe98 > --- /dev/null > +++ b/lib/struct_user.c > @@ -0,0 +1,182 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* > + * Copyright (C) 2019 SUSE LLC > + * Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com> > + */ > + > +#include <linux/types.h> > +#include <linux/export.h> > +#include <linux/uaccess.h> > +#include <linux/kernel.h> > +#include <linux/string.h> > + > +#define BUFFER_SIZE 64 > + > +/* > + * "memset(p, 0, size)" but for user space buffers. Caller must have already > + * checked access_ok(p, size). > + */ > +static int __memzero_user(void __user *p, size_t s) > +{ > + const char zeros[BUFFER_SIZE] = {}; > + while (s > 0) { > + size_t n = min(s, sizeof(zeros)); > + > + if (__copy_to_user(p, zeros, n)) > + return -EFAULT; > + > + p += n; > + s -= n; > + } > + return 0; > +} > + > +/** > + * copy_struct_to_user: copy a struct to user space > + * @dst: Destination address, in user space. > + * @usize: Size of @dst struct. > + * @src: Source address, in kernel space. > + * @ksize: Size of @src struct. > + * > + * Copies a struct from kernel space to user space, in a way that guarantees > + * backwards-compatibility for struct syscall arguments (as long as future > + * struct extensions are made such that all new fields are *appended* to the > + * old struct, and zeroed-out new fields have the same meaning as the old > + * struct). > + * > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space. > + * The recommended usage is something like the following: > + * > + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize) > + * { > + * int err; > + * struct foo karg = {}; > + * > + * // do something with karg > + * > + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg)); > + * if (err) > + * return err; > + * > + * // ... > + * } > + * > + * There are three cases to consider: > + * * If @usize == @ksize, then it's copied verbatim. > + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an > + * older user space. In order to avoid user space getting incomplete > + * information (new fields might be important), all trailing bytes in @src > + * (@ksize - @usize) must be zerored, otherwise -EFBIG is returned. > + * * If @usize > @ksize, then the kernel is "returning" an older struct to a > + * newer user space. The trailing bytes in @dst (@usize - @ksize) will be > + * zero-filled. > + * > + * Returns (in all cases, some data may have been copied): > + * * -EFBIG: (@usize < @ksize) and there are non-zero trailing bytes in @src. > + * * -EFAULT: access to user space failed. > + */ > +int copy_struct_to_user(void __user *dst, size_t usize, > + const void *src, size_t ksize) > +{ > + size_t size = min(ksize, usize); > + size_t rest = abs(ksize - usize); > + > + if (unlikely(usize > PAGE_SIZE)) > + return -EFAULT; Looks like this should be -EFBIG. > + if (unlikely(!access_ok(dst, usize))) > + return -EFAULT; > + > + /* Deal with trailing bytes. */ > + if (usize < ksize) { > + if (memchr_inv(src + size, 0, rest)) > + return -EFBIG; > + } else if (usize > ksize) { > + if (__memzero_user(dst + size, rest)) > + return -EFAULT; Is zeroing that memory really our job? Seems to me we should just check it is zeroed. > + } > + /* Copy the interoperable parts of the struct. */ > + if (__copy_to_user(dst, src, size)) > + return -EFAULT; > + return 0; > +} > +EXPORT_SYMBOL(copy_struct_to_user); > + > +/** > + * copy_struct_from_user: copy a struct from user space > + * @dst: Destination address, in kernel space. This buffer must be @ksize > + * bytes long. > + * @ksize: Size of @dst struct. > + * @src: Source address, in user space. > + * @usize: (Alleged) size of @src struct. > + * > + * Copies a struct from user space to kernel space, in a way that guarantees > + * backwards-compatibility for struct syscall arguments (as long as future > + * struct extensions are made such that all new fields are *appended* to the > + * old struct, and zeroed-out new fields have the same meaning as the old > + * struct). > + * > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space. > + * The recommended usage is something like the following: > + * > + * SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize) > + * { > + * int err; > + * struct foo karg = {}; > + * > + * err = copy_struct_from_user(&karg, sizeof(karg), uarg, size); > + * if (err) > + * return err; > + * > + * // ... > + * } > + * > + * There are three cases to consider: > + * * If @usize == @ksize, then it's copied verbatim. > + * * If @usize < @ksize, then the user space has passed an old struct to a > + * newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize) > + * are to be zero-filled. > + * * If @usize > @ksize, then the user space has passed a new struct to an > + * older kernel. The trailing bytes unknown to the kernel (@usize - @ksize) > + * are checked to ensure they are zeroed, otherwise -E2BIG is returned. > + * > + * Returns (in all cases, some data may have been copied): > + * * -E2BIG: (@usize > @ksize) and there are non-zero trailing bytes in @src. > + * * -E2BIG: @usize is "too big" (at time of writing, >PAGE_SIZE). > + * * -EFAULT: access to user space failed. > + */ > +int copy_struct_from_user(void *dst, size_t ksize, > + const void __user *src, size_t usize) > +{ > + size_t size = min(ksize, usize); > + size_t rest = abs(ksize - usize); > + > + if (unlikely(usize > PAGE_SIZE)) > + return -EFAULT; That should be -E2BIG. > + if (unlikely(!access_ok(src, usize))) > + return -EFAULT; > + > + /* Deal with trailing bytes. */ > + if (usize < ksize) > + memset(dst + size, 0, rest); I think kernel style mandates that if one branch in an if-else ladder requires {} all other must use {} as well. So this should be: if () { // one line } else { // one line // another line } That's a change in behavior for clone3() and sched at least, no? Unless - which I guess you might have done - you have moved the "error out when the struct is too small" part before the call to copy_struct_from_user() for them. > + else if (usize > ksize) { > + const void __user *addr = src + size; > + char buffer[BUFFER_SIZE] = {}; > + > + while (rest > 0) { > + size_t bufsize = min(rest, sizeof(buffer)); > + > + if (__copy_from_user(buffer, addr, bufsize)) > + return -EFAULT; > + if (memchr_inv(buffer, 0, bufsize)) > + return -E2BIG; > + > + addr += bufsize; > + rest -= bufsize; > + } > + } > + /* Copy the interoperable parts of the struct. */ > + if (__copy_from_user(dst, src, size)) > + return -EFAULT; > + return 0; > +} > +EXPORT_SYMBOL(copy_struct_from_user); > -- > 2.23.0 >
On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote: > A common pattern for syscall extensions is increasing the size of a > struct passed from userspace, such that the zero-value of the new fields > result in the old kernel behaviour (allowing for a mix of userspace and > kernel vintages to operate on one another in most cases). This is done > in both directions -- hence two helpers -- though it's more common to > have to copy user space structs into kernel space. > > Previously there was no common lib/ function that implemented > the necessary extension-checking semantics (and different syscalls > implemented them slightly differently or incompletely[1]). A future > patch replaces all of the common uses of this pattern to use the new > copy_struct_{to,from}_user() helpers. > > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do > similar checks to copy_struct_from_user() while rt_sigprocmask(2) > always rejects differently-sized struct arguments. > > Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> I would probably split this out into a separate patchset. It can very well go in before openat2(). Thoughts? Christian
On 05/09/2019 13.05, Christian Brauner wrote: > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote: >> + if (unlikely(!access_ok(dst, usize))) >> + return -EFAULT; >> + >> + /* Deal with trailing bytes. */ >> + if (usize < ksize) { >> + if (memchr_inv(src + size, 0, rest)) >> + return -EFBIG; >> + } else if (usize > ksize) { >> + if (__memzero_user(dst + size, rest)) >> + return -EFAULT; > > Is zeroing that memory really our job? Seems to me we should just check > it is zeroed. Of course it is, otherwise you'd require userspace to clear the output buffer it gives us, which in the majority of cases is wasted work. It's much easier to reason about if we just say "the kernel populates [uaddr, uaddr + usize)". It's completely symmetric to copy_struct_from_user doing a memset() of the tail of the kernel buffer in case of ksize>usize - you wouldn't want to require the kernel callers to pass a zeroed buffer to copy_struct_from_user() - it's just that when we memset(__user*), there's an error check to do. Rasmus
On 2019-09-05, Christian Brauner <christian.brauner@ubuntu.com> wrote: > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote: > > A common pattern for syscall extensions is increasing the size of a > > struct passed from userspace, such that the zero-value of the new fields > > result in the old kernel behaviour (allowing for a mix of userspace and > > kernel vintages to operate on one another in most cases). This is done > > in both directions -- hence two helpers -- though it's more common to > > have to copy user space structs into kernel space. > > > > Previously there was no common lib/ function that implemented > > the necessary extension-checking semantics (and different syscalls > > implemented them slightly differently or incompletely[1]). A future > > patch replaces all of the common uses of this pattern to use the new > > copy_struct_{to,from}_user() helpers. > > > > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do > > similar checks to copy_struct_from_user() while rt_sigprocmask(2) > > always rejects differently-sized struct arguments. > > > > Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > > I would probably split this out into a separate patchset. It can very > well go in before openat2(). Thoughts? Yeah, I'll split this and the related patches out -- though I will admit I'm not sure how you're supposed to deal with multiple independent patchsets that depend on each other. How will folks reviewing openat2(2) know to include the lib/struct_user.c changes? Also, whose tree should it go through?
On Thu, Sep 05, 2019 at 01:17:38PM +0200, Rasmus Villemoes wrote: > On 05/09/2019 13.05, Christian Brauner wrote: > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote: > > >> + if (unlikely(!access_ok(dst, usize))) > >> + return -EFAULT; > >> + > >> + /* Deal with trailing bytes. */ > >> + if (usize < ksize) { > >> + if (memchr_inv(src + size, 0, rest)) > >> + return -EFBIG; > >> + } else if (usize > ksize) { > >> + if (__memzero_user(dst + size, rest)) > >> + return -EFAULT; > > > > Is zeroing that memory really our job? Seems to me we should just check > > it is zeroed. > > Of course it is, otherwise you'd require userspace to clear the output > buffer it gives us, which in the majority of cases is wasted work. It's > much easier to reason about if we just say "the kernel populates [uaddr, > uaddr + usize)". I don't really mind either way so sure. :)
On Thu, Sep 05, 2019 at 09:27:18PM +1000, Aleksa Sarai wrote: > On 2019-09-05, Christian Brauner <christian.brauner@ubuntu.com> wrote: > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote: > > > A common pattern for syscall extensions is increasing the size of a > > > struct passed from userspace, such that the zero-value of the new fields > > > result in the old kernel behaviour (allowing for a mix of userspace and > > > kernel vintages to operate on one another in most cases). This is done > > > in both directions -- hence two helpers -- though it's more common to > > > have to copy user space structs into kernel space. > > > > > > Previously there was no common lib/ function that implemented > > > the necessary extension-checking semantics (and different syscalls > > > implemented them slightly differently or incompletely[1]). A future > > > patch replaces all of the common uses of this pattern to use the new > > > copy_struct_{to,from}_user() helpers. > > > > > > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do > > > similar checks to copy_struct_from_user() while rt_sigprocmask(2) > > > always rejects differently-sized struct arguments. > > > > > > Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> > > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > > > > I would probably split this out into a separate patchset. It can very > > well go in before openat2(). Thoughts? > > Yeah, I'll split this and the related patches out -- though I will admit > I'm not sure how you're supposed to deal with multiple independent > patchsets that depend on each other. How will folks reviewing openat2(2) > know to include the lib/struct_user.c changes? The way I usually deal with this is to make two branches. One with the changes the other depends on and then merge this branch into the other and put the changes on top. Then you can provide a complete branch that people can test when you send the patchset out by just linking to it in the cover letter. (But if it's too much hazzle just leave it.) > > Also, whose tree should it go through? If people think splitting it out makes sense and we can settle the technical details I can take it and let it stew in linux-next at least for a little while. I have changes to clone3() in there that touch copy_clone_args_from_user() anyway and there are tests for clone3() struct copying so we'd catch regressions (for clone3() at least) pretty quickly. If we don't see any major issues in the next two weeks it might even be ok to send for 5.4. Christian
On 2019-09-05, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote: > > On 2019-09-05, Peter Zijlstra <peterz@infradead.org> wrote: > > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote: > > > > +/** > > > > + * copy_struct_to_user: copy a struct to user space > > > > + * @dst: Destination address, in user space. > > > > + * @usize: Size of @dst struct. > > > > + * @src: Source address, in kernel space. > > > > + * @ksize: Size of @src struct. > > > > + * > > > > + * Copies a struct from kernel space to user space, in a way that guarantees > > > > + * backwards-compatibility for struct syscall arguments (as long as future > > > > + * struct extensions are made such that all new fields are *appended* to the > > > > + * old struct, and zeroed-out new fields have the same meaning as the old > > > > + * struct). > > > > + * > > > > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space. > > > > + * The recommended usage is something like the following: > > > > + * > > > > + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize) > > > > + * { > > > > + * int err; > > > > + * struct foo karg = {}; > > > > + * > > > > + * // do something with karg > > > > + * > > > > + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg)); > > > > + * if (err) > > > > + * return err; > > > > + * > > > > + * // ... > > > > + * } > > > > + * > > > > + * There are three cases to consider: > > > > + * * If @usize == @ksize, then it's copied verbatim. > > > > + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an > > > > + * older user space. In order to avoid user space getting incomplete > > > > + * information (new fields might be important), all trailing bytes in @src > > > > + * (@ksize - @usize) must be zerored > > > > > > s/zerored/zero/, right? > > > > It should've been "zeroed". > > That reads wrong to me; that way it reads like this function must take > that action and zero out the 'rest'; which is just wrong. > > This function must verify those bytes are zero, not make them zero. Right, in my head I was thinking "must have been zeroed" which isn't what it says. I'll switch to "zero". > > > > , otherwise -EFBIG is returned. > > > > > > 'Funny' that, copy_struct_from_user() below seems to use E2BIG. > > > > This is a copy of the semantics that sched_[sg]etattr(2) uses -- E2BIG for > > a "too big" struct passed to the kernel, and EFBIG for a "too big" > > struct passed to user-space. I would personally have preferred EMSGSIZE > > instead of EFBIG, but felt using the existing error codes would be less > > confusing. > > Sadly a recent commit: > > 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code") > > Made the situation even 'worse'. I hadn't seen this patch before, and I have a few questions taking a look at it: * An error code for a particular behaviour was changed (EFBIG -> E2BIG). Is this not a userspace breakage (I know Linus went ballistic about something similar a while ago[1]), or did I misunderstand what the issue was in [1]? * At the risk of bike-shedding -- of we are changing it, wouldn't -EMSGSIZE be more appropriate? To be fair, picking errno values has always been more of an art than a science, but to my ears "Argument list too long" doesn't make too much sense in the context of "returning" a struct back to userspace (and the cause of the error is that the argument passed by user space *isn't big enough*). If there was an E2SMALL that would also work. ;) * Do you want me to write a patch based on that, to switch it to copy_struct_to_user()? * That patch removes the "are there non-zero bytes in the tail that userspace won't know about" check (which I have included in mine). I understand that this caused issues specifically with sched_getattr(2) due to the default value not being zero -- how should we rectify that (given that we'd hopefully want to port everyone who uses that interface to copy_struct_{to,from}_user())? * Given that the [uk]attr->size construct is pretty important to the usability of the sched and perf interfaces, should we require (or encourage) it for all struct-extension syscall setups? > > > > + if (unlikely(!access_ok(src, usize))) > > > > + return -EFAULT; > > > > + > > > > + /* Deal with trailing bytes. */ > > > > + if (usize < ksize) > > > > + memset(dst + size, 0, rest); > > > > + else if (usize > ksize) { > > > > + const void __user *addr = src + size; > > > > + char buffer[BUFFER_SIZE] = {}; > > > > > > Isn't that too big for on-stack? > > > > Is a 64-byte buffer too big? I picked the number "at random" to be the > > size of a cache line, but I could shrink it down to 32 bytes if the size > > is an issue (I wanted to avoid needless allocations -- hence it being > > on-stack). > > Ah, my ctags gave me a definition of BUFFER_SIZE that was 512. I suppose > 64 should be OK. Good to know, though I'll rename it to avoid confusion. [1]: https://lore.kernel.org/lkml/CA+55aFy98A+LJK4+GWMcbzaa1zsPBRo76q+ioEjbx-uaMKH6Uw@mail.gmail.com/
On 2019-09-05, Christian Brauner <christian.brauner@ubuntu.com> wrote: > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote: > > A common pattern for syscall extensions is increasing the size of a > > struct passed from userspace, such that the zero-value of the new fields > > result in the old kernel behaviour (allowing for a mix of userspace and > > kernel vintages to operate on one another in most cases). This is done > > in both directions -- hence two helpers -- though it's more common to > > have to copy user space structs into kernel space. > > > > Previously there was no common lib/ function that implemented > > the necessary extension-checking semantics (and different syscalls > > implemented them slightly differently or incompletely[1]). A future > > patch replaces all of the common uses of this pattern to use the new > > copy_struct_{to,from}_user() helpers. > > > > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do > > similar checks to copy_struct_from_user() while rt_sigprocmask(2) > > always rejects differently-sized struct arguments. > > > > Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> [...] > > + if (unlikely(!access_ok(src, usize))) > > + return -EFAULT; > > + > > + /* Deal with trailing bytes. */ > > + if (usize < ksize) > > + memset(dst + size, 0, rest); [...] > That's a change in behavior for clone3() and sched at least, no? Unless > - which I guess you might have done - you have moved the "error out when > the struct is too small" part before the call to copy_struct_from_user() > for them. Yes, I've put the minimum size check to the callers in all of the cases (in the case of clone3() I've #define'd a CLONE_ARGS_SIZE_VER0 to match the others -- see patch 2 of the series).
On 2019-09-05, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote: > > On 2019-09-05, Peter Zijlstra <peterz@infradead.org> wrote: > > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote: > > > > + > > > > + while (rest > 0) { > > > > + size_t bufsize = min(rest, sizeof(buffer)); > > > > + > > > > + if (__copy_from_user(buffer, addr, bufsize)) > > > > + return -EFAULT; > > > > + if (memchr_inv(buffer, 0, bufsize)) > > > > + return -E2BIG; > > > > + > > > > + addr += bufsize; > > > > + rest -= bufsize; > > > > + } > > > > > > The perf implementation uses get_user(); but if that is too slow, surely > > > we can do something with uaccess_try() here? > > > > Is there a non-x86-specific way to do that (unless I'm mistaken only x86 > > has uaccess_try() or the other *_try() wrappers)? The main "performance > > improvement" (if you can even call it that) is that we use memchr_inv() > > which finds non-matching characters more efficiently than just doing a > > loop. > > Oh, you're right, that's x86 only :/ Though, I just had an idea -- am I wrong to think that the following would work just as well (without the need for an intermediate buffer)? if (memchr_inv((const char __force *) src + size, 0, rest)) return -E2BIG; Or is this type of thing very much frowned upon? What if it was a separate memchr_inv_user() instead -- I feel as though there's not a strong argument for needing to use a buffer when we're single-passing the __user buffer and doing a basic boolean check.
On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote: > +/* > + * "memset(p, 0, size)" but for user space buffers. Caller must have already > + * checked access_ok(p, size). > + */ > +static int __memzero_user(void __user *p, size_t s) > +{ > + const char zeros[BUFFER_SIZE] = {}; > + while (s > 0) { > + size_t n = min(s, sizeof(zeros)); > + > + if (__copy_to_user(p, zeros, n)) > + return -EFAULT; > + > + p += n; > + s -= n; > + } > + return 0; > +} That's called clear_user(). > +int copy_struct_to_user(void __user *dst, size_t usize, > + const void *src, size_t ksize) > +{ > + size_t size = min(ksize, usize); > + size_t rest = abs(ksize - usize); > + > + if (unlikely(usize > PAGE_SIZE)) > + return -EFAULT; Why? > + } else if (usize > ksize) { > + if (__memzero_user(dst + size, rest)) > + return -EFAULT; > + } > + /* Copy the interoperable parts of the struct. */ > + if (__copy_to_user(dst, src, size)) > + return -EFAULT; Why not simply clear_user() and copy_to_user()? > +int copy_struct_from_user(void *dst, size_t ksize, > + const void __user *src, size_t usize) > +{ > + size_t size = min(ksize, usize); > + size_t rest = abs(ksize - usize); Cute, but... you would be just as well without that 'rest' thing. > + > + if (unlikely(usize > PAGE_SIZE)) > + return -EFAULT; Again, why? > + if (unlikely(!access_ok(src, usize))) > + return -EFAULT; Why not simply copy_from_user() here? > + /* Deal with trailing bytes. */ > + if (usize < ksize) > + memset(dst + size, 0, rest); > + else if (usize > ksize) { > + const void __user *addr = src + size; > + char buffer[BUFFER_SIZE] = {}; > + > + while (rest > 0) { > + size_t bufsize = min(rest, sizeof(buffer)); > + > + if (__copy_from_user(buffer, addr, bufsize)) > + return -EFAULT; > + if (memchr_inv(buffer, 0, bufsize)) > + return -E2BIG; Frankly, that looks like a candidate for is_all_zeroes_user(). With the loop like above serving as a dumb default. And on badly alighed address it _will_ be dumb. Probably too much so - something like if ((unsigned long)addr & 1) { u8 v; if (get_user(v, (__u8 __user *)addr)) return -EFAULT; if (v) return -E2BIG; addr++; } if ((unsigned long)addr & 2) { u16 v; if (get_user(v, (__u16 __user *)addr)) return -EFAULT; if (v) return -E2BIG; addr +=2; } if ((unsigned long)addr & 4) { u32 v; if (get_user(v, (__u32 __user *)addr)) return -EFAULT; if (v) return -E2BIG; } <read the rest like you currently do> would be saner, and things like x86 could trivially add an asm variant - it's not hard. Incidentally, memchr_inv() is an overkill in this case...
On Thu, Sep 05, 2019 at 07:07:50PM +0100, Al Viro wrote: > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote: > > +/* > > + * "memset(p, 0, size)" but for user space buffers. Caller must have already > > + * checked access_ok(p, size). > > + */ > > +static int __memzero_user(void __user *p, size_t s) > > +{ > > + const char zeros[BUFFER_SIZE] = {}; > > + while (s > 0) { > > + size_t n = min(s, sizeof(zeros)); > > + > > + if (__copy_to_user(p, zeros, n)) > > + return -EFAULT; > > + > > + p += n; > > + s -= n; > > + } > > + return 0; > > +} > > That's called clear_user(). > > > +int copy_struct_to_user(void __user *dst, size_t usize, > > + const void *src, size_t ksize) > > +{ > > + size_t size = min(ksize, usize); > > + size_t rest = abs(ksize - usize); > > + > > + if (unlikely(usize > PAGE_SIZE)) > > + return -EFAULT; > > Why? Because every caller of that function right now has that limit set anyway iirc. So we can either remove it from here and place it back for the individual callers or leave it in the helper. Also, I'm really asking, why not? Is it unreasonable to have an upper bound on the size (for a long time probably) or are you disagreeing with PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf, bpf, and clone3 and in a few other places.
On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote: > Because every caller of that function right now has that limit set > anyway iirc. So we can either remove it from here and place it back for > the individual callers or leave it in the helper. > Also, I'm really asking, why not? Is it unreasonable to have an upper > bound on the size (for a long time probably) or are you disagreeing with > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf, > bpf, and clone3 and in a few other places. For a primitive that can be safely used with any size (OK, any within the usual 2Gb limit)? Why push the random policy into the place where it doesn't belong? Seriously, what's the point? If they want to have a large chunk of userland memory zeroed or checked for non-zeroes - why would that be a problem?
On Thu, Sep 05, 2019 at 07:28:01PM +0100, Al Viro wrote: > On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote: > > > Because every caller of that function right now has that limit set > > anyway iirc. So we can either remove it from here and place it back for > > the individual callers or leave it in the helper. > > Also, I'm really asking, why not? Is it unreasonable to have an upper > > bound on the size (for a long time probably) or are you disagreeing with > > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf, > > bpf, and clone3 and in a few other places. > > For a primitive that can be safely used with any size (OK, any within > the usual 2Gb limit)? Why push the random policy into the place where > it doesn't belong? Ah, the "not in the helper part" makes sense. As long as leave the check for the callers themselves. > > Seriously, what's the point? If they want to have a large chunk of > userland memory zeroed or checked for non-zeroes - why would that > be a problem?
On 2019-09-05, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote: > > > Because every caller of that function right now has that limit set > > anyway iirc. So we can either remove it from here and place it back for > > the individual callers or leave it in the helper. > > Also, I'm really asking, why not? Is it unreasonable to have an upper > > bound on the size (for a long time probably) or are you disagreeing with > > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf, > > bpf, and clone3 and in a few other places. > > For a primitive that can be safely used with any size (OK, any within > the usual 2Gb limit)? Why push the random policy into the place where > it doesn't belong? > > Seriously, what's the point? If they want to have a large chunk of > userland memory zeroed or checked for non-zeroes - why would that > be a problem? Thinking about it some more, there isn't really any r/w amplification -- so there isn't much to gain by passing giant structs. Though, if we are going to permit 2GB buffers, isn't that also an argument to use memchr_inv()? :P
On Fri, Sep 06, 2019 at 05:56:18AM +1000, Aleksa Sarai wrote: > On 2019-09-05, Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote: > > > > > Because every caller of that function right now has that limit set > > > anyway iirc. So we can either remove it from here and place it back for > > > the individual callers or leave it in the helper. > > > Also, I'm really asking, why not? Is it unreasonable to have an upper > > > bound on the size (for a long time probably) or are you disagreeing with > > > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf, > > > bpf, and clone3 and in a few other places. > > > > For a primitive that can be safely used with any size (OK, any within > > the usual 2Gb limit)? Why push the random policy into the place where > > it doesn't belong? > > > > Seriously, what's the point? If they want to have a large chunk of > > userland memory zeroed or checked for non-zeroes - why would that > > be a problem? > > Thinking about it some more, there isn't really any r/w amplification -- > so there isn't much to gain by passing giant structs. Though, if we are > going to permit 2GB buffers, isn't that also an argument to use > memchr_inv()? :P I'm not sure I understand the last bit. If you look at what copy_from_user() does on misaligned source/destination, especially on architectures that really, really do not like unaligned access... Case in point: alpha (and it's not unusual in that respect). What it boils down to is copy bytes until the destination is aligned if source and destination are both aligned copy word by word else read word by word, storing the mix of two adjacent words copy the rest byte by byte The unpleasant case (to and from having different remainders modulo 8) is basically if (count >= 8) { u64 *aligned = (u64 *)(from & ~7); u64 *dest = (u64 *)to; int bitshift = (from & 7) * 8; u64 prev, next; prev = aligned[0]; do { next = aligned[1]; prev <<= bitshift; prev |= next >> (64 - bitshift); *dest++ = prev; aligned++; prev = next; from += 8; to += 8; count -= 8; } while (count >= 8); } Now, mix that with "... and do memchr_inv() on the copy to find if we'd copied any non-zeroes, nevermind where" and it starts looking really ridiculous. We should just read the fscking source, aligned down to word boundary and check each word being read. The first and the last ones - masked. All there is to it. On almost all architectures that'll work well enough; s390 might want something more elaborate (there even word-by-word copies are costly, but I'd suggest talking to them for details). Something like bool all_zeroes_user(const void __user *p, size_t count) would probably be a sane API...
On 2019-09-05, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote: > > +/* > > + * "memset(p, 0, size)" but for user space buffers. Caller must have already > > + * checked access_ok(p, size). > > + */ > > +static int __memzero_user(void __user *p, size_t s) > > +{ > > + const char zeros[BUFFER_SIZE] = {}; > > + while (s > 0) { > > + size_t n = min(s, sizeof(zeros)); > > + > > + if (__copy_to_user(p, zeros, n)) > > + return -EFAULT; > > + > > + p += n; > > + s -= n; > > + } > > + return 0; > > +} > > That's called clear_user(). Already switched, I didn't know about clear_user() -- I assumed it would've been called bzero_user() or memzero_user() and didn't find it when looking. > > +int copy_struct_to_user(void __user *dst, size_t usize, > > + const void *src, size_t ksize) > > +{ > > + size_t size = min(ksize, usize); > > + size_t rest = abs(ksize - usize); > > + > > + if (unlikely(usize > PAGE_SIZE)) > > + return -EFAULT; > > Why? > > > + } else if (usize > ksize) { > > + if (__memzero_user(dst + size, rest)) > > + return -EFAULT; > > + } > > + /* Copy the interoperable parts of the struct. */ > > + if (__copy_to_user(dst, src, size)) > > + return -EFAULT; > > Why not simply clear_user() and copy_to_user()? I'm not sure I understand what you mean -- are you asking why we need to do memchr_inv(src + size, 0, rest) earlier? > > > +int copy_struct_from_user(void *dst, size_t ksize, > > + const void __user *src, size_t usize) > > +{ > > + size_t size = min(ksize, usize); > > + size_t rest = abs(ksize - usize); > > Cute, but... you would be just as well without that 'rest' thing. I would argue it's harder to mess up using "rest" compared to getting "ksize - usize" and "usize - ksize" mixed up (and it's a bit more readable). > > + > > + if (unlikely(usize > PAGE_SIZE)) > > + return -EFAULT; > > Again, why? As discussed in a sister thread, I will leave this in the callers (though I would argue callers should always do some kind of sanity check like this). > > > + if (unlikely(!access_ok(src, usize))) > > + return -EFAULT; > > Why not simply copy_from_user() here? > > > + /* Deal with trailing bytes. */ > > + if (usize < ksize) > > + memset(dst + size, 0, rest); > > + else if (usize > ksize) { > > + const void __user *addr = src + size; > > + char buffer[BUFFER_SIZE] = {}; > > + > > + while (rest > 0) { > > + size_t bufsize = min(rest, sizeof(buffer)); > > + > > + if (__copy_from_user(buffer, addr, bufsize)) > > + return -EFAULT; > > + if (memchr_inv(buffer, 0, bufsize)) > > + return -E2BIG; > > Frankly, that looks like a candidate for is_all_zeroes_user(). > With the loop like above serving as a dumb default. And on > badly alighed address it _will_ be dumb. Probably too much > so - something like > if ((unsigned long)addr & 1) { > u8 v; > if (get_user(v, (__u8 __user *)addr)) > return -EFAULT; > if (v) > return -E2BIG; > addr++; > } > if ((unsigned long)addr & 2) { > u16 v; > if (get_user(v, (__u16 __user *)addr)) > return -EFAULT; > if (v) > return -E2BIG; > addr +=2; > } > if ((unsigned long)addr & 4) { > u32 v; > if (get_user(v, (__u32 __user *)addr)) > return -EFAULT; > if (v) > return -E2BIG; > } > <read the rest like you currently do> > would be saner, and things like x86 could trivially add an > asm variant - it's not hard. Incidentally, memchr_inv() is > an overkill in this case... Why is memchr_inv() overkill? But yes, breaking this out to an asm-generic is_all_zeroes_user() wouldn't hurt -- and I'll put a cleaned-up version of the alignment handling there too. Should I drop it in asm-generic/uaccess.h, or somewhere else?
On Fri, Sep 06, 2019 at 09:00:03AM +1000, Aleksa Sarai wrote: > > > + return -EFAULT; > > > + } > > > + /* Copy the interoperable parts of the struct. */ > > > + if (__copy_to_user(dst, src, size)) > > > + return -EFAULT; > > > > Why not simply clear_user() and copy_to_user()? > > I'm not sure I understand what you mean -- are you asking why we need to > do memchr_inv(src + size, 0, rest) earlier? I'm asking why bother with __ and separate access_ok(). > > if ((unsigned long)addr & 1) { > > u8 v; > > if (get_user(v, (__u8 __user *)addr)) > > return -EFAULT; > > if (v) > > return -E2BIG; > > addr++; > > } > > if ((unsigned long)addr & 2) { > > u16 v; > > if (get_user(v, (__u16 __user *)addr)) > > return -EFAULT; > > if (v) > > return -E2BIG; > > addr +=2; > > } > > if ((unsigned long)addr & 4) { > > u32 v; > > if (get_user(v, (__u32 __user *)addr)) > > return -EFAULT; > > if (v) > > return -E2BIG; > > } > > <read the rest like you currently do> Actually, this is a dumb way to do it - page size on anything is going to be a multiple of 8, so you could just as well read 8 bytes from an address aligned down. Then mask the bytes you don't want to check out and see if there's anything left. You can have readability boundaries inside a page - it's either the entire page (let alone a single word) being readable, or it's EFAULT for all parts. > > would be saner, and things like x86 could trivially add an > > asm variant - it's not hard. Incidentally, memchr_inv() is > > an overkill in this case... > > Why is memchr_inv() overkill? Look at its implementation; you only care if there are non-zeroes, you don't give a damn where in the buffer the first one would be. All you need is the same logics as in "from userland" case if (!count) return true; offset = (unsigned long)from & 7 p = (u64 *)(from - offset); v = *p++; if (offset) { // unaligned count += offset; v &= ~aligned_byte_mask(offset); // see strnlen_user.c } while (count > 8) { if (v) return false; v = *p++; count -= 8; } if (count != 8) v &= aligned_byte_mask(count); return v == 0; All there is to it...
On 2019-09-06, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Sep 06, 2019 at 09:00:03AM +1000, Aleksa Sarai wrote: > > > > + return -EFAULT; > > > > + } > > > > + /* Copy the interoperable parts of the struct. */ > > > > + if (__copy_to_user(dst, src, size)) > > > > + return -EFAULT; > > > > > > Why not simply clear_user() and copy_to_user()? > > > > I'm not sure I understand what you mean -- are you asking why we need to > > do memchr_inv(src + size, 0, rest) earlier? > > I'm asking why bother with __ and separate access_ok(). Ah right, it was a dumb "optimisation" (since we need to do access_ok() anyway since we should early -EFAULT in that case). I've dropped the __ usages in my working copy. > > > if ((unsigned long)addr & 1) { > > > u8 v; > > > if (get_user(v, (__u8 __user *)addr)) > > > return -EFAULT; > > > if (v) > > > return -E2BIG; > > > addr++; > > > } > > > if ((unsigned long)addr & 2) { > > > u16 v; > > > if (get_user(v, (__u16 __user *)addr)) > > > return -EFAULT; > > > if (v) > > > return -E2BIG; > > > addr +=2; > > > } > > > if ((unsigned long)addr & 4) { > > > u32 v; > > > if (get_user(v, (__u32 __user *)addr)) > > > return -EFAULT; > > > if (v) > > > return -E2BIG; > > > } > > > <read the rest like you currently do> > > Actually, this is a dumb way to do it - page size on anything > is going to be a multiple of 8, so you could just as well > read 8 bytes from an address aligned down. Then mask the > bytes you don't want to check out and see if there's anything > left. > > You can have readability boundaries inside a page - it's either > the entire page (let alone a single word) being readable, or > it's EFAULT for all parts. > > > > would be saner, and things like x86 could trivially add an > > > asm variant - it's not hard. Incidentally, memchr_inv() is > > > an overkill in this case... > > > > Why is memchr_inv() overkill? > > Look at its implementation; you only care if there are > non-zeroes, you don't give a damn where in the buffer > the first one would be. All you need is the same logics > as in "from userland" case > if (!count) > return true; > offset = (unsigned long)from & 7 > p = (u64 *)(from - offset); > v = *p++; > if (offset) { // unaligned > count += offset; > v &= ~aligned_byte_mask(offset); // see strnlen_user.c > } > while (count > 8) { > if (v) > return false; > v = *p++; > count -= 8; > } > if (count != 8) > v &= aligned_byte_mask(count); > return v == 0; > > All there is to it... Alright, will do (for some reason I hadn't made the connection that memchr_inv() is doing effectively the same word-by-word comparison but also detecting where the first byte is).
On Fri, Sep 06, 2019 at 12:49:44AM +0100, Al Viro wrote: > On Fri, Sep 06, 2019 at 09:00:03AM +1000, Aleksa Sarai wrote: > > > > + return -EFAULT; > > > > + } > > > > + /* Copy the interoperable parts of the struct. */ > > > > + if (__copy_to_user(dst, src, size)) > > > > + return -EFAULT; > > > > > > Why not simply clear_user() and copy_to_user()? > > > > I'm not sure I understand what you mean -- are you asking why we need to > > do memchr_inv(src + size, 0, rest) earlier? > > I'm asking why bother with __ and separate access_ok(). > > > > if ((unsigned long)addr & 1) { > > > u8 v; > > > if (get_user(v, (__u8 __user *)addr)) > > > return -EFAULT; > > > if (v) > > > return -E2BIG; > > > addr++; > > > } > > > if ((unsigned long)addr & 2) { > > > u16 v; > > > if (get_user(v, (__u16 __user *)addr)) > > > return -EFAULT; > > > if (v) > > > return -E2BIG; > > > addr +=2; > > > } > > > if ((unsigned long)addr & 4) { > > > u32 v; > > > if (get_user(v, (__u32 __user *)addr)) > > > return -EFAULT; > > > if (v) > > > return -E2BIG; > > > } > > > <read the rest like you currently do> > > Actually, this is a dumb way to do it - page size on anything > is going to be a multiple of 8, so you could just as well > read 8 bytes from an address aligned down. Then mask the > bytes you don't want to check out and see if there's anything > left. > > You can have readability boundaries inside a page - it's either > the entire page (let alone a single word) being readable, or > it's EFAULT for all parts. > > > > would be saner, and things like x86 could trivially add an > > > asm variant - it's not hard. Incidentally, memchr_inv() is > > > an overkill in this case... > > > > Why is memchr_inv() overkill? > > Look at its implementation; you only care if there are > non-zeroes, you don't give a damn where in the buffer > the first one would be. All you need is the same logics > as in "from userland" case > if (!count) > return true; > offset = (unsigned long)from & 7 > p = (u64 *)(from - offset); > v = *p++; > if (offset) { // unaligned > count += offset; > v &= ~aligned_byte_mask(offset); // see strnlen_user.c > } > while (count > 8) { > if (v) > return false; > v = *p++; > count -= 8; > } > if (count != 8) > v &= aligned_byte_mask(count); > return v == 0; > > All there is to it... ... and __user case would be pretty much this with if (user_access_begin(from, count)) { .... user_access_end(); } wrapped around the damn thing - again, see strnlen_user.c, with unsafe_get_user(v, p++, efault); instead of those v = *p++; Calling conventions might need some thinking - it might be * all read, all zeroes * non-zero found * read failed so we probably want to map the "all zeroes" case to 0, "read failed" to -EFAULT and "non-zero found" to something else. Might be positive, might be some other -E.... - not sure if E2BIG (or EFBIG) makes much sense here. Need to look at the users...
On Fri, Sep 06, 2019 at 05:56:18AM +1000, Aleksa Sarai wrote: > On 2019-09-05, Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote: > > > > > Because every caller of that function right now has that limit set > > > anyway iirc. So we can either remove it from here and place it back for > > > the individual callers or leave it in the helper. > > > Also, I'm really asking, why not? Is it unreasonable to have an upper > > > bound on the size (for a long time probably) or are you disagreeing with > > > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf, > > > bpf, and clone3 and in a few other places. > > > > For a primitive that can be safely used with any size (OK, any within > > the usual 2Gb limit)? Why push the random policy into the place where > > it doesn't belong? > > > > Seriously, what's the point? If they want to have a large chunk of > > userland memory zeroed or checked for non-zeroes - why would that > > be a problem? > > Thinking about it some more, there isn't really any r/w amplification -- > so there isn't much to gain by passing giant structs. Though, if we are > going to permit 2GB buffers, isn't that also an argument to use > memchr_inv()? :P I think we should just do a really dumb, easy to understand minimal thing for now. It could even just be what every caller is doing right now anyway with the get_user() loop. The only way to settle memchr_inv() vs all the other clever ways suggested here is to benchmark it and see if it matters *for the current users* of this helper. If it does, great we can do it. If it doesn't why bother having that argument right now? Once we somehow end up in a possible world where we apparently have decided it's a great idea to copy 2GB argument structures for a syscall into or from the kernel we can start optimizing the hell out of this. Before that and especially with current callers I honestly doubt it matters whether we use memchr_inv() or while() {get_user()} loops. Christian
On 2019-09-05, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Sep 05, 2019 at 11:43:05AM +0200, Peter Zijlstra wrote: > > On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote: > > > On 2019-09-05, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote: > > > > > +/** > > > > > + * copy_struct_to_user: copy a struct to user space > > > > > + * @dst: Destination address, in user space. > > > > > + * @usize: Size of @dst struct. > > > > > + * @src: Source address, in kernel space. > > > > > + * @ksize: Size of @src struct. > > > > > + * > > > > > + * Copies a struct from kernel space to user space, in a way that guarantees > > > > > + * backwards-compatibility for struct syscall arguments (as long as future > > > > > + * struct extensions are made such that all new fields are *appended* to the > > > > > + * old struct, and zeroed-out new fields have the same meaning as the old > > > > > + * struct). > > > > > + * > > > > > + * @ksize is just sizeof(*dst), and @usize should've been passed by user space. > > > > > + * The recommended usage is something like the following: > > > > > + * > > > > > + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize) > > > > > + * { > > > > > + * int err; > > > > > + * struct foo karg = {}; > > > > > + * > > > > > + * // do something with karg > > > > > + * > > > > > + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg)); > > > > > + * if (err) > > > > > + * return err; > > > > > + * > > > > > + * // ... > > > > > + * } > > > > > + * > > > > > + * There are three cases to consider: > > > > > + * * If @usize == @ksize, then it's copied verbatim. > > > > > + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an > > > > > + * older user space. In order to avoid user space getting incomplete > > > > > + * information (new fields might be important), all trailing bytes in @src > > > > > + * (@ksize - @usize) must be zerored > > > > > > > > s/zerored/zero/, right? > > > > > > It should've been "zeroed". > > > > That reads wrong to me; that way it reads like this function must take > > that action and zero out the 'rest'; which is just wrong. > > > > This function must verify those bytes are zero, not make them zero. > > > > > > > , otherwise -EFBIG is returned. > > > > > > > > 'Funny' that, copy_struct_from_user() below seems to use E2BIG. > > > > > > This is a copy of the semantics that sched_[sg]etattr(2) uses -- E2BIG for > > > a "too big" struct passed to the kernel, and EFBIG for a "too big" > > > struct passed to user-space. I would personally have preferred EMSGSIZE > > > instead of EFBIG, but felt using the existing error codes would be less > > > confusing. > > > > Sadly a recent commit: > > > > 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code") > > > > Made the situation even 'worse'. > > And thinking more about things; I'm not convinced the above patch is > actually right. > > Do we really want to simply truncate all the attributes of the task? > > And should we not at least set sched_flags when there are non-default > clamp values applied? > > See; that is I think the primary bug that had chrt failing; we tried to > publish the default clamp values as !0. I just saw this patch in -rc8 -- should I even attempt to port sched_getattr(2) to copy_struct_to_user()? I agree that publishing a default non-zero value is a mistake -- once you do that, old user space will either get confused or lose information.
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 34a038563d97..0ad9544a1aee 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -230,6 +230,11 @@ static inline unsigned long __copy_from_user_inatomic_nocache(void *to, #endif /* ARCH_HAS_NOCACHE_UACCESS */ +extern int copy_struct_to_user(void __user *dst, size_t usize, + const void *src, size_t ksize); +extern int copy_struct_from_user(void *dst, size_t ksize, + const void __user *src, size_t usize); + /* * probe_kernel_read(): safely attempt to read from a location * @dst: pointer to the buffer that shall take the data diff --git a/lib/Makefile b/lib/Makefile index 29c02a924973..d86c71feaf0a 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -28,7 +28,7 @@ endif CFLAGS_string.o := $(call cc-option, -fno-stack-protector) endif -lib-y := ctype.o string.o vsprintf.o cmdline.o \ +lib-y := ctype.o string.o struct_user.o vsprintf.o cmdline.o \ rbtree.o radix-tree.o timerqueue.o xarray.o \ idr.o extable.o \ sha1.o chacha.o irq_regs.o argv_split.o \ diff --git a/lib/struct_user.c b/lib/struct_user.c new file mode 100644 index 000000000000..7301ab1bbe98 --- /dev/null +++ b/lib/struct_user.c @@ -0,0 +1,182 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2019 SUSE LLC + * Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com> + */ + +#include <linux/types.h> +#include <linux/export.h> +#include <linux/uaccess.h> +#include <linux/kernel.h> +#include <linux/string.h> + +#define BUFFER_SIZE 64 + +/* + * "memset(p, 0, size)" but for user space buffers. Caller must have already + * checked access_ok(p, size). + */ +static int __memzero_user(void __user *p, size_t s) +{ + const char zeros[BUFFER_SIZE] = {}; + while (s > 0) { + size_t n = min(s, sizeof(zeros)); + + if (__copy_to_user(p, zeros, n)) + return -EFAULT; + + p += n; + s -= n; + } + return 0; +} + +/** + * copy_struct_to_user: copy a struct to user space + * @dst: Destination address, in user space. + * @usize: Size of @dst struct. + * @src: Source address, in kernel space. + * @ksize: Size of @src struct. + * + * Copies a struct from kernel space to user space, in a way that guarantees + * backwards-compatibility for struct syscall arguments (as long as future + * struct extensions are made such that all new fields are *appended* to the + * old struct, and zeroed-out new fields have the same meaning as the old + * struct). + * + * @ksize is just sizeof(*dst), and @usize should've been passed by user space. + * The recommended usage is something like the following: + * + * SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize) + * { + * int err; + * struct foo karg = {}; + * + * // do something with karg + * + * err = copy_struct_to_user(uarg, usize, &karg, sizeof(karg)); + * if (err) + * return err; + * + * // ... + * } + * + * There are three cases to consider: + * * If @usize == @ksize, then it's copied verbatim. + * * If @usize < @ksize, then kernel space is "returning" a newer struct to an + * older user space. In order to avoid user space getting incomplete + * information (new fields might be important), all trailing bytes in @src + * (@ksize - @usize) must be zerored, otherwise -EFBIG is returned. + * * If @usize > @ksize, then the kernel is "returning" an older struct to a + * newer user space. The trailing bytes in @dst (@usize - @ksize) will be + * zero-filled. + * + * Returns (in all cases, some data may have been copied): + * * -EFBIG: (@usize < @ksize) and there are non-zero trailing bytes in @src. + * * -EFAULT: access to user space failed. + */ +int copy_struct_to_user(void __user *dst, size_t usize, + const void *src, size_t ksize) +{ + size_t size = min(ksize, usize); + size_t rest = abs(ksize - usize); + + if (unlikely(usize > PAGE_SIZE)) + return -EFAULT; + if (unlikely(!access_ok(dst, usize))) + return -EFAULT; + + /* Deal with trailing bytes. */ + if (usize < ksize) { + if (memchr_inv(src + size, 0, rest)) + return -EFBIG; + } else if (usize > ksize) { + if (__memzero_user(dst + size, rest)) + return -EFAULT; + } + /* Copy the interoperable parts of the struct. */ + if (__copy_to_user(dst, src, size)) + return -EFAULT; + return 0; +} +EXPORT_SYMBOL(copy_struct_to_user); + +/** + * copy_struct_from_user: copy a struct from user space + * @dst: Destination address, in kernel space. This buffer must be @ksize + * bytes long. + * @ksize: Size of @dst struct. + * @src: Source address, in user space. + * @usize: (Alleged) size of @src struct. + * + * Copies a struct from user space to kernel space, in a way that guarantees + * backwards-compatibility for struct syscall arguments (as long as future + * struct extensions are made such that all new fields are *appended* to the + * old struct, and zeroed-out new fields have the same meaning as the old + * struct). + * + * @ksize is just sizeof(*dst), and @usize should've been passed by user space. + * The recommended usage is something like the following: + * + * SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize) + * { + * int err; + * struct foo karg = {}; + * + * err = copy_struct_from_user(&karg, sizeof(karg), uarg, size); + * if (err) + * return err; + * + * // ... + * } + * + * There are three cases to consider: + * * If @usize == @ksize, then it's copied verbatim. + * * If @usize < @ksize, then the user space has passed an old struct to a + * newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize) + * are to be zero-filled. + * * If @usize > @ksize, then the user space has passed a new struct to an + * older kernel. The trailing bytes unknown to the kernel (@usize - @ksize) + * are checked to ensure they are zeroed, otherwise -E2BIG is returned. + * + * Returns (in all cases, some data may have been copied): + * * -E2BIG: (@usize > @ksize) and there are non-zero trailing bytes in @src. + * * -E2BIG: @usize is "too big" (at time of writing, >PAGE_SIZE). + * * -EFAULT: access to user space failed. + */ +int copy_struct_from_user(void *dst, size_t ksize, + const void __user *src, size_t usize) +{ + size_t size = min(ksize, usize); + size_t rest = abs(ksize - usize); + + if (unlikely(usize > PAGE_SIZE)) + return -EFAULT; + if (unlikely(!access_ok(src, usize))) + return -EFAULT; + + /* Deal with trailing bytes. */ + if (usize < ksize) + memset(dst + size, 0, rest); + else if (usize > ksize) { + const void __user *addr = src + size; + char buffer[BUFFER_SIZE] = {}; + + while (rest > 0) { + size_t bufsize = min(rest, sizeof(buffer)); + + if (__copy_from_user(buffer, addr, bufsize)) + return -EFAULT; + if (memchr_inv(buffer, 0, bufsize)) + return -E2BIG; + + addr += bufsize; + rest -= bufsize; + } + } + /* Copy the interoperable parts of the struct. */ + if (__copy_from_user(dst, src, size)) + return -EFAULT; + return 0; +} +EXPORT_SYMBOL(copy_struct_from_user);
A common pattern for syscall extensions is increasing the size of a struct passed from userspace, such that the zero-value of the new fields result in the old kernel behaviour (allowing for a mix of userspace and kernel vintages to operate on one another in most cases). This is done in both directions -- hence two helpers -- though it's more common to have to copy user space structs into kernel space. Previously there was no common lib/ function that implemented the necessary extension-checking semantics (and different syscalls implemented them slightly differently or incompletely[1]). A future patch replaces all of the common uses of this pattern to use the new copy_struct_{to,from}_user() helpers. [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do similar checks to copy_struct_from_user() while rt_sigprocmask(2) always rejects differently-sized struct arguments. Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> --- include/linux/uaccess.h | 5 ++ lib/Makefile | 2 +- lib/struct_user.c | 182 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 188 insertions(+), 1 deletion(-) create mode 100644 lib/struct_user.c