Message ID | 20221119120929.2963813-2-Jason@zx2c4.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Herbert Xu |
Headers | show |
Series | implement getrandom() in vDSO | expand |
On Sat, Nov 19, 2022 at 01:09:27PM +0100, Jason A. Donenfeld wrote: > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num, > + unsigned long __user *, size_per_each, unsigned int, flags) > +{ > + unsigned long alloc_size; > + unsigned long num_states; > + unsigned long pages_addr; > + int ret; > + > + if (flags) > + return -EINVAL; > + > + if (get_user(num_states, num)) > + return -EFAULT; > + > + alloc_size = size_mul(num_states, sizeof(struct vgetrandom_state)); > + if (alloc_size == SIZE_MAX) > + return -EOVERFLOW; > + alloc_size = roundup(alloc_size, PAGE_SIZE); Small detail: the roundup to PAGE_SIZE can make alloc_size overflow to 0. Also, 'roundup(alloc_size, PAGE_SIZE)' could be 'PAGE_ALIGN(alloc_size)'. > + pages_addr = vm_mmap(NULL, 0, alloc_size, PROT_READ | PROT_WRITE, > + MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, 0); > + if (IS_ERR_VALUE(pages_addr)) > + return pages_addr; This will only succeed if the userspace process has permission to mlock pages, i.e. if there is space available in RLIMIT_MEMLOCK or if process has CAP_IPC_LOCK. I suppose this is working as intended, as this syscall can be used to try to allocate and mlock arbitrary amounts of memory. I wonder if this permission check will cause problems. Maybe there could be a way to relax it for just one page per task? I don't know how that would work, though, especially when the planned usage involves userspace allocating a single pool of these contexts per process that get handed out to threads. - Eric
Hi Eric, On Sat, Nov 19, 2022 at 12:39:26PM -0800, Eric Biggers wrote: > On Sat, Nov 19, 2022 at 01:09:27PM +0100, Jason A. Donenfeld wrote: > > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num, > > + unsigned long __user *, size_per_each, unsigned int, flags) > > +{ > > + unsigned long alloc_size; > > + unsigned long num_states; > > + unsigned long pages_addr; > > + int ret; > > + > > + if (flags) > > + return -EINVAL; > > + > > + if (get_user(num_states, num)) > > + return -EFAULT; > > + > > + alloc_size = size_mul(num_states, sizeof(struct vgetrandom_state)); > > + if (alloc_size == SIZE_MAX) > > + return -EOVERFLOW; > > + alloc_size = roundup(alloc_size, PAGE_SIZE); > > Small detail: the roundup to PAGE_SIZE can make alloc_size overflow to 0. > > Also, 'roundup(alloc_size, PAGE_SIZE)' could be 'PAGE_ALIGN(alloc_size)'. Good catch, thanks. So perhaps this? alloc_size = size_mul(num_states, sizeof(struct vgetrandom_state)); if (alloc_size > SIZE_MAX - PAGE_SIZE + 1) return -EOVERFLOW; alloc_size = PAGE_ALIGN(alloc_size); Does that look right? > > + pages_addr = vm_mmap(NULL, 0, alloc_size, PROT_READ | PROT_WRITE, > > + MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, 0); > > + if (IS_ERR_VALUE(pages_addr)) > > + return pages_addr; > > This will only succeed if the userspace process has permission to mlock pages, > i.e. if there is space available in RLIMIT_MEMLOCK or if process has > CAP_IPC_LOCK. I suppose this is working as intended, as this syscall can be > used to try to allocate and mlock arbitrary amounts of memory. > > I wonder if this permission check will cause problems. Maybe there could be a > way to relax it for just one page per task? I don't know how that would work, > though, especially when the planned usage involves userspace allocating a single > pool of these contexts per process that get handed out to threads. Probably though, we don't want to create a mlock backdoor, right? I suppose if a user is above RLIMIT_MEMLOCK, it'll just fallback to the slowpath, which still works. That seems like an okay enough circumstance. Jason
On Sun, Nov 20, 2022 at 12:59:07AM +0100, Jason A. Donenfeld wrote: > Hi Eric, > > On Sat, Nov 19, 2022 at 12:39:26PM -0800, Eric Biggers wrote: > > On Sat, Nov 19, 2022 at 01:09:27PM +0100, Jason A. Donenfeld wrote: > > > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num, > > > + unsigned long __user *, size_per_each, unsigned int, flags) > > > +{ > > > + unsigned long alloc_size; > > > + unsigned long num_states; > > > + unsigned long pages_addr; > > > + int ret; > > > + > > > + if (flags) > > > + return -EINVAL; > > > + > > > + if (get_user(num_states, num)) > > > + return -EFAULT; > > > + > > > + alloc_size = size_mul(num_states, sizeof(struct vgetrandom_state)); > > > + if (alloc_size == SIZE_MAX) > > > + return -EOVERFLOW; > > > + alloc_size = roundup(alloc_size, PAGE_SIZE); > > > > Small detail: the roundup to PAGE_SIZE can make alloc_size overflow to 0. > > > > Also, 'roundup(alloc_size, PAGE_SIZE)' could be 'PAGE_ALIGN(alloc_size)'. > > Good catch, thanks. So perhaps this? > > alloc_size = size_mul(num_states, sizeof(struct vgetrandom_state)); > if (alloc_size > SIZE_MAX - PAGE_SIZE + 1) > return -EOVERFLOW; > alloc_size = PAGE_ALIGN(alloc_size); > > Does that look right? Yes. Maybe use 'SIZE_MAX & PAGE_MASK'? Another alternative is the following: if (num_states > (SIZE_MAX & PAGE_MASK) / sizeof(struct vgetrandom_state)) return -EOVERFLOW; alloc_size = PAGE_ALIGN(num_states * sizeof(struct vgetrandom_state)); - Eric
On Sat, Nov 19, 2022 at 05:40:04PM -0800, Eric Biggers wrote: > On Sun, Nov 20, 2022 at 12:59:07AM +0100, Jason A. Donenfeld wrote: > > Hi Eric, > > > > On Sat, Nov 19, 2022 at 12:39:26PM -0800, Eric Biggers wrote: > > > On Sat, Nov 19, 2022 at 01:09:27PM +0100, Jason A. Donenfeld wrote: > > > > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num, > > > > + unsigned long __user *, size_per_each, unsigned int, flags) > > > > +{ > > > > + unsigned long alloc_size; > > > > + unsigned long num_states; > > > > + unsigned long pages_addr; > > > > + int ret; > > > > + > > > > + if (flags) > > > > + return -EINVAL; > > > > + > > > > + if (get_user(num_states, num)) > > > > + return -EFAULT; > > > > + > > > > + alloc_size = size_mul(num_states, sizeof(struct vgetrandom_state)); > > > > + if (alloc_size == SIZE_MAX) > > > > + return -EOVERFLOW; > > > > + alloc_size = roundup(alloc_size, PAGE_SIZE); > > > > > > Small detail: the roundup to PAGE_SIZE can make alloc_size overflow to 0. > > > > > > Also, 'roundup(alloc_size, PAGE_SIZE)' could be 'PAGE_ALIGN(alloc_size)'. > > > > Good catch, thanks. So perhaps this? > > > > alloc_size = size_mul(num_states, sizeof(struct vgetrandom_state)); > > if (alloc_size > SIZE_MAX - PAGE_SIZE + 1) > > return -EOVERFLOW; > > alloc_size = PAGE_ALIGN(alloc_size); > > > > Does that look right? > > Yes. Maybe use 'SIZE_MAX & PAGE_MASK'? > > Another alternative is the following: > > if (num_states > > (SIZE_MAX & PAGE_MASK) / sizeof(struct vgetrandom_state)) > return -EOVERFLOW; > alloc_size = PAGE_ALIGN(num_states * sizeof(struct vgetrandom_state)); Thanks, that's much nicer. Jason
diff --git a/MAINTAINERS b/MAINTAINERS index 256f03904987..843dd6a49538 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -17287,6 +17287,7 @@ T: git https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git S: Maintained F: drivers/char/random.c F: drivers/virt/vmgenid.c +F: lib/vdso/getrandom.h RAPIDIO SUBSYSTEM M: Matt Porter <mporter@kernel.crashing.org> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 320480a8db4f..ea0fbc2ded5e 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -455,3 +455,4 @@ 448 i386 process_mrelease sys_process_mrelease 449 i386 futex_waitv sys_futex_waitv 450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node +451 i386 vgetrandom_alloc sys_vgetrandom_alloc diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index c84d12608cd2..0186f173f0e8 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -372,6 +372,7 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common vgetrandom_alloc sys_vgetrandom_alloc # # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/x86/include/asm/unistd.h b/arch/x86/include/asm/unistd.h index 761173ccc33c..f9673293cd95 100644 --- a/arch/x86/include/asm/unistd.h +++ b/arch/x86/include/asm/unistd.h @@ -57,5 +57,6 @@ # define __ARCH_WANT_SYS_VFORK # define __ARCH_WANT_SYS_CLONE # define __ARCH_WANT_SYS_CLONE3 +# define __ARCH_WANT_VGETRANDOM_ALLOC #endif /* _ASM_X86_UNISTD_H */ diff --git a/drivers/char/random.c b/drivers/char/random.c index 65ee69896967..ab6e02efa432 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -8,6 +8,7 @@ * into roughly six sections, each with a section header: * * - Initialization and readiness waiting. + * - vDSO support helpers. * - Fast key erasure RNG, the "crng". * - Entropy accumulation and extraction routines. * - Entropy collection routines. @@ -39,6 +40,7 @@ #include <linux/blkdev.h> #include <linux/interrupt.h> #include <linux/mm.h> +#include <linux/mman.h> #include <linux/nodemask.h> #include <linux/spinlock.h> #include <linux/kthread.h> @@ -59,6 +61,7 @@ #include <asm/irq.h> #include <asm/irq_regs.h> #include <asm/io.h> +#include "../../lib/vdso/getrandom.h" /********************************************************************* * @@ -146,6 +149,62 @@ EXPORT_SYMBOL(wait_for_random_bytes); __func__, (void *)_RET_IP_, crng_init) + +/******************************************************************** + * + * vDSO support helpers. + * + * The actual vDSO function is defined over in lib/vdso/getrandom.c, + * but this section contains the kernel-mode helpers to support that. + * + ********************************************************************/ + +#ifdef __ARCH_WANT_VGETRANDOM_ALLOC +/* + * The vgetrandom() function in userspace requires an opaque state, which this + * function provides to userspace. The result is that it maps a certain + * number of special pages into the calling process and returns the address. + */ +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num, + unsigned long __user *, size_per_each, unsigned int, flags) +{ + unsigned long alloc_size; + unsigned long num_states; + unsigned long pages_addr; + int ret; + + if (flags) + return -EINVAL; + + if (get_user(num_states, num)) + return -EFAULT; + + alloc_size = size_mul(num_states, sizeof(struct vgetrandom_state)); + if (alloc_size == SIZE_MAX) + return -EOVERFLOW; + alloc_size = roundup(alloc_size, PAGE_SIZE); + + if (put_user(alloc_size / sizeof(struct vgetrandom_state), num) || + put_user(sizeof(struct vgetrandom_state), size_per_each)) + return -EFAULT; + + pages_addr = vm_mmap(NULL, 0, alloc_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, 0); + if (IS_ERR_VALUE(pages_addr)) + return pages_addr; + + ret = do_madvise(current->mm, pages_addr, alloc_size, MADV_WIPEONFORK); + if (ret < 0) + goto err_unmap; + + return pages_addr; + +err_unmap: + vm_munmap(pages_addr, alloc_size); + return ret; +} +#endif + /********************************************************************* * * Fast key erasure RNG, the "crng". diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 45fa180cc56a..77b6debe7e18 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -886,8 +886,13 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) #define __NR_set_mempolicy_home_node 450 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) +#ifdef __ARCH_WANT_VGETRANDOM_ALLOC +#define __NR_vgetrandom_alloc 451 +__SYSCALL(__NR_vgetrandom_alloc, sys_vgetrandom_alloc) +#endif + #undef __NR_syscalls -#define __NR_syscalls 451 +#define __NR_syscalls 452 /* * 32 bit systems traditionally used different diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 860b2dcf3ac4..f28196cb919b 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free); /* memfd_secret */ COND_SYSCALL(memfd_secret); +/* random */ +COND_SYSCALL(vgetrandom_alloc); + /* * Architecture specific weak syscall entries. */ diff --git a/lib/vdso/getrandom.h b/lib/vdso/getrandom.h new file mode 100644 index 000000000000..85c2f62c0f5f --- /dev/null +++ b/lib/vdso/getrandom.h @@ -0,0 +1,23 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2022 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. + */ + +#ifndef _VDSO_LIB_GETRANDOM_H +#define _VDSO_LIB_GETRANDOM_H + +#include <crypto/chacha.h> + +struct vgetrandom_state { + unsigned long generation; + union { + struct { + u8 key[CHACHA_KEY_SIZE]; + u8 batch[CHACHA_BLOCK_SIZE * 3 / 2]; + }; + u8 key_batch[CHACHA_BLOCK_SIZE * 2]; + }; + u8 pos; +}; + +#endif /* _VDSO_LIB_GETRANDOM_H */ diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh index f33e61aca93d..7f7928c6487f 100755 --- a/scripts/checksyscalls.sh +++ b/scripts/checksyscalls.sh @@ -44,6 +44,10 @@ cat << EOF #define __IGNORE_memfd_secret #endif +#ifndef __ARCH_WANT_VGETRANDOM_ALLOC +#define __IGNORE_vgetrandom_alloc +#endif + /* Missing flags argument */ #define __IGNORE_renameat /* renameat2 */ diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h index 45fa180cc56a..77b6debe7e18 100644 --- a/tools/include/uapi/asm-generic/unistd.h +++ b/tools/include/uapi/asm-generic/unistd.h @@ -886,8 +886,13 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) #define __NR_set_mempolicy_home_node 450 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) +#ifdef __ARCH_WANT_VGETRANDOM_ALLOC +#define __NR_vgetrandom_alloc 451 +__SYSCALL(__NR_vgetrandom_alloc, sys_vgetrandom_alloc) +#endif + #undef __NR_syscalls -#define __NR_syscalls 451 +#define __NR_syscalls 452 /* * 32 bit systems traditionally used different
The vDSO getrandom() works over an opaque per-thread state of an unexported size, which must be marked as MADV_WIPEONFORK and be mlock()'d for proper operation. Over time, the nuances of these allocations may change or grow or even differ based on architectural features. The syscall has the signature: void *vgetrandom_alloc([inout] size_t *num, [out] size_t *size_per_each, unsigned int flags); This takes the desired number of opaque states in `num`, and returns a pointer to an array of opaque states, the number actually allocated back in `num`, and the size in bytes of each one in `size_per_each`, enabling a libc to slice up the returned array into a state per each thread. (The `flags` argument is always zero for now.) Libc is expected to allocate a chunk of these on first use, and then dole them out to threads as they're created, allocating more when needed. The following commit shows an example of this, being used in conjunction with the getrandom() vDSO function. We very intentionally do *not* leave state allocation for vDSO getrandom() up to userspace itself, but rather provide this new syscall for such allocations. vDSO getrandom() must not store its state in just any old memory address, but rather just ones that the kernel specially allocates for it, leaving the particularities of those allocations up to the kernel. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> --- MAINTAINERS | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/x86/include/asm/unistd.h | 1 + drivers/char/random.c | 59 +++++++++++++++++++++++++ include/uapi/asm-generic/unistd.h | 7 ++- kernel/sys_ni.c | 3 ++ lib/vdso/getrandom.h | 23 ++++++++++ scripts/checksyscalls.sh | 4 ++ tools/include/uapi/asm-generic/unistd.h | 7 ++- 10 files changed, 105 insertions(+), 2 deletions(-) create mode 100644 lib/vdso/getrandom.h