Message ID | 20221121152909.3414096-2-Jason@zx2c4.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Herbert Xu |
Headers | show |
Series | implement getrandom() in vDSO | expand |
* Jason A. Donenfeld: > + * The vgetrandom() function in userspace requires an opaque state, which this > + * function provides to userspace, by mapping a certain number of special pages > + * into the calling process. It takes a hint as to the number of opaque states > + * desired, and returns the number of opaque states actually allocated, the > + * size of each one in bytes, and the address of the first state. > + */ > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num, > + unsigned long __user *, size_per_each, unsigned int, flags) I think you should make this __u64, so that you get a consistent userspace interface on all architectures, without the need for compat system calls. Thanks, Florian
Hi Florian, On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote: > * Jason A. Donenfeld: > > > + * The vgetrandom() function in userspace requires an opaque state, which this > > + * function provides to userspace, by mapping a certain number of special pages > > + * into the calling process. It takes a hint as to the number of opaque states > > + * desired, and returns the number of opaque states actually allocated, the > > + * size of each one in bytes, and the address of the first state. > > + */ > > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num, > > + unsigned long __user *, size_per_each, unsigned int, flags) > > I think you should make this __u64, so that you get a consistent > userspace interface on all architectures, without the need for compat > system calls. That would be quite unconventional. Most syscalls that take lengths do so with the native register size (`unsigned long`, `size_t`), rather than u64. If you can point to a recent trend away from this by indicating some commits that added new syscalls with u64, I'd be happy to be shown otherwise. But AFAIK, that's not the way it's done. Jason
* Jason A. Donenfeld: > Hi Florian, > > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote: >> * Jason A. Donenfeld: >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this >> > + * function provides to userspace, by mapping a certain number of special pages >> > + * into the calling process. It takes a hint as to the number of opaque states >> > + * desired, and returns the number of opaque states actually allocated, the >> > + * size of each one in bytes, and the address of the first state. >> > + */ >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num, >> > + unsigned long __user *, size_per_each, unsigned int, flags) >> >> I think you should make this __u64, so that you get a consistent >> userspace interface on all architectures, without the need for compat >> system calls. > > That would be quite unconventional. Most syscalls that take lengths do > so with the native register size (`unsigned long`, `size_t`), rather > than u64. If you can point to a recent trend away from this by > indicating some commits that added new syscalls with u64, I'd be happy > to be shown otherwise. But AFAIK, that's not the way it's done. See clone3 and struct clone_args. It's more common with pointers, which are now 64 bits unconditionally: struct futex_waitv, struct rseq_cs and struct rseq. If the length or pointer is a system call argument, widening it to 64 bits is not necessary because zero-extension to the full register eliminates the need for a compat system call. But if you pass the address to a size or pointer, you'll need compat syscalls if you don't make the passed data __u64. Thanks, Florian
Hi Florian, On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote: > * Jason A. Donenfeld: > > > Hi Florian, > > > > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote: > >> * Jason A. Donenfeld: > >> > >> > + * The vgetrandom() function in userspace requires an opaque state, which this > >> > + * function provides to userspace, by mapping a certain number of special pages > >> > + * into the calling process. It takes a hint as to the number of opaque states > >> > + * desired, and returns the number of opaque states actually allocated, the > >> > + * size of each one in bytes, and the address of the first state. > >> > + */ > >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num, > >> > + unsigned long __user *, size_per_each, unsigned int, flags) > >> > >> I think you should make this __u64, so that you get a consistent > >> userspace interface on all architectures, without the need for compat > >> system calls. > > > > That would be quite unconventional. Most syscalls that take lengths do > > so with the native register size (`unsigned long`, `size_t`), rather > > than u64. If you can point to a recent trend away from this by > > indicating some commits that added new syscalls with u64, I'd be happy > > to be shown otherwise. But AFAIK, that's not the way it's done. > > See clone3 and struct clone_args. The struct is one thing. But actually, clone3 takes a `size_t`: SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size) I take from this that I too should use `size_t` rather than `unsigned long.` And it doesn't seem like there's any compat clone3. Jason
* Jason A. Donenfeld: > Hi Florian, > > On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote: >> * Jason A. Donenfeld: >> >> > Hi Florian, >> > >> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote: >> >> * Jason A. Donenfeld: >> >> >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this >> >> > + * function provides to userspace, by mapping a certain number of special pages >> >> > + * into the calling process. It takes a hint as to the number of opaque states >> >> > + * desired, and returns the number of opaque states actually allocated, the >> >> > + * size of each one in bytes, and the address of the first state. >> >> > + */ >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num, >> >> > + unsigned long __user *, size_per_each, unsigned int, flags) >> >> >> >> I think you should make this __u64, so that you get a consistent >> >> userspace interface on all architectures, without the need for compat >> >> system calls. >> > >> > That would be quite unconventional. Most syscalls that take lengths do >> > so with the native register size (`unsigned long`, `size_t`), rather >> > than u64. If you can point to a recent trend away from this by >> > indicating some commits that added new syscalls with u64, I'd be happy >> > to be shown otherwise. But AFAIK, that's not the way it's done. >> >> See clone3 and struct clone_args. > > The struct is one thing. But actually, clone3 takes a `size_t`: > > SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size) > > I take from this that I too should use `size_t` rather than `unsigned > long.` And it doesn't seem like there's any compat clone3. But vgetrandom_alloc does not use unsigned long, but unsigned long *. You need to look at the contents for struct clone_args for comparison. Thanks, Florian
Hi Florian, On Thu, Nov 24, 2022 at 01:15:24PM +0100, Florian Weimer wrote: > * Jason A. Donenfeld: > > > Hi Florian, > > > > On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote: > >> * Jason A. Donenfeld: > >> > >> > Hi Florian, > >> > > >> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote: > >> >> * Jason A. Donenfeld: > >> >> > >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this > >> >> > + * function provides to userspace, by mapping a certain number of special pages > >> >> > + * into the calling process. It takes a hint as to the number of opaque states > >> >> > + * desired, and returns the number of opaque states actually allocated, the > >> >> > + * size of each one in bytes, and the address of the first state. > >> >> > + */ > >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num, > >> >> > + unsigned long __user *, size_per_each, unsigned int, flags) > >> >> > >> >> I think you should make this __u64, so that you get a consistent > >> >> userspace interface on all architectures, without the need for compat > >> >> system calls. > >> > > >> > That would be quite unconventional. Most syscalls that take lengths do > >> > so with the native register size (`unsigned long`, `size_t`), rather > >> > than u64. If you can point to a recent trend away from this by > >> > indicating some commits that added new syscalls with u64, I'd be happy > >> > to be shown otherwise. But AFAIK, that's not the way it's done. > >> > >> See clone3 and struct clone_args. > > > > The struct is one thing. But actually, clone3 takes a `size_t`: > > > > SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size) > > > > I take from this that I too should use `size_t` rather than `unsigned > > long.` And it doesn't seem like there's any compat clone3. > > But vgetrandom_alloc does not use unsigned long, but unsigned long *. > You need to look at the contents for struct clone_args for comparison. Ah! I see what you mean; that's a good point. The usual register clearing thing isn't going to happen because these are addresses. I still am somewhat hesitant, though, because `size_t` is really the "proper" type to be used. Maybe the compat syscall thing is just a necessary evil? The other direction would be making this a u32, since 640k ought to be enough for anybody and such, but maybe that'd be a mistake too. So I'm not sure. Anybody else on the list with experience adding syscalls have an opinion? Jason
Hey again, On Thu, Nov 24, 2022 at 01:24:42PM +0100, Jason A. Donenfeld wrote: > Hi Florian, > > On Thu, Nov 24, 2022 at 01:15:24PM +0100, Florian Weimer wrote: > > * Jason A. Donenfeld: > > > > > Hi Florian, > > > > > > On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote: > > >> * Jason A. Donenfeld: > > >> > > >> > Hi Florian, > > >> > > > >> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote: > > >> >> * Jason A. Donenfeld: > > >> >> > > >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this > > >> >> > + * function provides to userspace, by mapping a certain number of special pages > > >> >> > + * into the calling process. It takes a hint as to the number of opaque states > > >> >> > + * desired, and returns the number of opaque states actually allocated, the > > >> >> > + * size of each one in bytes, and the address of the first state. > > >> >> > + */ > > >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num, > > >> >> > + unsigned long __user *, size_per_each, unsigned int, flags) > > >> >> > > >> >> I think you should make this __u64, so that you get a consistent > > >> >> userspace interface on all architectures, without the need for compat > > >> >> system calls. > > >> > > > >> > That would be quite unconventional. Most syscalls that take lengths do > > >> > so with the native register size (`unsigned long`, `size_t`), rather > > >> > than u64. If you can point to a recent trend away from this by > > >> > indicating some commits that added new syscalls with u64, I'd be happy > > >> > to be shown otherwise. But AFAIK, that's not the way it's done. > > >> > > >> See clone3 and struct clone_args. > > > > > > The struct is one thing. But actually, clone3 takes a `size_t`: > > > > > > SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size) > > > > > > I take from this that I too should use `size_t` rather than `unsigned > > > long.` And it doesn't seem like there's any compat clone3. > > > > But vgetrandom_alloc does not use unsigned long, but unsigned long *. > > You need to look at the contents for struct clone_args for comparison. > > Ah! I see what you mean; that's a good point. The usual register > clearing thing isn't going to happen because these are addresses. > > I still am somewhat hesitant, though, because `size_t` is really the > "proper" type to be used. Maybe the compat syscall thing is just a > necessary evil? > > The other direction would be making this a u32, since 640k ought to be > enough for anybody and such, but maybe that'd be a mistake too. > > So I'm not sure. Anybody else on the list with experience adding > syscalls have an opinion? Looks like set_mempolicy, get_mempoliy, and migrate_pages pass an unsigned long pointer and I don't see any compat stuff around it: SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask, unsigned long, maxnode) SYSCALL_DEFINE5(get_mempolicy, int __user *, policy, unsigned long __user *, nmask, unsigned long, maxnode, unsigned long, addr, unsigned long, flags) SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode, const unsigned long __user *, old_nodes, const unsigned long __user *, new_nodes) In contrast sched_setaffinity and get_robust_list take a unsigned long pointer and does have a compat wrapper: SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len, unsigned long __user *, user_mask_ptr) SYSCALL_DEFINE3(get_robust_list, int, pid, struct robust_list_head __user * __user *, head_ptr, size_t __user *, len_ptr) Jason
On Thu, Nov 24, 2022 at 01:24:42PM +0100, Jason A. Donenfeld wrote: > Hi Florian, > > On Thu, Nov 24, 2022 at 01:15:24PM +0100, Florian Weimer wrote: > > * Jason A. Donenfeld: > > > > > Hi Florian, > > > > > > On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote: > > >> * Jason A. Donenfeld: > > >> > > >> > Hi Florian, > > >> > > > >> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote: > > >> >> * Jason A. Donenfeld: > > >> >> > > >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this > > >> >> > + * function provides to userspace, by mapping a certain number of special pages > > >> >> > + * into the calling process. It takes a hint as to the number of opaque states > > >> >> > + * desired, and returns the number of opaque states actually allocated, the > > >> >> > + * size of each one in bytes, and the address of the first state. > > >> >> > + */ > > >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num, > > >> >> > + unsigned long __user *, size_per_each, unsigned int, flags) > > >> >> > > >> >> I think you should make this __u64, so that you get a consistent > > >> >> userspace interface on all architectures, without the need for compat > > >> >> system calls. > > >> > > > >> > That would be quite unconventional. Most syscalls that take lengths do > > >> > so with the native register size (`unsigned long`, `size_t`), rather > > >> > than u64. If you can point to a recent trend away from this by > > >> > indicating some commits that added new syscalls with u64, I'd be happy > > >> > to be shown otherwise. But AFAIK, that's not the way it's done. > > >> > > >> See clone3 and struct clone_args. For system calls that take structs as arguments we use u64 in the struct for proper alignment so we can extend structs without regressing old kernels. We have a few of those extensible struct system calls. But we don't really have a lot system calls that pass u64 as a pointer outside of a structure so far. Neither as register and nor as pointer iirc. Passing them as a register arg is problematic because of 32bit arches. But passing as pointer should be fine but it is indeed uncommon. > > > > > > The struct is one thing. But actually, clone3 takes a `size_t`: > > > > > > SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size) > > > > > > I take from this that I too should use `size_t` rather than `unsigned > > > long.` And it doesn't seem like there's any compat clone3. > > > > But vgetrandom_alloc does not use unsigned long, but unsigned long *. > > You need to look at the contents for struct clone_args for comparison. > > Ah! I see what you mean; that's a good point. The usual register > clearing thing isn't going to happen because these are addresses. > > I still am somewhat hesitant, though, because `size_t` is really the > "proper" type to be used. Maybe the compat syscall thing is just a > necessary evil? We try to avoid adding new compat-requiring syscalls like the plague usually. (At least for new syscalls that don't need to inherit behavior from earlier syscalls they are a revisions of.) > > The other direction would be making this a u32, since 640k ought to be > enough for anybody and such, but maybe that'd be a mistake too. I think making this a size_t is fine. We haven't traditionally used u32 for sizes. All syscalls that pass structs versioned by size use size_t. So I would recommend to stick with that. Alternatively, you could also introduce a simple struct versioned by size for this system call similar to mount_setatt() and clone3() and so on. This way you don't need to worry about future extensibilty. Just a thought.
Hi Christian, Thanks a bunch for chiming in. On Thu, Nov 24, 2022 at 01:49:27PM +0100, Christian Brauner wrote: > Alternatively, you could also introduce a simple struct versioned by > size for this system call similar to mount_setatt() and clone3() and so > on. This way you don't need to worry about future extensibilty. Just a > thought. Briefly considered that, but it seemed a bit heavy for something like this. I'm not super heavily opposed, but just seemed like a bit much. > > > >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num, > > > >> >> > + unsigned long __user *, size_per_each, unsigned int, flags) > > > >> >> > > > >> >> I think you should make this __u64, so that you get a consistent > > > >> >> userspace interface on all architectures, without the need for compat > > > >> >> system calls. > > > >> > > > > >> > That would be quite unconventional. Most syscalls that take lengths do > > > >> > so with the native register size (`unsigned long`, `size_t`), rather > > > >> > than u64. If you can point to a recent trend away from this by > > > >> > indicating some commits that added new syscalls with u64, I'd be happy > > > >> > to be shown otherwise. But AFAIK, that's not the way it's done. > > > >> > > > >> See clone3 and struct clone_args. > > For system calls that take structs as arguments we use u64 in the struct > for proper alignment so we can extend structs without regressing old > kernels. We have a few of those extensible struct system calls. > > But we don't really have a lot system calls that pass u64 as a pointer > outside of a structure so far. Neither as register and nor as pointer > iirc. Right, the __u64_aligned business seemed to be mostly about extensibility. > > > > The struct is one thing. But actually, clone3 takes a `size_t`: > > > > > > > > SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size) > > > > > > > > I take from this that I too should use `size_t` rather than `unsigned > > > > long.` And it doesn't seem like there's any compat clone3. > > > > > > But vgetrandom_alloc does not use unsigned long, but unsigned long *. > > > You need to look at the contents for struct clone_args for comparison. > > > > Ah! I see what you mean; that's a good point. The usual register > > clearing thing isn't going to happen because these are addresses. > > > > I still am somewhat hesitant, though, because `size_t` is really the > > "proper" type to be used. Maybe the compat syscall thing is just a > > necessary evil? > > I think making this a size_t is fine. We haven't traditionally used u32 > for sizes. All syscalls that pass structs versioned by size use size_t. > So I would recommend to stick with that. This isn't quite a struct versioned by size. This is: void *vgetrandom_alloc([inout] size_t *num, [out] size_t *size_per_each, unsigned int flags); You give it an input 'num' and some flags (currently flags=0), and it gives you back an output 'num' size, an output 'size_per_each' size, and an opaque pointer value mapping as its return value. I do like the idea of keeping size_t so that the type is "right". But the other arguments are equally compelling as well, so not sure. Jason
On Thu, Nov 24, 2022, at 13:48, Jason A. Donenfeld wrote: > On Thu, Nov 24, 2022 at 01:24:42PM +0100, Jason A. Donenfeld wrote: > Looks like set_mempolicy, get_mempoliy, and migrate_pages pass an > unsigned long pointer and I don't see any compat stuff around it: > > SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long > __user *, nmask, > unsigned long, maxnode) > > SYSCALL_DEFINE5(get_mempolicy, int __user *, policy, > unsigned long __user *, nmask, unsigned long, maxnode, > unsigned long, addr, unsigned long, flags) > > SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode, > const unsigned long __user *, old_nodes, > const unsigned long __user *, new_nodes) Compat handling for these is done all the way down in the pointer access: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/mempolicy.c#n1368 This works here because it's a special bitmap but is not the best approach if you just have a pointer to a single value. Arnd
On Thu, Nov 24, 2022 at 01:24:42PM +0100, Jason A. Donenfeld wrote: > Hi Florian, > > On Thu, Nov 24, 2022 at 01:15:24PM +0100, Florian Weimer wrote: > > * Jason A. Donenfeld: > > > > > Hi Florian, > > > > > > On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote: > > >> * Jason A. Donenfeld: > > >> > > >> > Hi Florian, > > >> > > > >> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote: > > >> >> * Jason A. Donenfeld: > > >> >> > > >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this > > >> >> > + * function provides to userspace, by mapping a certain number of special pages > > >> >> > + * into the calling process. It takes a hint as to the number of opaque states > > >> >> > + * desired, and returns the number of opaque states actually allocated, the > > >> >> > + * size of each one in bytes, and the address of the first state. > > >> >> > + */ > > >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num, > > >> >> > + unsigned long __user *, size_per_each, unsigned int, flags) > > >> >> > > >> >> I think you should make this __u64, so that you get a consistent > > >> >> userspace interface on all architectures, without the need for compat > > >> >> system calls. > > >> > > > >> > That would be quite unconventional. Most syscalls that take lengths do > > >> > so with the native register size (`unsigned long`, `size_t`), rather > > >> > than u64. If you can point to a recent trend away from this by > > >> > indicating some commits that added new syscalls with u64, I'd be happy > > >> > to be shown otherwise. But AFAIK, that's not the way it's done. > > >> > > >> See clone3 and struct clone_args. > > > > > > The struct is one thing. But actually, clone3 takes a `size_t`: > > > > > > SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size) > > > > > > I take from this that I too should use `size_t` rather than `unsigned > > > long.` And it doesn't seem like there's any compat clone3. > > > > But vgetrandom_alloc does not use unsigned long, but unsigned long *. > > You need to look at the contents for struct clone_args for comparison. > > The other direction would be making this a u32 I think `unsigned int` is actually a sensible size for what these values should be. That eliminates the problem and potential bikeshed too. So I'll go with that for v+1. Jason
diff --git a/MAINTAINERS b/MAINTAINERS index 256f03904987..843dd6a49538 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -17287,6 +17287,7 @@ T: git https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git S: Maintained F: drivers/char/random.c F: drivers/virt/vmgenid.c +F: lib/vdso/getrandom.h RAPIDIO SUBSYSTEM M: Matt Porter <mporter@kernel.crashing.org> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 67745ceab0db..331e21ba961a 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -59,6 +59,7 @@ config X86 # select ACPI_LEGACY_TABLES_LOOKUP if ACPI select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI + select ADVISE_SYSCALLS if X86_64 select ARCH_32BIT_OFF_T if X86_32 select ARCH_CLOCKSOURCE_INIT select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index c84d12608cd2..0186f173f0e8 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -372,6 +372,7 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common vgetrandom_alloc sys_vgetrandom_alloc # # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/x86/include/asm/unistd.h b/arch/x86/include/asm/unistd.h index 761173ccc33c..1bf509eaeff1 100644 --- a/arch/x86/include/asm/unistd.h +++ b/arch/x86/include/asm/unistd.h @@ -27,6 +27,7 @@ # define __ARCH_WANT_COMPAT_SYS_PWRITEV64 # define __ARCH_WANT_COMPAT_SYS_PREADV64V2 # define __ARCH_WANT_COMPAT_SYS_PWRITEV64V2 +# define __ARCH_WANT_VGETRANDOM_ALLOC # define X32_NR_syscalls (__NR_x32_syscalls) # define IA32_NR_syscalls (__NR_ia32_syscalls) diff --git a/drivers/char/random.c b/drivers/char/random.c index 65ee69896967..9b64db52849f 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -8,6 +8,7 @@ * into roughly six sections, each with a section header: * * - Initialization and readiness waiting. + * - vDSO support helpers. * - Fast key erasure RNG, the "crng". * - Entropy accumulation and extraction routines. * - Entropy collection routines. @@ -39,6 +40,7 @@ #include <linux/blkdev.h> #include <linux/interrupt.h> #include <linux/mm.h> +#include <linux/mman.h> #include <linux/nodemask.h> #include <linux/spinlock.h> #include <linux/kthread.h> @@ -59,6 +61,7 @@ #include <asm/irq.h> #include <asm/irq_regs.h> #include <asm/io.h> +#include "../../lib/vdso/getrandom.h" /********************************************************************* * @@ -146,6 +149,62 @@ EXPORT_SYMBOL(wait_for_random_bytes); __func__, (void *)_RET_IP_, crng_init) + +/******************************************************************** + * + * vDSO support helpers. + * + * The actual vDSO function is defined over in lib/vdso/getrandom.c, + * but this section contains the kernel-mode helpers to support that. + * + ********************************************************************/ + +#ifdef __ARCH_WANT_VGETRANDOM_ALLOC +/* + * The vgetrandom() function in userspace requires an opaque state, which this + * function provides to userspace, by mapping a certain number of special pages + * into the calling process. It takes a hint as to the number of opaque states + * desired, and returns the number of opaque states actually allocated, the + * size of each one in bytes, and the address of the first state. + */ +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num, + unsigned long __user *, size_per_each, unsigned int, flags) +{ + unsigned long alloc_size; + unsigned long num_states; + unsigned long pages_addr; + int ret; + + if (flags) + return -EINVAL; + + if (get_user(num_states, num)) + return -EFAULT; + + num_states = clamp(num_states, 1UL, (SIZE_MAX & PAGE_MASK) / sizeof(struct vgetrandom_state)); + alloc_size = PAGE_ALIGN(num_states * sizeof(struct vgetrandom_state)); + + if (put_user(alloc_size / sizeof(struct vgetrandom_state), num) || + put_user(sizeof(struct vgetrandom_state), size_per_each)) + return -EFAULT; + + pages_addr = vm_mmap(NULL, 0, alloc_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, 0); + if (IS_ERR_VALUE(pages_addr)) + return pages_addr; + + ret = do_madvise(current->mm, pages_addr, alloc_size, MADV_WIPEONFORK); + if (ret < 0) + goto err_unmap; + + return pages_addr; + +err_unmap: + vm_munmap(pages_addr, alloc_size); + return ret; +} +#endif + /********************************************************************* * * Fast key erasure RNG, the "crng". diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 45fa180cc56a..77b6debe7e18 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -886,8 +886,13 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) #define __NR_set_mempolicy_home_node 450 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) +#ifdef __ARCH_WANT_VGETRANDOM_ALLOC +#define __NR_vgetrandom_alloc 451 +__SYSCALL(__NR_vgetrandom_alloc, sys_vgetrandom_alloc) +#endif + #undef __NR_syscalls -#define __NR_syscalls 451 +#define __NR_syscalls 452 /* * 32 bit systems traditionally used different diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 860b2dcf3ac4..f28196cb919b 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free); /* memfd_secret */ COND_SYSCALL(memfd_secret); +/* random */ +COND_SYSCALL(vgetrandom_alloc); + /* * Architecture specific weak syscall entries. */ diff --git a/lib/vdso/getrandom.h b/lib/vdso/getrandom.h new file mode 100644 index 000000000000..c7f727db2aaa --- /dev/null +++ b/lib/vdso/getrandom.h @@ -0,0 +1,23 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2022 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. + */ + +#ifndef _VDSO_LIB_GETRANDOM_H +#define _VDSO_LIB_GETRANDOM_H + +#include <crypto/chacha.h> + +struct vgetrandom_state { + union { + struct { + u8 batch[CHACHA_BLOCK_SIZE * 3 / 2]; + u32 key[CHACHA_KEY_SIZE / sizeof(u32)]; + }; + u8 batch_key[CHACHA_BLOCK_SIZE * 2]; + }; + unsigned long generation; + u8 pos; +}; + +#endif /* _VDSO_LIB_GETRANDOM_H */ diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh index f33e61aca93d..7f7928c6487f 100755 --- a/scripts/checksyscalls.sh +++ b/scripts/checksyscalls.sh @@ -44,6 +44,10 @@ cat << EOF #define __IGNORE_memfd_secret #endif +#ifndef __ARCH_WANT_VGETRANDOM_ALLOC +#define __IGNORE_vgetrandom_alloc +#endif + /* Missing flags argument */ #define __IGNORE_renameat /* renameat2 */ diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h index 45fa180cc56a..77b6debe7e18 100644 --- a/tools/include/uapi/asm-generic/unistd.h +++ b/tools/include/uapi/asm-generic/unistd.h @@ -886,8 +886,13 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) #define __NR_set_mempolicy_home_node 450 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) +#ifdef __ARCH_WANT_VGETRANDOM_ALLOC +#define __NR_vgetrandom_alloc 451 +__SYSCALL(__NR_vgetrandom_alloc, sys_vgetrandom_alloc) +#endif + #undef __NR_syscalls -#define __NR_syscalls 451 +#define __NR_syscalls 452 /* * 32 bit systems traditionally used different
The vDSO getrandom() works over an opaque per-thread state of an unexported size, which must be marked as MADV_WIPEONFORK and be mlock()'d for proper operation. Over time, the nuances of these allocations may change or grow or even differ based on architectural features. The syscall has the signature: void *vgetrandom_alloc([inout] size_t *num, [out] size_t *size_per_each, unsigned int flags); This takes the desired number of opaque states in `num`, and returns a pointer to an array of opaque states, the number actually allocated back in `num`, and the size in bytes of each one in `size_per_each`, enabling a libc to slice up the returned array into a state per each thread. (The `flags` argument is always zero for now.) Libc is expected to allocate a chunk of these on first use, and then dole them out to threads as they're created, allocating more when needed. The following commit shows an example of this, being used in conjunction with the getrandom() vDSO function. We very intentionally do *not* leave state allocation for vDSO getrandom() up to userspace itself, but rather provide this new syscall for such allocations. vDSO getrandom() must not store its state in just any old memory address, but rather just ones that the kernel specially allocates for it, leaving the particularities of those allocations up to the kernel. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> --- MAINTAINERS | 1 + arch/x86/Kconfig | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/x86/include/asm/unistd.h | 1 + drivers/char/random.c | 59 +++++++++++++++++++++++++ include/uapi/asm-generic/unistd.h | 7 ++- kernel/sys_ni.c | 3 ++ lib/vdso/getrandom.h | 23 ++++++++++ scripts/checksyscalls.sh | 4 ++ tools/include/uapi/asm-generic/unistd.h | 7 ++- 10 files changed, 105 insertions(+), 2 deletions(-) create mode 100644 lib/vdso/getrandom.h