[v2] aarch64: vdso: Wire up getrandom() vDSO implementation

Message ID	20240829201728.2825-1-adhemerval.zanella@linaro.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: Adhemerval Zanella <adhemerval.zanella@linaro.org> To: "Jason A . Donenfeld" <Jason@zx2c4.com>, Theodore Ts'o <tytso@mit.edu>, linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-arch@vger.kernel.org, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Thomas Gleixner <tglx@linutronix.de>, Eric Biggers <ebiggers@kernel.org>, Christophe Leroy <christophe.leroy@csgroup.eu> Subject: [PATCH v2] aarch64: vdso: Wire up getrandom() vDSO implementation Date: Thu, 29 Aug 2024 20:17:14 +0000 Message-ID: <20240829201728.2825-1-adhemerval.zanella@linaro.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	[v2] aarch64: vdso: Wire up getrandom() vDSO implementation \| expand [v2] aarch64: vdso: Wire up getrandom() vDSO implementation

Adhemerval Zanella Netto Aug. 29, 2024, 8:17 p.m. UTC

Hook up the generic vDSO implementation to the aarch64 vDSO data page.
The _vdso_rng_data required data is placed within the _vdso_data vvar
page, by using a offset larger than the vdso_data.

The vDSO function requires a ChaCha20 implementation that does not
write to the stack, and that can do an entire ChaCha20 permutation.
The one provided is based on the current chacha-neon-core.S and uses NEON
on the permute operation. The fallback for chips that do not support
NEON issues the syscall.

This also passes the vdso_test_chacha test along with
vdso_test_getrandom. The vdso_test_getrandom bench-single result on
Neoverse-N1 shows:

   vdso: 25000000 times in 0.746506464 seconds
   libc: 25000000 times in 8.849179444 seconds
syscall: 25000000 times in 8.818726425 seconds

Changes from v1:
- Fixed style issues and typos.
- Added fallback for systems without NEON support.
- Avoid use of non-volatile vector registers in neon chacha20.
- Use c-getrandom-y for vgetrandom.c.
- Fixed TIMENS vdso_rnd_data access.

Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
---
 arch/arm64/Kconfig                         |   1 +
 arch/arm64/include/asm/vdso.h              |   6 +
 arch/arm64/include/asm/vdso/getrandom.h    |  49 ++++++
 arch/arm64/include/asm/vdso/vsyscall.h     |  10 ++
 arch/arm64/kernel/vdso.c                   |   6 -
 arch/arm64/kernel/vdso/Makefile            |  11 +-
 arch/arm64/kernel/vdso/vdso                |   1 +
 arch/arm64/kernel/vdso/vdso.lds.S          |   4 +
 arch/arm64/kernel/vdso/vgetrandom-chacha.S | 168 +++++++++++++++++++++
 arch/arm64/kernel/vdso/vgetrandom.c        |  15 ++
 lib/vdso/getrandom.c                       |   1 +
 tools/arch/arm64/vdso                      |   1 +
 tools/include/linux/compiler.h             |   4 +
 tools/testing/selftests/vDSO/Makefile      |   5 +-
 14 files changed, 273 insertions(+), 9 deletions(-)
 create mode 100644 arch/arm64/include/asm/vdso/getrandom.h
 create mode 120000 arch/arm64/kernel/vdso/vdso
 create mode 100644 arch/arm64/kernel/vdso/vgetrandom-chacha.S
 create mode 100644 arch/arm64/kernel/vdso/vgetrandom.c
 create mode 120000 tools/arch/arm64/vdso

Jason A. Donenfeld Aug. 29, 2024, 8:46 p.m. UTC | #1

Hi Catalin, Will, Adhemerval,

On Thu, Aug 29, 2024 at 08:17:14PM +0000, Adhemerval Zanella wrote:
> Hook up the generic vDSO implementation to the aarch64 vDSO data page.
> The _vdso_rng_data required data is placed within the _vdso_data vvar
> page, by using a offset larger than the vdso_data.
> 
> The vDSO function requires a ChaCha20 implementation that does not
> write to the stack, and that can do an entire ChaCha20 permutation.
> The one provided is based on the current chacha-neon-core.S and uses NEON
> on the permute operation. The fallback for chips that do not support
> NEON issues the syscall.
> 
> This also passes the vdso_test_chacha test along with
> vdso_test_getrandom. The vdso_test_getrandom bench-single result on
> Neoverse-N1 shows:
> 
>    vdso: 25000000 times in 0.746506464 seconds
>    libc: 25000000 times in 8.849179444 seconds
> syscall: 25000000 times in 8.818726425 seconds

Aside from the big endian concerns we discussed on IRC, this is looking
fine to me, and I'd like to get some variant of this queued up in my
random.git tree for 6.12 soon.

But first, Catalin or Will -- could one of you take a look and provide
your Acked-by for that, if the patch looks good to you?

Thanks,
Jason

Will Deacon Aug. 30, 2024, 11:46 a.m. UTC | #2

On Thu, Aug 29, 2024 at 08:17:14PM +0000, Adhemerval Zanella wrote:
> Hook up the generic vDSO implementation to the aarch64 vDSO data page.
> The _vdso_rng_data required data is placed within the _vdso_data vvar
> page, by using a offset larger than the vdso_data.
> 
> The vDSO function requires a ChaCha20 implementation that does not
> write to the stack, and that can do an entire ChaCha20 permutation.
> The one provided is based on the current chacha-neon-core.S and uses NEON
> on the permute operation. The fallback for chips that do not support
> NEON issues the syscall.
> 
> This also passes the vdso_test_chacha test along with
> vdso_test_getrandom. The vdso_test_getrandom bench-single result on
> Neoverse-N1 shows:
> 
>    vdso: 25000000 times in 0.746506464 seconds
>    libc: 25000000 times in 8.849179444 seconds
> syscall: 25000000 times in 8.818726425 seconds
> 
> Changes from v1:
> - Fixed style issues and typos.
> - Added fallback for systems without NEON support.
> - Avoid use of non-volatile vector registers in neon chacha20.
> - Use c-getrandom-y for vgetrandom.c.
> - Fixed TIMENS vdso_rnd_data access.
> 
> Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
> ---
>  arch/arm64/Kconfig                         |   1 +
>  arch/arm64/include/asm/vdso.h              |   6 +
>  arch/arm64/include/asm/vdso/getrandom.h    |  49 ++++++
>  arch/arm64/include/asm/vdso/vsyscall.h     |  10 ++
>  arch/arm64/kernel/vdso.c                   |   6 -
>  arch/arm64/kernel/vdso/Makefile            |  11 +-
>  arch/arm64/kernel/vdso/vdso                |   1 +
>  arch/arm64/kernel/vdso/vdso.lds.S          |   4 +
>  arch/arm64/kernel/vdso/vgetrandom-chacha.S | 168 +++++++++++++++++++++
>  arch/arm64/kernel/vdso/vgetrandom.c        |  15 ++
>  lib/vdso/getrandom.c                       |   1 +
>  tools/arch/arm64/vdso                      |   1 +
>  tools/include/linux/compiler.h             |   4 +
>  tools/testing/selftests/vDSO/Makefile      |   5 +-

Please can you split the tools/ changes into a separate patch?

>  14 files changed, 273 insertions(+), 9 deletions(-)
>  create mode 100644 arch/arm64/include/asm/vdso/getrandom.h
>  create mode 120000 arch/arm64/kernel/vdso/vdso
>  create mode 100644 arch/arm64/kernel/vdso/vgetrandom-chacha.S
>  create mode 100644 arch/arm64/kernel/vdso/vgetrandom.c
>  create mode 120000 tools/arch/arm64/vdso
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index a2f8ff354ca6..7f7424d1b3b8 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -262,6 +262,7 @@ config ARM64
>  	select TRACE_IRQFLAGS_NMI_SUPPORT
>  	select HAVE_SOFTIRQ_ON_OWN_STACK
>  	select USER_STACKTRACE_SUPPORT
> +	select VDSO_GETRANDOM
>  	help
>  	  ARM 64-bit (AArch64) Linux support.
>  
> diff --git a/arch/arm64/include/asm/vdso.h b/arch/arm64/include/asm/vdso.h
> index 4305995c8f82..18407b757c95 100644
> --- a/arch/arm64/include/asm/vdso.h
> +++ b/arch/arm64/include/asm/vdso.h
> @@ -16,6 +16,12 @@
>  
>  #ifndef __ASSEMBLY__
>  
> +enum vvar_pages {
> +	VVAR_DATA_PAGE_OFFSET,
> +	VVAR_TIMENS_PAGE_OFFSET,
> +	VVAR_NR_PAGES,
> +};
> +
>  #include <generated/vdso-offsets.h>
>  
>  #define VDSO_SYMBOL(base, name)						   \
> diff --git a/arch/arm64/include/asm/vdso/getrandom.h b/arch/arm64/include/asm/vdso/getrandom.h
> new file mode 100644
> index 000000000000..fca66ba49d4c
> --- /dev/null
> +++ b/arch/arm64/include/asm/vdso/getrandom.h
> @@ -0,0 +1,49 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef __ASM_VDSO_GETRANDOM_H
> +#define __ASM_VDSO_GETRANDOM_H
> +
> +#ifndef __ASSEMBLY__
> +
> +#include <asm/vdso.h>
> +#include <asm/unistd.h>
> +#include <vdso/datapage.h>
> +
> +/**
> + * getrandom_syscall - Invoke the getrandom() syscall.
> + * @buffer:	Destination buffer to fill with random bytes.
> + * @len:	Size of @buffer in bytes.
> + * @flags:	Zero or more GRND_* flags.
> + * Returns:	The number of random bytes written to @buffer, or a negative value indicating an error.
> + */
> +static __always_inline ssize_t getrandom_syscall(void *_buffer, size_t _len, unsigned int _flags)
> +{
> +	register void *buffer asm ("x0") = _buffer;
> +	register size_t len asm ("x1") = _len;
> +	register unsigned int flags asm ("x2") = _flags;
> +	register long ret asm ("x0");
> +	register long nr asm ("x8") = __NR_getrandom;
> +
> +	asm volatile(
> +	"       svc #0\n"
> +	: "=r" (ret)
> +	: "r" (buffer), "r" (len), "r" (flags), "r" (nr)
> +	: "memory");
> +
> +	return ret;
> +}
> +
> +static __always_inline const struct vdso_rng_data *__arch_get_vdso_rng_data(void)
> +{
> +	/*
> +	 * If a task belongs to a time namespace then a namespace the real
> +	 * VVAR page is mapped with the VVAR_TIMENS_PAGE_OFFSET.
> +	 */

This comment doesn't make sense.

> +	if (IS_ENABLED(CONFIG_TIME_NS) && _vdso_data->clock_mode == VDSO_CLOCKMODE_TIMENS)
> +		return (void*)&_vdso_rng_data + VVAR_TIMENS_PAGE_OFFSET * PAGE_SIZE;
> +	return &_vdso_rng_data;
> +}
> +
> +#endif /* !__ASSEMBLY__ */
> +
> +#endif /* __ASM_VDSO_GETRANDOM_H */
> diff --git a/arch/arm64/include/asm/vdso/vsyscall.h b/arch/arm64/include/asm/vdso/vsyscall.h
> index f94b1457c117..2a87f0e1b144 100644
> --- a/arch/arm64/include/asm/vdso/vsyscall.h
> +++ b/arch/arm64/include/asm/vdso/vsyscall.h
> @@ -2,8 +2,11 @@
>  #ifndef __ASM_VDSO_VSYSCALL_H
>  #define __ASM_VDSO_VSYSCALL_H
>  
> +#define __VDSO_RND_DATA_OFFSET  480

Why 480?

> +
>  #ifndef __ASSEMBLY__
>  
> +#include <asm/vdso.h>
>  #include <linux/timekeeper_internal.h>
>  #include <vdso/datapage.h>
>  
> @@ -21,6 +24,13 @@ struct vdso_data *__arm64_get_k_vdso_data(void)
>  }
>  #define __arch_get_k_vdso_data __arm64_get_k_vdso_data
>  
> +static __always_inline
> +struct vdso_rng_data *__arm64_get_k_vdso_rnd_data(void)
> +{
> +	return (void*)vdso_data + __VDSO_RND_DATA_OFFSET;
> +}
> +#define __arch_get_k_vdso_rng_data __arm64_get_k_vdso_rnd_data
> +
>  static __always_inline
>  void __arm64_update_vsyscall(struct vdso_data *vdata, struct timekeeper *tk)
>  {
> diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> index 89b6e7840002..706c9c3a7a50 100644
> --- a/arch/arm64/kernel/vdso.c
> +++ b/arch/arm64/kernel/vdso.c
> @@ -34,12 +34,6 @@ enum vdso_abi {
>  	VDSO_ABI_AA32,
>  };
>  
> -enum vvar_pages {
> -	VVAR_DATA_PAGE_OFFSET,
> -	VVAR_TIMENS_PAGE_OFFSET,
> -	VVAR_NR_PAGES,
> -};
> -
>  struct vdso_abi_info {
>  	const char *name;
>  	const char *vdso_code_start;
> diff --git a/arch/arm64/kernel/vdso/Makefile b/arch/arm64/kernel/vdso/Makefile
> index d11da6461278..50246a38d6bd 100644
> --- a/arch/arm64/kernel/vdso/Makefile
> +++ b/arch/arm64/kernel/vdso/Makefile
> @@ -9,7 +9,7 @@
>  # Include the generic Makefile to check the built vdso.
>  include $(srctree)/lib/vdso/Makefile
>  
> -obj-vdso := vgettimeofday.o note.o sigreturn.o
> +obj-vdso := vgettimeofday.o note.o sigreturn.o vgetrandom.o vgetrandom-chacha.o
>  
>  # Build rules
>  targets := $(obj-vdso) vdso.so vdso.so.dbg
> @@ -40,13 +40,22 @@ CFLAGS_REMOVE_vgettimeofday.o = $(CC_FLAGS_FTRACE) -Os $(CC_FLAGS_SCS) \
>  				$(RANDSTRUCT_CFLAGS) $(GCC_PLUGINS_CFLAGS) \
>  				$(CC_FLAGS_LTO) $(CC_FLAGS_CFI) \
>  				-Wmissing-prototypes -Wmissing-declarations
> +CFLAGS_REMOVE_vgetrandom.o = $(CC_FLAGS_FTRACE) -Os $(CC_FLAGS_SCS) \
> +			     $(RANDSTRUCT_CFLAGS) $(GCC_PLUGINS_CFLAGS) \
> +			     $(CC_FLAGS_LTO) $(CC_FLAGS_CFI) \
> +			     -Wmissing-prototypes -Wmissing-declarations
>  
>  CFLAGS_vgettimeofday.o = -O2 -mcmodel=tiny -fasynchronous-unwind-tables
> +CFLAGS_vgetrandom.o = -O2 -mcmodel=tiny -fasynchronous-unwind-tables

You're using identical CFLAGS_ and CFLAGS_REMOVE_ definitions for
vgettimeofdat.o and vgetrandom.o. Please refactor this so that they use
common definitions.

> diff --git a/arch/arm64/kernel/vdso/vdso b/arch/arm64/kernel/vdso/vdso
> new file mode 120000
> index 000000000000..233c7a26f6e5
> --- /dev/null
> +++ b/arch/arm64/kernel/vdso/vdso
> @@ -0,0 +1 @@
> +../../../arch/arm64/kernel/vdso
> \ No newline at end of file
> diff --git a/arch/arm64/kernel/vdso/vdso.lds.S b/arch/arm64/kernel/vdso/vdso.lds.S
> index 45354f2ddf70..f204a9ddc833 100644
> --- a/arch/arm64/kernel/vdso/vdso.lds.S
> +++ b/arch/arm64/kernel/vdso/vdso.lds.S
> @@ -11,7 +11,9 @@
>  #include <linux/const.h>
>  #include <asm/page.h>
>  #include <asm/vdso.h>
> +#include <asm/vdso/vsyscall.h>
>  #include <asm-generic/vmlinux.lds.h>
> +#include <vdso/datapage.h>
>  
>  OUTPUT_FORMAT("elf64-littleaarch64", "elf64-bigaarch64", "elf64-littleaarch64")
>  OUTPUT_ARCH(aarch64)
> @@ -19,6 +21,7 @@ OUTPUT_ARCH(aarch64)
>  SECTIONS
>  {
>  	PROVIDE(_vdso_data = . - __VVAR_PAGES * PAGE_SIZE);
> +	PROVIDE(_vdso_rng_data = _vdso_data + __VDSO_RND_DATA_OFFSET);
>  #ifdef CONFIG_TIME_NS
>  	PROVIDE(_timens_data = _vdso_data + PAGE_SIZE);
>  #endif
> @@ -102,6 +105,7 @@ VERSION
>  		__kernel_gettimeofday;
>  		__kernel_clock_gettime;
>  		__kernel_clock_getres;
> +		__kernel_getrandom;
>  	local: *;
>  	};
>  }
> diff --git a/arch/arm64/kernel/vdso/vgetrandom-chacha.S b/arch/arm64/kernel/vdso/vgetrandom-chacha.S
> new file mode 100644
> index 000000000000..9ebf12a09c65
> --- /dev/null
> +++ b/arch/arm64/kernel/vdso/vgetrandom-chacha.S
> @@ -0,0 +1,168 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/linkage.h>
> +#include <asm/cache.h>
> +#include <asm/assembler.h>
> +
> +	.text
> +
> +#define state0		v0
> +#define state1		v1
> +#define state2		v2
> +#define state3		v3
> +#define copy0		v4
> +#define copy1		v5
> +#define copy2		v6
> +#define copy3		v7
> +#define copy3_d		d7
> +#define one_d		d16
> +#define one_q		q16
> +#define tmp		v17
> +#define rot8		v18
> +
> +/*
> + * ARM64 ChaCha20 implementation meant for vDSO.  Produces a given positive
> + * number of blocks of output with nonce 0, taking an input key and 8-bytes
> + * counter.  Importantly does not spill to the stack.
> + *
> + * void __arch_chacha20_blocks_nostack(uint8_t *dst_bytes,
> + *				       const uint8_t *key,
> + * 				       uint32_t *counter,
> + *				       size_t nblocks)
> + *
> + * 	x0: output bytes
> + *	x1: 32-byte key input
> + *	x2: 8-byte counter input/output
> + *	x3: number of 64-byte block to write to output
> + */
> +SYM_FUNC_START(__arch_chacha20_blocks_nostack)

Is there any way we can reuse the existing code in
crypto/chacha-neon-core.S for this? It looks to my untrained eye like
this is an arbitrarily different implementation to what we already have.

> +	/* copy0 = "expand 32-byte k" */
> +	adr_l		x8, CTES
> +	ld1		{copy0.4s}, [x8]
> +	/* copy1,copy2 = key */
> +	ld1		{ copy1.4s, copy2.4s }, [x1]
> +	/* copy3 = counter || zero nonce  */
> +	ldr		copy3_d, [x2]
> +
> +	adr_l		x8, ONE
> +	ldr		one_q, [x8]
> +
> +	adr_l		x10, ROT8
> +	ld1		{rot8.4s}, [x10]
> +.Lblock:
> +	/* copy state to auxiliary vectors for the final add after the permute.  */
> +	mov		state0.16b, copy0.16b
> +	mov		state1.16b, copy1.16b
> +	mov		state2.16b, copy2.16b
> +	mov		state3.16b, copy3.16b
> +
> +	mov		w4, 20
> +.Lpermute:
> +	/*
> +	 * Permute one 64-byte block where the state matrix is stored in the four NEON
> +	 * registers state0-state3.  It performs matrix operations on four words in parallel,
> +	 * but requires shuffling to rearrange the words after each round.
> +	 */
> +
> +.Ldoubleround:
> +	/* state0 += state1, state3 = rotl32(state3 ^ state0, 16) */
> +	add		state0.4s, state0.4s, state1.4s
> +	eor		state3.16b, state3.16b, state0.16b
> +	rev32		state3.8h, state3.8h
> +
> +	/* state2 += state3, state1 = rotl32(state1 ^ state2, 12) */
> +	add		state2.4s, state2.4s, state3.4s
> +	eor		tmp.16b, state1.16b, state2.16b
> +	shl		state1.4s, tmp.4s, #12
> +	sri		state1.4s, tmp.4s, #20
> +
> +	/* state0 += state1, state3 = rotl32(state3 ^ state0, 8) */
> +	add		state0.4s, state0.4s, state1.4s
> +	eor		state3.16b, state3.16b, state0.16b
> +	tbl		state3.16b, {state3.16b}, rot8.16b
> +
> +	/* state2 += state3, state1 = rotl32(state1 ^ state2, 7) */
> +	add		state2.4s, state2.4s, state3.4s
> +	eor		tmp.16b, state1.16b, state2.16b
> +	shl		state1.4s, tmp.4s, #7
> +	sri		state1.4s, tmp.4s, #25
> +
> +	/* state1[0,1,2,3] = state1[1,2,3,0] */
> +	ext		state1.16b, state1.16b, state1.16b, #4
> +	/* state2[0,1,2,3] = state2[2,3,0,1] */
> +	ext		state2.16b, state2.16b, state2.16b, #8
> +	/* state3[0,1,2,3] = state3[1,2,3,0] */
> +	ext		state3.16b, state3.16b, state3.16b, #12
> +
> +	/* state0 += state1, state3 = rotl32(state3 ^ state0, 16) */
> +	add		state0.4s, state0.4s, state1.4s
> +	eor		state3.16b, state3.16b, state0.16b
> +	rev32		state3.8h, state3.8h
> +
> +	/* state2 += state3, state1 = rotl32(state1 ^ state2, 12) */
> +	add		state2.4s, state2.4s, state3.4s
> +	eor		tmp.16b, state1.16b, state2.16b
> +	shl		state1.4s, tmp.4s, #12
> +	sri		state1.4s, tmp.4s, #20
> +
> +	/* state0 += state1, state3 = rotl32(state3 ^ state0, 8) */
> +	add		state0.4s, state0.4s, state1.4s
> +	eor		state3.16b, state3.16b, state0.16b
> +	tbl		state3.16b, {state3.16b}, rot8.16b
> +
> +	/* state2 += state3, state1 = rotl32(state1 ^ state2, 7) */
> +	add		state2.4s, state2.4s, state3.4s
> +	eor		tmp.16b, state1.16b, state2.16b
> +	shl		state1.4s, tmp.4s, #7
> +	sri		state1.4s, tmp.4s, #25
> +
> +	/* state1[0,1,2,3] = state1[3,0,1,2] */
> +	ext		state1.16b, state1.16b, state1.16b, #12
> +	/* state2[0,1,2,3] = state2[2,3,0,1] */
> +	ext		state2.16b, state2.16b, state2.16b, #8
> +	/* state3[0,1,2,3] = state3[1,2,3,0] */
> +	ext		state3.16b, state3.16b, state3.16b, #4
> +
> +	subs		w4, w4, #2
> +	b.ne		.Ldoubleround
> +
> +	/* output0 = state0 + state0 */
> +	add		state0.4s, state0.4s, copy0.4s
> +	/* output1 = state1 + state1 */
> +	add		state1.4s, state1.4s, copy1.4s
> +	/* output2 = state2 + state2 */
> +	add		state2.4s, state2.4s, copy2.4s
> +	/* output2 = state3 + state3 */
> +	add		state3.4s, state3.4s, copy3.4s
> +	st1		{ state0.4s - state3.4s }, [x0]
> +
> +	/* ++copy3.counter */
> +	add		copy3_d, copy3_d, one_d
> +
> +	/* output += 64, --nblocks */
> +	add		x0, x0, 64
> +	subs		x3, x3, #1
> +	b.ne		.Lblock
> +
> +	/* counter = copy3.counter */
> +	str		copy3_d, [x2]
> +
> +	/* Zero out the potentially sensitive regs, in case nothing uses these again. */
> +	eor		state0.16b, state0.16b, state0.16b
> +	eor		state1.16b, state1.16b, state1.16b
> +	eor		state2.16b, state2.16b, state2.16b
> +	eor		state3.16b, state3.16b, state3.16b
> +	eor		copy1.16b, copy1.16b, copy1.16b
> +	eor		copy2.16b, copy2.16b, copy2.16b
> +	ret
> +SYM_FUNC_END(__arch_chacha20_blocks_nostack)
> +
> +        .section        ".rodata", "a", %progbits
> +        .align          L1_CACHE_SHIFT
> +
> +CTES:	.word		1634760805, 857760878, 	2036477234, 1797285236
> +ONE:    .xword		1, 0
> +ROT8:	.word		0x02010003, 0x06050407, 0x0a09080b, 0x0e0d0c0f
> +
> +emit_aarch64_feature_1_and
> diff --git a/arch/arm64/kernel/vdso/vgetrandom.c b/arch/arm64/kernel/vdso/vgetrandom.c
> new file mode 100644
> index 000000000000..0833d25f3121
> --- /dev/null
> +++ b/arch/arm64/kernel/vdso/vgetrandom.c
> @@ -0,0 +1,15 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +typeof(__cvdso_getrandom) __kernel_getrandom;
> +
> +ssize_t __kernel_getrandom(void *buffer, size_t len, unsigned int flags, void *opaque_state, size_t opaque_len)
> +{
> +	asm goto (
> +	ALTERNATIVE("b %[fallback]", "nop", RM64_HAS_FPSIMD) : : : : fallback);

"RM64_HAS_FPSIMD". Are you sure you've tested this?

> +	return __cvdso_getrandom(buffer, len, flags, opaque_state, opaque_len);
> +
> +fallback:
> +	if (unlikely(opaque_len == ~0UL && !buffer && !len && !flags))
> +		return -ENOSYS;
> +	return getrandom_syscall(buffer, len, flags);
> +}
> diff --git a/lib/vdso/getrandom.c b/lib/vdso/getrandom.c
> index 938ca539aaa6..7c9711248d9b 100644
> --- a/lib/vdso/getrandom.c
> +++ b/lib/vdso/getrandom.c
> @@ -5,6 +5,7 @@
>  
>  #include <linux/array_size.h>
>  #include <linux/minmax.h>
> +#include <linux/mm.h>
>  #include <vdso/datapage.h>
>  #include <vdso/getrandom.h>
>  #include <vdso/unaligned.h>

Looks like this should be a separate change?

Will

Jason A. Donenfeld Aug. 30, 2024, 12:04 p.m. UTC | #3

> > +SYM_FUNC_START(__arch_chacha20_blocks_nostack)
> 
> Is there any way we can reuse the existing code in
> crypto/chacha-neon-core.S for this? It looks to my untrained eye like
> this is an arbitrarily different implementation to what we already have.

Nope, it is indeed different, and not arbitrarily so. This patch is
mirroring exactly what we did on x86.

Jason

Adhemerval Zanella Netto Aug. 30, 2024, 12:28 p.m. UTC | #4

On 30/08/24 08:46, Will Deacon wrote:
> On Thu, Aug 29, 2024 at 08:17:14PM +0000, Adhemerval Zanella wrote:
>> Hook up the generic vDSO implementation to the aarch64 vDSO data page.
>> The _vdso_rng_data required data is placed within the _vdso_data vvar
>> page, by using a offset larger than the vdso_data.
>>
>> The vDSO function requires a ChaCha20 implementation that does not
>> write to the stack, and that can do an entire ChaCha20 permutation.
>> The one provided is based on the current chacha-neon-core.S and uses NEON
>> on the permute operation. The fallback for chips that do not support
>> NEON issues the syscall.
>>
>> This also passes the vdso_test_chacha test along with
>> vdso_test_getrandom. The vdso_test_getrandom bench-single result on
>> Neoverse-N1 shows:
>>
>>    vdso: 25000000 times in 0.746506464 seconds
>>    libc: 25000000 times in 8.849179444 seconds
>> syscall: 25000000 times in 8.818726425 seconds
>>
>> Changes from v1:
>> - Fixed style issues and typos.
>> - Added fallback for systems without NEON support.
>> - Avoid use of non-volatile vector registers in neon chacha20.
>> - Use c-getrandom-y for vgetrandom.c.
>> - Fixed TIMENS vdso_rnd_data access.
>>
>> Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
>> ---
>>  arch/arm64/Kconfig                         |   1 +
>>  arch/arm64/include/asm/vdso.h              |   6 +
>>  arch/arm64/include/asm/vdso/getrandom.h    |  49 ++++++
>>  arch/arm64/include/asm/vdso/vsyscall.h     |  10 ++
>>  arch/arm64/kernel/vdso.c                   |   6 -
>>  arch/arm64/kernel/vdso/Makefile            |  11 +-
>>  arch/arm64/kernel/vdso/vdso                |   1 +
>>  arch/arm64/kernel/vdso/vdso.lds.S          |   4 +
>>  arch/arm64/kernel/vdso/vgetrandom-chacha.S | 168 +++++++++++++++++++++
>>  arch/arm64/kernel/vdso/vgetrandom.c        |  15 ++
>>  lib/vdso/getrandom.c                       |   1 +
>>  tools/arch/arm64/vdso                      |   1 +
>>  tools/include/linux/compiler.h             |   4 +
>>  tools/testing/selftests/vDSO/Makefile      |   5 +-
> 
> Please can you split the tools/ changes into a separate patch?

Alright, it would require to be after the inclusion on vgetrandom-chacha.S
otherwise vdso_test_chacha will not build on aarch64. 

> 
>>  14 files changed, 273 insertions(+), 9 deletions(-)
>>  create mode 100644 arch/arm64/include/asm/vdso/getrandom.h
>>  create mode 120000 arch/arm64/kernel/vdso/vdso
>>  create mode 100644 arch/arm64/kernel/vdso/vgetrandom-chacha.S
>>  create mode 100644 arch/arm64/kernel/vdso/vgetrandom.c
>>  create mode 120000 tools/arch/arm64/vdso
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index a2f8ff354ca6..7f7424d1b3b8 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -262,6 +262,7 @@ config ARM64
>>  	select TRACE_IRQFLAGS_NMI_SUPPORT
>>  	select HAVE_SOFTIRQ_ON_OWN_STACK
>>  	select USER_STACKTRACE_SUPPORT
>> +	select VDSO_GETRANDOM
>>  	help
>>  	  ARM 64-bit (AArch64) Linux support.
>>  
>> diff --git a/arch/arm64/include/asm/vdso.h b/arch/arm64/include/asm/vdso.h
>> index 4305995c8f82..18407b757c95 100644
>> --- a/arch/arm64/include/asm/vdso.h
>> +++ b/arch/arm64/include/asm/vdso.h
>> @@ -16,6 +16,12 @@
>>  
>>  #ifndef __ASSEMBLY__
>>  
>> +enum vvar_pages {
>> +	VVAR_DATA_PAGE_OFFSET,
>> +	VVAR_TIMENS_PAGE_OFFSET,
>> +	VVAR_NR_PAGES,
>> +};
>> +
>>  #include <generated/vdso-offsets.h>
>>  
>>  #define VDSO_SYMBOL(base, name)						   \
>> diff --git a/arch/arm64/include/asm/vdso/getrandom.h b/arch/arm64/include/asm/vdso/getrandom.h
>> new file mode 100644
>> index 000000000000..fca66ba49d4c
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/vdso/getrandom.h
>> @@ -0,0 +1,49 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#ifndef __ASM_VDSO_GETRANDOM_H
>> +#define __ASM_VDSO_GETRANDOM_H
>> +
>> +#ifndef __ASSEMBLY__
>> +
>> +#include <asm/vdso.h>
>> +#include <asm/unistd.h>
>> +#include <vdso/datapage.h>
>> +
>> +/**
>> + * getrandom_syscall - Invoke the getrandom() syscall.
>> + * @buffer:	Destination buffer to fill with random bytes.
>> + * @len:	Size of @buffer in bytes.
>> + * @flags:	Zero or more GRND_* flags.
>> + * Returns:	The number of random bytes written to @buffer, or a negative value indicating an error.
>> + */
>> +static __always_inline ssize_t getrandom_syscall(void *_buffer, size_t _len, unsigned int _flags)
>> +{
>> +	register void *buffer asm ("x0") = _buffer;
>> +	register size_t len asm ("x1") = _len;
>> +	register unsigned int flags asm ("x2") = _flags;
>> +	register long ret asm ("x0");
>> +	register long nr asm ("x8") = __NR_getrandom;
>> +
>> +	asm volatile(
>> +	"       svc #0\n"
>> +	: "=r" (ret)
>> +	: "r" (buffer), "r" (len), "r" (flags), "r" (nr)
>> +	: "memory");
>> +
>> +	return ret;
>> +}
>> +
>> +static __always_inline const struct vdso_rng_data *__arch_get_vdso_rng_data(void)
>> +{
>> +	/*
>> +	 * If a task belongs to a time namespace then a namespace the real
>> +	 * VVAR page is mapped with the VVAR_TIMENS_PAGE_OFFSET.
>> +	 */
> 
> This comment doesn't make sense.

I reprased it from arch/arm64/kernel/vdso.c (vvar_fault). Did I confuse something?

This is indeed required, otherwise the getrandom vDSO on a timens does not
see the generation counter with correctly and fallback to syscall.

> 
>> +	if (IS_ENABLED(CONFIG_TIME_NS) && _vdso_data->clock_mode == VDSO_CLOCKMODE_TIMENS)
>> +		return (void*)&_vdso_rng_data + VVAR_TIMENS_PAGE_OFFSET * PAGE_SIZE;
>> +	return &_vdso_rng_data;
>> +}
>> +
>> +#endif /* !__ASSEMBLY__ */
>> +
>> +#endif /* __ASM_VDSO_GETRANDOM_H */
>> diff --git a/arch/arm64/include/asm/vdso/vsyscall.h b/arch/arm64/include/asm/vdso/vsyscall.h
>> index f94b1457c117..2a87f0e1b144 100644
>> --- a/arch/arm64/include/asm/vdso/vsyscall.h
>> +++ b/arch/arm64/include/asm/vdso/vsyscall.h
>> @@ -2,8 +2,11 @@
>>  #ifndef __ASM_VDSO_VSYSCALL_H
>>  #define __ASM_VDSO_VSYSCALL_H
>>  
>> +#define __VDSO_RND_DATA_OFFSET  480
> 
> Why 480?

I used the x86 strategy to place the the vdso_rng_data and the vdso_data,
but I could not make to fit the vdso_data generation with the linker
script machinery required for vdso.lds.S (I think Jason faced a similar
issue with x86).

I will try to see if I can refactor in a subsequent patch the vdso_data 
definition to place the vdso_rng_data in a common struct.  It does not 
help that it seems know that each architecture is placing the 
vdso_rng_data in a different place.

> 
>> +
>>  #ifndef __ASSEMBLY__
>>  
>> +#include <asm/vdso.h>
>>  #include <linux/timekeeper_internal.h>
>>  #include <vdso/datapage.h>
>>  
>> @@ -21,6 +24,13 @@ struct vdso_data *__arm64_get_k_vdso_data(void)
>>  }
>>  #define __arch_get_k_vdso_data __arm64_get_k_vdso_data
>>  
>> +static __always_inline
>> +struct vdso_rng_data *__arm64_get_k_vdso_rnd_data(void)
>> +{
>> +	return (void*)vdso_data + __VDSO_RND_DATA_OFFSET;
>> +}
>> +#define __arch_get_k_vdso_rng_data __arm64_get_k_vdso_rnd_data
>> +
>>  static __always_inline
>>  void __arm64_update_vsyscall(struct vdso_data *vdata, struct timekeeper *tk)
>>  {
>> diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
>> index 89b6e7840002..706c9c3a7a50 100644
>> --- a/arch/arm64/kernel/vdso.c
>> +++ b/arch/arm64/kernel/vdso.c
>> @@ -34,12 +34,6 @@ enum vdso_abi {
>>  	VDSO_ABI_AA32,
>>  };
>>  
>> -enum vvar_pages {
>> -	VVAR_DATA_PAGE_OFFSET,
>> -	VVAR_TIMENS_PAGE_OFFSET,
>> -	VVAR_NR_PAGES,
>> -};
>> -
>>  struct vdso_abi_info {
>>  	const char *name;
>>  	const char *vdso_code_start;
>> diff --git a/arch/arm64/kernel/vdso/Makefile b/arch/arm64/kernel/vdso/Makefile
>> index d11da6461278..50246a38d6bd 100644
>> --- a/arch/arm64/kernel/vdso/Makefile
>> +++ b/arch/arm64/kernel/vdso/Makefile
>> @@ -9,7 +9,7 @@
>>  # Include the generic Makefile to check the built vdso.
>>  include $(srctree)/lib/vdso/Makefile
>>  
>> -obj-vdso := vgettimeofday.o note.o sigreturn.o
>> +obj-vdso := vgettimeofday.o note.o sigreturn.o vgetrandom.o vgetrandom-chacha.o
>>  
>>  # Build rules
>>  targets := $(obj-vdso) vdso.so vdso.so.dbg
>> @@ -40,13 +40,22 @@ CFLAGS_REMOVE_vgettimeofday.o = $(CC_FLAGS_FTRACE) -Os $(CC_FLAGS_SCS) \
>>  				$(RANDSTRUCT_CFLAGS) $(GCC_PLUGINS_CFLAGS) \
>>  				$(CC_FLAGS_LTO) $(CC_FLAGS_CFI) \
>>  				-Wmissing-prototypes -Wmissing-declarations
>> +CFLAGS_REMOVE_vgetrandom.o = $(CC_FLAGS_FTRACE) -Os $(CC_FLAGS_SCS) \
>> +			     $(RANDSTRUCT_CFLAGS) $(GCC_PLUGINS_CFLAGS) \
>> +			     $(CC_FLAGS_LTO) $(CC_FLAGS_CFI) \
>> +			     -Wmissing-prototypes -Wmissing-declarations
>>  
>>  CFLAGS_vgettimeofday.o = -O2 -mcmodel=tiny -fasynchronous-unwind-tables
>> +CFLAGS_vgetrandom.o = -O2 -mcmodel=tiny -fasynchronous-unwind-tables
> 
> You're using identical CFLAGS_ and CFLAGS_REMOVE_ definitions for
> vgettimeofdat.o and vgetrandom.o. Please refactor this so that they use
> common definitions.

Ack.

> 
>> diff --git a/arch/arm64/kernel/vdso/vdso b/arch/arm64/kernel/vdso/vdso
>> new file mode 120000
>> index 000000000000..233c7a26f6e5
>> --- /dev/null
>> +++ b/arch/arm64/kernel/vdso/vdso
>> @@ -0,0 +1 @@
>> +../../../arch/arm64/kernel/vdso
>> \ No newline at end of file
>> diff --git a/arch/arm64/kernel/vdso/vdso.lds.S b/arch/arm64/kernel/vdso/vdso.lds.S
>> index 45354f2ddf70..f204a9ddc833 100644
>> --- a/arch/arm64/kernel/vdso/vdso.lds.S
>> +++ b/arch/arm64/kernel/vdso/vdso.lds.S
>> @@ -11,7 +11,9 @@
>>  #include <linux/const.h>
>>  #include <asm/page.h>
>>  #include <asm/vdso.h>
>> +#include <asm/vdso/vsyscall.h>
>>  #include <asm-generic/vmlinux.lds.h>
>> +#include <vdso/datapage.h>
>>  
>>  OUTPUT_FORMAT("elf64-littleaarch64", "elf64-bigaarch64", "elf64-littleaarch64")
>>  OUTPUT_ARCH(aarch64)
>> @@ -19,6 +21,7 @@ OUTPUT_ARCH(aarch64)
>>  SECTIONS
>>  {
>>  	PROVIDE(_vdso_data = . - __VVAR_PAGES * PAGE_SIZE);
>> +	PROVIDE(_vdso_rng_data = _vdso_data + __VDSO_RND_DATA_OFFSET);
>>  #ifdef CONFIG_TIME_NS
>>  	PROVIDE(_timens_data = _vdso_data + PAGE_SIZE);
>>  #endif
>> @@ -102,6 +105,7 @@ VERSION
>>  		__kernel_gettimeofday;
>>  		__kernel_clock_gettime;
>>  		__kernel_clock_getres;
>> +		__kernel_getrandom;
>>  	local: *;
>>  	};
>>  }
>> diff --git a/arch/arm64/kernel/vdso/vgetrandom-chacha.S b/arch/arm64/kernel/vdso/vgetrandom-chacha.S
>> new file mode 100644
>> index 000000000000..9ebf12a09c65
>> --- /dev/null
>> +++ b/arch/arm64/kernel/vdso/vgetrandom-chacha.S
>> @@ -0,0 +1,168 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include <linux/linkage.h>
>> +#include <asm/cache.h>
>> +#include <asm/assembler.h>
>> +
>> +	.text
>> +
>> +#define state0		v0
>> +#define state1		v1
>> +#define state2		v2
>> +#define state3		v3
>> +#define copy0		v4
>> +#define copy1		v5
>> +#define copy2		v6
>> +#define copy3		v7
>> +#define copy3_d		d7
>> +#define one_d		d16
>> +#define one_q		q16
>> +#define tmp		v17
>> +#define rot8		v18
>> +
>> +/*
>> + * ARM64 ChaCha20 implementation meant for vDSO.  Produces a given positive
>> + * number of blocks of output with nonce 0, taking an input key and 8-bytes
>> + * counter.  Importantly does not spill to the stack.
>> + *
>> + * void __arch_chacha20_blocks_nostack(uint8_t *dst_bytes,
>> + *				       const uint8_t *key,
>> + * 				       uint32_t *counter,
>> + *				       size_t nblocks)
>> + *
>> + * 	x0: output bytes
>> + *	x1: 32-byte key input
>> + *	x2: 8-byte counter input/output
>> + *	x3: number of 64-byte block to write to output
>> + */
>> +SYM_FUNC_START(__arch_chacha20_blocks_nostack)
> 
> Is there any way we can reuse the existing code in
> crypto/chacha-neon-core.S for this? It looks to my untrained eye like
> this is an arbitrarily different implementation to what we already have.
> 
>> +	/* copy0 = "expand 32-byte k" */
>> +	adr_l		x8, CTES
>> +	ld1		{copy0.4s}, [x8]
>> +	/* copy1,copy2 = key */
>> +	ld1		{ copy1.4s, copy2.4s }, [x1]
>> +	/* copy3 = counter || zero nonce  */
>> +	ldr		copy3_d, [x2]
>> +
>> +	adr_l		x8, ONE
>> +	ldr		one_q, [x8]
>> +
>> +	adr_l		x10, ROT8
>> +	ld1		{rot8.4s}, [x10]
>> +.Lblock:
>> +	/* copy state to auxiliary vectors for the final add after the permute.  */
>> +	mov		state0.16b, copy0.16b
>> +	mov		state1.16b, copy1.16b
>> +	mov		state2.16b, copy2.16b
>> +	mov		state3.16b, copy3.16b
>> +
>> +	mov		w4, 20
>> +.Lpermute:
>> +	/*
>> +	 * Permute one 64-byte block where the state matrix is stored in the four NEON
>> +	 * registers state0-state3.  It performs matrix operations on four words in parallel,
>> +	 * but requires shuffling to rearrange the words after each round.
>> +	 */
>> +
>> +.Ldoubleround:
>> +	/* state0 += state1, state3 = rotl32(state3 ^ state0, 16) */
>> +	add		state0.4s, state0.4s, state1.4s
>> +	eor		state3.16b, state3.16b, state0.16b
>> +	rev32		state3.8h, state3.8h
>> +
>> +	/* state2 += state3, state1 = rotl32(state1 ^ state2, 12) */
>> +	add		state2.4s, state2.4s, state3.4s
>> +	eor		tmp.16b, state1.16b, state2.16b
>> +	shl		state1.4s, tmp.4s, #12
>> +	sri		state1.4s, tmp.4s, #20
>> +
>> +	/* state0 += state1, state3 = rotl32(state3 ^ state0, 8) */
>> +	add		state0.4s, state0.4s, state1.4s
>> +	eor		state3.16b, state3.16b, state0.16b
>> +	tbl		state3.16b, {state3.16b}, rot8.16b
>> +
>> +	/* state2 += state3, state1 = rotl32(state1 ^ state2, 7) */
>> +	add		state2.4s, state2.4s, state3.4s
>> +	eor		tmp.16b, state1.16b, state2.16b
>> +	shl		state1.4s, tmp.4s, #7
>> +	sri		state1.4s, tmp.4s, #25
>> +
>> +	/* state1[0,1,2,3] = state1[1,2,3,0] */
>> +	ext		state1.16b, state1.16b, state1.16b, #4
>> +	/* state2[0,1,2,3] = state2[2,3,0,1] */
>> +	ext		state2.16b, state2.16b, state2.16b, #8
>> +	/* state3[0,1,2,3] = state3[1,2,3,0] */
>> +	ext		state3.16b, state3.16b, state3.16b, #12
>> +
>> +	/* state0 += state1, state3 = rotl32(state3 ^ state0, 16) */
>> +	add		state0.4s, state0.4s, state1.4s
>> +	eor		state3.16b, state3.16b, state0.16b
>> +	rev32		state3.8h, state3.8h
>> +
>> +	/* state2 += state3, state1 = rotl32(state1 ^ state2, 12) */
>> +	add		state2.4s, state2.4s, state3.4s
>> +	eor		tmp.16b, state1.16b, state2.16b
>> +	shl		state1.4s, tmp.4s, #12
>> +	sri		state1.4s, tmp.4s, #20
>> +
>> +	/* state0 += state1, state3 = rotl32(state3 ^ state0, 8) */
>> +	add		state0.4s, state0.4s, state1.4s
>> +	eor		state3.16b, state3.16b, state0.16b
>> +	tbl		state3.16b, {state3.16b}, rot8.16b
>> +
>> +	/* state2 += state3, state1 = rotl32(state1 ^ state2, 7) */
>> +	add		state2.4s, state2.4s, state3.4s
>> +	eor		tmp.16b, state1.16b, state2.16b
>> +	shl		state1.4s, tmp.4s, #7
>> +	sri		state1.4s, tmp.4s, #25
>> +
>> +	/* state1[0,1,2,3] = state1[3,0,1,2] */
>> +	ext		state1.16b, state1.16b, state1.16b, #12
>> +	/* state2[0,1,2,3] = state2[2,3,0,1] */
>> +	ext		state2.16b, state2.16b, state2.16b, #8
>> +	/* state3[0,1,2,3] = state3[1,2,3,0] */
>> +	ext		state3.16b, state3.16b, state3.16b, #4
>> +
>> +	subs		w4, w4, #2
>> +	b.ne		.Ldoubleround
>> +
>> +	/* output0 = state0 + state0 */
>> +	add		state0.4s, state0.4s, copy0.4s
>> +	/* output1 = state1 + state1 */
>> +	add		state1.4s, state1.4s, copy1.4s
>> +	/* output2 = state2 + state2 */
>> +	add		state2.4s, state2.4s, copy2.4s
>> +	/* output2 = state3 + state3 */
>> +	add		state3.4s, state3.4s, copy3.4s
>> +	st1		{ state0.4s - state3.4s }, [x0]
>> +
>> +	/* ++copy3.counter */
>> +	add		copy3_d, copy3_d, one_d
>> +
>> +	/* output += 64, --nblocks */
>> +	add		x0, x0, 64
>> +	subs		x3, x3, #1
>> +	b.ne		.Lblock
>> +
>> +	/* counter = copy3.counter */
>> +	str		copy3_d, [x2]
>> +
>> +	/* Zero out the potentially sensitive regs, in case nothing uses these again. */
>> +	eor		state0.16b, state0.16b, state0.16b
>> +	eor		state1.16b, state1.16b, state1.16b
>> +	eor		state2.16b, state2.16b, state2.16b
>> +	eor		state3.16b, state3.16b, state3.16b
>> +	eor		copy1.16b, copy1.16b, copy1.16b
>> +	eor		copy2.16b, copy2.16b, copy2.16b
>> +	ret
>> +SYM_FUNC_END(__arch_chacha20_blocks_nostack)
>> +
>> +        .section        ".rodata", "a", %progbits
>> +        .align          L1_CACHE_SHIFT
>> +
>> +CTES:	.word		1634760805, 857760878, 	2036477234, 1797285236
>> +ONE:    .xword		1, 0
>> +ROT8:	.word		0x02010003, 0x06050407, 0x0a09080b, 0x0e0d0c0f
>> +
>> +emit_aarch64_feature_1_and
>> diff --git a/arch/arm64/kernel/vdso/vgetrandom.c b/arch/arm64/kernel/vdso/vgetrandom.c
>> new file mode 100644
>> index 000000000000..0833d25f3121
>> --- /dev/null
>> +++ b/arch/arm64/kernel/vdso/vgetrandom.c
>> @@ -0,0 +1,15 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +typeof(__cvdso_getrandom) __kernel_getrandom;
>> +
>> +ssize_t __kernel_getrandom(void *buffer, size_t len, unsigned int flags, void *opaque_state, size_t opaque_len)
>> +{
>> +	asm goto (
>> +	ALTERNATIVE("b %[fallback]", "nop", RM64_HAS_FPSIMD) : : : : fallback);
> 
> "RM64_HAS_FPSIMD". Are you sure you've tested this?

I am not sure why build has not failed (I double test and it does not generate
a wrong relocation) or why vdso does seems to have the nop in the expected place.
I have changed to ARM64_HAS_FPSIMD.

> 
>> +	return __cvdso_getrandom(buffer, len, flags, opaque_state, opaque_len);
>> +
>> +fallback:
>> +	if (unlikely(opaque_len == ~0UL && !buffer && !len && !flags))
>> +		return -ENOSYS;
>> +	return getrandom_syscall(buffer, len, flags);
>> +}
>> diff --git a/lib/vdso/getrandom.c b/lib/vdso/getrandom.c
>> index 938ca539aaa6..7c9711248d9b 100644
>> --- a/lib/vdso/getrandom.c
>> +++ b/lib/vdso/getrandom.c
>> @@ -5,6 +5,7 @@
>>  
>>  #include <linux/array_size.h>
>>  #include <linux/minmax.h>
>> +#include <linux/mm.h>
>>  #include <vdso/datapage.h>
>>  #include <vdso/getrandom.h>
>>  #include <vdso/unaligned.h>
> 
> Looks like this should be a separate change?


It is required so arm64 can use  c-getrandom-y, otherwise vgetrandom.o build
fails:

CC      arch/arm64/kernel/vdso/vgetrandom.o
In file included from ./include/uapi/linux/mman.h:5,
                 from /mnt/projects/linux/linux-git/lib/vdso/getrandom.c:13,
                 from <command-line>:
./arch/arm64/include/asm/mman.h: In function ‘arch_calc_vm_prot_bits’:
./arch/arm64/include/asm/mman.h:14:13: error: implicit declaration of function ‘system_supports_bti’ [-Werror=implicit-function-declaration]
   14 |         if (system_supports_bti() && (prot & PROT_BTI))
      |             ^~~~~~~~~~~~~~~~~~~
./arch/arm64/include/asm/mman.h:15:24: error: ‘VM_ARM64_BTI’ undeclared (first use in this function); did you mean ‘ARM64_BTI’?
   15 |                 ret |= VM_ARM64_BTI;
      |                        ^~~~~~~~~~~~
      |                        ARM64_BTI
./arch/arm64/include/asm/mman.h:15:24: note: each undeclared identifier is reported only once for each function it appears in
./arch/arm64/include/asm/mman.h:17:13: error: implicit declaration of function ‘system_supports_mte’ [-Werror=implicit-function-declaration]
   17 |         if (system_supports_mte() && (prot & PROT_MTE))
      |             ^~~~~~~~~~~~~~~~~~~
./arch/arm64/include/asm/mman.h:18:24: error: ‘VM_MTE’ undeclared (first use in this function)
   18 |                 ret |= VM_MTE;
      |                        ^~~~~~
./arch/arm64/include/asm/mman.h: In function ‘arch_calc_vm_flag_bits’:
./arch/arm64/include/asm/mman.h:32:24: error: ‘VM_MTE_ALLOWED’ undeclared (first use in this function)
   32 |                 return VM_MTE_ALLOWED;
      |                        ^~~~~~~~~~~~~~
./arch/arm64/include/asm/mman.h: In function ‘arch_validate_flags’:
./arch/arm64/include/asm/mman.h:59:29: error: ‘VM_MTE’ undeclared (first use in this function)
   59 |         return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
      |                             ^~~~~~
./arch/arm64/include/asm/mman.h:59:52: error: ‘VM_MTE_ALLOWED’ undeclared (first use in this function)
   59 |         return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
      |                                                    ^~~~~~~~~~~~~~
arch/arm64/kernel/vdso/vgetrandom.c: In function ‘__kernel_getrandom’:
arch/arm64/kernel/vdso/vgetrandom.c:18:25: error: ‘ENOSYS’ undeclared (first use in this function); did you mean ‘ENOSPC’?
   18 |                 return -ENOSYS;
      |                         ^~~~~~
      |                         ENOSPC
cc1: some warnings being treated as errors

I can move to a different patch, but this is really tied to this patch.

Ard Biesheuvel Aug. 30, 2024, 2:11 p.m. UTC | #5

On Thu, 29 Aug 2024 at 22:17, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
> Hook up the generic vDSO implementation to the aarch64 vDSO data page.
> The _vdso_rng_data required data is placed within the _vdso_data vvar
> page, by using a offset larger than the vdso_data.
>
> The vDSO function requires a ChaCha20 implementation that does not
> write to the stack, and that can do an entire ChaCha20 permutation.
> The one provided is based on the current chacha-neon-core.S and uses NEON
> on the permute operation. The fallback for chips that do not support
> NEON issues the syscall.
>
> This also passes the vdso_test_chacha test along with
> vdso_test_getrandom. The vdso_test_getrandom bench-single result on
> Neoverse-N1 shows:
>
>    vdso: 25000000 times in 0.746506464 seconds
>    libc: 25000000 times in 8.849179444 seconds
> syscall: 25000000 times in 8.818726425 seconds
>
> Changes from v1:
> - Fixed style issues and typos.
> - Added fallback for systems without NEON support.
> - Avoid use of non-volatile vector registers in neon chacha20.
> - Use c-getrandom-y for vgetrandom.c.
> - Fixed TIMENS vdso_rnd_data access.
>
> Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
> ---
...
> diff --git a/arch/arm64/kernel/vdso/vgetrandom-chacha.S b/arch/arm64/kernel/vdso/vgetrandom-chacha.S
> new file mode 100644
> index 000000000000..9ebf12a09c65
> --- /dev/null
> +++ b/arch/arm64/kernel/vdso/vgetrandom-chacha.S
> @@ -0,0 +1,168 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/linkage.h>
> +#include <asm/cache.h>
> +#include <asm/assembler.h>
> +
> +       .text
> +
> +#define state0         v0
> +#define state1         v1
> +#define state2         v2
> +#define state3         v3
> +#define copy0          v4
> +#define copy1          v5
> +#define copy2          v6
> +#define copy3          v7
> +#define copy3_d                d7
> +#define one_d          d16
> +#define one_q          q16
> +#define tmp            v17
> +#define rot8           v18
> +

Please make a note somewhere around here that you are deliberately
avoiding d8-d15 because they are callee-save in user space.

> +/*
> + * ARM64 ChaCha20 implementation meant for vDSO.  Produces a given positive
> + * number of blocks of output with nonce 0, taking an input key and 8-bytes
> + * counter.  Importantly does not spill to the stack.
> + *
> + * void __arch_chacha20_blocks_nostack(uint8_t *dst_bytes,
> + *                                    const uint8_t *key,
> + *                                    uint32_t *counter,
> + *                                    size_t nblocks)
> + *
> + *     x0: output bytes
> + *     x1: 32-byte key input
> + *     x2: 8-byte counter input/output
> + *     x3: number of 64-byte block to write to output
> + */
> +SYM_FUNC_START(__arch_chacha20_blocks_nostack)
> +
> +       /* copy0 = "expand 32-byte k" */
> +       adr_l           x8, CTES
> +       ld1             {copy0.4s}, [x8]
> +       /* copy1,copy2 = key */
> +       ld1             { copy1.4s, copy2.4s }, [x1]
> +       /* copy3 = counter || zero nonce  */
> +       ldr             copy3_d, [x2]
> +
> +       adr_l           x8, ONE
> +       ldr             one_q, [x8]
> +
> +       adr_l           x10, ROT8
> +       ld1             {rot8.4s}, [x10]

These immediate loads are forcing the vDSO to have a .rodata section,
which is best avoided, given that this is mapped into every user space
program.

Either use the existing mov_q macro and then move the values into SIMD
registers, or compose the required vectors in a different way.

E.g., with one_v == v16,

movi one_v.2s, #1
uzp1 one_v.4s, one_v.4s, one_v.4s

puts the correct value in one_d, uses 1 instruction and 16 bytes of
rodata less, and avoids a memory access.

The ROT8 + tbl can be replaced by shl/sri (see below)

> +.Lblock:
> +       /* copy state to auxiliary vectors for the final add after the permute.  */
> +       mov             state0.16b, copy0.16b
> +       mov             state1.16b, copy1.16b
> +       mov             state2.16b, copy2.16b
> +       mov             state3.16b, copy3.16b
> +
> +       mov             w4, 20
> +.Lpermute:
> +       /*
> +        * Permute one 64-byte block where the state matrix is stored in the four NEON
> +        * registers state0-state3.  It performs matrix operations on four words in parallel,
> +        * but requires shuffling to rearrange the words after each round.
> +        */
> +
> +.Ldoubleround:
> +       /* state0 += state1, state3 = rotl32(state3 ^ state0, 16) */
> +       add             state0.4s, state0.4s, state1.4s
> +       eor             state3.16b, state3.16b, state0.16b
> +       rev32           state3.8h, state3.8h
> +
> +       /* state2 += state3, state1 = rotl32(state1 ^ state2, 12) */
> +       add             state2.4s, state2.4s, state3.4s
> +       eor             tmp.16b, state1.16b, state2.16b
> +       shl             state1.4s, tmp.4s, #12
> +       sri             state1.4s, tmp.4s, #20
> +
> +       /* state0 += state1, state3 = rotl32(state3 ^ state0, 8) */
> +       add             state0.4s, state0.4s, state1.4s
> +       eor             state3.16b, state3.16b, state0.16b
> +       tbl             state3.16b, {state3.16b}, rot8.16b
> +

This can be changed to the below, removing the need for the ROT8 vector

eor   tmp.16b, state3.16b, state0.16b
shl   state3.4s, tmp.4s, #8
sri   state3.4s, tmp.4s, #24

> +       /* state2 += state3, state1 = rotl32(state1 ^ state2, 7) */
> +       add             state2.4s, state2.4s, state3.4s
> +       eor             tmp.16b, state1.16b, state2.16b
> +       shl             state1.4s, tmp.4s, #7
> +       sri             state1.4s, tmp.4s, #25
> +
> +       /* state1[0,1,2,3] = state1[1,2,3,0] */
> +       ext             state1.16b, state1.16b, state1.16b, #4
> +       /* state2[0,1,2,3] = state2[2,3,0,1] */
> +       ext             state2.16b, state2.16b, state2.16b, #8
> +       /* state3[0,1,2,3] = state3[1,2,3,0] */
> +       ext             state3.16b, state3.16b, state3.16b, #12
> +
> +       /* state0 += state1, state3 = rotl32(state3 ^ state0, 16) */
> +       add             state0.4s, state0.4s, state1.4s
> +       eor             state3.16b, state3.16b, state0.16b
> +       rev32           state3.8h, state3.8h
> +
> +       /* state2 += state3, state1 = rotl32(state1 ^ state2, 12) */
> +       add             state2.4s, state2.4s, state3.4s
> +       eor             tmp.16b, state1.16b, state2.16b
> +       shl             state1.4s, tmp.4s, #12
> +       sri             state1.4s, tmp.4s, #20
> +
> +       /* state0 += state1, state3 = rotl32(state3 ^ state0, 8) */
> +       add             state0.4s, state0.4s, state1.4s
> +       eor             state3.16b, state3.16b, state0.16b
> +       tbl             state3.16b, {state3.16b}, rot8.16b
> +
> +       /* state2 += state3, state1 = rotl32(state1 ^ state2, 7) */
> +       add             state2.4s, state2.4s, state3.4s
> +       eor             tmp.16b, state1.16b, state2.16b
> +       shl             state1.4s, tmp.4s, #7
> +       sri             state1.4s, tmp.4s, #25
> +
> +       /* state1[0,1,2,3] = state1[3,0,1,2] */
> +       ext             state1.16b, state1.16b, state1.16b, #12
> +       /* state2[0,1,2,3] = state2[2,3,0,1] */
> +       ext             state2.16b, state2.16b, state2.16b, #8
> +       /* state3[0,1,2,3] = state3[1,2,3,0] */
> +       ext             state3.16b, state3.16b, state3.16b, #4
> +
> +       subs            w4, w4, #2
> +       b.ne            .Ldoubleround
> +
> +       /* output0 = state0 + state0 */
> +       add             state0.4s, state0.4s, copy0.4s
> +       /* output1 = state1 + state1 */
> +       add             state1.4s, state1.4s, copy1.4s
> +       /* output2 = state2 + state2 */
> +       add             state2.4s, state2.4s, copy2.4s
> +       /* output2 = state3 + state3 */
> +       add             state3.4s, state3.4s, copy3.4s
> +       st1             { state0.4s - state3.4s }, [x0]
> +
> +       /* ++copy3.counter */
> +       add             copy3_d, copy3_d, one_d
> +

This 'add' clears the upper half of the SIMD register, which is where
the zero nonce lives. So this happens to be correct, but it is not
very intuitive, so perhaps a comment would be in order here.

> +       /* output += 64, --nblocks */
> +       add             x0, x0, 64
> +       subs            x3, x3, #1
> +       b.ne            .Lblock
> +
> +       /* counter = copy3.counter */
> +       str             copy3_d, [x2]
> +
> +       /* Zero out the potentially sensitive regs, in case nothing uses these again. */
> +       eor             state0.16b, state0.16b, state0.16b
> +       eor             state1.16b, state1.16b, state1.16b
> +       eor             state2.16b, state2.16b, state2.16b
> +       eor             state3.16b, state3.16b, state3.16b
> +       eor             copy1.16b, copy1.16b, copy1.16b
> +       eor             copy2.16b, copy2.16b, copy2.16b

This is not x86 - no need to use XOR to clear registers, you can just
use 'movi reg.16b, #0' here.

> +       ret
> +SYM_FUNC_END(__arch_chacha20_blocks_nostack)
> +
> +        .section        ".rodata", "a", %progbits
> +        .align          L1_CACHE_SHIFT
> +
> +CTES:  .word           1634760805, 857760878,  2036477234, 1797285236
> +ONE:    .xword         1, 0
> +ROT8:  .word           0x02010003, 0x06050407, 0x0a09080b, 0x0e0d0c0f
> +
> +emit_aarch64_feature_1_and
...

Mark Brown Aug. 30, 2024, 3:05 p.m. UTC | #6

On Fri, Aug 30, 2024 at 02:04:39PM +0200, Jason A. Donenfeld wrote:
> > > +SYM_FUNC_START(__arch_chacha20_blocks_nostack)

> > Is there any way we can reuse the existing code in
> > crypto/chacha-neon-core.S for this? It looks to my untrained eye like
> > this is an arbitrarily different implementation to what we already have.

> Nope, it is indeed different, and not arbitrarily so. This patch is
> mirroring exactly what we did on x86.

It's probably worth some comments or something explaining what's going
on with that (the commit log for the x86 patch mentions that it's that
the vDSO needs a version that doesn't write to the stack).

Mark Brown Aug. 30, 2024, 3:19 p.m. UTC | #7

On Thu, Aug 29, 2024 at 08:17:14PM +0000, Adhemerval Zanella wrote:
> Hook up the generic vDSO implementation to the aarch64 vDSO data page.
> The _vdso_rng_data required data is placed within the _vdso_data vvar
> page, by using a offset larger than the vdso_data.

This exposes some preexisting compiler warnings in the getrandom test
when built with clang:

vdso_test_getrandom.c:145:40: warning: omitting the parameter name in a function definition is a C23 extension [-Wc23-extensions]
  145 | static void *test_vdso_getrandom(void *)
      |                                        ^
vdso_test_getrandom.c:155:40: warning: omitting the parameter name in a function definition is a C23 extension [-Wc23-extensions]
  155 | static void *test_libc_getrandom(void *)
      |                                        ^
vdso_test_getrandom.c:165:43: warning: omitting the parameter name in a function definition is a C23 extension [-Wc23-extensions]
  165 | static void *test_syscall_getrandom(void *)
      |                                           ^

which it'd be good to get fixed before merging.

Jason A. Donenfeld Aug. 30, 2024, 3:21 p.m. UTC | #8

On Fri, Aug 30, 2024 at 04:19:00PM +0100, Mark Brown wrote:
> On Thu, Aug 29, 2024 at 08:17:14PM +0000, Adhemerval Zanella wrote:
> > Hook up the generic vDSO implementation to the aarch64 vDSO data page.
> > The _vdso_rng_data required data is placed within the _vdso_data vvar
> > page, by using a offset larger than the vdso_data.
> 
> This exposes some preexisting compiler warnings in the getrandom test
> when built with clang:
> 
> vdso_test_getrandom.c:145:40: warning: omitting the parameter name in a function definition is a C23 extension [-Wc23-extensions]
>   145 | static void *test_vdso_getrandom(void *)
>       |                                        ^
> vdso_test_getrandom.c:155:40: warning: omitting the parameter name in a function definition is a C23 extension [-Wc23-extensions]
>   155 | static void *test_libc_getrandom(void *)
>       |                                        ^
> vdso_test_getrandom.c:165:43: warning: omitting the parameter name in a function definition is a C23 extension [-Wc23-extensions]
>   165 | static void *test_syscall_getrandom(void *)
>       |                                           ^
> 
> which it'd be good to get fixed before merging.

That's my bug. I'll fix that up in the tree and CC you on it. Thanks for
pointing it out.

Jason

kernel test robot Aug. 30, 2024, 4:18 p.m. UTC | #9

Hi Adhemerval,

kernel test robot noticed the following build errors:

[auto build test ERROR on crng-random/master]
[also build test ERROR on next-20240830]
[cannot apply to arm64/for-next/core shuah-kselftest/next shuah-kselftest/fixes linus/master v6.11-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Adhemerval-Zanella/aarch64-vdso-Wire-up-getrandom-vDSO-implementation/20240830-041912
base:   https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git master
patch link:    https://lore.kernel.org/r/20240829201728.2825-1-adhemerval.zanella%40linaro.org
patch subject: [PATCH v2] aarch64: vdso: Wire up getrandom() vDSO implementation
config: arm64-allyesconfig (https://download.01.org/0day-ci/archive/20240831/202408310030.S5ZNwLWz-lkp@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 46fe36a4295f05d5d3731762e31fc4e6e99863e9)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240831/202408310030.S5ZNwLWz-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408310030.S5ZNwLWz-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/arm64/kernel/asm-offsets.c:10:
   In file included from include/linux/arm_sdei.h:8:
   In file included from include/acpi/ghes.h:5:
   In file included from include/acpi/apei.h:9:
   In file included from include/linux/acpi.h:39:
   In file included from include/acpi/acpi_io.h:7:
   In file included from arch/arm64/include/asm/acpi.h:14:
   In file included from include/linux/memblock.h:12:
   In file included from include/linux/mm.h:2228:
   include/linux/vmstat.h:503:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     503 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     504 |                            item];
         |                            ~~~~
   include/linux/vmstat.h:510:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     510 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     511 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
   include/linux/vmstat.h:517:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     517 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
   include/linux/vmstat.h:523:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     523 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     524 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
   4 warnings generated.
   In file included from <built-in>:4:
   In file included from lib/vdso/getrandom.c:8:
   In file included from include/linux/mm.h:2228:
   include/linux/vmstat.h:503:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     503 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     504 |                            item];
         |                            ~~~~
   include/linux/vmstat.h:510:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     510 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     511 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
   include/linux/vmstat.h:517:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     517 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
   include/linux/vmstat.h:523:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     523 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     524 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
   In file included from <built-in>:4:
   In file included from lib/vdso/getrandom.c:12:
   In file included from arch/arm64/include/asm/vdso/getrandom.h:8:
>> arch/arm64/include/asm/vdso.h:25:10: fatal error: 'generated/vdso-offsets.h' file not found
      25 | #include <generated/vdso-offsets.h>
         |          ^~~~~~~~~~~~~~~~~~~~~~~~~~
   4 warnings and 1 error generated.
   make[3]: *** [scripts/Makefile.build:244: arch/arm64/kernel/vdso/vgetrandom.o] Error 1
   make[3]: Target 'include/generated/vdso-offsets.h' not remade because of errors.
   make[3]: Target 'arch/arm64/kernel/vdso/vdso.so' not remade because of errors.
   make[2]: *** [arch/arm64/Makefile:217: vdso_prepare] Error 2
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:224: __sub-make] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:224: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.


vim +25 arch/arm64/include/asm/vdso.h

0a7927d2b89e55 Adhemerval Zanella 2024-08-29  24  
9031fefde6f2ac Will Deacon        2012-03-05 @25  #include <generated/vdso-offsets.h>
9031fefde6f2ac Will Deacon        2012-03-05  26

Adhemerval Zanella Netto Aug. 30, 2024, 5:38 p.m. UTC | #10

On 30/08/24 11:11, Ard Biesheuvel wrote:
> On Thu, 29 Aug 2024 at 22:17, Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>> Hook up the generic vDSO implementation to the aarch64 vDSO data page.
>> The _vdso_rng_data required data is placed within the _vdso_data vvar
>> page, by using a offset larger than the vdso_data.
>>
>> The vDSO function requires a ChaCha20 implementation that does not
>> write to the stack, and that can do an entire ChaCha20 permutation.
>> The one provided is based on the current chacha-neon-core.S and uses NEON
>> on the permute operation. The fallback for chips that do not support
>> NEON issues the syscall.
>>
>> This also passes the vdso_test_chacha test along with
>> vdso_test_getrandom. The vdso_test_getrandom bench-single result on
>> Neoverse-N1 shows:
>>
>>    vdso: 25000000 times in 0.746506464 seconds
>>    libc: 25000000 times in 8.849179444 seconds
>> syscall: 25000000 times in 8.818726425 seconds
>>
>> Changes from v1:
>> - Fixed style issues and typos.
>> - Added fallback for systems without NEON support.
>> - Avoid use of non-volatile vector registers in neon chacha20.
>> - Use c-getrandom-y for vgetrandom.c.
>> - Fixed TIMENS vdso_rnd_data access.
>>
>> Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
>> ---
> ...
>> diff --git a/arch/arm64/kernel/vdso/vgetrandom-chacha.S b/arch/arm64/kernel/vdso/vgetrandom-chacha.S
>> new file mode 100644
>> index 000000000000..9ebf12a09c65
>> --- /dev/null
>> +++ b/arch/arm64/kernel/vdso/vgetrandom-chacha.S
>> @@ -0,0 +1,168 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include <linux/linkage.h>
>> +#include <asm/cache.h>
>> +#include <asm/assembler.h>
>> +
>> +       .text
>> +
>> +#define state0         v0
>> +#define state1         v1
>> +#define state2         v2
>> +#define state3         v3
>> +#define copy0          v4
>> +#define copy1          v5
>> +#define copy2          v6
>> +#define copy3          v7
>> +#define copy3_d                d7
>> +#define one_d          d16
>> +#define one_q          q16
>> +#define tmp            v17
>> +#define rot8           v18
>> +
> 
> Please make a note somewhere around here that you are deliberately
> avoiding d8-d15 because they are callee-save in user space.

Ack.

> 
>> +/*
>> + * ARM64 ChaCha20 implementation meant for vDSO.  Produces a given positive
>> + * number of blocks of output with nonce 0, taking an input key and 8-bytes
>> + * counter.  Importantly does not spill to the stack.
>> + *
>> + * void __arch_chacha20_blocks_nostack(uint8_t *dst_bytes,
>> + *                                    const uint8_t *key,
>> + *                                    uint32_t *counter,
>> + *                                    size_t nblocks)
>> + *
>> + *     x0: output bytes
>> + *     x1: 32-byte key input
>> + *     x2: 8-byte counter input/output
>> + *     x3: number of 64-byte block to write to output
>> + */
>> +SYM_FUNC_START(__arch_chacha20_blocks_nostack)
>> +
>> +       /* copy0 = "expand 32-byte k" */
>> +       adr_l           x8, CTES
>> +       ld1             {copy0.4s}, [x8]
>> +       /* copy1,copy2 = key */
>> +       ld1             { copy1.4s, copy2.4s }, [x1]
>> +       /* copy3 = counter || zero nonce  */
>> +       ldr             copy3_d, [x2]
>> +
>> +       adr_l           x8, ONE
>> +       ldr             one_q, [x8]
>> +
>> +       adr_l           x10, ROT8
>> +       ld1             {rot8.4s}, [x10]
> 
> These immediate loads are forcing the vDSO to have a .rodata section,
> which is best avoided, given that this is mapped into every user space
> program.
> 
> Either use the existing mov_q macro and then move the values into SIMD
> registers, or compose the required vectors in a different way.

Ack, mov_q seems suffice here.

> 
> E.g., with one_v == v16,
> 
> movi one_v.2s, #1
> uzp1 one_v.4s, one_v.4s, one_v.4s
> 
> puts the correct value in one_d, uses 1 instruction and 16 bytes of
> rodata less, and avoids a memory access.

Ack.

> 
> The ROT8 + tbl can be replaced by shl/sri (see below)
> 
>> +.Lblock:
>> +       /* copy state to auxiliary vectors for the final add after the permute.  */
>> +       mov             state0.16b, copy0.16b
>> +       mov             state1.16b, copy1.16b
>> +       mov             state2.16b, copy2.16b
>> +       mov             state3.16b, copy3.16b
>> +
>> +       mov             w4, 20
>> +.Lpermute:
>> +       /*
>> +        * Permute one 64-byte block where the state matrix is stored in the four NEON
>> +        * registers state0-state3.  It performs matrix operations on four words in parallel,
>> +        * but requires shuffling to rearrange the words after each round.
>> +        */
>> +
>> +.Ldoubleround:
>> +       /* state0 += state1, state3 = rotl32(state3 ^ state0, 16) */
>> +       add             state0.4s, state0.4s, state1.4s
>> +       eor             state3.16b, state3.16b, state0.16b
>> +       rev32           state3.8h, state3.8h
>> +
>> +       /* state2 += state3, state1 = rotl32(state1 ^ state2, 12) */
>> +       add             state2.4s, state2.4s, state3.4s
>> +       eor             tmp.16b, state1.16b, state2.16b
>> +       shl             state1.4s, tmp.4s, #12
>> +       sri             state1.4s, tmp.4s, #20
>> +
>> +       /* state0 += state1, state3 = rotl32(state3 ^ state0, 8) */
>> +       add             state0.4s, state0.4s, state1.4s
>> +       eor             state3.16b, state3.16b, state0.16b
>> +       tbl             state3.16b, {state3.16b}, rot8.16b
>> +
> 
> This can be changed to the below, removing the need for the ROT8 vector
> 
> eor   tmp.16b, state3.16b, state0.16b
> shl   state3.4s, tmp.4s, #8
> sri   state3.4s, tmp.4s, #24
> 

Ack.

>> +       /* state2 += state3, state1 = rotl32(state1 ^ state2, 7) */
>> +       add             state2.4s, state2.4s, state3.4s
>> +       eor             tmp.16b, state1.16b, state2.16b
>> +       shl             state1.4s, tmp.4s, #7
>> +       sri             state1.4s, tmp.4s, #25
>> +
>> +       /* state1[0,1,2,3] = state1[1,2,3,0] */
>> +       ext             state1.16b, state1.16b, state1.16b, #4
>> +       /* state2[0,1,2,3] = state2[2,3,0,1] */
>> +       ext             state2.16b, state2.16b, state2.16b, #8
>> +       /* state3[0,1,2,3] = state3[1,2,3,0] */
>> +       ext             state3.16b, state3.16b, state3.16b, #12
>> +
>> +       /* state0 += state1, state3 = rotl32(state3 ^ state0, 16) */
>> +       add             state0.4s, state0.4s, state1.4s
>> +       eor             state3.16b, state3.16b, state0.16b
>> +       rev32           state3.8h, state3.8h
>> +
>> +       /* state2 += state3, state1 = rotl32(state1 ^ state2, 12) */
>> +       add             state2.4s, state2.4s, state3.4s
>> +       eor             tmp.16b, state1.16b, state2.16b
>> +       shl             state1.4s, tmp.4s, #12
>> +       sri             state1.4s, tmp.4s, #20
>> +
>> +       /* state0 += state1, state3 = rotl32(state3 ^ state0, 8) */
>> +       add             state0.4s, state0.4s, state1.4s
>> +       eor             state3.16b, state3.16b, state0.16b
>> +       tbl             state3.16b, {state3.16b}, rot8.16b
>> +
>> +       /* state2 += state3, state1 = rotl32(state1 ^ state2, 7) */
>> +       add             state2.4s, state2.4s, state3.4s
>> +       eor             tmp.16b, state1.16b, state2.16b
>> +       shl             state1.4s, tmp.4s, #7
>> +       sri             state1.4s, tmp.4s, #25
>> +
>> +       /* state1[0,1,2,3] = state1[3,0,1,2] */
>> +       ext             state1.16b, state1.16b, state1.16b, #12
>> +       /* state2[0,1,2,3] = state2[2,3,0,1] */
>> +       ext             state2.16b, state2.16b, state2.16b, #8
>> +       /* state3[0,1,2,3] = state3[1,2,3,0] */
>> +       ext             state3.16b, state3.16b, state3.16b, #4
>> +
>> +       subs            w4, w4, #2
>> +       b.ne            .Ldoubleround
>> +
>> +       /* output0 = state0 + state0 */
>> +       add             state0.4s, state0.4s, copy0.4s
>> +       /* output1 = state1 + state1 */
>> +       add             state1.4s, state1.4s, copy1.4s
>> +       /* output2 = state2 + state2 */
>> +       add             state2.4s, state2.4s, copy2.4s
>> +       /* output2 = state3 + state3 */
>> +       add             state3.4s, state3.4s, copy3.4s
>> +       st1             { state0.4s - state3.4s }, [x0]
>> +
>> +       /* ++copy3.counter */
>> +       add             copy3_d, copy3_d, one_d
>> +
> 
> This 'add' clears the upper half of the SIMD register, which is where
> the zero nonce lives. So this happens to be correct, but it is not
> very intuitive, so perhaps a comment would be in order here.

Ack, will do.

> 
>> +       /* output += 64, --nblocks */
>> +       add             x0, x0, 64
>> +       subs            x3, x3, #1
>> +       b.ne            .Lblock
>> +
>> +       /* counter = copy3.counter */
>> +       str             copy3_d, [x2]
>> +
>> +       /* Zero out the potentially sensitive regs, in case nothing uses these again. */
>> +       eor             state0.16b, state0.16b, state0.16b
>> +       eor             state1.16b, state1.16b, state1.16b
>> +       eor             state2.16b, state2.16b, state2.16b
>> +       eor             state3.16b, state3.16b, state3.16b
>> +       eor             copy1.16b, copy1.16b, copy1.16b
>> +       eor             copy2.16b, copy2.16b, copy2.16b
> 
> This is not x86 - no need to use XOR to clear registers, you can just
> use 'movi reg.16b, #0' here.

Ack.

> 
>> +       ret
>> +SYM_FUNC_END(__arch_chacha20_blocks_nostack)
>> +
>> +        .section        ".rodata", "a", %progbits
>> +        .align          L1_CACHE_SHIFT
>> +
>> +CTES:  .word           1634760805, 857760878,  2036477234, 1797285236
>> +ONE:    .xword         1, 0
>> +ROT8:  .word           0x02010003, 0x06050407, 0x0a09080b, 0x0e0d0c0f
>> +
>> +emit_aarch64_feature_1_and
> ...

kernel test robot Aug. 31, 2024, 1:56 a.m. UTC | #11

Hi Adhemerval,

kernel test robot noticed the following build errors:

[auto build test ERROR on crng-random/master]
[also build test ERROR on next-20240830]
[cannot apply to arm64/for-next/core shuah-kselftest/next shuah-kselftest/fixes linus/master v6.11-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Adhemerval-Zanella/aarch64-vdso-Wire-up-getrandom-vDSO-implementation/20240830-041912
base:   https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git master
patch link:    https://lore.kernel.org/r/20240829201728.2825-1-adhemerval.zanella%40linaro.org
patch subject: [PATCH v2] aarch64: vdso: Wire up getrandom() vDSO implementation
config: arm64-defconfig (https://download.01.org/0day-ci/archive/20240831/202408310834.qh5oO1N6-lkp@intel.com/config)
compiler: aarch64-linux-gcc (GCC) 13.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240831/202408310834.qh5oO1N6-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408310834.qh5oO1N6-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/arm64/include/asm/vdso/getrandom.h:8,
                    from lib/vdso/getrandom.c:12,
                    from <command-line>:
>> arch/arm64/include/asm/vdso.h:25:10: fatal error: generated/vdso-offsets.h: No such file or directory
      25 | #include <generated/vdso-offsets.h>
         |          ^~~~~~~~~~~~~~~~~~~~~~~~~~
   compilation terminated.
   make[3]: *** [scripts/Makefile.build:244: arch/arm64/kernel/vdso/vgetrandom.o] Error 1
   make[3]: Target 'include/generated/vdso-offsets.h' not remade because of errors.
   make[3]: Target 'arch/arm64/kernel/vdso/vdso.so' not remade because of errors.
   make[2]: *** [arch/arm64/Makefile:217: vdso_prepare] Error 2
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:224: __sub-make] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:224: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.


vim +25 arch/arm64/include/asm/vdso.h

0a7927d2b89e55 Adhemerval Zanella 2024-08-29  24  
9031fefde6f2ac Will Deacon        2012-03-05 @25  #include <generated/vdso-offsets.h>
9031fefde6f2ac Will Deacon        2012-03-05  26

Jason A. Donenfeld Sept. 2, 2024, 1:11 p.m. UTC | #12

Hey Christophe (for header logic) & Will (for arm64 stuff),

On Fri, Aug 30, 2024 at 09:28:29AM -0300, Adhemerval Zanella Netto wrote:
> >> diff --git a/lib/vdso/getrandom.c b/lib/vdso/getrandom.c
> >> index 938ca539aaa6..7c9711248d9b 100644
> >> --- a/lib/vdso/getrandom.c
> >> +++ b/lib/vdso/getrandom.c
> >> @@ -5,6 +5,7 @@
> >>  
> >>  #include <linux/array_size.h>
> >>  #include <linux/minmax.h>
> >> +#include <linux/mm.h>
> >>  #include <vdso/datapage.h>
> >>  #include <vdso/getrandom.h>
> >>  #include <vdso/unaligned.h>
> > 
> > Looks like this should be a separate change?
> 
> 
> It is required so arm64 can use  c-getrandom-y, otherwise vgetrandom.o build
> fails:
> 
> CC      arch/arm64/kernel/vdso/vgetrandom.o
> In file included from ./include/uapi/linux/mman.h:5,
>                  from /mnt/projects/linux/linux-git/lib/vdso/getrandom.c:13,
>                  from <command-line>:
> ./arch/arm64/include/asm/mman.h: In function ‘arch_calc_vm_prot_bits’:
> ./arch/arm64/include/asm/mman.h:14:13: error: implicit declaration of function ‘system_supports_bti’ [-Werror=implicit-function-declaration]
>    14 |         if (system_supports_bti() && (prot & PROT_BTI))
>       |             ^~~~~~~~~~~~~~~~~~~
> ./arch/arm64/include/asm/mman.h:15:24: error: ‘VM_ARM64_BTI’ undeclared (first use in this function); did you mean ‘ARM64_BTI’?
>    15 |                 ret |= VM_ARM64_BTI;
>       |                        ^~~~~~~~~~~~
>       |                        ARM64_BTI
> ./arch/arm64/include/asm/mman.h:15:24: note: each undeclared identifier is reported only once for each function it appears in
> ./arch/arm64/include/asm/mman.h:17:13: error: implicit declaration of function ‘system_supports_mte’ [-Werror=implicit-function-declaration]
>    17 |         if (system_supports_mte() && (prot & PROT_MTE))
>       |             ^~~~~~~~~~~~~~~~~~~
> ./arch/arm64/include/asm/mman.h:18:24: error: ‘VM_MTE’ undeclared (first use in this function)
>    18 |                 ret |= VM_MTE;
>       |                        ^~~~~~
> ./arch/arm64/include/asm/mman.h: In function ‘arch_calc_vm_flag_bits’:
> ./arch/arm64/include/asm/mman.h:32:24: error: ‘VM_MTE_ALLOWED’ undeclared (first use in this function)
>    32 |                 return VM_MTE_ALLOWED;
>       |                        ^~~~~~~~~~~~~~
> ./arch/arm64/include/asm/mman.h: In function ‘arch_validate_flags’:
> ./arch/arm64/include/asm/mman.h:59:29: error: ‘VM_MTE’ undeclared (first use in this function)
>    59 |         return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
>       |                             ^~~~~~
> ./arch/arm64/include/asm/mman.h:59:52: error: ‘VM_MTE_ALLOWED’ undeclared (first use in this function)
>    59 |         return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
>       |                                                    ^~~~~~~~~~~~~~
> arch/arm64/kernel/vdso/vgetrandom.c: In function ‘__kernel_getrandom’:
> arch/arm64/kernel/vdso/vgetrandom.c:18:25: error: ‘ENOSYS’ undeclared (first use in this function); did you mean ‘ENOSPC’?
>    18 |                 return -ENOSYS;
>       |                         ^~~~~~
>       |                         ENOSPC
> cc1: some warnings being treated as errors
> 
> I can move to a different patch, but this is really tied to this patch.

Adhemerval kept this change in this patch for v3, which, if it's
necessary, is fine with me. But I was looking to see if there was
another way of doing it, because including linux/mm.h inside of vdso
code is kind of contrary to your project with e379299fe0b3 ("random:
vDSO: minimize and simplify header includes").

getrandom.c includes uapi/linux/mman.h for the mmap constants. That
seems fine; it's userspace code after all. But then uapi/linux/mman.h
has this:

   #include <asm/mman.h>
   #include <asm-generic/hugetlb_encode.h>
   #include <linux/types.h>

The asm-generic/ one resolves to uapi/asm-generic. But the asm/ one
resolves to arch code, which is where we then get in trouble on ARM,
where arch/arm64/include/asm/mman.h has all sorts of kernel code in it.

Maybe, instead, it should resolve to arch/arm64/include/uapi/asm/mman.h,
which is the header that userspace actually uses in normal user code?

Is this a makefile problem? What's going on here? Seems like this is
something worth sorting out. Or I can take Adhemerval's v3 as-is and
we'll grit our teeth and work it out later, as you prefer. But I thought
I should mention it.

Thoughts?

Jason

Christophe Leroy Sept. 2, 2024, 1:19 p.m. UTC | #13

Le 02/09/2024 à 15:11, Jason A. Donenfeld a écrit :
> Hey Christophe (for header logic) & Will (for arm64 stuff),
> 
> On Fri, Aug 30, 2024 at 09:28:29AM -0300, Adhemerval Zanella Netto wrote:
>>>> diff --git a/lib/vdso/getrandom.c b/lib/vdso/getrandom.c
>>>> index 938ca539aaa6..7c9711248d9b 100644
>>>> --- a/lib/vdso/getrandom.c
>>>> +++ b/lib/vdso/getrandom.c
>>>> @@ -5,6 +5,7 @@
>>>>   
>>>>   #include <linux/array_size.h>
>>>>   #include <linux/minmax.h>
>>>> +#include <linux/mm.h>
>>>>   #include <vdso/datapage.h>
>>>>   #include <vdso/getrandom.h>
>>>>   #include <vdso/unaligned.h>
>>>
>>> Looks like this should be a separate change?
>>
>>
>> It is required so arm64 can use  c-getrandom-y, otherwise vgetrandom.o build
>> fails:
>>
>> CC      arch/arm64/kernel/vdso/vgetrandom.o
>> In file included from ./include/uapi/linux/mman.h:5,
>>                   from /mnt/projects/linux/linux-git/lib/vdso/getrandom.c:13,
>>                   from <command-line>:
>> ./arch/arm64/include/asm/mman.h: In function ‘arch_calc_vm_prot_bits’:
>> ./arch/arm64/include/asm/mman.h:14:13: error: implicit declaration of function ‘system_supports_bti’ [-Werror=implicit-function-declaration]
>>     14 |         if (system_supports_bti() && (prot & PROT_BTI))
>>        |             ^~~~~~~~~~~~~~~~~~~
>> ./arch/arm64/include/asm/mman.h:15:24: error: ‘VM_ARM64_BTI’ undeclared (first use in this function); did you mean ‘ARM64_BTI’?
>>     15 |                 ret |= VM_ARM64_BTI;
>>        |                        ^~~~~~~~~~~~
>>        |                        ARM64_BTI
>> ./arch/arm64/include/asm/mman.h:15:24: note: each undeclared identifier is reported only once for each function it appears in
>> ./arch/arm64/include/asm/mman.h:17:13: error: implicit declaration of function ‘system_supports_mte’ [-Werror=implicit-function-declaration]
>>     17 |         if (system_supports_mte() && (prot & PROT_MTE))
>>        |             ^~~~~~~~~~~~~~~~~~~
>> ./arch/arm64/include/asm/mman.h:18:24: error: ‘VM_MTE’ undeclared (first use in this function)
>>     18 |                 ret |= VM_MTE;
>>        |                        ^~~~~~
>> ./arch/arm64/include/asm/mman.h: In function ‘arch_calc_vm_flag_bits’:
>> ./arch/arm64/include/asm/mman.h:32:24: error: ‘VM_MTE_ALLOWED’ undeclared (first use in this function)
>>     32 |                 return VM_MTE_ALLOWED;
>>        |                        ^~~~~~~~~~~~~~
>> ./arch/arm64/include/asm/mman.h: In function ‘arch_validate_flags’:
>> ./arch/arm64/include/asm/mman.h:59:29: error: ‘VM_MTE’ undeclared (first use in this function)
>>     59 |         return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
>>        |                             ^~~~~~
>> ./arch/arm64/include/asm/mman.h:59:52: error: ‘VM_MTE_ALLOWED’ undeclared (first use in this function)
>>     59 |         return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
>>        |                                                    ^~~~~~~~~~~~~~
>> arch/arm64/kernel/vdso/vgetrandom.c: In function ‘__kernel_getrandom’:
>> arch/arm64/kernel/vdso/vgetrandom.c:18:25: error: ‘ENOSYS’ undeclared (first use in this function); did you mean ‘ENOSPC’?
>>     18 |                 return -ENOSYS;
>>        |                         ^~~~~~
>>        |                         ENOSPC
>> cc1: some warnings being treated as errors
>>
>> I can move to a different patch, but this is really tied to this patch.
> 
> Adhemerval kept this change in this patch for v3, which, if it's
> necessary, is fine with me. But I was looking to see if there was
> another way of doing it, because including linux/mm.h inside of vdso
> code is kind of contrary to your project with e379299fe0b3 ("random:
> vDSO: minimize and simplify header includes").
> 
> getrandom.c includes uapi/linux/mman.h for the mmap constants. That
> seems fine; it's userspace code after all. But then uapi/linux/mman.h
> has this:
> 
>     #include <asm/mman.h>
>     #include <asm-generic/hugetlb_encode.h>
>     #include <linux/types.h>
> 
> The asm-generic/ one resolves to uapi/asm-generic. But the asm/ one
> resolves to arch code, which is where we then get in trouble on ARM,
> where arch/arm64/include/asm/mman.h has all sorts of kernel code in it.
> 
> Maybe, instead, it should resolve to arch/arm64/include/uapi/asm/mman.h,
> which is the header that userspace actually uses in normal user code?
> 
> Is this a makefile problem? What's going on here? Seems like this is
> something worth sorting out. Or I can take Adhemerval's v3 as-is and
> we'll grit our teeth and work it out later, as you prefer. But I thought
> I should mention it.

That's a tricky problem, I also have it on powerpc, see patch 5, I 
solved it that way:

In the Makefile:
-ccflags-y := -fno-common -fno-builtin
+ccflags-y := -fno-common -fno-builtin -DBUILD_VDSO

In arch/powerpc/include/asm/mman.h:

diff --git a/arch/powerpc/include/asm/mman.h 
b/arch/powerpc/include/asm/mman.h
index 17a77d47ed6d..42a51a993d94 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -6,7 +6,7 @@

  #include <uapi/asm/mman.h>

-#ifdef CONFIG_PPC64
+#if defined(CONFIG_PPC64) && !defined(BUILD_VDSO)

  #include <asm/cputable.h>
  #include <linux/mm.h>

So that the only thing that remains in arch/powerpc/include/asm/mman.h 
when building a VDSO is #include <uapi/asm/mman.h>

I got the idea from ARM64, they use something similar in their 
arch/arm64/include/asm/rwonce.h

Christophe

Jason A. Donenfeld Sept. 2, 2024, 1:25 p.m. UTC | #14

On Mon, Sep 02, 2024 at 03:19:56PM +0200, Christophe Leroy wrote:
> 
> 
> Le 02/09/2024 à 15:11, Jason A. Donenfeld a écrit :
> > Hey Christophe (for header logic) & Will (for arm64 stuff),
> > 
> > On Fri, Aug 30, 2024 at 09:28:29AM -0300, Adhemerval Zanella Netto wrote:
> >>>> diff --git a/lib/vdso/getrandom.c b/lib/vdso/getrandom.c
> >>>> index 938ca539aaa6..7c9711248d9b 100644
> >>>> --- a/lib/vdso/getrandom.c
> >>>> +++ b/lib/vdso/getrandom.c
> >>>> @@ -5,6 +5,7 @@
> >>>>   
> >>>>   #include <linux/array_size.h>
> >>>>   #include <linux/minmax.h>
> >>>> +#include <linux/mm.h>
> >>>>   #include <vdso/datapage.h>
> >>>>   #include <vdso/getrandom.h>
> >>>>   #include <vdso/unaligned.h>
> >>>
> >>> Looks like this should be a separate change?
> >>
> >>
> >> It is required so arm64 can use  c-getrandom-y, otherwise vgetrandom.o build
> >> fails:
> >>
> >> CC      arch/arm64/kernel/vdso/vgetrandom.o
> >> In file included from ./include/uapi/linux/mman.h:5,
> >>                   from /mnt/projects/linux/linux-git/lib/vdso/getrandom.c:13,
> >>                   from <command-line>:
> >> ./arch/arm64/include/asm/mman.h: In function ‘arch_calc_vm_prot_bits’:
> >> ./arch/arm64/include/asm/mman.h:14:13: error: implicit declaration of function ‘system_supports_bti’ [-Werror=implicit-function-declaration]
> >>     14 |         if (system_supports_bti() && (prot & PROT_BTI))
> >>        |             ^~~~~~~~~~~~~~~~~~~
> >> ./arch/arm64/include/asm/mman.h:15:24: error: ‘VM_ARM64_BTI’ undeclared (first use in this function); did you mean ‘ARM64_BTI’?
> >>     15 |                 ret |= VM_ARM64_BTI;
> >>        |                        ^~~~~~~~~~~~
> >>        |                        ARM64_BTI
> >> ./arch/arm64/include/asm/mman.h:15:24: note: each undeclared identifier is reported only once for each function it appears in
> >> ./arch/arm64/include/asm/mman.h:17:13: error: implicit declaration of function ‘system_supports_mte’ [-Werror=implicit-function-declaration]
> >>     17 |         if (system_supports_mte() && (prot & PROT_MTE))
> >>        |             ^~~~~~~~~~~~~~~~~~~
> >> ./arch/arm64/include/asm/mman.h:18:24: error: ‘VM_MTE’ undeclared (first use in this function)
> >>     18 |                 ret |= VM_MTE;
> >>        |                        ^~~~~~
> >> ./arch/arm64/include/asm/mman.h: In function ‘arch_calc_vm_flag_bits’:
> >> ./arch/arm64/include/asm/mman.h:32:24: error: ‘VM_MTE_ALLOWED’ undeclared (first use in this function)
> >>     32 |                 return VM_MTE_ALLOWED;
> >>        |                        ^~~~~~~~~~~~~~
> >> ./arch/arm64/include/asm/mman.h: In function ‘arch_validate_flags’:
> >> ./arch/arm64/include/asm/mman.h:59:29: error: ‘VM_MTE’ undeclared (first use in this function)
> >>     59 |         return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
> >>        |                             ^~~~~~
> >> ./arch/arm64/include/asm/mman.h:59:52: error: ‘VM_MTE_ALLOWED’ undeclared (first use in this function)
> >>     59 |         return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
> >>        |                                                    ^~~~~~~~~~~~~~
> >> arch/arm64/kernel/vdso/vgetrandom.c: In function ‘__kernel_getrandom’:
> >> arch/arm64/kernel/vdso/vgetrandom.c:18:25: error: ‘ENOSYS’ undeclared (first use in this function); did you mean ‘ENOSPC’?
> >>     18 |                 return -ENOSYS;
> >>        |                         ^~~~~~
> >>        |                         ENOSPC
> >> cc1: some warnings being treated as errors
> >>
> >> I can move to a different patch, but this is really tied to this patch.
> > 
> > Adhemerval kept this change in this patch for v3, which, if it's
> > necessary, is fine with me. But I was looking to see if there was
> > another way of doing it, because including linux/mm.h inside of vdso
> > code is kind of contrary to your project with e379299fe0b3 ("random:
> > vDSO: minimize and simplify header includes").
> > 
> > getrandom.c includes uapi/linux/mman.h for the mmap constants. That
> > seems fine; it's userspace code after all. But then uapi/linux/mman.h
> > has this:
> > 
> >     #include <asm/mman.h>
> >     #include <asm-generic/hugetlb_encode.h>
> >     #include <linux/types.h>
> > 
> > The asm-generic/ one resolves to uapi/asm-generic. But the asm/ one
> > resolves to arch code, which is where we then get in trouble on ARM,
> > where arch/arm64/include/asm/mman.h has all sorts of kernel code in it.
> > 
> > Maybe, instead, it should resolve to arch/arm64/include/uapi/asm/mman.h,
> > which is the header that userspace actually uses in normal user code?
> > 
> > Is this a makefile problem? What's going on here? Seems like this is
> > something worth sorting out. Or I can take Adhemerval's v3 as-is and
> > we'll grit our teeth and work it out later, as you prefer. But I thought
> > I should mention it.
> 
> That's a tricky problem, I also have it on powerpc, see patch 5, I 
> solved it that way:
> 
> In the Makefile:
> -ccflags-y := -fno-common -fno-builtin
> +ccflags-y := -fno-common -fno-builtin -DBUILD_VDSO
> 
> In arch/powerpc/include/asm/mman.h:
> 
> diff --git a/arch/powerpc/include/asm/mman.h 
> b/arch/powerpc/include/asm/mman.h
> index 17a77d47ed6d..42a51a993d94 100644
> --- a/arch/powerpc/include/asm/mman.h
> +++ b/arch/powerpc/include/asm/mman.h
> @@ -6,7 +6,7 @@
> 
>   #include <uapi/asm/mman.h>
> 
> -#ifdef CONFIG_PPC64
> +#if defined(CONFIG_PPC64) && !defined(BUILD_VDSO)
> 
>   #include <asm/cputable.h>
>   #include <linux/mm.h>
> 
> So that the only thing that remains in arch/powerpc/include/asm/mman.h 
> when building a VDSO is #include <uapi/asm/mman.h>
> 
> I got the idea from ARM64, they use something similar in their 
> arch/arm64/include/asm/rwonce.h

That seems reasonable enough. Adhemerval - do you want to incorporate
this solution for your v+1? And Will, is it okay to keep that as one
patch, as Christophe has done, rather than splitting it, so the whole
change is hermetic?

Jason

Adhemerval Zanella Netto Sept. 2, 2024, 1:47 p.m. UTC | #15

On 02/09/24 10:25, Jason A. Donenfeld wrote:
> On Mon, Sep 02, 2024 at 03:19:56PM +0200, Christophe Leroy wrote:
>>
>>
>> Le 02/09/2024 à 15:11, Jason A. Donenfeld a écrit :
>>> Hey Christophe (for header logic) & Will (for arm64 stuff),
>>>
>>> On Fri, Aug 30, 2024 at 09:28:29AM -0300, Adhemerval Zanella Netto wrote:
>>>>>> diff --git a/lib/vdso/getrandom.c b/lib/vdso/getrandom.c
>>>>>> index 938ca539aaa6..7c9711248d9b 100644
>>>>>> --- a/lib/vdso/getrandom.c
>>>>>> +++ b/lib/vdso/getrandom.c
>>>>>> @@ -5,6 +5,7 @@
>>>>>>   
>>>>>>   #include <linux/array_size.h>
>>>>>>   #include <linux/minmax.h>
>>>>>> +#include <linux/mm.h>
>>>>>>   #include <vdso/datapage.h>
>>>>>>   #include <vdso/getrandom.h>
>>>>>>   #include <vdso/unaligned.h>
>>>>>
>>>>> Looks like this should be a separate change?
>>>>
>>>>
>>>> It is required so arm64 can use  c-getrandom-y, otherwise vgetrandom.o build
>>>> fails:
>>>>
>>>> CC      arch/arm64/kernel/vdso/vgetrandom.o
>>>> In file included from ./include/uapi/linux/mman.h:5,
>>>>                   from /mnt/projects/linux/linux-git/lib/vdso/getrandom.c:13,
>>>>                   from <command-line>:
>>>> ./arch/arm64/include/asm/mman.h: In function ‘arch_calc_vm_prot_bits’:
>>>> ./arch/arm64/include/asm/mman.h:14:13: error: implicit declaration of function ‘system_supports_bti’ [-Werror=implicit-function-declaration]
>>>>     14 |         if (system_supports_bti() && (prot & PROT_BTI))
>>>>        |             ^~~~~~~~~~~~~~~~~~~
>>>> ./arch/arm64/include/asm/mman.h:15:24: error: ‘VM_ARM64_BTI’ undeclared (first use in this function); did you mean ‘ARM64_BTI’?
>>>>     15 |                 ret |= VM_ARM64_BTI;
>>>>        |                        ^~~~~~~~~~~~
>>>>        |                        ARM64_BTI
>>>> ./arch/arm64/include/asm/mman.h:15:24: note: each undeclared identifier is reported only once for each function it appears in
>>>> ./arch/arm64/include/asm/mman.h:17:13: error: implicit declaration of function ‘system_supports_mte’ [-Werror=implicit-function-declaration]
>>>>     17 |         if (system_supports_mte() && (prot & PROT_MTE))
>>>>        |             ^~~~~~~~~~~~~~~~~~~
>>>> ./arch/arm64/include/asm/mman.h:18:24: error: ‘VM_MTE’ undeclared (first use in this function)
>>>>     18 |                 ret |= VM_MTE;
>>>>        |                        ^~~~~~
>>>> ./arch/arm64/include/asm/mman.h: In function ‘arch_calc_vm_flag_bits’:
>>>> ./arch/arm64/include/asm/mman.h:32:24: error: ‘VM_MTE_ALLOWED’ undeclared (first use in this function)
>>>>     32 |                 return VM_MTE_ALLOWED;
>>>>        |                        ^~~~~~~~~~~~~~
>>>> ./arch/arm64/include/asm/mman.h: In function ‘arch_validate_flags’:
>>>> ./arch/arm64/include/asm/mman.h:59:29: error: ‘VM_MTE’ undeclared (first use in this function)
>>>>     59 |         return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
>>>>        |                             ^~~~~~
>>>> ./arch/arm64/include/asm/mman.h:59:52: error: ‘VM_MTE_ALLOWED’ undeclared (first use in this function)
>>>>     59 |         return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
>>>>        |                                                    ^~~~~~~~~~~~~~
>>>> arch/arm64/kernel/vdso/vgetrandom.c: In function ‘__kernel_getrandom’:
>>>> arch/arm64/kernel/vdso/vgetrandom.c:18:25: error: ‘ENOSYS’ undeclared (first use in this function); did you mean ‘ENOSPC’?
>>>>     18 |                 return -ENOSYS;
>>>>        |                         ^~~~~~
>>>>        |                         ENOSPC
>>>> cc1: some warnings being treated as errors
>>>>
>>>> I can move to a different patch, but this is really tied to this patch.
>>>
>>> Adhemerval kept this change in this patch for v3, which, if it's
>>> necessary, is fine with me. But I was looking to see if there was
>>> another way of doing it, because including linux/mm.h inside of vdso
>>> code is kind of contrary to your project with e379299fe0b3 ("random:
>>> vDSO: minimize and simplify header includes").
>>>
>>> getrandom.c includes uapi/linux/mman.h for the mmap constants. That
>>> seems fine; it's userspace code after all. But then uapi/linux/mman.h
>>> has this:
>>>
>>>     #include <asm/mman.h>
>>>     #include <asm-generic/hugetlb_encode.h>
>>>     #include <linux/types.h>
>>>
>>> The asm-generic/ one resolves to uapi/asm-generic. But the asm/ one
>>> resolves to arch code, which is where we then get in trouble on ARM,
>>> where arch/arm64/include/asm/mman.h has all sorts of kernel code in it.
>>>
>>> Maybe, instead, it should resolve to arch/arm64/include/uapi/asm/mman.h,
>>> which is the header that userspace actually uses in normal user code?
>>>
>>> Is this a makefile problem? What's going on here? Seems like this is
>>> something worth sorting out. Or I can take Adhemerval's v3 as-is and
>>> we'll grit our teeth and work it out later, as you prefer. But I thought
>>> I should mention it.
>>
>> That's a tricky problem, I also have it on powerpc, see patch 5, I 
>> solved it that way:
>>
>> In the Makefile:
>> -ccflags-y := -fno-common -fno-builtin
>> +ccflags-y := -fno-common -fno-builtin -DBUILD_VDSO
>>
>> In arch/powerpc/include/asm/mman.h:
>>
>> diff --git a/arch/powerpc/include/asm/mman.h 
>> b/arch/powerpc/include/asm/mman.h
>> index 17a77d47ed6d..42a51a993d94 100644
>> --- a/arch/powerpc/include/asm/mman.h
>> +++ b/arch/powerpc/include/asm/mman.h
>> @@ -6,7 +6,7 @@
>>
>>   #include <uapi/asm/mman.h>
>>
>> -#ifdef CONFIG_PPC64
>> +#if defined(CONFIG_PPC64) && !defined(BUILD_VDSO)
>>
>>   #include <asm/cputable.h>
>>   #include <linux/mm.h>
>>
>> So that the only thing that remains in arch/powerpc/include/asm/mman.h 
>> when building a VDSO is #include <uapi/asm/mman.h>
>>
>> I got the idea from ARM64, they use something similar in their 
>> arch/arm64/include/asm/rwonce.h
> 
> That seems reasonable enough. Adhemerval - do you want to incorporate
> this solution for your v+1? And Will, is it okay to keep that as one
> patch, as Christophe has done, rather than splitting it, so the whole
> change is hermetic?

Sure, I will do it for v4.

Will Deacon Sept. 2, 2024, 3:10 p.m. UTC | #16

On Mon, Sep 02, 2024 at 03:25:34PM +0200, Jason A. Donenfeld wrote:
> On Mon, Sep 02, 2024 at 03:19:56PM +0200, Christophe Leroy wrote:
> > diff --git a/arch/powerpc/include/asm/mman.h 
> > b/arch/powerpc/include/asm/mman.h
> > index 17a77d47ed6d..42a51a993d94 100644
> > --- a/arch/powerpc/include/asm/mman.h
> > +++ b/arch/powerpc/include/asm/mman.h
> > @@ -6,7 +6,7 @@
> > 
> >   #include <uapi/asm/mman.h>
> > 
> > -#ifdef CONFIG_PPC64
> > +#if defined(CONFIG_PPC64) && !defined(BUILD_VDSO)
> > 
> >   #include <asm/cputable.h>
> >   #include <linux/mm.h>
> > 
> > So that the only thing that remains in arch/powerpc/include/asm/mman.h 
> > when building a VDSO is #include <uapi/asm/mman.h>
> > 
> > I got the idea from ARM64, they use something similar in their 
> > arch/arm64/include/asm/rwonce.h
> 
> That seems reasonable enough. Adhemerval - do you want to incorporate
> this solution for your v+1? And Will, is it okay to keep that as one
> patch, as Christophe has done, rather than splitting it, so the whole
> change is hermetic?

Yup, that makes sense to me (and the lib/vdso/getrandom.c change would go
away entirely).

Will

[v2] aarch64: vdso: Wire up getrandom() vDSO implementation

Commit Message

Comments

Patch