diff mbox series

[22/35] arm64/mm: Implement map_shadow_stack()

Message ID 20230716-arm64-gcs-v1-22-bf567f93bba6@kernel.org (mailing list archive)
State Handled Elsewhere
Headers show
Series arm64/gcs: Provide support for GCS at EL0 | expand

Checks

Context Check Description
conchuod/tree_selection fail Failed to apply to next/pending-fixes, riscv/for-next or riscv/master

Commit Message

Mark Brown July 16, 2023, 9:51 p.m. UTC
As discussed extensively in the changelog for the addition of this
syscall on x86 ("x86/shstk: Introduce map_shadow_stack syscall") the
existing mmap() and madvise() syscalls do not map entirely well onto the
security requirements for guarded control stacks since they lead to
windows where memory is allocated but not yet protected or stacks which
are not properly and safely initialised. Instead a new syscall
map_shadow_stack() has been defined which allocates and initialises a
shadow stack page.

Implement this for arm64, initialising memory allocated this way with
the top two entries in the stack being 0 (to allow detection of the end
of the GCS) and a GCS cap token (to allow switching to the newly
allocated GCS via the GCS switch instructions).

Since the x86 code has not yet been rebased to v6.5-rc1 this includes
the architecture neutral parts of Rick Edgecmbe's "x86/shstk: Introduce
map_shadow_stack syscall".

Signed-off-by: Mark Brown <broonie@kernel.org>
---
 arch/arm64/mm/gcs.c               | 44 ++++++++++++++++++++++++++++++++++++++-
 include/linux/syscalls.h          |  1 +
 include/uapi/asm-generic/unistd.h |  5 ++++-
 kernel/sys_ni.c                   |  1 +
 4 files changed, 49 insertions(+), 2 deletions(-)

Comments

Szabolcs Nagy July 18, 2023, 9:10 a.m. UTC | #1
The 07/16/2023 22:51, Mark Brown wrote:
> +SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
> +{
> +	unsigned long aligned_size;
> +	unsigned long __user *cap_ptr;
> +	unsigned long cap_val;
> +	int ret;
> +
> +	if (!system_supports_gcs())
> +		return -EOPNOTSUPP;
> +
> +	if (flags)
> +		return -EINVAL;
> +
> +	/*
> +	 * An overflow would result in attempting to write the restore token
> +	 * to the wrong location. Not catastrophic, but just return the right
> +	 * error code and block it.
> +	 */
> +	aligned_size = PAGE_ALIGN(size);
> +	if (aligned_size < size)
> +		return -EOVERFLOW;
> +
> +	addr = alloc_gcs(addr, aligned_size, 0, false);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
> +	/*
> +	 * Put a cap token at the end of the allocated region so it
> +	 * can be switched to.
> +	 */
> +	cap_ptr = (unsigned long __user *)(addr + aligned_size -
> +					   (2 * sizeof(unsigned long)));
> +	cap_val = GCS_CAP(cap_ptr);
> +
> +	ret = copy_to_user_gcs(cap_ptr, &cap_val, 1);

with

  uint64_t *p = map_shadow_stack(0, N*8, 0);

i'd expect p[N-1] to be the end token and p[N-2] to be the cap token,
not p[PAGE_ALIGN(N*8)/8-2].

if we allow misalligned size here (and in munmap) then i think it's
better to not page align.  size%8!=0 || size<16 can be an error.


> +	if (ret != 0) {
> +		vm_munmap(addr, size);
> +		return -EFAULT;
> +	}
> +
> +	return addr;
> +}
Mark Brown July 18, 2023, 1:55 p.m. UTC | #2
On Tue, Jul 18, 2023 at 10:10:04AM +0100, Szabolcs Nagy wrote:

>   uint64_t *p = map_shadow_stack(0, N*8, 0);

> i'd expect p[N-1] to be the end token and p[N-2] to be the cap token,
> not p[PAGE_ALIGN(N*8)/8-2].

Yes, that probably would be more helpful.

> if we allow misalligned size here (and in munmap) then i think it's
> better to not page align.  size%8!=0 || size<16 can be an error.

Honestly I'd be a lot happier to just not allow misalignment but that
raises the issue with binaries randomly not working when moved to a
kernel with a different page size.  I'll have a think but possibly the
safest thing would be requiring a multiple of 4K then rounding up to our
actual page size.
Rick Edgecombe July 18, 2023, 3:49 p.m. UTC | #3
On Tue, 2023-07-18 at 14:55 +0100, Mark Brown wrote:
> On Tue, Jul 18, 2023 at 10:10:04AM +0100, Szabolcs Nagy wrote:
> 
> >    uint64_t *p = map_shadow_stack(0, N*8, 0);
> 
> > i'd expect p[N-1] to be the end token and p[N-2] to be the cap
> > token,
> > not p[PAGE_ALIGN(N*8)/8-2].
> 
> Yes, that probably would be more helpful.

HJ made a similar request on the x86 side. He wanted an unaligned size
passed in to result in unaligned token placement.

> 
> > if we allow misalligned size here (and in munmap) then i think it's
> > better to not page align.  size%8!=0 || size<16 can be an error.
> 
> Honestly I'd be a lot happier to just not allow misalignment but that
> raises the issue with binaries randomly not working when moved to a
> kernel with a different page size.  I'll have a think but possibly
> the
> safest thing would be requiring a multiple of 4K then rounding up to
> our
> actual page size.


Someday when the x86 side is finally upstream I have a manpage for
map_shadow_stack. Any differences on the arm side would need to be
documented, but I'm not sure why there should be any differences. Like,
why not use the same flags? Or have a new flag for token+end marker
that x86 can use as well?
diff mbox series

Patch

diff --git a/arch/arm64/mm/gcs.c b/arch/arm64/mm/gcs.c
index b137493c594d..4a0a736800c0 100644
--- a/arch/arm64/mm/gcs.c
+++ b/arch/arm64/mm/gcs.c
@@ -52,7 +52,6 @@  unsigned long gcs_alloc_thread_stack(struct task_struct *tsk,
 		return 0;
 
 	size = gcs_size(size);
-
 	addr = alloc_gcs(0, size, 0, 0);
 	if (IS_ERR_VALUE(addr))
 		return addr;
@@ -64,6 +63,49 @@  unsigned long gcs_alloc_thread_stack(struct task_struct *tsk,
 	return addr;
 }
 
+SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
+{
+	unsigned long aligned_size;
+	unsigned long __user *cap_ptr;
+	unsigned long cap_val;
+	int ret;
+
+	if (!system_supports_gcs())
+		return -EOPNOTSUPP;
+
+	if (flags)
+		return -EINVAL;
+
+	/*
+	 * An overflow would result in attempting to write the restore token
+	 * to the wrong location. Not catastrophic, but just return the right
+	 * error code and block it.
+	 */
+	aligned_size = PAGE_ALIGN(size);
+	if (aligned_size < size)
+		return -EOVERFLOW;
+
+	addr = alloc_gcs(addr, aligned_size, 0, false);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
+	/*
+	 * Put a cap token at the end of the allocated region so it
+	 * can be switched to.
+	 */
+	cap_ptr = (unsigned long __user *)(addr + aligned_size -
+					   (2 * sizeof(unsigned long)));
+	cap_val = GCS_CAP(cap_ptr);
+
+	ret = copy_to_user_gcs(cap_ptr, &cap_val, 1);
+	if (ret != 0) {
+		vm_munmap(addr, size);
+		return -EFAULT;
+	}
+
+	return addr;
+}
+
 /*
  * Apply the GCS mode configured for the specified task to the
  * hardware.
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 03e3d0121d5e..7f6dc0988197 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -953,6 +953,7 @@  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long l
 asmlinkage long sys_cachestat(unsigned int fd,
 		struct cachestat_range __user *cstat_range,
 		struct cachestat __user *cstat, unsigned int flags);
+asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index fd6c1cb585db..38885a795ea6 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -820,8 +820,11 @@  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 #define __NR_cachestat 451
 __SYSCALL(__NR_cachestat, sys_cachestat)
 
+#define __NR_map_shadow_stack 452
+__SYSCALL(__NR_map_shadow_stack, sys_map_shadow_stack)
+
 #undef __NR_syscalls
-#define __NR_syscalls 452
+#define __NR_syscalls 453
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 781de7cc6a4e..e137c1385c56 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -274,6 +274,7 @@  COND_SYSCALL(vm86old);
 COND_SYSCALL(modify_ldt);
 COND_SYSCALL(vm86);
 COND_SYSCALL(kexec_file_load);
+COND_SYSCALL(map_shadow_stack);
 
 /* s390 */
 COND_SYSCALL(s390_pci_mmio_read);