[v7,30/41] x86/shstk: Handle thread shadow stack

Message ID	20230227222957.24501-31-rick.p.edgecombe@intel.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Rick Edgecombe <rick.p.edgecombe@intel.com> To: x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann <arnd@arndb.de>, Andy Lutomirski <luto@kernel.org>, Balbir Singh <bsingharora@gmail.com>, Borislav Petkov <bp@alien8.de>, Cyrill Gorcunov <gorcunov@gmail.com>, Dave Hansen <dave.hansen@linux.intel.com>, Eugene Syromiatnikov <esyr@redhat.com>, Florian Weimer <fweimer@redhat.com>, "H . J . Lu" <hjl.tools@gmail.com>, Jann Horn <jannh@google.com>, Jonathan Corbet <corbet@lwn.net>, Kees Cook <keescook@chromium.org>, Mike Kravetz <mike.kravetz@oracle.com>, Nadav Amit <nadav.amit@gmail.com>, Oleg Nesterov <oleg@redhat.com>, Pavel Machek <pavel@ucw.cz>, Peter Zijlstra <peterz@infradead.org>, Randy Dunlap <rdunlap@infradead.org>, Weijiang Yang <weijiang.yang@intel.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, John Allen <john.allen@amd.com>, kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com Cc: rick.p.edgecombe@intel.com, Yu-cheng Yu <yu-cheng.yu@intel.com> Subject: [PATCH v7 30/41] x86/shstk: Handle thread shadow stack Date: Mon, 27 Feb 2023 14:29:46 -0800 Message-Id: <20230227222957.24501-31-rick.p.edgecombe@intel.com> In-Reply-To: <20230227222957.24501-1-rick.p.edgecombe@intel.com> References: <20230227222957.24501-1-rick.p.edgecombe@intel.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Shadow stacks for userspace \| expand [v7,00/41] Shadow stacks for userspace [v7,01/41] Documentation/x86: Add CET shadow stack description [v7,02/41] x86/shstk: Add Kconfig option for shadow stack [v7,03/41] x86/cpufeatures: Add CPU feature flags for shadow stacks [v7,04/41] x86/cpufeatures: Enable CET CR4 bit for shadow stack [v7,05/41] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states [v7,06/41] x86/fpu: Add helper for modifying xstate [v7,07/41] x86: Move control protection handler to separate file [v7,08/41] x86/shstk: Add user control-protection fault handler [v7,09/41] x86/mm: Remove _PAGE_DIRTY from kernel RO pages [v7,10/41] x86/mm: Move pmd_write(), pud_write() up in the file [v7,11/41] mm: Introduce pte_mkwrite_kernel() [v7,12/41] s390/mm: Introduce pmd_mkwrite_kernel() [v7,13/41] mm: Make pte_mkwrite() take a VMA [v7,14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY [v7,15/41] x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY [v7,16/41] x86/mm: Start actually marking _PAGE_SAVED_DIRTY [v7,17/41] mm: Move VM_UFFD_MINOR_BIT from 37 to 38 [v7,18/41] mm: Introduce VM_SHADOW_STACK for shadow stack memory [v7,19/41] x86/mm: Check shadow stack page fault errors [v7,20/41] x86/mm: Teach pte_mkwrite() about stack memory [v7,21/41] mm: Add guard pages around a shadow stack. [v7,22/41] mm/mmap: Add shadow stack pages to memory accounting [v7,23/41] mm: Re-introduce vm_flags to do_mmap() [v7,24/41] mm: Don't allow write GUPs to shadow stack memory [v7,25/41] x86/mm: Introduce MAP_ABOVE4G [v7,26/41] mm: Warn on shadow stack memory in wrong vma [v7,27/41] x86/mm: Warn if create Write=0,Dirty=1 with raw prot [v7,28/41] x86: Introduce userspace API for shadow stack [v7,29/41] x86/shstk: Add user-mode shadow stack support [v7,30/41] x86/shstk: Handle thread shadow stack [v7,31/41] x86/shstk: Introduce routines modifying shstk [v7,32/41] x86/shstk: Handle signals for shadow stack [v7,33/41] x86/shstk: Introduce map_shadow_stack syscall [v7,34/41] x86/shstk: Support WRSS for userspace [v7,35/41] x86: Expose thread features in /proc/$PID/status [v7,36/41] x86/shstk: Wire in shadow stack interface [v7,37/41] selftests/x86: Add shadow stack test [v7,38/41] x86/fpu: Add helper for initing features [v7,39/41] x86: Add PTRACE interface for shadow stack [v7,40/41] x86/shstk: Add ARCH_SHSTK_UNLOCK [v7,41/41] x86/shstk: Add ARCH_SHSTK_STATUS

Edgecombe, Rick P Feb. 27, 2023, 10:29 p.m. UTC

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

When a process is duplicated, but the child shares the address space with
the parent, there is potential for the threads sharing a single stack to
cause conflicts for each other. In the normal non-cet case this is handled
in two ways.

With regular CLONE_VM a new stack is provided by userspace such that the
parent and child have different stacks.

For vfork, the parent is suspended until the child exits. So as long as
the child doesn't return from the vfork()/CLONE_VFORK calling function and
sticks to a limited set of operations, the parent and child can share the
same stack.

For shadow stack, these scenarios present similar sharing problems. For the
CLONE_VM case, the child and the parent must have separate shadow stacks.
Instead of changing clone to take a shadow stack, have the kernel just
allocate one and switch to it.

Use stack_size passed from clone3() syscall for thread shadow stack size. A
compat-mode thread shadow stack size is further reduced to 1/4. This
allows more threads to run in a 32-bit address space. The clone() does not
pass stack_size, which was added to clone3(). In that case, use
RLIMIT_STACK size and cap to 4 GB.

For shadow stack enabled vfork(), the parent and child can share the same
shadow stack, like they can share a normal stack. Since the parent is
suspended until the child terminates, the child will not interfere with
the parent while executing as long as it doesn't return from the vfork()
and overwrite up the shadow stack. The child can safely overwrite down
the shadow stack, as the parent can just overwrite this later. So CET does
not add any additional limitations for vfork().

Userspace implementing posix vfork() can actually prevent the child from
returning from the vfork() calling function, using CET. Glibc does this
by adjusting the shadow stack pointer in the child, so that the child
receives a #CP if it tries to return from vfork() calling function.

Free the shadow stack on thread exit by doing it in mm_release(). Skip
this when exiting a vfork() child since the stack is shared in the
parent.

During this operation, the shadow stack pointer of the new thread needs
to be updated to point to the newly allocated shadow stack. Since the
ability to do this is confined to the FPU subsystem, change
fpu_clone() to take the new shadow stack pointer, and update it
internally inside the FPU subsystem. This part was suggested by Thomas
Gleixner.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v3:
 - Fix update_fpu_shstk() stub (Mike Rapoport)
 - Fix chunks around alloc_shstk() in wrong patch (Kees)
 - Fix stack_size/flags swap (Kees)
 - Use centralized stack size logic (Kees)

v2:
 - Have fpu_clone() take new shadow stack pointer and update SSP in
   xsave buffer for new task. (tglx)

v1:
 - Expand commit log.
 - Add more comments.
 - Switch to xsave helpers.

Yu-cheng v30:
 - Update comments about clone()/clone3(). (Borislav Petkov)
---
 arch/x86/include/asm/fpu/sched.h   |  3 ++-
 arch/x86/include/asm/mmu_context.h |  2 ++
 arch/x86/include/asm/shstk.h       |  7 +++++
 arch/x86/kernel/fpu/core.c         | 41 +++++++++++++++++++++++++++-
 arch/x86/kernel/process.c          | 18 ++++++++++++-
 arch/x86/kernel/shstk.c            | 43 ++++++++++++++++++++++++++++--
 6 files changed, 109 insertions(+), 5 deletions(-)

Szabolcs Nagy March 2, 2023, 5:34 p.m. UTC | #1

The 02/27/2023 14:29, Rick Edgecombe wrote:
> For shadow stack enabled vfork(), the parent and child can share the same
> shadow stack, like they can share a normal stack. Since the parent is
> suspended until the child terminates, the child will not interfere with
> the parent while executing as long as it doesn't return from the vfork()
> and overwrite up the shadow stack. The child can safely overwrite down
> the shadow stack, as the parent can just overwrite this later. So CET does
> not add any additional limitations for vfork().
> 
> Userspace implementing posix vfork() can actually prevent the child from
> returning from the vfork() calling function, using CET. Glibc does this
> by adjusting the shadow stack pointer in the child, so that the child
> receives a #CP if it tries to return from vfork() calling function.

this commit message implies there is protection against
the vfork child clobbering the parent's shadow stack,
but actually the child can INCSSP (or longjmp) and then
clobber it.

so the glibc code just tries to catch bugs and accidents
not a strong security mechanism. i'd skip this paragraph.

Edgecombe, Rick P March 2, 2023, 9:48 p.m. UTC | #2

On Thu, 2023-03-02 at 17:34 +0000, Szabolcs Nagy wrote:
> The 02/27/2023 14:29, Rick Edgecombe wrote:
> > For shadow stack enabled vfork(), the parent and child can share
> > the same
> > shadow stack, like they can share a normal stack. Since the parent
> > is
> > suspended until the child terminates, the child will not interfere
> > with
> > the parent while executing as long as it doesn't return from the
> > vfork()
> > and overwrite up the shadow stack. The child can safely overwrite
> > down
> > the shadow stack, as the parent can just overwrite this later. So
> > CET does
> > not add any additional limitations for vfork().
> > 
> > Userspace implementing posix vfork() can actually prevent the child
> > from
> > returning from the vfork() calling function, using CET. Glibc does
> > this
> > by adjusting the shadow stack pointer in the child, so that the
> > child
> > receives a #CP if it tries to return from vfork() calling function.
> 
> this commit message implies there is protection against
> the vfork child clobbering the parent's shadow stack,
> but actually the child can INCSSP (or longjmp) and then
> clobber it.

It's true the vfork child could use INCSSP and clobber to create
problems, so it is not a strong guarantee of shadow stack integrity.
But that's not claimed either. It does "prevent the child from
returning from the vfork() calling function" as much as shadow stack
protections apply, which I think would be reasonably understood. The
vfork child could also use wrss to write the return address to the
shadow stack and actually return, or disable shadow stack and return,
as other ways to create problems.

> 
> so the glibc code just tries to catch bugs and accidents
> not a strong security mechanism. i'd skip this paragraph.

Yep. I think it's very much a "nice to have" thing and not intended for
security. The paragraph is an aside anyway, because it is specifics
about another project. I don't have any objection to dropping it if the
opportunity comes up. IIRC it was added because someone thought vfork
couldn't work with shadow stack, so people might like to have the
details of how can be done.

I wouldn't even be too bothered if the discussed glibc behavior was
dropped either. vfork() can go wrong many ways regardless of shadow
stack. Is it worth the extra special behavior? Maybe just barely...

Borislav Petkov March 8, 2023, 3:26 p.m. UTC | #3

On Mon, Feb 27, 2023 at 02:29:46PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> When a process is duplicated, but the child shares the address space with
> the parent, there is potential for the threads sharing a single stack to
> cause conflicts for each other. In the normal non-cet case this is handled

"non-CET"

> in two ways.
> 
> With regular CLONE_VM a new stack is provided by userspace such that the
> parent and child have different stacks.
> 
> For vfork, the parent is suspended until the child exits. So as long as
> the child doesn't return from the vfork()/CLONE_VFORK calling function and
> sticks to a limited set of operations, the parent and child can share the
> same stack.
> 
> For shadow stack, these scenarios present similar sharing problems. For the
> CLONE_VM case, the child and the parent must have separate shadow stacks.
> Instead of changing clone to take a shadow stack, have the kernel just
> allocate one and switch to it.
> 
> Use stack_size passed from clone3() syscall for thread shadow stack size. A
> compat-mode thread shadow stack size is further reduced to 1/4. This
> allows more threads to run in a 32-bit address space. The clone() does not
> pass stack_size, which was added to clone3(). In that case, use
> RLIMIT_STACK size and cap to 4 GB.
> 
> For shadow stack enabled vfork(), the parent and child can share the same
> shadow stack, like they can share a normal stack. Since the parent is
> suspended until the child terminates, the child will not interfere with
> the parent while executing as long as it doesn't return from the vfork()
> and overwrite up the shadow stack. The child can safely overwrite down
> the shadow stack, as the parent can just overwrite this later. So CET does
> not add any additional limitations for vfork().
> 
> Userspace implementing posix vfork() can actually prevent the child from

"POSIX"

...

> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index f851558b673f..bc3de4aeb661 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -552,8 +552,41 @@ static inline void fpu_inherit_perms(struct fpu *dst_fpu)
>  	}
>  }
>  
> +#ifdef CONFIG_X86_USER_SHADOW_STACK
> +static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp)
> +{
> +	struct cet_user_state *xstate;
> +
> +	/* If ssp update is not needed. */
> +	if (!ssp)
> +		return 0;
> +
> +	xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave,
> +				XFEATURE_CET_USER);
> +
> +	/*
> +	 * If there is a non-zero ssp, then 'dst' must be configured with a shadow
> +	 * stack and the fpu state should be up to date since it was just copied
> +	 * from the parent in fpu_clone(). So there must be a valid non-init CET
> +	 * state location in the buffer.
> +	 */
> +	if (WARN_ON_ONCE(!xstate))
> +		return 1;
> +
> +	xstate->user_ssp = (u64)ssp;
> +
> +	return 0;
> +}
> +#else
> +static int update_fpu_shstk(struct task_struct *dst, unsigned long shstk_addr)
								      ^^^^^^^^^^^
ssp, like above.

Better yet:

static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp)
{
#ifdef CONFIG_X86_USER_SHADOW_STACK
	...
#endif
	return 0;
}

and less ifdeffery.



> +{
> +	return 0;
> +}
> +#endif
> +
>  /* Clone current's FPU state on fork */
> -int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
> +int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal,
> +	      unsigned long ssp)
>  {
>  	struct fpu *src_fpu = &current->thread.fpu;
>  	struct fpu *dst_fpu = &dst->thread.fpu;
> @@ -613,6 +646,12 @@ int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
>  	if (use_xsave())
>  		dst_fpu->fpstate->regs.xsave.header.xfeatures &= ~XFEATURE_MASK_PASID;
>  
> +	/*
> +	 * Update shadow stack pointer, in case it changed during clone.
> +	 */
> +	if (update_fpu_shstk(dst, ssp))
> +		return 1;
> +
>  	trace_x86_fpu_copy_src(src_fpu);
>  	trace_x86_fpu_copy_dst(dst_fpu);
>  
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index b650cde3f64d..bf703f53fa49 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -48,6 +48,7 @@
>  #include <asm/frame.h>
>  #include <asm/unwind.h>
>  #include <asm/tdx.h>
> +#include <asm/shstk.h>
>  
>  #include "process.h"
>  
> @@ -119,6 +120,7 @@ void exit_thread(struct task_struct *tsk)
>  
>  	free_vm86(t);
>  
> +	shstk_free(tsk);
>  	fpu__drop(fpu);
>  }
>  
> @@ -140,6 +142,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
>  	struct inactive_task_frame *frame;
>  	struct fork_frame *fork_frame;
>  	struct pt_regs *childregs;
> +	unsigned long shstk_addr = 0;
>  	int ret = 0;
>  
>  	childregs = task_pt_regs(p);
> @@ -174,7 +177,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
>  	frame->flags = X86_EFLAGS_FIXED;
>  #endif
>  
> -	fpu_clone(p, clone_flags, args->fn);
> +	/* Allocate a new shadow stack for pthread if needed */
> +	ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size,
> +				       &shstk_addr);

That function will return 0 even if shstk_addr hasn't been written in it
and you will continue merrily and call

	fpu_clone(..., shstk_addr=0);

why don't you return the shadow stack address or negative on error
instead of adding an I/O parameter which is pretty much always nasty to
deal with.



> +	if (ret)
> +		return ret;
> +
> +	fpu_clone(p, clone_flags, args->fn, shstk_addr);
>  
>  	/* Kernel thread ? */
>  	if (unlikely(p->flags & PF_KTHREAD)) {

...

Edgecombe, Rick P March 8, 2023, 8:03 p.m. UTC | #4

On Wed, 2023-03-08 at 16:26 +0100, Borislav Petkov wrote:
> On Mon, Feb 27, 2023 at 02:29:46PM -0800, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > When a process is duplicated, but the child shares the address
> > space with
> > the parent, there is potential for the threads sharing a single
> > stack to
> > cause conflicts for each other. In the normal non-cet case this is
> > handled
> 
> "non-CET"

Sure.

> 
> > in two ways.
> > 
> > With regular CLONE_VM a new stack is provided by userspace such
> > that the
> > parent and child have different stacks.
> > 
> > For vfork, the parent is suspended until the child exits. So as
> > long as
> > the child doesn't return from the vfork()/CLONE_VFORK calling
> > function and
> > sticks to a limited set of operations, the parent and child can
> > share the
> > same stack.
> > 
> > For shadow stack, these scenarios present similar sharing problems.
> > For the
> > CLONE_VM case, the child and the parent must have separate shadow
> > stacks.
> > Instead of changing clone to take a shadow stack, have the kernel
> > just
> > allocate one and switch to it.
> > 
> > Use stack_size passed from clone3() syscall for thread shadow stack
> > size. A
> > compat-mode thread shadow stack size is further reduced to 1/4.
> > This
> > allows more threads to run in a 32-bit address space. The clone()
> > does not
> > pass stack_size, which was added to clone3(). In that case, use
> > RLIMIT_STACK size and cap to 4 GB.
> > 
> > For shadow stack enabled vfork(), the parent and child can share
> > the same
> > shadow stack, like they can share a normal stack. Since the parent
> > is
> > suspended until the child terminates, the child will not interfere
> > with
> > the parent while executing as long as it doesn't return from the
> > vfork()
> > and overwrite up the shadow stack. The child can safely overwrite
> > down
> > the shadow stack, as the parent can just overwrite this later. So
> > CET does
> > not add any additional limitations for vfork().
> > 
> > Userspace implementing posix vfork() can actually prevent the child
> > from
> 
> "POSIX"

Ok.

> 
> ...
> 
> > diff --git a/arch/x86/kernel/fpu/core.c
> > b/arch/x86/kernel/fpu/core.c
> > index f851558b673f..bc3de4aeb661 100644
> > --- a/arch/x86/kernel/fpu/core.c
> > +++ b/arch/x86/kernel/fpu/core.c
> > @@ -552,8 +552,41 @@ static inline void fpu_inherit_perms(struct
> > fpu *dst_fpu)
> >  	}
> >  }
> >  
> > +#ifdef CONFIG_X86_USER_SHADOW_STACK
> > +static int update_fpu_shstk(struct task_struct *dst, unsigned long
> > ssp)
> > +{
> > +	struct cet_user_state *xstate;
> > +
> > +	/* If ssp update is not needed. */
> > +	if (!ssp)
> > +		return 0;
> > +
> > +	xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave,
> > +				XFEATURE_CET_USER);
> > +
> > +	/*
> > +	 * If there is a non-zero ssp, then 'dst' must be configured
> > with a shadow
> > +	 * stack and the fpu state should be up to date since it was
> > just copied
> > +	 * from the parent in fpu_clone(). So there must be a valid
> > non-init CET
> > +	 * state location in the buffer.
> > +	 */
> > +	if (WARN_ON_ONCE(!xstate))
> > +		return 1;
> > +
> > +	xstate->user_ssp = (u64)ssp;
> > +
> > +	return 0;
> > +}
> > +#else
> > +static int update_fpu_shstk(struct task_struct *dst, unsigned long
> > shstk_addr)
> 
> 								      ^
> ^^^^^^^^^^
> ssp, like above.
> 
> Better yet:
> 
> static int update_fpu_shstk(struct task_struct *dst, unsigned long
> ssp)
> {
> #ifdef CONFIG_X86_USER_SHADOW_STACK
> 	...
> #endif
> 	return 0;
> }
> 
> and less ifdeffery.

Sure. Sometimes people tell me to only ifdef out whole functions to
make it easier to read. I suppose in this case it's not hard to see.


> 
> 
> 
> > +{
> > +	return 0;
> > +}
> > +#endif
> > +
> >  /* Clone current's FPU state on fork */
> > -int fpu_clone(struct task_struct *dst, unsigned long clone_flags,
> > bool minimal)
> > +int fpu_clone(struct task_struct *dst, unsigned long clone_flags,
> > bool minimal,
> > +	      unsigned long ssp)
> >  {
> >  	struct fpu *src_fpu = &current->thread.fpu;
> >  	struct fpu *dst_fpu = &dst->thread.fpu;
> > @@ -613,6 +646,12 @@ int fpu_clone(struct task_struct *dst,
> > unsigned long clone_flags, bool minimal)
> >  	if (use_xsave())
> >  		dst_fpu->fpstate->regs.xsave.header.xfeatures &=
> > ~XFEATURE_MASK_PASID;
> >  
> > +	/*
> > +	 * Update shadow stack pointer, in case it changed during
> > clone.
> > +	 */
> > +	if (update_fpu_shstk(dst, ssp))
> > +		return 1;
> > +
> >  	trace_x86_fpu_copy_src(src_fpu);
> >  	trace_x86_fpu_copy_dst(dst_fpu);
> >  
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index b650cde3f64d..bf703f53fa49 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -48,6 +48,7 @@
> >  #include <asm/frame.h>
> >  #include <asm/unwind.h>
> >  #include <asm/tdx.h>
> > +#include <asm/shstk.h>
> >  
> >  #include "process.h"
> >  
> > @@ -119,6 +120,7 @@ void exit_thread(struct task_struct *tsk)
> >  
> >  	free_vm86(t);
> >  
> > +	shstk_free(tsk);
> >  	fpu__drop(fpu);
> >  }
> >  
> > @@ -140,6 +142,7 @@ int copy_thread(struct task_struct *p, const
> > struct kernel_clone_args *args)
> >  	struct inactive_task_frame *frame;
> >  	struct fork_frame *fork_frame;
> >  	struct pt_regs *childregs;
> > +	unsigned long shstk_addr = 0;
> >  	int ret = 0;
> >  
> >  	childregs = task_pt_regs(p);
> > @@ -174,7 +177,13 @@ int copy_thread(struct task_struct *p, const
> > struct kernel_clone_args *args)
> >  	frame->flags = X86_EFLAGS_FIXED;
> >  #endif
> >  
> > -	fpu_clone(p, clone_flags, args->fn);
> > +	/* Allocate a new shadow stack for pthread if needed */
> > +	ret = shstk_alloc_thread_stack(p, clone_flags, args-
> > >stack_size,
> > +				       &shstk_addr);
> 
> That function will return 0 even if shstk_addr hasn't been written in
> it
> and you will continue merrily and call
> 
> 	fpu_clone(..., shstk_addr=0);
> 
> why don't you return the shadow stack address or negative on error
> instead of adding an I/O parameter which is pretty much always nasty
> to
> deal with.

On a shadow stack allocation error, we fail the copy_thread(). When
shadow stack is enabled, the app might be able to handle a clone
failure, but would not be able to handle starting a new thread without
getting a new shadow stack.

So in your suggestion I guess we would have two types of failure one
that signifies shadow stack is enabled and the allocation failed, and
another that signifies that shadow stack is not enabled, so zero needs
to be passed into fpu_clone()?

We need the output param in shstk_alloc_thread_stack() because we need
to update the SSP to the new shadow stack. If we want to make the non-
shadow stack case handled differently, I think the extra conditionals
are worse, like:
/* Allocate a new shadow stack for pthread if needed */
ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size,
				&shstk_addr);
if (ret == -EOPNOTSUPP)
	fpu_clone(p, clone_flags, args->fn, 0);
else if (ret < 0)
	return ret;
else
	fpu_clone(p, clone_flags, args->fn, shstk_addr);

Do you think?

It used to be that shstk_alloc_thread_stack() reached into FPU
internals to do the SSP update itself. Then the ability to do this was
removed. So I came up with an interface for allowing features to modify
XSAVE buffers from outside the FPU code. On further discussion, letting
code outside the FPU have flexible access to the XSAVE buffer could
constrain the FPU code from adding optimizations. So Thomas suggested
to pass the SSP along into FPU code so that the FPU modification could
be all monolithic and flexible.

If the default SSP value logic is too hidden, what about some clearer
code and comments, like this?

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index bf703f53fa49..bd123527fcca 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -142,7 +142,7 @@ int copy_thread(struct task_struct *p, const struct
kernel_clone_args *args)
        struct inactive_task_frame *frame;
        struct fork_frame *fork_frame;
        struct pt_regs *childregs;
-       unsigned long shstk_addr = 0;
+       unsigned long new_ssp;
        int ret = 0;
 
        childregs = task_pt_regs(p);
@@ -177,13 +177,18 @@ int copy_thread(struct task_struct *p, const
struct kernel_clone_args *args)
        frame->flags = X86_EFLAGS_FIXED;
 #endif
 
-       /* Allocate a new shadow stack for pthread if needed */
+       /*
+        * Allocate a new shadow stack for thread if needed. If shadow
stack,
+        * is disabled, new_ssp will remain 0, and fpu_clone() will
know not to
+        * update it.
+        */
+       new_ssp = 0;
        ret = shstk_alloc_thread_stack(p, clone_flags, args-
>stack_size,
-                                      &shstk_addr);
+                                      &new_ssp);
        if (ret)
                return ret;
 
-       fpu_clone(p, clone_flags, args->fn, shstk_addr);
+       fpu_clone(p, clone_flags, args->fn, new_ssp);
 
        /* Kernel thread ? */
        if (unlikely(p->flags & PF_KTHREAD)) {

Borislav Petkov March 9, 2023, 2:12 p.m. UTC | #5

On Wed, Mar 08, 2023 at 08:03:17PM +0000, Edgecombe, Rick P wrote:

Btw,

pls try to trim your replies as I need ot scroll through pages of quoted
text to find the response.

> Sure. Sometimes people tell me to only ifdef out whole functions to
> make it easier to read. I suppose in this case it's not hard to see.

Yeah, the less ifdeffery we have, the better.

> If the default SSP value logic is too hidden, what about some clearer
> code and comments, like this?

The problem with this function is that it needs to return three things:

* success:
 ** 0
 or
 ** shadow stack address
* failure: due to allocation.

How about this below instead? (totally untested ofc):

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index bf703f53fa49..6e323d4e32fc 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -142,7 +142,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	struct inactive_task_frame *frame;
 	struct fork_frame *fork_frame;
 	struct pt_regs *childregs;
-	unsigned long shstk_addr = 0;
+	unsigned long shstk_addr;
 	int ret = 0;
 
 	childregs = task_pt_regs(p);
@@ -178,10 +178,9 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 #endif
 
 	/* Allocate a new shadow stack for pthread if needed */
-	ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size,
-				       &shstk_addr);
-	if (ret)
-		return ret;
+	shstk_addr = shstk_alloc_thread_stack(p, clone_flags, args->stack_size);
+	if (IS_ERR_VALUE(shstk_addr))
+		return PTR_ERR((void *)shstk_addr);
 
 	fpu_clone(p, clone_flags, args->fn, shstk_addr);
 
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 13c02747386f..b1668b499e9a 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -157,8 +157,8 @@ void reset_thread_features(void)
 	current->thread.features_locked = 0;
 }
 
-int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
-			     unsigned long stack_size, unsigned long *shstk_addr)
+unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
+				       unsigned long stack_size)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
 	unsigned long addr, size;
@@ -180,14 +180,12 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
 	size = adjust_shstk_size(stack_size);
 	addr = alloc_shstk(size);
 	if (IS_ERR_VALUE(addr))
-		return PTR_ERR((void *)addr);
+		return addr;
 
 	shstk->base = addr;
 	shstk->size = size;
 
-	*shstk_addr = addr + size;
-
-	return 0;
+	return addr + size;
 }
 
 static unsigned long get_user_shstk_addr(void)

Edgecombe, Rick P March 9, 2023, 4:59 p.m. UTC | #6

On Thu, 2023-03-09 at 15:12 +0100, Borislav Petkov wrote:
> On Wed, Mar 08, 2023 at 08:03:17PM +0000, Edgecombe, Rick P wrote:
> 
> Btw,
> 
> pls try to trim your replies as I need ot scroll through pages of
> quoted
> text to find the response.

Sure sorry.


[...]
> 
> > If the default SSP value logic is too hidden, what about some
> > clearer
> > code and comments, like this?
> 
> The problem with this function is that it needs to return three
> things:
> 
> * success:
>  ** 0
>  or
>  ** shadow stack address
> * failure: due to allocation.
> 
> How about this below instead? (totally untested ofc):

Ah, I see what you were saying now. It looks like it will work to me if
you think it is better stylistically.

Borislav Petkov March 9, 2023, 5:04 p.m. UTC | #7

On Thu, Mar 09, 2023 at 04:59:52PM +0000, Edgecombe, Rick P wrote:
> Ah, I see what you were saying now. It looks like it will work to me if
> you think it is better stylistically.

Yeah, having a function return an error *and* an I/O parameter at the
same time is more complicated and error prone instead of when you have
a single retval and only input parameters.

I'd say.

Edgecombe, Rick P March 9, 2023, 8:29 p.m. UTC | #8

On Thu, 2023-03-09 at 18:04 +0100, Borislav Petkov wrote:
> On Thu, Mar 09, 2023 at 04:59:52PM +0000, Edgecombe, Rick P wrote:
> > Ah, I see what you were saying now. It looks like it will work to
> > me if
> > you think it is better stylistically.
> 
> Yeah, having a function return an error *and* an I/O parameter at the
> same time is more complicated and error prone instead of when you
> have
> a single retval and only input parameters.
> 
> I'd say.

Yea, I agree it's better this way, and at first I just missed your
point. By "if you think it's better", I just meant that if someone told
me to do it the other way I wouldn't die on the hill.

[v7,30/41] x86/shstk: Handle thread shadow stack

Commit Message

Comments

Patch