diff mbox series

[v4,03/36] arm64/gcs: Document the ABI for Guarded Control Stacks

Message ID 20230807-arm64-gcs-v4-3-68cfa37f9069@kernel.org (mailing list archive)
State New
Headers show
Series arm64/gcs: Provide support for GCS in userspace | expand

Commit Message

Mark Brown Aug. 7, 2023, 10 p.m. UTC
Add some documentation of the userspace ABI for Guarded Control Stacks.

Signed-off-by: Mark Brown <broonie@kernel.org>
---
 Documentation/arch/arm64/gcs.rst   | 228 +++++++++++++++++++++++++++++++++++++
 Documentation/arch/arm64/index.rst |   1 +
 2 files changed, 229 insertions(+)

Comments

Catalin Marinas Aug. 9, 2023, 2:24 p.m. UTC | #1
On Mon, Aug 07, 2023 at 11:00:08PM +0100, Mark Brown wrote:
> +2.  Enabling and disabling Guarded Control Stacks
> +-------------------------------------------------
> +
> +* GCS is enabled and disabled for a thread via the PR_SET_SHADOW_STACK_STATUS
> +  prctl(), this takes a single flags argument specifying which GCS features
> +  should be used.
> +
> +* When set PR_SHADOW_STACK_ENABLE flag allocates a Guarded Control Stack for

The 'for' at the end of the line above is not needed.

> +  and enables GCS for the thread, enabling the functionality controlled by
> +  GCSPRE0_EL1.{nTR, RVCHKEN, PCRSEL}.

This should be GCSCRE0_EL1.

> +* When set the PR_SHADOW_STACK_PUSH flag enables the functionality controlled
> +  by GCSCRE0_EL1.PUSHMEn, allowing explicit GCS pushes.
> +
> +* When set the PR_SHADOW_STACK_WRITE flag enables the functionality controlled
> +  by GCSCRE0_EL1.STREn, allowing explicit stores to the Guarded Control Stack.
> +
> +* Any unknown flags will cause PR_SET_SHADOW_STACK_STATUS to return -EINVAL.
> +
> +* PR_LOCK_SHADOW_STACK_STATUS is passed a bitmask of features with the same
> +  values as used for PR_SET_SHADOW_STACK_STATUS.  Any future changes to the
> +  status of the specified GCS mode bits will be rejected.
> +
> +* PR_LOCK_SHADOW_STACK_STATUS allows any bit to be locked, this allows
> +  userspace to prevent changes to any future features.

I presume a new lock prctl() won't allow unlocking but can only extend
the lock. I haven't looked at the patches yet but it may be worth
spelling this out.

> +* PR_SET_SHADOW_STACK_STATUS and PR_LOCK_SHADOW_STACK_STATUS affect only the
> +  thread the called them, any other running threads will be unaffected.

s/the called/that called/

> +* New threads inherit the GCS configuration of the thread that created them.
> +
> +* GCS is disabled on exec().
> +
> +* The current GCS configuration for a thread may be read with the
> +  PR_GET_SHADOW_STACK_STATUS prctl(), this returns the same flags that
> +  are passed to PR_SET_SHADOW_STACK_STATUS.
> +
> +* If GCS is disabled for a thread after having previously been enabled then
> +  the stack will remain allocated for the lifetime of the thread.

Sorry if this has been discussed in other threads. What is the issue
with unmapping/freeing of the shadow stack?

> At present
> +  any attempt to reenable GCS for the thread will be rejected, this may be
> +  revisited in future.

What's the rationale here? Is it that function returns won't work?

> +3.  Allocation of Guarded Control Stacks
> +----------------------------------------
> +
> +* When GCS is enabled for a thread a new Guarded Control Stack will be
> +  allocated for it of size RLIMIT_STACK / 2 or 2 gigabytes, whichever is
> +  smaller.

Is this number based on the fact that a function call would only push
the LR to GCS while standard function prologue pushes at least two
registers?

> +* When GCS is disabled for a thread the Guarded Control Stack initially
> +  allocated for that thread will be freed.  Note carefully that if the
> +  stack has been switched this may not be the stack currently in use by
> +  the thread.

Does this not contradict an earlier statement that the GCS is not freed
for a thread when disabled?

> +4.  Signal handling
> +--------------------
> +
> +* A new signal frame record gcs_context encodes the current GCS mode and
> +  pointer for the interrupted context on signal delivery.  This will always
> +  be present on systems that support GCS.
> +
> +* The record contains a flag field which reports the current GCS configuration
> +  for the interrupted context as PR_GET_SHADOW_STACK_STATUS would.
> +
> +* The signal handler is run with the same GCS configuration as the interrupted
> +  context.
> +
> +* When GCS is enabled for the interrupted thread a signal handling specific
> +  GCS cap token will be written to the GCS, this is an architectural GCS cap
> +  token with bit 63 set.  The GCSPR_EL0 reported in the signal frame will
> +  point to this cap token.

I lost track of the GCS spec versions. Has the valid cap token format
been updated? What I have in my spec (though most likely old) is:

  An entry in the Guarded control stack is defined as a Valid cap entry,
  if bits [63:12] of the value are same as bits [63:12] of the address
  where the entry is stored and bits [11:0] contain a Valid cap token.


The other bits in the code look fine to me so far but I haven't looked
at the code yet.
Mark Brown Aug. 9, 2023, 3:34 p.m. UTC | #2
On Wed, Aug 09, 2023 at 03:24:14PM +0100, Catalin Marinas wrote:
> On Mon, Aug 07, 2023 at 11:00:08PM +0100, Mark Brown wrote:

> > +* When set PR_SHADOW_STACK_ENABLE flag allocates a Guarded Control Stack for
> 
> The 'for' at the end of the line above is not needed.
> 
> > +  and enables GCS for the thread, enabling the functionality controlled by

I find it a little clearer that it's a per thread stack here but sure.

> > +* PR_LOCK_SHADOW_STACK_STATUS allows any bit to be locked, this allows
> > +  userspace to prevent changes to any future features.

> I presume a new lock prctl() won't allow unlocking but can only extend
> the lock. I haven't looked at the patches yet but it may be worth
> spelling this out.

Yes, there is no unlock prctl() - the lock prctl() just lets you set
locks.

> > +* If GCS is disabled for a thread after having previously been enabled then
> > +  the stack will remain allocated for the lifetime of the thread.

> Sorry if this has been discussed in other threads. What is the issue
> with unmapping/freeing of the shadow stack?

If something was in the middle of looking the GCS (eg, you're trying to
disable GCS during signal handling that preempted something that is
logging the call stack) and you pull the GCS storage from underneath it
then things will go badly.

> > At present
> > +  any attempt to reenable GCS for the thread will be rejected, this may be
> > +  revisited in future.

> What's the rationale here? Is it that function returns won't work?

We have to work out where to point GCSPR_EL0 when reenabling - at the
top of the old stack, where it was before, on a new stack?  It was
easier to just not support reenabling than to work out what the sensible
answer is when it's not clear that any real use case exists to inform
what makes sense.  We can add support for that later (you can probe by
starting a thread, disabling and trying to reenable) if someone does
come up with a use case.

The use case for disabling is a non-enforcing mode which logs GCS faults
and then disables GCS.

> > +3.  Allocation of Guarded Control Stacks
> > +----------------------------------------

> > +* When GCS is enabled for a thread a new Guarded Control Stack will be
> > +  allocated for it of size RLIMIT_STACK / 2 or 2 gigabytes, whichever is
> > +  smaller.

> Is this number based on the fact that a function call would only push
> the LR to GCS while standard function prologue pushes at least two
> registers?

It's actually based on bitrot that I'd initially chosen a smaller value
since it's likely that functions will push at least something as you
suggest, the patches now just use RLIMIT_STACK.  I'll fix.

> > +* When GCS is disabled for a thread the Guarded Control Stack initially
> > +  allocated for that thread will be freed.  Note carefully that if the
> > +  stack has been switched this may not be the stack currently in use by
> > +  the thread.

> Does this not contradict an earlier statement that the GCS is not freed
> for a thread when disabled?

Yes, it was meant to say when the thread is freed rather than when GCS
is disabled.

> > +* When GCS is enabled for the interrupted thread a signal handling specific
> > +  GCS cap token will be written to the GCS, this is an architectural GCS cap
> > +  token with bit 63 set.  The GCSPR_EL0 reported in the signal frame will
> > +  point to this cap token.

> I lost track of the GCS spec versions. Has the valid cap token format
> been updated? What I have in my spec (though most likely old) is:

>   An entry in the Guarded control stack is defined as a Valid cap entry,
>   if bits [63:12] of the value are same as bits [63:12] of the address
>   where the entry is stored and bits [11:0] contain a Valid cap token.

You have a draft version of the spec, this was changed and now there is
now a token field reserved in the low 12 bits of the register:

#define GCS_CAP_ADDR_MASK		GENMASK(63, 12)
#define GCS_CAP_ADDR_SHIFT		12
#define GCS_CAP_ADDR_WIDTH		52
#define GCS_CAP_ADDR(x)			FIELD_GET(GCS_CAP_ADDR_MASK, x)

#define GCS_CAP_TOKEN_MASK		GENMASK(11, 0)
#define GCS_CAP_TOKEN_SHIFT		0
#define GCS_CAP_TOKEN_WIDTH		12
#define GCS_CAP_TOKEN(x)		FIELD_GET(GCS_CAP_TOKEN_MASK, x)

#define GCS_CAP_VALID_TOKEN		0x1
#define GCS_CAP_IN_PROGRESS_TOKEN	0x5

#define GCS_CAP(x)	((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
					       GCS_CAP_VALID_TOKEN)
Szabolcs Nagy Aug. 10, 2023, 8:55 a.m. UTC | #3
The 08/09/2023 16:34, Mark Brown wrote:
> On Wed, Aug 09, 2023 at 03:24:14PM +0100, Catalin Marinas wrote:
> > On Mon, Aug 07, 2023 at 11:00:08PM +0100, Mark Brown wrote:
> > > +* When GCS is enabled for a thread a new Guarded Control Stack will be
> > > +  allocated for it of size RLIMIT_STACK / 2 or 2 gigabytes, whichever is
> > > +  smaller.
> 
> > Is this number based on the fact that a function call would only push
> > the LR to GCS while standard function prologue pushes at least two
> > registers?
> 
> It's actually based on bitrot that I'd initially chosen a smaller value
> since it's likely that functions will push at least something as you
> suggest, the patches now just use RLIMIT_STACK.  I'll fix.

the pcs requires 16byte aligned stack frames, with 8byte per gcs entry
there is no need for same gcs size as stack size in userspace.

you can argue about a fixed size small increment (stacksize/2 + inc)
for signal handling on alt stack and special tokens, but stack size is
overkill i think.

fwiw my current makecontext patch uses roundup(stacksize/2+160).
(threads guaranteed to have about 300bytes of data on the stack in glibc
so if gcs is stacksize/2, that accounts for the increment. this is for
the theoretical case when an empty thread just tries to overflow the
stack and then handle the fault on sigaltstack.)
Mark Brown Aug. 10, 2023, 11:41 a.m. UTC | #4
On Thu, Aug 10, 2023 at 09:55:50AM +0100, Szabolcs Nagy wrote:
> The 08/09/2023 16:34, Mark Brown wrote:

> > It's actually based on bitrot that I'd initially chosen a smaller value
> > since it's likely that functions will push at least something as you
> > suggest, the patches now just use RLIMIT_STACK.  I'll fix.

> the pcs requires 16byte aligned stack frames, with 8byte per gcs entry
> there is no need for same gcs size as stack size in userspace.

I agree that it's going to be excessive for pretty much all
applications, I adjusted it to match x86 as part of the general effort
to avoid divergence and because I was a bit concerned about non-PCS
cases (eg, JITed code) potentially running into trouble, especially with
smaller stack limits.  It's not an issue I have super strong opinions on
though, as you can see I had implemented it both ways at various times.
Szabolcs Nagy Aug. 10, 2023, 1:34 p.m. UTC | #5
The 08/10/2023 12:41, Mark Brown wrote:
> On Thu, Aug 10, 2023 at 09:55:50AM +0100, Szabolcs Nagy wrote:
> > The 08/09/2023 16:34, Mark Brown wrote:
> 
> > > It's actually based on bitrot that I'd initially chosen a smaller value
> > > since it's likely that functions will push at least something as you
> > > suggest, the patches now just use RLIMIT_STACK.  I'll fix.
> 
> > the pcs requires 16byte aligned stack frames, with 8byte per gcs entry
> > there is no need for same gcs size as stack size in userspace.
> 
> I agree that it's going to be excessive for pretty much all
> applications, I adjusted it to match x86 as part of the general effort
> to avoid divergence and because I was a bit concerned about non-PCS
> cases (eg, JITed code) potentially running into trouble, especially with

is that even possible?

16byte alignment is not a convention but architectural:
access via unaligned sp traps (at least in userspace).

it is possible to use bl such that the stack is not involved
e.g. if there is no bl/ret pairing, but if we base the gcs
size on the stack size then i'd expect one stack frame per
bl/ret pair with 16byte alignment, or is there a programming
model possible that uses 8byte stack per bl?

> smaller stack limits.  It's not an issue I have super strong opinions on
> though, as you can see I had implemented it both ways at various times.
Mark Brown Aug. 10, 2023, 4:30 p.m. UTC | #6
On Thu, Aug 10, 2023 at 02:34:04PM +0100, Szabolcs Nagy wrote:
> The 08/10/2023 12:41, Mark Brown wrote:

> > I agree that it's going to be excessive for pretty much all
> > applications, I adjusted it to match x86 as part of the general effort
> > to avoid divergence and because I was a bit concerned about non-PCS
> > cases (eg, JITed code) potentially running into trouble, especially with

> is that even possible?

> 16byte alignment is not a convention but architectural:
> access via unaligned sp traps (at least in userspace).

> it is possible to use bl such that the stack is not involved
> e.g. if there is no bl/ret pairing, but if we base the gcs
> size on the stack size then i'd expect one stack frame per
> bl/ret pair with 16byte alignment, or is there a programming
> model possible that uses 8byte stack per bl?

That's definitely what I'd expect most of the time.  You'd need to be
tracking what needs pushing in some other register but it's possible.
Quite why you'd do this is a separate question, I think I'm being overly
cautious worrying about anyone actually having done it but it wouldn't
be the first time I was surprised by someone doing something unexpected.
Like I say I think it's excessive and was erring on the side of being
conservative.
Catalin Marinas Aug. 18, 2023, 5:29 p.m. UTC | #7
On Wed, Aug 09, 2023 at 04:34:38PM +0100, Mark Brown wrote:
> On Wed, Aug 09, 2023 at 03:24:14PM +0100, Catalin Marinas wrote:
> > On Mon, Aug 07, 2023 at 11:00:08PM +0100, Mark Brown wrote:
> > > +* When set PR_SHADOW_STACK_ENABLE flag allocates a Guarded Control Stack for
> > 
> > The 'for' at the end of the line above is not needed.
> > 
> > > +  and enables GCS for the thread, enabling the functionality controlled by
> 
> I find it a little clearer that it's a per thread stack here but sure.

If it reads better for you, feel free to keep it as is.

> > > +3.  Allocation of Guarded Control Stacks
> > > +----------------------------------------
> 
> > > +* When GCS is enabled for a thread a new Guarded Control Stack will be
> > > +  allocated for it of size RLIMIT_STACK / 2 or 2 gigabytes, whichever is
> > > +  smaller.
> 
> > Is this number based on the fact that a function call would only push
> > the LR to GCS while standard function prologue pushes at least two
> > registers?
> 
> It's actually based on bitrot that I'd initially chosen a smaller value
> since it's likely that functions will push at least something as you
> suggest, the patches now just use RLIMIT_STACK.  I'll fix.

A related question - it may have been discussed intensively on the x86
thread (I may read it sometime) - why not have the libc map the shadow
stack and pass the pointer/size to clone3()? It saves us from having to
guess what the right size we'd need. struct clone_args is extensible.

(I plan to get back next week to this series, I'll need to read a bit
more on the spec)
Mark Brown Aug. 18, 2023, 7:38 p.m. UTC | #8
On Fri, Aug 18, 2023 at 06:29:54PM +0100, Catalin Marinas wrote:

> A related question - it may have been discussed intensively on the x86
> thread (I may read it sometime) - why not have the libc map the shadow

Your assumption that this is a single thread feels optimistic there.

> stack and pass the pointer/size to clone3()? It saves us from having to
> guess what the right size we'd need. struct clone_args is extensible.

I can't recall or locate the specific reasoning there right now, perhaps
Rick or someone else can?  I'd guess there would be compat concerns for
things that don't go via libc which would complicate the story with
identifying and marking things as GCS/SS safe, it's going to be more
robust to just supply a GCS if the process is using it.  That said
having a default doesn't preclude us using the extensibility to allow
userspace directly to control the GCS size, I would certainly be in
favour of adding support for that.

> (I plan to get back next week to this series, I'll need to read a bit
> more on the spec)

I've been making changes, mostly in response to your feedback, so there
should be a new version on Monday even if not everything is addressed
yet.
Catalin Marinas Aug. 22, 2023, 4:49 p.m. UTC | #9
On Fri, Aug 18, 2023 at 08:38:02PM +0100, Mark Brown wrote:
> On Fri, Aug 18, 2023 at 06:29:54PM +0100, Catalin Marinas wrote:
> > A related question - it may have been discussed intensively on the x86
> > thread (I may read it sometime) - why not have the libc map the shadow
> 
> Your assumption that this is a single thread feels optimistic there.

Yeah and I unfortunately ignored all of them.

> > stack and pass the pointer/size to clone3()? It saves us from having to
> > guess what the right size we'd need. struct clone_args is extensible.
> 
> I can't recall or locate the specific reasoning there right now, perhaps
> Rick or someone else can?  I'd guess there would be compat concerns for
> things that don't go via libc which would complicate the story with
> identifying and marking things as GCS/SS safe, it's going to be more
> robust to just supply a GCS if the process is using it.  That said
> having a default doesn't preclude us using the extensibility to allow
> userspace directly to control the GCS size, I would certainly be in
> favour of adding support for that.

It would be good if someone provided a summary of the x86 decision (I'll
get to those thread but most likely in September). I think we concluded
that we can't deploy GCS entirely transparently, so we need a libc
change (apart from the ELF annotations). Since libc is opting in to GCS,
we could also update the pthread_create() etc. to allocate the shadow
together with the standard stack.

Anyway, that's my preference but maybe there were good reasons not to do
this.
Mark Brown Aug. 22, 2023, 5:53 p.m. UTC | #10
On Tue, Aug 22, 2023 at 05:49:51PM +0100, Catalin Marinas wrote:
> On Fri, Aug 18, 2023 at 08:38:02PM +0100, Mark Brown wrote:

> > > stack and pass the pointer/size to clone3()? It saves us from having to
> > > guess what the right size we'd need. struct clone_args is extensible.

> > I can't recall or locate the specific reasoning there right now, perhaps
> > Rick or someone else can?  I'd guess there would be compat concerns for
> > things that don't go via libc which would complicate the story with
> > identifying and marking things as GCS/SS safe, it's going to be more
> > robust to just supply a GCS if the process is using it.  That said
> > having a default doesn't preclude us using the extensibility to allow
> > userspace directly to control the GCS size, I would certainly be in
> > favour of adding support for that.

> It would be good if someone provided a summary of the x86 decision (I'll
> get to those thread but most likely in September). I think we concluded
> that we can't deploy GCS entirely transparently, so we need a libc
> change (apart from the ELF annotations). Since libc is opting in to GCS,

Right, we need changes for setjmp()/longjmp() for example.

> we could also update the pthread_create() etc. to allocate the shadow
> together with the standard stack.

> Anyway, that's my preference but maybe there were good reasons not to do
> this.

Yeah, it'd be good to understand.  I've been through quite a lot of old
versions of the x86 series (I've not found them all, there's 30 versions
or something of the old series plus the current one is on v9) and the
code always appears to have been this way with changelogs that explain
the what but not the why.  For example roughly the current behaviour was
already in place in v10 of the original series:

   https://lore.kernel.org/lkml/20200429220732.31602-26-yu-cheng.yu@intel.com/

I do worry about the story for users calling the underlying clone3() API
(or legacy clone() for that matter) directly, and we would also need to
handle the initial GCS enable via prctl() - that's not insurmountable,
we could add a size argument there that only gets interpreted during the
initial enable for example.

My sense is that they deployment story is going to be smoother with
defaults being provided since it avoids dealing with the issue of what
to do if userspace creates a thread without a GCS in a GCS enabled
process but like I say I'd be totally happy to extend clone3().  I will
put some patches together for that (probably once the x86 stuff lands).
Given the size of this series it might be better split out for
manageability if nothing else.
Szabolcs Nagy Aug. 23, 2023, 10:09 a.m. UTC | #11
The 08/22/2023 18:53, Mark Brown wrote:
> On Tue, Aug 22, 2023 at 05:49:51PM +0100, Catalin Marinas wrote:
> > On Fri, Aug 18, 2023 at 08:38:02PM +0100, Mark Brown wrote:
> 
> > > > stack and pass the pointer/size to clone3()? It saves us from having to
> > > > guess what the right size we'd need. struct clone_args is extensible.
> 
> > > I can't recall or locate the specific reasoning there right now, perhaps
> > > Rick or someone else can?  I'd guess there would be compat concerns for
> > > things that don't go via libc which would complicate the story with
> > > identifying and marking things as GCS/SS safe, it's going to be more
> > > robust to just supply a GCS if the process is using it.  That said
> > > having a default doesn't preclude us using the extensibility to allow
> > > userspace directly to control the GCS size, I would certainly be in
> > > favour of adding support for that.
> 
> > It would be good if someone provided a summary of the x86 decision (I'll
> > get to those thread but most likely in September). I think we concluded
> > that we can't deploy GCS entirely transparently, so we need a libc
> > change (apart from the ELF annotations). Since libc is opting in to GCS,
> 
> Right, we need changes for setjmp()/longjmp() for example.
> 
> > we could also update the pthread_create() etc. to allocate the shadow
> > together with the standard stack.
> 
> > Anyway, that's my preference but maybe there were good reasons not to do
> > this.
> 
> Yeah, it'd be good to understand.  I've been through quite a lot of old
> versions of the x86 series (I've not found them all, there's 30 versions
> or something of the old series plus the current one is on v9) and the
> code always appears to have been this way with changelogs that explain
> the what but not the why.  For example roughly the current behaviour was
> already in place in v10 of the original series:
> 
>    https://lore.kernel.org/lkml/20200429220732.31602-26-yu-cheng.yu@intel.com/

well the original shstk patches predate clone3 so no surprise there.
e.g. v6 is from 2018 and clone3 is 2019 linux 5.3
https://lore.kernel.org/lkml/20181119214809.6086-1-yu-cheng.yu@intel.com/

> 
> I do worry about the story for users calling the underlying clone3() API
> (or legacy clone() for that matter) directly, and we would also need to
> handle the initial GCS enable via prctl() - that's not insurmountable,
> we could add a size argument there that only gets interpreted during the
> initial enable for example.

musl and bionic currently use plain clone for threads.

and there is user code doing raw clone threads (such threads are
technically not allowed to call into libc) it's not immediately
clear to me if having gcs in those threads is better or worse.

glibc can use clone3 args for gcs, i'd expect the unmap to be more
annoying than the allocation, but possible (it is certainly more
work than leaving everything to the kernel).

one difference is that userspace can then set gcspr of a new thread
and e.g. two threads can have overlapping gcs, however i don't think
this impacts security much since if clone3 is attacker controlled
then likely all bets are off.

and yes the main thread gcs can also be libc allocated given we
have to deal with the prctl anyway.

if gcs size logic is in libc it can depend on env vars and can be
changed more easily (and adapted to android vs musl vs glibc
requirements).

sigaltstack with alt gcs was a case where i thought the kernel
doing it transparently is better (the libc cannot do the same
as it cannot wrap signal handlers currently so does not know
when a handler returns or the current alt stack state), but
others seems to want an explicit sigaltgcs syscall and expose
it to users. in any case we have no unwinder solution for alt
gcs nor longjmp solution when the thread gcs is overflowed so
this is not an issue for now.

> My sense is that they deployment story is going to be smoother with
> defaults being provided since it avoids dealing with the issue of what
> to do if userspace creates a thread without a GCS in a GCS enabled
> process but like I say I'd be totally happy to extend clone3().  I will
> put some patches together for that (probably once the x86 stuff lands).
> Given the size of this series it might be better split out for
> manageability if nothing else.

i would make thread without gcs to implicitly disable gcs, since
that's what's bw compat with clones outside of libc (the libc can
guarantee gcs allocation when gcs is enabled).
Mark Brown Aug. 23, 2023, 12:51 p.m. UTC | #12
On Wed, Aug 23, 2023 at 11:09:59AM +0100, Szabolcs Nagy wrote:
> The 08/22/2023 18:53, Mark Brown wrote:
> > On Tue, Aug 22, 2023 at 05:49:51PM +0100, Catalin Marinas wrote:

> > the what but not the why.  For example roughly the current behaviour was
> > already in place in v10 of the original series:

> well the original shstk patches predate clone3 so no surprise there.
> e.g. v6 is from 2018 and clone3 is 2019 linux 5.3
> https://lore.kernel.org/lkml/20181119214809.6086-1-yu-cheng.yu@intel.com/

Ah, that'd do it.  People weren't excited enough on about clone3() when
reviewing the shadow stack patches, I hadn't realised clone3() was so
new.

> > I do worry about the story for users calling the underlying clone3() API
> > (or legacy clone() for that matter) directly, and we would also need to
> > handle the initial GCS enable via prctl() - that's not insurmountable,
> > we could add a size argument there that only gets interpreted during the
> > initial enable for example.

> musl and bionic currently use plain clone for threads.

> and there is user code doing raw clone threads (such threads are
> technically not allowed to call into libc) it's not immediately
> clear to me if having gcs in those threads is better or worse.

Right, that's my big worry - I hadn't realised it was extending as far
as musl/bionic.

> one difference is that userspace can then set gcspr of a new thread
> and e.g. two threads can have overlapping gcs, however i don't think
> this impacts security much since if clone3 is attacker controlled
> then likely all bets are off.

Yeah, I think that's a "you broke it, you get all the pieces" thing.  

> > My sense is that they deployment story is going to be smoother with
> > defaults being provided since it avoids dealing with the issue of what
> > to do if userspace creates a thread without a GCS in a GCS enabled
> > process but like I say I'd be totally happy to extend clone3().  I will
> > put some patches together for that (probably once the x86 stuff lands).
> > Given the size of this series it might be better split out for
> > manageability if nothing else.

> i would make thread without gcs to implicitly disable gcs, since
> that's what's bw compat with clones outside of libc (the libc can
> guarantee gcs allocation when gcs is enabled).

That'd create a pretty substantial divergence with the x86 patches if
they land this time around, there's not enough time to rework them now -
I suppose it'd mainly bite people implementing libc type stuff but
still, doesn't feel great.
Catalin Marinas Aug. 23, 2023, 1:11 p.m. UTC | #13
On Wed, Aug 23, 2023 at 11:09:59AM +0100, Szabolcs Nagy wrote:
> The 08/22/2023 18:53, Mark Brown wrote:
> > On Tue, Aug 22, 2023 at 05:49:51PM +0100, Catalin Marinas wrote:
> > > It would be good if someone provided a summary of the x86 decision (I'll
> > > get to those thread but most likely in September). I think we concluded
> > > that we can't deploy GCS entirely transparently, so we need a libc
> > > change (apart from the ELF annotations). Since libc is opting in to GCS,
> > 
> > Right, we need changes for setjmp()/longjmp() for example.
> > 
> > > we could also update the pthread_create() etc. to allocate the shadow
> > > together with the standard stack.
> > > 
> > > Anyway, that's my preference but maybe there were good reasons not to do
> > > this.
> > 
> > Yeah, it'd be good to understand.  I've been through quite a lot of old
> > versions of the x86 series (I've not found them all, there's 30 versions
> > or something of the old series plus the current one is on v9) and the
> > code always appears to have been this way with changelogs that explain
> > the what but not the why.  For example roughly the current behaviour was
> > already in place in v10 of the original series:
> > 
> >    https://lore.kernel.org/lkml/20200429220732.31602-26-yu-cheng.yu@intel.com/
> 
> well the original shstk patches predate clone3 so no surprise there.
> e.g. v6 is from 2018 and clone3 is 2019 linux 5.3
> https://lore.kernel.org/lkml/20181119214809.6086-1-yu-cheng.yu@intel.com/

Good point, I had not realised that.

> > I do worry about the story for users calling the underlying clone3() API
> > (or legacy clone() for that matter) directly, and we would also need to
> > handle the initial GCS enable via prctl() - that's not insurmountable,
> > we could add a size argument there that only gets interpreted during the
> > initial enable for example.
> 
> musl and bionic currently use plain clone for threads.
> 
> and there is user code doing raw clone threads (such threads are
> technically not allowed to call into libc) it's not immediately
> clear to me if having gcs in those threads is better or worse.
> 
> glibc can use clone3 args for gcs, i'd expect the unmap to be more
> annoying than the allocation, but possible (it is certainly more
> work than leaving everything to the kernel).

Unmapping is indeed more complex but I guess something similar needs to
happen for the thread stack to be reclaimed.

The thing I dislike about the kernel automatically mapping it is the
arbitrary fraction of RLIMIT_STACK size. glibc may use RLIMIT_STACK as a
hint for the thread stack size but is this the case for other libraries?
Some quick search (which I may have misinterpreted) shows that musl uses
128KB, bionic 1MB. So at this point the shadow stack size has no
relevance for the actual thread stack.

An alternative would be for the clone3() to provide an address _hint_
and size for GCS and it would still be the kernel doing the mmap (and
munmap on clearing). But at least the user has some control over the
placement of the GCS and its size (and maybe providing the address has
MAP_FIXED semantics).

> > My sense is that they deployment story is going to be smoother with
> > defaults being provided since it avoids dealing with the issue of what
> > to do if userspace creates a thread without a GCS in a GCS enabled
> > process but like I say I'd be totally happy to extend clone3().  I will
> > put some patches together for that (probably once the x86 stuff lands).
> > Given the size of this series it might be better split out for
> > manageability if nothing else.
> 
> i would make thread without gcs to implicitly disable gcs, since
> that's what's bw compat with clones outside of libc (the libc can
> guarantee gcs allocation when gcs is enabled).

Yes, this should work. Any invocation of clone() or clone3() without a
shadow stack would disable GCS. What about the reverse, should GCS be
enabled for a thread even if the clone3() caller has GCS disabled? I
guess we shouldn't since GCS enabling depends on the prctl() state set
previously.
Mark Brown Aug. 23, 2023, 3:50 p.m. UTC | #14
On Wed, Aug 23, 2023 at 02:11:07PM +0100, Catalin Marinas wrote:

> Yes, this should work. Any invocation of clone() or clone3() without a
> shadow stack would disable GCS. What about the reverse, should GCS be
> enabled for a thread even if the clone3() caller has GCS disabled? I
> guess we shouldn't since GCS enabling depends on the prctl() state set
> previously.

It has a fairly obvious intended meaning so we could do it easily enough
but OTOH allowing it opens up the idea of people wanting to specify GCS
flags which starts to seem like more trouble than it's worth compared to
just having them do the prctl() in the new thread.
Catalin Marinas Aug. 23, 2023, 4:45 p.m. UTC | #15
On Wed, Aug 23, 2023 at 01:51:35PM +0100, Mark Brown wrote:
> On Wed, Aug 23, 2023 at 11:09:59AM +0100, Szabolcs Nagy wrote:
> > The 08/22/2023 18:53, Mark Brown wrote:
> > > My sense is that they deployment story is going to be smoother with
> > > defaults being provided since it avoids dealing with the issue of what
> > > to do if userspace creates a thread without a GCS in a GCS enabled
> > > process but like I say I'd be totally happy to extend clone3().  I will
> > > put some patches together for that (probably once the x86 stuff lands).
> > > Given the size of this series it might be better split out for
> > > manageability if nothing else.
> 
> > i would make thread without gcs to implicitly disable gcs, since
> > that's what's bw compat with clones outside of libc (the libc can
> > guarantee gcs allocation when gcs is enabled).
> 
> That'd create a pretty substantial divergence with the x86 patches if
> they land this time around, there's not enough time to rework them now -
> I suppose it'd mainly bite people implementing libc type stuff but
> still, doesn't feel great.

I don't mind the divergence in this area if the libc folks are ok with
it. x86 can eventually use the clone3() interface if they want more
flexibility, they'll just have to continue supporting the old one. I
think we already diverge around the prctl().
Mark Brown Aug. 23, 2023, 5:18 p.m. UTC | #16
On Wed, Aug 23, 2023 at 05:45:11PM +0100, Catalin Marinas wrote:

> I don't mind the divergence in this area if the libc folks are ok with
> it. x86 can eventually use the clone3() interface if they want more
> flexibility, they'll just have to continue supporting the old one. I
> think we already diverge around the prctl().

Yes, though that's basically the same thing - the difference is
basically just that x86 uses their custom arch_prctl() thing and we use
a regular prctl(), there's nothing conceptual going on like there is
here.  I don't really mind either, it's just something where I'd
anticipate pushback.
Szabolcs Nagy Aug. 23, 2023, 5:40 p.m. UTC | #17
The 08/23/2023 17:45, Catalin Marinas wrote:
> On Wed, Aug 23, 2023 at 01:51:35PM +0100, Mark Brown wrote:
> > On Wed, Aug 23, 2023 at 11:09:59AM +0100, Szabolcs Nagy wrote:
> > > The 08/22/2023 18:53, Mark Brown wrote:
> > > > My sense is that they deployment story is going to be smoother with
> > > > defaults being provided since it avoids dealing with the issue of what
> > > > to do if userspace creates a thread without a GCS in a GCS enabled
> > > > process but like I say I'd be totally happy to extend clone3().  I will
> > > > put some patches together for that (probably once the x86 stuff lands).
> > > > Given the size of this series it might be better split out for
> > > > manageability if nothing else.
> > 
> > > i would make thread without gcs to implicitly disable gcs, since
> > > that's what's bw compat with clones outside of libc (the libc can
> > > guarantee gcs allocation when gcs is enabled).
> > 
> > That'd create a pretty substantial divergence with the x86 patches if
> > they land this time around, there's not enough time to rework them now -
> > I suppose it'd mainly bite people implementing libc type stuff but
> > still, doesn't feel great.
> 
> I don't mind the divergence in this area if the libc folks are ok with
> it. x86 can eventually use the clone3() interface if they want more
> flexibility, they'll just have to continue supporting the old one. I
> think we already diverge around the prctl().

i will have to prototype it, but in principle i'm ok with moving gcs
allocation to userspace and passing it as argument to clone3. i will
have to think if x86 divergence could cause issues.

to maximize compat with existing raw clone users gcs either has to
be disabled implicitly or allocated by the kernel. if we move gcs
management to userspace then disable sounds better to me.
(except vfork/fork does not have to disable etc.)

to support gcs, a libc would have to use clone3 or enable gcs in the
clone start code.

i don't know if we can allow disabled gcs thread creation with locked
gcs state. (i can see arguments both ways, so further prctl flag may
be needed which may be another divergence from x86)

i wonder if we can allow MAP_FIXED as well as MAP_FIXED_NOREPLACE
semantics for map_shadow_stack (MAP_FIXED makes sense if userspace
allocates thread stack + tls + gcs + guard pages with one PROT_NONE
mapping and then mprotects / map_shadow_stack on top of that) i.e.
if userspace manages the gcs it may need more flexibility here.
(for now i think separate gcs mapping works for me.)
Mark Brown Aug. 23, 2023, 6:16 p.m. UTC | #18
On Wed, Aug 23, 2023 at 06:40:40PM +0100, Szabolcs Nagy wrote:

> i don't know if we can allow disabled gcs thread creation with locked
> gcs state. (i can see arguments both ways, so further prctl flag may
> be needed which may be another divergence from x86)

I think that if we do add a new flag that'd just be new functionality,
the divergence would be in allowing configuration via clone3() rather
than the flag.  TBH I'm not sure I see a use case for locking but
providing a mechanism for getting out of the lock, that seems very
questionable.
Catalin Marinas Aug. 24, 2023, 3:43 p.m. UTC | #19
On Wed, Aug 23, 2023 at 07:16:52PM +0100, Mark Brown wrote:
> On Wed, Aug 23, 2023 at 06:40:40PM +0100, Szabolcs Nagy wrote:
> > i don't know if we can allow disabled gcs thread creation with locked
> > gcs state. (i can see arguments both ways, so further prctl flag may
> > be needed which may be another divergence from x86)
> 
> I think that if we do add a new flag that'd just be new functionality,
> the divergence would be in allowing configuration via clone3() rather
> than the flag.  TBH I'm not sure I see a use case for locking but
> providing a mechanism for getting out of the lock, that seems very
> questionable.

You are right, once the configuration is locked a plain clone() or
clone3() without a GCS pointer should be rejected.

Is there a use-case for the unlocked configuration to allow disabling
the GCS implicitly via a clone syscall?

If we go for extending clone3, I wonder whether we should also introduce
a sigaltstack2/3 ;). I haven't checked what the current patches do and
won't have time until early September (on holiday from the end of today).
Mark Brown Aug. 24, 2023, 5:38 p.m. UTC | #20
On Thu, Aug 24, 2023 at 04:43:19PM +0100, Catalin Marinas wrote:

> If we go for extending clone3, I wonder whether we should also introduce
> a sigaltstack2/3 ;). I haven't checked what the current patches do and
> won't have time until early September (on holiday from the end of today).

The current patches (and the x86 stuff that's in -next) punt on
sigaltstack for now, the discussions around that are unresolved.  My
hope is that whatever we come up with there can be cross platform.
Szabolcs Nagy Aug. 30, 2023, 12:37 p.m. UTC | #21
The 08/24/2023 16:43, Catalin Marinas wrote:
> Is there a use-case for the unlocked configuration to allow disabling
> the GCS implicitly via a clone syscall?

how would you handle clone or clone3 without gcs specified?
(in the cases when clone creates a new thread with new stack)

(1) fail.
(2) allocate gcs.
(3) disable gcs.

the problem with (1) is that it requires changes to user code
(this only affects code outside the libc doing something special
since raw clone thread cannot call into the libc, all executing
code have to be tightly controlled. i don't know how common this
is, but i at least expect it in test code for system level tools
like debuggers, strace, valgrind, qemu-system, seccomp filters
etc. if it appears in actual use then that's a deployment issue
for distros: note that changing clone to clone3 is non-trivial:
it requires fallback logic and may have to deal with seccomp
filters).

problem with (2) is that the size policy and lifetime management
is in the kernel then. (since only special cases are affected i
guess that is ok, but i assumed we want to avoid this by moving
to clone3 and user managed gcs).

the problem with (3) is escaping the security measure, however
it only applies to very special threads that can always decide
to opt-in to gcs, so i don't see this as such a bad option and
at least bw compat with existing code. (in my threat model the
attacker cannot hijack clone syscalls as that seems stronger
than hijacking return addresses.)

so i guess the answer depends on how much headache failing
raw clone may cause and i don't know that.
Mark Brown Aug. 30, 2023, 4:42 p.m. UTC | #22
On Wed, Aug 30, 2023 at 01:37:33PM +0100, Szabolcs Nagy wrote:
> The 08/24/2023 16:43, Catalin Marinas wrote:

> > Is there a use-case for the unlocked configuration to allow disabling
> > the GCS implicitly via a clone syscall?

> how would you handle clone or clone3 without gcs specified?
> (in the cases when clone creates a new thread with new stack)

> (1) fail.
> (2) allocate gcs.
> (3) disable gcs.

...

> problem with (2) is that the size policy and lifetime management
> is in the kernel then. (since only special cases are affected i
> guess that is ok, but i assumed we want to avoid this by moving
> to clone3 and user managed gcs).

Right, it seems like if we go with this then we may as well just allow
plain clone() too.

> the problem with (3) is escaping the security measure, however
> it only applies to very special threads that can always decide
> to opt-in to gcs, so i don't see this as such a bad option and
> at least bw compat with existing code. (in my threat model the
> attacker cannot hijack clone syscalls as that seems stronger
> than hijacking return addresses.)

It doesn't seem great to have a feature which is to a large extent a
security feature where we provide a fairly straightforward mechanism for
disabling the feature and actively expect things to be using it.

Given the timescales until this gets practically deployed on arm64 I
would be inclined to go with making things fail and forcing updates in
the users, though obviously that's less helpful for x86 where the
hardware is in user hands already so it's more of a pressing issue (and
there's already what is effectively option 2 in the code).  We could
have the architectures diverge, as you say the effect is likely to be
mainly in very low level code rather than general software.
Szabolcs Nagy Sept. 28, 2023, 4:59 p.m. UTC | #23
The 08/23/2023 14:11, Catalin Marinas wrote:
> On Wed, Aug 23, 2023 at 11:09:59AM +0100, Szabolcs Nagy wrote:
> > The 08/22/2023 18:53, Mark Brown wrote:
> > > I do worry about the story for users calling the underlying clone3() API
> > > (or legacy clone() for that matter) directly, and we would also need to
> > > handle the initial GCS enable via prctl() - that's not insurmountable,
> > > we could add a size argument there that only gets interpreted during the
> > > initial enable for example.
...
> > and there is user code doing raw clone threads (such threads are
> > technically not allowed to call into libc) it's not immediately
> > clear to me if having gcs in those threads is better or worse.

i think raw clone / clone3 users may be relevant so we need a
solution such that they don't fail when gcs args are missing.

userspace allocated gcs works for me, but maybe the alternative
with size only is more consistent (thread gcs is kernel mapped
with fallback size logic if gcs size is missing):

> An alternative would be for the clone3() to provide an address _hint_
> and size for GCS and it would still be the kernel doing the mmap (and
> munmap on clearing). But at least the user has some control over the
> placement of the GCS and its size (and maybe providing the address has
> MAP_FIXED semantics).

the main thread gcs is still special: the size is provided
via prctl (if at all).
Mark Brown Oct. 2, 2023, 7:49 p.m. UTC | #24
On Thu, Sep 28, 2023 at 05:59:25PM +0100, Szabolcs Nagy wrote:
> The 08/23/2023 14:11, Catalin Marinas wrote:

> > > and there is user code doing raw clone threads (such threads are
> > > technically not allowed to call into libc) it's not immediately
> > > clear to me if having gcs in those threads is better or worse.

> i think raw clone / clone3 users may be relevant so we need a
> solution such that they don't fail when gcs args are missing.

Are we sure about that?  Old binaries shouldn't be affected since they
won't turn GCS so we're just talking about new binaries here - are there
really so many of them that we won't be able to get them all converted
over to clone3() and GCS in the timescales we're talking about for GCS
deployment?  I obviously don't particularly mind having the default size
logic but if we allow clone() then that's keeping the existing behaviour
and layering allocation via clone3() on top of it which Catalin didn't
want.  Catalin?

> userspace allocated gcs works for me, but maybe the alternative
> with size only is more consistent (thread gcs is kernel mapped
> with fallback size logic if gcs size is missing):

If we have size only then the handling of GCS and normal stack in struct
clone_args would be inconsistent.  Given that it seems better to have
the field present, we can allow it to be NULL and do the allocation with
the specified size but it should be there.

> > An alternative would be for the clone3() to provide an address _hint_
> > and size for GCS and it would still be the kernel doing the mmap (and
> > munmap on clearing). But at least the user has some control over the
> > placement of the GCS and its size (and maybe providing the address has
> > MAP_FIXED semantics).

> the main thread gcs is still special: the size is provided
> via prctl (if at all).

Either that or we have it do a map_shadow_stack() but that's an extra
syscall during startup.
Edgecombe, Rick P Oct. 2, 2023, 9:43 p.m. UTC | #25
On Mon, 2023-10-02 at 20:49 +0100, Mark Brown wrote:
> On Thu, Sep 28, 2023 at 05:59:25PM +0100, Szabolcs Nagy wrote:
> > The 08/23/2023 14:11, Catalin Marinas wrote:
> 
> > > > and there is user code doing raw clone threads (such threads
> > > > are
> > > > technically not allowed to call into libc) it's not immediately
> > > > clear to me if having gcs in those threads is better or worse.
> 
> > i think raw clone / clone3 users may be relevant so we need a
> > solution such that they don't fail when gcs args are missing.
> 
> Are we sure about that?  Old binaries shouldn't be affected since
> they
> won't turn GCS so we're just talking about new binaries here - are
> there
> really so many of them that we won't be able to get them all
> converted
> over to clone3() and GCS in the timescales we're talking about for
> GCS
> deployment?  I obviously don't particularly mind having the default
> size
> logic but if we allow clone() then that's keeping the existing
> behaviour
> and layering allocation via clone3() on top of it which Catalin
> didn't
> want. 

On the x86 side, there have been a lot of binaries generated that have
been blindly marked as supporting shadow stack. So even if they are not
old, there may still be mismarked ones using clone().

In general there are a lot of tradeoffs with shadow stack between
security, compatibility and performance. Szabolcs had previously
discussed some ideas around all three:
 - compatibility (automatic sigaltstack())
 - performance/utility (creating a new libc API for makecontext())
 - security (token schemes to guarantee only one user of a shadow 
             stack at a time)

On the x86 kernel side, we had our hands tied a bit by the existing
userspace, and kind of ended up with a mix. I imagined that we might
see demand along one of those axes after some real world use. At which
point we could add some opt-in ABI tweaks to help.

If ARM is thinking of doing things differently than x86, you might
think about how you weight those tradeoffs. Like, it might be silly to
worry about clone() support if something else ends up breaking
compatibility majorly. But, it might be worthwhile it you end up going
to the proposed extremes around signal alt stacks, to maximize
compatibility

Also then maybe x86 could copy the ARM ABI some day, if it ends up
chasing the tradeoff people prefer. It probably goes without saying
that the closer these features behave from the app developer
perspective, the better. So a different ABI than x86 that also targets
a mix would be a bit unfortunate. (not the end of the world though)

Anyway, just wanted to share some perspective there. Sorry for joining
this thread so late.
Szabolcs Nagy Oct. 3, 2023, 8:45 a.m. UTC | #26
The 10/02/2023 20:49, Mark Brown wrote:
> On Thu, Sep 28, 2023 at 05:59:25PM +0100, Szabolcs Nagy wrote:
> > The 08/23/2023 14:11, Catalin Marinas wrote:
> 
> > > > and there is user code doing raw clone threads (such threads are
> > > > technically not allowed to call into libc) it's not immediately
> > > > clear to me if having gcs in those threads is better or worse.
> 
> > i think raw clone / clone3 users may be relevant so we need a
> > solution such that they don't fail when gcs args are missing.
> 
> Are we sure about that?  Old binaries shouldn't be affected since they
> won't turn GCS so we're just talking about new binaries here - are there
> really so many of them that we won't be able to get them all converted
> over to clone3() and GCS in the timescales we're talking about for GCS
> deployment?  I obviously don't particularly mind having the default size
> logic but if we allow clone() then that's keeping the existing behaviour
> and layering allocation via clone3() on top of it which Catalin didn't
> want.  Catalin?

clone3 seems to have features that are only available in clone3 and
not exposed (reasonably) in libc apis so ppl will use clone3 directly
and those will be hard to fix for gcs (you have to convince upstream
to add future arm64 arch specific changes that they cannot test).
where this analysis might be wrong is that raw clone3 is more likely
used as fork/vfork without a new stack and thus no gcs issue.

even if we have time to fix code, we don't want too many ifdef hacks
just for gcs so it matters how many projects are affected.

> > userspace allocated gcs works for me, but maybe the alternative
> > with size only is more consistent (thread gcs is kernel mapped
> > with fallback size logic if gcs size is missing):
> 
> If we have size only then the handling of GCS and normal stack in struct
> clone_args would be inconsistent.  Given that it seems better to have
> the field present, we can allow it to be NULL and do the allocation with
> the specified size but it should be there.

i see, then try the original plan.

> > the main thread gcs is still special: the size is provided
> > via prctl (if at all).
> 
> Either that or we have it do a map_shadow_stack() but that's an extra
> syscall during startup.

an extra syscall is not too bad for the gcs enabled case.
Mark Brown Oct. 3, 2023, 1:38 p.m. UTC | #27
On Mon, Oct 02, 2023 at 09:43:25PM +0000, Edgecombe, Rick P wrote:

> If ARM is thinking of doing things differently than x86, you might
> think about how you weight those tradeoffs. Like, it might be silly to
> worry about clone() support if something else ends up breaking
> compatibility majorly. But, it might be worthwhile it you end up going
> to the proposed extremes around signal alt stacks, to maximize
> compatibility

Yeah, I think Catalin's thinking here was that we're quite a way out
from actual hardware so it's much more tractable to fix up callers than
it is for x86 where the hardware is widely available.

> Also then maybe x86 could copy the ARM ABI some day, if it ends up
> chasing the tradeoff people prefer. It probably goes without saying
> that the closer these features behave from the app developer
> perspective, the better. So a different ABI than x86 that also targets
> a mix would be a bit unfortunate. (not the end of the world though)

If nothing else even if we end up being stricter about things it would
be extremely disappointing if we ended up with something where code for
arm64 won't run when built for x86.
Mark Brown Oct. 3, 2023, 2:26 p.m. UTC | #28
On Tue, Oct 03, 2023 at 09:45:56AM +0100, Szabolcs Nagy wrote:

> clone3 seems to have features that are only available in clone3 and
> not exposed (reasonably) in libc apis so ppl will use clone3 directly
> and those will be hard to fix for gcs (you have to convince upstream
> to add future arm64 arch specific changes that they cannot test).

Ah, I hadn't realised that there were things that weren't available via
libc - that does change the calculation a bit here.  I would hope that
anything we do for clone3() would work just as well for x86 so the test
side should be a bit easier there than if it were a future arm64 thing,
though obviously it wouldn't be mandatory on x86 in the way that Catalin
wanted it for arm64.

> where this analysis might be wrong is that raw clone3 is more likely
> used as fork/vfork without a new stack and thus no gcs issue.

> even if we have time to fix code, we don't want too many ifdef hacks
> just for gcs so it matters how many projects are affected.

My impression was that raw usage of the APIs was a specialist enough
thing that this was viable, ICBW though - I might not have been
searching well enough (clone is an annoying term to search for!).
Catalin Marinas Oct. 5, 2023, 5:23 p.m. UTC | #29
On Tue, Oct 03, 2023 at 03:26:51PM +0100, Mark Brown wrote:
> On Tue, Oct 03, 2023 at 09:45:56AM +0100, Szabolcs Nagy wrote:
> > clone3 seems to have features that are only available in clone3 and
> > not exposed (reasonably) in libc apis so ppl will use clone3 directly
> > and those will be hard to fix for gcs (you have to convince upstream
> > to add future arm64 arch specific changes that they cannot test).
> 
> Ah, I hadn't realised that there were things that weren't available via
> libc - that does change the calculation a bit here.  I would hope that
> anything we do for clone3() would work just as well for x86 so the test
> side should be a bit easier there than if it were a future arm64 thing,
> though obviously it wouldn't be mandatory on x86 in the way that Catalin
> wanted it for arm64.

I haven't checked how many clone() or clone3() uses outside the libc are
(I tried some quick search in Debian but did not dig into the specifics
to see how generic that code is). I agree that having to change valid
cases outside of libc is not ideal. Even if we have the same clone3()
interface for x86 and arm64, we'd have other architectures that need
#ifdef'ing.

So I'm slightly warming up to the idea of having a default shadow stack
size (either RLIMIT_STACK or the clone3() stack size, following x86). A
clone3() extension can be added on top, though I wonder whether anyone
will use it if the kernel allocates a shadow stack by default.

It's not just the default size that I dislike (I think the x86
RLIMIT_STACK or clone3() stack_size is probably good enough) but the
kernel allocating the shadow stack and inserting it into the user
address space. The actual thread stack is managed by the user but the
shadow stack is not (and we don't do this very often). Anyway, I don't
have a better solution for direct uses of clone() or clone3(), other
than running those threads with the shadow stack disabled. Not sure
that's desirable.
Mark Brown Oct. 6, 2023, 12:17 p.m. UTC | #30
On Thu, Oct 05, 2023 at 06:23:10PM +0100, Catalin Marinas wrote:

> It's not just the default size that I dislike (I think the x86
> RLIMIT_STACK or clone3() stack_size is probably good enough) but the
> kernel allocating the shadow stack and inserting it into the user
> address space. The actual thread stack is managed by the user but the
> shadow stack is not (and we don't do this very often). Anyway, I don't
> have a better solution for direct uses of clone() or clone3(), other
> than running those threads with the shadow stack disabled. Not sure
> that's desirable.

Running threads with the shadow stack disabled if they don't explicitly
request it feels like it's asking for trouble - as well as the escape
route from the protection it'd provide I'd expect there to be trouble
for things that do stack pivots, potentially random issues if there's a
mix of ways threads are started.  It's going to be a tradeoff whatever
we do.
Eric W. Biederman Oct. 6, 2023, 12:29 p.m. UTC | #31
Mark Brown <broonie@kernel.org> writes:

> On Thu, Oct 05, 2023 at 06:23:10PM +0100, Catalin Marinas wrote:
>
>> It's not just the default size that I dislike (I think the x86
>> RLIMIT_STACK or clone3() stack_size is probably good enough) but the
>> kernel allocating the shadow stack and inserting it into the user
>> address space. The actual thread stack is managed by the user but the
>> shadow stack is not (and we don't do this very often). Anyway, I don't
>> have a better solution for direct uses of clone() or clone3(), other
>> than running those threads with the shadow stack disabled. Not sure
>> that's desirable.
>
> Running threads with the shadow stack disabled if they don't explicitly
> request it feels like it's asking for trouble - as well as the escape
> route from the protection it'd provide I'd expect there to be trouble
> for things that do stack pivots, potentially random issues if there's a
> mix of ways threads are started.  It's going to be a tradeoff whatever
> we do.

Something I haven't seen in the discussion is that one of the ways I
have seen a non-libc clone used is to implement a fork with flags.
That is a new mm is created, and effectively a new process.  Which
makes the characterization different.

In general creating a thread with clone and bypassing libc is
incompatible with pthreads, and the caller gets to keep both pieces.

As long as there is enough information code can detect that
shadow stacks are in use, and the code is able to create their own
I don't see why it shouldn't be the callers responsibility.

On the other hand I don't see the maintainer of clone Christian Brauner
or the libc folks especially Florian cc'd on this thread.  So I really
don't think you have the right folks in on this conversation.

Eric
k
Mark Brown Oct. 6, 2023, 1:23 p.m. UTC | #32
On Fri, Oct 06, 2023 at 07:29:45AM -0500, Eric W. Biederman wrote:
> Mark Brown <broonie@kernel.org> writes:

> >> It's not just the default size that I dislike (I think the x86
> >> RLIMIT_STACK or clone3() stack_size is probably good enough) but the
> >> kernel allocating the shadow stack and inserting it into the user
> >> address space. The actual thread stack is managed by the user but the
> >> shadow stack is not (and we don't do this very often). Anyway, I don't
> >> have a better solution for direct uses of clone() or clone3(), other
> >> than running those threads with the shadow stack disabled. Not sure
> >> that's desirable.

> > Running threads with the shadow stack disabled if they don't explicitly
> > request it feels like it's asking for trouble - as well as the escape
> > route from the protection it'd provide I'd expect there to be trouble
> > for things that do stack pivots, potentially random issues if there's a
> > mix of ways threads are started.  It's going to be a tradeoff whatever
> > we do.

> Something I haven't seen in the discussion is that one of the ways I
> have seen a non-libc clone used is to implement a fork with flags.
> That is a new mm is created, and effectively a new process.  Which
> makes the characterization different.

> In general creating a thread with clone and bypassing libc is
> incompatible with pthreads, and the caller gets to keep both pieces.

> As long as there is enough information code can detect that
> shadow stacks are in use, and the code is able to create their own
> I don't see why it shouldn't be the callers responsibility.

> On the other hand I don't see the maintainer of clone Christian Brauner
> or the libc folks especially Florian cc'd on this thread.  So I really
> don't think you have the right folks in on this conversation.

Well, copying them in now.  The discussion here is about allocation of
shadow stacks for the arm64 implementation of the feature (the arm64
feature is called Guarded Control Stack in the architecture).  These
maintain a second copy of the stack with only the return targets in
memory allocated with special protections so userspace can't write to it
directly and use this when doing returns to ensure that the returns
haven't been redirected.  These shadow stacks can be allocated directly
by userspace using a new system call map_shadow_stack(), doing this via
mmap() was extensively discussed but it was concluded that this was very
likely to lead to security problems so we've got this new syscall that
ensures that shadow stack memory is never accessible to userspace via
other means.

The x86 implementation that has already been merged into mainline will
allocate a new shadow stack for newly created threads when the creating
thread has one.  There was a suggestion to have arm64 diverge and
require that threads be created with clone3() and manualy provide a
shadow stack but then concerns were raised that as well as the issues
with divergence this would be too disruptive for adoption due to
non-libc thread creation.  It's not controversial that it'd be good to
have clone3() by able to explicitly specify a shadow stack, just if it
should be required.
Mark Brown Oct. 19, 2023, 5:08 p.m. UTC | #33
On Thu, Oct 05, 2023 at 06:23:10PM +0100, Catalin Marinas wrote:

> I haven't checked how many clone() or clone3() uses outside the libc are
> (I tried some quick search in Debian but did not dig into the specifics
> to see how generic that code is). I agree that having to change valid
> cases outside of libc is not ideal. Even if we have the same clone3()
> interface for x86 and arm64, we'd have other architectures that need
> #ifdef'ing.

FTR the set of Debian source packages that have references to the string
__NR_clone (which picks up clone3 too) is below.  At least some (eg,
kore) just have things that look like a copy of the syscall table rather
than things that look like calls, though equally it's likely we're
missing some.

aflplusplus
android-platform-tools
binutils-avr
box64
brltty
bubblewrap
chromium
chrony
crash
criu
crun
dietlibc
elogind
emscripten
fakeroot-ng
falcosecurity-libs
firefox
firefox-esr
flatpak
gcc-9
gcc-10
gcc-11
gcc-12
gcc-13
gcc-arm-none-eabi
gcc-snapshot
gdb-msp430
glibc
gnumach
hurd
klibc
kore
libpod
libseccomp
linux
llvm-toolchain-14
llvm-toolchain-15
llvm-toolchain-16
lxc
lxcfs
lxd
musl
newlib
notcurses
purelibc
pwntools
qemu
qt6-base
qt6-webengine
qtbase-opensource-src
qtbase-opensource-src-gles
qtwebengine-opensource-src
radare2
rumur
rustc
rust-linux-raw-sys
rust-rustix
strace
stress-ng
swtpm
systemd
systemtap
termpaint
thunderbird
tor
uclibc
umview
valgrind
vsftpd
wasi-libc
webkit2gtk
wpewebkit
diff mbox series

Patch

diff --git a/Documentation/arch/arm64/gcs.rst b/Documentation/arch/arm64/gcs.rst
new file mode 100644
index 000000000000..c0f43961fd4b
--- /dev/null
+++ b/Documentation/arch/arm64/gcs.rst
@@ -0,0 +1,228 @@ 
+===============================================
+Guarded Control Stack support for AArch64 Linux
+===============================================
+
+This document outlines briefly the interface provided to userspace by Linux in
+order to support use of the ARM Guarded Control Stack (GCS) feature.
+
+This is an outline of the most important features and issues only and not
+intended to be exhaustive.
+
+
+
+1.  General
+-----------
+
+* GCS is an architecture feature intended to provide greater protection
+  against return oriented programming (ROP) attacks and to simplify the
+  implementation of features that need to collect stack traces such as
+  profiling.
+
+* When GCS is enabled a separate guarded control stack is maintained by the
+  PE which is writeable only through specific GCS operations.  This
+  stores the call stack only, when a procedure call instruction is
+  performed the current PC is pushed onto the GCS and on RET the
+  address in the LR is verified against that on the top of the GCS.
+
+* When active current GCS pointer is stored in the system register
+  GCSPR_EL0.  This is readable by userspace but can only be updated
+  via specific GCS instructions.
+
+* The architecture provides instructions for switching between guarded
+  control stacks with checks to ensure that the new stack is a valid
+  target for switching.
+
+* The functionality of GCS is similar to that provided by the x86 Shadow
+  Stack feature, due to sharing of userspace interfaces the ABI refers to
+  shadow stacks rather than GCS.
+
+* Support for GCS is reported to userspace via HWCAP2_GCS in the aux vector
+  AT_HWCAP2 entry.
+
+* GCS is enabled per thread.  While there is support for disabling GCS
+  at runtime this should be done with great care.
+
+* GCS memory access faults are reported as normal memory access faults.
+
+* GCS specific errors (those reported with EC 0x2d) will be reported as
+  SIGSEGV with a si_code of SEGV_CPERR (control protection error).
+
+* GCS is supported only for AArch64.
+
+* On systems where GCS is supported GCSPR_EL0 is always readable by EL0
+  regardless of the GCS configuration for the thread.
+
+* The architecture supports enabling GCS without verifying that return values
+  in LR match those in the GCS, the LR will be ignored.  This is not supported
+  by Linux.
+
+* EL0 GCS entries with bit 63 set are reserved for use, one such use is defined
+  below for signals and should be ignored when parsing the stack if not
+  understood.
+
+
+2.  Enabling and disabling Guarded Control Stacks
+-------------------------------------------------
+
+* GCS is enabled and disabled for a thread via the PR_SET_SHADOW_STACK_STATUS
+  prctl(), this takes a single flags argument specifying which GCS features
+  should be used.
+
+* When set PR_SHADOW_STACK_ENABLE flag allocates a Guarded Control Stack for
+  and enables GCS for the thread, enabling the functionality controlled by
+  GCSPRE0_EL1.{nTR, RVCHKEN, PCRSEL}.
+
+* When set the PR_SHADOW_STACK_PUSH flag enables the functionality controlled
+  by GCSCRE0_EL1.PUSHMEn, allowing explicit GCS pushes.
+
+* When set the PR_SHADOW_STACK_WRITE flag enables the functionality controlled
+  by GCSCRE0_EL1.STREn, allowing explicit stores to the Guarded Control Stack.
+
+* Any unknown flags will cause PR_SET_SHADOW_STACK_STATUS to return -EINVAL.
+
+* PR_LOCK_SHADOW_STACK_STATUS is passed a bitmask of features with the same
+  values as used for PR_SET_SHADOW_STACK_STATUS.  Any future changes to the
+  status of the specified GCS mode bits will be rejected.
+
+* PR_LOCK_SHADOW_STACK_STATUS allows any bit to be locked, this allows
+  userspace to prevent changes to any future features.
+
+* PR_SET_SHADOW_STACK_STATUS and PR_LOCK_SHADOW_STACK_STATUS affect only the
+  thread the called them, any other running threads will be unaffected.
+
+* New threads inherit the GCS configuration of the thread that created them.
+
+* GCS is disabled on exec().
+
+* The current GCS configuration for a thread may be read with the
+  PR_GET_SHADOW_STACK_STATUS prctl(), this returns the same flags that
+  are passed to PR_SET_SHADOW_STACK_STATUS.
+
+* If GCS is disabled for a thread after having previously been enabled then
+  the stack will remain allocated for the lifetime of the thread.  At present
+  any attempt to reenable GCS for the thread will be rejected, this may be
+  revisited in future.
+
+* It should be noted that since enabling GCS will result in GCS becoming
+  active immediately it is not normally possible to return from the function
+  that invoked the prctl() that enabled GCS.  It is expected that the normal
+  usage will be that GCS is enabled very early in execution of a program.
+
+
+
+3.  Allocation of Guarded Control Stacks
+----------------------------------------
+
+* When GCS is enabled for a thread a new Guarded Control Stack will be
+  allocated for it of size RLIMIT_STACK / 2 or 2 gigabytes, whichever is
+  smaller.
+
+* When a new thread is created by a thread which has GCS enabled then a
+  new Guarded Control Stack will be allocated for the new thread with
+  half the size of the standard stack.
+
+* When a stack is allocated by enabling GCS or during thread creation then
+  the top 8 bytes of the stack will be initialised to 0 and GCSPR_EL0 will
+  be set to point to the address of this 0 value, this can be used to
+  detect the top of the stack.
+
+* Additional Guarded Control Stacks can be allocated using the
+  map_shadow_stack() system call.
+
+* Stacks allocated using map_shadow_stack() can optionally have an end of
+  stack marker and cap placed at the top of the stack.  If the flag
+  SHADOW_STACK_SET_TOKEN is specified a cap will be placed on the stack,
+  if SHADOW_STACK_SET_MARKER is not specified the cap will be the top 8
+  bytes of the stack and if it is specified then the cap will be the next
+  8 bytes.  While specifying just SHADOW_STACK_SET_MARKER by itself is
+  valid since the marker is all bits 0 it has no observable effect.
+
+* Stacks allocated using map_shadow_stack() must be larger than 16 bytes and
+  must be 16 bytes aligned.
+
+* When GCS is disabled for a thread the Guarded Control Stack initially
+  allocated for that thread will be freed.  Note carefully that if the
+  stack has been switched this may not be the stack currently in use by
+  the thread.
+
+
+4.  Signal handling
+--------------------
+
+* A new signal frame record gcs_context encodes the current GCS mode and
+  pointer for the interrupted context on signal delivery.  This will always
+  be present on systems that support GCS.
+
+* The record contains a flag field which reports the current GCS configuration
+  for the interrupted context as PR_GET_SHADOW_STACK_STATUS would.
+
+* The signal handler is run with the same GCS configuration as the interrupted
+  context.
+
+* When GCS is enabled for the interrupted thread a signal handling specific
+  GCS cap token will be written to the GCS, this is an architectural GCS cap
+  token with bit 63 set.  The GCSPR_EL0 reported in the signal frame will
+  point to this cap token.
+
+* The signal handler will use the same GCS as the interrupted context.
+
+* When GCS is enabled on signal entry a frame with the address of the signal
+  return handler will be pushed onto the GCS, allowing return from the signal
+  handler via RET as normal.  This will not be reported in the gcs_context in
+  the signal frame.
+
+
+5.  Signal return
+-----------------
+
+When returning from a signal handler:
+
+* If there is a gcs_context record in the signal frame then the GCS flags
+  and GCSPR_EL0 will be restored from that context prior to further
+  validation.
+
+* If there is no gcs_context record in the signal frame then the GCS
+  configuration will be unchanged.
+
+* If GCS is enabled on return from a signal handler then GCSPR_EL0 must
+  point to a valid GCS signal cap record, this will be popped from the
+  GCS prior to signal return.
+
+* If the GCS configuration is locked when returning from a signal then any
+  attempt to change the GCS configuration will be treated as an error.  This
+  is true even if GCS was not enabled prior to signal entry.
+
+* GCS may be disabled via signal return but any attempt to enable GCS via
+  signal return will be rejected.
+
+
+7.  ptrace extensions
+---------------------
+
+* A new regset NT_ARM_GCS is defined for use with PTRACE_GETREGSET and
+  PTRACE_SETREGSET.
+
+* Due to the complexity surrounding allocation and deallocation of stacks and
+  lack of practical application it is not possible to enable GCS via ptrace.
+  GCS may be disabled via the ptrace interface.
+
+* Other GCS modes may be configured via ptrace.
+
+* Configuration via ptrace ignores locking of GCS mode bits.
+
+
+8.  ELF coredump extensions
+---------------------------
+
+* NT_ARM_GCS notes will be added to each coredump for each thread of the
+  dumped process.  The contents will be equivalent to the data that would
+  have been read if a PTRACE_GETREGSET of the corresponding type were
+  executed for each thread when the coredump was generated.
+
+
+
+9.  /proc extensions
+--------------------
+
+* Guarded Control Stack pages will include "ss" in their VmFlags in
+  /proc/<pid>/smaps.
diff --git a/Documentation/arch/arm64/index.rst b/Documentation/arch/arm64/index.rst
index d08e924204bf..dcf3ee3eb8c0 100644
--- a/Documentation/arch/arm64/index.rst
+++ b/Documentation/arch/arm64/index.rst
@@ -14,6 +14,7 @@  ARM64 Architecture
     booting
     cpu-feature-registers
     elf_hwcaps
+    gcs
     hugetlbpage
     kdump
     legacy_instructions