Message ID | 20230807-arm64-gcs-v4-3-68cfa37f9069@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64/gcs: Provide support for GCS in userspace | expand |
On Mon, Aug 07, 2023 at 11:00:08PM +0100, Mark Brown wrote: > +2. Enabling and disabling Guarded Control Stacks > +------------------------------------------------- > + > +* GCS is enabled and disabled for a thread via the PR_SET_SHADOW_STACK_STATUS > + prctl(), this takes a single flags argument specifying which GCS features > + should be used. > + > +* When set PR_SHADOW_STACK_ENABLE flag allocates a Guarded Control Stack for The 'for' at the end of the line above is not needed. > + and enables GCS for the thread, enabling the functionality controlled by > + GCSPRE0_EL1.{nTR, RVCHKEN, PCRSEL}. This should be GCSCRE0_EL1. > +* When set the PR_SHADOW_STACK_PUSH flag enables the functionality controlled > + by GCSCRE0_EL1.PUSHMEn, allowing explicit GCS pushes. > + > +* When set the PR_SHADOW_STACK_WRITE flag enables the functionality controlled > + by GCSCRE0_EL1.STREn, allowing explicit stores to the Guarded Control Stack. > + > +* Any unknown flags will cause PR_SET_SHADOW_STACK_STATUS to return -EINVAL. > + > +* PR_LOCK_SHADOW_STACK_STATUS is passed a bitmask of features with the same > + values as used for PR_SET_SHADOW_STACK_STATUS. Any future changes to the > + status of the specified GCS mode bits will be rejected. > + > +* PR_LOCK_SHADOW_STACK_STATUS allows any bit to be locked, this allows > + userspace to prevent changes to any future features. I presume a new lock prctl() won't allow unlocking but can only extend the lock. I haven't looked at the patches yet but it may be worth spelling this out. > +* PR_SET_SHADOW_STACK_STATUS and PR_LOCK_SHADOW_STACK_STATUS affect only the > + thread the called them, any other running threads will be unaffected. s/the called/that called/ > +* New threads inherit the GCS configuration of the thread that created them. > + > +* GCS is disabled on exec(). > + > +* The current GCS configuration for a thread may be read with the > + PR_GET_SHADOW_STACK_STATUS prctl(), this returns the same flags that > + are passed to PR_SET_SHADOW_STACK_STATUS. > + > +* If GCS is disabled for a thread after having previously been enabled then > + the stack will remain allocated for the lifetime of the thread. Sorry if this has been discussed in other threads. What is the issue with unmapping/freeing of the shadow stack? > At present > + any attempt to reenable GCS for the thread will be rejected, this may be > + revisited in future. What's the rationale here? Is it that function returns won't work? > +3. Allocation of Guarded Control Stacks > +---------------------------------------- > + > +* When GCS is enabled for a thread a new Guarded Control Stack will be > + allocated for it of size RLIMIT_STACK / 2 or 2 gigabytes, whichever is > + smaller. Is this number based on the fact that a function call would only push the LR to GCS while standard function prologue pushes at least two registers? > +* When GCS is disabled for a thread the Guarded Control Stack initially > + allocated for that thread will be freed. Note carefully that if the > + stack has been switched this may not be the stack currently in use by > + the thread. Does this not contradict an earlier statement that the GCS is not freed for a thread when disabled? > +4. Signal handling > +-------------------- > + > +* A new signal frame record gcs_context encodes the current GCS mode and > + pointer for the interrupted context on signal delivery. This will always > + be present on systems that support GCS. > + > +* The record contains a flag field which reports the current GCS configuration > + for the interrupted context as PR_GET_SHADOW_STACK_STATUS would. > + > +* The signal handler is run with the same GCS configuration as the interrupted > + context. > + > +* When GCS is enabled for the interrupted thread a signal handling specific > + GCS cap token will be written to the GCS, this is an architectural GCS cap > + token with bit 63 set. The GCSPR_EL0 reported in the signal frame will > + point to this cap token. I lost track of the GCS spec versions. Has the valid cap token format been updated? What I have in my spec (though most likely old) is: An entry in the Guarded control stack is defined as a Valid cap entry, if bits [63:12] of the value are same as bits [63:12] of the address where the entry is stored and bits [11:0] contain a Valid cap token. The other bits in the code look fine to me so far but I haven't looked at the code yet.
On Wed, Aug 09, 2023 at 03:24:14PM +0100, Catalin Marinas wrote: > On Mon, Aug 07, 2023 at 11:00:08PM +0100, Mark Brown wrote: > > +* When set PR_SHADOW_STACK_ENABLE flag allocates a Guarded Control Stack for > > The 'for' at the end of the line above is not needed. > > > + and enables GCS for the thread, enabling the functionality controlled by I find it a little clearer that it's a per thread stack here but sure. > > +* PR_LOCK_SHADOW_STACK_STATUS allows any bit to be locked, this allows > > + userspace to prevent changes to any future features. > I presume a new lock prctl() won't allow unlocking but can only extend > the lock. I haven't looked at the patches yet but it may be worth > spelling this out. Yes, there is no unlock prctl() - the lock prctl() just lets you set locks. > > +* If GCS is disabled for a thread after having previously been enabled then > > + the stack will remain allocated for the lifetime of the thread. > Sorry if this has been discussed in other threads. What is the issue > with unmapping/freeing of the shadow stack? If something was in the middle of looking the GCS (eg, you're trying to disable GCS during signal handling that preempted something that is logging the call stack) and you pull the GCS storage from underneath it then things will go badly. > > At present > > + any attempt to reenable GCS for the thread will be rejected, this may be > > + revisited in future. > What's the rationale here? Is it that function returns won't work? We have to work out where to point GCSPR_EL0 when reenabling - at the top of the old stack, where it was before, on a new stack? It was easier to just not support reenabling than to work out what the sensible answer is when it's not clear that any real use case exists to inform what makes sense. We can add support for that later (you can probe by starting a thread, disabling and trying to reenable) if someone does come up with a use case. The use case for disabling is a non-enforcing mode which logs GCS faults and then disables GCS. > > +3. Allocation of Guarded Control Stacks > > +---------------------------------------- > > +* When GCS is enabled for a thread a new Guarded Control Stack will be > > + allocated for it of size RLIMIT_STACK / 2 or 2 gigabytes, whichever is > > + smaller. > Is this number based on the fact that a function call would only push > the LR to GCS while standard function prologue pushes at least two > registers? It's actually based on bitrot that I'd initially chosen a smaller value since it's likely that functions will push at least something as you suggest, the patches now just use RLIMIT_STACK. I'll fix. > > +* When GCS is disabled for a thread the Guarded Control Stack initially > > + allocated for that thread will be freed. Note carefully that if the > > + stack has been switched this may not be the stack currently in use by > > + the thread. > Does this not contradict an earlier statement that the GCS is not freed > for a thread when disabled? Yes, it was meant to say when the thread is freed rather than when GCS is disabled. > > +* When GCS is enabled for the interrupted thread a signal handling specific > > + GCS cap token will be written to the GCS, this is an architectural GCS cap > > + token with bit 63 set. The GCSPR_EL0 reported in the signal frame will > > + point to this cap token. > I lost track of the GCS spec versions. Has the valid cap token format > been updated? What I have in my spec (though most likely old) is: > An entry in the Guarded control stack is defined as a Valid cap entry, > if bits [63:12] of the value are same as bits [63:12] of the address > where the entry is stored and bits [11:0] contain a Valid cap token. You have a draft version of the spec, this was changed and now there is now a token field reserved in the low 12 bits of the register: #define GCS_CAP_ADDR_MASK GENMASK(63, 12) #define GCS_CAP_ADDR_SHIFT 12 #define GCS_CAP_ADDR_WIDTH 52 #define GCS_CAP_ADDR(x) FIELD_GET(GCS_CAP_ADDR_MASK, x) #define GCS_CAP_TOKEN_MASK GENMASK(11, 0) #define GCS_CAP_TOKEN_SHIFT 0 #define GCS_CAP_TOKEN_WIDTH 12 #define GCS_CAP_TOKEN(x) FIELD_GET(GCS_CAP_TOKEN_MASK, x) #define GCS_CAP_VALID_TOKEN 0x1 #define GCS_CAP_IN_PROGRESS_TOKEN 0x5 #define GCS_CAP(x) ((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \ GCS_CAP_VALID_TOKEN)
The 08/09/2023 16:34, Mark Brown wrote: > On Wed, Aug 09, 2023 at 03:24:14PM +0100, Catalin Marinas wrote: > > On Mon, Aug 07, 2023 at 11:00:08PM +0100, Mark Brown wrote: > > > +* When GCS is enabled for a thread a new Guarded Control Stack will be > > > + allocated for it of size RLIMIT_STACK / 2 or 2 gigabytes, whichever is > > > + smaller. > > > Is this number based on the fact that a function call would only push > > the LR to GCS while standard function prologue pushes at least two > > registers? > > It's actually based on bitrot that I'd initially chosen a smaller value > since it's likely that functions will push at least something as you > suggest, the patches now just use RLIMIT_STACK. I'll fix. the pcs requires 16byte aligned stack frames, with 8byte per gcs entry there is no need for same gcs size as stack size in userspace. you can argue about a fixed size small increment (stacksize/2 + inc) for signal handling on alt stack and special tokens, but stack size is overkill i think. fwiw my current makecontext patch uses roundup(stacksize/2+160). (threads guaranteed to have about 300bytes of data on the stack in glibc so if gcs is stacksize/2, that accounts for the increment. this is for the theoretical case when an empty thread just tries to overflow the stack and then handle the fault on sigaltstack.)
On Thu, Aug 10, 2023 at 09:55:50AM +0100, Szabolcs Nagy wrote: > The 08/09/2023 16:34, Mark Brown wrote: > > It's actually based on bitrot that I'd initially chosen a smaller value > > since it's likely that functions will push at least something as you > > suggest, the patches now just use RLIMIT_STACK. I'll fix. > the pcs requires 16byte aligned stack frames, with 8byte per gcs entry > there is no need for same gcs size as stack size in userspace. I agree that it's going to be excessive for pretty much all applications, I adjusted it to match x86 as part of the general effort to avoid divergence and because I was a bit concerned about non-PCS cases (eg, JITed code) potentially running into trouble, especially with smaller stack limits. It's not an issue I have super strong opinions on though, as you can see I had implemented it both ways at various times.
The 08/10/2023 12:41, Mark Brown wrote: > On Thu, Aug 10, 2023 at 09:55:50AM +0100, Szabolcs Nagy wrote: > > The 08/09/2023 16:34, Mark Brown wrote: > > > > It's actually based on bitrot that I'd initially chosen a smaller value > > > since it's likely that functions will push at least something as you > > > suggest, the patches now just use RLIMIT_STACK. I'll fix. > > > the pcs requires 16byte aligned stack frames, with 8byte per gcs entry > > there is no need for same gcs size as stack size in userspace. > > I agree that it's going to be excessive for pretty much all > applications, I adjusted it to match x86 as part of the general effort > to avoid divergence and because I was a bit concerned about non-PCS > cases (eg, JITed code) potentially running into trouble, especially with is that even possible? 16byte alignment is not a convention but architectural: access via unaligned sp traps (at least in userspace). it is possible to use bl such that the stack is not involved e.g. if there is no bl/ret pairing, but if we base the gcs size on the stack size then i'd expect one stack frame per bl/ret pair with 16byte alignment, or is there a programming model possible that uses 8byte stack per bl? > smaller stack limits. It's not an issue I have super strong opinions on > though, as you can see I had implemented it both ways at various times.
On Thu, Aug 10, 2023 at 02:34:04PM +0100, Szabolcs Nagy wrote: > The 08/10/2023 12:41, Mark Brown wrote: > > I agree that it's going to be excessive for pretty much all > > applications, I adjusted it to match x86 as part of the general effort > > to avoid divergence and because I was a bit concerned about non-PCS > > cases (eg, JITed code) potentially running into trouble, especially with > is that even possible? > 16byte alignment is not a convention but architectural: > access via unaligned sp traps (at least in userspace). > it is possible to use bl such that the stack is not involved > e.g. if there is no bl/ret pairing, but if we base the gcs > size on the stack size then i'd expect one stack frame per > bl/ret pair with 16byte alignment, or is there a programming > model possible that uses 8byte stack per bl? That's definitely what I'd expect most of the time. You'd need to be tracking what needs pushing in some other register but it's possible. Quite why you'd do this is a separate question, I think I'm being overly cautious worrying about anyone actually having done it but it wouldn't be the first time I was surprised by someone doing something unexpected. Like I say I think it's excessive and was erring on the side of being conservative.
On Wed, Aug 09, 2023 at 04:34:38PM +0100, Mark Brown wrote: > On Wed, Aug 09, 2023 at 03:24:14PM +0100, Catalin Marinas wrote: > > On Mon, Aug 07, 2023 at 11:00:08PM +0100, Mark Brown wrote: > > > +* When set PR_SHADOW_STACK_ENABLE flag allocates a Guarded Control Stack for > > > > The 'for' at the end of the line above is not needed. > > > > > + and enables GCS for the thread, enabling the functionality controlled by > > I find it a little clearer that it's a per thread stack here but sure. If it reads better for you, feel free to keep it as is. > > > +3. Allocation of Guarded Control Stacks > > > +---------------------------------------- > > > > +* When GCS is enabled for a thread a new Guarded Control Stack will be > > > + allocated for it of size RLIMIT_STACK / 2 or 2 gigabytes, whichever is > > > + smaller. > > > Is this number based on the fact that a function call would only push > > the LR to GCS while standard function prologue pushes at least two > > registers? > > It's actually based on bitrot that I'd initially chosen a smaller value > since it's likely that functions will push at least something as you > suggest, the patches now just use RLIMIT_STACK. I'll fix. A related question - it may have been discussed intensively on the x86 thread (I may read it sometime) - why not have the libc map the shadow stack and pass the pointer/size to clone3()? It saves us from having to guess what the right size we'd need. struct clone_args is extensible. (I plan to get back next week to this series, I'll need to read a bit more on the spec)
On Fri, Aug 18, 2023 at 06:29:54PM +0100, Catalin Marinas wrote: > A related question - it may have been discussed intensively on the x86 > thread (I may read it sometime) - why not have the libc map the shadow Your assumption that this is a single thread feels optimistic there. > stack and pass the pointer/size to clone3()? It saves us from having to > guess what the right size we'd need. struct clone_args is extensible. I can't recall or locate the specific reasoning there right now, perhaps Rick or someone else can? I'd guess there would be compat concerns for things that don't go via libc which would complicate the story with identifying and marking things as GCS/SS safe, it's going to be more robust to just supply a GCS if the process is using it. That said having a default doesn't preclude us using the extensibility to allow userspace directly to control the GCS size, I would certainly be in favour of adding support for that. > (I plan to get back next week to this series, I'll need to read a bit > more on the spec) I've been making changes, mostly in response to your feedback, so there should be a new version on Monday even if not everything is addressed yet.
On Fri, Aug 18, 2023 at 08:38:02PM +0100, Mark Brown wrote: > On Fri, Aug 18, 2023 at 06:29:54PM +0100, Catalin Marinas wrote: > > A related question - it may have been discussed intensively on the x86 > > thread (I may read it sometime) - why not have the libc map the shadow > > Your assumption that this is a single thread feels optimistic there. Yeah and I unfortunately ignored all of them. > > stack and pass the pointer/size to clone3()? It saves us from having to > > guess what the right size we'd need. struct clone_args is extensible. > > I can't recall or locate the specific reasoning there right now, perhaps > Rick or someone else can? I'd guess there would be compat concerns for > things that don't go via libc which would complicate the story with > identifying and marking things as GCS/SS safe, it's going to be more > robust to just supply a GCS if the process is using it. That said > having a default doesn't preclude us using the extensibility to allow > userspace directly to control the GCS size, I would certainly be in > favour of adding support for that. It would be good if someone provided a summary of the x86 decision (I'll get to those thread but most likely in September). I think we concluded that we can't deploy GCS entirely transparently, so we need a libc change (apart from the ELF annotations). Since libc is opting in to GCS, we could also update the pthread_create() etc. to allocate the shadow together with the standard stack. Anyway, that's my preference but maybe there were good reasons not to do this.
On Tue, Aug 22, 2023 at 05:49:51PM +0100, Catalin Marinas wrote: > On Fri, Aug 18, 2023 at 08:38:02PM +0100, Mark Brown wrote: > > > stack and pass the pointer/size to clone3()? It saves us from having to > > > guess what the right size we'd need. struct clone_args is extensible. > > I can't recall or locate the specific reasoning there right now, perhaps > > Rick or someone else can? I'd guess there would be compat concerns for > > things that don't go via libc which would complicate the story with > > identifying and marking things as GCS/SS safe, it's going to be more > > robust to just supply a GCS if the process is using it. That said > > having a default doesn't preclude us using the extensibility to allow > > userspace directly to control the GCS size, I would certainly be in > > favour of adding support for that. > It would be good if someone provided a summary of the x86 decision (I'll > get to those thread but most likely in September). I think we concluded > that we can't deploy GCS entirely transparently, so we need a libc > change (apart from the ELF annotations). Since libc is opting in to GCS, Right, we need changes for setjmp()/longjmp() for example. > we could also update the pthread_create() etc. to allocate the shadow > together with the standard stack. > Anyway, that's my preference but maybe there were good reasons not to do > this. Yeah, it'd be good to understand. I've been through quite a lot of old versions of the x86 series (I've not found them all, there's 30 versions or something of the old series plus the current one is on v9) and the code always appears to have been this way with changelogs that explain the what but not the why. For example roughly the current behaviour was already in place in v10 of the original series: https://lore.kernel.org/lkml/20200429220732.31602-26-yu-cheng.yu@intel.com/ I do worry about the story for users calling the underlying clone3() API (or legacy clone() for that matter) directly, and we would also need to handle the initial GCS enable via prctl() - that's not insurmountable, we could add a size argument there that only gets interpreted during the initial enable for example. My sense is that they deployment story is going to be smoother with defaults being provided since it avoids dealing with the issue of what to do if userspace creates a thread without a GCS in a GCS enabled process but like I say I'd be totally happy to extend clone3(). I will put some patches together for that (probably once the x86 stuff lands). Given the size of this series it might be better split out for manageability if nothing else.
The 08/22/2023 18:53, Mark Brown wrote: > On Tue, Aug 22, 2023 at 05:49:51PM +0100, Catalin Marinas wrote: > > On Fri, Aug 18, 2023 at 08:38:02PM +0100, Mark Brown wrote: > > > > > stack and pass the pointer/size to clone3()? It saves us from having to > > > > guess what the right size we'd need. struct clone_args is extensible. > > > > I can't recall or locate the specific reasoning there right now, perhaps > > > Rick or someone else can? I'd guess there would be compat concerns for > > > things that don't go via libc which would complicate the story with > > > identifying and marking things as GCS/SS safe, it's going to be more > > > robust to just supply a GCS if the process is using it. That said > > > having a default doesn't preclude us using the extensibility to allow > > > userspace directly to control the GCS size, I would certainly be in > > > favour of adding support for that. > > > It would be good if someone provided a summary of the x86 decision (I'll > > get to those thread but most likely in September). I think we concluded > > that we can't deploy GCS entirely transparently, so we need a libc > > change (apart from the ELF annotations). Since libc is opting in to GCS, > > Right, we need changes for setjmp()/longjmp() for example. > > > we could also update the pthread_create() etc. to allocate the shadow > > together with the standard stack. > > > Anyway, that's my preference but maybe there were good reasons not to do > > this. > > Yeah, it'd be good to understand. I've been through quite a lot of old > versions of the x86 series (I've not found them all, there's 30 versions > or something of the old series plus the current one is on v9) and the > code always appears to have been this way with changelogs that explain > the what but not the why. For example roughly the current behaviour was > already in place in v10 of the original series: > > https://lore.kernel.org/lkml/20200429220732.31602-26-yu-cheng.yu@intel.com/ well the original shstk patches predate clone3 so no surprise there. e.g. v6 is from 2018 and clone3 is 2019 linux 5.3 https://lore.kernel.org/lkml/20181119214809.6086-1-yu-cheng.yu@intel.com/ > > I do worry about the story for users calling the underlying clone3() API > (or legacy clone() for that matter) directly, and we would also need to > handle the initial GCS enable via prctl() - that's not insurmountable, > we could add a size argument there that only gets interpreted during the > initial enable for example. musl and bionic currently use plain clone for threads. and there is user code doing raw clone threads (such threads are technically not allowed to call into libc) it's not immediately clear to me if having gcs in those threads is better or worse. glibc can use clone3 args for gcs, i'd expect the unmap to be more annoying than the allocation, but possible (it is certainly more work than leaving everything to the kernel). one difference is that userspace can then set gcspr of a new thread and e.g. two threads can have overlapping gcs, however i don't think this impacts security much since if clone3 is attacker controlled then likely all bets are off. and yes the main thread gcs can also be libc allocated given we have to deal with the prctl anyway. if gcs size logic is in libc it can depend on env vars and can be changed more easily (and adapted to android vs musl vs glibc requirements). sigaltstack with alt gcs was a case where i thought the kernel doing it transparently is better (the libc cannot do the same as it cannot wrap signal handlers currently so does not know when a handler returns or the current alt stack state), but others seems to want an explicit sigaltgcs syscall and expose it to users. in any case we have no unwinder solution for alt gcs nor longjmp solution when the thread gcs is overflowed so this is not an issue for now. > My sense is that they deployment story is going to be smoother with > defaults being provided since it avoids dealing with the issue of what > to do if userspace creates a thread without a GCS in a GCS enabled > process but like I say I'd be totally happy to extend clone3(). I will > put some patches together for that (probably once the x86 stuff lands). > Given the size of this series it might be better split out for > manageability if nothing else. i would make thread without gcs to implicitly disable gcs, since that's what's bw compat with clones outside of libc (the libc can guarantee gcs allocation when gcs is enabled).
On Wed, Aug 23, 2023 at 11:09:59AM +0100, Szabolcs Nagy wrote: > The 08/22/2023 18:53, Mark Brown wrote: > > On Tue, Aug 22, 2023 at 05:49:51PM +0100, Catalin Marinas wrote: > > the what but not the why. For example roughly the current behaviour was > > already in place in v10 of the original series: > well the original shstk patches predate clone3 so no surprise there. > e.g. v6 is from 2018 and clone3 is 2019 linux 5.3 > https://lore.kernel.org/lkml/20181119214809.6086-1-yu-cheng.yu@intel.com/ Ah, that'd do it. People weren't excited enough on about clone3() when reviewing the shadow stack patches, I hadn't realised clone3() was so new. > > I do worry about the story for users calling the underlying clone3() API > > (or legacy clone() for that matter) directly, and we would also need to > > handle the initial GCS enable via prctl() - that's not insurmountable, > > we could add a size argument there that only gets interpreted during the > > initial enable for example. > musl and bionic currently use plain clone for threads. > and there is user code doing raw clone threads (such threads are > technically not allowed to call into libc) it's not immediately > clear to me if having gcs in those threads is better or worse. Right, that's my big worry - I hadn't realised it was extending as far as musl/bionic. > one difference is that userspace can then set gcspr of a new thread > and e.g. two threads can have overlapping gcs, however i don't think > this impacts security much since if clone3 is attacker controlled > then likely all bets are off. Yeah, I think that's a "you broke it, you get all the pieces" thing. > > My sense is that they deployment story is going to be smoother with > > defaults being provided since it avoids dealing with the issue of what > > to do if userspace creates a thread without a GCS in a GCS enabled > > process but like I say I'd be totally happy to extend clone3(). I will > > put some patches together for that (probably once the x86 stuff lands). > > Given the size of this series it might be better split out for > > manageability if nothing else. > i would make thread without gcs to implicitly disable gcs, since > that's what's bw compat with clones outside of libc (the libc can > guarantee gcs allocation when gcs is enabled). That'd create a pretty substantial divergence with the x86 patches if they land this time around, there's not enough time to rework them now - I suppose it'd mainly bite people implementing libc type stuff but still, doesn't feel great.
On Wed, Aug 23, 2023 at 11:09:59AM +0100, Szabolcs Nagy wrote: > The 08/22/2023 18:53, Mark Brown wrote: > > On Tue, Aug 22, 2023 at 05:49:51PM +0100, Catalin Marinas wrote: > > > It would be good if someone provided a summary of the x86 decision (I'll > > > get to those thread but most likely in September). I think we concluded > > > that we can't deploy GCS entirely transparently, so we need a libc > > > change (apart from the ELF annotations). Since libc is opting in to GCS, > > > > Right, we need changes for setjmp()/longjmp() for example. > > > > > we could also update the pthread_create() etc. to allocate the shadow > > > together with the standard stack. > > > > > > Anyway, that's my preference but maybe there were good reasons not to do > > > this. > > > > Yeah, it'd be good to understand. I've been through quite a lot of old > > versions of the x86 series (I've not found them all, there's 30 versions > > or something of the old series plus the current one is on v9) and the > > code always appears to have been this way with changelogs that explain > > the what but not the why. For example roughly the current behaviour was > > already in place in v10 of the original series: > > > > https://lore.kernel.org/lkml/20200429220732.31602-26-yu-cheng.yu@intel.com/ > > well the original shstk patches predate clone3 so no surprise there. > e.g. v6 is from 2018 and clone3 is 2019 linux 5.3 > https://lore.kernel.org/lkml/20181119214809.6086-1-yu-cheng.yu@intel.com/ Good point, I had not realised that. > > I do worry about the story for users calling the underlying clone3() API > > (or legacy clone() for that matter) directly, and we would also need to > > handle the initial GCS enable via prctl() - that's not insurmountable, > > we could add a size argument there that only gets interpreted during the > > initial enable for example. > > musl and bionic currently use plain clone for threads. > > and there is user code doing raw clone threads (such threads are > technically not allowed to call into libc) it's not immediately > clear to me if having gcs in those threads is better or worse. > > glibc can use clone3 args for gcs, i'd expect the unmap to be more > annoying than the allocation, but possible (it is certainly more > work than leaving everything to the kernel). Unmapping is indeed more complex but I guess something similar needs to happen for the thread stack to be reclaimed. The thing I dislike about the kernel automatically mapping it is the arbitrary fraction of RLIMIT_STACK size. glibc may use RLIMIT_STACK as a hint for the thread stack size but is this the case for other libraries? Some quick search (which I may have misinterpreted) shows that musl uses 128KB, bionic 1MB. So at this point the shadow stack size has no relevance for the actual thread stack. An alternative would be for the clone3() to provide an address _hint_ and size for GCS and it would still be the kernel doing the mmap (and munmap on clearing). But at least the user has some control over the placement of the GCS and its size (and maybe providing the address has MAP_FIXED semantics). > > My sense is that they deployment story is going to be smoother with > > defaults being provided since it avoids dealing with the issue of what > > to do if userspace creates a thread without a GCS in a GCS enabled > > process but like I say I'd be totally happy to extend clone3(). I will > > put some patches together for that (probably once the x86 stuff lands). > > Given the size of this series it might be better split out for > > manageability if nothing else. > > i would make thread without gcs to implicitly disable gcs, since > that's what's bw compat with clones outside of libc (the libc can > guarantee gcs allocation when gcs is enabled). Yes, this should work. Any invocation of clone() or clone3() without a shadow stack would disable GCS. What about the reverse, should GCS be enabled for a thread even if the clone3() caller has GCS disabled? I guess we shouldn't since GCS enabling depends on the prctl() state set previously.
On Wed, Aug 23, 2023 at 02:11:07PM +0100, Catalin Marinas wrote: > Yes, this should work. Any invocation of clone() or clone3() without a > shadow stack would disable GCS. What about the reverse, should GCS be > enabled for a thread even if the clone3() caller has GCS disabled? I > guess we shouldn't since GCS enabling depends on the prctl() state set > previously. It has a fairly obvious intended meaning so we could do it easily enough but OTOH allowing it opens up the idea of people wanting to specify GCS flags which starts to seem like more trouble than it's worth compared to just having them do the prctl() in the new thread.
On Wed, Aug 23, 2023 at 01:51:35PM +0100, Mark Brown wrote: > On Wed, Aug 23, 2023 at 11:09:59AM +0100, Szabolcs Nagy wrote: > > The 08/22/2023 18:53, Mark Brown wrote: > > > My sense is that they deployment story is going to be smoother with > > > defaults being provided since it avoids dealing with the issue of what > > > to do if userspace creates a thread without a GCS in a GCS enabled > > > process but like I say I'd be totally happy to extend clone3(). I will > > > put some patches together for that (probably once the x86 stuff lands). > > > Given the size of this series it might be better split out for > > > manageability if nothing else. > > > i would make thread without gcs to implicitly disable gcs, since > > that's what's bw compat with clones outside of libc (the libc can > > guarantee gcs allocation when gcs is enabled). > > That'd create a pretty substantial divergence with the x86 patches if > they land this time around, there's not enough time to rework them now - > I suppose it'd mainly bite people implementing libc type stuff but > still, doesn't feel great. I don't mind the divergence in this area if the libc folks are ok with it. x86 can eventually use the clone3() interface if they want more flexibility, they'll just have to continue supporting the old one. I think we already diverge around the prctl().
On Wed, Aug 23, 2023 at 05:45:11PM +0100, Catalin Marinas wrote: > I don't mind the divergence in this area if the libc folks are ok with > it. x86 can eventually use the clone3() interface if they want more > flexibility, they'll just have to continue supporting the old one. I > think we already diverge around the prctl(). Yes, though that's basically the same thing - the difference is basically just that x86 uses their custom arch_prctl() thing and we use a regular prctl(), there's nothing conceptual going on like there is here. I don't really mind either, it's just something where I'd anticipate pushback.
The 08/23/2023 17:45, Catalin Marinas wrote: > On Wed, Aug 23, 2023 at 01:51:35PM +0100, Mark Brown wrote: > > On Wed, Aug 23, 2023 at 11:09:59AM +0100, Szabolcs Nagy wrote: > > > The 08/22/2023 18:53, Mark Brown wrote: > > > > My sense is that they deployment story is going to be smoother with > > > > defaults being provided since it avoids dealing with the issue of what > > > > to do if userspace creates a thread without a GCS in a GCS enabled > > > > process but like I say I'd be totally happy to extend clone3(). I will > > > > put some patches together for that (probably once the x86 stuff lands). > > > > Given the size of this series it might be better split out for > > > > manageability if nothing else. > > > > > i would make thread without gcs to implicitly disable gcs, since > > > that's what's bw compat with clones outside of libc (the libc can > > > guarantee gcs allocation when gcs is enabled). > > > > That'd create a pretty substantial divergence with the x86 patches if > > they land this time around, there's not enough time to rework them now - > > I suppose it'd mainly bite people implementing libc type stuff but > > still, doesn't feel great. > > I don't mind the divergence in this area if the libc folks are ok with > it. x86 can eventually use the clone3() interface if they want more > flexibility, they'll just have to continue supporting the old one. I > think we already diverge around the prctl(). i will have to prototype it, but in principle i'm ok with moving gcs allocation to userspace and passing it as argument to clone3. i will have to think if x86 divergence could cause issues. to maximize compat with existing raw clone users gcs either has to be disabled implicitly or allocated by the kernel. if we move gcs management to userspace then disable sounds better to me. (except vfork/fork does not have to disable etc.) to support gcs, a libc would have to use clone3 or enable gcs in the clone start code. i don't know if we can allow disabled gcs thread creation with locked gcs state. (i can see arguments both ways, so further prctl flag may be needed which may be another divergence from x86) i wonder if we can allow MAP_FIXED as well as MAP_FIXED_NOREPLACE semantics for map_shadow_stack (MAP_FIXED makes sense if userspace allocates thread stack + tls + gcs + guard pages with one PROT_NONE mapping and then mprotects / map_shadow_stack on top of that) i.e. if userspace manages the gcs it may need more flexibility here. (for now i think separate gcs mapping works for me.)
On Wed, Aug 23, 2023 at 06:40:40PM +0100, Szabolcs Nagy wrote: > i don't know if we can allow disabled gcs thread creation with locked > gcs state. (i can see arguments both ways, so further prctl flag may > be needed which may be another divergence from x86) I think that if we do add a new flag that'd just be new functionality, the divergence would be in allowing configuration via clone3() rather than the flag. TBH I'm not sure I see a use case for locking but providing a mechanism for getting out of the lock, that seems very questionable.
On Wed, Aug 23, 2023 at 07:16:52PM +0100, Mark Brown wrote: > On Wed, Aug 23, 2023 at 06:40:40PM +0100, Szabolcs Nagy wrote: > > i don't know if we can allow disabled gcs thread creation with locked > > gcs state. (i can see arguments both ways, so further prctl flag may > > be needed which may be another divergence from x86) > > I think that if we do add a new flag that'd just be new functionality, > the divergence would be in allowing configuration via clone3() rather > than the flag. TBH I'm not sure I see a use case for locking but > providing a mechanism for getting out of the lock, that seems very > questionable. You are right, once the configuration is locked a plain clone() or clone3() without a GCS pointer should be rejected. Is there a use-case for the unlocked configuration to allow disabling the GCS implicitly via a clone syscall? If we go for extending clone3, I wonder whether we should also introduce a sigaltstack2/3 ;). I haven't checked what the current patches do and won't have time until early September (on holiday from the end of today).
On Thu, Aug 24, 2023 at 04:43:19PM +0100, Catalin Marinas wrote: > If we go for extending clone3, I wonder whether we should also introduce > a sigaltstack2/3 ;). I haven't checked what the current patches do and > won't have time until early September (on holiday from the end of today). The current patches (and the x86 stuff that's in -next) punt on sigaltstack for now, the discussions around that are unresolved. My hope is that whatever we come up with there can be cross platform.
The 08/24/2023 16:43, Catalin Marinas wrote: > Is there a use-case for the unlocked configuration to allow disabling > the GCS implicitly via a clone syscall? how would you handle clone or clone3 without gcs specified? (in the cases when clone creates a new thread with new stack) (1) fail. (2) allocate gcs. (3) disable gcs. the problem with (1) is that it requires changes to user code (this only affects code outside the libc doing something special since raw clone thread cannot call into the libc, all executing code have to be tightly controlled. i don't know how common this is, but i at least expect it in test code for system level tools like debuggers, strace, valgrind, qemu-system, seccomp filters etc. if it appears in actual use then that's a deployment issue for distros: note that changing clone to clone3 is non-trivial: it requires fallback logic and may have to deal with seccomp filters). problem with (2) is that the size policy and lifetime management is in the kernel then. (since only special cases are affected i guess that is ok, but i assumed we want to avoid this by moving to clone3 and user managed gcs). the problem with (3) is escaping the security measure, however it only applies to very special threads that can always decide to opt-in to gcs, so i don't see this as such a bad option and at least bw compat with existing code. (in my threat model the attacker cannot hijack clone syscalls as that seems stronger than hijacking return addresses.) so i guess the answer depends on how much headache failing raw clone may cause and i don't know that.
On Wed, Aug 30, 2023 at 01:37:33PM +0100, Szabolcs Nagy wrote: > The 08/24/2023 16:43, Catalin Marinas wrote: > > Is there a use-case for the unlocked configuration to allow disabling > > the GCS implicitly via a clone syscall? > how would you handle clone or clone3 without gcs specified? > (in the cases when clone creates a new thread with new stack) > (1) fail. > (2) allocate gcs. > (3) disable gcs. ... > problem with (2) is that the size policy and lifetime management > is in the kernel then. (since only special cases are affected i > guess that is ok, but i assumed we want to avoid this by moving > to clone3 and user managed gcs). Right, it seems like if we go with this then we may as well just allow plain clone() too. > the problem with (3) is escaping the security measure, however > it only applies to very special threads that can always decide > to opt-in to gcs, so i don't see this as such a bad option and > at least bw compat with existing code. (in my threat model the > attacker cannot hijack clone syscalls as that seems stronger > than hijacking return addresses.) It doesn't seem great to have a feature which is to a large extent a security feature where we provide a fairly straightforward mechanism for disabling the feature and actively expect things to be using it. Given the timescales until this gets practically deployed on arm64 I would be inclined to go with making things fail and forcing updates in the users, though obviously that's less helpful for x86 where the hardware is in user hands already so it's more of a pressing issue (and there's already what is effectively option 2 in the code). We could have the architectures diverge, as you say the effect is likely to be mainly in very low level code rather than general software.
The 08/23/2023 14:11, Catalin Marinas wrote: > On Wed, Aug 23, 2023 at 11:09:59AM +0100, Szabolcs Nagy wrote: > > The 08/22/2023 18:53, Mark Brown wrote: > > > I do worry about the story for users calling the underlying clone3() API > > > (or legacy clone() for that matter) directly, and we would also need to > > > handle the initial GCS enable via prctl() - that's not insurmountable, > > > we could add a size argument there that only gets interpreted during the > > > initial enable for example. ... > > and there is user code doing raw clone threads (such threads are > > technically not allowed to call into libc) it's not immediately > > clear to me if having gcs in those threads is better or worse. i think raw clone / clone3 users may be relevant so we need a solution such that they don't fail when gcs args are missing. userspace allocated gcs works for me, but maybe the alternative with size only is more consistent (thread gcs is kernel mapped with fallback size logic if gcs size is missing): > An alternative would be for the clone3() to provide an address _hint_ > and size for GCS and it would still be the kernel doing the mmap (and > munmap on clearing). But at least the user has some control over the > placement of the GCS and its size (and maybe providing the address has > MAP_FIXED semantics). the main thread gcs is still special: the size is provided via prctl (if at all).
On Thu, Sep 28, 2023 at 05:59:25PM +0100, Szabolcs Nagy wrote: > The 08/23/2023 14:11, Catalin Marinas wrote: > > > and there is user code doing raw clone threads (such threads are > > > technically not allowed to call into libc) it's not immediately > > > clear to me if having gcs in those threads is better or worse. > i think raw clone / clone3 users may be relevant so we need a > solution such that they don't fail when gcs args are missing. Are we sure about that? Old binaries shouldn't be affected since they won't turn GCS so we're just talking about new binaries here - are there really so many of them that we won't be able to get them all converted over to clone3() and GCS in the timescales we're talking about for GCS deployment? I obviously don't particularly mind having the default size logic but if we allow clone() then that's keeping the existing behaviour and layering allocation via clone3() on top of it which Catalin didn't want. Catalin? > userspace allocated gcs works for me, but maybe the alternative > with size only is more consistent (thread gcs is kernel mapped > with fallback size logic if gcs size is missing): If we have size only then the handling of GCS and normal stack in struct clone_args would be inconsistent. Given that it seems better to have the field present, we can allow it to be NULL and do the allocation with the specified size but it should be there. > > An alternative would be for the clone3() to provide an address _hint_ > > and size for GCS and it would still be the kernel doing the mmap (and > > munmap on clearing). But at least the user has some control over the > > placement of the GCS and its size (and maybe providing the address has > > MAP_FIXED semantics). > the main thread gcs is still special: the size is provided > via prctl (if at all). Either that or we have it do a map_shadow_stack() but that's an extra syscall during startup.
On Mon, 2023-10-02 at 20:49 +0100, Mark Brown wrote: > On Thu, Sep 28, 2023 at 05:59:25PM +0100, Szabolcs Nagy wrote: > > The 08/23/2023 14:11, Catalin Marinas wrote: > > > > > and there is user code doing raw clone threads (such threads > > > > are > > > > technically not allowed to call into libc) it's not immediately > > > > clear to me if having gcs in those threads is better or worse. > > > i think raw clone / clone3 users may be relevant so we need a > > solution such that they don't fail when gcs args are missing. > > Are we sure about that? Old binaries shouldn't be affected since > they > won't turn GCS so we're just talking about new binaries here - are > there > really so many of them that we won't be able to get them all > converted > over to clone3() and GCS in the timescales we're talking about for > GCS > deployment? I obviously don't particularly mind having the default > size > logic but if we allow clone() then that's keeping the existing > behaviour > and layering allocation via clone3() on top of it which Catalin > didn't > want. On the x86 side, there have been a lot of binaries generated that have been blindly marked as supporting shadow stack. So even if they are not old, there may still be mismarked ones using clone(). In general there are a lot of tradeoffs with shadow stack between security, compatibility and performance. Szabolcs had previously discussed some ideas around all three: - compatibility (automatic sigaltstack()) - performance/utility (creating a new libc API for makecontext()) - security (token schemes to guarantee only one user of a shadow stack at a time) On the x86 kernel side, we had our hands tied a bit by the existing userspace, and kind of ended up with a mix. I imagined that we might see demand along one of those axes after some real world use. At which point we could add some opt-in ABI tweaks to help. If ARM is thinking of doing things differently than x86, you might think about how you weight those tradeoffs. Like, it might be silly to worry about clone() support if something else ends up breaking compatibility majorly. But, it might be worthwhile it you end up going to the proposed extremes around signal alt stacks, to maximize compatibility Also then maybe x86 could copy the ARM ABI some day, if it ends up chasing the tradeoff people prefer. It probably goes without saying that the closer these features behave from the app developer perspective, the better. So a different ABI than x86 that also targets a mix would be a bit unfortunate. (not the end of the world though) Anyway, just wanted to share some perspective there. Sorry for joining this thread so late.
The 10/02/2023 20:49, Mark Brown wrote: > On Thu, Sep 28, 2023 at 05:59:25PM +0100, Szabolcs Nagy wrote: > > The 08/23/2023 14:11, Catalin Marinas wrote: > > > > > and there is user code doing raw clone threads (such threads are > > > > technically not allowed to call into libc) it's not immediately > > > > clear to me if having gcs in those threads is better or worse. > > > i think raw clone / clone3 users may be relevant so we need a > > solution such that they don't fail when gcs args are missing. > > Are we sure about that? Old binaries shouldn't be affected since they > won't turn GCS so we're just talking about new binaries here - are there > really so many of them that we won't be able to get them all converted > over to clone3() and GCS in the timescales we're talking about for GCS > deployment? I obviously don't particularly mind having the default size > logic but if we allow clone() then that's keeping the existing behaviour > and layering allocation via clone3() on top of it which Catalin didn't > want. Catalin? clone3 seems to have features that are only available in clone3 and not exposed (reasonably) in libc apis so ppl will use clone3 directly and those will be hard to fix for gcs (you have to convince upstream to add future arm64 arch specific changes that they cannot test). where this analysis might be wrong is that raw clone3 is more likely used as fork/vfork without a new stack and thus no gcs issue. even if we have time to fix code, we don't want too many ifdef hacks just for gcs so it matters how many projects are affected. > > userspace allocated gcs works for me, but maybe the alternative > > with size only is more consistent (thread gcs is kernel mapped > > with fallback size logic if gcs size is missing): > > If we have size only then the handling of GCS and normal stack in struct > clone_args would be inconsistent. Given that it seems better to have > the field present, we can allow it to be NULL and do the allocation with > the specified size but it should be there. i see, then try the original plan. > > the main thread gcs is still special: the size is provided > > via prctl (if at all). > > Either that or we have it do a map_shadow_stack() but that's an extra > syscall during startup. an extra syscall is not too bad for the gcs enabled case.
On Mon, Oct 02, 2023 at 09:43:25PM +0000, Edgecombe, Rick P wrote: > If ARM is thinking of doing things differently than x86, you might > think about how you weight those tradeoffs. Like, it might be silly to > worry about clone() support if something else ends up breaking > compatibility majorly. But, it might be worthwhile it you end up going > to the proposed extremes around signal alt stacks, to maximize > compatibility Yeah, I think Catalin's thinking here was that we're quite a way out from actual hardware so it's much more tractable to fix up callers than it is for x86 where the hardware is widely available. > Also then maybe x86 could copy the ARM ABI some day, if it ends up > chasing the tradeoff people prefer. It probably goes without saying > that the closer these features behave from the app developer > perspective, the better. So a different ABI than x86 that also targets > a mix would be a bit unfortunate. (not the end of the world though) If nothing else even if we end up being stricter about things it would be extremely disappointing if we ended up with something where code for arm64 won't run when built for x86.
On Tue, Oct 03, 2023 at 09:45:56AM +0100, Szabolcs Nagy wrote: > clone3 seems to have features that are only available in clone3 and > not exposed (reasonably) in libc apis so ppl will use clone3 directly > and those will be hard to fix for gcs (you have to convince upstream > to add future arm64 arch specific changes that they cannot test). Ah, I hadn't realised that there were things that weren't available via libc - that does change the calculation a bit here. I would hope that anything we do for clone3() would work just as well for x86 so the test side should be a bit easier there than if it were a future arm64 thing, though obviously it wouldn't be mandatory on x86 in the way that Catalin wanted it for arm64. > where this analysis might be wrong is that raw clone3 is more likely > used as fork/vfork without a new stack and thus no gcs issue. > even if we have time to fix code, we don't want too many ifdef hacks > just for gcs so it matters how many projects are affected. My impression was that raw usage of the APIs was a specialist enough thing that this was viable, ICBW though - I might not have been searching well enough (clone is an annoying term to search for!).
On Tue, Oct 03, 2023 at 03:26:51PM +0100, Mark Brown wrote: > On Tue, Oct 03, 2023 at 09:45:56AM +0100, Szabolcs Nagy wrote: > > clone3 seems to have features that are only available in clone3 and > > not exposed (reasonably) in libc apis so ppl will use clone3 directly > > and those will be hard to fix for gcs (you have to convince upstream > > to add future arm64 arch specific changes that they cannot test). > > Ah, I hadn't realised that there were things that weren't available via > libc - that does change the calculation a bit here. I would hope that > anything we do for clone3() would work just as well for x86 so the test > side should be a bit easier there than if it were a future arm64 thing, > though obviously it wouldn't be mandatory on x86 in the way that Catalin > wanted it for arm64. I haven't checked how many clone() or clone3() uses outside the libc are (I tried some quick search in Debian but did not dig into the specifics to see how generic that code is). I agree that having to change valid cases outside of libc is not ideal. Even if we have the same clone3() interface for x86 and arm64, we'd have other architectures that need #ifdef'ing. So I'm slightly warming up to the idea of having a default shadow stack size (either RLIMIT_STACK or the clone3() stack size, following x86). A clone3() extension can be added on top, though I wonder whether anyone will use it if the kernel allocates a shadow stack by default. It's not just the default size that I dislike (I think the x86 RLIMIT_STACK or clone3() stack_size is probably good enough) but the kernel allocating the shadow stack and inserting it into the user address space. The actual thread stack is managed by the user but the shadow stack is not (and we don't do this very often). Anyway, I don't have a better solution for direct uses of clone() or clone3(), other than running those threads with the shadow stack disabled. Not sure that's desirable.
On Thu, Oct 05, 2023 at 06:23:10PM +0100, Catalin Marinas wrote: > It's not just the default size that I dislike (I think the x86 > RLIMIT_STACK or clone3() stack_size is probably good enough) but the > kernel allocating the shadow stack and inserting it into the user > address space. The actual thread stack is managed by the user but the > shadow stack is not (and we don't do this very often). Anyway, I don't > have a better solution for direct uses of clone() or clone3(), other > than running those threads with the shadow stack disabled. Not sure > that's desirable. Running threads with the shadow stack disabled if they don't explicitly request it feels like it's asking for trouble - as well as the escape route from the protection it'd provide I'd expect there to be trouble for things that do stack pivots, potentially random issues if there's a mix of ways threads are started. It's going to be a tradeoff whatever we do.
Mark Brown <broonie@kernel.org> writes: > On Thu, Oct 05, 2023 at 06:23:10PM +0100, Catalin Marinas wrote: > >> It's not just the default size that I dislike (I think the x86 >> RLIMIT_STACK or clone3() stack_size is probably good enough) but the >> kernel allocating the shadow stack and inserting it into the user >> address space. The actual thread stack is managed by the user but the >> shadow stack is not (and we don't do this very often). Anyway, I don't >> have a better solution for direct uses of clone() or clone3(), other >> than running those threads with the shadow stack disabled. Not sure >> that's desirable. > > Running threads with the shadow stack disabled if they don't explicitly > request it feels like it's asking for trouble - as well as the escape > route from the protection it'd provide I'd expect there to be trouble > for things that do stack pivots, potentially random issues if there's a > mix of ways threads are started. It's going to be a tradeoff whatever > we do. Something I haven't seen in the discussion is that one of the ways I have seen a non-libc clone used is to implement a fork with flags. That is a new mm is created, and effectively a new process. Which makes the characterization different. In general creating a thread with clone and bypassing libc is incompatible with pthreads, and the caller gets to keep both pieces. As long as there is enough information code can detect that shadow stacks are in use, and the code is able to create their own I don't see why it shouldn't be the callers responsibility. On the other hand I don't see the maintainer of clone Christian Brauner or the libc folks especially Florian cc'd on this thread. So I really don't think you have the right folks in on this conversation. Eric k
On Fri, Oct 06, 2023 at 07:29:45AM -0500, Eric W. Biederman wrote: > Mark Brown <broonie@kernel.org> writes: > >> It's not just the default size that I dislike (I think the x86 > >> RLIMIT_STACK or clone3() stack_size is probably good enough) but the > >> kernel allocating the shadow stack and inserting it into the user > >> address space. The actual thread stack is managed by the user but the > >> shadow stack is not (and we don't do this very often). Anyway, I don't > >> have a better solution for direct uses of clone() or clone3(), other > >> than running those threads with the shadow stack disabled. Not sure > >> that's desirable. > > Running threads with the shadow stack disabled if they don't explicitly > > request it feels like it's asking for trouble - as well as the escape > > route from the protection it'd provide I'd expect there to be trouble > > for things that do stack pivots, potentially random issues if there's a > > mix of ways threads are started. It's going to be a tradeoff whatever > > we do. > Something I haven't seen in the discussion is that one of the ways I > have seen a non-libc clone used is to implement a fork with flags. > That is a new mm is created, and effectively a new process. Which > makes the characterization different. > In general creating a thread with clone and bypassing libc is > incompatible with pthreads, and the caller gets to keep both pieces. > As long as there is enough information code can detect that > shadow stacks are in use, and the code is able to create their own > I don't see why it shouldn't be the callers responsibility. > On the other hand I don't see the maintainer of clone Christian Brauner > or the libc folks especially Florian cc'd on this thread. So I really > don't think you have the right folks in on this conversation. Well, copying them in now. The discussion here is about allocation of shadow stacks for the arm64 implementation of the feature (the arm64 feature is called Guarded Control Stack in the architecture). These maintain a second copy of the stack with only the return targets in memory allocated with special protections so userspace can't write to it directly and use this when doing returns to ensure that the returns haven't been redirected. These shadow stacks can be allocated directly by userspace using a new system call map_shadow_stack(), doing this via mmap() was extensively discussed but it was concluded that this was very likely to lead to security problems so we've got this new syscall that ensures that shadow stack memory is never accessible to userspace via other means. The x86 implementation that has already been merged into mainline will allocate a new shadow stack for newly created threads when the creating thread has one. There was a suggestion to have arm64 diverge and require that threads be created with clone3() and manualy provide a shadow stack but then concerns were raised that as well as the issues with divergence this would be too disruptive for adoption due to non-libc thread creation. It's not controversial that it'd be good to have clone3() by able to explicitly specify a shadow stack, just if it should be required.
On Thu, Oct 05, 2023 at 06:23:10PM +0100, Catalin Marinas wrote: > I haven't checked how many clone() or clone3() uses outside the libc are > (I tried some quick search in Debian but did not dig into the specifics > to see how generic that code is). I agree that having to change valid > cases outside of libc is not ideal. Even if we have the same clone3() > interface for x86 and arm64, we'd have other architectures that need > #ifdef'ing. FTR the set of Debian source packages that have references to the string __NR_clone (which picks up clone3 too) is below. At least some (eg, kore) just have things that look like a copy of the syscall table rather than things that look like calls, though equally it's likely we're missing some. aflplusplus android-platform-tools binutils-avr box64 brltty bubblewrap chromium chrony crash criu crun dietlibc elogind emscripten fakeroot-ng falcosecurity-libs firefox firefox-esr flatpak gcc-9 gcc-10 gcc-11 gcc-12 gcc-13 gcc-arm-none-eabi gcc-snapshot gdb-msp430 glibc gnumach hurd klibc kore libpod libseccomp linux llvm-toolchain-14 llvm-toolchain-15 llvm-toolchain-16 lxc lxcfs lxd musl newlib notcurses purelibc pwntools qemu qt6-base qt6-webengine qtbase-opensource-src qtbase-opensource-src-gles qtwebengine-opensource-src radare2 rumur rustc rust-linux-raw-sys rust-rustix strace stress-ng swtpm systemd systemtap termpaint thunderbird tor uclibc umview valgrind vsftpd wasi-libc webkit2gtk wpewebkit
diff --git a/Documentation/arch/arm64/gcs.rst b/Documentation/arch/arm64/gcs.rst new file mode 100644 index 000000000000..c0f43961fd4b --- /dev/null +++ b/Documentation/arch/arm64/gcs.rst @@ -0,0 +1,228 @@ +=============================================== +Guarded Control Stack support for AArch64 Linux +=============================================== + +This document outlines briefly the interface provided to userspace by Linux in +order to support use of the ARM Guarded Control Stack (GCS) feature. + +This is an outline of the most important features and issues only and not +intended to be exhaustive. + + + +1. General +----------- + +* GCS is an architecture feature intended to provide greater protection + against return oriented programming (ROP) attacks and to simplify the + implementation of features that need to collect stack traces such as + profiling. + +* When GCS is enabled a separate guarded control stack is maintained by the + PE which is writeable only through specific GCS operations. This + stores the call stack only, when a procedure call instruction is + performed the current PC is pushed onto the GCS and on RET the + address in the LR is verified against that on the top of the GCS. + +* When active current GCS pointer is stored in the system register + GCSPR_EL0. This is readable by userspace but can only be updated + via specific GCS instructions. + +* The architecture provides instructions for switching between guarded + control stacks with checks to ensure that the new stack is a valid + target for switching. + +* The functionality of GCS is similar to that provided by the x86 Shadow + Stack feature, due to sharing of userspace interfaces the ABI refers to + shadow stacks rather than GCS. + +* Support for GCS is reported to userspace via HWCAP2_GCS in the aux vector + AT_HWCAP2 entry. + +* GCS is enabled per thread. While there is support for disabling GCS + at runtime this should be done with great care. + +* GCS memory access faults are reported as normal memory access faults. + +* GCS specific errors (those reported with EC 0x2d) will be reported as + SIGSEGV with a si_code of SEGV_CPERR (control protection error). + +* GCS is supported only for AArch64. + +* On systems where GCS is supported GCSPR_EL0 is always readable by EL0 + regardless of the GCS configuration for the thread. + +* The architecture supports enabling GCS without verifying that return values + in LR match those in the GCS, the LR will be ignored. This is not supported + by Linux. + +* EL0 GCS entries with bit 63 set are reserved for use, one such use is defined + below for signals and should be ignored when parsing the stack if not + understood. + + +2. Enabling and disabling Guarded Control Stacks +------------------------------------------------- + +* GCS is enabled and disabled for a thread via the PR_SET_SHADOW_STACK_STATUS + prctl(), this takes a single flags argument specifying which GCS features + should be used. + +* When set PR_SHADOW_STACK_ENABLE flag allocates a Guarded Control Stack for + and enables GCS for the thread, enabling the functionality controlled by + GCSPRE0_EL1.{nTR, RVCHKEN, PCRSEL}. + +* When set the PR_SHADOW_STACK_PUSH flag enables the functionality controlled + by GCSCRE0_EL1.PUSHMEn, allowing explicit GCS pushes. + +* When set the PR_SHADOW_STACK_WRITE flag enables the functionality controlled + by GCSCRE0_EL1.STREn, allowing explicit stores to the Guarded Control Stack. + +* Any unknown flags will cause PR_SET_SHADOW_STACK_STATUS to return -EINVAL. + +* PR_LOCK_SHADOW_STACK_STATUS is passed a bitmask of features with the same + values as used for PR_SET_SHADOW_STACK_STATUS. Any future changes to the + status of the specified GCS mode bits will be rejected. + +* PR_LOCK_SHADOW_STACK_STATUS allows any bit to be locked, this allows + userspace to prevent changes to any future features. + +* PR_SET_SHADOW_STACK_STATUS and PR_LOCK_SHADOW_STACK_STATUS affect only the + thread the called them, any other running threads will be unaffected. + +* New threads inherit the GCS configuration of the thread that created them. + +* GCS is disabled on exec(). + +* The current GCS configuration for a thread may be read with the + PR_GET_SHADOW_STACK_STATUS prctl(), this returns the same flags that + are passed to PR_SET_SHADOW_STACK_STATUS. + +* If GCS is disabled for a thread after having previously been enabled then + the stack will remain allocated for the lifetime of the thread. At present + any attempt to reenable GCS for the thread will be rejected, this may be + revisited in future. + +* It should be noted that since enabling GCS will result in GCS becoming + active immediately it is not normally possible to return from the function + that invoked the prctl() that enabled GCS. It is expected that the normal + usage will be that GCS is enabled very early in execution of a program. + + + +3. Allocation of Guarded Control Stacks +---------------------------------------- + +* When GCS is enabled for a thread a new Guarded Control Stack will be + allocated for it of size RLIMIT_STACK / 2 or 2 gigabytes, whichever is + smaller. + +* When a new thread is created by a thread which has GCS enabled then a + new Guarded Control Stack will be allocated for the new thread with + half the size of the standard stack. + +* When a stack is allocated by enabling GCS or during thread creation then + the top 8 bytes of the stack will be initialised to 0 and GCSPR_EL0 will + be set to point to the address of this 0 value, this can be used to + detect the top of the stack. + +* Additional Guarded Control Stacks can be allocated using the + map_shadow_stack() system call. + +* Stacks allocated using map_shadow_stack() can optionally have an end of + stack marker and cap placed at the top of the stack. If the flag + SHADOW_STACK_SET_TOKEN is specified a cap will be placed on the stack, + if SHADOW_STACK_SET_MARKER is not specified the cap will be the top 8 + bytes of the stack and if it is specified then the cap will be the next + 8 bytes. While specifying just SHADOW_STACK_SET_MARKER by itself is + valid since the marker is all bits 0 it has no observable effect. + +* Stacks allocated using map_shadow_stack() must be larger than 16 bytes and + must be 16 bytes aligned. + +* When GCS is disabled for a thread the Guarded Control Stack initially + allocated for that thread will be freed. Note carefully that if the + stack has been switched this may not be the stack currently in use by + the thread. + + +4. Signal handling +-------------------- + +* A new signal frame record gcs_context encodes the current GCS mode and + pointer for the interrupted context on signal delivery. This will always + be present on systems that support GCS. + +* The record contains a flag field which reports the current GCS configuration + for the interrupted context as PR_GET_SHADOW_STACK_STATUS would. + +* The signal handler is run with the same GCS configuration as the interrupted + context. + +* When GCS is enabled for the interrupted thread a signal handling specific + GCS cap token will be written to the GCS, this is an architectural GCS cap + token with bit 63 set. The GCSPR_EL0 reported in the signal frame will + point to this cap token. + +* The signal handler will use the same GCS as the interrupted context. + +* When GCS is enabled on signal entry a frame with the address of the signal + return handler will be pushed onto the GCS, allowing return from the signal + handler via RET as normal. This will not be reported in the gcs_context in + the signal frame. + + +5. Signal return +----------------- + +When returning from a signal handler: + +* If there is a gcs_context record in the signal frame then the GCS flags + and GCSPR_EL0 will be restored from that context prior to further + validation. + +* If there is no gcs_context record in the signal frame then the GCS + configuration will be unchanged. + +* If GCS is enabled on return from a signal handler then GCSPR_EL0 must + point to a valid GCS signal cap record, this will be popped from the + GCS prior to signal return. + +* If the GCS configuration is locked when returning from a signal then any + attempt to change the GCS configuration will be treated as an error. This + is true even if GCS was not enabled prior to signal entry. + +* GCS may be disabled via signal return but any attempt to enable GCS via + signal return will be rejected. + + +7. ptrace extensions +--------------------- + +* A new regset NT_ARM_GCS is defined for use with PTRACE_GETREGSET and + PTRACE_SETREGSET. + +* Due to the complexity surrounding allocation and deallocation of stacks and + lack of practical application it is not possible to enable GCS via ptrace. + GCS may be disabled via the ptrace interface. + +* Other GCS modes may be configured via ptrace. + +* Configuration via ptrace ignores locking of GCS mode bits. + + +8. ELF coredump extensions +--------------------------- + +* NT_ARM_GCS notes will be added to each coredump for each thread of the + dumped process. The contents will be equivalent to the data that would + have been read if a PTRACE_GETREGSET of the corresponding type were + executed for each thread when the coredump was generated. + + + +9. /proc extensions +-------------------- + +* Guarded Control Stack pages will include "ss" in their VmFlags in + /proc/<pid>/smaps. diff --git a/Documentation/arch/arm64/index.rst b/Documentation/arch/arm64/index.rst index d08e924204bf..dcf3ee3eb8c0 100644 --- a/Documentation/arch/arm64/index.rst +++ b/Documentation/arch/arm64/index.rst @@ -14,6 +14,7 @@ ARM64 Architecture booting cpu-feature-registers elf_hwcaps + gcs hugetlbpage kdump legacy_instructions
Add some documentation of the userspace ABI for Guarded Control Stacks. Signed-off-by: Mark Brown <broonie@kernel.org> --- Documentation/arch/arm64/gcs.rst | 228 +++++++++++++++++++++++++++++++++++++ Documentation/arch/arm64/index.rst | 1 + 2 files changed, 229 insertions(+)