diff mbox series

[v9,23/42] Documentation/x86: Add CET shadow stack description

Message ID 20230613001108.3040476-24-rick.p.edgecombe@intel.com (mailing list archive)
State New
Headers show
Series Shadow stacks for userspace | expand

Commit Message

Edgecombe, Rick P June 13, 2023, 12:10 a.m. UTC
Introduce a new document on Control-flow Enforcement Technology (CET).

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
 Documentation/arch/x86/index.rst |   1 +
 Documentation/arch/x86/shstk.rst | 169 +++++++++++++++++++++++++++++++
 2 files changed, 170 insertions(+)
 create mode 100644 Documentation/arch/x86/shstk.rst

Comments

Mark Brown June 13, 2023, 11:55 a.m. UTC | #1
On Mon, Jun 12, 2023 at 05:10:49PM -0700, Rick Edgecombe wrote:

> +Enabling arch_prctl()'s
> +=======================
> +
> +Elf features should be enabled by the loader using the below arch_prctl's. They
> +are only supported in 64 bit user applications. These operate on the features
> +on a per-thread basis. The enablement status is inherited on clone, so if the
> +feature is enabled on the first thread, it will propagate to all the thread's
> +in an app.

I appreciate it's very late in the development of this series but given
that there are very similar features on both arm64 and riscv would it
make sense to make these just regular prctl()s, arch_prctl() isn't used
on other architectures and it'd reduce the amount of arch specific work
that userspace needs to do if the interface is shared.

It should also be possible to support both interfaces for x86 I guess,
though that feels like askng for trouble.
Florian Weimer June 13, 2023, 12:37 p.m. UTC | #2
* Mark Brown:

> On Mon, Jun 12, 2023 at 05:10:49PM -0700, Rick Edgecombe wrote:
>
>> +Enabling arch_prctl()'s
>> +=======================
>> +
>> +Elf features should be enabled by the loader using the below arch_prctl's. They
>> +are only supported in 64 bit user applications. These operate on the features
>> +on a per-thread basis. The enablement status is inherited on clone, so if the
>> +feature is enabled on the first thread, it will propagate to all the thread's
>> +in an app.
>
> I appreciate it's very late in the development of this series but given
> that there are very similar features on both arm64 and riscv would it
> make sense to make these just regular prctl()s, arch_prctl() isn't used
> on other architectures and it'd reduce the amount of arch specific work
> that userspace needs to do if the interface is shared.

Has the Arm feature been fully disclosed?

I would expect the integration with stack switching and unwinding
differs between architectures even if the core mechanism is similar.
It's probably tempting to handle shadow stack placement differently,
too.

Thanks,
Florian
Mark Brown June 13, 2023, 3:15 p.m. UTC | #3
On Tue, Jun 13, 2023 at 02:37:18PM +0200, Florian Weimer wrote:

> > I appreciate it's very late in the development of this series but given
> > that there are very similar features on both arm64 and riscv would it
> > make sense to make these just regular prctl()s, arch_prctl() isn't used
> > on other architectures and it'd reduce the amount of arch specific work
> > that userspace needs to do if the interface is shared.

> Has the Arm feature been fully disclosed?

Unfortunately no, it's not yet been folded into the ARM.  The system
registers and instructions are in the latest XML releases but that's not
the full story.

> I would expect the integration with stack switching and unwinding
> differs between architectures even if the core mechanism is similar.
> It's probably tempting to handle shadow stack placement differently,
> too.

Yeah, there's likely to be some differences (though given the amount of
discussion on the x86 implementation I'm trying to follow the decisions
there as much as reasonable on the basis that we should hopefully come
to the same conclusions).  It seemed worth mentioning as a needless
bump, OTOH I defninitely don't see it as critical.
Edgecombe, Rick P June 13, 2023, 5:11 p.m. UTC | #4
On Tue, 2023-06-13 at 16:15 +0100, Mark Brown wrote:
> > I would expect the integration with stack switching and unwinding
> > differs between architectures even if the core mechanism is
> > similar.
> > It's probably tempting to handle shadow stack placement
> > differently,
> > too.
> 
> Yeah, there's likely to be some differences (though given the amount
> of
> discussion on the x86 implementation I'm trying to follow the
> decisions
> there as much as reasonable on the basis that we should hopefully
> come
> to the same conclusions).  It seemed worth mentioning as a needless
> bump, OTOH I defninitely don't see it as critical.

Two things that came up as far as unifying the interface were:
1. The map_shadow_stack syscall
x86 shadow stack does some optional pre-populating of the shadow stack
memory. And in additional not all types of memory are supported
(private anonymous only). This is partly to strengthen the security
(which might be a cross-arch thing) and also partly due to x86's
Write=0,Dirty=1 PTE bit combination. So a new syscall fit better. Some
core-mm folks were not super keen on overloading mmap() to start doing
things like writing to the memory being mapped, as well.

2. The arch_prctl() interface
While enable and disable might be shared, there are some arch-specific
stuff for x86 like enabling the WRSS instruction.

For x86 all of the exercising of the kernel interface was in arch
specific code, so unifying the kernel interface didn't save much on the
user side. If there turns out to be some unification opportunities when
everything is explored and decided on, we could have the option of
tying x86's feature into it later.

I think the map_shadow_stack syscall had the most debate. But the
arch_prctl() was mostly agreed on IIRC. The debate was mostly with
glibc folks and the riscv shadow stack developer.


For my part, the thing I would really like to see unified as much as
possible is at the app developer's interface (glibc/gcc). The idea
would be to make it easy for app developers to know if their app
supports shadow stack. There will probably be some differences, but it
would be great if there was mostly the same behavior and a small list
of differences. I'm thinking about the behavior of longjmp(),
swapcontext(), etc.
Mark Brown June 13, 2023, 5:57 p.m. UTC | #5
On Tue, Jun 13, 2023 at 05:11:35PM +0000, Edgecombe, Rick P wrote:

> Two things that came up as far as unifying the interface were:
> 1. The map_shadow_stack syscall
> x86 shadow stack does some optional pre-populating of the shadow stack
> memory. And in additional not all types of memory are supported
> (private anonymous only). This is partly to strengthen the security
> (which might be a cross-arch thing) and also partly due to x86's
> Write=0,Dirty=1 PTE bit combination. So a new syscall fit better. Some
> core-mm folks were not super keen on overloading mmap() to start doing
> things like writing to the memory being mapped, as well.

Right, the strengthening security bits made this one look cross arch -
that one wasn't worrying me.

> 2. The arch_prctl() interface
> While enable and disable might be shared, there are some arch-specific
> stuff for x86 like enabling the WRSS instruction.

> For x86 all of the exercising of the kernel interface was in arch
> specific code, so unifying the kernel interface didn't save much on the
> user side. If there turns out to be some unification opportunities when
> everything is explored and decided on, we could have the option of
> tying x86's feature into it later.

> I think the map_shadow_stack syscall had the most debate. But the
> arch_prctl() was mostly agreed on IIRC. The debate was mostly with
> glibc folks and the riscv shadow stack developer.

For arm64 we have an equivalentish thing to WRSS which lets us control
if userspace can explicitly push or pop values onto the shadow stack
(GCS for us) so it all maps on well - before I noticed that it was
arch_prctl() I was looking at it and thinking it worked for us.  At the
minute I've taken the prctl() patch from the riscv series and added in a
flag for writability since we just don't have an arch_prctl(), this
isn't a huge deal but it just seemed like needless effort to wonder why
it's different.

> For my part, the thing I would really like to see unified as much as
> possible is at the app developer's interface (glibc/gcc). The idea
> would be to make it easy for app developers to know if their app
> supports shadow stack. There will probably be some differences, but it
> would be great if there was mostly the same behavior and a small list
> of differences. I'm thinking about the behavior of longjmp(),
> swapcontext(), etc.

Yes, very much so.  sigaltcontext() too.
Edgecombe, Rick P June 13, 2023, 7:57 p.m. UTC | #6
On Tue, 2023-06-13 at 18:57 +0100, Mark Brown wrote:
> > > > For my part, the thing I would really like to see unified as
> > > > much > > as
> > > > possible is at the app developer's interface (glibc/gcc). The
> > > > idea
> > > > would be to make it easy for app developers to know if their
> > > > app
> > > > supports shadow stack. There will probably be some differences,
> > > > but > > it
> > > > would be great if there was mostly the same behavior and a
> > > > small > > list
> > > > of differences. I'm thinking about the behavior of longjmp(),
> > > > swapcontext(), etc.
> > 
> > Yes, very much so.  sigaltcontext() too.

For alt shadow stack's, this is what I came up with:
https://lore.kernel.org/lkml/20220929222936.14584-40-rick.p.edgecombe@intel.com/

Unfortunately it can't work automatically with sigaltstack(). Since it
has to be a new thing anyway, it's been left for the future. I guess
that might have a better chance of being cross arch.


BTW, last time this series accidentally broke an arm config and made it
all the way through the robots up to Linus. Would you mind giving
patches 1-3 a check?

Thanks,

Rick
Szabolcs Nagy June 14, 2023, 10:43 a.m. UTC | #7
The 06/13/2023 19:57, Edgecombe, Rick P wrote:
> On Tue, 2023-06-13 at 18:57 +0100, Mark Brown wrote:
> > > > > For my part, the thing I would really like to see unified as
> > > > > much > > as
> > > > > possible is at the app developer's interface (glibc/gcc). The
> > > > > idea
> > > > > would be to make it easy for app developers to know if their
> > > > > app
> > > > > supports shadow stack. There will probably be some differences,
> > > > > but > > it
> > > > > would be great if there was mostly the same behavior and a
> > > > > small > > list
> > > > > of differences. I'm thinking about the behavior of longjmp(),
> > > > > swapcontext(), etc.
> > >
> > > Yes, very much so.  sigaltcontext() too.
>
> For alt shadow stack's, this is what I came up with:
> https://lore.kernel.org/lkml/20220929222936.14584-40-rick.p.edgecombe@intel.com/
>
> Unfortunately it can't work automatically with sigaltstack(). Since it
> has to be a new thing anyway, it's been left for the future. I guess
> that might have a better chance of being cross arch.

i dont think you can add sigaltshstk later.

libgcc already has unwinder code for shstk and that cannot handle
discontinous shadow stack. (may affect longjmp too depending on
how it is implemented)

we can change the unwinder now to know how to switch shstk when
it unwinds the signal frame and backport that to systems that
want to support shstk. or we can introduce a new elf marking
scheme just for sigaltshstk when it is added so incompatibility
can be detected. or we simply not support unwinding with
sigaltshstk which would make it pretty much useless in practice.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Mark Brown June 14, 2023, 1:12 p.m. UTC | #8
On Tue, Jun 13, 2023 at 07:57:37PM +0000, Edgecombe, Rick P wrote:

> For alt shadow stack's, this is what I came up with:
> https://lore.kernel.org/lkml/20220929222936.14584-40-rick.p.edgecombe@intel.com/

> Unfortunately it can't work automatically with sigaltstack(). Since it
> has to be a new thing anyway, it's been left for the future. I guess
> that might have a better chance of being cross arch.

Yeah, I've not seen and can't think of anything that's entirely
satisfactory either.  Like Szabolcs says I do think we need a story on
this.

> BTW, last time this series accidentally broke an arm config and made it
> all the way through the robots up to Linus. Would you mind giving
> patches 1-3 a check?

I'm in the middle of importing the whole series into my development
branch, but note that I'm only really working with arm64 not arm so
might miss stuff the bots would hit.  Hopefully there should be some
Tested-bys coming for arm64 anyway.
Edgecombe, Rick P June 14, 2023, 4:57 p.m. UTC | #9
On Wed, 2023-06-14 at 11:43 +0100, szabolcs.nagy@arm.com wrote:
> The 06/13/2023 19:57, Edgecombe, Rick P wrote:
> > On Tue, 2023-06-13 at 18:57 +0100, Mark Brown wrote:
> > > > > > For my part, the thing I would really like to see unified
> > > > > > as
> > > > > > much > > as
> > > > > > possible is at the app developer's interface (glibc/gcc).
> > > > > > The
> > > > > > idea
> > > > > > would be to make it easy for app developers to know if
> > > > > > their
> > > > > > app
> > > > > > supports shadow stack. There will probably be some
> > > > > > differences,
> > > > > > but > > it
> > > > > > would be great if there was mostly the same behavior and a
> > > > > > small > > list
> > > > > > of differences. I'm thinking about the behavior of
> > > > > > longjmp(),
> > > > > > swapcontext(), etc.
> > > > 
> > > > Yes, very much so.  sigaltcontext() too.
> > 
> > For alt shadow stack's, this is what I came up with:
> > https://lore.kernel.org/lkml/20220929222936.14584-40-rick.p.edgecombe@intel.com/
> > 
> > Unfortunately it can't work automatically with sigaltstack(). Since
> > it
> > has to be a new thing anyway, it's been left for the future. I
> > guess
> > that might have a better chance of being cross arch.
> 
> i dont think you can add sigaltshstk later.
> 
> libgcc already has unwinder code for shstk and that cannot handle
> discontinous shadow stack.

Are you referring to the existing C++ exception unwinding code that
expects a different signal frame format? Yea this is a problem, but I
don't see how it's a problem with any solutions now that will be harder
later. I mentioned it when I brought up all the app compatibility
problems.[0]

The problem is that gcc expects a fixed 8 byte sized shadow stack
signal frame. The format in these patches is such that it can be
expanded for the sake of supporting alt shadow stack later, but it
happens to be a fixed 8 bytes for now, so it will work seamlessly with
these old gcc's. HJ has some patches to fix GCC to jump over a
dynamically sized shadow stack signal frame, but this of course won't
stop old gcc's from generating binaries that won't work with an
expanded frame.

I was waffling on whether it would be better to pad the shadow stack
[1] signal frame to start, this would break compatibility with any
binaries that use this -fnon-call-exceptions feature (if there are
any), but would set us up better for the future if we got away with it.
On one hand we are already juggling some compatibility issues so maybe
it's not too much worse, but on the other hand the kernel is trying its
best to be as compatible as it can given the situation. It doesn't
*need* to break this compatibility at this point.

In the end I thought it was better to deal with it later.

>  (may affect longjmp too depending on
> how it is implemented)

glibc's longjmp ignores anything everything it skips over and just does
INCSSP until it gets back to the setjmp point. So it is not affected by
the shadow stack signal frame format. I don't think we can support
longjmping off an alt shadow stack unless we enable WRSS or get the
kernel's help. So this was to be declared as unsupported.

> 
> we can change the unwinder now to know how to switch shstk when
> it unwinds the signal frame and backport that to systems that
> want to support shstk. or we can introduce a new elf marking
> scheme just for sigaltshstk when it is added so incompatibility
> can be detected. or we simply not support unwinding with
> sigaltshstk which would make it pretty much useless in practice.

Yea, I was thinking along the same lines. Someday we could easily need
some new marker. Maybe because we want to add something, or maybe
because of the pre-existing userspace. In that case, this
implementation will get the ball rolling and we can learn more about
how shadow stack will be used. So if we need to break compatibility
with any apps, we would not really be in a different situation than we
are already in (if we are going to take proper care to not break
userspace). So if/when that happens all the learning's can go into the
clean break.

But if it's not clear, unwinder's that properly use the format in these
patches should work from an alt shadow stack implemented like that RFC
linked earlier in the thread. At least it will be able to read back the
shadow stack starting from the alt shadow stack, it can't actually
resume control flow from where it unwound to. For that we need WRSS or
some kernel help.

[0]
https://lore.kernel.org/lkml/7d8133c7e0186bdaeb3893c1c808148dc0d11945.camel@intel.com/
Szabolcs Nagy June 19, 2023, 8:47 a.m. UTC | #10
The 06/14/2023 16:57, Edgecombe, Rick P wrote:
> On Wed, 2023-06-14 at 11:43 +0100, szabolcs.nagy@arm.com wrote:
> > i dont think you can add sigaltshstk later.
> > 
> > libgcc already has unwinder code for shstk and that cannot handle
> > discontinous shadow stack.
> 
> Are you referring to the existing C++ exception unwinding code that
> expects a different signal frame format? Yea this is a problem, but I
> don't see how it's a problem with any solutions now that will be harder
> later. I mentioned it when I brought up all the app compatibility
> problems.[0]

there is old unwinder code incompatible with the current patches,
but that was fixed. however the new unwind code assumes signal
entry pushes one extra token that just have to be popped from the
shstk. this abi cannot be expanded which means

1) kernel cannot push more tokens for more integrity checks
   (or to add whatever other features)

2) sigaltshstk cannot work.

if the unwinder instead interprets the token to be the old ssp and
either incssp or switch to that ssp (depending on continous or
discontinous shstk, which the unwinder can detect), then 1) and 2)
are fixed.

but currently the distributed unwinder binary is incompatible with
1) and 2) so sigaltshstk cannot be added later. breaking the unwind
abi is not acceptable.

> The problem is that gcc expects a fixed 8 byte sized shadow stack
> signal frame. The format in these patches is such that it can be
> expanded for the sake of supporting alt shadow stack later, but it
> happens to be a fixed 8 bytes for now, so it will work seamlessly with
> these old gcc's. HJ has some patches to fix GCC to jump over a
> dynamically sized shadow stack signal frame, but this of course won't
> stop old gcc's from generating binaries that won't work with an
> expanded frame.
> 
> I was waffling on whether it would be better to pad the shadow stack
> [1] signal frame to start, this would break compatibility with any
> binaries that use this -fnon-call-exceptions feature (if there are
> any), but would set us up better for the future if we got away with it.

i don't see how -fnon-call-exceptions is relevant.

you can unwind from a signal handler (this is not a c++ question
but unwind abi question) and in practice eh works e.g. if the
signal is raised (sync or async) in a frame where there are no
cleanup handlers registered. in practice code rarely relies on
this (because it's not valid in c++). the main user of this i
know of is the glibc cancellation implmentation. (that is special
in that it never catches the exception so ssp does not have to be
updated for things to work, but in principle the unwinder should
still verify the entries on shstk, otherwise the security
guarantees are broken and the cleanup handlers can be hijacked.
there are glibc abi issues that prevent fixing this, but in other
libcs this may be still relevant).

> On one hand we are already juggling some compatibility issues so maybe
> it's not too much worse, but on the other hand the kernel is trying its
> best to be as compatible as it can given the situation. It doesn't
> *need* to break this compatibility at this point.
> 
> In the end I thought it was better to deal with it later.
> 
> >  (may affect longjmp too depending on
> > how it is implemented)
> 
> glibc's longjmp ignores anything everything it skips over and just does
> INCSSP until it gets back to the setjmp point. So it is not affected by
> the shadow stack signal frame format. I don't think we can support
> longjmping off an alt shadow stack unless we enable WRSS or get the
> kernel's help. So this was to be declared as unsupported.

longjmp can support discontinous shadow stack without wrss.
the current code proposed to glibc does not, which is wrong
(it breaks altshstk and green thread users like qemu for no
good reason).

declaring things unsupported means you have to go around to
audit and mark binaries accordingly.

> > we can change the unwinder now to know how to switch shstk when
> > it unwinds the signal frame and backport that to systems that
> > want to support shstk. or we can introduce a new elf marking
> > scheme just for sigaltshstk when it is added so incompatibility
> > can be detected. or we simply not support unwinding with
> > sigaltshstk which would make it pretty much useless in practice.
> 
> Yea, I was thinking along the same lines. Someday we could easily need
> some new marker. Maybe because we want to add something, or maybe
> because of the pre-existing userspace. In that case, this
> implementation will get the ball rolling and we can learn more about
> how shadow stack will be used. So if we need to break compatibility
> with any apps, we would not really be in a different situation than we
> are already in (if we are going to take proper care to not break
> userspace). So if/when that happens all the learning's can go into the
> clean break.
> 
> But if it's not clear, unwinder's that properly use the format in these
> patches should work from an alt shadow stack implemented like that RFC
> linked earlier in the thread. At least it will be able to read back the
> shadow stack starting from the alt shadow stack, it can't actually
> resume control flow from where it unwound to. For that we need WRSS or
> some kernel help.

wrss is not needed to resume control flow on a different shstk.

(if you needed wrss then the map_shadow_stack would be useless.)

> 
> [0]
> https://lore.kernel.org/lkml/7d8133c7e0186bdaeb3893c1c808148dc0d11945.camel@intel.com/
>
Edgecombe, Rick P June 19, 2023, 4:44 p.m. UTC | #11
On Mon, 2023-06-19 at 09:47 +0100, szabolcs.nagy@arm.com wrote:
> The 06/14/2023 16:57, Edgecombe, Rick P wrote:
> > On Wed, 2023-06-14 at 11:43 +0100, szabolcs.nagy@arm.com wrote:
> > > i dont think you can add sigaltshstk later.
> > > 
> > > libgcc already has unwinder code for shstk and that cannot handle
> > > discontinous shadow stack.
> > 
> > Are you referring to the existing C++ exception unwinding code that
> > expects a different signal frame format? Yea this is a problem, but
> > I
> > don't see how it's a problem with any solutions now that will be
> > harder
> > later. I mentioned it when I brought up all the app compatibility
> > problems.[0]
> 
> there is old unwinder code incompatible with the current patches,
> but that was fixed. however the new unwind code assumes signal
> entry pushes one extra token that just have to be popped from the
> shstk. this abi cannot be expanded which means
> 
> 1) kernel cannot push more tokens for more integrity checks
>    (or to add whatever other features)
> 
> 2) sigaltshstk cannot work.
> 
> if the unwinder instead interprets the token to be the old ssp and
> either incssp or switch to that ssp (depending on continous or
> discontinous shstk, which the unwinder can detect), then 1) and 2)
> are fixed.
> 
> but currently the distributed unwinder binary is incompatible with
> 1) and 2) so sigaltshstk cannot be added later. breaking the unwind
> abi is not acceptable.

Can you point me to what you are talking about? I tested adding fields
to the shadow stack on top of these changes. It worked with HJ's new
(unposted I think) C++ changes for gcc. Adding fields doesn't work with
the existing gcc because it assumes a fixed size.

> 
> > The problem is that gcc expects a fixed 8 byte sized shadow stack
> > signal frame. The format in these patches is such that it can be
> > expanded for the sake of supporting alt shadow stack later, but it
> > happens to be a fixed 8 bytes for now, so it will work seamlessly
> > with
> > these old gcc's. HJ has some patches to fix GCC to jump over a
> > dynamically sized shadow stack signal frame, but this of course
> > won't
> > stop old gcc's from generating binaries that won't work with an
> > expanded frame.
> > 
> > I was waffling on whether it would be better to pad the shadow
> > stack
> > [1] signal frame to start, this would break compatibility with any
> > binaries that use this -fnon-call-exceptions feature (if there are
> > any), but would set us up better for the future if we got away with
> > it.
> 
> i don't see how -fnon-call-exceptions is relevant.

It uses unwinder code that does assume a fixed shadow stack signal
frame size. Since gcc 8.5 I think. So these compilers will continue to
generate code that assumes a fixed frame size. This is one of the
limitations we have for not moving to a new elf bit.

> 
> you can unwind from a signal handler (this is not a c++ question
> but unwind abi question) and in practice eh works e.g. if the
> signal is raised (sync or async) in a frame where there are no
> cleanup handlers registered. in practice code rarely relies on
> this (because it's not valid in c++). the main user of this i
> know of is the glibc cancellation implmentation. (that is special
> in that it never catches the exception so ssp does not have to be
> updated for things to work, but in principle the unwinder should
> still verify the entries on shstk, otherwise the security
> guarantees are broken and the cleanup handlers can be hijacked.
> there are glibc abi issues that prevent fixing this, but in other
> libcs this may be still relevant).

I'm not fully sure what you are trying to say here. The glibc shadow
stack stuff that is there today supports unwinding through a signal
handler. The longjmp code (unlike fnon-call-exections) doesn't look at
the shstk signal frame. It just does INCSSP until it reaches its
desired SSP, not caring what it is INCSSPing over.

> 
> > On one hand we are already juggling some compatibility issues so
> > maybe
> > it's not too much worse, but on the other hand the kernel is trying
> > its
> > best to be as compatible as it can given the situation. It doesn't
> > *need* to break this compatibility at this point.
> > 
> > In the end I thought it was better to deal with it later.
> > 
> > >  (may affect longjmp too depending on
> > > how it is implemented)
> > 
> > glibc's longjmp ignores anything everything it skips over and just
> > does
> > INCSSP until it gets back to the setjmp point. So it is not
> > affected by
> > the shadow stack signal frame format. I don't think we can support
> > longjmping off an alt shadow stack unless we enable WRSS or get the
> > kernel's help. So this was to be declared as unsupported.
> 
> longjmp can support discontinous shadow stack without wrss.
> the current code proposed to glibc does not, which is wrong
> (it breaks altshstk and green thread users like qemu for no
> good reason).
> 
> declaring things unsupported means you have to go around to
> audit and mark binaries accordingly.

The idea that all apps can be supported without auditing has been
assumed to be impossible by everyone I've talked to, including the
GLIBC developers deeply versed in the architectural limitations of this
feature. So if you have a magic solution, then that is a notable claim
and I think you should propose it instead of just alluding to the fact
that there is one.

The only non-WRSS "longjmp from an alt shadow stack solution" that I
can think of would have something like a new syscall performing some
limited shadow stack actions normally prohibited in userspace by the
architecture. We'd have to think through how this would impact the
security. There are a lot of security/compatibility tradeoffs to parse
in this. So also, just because something can be done, doesn't mean we
should do it. I think the philosophy at this point is, lets get the
basics working that can support most apps, and learn more about stuff
like where this bar is in the real world.

> 
> > > we can change the unwinder now to know how to switch shstk when
> > > it unwinds the signal frame and backport that to systems that
> > > want to support shstk. or we can introduce a new elf marking
> > > scheme just for sigaltshstk when it is added so incompatibility
> > > can be detected. or we simply not support unwinding with
> > > sigaltshstk which would make it pretty much useless in practice.
> > 
> > Yea, I was thinking along the same lines. Someday we could easily
> > need
> > some new marker. Maybe because we want to add something, or maybe
> > because of the pre-existing userspace. In that case, this
> > implementation will get the ball rolling and we can learn more
> > about
> > how shadow stack will be used. So if we need to break compatibility
> > with any apps, we would not really be in a different situation than
> > we
> > are already in (if we are going to take proper care to not break
> > userspace). So if/when that happens all the learning's can go into
> > the
> > clean break.
> > 
> > But if it's not clear, unwinder's that properly use the format in
> > these
> > patches should work from an alt shadow stack implemented like that
> > RFC
> > linked earlier in the thread. At least it will be able to read back
> > the
> > shadow stack starting from the alt shadow stack, it can't actually
> > resume control flow from where it unwound to. For that we need WRSS
> > or
> > some kernel help.
> 
> wrss is not needed to resume control flow on a different shstk.

WRSS lets you resume control flow at arbitrarily points by writing your
own restore token. Otherwise there are restrictions.

> 
> (if you needed wrss then the map_shadow_stack would be useless.)

map_shadow_stack is usually prepopulated with a token, otherwise it
does need WRSS to create one on it.

> 
> > 
> > [0]
> > https://lore.kernel.org/lkml/7d8133c7e0186bdaeb3893c1c808148dc0d11945.camel@intel.com/
> >
Szabolcs Nagy June 20, 2023, 9:17 a.m. UTC | #12
The 06/19/2023 16:44, Edgecombe, Rick P wrote:
> On Mon, 2023-06-19 at 09:47 +0100, szabolcs.nagy@arm.com wrote:
> > The 06/14/2023 16:57, Edgecombe, Rick P wrote:
> > > On Wed, 2023-06-14 at 11:43 +0100, szabolcs.nagy@arm.com wrote:
> > > > i dont think you can add sigaltshstk later.
> > > > 
> > > > libgcc already has unwinder code for shstk and that cannot handle
> > > > discontinous shadow stack.
> > > 
> > > Are you referring to the existing C++ exception unwinding code that
> > > expects a different signal frame format? Yea this is a problem, but
> > > I
> > > don't see how it's a problem with any solutions now that will be
> > > harder
> > > later. I mentioned it when I brought up all the app compatibility
> > > problems.[0]
> > 
> > there is old unwinder code incompatible with the current patches,
> > but that was fixed. however the new unwind code assumes signal
> > entry pushes one extra token that just have to be popped from the
> > shstk. this abi cannot be expanded which means
> > 
> > 1) kernel cannot push more tokens for more integrity checks
> >    (or to add whatever other features)
> > 
> > 2) sigaltshstk cannot work.
> > 
> > if the unwinder instead interprets the token to be the old ssp and
> > either incssp or switch to that ssp (depending on continous or
> > discontinous shstk, which the unwinder can detect), then 1) and 2)
> > are fixed.
> > 
> > but currently the distributed unwinder binary is incompatible with
> > 1) and 2) so sigaltshstk cannot be added later. breaking the unwind
> > abi is not acceptable.
> 
> Can you point me to what you are talking about? I tested adding fields
> to the shadow stack on top of these changes. It worked with HJ's new
> (unposted I think) C++ changes for gcc. Adding fields doesn't work with
> the existing gcc because it assumes a fixed size.

if there is a fix that's good, i haven't seen it.

my point was that the current unwinder works with current kernel
patches, but does not allow future extensions which prevents
sigaltshstk to work. the unwinder is not versioned so this cannot
be fixed later. it only works if distros ensure shstk is disabled
until the unwinder is fixed. (however there is no way to detect
old unwinder if somebody builds gcc from source.)

also note that there is generic code in the unwinder that will
deal with this and likely the x86 patches will conflict with
arm and riscv etc patches that try to fix the same issue..
so posting patches on the tools side of the abi would be useful
at this point.

> > > The problem is that gcc expects a fixed 8 byte sized shadow stack
> > > signal frame. The format in these patches is such that it can be
> > > expanded for the sake of supporting alt shadow stack later, but it
> > > happens to be a fixed 8 bytes for now, so it will work seamlessly
> > > with
> > > these old gcc's. HJ has some patches to fix GCC to jump over a
> > > dynamically sized shadow stack signal frame, but this of course
> > > won't
> > > stop old gcc's from generating binaries that won't work with an
> > > expanded frame.
> > > 
> > > I was waffling on whether it would be better to pad the shadow
> > > stack
> > > [1] signal frame to start, this would break compatibility with any
> > > binaries that use this -fnon-call-exceptions feature (if there are
> > > any), but would set us up better for the future if we got away with
> > > it.
> > 
> > i don't see how -fnon-call-exceptions is relevant.
> 
> It uses unwinder code that does assume a fixed shadow stack signal
> frame size. Since gcc 8.5 I think. So these compilers will continue to
> generate code that assumes a fixed frame size. This is one of the
> limitations we have for not moving to a new elf bit.

how does "fixed shadow stack signal frame size" relates to
"-fnon-call-exceptions"?

if there were instruction boundaries within a function where the
ret addr is not yet pushed or already poped from the shstk then
the flag would be relevant, but since push/pop happens atomically
at function entry/return -fnon-call-exceptions makes no
difference as far as shstk unwinding is concerned.

> > you can unwind from a signal handler (this is not a c++ question
> > but unwind abi question) and in practice eh works e.g. if the
> > signal is raised (sync or async) in a frame where there are no
> > cleanup handlers registered. in practice code rarely relies on
> > this (because it's not valid in c++). the main user of this i
> > know of is the glibc cancellation implmentation. (that is special
> > in that it never catches the exception so ssp does not have to be
> > updated for things to work, but in principle the unwinder should
> > still verify the entries on shstk, otherwise the security
> > guarantees are broken and the cleanup handlers can be hijacked.
> > there are glibc abi issues that prevent fixing this, but in other
> > libcs this may be still relevant).
> 
> I'm not fully sure what you are trying to say here. The glibc shadow

you mentioned -fnon-call-exceptions and i'm saying even without that
unwinding from signal handler is relevant and has to track shstk and
ideally even update it for eh control transfer (i.e. properly switch
to a different shstk in case of discontinous shstk).

> stack stuff that is there today supports unwinding through a signal
> handler. The longjmp code (unlike fnon-call-exections) doesn't look at
> the shstk signal frame. It just does INCSSP until it reaches its
> desired SSP, not caring what it is INCSSPing over.

x86 longjmp has differnet problems (cannot handle discontinous
shstk now).

glibc cancellation is a mix of unwinding and special longjmp and
it is currently broken in that the unwind bit cannot verify the
return addresses. the unwinder does control transfer to cleanup
handlers so control flow hijack is possible in principle on a
corrupt stack (though i don't think cancellation is a practical
attack surface).

> > longjmp can support discontinous shadow stack without wrss.
> > the current code proposed to glibc does not, which is wrong
> > (it breaks altshstk and green thread users like qemu for no
> > good reason).
> > 
> > declaring things unsupported means you have to go around to
> > audit and mark binaries accordingly.
> 
> The idea that all apps can be supported without auditing has been
> assumed to be impossible by everyone I've talked to, including the
> GLIBC developers deeply versed in the architectural limitations of this
> feature. So if you have a magic solution, then that is a notable claim
> and I think you should propose it instead of just alluding to the fact
> that there is one.

there is no magic, longjmp should be implemented as:

	target_ssp = read from jmpbuf;
	current_ssp = read ssp;
	for (p = target_ssp; p != current_ssp; p--) {
		if (*p == restore-token) {
			// target_ssp is on a different shstk.
			switch_shstk_to(p);
			break;
		}
	}
	for (; p != target_ssp; p++)
		// ssp is now on the same shstk as target.
		inc_ssp();

this is what setcontext is doing and longjmp can do the same:
for programs that always longjmp within the same shstk the first
loop is just p = current_ssp, but it also works when longjmp
target is on a different shstk assuming nothing is running on
that shstk, which is only possible if there is a restore token
on top.

this implies if the kernel switches shstk on signal entry it has
to add a restore-token on the switched away shstk.

> The only non-WRSS "longjmp from an alt shadow stack solution" that I
> can think of would have something like a new syscall performing some
> limited shadow stack actions normally prohibited in userspace by the

there is setcontext and swapcontext already doing an shstk
switch, i don't see why you think longjmp is different and
needs magic syscalls or wrss.

> architecture. We'd have to think through how this would impact the
> security. There are a lot of security/compatibility tradeoffs to parse
> in this. So also, just because something can be done, doesn't mean we
> should do it. I think the philosophy at this point is, lets get the
> basics working that can support most apps, and learn more about stuff
> like where this bar is in the real world.

i think longjmp should really be discussed with libc devs,
not on the kernel list, since they know the practical
constraints and trade-offs better. however longjmp is
relevant for the signal abi design so it's not ideal to
push a linux abi and then have the libc side discussion
later..
Edgecombe, Rick P June 20, 2023, 7:34 p.m. UTC | #13
On Tue, 2023-06-20 at 10:17 +0100, szabolcs.nagy@arm.com wrote:
> if there is a fix that's good, i haven't seen it.
> 
> my point was that the current unwinder works with current kernel
> patches, but does not allow future extensions which prevents
> sigaltshstk to work. the unwinder is not versioned so this cannot
> be fixed later. it only works if distros ensure shstk is disabled
> until the unwinder is fixed. (however there is no way to detect
> old unwinder if somebody builds gcc from source.)

This is a problem the kernel is having to deal with, not causing. The
userspace changes were upstreamed before the kernel. Userspace folks
are adamantly against moving to a new elf bit, to start over with a
clean slate. I tried everything to influence this and was not
successful. So I'm still not sure what the proposal here is for the
kernel.

I am guessing that the fnon-call-exceptions/expanded frame size
incompatibilities could end up causing something to grow an opt-in at
some point.

> 
> also note that there is generic code in the unwinder that will
> deal with this and likely the x86 patches will conflict with
> arm and riscv etc patches that try to fix the same issue..
> so posting patches on the tools side of the abi would be useful
> at this point.

The glibc patches are unfortunately mostly upstream already. See HJ for
the diff that targets the new enabling interface. From lessons learned
earlier in this effort, he was not going push those changes before the
kernel support was upstream. There shouldn't be any glibc changes to
signal or longjmp stuff in those AFAIK though.

[ snip ]

> how does "fixed shadow stack signal frame size" relates to
> "-fnon-call-exceptions"?
> 
> if there were instruction boundaries within a function where the
> ret addr is not yet pushed or already poped from the shstk then
> the flag would be relevant, but since push/pop happens atomically
> at function entry/return -fnon-call-exceptions makes no
> difference as far as shstk unwinding is concerned.

As I said, the existing unwinding code for fnon-call-excecptions
assumes a fixed shadow stack signal frame size of 8 bytes. Since the
exception is thrown out of a signal, it needs to know how to unwind
through the shadow stack signal frame.

[ snip ]

> there is no magic, longjmp should be implemented as:
> 
>         target_ssp = read from jmpbuf;
>         current_ssp = read ssp;
>         for (p = target_ssp; p != current_ssp; p--) {
>                 if (*p == restore-token) {
>                         // target_ssp is on a different shstk.
>                         switch_shstk_to(p);
>                         break;
>                 }
>         }
>         for (; p != target_ssp; p++)
>                 // ssp is now on the same shstk as target.
>                 inc_ssp();
> 
> this is what setcontext is doing and longjmp can do the same:
> for programs that always longjmp within the same shstk the first
> loop is just p = current_ssp, but it also works when longjmp
> target is on a different shstk assuming nothing is running on
> that shstk, which is only possible if there is a restore token
> on top.
> 
> this implies if the kernel switches shstk on signal entry it has
> to add a restore-token on the switched away shstk.

I actually did a POC for this, but rejected it. The problem is, if
there is a shadow stack overflow at that point then the kernel can't
push the shadow stack token to the old stack. And shadow stack overflow
is exactly the alt shadow stack use case. So it doesn't really solve
the problem.

This reasoning was actually elaborated on when the alt shadow stack
patches were posted. And it looks like I previously pointed you at it.

This history here is quite long and complicated, but I’ve done my best
to summarize it in the coverletters. It would be helpful if you could
review those links.

[ snip ]

> i think longjmp should really be discussed with libc devs,
> not on the kernel list, since they know the practical
> constraints and trade-offs better. however longjmp is
> relevant for the signal abi design so it's not ideal to
> push a linux abi and then have the libc side discussion
> later..

It sounds like you are aware of the limitations the pre-existing
upstream userspace places on the shadow stack signal frame. We also
previously discussed how the kernel had to work around other aspects of
upstream userspace that assumed undecided kernel ABI. How on earth are
you getting that the kernel ABI is being pushed before input from the
userspace side? The situation is the opposite.
Szabolcs Nagy June 21, 2023, 11:36 a.m. UTC | #14
The 06/20/2023 19:34, Edgecombe, Rick P wrote:
> On Tue, 2023-06-20 at 10:17 +0100, szabolcs.nagy@arm.com wrote:
> > if there is a fix that's good, i haven't seen it.
> > 
> > my point was that the current unwinder works with current kernel
> > patches, but does not allow future extensions which prevents
> > sigaltshstk to work. the unwinder is not versioned so this cannot
> > be fixed later. it only works if distros ensure shstk is disabled
> > until the unwinder is fixed. (however there is no way to detect
> > old unwinder if somebody builds gcc from source.)
> 
> This is a problem the kernel is having to deal with, not causing. The
> userspace changes were upstreamed before the kernel. Userspace folks
> are adamantly against moving to a new elf bit, to start over with a
> clean slate. I tried everything to influence this and was not
> successful. So I'm still not sure what the proposal here is for the
> kernel.

i agree, the glibc and libgcc patches should not have been accepted
before a linux abi.

but the other direction also holds: the linux patches should not be
pushed before the userspace design is discussed. (the current code
upstream is wrong, and new code for the proposed linux abi is not
posted yet. this is not your fault, i'm saying it here, because the
discussion is here.)

> I am guessing that the fnon-call-exceptions/expanded frame size
> incompatibilities could end up causing something to grow an opt-in at
> some point.

there are independent userspace components and not every component
has a chance to opt-in.

> > how does "fixed shadow stack signal frame size" relates to
> > "-fnon-call-exceptions"?
> > 
> > if there were instruction boundaries within a function where the
> > ret addr is not yet pushed or already poped from the shstk then
> > the flag would be relevant, but since push/pop happens atomically
> > at function entry/return -fnon-call-exceptions makes no
> > difference as far as shstk unwinding is concerned.
> 
> As I said, the existing unwinding code for fnon-call-excecptions
> assumes a fixed shadow stack signal frame size of 8 bytes. Since the
> exception is thrown out of a signal, it needs to know how to unwind
> through the shadow stack signal frame.

sorry but there is some misunderstanding about -fnon-call-exceptions.

it is for emitting cleanup and exception handler data for a function
such that throwing from certain instructions within that function
works, while normally only throwing from calls work.

it is not about *unwinding* from an async signal handler, which is
-fasynchronous-unwind-tables and should always work on linux, nor for
dealing with cleanup/exception handlers above the interrupted frame
(likewise it works on linux without special cflags).

as far as i can tell the current unwinder handles shstk unwinding
correctly across signal handlers (sync or async and cleanup/exceptions
handlers too), i see no issue with "fixed shadow stack signal frame
size of 8 bytes" other than future extensions and discontinous shstk.

> > there is no magic, longjmp should be implemented as:
> > 
> >         target_ssp = read from jmpbuf;
> >         current_ssp = read ssp;
> >         for (p = target_ssp; p != current_ssp; p--) {
> >                 if (*p == restore-token) {
> >                         // target_ssp is on a different shstk.
> >                         switch_shstk_to(p);
> >                         break;
> >                 }
> >         }
> >         for (; p != target_ssp; p++)
> >                 // ssp is now on the same shstk as target.
> >                 inc_ssp();
> > 
> > this is what setcontext is doing and longjmp can do the same:
> > for programs that always longjmp within the same shstk the first
> > loop is just p = current_ssp, but it also works when longjmp
> > target is on a different shstk assuming nothing is running on
> > that shstk, which is only possible if there is a restore token
> > on top.
> > 
> > this implies if the kernel switches shstk on signal entry it has
> > to add a restore-token on the switched away shstk.
> 
> I actually did a POC for this, but rejected it. The problem is, if
> there is a shadow stack overflow at that point then the kernel can't
> push the shadow stack token to the old stack. And shadow stack overflow
> is exactly the alt shadow stack use case. So it doesn't really solve
> the problem.

the restore token in the alt shstk case does not regress anything but
makes some use-cases work.

alt shadow stack is important if code tries to jump in and out of
signal handlers (dosemu does this with swapcontext) and for that a
restore token is needed.

alt shadow stack is important if the original shstk did not overflow
but the signal handler would overflow it (small thread stack, huge
sigaltstack case).

alt shadow stack is also important for crash reporting on shstk
overflow even if longjmp does not work then. longjmp to a makecontext
stack would still work and longjmp back to the original stack can be
made to mostly work by an altshstk option to overwrite the top entry
with a restore token on overflow (this can break unwinding though).
Edgecombe, Rick P June 21, 2023, 6:54 p.m. UTC | #15
On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> > The 06/20/2023 19:34, Edgecombe, Rick P wrote:
> > > > On Tue, 2023-06-20 at 10:17 +0100, szabolcs.nagy@arm.com wrote:
> > > > > > if there is a fix that's good, i haven't seen it.
> > > > > > 
> > > > > > my point was that the current unwinder works with current
> > > > > > kernel
> > > > > > patches, but does not allow future extensions which
> > > > > > prevents
> > > > > > sigaltshstk to work. the unwinder is not versioned so this
> > > > > > cannot
> > > > > > be fixed later. it only works if distros ensure shstk is
> > > > > > disabled
> > > > > > until the unwinder is fixed. (however there is no way to
> > > > > > detect
> > > > > > old unwinder if somebody builds gcc from source.)
> > > > 
> > > > This is a problem the kernel is having to deal with, not
> > > > causing. > > The
> > > > userspace changes were upstreamed before the kernel. Userspace
> > > > > > folks
> > > > are adamantly against moving to a new elf bit, to start over
> > > > with a
> > > > clean slate. I tried everything to influence this and was not
> > > > successful. So I'm still not sure what the proposal here is for
> > > > the
> > > > kernel.
> > 
> > i agree, the glibc and libgcc patches should not have been accepted
> > before a linux abi.
> > 
> > but the other direction also holds: the linux patches should not be
> > pushed before the userspace design is discussed. (the current code
> > upstream is wrong, and new code for the proposed linux abi is not
> > posted yet. this is not your fault, i'm saying it here, because the
> > discussion is here.)

This series has been discussed with glibc/gcc developers regularly
throughout the enabling effort. In fact there have been ongoing
discussions about future shadow stack functionality. 

It's not like this feature has been a fast or hidden effort. You are
just walking into the tail end of it. (much of it predates my
involvement BTW, including the initial glibc support)

AFAIK HJ presented the enabling changes at some glibc meeting. The
signal side of glibc is unchanged from what is already upstream. So I'm
not sure characterizing it that way is fair. It seems you were not part
of those old discussions, but that might be because your interest is
new. In any case we are constrained by some of these earlier outcomes.
More on that below.

> > 
> > > > I am guessing that the fnon-call-exceptions/expanded frame size
> > > > incompatibilities could end up causing something to grow an
> > > > opt-in > > at
> > > > some point.
> > 
> > there are independent userspace components and not every component
> > has a chance to opt-in.
> > 
> > > > > > how does "fixed shadow stack signal frame size" relates to
> > > > > > "-fnon-call-exceptions"?
> > > > > > 
> > > > > > if there were instruction boundaries within a function
> > > > > > where the
> > > > > > ret addr is not yet pushed or already poped from the shstk
> > > > > > then
> > > > > > the flag would be relevant, but since push/pop happens
> > > > > > atomically
> > > > > > at function entry/return -fnon-call-exceptions makes no
> > > > > > difference as far as shstk unwinding is concerned.
> > > > 
> > > > As I said, the existing unwinding code for fnon-call-
> > > > excecptions
> > > > assumes a fixed shadow stack signal frame size of 8 bytes.
> > > > Since > > the
> > > > exception is thrown out of a signal, it needs to know how to
> > > > unwind
> > > > through the shadow stack signal frame.
> > 
> > sorry but there is some misunderstanding about -fnon-call-
> > exceptions.
> > 
> > it is for emitting cleanup and exception handler data for a
> > function
> > such that throwing from certain instructions within that function
> > works, while normally only throwing from calls work.
> > 
> > it is not about *unwinding* from an async signal handler, which is
> > -fasynchronous-unwind-tables and should always work on linux, nor
> > for
> > dealing with cleanup/exception handlers above the interrupted frame
> > (likewise it works on linux without special cflags).
> > 
> > as far as i can tell the current unwinder handles shstk unwinding
> > correctly across signal handlers (sync or async and >
> > cleanup/exceptions
> > handlers too), i see no issue with "fixed shadow stack signal frame
> > size of 8 bytes" other than future extensions and discontinous
> > shstk.

HJ, can you link your patch that makes it extensible and we can clear
this up? Maybe the issue extends beyond fnon-call-exceptions, but that
is where I reproduced it.

> > 
> > > > > > there is no magic, longjmp should be implemented as:
> > > > > > 
> > > > > >         target_ssp = read from jmpbuf;
> > > > > >         current_ssp = read ssp;
> > > > > >         for (p = target_ssp; p != current_ssp; p--) {
> > > > > >                 if (*p == restore-token) {
> > > > > >                         // target_ssp is on a different
> > > > > > shstk.
> > > > > >                         switch_shstk_to(p);
> > > > > >                         break;
> > > > > >                 }
> > > > > >         }
> > > > > >         for (; p != target_ssp; p++)
> > > > > >                 // ssp is now on the same shstk as target.
> > > > > >                 inc_ssp();
> > > > > > 
> > > > > > this is what setcontext is doing and longjmp can do the
> > > > > > same:
> > > > > > for programs that always longjmp within the same shstk the
> > > > > > first
> > > > > > loop is just p = current_ssp, but it also works when
> > > > > > longjmp
> > > > > > target is on a different shstk assuming nothing is running
> > > > > > on
> > > > > > that shstk, which is only possible if there is a restore
> > > > > > token
> > > > > > on top.
> > > > > > 
> > > > > > this implies if the kernel switches shstk on signal entry
> > > > > > it has
> > > > > > to add a restore-token on the switched away shstk.
> > > > 
> > > > I actually did a POC for this, but rejected it. The problem is,
> > > > if
> > > > there is a shadow stack overflow at that point then the kernel
> > > > > > can't
> > > > push the shadow stack token to the old stack. And shadow stack
> > > > > > overflow
> > > > is exactly the alt shadow stack use case. So it doesn't really
> > > > > > solve
> > > > the problem.
> > 
> > the restore token in the alt shstk case does not regress anything
> > but
> > makes some use-cases work.
> > 
> > alt shadow stack is important if code tries to jump in and out of
> > signal handlers (dosemu does this with swapcontext) and for that a
> > restore token is needed.
> > 
> > alt shadow stack is important if the original shstk did not
> > overflow
> > but the signal handler would overflow it (small thread stack, huge
> > sigaltstack case).
> > 
> > alt shadow stack is also important for crash reporting on shstk
> > overflow even if longjmp does not work then. longjmp to a
> > makecontext
> > stack would still work and longjmp back to the original stack can
> > be
> > made to mostly work by an altshstk option to overwrite the top
> > entry
> > with a restore token on overflow (this can break unwinding though).
> > 

There was previously a request to create an alt shadow stack for the
purpose of handling shadow stack overflow. So you are now suggesting to
to exclude that and instead target a different use case for alt shadow
stack?

But I'm not sure how much we should change the ABI at this point since
we are constrained by existing userspace. If you read the history, we
may end up needing to deprecate the whole elf bit for this and other
reasons.

So should we struggle to find a way to grow the existing ABI without
disturbing the existing userspace? Or should we start with something,
finally, and see where we need to grow and maybe get a chance at a
fresh start to grow it?

Like, maybe 3 people will show up saying "hey, I *really* need to use
shadow stack and longjmp from a ucontext stack", and no one says
anything about shadow stack overflow. Then we know what to do. And
maybe dosemu decides it doesn't need to implement shadow stack (highly
likely I would think). Now that I think about it, AFAIU SS_AUTODISARM
was created for dosemu, and the alt shadow stack patch adopted this
behavior. So it's speculation that there is even a problem in that
scenario.

Or maybe people just enable WRSS for longjmp() and directly jump back
to the setjmp() point. Do most people want fast setjmp/longjmp() at the
cost of a little security?

Even if, with enough discussion, we could optimize for all
hypotheticals without real user feedback, I don't see how it helps
users to hold shadow stack. So I think we should move forward with the
current ABI.
Edgecombe, Rick P June 21, 2023, 10:22 p.m. UTC | #16
On Wed, 2023-06-21 at 11:54 -0700, Rick Edgecombe wrote:
> > > > > > > there is no magic, longjmp should be implemented as:
> > > > > > > 
> > > > > > >         target_ssp = read from jmpbuf;
> > > > > > >         current_ssp = read ssp;
> > > > > > >         for (p = target_ssp; p != current_ssp; p--) {
> > > > > > >                 if (*p == restore-token) {
> > > > > > >                         // target_ssp is on a different
> > > > > > > shstk.
> > > > > > >                         switch_shstk_to(p);
> > > > > > >                         break;
> > > > > > >                 }
> > > > > > >         }
> > > > > > >         for (; p != target_ssp; p++)
> > > > > > >                 // ssp is now on the same shstk as
> > > > > > > target.
> > > > > > >                 inc_ssp();
> > > > > > > 
> > > > > > > this is what setcontext is doing and longjmp can do the
> > > > > > > same:
> > > > > > > for programs that always longjmp within the same shstk
> > > > > > > the
> > > > > > > first
> > > > > > > loop is just p = current_ssp, but it also works when
> > > > > > > longjmp
> > > > > > > target is on a different shstk assuming nothing is
> > > > > > > running
> > > > > > > on
> > > > > > > that shstk, which is only possible if there is a restore
> > > > > > > token
> > > > > > > on top.
> > > > > > > 
> > > > > > > this implies if the kernel switches shstk on signal entry
> > > > > > > it has
> > > > > > > to add a restore-token on the switched away shstk.

Wait a second, the claim is that the kernel should add a restore token
on the current shadow stack before handling a signal, to allow to
unwind from an alt shadow stack, right? But in this series there is not
an alt shadow stack, so signal will be handled on the current shadow
stack. If the user stays on the current shadow stack, the existing
simple INCSSP based solution will work.

If the user swapcontext()'s away while handling a signal (which *is*
currently supported) they will leave their own restore token on the old
stack. Hypothetically glibc could unwind back through a series of
ucontext stacks by pivoting, if it kept some metadata somewhere about
where to restore to. So there are actually already enough tokens to
make it back in this case, glibc just doesn't do this.

But how does the proposed token placed by the kernel on the original
stack help this problem? The longjmp() would have to be able to find
the location of the restore tokens somehow, which would not necessarily
be near the setjmp() point. The signal token could even be on a
different shadow stack.

So I think the above is short of a design for a universally compatible
longjmp().

Which makes me think if we did want to make a more compatible longjmp()
a better the way to do it might be an arch_prctl that emits a token at
the current SSP. This would be loosening up the security somewhat (have
to be an opt-in), but less so then enabling WRSS. But it would also be
way simpler, work for all cases (I think), and be faster (maybe?) than
INCSSPing through a bunch of stacks.

I'm also not sure leaving a token on signal doesn't weaken the security
it it's own way as well. Any thread could then swap to that token.
Where as the shadow stack signal frame ssp pointer can only be used
from the shadow stack the signal was handled on.

So I think, in addition to blocking the shadow stack overflow use case
in the future, leaving a token behind on signal will not really help
longjmp(). (or at least I'm not following)
H.J. Lu June 21, 2023, 11:02 p.m. UTC | #17
On Wed, Jun 21, 2023 at 11:54 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> > > The 06/20/2023 19:34, Edgecombe, Rick P wrote:
> > > > > On Tue, 2023-06-20 at 10:17 +0100, szabolcs.nagy@arm.com wrote:
> > > > > > > if there is a fix that's good, i haven't seen it.
> > > > > > >
> > > > > > > my point was that the current unwinder works with current
> > > > > > > kernel
> > > > > > > patches, but does not allow future extensions which
> > > > > > > prevents
> > > > > > > sigaltshstk to work. the unwinder is not versioned so this
> > > > > > > cannot
> > > > > > > be fixed later. it only works if distros ensure shstk is
> > > > > > > disabled
> > > > > > > until the unwinder is fixed. (however there is no way to
> > > > > > > detect
> > > > > > > old unwinder if somebody builds gcc from source.)
> > > > >
> > > > > This is a problem the kernel is having to deal with, not
> > > > > causing. > > The
> > > > > userspace changes were upstreamed before the kernel. Userspace
> > > > > > > folks
> > > > > are adamantly against moving to a new elf bit, to start over
> > > > > with a
> > > > > clean slate. I tried everything to influence this and was not
> > > > > successful. So I'm still not sure what the proposal here is for
> > > > > the
> > > > > kernel.
> > >
> > > i agree, the glibc and libgcc patches should not have been accepted
> > > before a linux abi.
> > >
> > > but the other direction also holds: the linux patches should not be
> > > pushed before the userspace design is discussed. (the current code
> > > upstream is wrong, and new code for the proposed linux abi is not
> > > posted yet. this is not your fault, i'm saying it here, because the
> > > discussion is here.)
>
> This series has been discussed with glibc/gcc developers regularly
> throughout the enabling effort. In fact there have been ongoing
> discussions about future shadow stack functionality.
>
> It's not like this feature has been a fast or hidden effort. You are
> just walking into the tail end of it. (much of it predates my
> involvement BTW, including the initial glibc support)
>
> AFAIK HJ presented the enabling changes at some glibc meeting. The
> signal side of glibc is unchanged from what is already upstream. So I'm
> not sure characterizing it that way is fair. It seems you were not part
> of those old discussions, but that might be because your interest is
> new. In any case we are constrained by some of these earlier outcomes.
> More on that below.
>
> > >
> > > > > I am guessing that the fnon-call-exceptions/expanded frame size
> > > > > incompatibilities could end up causing something to grow an
> > > > > opt-in > > at
> > > > > some point.
> > >
> > > there are independent userspace components and not every component
> > > has a chance to opt-in.
> > >
> > > > > > > how does "fixed shadow stack signal frame size" relates to
> > > > > > > "-fnon-call-exceptions"?
> > > > > > >
> > > > > > > if there were instruction boundaries within a function
> > > > > > > where the
> > > > > > > ret addr is not yet pushed or already poped from the shstk
> > > > > > > then
> > > > > > > the flag would be relevant, but since push/pop happens
> > > > > > > atomically
> > > > > > > at function entry/return -fnon-call-exceptions makes no
> > > > > > > difference as far as shstk unwinding is concerned.
> > > > >
> > > > > As I said, the existing unwinding code for fnon-call-
> > > > > excecptions
> > > > > assumes a fixed shadow stack signal frame size of 8 bytes.
> > > > > Since > > the
> > > > > exception is thrown out of a signal, it needs to know how to
> > > > > unwind
> > > > > through the shadow stack signal frame.
> > >
> > > sorry but there is some misunderstanding about -fnon-call-
> > > exceptions.
> > >
> > > it is for emitting cleanup and exception handler data for a
> > > function
> > > such that throwing from certain instructions within that function
> > > works, while normally only throwing from calls work.
> > >
> > > it is not about *unwinding* from an async signal handler, which is
> > > -fasynchronous-unwind-tables and should always work on linux, nor
> > > for
> > > dealing with cleanup/exception handlers above the interrupted frame
> > > (likewise it works on linux without special cflags).
> > >
> > > as far as i can tell the current unwinder handles shstk unwinding
> > > correctly across signal handlers (sync or async and >
> > > cleanup/exceptions
> > > handlers too), i see no issue with "fixed shadow stack signal frame
> > > size of 8 bytes" other than future extensions and discontinous
> > > shstk.
>
> HJ, can you link your patch that makes it extensible and we can clear
> this up? Maybe the issue extends beyond fnon-call-exceptions, but that
> is where I reproduced it.

Here is the patch:

https://gitlab.com/x86-gcc/gcc/-/commit/aab4c24b67b5f05b72e52a3eaae005c2277710b9

> > >
> > > > > > > there is no magic, longjmp should be implemented as:
> > > > > > >
> > > > > > >         target_ssp = read from jmpbuf;
> > > > > > >         current_ssp = read ssp;
> > > > > > >         for (p = target_ssp; p != current_ssp; p--) {
> > > > > > >                 if (*p == restore-token) {
> > > > > > >                         // target_ssp is on a different
> > > > > > > shstk.
> > > > > > >                         switch_shstk_to(p);
> > > > > > >                         break;
> > > > > > >                 }
> > > > > > >         }
> > > > > > >         for (; p != target_ssp; p++)
> > > > > > >                 // ssp is now on the same shstk as target.
> > > > > > >                 inc_ssp();
> > > > > > >
> > > > > > > this is what setcontext is doing and longjmp can do the
> > > > > > > same:
> > > > > > > for programs that always longjmp within the same shstk the
> > > > > > > first
> > > > > > > loop is just p = current_ssp, but it also works when
> > > > > > > longjmp
> > > > > > > target is on a different shstk assuming nothing is running
> > > > > > > on
> > > > > > > that shstk, which is only possible if there is a restore
> > > > > > > token
> > > > > > > on top.
> > > > > > >
> > > > > > > this implies if the kernel switches shstk on signal entry
> > > > > > > it has
> > > > > > > to add a restore-token on the switched away shstk.
> > > > >
> > > > > I actually did a POC for this, but rejected it. The problem is,
> > > > > if
> > > > > there is a shadow stack overflow at that point then the kernel
> > > > > > > can't
> > > > > push the shadow stack token to the old stack. And shadow stack
> > > > > > > overflow
> > > > > is exactly the alt shadow stack use case. So it doesn't really
> > > > > > > solve
> > > > > the problem.
> > >
> > > the restore token in the alt shstk case does not regress anything
> > > but
> > > makes some use-cases work.
> > >
> > > alt shadow stack is important if code tries to jump in and out of
> > > signal handlers (dosemu does this with swapcontext) and for that a
> > > restore token is needed.
> > >
> > > alt shadow stack is important if the original shstk did not
> > > overflow
> > > but the signal handler would overflow it (small thread stack, huge
> > > sigaltstack case).
> > >
> > > alt shadow stack is also important for crash reporting on shstk
> > > overflow even if longjmp does not work then. longjmp to a
> > > makecontext
> > > stack would still work and longjmp back to the original stack can
> > > be
> > > made to mostly work by an altshstk option to overwrite the top
> > > entry
> > > with a restore token on overflow (this can break unwinding though).
> > >
>
> There was previously a request to create an alt shadow stack for the
> purpose of handling shadow stack overflow. So you are now suggesting to
> to exclude that and instead target a different use case for alt shadow
> stack?
>
> But I'm not sure how much we should change the ABI at this point since
> we are constrained by existing userspace. If you read the history, we
> may end up needing to deprecate the whole elf bit for this and other
> reasons.
>
> So should we struggle to find a way to grow the existing ABI without
> disturbing the existing userspace? Or should we start with something,
> finally, and see where we need to grow and maybe get a chance at a
> fresh start to grow it?
>
> Like, maybe 3 people will show up saying "hey, I *really* need to use
> shadow stack and longjmp from a ucontext stack", and no one says
> anything about shadow stack overflow. Then we know what to do. And
> maybe dosemu decides it doesn't need to implement shadow stack (highly
> likely I would think). Now that I think about it, AFAIU SS_AUTODISARM
> was created for dosemu, and the alt shadow stack patch adopted this
> behavior. So it's speculation that there is even a problem in that
> scenario.
>
> Or maybe people just enable WRSS for longjmp() and directly jump back
> to the setjmp() point. Do most people want fast setjmp/longjmp() at the
> cost of a little security?
>
> Even if, with enough discussion, we could optimize for all
> hypotheticals without real user feedback, I don't see how it helps
> users to hold shadow stack. So I think we should move forward with the
> current ABI.
>
>
H.J. Lu June 21, 2023, 11:05 p.m. UTC | #18
On Wed, Jun 21, 2023 at 3:23 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Wed, 2023-06-21 at 11:54 -0700, Rick Edgecombe wrote:
> > > > > > > > there is no magic, longjmp should be implemented as:
> > > > > > > >
> > > > > > > >         target_ssp = read from jmpbuf;
> > > > > > > >         current_ssp = read ssp;
> > > > > > > >         for (p = target_ssp; p != current_ssp; p--) {
> > > > > > > >                 if (*p == restore-token) {
> > > > > > > >                         // target_ssp is on a different
> > > > > > > > shstk.
> > > > > > > >                         switch_shstk_to(p);
> > > > > > > >                         break;
> > > > > > > >                 }
> > > > > > > >         }
> > > > > > > >         for (; p != target_ssp; p++)
> > > > > > > >                 // ssp is now on the same shstk as
> > > > > > > > target.
> > > > > > > >                 inc_ssp();
> > > > > > > >
> > > > > > > > this is what setcontext is doing and longjmp can do the
> > > > > > > > same:
> > > > > > > > for programs that always longjmp within the same shstk
> > > > > > > > the
> > > > > > > > first
> > > > > > > > loop is just p = current_ssp, but it also works when
> > > > > > > > longjmp
> > > > > > > > target is on a different shstk assuming nothing is
> > > > > > > > running
> > > > > > > > on
> > > > > > > > that shstk, which is only possible if there is a restore
> > > > > > > > token
> > > > > > > > on top.
> > > > > > > >
> > > > > > > > this implies if the kernel switches shstk on signal entry
> > > > > > > > it has
> > > > > > > > to add a restore-token on the switched away shstk.
>
> Wait a second, the claim is that the kernel should add a restore token
> on the current shadow stack before handling a signal, to allow to
> unwind from an alt shadow stack, right? But in this series there is not
> an alt shadow stack, so signal will be handled on the current shadow
> stack. If the user stays on the current shadow stack, the existing
> simple INCSSP based solution will work.
>
> If the user swapcontext()'s away while handling a signal (which *is*
> currently supported) they will leave their own restore token on the old
> stack. Hypothetically glibc could unwind back through a series of
> ucontext stacks by pivoting, if it kept some metadata somewhere about
> where to restore to. So there are actually already enough tokens to
> make it back in this case, glibc just doesn't do this.
>
> But how does the proposed token placed by the kernel on the original
> stack help this problem? The longjmp() would have to be able to find
> the location of the restore tokens somehow, which would not necessarily
> be near the setjmp() point. The signal token could even be on a
> different shadow stack.
>
> So I think the above is short of a design for a universally compatible
> longjmp().
>
> Which makes me think if we did want to make a more compatible longjmp()
> a better the way to do it might be an arch_prctl that emits a token at
> the current SSP. This would be loosening up the security somewhat (have
> to be an opt-in), but less so then enabling WRSS. But it would also be
> way simpler, work for all cases (I think), and be faster (maybe?) than
> INCSSPing through a bunch of stacks.

Since longjmp isn't required to be called after setjmp, leaving a restore
token doesn't work when longjmp isn't called.

> I'm also not sure leaving a token on signal doesn't weaken the security
> it it's own way as well. Any thread could then swap to that token.
> Where as the shadow stack signal frame ssp pointer can only be used
> from the shadow stack the signal was handled on.
>
> So I think, in addition to blocking the shadow stack overflow use case
> in the future, leaving a token behind on signal will not really help
> longjmp(). (or at least I'm not following)
>
Edgecombe, Rick P June 21, 2023, 11:15 p.m. UTC | #19
On Wed, 2023-06-21 at 16:05 -0700, H.J. Lu wrote:
> > Which makes me think if we did want to make a more compatible
> > longjmp()
> > a better the way to do it might be an arch_prctl that emits a token
> > at
> > the current SSP. This would be loosening up the security somewhat
> > (have
> > to be an opt-in), but less so then enabling WRSS. But it would also
> > be
> > way simpler, work for all cases (I think), and be faster (maybe?)
> > than
> > INCSSPing through a bunch of stacks.
> 
> Since longjmp isn't required to be called after setjmp, leaving a
> restore
> token doesn't work when longjmp isn't called.

Oh good point. Hmm.
Edgecombe, Rick P June 22, 2023, 1:07 a.m. UTC | #20
On Wed, 2023-06-21 at 16:15 -0700, Rick Edgecombe wrote:
> On Wed, 2023-06-21 at 16:05 -0700, H.J. Lu wrote:
> > > Which makes me think if we did want to make a more compatible
> > > longjmp()
> > > a better the way to do it might be an arch_prctl that emits a
> > > token
> > > at
> > > the current SSP. This would be loosening up the security somewhat
> > > (have
> > > to be an opt-in), but less so then enabling WRSS. But it would
> > > also
> > > be
> > > way simpler, work for all cases (I think), and be faster (maybe?)
> > > than
> > > INCSSPing through a bunch of stacks.
> > 
> > Since longjmp isn't required to be called after setjmp, leaving a
> > restore
> > token doesn't work when longjmp isn't called.
> 
> Oh good point. Hmm.

Just had a quick chat with HJ on this. It seems like it *might* be able
to made to work. How it would go is setjmp() could act as a wrapper by
calling it's own return address (the function that called setjmp()).
This would mean in the case of longjmp() not being called, control flow
would return through setjmp() before returning from the calling method.
This would allow libc to do a RSTORSSP when returning though setjmp() 
in the non-shadow stack case, and essentially skip over the kernel
placed restore token, and then return from setjmp() like normal. In the
case of longjmp() being called, it could RSTORSSP directly to the
token, and then return from setjmp().

Another option could be getting the compilers help to do the RSTORSSP
in the case of longjmp() not being called. Apparently compilers are
aware of setjmp() and already do special things around it (makes sense
I guess, but news to me).

And also, this all would actually work with IBT, because the compiler
knows already to add an endbr at that point right after setjmp().

I think neither of us were ready to bet on it, but thought maybe it
could work. And even if it works it's much more complicated than I
first thought, so I don't like it as much. It's also unclear what a
change like that would mean for security.

As for unwinding through the existing swapcontext() placed restore
tokens, the problem was as assumed - that it's difficult to find them.
Even considering brute force options like doing manual searches for a
nearby token to use turned up edge cases pretty quick. So I think that
kind of leaves us where we were originally, with no known solutions
that would require breaking kernel ABI changes.


Are you interested in helping get longjmp() from a ucontext stack
working for shadow stack? One other thing that came up in the
conversation was that while it is known that some apps are doing this,
there are no tests for mixing longjmp and ucontext in glibc. So we may
not know which combinations of mixing them together even work in the
non-shadow stack case.

It could be useful to add some tests for this to glibc and we could get
some clarity on what behaviors shadow stack would actually need to
support.
H.J. Lu June 22, 2023, 3:23 a.m. UTC | #21
On Wed, Jun 21, 2023 at 6:07 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Wed, 2023-06-21 at 16:15 -0700, Rick Edgecombe wrote:
> > On Wed, 2023-06-21 at 16:05 -0700, H.J. Lu wrote:
> > > > Which makes me think if we did want to make a more compatible
> > > > longjmp()
> > > > a better the way to do it might be an arch_prctl that emits a
> > > > token
> > > > at
> > > > the current SSP. This would be loosening up the security somewhat
> > > > (have
> > > > to be an opt-in), but less so then enabling WRSS. But it would
> > > > also
> > > > be
> > > > way simpler, work for all cases (I think), and be faster (maybe?)
> > > > than
> > > > INCSSPing through a bunch of stacks.
> > >
> > > Since longjmp isn't required to be called after setjmp, leaving a
> > > restore
> > > token doesn't work when longjmp isn't called.
> >
> > Oh good point. Hmm.
>
> Just had a quick chat with HJ on this. It seems like it *might* be able
> to made to work. How it would go is setjmp() could act as a wrapper by
> calling it's own return address (the function that called setjmp()).
> This would mean in the case of longjmp() not being called, control flow
> would return through setjmp() before returning from the calling method.

It may not work since we can't tell if RAX (return value) is set by longjmp
or function return.

> This would allow libc to do a RSTORSSP when returning though setjmp()
> in the non-shadow stack case, and essentially skip over the kernel
> placed restore token, and then return from setjmp() like normal. In the
> case of longjmp() being called, it could RSTORSSP directly to the
> token, and then return from setjmp().
>
> Another option could be getting the compilers help to do the RSTORSSP
> in the case of longjmp() not being called. Apparently compilers are
> aware of setjmp() and already do special things around it (makes sense
> I guess, but news to me).
>
> And also, this all would actually work with IBT, because the compiler
> knows already to add an endbr at that point right after setjmp().
>
> I think neither of us were ready to bet on it, but thought maybe it
> could work. And even if it works it's much more complicated than I
> first thought, so I don't like it as much. It's also unclear what a
> change like that would mean for security.
>
> As for unwinding through the existing swapcontext() placed restore
> tokens, the problem was as assumed - that it's difficult to find them.
> Even considering brute force options like doing manual searches for a
> nearby token to use turned up edge cases pretty quick. So I think that
> kind of leaves us where we were originally, with no known solutions
> that would require breaking kernel ABI changes.
>
>
> Are you interested in helping get longjmp() from a ucontext stack
> working for shadow stack? One other thing that came up in the
> conversation was that while it is known that some apps are doing this,
> there are no tests for mixing longjmp and ucontext in glibc. So we may
> not know which combinations of mixing them together even work in the
> non-shadow stack case.
>
> It could be useful to add some tests for this to glibc and we could get
> some clarity on what behaviors shadow stack would actually need to
> support.
Szabolcs Nagy June 22, 2023, 7:40 a.m. UTC | #22
The 06/21/2023 16:02, H.J. Lu wrote:
> On Wed, Jun 21, 2023 at 11:54 AM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> > HJ, can you link your patch that makes it extensible and we can clear
> > this up? Maybe the issue extends beyond fnon-call-exceptions, but that
> > is where I reproduced it.
> 
> Here is the patch:
> 
> https://gitlab.com/x86-gcc/gcc/-/commit/aab4c24b67b5f05b72e52a3eaae005c2277710b9

ok i don't see how this is related to fnon-call-exceptions..

but it is wrong if the shstk is ever discontinous even though the
shstk format would allow that. this was my original point in this
thread: with current unwinder code you cannot add altshstk later.
and no, introducing new binary markings and rebuild the world just
for altshstk is not OK. so we have to decide now if we want alt
shstk in the future or not, i gave reasons why we would want it.
Szabolcs Nagy June 22, 2023, 8:27 a.m. UTC | #23
The 06/21/2023 22:22, Edgecombe, Rick P wrote:
> On Wed, 2023-06-21 at 11:54 -0700, Rick Edgecombe wrote:
> > > > > > > > there is no magic, longjmp should be implemented as:
> > > > > > > > 
> > > > > > > >         target_ssp = read from jmpbuf;
> > > > > > > >         current_ssp = read ssp;
> > > > > > > >         for (p = target_ssp; p != current_ssp; p--) {
> > > > > > > >                 if (*p == restore-token) {
> > > > > > > >                         // target_ssp is on a different
> > > > > > > > shstk.
> > > > > > > >                         switch_shstk_to(p);
> > > > > > > >                         break;
> > > > > > > >                 }
> > > > > > > >         }
> > > > > > > >         for (; p != target_ssp; p++)
> > > > > > > >                 // ssp is now on the same shstk as
> > > > > > > > target.
> > > > > > > >                 inc_ssp();
> > > > > > > > 
> > > > > > > > this is what setcontext is doing and longjmp can do the
> > > > > > > > same:
> > > > > > > > for programs that always longjmp within the same shstk
> > > > > > > > the
> > > > > > > > first
> > > > > > > > loop is just p = current_ssp, but it also works when
> > > > > > > > longjmp
> > > > > > > > target is on a different shstk assuming nothing is
> > > > > > > > running
> > > > > > > > on
> > > > > > > > that shstk, which is only possible if there is a restore
> > > > > > > > token
> > > > > > > > on top.
> > > > > > > > 
> > > > > > > > this implies if the kernel switches shstk on signal entry
> > > > > > > > it has
> > > > > > > > to add a restore-token on the switched away shstk.
> 
> Wait a second, the claim is that the kernel should add a restore token
> on the current shadow stack before handling a signal, to allow to
> unwind from an alt shadow stack, right? But in this series there is not
> an alt shadow stack, so signal will be handled on the current shadow
> stack. If the user stays on the current shadow stack, the existing
> simple INCSSP based solution will work.

yes.

> If the user swapcontext()'s away while handling a signal (which *is*
> currently supported) they will leave their own restore token on the old
> stack. Hypothetically glibc could unwind back through a series of
> ucontext stacks by pivoting, if it kept some metadata somewhere about
> where to restore to. So there are actually already enough tokens to
> make it back in this case, glibc just doesn't do this.

swapcontext is currently *not* supported: for it to work you have to
be able to jump *back* into the signal handler, which does not work if
the swapcontext target is on the original thread stack (instead of
say a makecontext stack).

jumping back can only be supported if alt stack can be paired with
an alt shadow stack.

unwinding across a series of signal interrupts should work even
with discontinous shstk. libgcc does not implement this which is
a problem i think.

> But how does the proposed token placed by the kernel on the original
> stack help this problem? The longjmp() would have to be able to find
> the location of the restore tokens somehow, which would not necessarily
> be near the setjmp() point. The signal token could even be on a
> different shadow stack.

i posted the exact longjmp code and it takes care of this case.

setjmp does not need to do anything special.

the invariant is that an shstk is either capped by a restore token
or in use by some executing task. this is guaranteed architecturally
(when shstk is switched with an instruction) and should be guaranteed
by the kernel too (when shstk is switched by the kernel).

> I'm also not sure leaving a token on signal doesn't weaken the security
> it it's own way as well. Any thread could then swap to that token.
> Where as the shadow stack signal frame ssp pointer can only be used
> from the shadow stack the signal was handled on.

as far as i'm concerned it is a valid programming model to switch
to a stack that is currently not in use and we should always allow
that. (signal handled on an alt stack may not return)

> So I think, in addition to blocking the shadow stack overflow use case
> in the future, leaving a token behind on signal will not really help
> longjmp(). (or at least I'm not following)

the restore token must only be added if shstk is switched
(currently it is not switched so don't add it, however if
we agree on this then the unwinder can be fixed accordingly.)
Szabolcs Nagy June 22, 2023, 9:18 a.m. UTC | #24
The 06/21/2023 18:54, Edgecombe, Rick P wrote:
> On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> > > The 06/20/2023 19:34, Edgecombe, Rick P wrote:
> > > > > I actually did a POC for this, but rejected it. The problem is,
> > > > > if
> > > > > there is a shadow stack overflow at that point then the kernel
> > > > > > > can't
> > > > > push the shadow stack token to the old stack. And shadow stack
> > > > > > > overflow
> > > > > is exactly the alt shadow stack use case. So it doesn't really
> > > > > > > solve
> > > > > the problem.
> > > 
> > > the restore token in the alt shstk case does not regress anything
> > > but
> > > makes some use-cases work.
> > > 
> > > alt shadow stack is important if code tries to jump in and out of
> > > signal handlers (dosemu does this with swapcontext) and for that a
> > > restore token is needed.
> > > 
> > > alt shadow stack is important if the original shstk did not
> > > overflow
> > > but the signal handler would overflow it (small thread stack, huge
> > > sigaltstack case).
> > > 
> > > alt shadow stack is also important for crash reporting on shstk
> > > overflow even if longjmp does not work then. longjmp to a
> > > makecontext
> > > stack would still work and longjmp back to the original stack can
> > > be
> > > made to mostly work by an altshstk option to overwrite the top
> > > entry
> > > with a restore token on overflow (this can break unwinding though).
> > > 
> 
> There was previously a request to create an alt shadow stack for the
> purpose of handling shadow stack overflow. So you are now suggesting to
> to exclude that and instead target a different use case for alt shadow
> stack?

that is not what i said.

> But I'm not sure how much we should change the ABI at this point since
> we are constrained by existing userspace. If you read the history, we
> may end up needing to deprecate the whole elf bit for this and other
> reasons.

i'm not against deprecating the elf bit, but i think binary
marking will be difficult for this kind of feature no matter what
(code may be incompatible for complex runtime dependent reasons).

> So should we struggle to find a way to grow the existing ABI without
> disturbing the existing userspace? Or should we start with something,
> finally, and see where we need to grow and maybe get a chance at a
> fresh start to grow it?
> 
> Like, maybe 3 people will show up saying "hey, I *really* need to use
> shadow stack and longjmp from a ucontext stack", and no one says
> anything about shadow stack overflow. Then we know what to do. And
> maybe dosemu decides it doesn't need to implement shadow stack (highly
> likely I would think). Now that I think about it, AFAIU SS_AUTODISARM
> was created for dosemu, and the alt shadow stack patch adopted this
> behavior. So it's speculation that there is even a problem in that
> scenario.
> 
> Or maybe people just enable WRSS for longjmp() and directly jump back
> to the setjmp() point. Do most people want fast setjmp/longjmp() at the
> cost of a little security?
> 
> Even if, with enough discussion, we could optimize for all
> hypotheticals without real user feedback, I don't see how it helps
> users to hold shadow stack. So I think we should move forward with the
> current ABI.

you may not get a second chance to fix a security feature.
it will be just disabled if it causes problems.
Andy Lutomirski June 22, 2023, 3:26 p.m. UTC | #25
On Thu, Jun 22, 2023 at 2:28 AM szabolcs.nagy@arm.com
<szabolcs.nagy@arm.com> wrote:
>
> The 06/21/2023 18:54, Edgecombe, Rick P wrote:
> > On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> > > > The 06/20/2023 19:34, Edgecombe, Rick P wrote:
> > > > > > I actually did a POC for this, but rejected it. The problem is,
> > > > > > if
> > > > > > there is a shadow stack overflow at that point then the kernel
> > > > > > > > can't
> > > > > > push the shadow stack token to the old stack. And shadow stack
> > > > > > > > overflow
> > > > > > is exactly the alt shadow stack use case. So it doesn't really
> > > > > > > > solve
> > > > > > the problem.
> > > >
> > > > the restore token in the alt shstk case does not regress anything
> > > > but
> > > > makes some use-cases work.
> > > >
> > > > alt shadow stack is important if code tries to jump in and out of
> > > > signal handlers (dosemu does this with swapcontext) and for that a
> > > > restore token is needed.
> > > >
> > > > alt shadow stack is important if the original shstk did not
> > > > overflow
> > > > but the signal handler would overflow it (small thread stack, huge
> > > > sigaltstack case).
> > > >
> > > > alt shadow stack is also important for crash reporting on shstk
> > > > overflow even if longjmp does not work then. longjmp to a
> > > > makecontext
> > > > stack would still work and longjmp back to the original stack can
> > > > be
> > > > made to mostly work by an altshstk option to overwrite the top
> > > > entry
> > > > with a restore token on overflow (this can break unwinding though).
> > > >
> >
> > There was previously a request to create an alt shadow stack for the
> > purpose of handling shadow stack overflow. So you are now suggesting to
> > to exclude that and instead target a different use case for alt shadow
> > stack?
>
> that is not what i said.
>
> > But I'm not sure how much we should change the ABI at this point since
> > we are constrained by existing userspace. If you read the history, we
> > may end up needing to deprecate the whole elf bit for this and other
> > reasons.
>
> i'm not against deprecating the elf bit, but i think binary
> marking will be difficult for this kind of feature no matter what
> (code may be incompatible for complex runtime dependent reasons).
>
> > So should we struggle to find a way to grow the existing ABI without
> > disturbing the existing userspace? Or should we start with something,
> > finally, and see where we need to grow and maybe get a chance at a
> > fresh start to grow it?
> >
> > Like, maybe 3 people will show up saying "hey, I *really* need to use
> > shadow stack and longjmp from a ucontext stack", and no one says
> > anything about shadow stack overflow. Then we know what to do. And
> > maybe dosemu decides it doesn't need to implement shadow stack (highly
> > likely I would think). Now that I think about it, AFAIU SS_AUTODISARM
> > was created for dosemu, and the alt shadow stack patch adopted this
> > behavior. So it's speculation that there is even a problem in that
> > scenario.
> >
> > Or maybe people just enable WRSS for longjmp() and directly jump back
> > to the setjmp() point. Do most people want fast setjmp/longjmp() at the
> > cost of a little security?
> >
> > Even if, with enough discussion, we could optimize for all
> > hypotheticals without real user feedback, I don't see how it helps
> > users to hold shadow stack. So I think we should move forward with the
> > current ABI.
>
> you may not get a second chance to fix a security feature.
> it will be just disabled if it causes problems.

*I* would use altshadowstack.

I run a production system (that cares about correctness *and*
performance, but that's not really relevant here -- SHSTK ought to be
fast).  And, if it crashes, I want to know why.  So I handle SIGSEGV,
etc so I have good logs if it crashes.  And I want those same logs if
I overflow the stack.

That being said, I have no need for longjmp or siglongjmp for this.  I
use exit(2) to escape.

For what it's worth, setjmp/longjmp is a bad API.  The actual pattern
that ought to work well (and that could be supported well by fancy
compilers and non-C languages, as I understand it) is more like a
function call that has two ways out.  Like this (pseudo-C):

void function(struct better_jmp_buf &buf, args...)
{
   ...
       if (condition)
          better_long_jump(buf);  // long jumps out!
       // could also pass buf to another function
   ...
       // could also return normally
}

better_call_with_jmp_buf(function, args);

*This* could support altshadowstack just fine.  And many users might
be okay with the understanding that, if altshadowstack is on, you have
to use a better long jump to get out (or a normal sigreturn or _exit).
No one is getting an altshadowstack signal handler without code
changes.

siglongjmp() could support altshadowstack with help from the kernel,
but we probably don't want to go there.

--Andy
Szabolcs Nagy June 22, 2023, 4:42 p.m. UTC | #26
The 06/22/2023 08:26, Andy Lutomirski wrote:
> On Thu, Jun 22, 2023 at 2:28 AM szabolcs.nagy@arm.com
> <szabolcs.nagy@arm.com> wrote:
> >
> > The 06/21/2023 18:54, Edgecombe, Rick P wrote:
> > > On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> > > > > The 06/20/2023 19:34, Edgecombe, Rick P wrote:
> > > > > > > I actually did a POC for this, but rejected it. The problem is,
> > > > > > > if
> > > > > > > there is a shadow stack overflow at that point then the kernel
> > > > > > > > > can't
> > > > > > > push the shadow stack token to the old stack. And shadow stack
> > > > > > > > > overflow
> > > > > > > is exactly the alt shadow stack use case. So it doesn't really
> > > > > > > > > solve
> > > > > > > the problem.
> > > > >
> > > > > the restore token in the alt shstk case does not regress anything
> > > > > but
> > > > > makes some use-cases work.
> > > > >
> > > > > alt shadow stack is important if code tries to jump in and out of
> > > > > signal handlers (dosemu does this with swapcontext) and for that a
> > > > > restore token is needed.
> > > > >
> > > > > alt shadow stack is important if the original shstk did not
> > > > > overflow
> > > > > but the signal handler would overflow it (small thread stack, huge
> > > > > sigaltstack case).
> > > > >
> > > > > alt shadow stack is also important for crash reporting on shstk
> > > > > overflow even if longjmp does not work then. longjmp to a
> > > > > makecontext
> > > > > stack would still work and longjmp back to the original stack can
> > > > > be
> > > > > made to mostly work by an altshstk option to overwrite the top
> > > > > entry
> > > > > with a restore token on overflow (this can break unwinding though).
> > > > >
> > >
> > > There was previously a request to create an alt shadow stack for the
> > > purpose of handling shadow stack overflow. So you are now suggesting to
> > > to exclude that and instead target a different use case for alt shadow
> > > stack?
> >
> > that is not what i said.
> >
> > > But I'm not sure how much we should change the ABI at this point since
> > > we are constrained by existing userspace. If you read the history, we
> > > may end up needing to deprecate the whole elf bit for this and other
> > > reasons.
> >
> > i'm not against deprecating the elf bit, but i think binary
> > marking will be difficult for this kind of feature no matter what
> > (code may be incompatible for complex runtime dependent reasons).
> >
> > > So should we struggle to find a way to grow the existing ABI without
> > > disturbing the existing userspace? Or should we start with something,
> > > finally, and see where we need to grow and maybe get a chance at a
> > > fresh start to grow it?
> > >
> > > Like, maybe 3 people will show up saying "hey, I *really* need to use
> > > shadow stack and longjmp from a ucontext stack", and no one says
> > > anything about shadow stack overflow. Then we know what to do. And
> > > maybe dosemu decides it doesn't need to implement shadow stack (highly
> > > likely I would think). Now that I think about it, AFAIU SS_AUTODISARM
> > > was created for dosemu, and the alt shadow stack patch adopted this
> > > behavior. So it's speculation that there is even a problem in that
> > > scenario.
> > >
> > > Or maybe people just enable WRSS for longjmp() and directly jump back
> > > to the setjmp() point. Do most people want fast setjmp/longjmp() at the
> > > cost of a little security?
> > >
> > > Even if, with enough discussion, we could optimize for all
> > > hypotheticals without real user feedback, I don't see how it helps
> > > users to hold shadow stack. So I think we should move forward with the
> > > current ABI.
> >
> > you may not get a second chance to fix a security feature.
> > it will be just disabled if it causes problems.
> 
> *I* would use altshadowstack.
> 
> I run a production system (that cares about correctness *and*
> performance, but that's not really relevant here -- SHSTK ought to be
> fast).  And, if it crashes, I want to know why.  So I handle SIGSEGV,
> etc so I have good logs if it crashes.  And I want those same logs if
> I overflow the stack.
> 
> That being said, I have no need for longjmp or siglongjmp for this.  I
> use exit(2) to escape.

the same crash handler that prints a log on shstk overflow should
work when a different cause of SIGSEGV is recoverable via longjmp.
to me this means that alt shstk must work with longjmp at least in
the non-shstk overflow case (which can be declared non-recoverable).

> For what it's worth, setjmp/longjmp is a bad API.  The actual pattern
> that ought to work well (and that could be supported well by fancy
> compilers and non-C languages, as I understand it) is more like a
> function call that has two ways out.  Like this (pseudo-C):
> 
> void function(struct better_jmp_buf &buf, args...)
> {
>    ...
>        if (condition)
>           better_long_jump(buf);  // long jumps out!
>        // could also pass buf to another function
>    ...
>        // could also return normally
> }
> 
> better_call_with_jmp_buf(function, args);
> 
> *This* could support altshadowstack just fine.  And many users might
> be okay with the understanding that, if altshadowstack is on, you have
> to use a better long jump to get out (or a normal sigreturn or _exit).

i don't understand why this would work fine when longjmp does not.
how does the shstk switch happen?

> No one is getting an altshadowstack signal handler without code
> changes.

assuming the same component is doing the alt shstk setup as the
longjmp.

> siglongjmp() could support altshadowstack with help from the kernel,
> but we probably don't want to go there.

what kind of help? maybe we need that help..

e.g. if the signal frame token is detected by longjmp on
the shstk then doing an rt_sigreturn with the right signal
frame context allows longjmp to continue unwinding the shstk.
however kernel sigcontext layout can change so userspace may
not know it so longjmp needs a helper, but only in the jump
across signal frame case.

(this is a different design than what i proposed earlier,
it also makes longjmp from alt shstk work without wrss,
the downside is that longjmp across makecontext needs a
separate solution then which implies that all shstk needs
a detectable token at the end of the shstk.. so again
something that we have to get right now and cannot add
later.)
Edgecombe, Rick P June 22, 2023, 4:46 p.m. UTC | #27
On Thu, 2023-06-22 at 08:40 +0100, szabolcs.nagy@arm.com wrote:
> The 06/21/2023 16:02, H.J. Lu wrote:
> > On Wed, Jun 21, 2023 at 11:54 AM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > > HJ, can you link your patch that makes it extensible and we can
> > > clear
> > > this up? Maybe the issue extends beyond fnon-call-exceptions, but
> > > that
> > > is where I reproduced it.
> > 
> > Here is the patch:
> > 
> > https://gitlab.com/x86-gcc/gcc/-/commit/aab4c24b67b5f05b72e52a3eaae005c2277710b9
> 
> ok i don't see how this is related to fnon-call-exceptions..

I don't know what to tell you. I'm not a compiler expert, but a simple
fnon-call-exceptions test was how I reproduced it. If there is some
other use case for throwing an exception out of a signal handler, I
don't see how it makes fnon-call-exceptions unrelated.

> 
> but it is wrong if the shstk is ever discontinous even though the
> shstk format would allow that. this was my original point in this
> thread: with current unwinder code you cannot add altshstk later.
> and no, introducing new binary markings and rebuild the world just
> for altshstk is not OK. so we have to decide now if we want alt
> shstk in the future or not, i gave reasons why we would want it.

The point was that the old GCCs restrict expanding the shadow stack
signal frame, which is a prerequisite to supporting alt shadow stacks.

So the existing userspace prevents us from supporting regular alt
shadow stack, before we even get to fancy unwinding alt shadow stack.
Some kind of ABI opt in will likely be required. Maybe a new elf bit,
at which point we can take advantage of what we learned.

You previously said:

On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> as far as i can tell the current unwinder handles shstk unwinding
> correctly across signal handlers (sync or async and
> cleanup/exceptions
> handlers too), i see no issue with "fixed shadow stack signal frame
> size of 8 bytes" other than future extensions and discontinous shstk.

I took that to mean that you didn't see how the the existing unwinder
prevented alt shadow stacks. Hopefully we're all on the same page now. 

BTW, when alt shadow stack's were POCed, I hadn't encountered this GCC
behavior yet. So it was assumed it could be bolted on later without
disturbing anything. If Linus or someone wants to say we're ok with
breaking these old GCCs in this way, the first thing I would do would
be to pad the shadow stack signal frame with room for alt shadow stack
and more. I actually have a patch to do this, but alas we are already
pushing it regression wise.
Edgecombe, Rick P June 22, 2023, 4:47 p.m. UTC | #28
On Thu, 2023-06-22 at 09:27 +0100, szabolcs.nagy@arm.com wrote:


[ snip ]

> swapcontext is currently *not* supported: for it to work you have to
> be able to jump *back* into the signal handler, which does not work
> if
> the swapcontext target is on the original thread stack (instead of
> say a makecontext stack).
> 
> jumping back can only be supported if alt stack can be paired with
> an alt shadow stack.
> 
> unwinding across a series of signal interrupts should work even
> with discontinous shstk. libgcc does not implement this which is
> a problem i think.

I would be helpful if you could enumerate the cases you think are
important to support. I would like to see how we could support them in
the future in some mode.

> 
> > But how does the proposed token placed by the kernel on the
> > original
> > stack help this problem? The longjmp() would have to be able to
> > find
> > the location of the restore tokens somehow, which would not
> > necessarily
> > be near the setjmp() point. The signal token could even be on a
> > different shadow stack.
> 
> i posted the exact longjmp code and it takes care of this case.

I see how it works for the simple case of longjmp() from an alt shadow
stack. I would still prefer a different solution that works with the
overflow case. (probably new kernel functionality)

But I don't see how it works for unwinding off of a ucontext stack. Or
unwinding off of an ucontext stack that was swapped to from an alt
shadow stack.

> 
> setjmp does not need to do anything special.
> 
> the invariant is that an shstk is either capped by a restore token
> or in use by some executing task. this is guaranteed architecturally
> (when shstk is switched with an instruction) and should be guaranteed
> by the kernel too (when shstk is switched by the kernel).
> 
> > I'm also not sure leaving a token on signal doesn't weaken the
> > security
> > it it's own way as well. Any thread could then swap to that token.
> > Where as the shadow stack signal frame ssp pointer can only be used
> > from the shadow stack the signal was handled on.
> 
> as far as i'm concerned it is a valid programming model to switch
> to a stack that is currently not in use and we should always allow
> that. (signal handled on an alt stack may not return)

Some people just want shadow stack for like, a super stack canary, and
want everything to "just work". Some people want to lock things down as
much as possible and change their code to do it if needed.

It sure is a challenge to figure out where the happy medium is. Ideally
there would be several modes so all the users could be happy, but I
wouldn't want to start with that for multiple reasons. Although we do
have WRSS today, which can support pretty much everything
functionality-wise.

But in any case, we have limited room for movement. I actually had some
other ABI tweaks fully ready to post around the tracing usages, but I
just thought given the situation, it was better to start with what we
have. This project could really use a time machine...

> 
> > So I think, in addition to blocking the shadow stack overflow use
> > case
> > in the future, leaving a token behind on signal will not really
> > help
> > longjmp(). (or at least I'm not following)
> 
> the restore token must only be added if shstk is switched
> (currently it is not switched so don't add it, however if
> we agree on this then the unwinder can be fixed accordingly.)

I do agree that a token should not be added when the stack is not
switched, as today. But I don't agree on this solution for alt shadow
stacks. Again, I actually built that exact design in actual code, so
it's not a NIH thing. It just doesn't work for valid use cases.
Edgecombe, Rick P June 22, 2023, 11:18 p.m. UTC | #29
On Thu, 2023-06-22 at 17:42 +0100, szabolcs.nagy@arm.com wrote:
> the downside is that longjmp across makecontext needs a
> separate solution then which implies that all shstk needs
> a detectable token at the end of the shstk.. so again
> something that we have to get right now and cannot add
> later.)

This sounds like some scheme to search for a token on another stack,
which if so, you haven't elaborated on.

I'm not going to be able to contribute on this thread much over the
next week, but if you think you know to solve problems which have
remained unsolved for years, please spell out the solutions.

I'd also appreciate if you could spell out exactly which:
 - ucontext
 - signal
 - longjmp
 - custom library stack switching

patterns you think shadow stack should support working together.
Because even after all these mails, I'm still not sure exactly what you
are trying to achieve.
Szabolcs Nagy June 23, 2023, 4:25 p.m. UTC | #30
The 06/22/2023 16:47, Edgecombe, Rick P wrote:
> On Thu, 2023-06-22 at 09:27 +0100, szabolcs.nagy@arm.com wrote:
> 
> 
> [ snip ]
> 
> > swapcontext is currently *not* supported: for it to work you have to
> > be able to jump *back* into the signal handler, which does not work
> > if
> > the swapcontext target is on the original thread stack (instead of
> > say a makecontext stack).
> > 
> > jumping back can only be supported if alt stack can be paired with
> > an alt shadow stack.
> > 
> > unwinding across a series of signal interrupts should work even
> > with discontinous shstk. libgcc does not implement this which is
> > a problem i think.
> 
> I would be helpful if you could enumerate the cases you think are
> important to support. I would like to see how we could support them in
> the future in some mode.
> 
> > 
> > > But how does the proposed token placed by the kernel on the
> > > original
> > > stack help this problem? The longjmp() would have to be able to
> > > find
> > > the location of the restore tokens somehow, which would not
> > > necessarily
> > > be near the setjmp() point. The signal token could even be on a
> > > different shadow stack.
> > 
> > i posted the exact longjmp code and it takes care of this case.
> 
> I see how it works for the simple case of longjmp() from an alt shadow
> stack. I would still prefer a different solution that works with the
> overflow case. (probably new kernel functionality)
> 
> But I don't see how it works for unwinding off of a ucontext stack. Or
> unwinding off of an ucontext stack that was swapped to from an alt
> shadow stack.

why?

a stack can be active or inactive (task executing on it or not),
and if inactive it can be live or dead (has stack frames that can
be jumped to or not).

this is independent of shadow stacks: longjmp is only valid if the
target is either the same active stack or an inactive live stack.
(there are cases that may seem to work, but fundamentally broken
and not supportable: e.g. two tasks executing on the same stack
where the one above happens to not clobber frames deep enough to
collide with the task below.)

the proposed longjmp design works for both cases. no assumption is
made about ucontext or signals other than the shadow stack for an
inactive live stack ends in a restore token, which is guaranteed by
the isa so we only need the kernel to do the same when it switches
shadow stacks. then longjmp works by construction.

the only wart is that an overflowed shadow stack is inactive dead
instead of inactive live because the token cannot be added. (note
that handling shstk overflow and avoiding shstk overflow during
signal handling still works with alt shstk!)

an alternative solution is to allow jump to inactive dead stack
if that's created by a signal interrupt. for that a syscall is
needed and longjmp has to detect if the target stack is dead or
live. (the kernel also has to be able to tell if switching to the
dead stack is valid for security reasons.) i don't know if this
is doable (if we allow some hacks it's doable).

unwinding across signal handlers is just a matter of having
enough information at the signal frame to continue, it does
not have to follow crazy jumps or weird coroutine things:
that does not work without shadow stacks either. but unwind
across alt stack frame should work.
Edgecombe, Rick P June 25, 2023, 6:48 p.m. UTC | #31
On Fri, 2023-06-23 at 17:25 +0100, szabolcs.nagy@arm.com wrote:
> why?
> 
> a stack can be active or inactive (task executing on it or not),
> and if inactive it can be live or dead (has stack frames that can
> be jumped to or not).
> 
> this is independent of shadow stacks: longjmp is only valid if the
> target is either the same active stack or an inactive live stack.
> (there are cases that may seem to work, but fundamentally broken
> and not supportable: e.g. two tasks executing on the same stack
> where the one above happens to not clobber frames deep enough to
> collide with the task below.)
> 
> the proposed longjmp design works for both cases. no assumption is
> made about ucontext or signals other than the shadow stack for an
> inactive live stack ends in a restore token, 

One of the problems for the case of longjmp() from another another
stack, is how to find the old stack's token. HJ and I had previously
discussed searching for the token from the target SSP forward, but the
problems are it is 1, not gauranteed to be there and 2, pretty awkward
and potentially slow.

> which is guaranteed by
> the isa so we only need the kernel to do the same when it switches
> shadow stacks. then longjmp works by construction.

No it's not.

> 
> the only wart is that an overflowed shadow stack is inactive dead
> instead of inactive live because the token cannot be added. (note
> that handling shstk overflow and avoiding shstk overflow during
> signal handling still works with alt shstk!)

I thought we were on the same page as far as pushing a restore token on
signal not being robust against shadow stack overflow, so is this a new
idea? Usually people around here say "code talks", but all I'm asking
for is a full explanation of what you are trying to accomplish and what
the idea is. Otherwise this is asking to hold up this feature based on
hand waving.

Could you answer the questions here, along with a full description of
your proposal:
https://lore.kernel.org/all/1cd67ae45fc379fd82d2745190e4caf74e67499e.camel@intel.com/

It we are talking past each other, it could help to do a levelset at
this point.

> 
> an alternative solution is to allow jump to inactive dead stack
> if that's created by a signal interrupt. for that a syscall is
> needed and longjmp has to detect if the target stack is dead or
> live. (the kernel also has to be able to tell if switching to the
> dead stack is valid for security reasons.) i don't know if this
> is doable (if we allow some hacks it's doable).
> 
> unwinding across signal handlers is just a matter of having
> enough information at the signal frame to continue, it does
> not have to follow crazy jumps or weird coroutine things:
> that does not work without shadow stacks either. but unwind
> across alt stack frame should work.

Even if there is some workable idea, there is already a bunch of 
userspace built around the existing solution, and users waiting for it.

In addition the whole discussion around alt shadow stack cases will
require alt shadow stacks to be implemented and we might be constrained
there anyway.

If we want to take learnings and do something new, let's build it
around a new elf bit. This current kernel ABI is to support the old elf
bit userspace stack, which has been solidified by the existing upstream
userspace.

I had thought we should start from scratch now and proposed a patch to
block the old elf bit to force this, but lost that battle with other
glibc developers.

Speaking of which, I don't see any enthusiasm from any other glibc
developers that have been involved in this previously. Who were you
thinking was going to implement any of this on the glibc side? Have you
made any progress in getting any of them onboard with this order of
operations in the months since you first brought up changing the
design?
Andy Lutomirski June 25, 2023, 11:52 p.m. UTC | #32
On Thu, Jun 22, 2023 at 9:43 AM szabolcs.nagy@arm.com
<szabolcs.nagy@arm.com> wrote:
>
> The 06/22/2023 08:26, Andy Lutomirski wrote:
> > On Thu, Jun 22, 2023 at 2:28 AM szabolcs.nagy@arm.com
> > <szabolcs.nagy@arm.com> wrote:
> > >
> > > The 06/21/2023 18:54, Edgecombe, Rick P wrote:
> > > > On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> > > > > > The 06/20/2023 19:34, Edgecombe, Rick P wrote:
> > > > > > > > I actually did a POC for this, but rejected it. The problem is,
> > > > > > > > if
> > > > > > > > there is a shadow stack overflow at that point then the kernel
> > > > > > > > > > can't
> > > > > > > > push the shadow stack token to the old stack. And shadow stack
> > > > > > > > > > overflow
> > > > > > > > is exactly the alt shadow stack use case. So it doesn't really
> > > > > > > > > > solve
> > > > > > > > the problem.
> > > > > >
> > > > > > the restore token in the alt shstk case does not regress anything
> > > > > > but
> > > > > > makes some use-cases work.
> > > > > >
> > > > > > alt shadow stack is important if code tries to jump in and out of
> > > > > > signal handlers (dosemu does this with swapcontext) and for that a
> > > > > > restore token is needed.
> > > > > >
> > > > > > alt shadow stack is important if the original shstk did not
> > > > > > overflow
> > > > > > but the signal handler would overflow it (small thread stack, huge
> > > > > > sigaltstack case).
> > > > > >
> > > > > > alt shadow stack is also important for crash reporting on shstk
> > > > > > overflow even if longjmp does not work then. longjmp to a
> > > > > > makecontext
> > > > > > stack would still work and longjmp back to the original stack can
> > > > > > be
> > > > > > made to mostly work by an altshstk option to overwrite the top
> > > > > > entry
> > > > > > with a restore token on overflow (this can break unwinding though).
> > > > > >
> > > >
> > > > There was previously a request to create an alt shadow stack for the
> > > > purpose of handling shadow stack overflow. So you are now suggesting to
> > > > to exclude that and instead target a different use case for alt shadow
> > > > stack?
> > >
> > > that is not what i said.
> > >
> > > > But I'm not sure how much we should change the ABI at this point since
> > > > we are constrained by existing userspace. If you read the history, we
> > > > may end up needing to deprecate the whole elf bit for this and other
> > > > reasons.
> > >
> > > i'm not against deprecating the elf bit, but i think binary
> > > marking will be difficult for this kind of feature no matter what
> > > (code may be incompatible for complex runtime dependent reasons).
> > >
> > > > So should we struggle to find a way to grow the existing ABI without
> > > > disturbing the existing userspace? Or should we start with something,
> > > > finally, and see where we need to grow and maybe get a chance at a
> > > > fresh start to grow it?
> > > >
> > > > Like, maybe 3 people will show up saying "hey, I *really* need to use
> > > > shadow stack and longjmp from a ucontext stack", and no one says
> > > > anything about shadow stack overflow. Then we know what to do. And
> > > > maybe dosemu decides it doesn't need to implement shadow stack (highly
> > > > likely I would think). Now that I think about it, AFAIU SS_AUTODISARM
> > > > was created for dosemu, and the alt shadow stack patch adopted this
> > > > behavior. So it's speculation that there is even a problem in that
> > > > scenario.
> > > >
> > > > Or maybe people just enable WRSS for longjmp() and directly jump back
> > > > to the setjmp() point. Do most people want fast setjmp/longjmp() at the
> > > > cost of a little security?
> > > >
> > > > Even if, with enough discussion, we could optimize for all
> > > > hypotheticals without real user feedback, I don't see how it helps
> > > > users to hold shadow stack. So I think we should move forward with the
> > > > current ABI.
> > >
> > > you may not get a second chance to fix a security feature.
> > > it will be just disabled if it causes problems.
> >
> > *I* would use altshadowstack.
> >
> > I run a production system (that cares about correctness *and*
> > performance, but that's not really relevant here -- SHSTK ought to be
> > fast).  And, if it crashes, I want to know why.  So I handle SIGSEGV,
> > etc so I have good logs if it crashes.  And I want those same logs if
> > I overflow the stack.
> >
> > That being said, I have no need for longjmp or siglongjmp for this.  I
> > use exit(2) to escape.
>
> the same crash handler that prints a log on shstk overflow should
> work when a different cause of SIGSEGV is recoverable via longjmp.
> to me this means that alt shstk must work with longjmp at least in
> the non-shstk overflow case (which can be declared non-recoverable).

Sure, but how many SIGSEGV handlers would use altshadowstack and
*also, in the same handler* ever resume?  Not mine.  Obviously I'm
only one sample.

>
> > For what it's worth, setjmp/longjmp is a bad API.  The actual pattern
> > that ought to work well (and that could be supported well by fancy
> > compilers and non-C languages, as I understand it) is more like a
> > function call that has two ways out.  Like this (pseudo-C):
> >
> > void function(struct better_jmp_buf &buf, args...)
> > {
> >    ...
> >        if (condition)
> >           better_long_jump(buf);  // long jumps out!
> >        // could also pass buf to another function
> >    ...
> >        // could also return normally
> > }
> >
> > better_call_with_jmp_buf(function, args);
> >
> > *This* could support altshadowstack just fine.  And many users might
> > be okay with the understanding that, if altshadowstack is on, you have
> > to use a better long jump to get out (or a normal sigreturn or _exit).
>
> i don't understand why this would work fine when longjmp does not.
> how does the shstk switch happen?

Ugh, I think this may have some issues given how the ISA works.  Sigh.

I was imagining that better_call_with_jmp_buf would push a restore
token on the shadow stack, then call the passed-in function, then, on
a successful return, INCSSP over the token and continue on.
better_long_jump() would RSTORSSP to the saved token.

But I'm not sure how to write the token without WRUSS.

What *could* be done, which would be nasty and
sigaltshadowstack-specific, is to have a jump out of a signal handler
provide a pointer to the signal frame (siginfo_t or ucontext pointer),
and the kernel would assist it in switching the shadow stack back.
Eww.

--Andy

>
> > No one is getting an altshadowstack signal handler without code
> > changes.
>
> assuming the same component is doing the alt shstk setup as the
> longjmp.
>
> > siglongjmp() could support altshadowstack with help from the kernel,
> > but we probably don't want to go there.
>
> what kind of help? maybe we need that help..
>
> e.g. if the signal frame token is detected by longjmp on
> the shstk then doing an rt_sigreturn with the right signal
> frame context allows longjmp to continue unwinding the shstk.
> however kernel sigcontext layout can change so userspace may
> not know it so longjmp needs a helper, but only in the jump
> across signal frame case.
>
> (this is a different design than what i proposed earlier,
> it also makes longjmp from alt shstk work without wrss,
> the downside is that longjmp across makecontext needs a
> separate solution then which implies that all shstk needs
> a detectable token at the end of the shstk.. so again
> something that we have to get right now and cannot add
> later.)
Szabolcs Nagy June 26, 2023, 2:08 p.m. UTC | #33
The 06/22/2023 16:46, Edgecombe, Rick P wrote:
> You previously said:
> 
> On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> > as far as i can tell the current unwinder handles shstk unwinding
> > correctly across signal handlers (sync or async and
> > cleanup/exceptions
> > handlers too), i see no issue with "fixed shadow stack signal frame
> > size of 8 bytes" other than future extensions and discontinous shstk.
> 
> I took that to mean that you didn't see how the the existing unwinder
> prevented alt shadow stacks. Hopefully we're all on the same page now. 

well alt shstk is discontinous.

there were two separate confusions:

- your mention of fnon-call-exception threw me off since that is a
very specific corner case.

- i was talking about an unwind design that can deal with altshstk
which requires ssp switch. (current uwinder does not support this,
but i assumed we can add that now and ignore old broken unwinders).
you are saying that alt shstk support needs additional shstk tokens
in the signal frame to maintain alt shstk state for the kernel.

i think now we are on the same page.

> BTW, when alt shadow stack's were POCed, I hadn't encountered this GCC
> behavior yet. So it was assumed it could be bolted on later without
> disturbing anything. If Linus or someone wants to say we're ok with
> breaking these old GCCs in this way, the first thing I would do would
> be to pad the shadow stack signal frame with room for alt shadow stack
> and more. I actually have a patch to do this, but alas we are already
> pushing it regression wise.

sounds like it will be hard to add alt shstk later.

(i think maintaining alt shstk state on the stack instead of
shstk should work too. but if that does not work, then alt
shstk will require another abi opt-in.)
Edgecombe, Rick P June 28, 2023, 1:23 a.m. UTC | #34
On Mon, 2023-06-26 at 15:08 +0100, szabolcs.nagy@arm.com wrote:
> The 06/22/2023 16:46, Edgecombe, Rick P wrote:
> > You previously said:
> > 
> > On Wed, 2023-06-21 at 12:36 +0100, szabolcs.nagy@arm.com wrote:
> > > as far as i can tell the current unwinder handles shstk unwinding
> > > correctly across signal handlers (sync or async and
> > > cleanup/exceptions
> > > handlers too), i see no issue with "fixed shadow stack signal
> > > frame
> > > size of 8 bytes" other than future extensions and discontinous
> > > shstk.
> > 
> > I took that to mean that you didn't see how the the existing
> > unwinder
> > prevented alt shadow stacks. Hopefully we're all on the same page
> > now. 
> 
> well alt shstk is discontinous.
> 
> there were two separate confusions:
> 
> - your mention of fnon-call-exception threw me off since that is a
> very specific corner case.
> 
> - i was talking about an unwind design that can deal with altshstk
> which requires ssp switch. (current uwinder does not support this,
> but i assumed we can add that now and ignore old broken unwinders).
> you are saying that alt shstk support needs additional shstk tokens
> in the signal frame to maintain alt shstk state for the kernel.
> 
> i think now we are on the same page.

I don't think I fully understand your point still. It would really help
if you could do a levelset summary, like I asked on other branches of
this thread.

> 
> > BTW, when alt shadow stack's were POCed, I hadn't encountered this
> > GCC
> > behavior yet. So it was assumed it could be bolted on later without
> > disturbing anything. If Linus or someone wants to say we're ok with
> > breaking these old GCCs in this way, the first thing I would do
> > would
> > be to pad the shadow stack signal frame with room for alt shadow
> > stack
> > and more. I actually have a patch to do this, but alas we are
> > already
> > pushing it regression wise.
> 
> sounds like it will be hard to add alt shstk later.

Adding alt shadow stack without moving the elf bit runs the risk of
regressions because of the old GCCs. But moving the elf bit is easy
(from the technical side at least). There were several threads about
this in the past.

So I don't think any harder than it is now.

> 
> (i think maintaining alt shstk state on the stack instead of
> shstk should work too. but if that does not work, then alt
> shstk will require another abi opt-in.)

The x86 sigframe is pretty full AFAIK. There are already features that
require an opt in. It might be true that the alt shadow stack base and
size don't need to be protected like the previous SSP. I'd have to
think about it. It could be nice to optionally put some IBT stuff on
the shadow stack as well, which would require growing the frame, and at
that point the alt shadow stack stuff might as well go there too.
Szabolcs Nagy June 29, 2023, 4:07 p.m. UTC | #35
The 06/22/2023 23:18, Edgecombe, Rick P wrote:
> I'd also appreciate if you could spell out exactly which:
>  - ucontext
>  - signal
>  - longjmp
>  - custom library stack switching
> 
> patterns you think shadow stack should support working together.
> Because even after all these mails, I'm still not sure exactly what you
> are trying to achieve.

i'm trying to support two operations (in any combination):

(1) jump up the current (active) stack.

(2) jump to a live frame in a different inactive but live stack.
    the old stack becomes inactive (= no task executes on it)
    and live (= has valid frames to jump to).

with

(3) the runtime must manage the shadow stacks transparently.
    (= portable c code does not need modifications)

mapping this to c apis:

- swapcontext, setcontext, longjmp, custom stack switching are jump
  operations. (there are conditions under which (1) and (2) must work,
  further details don't matter.)

- makecontext creates an inactive live stack.

- signal is only special if it executes on an alt stack: on signal
  entry the alt stack becomes active and the interrupted stack
  inactive but live. (nested signals execute on the alt stack until
  that is left either via a jump or signal return.)

- unwinding can be implemented with jump operations (it needs some
  other things but that's out of scope here).

the patterns that shadow stack should support falls out of this model.
(e.g. posix does not allow jumping from one thread to the stack of a
different thread, but the model does not care about that, it only
cares if the target stack is inactive and live then jump should work.)

some observations:

- it is necessary for jump to detect case (2) and then switch to the
  target shadow stack. this is also sufficient to implement it. (note:
  the restore token can be used for detection since that is guaranteed
  to be present when user code creates an inactive live stack and is
  not present anywhere else by design. a different marking can be used
  if the inactive live stack is created by the kernel, but then the
  kernel has to provide a switch method, e.g. syscall. this should not
  be controversial.)

- in this model two live stacks cannot use the same shadow stack since
  jumping between the two stacks is allowed in both directions, but
  jumping within a shadow stack only works in one direction. (also two
  tasks could execute on the same shadow stack then. and it makes
  shadow stack size accounting problematic.)

- so sharing shadow stack with alt stack is broken. (the model is
  right in the sense that valid posix code can trigger the issue. we
  can ignore that corner case and adjust the model so the shared
  shadow stack works for alt stack, but it likely does not change the
  jump design: eventually we want alt shadow stack.)

- shadow stack cannot always be managed by the runtime transparently:
  it has to be allocated for makecontext and alt stack in situations
  where allocation failure cannot be handled. more alarmingly the
  destruction of stacks may not be visible to the runtime so the
  corresponding shadow stacks leak. my preferred way to fix this is
  new apis that are shadow stack compatible (e.g. shadow_makecontext
  with shadow_freecontext) and marking the incompatible apis as such.
  portable code then can decide to update to new apis, run with shstk
  disabled or accept the leaks and OOM failures. the current approach
  needs ifdef __CET__ in user code for makecontext and sigaltstack
  has many issues.

- i'm still not happy with the shadow stack sizing. and would like to
  have a token at the end of the shadow stack to allow scanning. and
  it would be nice to deal with shadow stack overflow. and there is
  async disable on dlopen. so there are things to work on.

i understand that the proposed linux abi makes most existing binaries
with shstk marking work, which is relevant for x86.

for a while i thought we can fix the remaining issues even if that
means breaking existing shstk binaries (just bump the abi marking).
now it seems the issues can only be addressed in a future abi break.

which means x86 linux will likely end up maintaining two incompatible
abis and the future one will need user code and build system changes,
not just runtime changes. it is not a small incremental change to add
alt shadow stack support for example.

i don't think the maintenance burden of two shadow stack abis is the
right path for arm64 to follow, so the shadow stack semantics will
likely become divergent not common across targets.

i hope my position is now clearer.
Edgecombe, Rick P July 2, 2023, 6:03 p.m. UTC | #36
On Thu, 2023-06-29 at 17:07 +0100, szabolcs.nagy@arm.com wrote:
> The 06/22/2023 23:18, Edgecombe, Rick P wrote:
> > I'd also appreciate if you could spell out exactly which:
> >  - ucontext
> >  - signal
> >  - longjmp
> >  - custom library stack switching
> > 
> > patterns you think shadow stack should support working together.
> > Because even after all these mails, I'm still not sure exactly what
> > you
> > are trying to achieve.

Hi Szablocs,

Thanks for writing all this up. It is helpful to understand where you
are coming from. Please don't miss my point at the very bottom of this
response.

> 
> i'm trying to support two operations (in any combination):
> 
> (1) jump up the current (active) stack.
> 
> (2) jump to a live frame in a different inactive but live stack.
>     the old stack becomes inactive (= no task executes on it)
>     and live (= has valid frames to jump to).
> 
> with
> 
> (3) the runtime must manage the shadow stacks transparently.
>     (= portable c code does not need modifications)
> 
> mapping this to c apis:
> 
> - swapcontext, setcontext, longjmp, custom stack switching are jump
>   operations. (there are conditions under which (1) and (2) must
> work,
>   further details don't matter.)
> 
> - makecontext creates an inactive live stack.
> 
> - signal is only special if it executes on an alt stack: on signal
>   entry the alt stack becomes active and the interrupted stack
>   inactive but live. (nested signals execute on the alt stack until
>   that is left either via a jump or signal return.)
> 
> - unwinding can be implemented with jump operations (it needs some
>   other things but that's out of scope here).
> 
> the patterns that shadow stack should support falls out of this
> model.
> (e.g. posix does not allow jumping from one thread to the stack of a
> different thread, but the model does not care about that, it only
> cares if the target stack is inactive and live then jump should
> work.)
> 
> some observations:
> 
> - it is necessary for jump to detect case (2) and then switch to the
>   target shadow stack. this is also sufficient to implement it.
> (note:
>   the restore token can be used for detection since that is
> guaranteed
>   to be present when user code creates an inactive live stack and is
>   not present anywhere else by design. a different marking can be
> used
>   if the inactive live stack is created by the kernel, but then the
>   kernel has to provide a switch method, e.g. syscall. this should
> not
>   be controversial.)

For x86's shadow stack you can jump to a new stack without leaving a
token behind. I don't know if maybe we could make it a rule in the
x86_64 ABI that you should always leave a token if you are going to
mark the SHSTK elf bit. But if anything did this, then longjmp() could
never make it back to the stack where setjmp() was called without
kernel help.

> 
> - in this model two live stacks cannot use the same shadow stack
> since
>   jumping between the two stacks is allowed in both directions, but
>   jumping within a shadow stack only works in one direction. (also
> two
>   tasks could execute on the same shadow stack then. and it makes
>   shadow stack size accounting problematic.)
> 
> - so sharing shadow stack with alt stack is broken. (the model is
>   right in the sense that valid posix code can trigger the issue.

Could you spell out what "the issue" is that can be triggered?

>  we
>   can ignore that corner case and adjust the model so the shared
>   shadow stack works for alt stack, but it likely does not change the
>   jump design: eventually we want alt shadow stack.)

As we discussed previously, alt shadow stack can't work transparently
with existing code due to the sigaltstack API. I wonder if maybe you
are trying to get at something else, and I'm not following.

> 
> - shadow stack cannot always be managed by the runtime transparently:
>   it has to be allocated for makecontext and alt stack in situations
>   where allocation failure cannot be handled. more alarmingly the
>   destruction of stacks may not be visible to the runtime so the
>   corresponding shadow stacks leak. my preferred way to fix this is
>   new apis that are shadow stack compatible (e.g. shadow_makecontext
>   with shadow_freecontext) and marking the incompatible apis as such.
>   portable code then can decide to update to new apis, run with shstk
>   disabled or accept the leaks and OOM failures. the current approach
>   needs ifdef __CET__ in user code for makecontext and sigaltstack
>   has many issues.

This sounds reasonable to me on the face of it. It seems mostly
unrelated to the kernel ABI and purely a userspace thing.

> 
> - i'm still not happy with the shadow stack sizing. and would like to
>   have a token at the end of the shadow stack to allow scanning. and
>   it would be nice to deal with shadow stack overflow. and there is
>   async disable on dlopen. so there are things to work on.

I was imagining that for tracing-only users, it might make sense to run
with WRSS enabled. This could mean libc's could write their own restore
tokens. In the case of longjmp() it could be simple and fast. The
implementation could just write a token at the target SSP and switch to
it. Non C runtimes that want to use if for backtracing could also write
their own preferred stack markers or other data. It also is whole
different solution to what is being discussed.

But over the course of this thread, I could imagine a little more now
how a top of stack marker could possibly be useful for non-tracing
usages. I have a patch prepared for this and I had tested to see if
adding this later could disturb anything in userspace. The only thing
that I found was that gdb might output a slightly different stack
trace. So it would be a user visible change, if not a regression.

One reason I held off on it still, is that the plan for the expanded
shadow stack signal frame includes using a 0 frame, to avoid a forgery
scenario. The token that makes sense for the end of stack marker is
also a 0 frame. So if userspace that looks for the end of stack marker
scans for the 0 frame without checking if it is part of an expanded
shadow stack signal frame, then it could make more trouble for alt
shadow stack.

So since they are tied together, and I thought to hold off on it for
now. I don't want to try to squeeze around the upstream userspace, I
think a version 2 should be a clean slate on a new elf bit.

> 
> i understand that the proposed linux abi makes most existing binaries
> with shstk marking work, which is relevant for x86.
> 
> for a while i thought we can fix the remaining issues even if that
> means breaking existing shstk binaries (just bump the abi marking).
> now it seems the issues can only be addressed in a future abi break.

Adding a new arch_prctl() ENABLE value was the plan. Not sure what you
mean by ABI break vs version bump. The plan was to add the new features
without userspace regression by putting any behavior behind a different
enable option. This relies on userspace to add a new elf bit, and to
use it.

> 
> which means x86 linux will likely end up maintaining two incompatible
> abis and the future one will need user code and build system changes,
> not just runtime changes. it is not a small incremental change to add
> alt shadow stack support for example.
> 
> i don't think the maintenance burden of two shadow stack abis is the
> right path for arm64 to follow, so the shadow stack semantics will
> likely become divergent not common across targets.

Unfortunately we are at a bit of an information asymmetry here because
the ARM spec and patches are not public. It may be part of the cause of
the confusion.

> 
> i hope my position is now clearer.

It kind of sounds like you don't like the x86 glibc implementation. And
you want to make sure the kernel can support whatever a new solution is
that you are working on. I am on board with the goal of having some
generic set of rules to make portable code work for other architectures
shadow stacks. But I think how close we can get to that goal or what it
looks like is an open question. For several reasons:
 1. Not everyone can see all the specs
 2. No POCs have been done (or at least shared)
 3. It's not clear what needs to be supported (yes, I know you have 
    made a rough proposal here, but it sounds like on the x86 glibc 
    side at least it's not even clear what non-shadow stack stack 
    switching operations can work together)

But towards these goals, I think your technical requests are:

1. Leave a token on switching to an alt shadow stack. As discussed
earlier, we can't do this because of the overflow issues. Also since,
alt shadow stack cannot be transparent to existing software anyway, it
should be ok to introduce limitations. So I think this one is a no.
What we could do is introduce security weakening kernel helpers, but
this would make sense to come with alt shadow stack support.
2. Add an end token at the top of the shadow stack. Because of the
existing userspace restriction interactions, this is complicated to
evaluate but I think we *could* do this now. There are pros and cons.
3. Support more options for shadow stack sizing. (I think you are
referring to this conversation:
https://lore.kernel.org/lkml/ZAIgrXQ4670gxlE4@arm.com/). I don't see
why this is needed for the base implementation. If ARM wants to add a
new rlimit or clone variant, I don't see why x86 can't support it
later.

So if we add 2, are you satisfied? Or otherwise, on the non-technical
request side, are you asking to hold off on x86 shadow stack, in order
to co-develop a unified solution?

I think the existing solution will see use in the meantime, including
for the development of all the x86 specific JIT implementations.


And finally, what I think is the most important point in all of this:

I think that *how* it gets used will be a better guide for further
development than us debating. For example the main pain point that has
come up so far is the problems around dlopen(). And the future work
that has been scoped has been for the kernel to help out in this area.
This is based on _user_ (distro) requests.

Any apps that don't work with shadow stack limitations can simply not
enable shadow stack. You and me are debating these specific API
combinations, but we can't know whether they are actually the best
place to focus development efforts. And the early signs are this is NOT
the most important problem to solve.
Mark Brown July 3, 2023, 1:32 p.m. UTC | #37
On Sun, Jul 02, 2023 at 06:03:42PM +0000, Edgecombe, Rick P wrote:
> On Thu, 2023-06-29 at 17:07 +0100, szabolcs.nagy@arm.com wrote:

> > which means x86 linux will likely end up maintaining two incompatible
> > abis and the future one will need user code and build system changes,
> > not just runtime changes. it is not a small incremental change to add
> > alt shadow stack support for example.

> > i don't think the maintenance burden of two shadow stack abis is the
> > right path for arm64 to follow, so the shadow stack semantics will
> > likely become divergent not common across targets.

> Unfortunately we are at a bit of an information asymmetry here because
> the ARM spec and patches are not public. It may be part of the cause of
> the confusion.

While the descriptive text bit of the spec is not yet integrated into
the ARM the architecture XML describing the instructions and system
registers is there, the document is numbered DDI0601:

    https://developer.arm.com/documentation/ddi0601/

The GCS specific instructions and system registers are all named
beginning with GCS, it's aarch64 only.

Hopefully I should have something out next week for the kernel.
Szabolcs Nagy July 3, 2023, 6:19 p.m. UTC | #38
The 07/02/2023 18:03, Edgecombe, Rick P wrote:
> On Thu, 2023-06-29 at 17:07 +0100, szabolcs.nagy@arm.com wrote:
> > The 06/22/2023 23:18, Edgecombe, Rick P wrote:
> > > I'd also appreciate if you could spell out exactly which:
> > >  - ucontext
> > >  - signal
> > >  - longjmp
> > >  - custom library stack switching
> > > 
> > > patterns you think shadow stack should support working together.
> > > Because even after all these mails, I'm still not sure exactly what
> > > you
> > > are trying to achieve.
> 
> Hi Szablocs,
> 
> Thanks for writing all this up. It is helpful to understand where you
> are coming from. Please don't miss my point at the very bottom of this
> response.
> 
> > 
> > i'm trying to support two operations (in any combination):
> > 
> > (1) jump up the current (active) stack.
> > 
> > (2) jump to a live frame in a different inactive but live stack.
> >     the old stack becomes inactive (= no task executes on it)
> >     and live (= has valid frames to jump to).
> > 
> > with
> > 
> > (3) the runtime must manage the shadow stacks transparently.
> >     (= portable c code does not need modifications)
> > 
> > mapping this to c apis:
> > 
> > - swapcontext, setcontext, longjmp, custom stack switching are jump
> >   operations. (there are conditions under which (1) and (2) must
> > work,
> >   further details don't matter.)
> > 
> > - makecontext creates an inactive live stack.
> > 
> > - signal is only special if it executes on an alt stack: on signal
> >   entry the alt stack becomes active and the interrupted stack
> >   inactive but live. (nested signals execute on the alt stack until
> >   that is left either via a jump or signal return.)
> > 
> > - unwinding can be implemented with jump operations (it needs some
> >   other things but that's out of scope here).
> > 
> > the patterns that shadow stack should support falls out of this
> > model.
> > (e.g. posix does not allow jumping from one thread to the stack of a
> > different thread, but the model does not care about that, it only
> > cares if the target stack is inactive and live then jump should
> > work.)
> > 
> > some observations:
> > 
> > - it is necessary for jump to detect case (2) and then switch to the
> >   target shadow stack. this is also sufficient to implement it.
> > (note:
> >   the restore token can be used for detection since that is
> > guaranteed
> >   to be present when user code creates an inactive live stack and is
> >   not present anywhere else by design. a different marking can be
> > used
> >   if the inactive live stack is created by the kernel, but then the
> >   kernel has to provide a switch method, e.g. syscall. this should
> > not
> >   be controversial.)
> 
> For x86's shadow stack you can jump to a new stack without leaving a
> token behind. I don't know if maybe we could make it a rule in the
> x86_64 ABI that you should always leave a token if you are going to
> mark the SHSTK elf bit. But if anything did this, then longjmp() could
> never make it back to the stack where setjmp() was called without
> kernel help.

ok. i didn't know that.

the restore token means the stack is valid to execute on later. so
from libc pov if the token is missing then jumping to that stack is
simply ub. (the kernel could help working around missing tokens but
presumably the point of not adding a token is exactly to prevent
any kind of jumps to that stack later).

so i don't think this changes much. (other than the x86 model has
inactive dead stacks that cannot be jumped to, and this is enforced.
but libc jumps are not supposed to leave such dead stacks behind so
libc jumps would have to add the restore token when doing a switch.)

> > - in this model two live stacks cannot use the same shadow stack
> > since
> >   jumping between the two stacks is allowed in both directions, but
> >   jumping within a shadow stack only works in one direction. (also
> > two
> >   tasks could execute on the same shadow stack then. and it makes
> >   shadow stack size accounting problematic.)
> > 
> > - so sharing shadow stack with alt stack is broken. (the model is
> >   right in the sense that valid posix code can trigger the issue.
> 
> Could you spell out what "the issue" is that can be triggered?

i meant jumping back from the main to the alt stack:

in main:
setup sig alt stack
setjmp buf1
	raise signal on first return
	longjmp buf2 on second return

in signal handler:
setjmp buf2
	longjmp buf1 on first return
	can continue after second return

in my reading of posix this is valid (and works if signals are masked
such that the alt stack is not clobbered when jumping away from it).

but cannot work with a single shared shadow stack.

> >  we
> >   can ignore that corner case and adjust the model so the shared
> >   shadow stack works for alt stack, but it likely does not change the
> >   jump design: eventually we want alt shadow stack.)
> 
> As we discussed previously, alt shadow stack can't work transparently
> with existing code due to the sigaltstack API. I wonder if maybe you
> are trying to get at something else, and I'm not following.

i would like a jump design that works with alt shadow stack.

...
> So since they are tied together, and I thought to hold off on it for
> now. I don't want to try to squeeze around the upstream userspace, I
> think a version 2 should be a clean slate on a new elf bit.

ok.

> > i understand that the proposed linux abi makes most existing binaries
> > with shstk marking work, which is relevant for x86.
> > 
> > for a while i thought we can fix the remaining issues even if that
> > means breaking existing shstk binaries (just bump the abi marking).
> > now it seems the issues can only be addressed in a future abi break.
> 
> Adding a new arch_prctl() ENABLE value was the plan. Not sure what you
> mean by ABI break vs version bump. The plan was to add the new features
> without userspace regression by putting any behavior behind a different
> enable option. This relies on userspace to add a new elf bit, and to
> use it.

yes future abi break was the wrong wording: new abi with new elf bit
is what i meant. i.e. x86 will have an abi v1 and abi v2, where v2
abi is not compatible e.g. with v1 unwinder.

> > which means x86 linux will likely end up maintaining two incompatible
> > abis and the future one will need user code and build system changes,
> > not just runtime changes. it is not a small incremental change to add
> > alt shadow stack support for example.
> > 
> > i don't think the maintenance burden of two shadow stack abis is the
> > right path for arm64 to follow, so the shadow stack semantics will
> > likely become divergent not common across targets.
> 
> Unfortunately we are at a bit of an information asymmetry here because
> the ARM spec and patches are not public. It may be part of the cause of
> the confusion.

yes that's unfortunate. but in this case i just meant that arm64
does not have existing marked binaries to worry about. so it seems
wrong to do a v1 abi and later a v2 abi to fix that up.

> It kind of sounds like you don't like the x86 glibc implementation. And
> you want to make sure the kernel can support whatever a new solution is
> that you are working on. I am on board with the goal of having some
> generic set of rules to make portable code work for other architectures
> shadow stacks. But I think how close we can get to that goal or what it
> looks like is an open question. For several reasons:
>  1. Not everyone can see all the specs
>  2. No POCs have been done (or at least shared)
>  3. It's not clear what needs to be supported (yes, I know you have 
>     made a rough proposal here, but it sounds like on the x86 glibc 
>     side at least it's not even clear what non-shadow stack stack 
>     switching operations can work together)
> 
> But towards these goals, I think your technical requests are:
> 
> 1. Leave a token on switching to an alt shadow stack. As discussed
> earlier, we can't do this because of the overflow issues. Also since,

i don't think the overflow discussion was conclusive.

the kernel could modify the top entry instead of adding a new token.
and provide a syscall to switch to that top entry undoing the
modification.

there might be cornercases and likely not much space for encoding
a return address and special marking in an entry. but otherwise
this makes jumping to an overflowed shadow stack work.

but it is also a valid design to just not support jumping out of alt
stack in the specific case of a shadow stack overflow (treat that as
a fatal error), but still allow the jump in less fatal situations.

> alt shadow stack cannot be transparent to existing software anyway, it

maybe not in glibc, but a libc can internally use alt shadow stack
in sigaltstack instead of exposing a separate sigaltshadowstack api.
(this is what a strict posix conform implementation has to do to
support shadow stacks), leaking shadow stacks is not a correctness
issue unless it prevents the program working (the shadow stack for
the main thread likely wastes more memory than all the alt stack
leaks. if the leaks become dominant in a thread the sigaltstack
libc api can just fail).

> should be ok to introduce limitations. So I think this one is a no.
> What we could do is introduce security weakening kernel helpers, but
> this would make sense to come with alt shadow stack support.

yes this would be for abi v2 on x86.

> 2. Add an end token at the top of the shadow stack. Because of the
> existing userspace restriction interactions, this is complicated to
> evaluate but I think we *could* do this now. There are pros and cons.

yes, this is a minor point. (and ok to do differently across targets)

> 3. Support more options for shadow stack sizing. (I think you are
> referring to this conversation:
> https://lore.kernel.org/lkml/ZAIgrXQ4670gxlE4@arm.com/). I don't see
> why this is needed for the base implementation. If ARM wants to add a
> new rlimit or clone variant, I don't see why x86 can't support it
> later.

i think it can be added later.

but it may be important for deployment on some platforms, since a
libc (or other language runtime) may want to set the shadow stack
size differently than the kernel default, because

- languages allocating large arrays on the stack
  (too big shadow stack can cause OOM with overcommit off and
  rlimits can be hit like RLIMIT_DATA, RLIMIT_AS because of it)

- tiny thread stack but big sigaltstack (musl libc, go).

> So if we add 2, are you satisfied? Or otherwise, on the non-technical
> request side, are you asking to hold off on x86 shadow stack, in order
> to co-develop a unified solution?

well a unified solution would need a v2 abi with new elf bit,
and it seems the preference is a v1 abi first, so at this point
i'm just trying to understand the potential brokenness in v1
and possible solutions for v2 since if there is a v2 i would
like that to be compatible across targets.

> I think the existing solution will see use in the meantime, including
> for the development of all the x86 specific JIT implementations.

i think some of that work have to be redone if there is a v2 abi,
which is why i thought having 2 abis is too much maintenance work.

> And finally, what I think is the most important point in all of this:
> 
> I think that *how* it gets used will be a better guide for further
> development than us debating. For example the main pain point that has
> come up so far is the problems around dlopen(). And the future work
> that has been scoped has been for the kernel to help out in this area.
> This is based on _user_ (distro) requests.

i think dlopen (and how it is used) is part of the api/abi design.

in presence of programs that can load any library (e.g. python exe)
it is difficult to make use of shstk without runtime disable.

however if dlopen gets support for runtime disable, reducing the
number of incompatible libraries over time is still relevant,
which requires abi/api design that allows that.

> Any apps that don't work with shadow stack limitations can simply not
> enable shadow stack. You and me are debating these specific API
> combinations, but we can't know whether they are actually the best
> place to focus development efforts. And the early signs are this is NOT
> the most important problem to solve.

disabling shadow stack is not simple (removing the elf marking from
an exe is often not possible/appropriate, but using an env var that
affects an entire process tree has problems too.) but i don't have
a solution for this (it is likely a userspace issue).

debating the api issues was at least useful for me to understand
what can go into the x86 v1 abi and what may go into a v2 abi.
Mark Brown July 3, 2023, 6:38 p.m. UTC | #39
On Mon, Jul 03, 2023 at 07:19:00PM +0100, szabolcs.nagy@arm.com wrote:
> The 07/02/2023 18:03, Edgecombe, Rick P wrote:

> > 3. Support more options for shadow stack sizing. (I think you are
> > referring to this conversation:
> > https://lore.kernel.org/lkml/ZAIgrXQ4670gxlE4@arm.com/). I don't see
> > why this is needed for the base implementation. If ARM wants to add a
> > new rlimit or clone variant, I don't see why x86 can't support it
> > later.

> i think it can be added later.

> but it may be important for deployment on some platforms, since a
> libc (or other language runtime) may want to set the shadow stack
> size differently than the kernel default, because

I agree that this can be deferred to later, so long as we err on the
side of large stacks now we can provide an additional limit without
confusing things.
Florian Weimer July 3, 2023, 6:49 p.m. UTC | #40
* szabolcs:

>> alt shadow stack cannot be transparent to existing software anyway, it
>
> maybe not in glibc, but a libc can internally use alt shadow stack
> in sigaltstack instead of exposing a separate sigaltshadowstack api.
> (this is what a strict posix conform implementation has to do to
> support shadow stacks), leaking shadow stacks is not a correctness
> issue unless it prevents the program working (the shadow stack for
> the main thread likely wastes more memory than all the alt stack
> leaks. if the leaks become dominant in a thread the sigaltstack
> libc api can just fail).

It should be possible in theory to carve out pages from sigaltstack and
push a shadow stack page and a guard page as part of the signal frame.
As far as I understand it, the signal frame layout is not ABI, so it's
possible to hide arbitrary stuff in it.  I'm just saying that it looks
possible, not that it's a good idea.

Perhaps that's not realistic with 64K pages, though.

Thanks,
Florian
Szabolcs Nagy July 4, 2023, 11:33 a.m. UTC | #41
The 07/03/2023 20:49, Florian Weimer wrote:
> * szabolcs:
> 
> >> alt shadow stack cannot be transparent to existing software anyway, it
> >
> > maybe not in glibc, but a libc can internally use alt shadow stack
> > in sigaltstack instead of exposing a separate sigaltshadowstack api.
> > (this is what a strict posix conform implementation has to do to
> > support shadow stacks), leaking shadow stacks is not a correctness
> > issue unless it prevents the program working (the shadow stack for
> > the main thread likely wastes more memory than all the alt stack
> > leaks. if the leaks become dominant in a thread the sigaltstack
> > libc api can just fail).
> 
> It should be possible in theory to carve out pages from sigaltstack and
> push a shadow stack page and a guard page as part of the signal frame.
> As far as I understand it, the signal frame layout is not ABI, so it's
> possible to hide arbitrary stuff in it.  I'm just saying that it looks
> possible, not that it's a good idea.
> 
> Perhaps that's not realistic with 64K pages, though.

interesting idea, but it would not work transparently:

the user expects the alt stack memory to be usable as normal
memory after longjmping out of a signal handler.

this would break code in practice e.g. when a malloced alt
stack is passed to free(), the contract there is to not
allow changes to the underlying mapping (affects malloc
interposition so not possible to paper over inside the
libc malloc).

so signal entry cannot change the mappings of alt stack.

i think kernel internal alt shadow stack allocation works
in practice where their lifetime is the same as the thread
lifetime. it is sketchy as os interface but doing it in
userspace should be fine i think (it's policy what kind of
sigaltstack usage is allowed). the kernel is easier in the
sense that if there is actual sigreturn then the alt shadow
stack can be freed, while libc cannot catch this case (at
least not easily). leaked shadow stack can also have
security implication but reuse of old alt shadow stack
sounds like a minor issue in practice.
Edgecombe, Rick P July 5, 2023, 6:45 p.m. UTC | #42
On Mon, 2023-07-03 at 19:19 +0100, szabolcs.nagy@arm.com wrote:
> Could you spell out what "the issue" is that can be triggered?
> 
> i meant jumping back from the main to the alt stack:
> 
> in main:
> setup sig alt stack
> setjmp buf1
>         raise signal on first return
>         longjmp buf2 on second return
> 
> in signal handler:
> setjmp buf2
>         longjmp buf1 on first return
>         can continue after second return
> 
> in my reading of posix this is valid (and works if signals are masked
> such that the alt stack is not clobbered when jumping away from it).
> 
> but cannot work with a single shared shadow stack.

Ah, I see. To make this work seamlessly, you would need to have
automatic alt shadow stacks, and as we previously discussed this is not
possible with the existing sigaltstack API. (Or at least it seemed like
a closed discussion to me).

If there is a solution, then we are currently missing a detailed
proposal. It looks like further down you proposed leaking alt shadow
stacks (quoted up here near the related discussion):

On Mon, 2023-07-03 at 19:19 +0100, szabolcs.nagy@arm.com wrote:
> maybe not in glibc, but a libc can internally use alt shadow stack
> in sigaltstack instead of exposing a separate sigaltshadowstack api.
> (this is what a strict posix conform implementation has to do to
> support shadow stacks), leaking shadow stacks is not a correctness
> issue unless it prevents the program working (the shadow stack for
> the main thread likely wastes more memory than all the alt stack
> leaks. if the leaks become dominant in a thread the sigaltstack
> libc api can just fail).

It seems like your priority must be to make sure pure C apps don't have
to make any changes in order to not crash with shadow stack enabled.
And this at the expense of any performance and memory usage. Do you
have some formalized priorities or design philosophy you can share?

Earlier you suggested glibc should create new interfaces to handle
makecontext() (makes sense). Shouldn't the same thing happen here? In
which case we are in code-changes territory and we should ask ourselves
what apps really need.

> 
> > >   we
> > >   can ignore that corner case and adjust the model so the shared
> > >   shadow stack works for alt stack, but it likely does not change
> > > the
> > >   jump design: eventually we want alt shadow stack.)
> > 
> > As we discussed previously, alt shadow stack can't work
> > transparently
> > with existing code due to the sigaltstack API. I wonder if maybe
> > you
> > are trying to get at something else, and I'm not following.
> 
> i would like a jump design that works with alt shadow stack.

A shadow stack switch could happen based on the following scenarios:
 1. Alt shadow stack
 2. ucontext
 3. custom stack switching logic

If we leave a token on signal, then 1 and 2 could be guaranteed to have
a token *somewhere* above where setjmp() could have been called.

The algorithm could be to search from the target SSP up the stack until
it finds a token, and then switch to it and INCSSP back to the SSP of
the setjmp() point. This is what we are talking about, right?

And the two problems are:
 - Alt shadow stack overflow problem
 - In the case of (3) there might not be a token

Let's ignore these problems for a second - now we have a solution that
allows you to longjmp() back from an alt stack or ucontext stack. Or at
least it works functionally. But is it going to actually work for
people who are using longjmp() for things that are supposed to be fast?
Like, is this the tradeoff people want? I see some references to fiber
switching implementations using longjmp(). I wonder if the existing
INCSSP loops are not going to be ideal for every usage already, and
this sounds like going further down that road.

For jumping out occasionally in some error case, it seems it would be
useful. But I think we are then talking about targeting a subset of
people using these stack switching patterns.

Looking at the docs Mark linked (thanks!), ARM has generic GCS PUSH and
POP shadow stack instructions? Can ARM just push a restore token at
setjmp time, like I was trying to figure out earlier with a push token
arch_prctl? It would be good to understand how ARM is going to
implement this with these differences in what is allowed by the HW.

If there are differences in how locked down/functional the hardware
implementations are, and if we want to have some unified set of rules
for apps, there will need to some give and take. The x86 approach was
mostly to not support all behaviors and ask apps to either change or
not enable shadow stacks. We don't want one architecture to have to do
a bunch of strange things, but we also don't want one to lose some key
end user value.

I'm thinking that for pure tracing users, glibc might do things a lot
differently (use of WRSS to speed things up). So I'm guessing we will
end up with at least one more "policy" on the x86 side.

I wonder if maybe we should have something like a "max compatibility"
policy/mode where arm/x86/riscv could all behave the same from the
glibc caller perspective. We could add kernel help to achieve this for
any implementation that is more locked down. And maybe that is x86's v2
ABI. I don't know, just sort of thinking out loud at this point. And
this sort of gets back to the point I keep making: if we need to decide
tradeoffs, it would be great to get some users to start using this and
start telling us what they want. Are people caring mostly about
security, compatibility or performance?

[snip]
Mark Brown July 5, 2023, 7:10 p.m. UTC | #43
On Wed, Jul 05, 2023 at 06:45:38PM +0000, Edgecombe, Rick P wrote:

> Looking at the docs Mark linked (thanks!), ARM has generic GCS PUSH and
> POP shadow stack instructions? Can ARM just push a restore token at
> setjmp time, like I was trying to figure out earlier with a push token
> arch_prctl? It would be good to understand how ARM is going to
> implement this with these differences in what is allowed by the HW.

> If there are differences in how locked down/functional the hardware
> implementations are, and if we want to have some unified set of rules
> for apps, there will need to some give and take. The x86 approach was
> mostly to not support all behaviors and ask apps to either change or
> not enable shadow stacks. We don't want one architecture to have to do
> a bunch of strange things, but we also don't want one to lose some key
> end user value.

GCS is all or nothing, either the hardware supports GCS or it doesn't.
There are finer grained hypervisor traps (see HFGxTR_EL2 in the system
registers) but they aren't intended to be used to disable partial
functionality and there's a strong chance we'd just disable the feature
in the face of such usage.  The kernel does have the option to control
which functionality is exposed to userspace, in particular we have
separate controls for use of the GCS, the push/pop instructions and the
store instructions (similarly to the control x86 has for WRSS).
Similarly to the handling of WRSS in your series my patches allow
userspace to choose which of these features are enabled.
Edgecombe, Rick P July 5, 2023, 7:17 p.m. UTC | #44
On Wed, 2023-07-05 at 20:10 +0100, Mark Brown wrote:
> On Wed, Jul 05, 2023 at 06:45:38PM +0000, Edgecombe, Rick P wrote:
> 
> > Looking at the docs Mark linked (thanks!), ARM has generic GCS PUSH
> > and
> > POP shadow stack instructions? Can ARM just push a restore token at
> > setjmp time, like I was trying to figure out earlier with a push
> > token
> > arch_prctl? It would be good to understand how ARM is going to
> > implement this with these differences in what is allowed by the HW.
> 
> > If there are differences in how locked down/functional the hardware
> > implementations are, and if we want to have some unified set of
> > rules
> > for apps, there will need to some give and take. The x86 approach
> > was
> > mostly to not support all behaviors and ask apps to either change
> > or
> > not enable shadow stacks. We don't want one architecture to have to
> > do
> > a bunch of strange things, but we also don't want one to lose some
> > key
> > end user value.
> 
> GCS is all or nothing, either the hardware supports GCS or it
> doesn't.
> There are finer grained hypervisor traps (see HFGxTR_EL2 in the
> system
> registers) but they aren't intended to be used to disable partial
> functionality and there's a strong chance we'd just disable the
> feature
> in the face of such usage.  The kernel does have the option to
> control
> which functionality is exposed to userspace, in particular we have
> separate controls for use of the GCS, the push/pop instructions and
> the
> store instructions (similarly to the control x86 has for WRSS).
> Similarly to the handling of WRSS in your series my patches allow
> userspace to choose which of these features are enabled.

Ah, interesting, thanks for the extra info. So which features is glibc
planning to use? (probably more of a question for Szabolcs). Are push
and pop controllable separately?
Mark Brown July 5, 2023, 7:29 p.m. UTC | #45
On Wed, Jul 05, 2023 at 07:17:25PM +0000, Edgecombe, Rick P wrote:

> Ah, interesting, thanks for the extra info. So which features is glibc
> planning to use? (probably more of a question for Szabolcs). Are push
> and pop controllable separately?

Push and pop are one control, you get both or neither.

I'll defer to Szabolcs on glibc plans.
Szabolcs Nagy July 6, 2023, 1:07 p.m. UTC | #46
The 07/05/2023 18:45, Edgecombe, Rick P wrote:
> On Mon, 2023-07-03 at 19:19 +0100, szabolcs.nagy@arm.com wrote:
> > Could you spell out what "the issue" is that can be triggered?
> > 
> > i meant jumping back from the main to the alt stack:
> > 
> > in main:
> > setup sig alt stack
> > setjmp buf1
> >         raise signal on first return
> >         longjmp buf2 on second return
> > 
> > in signal handler:
> > setjmp buf2
> >         longjmp buf1 on first return
> >         can continue after second return
> > 
> > in my reading of posix this is valid (and works if signals are masked
> > such that the alt stack is not clobbered when jumping away from it).
> > 
> > but cannot work with a single shared shadow stack.
> 
> Ah, I see. To make this work seamlessly, you would need to have
> automatic alt shadow stacks, and as we previously discussed this is not
> possible with the existing sigaltstack API. (Or at least it seemed like
> a closed discussion to me).
> 
> If there is a solution, then we are currently missing a detailed
> proposal. It looks like further down you proposed leaking alt shadow
> stacks (quoted up here near the related discussion):
> 
> On Mon, 2023-07-03 at 19:19 +0100, szabolcs.nagy@arm.com wrote:
> > maybe not in glibc, but a libc can internally use alt shadow stack
> > in sigaltstack instead of exposing a separate sigaltshadowstack api.
> > (this is what a strict posix conform implementation has to do to
> > support shadow stacks), leaking shadow stacks is not a correctness
> > issue unless it prevents the program working (the shadow stack for
> > the main thread likely wastes more memory than all the alt stack
> > leaks. if the leaks become dominant in a thread the sigaltstack
> > libc api can just fail).
> 
> It seems like your priority must be to make sure pure C apps don't have
> to make any changes in order to not crash with shadow stack enabled.
> And this at the expense of any performance and memory usage. Do you
> have some formalized priorities or design philosophy you can share?
> 
> Earlier you suggested glibc should create new interfaces to handle
> makecontext() (makes sense). Shouldn't the same thing happen here? In
> which case we are in code-changes territory and we should ask ourselves
> what apps really need.

instead of priority, i'd say "posix conform c apps work
without change" is a benchmark i use to see if the design
is sound.

i do not have a particular workload (or distro) in mind, so
i have to reason through the cases that make sense and the
current linux syscall abi allows, but fail or difficult to
support with shadow stacks.

one such case is jumping back to an alt stack (i.e. inactive
live alt stack):

- with shared shadow stack this does not work in many cases.

- with alt shadow stack this extends the lifetime beyond the
  point it become inactive (so it cannot be freed).

if there are no inactive live alt stacks then *both* shared
and implicit alt shadow stack works. and to me it looked
like implicit alt shadow stack is simply better of those two
(more alt shadow stack use-cases are supported, shadow stack
overflow can be handled. drawback: complications due to the
discontinous shadow stack.)

on arm64 i personally don't like the idea of "deal with alt
shadow stack later" because it likely requires a v2 abi
affecting the unwinder and jump implementations. (later
extensions are fine if they are bw compat and discoverable)

one nasty case is shadow stack overflow handling, but i
think i have a solution for that (not the nicest thing:
it involves setting the top bit on the last entry on the
shadow stack instead of adding a new entry to it. + a new
syscall that can switch to this entry. i havent convinced
myself about this yet).

> 
> > 
> > > >   we
> > > >   can ignore that corner case and adjust the model so the shared
> > > >   shadow stack works for alt stack, but it likely does not change
> > > > the
> > > >   jump design: eventually we want alt shadow stack.)
> > > 
> > > As we discussed previously, alt shadow stack can't work
> > > transparently
> > > with existing code due to the sigaltstack API. I wonder if maybe
> > > you
> > > are trying to get at something else, and I'm not following.
> > 
> > i would like a jump design that works with alt shadow stack.
> 
> A shadow stack switch could happen based on the following scenarios:
>  1. Alt shadow stack
>  2. ucontext
>  3. custom stack switching logic
> 
> If we leave a token on signal, then 1 and 2 could be guaranteed to have
> a token *somewhere* above where setjmp() could have been called.
> 
> The algorithm could be to search from the target SSP up the stack until
> it finds a token, and then switch to it and INCSSP back to the SSP of
> the setjmp() point. This is what we are talking about, right?
> 
> And the two problems are:
>  - Alt shadow stack overflow problem
>  - In the case of (3) there might not be a token
> 
> Let's ignore these problems for a second - now we have a solution that
> allows you to longjmp() back from an alt stack or ucontext stack. Or at
> least it works functionally. But is it going to actually work for
> people who are using longjmp() for things that are supposed to be fast?

slow longjmp is bad. (well longjmp is actually always slow
in glibc because it sets the signalmask with a syscall, but
there are other jump operations that don't do this and want
to be fast so yes we want fast jump to be possible).

jumping up the shadow stack is at least linear time in the
number of frames jumped over (which already sounds significant
slowdown however this is amortized by the fact that the stack
frames had to be created at some point and that is actually a
lot more expensive because it involves write operations, so a
zero cost jump will not do any asymptotic speedup compared to
a linear cost jump as far as i can see.).

with my proposed solution the jump is still linear. (i know
x86 incssp can jump many entries at a time and does not have
to actually read and check the entries, but technically it's
linear time too: you have to do at least one read per page to
have the guardpage protection). this all looks fine to me
even for extreme made up workloads.

> Like, is this the tradeoff people want? I see some references to fiber
> switching implementations using longjmp(). I wonder if the existing
> INCSSP loops are not going to be ideal for every usage already, and
> this sounds like going further down that road.
> 
> For jumping out occasionally in some error case, it seems it would be
> useful. But I think we are then talking about targeting a subset of
> people using these stack switching patterns.
> 

i simply don't see any trade-off here (i expect no measurable
difference with a scanning vs no scanning approach even in
a microbenchmark that does longjmp in a loop independently of
the stack switch pattern and even if the non-scanning
implementation can do wrss).

> Looking at the docs Mark linked (thanks!), ARM has generic GCS PUSH and
> POP shadow stack instructions? Can ARM just push a restore token at
> setjmp time, like I was trying to figure out earlier with a push token
> arch_prctl? It would be good to understand how ARM is going to
> implement this with these differences in what is allowed by the HW.
> 
> If there are differences in how locked down/functional the hardware
> implementations are, and if we want to have some unified set of rules
> for apps, there will need to some give and take. The x86 approach was
> mostly to not support all behaviors and ask apps to either change or
> not enable shadow stacks. We don't want one architecture to have to do
> a bunch of strange things, but we also don't want one to lose some key
> end user value.
> 
> I'm thinking that for pure tracing users, glibc might do things a lot
> differently (use of WRSS to speed things up). So I'm guessing we will
> end up with at least one more "policy" on the x86 side.
> 
> I wonder if maybe we should have something like a "max compatibility"
> policy/mode where arm/x86/riscv could all behave the same from the
> glibc caller perspective. We could add kernel help to achieve this for
> any implementation that is more locked down. And maybe that is x86's v2
> ABI. I don't know, just sort of thinking out loud at this point. And
> this sort of gets back to the point I keep making: if we need to decide
> tradeoffs, it would be great to get some users to start using this and
> start telling us what they want. Are people caring mostly about
> security, compatibility or performance?
> 
> [snip]
>
Szabolcs Nagy July 6, 2023, 1:14 p.m. UTC | #47
The 07/05/2023 20:29, Mark Brown wrote:
> On Wed, Jul 05, 2023 at 07:17:25PM +0000, Edgecombe, Rick P wrote:
> 
> > Ah, interesting, thanks for the extra info. So which features is glibc
> > planning to use? (probably more of a question for Szabolcs). Are push
> > and pop controllable separately?
> 
> Push and pop are one control, you get both or neither.
> 
> I'll defer to Szabolcs on glibc plans.

gcspopm is always available (esentially *ssp++, this is used
for longjmp).

i haven't planned anything yet for other modes (i dont know
anything where writable shadow stack is better than just
turning the feature off, so i expect we at most have a
glibc tunable env var to enable it but it will not affect
glibc behaviour otherwise).
Mark Brown July 6, 2023, 2:24 p.m. UTC | #48
On Thu, Jul 06, 2023 at 02:14:40PM +0100, szabolcs.nagy@arm.com wrote:
> The 07/05/2023 20:29, Mark Brown wrote:

> > Push and pop are one control, you get both or neither.

> gcspopm is always available (esentially *ssp++, this is used
> for longjmp).

Ah, sorry - I misremembered there.  You're right, it's only push that we
have control over.
Edgecombe, Rick P July 6, 2023, 4:59 p.m. UTC | #49
On Thu, 2023-07-06 at 15:24 +0100, Mark Brown wrote:
> On Thu, Jul 06, 2023 at 02:14:40PM +0100,
> szabolcs.nagy@arm.com wrote:
> > The 07/05/2023 20:29, Mark Brown wrote:
> 
> > > Push and pop are one control, you get both or neither.
> 
> > gcspopm is always available (esentially *ssp++, this is used
> > for longjmp).
> 
> Ah, sorry - I misremembered there.  You're right, it's only push that
> we
> have control over.

Ah, ok! So if you are not planning to enable the push mode then the
features are pretty well aligned, except:
 - On x86 it is possible to switch stacks without leaving a token 
   behind.
 - The GCSPOPM/INCSSP looping may require longer loops on ARM 
   because it only pops one at at time.

If you are not going to use GCSPUSHM by default, then I think we
*should* be able to have some unified set of rules for developers for
glibc behaviors at least.
Edgecombe, Rick P July 6, 2023, 6:25 p.m. UTC | #50
On Thu, 2023-07-06 at 14:07 +0100, szabolcs.nagy@arm.com wrote:

[ snip ]
> 
> instead of priority, i'd say "posix conform c apps work
> without change" is a benchmark i use to see if the design
> is sound.

This involves leaking shadow stacks for sigaltstack and makecontext,
though, right? This seems kind of wrong. It might be useful for
avoiding crashes at all costs, but probably shouldn't be the long term
solution. I thought your API updates were the right direction.

But, I'm of course not a glibc developer. HJ and friends would have to
agree to all of that.

> 
> i do not have a particular workload (or distro) in mind, so
> i have to reason through the cases that make sense and the
> current linux syscall abi allows, but fail or difficult to
> support with shadow stacks.
> 
> one such case is jumping back to an alt stack (i.e. inactive
> live alt stack):
> 
> - with shared shadow stack this does not work in many cases.
> 
> - with alt shadow stack this extends the lifetime beyond the
>   point it become inactive (so it cannot be freed).
> 
> if there are no inactive live alt stacks then *both* shared
> and implicit alt shadow stack works. and to me it looked
> like implicit alt shadow stack is simply better of those two
> (more alt shadow stack use-cases are supported, shadow stack
> overflow can be handled. drawback: complications due to the
> discontinous shadow stack.)
> 
> on arm64 i personally don't like the idea of "deal with alt
> shadow stack later" because it likely requires a v2 abi
> affecting the unwinder and jump implementations. (later
> extensions are fine if they are bw compat and discoverable)

I think you could do it, if your signal handler can push data on the
shadow stack like x86 does. I'd start with a padded shadow stack signal
frame though, and not trust userspace to parse it.

> 
> one nasty case is shadow stack overflow handling, but i
> think i have a solution for that (not the nicest thing:
> it involves setting the top bit on the last entry on the
> shadow stack instead of adding a new entry to it. + a new
> syscall that can switch to this entry. i havent convinced
> myself about this yet).

There might be some complicated thing around storing the last shadow
stack entry into the shadow stack sigframe and restoring it on
sigreturn. Then writing a token from the kernel to where the saved
frame was to live there in the meantime.

But to me this whole search, restore and INCSSP thing is suspect at
this point though. We could also increase compatibility and performance
more simply, by adding kernel help, at the expense of security.


[ snip ]

> slow longjmp is bad. (well longjmp is actually always slow
> in glibc because it sets the signalmask with a syscall, but
> there are other jump operations that don't do this and want
> to be fast so yes we want fast jump to be possible).
> 
> jumping up the shadow stack is at least linear time in the
> number of frames jumped over (which already sounds significant
> slowdown however this is amortized by the fact that the stack
> frames had to be created at some point and that is actually a
> lot more expensive because it involves write operations, so a
> zero cost jump will not do any asymptotic speedup compared to
> a linear cost jump as far as i can see.).
> 
> with my proposed solution the jump is still linear. (i know
> x86 incssp can jump many entries at a time and does not have
> to actually read and check the entries, but technically it's
> linear time too: you have to do at least one read per page to
> have the guardpage protection). this all looks fine to me
> even for extreme made up workloads.

Well I guess we are talking about hypothetical performance. But linear
time is still worse than O(1). And I thought longjmp() was supposed to
be an O(1) type thing.


Separate from all of this...now that all the constraints are clearer,
if you have changed your mind on whether this series is ready, could
you comment at the top of this thread something to that effect? I'm
imagining not many are reading so far down at this point.

For my part, I think we should go forward with what we have on the
kernel side, unless glibc/gcc developers would like to start by
deprecating the existing binaries. I just talked with HJ, and he has
not changed his plans around this. If anyone else in that community has
(Florian?), please speak up. But otherwise I think it's better to start
getting real world feedback and grow based on that.
Mark Brown July 6, 2023, 7:03 p.m. UTC | #51
On Thu, Jul 06, 2023 at 04:59:45PM +0000, Edgecombe, Rick P wrote:
> On Thu, 2023-07-06 at 15:24 +0100, Mark Brown wrote:
> > szabolcs.nagy@arm.com wrote:
> > > The 07/05/2023 20:29, Mark Brown wrote:

> > > gcspopm is always available (esentially *ssp++, this is used
> > > for longjmp).

> > Ah, sorry - I misremembered there.  You're right, it's only push that
> > we
> > have control over.

FWIW the confusion there was some of the hypervisor features which do
tie some of the push and pop instructions together.

> Ah, ok! So if you are not planning to enable the push mode then the
> features are pretty well aligned, except:
>  - On x86 it is possible to switch stacks without leaving a token 
>    behind.
>  - The GCSPOPM/INCSSP looping may require longer loops on ARM 
>    because it only pops one at at time.

> If you are not going to use GCSPUSHM by default, then I think we
> *should* be able to have some unified set of rules for developers for
> glibc behaviors at least.

Yes, the only case where I am aware of conciously diverging in any
substantial way is that we do not free the GCS when GCS is disabled by
userspace, we just disable the updates and checks, and reenabling after
disabling is not supported.  We have demand for disabling at runtime so
we want to keep the stack around for things like a running unwinder but
we don't see a practical use for reenabling so didn't worry about
figuring out what would make sense for userspace.  glibc isn't going to
be using that though.
Szabolcs Nagy July 7, 2023, 3:25 p.m. UTC | #52
The 07/06/2023 18:25, Edgecombe, Rick P wrote:
> On Thu, 2023-07-06 at 14:07 +0100, szabolcs.nagy@arm.com wrote:
> 
> [ snip ]
> > 
> > instead of priority, i'd say "posix conform c apps work
> > without change" is a benchmark i use to see if the design
> > is sound.
> 
> This involves leaking shadow stacks for sigaltstack and makecontext,
> though, right? This seems kind of wrong. It might be useful for
> avoiding crashes at all costs, but probably shouldn't be the long term
> solution. I thought your API updates were the right direction.

new apis are not enough.

existing apis either must do something reasonable or shstk must
be disabled when that api is used (sigaltstack, makecontext).

the disable does not really work as the apis are widely used
and there is no 'disable shstk locally' it is viral when a
widely used dependency is affected.

so apis must do something reasonable. however there will be
remaining issues and that will need new apis which can take
a long time to transition to.

> > one nasty case is shadow stack overflow handling, but i
> > think i have a solution for that (not the nicest thing:
> > it involves setting the top bit on the last entry on the
> > shadow stack instead of adding a new entry to it. + a new
> > syscall that can switch to this entry. i havent convinced
> > myself about this yet).
> 
> There might be some complicated thing around storing the last shadow
> stack entry into the shadow stack sigframe and restoring it on
> sigreturn. Then writing a token from the kernel to where the saved
> frame was to live there in the meantime.

this only works if you jump from the alt stack to the overflowed
stack, not if you jump somewhere else in between.

this is a would be nice to solve but not the most important case.

> 
> But to me this whole search, restore and INCSSP thing is suspect at
> this point though. We could also increase compatibility and performance
> more simply, by adding kernel help, at the expense of security.

what is the kernel help? (and security trade-off)

and why is scan+restore+incssp suspect?

the reasoning i've seen are

- some ppl might not add restore token: sounds fine, it's already
  ub to jump to such stack either way.

- it is slow: please give an example where there is slowdown.
  (the slowdown has to be larger than the call/ret overhead)

- jump to overflowed shadow stack: i'm fairly sure this can be done
  (but indeed complicated), and if that's not acceptable then not
  supporting this case is better than not supporting reliable
  crash handling (alt stack handler can overflow the shadow stack
  and shadow stack overflow cannot be handled).

> > slow longjmp is bad. (well longjmp is actually always slow
> > in glibc because it sets the signalmask with a syscall, but
> > there are other jump operations that don't do this and want
> > to be fast so yes we want fast jump to be possible).
> > 
> > jumping up the shadow stack is at least linear time in the
> > number of frames jumped over (which already sounds significant
> > slowdown however this is amortized by the fact that the stack
> > frames had to be created at some point and that is actually a
> > lot more expensive because it involves write operations, so a
> > zero cost jump will not do any asymptotic speedup compared to
> > a linear cost jump as far as i can see.).
> > 
> > with my proposed solution the jump is still linear. (i know
> > x86 incssp can jump many entries at a time and does not have
> > to actually read and check the entries, but technically it's
> > linear time too: you have to do at least one read per page to
> > have the guardpage protection). this all looks fine to me
> > even for extreme made up workloads.
> 
> Well I guess we are talking about hypothetical performance. But linear
> time is still worse than O(1). And I thought longjmp() was supposed to
> be an O(1) type thing.

longjmp is not O(1) with your proposed abi.

and i don't think linear time is worse than O(1) in this case.

> Separate from all of this...now that all the constraints are clearer,
> if you have changed your mind on whether this series is ready, could
> you comment at the top of this thread something to that effect? I'm
> imagining not many are reading so far down at this point.
> 
> For my part, I think we should go forward with what we have on the
> kernel side, unless glibc/gcc developers would like to start by
> deprecating the existing binaries. I just talked with HJ, and he has
> not changed his plans around this. If anyone else in that community has
> (Florian?), please speak up. But otherwise I think it's better to start
> getting real world feedback and grow based on that.
> 

the x86 v1 abi tries to be compatible with existing unwinders.
(are there other binaries that constrains v1? portable code
should be fine as they rely on libc which we can still change)

i will have to discuss the arm plan with the arm kernel devs.
the ugly bit i want to avoid on arm is to have to reimplement
unwind and jump ops to make alt shadow stack work in a v2 abi.

i think the worse bit of the x86 v1 abi is that crash handlers
don't work reliably (e.g. a crash on a tiny makecontext stack
with the usual sigaltstack crash handler can unrecoverably fail
during crash handling). i guess this can be somewhat mitigated
by both linux and libc adding an extra page to the shadow stack
size to guarantee that alt stack handlers with certain depth
always work.
Edgecombe, Rick P July 7, 2023, 5:37 p.m. UTC | #53
On Fri, 2023-07-07 at 16:25 +0100, szabolcs.nagy@arm.com wrote:
> > Separate from all of this...now that all the constraints are
> > clearer,
> > if you have changed your mind on whether this series is ready,
> > could
> > you comment at the top of this thread something to that effect? I'm
> > imagining not many are reading so far down at this point.
> > 
> > For my part, I think we should go forward with what we have on the
> > kernel side, unless glibc/gcc developers would like to start by
> > deprecating the existing binaries. I just talked with HJ, and he
> > has
> > not changed his plans around this. If anyone else in that community
> > has
> > (Florian?), please speak up. But otherwise I think it's better to
> > start
> > getting real world feedback and grow based on that.
> > 
> 
> the x86 v1 abi tries to be compatible with existing unwinders.
> (are there other binaries that constrains v1? portable code
> should be fine as they rely on libc which we can still change)
> 
> i will have to discuss the arm plan with the arm kernel devs.
> the ugly bit i want to avoid on arm is to have to reimplement
> unwind and jump ops to make alt shadow stack work in a v2 abi.
> 
> i think the worse bit of the x86 v1 abi is that crash handlers
> don't work reliably (e.g. a crash on a tiny makecontext stack
> with the usual sigaltstack crash handler can unrecoverably fail
> during crash handling). i guess this can be somewhat mitigated
> by both linux and libc adding an extra page to the shadow stack
> size to guarantee that alt stack handlers with certain depth
> always work.

Some mails back, I listed the three things you might be asking for from
the kernel side and pointedly asked you to clarify. The only one you
still were wishing for up front was "Leave a token on switching to an
alt shadow stack."

But how you want to use this involves a lot of details for how glibc
will work (automatic shadow stack for sigaltstack, scan-restore-incssp,
etc). I think you first need to get the story straight with other libc
developers, otherwise this is just brainstorming. I'm not a glibc
contributor, so winning me over is only half the battle.

Only after that is settled do we get to the problem of the old libgcc
unwinders, and how it is a challenge to even add alt shadow stack given
glibc's plans and the existing binaries.

Once that is solved we are at the overflow problem, and the current
state of thinking on that is "i'm fairly sure this can be done (but
indeed complicated)".

So I think we are still missing any actionable requests that should
hold this up.

Is this a reasonable summary?
Szabolcs Nagy July 10, 2023, 4:54 p.m. UTC | #54
The 07/07/2023 17:37, Edgecombe, Rick P wrote:
> On Fri, 2023-07-07 at 16:25 +0100, szabolcs.nagy@arm.com wrote:
> > > Separate from all of this...now that all the constraints are
> > > clearer,
> > > if you have changed your mind on whether this series is ready,
> > > could
> > > you comment at the top of this thread something to that effect? I'm
> > > imagining not many are reading so far down at this point.
> > > 
> > > For my part, I think we should go forward with what we have on the
> > > kernel side, unless glibc/gcc developers would like to start by
> > > deprecating the existing binaries. I just talked with HJ, and he
> > > has
> > > not changed his plans around this. If anyone else in that community
> > > has
> > > (Florian?), please speak up. But otherwise I think it's better to
> > > start
> > > getting real world feedback and grow based on that.
> > > 
> > 
> > the x86 v1 abi tries to be compatible with existing unwinders.
> > (are there other binaries that constrains v1? portable code
> > should be fine as they rely on libc which we can still change)
> > 
> > i will have to discuss the arm plan with the arm kernel devs.
> > the ugly bit i want to avoid on arm is to have to reimplement
> > unwind and jump ops to make alt shadow stack work in a v2 abi.
> > 
> > i think the worse bit of the x86 v1 abi is that crash handlers
> > don't work reliably (e.g. a crash on a tiny makecontext stack
> > with the usual sigaltstack crash handler can unrecoverably fail
> > during crash handling). i guess this can be somewhat mitigated
> > by both linux and libc adding an extra page to the shadow stack
> > size to guarantee that alt stack handlers with certain depth
> > always work.
> 
> Some mails back, I listed the three things you might be asking for from
> the kernel side and pointedly asked you to clarify. The only one you
> still were wishing for up front was "Leave a token on switching to an
> alt shadow stack."
> 
> But how you want to use this involves a lot of details for how glibc
> will work (automatic shadow stack for sigaltstack, scan-restore-incssp,
> etc). I think you first need to get the story straight with other libc
> developers, otherwise this is just brainstorming. I'm not a glibc
> contributor, so winning me over is only half the battle.
> 
> Only after that is settled do we get to the problem of the old libgcc
> unwinders, and how it is a challenge to even add alt shadow stack given
> glibc's plans and the existing binaries.
> 
> Once that is solved we are at the overflow problem, and the current
> state of thinking on that is "i'm fairly sure this can be done (but
> indeed complicated)".
> 
> So I think we are still missing any actionable requests that should
> hold this up.
> 
> Is this a reasonable summary?

not entirely.

the high level requirement is a design that

a) does not break many existing sigaltstack uses,

b) allows implementing jump and unwind that support the
   relevant use-cases around signals and stack switches
   with minimal userspace changes.

where (b) has nothing to add to v1 abi: existing unwind
binaries mean this needs a v2 abi. (the point of discussing
v2 ahead of time is to understand the cost of v2 and the
divergence wrt targets without abi compat issue.)

for (a) my actionable suggestion was to account altstack
when sizing shadow stacks. to document an altstack call
depth limit on the libc level (e.g. fixed 100 is fine) we
need guarantees from the kernel. (consider recursive calls
overflowing the stack with altstack crash handler: for this
to be reliable shadow stack size > stack size is needed.
but the diff can be tiny e.g. 1 page is enough.)

your previous 3 actionable item list was

1. add token when handling signals on altstack.

this falls under (b). your summary is correct that this
requires sorting out many fiddly details.

2. top of stack token.

this can work differently across targets so i have nothing
against the x86 v1 abi, but on arm64 we plan to have this.

3. more shadow stack sizing policies.

this can be done in the future except the default policy
should be fixed for (a) and a smaller size introduces the
overflow issue which may require v2.

in short the only important change for v1 is shstk sizing.
Edgecombe, Rick P July 10, 2023, 10:56 p.m. UTC | #55
On Mon, 2023-07-10 at 17:54 +0100, szabolcs.nagy@arm.com wrote:
> > Some mails back, I listed the three things you might be asking for
> > from
> > the kernel side and pointedly asked you to clarify. The only one
> > you
> > still were wishing for up front was "Leave a token on switching to
> > an
> > alt shadow stack."
> > 
> > But how you want to use this involves a lot of details for how
> > glibc
> > will work (automatic shadow stack for sigaltstack, scan-restore-
> > incssp,
> > etc). I think you first need to get the story straight with other
> > libc
> > developers, otherwise this is just brainstorming. I'm not a glibc
> > contributor, so winning me over is only half the battle.
> > 
> > Only after that is settled do we get to the problem of the old
> > libgcc
> > unwinders, and how it is a challenge to even add alt shadow stack
> > given
> > glibc's plans and the existing binaries.
> > 
> > Once that is solved we are at the overflow problem, and the current
> > state of thinking on that is "i'm fairly sure this can be done (but
> > indeed complicated)".
> > 
> > So I think we are still missing any actionable requests that should
> > hold this up.
> > 
> > Is this a reasonable summary?
> 
> not entirely.
> 
> the high level requirement is a design that
> 
> a) does not break many existing sigaltstack uses,
> 
> b) allows implementing jump and unwind that support the
>    relevant use-cases around signals and stack switches
>    with minimal userspace changes.

Please open a discussion with the other glibc developers that have been
involved with shadow stack regarding this subject(b). Please include me
(and probably AndyL would be interested?). I think we've talked it
through as much as you and I can at this point. Let's at least start a
new more focused thread on the "unwind across stacks" problem. And also
get some consensus on the wisdom of the related suggestion to leak
shadow stacks in order to transparently support existing posix APIs.

> 
> where (b) has nothing to add to v1 abi: existing unwind
> binaries mean this needs a v2 abi. (the point of discussing
> v2 ahead of time is to understand the cost of v2 and the
> divergence wrt targets without abi compat issue.)
> 
> for (a) my actionable suggestion was to account altstack
> when sizing shadow stacks. to document an altstack call
> depth limit on the libc level (e.g. fixed 100 is fine) we
> need guarantees from the kernel. (consider recursive calls
> overflowing the stack with altstack crash handler: for this
> to be reliable shadow stack size > stack size is needed.
> but the diff can be tiny e.g. 1 page is enough.)
> 
> your previous 3 actionable item list was
> 
> 1. add token when handling signals on altstack.
> 
> this falls under (b). your summary is correct that this
> requires sorting out many fiddly details.
> 
> 2. top of stack token.
> 
> this can work differently across targets so i have nothing
> against the x86 v1 abi, but on arm64 we plan to have this.
> 
> 3. more shadow stack sizing policies.
> 
> this can be done in the future except the default policy
> should be fixed for (a) and a smaller size introduces the
> overflow issue which may require v2.
> 
> in short the only important change for v1 is shstk sizing.

I tried searching through this long thread and AFAICT this is a new
idea. Sorry if I missed something, but your previous answer on this(3)
seemed concerned with the opposite problem (oversized shadow stacks).

Quoted from a past mail:
On Mon, 2023-07-03 at 19:19 +0100, szabolcs.nagy@arm.com wrote:
> i think it can be added later.
> 
> but it may be important for deployment on some platforms, since a
> libc (or other language runtime) may want to set the shadow stack
> size differently than the kernel default, because
> 
> - languages allocating large arrays on the stack
>   (too big shadow stack can cause OOM with overcommit off and
>   rlimits can be hit like RLIMIT_DATA, RLIMIT_AS because of it)
> 
> - tiny thread stack but big sigaltstack (musl libc, go).

So you can probably see how I got the impression that 3 was closed.

But anyways, ok, so if we add a page to every thread allocated shadow
stack, then you can guarantee that an alt stack can have some room to
handle at least a single alt stack signal, even in the case of
exhausting the entire stack by recursively making calls and pushing
nothing else to the stack. SS_AUTODISARM remains a bit muddy.

Also glibc would have to size ucontext shadow stacks with an additional
page as well. I think it would be good to get some other signs of
interest on this tweak due to the requirements for glibc to participate
on the scheme. Can you gather that quickly, so we can get this all
prepped again?

To me (unless I'm missing something), it seems like complicating the
equation for probably no real world benefit due to the low chances of
exhausting a shadow stack. But if there is consensus on the glibc side,
then I'm happy to make the change to finally settle this discussion.
Szabolcs Nagy July 11, 2023, 8:08 a.m. UTC | #56
The 07/10/2023 22:56, Edgecombe, Rick P wrote:
> On Mon, 2023-07-10 at 17:54 +0100, szabolcs.nagy@arm.com wrote:
> > in short the only important change for v1 is shstk sizing.
> 
> I tried searching through this long thread and AFAICT this is a new
> idea. Sorry if I missed something, but your previous answer on this(3)
> seemed concerned with the opposite problem (oversized shadow stacks).
> 
> Quoted from a past mail:
> On Mon, 2023-07-03 at 19:19 +0100, szabolcs.nagy@arm.com wrote:
...
> > - tiny thread stack but big sigaltstack (musl libc, go).

and 4 months earlier:

> Date: Fri, 3 Mar 2023 16:30:37 +0000
> Subject: Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description
> > Looking at this again, I'm not sure why a new rlimit is needed. It
> > seems many of those points were just formulations of that the clone3
> > stack size was not used, but it actually is and just not documented. If
> > you disagree perhaps you could elaborate on what the requirements are
> > and we can see if it seems tricky to do in a follow up.
> 
> - tiny thread stack and deep signal stack.
> (note that this does not really work with glibc because it has
> implementation internal signals that don't run on alt stack,
> cannot be masked and don't fit on a tiny thread stack, but
> with other runtimes this can be a valid use-case, e.g. musl
> allows tiny thread stacks, < pagesize.)
...

that analysis was wrong: it considered handling any signals
at all, but if the stack overflows due to a recursive call
with empty stack frames then the stack gets used up at the
same rate as the shadow stack and since the signal handler
on the alt stack uses the same shadow stack that can easily
overflow too independently of the stack size.

this is a big deal as it affects what operations the libc
can support and handling stack overflow is a common
requirement.

i originally argued for a fix using separate alt shadow
stacks, but since it became clear that does not work for
the x86 v1 abi i am recommending the size increase.

with shadow stack size = stack size + 1 page, libc can
document a depth limit for the alt stack that works.
(longjumping back to the alt stack is still broken
though, but that's rarely a requirement)

> Also glibc would have to size ucontext shadow stacks with an additional

yes, assuming glibc wants to support sigaltstack.

> page as well. I think it would be good to get some other signs of
> interest on this tweak due to the requirements for glibc to participate
> on the scheme. Can you gather that quickly, so we can get this all
> prepped again?

i can cc libc-alpha.

the decision is for x86 shadow stack linux abi to use

  shadow stack size = stack size

or

  shadow stack size = stack size + 1 page

as default policy when alt stack signals use the same
shadow stack, not a separate one.

note: smallest stack frame size is 8bytes, same as the
shadow stack entry. on a target where smallest frame
size is 2x shadow stack entry size, the formula would
use (stack size / 2).

note: there is no api to change the policy from userspace
at this point.
Szabolcs Nagy July 12, 2023, 9:39 a.m. UTC | #57
The 07/11/2023 09:08, szabolcs.nagy--- via Libc-alpha wrote:
> the decision is for x86 shadow stack linux abi to use
> 
>   shadow stack size = stack size
> 
> or
> 
>   shadow stack size = stack size + 1 page
> 
> as default policy when alt stack signals use the same
> shadow stack, not a separate one.
> 
> note: smallest stack frame size is 8bytes, same as the
> shadow stack entry. on a target where smallest frame
> size is 2x shadow stack entry size, the formula would
> use (stack size / 2).

i convinced myself that shadow stack size = stack size
works:

libc can reserve N bytes on the initial stack frame so
when the stack overflows there will be at least N bytes
on the shadow stack usable for signal handling.

this is only bad for tiny user allocated stacks where libc
should not consume too much stack space. but e.g. glibc
already uses >128 bytes on the initial stack frame for its
cancellation jumpbuf so 16 deep signal call stack is
already guaranteed to work.

the glibc makecontext code has to be adjusted, but that's
a libc side discussion.

the shadow stack of the main stack can still overflow, but
that requires increasing RLIMIT_STACK at runtime which is
not very common.
Szabolcs Nagy July 18, 2023, 7:32 p.m. UTC | #58
The 06/12/2023 17:10, Rick Edgecombe wrote:
> Introduce a new document on Control-flow Enforcement Technology (CET).
>
> Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Tested-by: Kees Cook <keescook@chromium.org>

i don't have more comments on this.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
diff mbox series

Patch

diff --git a/Documentation/arch/x86/index.rst b/Documentation/arch/x86/index.rst
index c73d133fd37c..8ac64d7de4dc 100644
--- a/Documentation/arch/x86/index.rst
+++ b/Documentation/arch/x86/index.rst
@@ -22,6 +22,7 @@  x86-specific Documentation
    mtrr
    pat
    intel-hfi
+   shstk
    iommu
    intel_txt
    amd-memory-encryption
diff --git a/Documentation/arch/x86/shstk.rst b/Documentation/arch/x86/shstk.rst
new file mode 100644
index 000000000000..f09afa504ec0
--- /dev/null
+++ b/Documentation/arch/x86/shstk.rst
@@ -0,0 +1,169 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
+Control-flow Enforcement Technology (CET) Shadow Stack
+======================================================
+
+CET Background
+==============
+
+Control-flow Enforcement Technology (CET) covers several related x86 processor
+features that provide protection against control flow hijacking attacks. CET
+can protect both applications and the kernel.
+
+CET introduces shadow stack and indirect branch tracking (IBT). A shadow stack
+is a secondary stack allocated from memory which cannot be directly modified by
+applications. When executing a CALL instruction, the processor pushes the
+return address to both the normal stack and the shadow stack. Upon
+function return, the processor pops the shadow stack copy and compares it
+to the normal stack copy. If the two differ, the processor raises a
+control-protection fault. IBT verifies indirect CALL/JMP targets are intended
+as marked by the compiler with 'ENDBR' opcodes. Not all CPU's have both Shadow
+Stack and Indirect Branch Tracking. Today in the 64-bit kernel, only userspace
+shadow stack and kernel IBT are supported.
+
+Requirements to use Shadow Stack
+================================
+
+To use userspace shadow stack you need HW that supports it, a kernel
+configured with it and userspace libraries compiled with it.
+
+The kernel Kconfig option is X86_USER_SHADOW_STACK.  When compiled in, shadow
+stacks can be disabled at runtime with the kernel parameter: nousershstk.
+
+To build a user shadow stack enabled kernel, Binutils v2.29 or LLVM v6 or later
+are required.
+
+At run time, /proc/cpuinfo shows CET features if the processor supports
+CET. "user_shstk" means that userspace shadow stack is supported on the current
+kernel and HW.
+
+Application Enabling
+====================
+
+An application's CET capability is marked in its ELF note and can be verified
+from readelf/llvm-readelf output::
+
+    readelf -n <application> | grep -a SHSTK
+        properties: x86 feature: SHSTK
+
+The kernel does not process these applications markers directly. Applications
+or loaders must enable CET features using the interface described in section 4.
+Typically this would be done in dynamic loader or static runtime objects, as is
+the case in GLIBC.
+
+Enabling arch_prctl()'s
+=======================
+
+Elf features should be enabled by the loader using the below arch_prctl's. They
+are only supported in 64 bit user applications. These operate on the features
+on a per-thread basis. The enablement status is inherited on clone, so if the
+feature is enabled on the first thread, it will propagate to all the thread's
+in an app.
+
+arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature)
+    Enable a single feature specified in 'feature'. Can only operate on
+    one feature at a time.
+
+arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature)
+    Disable a single feature specified in 'feature'. Can only operate on
+    one feature at a time.
+
+arch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
+    Lock in features at their current enabled or disabled status. 'features'
+    is a mask of all features to lock. All bits set are processed, unset bits
+    are ignored. The mask is ORed with the existing value. So any feature bits
+    set here cannot be enabled or disabled afterwards.
+
+The return values are as follows. On success, return 0. On error, errno can
+be::
+
+        -EPERM if any of the passed feature are locked.
+        -ENOTSUPP if the feature is not supported by the hardware or
+         kernel.
+        -EINVAL arguments (non existing feature, etc)
+
+The feature's bits supported are::
+
+    ARCH_SHSTK_SHSTK - Shadow stack
+    ARCH_SHSTK_WRSS  - WRSS
+
+Currently shadow stack and WRSS are supported via this interface. WRSS
+can only be enabled with shadow stack, and is automatically disabled
+if shadow stack is disabled.
+
+Proc Status
+===========
+To check if an application is actually running with shadow stack, the
+user can read the /proc/$PID/status. It will report "wrss" or "shstk"
+depending on what is enabled. The lines look like this::
+
+    x86_Thread_features: shstk wrss
+    x86_Thread_features_locked: shstk wrss
+
+Implementation of the Shadow Stack
+==================================
+
+Shadow Stack Size
+-----------------
+
+A task's shadow stack is allocated from memory to a fixed size of
+MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to
+the maximum size of the normal stack, but capped to 4 GB. In the case
+of the clone3 syscall, there is a stack size passed in and shadow stack
+uses this instead of the rlimit.
+
+Signal
+------
+
+The main program and its signal handlers use the same shadow stack. Because
+the shadow stack stores only return addresses, a large shadow stack covers
+the condition that both the program stack and the signal alternate stack run
+out.
+
+When a signal happens, the old pre-signal state is pushed on the stack. When
+shadow stack is enabled, the shadow stack specific state is pushed onto the
+shadow stack. Today this is only the old SSP (shadow stack pointer), pushed
+in a special format with bit 63 set. On sigreturn this old SSP token is
+verified and restored by the kernel. The kernel will also push the normal
+restorer address to the shadow stack to help userspace avoid a shadow stack
+violation on the sigreturn path that goes through the restorer.
+
+So the shadow stack signal frame format is as follows::
+
+    |1...old SSP| - Pointer to old pre-signal ssp in sigframe token format
+                    (bit 63 set to 1)
+    |        ...| - Other state may be added in the future
+
+
+32 bit ABI signals are not supported in shadow stack processes. Linux prevents
+32 bit execution while shadow stack is enabled by the allocating shadow stacks
+outside of the 32 bit address space. When execution enters 32 bit mode, either
+via far call or returning to userspace, a #GP is generated by the hardware
+which, will be delivered to the process as a segfault. When transitioning to
+userspace the register's state will be as if the userspace ip being returned to
+caused the segfault.
+
+Fork
+----
+
+The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required
+to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a
+shadow access triggers a page fault with the shadow stack access bit set
+in the page fault error code.
+
+When a task forks a child, its shadow stack PTEs are copied and both the
+parent's and the child's shadow stack PTEs are cleared of the dirty bit.
+Upon the next shadow stack access, the resulting shadow stack page fault
+is handled by page copy/re-use.
+
+When a pthread child is created, the kernel allocates a new shadow stack
+for the new thread. New shadow stack creation behaves like mmap() with respect
+to ASLR behavior. Similarly, on thread exit the thread's shadow stack is
+disabled.
+
+Exec
+----
+
+On exec, shadow stack features are disabled by the kernel. At which point,
+userspace can choose to re-enable, or lock them.