[v26,25/30] x86/cet/shstk: Handle signals for shadow stack

Message ID	20210427204315.24153-26-yu-cheng.yu@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=Z8q0=JY=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E258360720 IronPort-SDR: vBIgsRpdHckTRqE/edWIecMl6QimoN0/JebSDP3Sje28r8+vxkGIxHXM6GKZbnr1ztOgNDSKMu QvgptRhJ+BhA== IronPort-SDR: rM42Qnag36xHnuLTxjiNeven6Z3hacM22stlqYZPZujkhxvTmy46s3gdNx9/rehtNpA6ZNn9gf dVRnV5tUiKig== From: Yu-cheng Yu <yu-cheng.yu@intel.com> To: x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann <arnd@arndb.de>, Andy Lutomirski <luto@kernel.org>, Balbir Singh <bsingharora@gmail.com>, Borislav Petkov <bp@alien8.de>, Cyrill Gorcunov <gorcunov@gmail.com>, Dave Hansen <dave.hansen@linux.intel.com>, Eugene Syromiatnikov <esyr@redhat.com>, Florian Weimer <fweimer@redhat.com>, "H.J. Lu" <hjl.tools@gmail.com>, Jann Horn <jannh@google.com>, Jonathan Corbet <corbet@lwn.net>, Kees Cook <keescook@chromium.org>, Mike Kravetz <mike.kravetz@oracle.com>, Nadav Amit <nadav.amit@gmail.com>, Oleg Nesterov <oleg@redhat.com>, Pavel Machek <pavel@ucw.cz>, Peter Zijlstra <peterz@infradead.org>, Randy Dunlap <rdunlap@infradead.org>, "Ravi V. Shankar" <ravi.v.shankar@intel.com>, Vedvyas Shanbhogue <vedvyas.shanbhogue@intel.com>, Dave Martin <Dave.Martin@arm.com>, Weijiang Yang <weijiang.yang@intel.com>, Pengfei Xu <pengfei.xu@intel.com>, Haitao Huang <haitao.huang@intel.com> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Subject: [PATCH v26 25/30] x86/cet/shstk: Handle signals for shadow stack Date: Tue, 27 Apr 2021 13:43:10 -0700 Message-Id: <20210427204315.24153-26-yu-cheng.yu@intel.com> In-Reply-To: <20210427204315.24153-1-yu-cheng.yu@intel.com> References: <20210427204315.24153-1-yu-cheng.yu@intel.com> MIME-Version: 1.0 Received-SPF: none (intel.com>: No applicable sender policy available) receiver=imf09; identity=mailfrom; envelope-from="<yu-cheng.yu@intel.com>"; helo=mga17.intel.com; client-ip=192.55.52.151 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Control-flow Enforcement: Shadow Stack \| expand [v26,00/30] Control-flow Enforcement: Shadow Stack [v26,01/30] Documentation/x86: Add CET description [v26,02/30] x86/cet/shstk: Add Kconfig option for Shadow Stack [v26,03/30] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET) [v26,04/30] x86/cpufeatures: Introduce CPU setup and option parsing for CET [v26,05/30] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states [v26,06/30] x86/cet: Add control-protection fault handler [v26,07/30] x86/mm: Remove _PAGE_DIRTY from kernel RO pages [v26,08/30] x86/mm: Move pmd_write(), pud_write() up in the file [v26,09/30] x86/mm: Introduce _PAGE_COW [v26,10/30] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS [v26,11/30] x86/mm: Update pte_modify for _PAGE_COW [v26,12/30] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_… [v26,13/30] mm: Introduce VM_SHADOW_STACK for shadow stack memory [v26,14/30] x86/mm: Shadow Stack page fault error checking [v26,15/30] x86/mm: Update maybe_mkwrite() for shadow stack [v26,16/30] mm: Fixup places that call pte_mkwrite() directly [v26,17/30] mm: Add guard pages around a shadow stack. [v26,18/30] mm/mmap: Add shadow stack pages to memory accounting [v26,19/30] mm: Update can_follow_write_pte() for shadow stack [v26,20/30] mm/mprotect: Exclude shadow stack from preserve_write [v26,21/30] mm: Re-introduce vm_flags to do_mmap() [v26,22/30] x86/cet/shstk: Add user-mode shadow stack support [v26,23/30] x86/cet/shstk: Handle thread shadow stack [v26,24/30] x86/cet/shstk: Introduce shadow stack token setup/verify routines [v26,25/30] x86/cet/shstk: Handle signals for shadow stack [v26,26/30] ELF: Introduce arch_setup_elf_property() [v26,27/30] x86/cet/shstk: Add arch_prctl functions for shadow stack [v26,28/30] mm: Move arch_calc_vm_prot_bits() to arch/x86/include/asm/mman.h [v26,29/30] mm: Update arch_validate_flags() to test vma anonymous [v26,30/30] mm: Introduce PROT_SHADOW_STACK for shadow stack

Yu-cheng Yu April 27, 2021, 8:43 p.m. UTC

When shadow stack is enabled, a task's shadow stack states must be saved
along with the signal context and later restored in sigreturn.  However,
currently there is no systematic facility for extending a signal context.
There is some space left in the ucontext, but changing ucontext is likely
to create compatibility issues and there is not enough space for further
extensions.

Introduce a signal context extension struct 'sc_ext', which is used to save
shadow stack restore token address.  The extension is located above the fpu
states, plus alignment.  The struct can be extended (such as the ibt's
wait_endbr status to be introduced later), and sc_ext.total_size field
keeps track of total size.

Introduce routines for the allocation, save, and restore for sc_ext:
- fpu__alloc_sigcontext_ext(),
- save_extra_state_to_sigframe(),
- get_extra_state_from_sigframe(),
- restore_extra_state_to_xregs().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
---
v25:
- Update commit log/comments for the sc_ext struct.
- Use restorer address already calculated.
- Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
- Change X86_FEATURE_CET to X86_FEATURE_SHSTK.
- Eliminate writing to MSR_IA32_U_CET for shadow stack.
- Change wrmsrl() to wrmsrl_safe() and handle error.

v24:
- Split out shadow stack token routines to a separate patch.
- Put signal frame save/restore routines to fpu/signal.c and re-name accordingly.

 arch/x86/ia32/ia32_signal.c            |  24 +++--
 arch/x86/include/asm/cet.h             |   2 +
 arch/x86/include/asm/fpu/internal.h    |   2 +
 arch/x86/include/uapi/asm/sigcontext.h |   9 ++
 arch/x86/kernel/fpu/signal.c           | 137 ++++++++++++++++++++++++-
 arch/x86/kernel/signal.c               |   9 ++
 6 files changed, 172 insertions(+), 11 deletions(-)

Andy Lutomirski April 28, 2021, 11:03 p.m. UTC | #1

On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>
> When shadow stack is enabled, a task's shadow stack states must be saved
> along with the signal context and later restored in sigreturn.  However,
> currently there is no systematic facility for extending a signal context.
> There is some space left in the ucontext, but changing ucontext is likely
> to create compatibility issues and there is not enough space for further
> extensions.
>
> Introduce a signal context extension struct 'sc_ext', which is used to save
> shadow stack restore token address.  The extension is located above the fpu
> states, plus alignment.  The struct can be extended (such as the ibt's
> wait_endbr status to be introduced later), and sc_ext.total_size field
> keeps track of total size.

I still don't like this.

Here's how the signal layout works, for better or for worse:

The kernel has:

struct rt_sigframe {
    char __user *pretcode;
    struct ucontext uc;
    struct siginfo info;
    /* fp state follows here */
};

This is roughly the actual signal frame.  But userspace does not have
this struct declared, and user code does not know the sizes of the
fields.  So it's accessed in a nonsensical way.  The signal handler
function is passed a pointer to the whole sigframe implicitly in RSP,
a pointer to &frame->info in RSI, anda pointer to &frame->uc in RDX.
User code can *find* the fp state by following a pointer from
mcontext, which is, in turn, found via uc:

struct ucontext {
    unsigned long      uc_flags;
    struct ucontext  *uc_link;
    stack_t          uc_stack;
    struct sigcontext uc_mcontext;  <-- fp pointer is in here
    sigset_t      uc_sigmask;    /* mask last for extensibility */
};

The kernel, in sigreturn, works a bit differently.  The sigreturn
variants know the base address of the frame but don't have the benefit
of receiving pointers to the fields.  So instead the kernel takes
advantage of the fact that it knows the offset to uc and parses uc
accordingly.  And the kernel follows the pointer in mcontext to find
the fp state.  The latter bit is quite important later.  The kernel
does not parse info at all.

The fp state is its own mess.  When XSAVE happened, Intel kindly (?)
gave us a software defined area between the "legacy" x87 region and
the modern supposedly extensible part.  Linux sticks the following
structure in that hole:

struct _fpx_sw_bytes {
    /*
     * If set to FP_XSTATE_MAGIC1 then this is an xstate context.
     * 0 if a legacy frame.
     */
    __u32                magic1;

    /*
     * Total size of the fpstate area:
     *
     *  - if magic1 == 0 then it's sizeof(struct _fpstate)
     *  - if magic1 == FP_XSTATE_MAGIC1 then it's sizeof(struct _xstate)
     *    plus extensions (if any)
     */
    __u32                extended_size;

    /*
     * Feature bit mask (including FP/SSE/extended state) that is present
     * in the memory layout:
     */
    __u64                xfeatures;

    /*
     * Actual XSAVE state size, based on the xfeatures saved in the layout.
     * 'extended_size' is greater than 'xstate_size':
     */
    __u32                xstate_size;

    /* For future use: */
    __u32                padding[7];
};

That's where we are right now upstream.  The kernel has a parser for
the FPU state that is bugs piled upon bugs and is going to have to be
rewritten sometime soon.  On top of all this, we have two upcoming
features, both of which require different kinds of extensions:

1. AVX-512.  (Yeah, you thought this story was over a few years ago,
but no.  And AMX makes it worse.)  To make a long story short, we
promised user code many years ago that a signal frame fit in 2048
bytes with some room to spare.  With AVX-512 this is false.  With AMX
it's so wrong it's not even funny.  The only way out of the mess
anyone has come up with involves making the length of the FPU state
vary depending on which features are INIT, i.e. making it more compact
than "compact" mode is.  This has a side effect: it's no longer
possible to modify the state in place, because enabling a feature with
no space allocated will make the structure bigger, and the stack won't
have room.  Fortunately, one can relocate the entire FPU state, update
the pointer in mcontext, and the kernel will happily follow the
pointer.  So new code on a new kernel using a super-compact state
could expand the state by allocating new memory (on the heap? very
awkwardly on the stack?) and changing the pointer.  For all we know,
some code already fiddles with the pointer.  This is great, except
that your patch sticks more data at the end of the FPU block that no
one is expecting, and your sigreturn code follows that pointer, and
will read off into lala land.

2. CET.  CET wants us to find a few more bytes somewhere, and those
bytes logically belong in ucontext, and here we are.

This is *almost*, but not quite, easy: struct ucontext is already
variable length!  Unfortunately, the whole variable length portion is
used up by uc_sigmask.  So I propose that we introduce a brand new
bona fide extension mechanism.  It works like this:

First, we add a struct ucontext_extension at the end.  It looks like:

struct ucontext_extension {
  u64 length;  /* sizeof(struct ucontext_extension) */
  u64 flags;  /* we will want this some day */
  [CET stuff here]
  [future stuff here]
};

And we locate it by scrounging a word somewhere in ucontext to give
the offset from the beginning of struct ucontext to
ucontext_extension.  We indicate the presence of this feature using a
new uc_flags bit.  I can think of a couple of vaguely reasonable
places:

a) the reserved word in sigcontext.  This is fine for x86 but not so
great if other architectures want to do this.

b) uc_link.  Fine everywhere but powerpc.  Oops.

c) use the high bits of uc_flags.  After all, once we add extensions,
we don't need new flags, so we can steal 16 high bits of uc_flags for
this.

I think I'm in favor of (c).  We do:

(uc_flags & 0xffff0000) == 0: extension not present

Otherwise the extension region is at ucontext + (uc_flags >> 16).

And sigreturn finds the extension the same way, because CRIU can
already migrate a signal frame from one kernel to another, your patch
breaks this, and having sigreturn hardcode the offset would also break
it.

What do you think?

Yu-cheng Yu April 28, 2021, 11:20 p.m. UTC | #2

On 4/28/2021 4:03 PM, Andy Lutomirski wrote:
> On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>>
>> When shadow stack is enabled, a task's shadow stack states must be saved
>> along with the signal context and later restored in sigreturn.  However,
>> currently there is no systematic facility for extending a signal context.
>> There is some space left in the ucontext, but changing ucontext is likely
>> to create compatibility issues and there is not enough space for further
>> extensions.
>>
>> Introduce a signal context extension struct 'sc_ext', which is used to save
>> shadow stack restore token address.  The extension is located above the fpu
>> states, plus alignment.  The struct can be extended (such as the ibt's
>> wait_endbr status to be introduced later), and sc_ext.total_size field
>> keeps track of total size.
> 
> I still don't like this.
> 
> Here's how the signal layout works, for better or for worse:
> 
> The kernel has:
> 
> struct rt_sigframe {
>      char __user *pretcode;
>      struct ucontext uc;
>      struct siginfo info;
>      /* fp state follows here */
> };
> 
> This is roughly the actual signal frame.  But userspace does not have
> this struct declared, and user code does not know the sizes of the
> fields.  So it's accessed in a nonsensical way.  The signal handler
> function is passed a pointer to the whole sigframe implicitly in RSP,
> a pointer to &frame->info in RSI, anda pointer to &frame->uc in RDX.
> User code can *find* the fp state by following a pointer from
> mcontext, which is, in turn, found via uc:
> 
> struct ucontext {
>      unsigned long      uc_flags;
>      struct ucontext  *uc_link;
>      stack_t          uc_stack;
>      struct sigcontext uc_mcontext;  <-- fp pointer is in here
>      sigset_t      uc_sigmask;    /* mask last for extensibility */
> };
> 
> The kernel, in sigreturn, works a bit differently.  The sigreturn
> variants know the base address of the frame but don't have the benefit
> of receiving pointers to the fields.  So instead the kernel takes
> advantage of the fact that it knows the offset to uc and parses uc
> accordingly.  And the kernel follows the pointer in mcontext to find
> the fp state.  The latter bit is quite important later.  The kernel
> does not parse info at all.
> 
> The fp state is its own mess.  When XSAVE happened, Intel kindly (?)
> gave us a software defined area between the "legacy" x87 region and
> the modern supposedly extensible part.  Linux sticks the following
> structure in that hole:
> 
> struct _fpx_sw_bytes {
>      /*
>       * If set to FP_XSTATE_MAGIC1 then this is an xstate context.
>       * 0 if a legacy frame.
>       */
>      __u32                magic1;
> 
>      /*
>       * Total size of the fpstate area:
>       *
>       *  - if magic1 == 0 then it's sizeof(struct _fpstate)
>       *  - if magic1 == FP_XSTATE_MAGIC1 then it's sizeof(struct _xstate)
>       *    plus extensions (if any)
>       */
>      __u32                extended_size;
> 
>      /*
>       * Feature bit mask (including FP/SSE/extended state) that is present
>       * in the memory layout:
>       */
>      __u64                xfeatures;
> 
>      /*
>       * Actual XSAVE state size, based on the xfeatures saved in the layout.
>       * 'extended_size' is greater than 'xstate_size':
>       */
>      __u32                xstate_size;
> 
>      /* For future use: */
>      __u32                padding[7];
> };
> 
> 
> That's where we are right now upstream.  The kernel has a parser for
> the FPU state that is bugs piled upon bugs and is going to have to be
> rewritten sometime soon.  On top of all this, we have two upcoming
> features, both of which require different kinds of extensions:
> 
> 1. AVX-512.  (Yeah, you thought this story was over a few years ago,
> but no.  And AMX makes it worse.)  To make a long story short, we
> promised user code many years ago that a signal frame fit in 2048
> bytes with some room to spare.  With AVX-512 this is false.  With AMX
> it's so wrong it's not even funny.  The only way out of the mess
> anyone has come up with involves making the length of the FPU state
> vary depending on which features are INIT, i.e. making it more compact
> than "compact" mode is.  This has a side effect: it's no longer
> possible to modify the state in place, because enabling a feature with
> no space allocated will make the structure bigger, and the stack won't
> have room.  Fortunately, one can relocate the entire FPU state, update
> the pointer in mcontext, and the kernel will happily follow the
> pointer.  So new code on a new kernel using a super-compact state
> could expand the state by allocating new memory (on the heap? very
> awkwardly on the stack?) and changing the pointer.  For all we know,
> some code already fiddles with the pointer.  This is great, except
> that your patch sticks more data at the end of the FPU block that no
> one is expecting, and your sigreturn code follows that pointer, and
> will read off into lala land.
> 
> 2. CET.  CET wants us to find a few more bytes somewhere, and those
> bytes logically belong in ucontext, and here we are.
> 
> This is *almost*, but not quite, easy: struct ucontext is already
> variable length!  Unfortunately, the whole variable length portion is
> used up by uc_sigmask.  So I propose that we introduce a brand new
> bona fide extension mechanism.  It works like this:
> 
> First, we add a struct ucontext_extension at the end.  It looks like:
> 
> struct ucontext_extension {
>    u64 length;  /* sizeof(struct ucontext_extension) */
>    u64 flags;  /* we will want this some day */
>    [CET stuff here]
>    [future stuff here]
> };
> 
> And we locate it by scrounging a word somewhere in ucontext to give
> the offset from the beginning of struct ucontext to
> ucontext_extension.  We indicate the presence of this feature using a
> new uc_flags bit.  I can think of a couple of vaguely reasonable
> places:
> 
> a) the reserved word in sigcontext.  This is fine for x86 but not so
> great if other architectures want to do this.
> 
> b) uc_link.  Fine everywhere but powerpc.  Oops.
> 
> c) use the high bits of uc_flags.  After all, once we add extensions,
> we don't need new flags, so we can steal 16 high bits of uc_flags for
> this.
> 
> I think I'm in favor of (c).  We do:
> 
> (uc_flags & 0xffff0000) == 0: extension not present
> 
> Otherwise the extension region is at ucontext + (uc_flags >> 16).
> 
> And sigreturn finds the extension the same way, because CRIU can
> already migrate a signal frame from one kernel to another, your patch
> breaks this, and having sigreturn hardcode the offset would also break
> it.
> 
> What do you think?
> 

There are a lot of things in here.  I think I could create some patches 
for ucontext_extension and send out for discussion.  Thanks for 
explaining this!

Yu-cheng

Cyrill Gorcunov April 29, 2021, 7:28 a.m. UTC | #3

On Wed, Apr 28, 2021 at 04:03:55PM -0700, Andy Lutomirski wrote:
> On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> >
> > When shadow stack is enabled, a task's shadow stack states must be saved
> > along with the signal context and later restored in sigreturn.  However,
> > currently there is no systematic facility for extending a signal context.
> > There is some space left in the ucontext, but changing ucontext is likely
> > to create compatibility issues and there is not enough space for further
> > extensions.
> >
> > Introduce a signal context extension struct 'sc_ext', which is used to save
> > shadow stack restore token address.  The extension is located above the fpu
> > states, plus alignment.  The struct can be extended (such as the ibt's
> > wait_endbr status to be introduced later), and sc_ext.total_size field
> > keeps track of total size.
> 
> I still don't like this.
> 
> Here's how the signal layout works, for better or for worse:
> 
> The kernel has:
> 
> struct rt_sigframe {
>     char __user *pretcode;
>     struct ucontext uc;
>     struct siginfo info;
>     /* fp state follows here */
> };
> 
> This is roughly the actual signal frame.  But userspace does not have
> this struct declared, and user code does not know the sizes of the
> fields.  So it's accessed in a nonsensical way.  The signal handler

Well, not really. While indeed this is not declared as a part of API
the structure is widely used for rt_sigreturn syscall (and we're using
it inside criu thus any change here will simply break the restore
procedure). Sorry out of time right now, I'll read your mail more
carefully once time permit.

Andy Lutomirski April 29, 2021, 2:44 p.m. UTC | #4

On Thu, Apr 29, 2021 at 12:28 AM Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>
> On Wed, Apr 28, 2021 at 04:03:55PM -0700, Andy Lutomirski wrote:
> > On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > >
> > > When shadow stack is enabled, a task's shadow stack states must be saved
> > > along with the signal context and later restored in sigreturn.  However,
> > > currently there is no systematic facility for extending a signal context.
> > > There is some space left in the ucontext, but changing ucontext is likely
> > > to create compatibility issues and there is not enough space for further
> > > extensions.
> > >
> > > Introduce a signal context extension struct 'sc_ext', which is used to save
> > > shadow stack restore token address.  The extension is located above the fpu
> > > states, plus alignment.  The struct can be extended (such as the ibt's
> > > wait_endbr status to be introduced later), and sc_ext.total_size field
> > > keeps track of total size.
> >
> > I still don't like this.
> >
> > Here's how the signal layout works, for better or for worse:
> >
> > The kernel has:
> >
> > struct rt_sigframe {
> >     char __user *pretcode;
> >     struct ucontext uc;
> >     struct siginfo info;
> >     /* fp state follows here */
> > };
> >
> > This is roughly the actual signal frame.  But userspace does not have
> > this struct declared, and user code does not know the sizes of the
> > fields.  So it's accessed in a nonsensical way.  The signal handler
>
> Well, not really. While indeed this is not declared as a part of API
> the structure is widely used for rt_sigreturn syscall (and we're using
> it inside criu thus any change here will simply break the restore
> procedure). Sorry out of time right now, I'll read your mail more
> carefully once time permit.

I skimmed the CRIU code.  You appear to declare struct rt_sigframe,
and you use the offset from the start of rt_sigframe to uc.  You also
use the offset to the /* fp state follows here */ part, but that's
unnecessary -- you could just as easily have put the fp state at any
other address -- the kernel will happily follow the pointer you supply
regardless of where it points.  So the only issues I can see are if
you write the fp state on top of something else or if you
inadvertently fill in the proposed extension part of uc_flags.  Right
now you seem to be ignoring uc_flags, which I presume means that you
are filling it in as zero.  Even if the offset of the fp state in the
kernel rt_sigframe changes, the kernel should still successfully parse
the signal frame you generate.

I suppose there is another potential issue: would CRIU have issues if
the *save* runs on a kernel that uses this proposed extension
mechanism?  Are you doing something with the saved state that would
get confused?

--Andy

Cyrill Gorcunov April 29, 2021, 3:35 p.m. UTC | #5

On Thu, Apr 29, 2021 at 07:44:03AM -0700, Andy Lutomirski wrote:
> On Thu, Apr 29, 2021 at 12:28 AM Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> >
> > On Wed, Apr 28, 2021 at 04:03:55PM -0700, Andy Lutomirski wrote:
> > > On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > > >
> > > > When shadow stack is enabled, a task's shadow stack states must be saved
> > > > along with the signal context and later restored in sigreturn.  However,
> > > > currently there is no systematic facility for extending a signal context.
> > > > There is some space left in the ucontext, but changing ucontext is likely
> > > > to create compatibility issues and there is not enough space for further
> > > > extensions.
> > > >
> > > > Introduce a signal context extension struct 'sc_ext', which is used to save
> > > > shadow stack restore token address.  The extension is located above the fpu
> > > > states, plus alignment.  The struct can be extended (such as the ibt's
> > > > wait_endbr status to be introduced later), and sc_ext.total_size field
> > > > keeps track of total size.
> > >
> > > I still don't like this.
> > >
> > > Here's how the signal layout works, for better or for worse:
> > >
> > > The kernel has:
> > >
> > > struct rt_sigframe {
> > >     char __user *pretcode;
> > >     struct ucontext uc;
> > >     struct siginfo info;
> > >     /* fp state follows here */
> > > };
> > >
> > > This is roughly the actual signal frame.  But userspace does not have
> > > this struct declared, and user code does not know the sizes of the
> > > fields.  So it's accessed in a nonsensical way.  The signal handler
> >
> > Well, not really. While indeed this is not declared as a part of API
> > the structure is widely used for rt_sigreturn syscall (and we're using
> > it inside criu thus any change here will simply break the restore
> > procedure). Sorry out of time right now, I'll read your mail more
> > carefully once time permit.
> 
> I skimmed the CRIU code.  You appear to declare struct rt_sigframe,
> and you use the offset from the start of rt_sigframe to uc.  You also
> use the offset to the /* fp state follows here */ part, but that's
> unnecessary -- you could just as easily have put the fp state at any
> other address -- the kernel will happily follow the pointer you supply
> regardless of where it points.  So the only issues I can see are if

Yes, since rt_sigreturn is a very late stage of restore we've to preallocate
memory early and surely doing it in one slab is a way more convenient, thus
that's why fp context follows the structure declaration (well, it was at
least I didn't touch this code for years already). Anyway, I meant that
assuming "user code does not know the sizes of the fields" is not exactly
true and strictly speaking current layout became a part of nondeclared api
(simply because kernel use this structure run-time only and even if something
get changed in kernel declaration for this structure the kernel itself will
be fine because it doesn't save it on disk between kernel updates) but for
tools like criu such change might require update. IOW, I mean that changing
rt_sigreturn structure itself is not safe and we can't reorder the members
and theirs size.

Or I misunderstood you, and you meant only the fact that fp may be arbitrary,
pointing for any place in userspace? If so then indeed, fp can be pointing
to context placed anywere since kernel will fetch it from the address during
rt_sigreturn processing.

> you write the fp state on top of something else or if you
> inadvertently fill in the proposed extension part of uc_flags.  Right
> now you seem to be ignoring uc_flags, which I presume means that you
> are filling it in as zero.  Even if the offset of the fp state in the

yes

> kernel rt_sigframe changes, the kernel should still successfully parse
> the signal frame you generate.
> 
> I suppose there is another potential issue: would CRIU have issues if
> the *save* runs on a kernel that uses this proposed extension
> mechanism?  Are you doing something with the saved state that would
> get confused?

You know, actually I guess criu may not work with new extensions, we have
a predefined max xsave frame size and any new extensions to cpu states
will require additional support.

I must confess I didn't follow much on CET internals yet so I must be simply
wrong. CC'ing Anrew just in case.

	Cyrill

Florian Weimer April 30, 2021, 6:45 a.m. UTC | #6

* Andy Lutomirski:

> The kernel has:
>
> struct rt_sigframe {
>     char __user *pretcode;
>     struct ucontext uc;
>     struct siginfo info;
>     /* fp state follows here */
> };
>
> This is roughly the actual signal frame.  But userspace does not have
> this struct declared, and user code does not know the sizes of the
> fields.  So it's accessed in a nonsensical way.  The signal handler
> function is passed a pointer to the whole sigframe implicitly in RSP,
> a pointer to &frame->info in RSI, anda pointer to &frame->uc in RDX.
> User code can *find* the fp state by following a pointer from
> mcontext, which is, in turn, found via uc:
>
> struct ucontext {
>     unsigned long      uc_flags;
>     struct ucontext  *uc_link;
>     stack_t          uc_stack;
>     struct sigcontext uc_mcontext;  <-- fp pointer is in here
>     sigset_t      uc_sigmask;    /* mask last for extensibility */
> };

I must say that I haven't reviewed this in detail, but for historic
reasons, glibc userspace has a differently-sized sigset_t.  So the
kernel ucontext (used in signals) and user ucontext (used for
swapcontext et al.) are already fully decoupled?

Thanks,
Florian

Yu-cheng Yu April 30, 2021, 5 p.m. UTC | #7

On 4/28/2021 4:03 PM, Andy Lutomirski wrote:
> On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>>
>> When shadow stack is enabled, a task's shadow stack states must be saved
>> along with the signal context and later restored in sigreturn.  However,
>> currently there is no systematic facility for extending a signal context.
>> There is some space left in the ucontext, but changing ucontext is likely
>> to create compatibility issues and there is not enough space for further
>> extensions.
>>
>> Introduce a signal context extension struct 'sc_ext', which is used to save
>> shadow stack restore token address.  The extension is located above the fpu
>> states, plus alignment.  The struct can be extended (such as the ibt's
>> wait_endbr status to be introduced later), and sc_ext.total_size field
>> keeps track of total size.
> 
> I still don't like this.
> 
> Here's how the signal layout works, for better or for worse:
> 
> The kernel has:
> 
> struct rt_sigframe {
>      char __user *pretcode;
>      struct ucontext uc;
>      struct siginfo info;
>      /* fp state follows here */
> };
> 
> This is roughly the actual signal frame.  But userspace does not have
> this struct declared, and user code does not know the sizes of the
> fields.  So it's accessed in a nonsensical way.  The signal handler
> function is passed a pointer to the whole sigframe implicitly in RSP,
> a pointer to &frame->info in RSI, anda pointer to &frame->uc in RDX.
> User code can *find* the fp state by following a pointer from
> mcontext, which is, in turn, found via uc:
> 
> struct ucontext {
>      unsigned long      uc_flags;
>      struct ucontext  *uc_link;
>      stack_t          uc_stack;
>      struct sigcontext uc_mcontext;  <-- fp pointer is in here
>      sigset_t      uc_sigmask;    /* mask last for extensibility */
> };
> 
> The kernel, in sigreturn, works a bit differently.  The sigreturn
> variants know the base address of the frame but don't have the benefit
> of receiving pointers to the fields.  So instead the kernel takes
> advantage of the fact that it knows the offset to uc and parses uc
> accordingly.  And the kernel follows the pointer in mcontext to find
> the fp state.  The latter bit is quite important later.  The kernel
> does not parse info at all.
> 
> The fp state is its own mess.  When XSAVE happened, Intel kindly (?)
> gave us a software defined area between the "legacy" x87 region and
> the modern supposedly extensible part.  Linux sticks the following
> structure in that hole:
> 
> struct _fpx_sw_bytes {
>      /*
>       * If set to FP_XSTATE_MAGIC1 then this is an xstate context.
>       * 0 if a legacy frame.
>       */
>      __u32                magic1;
> 
>      /*
>       * Total size of the fpstate area:
>       *
>       *  - if magic1 == 0 then it's sizeof(struct _fpstate)
>       *  - if magic1 == FP_XSTATE_MAGIC1 then it's sizeof(struct _xstate)
>       *    plus extensions (if any)
>       */
>      __u32                extended_size;
> 
>      /*
>       * Feature bit mask (including FP/SSE/extended state) that is present
>       * in the memory layout:
>       */
>      __u64                xfeatures;
> 
>      /*
>       * Actual XSAVE state size, based on the xfeatures saved in the layout.
>       * 'extended_size' is greater than 'xstate_size':
>       */
>      __u32                xstate_size;
> 
>      /* For future use: */
>      __u32                padding[7];
> };
> 
> 
> That's where we are right now upstream.  The kernel has a parser for
> the FPU state that is bugs piled upon bugs and is going to have to be
> rewritten sometime soon.  On top of all this, we have two upcoming
> features, both of which require different kinds of extensions:
> 
> 1. AVX-512.  (Yeah, you thought this story was over a few years ago,
> but no.  And AMX makes it worse.)  To make a long story short, we
> promised user code many years ago that a signal frame fit in 2048
> bytes with some room to spare.  With AVX-512 this is false.  With AMX
> it's so wrong it's not even funny.  The only way out of the mess
> anyone has come up with involves making the length of the FPU state
> vary depending on which features are INIT, i.e. making it more compact
> than "compact" mode is.  This has a side effect: it's no longer
> possible to modify the state in place, because enabling a feature with
> no space allocated will make the structure bigger, and the stack won't
> have room.  Fortunately, one can relocate the entire FPU state, update
> the pointer in mcontext, and the kernel will happily follow the
> pointer.  So new code on a new kernel using a super-compact state
> could expand the state by allocating new memory (on the heap? very
> awkwardly on the stack?) and changing the pointer.  For all we know,
> some code already fiddles with the pointer.  This is great, except
> that your patch sticks more data at the end of the FPU block that no
> one is expecting, and your sigreturn code follows that pointer, and
> will read off into lala land.
> 

Then, what about we don't do that at all.  Is it possible from now on we 
don't stick more data at the end, and take the relocating-fpu approach?

> 2. CET.  CET wants us to find a few more bytes somewhere, and those
> bytes logically belong in ucontext, and here we are.
> 

Fortunately, we can spare CET the need of ucontext extension.  When the 
kernel handles sigreturn, the user-mode shadow stack pointer is right at 
the restore token.  There is no need to put that in ucontext.

However, the WAIT_ENDBR status needs to be saved/restored for signals. 
Since IBT is now dependent on shadow stack, we can use a spare bit of 
the shadow stack restore token for that.

I have tested the change, and will send out another version of the whole 
series.

> This is *almost*, but not quite, easy: struct ucontext is already
> variable length!  Unfortunately, the whole variable length portion is
> used up by uc_sigmask.  So I propose that we introduce a brand new
> bona fide extension mechanism.  It works like this:
> 
> First, we add a struct ucontext_extension at the end.  It looks like:
> 
> struct ucontext_extension {
>    u64 length;  /* sizeof(struct ucontext_extension) */
>    u64 flags;  /* we will want this some day */
>    [CET stuff here]
>    [future stuff here]
> };
> 
> And we locate it by scrounging a word somewhere in ucontext to give
> the offset from the beginning of struct ucontext to
> ucontext_extension.  We indicate the presence of this feature using a
> new uc_flags bit.  I can think of a couple of vaguely reasonable
> places:
> 
> a) the reserved word in sigcontext.  This is fine for x86 but not so
> great if other architectures want to do this.
> 
> b) uc_link.  Fine everywhere but powerpc.  Oops.
> 
> c) use the high bits of uc_flags.  After all, once we add extensions,
> we don't need new flags, so we can steal 16 high bits of uc_flags for
> this.
> 
> I think I'm in favor of (c).  We do:
> 
> (uc_flags & 0xffff0000) == 0: extension not present
> 
> Otherwise the extension region is at ucontext + (uc_flags >> 16).
> 
> And sigreturn finds the extension the same way, because CRIU can
> already migrate a signal frame from one kernel to another, your patch
> breaks this, and having sigreturn hardcode the offset would also break
> it.
> 
> What do you think?
>

Andy Lutomirski April 30, 2021, 5:47 p.m. UTC | #8

On Fri, Apr 30, 2021 at 10:00 AM Yu, Yu-cheng <yu-cheng.yu@intel.com> wrote:
>
> On 4/28/2021 4:03 PM, Andy Lutomirski wrote:
> > On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> >>
> >> When shadow stack is enabled, a task's shadow stack states must be saved
> >> along with the signal context and later restored in sigreturn.  However,
> >> currently there is no systematic facility for extending a signal context.
> >> There is some space left in the ucontext, but changing ucontext is likely
> >> to create compatibility issues and there is not enough space for further
> >> extensions.
> >>
> >> Introduce a signal context extension struct 'sc_ext', which is used to save
> >> shadow stack restore token address.  The extension is located above the fpu
> >> states, plus alignment.  The struct can be extended (such as the ibt's
> >> wait_endbr status to be introduced later), and sc_ext.total_size field
> >> keeps track of total size.
> >
> > I still don't like this.
> >
> > Here's how the signal layout works, for better or for worse:
> >
> > The kernel has:
> >
> > struct rt_sigframe {
> >      char __user *pretcode;
> >      struct ucontext uc;
> >      struct siginfo info;
> >      /* fp state follows here */
> > };
> >
> > This is roughly the actual signal frame.  But userspace does not have
> > this struct declared, and user code does not know the sizes of the
> > fields.  So it's accessed in a nonsensical way.  The signal handler
> > function is passed a pointer to the whole sigframe implicitly in RSP,
> > a pointer to &frame->info in RSI, anda pointer to &frame->uc in RDX.
> > User code can *find* the fp state by following a pointer from
> > mcontext, which is, in turn, found via uc:
> >
> > struct ucontext {
> >      unsigned long      uc_flags;
> >      struct ucontext  *uc_link;
> >      stack_t          uc_stack;
> >      struct sigcontext uc_mcontext;  <-- fp pointer is in here
> >      sigset_t      uc_sigmask;    /* mask last for extensibility */
> > };
> >
> > The kernel, in sigreturn, works a bit differently.  The sigreturn
> > variants know the base address of the frame but don't have the benefit
> > of receiving pointers to the fields.  So instead the kernel takes
> > advantage of the fact that it knows the offset to uc and parses uc
> > accordingly.  And the kernel follows the pointer in mcontext to find
> > the fp state.  The latter bit is quite important later.  The kernel
> > does not parse info at all.
> >
> > The fp state is its own mess.  When XSAVE happened, Intel kindly (?)
> > gave us a software defined area between the "legacy" x87 region and
> > the modern supposedly extensible part.  Linux sticks the following
> > structure in that hole:
> >
> > struct _fpx_sw_bytes {
> >      /*
> >       * If set to FP_XSTATE_MAGIC1 then this is an xstate context.
> >       * 0 if a legacy frame.
> >       */
> >      __u32                magic1;
> >
> >      /*
> >       * Total size of the fpstate area:
> >       *
> >       *  - if magic1 == 0 then it's sizeof(struct _fpstate)
> >       *  - if magic1 == FP_XSTATE_MAGIC1 then it's sizeof(struct _xstate)
> >       *    plus extensions (if any)
> >       */
> >      __u32                extended_size;
> >
> >      /*
> >       * Feature bit mask (including FP/SSE/extended state) that is present
> >       * in the memory layout:
> >       */
> >      __u64                xfeatures;
> >
> >      /*
> >       * Actual XSAVE state size, based on the xfeatures saved in the layout.
> >       * 'extended_size' is greater than 'xstate_size':
> >       */
> >      __u32                xstate_size;
> >
> >      /* For future use: */
> >      __u32                padding[7];
> > };
> >
> >
> > That's where we are right now upstream.  The kernel has a parser for
> > the FPU state that is bugs piled upon bugs and is going to have to be
> > rewritten sometime soon.  On top of all this, we have two upcoming
> > features, both of which require different kinds of extensions:
> >
> > 1. AVX-512.  (Yeah, you thought this story was over a few years ago,
> > but no.  And AMX makes it worse.)  To make a long story short, we
> > promised user code many years ago that a signal frame fit in 2048
> > bytes with some room to spare.  With AVX-512 this is false.  With AMX
> > it's so wrong it's not even funny.  The only way out of the mess
> > anyone has come up with involves making the length of the FPU state
> > vary depending on which features are INIT, i.e. making it more compact
> > than "compact" mode is.  This has a side effect: it's no longer
> > possible to modify the state in place, because enabling a feature with
> > no space allocated will make the structure bigger, and the stack won't
> > have room.  Fortunately, one can relocate the entire FPU state, update
> > the pointer in mcontext, and the kernel will happily follow the
> > pointer.  So new code on a new kernel using a super-compact state
> > could expand the state by allocating new memory (on the heap? very
> > awkwardly on the stack?) and changing the pointer.  For all we know,
> > some code already fiddles with the pointer.  This is great, except
> > that your patch sticks more data at the end of the FPU block that no
> > one is expecting, and your sigreturn code follows that pointer, and
> > will read off into lala land.
> >
>
> Then, what about we don't do that at all.  Is it possible from now on we
> don't stick more data at the end, and take the relocating-fpu approach?
>
> > 2. CET.  CET wants us to find a few more bytes somewhere, and those
> > bytes logically belong in ucontext, and here we are.
> >
>
> Fortunately, we can spare CET the need of ucontext extension.  When the
> kernel handles sigreturn, the user-mode shadow stack pointer is right at
> the restore token.  There is no need to put that in ucontext.

That seems entirely reasonable.  This might also avoid needing to
teach CRIU about CET at all.

>
> However, the WAIT_ENDBR status needs to be saved/restored for signals.
> Since IBT is now dependent on shadow stack, we can use a spare bit of
> the shadow stack restore token for that.

That seems like unnecessary ABI coupling.  We have plenty of bits in
uc_flags, and we have an entire reserved word in sigcontext.  How
about just sticking this bit in one of those places?

--Andy

Yu-cheng Yu April 30, 2021, 6:32 p.m. UTC | #9

On 4/30/2021 10:47 AM, Andy Lutomirski wrote:
> On Fri, Apr 30, 2021 at 10:00 AM Yu, Yu-cheng <yu-cheng.yu@intel.com> wrote:
>>
>> On 4/28/2021 4:03 PM, Andy Lutomirski wrote:
>>> On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>>>>
>>>> When shadow stack is enabled, a task's shadow stack states must be saved
>>>> along with the signal context and later restored in sigreturn.  However,
>>>> currently there is no systematic facility for extending a signal context.
>>>> There is some space left in the ucontext, but changing ucontext is likely
>>>> to create compatibility issues and there is not enough space for further
>>>> extensions.
>>>>
>>>> Introduce a signal context extension struct 'sc_ext', which is used to save
>>>> shadow stack restore token address.  The extension is located above the fpu
>>>> states, plus alignment.  The struct can be extended (such as the ibt's
>>>> wait_endbr status to be introduced later), and sc_ext.total_size field
>>>> keeps track of total size.
>>>
>>> I still don't like this.
>>>
>>> Here's how the signal layout works, for better or for worse:
>>>
>>> The kernel has:
>>>
>>> struct rt_sigframe {
>>>       char __user *pretcode;
>>>       struct ucontext uc;
>>>       struct siginfo info;
>>>       /* fp state follows here */
>>> };
>>>
>>> This is roughly the actual signal frame.  But userspace does not have
>>> this struct declared, and user code does not know the sizes of the
>>> fields.  So it's accessed in a nonsensical way.  The signal handler
>>> function is passed a pointer to the whole sigframe implicitly in RSP,
>>> a pointer to &frame->info in RSI, anda pointer to &frame->uc in RDX.
>>> User code can *find* the fp state by following a pointer from
>>> mcontext, which is, in turn, found via uc:
>>>
>>> struct ucontext {
>>>       unsigned long      uc_flags;
>>>       struct ucontext  *uc_link;
>>>       stack_t          uc_stack;
>>>       struct sigcontext uc_mcontext;  <-- fp pointer is in here
>>>       sigset_t      uc_sigmask;    /* mask last for extensibility */
>>> };
>>>
>>> The kernel, in sigreturn, works a bit differently.  The sigreturn
>>> variants know the base address of the frame but don't have the benefit
>>> of receiving pointers to the fields.  So instead the kernel takes
>>> advantage of the fact that it knows the offset to uc and parses uc
>>> accordingly.  And the kernel follows the pointer in mcontext to find
>>> the fp state.  The latter bit is quite important later.  The kernel
>>> does not parse info at all.
>>>
>>> The fp state is its own mess.  When XSAVE happened, Intel kindly (?)
>>> gave us a software defined area between the "legacy" x87 region and
>>> the modern supposedly extensible part.  Linux sticks the following
>>> structure in that hole:
>>>
>>> struct _fpx_sw_bytes {
>>>       /*
>>>        * If set to FP_XSTATE_MAGIC1 then this is an xstate context.
>>>        * 0 if a legacy frame.
>>>        */
>>>       __u32                magic1;
>>>
>>>       /*
>>>        * Total size of the fpstate area:
>>>        *
>>>        *  - if magic1 == 0 then it's sizeof(struct _fpstate)
>>>        *  - if magic1 == FP_XSTATE_MAGIC1 then it's sizeof(struct _xstate)
>>>        *    plus extensions (if any)
>>>        */
>>>       __u32                extended_size;
>>>
>>>       /*
>>>        * Feature bit mask (including FP/SSE/extended state) that is present
>>>        * in the memory layout:
>>>        */
>>>       __u64                xfeatures;
>>>
>>>       /*
>>>        * Actual XSAVE state size, based on the xfeatures saved in the layout.
>>>        * 'extended_size' is greater than 'xstate_size':
>>>        */
>>>       __u32                xstate_size;
>>>
>>>       /* For future use: */
>>>       __u32                padding[7];
>>> };
>>>
>>>
>>> That's where we are right now upstream.  The kernel has a parser for
>>> the FPU state that is bugs piled upon bugs and is going to have to be
>>> rewritten sometime soon.  On top of all this, we have two upcoming
>>> features, both of which require different kinds of extensions:
>>>
>>> 1. AVX-512.  (Yeah, you thought this story was over a few years ago,
>>> but no.  And AMX makes it worse.)  To make a long story short, we
>>> promised user code many years ago that a signal frame fit in 2048
>>> bytes with some room to spare.  With AVX-512 this is false.  With AMX
>>> it's so wrong it's not even funny.  The only way out of the mess
>>> anyone has come up with involves making the length of the FPU state
>>> vary depending on which features are INIT, i.e. making it more compact
>>> than "compact" mode is.  This has a side effect: it's no longer
>>> possible to modify the state in place, because enabling a feature with
>>> no space allocated will make the structure bigger, and the stack won't
>>> have room.  Fortunately, one can relocate the entire FPU state, update
>>> the pointer in mcontext, and the kernel will happily follow the
>>> pointer.  So new code on a new kernel using a super-compact state
>>> could expand the state by allocating new memory (on the heap? very
>>> awkwardly on the stack?) and changing the pointer.  For all we know,
>>> some code already fiddles with the pointer.  This is great, except
>>> that your patch sticks more data at the end of the FPU block that no
>>> one is expecting, and your sigreturn code follows that pointer, and
>>> will read off into lala land.
>>>
>>
>> Then, what about we don't do that at all.  Is it possible from now on we
>> don't stick more data at the end, and take the relocating-fpu approach?
>>
>>> 2. CET.  CET wants us to find a few more bytes somewhere, and those
>>> bytes logically belong in ucontext, and here we are.
>>>
>>
>> Fortunately, we can spare CET the need of ucontext extension.  When the
>> kernel handles sigreturn, the user-mode shadow stack pointer is right at
>> the restore token.  There is no need to put that in ucontext.
> 
> That seems entirely reasonable.  This might also avoid needing to
> teach CRIU about CET at all.
> 
>>
>> However, the WAIT_ENDBR status needs to be saved/restored for signals.
>> Since IBT is now dependent on shadow stack, we can use a spare bit of
>> the shadow stack restore token for that.
> 
> That seems like unnecessary ABI coupling.  We have plenty of bits in
> uc_flags, and we have an entire reserved word in sigcontext.  How
> about just sticking this bit in one of those places?

Yes, I will make it UC_WAIT_ENDBR.

Thanks,
Yu-cheng

Andy Lutomirski May 2, 2021, 11:23 p.m. UTC | #10

On Fri, Apr 30, 2021 at 10:47 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Fri, Apr 30, 2021 at 10:00 AM Yu, Yu-cheng <yu-cheng.yu@intel.com> wrote:
> >
> > On 4/28/2021 4:03 PM, Andy Lutomirski wrote:
> > > On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > >>
> > >> When shadow stack is enabled, a task's shadow stack states must be saved
> > >> along with the signal context and later restored in sigreturn.  However,
> > >> currently there is no systematic facility for extending a signal context.
> > >> There is some space left in the ucontext, but changing ucontext is likely
> > >> to create compatibility issues and there is not enough space for further
> > >> extensions.
> > >>
> > >> Introduce a signal context extension struct 'sc_ext', which is used to save
> > >> shadow stack restore token address.  The extension is located above the fpu
> > >> states, plus alignment.  The struct can be extended (such as the ibt's
> > >> wait_endbr status to be introduced later), and sc_ext.total_size field
> > >> keeps track of total size.
> > >
> > > I still don't like this.
> > >
> > > Here's how the signal layout works, for better or for worse:
> > >
> > > The kernel has:
> > >
> > > struct rt_sigframe {
> > >      char __user *pretcode;
> > >      struct ucontext uc;
> > >      struct siginfo info;
> > >      /* fp state follows here */
> > > };
> > >
> > > This is roughly the actual signal frame.  But userspace does not have
> > > this struct declared, and user code does not know the sizes of the
> > > fields.  So it's accessed in a nonsensical way.  The signal handler
> > > function is passed a pointer to the whole sigframe implicitly in RSP,
> > > a pointer to &frame->info in RSI, anda pointer to &frame->uc in RDX.
> > > User code can *find* the fp state by following a pointer from
> > > mcontext, which is, in turn, found via uc:
> > >
> > > struct ucontext {
> > >      unsigned long      uc_flags;
> > >      struct ucontext  *uc_link;
> > >      stack_t          uc_stack;
> > >      struct sigcontext uc_mcontext;  <-- fp pointer is in here
> > >      sigset_t      uc_sigmask;    /* mask last for extensibility */
> > > };
> > >
> > > The kernel, in sigreturn, works a bit differently.  The sigreturn
> > > variants know the base address of the frame but don't have the benefit
> > > of receiving pointers to the fields.  So instead the kernel takes
> > > advantage of the fact that it knows the offset to uc and parses uc
> > > accordingly.  And the kernel follows the pointer in mcontext to find
> > > the fp state.  The latter bit is quite important later.  The kernel
> > > does not parse info at all.
> > >
> > > The fp state is its own mess.  When XSAVE happened, Intel kindly (?)
> > > gave us a software defined area between the "legacy" x87 region and
> > > the modern supposedly extensible part.  Linux sticks the following
> > > structure in that hole:
> > >
> > > struct _fpx_sw_bytes {
> > >      /*
> > >       * If set to FP_XSTATE_MAGIC1 then this is an xstate context.
> > >       * 0 if a legacy frame.
> > >       */
> > >      __u32                magic1;
> > >
> > >      /*
> > >       * Total size of the fpstate area:
> > >       *
> > >       *  - if magic1 == 0 then it's sizeof(struct _fpstate)
> > >       *  - if magic1 == FP_XSTATE_MAGIC1 then it's sizeof(struct _xstate)
> > >       *    plus extensions (if any)
> > >       */
> > >      __u32                extended_size;
> > >
> > >      /*
> > >       * Feature bit mask (including FP/SSE/extended state) that is present
> > >       * in the memory layout:
> > >       */
> > >      __u64                xfeatures;
> > >
> > >      /*
> > >       * Actual XSAVE state size, based on the xfeatures saved in the layout.
> > >       * 'extended_size' is greater than 'xstate_size':
> > >       */
> > >      __u32                xstate_size;
> > >
> > >      /* For future use: */
> > >      __u32                padding[7];
> > > };
> > >
> > >
> > > That's where we are right now upstream.  The kernel has a parser for
> > > the FPU state that is bugs piled upon bugs and is going to have to be
> > > rewritten sometime soon.  On top of all this, we have two upcoming
> > > features, both of which require different kinds of extensions:
> > >
> > > 1. AVX-512.  (Yeah, you thought this story was over a few years ago,
> > > but no.  And AMX makes it worse.)  To make a long story short, we
> > > promised user code many years ago that a signal frame fit in 2048
> > > bytes with some room to spare.  With AVX-512 this is false.  With AMX
> > > it's so wrong it's not even funny.  The only way out of the mess
> > > anyone has come up with involves making the length of the FPU state
> > > vary depending on which features are INIT, i.e. making it more compact
> > > than "compact" mode is.  This has a side effect: it's no longer
> > > possible to modify the state in place, because enabling a feature with
> > > no space allocated will make the structure bigger, and the stack won't
> > > have room.  Fortunately, one can relocate the entire FPU state, update
> > > the pointer in mcontext, and the kernel will happily follow the
> > > pointer.  So new code on a new kernel using a super-compact state
> > > could expand the state by allocating new memory (on the heap? very
> > > awkwardly on the stack?) and changing the pointer.  For all we know,
> > > some code already fiddles with the pointer.  This is great, except
> > > that your patch sticks more data at the end of the FPU block that no
> > > one is expecting, and your sigreturn code follows that pointer, and
> > > will read off into lala land.
> > >
> >
> > Then, what about we don't do that at all.  Is it possible from now on we
> > don't stick more data at the end, and take the relocating-fpu approach?
> >
> > > 2. CET.  CET wants us to find a few more bytes somewhere, and those
> > > bytes logically belong in ucontext, and here we are.
> > >
> >
> > Fortunately, we can spare CET the need of ucontext extension.  When the
> > kernel handles sigreturn, the user-mode shadow stack pointer is right at
> > the restore token.  There is no need to put that in ucontext.
>
> That seems entirely reasonable.  This might also avoid needing to
> teach CRIU about CET at all.

Wait, what's the actual shadow stack token format?  And is the token
on the new stack or the old stack when sigaltstack is in use?  For
that matter, is there any support for an alternate shadow stack for
signals?

--Andy

H. Peter Anvin May 3, 2021, 6:03 a.m. UTC | #11

<vedvyas.shanbhogue@intel.com>,Dave Martin <Dave.Martin@arm.com>,Weijiang Yang <weijiang.yang@intel.com>,Pengfei Xu <pengfei.xu@intel.com>,Haitao Huang <haitao.huang@intel.com>
From: "H. Peter Anvin" <hpa@zytor.com>
Message-ID: <2360408E-9967-469C-96AF-E8AC40044021@zytor.com>

There are a few words at the end of the FXSAVE area for software use for this reason: so we can extend the signal frame – originally to enable saving the XSAVE state but that doesn't use up the whole software available area.

On May 2, 2021 4:23:49 PM PDT, Andy Lutomirski <luto@kernel.org> wrote:
>On Fri, Apr 30, 2021 at 10:47 AM Andy Lutomirski <luto@kernel.org>
>wrote:
>>
>> On Fri, Apr 30, 2021 at 10:00 AM Yu, Yu-cheng <yu-cheng.yu@intel.com>
>wrote:
>> >
>> > On 4/28/2021 4:03 PM, Andy Lutomirski wrote:
>> > > On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu
><yu-cheng.yu@intel.com> wrote:
>> > >>
>> > >> When shadow stack is enabled, a task's shadow stack states must
>be saved
>> > >> along with the signal context and later restored in sigreturn. 
>However,
>> > >> currently there is no systematic facility for extending a signal
>context.
>> > >> There is some space left in the ucontext, but changing ucontext
>is likely
>> > >> to create compatibility issues and there is not enough space for
>further
>> > >> extensions.
>> > >>
>> > >> Introduce a signal context extension struct 'sc_ext', which is
>used to save
>> > >> shadow stack restore token address.  The extension is located
>above the fpu
>> > >> states, plus alignment.  The struct can be extended (such as the
>ibt's
>> > >> wait_endbr status to be introduced later), and sc_ext.total_size
>field
>> > >> keeps track of total size.
>> > >
>> > > I still don't like this.
>> > >
>> > > Here's how the signal layout works, for better or for worse:
>> > >
>> > > The kernel has:
>> > >
>> > > struct rt_sigframe {
>> > >      char __user *pretcode;
>> > >      struct ucontext uc;
>> > >      struct siginfo info;
>> > >      /* fp state follows here */
>> > > };
>> > >
>> > > This is roughly the actual signal frame.  But userspace does not
>have
>> > > this struct declared, and user code does not know the sizes of
>the
>> > > fields.  So it's accessed in a nonsensical way.  The signal
>handler
>> > > function is passed a pointer to the whole sigframe implicitly in
>RSP,
>> > > a pointer to &frame->info in RSI, anda pointer to &frame->uc in
>RDX.
>> > > User code can *find* the fp state by following a pointer from
>> > > mcontext, which is, in turn, found via uc:
>> > >
>> > > struct ucontext {
>> > >      unsigned long      uc_flags;
>> > >      struct ucontext  *uc_link;
>> > >      stack_t          uc_stack;
>> > >      struct sigcontext uc_mcontext;  <-- fp pointer is in here
>> > >      sigset_t      uc_sigmask;    /* mask last for extensibility
>*/
>> > > };
>> > >
>> > > The kernel, in sigreturn, works a bit differently.  The sigreturn
>> > > variants know the base address of the frame but don't have the
>benefit
>> > > of receiving pointers to the fields.  So instead the kernel takes
>> > > advantage of the fact that it knows the offset to uc and parses
>uc
>> > > accordingly.  And the kernel follows the pointer in mcontext to
>find
>> > > the fp state.  The latter bit is quite important later.  The
>kernel
>> > > does not parse info at all.
>> > >
>> > > The fp state is its own mess.  When XSAVE happened, Intel kindly
>(?)
>> > > gave us a software defined area between the "legacy" x87 region
>and
>> > > the modern supposedly extensible part.  Linux sticks the
>following
>> > > structure in that hole:
>> > >
>> > > struct _fpx_sw_bytes {
>> > >      /*
>> > >       * If set to FP_XSTATE_MAGIC1 then this is an xstate
>context.
>> > >       * 0 if a legacy frame.
>> > >       */
>> > >      __u32                magic1;
>> > >
>> > >      /*
>> > >       * Total size of the fpstate area:
>> > >       *
>> > >       *  - if magic1 == 0 then it's sizeof(struct _fpstate)
>> > >       *  - if magic1 == FP_XSTATE_MAGIC1 then it's sizeof(struct
>_xstate)
>> > >       *    plus extensions (if any)
>> > >       */
>> > >      __u32                extended_size;
>> > >
>> > >      /*
>> > >       * Feature bit mask (including FP/SSE/extended state) that
>is present
>> > >       * in the memory layout:
>> > >       */
>> > >      __u64                xfeatures;
>> > >
>> > >      /*
>> > >       * Actual XSAVE state size, based on the xfeatures saved in
>the layout.
>> > >       * 'extended_size' is greater than 'xstate_size':
>> > >       */
>> > >      __u32                xstate_size;
>> > >
>> > >      /* For future use: */
>> > >      __u32                padding[7];
>> > > };
>> > >
>> > >
>> > > That's where we are right now upstream.  The kernel has a parser
>for
>> > > the FPU state that is bugs piled upon bugs and is going to have
>to be
>> > > rewritten sometime soon.  On top of all this, we have two
>upcoming
>> > > features, both of which require different kinds of extensions:
>> > >
>> > > 1. AVX-512.  (Yeah, you thought this story was over a few years
>ago,
>> > > but no.  And AMX makes it worse.)  To make a long story short, we
>> > > promised user code many years ago that a signal frame fit in 2048
>> > > bytes with some room to spare.  With AVX-512 this is false.  With
>AMX
>> > > it's so wrong it's not even funny.  The only way out of the mess
>> > > anyone has come up with involves making the length of the FPU
>state
>> > > vary depending on which features are INIT, i.e. making it more
>compact
>> > > than "compact" mode is.  This has a side effect: it's no longer
>> > > possible to modify the state in place, because enabling a feature
>with
>> > > no space allocated will make the structure bigger, and the stack
>won't
>> > > have room.  Fortunately, one can relocate the entire FPU state,
>update
>> > > the pointer in mcontext, and the kernel will happily follow the
>> > > pointer.  So new code on a new kernel using a super-compact state
>> > > could expand the state by allocating new memory (on the heap?
>very
>> > > awkwardly on the stack?) and changing the pointer.  For all we
>know,
>> > > some code already fiddles with the pointer.  This is great,
>except
>> > > that your patch sticks more data at the end of the FPU block that
>no
>> > > one is expecting, and your sigreturn code follows that pointer,
>and
>> > > will read off into lala land.
>> > >
>> >
>> > Then, what about we don't do that at all.  Is it possible from now
>on we
>> > don't stick more data at the end, and take the relocating-fpu
>approach?
>> >
>> > > 2. CET.  CET wants us to find a few more bytes somewhere, and
>those
>> > > bytes logically belong in ucontext, and here we are.
>> > >
>> >
>> > Fortunately, we can spare CET the need of ucontext extension.  When
>the
>> > kernel handles sigreturn, the user-mode shadow stack pointer is
>right at
>> > the restore token.  There is no need to put that in ucontext.
>>
>> That seems entirely reasonable.  This might also avoid needing to
>> teach CRIU about CET at all.
>
>Wait, what's the actual shadow stack token format?  And is the token
>on the new stack or the old stack when sigaltstack is in use?  For
>that matter, is there any support for an alternate shadow stack for
>signals?
>
>--Andy

Yu-cheng Yu May 3, 2021, 3:13 p.m. UTC | #12

On 5/2/2021 4:23 PM, Andy Lutomirski wrote:
> On Fri, Apr 30, 2021 at 10:47 AM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On Fri, Apr 30, 2021 at 10:00 AM Yu, Yu-cheng <yu-cheng.yu@intel.com> wrote:
>>>
>>> On 4/28/2021 4:03 PM, Andy Lutomirski wrote:
>>>> On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>>>>>
>>>>> When shadow stack is enabled, a task's shadow stack states must be saved
>>>>> along with the signal context and later restored in sigreturn.  However,
>>>>> currently there is no systematic facility for extending a signal context.
>>>>> There is some space left in the ucontext, but changing ucontext is likely
>>>>> to create compatibility issues and there is not enough space for further
>>>>> extensions.
>>>>>
>>>>> Introduce a signal context extension struct 'sc_ext', which is used to save
>>>>> shadow stack restore token address.  The extension is located above the fpu
>>>>> states, plus alignment.  The struct can be extended (such as the ibt's
>>>>> wait_endbr status to be introduced later), and sc_ext.total_size field
>>>>> keeps track of total size.
>>>>
>>>> I still don't like this.
>>>>
>>>> Here's how the signal layout works, for better or for worse:
>>>>

[...]

>>>>
>>>> That's where we are right now upstream.  The kernel has a parser for
>>>> the FPU state that is bugs piled upon bugs and is going to have to be
>>>> rewritten sometime soon.  On top of all this, we have two upcoming
>>>> features, both of which require different kinds of extensions:
>>>>
>>>> 1. AVX-512.  (Yeah, you thought this story was over a few years ago,
>>>> but no.  And AMX makes it worse.)  To make a long story short, we
>>>> promised user code many years ago that a signal frame fit in 2048
>>>> bytes with some room to spare.  With AVX-512 this is false.  With AMX
>>>> it's so wrong it's not even funny.  The only way out of the mess
>>>> anyone has come up with involves making the length of the FPU state
>>>> vary depending on which features are INIT, i.e. making it more compact
>>>> than "compact" mode is.  This has a side effect: it's no longer
>>>> possible to modify the state in place, because enabling a feature with
>>>> no space allocated will make the structure bigger, and the stack won't
>>>> have room.  Fortunately, one can relocate the entire FPU state, update
>>>> the pointer in mcontext, and the kernel will happily follow the
>>>> pointer.  So new code on a new kernel using a super-compact state
>>>> could expand the state by allocating new memory (on the heap? very
>>>> awkwardly on the stack?) and changing the pointer.  For all we know,
>>>> some code already fiddles with the pointer.  This is great, except
>>>> that your patch sticks more data at the end of the FPU block that no
>>>> one is expecting, and your sigreturn code follows that pointer, and
>>>> will read off into lala land.
>>>>
>>>
>>> Then, what about we don't do that at all.  Is it possible from now on we
>>> don't stick more data at the end, and take the relocating-fpu approach?
>>>
>>>> 2. CET.  CET wants us to find a few more bytes somewhere, and those
>>>> bytes logically belong in ucontext, and here we are.
>>>>
>>>
>>> Fortunately, we can spare CET the need of ucontext extension.  When the
>>> kernel handles sigreturn, the user-mode shadow stack pointer is right at
>>> the restore token.  There is no need to put that in ucontext.
>>
>> That seems entirely reasonable.  This might also avoid needing to
>> teach CRIU about CET at all.
> 
> Wait, what's the actual shadow stack token format?  And is the token
> on the new stack or the old stack when sigaltstack is in use?  For
> that matter, is there any support for an alternate shadow stack for
> signals?
> 

The restore token is a pointer pointing directly above itself and bit[0] 
indicates 64-bit mode.

Because the shadow stack stores only return addresses, there is no 
alternate shadow stack.  However, the application can allocate and 
switch to a new shadow stack.

Yu-cheng

Andy Lutomirski May 3, 2021, 3:29 p.m. UTC | #13

> On May 3, 2021, at 8:14 AM, Yu, Yu-cheng <yu-cheng.yu@intel.com> wrote:
> 
> On 5/2/2021 4:23 PM, Andy Lutomirski wrote:
>>> On Fri, Apr 30, 2021 at 10:47 AM Andy Lutomirski <luto@kernel.org> wrote:
>>> 
>>> On Fri, Apr 30, 2021 at 10:00 AM Yu, Yu-cheng <yu-cheng.yu@intel.com> wrote:
>>>> 
>>>> On 4/28/2021 4:03 PM, Andy Lutomirski wrote:
>>>>> On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>>>>>> 
>>>>>> When shadow stack is enabled, a task's shadow stack states must be saved
>>>>>> along with the signal context and later restored in sigreturn.  However,
>>>>>> currently there is no systematic facility for extending a signal context.
>>>>>> There is some space left in the ucontext, but changing ucontext is likely
>>>>>> to create compatibility issues and there is not enough space for further
>>>>>> extensions.
>>>>>> 
>>>>>> Introduce a signal context extension struct 'sc_ext', which is used to save
>>>>>> shadow stack restore token address.  The extension is located above the fpu
>>>>>> states, plus alignment.  The struct can be extended (such as the ibt's
>>>>>> wait_endbr status to be introduced later), and sc_ext.total_size field
>>>>>> keeps track of total size.
>>>>> 
>>>>> I still don't like this.
>>>>> 
>>>>> Here's how the signal layout works, for better or for worse:
>>>>> 
> 
> [...]
> 
>>>>> 
>>>>> That's where we are right now upstream.  The kernel has a parser for
>>>>> the FPU state that is bugs piled upon bugs and is going to have to be
>>>>> rewritten sometime soon.  On top of all this, we have two upcoming
>>>>> features, both of which require different kinds of extensions:
>>>>> 
>>>>> 1. AVX-512.  (Yeah, you thought this story was over a few years ago,
>>>>> but no.  And AMX makes it worse.)  To make a long story short, we
>>>>> promised user code many years ago that a signal frame fit in 2048
>>>>> bytes with some room to spare.  With AVX-512 this is false.  With AMX
>>>>> it's so wrong it's not even funny.  The only way out of the mess
>>>>> anyone has come up with involves making the length of the FPU state
>>>>> vary depending on which features are INIT, i.e. making it more compact
>>>>> than "compact" mode is.  This has a side effect: it's no longer
>>>>> possible to modify the state in place, because enabling a feature with
>>>>> no space allocated will make the structure bigger, and the stack won't
>>>>> have room.  Fortunately, one can relocate the entire FPU state, update
>>>>> the pointer in mcontext, and the kernel will happily follow the
>>>>> pointer.  So new code on a new kernel using a super-compact state
>>>>> could expand the state by allocating new memory (on the heap? very
>>>>> awkwardly on the stack?) and changing the pointer.  For all we know,
>>>>> some code already fiddles with the pointer.  This is great, except
>>>>> that your patch sticks more data at the end of the FPU block that no
>>>>> one is expecting, and your sigreturn code follows that pointer, and
>>>>> will read off into lala land.
>>>>> 
>>>> 
>>>> Then, what about we don't do that at all.  Is it possible from now on we
>>>> don't stick more data at the end, and take the relocating-fpu approach?
>>>> 
>>>>> 2. CET.  CET wants us to find a few more bytes somewhere, and those
>>>>> bytes logically belong in ucontext, and here we are.
>>>>> 
>>>> 
>>>> Fortunately, we can spare CET the need of ucontext extension.  When the
>>>> kernel handles sigreturn, the user-mode shadow stack pointer is right at
>>>> the restore token.  There is no need to put that in ucontext.
>>> 
>>> That seems entirely reasonable.  This might also avoid needing to
>>> teach CRIU about CET at all.
>> Wait, what's the actual shadow stack token format?  And is the token
>> on the new stack or the old stack when sigaltstack is in use?  For
>> that matter, is there any support for an alternate shadow stack for
>> signals?
> 
> The restore token is a pointer pointing directly above itself and bit[0] indicates 64-bit mode.
> 
> Because the shadow stack stores only return addresses, there is no alternate shadow stack.  However, the application can allocate and switch to a new shadow stack.

I think we should make the ABI support an alternate shadow stack even if we don’t implement it initially. After all, some day someone might want to register a handler for shadow stack overflow.

> 
> Yu-cheng

Yu-cheng Yu May 3, 2021, 8:25 p.m. UTC | #14

On 5/3/2021 8:29 AM, Andy Lutomirski wrote:
> 
>> On May 3, 2021, at 8:14 AM, Yu, Yu-cheng <yu-cheng.yu@intel.com> wrote:
>>
>> On 5/2/2021 4:23 PM, Andy Lutomirski wrote:
>>>> On Fri, Apr 30, 2021 at 10:47 AM Andy Lutomirski <luto@kernel.org> wrote:
>>>>
>>>> On Fri, Apr 30, 2021 at 10:00 AM Yu, Yu-cheng <yu-cheng.yu@intel.com> wrote:
>>>>>
>>>>> On 4/28/2021 4:03 PM, Andy Lutomirski wrote:
>>>>>> On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>>>>>>>
>>>>>>> When shadow stack is enabled, a task's shadow stack states must be saved
>>>>>>> along with the signal context and later restored in sigreturn.  However,
>>>>>>> currently there is no systematic facility for extending a signal context.
>>>>>>> There is some space left in the ucontext, but changing ucontext is likely
>>>>>>> to create compatibility issues and there is not enough space for further
>>>>>>> extensions.
>>>>>>>
>>>>>>> Introduce a signal context extension struct 'sc_ext', which is used to save
>>>>>>> shadow stack restore token address.  The extension is located above the fpu
>>>>>>> states, plus alignment.  The struct can be extended (such as the ibt's
>>>>>>> wait_endbr status to be introduced later), and sc_ext.total_size field
>>>>>>> keeps track of total size.
>>>>>>
>>>>>> I still don't like this.
>>>>>>
>>>>>> Here's how the signal layout works, for better or for worse:
>>>>>>
>>
>> [...]
>>
>>>>>>
>>>>>> That's where we are right now upstream.  The kernel has a parser for
>>>>>> the FPU state that is bugs piled upon bugs and is going to have to be
>>>>>> rewritten sometime soon.  On top of all this, we have two upcoming
>>>>>> features, both of which require different kinds of extensions:
>>>>>>
>>>>>> 1. AVX-512.  (Yeah, you thought this story was over a few years ago,
>>>>>> but no.  And AMX makes it worse.)  To make a long story short, we
>>>>>> promised user code many years ago that a signal frame fit in 2048
>>>>>> bytes with some room to spare.  With AVX-512 this is false.  With AMX
>>>>>> it's so wrong it's not even funny.  The only way out of the mess
>>>>>> anyone has come up with involves making the length of the FPU state
>>>>>> vary depending on which features are INIT, i.e. making it more compact
>>>>>> than "compact" mode is.  This has a side effect: it's no longer
>>>>>> possible to modify the state in place, because enabling a feature with
>>>>>> no space allocated will make the structure bigger, and the stack won't
>>>>>> have room.  Fortunately, one can relocate the entire FPU state, update
>>>>>> the pointer in mcontext, and the kernel will happily follow the
>>>>>> pointer.  So new code on a new kernel using a super-compact state
>>>>>> could expand the state by allocating new memory (on the heap? very
>>>>>> awkwardly on the stack?) and changing the pointer.  For all we know,
>>>>>> some code already fiddles with the pointer.  This is great, except
>>>>>> that your patch sticks more data at the end of the FPU block that no
>>>>>> one is expecting, and your sigreturn code follows that pointer, and
>>>>>> will read off into lala land.
>>>>>>
>>>>>
>>>>> Then, what about we don't do that at all.  Is it possible from now on we
>>>>> don't stick more data at the end, and take the relocating-fpu approach?
>>>>>
>>>>>> 2. CET.  CET wants us to find a few more bytes somewhere, and those
>>>>>> bytes logically belong in ucontext, and here we are.
>>>>>>
>>>>>
>>>>> Fortunately, we can spare CET the need of ucontext extension.  When the
>>>>> kernel handles sigreturn, the user-mode shadow stack pointer is right at
>>>>> the restore token.  There is no need to put that in ucontext.
>>>>
>>>> That seems entirely reasonable.  This might also avoid needing to
>>>> teach CRIU about CET at all.
>>> Wait, what's the actual shadow stack token format?  And is the token
>>> on the new stack or the old stack when sigaltstack is in use?  For
>>> that matter, is there any support for an alternate shadow stack for
>>> signals?
>>
>> The restore token is a pointer pointing directly above itself and bit[0] indicates 64-bit mode.
>>
>> Because the shadow stack stores only return addresses, there is no alternate shadow stack.  However, the application can allocate and switch to a new shadow stack.
> 
> I think we should make the ABI support an alternate shadow stack even if we don’t implement it initially. After all, some day someone might want to register a handler for shadow stack overflow.
> 

Agree.  We can probably add something in parallel of sigaltstack(), and 
let the user choose separately alternate normal/shadow stacks.

Thanks,
Yu-cheng

Yu-cheng Yu May 4, 2021, 8:49 p.m. UTC | #15

On 4/30/2021 11:32 AM, Yu, Yu-cheng wrote:
> On 4/30/2021 10:47 AM, Andy Lutomirski wrote:
>> On Fri, Apr 30, 2021 at 10:00 AM Yu, Yu-cheng <yu-cheng.yu@intel.com> 
>> wrote:
>>>
>>> On 4/28/2021 4:03 PM, Andy Lutomirski wrote:
>>>> On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu <yu-cheng.yu@intel.com> 
>>>> wrote:
>>>>>
>>>>> When shadow stack is enabled, a task's shadow stack states must be 
>>>>> saved
>>>>> along with the signal context and later restored in sigreturn.  
>>>>> However,
>>>>> currently there is no systematic facility for extending a signal 
>>>>> context.
>>>>> There is some space left in the ucontext, but changing ucontext is 
>>>>> likely
>>>>> to create compatibility issues and there is not enough space for 
>>>>> further
>>>>> extensions.
>>>>>
>>>>> Introduce a signal context extension struct 'sc_ext', which is used 
>>>>> to save
>>>>> shadow stack restore token address.  The extension is located above 
>>>>> the fpu
>>>>> states, plus alignment.  The struct can be extended (such as the ibt's
>>>>> wait_endbr status to be introduced later), and sc_ext.total_size field
>>>>> keeps track of total size.
>>>>
>>>> I still don't like this.
>>>>
>>>> Here's how the signal layout works, for better or for worse:
>>>>
>>>> The kernel has:
>>>>
>>>> struct rt_sigframe {
>>>>       char __user *pretcode;
>>>>       struct ucontext uc;
>>>>       struct siginfo info;
>>>>       /* fp state follows here */
>>>> };
>>>>
>>>> This is roughly the actual signal frame.  But userspace does not have
>>>> this struct declared, and user code does not know the sizes of the
>>>> fields.  So it's accessed in a nonsensical way.  The signal handler
>>>> function is passed a pointer to the whole sigframe implicitly in RSP,
>>>> a pointer to &frame->info in RSI, anda pointer to &frame->uc in RDX.
>>>> User code can *find* the fp state by following a pointer from
>>>> mcontext, which is, in turn, found via uc:
>>>>
>>>> struct ucontext {
>>>>       unsigned long      uc_flags;
>>>>       struct ucontext  *uc_link;
>>>>       stack_t          uc_stack;
>>>>       struct sigcontext uc_mcontext;  <-- fp pointer is in here
>>>>       sigset_t      uc_sigmask;    /* mask last for extensibility */
>>>> };
>>>>
>>>> The kernel, in sigreturn, works a bit differently.  The sigreturn
>>>> variants know the base address of the frame but don't have the benefit
>>>> of receiving pointers to the fields.  So instead the kernel takes
>>>> advantage of the fact that it knows the offset to uc and parses uc
>>>> accordingly.  And the kernel follows the pointer in mcontext to find
>>>> the fp state.  The latter bit is quite important later.  The kernel
>>>> does not parse info at all.
>>>>
>>>> The fp state is its own mess.  When XSAVE happened, Intel kindly (?)
>>>> gave us a software defined area between the "legacy" x87 region and
>>>> the modern supposedly extensible part.  Linux sticks the following
>>>> structure in that hole:
>>>>
>>>> struct _fpx_sw_bytes {
>>>>       /*
>>>>        * If set to FP_XSTATE_MAGIC1 then this is an xstate context.
>>>>        * 0 if a legacy frame.
>>>>        */
>>>>       __u32                magic1;
>>>>
>>>>       /*
>>>>        * Total size of the fpstate area:
>>>>        *
>>>>        *  - if magic1 == 0 then it's sizeof(struct _fpstate)
>>>>        *  - if magic1 == FP_XSTATE_MAGIC1 then it's sizeof(struct 
>>>> _xstate)
>>>>        *    plus extensions (if any)
>>>>        */
>>>>       __u32                extended_size;
>>>>
>>>>       /*
>>>>        * Feature bit mask (including FP/SSE/extended state) that is 
>>>> present
>>>>        * in the memory layout:
>>>>        */
>>>>       __u64                xfeatures;
>>>>
>>>>       /*
>>>>        * Actual XSAVE state size, based on the xfeatures saved in 
>>>> the layout.
>>>>        * 'extended_size' is greater than 'xstate_size':
>>>>        */
>>>>       __u32                xstate_size;
>>>>
>>>>       /* For future use: */
>>>>       __u32                padding[7];
>>>> };
>>>>
>>>>
>>>> That's where we are right now upstream.  The kernel has a parser for
>>>> the FPU state that is bugs piled upon bugs and is going to have to be
>>>> rewritten sometime soon.  On top of all this, we have two upcoming
>>>> features, both of which require different kinds of extensions:
>>>>
>>>> 1. AVX-512.  (Yeah, you thought this story was over a few years ago,
>>>> but no.  And AMX makes it worse.)  To make a long story short, we
>>>> promised user code many years ago that a signal frame fit in 2048
>>>> bytes with some room to spare.  With AVX-512 this is false.  With AMX
>>>> it's so wrong it's not even funny.  The only way out of the mess
>>>> anyone has come up with involves making the length of the FPU state
>>>> vary depending on which features are INIT, i.e. making it more compact
>>>> than "compact" mode is.  This has a side effect: it's no longer
>>>> possible to modify the state in place, because enabling a feature with
>>>> no space allocated will make the structure bigger, and the stack won't
>>>> have room.  Fortunately, one can relocate the entire FPU state, update
>>>> the pointer in mcontext, and the kernel will happily follow the
>>>> pointer.  So new code on a new kernel using a super-compact state
>>>> could expand the state by allocating new memory (on the heap? very
>>>> awkwardly on the stack?) and changing the pointer.  For all we know,
>>>> some code already fiddles with the pointer.  This is great, except
>>>> that your patch sticks more data at the end of the FPU block that no
>>>> one is expecting, and your sigreturn code follows that pointer, and
>>>> will read off into lala land.
>>>>
>>>
>>> Then, what about we don't do that at all.  Is it possible from now on we
>>> don't stick more data at the end, and take the relocating-fpu approach?
>>>
>>>> 2. CET.  CET wants us to find a few more bytes somewhere, and those
>>>> bytes logically belong in ucontext, and here we are.
>>>>
>>>
>>> Fortunately, we can spare CET the need of ucontext extension.  When the
>>> kernel handles sigreturn, the user-mode shadow stack pointer is right at
>>> the restore token.  There is no need to put that in ucontext.
>>
>> That seems entirely reasonable.  This might also avoid needing to
>> teach CRIU about CET at all.
>>
>>>
>>> However, the WAIT_ENDBR status needs to be saved/restored for signals.
>>> Since IBT is now dependent on shadow stack, we can use a spare bit of
>>> the shadow stack restore token for that.
>>
>> That seems like unnecessary ABI coupling.  We have plenty of bits in
>> uc_flags, and we have an entire reserved word in sigcontext.  How
>> about just sticking this bit in one of those places?
> 
> Yes, I will make it UC_WAIT_ENDBR.

Personally, I think an explicit flag is cleaner than using a reserved 
word somewhere.  However, there is a small issue: ia32 has no uc_flags.

This series can support legacy apps up to now.  But, instead of creating 
too many special cases, perhaps we should drop CET support of ia32?

Thoughts?

Thanks,
Yu-cheng

Yu-cheng Yu May 6, 2021, 10:05 p.m. UTC | #16

On 5/4/2021 1:49 PM, Yu, Yu-cheng wrote:
> On 4/30/2021 11:32 AM, Yu, Yu-cheng wrote:
>> On 4/30/2021 10:47 AM, Andy Lutomirski wrote:
>>> On Fri, Apr 30, 2021 at 10:00 AM Yu, Yu-cheng <yu-cheng.yu@intel.com> 
>>> wrote:
>>>>
>>>> On 4/28/2021 4:03 PM, Andy Lutomirski wrote:
>>>>> On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu <yu-cheng.yu@intel.com> 
>>>>> wrote:
>>>>>>
>>>>>> When shadow stack is enabled, a task's shadow stack states must be 
>>>>>> saved
>>>>>> along with the signal context and later restored in sigreturn. 
>>>>>> However,
>>>>>> currently there is no systematic facility for extending a signal 
>>>>>> context.
>>>>>> There is some space left in the ucontext, but changing ucontext is 
>>>>>> likely
>>>>>> to create compatibility issues and there is not enough space for 
>>>>>> further
>>>>>> extensions.
>>>>>>
>>>>>> Introduce a signal context extension struct 'sc_ext', which is 
>>>>>> used to save
>>>>>> shadow stack restore token address.  The extension is located 
>>>>>> above the fpu
>>>>>> states, plus alignment.  The struct can be extended (such as the 
>>>>>> ibt's
>>>>>> wait_endbr status to be introduced later), and sc_ext.total_size 
>>>>>> field
>>>>>> keeps track of total size.
>>>>>
>>>>> I still don't like this.
>>>>>

[...]

>>>>>
>>>>> That's where we are right now upstream.  The kernel has a parser for
>>>>> the FPU state that is bugs piled upon bugs and is going to have to be
>>>>> rewritten sometime soon.  On top of all this, we have two upcoming
>>>>> features, both of which require different kinds of extensions:
>>>>>
>>>>> 1. AVX-512.  (Yeah, you thought this story was over a few years ago,
>>>>> but no.  And AMX makes it worse.)  To make a long story short, we
>>>>> promised user code many years ago that a signal frame fit in 2048
>>>>> bytes with some room to spare.  With AVX-512 this is false.  With AMX
>>>>> it's so wrong it's not even funny.  The only way out of the mess
>>>>> anyone has come up with involves making the length of the FPU state
>>>>> vary depending on which features are INIT, i.e. making it more compact
>>>>> than "compact" mode is.  This has a side effect: it's no longer
>>>>> possible to modify the state in place, because enabling a feature with
>>>>> no space allocated will make the structure bigger, and the stack won't
>>>>> have room.  Fortunately, one can relocate the entire FPU state, update
>>>>> the pointer in mcontext, and the kernel will happily follow the
>>>>> pointer.  So new code on a new kernel using a super-compact state
>>>>> could expand the state by allocating new memory (on the heap? very
>>>>> awkwardly on the stack?) and changing the pointer.  For all we know,
>>>>> some code already fiddles with the pointer.  This is great, except
>>>>> that your patch sticks more data at the end of the FPU block that no
>>>>> one is expecting, and your sigreturn code follows that pointer, and
>>>>> will read off into lala land.
>>>>>
>>>>
>>>> Then, what about we don't do that at all.  Is it possible from now 
>>>> on we
>>>> don't stick more data at the end, and take the relocating-fpu approach?
>>>>
>>>>> 2. CET.  CET wants us to find a few more bytes somewhere, and those
>>>>> bytes logically belong in ucontext, and here we are.
>>>>>
>>>>
>>>> Fortunately, we can spare CET the need of ucontext extension.  When the
>>>> kernel handles sigreturn, the user-mode shadow stack pointer is 
>>>> right at
>>>> the restore token.  There is no need to put that in ucontext.
>>>
>>> That seems entirely reasonable.  This might also avoid needing to
>>> teach CRIU about CET at all.
>>>
>>>>
>>>> However, the WAIT_ENDBR status needs to be saved/restored for signals.
>>>> Since IBT is now dependent on shadow stack, we can use a spare bit of
>>>> the shadow stack restore token for that.
>>>
>>> That seems like unnecessary ABI coupling.  We have plenty of bits in
>>> uc_flags, and we have an entire reserved word in sigcontext.  How
>>> about just sticking this bit in one of those places?
>>
>> Yes, I will make it UC_WAIT_ENDBR.
> 
> Personally, I think an explicit flag is cleaner than using a reserved 
> word somewhere.  However, there is a small issue: ia32 has no uc_flags.
> 
> This series can support legacy apps up to now.  But, instead of creating 
> too many special cases, perhaps we should drop CET support of ia32?
> 
> Thoughts?
> 

Once we have UC_WAIT_ENDBR, IBT signal handling becomes quite simple. 
Like the following:

diff --git a/arch/x86/include/uapi/asm/ucontext.h 
b/arch/x86/include/uapi/asm/ucontext.h
index 5657b7a49f03..96375d609e11 100644
--- a/arch/x86/include/uapi/asm/ucontext.h
+++ b/arch/x86/include/uapi/asm/ucontext.h
@@ -49,6 +49,11 @@
   */
  #define UC_SIGCONTEXT_SS	0x2
  #define UC_STRICT_RESTORE_SS	0x4
+
+/*
+ * UC_WAIT_ENDBR indicates the task is in wait-ENDBR status.
+ */
+#define UC_WAIT_ENDBR		0x08
  #endif

  #include <asm-generic/ucontext.h>
diff --git a/arch/x86/kernel/ibt.c b/arch/x86/kernel/ibt.c
index d2563dd4759f..da804314ddc4 100644
--- a/arch/x86/kernel/ibt.c
+++ b/arch/x86/kernel/ibt.c
@@ -66,3 +66,32 @@ void ibt_disable(void)
  	ibt_set_clear_msr_bits(0, CET_ENDBR_EN);
  	current->thread.cet.ibt = 0;
  }
+
+int ibt_get_clear_wait_endbr(void)
+{
+	u64 msr_val = 0;
+
+	if (!current->thread.cet.ibt)
+		return 0;
+
+	fpregs_lock();
+
+	if (test_thread_flag(TIF_NEED_FPU_LOAD))
+		__fpregs_load_activate();
+
+	if (!rdmsrl_safe(MSR_IA32_U_CET, &msr_val))
+		wrmsrl(MSR_IA32_U_CET, msr_val & ~CET_WAIT_ENDBR);
+
+	fpregs_unlock();
+
+	return msr_val & CET_WAIT_ENDBR;
+}
+
+int ibt_set_wait_endbr(void)
+{
+	if (!current->thread.cet.ibt)
+		return 0;
+
+
+	return ibt_set_clear_msr_bits(CET_WAIT_ENDBR, 0);
+}
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 66b662e57e19..5afd15419006 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -46,6 +46,7 @@
  #include <asm/syscall.h>
  #include <asm/sigframe.h>
  #include <asm/signal.h>
+#include <asm/cet.h>

  #ifdef CONFIG_X86_64
  /*
@@ -134,6 +135,9 @@ static int restore_sigcontext(struct pt_regs *regs,
  	 */
  	if (unlikely(!(uc_flags & UC_STRICT_RESTORE_SS) && 
user_64bit_mode(regs)))
  		force_valid_ss(regs);
+
+	if (uc_flags & UC_WAIT_ENDBR)
+		ibt_set_wait_endbr();
  #endif

  	return fpu__restore_sig((void __user *)sc.fpstate,
@@ -433,6 +437,9 @@ static unsigned long frame_uc_flags(struct pt_regs 
*regs)
  	if (likely(user_64bit_mode(regs)))
  		flags |= UC_STRICT_RESTORE_SS;

+	if (ibt_get_clear_wait_endbr())
+		flags |= UC_WAIT_ENDBR;
+
  	return flags;
  }


However, this cannot handle ia32 with no SA_SIGINFO.  For that, can we 
create a synthetic token on the shadow stack?

- The token points to itself with reserved bit[1] set, and cannot be 
used for RSTORSSP.
- The token only exists for ia32 with no SA_SIGINFO *AND* when the task 
is in wait-endbr.

The signal shadow stack will look like this:

--> ssp before signal
     synthetic IBT token (for ia32 no SA_SIGINFO)
     shadow stack restore token
     sigreturn address

The synthetic token is not valid in other situations.
How is that?

Thanks,
Yu-cheng

Andy Lutomirski May 6, 2021, 11:31 p.m. UTC | #17

On Thu, May 6, 2021 at 3:05 PM Yu, Yu-cheng <yu-cheng.yu@intel.com> wrote:
>
> On 5/4/2021 1:49 PM, Yu, Yu-cheng wrote:
> > On 4/30/2021 11:32 AM, Yu, Yu-cheng wrote:
> >> On 4/30/2021 10:47 AM, Andy Lutomirski wrote:
> >>> On Fri, Apr 30, 2021 at 10:00 AM Yu, Yu-cheng <yu-cheng.yu@intel.com>
> >>> wrote:
> >>>>
> >>>> On 4/28/2021 4:03 PM, Andy Lutomirski wrote:
> >>>>> On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu <yu-cheng.yu@intel.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> When shadow stack is enabled, a task's shadow stack states must be
> >>>>>> saved
> >>>>>> along with the signal context and later restored in sigreturn.
> >>>>>> However,
> >>>>>> currently there is no systematic facility for extending a signal
> >>>>>> context.
> >>>>>> There is some space left in the ucontext, but changing ucontext is
> >>>>>> likely
> >>>>>> to create compatibility issues and there is not enough space for
> >>>>>> further
> >>>>>> extensions.
> >>>>>>
> >>>>>> Introduce a signal context extension struct 'sc_ext', which is
> >>>>>> used to save
> >>>>>> shadow stack restore token address.  The extension is located
> >>>>>> above the fpu
> >>>>>> states, plus alignment.  The struct can be extended (such as the
> >>>>>> ibt's
> >>>>>> wait_endbr status to be introduced later), and sc_ext.total_size
> >>>>>> field
> >>>>>> keeps track of total size.
> >>>>>
> >>>>> I still don't like this.
> >>>>>
>
> [...]
>
> >>>>>
> >>>>> That's where we are right now upstream.  The kernel has a parser for
> >>>>> the FPU state that is bugs piled upon bugs and is going to have to be
> >>>>> rewritten sometime soon.  On top of all this, we have two upcoming
> >>>>> features, both of which require different kinds of extensions:
> >>>>>
> >>>>> 1. AVX-512.  (Yeah, you thought this story was over a few years ago,
> >>>>> but no.  And AMX makes it worse.)  To make a long story short, we
> >>>>> promised user code many years ago that a signal frame fit in 2048
> >>>>> bytes with some room to spare.  With AVX-512 this is false.  With AMX
> >>>>> it's so wrong it's not even funny.  The only way out of the mess
> >>>>> anyone has come up with involves making the length of the FPU state
> >>>>> vary depending on which features are INIT, i.e. making it more compact
> >>>>> than "compact" mode is.  This has a side effect: it's no longer
> >>>>> possible to modify the state in place, because enabling a feature with
> >>>>> no space allocated will make the structure bigger, and the stack won't
> >>>>> have room.  Fortunately, one can relocate the entire FPU state, update
> >>>>> the pointer in mcontext, and the kernel will happily follow the
> >>>>> pointer.  So new code on a new kernel using a super-compact state
> >>>>> could expand the state by allocating new memory (on the heap? very
> >>>>> awkwardly on the stack?) and changing the pointer.  For all we know,
> >>>>> some code already fiddles with the pointer.  This is great, except
> >>>>> that your patch sticks more data at the end of the FPU block that no
> >>>>> one is expecting, and your sigreturn code follows that pointer, and
> >>>>> will read off into lala land.
> >>>>>
> >>>>
> >>>> Then, what about we don't do that at all.  Is it possible from now
> >>>> on we
> >>>> don't stick more data at the end, and take the relocating-fpu approach?
> >>>>
> >>>>> 2. CET.  CET wants us to find a few more bytes somewhere, and those
> >>>>> bytes logically belong in ucontext, and here we are.
> >>>>>
> >>>>
> >>>> Fortunately, we can spare CET the need of ucontext extension.  When the
> >>>> kernel handles sigreturn, the user-mode shadow stack pointer is
> >>>> right at
> >>>> the restore token.  There is no need to put that in ucontext.
> >>>
> >>> That seems entirely reasonable.  This might also avoid needing to
> >>> teach CRIU about CET at all.
> >>>
> >>>>
> >>>> However, the WAIT_ENDBR status needs to be saved/restored for signals.
> >>>> Since IBT is now dependent on shadow stack, we can use a spare bit of
> >>>> the shadow stack restore token for that.
> >>>
> >>> That seems like unnecessary ABI coupling.  We have plenty of bits in
> >>> uc_flags, and we have an entire reserved word in sigcontext.  How
> >>> about just sticking this bit in one of those places?
> >>
> >> Yes, I will make it UC_WAIT_ENDBR.
> >
> > Personally, I think an explicit flag is cleaner than using a reserved
> > word somewhere.  However, there is a small issue: ia32 has no uc_flags.
> >
> > This series can support legacy apps up to now.  But, instead of creating
> > too many special cases, perhaps we should drop CET support of ia32?
> >
> > Thoughts?

I'm really not thrilled about coupling IBT and SHSTK like this.

Here are a couple of possible solutions:

- Don't support IBT in 32-bit mode, or maybe just don't support IBT
with legacy 32-bit signals.  The actual mechanics of this could be
awkward.  Maybe we would reject the sigaction() call or the
IBT-enabling request if they conflict?

- Find some space in the signal frame for these flags.  Looking around
a bit, sigframe_ia32 has fpstate_unused, but I can imagine things like
CRIU getting very confused if it stops being unused.  sigframe_ia32
uses sigcontext_32, which has a bunch of reserved space in __gsh,
__fsh, etc.

rt_sigframe_ia32 has uc_flags, so this isn't a real problem.

I don't have a brilliant solution here.

[v26,25/30] x86/cet/shstk: Handle signals for shadow stack

Commit Message

Comments

Patch