diff mbox series

[v14,09/19] x86/mm: x86/sgx: Signal SEGV_SGXERR for #PFs w/ PF_SGX

Message ID 20180925130845.9962-10-jarkko.sakkinen@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Intel SGX1 support | expand

Commit Message

Jarkko Sakkinen Sept. 25, 2018, 1:06 p.m. UTC
From: Sean Christopherson <sean.j.christopherson@intel.com>

Signal SIGSEGV(SEGV_SGXERR) for all faults with PF_SGX set in the
error code.  The PF_SGX bit is set if and only if the #PF is detected
by the Enclave Page Cache Map (EPCM), which is consulted only after
an access walks the kernel's page tables, i.e.:

  a. the access was allowed by the kernel
  b. the kernel's tables have become less restrictive than the EPCM
  c. the kernel cannot fixup the cause of the fault

Noteably, (b) implies that either the kernel has botched the EPC
mappings or the EPCM has been invalidated due to a power event.  In
either case, userspace needs to be alerted so that it can take
appropriate action, e.g. restart the enclave.  This is reinforced
by (c) as the kernel doesn't really have any other reasonable option,
e.g. we could kill the task or panic, but neither is warranted.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
---
 arch/x86/mm/fault.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

Comments

Andy Lutomirski Sept. 25, 2018, 10:53 p.m. UTC | #1
Minor nit:

On Tue, Sep 25, 2018 at 6:12 AM Jarkko Sakkinen
<jarkko.sakkinen@linux.intel.com> wrote:
>
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>

> by (c) as the kernel doesn't really have any other reasonable option,
> e.g. we could kill the task or panic, but neither is warranted.

Not killing the task is quite nice, but...

> +       /*
> +        * Access is blocked by the Enclave Page Cache Map (EPCM),
> +        * i.e. the access is allowed by the PTE but not the EPCM.
> +        * This usually happens when the EPCM is yanked out from
> +        * under us, e.g. by hardware after a suspend/resume cycle.
> +        * In any case, there is nothing that can be done by the
> +        * kernel to resolve the fault (short of killing the task).

Maybe s/killing the task/sending a signal/?

--Andy
Sean Christopherson Sept. 26, 2018, 5:35 p.m. UTC | #2
On Tue, Sep 25, 2018 at 03:53:48PM -0700, Andy Lutomirski wrote:
> Minor nit:
> 
> On Tue, Sep 25, 2018 at 6:12 AM Jarkko Sakkinen
> <jarkko.sakkinen@linux.intel.com> wrote:
> >
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> >
> 
> > by (c) as the kernel doesn't really have any other reasonable option,
> > e.g. we could kill the task or panic, but neither is warranted.
> 
> Not killing the task is quite nice, but...
> 
> > +       /*
> > +        * Access is blocked by the Enclave Page Cache Map (EPCM),
> > +        * i.e. the access is allowed by the PTE but not the EPCM.
> > +        * This usually happens when the EPCM is yanked out from
> > +        * under us, e.g. by hardware after a suspend/resume cycle.
> > +        * In any case, there is nothing that can be done by the
> > +        * kernel to resolve the fault (short of killing the task).
> 
> Maybe s/killing the task/sending a signal/?

My intent was to document that, unlike all other page faults, the
kernel can't fix the source of the fault even if it were omniscient.

How about this?  With formatting changes since it's long-winded...

        /*
         * Access is blocked by the Enclave Page Cache Map (EPCM), i.e. the
         * access is allowed by the PTE but not the EPCM.  This usually happens
         * when the EPCM is yanked out from under us, e.g. by hardware after a
         * suspend/resume cycle.  In any case, software, i.e. the kernel, can't
         * fix the source of the fault as the EPCM can't be directly modified
         * by software.  Handle the fault as an access error in order to signal
         * userspace, e.g. so that userspace can rebuild their enclave(s), even
         * though userspace may not have actually violated access permissions.
         */
Andy Lutomirski Sept. 26, 2018, 6:12 p.m. UTC | #3
> On Sep 26, 2018, at 10:35 AM, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
>> On Tue, Sep 25, 2018 at 03:53:48PM -0700, Andy Lutomirski wrote:
>> Minor nit:
>> 
>> On Tue, Sep 25, 2018 at 6:12 AM Jarkko Sakkinen
>> <jarkko.sakkinen@linux.intel.com> wrote:
>>> 
>>> From: Sean Christopherson <sean.j.christopherson@intel.com>
>>> 
>> 
>>> by (c) as the kernel doesn't really have any other reasonable option,
>>> e.g. we could kill the task or panic, but neither is warranted.
>> 
>> Not killing the task is quite nice, but...
>> 
>>> +       /*
>>> +        * Access is blocked by the Enclave Page Cache Map (EPCM),
>>> +        * i.e. the access is allowed by the PTE but not the EPCM.
>>> +        * This usually happens when the EPCM is yanked out from
>>> +        * under us, e.g. by hardware after a suspend/resume cycle.
>>> +        * In any case, there is nothing that can be done by the
>>> +        * kernel to resolve the fault (short of killing the task).
>> 
>> Maybe s/killing the task/sending a signal/?
> 
> My intent was to document that, unlike all other page faults, the
> kernel can't fix the source of the fault even if it were omniscient.
> 
> How about this?  With formatting changes since it's long-winded...
> 
>        /*
>         * Access is blocked by the Enclave Page Cache Map (EPCM), i.e. the
>         * access is allowed by the PTE but not the EPCM.  This usually happens
>         * when the EPCM is yanked out from under us, e.g. by hardware after a
>         * suspend/resume cycle.  In any case, software, i.e. the kernel, can't
>         * fix the source of the fault as the EPCM can't be directly modified
>         * by software.  Handle the fault as an access error in order to signal
>         * userspace, e.g. so that userspace can rebuild their enclave(s), even
>         * though userspace may not have actually violated access permissions.
>         */
> 

Looks good to me.
Dave Hansen Sept. 26, 2018, 8:16 p.m. UTC | #4
On 09/26/2018 11:12 AM, Andy Lutomirski wrote:
>> e omniscient.
>>
>> How about this?  With formatting changes since it's long-winded...
>>
>>        /*
>>         * Access is blocked by the Enclave Page Cache Map (EPCM), i.e. the
>>         * access is allowed by the PTE but not the EPCM.  This usually happens
>>         * when the EPCM is yanked out from under us, e.g. by hardware after a
>>         * suspend/resume cycle.  In any case, software, i.e. the kernel, can't
>>         * fix the source of the fault as the EPCM can't be directly modified
>>         * by software.  Handle the fault as an access error in order to signal
>>         * userspace, e.g. so that userspace can rebuild their enclave(s), even
>>         * though userspace may not have actually violated access permissions.
>>         */
>>
> Looks good to me.

Including the actual architectural definition of the bit might add some
clarity.  The SDM explicitly says (Vol 3a section 4.7):

	The fault resulted from violation of SGX-specific access-control
	requirements.

Which totally squares with returning true from access_error().

There's also a tidbit that says:

	This flag is 1 if the exception is unrelated to paging and
	resulted from violation of SGX-specific access-control
	requirements. ... such a violation can occur only if there
	is no ordinary page fault...

This is pretty important.  It means that *none* of the other
paging-related stuff that we're doing applies.

We also need to clarify how this can happen.  Is it through something
than an app does, or is it solely when the hardware does something under
the covers, like suspend/resume.
Sean Christopherson Sept. 26, 2018, 8:44 p.m. UTC | #5
On Wed, Sep 26, 2018 at 01:16:59PM -0700, Dave Hansen wrote:
> On 09/26/2018 11:12 AM, Andy Lutomirski wrote:
> >> e omniscient.
> >>
> >> How about this?  With formatting changes since it's long-winded...
> >>
> >>        /*
> >>         * Access is blocked by the Enclave Page Cache Map (EPCM), i.e. the
> >>         * access is allowed by the PTE but not the EPCM.  This usually happens
> >>         * when the EPCM is yanked out from under us, e.g. by hardware after a
> >>         * suspend/resume cycle.  In any case, software, i.e. the kernel, can't
> >>         * fix the source of the fault as the EPCM can't be directly modified
> >>         * by software.  Handle the fault as an access error in order to signal
> >>         * userspace, e.g. so that userspace can rebuild their enclave(s), even
> >>         * though userspace may not have actually violated access permissions.
> >>         */
> >>
> > Looks good to me.
> 
> Including the actual architectural definition of the bit might add some
> clarity.  The SDM explicitly says (Vol 3a section 4.7):
> 
> 	The fault resulted from violation of SGX-specific access-control
> 	requirements.
> 
> Which totally squares with returning true from access_error().
> 
> There's also a tidbit that says:
> 
> 	This flag is 1 if the exception is unrelated to paging and
> 	resulted from violation of SGX-specific access-control
> 	requirements. ... such a violation can occur only if there
> 	is no ordinary page fault...
> 
> This is pretty important.  It means that *none* of the other
> paging-related stuff that we're doing applies.
>
> We also need to clarify how this can happen.  Is it through something
> than an app does, or is it solely when the hardware does something under
> the covers, like suspend/resume.

Are you looking for something in the changelog, the comment, or just
a response?  If it's the latter...

On bare metal with a bug-free kernel, the only scenario I'm aware of
where we'll encounter these faults is when hardware pulls the rug out
from under us.  In a virtualized environment all bets are off because
the architecture allows VMMs to silently "destroy" the EPC at will,
e.g. KVM, and I believe Hyper-V, will take advantage of this behavior
to support live migration.  Post migration, the destination system
will generate PF_SGX because the EPC{M} can't be migrated between
system, i.e. the destination EPCM sees all EPC pages as invalid.
Dave Hansen Sept. 26, 2018, 8:49 p.m. UTC | #6
On 09/26/2018 01:44 PM, Sean Christopherson wrote:
> On Wed, Sep 26, 2018 at 01:16:59PM -0700, Dave Hansen wrote:
>> We also need to clarify how this can happen.  Is it through something
>> than an app does, or is it solely when the hardware does something under
>> the covers, like suspend/resume.
> 
> Are you looking for something in the changelog, the comment, or just
> a response?  If it's the latter...

Comments, please.

> On bare metal with a bug-free kernel, the only scenario I'm aware of
> where we'll encounter these faults is when hardware pulls the rug out
> from under us.  In a virtualized environment all bets are off because
> the architecture allows VMMs to silently "destroy" the EPC at will,
> e.g. KVM, and I believe Hyper-V, will take advantage of this behavior
> to support live migration.  Post migration, the destination system
> will generate PF_SGX because the EPC{M} can't be migrated between
> system, i.e. the destination EPCM sees all EPC pages as invalid.

OK, cool.

That's good background fodder for the changelog.

But, for the comment, I'm happy with something like this:

	/*
	 * The fault resulted from violation of SGX-specific access-
	 * controls.  This is expected to be the result of some lower
	 * layer action (CPU suspend/resume, VM migration) and is
	 * not related to anything the OS did.  Treat it as an access
	 * error to ensure it is passed up to the app via a signal where
	 * it can be handled.
	 */

I really don't think we need to delve too deeply into the relationship
between EPCM and PTEs or anything.  Let's just say, "it's not the
kernel's fault, it's not the app's fault, so throw up our hands".
Andy Lutomirski Sept. 26, 2018, 9:15 p.m. UTC | #7
On Wed, Sep 26, 2018 at 1:55 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 09/26/2018 01:44 PM, Sean Christopherson wrote:
> > On Wed, Sep 26, 2018 at 01:16:59PM -0700, Dave Hansen wrote:
> >> We also need to clarify how this can happen.  Is it through something
> >> than an app does, or is it solely when the hardware does something under
> >> the covers, like suspend/resume.
> >
> > Are you looking for something in the changelog, the comment, or just
> > a response?  If it's the latter...
>
> Comments, please.
>
> > On bare metal with a bug-free kernel, the only scenario I'm aware of
> > where we'll encounter these faults is when hardware pulls the rug out
> > from under us.  In a virtualized environment all bets are off because
> > the architecture allows VMMs to silently "destroy" the EPC at will,
> > e.g. KVM, and I believe Hyper-V, will take advantage of this behavior
> > to support live migration.  Post migration, the destination system
> > will generate PF_SGX because the EPC{M} can't be migrated between
> > system, i.e. the destination EPCM sees all EPC pages as invalid.
>
> OK, cool.
>
> That's good background fodder for the changelog.
>
> But, for the comment, I'm happy with something like this:
>
>         /*
>          * The fault resulted from violation of SGX-specific access-
>          * controls.  This is expected to be the result of some lower
>          * layer action (CPU suspend/resume, VM migration) and is
>          * not related to anything the OS did.  Treat it as an access
>          * error to ensure it is passed up to the app via a signal where
>          * it can be handled.
>          */
>
> I really don't think we need to delve too deeply into the relationship
> between EPCM and PTEs or anything.  Let's just say, "it's not the
> kernel's fault, it's not the app's fault, so throw up our hands".

There is a non-nitpicky consideration here.  Logically, user code is
going to do this (totally made-up pseudocode):

enclave_t enclave = load_and_init_enclave(...);
int ret = sgx_run(enclave, some pointers to non-enclave-memory buffers, ...);

and, with the code in this patch, a correct implementation of
sgx_run() requires installing a signal handler.  This is nasty, since
signal handlers, expecially for something like SIGSEGV or SIGBUS, are
not fantastic to say the least in libraries.

Could we perhaps have a little vDSO entry (or syscall, I suppose) that
runs an enclave an returns an error code, and rig up the #PF handler
to check if the error happened in the vDSO entry and fix it up rather
than sending a signal?

On Windows, this is much less of a concern, because Windows has real
scoped fault handling. But Linux doesn't, at least not yet.


--
Andy Lutomirski
AMA Capital Management, LLC
Dave Hansen Sept. 26, 2018, 9:45 p.m. UTC | #8
On 09/26/2018 02:15 PM, Andy Lutomirski wrote:
> Could we perhaps have a little vDSO entry (or syscall, I suppose) that
> runs an enclave an returns an error code, and rig up the #PF handler
> to check if the error happened in the vDSO entry and fix it up rather
> than sending a signal?

Yeah, signals suck.

So, instead of doing the enclave entry instruction (EENTER is it?), the
app would do the vDSO call.  It would have some calling convention, like
"set %rax to 0 before entering".  Then, we just teach the page fault
handler about the %RIP in the vDSO that can fault and how to move one
instruction later, munge %RIP to a value that tells about the error,
then return from the fault.  It would basically be like the kernel
exception tables, but for userspace.  Right?

How would a syscall work, though?  I assume we can't just enter the
enclave from ring0.
Andy Lutomirski Sept. 26, 2018, 10:37 p.m. UTC | #9
On Wed, Sep 26, 2018 at 2:45 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 09/26/2018 02:15 PM, Andy Lutomirski wrote:
> > Could we perhaps have a little vDSO entry (or syscall, I suppose) that
> > runs an enclave an returns an error code, and rig up the #PF handler
> > to check if the error happened in the vDSO entry and fix it up rather
> > than sending a signal?
>
> Yeah, signals suck.
>
> So, instead of doing the enclave entry instruction (EENTER is it?), the
> app would do the vDSO call.  It would have some calling convention, like
> "set %rax to 0 before entering".  Then, we just teach the page fault
> handler about the %RIP in the vDSO that can fault and how to move one
> instruction later, munge %RIP to a value that tells about the error,
> then return from the fault.  It would basically be like the kernel
> exception tables, but for userspace.  Right?

Yeah.  Maybe like this:

xorl %eax,%eax
eenter_insn:
ENCLU[whatever]
eenter_landing_pad:
ret

And the kernel would use the existing vdso2c vdso-symbol-finding
mechanism to do the fixup.

>
> How would a syscall work, though?  I assume we can't just enter the
> enclave from ring0.

My understanding of how AEX works is a bit vague, but maybe a syscall
could reuse the mechanism?  The vDSO approach seems considerably
simpler.

We do need to make sure that a fault that happens on or after return
from an AEX event does the right thing.  But I'm still vague on how
that works, sigh.

--Andy
Jarkko Sakkinen Sept. 27, 2018, 1:14 p.m. UTC | #10
On Tue, Sep 25, 2018 at 03:53:48PM -0700, Andy Lutomirski wrote:
> Minor nit:
> 
> On Tue, Sep 25, 2018 at 6:12 AM Jarkko Sakkinen
> <jarkko.sakkinen@linux.intel.com> wrote:
> >
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> >
> 
> > by (c) as the kernel doesn't really have any other reasonable option,
> > e.g. we could kill the task or panic, but neither is warranted.
> 
> Not killing the task is quite nice, but...
> 
> > +       /*
> > +        * Access is blocked by the Enclave Page Cache Map (EPCM),
> > +        * i.e. the access is allowed by the PTE but not the EPCM.
> > +        * This usually happens when the EPCM is yanked out from
> > +        * under us, e.g. by hardware after a suspend/resume cycle.
> > +        * In any case, there is nothing that can be done by the
> > +        * kernel to resolve the fault (short of killing the task).
> 
> Maybe s/killing the task/sending a signal/?

Agreed. Thanks. 

/Jarkko
Jarkko Sakkinen Sept. 27, 2018, 1:42 p.m. UTC | #11
On Wed, Sep 26, 2018 at 01:16:59PM -0700, Dave Hansen wrote:
> On 09/26/2018 11:12 AM, Andy Lutomirski wrote:
> >> e omniscient.
> >>
> >> How about this?  With formatting changes since it's long-winded...
> >>
> >>        /*
> >>         * Access is blocked by the Enclave Page Cache Map (EPCM), i.e. the
> >>         * access is allowed by the PTE but not the EPCM.  This usually happens
> >>         * when the EPCM is yanked out from under us, e.g. by hardware after a
> >>         * suspend/resume cycle.  In any case, software, i.e. the kernel, can't
> >>         * fix the source of the fault as the EPCM can't be directly modified
> >>         * by software.  Handle the fault as an access error in order to signal
> >>         * userspace, e.g. so that userspace can rebuild their enclave(s), even
> >>         * though userspace may not have actually violated access permissions.
> >>         */
> >>
> > Looks good to me.
> 
> Including the actual architectural definition of the bit might add some
> clarity.  The SDM explicitly says (Vol 3a section 4.7):
> 
> 	The fault resulted from violation of SGX-specific access-control
> 	requirements.
> 
> Which totally squares with returning true from access_error().
> 
> There's also a tidbit that says:
> 
> 	This flag is 1 if the exception is unrelated to paging and
> 	resulted from violation of SGX-specific access-control
> 	requirements. ... such a violation can occur only if there
> 	is no ordinary page fault...
> 
> This is pretty important.  It means that *none* of the other
> paging-related stuff that we're doing applies.
> 
> We also need to clarify how this can happen.  Is it through something
> than an app does, or is it solely when the hardware does something under
> the covers, like suspend/resume.

When you change page permissions lets say with mprotect after the and
try to do an invalid access according to the EPCM permissions this can
happen.

/Jarkko
Jarkko Sakkinen Sept. 27, 2018, 1:56 p.m. UTC | #12
On Wed, Sep 26, 2018 at 02:45:17PM -0700, Dave Hansen wrote:
> On 09/26/2018 02:15 PM, Andy Lutomirski wrote:
> > Could we perhaps have a little vDSO entry (or syscall, I suppose) that
> > runs an enclave an returns an error code, and rig up the #PF handler
> > to check if the error happened in the vDSO entry and fix it up rather
> > than sending a signal?
> 
> Yeah, signals suck.
> 
> So, instead of doing the enclave entry instruction (EENTER is it?), the
> app would do the vDSO call.  It would have some calling convention, like
> "set %rax to 0 before entering".  Then, we just teach the page fault
> handler about the %RIP in the vDSO that can fault and how to move one
> instruction later, munge %RIP to a value that tells about the error,
> then return from the fault.  It would basically be like the kernel
> exception tables, but for userspace.  Right?
> 
> How would a syscall work, though?  I assume we can't just enter the
> enclave from ring0.

Enclave cannot be entered from ring-0.

For me this plan sounds simple and sound.

/Jarkko
Jarkko Sakkinen Sept. 27, 2018, 2:21 p.m. UTC | #13
On Wed, Sep 26, 2018 at 03:37:45PM -0700, Andy Lutomirski wrote:
> Yeah.  Maybe like this: > > xorl %eax,%eax > eenter_insn:
> ENCLU[whatever]
> eenter_landing_pad:
> ret
> 
> And the kernel would use the existing vdso2c vdso-symbol-finding
> mechanism to do the fixup.
> 
> >
> > How would a syscall work, though?  I assume we can't just enter the
> > enclave from ring0.
> 
> My understanding of how AEX works is a bit vague, but maybe a syscall
> could reuse the mechanism?  The vDSO approach seems considerably
> simpler.
> 
> We do need to make sure that a fault that happens on or after return
> from an AEX event does the right thing.  But I'm still vague on how
> that works, sigh.
> 
> --Andy

Returning from AEX does not differ from any other memory access event so
AFAIK it should be handled right with the proposed solution already.
For convenience I think we could have a fixed trampoline for AEX e.g.
this how it is implemented in the open source LE that I did:

sgx_get_token:
	push	%rbx
	mov	$0x02, %rax
	mov	%rsi, %rbx
	mov	%rdx, %rsi
	mov	$sgx_async_exit, %rcx
sgx_async_exit:
	ENCLU
	pop	%rbx
	ret

BTW, if I converted the in-kernel LE as a standalone test program, would
that be useful for basic testing of the series?

/Jarkko
Andy Lutomirski Sept. 27, 2018, 2:41 p.m. UTC | #14
> On Sep 27, 2018, at 7:21 AM, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> 
>> On Wed, Sep 26, 2018 at 03:37:45PM -0700, Andy Lutomirski wrote:
>> Yeah.  Maybe like this: > > xorl %eax,%eax > eenter_insn:
>> ENCLU[whatever]
>> eenter_landing_pad:
>> ret
>> 
>> And the kernel would use the existing vdso2c vdso-symbol-finding
>> mechanism to do the fixup.
>> 
>>> 
>>> How would a syscall work, though?  I assume we can't just enter the
>>> enclave from ring0.
>> 
>> My understanding of how AEX works is a bit vague, but maybe a syscall
>> could reuse the mechanism?  The vDSO approach seems considerably
>> simpler.
>> 
>> We do need to make sure that a fault that happens on or after return
>> from an AEX event does the right thing.  But I'm still vague on how
>> that works, sigh.
>> 
>> --Andy
> 
> Returning from AEX does not differ from any other memory access event so
> AFAIK it should be handled right with the proposed solution already.
> For convenience I think we could have a fixed trampoline for AEX e.g.
> this how it is implemented in the open source LE that I did:
> 
> sgx_get_token:
>    push    %rbx
>    mov    $0x02, %rax
>    mov    %rsi, %rbx
>    mov    %rdx, %rsi
>    mov    $sgx_async_exit, %rcx
> sgx_async_exit:
>    ENCLU
>    pop    %rbx
>    ret
> 
> BTW, if I converted the in-kernel LE as a standalone test program, would
> that be useful for basic testing of the series?
> 
> 

Definitely. Especially if you stick it in selftests/x86 and make it exit cleanly (error code 0) on unsupported hardware.
Dave Hansen Sept. 27, 2018, 2:58 p.m. UTC | #15
On 09/27/2018 06:42 AM, Jarkko Sakkinen wrote:
>> 	This flag is 1 if the exception is unrelated to paging and
>> 	resulted from violation of SGX-specific access-control
>> 	requirements. ... such a violation can occur only if there
>> 	is no ordinary page fault...
>>
>> This is pretty important.  It means that *none* of the other
>> paging-related stuff that we're doing applies.
>>
>> We also need to clarify how this can happen.  Is it through something
>> than an app does, or is it solely when the hardware does something under
>> the covers, like suspend/resume.
> When you change page permissions lets say with mprotect after the and
> try to do an invalid access according to the EPCM permissions this can
> happen.

So, there are pages that are non-executable, non-readable, or
non-writable both via the page tables and via underlying SGX
permissions.  Then, we allow an mprotect() and a later access will
result in one of these SGX faults?

What permissions are these, exactly?  Is it even a good idea to let that
mprotect() go through in the first place?

Either way, it sounds like we have some new conditions to spell out in
that comment.
Jarkko Sakkinen Sept. 27, 2018, 3:39 p.m. UTC | #16
On Thu, Sep 27, 2018 at 07:58:41AM -0700, Dave Hansen wrote:
> On 09/27/2018 06:42 AM, Jarkko Sakkinen wrote:
> >> 	This flag is 1 if the exception is unrelated to paging and
> >> 	resulted from violation of SGX-specific access-control
> >> 	requirements. ... such a violation can occur only if there
> >> 	is no ordinary page fault...
> >>
> >> This is pretty important.  It means that *none* of the other
> >> paging-related stuff that we're doing applies.
> >>
> >> We also need to clarify how this can happen.  Is it through something
> >> than an app does, or is it solely when the hardware does something under
> >> the covers, like suspend/resume.
> > When you change page permissions lets say with mprotect after the and
> > try to do an invalid access according to the EPCM permissions this can
> > happen.
> 
> So, there are pages that are non-executable, non-readable, or
> non-writable both via the page tables and via underlying SGX
> permissions.  Then, we allow an mprotect() and a later access will
> result in one of these SGX faults?

The permissions are intersection of PTE and EPCM permissions.

EPCM permissions are part of the enclave measurement. For SGX1 they are
static. For SGX2 they can be changed with EMODPR/EACCEPT protocol (i.e.
measurement can be updated after enclave initialization).

> What permissions are these, exactly?  Is it even a good idea to let that
> mprotect() go through in the first place?

You define RWX for each page when you do EADD.

> Either way, it sounds like we have some new conditions to spell out in
> that comment.

Agreed.

/Jarkko
Dave Hansen Sept. 27, 2018, 3:53 p.m. UTC | #17
On 09/27/2018 08:39 AM, Jarkko Sakkinen wrote:
> On Thu, Sep 27, 2018 at 07:58:41AM -0700, Dave Hansen wrote:
>> On 09/27/2018 06:42 AM, Jarkko Sakkinen wrote:
>>>> 	This flag is 1 if the exception is unrelated to paging and
>>>> 	resulted from violation of SGX-specific access-control
>>>> 	requirements. ... such a violation can occur only if there
>>>> 	is no ordinary page fault...
>>>>
>>>> This is pretty important.  It means that *none* of the other
>>>> paging-related stuff that we're doing applies.
>>>>
>>>> We also need to clarify how this can happen.  Is it through something
>>>> than an app does, or is it solely when the hardware does something under
>>>> the covers, like suspend/resume.
>>> When you change page permissions lets say with mprotect after the and
>>> try to do an invalid access according to the EPCM permissions this can
>>> happen.
>>
>> So, there are pages that are non-executable, non-readable, or
>> non-writable both via the page tables and via underlying SGX
>> permissions.  Then, we allow an mprotect() and a later access will
>> result in one of these SGX faults?
> 
> The permissions are intersection of PTE and EPCM permissions.

Right, but this *fault* bit is not.

> EPCM permissions are part of the enclave measurement. For SGX1 they are
> static. For SGX2 they can be changed with EMODPR/EACCEPT protocol (i.e.
> measurement can be updated after enclave initialization).

What does this all have to do with enclave measurement?

>> What permissions are these, exactly?  Is it even a good idea to let that
>> mprotect() go through in the first place?
> 
> You define RWX for each page when you do EADD.

Are those permissions reflected into the VMAs mapping the enclave memory?
Eric W. Biederman Sept. 27, 2018, 7:43 p.m. UTC | #18
Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> writes:

> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Signal SIGSEGV(SEGV_SGXERR) for all faults with PF_SGX set in the
> error code.  The PF_SGX bit is set if and only if the #PF is detected
> by the Enclave Page Cache Map (EPCM), which is consulted only after
> an access walks the kernel's page tables, i.e.:
>
>   a. the access was allowed by the kernel
>   b. the kernel's tables have become less restrictive than the EPCM
>   c. the kernel cannot fixup the cause of the fault
>
> Noteably, (b) implies that either the kernel has botched the EPC
> mappings or the EPCM has been invalidated due to a power event.  In
> either case, userspace needs to be alerted so that it can take
> appropriate action, e.g. restart the enclave.  This is reinforced
> by (c) as the kernel doesn't really have any other reasonable option,
> e.g. we could kill the task or panic, but neither is warranted.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> ---
>  arch/x86/mm/fault.c | 20 +++++++++++++++++---
>  1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 85d20516b2f3..3fb2b2838d6c 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -960,10 +960,13 @@ static noinline void
>  bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
>  		      unsigned long address, struct vm_area_struct *vma)
>  {
> +	int si_code = SEGV_ACCERR;
> +
>  	if (bad_area_access_from_pkeys(error_code, vma))
> -		__bad_area(regs, error_code, address, vma, SEGV_PKUERR);
> -	else
> -		__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
> +		si_code = SEGV_PKUERR;
> +	else if (unlikely(error_code & X86_PF_SGX))
> +		si_code = SEGV_SGXERR;
> +	__bad_area(regs, error_code, address, vma, si_code);
>  }

This conflicts with a cleanup in this area I have sitting in linux-next.
It isn't in the x86 tree but you can find my siginfo tree at:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git siginfo-next

In my tree bad area no longer takes a vma parameter.

If you are going to make changes to the fault handling code this cycle
please let's figure out how to build it on top of my clean ups.

Thank you,
Eric Biederman
Jarkko Sakkinen Sept. 28, 2018, 12:17 p.m. UTC | #19
On Thu, 2018-09-27 at 21:43 +0200, Eric W. Biederman wrote:
> Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> writes:
> 
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > Signal SIGSEGV(SEGV_SGXERR) for all faults with PF_SGX set in the
> > error code.  The PF_SGX bit is set if and only if the #PF is detected
> > by the Enclave Page Cache Map (EPCM), which is consulted only after
> > an access walks the kernel's page tables, i.e.:
> > 
> >   a. the access was allowed by the kernel
> >   b. the kernel's tables have become less restrictive than the EPCM
> >   c. the kernel cannot fixup the cause of the fault
> > 
> > Noteably, (b) implies that either the kernel has botched the EPC
> > mappings or the EPCM has been invalidated due to a power event.  In
> > either case, userspace needs to be alerted so that it can take
> > appropriate action, e.g. restart the enclave.  This is reinforced
> > by (c) as the kernel doesn't really have any other reasonable option,
> > e.g. we could kill the task or panic, but neither is warranted.
> > 
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> > ---
> >  arch/x86/mm/fault.c | 20 +++++++++++++++++---
> >  1 file changed, 17 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index 85d20516b2f3..3fb2b2838d6c 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -960,10 +960,13 @@ static noinline void
> >  bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
> >  		      unsigned long address, struct vm_area_struct *vma)
> >  {
> > +	int si_code = SEGV_ACCERR;
> > +
> >  	if (bad_area_access_from_pkeys(error_code, vma))
> > -		__bad_area(regs, error_code, address, vma, SEGV_PKUERR);
> > -	else
> > -		__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
> > +		si_code = SEGV_PKUERR;
> > +	else if (unlikely(error_code & X86_PF_SGX))
> > +		si_code = SEGV_SGXERR;
> > +	__bad_area(regs, error_code, address, vma, si_code);
> >  }
> 
> This conflicts with a cleanup in this area I have sitting in linux-next.
> It isn't in the x86 tree but you can find my siginfo tree at:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git
> siginfo-next
> 
> In my tree bad area no longer takes a vma parameter.
> 
> If you are going to make changes to the fault handling code this cycle
> please let's figure out how to build it on top of my clean ups.

We are now going with the proposed vdso solution for v15. Thank you for
your feedback. I'll sync with you if that route would show non-feasible
(still prototyping the vdso solution). 

> Thank you,
> Eric Biederman

/Jarkko
Sean Christopherson Oct. 1, 2018, 2:29 p.m. UTC | #20
On Wed, 2018-09-26 at 14:15 -0700, Andy Lutomirski wrote:
> On Wed, Sep 26, 2018 at 1:55 PM Dave Hansen <dave.hansen@intel.com> wrote:
> > 
> > 
> > On 09/26/2018 01:44 PM, Sean Christopherson wrote:
> > > 
> > > On Wed, Sep 26, 2018 at 01:16:59PM -0700, Dave Hansen wrote:
> > > > 
> > > > We also need to clarify how this can happen.  Is it through something
> > > > than an app does, or is it solely when the hardware does something under
> > > > the covers, like suspend/resume.
> > > Are you looking for something in the changelog, the comment, or just
> > > a response?  If it's the latter...
> > Comments, please.
> > 
> > > 
> > > On bare metal with a bug-free kernel, the only scenario I'm aware of
> > > where we'll encounter these faults is when hardware pulls the rug out
> > > from under us.  In a virtualized environment all bets are off because
> > > the architecture allows VMMs to silently "destroy" the EPC at will,
> > > e.g. KVM, and I believe Hyper-V, will take advantage of this behavior
> > > to support live migration.  Post migration, the destination system
> > > will generate PF_SGX because the EPC{M} can't be migrated between
> > > system, i.e. the destination EPCM sees all EPC pages as invalid.
> > OK, cool.
> > 
> > That's good background fodder for the changelog.
> > 
> > But, for the comment, I'm happy with something like this:
> > 
> >         /*
> >          * The fault resulted from violation of SGX-specific access-
> >          * controls.  This is expected to be the result of some lower
> >          * layer action (CPU suspend/resume, VM migration) and is
> >          * not related to anything the OS did.  Treat it as an access
> >          * error to ensure it is passed up to the app via a signal where
> >          * it can be handled.
> >          */
> > 
> > I really don't think we need to delve too deeply into the relationship
> > between EPCM and PTEs or anything.  Let's just say, "it's not the
> > kernel's fault, it's not the app's fault, so throw up our hands".
> There is a non-nitpicky consideration here.  Logically, user code is
> going to do this (totally made-up pseudocode):
> 
> enclave_t enclave = load_and_init_enclave(...);
> int ret = sgx_run(enclave, some pointers to non-enclave-memory buffers, ...);
> 
> and, with the code in this patch, a correct implementation of
> sgx_run() requires installing a signal handler.  This is nasty, since
> signal handlers, expecially for something like SIGSEGV or SIGBUS, are
> not fantastic to say the least in libraries.
>
> Could we perhaps have a little vDSO entry (or syscall, I suppose) that
> runs an enclave an returns an error code, and rig up the #PF handler
> to check if the error happened in the vDSO entry and fix it up rather
> than sending a signal?


If we want to avoid having to install a signal handler then I'm pretty
sure we'd need to fixup all #GPs and "bad access" #PFs that occur on
EENTER or in the enclave, not just PF_SGX faults.  SGX1 hardware takes
a #GP instead of a #PF on EPCM faults, and SGX2 hardware allows enclaves
to allocate/free/adjust EPC pages at runtime, e.g. an enclave runtime
might want to intercept #PFs from within the enclave so that the enclave
can dynamically grow its stack.

> On Windows, this is much less of a concern, because Windows has real
> scoped fault handling. But Linux doesn't, at least not yet.
> 
> 
> --
> Andy Lutomirski
> AMA Capital Management, LLC
Dave Hansen Oct. 1, 2018, 2:41 p.m. UTC | #21
On 10/01/2018 07:29 AM, Sean Christopherson wrote:
>> Could we perhaps have a little vDSO entry (or syscall, I suppose) that
>> runs an enclave an returns an error code, and rig up the #PF handler
>> to check if the error happened in the vDSO entry and fix it up rather
>> than sending a signal?
> 
> If we want to avoid having to install a signal handler then I'm pretty
> sure we'd need to fixup all #GPs and "bad access" #PFs that occur on
> EENTER or in the enclave, not just PF_SGX faults.  SGX1 hardware takes
> a #GP instead of a #PF on EPCM faults, and SGX2 hardware allows enclaves
> to allocate/free/adjust EPC pages at runtime, e.g. an enclave runtime
> might want to intercept #PFs from within the enclave so that the enclave
> can dynamically grow its stack.

I think the technique Andy describes can be used for that as well.  It
basically works for any case where we know which instructions will take
an exception (any exception), call the instruction from a fixed
location, and know the fault(s) it can throw.

To me, it's almost like turning these faulting instructions into mini
syscall instructions.  They enter the kernel only when they need help,
though, instead of always.
Jethro Beekman Oct. 1, 2018, 9:42 p.m. UTC | #22
On 2018-09-27 06:56, Jarkko Sakkinen wrote:
> On Wed, Sep 26, 2018 at 02:45:17PM -0700, Dave Hansen wrote:
>> On 09/26/2018 02:15 PM, Andy Lutomirski wrote:
>>> Could we perhaps have a little vDSO entry (or syscall, I suppose) that
>>> runs an enclave an returns an error code, and rig up the #PF handler
>>> to check if the error happened in the vDSO entry and fix it up rather
>>> than sending a signal?
> 
> For me this plan sounds simple and sound.

I support avoiding the need for a signal handler for various 
SGX-specific operations. Looking forward to v15.

I have some thoughts regarding the design of the vDSO function. Please 
consider the following as you work on the next patch set.

1) Even though the vDSO function exists, userspace may still call 
`ENCLU[EENTER]` manually, so the fault handling as described in the 
current patch should also be maintained.

2) All the information that would normally be provided through the 
signal handler (x86 fault number, reason) should be provided to userspace.

3) vDSO functions should be provided for all standard non-enclave ENCLU 
leafs, and should support most ways that an application might want to 
use. This includes:

* EENTER with a automatic AEX handler (as in Jarkko's sgx_get_token example)
* EENTER & ERESUME with a user-specified AEX handler, or possibly just a 
special return value from the ENCLU function on AEX

4) I think the vDSO functions should have a special calling convention 
(not conforming to the standard SysV ABI), such that most registers are 
passed between user space and enclave space untouched. Basically, only 
RAX, RBX, RCX are available as input and output registers.

--
Jethro Beekman | Fortanix
Dave Hansen Oct. 1, 2018, 10:03 p.m. UTC | #23
On 10/01/2018 02:42 PM, Jethro Beekman wrote:
> 
> 1) Even though the vDSO function exists, userspace may still call 
> `ENCLU[EENTER]` manually, so the fault handling as described in the 
> current patch should also be maintained.

Why?
Jarkko Sakkinen Oct. 2, 2018, 12:07 a.m. UTC | #24
On Mon, Oct 01, 2018 at 07:29:03AM -0700, Sean Christopherson wrote:
> On Wed, 2018-09-26 at 14:15 -0700, Andy Lutomirski wrote:
> > runs an enclave an returns an error code, and rig up the #PF handler
> > to check if the error happened in the vDSO entry and fix it up rather
> > than sending a signal?
> 
> 
> If we want to avoid having to install a signal handler then I'm pretty
> sure we'd need to fixup all #GPs and "bad access" #PFs that occur on
> EENTER or in the enclave, not just PF_SGX faults.  SGX1 hardware takes
> a #GP instead of a #PF on EPCM faults, and SGX2 hardware allows enclaves
> to allocate/free/adjust EPC pages at runtime, e.g. an enclave runtime
> might want to intercept #PFs from within the enclave so that the enclave
> can dynamically grow its stack.

If I've understood Andy's proposal correctly, the run-time would get the
same information as with a signal. The delivery path for this
information would be just different.

/Jarkko
Jarkko Sakkinen Oct. 2, 2018, 12:31 a.m. UTC | #25
On Mon, Oct 01, 2018 at 09:42:48PM +0000, Jethro Beekman wrote:
> 1) Even though the vDSO function exists, userspace may still call
> `ENCLU[EENTER]` manually, so the fault handling as described in the current
> patch should also be maintained.

You mean the way it was is in v13 and not the way it is in v14?

> 2) All the information that would normally be provided through the signal
> handler (x86 fault number, reason) should be provided to userspace.

As I've understood it, this should be just a change in the delivery path.

/Jarkko
Sean Christopherson Oct. 31, 2018, 9:30 p.m. UTC | #26
On Mon, Oct 01, 2018 at 03:03:30PM -0700, Dave Hansen wrote:
> On 10/01/2018 02:42 PM, Jethro Beekman wrote:
> > 
> > 1) Even though the vDSO function exists, userspace may still call 
> > `ENCLU[EENTER]` manually, so the fault handling as described in the 
> > current patch should also be maintained.
> 
> Why?

Circling back to this question, what if we take the easy way out and
simply signal SIGSEGV without an SGX-specific code?  I.e. treat #PF
with X86_PF_SGX as an access error, no more no less.  That should be
sufficient for userspace to function, albeit with a little more effort,
but presumably no more than would be needed to run on SGX1 hardware.

AFAIK there isn't a way to prevent userspace from manually invoking
EENTER, short of doing some really nasty text poking or PTE swizzling.
We could declare using EENTER as unsupported, but that seems like
cutting off the nose to spite the face.  Supporting userspace EENTER
in a limited capacity would allow people to do whatever crazy tricks
they're wont to do without having to deal with absurd requests for
the vDSO interface.

If we go this route we could also add the vDSO stuff after basic SGX
support is in mainline, obviously with approval from the powers that
be.
Dave Hansen Oct. 31, 2018, 9:35 p.m. UTC | #27
On 10/31/18 2:30 PM, Sean Christopherson wrote:
> On Mon, Oct 01, 2018 at 03:03:30PM -0700, Dave Hansen wrote:
>> On 10/01/2018 02:42 PM, Jethro Beekman wrote:
>>>
>>> 1) Even though the vDSO function exists, userspace may still call 
>>> `ENCLU[EENTER]` manually, so the fault handling as described in the 
>>> current patch should also be maintained.
>>
>> Why?
> 
> Circling back to this question, what if we take the easy way out and
> simply signal SIGSEGV without an SGX-specific code?  I.e. treat #PF
> with X86_PF_SGX as an access error, no more no less.  That should be
> sufficient for userspace to function, albeit with a little more effort,
> but presumably no more than would be needed to run on SGX1 hardware.

There are two sides to this ABI: what the kernel does to support SGX and
what userspace does.  If we do what you suggest, we remove any (most?)
needed kernel changes and foist the burden entirely into userspace.
But, we end up with two ABIs: the old one and the new vDSO one.

IOW, once we start doing SIGSEGV, we have to do it forever, despite if
we have a newer mechanism.

> AFAIK there isn't a way to prevent userspace from manually invoking
> EENTER, short of doing some really nasty text poking or PTE swizzling.
> We could declare using EENTER as unsupported,

Yep, userspace can call it all it wants, and we can also say that
calling it outside the vdso is "undefined".
Jethro Beekman Oct. 31, 2018, 9:53 p.m. UTC | #28
On 2018-10-31 14:35, Dave Hansen wrote:
> On 10/31/18 2:30 PM, Sean Christopherson wrote:
>> AFAIK there isn't a way to prevent userspace from manually invoking
>> EENTER, short of doing some really nasty text poking or PTE swizzling.
>> We could declare using EENTER as unsupported,
> 
> Yep, userspace can call it all it wants, and we can also say that
> calling it outside the vdso is "undefined".

Is there a precedent for this? Are there any other ring 3 x86 
instructions that Linux is claiming to be "undefined" when executed by a 
user process?

--
Jethro Beekman | Fortanix
Dave Hansen Oct. 31, 2018, 9:58 p.m. UTC | #29
On 10/31/18 2:53 PM, Jethro Beekman wrote:
> On 2018-10-31 14:35, Dave Hansen wrote:
>> On 10/31/18 2:30 PM, Sean Christopherson wrote:
>>> AFAIK there isn't a way to prevent userspace from manually invoking
>>> EENTER, short of doing some really nasty text poking or PTE swizzling.
>>> We could declare using EENTER as unsupported,
>>
>> Yep, userspace can call it all it wants, and we can also say that
>> calling it outside the vdso is "undefined".
> 
> Is there a precedent for this? Are there any other ring 3 x86
> instructions that Linux is claiming to be "undefined" when executed by a
> user process?

We did it for MPX.  "Don't use MPX unless you first tell the kernel, or
we'll eat your puppy."
Andy Lutomirski Oct. 31, 2018, 10:52 p.m. UTC | #30
> On Oct 31, 2018, at 2:58 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
>> On 10/31/18 2:53 PM, Jethro Beekman wrote:
>>> On 2018-10-31 14:35, Dave Hansen wrote:
>>>> On 10/31/18 2:30 PM, Sean Christopherson wrote:
>>>> AFAIK there isn't a way to prevent userspace from manually invoking
>>>> EENTER, short of doing some really nasty text poking or PTE swizzling.
>>>> We could declare using EENTER as unsupported,
>>> 
>>> Yep, userspace can call it all it wants, and we can also say that
>>> calling it outside the vdso is "undefined".
>> 
>> Is there a precedent for this? Are there any other ring 3 x86
>> instructions that Linux is claiming to be "undefined" when executed by a
>> user process?
> 
> We did it for MPX.  "Don't use MPX unless you first tell the kernel, or
> we'll eat your puppy."

I think EENTER in plain user code should have well defined semantics, and it should get regular signals with the appropriate bits set in the error code field in the ucontext.  But we should probably simultaneously offer a nicer API, and the libraries will use it because it’s nicer.
Jarkko Sakkinen Nov. 1, 2018, 5:42 p.m. UTC | #31
On Wed, 31 Oct 2018, Sean Christopherson wrote:
> On Mon, Oct 01, 2018 at 03:03:30PM -0700, Dave Hansen wrote:
>> On 10/01/2018 02:42 PM, Jethro Beekman wrote:
>>>
>>> 1) Even though the vDSO function exists, userspace may still call
>>> `ENCLU[EENTER]` manually, so the fault handling as described in the
>>> current patch should also be maintained.
>>
>> Why?
>
> Circling back to this question, what if we take the easy way out and
> simply signal SIGSEGV without an SGX-specific code?  I.e. treat #PF
> with X86_PF_SGX as an access error, no more no less.  That should be
> sufficient for userspace to function, albeit with a little more effort,
> but presumably no more than would be needed to run on SGX1 hardware.
>
> AFAIK there isn't a way to prevent userspace from manually invoking
> EENTER, short of doing some really nasty text poking or PTE swizzling.
> We could declare using EENTER as unsupported, but that seems like
> cutting off the nose to spite the face.  Supporting userspace EENTER
> in a limited capacity would allow people to do whatever crazy tricks
> they're wont to do without having to deal with absurd requests for
> the vDSO interface.
>
> If we go this route we could also add the vDSO stuff after basic SGX
> support is in mainline, obviously with approval from the powers that
> be.
>

Yeah, this would give stable behavior when vDSO functions are not
available.

Here's a question: if we implement this behavior, could be upstream
series without vDSO's first and after those changes have been landed
we would continue with the vDSO's?

/Jarkko
Jarkko Sakkinen Nov. 1, 2018, 5:44 p.m. UTC | #32
On Thu, 1 Nov 2018, Jarkko Sakkinen wrote:

> On Wed, 31 Oct 2018, Sean Christopherson wrote:
>> On Mon, Oct 01, 2018 at 03:03:30PM -0700, Dave Hansen wrote:
>>> On 10/01/2018 02:42 PM, Jethro Beekman wrote:
>>>> 
>>>> 1) Even though the vDSO function exists, userspace may still call
>>>> `ENCLU[EENTER]` manually, so the fault handling as described in the
>>>> current patch should also be maintained.
>>> 
>>> Why?
>> 
>> Circling back to this question, what if we take the easy way out and
>> simply signal SIGSEGV without an SGX-specific code?  I.e. treat #PF
>> with X86_PF_SGX as an access error, no more no less.  That should be
>> sufficient for userspace to function, albeit with a little more effort,
>> but presumably no more than would be needed to run on SGX1 hardware.
>> 
>> AFAIK there isn't a way to prevent userspace from manually invoking
>> EENTER, short of doing some really nasty text poking or PTE swizzling.
>> We could declare using EENTER as unsupported, but that seems like
>> cutting off the nose to spite the face.  Supporting userspace EENTER
>> in a limited capacity would allow people to do whatever crazy tricks
>> they're wont to do without having to deal with absurd requests for
>> the vDSO interface.
>> 
>> If we go this route we could also add the vDSO stuff after basic SGX
>> support is in mainline, obviously with approval from the powers that
>> be.
>> 
>
> Yeah, this would give stable behavior when vDSO functions are not
> available.
>
> Here's a question: if we implement this behavior, could be upstream
> series without vDSO's first and after those changes have been landed
> we would continue with the vDSO's?

Right, it was in your last paragraph, sorry. Yeah, I fully support this 
idea. It will be easier also to work on the vDSO's once we have something
landed (instead of working on a moving platform).

/Jarkko
Dave Hansen Nov. 1, 2018, 5:51 p.m. UTC | #33
On 10/31/18 3:52 PM, Andy Lutomirski wrote:
> I think EENTER in plain user code should have well defined semantics,
> and it should get regular signals with the appropriate bits set in
> the error code field in the ucontext.  But we should probably
> simultaneously offer a nicer API, and the libraries will use it
> because it’s nicer.

FWIW, if we have a signal-based version and a VDSO-based version, nobody
will use the VDSO one.

The Intel libraries are surely going to keep using the approach they've
been using for years and I doubt their owners will be tempted even by a
simpler interface to change one line of code.

If we want to do the VDSO route, I think we probably need to go
whole-hog and toss the signal-based one.
Andy Lutomirski Nov. 1, 2018, 5:52 p.m. UTC | #34
On Thu, Nov 1, 2018 at 10:51 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/31/18 3:52 PM, Andy Lutomirski wrote:
> > I think EENTER in plain user code should have well defined semantics,
> > and it should get regular signals with the appropriate bits set in
> > the error code field in the ucontext.  But we should probably
> > simultaneously offer a nicer API, and the libraries will use it
> > because it’s nicer.
>
> FWIW, if we have a signal-based version and a VDSO-based version, nobody
> will use the VDSO one.
>
> The Intel libraries are surely going to keep using the approach they've
> been using for years and I doubt their owners will be tempted even by a
> simpler interface to change one line of code.
>
> If we want to do the VDSO route, I think we probably need to go
> whole-hog and toss the signal-based one.

That's a fair point.  We don't want a situation where a widely-used
SGX library registers a SIGSEGV handler.
diff mbox series

Patch

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 85d20516b2f3..3fb2b2838d6c 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -960,10 +960,13 @@  static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
 		      unsigned long address, struct vm_area_struct *vma)
 {
+	int si_code = SEGV_ACCERR;
+
 	if (bad_area_access_from_pkeys(error_code, vma))
-		__bad_area(regs, error_code, address, vma, SEGV_PKUERR);
-	else
-		__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
+		si_code = SEGV_PKUERR;
+	else if (unlikely(error_code & X86_PF_SGX))
+		si_code = SEGV_SGXERR;
+	__bad_area(regs, error_code, address, vma, si_code);
 }
 
 static void
@@ -1153,6 +1156,17 @@  access_error(unsigned long error_code, struct vm_area_struct *vma)
 	if (error_code & X86_PF_PK)
 		return 1;
 
+	/*
+	 * Access is blocked by the Enclave Page Cache Map (EPCM),
+	 * i.e. the access is allowed by the PTE but not the EPCM.
+	 * This usually happens when the EPCM is yanked out from
+	 * under us, e.g. by hardware after a suspend/resume cycle.
+	 * In any case, there is nothing that can be done by the
+	 * kernel to resolve the fault (short of killing the task).
+	 */
+	if (unlikely(error_code & X86_PF_SGX))
+		return 1;
+
 	/*
 	 * Make sure to check the VMA so that we do not perform
 	 * faults just to hit a X86_PF_PK as soon as we fill in a