[RFC,v2,2/5] x86/sgx: Require userspace to define enclave pages' protection bits
diff mbox series

Message ID 20190606021145.12604-3-sean.j.christopherson@intel.com
State New
Headers show
Series
  • security: x86/sgx: SGX vs. LSM
Related show

Commit Message

Sean Christopherson June 6, 2019, 2:11 a.m. UTC
Existing Linux Security Module policies restrict userspace's ability to
map memory, e.g. may require priveleged permissions to map a page that
is simultaneously writable and executable.  Said permissions are often
tied to the file which backs the mapped memory, i.e. vm_file.

For reasons explained below, SGX does not allow LSMs to enforce policies
using existing LSM hooks such as file_mprotect().  Explicitly track the
protection bits for an enclave page (separate from the vma/pte bits) and
require userspace to explicit define each page's protection bit when the
page is added to the enclave.  Enclave page protection bits pave the way
adding security_enclave_load() as an SGX equivalent to file_mprotect(),
e.g. SGX can pass the page's protection bits and source vma to LSMs.
The source vma will allow LSMs to tie permissions to files, e.g. the
file containing the enclave's code and initial data, and the protection
bits will allow LSMs to make decisions based on the capabilities of the
enclave, e.g. if a page can be converted from RW to RX.

Due to the nature of the Enclave Page Cache, and because the EPC is
manually managed by SGX, all enclave vmas are backed by the same file,
i.e. /dev/sgx/enclave.  Specifically, a single file allows SGX to use
file op hooks to move pages in/out of the EPC.

Furthermore, EPC pages for any given enclave are fundamentally shared
between processes, i.e. CoW semantics are not possible with EPC pages
due to hardware restrictions such as 1:1 mappings between virtual and
physical addresses (within the enclave).

Lastly, all real world enclaves will need read, write and execute
permissions to EPC pages.

As a result, SGX does not play nice with existing LSM behavior as it is
impossible to apply policies to enclaves with reasonable granularity,
e.g. an LSM can deny access to EPC altogether, but can't deny
potentially unwanted behavior such as mapping pages RW->RW or RWX.

For example, because all (practical) enclaves need RW pages for data and
RX pages for code, SELinux's existing policies will require all enclaves
to have FILE__READ, FILE__WRITE and FILE__EXECUTE permissions on
/dev/sgx/enclave.  Witholding FILE__WRITE or FILE__EXECUTE in an attempt
to deny RW->RX or RWX would prevent running *any* enclave, even those
that cleanly separate RW and RX pages.  And because /dev/sgx/enclave
requires MAP_SHARED, the anonymous/CoW checks that would trigger
FILE__EXECMOD or PROCESS__EXECMEM permissions will never fire.

Taking protection bits has a second use in that it can be used to
prevent loading an enclave from a noexec file system.  On SGX2 hardware,
regardless of kernel support for SGX2, userspace could EADD a page from
a noexec path using read-only permissions and later mprotect() and
ENCLU[EMODPE] the page to gain execute permissions.  By requiring
the enclave's page protections up front, SGX will be able to enforce
noexec paths when building enclaves.

To prevent userspace from circumventing the allowed protections, do not
allow PROT_{READ,WRITE,EXEC} mappings to an enclave without an
associated enclave page, i.e. prevent creating a mapping with unchecked
protection bits.

Alternatively, SGX could pre-check what transitions are/aren't allowed
using some form of proxy for the enclave, e.g. its sigstruct, and
dynamically track protections in the SGX driver.  Dynamically tracking
protections and pre-checking permissions has several drawbacks:

  - Complicates the SGX implementation due to the need to coordinate
    tracking across multiple mm structs and vmas.

  - LSM auditing would log denials that never manifest in failure.

  - Requires additional SGX specific flags/definitions be passed to/from
    LSMs.

A second alternative would be to again use sigstruct as a proxy for the
enclave when performing access control checks, but hold a reference to
the sigstruct file and perform LSM checks during mmap()/mmprotect() as
opposed to pre-checking permissions at enclave build time.  The big
downside to this approach is that it effecitvely requires userspace to
place sigstruct in a file, and the SGX driver must "pin" said file by
holding a reference to the file for the lifetime of the enclave.

A third alternative would be to pull the protection bits from the page's
SECINFO, i.e. make decisions based on the protections enforced by
hardware.  However, with SGX2, userspace can extend the hardware-
enforced protections via ENCLU[EMODPE], e.g. can add a page as RW and
later convert it to RX.  With SGX2, making a decision based on the
initial protections would either create a security hole or force SGX to
dynamically track "dirty" pages (see first alternative above).

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/include/uapi/asm/sgx.h        |  2 +
 arch/x86/kernel/cpu/sgx/driver/ioctl.c | 14 +++++--
 arch/x86/kernel/cpu/sgx/driver/main.c  |  5 +++
 arch/x86/kernel/cpu/sgx/encl.c         | 53 ++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/encl.h         |  4 ++
 5 files changed, 74 insertions(+), 4 deletions(-)

Comments

Jarkko Sakkinen June 10, 2019, 3:27 p.m. UTC | #1
On Wed, Jun 05, 2019 at 07:11:42PM -0700, Sean Christopherson wrote:
> [SNAP]

Same general criticism as for the previous patch: try to say things as
they are without anything extra.

> A third alternative would be to pull the protection bits from the page's
> SECINFO, i.e. make decisions based on the protections enforced by
> hardware.  However, with SGX2, userspace can extend the hardware-
> enforced protections via ENCLU[EMODPE], e.g. can add a page as RW and
> later convert it to RX.  With SGX2, making a decision based on the
> initial protections would either create a security hole or force SGX to
> dynamically track "dirty" pages (see first alternative above).
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

'flags' should would renamed as 'secinfo_flags_mask' even if the name is
longish. It would use the same values as the SECINFO flags. The field in
struct sgx_encl_page should have the same name. That would express
exactly relation between SECINFO and the new field. I would have never
asked on last iteration why SECINFO is not enough with a better naming.

The same field can be also used to cage page type to a subset of values.

/Jarkko
Sean Christopherson June 10, 2019, 4:15 p.m. UTC | #2
On Mon, Jun 10, 2019 at 06:27:17PM +0300, Jarkko Sakkinen wrote:
> On Wed, Jun 05, 2019 at 07:11:42PM -0700, Sean Christopherson wrote:
> > [SNAP]
> 
> Same general criticism as for the previous patch: try to say things as
> they are without anything extra.
> 
> > A third alternative would be to pull the protection bits from the page's
> > SECINFO, i.e. make decisions based on the protections enforced by
> > hardware.  However, with SGX2, userspace can extend the hardware-
> > enforced protections via ENCLU[EMODPE], e.g. can add a page as RW and
> > later convert it to RX.  With SGX2, making a decision based on the
> > initial protections would either create a security hole or force SGX to
> > dynamically track "dirty" pages (see first alternative above).
> > 
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> 'flags' should would renamed as 'secinfo_flags_mask' even if the name is
> longish. It would use the same values as the SECINFO flags. The field in
> struct sgx_encl_page should have the same name. That would express
> exactly relation between SECINFO and the new field. I would have never
> asked on last iteration why SECINFO is not enough with a better naming.

No, these flags do not impact the EPCM protections in any way.  Userspace
can extend the EPCM protections without going through the kernel.  The
protection flags for an enclave page impact VMA/PTE protection bits.

IMO, it is best to treat the EPCM as being completely separate from the
kernel's EPC management.

> The same field can be also used to cage page type to a subset of values.
> 
> /Jarkko
Jarkko Sakkinen June 10, 2019, 5:45 p.m. UTC | #3
On Mon, Jun 10, 2019 at 09:15:33AM -0700, Sean Christopherson wrote:
> > 'flags' should would renamed as 'secinfo_flags_mask' even if the name is
> > longish. It would use the same values as the SECINFO flags. The field in
> > struct sgx_encl_page should have the same name. That would express
> > exactly relation between SECINFO and the new field. I would have never
> > asked on last iteration why SECINFO is not enough with a better naming.
> 
> No, these flags do not impact the EPCM protections in any way.  Userspace
> can extend the EPCM protections without going through the kernel.  The
> protection flags for an enclave page impact VMA/PTE protection bits.
> 
> IMO, it is best to treat the EPCM as being completely separate from the
> kernel's EPC management.

It is a clumsy API if permissions are not taken in the same format for
everything. There is no reason not to do it. The way mprotect() callback
just interprets the field is as VMA permissions.

It would also be more future-proof just to have a mask covering all bits
of the SECINFO flags field.

/Jarkko
Sean Christopherson June 10, 2019, 6:17 p.m. UTC | #4
On Mon, Jun 10, 2019 at 08:45:06PM +0300, Jarkko Sakkinen wrote:
> On Mon, Jun 10, 2019 at 09:15:33AM -0700, Sean Christopherson wrote:
> > > 'flags' should would renamed as 'secinfo_flags_mask' even if the name is
> > > longish. It would use the same values as the SECINFO flags. The field in
> > > struct sgx_encl_page should have the same name. That would express
> > > exactly relation between SECINFO and the new field. I would have never
> > > asked on last iteration why SECINFO is not enough with a better naming.
> > 
> > No, these flags do not impact the EPCM protections in any way.  Userspace
> > can extend the EPCM protections without going through the kernel.  The
> > protection flags for an enclave page impact VMA/PTE protection bits.
> > 
> > IMO, it is best to treat the EPCM as being completely separate from the
> > kernel's EPC management.
> 
> It is a clumsy API if permissions are not taken in the same format for
> everything. There is no reason not to do it. The way mprotect() callback
> just interprets the field is as VMA permissions.

They are two entirely different things.  The explicit protection bits are
consumed by the kernel, while SECINFO.flags is consumed by the CPU.  The
intent is to have the protection flags be analogous to mprotect(), the
fact that they have a similar/identical format to SECINFO is irrelevant.

Calling the field secinfo_flags_mask is straight up wrong on SGX2, as 
userspace can use EMODPE to set SECINFO after the page is added.  It's
also wrong on SGX1 when adding TCS pages since SECINFO.RWX bits for TCS
pages are forced to zero by hardware.

> It would also be more future-proof just to have a mask covering all bits
> of the SECINFO flags field.

This simply doesn't work, e.g. the PENDING, MODIFIED and PR flags in the
SECINFO are read-only from a software perspective.
Xing, Cedric June 10, 2019, 6:29 p.m. UTC | #5
> From: Christopherson, Sean J
> Sent: Wednesday, June 05, 2019 7:12 PM
> 
> +/**
> + * sgx_map_allowed - check vma protections against the associated
> enclave page
> + * @encl:	an enclave
> + * @start:	start address of the mapping (inclusive)
> + * @end:	end address of the mapping (exclusive)
> + * @prot:	protection bits of the mapping
> + *
> + * Verify a userspace mapping to an enclave page would not violate the
> +security
> + * requirements of the *kernel*.  Note, this is in no way related to
> +the
> + * page protections enforced by hardware via the EPCM.  The EPCM
> +protections
> + * can be directly extended by the enclave, i.e. cannot be relied upon
> +by the
> + * kernel for security guarantees of any kind.
> + *
> + * Return:
> + *   0 on success,
> + *   -EACCES if the mapping is disallowed
> + */
> +int sgx_map_allowed(struct sgx_encl *encl, unsigned long start,
> +		    unsigned long end, unsigned long prot) {
> +	struct sgx_encl_page *page;
> +	unsigned long addr;
> +
> +	prot &= (VM_READ | VM_WRITE | VM_EXEC);
> +	if (!prot || !encl)
> +		return 0;
> +
> +	mutex_lock(&encl->lock);
> +
> +	for (addr = start; addr < end; addr += PAGE_SIZE) {
> +		page = radix_tree_lookup(&encl->page_tree, addr >>
> PAGE_SHIFT);
> +
> +		/*
> +		 * Do not allow R|W|X to a non-existent page, or protections
> +		 * beyond those of the existing enclave page.
> +		 */
> +		if (!page || (prot & ~page->prot))
> +			return -EACCES;

In SGX2, pages will be "mapped" before being populated.

Here's a brief summary for those who don't have enough background on how new EPC pages could be added to a running enclave in SGX2:
  - There are 2 new instructions - EACCEPT and EAUG.
  - EAUG is used by SGX module to add (augment) a new page to an existing enclave. The newly added page is *inaccessible* until the enclave *accepts* it.
  - EACCEPT is the instruction for an enclave to accept a new page.

And the s/w flow for an enclave to request new EPC pages is expected to be something like the following:
  - The enclave issues EACCEPT at the linear address that it would like a new page.
  - EACCEPT results in #PF, as there's no page at the linear address above.
  - SGX module is notified about the #PF, in form of its vma->vm_ops->fault() being called by kernel.
  - SGX module EAUGs a new EPC page at the fault address, and resumes the enclave.
  - EACCEPT is reattempted, and succeeds at this time.

But with the above check in sgx_map_allowed(), I'm not sure how this will work out with SGX2.

> +	}
> +
> +	mutex_unlock(&encl->lock);
> +
> +	return 0;
> +}
> +
Andy Lutomirski June 10, 2019, 7:15 p.m. UTC | #6
On Mon, Jun 10, 2019 at 11:29 AM Xing, Cedric <cedric.xing@intel.com> wrote:
>
> > From: Christopherson, Sean J
> > Sent: Wednesday, June 05, 2019 7:12 PM
> >
> > +/**
> > + * sgx_map_allowed - check vma protections against the associated
> > enclave page
> > + * @encl:    an enclave
> > + * @start:   start address of the mapping (inclusive)
> > + * @end:     end address of the mapping (exclusive)
> > + * @prot:    protection bits of the mapping
> > + *
> > + * Verify a userspace mapping to an enclave page would not violate the
> > +security
> > + * requirements of the *kernel*.  Note, this is in no way related to
> > +the
> > + * page protections enforced by hardware via the EPCM.  The EPCM
> > +protections
> > + * can be directly extended by the enclave, i.e. cannot be relied upon
> > +by the
> > + * kernel for security guarantees of any kind.
> > + *
> > + * Return:
> > + *   0 on success,
> > + *   -EACCES if the mapping is disallowed
> > + */
> > +int sgx_map_allowed(struct sgx_encl *encl, unsigned long start,
> > +                 unsigned long end, unsigned long prot) {
> > +     struct sgx_encl_page *page;
> > +     unsigned long addr;
> > +
> > +     prot &= (VM_READ | VM_WRITE | VM_EXEC);
> > +     if (!prot || !encl)
> > +             return 0;
> > +
> > +     mutex_lock(&encl->lock);
> > +
> > +     for (addr = start; addr < end; addr += PAGE_SIZE) {
> > +             page = radix_tree_lookup(&encl->page_tree, addr >>
> > PAGE_SHIFT);
> > +
> > +             /*
> > +              * Do not allow R|W|X to a non-existent page, or protections
> > +              * beyond those of the existing enclave page.
> > +              */
> > +             if (!page || (prot & ~page->prot))
> > +                     return -EACCES;
>
> In SGX2, pages will be "mapped" before being populated.
>
> Here's a brief summary for those who don't have enough background on how new EPC pages could be added to a running enclave in SGX2:
>   - There are 2 new instructions - EACCEPT and EAUG.
>   - EAUG is used by SGX module to add (augment) a new page to an existing enclave. The newly added page is *inaccessible* until the enclave *accepts* it.
>   - EACCEPT is the instruction for an enclave to accept a new page.
>
> And the s/w flow for an enclave to request new EPC pages is expected to be something like the following:
>   - The enclave issues EACCEPT at the linear address that it would like a new page.
>   - EACCEPT results in #PF, as there's no page at the linear address above.
>   - SGX module is notified about the #PF, in form of its vma->vm_ops->fault() being called by kernel.
>   - SGX module EAUGs a new EPC page at the fault address, and resumes the enclave.
>   - EACCEPT is reattempted, and succeeds at this time.

This seems like an odd workflow.  Shouldn't the #PF return back to
untrusted userspace so that the untrusted user code can make its own
decision as to whether it wants to EAUG a page there as opposed to,
say, killing the enclave or waiting to keep resource usage under
control?
Xing, Cedric June 10, 2019, 10:28 p.m. UTC | #7
> From: Andy Lutomirski [mailto:luto@kernel.org]
> Sent: Monday, June 10, 2019 12:15 PM
> 
> On Mon, Jun 10, 2019 at 11:29 AM Xing, Cedric <cedric.xing@intel.com>
> wrote:
> >
> > > From: Christopherson, Sean J
> > > Sent: Wednesday, June 05, 2019 7:12 PM
> > >
> > > +/**
> > > + * sgx_map_allowed - check vma protections against the associated
> > > enclave page
> > > + * @encl:    an enclave
> > > + * @start:   start address of the mapping (inclusive)
> > > + * @end:     end address of the mapping (exclusive)
> > > + * @prot:    protection bits of the mapping
> > > + *
> > > + * Verify a userspace mapping to an enclave page would not violate
> > > +the security
> > > + * requirements of the *kernel*.  Note, this is in no way related
> > > +to the
> > > + * page protections enforced by hardware via the EPCM.  The EPCM
> > > +protections
> > > + * can be directly extended by the enclave, i.e. cannot be relied
> > > +upon by the
> > > + * kernel for security guarantees of any kind.
> > > + *
> > > + * Return:
> > > + *   0 on success,
> > > + *   -EACCES if the mapping is disallowed
> > > + */
> > > +int sgx_map_allowed(struct sgx_encl *encl, unsigned long start,
> > > +                 unsigned long end, unsigned long prot) {
> > > +     struct sgx_encl_page *page;
> > > +     unsigned long addr;
> > > +
> > > +     prot &= (VM_READ | VM_WRITE | VM_EXEC);
> > > +     if (!prot || !encl)
> > > +             return 0;
> > > +
> > > +     mutex_lock(&encl->lock);
> > > +
> > > +     for (addr = start; addr < end; addr += PAGE_SIZE) {
> > > +             page = radix_tree_lookup(&encl->page_tree, addr >>
> > > PAGE_SHIFT);
> > > +
> > > +             /*
> > > +              * Do not allow R|W|X to a non-existent page, or
> protections
> > > +              * beyond those of the existing enclave page.
> > > +              */
> > > +             if (!page || (prot & ~page->prot))
> > > +                     return -EACCES;
> >
> > In SGX2, pages will be "mapped" before being populated.
> >
> > Here's a brief summary for those who don't have enough background on
> how new EPC pages could be added to a running enclave in SGX2:
> >   - There are 2 new instructions - EACCEPT and EAUG.
> >   - EAUG is used by SGX module to add (augment) a new page to an
> existing enclave. The newly added page is *inaccessible* until the
> enclave *accepts* it.
> >   - EACCEPT is the instruction for an enclave to accept a new page.
> >
> > And the s/w flow for an enclave to request new EPC pages is expected
> to be something like the following:
> >   - The enclave issues EACCEPT at the linear address that it would
> like a new page.
> >   - EACCEPT results in #PF, as there's no page at the linear address
> above.
> >   - SGX module is notified about the #PF, in form of its vma->vm_ops-
> >fault() being called by kernel.
> >   - SGX module EAUGs a new EPC page at the fault address, and resumes
> the enclave.
> >   - EACCEPT is reattempted, and succeeds at this time.
> 
> This seems like an odd workflow.  Shouldn't the #PF return back to
> untrusted userspace so that the untrusted user code can make its own
> decision as to whether it wants to EAUG a page there as opposed to, say,
> killing the enclave or waiting to keep resource usage under control?

This may seem odd to some at the first glance. But if you can think of how static heap (pre-allocated by EADD before EINIT) works, the load parses the "metadata" coming with the enclave to decide the address/size of the heap, EADDs it, and calls it done. In the case of "dynamic" heap (allocated dynamically by EAUG after EINIT), the same thing applies - the loader determines the range of the heap, tells the SGX module about it, and calls it done. Everything else is the between the enclave and the SGX module.

In practice, untrusted code usually doesn't know much about enclaves, just like it doesn't know much about the shared objects loaded into its address space either. Without the necessary knowledge, untrusted code usually just does what it is told (via o-calls, or return value from e-calls), without judging that's right or wrong. 

When it comes to #PF like what I described, of course a signal could be sent to the untrusted code but what would it do then? Usually it'd just come back asking for a page at the fault address. So we figured it'd be more efficient to just have the kernel EAUG at #PF. 

Please don't get me wrong though, as I'm not dictating what the s/w flow shall be. It's just going to be a choice offered to user mode. And that choice was planned to be offered via mprotect() - i.e. a writable vma causes kernel to EAUG while a non-writable vma will result in a signal (then the user mode could decide whether to EAUG). The key point is flexibility - as we want to allow all reasonable s/w flows instead of dictating one over others. We had similar discussions on vDSO API before. And I think you accepted my approach because of its flexibility. Am I right?
Andy Lutomirski June 12, 2019, 12:09 a.m. UTC | #8
On Jun 10, 2019, at 3:28 PM, Xing, Cedric <cedric.xing@intel.com> wrote:

>> From: Andy Lutomirski [mailto:luto@kernel.org]
>> Sent: Monday, June 10, 2019 12:15 PM
>> 
>> On Mon, Jun 10, 2019 at 11:29 AM Xing, Cedric <cedric.xing@intel.com>
>> wrote:
>>> 
>>>> From: Christopherson, Sean J
>>>> Sent: Wednesday, June 05, 2019 7:12 PM
>>>> 
>>>> +/**
>>>> + * sgx_map_allowed - check vma protections against the associated
>>>> enclave page
>>>> + * @encl:    an enclave
>>>> + * @start:   start address of the mapping (inclusive)
>>>> + * @end:     end address of the mapping (exclusive)
>>>> + * @prot:    protection bits of the mapping
>>>> + *
>>>> + * Verify a userspace mapping to an enclave page would not violate
>>>> +the security
>>>> + * requirements of the *kernel*.  Note, this is in no way related
>>>> +to the
>>>> + * page protections enforced by hardware via the EPCM.  The EPCM
>>>> +protections
>>>> + * can be directly extended by the enclave, i.e. cannot be relied
>>>> +upon by the
>>>> + * kernel for security guarantees of any kind.
>>>> + *
>>>> + * Return:
>>>> + *   0 on success,
>>>> + *   -EACCES if the mapping is disallowed
>>>> + */
>>>> +int sgx_map_allowed(struct sgx_encl *encl, unsigned long start,
>>>> +                 unsigned long end, unsigned long prot) {
>>>> +     struct sgx_encl_page *page;
>>>> +     unsigned long addr;
>>>> +
>>>> +     prot &= (VM_READ | VM_WRITE | VM_EXEC);
>>>> +     if (!prot || !encl)
>>>> +             return 0;
>>>> +
>>>> +     mutex_lock(&encl->lock);
>>>> +
>>>> +     for (addr = start; addr < end; addr += PAGE_SIZE) {
>>>> +             page = radix_tree_lookup(&encl->page_tree, addr >>
>>>> PAGE_SHIFT);
>>>> +
>>>> +             /*
>>>> +              * Do not allow R|W|X to a non-existent page, or
>> protections
>>>> +              * beyond those of the existing enclave page.
>>>> +              */
>>>> +             if (!page || (prot & ~page->prot))
>>>> +                     return -EACCES;
>>> 
>>> In SGX2, pages will be "mapped" before being populated.
>>> 
>>> Here's a brief summary for those who don't have enough background on
>> how new EPC pages could be added to a running enclave in SGX2:
>>>  - There are 2 new instructions - EACCEPT and EAUG.
>>>  - EAUG is used by SGX module to add (augment) a new page to an
>> existing enclave. The newly added page is *inaccessible* until the
>> enclave *accepts* it.
>>>  - EACCEPT is the instruction for an enclave to accept a new page.
>>> 
>>> And the s/w flow for an enclave to request new EPC pages is expected
>> to be something like the following:
>>>  - The enclave issues EACCEPT at the linear address that it would
>> like a new page.
>>>  - EACCEPT results in #PF, as there's no page at the linear address
>> above.
>>>  - SGX module is notified about the #PF, in form of its vma->vm_ops-
>>> fault() being called by kernel.
>>>  - SGX module EAUGs a new EPC page at the fault address, and resumes
>> the enclave.
>>>  - EACCEPT is reattempted, and succeeds at this time.
>> 
>> This seems like an odd workflow.  Shouldn't the #PF return back to
>> untrusted userspace so that the untrusted user code can make its own
>> decision as to whether it wants to EAUG a page there as opposed to, say,
>> killing the enclave or waiting to keep resource usage under control?
> 
> This may seem odd to some at the first glance. But if you can think of how static heap (pre-allocated by EADD before EINIT) works, the load parses the "metadata" coming with the enclave to decide the address/size of the heap, EADDs it, and calls it done. In the case of "dynamic" heap (allocated dynamically by EAUG after EINIT), the same thing applies - the loader determines the range of the heap, tells the SGX module about it, and calls it done. Everything else is the between the enclave and the SGX module.
> 
> In practice, untrusted code usually doesn't know much about enclaves, just like it doesn't know much about the shared objects loaded into its address space either. Without the necessary knowledge, untrusted code usually just does what it is told (via o-calls, or return value from e-calls), without judging that's right or wrong. 
> 
> When it comes to #PF like what I described, of course a signal could be sent to the untrusted code but what would it do then? Usually it'd just come back asking for a page at the fault address. So we figured it'd be more efficient to just have the kernel EAUG at #PF. 
> 
> Please don't get me wrong though, as I'm not dictating what the s/w flow shall be. It's just going to be a choice offered to user mode. And that choice was planned to be offered via mprotect() - i.e. a writable vma causes kernel to EAUG while a non-writable vma will result in a signal (then the user mode could decide whether to EAUG). The key point is flexibility - as we want to allow all reasonable s/w flows instead of dictating one over others. We had similar discussions on vDSO API before. And I think you accepted my approach because of its flexibility. Am I right?

As long as user code can turn this off, I have no real objection. But it might make sense to have it be more explicit — have an ioctl set up a range as “EAUG-on-demand”.

But this is all currently irrelevant. We can argue about it when the patches show up. :)
Sean Christopherson June 12, 2019, 2:34 p.m. UTC | #9
On Tue, Jun 11, 2019 at 05:09:28PM -0700, Andy Lutomirski wrote:
> 
> On Jun 10, 2019, at 3:28 PM, Xing, Cedric <cedric.xing@intel.com> wrote:
> 
> >> From: Andy Lutomirski [mailto:luto@kernel.org]
> >> Sent: Monday, June 10, 2019 12:15 PM
> >> This seems like an odd workflow.  Shouldn't the #PF return back to
> >> untrusted userspace so that the untrusted user code can make its own
> >> decision as to whether it wants to EAUG a page there as opposed to, say,
> >> killing the enclave or waiting to keep resource usage under control?
> > 
> > This may seem odd to some at the first glance. But if you can think of how
> > static heap (pre-allocated by EADD before EINIT) works, the load parses the
> > "metadata" coming with the enclave to decide the address/size of the heap,
> > EADDs it, and calls it done. In the case of "dynamic" heap (allocated
> > dynamically by EAUG after EINIT), the same thing applies - the loader
> > determines the range of the heap, tells the SGX module about it, and calls
> > it done. Everything else is the between the enclave and the SGX module.
> > 
> > In practice, untrusted code usually doesn't know much about enclaves, just
> > like it doesn't know much about the shared objects loaded into its address
> > space either. Without the necessary knowledge, untrusted code usually just
> > does what it is told (via o-calls, or return value from e-calls), without
> > judging that's right or wrong. 
> > 
> > When it comes to #PF like what I described, of course a signal could be
> > sent to the untrusted code but what would it do then? Usually it'd just
> > come back asking for a page at the fault address. So we figured it'd be
> > more efficient to just have the kernel EAUG at #PF. 
> > 
> > Please don't get me wrong though, as I'm not dictating what the s/w flow
> > shall be. It's just going to be a choice offered to user mode. And that
> > choice was planned to be offered via mprotect() - i.e. a writable vma
> > causes kernel to EAUG while a non-writable vma will result in a signal
> > (then the user mode could decide whether to EAUG). The key point is
> > flexibility - as we want to allow all reasonable s/w flows instead of
> > dictating one over others. We had similar discussions on vDSO API before.
> > And I think you accepted my approach because of its flexibility. Am I
> > right?
> 
> As long as user code can turn this off, I have no real objection. But it
> might make sense to have it be more explicit — have an ioctl set up a range
> as “EAUG-on-demand”.

This was part of the motivation behind changing SGX_IOC_ENCLAVE_ADD_PAGE
to SGX_IOC_ENCLAVE_ADD_REGION and adding a @flags parameter.  E.g. adding
support for "EAUG-on-demand" regions would just be a new flag.

> But this is all currently irrelevant. We can argue about it when the patches
> show up. :)
Xing, Cedric June 12, 2019, 6:20 p.m. UTC | #10
> From: Christopherson, Sean J
> Sent: Wednesday, June 12, 2019 7:34 AM
> 
> On Tue, Jun 11, 2019 at 05:09:28PM -0700, Andy Lutomirski wrote:
> >
> > On Jun 10, 2019, at 3:28 PM, Xing, Cedric <cedric.xing@intel.com>
> wrote:
> >
> > >> From: Andy Lutomirski [mailto:luto@kernel.org]
> > >> Sent: Monday, June 10, 2019 12:15 PM This seems like an odd
> > >> workflow.  Shouldn't the #PF return back to untrusted userspace so
> > >> that the untrusted user code can make its own decision as to
> > >> whether it wants to EAUG a page there as opposed to, say, killing
> > >> the enclave or waiting to keep resource usage under control?
> > >
> > > This may seem odd to some at the first glance. But if you can think
> > > of how static heap (pre-allocated by EADD before EINIT) works, the
> > > load parses the "metadata" coming with the enclave to decide the
> > > address/size of the heap, EADDs it, and calls it done. In the case
> > > of "dynamic" heap (allocated dynamically by EAUG after EINIT), the
> > > same thing applies - the loader determines the range of the heap,
> > > tells the SGX module about it, and calls it done. Everything else is
> the between the enclave and the SGX module.
> > >
> > > In practice, untrusted code usually doesn't know much about
> > > enclaves, just like it doesn't know much about the shared objects
> > > loaded into its address space either. Without the necessary
> > > knowledge, untrusted code usually just does what it is told (via
> > > o-calls, or return value from e-calls), without judging that's right
> or wrong.
> > >
> > > When it comes to #PF like what I described, of course a signal could
> > > be sent to the untrusted code but what would it do then? Usually
> > > it'd just come back asking for a page at the fault address. So we
> > > figured it'd be more efficient to just have the kernel EAUG at #PF.
> > >
> > > Please don't get me wrong though, as I'm not dictating what the s/w
> > > flow shall be. It's just going to be a choice offered to user mode.
> > > And that choice was planned to be offered via mprotect() - i.e. a
> > > writable vma causes kernel to EAUG while a non-writable vma will
> > > result in a signal (then the user mode could decide whether to
> > > EAUG). The key point is flexibility - as we want to allow all
> > > reasonable s/w flows instead of dictating one over others. We had
> similar discussions on vDSO API before.
> > > And I think you accepted my approach because of its flexibility. Am
> > > I right?
> >
> > As long as user code can turn this off, I have no real objection. But
> > it might make sense to have it be more explicit — have an ioctl set up
> > a range as “EAUG-on-demand”.
> 
> This was part of the motivation behind changing SGX_IOC_ENCLAVE_ADD_PAGE
> to SGX_IOC_ENCLAVE_ADD_REGION and adding a @flags parameter.  E.g.
> adding support for "EAUG-on-demand" regions would just be a new flag.

We'll end up in some sort of interface eventually. But that's too early to discuss.

Currently what we need is the plumbing - i.e. the range has to be mmap()'ed and it cannot be PROT_NONE, otherwise vm_ops->fault() will not be reached.

> 
> > But this is all currently irrelevant. We can argue about it when the
> > patches show up. :)
Jarkko Sakkinen June 12, 2019, 7:26 p.m. UTC | #11
On Mon, Jun 10, 2019 at 11:17:44AM -0700, Sean Christopherson wrote:
> On Mon, Jun 10, 2019 at 08:45:06PM +0300, Jarkko Sakkinen wrote:
> > On Mon, Jun 10, 2019 at 09:15:33AM -0700, Sean Christopherson wrote:
> > > > 'flags' should would renamed as 'secinfo_flags_mask' even if the name is
> > > > longish. It would use the same values as the SECINFO flags. The field in
> > > > struct sgx_encl_page should have the same name. That would express
> > > > exactly relation between SECINFO and the new field. I would have never
> > > > asked on last iteration why SECINFO is not enough with a better naming.
> > > 
> > > No, these flags do not impact the EPCM protections in any way.  Userspace
> > > can extend the EPCM protections without going through the kernel.  The
> > > protection flags for an enclave page impact VMA/PTE protection bits.
> > > 
> > > IMO, it is best to treat the EPCM as being completely separate from the
> > > kernel's EPC management.
> > 
> > It is a clumsy API if permissions are not taken in the same format for
> > everything. There is no reason not to do it. The way mprotect() callback
> > just interprets the field is as VMA permissions.
> 
> They are two entirely different things.  The explicit protection bits are
> consumed by the kernel, while SECINFO.flags is consumed by the CPU.  The
> intent is to have the protection flags be analogous to mprotect(), the
> fact that they have a similar/identical format to SECINFO is irrelevant.
> 
> Calling the field secinfo_flags_mask is straight up wrong on SGX2, as 
> userspace can use EMODPE to set SECINFO after the page is added.  It's
> also wrong on SGX1 when adding TCS pages since SECINFO.RWX bits for TCS
> pages are forced to zero by hardware.

The new variable tells the limits on which kernel will co-operate with
the enclave. It is way more descriptive than 'flags'.

> > It would also be more future-proof just to have a mask covering all bits
> > of the SECINFO flags field.
> 
> This simply doesn't work, e.g. the PENDING, MODIFIED and PR flags in the
> SECINFO are read-only from a software perspective.

It is easy to validate reserved bits from a SECINFO struct.

/Jarkko

Patch
diff mbox series

diff --git a/arch/x86/include/uapi/asm/sgx.h b/arch/x86/include/uapi/asm/sgx.h
index 9ed690a38c70..2c6198ffeaf8 100644
--- a/arch/x86/include/uapi/asm/sgx.h
+++ b/arch/x86/include/uapi/asm/sgx.h
@@ -37,12 +37,14 @@  struct sgx_enclave_create  {
  * @addr:	address within the ELRANGE
  * @src:	address for the page data
  * @secinfo:	address for the SECINFO data
+ * @flags:	flags, e.g. PROT_{READ,WRITE,EXEC}
  * @mrmask:	bitmask for the measured 256 byte chunks
  */
 struct sgx_enclave_add_page {
 	__u64	addr;
 	__u64	src;
 	__u64	secinfo;
+	__u32	flags;
 	__u16	mrmask;
 } __attribute__((__packed__));
 
diff --git a/arch/x86/kernel/cpu/sgx/driver/ioctl.c b/arch/x86/kernel/cpu/sgx/driver/ioctl.c
index a27ec26a9350..ef5c2ce0f37b 100644
--- a/arch/x86/kernel/cpu/sgx/driver/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/driver/ioctl.c
@@ -235,7 +235,8 @@  static int sgx_validate_secs(const struct sgx_secs *secs,
 }
 
 static struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
-						 unsigned long addr)
+						 unsigned long addr,
+						 unsigned long prot)
 {
 	struct sgx_encl_page *encl_page;
 	int ret;
@@ -247,6 +248,7 @@  static struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
 		return ERR_PTR(-ENOMEM);
 	encl_page->desc = addr;
 	encl_page->encl = encl;
+	encl_page->prot = prot;
 	ret = radix_tree_insert(&encl->page_tree, PFN_DOWN(encl_page->desc),
 				encl_page);
 	if (ret) {
@@ -531,7 +533,7 @@  static int __sgx_encl_add_page(struct sgx_encl *encl,
 
 static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long addr,
 			     void *data, struct sgx_secinfo *secinfo,
-			     unsigned int mrmask)
+			     unsigned int mrmask, unsigned long prot)
 {
 	u64 page_type = secinfo->flags & SGX_SECINFO_PAGE_TYPE_MASK;
 	struct sgx_encl_page *encl_page;
@@ -557,7 +559,7 @@  static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long addr,
 		goto out;
 	}
 
-	encl_page = sgx_encl_page_alloc(encl, addr);
+	encl_page = sgx_encl_page_alloc(encl, addr, prot);
 	if (IS_ERR(encl_page)) {
 		ret = PTR_ERR(encl_page);
 		goto out;
@@ -599,6 +601,7 @@  static long sgx_ioc_enclave_add_page(struct file *filep, unsigned int cmd,
 	struct sgx_enclave_add_page *addp = (void *)arg;
 	struct sgx_encl *encl = filep->private_data;
 	struct sgx_secinfo secinfo;
+	unsigned long prot;
 	struct page *data_page;
 	void *data;
 	int ret;
@@ -618,7 +621,10 @@  static long sgx_ioc_enclave_add_page(struct file *filep, unsigned int cmd,
 		goto out;
 	}
 
-	ret = sgx_encl_add_page(encl, addp->addr, data, &secinfo, addp->mrmask);
+	prot = addp->flags & (PROT_READ | PROT_WRITE | PROT_EXEC);
+
+	ret = sgx_encl_add_page(encl, addp->addr, data, &secinfo, addp->mrmask,
+				prot);
 	if (ret)
 		goto out;
 
diff --git a/arch/x86/kernel/cpu/sgx/driver/main.c b/arch/x86/kernel/cpu/sgx/driver/main.c
index 129d356aff30..65a87c2fdf02 100644
--- a/arch/x86/kernel/cpu/sgx/driver/main.c
+++ b/arch/x86/kernel/cpu/sgx/driver/main.c
@@ -63,6 +63,11 @@  static long sgx_compat_ioctl(struct file *filep, unsigned int cmd,
 static int sgx_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct sgx_encl *encl = file->private_data;
+	int ret;
+
+	ret = sgx_map_allowed(encl, vma->vm_start, vma->vm_end, vma->vm_flags);
+	if (ret)
+		return ret;
 
 	vma->vm_ops = &sgx_vm_ops;
 	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 7216bdf07bd0..a5a412220058 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -235,6 +235,58 @@  static void sgx_vma_close(struct vm_area_struct *vma)
 	kref_put(&encl->refcount, sgx_encl_release);
 }
 
+
+/**
+ * sgx_map_allowed - check vma protections against the associated enclave page
+ * @encl:	an enclave
+ * @start:	start address of the mapping (inclusive)
+ * @end:	end address of the mapping (exclusive)
+ * @prot:	protection bits of the mapping
+ *
+ * Verify a userspace mapping to an enclave page would not violate the security
+ * requirements of the *kernel*.  Note, this is in no way related to the
+ * page protections enforced by hardware via the EPCM.  The EPCM protections
+ * can be directly extended by the enclave, i.e. cannot be relied upon by the
+ * kernel for security guarantees of any kind.
+ *
+ * Return:
+ *   0 on success,
+ *   -EACCES if the mapping is disallowed
+ */
+int sgx_map_allowed(struct sgx_encl *encl, unsigned long start,
+		    unsigned long end, unsigned long prot)
+{
+	struct sgx_encl_page *page;
+	unsigned long addr;
+
+	prot &= (VM_READ | VM_WRITE | VM_EXEC);
+	if (!prot || !encl)
+		return 0;
+
+	mutex_lock(&encl->lock);
+
+	for (addr = start; addr < end; addr += PAGE_SIZE) {
+		page = radix_tree_lookup(&encl->page_tree, addr >> PAGE_SHIFT);
+
+		/*
+		 * Do not allow R|W|X to a non-existent page, or protections
+		 * beyond those of the existing enclave page.
+		 */
+		if (!page || (prot & ~page->prot))
+			return -EACCES;
+	}
+
+	mutex_unlock(&encl->lock);
+
+	return 0;
+}
+
+static int sgx_vma_mprotect(struct vm_area_struct *vma, unsigned long start,
+			    unsigned long end, unsigned long prot)
+{
+	return sgx_map_allowed(vma->vm_private_data, start, end, prot);
+}
+
 static unsigned int sgx_vma_fault(struct vm_fault *vmf)
 {
 	unsigned long addr = (unsigned long)vmf->address;
@@ -372,6 +424,7 @@  static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
 const struct vm_operations_struct sgx_vm_ops = {
 	.close = sgx_vma_close,
 	.open = sgx_vma_open,
+	.may_mprotect = sgx_vma_mprotect,
 	.fault = sgx_vma_fault,
 	.access = sgx_vma_access,
 };
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index c557f0374d74..176467c0eb22 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -41,6 +41,7 @@  enum sgx_encl_page_desc {
 
 struct sgx_encl_page {
 	unsigned long desc;
+	unsigned long prot;
 	struct sgx_epc_page *epc_page;
 	struct sgx_va_page *va_page;
 	struct sgx_encl *encl;
@@ -106,6 +107,9 @@  static inline unsigned long sgx_pcmd_offset(pgoff_t page_index)
 	       sizeof(struct sgx_pcmd);
 }
 
+int sgx_map_allowed(struct sgx_encl *encl, unsigned long start,
+		    unsigned long end, unsigned long prot);
+
 enum sgx_encl_mm_iter {
 	SGX_ENCL_MM_ITER_DONE		= 0,
 	SGX_ENCL_MM_ITER_NEXT		= 1,