diff mbox series

[RFC,v4,04/12] x86/sgx: Require userspace to define enclave pages' protection bits

Message ID 20190619222401.14942-5-sean.j.christopherson@intel.com (mailing list archive)
State New, archived
Headers show
Series security: x86/sgx: SGX vs. LSM | expand

Commit Message

Sean Christopherson June 19, 2019, 10:23 p.m. UTC
Existing Linux Security Module policies restrict userspace's ability to
map memory, e.g. may require priveleged permissions to map a page that
is simultaneously writable and executable.  Said permissions are often
tied to the file which backs the mapped memory, i.e. vm_file.

For reasons explained below, SGX does not allow LSMs to enforce policies
using existing LSM hooks such as file_mprotect().  Explicitly track the
protection bits for an enclave page (separate from the vma/pte bits) and
require userspace to explicit define a page's protection bits when the
page is added to the enclave.  Enclave page protection bits paves the
way to adding security_enclave_load() LSM hook as an SGX equivalent to
security_file_mprotect(), e.g. SGX can pass the page's protection bits
and source vma to LSMs.   The source vma will allow LSMs to tie
permissions to files, e.g. the file containing the enclave's code and
initial data, and the protection bits will allow LSMs to make decisions
based on the capabilities of the process, e.g. if a process is allowed
to load unmeasured code or load code from anonymous memory.

Due to the nature of the Enclave Page Cache, and because the EPC is
manually managed by SGX, all enclave vmas are backed by the same file,
i.e. /dev/sgx/enclave.  Specifically, a single file allows SGX to use
file op hooks to move pages in/out of the EPC.

Furthermore, EPC pages for any given enclave are fundamentally shared
between processes, i.e. CoW semantics are not possible with EPC pages
due to hardware restrictions such as 1:1 mappings between virtual and
physical addresses (within the enclave).

Lastly, all real world enclaves will need read, write and execute
permissions to EPC pages.

As a result, SGX does not play nice with existing LSM behavior as it is
impossible to apply policies to enclaves with reasonable granularity,
e.g. an LSM can deny access to EPC altogether, but can't deny
potentially unwanted behavior such as mapping pages WX, loading code
from anonymous memory, loading unmeasured code, etc...

For example, because all (practical) enclaves need RW pages for data and
RX pages for code, SELinux's existing policies will require all enclaves
to have FILE__READ, FILE__WRITE and FILE__EXECUTE permissions on
/dev/sgx/enclave.  Witholding FILE__WRITE or FILE__EXECUTE in an attempt
to deny RW->RX or RWX would prevent running *any* enclave, even those
that cleanly separate RW and RX pages.  And because /dev/sgx/enclave
requires MAP_SHARED, the anonymous/CoW checks that would trigger
FILE__EXECMOD or PROCESS__EXECMEM permissions will never fire.

Taking protection bits has a second use in that it can be used to
prevent loading an enclave from a noexec file system.  On SGX2 hardware,
regardless of kernel support for SGX2, userspace could EADD a page from
a noexec path using read-only permissions and later mprotect() and
ENCLU[EMODPE] the page to gain execute permissions.  By requiring
the enclave's page protections up front, SGX will be able to enforce
noexec paths when building enclaves.

To prevent userspace from circumventing the allowed protections, do not
allow PROT_{READ,WRITE,EXEC} mappings to an enclave without an
associated enclave page, i.e. prevent creating a mapping with unchecked
protection bits.

Many alternatives[1][2] have been explored, most notably the concept of
having SGX check (at load time) and save the permissions of the enclave
loader.  The permissions would then be enforced by SGX at run time, e.g.
via mmap()/mprotect() hooks of some form.  The basic functionality of
pre-checking permissions is relatively straightforward, but supporting
LSM auditing is complex and fraught with pitfalls.  If auditing is done
at the time of denial then the audit logs will potentially show a large
number of false positives.  Auditing when the denial is enforced, e.g.
at mprotect(), suffers from its own problems, e.g.:

  - Requires LSMs to pre-generate audit messages so that they can be
    replayed by SGX when the denial is actually enforced.

  - System changes can result in stale audit messages, e.g. if files
    are removed from the system, an LSM profile is modified, etc...

  - A process could log what is essentially a false positive denial,
    e.g. if the current process has the requisite capability but the
    original enclave loader did not.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/include/uapi/asm/sgx.h        |  6 ++--
 arch/x86/kernel/cpu/sgx/driver/ioctl.c | 15 +++++---
 arch/x86/kernel/cpu/sgx/driver/main.c  | 49 ++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/encl.h         |  1 +
 tools/testing/selftests/x86/sgx/main.c | 32 +++++++++++++++--
 5 files changed, 94 insertions(+), 9 deletions(-)

Comments

Jarkko Sakkinen June 21, 2019, 1:07 a.m. UTC | #1
On Wed, Jun 19, 2019 at 03:23:53PM -0700, Sean Christopherson wrote:
>  arch/x86/include/uapi/asm/sgx.h        |  6 ++--
>  arch/x86/kernel/cpu/sgx/driver/ioctl.c | 15 +++++---
>  arch/x86/kernel/cpu/sgx/driver/main.c  | 49 ++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/encl.h         |  1 +
>  tools/testing/selftests/x86/sgx/main.c | 32 +++++++++++++++--

Please split the kselftest change to a separate patch.

>  5 files changed, 94 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/include/uapi/asm/sgx.h b/arch/x86/include/uapi/asm/sgx.h
> index 6dba9f282232..67a3babbb24d 100644
> --- a/arch/x86/include/uapi/asm/sgx.h
> +++ b/arch/x86/include/uapi/asm/sgx.h
> @@ -35,15 +35,17 @@ struct sgx_enclave_create  {
>   * @src:	address for the page data
>   * @secinfo:	address for the SECINFO data
>   * @mrmask:	bitmask for the measured 256 byte chunks
> + * @prot:	maximal PROT_{READ,WRITE,EXEC} protections for the page
>   */
>  struct sgx_enclave_add_page {
>  	__u64	addr;
>  	__u64	src;
>  	__u64	secinfo;
> -	__u64	mrmask;
> +	__u16	mrmask;
> +	__u8	prot;
> +	__u8	pad;

__u8 pad[7];

> +/*
> + * Returns the AND of VM_{READ,WRITE,EXEC} permissions across all pages
> + * covered by the specific VMA.  A non-existent (or yet to be added) enclave
> + * page is considered to have no RWX permissions, i.e. is inaccessible.
> + */

That was a bit hard to grasp (at least for me). I would rephrase it like:

/**
 * sgx_calc_vma_prot_intersection() - Calculate intersection of the permissions
 *				      for a VMA
 * @encl:	an enclave
 * @vma:	a VMA inside the enclave
 *
 * Iterate through the page addresses inside the VMA and calculate a bitmask
 * of permissions that all pages have in common. Page addresses that do
 * not have an associated enclave page are interpreted to zero
 * permissions.
 */

> +static unsigned long sgx_allowed_rwx(struct sgx_encl *encl,
> +				     struct vm_area_struct *vma)

Suggestion for the name: sgx_calc_vma_prot_intersection()

> +{
> +	unsigned long allowed_rwx = VM_READ | VM_WRITE | VM_EXEC;
> +	unsigned long idx, idx_start, idx_end;
> +	struct sgx_encl_page *page;
> +
> +	idx_start = PFN_DOWN(vma->vm_start);
> +	idx_end = PFN_DOWN(vma->vm_end - 1);

Suggestion: just open code these to the for-statement.

> +
> +	for (idx = idx_start; idx <= idx_end; ++idx) {
> +		/*
> +		 * No need to take encl->lock, vm_prot_bits is set prior to
> +		 * insertion and never changes, and racing with adding pages is
> +		 * a userspace bug.
> +		 */
> +		rcu_read_lock();
> +		page = radix_tree_lookup(&encl->page_tree, idx);
> +		rcu_read_unlock();
> +
> +		/* Do not allow R|W|X to a non-existent page. */
> +		if (!page)
> +			allowed_rwx = 0;
> +		else
> +			allowed_rwx &= page->vm_prot_bits;

This would be a more clean way to express the same:

	if (!page)
		return 0;

	allowed_rwx &= page->vm_prot_bits;

/Jarkko
Jarkko Sakkinen June 21, 2019, 1:16 a.m. UTC | #2
On Fri, Jun 21, 2019 at 04:07:53AM +0300, Jarkko Sakkinen wrote:
> /**
>  * sgx_calc_vma_prot_intersection() - Calculate intersection of the permissions
>  *				      for a VMA
>  * @encl:	an enclave
>  * @vma:	a VMA inside the enclave
>  *
>  * Iterate through the page addresses inside the VMA and calculate a bitmask
>  * of permissions that all pages have in common. Page addresses that do
>  * not have an associated enclave page are interpreted to zero
>  * permissions.
>  */
> 
> > +static unsigned long sgx_allowed_rwx(struct sgx_encl *encl,
> > +				     struct vm_area_struct *vma)
> 
> Suggestion for the name: sgx_calc_vma_prot_intersection()

And have you thought off caching these results?

I.e. hold the result for each VMA and only recalculate when the old
value is dirty. Just a random thought, zero analysis but though that
good to mention anyway.

/Jarkko
Xing, Cedric June 21, 2019, 4:42 p.m. UTC | #3
> From: Christopherson, Sean J
> Sent: Wednesday, June 19, 2019 3:24 PM
> 
> diff --git a/arch/x86/include/uapi/asm/sgx.h b/arch/x86/include/uapi/asm/sgx.h index
> 6dba9f282232..67a3babbb24d 100644
> --- a/arch/x86/include/uapi/asm/sgx.h
> +++ b/arch/x86/include/uapi/asm/sgx.h
> @@ -35,15 +35,17 @@ struct sgx_enclave_create  {
>   * @src:	address for the page data
>   * @secinfo:	address for the SECINFO data
>   * @mrmask:	bitmask for the measured 256 byte chunks
> + * @prot:	maximal PROT_{READ,WRITE,EXEC} protections for the page
>   */
>  struct sgx_enclave_add_page {
>  	__u64	addr;
>  	__u64	src;
>  	__u64	secinfo;
> -	__u64	mrmask;
> +	__u16	mrmask;
> +	__u8	prot;
> +	__u8	pad;
>  };

Given EPCM permissions cannot change in SGX1, these maximal PROT_* flags can be the same as EPCM permissions, so don't have to be specified by user code until SGX2. Given we don't have a clear picture on how SGX2 will work yet, I think we shall take "prot" off until it is proven necessary. 

> diff --git a/arch/x86/kernel/cpu/sgx/driver/main.c b/arch/x86/kernel/cpu/sgx/driver/main.c
> index 29384cdd0842..dabfe2a7245a 100644
> --- a/arch/x86/kernel/cpu/sgx/driver/main.c
> +++ b/arch/x86/kernel/cpu/sgx/driver/main.c
> @@ -93,15 +93,64 @@ static long sgx_compat_ioctl(struct file *filep, unsigned int cmd,  }
> #endif
> 
> +/*
> + * Returns the AND of VM_{READ,WRITE,EXEC} permissions across all pages
> + * covered by the specific VMA.  A non-existent (or yet to be added)
> +enclave
> + * page is considered to have no RWX permissions, i.e. is inaccessible.
> + */
> +static unsigned long sgx_allowed_rwx(struct sgx_encl *encl,
> +				     struct vm_area_struct *vma)
> +{
> +	unsigned long allowed_rwx = VM_READ | VM_WRITE | VM_EXEC;
> +	unsigned long idx, idx_start, idx_end;
> +	struct sgx_encl_page *page;
> +
> +	idx_start = PFN_DOWN(vma->vm_start);
> +	idx_end = PFN_DOWN(vma->vm_end - 1);
> +
> +	for (idx = idx_start; idx <= idx_end; ++idx) {
> +		/*
> +		 * No need to take encl->lock, vm_prot_bits is set prior to
> +		 * insertion and never changes, and racing with adding pages is
> +		 * a userspace bug.
> +		 */
> +		rcu_read_lock();
> +		page = radix_tree_lookup(&encl->page_tree, idx);
> +		rcu_read_unlock();

This loop iterates through every page in the range, which could be very slow if the range is large.

> +
> +		/* Do not allow R|W|X to a non-existent page. */
> +		if (!page)
> +			allowed_rwx = 0;
> +		else
> +			allowed_rwx &= page->vm_prot_bits;
> +		if (!allowed_rwx)
> +			break;
> +	}
> +
> +	return allowed_rwx;
> +}
> +
>  static int sgx_mmap(struct file *file, struct vm_area_struct *vma)  {
>  	struct sgx_encl *encl = file->private_data;
> +	unsigned long allowed_rwx;
>  	int ret;
> 
> +	allowed_rwx = sgx_allowed_rwx(encl, vma);
> +	if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC) & ~allowed_rwx)
> +		return -EACCES;
> +
>  	ret = sgx_encl_mm_add(encl, vma->vm_mm);
>  	if (ret)
>  		return ret;
> 
> +	if (!(allowed_rwx & VM_READ))
> +		vma->vm_flags &= ~VM_MAYREAD;
> +	if (!(allowed_rwx & VM_WRITE))
> +		vma->vm_flags &= ~VM_MAYWRITE;
> +	if (!(allowed_rwx & VM_EXEC))
> +		vma->vm_flags &= ~VM_MAYEXEC;
> +

Say a range comprised of a RW sub-range and a RX sub-range is being mmap()'ed as R here. It'd succeed but mprotect(<RW sub-range>, RW) afterwards will fail because VM_MAYWRITE is cleared here. However, if those two sub-ranges are mapped by separate mmap() calls then the same mprotect() would succeed. The inconsistence here is unexpected and unprecedented.

>  	vma->vm_ops = &sgx_vm_ops;
>  	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
>  	vma->vm_private_data = encl;
Andy Lutomirski July 1, 2019, 6 p.m. UTC | #4
On Wed, Jun 19, 2019 at 3:24 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>  static int sgx_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>         struct sgx_encl *encl = file->private_data;
> +       unsigned long allowed_rwx;
>         int ret;
>
> +       allowed_rwx = sgx_allowed_rwx(encl, vma);
> +       if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC) & ~allowed_rwx)
> +               return -EACCES;
> +
>         ret = sgx_encl_mm_add(encl, vma->vm_mm);
>         if (ret)
>                 return ret;
>
> +       if (!(allowed_rwx & VM_READ))
> +               vma->vm_flags &= ~VM_MAYREAD;
> +       if (!(allowed_rwx & VM_WRITE))
> +               vma->vm_flags &= ~VM_MAYWRITE;
> +       if (!(allowed_rwx & VM_EXEC))
> +               vma->vm_flags &= ~VM_MAYEXEC;
> +

I'm with Cedric here -- this is no good.  The reason I think we need
.may_mprotect or similar is exactly to avoid doing this.

mmap() just needs to make the same type of VMA regardless of the pages
in the range.
Xing, Cedric July 1, 2019, 7:22 p.m. UTC | #5
> From: Andy Lutomirski [mailto:luto@kernel.org]
> Sent: Monday, July 01, 2019 11:00 AM
> 
> On Wed, Jun 19, 2019 at 3:24 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >  static int sgx_mmap(struct file *file, struct vm_area_struct *vma)  {
> >         struct sgx_encl *encl = file->private_data;
> > +       unsigned long allowed_rwx;
> >         int ret;
> >
> > +       allowed_rwx = sgx_allowed_rwx(encl, vma);
> > +       if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC) &
> ~allowed_rwx)
> > +               return -EACCES;
> > +
> >         ret = sgx_encl_mm_add(encl, vma->vm_mm);
> >         if (ret)
> >                 return ret;
> >
> > +       if (!(allowed_rwx & VM_READ))
> > +               vma->vm_flags &= ~VM_MAYREAD;
> > +       if (!(allowed_rwx & VM_WRITE))
> > +               vma->vm_flags &= ~VM_MAYWRITE;
> > +       if (!(allowed_rwx & VM_EXEC))
> > +               vma->vm_flags &= ~VM_MAYEXEC;
> > +
> 
> I'm with Cedric here -- this is no good.  The reason I think we
> need .may_mprotect or similar is exactly to avoid doing this.
> 
> mmap() just needs to make the same type of VMA regardless of the pages
> in the range.

Instead of making decisions on its own, a more generic approach is for SGX subsystem/module to ask LSM for a decision, by calling security_file_mprotect() - as a new mapping could be considered as changing protection from PROT_NONE to (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)).

.may_mprotect() also solves part of the problem - i.e. VMAs will be created consistently but non-existent pages still cannot be mapped, which however is necessary for #PF driven EAUG in SGX2. Given that security_file_mprotect() is invoked by mprotect() syscall, it looks to me a more streamlined solution to call the same hook (security_file_mprotect) from both places (mmap and mprotect).
Sean Christopherson July 8, 2019, 4:34 p.m. UTC | #6
On Fri, Jun 21, 2019 at 09:42:54AM -0700, Xing, Cedric wrote:
> > From: Christopherson, Sean J
> > Sent: Wednesday, June 19, 2019 3:24 PM
> > 
> > diff --git a/arch/x86/include/uapi/asm/sgx.h b/arch/x86/include/uapi/asm/sgx.h index
> > 6dba9f282232..67a3babbb24d 100644
> > --- a/arch/x86/include/uapi/asm/sgx.h
> > +++ b/arch/x86/include/uapi/asm/sgx.h
> > @@ -35,15 +35,17 @@ struct sgx_enclave_create  {
> >   * @src:	address for the page data
> >   * @secinfo:	address for the SECINFO data
> >   * @mrmask:	bitmask for the measured 256 byte chunks
> > + * @prot:	maximal PROT_{READ,WRITE,EXEC} protections for the page
> >   */
> >  struct sgx_enclave_add_page {
> >  	__u64	addr;
> >  	__u64	src;
> >  	__u64	secinfo;
> > -	__u64	mrmask;
> > +	__u16	mrmask;
> > +	__u8	prot;
> > +	__u8	pad;
> >  };
> 
> Given EPCM permissions cannot change in SGX1, these maximal PROT_* flags can
> be the same as EPCM permissions, so don't have to be specified by user code
> until SGX2. Given we don't have a clear picture on how SGX2 will work yet, I
> think we shall take "prot" off until it is proven necessary.

I'm ok with deriving the maximal protections from SECINFO, so long as we
acknowledge that we're preventing userspace from utilizing EMODPE (until
the kernel supports SGX2).

> > diff --git a/arch/x86/kernel/cpu/sgx/driver/main.c
> > b/arch/x86/kernel/cpu/sgx/driver/main.c
> > index 29384cdd0842..dabfe2a7245a 100644
> > --- a/arch/x86/kernel/cpu/sgx/driver/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/driver/main.c
> > @@ -93,15 +93,64 @@ static long sgx_compat_ioctl(struct file *filep, unsigned int cmd,  }
> > #endif
> > 
> > +/*
> > + * Returns the AND of VM_{READ,WRITE,EXEC} permissions across all pages
> > + * covered by the specific VMA.  A non-existent (or yet to be added)
> > +enclave
> > + * page is considered to have no RWX permissions, i.e. is inaccessible.
> > + */
> > +static unsigned long sgx_allowed_rwx(struct sgx_encl *encl,
> > +				     struct vm_area_struct *vma)
> > +{
> > +	unsigned long allowed_rwx = VM_READ | VM_WRITE | VM_EXEC;
> > +	unsigned long idx, idx_start, idx_end;
> > +	struct sgx_encl_page *page;
> > +
> > +	idx_start = PFN_DOWN(vma->vm_start);
> > +	idx_end = PFN_DOWN(vma->vm_end - 1);
> > +
> > +	for (idx = idx_start; idx <= idx_end; ++idx) {
> > +		/*
> > +		 * No need to take encl->lock, vm_prot_bits is set prior to
> > +		 * insertion and never changes, and racing with adding pages is
> > +		 * a userspace bug.
> > +		 */
> > +		rcu_read_lock();
> > +		page = radix_tree_lookup(&encl->page_tree, idx);
> > +		rcu_read_unlock();
> 
> This loop iterates through every page in the range, which could be very slow
> if the range is large.

At this point I'm shooting for functional correctness and minimal code
changes.  Optimizations will be in order at some point, just not now.

> > +
> > +		/* Do not allow R|W|X to a non-existent page. */
> > +		if (!page)
> > +			allowed_rwx = 0;
> > +		else
> > +			allowed_rwx &= page->vm_prot_bits;
> > +		if (!allowed_rwx)
> > +			break;
> > +	}
> > +
> > +	return allowed_rwx;
> > +}
> > +
> >  static int sgx_mmap(struct file *file, struct vm_area_struct *vma)  {
> >  	struct sgx_encl *encl = file->private_data;
> > +	unsigned long allowed_rwx;
> >  	int ret;
> > 
> > +	allowed_rwx = sgx_allowed_rwx(encl, vma);
> > +	if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC) & ~allowed_rwx)
> > +		return -EACCES;
> > +
> >  	ret = sgx_encl_mm_add(encl, vma->vm_mm);
> >  	if (ret)
> >  		return ret;
> > 
> > +	if (!(allowed_rwx & VM_READ))
> > +		vma->vm_flags &= ~VM_MAYREAD;
> > +	if (!(allowed_rwx & VM_WRITE))
> > +		vma->vm_flags &= ~VM_MAYWRITE;
> > +	if (!(allowed_rwx & VM_EXEC))
> > +		vma->vm_flags &= ~VM_MAYEXEC;
> > +
> 
> Say a range comprised of a RW sub-range and a RX sub-range is being mmap()'ed
> as R here. It'd succeed but mprotect(<RW sub-range>, RW) afterwards will fail
> because VM_MAYWRITE is cleared here. However, if those two sub-ranges are
> mapped by separate mmap() calls then the same mprotect() would succeed. The
> inconsistence here is unexpected and unprecedented.

Boo, I thought I was being super clever.

> >  	vma->vm_ops = &sgx_vm_ops;
> >  	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
> >  	vma->vm_private_data = encl;
>
Xing, Cedric July 8, 2019, 5:29 p.m. UTC | #7
On 7/8/2019 9:34 AM, Sean Christopherson wrote:
> On Fri, Jun 21, 2019 at 09:42:54AM -0700, Xing, Cedric wrote:
>>> From: Christopherson, Sean J
>>> Sent: Wednesday, June 19, 2019 3:24 PM
>>>
>>> diff --git a/arch/x86/include/uapi/asm/sgx.h b/arch/x86/include/uapi/asm/sgx.h index
>>> 6dba9f282232..67a3babbb24d 100644
>>> --- a/arch/x86/include/uapi/asm/sgx.h
>>> +++ b/arch/x86/include/uapi/asm/sgx.h
>>> @@ -35,15 +35,17 @@ struct sgx_enclave_create  {
>>>    * @src:	address for the page data
>>>    * @secinfo:	address for the SECINFO data
>>>    * @mrmask:	bitmask for the measured 256 byte chunks
>>> + * @prot:	maximal PROT_{READ,WRITE,EXEC} protections for the page
>>>    */
>>>   struct sgx_enclave_add_page {
>>>   	__u64	addr;
>>>   	__u64	src;
>>>   	__u64	secinfo;
>>> -	__u64	mrmask;
>>> +	__u16	mrmask;
>>> +	__u8	prot;
>>> +	__u8	pad;
>>>   };
>>
>> Given EPCM permissions cannot change in SGX1, these maximal PROT_* flags can
>> be the same as EPCM permissions, so don't have to be specified by user code
>> until SGX2. Given we don't have a clear picture on how SGX2 will work yet, I
>> think we shall take "prot" off until it is proven necessary.
> 
> I'm ok with deriving the maximal protections from SECINFO, so long as we
> acknowledge that we're preventing userspace from utilizing EMODPE (until
> the kernel supports SGX2).

I think that's alright.

>>> diff --git a/arch/x86/kernel/cpu/sgx/driver/main.c
>>> b/arch/x86/kernel/cpu/sgx/driver/main.c
>>> index 29384cdd0842..dabfe2a7245a 100644
>>> --- a/arch/x86/kernel/cpu/sgx/driver/main.c
>>> +++ b/arch/x86/kernel/cpu/sgx/driver/main.c
>>> @@ -93,15 +93,64 @@ static long sgx_compat_ioctl(struct file *filep, unsigned int cmd,  }
>>> #endif
>>>
>>> +/*
>>> + * Returns the AND of VM_{READ,WRITE,EXEC} permissions across all pages
>>> + * covered by the specific VMA.  A non-existent (or yet to be added)
>>> +enclave
>>> + * page is considered to have no RWX permissions, i.e. is inaccessible.
>>> + */
>>> +static unsigned long sgx_allowed_rwx(struct sgx_encl *encl,
>>> +				     struct vm_area_struct *vma)
>>> +{
>>> +	unsigned long allowed_rwx = VM_READ | VM_WRITE | VM_EXEC;
>>> +	unsigned long idx, idx_start, idx_end;
>>> +	struct sgx_encl_page *page;
>>> +
>>> +	idx_start = PFN_DOWN(vma->vm_start);
>>> +	idx_end = PFN_DOWN(vma->vm_end - 1);
>>> +
>>> +	for (idx = idx_start; idx <= idx_end; ++idx) {
>>> +		/*
>>> +		 * No need to take encl->lock, vm_prot_bits is set prior to
>>> +		 * insertion and never changes, and racing with adding pages is
>>> +		 * a userspace bug.
>>> +		 */
>>> +		rcu_read_lock();
>>> +		page = radix_tree_lookup(&encl->page_tree, idx);
>>> +		rcu_read_unlock();
>>
>> This loop iterates through every page in the range, which could be very slow
>> if the range is large.
> 
> At this point I'm shooting for functional correctness and minimal code
> changes.  Optimizations will be in order at some point, just not now.

I was trying to point out in this thread that your approach isn't as 
simple as it looks lik

>>> +
>>> +		/* Do not allow R|W|X to a non-existent page. */
>>> +		if (!page)
>>> +			allowed_rwx = 0;
>>> +		else
>>> +			allowed_rwx &= page->vm_prot_bits;
>>> +		if (!allowed_rwx)
>>> +			break;
>>> +	}
>>> +
>>> +	return allowed_rwx;
>>> +}
>>> +
>>>   static int sgx_mmap(struct file *file, struct vm_area_struct *vma)  {
>>>   	struct sgx_encl *encl = file->private_data;
>>> +	unsigned long allowed_rwx;
>>>   	int ret;
>>>
>>> +	allowed_rwx = sgx_allowed_rwx(encl, vma);
>>> +	if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC) & ~allowed_rwx)
>>> +		return -EACCES;
>>> +
>>>   	ret = sgx_encl_mm_add(encl, vma->vm_mm);
>>>   	if (ret)
>>>   		return ret;
>>>
>>> +	if (!(allowed_rwx & VM_READ))
>>> +		vma->vm_flags &= ~VM_MAYREAD;
>>> +	if (!(allowed_rwx & VM_WRITE))
>>> +		vma->vm_flags &= ~VM_MAYWRITE;
>>> +	if (!(allowed_rwx & VM_EXEC))
>>> +		vma->vm_flags &= ~VM_MAYEXEC;
>>> +
>>
>> Say a range comprised of a RW sub-range and a RX sub-range is being mmap()'ed
>> as R here. It'd succeed but mprotect(<RW sub-range>, RW) afterwards will fail
>> because VM_MAYWRITE is cleared here. However, if those two sub-ranges are
>> mapped by separate mmap() calls then the same mprotect() would succeed. The
>> inconsistence here is unexpected and unprecedented.
> 
> Boo, I thought I was being super clever.
> 
>>>   	vma->vm_ops = &sgx_vm_ops;
>>>   	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
>>>   	vma->vm_private_data = encl;
>>
diff mbox series

Patch

diff --git a/arch/x86/include/uapi/asm/sgx.h b/arch/x86/include/uapi/asm/sgx.h
index 6dba9f282232..67a3babbb24d 100644
--- a/arch/x86/include/uapi/asm/sgx.h
+++ b/arch/x86/include/uapi/asm/sgx.h
@@ -35,15 +35,17 @@  struct sgx_enclave_create  {
  * @src:	address for the page data
  * @secinfo:	address for the SECINFO data
  * @mrmask:	bitmask for the measured 256 byte chunks
+ * @prot:	maximal PROT_{READ,WRITE,EXEC} protections for the page
  */
 struct sgx_enclave_add_page {
 	__u64	addr;
 	__u64	src;
 	__u64	secinfo;
-	__u64	mrmask;
+	__u16	mrmask;
+	__u8	prot;
+	__u8	pad;
 };
 
-
 /**
  * struct sgx_enclave_init - parameter structure for the
  *                           %SGX_IOC_ENCLAVE_INIT ioctl
diff --git a/arch/x86/kernel/cpu/sgx/driver/ioctl.c b/arch/x86/kernel/cpu/sgx/driver/ioctl.c
index 3552d642b26f..e18d2afd2aad 100644
--- a/arch/x86/kernel/cpu/sgx/driver/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/driver/ioctl.c
@@ -2,6 +2,7 @@ 
 // Copyright(c) 2016-19 Intel Corporation.
 
 #include <asm/mman.h>
+#include <linux/mman.h>
 #include <linux/delay.h>
 #include <linux/file.h>
 #include <linux/hashtable.h>
@@ -235,7 +236,8 @@  static int sgx_validate_secs(const struct sgx_secs *secs,
 }
 
 static struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
-						 unsigned long addr)
+						 unsigned long addr,
+						 unsigned long prot)
 {
 	struct sgx_encl_page *encl_page;
 	int ret;
@@ -247,6 +249,7 @@  static struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
 		return ERR_PTR(-ENOMEM);
 	encl_page->desc = addr;
 	encl_page->encl = encl;
+	encl_page->vm_prot_bits = calc_vm_prot_bits(prot, 0);
 	ret = radix_tree_insert(&encl->page_tree, PFN_DOWN(encl_page->desc),
 				encl_page);
 	if (ret) {
@@ -517,7 +520,7 @@  static int __sgx_encl_add_page(struct sgx_encl *encl,
 
 static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long addr,
 			     void *data, struct sgx_secinfo *secinfo,
-			     unsigned int mrmask)
+			     unsigned int mrmask, unsigned long prot)
 {
 	u64 page_type = secinfo->flags & SGX_SECINFO_PAGE_TYPE_MASK;
 	struct sgx_encl_page *encl_page;
@@ -543,7 +546,7 @@  static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long addr,
 		goto out;
 	}
 
-	encl_page = sgx_encl_page_alloc(encl, addr);
+	encl_page = sgx_encl_page_alloc(encl, addr, prot);
 	if (IS_ERR(encl_page)) {
 		ret = PTR_ERR(encl_page);
 		goto out;
@@ -584,6 +587,7 @@  static long sgx_ioc_enclave_add_page(struct file *filep, void __user *arg)
 	struct sgx_enclave_add_page addp;
 	struct sgx_secinfo secinfo;
 	struct page *data_page;
+	unsigned long prot;
 	void *data;
 	int ret;
 
@@ -605,7 +609,10 @@  static long sgx_ioc_enclave_add_page(struct file *filep, void __user *arg)
 		goto out;
 	}
 
-	ret = sgx_encl_add_page(encl, addp.addr, data, &secinfo, addp.mrmask);
+	prot = addp.prot & (PROT_READ | PROT_WRITE | PROT_EXEC);
+
+	ret = sgx_encl_add_page(encl, addp.addr, data, &secinfo, addp.mrmask,
+				prot);
 	if (ret)
 		goto out;
 
diff --git a/arch/x86/kernel/cpu/sgx/driver/main.c b/arch/x86/kernel/cpu/sgx/driver/main.c
index 29384cdd0842..dabfe2a7245a 100644
--- a/arch/x86/kernel/cpu/sgx/driver/main.c
+++ b/arch/x86/kernel/cpu/sgx/driver/main.c
@@ -93,15 +93,64 @@  static long sgx_compat_ioctl(struct file *filep, unsigned int cmd,
 }
 #endif
 
+/*
+ * Returns the AND of VM_{READ,WRITE,EXEC} permissions across all pages
+ * covered by the specific VMA.  A non-existent (or yet to be added) enclave
+ * page is considered to have no RWX permissions, i.e. is inaccessible.
+ */
+static unsigned long sgx_allowed_rwx(struct sgx_encl *encl,
+				     struct vm_area_struct *vma)
+{
+	unsigned long allowed_rwx = VM_READ | VM_WRITE | VM_EXEC;
+	unsigned long idx, idx_start, idx_end;
+	struct sgx_encl_page *page;
+
+	idx_start = PFN_DOWN(vma->vm_start);
+	idx_end = PFN_DOWN(vma->vm_end - 1);
+
+	for (idx = idx_start; idx <= idx_end; ++idx) {
+		/*
+		 * No need to take encl->lock, vm_prot_bits is set prior to
+		 * insertion and never changes, and racing with adding pages is
+		 * a userspace bug.
+		 */
+		rcu_read_lock();
+		page = radix_tree_lookup(&encl->page_tree, idx);
+		rcu_read_unlock();
+
+		/* Do not allow R|W|X to a non-existent page. */
+		if (!page)
+			allowed_rwx = 0;
+		else
+			allowed_rwx &= page->vm_prot_bits;
+		if (!allowed_rwx)
+			break;
+	}
+
+	return allowed_rwx;
+}
+
 static int sgx_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct sgx_encl *encl = file->private_data;
+	unsigned long allowed_rwx;
 	int ret;
 
+	allowed_rwx = sgx_allowed_rwx(encl, vma);
+	if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC) & ~allowed_rwx)
+		return -EACCES;
+
 	ret = sgx_encl_mm_add(encl, vma->vm_mm);
 	if (ret)
 		return ret;
 
+	if (!(allowed_rwx & VM_READ))
+		vma->vm_flags &= ~VM_MAYREAD;
+	if (!(allowed_rwx & VM_WRITE))
+		vma->vm_flags &= ~VM_MAYWRITE;
+	if (!(allowed_rwx & VM_EXEC))
+		vma->vm_flags &= ~VM_MAYEXEC;
+
 	vma->vm_ops = &sgx_vm_ops;
 	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
 	vma->vm_private_data = encl;
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index 0904b3c20ed0..5ad018c8d74c 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -43,6 +43,7 @@  enum sgx_encl_page_desc {
 
 struct sgx_encl_page {
 	unsigned long desc;
+	unsigned long vm_prot_bits;
 	struct sgx_epc_page *epc_page;
 	struct sgx_va_page *va_page;
 	struct sgx_encl *encl;
diff --git a/tools/testing/selftests/x86/sgx/main.c b/tools/testing/selftests/x86/sgx/main.c
index e2265f841fb0..77e93f8e8a59 100644
--- a/tools/testing/selftests/x86/sgx/main.c
+++ b/tools/testing/selftests/x86/sgx/main.c
@@ -2,6 +2,7 @@ 
 // Copyright(c) 2016-18 Intel Corporation.
 
 #include <elf.h>
+#include <errno.h>
 #include <fcntl.h>
 #include <stdbool.h>
 #include <stdio.h>
@@ -18,6 +19,8 @@ 
 #include "../../../../../arch/x86/kernel/cpu/sgx/arch.h"
 #include "../../../../../arch/x86/include/uapi/asm/sgx.h"
 
+#define PAGE_SIZE  4096
+
 static const uint64_t MAGIC = 0x1122334455667788ULL;
 
 struct vdso_symtab {
@@ -135,8 +138,7 @@  static bool encl_create(int dev_fd, unsigned long bin_size,
 	for (secs->size = 4096; secs->size < bin_size; )
 		secs->size <<= 1;
 
-	base = mmap(NULL, secs->size, PROT_READ | PROT_WRITE | PROT_EXEC,
-		    MAP_SHARED, dev_fd, 0);
+	base = mmap(NULL, secs->size, PROT_NONE, MAP_SHARED, dev_fd, 0);
 	if (base == MAP_FAILED) {
 		perror("mmap");
 		return false;
@@ -147,7 +149,7 @@  static bool encl_create(int dev_fd, unsigned long bin_size,
 	ioc.src = (unsigned long)secs;
 	rc = ioctl(dev_fd, SGX_IOC_ENCLAVE_CREATE, &ioc);
 	if (rc) {
-		fprintf(stderr, "ECREATE failed rc=%d.\n", rc);
+		fprintf(stderr, "ECREATE failed rc=%d, err=%d.\n", rc, errno);
 		munmap(base, secs->size);
 		return false;
 	}
@@ -160,8 +162,14 @@  static bool encl_add_page(int dev_fd, unsigned long addr, void *data,
 {
 	struct sgx_enclave_add_page ioc;
 	struct sgx_secinfo secinfo;
+	unsigned long prot;
 	int rc;
 
+	if (flags == SGX_SECINFO_TCS)
+		prot = PROT_READ | PROT_WRITE;
+	else
+		prot = PROT_READ | PROT_WRITE | PROT_EXEC;
+
 	memset(&secinfo, 0, sizeof(secinfo));
 	secinfo.flags = flags;
 
@@ -169,6 +177,7 @@  static bool encl_add_page(int dev_fd, unsigned long addr, void *data,
 	ioc.mrmask = 0xFFFF;
 	ioc.addr = addr;
 	ioc.src = (uint64_t)data;
+	ioc.prot = prot;
 
 	rc = ioctl(dev_fd, SGX_IOC_ENCLAVE_ADD_PAGE, &ioc);
 	if (rc) {
@@ -184,6 +193,7 @@  static bool encl_load(struct sgx_secs *secs, unsigned long bin_size)
 	struct sgx_enclave_init ioc;
 	uint64_t offset;
 	uint64_t flags;
+	void *addr;
 	int dev_fd;
 	int rc;
 
@@ -215,6 +225,22 @@  static bool encl_load(struct sgx_secs *secs, unsigned long bin_size)
 		goto out_map;
 	}
 
+	addr = mmap((void *)secs->base, PAGE_SIZE, PROT_READ | PROT_WRITE,
+		    MAP_SHARED | MAP_FIXED, dev_fd, 0);
+	if (addr == MAP_FAILED) {
+		fprintf(stderr, "mmap() failed on TCS, errno=%d.\n", errno);
+		return false;
+	}
+
+	addr = mmap((void *)(secs->base + PAGE_SIZE), bin_size - PAGE_SIZE,
+		    PROT_READ | PROT_WRITE | PROT_EXEC,
+		    MAP_SHARED | MAP_FIXED, dev_fd, 0);
+	if (addr == MAP_FAILED) {
+		fprintf(stderr, "mmap() failed, errno=%d.\n", errno);
+		return false;
+	}
+
+
 	close(dev_fd);
 	return true;
 out_map: