diff mbox series

[v41,10/24] mm: Add 'mprotect' hook to struct vm_operations_struct

Message ID 20201112220135.165028-11-jarkko@kernel.org (mailing list archive)
State New, archived
Headers show
Series Intel SGX foundations | expand

Commit Message

Jarkko Sakkinen Nov. 12, 2020, 10:01 p.m. UTC
From: Sean Christopherson <sean.j.christopherson@intel.com>

Background
==========

1. SGX enclave pages are populated with data by copying from normal memory
   via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in
   this series.
2. It is desirable to be able to restrict those normal memory data sources.
   For instance, to ensure that the source data is executable before
   copying data to an executable enclave page.
3. Enclave page permissions are dynamic (just like normal permissions) and
   can be adjusted at runtime with mprotect().

This creates a problem because the original data source may have long since
vanished at the time when enclave page permissions are established (mmap()
or mprotect()).

The solution (elsewhere in this series) is to force enclaves creators to
declare their paging permission *intent* up front to the ioctl().  This
intent can be immediately compared to the source data’s mapping and
rejected if necessary.

The “intent” is also stashed off for later comparison with enclave
PTEs. This ensures that any future mmap()/mprotect() operations
performed by the enclave creator or done on behalf of the enclave
can be compared with the earlier declared permissions.

Problem
=======

There is an existing mmap() hook which allows SGX to perform this
permission comparison at mmap() time.  However, there is no corresponding
->mprotect() hook.

Solution
========

Add a vm_ops->mprotect() hook so that mprotect() operations which are
inconsistent with any page's stashed intent can be rejected by the driver.

Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Jethro Beekman <jethro@fortanix.com> # v40
Acked-by: Dave Hansen <dave.hansen@intel.com> # v40
# Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
---
Changes from v40:
* Move mprotect_fixup() outside of the callback. This keeps mm internals
  outside of the callback.

 include/linux/mm.h | 7 +++++++
 mm/mprotect.c      | 7 +++++++
 2 files changed, 14 insertions(+)

Comments

Mel Gorman Nov. 13, 2020, 10:25 a.m. UTC | #1
On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Background
> ==========
> 
> 1. SGX enclave pages are populated with data by copying from normal memory
>    via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in
>    this series.
> 2. It is desirable to be able to restrict those normal memory data sources.
>    For instance, to ensure that the source data is executable before
>    copying data to an executable enclave page.
> 3. Enclave page permissions are dynamic (just like normal permissions) and
>    can be adjusted at runtime with mprotect().
> 
> This creates a problem because the original data source may have long since
> vanished at the time when enclave page permissions are established (mmap()
> or mprotect()).
> 
> The solution (elsewhere in this series) is to force enclaves creators to
> declare their paging permission *intent* up front to the ioctl().  This
> intent can be immediately compared to the source data???s mapping and
> rejected if necessary.
> 
> The ???intent??? is also stashed off for later comparison with enclave
> PTEs. This ensures that any future mmap()/mprotect() operations
> performed by the enclave creator or done on behalf of the enclave
> can be compared with the earlier declared permissions.
> 
> Problem
> =======
> 
> There is an existing mmap() hook which allows SGX to perform this
> permission comparison at mmap() time.  However, there is no corresponding
> ->mprotect() hook.
> 
> Solution
> ========
> 
> Add a vm_ops->mprotect() hook so that mprotect() operations which are
> inconsistent with any page's stashed intent can be rejected by the driver.
> 
> Cc: linux-mm@kvack.org
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Jethro Beekman <jethro@fortanix.com> # v40
> Acked-by: Dave Hansen <dave.hansen@intel.com> # v40
> # Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org>
> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>

Acked-by: Mel Gorman <mgorman@techsingularity.net>
Dr. Greg Nov. 15, 2020, 5:08 p.m. UTC | #2
On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote:

Good morning, I hope the weekend is going well for everyone.

> From: Sean Christopherson <sean.j.christopherson@intel.com>

We wish Sean well in whatever new avocation he has chosen.

> Background
> ==========
> 
> 1. SGX enclave pages are populated with data by copying from normal memory
>    via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in
>    this series.
> 2. It is desirable to be able to restrict those normal memory data sources.
>    For instance, to ensure that the source data is executable before
>    copying data to an executable enclave page.
> 3. Enclave page permissions are dynamic (just like normal permissions) and
>    can be adjusted at runtime with mprotect().
>
> This creates a problem because the original data source may have long since
> vanished at the time when enclave page permissions are established (mmap()
> or mprotect()).
> 
> The solution (elsewhere in this series) is to force enclaves creators to
> declare their paging permission *intent* up front to the ioctl().  This
> intent can be immediately compared to the source data???s mapping and
> rejected if necessary.
> 
> The ???intent??? is also stashed off for later comparison with enclave
> PTEs. This ensures that any future mmap()/mprotect() operations
> performed by the enclave creator or done on behalf of the enclave
> can be compared with the earlier declared permissions.

The new mprotect hook in vm_operations_struct is indeed useful, as I
will demonstrate in a subsequent patch for consideration.  However,
the officially stated intent of this version of the driver is to
implement SGX1 semantics even on hardware (SGX2) that implements the
instructions needed for Enclave Dynamic Memory Management (EDMM).

As a result, at this stage of the driver's development, the
implementation that walks the page permission intents can be
substituted with a simple denial of mmap and mprotect on an
initialized enclave.  With this prohibition in place, the hardware
itself will enforce the page permission intents that were encoded when
the enclave was loaded, thus making the subsequent scan irrelevant.

The following patch implements this functionality.

Dr. Greg

---------------------------------------------------------------------------
Subject: [PATCH] Unconditionally block permission changes on enclave memory.

In SGX there are two levels of memory protection, the classic
page table mechanism and SGX hardware based page protections
that are codified in the Enclave Page Cache Metadata.  A
successful memory access requires that both mechanisms agree
that the access is permitted.

In the initial implementation of SGX (SGX1), the page permissions
are immutable after the enclave is initialized.  Even if classic
page protections are modified via mprotect, any attempt to access
enclave memory with alternative permissions will be blocked.

One of the architectural changes implemented in the second
generation of SGX (SGX2) is the ability for page access
permissions to be dynamically manipulated after the enclave is
initialized.  This requires coordination between trusted code
running in the enclave and untrusted code using mprotect and
special ring-0 instructions.

One of the security threats associated with SGX2 hardware is that
enclave based code can conspire with its untrusted runtime to make
executable enclave memory writable.  This provides the opportunity for
completely anonymous code execution that the operating system has no
visibility into.

All that is needed to, simply, close this vulnerability is to
eliminate the ability to call mprotect or mmap against the virtual
memory range of an enclave after it is initialized.  Any
permission changes made prior to initialization that are inconsistent
with the permissions codified in the enclave will cause initialization
or execution of the enclave to fail.

Tested-by: Dr. Greg <greg@enjellic.com>
Signed-off-by: Dr. Greg <greg@enjellic.com>
---
 arch/x86/kernel/cpu/sgx/encl.c | 50 +++++++++-------------------------
 1 file changed, 13 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 5551c7d36483..3bd770fbfc32 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -212,27 +212,25 @@ static void sgx_vma_open(struct vm_area_struct *vma)
  * @end:		upper bound of the address range, exclusive
  * @vm_flags:		VMA flags
  *
- * Iterate through the enclave pages contained within [@start, @end) to verify
- * that the permissions requested by a subset of {VM_READ, VM_WRITE, VM_EXEC}
- * does not contain any permissions that are not contained in the build time
- * permissions of any of the enclave pages within the given address range.
+ * This function provides a method for determining whether or not mmap
+ * or mprotect can be invoked called on any pages in the virtual
+ * address range of an enclave.
  *
- * An enclave creator must declare the strongest permissions that will be
- * needed for each enclave page  This ensures that mappings  have the identical
- * or weaker permissions that the earlier declared permissions.
+ * Since this driver is not designed to support Enclave Dynamic Memory
+ * Management (EDMM), any attempts to modify enclave memory map after
+ * the enclave is initialized are simply blocked.
+ *
+ * The function signature is left intact since future versions of the
+ * driver may implement verifications that the requested permission
+ * changes are consistent with the desire of the enclave author.
  *
  * Return: 0 on success, -EACCES otherwise
  */
 int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
 		     unsigned long end, unsigned long vm_flags)
 {
-	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
-	struct sgx_encl_page *page;
-	unsigned long count = 0;
 	int ret = 0;
 
-	XA_STATE(xas, &encl->page_array, PFN_DOWN(start));
-
 	/*
 	 * Disallow READ_IMPLIES_EXEC tasks as their VMA permissions might
 	 * conflict with the enclave page permissions.
@@ -240,31 +238,9 @@ int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
 	if (current->personality & READ_IMPLIES_EXEC)
 		return -EACCES;
 
-	mutex_lock(&encl->lock);
-	xas_lock(&xas);
-	xas_for_each(&xas, page, PFN_DOWN(end - 1)) {
-		if (!page)
-			break;
-
-		if (~page->vm_max_prot_bits & vm_prot_bits) {
-			ret = -EACCES;
-			break;
-		}
-
-		/* Reschedule on every XA_CHECK_SCHED iteration. */
-		if (!(++count % XA_CHECK_SCHED)) {
-			xas_pause(&xas);
-			xas_unlock(&xas);
-			mutex_unlock(&encl->lock);
-
-			cond_resched();
-
-			mutex_lock(&encl->lock);
-			xas_lock(&xas);
-		}
-	}
-	xas_unlock(&xas);
-	mutex_unlock(&encl->lock);
+	/* Disallow mapping on an initialized enclave. */
+	if (test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
+		ret = -EACCES;
 
 	return ret;
 }
Matthew Wilcox (Oracle) Nov. 15, 2020, 5:32 p.m. UTC | #3
On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote:
> +++ b/include/linux/mm.h
> @@ -559,6 +559,13 @@ struct vm_operations_struct {
>  	void (*close)(struct vm_area_struct * area);
>  	int (*split)(struct vm_area_struct * area, unsigned long addr);
>  	int (*mremap)(struct vm_area_struct * area);
> +	/*
> +	 * Called by mprotect() to make driver-specific permission
> +	 * checks before mprotect() is finalised.   The VMA must not
> +	 * be modified.  Returns 0 if eprotect() can proceed.
> +	 */

This is the wrong place for this documentation, and it's absurdly
specific to your implementation.  It should be in
Documentation/filesystems/locking.rst.

> +	int (*mprotect)(struct vm_area_struct *vma, unsigned long start,
> +			unsigned long end, unsigned long newflags);
> +
> +		if (vma->vm_ops && vma->vm_ops->mprotect)
> +			error = vma->vm_ops->mprotect(vma, nstart, tmp, newflags);
> +		if (error)
> +			goto out;
> +
>  		error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
>  		if (error)
>  			goto out;
> +
>  		nstart = tmp;

Spurious newline added.
Dave Hansen Nov. 15, 2020, 6:36 p.m. UTC | #4
On 11/15/20 9:32 AM, Matthew Wilcox wrote:
> On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote:
>> +++ b/include/linux/mm.h
>> @@ -559,6 +559,13 @@ struct vm_operations_struct {
>>  	void (*close)(struct vm_area_struct * area);
>>  	int (*split)(struct vm_area_struct * area, unsigned long addr);
>>  	int (*mremap)(struct vm_area_struct * area);
>> +	/*
>> +	 * Called by mprotect() to make driver-specific permission
>> +	 * checks before mprotect() is finalised.   The VMA must not
>> +	 * be modified.  Returns 0 if eprotect() can proceed.
>> +	 */
> 
> This is the wrong place for this documentation, and it's absurdly
> specific to your implementation.  It should be in
> Documentation/filesystems/locking.rst.

I'll let you and Mel duke that one out:

> https://lore.kernel.org/linux-mm/20201106100409.GD3371@techsingularity.net/

As for 'locking.rst', I didn't even know that was there.  It's also a
_bit_ silly that we would depend on folks finding that for code like SGX
that has nothing to do with filesystems.  If we expect folks to update
that, we need a comment spelling that out in the struct.

The comment also _looks_ reasonably generic to me.  None of the
parameters can be modified, so the impact on the core mm code must only
be from the result of the return code, even if the handler does some
other magic in addition to plain checks.

For one thing, I guess we could take the literal mention of "driver"
out.  vm_operations are certainly supplied by code that isn't a "driver"
per se.

>> +	int (*mprotect)(struct vm_area_struct *vma, unsigned long start,
>> +			unsigned long end, unsigned long newflags);
>> +
>> +		if (vma->vm_ops && vma->vm_ops->mprotect)
>> +			error = vma->vm_ops->mprotect(vma, nstart, tmp, newflags);
>> +		if (error)
>> +			goto out;
>> +
>>  		error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
>>  		if (error)
>>  			goto out;
>> +
>>  		nstart = tmp;
> 
> Spurious newline added.

I don't actually think this is spurious.  It ends up grouping the code
into the logical chunks with the ->mprotect() and goto in one chunk and
mprotect_fixup() and its goto in another.  Without this newline, the two
logical parts don't look separate.

It's kinda hard to see with just the diff context, though.
Mel Gorman Nov. 16, 2020, 10:09 a.m. UTC | #5
On Sun, Nov 15, 2020 at 10:36:51AM -0800, Dave Hansen wrote:
> On 11/15/20 9:32 AM, Matthew Wilcox wrote:
> > On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote:
> >> +++ b/include/linux/mm.h
> >> @@ -559,6 +559,13 @@ struct vm_operations_struct {
> >>  	void (*close)(struct vm_area_struct * area);
> >>  	int (*split)(struct vm_area_struct * area, unsigned long addr);
> >>  	int (*mremap)(struct vm_area_struct * area);
> >> +	/*
> >> +	 * Called by mprotect() to make driver-specific permission
> >> +	 * checks before mprotect() is finalised.   The VMA must not
> >> +	 * be modified.  Returns 0 if eprotect() can proceed.
> >> +	 */
> > 
> > This is the wrong place for this documentation, and it's absurdly
> > specific to your implementation.  It should be in
> > Documentation/filesystems/locking.rst.
> 
> I'll let you and Mel duke that one out:
> 

I suggested placing the comment there to make it clear what the expected
semantics of the hook was to reduce the chances of abuse or surprises. The
hook does not affect locking so Documentation/filesystems/locking.rst
didn't appear appropriate other than maybe adding a note there
that it doesn't affect locks. The hook also is not expecting any
filesystems-specific action that I aware of but a note could be added to
the effect that filesystems should not need to take special action for it.
Protections on the filesystem level are for the inode, I can't imagine what
a filesystem would do with a protection change on the page table level
but maybe I'm not particularly imaginative today.
Jarkko Sakkinen Nov. 17, 2020, 6:16 p.m. UTC | #6
On Fri, Nov 13, 2020 at 10:25:43AM +0000, Mel Gorman wrote:
> On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > Background
> > ==========
> > 
> > 1. SGX enclave pages are populated with data by copying from normal memory
> >    via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in
> >    this series.
> > 2. It is desirable to be able to restrict those normal memory data sources.
> >    For instance, to ensure that the source data is executable before
> >    copying data to an executable enclave page.
> > 3. Enclave page permissions are dynamic (just like normal permissions) and
> >    can be adjusted at runtime with mprotect().
> > 
> > This creates a problem because the original data source may have long since
> > vanished at the time when enclave page permissions are established (mmap()
> > or mprotect()).
> > 
> > The solution (elsewhere in this series) is to force enclaves creators to
> > declare their paging permission *intent* up front to the ioctl().  This
> > intent can be immediately compared to the source data???s mapping and
> > rejected if necessary.
> > 
> > The ???intent??? is also stashed off for later comparison with enclave
> > PTEs. This ensures that any future mmap()/mprotect() operations
> > performed by the enclave creator or done on behalf of the enclave
> > can be compared with the earlier declared permissions.
> > 
> > Problem
> > =======
> > 
> > There is an existing mmap() hook which allows SGX to perform this
> > permission comparison at mmap() time.  However, there is no corresponding
> > ->mprotect() hook.
> > 
> > Solution
> > ========
> > 
> > Add a vm_ops->mprotect() hook so that mprotect() operations which are
> > inconsistent with any page's stashed intent can be rejected by the driver.
> > 
> > Cc: linux-mm@kvack.org
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Jethro Beekman <jethro@fortanix.com> # v40
> > Acked-by: Dave Hansen <dave.hansen@intel.com> # v40
> > # Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org>
> > Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
> 
> Acked-by: Mel Gorman <mgorman@techsingularity.net>

Thank you.

/Jarkko
Jarkko Sakkinen Nov. 17, 2020, 7:15 p.m. UTC | #7
On Mon, Nov 16, 2020 at 10:09:57AM +0000, Mel Gorman wrote:
> On Sun, Nov 15, 2020 at 10:36:51AM -0800, Dave Hansen wrote:
> > On 11/15/20 9:32 AM, Matthew Wilcox wrote:
> > > On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote:
> > >> +++ b/include/linux/mm.h
> > >> @@ -559,6 +559,13 @@ struct vm_operations_struct {
> > >>  	void (*close)(struct vm_area_struct * area);
> > >>  	int (*split)(struct vm_area_struct * area, unsigned long addr);
> > >>  	int (*mremap)(struct vm_area_struct * area);
> > >> +	/*
> > >> +	 * Called by mprotect() to make driver-specific permission
> > >> +	 * checks before mprotect() is finalised.   The VMA must not
> > >> +	 * be modified.  Returns 0 if eprotect() can proceed.
> > >> +	 */

Wonder if this should also document the negative case for the return
value, i.e. -EACCES is returned otherwise.

> > > 
> > > This is the wrong place for this documentation, and it's absurdly
> > > specific to your implementation.  It should be in
> > > Documentation/filesystems/locking.rst.
> > 
> > I'll let you and Mel duke that one out:
> > 
> 
> I suggested placing the comment there to make it clear what the expected
> semantics of the hook was to reduce the chances of abuse or surprises. The
> hook does not affect locking so Documentation/filesystems/locking.rst
> didn't appear appropriate other than maybe adding a note there
> that it doesn't affect locks. The hook also is not expecting any
> filesystems-specific action that I aware of but a note could be added to
> the effect that filesystems should not need to take special action for it.
> Protections on the filesystem level are for the inode, I can't imagine what
> a filesystem would do with a protection change on the page table level
> but maybe I'm not particularly imaginative today.

I try to decipher this in generic context.

In a permission check of a filesystem, truncated pages should be
encapsulated in to the permission decision. It's a just a query.

So maybe I'll add something like:

"This callback does only a permission query, and thus does never
return locked pages."

> -- 
> Mel Gorman
> SUSE Labs

/Jarkko
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..1813fa86b981 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -559,6 +559,13 @@  struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	int (*split)(struct vm_area_struct * area, unsigned long addr);
 	int (*mremap)(struct vm_area_struct * area);
+	/*
+	 * Called by mprotect() to make driver-specific permission
+	 * checks before mprotect() is finalised.   The VMA must not
+	 * be modified.  Returns 0 if eprotect() can proceed.
+	 */
+	int (*mprotect)(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, unsigned long newflags);
 	vm_fault_t (*fault)(struct vm_fault *vmf);
 	vm_fault_t (*huge_fault)(struct vm_fault *vmf,
 			enum page_entry_size pe_size);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 56c02beb6041..ab709023e9aa 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -616,9 +616,16 @@  static int do_mprotect_pkey(unsigned long start, size_t len,
 		tmp = vma->vm_end;
 		if (tmp > end)
 			tmp = end;
+
+		if (vma->vm_ops && vma->vm_ops->mprotect)
+			error = vma->vm_ops->mprotect(vma, nstart, tmp, newflags);
+		if (error)
+			goto out;
+
 		error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
 		if (error)
 			goto out;
+
 		nstart = tmp;
 
 		if (nstart < prev->vm_end)