diff mbox series

[v3,2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race

Message ID 20240517110631.3441817-3-dmitrii.kuvaiskii@intel.com (mailing list archive)
State New, archived
Headers show
Series x86/sgx: Fix two data races in EAUG/EREMOVE flows | expand

Commit Message

Dmitrii Kuvaiskii May 17, 2024, 11:06 a.m. UTC
Two enclave threads may try to add and remove the same enclave page
simultaneously (e.g., if the SGX runtime supports both lazy allocation
and MADV_DONTNEED semantics). Consider some enclave page added to the
enclave. User space decides to temporarily remove this page (e.g.,
emulating the MADV_DONTNEED semantics) on CPU1. At the same time, user
space performs a memory access on the same page on CPU2, which results
in a #PF and ultimately in sgx_vma_fault(). Scenario proceeds as
follows:

/*
 * CPU1: User space performs
 * ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES)
 * on enclave page X
 */
sgx_encl_remove_pages() {

  mutex_lock(&encl->lock);

  entry = sgx_encl_load_page(encl);
  /*
   * verify that page is
   * trimmed and accepted
   */

  mutex_unlock(&encl->lock);

  /*
   * remove PTE entry; cannot
   * be performed under lock
   */
  sgx_zap_enclave_ptes(encl);
                                 /*
                                  * Fault on CPU2 on same page X
                                  */
                                 sgx_vma_fault() {
                                   /*
                                    * PTE entry was removed, but the
                                    * page is still in enclave's xarray
                                    */
                                   xa_load(&encl->page_array) != NULL ->
                                   /*
                                    * SGX driver thinks that this page
                                    * was swapped out and loads it
                                    */
                                   mutex_lock(&encl->lock);
                                   /*
                                    * this is effectively a no-op
                                    */
                                   entry = sgx_encl_load_page_in_vma();
                                   /*
                                    * add PTE entry
                                    *
                                    * *BUG*: a PTE is installed for a
                                    * page in process of being removed
                                    */
                                   vmf_insert_pfn(...);

                                   mutex_unlock(&encl->lock);
                                   return VM_FAULT_NOPAGE;
                                 }
  /*
   * continue with page removal
   */
  mutex_lock(&encl->lock);

  sgx_encl_free_epc_page(epc_page) {
    /*
     * remove page via EREMOVE
     */
    /*
     * free EPC page
     */
    sgx_free_epc_page(epc_page);
  }

  xa_erase(&encl->page_array);

  mutex_unlock(&encl->lock);
}

Here, CPU1 removed the page. However CPU2 installed the PTE entry on the
same page. This enclave page becomes perpetually inaccessible (until
another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
marked accessible in the PTE entry but is not EAUGed, and any subsequent
access to this page raises a fault: with the kernel believing there to
be a valid VMA, the unlikely error code X86_PF_SGX encountered by code
path do_user_addr_fault() -> access_error() causes the SGX driver's
sgx_vma_fault() to be skipped and user space receives a SIGSEGV instead.
The userspace SIGSEGV handler cannot perform EACCEPT because the page
was not EAUGed. Thus, the user space is stuck with the inaccessible
page.

Fix this race by forcing the fault handler on CPU2 to back off if the
page is currently being removed (on CPU1). This is achieved by
introducing a new flag SGX_ENCL_PAGE_BEING_REMOVED, which is unset by
default and set only right-before the first mutex_unlock() in
sgx_encl_remove_pages(). Upon loading the page, CPU2 checks whether this
page is being removed, and if yes then CPU2 backs off and waits until
the page is completely removed. After that, any memory access to this
page results in a normal "allocate and EAUG a page on #PF" flow.

Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Reviewed-by: Haitao Huang <haitao.huang@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/sgx/encl.c  | 3 ++-
 arch/x86/kernel/cpu/sgx/encl.h  | 3 +++
 arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
 3 files changed, 6 insertions(+), 1 deletion(-)

Comments

Dave Hansen May 28, 2024, 4:23 p.m. UTC | #1
On 5/17/24 04:06, Dmitrii Kuvaiskii wrote:
...

First, why is SGX so special here?  How is the SGX problem different
than what the core mm code does?

> --- a/arch/x86/kernel/cpu/sgx/encl.h
> +++ b/arch/x86/kernel/cpu/sgx/encl.h
> @@ -25,6 +25,9 @@
>  /* 'desc' bit marking that the page is being reclaimed. */
>  #define SGX_ENCL_PAGE_BEING_RECLAIMED	BIT(3)
>  
> +/* 'desc' bit marking that the page is being removed. */
> +#define SGX_ENCL_PAGE_BEING_REMOVED	BIT(2)

Second, convince me that this _needs_ a new bit.  Why can't we just have
a bit that effectively means "return EBUSY if you see this bit when
handling a fault".

>  struct sgx_encl_page {
>  	unsigned long desc;
>  	unsigned long vm_max_prot_bits:8;
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index 5d390df21440..de59219ae794 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
>  		 * Do not keep encl->lock because of dependency on
>  		 * mmap_lock acquired in sgx_zap_enclave_ptes().
>  		 */
> +		entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;

This also needs a comment, no matter what.
Haitao Huang June 3, 2024, 6:42 p.m. UTC | #2
On Tue, 28 May 2024 11:23:13 -0500, Dave Hansen <dave.hansen@intel.com>  
wrote:

> On 5/17/24 04:06, Dmitrii Kuvaiskii wrote:
> ...
>
> First, why is SGX so special here?  How is the SGX problem different
> than what the core mm code does?
>
>> --- a/arch/x86/kernel/cpu/sgx/encl.h
>> +++ b/arch/x86/kernel/cpu/sgx/encl.h
>> @@ -25,6 +25,9 @@
>>  /* 'desc' bit marking that the page is being reclaimed. */
>>  #define SGX_ENCL_PAGE_BEING_RECLAIMED	BIT(3)
>>
>> +/* 'desc' bit marking that the page is being removed. */
>> +#define SGX_ENCL_PAGE_BEING_REMOVED	BIT(2)
>
> Second, convince me that this _needs_ a new bit.  Why can't we just have
> a bit that effectively means "return EBUSY if you see this bit when
> handling a fault".
>

IIUC, reclaimer_writing_to_pcmd() also uses SGX_ENCL_PAGE_BEING_RECLAIMED  
to check if a page is about being reclaimed in order to prevent its VA  
slot fro being freed. So I think we do need separate bit for EREMOVE which  
does not write to VA slot?

BR
Haitao
Dmitrii Kuvaiskii June 7, 2024, 5:21 p.m. UTC | #3
On Tue, May 28, 2024 at 09:23:13AM -0700, Dave Hansen wrote:
> On 5/17/24 04:06, Dmitrii Kuvaiskii wrote:
> ...
>
> First, why is SGX so special here?  How is the SGX problem different
> than what the core mm code does?

Here is my understanding why SGX is so special and why I have to introduce
a new bit SGX_ENCL_PAGE_BEING_REMOVED.

In SGX's removal of the enclave page, two operations must happen
atomically: the PTE entry must be removed and the page must be EREMOVE'd.

Generally, to guarantee atomicity, encl->lock is acquired. Ideally, if
this encl->lock could be acquired at the beginning of
sgx_encl_remove_pages() and be released at the very end of this function,
there would be no EREMOVE page vs EAUG page data race, and my bug fix
(with SGX_ENCL_PAGE_BEING_REMOVED bit) wouldn't be needed.

However, the current implementation of sgx_encl_remove_pages() has to
release encl->lock before removing the PTE entry. Releasing the lock is
required because the function that removes the PTE entry --
sgx_zap_enclave_ptes() -- acquires another, enclave-MM lock:
mmap_read_lock(encl_mm->mm).

The two locks must be taken in this order:
1. mmap_read_lock(encl_mm->mm)
2. mutex_lock(&encl->lock)

This lock order is apparent from e.g. sgx_encl_add_page(). This order also
seems to make intuitive sense: VMA callbacks are called with the MM lock
being held, so the MM lock should be the first in lock order.

So, if sgx_encl_remove_pages() would _not_ release encl->lock before
calling sgx_zap_enclave_ptes(), this would violate the lock order and
might lead to deadlocks. At the same time, releasing encl->lock in the
middle of the two-operations flow leads to a data race that I found in
this patch series.

Quick summary:
- Removing the enclave page requires two operations: removing the PTE and
  performing EREMOVE.
- The complete flow of removing the enclave page cannot be protected by a
  single encl->lock, because it would violate the lock order and would
  lead to deadlocks.
- The current upstream implementation thus breaks the flow into two
  critical sections, releasing encl->lock before sgx_zap_enclave_ptes()
  and re-acquiring this lock afterwards. This leads to a data race.
- My patch restores "atomicity" of the flow by introducing a new flag
  SGX_ENCL_PAGE_BEING_REMOVED.

> > --- a/arch/x86/kernel/cpu/sgx/encl.h
> > +++ b/arch/x86/kernel/cpu/sgx/encl.h
> > @@ -25,6 +25,9 @@
> >  /* 'desc' bit marking that the page is being reclaimed. */
> >  #define SGX_ENCL_PAGE_BEING_RECLAIMED  BIT(3)
> >
> > +/* 'desc' bit marking that the page is being removed. */
> > +#define SGX_ENCL_PAGE_BEING_REMOVED    BIT(2)
>
> Second, convince me that this _needs_ a new bit.  Why can't we just have
> a bit that effectively means "return EBUSY if you see this bit when
> handling a fault".

As Haitao mentioned in his reply, the bit SGX_ENCL_PAGE_BEING_RECLAIMED is
also used in reclaimer_writing_to_pcmd(). If we would re-use this bit to
mark a page being removed, reclaimer_writing_to_pcmd() would incorrectly
return 1, meaning that the reclaimer is about to write to the PCMD page,
which is not true.

> >  struct sgx_encl_page {
> >     unsigned long desc;
> >     unsigned long vm_max_prot_bits:8;
> > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> > index 5d390df21440..de59219ae794 100644
> > --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> > @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
> >          * Do not keep encl->lock because of dependency on
> >          * mmap_lock acquired in sgx_zap_enclave_ptes().
> >          */
> > +       entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
>
> This also needs a comment, no matter what.

Ok, I will write something along the lines that we want to prevent a data
race with an EAUG flow, and since we have to release encl->lock (which
would otherwise prevent the data race) we instead set a bit to mark this
enclave page as being in the process of removal, so that the EAUG flow
backs off and retries later.

--
Dmitrii Kuvaiskii
Dave Hansen June 7, 2024, 5:43 p.m. UTC | #4
On 6/3/24 11:42, Haitao Huang wrote:
>> Second, convince me that this _needs_ a new bit.  Why can't we just have
>> a bit that effectively means "return EBUSY if you see this bit when
>> handling a fault".
> 
> IIUC, reclaimer_writing_to_pcmd() also uses
> SGX_ENCL_PAGE_BEING_RECLAIMED to check if a page is about being
> reclaimed in order to prevent its VA slot fro being freed. So I think we
> do need separate bit for EREMOVE which does not write to VA slot?

I think the bits should be centered around what action the code needs to
take and not what is being done to the page.

Right now, SGX_ENCL_PAGE_BEING_RECLAIMED has two logical meanings:

 1. Don't load the page
 2. The page is in the backing store

But now folks are suggesting that a new bit is added which means "do #1,
but not #2".

Let's take a step back and look at what logical outcomes we want in the
code and then create the bits based on _that_.
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 41f14b1a3025..7ccd8b2fce5f 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -257,7 +257,8 @@  static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 
 	/* Entry successfully located. */
 	if (entry->epc_page) {
-		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
+		if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
+				   SGX_ENCL_PAGE_BEING_REMOVED))
 			return ERR_PTR(-EBUSY);
 
 		return entry;
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..fff5f2293ae7 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -25,6 +25,9 @@ 
 /* 'desc' bit marking that the page is being reclaimed. */
 #define SGX_ENCL_PAGE_BEING_RECLAIMED	BIT(3)
 
+/* 'desc' bit marking that the page is being removed. */
+#define SGX_ENCL_PAGE_BEING_REMOVED	BIT(2)
+
 struct sgx_encl_page {
 	unsigned long desc;
 	unsigned long vm_max_prot_bits:8;
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 5d390df21440..de59219ae794 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -1142,6 +1142,7 @@  static long sgx_encl_remove_pages(struct sgx_encl *encl,
 		 * Do not keep encl->lock because of dependency on
 		 * mmap_lock acquired in sgx_zap_enclave_ptes().
 		 */
+		entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
 		mutex_unlock(&encl->lock);
 
 		sgx_zap_enclave_ptes(encl, addr);