diff mbox series

[V2] x86/sgx: Fix free page accounting

Message ID b2e69e9febcae5d98d331de094d9cc7ce3217e66.1636487172.git.reinette.chatre@intel.com (mailing list archive)
State New, archived
Headers show
Series [V2] x86/sgx: Fix free page accounting | expand

Commit Message

Reinette Chatre Nov. 9, 2021, 8 p.m. UTC
The SGX driver maintains a single global free page counter,
sgx_nr_free_pages, that reflects the number of free pages available
across all NUMA nodes. Correspondingly, a list of free pages is
associated with each NUMA node and sgx_nr_free_pages is updated
every time a page is added or removed from any of the free page
lists. The main usage of sgx_nr_free_pages is by the reclaimer
that will run when it (sgx_nr_free_pages) goes below a watermark
to ensure that there are always some free pages available to, for
example, support efficient page faults.

With sgx_nr_free_pages accessed and modified from a few places
it is essential to ensure that these accesses are done safely but
this is not the case. sgx_nr_free_pages is read without any
protection and updated with inconsistent protection by any one
of the spin locks associated with the individual NUMA nodes.
For example:

      CPU_A                                 CPU_B
      -----                                 -----
 spin_lock(&nodeA->lock);              spin_lock(&nodeB->lock);
 ...                                   ...
 sgx_nr_free_pages--;  /* NOT SAFE */  sgx_nr_free_pages--;

 spin_unlock(&nodeA->lock);            spin_unlock(&nodeB->lock);

The consequence of sgx_nr_free_pages not being protected is that
its value may not accurately reflect the actual number of free
pages on the system, impacting the availability of free pages in
support of many flows. The problematic scenario is when the
reclaimer does not run because it believes there to be sufficient
free pages while any attempt to allocate a page fails because there
are no free pages available.

The worst scenario observed was a user space hang because of
repeated page faults caused by no free pages made available.

The following flow was encountered:
asm_exc_page_fault
 ...
   sgx_vma_fault()
     sgx_encl_load_page()
       sgx_encl_eldu() // Encrypted page needs to be loaded from backing
                       // storage into newly allocated SGX memory page
         sgx_alloc_epc_page() // Allocate a page of SGX memory
           __sgx_alloc_epc_page() // Fails, no free SGX memory
           ...
           if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer
             wake_up(&ksgxd_waitq);
           return -EBUSY; // Return -EBUSY giving reclaimer time to run
       return -EBUSY;
     return -EBUSY;
   return VM_FAULT_NOPAGE;

The reclaimer is triggered in above flow with the following code:

static bool sgx_should_reclaim(unsigned long watermark)
{
        return sgx_nr_free_pages < watermark &&
               !list_empty(&sgx_active_page_list);
}

In the problematic scenario there were no free pages available yet the
value of sgx_nr_free_pages was above the watermark. The allocation of
SGX memory thus always failed because of a lack of free pages while no
free pages were made available because the reclaimer is never started
because of sgx_nr_free_pages' incorrect value. The consequence was that
user space kept encountering VM_FAULT_NOPAGE that caused the same
address to be accessed repeatedly with the same result.

Change the global free page counter to an atomic type that
ensures simultaneous updates are done safely. While doing so, move
the updating of the variable outside of the spin lock critical
section to which it does not belong.

Cc: stable@vger.kernel.org
Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---

Changes since V1:
- V1:
  https://lore.kernel.org/lkml/373992d869cd356ce9e9afe43ef4934b70d604fd.1636049678.git.reinette.chatre@intel.com/
- Add static to definition of sgx_nr_free_pages (Tony).
- Add Tony's signature.
- Provide detail about error scenario in changelog (Jarkko).

 arch/x86/kernel/cpu/sgx/main.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Comments

Jarkko Sakkinen Nov. 10, 2021, 3:11 p.m. UTC | #1
On Tue, 2021-11-09 at 12:00 -0800, Reinette Chatre wrote:
> The SGX driver maintains a single global free page counter,
> sgx_nr_free_pages, that reflects the number of free pages available
> across all NUMA nodes. Correspondingly, a list of free pages is
> associated with each NUMA node and sgx_nr_free_pages is updated
> every time a page is added or removed from any of the free page
> lists. The main usage of sgx_nr_free_pages is by the reclaimer
> that will run when it (sgx_nr_free_pages) goes below a watermark
> to ensure that there are always some free pages available to, for
> example, support efficient page faults.
> 
> With sgx_nr_free_pages accessed and modified from a few places
> it is essential to ensure that these accesses are done safely but
> this is not the case. sgx_nr_free_pages is read without any
> protection and updated with inconsistent protection by any one
> of the spin locks associated with the individual NUMA nodes.
> For example:
> 
>       CPU_A                                 CPU_B
>       -----                                 -----
>  spin_lock(&nodeA->lock);              spin_lock(&nodeB->lock);
>  ...                                   ...
>  sgx_nr_free_pages--;  /* NOT SAFE */  sgx_nr_free_pages--;
> 
>  spin_unlock(&nodeA->lock);            spin_unlock(&nodeB->lock);

I don't understand the "NOT SAFE" part here. It is safe to access
the variable, even when it is not atomic...

I don't understand what the sequence above should tell me.

> The consequence of sgx_nr_free_pages not being protected is that
> its value may not accurately reflect the actual number of free
> pages on the system, impacting the availability of free pages in
> support of many flows. The problematic scenario isu when the

In non-atomicity is not a problem, when it is not a problem :-)

> reclaimer does not run because it believes there to be sufficient
> free pages while any attempt to allocate a page fails because there
> are no free pages available.
> 
> The worst scenario observed was a user space hang because of
> repeated page faults caused by no free pages made available.
> 
> The following flow was encountered:
> asm_exc_page_fault
>  ...
>    sgx_vma_fault()
>      sgx_encl_load_page()
>        sgx_encl_eldu() // Encrypted page needs to be loaded from backing
>                        // storage into newly allocated SGX memory page
>          sgx_alloc_epc_page() // Allocate a page of SGX memory
>            __sgx_alloc_epc_page() // Fails, no free SGX memory
>            ...
>            if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer
>              wake_up(&ksgxd_waitq);
>            return -EBUSY; // Return -EBUSY giving reclaimer time to run
>        return -EBUSY;
>      return -EBUSY;
>    return VM_FAULT_NOPAGE;
> 
> The reclaimer is triggered in above flow with the following code:
> 
> static bool sgx_should_reclaim(unsigned long watermark)
> {
>         return sgx_nr_free_pages < watermark &&
>                !list_empty(&sgx_active_page_list);
> }
> 
> In the problematic scenario there were no free pages available yet the
> value of sgx_nr_free_pages was above the watermark. The allocation of
> SGX memory thus always failed because of a lack of free pages while no
> free pages were made available because the reclaimer is never started
> because of sgx_nr_free_pages' incorrect value. The consequence was that
> user space kept encountering VM_FAULT_NOPAGE that caused the same
> address to be accessed repeatedly with the same result.

That causes sgx_should_reclaim() executed to be multiple times as the
fault is retried. Eventually it should be successful.

> Change the global free page counter to an atomic type that
> ensures simultaneous updates are done safely. While doing so, move
> the updating of the variable outside of the spin lock critical
> section to which it does not belong.
> 
> Cc: stable@vger.kernel.org
> Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>

I'm not yet sure if this a bug, even if it'd be a improvement
to the performance.

/Jarkko
Reinette Chatre Nov. 10, 2021, 6:51 p.m. UTC | #2
Hi Jarkko,

On 11/10/2021 7:11 AM, Jarkko Sakkinen wrote:
> On Tue, 2021-11-09 at 12:00 -0800, Reinette Chatre wrote:
>> The SGX driver maintains a single global free page counter,
>> sgx_nr_free_pages, that reflects the number of free pages available
>> across all NUMA nodes. Correspondingly, a list of free pages is
>> associated with each NUMA node and sgx_nr_free_pages is updated
>> every time a page is added or removed from any of the free page
>> lists. The main usage of sgx_nr_free_pages is by the reclaimer
>> that will run when it (sgx_nr_free_pages) goes below a watermark
>> to ensure that there are always some free pages available to, for
>> example, support efficient page faults.
>>
>> With sgx_nr_free_pages accessed and modified from a few places
>> it is essential to ensure that these accesses are done safely but
>> this is not the case. sgx_nr_free_pages is read without any
>> protection and updated with inconsistent protection by any one
>> of the spin locks associated with the individual NUMA nodes.
>> For example:
>>
>>        CPU_A                                 CPU_B
>>        -----                                 -----
>>   spin_lock(&nodeA->lock);              spin_lock(&nodeB->lock);
>>   ...                                   ...
>>   sgx_nr_free_pages--;  /* NOT SAFE */  sgx_nr_free_pages--;
>>
>>   spin_unlock(&nodeA->lock);            spin_unlock(&nodeB->lock);
> 
> I don't understand the "NOT SAFE" part here. It is safe to access
> the variable, even when it is not atomic...

The "NOT SAFE" part is because in the above example (that reflects the 
current code behavior) the updates to sgx_nr_free_pages is "protected" 
by two _different_ spin locks and thus not actually protected.

> I don't understand what the sequence above should tell me.
> 
>> The consequence of sgx_nr_free_pages not being protected is that
>> its value may not accurately reflect the actual number of free
>> pages on the system, impacting the availability of free pages in
>> support of many flows. The problematic scenario isu when the
> 
> In non-atomicity is not a problem, when it is not a problem :-)
> 
>> reclaimer does not run because it believes there to be sufficient
>> free pages while any attempt to allocate a page fails because there
>> are no free pages available.
>>
>> The worst scenario observed was a user space hang because of
>> repeated page faults caused by no free pages made available.
>>
>> The following flow was encountered:
>> asm_exc_page_fault
>>   ...
>>     sgx_vma_fault()
>>       sgx_encl_load_page()
>>         sgx_encl_eldu() // Encrypted page needs to be loaded from backing
>>                         // storage into newly allocated SGX memory page
>>           sgx_alloc_epc_page() // Allocate a page of SGX memory
>>             __sgx_alloc_epc_page() // Fails, no free SGX memory
>>             ...
>>             if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer
>>               wake_up(&ksgxd_waitq);
>>             return -EBUSY; // Return -EBUSY giving reclaimer time to run
>>         return -EBUSY;
>>       return -EBUSY;
>>     return VM_FAULT_NOPAGE;
>>
>> The reclaimer is triggered in above flow with the following code:
>>
>> static bool sgx_should_reclaim(unsigned long watermark)
>> {
>>          return sgx_nr_free_pages < watermark &&
>>                 !list_empty(&sgx_active_page_list);
>> }
>>
>> In the problematic scenario there were no free pages available yet the
>> value of sgx_nr_free_pages was above the watermark. The allocation of
>> SGX memory thus always failed because of a lack of free pages while no
>> free pages were made available because the reclaimer is never started
>> because of sgx_nr_free_pages' incorrect value. The consequence was that
>> user space kept encountering VM_FAULT_NOPAGE that caused the same
>> address to be accessed repeatedly with the same result.
> 
> That causes sgx_should_reclaim() executed to be multiple times as the
> fault is retried. Eventually it should be successful.


sgx_should_reclaim() would only succeed when sgx_nr_free_pages goes 
below the watermark. Once sgx_nr_free_pages becomes corrupted there is 
no clear way in which it can correct itself since it is only ever 
incremented or decremented.

It may indeed be possible for the reclaimer to eventually get a chance 
to run with a corrupted sgx_nr_free_pages if it is not wrong by more 
than SGX_NR_LOW_PAGES when there are enough free pages available to have 
the reclaimer run when it is almost depleted. Unfortunately, as in the 
scenario I encountered, it is also possible for the free pages to be 
depleted while sgx_nr_free_pages is above the watermark and in this case 
there is not a way for the reclaimer to ever run.

On the system I tested with there was two nodes with about 64GB of SGX 
memory per node and the test created an enclave that consumed all memory 
across both nodes. The test then accessed all this memory three times, 
once to change the type of each page, once for a read access to each 
page from within the enclave, once to remove the page. With these many 
accesses and unsafe updating of sgx_nr_free_pages it seems to be enough 
to trigger the scenario where sgx_nr_free_pages has a value that is off 
by more than SGX_NR_LOW_PAGES (which is just 32).

>> Change the global free page counter to an atomic type that
>> ensures simultaneous updates are done safely. While doing so, move
>> the updating of the variable outside of the spin lock critical
>> section to which it does not belong.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
>> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Reviewed-by: Tony Luck <tony.luck@intel.com>
>> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> 
> I'm not yet sure if this a bug, even if it'd be a improvement
> to the performance.

Please let me know what additional data would convince you. The traces I 
provided earlier show that without this patch the system spends almost 
100% of time in the page fault handler while the user space application 
hangs, with this patch the traces show that the system divides its time 
between the page fault handler and the reclaimer and the user space 
application is able to complete.

Reinette
Luck, Tony Nov. 10, 2021, 7:16 p.m. UTC | #3
>>> The consequence of sgx_nr_free_pages not being protected is that
>>> its value may not accurately reflect the actual number of free
>>> pages on the system, impacting the availability of free pages in
>>> support of many flows. The problematic scenario isu when the
>> 
> > In non-atomicity is not a problem, when it is not a problem :-)

This is most definitely a problem.

start with sgx_nr_free_pages == 100

Now have a cpu on node0 allocate a page at the same time as another
cpu on node1 allcoates a page. Each holds the relevent per-node lock,
but that doesn't stop both CPUs from accessing the global together:

	CPU on node0		CPU on node1
	sgx_nr_free_pages--;		sgx_nr_free_pages--;

What is the value of sgx_nr_free_pages now? We want it to be 98,
but it could be 99.

Rinse, repeat thousands of times. Eventually the value of sgx_nr_free_pages
has not even a close connection to the number of free pages.

-Tony
Jarkko Sakkinen Nov. 11, 2021, 2:55 a.m. UTC | #4
On Wed, 2021-11-10 at 10:51 -0800, Reinette Chatre wrote:
> sgx_should_reclaim() would only succeed when sgx_nr_free_pages goes 
> below the watermark. Once sgx_nr_free_pages becomes corrupted there is 
> no clear way in which it can correct itself since it is only ever 
> incremented or decremented.

So one scenario would be:

1. CPU A does a READ of sgx_nr_free_pages.
2. CPU B does a READ of sgx_nr_free_pages.
3. CPU A does a STORE of sgx_nr_free_pages.
4. CPU B does a STORE of sgx_nr_free_pages.

?

That does corrupt the value, yes, but I don't see anything like this
in the commit message, so I'll have to check.

I think the commit message is lacking a concurrency scenario, and the
current transcripts are a bit useless.

/Jarkko
Jarkko Sakkinen Nov. 11, 2021, 2:56 a.m. UTC | #5
On Wed, 2021-11-10 at 19:16 +0000, Luck, Tony wrote:
> > > > The consequence of sgx_nr_free_pages not being protected is that
> > > > its value may not accurately reflect the actual number of free
> > > > pages on the system, impacting the availability of free pages in
> > > > support of many flows. The problematic scenario isu when the
> > > 
> > > In non-atomicity is not a problem, when it is not a problem :-)
> 
> This is most definitely a problem.
> 
> start with sgx_nr_free_pages == 100
> 
> Now have a cpu on node0 allocate a page at the same time as another
> cpu on node1 allcoates a page. Each holds the relevent per-node lock,
> but that doesn't stop both CPUs from accessing the global together:
> 
>         CPU on node0            CPU on node1
>         sgx_nr_free_pages--;            sgx_nr_free_pages--;
> 
> What is the value of sgx_nr_free_pages now? We want it to be 98,
> but it could be 99.
> 
> Rinse, repeat thousands of times. Eventually the value of sgx_nr_free_pages
> has not even a close connection to the number of free pages.
> 
> -Tony

Yeah, so I figured this (see my follow-up response to Reinette) but
such description is lacking from the commit message.

/Jarkko
Luck, Tony Nov. 11, 2021, 3:26 a.m. UTC | #6
On Thu, Nov 11, 2021 at 04:55:14AM +0200, Jarkko Sakkinen wrote:
> On Wed, 2021-11-10 at 10:51 -0800, Reinette Chatre wrote:
> > sgx_should_reclaim() would only succeed when sgx_nr_free_pages goes 
> > below the watermark. Once sgx_nr_free_pages becomes corrupted there is 
> > no clear way in which it can correct itself since it is only ever 
> > incremented or decremented.
> 
> So one scenario would be:
> 
> 1. CPU A does a READ of sgx_nr_free_pages.
> 2. CPU B does a READ of sgx_nr_free_pages.
> 3. CPU A does a STORE of sgx_nr_free_pages.
> 4. CPU B does a STORE of sgx_nr_free_pages.
> 
> ?
> 
> That does corrupt the value, yes, but I don't see anything like this
> in the commit message, so I'll have to check.
> 
> I think the commit message is lacking a concurrency scenario, and the
> current transcripts are a bit useless.

What about this part:

	With sgx_nr_free_pages accessed and modified from a few places
	it is essential to ensure that these accesses are done safely but
	this is not the case. sgx_nr_free_pages is read without any
	protection and updated with inconsistent protection by any one
	of the spin locks associated with the individual NUMA nodes.
	For example:

	      CPU_A                                 CPU_B
	      -----                                 -----
	 spin_lock(&nodeA->lock);              spin_lock(&nodeB->lock);
	 ...                                   ...
	 sgx_nr_free_pages--;  /* NOT SAFE */  sgx_nr_free_pages--;

	 spin_unlock(&nodeA->lock);            spin_unlock(&nodeB->lock);

Maybe you missed the "NOT SAFE" hidden in the middle of
the picture?

-Tony
Jarkko Sakkinen Nov. 11, 2021, 3:50 a.m. UTC | #7
On Wed, 2021-11-10 at 19:26 -0800, Luck, Tony wrote:
> On Thu, Nov 11, 2021 at 04:55:14AM +0200, Jarkko Sakkinen wrote:
> > On Wed, 2021-11-10 at 10:51 -0800, Reinette Chatre wrote:
> > > sgx_should_reclaim() would only succeed when sgx_nr_free_pages goes 
> > > below the watermark. Once sgx_nr_free_pages becomes corrupted there is 
> > > no clear way in which it can correct itself since it is only ever 
> > > incremented or decremented.
> > 
> > So one scenario would be:
> > 
> > 1. CPU A does a READ of sgx_nr_free_pages.
> > 2. CPU B does a READ of sgx_nr_free_pages.
> > 3. CPU A does a STORE of sgx_nr_free_pages.
> > 4. CPU B does a STORE of sgx_nr_free_pages.
> > 
> > ?
> > 
> > That does corrupt the value, yes, but I don't see anything like this
> > in the commit message, so I'll have to check.
> > 
> > I think the commit message is lacking a concurrency scenario, and the
> > current transcripts are a bit useless.
> 
> What about this part:
> 
>         With sgx_nr_free_pages accessed and modified from a few places
>         it is essential to ensure that these accesses are done safely but
>         this is not the case. sgx_nr_free_pages is read without any
>         protection and updated with inconsistent protection by any one
>         of the spin locks associated with the individual NUMA nodes.
>         For example:
> 
>               CPU_A                                 CPU_B
>               -----                                 -----
>          spin_lock(&nodeA->lock);              spin_lock(&nodeB->lock);
>          ...                                   ...
>          sgx_nr_free_pages--;  /* NOT SAFE */  sgx_nr_free_pages--;
> 
>          spin_unlock(&nodeA->lock);            spin_unlock(&nodeB->lock);
> 
> Maybe you missed the "NOT SAFE" hidden in the middle of
> the picture?
> 
> -Tony

For me from that the ordering is not clear. E.g. compare to
https://www.kernel.org/doc/Documentation/memory-barriers.txt

/Jarkko
Dave Hansen Nov. 11, 2021, 4:01 a.m. UTC | #8
On 11/10/21 7:50 PM, Jarkko Sakkinen wrote:
>>               CPU_A                                 CPU_B
>>               -----                                 -----
>>          spin_lock(&nodeA->lock);              spin_lock(&nodeB->lock);
>>          ...                                   ...
>>          sgx_nr_free_pages--;  /* NOT SAFE */  sgx_nr_free_pages--;
>>
>>          spin_unlock(&nodeA->lock);            spin_unlock(&nodeB->lock);
>>
>> Maybe you missed the "NOT SAFE" hidden in the middle of
>> the picture?
>>
>> -Tony
> For me from that the ordering is not clear. E.g. compare to
> https://www.kernel.org/doc/Documentation/memory-barriers.txt

Jarkko,

Reinette's explanation looks great to me.  Something "protected" by two
different locks is not protected at all.  I don't think we need to fret
over this too much.

We don't need memory barriers or anything fancy at all to explain this.
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 63d3de02bbcc..8471a8b9b48e 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -28,8 +28,7 @@  static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq);
 static LIST_HEAD(sgx_active_page_list);
 static DEFINE_SPINLOCK(sgx_reclaimer_lock);
 
-/* The free page list lock protected variables prepend the lock. */
-static unsigned long sgx_nr_free_pages;
+static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
 static nodemask_t sgx_numa_mask;
@@ -403,14 +402,15 @@  static void sgx_reclaim_pages(void)
 
 		spin_lock(&node->lock);
 		list_add_tail(&epc_page->list, &node->free_page_list);
-		sgx_nr_free_pages++;
 		spin_unlock(&node->lock);
+		atomic_long_inc(&sgx_nr_free_pages);
 	}
 }
 
 static bool sgx_should_reclaim(unsigned long watermark)
 {
-	return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list);
+	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
+	       !list_empty(&sgx_active_page_list);
 }
 
 static int ksgxd(void *p)
@@ -471,9 +471,9 @@  static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
 
 	page = list_first_entry(&node->free_page_list, struct sgx_epc_page, list);
 	list_del_init(&page->list);
-	sgx_nr_free_pages--;
 
 	spin_unlock(&node->lock);
+	atomic_long_dec(&sgx_nr_free_pages);
 
 	return page;
 }
@@ -625,9 +625,9 @@  void sgx_free_epc_page(struct sgx_epc_page *page)
 	spin_lock(&node->lock);
 
 	list_add_tail(&page->list, &node->free_page_list);
-	sgx_nr_free_pages++;
 
 	spin_unlock(&node->lock);
+	atomic_long_inc(&sgx_nr_free_pages);
 }
 
 static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,