x86/sgx: Fix free page accounting

Message ID	373992d869cd356ce9e9afe43ef4934b70d604fd.1636049678.git.reinette.chatre@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-sgx-owner@kernel.org> From: Reinette Chatre <reinette.chatre@intel.com> To: dave.hansen@linux.intel.com, jarkko@kernel.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, linux-sgx@vger.kernel.org, x86@kernel.org Cc: seanjc@google.com, tony.luck@intel.com, hpa@zytor.com, linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: [PATCH] x86/sgx: Fix free page accounting Date: Thu, 4 Nov 2021 11:28:54 -0700 Message-Id: <373992d869cd356ce9e9afe43ef4934b70d604fd.1636049678.git.reinette.chatre@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	x86/sgx: Fix free page accounting \| expand x86/sgx: Fix free page accounting

Reinette Chatre Nov. 4, 2021, 6:28 p.m. UTC

The SGX driver maintains a single global free page counter,
sgx_nr_free_pages, that reflects the number of free pages available
across all NUMA nodes. Correspondingly, a list of free pages is
associated with each NUMA node and sgx_nr_free_pages is updated
every time a page is added or removed from any of the free page
lists. The main usage of sgx_nr_free_pages is by the reclaimer
that will run when the total free pages go below a watermark to
ensure that there are always some free pages available to, for
example, support efficient page faults.

With sgx_nr_free_pages accessed and modified from a few places
it is essential to ensure that these accesses are done safely but
this is not the case. sgx_nr_free_pages is sometimes accessed
without any protection and when it is protected it is done
inconsistently with any one of the spin locks associated with the
individual NUMA nodes.

The consequence of sgx_nr_free_pages not being protected is that
its value may not accurately reflect the actual number of free
pages on the system, impacting the availability of free pages in
support of many flows. The problematic scenario is when the
reclaimer never runs because it believes there to be sufficient
free pages while any attempt to allocate a page fails because there
are no free pages available. The worst scenario observed was a
user space hang because of repeated page faults caused by
no free pages ever made available.

Change the global free page counter to an atomic type that
ensures simultaneous updates are done safely. While doing so, move
the updating of the variable outside of the spin lock critical
section to which it does not belong.

Cc: stable@vger.kernel.org
Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Luck, Tony Nov. 4, 2021, 6:36 p.m. UTC | #1

> -/* The free page list lock protected variables prepend the lock. */
> -static unsigned long sgx_nr_free_pages;
> +atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);

You accidentally lost the "static" here. This is still only accessed within this
one source file.

Otherwise:

Reviewed-by: Tony Luck <tony.luck@intel.com>

-Tony

Reinette Chatre Nov. 4, 2021, 6:44 p.m. UTC | #2

Hi Tony,

On 11/4/2021 11:36 AM, Luck, Tony wrote:
>> -/* The free page list lock protected variables prepend the lock. */
>> -static unsigned long sgx_nr_free_pages;
>> +atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
> 
> You accidentally lost the "static" here. This is still only accessed within this
> one source file.

Thank you very much for catching this. Will fix.

> 
> Otherwise:
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> 

Thank you very much.

Reinette

Greg Kroah-Hartman Nov. 4, 2021, 6:54 p.m. UTC | #3

On Thu, Nov 04, 2021 at 11:28:54AM -0700, Reinette Chatre wrote:
> The SGX driver maintains a single global free page counter,
> sgx_nr_free_pages, that reflects the number of free pages available
> across all NUMA nodes. Correspondingly, a list of free pages is
> associated with each NUMA node and sgx_nr_free_pages is updated
> every time a page is added or removed from any of the free page
> lists. The main usage of sgx_nr_free_pages is by the reclaimer
> that will run when the total free pages go below a watermark to
> ensure that there are always some free pages available to, for
> example, support efficient page faults.
> 
> With sgx_nr_free_pages accessed and modified from a few places
> it is essential to ensure that these accesses are done safely but
> this is not the case. sgx_nr_free_pages is sometimes accessed
> without any protection and when it is protected it is done
> inconsistently with any one of the spin locks associated with the
> individual NUMA nodes.
> 
> The consequence of sgx_nr_free_pages not being protected is that
> its value may not accurately reflect the actual number of free
> pages on the system, impacting the availability of free pages in
> support of many flows. The problematic scenario is when the
> reclaimer never runs because it believes there to be sufficient
> free pages while any attempt to allocate a page fails because there
> are no free pages available. The worst scenario observed was a
> user space hang because of repeated page faults caused by
> no free pages ever made available.
> 
> Change the global free page counter to an atomic type that
> ensures simultaneous updates are done safely. While doing so, move
> the updating of the variable outside of the spin lock critical
> section to which it does not belong.
> 
> Cc: stable@vger.kernel.org
> Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 63d3de02bbcc..8558d7d5f3e7 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -28,8 +28,7 @@ static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq);
>  static LIST_HEAD(sgx_active_page_list);
>  static DEFINE_SPINLOCK(sgx_reclaimer_lock);
>  
> -/* The free page list lock protected variables prepend the lock. */
> -static unsigned long sgx_nr_free_pages;
> +atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
>  
>  /* Nodes with one or more EPC sections. */
>  static nodemask_t sgx_numa_mask;
> @@ -403,14 +402,15 @@ static void sgx_reclaim_pages(void)
>  
>  		spin_lock(&node->lock);
>  		list_add_tail(&epc_page->list, &node->free_page_list);
> -		sgx_nr_free_pages++;
>  		spin_unlock(&node->lock);
> +		atomic_long_inc(&sgx_nr_free_pages);
>  	}
>  }
>  
>  static bool sgx_should_reclaim(unsigned long watermark)
>  {
> -	return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list);
> +	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
> +	       !list_empty(&sgx_active_page_list);

What prevents the value from changing right after you test this?  Why is
an atomic value somehow solving the problem?

The value changes were happening safely, it was just the reading of the
value that was not.  You have not changed the fact that the value can
change right after reading given that there was not going to be a
problem with reading a stale value before.

In other words, what did you really fix here?  And how did you test it
to verify it did fix things?

thanks,

greg k-h

Dave Hansen Nov. 4, 2021, 7:04 p.m. UTC | #4

On 11/4/21 11:54 AM, Greg KH wrote:
>>  static bool sgx_should_reclaim(unsigned long watermark)
>>  {
>> -	return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list);
>> +	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
>> +	       !list_empty(&sgx_active_page_list);
> What prevents the value from changing right after you test this?  Why is
> an atomic value somehow solving the problem?

Nothing.  It's fundamentally racy, and that's OK.

Just like the core VM, being under the watermark is an indication that
the kernel is low on pages (SGX pages in this case).  It means we should
try SGX reclaim.  Let's say there's a race and a bunch of pages are
freed.  Reclaim will run once iteration then stop.

We could make an argument that the sgx_reclaim_pages() loop should check
sgx_nr_free_pages in a few places to ensure it doesn't unnecessarily
reclaim too much memory.  That's something to look at, but it's beyond
the scope of this very simple bug fix.

Reinette Chatre Nov. 4, 2021, 8:57 p.m. UTC | #5

Hi Greg,

On 11/4/2021 11:54 AM, Greg KH wrote:
> On Thu, Nov 04, 2021 at 11:28:54AM -0700, Reinette Chatre wrote:
>> The SGX driver maintains a single global free page counter,
>> sgx_nr_free_pages, that reflects the number of free pages available
>> across all NUMA nodes. Correspondingly, a list of free pages is
>> associated with each NUMA node and sgx_nr_free_pages is updated
>> every time a page is added or removed from any of the free page
>> lists. The main usage of sgx_nr_free_pages is by the reclaimer
>> that will run when the total free pages go below a watermark to
>> ensure that there are always some free pages available to, for
>> example, support efficient page faults.
>>
>> With sgx_nr_free_pages accessed and modified from a few places
>> it is essential to ensure that these accesses are done safely but
>> this is not the case. sgx_nr_free_pages is sometimes accessed
>> without any protection and when it is protected it is done
>> inconsistently with any one of the spin locks associated with the
>> individual NUMA nodes.
>>
>> The consequence of sgx_nr_free_pages not being protected is that
>> its value may not accurately reflect the actual number of free
>> pages on the system, impacting the availability of free pages in
>> support of many flows. The problematic scenario is when the
>> reclaimer never runs because it believes there to be sufficient
>> free pages while any attempt to allocate a page fails because there
>> are no free pages available. The worst scenario observed was a
>> user space hang because of repeated page faults caused by
>> no free pages ever made available.
>>
>> Change the global free page counter to an atomic type that
>> ensures simultaneous updates are done safely. While doing so, move
>> the updating of the variable outside of the spin lock critical
>> section to which it does not belong.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
>> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
>> ---
>>   arch/x86/kernel/cpu/sgx/main.c | 12 ++++++------
>>   1 file changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
>> index 63d3de02bbcc..8558d7d5f3e7 100644
>> --- a/arch/x86/kernel/cpu/sgx/main.c
>> +++ b/arch/x86/kernel/cpu/sgx/main.c
>> @@ -28,8 +28,7 @@ static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq);
>>   static LIST_HEAD(sgx_active_page_list);
>>   static DEFINE_SPINLOCK(sgx_reclaimer_lock);
>>   
>> -/* The free page list lock protected variables prepend the lock. */
>> -static unsigned long sgx_nr_free_pages;
>> +atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
>>   
>>   /* Nodes with one or more EPC sections. */
>>   static nodemask_t sgx_numa_mask;
>> @@ -403,14 +402,15 @@ static void sgx_reclaim_pages(void)
>>   
>>   		spin_lock(&node->lock);
>>   		list_add_tail(&epc_page->list, &node->free_page_list);
>> -		sgx_nr_free_pages++;
>>   		spin_unlock(&node->lock);
>> +		atomic_long_inc(&sgx_nr_free_pages);
>>   	}
>>   }
>>   
>>   static bool sgx_should_reclaim(unsigned long watermark)
>>   {
>> -	return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list);
>> +	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
>> +	       !list_empty(&sgx_active_page_list);
> 

Thank you very much for taking a look.

> What prevents the value from changing right after you test this?

You are correct. It is indeed possible for the value to change after 
this test. This test decides when to reclaim more pages so an absolute 
accurate value is not required but knowing that the amount of free pages 
are running low is.

>  Why is
> an atomic value somehow solving the problem?

During stress testing it was found that page allocation during hot path 
(for example page fault) is failing because there were no free pages 
available in any of the free page lists while the global counter 
contained a value that does not reflect the actual free pages and was 
higher than the watermark and thus the reclaimer is never run.

Changing it to atomic fixed this issue. I also reverted to how this 
counter was managed before with a spin lock protected free counter per 
free list and that also fixes the issue.

> 
> The value changes were happening safely, it was just the reading of the
> value that was not.  You have not changed the fact that the value can
> change right after reading given that there was not going to be a
> problem with reading a stale value before.

I am testing on a two socket system and I am seeing that the value of 
sgx_nr_free_pages does not accurately reflect the actual free pages.

It does not look to me like the value is updated safely as it is updated 
with inconsistent protection on a system like this. There is a spin lock 
associated with each NUMA node and which lock is used to update the 
variable depends on which NUMA node memory is being modified - it is not 
always protected with the same lock:

spin_lock(&node->lock);
sgx_nr_free_pages++;
spin_unlock(&node->lock);


> In other words, what did you really fix here?  And how did you test it
> to verify it did fix things?

To test this I created a stress test that builds on top of the new 
"oversubscription" test case that we are trying to add to the SGX 
kselftests:
https://lore.kernel.org/lkml/7715db4882ab9fd52d21de6f62bb3b7e94dc4885.1635447301.git.reinette.chatre@intel.com/

In the changed test an enclave is created with as much heap as SGX 
memory. After that all the pages are accessed, their type is changed, 
then the enclave is entered to run EACCEPT on each page, after that the 
pages are removed (EREMOVE).

This test places significant stress on the SGX memory allocation and 
reclaim subsystems. The troublesome part of the test is when the enclave 
is entered so that EACCEPT can be run on each page. During this time, 
because of the oversubscription and previous accesses, there are many 
page faults. During this time a new page needs to be allocated in the 
SGX memory into which the page being faulted needs to be decrypted and 
loaded back into SGX memory. At this point the test hangs.

Below I show the following:
* perf top showing that the test hangs because user space is stuck just 
encountering page faults
* below that I show the code traces explaining why the repeated page 
faults occur (because reclaimer never runs)
* below that shows the perf top traces after this patch is applied 
showing that the reclaimer now gets a chance to run and the test can 
complete


Here is the perf top trace before this patch is applied showing how user 
space is stuck hitting page faults over and over:
    PerfTop:    4569 irqs/sec  kernel:25.0%  exact: 100.0% lost: 0/0 
drop: 0/0 [4000Hz cycles],  (all, 224 CPUs)
------------------------------------------------------------------------------------------------------------------------------------------------------------

     94.34%    68.51%  [vdso]         [.] __vdso_sgx_enter_enclave
             |
             |--68.51%--0x5541f689495641d7
             |          __libc_start_main
             |          main
             |          test_harness_run
             |          __run_test
             | 
wrapper_enclave_unclobbered_vdso_oversubscribed_remove
             |          enclave_unclobbered_vdso_oversubscribed_remove
             |          __vdso_sgx_enter_enclave
             |          |
             |           --68.30%--asm_exc_page_fault
             |
              --7.84%--__vdso_sgx_enter_enclave
                        |
                         --2.58%--asm_exc_page_fault
                                   |
                                    --2.72%--exc_page_fault
                                              |
                                               --3.67%--do_user_addr_fault
                                                         |
 
--6.96%--handle_mm_fault
                                                                    |
 
|--4.37%--__handle_mm_fault
                                                                    | 
        |
                                                                    | 
         --1.65%--__do_fault
                                                                    | 
                   |
                                                                    | 
                    --2.66%--sgx_vma_fault
                                                                    | 
                              |
                                                                    | 
                              |--1.93%--xa_load
                                                                    | 
                              |          |
                                                                    | 
                              |           --1.85%--xas_load
                                                                    | 
                              |
                                                                    | 
                               --1.21%--sgx_encl_load_page
                                                                    |
 
--1.35%--__count_memcg_events

     85.55%     0.17%  [kernel]       [k] asm_exc_page_fault
             |
              --70.81%--asm_exc_page_fault
                        |
                         --2.73%--exc_page_fault
                                   |
                                    --3.71%--do_user_addr_fault
                                              |
                                               --7.00%--handle_mm_fault
                                                         |
 
|--4.42%--__handle_mm_fault
                                                         |          |
                                                         | 
--1.65%--__do_fault
                                                         | 
        |
                                                         | 
         --2.66%--sgx_vma_fault
                                                         | 
                   |
                                                         | 
                   |--1.93%--xa_load
                                                         | 
                   |          |
                                                         | 
                   |           --1.85%--xas_load
                                                         | 
                   |
                                                         | 
                    --1.21%--sgx_encl_load_page
                                                         |
 
--1.35%--__count_memcg_events

     16.83%     0.18%  [kernel]       [k] exc_page_fault
             |
              --2.57%--exc_page_fault
                        |
                         --3.69%--do_user_addr_fault
                                   |
                                    --7.00%--handle_mm_fault
                                              |
                                              |--4.42%--__handle_mm_fault
                                              |          |
                                              | 
--1.65%--__do_fault
                                              |                     |
                                              | 
--2.66%--sgx_vma_fault
                                              | 
        |
                                              | 
        |--1.93%--xa_load
                                              | 
        |          |
                                              | 
        |           --1.85%--xas_load
                                              | 
        |
                                              | 
         --1.21%--sgx_encl_load_page
                                              |
                                               --1.35%--__count_memcg_events

exiting.


What happens above is the following flow is encountered:
sgx_vma_fault()
   sgx_encl_load_page()
     sgx_encl_eldu() //page needs to be loaded from swap
       sgx_alloc_epc_page(..., false) // new EPC page is needed

Taking a closer look at sgx_alloc_epc_page:

struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
{
	struct sgx_epc_page *page;

	for ( ; ; ) {
		page = __sgx_alloc_epc_page(); <--- page == NULL because of no free pages
		if (!IS_ERR(page)) {
			page->owner = owner;
			break;
		}

		if (list_empty(&sgx_active_page_list)) <--- there are plenty of pages 
that can be reclaimed
			return ERR_PTR(-ENOMEM);

		if (!reclaim) { <--- reclaim is false so EBUSY will be returned but an 
attempt is made to wake the reclaimer below
			page = ERR_PTR(-EBUSY);
			break;
		}

		if (signal_pending(current)) {
			page = ERR_PTR(-ERESTARTSYS);
			break;
		}

		sgx_reclaim_pages();
		cond_resched();
	}

	if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) <-- expected to wake up 
reclaimer but this is not happening
		wake_up(&ksgxd_waitq);

	return page;
}

Because the above returns EBUSY sgx_vma_fault() will return 
VM_FAULT_NOPAGE and user space will keep attempting to access the same 
page and trigger the same fault again because there still are no free pages.


static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
{

         ...
         entry = sgx_encl_load_page(encl, addr);
         if (IS_ERR(entry)) {
                 mutex_unlock(&encl->lock);
                 if (PTR_ERR(entry) == -EBUSY)
                         return VM_FAULT_NOPAGE;
         }
         ...
}


I added some tracing and it shows that the value of sgx_nr_free_pages 
was higher than the watermark and thus the reclaimer does not free up 
pages, yet the allocation of memory keeps failing because there are no 
more free pages.

With this patch the test is able to complete and the tracing shows that 
the reclaimer is getting a chance to run. Previously the system was 
spending almost all its time in page faults but now the time is split 
between the page faults and the reclaimer.


    PerfTop:    7432 irqs/sec  kernel:81.5%  exact: 100.0% lost: 0/0 
drop: 0/0 [4000Hz cycles],  (all, 224 CPUs)
------------------------------------------------------------------------------------------------------------------------------------------------------------

     49.62%     0.18%  test_sgx       [.] 
enclave_unclobbered_vdso_oversubscribed_remove
             |
              --14.59%--enclave_unclobbered_vdso_oversubscribed_remove
                        |
                         --20.19%--__vdso_sgx_enter_enclave
                                   |
                                    --11.23%--asm_exc_page_fault
                                              |
                                               --4.82%--exc_page_fault
                                                         |
 
--5.04%--do_user_addr_fault
                                                                    |
 
--5.33%--handle_mm_fault
 
        |
 
         --4.98%--__handle_mm_fault
 
                   |
 
                    --4.59%--__do_fault
 
                              |
 
                               --5.71%--sgx_vma_fault
 
                                         |
 
                                         |--16.71%--sgx_encl_load_page
 
                                         |          |
 
                                         |           --17.05%--sgx_encl_eldu
 
                                         |
 
                                          --5.31%--__mutex_lock.isra.9
 
                                                    |
 
 
--4.92%--mutex_spin_on_owner

     49.46%    14.41%  [vdso]         [.] __vdso_sgx_enter_enclave
             |
             |--5.81%--__vdso_sgx_enter_enclave
             |          |
             |           --4.83%--asm_exc_page_fault
             |                     |
             |                      --4.82%--exc_page_fault
             |                                |
             |                                 --5.04%--do_user_addr_fault
             |                                           |
             | 
--5.33%--handle_mm_fault
             |                                                      |
             | 
--4.98%--__handle_mm_fault
             | 
        |
             | 
         --4.59%--__do_fault
             | 
                   |
             | 
                    --5.71%--sgx_vma_fault
             | 
                              |
             | 
                              |--16.71%--sgx_encl_load_page
             | 
                              |          |
             | 
                              |           --17.05%--sgx_encl_eldu
             | 
                              |
             | 
                               --5.31%--__mutex_lock.isra.9
             | 
                                         |
             | 
                                          --4.92%--mutex_spin_on_owner
             |
              --2.10%--0x5541f689495641d7
                        __libc_start_main
                        main
                        test_harness_run
                        __run_test
 
wrapper_enclave_unclobbered_vdso_oversubscribed_remove
                        |
 
--8.90%--enclave_unclobbered_vdso_oversubscribed_remove
                                   |
                                    --14.39%--__vdso_sgx_enter_enclave
                                              |
                                               --6.39%--asm_exc_page_fault

     45.60%     0.05%  [kernel]       [k] ksgxd
             |
              --10.27%--ksgxd
                        |
                         --13.34%--sgx_reclaim_pages
                                   |
                                   |--19.74%--sgx_encl_ewb
                                   |          |
                                   |           --18.49%--__sgx_encl_ewb
                                   |
                                    --15.73%--__mutex_lock.isra.9
                                              |
                                               --14.77%--mutex_spin_on_owner

exiting.


Reinette

Greg Kroah-Hartman Nov. 5, 2021, 7:10 a.m. UTC | #6

On Thu, Nov 04, 2021 at 01:57:31PM -0700, Reinette Chatre wrote:
> Hi Greg,
> 
> On 11/4/2021 11:54 AM, Greg KH wrote:
> > On Thu, Nov 04, 2021 at 11:28:54AM -0700, Reinette Chatre wrote:
> > > The SGX driver maintains a single global free page counter,
> > > sgx_nr_free_pages, that reflects the number of free pages available
> > > across all NUMA nodes. Correspondingly, a list of free pages is
> > > associated with each NUMA node and sgx_nr_free_pages is updated
> > > every time a page is added or removed from any of the free page
> > > lists. The main usage of sgx_nr_free_pages is by the reclaimer
> > > that will run when the total free pages go below a watermark to
> > > ensure that there are always some free pages available to, for
> > > example, support efficient page faults.
> > > 
> > > With sgx_nr_free_pages accessed and modified from a few places
> > > it is essential to ensure that these accesses are done safely but
> > > this is not the case. sgx_nr_free_pages is sometimes accessed
> > > without any protection and when it is protected it is done
> > > inconsistently with any one of the spin locks associated with the
> > > individual NUMA nodes.
> > > 
> > > The consequence of sgx_nr_free_pages not being protected is that
> > > its value may not accurately reflect the actual number of free
> > > pages on the system, impacting the availability of free pages in
> > > support of many flows. The problematic scenario is when the
> > > reclaimer never runs because it believes there to be sufficient
> > > free pages while any attempt to allocate a page fails because there
> > > are no free pages available. The worst scenario observed was a
> > > user space hang because of repeated page faults caused by
> > > no free pages ever made available.
> > > 
> > > Change the global free page counter to an atomic type that
> > > ensures simultaneous updates are done safely. While doing so, move
> > > the updating of the variable outside of the spin lock critical
> > > section to which it does not belong.
> > > 
> > > Cc: stable@vger.kernel.org
> > > Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
> > > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> > > ---
> > >   arch/x86/kernel/cpu/sgx/main.c | 12 ++++++------
> > >   1 file changed, 6 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> > > index 63d3de02bbcc..8558d7d5f3e7 100644
> > > --- a/arch/x86/kernel/cpu/sgx/main.c
> > > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > > @@ -28,8 +28,7 @@ static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq);
> > >   static LIST_HEAD(sgx_active_page_list);
> > >   static DEFINE_SPINLOCK(sgx_reclaimer_lock);
> > > -/* The free page list lock protected variables prepend the lock. */
> > > -static unsigned long sgx_nr_free_pages;
> > > +atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
> > >   /* Nodes with one or more EPC sections. */
> > >   static nodemask_t sgx_numa_mask;
> > > @@ -403,14 +402,15 @@ static void sgx_reclaim_pages(void)
> > >   		spin_lock(&node->lock);
> > >   		list_add_tail(&epc_page->list, &node->free_page_list);
> > > -		sgx_nr_free_pages++;
> > >   		spin_unlock(&node->lock);
> > > +		atomic_long_inc(&sgx_nr_free_pages);
> > >   	}
> > >   }
> > >   static bool sgx_should_reclaim(unsigned long watermark)
> > >   {
> > > -	return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list);
> > > +	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
> > > +	       !list_empty(&sgx_active_page_list);
> > 
> 
> Thank you very much for taking a look.
> 
> > What prevents the value from changing right after you test this?
> 
> You are correct. It is indeed possible for the value to change after this
> test. This test decides when to reclaim more pages so an absolute accurate
> value is not required but knowing that the amount of free pages are running
> low is.
> 
> >  Why is
> > an atomic value somehow solving the problem?
> 
> During stress testing it was found that page allocation during hot path (for
> example page fault) is failing because there were no free pages available in
> any of the free page lists while the global counter contained a value that
> does not reflect the actual free pages and was higher than the watermark and
> thus the reclaimer is never run.
> 
> Changing it to atomic fixed this issue. I also reverted to how this counter
> was managed before with a spin lock protected free counter per free list and
> that also fixes the issue.
>
> > The value changes were happening safely, it was just the reading of the
> > value that was not.  You have not changed the fact that the value can
> > change right after reading given that there was not going to be a
> > problem with reading a stale value before.
> 
> I am testing on a two socket system and I am seeing that the value of
> sgx_nr_free_pages does not accurately reflect the actual free pages.
> 
> It does not look to me like the value is updated safely as it is updated
> with inconsistent protection on a system like this. There is a spin lock
> associated with each NUMA node and which lock is used to update the variable
> depends on which NUMA node memory is being modified - it is not always
> protected with the same lock:
> 
> spin_lock(&node->lock);
> sgx_nr_free_pages++;
> spin_unlock(&node->lock);

Ah, I missed that the original code was using a different lock for every
call place, while now you are just using a single lock (the atomic value
itself.)  That makes more sense, sorry for the noise.

But isn't this going to cause more thrashing and slow things down as you
are hitting the "global" lock for this variable for every allocation?
Or is this not in the hot path?



> 
> 
> > In other words, what did you really fix here?  And how did you test it
> > to verify it did fix things?
> 
> To test this I created a stress test that builds on top of the new
> "oversubscription" test case that we are trying to add to the SGX
> kselftests:
> https://lore.kernel.org/lkml/7715db4882ab9fd52d21de6f62bb3b7e94dc4885.1635447301.git.reinette.chatre@intel.com/
> 
> In the changed test an enclave is created with as much heap as SGX memory.
> After that all the pages are accessed, their type is changed, then the
> enclave is entered to run EACCEPT on each page, after that the pages are
> removed (EREMOVE).
> 
> This test places significant stress on the SGX memory allocation and reclaim
> subsystems. The troublesome part of the test is when the enclave is entered
> so that EACCEPT can be run on each page. During this time, because of the
> oversubscription and previous accesses, there are many page faults. During
> this time a new page needs to be allocated in the SGX memory into which the
> page being faulted needs to be decrypted and loaded back into SGX memory. At
> this point the test hangs.
> 
> Below I show the following:
> * perf top showing that the test hangs because user space is stuck just
> encountering page faults
> * below that I show the code traces explaining why the repeated page faults
> occur (because reclaimer never runs)
> * below that shows the perf top traces after this patch is applied showing
> that the reclaimer now gets a chance to run and the test can complete
> 
> 
> Here is the perf top trace before this patch is applied showing how user
> space is stuck hitting page faults over and over:
>    PerfTop:    4569 irqs/sec  kernel:25.0%  exact: 100.0% lost: 0/0 drop:
> 0/0 [4000Hz cycles],  (all, 224 CPUs)

<ascii art that line-wrapped snipped>

> With this patch the test is able to complete and the tracing shows that the
> reclaimer is getting a chance to run. Previously the system was spending
> almost all its time in page faults but now the time is split between the
> page faults and the reclaimer.
> 
> 
>    PerfTop:    7432 irqs/sec  kernel:81.5%  exact: 100.0% lost: 0/0 drop:
> 0/0 [4000Hz cycles],  (all, 224 CPUs)

Ok, that's better, you need the reclaim in order to make forward
progress.

Thanks for the detailed explaination, no objection from me, sorry for
the misunderstanding.

greg k-h

Jarkko Sakkinen Nov. 7, 2021, 4:45 p.m. UTC | #7

On Thu, 2021-11-04 at 11:28 -0700, Reinette Chatre wrote:
> The consequence of sgx_nr_free_pages not being protected is that
> its value may not accurately reflect the actual number of free
> pages on the system, impacting the availability of free pages in
> support of many flows. The problematic scenario is when the
> reclaimer never runs because it believes there to be sufficient
> free pages while any attempt to allocate a page fails because there
> are no free pages available. The worst scenario observed was a
> user space hang because of repeated page faults caused by
> no free pages ever made available.

Can you go in detail with the "concrete scenario" in the commit
message? It does not have to describe all the possible scenarios
but at least one sequence of events.

/Jarkko

Jarkko Sakkinen Nov. 7, 2021, 4:47 p.m. UTC | #8

On Sun, 2021-11-07 at 18:45 +0200, Jarkko Sakkinen wrote:
> On Thu, 2021-11-04 at 11:28 -0700, Reinette Chatre wrote:
> > The consequence of sgx_nr_free_pages not being protected is that
> > its value may not accurately reflect the actual number of free
> > pages on the system, impacting the availability of free pages in
> > support of many flows. The problematic scenario is when the
> > reclaimer never runs because it believes there to be sufficient
> > free pages while any attempt to allocate a page fails because there
> > are no free pages available. The worst scenario observed was a
> > user space hang because of repeated page faults caused by
> > no free pages ever made available.
> 
> Can you go in detail with the "concrete scenario" in the commit
> message? It does not have to describe all the possible scenarios
> but at least one sequence of events.

I.e. I don't have anything fundamentally against changing it to
atomic but the commit message is completely lacking the stimulus
of changing anything.

/Jarkko

Reinette Chatre Nov. 8, 2021, 7:19 p.m. UTC | #9

Hi Greg,

On 11/5/2021 12:10 AM, Greg KH wrote:
> On Thu, Nov 04, 2021 at 01:57:31PM -0700, Reinette Chatre wrote:
>> Hi Greg,
>>
>> On 11/4/2021 11:54 AM, Greg KH wrote:
>>> On Thu, Nov 04, 2021 at 11:28:54AM -0700, Reinette Chatre wrote:
>>>> The SGX driver maintains a single global free page counter,
>>>> sgx_nr_free_pages, that reflects the number of free pages available
>>>> across all NUMA nodes. Correspondingly, a list of free pages is
>>>> associated with each NUMA node and sgx_nr_free_pages is updated
>>>> every time a page is added or removed from any of the free page
>>>> lists. The main usage of sgx_nr_free_pages is by the reclaimer
>>>> that will run when the total free pages go below a watermark to
>>>> ensure that there are always some free pages available to, for
>>>> example, support efficient page faults.
>>>>
>>>> With sgx_nr_free_pages accessed and modified from a few places
>>>> it is essential to ensure that these accesses are done safely but
>>>> this is not the case. sgx_nr_free_pages is sometimes accessed
>>>> without any protection and when it is protected it is done
>>>> inconsistently with any one of the spin locks associated with the
>>>> individual NUMA nodes.
>>>>
>>>> The consequence of sgx_nr_free_pages not being protected is that
>>>> its value may not accurately reflect the actual number of free
>>>> pages on the system, impacting the availability of free pages in
>>>> support of many flows. The problematic scenario is when the
>>>> reclaimer never runs because it believes there to be sufficient
>>>> free pages while any attempt to allocate a page fails because there
>>>> are no free pages available. The worst scenario observed was a
>>>> user space hang because of repeated page faults caused by
>>>> no free pages ever made available.
>>>>
>>>> Change the global free page counter to an atomic type that
>>>> ensures simultaneous updates are done safely. While doing so, move
>>>> the updating of the variable outside of the spin lock critical
>>>> section to which it does not belong.
>>>>
>>>> Cc: stable@vger.kernel.org
>>>> Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
>>>> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
>>>> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
>>>> ---
>>>>    arch/x86/kernel/cpu/sgx/main.c | 12 ++++++------
>>>>    1 file changed, 6 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
>>>> index 63d3de02bbcc..8558d7d5f3e7 100644
>>>> --- a/arch/x86/kernel/cpu/sgx/main.c
>>>> +++ b/arch/x86/kernel/cpu/sgx/main.c
>>>> @@ -28,8 +28,7 @@ static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq);
>>>>    static LIST_HEAD(sgx_active_page_list);
>>>>    static DEFINE_SPINLOCK(sgx_reclaimer_lock);
>>>> -/* The free page list lock protected variables prepend the lock. */
>>>> -static unsigned long sgx_nr_free_pages;
>>>> +atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
>>>>    /* Nodes with one or more EPC sections. */
>>>>    static nodemask_t sgx_numa_mask;
>>>> @@ -403,14 +402,15 @@ static void sgx_reclaim_pages(void)
>>>>    		spin_lock(&node->lock);
>>>>    		list_add_tail(&epc_page->list, &node->free_page_list);
>>>> -		sgx_nr_free_pages++;
>>>>    		spin_unlock(&node->lock);
>>>> +		atomic_long_inc(&sgx_nr_free_pages);
>>>>    	}
>>>>    }
>>>>    static bool sgx_should_reclaim(unsigned long watermark)
>>>>    {
>>>> -	return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list);
>>>> +	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
>>>> +	       !list_empty(&sgx_active_page_list);

...

>>> The value changes were happening safely, it was just the reading of the
>>> value that was not.  You have not changed the fact that the value can
>>> change right after reading given that there was not going to be a
>>> problem with reading a stale value before.
>>
>> I am testing on a two socket system and I am seeing that the value of
>> sgx_nr_free_pages does not accurately reflect the actual free pages.
>>
>> It does not look to me like the value is updated safely as it is updated
>> with inconsistent protection on a system like this. There is a spin lock
>> associated with each NUMA node and which lock is used to update the variable
>> depends on which NUMA node memory is being modified - it is not always
>> protected with the same lock:
>>
>> spin_lock(&node->lock);
>> sgx_nr_free_pages++;
>> spin_unlock(&node->lock);
> 
> Ah, I missed that the original code was using a different lock for every
> call place, while now you are just using a single lock (the atomic value
> itself.)  That makes more sense, sorry for the noise.
> 
> But isn't this going to cause more thrashing and slow things down as you
> are hitting the "global" lock for this variable for every allocation?
> Or is this not in the hot path?

I do see this as being in the hot path as it is in the page fault 
handling flow. A global lock does seem to be required since it is a 
single free page count that directs the reclaimer and that counter needs 
to be updated safely.

I obtained access to another two socket system where I can test this 
issue. Since this system also has two NUMA nodes it updates the global 
counter unsafely but on this system I am not encountering the user space 
hang and can thus test how much things are being slowed down by this fix.

Interesting is that without the fix the test is actually slightly 
_slower_ than with the fix. I am speculating now that the issue is 
indeed also encountered on this system also but not noticed because the 
global counter can correct itself after some time and not get stuck as 
on the other system from which I sent the long traces.

Here are four runs without the fix:
real    0m47.024s 0m47.433s 0m47.547s 0m47.569s
user    0m7.204s  0m7.292s  0m7.265s  0m7.388s
sys     0m39.806s 0m40.129s 0m40.271s 0m40.168s

Here are four runs with the fix:
real    0m46.893s 0m47.328s 0m46.786s 0m46.635s
user    0m7.351s  0m7.252s  0m7.130s  0m7.170s
sys     0m39.528s 0m40.063s 0m39.642s 0m39.452s


>> Here is the perf top trace before this patch is applied showing how user
>> space is stuck hitting page faults over and over:
>>     PerfTop:    4569 irqs/sec  kernel:25.0%  exact: 100.0% lost: 0/0 drop:
>> 0/0 [4000Hz cycles],  (all, 224 CPUs)
> 
> <ascii art that line-wrapped snipped>

Sorry about that.

> 
>> With this patch the test is able to complete and the tracing shows that the
>> reclaimer is getting a chance to run. Previously the system was spending
>> almost all its time in page faults but now the time is split between the
>> page faults and the reclaimer.
>>
>>
>>     PerfTop:    7432 irqs/sec  kernel:81.5%  exact: 100.0% lost: 0/0 drop:
>> 0/0 [4000Hz cycles],  (all, 224 CPUs)
> 
> Ok, that's better, you need the reclaim in order to make forward
> progress.
> 
> Thanks for the detailed explaination, no objection from me, sorry for
> the misunderstanding.

Thank you very much for taking a look. It is much appreciated.

Reinette

Reinette Chatre Nov. 8, 2021, 7:48 p.m. UTC | #10

Hi Jarkko,

On 11/7/2021 8:47 AM, Jarkko Sakkinen wrote:
> On Sun, 2021-11-07 at 18:45 +0200, Jarkko Sakkinen wrote:
>> On Thu, 2021-11-04 at 11:28 -0700, Reinette Chatre wrote:
>>> The consequence of sgx_nr_free_pages not being protected is that
>>> its value may not accurately reflect the actual number of free
>>> pages on the system, impacting the availability of free pages in
>>> support of many flows. The problematic scenario is when the
>>> reclaimer never runs because it believes there to be sufficient
>>> free pages while any attempt to allocate a page fails because there
>>> are no free pages available. The worst scenario observed was a
>>> user space hang because of repeated page faults caused by
>>> no free pages ever made available.
>>
>> Can you go in detail with the "concrete scenario" in the commit
>> message? It does not have to describe all the possible scenarios
>> but at least one sequence of events.

I provided significant detail regarding the "concrete scenario" in a 
separate response to Greg:
https://lore.kernel.org/lkml/a636290d-db04-be16-1c86-a8dcc3719b39@intel.com/

That message details the test that was run (the test hangs before the 
fix and can complete after the fix), the traces captured at the time the 
test hung, analysis of the traces with root cause of why the system is 
hung, traces after fix applied demonstrating why user space is able to 
make progress and explaining why the test can complete.

Unfortunately the traces I provided wrapped and are not easy to read. 
The essential message from the two traces are that the first trace 
(before the fix) shows that the system is stuck (almost 100%) in the SGX 
page fault handler and not able to make any progress and user space 
hangs. The second trace (after the fix) shows that the system splits its 
time between the SGX page fault handler and the reclaimer enabling user 
space to make progress and the test can complete.

 > I.e. I don't have anything fundamentally against changing it to
 > atomic but the commit message is completely lacking the stimulus
 > of changing anything.

The problem needing to be fixed is that sgx_nr_free_pages is not updated 
safely on systems with more than one NUMA node.

Reinette

Jarkko Sakkinen Nov. 8, 2021, 8:12 p.m. UTC | #11

On Mon, Nov 08, 2021 at 11:48:18AM -0800, Reinette Chatre wrote:
> Hi Jarkko,
> 
> On 11/7/2021 8:47 AM, Jarkko Sakkinen wrote:
> > On Sun, 2021-11-07 at 18:45 +0200, Jarkko Sakkinen wrote:
> > > On Thu, 2021-11-04 at 11:28 -0700, Reinette Chatre wrote:
> > > > The consequence of sgx_nr_free_pages not being protected is that
> > > > its value may not accurately reflect the actual number of free
> > > > pages on the system, impacting the availability of free pages in
> > > > support of many flows. The problematic scenario is when the
> > > > reclaimer never runs because it believes there to be sufficient
> > > > free pages while any attempt to allocate a page fails because there
> > > > are no free pages available. The worst scenario observed was a
> > > > user space hang because of repeated page faults caused by
> > > > no free pages ever made available.
> > > 
> > > Can you go in detail with the "concrete scenario" in the commit
> > > message? It does not have to describe all the possible scenarios
> > > but at least one sequence of events.
> 
> 
> I provided significant detail regarding the "concrete scenario" in a
> separate response to Greg:
> https://lore.kernel.org/lkml/a636290d-db04-be16-1c86-a8dcc3719b39@intel.com/
> 
> That message details the test that was run (the test hangs before the fix
> and can complete after the fix), the traces captured at the time the test
> hung, analysis of the traces with root cause of why the system is hung,
> traces after fix applied demonstrating why user space is able to make
> progress and explaining why the test can complete.

For me that sequence looks like something that you could "abstract"
a bit and get a rough description of the concurrency scenario.

It is as important in this type of patch, as the code change itself,
not least because it helps with maintaining in the future to have
that info in some level of detail in the commit log.

/Jarkko

Reinette Chatre Nov. 8, 2021, 8:56 p.m. UTC | #12

Hi Jarkko,

On 11/8/2021 12:12 PM, Jarkko Sakkinen wrote:
> On Mon, Nov 08, 2021 at 11:48:18AM -0800, Reinette Chatre wrote:
>> Hi Jarkko,
>>
>> On 11/7/2021 8:47 AM, Jarkko Sakkinen wrote:
>>> On Sun, 2021-11-07 at 18:45 +0200, Jarkko Sakkinen wrote:
>>>> On Thu, 2021-11-04 at 11:28 -0700, Reinette Chatre wrote:
>>>>> The consequence of sgx_nr_free_pages not being protected is that
>>>>> its value may not accurately reflect the actual number of free
>>>>> pages on the system, impacting the availability of free pages in
>>>>> support of many flows. The problematic scenario is when the
>>>>> reclaimer never runs because it believes there to be sufficient
>>>>> free pages while any attempt to allocate a page fails because there
>>>>> are no free pages available. The worst scenario observed was a
>>>>> user space hang because of repeated page faults caused by
>>>>> no free pages ever made available.
>>>>
>>>> Can you go in detail with the "concrete scenario" in the commit
>>>> message? It does not have to describe all the possible scenarios
>>>> but at least one sequence of events.
>>
>>
>> I provided significant detail regarding the "concrete scenario" in a
>> separate response to Greg:
>> https://lore.kernel.org/lkml/a636290d-db04-be16-1c86-a8dcc3719b39@intel.com/
>>
>> That message details the test that was run (the test hangs before the fix
>> and can complete after the fix), the traces captured at the time the test
>> hung, analysis of the traces with root cause of why the system is hung,
>> traces after fix applied demonstrating why user space is able to make
>> progress and explaining why the test can complete.
> 
> For me that sequence looks like something that you could "abstract"
> a bit and get a rough description of the concurrency scenario.
> 
> It is as important in this type of patch, as the code change itself,
> not least because it helps with maintaining in the future to have
> that info in some level of detail in the commit log.

My apologies. I understood your comment to be a concern with the change 
itself instead of just the commit message. I will add more detail about 
the failing scenario encountered to the commit message.

Reinette

Jarkko Sakkinen Nov. 9, 2021, 1:30 a.m. UTC | #13

On Mon, Nov 08, 2021 at 12:56:21PM -0800, Reinette Chatre wrote:
> Hi Jarkko,
> 
> On 11/8/2021 12:12 PM, Jarkko Sakkinen wrote:
> > On Mon, Nov 08, 2021 at 11:48:18AM -0800, Reinette Chatre wrote:
> > > Hi Jarkko,
> > > 
> > > On 11/7/2021 8:47 AM, Jarkko Sakkinen wrote:
> > > > On Sun, 2021-11-07 at 18:45 +0200, Jarkko Sakkinen wrote:
> > > > > On Thu, 2021-11-04 at 11:28 -0700, Reinette Chatre wrote:
> > > > > > The consequence of sgx_nr_free_pages not being protected is that
> > > > > > its value may not accurately reflect the actual number of free
> > > > > > pages on the system, impacting the availability of free pages in
> > > > > > support of many flows. The problematic scenario is when the
> > > > > > reclaimer never runs because it believes there to be sufficient
> > > > > > free pages while any attempt to allocate a page fails because there
> > > > > > are no free pages available. The worst scenario observed was a
> > > > > > user space hang because of repeated page faults caused by
> > > > > > no free pages ever made available.
> > > > > 
> > > > > Can you go in detail with the "concrete scenario" in the commit
> > > > > message? It does not have to describe all the possible scenarios
> > > > > but at least one sequence of events.
> > > 
> > > 
> > > I provided significant detail regarding the "concrete scenario" in a
> > > separate response to Greg:
> > > https://lore.kernel.org/lkml/a636290d-db04-be16-1c86-a8dcc3719b39@intel.com/
> > > 
> > > That message details the test that was run (the test hangs before the fix
> > > and can complete after the fix), the traces captured at the time the test
> > > hung, analysis of the traces with root cause of why the system is hung,
> > > traces after fix applied demonstrating why user space is able to make
> > > progress and explaining why the test can complete.
> > 
> > For me that sequence looks like something that you could "abstract"
> > a bit and get a rough description of the concurrency scenario.
> > 
> > It is as important in this type of patch, as the code change itself,
> > not least because it helps with maintaining in the future to have
> > that info in some level of detail in the commit log.
> 
> My apologies. I understood your comment to be a concern with the change
> itself instead of just the commit message. I will add more detail about the
> failing scenario encountered to the commit message.

Yeah, I went through the log and the code change makes sense :-)

/Jarkko

x86/sgx: Fix free page accounting

Commit Message

Comments

Patch