diff mbox

[RFC,0/4] SGX shmem backing store issue

Message ID cover.1651171455.git.reinette.chatre@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Reinette Chatre April 28, 2022, 8:11 p.m. UTC
Hi Everybody,

Haitao reported encountering the following WARN while stress testing SGX
with the SGX2 series [1] applied:

ELDU returned 1073741837 (0x4000000d)
WARNING: CPU: 72 PID: 24407 at arch/x86/kernel/cpu/sgx/encl.c:81 sgx_encl_eldu+0x3cf/0x400
...
Call Trace:
<TASK>
? xa_load+0x6e/0xa0
__sgx_encl_load_page+0x3d/0x80
sgx_encl_load_page_in_vma+0x4a/0x60
sgx_vma_fault+0x7f/0x3b0
? sysvec_call_function_single+0x66/0xd0
? asm_sysvec_call_function_single+0x12/0x20
__do_fault+0x39/0x110
__handle_mm_fault+0x1222/0x15a0
handle_mm_fault+0xdb/0x2c0
do_user_addr_fault+0x1d1/0x650
? exit_to_user_mode_prepare+0x45/0x170
exc_page_fault+0x76/0x170
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
...
</TASK>

ENCLS[ELDU] is returning a #GP when attempting to load an enclave
page from the backing store into the enclave. This is likely because
of a MAC verification failure.

Haitao's stress testing involves running two concurrent loops of the SGX
selftests on a system with 4GB EPC memory. One of the tests is modified
to reduce the oversubscription heap size:

I focused on investigating this issue the past two weeks and was able to
narrow down the cause but not yet fix the issue. Now I am out of ideas and
would really appreciate if you could provide guidance on how to proceed.

Here is what I have learned so far:
* Reverting commit 08999b2489b4 ("x86/sgx: Free backing memory after
  faulting the enclave page") resolves the issue. With that commit
  reverted the concurrent selftest loops can run to completion without
  encountering any WARNs.

* The issue is also resolved if only the calls (introduced by commit
  08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave
  page") ) to sgx_encl_truncate_backing_page() within __sgx_encl_eldu()
  are disabled.

* Two issues with commit 08999b2489b4 ("x86/sgx: Free backing memory
  after faulting the enclave page") are fixed with patch 1 and 2 in
  this series. These fixes do not resolve the WARN.

* There is a potential race between the reclaimer and the page fault
  handler while interacting with the backing store. This is addressed
  in patch 3, but it does not resolve the WARN.

* Lockdep did reveal a locking issue in SGX2 - the EAUG flow should not
  do direct reclaim since the page fault handler already has mmap_lock
  that the reclaimer will attempt to obtain. This will be fixed in the
  next version of SGX2 but that fix does not resolve the WARN.

* Since ENCLS[ELDU] returns #GP on MAC verification failure a possible
  cause could be if a problem occurs while loading the page from
  backing store.
  sgx_encl_get_backing_page() is used to both allocate backing storage
  in the ENCLS[EWB] flow as well as to lookup backing storage in the
  ENCLS[ELDU] flow.
  As part of the investigation the interface was split to ensure that
  backing storage is only allocated during the ENCLS[EWB] flow and only
  lookup occurs during the ENCLS[ELDU] flow. This can be found in
  patch 4.
  This change revealed that there are cases during ENCLS[ELDU] flow
  when the backing page is not present - attempting to lookup a
  backing page returns -ENOENT. Before patch 4 a new backing page
  would be allocated and thus trigger a #GP when ENCLS[ELDU] is
  attempted.
  This change is not a fix, the code path just fails earlier now, but
  it reveals a cause of the WARNs encountered (MAC verification will
  fail on a newly allocated page) but not why the pages cannot be found
  in the backing store. With this change the number of WARNs are
  reduced significantly but not completely. Even with this patch
  applied the WARN was encountered once while running the stress
  selftests.

Could you please advise on how to proceed? Do you perhaps have ideas
why the backing store could not contain a page when it is loaded back
during the ENCLS[ELDU] flow?
There is some interesting text in the comments within shmem_fault()
that hints that faulting pages should be avoided into holes as they
are being punched ... but I do not understand those details and
would really appreciate guidance on how to understand what is going on
here.

These SGX2 tests exercise the concurrent operation of reclaimer and page
fault handler and have a good track record of locating issues. Thanks
to it we already have:
8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves")
ac5d272a0ad0 ("x86/sgx: Fix free page accounting")
While this issue is triggered by the SGX2 tests I cannot find a connection
with the SGX2 code paths. I have made a few attempts to create an
equivalent test for SGX1 but have not yet had success to create a SGX1
test that can trigger this issue.

These patches are based on v5.18-rc4 with the SGX2 series applied but
do not touch SGX2 code paths. Patch 1,2, and 3 also apply cleanly to
v5.18-rc4 while patch 4 has a conflict in arch/x86/kernel/cpu/sgx/encl.h
due to the SGX2 declarations.

Your feedback will be greatly appreciated.
Thank you very much

Reinette

[1] https://lore.kernel.org/lkml/cover.1649878359.git.reinette.chatre@intel.com/

Reinette Chatre (4):
  x86/sgx: Do not free backing memory on ENCLS[ELDU] failure
  x86/sgx: Set dirty bit after modifying page contents
  x86/sgx: Obtain backing storage page with enclave mutex held
  x86/sgx: Do not allocate backing pages when loading from backing store

 arch/x86/kernel/cpu/sgx/encl.c  | 89 ++++++++++++++++++++++++++++-----
 arch/x86/kernel/cpu/sgx/encl.h  |  8 ++-
 arch/x86/kernel/cpu/sgx/ioctl.c |  1 +
 arch/x86/kernel/cpu/sgx/main.c  | 15 +++---
 4 files changed, 93 insertions(+), 20 deletions(-)

Comments

Dave Hansen April 28, 2022, 9:12 p.m. UTC | #1
On 4/28/22 13:11, Reinette Chatre wrote:
> ELDU returned 1073741837 (0x4000000d)
> WARNING: CPU: 72 PID: 24407 at arch/x86/kernel/cpu/sgx/encl.c:81 sgx_encl_eldu+0x3cf/0x400
> ...
> Call Trace:
> <TASK>
> ? xa_load+0x6e/0xa0
> __sgx_encl_load_page+0x3d/0x80
> sgx_encl_load_page_in_vma+0x4a/0x60
> sgx_vma_fault+0x7f/0x3b0

First of all, thanks for all the work to narrow this down.

It sounds like there are probably at least two failure modes at play here:

	1. shmem_read_mapping_page_gfp() is called to retrieve an
	   existing page, but an empty one is allocated instead.  ELDU
	   fails on the empty page.  This one should be fixed by patch 	
	   4/4.
	2. shmem_read_mapping_page_gfp() actually finds a page, but it
	   still fails ELDU.

Is that right?

If so, I'd probably delve deeper into what the page and the PCMD look
like.  I usually go after these kinds of things with tracing.  I'd
probably dump some representation of the PCMD and page contents with
trace_printk().  Dump them when the at __sgx_encl_ewb() time, then also
dump them where the warning is being hit.  Pair the warning with a
tracing_off().

// A crude checksum:
u64 sum_page(u64 *page)
{
	u64 ret = 0
	int i;

	for (i = 0; i < PAGE_SIZE/sizeof(u64)); i++)
		ret += page[i];

	return ret;
}

Then, logically something like this:

	trace_printk("bad ELDU on shm page: %x sum: pcmd: %x %x...\n",
		page_to_pfn(shm_page), sum_page(page_kmap),
		&pcmd, ...);

Both at EWB time and ELDU time.  Let's see if the pages that are coming
out of shmem are the same as the ones that were put in.

When you hit the warning, tracing should turn itself off.  Then, you can
just grep through the trace for that same pfn.
Dave Hansen April 28, 2022, 9:29 p.m. UTC | #2
On 4/28/22 13:11, Reinette Chatre wrote:
> ELDU returned 1073741837 (0x4000000d)

BTW, what does that decode to exactly?
Reinette Chatre April 28, 2022, 10:20 p.m. UTC | #3
Hi Dave,

On 4/28/2022 2:29 PM, Dave Hansen wrote:
> On 4/28/22 13:11, Reinette Chatre wrote:
>> ELDU returned 1073741837 (0x4000000d)
> 
> BTW, what does that decode to exactly?

ENCLS instructions can fault as well as return SGX error
codes.

0x40000000 (SGX_ENCLS_FAULT_FLAG) indicates that this is
an ENCLS fault (as opposed to an error returned by the
ENCLS instruction) with 0xd (#GP) identifying the fault.

Reinette
Reinette Chatre April 29, 2022, 6:50 p.m. UTC | #4
Hi Dave,

On 4/28/2022 2:12 PM, Dave Hansen wrote:
> On 4/28/22 13:11, Reinette Chatre wrote:
>> ELDU returned 1073741837 (0x4000000d)
>> WARNING: CPU: 72 PID: 24407 at arch/x86/kernel/cpu/sgx/encl.c:81 sgx_encl_eldu+0x3cf/0x400
>> ...
>> Call Trace:
>> <TASK>
>> ? xa_load+0x6e/0xa0
>> __sgx_encl_load_page+0x3d/0x80
>> sgx_encl_load_page_in_vma+0x4a/0x60
>> sgx_vma_fault+0x7f/0x3b0
> 
> First of all, thanks for all the work to narrow this down.
> 
> It sounds like there are probably at least two failure modes at play here:
> 
> 	1. shmem_read_mapping_page_gfp() is called to retrieve an
> 	   existing page, but an empty one is allocated instead.  ELDU
> 	   fails on the empty page.  This one should be fixed by patch 	
> 	   4/4.

I am not yet comfortable calling patch 4/4 a fix. From what I understand
hitting that error means that the kernel lost a page. This patch detects
it earlier so that we can avoid the ELDU #GP but it does still seem
an issue to hit that error in the first place.

> 	2. shmem_read_mapping_page_gfp() actually finds a page, but it
> 	   still fails ELDU.
> 
> Is that right?

Correct.

> 
> If so, I'd probably delve deeper into what the page and the PCMD look
> like.  I usually go after these kinds of things with tracing.  I'd
> probably dump some representation of the PCMD and page contents with
> trace_printk().  Dump them when the at __sgx_encl_ewb() time, then also
> dump them where the warning is being hit.  Pair the warning with a
> tracing_off().
> 
> // A crude checksum:
> u64 sum_page(u64 *page)
> {
> 	u64 ret = 0
> 	int i;
> 
> 	for (i = 0; i < PAGE_SIZE/sizeof(u64)); i++)
> 		ret += page[i];
> 
> 	return ret;
> }
> 
> Then, logically something like this:
> 
> 	trace_printk("bad ELDU on shm page: %x sum: pcmd: %x %x...\n",
> 		page_to_pfn(shm_page), sum_page(page_kmap),
> 		&pcmd, ...);
> 
> Both at EWB time and ELDU time.  Let's see if the pages that are coming
> out of shmem are the same as the ones that were put in.
> 
> When you hit the warning, tracing should turn itself off.  Then, you can
> just grep through the trace for that same pfn.

Thank you very much for this guidance.

I added trace_printk() to the EWB and ELDU flows as below and found an
interesting result.

EWB:
__sgx_encl_ewb() {
   ...
   trace_printk("EWB encl %px on SHM page: 0x%lx PCMD page: 0x%lx index %lu sum: 0x%llx \n",
                encl_page->encl, page_to_pfn(backing->contents), page_to_pfn(backing->pcmd),
                page_index, sum_page((u64 *)pginfo.contents));
   ...
}


ELDU:
__sgx_encl_eldu() {
  ...
  trace_printk("WARN encl %px ELDU on SHM page: 0x%lx PCMD page: 0x%lx index %lu sum: 0x%llx \n",
               encl, page_to_pfn(b.contents), page_to_pfn(b.pcmd), page_index, sum_page((u64 *)pginfo.contents));
  ...
}


Below is the last few entries of the trace buffer:
           ksgxd-806     [022] ...2.   857.259294: __sgx_encl_ewb: EWB encl ffff90a2a38d0000 on SHM page: 0x27906c PCMD page: 0x27906d index 168866 sum: 0x0
           ksgxd-806     [022] ...2.   857.259299: __sgx_encl_ewb: EWB encl ffff90a2a38d0000 on SHM page: 0x27906c PCMD page: 0x27906d index 168866 sum: 0x9f7dc28a239283bb
=>           ksgxd-806     [022] ...2.   857.259321: __sgx_encl_ewb: EWB encl ffff90a2a38d0000 on SHM page: 0x27906e PCMD page: 0x27906d index 168867 sum: 0xcfe97176b1e8c386
           ksgxd-806     [022] ...2.   857.259325: __sgx_encl_ewb: EWB encl ffff90a2a38d0000 on SHM page: 0x27906f PCMD page: 0x27906d index 168868 sum: 0xd1d012775db676d9
           ksgxd-806     [022] ...2.   857.259329: __sgx_encl_ewb: EWB encl ffff90a2a38d0000 on SHM page: 0x279070 PCMD page: 0x27906d index 168869 sum: 0x1c568b08bd31d7bb
           ksgxd-806     [022] ...2.   857.259333: __sgx_encl_ewb: EWB encl ffff90a2a38d0000 on SHM page: 0x279071 PCMD page: 0x27906d index 168870 sum: 0x4dba242d348982a7
=>  test_sgx_large-4092    [030] ...2.   857.259341: sgx_encl_eldu: WARN encl ffff90a2a38d0000 ELDU on SHM page: 0x27906e PCMD page: 0x279052 index 168867 sum: 0xcfe97176b1e8c386


It seems that in this scenario a page is evicted from the enclave and then almost immediately
faulted back in. At the time it is faulted back in the sum matches with the original
encrypted enclave page at the time it was evicted from the enclave.
The interesting part is that the enclave page's PFN matches with the PFN when it was
evicted, but the PCMD PFN does not match.

I did search the log and I cannot find the PCMD PFN printed during ELDU used by the
enclave at all.

I am not familiar with this area - is the PFN expected to be consistent? Do
you perhaps have an idea why the PFN of the PCMD page may have changed? This
is running with this series applied so the ELDU flow would do a lookup of the
PCMD page and not attempt to allocate a new one.

Any hints would be appreciated.

Thank you very much

Reinette
Dave Hansen April 29, 2022, 7:45 p.m. UTC | #5
On 4/29/22 11:50, Reinette Chatre wrote:
> I am not familiar with this area - is the PFN expected to be consistent? Do
> you perhaps have an idea why the PFN of the PCMD page may have changed? This
> is running with this series applied so the ELDU flow would do a lookup of the
> PCMD page and not attempt to allocate a new one.

First of all, cool!  This is really good progress!

It's possible for the PCMD shmem page to be swapped out and swapped back
in.  The pfn would be likely to change in that case.  But, that seems
highly unlikely in that short of a period of time.

I'd dump out two more things:

First, dump out page_pcmd_off, just like you're doing with page_index in
case page_pcmd_off itself is getting botched.  I looked but couldn't
find anything obvious.

Second, dump out the pfn in sgx_encl_truncate_backing_page().  It's
possible that something is getting overly-aggressive and zapping the
PCMD page too early.  That would be easy to explain with that PCMD
locking issue you discovered.  But, your patch should have fixed that issue.

For debugging, could you double-check that the PCMD page *is* empty
around sgx_encl_truncate_backing_page()?  If there's a race here you can
also enlarge the race window by adding an msleep() or a spin loop
somewhere after the memchr_inv().

You could also hold an extra reference on the PCMD page, maybe something
like the attached patch.  That will let you inspect the actual page
after it is *actually* truncated.  There should never be data in the
page there.
Reinette Chatre April 30, 2022, 3:22 a.m. UTC | #6
Hi Dave,

On 4/29/2022 12:45 PM, Dave Hansen wrote:
> On 4/29/22 11:50, Reinette Chatre wrote:
>> I am not familiar with this area - is the PFN expected to be consistent? Do
>> you perhaps have an idea why the PFN of the PCMD page may have changed? This
>> is running with this series applied so the ELDU flow would do a lookup of the
>> PCMD page and not attempt to allocate a new one.
> 
> First of all, cool!  This is really good progress!
> 
> It's possible for the PCMD shmem page to be swapped out and swapped back
> in.  The pfn would be likely to change in that case.  But, that seems
> highly unlikely in that short of a period of time.
> 
> I'd dump out two more things:
> 
> First, dump out page_pcmd_off, just like you're doing with page_index in
> case page_pcmd_off itself is getting botched.  I looked but couldn't
> find anything obvious.
> 
> Second, dump out the pfn in sgx_encl_truncate_backing_page().  It's
> possible that something is getting overly-aggressive and zapping the
> PCMD page too early.  That would be easy to explain with that PCMD
> locking issue you discovered.  But, your patch should have fixed that issue.
> 
> For debugging, could you double-check that the PCMD page *is* empty
> around sgx_encl_truncate_backing_page()?  If there's a race here you can
> also enlarge the race window by adding an msleep() or a spin loop
> somewhere after the memchr_inv().
> 
> You could also hold an extra reference on the PCMD page, maybe something
> like the attached patch.  That will let you inspect the actual page
> after it is *actually* truncated.  There should never be data in the
> page there.


Thank you so much for pointing me in the right direction. With this
information I believe that I found a race condition. I did create a
fix for it and it has been running for more than an hour now without
a WARN _and_ without the -ENOENT error introduced in patch 4/4.

Please find below the description of the race and the fix.

Scenario is that the reclaimer is evicting a range of pages that
are being faulted right away.

Consider three enclave pages (ENCLAVE_A, ENCLAVE_B, and ENCLAVE_C)
sharing a PCMD page (PCMD_ABC). ENCLAVE_A and ENCLAVE_B are in the
enclave memory and ENCLAVE_C is in backing store. PCMD_ABC contains
just one entry, that of ENCLAVE_C.

Scenario proceeds where ENCLAVE_A is being evicted from the enclave
while ENCLAVE_C is faulted in.

Scenario is with this patch series applied.

sgx_reclaim_pages() {

  ...

  /*
   * Reclaim ENCLAVE_A
   */
  mutex_lock(&encl->lock);
  /*
   * Get a reference to ENCLAVE_A's
   * shmem page where enclave page
   * encrypted data will be stored
   * as well as a reference to the
   * enclave page's PCMD data page,
   * PCMD_ABC.
   */
  sgx_encl_alloc_backing(...);
  mutex_unlock(&encl->lock);

                                    /*
                                     * Fault ENCLAVE_C
                                     */

                                    sgx_vma_fault() {

                                      mutex_lock(&encl->lock);
                                      /*
                                       * Get reference to
                                       * ENCLAVE_C's shmem page
                                       * as well as PCMD_ABC.
                                       */
                                      sgx_encl_get_backing(...)
                                     /*
                                      * load page back into
                                      * enclave via ELDU
                                      */
                                     /*
                                      * Release reference to
                                      * ENCLAVE_C' shmem page and
                                      * PCMD_ABC.
                                      */
                                     sgx_encl_put_backing(...);
                                     /*
                                      * PCMD_ABC is found empty so
                                      * it and ENCLAVE_C's shmem page
                                      * are truncated.
                                      */
                                     /* Truncate ENCLAVE_C backing page */
                                     sgx_encl_truncate_backing_page();
                                     /* Truncate PCMD_ABC */
                                     sgx_encl_truncate_backing_page();

                                     mutex_unlock(&encl->lock);

                                     ...
                                     }
  mutex_lock(&encl->lock);
  /*
  * Write encrypted contents of
  * ENCLAVE_A to ENCLAVE_A shmem
  * page and its PCMD data to
  * PCMD_ABC.
  */

  sgx_encl_put_backing(...)

  /*
   * Reference to PCMD_ABC is
   * dropped and it is truncated.
   * ENCLAVE_A's PCMD data is lost.
   */
  mutex_unlock(&encl->lock);
}



What happens next depends on whether it is ENCLAVE_A being faulted
in or ENCLAVE_B (or ENCLAVE_C) being evicted.

If ENCLAVE_A is faulted then the error introduced in patch 4/4 would
be encountered since lookup of PCMD_ABC fails. Before patch 4/4 a new
PCMD page would be allocated and then ENCLS[ELDU] would WARN on the PCMD
data.
If ENCLAVE_B is evicted first then a new PCMD_ABC would be allocated but
later when ENCLAVE_A is faulted the ENCLS[ELDU] instruction would #GP
during its checks of the PCMD value and the WARN would be encountered.

Below is the fix I am currently testing. It would not truncate the memory
if there is a reference to the PCMD page.

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 22ed886dc825..81c7bfc12b9b 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -98,11 +98,10 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page,
 	kunmap_atomic(pcmd_page);
 	kunmap_atomic((void *)(unsigned long)pginfo.contents);
 
-	sgx_encl_put_backing(&b, true);
-
+	put_page(b.contents);
 	sgx_encl_truncate_backing_page(encl, page_index);
 
-	if (pcmd_page_empty) {
+	if (pcmd_page_empty && && put_page_testzero(b.pcmd)) {
 		ida_free(&encl->pcmd_in_backing, PFN_DOWN(page_pcmd_off));
 		sgx_encl_truncate_backing_page(encl, PFN_DOWN(page_pcmd_off));
 	}


I plan on letting the test keep running to make sure it really fixes the
issue. If all goes well patch 4/4 can be dropped and the fix included instead.

Thanks again for your guidance. It was tremendously helpful.

Reinette
Reinette Chatre April 30, 2022, 3:52 p.m. UTC | #7
On 4/29/2022 8:22 PM, Reinette Chatre wrote:

> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 22ed886dc825..81c7bfc12b9b 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -98,11 +98,10 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page,
>  	kunmap_atomic(pcmd_page);
>  	kunmap_atomic((void *)(unsigned long)pginfo.contents);
>  
> -	sgx_encl_put_backing(&b, true);
> -
> +	put_page(b.contents);
>  	sgx_encl_truncate_backing_page(encl, page_index);
>  
> -	if (pcmd_page_empty) {
> +	if (pcmd_page_empty && && put_page_testzero(b.pcmd)) {

These tests need to be swapped and instead be:

	if (put_page_testzero(b.pcmd) && pcmd_page_empty) {

I fixed the tests. Even with previous order the concurrent
selftests ran overnight without encountering a single error
or WARN. I increased the number of test iterations and will
run it over the weekend with the correct test order shown
above to ensure the reference to the PCMD page is always
dropped.

Reinette
Dave Hansen May 2, 2022, 2:36 p.m. UTC | #8
On 4/29/22 20:22, Reinette Chatre wrote:
> -	if (pcmd_page_empty) {
> +	if (pcmd_page_empty && && put_page_testzero(b.pcmd)) {
>  		ida_free(&encl->pcmd_in_backing, PFN_DOWN(page_pcmd_off));
>  		sgx_encl_truncate_backing_page(encl, PFN_DOWN(page_pcmd_off));
>  	}

Reinette, nice work on tracking this down!

One comment on the fix though:  Will this put_page_testzero() ever
actually trigger?  The page cache itself holds a reference, which is
released in this case via:

	shmem_truncate_range() ->
	truncate_inode_pages_range() ->
	truncate_inode_folio() ->
	filemap_remove_folio() ->
	filemap_free_folio() ->
	folio_put_refs()

So I think the lowest the page refcount can actually be at the point of
put_page_testzero() is 1.

I suspect the end result of that patch would actually just be that PCMD
pages are never freed at runtime, which would also work around the bug
you discovered.

No matter what we do here, we need to ensure that there is no race between:

        pcmd_page_empty = !memchr_inv(pcmd_page, 0, PAGE_SIZE);
and
	sgx_encl_truncate_backing_page(encl, PFN_DOWN(page_pcmd_off));

Holding the encl->lock over that section of code would work.  It also
would not be awful to add a special "truncation lock" which is taken
before putting data in a pcmd page and is taken over the
memchr()->truncate window as well.
Reinette Chatre May 2, 2022, 5:11 p.m. UTC | #9
Hi Dave,

On 5/2/2022 7:36 AM, Dave Hansen wrote:
> On 4/29/22 20:22, Reinette Chatre wrote:
>> -	if (pcmd_page_empty) {
>> +	if (pcmd_page_empty && && put_page_testzero(b.pcmd)) {
>>  		ida_free(&encl->pcmd_in_backing, PFN_DOWN(page_pcmd_off));
>>  		sgx_encl_truncate_backing_page(encl, PFN_DOWN(page_pcmd_off));
>>  	}
> 
> Reinette, nice work on tracking this down!
> 
> One comment on the fix though:  Will this put_page_testzero() ever
> actually trigger?  The page cache itself holds a reference, which is
> released in this case via:
> 
> 	shmem_truncate_range() ->
> 	truncate_inode_pages_range() ->
> 	truncate_inode_folio() ->
> 	filemap_remove_folio() ->
> 	filemap_free_folio() ->
> 	folio_put_refs()
> 
> So I think the lowest the page refcount can actually be at the point of
> put_page_testzero() is 1.

oh, I see, thank you for pointing that out. We then need something like
"put_page_testone()".

> 
> I suspect the end result of that patch would actually just be that PCMD
> pages are never freed at runtime, which would also work around the bug
> you discovered.
> 
> No matter what we do here, we need to ensure that there is no race between:
> 
>         pcmd_page_empty = !memchr_inv(pcmd_page, 0, PAGE_SIZE);
> and
> 	sgx_encl_truncate_backing_page(encl, PFN_DOWN(page_pcmd_off));
> 
> Holding the encl->lock over that section of code would work.  It also
> would not be awful to add a special "truncation lock" which is taken
> before putting data in a pcmd page and is taken over the
> memchr()->truncate window as well.

&encl->lock is already held over that section of the code so there
never is a race between the empty check and the truncate call.

The problem as I see it is that at the time __sgx_encl_eldu() is called
by the page fault handler (with enclave mutex held), the reclaimer already
has a reference to the PCMD page (but of course does _not_ have the enclave
mutex). The reclaimer expects to write to the PCMD page later when it obtains
the enclave mutex again and expects it to be safe since it has a reference
to the PCMD page.

The page fault handler will find the PCMD page empty when it removes
the last entry and thus truncate the page.

When reclaimer runs again it writes to the just-truncated PCMD page and
when it releases its reference the page is removed and the just written
data is lost.

My goal was to prevent the page fault handler from doing a "is the PCMD
page empty" if the reclaimer has a reference to it. Even if the PCMD page
is empty when the page fault handler checks it the page is expected to
get data right when reclaimer can get the mutex back since the reclaimer
already has a reference to the page.

Regarding your other suggestion, to use a "truncation lock". SGX just
gets the page pointer from shmem every time it is needed so there is
no room as I see it to add metadata to a PCMD page. One possibility
that we could do is for the page fault handler to check if any enclave
page sharing the PCMD page is being reclaimed (has the SGX_ENCL_PAGE_BEING_RECLAIMED
flag set). The reclaimer sets SGX_ENCL_PAGE_BEING_RECLAIMED for an enclave
page when it obtains the reference to its PCMD page and clears the flag
when it releases the reference.

Reinette
Dave Hansen May 2, 2022, 9:33 p.m. UTC | #10
On 5/2/22 10:11, Reinette Chatre wrote:
> My goal was to prevent the page fault handler from doing a "is the PCMD
> page empty" if the reclaimer has a reference to it. Even if the PCMD page
> is empty when the page fault handler checks it the page is expected to
> get data right when reclaimer can get the mutex back since the reclaimer
> already has a reference to the page.

I think shmem_truncate_range() might be the wrong operation.  It
destroys data and, in the end, we don't want to destroy data.
Filesystems and the page cache already have nice ways to keep from
destroying data, we just need to use them.

First, I think set_page_dirty(pcmd_page) before the EWB is a good start.
 That's what filesystems do before important data that needs to be saved
goes into pages.

Second, I think we need behavior like POSIX_FADV_DONTNEED, not
FALLOC_FL_PUNCH_HOLE.  The DONTNEED operations end up in
mapping_evict_folio(), which has both page refcount *and* dirty page
checks.  That means that either elevating a refcount or set_page_dirty()
will thwart DONTNEED-like behavior.

There are two basic things we need to do:

1. Prevent page from being truncated around EWB time
2. Prevent unreferenced page with all zeros from staying in shmem
   forever or taking up swap space

On the EWB (write to PCMD side) I think something like this works:

	sgx_encl_get_backing()
		get_page(pcmd_page)

	...
	lock_page(pcmd_page);
	// check for truncation since sgx_encl_get_backing()
	if (pcmd_page->mapping != shmem)
		goto retry;
	 // double check this is OK under lock_page():
	set_page_dirty(pcmd_page);
	__sgx_encl_ewb();
	unlock_page(pcmd_page);

That's basically what filesystems do.  Get the page from the page cache,
lock it, then make sure it is consistent.  If not, retry.

On the "free" / read in (ELDU) side:

	// get pcmd_page ref
	lock_page(pcmd_page);
	// truncation is not a concern because that's only done
	// on the read-in side, here, where we hold encl->lock

	memset();
	if (!memchr_inv())
		// clear the way for DONTNEED:
		ClearPageDirty(pcmd_page);
	unlock_page(pcmd_page);
	// drop pcmd_page ref
	...
	POSIX_FADV_DONTNEED

There's one downside to this: it's _possible_ that an transient
get_page() would block POSIX_FADV_DONTNEED.  Then the zeroed page would
stick around forever, or at least until the next ELDU operation did
another memchr_inv().

I went looking around for some of those and could not find any that I
*know* apply to shmem.

This doesn't feel like a great solution; it's more complicated than I
would like.  Any other ideas?
Jarkko Sakkinen May 4, 2022, 6:40 a.m. UTC | #11
On Thu, Apr 28, 2022 at 01:11:23PM -0700, Reinette Chatre wrote:
> Hi Everybody,
> 
> Haitao reported encountering the following WARN while stress testing SGX
> with the SGX2 series [1] applied:
> 
> ELDU returned 1073741837 (0x4000000d)
> WARNING: CPU: 72 PID: 24407 at arch/x86/kernel/cpu/sgx/encl.c:81 sgx_encl_eldu+0x3cf/0x400
> ...
> Call Trace:
> <TASK>
> ? xa_load+0x6e/0xa0
> __sgx_encl_load_page+0x3d/0x80
> sgx_encl_load_page_in_vma+0x4a/0x60
> sgx_vma_fault+0x7f/0x3b0
> ? sysvec_call_function_single+0x66/0xd0
> ? asm_sysvec_call_function_single+0x12/0x20
> __do_fault+0x39/0x110
> __handle_mm_fault+0x1222/0x15a0
> handle_mm_fault+0xdb/0x2c0
> do_user_addr_fault+0x1d1/0x650
> ? exit_to_user_mode_prepare+0x45/0x170
> exc_page_fault+0x76/0x170
> ? asm_exc_page_fault+0x8/0x30
> asm_exc_page_fault+0x1e/0x30
> ...
> </TASK>
> 
> ENCLS[ELDU] is returning a #GP when attempting to load an enclave
> page from the backing store into the enclave. This is likely because
> of a MAC verification failure.
> 
> Haitao's stress testing involves running two concurrent loops of the SGX
> selftests on a system with 4GB EPC memory. One of the tests is modified
> to reduce the oversubscription heap size:
> 
> diff --git a/tools/testing/selftests/sgx/main.c b/tools/testing/selftests/sgx/main.c
> index d480c2dd2858..12008789325b 100644
> --- a/tools/testing/selftests/sgx/main.c
> +++ b/tools/testing/selftests/sgx/main.c
> @@ -398,7 +398,7 @@ TEST_F_TIMEOUT(enclave, unclobbered_vdso_oversubscribed_remove, 900)
>  	 * Create enclave with additional heap that is as big as all
>  	 * available physical SGX memory.
>  	 */
> -	total_mem = get_total_epc_mem();
> +	total_mem = get_total_epc_mem()/16;
>  	ASSERT_NE(total_mem, 0);
>  	TH_LOG("Creating an enclave with %lu bytes heap may take a while ...",
>  	       total_mem);
> 
> If the the test compiled with above snippet is renamed as "test_sgx_small"
> and the original renamed as "test_sgx_large" the two concurrent loops are
> run as follows:
> 
> (for i in $(seq 1 999); do echo "Iteration $i"; sudo ./test_sgx_large; done ) > log.large 2>&1
> (for i in $(seq 1 9999); do echo "Iteration $i"; sudo ./test_sgx_small; done ) > log.small 2>&1
> 
> If the SGX driver is modified to always WARN when ENCLS[ELDU] encounters a #GP
> (see below) then the WARN appears after a few iterations of "test_sgx_large"
> and shows up throughout the testing.
> 
> diff --git a/arch/x86/kernel/cpu/sgx/encls.h b/arch/x86/kernel/cpu/sgx/encls.h
> index 99004b02e2ed..68c1dbc84ed3 100644
> --- a/arch/x86/kernel/cpu/sgx/encls.h
> +++ b/arch/x86/kernel/cpu/sgx/encls.h
> @@ -18,7 +18,7 @@
>  #define ENCLS_WARN(r, name) {						  \
>  	do {								  \
>  		int _r = (r);						  \
> -		WARN_ONCE(_r, "%s returned %d (0x%x)\n", (name), _r, _r); \
> +		WARN(_r, "%s returned %d (0x%x)\n", (name), _r, _r); \
>  	} while (0);							  \
>  }
> 
> I focused on investigating this issue the past two weeks and was able to
> narrow down the cause but not yet fix the issue. Now I am out of ideas and
> would really appreciate if you could provide guidance on how to proceed.
> 
> Here is what I have learned so far:
> * Reverting commit 08999b2489b4 ("x86/sgx: Free backing memory after
>   faulting the enclave page") resolves the issue. With that commit
>   reverted the concurrent selftest loops can run to completion without
>   encountering any WARNs.
> 
> * The issue is also resolved if only the calls (introduced by commit
>   08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave
>   page") ) to sgx_encl_truncate_backing_page() within __sgx_encl_eldu()
>   are disabled.
> 
> * Two issues with commit 08999b2489b4 ("x86/sgx: Free backing memory
>   after faulting the enclave page") are fixed with patch 1 and 2 in
>   this series. These fixes do not resolve the WARN.
> 
> * There is a potential race between the reclaimer and the page fault
>   handler while interacting with the backing store. This is addressed
>   in patch 3, but it does not resolve the WARN.
> 
> * Lockdep did reveal a locking issue in SGX2 - the EAUG flow should not
>   do direct reclaim since the page fault handler already has mmap_lock
>   that the reclaimer will attempt to obtain. This will be fixed in the
>   next version of SGX2 but that fix does not resolve the WARN.
> 
> * Since ENCLS[ELDU] returns #GP on MAC verification failure a possible
>   cause could be if a problem occurs while loading the page from
>   backing store.
>   sgx_encl_get_backing_page() is used to both allocate backing storage
>   in the ENCLS[EWB] flow as well as to lookup backing storage in the
>   ENCLS[ELDU] flow.
>   As part of the investigation the interface was split to ensure that
>   backing storage is only allocated during the ENCLS[EWB] flow and only
>   lookup occurs during the ENCLS[ELDU] flow. This can be found in
>   patch 4.
>   This change revealed that there are cases during ENCLS[ELDU] flow
>   when the backing page is not present - attempting to lookup a
>   backing page returns -ENOENT. Before patch 4 a new backing page
>   would be allocated and thus trigger a #GP when ENCLS[ELDU] is
>   attempted.
>   This change is not a fix, the code path just fails earlier now, but
>   it reveals a cause of the WARNs encountered (MAC verification will
>   fail on a newly allocated page) but not why the pages cannot be found
>   in the backing store. With this change the number of WARNs are
>   reduced significantly but not completely. Even with this patch
>   applied the WARN was encountered once while running the stress
>   selftests.
> 
> Could you please advise on how to proceed? Do you perhaps have ideas
> why the backing store could not contain a page when it is loaded back
> during the ENCLS[ELDU] flow?
> There is some interesting text in the comments within shmem_fault()
> that hints that faulting pages should be avoided into holes as they
> are being punched ... but I do not understand those details and
> would really appreciate guidance on how to understand what is going on
> here.
> 
> These SGX2 tests exercise the concurrent operation of reclaimer and page
> fault handler and have a good track record of locating issues. Thanks
> to it we already have:
> 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves")
> ac5d272a0ad0 ("x86/sgx: Fix free page accounting")
> While this issue is triggered by the SGX2 tests I cannot find a connection
> with the SGX2 code paths. I have made a few attempts to create an
> equivalent test for SGX1 but have not yet had success to create a SGX1
> test that can trigger this issue.
> 
> These patches are based on v5.18-rc4 with the SGX2 series applied but
> do not touch SGX2 code paths. Patch 1,2, and 3 also apply cleanly to
> v5.18-rc4 while patch 4 has a conflict in arch/x86/kernel/cpu/sgx/encl.h
> due to the SGX2 declarations.
> 
> Your feedback will be greatly appreciated.
> Thank you very much
> 
> Reinette

Hi, sorry for the response lag. I was on sick leave for the 2nd half of
last week. Looking into this.

BR, Jarkko
Reinette Chatre May 4, 2022, 10:13 p.m. UTC | #12
Hi Dave,

On 5/2/2022 2:33 PM, Dave Hansen wrote:
> On 5/2/22 10:11, Reinette Chatre wrote:
>> My goal was to prevent the page fault handler from doing a "is the PCMD
>> page empty" if the reclaimer has a reference to it. Even if the PCMD page
>> is empty when the page fault handler checks it the page is expected to
>> get data right when reclaimer can get the mutex back since the reclaimer
>> already has a reference to the page.
> 
> I think shmem_truncate_range() might be the wrong operation.  It
> destroys data and, in the end, we don't want to destroy data.
> Filesystems and the page cache already have nice ways to keep from
> destroying data, we just need to use them.
> 
> First, I think set_page_dirty(pcmd_page) before the EWB is a good start.
>  That's what filesystems do before important data that needs to be saved
> goes into pages.

ok, at this time the reclaimer sets the dirty bit _after_ writing to the
shmem backing store. 

From what I understand we need to split sgx_encl_put_backing() and remove
the part where dirty bit is set and only use sgx_encl_put_backing()
to do the put_page() calls.

The motivation for this change is not exactly clear to me since in
the implementation the access to these pages are protected by the
enclave mutex.

> 
> Second, I think we need behavior like POSIX_FADV_DONTNEED, not
> FALLOC_FL_PUNCH_HOLE.  The DONTNEED operations end up in
> mapping_evict_folio(), which has both page refcount *and* dirty page
> checks.  That means that either elevating a refcount or set_page_dirty()
> will thwart DONTNEED-like behavior.
> 
> There are two basic things we need to do:
> 
> 1. Prevent page from being truncated around EWB time
> 2. Prevent unreferenced page with all zeros from staying in shmem
>    forever or taking up swap space
> 
> On the EWB (write to PCMD side) I think something like this works:
> 
> 	sgx_encl_get_backing()
> 		get_page(pcmd_page)
> 
> 	...
> 	lock_page(pcmd_page);
> 	// check for truncation since sgx_encl_get_backing()
> 	if (pcmd_page->mapping != shmem)
> 		goto retry;
> 	 // double check this is OK under lock_page():
> 	set_page_dirty(pcmd_page);
> 	__sgx_encl_ewb();
> 	unlock_page(pcmd_page);
> 
> That's basically what filesystems do.  Get the page from the page cache,
> lock it, then make sure it is consistent.  If not, retry.
> 
> On the "free" / read in (ELDU) side:
> 
> 	// get pcmd_page ref
> 	lock_page(pcmd_page);
> 	// truncation is not a concern because that's only done
> 	// on the read-in side, here, where we hold encl->lock
> 
> 	memset();
> 	if (!memchr_inv())
> 		// clear the way for DONTNEED:
> 		ClearPageDirty(pcmd_page);
> 	unlock_page(pcmd_page);
> 	// drop pcmd_page ref
> 	...
> 	POSIX_FADV_DONTNEED
> 
> There's one downside to this: it's _possible_ that an transient
> get_page() would block POSIX_FADV_DONTNEED.  Then the zeroed page would
> stick around forever, or at least until the next ELDU operation did
> another memchr_inv().
> 
> I went looking around for some of those and could not find any that I
> *know* apply to shmem.
> 
> This doesn't feel like a great solution; it's more complicated than I
> would like.  Any other ideas?

I am not familiar with the memory management and found the above a bit
intimidating. Noting that you are open to other ideas I 
attempted to contain the fix using existing SGX state. Note that the
reclaimer sets SGX_ENCL_PAGE_BEING_RECLAIMED at the time it obtains a
reference to the backing store pages of an enclave page it is in the
process of reclaiming. The fix only truncates the PCMD page
after ensuring that no page sharing the PCMD page is in the process of being
reclaimed.

I confirmed with this fix that PCMD pages are indeed being truncated -
but only if they are empty _and_ no page sharing the PCMD page is in
the process of being truncated.

With this fix the error printed by patch 4/4 are no longer encountered
and that patch is thus no longer required.

Temporarily I added the line of debugging that you can find in the patch
below and this line is printed while running the stress test. Here is a
snippet from the log file to give an idea how frequently this scenario
is encountered while running the stress test:

[  368.277362] sgx: pcmd_page_in_use:49 encl ffff9e76d8660000 addr 0x7f2545148000 desc 0x7f2545148008 (index 282952) being reclaimed
[  368.277400] sgx: pcmd_page_in_use:49 encl ffff9e76d8660000 addr 0x7f2545149000 desc 0x7f2545149008 (index 282953) being reclaimed
[  655.620465] sgx: pcmd_page_in_use:49 encl ffff9e774ce50000 addr 0x7fbe340f9000 desc 0x7fbe340f9008 (index 213241) being reclaimed
[  655.620520] sgx: pcmd_page_in_use:49 encl ffff9e774ce50000 addr 0x7fbe340fa000 desc 0x7fbe340fa008 (index 213242) being reclaimed
[  717.514992] sgx: pcmd_page_in_use:49 encl ffff9e780cd60000 addr 0x7f312c8cc000 desc 0x7f312c8cc008 (index 182476) being reclaimed
[  911.387007] sgx: pcmd_page_in_use:49 encl ffff9e77a6770000 addr 0x7f252fa2d000 desc 0x7f252fa2d008 (index 195117) being reclaimed
[  946.991400] sgx: pcmd_page_in_use:49 encl ffff9e76dd1a0000 addr 0x7fc210784000 desc 0x7fc210784008 (index 67460) being reclaimed
[  946.991479] sgx: pcmd_page_in_use:49 encl ffff9e76dd1a0000 addr 0x7fc210788000 desc 0x7fc210788008 (index 67464) being reclaimed
[ 1157.210449] sgx: pcmd_page_in_use:49 encl ffff9e76dc550000 addr 0x7f273158c000 desc 0x7f273158c008 (index 202124) being reclaimed
[ 1158.478397] sgx: pcmd_page_in_use:49 encl ffff9e76dc550000 addr 0x7f2743cac000 desc 0x7f2743cac008 (index 277676) being reclaimed
[ 1158.478451] sgx: pcmd_page_in_use:49 encl ffff9e76dc550000 addr 0x7f2743cad000 desc 0x7f2743cad008 (index 277677) being reclaimed
[ 1158.478467] sgx: pcmd_page_in_use:49 encl ffff9e76dc550000 addr 0x7f2743cae000 desc 0x7f2743cae008 (index 277678) being reclaimed
[ 1615.425623] sgx: pcmd_page_in_use:49 encl ffff9e774cee0000 addr 0x7efe1692c000 desc 0x7efe1692c008 (index 92460) being reclaimed
[ 1677.797446] sgx: pcmd_page_in_use:49 encl ffff9e77efc30000 addr 0x7feb3637f000 desc 0x7feb3637f008 (index 222079) being reclaimed
[ 1802.242423] sgx: pcmd_page_in_use:49 encl ffff9e76de190000 addr 0x7f5344fbf000 desc 0x7f5344fbf008 (index 282559) being reclaimed
[ 1887.397432] sgx: pcmd_page_in_use:49 encl ffff9e76d7960000 addr 0x7f0870ba8000 desc 0x7f0870ba8008 (index 461736) being reclaimed
[ 1994.519753] sgx: pcmd_page_in_use:49 encl ffff9e76d98d0000 addr 0x7f0531b43000 desc 0x7f0531b43008 (index 203587) being reclaimed
[ 2178.070802] sgx: pcmd_page_in_use:49 encl ffff9e775b400000 addr 0x7f1134149000 desc 0x7f1134149008 (index 213321) being reclaimed
[ 2349.632505] sgx: pcmd_page_in_use:49 encl ffff9e76ed450000 addr 0x7f944b94c000 desc 0x7f944b94c008 (index 309580) being reclaimed
[ 2349.632597] sgx: pcmd_page_in_use:49 encl ffff9e76ed450000 addr 0x7f944b94d000 desc 0x7f944b94d008 (index 309581) being reclaimed
[ 2349.632613] sgx: pcmd_page_in_use:49 encl ffff9e76ed450000 addr 0x7f944b94e000 desc 0x7f944b94e008 (index 309582) being reclaimed

---8<---

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 3051a26179bc..53bcaa153378 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -12,6 +12,76 @@
 #include "encls.h"
 #include "sgx.h"
 
+/**
+ * pcmd_page_in_use() - Query if any enclave page associated with a PCMD
+ *                      page is in process of being reclaimed.
+ * @encl:        Enclave to which PCMD page belongs
+ * @start_addr:  Address of enclave page using first entry within the PCMD page
+ *
+ * When an enclave page is reclaimed some Paging Crypto MetaData (PCMD) is
+ * stored. The PCMD data of a reclaimed enclave page contains enough
+ * information for the processor to verify the page at the time
+ * it is loaded back into the Enclave Page Cache (EPC).
+ *
+ * The backing storage to which enclave pages are reclaimed is laid out as
+ * follows:
+ * All enclave pages:SECS page:PCMD pages
+ *
+ * Each PCMD page contains the PCMD metadata of
+ * PAGE_SIZE/sizeof(struct sgx_pcmd) enclave pages.
+ *
+ * A PCMD page can only be truncated if it is (a) empty, and (b) no enclave
+ * page sharing the PCMD page is in the process of being reclaimed.
+ *
+ * The reclaimer sets the SGX_ENCL_PAGE_BEING_RECLAIMED flag when it
+ * intends to reclaim that enclave page - it means that the PCMD data
+ * associated with that enclave page is about to get some data and thus
+ * even if the PCMD page is empty, it should not be truncated.
+ *
+ * Return: 1 the PCMD page is in use, 0 if is not in use, -EFAULT on error
+ */
+static int pcmd_page_in_use(struct sgx_encl *encl,
+			    unsigned long start_addr)
+{
+	struct sgx_encl_page *entry;
+	unsigned long addr;
+	int reclaimed = 0;
+	int i;
+
+	for (i = 0; i < PAGE_SIZE/sizeof(struct sgx_pcmd); i++) {
+		addr = start_addr + i * PAGE_SIZE;
+
+		/*
+		 * Stop when reaching the SECS page - it does not
+		 * have a page_array entry and its reclaim is
+		 * started and completed with enclave mutex held so
+		 * it does not use the SGX_ENCL_PAGE_BEING_RECLAIMED
+		 * flag.
+		 */
+		if (addr == encl->base + encl->size)
+			break;
+
+		entry = xa_load(&encl->page_array, PFN_DOWN(addr));
+		if (!entry)
+			return -EFAULT;
+
+		/*
+		 * VA page slot ID uses same bit as the flag so it is important
+		 * to ensure that the page is not already in backing store.
+		 */
+		if (entry->epc_page &&
+		    (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)) {
+			pr_debug("%s:%d encl %px addr 0x%lx desc 0x%lx (index %lu) being reclaimed\n",
+				 __func__, __LINE__, encl, addr, entry->desc,
+				 PFN_DOWN(entry->desc - entry->encl->base));
+			reclaimed = 1;
+			break;
+		}
+	}
+
+	return reclaimed;
+}
+
 /*
  * Calculate byte offset of a PCMD struct associated with an enclave page. PCMD's
  * follow right after the EPC data in the backing storage. In addition to the
@@ -47,6 +117,7 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page,
 	unsigned long va_offset = encl_page->desc & SGX_ENCL_PAGE_VA_OFFSET_MASK;
 	struct sgx_encl *encl = encl_page->encl;
 	pgoff_t page_index, page_pcmd_off;
+	unsigned long pcmd_first_page;
 	struct sgx_pageinfo pginfo;
 	struct sgx_backing b;
 	bool pcmd_page_empty;
@@ -58,6 +129,12 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page,
 	else
 		page_index = PFN_DOWN(encl->size);
 
+	/*
+	 * Address of enclave page using the first entry within the
+	 * PCMD page.
+	 */
+	pcmd_first_page = PFN_PHYS(page_index & ~0x1FUL) + encl->base;
+
 	page_pcmd_off = sgx_encl_get_backing_page_pcmd_offset(encl, page_index);
 
 	ret = sgx_encl_lookup_backing(encl, page_index, &b);
@@ -103,7 +180,7 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page,
 
 	sgx_encl_truncate_backing_page(encl, page_index);
 
-	if (pcmd_page_empty) {
+	if (pcmd_page_empty && !pcmd_page_in_use(encl, pcmd_first_page)) {
 		ida_free(&encl->pcmd_in_backing, PFN_DOWN(page_pcmd_off));
 		sgx_encl_truncate_backing_page(encl, PFN_DOWN(page_pcmd_off));
 	}
---8<---

What do you think?

Reinette
Dave Hansen May 4, 2022, 10:58 p.m. UTC | #13
On 5/4/22 15:13, Reinette Chatre wrote:
> +		/*
> +		 * VA page slot ID uses same bit as the flag so it is important
> +		 * to ensure that the page is not already in backing store.
> +		 */
> +		if (entry->epc_page &&
> +		    (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)) {
> +			pr_debug("%s:%d encl %px addr 0x%lx desc 0x%lx (index %lu) being reclaimed\n",
> +				 __func__, __LINE__, encl, addr, entry->desc,
> +				 PFN_DOWN(entry->desc - entry->encl->base));
> +			reclaimed = 1;
> +			break;
> +		}

I don't think this works as-is.  As it stands,
SGX_ENCL_PAGE_BEING_RECLAIMED is cleared *BEFORE* the EWB.  It's
possible for that check above to race with SGX_ENCL_PAGE_BEING_RECLAIMED
being cleared.  That would erroneously lead pcmd_page_in_use() to return
false *just* before data is written to the PCMD page.

static void sgx_encl_ewb()
{
        encl_page->desc &= ~SGX_ENCL_PAGE_BEING_RECLAIMED;
	...
	__sgx_encl_ewb()
}

You might be able to move the ~SGX_ENCL_PAGE_BEING_RECLAIMED until after
the __sgx_encl_ewb().  But, at that point, the pcmd_page_empty is a bit
useless because it's stale.  I guess it might act as a performance
optimization because it allows avoiding the expensive pcmd_page_in_use()
check.

Also, one nit:

> +	/*
> +	 * Address of enclave page using the first entry within the
> +	 * PCMD page.
> +	 */
> +	pcmd_first_page = PFN_PHYS(page_index & ~0x1FUL) + encl->base;

That ~0x1FUL is pretty magic.  I assume that's 5 bits because there are
32 PCMDs per page.  This could probably be something resembling this
instead:

#define PCMDS_PER_PAGE (PAGE_SIZE/sizeof(struct sgx_pcmd))
#define PCMD_FIRST_MASK GENMASK(PCMDS_PER_PAGE, 0)
Dave Hansen May 4, 2022, 11:05 p.m. UTC | #14
On 5/4/22 15:13, Reinette Chatre wrote:
> -	if (pcmd_page_empty) {
> +	if (pcmd_page_empty && !pcmd_page_in_use(encl, pcmd_first_page)) {
>  		ida_free(&encl->pcmd_in_backing, PFN_DOWN(page_pcmd_off));
>  		sgx_encl_truncate_backing_page(encl, PFN_DOWN(page_pcmd_off));
>  	}

One other thing.  The role of encl->lock here is very important.
Without it, two concurrent page faults could do their individual
memset(), each see !pcmd_page_empty, then decline to truncate the page.

Also, given the challenges here, I do think we should check the
pcmd_page after truncate to ensure it is still all zero's.
Reinette Chatre May 4, 2022, 11:36 p.m. UTC | #15
Hi Dave,

On 5/4/2022 3:58 PM, Dave Hansen wrote:
> On 5/4/22 15:13, Reinette Chatre wrote:
>> +		/*
>> +		 * VA page slot ID uses same bit as the flag so it is important
>> +		 * to ensure that the page is not already in backing store.
>> +		 */
>> +		if (entry->epc_page &&
>> +		    (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)) {
>> +			pr_debug("%s:%d encl %px addr 0x%lx desc 0x%lx (index %lu) being reclaimed\n",
>> +				 __func__, __LINE__, encl, addr, entry->desc,
>> +				 PFN_DOWN(entry->desc - entry->encl->base));
>> +			reclaimed = 1;
>> +			break;
>> +		}
> 
> I don't think this works as-is.  As it stands,
> SGX_ENCL_PAGE_BEING_RECLAIMED is cleared *BEFORE* the EWB.  It's
> possible for that check above to race with SGX_ENCL_PAGE_BEING_RECLAIMED
> being cleared.  That would erroneously lead pcmd_page_in_use() to return
> false *just* before data is written to the PCMD page.
> 
> static void sgx_encl_ewb()
> {
>         encl_page->desc &= ~SGX_ENCL_PAGE_BEING_RECLAIMED;
> 	...
> 	__sgx_encl_ewb()
> }

You are correct that SGX_ENCL_PAGE_BEING_RECLAIMED is cleared before
EWB. Please do note though that clearing that bit and running EWB
are all done in one section with the enclave mutex held the entire time.

The page fault handler cannot check for SGX_ENCL_PAGE_BEING_RECLAIMED
until the reclaimer releases the mutex.

These checks are thus not able to race. Here is a higher level view:


    sgx_reclaim_pages() {

      ...

      /*
       * Reclaim ENCLAVE_A
       */
      mutex_lock(&encl->lock);
      /*
       * Get a reference to ENCLAVE_A's
       * shmem page where enclave page
       * encrypted data will be stored
       * as well as a reference to the
       * enclave page's PCMD data page,
       * PCMD_AB.
       * Release mutex before writing
       * any data to the shmem pages.
       */
      sgx_encl_get_backing(...);
      encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
      mutex_unlock(&encl->lock);

                                        /*
                                         * Fault ENCLAVE_B
                                         */

                                        sgx_vma_fault() {

                                          mutex_lock(&encl->lock);
                                          /*
                                           * Get reference to
                                           * ENCLAVE_B's shmem page
                                           * as well as PCMD_AB.
                                           */
                                          sgx_encl_get_backing(...)
                                         /*
                                          * Load page back into
                                          * enclave via ELDU.
                                          */
                                         /*
                                          * Release reference to
                                          * ENCLAVE_B' shmem page and
                                          * PCMD_AB.
                                          */
                                         sgx_encl_put_backing(...);
                                         /*
                                          * PCMD_AB is found empty so
                                          * it and ENCLAVE_B's shmem page
                                          * are truncated.
                                          */
                                         /* Truncate ENCLAVE_B backing page */
                                         sgx_encl_truncate_backing_page();

                                          /*
                                           * pcmd_page_empty indicates that
                                           * PCMD_AB is empty
                                           */
                                          pcmd_page_in_use() {
                                           /*
                                            * Check SGX_ENCL_PAGE_BEING_RECLAIMED
                                            * of pages sharing PCMD page.
                                            * Will find that ENCLAVE_A's 
                                            * SGX_ENCL_PAGE_BEING_RECLAIMED is
                                            * set and PCMD_AB page will NOT
                                            * be truncated.
                                            */
                                          }
                                         /* sgx_encl_truncate_backing_page(); => not run */

                                         mutex_unlock(&encl->lock);

                                         ...
                                         }
      mutex_lock(&encl->lock);
      encl_page->desc &=
           ~SGX_ENCL_PAGE_BEING_RECLAIMED;
      /*
      * Write encrypted contents of
      * ENCLAVE_A to ENCLAVE_A shmem
      * page and its PCMD data to
      * PCMD_AB.
      */
      sgx_encl_put_backing(...)

      /*
       * Reference to PCMD_AB is
       * dropped.
       */
      mutex_unlock(&encl->lock);
    }



> 
> You might be able to move the ~SGX_ENCL_PAGE_BEING_RECLAIMED until after
> the __sgx_encl_ewb().

This will not change much since all is done with mutex held. At the current
location that bit has served its purpose and the code that follows will
replace it with the VA slot information.

>  But, at that point, the pcmd_page_empty is a bit
> useless because it's stale. 

I cannot see how pcmd_page_empty would be stale - could you please elaborate
on the flow where you see this happening?

> I guess it might act as a performance
> optimization because it allows avoiding the expensive pcmd_page_in_use()
> check.
> 
> Also, one nit:
> 
>> +	/*
>> +	 * Address of enclave page using the first entry within the
>> +	 * PCMD page.
>> +	 */
>> +	pcmd_first_page = PFN_PHYS(page_index & ~0x1FUL) + encl->base;
> 
> That ~0x1FUL is pretty magic.  I assume that's 5 bits because there are
> 32 PCMDs per page.  This could probably be something resembling this
> instead:
> 
> #define PCMDS_PER_PAGE (PAGE_SIZE/sizeof(struct sgx_pcmd))
> #define PCMD_FIRST_MASK GENMASK(PCMDS_PER_PAGE, 0)
> 

Thank you, yes, that needs to improve.

> One other thing.  The role of encl->lock here is very important.
> Without it, two concurrent page faults could do their individual
> memset(), each see !pcmd_page_empty, then decline to truncate the page.

Your argument is not clear to me. Yes, this would be an issue (and
that will not be the only issue) if the enclave mutex is not held
by the page fault handler ... but the enclave mutex _is_ held by the
page fault handler.
 
> Also, given the challenges here, I do think we should check the
> pcmd_page after truncate to ensure it is still all zero's.

Could you please elaborate the challenges? I do not think it would
help the issue at hand at all to add a check since this is done with
the mutex held while the reclaimer has a reference to the page but no
mutex ... as long as the page fault handler has that mutex it can write
and check the PCMD page all it want, the new changes would occur to the
PCMD page when it (the page fault handler) releases the mutex and the
reclaimer attempts to use the page it has a reference to.

Reinette
Dave Hansen May 4, 2022, 11:50 p.m. UTC | #16
On 5/4/22 16:36, Reinette Chatre wrote:
> The page fault handler cannot check for SGX_ENCL_PAGE_BEING_RECLAIMED
> until the reclaimer releases the mutex.

Ahh, got it now.  I thought the mutex wasn't held on the reclaimer side.
 Thanks for the refresher.

>> One other thing.  The role of encl->lock here is very important.
>> Without it, two concurrent page faults could do their individual
>> memset(), each see !pcmd_page_empty, then decline to truncate the page.
> 
> Your argument is not clear to me. Yes, this would be an issue (and
> that will not be the only issue) if the enclave mutex is not held
> by the page fault handler ... but the enclave mutex _is_ held by the
> page fault handler.

It would just be nice to clearly state the locking logic in the eventual
changelog.

>> Also, given the challenges here, I do think we should check the
>> pcmd_page after truncate to ensure it is still all zero's.
> 
> Could you please elaborate the challenges? I do not think it would
> help the issue at hand at all to add a check since this is done with
> the mutex held while the reclaimer has a reference to the page but no
> mutex ... as long as the page fault handler has that mutex it can write
> and check the PCMD page all it want, the new changes would occur to the
> PCMD page when it (the page fault handler) releases the mutex and the
> reclaimer attempts to use the page it has a reference to.

I'm just asking for a debugging check that looks at the page *after* is
has been truncated to ensure it is zero, in case another bug creeps in here.
Reinette Chatre May 5, 2022, 12:08 a.m. UTC | #17
Hi Dave,

On 5/4/2022 4:50 PM, Dave Hansen wrote:
> On 5/4/22 16:36, Reinette Chatre wrote:
>> The page fault handler cannot check for SGX_ENCL_PAGE_BEING_RECLAIMED
>> until the reclaimer releases the mutex.
> 
> Ahh, got it now.  I thought the mutex wasn't held on the reclaimer side.
>  Thanks for the refresher.
> 
>>> One other thing.  The role of encl->lock here is very important.
>>> Without it, two concurrent page faults could do their individual
>>> memset(), each see !pcmd_page_empty, then decline to truncate the page.
>>
>> Your argument is not clear to me. Yes, this would be an issue (and
>> that will not be the only issue) if the enclave mutex is not held
>> by the page fault handler ... but the enclave mutex _is_ held by the
>> page fault handler.
> 
> It would just be nice to clearly state the locking logic in the eventual
> changelog.

Will do. Below is the changelog I am currently working on:

x86/sgx: Fix race between reclaimer and page fault handler

Haitao reported encountering a WARN triggered by the ENCLS[ELDU]
instruction faulting with a #GP.

The WARN is encountered when the reclaimer evicts a range of
pages from the enclave when the same pages are faulted back right away.

Consider two enclave pages (ENCLAVE_A and ENCLAVE_B)
sharing a PCMD page (PCMD_AB). ENCLAVE_A is in the
enclave memory and ENCLAVE_B is in the backing store. PCMD_AB contains
just one entry, that of ENCLAVE_B.

Scenario proceeds where ENCLAVE_A is being evicted from the enclave
while ENCLAVE_B is faulted in.

sgx_reclaim_pages() {

  ...

  /*
   * Reclaim ENCLAVE_A
   */
  mutex_lock(&encl->lock);
  /*
   * Get a reference to ENCLAVE_A's
   * shmem page where enclave page
   * encrypted data will be stored
   * as well as a reference to the
   * enclave page's PCMD data page,
   * PCMD_AB.
   * Release mutex before writing
   * any data to the shmem pages.
   */
  sgx_encl_get_backing(...);
  encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
  mutex_unlock(&encl->lock);

                                    /*
                                     * Fault ENCLAVE_B
                                     */

                                    sgx_vma_fault() {

                                      mutex_lock(&encl->lock);
                                      /*
                                       * Get reference to
                                       * ENCLAVE_B's shmem page
                                       * as well as PCMD_AB.
                                       */
                                      sgx_encl_get_backing(...)
                                     /*
                                      * Load page back into
                                      * enclave via ELDU.
                                      */
                                     /*
                                      * Release reference to
                                      * ENCLAVE_B' shmem page and
                                      * PCMD_AB.
                                      */
                                     sgx_encl_put_backing(...);
                                     /*
                                      * PCMD_AB is found empty so
                                      * it and ENCLAVE_B's shmem page
                                      * are truncated.
                                      */
                                     /* Truncate ENCLAVE_B backing page */
                                     sgx_encl_truncate_backing_page();
                                     /* Truncate PCMD_AB */
                                     sgx_encl_truncate_backing_page();

                                     mutex_unlock(&encl->lock);

                                     ...
                                     }
  mutex_lock(&encl->lock);
  encl_page->desc &=
       ~SGX_ENCL_PAGE_BEING_RECLAIMED;
  /*
  * Write encrypted contents of
  * ENCLAVE_A to ENCLAVE_A shmem
  * page and its PCMD data to
  * PCMD_AB.
  */
  sgx_encl_put_backing(...)

  /*
   * Reference to PCMD_AB is
   * dropped and it is truncated.
   * ENCLAVE_A's PCMD data is lost.
   */
  mutex_unlock(&encl->lock);
}

What happens next depends on whether it is ENCLAVE_A being faulted
in or ENCLAVE_B being evicted - but both end up with ENCLS[ELDU] faulting
with a #GP.

If ENCLAVE_A is faulted then at the time sgx_encl_get_backing() is called
a new PCMD page is allocated and providing the empty PCMD data for
ENCLAVE_A would cause ENCLS[ELDU] to #GP

If ENCLAVE_B is evicted first then a new PCMD_AB would be allocated by the
reclaimer but later when ENCLAVE_A is faulted the ENCLS[ELDU] instruction
would #GP during its checks of the PCMD value and the WARN would be
encountered.

Noting that the reclaimer sets SGX_ENCL_PAGE_BEING_RECLAIMED at the time
it obtains a reference to the backing store pages of an enclave page it is
in the process of reclaiming, fix the race by only truncating the PCMD page
after ensuring that no page sharing the PCMD page is in the process of being
reclaimed.

Reported-by: Haitao Huang <haitao.huang@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>

> 
>>> Also, given the challenges here, I do think we should check the
>>> pcmd_page after truncate to ensure it is still all zero's.
>>
>> Could you please elaborate the challenges? I do not think it would
>> help the issue at hand at all to add a check since this is done with
>> the mutex held while the reclaimer has a reference to the page but no
>> mutex ... as long as the page fault handler has that mutex it can write
>> and check the PCMD page all it want, the new changes would occur to the
>> PCMD page when it (the page fault handler) releases the mutex and the
>> reclaimer attempts to use the page it has a reference to.
> 
> I'm just asking for a debugging check that looks at the page *after* is
> has been truncated to ensure it is zero, in case another bug creeps in here.

ok, I can add your patch from
https://lore.kernel.org/linux-sgx/faaaf70d-22b7-eaa5-4623-bbdf1012827a@intel.com/
to this series.

Reinette
Jarkko Sakkinen May 5, 2022, 6:09 a.m. UTC | #18
On Thu, 2022-04-28 at 13:11 -0700, Reinette Chatre wrote:

> diff --git a/tools/testing/selftests/sgx/main.c b/tools/testing/selftests/sgx/main.c
> index d480c2dd2858..12008789325b 100644
> --- a/tools/testing/selftests/sgx/main.c
> +++ b/tools/testing/selftests/sgx/main.c
> @@ -398,7 +398,7 @@ TEST_F_TIMEOUT(enclave, unclobbered_vdso_oversubscribed_remove, 900)
>          * Create enclave with additional heap that is as big as all
>          * available physical SGX memory.
>          */
> -       total_mem = get_total_epc_mem();
> +       total_mem = get_total_epc_mem()/16;
>         ASSERT_NE(total_mem, 0);
>         TH_LOG("Creating an enclave with %lu bytes heap may take a while ...",
>                total_mem);
> 
> If the the test compiled with above snippet is renamed as "test_sgx_small"
> and the original renamed as "test_sgx_large" the two concurrent loops are
> run as follows:
> 
> (for i in $(seq 1 999); do echo "Iteration $i"; sudo ./test_sgx_large; done ) > log.large 2>&1
> (for i in $(seq 1 9999); do echo "Iteration $i"; sudo ./test_sgx_small; done ) > log.small 2>&1

Setting up VM image to see if I could reproduce this locally.

I'll reduce the iterations to 64 and 640 to scale the test to 256 MB EPC size.

BR, Jarkko
Jarkko Sakkinen May 7, 2022, 5:46 p.m. UTC | #19
On Mon, May 02, 2022 at 02:33:06PM -0700, Dave Hansen wrote:
> On 5/2/22 10:11, Reinette Chatre wrote:
> > My goal was to prevent the page fault handler from doing a "is the PCMD
> > page empty" if the reclaimer has a reference to it. Even if the PCMD page
> > is empty when the page fault handler checks it the page is expected to
> > get data right when reclaimer can get the mutex back since the reclaimer
> > already has a reference to the page.
> 
> I think shmem_truncate_range() might be the wrong operation.  It
> destroys data and, in the end, we don't want to destroy data.
> Filesystems and the page cache already have nice ways to keep from
> destroying data, we just need to use them.
> 
> First, I think set_page_dirty(pcmd_page) before the EWB is a good start.
>  That's what filesystems do before important data that needs to be saved
> goes into pages.
> 
> Second, I think we need behavior like POSIX_FADV_DONTNEED, not
> FALLOC_FL_PUNCH_HOLE.  The DONTNEED operations end up in
> mapping_evict_folio(), which has both page refcount *and* dirty page
> checks.  That means that either elevating a refcount or set_page_dirty()
> will thwart DONTNEED-like behavior.
> 
> There are two basic things we need to do:
> 
> 1. Prevent page from being truncated around EWB time
> 2. Prevent unreferenced page with all zeros from staying in shmem
>    forever or taking up swap space
> 
> On the EWB (write to PCMD side) I think something like this works:
> 
> 	sgx_encl_get_backing()
> 		get_page(pcmd_page)
> 
> 	...
> 	lock_page(pcmd_page);
> 	// check for truncation since sgx_encl_get_backing()
> 	if (pcmd_page->mapping != shmem)
> 		goto retry;
> 	 // double check this is OK under lock_page():
> 	set_page_dirty(pcmd_page);
> 	__sgx_encl_ewb();
> 	unlock_page(pcmd_page);
> 
> That's basically what filesystems do.  Get the page from the page cache,
> lock it, then make sure it is consistent.  If not, retry.
> 
> On the "free" / read in (ELDU) side:
> 
> 	// get pcmd_page ref
> 	lock_page(pcmd_page);
> 	// truncation is not a concern because that's only done
> 	// on the read-in side, here, where we hold encl->lock
> 
> 	memset();
> 	if (!memchr_inv())
> 		// clear the way for DONTNEED:
> 		ClearPageDirty(pcmd_page);
> 	unlock_page(pcmd_page);
> 	// drop pcmd_page ref
> 	...
> 	POSIX_FADV_DONTNEED
> 
> There's one downside to this: it's _possible_ that an transient
> get_page() would block POSIX_FADV_DONTNEED.  Then the zeroed page would
> stick around forever, or at least until the next ELDU operation did
> another memchr_inv().
> 
> I went looking around for some of those and could not find any that I
> *know* apply to shmem.
> 
> This doesn't feel like a great solution; it's more complicated than I
> would like.  Any other ideas?

If we could do both truncation and swapping in one side, i.e. in ksgxd,
that would simplify this process a lot. Then the whole synchronization
problem would not exist.

E.g. perhaps #PF handler could just zero PCMD and collect zeroed pages
indices to a list and ksgxd would truncate them.

BR, Jarkko
Jarkko Sakkinen May 7, 2022, 5:48 p.m. UTC | #20
On Sat, May 07, 2022 at 08:46:50PM +0300, Jarkko Sakkinen wrote:
> On Mon, May 02, 2022 at 02:33:06PM -0700, Dave Hansen wrote:
> > On 5/2/22 10:11, Reinette Chatre wrote:
> > > My goal was to prevent the page fault handler from doing a "is the PCMD
> > > page empty" if the reclaimer has a reference to it. Even if the PCMD page
> > > is empty when the page fault handler checks it the page is expected to
> > > get data right when reclaimer can get the mutex back since the reclaimer
> > > already has a reference to the page.
> > 
> > I think shmem_truncate_range() might be the wrong operation.  It
> > destroys data and, in the end, we don't want to destroy data.
> > Filesystems and the page cache already have nice ways to keep from
> > destroying data, we just need to use them.
> > 
> > First, I think set_page_dirty(pcmd_page) before the EWB is a good start.
> >  That's what filesystems do before important data that needs to be saved
> > goes into pages.
> > 
> > Second, I think we need behavior like POSIX_FADV_DONTNEED, not
> > FALLOC_FL_PUNCH_HOLE.  The DONTNEED operations end up in
> > mapping_evict_folio(), which has both page refcount *and* dirty page
> > checks.  That means that either elevating a refcount or set_page_dirty()
> > will thwart DONTNEED-like behavior.
> > 
> > There are two basic things we need to do:
> > 
> > 1. Prevent page from being truncated around EWB time
> > 2. Prevent unreferenced page with all zeros from staying in shmem
> >    forever or taking up swap space
> > 
> > On the EWB (write to PCMD side) I think something like this works:
> > 
> > 	sgx_encl_get_backing()
> > 		get_page(pcmd_page)
> > 
> > 	...
> > 	lock_page(pcmd_page);
> > 	// check for truncation since sgx_encl_get_backing()
> > 	if (pcmd_page->mapping != shmem)
> > 		goto retry;
> > 	 // double check this is OK under lock_page():
> > 	set_page_dirty(pcmd_page);
> > 	__sgx_encl_ewb();
> > 	unlock_page(pcmd_page);
> > 
> > That's basically what filesystems do.  Get the page from the page cache,
> > lock it, then make sure it is consistent.  If not, retry.
> > 
> > On the "free" / read in (ELDU) side:
> > 
> > 	// get pcmd_page ref
> > 	lock_page(pcmd_page);
> > 	// truncation is not a concern because that's only done
> > 	// on the read-in side, here, where we hold encl->lock
> > 
> > 	memset();
> > 	if (!memchr_inv())
> > 		// clear the way for DONTNEED:
> > 		ClearPageDirty(pcmd_page);
> > 	unlock_page(pcmd_page);
> > 	// drop pcmd_page ref
> > 	...
> > 	POSIX_FADV_DONTNEED
> > 
> > There's one downside to this: it's _possible_ that an transient
> > get_page() would block POSIX_FADV_DONTNEED.  Then the zeroed page would
> > stick around forever, or at least until the next ELDU operation did
> > another memchr_inv().
> > 
> > I went looking around for some of those and could not find any that I
> > *know* apply to shmem.
> > 
> > This doesn't feel like a great solution; it's more complicated than I
> > would like.  Any other ideas?
> 
> If we could do both truncation and swapping in one side, i.e. in ksgxd,
> that would simplify this process a lot. Then the whole synchronization
> problem would not exist.
> 
> E.g. perhaps #PF handler could just zero PCMD and collect zeroed pages
> indices to a list and ksgxd would truncate them.

I.e. instead of immediate response, go for lazy response that is taken 
care by ksgxd.

BR, Jarkko
Reinette Chatre May 9, 2022, 5:09 p.m. UTC | #21
Hi Jarkko,

On 5/7/2022 10:48 AM, Jarkko Sakkinen wrote:
> On Sat, May 07, 2022 at 08:46:50PM +0300, Jarkko Sakkinen wrote:
>> On Mon, May 02, 2022 at 02:33:06PM -0700, Dave Hansen wrote:
>>> On 5/2/22 10:11, Reinette Chatre wrote:
>>>> My goal was to prevent the page fault handler from doing a "is the PCMD
>>>> page empty" if the reclaimer has a reference to it. Even if the PCMD page
>>>> is empty when the page fault handler checks it the page is expected to
>>>> get data right when reclaimer can get the mutex back since the reclaimer
>>>> already has a reference to the page.
>>>
>>> I think shmem_truncate_range() might be the wrong operation.  It
>>> destroys data and, in the end, we don't want to destroy data.
>>> Filesystems and the page cache already have nice ways to keep from
>>> destroying data, we just need to use them.
>>>
>>> First, I think set_page_dirty(pcmd_page) before the EWB is a good start.
>>>  That's what filesystems do before important data that needs to be saved
>>> goes into pages.
>>>
>>> Second, I think we need behavior like POSIX_FADV_DONTNEED, not
>>> FALLOC_FL_PUNCH_HOLE.  The DONTNEED operations end up in
>>> mapping_evict_folio(), which has both page refcount *and* dirty page
>>> checks.  That means that either elevating a refcount or set_page_dirty()
>>> will thwart DONTNEED-like behavior.
>>>
>>> There are two basic things we need to do:
>>>
>>> 1. Prevent page from being truncated around EWB time
>>> 2. Prevent unreferenced page with all zeros from staying in shmem
>>>    forever or taking up swap space
>>>
>>> On the EWB (write to PCMD side) I think something like this works:
>>>
>>> 	sgx_encl_get_backing()
>>> 		get_page(pcmd_page)
>>>
>>> 	...
>>> 	lock_page(pcmd_page);
>>> 	// check for truncation since sgx_encl_get_backing()
>>> 	if (pcmd_page->mapping != shmem)
>>> 		goto retry;
>>> 	 // double check this is OK under lock_page():
>>> 	set_page_dirty(pcmd_page);
>>> 	__sgx_encl_ewb();
>>> 	unlock_page(pcmd_page);
>>>
>>> That's basically what filesystems do.  Get the page from the page cache,
>>> lock it, then make sure it is consistent.  If not, retry.
>>>
>>> On the "free" / read in (ELDU) side:
>>>
>>> 	// get pcmd_page ref
>>> 	lock_page(pcmd_page);
>>> 	// truncation is not a concern because that's only done
>>> 	// on the read-in side, here, where we hold encl->lock
>>>
>>> 	memset();
>>> 	if (!memchr_inv())
>>> 		// clear the way for DONTNEED:
>>> 		ClearPageDirty(pcmd_page);
>>> 	unlock_page(pcmd_page);
>>> 	// drop pcmd_page ref
>>> 	...
>>> 	POSIX_FADV_DONTNEED
>>>
>>> There's one downside to this: it's _possible_ that an transient
>>> get_page() would block POSIX_FADV_DONTNEED.  Then the zeroed page would
>>> stick around forever, or at least until the next ELDU operation did
>>> another memchr_inv().
>>>
>>> I went looking around for some of those and could not find any that I
>>> *know* apply to shmem.
>>>
>>> This doesn't feel like a great solution; it's more complicated than I
>>> would like.  Any other ideas?
>>
>> If we could do both truncation and swapping in one side, i.e. in ksgxd,
>> that would simplify this process a lot. Then the whole synchronization
>> problem would not exist.
>>
>> E.g. perhaps #PF handler could just zero PCMD and collect zeroed pages
>> indices to a list and ksgxd would truncate them.
> 
> I.e. instead of immediate response, go for lazy response that is taken 
> care by ksgxd.

Could you please elaborate how you envision this solution? From what I understand
there would be a per-enclave list that contains information about empty
PCMD pages intended to be truncated. The page fault handler adds pages
to this list and the reclaimer needs to remove pages from this list when
it writes to those pages and then do the actual truncation - but it is not
clear how the reclaimer will know when it can safely remove a page from the
list since it obtains PCMD page references in batches.

Did you get a chance to consider the fix proposed in
https://lore.kernel.org/linux-sgx/d4b52482-2dd0-d5f1-bda9-e1d97883298d@intel.com/ 

I understand that the email thread may have become hard to follow and I plan
to submit a new series today.

Reinette
Jarkko Sakkinen May 10, 2022, 10:28 p.m. UTC | #22
On Mon, May 09, 2022 at 10:09:24AM -0700, Reinette Chatre wrote:
> Hi Jarkko,
> 
> On 5/7/2022 10:48 AM, Jarkko Sakkinen wrote:
> > On Sat, May 07, 2022 at 08:46:50PM +0300, Jarkko Sakkinen wrote:
> >> On Mon, May 02, 2022 at 02:33:06PM -0700, Dave Hansen wrote:
> >>> On 5/2/22 10:11, Reinette Chatre wrote:
> >>>> My goal was to prevent the page fault handler from doing a "is the PCMD
> >>>> page empty" if the reclaimer has a reference to it. Even if the PCMD page
> >>>> is empty when the page fault handler checks it the page is expected to
> >>>> get data right when reclaimer can get the mutex back since the reclaimer
> >>>> already has a reference to the page.
> >>>
> >>> I think shmem_truncate_range() might be the wrong operation.  It
> >>> destroys data and, in the end, we don't want to destroy data.
> >>> Filesystems and the page cache already have nice ways to keep from
> >>> destroying data, we just need to use them.
> >>>
> >>> First, I think set_page_dirty(pcmd_page) before the EWB is a good start.
> >>>  That's what filesystems do before important data that needs to be saved
> >>> goes into pages.
> >>>
> >>> Second, I think we need behavior like POSIX_FADV_DONTNEED, not
> >>> FALLOC_FL_PUNCH_HOLE.  The DONTNEED operations end up in
> >>> mapping_evict_folio(), which has both page refcount *and* dirty page
> >>> checks.  That means that either elevating a refcount or set_page_dirty()
> >>> will thwart DONTNEED-like behavior.
> >>>
> >>> There are two basic things we need to do:
> >>>
> >>> 1. Prevent page from being truncated around EWB time
> >>> 2. Prevent unreferenced page with all zeros from staying in shmem
> >>>    forever or taking up swap space
> >>>
> >>> On the EWB (write to PCMD side) I think something like this works:
> >>>
> >>> 	sgx_encl_get_backing()
> >>> 		get_page(pcmd_page)
> >>>
> >>> 	...
> >>> 	lock_page(pcmd_page);
> >>> 	// check for truncation since sgx_encl_get_backing()
> >>> 	if (pcmd_page->mapping != shmem)
> >>> 		goto retry;
> >>> 	 // double check this is OK under lock_page():
> >>> 	set_page_dirty(pcmd_page);
> >>> 	__sgx_encl_ewb();
> >>> 	unlock_page(pcmd_page);
> >>>
> >>> That's basically what filesystems do.  Get the page from the page cache,
> >>> lock it, then make sure it is consistent.  If not, retry.
> >>>
> >>> On the "free" / read in (ELDU) side:
> >>>
> >>> 	// get pcmd_page ref
> >>> 	lock_page(pcmd_page);
> >>> 	// truncation is not a concern because that's only done
> >>> 	// on the read-in side, here, where we hold encl->lock
> >>>
> >>> 	memset();
> >>> 	if (!memchr_inv())
> >>> 		// clear the way for DONTNEED:
> >>> 		ClearPageDirty(pcmd_page);
> >>> 	unlock_page(pcmd_page);
> >>> 	// drop pcmd_page ref
> >>> 	...
> >>> 	POSIX_FADV_DONTNEED
> >>>
> >>> There's one downside to this: it's _possible_ that an transient
> >>> get_page() would block POSIX_FADV_DONTNEED.  Then the zeroed page would
> >>> stick around forever, or at least until the next ELDU operation did
> >>> another memchr_inv().
> >>>
> >>> I went looking around for some of those and could not find any that I
> >>> *know* apply to shmem.
> >>>
> >>> This doesn't feel like a great solution; it's more complicated than I
> >>> would like.  Any other ideas?
> >>
> >> If we could do both truncation and swapping in one side, i.e. in ksgxd,
> >> that would simplify this process a lot. Then the whole synchronization
> >> problem would not exist.
> >>
> >> E.g. perhaps #PF handler could just zero PCMD and collect zeroed pages
> >> indices to a list and ksgxd would truncate them.
> > 
> > I.e. instead of immediate response, go for lazy response that is taken 
> > care by ksgxd.
> 
> Could you please elaborate how you envision this solution? From what I understand
> there would be a per-enclave list that contains information about empty
> PCMD pages intended to be truncated. The page fault handler adds pages
> to this list and the reclaimer needs to remove pages from this list when
> it writes to those pages and then do the actual truncation - but it is not
> clear how the reclaimer will know when it can safely remove a page from the
> list since it obtains PCMD page references in batches.
> 
> Did you get a chance to consider the fix proposed in
> https://lore.kernel.org/linux-sgx/d4b52482-2dd0-d5f1-bda9-e1d97883298d@intel.com/ 
> 
> I understand that the email thread may have become hard to follow and I plan
> to submit a new series today.

Let's just say that I came a bit late to the series, and should have read
the whole thread before responding to anything. As long as enclave lock is
kept on both sides things should be fine.

I think for bugs like these it would make sense to put them out early as
possible, e.g. I would be fine getting them from kernel bugzilla. Now there
there was two week latency on finding the issue, and making it public.
Unless there is something confidential, it would be best to get early
alert. I'm always ready to change my priorities to help to fix such issues.

BR, Jarkko
Reinette Chatre May 11, 2022, 5:23 p.m. UTC | #23
Hi Jarkko,

On 5/10/2022 3:28 PM, Jarkko Sakkinen wrote:
> Let's just say that I came a bit late to the series, and should have read
> the whole thread before responding to anything. As long as enclave lock is
> kept on both sides things should be fine.

For the most part, yes. The remaining scenario is the case when the reclaimer
releases the enclave mutex while keeping a reference to the backing store pages.
By releasing the enclave mutex there is opportunity for page fault
handler to run and also operate on the backing store. Both the reclaimer
(after patch 3/4 in this series) and page fault handler operate on
the backing store with enclave mutex held but if that is done without
taking backing store references into account data could be lost. This is
addressed in the following series with:
https://lore.kernel.org/linux-sgx/d0ace4a1770ab8f4196bfeae06d0970ddb14ef01.1652131695.git.reinette.chatre@intel.com/


> 
> I think for bugs like these it would make sense to put them out early as
> possible, e.g. I would be fine getting them from kernel bugzilla. Now there
> there was two week latency on finding the issue, and making it public.
> Unless there is something confidential, it would be best to get early
> alert. I'm always ready to change my priorities to help to fix such issues.

I am sorry about this. The reason I first struggled with this by myself was
because it was made out to be an SGX2 issue. This was made worse when I was not
able to create an SGX1 test case that can trigger the issue. I thus lacked
evidence that it is an upstream issue and it took me a while to debug and
understand the issue to gain confidence that it is indeed an upstream issue.

Reinette
Jarkko Sakkinen May 12, 2022, 2:10 p.m. UTC | #24
On Wed, May 11, 2022 at 10:23:29AM -0700, Reinette Chatre wrote:
> Hi Jarkko,
> 
> On 5/10/2022 3:28 PM, Jarkko Sakkinen wrote:
> > Let's just say that I came a bit late to the series, and should have read
> > the whole thread before responding to anything. As long as enclave lock is
> > kept on both sides things should be fine.
> 
> For the most part, yes. The remaining scenario is the case when the reclaimer
> releases the enclave mutex while keeping a reference to the backing store pages.
> By releasing the enclave mutex there is opportunity for page fault
> handler to run and also operate on the backing store. Both the reclaimer
> (after patch 3/4 in this series) and page fault handler operate on
> the backing store with enclave mutex held but if that is done without
> taking backing store references into account data could be lost. This is
> addressed in the following series with:
> https://lore.kernel.org/linux-sgx/d0ace4a1770ab8f4196bfeae06d0970ddb14ef01.1652131695.git.reinette.chatre@intel.com/

Right, I get it now. Great job figuring all this out!

> > 
> > I think for bugs like these it would make sense to put them out early as
> > possible, e.g. I would be fine getting them from kernel bugzilla. Now there
> > there was two week latency on finding the issue, and making it public.
> > Unless there is something confidential, it would be best to get early
> > alert. I'm always ready to change my priorities to help to fix such issues.
> 
> I am sorry about this. The reason I first struggled with this by myself was
> because it was made out to be an SGX2 issue. This was made worse when I was not
> able to create an SGX1 test case that can trigger the issue. I thus lacked
> evidence that it is an upstream issue and it took me a while to debug and
> understand the issue to gain confidence that it is indeed an upstream issue.

No worries, just making the crash log available early on gives a chance
to investigate it :-) We can then figure out whether it is SGX2 issue or
not, or any other circumstance. There should be a low-barrier to do this.

E.g. I happened to get sick when the series was posted, and therefore
lagged on investigation. Giving the maximum amount of window here is super
important.

> Reinette

BR, Jarkko
diff mbox

Patch

diff --git a/tools/testing/selftests/sgx/main.c b/tools/testing/selftests/sgx/main.c
index d480c2dd2858..12008789325b 100644
--- a/tools/testing/selftests/sgx/main.c
+++ b/tools/testing/selftests/sgx/main.c
@@ -398,7 +398,7 @@  TEST_F_TIMEOUT(enclave, unclobbered_vdso_oversubscribed_remove, 900)
 	 * Create enclave with additional heap that is as big as all
 	 * available physical SGX memory.
 	 */
-	total_mem = get_total_epc_mem();
+	total_mem = get_total_epc_mem()/16;
 	ASSERT_NE(total_mem, 0);
 	TH_LOG("Creating an enclave with %lu bytes heap may take a while ...",
 	       total_mem);

If the the test compiled with above snippet is renamed as "test_sgx_small"
and the original renamed as "test_sgx_large" the two concurrent loops are
run as follows:

(for i in $(seq 1 999); do echo "Iteration $i"; sudo ./test_sgx_large; done ) > log.large 2>&1
(for i in $(seq 1 9999); do echo "Iteration $i"; sudo ./test_sgx_small; done ) > log.small 2>&1

If the SGX driver is modified to always WARN when ENCLS[ELDU] encounters a #GP
(see below) then the WARN appears after a few iterations of "test_sgx_large"
and shows up throughout the testing.

diff --git a/arch/x86/kernel/cpu/sgx/encls.h b/arch/x86/kernel/cpu/sgx/encls.h
index 99004b02e2ed..68c1dbc84ed3 100644
--- a/arch/x86/kernel/cpu/sgx/encls.h
+++ b/arch/x86/kernel/cpu/sgx/encls.h
@@ -18,7 +18,7 @@ 
 #define ENCLS_WARN(r, name) {						  \
 	do {								  \
 		int _r = (r);						  \
-		WARN_ONCE(_r, "%s returned %d (0x%x)\n", (name), _r, _r); \
+		WARN(_r, "%s returned %d (0x%x)\n", (name), _r, _r); \
 	} while (0);							  \
 }