[intel-sgx-kernel-dev,v4,8/8] intel_sgx: add LRU algorithm to page swapping
diff mbox

Message ID 20161201205632.8593-9-jarkko.sakkinen@linux.intel.com
State New
Headers show

Commit Message

Jarkko Sakkinen Dec. 1, 2016, 8:56 p.m. UTC
From: Sean Christopherson <sean.j.christopherson@intel.com>

Test and clear the A bit of an EPC page when isolating pages during EPC
page swap.  Move accessed pages to the end of the load list instead of
the eviction list, i.e. isolate only those pages that have not been
accessed since the last time the swapping flow was run.

This basic LRU algorithm yields a significant improvement in throughput
when the system is under a heavy EPC pressure.

This patch is based on the code originally written Serge Ayoun for the
out-of-tree driver with a bug fix that the EPC pages are moved always to
the end of the list if they have been lately accessed. For unknown
reason a version of out-of-tree driver slipped into the Github, which
this regression.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
---
 drivers/platform/x86/intel_sgx.h            |  2 +-
 drivers/platform/x86/intel_sgx_ioctl.c      |  1 +
 drivers/platform/x86/intel_sgx_page_cache.c | 34 ++++++++++++++++++++++++++++-
 drivers/platform/x86/intel_sgx_vma.c        |  1 +
 4 files changed, 36 insertions(+), 2 deletions(-)

Comments

Dave Hansen Dec. 1, 2016, 9:05 p.m. UTC | #1
On 12/01/2016 12:56 PM, Jarkko Sakkinen wrote:
> +int sgx_test_and_clear_young(struct sgx_encl_page *page, struct sgx_encl *encl)
> +{
> +	struct vm_area_struct *vma = sgx_find_vma(encl, page->addr);
> +	if (!vma)
> +		return 0;
> +
> +	return apply_to_page_range(vma->vm_mm, page->addr, PAGE_SIZE,
> +				   sgx_test_and_clear_young_cb, vma->vm_mm);
> +}

Don't you eventually need to flush the TLB?
Sean Christopherson Dec. 1, 2016, 10:14 p.m. UTC | #2
Dave Hansen <dave.hansen@intel.com> wrote:
> On 12/01/2016 12:56 PM, Jarkko Sakkinen wrote:
> > +int sgx_test_and_clear_young(struct sgx_encl_page *page, struct 
> > +sgx_encl *encl) {
> > +	struct vm_area_struct *vma = sgx_find_vma(encl, page->addr);
> > +	if (!vma)
> > +		return 0;
> > +
> > +	return apply_to_page_range(vma->vm_mm, page->addr, PAGE_SIZE,
> > +				   sgx_test_and_clear_young_cb,
> > vma->vm_mm); }
> 
> Don't you eventually need to flush the TLB?

Maybe?  The TLB doesn't need to be flushed on the thread modifying the PTE as an enclave's page entries can't be cached in the TLB outside of the enclave.  If all we cared about was accuracy then we'd need to send an IPI to all other threads to flush their TLBs, but from a performance perspective I don't know that sending IPIs would be an improvement.  A thread's EPC TLB entries will be flushed whenever it exits the enclave, which I assume happens often enough that missed A/D assists due to stale TLB entries wouldn't be a major issue.  

Anyways, my thought was to get the quick and easy solution in place now and then focus on tuning the eviction code.
Dave Hansen Dec. 1, 2016, 10:40 p.m. UTC | #3
On 12/01/2016 02:14 PM, Christopherson, Sean J wrote:
> Dave Hansen <dave.hansen@intel.com> wrote:
>> On 12/01/2016 12:56 PM, Jarkko Sakkinen wrote:
>>> +int sgx_test_and_clear_young(struct sgx_encl_page *page, struct 
>>> +sgx_encl *encl) {
>>> +	struct vm_area_struct *vma = sgx_find_vma(encl, page->addr);
>>> +	if (!vma)
>>> +		return 0;
>>> +
>>> +	return apply_to_page_range(vma->vm_mm, page->addr, PAGE_SIZE,
>>> +				   sgx_test_and_clear_young_cb,
>>> vma->vm_mm); }
>>
>> Don't you eventually need to flush the TLB?
> 
> Maybe?  The TLB doesn't need to be flushed on the thread modifying
> the PTE as an enclave's page entries can't be cached in the TLB
> outside of the enclave

Is that architectural?

> If all we cared about was accuracy then we'd
> need to send an IPI to all other threads to flush their TLBs, but
> from a performance perspective I don't know that sending IPIs would
> be an improvement.  A thread's EPC TLB entries will be flushed
> whenever it exits the enclave, which I assume happens often enough
> that missed A/D assists due to stale TLB entries wouldn't be a major
> issue.

OK, fair enough.  If you want to do this, let's at least document that
we're intentionally skipping the flush and what the downsides are of
doing that.
Sean Christopherson Dec. 1, 2016, 10:53 p.m. UTC | #4
Dave Hansen <dave.hansen@intel.com> wrote:
> >> Don't you eventually need to flush the TLB?
> > 
> > Maybe?  The TLB doesn't need to be flushed on the thread modifying the 
> > PTE as an enclave's page entries can't be cached in the TLB outside of 
> > the enclave
> 
> Is that architectural?

By inference, yes.  Eviction/aging is performed outside of an enclave, and TLB entries for an enclave's context are flushed on enclave exit.  Any attempt to access an EPC page outside of an enclave will get a master abort and will not be pulled into the TLB.
Jarkko Sakkinen Dec. 2, 2016, 7:09 a.m. UTC | #5
On Thu, Dec 01, 2016 at 02:40:25PM -0800, Dave Hansen wrote:
> On 12/01/2016 02:14 PM, Christopherson, Sean J wrote:
> > Dave Hansen <dave.hansen@intel.com> wrote:
> >> On 12/01/2016 12:56 PM, Jarkko Sakkinen wrote:
> >>> +int sgx_test_and_clear_young(struct sgx_encl_page *page, struct 
> >>> +sgx_encl *encl) {
> >>> +	struct vm_area_struct *vma = sgx_find_vma(encl, page->addr);
> >>> +	if (!vma)
> >>> +		return 0;
> >>> +
> >>> +	return apply_to_page_range(vma->vm_mm, page->addr, PAGE_SIZE,
> >>> +				   sgx_test_and_clear_young_cb,
> >>> vma->vm_mm); }
> >>
> >> Don't you eventually need to flush the TLB?
> > 
> > Maybe?  The TLB doesn't need to be flushed on the thread modifying
> > the PTE as an enclave's page entries can't be cached in the TLB
> > outside of the enclave
> 
> Is that architectural?
> 
> > If all we cared about was accuracy then we'd
> > need to send an IPI to all other threads to flush their TLBs, but
> > from a performance perspective I don't know that sending IPIs would
> > be an improvement.  A thread's EPC TLB entries will be flushed
> > whenever it exits the enclave, which I assume happens often enough
> > that missed A/D assists due to stale TLB entries wouldn't be a major
> > issue.
> 
> OK, fair enough.  If you want to do this, let's at least document that
> we're intentionally skipping the flush and what the downsides are of
> doing that.

All TLBs are flushed after hardware threads have exited. That is
arcitectural. The driver sends an IPI at the time of EWB's unless the HW
threads have already exited (by checking the erro code of EWB).

/Jarkko
Jarkko Sakkinen Dec. 2, 2016, 7:12 a.m. UTC | #6
On Fri, Dec 02, 2016 at 09:09:25AM +0200, Jarkko Sakkinen wrote:
> On Thu, Dec 01, 2016 at 02:40:25PM -0800, Dave Hansen wrote:
> > On 12/01/2016 02:14 PM, Christopherson, Sean J wrote:
> > > Dave Hansen <dave.hansen@intel.com> wrote:
> > >> On 12/01/2016 12:56 PM, Jarkko Sakkinen wrote:
> > >>> +int sgx_test_and_clear_young(struct sgx_encl_page *page, struct 
> > >>> +sgx_encl *encl) {
> > >>> +	struct vm_area_struct *vma = sgx_find_vma(encl, page->addr);
> > >>> +	if (!vma)
> > >>> +		return 0;
> > >>> +
> > >>> +	return apply_to_page_range(vma->vm_mm, page->addr, PAGE_SIZE,
> > >>> +				   sgx_test_and_clear_young_cb,
> > >>> vma->vm_mm); }
> > >>
> > >> Don't you eventually need to flush the TLB?
> > > 
> > > Maybe?  The TLB doesn't need to be flushed on the thread modifying
> > > the PTE as an enclave's page entries can't be cached in the TLB
> > > outside of the enclave
> > 
> > Is that architectural?
> > 
> > > If all we cared about was accuracy then we'd
> > > need to send an IPI to all other threads to flush their TLBs, but
> > > from a performance perspective I don't know that sending IPIs would
> > > be an improvement.  A thread's EPC TLB entries will be flushed
> > > whenever it exits the enclave, which I assume happens often enough
> > > that missed A/D assists due to stale TLB entries wouldn't be a major
> > > issue.
> > 
> > OK, fair enough.  If you want to do this, let's at least document that
> > we're intentionally skipping the flush and what the downsides are of
> > doing that.
> 
> All TLBs are flushed after hardware threads have exited. That is
> arcitectural. The driver sends an IPI at the time of EWB's unless the HW
> threads have already exited (by checking the erro code of EWB).

Sorry forgot to put the question. I'm not sure what should be documented
as the code flow will eventually flush enclave TLBs?

/Jarkko
Jarkko Sakkinen Dec. 2, 2016, 10:25 a.m. UTC | #7
On Thu, Dec 01, 2016 at 10:14:20PM +0000, Christopherson, Sean J wrote:
> Maybe?  The TLB doesn't need to be flushed on the thread modifying the
> PTE as an enclave's page entries can't be cached in the TLB outside of
> the enclave.  If all we cared about was accuracy then we'd need to
> send an IPI to all other threads to flush their TLBs, but from a
> performance perspective I don't know that sending IPIs would be an
> improvement.  A thread's EPC TLB entries will be flushed whenever it

How that would improve accurancy? New hardware threads can enter after
you send the IPI. It just moves "error" to the different direction. It
would be just repositioning when we send IPI.

The alternative bug fix that I proposed a while ago that sends IPI on a
failed ETRACK instead of changing the locking behavior will do such a
thing in effect.

The only somewhat "non-racy" position to test A bit would be after
EBLOCKs (as new TLBs cannot be created), which obviously does not make
sense for the code flow as you want to select the pages to be blocked
based on A bit.
 
/Jarkko
Jarkko Sakkinen Dec. 2, 2016, 10:26 a.m. UTC | #8
On Thu, Dec 01, 2016 at 10:56:32PM +0200, Jarkko Sakkinen wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Test and clear the A bit of an EPC page when isolating pages during EPC
> page swap.  Move accessed pages to the end of the load list instead of
> the eviction list, i.e. isolate only those pages that have not been
> accessed since the last time the swapping flow was run.
> 
> This basic LRU algorithm yields a significant improvement in throughput
> when the system is under a heavy EPC pressure.
> 
> This patch is based on the code originally written Serge Ayoun for the
> out-of-tree driver with a bug fix that the EPC pages are moved always to
> the end of the list if they have been lately accessed. For unknown
> reason a version of out-of-tree driver slipped into the Github, which
> this regression.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>

Tested-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>

/Jarkko
Sean Christopherson Dec. 2, 2016, 4:37 p.m. UTC | #9
Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> On Thu, Dec 01, 2016 at 10:14:20PM +0000, Christopherson, Sean J wrote:
> > Maybe?  The TLB doesn't need to be flushed on the thread modifying the 
> > PTE as an enclave's page entries can't be cached in the TLB outside of 
> > the enclave.  If all we cared about was accuracy then we'd need to 
> > send an IPI to all other threads to flush their TLBs, but from a 
> > performance perspective I don't know that sending IPIs would be an 
> > improvement.  A thread's EPC TLB entries will be flushed whenever it
> 
> How that would improve accurancy? New hardware threads can enter after you
> send the IPI. It just moves "error" to the different direction. It would be
> just repositioning when we send IPI.
> 
> The alternative bug fix that I proposed a while ago that sends IPI on a
> failed ETRACK instead of changing the locking behavior will do such a thing
> in effect.
> 
> The only somewhat "non-racy" position to test A bit would be after EBLOCKs
> (as new TLBs cannot be created), which obviously does not make sense for the
> code flow as you want to select the pages to be blocked based on A
> bit.

Dave's question about needing to flush the TLBs was in regards to clearing the accessed bit, the code in question has nothing to do with the actual eviction of pages.  Aging and eviction are two distinct operations, the current driver code just happens to tie the two together.  For example, if an enclave contains nothing but recently accessed pages, then we'll age (some of) its pages but not perform eviction since there are no old pages to evict.

Accuracy is referring to accurately tracking when a page has been accessed.  By clearing the accessed bit in memory but not flushing the TLBs, then an enclave thread with the associated PTE in its TLB will not trigger an A/D assist when accessing the page.  The driver will still see the page as being old (not accessed) and could incorrectly select the page for eviction in the future.  To guarantee that the LRU algorithm is 100% accurate the driver would need to flush all threads' TLBs after aging paging, regardless of whether or not a subsequent eviction action is taken.  My reasoning for not forcing a TLB flush at this time is that enclave TLB entries are flushed relatively frequently as is, i.e. accuracy should be quite high, and kicking threads out of enclaves is an expensive operation, i.e. the TLB flush is likely more costly than a false positive on an LRU check.
Jarkko Sakkinen Dec. 2, 2016, 7:48 p.m. UTC | #10
On Fri, Dec 02, 2016 at 04:37:28PM +0000, Christopherson, Sean J wrote:
> Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
> > On Thu, Dec 01, 2016 at 10:14:20PM +0000, Christopherson, Sean J wrote:
> > > Maybe?  The TLB doesn't need to be flushed on the thread modifying the 
> > > PTE as an enclave's page entries can't be cached in the TLB outside of 
> > > the enclave.  If all we cared about was accuracy then we'd need to 
> > > send an IPI to all other threads to flush their TLBs, but from a 
> > > performance perspective I don't know that sending IPIs would be an 
> > > improvement.  A thread's EPC TLB entries will be flushed whenever it
> > 
> > How that would improve accurancy? New hardware threads can enter after you
> > send the IPI. It just moves "error" to the different direction. It would be
> > just repositioning when we send IPI.
> > 
> > The alternative bug fix that I proposed a while ago that sends IPI on a
> > failed ETRACK instead of changing the locking behavior will do such a thing
> > in effect.
> > 
> > The only somewhat "non-racy" position to test A bit would be after EBLOCKs
> > (as new TLBs cannot be created), which obviously does not make sense for the
> > code flow as you want to select the pages to be blocked based on A
> > bit.
> 
> Dave's question about needing to flush the TLBs was in regards to
> clearing the accessed bit, the code in question has nothing to do with
> the actual eviction of pages.  Aging and eviction are two distinct
> operations, the current driver code just happens to tie the two
> together.  For example, if an enclave contains nothing but recently
> accessed pages, then we'll age (some of) its pages but not perform
> eviction since there are no old pages to evict.
> 
> Accuracy is referring to accurately tracking when a page has been
> accessed.  By clearing the accessed bit in memory but not flushing the
> TLBs, then an enclave thread with the associated PTE in its TLB will
> not trigger an A/D assist when accessing the page.  The driver will
> still see the page as being old (not accessed) and could incorrectly
> select the page for eviction in the future.  To guarantee that the LRU
> algorithm is 100% accurate the driver would need to flush all threads'
> TLBs after aging paging, regardless of whether or not a subsequent
> eviction action is taken.  My reasoning for not forcing a TLB flush at
> this time is that enclave TLB entries are flushed relatively
> frequently as is, i.e. accuracy should be quite high, and kicking
> threads out of enclaves is an expensive operation, i.e. the TLB flush
> is likely more costly than a false positive on an LRU check.

Right, I see your point now.

I think it is a reasonable heuristics to assume that enclaves that
consume a lot of EPC pages are more likely to do a lot of IO operations
and other system calls i.e. enclave exits are much more frequent.

Because of this I tend to agree that false positives are more
acceptable.

/Jarkko

Patch
diff mbox

diff --git a/drivers/platform/x86/intel_sgx.h b/drivers/platform/x86/intel_sgx.h
index fc52a6f..a5e54f8 100644
--- a/drivers/platform/x86/intel_sgx.h
+++ b/drivers/platform/x86/intel_sgx.h
@@ -193,7 +193,7 @@  long sgx_compat_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
 #endif
 
 /* Utility functions */
-
+int sgx_test_and_clear_young(struct sgx_encl_page *page, struct sgx_encl *encl);
 void *sgx_get_epc_page(struct sgx_epc_page *entry);
 void sgx_put_epc_page(void *epc_page_vaddr);
 struct page *sgx_get_backing(struct sgx_encl *encl,
diff --git a/drivers/platform/x86/intel_sgx_ioctl.c b/drivers/platform/x86/intel_sgx_ioctl.c
index 3bceebd..363aa68 100644
--- a/drivers/platform/x86/intel_sgx_ioctl.c
+++ b/drivers/platform/x86/intel_sgx_ioctl.c
@@ -296,6 +296,7 @@  static bool sgx_process_add_page_req(struct sgx_add_page_req *req)
 	}
 
 	encl_page->epc_page = epc_page;
+	sgx_test_and_clear_young(encl_page, encl);
 	list_add_tail(&encl_page->load_list, &encl->load_list);
 
 	mutex_unlock(&encl->lock);
diff --git a/drivers/platform/x86/intel_sgx_page_cache.c b/drivers/platform/x86/intel_sgx_page_cache.c
index c4865d2..091d0d3 100644
--- a/drivers/platform/x86/intel_sgx_page_cache.c
+++ b/drivers/platform/x86/intel_sgx_page_cache.c
@@ -77,6 +77,37 @@  static unsigned int sgx_nr_high_pages;
 struct task_struct *ksgxswapd_tsk;
 static DECLARE_WAIT_QUEUE_HEAD(ksgxswapd_waitq);
 
+
+static int sgx_test_and_clear_young_cb(pte_t *ptep, pgtable_t token,
+				       unsigned long addr, void *data)
+{
+	int ret = pte_young(*ptep);
+	if (ret) {
+		pte_t pte = pte_mkold(*ptep);
+		set_pte_at((struct mm_struct *) data, addr, ptep, pte);
+	}
+	return ret;
+}
+
+/**
+ * sgx_test_and_clear_young() - Test and reset the accessed bit
+ * @page:	enclave EPC page to be tested for recent access
+ * @encl:	enclave which owns @page
+ *
+ * Checks the Access (A) bit from the PTE corresponding to the
+ * enclave page and clears it.  Returns 1 if the page has been
+ * recently accessed and 0 if not.
+ */
+int sgx_test_and_clear_young(struct sgx_encl_page *page, struct sgx_encl *encl)
+{
+	struct vm_area_struct *vma = sgx_find_vma(encl, page->addr);
+	if (!vma)
+		return 0;
+
+	return apply_to_page_range(vma->vm_mm, page->addr, PAGE_SIZE,
+				   sgx_test_and_clear_young_cb, vma->vm_mm);
+}
+
 static struct sgx_tgid_ctx *sgx_isolate_tgid_ctx(unsigned long nr_to_scan)
 {
 	struct sgx_tgid_ctx *ctx = NULL;
@@ -166,7 +197,8 @@  static void sgx_isolate_pages(struct sgx_encl *encl,
 					 struct sgx_encl_page,
 					 load_list);
 
-		if (!(entry->flags & SGX_ENCL_PAGE_RESERVED)) {
+		if (!sgx_test_and_clear_young(entry, encl) &&
+		    !(entry->flags & SGX_ENCL_PAGE_RESERVED)) {
 			entry->flags |= SGX_ENCL_PAGE_RESERVED;
 			list_move_tail(&entry->load_list, dst);
 		} else {
diff --git a/drivers/platform/x86/intel_sgx_vma.c b/drivers/platform/x86/intel_sgx_vma.c
index 3ea200c..1349d73 100644
--- a/drivers/platform/x86/intel_sgx_vma.c
+++ b/drivers/platform/x86/intel_sgx_vma.c
@@ -256,6 +256,7 @@  static struct sgx_encl_page *sgx_vma_do_fault(struct vm_area_struct *vma,
 	/* Do not free */
 	epc_page = NULL;
 
+	sgx_test_and_clear_young(entry, encl);
 	list_add_tail(&entry->load_list, &encl->load_list);
 out:
 	mutex_unlock(&encl->lock);