diff mbox series

[v15,12/14] x86/sgx: Turn on per-cgroup EPC reclamation

Message ID 20240617125321.36658-13-haitao.huang@linux.intel.com (mailing list archive)
State New
Headers show
Series Add Cgroup support for SGX EPC memory | expand

Commit Message

Haitao Huang June 17, 2024, 12:53 p.m. UTC
From: Kristen Carlson Accardi <kristen@linux.intel.com>

Previous patches have implemented all infrastructure needed for
per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
pages are still tracked in the global LRU as sgx_epc_page_lru() always
returns reference to the global LRU.

Change sgx_epc_page_lru() to return the LRU of the cgroup in which the
given EPC page is allocated.

This makes all EPC pages tracked in per-cgroup LRUs and the global
reclaimer (ksgxd) will not be able to reclaim any pages from the global
LRU. However, in cases of over-committing, i.e., the sum of cgroup
limits greater than the total capacity, cgroups may never reclaim but
the total usage can still be near the capacity. Therefore a global
reclamation is still needed in those cases and it should be performed
from the root cgroup.

Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
when cgroup is enabled. Similar to sgx_cgroup_reclaim_pages(), return
the next cgroup so callers can use it as the new starting node for next
round of reclamation if needed.

Also update sgx_can_reclaim_global(), to check emptiness of LRUs of all
cgroups when EPC cgroup is enabled, otherwise only check the global LRU.

Finally, change sgx_reclaim_direct(), to check and ensure there are free
pages at cgroup level so forward progress can be made by the caller.

With these changes, the global reclamation and per-cgroup reclamation
both work properly with all pages tracked in per-cgroup LRUs.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
---
V14:
- Update global reclamation to use the new sgx_cgroup_reclaim_pages() to
iterate cgroups at lower level if the top cgroups are too busy.

V13:
- Use IS_ENABLED(CONFIG_CGROUP_MISC) in sgx_can_reclaim_global(). (Kai)

V12:
- Remove CONFIG_CGROUP_SGX_EPC, conditional compile SGX Cgroup for
CONFIGCONFIG_CGROUPMISC. (Jarkko)

V11:
- Reword the comments for global reclamation for allocation failure
after passing cgroup charging. (Kai)
- Add stub functions to remove ifdefs in c file (Kai)
- Add more detailed comments to clarify each page belongs to one cgroup, or the
root. (Kai)

V10:
- Add comment to clarify each page belongs to one cgroup, or the root by
default. (Kai)
- Merge the changes that expose sgx_cgroup_* functions to this patch.
- Add changes for sgx_reclaim_direct() that was missed previously.

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c |  8 +--
 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 32 +++++++++++
 arch/x86/kernel/cpu/sgx/main.c       | 80 +++++++++++++++++++++++++---
 3 files changed, 109 insertions(+), 11 deletions(-)

Comments

Huang, Kai June 20, 2024, 10:30 a.m. UTC | #1
On 18/06/2024 12:53 am, Huang, Haitao wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> 
> Previous patches have implemented all infrastructure needed for
> per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
> pages are still tracked in the global LRU as sgx_epc_page_lru() always
> returns reference to the global LRU.
> 
> Change sgx_epc_page_lru() to return the LRU of the cgroup in which the
> given EPC page is allocated.
> 
> This makes all EPC pages tracked in per-cgroup LRUs and the global
> reclaimer (ksgxd) will not be able to reclaim any pages from the global
> LRU. However, in cases of over-committing, i.e., the sum of cgroup
> limits greater than the total capacity, cgroups may never reclaim but
> the total usage can still be near the capacity. Therefore a global
> reclamation is still needed in those cases and it should be performed
> from the root cgroup.
> 
> Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
> when cgroup is enabled. Similar to sgx_cgroup_reclaim_pages(), return
> the next cgroup so callers can use it as the new starting node for next
> round of reclamation if needed.
> 
> Also update sgx_can_reclaim_global(), to check emptiness of LRUs of all
> cgroups when EPC cgroup is enabled, otherwise only check the global LRU.
> 
> Finally, change sgx_reclaim_direct(), to check and ensure there are free
> pages at cgroup level so forward progress can be made by the caller.

Reading above, it's not clear how the _new_ global reclaim works with
multiple LRUs.

E.g., the current global reclaim essentially treats all EPC pages equally
when scanning those pages.  From the above, I don't see how this is
achieved in the new global reclaim.

The changelog should:

1) describe the how does existing global reclaim work, and then describe
how to achieve the same beahviour in the new global reclaim which works
with multiple LRUs;

2) If there's any behaviour difference between the "existing" vs the "new"
global reclaim, the changelog should point out the difference, and explain
why such difference is OK.

> 
> With these changes, the global reclamation and per-cgroup reclamation
> both work properly with all pages tracked in per-cgroup LRUs.
> 

[...]

>   
> -static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
> +static struct misc_cg *sgx_reclaim_pages_global(struct misc_cg *next_cg,
> +						struct mm_struct *charge_mm)
>   {
> +	if (IS_ENABLED(CONFIG_CGROUP_MISC))
> +		return sgx_cgroup_reclaim_pages(misc_cg_root(), next_cg, charge_mm);
> +
>   	sgx_reclaim_pages(&sgx_global_lru, charge_mm);
> +	return NULL;
>   }
>   
>   /*
> @@ -414,12 +443,35 @@ static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
>    */
>   void sgx_reclaim_direct(void)
>   {
> +	struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
> +	struct misc_cg *cg = misc_from_sgx(sgx_cg);

From below @sgx_cg could be NULL.  It's not immediately clear whether calling 
misc_from_sgx(sgx_cg) unconditionally is safe here.

Leave the initiaization of @cg to a later phase where @sgx_cg is
guaranteed not being NULL, or initialize @cg to NULL here and update later.

> +	struct misc_cg *next_cg = NULL;
> +
> +	/*
> +	 * Make sure there are some free pages at both cgroup and global levels.
> +	 * In both cases, only make one attempt of reclamation to avoid lengthy
> +	 * block on the caller.
> +	 */
> +	if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg))
> +		next_cg = sgx_cgroup_reclaim_pages(cg, next_cg, current->mm);

I don't quite follow the logic.

First of all, sgx_cgroup_reclaim_pages() isn't called in a loop, so why
not just do:

	next_cg = sgx_cgroup_reclaim_pages(cg, NULL, current->mm);

And what is the point of set @next_cg here, since ...


> +
> +	if (next_cg != cg)
> +		put_misc_cg(next_cg);
> +
> +	next_cg = NULL;

... here @next_cg is reset to NULL ?

Looks the only reason is you need to do ...

	put_misc_cg(next_cg);

... above?

This piece of code appears repeatedly in this file.  Is there any way we
can get rid of it?

>   	if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES))
> -		sgx_reclaim_pages_global(current->mm);
> +		next_cg = sgx_reclaim_pages_global(next_cg, current->mm);

And this doesn't seems "global reclaim" at all?

Because it essentially equals to:

	next_cg = sgx_reclaim_pages_global(NULL, current->mm);

which always reclaims from the ROOT.

So each call to sgx_reclaim_direct() will always reclaim from the ROOT --
any other LRUs in the hierarchy will unlikely to get any chance to be
reclaimed.

> +
> +	if (next_cg != misc_cg_root())
> +		put_misc_cg(next_cg);
> +
> +	sgx_put_cg(sgx_cg);
>   }
>   
>   static int ksgxd(void *p)
>   {
> +	struct misc_cg *next_cg = NULL;
> +
>   	set_freezable();
>   
>   	/*
> @@ -437,11 +489,15 @@ static int ksgxd(void *p)
>   				     kthread_should_stop() ||
>   				     sgx_should_reclaim_global(SGX_NR_HIGH_PAGES));
>   
> -		if (sgx_should_reclaim_global(SGX_NR_HIGH_PAGES))
> +		while (!kthread_should_stop() && sgx_should_reclaim_global(SGX_NR_HIGH_PAGES)) {
>   			/* Indirect reclaim, no mm to charge, so NULL: */
> -			sgx_reclaim_pages_global(NULL);
> +			next_cg = sgx_reclaim_pages_global(next_cg, NULL);
> +			cond_resched();

Should the 'put_misc_cg(next_cg)' be done within the while() loop but not
below?
> +		}
>   
> -		cond_resched();
> +		if (next_cg != misc_cg_root())
> +			put_misc_cg(next_cg);
> +		next_cg = NULL;

Again, it doesn't seems "global reclaim" here, since you always restart
from the ROOT once the target pages have been reclaimed.

AFAICT it's completely different from the existing global reclaim.

>   	}
>   
>   	return 0;
> @@ -583,6 +639,7 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
>    */
>   struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
>   {
> +	struct misc_cg *next_cg = NULL;
>   	struct sgx_cgroup *sgx_cg;
>   	struct sgx_epc_page *page;
>   	int ret;
> @@ -616,10 +673,19 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
>   			break;
>   		}
>   
> -		sgx_reclaim_pages_global(current->mm);
> +		/*
> +		 * At this point, the usage within this cgroup is under its
> +		 * limit but there is no physical page left for allocation.
> +		 * Perform a global reclaim to get some pages released from any
> +		 * cgroup with reclaimable pages.
> +		 */
> +		next_cg = sgx_reclaim_pages_global(next_cg, current->mm);
>   		cond_resched();
>   	}

Ditto IIUC.
Haitao Huang June 20, 2024, 3:06 p.m. UTC | #2
Hi Kai

On Thu, 20 Jun 2024 05:30:16 -0500, Huang, Kai <kai.huang@intel.com> wrote:

>
> On 18/06/2024 12:53 am, Huang, Haitao wrote:
>> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>>
>> Previous patches have implemented all infrastructure needed for
>> per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
>> pages are still tracked in the global LRU as sgx_epc_page_lru() always
>> returns reference to the global LRU.
>>
>> Change sgx_epc_page_lru() to return the LRU of the cgroup in which the
>> given EPC page is allocated.
>>
>> This makes all EPC pages tracked in per-cgroup LRUs and the global
>> reclaimer (ksgxd) will not be able to reclaim any pages from the global
>> LRU. However, in cases of over-committing, i.e., the sum of cgroup
>> limits greater than the total capacity, cgroups may never reclaim but
>> the total usage can still be near the capacity. Therefore a global
>> reclamation is still needed in those cases and it should be performed
>> from the root cgroup.
>>
>> Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
>> when cgroup is enabled. Similar to sgx_cgroup_reclaim_pages(), return
>> the next cgroup so callers can use it as the new starting node for next
>> round of reclamation if needed.
>>
>> Also update sgx_can_reclaim_global(), to check emptiness of LRUs of all
>> cgroups when EPC cgroup is enabled, otherwise only check the global LRU.
>>
>> Finally, change sgx_reclaim_direct(), to check and ensure there are free
>> pages at cgroup level so forward progress can be made by the caller.
>
> Reading above, it's not clear how the _new_ global reclaim works with
> multiple LRUs.
>
> E.g., the current global reclaim essentially treats all EPC pages equally
> when scanning those pages.  From the above, I don't see how this is
> achieved in the new global reclaim.
>
> The changelog should:
>
> 1) describe the how does existing global reclaim work, and then describe
> how to achieve the same beahviour in the new global reclaim which works
> with multiple LRUs;
>
> 2) If there's any behaviour difference between the "existing" vs the  
> "new"
> global reclaim, the changelog should point out the difference, and  
> explain
> why such difference is OK.

Sure I can explain. here is what I plan to add for v16:

Note the original global reclaimer has
only one LRU and always scans and reclaims from the head of this global
LRU. The new global reclaimer always starts the scanning from the root
node, only moves down to its descendants if more reclamation is needed
or the root node does not have SGX_NR_TO_SCAN (16) pages in the LRU.
This makes the enclave pages in the root node more likely being
reclaimed if they are not frequently used (not 'young'). Unless we track
pages in one LRU again, we can not really match exactly the same
behavior of the original global reclaimer. And one-LRU approach would
make per-cgroup reclamation scanning and reclaiming too complex.  On the
other hand, this design is acceptable for following reasons:

1) For all purposes of using cgroups, enclaves will need live in
      non-root (leaf for cgroup v2) nodes where limits can be enforced
      per-cgroup.
2) Global reclamation now only happens in situation mentioned above when
      a lower level cgroup not at its limit can't allocate due to over
      commit at global level.
3) The pages in root being slightly penalized are not busily used
      anyway.
4) In cases that multiple rounds of reclamation is needed, the caller of
      sgx_reclaim_page_global() can still recall for reclaiming in 'next'
      descendant in round robin way, each round scans for SGX_NR_SCAN pages
      from the head of 'next' cgroup's LRU.


>
>>
>> With these changes, the global reclamation and per-cgroup reclamation
>> both work properly with all pages tracked in per-cgroup LRUs.
>>
>
> [...]
>
>>
>> -static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
>> +static struct misc_cg *sgx_reclaim_pages_global(struct misc_cg  
>> *next_cg,
>> +						struct mm_struct *charge_mm)
>>   {
>> +	if (IS_ENABLED(CONFIG_CGROUP_MISC))
>> +		return sgx_cgroup_reclaim_pages(misc_cg_root(), next_cg, charge_mm);
>> +
>>   	sgx_reclaim_pages(&sgx_global_lru, charge_mm);
>> +	return NULL;
>>   }
>>
>>   /*
>> @@ -414,12 +443,35 @@ static void sgx_reclaim_pages_global(struct  
>> mm_struct *charge_mm)
>>    */
>>   void sgx_reclaim_direct(void)
>>   {
>> +	struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
>> +	struct misc_cg *cg = misc_from_sgx(sgx_cg);
>
> From below @sgx_cg could be NULL.  It's not immediately clear whether  
> calling
> misc_from_sgx(sgx_cg) unconditionally is safe here.
>
> Leave the initiaization of @cg to a later phase where @sgx_cg is
> guaranteed not being NULL, or initialize @cg to NULL here and update  
> later.
>

Ok

>> +	struct misc_cg *next_cg = NULL;
>> +
>> +	/*
>> +	 * Make sure there are some free pages at both cgroup and global  
>> levels.
>> +	 * In both cases, only make one attempt of reclamation to avoid  
>> lengthy
>> +	 * block on the caller.
>> +	 */
>> +	if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg))
>> +		next_cg = sgx_cgroup_reclaim_pages(cg, next_cg, current->mm);
>
> I don't quite follow the logic.
>
> First of all, sgx_cgroup_reclaim_pages() isn't called in a loop, so why
> not just do:
>
> 	next_cg = sgx_cgroup_reclaim_pages(cg, NULL, current->mm);
>

Okay, I can replace it with NULL. I thought it was convenient to grep all  
such calls to see if they are used consistently.

> And what is the point of set @next_cg here, since ...
>
>
>> +
>> +	if (next_cg != cg)
>> +		put_misc_cg(next_cg);
>> +
>> +	next_cg = NULL;
>
> ... here @next_cg is reset to NULL ?
>
> Looks the only reason is you need to do ...
>
> 	put_misc_cg(next_cg);
>
> ... above?
>

yes

> This piece of code appears repeatedly in this file.  Is there any way we
> can get rid of it?
>

sgx_cgroup_reclaim_pages() returns 'next' in case caller needs to run it  
in loop. In this case, no.

I could create a thin wrapper for this case, but since there is only one  
such case, I did not fee it's worth the extra layer.

>>   	if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES))
>> -		sgx_reclaim_pages_global(current->mm);
>> +		next_cg = sgx_reclaim_pages_global(next_cg, current->mm);
>
> And this doesn't seems "global reclaim" at all?
>
> Because it essentially equals to:
>
> 	next_cg = sgx_reclaim_pages_global(NULL, current->mm);
>
> which always reclaims from the ROOT.
>
> So each call to sgx_reclaim_direct() will always reclaim from the ROOT --
> any other LRUs in the hierarchy will unlikely to get any chance to be
> reclaimed.

No, the current cgroup is always checked and called to reclaim if needed.
Global reclaim only happens when system-wide water mark SGX_NR_LOW_PAGES  
is violated.

>
>> +
>> +	if (next_cg != misc_cg_root())
>> +		put_misc_cg(next_cg);
>> +
>> +	sgx_put_cg(sgx_cg);
>>   }
>>
>>   static int ksgxd(void *p)
>>   {
>> +	struct misc_cg *next_cg = NULL;
>> +
>>   	set_freezable();
>>
>>   	/*
>> @@ -437,11 +489,15 @@ static int ksgxd(void *p)
>>   				     kthread_should_stop() ||
>>   				     sgx_should_reclaim_global(SGX_NR_HIGH_PAGES));
>>
>> -		if (sgx_should_reclaim_global(SGX_NR_HIGH_PAGES))
>> +		while (!kthread_should_stop() &&  
>> sgx_should_reclaim_global(SGX_NR_HIGH_PAGES)) {
>>   			/* Indirect reclaim, no mm to charge, so NULL: */
>> -			sgx_reclaim_pages_global(NULL);
>> +			next_cg = sgx_reclaim_pages_global(next_cg, NULL);
>> +			cond_resched();
>
> Should the 'put_misc_cg(next_cg)' be done within the while() loop but not
> below?

No, whenever there is a loop, we want to reclaim in round robin way.

>> +		}
>>
>> -		cond_resched();
>> +		if (next_cg != misc_cg_root())
>> +			put_misc_cg(next_cg);
>> +		next_cg = NULL;
>
> Again, it doesn't seems "global reclaim" here, since you always restart
> from the ROOT once the target pages have been reclaimed.
>
> AFAICT it's completely different from the existing global reclaim.
>
Hopefully adding comments mentioned above clarifies this.

>>   	}
>>
>>   	return 0;
>> @@ -583,6 +639,7 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page  
>> *page)
>>    */
>>   struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim  
>> reclaim)
>>   {
>> +	struct misc_cg *next_cg = NULL;
>>   	struct sgx_cgroup *sgx_cg;
>>   	struct sgx_epc_page *page;
>>   	int ret;
>> @@ -616,10 +673,19 @@ struct sgx_epc_page *sgx_alloc_epc_page(void  
>> *owner, enum sgx_reclaim reclaim)
>>   			break;
>>   		}
>>
>> -		sgx_reclaim_pages_global(current->mm);
>> +		/*
>> +		 * At this point, the usage within this cgroup is under its
>> +		 * limit but there is no physical page left for allocation.
>> +		 * Perform a global reclaim to get some pages released from any
>> +		 * cgroup with reclaimable pages.
>> +		 */
>> +		next_cg = sgx_reclaim_pages_global(next_cg, current->mm);
>>   		cond_resched();
>>   	}
>
> Ditto IIUC.
>
>
Huang, Kai June 20, 2024, 11:53 p.m. UTC | #3
On Thu, 2024-06-20 at 10:06 -0500, Haitao Huang wrote:
> Hi Kai
> 
> On Thu, 20 Jun 2024 05:30:16 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > 
> > On 18/06/2024 12:53 am, Huang, Haitao wrote:
> > > From: Kristen Carlson Accardi <kristen@linux.intel.com>
> > > 
> > > Previous patches have implemented all infrastructure needed for
> > > per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
> > > pages are still tracked in the global LRU as sgx_epc_page_lru() always
> > > returns reference to the global LRU.
> > > 
> > > Change sgx_epc_page_lru() to return the LRU of the cgroup in which the
> > > given EPC page is allocated.
> > > 
> > > This makes all EPC pages tracked in per-cgroup LRUs and the global
> > > reclaimer (ksgxd) will not be able to reclaim any pages from the global
> > > LRU. However, in cases of over-committing, i.e., the sum of cgroup
> > > limits greater than the total capacity, cgroups may never reclaim but
> > > the total usage can still be near the capacity. Therefore a global
> > > reclamation is still needed in those cases and it should be performed
> > > from the root cgroup.
> > > 
> > > Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
> > > when cgroup is enabled. Similar to sgx_cgroup_reclaim_pages(), return
> > > the next cgroup so callers can use it as the new starting node for next
> > > round of reclamation if needed.
> > > 
> > > Also update sgx_can_reclaim_global(), to check emptiness of LRUs of all
> > > cgroups when EPC cgroup is enabled, otherwise only check the global LRU.
> > > 
> > > Finally, change sgx_reclaim_direct(), to check and ensure there are free
> > > pages at cgroup level so forward progress can be made by the caller.
> > 
> > Reading above, it's not clear how the _new_ global reclaim works with
> > multiple LRUs.
> > 
> > E.g., the current global reclaim essentially treats all EPC pages equally
> > when scanning those pages.  From the above, I don't see how this is
> > achieved in the new global reclaim.
> > 
> > The changelog should:
> > 
> > 1) describe the how does existing global reclaim work, and then describe
> > how to achieve the same beahviour in the new global reclaim which works
> > with multiple LRUs;
> > 
> > 2) If there's any behaviour difference between the "existing" vs the  
> > "new"
> > global reclaim, the changelog should point out the difference, and  
> > explain
> > why such difference is OK.
> 
> Sure I can explain. here is what I plan to add for v16:
> 
> Note the original global reclaimer has
> only one LRU and always scans and reclaims from the head of this global
> LRU. The new global reclaimer always starts the scanning from the root
> node, only moves down to its descendants if more reclamation is needed
> or the root node does not have SGX_NR_TO_SCAN (16) pages in the LRU.
> This makes the enclave pages in the root node more likely being
> reclaimed if they are not frequently used (not 'young'). Unless we track
> pages in one LRU again, we can not really match exactly the same
> behavior of the original global reclaimer. And one-LRU approach would
> make per-cgroup reclamation scanning and reclaiming too complex.  On the
> other hand, this design is acceptable for following reasons:
> 
> 1) For all purposes of using cgroups, enclaves will need live in
>       non-root (leaf for cgroup v2) nodes where limits can be enforced
>       per-cgroup.

I don't see how it matters.  If ROOT is empty, then it moves to the first
descendant.

> 2) Global reclamation now only happens in situation mentioned above when
>       a lower level cgroup not at its limit can't allocate due to over
>       commit at global level.

Really?  In sgx_reclaim_direct() the code says:

/*
 * Make sure there are some free pages at both cgroup and global levels.
 * In both cases, only make one attempt of reclamation to avoid lengthy
 * block on the caller.
 */

Yeah only one attempt will be made for global level but it is still global
reclaim.

> 3) The pages in root being slightly penalized are not busily used
>       anyway.

The 1) says in practice the root will have no enclaves, thus no EPC at
all.

In other words, in practice the global reclaim will always skip the root
and move to the first descendant.

> 4) In cases that multiple rounds of reclamation is needed, the caller of
>       sgx_reclaim_page_global() can still recall for reclaiming in 'next'
>       descendant in round robin way, each round scans for SGX_NR_SCAN pages
>       from the head of 'next' cgroup's LRU.

"multiple rounds of reclamation" isn't clear enough.  Does it mean
multiple call of sgx_cgroup_reclaim_pages(), or does it mean each trigger
of global reclaim?

The current patch seems to be the former.  See the 'next_cg' is reset to
NULL for each loop of the main loop in ksgxd().

This essentially means each trigger of global reclaim will start from the
ROOT, or in practice the first descendant (based on 1) and 3) above) will
always be the victim of each global reclaim.

I agree it's hard to _EXACTLY_ match the existing global reclaim, but IMHO
we should at least treats all cgroups equally.
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index ec7f267d3e7a..04d156848cf1 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -67,7 +67,7 @@  static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg)
  *
  * Return: %true if all cgroups under the specified root have empty LRU lists.
  */
-static bool sgx_cgroup_lru_empty(struct misc_cg *root)
+bool sgx_cgroup_lru_empty(struct misc_cg *root)
 {
 	struct cgroup_subsys_state *css_root;
 	struct cgroup_subsys_state *pos;
@@ -127,8 +127,8 @@  static bool sgx_cgroup_lru_empty(struct misc_cg *root)
  * release the reference if the returned is not used as %start for a subsequent
  * call.
  */
-static struct misc_cg *sgx_cgroup_reclaim_pages(struct misc_cg *root, struct misc_cg *start,
-						struct mm_struct *charge_mm)
+struct misc_cg *sgx_cgroup_reclaim_pages(struct misc_cg *root, struct misc_cg *start,
+					 struct mm_struct *charge_mm)
 {
 	struct cgroup_subsys_state *css_root, *pos;
 	struct cgroup_subsys_state *next = NULL;
@@ -181,7 +181,7 @@  static struct misc_cg *sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mis
  * threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the
  * cgroup.
  */
-static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
+bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
 {
 	u64 cur, max;
 
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index 2f4ed046b1b5..4601cd507eaf 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -13,6 +13,11 @@ 
 #define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
 struct sgx_cgroup;
 
+static inline struct misc_cg *misc_from_sgx(struct sgx_cgroup *sgx_cg)
+{
+	return NULL;
+}
+
 static inline struct sgx_cgroup *sgx_get_current_cg(void)
 {
 	return NULL;
@@ -27,8 +32,25 @@  static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_recl
 
 static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }
 
+static inline bool sgx_cgroup_lru_empty(struct misc_cg *root)
+{
+	return true;
+}
+
+static inline bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
+{
+	return false;
+}
+
 static inline void sgx_cgroup_init(void) { }
 
+static inline struct misc_cg *sgx_cgroup_reclaim_pages(struct misc_cg *root,
+						       struct misc_cg *next_cg,
+						       struct mm_struct *charge_mm)
+{
+	return NULL;
+}
+
 #else /* CONFIG_CGROUP_MISC */
 
 struct sgx_cgroup {
@@ -37,6 +59,11 @@  struct sgx_cgroup {
 	struct work_struct reclaim_work;
 };
 
+static inline struct misc_cg *misc_from_sgx(struct sgx_cgroup *sgx_cg)
+{
+	return sgx_cg->cg;
+}
+
 static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct misc_cg *cg)
 {
 	return (struct sgx_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
@@ -67,6 +94,11 @@  static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
 
 int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim);
 void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
+bool sgx_cgroup_lru_empty(struct misc_cg *root);
+bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg);
+struct misc_cg *sgx_cgroup_reclaim_pages(struct misc_cg *root,
+					 struct misc_cg *next_cg,
+					 struct mm_struct *charge_mm);
 int __init sgx_cgroup_init(void);
 
 #endif /* CONFIG_CGROUP_MISC */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index e91fa5f5348d..894adf6922f8 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -32,9 +32,30 @@  static DEFINE_XARRAY(sgx_epc_address_space);
  */
 static struct sgx_epc_lru_list sgx_global_lru;
 
+/*
+ * Get the per-cgroup or global LRU list that tracks the given reclaimable page.
+ */
 static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page *epc_page)
 {
+#ifdef CONFIG_CGROUP_MISC
+	/*
+	 * epc_page->sgx_cg here is never NULL during a reclaimable epc_page's
+	 * life between sgx_alloc_epc_page() and sgx_free_epc_page():
+	 *
+	 * In sgx_alloc_epc_page(), epc_page->sgx_cg is set to the return from
+	 * sgx_get_current_cg() which is the misc cgroup of the current task, or
+	 * the root by default even if the misc cgroup is disabled by kernel
+	 * command line.
+	 *
+	 * epc_page->sgx_cg is only unset by sgx_free_epc_page().
+	 *
+	 * This function is never used before sgx_alloc_epc_page() or after
+	 * sgx_free_epc_page().
+	 */
+	return &epc_page->sgx_cg->lru;
+#else
 	return &sgx_global_lru;
+#endif
 }
 
 /*
@@ -42,7 +63,10 @@  static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page *epc
  */
 static inline bool sgx_can_reclaim_global(void)
 {
-	return !list_empty(&sgx_global_lru.reclaimable);
+	if (IS_ENABLED(CONFIG_CGROUP_MISC))
+		return !sgx_cgroup_lru_empty(misc_cg_root());
+	else
+		return !list_empty(&sgx_global_lru.reclaimable);
 }
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
@@ -402,9 +426,14 @@  static bool sgx_should_reclaim_global(unsigned long watermark)
 		sgx_can_reclaim_global();
 }
 
-static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
+static struct misc_cg *sgx_reclaim_pages_global(struct misc_cg *next_cg,
+						struct mm_struct *charge_mm)
 {
+	if (IS_ENABLED(CONFIG_CGROUP_MISC))
+		return sgx_cgroup_reclaim_pages(misc_cg_root(), next_cg, charge_mm);
+
 	sgx_reclaim_pages(&sgx_global_lru, charge_mm);
+	return NULL;
 }
 
 /*
@@ -414,12 +443,35 @@  static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
  */
 void sgx_reclaim_direct(void)
 {
+	struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
+	struct misc_cg *cg = misc_from_sgx(sgx_cg);
+	struct misc_cg *next_cg = NULL;
+
+	/*
+	 * Make sure there are some free pages at both cgroup and global levels.
+	 * In both cases, only make one attempt of reclamation to avoid lengthy
+	 * block on the caller.
+	 */
+	if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg))
+		next_cg = sgx_cgroup_reclaim_pages(cg, next_cg, current->mm);
+
+	if (next_cg != cg)
+		put_misc_cg(next_cg);
+
+	next_cg = NULL;
 	if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES))
-		sgx_reclaim_pages_global(current->mm);
+		next_cg = sgx_reclaim_pages_global(next_cg, current->mm);
+
+	if (next_cg != misc_cg_root())
+		put_misc_cg(next_cg);
+
+	sgx_put_cg(sgx_cg);
 }
 
 static int ksgxd(void *p)
 {
+	struct misc_cg *next_cg = NULL;
+
 	set_freezable();
 
 	/*
@@ -437,11 +489,15 @@  static int ksgxd(void *p)
 				     kthread_should_stop() ||
 				     sgx_should_reclaim_global(SGX_NR_HIGH_PAGES));
 
-		if (sgx_should_reclaim_global(SGX_NR_HIGH_PAGES))
+		while (!kthread_should_stop() && sgx_should_reclaim_global(SGX_NR_HIGH_PAGES)) {
 			/* Indirect reclaim, no mm to charge, so NULL: */
-			sgx_reclaim_pages_global(NULL);
+			next_cg = sgx_reclaim_pages_global(next_cg, NULL);
+			cond_resched();
+		}
 
-		cond_resched();
+		if (next_cg != misc_cg_root())
+			put_misc_cg(next_cg);
+		next_cg = NULL;
 	}
 
 	return 0;
@@ -583,6 +639,7 @@  int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
  */
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 {
+	struct misc_cg *next_cg = NULL;
 	struct sgx_cgroup *sgx_cg;
 	struct sgx_epc_page *page;
 	int ret;
@@ -616,10 +673,19 @@  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 			break;
 		}
 
-		sgx_reclaim_pages_global(current->mm);
+		/*
+		 * At this point, the usage within this cgroup is under its
+		 * limit but there is no physical page left for allocation.
+		 * Perform a global reclaim to get some pages released from any
+		 * cgroup with reclaimable pages.
+		 */
+		next_cg = sgx_reclaim_pages_global(next_cg, current->mm);
 		cond_resched();
 	}
 
+	if (next_cg != misc_cg_root())
+		put_misc_cg(next_cg);
+
 	if (!IS_ERR(page)) {
 		WARN_ON_ONCE(sgx_epc_page_get_cgroup(page));
 		/* sgx_put_cg() in sgx_free_epc_page() */