diff mbox series

[v9,10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

Message ID 20240205210638.157741-11-haitao.huang@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Add Cgroup support for SGX EPC memory | expand

Commit Message

Haitao Huang Feb. 5, 2024, 9:06 p.m. UTC
From: Kristen Carlson Accardi <kristen@linux.intel.com>

When the EPC usage of a cgroup is near its limit, the cgroup needs to
reclaim pages used in the same cgroup to make room for new allocations.
This is analogous to the behavior that the global reclaimer is triggered
when the global usage is close to total available EPC.

Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate
whether synchronous reclaim is allowed or not. And trigger the
synchronous/asynchronous reclamation flow accordingly.

Note at this point, all reclaimable EPC pages are still tracked in the
global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation
is activated yet.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++--
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  4 ++--
 arch/x86/kernel/cpu/sgx/main.c       |  2 +-
 3 files changed, 27 insertions(+), 5 deletions(-)

Comments

Jarkko Sakkinen Feb. 12, 2024, 7:55 p.m. UTC | #1
On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>
> When the EPC usage of a cgroup is near its limit, the cgroup needs to
> reclaim pages used in the same cgroup to make room for new allocations.
> This is analogous to the behavior that the global reclaimer is triggered
> when the global usage is close to total available EPC.
>
> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate
> whether synchronous reclaim is allowed or not. And trigger the
> synchronous/asynchronous reclamation flow accordingly.
>
> Note at this point, all reclaimable EPC pages are still tracked in the
> global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation
> is activated yet.
>
> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> ---
> V7:
> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> ---
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++--
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  4 ++--
>  arch/x86/kernel/cpu/sgx/main.c       |  2 +-
>  3 files changed, 27 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> index d399fda2b55e..abf74fdb12b4 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -184,13 +184,35 @@ static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
>  /**
>   * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
>   * @epc_cg:	The EPC cgroup to be charged for the page.
> + * @reclaim:	Whether or not synchronous reclaim is allowed
>   * Return:
>   * * %0 - If successfully charged.
>   * * -errno - for failures.
>   */
> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim)
>  {
> -	return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> +	for (;;) {
> +		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> +					PAGE_SIZE))
> +			break;
> +
> +		if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
> +			return -ENOMEM;
> + +		if (signal_pending(current))
> +			return -ERESTARTSYS;
> +
> +		if (!reclaim) {
> +			queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
> +			return -EBUSY;
> +		}
> +
> +		if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
> +			/* All pages were too young to reclaim, try again a little later */
> +			schedule();

This will be total pain to backtrack after a while when something
needs to be changed so there definitely should be inline comments
addressing each branch condition.

I'd rethink this as:

1. Create static __sgx_epc_cgroup_try_charge() for addressing single
   iteration with the new "reclaim" parameter.
2. Add a new sgx_epc_group_try_charge_reclaim() function.

There's a bit of redundancy with sgx_epc_cgroup_try_charge() and
sgx_epc_cgroup_try_charge_reclaim() because both have almost the
same loop calling internal __sgx_epc_cgroup_try_charge() with
different parameters. That is totally acceptable.

Please also add my suggested-by.

BR, Jarkko

BR, Jarkko
Haitao Huang Feb. 12, 2024, 11:15 p.m. UTC | #2
Hi Jarkko

On Mon, 12 Feb 2024 13:55:46 -0600, Jarkko Sakkinen <jarkko@kernel.org>  
wrote:

> On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>>
>> When the EPC usage of a cgroup is near its limit, the cgroup needs to
>> reclaim pages used in the same cgroup to make room for new allocations.
>> This is analogous to the behavior that the global reclaimer is triggered
>> when the global usage is close to total available EPC.
>>
>> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate
>> whether synchronous reclaim is allowed or not. And trigger the
>> synchronous/asynchronous reclamation flow accordingly.
>>
>> Note at this point, all reclaimable EPC pages are still tracked in the
>> global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation
>> is activated yet.
>>
>> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
>> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>> ---
>> V7:
>> - Split this out from the big patch, #10 in V6. (Dave, Kai)
>> ---
>>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++--
>>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  4 ++--
>>  arch/x86/kernel/cpu/sgx/main.c       |  2 +-
>>  3 files changed, 27 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c  
>> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> index d399fda2b55e..abf74fdb12b4 100644
>> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> @@ -184,13 +184,35 @@ static void  
>> sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
>>  /**
>>   * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC  
>> page
>>   * @epc_cg:	The EPC cgroup to be charged for the page.
>> + * @reclaim:	Whether or not synchronous reclaim is allowed
>>   * Return:
>>   * * %0 - If successfully charged.
>>   * * -errno - for failures.
>>   */
>> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
>> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool  
>> reclaim)
>>  {
>> -	return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
>> +	for (;;) {
>> +		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
>> +					PAGE_SIZE))
>> +			break;
>> +
>> +		if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
>> +			return -ENOMEM;
>> + +		if (signal_pending(current))
>> +			return -ERESTARTSYS;
>> +
>> +		if (!reclaim) {
>> +			queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
>> +			return -EBUSY;
>> +		}
>> +
>> +		if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
>> +			/* All pages were too young to reclaim, try again a little later */
>> +			schedule();
>
> This will be total pain to backtrack after a while when something
> needs to be changed so there definitely should be inline comments
> addressing each branch condition.
>
> I'd rethink this as:
>
> 1. Create static __sgx_epc_cgroup_try_charge() for addressing single
>    iteration with the new "reclaim" parameter.
> 2. Add a new sgx_epc_group_try_charge_reclaim() function.
>
> There's a bit of redundancy with sgx_epc_cgroup_try_charge() and
> sgx_epc_cgroup_try_charge_reclaim() because both have almost the
> same loop calling internal __sgx_epc_cgroup_try_charge() with
> different parameters. That is totally acceptable.
>
> Please also add my suggested-by.
>
> BR, Jarkko
>
> BR, Jarkko
>
For #2:
The only caller of this function, sgx_alloc_epc_page(), has the same  
boolean which is passed into this this function.

If we separate it into sgx_epc_cgroup_try_charge() and  
sgx_epc_cgroup_try_charge_reclaim(), then the caller has to have the  
if/else branches. So separation here seems not help?


For #1:
If we don't do #2, It seems overkill at the moment for such a short  
function.

How about we add inline comments for each branch for now, and if later  
there are more branches and the function become too long we add  
__sgx_epc_cgroup_try_charge() as you suggested?

Thanks
Haitao
Jarkko Sakkinen Feb. 14, 2024, 1:52 a.m. UTC | #3
On Tue Feb 13, 2024 at 1:15 AM EET, Haitao Huang wrote:
> Hi Jarkko
>
> On Mon, 12 Feb 2024 13:55:46 -0600, Jarkko Sakkinen <jarkko@kernel.org>  
> wrote:
>
> > On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote:
> >> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> >>
> >> When the EPC usage of a cgroup is near its limit, the cgroup needs to
> >> reclaim pages used in the same cgroup to make room for new allocations.
> >> This is analogous to the behavior that the global reclaimer is triggered
> >> when the global usage is close to total available EPC.
> >>
> >> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate
> >> whether synchronous reclaim is allowed or not. And trigger the
> >> synchronous/asynchronous reclamation flow accordingly.
> >>
> >> Note at this point, all reclaimable EPC pages are still tracked in the
> >> global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation
> >> is activated yet.
> >>
> >> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
> >> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> >> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
> >> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> >> ---
> >> V7:
> >> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> >> ---
> >>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++--
> >>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  4 ++--
> >>  arch/x86/kernel/cpu/sgx/main.c       |  2 +-
> >>  3 files changed, 27 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c  
> >> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> index d399fda2b55e..abf74fdb12b4 100644
> >> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> @@ -184,13 +184,35 @@ static void  
> >> sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
> >>  /**
> >>   * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC  
> >> page
> >>   * @epc_cg:	The EPC cgroup to be charged for the page.
> >> + * @reclaim:	Whether or not synchronous reclaim is allowed
> >>   * Return:
> >>   * * %0 - If successfully charged.
> >>   * * -errno - for failures.
> >>   */
> >> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> >> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool  
> >> reclaim)
> >>  {
> >> -	return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> >> +	for (;;) {
> >> +		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> >> +					PAGE_SIZE))
> >> +			break;
> >> +
> >> +		if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
> >> +			return -ENOMEM;
> >> + +		if (signal_pending(current))
> >> +			return -ERESTARTSYS;
> >> +
> >> +		if (!reclaim) {
> >> +			queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
> >> +			return -EBUSY;
> >> +		}
> >> +
> >> +		if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
> >> +			/* All pages were too young to reclaim, try again a little later */
> >> +			schedule();
> >
> > This will be total pain to backtrack after a while when something
> > needs to be changed so there definitely should be inline comments
> > addressing each branch condition.
> >
> > I'd rethink this as:
> >
> > 1. Create static __sgx_epc_cgroup_try_charge() for addressing single
> >    iteration with the new "reclaim" parameter.
> > 2. Add a new sgx_epc_group_try_charge_reclaim() function.
> >
> > There's a bit of redundancy with sgx_epc_cgroup_try_charge() and
> > sgx_epc_cgroup_try_charge_reclaim() because both have almost the
> > same loop calling internal __sgx_epc_cgroup_try_charge() with
> > different parameters. That is totally acceptable.
> >
> > Please also add my suggested-by.
> >
> > BR, Jarkko
> >
> > BR, Jarkko
> >
> For #2:
> The only caller of this function, sgx_alloc_epc_page(), has the same  
> boolean which is passed into this this function.

I know. This would be good opportunity to fix that up. Large patch
sets should try to make the space for its feature best possible and
thus also clean up the code base overally.

> If we separate it into sgx_epc_cgroup_try_charge() and  
> sgx_epc_cgroup_try_charge_reclaim(), then the caller has to have the  
> if/else branches. So separation here seems not help?

Of course it does. It makes the code in that location self-documenting
and easier to remember what it does.

BR, Jarkko
Haitao Huang Feb. 19, 2024, 3:12 p.m. UTC | #4
On Tue, 13 Feb 2024 19:52:25 -0600, Jarkko Sakkinen <jarkko@kernel.org>  
wrote:

> On Tue Feb 13, 2024 at 1:15 AM EET, Haitao Huang wrote:
>> Hi Jarkko
>>
>> On Mon, 12 Feb 2024 13:55:46 -0600, Jarkko Sakkinen <jarkko@kernel.org>
>> wrote:
>>
>> > On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote:
>> >> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>> >>
>> >> When the EPC usage of a cgroup is near its limit, the cgroup needs to
>> >> reclaim pages used in the same cgroup to make room for new  
>> allocations.
>> >> This is analogous to the behavior that the global reclaimer is  
>> triggered
>> >> when the global usage is close to total available EPC.
>> >>
>> >> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate
>> >> whether synchronous reclaim is allowed or not. And trigger the
>> >> synchronous/asynchronous reclamation flow accordingly.
>> >>
>> >> Note at this point, all reclaimable EPC pages are still tracked in  
>> the
>> >> global LRU and per-cgroup LRUs are empty. So no per-cgroup  
>> reclamation
>> >> is activated yet.
>> >>
>> >> Co-developed-by: Sean Christopherson  
>> <sean.j.christopherson@intel.com>
>> >> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> >> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
>> >> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>> >> ---
>> >> V7:
>> >> - Split this out from the big patch, #10 in V6. (Dave, Kai)
>> >> ---
>> >>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++--
>> >>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  4 ++--
>> >>  arch/x86/kernel/cpu/sgx/main.c       |  2 +-
>> >>  3 files changed, 27 insertions(+), 5 deletions(-)
>> >>
>> >> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> >> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> >> index d399fda2b55e..abf74fdb12b4 100644
>> >> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> >> @@ -184,13 +184,35 @@ static void
>> >> sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
>> >>  /**
>> >>   * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single  
>> EPC
>> >> page
>> >>   * @epc_cg:	The EPC cgroup to be charged for the page.
>> >> + * @reclaim:	Whether or not synchronous reclaim is allowed
>> >>   * Return:
>> >>   * * %0 - If successfully charged.
>> >>   * * -errno - for failures.
>> >>   */
>> >> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
>> >> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool
>> >> reclaim)
>> >>  {
>> >> -	return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,  
>> PAGE_SIZE);
>> >> +	for (;;) {
>> >> +		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
>> >> +					PAGE_SIZE))
>> >> +			break;
>> >> +
>> >> +		if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
>> >> +			return -ENOMEM;
>> >> + +		if (signal_pending(current))
>> >> +			return -ERESTARTSYS;
>> >> +
>> >> +		if (!reclaim) {
>> >> +			queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
>> >> +			return -EBUSY;
>> >> +		}
>> >> +
>> >> +		if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
>> >> +			/* All pages were too young to reclaim, try again a little later  
>> */
>> >> +			schedule();
>> >
>> > This will be total pain to backtrack after a while when something
>> > needs to be changed so there definitely should be inline comments
>> > addressing each branch condition.
>> >
>> > I'd rethink this as:
>> >
>> > 1. Create static __sgx_epc_cgroup_try_charge() for addressing single
>> >    iteration with the new "reclaim" parameter.
>> > 2. Add a new sgx_epc_group_try_charge_reclaim() function.
>> >
>> > There's a bit of redundancy with sgx_epc_cgroup_try_charge() and
>> > sgx_epc_cgroup_try_charge_reclaim() because both have almost the
>> > same loop calling internal __sgx_epc_cgroup_try_charge() with
>> > different parameters. That is totally acceptable.
>> >
>> > Please also add my suggested-by.
>> >
>> > BR, Jarkko
>> >
>> > BR, Jarkko
>> >
>> For #2:
>> The only caller of this function, sgx_alloc_epc_page(), has the same
>> boolean which is passed into this this function.
>
> I know. This would be good opportunity to fix that up. Large patch
> sets should try to make the space for its feature best possible and
> thus also clean up the code base overally.
>
>> If we separate it into sgx_epc_cgroup_try_charge() and
>> sgx_epc_cgroup_try_charge_reclaim(), then the caller has to have the
>> if/else branches. So separation here seems not help?
>
> Of course it does. It makes the code in that location self-documenting
> and easier to remember what it does.
>
> BR, Jarkko
>

Please let me know if this aligns with your suggestion.


static int ___sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
{
         if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
                                         PAGE_SIZE))
                 return 0;

         if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
                 return -ENOMEM;

         if (signal_pending(current))
                 return -ERESTARTSYS;

         return -EBUSY;
}

/**
  * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single page
  * @epc_cg:     The EPC cgroup to be charged for the page.
  *
  * Try to reclaim pages in the background if the group reaches its limit  
and
  * there are reclaimable pages in the group.
  * Return:
  * * %0 - If successfully charged.
  * * -errno - for failures.
  */
int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
{
         int ret =  ___sgx_epc_cgroup_try_charge(epc_cg);

         if (ret == -EBUSY)
                 queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);

         return ret;
}

/**
  * sgx_epc_cgroup_try_charge_reclaim() - try to charge cgroup for a single  
page
  * @epc_cg:     The EPC cgroup to be charged for the page.
  *
  * Try to reclaim pages directly if the group reaches its limit and there  
are
  * reclaimable pages in the group.
  * Return:
  * * %0 - If successfully charged.
  * * -errno - for failures.
  */
int sgx_epc_cgroup_try_charge_reclaim(struct sgx_epc_cgroup *epc_cg)
{
         int ret;

         for (;;) {
                 ret =  ___sgx_epc_cgroup_try_charge(epc_cg);
                 if (ret != -EBUSY)
                         return ret;

                 if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, current->mm))
                         /* All pages were too young to reclaim, try again  
a little later */
                         schedule();
         }

         return 0;
}

It is a little more involved to remove the boolean for  
sgx_alloc_epc_page() and its callers like sgx_encl_grow(),  
sgx_alloc_va_page(). I'll send a separate patch for comments.

Thanks
Haitao
Jarkko Sakkinen Feb. 19, 2024, 8:20 p.m. UTC | #5
On Mon Feb 19, 2024 at 3:12 PM UTC, Haitao Huang wrote:
> On Tue, 13 Feb 2024 19:52:25 -0600, Jarkko Sakkinen <jarkko@kernel.org>  
> wrote:
>
> > On Tue Feb 13, 2024 at 1:15 AM EET, Haitao Huang wrote:
> >> Hi Jarkko
> >>
> >> On Mon, 12 Feb 2024 13:55:46 -0600, Jarkko Sakkinen <jarkko@kernel.org>
> >> wrote:
> >>
> >> > On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote:
> >> >> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> >> >>
> >> >> When the EPC usage of a cgroup is near its limit, the cgroup needs to
> >> >> reclaim pages used in the same cgroup to make room for new  
> >> allocations.
> >> >> This is analogous to the behavior that the global reclaimer is  
> >> triggered
> >> >> when the global usage is close to total available EPC.
> >> >>
> >> >> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate
> >> >> whether synchronous reclaim is allowed or not. And trigger the
> >> >> synchronous/asynchronous reclamation flow accordingly.
> >> >>
> >> >> Note at this point, all reclaimable EPC pages are still tracked in  
> >> the
> >> >> global LRU and per-cgroup LRUs are empty. So no per-cgroup  
> >> reclamation
> >> >> is activated yet.
> >> >>
> >> >> Co-developed-by: Sean Christopherson  
> >> <sean.j.christopherson@intel.com>
> >> >> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> >> >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> >> >> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
> >> >> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> >> >> ---
> >> >> V7:
> >> >> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> >> >> ---
> >> >>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++--
> >> >>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  4 ++--
> >> >>  arch/x86/kernel/cpu/sgx/main.c       |  2 +-
> >> >>  3 files changed, 27 insertions(+), 5 deletions(-)
> >> >>
> >> >> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> >> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> >> index d399fda2b55e..abf74fdb12b4 100644
> >> >> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> >> @@ -184,13 +184,35 @@ static void
> >> >> sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
> >> >>  /**
> >> >>   * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single  
> >> EPC
> >> >> page
> >> >>   * @epc_cg:	The EPC cgroup to be charged for the page.
> >> >> + * @reclaim:	Whether or not synchronous reclaim is allowed
> >> >>   * Return:
> >> >>   * * %0 - If successfully charged.
> >> >>   * * -errno - for failures.
> >> >>   */
> >> >> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> >> >> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool
> >> >> reclaim)
> >> >>  {
> >> >> -	return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,  
> >> PAGE_SIZE);
> >> >> +	for (;;) {
> >> >> +		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> >> >> +					PAGE_SIZE))
> >> >> +			break;
> >> >> +
> >> >> +		if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
> >> >> +			return -ENOMEM;
> >> >> + +		if (signal_pending(current))
> >> >> +			return -ERESTARTSYS;
> >> >> +
> >> >> +		if (!reclaim) {
> >> >> +			queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
> >> >> +			return -EBUSY;
> >> >> +		}
> >> >> +
> >> >> +		if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
> >> >> +			/* All pages were too young to reclaim, try again a little later  
> >> */
> >> >> +			schedule();
> >> >
> >> > This will be total pain to backtrack after a while when something
> >> > needs to be changed so there definitely should be inline comments
> >> > addressing each branch condition.
> >> >
> >> > I'd rethink this as:
> >> >
> >> > 1. Create static __sgx_epc_cgroup_try_charge() for addressing single
> >> >    iteration with the new "reclaim" parameter.
> >> > 2. Add a new sgx_epc_group_try_charge_reclaim() function.
> >> >
> >> > There's a bit of redundancy with sgx_epc_cgroup_try_charge() and
> >> > sgx_epc_cgroup_try_charge_reclaim() because both have almost the
> >> > same loop calling internal __sgx_epc_cgroup_try_charge() with
> >> > different parameters. That is totally acceptable.
> >> >
> >> > Please also add my suggested-by.
> >> >
> >> > BR, Jarkko
> >> >
> >> > BR, Jarkko
> >> >
> >> For #2:
> >> The only caller of this function, sgx_alloc_epc_page(), has the same
> >> boolean which is passed into this this function.
> >
> > I know. This would be good opportunity to fix that up. Large patch
> > sets should try to make the space for its feature best possible and
> > thus also clean up the code base overally.
> >
> >> If we separate it into sgx_epc_cgroup_try_charge() and
> >> sgx_epc_cgroup_try_charge_reclaim(), then the caller has to have the
> >> if/else branches. So separation here seems not help?
> >
> > Of course it does. It makes the code in that location self-documenting
> > and easier to remember what it does.
> >
> > BR, Jarkko
> >
>
> Please let me know if this aligns with your suggestion.
>
>
> static int ___sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> {
>          if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
>                                          PAGE_SIZE))
>                  return 0;
>
>          if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
>                  return -ENOMEM;
>
>          if (signal_pending(current))
>                  return -ERESTARTSYS;
>
>          return -EBUSY;
> }
>
> /**
>   * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single page
>   * @epc_cg:     The EPC cgroup to be charged for the page.
>   *
>   * Try to reclaim pages in the background if the group reaches its limit  
> and
>   * there are reclaimable pages in the group.
>   * Return:
>   * * %0 - If successfully charged.
>   * * -errno - for failures.
>   */
> int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> {
>          int ret =  ___sgx_epc_cgroup_try_charge(epc_cg);
>
>          if (ret == -EBUSY)
>                  queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
>
>          return ret;
> }
>
> /**
>   * sgx_epc_cgroup_try_charge_reclaim() - try to charge cgroup for a single  
> page
>   * @epc_cg:     The EPC cgroup to be charged for the page.
>   *
>   * Try to reclaim pages directly if the group reaches its limit and there  
> are
>   * reclaimable pages in the group.
>   * Return:
>   * * %0 - If successfully charged.
>   * * -errno - for failures.
>   */
> int sgx_epc_cgroup_try_charge_reclaim(struct sgx_epc_cgroup *epc_cg)
> {
>          int ret;
>
>          for (;;) {
>                  ret =  ___sgx_epc_cgroup_try_charge(epc_cg);
>                  if (ret != -EBUSY)
>                          return ret;
>
>                  if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, current->mm))
>                          /* All pages were too young to reclaim, try again  
> a little later */
>                          schedule();
>          }
>
>          return 0;
> }
>
> It is a little more involved to remove the boolean for  
> sgx_alloc_epc_page() and its callers like sgx_encl_grow(),  
> sgx_alloc_va_page(). I'll send a separate patch for comments.

With quick look, it is towards right direction for sure.

BR, Jarkko
Huang, Kai Feb. 21, 2024, 11:06 a.m. UTC | #6
> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim)
>  {
> -	return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> +	for (;;) {
> +		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> +					PAGE_SIZE))
> +			break;
> +
> +		if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
> +			return -ENOMEM;
> +
> +		if (signal_pending(current))
> +			return -ERESTARTSYS;
> +
> +		if (!reclaim) {
> +			queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
> +			return -EBUSY;
> +		}
> +
> +		if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
> +			/* All pages were too young to reclaim, try again a little later */
> +			schedule();
> +	}
> +
> +	return 0;
>  }
>  

Seems this code change is 90% similar to the existing code in the
sgx_alloc_epc_page():

	...
	for ( ; ; ) {
                page = __sgx_alloc_epc_page();
                if (!IS_ERR(page)) {
                        page->owner = owner;
                        break;
                }

                if (list_empty(&sgx_active_page_list))
                        return ERR_PTR(-ENOMEM);

                if (!reclaim) {
                        page = ERR_PTR(-EBUSY);
                        break;
                }

                if (signal_pending(current)) {
                        page = ERR_PTR(-ERESTARTSYS);
                        break;
                }

                sgx_reclaim_pages();
                cond_resched();
        }
	...

Is it better to move the logic/code change in try_charge() out to
sgx_alloc_epc_page() to unify them?

IIUC, the logic is quite similar: When you either failed to allocate one page,
or failed to charge one page, you try to reclaim EPC page(s) from the current
EPC cgroup, either directly or indirectly.

No?
Haitao Huang Feb. 22, 2024, 5:09 p.m. UTC | #7
On Wed, 21 Feb 2024 05:06:02 -0600, Huang, Kai <kai.huang@intel.com> wrote:

>
>> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
>> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool  
>> reclaim)
>>  {
>> -	return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
>> +	for (;;) {
>> +		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
>> +					PAGE_SIZE))
>> +			break;
>> +
>> +		if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
>> +			return -ENOMEM;
>> +
>> +		if (signal_pending(current))
>> +			return -ERESTARTSYS;
>> +
>> +		if (!reclaim) {
>> +			queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
>> +			return -EBUSY;
>> +		}
>> +
>> +		if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
>> +			/* All pages were too young to reclaim, try again a little later */
>> +			schedule();
>> +	}
>> +
>> +	return 0;
>>  }
>>
>
> Seems this code change is 90% similar to the existing code in the
> sgx_alloc_epc_page():
>
> 	...
> 	for ( ; ; ) {
>                 page = __sgx_alloc_epc_page();
>                 if (!IS_ERR(page)) {
>                         page->owner = owner;
>                         break;
>                 }
>
>                 if (list_empty(&sgx_active_page_list))
>                         return ERR_PTR(-ENOMEM);
>
>                 if (!reclaim) {
>                         page = ERR_PTR(-EBUSY);
>                         break;
>                 }
>
>                 if (signal_pending(current)) {
>                         page = ERR_PTR(-ERESTARTSYS);
>                         break;
>                 }
>
>                 sgx_reclaim_pages();
>                 cond_resched();
>         }
> 	...
>
> Is it better to move the logic/code change in try_charge() out to
> sgx_alloc_epc_page() to unify them?
>
> IIUC, the logic is quite similar: When you either failed to allocate one  
> page,
> or failed to charge one page, you try to reclaim EPC page(s) from the  
> current
> EPC cgroup, either directly or indirectly.
>
> No?

Only these lines are the same:
                 if (!reclaim) {
                         page = ERR_PTR(-EBUSY);
                         break;
                 }

                 if (signal_pending(current)) {
                         page = ERR_PTR(-ERESTARTSYS);
                         break;
                 }

In sgx_alloc_epc_page() we do global reclamation but here we do per-cgroup  
reclamation. That's why the logic of other lines is different though they  
look similar due to similar function names. For the global reclamation we  
need consider case in that cgroup is not enabled. Similarly  
list_empty(&sgx_active_page_list) would have to be changed to check root  
cgroup if cgroups enabled otherwise check global LRU.  The (!reclaim) case  
is also different.  So I don't see an obvious good way to abstract those  
to get meaningful savings.

Thanks
Haitao
Huang, Kai Feb. 22, 2024, 9:26 p.m. UTC | #8
On 23/02/2024 6:09 am, Haitao Huang wrote:
> On Wed, 21 Feb 2024 05:06:02 -0600, Huang, Kai <kai.huang@intel.com> wrote:
> 
>>
>>> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
>>> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool 
>>> reclaim)
>>>  {
>>> -    return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, 
>>> PAGE_SIZE);
>>> +    for (;;) {
>>> +        if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
>>> +                    PAGE_SIZE))
>>> +            break;
>>> +
>>> +        if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
>>> +            return -ENOMEM;
>>> +
>>> +        if (signal_pending(current))
>>> +            return -ERESTARTSYS;
>>> +
>>> +        if (!reclaim) {
>>> +            queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
>>> +            return -EBUSY;
>>> +        }
>>> +
>>> +        if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
>>> +            /* All pages were too young to reclaim, try again a 
>>> little later */
>>> +            schedule();
>>> +    }
>>> +
>>> +    return 0;
>>>  }
>>>
>>
>> Seems this code change is 90% similar to the existing code in the
>> sgx_alloc_epc_page():
>>
>>     ...
>>     for ( ; ; ) {
>>                 page = __sgx_alloc_epc_page();
>>                 if (!IS_ERR(page)) {
>>                         page->owner = owner;
>>                         break;
>>                 }
>>
>>                 if (list_empty(&sgx_active_page_list))
>>                         return ERR_PTR(-ENOMEM);
>>
>>                 if (!reclaim) {
>>                         page = ERR_PTR(-EBUSY);
>>                         break;
>>                 }
>>
>>                 if (signal_pending(current)) {
>>                         page = ERR_PTR(-ERESTARTSYS);
>>                         break;
>>                 }
>>
>>                 sgx_reclaim_pages();
>>                 cond_resched();
>>         }
>>     ...
>>
>> Is it better to move the logic/code change in try_charge() out to
>> sgx_alloc_epc_page() to unify them?
>>
>> IIUC, the logic is quite similar: When you either failed to allocate 
>> one page,
>> or failed to charge one page, you try to reclaim EPC page(s) from the 
>> current
>> EPC cgroup, either directly or indirectly.
>>
>> No?
> 
> Only these lines are the same:
>                  if (!reclaim) {
>                          page = ERR_PTR(-EBUSY);
>                          break;
>                  }
> 
>                  if (signal_pending(current)) {
>                          page = ERR_PTR(-ERESTARTSYS);
>                          break;
>                  }
> 
> In sgx_alloc_epc_page() we do global reclamation but here we do 
> per-cgroup reclamation. 

But why?  If we failed to allocate, shouldn't we try to reclaim from the 
_current_ EPC cgroup instead of global?  E.g., I thought one enclave in 
one EPC cgroup requesting insane amount of EPC shouldn't impact enclaves 
inside other cgroups?

That's why the logic of other lines is different
> though they look similar due to similar function names. For the global 
> reclamation we need consider case in that cgroup is not enabled. 
> Similarly list_empty(&sgx_active_page_list) would have to be changed to 
> check root cgroup if cgroups enabled otherwise check global LRU.  The 
> (!reclaim) case is also different.  

W/o getting clear on my above question, so far I am not convinced why 
such difference cannot be hide inside wrapper function(s).

So I don't see an obvious good way
> to abstract those to get meaningful savings.
> 
> Thanks
> Haitao
Haitao Huang Feb. 22, 2024, 10:57 p.m. UTC | #9
On Thu, 22 Feb 2024 15:26:05 -0600, Huang, Kai <kai.huang@intel.com> wrote:

>
>
> On 23/02/2024 6:09 am, Haitao Huang wrote:
>> On Wed, 21 Feb 2024 05:06:02 -0600, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>>>
>>>> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
>>>> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool  
>>>> reclaim)
>>>>  {
>>>> -    return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,  
>>>> PAGE_SIZE);
>>>> +    for (;;) {
>>>> +        if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
>>>> +                    PAGE_SIZE))
>>>> +            break;
>>>> +
>>>> +        if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
>>>> +            return -ENOMEM;
>>>> +
>>>> +        if (signal_pending(current))
>>>> +            return -ERESTARTSYS;
>>>> +
>>>> +        if (!reclaim) {
>>>> +            queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
>>>> +            return -EBUSY;
>>>> +        }
>>>> +
>>>> +        if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
>>>> +            /* All pages were too young to reclaim, try again a  
>>>> little later */
>>>> +            schedule();
>>>> +    }
>>>> +
>>>> +    return 0;
>>>>  }
>>>>
>>>
>>> Seems this code change is 90% similar to the existing code in the
>>> sgx_alloc_epc_page():
>>>
>>>     ...
>>>     for ( ; ; ) {
>>>                 page = __sgx_alloc_epc_page();
>>>                 if (!IS_ERR(page)) {
>>>                         page->owner = owner;
>>>                         break;
>>>                 }
>>>
>>>                 if (list_empty(&sgx_active_page_list))
>>>                         return ERR_PTR(-ENOMEM);
>>>
>>>                 if (!reclaim) {
>>>                         page = ERR_PTR(-EBUSY);
>>>                         break;
>>>                 }
>>>
>>>                 if (signal_pending(current)) {
>>>                         page = ERR_PTR(-ERESTARTSYS);
>>>                         break;
>>>                 }
>>>
>>>                 sgx_reclaim_pages();
>>>                 cond_resched();
>>>         }
>>>     ...
>>>
>>> Is it better to move the logic/code change in try_charge() out to
>>> sgx_alloc_epc_page() to unify them?
>>>
>>> IIUC, the logic is quite similar: When you either failed to allocate  
>>> one page,
>>> or failed to charge one page, you try to reclaim EPC page(s) from the  
>>> current
>>> EPC cgroup, either directly or indirectly.
>>>
>>> No?
>>  Only these lines are the same:
>>                  if (!reclaim) {
>>                          page = ERR_PTR(-EBUSY);
>>                          break;
>>                  }
>>                   if (signal_pending(current)) {
>>                          page = ERR_PTR(-ERESTARTSYS);
>>                          break;
>>                  }
>>  In sgx_alloc_epc_page() we do global reclamation but here we do  
>> per-cgroup reclamation.
>
> But why?  If we failed to allocate, shouldn't we try to reclaim from the  
> _current_ EPC cgroup instead of global?  E.g., I thought one enclave in  
> one EPC cgroup requesting insane amount of EPC shouldn't impact enclaves  
> inside other cgroups?
>
Right. When code reaches to here, we already passed reclaim per cgroup.  
The cgroup may not at or reach limit but system has run out of physical  
EPC.

Thanks
Haitao
Huang, Kai Feb. 23, 2024, 10:18 a.m. UTC | #10
> > 
> Right. When code reaches to here, we already passed reclaim per cgroup.  

Yes if try_charge() failed we must do pre-cgroup reclaim.

> The cgroup may not at or reach limit but system has run out of physical  
> EPC.
> 

But after try_charge() we can still choose to reclaim from the current group,
but not necessarily have to be global, right?  I am not sure whether I am
missing something, but could you elaborate why we should choose to reclaim from
the global?
Haitao Huang Feb. 23, 2024, 5 p.m. UTC | #11
On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <kai.huang@intel.com> wrote:

>> >
>> Right. When code reaches to here, we already passed reclaim per cgroup.
>
> Yes if try_charge() failed we must do pre-cgroup reclaim.
>
>> The cgroup may not at or reach limit but system has run out of physical
>> EPC.
>>
>
> But after try_charge() we can still choose to reclaim from the current  
> group,
> but not necessarily have to be global, right?  I am not sure whether I am
> missing something, but could you elaborate why we should choose to  
> reclaim from
> the global?
>

Once try_charge is done and returns zero that means the cgroup usage is  
charged and it's not over usage limit. So you really can't reclaim from  
that cgroup if allocation failed. The only  thing you can do is to reclaim  
globally.

This could happen when the sum of limits of all cgroups is greater than  
the physical EPC, i.e., user is overcommitting.

Thanks

Haitao
Huang, Kai Feb. 26, 2024, 1:38 a.m. UTC | #12
On 24/02/2024 6:00 am, Haitao Huang wrote:
> On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <kai.huang@intel.com> wrote:
> 
>>> >
>>> Right. When code reaches to here, we already passed reclaim per cgroup.
>>
>> Yes if try_charge() failed we must do pre-cgroup reclaim.
>>
>>> The cgroup may not at or reach limit but system has run out of physical
>>> EPC.
>>>
>>
>> But after try_charge() we can still choose to reclaim from the current 
>> group,
>> but not necessarily have to be global, right?  I am not sure whether I am
>> missing something, but could you elaborate why we should choose to 
>> reclaim from
>> the global?
>>
> 
> Once try_charge is done and returns zero that means the cgroup usage is 
> charged and it's not over usage limit. So you really can't reclaim from 
> that cgroup if allocation failed. The only  thing you can do is to 
> reclaim globally.

Sorry I still cannot establish the logic here.

Let's say the sum of all cgroups are greater than the physical EPC, and 
elclave(s) in each cgroup could potentially fault w/o reaching cgroup's 
limit.

In this case, when enclave(s) in one cgroup faults, why we cannot 
reclaim from the current cgroup, but have to reclaim from global?

Is there any real downside of the former, or you just want to follow the 
reclaim logic w/o cgroup at all?

IIUC, there's at least one advantage of reclaim from the current group, 
that faults of enclave(s) in one group won't impact other enclaves in 
other cgroups.  E.g., in this way other enclaves in other groups may 
never need to trigger faults.

Or perhaps I am missing anything?
Haitao Huang Feb. 26, 2024, 4:03 a.m. UTC | #13
On Sun, 25 Feb 2024 19:38:26 -0600, Huang, Kai <kai.huang@intel.com> wrote:

>
>
> On 24/02/2024 6:00 am, Haitao Huang wrote:
>> On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>>>> >
>>>> Right. When code reaches to here, we already passed reclaim per  
>>>> cgroup.
>>>
>>> Yes if try_charge() failed we must do pre-cgroup reclaim.
>>>
>>>> The cgroup may not at or reach limit but system has run out of  
>>>> physical
>>>> EPC.
>>>>
>>>
>>> But after try_charge() we can still choose to reclaim from the current  
>>> group,
>>> but not necessarily have to be global, right?  I am not sure whether I  
>>> am
>>> missing something, but could you elaborate why we should choose to  
>>> reclaim from
>>> the global?
>>>
>>  Once try_charge is done and returns zero that means the cgroup usage  
>> is charged and it's not over usage limit. So you really can't reclaim  
>> from that cgroup if allocation failed. The only  thing you can do is to  
>> reclaim globally.
>
> Sorry I still cannot establish the logic here.
>
> Let's say the sum of all cgroups are greater than the physical EPC, and  
> elclave(s) in each cgroup could potentially fault w/o reaching cgroup's  
> limit.
>
> In this case, when enclave(s) in one cgroup faults, why we cannot  
> reclaim from the current cgroup, but have to reclaim from global?
>
> Is there any real downside of the former, or you just want to follow the  
> reclaim logic w/o cgroup at all?
>
> IIUC, there's at least one advantage of reclaim from the current group,  
> that faults of enclave(s) in one group won't impact other enclaves in  
> other cgroups.  E.g., in this way other enclaves in other groups may  
> never need to trigger faults.
>
> Or perhaps I am missing anything?
>
The use case here is that user knows it's OK for group A to borrow some  
pages from group B for some time without impact much performance, vice  
versa. That's why the user is overcomitting so system can run more  
enclave/groups. Otherwise, if she is concerned about impact of A on B, she  
could lower limit for A so it never interfere or interfere less with B  
(assume the lower limit is still high enough to run all enclaves in A),  
and sacrifice some of A's performance. Or if she does not want any  
interference between groups, just don't over-comit. So we don't really  
lose anything here.

In case of overcomitting, even if we always reclaim from the same cgroup  
for each fault, one group may still interfere the other: e.g., consider an  
extreme case in that group A used up almost all EPC at the time group B  
has a fault, B has to fail allocation and kill enclaves.

Haitao
Huang, Kai Feb. 26, 2024, 11:36 a.m. UTC | #14
On Sun, 2024-02-25 at 22:03 -0600, Haitao Huang wrote:
> On Sun, 25 Feb 2024 19:38:26 -0600, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > 
> > 
> > On 24/02/2024 6:00 am, Haitao Huang wrote:
> > > On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <kai.huang@intel.com>  
> > > wrote:
> > > 
> > > > > > 
> > > > > Right. When code reaches to here, we already passed reclaim per  
> > > > > cgroup.
> > > > 
> > > > Yes if try_charge() failed we must do pre-cgroup reclaim.
> > > > 
> > > > > The cgroup may not at or reach limit but system has run out of  
> > > > > physical
> > > > > EPC.
> > > > > 
> > > > 
> > > > But after try_charge() we can still choose to reclaim from the current  
> > > > group,
> > > > but not necessarily have to be global, right?  I am not sure whether I  
> > > > am
> > > > missing something, but could you elaborate why we should choose to  
> > > > reclaim from
> > > > the global?
> > > > 
> > >  Once try_charge is done and returns zero that means the cgroup usage  
> > > is charged and it's not over usage limit. So you really can't reclaim  
> > > from that cgroup if allocation failed. The only  thing you can do is to  
> > > reclaim globally.
> > 
> > Sorry I still cannot establish the logic here.
> > 
> > Let's say the sum of all cgroups are greater than the physical EPC, and  
> > elclave(s) in each cgroup could potentially fault w/o reaching cgroup's  
> > limit.
> > 
> > In this case, when enclave(s) in one cgroup faults, why we cannot  
> > reclaim from the current cgroup, but have to reclaim from global?
> > 
> > Is there any real downside of the former, or you just want to follow the  
> > reclaim logic w/o cgroup at all?
> > 
> > IIUC, there's at least one advantage of reclaim from the current group,  
> > that faults of enclave(s) in one group won't impact other enclaves in  
> > other cgroups.  E.g., in this way other enclaves in other groups may  
> > never need to trigger faults.
> > 
> > Or perhaps I am missing anything?
> > 
> The use case here is that user knows it's OK for group A to borrow some  
> pages from group B for some time without impact much performance, vice  
> versa. That's why the user is overcomitting so system can run more  
> enclave/groups. Otherwise, if she is concerned about impact of A on B, she  
> could lower limit for A so it never interfere or interfere less with B  
> (assume the lower limit is still high enough to run all enclaves in A),  
> and sacrifice some of A's performance. Or if she does not want any  
> interference between groups, just don't over-comit. So we don't really  
> lose anything here.

But if we reclaim from the same group, seems we could enable a user case that
allows the admin to ensure certain group won't be impacted at all, while
allowing other groups to over-commit?

E.g., let's say we have 100M physical EPC.  And let's say the admin wants to run
some performance-critical enclave(s) which costs 50M EPC w/o being impacted. 
The admin also wants to run other enclaves which could cost 100M EPC in total
but EPC swapping among them is acceptable.

If we choose to reclaim from the current EPC cgroup, then seems to that the
admin can achieve the above by setting up 2 groups with group1 having 50M limit
and group2 having 100M limit, and then run performance-critical enclave(s) in
group1 and others in group2?  Or am I missing anything?

If we choose to do global reclaim, then we cannot achieve that.

> 
> In case of overcomitting, even if we always reclaim from the same cgroup  
> for each fault, one group may still interfere the other: e.g., consider an  
> extreme case in that group A used up almost all EPC at the time group B  
> has a fault, B has to fail allocation and kill enclaves.

If the admin allows group A to use almost all EPC, to me it's fair to say he/she
doesn't want to run anything inside B at all and it is acceptable enclaves in B
to be killed.
Dave Hansen Feb. 26, 2024, 2:04 p.m. UTC | #15
On 2/26/24 03:36, Huang, Kai wrote:
>> In case of overcomitting, even if we always reclaim from the same cgroup  
>> for each fault, one group may still interfere the other: e.g., consider an  
>> extreme case in that group A used up almost all EPC at the time group B  
>> has a fault, B has to fail allocation and kill enclaves.
> If the admin allows group A to use almost all EPC, to me it's fair to say he/she
> doesn't want to run anything inside B at all and it is acceptable enclaves in B
> to be killed.

Folks, I'm having a really hard time following this thread.  It sounds
like there's disagreement about when to do system-wide reclaim.  Could
someone remind me of the choices that we have?  (A proposed patch would
go a _long_ way to helping me understand)

Also, what does the core mm memcg code do?

Last, what is the simplest (least amount of code) thing that the SGX
cgroup controller could implement here?
Haitao Huang Feb. 26, 2024, 9:18 p.m. UTC | #16
On Mon, 26 Feb 2024 05:36:02 -0600, Huang, Kai <kai.huang@intel.com> wrote:

> On Sun, 2024-02-25 at 22:03 -0600, Haitao Huang wrote:
>> On Sun, 25 Feb 2024 19:38:26 -0600, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>> >
>> >
>> > On 24/02/2024 6:00 am, Haitao Huang wrote:
>> > > On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <kai.huang@intel.com>
>> > > wrote:
>> > >
>> > > > > >
>> > > > > Right. When code reaches to here, we already passed reclaim per
>> > > > > cgroup.
>> > > >
>> > > > Yes if try_charge() failed we must do pre-cgroup reclaim.
>> > > >
>> > > > > The cgroup may not at or reach limit but system has run out of
>> > > > > physical
>> > > > > EPC.
>> > > > >
>> > > >
>> > > > But after try_charge() we can still choose to reclaim from the  
>> current
>> > > > group,
>> > > > but not necessarily have to be global, right?  I am not sure  
>> whether I
>> > > > am
>> > > > missing something, but could you elaborate why we should choose to
>> > > > reclaim from
>> > > > the global?
>> > > >
>> > >  Once try_charge is done and returns zero that means the cgroup  
>> usage
>> > > is charged and it's not over usage limit. So you really can't  
>> reclaim
>> > > from that cgroup if allocation failed. The only  thing you can do  
>> is to
>> > > reclaim globally.
>> >
>> > Sorry I still cannot establish the logic here.
>> >
>> > Let's say the sum of all cgroups are greater than the physical EPC,  
>> and
>> > elclave(s) in each cgroup could potentially fault w/o reaching  
>> cgroup's
>> > limit.
>> >
>> > In this case, when enclave(s) in one cgroup faults, why we cannot
>> > reclaim from the current cgroup, but have to reclaim from global?
>> >
>> > Is there any real downside of the former, or you just want to follow  
>> the
>> > reclaim logic w/o cgroup at all?
>> >
>> > IIUC, there's at least one advantage of reclaim from the current  
>> group,
>> > that faults of enclave(s) in one group won't impact other enclaves in
>> > other cgroups.  E.g., in this way other enclaves in other groups may
>> > never need to trigger faults.
>> >
>> > Or perhaps I am missing anything?
>> >
>> The use case here is that user knows it's OK for group A to borrow some
>> pages from group B for some time without impact much performance, vice
>> versa. That's why the user is overcomitting so system can run more
>> enclave/groups. Otherwise, if she is concerned about impact of A on B,  
>> she
>> could lower limit for A so it never interfere or interfere less with B
>> (assume the lower limit is still high enough to run all enclaves in A),
>> and sacrifice some of A's performance. Or if she does not want any
>> interference between groups, just don't over-comit. So we don't really
>> lose anything here.
>
> But if we reclaim from the same group, seems we could enable a user case  
> that
> allows the admin to ensure certain group won't be impacted at all, while
> allowing other groups to over-commit?
>
> E.g., let's say we have 100M physical EPC.  And let's say the admin  
> wants to run
> some performance-critical enclave(s) which costs 50M EPC w/o being  
> impacted.
> The admin also wants to run other enclaves which could cost 100M EPC in  
> total
> but EPC swapping among them is acceptable.
>
> If we choose to reclaim from the current EPC cgroup, then seems to that  
> the
> admin can achieve the above by setting up 2 groups with group1 having  
> 50M limit
> and group2 having 100M limit, and then run performance-critical  
> enclave(s) in
> group1 and others in group2?  Or am I missing anything?
>

The more important groups should have limits higher than or equal to peak  
usage to ensure no impact.
The less important groups should have lower limits than its peak usage to  
avoid impacting higher priority groups.
The limit is the maximum usage allowed.

By setting group2 limit to 100M, you are allowing it to use 100M. So as  
soon as it gets up and consume 100M, group1 can not even load any enclave  
if we only reclaim per-cgroup and do not do global reclaim.

> If we choose to do global reclaim, then we cannot achieve that.


You can achieve this by setting group 2 limit to 50M. No need to  
overcommiting to the system.
Group 2 will swap as soon as it hits 50M, which is the maximum it can  
consume so no impact to group 1.

>
>>
>> In case of overcomitting, even if we always reclaim from the same cgroup
>> for each fault, one group may still interfere the other: e.g., consider  
>> an
>> extreme case in that group A used up almost all EPC at the time group B
>> has a fault, B has to fail allocation and kill enclaves.
>
> If the admin allows group A to use almost all EPC, to me it's fair to  
> say he/she
> doesn't want to run anything inside B at all and it is acceptable  
> enclaves in B
> to be killed.
>
>
I don't think so. The user just knows group A + B peak usages higher than  
system capacity. And she is OK for them to share some of the pages  
dynamically. So kernel should allow one borrow from the other at a  
particular instance when one group has higher demand. And later doing the  
opposite. IOW, the favor goes both ways.

Haitao
Haitao Huang Feb. 26, 2024, 9:48 p.m. UTC | #17
Hi Dave,

On Mon, 26 Feb 2024 08:04:54 -0600, Dave Hansen <dave.hansen@intel.com>  
wrote:

> On 2/26/24 03:36, Huang, Kai wrote:
>>> In case of overcomitting, even if we always reclaim from the same  
>>> cgroup
>>> for each fault, one group may still interfere the other: e.g.,  
>>> consider an
>>> extreme case in that group A used up almost all EPC at the time group B
>>> has a fault, B has to fail allocation and kill enclaves.
>> If the admin allows group A to use almost all EPC, to me it's fair to  
>> say he/she
>> doesn't want to run anything inside B at all and it is acceptable  
>> enclaves in B
>> to be killed.
>
> Folks, I'm having a really hard time following this thread.  It sounds
> like there's disagreement about when to do system-wide reclaim.  Could
> someone remind me of the choices that we have?  (A proposed patch would
> go a _long_ way to helping me understand)
>

In case of overcomitting, i.e., sum of limits greater than the EPC  
capacity, if one group has a fault, and its usage is not above its own  
limit (try_charge() passes), yet total usage of the system has exceeded  
the capacity, whether we do global reclaim or just reclaim pages in the  
current faulting group.

> Also, what does the core mm memcg code do?
>
I'm not sure. I'll try to find out but it'd be appreciated if someone more  
knowledgeable can comment on this. memcg also has the protection mechanism  
(i.e., min, low settings) to guarantee some allocation per group so its  
approach might not be applicable to misc controller here.

> Last, what is the simplest (least amount of code) thing that the SGX
> cgroup controller could implement here?
>
>

I still think the current approach of doing global reclaim is reasonable  
and simple: try_charge() checks cgroup limit and reclaim within the group  
if needed, then do EPC page allocation, reclaim globally if allocation  
fails due to global usage reaches the capacity.

I'm not sure how not doing global reclaiming in this case would bring any  
benefit. Please see my response to Kai's example cases.

Thanks
Haitao
Dave Hansen Feb. 26, 2024, 9:56 p.m. UTC | #18
On 2/26/24 13:48, Haitao Huang wrote:
> In case of overcomitting, i.e., sum of limits greater than the EPC
> capacity, if one group has a fault, and its usage is not above its own
> limit (try_charge() passes), yet total usage of the system has exceeded
> the capacity, whether we do global reclaim or just reclaim pages in the
> current faulting group.

I don't see _any_ reason to limit reclaim to the current faulting cgroup.

>> Last, what is the simplest (least amount of code) thing that the SGX
>> cgroup controller could implement here?
> 
> I still think the current approach of doing global reclaim is reasonable
> and simple: try_charge() checks cgroup limit and reclaim within the
> group if needed, then do EPC page allocation, reclaim globally if
> allocation fails due to global usage reaches the capacity.
> 
> I'm not sure how not doing global reclaiming in this case would bring
> any benefit.
I tend to agree.

Kai, I think your examples sound a little bit contrived.  Have actual
users expressed a strong intent for doing anything with this series
other than limiting bad actors from eating all the EPC?
Huang, Kai Feb. 26, 2024, 10:24 p.m. UTC | #19
On 27/02/2024 10:18 am, Haitao Huang wrote:
> On Mon, 26 Feb 2024 05:36:02 -0600, Huang, Kai <kai.huang@intel.com> wrote:
> 
>> On Sun, 2024-02-25 at 22:03 -0600, Haitao Huang wrote:
>>> On Sun, 25 Feb 2024 19:38:26 -0600, Huang, Kai <kai.huang@intel.com> 
>>> wrote:
>>>
>>> >
>>> >
>>> > On 24/02/2024 6:00 am, Haitao Huang wrote:
>>> > > On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <kai.huang@intel.com>
>>> > > wrote:
>>> > >
>>> > > > > >
>>> > > > > Right. When code reaches to here, we already passed reclaim per
>>> > > > > cgroup.
>>> > > >
>>> > > > Yes if try_charge() failed we must do pre-cgroup reclaim.
>>> > > >
>>> > > > > The cgroup may not at or reach limit but system has run out of
>>> > > > > physical
>>> > > > > EPC.
>>> > > > >
>>> > > >
>>> > > > But after try_charge() we can still choose to reclaim from the 
>>> current
>>> > > > group,
>>> > > > but not necessarily have to be global, right?  I am not sure 
>>> whether I
>>> > > > am
>>> > > > missing something, but could you elaborate why we should choose to
>>> > > > reclaim from
>>> > > > the global?
>>> > > >
>>> > >  Once try_charge is done and returns zero that means the cgroup 
>>> usage
>>> > > is charged and it's not over usage limit. So you really can't 
>>> reclaim
>>> > > from that cgroup if allocation failed. The only  thing you can do 
>>> is to
>>> > > reclaim globally.
>>> >
>>> > Sorry I still cannot establish the logic here.
>>> >
>>> > Let's say the sum of all cgroups are greater than the physical EPC, 
>>> and
>>> > elclave(s) in each cgroup could potentially fault w/o reaching 
>>> cgroup's
>>> > limit.
>>> >
>>> > In this case, when enclave(s) in one cgroup faults, why we cannot
>>> > reclaim from the current cgroup, but have to reclaim from global?
>>> >
>>> > Is there any real downside of the former, or you just want to 
>>> follow the
>>> > reclaim logic w/o cgroup at all?
>>> >
>>> > IIUC, there's at least one advantage of reclaim from the current 
>>> group,
>>> > that faults of enclave(s) in one group won't impact other enclaves in
>>> > other cgroups.  E.g., in this way other enclaves in other groups may
>>> > never need to trigger faults.
>>> >
>>> > Or perhaps I am missing anything?
>>> >
>>> The use case here is that user knows it's OK for group A to borrow some
>>> pages from group B for some time without impact much performance, vice
>>> versa. That's why the user is overcomitting so system can run more
>>> enclave/groups. Otherwise, if she is concerned about impact of A on 
>>> B, she
>>> could lower limit for A so it never interfere or interfere less with B
>>> (assume the lower limit is still high enough to run all enclaves in A),
>>> and sacrifice some of A's performance. Or if she does not want any
>>> interference between groups, just don't over-comit. So we don't really
>>> lose anything here.
>>
>> But if we reclaim from the same group, seems we could enable a user 
>> case that
>> allows the admin to ensure certain group won't be impacted at all, while
>> allowing other groups to over-commit?
>>
>> E.g., let's say we have 100M physical EPC.  And let's say the admin 
>> wants to run
>> some performance-critical enclave(s) which costs 50M EPC w/o being 
>> impacted.
>> The admin also wants to run other enclaves which could cost 100M EPC 
>> in total
>> but EPC swapping among them is acceptable.
>>
>> If we choose to reclaim from the current EPC cgroup, then seems to 
>> that the
>> admin can achieve the above by setting up 2 groups with group1 having 
>> 50M limit
>> and group2 having 100M limit, and then run performance-critical 
>> enclave(s) in
>> group1 and others in group2?  Or am I missing anything?
>>
> 
> The more important groups should have limits higher than or equal to 
> peak usage to ensure no impact.

Yes.  But if you do global reclaim there's no guarantee of this 
regardless of the limit setting.  It depends on setting of limits of 
other groups.

> The less important groups should have lower limits than its peak usage 
> to avoid impacting higher priority groups.

Yeah, but depending on how low the limit is, the try_charge() can still 
succeed but physical EPC is already running out.

Are you saying we should always expect the admin to set limits of groups 
not exceeding the physical EPC?

> The limit is the maximum usage allowed.
> 
> By setting group2 limit to 100M, you are allowing it to use 100M. So as 
> soon as it gets up and consume 100M, group1 can not even load any 
> enclave if we only reclaim per-cgroup and do not do global reclaim.

I kinda forgot, but I think SGX supports swapping out EPC of an enclave 
before EINIT?  Also, with SGX2 the initial enclave can take less EPC to 
be loaded.

> 
>> If we choose to do global reclaim, then we cannot achieve that.
> 
> 
> You can achieve this by setting group 2 limit to 50M. No need to 
> overcommiting to the system.
> Group 2 will swap as soon as it hits 50M, which is the maximum it can 
> consume so no impact to group 1.

Right.  We can achieve this by doing so.  But as said above, you are 
depending on setting up the limit to do per-cgorup reclaim.

So, back to the question:

What is the downside of doing per-group reclaim when try_charge() 
succeeds for the enclave but failed to allocate EPC page?

Could you give an complete answer why you choose to use global reclaim 
for the above case?
Dave Hansen Feb. 26, 2024, 10:31 p.m. UTC | #20
On 2/26/24 14:24, Huang, Kai wrote:
> What is the downside of doing per-group reclaim when try_charge()
> succeeds for the enclave but failed to allocate EPC page?
> 
> Could you give an complete answer why you choose to use global reclaim
> for the above case?

There are literally two different limits at play.  There's the limit
that the cgroup imposes and then the actual physical limit.

Hitting the cgroup limit induces cgroup reclaim.

Hitting the physical limit induces global reclaim.

Maybe I'm just being dense, but I fail to understand why you would want
to entangle those two different concepts more than absolutely necessary.
Huang, Kai Feb. 26, 2024, 10:34 p.m. UTC | #21
> 
> Kai, I think your examples sound a little bit contrived.  Have actual
> users expressed a strong intent for doing anything with this series
> other than limiting bad actors from eating all the EPC?
I am not sure about this.  I am also trying to get a full picture.

I asked because I didn't quite like the duplicated code change in 
try_charge() in this patch and in sgx_alloc_epc_page().  I think using 
per-group reclaim we can unify the code (I have even started to write 
the code) and I don't see the downside of doing so.

So I am trying to get the actual downside of doing per-cgroup reclaim or 
the full reason that we choose global reclaim.
Dave Hansen Feb. 26, 2024, 10:38 p.m. UTC | #22
On 2/26/24 14:34, Huang, Kai wrote:
> So I am trying to get the actual downside of doing per-cgroup reclaim or
> the full reason that we choose global reclaim.

Take the most extreme example:

	while (hit_global_sgx_limit())
		reclaim_from_this(cgroup);

You eventually end up with all of 'cgroup's pages gone and handed out to
other users on the system who stole them all.  Other users might cause
you to go over the global limit.  *They* should be paying part of the
cost, not just you and your cgroup.
Huang, Kai Feb. 26, 2024, 10:38 p.m. UTC | #23
On 27/02/2024 11:31 am, Dave Hansen wrote:
> On 2/26/24 14:24, Huang, Kai wrote:
>> What is the downside of doing per-group reclaim when try_charge()
>> succeeds for the enclave but failed to allocate EPC page?
>>
>> Could you give an complete answer why you choose to use global reclaim
>> for the above case?
> 
> There are literally two different limits at play.  There's the limit
> that the cgroup imposes and then the actual physical limit.
> 
> Hitting the cgroup limit induces cgroup reclaim.
> 
> Hitting the physical limit induces global reclaim.
> 
> Maybe I'm just being dense, but I fail to understand why you would want
> to entangle those two different concepts more than absolutely necessary.

OK.  Yes I agree doing per-cgroup reclaim when hitting physical limit 
would bring another layer of consideration of when to do global reclaim, 
which is not necessary now.
Huang, Kai Feb. 26, 2024, 10:46 p.m. UTC | #24
On 27/02/2024 11:38 am, Dave Hansen wrote:
> On 2/26/24 14:34, Huang, Kai wrote:
>> So I am trying to get the actual downside of doing per-cgroup reclaim or
>> the full reason that we choose global reclaim.
> 
> Take the most extreme example:
> 
> 	while (hit_global_sgx_limit())
> 		reclaim_from_this(cgroup);
> 
> You eventually end up with all of 'cgroup's pages gone and handed out to
> other users on the system who stole them all.  Other users might cause
> you to go over the global limit.  *They* should be paying part of the
> cost, not just you and your cgroup.

Yeah likely we will need another layer of logic to decide when to do 
global reclaim.  I agree that is downside and is unnecessary for this 
patchset.

Thanks for the comments.
Michal Koutný Feb. 27, 2024, 9:26 a.m. UTC | #25
Hello.

On Mon, Feb 26, 2024 at 03:48:18PM -0600, Haitao Huang <haitao.huang@linux.intel.com> wrote:
> In case of overcomitting, i.e., sum of limits greater than the EPC capacity,
> if one group has a fault, and its usage is not above its own limit
> (try_charge() passes), yet total usage of the system has exceeded the
> capacity, whether we do global reclaim or just reclaim pages in the current
> faulting group.
> 
> > Also, what does the core mm memcg code do?
> > 
> I'm not sure. I'll try to find out but it'd be appreciated if someone more
> knowledgeable can comment on this. memcg also has the protection mechanism
> (i.e., min, low settings) to guarantee some allocation per group so its
> approach might not be applicable to misc controller here.

I only follow the discussion superficially but it'd be nice to have
analogous mechanisms in memcg and sgx controller.

The memory limits are rather simple -- when allocating new memory, the
tightest limit of ancestor applies and reclaim is applied to whole
subtree of such an ancestor (not necessearily the originating cgroup of
the allocation). Overcommit is admited, competition among siblings is
resolved on the first comes, first served basis.

The memory protections are an additional (and in a sense orthogoal)
mechanism. They can be interpretted as limits that are enforced not at
the time of allocation but at the time of reclaim (and reclaim is
triggered independetly, either global or with the limits above). The
consequence is that the protection code must do some normalization to
evaluate overcommited values sensibly.

(Historically, there was also memory.soft_limit_in_bytes, which combined
"protection" and global reclaim but it turned out broken. I suggest
reading Documentation/admin-guide/cgroup-v2.rst section Controller
Issues and Remedies/Memory for more details and as cautionary example.)

HTH,
Michal
Jarkko Sakkinen Feb. 27, 2024, 8:41 p.m. UTC | #26
On Mon Feb 26, 2024 at 11:56 PM EET, Dave Hansen wrote:
> On 2/26/24 13:48, Haitao Huang wrote:
> > In case of overcomitting, i.e., sum of limits greater than the EPC
> > capacity, if one group has a fault, and its usage is not above its own
> > limit (try_charge() passes), yet total usage of the system has exceeded
> > the capacity, whether we do global reclaim or just reclaim pages in the
> > current faulting group.
>
> I don't see _any_ reason to limit reclaim to the current faulting cgroup.
>
> >> Last, what is the simplest (least amount of code) thing that the SGX
> >> cgroup controller could implement here?
> > 
> > I still think the current approach of doing global reclaim is reasonable
> > and simple: try_charge() checks cgroup limit and reclaim within the
> > group if needed, then do EPC page allocation, reclaim globally if
> > allocation fails due to global usage reaches the capacity.
> > 
> > I'm not sure how not doing global reclaiming in this case would bring
> > any benefit.
> I tend to agree.
>
> Kai, I think your examples sound a little bit contrived.  Have actual
> users expressed a strong intent for doing anything with this series
> other than limiting bad actors from eating all the EPC?

I'd consider this from the viewpoint is there anything in the user space
visible portion of the patch set that would limit tuning the performance
later on, if required let's say by a workload that acts sub-optimally.

If not, then most of performance related issues can be only identified
by actual use of the code.

BR, Jarkko
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index d399fda2b55e..abf74fdb12b4 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -184,13 +184,35 @@  static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
 /**
  * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
  * @epc_cg:	The EPC cgroup to be charged for the page.
+ * @reclaim:	Whether or not synchronous reclaim is allowed
  * Return:
  * * %0 - If successfully charged.
  * * -errno - for failures.
  */
-int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim)
 {
-	return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+	for (;;) {
+		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+					PAGE_SIZE))
+			break;
+
+		if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
+			return -ENOMEM;
+
+		if (signal_pending(current))
+			return -ERESTARTSYS;
+
+		if (!reclaim) {
+			queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
+			return -EBUSY;
+		}
+
+		if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
+			/* All pages were too young to reclaim, try again a little later */
+			schedule();
+	}
+
+	return 0;
 }
 
 /**
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index e3c6a08f0ee8..d061cd807b45 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -23,7 +23,7 @@  static inline struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
 
 static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg) { }
 
-static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim)
 {
 	return 0;
 }
@@ -66,7 +66,7 @@  static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg)
 		put_misc_cg(epc_cg->cg);
 }
 
-int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg);
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim);
 void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
 bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
 void sgx_epc_cgroup_init(void);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 51904f191b97..2279ae967707 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -588,7 +588,7 @@  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 	int ret;
 
 	epc_cg = sgx_get_current_epc_cg();
-	ret = sgx_epc_cgroup_try_charge(epc_cg);
+	ret = sgx_epc_cgroup_try_charge(epc_cg, reclaim);
 	if (ret) {
 		sgx_put_epc_cg(epc_cg);
 		return ERR_PTR(ret);