diff mbox series

[v12,09/14] x86/sgx: Implement async reclamation for cgroup

Message ID 20240416032011.58578-10-haitao.huang@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Add Cgroup support for SGX EPC memory | expand

Commit Message

Haitao Huang April 16, 2024, 3:20 a.m. UTC
From: Kristen Carlson Accardi <kristen@linux.intel.com>

In cases EPC pages need be allocated during a page fault and the cgroup
usage is near its limit, an asynchronous reclamation needs be triggered
to avoid blocking the page fault handling.

Create a workqueue, corresponding work item and function definitions
for EPC cgroup to support the asynchronous reclamation.

In case the workqueue allocation is failed during init, disable cgroup.

In sgx_cgroup_try_charge(), if caller does not allow synchronous
reclamation, queue an asynchronous work into the workqueue.

Reclaiming only when the usage is at or very close to the limit would
cause thrashing. To avoid that, before returning from
sgx_cgroup_try_charge(), check the need for reclamation (usage too close
to the limit), queue an async work if needed, similar to how the global
reclaimer wakes up its reclaiming thread after each allocation in
sgx_alloc_epc_pages().

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
---
V11:
- Print error instead of WARN (Kai)
- Add check for need to queue an async reclamation before returning from
try_charge(), do so if needed. This is to be consistent with global
reclaimer to minimize thrashing during allocation time.

V10:
- Split asynchronous flow in separate patch. (Kai)
- Consider cgroup disabled when the workqueue allocation fail during
init. (Kai)
- Abstract out sgx_cgroup_should_reclaim().

V9:
- Add comments for static variables. (Jarkko)

V8:
- Remove alignment for substructure variables. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 129 ++++++++++++++++++++++++++-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |   1 +
 2 files changed, 128 insertions(+), 2 deletions(-)

Comments

Huang, Kai April 19, 2024, 1:32 a.m. UTC | #1
On 16/04/2024 3:20 pm, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> 
> In cases EPC pages need be allocated during a page fault and the cgroup
> usage is near its limit, an asynchronous reclamation needs be triggered
> to avoid blocking the page fault handling.
> 
> Create a workqueue, corresponding work item and function definitions
> for EPC cgroup to support the asynchronous reclamation.
> 
> In case the workqueue allocation is failed during init, disable cgroup.

It's fine and reasonable to disable (SGX EPC) cgroup.  The problem is 
"exactly what does this mean" isn't quite clear.

Given SGX EPC is just one type of MISC cgroup resources, we cannot just 
disable MISC cgroup as a whole.

So, the first interpretation is we treat the entire MISC_CG_RES_SGX 
resource type doesn't exist, that is, we just don't show control files 
in the file system, and all EPC pages are tracked in the global list.

But it might be not straightforward to implement in the SGX driver, 
i.e., we might need to do more MISC cgroup core code change to make it 
being able to support disable particular resource at runtime -- I need 
to double check.

So if that is not something worth to do, we will still need to live with 
the fact that, the user is still able to create SGX cgroup in the 
hierarchy and see those control files, and being able to read/write them.

The second interpretation I suppose is, although the SGX cgroup is still 
seen as supported in userspace, in kernel we just treat it doesn't exist.

Specifically, that means: 1) we always return the root SGX cgroup for 
any EPC page when allocating a new one; 2) as a result, we still track 
all EPC pages in a single global list.

But from the code below ...


>   static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg)
>   {
>   	if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE))
> @@ -117,19 +226,28 @@ int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
>   {
>   	int ret;
>   
> +	/* cgroup disabled due to wq allocation failure during sgx_cgroup_init(). */
> +	if (!sgx_cg_wq)
> +		return 0;
> +

..., IIUC you choose a (third) solution that is even one more step back:

It just makes try_charge() always succeed, but EPC pages are still 
managed in the "per-cgroup" list.

But this solution, AFAICT, doesn't work.  The reason is when you fail to 
allocate EPC page you will do the global reclaim, but now the global 
list is empty.

Am I missing anything?

So my thinking is, we have two options:

1) Modify the MISC cgroup core code to allow the kernel to disable one 
particular resource.  It shouldn't be hard, e.g., we can add a 
'disabled' flag to the 'struct misc_res'.

Hmm.. wait, after checking, the MISC cgroup won't show any control files 
if the "capacity" of the resource is 0:

"
  * Miscellaneous resources capacity for the entire machine. 0 capacity
  * means resource is not initialized or not present in the host.
"

So I really suppose we should go with this route, i.e., by just setting 
the EPC capacity to 0?

Note misc_cg_try_charge() will fail if capacity is 0, but we can make it 
return success by explicitly check whether SGX cgroup is disabled by 
using a helper, e.g., sgx_cgroup_disabled().

And you always return the root SGX cgroup in sgx_get_current_cg() when 
sgx_cgroup_disabled() is true.

And in sgx_reclaim_pages_global(), you do something like:

	static void sgx_reclaim_pages_global(..)
	{
	#ifdef CONFIG_CGROUP_MISC
		if (sgx_cgroup_disabled())
			sgx_reclaim_pages(&sgx_root_cg.lru);
		else
			sgx_cgroup_reclaim_pages(misc_cg_root());
	#else
		sgx_reclaim_pages(&sgx_global_list);
	#endif
	}

I am perhaps missing some other spots too but you got the idea.

At last, after typing those, I believe we should have a separate patch 
to handle disable SGX cgroup at initialization time.  And you can even 
put this patch _somewhere_ after the patch

	"x86/sgx: Implement basic EPC misc cgroup functionality"

and before this patch.

It makes sense to have such patch anyway, because with it we can easily 
to add a kernel command line 'sgx_cgroup=disabled" if the user wants it 
disabled (when someone has such requirement in the future).
Haitao Huang April 19, 2024, 6:55 p.m. UTC | #2
On Thu, 18 Apr 2024 20:32:14 -0500, Huang, Kai <kai.huang@intel.com> wrote:

>
>
> On 16/04/2024 3:20 pm, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>>  In cases EPC pages need be allocated during a page fault and the cgroup
>> usage is near its limit, an asynchronous reclamation needs be triggered
>> to avoid blocking the page fault handling.
>>  Create a workqueue, corresponding work item and function definitions
>> for EPC cgroup to support the asynchronous reclamation.
>>  In case the workqueue allocation is failed during init, disable cgroup.
>
> It's fine and reasonable to disable (SGX EPC) cgroup.  The problem is  
> "exactly what does this mean" isn't quite clear.
>
First, this is really some corner case most people don't care: during  
init, kernel can't even allocate a workqueue object. So I don't think we  
should write extra code to implement some sophisticated solution. Any  
solution we come up with may just not work as the way user want or solve  
the real issue due to the fact such allocation failure even happens at  
init time.

So IMHO the current solution should be fine and I'll answer some of your  
detailed questions below.

> Given SGX EPC is just one type of MISC cgroup resources, we cannot just  
> disable MISC cgroup as a whole.
>
> So, the first interpretation is we treat the entire MISC_CG_RES_SGX  
> resource type doesn't exist, that is, we just don't show control files  
> in the file system, and all EPC pages are tracked in the global list.
>
> But it might be not straightforward to implement in the SGX driver,  
> i.e., we might need to do more MISC cgroup core code change to make it  
> being able to support disable particular resource at runtime -- I need  
> to double check.
>
> So if that is not something worth to do, we will still need to live with  
> the fact that, the user is still able to create SGX cgroup in the  
> hierarchy and see those control files, and being able to read/write them.
>

Can not reliably predict what will happen. Most likely the ENOMEM will be  
returned by sgx_cgroup_alloc() if reached or other error in the stack if  
not reached to sgx_cgroup_alloc()
  and user fails on creating anything.

But if they do end up creating some cgroups (sgx_cgroup_alloc() and  
everything else  on the call stack passed without failure), everything  
still kind of works for the reason answered below.

> The second interpretation I suppose is, although the SGX cgroup is still  
> seen as supported in userspace, in kernel we just treat it doesn't exist.
>
> Specifically, that means: 1) we always return the root SGX cgroup for  
> any EPC page when allocating a new one; 2) as a result, we still track  
> all EPC pages in a single global list.
>

Current code has similar behavior without extra code.

> But from the code below ...
>
>
>>   static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg)
>>   {
>>   	if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE))
>> @@ -117,19 +226,28 @@ int sgx_cgroup_try_charge(struct sgx_cgroup  
>> *sgx_cg, enum sgx_reclaim reclaim)
>>   {
>>   	int ret;
>>   +	/* cgroup disabled due to wq allocation failure during  
>> sgx_cgroup_init(). */
>> +	if (!sgx_cg_wq)
>> +		return 0;
>> +
>
> ..., IIUC you choose a (third) solution that is even one more step back:
>
> It just makes try_charge() always succeed, but EPC pages are still  
> managed in the "per-cgroup" list.
>
> But this solution, AFAICT, doesn't work.  The reason is when you fail to  
> allocate EPC page you will do the global reclaim, but now the global  
> list is empty.
>
> Am I missing anything?

But when cgroups enabled in config, global reclamation starts from root  
and reclaim from the whole hierarchy if user may still be able to create.  
Just that we don't have async/sync per-cgroup reclaim triggered.

>
> So my thinking is, we have two options:
>
> 1) Modify the MISC cgroup core code to allow the kernel to disable one  
> particular resource.  It shouldn't be hard, e.g., we can add a  
> 'disabled' flag to the 'struct misc_res'.
>
> Hmm.. wait, after checking, the MISC cgroup won't show any control files  
> if the "capacity" of the resource is 0:
>
> "
>   * Miscellaneous resources capacity for the entire machine. 0 capacity
>   * means resource is not initialized or not present in the host.
> "
>
> So I really suppose we should go with this route, i.e., by just setting  
> the EPC capacity to 0?
>
> Note misc_cg_try_charge() will fail if capacity is 0, but we can make it  
> return success by explicitly check whether SGX cgroup is disabled by  
> using a helper, e.g., sgx_cgroup_disabled().
>
> And you always return the root SGX cgroup in sgx_get_current_cg() when  
> sgx_cgroup_disabled() is true.
>
> And in sgx_reclaim_pages_global(), you do something like:
>
> 	static void sgx_reclaim_pages_global(..)
> 	{
> 	#ifdef CONFIG_CGROUP_MISC
> 		if (sgx_cgroup_disabled())
> 			sgx_reclaim_pages(&sgx_root_cg.lru);
> 		else
> 			sgx_cgroup_reclaim_pages(misc_cg_root());
> 	#else
> 		sgx_reclaim_pages(&sgx_global_list);
> 	#endif
> 	}
>
> I am perhaps missing some other spots too but you got the idea.
>
> At last, after typing those, I believe we should have a separate patch  
> to handle disable SGX cgroup at initialization time.  And you can even  
> put this patch _somewhere_ after the patch
>
> 	"x86/sgx: Implement basic EPC misc cgroup functionality"
>
> and before this patch.
>
> It makes sense to have such patch anyway, because with it we can easily  
> to add a kernel command line 'sgx_cgroup=disabled" if the user wants it  
> disabled (when someone has such requirement in the future).
>

I think we can add support for "sgx_cgroup=disabled" in future if indeed  
needed. But just for init failure, no?

Thanks
Haitao
Huang, Kai April 19, 2024, 10:44 p.m. UTC | #3
On Fri, 2024-04-19 at 13:55 -0500, Haitao Huang wrote:
> On Thu, 18 Apr 2024 20:32:14 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > 
> > 
> > On 16/04/2024 3:20 pm, Haitao Huang wrote:
> > > From: Kristen Carlson Accardi <kristen@linux.intel.com>
> > >  In cases EPC pages need be allocated during a page fault and the cgroup
> > > usage is near its limit, an asynchronous reclamation needs be triggered
> > > to avoid blocking the page fault handling.
> > >  Create a workqueue, corresponding work item and function definitions
> > > for EPC cgroup to support the asynchronous reclamation.
> > >  In case the workqueue allocation is failed during init, disable cgroup.
> > 
> > It's fine and reasonable to disable (SGX EPC) cgroup.  The problem is  
> > "exactly what does this mean" isn't quite clear.
> > 
> First, this is really some corner case most people don't care: during  
> init, kernel can't even allocate a workqueue object. So I don't think we  
> should write extra code to implement some sophisticated solution. Any  
> solution we come up with may just not work as the way user want or solve  
> the real issue due to the fact such allocation failure even happens at  
> init time.

I think for such boot time failure we can either choose directly BUG_ON(),
or we try to handle it _nicely_, but not half-way.  My experience is
adding BUG_ON() should be avoided in general, but it might be acceptable
during kernel boot.  I will leave it to others.


[...]

> > 
> > ..., IIUC you choose a (third) solution that is even one more step back:
> > 
> > It just makes try_charge() always succeed, but EPC pages are still  
> > managed in the "per-cgroup" list.
> > 
> > But this solution, AFAICT, doesn't work.  The reason is when you fail to  
> > allocate EPC page you will do the global reclaim, but now the global  
> > list is empty.
> > 
> > Am I missing anything?
> 
> But when cgroups enabled in config, global reclamation starts from root  
> and reclaim from the whole hierarchy if user may still be able to create.  
> Just that we don't have async/sync per-cgroup reclaim triggered.

OK.  I missed this as it is in a later patch.

> 
> > 
> > So my thinking is, we have two options:
> > 
> > 1) Modify the MISC cgroup core code to allow the kernel to disable one  
> > particular resource.  It shouldn't be hard, e.g., we can add a  
> > 'disabled' flag to the 'struct misc_res'.
> > 
> > Hmm.. wait, after checking, the MISC cgroup won't show any control files  
> > if the "capacity" of the resource is 0:
> > 
> > "
> >   * Miscellaneous resources capacity for the entire machine. 0 capacity
> >   * means resource is not initialized or not present in the host.
> > "
> > 
> > So I really suppose we should go with this route, i.e., by just setting  
> > the EPC capacity to 0?
> > 
> > Note misc_cg_try_charge() will fail if capacity is 0, but we can make it  
> > return success by explicitly check whether SGX cgroup is disabled by  
> > using a helper, e.g., sgx_cgroup_disabled().
> > 
> > And you always return the root SGX cgroup in sgx_get_current_cg() when  
> > sgx_cgroup_disabled() is true.
> > 
> > And in sgx_reclaim_pages_global(), you do something like:
> > 
> > 	static void sgx_reclaim_pages_global(..)
> > 	{
> > 	#ifdef CONFIG_CGROUP_MISC
> > 		if (sgx_cgroup_disabled())
> > 			sgx_reclaim_pages(&sgx_root_cg.lru);
> > 		else
> > 			sgx_cgroup_reclaim_pages(misc_cg_root());
> > 	#else
> > 		sgx_reclaim_pages(&sgx_global_list);
> > 	#endif
> > 	}
> > 
> > I am perhaps missing some other spots too but you got the idea.
> > 
> > At last, after typing those, I believe we should have a separate patch  
> > to handle disable SGX cgroup at initialization time.  And you can even  
> > put this patch _somewhere_ after the patch
> > 
> > 	"x86/sgx: Implement basic EPC misc cgroup functionality"
> > 
> > and before this patch.
> > 
> > It makes sense to have such patch anyway, because with it we can easily  
> > to add a kernel command line 'sgx_cgroup=disabled" if the user wants it  
> > disabled (when someone has such requirement in the future).
> > 
> 
> I think we can add support for "sgx_cgroup=disabled" in future if indeed  
> needed. But just for init failure, no?
> 

It's not about the commandline, which we can add in the future when
needed.  It's about we need to have a way to handle SGX cgroup being
disabled at boot time nicely, because we already have a case where we need
to do so.

Your approach looks half-way to me, and is not future extendible.  If we
choose to do it, do it right -- that is, we need a way to disable it
completely in both kernel and userspace so that userspace won't be able to
see it.
Haitao Huang April 20, 2024, 1:14 a.m. UTC | #4
On Fri, 19 Apr 2024 17:44:59 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Fri, 2024-04-19 at 13:55 -0500, Haitao Huang wrote:
>> On Thu, 18 Apr 2024 20:32:14 -0500, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>> >
>> >
>> > On 16/04/2024 3:20 pm, Haitao Huang wrote:
>> > > From: Kristen Carlson Accardi <kristen@linux.intel.com>
>> > >  In cases EPC pages need be allocated during a page fault and the  
>> cgroup
>> > > usage is near its limit, an asynchronous reclamation needs be  
>> triggered
>> > > to avoid blocking the page fault handling.
>> > >  Create a workqueue, corresponding work item and function  
>> definitions
>> > > for EPC cgroup to support the asynchronous reclamation.
>> > >  In case the workqueue allocation is failed during init, disable  
>> cgroup.
>> >
>> > It's fine and reasonable to disable (SGX EPC) cgroup.  The problem is
>> > "exactly what does this mean" isn't quite clear.
>> >
>> First, this is really some corner case most people don't care: during
>> init, kernel can't even allocate a workqueue object. So I don't think we
>> should write extra code to implement some sophisticated solution. Any
>> solution we come up with may just not work as the way user want or solve
>> the real issue due to the fact such allocation failure even happens at
>> init time.
>
> I think for such boot time failure we can either choose directly  
> BUG_ON(),
> or we try to handle it _nicely_, but not half-way.  My experience is
> adding BUG_ON() should be avoided in general, but it might be acceptable
> during kernel boot.  I will leave it to others.
>
>
> [...]
>
>> >
>> > ..., IIUC you choose a (third) solution that is even one more step  
>> back:
>> >
>> > It just makes try_charge() always succeed, but EPC pages are still
>> > managed in the "per-cgroup" list.
>> >
>> > But this solution, AFAICT, doesn't work.  The reason is when you fail  
>> to
>> > allocate EPC page you will do the global reclaim, but now the global
>> > list is empty.
>> >
>> > Am I missing anything?
>>
>> But when cgroups enabled in config, global reclamation starts from root
>> and reclaim from the whole hierarchy if user may still be able to  
>> create.
>> Just that we don't have async/sync per-cgroup reclaim triggered.
>
> OK.  I missed this as it is in a later patch.
>
>>
>> >
>> > So my thinking is, we have two options:
>> >
>> > 1) Modify the MISC cgroup core code to allow the kernel to disable one
>> > particular resource.  It shouldn't be hard, e.g., we can add a
>> > 'disabled' flag to the 'struct misc_res'.
>> >
>> > Hmm.. wait, after checking, the MISC cgroup won't show any control  
>> files
>> > if the "capacity" of the resource is 0:
>> >
>> > "
>> >   * Miscellaneous resources capacity for the entire machine. 0  
>> capacity
>> >   * means resource is not initialized or not present in the host.
>> > "
>> >
>> > So I really suppose we should go with this route, i.e., by just  
>> setting
>> > the EPC capacity to 0?
>> >
>> > Note misc_cg_try_charge() will fail if capacity is 0, but we can make  
>> it
>> > return success by explicitly check whether SGX cgroup is disabled by
>> > using a helper, e.g., sgx_cgroup_disabled().
>> >
>> > And you always return the root SGX cgroup in sgx_get_current_cg() when
>> > sgx_cgroup_disabled() is true.
>> >
>> > And in sgx_reclaim_pages_global(), you do something like:
>> >
>> > 	static void sgx_reclaim_pages_global(..)
>> > 	{
>> > 	#ifdef CONFIG_CGROUP_MISC
>> > 		if (sgx_cgroup_disabled())
>> > 			sgx_reclaim_pages(&sgx_root_cg.lru);
>> > 		else
>> > 			sgx_cgroup_reclaim_pages(misc_cg_root());
>> > 	#else
>> > 		sgx_reclaim_pages(&sgx_global_list);
>> > 	#endif
>> > 	}
>> >
>> > I am perhaps missing some other spots too but you got the idea.
>> >
>> > At last, after typing those, I believe we should have a separate patch
>> > to handle disable SGX cgroup at initialization time.  And you can even
>> > put this patch _somewhere_ after the patch
>> >
>> > 	"x86/sgx: Implement basic EPC misc cgroup functionality"
>> >
>> > and before this patch.
>> >
>> > It makes sense to have such patch anyway, because with it we can  
>> easily
>> > to add a kernel command line 'sgx_cgroup=disabled" if the user wants  
>> it
>> > disabled (when someone has such requirement in the future).
>> >
>>
>> I think we can add support for "sgx_cgroup=disabled" in future if indeed
>> needed. But just for init failure, no?
>>
>
> It's not about the commandline, which we can add in the future when
> needed.  It's about we need to have a way to handle SGX cgroup being
> disabled at boot time nicely, because we already have a case where we  
> need
> to do so.
>
> Your approach looks half-way to me, and is not future extendible.  If we
> choose to do it, do it right -- that is, we need a way to disable it
> completely in both kernel and userspace so that userspace won't be able  
> to
> see it.

That would need more changes in misc cgroup implementation to support  
sgx-disable. Right now misc does not have separate files for different  
resource types. So we can only block echo "sgx_epc..." to those interface  
files, can't really make files not visible.

Anyway, I see this is really a configuration failure case. And we handle  
it without added complexity and do as much as we can gracefully until  
another fatal error happens. So if we need support disable sgx through  
command line in future, then we can also make this failure handling even  
more graceful at that time. Nothing really is lost IMHO.

Originally we put in BUG_ON() but then we changed to handle it based on  
your feedback. I can put BUG_ON() back if we agree that's more appropriate  
at the moment.

@Dave, @Jarkko, @Michal, your thoughts?

Thanks
Haitao
Huang, Kai April 22, 2024, 12:22 a.m. UTC | #5
On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
> > > I think we can add support for "sgx_cgroup=disabled" in future if indeed
> > > needed. But just for init failure, no?
> > > 
> > 
> > It's not about the commandline, which we can add in the future when
> > needed.  It's about we need to have a way to handle SGX cgroup being
> > disabled at boot time nicely, because we already have a case where we  
> > need
> > to do so.
> > 
> > Your approach looks half-way to me, and is not future extendible.  If we
> > choose to do it, do it right -- that is, we need a way to disable it
> > completely in both kernel and userspace so that userspace won't be able  
> > to
> > see it.
> 
> That would need more changes in misc cgroup implementation to support  
> sgx-disable. Right now misc does not have separate files for different  
> resource types. So we can only block echo "sgx_epc..." to those interface  
> files, can't really make files not visible.

"won't be able to see" I mean "only for SGX EPC resource", but not the
control files for the entire MISC cgroup.

I replied at the beginning of the previous reply:

"
Given SGX EPC is just one type of MISC cgroup resources, we cannot just 
disable MISC cgroup as a whole.
"

You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC.  See
the comment of @misc_res_capacity:

 * Miscellaneous resources capacity for the entire machine. 0 capacity
 * means resource is not initialized or not present in the host.

And "blocking echo sgx_epc ... to those control files" is already
sufficient for the purpose of not exposing SGX EPC to userspace, correct?

E.g., if SGX cgroup is enabled, you can see below when you read "max":

 # cat /sys/fs/cgroup/my_group/misc.max
 # <resource1> <max1>
   sgx_epc ...
   ...

Otherwise you won't be able to see "sgx_epc":

 # cat /sys/fs/cgroup/my_group/misc.max
 # <resource1> <max1>
   ...

And when you try to write the "max" for "sgx_epc", you will hit error:

 # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
 # ... echo: write error: Invalid argument

The above applies to all the control files.  To me this is pretty much
means "SGX EPC is disabled" or "not supported" for userspace. 

Am I missing anything?
Haitao Huang April 22, 2024, 4:17 p.m. UTC | #6
On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
>> > > I think we can add support for "sgx_cgroup=disabled" in future if  
>> indeed
>> > > needed. But just for init failure, no?
>> > >
>> >
>> > It's not about the commandline, which we can add in the future when
>> > needed.  It's about we need to have a way to handle SGX cgroup being
>> > disabled at boot time nicely, because we already have a case where we 
>> > need
>> > to do so.
>> >
>> > Your approach looks half-way to me, and is not future extendible.  If  
>> we
>> > choose to do it, do it right -- that is, we need a way to disable it
>> > completely in both kernel and userspace so that userspace won't be  
>> able> to
>> > see it.
>>
>> That would need more changes in misc cgroup implementation to support 
>> sgx-disable. Right now misc does not have separate files for different 
>> resource types. So we can only block echo "sgx_epc..." to those  
>> interfacefiles, can't really make files not visible.
>
> "won't be able to see" I mean "only for SGX EPC resource", but not the
> control files for the entire MISC cgroup.
>
> I replied at the beginning of the previous reply:
>
> "
> Given SGX EPC is just one type of MISC cgroup resources, we cannot just
> disable MISC cgroup as a whole.
> "
>
Sorry I missed this point. below.

> You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC.  See
> the comment of @misc_res_capacity:
>
>  * Miscellaneous resources capacity for the entire machine. 0 capacity
>  * means resource is not initialized or not present in the host.
>

IIUC I don't think the situation we have is either of those cases. For our  
case, resource is inited and present on the host but we have allocation  
error for sgx cgroup infra.

> And "blocking echo sgx_epc ... to those control files" is already
> sufficient for the purpose of not exposing SGX EPC to userspace, correct?
>
> E.g., if SGX cgroup is enabled, you can see below when you read "max":
>
>  # cat /sys/fs/cgroup/my_group/misc.max
>  # <resource1> <max1>
>    sgx_epc ...
>    ...
>
> Otherwise you won't be able to see "sgx_epc":
>
>  # cat /sys/fs/cgroup/my_group/misc.max
>  # <resource1> <max1>
>    ...
>
> And when you try to write the "max" for "sgx_epc", you will hit error:
>
>  # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
>  # ... echo: write error: Invalid argument
>
> The above applies to all the control files.  To me this is pretty much
> means "SGX EPC is disabled" or "not supported" for userspace.
>
You are right, capacity == 0 does block echoing max and users see an error  
if they do that. But 1) doubt you literately wanted "SGX EPC is disabled"  
and make it unsupported in this case, 2) even if we accept this is "sgx  
cgroup disabled" I don't see how it is much better user experience than  
current solution or really helps user better.

Also to implement this approach, as you mentioned, we need workaround the  
fact that misc_try_charge() fails when capacity set to zero, and adding  
code to return root always? So it seems like more workaround code to just  
make it work for a failing case no one really care much and end result is  
not really much better IMHO.

Thanks
Haitao
Huang, Kai April 22, 2024, 10:16 p.m. UTC | #7
On Mon, 2024-04-22 at 11:17 -0500, Haitao Huang wrote:
> On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
> > > > > I think we can add support for "sgx_cgroup=disabled" in future if  
> > > indeed
> > > > > needed. But just for init failure, no?
> > > > > 
> > > > 
> > > > It's not about the commandline, which we can add in the future when
> > > > needed.  It's about we need to have a way to handle SGX cgroup being
> > > > disabled at boot time nicely, because we already have a case where we 
> > > > need
> > > > to do so.
> > > > 
> > > > Your approach looks half-way to me, and is not future extendible.  If  
> > > we
> > > > choose to do it, do it right -- that is, we need a way to disable it
> > > > completely in both kernel and userspace so that userspace won't be  
> > > able> to
> > > > see it.
> > > 
> > > That would need more changes in misc cgroup implementation to support 
> > > sgx-disable. Right now misc does not have separate files for different 
> > > resource types. So we can only block echo "sgx_epc..." to those  
> > > interfacefiles, can't really make files not visible.
> > 
> > "won't be able to see" I mean "only for SGX EPC resource", but not the
> > control files for the entire MISC cgroup.
> > 
> > I replied at the beginning of the previous reply:
> > 
> > "
> > Given SGX EPC is just one type of MISC cgroup resources, we cannot just
> > disable MISC cgroup as a whole.
> > "
> > 
> Sorry I missed this point. below.
> 
> > You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC.  See
> > the comment of @misc_res_capacity:
> > 
> >  * Miscellaneous resources capacity for the entire machine. 0 capacity
> >  * means resource is not initialized or not present in the host.
> > 
> 
> IIUC I don't think the situation we have is either of those cases. For our  
> case, resource is inited and present on the host but we have allocation  
> error for sgx cgroup infra.

You have calculated the "capacity", but later you failed something and
then reset the "capacity" to 0, i.e., cleanup.  What's wrong with that?

> 
> > And "blocking echo sgx_epc ... to those control files" is already
> > sufficient for the purpose of not exposing SGX EPC to userspace, correct?
> > 
> > E.g., if SGX cgroup is enabled, you can see below when you read "max":
> > 
> >  # cat /sys/fs/cgroup/my_group/misc.max
> >  # <resource1> <max1>
> >    sgx_epc ...
> >    ...
> > 
> > Otherwise you won't be able to see "sgx_epc":
> > 
> >  # cat /sys/fs/cgroup/my_group/misc.max
> >  # <resource1> <max1>
> >    ...
> > 
> > And when you try to write the "max" for "sgx_epc", you will hit error:
> > 
> >  # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
> >  # ... echo: write error: Invalid argument
> > 
> > The above applies to all the control files.  To me this is pretty much
> > means "SGX EPC is disabled" or "not supported" for userspace.
> > 
> You are right, capacity == 0 does block echoing max and users see an error  
> if they do that. But 1) doubt you literately wanted "SGX EPC is disabled"  
> and make it unsupported in this case, 
> 

I don't understand.  Something failed during SGX cgroup initialization,
you _literally_ cannot continue to support it.


> 2) even if we accept this is "sgx  
> cgroup disabled" I don't see how it is much better user experience than  
> current solution or really helps user better.

In your way, the userspace is still able to see "sgx_epc" in control files
and is able to update them.  So from userspace's perspective SGX cgroup is
enabled, but obviously updating to "max" doesn't have any impact.  This
will confuse userspace.

> 
> Also to implement this approach, as you mentioned, we need workaround the  
> fact that misc_try_charge() fails when capacity set to zero, and adding  
> code to return root always? 
> 

Why this is a problem?

> So it seems like more workaround code to just  
> make it work for a failing case no one really care much and end result is  
> not really much better IMHO.

It's not workaround, it's the right thing to do.

The result is userspace will see it being disabled when kernel disables
it.
Haitao Huang April 23, 2024, 1:08 p.m. UTC | #8
On Mon, 22 Apr 2024 17:16:34 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Mon, 2024-04-22 at 11:17 -0500, Haitao Huang wrote:
>> On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>> > On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
>> > > > > I think we can add support for "sgx_cgroup=disabled" in future  
>> if
>> > > indeed
>> > > > > needed. But just for init failure, no?
>> > > > >
>> > > >
>> > > > It's not about the commandline, which we can add in the future  
>> when
>> > > > needed.  It's about we need to have a way to handle SGX cgroup  
>> being
>> > > > disabled at boot time nicely, because we already have a case  
>> where we
>> > > > need
>> > > > to do so.
>> > > >
>> > > > Your approach looks half-way to me, and is not future  
>> extendible.  If
>> > > we
>> > > > choose to do it, do it right -- that is, we need a way to disable  
>> it
>> > > > completely in both kernel and userspace so that userspace won't be
>> > > able> to
>> > > > see it.
>> > >
>> > > That would need more changes in misc cgroup implementation to  
>> support
>> > > sgx-disable. Right now misc does not have separate files for  
>> different
>> > > resource types. So we can only block echo "sgx_epc..." to those
>> > > interfacefiles, can't really make files not visible.
>> >
>> > "won't be able to see" I mean "only for SGX EPC resource", but not the
>> > control files for the entire MISC cgroup.
>> >
>> > I replied at the beginning of the previous reply:
>> >
>> > "
>> > Given SGX EPC is just one type of MISC cgroup resources, we cannot  
>> just
>> > disable MISC cgroup as a whole.
>> > "
>> >
>> Sorry I missed this point. below.
>>
>> > You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC.   
>> See
>> > the comment of @misc_res_capacity:
>> >
>> >  * Miscellaneous resources capacity for the entire machine. 0 capacity
>> >  * means resource is not initialized or not present in the host.
>> >
>>
>> IIUC I don't think the situation we have is either of those cases. For  
>> our
>> case, resource is inited and present on the host but we have allocation
>> error for sgx cgroup infra.
>
> You have calculated the "capacity", but later you failed something and
> then reset the "capacity" to 0, i.e., cleanup.  What's wrong with that?
>
>>
>> > And "blocking echo sgx_epc ... to those control files" is already
>> > sufficient for the purpose of not exposing SGX EPC to userspace,  
>> correct?
>> >
>> > E.g., if SGX cgroup is enabled, you can see below when you read "max":
>> >
>> >  # cat /sys/fs/cgroup/my_group/misc.max
>> >  # <resource1> <max1>
>> >    sgx_epc ...
>> >    ...
>> >
>> > Otherwise you won't be able to see "sgx_epc":
>> >
>> >  # cat /sys/fs/cgroup/my_group/misc.max
>> >  # <resource1> <max1>
>> >    ...
>> >
>> > And when you try to write the "max" for "sgx_epc", you will hit error:
>> >
>> >  # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
>> >  # ... echo: write error: Invalid argument
>> >
>> > The above applies to all the control files.  To me this is pretty much
>> > means "SGX EPC is disabled" or "not supported" for userspace.
>> >
>> You are right, capacity == 0 does block echoing max and users see an  
>> error
>> if they do that. But 1) doubt you literately wanted "SGX EPC is  
>> disabled"
>> and make it unsupported in this case,
>
> I don't understand.  Something failed during SGX cgroup initialization,
> you _literally_ cannot continue to support it.
>
>

Then we should just return -ENOMEM from sgx_init() when sgx cgroup  
initialization fails?
I thought we only disable SGX cgroup support. SGX can still run.

>> 2) even if we accept this is "sgx
>> cgroup disabled" I don't see how it is much better user experience than
>> current solution or really helps user better.
>
> In your way, the userspace is still able to see "sgx_epc" in control  
> files
> and is able to update them.  So from userspace's perspective SGX cgroup  
> is
> enabled, but obviously updating to "max" doesn't have any impact.  This
> will confuse userspace.
>
>>

Setting capacity to zero also confuses user space. Some application may  
rely on this file to know the capacity.

>> Also to implement this approach, as you mentioned, we need workaround  
>> the
>> fact that misc_try_charge() fails when capacity set to zero, and adding
>> code to return root always?
>
> Why this is a problem?
>

It changes/overrides the the original meaning of capacity==0: no one can  
allocate if capacity is zero.

>> So it seems like more workaround code to just
>> make it work for a failing case no one really care much and end result  
>> is
>> not really much better IMHO.
>
> It's not workaround, it's the right thing to do.
>
> The result is userspace will see it being disabled when kernel disables
> it.
>
>
It's a workaround because you use the capacity==0 but it does not really  
mean to disable the misc cgroup for specific resource IIUC.

There is explicit way for user to disable misc without setting capacity to  
zero. So in future if we want really disable sgx_epc cgroup specifically  
we should not use capacity.  Therefore your approach is not  
extensible/reusable.

Given this is a rare corner case caused by configuration, we can only do  
as much as possible IMHO, not trying to implement a perfect solution at  
the moment. Maybe BUG_ON() is more appropriate?

(If you want to disable sgx, we can return error from sgx_init() but I  
still doubt that's what you meant.)

Thanks
Haitao
Huang, Kai April 23, 2024, 2:19 p.m. UTC | #9
On Tue, 2024-04-23 at 08:08 -0500, Haitao Huang wrote:
> On Mon, 22 Apr 2024 17:16:34 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Mon, 2024-04-22 at 11:17 -0500, Haitao Huang wrote:
> > > On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai <kai.huang@intel.com>  
> > > wrote:
> > > 
> > > > On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
> > > > > > > I think we can add support for "sgx_cgroup=disabled" in future  
> > > if
> > > > > indeed
> > > > > > > needed. But just for init failure, no?
> > > > > > > 
> > > > > > 
> > > > > > It's not about the commandline, which we can add in the future  
> > > when
> > > > > > needed.  It's about we need to have a way to handle SGX cgroup  
> > > being
> > > > > > disabled at boot time nicely, because we already have a case  
> > > where we
> > > > > > need
> > > > > > to do so.
> > > > > > 
> > > > > > Your approach looks half-way to me, and is not future  
> > > extendible.  If
> > > > > we
> > > > > > choose to do it, do it right -- that is, we need a way to disable  
> > > it
> > > > > > completely in both kernel and userspace so that userspace won't be
> > > > > able> to
> > > > > > see it.
> > > > > 
> > > > > That would need more changes in misc cgroup implementation to  
> > > support
> > > > > sgx-disable. Right now misc does not have separate files for  
> > > different
> > > > > resource types. So we can only block echo "sgx_epc..." to those
> > > > > interfacefiles, can't really make files not visible.
> > > > 
> > > > "won't be able to see" I mean "only for SGX EPC resource", but not the
> > > > control files for the entire MISC cgroup.
> > > > 
> > > > I replied at the beginning of the previous reply:
> > > > 
> > > > "
> > > > Given SGX EPC is just one type of MISC cgroup resources, we cannot  
> > > just
> > > > disable MISC cgroup as a whole.
> > > > "
> > > > 
> > > Sorry I missed this point. below.
> > > 
> > > > You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC.   
> > > See
> > > > the comment of @misc_res_capacity:
> > > > 
> > > >  * Miscellaneous resources capacity for the entire machine. 0 capacity
> > > >  * means resource is not initialized or not present in the host.
> > > > 
> > > 
> > > IIUC I don't think the situation we have is either of those cases. For  
> > > our
> > > case, resource is inited and present on the host but we have allocation
> > > error for sgx cgroup infra.
> > 
> > You have calculated the "capacity", but later you failed something and
> > then reset the "capacity" to 0, i.e., cleanup.  What's wrong with that?
> > 
> > > 
> > > > And "blocking echo sgx_epc ... to those control files" is already
> > > > sufficient for the purpose of not exposing SGX EPC to userspace,  
> > > correct?
> > > > 
> > > > E.g., if SGX cgroup is enabled, you can see below when you read "max":
> > > > 
> > > >  # cat /sys/fs/cgroup/my_group/misc.max
> > > >  # <resource1> <max1>
> > > >    sgx_epc ...
> > > >    ...
> > > > 
> > > > Otherwise you won't be able to see "sgx_epc":
> > > > 
> > > >  # cat /sys/fs/cgroup/my_group/misc.max
> > > >  # <resource1> <max1>
> > > >    ...
> > > > 
> > > > And when you try to write the "max" for "sgx_epc", you will hit error:
> > > > 
> > > >  # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
> > > >  # ... echo: write error: Invalid argument
> > > > 
> > > > The above applies to all the control files.  To me this is pretty much
> > > > means "SGX EPC is disabled" or "not supported" for userspace.
> > > > 
> > > You are right, capacity == 0 does block echoing max and users see an  
> > > error
> > > if they do that. But 1) doubt you literately wanted "SGX EPC is  
> > > disabled"
> > > and make it unsupported in this case,
> > 
> > I don't understand.  Something failed during SGX cgroup initialization,
> > you _literally_ cannot continue to support it.
> > 
> > 
> 
> Then we should just return -ENOMEM from sgx_init() when sgx cgroup  
> initialization fails?
> I thought we only disable SGX cgroup support. SGX can still run.

I am not sure how you got this conclusion.  I specifically said something
failed during SGX "cgroup" initialization, so only SGX "cgroup" needs to
be disabled, not SGX as a whole.

> 
> > > 2) even if we accept this is "sgx
> > > cgroup disabled" I don't see how it is much better user experience than
> > > current solution or really helps user better.
> > 
> > In your way, the userspace is still able to see "sgx_epc" in control  
> > files
> > and is able to update them.  So from userspace's perspective SGX cgroup  
> > is
> > enabled, but obviously updating to "max" doesn't have any impact.  This
> > will confuse userspace.
> > 
> > > 
> 
> Setting capacity to zero also confuses user space. Some application may  
> rely on this file to know the capacity.


Why??

Are you saying before this SGX cgroup patchset those applications cannot
run?

> 
> > > Also to implement this approach, as you mentioned, we need workaround  
> > > the
> > > fact that misc_try_charge() fails when capacity set to zero, and adding
> > > code to return root always?
> > 
> > Why this is a problem?
> > 
> 
> It changes/overrides the the original meaning of capacity==0: no one can  
> allocate if capacity is zero.

Why??

Are you saying before this series, no one can allocate EPC page?

> 
> > > So it seems like more workaround code to just
> > > make it work for a failing case no one really care much and end result  
> > > is
> > > not really much better IMHO.
> > 
> > It's not workaround, it's the right thing to do.
> > 
> > The result is userspace will see it being disabled when kernel disables
> > it.
> > 
> > 
> It's a workaround because you use the capacity==0 but it does not really  
> mean to disable the misc cgroup for specific resource IIUC.

Please read the comment around @misc_res_capacity again:

 * Miscellaneous resources capacity for the entire machine. 0 capacity
 * means resource is not initialized or not present in the host.

> 
> There is explicit way for user to disable misc without setting capacity to  
> zero. 
> 

Which way are you talking about?

> So in future if we want really disable sgx_epc cgroup specifically  
> we should not use capacity.  Therefore your approach is not  
> extensible/reusable.
> 
> Given this is a rare corner case caused by configuration, we can only do  
> as much as possible IMHO, not trying to implement a perfect solution at  
> the moment. Maybe BUG_ON() is more appropriate?
> 

I think I will reply this thread for the last time:

I don't have strong opinion to against using BUG_ON() when you fail to
allocate workqueue.  If you choose to do this, I'll leave to others.

If you want to "disable SGX cgroup" when you fail to allocate workqueue,
reset the "capacity" to 0 to disable it.
Haitao Huang April 23, 2024, 3:30 p.m. UTC | #10
On Tue, 23 Apr 2024 09:19:53 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Tue, 2024-04-23 at 08:08 -0500, Haitao Huang wrote:
>> On Mon, 22 Apr 2024 17:16:34 -0500, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>> > On Mon, 2024-04-22 at 11:17 -0500, Haitao Huang wrote:
>> > > On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai <kai.huang@intel.com>
>> > > wrote:
>> > >
>> > > > On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
>> > > > > > > I think we can add support for "sgx_cgroup=disabled" in  
>> future
>> > > if
>> > > > > indeed
>> > > > > > > needed. But just for init failure, no?
>> > > > > > >
>> > > > > >
>> > > > > > It's not about the commandline, which we can add in the future
>> > > when
>> > > > > > needed.  It's about we need to have a way to handle SGX cgroup
>> > > being
>> > > > > > disabled at boot time nicely, because we already have a case
>> > > where we
>> > > > > > need
>> > > > > > to do so.
>> > > > > >
>> > > > > > Your approach looks half-way to me, and is not future
>> > > extendible.  If
>> > > > > we
>> > > > > > choose to do it, do it right -- that is, we need a way to  
>> disable
>> > > it
>> > > > > > completely in both kernel and userspace so that userspace  
>> won't be
>> > > > > able> to
>> > > > > > see it.
>> > > > >
>> > > > > That would need more changes in misc cgroup implementation to
>> > > support
>> > > > > sgx-disable. Right now misc does not have separate files for
>> > > different
>> > > > > resource types. So we can only block echo "sgx_epc..." to those
>> > > > > interfacefiles, can't really make files not visible.
>> > > >
>> > > > "won't be able to see" I mean "only for SGX EPC resource", but  
>> not the
>> > > > control files for the entire MISC cgroup.
>> > > >
>> > > > I replied at the beginning of the previous reply:
>> > > >
>> > > > "
>> > > > Given SGX EPC is just one type of MISC cgroup resources, we cannot
>> > > just
>> > > > disable MISC cgroup as a whole.
>> > > > "
>> > > >
>> > > Sorry I missed this point. below.
>> > >
>> > > > You just need to set the SGX EPC "capacity" to 0 to disable SGX  
>> EPC.
>> > > See
>> > > > the comment of @misc_res_capacity:
>> > > >
>> > > >  * Miscellaneous resources capacity for the entire machine. 0  
>> capacity
>> > > >  * means resource is not initialized or not present in the host.
>> > > >
>> > >
>> > > IIUC I don't think the situation we have is either of those cases.  
>> For
>> > > our
>> > > case, resource is inited and present on the host but we have  
>> allocation
>> > > error for sgx cgroup infra.
>> >
>> > You have calculated the "capacity", but later you failed something and
>> > then reset the "capacity" to 0, i.e., cleanup.  What's wrong with  
>> that?
>> >
>> > >
>> > > > And "blocking echo sgx_epc ... to those control files" is already
>> > > > sufficient for the purpose of not exposing SGX EPC to userspace,
>> > > correct?
>> > > >
>> > > > E.g., if SGX cgroup is enabled, you can see below when you read  
>> "max":
>> > > >
>> > > >  # cat /sys/fs/cgroup/my_group/misc.max
>> > > >  # <resource1> <max1>
>> > > >    sgx_epc ...
>> > > >    ...
>> > > >
>> > > > Otherwise you won't be able to see "sgx_epc":
>> > > >
>> > > >  # cat /sys/fs/cgroup/my_group/misc.max
>> > > >  # <resource1> <max1>
>> > > >    ...
>> > > >
>> > > > And when you try to write the "max" for "sgx_epc", you will hit  
>> error:
>> > > >
>> > > >  # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
>> > > >  # ... echo: write error: Invalid argument
>> > > >
>> > > > The above applies to all the control files.  To me this is pretty  
>> much
>> > > > means "SGX EPC is disabled" or "not supported" for userspace.
>> > > >
>> > > You are right, capacity == 0 does block echoing max and users see an
>> > > error
>> > > if they do that. But 1) doubt you literately wanted "SGX EPC is
>> > > disabled"
>> > > and make it unsupported in this case,
>> >
>> > I don't understand.  Something failed during SGX cgroup  
>> initialization,
>> > you _literally_ cannot continue to support it.
>> >
>> >
>>
>> Then we should just return -ENOMEM from sgx_init() when sgx cgroup
>> initialization fails?
>> I thought we only disable SGX cgroup support. SGX can still run.
>
> I am not sure how you got this conclusion.  I specifically said something
> failed during SGX "cgroup" initialization, so only SGX "cgroup" needs to
> be disabled, not SGX as a whole.
>
>>
>> > > 2) even if we accept this is "sgx
>> > > cgroup disabled" I don't see how it is much better user experience  
>> than
>> > > current solution or really helps user better.
>> >
>> > In your way, the userspace is still able to see "sgx_epc" in control
>> > files
>> > and is able to update them.  So from userspace's perspective SGX  
>> cgroup
>> > is
>> > enabled, but obviously updating to "max" doesn't have any impact.   
>> This
>> > will confuse userspace.
>> >
>> > >
>>
>> Setting capacity to zero also confuses user space. Some application may
>> rely on this file to know the capacity.
>
>
> Why??
>
> Are you saying before this SGX cgroup patchset those applications cannot
> run?
>
>>
>> > > Also to implement this approach, as you mentioned, we need  
>> workaround
>> > > the
>> > > fact that misc_try_charge() fails when capacity set to zero, and  
>> adding
>> > > code to return root always?
>> >
>> > Why this is a problem?
>> >
>>
>> It changes/overrides the the original meaning of capacity==0: no one can
>> allocate if capacity is zero.
>
> Why??
>
> Are you saying before this series, no one can allocate EPC page?
>
>>
>> > > So it seems like more workaround code to just
>> > > make it work for a failing case no one really care much and end  
>> result
>> > > is
>> > > not really much better IMHO.
>> >
>> > It's not workaround, it's the right thing to do.
>> >
>> > The result is userspace will see it being disabled when kernel  
>> disables
>> > it.
>> >
>> >
>> It's a workaround because you use the capacity==0 but it does not really
>> mean to disable the misc cgroup for specific resource IIUC.
>
> Please read the comment around @misc_res_capacity again:
>
>  * Miscellaneous resources capacity for the entire machine. 0 capacity
>  * means resource is not initialized or not present in the host.
>

I mentioned this in earlier email. I think this means no SGX EPC. It does  
not mean sgx epc cgroup not enabled. That's also consistent with the  
behavior try_charge() fails if capacity is zero.

>>
>> There is explicit way for user to disable misc without setting capacity  
>> to
>> zero.
>
> Which way are you talking about?

Echo "-misc" to cgroup.subtree_control at root level for example still  
shows non-zero sgx_epc capacity.

>
>> So in future if we want really disable sgx_epc cgroup specifically
>> we should not use capacity.  Therefore your approach is not
>> extensible/reusable.
>>
>> Given this is a rare corner case caused by configuration, we can only do
>> as much as possible IMHO, not trying to implement a perfect solution at
>> the moment. Maybe BUG_ON() is more appropriate?
>>
>
> I think I will reply this thread for the last time:
>
> I don't have strong opinion to against using BUG_ON() when you fail to
> allocate workqueue.  If you choose to do this, I'll leave to others.
>
> If you want to "disable SGX cgroup" when you fail to allocate workqueue,
> reset the "capacity" to 0 to disable it.

Unless I hear otherwise, I'll revert to BUG_ON().

Thanks
Haitao
Huang, Kai April 23, 2024, 10:13 p.m. UTC | #11
On Tue, 2024-04-23 at 10:30 -0500, Haitao Huang wrote:
> > > It's a workaround because you use the capacity==0 but it does not really
> > > mean to disable the misc cgroup for specific resource IIUC.
> > 
> > Please read the comment around @misc_res_capacity again:
> > 
> >   * Miscellaneous resources capacity for the entire machine. 0 capacity
> >   * means resource is not initialized or not present in the host.
> > 
> 
> I mentioned this in earlier email. I think this means no SGX EPC. It does  
> not mean sgx epc cgroup not enabled. That's also consistent with the  
> behavior try_charge() fails if capacity is zero.

OK. To me the "capacity" is purely the concept of cgroup, so it must be
interpreted within the scope of "cgroup".  If cgroup, in our case, SGX
cgroup, is disabled, then whether "leaving the capacity to reflect the
presence of hardware resource" doesn't really matter. 

So what you are saying is that, the kernel must initialize the capacity of
some MISC resource once it is added as valid type.  

And you must initialize the "capacity" even MISC cgroup is disabled
entirely by kernel commandline, in which case, IIUC, misc.capacity is not
even going to show in the /fs.

If this is your point, then your patch:

	cgroup/misc: Add SGX EPC resource type

is already broken, because you added the new type w/o initializing the
capacity.

Please fix that up.

> 
> > > 
> > > There is explicit way for user to disable misc without setting capacity  
> > > to
> > > zero.
> > 
> > Which way are you talking about?
> 
> Echo "-misc" to cgroup.subtree_control at root level for example still  
> shows non-zero sgx_epc capacity.

I guess "having to disable all MISC resources just in order to disable SGX
EPC cgroup" is a brilliant idea.

You can easily disable the entire MISC cgroup by commandline for that
purpose if that's acceptable.

And I have no idea why "still showing non-zero EPC capacity" is important
if SGX cgroup cannot be supported at all.
Haitao Huang April 24, 2024, 12:26 a.m. UTC | #12
On Tue, 23 Apr 2024 17:13:15 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Tue, 2024-04-23 at 10:30 -0500, Haitao Huang wrote:
>> > > It's a workaround because you use the capacity==0 but it does not  
>> really
>> > > mean to disable the misc cgroup for specific resource IIUC.
>> >
>> > Please read the comment around @misc_res_capacity again:
>> >
>> >   * Miscellaneous resources capacity for the entire machine. 0  
>> capacity
>> >   * means resource is not initialized or not present in the host.
>> >
>>
>> I mentioned this in earlier email. I think this means no SGX EPC. It  
>> doesnot mean sgx epc cgroup not enabled. That's also consistent with the 
>> behavior try_charge() fails if capacity is zero.
>
> OK. To me the "capacity" is purely the concept of cgroup, so it must be
> interpreted within the scope of "cgroup".  If cgroup, in our case, SGX
> cgroup, is disabled, then whether "leaving the capacity to reflect the
> presence of hardware resource" doesn't really matter.
> So what you are saying is that, the kernel must initialize the capacity  
> of
> some MISC resource once it is added as valid type.  
> And you must initialize the "capacity" even MISC cgroup is disabled
> entirely by kernel commandline, in which case, IIUC, misc.capacity is not
> even going to show in the /fs.
>
> If this is your point, then your patch:
>
> 	cgroup/misc: Add SGX EPC resource type
>
> is already broken, because you added the new type w/o initializing the
> capacity.
>
> Please fix that up.
>
>>
>> > >
>> > > There is explicit way for user to disable misc without setting  
>> capacity> > to
>> > > zero.
>> >
>> > Which way are you talking about?
>>
>> Echo "-misc" to cgroup.subtree_control at root level for example still 
>> shows non-zero sgx_epc capacity.
>
> I guess "having to disable all MISC resources just in order to disable  
> SGX
> EPC cgroup" is a brilliant idea.
>
> You can easily disable the entire MISC cgroup by commandline for that
> purpose if that's acceptable.
>
> And I have no idea why "still showing non-zero EPC capacity" is important
> if SGX cgroup cannot be supported at all.
>

Okay, all I'm trying to say is we should care about consistency in code  
and don't want SGX do something different. Mixing "disable" with  
"capacity==0" causes inconsistencies AFAICS:

1) The try_charge() API currently returns error when capacity is zero. So  
it appears not to mean that the cgroup is disabled otherwise it should  
return success.

2) The current explicit way ("-misc") to disable misc still shows non-zero  
entries in misc.capacity. (At least for v2 cgroup, it does when I tested).  
Maybe this is not important but I just don't feel good about this  
inconsistency.

For now I'll just do BUG_ON() unless there are more strong opinions one  
way or the other.

BR
Haitao
Huang, Kai April 24, 2024, 2:13 a.m. UTC | #13
On Tue, 2024-04-23 at 19:26 -0500, Haitao Huang wrote:
> On Tue, 23 Apr 2024 17:13:15 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Tue, 2024-04-23 at 10:30 -0500, Haitao Huang wrote:
> > > > > It's a workaround because you use the capacity==0 but it does not  
> > > really
> > > > > mean to disable the misc cgroup for specific resource IIUC.
> > > > 
> > > > Please read the comment around @misc_res_capacity again:
> > > > 
> > > >   * Miscellaneous resources capacity for the entire machine. 0  
> > > capacity
> > > >   * means resource is not initialized or not present in the host.
> > > > 
> > > 
> > > I mentioned this in earlier email. I think this means no SGX EPC. It  
> > > doesnot mean sgx epc cgroup not enabled. That's also consistent with the 
> > > behavior try_charge() fails if capacity is zero.
> > 
> > OK. To me the "capacity" is purely the concept of cgroup, so it must be
> > interpreted within the scope of "cgroup".  If cgroup, in our case, SGX
> > cgroup, is disabled, then whether "leaving the capacity to reflect the
> > presence of hardware resource" doesn't really matter.
> > So what you are saying is that, the kernel must initialize the capacity  
> > of
> > some MISC resource once it is added as valid type.  
> > And you must initialize the "capacity" even MISC cgroup is disabled
> > entirely by kernel commandline, in which case, IIUC, misc.capacity is not
> > even going to show in the /fs.
> > 
> > If this is your point, then your patch:
> > 
> > 	cgroup/misc: Add SGX EPC resource type
> > 
> > is already broken, because you added the new type w/o initializing the
> > capacity.
> > 
> > Please fix that up.
> > 
> > > 
> > > > > 
> > > > > There is explicit way for user to disable misc without setting  
> > > capacity> > to
> > > > > zero.
> > > > 
> > > > Which way are you talking about?
> > > 
> > > Echo "-misc" to cgroup.subtree_control at root level for example still 
> > > shows non-zero sgx_epc capacity.
> > 
> > I guess "having to disable all MISC resources just in order to disable  
> > SGX
> > EPC cgroup" is a brilliant idea.
> > 
> > You can easily disable the entire MISC cgroup by commandline for that
> > purpose if that's acceptable.
> > 
> > And I have no idea why "still showing non-zero EPC capacity" is important
> > if SGX cgroup cannot be supported at all.
> > 
> 
> Okay, all I'm trying to say is we should care about consistency in code  
> and don't want SGX do something different. Mixing "disable" with  
> "capacity==0" causes inconsistencies AFAICS:
> 
> 1) The try_charge() API currently returns error when capacity is zero. So  
> it appears not to mean that the cgroup is disabled otherwise it should  
> return success.

I agree this isn't ideal.  My view is we can fix it in MISC code if
needed.

> 
> 2) The current explicit way ("-misc") to disable misc still shows non-zero  
> entries in misc.capacity. (At least for v2 cgroup, it does when I tested).  
> Maybe this is not important but I just don't feel good about this  
> inconsistency.

This belongs to "MISC resource cgroup was initially enabled by the kernel
at boot time, but later was disabled *somewhere in hierarchy* by the
user".

In fact, if you only do "-misc" for "some subtree", it's quite reasonable
to still report the resource in max.capacity.  In the case above, the
"some subtree" happens to be the root.

So to me it's reasonable to still show max.capacity in this case.  And you
can actually argue that the kernel still supports the cgroup for the
resource.  E.g., you can at runtime do "+misc" to re-enable.

However, if the kernel isn't able to support certain MISC resource cgroup
at boot time, it's quite reasonable to just set the "capacity" to 0 so it
isn't visible to userspace.

Note:

My key point is, when userspace sees 0 "capacity", it shouldn't need to
care about whether it is because of "hardware resource is not available",
or "hardware resource is available but kernel cannot support cgroup for
it".  The resource cgroup is simply unavailable.

That means the kernel has full right to just hide that resource from the
cgroup at boot time.

But this should be just within "cgroup's scope", i.e., that resource can
still be available if kernel can provide it.  If some app wants to
additionally check whether such resource is indeed available but only
cgroup is not available, it should check resource specific interface but
not take advantage of the MISC cgroup interface.

> 
> For now I'll just do BUG_ON() unless there are more strong opinions one  
> way or the other.
> 

Fine to me.
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 74d403d1e0d4..8151371a198b 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -5,9 +5,63 @@ 
 #include <linux/kernel.h>
 #include "epc_cgroup.h"
 
+/*
+ * The minimal free pages maintained by per-cgroup reclaimer
+ * Set this to the low threshold used by the global reclaimer, ksgxd.
+ */
+#define SGX_CG_MIN_FREE_PAGE	(SGX_NR_LOW_PAGES)
+
+/*
+ * If the cgroup limit is close to SGX_CG_MIN_FREE_PAGE, maintaining the minimal
+ * free pages would barely leave any page for use, causing excessive reclamation
+ * and thrashing.
+ *
+ * Define the following limit, below which cgroup does not maintain the minimal
+ * free page threshold. Set this to quadruple of the minimal so at least 75%
+ * pages used without being reclaimed.
+ */
+#define SGX_CG_LOW_LIMIT	(SGX_CG_MIN_FREE_PAGE * 4)
+
 /* The root SGX EPC cgroup */
 static struct sgx_cgroup sgx_cg_root;
 
+/*
+ * The work queue that reclaims EPC pages in the background for cgroups.
+ *
+ * A cgroup schedules a work item into this queue to reclaim pages within the
+ * same cgroup when its usage limit is reached and synchronous reclamation is not
+ * an option, i.e., in a page fault handler.
+ */
+static struct workqueue_struct *sgx_cg_wq;
+
+static inline u64 sgx_cgroup_page_counter_read(struct sgx_cgroup *sgx_cg)
+{
+	return atomic64_read(&sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE;
+}
+
+static inline u64 sgx_cgroup_max_pages(struct sgx_cgroup *sgx_cg)
+{
+	return READ_ONCE(sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+/*
+ * Get the lower bound of limits of a cgroup and its ancestors. Used in
+ * sgx_cgroup_should_reclaim() to determine if EPC usage of a cgroup is
+ * close to its limit or its ancestors' hence reclamation is needed.
+ */
+static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg)
+{
+	struct misc_cg *i = sgx_cg->cg;
+	u64 m = U64_MAX;
+
+	while (i) {
+		m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+		i = misc_cg_parent(i);
+	}
+
+	return m / PAGE_SIZE;
+}
+
 /**
  * sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
  * @root:	Root of the tree to check
@@ -90,6 +144,61 @@  static void sgx_cgroup_reclaim_pages(struct misc_cg *root)
 	rcu_read_unlock();
 }
 
+/**
+ * sgx_cgroup_should_reclaim() - check if EPC reclamation is needed for a cgroup
+ * @sgx_cg: The cgroup to be checked.
+ *
+ * This function can be used to guard a call to sgx_cgroup_reclaim_pages() where
+ * the minimal number of free page needs be maintained for the cgroup to make
+ * good forward progress.
+ *
+ * Return: %true if number of free pages available for the cgroup below a
+ * threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the
+ * cgroup.
+ */
+static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
+{
+	u64 cur, max;
+
+	if (sgx_cgroup_lru_empty(sgx_cg->cg))
+		return false;
+
+	max = sgx_cgroup_max_pages_to_root(sgx_cg);
+
+	/*
+	 * Unless the limit is very low, maintain a minimal number of free pages
+	 * so there is always a few pages available to serve new allocation
+	 * requests quickly.
+	 */
+	if (max > SGX_CG_LOW_LIMIT)
+		max -= SGX_CG_MIN_FREE_PAGE;
+
+	cur = sgx_cgroup_page_counter_read(sgx_cg);
+
+	return (cur >= max);
+}
+
+/*
+ * Asynchronous work flow to reclaim pages from the cgroup when the cgroup is
+ * at/near its maximum capacity.
+ */
+static void sgx_cgroup_reclaim_work_func(struct work_struct *work)
+{
+	struct sgx_cgroup *sgx_cg = container_of(work, struct sgx_cgroup, reclaim_work);
+
+	/*
+	 * This work func is scheduled by sgx_cgroup_try_charge() when it cannot
+	 * directly reclaim, i.e., EPC allocation in a fault handler. Waiting to
+	 * reclaim until the cgroup is actually at its limit is less performant,
+	 * as it means the task scheduling this asynchronous work is effectively
+	 * blocked until a worker makes its way through the global work queue.
+	 */
+	while (sgx_cgroup_should_reclaim(sgx_cg)) {
+		sgx_cgroup_reclaim_pages(sgx_cg->cg);
+		cond_resched();
+	}
+}
+
 static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg)
 {
 	if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE))
@@ -117,19 +226,28 @@  int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
 {
 	int ret;
 
+	/* cgroup disabled due to wq allocation failure during sgx_cgroup_init(). */
+	if (!sgx_cg_wq)
+		return 0;
+
 	for (;;) {
 		ret = __sgx_cgroup_try_charge(sgx_cg);
 
 		if (ret != -EBUSY)
 			return ret;
 
-		if (reclaim == SGX_NO_RECLAIM)
-			return -ENOMEM;
+		if (reclaim == SGX_NO_RECLAIM) {
+			queue_work(sgx_cg_wq, &sgx_cg->reclaim_work);
+			return -EBUSY;
+		}
 
 		sgx_cgroup_reclaim_pages(sgx_cg->cg);
 		cond_resched();
 	}
 
+	if (sgx_cgroup_should_reclaim(sgx_cg))
+		queue_work(sgx_cg_wq, &sgx_cg->reclaim_work);
+
 	return 0;
 }
 
@@ -150,12 +268,14 @@  static void sgx_cgroup_free(struct misc_cg *cg)
 	if (!sgx_cg)
 		return;
 
+	cancel_work_sync(&sgx_cg->reclaim_work);
 	kfree(sgx_cg);
 }
 
 static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup *sgx_cg)
 {
 	sgx_lru_init(&sgx_cg->lru);
+	INIT_WORK(&sgx_cg->reclaim_work, sgx_cgroup_reclaim_work_func);
 	cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
 	sgx_cg->cg = cg;
 }
@@ -182,4 +302,9 @@  void sgx_cgroup_init(void)
 {
 	misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_cgroup_ops);
 	sgx_cgroup_misc_init(misc_cg_root(), &sgx_cg_root);
+
+	sgx_cg_wq = alloc_workqueue("sgx_cg_wq", WQ_UNBOUND | WQ_FREEZABLE, WQ_UNBOUND_MAX_ACTIVE);
+
+	if (!sgx_cg_wq)
+		pr_err("SGX EPC cgroup disabled: alloc_workqueue() failed.\n");
 }
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index 538524f5669d..2044e0d64076 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -34,6 +34,7 @@  static inline void sgx_cgroup_init(void) { }
 struct sgx_cgroup {
 	struct misc_cg *cg;
 	struct sgx_epc_lru_list lru;
+	struct work_struct reclaim_work;
 };
 
 static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct misc_cg *cg)