diff mbox series

[v15,05/14] x86/sgx: Implement basic EPC misc cgroup functionality

Message ID 20240617125321.36658-6-haitao.huang@linux.intel.com (mailing list archive)
State New
Headers show
Series Add Cgroup support for SGX EPC memory | expand

Commit Message

Haitao Huang June 17, 2024, 12:53 p.m. UTC
From: Kristen Carlson Accardi <kristen@linux.intel.com>

SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The
existing cgroup memory controller cannot be used to limit or account for
SGX EPC memory, which is a desirable feature in some environments. For
instance, within a Kubernetes environment, while a user may specify a
particular EPC quota for a pod, the orchestrator requires a mechanism to
enforce that the pod's actual runtime EPC usage does not exceed the
allocated quota.

Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
limit and track EPC allocations per cgroup. Earlier patches have added
the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
support in SGX driver as the "sgx_epc" resource provider:

- Set "capacity" of EPC by calling misc_cg_set_capacity()
- Update EPC usage counter, "current", by calling charge and uncharge
APIs for EPC allocation and deallocation, respectively.
- Setup sgx_epc resource type specific callbacks, which perform
initialization and cleanup during cgroup allocation and deallocation,
respectively.

With these changes, the misc cgroup controller enables users to set a hard
limit for EPC usage in the "misc.max" interface file. It reports current
usage in "misc.current", the total EPC memory available in
"misc.capacity", and the number of times EPC usage reached the max limit
in "misc.events".

For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.

Later patches will reorganize the tracking and reclamation code in the
global reclaimer and implement per-cgroup tracking and reclaiming.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
---
V15:
- Declare __init for sgx_cgroup_init() (Jarkko)
- Disable SGX when sgx_cgroup_init() fails (Jarkko)

V13:
- Remove unneeded includes. (Kai)

V12:
- Remove CONFIG_CGROUP_SGX_EPC and make sgx cgroup implementation
conditionally compiled with CONFIG_CGROUP_MISC. (Jarkko)

V11:
- Update copyright and format better (Kai)
- Create wrappers to remove #ifdefs in c file. (Kai)
- Remove unneeded comments (Kai)

V10:
- Shorten function, variable, struct names, s/sgx_epc_cgroup/sgx_cgroup. (Jarkko)
- Use enums instead of booleans for the parameters. (Dave, Jarkko)

V8:
- Remove null checks for epc_cg in try_charge()/uncharge(). (Jarkko)
- Remove extra space, '_INTEL'. (Jarkko)

V7:
- Use a static for root cgroup (Kai)
- Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai)
- Correct check for charge API return (Kai)
- Start initialization in SGX device driver init (Kai)
- Remove unneeded BUG_ON (Kai)
- Split  sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)

V6:
- Split the original large patch"Limit process EPC usage with misc
cgroup controller"  and restructure it (Kai)
---
 arch/x86/kernel/cpu/sgx/Makefile     |  1 +
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 73 ++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 72 +++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/main.c       | 42 ++++++++++++++--
 arch/x86/kernel/cpu/sgx/sgx.h        | 21 ++++++++
 include/linux/misc_cgroup.h          |  2 +
 6 files changed, 208 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

Comments

Huang, Kai June 18, 2024, 11:31 a.m. UTC | #1
> @@ -921,7 +956,8 @@ static int __init sgx_init(void)
>  	if (!sgx_page_cache_init())
>  		return -ENOMEM;
>  
> -	if (!sgx_page_reclaimer_init()) {
> +	if (!sgx_page_reclaimer_init() || !sgx_cgroup_init()) {
> +		misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, 0);
>  		ret = -ENOMEM;
>  		goto err_page_cache;
>  	}

This code change is wrong due to two reasons:

1) If sgx_page_reclaimer_init() was successful, but sgx_cgroup_init()
failed, you actually need to 'goto err_kthread' because the ksgxd() kernel
thread is already created and is running.

2) There are other cases after here that can also result in sgx_init() to
fail completely, e.g., registering sgx_dev_provision mics device.  We need
to reset the capacity to 0 for those cases as well.

AFAICT, you need something like:

diff --git a/arch/x86/kernel/cpu/sgx/main.c
b/arch/x86/kernel/cpu/sgx/main.c
index 27892e57c4ef..46f9c26992a7 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -930,6 +930,10 @@ static int __init sgx_init(void)
        if (ret)
                goto err_kthread;
 
+       ret = sgx_cgroup_init();
+       if (ret)
+               goto err_provision;
+
        /*
         * Always try to initialize the native *and* KVM drivers.
         * The KVM driver is less picky than the native one and
@@ -941,10 +945,12 @@ static int __init sgx_init(void)
        ret = sgx_drv_init();
 
        if (sgx_vepc_init() && ret)
-               goto err_provision;
+               goto err_cgroup;
 
        return 0;
 
+err_cgroup:
+       /* SGX EPC cgroup cleanup */
 err_provision:
        misc_deregister(&sgx_dev_provision);
 
@@ -952,6 +958,8 @@ static int __init sgx_init(void)
        kthread_stop(ksgxd_tsk);
 
 err_page_cache:
+       misc_misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, 0);
+
        for (i = 0; i < sgx_nr_epc_sections; i++) {
                vfree(sgx_epc_sections[i].pages);
                memunmap(sgx_epc_sections[i].virt_addr);


I put the sgx_cgroup_init() before sgx_drv_init() and sgx_vepc_init(),
otherwise you will need sgx_drv_cleanup() and sgx_vepc_cleanup()
respectively when sgx_cgroup_init() fails.

This looks a little bit weird too, though:

Calling misc_misc_cg_set_capacity() to reset capacity to 0 is done at end
of sgx_init() error path, because the "set capacity" part is done in
sgx_epc_cache_init().  

But logically, both "set capacity" and "reset capacity to 0" should be SGX
EPC cgroup operation, so it's more reasonable to do "set capacity" in
sgx_cgroup_init() and do "reset to 0" in the

	/* SGX EPC cgroup cleanup */

as shown above.

Eventually, you will need to do EPC cgroup cleanup anyway, e.g., to free
the workqueue, so it's odd to have two places to handle EPC cgroup
cleanup.

I understand the reason "set capacity" part is done in
sgx_page_cache_init() now is because in that function you can easily get
the capacity.  But the fact is @sgx_numa_nodes also tracks EPC size for
each node, so you can also get the total EPC size from @sgx_numa_node in
sgx_cgroup_init() and set capacity there.

In this case, you can put "reset capacity to 0" and "free workqueue"
together as the "SGX EPC cgroup cleanup", which is way more clear IMHO.
Haitao Huang June 18, 2024, 12:56 p.m. UTC | #2
On Tue, 18 Jun 2024 06:31:09 -0500, Huang, Kai <kai.huang@intel.com> wrote:

>
>> @@ -921,7 +956,8 @@ static int __init sgx_init(void)
>>  	if (!sgx_page_cache_init())
>>  		return -ENOMEM;
>>
>> -	if (!sgx_page_reclaimer_init()) {
>> +	if (!sgx_page_reclaimer_init() || !sgx_cgroup_init()) {
>> +		misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, 0);
>>  		ret = -ENOMEM;
>>  		goto err_page_cache;
>>  	}
>
> This code change is wrong due to two reasons:
>
> 1) If sgx_page_reclaimer_init() was successful, but sgx_cgroup_init()
> failed, you actually need to 'goto err_kthread' because the ksgxd()  
> kernel
> thread is already created and is running.
>
> 2) There are other cases after here that can also result in sgx_init() to
> fail completely, e.g., registering sgx_dev_provision mics device.  We  
> need
> to reset the capacity to 0 for those cases as well.
>
> AFAICT, you need something like:
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c
> b/arch/x86/kernel/cpu/sgx/main.c
> index 27892e57c4ef..46f9c26992a7 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -930,6 +930,10 @@ static int __init sgx_init(void)
>         if (ret)
>                 goto err_kthread;
> +       ret = sgx_cgroup_init();
> +       if (ret)
> +               goto err_provision;
> +
>         /*
>          * Always try to initialize the native *and* KVM drivers.
>          * The KVM driver is less picky than the native one and
> @@ -941,10 +945,12 @@ static int __init sgx_init(void)
>         ret = sgx_drv_init();
>        if (sgx_vepc_init() && ret)
> -               goto err_provision;
> +               goto err_cgroup;
>        return 0;
> +err_cgroup:
> +       /* SGX EPC cgroup cleanup */
>  err_provision:
>         misc_deregister(&sgx_dev_provision);
> @@ -952,6 +958,8 @@ static int __init sgx_init(void)
>         kthread_stop(ksgxd_tsk);
> err_page_cache:
> +       misc_misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, 0);
> +
>         for (i = 0; i < sgx_nr_epc_sections; i++) {
>                 vfree(sgx_epc_sections[i].pages);
>                 memunmap(sgx_epc_sections[i].virt_addr);
>
>
> I put the sgx_cgroup_init() before sgx_drv_init() and sgx_vepc_init(),
> otherwise you will need sgx_drv_cleanup() and sgx_vepc_cleanup()
> respectively when sgx_cgroup_init() fails.
>

Yes, good catch.

> This looks a little bit weird too, though:
>
> Calling misc_misc_cg_set_capacity() to reset capacity to 0 is done at end
> of sgx_init() error path, because the "set capacity" part is done in
> sgx_epc_cache_init().  
> But logically, both "set capacity" and "reset capacity to 0" should be  
> SGX
> EPC cgroup operation, so it's more reasonable to do "set capacity" in
> sgx_cgroup_init() and do "reset to 0" in the
>
> 	/* SGX EPC cgroup cleanup */
>
> as shown above.
>
> Eventually, you will need to do EPC cgroup cleanup anyway, e.g., to free
> the workqueue, so it's odd to have two places to handle EPC cgroup
> cleanup.
>
> I understand the reason "set capacity" part is done in
> sgx_page_cache_init() now is because in that function you can easily get
> the capacity.  But the fact is @sgx_numa_nodes also tracks EPC size for
> each node, so you can also get the total EPC size from @sgx_numa_node in
> sgx_cgroup_init() and set capacity there.
>
> In this case, you can put "reset capacity to 0" and "free workqueue"
> together as the "SGX EPC cgroup cleanup", which is way more clear IMHO.
>
Okay, will  expose @sgx_numa_nodes to epc_cgroup.c and do the calculations  
in sgx_cgroup_init().
Thanks
Haitao
Huang, Kai June 18, 2024, 11:15 p.m. UTC | #3
On Tue, 2024-06-18 at 07:56 -0500, Haitao Huang wrote:
> On Tue, 18 Jun 2024 06:31:09 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > 
> > > @@ -921,7 +956,8 @@ static int __init sgx_init(void)
> > >  	if (!sgx_page_cache_init())
> > >  		return -ENOMEM;
> > > 
> > > -	if (!sgx_page_reclaimer_init()) {
> > > +	if (!sgx_page_reclaimer_init() || !sgx_cgroup_init()) {
> > > +		misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, 0);
> > >  		ret = -ENOMEM;
> > >  		goto err_page_cache;
> > >  	}
> > 
> > This code change is wrong due to two reasons:
> > 
> > 1) If sgx_page_reclaimer_init() was successful, but sgx_cgroup_init()
> > failed, you actually need to 'goto err_kthread' because the ksgxd()  
> > kernel
> > thread is already created and is running.
> > 
> > 2) There are other cases after here that can also result in sgx_init() to
> > fail completely, e.g., registering sgx_dev_provision mics device.  We  
> > need
> > to reset the capacity to 0 for those cases as well.
> > 
> > AFAICT, you need something like:
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/main.c
> > b/arch/x86/kernel/cpu/sgx/main.c
> > index 27892e57c4ef..46f9c26992a7 100644
> > --- a/arch/x86/kernel/cpu/sgx/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > @@ -930,6 +930,10 @@ static int __init sgx_init(void)
> >         if (ret)
> >                 goto err_kthread;
> > +       ret = sgx_cgroup_init();
> > +       if (ret)
> > +               goto err_provision;
> > +
> >         /*
> >          * Always try to initialize the native *and* KVM drivers.
> >          * The KVM driver is less picky than the native one and
> > @@ -941,10 +945,12 @@ static int __init sgx_init(void)
> >         ret = sgx_drv_init();
> >        if (sgx_vepc_init() && ret)
> > -               goto err_provision;
> > +               goto err_cgroup;
> >        return 0;
> > +err_cgroup:
> > +       /* SGX EPC cgroup cleanup */
> >  err_provision:
> >         misc_deregister(&sgx_dev_provision);
> > @@ -952,6 +958,8 @@ static int __init sgx_init(void)
> >         kthread_stop(ksgxd_tsk);
> > err_page_cache:
> > +       misc_misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, 0);
> > +
> >         for (i = 0; i < sgx_nr_epc_sections; i++) {
> >                 vfree(sgx_epc_sections[i].pages);
> >                 memunmap(sgx_epc_sections[i].virt_addr);
> > 
> > 
> > I put the sgx_cgroup_init() before sgx_drv_init() and sgx_vepc_init(),
> > otherwise you will need sgx_drv_cleanup() and sgx_vepc_cleanup()
> > respectively when sgx_cgroup_init() fails.
> > 
> 
> Yes, good catch.
> 
> > This looks a little bit weird too, though:
> > 
> > Calling misc_misc_cg_set_capacity() to reset capacity to 0 is done at end
> > of sgx_init() error path, because the "set capacity" part is done in
> > sgx_epc_cache_init().  
> > But logically, both "set capacity" and "reset capacity to 0" should be  
> > SGX
> > EPC cgroup operation, so it's more reasonable to do "set capacity" in
> > sgx_cgroup_init() and do "reset to 0" in the
> > 
> > 	/* SGX EPC cgroup cleanup */
> > 
> > as shown above.
> > 
> > Eventually, you will need to do EPC cgroup cleanup anyway, e.g., to free
> > the workqueue, so it's odd to have two places to handle EPC cgroup
> > cleanup.
> > 
> > I understand the reason "set capacity" part is done in
> > sgx_page_cache_init() now is because in that function you can easily get
> > the capacity.  But the fact is @sgx_numa_nodes also tracks EPC size for
> > each node, so you can also get the total EPC size from @sgx_numa_node in
> > sgx_cgroup_init() and set capacity there.
> > 
> > In this case, you can put "reset capacity to 0" and "free workqueue"
> > together as the "SGX EPC cgroup cleanup", which is way more clear IMHO.
> > 
> Okay, will  expose @sgx_numa_nodes to epc_cgroup.c and do the calculations  
> in sgx_cgroup_init().
> 

Looks you will also need to expose @sgx_numa_mask, which looks overkill.

Other options:

1) Expose a function to return total EPC pages/size in "sgx.h".

2) Move out the new 'capacity' variable in this patch as a global variable
and expose it in "sgx.h" (perhaps rename to 'sgx_total_epc_pages/size').

3) Make sgx_cgroup_init() to take an argument of total EPC pages/size, and
pass it in sgx_init().  

For 3) there are also options to get total EPC pages/size:

a) Move out the new 'capacity' variable in this patch as a static.

b) Add a function to calculate total EPC pages/size from sgx_numa_nodes.

Hmm.. I think we can just use option 2)?
Haitao Huang June 18, 2024, 11:23 p.m. UTC | #4
On Tue, 18 Jun 2024 18:15:37 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Tue, 2024-06-18 at 07:56 -0500, Haitao Huang wrote:
>> On Tue, 18 Jun 2024 06:31:09 -0500, Huang, Kai <kai.huang@intel.com>  
>> wrote:
>>
>> >
>> > > @@ -921,7 +956,8 @@ static int __init sgx_init(void)
>> > >  	if (!sgx_page_cache_init())
>> > >  		return -ENOMEM;
>> > >
>> > > -	if (!sgx_page_reclaimer_init()) {
>> > > +	if (!sgx_page_reclaimer_init() || !sgx_cgroup_init()) {
>> > > +		misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, 0);
>> > >  		ret = -ENOMEM;
>> > >  		goto err_page_cache;
>> > >  	}
>> >
>> > This code change is wrong due to two reasons:
>> >
>> > 1) If sgx_page_reclaimer_init() was successful, but sgx_cgroup_init()
>> > failed, you actually need to 'goto err_kthread' because the ksgxd()
>> > kernel
>> > thread is already created and is running.
>> >
>> > 2) There are other cases after here that can also result in  
>> sgx_init() to
>> > fail completely, e.g., registering sgx_dev_provision mics device.  We
>> > need
>> > to reset the capacity to 0 for those cases as well.
>> >
>> > AFAICT, you need something like:
>> >
>> > diff --git a/arch/x86/kernel/cpu/sgx/main.c
>> > b/arch/x86/kernel/cpu/sgx/main.c
>> > index 27892e57c4ef..46f9c26992a7 100644
>> > --- a/arch/x86/kernel/cpu/sgx/main.c
>> > +++ b/arch/x86/kernel/cpu/sgx/main.c
>> > @@ -930,6 +930,10 @@ static int __init sgx_init(void)
>> >         if (ret)
>> >                 goto err_kthread;
>> > +       ret = sgx_cgroup_init();
>> > +       if (ret)
>> > +               goto err_provision;
>> > +
>> >         /*
>> >          * Always try to initialize the native *and* KVM drivers.
>> >          * The KVM driver is less picky than the native one and
>> > @@ -941,10 +945,12 @@ static int __init sgx_init(void)
>> >         ret = sgx_drv_init();
>> >        if (sgx_vepc_init() && ret)
>> > -               goto err_provision;
>> > +               goto err_cgroup;
>> >        return 0;
>> > +err_cgroup:
>> > +       /* SGX EPC cgroup cleanup */
>> >  err_provision:
>> >         misc_deregister(&sgx_dev_provision);
>> > @@ -952,6 +958,8 @@ static int __init sgx_init(void)
>> >         kthread_stop(ksgxd_tsk);
>> > err_page_cache:
>> > +       misc_misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, 0);
>> > +
>> >         for (i = 0; i < sgx_nr_epc_sections; i++) {
>> >                 vfree(sgx_epc_sections[i].pages);
>> >                 memunmap(sgx_epc_sections[i].virt_addr);
>> >
>> >
>> > I put the sgx_cgroup_init() before sgx_drv_init() and sgx_vepc_init(),
>> > otherwise you will need sgx_drv_cleanup() and sgx_vepc_cleanup()
>> > respectively when sgx_cgroup_init() fails.
>> >
>>
>> Yes, good catch.
>>
>> > This looks a little bit weird too, though:
>> >
>> > Calling misc_misc_cg_set_capacity() to reset capacity to 0 is done at  
>> end
>> > of sgx_init() error path, because the "set capacity" part is done in
>> > sgx_epc_cache_init().
>> > But logically, both "set capacity" and "reset capacity to 0" should be
>> > SGX
>> > EPC cgroup operation, so it's more reasonable to do "set capacity" in
>> > sgx_cgroup_init() and do "reset to 0" in the
>> >
>> > 	/* SGX EPC cgroup cleanup */
>> >
>> > as shown above.
>> >
>> > Eventually, you will need to do EPC cgroup cleanup anyway, e.g., to  
>> free
>> > the workqueue, so it's odd to have two places to handle EPC cgroup
>> > cleanup.
>> >
>> > I understand the reason "set capacity" part is done in
>> > sgx_page_cache_init() now is because in that function you can easily  
>> get
>> > the capacity.  But the fact is @sgx_numa_nodes also tracks EPC size  
>> for
>> > each node, so you can also get the total EPC size from @sgx_numa_node  
>> in
>> > sgx_cgroup_init() and set capacity there.
>> >
>> > In this case, you can put "reset capacity to 0" and "free workqueue"
>> > together as the "SGX EPC cgroup cleanup", which is way more clear  
>> IMHO.
>> >
>> Okay, will  expose @sgx_numa_nodes to epc_cgroup.c and do the  
>> calculations
>> in sgx_cgroup_init().
>>
>
> Looks you will also need to expose @sgx_numa_mask, which looks overkill.
>
> Other options:
>
> 1) Expose a function to return total EPC pages/size in "sgx.h".
>
> 2) Move out the new 'capacity' variable in this patch as a global  
> variable
> and expose it in "sgx.h" (perhaps rename to 'sgx_total_epc_pages/size').
>
> 3) Make sgx_cgroup_init() to take an argument of total EPC pages/size,  
> and
> pass it in sgx_init().  
> For 3) there are also options to get total EPC pages/size:
>
> a) Move out the new 'capacity' variable in this patch as a static.
>
> b) Add a function to calculate total EPC pages/size from sgx_numa_nodes.
>
> Hmm.. I think we can just use option 2)?
>
>
I was  about doing this in sgx_cgroup_init():
         for (i = 0; i < num_possible_nodes(); i++)
                 capacity += sgx_numa_nodes[i].size;

any concern using num_possible_nodes()?

I think case handled in sgx_page_cache_init() for a node with no epc (or  
mask). Only requirement is sgx_cgroup_init() called after  
sgx_page_cache_init().
Haitao
Huang, Kai June 19, 2024, 2 a.m. UTC | #5
On Tue, 2024-06-18 at 18:23 -0500, Haitao Huang wrote:
> On Tue, 18 Jun 2024 18:15:37 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Tue, 2024-06-18 at 07:56 -0500, Haitao Huang wrote:
> > > On Tue, 18 Jun 2024 06:31:09 -0500, Huang, Kai <kai.huang@intel.com>  
> > > wrote:
> > > 
> > > > 
> > > > > @@ -921,7 +956,8 @@ static int __init sgx_init(void)
> > > > >  	if (!sgx_page_cache_init())
> > > > >  		return -ENOMEM;
> > > > > 
> > > > > -	if (!sgx_page_reclaimer_init()) {
> > > > > +	if (!sgx_page_reclaimer_init() || !sgx_cgroup_init()) {
> > > > > +		misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, 0);
> > > > >  		ret = -ENOMEM;
> > > > >  		goto err_page_cache;
> > > > >  	}
> > > > 
> > > > This code change is wrong due to two reasons:
> > > > 
> > > > 1) If sgx_page_reclaimer_init() was successful, but sgx_cgroup_init()
> > > > failed, you actually need to 'goto err_kthread' because the ksgxd()
> > > > kernel
> > > > thread is already created and is running.
> > > > 
> > > > 2) There are other cases after here that can also result in  
> > > sgx_init() to
> > > > fail completely, e.g., registering sgx_dev_provision mics device.  We
> > > > need
> > > > to reset the capacity to 0 for those cases as well.
> > > > 
> > > > AFAICT, you need something like:
> > > > 
> > > > diff --git a/arch/x86/kernel/cpu/sgx/main.c
> > > > b/arch/x86/kernel/cpu/sgx/main.c
> > > > index 27892e57c4ef..46f9c26992a7 100644
> > > > --- a/arch/x86/kernel/cpu/sgx/main.c
> > > > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > > > @@ -930,6 +930,10 @@ static int __init sgx_init(void)
> > > >         if (ret)
> > > >                 goto err_kthread;
> > > > +       ret = sgx_cgroup_init();
> > > > +       if (ret)
> > > > +               goto err_provision;
> > > > +
> > > >         /*
> > > >          * Always try to initialize the native *and* KVM drivers.
> > > >          * The KVM driver is less picky than the native one and
> > > > @@ -941,10 +945,12 @@ static int __init sgx_init(void)
> > > >         ret = sgx_drv_init();
> > > >        if (sgx_vepc_init() && ret)
> > > > -               goto err_provision;
> > > > +               goto err_cgroup;
> > > >        return 0;
> > > > +err_cgroup:
> > > > +       /* SGX EPC cgroup cleanup */
> > > >  err_provision:
> > > >         misc_deregister(&sgx_dev_provision);
> > > > @@ -952,6 +958,8 @@ static int __init sgx_init(void)
> > > >         kthread_stop(ksgxd_tsk);
> > > > err_page_cache:
> > > > +       misc_misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, 0);
> > > > +
> > > >         for (i = 0; i < sgx_nr_epc_sections; i++) {
> > > >                 vfree(sgx_epc_sections[i].pages);
> > > >                 memunmap(sgx_epc_sections[i].virt_addr);
> > > > 
> > > > 
> > > > I put the sgx_cgroup_init() before sgx_drv_init() and sgx_vepc_init(),
> > > > otherwise you will need sgx_drv_cleanup() and sgx_vepc_cleanup()
> > > > respectively when sgx_cgroup_init() fails.
> > > > 
> > > 
> > > Yes, good catch.
> > > 
> > > > This looks a little bit weird too, though:
> > > > 
> > > > Calling misc_misc_cg_set_capacity() to reset capacity to 0 is done at  
> > > end
> > > > of sgx_init() error path, because the "set capacity" part is done in
> > > > sgx_epc_cache_init().
> > > > But logically, both "set capacity" and "reset capacity to 0" should be
> > > > SGX
> > > > EPC cgroup operation, so it's more reasonable to do "set capacity" in
> > > > sgx_cgroup_init() and do "reset to 0" in the
> > > > 
> > > > 	/* SGX EPC cgroup cleanup */
> > > > 
> > > > as shown above.
> > > > 
> > > > Eventually, you will need to do EPC cgroup cleanup anyway, e.g., to  
> > > free
> > > > the workqueue, so it's odd to have two places to handle EPC cgroup
> > > > cleanup.
> > > > 
> > > > I understand the reason "set capacity" part is done in
> > > > sgx_page_cache_init() now is because in that function you can easily  
> > > get
> > > > the capacity.  But the fact is @sgx_numa_nodes also tracks EPC size  
> > > for
> > > > each node, so you can also get the total EPC size from @sgx_numa_node  
> > > in
> > > > sgx_cgroup_init() and set capacity there.
> > > > 
> > > > In this case, you can put "reset capacity to 0" and "free workqueue"
> > > > together as the "SGX EPC cgroup cleanup", which is way more clear  
> > > IMHO.
> > > > 
> > > Okay, will  expose @sgx_numa_nodes to epc_cgroup.c and do the  
> > > calculations
> > > in sgx_cgroup_init().
> > > 
> > 
> > Looks you will also need to expose @sgx_numa_mask, which looks overkill.
> > 
> > Other options:
> > 
> > 1) Expose a function to return total EPC pages/size in "sgx.h".
> > 
> > 2) Move out the new 'capacity' variable in this patch as a global  
> > variable
> > and expose it in "sgx.h" (perhaps rename to 'sgx_total_epc_pages/size').
> > 
> > 3) Make sgx_cgroup_init() to take an argument of total EPC pages/size,  
> > and
> > pass it in sgx_init().  
> > For 3) there are also options to get total EPC pages/size:
> > 
> > a) Move out the new 'capacity' variable in this patch as a static.
> > 
> > b) Add a function to calculate total EPC pages/size from sgx_numa_nodes.
> > 
> > Hmm.. I think we can just use option 2)?
> > 
> > 
> I was  about doing this in sgx_cgroup_init():
>          for (i = 0; i < num_possible_nodes(); i++)
>                  capacity += sgx_numa_nodes[i].size;
> 
> any concern using num_possible_nodes()?
> 
> I think case handled in sgx_page_cache_init() for a node with no epc (or  
> mask). Only requirement is sgx_cgroup_init() called after  
> sgx_page_cache_init().
> 

All other code uses sgx_numa_mask to tell whether a node has EPC.  It
would be great if we use this way consistently in all SGX code.
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..081cb424575e 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@  obj-y += \
 	ioctl.o \
 	main.o
 obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
+obj-$(CONFIG_CGROUP_MISC)	+= epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index 000000000000..73e4c0304b77
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,73 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2022-2024 Intel Corporation. */
+
+#include<linux/slab.h>
+#include "epc_cgroup.h"
+
+/* The root SGX EPC cgroup */
+static struct sgx_cgroup sgx_cg_root;
+
+/**
+ * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ *
+ * @sgx_cg:	The EPC cgroup to be charged for the page.
+ * Return:
+ * * %0 - If successfully charged.
+ * * -errno - for failures.
+ */
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+{
+	return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
+}
+
+/**
+ * sgx_cgroup_uncharge() - uncharge a cgroup for an EPC page
+ * @sgx_cg:	The charged sgx cgroup.
+ */
+void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg)
+{
+	misc_cg_uncharge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
+}
+
+static void sgx_cgroup_free(struct misc_cg *cg)
+{
+	struct sgx_cgroup *sgx_cg;
+
+	sgx_cg = sgx_cgroup_from_misc_cg(cg);
+	if (!sgx_cg)
+		return;
+
+	kfree(sgx_cg);
+}
+
+static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup *sgx_cg)
+{
+	cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
+	sgx_cg->cg = cg;
+}
+
+static int sgx_cgroup_alloc(struct misc_cg *cg)
+{
+	struct sgx_cgroup *sgx_cg;
+
+	sgx_cg = kzalloc(sizeof(*sgx_cg), GFP_KERNEL);
+	if (!sgx_cg)
+		return -ENOMEM;
+
+	sgx_cgroup_misc_init(cg, sgx_cg);
+
+	return 0;
+}
+
+const struct misc_res_ops sgx_cgroup_ops = {
+	.alloc = sgx_cgroup_alloc,
+	.free = sgx_cgroup_free,
+};
+
+int __init sgx_cgroup_init(void)
+{
+	misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_cgroup_ops);
+	sgx_cgroup_misc_init(misc_cg_root(), &sgx_cg_root);
+
+	return 0;
+}
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
new file mode 100644
index 000000000000..5750eebc2f3f
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,72 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SGX_EPC_CGROUP_H_
+#define _SGX_EPC_CGROUP_H_
+
+#include <asm/sgx.h>
+#include <linux/cgroup.h>
+#include <linux/misc_cgroup.h>
+
+#include "sgx.h"
+
+#ifndef CONFIG_CGROUP_MISC
+
+#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
+struct sgx_cgroup;
+
+static inline struct sgx_cgroup *sgx_get_current_cg(void)
+{
+	return NULL;
+}
+
+static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) { }
+
+static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+{
+	return 0;
+}
+
+static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }
+
+static inline void sgx_cgroup_init(void) { }
+
+#else /* CONFIG_CGROUP_MISC */
+
+struct sgx_cgroup {
+	struct misc_cg *cg;
+};
+
+static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct misc_cg *cg)
+{
+	return (struct sgx_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
+}
+
+/**
+ * sgx_get_current_cg() - get the EPC cgroup of current process.
+ *
+ * Returned cgroup has its ref count increased by 1. Caller must call
+ * sgx_put_cg() to return the reference.
+ *
+ * Return: EPC cgroup to which the current task belongs to.
+ */
+static inline struct sgx_cgroup *sgx_get_current_cg(void)
+{
+	/* get_current_misc_cg() never returns NULL when Kconfig enabled */
+	return sgx_cgroup_from_misc_cg(get_current_misc_cg());
+}
+
+/**
+ * sgx_put_cg() - Put the EPC cgroup and reduce its ref count.
+ * @sgx_cg - EPC cgroup to put.
+ */
+static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
+{
+	put_misc_cg(sgx_cg->cg);
+}
+
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg);
+void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
+int __init sgx_cgroup_init(void);
+
+#endif /* CONFIG_CGROUP_MISC */
+
+#endif /* _SGX_EPC_CGROUP_H_ */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index e64073fb4256..8c30257e0953 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -18,6 +18,7 @@ 
 #include "driver.h"
 #include "encl.h"
 #include "encls.h"
+#include "epc_cgroup.h"
 
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 static int sgx_nr_epc_sections;
@@ -559,7 +560,16 @@  int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
  */
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 {
+	struct sgx_cgroup *sgx_cg;
 	struct sgx_epc_page *page;
+	int ret;
+
+	sgx_cg = sgx_get_current_cg();
+	ret = sgx_cgroup_try_charge(sgx_cg);
+	if (ret) {
+		sgx_put_cg(sgx_cg);
+		return ERR_PTR(ret);
+	}
 
 	for ( ; ; ) {
 		page = __sgx_alloc_epc_page();
@@ -568,8 +578,10 @@  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 			break;
 		}
 
-		if (list_empty(&sgx_active_page_list))
-			return ERR_PTR(-ENOMEM);
+		if (list_empty(&sgx_active_page_list)) {
+			page = ERR_PTR(-ENOMEM);
+			break;
+		}
 
 		if (reclaim == SGX_NO_RECLAIM) {
 			page = ERR_PTR(-EBUSY);
@@ -585,6 +597,15 @@  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 		cond_resched();
 	}
 
+	if (!IS_ERR(page)) {
+		WARN_ON_ONCE(sgx_epc_page_get_cgroup(page));
+		/* sgx_put_cg() in sgx_free_epc_page() */
+		sgx_epc_page_set_cgroup(page, sgx_cg);
+	} else {
+		sgx_cgroup_uncharge(sgx_cg);
+		sgx_put_cg(sgx_cg);
+	}
+
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
 		wake_up(&ksgxd_waitq);
 
@@ -603,8 +624,16 @@  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
 void sgx_free_epc_page(struct sgx_epc_page *page)
 {
 	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
+	struct sgx_cgroup *sgx_cg = sgx_epc_page_get_cgroup(page);
 	struct sgx_numa_node *node = section->node;
 
+	/* sgx_cg could be NULL if called from __sgx_sanitize_pages() */
+	if (sgx_cg) {
+		sgx_cgroup_uncharge(sgx_cg);
+		sgx_put_cg(sgx_cg);
+		sgx_epc_page_set_cgroup(page, NULL);
+	}
+
 	spin_lock(&node->lock);
 
 	page->owner = NULL;
@@ -644,6 +673,8 @@  static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 		section->pages[i].flags = 0;
 		section->pages[i].owner = NULL;
 		section->pages[i].poison = 0;
+		sgx_epc_page_set_cgroup(&section->pages[i], NULL);
+
 		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
 	}
 
@@ -788,6 +819,7 @@  static void __init arch_update_sysfs_visibility(int nid) {}
 static bool __init sgx_page_cache_init(void)
 {
 	u32 eax, ebx, ecx, edx, type;
+	u64 capacity = 0;
 	u64 pa, size;
 	int nid;
 	int i;
@@ -838,6 +870,7 @@  static bool __init sgx_page_cache_init(void)
 
 		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
 		sgx_numa_nodes[nid].size += size;
+		capacity += size;
 
 		sgx_nr_epc_sections++;
 	}
@@ -847,6 +880,8 @@  static bool __init sgx_page_cache_init(void)
 		return false;
 	}
 
+	misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
+
 	return true;
 }
 
@@ -921,7 +956,8 @@  static int __init sgx_init(void)
 	if (!sgx_page_cache_init())
 		return -ENOMEM;
 
-	if (!sgx_page_reclaimer_init()) {
+	if (!sgx_page_reclaimer_init() || !sgx_cgroup_init()) {
+		misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, 0);
 		ret = -ENOMEM;
 		goto err_page_cache;
 	}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index ca34cd4f58ac..fae8eef10232 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -39,14 +39,35 @@  enum sgx_reclaim {
 	SGX_DO_RECLAIM
 };
 
+struct sgx_cgroup;
+
 struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
 	u16 poison;
 	struct sgx_encl_page *owner;
 	struct list_head list;
+#ifdef CONFIG_CGROUP_MISC
+	struct sgx_cgroup *sgx_cg;
+#endif
 };
 
+static inline void sgx_epc_page_set_cgroup(struct sgx_epc_page *page, struct sgx_cgroup *cg)
+{
+#ifdef CONFIG_CGROUP_MISC
+	page->sgx_cg = cg;
+#endif
+}
+
+static inline struct sgx_cgroup *sgx_epc_page_get_cgroup(struct sgx_epc_page *page)
+{
+#ifdef CONFIG_CGROUP_MISC
+	return page->sgx_cg;
+#else
+	return NULL;
+#endif
+}
+
 /*
  * Contains the tracking data for NUMA nodes having EPC pages. Most importantly,
  * the free page list local to the node is stored here.
diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 440ed2bb8053..c9b47a5e966a 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -46,11 +46,13 @@  struct misc_res_ops {
  * @max: Maximum limit on the resource.
  * @usage: Current usage of the resource.
  * @events: Number of times, the resource limit exceeded.
+ * @priv: resource specific data.
  */
 struct misc_res {
 	u64 max;
 	atomic64_t usage;
 	atomic64_t events;
+	void *priv;
 };
 
 /**