Message ID | 20240416032011.58578-6-haitao.huang@linux.intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Add Cgroup support for SGX EPC memory | expand |
On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote: > From: Kristen Carlson Accardi <kristen@linux.intel.com> > > SGX Enclave Page Cache (EPC) memory allocations are separate from normal > RAM allocations, and are managed solely by the SGX subsystem. The > existing cgroup memory controller cannot be used to limit or account for > SGX EPC memory, which is a desirable feature in some environments. For > instance, within a Kubernetes environment, while a user may specify a > particular EPC quota for a pod, the orchestrator requires a mechanism to > enforce that the pod's actual runtime EPC usage does not exceed the > allocated quota. > > Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to > limit and track EPC allocations per cgroup. Earlier patches have added > the "sgx_epc" resource type in the misc cgroup subsystem. Add basic > support in SGX driver as the "sgx_epc" resource provider: > > - Set "capacity" of EPC by calling misc_cg_set_capacity() > - Update EPC usage counter, "current", by calling charge and uncharge > APIs for EPC allocation and deallocation, respectively. > - Setup sgx_epc resource type specific callbacks, which perform > initialization and cleanup during cgroup allocation and deallocation, > respectively. > > With these changes, the misc cgroup controller enables users to set a hard > limit for EPC usage in the "misc.max" interface file. It reports current > usage in "misc.current", the total EPC memory available in > "misc.capacity", and the number of times EPC usage reached the max limit > in "misc.events". > > For now, the EPC cgroup simply blocks additional EPC allocation in > sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are > still tracked in the global active list, only reclaimed by the global > reclaimer when the total free page count is lower than a threshold. > > Later patches will reorganize the tracking and reclamation code in the > global reclaimer and implement per-cgroup tracking and reclaiming. > > Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> > Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> > Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> > Reviewed-by: Tejun Heo <tj@kernel.org> > Tested-by: Jarkko Sakkinen <jarkko@kernel.org> I don't see any big issue, so feel free to add: Reviewed-by: Kai Huang <kai.huang@intel.com> Nitpickings below: [...] > --- /dev/null > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > @@ -0,0 +1,72 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* Copyright(c) 2022-2024 Intel Corporation. */ > + > +#include <linux/atomic.h> > +#include <linux/kernel.h> It doesn't seem you need the above two here. Probably they are needed in later patches, in that case we can move to the relevant patch(es) that they got used. However I think it's better to explicitly include <linux/slab.h> since kzalloc()/kfree() are used. Btw, I am not sure whether you want to use <linux/kernel.h> because looks it contains a lot of unrelated staff. Anyway I guess nobody cares. > +#include "epc_cgroup.h" > + > +/* The root SGX EPC cgroup */ > +static struct sgx_cgroup sgx_cg_root; The comment isn't necessary (sorry didn't notice before), because the code is pretty clear saying that IMHO. [...] > > --- /dev/null > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h > @@ -0,0 +1,72 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _SGX_EPC_CGROUP_H_ > +#define _SGX_EPC_CGROUP_H_ > + > +#include <asm/sgx.h> I don't see why you need <asm/sgx.h> here. Also, ... > +#include <linux/cgroup.h> > +#include <linux/misc_cgroup.h> > + > +#include "sgx.h" ... "sgx.h" already includes <asm/sgx.h> [...] > > +static inline struct sgx_cgroup *sgx_get_current_cg(void) > +{ > + /* get_current_misc_cg() never returns NULL when Kconfig enabled */ > + return sgx_cgroup_from_misc_cg(get_current_misc_cg()); > +} I spent some time looking into this. And yes if I was reading code correctly the get_current_misc_cg() should never return NULL when Kconfig is on. I typed my analysis below in [*]. And it would be helpful if any cgroup expert can have a second eye on this. [...] > --- a/arch/x86/kernel/cpu/sgx/main.c > +++ b/arch/x86/kernel/cpu/sgx/main.c > @@ -6,6 +6,7 @@ > #include <linux/highmem.h> > #include <linux/kthread.h> > #include <linux/miscdevice.h> > +#include <linux/misc_cgroup.h> Is this needed? I believe SGX variants in "epc_cgroup.h" should be enough for sgx/main.c? [...] [*] IIUC get_current_misc_cg() should never return NULL when Kconfig is on (code indent slight adjusted for text wrap). Firstly, during kernel boot there's always a valid @css allocated for MISC cgroup, regardless whether it is disabled in kernel command line. int __init cgroup_init(void) { ... for_each_subsys(ss, ssid) { if (ss->early_init) { ... } else { cgroup_init_subsys(ss, false); } ... if (!cgroup_ssid_enabled(ssid)) continue; ... } ... } cgroup_init_subsys() makes a valid @css is allocated for MISC cgroup and set the pointer to the @init_css_set. static void __init cgroup_init_subsys(struct cgroup_subsys *ss, ...) { struct cgroup_subsys_state *css; ... css = ss->css_alloc(NULL); /* We don't handle early failures gracefully */ BUG_ON(IS_ERR(css)); ... init_css_set.subsys[ss->id] = css; ... } All processes are by default associated to the @init_css_set: void cgroup_fork(struct task_struct *child) { RCU_INIT_POINTER(child->cgroups, &init_css_set); INIT_LIST_HEAD(&child->cg_list); } At runtime, when a new cgroup is created in the hierarchy, the "cgroup" can have a NULL @css if some subsystem is not enabled in it: static int cgroup_apply_control_enable(struct cgroup *cgrp) { struct cgroup *dsct; struct cgroup_subsys_state *d_css; struct cgroup_subsys *ss; int ssid, ret; cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) { for_each_subsys(ss, ssid) { struct cgroup_subsys_state *css = cgroup_css(dsct, ss); if (!(cgroup_ss_mask(dsct) & (1 << ss->id))) continue; if (!css) { css = css_create(dsct, ss); if (IS_ERR(css)) return PTR_ERR(css); } ... } } } We can see if cgroup_ss_mask(dsct) doesn't have subsystem enabled, the css_create() won't be invoked, and cgroup->subsys[ssid] will remain NULL. However, when a process is bound to a specific cgroup, the kernel tries to get the cgorup's "effective css", and it seems this "effective css" cannot be NULL if the subsys has a valid 'struct cgroup_subsys' provided, which the MISC cgroup does. There are couple of code paths can lead to this, but they all reach to static struct css_set *find_existing_css_set(...) { struct cgroup_root *root = cgrp->root; struct cgroup_subsys *ss; struct css_set *cset; unsigned long key; int i; for_each_subsys(ss, i) { if (root->subsys_mask & (1UL << i)) { /* * @ss is in this hierarchy, so we want * the effective css from @cgrp. */ template[i] = cgroup_e_css_by_mask(cgrp, ss); } else { /* * @ss is not in this hierarchy, so we * don't want to change the css. */ template[i] = old_cset->subsys[i]; } } ... } Which calls cgroup_e_css_by_mask() to get the "effective css" when subsys is enabled in the root cgroup (which means MISC cgroup is not disabled by kernel command line), or get the default css, which is @init_css_set- >subsys[ssid], which is always valid for MISC cgroup. And more specifically, the "effective css" in the cgroup_e_css_by_mask() is done by searching the entire hierarchy, so MISC cgroup will always have a valid "effective css". static struct cgroup_subsys_state *cgroup_e_css_by_mask( struct cgroup *cgrp, struct cgroup_subsys *ss) { lockdep_assert_held(&cgroup_mutex); if (!ss) return &cgrp->self; ... while (!(cgroup_ss_mask(cgrp) & (1 << ss->id))) { cgrp = cgroup_parent(cgrp); if (!cgrp) return NULL; } return cgroup_css(cgrp, ss); } The comment of cgroup_e_css_by_mask() says: * Similar to cgroup_css() but returns the effective css, which is defined * as the matching css of the nearest ancestor including self which has @ss * enabled. If @ss is associated with the hierarchy @cgrp is on, this * function is guaranteed to return non-NULL css. It's hard for me to interpret the second sentence, specifically, what does "@ss is associated with the hierarchy @cgrp is on" mean. I interpret it as "subsys is enabled in root and/or any descendants". But again, in the find_existing_css_set() it is called when the root cgroup has enabled the subsys, so it should always return a non-NULL css. And that means for any process, get_current_misc_cg() cannot be NULL.
On Mon, 15 Apr 2024 22:20:02 -0500, Haitao Huang <haitao.huang@linux.intel.com> wrote: > diff --git a/arch/x86/kernel/cpu/sgx/Makefile > b/arch/x86/kernel/cpu/sgx/Makefile > index 9c1656779b2a..400baa7cfb69 100644 > --- a/arch/x86/kernel/cpu/sgx/Makefile > +++ b/arch/x86/kernel/cpu/sgx/Makefile > @@ -1,6 +1,7 @@ > obj-y += \ > driver.o \ > encl.o \ > + epc_cgroup.o \ It should be: +obj-$(CONFIG_CGROUP_MISC) += epc_cgroup.o Haitao
On Tue, 16 Apr 2024 08:22:06 -0500, Huang, Kai <kai.huang@intel.com> wrote: > On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote: >> From: Kristen Carlson Accardi <kristen@linux.intel.com> >> >> SGX Enclave Page Cache (EPC) memory allocations are separate from normal >> RAM allocations, and are managed solely by the SGX subsystem. The >> existing cgroup memory controller cannot be used to limit or account for >> SGX EPC memory, which is a desirable feature in some environments. For >> instance, within a Kubernetes environment, while a user may specify a >> particular EPC quota for a pod, the orchestrator requires a mechanism to >> enforce that the pod's actual runtime EPC usage does not exceed the >> allocated quota. >> >> Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to >> limit and track EPC allocations per cgroup. Earlier patches have added >> the "sgx_epc" resource type in the misc cgroup subsystem. Add basic >> support in SGX driver as the "sgx_epc" resource provider: >> >> - Set "capacity" of EPC by calling misc_cg_set_capacity() >> - Update EPC usage counter, "current", by calling charge and uncharge >> APIs for EPC allocation and deallocation, respectively. >> - Setup sgx_epc resource type specific callbacks, which perform >> initialization and cleanup during cgroup allocation and deallocation, >> respectively. >> >> With these changes, the misc cgroup controller enables users to set a >> hard >> limit for EPC usage in the "misc.max" interface file. It reports current >> usage in "misc.current", the total EPC memory available in >> "misc.capacity", and the number of times EPC usage reached the max limit >> in "misc.events". >> >> For now, the EPC cgroup simply blocks additional EPC allocation in >> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are >> still tracked in the global active list, only reclaimed by the global >> reclaimer when the total free page count is lower than a threshold. >> >> Later patches will reorganize the tracking and reclamation code in the >> global reclaimer and implement per-cgroup tracking and reclaiming. >> >> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> >> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> >> Co-developed-by: Haitao Huang <haitao.huang@linux.intel.com> >> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> >> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> >> Reviewed-by: Tejun Heo <tj@kernel.org> >> Tested-by: Jarkko Sakkinen <jarkko@kernel.org> > > I don't see any big issue, so feel free to add: > > Reviewed-by: Kai Huang <kai.huang@intel.com> > Thanks > Nitpickings below: > > [...] > > >> --- /dev/null >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c >> @@ -0,0 +1,72 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* Copyright(c) 2022-2024 Intel Corporation. */ >> + >> +#include <linux/atomic.h> >> +#include <linux/kernel.h> > > It doesn't seem you need the above two here. > > Probably they are needed in later patches, in that case we can move to > the > relevant patch(es) that they got used. > > However I think it's better to explicitly include <linux/slab.h> since > kzalloc()/kfree() are used. > > Btw, I am not sure whether you want to use <linux/kernel.h> because looks > it contains a lot of unrelated staff. Anyway I guess nobody cares. > I'll check and remove as needed. >> +#include "epc_cgroup.h" >> + >> +/* The root SGX EPC cgroup */ >> +static struct sgx_cgroup sgx_cg_root; > > The comment isn't necessary (sorry didn't notice before), because the > code > is pretty clear saying that IMHO. > Was requested by Jarkko: https://lore.kernel.org/lkml/CYU504RLY7QU.QZY9LWC076NX@suppilovahvero/#t > [...] > >> >> --- /dev/null >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h >> @@ -0,0 +1,72 @@ >> +/* SPDX-License-Identifier: GPL-2.0 */ >> +#ifndef _SGX_EPC_CGROUP_H_ >> +#define _SGX_EPC_CGROUP_H_ >> + >> +#include <asm/sgx.h> > > I don't see why you need <asm/sgx.h> here. Also, ... > >> +#include <linux/cgroup.h> >> +#include <linux/misc_cgroup.h> >> + >> +#include "sgx.h" > > ... "sgx.h" already includes <asm/sgx.h> > > [...] > right >> >> +static inline struct sgx_cgroup *sgx_get_current_cg(void) >> +{ >> + /* get_current_misc_cg() never returns NULL when Kconfig enabled */ >> + return sgx_cgroup_from_misc_cg(get_current_misc_cg()); >> +} > > I spent some time looking into this. And yes if I was reading code > correctly the get_current_misc_cg() should never return NULL when Kconfig > is on. > > I typed my analysis below in [*]. And it would be helpful if any cgroup > expert can have a second eye on this. > > [...] > Thanks for checking this and I did similar and agree with the conclusion. I think this is confirmed also by Michal's description AFAICT: " The current implementation creates root css object (see cgroup_init(), cgroup_ssid_enabled() check is after cgroup_init_subsys()). I.e. it will look like all tasks are members of root cgroup wrt given controller permanently and controller attribute files won't exist." > >> --- a/arch/x86/kernel/cpu/sgx/main.c >> +++ b/arch/x86/kernel/cpu/sgx/main.c >> @@ -6,6 +6,7 @@ >> #include <linux/highmem.h> >> #include <linux/kthread.h> >> #include <linux/miscdevice.h> >> +#include <linux/misc_cgroup.h> > > Is this needed? I believe SGX variants in "epc_cgroup.h" should be > enough > for sgx/main.c? > > [...] > > right > [*] IIUC get_current_misc_cg() should never return NULL when Kconfig is > on yes [...] Thanks Haitao
> Was requested by Jarkko: > https://lore.kernel.org/lkml/CYU504RLY7QU.QZY9LWC076NX@suppilovahvero/#t > >> [...] Ah I missed that. No problem to me. >> >>> >>> --- /dev/null >>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h >>> @@ -0,0 +1,72 @@ >>> +/* SPDX-License-Identifier: GPL-2.0 */ >>> +#ifndef _SGX_EPC_CGROUP_H_ >>> +#define _SGX_EPC_CGROUP_H_ >>> + >>> +#include <asm/sgx.h> >> >> I don't see why you need <asm/sgx.h> here. Also, ... >> >>> +#include <linux/cgroup.h> >>> +#include <linux/misc_cgroup.h> >>> + >>> +#include "sgx.h" >> >> ... "sgx.h" already includes <asm/sgx.h> >> >> [...] >> > right > >>> >>> +static inline struct sgx_cgroup *sgx_get_current_cg(void) >>> +{ >>> + /* get_current_misc_cg() never returns NULL when Kconfig enabled */ >>> + return sgx_cgroup_from_misc_cg(get_current_misc_cg()); >>> +} >> >> I spent some time looking into this. And yes if I was reading code >> correctly the get_current_misc_cg() should never return NULL when Kconfig >> is on. >> >> I typed my analysis below in [*]. And it would be helpful if any cgroup >> expert can have a second eye on this. >> >> [...] >> > Thanks for checking this and I did similar and agree with the > conclusion. I think this is confirmed also by Michal's description AFAICT: > " > The current implementation creates root css object (see cgroup_init(), > cgroup_ssid_enabled() check is after cgroup_init_subsys()). > I.e. it will look like all tasks are members of root cgroup wrt given > controller permanently and controller attribute files won't exist." After looking I believe we can even disable MISC cgroup at runtime for a particular cgroup (haven't actually verified on real machine, though): # echo "-misc" > /sys/fs/cgroup/my_group/cgroup.subtree_control And if you look at the MISC cgroup core code, many functions actually handle a NULL css, e.g., misc_cg_try_charge(): int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount) { ... if (!(valid_type(type) && cg && READ_ONCE(misc_res_capacity[type]))) return -EINVAL; ... } That's why I am still a little bit worried about this. And it's better to have cgroup expert(s) to confirm here. Btw, AMD SEV doesn't need to worry because it doesn't dereference @css but just pass it to MISC cgroup core functions like misc_cg_try_charge(). But for SGX, we actually dereference it directly.
On Thu, 18 Apr 2024 18:29:53 -0500, Huang, Kai <kai.huang@intel.com> wrote: >>>> >>>> --- /dev/null >>>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h >>>> @@ -0,0 +1,72 @@ >>>> +/* SPDX-License-Identifier: GPL-2.0 */ >>>> +#ifndef _SGX_EPC_CGROUP_H_ >>>> +#define _SGX_EPC_CGROUP_H_ >>>> + >>>> +#include <asm/sgx.h> >>> >>> I don't see why you need <asm/sgx.h> here. Also, ... >>> >>>> +#include <linux/cgroup.h> >>>> +#include <linux/misc_cgroup.h> >>>> + >>>> +#include "sgx.h" >>> >>> ... "sgx.h" already includes <asm/sgx.h> >>> >>> [...] >>> >> right >> >>>> >>>> +static inline struct sgx_cgroup *sgx_get_current_cg(void) >>>> +{ >>>> + /* get_current_misc_cg() never returns NULL when Kconfig enabled >>>> */ >>>> + return sgx_cgroup_from_misc_cg(get_current_misc_cg()); >>>> +} >>> >>> I spent some time looking into this. And yes if I was reading code >>> correctly the get_current_misc_cg() should never return NULL when >>> Kconfig >>> is on. >>> >>> I typed my analysis below in [*]. And it would be helpful if any >>> cgroup >>> expert can have a second eye on this. >>> >>> [...] >>> >> Thanks for checking this and I did similar and agree with the >> conclusion. I think this is confirmed also by Michal's description >> AFAICT: >> " >> The current implementation creates root css object (see cgroup_init(), >> cgroup_ssid_enabled() check is after cgroup_init_subsys()). >> I.e. it will look like all tasks are members of root cgroup wrt given >> controller permanently and controller attribute files won't exist." > > After looking I believe we can even disable MISC cgroup at runtime for a > particular cgroup (haven't actually verified on real machine, though): > > # echo "-misc" > /sys/fs/cgroup/my_group/cgroup.subtree_control > My test confirms this is does not cause NULL cgroup for the tasks. It actually works the same way as commandline disable except for that this only disables misc in subtree and does not show any misc.* files or allow creating such files in the subtree. > And if you look at the MISC cgroup core code, many functions actually > handle a NULL css, e.g., misc_cg_try_charge(): > > int misc_cg_try_charge(enum misc_res_type type, > struct misc_cg *cg, u64 amount) > { > ... > > if (!(valid_type(type) && cg && > READ_ONCE(misc_res_capacity[type]))) > return -EINVAL; > > ... > } > > That's why I am still a little bit worried about this. And it's better > to have cgroup expert(s) to confirm here. > I think it's just being defensive as this function is public API called by other parts of kernel. Documentation of task_get_css() says it always returns a valid css. This function is used by get_current_misc_cg() to get the css refernce. /** * task_get_css - find and get the css for (task, subsys) * @task: the target task * @subsys_id: the target subsystem ID * * Find the css for the (@task, @subsys_id) combination, increment a * reference on and return it. This function is guaranteed to return a * valid css. The returned css may already have been offlined. */ static inline struct cgroup_subsys_state * task_get_css(struct task_struct *task, int subsys_id) If you look at the code of this function, you will see it does not check NULL either for task_css(). So I think we are pretty sure here it's confirmed by this documentation and testing. Thanks Haitao
> Documentation of task_get_css() says it always > returns a valid css. This function is used by get_current_misc_cg() to get > the css refernce. > > > /** > * task_get_css - find and get the css for (task, subsys) > * @task: the target task > * @subsys_id: the target subsystem ID > * > * Find the css for the (@task, @subsys_id) combination, increment a > * reference on and return it. This function is guaranteed to return a > * valid css. The returned css may already have been offlined. > */ > static inline struct cgroup_subsys_state * > task_get_css(struct task_struct *task, int subsys_id) Ah, I missed this comment. This confirms my code reading too. > > > If you look at the code of this function, you will see it does not check > NULL either for task_css(). > > So I think we are pretty sure here it's confirmed by this documentation > and testing. Yeah agreed. Thanks.
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile index 9c1656779b2a..400baa7cfb69 100644 --- a/arch/x86/kernel/cpu/sgx/Makefile +++ b/arch/x86/kernel/cpu/sgx/Makefile @@ -1,6 +1,7 @@ obj-y += \ driver.o \ encl.o \ + epc_cgroup.o \ ioctl.o \ main.o obj-$(CONFIG_X86_SGX_KVM) += virt.o diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c new file mode 100644 index 000000000000..ff4d4a25dbe7 --- /dev/null +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -0,0 +1,72 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2022-2024 Intel Corporation. */ + +#include <linux/atomic.h> +#include <linux/kernel.h> +#include "epc_cgroup.h" + +/* The root SGX EPC cgroup */ +static struct sgx_cgroup sgx_cg_root; + +/** + * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page + * + * @sgx_cg: The EPC cgroup to be charged for the page. + * Return: + * * %0 - If successfully charged. + * * -errno - for failures. + */ +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg) +{ + return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE); +} + +/** + * sgx_cgroup_uncharge() - uncharge a cgroup for an EPC page + * @sgx_cg: The charged sgx cgroup. + */ +void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) +{ + misc_cg_uncharge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE); +} + +static void sgx_cgroup_free(struct misc_cg *cg) +{ + struct sgx_cgroup *sgx_cg; + + sgx_cg = sgx_cgroup_from_misc_cg(cg); + if (!sgx_cg) + return; + + kfree(sgx_cg); +} + +static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup *sgx_cg) +{ + cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg; + sgx_cg->cg = cg; +} + +static int sgx_cgroup_alloc(struct misc_cg *cg) +{ + struct sgx_cgroup *sgx_cg; + + sgx_cg = kzalloc(sizeof(*sgx_cg), GFP_KERNEL); + if (!sgx_cg) + return -ENOMEM; + + sgx_cgroup_misc_init(cg, sgx_cg); + + return 0; +} + +const struct misc_res_ops sgx_cgroup_ops = { + .alloc = sgx_cgroup_alloc, + .free = sgx_cgroup_free, +}; + +void sgx_cgroup_init(void) +{ + misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_cgroup_ops); + sgx_cgroup_misc_init(misc_cg_root(), &sgx_cg_root); +} diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h new file mode 100644 index 000000000000..bd9606479e67 --- /dev/null +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h @@ -0,0 +1,72 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _SGX_EPC_CGROUP_H_ +#define _SGX_EPC_CGROUP_H_ + +#include <asm/sgx.h> +#include <linux/cgroup.h> +#include <linux/misc_cgroup.h> + +#include "sgx.h" + +#ifndef CONFIG_CGROUP_MISC + +#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES +struct sgx_cgroup; + +static inline struct sgx_cgroup *sgx_get_current_cg(void) +{ + return NULL; +} + +static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) { } + +static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg) +{ + return 0; +} + +static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { } + +static inline void sgx_cgroup_init(void) { } + +#else /* CONFIG_CGROUP_MISC */ + +struct sgx_cgroup { + struct misc_cg *cg; +}; + +static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct misc_cg *cg) +{ + return (struct sgx_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv); +} + +/** + * sgx_get_current_cg() - get the EPC cgroup of current process. + * + * Returned cgroup has its ref count increased by 1. Caller must call + * sgx_put_cg() to return the reference. + * + * Return: EPC cgroup to which the current task belongs to. + */ +static inline struct sgx_cgroup *sgx_get_current_cg(void) +{ + /* get_current_misc_cg() never returns NULL when Kconfig enabled */ + return sgx_cgroup_from_misc_cg(get_current_misc_cg()); +} + +/** + * sgx_put_cg() - Put the EPC cgroup and reduce its ref count. + * @sgx_cg - EPC cgroup to put. + */ +static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) +{ + put_misc_cg(sgx_cg->cg); +} + +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg); +void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg); +void sgx_cgroup_init(void); + +#endif /* CONFIG_CGROUP_MISC */ + +#endif /* _SGX_EPC_CGROUP_H_ */ diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index d219f14365d4..d482ae7fdabf 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -6,6 +6,7 @@ #include <linux/highmem.h> #include <linux/kthread.h> #include <linux/miscdevice.h> +#include <linux/misc_cgroup.h> #include <linux/node.h> #include <linux/pagemap.h> #include <linux/ratelimit.h> @@ -17,6 +18,7 @@ #include "driver.h" #include "encl.h" #include "encls.h" +#include "epc_cgroup.h" struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS]; static int sgx_nr_epc_sections; @@ -558,7 +560,16 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page) */ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim) { + struct sgx_cgroup *sgx_cg; struct sgx_epc_page *page; + int ret; + + sgx_cg = sgx_get_current_cg(); + ret = sgx_cgroup_try_charge(sgx_cg); + if (ret) { + sgx_put_cg(sgx_cg); + return ERR_PTR(ret); + } for ( ; ; ) { page = __sgx_alloc_epc_page(); @@ -567,8 +578,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim) break; } - if (list_empty(&sgx_active_page_list)) - return ERR_PTR(-ENOMEM); + if (list_empty(&sgx_active_page_list)) { + page = ERR_PTR(-ENOMEM); + break; + } if (reclaim == SGX_NO_RECLAIM) { page = ERR_PTR(-EBUSY); @@ -584,6 +597,15 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim) cond_resched(); } + if (!IS_ERR(page)) { + WARN_ON_ONCE(sgx_epc_page_get_cgroup(page)); + /* sgx_put_cg() in sgx_free_epc_page() */ + sgx_epc_page_set_cgroup(page, sgx_cg); + } else { + sgx_cgroup_uncharge(sgx_cg); + sgx_put_cg(sgx_cg); + } + if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) wake_up(&ksgxd_waitq); @@ -602,8 +624,16 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim) void sgx_free_epc_page(struct sgx_epc_page *page) { struct sgx_epc_section *section = &sgx_epc_sections[page->section]; + struct sgx_cgroup *sgx_cg = sgx_epc_page_get_cgroup(page); struct sgx_numa_node *node = section->node; + /* sgx_cg could be NULL if called from __sgx_sanitize_pages() */ + if (sgx_cg) { + sgx_cgroup_uncharge(sgx_cg); + sgx_put_cg(sgx_cg); + sgx_epc_page_set_cgroup(page, NULL); + } + spin_lock(&node->lock); page->owner = NULL; @@ -643,6 +673,8 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size, section->pages[i].flags = 0; section->pages[i].owner = NULL; section->pages[i].poison = 0; + sgx_epc_page_set_cgroup(§ion->pages[i], NULL); + list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list); } @@ -787,6 +819,7 @@ static void __init arch_update_sysfs_visibility(int nid) {} static bool __init sgx_page_cache_init(void) { u32 eax, ebx, ecx, edx, type; + u64 capacity = 0; u64 pa, size; int nid; int i; @@ -837,6 +870,7 @@ static bool __init sgx_page_cache_init(void) sgx_epc_sections[i].node = &sgx_numa_nodes[nid]; sgx_numa_nodes[nid].size += size; + capacity += size; sgx_nr_epc_sections++; } @@ -846,6 +880,8 @@ static bool __init sgx_page_cache_init(void) return false; } + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity); + return true; } @@ -942,6 +978,9 @@ static int __init sgx_init(void) if (sgx_vepc_init() && ret) goto err_provision; + /* Setup cgroup if either the native or vepc driver is active */ + sgx_cgroup_init(); + return 0; err_provision: diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h index ca34cd4f58ac..fae8eef10232 100644 --- a/arch/x86/kernel/cpu/sgx/sgx.h +++ b/arch/x86/kernel/cpu/sgx/sgx.h @@ -39,14 +39,35 @@ enum sgx_reclaim { SGX_DO_RECLAIM }; +struct sgx_cgroup; + struct sgx_epc_page { unsigned int section; u16 flags; u16 poison; struct sgx_encl_page *owner; struct list_head list; +#ifdef CONFIG_CGROUP_MISC + struct sgx_cgroup *sgx_cg; +#endif }; +static inline void sgx_epc_page_set_cgroup(struct sgx_epc_page *page, struct sgx_cgroup *cg) +{ +#ifdef CONFIG_CGROUP_MISC + page->sgx_cg = cg; +#endif +} + +static inline struct sgx_cgroup *sgx_epc_page_get_cgroup(struct sgx_epc_page *page) +{ +#ifdef CONFIG_CGROUP_MISC + return page->sgx_cg; +#else + return NULL; +#endif +} + /* * Contains the tracking data for NUMA nodes having EPC pages. Most importantly, * the free page list local to the node is stored here. diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h index 440ed2bb8053..c9b47a5e966a 100644 --- a/include/linux/misc_cgroup.h +++ b/include/linux/misc_cgroup.h @@ -46,11 +46,13 @@ struct misc_res_ops { * @max: Maximum limit on the resource. * @usage: Current usage of the resource. * @events: Number of times, the resource limit exceeded. + * @priv: resource specific data. */ struct misc_res { u64 max; atomic64_t usage; atomic64_t events; + void *priv; }; /**