x86/sgx: Set active memcg prior to shmem allocation

Message ID	20220517164713.4610-1-kristen@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-sgx-owner@kernel.org> From: Kristen Carlson Accardi <kristen@linux.intel.com> To: linux-sgx@vger.kernel.org, Jarkko Sakkinen <jarkko@kernel.org>, Dave Hansen <dave.hansen@linux.intel.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com> Cc: linux-kernel@vger.kernel.org, Kristen Carlson Accardi <kristen@linux.intel.com> Subject: [PATCH] x86/sgx: Set active memcg prior to shmem allocation Date: Tue, 17 May 2022 09:47:13 -0700 Message-Id: <20220517164713.4610-1-kristen@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	x86/sgx: Set active memcg prior to shmem allocation \| expand x86/sgx: Set active memcg prior to shmem allocation

Kristen Carlson Accardi May 17, 2022, 4:47 p.m. UTC

When the system runs out of enclave memory, SGX can reclaim EPC pages
by swapping to normal RAM. These backing pages are allocated via a
per-enclave shared memory area. Since SGX allows unlimited over
commit on EPC memory, the reclaimer thread can allocate a large
number of backing RAM pages in response to EPC memory pressure.

When the shared memory backing RAM allocation occurs during
the reclaimer thread context, the shared memory is charged to
the root memory control group, and the shmem usage of the enclave
is not properly accounted for, making cgroups ineffective at
limiting the amount of RAM an enclave can consume.

For example, when using a cgroup to launch a set of test
enclaves, the kernel does not properly account for 50% - 75% of
shmem page allocations on average. In the worst case, when
nearly all allocations occur during the reclaimer thread, the
kernel accounts less than a percent of the amount of shmem used
by the enclave's cgroup to the correct cgroup.

SGX currently stores a list of mm_structs that are associated with
an enclave. In order to allow the enclave's cgroup to more accurately
reflect the shmem usage, the memory control group (struct mem_cgroup)
of one of these mm_structs can be set as the active memory cgroup
prior to allocating any EPC backing pages. This will make any shmem
allocations be charged to a memory control group associated with the
enclave's cgroup. This will allow memory cgroup limits to restrict
RAM usage more effectively.

This patch will create a new function - sgx_encl_alloc_backing().
This function will be used whenever a new backing storage page
needs to be allocated. Previously the same function was used for
page allocation as well as retrieving a previously allocated page.
Prior to backing page allocation, if there is a mm_struct associated
with the enclave that is requesting the allocation, it will be set
as the active memory control group.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
---
 arch/x86/kernel/cpu/sgx/encl.c | 111 ++++++++++++++++++++++++++++++++-
 arch/x86/kernel/cpu/sgx/encl.h |   6 +-
 arch/x86/kernel/cpu/sgx/main.c |   4 +-
 3 files changed, 115 insertions(+), 6 deletions(-)

Dave Hansen May 17, 2022, 5:57 p.m. UTC | #1

... also adding the same folks that mhocko did in the other post

On 5/17/22 09:47, Kristen Carlson Accardi wrote:
> When the system runs out of enclave memory, SGX can reclaim EPC pages
> by swapping to normal RAM. These backing pages are allocated via a
> per-enclave shared memory area. Since SGX allows unlimited over
> commit on EPC memory, the reclaimer thread can allocate a large
> number of backing RAM pages in response to EPC memory pressure.

A few bits of info that the folks not deeply familiar with SGX might
care about: SGX "enclave memory" is RAM, but it is marked reserved by
the BIOS and not managed by the core mm.  The SGX "driver" manages the
memory and has its own little mm subsystem, including a reclaimer.

( Aside: If you haven't encountered SGX before, as core mm folks, your
  first reaction is going to be to recoil in disgust.  This is an
  appropriate reaction.  In order to mitigate attacks from the OS, the
  SGX architecture partially duplicates a ton of existing x86
  architectural structures.  For instance, SGX has its own page
  permissions which are separate from the page tables.  SGX is weird. )

> When the shared memory backing RAM allocation occurs during
> the reclaimer thread context, the shared memory is charged to
> the root memory control group, and the shmem usage of the enclave
> is not properly accounted for, making cgroups ineffective at
> limiting the amount of RAM an enclave can consume.

One more bit of context:

Just like the core mm, SGX has both a direct and an indirect reclaim
path.  The direct reclaim path properly accounts shared memory
allocations to the cgroup of the task doing the reclaim.  The problem
here is with the SGX indirect reclaim path.

> For example, when using a cgroup to launch a set of test
> enclaves, the kernel does not properly account for 50% - 75% of
> shmem page allocations on average. In the worst case, when
> nearly all allocations occur during the reclaimer thread, the

				s/during the/in/

> kernel accounts less than a percent of the amount of shmem used
> by the enclave's cgroup to the correct cgroup.
> 
> SGX currently stores a list of mm_structs that are associated with
> an enclave. In order to allow the enclave's cgroup to more accurately
> reflect the shmem usage, the memory control group (struct mem_cgroup)
> of one of these mm_structs can be set as the active memory cgroup
> prior to allocating any EPC backing pages. This will make any shmem
> allocations be charged to a memory control group associated with the
> enclave's cgroup. This will allow memory cgroup limits to restrict
> RAM usage more effectively.

Let's make this a bit more imperative:

	SGX stores a list of mm_structs that are associated with an
	enclave.  Pick one of them during reclaim and charge that mm's
 	memcg with the shmem allocation.  The one that gets picked is
	arbitrary, but this list almost always only has one mm.  The
	cases where there is more than one mm with *different memcg's
	are not even worth considering.

> This patch will create a new function - sgx_encl_alloc_backing().

No "this patch"'s, please.  Replace:

	This patch will create a new function -

With:

	Create a new function -

> This function will be used whenever a new backing storage page
> needs to be allocated. Previously the same function was used for
> page allocation as well as retrieving a previously allocated page.
> Prior to backing page allocation, if there is a mm_struct associated
> with the enclave that is requesting the allocation, it will be set
> as the active memory control group.
> 
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> ---
>  arch/x86/kernel/cpu/sgx/encl.c | 111 ++++++++++++++++++++++++++++++++-
>  arch/x86/kernel/cpu/sgx/encl.h |   6 +-
>  arch/x86/kernel/cpu/sgx/main.c |   4 +-
>  3 files changed, 115 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 001808e3901c..c3a5e57040bc 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -32,7 +32,7 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page,
>  	else
>  		page_index = PFN_DOWN(encl->size);
>  
> -	ret = sgx_encl_get_backing(encl, page_index, &b);
> +	ret = sgx_encl_lookup_backing(encl, page_index, &b);
>  	if (ret)
>  		return ret;
>  
> @@ -574,7 +574,7 @@ static struct page *sgx_encl_get_backing_page(struct sgx_encl *encl,
>   *   0 on success,
>   *   -errno otherwise.
>   */
> -int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index,
> +static int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index,
>  			 struct sgx_backing *backing)
>  {
>  	pgoff_t pcmd_index = PFN_DOWN(encl->size) + 1 + (page_index >> 5);
> @@ -601,6 +601,113 @@ int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index,
>  	return 0;
>  }
>  
> +static struct mem_cgroup * sgx_encl_set_active_memcg(struct sgx_encl *encl)

			     ^ stray whitespace

A comment saying what this returns would be nice too.

Could this maybe be named something like:

	set_active_memcg_from_encl()

This otherwise makes it sound like it's setting an *enclave's* memcg.

> +{
> +	struct mm_struct *mm = current->mm;
> +	struct sgx_encl_mm *encl_mm;
> +	struct mem_cgroup *memcg;
> +	int idx;
> +
> +	/*
> +	 * If current->mm is NULL, get_mem_cgroup_from_mm() will return
> +	 * the currently active mem_cgroup. This may be the root mem_cgroup
> +	 * if there is no active mem_cgroup set.
> +	 */
> +	memcg = get_mem_cgroup_from_mm(mm);
> +
> +	/*
> +	 * If we already have an mm, we are not in thread context and the
> +	 * mem_cgroup for the enclave will be charged for any allocations.
> +	 */
> +	if (mm)
> +		return memcg;

Can we just be more direct about this?

	/*
	 * If being called from normal task context, just use
	 * the task's normal memcg.  The remainder of the handling
	 * is for ksgxd.
	 */
	if (!current_is_ksgxd())
		return get_mem_cgroup_from_mm(mm);

It will mean adding that helper, but it's a *lot* more obvious what is
going on.

> +	/*
> +	 * If there is no mm, it means that we are in thread context,
> +	 * and any backing RAM allocations would be charged to the root
> +	 * mem_cgroup unless the active mem_cgroup is set. Search the
> +	 * enclave's mm_list to find any mm associated with this enclave.
> +	 */
> +	idx = srcu_read_lock(&encl->srcu);
> +
> +	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> +		if (encl_mm->mm == NULL)
> +			continue;
> +
> +		mm = encl_mm->mm;
> +		break;
> +
> +	}
> +
> +	srcu_read_unlock(&encl->srcu, idx);
> +
> +	/*
> +	 * If an associated mm was not found, the allocation will just
> +	 * need to be charged to the root mem_cgroup.
> +	 */
> +	if (!mm)
> +		return memcg;
> +
> +	memcg = get_mem_cgroup_from_mm(mm);

What keeps the mm around between the srcu_read_unlock() and here?  Do
you need a mmget_not_zero() like sgx_reclaimer_block() uses?

> +	/*
> +	 * set_active_memcg() returns the previous active memcg.
> +	 */
> +	return set_active_memcg(memcg);
> +}
> +
> +/**
> + * sgx_encl_alloc_backing() - allocate a new backing storage page
> + * @encl:	an enclave pointer
> + * @page_index:	enclave page index
> + * @backing:	data for accessing backing storage for the page
> + *
> + * If this function is called from the kernel thread, it will set
> + * the active memcg to one of the enclaves mm's in order to ensure

				"enclave's"

Jarkko Sakkinen May 18, 2022, 1:24 a.m. UTC | #2

On Tue, 2022-05-17 at 09:47 -0700, Kristen Carlson Accardi wrote:
> When the system runs out of enclave memory, SGX can reclaim EPC pages
> by swapping to normal RAM. These backing pages are allocated via a
> per-enclave shared memory area. Since SGX allows unlimited over
> commit on EPC memory, the reclaimer thread can allocate a large
> number of backing RAM pages in response to EPC memory pressure.
> 
> When the shared memory backing RAM allocation occurs during
> the reclaimer thread context, the shared memory is charged to
> the root memory control group, and the shmem usage of the enclave
> is not properly accounted for, making cgroups ineffective at
> limiting the amount of RAM an enclave can consume.
> 
> For example, when using a cgroup to launch a set of test
> enclaves, the kernel does not properly account for 50% - 75% of
> shmem page allocations on average. In the worst case, when
> nearly all allocations occur during the reclaimer thread, the
> kernel accounts less than a percent of the amount of shmem used
> by the enclave's cgroup to the correct cgroup.
> 
> SGX currently stores a list of mm_structs that are associated with
> an enclave. In order to allow the enclave's cgroup to more accurately
> reflect the shmem usage, the memory control group (struct mem_cgroup)
> of one of these mm_structs can be set as the active memory cgroup
> prior to allocating any EPC backing pages. This will make any shmem
> allocations be charged to a memory control group associated with the
> enclave's cgroup. This will allow memory cgroup limits to restrict
> RAM usage more effectively.
> 
> This patch will create a new function - sgx_encl_alloc_backing().
> This function will be used whenever a new backing storage page
> needs to be allocated. Previously the same function was used for
> page allocation as well as retrieving a previously allocated page.
> Prior to backing page allocation, if there is a mm_struct associated
> with the enclave that is requesting the allocation, it will be set
> as the active memory control group.
> 
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> ---
>  arch/x86/kernel/cpu/sgx/encl.c | 111 ++++++++++++++++++++++++++++++++-
>  arch/x86/kernel/cpu/sgx/encl.h |   6 +-
>  arch/x86/kernel/cpu/sgx/main.c |   4 +-
>  3 files changed, 115 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 001808e3901c..c3a5e57040bc 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -32,7 +32,7 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page,
>         else
>                 page_index = PFN_DOWN(encl->size);
>  
> -       ret = sgx_encl_get_backing(encl, page_index, &b);
> +       ret = sgx_encl_lookup_backing(encl, page_index, &b);
>         if (ret)
>                 return ret;
>  
> @@ -574,7 +574,7 @@ static struct page *sgx_encl_get_backing_page(struct sgx_encl *encl,
>   *   0 on success,
>   *   -errno otherwise.
>   */
> -int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index,
> +static int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index,
>                          struct sgx_backing *backing)
>  {
>         pgoff_t pcmd_index = PFN_DOWN(encl->size) + 1 + (page_index >> 5);
> @@ -601,6 +601,113 @@ int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index,
>         return 0;
>  }
>  
> +static struct mem_cgroup * sgx_encl_set_active_memcg(struct sgx_encl *encl)
> +{
> +       struct mm_struct *mm = current->mm;
> +       struct sgx_encl_mm *encl_mm;
> +       struct mem_cgroup *memcg;
> +       int idx;
> +
> +       /*
> +        * If current->mm is NULL, get_mem_cgroup_from_mm() will return
> +        * the currently active mem_cgroup. This may be the root mem_cgroup
> +        * if there is no active mem_cgroup set.
> +        */
> +       memcg = get_mem_cgroup_from_mm(mm);
> +
> +       /*
> +        * If we already have an mm, we are not in thread context and the
> +        * mem_cgroup for the enclave will be charged for any allocations.
> +        */


AFAIK CPU is always executing in some thread.

This would a lot clearer:

/*
 * If the CPU is not inside a kthread, return the active memcg.
 */
 
Please remark of not using "we" in my version. It's better to be
explicit when you can.

> +       if (mm)
> +               return memcg;
> +
> +       /*
> +        * If there is no mm, it means that we are in thread context,
> +        * and any backing RAM allocations would be charged to the root
> +        * mem_cgroup unless the active mem_cgroup is set. Search the
> +        * enclave's mm_list to find any mm associated with this enclave.
> +        */
> +       idx = srcu_read_lock(&encl->srcu);
> +
> +       list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> +               if (encl_mm->mm == NULL)
> +                       continue;
> +
> +               mm = encl_mm->mm;
> +               break;
> +
> +       }
> +
> +       srcu_read_unlock(&encl->srcu, idx);
> +
> +       /*
> +        * If an associated mm was not found, the allocation will just
> +        * need to be charged to the root mem_cgroup.
> +        */
> +       if (!mm)
> +               return memcg;
> +
> +       memcg = get_mem_cgroup_from_mm(mm);
> +
> +       /*
> +        * set_active_memcg() returns the previous active memcg.
> +        */
> +       return set_active_memcg(memcg);
> +}

> +
> +/**
> + * sgx_encl_alloc_backing() - allocate a new backing storage page
> + * @encl:      an enclave pointer
> + * @page_index:        enclave page index
> + * @backing:   data for accessing backing storage for the page
> + *
> + * If this function is called from the kernel thread, it will set
> + * the active memcg to one of the enclaves mm's in order to ensure
> + * that shmem page allocations are charged to the enclave when they
> + * are retrieved. Upon exit, the old memcg (if it existed at all)
> + * will be restored.
> + *
> + * Return:
> + *   0 on success,
> + *   -errno otherwise.
> + */
> +int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
> +                          struct sgx_backing *backing)
> +{
> +       struct mem_cgroup *old_memcg;
> +       int ret;
> +
> +       old_memcg = sgx_encl_set_active_memcg(encl);
> +
> +       ret = sgx_encl_get_backing(encl, page_index, backing);
> +
> +       set_active_memcg(old_memcg);
> +
> +       return ret;
> +}
> +
> +/**
> + * sgx_encl_lookup_backing() - retrieve an existing backing storage page
> + * @encl:      an enclave pointer
> + * @page_index:        enclave page index
> + * @backing:   data for accessing backing storage for the page
> + *
> + * Retrieve a backing page for loading data back into an EPC page with ELDU.
> + * It is the caller's responsibility to ensure that it is appropriate to use
> + * sgx_encl_lookup_backing() rather than sgx_encl_alloc_backing(). If lookup is
> + * not used correctly, this will cause an allocation which is not accounted for.
> + *
> + * Return:
> + *   0 on success,
> + *   -errno otherwise.
> + */
> +int sgx_encl_lookup_backing(struct sgx_encl *encl, unsigned long page_index,
> +                          struct sgx_backing *backing)
> +{
> +       return sgx_encl_get_backing(encl, page_index, backing);
> +}
> +
>  /**
>   * sgx_encl_put_backing() - Unpin the backing storage
>   * @backing:   data for accessing backing storage for the page
> diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
> index fec43ca65065..7816cfe8f832 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.h
> +++ b/arch/x86/kernel/cpu/sgx/encl.h
> @@ -105,8 +105,10 @@ int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
>  
>  void sgx_encl_release(struct kref *ref);
>  int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm);
> -int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index,
> -                        struct sgx_backing *backing);
> +int sgx_encl_lookup_backing(struct sgx_encl *encl, unsigned long page_index,
> +                           struct sgx_backing *backing);
> +int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
> +                          struct sgx_backing *backing);
>  void sgx_encl_put_backing(struct sgx_backing *backing, bool do_write);
>  int sgx_encl_test_and_clear_young(struct mm_struct *mm,
>                                   struct sgx_encl_page *page);
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 4b41efc9e367..7d41c8538795 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -310,7 +310,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>         encl->secs_child_cnt--;
>  
>         if (!encl->secs_child_cnt && test_bit(SGX_ENCL_INITIALIZED, &encl->flags)) {
> -               ret = sgx_encl_get_backing(encl, PFN_DOWN(encl->size),
> +               ret = sgx_encl_alloc_backing(encl, PFN_DOWN(encl->size),
>                                            &secs_backing);
>                 if (ret)
>                         goto out;
> @@ -381,7 +381,7 @@ static void sgx_reclaim_pages(void)
>                         goto skip;
>  
>                 page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
> -               ret = sgx_encl_get_backing(encl_page->encl, page_index, &backing[i]);
> +               ret = sgx_encl_alloc_backing(encl_page->encl, page_index, &backing[i]);
>                 if (ret)
>                         goto skip;
>  

BR, Jarkko

Shakeel Butt May 19, 2022, 5:36 a.m. UTC | #3

On Tue, May 17, 2022 at 09:47:13AM -0700, Kristen Carlson Accardi wrote:
>  
[...]
> +int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
> +			   struct sgx_backing *backing)
> +{
> +	struct mem_cgroup *old_memcg;
> +	int ret;
> +
> +	old_memcg = sgx_encl_set_active_memcg(encl);

This function is leaking memcg reference. Please change
sgx_encl_set_active_memcg() to sgx_encl_get_mem_cgroup() or something
which will return the memcg with the refcount elevated. Then use
set_active_memcg(returned_memcg) here.

> +
> +	ret = sgx_encl_get_backing(encl, page_index, backing);
> +
> +	set_active_memcg(old_memcg);

mem_cgroup_put(returned_memcg) here.

> +
> +	return ret;
> +}

Shakeel Butt May 19, 2022, 5:43 a.m. UTC | #4

(Fixing the CC list and sending again)

On Tue, May 17, 2022 at 09:47:13AM -0700, Kristen Carlson Accardi wrote:
>
[...]
> +int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
> +                        struct sgx_backing *backing)
> +{
> +     struct mem_cgroup *old_memcg;
> +     int ret;
> +
> +     old_memcg = sgx_encl_set_active_memcg(encl);

This function is leaking memcg reference. Please change
sgx_encl_set_active_memcg() to sgx_encl_get_mem_cgroup() or something
which will return the memcg with the refcount elevated. Then use
set_active_memcg(returned_memcg) here.

> +
> +     ret = sgx_encl_get_backing(encl, page_index, backing);
> +
> +     set_active_memcg(old_memcg);

mem_cgroup_put(returned_memcg) here.

> +
> +     return ret;
> +}

Kristen Carlson Accardi May 19, 2022, 8:44 p.m. UTC | #5

On Wed, 2022-05-18 at 22:43 -0700, Shakeel Butt wrote:
> (Fixing the CC list and sending again)
> 
> On Tue, May 17, 2022 at 09:47:13AM -0700, Kristen Carlson Accardi
> wrote:
> [...]
> > +int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long
> > page_index,
> > +                        struct sgx_backing *backing)
> > +{
> > +     struct mem_cgroup *old_memcg;
> > +     int ret;
> > +
> > +     old_memcg = sgx_encl_set_active_memcg(encl);
> 
> This function is leaking memcg reference. Please change
> sgx_encl_set_active_memcg() to sgx_encl_get_mem_cgroup() or something
> which will return the memcg with the refcount elevated. Then use
> set_active_memcg(returned_memcg) here.
> 
> > +
> > +     ret = sgx_encl_get_backing(encl, page_index, backing);
> > +
> > +     set_active_memcg(old_memcg);
> 
> mem_cgroup_put(returned_memcg) here.
> 
> > +
> > +     return ret;
> > +}

Thank you for your review. I will incorporate this fix into my next
version.

Kristen Carlson Accardi May 19, 2022, 8:45 p.m. UTC | #6

On Tue, 2022-05-17 at 10:57 -0700, Dave Hansen wrote:
> ... also adding the same folks that mhocko did in the other post
> 
> On 5/17/22 09:47, Kristen Carlson Accardi wrote:
> > When the system runs out of enclave memory, SGX can reclaim EPC
> > pages
> > by swapping to normal RAM. These backing pages are allocated via a
> > per-enclave shared memory area. Since SGX allows unlimited over
> > commit on EPC memory, the reclaimer thread can allocate a large
> > number of backing RAM pages in response to EPC memory pressure.
> 
> A few bits of info that the folks not deeply familiar with SGX might
> care about: SGX "enclave memory" is RAM, but it is marked reserved by
> the BIOS and not managed by the core mm.  The SGX "driver" manages
> the
> memory and has its own little mm subsystem, including a reclaimer.
> 
> ( Aside: If you haven't encountered SGX before, as core mm folks,
> your
>   first reaction is going to be to recoil in disgust.  This is an
>   appropriate reaction.  In order to mitigate attacks from the OS,
> the
>   SGX architecture partially duplicates a ton of existing x86
>   architectural structures.  For instance, SGX has its own page
>   permissions which are separate from the page tables.  SGX is weird.
> )
> 
> > When the shared memory backing RAM allocation occurs during
> > the reclaimer thread context, the shared memory is charged to
> > the root memory control group, and the shmem usage of the enclave
> > is not properly accounted for, making cgroups ineffective at
> > limiting the amount of RAM an enclave can consume.
> 
> One more bit of context:
> 
> Just like the core mm, SGX has both a direct and an indirect reclaim
> path.  The direct reclaim path properly accounts shared memory
> allocations to the cgroup of the task doing the reclaim.  The problem
> here is with the SGX indirect reclaim path.
> 
> > For example, when using a cgroup to launch a set of test
> > enclaves, the kernel does not properly account for 50% - 75% of
> > shmem page allocations on average. In the worst case, when
> > nearly all allocations occur during the reclaimer thread, the
> 
> 				s/during the/in/
> 
> > kernel accounts less than a percent of the amount of shmem used
> > by the enclave's cgroup to the correct cgroup.
> > 
> > SGX currently stores a list of mm_structs that are associated with
> > an enclave. In order to allow the enclave's cgroup to more
> > accurately
> > reflect the shmem usage, the memory control group (struct
> > mem_cgroup)
> > of one of these mm_structs can be set as the active memory cgroup
> > prior to allocating any EPC backing pages. This will make any shmem
> > allocations be charged to a memory control group associated with
> > the
> > enclave's cgroup. This will allow memory cgroup limits to restrict
> > RAM usage more effectively.
> 
> Let's make this a bit more imperative:
> 
> 	SGX stores a list of mm_structs that are associated with an
> 	enclave.  Pick one of them during reclaim and charge that mm's
>  	memcg with the shmem allocation.  The one that gets picked is
> 	arbitrary, but this list almost always only has one mm.  The
> 	cases where there is more than one mm with *different memcg's
> 	are not even worth considering.
> 
> > This patch will create a new function - sgx_encl_alloc_backing().
> 
> No "this patch"'s, please.  Replace:
> 
> 	This patch will create a new function -
> 
> With:
> 
> 	Create a new function -
> 
> > This function will be used whenever a new backing storage page
> > needs to be allocated. Previously the same function was used for
> > page allocation as well as retrieving a previously allocated page.
> > Prior to backing page allocation, if there is a mm_struct
> > associated
> > with the enclave that is requesting the allocation, it will be set
> > as the active memory control group.
> > 
> > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > ---
> >  arch/x86/kernel/cpu/sgx/encl.c | 111
> > ++++++++++++++++++++++++++++++++-
> >  arch/x86/kernel/cpu/sgx/encl.h |   6 +-
> >  arch/x86/kernel/cpu/sgx/main.c |   4 +-
> >  3 files changed, 115 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/encl.c
> > b/arch/x86/kernel/cpu/sgx/encl.c
> > index 001808e3901c..c3a5e57040bc 100644
> > --- a/arch/x86/kernel/cpu/sgx/encl.c
> > +++ b/arch/x86/kernel/cpu/sgx/encl.c
> > @@ -32,7 +32,7 @@ static int __sgx_encl_eldu(struct sgx_encl_page
> > *encl_page,
> >  	else
> >  		page_index = PFN_DOWN(encl->size);
> >  
> > -	ret = sgx_encl_get_backing(encl, page_index, &b);
> > +	ret = sgx_encl_lookup_backing(encl, page_index, &b);
> >  	if (ret)
> >  		return ret;
> >  
> > @@ -574,7 +574,7 @@ static struct page
> > *sgx_encl_get_backing_page(struct sgx_encl *encl,
> >   *   0 on success,
> >   *   -errno otherwise.
> >   */
> > -int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long
> > page_index,
> > +static int sgx_encl_get_backing(struct sgx_encl *encl, unsigned
> > long page_index,
> >  			 struct sgx_backing *backing)
> >  {
> >  	pgoff_t pcmd_index = PFN_DOWN(encl->size) + 1 + (page_index >>
> > 5);
> > @@ -601,6 +601,113 @@ int sgx_encl_get_backing(struct sgx_encl
> > *encl, unsigned long page_index,
> >  	return 0;
> >  }
> >  
> > +static struct mem_cgroup * sgx_encl_set_active_memcg(struct
> > sgx_encl *encl)
> 
> 			     ^ stray whitespace
> 
> A comment saying what this returns would be nice too.
> 
> Could this maybe be named something like:
> 
> 	set_active_memcg_from_encl()
> 
> This otherwise makes it sound like it's setting an *enclave's* memcg.
> 
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +	struct sgx_encl_mm *encl_mm;
> > +	struct mem_cgroup *memcg;
> > +	int idx;
> > +
> > +	/*
> > +	 * If current->mm is NULL, get_mem_cgroup_from_mm() will return
> > +	 * the currently active mem_cgroup. This may be the root
> > mem_cgroup
> > +	 * if there is no active mem_cgroup set.
> > +	 */
> > +	memcg = get_mem_cgroup_from_mm(mm);
> > +
> > +	/*
> > +	 * If we already have an mm, we are not in thread context and
> > the
> > +	 * mem_cgroup for the enclave will be charged for any
> > allocations.
> > +	 */
> > +	if (mm)
> > +		return memcg;
> 
> Can we just be more direct about this?
> 
> 	/*
> 	 * If being called from normal task context, just use
> 	 * the task's normal memcg.  The remainder of the handling
> 	 * is for ksgxd.
> 	 */
> 	if (!current_is_ksgxd())
> 		return get_mem_cgroup_from_mm(mm);
> 
> It will mean adding that helper, but it's a *lot* more obvious what
> is
> going on.
> 
> > +	/*
> > +	 * If there is no mm, it means that we are in thread context,
> > +	 * and any backing RAM allocations would be charged to the root
> > +	 * mem_cgroup unless the active mem_cgroup is set. Search the
> > +	 * enclave's mm_list to find any mm associated with this
> > enclave.
> > +	 */
> > +	idx = srcu_read_lock(&encl->srcu);
> > +
> > +	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> > +		if (encl_mm->mm == NULL)
> > +			continue;
> > +
> > +		mm = encl_mm->mm;
> > +		break;
> > +
> > +	}
> > +
> > +	srcu_read_unlock(&encl->srcu, idx);
> > +
> > +	/*
> > +	 * If an associated mm was not found, the allocation will just
> > +	 * need to be charged to the root mem_cgroup.
> > +	 */
> > +	if (!mm)
> > +		return memcg;
> > +
> > +	memcg = get_mem_cgroup_from_mm(mm);
> 
> What keeps the mm around between the srcu_read_unlock() and here?  Do
> you need a mmget_not_zero() like sgx_reclaimer_block() uses?
> 
> > +	/*
> > +	 * set_active_memcg() returns the previous active memcg.
> > +	 */
> > +	return set_active_memcg(memcg);
> > +}
> > +
> > +/**
> > + * sgx_encl_alloc_backing() - allocate a new backing storage page
> > + * @encl:	an enclave pointer
> > + * @page_index:	enclave page index
> > + * @backing:	data for accessing backing storage for the page
> > + *
> > + * If this function is called from the kernel thread, it will set
> > + * the active memcg to one of the enclaves mm's in order to ensure
> 
> 				"enclave's"
> 
> 

Thanks for your review. I've incorporated your feedback into my next
version.

Kristen Carlson Accardi May 19, 2022, 8:47 p.m. UTC | #7

On Wed, 2022-05-18 at 04:24 +0300, Jarkko Sakkinen wrote:
> On Tue, 2022-05-17 at 09:47 -0700, Kristen Carlson Accardi wrote:
> > When the system runs out of enclave memory, SGX can reclaim EPC
> > pages
> > by swapping to normal RAM. These backing pages are allocated via a
> > per-enclave shared memory area. Since SGX allows unlimited over
> > commit on EPC memory, the reclaimer thread can allocate a large
> > number of backing RAM pages in response to EPC memory pressure.
> > 
> > When the shared memory backing RAM allocation occurs during
> > the reclaimer thread context, the shared memory is charged to
> > the root memory control group, and the shmem usage of the enclave
> > is not properly accounted for, making cgroups ineffective at
> > limiting the amount of RAM an enclave can consume.
> > 
> > For example, when using a cgroup to launch a set of test
> > enclaves, the kernel does not properly account for 50% - 75% of
> > shmem page allocations on average. In the worst case, when
> > nearly all allocations occur during the reclaimer thread, the
> > kernel accounts less than a percent of the amount of shmem used
> > by the enclave's cgroup to the correct cgroup.
> > 
> > SGX currently stores a list of mm_structs that are associated with
> > an enclave. In order to allow the enclave's cgroup to more
> > accurately
> > reflect the shmem usage, the memory control group (struct
> > mem_cgroup)
> > of one of these mm_structs can be set as the active memory cgroup
> > prior to allocating any EPC backing pages. This will make any shmem
> > allocations be charged to a memory control group associated with
> > the
> > enclave's cgroup. This will allow memory cgroup limits to restrict
> > RAM usage more effectively.
> > 
> > This patch will create a new function - sgx_encl_alloc_backing().
> > This function will be used whenever a new backing storage page
> > needs to be allocated. Previously the same function was used for
> > page allocation as well as retrieving a previously allocated page.
> > Prior to backing page allocation, if there is a mm_struct
> > associated
> > with the enclave that is requesting the allocation, it will be set
> > as the active memory control group.
> > 
> > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > ---
> >  arch/x86/kernel/cpu/sgx/encl.c | 111
> > ++++++++++++++++++++++++++++++++-
> >  arch/x86/kernel/cpu/sgx/encl.h |   6 +-
> >  arch/x86/kernel/cpu/sgx/main.c |   4 +-
> >  3 files changed, 115 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/encl.c
> > b/arch/x86/kernel/cpu/sgx/encl.c
> > index 001808e3901c..c3a5e57040bc 100644
> > --- a/arch/x86/kernel/cpu/sgx/encl.c
> > +++ b/arch/x86/kernel/cpu/sgx/encl.c
> > @@ -32,7 +32,7 @@ static int __sgx_encl_eldu(struct sgx_encl_page
> > *encl_page,
> >         else
> >                 page_index = PFN_DOWN(encl->size);
> >  
> > -       ret = sgx_encl_get_backing(encl, page_index, &b);
> > +       ret = sgx_encl_lookup_backing(encl, page_index, &b);
> >         if (ret)
> >                 return ret;
> >  
> > @@ -574,7 +574,7 @@ static struct page
> > *sgx_encl_get_backing_page(struct sgx_encl *encl,
> >   *   0 on success,
> >   *   -errno otherwise.
> >   */
> > -int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long
> > page_index,
> > +static int sgx_encl_get_backing(struct sgx_encl *encl, unsigned
> > long page_index,
> >                          struct sgx_backing *backing)
> >  {
> >         pgoff_t pcmd_index = PFN_DOWN(encl->size) + 1 + (page_index
> > >> 5);
> > @@ -601,6 +601,113 @@ int sgx_encl_get_backing(struct sgx_encl
> > *encl, unsigned long page_index,
> >         return 0;
> >  }
> >  
> > +static struct mem_cgroup * sgx_encl_set_active_memcg(struct
> > sgx_encl *encl)
> > +{
> > +       struct mm_struct *mm = current->mm;
> > +       struct sgx_encl_mm *encl_mm;
> > +       struct mem_cgroup *memcg;
> > +       int idx;
> > +
> > +       /*
> > +        * If current->mm is NULL, get_mem_cgroup_from_mm() will
> > return
> > +        * the currently active mem_cgroup. This may be the root
> > mem_cgroup
> > +        * if there is no active mem_cgroup set.
> > +        */
> > +       memcg = get_mem_cgroup_from_mm(mm);
> > +
> > +       /*
> > +        * If we already have an mm, we are not in thread context
> > and the
> > +        * mem_cgroup for the enclave will be charged for any
> > allocations.
> > +        */
> 
> AFAIK CPU is always executing in some thread.
> 
> This would a lot clearer:
> 
> /*
>  * If the CPU is not inside a kthread, return the active memcg.
>  */
>  
> Please remark of not using "we" in my version. It's better to be
> explicit when you can.

Thank you for your feedback. I've incorporated it into my next version.

Kristen

x86/sgx: Set active memcg prior to shmem allocation

Commit Message

Comments

Patch