mbox series

[RFC,00/20] Add Cgroup support for SGX EPC memory

Message ID 20220922171057.1236139-1-kristen@linux.intel.com (mailing list archive)
Headers show
Series Add Cgroup support for SGX EPC memory | expand

Message

Kristen Carlson Accardi Sept. 22, 2022, 5:10 p.m. UTC
Add a new cgroup controller to regulate the distribution of SGX EPC memory,
which is a subset of system RAM that is used to provide SGX-enabled
applications with protected memory, and is otherwise inaccessible.

SGX EPC memory allocations are separate from normal RAM allocations,
and is managed solely by the SGX subsystem. The existing cgroup memory
controller cannot be used to limit or account for SGX EPC memory.

This patchset implements the sgx_epc cgroup controller, which will provide
support for stats, events, and the following interface files:

sgx_epc.current
	A read-only value which represents the total amount of EPC
	memory currently being used on by the cgroup and its descendents.

sgx_epc.low
	A read-write value which is used to set best-effort protection
	of EPC usage. If the EPC usage of a cgroup drops below this value,
	then the cgroup's EPC memory will not be reclaimed if possible.

sgx_epc.high
	A read-write value which is used to set a best-effort limit
	on the amount of EPC usage a cgroup has. If a cgroup's usage
	goes past the high value, the EPC memory of that cgroup will
	get reclaimed back under the high limit.

sgx_epc.max
	A read-write value which is used to set a hard limit for
	cgroup EPC usage. If a cgroup's EPC usage reaches this limit,
	allocations are blocked until EPC memory can be reclaimed from
	the cgroup.

This work was originally authored by Sean Christopherson a few years ago,
and was modified to work with more recent kernels.

The patchset adds support for multiple LRUs to track both reclaimable
EPC pages (i.e. pages the reclaimer knows about), as well as unreclaimable
EPC pages (i.e. pages which the reclaimer isn't aware of, such as va pages).
These pages are assigned to an LRU, as well as an enclave, so that an
enclave's full EPC usage can be tracked. During OOM events, an enclave
can be have its memory zapped, and all the EPC pages not tracked by the
reclaimer can be freed.

I appreciate your comments and feedback.

Sean Christopherson (20):
  x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages()
  x86/sgx: Store EPC page owner as a 'void *' to handle multiple users
  x86/sgx: Track owning enclave in VA EPC pages
  x86/sgx: Add 'struct sgx_epc_lru' to encapsulate lru list(s)
  x86/sgx: Introduce unreclaimable EPC page lists
  x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages
  x86/sgx: Use a list to track to-be-reclaimed pages during reclaim
  x86/sgx: Add EPC page flags to identify type of page
  x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default
  x86/sgx: Return the number of EPC pages that were successfully
    reclaimed
  x86/sgx: Add option to ignore age of page during EPC reclaim
  x86/sgx: Add helper to retrieve SGX EPC LRU given an EPC page
  x86/sgx: Prepare for multiple LRUs
  x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
  x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
  x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  cgroup, x86/sgx: Add SGX EPC cgroup controller
  x86/sgx: Enable EPC cgroup controller in SGX core
  x86/sgx: Add stats and events interfaces to EPC cgroup controller
  docs, cgroup, x86/sgx: Add SGX EPC cgroup controller documentation

 Documentation/admin-guide/cgroup-v2.rst | 201 +++++
 arch/x86/kernel/cpu/sgx/Makefile        |   1 +
 arch/x86/kernel/cpu/sgx/encl.c          |  89 ++-
 arch/x86/kernel/cpu/sgx/encl.h          |   4 +-
 arch/x86/kernel/cpu/sgx/epc_cgroup.c    | 950 ++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h    |  51 ++
 arch/x86/kernel/cpu/sgx/ioctl.c         |  13 +-
 arch/x86/kernel/cpu/sgx/main.c          | 389 ++++++++--
 arch/x86/kernel/cpu/sgx/sgx.h           |  40 +-
 arch/x86/kernel/cpu/sgx/virt.c          |  28 +-
 include/linux/cgroup_subsys.h           |   4 +
 init/Kconfig                            |  12 +
 12 files changed, 1669 insertions(+), 113 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

Comments

Tejun Heo Sept. 22, 2022, 5:41 p.m. UTC | #1
Hello,

(cc'ing memcg folks)

On Thu, Sep 22, 2022 at 10:10:37AM -0700, Kristen Carlson Accardi wrote:
> Add a new cgroup controller to regulate the distribution of SGX EPC memory,
> which is a subset of system RAM that is used to provide SGX-enabled
> applications with protected memory, and is otherwise inaccessible.
> 
> SGX EPC memory allocations are separate from normal RAM allocations,
> and is managed solely by the SGX subsystem. The existing cgroup memory
> controller cannot be used to limit or account for SGX EPC memory.
> 
> This patchset implements the sgx_epc cgroup controller, which will provide
> support for stats, events, and the following interface files:
> 
> sgx_epc.current
> 	A read-only value which represents the total amount of EPC
> 	memory currently being used on by the cgroup and its descendents.
> 
> sgx_epc.low
> 	A read-write value which is used to set best-effort protection
> 	of EPC usage. If the EPC usage of a cgroup drops below this value,
> 	then the cgroup's EPC memory will not be reclaimed if possible.
> 
> sgx_epc.high
> 	A read-write value which is used to set a best-effort limit
> 	on the amount of EPC usage a cgroup has. If a cgroup's usage
> 	goes past the high value, the EPC memory of that cgroup will
> 	get reclaimed back under the high limit.
> 
> sgx_epc.max
> 	A read-write value which is used to set a hard limit for
> 	cgroup EPC usage. If a cgroup's EPC usage reaches this limit,
> 	allocations are blocked until EPC memory can be reclaimed from
> 	the cgroup.

I don't know how SGX uses its memory but you said in the other message that
it's usually a really small portion of the memory and glancing the code it
looks like its own page aging and all. Can you give some concrete examples
on how it's used and why we need cgroup support for it? Also, do you really
need all three control knobs here? e.g. given that .high is only really
useful in conjunction with memory pressure and oom handling from userspace,
I don't see how this would actually be useful for something like this.

Thanks.
Kristen Carlson Accardi Sept. 22, 2022, 6:59 p.m. UTC | #2
On Thu, 2022-09-22 at 07:41 -1000, Tejun Heo wrote:
> Hello,
> 
> (cc'ing memcg folks)
> 
> On Thu, Sep 22, 2022 at 10:10:37AM -0700, Kristen Carlson Accardi
> wrote:
> > Add a new cgroup controller to regulate the distribution of SGX EPC
> > memory,
> > which is a subset of system RAM that is used to provide SGX-enabled
> > applications with protected memory, and is otherwise inaccessible.
> > 
> > SGX EPC memory allocations are separate from normal RAM
> > allocations,
> > and is managed solely by the SGX subsystem. The existing cgroup
> > memory
> > controller cannot be used to limit or account for SGX EPC memory.
> > 
> > This patchset implements the sgx_epc cgroup controller, which will
> > provide
> > support for stats, events, and the following interface files:
> > 
> > sgx_epc.current
> >         A read-only value which represents the total amount of EPC
> >         memory currently being used on by the cgroup and its
> > descendents.
> > 
> > sgx_epc.low
> >         A read-write value which is used to set best-effort
> > protection
> >         of EPC usage. If the EPC usage of a cgroup drops below this
> > value,
> >         then the cgroup's EPC memory will not be reclaimed if
> > possible.
> > 
> > sgx_epc.high
> >         A read-write value which is used to set a best-effort limit
> >         on the amount of EPC usage a cgroup has. If a cgroup's
> > usage
> >         goes past the high value, the EPC memory of that cgroup
> > will
> >         get reclaimed back under the high limit.
> > 
> > sgx_epc.max
> >         A read-write value which is used to set a hard limit for
> >         cgroup EPC usage. If a cgroup's EPC usage reaches this
> > limit,
> >         allocations are blocked until EPC memory can be reclaimed
> > from
> >         the cgroup.
> 
> I don't know how SGX uses its memory but you said in the other
> message that
> it's usually a really small portion of the memory and glancing the
> code it
> looks like its own page aging and all. Can you give some concrete
> examples
> on how it's used and why we need cgroup support for it? Also, do you
> really
> need all three control knobs here? e.g. given that .high is only
> really
> useful in conjunction with memory pressure and oom handling from
> userspace,
> I don't see how this would actually be useful for something like
> this.
> 
> Thanks.
> 

Thanks for your question. The SGX EPC memory is a global shared
resource that can be over committed. The SGX EPC controller should be
used similarly to the normal memory controller. Normally when there is
pressure on EPC memory, the reclaimer thread will write out pages from
EPC memory to a backing RAM that is allocated per enclave. It is
possible currently for even a single enclave to force all the other
enclaves to have their epc pages written to backing RAM by allocating
all the available system EPC memory. This can cause performance issues
for the enclaves when they have to fault to load pages page in.

sgx_epc.high value will help control the EPC usage of the cgroup. The
sgx reclaimer will use this value to prevent the total EPC usage of a
cgroup from exceeding this value (best effort). This way, if a system
administrator would like to try to prevent single enclaves, or groups
of enclaves from allocating all of the EPC memory and causing
performance issues for the other enclaves on the system, they can set
this limit. sgx_epc.max can be used to set a hard limit, which will
cause an enclave to get all it's used pages zapped and it will
effectively be killed until it is rebuilt by the owning sgx
application. sgx_epc.low can be used to (best effort) try to ensure
that some minimum amount of EPC pages are protected for enclaves in a
particular cgroup. This can be useful for preventing evictions and thus
performance issues due to faults.

I hope this answers your question.

Thanks,
Kristen
Tejun Heo Sept. 22, 2022, 7:08 p.m. UTC | #3
Hello,

On Thu, Sep 22, 2022 at 11:59:14AM -0700, Kristen Carlson Accardi wrote:
> Thanks for your question. The SGX EPC memory is a global shared
> resource that can be over committed. The SGX EPC controller should be
> used similarly to the normal memory controller. Normally when there is
> pressure on EPC memory, the reclaimer thread will write out pages from
> EPC memory to a backing RAM that is allocated per enclave. It is
> possible currently for even a single enclave to force all the other
> enclaves to have their epc pages written to backing RAM by allocating
> all the available system EPC memory. This can cause performance issues
> for the enclaves when they have to fault to load pages page in.

Can you please give more concrete examples? I'd love to hear how the SGX EPC
memory is typically used in what amounts and what's the performance
implications when they get reclaimed and so on. ie. Please describe a
realistic usage scenario of contention with sufficient details on how the
system is set up, what the applications are using the SGX EPC memory for and
how much, how the contention on memory affects the users and so on.

Thank you.
Dave Hansen Sept. 22, 2022, 9:03 p.m. UTC | #4
On 9/22/22 12:08, Tejun Heo wrote:
> Can you please give more concrete examples? I'd love to hear how the SGX EPC
> memory is typically used in what amounts and what's the performance
> implications when they get reclaimed and so on. ie. Please describe a
> realistic usage scenario of contention with sufficient details on how the
> system is set up, what the applications are using the SGX EPC memory for and
> how much, how the contention on memory affects the users and so on.

One wrinkle is that the apps that use SGX EPC memory are *normal* apps.
 There are frameworks that some folks are very excited about that allow
you to run mostly unmodified app stacks inside SGX.  For example:

	https://github.com/gramineproject/graphene

In fact, Gramine users are the troublesome ones for overcommit.  Most
explicitly-written SGX applications are quite austere in their SGX
memory use; they're probably never going to see overcommit.  These
Gramine-wrapped apps are (relative) pigs.  They've been the ones finding
bugs in the existing SGX overcommit code.

So, where does all the SGX memory go?  It's the usual suspects:
memcached and redis. ;)
Jarkko Sakkinen Sept. 23, 2022, 12:24 p.m. UTC | #5
On Thu, Sep 22, 2022 at 10:10:37AM -0700, Kristen Carlson Accardi wrote:
> Add a new cgroup controller to regulate the distribution of SGX EPC memory,
> which is a subset of system RAM that is used to provide SGX-enabled
> applications with protected memory, and is otherwise inaccessible.
> 
> SGX EPC memory allocations are separate from normal RAM allocations,
> and is managed solely by the SGX subsystem. The existing cgroup memory
> controller cannot be used to limit or account for SGX EPC memory.
> 
> This patchset implements the sgx_epc cgroup controller, which will provide
> support for stats, events, and the following interface files:
> 
> sgx_epc.current
> 	A read-only value which represents the total amount of EPC
> 	memory currently being used on by the cgroup and its descendents.
> 
> sgx_epc.low
> 	A read-write value which is used to set best-effort protection
> 	of EPC usage. If the EPC usage of a cgroup drops below this value,
> 	then the cgroup's EPC memory will not be reclaimed if possible.
> 
> sgx_epc.high
> 	A read-write value which is used to set a best-effort limit
> 	on the amount of EPC usage a cgroup has. If a cgroup's usage
> 	goes past the high value, the EPC memory of that cgroup will
> 	get reclaimed back under the high limit.
> 
> sgx_epc.max
> 	A read-write value which is used to set a hard limit for
> 	cgroup EPC usage. If a cgroup's EPC usage reaches this limit,
> 	allocations are blocked until EPC memory can be reclaimed from
> 	the cgroup.

It would be worth of mentioning for clarity that shmem is accounted from
memcg.

BR, Jarkko
Tejun Heo Sept. 24, 2022, 12:09 a.m. UTC | #6
Hello,

On Thu, Sep 22, 2022 at 02:03:52PM -0700, Dave Hansen wrote:
> On 9/22/22 12:08, Tejun Heo wrote:
> > Can you please give more concrete examples? I'd love to hear how the SGX EPC
> > memory is typically used in what amounts and what's the performance
> > implications when they get reclaimed and so on. ie. Please describe a
> > realistic usage scenario of contention with sufficient details on how the
> > system is set up, what the applications are using the SGX EPC memory for and
> > how much, how the contention on memory affects the users and so on.
> 
> One wrinkle is that the apps that use SGX EPC memory are *normal* apps.
>  There are frameworks that some folks are very excited about that allow
> you to run mostly unmodified app stacks inside SGX.  For example:
> 
> 	https://github.com/gramineproject/graphene
> 
> In fact, Gramine users are the troublesome ones for overcommit.  Most
> explicitly-written SGX applications are quite austere in their SGX
> memory use; they're probably never going to see overcommit.  These
> Gramine-wrapped apps are (relative) pigs.  They've been the ones finding
> bugs in the existing SGX overcommit code.
> 
> So, where does all the SGX memory go?  It's the usual suspects:
> memcached and redis. ;)

Hey, so, I'm a bit weary that this doesn't seem to have a strong demand at
this point. When there's clear shared demand, I usually hear from multiple
parties about their use cases and the practical problems they're trying to
solve and so on. This, at least to me, seems primarily driven by producers
than consumers.

There's nothing wrong with projecting future usages and jumping ahead the
curve but there's a balance to hit, and going full-on memcg-style controller
with three control knobs seems to be jumping the gun and may create
commitments which we end up looking back on with a bit of regret.

Given that, how about this? We can easily add the functionality of .max
through the misc controller. Add a new key there, trycharge when allocating
new memory, if fails, try reclaim and then fail allocation if reclaim fails
hard enough. I belive that should give at least a reasonable place to start
especially given that memcg only had limits with similar semantics for quite
a while at the beginning.

That way, we avoid creating a big interface commitments while providing a
feature which should be able to serve and test out the immediate usecases.
If, for some reason, many of us end up running hefty applications in SGX, we
can revisit the issue and build up something more complete with provisions
for backward compatibility.

Thanks.
Kristen Carlson Accardi Sept. 26, 2022, 6:30 p.m. UTC | #7
On Fri, 2022-09-23 at 14:09 -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Sep 22, 2022 at 02:03:52PM -0700, Dave Hansen wrote:
> > On 9/22/22 12:08, Tejun Heo wrote:
> > > Can you please give more concrete examples? I'd love to hear how
> > > the SGX EPC
> > > memory is typically used in what amounts and what's the
> > > performance
> > > implications when they get reclaimed and so on. ie. Please
> > > describe a
> > > realistic usage scenario of contention with sufficient details on
> > > how the
> > > system is set up, what the applications are using the SGX EPC
> > > memory for and
> > > how much, how the contention on memory affects the users and so
> > > on.
> > 
> > One wrinkle is that the apps that use SGX EPC memory are *normal*
> > apps.
> >  There are frameworks that some folks are very excited about that
> > allow
> > you to run mostly unmodified app stacks inside SGX.  For example:
> > 
> >         https://github.com/gramineproject/graphene
> > 
> > In fact, Gramine users are the troublesome ones for overcommit. 
> > Most
> > explicitly-written SGX applications are quite austere in their SGX
> > memory use; they're probably never going to see overcommit.  These
> > Gramine-wrapped apps are (relative) pigs.  They've been the ones
> > finding
> > bugs in the existing SGX overcommit code.
> > 
> > So, where does all the SGX memory go?  It's the usual suspects:
> > memcached and redis. ;)
> 
> Hey, so, I'm a bit weary that this doesn't seem to have a strong
> demand at
> this point. When there's clear shared demand, I usually hear from
> multiple
> parties about their use cases and the practical problems they're
> trying to
> solve and so on. This, at least to me, seems primarily driven by
> producers
> than consumers.
> 
> There's nothing wrong with projecting future usages and jumping ahead
> the
> curve but there's a balance to hit, and going full-on memcg-style
> controller
> with three control knobs seems to be jumping the gun and may create
> commitments which we end up looking back on with a bit of regret.
> 
> Given that, how about this? We can easily add the functionality of
> .max
> through the misc controller. Add a new key there, trycharge when
> allocating
> new memory, if fails, try reclaim and then fail allocation if reclaim
> fails
> hard enough. I belive that should give at least a reasonable place to
> start
> especially given that memcg only had limits with similar semantics
> for quite
> a while at the beginning.
> 
> That way, we avoid creating a big interface commitments while
> providing a
> feature which should be able to serve and test out the immediate
> usecases.
> If, for some reason, many of us end up running hefty applications in
> SGX, we
> can revisit the issue and build up something more complete with
> provisions
> for backward compatibility.
> 
> Thanks.
> 

Hi Tejun,

thanks for your suggestion. Let me discuss this with customers who
requested this feature (not all customers like to respond publically)
and see if it will meet needs. If there is an issue, I'll respond back
with concerns.

Thanks,
Kristen
Kristen Carlson Accardi Oct. 7, 2022, 4:39 p.m. UTC | #8
On Fri, 2022-09-23 at 14:09 -1000, Tejun Heo wrote:
<snip>

> 
> Given that, how about this? We can easily add the functionality of
> .max
> through the misc controller. Add a new key there, trycharge when
> allocating
> new memory, if fails, try reclaim and then fail allocation if reclaim
> fails
> hard enough. I belive that should give at least a reasonable place to
> start
> especially given that memcg only had limits with similar semantics
> for quite
> a while at the beginning.
> 

Hi Tejun,
I'm playing with the misc controller to see if I can make it do what I
need to do, and I had a question for you. Is there a way to easily get
notified when there are writes to the "max" file? For example, in my
full controller implementation, if a max value is written, the current
epc usage for that cgroup is immediately examined. If that usage is
over the new value of max, then the reclaimer will reclaim from that
particular cgroup to get it under the max. If it is not possible to
reclaim enough to get it under the max, enclaves will be killed so that
all the epc pages can be released and then get under the max value.
With the misc controller, i haven't been able to find a way to easily
react to a change in the max value. Am I missing something?

Thanks,
Kristen
Tejun Heo Oct. 7, 2022, 4:42 p.m. UTC | #9
Hello, Kristen.

On Fri, Oct 07, 2022 at 09:39:40AM -0700, Kristen Carlson Accardi wrote:
...
> With the misc controller, i haven't been able to find a way to easily
> react to a change in the max value. Am I missing something?

There isn't currently but it should be possible to add per-key notifiers,
right?

Thanks.
Kristen Carlson Accardi Oct. 7, 2022, 4:46 p.m. UTC | #10
On Fri, 2022-10-07 at 06:42 -1000, Tejun Heo wrote:
> Hello, Kristen.
> 
> On Fri, Oct 07, 2022 at 09:39:40AM -0700, Kristen Carlson Accardi
> wrote:
> ...
> > With the misc controller, i haven't been able to find a way to
> > easily
> > react to a change in the max value. Am I missing something?
> 
> There isn't currently but it should be possible to add per-key
> notifiers,
> right?
> 
> Thanks.
> 

OK - yes, I will include a modification to the misc controller for the
functionality I need in my patchset.