diff mbox series

[1/2] x86/sgx: Add accounting for tracking overcommit

Message ID 20211220174640.7542-2-kristen@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series x86/sgx: Limit EPC overcommit | expand

Commit Message

Kristen Carlson Accardi Dec. 20, 2021, 5:46 p.m. UTC
When the system runs out of enclave memory, SGX can reclaim EPC pages
by swapping to normal RAM. This normal RAM is allocated via a
per-enclave shared memory area. The shared memory area is not mapped
into the enclave or the task mapping it, which makes its memory use
opaque (including to the OOM killer). Having lots of hard to find
memory around is problematic, especially when there is no limit.

Introduce a module parameter and a global counter that can be used to limit
the number of pages that enclaves are able to consume for backing storage.
This parameter is a percentage value that is used in conjunction with the
number of EPC pages in the system to set a cap on the amount of backing
RAM that can be consumed.

The default for this value is 100, which limits the total number of
shared memory pages that may be consumed by all enclaves as backing pages
to the number of EPC pages on the system.

For example, on an SGX system that has 128MB of EPC, this default would
cap the amount of normal RAM that SGX consumes for its shared memory
areas at 128MB.

If sgx.overcommit_percent is set to a negative value (such as -1),
SGX will not place any limits on the amount of overcommit that might
be requested, and SGX will behave as it has previously without the
overcommit_percent limit.

SGX may not be built as a module, but the module parameter interface
is used in order to provide a convenient interface.

The SGX overcommit_percent works differently than the core VM overcommit
limit. Enclaves request backing pages one page at a time, and the number
of in use backing pages that are allowed is a global resource that is
limited for all enclaves.

Introduce a pair of functions which can be used by callers when requesting
backing RAM pages. These functions are responsible for accounting the
page charges. A request may return an error if the request will cause the
counter to exceed the backing page cap.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
---
 .../admin-guide/kernel-parameters.txt         |  7 ++
 Documentation/x86/sgx.rst                     | 16 ++++-
 arch/x86/kernel/cpu/sgx/Makefile              |  6 +-
 arch/x86/kernel/cpu/sgx/main.c                | 64 +++++++++++++++++++
 arch/x86/kernel/cpu/sgx/sgx.h                 |  2 +
 5 files changed, 93 insertions(+), 2 deletions(-)

Comments

Borislav Petkov Dec. 20, 2021, 7:30 p.m. UTC | #1
On Mon, Dec 20, 2021 at 09:46:39AM -0800, Kristen Carlson Accardi wrote:
>  Similar to the core kswapd, ksgxd, is responsible for managing the
>  overcommitment of enclave memory.  If the system runs out of enclave memory,
> -*ksgxd* “swaps” enclave memory to normal memory.
> +*ksgxd* “swaps” enclave memory to normal RAM. This normal RAM is allocated
> +via per enclave shared memory. The shared memory area is not mapped into the
> +enclave or the task mapping it, which makes its memory use opaque - including
> +to the system out of memory killer (OOM). This can be problematic when there
> +are no limits in place on the amount an enclave can allocate.

Problematic how?

The commit message above is talking about what your patch does and that
is kinda clear from the diff. The *why* is really missing. Only that
allusion that it might be problematic in some cases but that's not even
scratching the surface.

> +At boot time, the module parameter "sgx.overcommit_percent" can be used to
> +place a limit on the number of shared memory backing pages that may be
> +allocated, expressed as a percentage of the total number of EPC pages in the
> +system.  A value of 100 is the default, and represents a limit equal to the
> +number of EPC pages in the system. To disable the limit, set
> +sgx.overcommit_percent to -1. The number of backing pages available to
> +enclaves is a global resource. If the system exceeds the number of allowed
> +backing pages in use, the reclaimer will be unable to swap EPC pages to
> +shared memory.

So you're basically putting the burden on the user/sysadmin to
*actually* *know* what percentage is "problematic" and to know what to
supply. I'd bet not very many people would know how much is problematic
and it probably all depends.

So why don't you come up with a sane default, instead, which works in
most cases and set it automatically?

Dunno, maybe some scaled percentage of memory depending on how many
enclaves are run but all up to a sane limit of, say, 90% of total memory
so that there are 10% left for normal system operation.

This way you'll avoid "problematic" and still have some memory left for
other use.

Or something like that.

Adding yet another knob is yuck and the easy way out. And we have waaaay
too many knobs so we should always try to do the automatic thing, if at
all possible.

Thx.
Kristen Carlson Accardi Dec. 20, 2021, 8:39 p.m. UTC | #2
Hi Boris, thanks for taking look.

On Mon, 2021-12-20 at 20:30 +0100, Borislav Petkov wrote:
> On Mon, Dec 20, 2021 at 09:46:39AM -0800, Kristen Carlson Accardi
> wrote:
> >  Similar to the core kswapd, ksgxd, is responsible for managing the
> >  overcommitment of enclave memory.  If the system runs out of
> > enclave memory,
> > -*ksgxd* “swaps” enclave memory to normal memory.
> > +*ksgxd* “swaps” enclave memory to normal RAM. This normal RAM is
> > allocated
> > +via per enclave shared memory. The shared memory area is not
> > mapped into the
> > +enclave or the task mapping it, which makes its memory use opaque
> > - including
> > +to the system out of memory killer (OOM). This can be problematic
> > when there
> > +are no limits in place on the amount an enclave can allocate.
> 
> Problematic how?

If a malicious or just extra large enclave is loaded, or even just a
lot of enclaves, it can eat up all the normal RAM on the system. Normal
methods of finding out where all the memory on the system is being
used, wouldn't be able to find this usage since it is shared memory. In
addition, the OOM killer wouldn't be able to kill any enclaves.
 
> 
> The commit message above is talking about what your patch does and
> that
> is kinda clear from the diff. The *why* is really missing. Only that
> allusion that it might be problematic in some cases but that's not
> even
> scratching the surface.
> 
> > +At boot time, the module parameter "sgx.overcommit_percent" can be
> > used to
> > +place a limit on the number of shared memory backing pages that
> > may be
> > +allocated, expressed as a percentage of the total number of EPC
> > pages in the
> > +system.  A value of 100 is the default, and represents a limit
> > equal to the
> > +number of EPC pages in the system. To disable the limit, set
> > +sgx.overcommit_percent to -1. The number of backing pages
> > available to
> > +enclaves is a global resource. If the system exceeds the number of
> > allowed
> > +backing pages in use, the reclaimer will be unable to swap EPC
> > pages to
> > +shared memory.
> 
> So you're basically putting the burden on the user/sysadmin to
> *actually* *know* what percentage is "problematic" and to know what
> to
> supply. I'd bet not very many people would know how much is
> problematic
> and it probably all depends.
> 
> So why don't you come up with a sane default, instead, which works in
> most cases and set it automatically?

The default value is set to 100%, and this percentage, plus the number
of EPC pages in the system is used to calculate a sane value for the
number of backing pages to add - in this case, exactly the number of
EPC pages in the system. It is set automatically if the parameter is
not used to override it.

> 
> Dunno, maybe some scaled percentage of memory depending on how many
> enclaves are run but all up to a sane limit of, say, 90% of total
> memory
> so that there are 10% left for normal system operation.

I think in this case you are still making a judgement call for the
admin about how much memory you think they ought to be able to use for
non-SGX operations, and it feels more natural to me to have the default
based on how many backing pages you might reasonably need. I suppose
one could check the total system memory available and double check that
the calculated number of backing pages you'd give out would never
exceed some percentage of total system RAM, but then you wind up with 2
knobs to play with instead of just one, which seems wrong to me.

> 
> This way you'll avoid "problematic" and still have some memory left
> for
> other use.
> 
> Or something like that.
> 
> Adding yet another knob is yuck and the easy way out. And we have
> waaaay
> too many knobs so we should always try to do the automatic thing, if
> at
> all possible.

I completely agree - so I'm trying to make sure I understand this
comment, as the value is currently set to default that would
automatically apply that is based on EPC memory present and not a fixed
value. So I'd like to understand what you'd like to see done
differently. are you saying you'd like to eliminated the ability to
override the automatic default? Or just that you'd rather calculate the
percentage based on total normal system RAM rather than EPC memory?

Thanks,
Kristen
Borislav Petkov Dec. 20, 2021, 9:11 p.m. UTC | #3
Bah, that thread is not on lkml. Please always Cc lkml in the future.

On Mon, Dec 20, 2021 at 12:39:19PM -0800, Kristen Carlson Accardi wrote:
> If a malicious or just extra large enclave is loaded, or even just a
> lot of enclaves, it can eat up all the normal RAM on the system. Normal
> methods of finding out where all the memory on the system is being
> used, wouldn't be able to find this usage since it is shared memory. In
> addition, the OOM killer wouldn't be able to kill any enclaves.

So you need some sort of limiting against malicious enclaves anyways,
regardless of this knob. IOW, you can set a percentage limit of
per-enclave memory which cannot exceed a certain amount which won't
prevent the system from its normal operation. For example.

> I completely agree - so I'm trying to make sure I understand this
> comment, as the value is currently set to default that would
> automatically apply that is based on EPC memory present and not a fixed
> value. So I'd like to understand what you'd like to see done
> differently. are you saying you'd like to eliminated the ability to
> override the automatic default? Or just that you'd rather calculate the
> percentage based on total normal system RAM rather than EPC memory?

So you say that there are cases where swapping to normal RAM can eat
up all RAM.

So the first heuristic should be: do not allow for *all* enclaves
together to use up more than X percent of normal RAM during EPC reclaim.

X percent being, say, 90% of all RAM. For example. I guess 10% should
be enough for normal operation but someone who's more knowledgeable in
system limits could chime in here.

Then, define a per-enclave limit which says, an enclave can use Y % of
memory for swapping when trying to reclaim EPC memory. And that can be
something like:

	90 % RAM
	--------
	total amount of enclaves currently on the system

And you can obviously create scenarios where creating too many enclaves
can bring the system into a situation where it doesn't do any forward
progress.

But you probably can cause the same with overcommitting with VMs so
perhaps it would be a good idea to look how overcommitting VMs and
limits there are handled.

Bottom line is: the logic should be for the most common cases to
function properly, out-of-the-box, without knobs. And then to keep the
system operational by preventing enclaves from bringing it down to a
halt just by doing EPC reclaim.

Does that make more sense?

Thx.
Kristen Carlson Accardi Dec. 20, 2021, 9:35 p.m. UTC | #4
On Mon, 2021-12-20 at 22:11 +0100, Borislav Petkov wrote:
> Bah, that thread is not on lkml. Please always Cc lkml in the future.
> 
> On Mon, Dec 20, 2021 at 12:39:19PM -0800, Kristen Carlson Accardi
> wrote:
> > If a malicious or just extra large enclave is loaded, or even just
> > a
> > lot of enclaves, it can eat up all the normal RAM on the system.
> > Normal
> > methods of finding out where all the memory on the system is being
> > used, wouldn't be able to find this usage since it is shared
> > memory. In
> > addition, the OOM killer wouldn't be able to kill any enclaves.
> 
> So you need some sort of limiting against malicious enclaves anyways,
> regardless of this knob. IOW, you can set a percentage limit of
> per-enclave memory which cannot exceed a certain amount which won't
> prevent the system from its normal operation. For example.
> 
> > I completely agree - so I'm trying to make sure I understand this
> > comment, as the value is currently set to default that would
> > automatically apply that is based on EPC memory present and not a
> > fixed
> > value. So I'd like to understand what you'd like to see done
> > differently. are you saying you'd like to eliminated the ability to
> > override the automatic default? Or just that you'd rather calculate
> > the
> > percentage based on total normal system RAM rather than EPC memory?
> 
> So you say that there are cases where swapping to normal RAM can eat
> up all RAM.
> 
> So the first heuristic should be: do not allow for *all* enclaves
> together to use up more than X percent of normal RAM during EPC
> reclaim.

So, in your proposal, you would first change the calculated number of
maximum available backing pages to be based on total system RAM rather
than EPC memory, got it.

> 
> X percent being, say, 90% of all RAM. For example. I guess 10% should
> be enough for normal operation but someone who's more knowledgeable
> in
> system limits could chime in here.
> 
> Then, define a per-enclave limit which says, an enclave can use Y %
> of
> memory for swapping when trying to reclaim EPC memory. And that can
> be
> something like:
> 
> 	90 % RAM
> 	--------
> 	total amount of enclaves currently on the system
> 

This would require recalculating the max number of allowed backing
pages per enclave at run time whenever a new enclave is loaded - but
all the existing enclaves may have already used more than the new max
number of per-enclave allowable pages. How would you handle that
scenario? This would add a lot of complexity for sure - and it does
make me wonder whether any additional benefit of limiting per enclave
would be worth it.

> And you can obviously create scenarios where creating too many
> enclaves
> can bring the system into a situation where it doesn't do any forward
> progress.
> 
> But you probably can cause the same with overcommitting with VMs so
> perhaps it would be a good idea to look how overcommitting VMs and
> limits there are handled.
> 
> Bottom line is: the logic should be for the most common cases to
> function properly, out-of-the-box, without knobs. And then to keep
> the
> system operational by preventing enclaves from bringing it down to a
> halt just by doing EPC reclaim.
> 
> Does that make more sense?
> 

Thanks for your more detailed explanation - I will take a look at the
VM overcommit limits. Since obviously the original implementation did
have a default value set, I had still a remaining specific question
about your comments. Are you suggesting that there should not be a way
to override any overcommit limit at all? So drop the parameter all
together?
Borislav Petkov Dec. 20, 2021, 10:48 p.m. UTC | #5
On Mon, Dec 20, 2021 at 01:35:03PM -0800, Kristen Carlson Accardi wrote:
> So, in your proposal you would first change the calculated number of
> maximum available backing pages to be based on total system RAM rather
> than EPC memory, got it.

This was just an example. My point is to try to make it automatic and
not introduce another knob. And to decide on the limits and behavior
by using common sense and addressing the common use cases first.

> This would require recalculating the max number of allowed backing
> pages per enclave at run time whenever a new enclave is loaded - but
> all the existing enclaves may have already used more than the new max
> number of per-enclave allowable pages. How would you handle that
> scenario? This would add a lot of complexity for sure - and it does
> make me wonder whether any additional benefit of limiting per enclave
> would be worth it.

See above - this was just an example. And as you've shown, an example of
what *not* to do.

> Thanks for your more detailed explanation - I will take a look at the
> VM overcommit limits. Since obviously the original implementation did
> have a default value set, I had still a remaining specific question
> about your comments. Are you suggesting that there should not be a way
> to override any overcommit limit at all? So drop the parameter all
> together?

So let me ask you a counter-question:

Imagine you're a sysadmin. Or a general, common system user if there
ever is one.

When your system starts thrashing and you're running a bunch of
enclaves, how do you find out there even is a knob which might
potentially help you?

And after you find it, how would you use that knob?

Or would you rather prefer that the system did the right thing for you
instead of having to figure out what the sensible values for that knob
would be?

My point is: put yourself in the user's shoes and try to think about
what would be the optimal thing the system should do.

Once that is cleanly and properly defined, then we can deal with knobs
and policies etc.

I sincerely hope that makes more sense.
Dave Hansen Dec. 21, 2021, 3:53 p.m. UTC | #6
On 12/20/21 2:48 PM, Borislav Petkov wrote:
> On Mon, Dec 20, 2021 at 01:35:03PM -0800, Kristen Carlson Accardi wrote:
>> So, in your proposal you would first change the calculated number of
>> maximum available backing pages to be based on total system RAM rather
>> than EPC memory, got it.
> 
> This was just an example. My point is to try to make it automatic and
> not introduce another knob. And to decide on the limits and behavior
> by using common sense and addressing the common use cases first.

The common case is clearly a few enclaves on systems where the
overcommit levels are modest.  The "100%" limit will work great there.

I'd personally be fine with just enforcing that limit as the default and
ignoring everything else.  It makes me a bit nervous, though, because
suddenly enforcing a limit is an ABI break.  The tunable effectively
gives us a way out if we screw up either the limit's quantity or someone
needs the old ABI back.

That said, we don't *need* it to be tunable, boot parameter or not.  If
you're concerned about the tunable, I think we should drop it and not
look back.

> Imagine you're a sysadmin. Or a general, common system user if there
> ever is one.
> 
> When your system starts thrashing and you're running a bunch of
> enclaves, how do you find out there even is a knob which might
> potentially help you?

I'm selfish.  The tunable isn't for end users.  It's for me.

The scenario I envisioned is that a user upgrades to a new kernel and
their enclaves start crashing.  They'll eventually find us, the
maintainers of the SGX code, and we'll have a tool as kernel developers
to help them.  The tunable helps _me_ in two ways:
1. It help me easily get user back to pre-5.17 (or whatever) behavior
2. If we got the "100%" value wrong, end users can help us experiment to
   help find a better value.

BTW, all the chat about "malicious" enclaves and so forth...  I
*totally* agree that this is a problem and one worth solving.  It just
can't be solved today.  We need real cgroup support.  It's coming soon.
Dave Hansen Dec. 22, 2021, 2:21 p.m. UTC | #7
On 12/20/21 1:35 PM, Kristen Carlson Accardi wrote:
> Thanks for your more detailed explanation - I will take a look at the
> VM overcommit limits. Since obviously the original implementation did
> have a default value set, I had still a remaining specific question
> about your comments. Are you suggesting that there should not be a way
> to override any overcommit limit at all? So drop the parameter all
> together?

Yes, let's kill the exposed tunable.

We don't have any strong, practical need for it at the moment other than
paranoia.
Jarkko Sakkinen Dec. 28, 2021, 11:04 p.m. UTC | #8
On Mon, Dec 20, 2021 at 09:46:39AM -0800, Kristen Carlson Accardi wrote:
> When the system runs out of enclave memory, SGX can reclaim EPC pages
> by swapping to normal RAM. This normal RAM is allocated via a
> per-enclave shared memory area. The shared memory area is not mapped
> into the enclave or the task mapping it, which makes its memory use
> opaque (including to the OOM killer). Having lots of hard to find
> memory around is problematic, especially when there is no limit.
> 
> Introduce a module parameter and a global counter that can be used to limit
> the number of pages that enclaves are able to consume for backing storage.
> This parameter is a percentage value that is used in conjunction with the
> number of EPC pages in the system to set a cap on the amount of backing
> RAM that can be consumed.
> 
> The default for this value is 100, which limits the total number of
> shared memory pages that may be consumed by all enclaves as backing pages
> to the number of EPC pages on the system.
> 
> For example, on an SGX system that has 128MB of EPC, this default would
> cap the amount of normal RAM that SGX consumes for its shared memory
> areas at 128MB.
> 
> If sgx.overcommit_percent is set to a negative value (such as -1),
> SGX will not place any limits on the amount of overcommit that might
> be requested, and SGX will behave as it has previously without the
> overcommit_percent limit.
> 
> SGX may not be built as a module, but the module parameter interface
> is used in order to provide a convenient interface.
> 
> The SGX overcommit_percent works differently than the core VM overcommit
> limit. Enclaves request backing pages one page at a time, and the number
> of in use backing pages that are allowed is a global resource that is
> limited for all enclaves.
> 
> Introduce a pair of functions which can be used by callers when requesting
> backing RAM pages. These functions are responsible for accounting the
> page charges. A request may return an error if the request will cause the
> counter to exceed the backing page cap.
> 
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  7 ++
>  Documentation/x86/sgx.rst                     | 16 ++++-
>  arch/x86/kernel/cpu/sgx/Makefile              |  6 +-
>  arch/x86/kernel/cpu/sgx/main.c                | 64 +++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/sgx.h                 |  2 +
>  5 files changed, 93 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 9725c546a0d4..9d23c05a833b 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5165,6 +5165,13 @@
>  
>  	serialnumber	[BUGS=X86-32]
>  
> +	sgx.overcommit_percent=	[X86-64,SGX]
> +			Limits the amount of normal RAM used for backing
> +			storage that may be allocate, expressed as a
> +			percentage of the total number of EPC pages in the
> +			system.
> +			See Documentation/x86/sgx.rst for more information.
> +
>  	shapers=	[NET]
>  			Maximal number of shapers.
>  
> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
> index 265568a9292c..4f9a1c68be94 100644
> --- a/Documentation/x86/sgx.rst
> +++ b/Documentation/x86/sgx.rst
> @@ -147,7 +147,21 @@ Page reclaimer
>  
>  Similar to the core kswapd, ksgxd, is responsible for managing the
>  overcommitment of enclave memory.  If the system runs out of enclave memory,
> -*ksgxd* “swaps” enclave memory to normal memory.
> +*ksgxd* “swaps” enclave memory to normal RAM. This normal RAM is allocated
> +via per enclave shared memory. The shared memory area is not mapped into the
> +enclave or the task mapping it, which makes its memory use opaque - including
> +to the system out of memory killer (OOM). This can be problematic when there
> +are no limits in place on the amount an enclave can allocate.
> +
> +At boot time, the module parameter "sgx.overcommit_percent" can be used to
> +place a limit on the number of shared memory backing pages that may be
> +allocated, expressed as a percentage of the total number of EPC pages in the
> +system.  A value of 100 is the default, and represents a limit equal to the
> +number of EPC pages in the system. To disable the limit, set
> +sgx.overcommit_percent to -1. The number of backing pages available to
> +enclaves is a global resource. If the system exceeds the number of allowed
> +backing pages in use, the reclaimer will be unable to swap EPC pages to
> +shared memory.
>  
>  Launch Control
>  ==============
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
> index 9c1656779b2a..72f9192a43fe 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -1,6 +1,10 @@
> -obj-y += \
> +# This allows sgx to have module namespace
> +obj-y += sgx.o
> +
> +sgx-y += \
>  	driver.o \
>  	encl.o \
>  	ioctl.o \
>  	main.o
> +
>  obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 2857a49f2335..c58ce9d9fd56 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0
>  /*  Copyright(c) 2016-20 Intel Corporation. */
>  
> +#include <linux/moduleparam.h>
>  #include <linux/file.h>
>  #include <linux/freezer.h>
>  #include <linux/highmem.h>
> @@ -43,6 +44,54 @@ static struct sgx_numa_node *sgx_numa_nodes;
>  
>  static LIST_HEAD(sgx_dirty_page_list);
>  
> +/*
> + * Limits the amount of normal RAM that SGX can consume for EPC
> + * overcommit to the total EPC pages * sgx_overcommit_percent / 100
> + */
> +static int sgx_overcommit_percent = 100;
> +module_param_named(overcommit_percent, sgx_overcommit_percent, int, 0440);
> +MODULE_PARM_DESC(overcommit_percent, "Percentage of overcommit of EPC pages.");
> +
> +/* The number of pages that can be allocated globally for backing storage. */
> +static atomic_long_t sgx_nr_available_backing_pages;
> +static bool sgx_disable_overcommit_tracking;

I don't like the use of word tracking as we already have ETRACK. I'd also
shorten the first global as "sgx_nr_backing_pages".

Couldn't you set "sgx_nr_backing_pages" to -1 when capping is disabled, and
then you would not need that bool in the first place?

> +
> +/**
> + * sgx_charge_mem() - charge for a page used for backing storage
> + *

Please remove this empty line:

https://www.kernel.org/doc/Documentation/kernel-doc-nano-HOWTO.txt

> + * Backing storage usage is capped by the sgx_nr_available_backing_pages.
> + * If the backing storage usage is over the overcommit limit,

Where does this verb "charge" come from?

/Jarkko
Dave Hansen Dec. 28, 2021, 11:34 p.m. UTC | #9
>> +/*
>> + * Limits the amount of normal RAM that SGX can consume for EPC
>> + * overcommit to the total EPC pages * sgx_overcommit_percent / 100
>> + */
>> +static int sgx_overcommit_percent = 100;
>> +module_param_named(overcommit_percent, sgx_overcommit_percent, int, 0440);
>> +MODULE_PARM_DESC(overcommit_percent, "Percentage of overcommit of EPC pages.");
>> +
>> +/* The number of pages that can be allocated globally for backing storage. */
>> +static atomic_long_t sgx_nr_available_backing_pages;
>> +static bool sgx_disable_overcommit_tracking;
> 
> I don't like the use of word tracking as we already have ETRACK.

I don't think anyone is going to confuse "overcommit tracking" with ETRACK.

That said, this *could* be "sgx_disable_overcommit_limits", I guess.

> I'd also shorten the first global as "sgx_nr_backing_pages".
That means something different from the variable, though.

"sgx_nr_backing_pages" would be the name for the current number of
backing pages which currently exist.

> Couldn't you set "sgx_nr_backing_pages" to -1 when capping is disabled, and
> then you would not need that bool in the first place?
> 
>> +
>> +/**
>> + * sgx_charge_mem() - charge for a page used for backing storage
>> + *
> 
> Please remove this empty line:
> 
> https://www.kernel.org/doc/Documentation/kernel-doc-nano-HOWTO.txt

That *might* make sense when there are arguments.  The arguments at
least help visually separate the short function description from the
full text description.
Kristen Carlson Accardi Jan. 6, 2022, 6:26 p.m. UTC | #10
Hi Jarkko, thanks for your review,

On Wed, 2021-12-29 at 01:04 +0200, Jarkko Sakkinen wrote:
> On Mon, Dec 20, 2021 at 09:46:39AM -0800, Kristen Carlson Accardi
> wrote:
> > 
> > +
> > +/**
> > + * sgx_charge_mem() - charge for a page used for backing storage
> > + *
> 
> Please remove this empty line:
> 
> https://www.kernel.org/doc/Documentation/kernel-doc-nano-HOWTO.txt

I read this to be that there should be an empty line after the short
description/arg list but before the longer description. I think for
functions without args this is the proper layout. It also is more
readable.

> 
> > + * Backing storage usage is capped by the
> > sgx_nr_available_backing_pages.
> > + * If the backing storage usage is over the overcommit limit,
> 
> Where does this verb "charge" come from?

Charge in this context means that some available backing pages are now
not available because they are in use. It feels appropriate to me to
use "charge/uncharge" verbs for this action, unless you think it's
confusing somehow.
Jarkko Sakkinen Jan. 7, 2022, 12:25 p.m. UTC | #11
On Thu, Jan 06, 2022 at 10:26:18AM -0800, Kristen Carlson Accardi wrote:
> Hi Jarkko, thanks for your review,
> 
> On Wed, 2021-12-29 at 01:04 +0200, Jarkko Sakkinen wrote:
> > On Mon, Dec 20, 2021 at 09:46:39AM -0800, Kristen Carlson Accardi
> > wrote:
> > > 
> > > +
> > > +/**
> > > + * sgx_charge_mem() - charge for a page used for backing storage
> > > + *
> > 
> > Please remove this empty line:
> > 
> > https://www.kernel.org/doc/Documentation/kernel-doc-nano-HOWTO.txt
> 
> I read this to be that there should be an empty line after the short
> description/arg list but before the longer description. I think for
> functions without args this is the proper layout. It also is more
> readable.
> 
> > 
> > > + * Backing storage usage is capped by the
> > > sgx_nr_available_backing_pages.
> > > + * If the backing storage usage is over the overcommit limit,
> > 
> > Where does this verb "charge" come from?
> 
> Charge in this context means that some available backing pages are now
> not available because they are in use. It feels appropriate to me to
> use "charge/uncharge" verbs for this action, unless you think it's
> confusing somehow.

OK, it's cool.

I'm still wondering why you need extra variable given that
sgx_nr_backing_available_pages is signed. You could mark the
feature being disabled by setting it to -1.

It might cosmetically improve readability to have a boolean but
for practical uses (e.g. eBPF tracing scripts) it only adds extra
complexity.

/Jarkko
Kristen Carlson Accardi Jan. 7, 2022, 5:17 p.m. UTC | #12
On Fri, 2022-01-07 at 14:25 +0200, Jarkko Sakkinen wrote:
> On Thu, Jan 06, 2022 at 10:26:18AM -0800, Kristen Carlson Accardi
> wrote:
> > Hi Jarkko, thanks for your review,
> > 
> > On Wed, 2021-12-29 at 01:04 +0200, Jarkko Sakkinen wrote:
> > > On Mon, Dec 20, 2021 at 09:46:39AM -0800, Kristen Carlson Accardi
> > > wrote:
> > > > +
> > > > +/**
> > > > + * sgx_charge_mem() - charge for a page used for backing
> > > > storage
> > > > + *
> > > 
> > > Please remove this empty line:
> > > 
> > > https://www.kernel.org/doc/Documentation/kernel-doc-nano-HOWTO.txt
> > 
> > I read this to be that there should be an empty line after the
> > short
> > description/arg list but before the longer description. I think for
> > functions without args this is the proper layout. It also is more
> > readable.
> > 
> > > > + * Backing storage usage is capped by the
> > > > sgx_nr_available_backing_pages.
> > > > + * If the backing storage usage is over the overcommit limit,
> > > 
> > > Where does this verb "charge" come from?
> > 
> > Charge in this context means that some available backing pages are
> > now
> > not available because they are in use. It feels appropriate to me
> > to
> > use "charge/uncharge" verbs for this action, unless you think it's
> > confusing somehow.
> 
> OK, it's cool.
> 
> I'm still wondering why you need extra variable given that
> sgx_nr_backing_available_pages is signed. You could mark the
> feature being disabled by setting it to -1.
> 
> It might cosmetically improve readability to have a boolean but
> for practical uses (e.g. eBPF tracing scripts) it only adds extra
> complexity.
> 
> /Jarkko

I can see your point. Since Boris objected to the module param, the
next version will not have a way to disable limits at all, so I am
deleting that boolean all together.
Jarkko Sakkinen Jan. 8, 2022, 3:54 p.m. UTC | #13
On Fri, Jan 07, 2022 at 09:17:03AM -0800, Kristen Carlson Accardi wrote:
> On Fri, 2022-01-07 at 14:25 +0200, Jarkko Sakkinen wrote:
> > On Thu, Jan 06, 2022 at 10:26:18AM -0800, Kristen Carlson Accardi
> > wrote:
> > > Hi Jarkko, thanks for your review,
> > > 
> > > On Wed, 2021-12-29 at 01:04 +0200, Jarkko Sakkinen wrote:
> > > > On Mon, Dec 20, 2021 at 09:46:39AM -0800, Kristen Carlson Accardi
> > > > wrote:
> > > > > +
> > > > > +/**
> > > > > + * sgx_charge_mem() - charge for a page used for backing
> > > > > storage
> > > > > + *
> > > > 
> > > > Please remove this empty line:
> > > > 
> > > > https://www.kernel.org/doc/Documentation/kernel-doc-nano-HOWTO.txt
> > > 
> > > I read this to be that there should be an empty line after the
> > > short
> > > description/arg list but before the longer description. I think for
> > > functions without args this is the proper layout. It also is more
> > > readable.
> > > 
> > > > > + * Backing storage usage is capped by the
> > > > > sgx_nr_available_backing_pages.
> > > > > + * If the backing storage usage is over the overcommit limit,
> > > > 
> > > > Where does this verb "charge" come from?
> > > 
> > > Charge in this context means that some available backing pages are
> > > now
> > > not available because they are in use. It feels appropriate to me
> > > to
> > > use "charge/uncharge" verbs for this action, unless you think it's
> > > confusing somehow.
> > 
> > OK, it's cool.
> > 
> > I'm still wondering why you need extra variable given that
> > sgx_nr_backing_available_pages is signed. You could mark the
> > feature being disabled by setting it to -1.
> > 
> > It might cosmetically improve readability to have a boolean but
> > for practical uses (e.g. eBPF tracing scripts) it only adds extra
> > complexity.
> > 
> > /Jarkko
> 
> I can see your point. Since Boris objected to the module param, the
> next version will not have a way to disable limits at all, so I am
> deleting that boolean all together.

Ok, cool, I'm looking forward for the next version.

/Jarkko
diff mbox series

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9725c546a0d4..9d23c05a833b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5165,6 +5165,13 @@ 
 
 	serialnumber	[BUGS=X86-32]
 
+	sgx.overcommit_percent=	[X86-64,SGX]
+			Limits the amount of normal RAM used for backing
+			storage that may be allocate, expressed as a
+			percentage of the total number of EPC pages in the
+			system.
+			See Documentation/x86/sgx.rst for more information.
+
 	shapers=	[NET]
 			Maximal number of shapers.
 
diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
index 265568a9292c..4f9a1c68be94 100644
--- a/Documentation/x86/sgx.rst
+++ b/Documentation/x86/sgx.rst
@@ -147,7 +147,21 @@  Page reclaimer
 
 Similar to the core kswapd, ksgxd, is responsible for managing the
 overcommitment of enclave memory.  If the system runs out of enclave memory,
-*ksgxd* “swaps” enclave memory to normal memory.
+*ksgxd* “swaps” enclave memory to normal RAM. This normal RAM is allocated
+via per enclave shared memory. The shared memory area is not mapped into the
+enclave or the task mapping it, which makes its memory use opaque - including
+to the system out of memory killer (OOM). This can be problematic when there
+are no limits in place on the amount an enclave can allocate.
+
+At boot time, the module parameter "sgx.overcommit_percent" can be used to
+place a limit on the number of shared memory backing pages that may be
+allocated, expressed as a percentage of the total number of EPC pages in the
+system.  A value of 100 is the default, and represents a limit equal to the
+number of EPC pages in the system. To disable the limit, set
+sgx.overcommit_percent to -1. The number of backing pages available to
+enclaves is a global resource. If the system exceeds the number of allowed
+backing pages in use, the reclaimer will be unable to swap EPC pages to
+shared memory.
 
 Launch Control
 ==============
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..72f9192a43fe 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -1,6 +1,10 @@ 
-obj-y += \
+# This allows sgx to have module namespace
+obj-y += sgx.o
+
+sgx-y += \
 	driver.o \
 	encl.o \
 	ioctl.o \
 	main.o
+
 obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 2857a49f2335..c58ce9d9fd56 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -1,6 +1,7 @@ 
 // SPDX-License-Identifier: GPL-2.0
 /*  Copyright(c) 2016-20 Intel Corporation. */
 
+#include <linux/moduleparam.h>
 #include <linux/file.h>
 #include <linux/freezer.h>
 #include <linux/highmem.h>
@@ -43,6 +44,54 @@  static struct sgx_numa_node *sgx_numa_nodes;
 
 static LIST_HEAD(sgx_dirty_page_list);
 
+/*
+ * Limits the amount of normal RAM that SGX can consume for EPC
+ * overcommit to the total EPC pages * sgx_overcommit_percent / 100
+ */
+static int sgx_overcommit_percent = 100;
+module_param_named(overcommit_percent, sgx_overcommit_percent, int, 0440);
+MODULE_PARM_DESC(overcommit_percent, "Percentage of overcommit of EPC pages.");
+
+/* The number of pages that can be allocated globally for backing storage. */
+static atomic_long_t sgx_nr_available_backing_pages;
+static bool sgx_disable_overcommit_tracking;
+
+/**
+ * sgx_charge_mem() - charge for a page used for backing storage
+ *
+ * Backing storage usage is capped by the sgx_nr_available_backing_pages.
+ * If the backing storage usage is over the overcommit limit,
+ * return an error.
+ *
+ * Return:
+ * 0:		The page requested does not exceed the limit
+ * -ENOMEM:	The page requested exceeds the overcommit limit
+ */
+int sgx_charge_mem(void)
+{
+	if (sgx_disable_overcommit_tracking)
+		return 0;
+
+	if (!atomic_long_add_unless(&sgx_nr_available_backing_pages, -1, 0))
+		return -ENOMEM;
+
+	return 0;
+}
+
+/**
+ * sgx_uncharge_mem() - uncharge a page previously used for backing storage
+ *
+ * When backing storage is no longer in use, increment the
+ * sgx_nr_available_backing_pages counter.
+ */
+void sgx_uncharge_mem(void)
+{
+	if (sgx_disable_overcommit_tracking)
+		return;
+
+	atomic_long_inc(&sgx_nr_available_backing_pages);
+}
+
 /*
  * Reset post-kexec EPC pages to the uninitialized state. The pages are removed
  * from the input list, and made available for the page allocator. SECS pages
@@ -786,6 +835,7 @@  static bool __init sgx_page_cache_init(void)
 	u64 pa, size;
 	int nid;
 	int i;
+	u64 total_epc_bytes = 0;
 
 	sgx_numa_nodes = kmalloc_array(num_possible_nodes(), sizeof(*sgx_numa_nodes), GFP_KERNEL);
 	if (!sgx_numa_nodes)
@@ -830,6 +880,7 @@  static bool __init sgx_page_cache_init(void)
 
 		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
 		sgx_numa_nodes[nid].size += size;
+		total_epc_bytes += size;
 
 		sgx_nr_epc_sections++;
 	}
@@ -839,6 +890,19 @@  static bool __init sgx_page_cache_init(void)
 		return false;
 	}
 
+	if (sgx_overcommit_percent >= 0) {
+		u64 available_backing_bytes;
+
+		available_backing_bytes =
+			total_epc_bytes * (sgx_overcommit_percent / 100);
+
+		atomic_long_set(&sgx_nr_available_backing_pages,
+				available_backing_bytes >> PAGE_SHIFT);
+	} else {
+		pr_info("Disabling overcommit limit.\n");
+		sgx_disable_overcommit_tracking = true;
+	}
+
 	return true;
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 0f17def9fe6f..3507a9983fc1 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -89,6 +89,8 @@  void sgx_free_epc_page(struct sgx_epc_page *page);
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
+int sgx_charge_mem(void);
+void sgx_uncharge_mem(void);
 
 #ifdef CONFIG_X86_SGX_KVM
 int __init sgx_vepc_init(void);