Message ID | 20211220174640.7542-2-kristen@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/sgx: Limit EPC overcommit | expand |
On Mon, Dec 20, 2021 at 09:46:39AM -0800, Kristen Carlson Accardi wrote: > Similar to the core kswapd, ksgxd, is responsible for managing the > overcommitment of enclave memory. If the system runs out of enclave memory, > -*ksgxd* “swaps” enclave memory to normal memory. > +*ksgxd* “swaps” enclave memory to normal RAM. This normal RAM is allocated > +via per enclave shared memory. The shared memory area is not mapped into the > +enclave or the task mapping it, which makes its memory use opaque - including > +to the system out of memory killer (OOM). This can be problematic when there > +are no limits in place on the amount an enclave can allocate. Problematic how? The commit message above is talking about what your patch does and that is kinda clear from the diff. The *why* is really missing. Only that allusion that it might be problematic in some cases but that's not even scratching the surface. > +At boot time, the module parameter "sgx.overcommit_percent" can be used to > +place a limit on the number of shared memory backing pages that may be > +allocated, expressed as a percentage of the total number of EPC pages in the > +system. A value of 100 is the default, and represents a limit equal to the > +number of EPC pages in the system. To disable the limit, set > +sgx.overcommit_percent to -1. The number of backing pages available to > +enclaves is a global resource. If the system exceeds the number of allowed > +backing pages in use, the reclaimer will be unable to swap EPC pages to > +shared memory. So you're basically putting the burden on the user/sysadmin to *actually* *know* what percentage is "problematic" and to know what to supply. I'd bet not very many people would know how much is problematic and it probably all depends. So why don't you come up with a sane default, instead, which works in most cases and set it automatically? Dunno, maybe some scaled percentage of memory depending on how many enclaves are run but all up to a sane limit of, say, 90% of total memory so that there are 10% left for normal system operation. This way you'll avoid "problematic" and still have some memory left for other use. Or something like that. Adding yet another knob is yuck and the easy way out. And we have waaaay too many knobs so we should always try to do the automatic thing, if at all possible. Thx.
Hi Boris, thanks for taking look. On Mon, 2021-12-20 at 20:30 +0100, Borislav Petkov wrote: > On Mon, Dec 20, 2021 at 09:46:39AM -0800, Kristen Carlson Accardi > wrote: > > Similar to the core kswapd, ksgxd, is responsible for managing the > > overcommitment of enclave memory. If the system runs out of > > enclave memory, > > -*ksgxd* “swaps” enclave memory to normal memory. > > +*ksgxd* “swaps” enclave memory to normal RAM. This normal RAM is > > allocated > > +via per enclave shared memory. The shared memory area is not > > mapped into the > > +enclave or the task mapping it, which makes its memory use opaque > > - including > > +to the system out of memory killer (OOM). This can be problematic > > when there > > +are no limits in place on the amount an enclave can allocate. > > Problematic how? If a malicious or just extra large enclave is loaded, or even just a lot of enclaves, it can eat up all the normal RAM on the system. Normal methods of finding out where all the memory on the system is being used, wouldn't be able to find this usage since it is shared memory. In addition, the OOM killer wouldn't be able to kill any enclaves. > > The commit message above is talking about what your patch does and > that > is kinda clear from the diff. The *why* is really missing. Only that > allusion that it might be problematic in some cases but that's not > even > scratching the surface. > > > +At boot time, the module parameter "sgx.overcommit_percent" can be > > used to > > +place a limit on the number of shared memory backing pages that > > may be > > +allocated, expressed as a percentage of the total number of EPC > > pages in the > > +system. A value of 100 is the default, and represents a limit > > equal to the > > +number of EPC pages in the system. To disable the limit, set > > +sgx.overcommit_percent to -1. The number of backing pages > > available to > > +enclaves is a global resource. If the system exceeds the number of > > allowed > > +backing pages in use, the reclaimer will be unable to swap EPC > > pages to > > +shared memory. > > So you're basically putting the burden on the user/sysadmin to > *actually* *know* what percentage is "problematic" and to know what > to > supply. I'd bet not very many people would know how much is > problematic > and it probably all depends. > > So why don't you come up with a sane default, instead, which works in > most cases and set it automatically? The default value is set to 100%, and this percentage, plus the number of EPC pages in the system is used to calculate a sane value for the number of backing pages to add - in this case, exactly the number of EPC pages in the system. It is set automatically if the parameter is not used to override it. > > Dunno, maybe some scaled percentage of memory depending on how many > enclaves are run but all up to a sane limit of, say, 90% of total > memory > so that there are 10% left for normal system operation. I think in this case you are still making a judgement call for the admin about how much memory you think they ought to be able to use for non-SGX operations, and it feels more natural to me to have the default based on how many backing pages you might reasonably need. I suppose one could check the total system memory available and double check that the calculated number of backing pages you'd give out would never exceed some percentage of total system RAM, but then you wind up with 2 knobs to play with instead of just one, which seems wrong to me. > > This way you'll avoid "problematic" and still have some memory left > for > other use. > > Or something like that. > > Adding yet another knob is yuck and the easy way out. And we have > waaaay > too many knobs so we should always try to do the automatic thing, if > at > all possible. I completely agree - so I'm trying to make sure I understand this comment, as the value is currently set to default that would automatically apply that is based on EPC memory present and not a fixed value. So I'd like to understand what you'd like to see done differently. are you saying you'd like to eliminated the ability to override the automatic default? Or just that you'd rather calculate the percentage based on total normal system RAM rather than EPC memory? Thanks, Kristen
Bah, that thread is not on lkml. Please always Cc lkml in the future. On Mon, Dec 20, 2021 at 12:39:19PM -0800, Kristen Carlson Accardi wrote: > If a malicious or just extra large enclave is loaded, or even just a > lot of enclaves, it can eat up all the normal RAM on the system. Normal > methods of finding out where all the memory on the system is being > used, wouldn't be able to find this usage since it is shared memory. In > addition, the OOM killer wouldn't be able to kill any enclaves. So you need some sort of limiting against malicious enclaves anyways, regardless of this knob. IOW, you can set a percentage limit of per-enclave memory which cannot exceed a certain amount which won't prevent the system from its normal operation. For example. > I completely agree - so I'm trying to make sure I understand this > comment, as the value is currently set to default that would > automatically apply that is based on EPC memory present and not a fixed > value. So I'd like to understand what you'd like to see done > differently. are you saying you'd like to eliminated the ability to > override the automatic default? Or just that you'd rather calculate the > percentage based on total normal system RAM rather than EPC memory? So you say that there are cases where swapping to normal RAM can eat up all RAM. So the first heuristic should be: do not allow for *all* enclaves together to use up more than X percent of normal RAM during EPC reclaim. X percent being, say, 90% of all RAM. For example. I guess 10% should be enough for normal operation but someone who's more knowledgeable in system limits could chime in here. Then, define a per-enclave limit which says, an enclave can use Y % of memory for swapping when trying to reclaim EPC memory. And that can be something like: 90 % RAM -------- total amount of enclaves currently on the system And you can obviously create scenarios where creating too many enclaves can bring the system into a situation where it doesn't do any forward progress. But you probably can cause the same with overcommitting with VMs so perhaps it would be a good idea to look how overcommitting VMs and limits there are handled. Bottom line is: the logic should be for the most common cases to function properly, out-of-the-box, without knobs. And then to keep the system operational by preventing enclaves from bringing it down to a halt just by doing EPC reclaim. Does that make more sense? Thx.
On Mon, 2021-12-20 at 22:11 +0100, Borislav Petkov wrote: > Bah, that thread is not on lkml. Please always Cc lkml in the future. > > On Mon, Dec 20, 2021 at 12:39:19PM -0800, Kristen Carlson Accardi > wrote: > > If a malicious or just extra large enclave is loaded, or even just > > a > > lot of enclaves, it can eat up all the normal RAM on the system. > > Normal > > methods of finding out where all the memory on the system is being > > used, wouldn't be able to find this usage since it is shared > > memory. In > > addition, the OOM killer wouldn't be able to kill any enclaves. > > So you need some sort of limiting against malicious enclaves anyways, > regardless of this knob. IOW, you can set a percentage limit of > per-enclave memory which cannot exceed a certain amount which won't > prevent the system from its normal operation. For example. > > > I completely agree - so I'm trying to make sure I understand this > > comment, as the value is currently set to default that would > > automatically apply that is based on EPC memory present and not a > > fixed > > value. So I'd like to understand what you'd like to see done > > differently. are you saying you'd like to eliminated the ability to > > override the automatic default? Or just that you'd rather calculate > > the > > percentage based on total normal system RAM rather than EPC memory? > > So you say that there are cases where swapping to normal RAM can eat > up all RAM. > > So the first heuristic should be: do not allow for *all* enclaves > together to use up more than X percent of normal RAM during EPC > reclaim. So, in your proposal, you would first change the calculated number of maximum available backing pages to be based on total system RAM rather than EPC memory, got it. > > X percent being, say, 90% of all RAM. For example. I guess 10% should > be enough for normal operation but someone who's more knowledgeable > in > system limits could chime in here. > > Then, define a per-enclave limit which says, an enclave can use Y % > of > memory for swapping when trying to reclaim EPC memory. And that can > be > something like: > > 90 % RAM > -------- > total amount of enclaves currently on the system > This would require recalculating the max number of allowed backing pages per enclave at run time whenever a new enclave is loaded - but all the existing enclaves may have already used more than the new max number of per-enclave allowable pages. How would you handle that scenario? This would add a lot of complexity for sure - and it does make me wonder whether any additional benefit of limiting per enclave would be worth it. > And you can obviously create scenarios where creating too many > enclaves > can bring the system into a situation where it doesn't do any forward > progress. > > But you probably can cause the same with overcommitting with VMs so > perhaps it would be a good idea to look how overcommitting VMs and > limits there are handled. > > Bottom line is: the logic should be for the most common cases to > function properly, out-of-the-box, without knobs. And then to keep > the > system operational by preventing enclaves from bringing it down to a > halt just by doing EPC reclaim. > > Does that make more sense? > Thanks for your more detailed explanation - I will take a look at the VM overcommit limits. Since obviously the original implementation did have a default value set, I had still a remaining specific question about your comments. Are you suggesting that there should not be a way to override any overcommit limit at all? So drop the parameter all together?
On Mon, Dec 20, 2021 at 01:35:03PM -0800, Kristen Carlson Accardi wrote: > So, in your proposal you would first change the calculated number of > maximum available backing pages to be based on total system RAM rather > than EPC memory, got it. This was just an example. My point is to try to make it automatic and not introduce another knob. And to decide on the limits and behavior by using common sense and addressing the common use cases first. > This would require recalculating the max number of allowed backing > pages per enclave at run time whenever a new enclave is loaded - but > all the existing enclaves may have already used more than the new max > number of per-enclave allowable pages. How would you handle that > scenario? This would add a lot of complexity for sure - and it does > make me wonder whether any additional benefit of limiting per enclave > would be worth it. See above - this was just an example. And as you've shown, an example of what *not* to do. > Thanks for your more detailed explanation - I will take a look at the > VM overcommit limits. Since obviously the original implementation did > have a default value set, I had still a remaining specific question > about your comments. Are you suggesting that there should not be a way > to override any overcommit limit at all? So drop the parameter all > together? So let me ask you a counter-question: Imagine you're a sysadmin. Or a general, common system user if there ever is one. When your system starts thrashing and you're running a bunch of enclaves, how do you find out there even is a knob which might potentially help you? And after you find it, how would you use that knob? Or would you rather prefer that the system did the right thing for you instead of having to figure out what the sensible values for that knob would be? My point is: put yourself in the user's shoes and try to think about what would be the optimal thing the system should do. Once that is cleanly and properly defined, then we can deal with knobs and policies etc. I sincerely hope that makes more sense.
On 12/20/21 2:48 PM, Borislav Petkov wrote: > On Mon, Dec 20, 2021 at 01:35:03PM -0800, Kristen Carlson Accardi wrote: >> So, in your proposal you would first change the calculated number of >> maximum available backing pages to be based on total system RAM rather >> than EPC memory, got it. > > This was just an example. My point is to try to make it automatic and > not introduce another knob. And to decide on the limits and behavior > by using common sense and addressing the common use cases first. The common case is clearly a few enclaves on systems where the overcommit levels are modest. The "100%" limit will work great there. I'd personally be fine with just enforcing that limit as the default and ignoring everything else. It makes me a bit nervous, though, because suddenly enforcing a limit is an ABI break. The tunable effectively gives us a way out if we screw up either the limit's quantity or someone needs the old ABI back. That said, we don't *need* it to be tunable, boot parameter or not. If you're concerned about the tunable, I think we should drop it and not look back. > Imagine you're a sysadmin. Or a general, common system user if there > ever is one. > > When your system starts thrashing and you're running a bunch of > enclaves, how do you find out there even is a knob which might > potentially help you? I'm selfish. The tunable isn't for end users. It's for me. The scenario I envisioned is that a user upgrades to a new kernel and their enclaves start crashing. They'll eventually find us, the maintainers of the SGX code, and we'll have a tool as kernel developers to help them. The tunable helps _me_ in two ways: 1. It help me easily get user back to pre-5.17 (or whatever) behavior 2. If we got the "100%" value wrong, end users can help us experiment to help find a better value. BTW, all the chat about "malicious" enclaves and so forth... I *totally* agree that this is a problem and one worth solving. It just can't be solved today. We need real cgroup support. It's coming soon.
On 12/20/21 1:35 PM, Kristen Carlson Accardi wrote: > Thanks for your more detailed explanation - I will take a look at the > VM overcommit limits. Since obviously the original implementation did > have a default value set, I had still a remaining specific question > about your comments. Are you suggesting that there should not be a way > to override any overcommit limit at all? So drop the parameter all > together? Yes, let's kill the exposed tunable. We don't have any strong, practical need for it at the moment other than paranoia.
On Mon, Dec 20, 2021 at 09:46:39AM -0800, Kristen Carlson Accardi wrote: > When the system runs out of enclave memory, SGX can reclaim EPC pages > by swapping to normal RAM. This normal RAM is allocated via a > per-enclave shared memory area. The shared memory area is not mapped > into the enclave or the task mapping it, which makes its memory use > opaque (including to the OOM killer). Having lots of hard to find > memory around is problematic, especially when there is no limit. > > Introduce a module parameter and a global counter that can be used to limit > the number of pages that enclaves are able to consume for backing storage. > This parameter is a percentage value that is used in conjunction with the > number of EPC pages in the system to set a cap on the amount of backing > RAM that can be consumed. > > The default for this value is 100, which limits the total number of > shared memory pages that may be consumed by all enclaves as backing pages > to the number of EPC pages on the system. > > For example, on an SGX system that has 128MB of EPC, this default would > cap the amount of normal RAM that SGX consumes for its shared memory > areas at 128MB. > > If sgx.overcommit_percent is set to a negative value (such as -1), > SGX will not place any limits on the amount of overcommit that might > be requested, and SGX will behave as it has previously without the > overcommit_percent limit. > > SGX may not be built as a module, but the module parameter interface > is used in order to provide a convenient interface. > > The SGX overcommit_percent works differently than the core VM overcommit > limit. Enclaves request backing pages one page at a time, and the number > of in use backing pages that are allowed is a global resource that is > limited for all enclaves. > > Introduce a pair of functions which can be used by callers when requesting > backing RAM pages. These functions are responsible for accounting the > page charges. A request may return an error if the request will cause the > counter to exceed the backing page cap. > > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> > --- > .../admin-guide/kernel-parameters.txt | 7 ++ > Documentation/x86/sgx.rst | 16 ++++- > arch/x86/kernel/cpu/sgx/Makefile | 6 +- > arch/x86/kernel/cpu/sgx/main.c | 64 +++++++++++++++++++ > arch/x86/kernel/cpu/sgx/sgx.h | 2 + > 5 files changed, 93 insertions(+), 2 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index 9725c546a0d4..9d23c05a833b 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -5165,6 +5165,13 @@ > > serialnumber [BUGS=X86-32] > > + sgx.overcommit_percent= [X86-64,SGX] > + Limits the amount of normal RAM used for backing > + storage that may be allocate, expressed as a > + percentage of the total number of EPC pages in the > + system. > + See Documentation/x86/sgx.rst for more information. > + > shapers= [NET] > Maximal number of shapers. > > diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst > index 265568a9292c..4f9a1c68be94 100644 > --- a/Documentation/x86/sgx.rst > +++ b/Documentation/x86/sgx.rst > @@ -147,7 +147,21 @@ Page reclaimer > > Similar to the core kswapd, ksgxd, is responsible for managing the > overcommitment of enclave memory. If the system runs out of enclave memory, > -*ksgxd* “swaps” enclave memory to normal memory. > +*ksgxd* “swaps” enclave memory to normal RAM. This normal RAM is allocated > +via per enclave shared memory. The shared memory area is not mapped into the > +enclave or the task mapping it, which makes its memory use opaque - including > +to the system out of memory killer (OOM). This can be problematic when there > +are no limits in place on the amount an enclave can allocate. > + > +At boot time, the module parameter "sgx.overcommit_percent" can be used to > +place a limit on the number of shared memory backing pages that may be > +allocated, expressed as a percentage of the total number of EPC pages in the > +system. A value of 100 is the default, and represents a limit equal to the > +number of EPC pages in the system. To disable the limit, set > +sgx.overcommit_percent to -1. The number of backing pages available to > +enclaves is a global resource. If the system exceeds the number of allowed > +backing pages in use, the reclaimer will be unable to swap EPC pages to > +shared memory. > > Launch Control > ============== > diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile > index 9c1656779b2a..72f9192a43fe 100644 > --- a/arch/x86/kernel/cpu/sgx/Makefile > +++ b/arch/x86/kernel/cpu/sgx/Makefile > @@ -1,6 +1,10 @@ > -obj-y += \ > +# This allows sgx to have module namespace > +obj-y += sgx.o > + > +sgx-y += \ > driver.o \ > encl.o \ > ioctl.o \ > main.o > + > obj-$(CONFIG_X86_SGX_KVM) += virt.o > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c > index 2857a49f2335..c58ce9d9fd56 100644 > --- a/arch/x86/kernel/cpu/sgx/main.c > +++ b/arch/x86/kernel/cpu/sgx/main.c > @@ -1,6 +1,7 @@ > // SPDX-License-Identifier: GPL-2.0 > /* Copyright(c) 2016-20 Intel Corporation. */ > > +#include <linux/moduleparam.h> > #include <linux/file.h> > #include <linux/freezer.h> > #include <linux/highmem.h> > @@ -43,6 +44,54 @@ static struct sgx_numa_node *sgx_numa_nodes; > > static LIST_HEAD(sgx_dirty_page_list); > > +/* > + * Limits the amount of normal RAM that SGX can consume for EPC > + * overcommit to the total EPC pages * sgx_overcommit_percent / 100 > + */ > +static int sgx_overcommit_percent = 100; > +module_param_named(overcommit_percent, sgx_overcommit_percent, int, 0440); > +MODULE_PARM_DESC(overcommit_percent, "Percentage of overcommit of EPC pages."); > + > +/* The number of pages that can be allocated globally for backing storage. */ > +static atomic_long_t sgx_nr_available_backing_pages; > +static bool sgx_disable_overcommit_tracking; I don't like the use of word tracking as we already have ETRACK. I'd also shorten the first global as "sgx_nr_backing_pages". Couldn't you set "sgx_nr_backing_pages" to -1 when capping is disabled, and then you would not need that bool in the first place? > + > +/** > + * sgx_charge_mem() - charge for a page used for backing storage > + * Please remove this empty line: https://www.kernel.org/doc/Documentation/kernel-doc-nano-HOWTO.txt > + * Backing storage usage is capped by the sgx_nr_available_backing_pages. > + * If the backing storage usage is over the overcommit limit, Where does this verb "charge" come from? /Jarkko
>> +/* >> + * Limits the amount of normal RAM that SGX can consume for EPC >> + * overcommit to the total EPC pages * sgx_overcommit_percent / 100 >> + */ >> +static int sgx_overcommit_percent = 100; >> +module_param_named(overcommit_percent, sgx_overcommit_percent, int, 0440); >> +MODULE_PARM_DESC(overcommit_percent, "Percentage of overcommit of EPC pages."); >> + >> +/* The number of pages that can be allocated globally for backing storage. */ >> +static atomic_long_t sgx_nr_available_backing_pages; >> +static bool sgx_disable_overcommit_tracking; > > I don't like the use of word tracking as we already have ETRACK. I don't think anyone is going to confuse "overcommit tracking" with ETRACK. That said, this *could* be "sgx_disable_overcommit_limits", I guess. > I'd also shorten the first global as "sgx_nr_backing_pages". That means something different from the variable, though. "sgx_nr_backing_pages" would be the name for the current number of backing pages which currently exist. > Couldn't you set "sgx_nr_backing_pages" to -1 when capping is disabled, and > then you would not need that bool in the first place? > >> + >> +/** >> + * sgx_charge_mem() - charge for a page used for backing storage >> + * > > Please remove this empty line: > > https://www.kernel.org/doc/Documentation/kernel-doc-nano-HOWTO.txt That *might* make sense when there are arguments. The arguments at least help visually separate the short function description from the full text description.
Hi Jarkko, thanks for your review, On Wed, 2021-12-29 at 01:04 +0200, Jarkko Sakkinen wrote: > On Mon, Dec 20, 2021 at 09:46:39AM -0800, Kristen Carlson Accardi > wrote: > > > > + > > +/** > > + * sgx_charge_mem() - charge for a page used for backing storage > > + * > > Please remove this empty line: > > https://www.kernel.org/doc/Documentation/kernel-doc-nano-HOWTO.txt I read this to be that there should be an empty line after the short description/arg list but before the longer description. I think for functions without args this is the proper layout. It also is more readable. > > > + * Backing storage usage is capped by the > > sgx_nr_available_backing_pages. > > + * If the backing storage usage is over the overcommit limit, > > Where does this verb "charge" come from? Charge in this context means that some available backing pages are now not available because they are in use. It feels appropriate to me to use "charge/uncharge" verbs for this action, unless you think it's confusing somehow.
On Thu, Jan 06, 2022 at 10:26:18AM -0800, Kristen Carlson Accardi wrote: > Hi Jarkko, thanks for your review, > > On Wed, 2021-12-29 at 01:04 +0200, Jarkko Sakkinen wrote: > > On Mon, Dec 20, 2021 at 09:46:39AM -0800, Kristen Carlson Accardi > > wrote: > > > > > > + > > > +/** > > > + * sgx_charge_mem() - charge for a page used for backing storage > > > + * > > > > Please remove this empty line: > > > > https://www.kernel.org/doc/Documentation/kernel-doc-nano-HOWTO.txt > > I read this to be that there should be an empty line after the short > description/arg list but before the longer description. I think for > functions without args this is the proper layout. It also is more > readable. > > > > > > + * Backing storage usage is capped by the > > > sgx_nr_available_backing_pages. > > > + * If the backing storage usage is over the overcommit limit, > > > > Where does this verb "charge" come from? > > Charge in this context means that some available backing pages are now > not available because they are in use. It feels appropriate to me to > use "charge/uncharge" verbs for this action, unless you think it's > confusing somehow. OK, it's cool. I'm still wondering why you need extra variable given that sgx_nr_backing_available_pages is signed. You could mark the feature being disabled by setting it to -1. It might cosmetically improve readability to have a boolean but for practical uses (e.g. eBPF tracing scripts) it only adds extra complexity. /Jarkko
On Fri, 2022-01-07 at 14:25 +0200, Jarkko Sakkinen wrote: > On Thu, Jan 06, 2022 at 10:26:18AM -0800, Kristen Carlson Accardi > wrote: > > Hi Jarkko, thanks for your review, > > > > On Wed, 2021-12-29 at 01:04 +0200, Jarkko Sakkinen wrote: > > > On Mon, Dec 20, 2021 at 09:46:39AM -0800, Kristen Carlson Accardi > > > wrote: > > > > + > > > > +/** > > > > + * sgx_charge_mem() - charge for a page used for backing > > > > storage > > > > + * > > > > > > Please remove this empty line: > > > > > > https://www.kernel.org/doc/Documentation/kernel-doc-nano-HOWTO.txt > > > > I read this to be that there should be an empty line after the > > short > > description/arg list but before the longer description. I think for > > functions without args this is the proper layout. It also is more > > readable. > > > > > > + * Backing storage usage is capped by the > > > > sgx_nr_available_backing_pages. > > > > + * If the backing storage usage is over the overcommit limit, > > > > > > Where does this verb "charge" come from? > > > > Charge in this context means that some available backing pages are > > now > > not available because they are in use. It feels appropriate to me > > to > > use "charge/uncharge" verbs for this action, unless you think it's > > confusing somehow. > > OK, it's cool. > > I'm still wondering why you need extra variable given that > sgx_nr_backing_available_pages is signed. You could mark the > feature being disabled by setting it to -1. > > It might cosmetically improve readability to have a boolean but > for practical uses (e.g. eBPF tracing scripts) it only adds extra > complexity. > > /Jarkko I can see your point. Since Boris objected to the module param, the next version will not have a way to disable limits at all, so I am deleting that boolean all together.
On Fri, Jan 07, 2022 at 09:17:03AM -0800, Kristen Carlson Accardi wrote: > On Fri, 2022-01-07 at 14:25 +0200, Jarkko Sakkinen wrote: > > On Thu, Jan 06, 2022 at 10:26:18AM -0800, Kristen Carlson Accardi > > wrote: > > > Hi Jarkko, thanks for your review, > > > > > > On Wed, 2021-12-29 at 01:04 +0200, Jarkko Sakkinen wrote: > > > > On Mon, Dec 20, 2021 at 09:46:39AM -0800, Kristen Carlson Accardi > > > > wrote: > > > > > + > > > > > +/** > > > > > + * sgx_charge_mem() - charge for a page used for backing > > > > > storage > > > > > + * > > > > > > > > Please remove this empty line: > > > > > > > > https://www.kernel.org/doc/Documentation/kernel-doc-nano-HOWTO.txt > > > > > > I read this to be that there should be an empty line after the > > > short > > > description/arg list but before the longer description. I think for > > > functions without args this is the proper layout. It also is more > > > readable. > > > > > > > > + * Backing storage usage is capped by the > > > > > sgx_nr_available_backing_pages. > > > > > + * If the backing storage usage is over the overcommit limit, > > > > > > > > Where does this verb "charge" come from? > > > > > > Charge in this context means that some available backing pages are > > > now > > > not available because they are in use. It feels appropriate to me > > > to > > > use "charge/uncharge" verbs for this action, unless you think it's > > > confusing somehow. > > > > OK, it's cool. > > > > I'm still wondering why you need extra variable given that > > sgx_nr_backing_available_pages is signed. You could mark the > > feature being disabled by setting it to -1. > > > > It might cosmetically improve readability to have a boolean but > > for practical uses (e.g. eBPF tracing scripts) it only adds extra > > complexity. > > > > /Jarkko > > I can see your point. Since Boris objected to the module param, the > next version will not have a way to disable limits at all, so I am > deleting that boolean all together. Ok, cool, I'm looking forward for the next version. /Jarkko
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 9725c546a0d4..9d23c05a833b 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5165,6 +5165,13 @@ serialnumber [BUGS=X86-32] + sgx.overcommit_percent= [X86-64,SGX] + Limits the amount of normal RAM used for backing + storage that may be allocate, expressed as a + percentage of the total number of EPC pages in the + system. + See Documentation/x86/sgx.rst for more information. + shapers= [NET] Maximal number of shapers. diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst index 265568a9292c..4f9a1c68be94 100644 --- a/Documentation/x86/sgx.rst +++ b/Documentation/x86/sgx.rst @@ -147,7 +147,21 @@ Page reclaimer Similar to the core kswapd, ksgxd, is responsible for managing the overcommitment of enclave memory. If the system runs out of enclave memory, -*ksgxd* “swaps” enclave memory to normal memory. +*ksgxd* “swaps” enclave memory to normal RAM. This normal RAM is allocated +via per enclave shared memory. The shared memory area is not mapped into the +enclave or the task mapping it, which makes its memory use opaque - including +to the system out of memory killer (OOM). This can be problematic when there +are no limits in place on the amount an enclave can allocate. + +At boot time, the module parameter "sgx.overcommit_percent" can be used to +place a limit on the number of shared memory backing pages that may be +allocated, expressed as a percentage of the total number of EPC pages in the +system. A value of 100 is the default, and represents a limit equal to the +number of EPC pages in the system. To disable the limit, set +sgx.overcommit_percent to -1. The number of backing pages available to +enclaves is a global resource. If the system exceeds the number of allowed +backing pages in use, the reclaimer will be unable to swap EPC pages to +shared memory. Launch Control ============== diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile index 9c1656779b2a..72f9192a43fe 100644 --- a/arch/x86/kernel/cpu/sgx/Makefile +++ b/arch/x86/kernel/cpu/sgx/Makefile @@ -1,6 +1,10 @@ -obj-y += \ +# This allows sgx to have module namespace +obj-y += sgx.o + +sgx-y += \ driver.o \ encl.o \ ioctl.o \ main.o + obj-$(CONFIG_X86_SGX_KVM) += virt.o diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 2857a49f2335..c58ce9d9fd56 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright(c) 2016-20 Intel Corporation. */ +#include <linux/moduleparam.h> #include <linux/file.h> #include <linux/freezer.h> #include <linux/highmem.h> @@ -43,6 +44,54 @@ static struct sgx_numa_node *sgx_numa_nodes; static LIST_HEAD(sgx_dirty_page_list); +/* + * Limits the amount of normal RAM that SGX can consume for EPC + * overcommit to the total EPC pages * sgx_overcommit_percent / 100 + */ +static int sgx_overcommit_percent = 100; +module_param_named(overcommit_percent, sgx_overcommit_percent, int, 0440); +MODULE_PARM_DESC(overcommit_percent, "Percentage of overcommit of EPC pages."); + +/* The number of pages that can be allocated globally for backing storage. */ +static atomic_long_t sgx_nr_available_backing_pages; +static bool sgx_disable_overcommit_tracking; + +/** + * sgx_charge_mem() - charge for a page used for backing storage + * + * Backing storage usage is capped by the sgx_nr_available_backing_pages. + * If the backing storage usage is over the overcommit limit, + * return an error. + * + * Return: + * 0: The page requested does not exceed the limit + * -ENOMEM: The page requested exceeds the overcommit limit + */ +int sgx_charge_mem(void) +{ + if (sgx_disable_overcommit_tracking) + return 0; + + if (!atomic_long_add_unless(&sgx_nr_available_backing_pages, -1, 0)) + return -ENOMEM; + + return 0; +} + +/** + * sgx_uncharge_mem() - uncharge a page previously used for backing storage + * + * When backing storage is no longer in use, increment the + * sgx_nr_available_backing_pages counter. + */ +void sgx_uncharge_mem(void) +{ + if (sgx_disable_overcommit_tracking) + return; + + atomic_long_inc(&sgx_nr_available_backing_pages); +} + /* * Reset post-kexec EPC pages to the uninitialized state. The pages are removed * from the input list, and made available for the page allocator. SECS pages @@ -786,6 +835,7 @@ static bool __init sgx_page_cache_init(void) u64 pa, size; int nid; int i; + u64 total_epc_bytes = 0; sgx_numa_nodes = kmalloc_array(num_possible_nodes(), sizeof(*sgx_numa_nodes), GFP_KERNEL); if (!sgx_numa_nodes) @@ -830,6 +880,7 @@ static bool __init sgx_page_cache_init(void) sgx_epc_sections[i].node = &sgx_numa_nodes[nid]; sgx_numa_nodes[nid].size += size; + total_epc_bytes += size; sgx_nr_epc_sections++; } @@ -839,6 +890,19 @@ static bool __init sgx_page_cache_init(void) return false; } + if (sgx_overcommit_percent >= 0) { + u64 available_backing_bytes; + + available_backing_bytes = + total_epc_bytes * (sgx_overcommit_percent / 100); + + atomic_long_set(&sgx_nr_available_backing_pages, + available_backing_bytes >> PAGE_SHIFT); + } else { + pr_info("Disabling overcommit limit.\n"); + sgx_disable_overcommit_tracking = true; + } + return true; } diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h index 0f17def9fe6f..3507a9983fc1 100644 --- a/arch/x86/kernel/cpu/sgx/sgx.h +++ b/arch/x86/kernel/cpu/sgx/sgx.h @@ -89,6 +89,8 @@ void sgx_free_epc_page(struct sgx_epc_page *page); void sgx_mark_page_reclaimable(struct sgx_epc_page *page); int sgx_unmark_page_reclaimable(struct sgx_epc_page *page); struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim); +int sgx_charge_mem(void); +void sgx_uncharge_mem(void); #ifdef CONFIG_X86_SGX_KVM int __init sgx_vepc_init(void);
When the system runs out of enclave memory, SGX can reclaim EPC pages by swapping to normal RAM. This normal RAM is allocated via a per-enclave shared memory area. The shared memory area is not mapped into the enclave or the task mapping it, which makes its memory use opaque (including to the OOM killer). Having lots of hard to find memory around is problematic, especially when there is no limit. Introduce a module parameter and a global counter that can be used to limit the number of pages that enclaves are able to consume for backing storage. This parameter is a percentage value that is used in conjunction with the number of EPC pages in the system to set a cap on the amount of backing RAM that can be consumed. The default for this value is 100, which limits the total number of shared memory pages that may be consumed by all enclaves as backing pages to the number of EPC pages on the system. For example, on an SGX system that has 128MB of EPC, this default would cap the amount of normal RAM that SGX consumes for its shared memory areas at 128MB. If sgx.overcommit_percent is set to a negative value (such as -1), SGX will not place any limits on the amount of overcommit that might be requested, and SGX will behave as it has previously without the overcommit_percent limit. SGX may not be built as a module, but the module parameter interface is used in order to provide a convenient interface. The SGX overcommit_percent works differently than the core VM overcommit limit. Enclaves request backing pages one page at a time, and the number of in use backing pages that are allowed is a global resource that is limited for all enclaves. Introduce a pair of functions which can be used by callers when requesting backing RAM pages. These functions are responsible for accounting the page charges. A request may return an error if the request will cause the counter to exceed the backing page cap. Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> --- .../admin-guide/kernel-parameters.txt | 7 ++ Documentation/x86/sgx.rst | 16 ++++- arch/x86/kernel/cpu/sgx/Makefile | 6 +- arch/x86/kernel/cpu/sgx/main.c | 64 +++++++++++++++++++ arch/x86/kernel/cpu/sgx/sgx.h | 2 + 5 files changed, 93 insertions(+), 2 deletions(-)