diff mbox series

[v3,03/25] x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()

Message ID 062acb801926b2ade2f9fe1672afb7113453a741.1616136308.git.kai.huang@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM SGX virtualization support | expand

Commit Message

Huang, Kai March 19, 2021, 7:22 a.m. UTC
EREMOVE takes a page and removes any association between that page and
an enclave.  It must be run on a page before it can be added into
another enclave.  Currently, EREMOVE is run as part of pages being freed
into the SGX page allocator.  It is not expected to fail.

KVM does not track how guest pages are used, which means that SGX
virtualization use of EREMOVE might fail.

Break out the EREMOVE call from the SGX page allocator.  This will allow
the SGX virtualization code to use the allocator directly.  (SGX/KVM
will also introduce a more permissive EREMOVE helper).

Implement original sgx_free_epc_page() as sgx_encl_free_epc_page() to be
more specific that it is used to free EPC page assigned to one enclave.
Explicitly give an message using WARN_ONCE() when EREMOVE fails, to call
out EPC page is leaked, and requires machine reboot to get leaked pages
back.

Replace sgx_free_epc_page() with sgx_encl_free_epc_page() in all call
sites.  No functional change is intended, except the new WARNING message
when EREMOVE fails.

Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---
v2->v3:

 - Changed to replace all call sites of sgx_free_epc_page() with
   sgx_encl_free_epc_page() to make this patch have no functional change,
   except a WARN() upon EREMOVE failure requested by Dave.
 - Rebased to latest tip/x86/sgx to resolve merge conflict with Jarkko's NUMA
   allocation.
 - Removed Jarkko as author. Added Jarkko's Acked-by.

v1->v2:

 - Merge original WARN() and pr_err_once() into one single WARN(), suggested
   by Sean.

---
 arch/x86/kernel/cpu/sgx/encl.c  | 40 ++++++++++++++++++++++++++++-----
 arch/x86/kernel/cpu/sgx/encl.h  |  1 +
 arch/x86/kernel/cpu/sgx/ioctl.c |  6 ++---
 arch/x86/kernel/cpu/sgx/main.c  | 14 +++++-------
 4 files changed, 44 insertions(+), 17 deletions(-)

Comments

Borislav Petkov March 22, 2021, 6:16 p.m. UTC | #1
On Fri, Mar 19, 2021 at 08:22:19PM +1300, Kai Huang wrote:
> +/**
> + * sgx_encl_free_epc_page - free EPC page assigned to an enclave
> + * @page:	EPC page to be freed
> + *
> + * Free EPC page assigned to an enclave.  It does EREMOVE for the page, and
> + * only upon success, it puts the page back to free page list.  Otherwise, it
> + * gives a WARNING to indicate page is leaked, and require reboot to retrieve
> + * leaked pages.
> + */
> +void sgx_encl_free_epc_page(struct sgx_epc_page *page)
> +{
> +	int ret;
> +
> +	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> +
> +	/*
> +	 * Give a message to remind EPC page is leaked when EREMOVE fails,
> +	 * and requires machine reboot to get leaked pages back. This can
> +	 * be improved in future by adding stats of leaked pages, etc.
> +	 */
> +#define EREMOVE_ERROR_MESSAGE \
> +	"EREMOVE returned %d (0x%x).  EPC page leaked.  Reboot required to retrieve leaked pages."

A reboot? Seriously? Why?

How are you going to explain to cloud people that they need to reboot
their fat server? The same cloud people who want to make sure Intel
supports late microcode loading no matter the effort just so to avoid
rebooting the machine.

But now all of a sudden, if they wanna have SGX enclaves in guests, they
need to get prepared for potential rebooting.

I sure hope I'm missing something...
Sean Christopherson March 22, 2021, 6:56 p.m. UTC | #2
On Mon, Mar 22, 2021, Borislav Petkov wrote:
> On Fri, Mar 19, 2021 at 08:22:19PM +1300, Kai Huang wrote:
> > +/**
> > + * sgx_encl_free_epc_page - free EPC page assigned to an enclave
> > + * @page:	EPC page to be freed
> > + *
> > + * Free EPC page assigned to an enclave.  It does EREMOVE for the page, and
> > + * only upon success, it puts the page back to free page list.  Otherwise, it
> > + * gives a WARNING to indicate page is leaked, and require reboot to retrieve
> > + * leaked pages.
> > + */
> > +void sgx_encl_free_epc_page(struct sgx_epc_page *page)
> > +{
> > +	int ret;
> > +
> > +	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> > +
> > +	/*
> > +	 * Give a message to remind EPC page is leaked when EREMOVE fails,
> > +	 * and requires machine reboot to get leaked pages back. This can
> > +	 * be improved in future by adding stats of leaked pages, etc.
> > +	 */
> > +#define EREMOVE_ERROR_MESSAGE \
> > +	"EREMOVE returned %d (0x%x).  EPC page leaked.  Reboot required to retrieve leaked pages."
> 
> A reboot? Seriously? Why?
> 
> How are you going to explain to cloud people that they need to reboot
> their fat server? The same cloud people who want to make sure Intel
> supports late microcode loading no matter the effort just so to avoid
> rebooting the machine.
> 
> But now all of a sudden, if they wanna have SGX enclaves in guests, they
> need to get prepared for potential rebooting.

Not necessarily.  This can only trigger in the host, and thus require a host
reboot, if the host is also running enclaves.  If the CSP is not running
enclaves, or is running its enclaves in a separate VM, then this path cannot be
reached.

> I sure hope I'm missing something...

EREMOVE can only fail if there's a kernel or hardware bug (or a VMM bug if
running as a guest).  IME, nearly every kernel/KVM bug that I introduced that
led to EREMOVE failure was also quite fatal to SGX, i.e. this is just the canary
in the coal mine.

It's certainly possible to add more sophisticated error handling, e.g. through
the pages onto a list and periodically try to recover them.  But, since the vast
majority of bugs that cause EREMOVE failure are fatal to SGX, implementing
sophisticated handling is quite low on the list of priorities.

Dave wanted the "page leaked" error message so that it's abundantly clear that
the kernel is leaking pages on EREMOVE failure and that the WARN isn't "benign".
Paolo Bonzini March 22, 2021, 7:11 p.m. UTC | #3
On 22/03/21 19:56, Sean Christopherson wrote:
> EREMOVE can only fail if there's a kernel or hardware bug (or a VMM bug if
> running as a guest).  IME, nearly every kernel/KVM bug that I introduced that
> led to EREMOVE failure was also quite fatal to SGX, i.e. this is just the canary
> in the coal mine.

That was my recollection as well from previous threads but, to be fair 
to Boris, the commit message is a lot more scary (and, which is what 
triggers me, puts the blame on KVM).  It just says "KVM does not track 
how guest pages are used, which means that SGX virtualization use of 
EREMOVE might fail".

Paolo
Borislav Petkov March 22, 2021, 7:15 p.m. UTC | #4
On Mon, Mar 22, 2021 at 11:56:37AM -0700, Sean Christopherson wrote:
> Not necessarily.  This can only trigger in the host, and thus require a host
> reboot, if the host is also running enclaves.  If the CSP is not running
> enclaves, or is running its enclaves in a separate VM, then this path cannot be
> reached.

That's what I meant. Rebooting guests is a lot easier, ofc.

Or are you saying, this can trigger *only* when they're running enclaves
on the *host* too?

> EREMOVE can only fail if there's a kernel or hardware bug (or a VMM bug if
> running as a guest). 

We get those on a daily basis.

> IME, nearly every kernel/KVM bug that I introduced that led to EREMOVE
> failure was also quite fatal to SGX, i.e. this is just the canary in
> the coal mine.
>
> It's certainly possible to add more sophisticated error handling, e.g. through
> the pages onto a list and periodically try to recover them.  But, since the vast
> majority of bugs that cause EREMOVE failure are fatal to SGX, implementing
> sophisticated handling is quite low on the list of priorities.
> 
> Dave wanted the "page leaked" error message so that it's abundantly clear that
> the kernel is leaking pages on EREMOVE failure and that the WARN isn't "benign".

So this sounds to me like this should BUG too eventually.

Or is this one of those "this should never happen" things so no one
should worry?

Whatever it is, if an admin sees this message in dmesg and doesn't get a
lengthy explanation what she/he is supposed to do, I don't think she/he
will be as relaxed.

Hell, people open bugs for correctable ECCs and are asking whether they
need to replace their hardware.

So let's play this out: put yourself in an admin's shoes and tell me how
should an admin react when she/he sees that?

Should the kernel probably also say: "Don't worry, you have enough
memory and what's a 4K, who cares? You'll reboot eventually."

Or should the kernel say "You need to reboot ASAP."

And so on...

So what is the scenario here and what kind of reaction is that message
supposed to cause, recovery action, blabla, the whole spiel?

Thx.
Sean Christopherson March 22, 2021, 7:37 p.m. UTC | #5
On Mon, Mar 22, 2021, Borislav Petkov wrote:
> On Mon, Mar 22, 2021 at 11:56:37AM -0700, Sean Christopherson wrote:
> > Not necessarily.  This can only trigger in the host, and thus require a host
> > reboot, if the host is also running enclaves.  If the CSP is not running
> > enclaves, or is running its enclaves in a separate VM, then this path cannot be
> > reached.
> 
> That's what I meant. Rebooting guests is a lot easier, ofc.
> 
> Or are you saying, this can trigger *only* when they're running enclaves
> on the *host* too?

Yes.  Note, it's still true if you strike out the "too", KVM support is completely
orthogonal to this code.  The purpose of this patch is to separate out the EREMOVE
path used for host enclaves (/dev/sgx_enclave), because EPC virtualization for
KVM will have non-buggy scenarios where EREMOVE can fail.  But the virt EPC code
is designed to handle that gracefully.

> > EREMOVE can only fail if there's a kernel or hardware bug (or a VMM bug if
> > running as a guest). 
> 
> We get those on a daily basis.
> 
> > IME, nearly every kernel/KVM bug that I introduced that led to EREMOVE
> > failure was also quite fatal to SGX, i.e. this is just the canary in
> > the coal mine.
> >
> > It's certainly possible to add more sophisticated error handling, e.g. through
> > the pages onto a list and periodically try to recover them.  But, since the vast
> > majority of bugs that cause EREMOVE failure are fatal to SGX, implementing
> > sophisticated handling is quite low on the list of priorities.
> > 
> > Dave wanted the "page leaked" error message so that it's abundantly clear that
> > the kernel is leaking pages on EREMOVE failure and that the WARN isn't "benign".
> 
> So this sounds to me like this should BUG too eventually.
> 
> Or is this one of those "this should never happen" things so no one
> should worry?

Hmm.  I don't think it warrants BUG.  At worst, leaking EPC pages is fatal only
to SGX.  If the underlying bug caused other fallout, e.g. didn't release a lock,
then obviously that could be fatal to the kernel.  But I don't think there's
ever a case where SGX being unusuable would prevent the kernel from functioning.
 
> Whatever it is, if an admin sees this message in dmesg and doesn't get a
> lengthy explanation what she/he is supposed to do, I don't think she/he
> will be as relaxed.
> 
> Hell, people open bugs for correctable ECCs and are asking whether they
> need to replace their hardware.

LOL.

> So let's play this out: put yourself in an admin's shoes and tell me how
> should an admin react when she/he sees that?
> 
> Should the kernel probably also say: "Don't worry, you have enough
> memory and what's a 4K, who cares? You'll reboot eventually."
 
> Or should the kernel say "You need to reboot ASAP."
> 
> And so on...
> 
> So what is the scenario here and what kind of reaction is that message
> supposed to cause, recovery action, blabla, the whole spiel?

Probably something in between.  Odds are good SGX will eventually become
unusuable, e.g. either kernel SGX support is completely hosted, or it will soon
leak the majority of EPC pages.  Something like this?

  "EREMOVE returned %d (0x%x), kernel bug likely.  EPC page leaked, SGX may become unusuable.  Reboot recommended to continue using SGX."
Huang, Kai March 22, 2021, 8:36 p.m. UTC | #6
On Mon, 22 Mar 2021 12:37:02 -0700 Sean Christopherson wrote:
> On Mon, Mar 22, 2021, Borislav Petkov wrote:
> > On Mon, Mar 22, 2021 at 11:56:37AM -0700, Sean Christopherson wrote:
> > > Not necessarily.  This can only trigger in the host, and thus require a host
> > > reboot, if the host is also running enclaves.  If the CSP is not running
> > > enclaves, or is running its enclaves in a separate VM, then this path cannot be
> > > reached.
> > 
> > That's what I meant. Rebooting guests is a lot easier, ofc.
> > 
> > Or are you saying, this can trigger *only* when they're running enclaves
> > on the *host* too?
> 
> Yes.  Note, it's still true if you strike out the "too", KVM support is completely
> orthogonal to this code.  The purpose of this patch is to separate out the EREMOVE
> path used for host enclaves (/dev/sgx_enclave), because EPC virtualization for
> KVM will have non-buggy scenarios where EREMOVE can fail.  But the virt EPC code
> is designed to handle that gracefully.
> 
> > > EREMOVE can only fail if there's a kernel or hardware bug (or a VMM bug if
> > > running as a guest). 
> > 
> > We get those on a daily basis.
> > 
> > > IME, nearly every kernel/KVM bug that I introduced that led to EREMOVE
> > > failure was also quite fatal to SGX, i.e. this is just the canary in
> > > the coal mine.
> > >
> > > It's certainly possible to add more sophisticated error handling, e.g. through
> > > the pages onto a list and periodically try to recover them.  But, since the vast
> > > majority of bugs that cause EREMOVE failure are fatal to SGX, implementing
> > > sophisticated handling is quite low on the list of priorities.
> > > 
> > > Dave wanted the "page leaked" error message so that it's abundantly clear that
> > > the kernel is leaking pages on EREMOVE failure and that the WARN isn't "benign".
> > 
> > So this sounds to me like this should BUG too eventually.
> > 
> > Or is this one of those "this should never happen" things so no one
> > should worry?
> 
> Hmm.  I don't think it warrants BUG.  At worst, leaking EPC pages is fatal only
> to SGX.  If the underlying bug caused other fallout, e.g. didn't release a lock,
> then obviously that could be fatal to the kernel.  But I don't think there's
> ever a case where SGX being unusuable would prevent the kernel from functioning.
>  
> > Whatever it is, if an admin sees this message in dmesg and doesn't get a
> > lengthy explanation what she/he is supposed to do, I don't think she/he
> > will be as relaxed.
> > 
> > Hell, people open bugs for correctable ECCs and are asking whether they
> > need to replace their hardware.
> 
> LOL.
> 
> > So let's play this out: put yourself in an admin's shoes and tell me how
> > should an admin react when she/he sees that?
> > 
> > Should the kernel probably also say: "Don't worry, you have enough
> > memory and what's a 4K, who cares? You'll reboot eventually."
>  
> > Or should the kernel say "You need to reboot ASAP."
> > 
> > And so on...
> > 
> > So what is the scenario here and what kind of reaction is that message
> > supposed to cause, recovery action, blabla, the whole spiel?
> 
> Probably something in between.  Odds are good SGX will eventually become
> unusuable, e.g. either kernel SGX support is completely hosted, or it will soon
> leak the majority of EPC pages.  Something like this?
> 
>   "EREMOVE returned %d (0x%x), kernel bug likely.  EPC page leaked, SGX may become unusuable.  Reboot recommended to continue using SGX."

Or perhaps just stick to old behavior in original sgx_free_epc_page()?

	ret = __eremove(sgx_get_epc_virt_addr(page));
	if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
		return;

This code path is only used by host SGX driver, but not KVM. And this patch's
*main* intention is to break EREMOVE out of sgx_free_epc_page() so virtual EPC
code can use sgx_free_epc_page(). 

Improving the error msg can be a separate discussion and separate patch which
can be done in the future, and this has nothing to do with SGX virtualization
support.
Huang, Kai March 22, 2021, 8:43 p.m. UTC | #7
On Mon, 22 Mar 2021 20:11:57 +0100 Paolo Bonzini wrote:
> On 22/03/21 19:56, Sean Christopherson wrote:
> > EREMOVE can only fail if there's a kernel or hardware bug (or a VMM bug if
> > running as a guest).  IME, nearly every kernel/KVM bug that I introduced that
> > led to EREMOVE failure was also quite fatal to SGX, i.e. this is just the canary
> > in the coal mine.
> 
> That was my recollection as well from previous threads but, to be fair 
> to Boris, the commit message is a lot more scary (and, which is what 
> triggers me, puts the blame on KVM).  It just says "KVM does not track 
> how guest pages are used, which means that SGX virtualization use of 
> EREMOVE might fail".

I don't see the commit msg being scary.  EREMOVE might fail but virtual EPC code
can handle that.  This is the reason to break out EREMOVE from original
sgx_free_epc_page(), so virtual EPC code can have its own logic of handling
EREMOVE failure.
Borislav Petkov March 22, 2021, 9:06 p.m. UTC | #8
On Mon, Mar 22, 2021 at 12:37:02PM -0700, Sean Christopherson wrote:
> Yes.  Note, it's still true if you strike out the "too", KVM support is completely
> orthogonal to this code.  The purpose of this patch is to separate out the EREMOVE
> path used for host enclaves (/dev/sgx_enclave), because EPC virtualization for
> KVM will have non-buggy scenarios where EREMOVE can fail.  But the virt EPC code
> is designed to handle that gracefully.

"gracefully" as it won't leak EPC pages which would require a host reboot? That
leaking is done by host enclaves only?

> Hmm.  I don't think it warrants BUG.  At worst, leaking EPC pages is fatal only
> to SGX.

Fatal how? If it keeps leaking, at some point it won't have any pages
for EPC pages anymore?

Btw, I probably have seen this and forgotten again so pls remind me,
is the amount of pages available for SGX use static and limited by,
I believe BIOS, or can a leakage in EPC pages cause system memory
shortage?

> If the underlying bug caused other fallout, e.g. didn't release a
> lock, then obviously that could be fatal to the kernel. But I don't
> think there's ever a case where SGX being unusuable would prevent the
> kernel from functioning.

This kinda replies my question above but still...

> Probably something in between.  Odds are good SGX will eventually become
> unusuable, e.g. either kernel SGX support is completely hosted, or it will soon
> leak the majority of EPC pages.  Something like this?
> 
>   "EREMOVE returned %d (0x%x), kernel bug likely.  EPC page leaked, SGX may become unusuable.  Reboot recommended to continue using SGX."

So all this handwaving I'm doing is to provoke a proper response from
you guys as to how a EPC page leaking is supposed to be handled by the
users of the technology:

1. Issue a warning message and forget about it, eventual reboot

2. Really scary message to make users reboot sooner

3. Detect when host enclaves are run while guest enclaves are running
and issue a warning then.

4. Fall on knees and pray to not get sued by customers because their
enclaves are not working anymore.

....

Btw, 4. needs to be considered properly so that people can cover asses.

Oh and whatever we end up deciding, we should document that in
Documentation/... somewhere and point users to it in that warning
message where a longer treatise is explaining the whole deal properly.
Huang, Kai March 22, 2021, 10:06 p.m. UTC | #9
On Mon, 22 Mar 2021 22:06:45 +0100 Borislav Petkov wrote:
> On Mon, Mar 22, 2021 at 12:37:02PM -0700, Sean Christopherson wrote:
> > Yes.  Note, it's still true if you strike out the "too", KVM support is completely
> > orthogonal to this code.  The purpose of this patch is to separate out the EREMOVE
> > path used for host enclaves (/dev/sgx_enclave), because EPC virtualization for
> > KVM will have non-buggy scenarios where EREMOVE can fail.  But the virt EPC code
> > is designed to handle that gracefully.
> 
> "gracefully" as it won't leak EPC pages which would require a host reboot? That
> leaking is done by host enclaves only?

This path is called by host SGX driver only, so yes this leaking is done by
host enclaves only.

This patch is purpose is to break EREMOVE out of sgx_free_epc_page() so virtual
EPC code can use sgx_free_epc_page(), and handle EREMOVE logic differently.
This patch doesn't have intention to bring functional change.  I changed the
error msg because Dave said it worth to give a message saying EPC page is
leaked, and I thought this minor change won't break anything.

Perpahps we can avoid changing error message but stick to existing SGX driver
behavior? 

> 
> > Hmm.  I don't think it warrants BUG.  At worst, leaking EPC pages is fatal only
> > to SGX.
> 
> Fatal how? If it keeps leaking, at some point it won't have any pages
> for EPC pages anymore?
> 
> Btw, I probably have seen this and forgotten again so pls remind me,
> is the amount of pages available for SGX use static and limited by,
> I believe BIOS, or can a leakage in EPC pages cause system memory
> shortage?
> 
> > If the underlying bug caused other fallout, e.g. didn't release a
> > lock, then obviously that could be fatal to the kernel. But I don't
> > think there's ever a case where SGX being unusuable would prevent the
> > kernel from functioning.
> 
> This kinda replies my question above but still...
> 
> > Probably something in between.  Odds are good SGX will eventually become
> > unusuable, e.g. either kernel SGX support is completely hosted, or it will soon
> > leak the majority of EPC pages.  Something like this?
> > 
> >   "EREMOVE returned %d (0x%x), kernel bug likely.  EPC page leaked, SGX may become unusuable.  Reboot recommended to continue using SGX."
> 
> So all this handwaving I'm doing is to provoke a proper response from
> you guys as to how a EPC page leaking is supposed to be handled by the
> users of the technology:
> 
> 1. Issue a warning message and forget about it, eventual reboot

This is the existing SGX driver behavior IMHO. It just gives a WARN() saying
EREMOVE failed with the error code printed.  The EPC page is leaked w/o any
message to user.  I can live with it, and it is existing code anyway.

btw, currently virtual EPC code (patch 5) handles in similar way: There's one
EREMOVE error is expected and virtual EPC code can handle, but for other
errors, it means kernel bug, and virtual EPC code gives a WARN(), and that EPC
page is leaked too:

+		WARN_ONCE(ret != SGX_CHILD_PRESENT,
+			  "EREMOVE (EPC page 0x%lx): unexpected error: %d\n",
+			  sgx_get_epc_phys_addr(epc_page), ret);
+		return ret;

So to me they are just WARN() to catch kernel bug.

> 
> 2. Really scary message to make users reboot sooner
> 
> 3. Detect when host enclaves are run while guest enclaves are running
> and issue a warning then.

This code path has nothing to do with guest enclaves.

> 
> 4. Fall on knees and pray to not get sued by customers because their
> enclaves are not working anymore.
> 
> ....
> 
> Btw, 4. needs to be considered properly so that people can cover asses.

If we are talking about CSPs being unable to provide correct services to
customers due to kernel bug, I think this is a bigger question but not
just related to SGX,  since other kernel bug can also cause similar problem, for
instance, VM or SGX process itself being killed.

> 
> Oh and whatever we end up deciding, we should document that in
> Documentation/... somewhere and point users to it in that warning
> message where a longer treatise is explaining the whole deal properly.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
Huang, Kai March 22, 2021, 10:23 p.m. UTC | #10
> 
> Btw, I probably have seen this and forgotten again so pls remind me,
> is the amount of pages available for SGX use static and limited by,
> I believe BIOS, or can a leakage in EPC pages cause system memory
> shortage?
> 

Yes EPC size is fixed and configured in BIOS. Leaking EPC pages may cause EPC
shortage, but not system memory.
Borislav Petkov March 22, 2021, 10:37 p.m. UTC | #11
On Tue, Mar 23, 2021 at 11:06:43AM +1300, Kai Huang wrote:
> This path is called by host SGX driver only, so yes this leaking is done by
> host enclaves only.

Yes, so I was told.

> This patch is purpose is to break EREMOVE out of sgx_free_epc_page() so virtual
> EPC code can use sgx_free_epc_page(), and handle EREMOVE logic differently.
> This patch doesn't have intention to bring functional change.  I changed the
> error msg because Dave said it worth to give a message saying EPC page is
> leaked, and I thought this minor change won't break anything.

I read that already - you don't have to repeat it.

> btw, currently virtual EPC code (patch 5) handles in similar way: There's one
> EREMOVE error is expected and virtual EPC code can handle, but for other
> errors, it means kernel bug, and virtual EPC code gives a WARN(), and that EPC
> page is leaked too:
> 
> +		WARN_ONCE(ret != SGX_CHILD_PRESENT,
> +			  "EREMOVE (EPC page 0x%lx): unexpected error: %d\n",
> +			  sgx_get_epc_phys_addr(epc_page), ret);
> +		return ret;
> 
> So to me they are just WARN() to catch kernel bug.

You don't care about users, do you? Because when that error happens,
they won't come crying to you to fix it.

Lemme save you some trouble: we don't take half-baked code into the
kernel until stuff has been discussed and analyzed properly. So instead
of trying to downplay this, try answering my questions.

Here's another one: when does EREMOVE fail?

/me goes and looks it up

"The instruction fails if the operand is not properly aligned or does
not refer to an EPC page or the page is in use by another thread, or
other threads are running in the enclave to which the page belongs. In
addition the instruction fails if the operand refers to an SECS with
associations."

And I guess those conditions will become more in the future.

Now, let's play. I'm the cloud admin and you're cloud OS customer
support. I say:

"I got this scary error message while running enclaves on my server

"EREMOVE returned ... .  EPC page leaked.  Reboot required to retrieve leaked pages."

but I cannot reboot that machine because there are guests running on it
and I'm getting paid for those guests and I might get sued if I do?"

Your turn, go wild.
Huang, Kai March 22, 2021, 11:16 p.m. UTC | #12
On Mon, 22 Mar 2021 23:37:26 +0100 Borislav Petkov wrote:
> On Tue, Mar 23, 2021 at 11:06:43AM +1300, Kai Huang wrote:
> > This path is called by host SGX driver only, so yes this leaking is done by
> > host enclaves only.
> 
> Yes, so I was told.
> 
> > This patch is purpose is to break EREMOVE out of sgx_free_epc_page() so virtual
> > EPC code can use sgx_free_epc_page(), and handle EREMOVE logic differently.
> > This patch doesn't have intention to bring functional change.  I changed the
> > error msg because Dave said it worth to give a message saying EPC page is
> > leaked, and I thought this minor change won't break anything.
> 
> I read that already - you don't have to repeat it.
> 
> > btw, currently virtual EPC code (patch 5) handles in similar way: There's one
> > EREMOVE error is expected and virtual EPC code can handle, but for other
> > errors, it means kernel bug, and virtual EPC code gives a WARN(), and that EPC
> > page is leaked too:
> > 
> > +		WARN_ONCE(ret != SGX_CHILD_PRESENT,
> > +			  "EREMOVE (EPC page 0x%lx): unexpected error: %d\n",
> > +			  sgx_get_epc_phys_addr(epc_page), ret);
> > +		return ret;
> > 
> > So to me they are just WARN() to catch kernel bug.
> 
> You don't care about users, do you? Because when that error happens,
> they won't come crying to you to fix it.
> 
> Lemme save you some trouble: we don't take half-baked code into the
> kernel until stuff has been discussed and analyzed properly. So instead
> of trying to downplay this, try answering my questions.
> 
> Here's another one: when does EREMOVE fail?
> 
> /me goes and looks it up
> 
> "The instruction fails if the operand is not properly aligned or does
> not refer to an EPC page or the page is in use by another thread, or
> other threads are running in the enclave to which the page belongs. In
> addition the instruction fails if the operand refers to an SECS with
> associations."
> 
> And I guess those conditions will become more in the future.
> 
> Now, let's play. I'm the cloud admin and you're cloud OS customer
> support. I say:
> 
> "I got this scary error message while running enclaves on my server
> 
> "EREMOVE returned ... .  EPC page leaked.  Reboot required to retrieve leaked pages."
> 
> but I cannot reboot that machine because there are guests running on it
> and I'm getting paid for those guests and I might get sued if I do?"
> 
> Your turn, go wild.

I suppose admin can migrate those VMs, and then engineers can analyse the root
cause of such failure, and then fix it.
Sean Christopherson March 23, 2021, 3:45 p.m. UTC | #13
On Tue, Mar 23, 2021, Kai Huang wrote:
> On Mon, 22 Mar 2021 23:37:26 +0100 Borislav Petkov wrote:
> > "The instruction fails if the operand is not properly aligned or does
> > not refer to an EPC page or the page is in use by another thread, or
> > other threads are running in the enclave to which the page belongs. In
> > addition the instruction fails if the operand refers to an SECS with
> > associations."
> > 
> > And I guess those conditions will become more in the future.

Yep, IME these types of bugs rarely, if ever, lead to isolated failures.

> > Now, let's play. I'm the cloud admin and you're cloud OS customer
> > support. I say:
> > 
> > "I got this scary error message while running enclaves on my server
> > 
> > "EREMOVE returned ... .  EPC page leaked.  Reboot required to retrieve leaked pages."
> > 
> > but I cannot reboot that machine because there are guests running on it
> > and I'm getting paid for those guests and I might get sued if I do?"
> > 
> > Your turn, go wild.
> 
> I suppose admin can migrate those VMs, and then engineers can analyse the root
> cause of such failure, and then fix it.

That's more than likely what will happen, though there are a lot of "ifs" and
"buts" in any answer, e.g. things will go downhill fast if the majority of
systems in the fleet are running the buggy kernel and are triggering the error.

Practically speaking, "basic" deployments of SGX VMs will be insulated from
this bug.  KVM doesn't support EPC oversubscription, so even if all EPC is
exhausted, new VMs will fail to launch, but existing VMs will continue to chug
along with no ill effects.  There are again caveats, e.g. if EPC is being lazily
allocated for VMs, then running VMs will be affected if a VM starts using SGX
after the leak in the host occurs.  But, IMO doing lazy allocation _and_ running
enclaves in the host falls firmly into the "advanced" bucket; anyone going that
route had better do their homework to understand the various EPC interactions.
Borislav Petkov March 23, 2021, 4:06 p.m. UTC | #14
On Tue, Mar 23, 2021 at 03:45:14PM +0000, Sean Christopherson wrote:
> Practically speaking, "basic" deployments of SGX VMs will be insulated from
> this bug.  KVM doesn't support EPC oversubscription, so even if all EPC is
> exhausted, new VMs will fail to launch, but existing VMs will continue to chug
> along with no ill effects....

Ok, so it sounds to me like *at* *least* there should be some writeup in
Documentation/ explaining to the user what to do when she sees such an
EREMOVE failure, perhaps the gist of this thread and then possibly the
error message should point to that doc.

We will of course have to revisit when this hits the wild and people
start (or not) hitting this. But judging by past experience, if it is
there, we will hit it. Murphy says so.

Thx.
Sean Christopherson March 23, 2021, 4:21 p.m. UTC | #15
On Tue, Mar 23, 2021, Borislav Petkov wrote:
> On Tue, Mar 23, 2021 at 03:45:14PM +0000, Sean Christopherson wrote:
> > Practically speaking, "basic" deployments of SGX VMs will be insulated from
> > this bug.  KVM doesn't support EPC oversubscription, so even if all EPC is
> > exhausted, new VMs will fail to launch, but existing VMs will continue to chug
> > along with no ill effects....
> 
> Ok, so it sounds to me like *at* *least* there should be some writeup in
> Documentation/ explaining to the user what to do when she sees such an
> EREMOVE failure, perhaps the gist of this thread and then possibly the
> error message should point to that doc.
> 
> We will of course have to revisit when this hits the wild and people
> start (or not) hitting this. But judging by past experience, if it is
> there, we will hit it. Murphy says so.

I like the idea of pointing at the documentation.  The documentation should
probably emphasize that something is very, very wrong.  E.g. if a kernel bug
triggers EREMOVE failure and isn't detected until the kernel is widely deployed
in a fleet, then the folks deploying the kernel probably _should_ be in all out
panic.  For this variety of bug to escape that far, it means there are huge
holes in test coverage, in both the kernel itself and in the infrasturcture of
whoever is rolling out their new kernel.
Borislav Petkov March 23, 2021, 4:32 p.m. UTC | #16
On Tue, Mar 23, 2021 at 04:21:47PM +0000, Sean Christopherson wrote:
> I like the idea of pointing at the documentation.  The documentation should
> probably emphasize that something is very, very wrong.

Yap, because no matter how we formulate the error message, it still ain't enough
and needs a longer explanation.

> E.g. if a kernel bug triggers EREMOVE failure and isn't detected until
> the kernel is widely deployed in a fleet, then the folks deploying the
> kernel probably _should_ be in all out panic. For this variety of bug
> to escape that far, it means there are huge holes in test coverage, in
> both the kernel itself and in the infrasturcture of whoever is rolling
> out their new kernel.

You sound just like someone who works at a company with a big fleet, oh
wait...

:-)

And yap, you big fleeted guys will more likely catch it but we do have
all these other customers who have a handful of servers only so they
probably won't be able to do such a wide coverage.

So I hope they'll appreciate this longer explanation about what to do
when they hit it. And normally I wouldn't even care but we almost never
tell people to reboot their boxes to fix sh*t - that's the other OS.

Thx.
Paolo Bonzini March 23, 2021, 4:38 p.m. UTC | #17
On 23/03/21 17:06, Borislav Petkov wrote:
>> Practically speaking, "basic" deployments of SGX VMs will be insulated from
>> this bug.  KVM doesn't support EPC oversubscription, so even if all EPC is
>> exhausted, new VMs will fail to launch, but existing VMs will continue to chug
>> along with no ill effects....
>
> Ok, so it sounds to me like*at*  *least*  there should be some writeup in
> Documentation/ explaining to the user what to do when she sees such an
> EREMOVE failure, perhaps the gist of this thread and then possibly the
> error message should point to that doc.

That's important, but it's even more important *to developers* that the 
commit message spells out why this would be a kernel bug more often than 
not.  I for one do not understand it, and I suspect I'm not alone.

Maybe (optimistically) once we see that explanation we decide that the 
documentation is not important.  Sean, Kai, can you explain it?

Paolo
Paolo Bonzini March 23, 2021, 4:40 p.m. UTC | #18
On 22/03/21 21:43, Kai Huang wrote:
>> That was my recollection as well from previous threads but, to be fair
>> to Boris, the commit message is a lot more scary (and, which is what
>> triggers me, puts the blame on KVM).  It just says "KVM does not track
>> how guest pages are used, which means that SGX virtualization use of
>> EREMOVE might fail".
>
> I don't see the commit msg being scary.  EREMOVE might fail but virtual EPC code
> can handle that.  This is the reason to break out EREMOVE from original
> sgx_free_epc_page(), so virtual EPC code can have its own logic of handling
> EREMOVE failure.

I should explain what I mean by scary.

What you wrote above, "EREMOVE might fail but virtual EPC code can 
handle that" sounds fine.  But it doesn't say the failure mode, so it's 
hiding information.

What I would like to have, "EREMOVE might fail and will be leaked, but 
virtual EPC code will not crash and in any case there are much worse 
problems waiting to happen" is fine.  (It's even better with an 
explanation of the problems).

Your message however was in the middle: "EREMOVE might fail, virtual EPC 
code will not crash but the page will be leaked".  It gives the failure 
mode but not how the problem arises, and it is this combination that 
results in something scary-sounding.

Paolo
Sean Christopherson March 23, 2021, 4:51 p.m. UTC | #19
On Tue, Mar 23, 2021, Borislav Petkov wrote:
> On Tue, Mar 23, 2021 at 04:21:47PM +0000, Sean Christopherson wrote:
> > I like the idea of pointing at the documentation.  The documentation should
> > probably emphasize that something is very, very wrong.
> 
> Yap, because no matter how we formulate the error message, it still ain't enough
> and needs a longer explanation.
> 
> > E.g. if a kernel bug triggers EREMOVE failure and isn't detected until
> > the kernel is widely deployed in a fleet, then the folks deploying the
> > kernel probably _should_ be in all out panic. For this variety of bug
> > to escape that far, it means there are huge holes in test coverage, in
> > both the kernel itself and in the infrasturcture of whoever is rolling
> > out their new kernel.
> 
> You sound just like someone who works at a company with a big fleet, oh
> wait...
> 
> :-)
> 
> And yap, you big fleeted guys will more likely catch it but we do have
> all these other customers who have a handful of servers only so they
> probably won't be able to do such a wide coverage.

The size of the fleet shouldn't matter for this specific case.  This bug
requires the _host_ to be running enclaves, and obviously it also requires the
system to be running SGX-enabled guests as well.  In such a setup, the SGX
workload running in the host should be very well defined and understood, i.e.
testing should be a well-bounded problem to solve.

Running enclaves in both the host and guest should be uncommon in and of itself,
and for such setups, running _any_ SGX workloads in the host, let alone more
than 1 or 2 unique workloads, without ensuring guests are fully isolated is,
IMO, insane.

But yeah, what can happen, will happen.
 
> So I hope they'll appreciate this longer explanation about what to do
> when they hit it. And normally I wouldn't even care but we almost never
> tell people to reboot their boxes to fix sh*t - that's the other OS.
> 
> Thx.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
Sean Christopherson March 23, 2021, 5:02 p.m. UTC | #20
On Tue, Mar 23, 2021, Paolo Bonzini wrote:
> On 23/03/21 17:06, Borislav Petkov wrote:
> > > Practically speaking, "basic" deployments of SGX VMs will be insulated from
> > > this bug.  KVM doesn't support EPC oversubscription, so even if all EPC is
> > > exhausted, new VMs will fail to launch, but existing VMs will continue to chug
> > > along with no ill effects....
> > 
> > Ok, so it sounds to me like*at*  *least*  there should be some writeup in
> > Documentation/ explaining to the user what to do when she sees such an
> > EREMOVE failure, perhaps the gist of this thread and then possibly the
> > error message should point to that doc.
> 
> That's important, but it's even more important *to developers* that the
> commit message spells out why this would be a kernel bug more often than
> not.  I for one do not understand it, and I suspect I'm not alone.
> 
> Maybe (optimistically) once we see that explanation we decide that the
> documentation is not important.  Sean, Kai, can you explain it?

Thought of a good analogy that can be used for the changelog and/or docs:

This is effectively a kernel use-after-free of EPC, and due to the way SGX works,
the bug is detected at freeing.  Rather than add the page back to the pool of
available EPC, the kernel intentionally leaks the page to avoid additional
errors in the future.

Does that help?
Paolo Bonzini March 23, 2021, 5:06 p.m. UTC | #21
On 23/03/21 18:02, Sean Christopherson wrote:
>> That's important, but it's even more important *to developers* that the
>> commit message spells out why this would be a kernel bug more often than
>> not.  I for one do not understand it, and I suspect I'm not alone.
>> 
>> Maybe (optimistically) once we see that explanation we decide that the
>> documentation is not important.  Sean, Kai, can you explain it?
>
> Thought of a good analogy that can be used for the changelog and/or docs:
> 
> This is effectively a kernel use-after-free of EPC, and due to the way SGX works,
> the bug is detected at freeing.  Rather than add the page back to the pool of
> available EPC, the kernel intentionally leaks the page to avoid additional
> errors in the future.
> 
> Does that help?

Very much, and for me this also settles the question of documentation. 
Borislav or Kai, can you add it to the commit message?

Paolo
Sean Christopherson March 23, 2021, 5:16 p.m. UTC | #22
On Tue, Mar 23, 2021, Paolo Bonzini wrote:
> On 23/03/21 18:02, Sean Christopherson wrote:
> > > That's important, but it's even more important *to developers* that the
> > > commit message spells out why this would be a kernel bug more often than
> > > not.  I for one do not understand it, and I suspect I'm not alone.
> > > 
> > > Maybe (optimistically) once we see that explanation we decide that the
> > > documentation is not important.  Sean, Kai, can you explain it?
> > 
> > Thought of a good analogy that can be used for the changelog and/or docs:
> > 
> > This is effectively a kernel use-after-free of EPC, and due to the way SGX works,
> > the bug is detected at freeing.  Rather than add the page back to the pool of
> > available EPC, the kernel intentionally leaks the page to avoid additional
> > errors in the future.
> > 
> > Does that help?
> 
> Very much, and for me this also settles the question of documentation.
> Borislav or Kai, can you add it to the commit message?

One last thought.  This error/WARN doesn't guarantee that a conflict hasn't
already occurred, e.g. the EPC page was prematurely put back on the list and
already handed out to a second enclave.  In that case there will undoubtedly be
a slew of WARNs/errors leading up to this one, I just wanted to clarify that
intentionally leaking the page doesn't magically cure _all_ use-after-free or
double-use bugs.
Borislav Petkov March 23, 2021, 6:16 p.m. UTC | #23
On Tue, Mar 23, 2021 at 06:06:19PM +0100, Paolo Bonzini wrote:
> Very much, and for me this also settles the question of documentation.
> Borislav or Kai, can you add it to the commit message?

Not only the commit message - that will become hard to find over time. I
believe Documentation/x86/sgx.rst is a good place to put the gist of it
in and refer to it in the warning message.

Thx.
Jarkko Sakkinen March 24, 2021, 9:26 a.m. UTC | #24
On Tue, Mar 23, 2021 at 05:06:04PM +0100, Borislav Petkov wrote:
> On Tue, Mar 23, 2021 at 03:45:14PM +0000, Sean Christopherson wrote:
> > Practically speaking, "basic" deployments of SGX VMs will be insulated from
> > this bug.  KVM doesn't support EPC oversubscription, so even if all EPC is
> > exhausted, new VMs will fail to launch, but existing VMs will continue to chug
> > along with no ill effects....
> 
> Ok, so it sounds to me like *at* *least* there should be some writeup in
> Documentation/ explaining to the user what to do when she sees such an
> EREMOVE failure, perhaps the gist of this thread and then possibly the
> error message should point to that doc.
> 
> We will of course have to revisit when this hits the wild and people
> start (or not) hitting this. But judging by past experience, if it is
> there, we will hit it. Murphy says so.
> 
> Thx.

We had recently a steady flush of bug reports about a WARN() in tpm_tis
driver, from all levels of involvement with the kernel. Even people who
don't know what kernel documentation is, got their message through.

When a WARN() triggers anywhere in the kernel, what people tend to do is
that they go to the distro bugzilla, and the issue is quickly escalated
to the corresponding maintainer.

So, what is the part missing from the equation that should be documented
to the kernel documentation. This not a counter argument per se, I just
don't fully understand what is the missing piece that should be put there.

> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
> 

/Jarkko
Jarkko Sakkinen March 24, 2021, 9:28 a.m. UTC | #25
On Tue, Mar 23, 2021 at 04:21:47PM +0000, Sean Christopherson wrote:
> On Tue, Mar 23, 2021, Borislav Petkov wrote:
> > On Tue, Mar 23, 2021 at 03:45:14PM +0000, Sean Christopherson wrote:
> > > Practically speaking, "basic" deployments of SGX VMs will be insulated from
> > > this bug.  KVM doesn't support EPC oversubscription, so even if all EPC is
> > > exhausted, new VMs will fail to launch, but existing VMs will continue to chug
> > > along with no ill effects....
> > 
> > Ok, so it sounds to me like *at* *least* there should be some writeup in
> > Documentation/ explaining to the user what to do when she sees such an
> > EREMOVE failure, perhaps the gist of this thread and then possibly the
> > error message should point to that doc.
> > 
> > We will of course have to revisit when this hits the wild and people
> > start (or not) hitting this. But judging by past experience, if it is
> > there, we will hit it. Murphy says so.
> 
> I like the idea of pointing at the documentation.  The documentation should
> probably emphasize that something is very, very wrong.  E.g. if a kernel bug
> triggers EREMOVE failure and isn't detected until the kernel is widely deployed
> in a fleet, then the folks deploying the kernel probably _should_ be in all out
> panic.  For this variety of bug to escape that far, it means there are huge
> holes in test coverage, in both the kernel itself and in the infrasturcture of
> whoever is rolling out their new kernel.

My own experience with WARN()'s has been so far, that the stack trace does
the job fairly well. It's commonly misintepreted same as oops.

/Jarkko
Huang, Kai March 24, 2021, 9:38 a.m. UTC | #26
On Tue, 2021-03-23 at 17:32 +0100, Borislav Petkov wrote:
> On Tue, Mar 23, 2021 at 04:21:47PM +0000, Sean Christopherson wrote:
> > I like the idea of pointing at the documentation.  The documentation should
> > probably emphasize that something is very, very wrong.
> 
> Yap, because no matter how we formulate the error message, it still ain't enough
> and needs a longer explanation.
> 
> > E.g. if a kernel bug triggers EREMOVE failure and isn't detected until
> > the kernel is widely deployed in a fleet, then the folks deploying the
> > kernel probably _should_ be in all out panic. For this variety of bug
> > to escape that far, it means there are huge holes in test coverage, in
> > both the kernel itself and in the infrasturcture of whoever is rolling
> > out their new kernel.
> 
> You sound just like someone who works at a company with a big fleet, oh
> wait...
> 
> :-)
> 
> And yap, you big fleeted guys will more likely catch it but we do have
> all these other customers who have a handful of servers only so they
> probably won't be able to do such a wide coverage.
> 
> So I hope they'll appreciate this longer explanation about what to do
> when they hit it. And normally I wouldn't even care but we almost never
> tell people to reboot their boxes to fix sh*t - that's the other OS.
> 
> Thx.
> 

Hi Sean, Boris, Paolo,

Thanks for the discussion. I tried to digest all your conversations and
hopefully I have understood you correctly. I pasted the new patch here
(not full patch, but relevant part only). I modified the error msg, added
some writeup to Documentation/x86/sgx.rst, and put Sean's explanation of this
bug to the commit msg (per Paolo). I am terrible Documentation writer, so
please help to check and give comments. Thanks!

---
commit 1e297a535bcb4f51a08343c40207520017d85efe (HEAD)
Author: Kai Huang <kai.huang@intel.com>
Date:   Wed Jan 20 03:40:53 2021 +0200

    x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()
    
    EREMOVE takes a page and removes any association between that page and
    an enclave.  It must be run on a page before it can be added into
    another enclave.  Currently, EREMOVE is run as part of pages being freed
    into the SGX page allocator.  It is not expected to fail.
    
    KVM does not track how guest pages are used, which means that SGX
    virtualization use of EREMOVE might fail.  Specifically, it is
    legitimate that EREMOVE returns SGX_CHILD_PRESENT for EPC assigned to
    KVM guest, because KVM/kernel doesn't track SECS pages.
    
    Break out the EREMOVE call from the SGX page allocator.  This will allow
    the SGX virtualization code to use the allocator directly.  (SGX/KVM
    will also introduce a more permissive EREMOVE helper).
    
    Implement original sgx_free_epc_page() as sgx_encl_free_epc_page() to be
    more specific that it is used to free EPC page assigned host enclave.
    Replace sgx_free_epc_page() with sgx_encl_free_epc_page() in all call
    sites so there's no functional change.
    
    Improve error message when EREMOVE fails, and kernel is about to leak
    EPC page, which is likely a kernel bug.  This is effectively a kernel
    use-after-free of EPC, and due to the way SGX works, the bug is detected
    at freeing.  Rather than add the page back to the pool of available EPC,
    the kernel intentionally leaks the page to avoid additional errors in
    the future.
    
    Also add documentation to explain to user what is the bug and suggest
    user what to do when this bug happens, although extremely unlikely.
    
    Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
    Signed-off-by: Kai Huang <kai.huang@intel.com>

diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
index eaee1368b4fd..cb0428a8f4dd 100644
--- a/Documentation/x86/sgx.rst
+++ b/Documentation/x86/sgx.rst
@@ -209,3 +209,29 @@ An application may be loaded into a container enclave which is
specially
 configured with a library OS and run-time which permits the application to run.
 The enclave run-time and library OS work together to execute the application
 when a thread enters the enclave.
+
+Impact of Potential Kernel SGX Bugs
+===================================
+
+EPC leaks
+---------
+
+Although extremely unlikely, EPC leaks can happen if kernel SGX bug happens,
+when a WARNING with below message is shown in dmesg:
+
+"...EREMOVE returned ..., kernel bug likely.  EPC page leaked, SGX may become
+unusuable.  Please refer to Documentation/x86/sgx.rst for more information."
+
+This is effectively a kernel use-after-free of EPC, and due to the way SGX
+works, the bug is detected at freeing. Rather than add the page back to the pool
+of available EPC, the kernel intentionally leaks the page to avoid additional
+errors in the future.
+
+When this happens, kernel will likely soon leak majority of EPC pages, and SGX
+will likely become unusable. However while this may be fatal to SGX, other
+kernel functionalities are unlikely to be impacted, and should continue to work.
+
+As a result, when this happpens, user should stop running any new SGX workloads,
+(or just any new workloads), and migrate all valuable workloads, for instance,
+virtual machines, to other places. Although a machine reboot can recover all
+EPC, debugging and fixing this bug is appreciated.
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 7449ef33f081..26c0987153de 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -78,7 +78,7 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page
*encl_page,
 
        ret = __sgx_encl_eldu(encl_page, epc_page, secs_page);
        if (ret) {
-               sgx_free_epc_page(epc_page);
+               sgx_encl_free_epc_page(epc_page);
                return ERR_PTR(ret);
        }
 
@@ -404,7 +404,7 @@ void sgx_encl_release(struct kref *ref)
                        if (sgx_unmark_page_reclaimable(entry->epc_page))
                                continue;
 
-                       sgx_free_epc_page(entry->epc_page);
+                       sgx_encl_free_epc_page(entry->epc_page);
                        encl->secs_child_cnt--;
                        entry->epc_page = NULL;
                }
@@ -415,7 +415,7 @@ void sgx_encl_release(struct kref *ref)
        xa_destroy(&encl->page_array);
 
        if (!encl->secs_child_cnt && encl->secs.epc_page) {
-               sgx_free_epc_page(encl->secs.epc_page);
+               sgx_encl_free_epc_page(encl->secs.epc_page);
                encl->secs.epc_page = NULL;
        }
 
@@ -423,7 +423,7 @@ void sgx_encl_release(struct kref *ref)
                va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
                                           list);
                list_del(&va_page->list);
-               sgx_free_epc_page(va_page->epc_page);
+               sgx_encl_free_epc_page(va_page->epc_page);
                kfree(va_page);
        }
 
@@ -686,7 +686,7 @@ struct sgx_epc_page *sgx_alloc_va_page(void)
        ret = __epa(sgx_get_epc_virt_addr(epc_page));
        if (ret) {
                WARN_ONCE(1, "EPA returned %d (0x%x)", ret, ret);
-               sgx_free_epc_page(epc_page);
+               sgx_encl_free_epc_page(epc_page);
                return ERR_PTR(-EFAULT);
        }
 
@@ -735,3 +735,25 @@ bool sgx_va_page_full(struct sgx_va_page *va_page)
 
        return slot == SGX_VA_SLOT_COUNT;
 }
+
+/**
+ * sgx_encl_free_epc_page - free EPC page assigned to an enclave
+ * @page:      EPC page to be freed
+ *
+ * Free EPC page assigned to an enclave.  It does EREMOVE for the page, and
+ * only upon success, it puts the page back to free page list.  Otherwise, it
+ * gives a WARNING to indicate page is leaked, and require reboot to retrieve
+ * leaked pages.
+ */
+void sgx_encl_free_epc_page(struct sgx_epc_page *page)
+{
+       int ret;
+
+       WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+
+       ret = __eremove(sgx_get_epc_virt_addr(page));
+       if (WARN_ONCE(ret, EREMOVE_ERROR_MESSAGE, ret, ret))
+               return;
+
+       sgx_free_epc_page(page);
+}
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index d8d30ccbef4c..6e74f85b6264 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -115,5 +115,6 @@ struct sgx_epc_page *sgx_alloc_va_page(void);
 unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
 void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
 bool sgx_va_page_full(struct sgx_va_page *va_page);
+void sgx_encl_free_epc_page(struct sgx_epc_page *page);
 
 #endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 90a5caf76939..772b9c648cf1 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -47,7 +47,7 @@ static void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page
*va_page)
        encl->page_cnt--;
 
        if (va_page) {
-               sgx_free_epc_page(va_page->epc_page);
+               sgx_encl_free_epc_page(va_page->epc_page);
                list_del(&va_page->list);
                kfree(va_page);
        }
@@ -117,7 +117,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs
*secs)
        return 0;
 
 err_out:
-       sgx_free_epc_page(encl->secs.epc_page);
+       sgx_encl_free_epc_page(encl->secs.epc_page);
        encl->secs.epc_page = NULL;
 
 err_out_backing:
@@ -365,7 +365,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
        mmap_read_unlock(current->mm);
 
 err_out_free:
-       sgx_free_epc_page(epc_page);
+       sgx_encl_free_epc_page(epc_page);
        kfree(encl_page);
 
        return ret;
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 5c9c5e5fb1fb..6a734f484aa7 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -294,7 +294,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 
                sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
 
-               sgx_free_epc_page(encl->secs.epc_page);
+               sgx_encl_free_epc_page(encl->secs.epc_page);
                encl->secs.epc_page = NULL;
 
                sgx_encl_put_backing(&secs_backing, true);
@@ -609,19 +609,15 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
  * sgx_free_epc_page() - Free an EPC page
  * @page:      an EPC page
  *
- * Call EREMOVE for an EPC page and insert it back to the list of free pages.
+ * Put the EPC page back to the list of free pages. It's the caller's
+ * responsibility to make sure that the page is in uninitialized state. In other
+ * words, do EREMOVE, EWB or whatever operation is necessary before calling
+ * this function.
  */
 void sgx_free_epc_page(struct sgx_epc_page *page)
 {
        struct sgx_epc_section *section = &sgx_epc_sections[page->section];
        struct sgx_numa_node *node = section->node;
-       int ret;
-
-       WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
-
-       ret = __eremove(sgx_get_epc_virt_addr(page));
-       if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
-               return;
 
        spin_lock(&node->lock);
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 653af8ca1a25..a66614f94538 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -13,6 +13,10 @@
 #undef pr_fmt
 #define pr_fmt(fmt) "sgx: " fmt
 
+/* Error message for EREMOVE failure, when kernel is about to leak EPC page */
+#define EREMOVE_ERROR_MESSAGE \
+       "EREMOVE returned %d (0x%x), kernel bug likely.  EPC page leaked, SGX may become
unusuable.  Please refer to Documentation/x86/sgx.rst for more information."
+
 #define SGX_MAX_EPC_SECTIONS           8
 #define SGX_EEXTEND_BLOCK_SIZE         256
 #define SGX_NR_TO_SCAN                 16
Paolo Bonzini March 24, 2021, 10:09 a.m. UTC | #27
On 24/03/21 10:38, Kai Huang wrote:
> Hi Sean, Boris, Paolo,
> 
> Thanks for the discussion. I tried to digest all your conversations and
> hopefully I have understood you correctly. I pasted the new patch here
> (not full patch, but relevant part only). I modified the error msg, added
> some writeup to Documentation/x86/sgx.rst, and put Sean's explanation of this
> bug to the commit msg (per Paolo). I am terrible Documentation writer, so
> please help to check and give comments. Thanks!

I have some phrasing suggestions below but that was actually pretty good.

> ---
> commit 1e297a535bcb4f51a08343c40207520017d85efe (HEAD)
> Author: Kai Huang <kai.huang@intel.com>
> Date:   Wed Jan 20 03:40:53 2021 +0200
> 
>      x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()
>      
>      EREMOVE takes a page and removes any association between that page and
>      an enclave.  It must be run on a page before it can be added into
>      another enclave.  Currently, EREMOVE is run as part of pages being freed
>      into the SGX page allocator.  It is not expected to fail.
>      
>      KVM does not track how guest pages are used, which means that SGX
>      virtualization use of EREMOVE might fail.  Specifically, it is
>      legitimate that EREMOVE returns SGX_CHILD_PRESENT for EPC assigned to
>      KVM guest, because KVM/kernel doesn't track SECS pages.
>
>      Break out the EREMOVE call from the SGX page allocator.  This will allow
>      the SGX virtualization code to use the allocator directly.  (SGX/KVM
>      will also introduce a more permissive EREMOVE helper).

Ok, I think I got the source of my confusion.  The part in parentheses
is the key.  It was not clear that KVM can deal with EREMOVE failures
*without printing the error*.  Good!

>      Implement original sgx_free_epc_page() as sgx_encl_free_epc_page() to be
>      more specific that it is used to free EPC page assigned host enclave.
>      Replace sgx_free_epc_page() with sgx_encl_free_epc_page() in all call
>      sites so there's no functional change.
>      
>      Improve error message when EREMOVE fails, and kernel is about to leak
>      EPC page, which is likely a kernel bug.  This is effectively a kernel
>      use-after-free of EPC, and due to the way SGX works, the bug is detected
>      at freeing.  Rather than add the page back to the pool of available EPC,
>      the kernel intentionally leaks the page to avoid additional errors in
>      the future.
>      
>      Also add documentation to explain to user what is the bug and suggest
>      user what to do when this bug happens, although extremely unlikely.

Rewritten:

EREMOVE takes a page and removes any association between that page and
an enclave.  It must be run on a page before it can be added into
another enclave.  Currently, EREMOVE is run as part of pages being freed
into the SGX page allocator.  It is not expected to fail, as it would
indicate a use-after-free of EPC.  Rather than add the page back to the
pool of available EPC, the kernel intentionally leaks the page to avoid
additional errors in the future.

However, KVM does not track how guest pages are used, which means that SGX
virtualization use of EREMOVE might fail.  Specifically, it is
legitimate that EREMOVE returns SGX_CHILD_PRESENT for EPC assigned to
KVM guest, because KVM/kernel doesn't track SECS pages.

To allow SGX/KVM to introduce a more permissive EREMOVE helper and to
let the SGX virtualization code use the allocator directly,
break out the EREMOVE call from the SGX page allocator.  Rename the
original sgx_free_epc_page() to sgx_encl_free_epc_page(),
indicating that it is used to free EPC page assigned host enclave.
Replace sgx_free_epc_page() with sgx_encl_free_epc_page() in all call
sites so there's no functional change.

At the same time improve error message when EREMOVE fails, and add
documentation to explain to user what is the bug and suggest user what
to do when this bug happens, although extremely unlikely.

> +Although extremely unlikely, EPC leaks can happen if kernel SGX bug happens,
> +when a WARNING with below message is shown in dmesg:

Remove "Although extremely unlikely".

> +"...EREMOVE returned ..., kernel bug likely.  EPC page leaked, SGX may become
> +unusuable.  Please refer to Documentation/x86/sgx.rst for more information."
> +
> +This is effectively a kernel use-after-free of EPC, and due to the way SGX
> +works, the bug is detected at freeing. Rather than add the page back to the pool
> +of available EPC, the kernel intentionally leaks the page to avoid additional
> +errors in the future.
> +
> +When this happens, kernel will likely soon leak majority of EPC pages, and SGX
> +will likely become unusable. However while this may be fatal to SGX, other
> +kernel functionalities are unlikely to be impacted, and should continue to work.
> +
> +As a result, when this happpens, user should stop running any new SGX workloads,
> +(or just any new workloads), and migrate all valuable workloads, for instance,
> +virtual machines, to other places.

Remove everything starting with "for instance".

  Although a machine reboot can recover all
> +EPC, debugging and fixing this bug is appreciated.

Replace the second part with "the bug should be reported to the Linux developers".
The poor user is not expected to debug SGX. ;)

> +/* Error message for EREMOVE failure, when kernel is about to leak EPC page */
> +#define EREMOVE_ERROR_MESSAGE \
> +       "EREMOVE returned %d (0x%x), kernel bug likely.  EPC page leaked, SGX may become
> unusuable.  Please refer to Documentation/x86/sgx.rst for more information."

Rewritten:

EREMOVE returned %d and an EPC page was leaked; SGX may become unusable.
This is a kernel bug, refer to Documentation/x86/sgx.rst for more information.

Also please split it across multiple lines.

Paolo
Huang, Kai March 24, 2021, 10:48 a.m. UTC | #28
On Wed, 24 Mar 2021 11:09:20 +0100 Paolo Bonzini wrote:
> On 24/03/21 10:38, Kai Huang wrote:
> > Hi Sean, Boris, Paolo,
> > 
> > Thanks for the discussion. I tried to digest all your conversations and
> > hopefully I have understood you correctly. I pasted the new patch here
> > (not full patch, but relevant part only). I modified the error msg, added
> > some writeup to Documentation/x86/sgx.rst, and put Sean's explanation of this
> > bug to the commit msg (per Paolo). I am terrible Documentation writer, so
> > please help to check and give comments. Thanks!
> 
> I have some phrasing suggestions below but that was actually pretty good.
> 
> > ---
> > commit 1e297a535bcb4f51a08343c40207520017d85efe (HEAD)
> > Author: Kai Huang <kai.huang@intel.com>
> > Date:   Wed Jan 20 03:40:53 2021 +0200
> > 
> >      x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()
> >      
> >      EREMOVE takes a page and removes any association between that page and
> >      an enclave.  It must be run on a page before it can be added into
> >      another enclave.  Currently, EREMOVE is run as part of pages being freed
> >      into the SGX page allocator.  It is not expected to fail.
> >      
> >      KVM does not track how guest pages are used, which means that SGX
> >      virtualization use of EREMOVE might fail.  Specifically, it is
> >      legitimate that EREMOVE returns SGX_CHILD_PRESENT for EPC assigned to
> >      KVM guest, because KVM/kernel doesn't track SECS pages.
> >
> >      Break out the EREMOVE call from the SGX page allocator.  This will allow
> >      the SGX virtualization code to use the allocator directly.  (SGX/KVM
> >      will also introduce a more permissive EREMOVE helper).
> 
> Ok, I think I got the source of my confusion.  The part in parentheses
> is the key.  It was not clear that KVM can deal with EREMOVE failures
> *without printing the error*.  Good!

Yes the "will also introduce a more premissive EREMOVE helper" is done in patch
5 (x86/sgx: Introduce virtual EPC for use by KVM guests).

> 
> >      Implement original sgx_free_epc_page() as sgx_encl_free_epc_page() to be
> >      more specific that it is used to free EPC page assigned host enclave.
> >      Replace sgx_free_epc_page() with sgx_encl_free_epc_page() in all call
> >      sites so there's no functional change.
> >      
> >      Improve error message when EREMOVE fails, and kernel is about to leak
> >      EPC page, which is likely a kernel bug.  This is effectively a kernel
> >      use-after-free of EPC, and due to the way SGX works, the bug is detected
> >      at freeing.  Rather than add the page back to the pool of available EPC,
> >      the kernel intentionally leaks the page to avoid additional errors in
> >      the future.
> >      
> >      Also add documentation to explain to user what is the bug and suggest
> >      user what to do when this bug happens, although extremely unlikely.
> 
> Rewritten:
> 
> EREMOVE takes a page and removes any association between that page and
> an enclave.  It must be run on a page before it can be added into
> another enclave.  Currently, EREMOVE is run as part of pages being freed
> into the SGX page allocator.  It is not expected to fail, as it would
> indicate a use-after-free of EPC.  Rather than add the page back to the
> pool of available EPC, the kernel intentionally leaks the page to avoid
> additional errors in the future.
> 
> However, KVM does not track how guest pages are used, which means that SGX
> virtualization use of EREMOVE might fail.  Specifically, it is
> legitimate that EREMOVE returns SGX_CHILD_PRESENT for EPC assigned to
> KVM guest, because KVM/kernel doesn't track SECS pages.
> 
> To allow SGX/KVM to introduce a more permissive EREMOVE helper and to
> let the SGX virtualization code use the allocator directly,
> break out the EREMOVE call from the SGX page allocator.  Rename the
> original sgx_free_epc_page() to sgx_encl_free_epc_page(),
> indicating that it is used to free EPC page assigned host enclave.
> Replace sgx_free_epc_page() with sgx_encl_free_epc_page() in all call
> sites so there's no functional change.
> 
> At the same time improve error message when EREMOVE fails, and add
> documentation to explain to user what is the bug and suggest user what
> to do when this bug happens, although extremely unlikely.

Thanks :)

> 
> > +Although extremely unlikely, EPC leaks can happen if kernel SGX bug happens,
> > +when a WARNING with below message is shown in dmesg:
> 
> Remove "Although extremely unlikely".

Will do.

> 
> > +"...EREMOVE returned ..., kernel bug likely.  EPC page leaked, SGX may become
> > +unusuable.  Please refer to Documentation/x86/sgx.rst for more information."
> > +
> > +This is effectively a kernel use-after-free of EPC, and due to the way SGX
> > +works, the bug is detected at freeing. Rather than add the page back to the pool
> > +of available EPC, the kernel intentionally leaks the page to avoid additional
> > +errors in the future.
> > +
> > +When this happens, kernel will likely soon leak majority of EPC pages, and SGX
> > +will likely become unusable. However while this may be fatal to SGX, other
> > +kernel functionalities are unlikely to be impacted, and should continue to work.
> > +
> > +As a result, when this happpens, user should stop running any new SGX workloads,
> > +(or just any new workloads), and migrate all valuable workloads, for instance,
> > +virtual machines, to other places.
> 
> Remove everything starting with "for instance".

Will do.

> 
>   Although a machine reboot can recover all
> > +EPC, debugging and fixing this bug is appreciated.
> 
> Replace the second part with "the bug should be reported to the Linux developers".
> The poor user is not expected to debug SGX. ;)

Hmm.. Makes sense :)

> 
> > +/* Error message for EREMOVE failure, when kernel is about to leak EPC page */
> > +#define EREMOVE_ERROR_MESSAGE \
> > +       "EREMOVE returned %d (0x%x), kernel bug likely.  EPC page leaked, SGX may become
> > unusuable.  Please refer to Documentation/x86/sgx.rst for more information."
> 
> Rewritten:
> 
> EREMOVE returned %d and an EPC page was leaked; SGX may become unusable.
> This is a kernel bug, refer to Documentation/x86/sgx.rst for more information.

Fine to me, although this would have %d (0x%x) -> %d change in the code.

> 
> Also please split it across multiple lines.
> 
> Paolo
>
Paolo Bonzini March 24, 2021, 11:24 a.m. UTC | #29
On 24/03/21 11:48, Kai Huang wrote:
>>> +/* Error message for EREMOVE failure, when kernel is about to leak EPC page */
>>> +#define EREMOVE_ERROR_MESSAGE \
>>> +       "EREMOVE returned %d (0x%x), kernel bug likely.  EPC page leaked, SGX may become
>>> unusuable.  Please refer to Documentation/x86/sgx.rst for more information."
>> Rewritten:
>>
>> EREMOVE returned %d and an EPC page was leaked; SGX may become unusable.
>> This is a kernel bug, refer to Documentation/x86/sgx.rst for more information.
> Fine to me, although this would have %d (0x%x) -> %d change in the code.
> 

Yeah you can of course keep the 0x%x part.

Paolo
Huang, Kai March 24, 2021, 11:23 p.m. UTC | #30
> 
> > +/* Error message for EREMOVE failure, when kernel is about to leak EPC page */
> > +#define EREMOVE_ERROR_MESSAGE \
> > +       "EREMOVE returned %d (0x%x), kernel bug likely.  EPC page leaked, SGX may become
> > unusuable.  Please refer to Documentation/x86/sgx.rst for more information."
> 
> Rewritten:
> 
> EREMOVE returned %d and an EPC page was leaked; SGX may become unusable.
> This is a kernel bug, refer to Documentation/x86/sgx.rst for more information.
> 
> Also please split it across multiple lines.
> 
> Paolo
> 

Hi Boris/Paolo,

I changed to below (with slight modification on Paolo's):

/* Error message for EREMOVE failure, when kernel is about to leak EPC page */
#define EREMOVE_ERROR_MESSAGE \ 
        "EREMOVE returned %d (0x%x) and an EPC page was leaked.  SGX may become unusuable.  " \
        "This is likely a kernel bug.  Refer to Documentation/x86/sgx.rst for more information."

I got a checkpatch warning however:

WARNING: It's generally not useful to have the filename in the file
#60: FILE: Documentation/x86/sgx.rst:223:
+This is likely a kernel bug.  Refer to Documentation/x86/sgx.rst for more

I suppose it is OK? Since the error msg is actually hard-coded in the code,
and in this document, IMHO we should explicitly call out what error message user
is supposed to see, when this bug happens, so that user can absolutely know
he/she is dealing with this particular issue.
Paolo Bonzini March 24, 2021, 11:39 p.m. UTC | #31
On 25/03/21 00:23, Kai Huang wrote:
> I changed to below (with slight modification on Paolo's):
> 
> /* Error message for EREMOVE failure, when kernel is about to leak EPC page */
> #define EREMOVE_ERROR_MESSAGE \
>          "EREMOVE returned %d (0x%x) and an EPC page was leaked.  SGX may become unusuable.  " \
>          "This is likely a kernel bug.  Refer to Documentation/x86/sgx.rst for more information."
> 
> I got a checkpatch warning however:
> 
> WARNING: It's generally not useful to have the filename in the file
> #60: FILE: Documentation/x86/sgx.rst:223:
> +This is likely a kernel bug.  Refer to Documentation/x86/sgx.rst for more

Yeah, this is more or less a false positive.

Paolo
Huang, Kai March 24, 2021, 11:46 p.m. UTC | #32
On Thu, 25 Mar 2021 00:39:01 +0100 Paolo Bonzini wrote:
> On 25/03/21 00:23, Kai Huang wrote:
> > I changed to below (with slight modification on Paolo's):
> > 
> > /* Error message for EREMOVE failure, when kernel is about to leak EPC page */
> > #define EREMOVE_ERROR_MESSAGE \
> >          "EREMOVE returned %d (0x%x) and an EPC page was leaked.  SGX may become unusuable.  " \
> >          "This is likely a kernel bug.  Refer to Documentation/x86/sgx.rst for more information."
> > 
> > I got a checkpatch warning however:
> > 
> > WARNING: It's generally not useful to have the filename in the file
> > #60: FILE: Documentation/x86/sgx.rst:223:
> > +This is likely a kernel bug.  Refer to Documentation/x86/sgx.rst for more
> 
> Yeah, this is more or less a false positive.
> 
> Paolo
> 

Thanks.
Borislav Petkov March 25, 2021, 8:42 a.m. UTC | #33
... so you could send the final version of this patch as a reply to this
thread, now that everyone agrees, so that I can continue going through
the rest.

Thx.
Huang, Kai March 25, 2021, 9:38 a.m. UTC | #34
On Thu, 25 Mar 2021 09:42:41 +0100 Borislav Petkov wrote:
> ... so you could send the final version of this patch as a reply to this
> thread, now that everyone agrees, so that I can continue going through
> the rest.
> 

I have sent it by replying to this patch.

[PATCH v4 03/25] x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()

Btw, with this patch being changed, I think there's a place in patch 5 should
also be changed. I have replied patch 5. Please take a look.

Thanks.
Borislav Petkov March 25, 2021, 4:52 p.m. UTC | #35
On Thu, Mar 25, 2021 at 10:38:13PM +1300, Kai Huang wrote:
> I have sent it by replying to this patch.
>
> [PATCH v4 03/25] x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()

Thanks, I've committed the below.

> Btw, with this patch being changed, I think there's a place in patch 5 should
> also be changed. I have replied patch 5. Please take a look.

Ok, thx.

---
From: Kai Huang <kai.huang@intel.com>
Date: Thu, 25 Mar 2021 22:30:57 +1300
Subject: [PATCH] x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()

EREMOVE takes a page and removes any association between that page and
an enclave. It must be run on a page before it can be added into another
enclave. Currently, EREMOVE is run as part of pages being freed into the
SGX page allocator. It is not expected to fail, as it would indicate a
use-after-free of EPC pages. Rather than add the page back to the pool
of available EPC pages, the kernel intentionally leaks the page to avoid
additional errors in the future.

However, KVM does not track how guest pages are used, which means that
SGX virtualization use of EREMOVE might fail. Specifically, it is
legitimate that EREMOVE returns SGX_CHILD_PRESENT for EPC assigned to
KVM guest, because KVM/kernel doesn't track SECS pages.

To allow SGX/KVM to introduce a more permissive EREMOVE helper and
to let the SGX virtualization code use the allocator directly, break
out the EREMOVE call from the SGX page allocator. Rename the original
sgx_free_epc_page() to sgx_encl_free_epc_page(), indicating that
it is used to free an EPC page assigned to a host enclave. Replace
sgx_free_epc_page() with sgx_encl_free_epc_page() in all call sites so
there's no functional change.

At the same time, improve the error message when EREMOVE fails, and
add documentation to explain to the user what that failure means and
to suggest to the user what to do when this bug happens in the case it
happens.

 [ bp: Massage commit message, fix typos and sanitize text, simplify. ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210325093057.122834-1-kai.huang@intel.com
---
 Documentation/x86/sgx.rst       | 25 +++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/encl.c  | 31 ++++++++++++++++++++++++++-----
 arch/x86/kernel/cpu/sgx/encl.h  |  1 +
 arch/x86/kernel/cpu/sgx/ioctl.c |  6 +++---
 arch/x86/kernel/cpu/sgx/main.c  | 14 +++++---------
 arch/x86/kernel/cpu/sgx/sgx.h   |  4 ++++
 6 files changed, 64 insertions(+), 17 deletions(-)

diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
index eaee1368b4fd..f90076e67cde 100644
--- a/Documentation/x86/sgx.rst
+++ b/Documentation/x86/sgx.rst
@@ -209,3 +209,28 @@ An application may be loaded into a container enclave which is specially
 configured with a library OS and run-time which permits the application to run.
 The enclave run-time and library OS work together to execute the application
 when a thread enters the enclave.
+
+Impact of Potential Kernel SGX Bugs
+===================================
+
+EPC leaks
+---------
+
+When EPC page leaks happen, a WARNING like this is shown in dmesg:
+
+"EREMOVE returned ... and an EPC page was leaked.  SGX may become unusable..."
+
+This is effectively a kernel use-after-free of an EPC page, and due
+to the way SGX works, the bug is detected at freeing. Rather than
+adding the page back to the pool of available EPC pages, the kernel
+intentionally leaks the page to avoid additional errors in the future.
+
+When this happens, the kernel will likely soon leak more EPC pages, and
+SGX will likely become unusable because the memory available to SGX is
+limited. However, while this may be fatal to SGX, the rest of the kernel
+is unlikely to be impacted and should continue to work.
+
+As a result, when this happpens, user should stop running any new
+SGX workloads, (or just any new workloads), and migrate all valuable
+workloads. Although a machine reboot can recover all EPC memory, the bug
+should be reported to Linux developers.
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 7449ef33f081..d25f2a245e1d 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -78,7 +78,7 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
 
 	ret = __sgx_encl_eldu(encl_page, epc_page, secs_page);
 	if (ret) {
-		sgx_free_epc_page(epc_page);
+		sgx_encl_free_epc_page(epc_page);
 		return ERR_PTR(ret);
 	}
 
@@ -404,7 +404,7 @@ void sgx_encl_release(struct kref *ref)
 			if (sgx_unmark_page_reclaimable(entry->epc_page))
 				continue;
 
-			sgx_free_epc_page(entry->epc_page);
+			sgx_encl_free_epc_page(entry->epc_page);
 			encl->secs_child_cnt--;
 			entry->epc_page = NULL;
 		}
@@ -415,7 +415,7 @@ void sgx_encl_release(struct kref *ref)
 	xa_destroy(&encl->page_array);
 
 	if (!encl->secs_child_cnt && encl->secs.epc_page) {
-		sgx_free_epc_page(encl->secs.epc_page);
+		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 	}
 
@@ -423,7 +423,7 @@ void sgx_encl_release(struct kref *ref)
 		va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
 					   list);
 		list_del(&va_page->list);
-		sgx_free_epc_page(va_page->epc_page);
+		sgx_encl_free_epc_page(va_page->epc_page);
 		kfree(va_page);
 	}
 
@@ -686,7 +686,7 @@ struct sgx_epc_page *sgx_alloc_va_page(void)
 	ret = __epa(sgx_get_epc_virt_addr(epc_page));
 	if (ret) {
 		WARN_ONCE(1, "EPA returned %d (0x%x)", ret, ret);
-		sgx_free_epc_page(epc_page);
+		sgx_encl_free_epc_page(epc_page);
 		return ERR_PTR(-EFAULT);
 	}
 
@@ -735,3 +735,24 @@ bool sgx_va_page_full(struct sgx_va_page *va_page)
 
 	return slot == SGX_VA_SLOT_COUNT;
 }
+
+/**
+ * sgx_encl_free_epc_page - free an EPC page assigned to an enclave
+ * @page:	EPC page to be freed
+ *
+ * Free an EPC page assigned to an enclave. It does EREMOVE for the page, and
+ * only upon success, it puts the page back to free page list.  Otherwise, it
+ * gives a WARNING to indicate page is leaked.
+ */
+void sgx_encl_free_epc_page(struct sgx_epc_page *page)
+{
+	int ret;
+
+	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+
+	ret = __eremove(sgx_get_epc_virt_addr(page));
+	if (WARN_ONCE(ret, EREMOVE_ERROR_MESSAGE, ret, ret))
+		return;
+
+	sgx_free_epc_page(page);
+}
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index d8d30ccbef4c..6e74f85b6264 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -115,5 +115,6 @@ struct sgx_epc_page *sgx_alloc_va_page(void);
 unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
 void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
 bool sgx_va_page_full(struct sgx_va_page *va_page);
+void sgx_encl_free_epc_page(struct sgx_epc_page *page);
 
 #endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 2e10367ea66c..354e309fcdb7 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -47,7 +47,7 @@ static void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page)
 	encl->page_cnt--;
 
 	if (va_page) {
-		sgx_free_epc_page(va_page->epc_page);
+		sgx_encl_free_epc_page(va_page->epc_page);
 		list_del(&va_page->list);
 		kfree(va_page);
 	}
@@ -117,7 +117,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 	return 0;
 
 err_out:
-	sgx_free_epc_page(encl->secs.epc_page);
+	sgx_encl_free_epc_page(encl->secs.epc_page);
 	encl->secs.epc_page = NULL;
 
 err_out_backing:
@@ -365,7 +365,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 	mmap_read_unlock(current->mm);
 
 err_out_free:
-	sgx_free_epc_page(epc_page);
+	sgx_encl_free_epc_page(epc_page);
 	kfree(encl_page);
 
 	return ret;
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 13a7599ce7d4..b227629b1e9c 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -294,7 +294,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 
 		sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
 
-		sgx_free_epc_page(encl->secs.epc_page);
+		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 
 		sgx_encl_put_backing(&secs_backing, true);
@@ -609,19 +609,15 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
  * sgx_free_epc_page() - Free an EPC page
  * @page:	an EPC page
  *
- * Call EREMOVE for an EPC page and insert it back to the list of free pages.
+ * Put the EPC page back to the list of free pages. It's the caller's
+ * responsibility to make sure that the page is in uninitialized state. In other
+ * words, do EREMOVE, EWB or whatever operation is necessary before calling
+ * this function.
  */
 void sgx_free_epc_page(struct sgx_epc_page *page)
 {
 	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
 	struct sgx_numa_node *node = section->node;
-	int ret;
-
-	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
-
-	ret = __eremove(sgx_get_epc_virt_addr(page));
-	if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
-		return;
 
 	spin_lock(&node->lock);
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 653af8ca1a25..4aa40c627819 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -13,6 +13,10 @@
 #undef pr_fmt
 #define pr_fmt(fmt) "sgx: " fmt
 
+#define EREMOVE_ERROR_MESSAGE \
+	"EREMOVE returned %d (0x%x) and an EPC page was leaked. SGX may become unusable. " \
+	"Refer to Documentation/x86/sgx.rst for more information."
+
 #define SGX_MAX_EPC_SECTIONS		8
 #define SGX_EEXTEND_BLOCK_SIZE		256
 #define SGX_NR_TO_SCAN			16
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 7449ef33f081..e0fb0f121616 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -78,7 +78,7 @@  static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
 
 	ret = __sgx_encl_eldu(encl_page, epc_page, secs_page);
 	if (ret) {
-		sgx_free_epc_page(epc_page);
+		sgx_encl_free_epc_page(epc_page);
 		return ERR_PTR(ret);
 	}
 
@@ -404,7 +404,7 @@  void sgx_encl_release(struct kref *ref)
 			if (sgx_unmark_page_reclaimable(entry->epc_page))
 				continue;
 
-			sgx_free_epc_page(entry->epc_page);
+			sgx_encl_free_epc_page(entry->epc_page);
 			encl->secs_child_cnt--;
 			entry->epc_page = NULL;
 		}
@@ -415,7 +415,7 @@  void sgx_encl_release(struct kref *ref)
 	xa_destroy(&encl->page_array);
 
 	if (!encl->secs_child_cnt && encl->secs.epc_page) {
-		sgx_free_epc_page(encl->secs.epc_page);
+		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 	}
 
@@ -423,7 +423,7 @@  void sgx_encl_release(struct kref *ref)
 		va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
 					   list);
 		list_del(&va_page->list);
-		sgx_free_epc_page(va_page->epc_page);
+		sgx_encl_free_epc_page(va_page->epc_page);
 		kfree(va_page);
 	}
 
@@ -686,7 +686,7 @@  struct sgx_epc_page *sgx_alloc_va_page(void)
 	ret = __epa(sgx_get_epc_virt_addr(epc_page));
 	if (ret) {
 		WARN_ONCE(1, "EPA returned %d (0x%x)", ret, ret);
-		sgx_free_epc_page(epc_page);
+		sgx_encl_free_epc_page(epc_page);
 		return ERR_PTR(-EFAULT);
 	}
 
@@ -735,3 +735,33 @@  bool sgx_va_page_full(struct sgx_va_page *va_page)
 
 	return slot == SGX_VA_SLOT_COUNT;
 }
+
+/**
+ * sgx_encl_free_epc_page - free EPC page assigned to an enclave
+ * @page:	EPC page to be freed
+ *
+ * Free EPC page assigned to an enclave.  It does EREMOVE for the page, and
+ * only upon success, it puts the page back to free page list.  Otherwise, it
+ * gives a WARNING to indicate page is leaked, and require reboot to retrieve
+ * leaked pages.
+ */
+void sgx_encl_free_epc_page(struct sgx_epc_page *page)
+{
+	int ret;
+
+	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+
+	/*
+	 * Give a message to remind EPC page is leaked when EREMOVE fails,
+	 * and requires machine reboot to get leaked pages back. This can
+	 * be improved in future by adding stats of leaked pages, etc.
+	 */
+#define EREMOVE_ERROR_MESSAGE \
+	"EREMOVE returned %d (0x%x).  EPC page leaked.  Reboot required to retrieve leaked pages."
+	ret = __eremove(sgx_get_epc_virt_addr(page));
+	if (WARN_ONCE(ret, EREMOVE_ERROR_MESSAGE, ret, ret))
+		return;
+#undef EREMOVE_ERROR_MESSAGE
+
+	sgx_free_epc_page(page);
+}
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index d8d30ccbef4c..6e74f85b6264 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -115,5 +115,6 @@  struct sgx_epc_page *sgx_alloc_va_page(void);
 unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
 void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
 bool sgx_va_page_full(struct sgx_va_page *va_page);
+void sgx_encl_free_epc_page(struct sgx_epc_page *page);
 
 #endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 90a5caf76939..772b9c648cf1 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -47,7 +47,7 @@  static void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page)
 	encl->page_cnt--;
 
 	if (va_page) {
-		sgx_free_epc_page(va_page->epc_page);
+		sgx_encl_free_epc_page(va_page->epc_page);
 		list_del(&va_page->list);
 		kfree(va_page);
 	}
@@ -117,7 +117,7 @@  static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 	return 0;
 
 err_out:
-	sgx_free_epc_page(encl->secs.epc_page);
+	sgx_encl_free_epc_page(encl->secs.epc_page);
 	encl->secs.epc_page = NULL;
 
 err_out_backing:
@@ -365,7 +365,7 @@  static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 	mmap_read_unlock(current->mm);
 
 err_out_free:
-	sgx_free_epc_page(epc_page);
+	sgx_encl_free_epc_page(epc_page);
 	kfree(encl_page);
 
 	return ret;
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 5c9c5e5fb1fb..6a734f484aa7 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -294,7 +294,7 @@  static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 
 		sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
 
-		sgx_free_epc_page(encl->secs.epc_page);
+		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 
 		sgx_encl_put_backing(&secs_backing, true);
@@ -609,19 +609,15 @@  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
  * sgx_free_epc_page() - Free an EPC page
  * @page:	an EPC page
  *
- * Call EREMOVE for an EPC page and insert it back to the list of free pages.
+ * Put the EPC page back to the list of free pages. It's the caller's
+ * responsibility to make sure that the page is in uninitialized state. In other
+ * words, do EREMOVE, EWB or whatever operation is necessary before calling
+ * this function.
  */
 void sgx_free_epc_page(struct sgx_epc_page *page)
 {
 	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
 	struct sgx_numa_node *node = section->node;
-	int ret;
-
-	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
-
-	ret = __eremove(sgx_get_epc_virt_addr(page));
-	if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
-		return;
 
 	spin_lock(&node->lock);