diff mbox series

[1/2] x86: sgx_vepc: extract sgx_vepc_remove_page

Message ID 20210913131153.1202354-2-pbonzini@redhat.com (mailing list archive)
State New, archived
Headers show
Series x86: sgx_vepc: implement ioctl to EREMOVE all pages | expand

Commit Message

Paolo Bonzini Sept. 13, 2021, 1:11 p.m. UTC
Windows expects all pages to be in uninitialized state on startup.
In order to implement this, we will need a ioctl that performs
EREMOVE on all pages mapped by a /dev/sgx_vepc file descriptor:
other possibilities, such as closing and reopening the device,
are racy.

Start the implementation by pulling the EREMOVE into a separate
function.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kernel/cpu/sgx/virt.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

Comments

Dave Hansen Sept. 13, 2021, 2:05 p.m. UTC | #1
On 9/13/21 6:11 AM, Paolo Bonzini wrote:
> Windows expects all pages to be in uninitialized state on startup.
> In order to implement this, we will need a ioctl that performs
> EREMOVE on all pages mapped by a /dev/sgx_vepc file descriptor:
> other possibilities, such as closing and reopening the device,
> are racy.

Hi Paolo,

How does this end up happening in the first place?

All enclave pages should start out on 'sgx_dirty_page_list' and
ksgxd sanitizes them with EREMOVE before making them available.  That
should cover EREMOVE after reboots while SGX pages are initialized,
including kexec().

sgx_vepc_free_page() should do the same for pages that a guest not not
clean up properly.

sgx_encl_free_epc_page() does an EREMOVE after a normal enclave has used
a page.

Those are the only three cases that I can think of.  So, it sounds like
one of those is buggy, or there's another unexpected path out there.
Ultimately, I think it would be really handy if we could do this EREMOVE
implicitly and without any new ABI.
Paolo Bonzini Sept. 13, 2021, 2:24 p.m. UTC | #2
On 13/09/21 16:05, Dave Hansen wrote:
> On 9/13/21 6:11 AM, Paolo Bonzini wrote:
>> Windows expects all pages to be in uninitialized state on startup.
>> In order to implement this, we will need a ioctl that performs
>> EREMOVE on all pages mapped by a /dev/sgx_vepc file descriptor:
>> other possibilities, such as closing and reopening the device,
>> are racy.
> 
> Hi Paolo,
> 
> How does this end up happening in the first place?
> 
> All enclave pages should start out on 'sgx_dirty_page_list' and
> ksgxd sanitizes them with EREMOVE before making them available.  That
> should cover EREMOVE after reboots while SGX pages are initialized,
> including kexec().

By "Windows startup" I mean even after guest reboot.  Because another 
process could sneak in and steal your EPC pages between a close() and an 
open(), I'd like to have a way to EREMOVE the pages while keeping them 
assigned to the specific vEPC instance, i.e. *without* going through 
sgx_vepc_free_page().

Thanks,

Paolo

> sgx_vepc_free_page() should do the same for pages that a guest not not
> clean up properly.
> 
> sgx_encl_free_epc_page() does an EREMOVE after a normal enclave has used
> a page.
> 
> Those are the only three cases that I can think of.  So, it sounds like
> one of those is buggy, or there's another unexpected path out there.
> Ultimately, I think it would be really handy if we could do this EREMOVE
> implicitly and without any new ABI.
>
Dave Hansen Sept. 13, 2021, 2:55 p.m. UTC | #3
On 9/13/21 7:24 AM, Paolo Bonzini wrote:
>> How does this end up happening in the first place?
>>
>> All enclave pages should start out on 'sgx_dirty_page_list' and
>> ksgxd sanitizes them with EREMOVE before making them available.  That
>> should cover EREMOVE after reboots while SGX pages are initialized,
>> including kexec().
> 
> By "Windows startup" I mean even after guest reboot.  Because another
> process could sneak in and steal your EPC pages between a close() and an
> open(), I'd like to have a way to EREMOVE the pages while keeping them
> assigned to the specific vEPC instance, i.e. *without* going through
> sgx_vepc_free_page().

Oh, so you want fresh EPC state for the guest, but you're concerned that
the previous guest might have left them in a bad state.  The current
method of getting a new vepc instance (which guarantees fresh state) has
some other downsides.

Can't another process steal pages via sgxd and reclaim at any time?
What's the extra concern here about going through a close()/open()
cycle?  Performance?
Paolo Bonzini Sept. 13, 2021, 3:14 p.m. UTC | #4
On 13/09/21 16:55, Dave Hansen wrote:
>> By "Windows startup" I mean even after guest reboot.  Because another
>> process could sneak in and steal your EPC pages between a close() and an
>> open(), I'd like to have a way to EREMOVE the pages while keeping them
>> assigned to the specific vEPC instance, i.e.*without*  going through
>> sgx_vepc_free_page().
> Oh, so you want fresh EPC state for the guest, but you're concerned that
> the previous guest might have left them in a bad state.  The current
> method of getting a new vepc instance (which guarantees fresh state) has
> some other downsides.
> 
> Can't another process steal pages via sgxd and reclaim at any time?

vEPC pages never call sgx_mark_page_reclaimable, don't they?

> What's the extra concern here about going through a close()/open()
> cycle?  Performance?

Apart from reclaiming, /dev/sgx_vepc might disappear between the first 
open() and subsequent ones.

Paolo
Dave Hansen Sept. 13, 2021, 3:29 p.m. UTC | #5
On 9/13/21 8:14 AM, Paolo Bonzini wrote:
> On 13/09/21 16:55, Dave Hansen wrote:
>>> By "Windows startup" I mean even after guest reboot.  Because another
>>> process could sneak in and steal your EPC pages between a close() and an
>>> open(), I'd like to have a way to EREMOVE the pages while keeping them
>>> assigned to the specific vEPC instance, i.e.*without*  going through
>>> sgx_vepc_free_page().
>> Oh, so you want fresh EPC state for the guest, but you're concerned that
>> the previous guest might have left them in a bad state.  The current
>> method of getting a new vepc instance (which guarantees fresh state) has
>> some other downsides.
>>
>> Can't another process steal pages via sgxd and reclaim at any time?
> 
> vEPC pages never call sgx_mark_page_reclaimable, don't they?

Oh, I was just looking that they were on the SGX LRU.  You might be right.

But, we certainly don't want the fact that they are unreclaimable today
to be part of the ABI.  It's more of a bug than a feature.

>> What's the extra concern here about going through a close()/open()
>> cycle?  Performance?
> 
> Apart from reclaiming, /dev/sgx_vepc might disappear between the first
> open() and subsequent ones.

Aside from it being rm'd, I don't think that's possible.
Paolo Bonzini Sept. 13, 2021, 6:35 p.m. UTC | #6
On 13/09/21 17:29, Dave Hansen wrote:
> On 9/13/21 8:14 AM, Paolo Bonzini wrote:
>> On 13/09/21 16:55, Dave Hansen wrote:
>>>> By "Windows startup" I mean even after guest reboot.  Because another
>>>> process could sneak in and steal your EPC pages between a close() and an
>>>> open(), I'd like to have a way to EREMOVE the pages while keeping them
>>>> assigned to the specific vEPC instance, i.e.*without*  going through
>>>> sgx_vepc_free_page().
>>> Oh, so you want fresh EPC state for the guest, but you're concerned that
>>> the previous guest might have left them in a bad state.  The current
>>> method of getting a new vepc instance (which guarantees fresh state) has
>>> some other downsides.
>>>
>>> Can't another process steal pages via sgxd and reclaim at any time?
>>
>> vEPC pages never call sgx_mark_page_reclaimable, don't they?
> 
> Oh, I was just looking that they were on the SGX LRU.  You might be right.
> But, we certainly don't want the fact that they are unreclaimable today
> to be part of the ABI.  It's more of a bug than a feature.

Sure, that's fine.

>>> What's the extra concern here about going through a close()/open()
>>> cycle?  Performance?
>>
>> Apart from reclaiming, /dev/sgx_vepc might disappear between the first
>> open() and subsequent ones.
> 
> Aside from it being rm'd, I don't think that's possible.
> 

Being rm'd would be a possibility in principle, and it would be ugly for 
it to cause issues on running VMs.  Also I'd like for it to be able to 
pass /dev/sgx_vepc in via a file descriptor, and run QEMU in a chroot or 
a mount namespace.  Alternatively, with seccomp it may be possible to 
sandbox a running QEMU process in such a way that open() is forbidden at 
runtime (all hotplug is done via file descriptor passing); it is not yet 
possible, but it is a goal.

Paolo
Dave Hansen Sept. 13, 2021, 7:25 p.m. UTC | #7
On 9/13/21 11:35 AM, Paolo Bonzini wrote:
>>> Apart from reclaiming, /dev/sgx_vepc might disappear between the first
>>> open() and subsequent ones.
>>
>> Aside from it being rm'd, I don't think that's possible.
>>
> 
> Being rm'd would be a possibility in principle, and it would be ugly for
> it to cause issues on running VMs.  Also I'd like for it to be able to
> pass /dev/sgx_vepc in via a file descriptor, and run QEMU in a chroot or
> a mount namespace.  Alternatively, with seccomp it may be possible to
> sandbox a running QEMU process in such a way that open() is forbidden at
> runtime (all hotplug is done via file descriptor passing); it is not yet
> possible, but it is a goal.

OK, so maybe another way of saying this:

For bare-metal SGX on real hardware, the hardware provides guarantees
SGX state at reboot.  For instance, all pages start out uninitialized.
The vepc driver provides a similar guarantee today for freshly-opened
vepc instances.

But, vepc users have a problem: they might want to run an OS that
expects to be booted with clean, fully uninitialized SGX state, just as
it would be on bare-metal.  Right now, the only way to get that is to
create a new vepc instance.  That might not be possible in all
configurations, like if the permission to open(/dev/sgx_vepc) has been
lost since the VM was first booted.

Windows has these expectations about booting with fully uninitialized
state.  There are also a number of environments where QEMU is sandboxed
or drops permissions in a way that prevent free and open access to
/dev/sgx_vepc.

So good so far?
Jarkko Sakkinen Sept. 13, 2021, 8:33 p.m. UTC | #8
On Mon, 2021-09-13 at 09:11 -0400, Paolo Bonzini wrote:
> Windows expects all pages to be in uninitialized state on startup.
> In order to implement this, we will need a ioctl that performs
> EREMOVE on all pages mapped by a /dev/sgx_vepc file descriptor:
> other possibilities, such as closing and reopening the device,
> are racy.

So what makes it racy, and what do mean by racy in this case?

I.e. you have to do open() and mmap(), and munmap() + close()
for removal. Depending on situation that is racy or not...

And is "Windows" more precisely a "Windows guest running in
Linux QEMU host"? It's ambiguous..

/Jarkko
Jarkko Sakkinen Sept. 13, 2021, 9 p.m. UTC | #9
On Mon, 2021-09-13 at 16:24 +0200, Paolo Bonzini wrote:
> On 13/09/21 16:05, Dave Hansen wrote:
> > On 9/13/21 6:11 AM, Paolo Bonzini wrote:
> > > Windows expects all pages to be in uninitialized state on startup.
> > > In order to implement this, we will need a ioctl that performs
> > > EREMOVE on all pages mapped by a /dev/sgx_vepc file descriptor:
> > > other possibilities, such as closing and reopening the device,
> > > are racy.
> > 
> > Hi Paolo,
> > 
> > How does this end up happening in the first place?
> > 
> > All enclave pages should start out on 'sgx_dirty_page_list' and
> > ksgxd sanitizes them with EREMOVE before making them available.  That
> > should cover EREMOVE after reboots while SGX pages are initialized,
> > including kexec().
> 
> By "Windows startup" I mean even after guest reboot.  Because another 
> process could sneak in and steal your EPC pages between a close() and an 
> open(), I'd like to have a way to EREMOVE the pages while keeping them 
> assigned to the specific vEPC instance, i.e. *without* going through 
> sgx_vepc_free_page().

Isn't "other process in and steal your EPC pages" more like sysadmin
problem, rather than software?

I'm lacking of understanding what would be the collateral damage in
the end.

/Jarkko
Jarkko Sakkinen Sept. 13, 2021, 9:12 p.m. UTC | #10
On Mon, 2021-09-13 at 07:55 -0700, Dave Hansen wrote:
> On 9/13/21 7:24 AM, Paolo Bonzini wrote:
> > > How does this end up happening in the first place?
> > > 
> > > All enclave pages should start out on 'sgx_dirty_page_list' and
> > > ksgxd sanitizes them with EREMOVE before making them available.  That
> > > should cover EREMOVE after reboots while SGX pages are initialized,
> > > including kexec().
> > 
> > By "Windows startup" I mean even after guest reboot.  Because another
> > process could sneak in and steal your EPC pages between a close() and an
> > open(), I'd like to have a way to EREMOVE the pages while keeping them
> > assigned to the specific vEPC instance, i.e. *without* going through
> > sgx_vepc_free_page().
> 
> Oh, so you want fresh EPC state for the guest, but you're concerned that
> the previous guest might have left them in a bad state.  The current
> method of getting a new vepc instance (which guarantees fresh state) has
> some other downsides.
> 
> Can't another process steal pages via sgxd and reclaim at any time?
> What's the extra concern here about going through a close()/open()
> cycle?  Performance?

sgxd does not steal anything from vepc regions.

They are not part of the reclaiming process.

/Jarkko
Jarkko Sakkinen Sept. 13, 2021, 9:13 p.m. UTC | #11
On Mon, 2021-09-13 at 17:14 +0200, Paolo Bonzini wrote:
> On 13/09/21 16:55, Dave Hansen wrote:
> > > By "Windows startup" I mean even after guest reboot.  Because another
> > > process could sneak in and steal your EPC pages between a close() and an
> > > open(), I'd like to have a way to EREMOVE the pages while keeping them
> > > assigned to the specific vEPC instance, i.e.*without*  going through
> > > sgx_vepc_free_page().
> > Oh, so you want fresh EPC state for the guest, but you're concerned that
> > the previous guest might have left them in a bad state.  The current
> > method of getting a new vepc instance (which guarantees fresh state) has
> > some other downsides.
> > 
> > Can't another process steal pages via sgxd and reclaim at any time?
> 
> vEPC pages never call sgx_mark_page_reclaimable, don't they?
> 
> > What's the extra concern here about going through a close()/open()
> > cycle?  Performance?
> 
> Apart from reclaiming, /dev/sgx_vepc might disappear between the first 
> open() and subsequent ones.

If /dev/sgx_vepc dissapears, why is it a problem *for the software*, and
not a sysadmin problem?

I think that this is what the whole patch is lacking, why are we talking
about a software problem...

/Jarkko
Jarkko Sakkinen Sept. 13, 2021, 9:15 p.m. UTC | #12
On Mon, 2021-09-13 at 20:35 +0200, Paolo Bonzini wrote:
> On 13/09/21 17:29, Dave Hansen wrote:
> > On 9/13/21 8:14 AM, Paolo Bonzini wrote:
> > > On 13/09/21 16:55, Dave Hansen wrote:
> > > > > By "Windows startup" I mean even after guest reboot.  Because another
> > > > > process could sneak in and steal your EPC pages between a close() and an
> > > > > open(), I'd like to have a way to EREMOVE the pages while keeping them
> > > > > assigned to the specific vEPC instance, i.e.*without*  going through
> > > > > sgx_vepc_free_page().
> > > > Oh, so you want fresh EPC state for the guest, but you're concerned that
> > > > the previous guest might have left them in a bad state.  The current
> > > > method of getting a new vepc instance (which guarantees fresh state) has
> > > > some other downsides.
> > > > 
> > > > Can't another process steal pages via sgxd and reclaim at any time?
> > > 
> > > vEPC pages never call sgx_mark_page_reclaimable, don't they?
> > 
> > Oh, I was just looking that they were on the SGX LRU.  You might be right.
> > But, we certainly don't want the fact that they are unreclaimable today
> > to be part of the ABI.  It's more of a bug than a feature.
> 
> Sure, that's fine.
> 
> > > > What's the extra concern here about going through a close()/open()
> > > > cycle?  Performance?
> > > 
> > > Apart from reclaiming, /dev/sgx_vepc might disappear between the first
> > > open() and subsequent ones.
> > 
> > Aside from it being rm'd, I don't think that's possible.
> > 
> 
> Being rm'd would be a possibility in principle, and it would be ugly for 
> it to cause issues on running VMs.  Also I'd like for it to be able to 
> pass /dev/sgx_vepc in via a file descriptor, and run QEMU in a chroot or 
> a mount namespace.  Alternatively, with seccomp it may be possible to 
> sandbox a running QEMU process in such a way that open() is forbidden at 
> runtime (all hotplug is done via file descriptor passing); it is not yet 
> possible, but it is a goal.

AFAIK, as long you have open files for a device, they work
as expected.

/Jarkko
Jarkko Sakkinen Sept. 13, 2021, 9:16 p.m. UTC | #13
On Mon, 2021-09-13 at 12:25 -0700, Dave Hansen wrote:
> On 9/13/21 11:35 AM, Paolo Bonzini wrote:
> > > > Apart from reclaiming, /dev/sgx_vepc might disappear between the first
> > > > open() and subsequent ones.
> > > 
> > > Aside from it being rm'd, I don't think that's possible.
> > > 
> > 
> > Being rm'd would be a possibility in principle, and it would be ugly for
> > it to cause issues on running VMs.  Also I'd like for it to be able to
> > pass /dev/sgx_vepc in via a file descriptor, and run QEMU in a chroot or
> > a mount namespace.  Alternatively, with seccomp it may be possible to
> > sandbox a running QEMU process in such a way that open() is forbidden at
> > runtime (all hotplug is done via file descriptor passing); it is not yet
> > possible, but it is a goal.
> 
> OK, so maybe another way of saying this:
> 
> For bare-metal SGX on real hardware, the hardware provides guarantees
> SGX state at reboot.  For instance, all pages start out uninitialized.
> The vepc driver provides a similar guarantee today for freshly-opened
> vepc instances.
> 
> But, vepc users have a problem: they might want to run an OS that
> expects to be booted with clean, fully uninitialized SGX state, just as
> it would be on bare-metal.  Right now, the only way to get that is to
> create a new vepc instance.  That might not be possible in all
> configurations, like if the permission to open(/dev/sgx_vepc) has been
> lost since the VM was first booted.

So you maintain your systems in a way that this does not happen?

/Jarkko
Paolo Bonzini Sept. 14, 2021, 5:36 a.m. UTC | #14
On 13/09/21 23:13, Jarkko Sakkinen wrote:
>> Apart from reclaiming, /dev/sgx_vepc might disappear between the first
>> open() and subsequent ones.
>
> If /dev/sgx_vepc disappears, why is it a problem *for the software*, and
> not a sysadmin problem?

Rather than disappearing, it could be that a program first gets all the 
resources it needs before it gets malicious input, and then enter a 
restrictive sandbox.  In this case open() could be completely forbidden.

I will improve the documentation and changelogs when I post the non-RFC 
version; that could have been done better, sorry.

Paolo
Jarkko Sakkinen Sept. 14, 2021, 4:05 p.m. UTC | #15
On Tue, 2021-09-14 at 07:36 +0200, Paolo Bonzini wrote:
> On 13/09/21 23:13, Jarkko Sakkinen wrote:
> > > Apart from reclaiming, /dev/sgx_vepc might disappear between the first
> > > open() and subsequent ones.
> > 
> > If /dev/sgx_vepc disappears, why is it a problem *for the software*, and
> > not a sysadmin problem?
> 
> Rather than disappearing, it could be that a program first gets all the 
> resources it needs before it gets malicious input, and then enter a 
> restrictive sandbox.  In this case open() could be completely forbidden.
> 
> I will improve the documentation and changelogs when I post the non-RFC 
> version; that could have been done better, sorry.
> 

No worries, just trying to get bottom of the actual issue.

/Jarkko
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
index 64511c4a5200..59b9c13121cd 100644
--- a/arch/x86/kernel/cpu/sgx/virt.c
+++ b/arch/x86/kernel/cpu/sgx/virt.c
@@ -111,7 +111,7 @@  static int sgx_vepc_mmap(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
-static int sgx_vepc_free_page(struct sgx_epc_page *epc_page)
+static int sgx_vepc_remove_page(struct sgx_epc_page *epc_page)
 {
 	int ret;
 
@@ -140,11 +140,17 @@  static int sgx_vepc_free_page(struct sgx_epc_page *epc_page)
 		 */
 		WARN_ONCE(ret != SGX_CHILD_PRESENT, EREMOVE_ERROR_MESSAGE,
 			  ret, ret);
-		return ret;
 	}
+	return ret;
+}
 
-	sgx_free_epc_page(epc_page);
+static int sgx_vepc_free_page(struct sgx_epc_page *epc_page)
+{
+	int ret = sgx_vepc_remove_page(epc_page);
+	if (ret)
+		return ret;
 
+	sgx_free_epc_page(epc_page);
 	return 0;
 }