diff mbox series

[2/2] x86: sgx_vepc: implement SGX_IOC_VEPC_REMOVE ioctl

Message ID 20210913131153.1202354-3-pbonzini@redhat.com (mailing list archive)
State New, archived
Headers show
Series x86: sgx_vepc: implement ioctl to EREMOVE all pages | expand

Commit Message

Paolo Bonzini Sept. 13, 2021, 1:11 p.m. UTC
Windows expects all pages to be in uninitialized state on startup.
Add a ioctl that does this with EREMOVE, so that userspace can bring
the pages back to this state also when resetting the VM.
Pure userspace implementations, such as closing and reopening the device,
are racy.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/uapi/asm/sgx.h |  2 ++
 arch/x86/kernel/cpu/sgx/virt.c  | 36 +++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

Comments

Dave Hansen Sept. 13, 2021, 7:33 p.m. UTC | #1
On 9/13/21 6:11 AM, Paolo Bonzini wrote:
> +static long sgx_vepc_remove_all(struct sgx_vepc *vepc)
> +{
> +	struct sgx_epc_page *entry;
> +	unsigned long index;
> +	long failures = 0;
> +
> +	xa_for_each(&vepc->page_array, index, entry)
> +		if (sgx_vepc_remove_page(entry))
> +			failures++;
> +
> +	/*
> +	 * Return the number of pages that failed to be removed, so
> +	 * userspace knows that there are still SECS pages lying
> +	 * around.
> +	 */
> +	return failures;
> +}

I'm not sure the retry logic should be in userspace.  Also, is this
strictly limited to SECS pages?  It could also happen if there were
enclaves running that used the page.  Granted, userspace can probably
stop all the vcpus, but the *interface* doesn't prevent it being called
like that.

What else can userspace do but:

	ret = ioctl(fd, SGX_IOC_VEPC_REMOVE);
	if (ret)
		ret = ioctl(fd, SGX_IOC_VEPC_REMOVE);
	if (ret)
		printf("uh oh\n");

We already have existing code to gather up the pages that couldn't be
EREMOVE'd and selectively EREMOVE them again.  Why not reuse that code
here?  If there is 100GB of EPC, it's gotta be painful to run through
the ->page_array twice when once plus a small list iteration will do.

Which reminds me...  Do we need a cond_resched() in there or anything?
That loop could theoretically get really, really long.
Sean Christopherson Sept. 13, 2021, 9:11 p.m. UTC | #2
On Mon, Sep 13, 2021, Dave Hansen wrote:
> On 9/13/21 6:11 AM, Paolo Bonzini wrote:
> > +static long sgx_vepc_remove_all(struct sgx_vepc *vepc)
> > +{
> > +	struct sgx_epc_page *entry;
> > +	unsigned long index;
> > +	long failures = 0;
> > +
> > +	xa_for_each(&vepc->page_array, index, entry)
> > +		if (sgx_vepc_remove_page(entry))
> > +			failures++;
> > +
> > +	/*
> > +	 * Return the number of pages that failed to be removed, so
> > +	 * userspace knows that there are still SECS pages lying
> > +	 * around.
> > +	 */
> > +	return failures;
> > +}
> 
> I'm not sure the retry logic should be in userspace.  Also, is this
> strictly limited to SECS pages?  It could also happen if there were
> enclaves running that used the page.  Granted, userspace can probably
> stop all the vcpus, but the *interface* doesn't prevent it being called
> like that.

The general rule for KVM is that so long as userspace abuse of running vCPUs (or
other concurrent operations) doesn't put the kernel/platform at risk, it's
userspace's responsibility to not screw up.  The main argument being that there
are myriad ways the VMM can DoS the guest without having to abuse an ioctl().

> What else can userspace do but:
> 
> 	ret = ioctl(fd, SGX_IOC_VEPC_REMOVE);
> 	if (ret)
> 		ret = ioctl(fd, SGX_IOC_VEPC_REMOVE);
> 	if (ret)
> 		printf("uh oh\n");
> 
> We already have existing code to gather up the pages that couldn't be
> EREMOVE'd and selectively EREMOVE them again.  Why not reuse that code
> here?  If there is 100GB of EPC, it's gotta be painful to run through
> the ->page_array twice when once plus a small list iteration will do.

My argument against handling this fully in the kernel is that to handle a vNUMA
setup with multiple vEPC sections, the ioctl() would need to a take a set of file
descriptors to handle the case where an SECS is pinned by a child page in a
diferent vEPC.  It would also require an extra list_head per page (or dynamic
allocations), as the list_head in sgx_epc_page will (likely, eventually) be needed
to handle EPC OOM.  In the teardown case, sgx_epc_page.list can be used because
the pages are taken off the active/allocated list as part of teardown.

Neither issue is the end of the world, but IMO it's not worth burning the memory
and taking on extra complexity in the kernel for a relatively rare operation that's
slow as dirt anyways.

> Which reminds me...  Do we need a cond_resched() in there or anything?
> That loop could theoretically get really, really long.
Dave Hansen Sept. 13, 2021, 10:43 p.m. UTC | #3
On 9/13/21 2:11 PM, Sean Christopherson wrote:
> My argument against handling this fully in the kernel is that to handle a vNUMA
> setup with multiple vEPC sections, the ioctl() would need to a take a set of file
> descriptors to handle the case where an SECS is pinned by a child page in a
> diferent vEPC.

Bah, I'm always forgetting about the multiple vepc fd's case.

I completely agree that there's no sane way to do this with a per-vepc
ioctl() when the EREMOVE failures can originate from other vepc instances.

The only other possible thing would be keep an mm_list for vepc
instances and have this ioctl() (or another interface) blast them all.
But that's going to be a heck of a lot more complicated than this is.

OK... you two are wearing me down on this one.

Let's just get this all documented in the changelogs, especially the
retry behavior.
Huang, Kai Sept. 14, 2021, 10:55 a.m. UTC | #4
On Mon, 13 Sep 2021 09:11:53 -0400 Paolo Bonzini wrote:
> Windows expects all pages to be in uninitialized state on startup.
> Add a ioctl that does this with EREMOVE, so that userspace can bring
> the pages back to this state also when resetting the VM.
> Pure userspace implementations, such as closing and reopening the device,
> are racy.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/include/uapi/asm/sgx.h |  2 ++
>  arch/x86/kernel/cpu/sgx/virt.c  | 36 +++++++++++++++++++++++++++++++++
>  2 files changed, 38 insertions(+)
> 
> diff --git a/arch/x86/include/uapi/asm/sgx.h b/arch/x86/include/uapi/asm/sgx.h
> index 9690d6899ad9..f79d84ce8033 100644
> --- a/arch/x86/include/uapi/asm/sgx.h
> +++ b/arch/x86/include/uapi/asm/sgx.h
> @@ -27,6 +27,8 @@ enum sgx_page_flags {
>  	_IOW(SGX_MAGIC, 0x02, struct sgx_enclave_init)
>  #define SGX_IOC_ENCLAVE_PROVISION \
>  	_IOW(SGX_MAGIC, 0x03, struct sgx_enclave_provision)
> +#define SGX_IOC_VEPC_REMOVE \
> +	_IO(SGX_MAGIC, 0x04)

Perhaps SGX_IOC_VEPC_RESET is better than REMOVE, since this ioctl doesn't
actually remove any EPC page from virtual EPC device, but just reset to a clean
slate (by using EREMOVE).
diff mbox series

Patch

diff --git a/arch/x86/include/uapi/asm/sgx.h b/arch/x86/include/uapi/asm/sgx.h
index 9690d6899ad9..f79d84ce8033 100644
--- a/arch/x86/include/uapi/asm/sgx.h
+++ b/arch/x86/include/uapi/asm/sgx.h
@@ -27,6 +27,8 @@  enum sgx_page_flags {
 	_IOW(SGX_MAGIC, 0x02, struct sgx_enclave_init)
 #define SGX_IOC_ENCLAVE_PROVISION \
 	_IOW(SGX_MAGIC, 0x03, struct sgx_enclave_provision)
+#define SGX_IOC_VEPC_REMOVE \
+	_IO(SGX_MAGIC, 0x04)
 
 /**
  * struct sgx_enclave_create - parameter structure for the
diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
index 59b9c13121cd..81dc186fda2e 100644
--- a/arch/x86/kernel/cpu/sgx/virt.c
+++ b/arch/x86/kernel/cpu/sgx/virt.c
@@ -154,6 +154,24 @@  static int sgx_vepc_free_page(struct sgx_epc_page *epc_page)
 	return 0;
 }
 
+static long sgx_vepc_remove_all(struct sgx_vepc *vepc)
+{
+	struct sgx_epc_page *entry;
+	unsigned long index;
+	long failures = 0;
+
+	xa_for_each(&vepc->page_array, index, entry)
+		if (sgx_vepc_remove_page(entry))
+			failures++;
+
+	/*
+	 * Return the number of pages that failed to be removed, so
+	 * userspace knows that there are still SECS pages lying
+	 * around.
+	 */
+	return failures;
+}
+
 static int sgx_vepc_release(struct inode *inode, struct file *file)
 {
 	struct sgx_vepc *vepc = file->private_data;
@@ -239,9 +257,27 @@  static int sgx_vepc_open(struct inode *inode, struct file *file)
 	return 0;
 }
 
+static long sgx_vepc_ioctl(struct file *file,
+			   unsigned int cmd, unsigned long arg)
+{
+	struct sgx_vepc *vepc = file->private_data;
+
+	switch (cmd) {
+	case SGX_IOC_VEPC_REMOVE:
+		if (arg)
+			return -EINVAL;
+		return sgx_vepc_remove_all(vepc);
+
+	default:
+		return -ENOTTY;
+	}
+}
+
 static const struct file_operations sgx_vepc_fops = {
 	.owner		= THIS_MODULE,
 	.open		= sgx_vepc_open,
+	.unlocked_ioctl	= sgx_vepc_ioctl,
+	.compat_ioctl	= sgx_vepc_ioctl,
 	.release	= sgx_vepc_release,
 	.mmap		= sgx_vepc_mmap,
 };