diff mbox series

[v3,05/25] x86/sgx: Introduce virtual EPC for use by KVM guests

Message ID 0c38ced8c8e5a69872db4d6a1c0dabd01e07cad7.1616136308.git.kai.huang@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM SGX virtualization support | expand

Commit Message

Huang, Kai March 19, 2021, 7:22 a.m. UTC
From: Sean Christopherson <sean.j.christopherson@intel.com>

Add a misc device /dev/sgx_vepc to allow userspace to allocate "raw" EPC
without an associated enclave.  The intended and only known use case for
raw EPC allocation is to expose EPC to a KVM guest, hence the 'vepc'
moniker, virt.{c,h} files and X86_SGX_KVM Kconfig.

SGX driver uses misc device /dev/sgx_enclave to support userspace to
create enclave.  Each file descriptor from opening /dev/sgx_enclave
represents an enclave.  Unlike SGX driver, KVM doesn't control how guest
uses EPC, therefore EPC allocated to KVM guest is not associated to an
enclave, and /dev/sgx_enclave is not suitable for allocating EPC for KVM
guest.

Having separate device nodes for SGX driver and KVM virtual EPC also
allows separate permission control for running host SGX enclaves and
KVM SGX guests.

To use /dev/sgx_vepc to allocate a virtual EPC instance with particular
size, the userspace hypervisor opens /dev/sgx_vepc, and uses mmap()
with the intended size to get an address range of virtual EPC.  Then
it may use the address range to create one KVM memory slot as virtual
EPC for guest.

Implement the "raw" EPC allocation in the x86 core-SGX subsystem via
/dev/sgx_vepc rather than in KVM. Doing so has two major advantages:

  - Does not require changes to KVM's uAPI, e.g. EPC gets handled as
    just another memory backend for guests.

  - EPC management is wholly contained in the SGX subsystem, e.g. SGX
    does not have to export any symbols, changes to reclaim flows don't
    need to be routed through KVM, SGX's dirty laundry doesn't have to
    get aired out for the world to see, and so on and so forth.

The virtual EPC pages allocated to guests are currently not reclaimable.
Reclaiming EPC page used by enclave requires a special reclaim mechanism
separate from normal page reclaim, and that mechanism is not supported
for virutal EPC pages.  Due to the complications of handling reclaim
conflicts between guest and host, reclaiming virtual EPC pages is
significantly more complex than basic support for SGX virtualization.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/Kconfig                 |  12 ++
 arch/x86/kernel/cpu/sgx/Makefile |   1 +
 arch/x86/kernel/cpu/sgx/sgx.h    |   9 ++
 arch/x86/kernel/cpu/sgx/virt.c   | 260 +++++++++++++++++++++++++++++++
 4 files changed, 282 insertions(+)
 create mode 100644 arch/x86/kernel/cpu/sgx/virt.c

Comments

Huang, Kai March 25, 2021, 9:36 a.m. UTC | #1
> +
> +static int sgx_vepc_free_page(struct sgx_epc_page *epc_page)
> +{
> +	int ret;
> +
> +	/*
> +	 * Take a previously guest-owned EPC page and return it to the
> +	 * general EPC page pool.
> +	 *
> +	 * Guests can not be trusted to have left this page in a good
> +	 * state, so run EREMOVE on the page unconditionally.  In the
> +	 * case that a guest properly EREMOVE'd this page, a superfluous
> +	 * EREMOVE is harmless.
> +	 */
> +	ret = __eremove(sgx_get_epc_virt_addr(epc_page));
> +	if (ret) {
> +		/*
> +		 * Only SGX_CHILD_PRESENT is expected, which is because of
> +		 * EREMOVE'ing an SECS still with child, in which case it can
> +		 * be handled by EREMOVE'ing the SECS again after all pages in
> +		 * virtual EPC have been EREMOVE'd. See comments in below in
> +		 * sgx_vepc_release().
> +		 *
> +		 * The user of virtual EPC (KVM) needs to guarantee there's no
> +		 * logical processor is still running in the enclave in guest,
> +		 * otherwise EREMOVE will get SGX_ENCLAVE_ACT which cannot be
> +		 * handled here.
> +		 */
> +		WARN_ONCE(ret != SGX_CHILD_PRESENT,
> +			  "EREMOVE (EPC page 0x%lx): unexpected error: %d\n",
> +			  sgx_get_epc_phys_addr(epc_page), ret);

Hi Boris,

With the change to patch 3, I think perhaps this WARN_ONCE() should also be
changed to:

                WARN_ONCE(ret != SGX_CHILD_PRESENT, EREMOVE_ERROR_MESSAGE,
                                ret, ret);

> +		return ret;
> +	}
> +
> +	sgx_free_epc_page(epc_page);
> +
> +	return 0;
> +}
>
Borislav Petkov March 26, 2021, 3:03 p.m. UTC | #2
On Fri, Mar 19, 2021 at 08:22:21PM +1300, Kai Huang wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Add a misc device /dev/sgx_vepc to allow userspace to allocate "raw" EPC
> without an associated enclave.  The intended and only known use case for
> raw EPC allocation is to expose EPC to a KVM guest, hence the 'vepc'
> moniker, virt.{c,h} files and X86_SGX_KVM Kconfig.
> 
> SGX driver uses misc device /dev/sgx_enclave to support userspace to
> create enclave.  Each file descriptor from opening /dev/sgx_enclave
> represents an enclave.  Unlike SGX driver, KVM doesn't control how guest
> uses EPC, therefore EPC allocated to KVM guest is not associated to an
> enclave, and /dev/sgx_enclave is not suitable for allocating EPC for KVM
> guest.
> 
> Having separate device nodes for SGX driver and KVM virtual EPC also
> allows separate permission control for running host SGX enclaves and
> KVM SGX guests.

Hmm, just a question on the big picture here - that might've popped up
already:

So baremetal uses /dev/sgx_enclave and KVM uses /dev/sgx_vepc. Who's
deciding which of the two has priority?

Let's say all guests start using enclaves and baremetal cannot start any
new ones anymore due to no more memory. Are we ok with that?

What if baremetal creates a big fat enclave and starves guests all of a
sudden. Are we ok with that either?

In general, having two disjoint things give out SGX resources separately
sounds like trouble to me.

IOW, why don't all virt allocations go through /dev/sgx_enclave too, so
that you can have a single place to control all resource allocations?

> To use /dev/sgx_vepc to allocate a virtual EPC instance with particular
> size, the userspace hypervisor opens /dev/sgx_vepc, and uses mmap()
> with the intended size to get an address range of virtual EPC.  Then
> it may use the address range to create one KVM memory slot as virtual
> EPC for guest.
> 
> Implement the "raw" EPC allocation in the x86 core-SGX subsystem via
> /dev/sgx_vepc rather than in KVM. Doing so has two major advantages:
> 
>   - Does not require changes to KVM's uAPI, e.g. EPC gets handled as
>     just another memory backend for guests.
> 
>   - EPC management is wholly contained in the SGX subsystem, e.g. SGX
>     does not have to export any symbols, changes to reclaim flows don't
>     need to be routed through KVM, SGX's dirty laundry doesn't have to
>     get aired out for the world to see,

Good one. :-)

> and so on and so forth.

> The virtual EPC pages allocated to guests are currently not reclaimable.
> Reclaiming EPC page used by enclave requires a special reclaim mechanism
> separate from normal page reclaim, and that mechanism is not supported
> for virutal EPC pages.  Due to the complications of handling reclaim
> conflicts between guest and host, reclaiming virtual EPC pages is
> significantly more complex than basic support for SGX virtualization.

What happens if someone in the future wants to change that? Someone
needs to write patches or there's a more fundamental stopper issue
involved?

As always, I might be missing something but that doesn't stop me from
being devil's advocate. :-)

Thx.
Dave Hansen March 26, 2021, 3:17 p.m. UTC | #3
On 3/26/21 8:03 AM, Borislav Petkov wrote:
> Let's say all guests start using enclaves and baremetal cannot start any
> new ones anymore due to no more memory. Are we ok with that?

Yes, for now.

> What if baremetal creates a big fat enclave and starves guests all of a
> sudden. Are we ok with that either?

Actually, the baremetal enclave will get a large chunk of its resources
reclaimed and stolen from it.  The guests will probably start and the
baremetal will probably thrash until its allocations fail and it is
killed because it couldn't allocate enclave memory in a page fault.

> In general, having two disjoint things give out SGX resources separately
> sounds like trouble to me.

Yes, it's trouble as-is.

We're working on a cgroup controller just for enclave pages that will
apply to guest use and bare metal.  It would have been nice to have up
front, but we're trying to do things incrementally.  A cgroup controller
should solve he vast majority of these issues where users are quarreling
about who gets enclave memory.

BTW, we probably should have laid this out up front in the original
merge, but the plans in order were roughly:

1. Core SGX functionality (merged into 5.11)
2. NUMA and KVM work
3. cgroup controller for enclave pages
4. EDMM support (lets you add/remove pages and change permissions while
   enclave runs.  Current enclaves are stuck with the same memory they
   start with)

After that, things become less clear.  There's some debate whether we
need to rework the VA pages (enclave swapping metadata to prevent
replay) or improve ability to reclaim guest pages.
Borislav Petkov March 26, 2021, 3:29 p.m. UTC | #4
On Fri, Mar 26, 2021 at 08:17:38AM -0700, Dave Hansen wrote:
> We're working on a cgroup controller just for enclave pages that will
> apply to guest use and bare metal.  It would have been nice to have up
> front, but we're trying to do things incrementally.  A cgroup controller
> should solve he vast majority of these issues where users are quarreling
> about who gets enclave memory.

Maybe I'm missing something but why do you need a cgroup controller
instead of controlling that resource sharing in the sgx core? Or the
cgroup thing has additional functionality which is good to have anyway?

> BTW, we probably should have laid this out up front in the original
> merge, but the plans in order were roughly:
> 
> 1. Core SGX functionality (merged into 5.11)
> 2. NUMA and KVM work
> 3. cgroup controller for enclave pages
> 4. EDMM support (lets you add/remove pages and change permissions while
>    enclave runs.  Current enclaves are stuck with the same memory they
>    start with)

Oh yeah, that helps, thanks!
Dave Hansen March 26, 2021, 3:35 p.m. UTC | #5
On 3/26/21 8:29 AM, Borislav Petkov wrote:
> On Fri, Mar 26, 2021 at 08:17:38AM -0700, Dave Hansen wrote:
>> We're working on a cgroup controller just for enclave pages that will
>> apply to guest use and bare metal.  It would have been nice to have up
>> front, but we're trying to do things incrementally.  A cgroup controller
>> should solve he vast majority of these issues where users are quarreling
>> about who gets enclave memory.
> Maybe I'm missing something but why do you need a cgroup controller
> instead of controlling that resource sharing in the sgx core? Or the
> cgroup thing has additional functionality which is good to have anyway?

We could do it in the SGX core, but I think what we end up with will end
up looking a lot like a cgroup controller.  It seems like overkill, but
I think there's enough infrastructure to leverage that it's simpler to
do it with cgroups versus anything else.
Borislav Petkov March 26, 2021, 5:02 p.m. UTC | #6
On Fri, Mar 26, 2021 at 08:35:34AM -0700, Dave Hansen wrote:
> We could do it in the SGX core, but I think what we end up with will end
> up looking a lot like a cgroup controller.  It seems like overkill, but
> I think there's enough infrastructure to leverage that it's simpler to
> do it with cgroups versus anything else.

Right.

Thx.
Huang, Kai March 31, 2021, 1:10 a.m. UTC | #7
On Fri, 26 Mar 2021 16:03:55 +0100 Borislav Petkov wrote:
> On Fri, Mar 19, 2021 at 08:22:21PM +1300, Kai Huang wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > Add a misc device /dev/sgx_vepc to allow userspace to allocate "raw" EPC
> > without an associated enclave.  The intended and only known use case for
> > raw EPC allocation is to expose EPC to a KVM guest, hence the 'vepc'
> > moniker, virt.{c,h} files and X86_SGX_KVM Kconfig.
> > 
> > SGX driver uses misc device /dev/sgx_enclave to support userspace to
> > create enclave.  Each file descriptor from opening /dev/sgx_enclave
> > represents an enclave.  Unlike SGX driver, KVM doesn't control how guest
> > uses EPC, therefore EPC allocated to KVM guest is not associated to an
> > enclave, and /dev/sgx_enclave is not suitable for allocating EPC for KVM
> > guest.
> > 
> > Having separate device nodes for SGX driver and KVM virtual EPC also
> > allows separate permission control for running host SGX enclaves and
> > KVM SGX guests.
> 
> Hmm, just a question on the big picture here - that might've popped up
> already:
> 
> So baremetal uses /dev/sgx_enclave and KVM uses /dev/sgx_vepc. Who's
> deciding which of the two has priority?

Hi Boris,

Sorry the late response (I saw Dave was replying. Thanks Dave :)).

Ultimately the admin, or the user decides, or the two don't have priority, from
EPC page allocation's perspective. SGX driver's EPC page reclaiming won't be
able to reclaim pages that have been allocated to KVM guests, and virtual EPC
fault handler won't try to reclaim page that has been allocated to host enclaves
either, when it tries to allocate EPC page.

For instance, in case of cloud, where KVM SGX is the main usage, SGX driver in
host either won't be used, or very minimal, specific and well-defined workloads
will be deployed in host (for instance, Quoting enclave and architecture
enclaves that are used for attestation). The admin will be aware of such EPC
allocation disjoint situation, and deploy host enclaves/KVM SGX guests
accordingly.

> 
> Let's say all guests start using enclaves and baremetal cannot start any
> new ones anymore due to no more memory. Are we ok with that?
> 
> What if baremetal creates a big fat enclave and starves guests all of a
> sudden. Are we ok with that either?

Yes to both above questions.

> 
> In general, having two disjoint things give out SGX resources separately
> sounds like trouble to me.
> 
> IOW, why don't all virt allocations go through /dev/sgx_enclave too, so
> that you can have a single place to control all resource allocations?

Overall, there are two reasons (also mentioned in the commit msg of this patch):

1) /dev/sgx_enclave, by its name, implies EPC pages allocated to it are
associated to an host enclave, so it is not suitable for virtual EPC, since EPC
allocated to KVM guest won't have an enclave associated. It's possible to
modify SGX driver (such as deferring 'struct sgx_encl' allocation from open to
CREATE_ENCLAVE ioctl, modifying majority code flows to cover both cases, etc),
but even with that, we'd still better to change /dev/sgx_enclave
to, for instance, /dev/sgx_epc, so it doesn't imply the fd opened from it is
an host enclave, but some raw EPC. However this is userspace ABI change. 
2) Having separate /dev/sgx_enclave, and /dev/sgx_vepc, allows admin to have
different permission control, if required.

So based on above reasons, we agreed it's better to have two device nodes.

Please see previous discussion for RFC v4:
https://lore.kernel.org/linux-sgx/c50ffb557166132cf73d0e838d3a5c1f653b28b7.camel@intel.com/

> 
> > To use /dev/sgx_vepc to allocate a virtual EPC instance with particular
> > size, the userspace hypervisor opens /dev/sgx_vepc, and uses mmap()
> > with the intended size to get an address range of virtual EPC.  Then
> > it may use the address range to create one KVM memory slot as virtual
> > EPC for guest.
> > 
> > Implement the "raw" EPC allocation in the x86 core-SGX subsystem via
> > /dev/sgx_vepc rather than in KVM. Doing so has two major advantages:
> > 
> >   - Does not require changes to KVM's uAPI, e.g. EPC gets handled as
> >     just another memory backend for guests.
> > 
> >   - EPC management is wholly contained in the SGX subsystem, e.g. SGX
> >     does not have to export any symbols, changes to reclaim flows don't
> >     need to be routed through KVM, SGX's dirty laundry doesn't have to
> >     get aired out for the world to see,
> 
> Good one. :-)
> 
> > and so on and so forth.
> 
> > The virtual EPC pages allocated to guests are currently not reclaimable.
> > Reclaiming EPC page used by enclave requires a special reclaim mechanism
> > separate from normal page reclaim, and that mechanism is not supported
> > for virutal EPC pages.  Due to the complications of handling reclaim
> > conflicts between guest and host, reclaiming virtual EPC pages is
> > significantly more complex than basic support for SGX virtualization.
> 
> What happens if someone in the future wants to change that? Someone
> needs to write patches or there's a more fundamental stopper issue
> involved?

Sorry I am not following. Do you mean if someone wants to support "reclaiming
EPC page from KVM guests"? If so yes someone needs to write patches (we
internally have some, actually), but could you elaborate why there will be a
more fundamental stopper issue involved?
Borislav Petkov March 31, 2021, 6:44 a.m. UTC | #8
On March 31, 2021 3:10:32 AM GMT+02:00, Kai Huang <kai.huang@intel.com> wrote: 

> The admin will be aware of
>such EPC
>allocation disjoint situation, and deploy host enclaves/KVM SGX guests
>accordingly.

The admin will be aware because...

1) he's following our discussion?

2) he'll read the commit messages and hopefully understand?

3) we *actually* have documentation somewhere explaining how we envision that stuff to be used?

Or none of the above and he'll end up doing whatever and then he'll eventually figure out that we don't support that use case but he's doing it already anyway and we don't break userspace so we have to support it now and we're stuck somewhere between a rock and a hard place?

Hmm, I think we have enough misguided use cases as it is - don't need another one.
Huang, Kai March 31, 2021, 6:51 a.m. UTC | #9
On Wed, 31 Mar 2021 08:44:23 +0200 Boris Petkov wrote:
> On March 31, 2021 3:10:32 AM GMT+02:00, Kai Huang <kai.huang@intel.com> wrote: 
> 
> > The admin will be aware of
> >such EPC
> >allocation disjoint situation, and deploy host enclaves/KVM SGX guests
> >accordingly.
> 
> The admin will be aware because...
> 
> 1) he's following our discussion?
> 
> 2) he'll read the commit messages and hopefully understand?
> 
> 3) we *actually* have documentation somewhere explaining how we envision that stuff to be used?
> 
> Or none of the above and he'll end up doing whatever and then he'll eventually figure out that we don't support that use case but he's doing it already anyway and we don't break userspace so we have to support it now and we're stuck somewhere between a rock and a hard place?
> 
> Hmm, I think we have enough misguided use cases as it is - don't need another one.

How about adding explanation to Documentation/x86/sgx.rst?
Borislav Petkov March 31, 2021, 7:44 a.m. UTC | #10
On March 31, 2021 8:51:38 AM GMT+02:00, Kai Huang <kai.huang@intel.com> wrote:
>How about adding explanation to Documentation/x86/sgx.rst?

Sure, and then we should point users at it. The thing is also indexed by search engines so hopefully people will find it.

Thx.
Huang, Kai March 31, 2021, 8:53 a.m. UTC | #11
On Wed, 31 Mar 2021 09:44:39 +0200 Boris Petkov wrote:
> On March 31, 2021 8:51:38 AM GMT+02:00, Kai Huang <kai.huang@intel.com> wrote:
> >How about adding explanation to Documentation/x86/sgx.rst?
> 
> Sure, and then we should point users at it. The thing is also indexed by search engines so hopefully people will find it.

Thanks. Will do and send out new patch for review.
Huang, Kai March 31, 2021, 12:20 p.m. UTC | #12
On Wed, 31 Mar 2021 21:53:45 +1300 Kai Huang wrote:
> On Wed, 31 Mar 2021 09:44:39 +0200 Boris Petkov wrote:
> > On March 31, 2021 8:51:38 AM GMT+02:00, Kai Huang <kai.huang@intel.com> wrote:
> > >How about adding explanation to Documentation/x86/sgx.rst?
> > 
> > Sure, and then we should point users at it. The thing is also indexed by search engines so hopefully people will find it.
> 
> Thanks. Will do and send out new patch for review.
> 
Hi Boris,

Could you help to review whether below change is OK?

diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
index 5ec7d17e65e0..49a840718a4d 100644
--- a/Documentation/x86/sgx.rst
+++ b/Documentation/x86/sgx.rst
@@ -236,3 +236,19 @@ As a result, when this happpens, user should stop running
any new SGX workloads, (or just any new workloads), and migrate all valuable
workloads. Although a machine reboot can recover all EPC, the bug should be
reported to Linux developers.
+
+Virtual EPC
+===========
+
+Separated from SGX driver for creating and running enclaves in host, SGX core
+also supports virtual EPC driver to support KVM SGX virtualization. Unlike SGX
+driver, EPC page allocated via virtual EPC driver is "raw" EPC page and doesn't
+have specific enclave associated. This is because KVM doesn't track how guest
+uses EPC pages.
+
+As a result, SGX core page reclaimer doesn't support reclaiming EPC pages
+allocated to KVM guests via virtual EPC driver. If user wants to deploy both
+host SGX applications and KVM SGX guests on the same machine, user should
+reserve enough EPC (by taking out total virtual EPC size of all SGX VMs from
+physical EPC size) for host SGX applications so they can run with acceptable
+performance.

In my local, I have squashed above change to this patch, and also added below
paragraph to the commit message:

    Also add documenetation to explain what is virtual EPC, and suggest
    users should be aware of virtual EPC pages are not reclaimable and take
    this into account when deploying both host SGX applications and KVM SGX
    guests on the same machine.
Huang, Kai April 1, 2021, 9:45 a.m. UTC | #13
On Wed, 31 Mar 2021 21:53:45 +1300 Kai Huang wrote:
> On Wed, 31 Mar 2021 09:44:39 +0200 Boris Petkov wrote:
> > On March 31, 2021 8:51:38 AM GMT+02:00, Kai Huang <kai.huang@intel.com> wrote:
> > >How about adding explanation to Documentation/x86/sgx.rst?
> > 
> > Sure, and then we should point users at it. The thing is also indexed by search engines so hopefully people will find it.
> 
> Thanks. Will do and send out new patch for review.
> 

Hi Boris,

I have sent out the updated patch, with documentation added. I also added
changelog in the patch. please help to take a look. Thanks!
Borislav Petkov April 1, 2021, 6:31 p.m. UTC | #14
On Thu, Apr 01, 2021 at 01:20:39AM +1300, Kai Huang wrote:
> Could you help to review whether below change is OK?

I ended up applying this:

---
From: Sean Christopherson <sean.j.christopherson@intel.com>
Date: Fri, 19 Mar 2021 20:22:21 +1300
Subject: [PATCH] x86/sgx: Introduce virtual EPC for use by KVM guests

Add a misc device /dev/sgx_vepc to allow userspace to allocate "raw"
Enclave Page Cache (EPC) without an associated enclave. The intended
and only known use case for raw EPC allocation is to expose EPC to a
KVM guest, hence the 'vepc' moniker, virt.{c,h} files and X86_SGX_KVM
Kconfig.

The SGX driver uses the misc device /dev/sgx_enclave to support
userspace in creating an enclave. Each file descriptor returned from
opening /dev/sgx_enclave represents an enclave. Unlike the SGX driver,
KVM doesn't control how the guest uses the EPC, therefore EPC allocated
to a KVM guest is not associated with an enclave, and /dev/sgx_enclave
is not suitable for allocating EPC for a KVM guest.

Having separate device nodes for the SGX driver and KVM virtual EPC also
allows separate permission control for running host SGX enclaves and KVM
SGX guests.

To use /dev/sgx_vepc to allocate a virtual EPC instance with particular
size, the hypervisor opens /dev/sgx_vepc, and uses mmap() with the
intended size to get an address range of virtual EPC. Then it may use
the address range to create one KVM memory slot as virtual EPC for
a guest.

Implement the "raw" EPC allocation in the x86 core-SGX subsystem via
/dev/sgx_vepc rather than in KVM. Doing so has two major advantages:

  - Does not require changes to KVM's uAPI, e.g. EPC gets handled as
    just another memory backend for guests.

  - EPC management is wholly contained in the SGX subsystem, e.g. SGX
    does not have to export any symbols, changes to reclaim flows don't
    need to be routed through KVM, SGX's dirty laundry doesn't have to
    get aired out for the world to see, and so on and so forth.

The virtual EPC pages allocated to guests are currently not reclaimable.
Reclaiming an EPC page used by enclave requires a special reclaim
mechanism separate from normal page reclaim, and that mechanism is not
supported for virutal EPC pages. Due to the complications of handling
reclaim conflicts between guest and host, reclaiming virtual EPC pages
is significantly more complex than basic support for SGX virtualization.

 [ bp:
   - Massage commit message and comments
   - use cpu_feature_enabled()
   - vertically align struct members init
   - massage Virtual EPC clarification text ]

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/0c38ced8c8e5a69872db4d6a1c0dabd01e07cad7.1616136308.git.kai.huang@intel.com
---
 Documentation/x86/sgx.rst        |  16 ++
 arch/x86/Kconfig                 |  12 ++
 arch/x86/kernel/cpu/sgx/Makefile |   1 +
 arch/x86/kernel/cpu/sgx/sgx.h    |   9 ++
 arch/x86/kernel/cpu/sgx/virt.c   | 259 +++++++++++++++++++++++++++++++
 5 files changed, 297 insertions(+)
 create mode 100644 arch/x86/kernel/cpu/sgx/virt.c

diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
index f90076e67cde..dd0ac96ff9ef 100644
--- a/Documentation/x86/sgx.rst
+++ b/Documentation/x86/sgx.rst
@@ -234,3 +234,19 @@ As a result, when this happpens, user should stop running any new
 SGX workloads, (or just any new workloads), and migrate all valuable
 workloads. Although a machine reboot can recover all EPC memory, the bug
 should be reported to Linux developers.
+
+
+Virtual EPC
+===========
+
+The implementation has also a virtual EPC driver to support SGX enclaves
+in guests. Unlike the SGX driver, an EPC page allocated by the virtual
+EPC driver doesn't have a specific enclave associated with it. This is
+because KVM doesn't track how a guest uses EPC pages.
+
+As a result, the SGX core page reclaimer doesn't support reclaiming EPC
+pages allocated to KVM guests through the virtual EPC driver. If the
+user wants to deploy SGX applications both on the host and in guests
+on the same machine, the user should reserve enough EPC (by taking out
+total virtual EPC size of all SGX VMs from the physical EPC size) for
+host SGX applications so they can run with acceptable performance.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 35391e94bd22..007912f67a06 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1942,6 +1942,18 @@ config X86_SGX
 
 	  If unsure, say N.
 
+config X86_SGX_KVM
+	bool "Software Guard eXtensions (SGX) Virtualization"
+	depends on X86_SGX && KVM_INTEL
+	help
+
+	  Enables KVM guests to create SGX enclaves.
+
+	  This includes support to expose "raw" unreclaimable enclave memory to
+	  guests via a device node, e.g. /dev/sgx_vepc.
+
+	  If unsure, say N.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 91d3dc784a29..9c1656779b2a 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -3,3 +3,4 @@ obj-y += \
 	encl.o \
 	ioctl.o \
 	main.o
+obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 4aa40c627819..4854f3980edd 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -84,4 +84,13 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 
+#ifdef CONFIG_X86_SGX_KVM
+int __init sgx_vepc_init(void);
+#else
+static inline int __init sgx_vepc_init(void)
+{
+	return -ENODEV;
+}
+#endif
+
 #endif /* _X86_SGX_H */
diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
new file mode 100644
index 000000000000..259cc46ad78c
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/virt.c
@@ -0,0 +1,259 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Device driver to expose SGX enclave memory to KVM guests.
+ *
+ * Copyright(c) 2021 Intel Corporation.
+ */
+
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/slab.h>
+#include <linux/xarray.h>
+#include <asm/sgx.h>
+#include <uapi/asm/sgx.h>
+
+#include "encls.h"
+#include "sgx.h"
+
+struct sgx_vepc {
+	struct xarray page_array;
+	struct mutex lock;
+};
+
+/*
+ * Temporary SECS pages that cannot be EREMOVE'd due to having child in other
+ * virtual EPC instances, and the lock to protect it.
+ */
+static struct mutex zombie_secs_pages_lock;
+static struct list_head zombie_secs_pages;
+
+static int __sgx_vepc_fault(struct sgx_vepc *vepc,
+			    struct vm_area_struct *vma, unsigned long addr)
+{
+	struct sgx_epc_page *epc_page;
+	unsigned long index, pfn;
+	int ret;
+
+	WARN_ON(!mutex_is_locked(&vepc->lock));
+
+	/* Calculate index of EPC page in virtual EPC's page_array */
+	index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start);
+
+	epc_page = xa_load(&vepc->page_array, index);
+	if (epc_page)
+		return 0;
+
+	epc_page = sgx_alloc_epc_page(vepc, false);
+	if (IS_ERR(epc_page))
+		return PTR_ERR(epc_page);
+
+	ret = xa_err(xa_store(&vepc->page_array, index, epc_page, GFP_KERNEL));
+	if (ret)
+		goto err_free;
+
+	pfn = PFN_DOWN(sgx_get_epc_phys_addr(epc_page));
+
+	ret = vmf_insert_pfn(vma, addr, pfn);
+	if (ret != VM_FAULT_NOPAGE) {
+		ret = -EFAULT;
+		goto err_delete;
+	}
+
+	return 0;
+
+err_delete:
+	xa_erase(&vepc->page_array, index);
+err_free:
+	sgx_free_epc_page(epc_page);
+	return ret;
+}
+
+static vm_fault_t sgx_vepc_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct sgx_vepc *vepc = vma->vm_private_data;
+	int ret;
+
+	mutex_lock(&vepc->lock);
+	ret = __sgx_vepc_fault(vepc, vma, vmf->address);
+	mutex_unlock(&vepc->lock);
+
+	if (!ret)
+		return VM_FAULT_NOPAGE;
+
+	if (ret == -EBUSY && (vmf->flags & FAULT_FLAG_ALLOW_RETRY)) {
+		mmap_read_unlock(vma->vm_mm);
+		return VM_FAULT_RETRY;
+	}
+
+	return VM_FAULT_SIGBUS;
+}
+
+const struct vm_operations_struct sgx_vepc_vm_ops = {
+	.fault = sgx_vepc_fault,
+};
+
+static int sgx_vepc_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct sgx_vepc *vepc = file->private_data;
+
+	if (!(vma->vm_flags & VM_SHARED))
+		return -EINVAL;
+
+	vma->vm_ops = &sgx_vepc_vm_ops;
+	/* Don't copy VMA in fork() */
+	vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTDUMP | VM_DONTCOPY;
+	vma->vm_private_data = vepc;
+
+	return 0;
+}
+
+static int sgx_vepc_free_page(struct sgx_epc_page *epc_page)
+{
+	int ret;
+
+	/*
+	 * Take a previously guest-owned EPC page and return it to the
+	 * general EPC page pool.
+	 *
+	 * Guests can not be trusted to have left this page in a good
+	 * state, so run EREMOVE on the page unconditionally.  In the
+	 * case that a guest properly EREMOVE'd this page, a superfluous
+	 * EREMOVE is harmless.
+	 */
+	ret = __eremove(sgx_get_epc_virt_addr(epc_page));
+	if (ret) {
+		/*
+		 * Only SGX_CHILD_PRESENT is expected, which is because of
+		 * EREMOVE'ing an SECS still with child, in which case it can
+		 * be handled by EREMOVE'ing the SECS again after all pages in
+		 * virtual EPC have been EREMOVE'd. See comments in below in
+		 * sgx_vepc_release().
+		 *
+		 * The user of virtual EPC (KVM) needs to guarantee there's no
+		 * logical processor is still running in the enclave in guest,
+		 * otherwise EREMOVE will get SGX_ENCLAVE_ACT which cannot be
+		 * handled here.
+		 */
+		WARN_ONCE(ret != SGX_CHILD_PRESENT, EREMOVE_ERROR_MESSAGE,
+			  ret, ret);
+		return ret;
+	}
+
+	sgx_free_epc_page(epc_page);
+
+	return 0;
+}
+
+static int sgx_vepc_release(struct inode *inode, struct file *file)
+{
+	struct sgx_vepc *vepc = file->private_data;
+	struct sgx_epc_page *epc_page, *tmp, *entry;
+	unsigned long index;
+
+	LIST_HEAD(secs_pages);
+
+	xa_for_each(&vepc->page_array, index, entry) {
+		/*
+		 * Remove all normal, child pages.  sgx_vepc_free_page()
+		 * will fail if EREMOVE fails, but this is OK and expected on
+		 * SECS pages.  Those can only be EREMOVE'd *after* all their
+		 * child pages. Retries below will clean them up.
+		 */
+		if (sgx_vepc_free_page(entry))
+			continue;
+
+		xa_erase(&vepc->page_array, index);
+	}
+
+	/*
+	 * Retry EREMOVE'ing pages.  This will clean up any SECS pages that
+	 * only had children in this 'epc' area.
+	 */
+	xa_for_each(&vepc->page_array, index, entry) {
+		epc_page = entry;
+		/*
+		 * An EREMOVE failure here means that the SECS page still
+		 * has children.  But, since all children in this 'sgx_vepc'
+		 * have been removed, the SECS page must have a child on
+		 * another instance.
+		 */
+		if (sgx_vepc_free_page(epc_page))
+			list_add_tail(&epc_page->list, &secs_pages);
+
+		xa_erase(&vepc->page_array, index);
+	}
+
+	/*
+	 * SECS pages are "pinned" by child pages, and "unpinned" once all
+	 * children have been EREMOVE'd.  A child page in this instance
+	 * may have pinned an SECS page encountered in an earlier release(),
+	 * creating a zombie.  Since some children were EREMOVE'd above,
+	 * try to EREMOVE all zombies in the hopes that one was unpinned.
+	 */
+	mutex_lock(&zombie_secs_pages_lock);
+	list_for_each_entry_safe(epc_page, tmp, &zombie_secs_pages, list) {
+		/*
+		 * Speculatively remove the page from the list of zombies,
+		 * if the page is successfully EREMOVE'd it will be added to
+		 * the list of free pages.  If EREMOVE fails, throw the page
+		 * on the local list, which will be spliced on at the end.
+		 */
+		list_del(&epc_page->list);
+
+		if (sgx_vepc_free_page(epc_page))
+			list_add_tail(&epc_page->list, &secs_pages);
+	}
+
+	if (!list_empty(&secs_pages))
+		list_splice_tail(&secs_pages, &zombie_secs_pages);
+	mutex_unlock(&zombie_secs_pages_lock);
+
+	kfree(vepc);
+
+	return 0;
+}
+
+static int sgx_vepc_open(struct inode *inode, struct file *file)
+{
+	struct sgx_vepc *vepc;
+
+	vepc = kzalloc(sizeof(struct sgx_vepc), GFP_KERNEL);
+	if (!vepc)
+		return -ENOMEM;
+	mutex_init(&vepc->lock);
+	xa_init(&vepc->page_array);
+
+	file->private_data = vepc;
+
+	return 0;
+}
+
+static const struct file_operations sgx_vepc_fops = {
+	.owner		= THIS_MODULE,
+	.open		= sgx_vepc_open,
+	.release	= sgx_vepc_release,
+	.mmap		= sgx_vepc_mmap,
+};
+
+static struct miscdevice sgx_vepc_dev = {
+	.minor		= MISC_DYNAMIC_MINOR,
+	.name		= "sgx_vepc",
+	.nodename	= "sgx_vepc",
+	.fops		= &sgx_vepc_fops,
+};
+
+int __init sgx_vepc_init(void)
+{
+	/* SGX virtualization requires KVM to work */
+	if (!cpu_feature_enabled(X86_FEATURE_VMX))
+		return -ENODEV;
+
+	INIT_LIST_HEAD(&zombie_secs_pages);
+	mutex_init(&zombie_secs_pages_lock);
+
+	return misc_register(&sgx_vepc_dev);
+}
Huang, Kai April 1, 2021, 11:38 p.m. UTC | #15
On Thu, 1 Apr 2021 20:31:59 +0200 Borislav Petkov wrote:
> On Thu, Apr 01, 2021 at 01:20:39AM +1300, Kai Huang wrote:
> > Could you help to review whether below change is OK?
> 
> I ended up applying this:

Thank you!

> 
> ---
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> Date: Fri, 19 Mar 2021 20:22:21 +1300
> Subject: [PATCH] x86/sgx: Introduce virtual EPC for use by KVM guests
> 
> Add a misc device /dev/sgx_vepc to allow userspace to allocate "raw"
> Enclave Page Cache (EPC) without an associated enclave. The intended
> and only known use case for raw EPC allocation is to expose EPC to a
> KVM guest, hence the 'vepc' moniker, virt.{c,h} files and X86_SGX_KVM
> Kconfig.
> 
> The SGX driver uses the misc device /dev/sgx_enclave to support
> userspace in creating an enclave. Each file descriptor returned from
> opening /dev/sgx_enclave represents an enclave. Unlike the SGX driver,
> KVM doesn't control how the guest uses the EPC, therefore EPC allocated
> to a KVM guest is not associated with an enclave, and /dev/sgx_enclave
> is not suitable for allocating EPC for a KVM guest.
> 
> Having separate device nodes for the SGX driver and KVM virtual EPC also
> allows separate permission control for running host SGX enclaves and KVM
> SGX guests.
> 
> To use /dev/sgx_vepc to allocate a virtual EPC instance with particular
> size, the hypervisor opens /dev/sgx_vepc, and uses mmap() with the
> intended size to get an address range of virtual EPC. Then it may use
> the address range to create one KVM memory slot as virtual EPC for
> a guest.
> 
> Implement the "raw" EPC allocation in the x86 core-SGX subsystem via
> /dev/sgx_vepc rather than in KVM. Doing so has two major advantages:
> 
>   - Does not require changes to KVM's uAPI, e.g. EPC gets handled as
>     just another memory backend for guests.
> 
>   - EPC management is wholly contained in the SGX subsystem, e.g. SGX
>     does not have to export any symbols, changes to reclaim flows don't
>     need to be routed through KVM, SGX's dirty laundry doesn't have to
>     get aired out for the world to see, and so on and so forth.
> 
> The virtual EPC pages allocated to guests are currently not reclaimable.
> Reclaiming an EPC page used by enclave requires a special reclaim
> mechanism separate from normal page reclaim, and that mechanism is not
> supported for virutal EPC pages. Due to the complications of handling
> reclaim conflicts between guest and host, reclaiming virtual EPC pages
> is significantly more complex than basic support for SGX virtualization.
> 
>  [ bp:
>    - Massage commit message and comments
>    - use cpu_feature_enabled()
>    - vertically align struct members init
>    - massage Virtual EPC clarification text ]
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Borislav Petkov <bp@suse.de>
> Acked-by: Dave Hansen <dave.hansen@intel.com>
> Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
> Link: https://lkml.kernel.org/r/0c38ced8c8e5a69872db4d6a1c0dabd01e07cad7.1616136308.git.kai.huang@intel.com
> ---
>  Documentation/x86/sgx.rst        |  16 ++
>  arch/x86/Kconfig                 |  12 ++
>  arch/x86/kernel/cpu/sgx/Makefile |   1 +
>  arch/x86/kernel/cpu/sgx/sgx.h    |   9 ++
>  arch/x86/kernel/cpu/sgx/virt.c   | 259 +++++++++++++++++++++++++++++++
>  5 files changed, 297 insertions(+)
>  create mode 100644 arch/x86/kernel/cpu/sgx/virt.c
> 
> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
> index f90076e67cde..dd0ac96ff9ef 100644
> --- a/Documentation/x86/sgx.rst
> +++ b/Documentation/x86/sgx.rst
> @@ -234,3 +234,19 @@ As a result, when this happpens, user should stop running any new
>  SGX workloads, (or just any new workloads), and migrate all valuable
>  workloads. Although a machine reboot can recover all EPC memory, the bug
>  should be reported to Linux developers.
> +
> +
> +Virtual EPC
> +===========
> +
> +The implementation has also a virtual EPC driver to support SGX enclaves
> +in guests. Unlike the SGX driver, an EPC page allocated by the virtual
> +EPC driver doesn't have a specific enclave associated with it. This is
> +because KVM doesn't track how a guest uses EPC pages.
> +
> +As a result, the SGX core page reclaimer doesn't support reclaiming EPC
> +pages allocated to KVM guests through the virtual EPC driver. If the
> +user wants to deploy SGX applications both on the host and in guests
> +on the same machine, the user should reserve enough EPC (by taking out
> +total virtual EPC size of all SGX VMs from the physical EPC size) for
> +host SGX applications so they can run with acceptable performance.
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 35391e94bd22..007912f67a06 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1942,6 +1942,18 @@ config X86_SGX
>  
>  	  If unsure, say N.
>  
> +config X86_SGX_KVM
> +	bool "Software Guard eXtensions (SGX) Virtualization"
> +	depends on X86_SGX && KVM_INTEL
> +	help
> +
> +	  Enables KVM guests to create SGX enclaves.
> +
> +	  This includes support to expose "raw" unreclaimable enclave memory to
> +	  guests via a device node, e.g. /dev/sgx_vepc.
> +
> +	  If unsure, say N.
> +
>  config EFI
>  	bool "EFI runtime service support"
>  	depends on ACPI
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
> index 91d3dc784a29..9c1656779b2a 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -3,3 +3,4 @@ obj-y += \
>  	encl.o \
>  	ioctl.o \
>  	main.o
> +obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 4aa40c627819..4854f3980edd 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -84,4 +84,13 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
>  int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
>  
> +#ifdef CONFIG_X86_SGX_KVM
> +int __init sgx_vepc_init(void);
> +#else
> +static inline int __init sgx_vepc_init(void)
> +{
> +	return -ENODEV;
> +}
> +#endif
> +
>  #endif /* _X86_SGX_H */
> diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
> new file mode 100644
> index 000000000000..259cc46ad78c
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/virt.c
> @@ -0,0 +1,259 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Device driver to expose SGX enclave memory to KVM guests.
> + *
> + * Copyright(c) 2021 Intel Corporation.
> + */
> +
> +#include <linux/miscdevice.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include <linux/sched/mm.h>
> +#include <linux/sched/signal.h>
> +#include <linux/slab.h>
> +#include <linux/xarray.h>
> +#include <asm/sgx.h>
> +#include <uapi/asm/sgx.h>
> +
> +#include "encls.h"
> +#include "sgx.h"
> +
> +struct sgx_vepc {
> +	struct xarray page_array;
> +	struct mutex lock;
> +};
> +
> +/*
> + * Temporary SECS pages that cannot be EREMOVE'd due to having child in other
> + * virtual EPC instances, and the lock to protect it.
> + */
> +static struct mutex zombie_secs_pages_lock;
> +static struct list_head zombie_secs_pages;
> +
> +static int __sgx_vepc_fault(struct sgx_vepc *vepc,
> +			    struct vm_area_struct *vma, unsigned long addr)
> +{
> +	struct sgx_epc_page *epc_page;
> +	unsigned long index, pfn;
> +	int ret;
> +
> +	WARN_ON(!mutex_is_locked(&vepc->lock));
> +
> +	/* Calculate index of EPC page in virtual EPC's page_array */
> +	index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start);
> +
> +	epc_page = xa_load(&vepc->page_array, index);
> +	if (epc_page)
> +		return 0;
> +
> +	epc_page = sgx_alloc_epc_page(vepc, false);
> +	if (IS_ERR(epc_page))
> +		return PTR_ERR(epc_page);
> +
> +	ret = xa_err(xa_store(&vepc->page_array, index, epc_page, GFP_KERNEL));
> +	if (ret)
> +		goto err_free;
> +
> +	pfn = PFN_DOWN(sgx_get_epc_phys_addr(epc_page));
> +
> +	ret = vmf_insert_pfn(vma, addr, pfn);
> +	if (ret != VM_FAULT_NOPAGE) {
> +		ret = -EFAULT;
> +		goto err_delete;
> +	}
> +
> +	return 0;
> +
> +err_delete:
> +	xa_erase(&vepc->page_array, index);
> +err_free:
> +	sgx_free_epc_page(epc_page);
> +	return ret;
> +}
> +
> +static vm_fault_t sgx_vepc_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct sgx_vepc *vepc = vma->vm_private_data;
> +	int ret;
> +
> +	mutex_lock(&vepc->lock);
> +	ret = __sgx_vepc_fault(vepc, vma, vmf->address);
> +	mutex_unlock(&vepc->lock);
> +
> +	if (!ret)
> +		return VM_FAULT_NOPAGE;
> +
> +	if (ret == -EBUSY && (vmf->flags & FAULT_FLAG_ALLOW_RETRY)) {
> +		mmap_read_unlock(vma->vm_mm);
> +		return VM_FAULT_RETRY;
> +	}
> +
> +	return VM_FAULT_SIGBUS;
> +}
> +
> +const struct vm_operations_struct sgx_vepc_vm_ops = {
> +	.fault = sgx_vepc_fault,
> +};
> +
> +static int sgx_vepc_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct sgx_vepc *vepc = file->private_data;
> +
> +	if (!(vma->vm_flags & VM_SHARED))
> +		return -EINVAL;
> +
> +	vma->vm_ops = &sgx_vepc_vm_ops;
> +	/* Don't copy VMA in fork() */
> +	vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTDUMP | VM_DONTCOPY;
> +	vma->vm_private_data = vepc;
> +
> +	return 0;
> +}
> +
> +static int sgx_vepc_free_page(struct sgx_epc_page *epc_page)
> +{
> +	int ret;
> +
> +	/*
> +	 * Take a previously guest-owned EPC page and return it to the
> +	 * general EPC page pool.
> +	 *
> +	 * Guests can not be trusted to have left this page in a good
> +	 * state, so run EREMOVE on the page unconditionally.  In the
> +	 * case that a guest properly EREMOVE'd this page, a superfluous
> +	 * EREMOVE is harmless.
> +	 */
> +	ret = __eremove(sgx_get_epc_virt_addr(epc_page));
> +	if (ret) {
> +		/*
> +		 * Only SGX_CHILD_PRESENT is expected, which is because of
> +		 * EREMOVE'ing an SECS still with child, in which case it can
> +		 * be handled by EREMOVE'ing the SECS again after all pages in
> +		 * virtual EPC have been EREMOVE'd. See comments in below in
> +		 * sgx_vepc_release().
> +		 *
> +		 * The user of virtual EPC (KVM) needs to guarantee there's no
> +		 * logical processor is still running in the enclave in guest,
> +		 * otherwise EREMOVE will get SGX_ENCLAVE_ACT which cannot be
> +		 * handled here.
> +		 */
> +		WARN_ONCE(ret != SGX_CHILD_PRESENT, EREMOVE_ERROR_MESSAGE,
> +			  ret, ret);
> +		return ret;
> +	}
> +
> +	sgx_free_epc_page(epc_page);
> +
> +	return 0;
> +}
> +
> +static int sgx_vepc_release(struct inode *inode, struct file *file)
> +{
> +	struct sgx_vepc *vepc = file->private_data;
> +	struct sgx_epc_page *epc_page, *tmp, *entry;
> +	unsigned long index;
> +
> +	LIST_HEAD(secs_pages);
> +
> +	xa_for_each(&vepc->page_array, index, entry) {
> +		/*
> +		 * Remove all normal, child pages.  sgx_vepc_free_page()
> +		 * will fail if EREMOVE fails, but this is OK and expected on
> +		 * SECS pages.  Those can only be EREMOVE'd *after* all their
> +		 * child pages. Retries below will clean them up.
> +		 */
> +		if (sgx_vepc_free_page(entry))
> +			continue;
> +
> +		xa_erase(&vepc->page_array, index);
> +	}
> +
> +	/*
> +	 * Retry EREMOVE'ing pages.  This will clean up any SECS pages that
> +	 * only had children in this 'epc' area.
> +	 */
> +	xa_for_each(&vepc->page_array, index, entry) {
> +		epc_page = entry;
> +		/*
> +		 * An EREMOVE failure here means that the SECS page still
> +		 * has children.  But, since all children in this 'sgx_vepc'
> +		 * have been removed, the SECS page must have a child on
> +		 * another instance.
> +		 */
> +		if (sgx_vepc_free_page(epc_page))
> +			list_add_tail(&epc_page->list, &secs_pages);
> +
> +		xa_erase(&vepc->page_array, index);
> +	}
> +
> +	/*
> +	 * SECS pages are "pinned" by child pages, and "unpinned" once all
> +	 * children have been EREMOVE'd.  A child page in this instance
> +	 * may have pinned an SECS page encountered in an earlier release(),
> +	 * creating a zombie.  Since some children were EREMOVE'd above,
> +	 * try to EREMOVE all zombies in the hopes that one was unpinned.
> +	 */
> +	mutex_lock(&zombie_secs_pages_lock);
> +	list_for_each_entry_safe(epc_page, tmp, &zombie_secs_pages, list) {
> +		/*
> +		 * Speculatively remove the page from the list of zombies,
> +		 * if the page is successfully EREMOVE'd it will be added to
> +		 * the list of free pages.  If EREMOVE fails, throw the page
> +		 * on the local list, which will be spliced on at the end.
> +		 */
> +		list_del(&epc_page->list);
> +
> +		if (sgx_vepc_free_page(epc_page))
> +			list_add_tail(&epc_page->list, &secs_pages);
> +	}
> +
> +	if (!list_empty(&secs_pages))
> +		list_splice_tail(&secs_pages, &zombie_secs_pages);
> +	mutex_unlock(&zombie_secs_pages_lock);
> +
> +	kfree(vepc);
> +
> +	return 0;
> +}
> +
> +static int sgx_vepc_open(struct inode *inode, struct file *file)
> +{
> +	struct sgx_vepc *vepc;
> +
> +	vepc = kzalloc(sizeof(struct sgx_vepc), GFP_KERNEL);
> +	if (!vepc)
> +		return -ENOMEM;
> +	mutex_init(&vepc->lock);
> +	xa_init(&vepc->page_array);
> +
> +	file->private_data = vepc;
> +
> +	return 0;
> +}
> +
> +static const struct file_operations sgx_vepc_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= sgx_vepc_open,
> +	.release	= sgx_vepc_release,
> +	.mmap		= sgx_vepc_mmap,
> +};
> +
> +static struct miscdevice sgx_vepc_dev = {
> +	.minor		= MISC_DYNAMIC_MINOR,
> +	.name		= "sgx_vepc",
> +	.nodename	= "sgx_vepc",
> +	.fops		= &sgx_vepc_fops,
> +};
> +
> +int __init sgx_vepc_init(void)
> +{
> +	/* SGX virtualization requires KVM to work */
> +	if (!cpu_feature_enabled(X86_FEATURE_VMX))
> +		return -ENODEV;
> +
> +	INIT_LIST_HEAD(&zombie_secs_pages);
> +	mutex_init(&zombie_secs_pages_lock);
> +
> +	return misc_register(&sgx_vepc_dev);
> +}
> -- 
> 2.29.2
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
Borislav Petkov April 5, 2021, 9:01 a.m. UTC | #16
On Fri, Mar 19, 2021 at 08:22:21PM +1300, Kai Huang wrote:
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 35391e94bd22..007912f67a06 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1942,6 +1942,18 @@ config X86_SGX
>  
>  	  If unsure, say N.
>  
> +config X86_SGX_KVM
> +	bool "Software Guard eXtensions (SGX) Virtualization"
> +	depends on X86_SGX && KVM_INTEL
> +	help

It seems to me this would fit better under "Virtualization" because even
if I want to enable it, I have to go enable KVM_INTEL first and then
return back here to turn it on too.

And under "Virtualization" is where we enable all kinds of aspects which
belong to it.
Huang, Kai April 5, 2021, 9:46 p.m. UTC | #17
On Mon, 5 Apr 2021 11:01:51 +0200 Borislav Petkov wrote:
> On Fri, Mar 19, 2021 at 08:22:21PM +1300, Kai Huang wrote:
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 35391e94bd22..007912f67a06 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1942,6 +1942,18 @@ config X86_SGX
> >  
> >  	  If unsure, say N.
> >  
> > +config X86_SGX_KVM
> > +	bool "Software Guard eXtensions (SGX) Virtualization"
> > +	depends on X86_SGX && KVM_INTEL
> > +	help
> 
> It seems to me this would fit better under "Virtualization" because even
> if I want to enable it, I have to go enable KVM_INTEL first and then
> return back here to turn it on too.
> 
> And under "Virtualization" is where we enable all kinds of aspects which
> belong to it.

Fine to me. Please let me know if you want me to resend patches. Thanks.
Borislav Petkov April 6, 2021, 8:28 a.m. UTC | #18
On Tue, Apr 06, 2021 at 09:46:34AM +1200, Kai Huang wrote:
> Fine to me. Please let me know if you want me to resend patches. Thanks.

Patch updated:

---
From: Sean Christopherson <sean.j.christopherson@intel.com>
Date: Fri, 19 Mar 2021 20:22:21 +1300
Subject: [PATCH] x86/sgx: Introduce virtual EPC for use by KVM guests

Add a misc device /dev/sgx_vepc to allow userspace to allocate "raw"
Enclave Page Cache (EPC) without an associated enclave. The intended
and only known use case for raw EPC allocation is to expose EPC to a
KVM guest, hence the 'vepc' moniker, virt.{c,h} files and X86_SGX_KVM
Kconfig.

The SGX driver uses the misc device /dev/sgx_enclave to support
userspace in creating an enclave. Each file descriptor returned from
opening /dev/sgx_enclave represents an enclave. Unlike the SGX driver,
KVM doesn't control how the guest uses the EPC, therefore EPC allocated
to a KVM guest is not associated with an enclave, and /dev/sgx_enclave
is not suitable for allocating EPC for a KVM guest.

Having separate device nodes for the SGX driver and KVM virtual EPC also
allows separate permission control for running host SGX enclaves and KVM
SGX guests.

To use /dev/sgx_vepc to allocate a virtual EPC instance with particular
size, the hypervisor opens /dev/sgx_vepc, and uses mmap() with the
intended size to get an address range of virtual EPC. Then it may use
the address range to create one KVM memory slot as virtual EPC for
a guest.

Implement the "raw" EPC allocation in the x86 core-SGX subsystem via
/dev/sgx_vepc rather than in KVM. Doing so has two major advantages:

  - Does not require changes to KVM's uAPI, e.g. EPC gets handled as
    just another memory backend for guests.

  - EPC management is wholly contained in the SGX subsystem, e.g. SGX
    does not have to export any symbols, changes to reclaim flows don't
    need to be routed through KVM, SGX's dirty laundry doesn't have to
    get aired out for the world to see, and so on and so forth.

The virtual EPC pages allocated to guests are currently not reclaimable.
Reclaiming an EPC page used by enclave requires a special reclaim
mechanism separate from normal page reclaim, and that mechanism is not
supported for virutal EPC pages. Due to the complications of handling
reclaim conflicts between guest and host, reclaiming virtual EPC pages
is significantly more complex than basic support for SGX virtualization.

 [ bp:
   - Massage commit message and comments
   - use cpu_feature_enabled()
   - vertically align struct members init
   - massage Virtual EPC clarification text
   - move Kconfig prompt to Virtualization ]

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/0c38ced8c8e5a69872db4d6a1c0dabd01e07cad7.1616136308.git.kai.huang@intel.com
---
 Documentation/x86/sgx.rst        |  16 ++
 arch/x86/kernel/cpu/sgx/Makefile |   1 +
 arch/x86/kernel/cpu/sgx/sgx.h    |   9 ++
 arch/x86/kernel/cpu/sgx/virt.c   | 259 +++++++++++++++++++++++++++++++
 arch/x86/kvm/Kconfig             |  12 ++
 5 files changed, 297 insertions(+)
 create mode 100644 arch/x86/kernel/cpu/sgx/virt.c

diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
index f90076e67cde..dd0ac96ff9ef 100644
--- a/Documentation/x86/sgx.rst
+++ b/Documentation/x86/sgx.rst
@@ -234,3 +234,19 @@ As a result, when this happpens, user should stop running any new
 SGX workloads, (or just any new workloads), and migrate all valuable
 workloads. Although a machine reboot can recover all EPC memory, the bug
 should be reported to Linux developers.
+
+
+Virtual EPC
+===========
+
+The implementation has also a virtual EPC driver to support SGX enclaves
+in guests. Unlike the SGX driver, an EPC page allocated by the virtual
+EPC driver doesn't have a specific enclave associated with it. This is
+because KVM doesn't track how a guest uses EPC pages.
+
+As a result, the SGX core page reclaimer doesn't support reclaiming EPC
+pages allocated to KVM guests through the virtual EPC driver. If the
+user wants to deploy SGX applications both on the host and in guests
+on the same machine, the user should reserve enough EPC (by taking out
+total virtual EPC size of all SGX VMs from the physical EPC size) for
+host SGX applications so they can run with acceptable performance.
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 91d3dc784a29..9c1656779b2a 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -3,3 +3,4 @@ obj-y += \
 	encl.o \
 	ioctl.o \
 	main.o
+obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 4aa40c627819..4854f3980edd 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -84,4 +84,13 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 
+#ifdef CONFIG_X86_SGX_KVM
+int __init sgx_vepc_init(void);
+#else
+static inline int __init sgx_vepc_init(void)
+{
+	return -ENODEV;
+}
+#endif
+
 #endif /* _X86_SGX_H */
diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
new file mode 100644
index 000000000000..259cc46ad78c
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/virt.c
@@ -0,0 +1,259 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Device driver to expose SGX enclave memory to KVM guests.
+ *
+ * Copyright(c) 2021 Intel Corporation.
+ */
+
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/slab.h>
+#include <linux/xarray.h>
+#include <asm/sgx.h>
+#include <uapi/asm/sgx.h>
+
+#include "encls.h"
+#include "sgx.h"
+
+struct sgx_vepc {
+	struct xarray page_array;
+	struct mutex lock;
+};
+
+/*
+ * Temporary SECS pages that cannot be EREMOVE'd due to having child in other
+ * virtual EPC instances, and the lock to protect it.
+ */
+static struct mutex zombie_secs_pages_lock;
+static struct list_head zombie_secs_pages;
+
+static int __sgx_vepc_fault(struct sgx_vepc *vepc,
+			    struct vm_area_struct *vma, unsigned long addr)
+{
+	struct sgx_epc_page *epc_page;
+	unsigned long index, pfn;
+	int ret;
+
+	WARN_ON(!mutex_is_locked(&vepc->lock));
+
+	/* Calculate index of EPC page in virtual EPC's page_array */
+	index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start);
+
+	epc_page = xa_load(&vepc->page_array, index);
+	if (epc_page)
+		return 0;
+
+	epc_page = sgx_alloc_epc_page(vepc, false);
+	if (IS_ERR(epc_page))
+		return PTR_ERR(epc_page);
+
+	ret = xa_err(xa_store(&vepc->page_array, index, epc_page, GFP_KERNEL));
+	if (ret)
+		goto err_free;
+
+	pfn = PFN_DOWN(sgx_get_epc_phys_addr(epc_page));
+
+	ret = vmf_insert_pfn(vma, addr, pfn);
+	if (ret != VM_FAULT_NOPAGE) {
+		ret = -EFAULT;
+		goto err_delete;
+	}
+
+	return 0;
+
+err_delete:
+	xa_erase(&vepc->page_array, index);
+err_free:
+	sgx_free_epc_page(epc_page);
+	return ret;
+}
+
+static vm_fault_t sgx_vepc_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct sgx_vepc *vepc = vma->vm_private_data;
+	int ret;
+
+	mutex_lock(&vepc->lock);
+	ret = __sgx_vepc_fault(vepc, vma, vmf->address);
+	mutex_unlock(&vepc->lock);
+
+	if (!ret)
+		return VM_FAULT_NOPAGE;
+
+	if (ret == -EBUSY && (vmf->flags & FAULT_FLAG_ALLOW_RETRY)) {
+		mmap_read_unlock(vma->vm_mm);
+		return VM_FAULT_RETRY;
+	}
+
+	return VM_FAULT_SIGBUS;
+}
+
+const struct vm_operations_struct sgx_vepc_vm_ops = {
+	.fault = sgx_vepc_fault,
+};
+
+static int sgx_vepc_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct sgx_vepc *vepc = file->private_data;
+
+	if (!(vma->vm_flags & VM_SHARED))
+		return -EINVAL;
+
+	vma->vm_ops = &sgx_vepc_vm_ops;
+	/* Don't copy VMA in fork() */
+	vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTDUMP | VM_DONTCOPY;
+	vma->vm_private_data = vepc;
+
+	return 0;
+}
+
+static int sgx_vepc_free_page(struct sgx_epc_page *epc_page)
+{
+	int ret;
+
+	/*
+	 * Take a previously guest-owned EPC page and return it to the
+	 * general EPC page pool.
+	 *
+	 * Guests can not be trusted to have left this page in a good
+	 * state, so run EREMOVE on the page unconditionally.  In the
+	 * case that a guest properly EREMOVE'd this page, a superfluous
+	 * EREMOVE is harmless.
+	 */
+	ret = __eremove(sgx_get_epc_virt_addr(epc_page));
+	if (ret) {
+		/*
+		 * Only SGX_CHILD_PRESENT is expected, which is because of
+		 * EREMOVE'ing an SECS still with child, in which case it can
+		 * be handled by EREMOVE'ing the SECS again after all pages in
+		 * virtual EPC have been EREMOVE'd. See comments in below in
+		 * sgx_vepc_release().
+		 *
+		 * The user of virtual EPC (KVM) needs to guarantee there's no
+		 * logical processor is still running in the enclave in guest,
+		 * otherwise EREMOVE will get SGX_ENCLAVE_ACT which cannot be
+		 * handled here.
+		 */
+		WARN_ONCE(ret != SGX_CHILD_PRESENT, EREMOVE_ERROR_MESSAGE,
+			  ret, ret);
+		return ret;
+	}
+
+	sgx_free_epc_page(epc_page);
+
+	return 0;
+}
+
+static int sgx_vepc_release(struct inode *inode, struct file *file)
+{
+	struct sgx_vepc *vepc = file->private_data;
+	struct sgx_epc_page *epc_page, *tmp, *entry;
+	unsigned long index;
+
+	LIST_HEAD(secs_pages);
+
+	xa_for_each(&vepc->page_array, index, entry) {
+		/*
+		 * Remove all normal, child pages.  sgx_vepc_free_page()
+		 * will fail if EREMOVE fails, but this is OK and expected on
+		 * SECS pages.  Those can only be EREMOVE'd *after* all their
+		 * child pages. Retries below will clean them up.
+		 */
+		if (sgx_vepc_free_page(entry))
+			continue;
+
+		xa_erase(&vepc->page_array, index);
+	}
+
+	/*
+	 * Retry EREMOVE'ing pages.  This will clean up any SECS pages that
+	 * only had children in this 'epc' area.
+	 */
+	xa_for_each(&vepc->page_array, index, entry) {
+		epc_page = entry;
+		/*
+		 * An EREMOVE failure here means that the SECS page still
+		 * has children.  But, since all children in this 'sgx_vepc'
+		 * have been removed, the SECS page must have a child on
+		 * another instance.
+		 */
+		if (sgx_vepc_free_page(epc_page))
+			list_add_tail(&epc_page->list, &secs_pages);
+
+		xa_erase(&vepc->page_array, index);
+	}
+
+	/*
+	 * SECS pages are "pinned" by child pages, and "unpinned" once all
+	 * children have been EREMOVE'd.  A child page in this instance
+	 * may have pinned an SECS page encountered in an earlier release(),
+	 * creating a zombie.  Since some children were EREMOVE'd above,
+	 * try to EREMOVE all zombies in the hopes that one was unpinned.
+	 */
+	mutex_lock(&zombie_secs_pages_lock);
+	list_for_each_entry_safe(epc_page, tmp, &zombie_secs_pages, list) {
+		/*
+		 * Speculatively remove the page from the list of zombies,
+		 * if the page is successfully EREMOVE'd it will be added to
+		 * the list of free pages.  If EREMOVE fails, throw the page
+		 * on the local list, which will be spliced on at the end.
+		 */
+		list_del(&epc_page->list);
+
+		if (sgx_vepc_free_page(epc_page))
+			list_add_tail(&epc_page->list, &secs_pages);
+	}
+
+	if (!list_empty(&secs_pages))
+		list_splice_tail(&secs_pages, &zombie_secs_pages);
+	mutex_unlock(&zombie_secs_pages_lock);
+
+	kfree(vepc);
+
+	return 0;
+}
+
+static int sgx_vepc_open(struct inode *inode, struct file *file)
+{
+	struct sgx_vepc *vepc;
+
+	vepc = kzalloc(sizeof(struct sgx_vepc), GFP_KERNEL);
+	if (!vepc)
+		return -ENOMEM;
+	mutex_init(&vepc->lock);
+	xa_init(&vepc->page_array);
+
+	file->private_data = vepc;
+
+	return 0;
+}
+
+static const struct file_operations sgx_vepc_fops = {
+	.owner		= THIS_MODULE,
+	.open		= sgx_vepc_open,
+	.release	= sgx_vepc_release,
+	.mmap		= sgx_vepc_mmap,
+};
+
+static struct miscdevice sgx_vepc_dev = {
+	.minor		= MISC_DYNAMIC_MINOR,
+	.name		= "sgx_vepc",
+	.nodename	= "sgx_vepc",
+	.fops		= &sgx_vepc_fops,
+};
+
+int __init sgx_vepc_init(void)
+{
+	/* SGX virtualization requires KVM to work */
+	if (!cpu_feature_enabled(X86_FEATURE_VMX))
+		return -ENODEV;
+
+	INIT_LIST_HEAD(&zombie_secs_pages);
+	mutex_init(&zombie_secs_pages_lock);
+
+	return misc_register(&sgx_vepc_dev);
+}
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index a788d5120d4d..f6b93a35ce14 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -84,6 +84,18 @@ config KVM_INTEL
 	  To compile this as a module, choose M here: the module
 	  will be called kvm-intel.
 
+config X86_SGX_KVM
+	bool "Software Guard eXtensions (SGX) Virtualization"
+	depends on X86_SGX && KVM_INTEL
+	help
+
+	  Enables KVM guests to create SGX enclaves.
+
+	  This includes support to expose "raw" unreclaimable enclave memory to
+	  guests via a device node, e.g. /dev/sgx_vepc.
+
+	  If unsure, say N.
+
 config KVM_AMD
 	tristate "KVM for AMD processors support"
 	depends on KVM
Huang, Kai April 6, 2021, 9:04 a.m. UTC | #19
On Tue, 6 Apr 2021 10:28:00 +0200 Borislav Petkov wrote:
> On Tue, Apr 06, 2021 at 09:46:34AM +1200, Kai Huang wrote:
> > Fine to me. Please let me know if you want me to resend patches. Thanks.
> 
> Patch updated:

Looks fine. Thank you!

> 
> ---
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> Date: Fri, 19 Mar 2021 20:22:21 +1300
> Subject: [PATCH] x86/sgx: Introduce virtual EPC for use by KVM guests
> 
> Add a misc device /dev/sgx_vepc to allow userspace to allocate "raw"
> Enclave Page Cache (EPC) without an associated enclave. The intended
> and only known use case for raw EPC allocation is to expose EPC to a
> KVM guest, hence the 'vepc' moniker, virt.{c,h} files and X86_SGX_KVM
> Kconfig.
> 
> The SGX driver uses the misc device /dev/sgx_enclave to support
> userspace in creating an enclave. Each file descriptor returned from
> opening /dev/sgx_enclave represents an enclave. Unlike the SGX driver,
> KVM doesn't control how the guest uses the EPC, therefore EPC allocated
> to a KVM guest is not associated with an enclave, and /dev/sgx_enclave
> is not suitable for allocating EPC for a KVM guest.
> 
> Having separate device nodes for the SGX driver and KVM virtual EPC also
> allows separate permission control for running host SGX enclaves and KVM
> SGX guests.
> 
> To use /dev/sgx_vepc to allocate a virtual EPC instance with particular
> size, the hypervisor opens /dev/sgx_vepc, and uses mmap() with the
> intended size to get an address range of virtual EPC. Then it may use
> the address range to create one KVM memory slot as virtual EPC for
> a guest.
> 
> Implement the "raw" EPC allocation in the x86 core-SGX subsystem via
> /dev/sgx_vepc rather than in KVM. Doing so has two major advantages:
> 
>   - Does not require changes to KVM's uAPI, e.g. EPC gets handled as
>     just another memory backend for guests.
> 
>   - EPC management is wholly contained in the SGX subsystem, e.g. SGX
>     does not have to export any symbols, changes to reclaim flows don't
>     need to be routed through KVM, SGX's dirty laundry doesn't have to
>     get aired out for the world to see, and so on and so forth.
> 
> The virtual EPC pages allocated to guests are currently not reclaimable.
> Reclaiming an EPC page used by enclave requires a special reclaim
> mechanism separate from normal page reclaim, and that mechanism is not
> supported for virutal EPC pages. Due to the complications of handling
> reclaim conflicts between guest and host, reclaiming virtual EPC pages
> is significantly more complex than basic support for SGX virtualization.
> 
>  [ bp:
>    - Massage commit message and comments
>    - use cpu_feature_enabled()
>    - vertically align struct members init
>    - massage Virtual EPC clarification text
>    - move Kconfig prompt to Virtualization ]
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Borislav Petkov <bp@suse.de>
> Acked-by: Dave Hansen <dave.hansen@intel.com>
> Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
> Link: https://lkml.kernel.org/r/0c38ced8c8e5a69872db4d6a1c0dabd01e07cad7.1616136308.git.kai.huang@intel.com
> ---
>  Documentation/x86/sgx.rst        |  16 ++
>  arch/x86/kernel/cpu/sgx/Makefile |   1 +
>  arch/x86/kernel/cpu/sgx/sgx.h    |   9 ++
>  arch/x86/kernel/cpu/sgx/virt.c   | 259 +++++++++++++++++++++++++++++++
>  arch/x86/kvm/Kconfig             |  12 ++
>  5 files changed, 297 insertions(+)
>  create mode 100644 arch/x86/kernel/cpu/sgx/virt.c
> 
> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
> index f90076e67cde..dd0ac96ff9ef 100644
> --- a/Documentation/x86/sgx.rst
> +++ b/Documentation/x86/sgx.rst
> @@ -234,3 +234,19 @@ As a result, when this happpens, user should stop running any new
>  SGX workloads, (or just any new workloads), and migrate all valuable
>  workloads. Although a machine reboot can recover all EPC memory, the bug
>  should be reported to Linux developers.
> +
> +
> +Virtual EPC
> +===========
> +
> +The implementation has also a virtual EPC driver to support SGX enclaves
> +in guests. Unlike the SGX driver, an EPC page allocated by the virtual
> +EPC driver doesn't have a specific enclave associated with it. This is
> +because KVM doesn't track how a guest uses EPC pages.
> +
> +As a result, the SGX core page reclaimer doesn't support reclaiming EPC
> +pages allocated to KVM guests through the virtual EPC driver. If the
> +user wants to deploy SGX applications both on the host and in guests
> +on the same machine, the user should reserve enough EPC (by taking out
> +total virtual EPC size of all SGX VMs from the physical EPC size) for
> +host SGX applications so they can run with acceptable performance.
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
> index 91d3dc784a29..9c1656779b2a 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -3,3 +3,4 @@ obj-y += \
>  	encl.o \
>  	ioctl.o \
>  	main.o
> +obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 4aa40c627819..4854f3980edd 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -84,4 +84,13 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
>  int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
>  
> +#ifdef CONFIG_X86_SGX_KVM
> +int __init sgx_vepc_init(void);
> +#else
> +static inline int __init sgx_vepc_init(void)
> +{
> +	return -ENODEV;
> +}
> +#endif
> +
>  #endif /* _X86_SGX_H */
> diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
> new file mode 100644
> index 000000000000..259cc46ad78c
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/virt.c
> @@ -0,0 +1,259 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Device driver to expose SGX enclave memory to KVM guests.
> + *
> + * Copyright(c) 2021 Intel Corporation.
> + */
> +
> +#include <linux/miscdevice.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include <linux/sched/mm.h>
> +#include <linux/sched/signal.h>
> +#include <linux/slab.h>
> +#include <linux/xarray.h>
> +#include <asm/sgx.h>
> +#include <uapi/asm/sgx.h>
> +
> +#include "encls.h"
> +#include "sgx.h"
> +
> +struct sgx_vepc {
> +	struct xarray page_array;
> +	struct mutex lock;
> +};
> +
> +/*
> + * Temporary SECS pages that cannot be EREMOVE'd due to having child in other
> + * virtual EPC instances, and the lock to protect it.
> + */
> +static struct mutex zombie_secs_pages_lock;
> +static struct list_head zombie_secs_pages;
> +
> +static int __sgx_vepc_fault(struct sgx_vepc *vepc,
> +			    struct vm_area_struct *vma, unsigned long addr)
> +{
> +	struct sgx_epc_page *epc_page;
> +	unsigned long index, pfn;
> +	int ret;
> +
> +	WARN_ON(!mutex_is_locked(&vepc->lock));
> +
> +	/* Calculate index of EPC page in virtual EPC's page_array */
> +	index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start);
> +
> +	epc_page = xa_load(&vepc->page_array, index);
> +	if (epc_page)
> +		return 0;
> +
> +	epc_page = sgx_alloc_epc_page(vepc, false);
> +	if (IS_ERR(epc_page))
> +		return PTR_ERR(epc_page);
> +
> +	ret = xa_err(xa_store(&vepc->page_array, index, epc_page, GFP_KERNEL));
> +	if (ret)
> +		goto err_free;
> +
> +	pfn = PFN_DOWN(sgx_get_epc_phys_addr(epc_page));
> +
> +	ret = vmf_insert_pfn(vma, addr, pfn);
> +	if (ret != VM_FAULT_NOPAGE) {
> +		ret = -EFAULT;
> +		goto err_delete;
> +	}
> +
> +	return 0;
> +
> +err_delete:
> +	xa_erase(&vepc->page_array, index);
> +err_free:
> +	sgx_free_epc_page(epc_page);
> +	return ret;
> +}
> +
> +static vm_fault_t sgx_vepc_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct sgx_vepc *vepc = vma->vm_private_data;
> +	int ret;
> +
> +	mutex_lock(&vepc->lock);
> +	ret = __sgx_vepc_fault(vepc, vma, vmf->address);
> +	mutex_unlock(&vepc->lock);
> +
> +	if (!ret)
> +		return VM_FAULT_NOPAGE;
> +
> +	if (ret == -EBUSY && (vmf->flags & FAULT_FLAG_ALLOW_RETRY)) {
> +		mmap_read_unlock(vma->vm_mm);
> +		return VM_FAULT_RETRY;
> +	}
> +
> +	return VM_FAULT_SIGBUS;
> +}
> +
> +const struct vm_operations_struct sgx_vepc_vm_ops = {
> +	.fault = sgx_vepc_fault,
> +};
> +
> +static int sgx_vepc_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct sgx_vepc *vepc = file->private_data;
> +
> +	if (!(vma->vm_flags & VM_SHARED))
> +		return -EINVAL;
> +
> +	vma->vm_ops = &sgx_vepc_vm_ops;
> +	/* Don't copy VMA in fork() */
> +	vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTDUMP | VM_DONTCOPY;
> +	vma->vm_private_data = vepc;
> +
> +	return 0;
> +}
> +
> +static int sgx_vepc_free_page(struct sgx_epc_page *epc_page)
> +{
> +	int ret;
> +
> +	/*
> +	 * Take a previously guest-owned EPC page and return it to the
> +	 * general EPC page pool.
> +	 *
> +	 * Guests can not be trusted to have left this page in a good
> +	 * state, so run EREMOVE on the page unconditionally.  In the
> +	 * case that a guest properly EREMOVE'd this page, a superfluous
> +	 * EREMOVE is harmless.
> +	 */
> +	ret = __eremove(sgx_get_epc_virt_addr(epc_page));
> +	if (ret) {
> +		/*
> +		 * Only SGX_CHILD_PRESENT is expected, which is because of
> +		 * EREMOVE'ing an SECS still with child, in which case it can
> +		 * be handled by EREMOVE'ing the SECS again after all pages in
> +		 * virtual EPC have been EREMOVE'd. See comments in below in
> +		 * sgx_vepc_release().
> +		 *
> +		 * The user of virtual EPC (KVM) needs to guarantee there's no
> +		 * logical processor is still running in the enclave in guest,
> +		 * otherwise EREMOVE will get SGX_ENCLAVE_ACT which cannot be
> +		 * handled here.
> +		 */
> +		WARN_ONCE(ret != SGX_CHILD_PRESENT, EREMOVE_ERROR_MESSAGE,
> +			  ret, ret);
> +		return ret;
> +	}
> +
> +	sgx_free_epc_page(epc_page);
> +
> +	return 0;
> +}
> +
> +static int sgx_vepc_release(struct inode *inode, struct file *file)
> +{
> +	struct sgx_vepc *vepc = file->private_data;
> +	struct sgx_epc_page *epc_page, *tmp, *entry;
> +	unsigned long index;
> +
> +	LIST_HEAD(secs_pages);
> +
> +	xa_for_each(&vepc->page_array, index, entry) {
> +		/*
> +		 * Remove all normal, child pages.  sgx_vepc_free_page()
> +		 * will fail if EREMOVE fails, but this is OK and expected on
> +		 * SECS pages.  Those can only be EREMOVE'd *after* all their
> +		 * child pages. Retries below will clean them up.
> +		 */
> +		if (sgx_vepc_free_page(entry))
> +			continue;
> +
> +		xa_erase(&vepc->page_array, index);
> +	}
> +
> +	/*
> +	 * Retry EREMOVE'ing pages.  This will clean up any SECS pages that
> +	 * only had children in this 'epc' area.
> +	 */
> +	xa_for_each(&vepc->page_array, index, entry) {
> +		epc_page = entry;
> +		/*
> +		 * An EREMOVE failure here means that the SECS page still
> +		 * has children.  But, since all children in this 'sgx_vepc'
> +		 * have been removed, the SECS page must have a child on
> +		 * another instance.
> +		 */
> +		if (sgx_vepc_free_page(epc_page))
> +			list_add_tail(&epc_page->list, &secs_pages);
> +
> +		xa_erase(&vepc->page_array, index);
> +	}
> +
> +	/*
> +	 * SECS pages are "pinned" by child pages, and "unpinned" once all
> +	 * children have been EREMOVE'd.  A child page in this instance
> +	 * may have pinned an SECS page encountered in an earlier release(),
> +	 * creating a zombie.  Since some children were EREMOVE'd above,
> +	 * try to EREMOVE all zombies in the hopes that one was unpinned.
> +	 */
> +	mutex_lock(&zombie_secs_pages_lock);
> +	list_for_each_entry_safe(epc_page, tmp, &zombie_secs_pages, list) {
> +		/*
> +		 * Speculatively remove the page from the list of zombies,
> +		 * if the page is successfully EREMOVE'd it will be added to
> +		 * the list of free pages.  If EREMOVE fails, throw the page
> +		 * on the local list, which will be spliced on at the end.
> +		 */
> +		list_del(&epc_page->list);
> +
> +		if (sgx_vepc_free_page(epc_page))
> +			list_add_tail(&epc_page->list, &secs_pages);
> +	}
> +
> +	if (!list_empty(&secs_pages))
> +		list_splice_tail(&secs_pages, &zombie_secs_pages);
> +	mutex_unlock(&zombie_secs_pages_lock);
> +
> +	kfree(vepc);
> +
> +	return 0;
> +}
> +
> +static int sgx_vepc_open(struct inode *inode, struct file *file)
> +{
> +	struct sgx_vepc *vepc;
> +
> +	vepc = kzalloc(sizeof(struct sgx_vepc), GFP_KERNEL);
> +	if (!vepc)
> +		return -ENOMEM;
> +	mutex_init(&vepc->lock);
> +	xa_init(&vepc->page_array);
> +
> +	file->private_data = vepc;
> +
> +	return 0;
> +}
> +
> +static const struct file_operations sgx_vepc_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= sgx_vepc_open,
> +	.release	= sgx_vepc_release,
> +	.mmap		= sgx_vepc_mmap,
> +};
> +
> +static struct miscdevice sgx_vepc_dev = {
> +	.minor		= MISC_DYNAMIC_MINOR,
> +	.name		= "sgx_vepc",
> +	.nodename	= "sgx_vepc",
> +	.fops		= &sgx_vepc_fops,
> +};
> +
> +int __init sgx_vepc_init(void)
> +{
> +	/* SGX virtualization requires KVM to work */
> +	if (!cpu_feature_enabled(X86_FEATURE_VMX))
> +		return -ENODEV;
> +
> +	INIT_LIST_HEAD(&zombie_secs_pages);
> +	mutex_init(&zombie_secs_pages_lock);
> +
> +	return misc_register(&sgx_vepc_dev);
> +}
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index a788d5120d4d..f6b93a35ce14 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -84,6 +84,18 @@ config KVM_INTEL
>  	  To compile this as a module, choose M here: the module
>  	  will be called kvm-intel.
>  
> +config X86_SGX_KVM
> +	bool "Software Guard eXtensions (SGX) Virtualization"
> +	depends on X86_SGX && KVM_INTEL
> +	help
> +
> +	  Enables KVM guests to create SGX enclaves.
> +
> +	  This includes support to expose "raw" unreclaimable enclave memory to
> +	  guests via a device node, e.g. /dev/sgx_vepc.
> +
> +	  If unsure, say N.
> +
>  config KVM_AMD
>  	tristate "KVM for AMD processors support"
>  	depends on KVM
> -- 
> 2.29.2
> 
> 
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
diff mbox series

Patch

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 35391e94bd22..007912f67a06 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1942,6 +1942,18 @@  config X86_SGX
 
 	  If unsure, say N.
 
+config X86_SGX_KVM
+	bool "Software Guard eXtensions (SGX) Virtualization"
+	depends on X86_SGX && KVM_INTEL
+	help
+
+	  Enables KVM guests to create SGX enclaves.
+
+	  This includes support to expose "raw" unreclaimable enclave memory to
+	  guests via a device node, e.g. /dev/sgx_vepc.
+
+	  If unsure, say N.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 91d3dc784a29..9c1656779b2a 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -3,3 +3,4 @@  obj-y += \
 	encl.o \
 	ioctl.o \
 	main.o
+obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 653af8ca1a25..a3aa00cb1ac4 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -80,4 +80,13 @@  void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 
+#ifdef CONFIG_X86_SGX_KVM
+int __init sgx_vepc_init(void);
+#else
+static inline int __init sgx_vepc_init(void)
+{
+	return -ENODEV;
+}
+#endif
+
 #endif /* _X86_SGX_H */
diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
new file mode 100644
index 000000000000..29d8d28b4695
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/virt.c
@@ -0,0 +1,260 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Device driver to expose SGX enclave memory to KVM guests.
+ *
+ * Copyright(c) 2021 Intel Corporation.
+ */
+
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/slab.h>
+#include <linux/xarray.h>
+#include <asm/sgx.h>
+#include <uapi/asm/sgx.h>
+
+#include "encls.h"
+#include "sgx.h"
+
+struct sgx_vepc {
+	struct xarray page_array;
+	struct mutex lock;
+};
+
+/*
+ * Temporary SECS pages that cannot be EREMOVE'd due to having child in other
+ * virtual EPC instances, and the lock to protect it.
+ */
+static struct mutex zombie_secs_pages_lock;
+static struct list_head zombie_secs_pages;
+
+static int __sgx_vepc_fault(struct sgx_vepc *vepc,
+			    struct vm_area_struct *vma, unsigned long addr)
+{
+	struct sgx_epc_page *epc_page;
+	unsigned long index, pfn;
+	int ret;
+
+	WARN_ON(!mutex_is_locked(&vepc->lock));
+
+	/* Calculate index of EPC page in virtual EPC's page_array */
+	index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start);
+
+	epc_page = xa_load(&vepc->page_array, index);
+	if (epc_page)
+		return 0;
+
+	epc_page = sgx_alloc_epc_page(vepc, false);
+	if (IS_ERR(epc_page))
+		return PTR_ERR(epc_page);
+
+	ret = xa_err(xa_store(&vepc->page_array, index, epc_page, GFP_KERNEL));
+	if (ret)
+		goto err_free;
+
+	pfn = PFN_DOWN(sgx_get_epc_phys_addr(epc_page));
+
+	ret = vmf_insert_pfn(vma, addr, pfn);
+	if (ret != VM_FAULT_NOPAGE) {
+		ret = -EFAULT;
+		goto err_delete;
+	}
+
+	return 0;
+
+err_delete:
+	xa_erase(&vepc->page_array, index);
+err_free:
+	sgx_free_epc_page(epc_page);
+	return ret;
+}
+
+static vm_fault_t sgx_vepc_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct sgx_vepc *vepc = vma->vm_private_data;
+	int ret;
+
+	mutex_lock(&vepc->lock);
+	ret = __sgx_vepc_fault(vepc, vma, vmf->address);
+	mutex_unlock(&vepc->lock);
+
+	if (!ret)
+		return VM_FAULT_NOPAGE;
+
+	if (ret == -EBUSY && (vmf->flags & FAULT_FLAG_ALLOW_RETRY)) {
+		mmap_read_unlock(vma->vm_mm);
+		return VM_FAULT_RETRY;
+	}
+
+	return VM_FAULT_SIGBUS;
+}
+
+const struct vm_operations_struct sgx_vepc_vm_ops = {
+	.fault = sgx_vepc_fault,
+};
+
+static int sgx_vepc_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct sgx_vepc *vepc = file->private_data;
+
+	if (!(vma->vm_flags & VM_SHARED))
+		return -EINVAL;
+
+	vma->vm_ops = &sgx_vepc_vm_ops;
+	/* Don't copy VMA in fork() */
+	vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTDUMP | VM_DONTCOPY;
+	vma->vm_private_data = vepc;
+
+	return 0;
+}
+
+static int sgx_vepc_free_page(struct sgx_epc_page *epc_page)
+{
+	int ret;
+
+	/*
+	 * Take a previously guest-owned EPC page and return it to the
+	 * general EPC page pool.
+	 *
+	 * Guests can not be trusted to have left this page in a good
+	 * state, so run EREMOVE on the page unconditionally.  In the
+	 * case that a guest properly EREMOVE'd this page, a superfluous
+	 * EREMOVE is harmless.
+	 */
+	ret = __eremove(sgx_get_epc_virt_addr(epc_page));
+	if (ret) {
+		/*
+		 * Only SGX_CHILD_PRESENT is expected, which is because of
+		 * EREMOVE'ing an SECS still with child, in which case it can
+		 * be handled by EREMOVE'ing the SECS again after all pages in
+		 * virtual EPC have been EREMOVE'd. See comments in below in
+		 * sgx_vepc_release().
+		 *
+		 * The user of virtual EPC (KVM) needs to guarantee there's no
+		 * logical processor is still running in the enclave in guest,
+		 * otherwise EREMOVE will get SGX_ENCLAVE_ACT which cannot be
+		 * handled here.
+		 */
+		WARN_ONCE(ret != SGX_CHILD_PRESENT,
+			  "EREMOVE (EPC page 0x%lx): unexpected error: %d\n",
+			  sgx_get_epc_phys_addr(epc_page), ret);
+		return ret;
+	}
+
+	sgx_free_epc_page(epc_page);
+
+	return 0;
+}
+
+static int sgx_vepc_release(struct inode *inode, struct file *file)
+{
+	struct sgx_vepc *vepc = file->private_data;
+	struct sgx_epc_page *epc_page, *tmp, *entry;
+	unsigned long index;
+
+	LIST_HEAD(secs_pages);
+
+	xa_for_each(&vepc->page_array, index, entry) {
+		/*
+		 * Remove all normal, child pages.  sgx_vepc_free_page()
+		 * will fail if EREMOVE fails, but this is OK and expected on
+		 * SECS pages.  Those can only be EREMOVE'd *after* all their
+		 * child pages. Retries below will clean them up.
+		 */
+		if (sgx_vepc_free_page(entry))
+			continue;
+
+		xa_erase(&vepc->page_array, index);
+	}
+
+	/*
+	 * Retry EREMOVE'ing pages.  This will clean up any SECS pages that
+	 * only had children in this 'epc' area.
+	 */
+	xa_for_each(&vepc->page_array, index, entry) {
+		epc_page = entry;
+		/*
+		 * An EREMOVE failure here means that the SECS page still
+		 * has children.  But, since all children in this 'sgx_vepc'
+		 * have been removed, the SECS page must have a child on
+		 * another instance.
+		 */
+		if (sgx_vepc_free_page(epc_page))
+			list_add_tail(&epc_page->list, &secs_pages);
+
+		xa_erase(&vepc->page_array, index);
+	}
+
+	/*
+	 * SECS pages are "pinned" by child pages, and unpinned once all
+	 * children have been EREMOVE'd.  A child page in this instance
+	 * may have pinned an SECS page encountered in an earlier release(),
+	 * creating a zombie.  Since some children were EREMOVE'd above,
+	 * try to EREMOVE all zombies in the hopes that one was unpinned.
+	 */
+	mutex_lock(&zombie_secs_pages_lock);
+	list_for_each_entry_safe(epc_page, tmp, &zombie_secs_pages, list) {
+		/*
+		 * Speculatively remove the page from the list of zombies,
+		 * if the page is successfully EREMOVE it will be added to
+		 * the list of free pages.  If EREMOVE fails, throw the page
+		 * on the local list, which will be spliced on at the end.
+		 */
+		list_del(&epc_page->list);
+
+		if (sgx_vepc_free_page(epc_page))
+			list_add_tail(&epc_page->list, &secs_pages);
+	}
+
+	if (!list_empty(&secs_pages))
+		list_splice_tail(&secs_pages, &zombie_secs_pages);
+	mutex_unlock(&zombie_secs_pages_lock);
+
+	kfree(vepc);
+
+	return 0;
+}
+
+static int sgx_vepc_open(struct inode *inode, struct file *file)
+{
+	struct sgx_vepc *vepc;
+
+	vepc = kzalloc(sizeof(struct sgx_vepc), GFP_KERNEL);
+	if (!vepc)
+		return -ENOMEM;
+	mutex_init(&vepc->lock);
+	xa_init(&vepc->page_array);
+
+	file->private_data = vepc;
+
+	return 0;
+}
+
+static const struct file_operations sgx_vepc_fops = {
+	.owner		= THIS_MODULE,
+	.open		= sgx_vepc_open,
+	.release	= sgx_vepc_release,
+	.mmap		= sgx_vepc_mmap,
+};
+
+static struct miscdevice sgx_vepc_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "sgx_vepc",
+	.nodename = "sgx_vepc",
+	.fops = &sgx_vepc_fops,
+};
+
+int __init sgx_vepc_init(void)
+{
+	/* SGX virtualization requires KVM to work */
+	if (!boot_cpu_has(X86_FEATURE_VMX))
+		return -ENODEV;
+
+	INIT_LIST_HEAD(&zombie_secs_pages);
+	mutex_init(&zombie_secs_pages_lock);
+
+	return misc_register(&sgx_vepc_dev);
+}