mbox series

[RFC,v3,00/27] KVM SGX virtualization support

Message ID cover.1611634586.git.kai.huang@intel.com (mailing list archive)
Headers show
Series KVM SGX virtualization support | expand

Message

Huang, Kai Jan. 26, 2021, 10:10 a.m. UTC
--- Disclaimer ---

These patches were originally written by Sean Christopherson while at Intel.
Now that Sean has left Intel, I (Kai) have taken over getting them upstream.
This series needs more review before it can be merged.  It is being posted
publicly and under RFC so Sean and others can review it. Maintainers are safe
ignoring it for now.

------------------

Hi all,

This series adds KVM SGX virtualization support. The first 15 patches starting
with x86/sgx or x86/cpu.. are necessary changes to x86 and SGX core/driver to
support KVM SGX virtualization, while the rest are patches to KVM subsystem.

Please help to review this series. Any feedback is highly appreciated.
Please let me know if I forgot to CC anyone, or anyone wants to be removed from
CC. Thanks in advance!

This series is based against tip/x86/sgx. You can also get the code from
upstream branch of kvm-sgx repo on github:

        https://github.com/intel/kvm-sgx.git upstream

It also requires Qemu changes to create VM with SGX support. You can find Qemu
repo here:

	https://github.com/intel/qemu-sgx.git upstream

Please refer to README.md of above qemu-sgx repo for detail on how to create
guest with SGX support. At meantime, for your quick reference you can use below
command to create SGX guest:

	#qemu-system-x86_64 -smp 4 -m 2G -drive file=<your_vm_image>,if=virtio \
		-cpu host,+sgx_provisionkey \
		-sgx-epc id=epc1,memdev=mem1 \
		-object memory-backend-epc,id=mem1,size=64M,prealloc

Please note that the SGX relevant part is:

		-cpu host,+sgx_provisionkey \
		-sgx-epc id=epc1,memdev=mem1 \
		-object memory-backend-epc,id=mem1,size=64M,prealloc

And you can change other parameters of your qemu command based on your needs.

=========
Changelog:

(Changelog here is for global changes. Please see each patch's changelog for
 changes made to specific patch.)

v2->v3:

 - Split original "x86/cpufeatures: Add SGX1 and SGX2 sub-features" patch into
   two patches, by splitting moving SGX_LC bit also into cpuid-deps table logic
   into a separate patch 2:
       [RFC PATCH v3 01/27] x86/cpufeatures: Add SGX1 and SGX2 sub-features
       [RFC PATCH v3 02/27] x86/cpufeatures: Make SGX_LC feature bit depend on SGX bit
 - Changed from /dev/sgx_virt_epc to /dev/sgx_vepc, per Jarkko. And accordingly,
   changed prefix 'sgx_virt_epc_xx' to 'sgx_vepc_xx' in various functions and
   structures.
 - Changed CONFIG_X86_SGX_VIRTUALIZATION to CONFIG_X86_SGX_KVM, per Dave. Couple
   of x86 patches and KVM patches are changed too due to the renaming.

v1->v2:

 - Refined this cover letter by addressing comments from Dave and Jarkko.
 - The original patch which introduced new X86_FEATURE_SGX1/SGX2 were replaced
   by 3 new patches from Sean, following Boris and Sean's discussion.
       [RFC PATCH v2 01/26] x86/cpufeatures: Add SGX1 and SGX2 sub-features
       [RFC PATCH v2 18/26] KVM: x86: Add support for reverse CPUID lookup of scattered features
       [RFC PATCH v2 19/26] KVM: x86: Add reverse-CPUID lookup support for scattered SGX features
 - The original patch 1
       x86/sgx: Split out adding EPC page to free list to separate helper
   was replaced with 2 new patches from Jarkko
       [RFC PATCH v2 02/26] x86/sgx: Remove a warn from sgx_free_epc_page()
       [RFC PATCH v2 03/26] x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()
   addressing Jarkko's comments.
 - Moved modifying sgx_init() to always initialize sgx_virt_epc_init() out of
   patch
       x86/sgx: Introduce virtual EPC for use by KVM guests
   to a separate patch:
       [RFC PATCH v2 07/26] x86/sgx: Initialize virtual EPC driver even when SGX driver is disabled
   to address Dave's comment that patch ordering can be improved due to before
   patch "Allow SGX virtualization without Launch Control support", all SGX,
   including SGX virtualization, is actually disabled when SGX LC is not
   present.

=========
KVM SGX virtualization Overview

- Virtual EPC

SGX enclave memory is special and is reserved specifically for enclave use.
In bare-metal SGX enclaves, the kernel allocates enclave pages, copies data
into the pages with privileged instructions, then allows the enclave to start.
In this scenario, only initialized pages already assigned to an enclave are
mapped to userspace.

In virtualized environments, the hypervisor still needs to do the physical
enclave page allocation.  The guest kernel is responsible for the data copying
(among other things).  This means that the job of starting an enclave is now
split between hypervisor and guest.

This series introduces a new misc device: /dev/sgx_vepc.  This device allows
the host to map *uninitialized* enclave memory into userspace, which can then
be passed into a guest.

While it might be *possible* to start a host-side enclave with /dev/sgx_enclave
and pass its memory into a guest, it would be wasteful and convoluted.

Implement the *raw* EPC allocation in the x86 core-SGX subsystem via
/dev/sgx_vepc rather than in KVM.  Doing so has two major advantages:

  - Does not require changes to KVM's uAPI, e.g. EPC gets handled as
    just another memory backend for guests.

  - EPC management is wholly contained in the SGX subsystem, e.g. SGX
    does not have to export any symbols, changes to reclaim flows don't
    need to be routed through KVM, SGX's dirty laundry doesn't have to
    get aired out for the world to see, and so on and so forth.

The virtual EPC pages allocated to guests are currently not reclaimable.
Reclaiming EPC page used by enclave requires a special reclaim mechanism
separate from normal page reclaim, and that mechanism is not supported
for virutal EPC pages.  Due to the complications of handling reclaim
conflicts between guest and host, reclaiming virtual EPC pages is 
significantly more complex than basic support for SGX virtualization.

- Support SGX virtualization without SGX Flexible Launch Control

SGX hardware supports two "launch control" modes to limit which enclaves can
run.  In the "locked" mode, the hardware prevents enclaves from running unless
they are blessed by a third party.  In the unlocked mode, the kernel is in
full control of which enclaves can run.  The bare-metal SGX code refuses to
launch enclaves unless it is in the unlocked mode.

This sgx_virt_epc driver does not have such a restriction.  This allows guests
which are OK with the locked mode to use SGX, even if the host kernel refuses
to.

- Support exposing SGX2

Due to the same reason above, SGX2 feature detection is added to core SGX code
to allow KVM to expose SGX2 to guest, even currently SGX driver doesn't support
SGX2, because SGX2 can work just fine in guest w/o any interaction to host SGX
driver.

- Restricit SGX guest access to provisioning key

To grant guest being able to fully use SGX, guest needs to be able to access
provisioning key.  The provisioning key is sensitive, and accessing to it should
be restricted. In bare-metal driver, allowing enclave to access provisioning key
is restricted by being able to open /dev/sgx_provision.

Add a new KVM_CAP_SGX_ATTRIBUTE to KVM uAPI to extend above mechanism to KVM
guests as well.  When userspace hypervisor creates a new VM, the new cap is only
added to VM when userspace hypervisior is able to open /dev/sgx_provision,
following the same role as in bare-metal driver.  KVM then traps ECREATE from
guest, and only allows ECREATE with provisioning key bit to run when guest
supports KVM_CAP_SGX_ATTRIBUTE.

Jarkko Sakkinen (2):
  x86/sgx: Remove a warn from sgx_free_epc_page()
  x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()

Kai Huang (3):
  x86/cpufeatures: Make SGX_LC feature bit depend on SGX bit
  x86/sgx: Initialize virtual EPC driver even when SGX driver is
    disabled
  x86/sgx: Add helper to update SGX_LEPUBKEYHASHn MSRs

Sean Christopherson (22):
  x86/cpufeatures: Add SGX1 and SGX2 sub-features
  x86/sgx: Add SGX_CHILD_PRESENT hardware error code
  x86/sgx: Introduce virtual EPC for use by KVM guests
  x86/cpu/intel: Allow SGX virtualization without Launch Control support
  x86/sgx: Expose SGX architectural definitions to the kernel
  x86/sgx: Move ENCLS leaf definitions to sgx_arch.h
  x86/sgx: Add SGX2 ENCLS leaf definitions (EAUG, EMODPR and EMODT)
  x86/sgx: Add encls_faulted() helper
  x86/sgx: Add helpers to expose ECREATE and EINIT to KVM
  x86/sgx: Move provisioning device creation out of SGX driver
  KVM: VMX: Convert vcpu_vmx.exit_reason to a union
  KVM: x86: Export kvm_mmu_gva_to_gpa_{read,write}() for SGX (VMX)
  KVM: x86: Define new #PF SGX error code bit
  KVM: x86: Add support for reverse CPUID lookup of scattered features
  KVM: x86: Add reverse-CPUID lookup support for scattered SGX features
  KVM: VMX: Add basic handling of VM-Exit from SGX enclave
  KVM: VMX: Frame in ENCLS handler for SGX virtualization
  KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions
  KVM: VMX: Add emulation of SGX Launch Control LE hash MSRs
  KVM: VMX: Add ENCLS[EINIT] handler to support SGX Launch Control (LC)
  KVM: VMX: Enable SGX virtualization for SGX1, SGX2 and LC
  KVM: x86: Add capability to grant VM access to privileged SGX
    attribute

 Documentation/virt/kvm/api.rst                |  23 +
 arch/x86/Kconfig                              |  12 +
 arch/x86/include/asm/cpufeatures.h            |   2 +
 arch/x86/include/asm/kvm_host.h               |   5 +
 arch/x86/include/asm/sgx.h                    |  19 +
 .../cpu/sgx/arch.h => include/asm/sgx_arch.h} |  20 +
 arch/x86/include/asm/vmx.h                    |   1 +
 arch/x86/include/uapi/asm/vmx.h               |   1 +
 arch/x86/kernel/cpu/cpuid-deps.c              |   3 +
 arch/x86/kernel/cpu/feat_ctl.c                |  70 ++-
 arch/x86/kernel/cpu/scattered.c               |   2 +
 arch/x86/kernel/cpu/sgx/Makefile              |   1 +
 arch/x86/kernel/cpu/sgx/driver.c              |  17 -
 arch/x86/kernel/cpu/sgx/encl.c                |  15 +-
 arch/x86/kernel/cpu/sgx/encls.h               |  30 +-
 arch/x86/kernel/cpu/sgx/ioctl.c               |  23 +-
 arch/x86/kernel/cpu/sgx/main.c                |  87 +++-
 arch/x86/kernel/cpu/sgx/sgx.h                 |   4 +-
 arch/x86/kernel/cpu/sgx/virt.c                | 347 +++++++++++++
 arch/x86/kernel/cpu/sgx/virt.h                |  14 +
 arch/x86/kvm/Makefile                         |   2 +
 arch/x86/kvm/cpuid.c                          |  89 +++-
 arch/x86/kvm/cpuid.h                          |  50 +-
 arch/x86/kvm/vmx/nested.c                     |  70 ++-
 arch/x86/kvm/vmx/nested.h                     |   5 +
 arch/x86/kvm/vmx/sgx.c                        | 462 ++++++++++++++++++
 arch/x86/kvm/vmx/sgx.h                        |  34 ++
 arch/x86/kvm/vmx/vmcs12.c                     |   1 +
 arch/x86/kvm/vmx/vmcs12.h                     |   4 +-
 arch/x86/kvm/vmx/vmx.c                        | 171 +++++--
 arch/x86/kvm/vmx/vmx.h                        |  27 +-
 arch/x86/kvm/x86.c                            |  24 +
 include/uapi/linux/kvm.h                      |   1 +
 tools/testing/selftests/sgx/defines.h         |   2 +-
 34 files changed, 1482 insertions(+), 156 deletions(-)
 create mode 100644 arch/x86/include/asm/sgx.h
 rename arch/x86/{kernel/cpu/sgx/arch.h => include/asm/sgx_arch.h} (96%)
 create mode 100644 arch/x86/kernel/cpu/sgx/virt.c
 create mode 100644 arch/x86/kernel/cpu/sgx/virt.h
 create mode 100644 arch/x86/kvm/vmx/sgx.c
 create mode 100644 arch/x86/kvm/vmx/sgx.h

Comments

Dave Hansen Jan. 26, 2021, 3:34 p.m. UTC | #1
On 1/26/21 1:30 AM, Kai Huang wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Add SGX1 and SGX2 feature flags, via CPUID.0x12.0x0.EAX, as scattered
> features, since adding a new leaf for only two bits would be wasteful.
> As part of virtualizing SGX, KVM will expose the SGX CPUID leafs to its
> guest, and to do so correctly needs to query hardware and kernel support
> for SGX1 and SGX2.

It's also not _just_ exposing the CPUID leaves.  There are some checks
here when KVM is emulating some SGX instructions too, right?

> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 84b887825f12..18b2d0c8bbbe 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -292,6 +292,8 @@
>  #define X86_FEATURE_FENCE_SWAPGS_KERNEL	(11*32+ 5) /* "" LFENCE in kernel entry SWAPGS path */
>  #define X86_FEATURE_SPLIT_LOCK_DETECT	(11*32+ 6) /* #AC for split lock */
>  #define X86_FEATURE_PER_THREAD_MBA	(11*32+ 7) /* "" Per-thread Memory Bandwidth Allocation */
> +#define X86_FEATURE_SGX1		(11*32+ 8) /* Software Guard Extensions sub-feature SGX1 */
> +#define X86_FEATURE_SGX2        	(11*32+ 9) /* Software Guard Extensions sub-feature SGX2 */

FWIW, I'm not sure how valuable it is to spell the SGX acronym out three
times.  Can't we use those bytes to put something more useful in that
comment?
Dave Hansen Jan. 26, 2021, 3:35 p.m. UTC | #2
On 1/26/21 1:30 AM, Kai Huang wrote:
> Move SGX_LC feature bit to CPUID dependency table as well, along with
> new added SGX1 and SGX2 bit, to make clearing all SGX feature bits
> easier. Also remove clear_sgx_caps() since it is just a wrapper of
> setup_clear_cpu_cap(X86_FEATURE_SGX) now.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Kai Huang <kai.huang@intel.com>

Looks good:

Acked-by: Dave Hansen <dave.hansen@intel.com>
Dave Hansen Jan. 26, 2021, 3:39 p.m. UTC | #3
On 1/26/21 1:30 AM, Kai Huang wrote:
> Remove SGX_EPC_PAGE_RECLAIMER_TRACKED check and warning.  This cannot
> happen, as enclave pages are freed only at the time when encl->refcount
> triggers, i.e. when both VFS and the page reclaimer have given up on
> their references.
> 
> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 8df81a3ed945..f330abdb5bb1 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -605,8 +605,6 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
>  	int ret;
>  
> -	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);

I'm all for cleaning up silly, useless warnings.  But, don't we usually
put warnings in for things that we don't expect to be able to happen?

In other words, I'm fine with removing this if it hasn't been a valuable
warning and we don't expect it to become a valuable warning.  But, the
changelog doesn't say that.  It also doesn't explain what this patch is
doing in this series.

Why is this her?
Dave Hansen Jan. 26, 2021, 3:49 p.m. UTC | #4
On 1/26/21 1:30 AM, Kai Huang wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> SGX virtualization requires to allocate "raw" EPC and use it as "virtual
> EPC" for SGX guest.  Unlike EPC used by SGX driver, virtual EPC doesn't
> track how EPC pages are used in VM, e.g. (de)construction of enclaves,
> so it cannot guarantee EREMOVE success, e.g. it doesn't have a priori
> knowledge of which pages are SECS with non-zero child counts.

The grammar there is a bit questionable in spots.  Here's a rewrite:

SGX can accurately track how bare-metal enclave pages are used.  This
enables SECS to be specifically targeted and EREMOVE'd only after all
child pages have been EREMOVE'd.  This ensures that bare-metal SGX will
never encounter SGX_CHILD_PRESENT in normal operation.

Virtual EPC is different.  The host does not track how EPC pages are
used by the guest, so it cannot guarantee EREMOVE success.  It might,
for instance, encounter a SECS with a non-zero child count.

Aside: Would it be *possible* for the host to figure out where the SECS
pages are?  If not, we can say "host can not track" versus what I said:
"host does not track".

> Add SGX_CHILD_PRESENT for use by SGX virtualization to assert EREMOVE
> failures are expected, but only due to SGX_CHILD_PRESENT.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
> Signed-off-by: Kai Huang <kai.huang@intel.com>

With the improved changelog:

Acked-by: Dave Hansen <dave.hansen@intel.com>
Dave Hansen Jan. 26, 2021, 4:04 p.m. UTC | #5
On 1/26/21 1:30 AM, Kai Huang wrote:
> From: Jarkko Sakkinen <jarkko@kernel.org>
> 
> Encapsulate the snippet in sgx_free_epc_page() concerning EREMOVE to
> sgx_reset_epc_page(), which is a static helper function for
> sgx_encl_release().  It's the only function existing, which deals with
> initialized pages.

Yikes.  I have no idea what that is saying.  Here's a rewrite:

EREMOVE takes a pages and removes any association between that page and
an enclave.  It must be run on a page before it can be added into
another enclave.  Currently, EREMOVE is run as part of pages being freed
into the SGX page allocator.  It is not expected to fail.

KVM does not track how guest pages are used, which means that SGX
virtualization use of EREMOVE might fail.

Break out the EREMOVE call from the SGX page allocator.  This will allow
the SGX virtualization code to use the allocator directly.  (SGX/KVM
will also introduce a more permissive EREMOVE helper).

> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index ee50a5010277..a78b71447771 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -389,6 +389,16 @@ const struct vm_operations_struct sgx_vm_ops = {
>  	.access = sgx_vma_access,
>  };
>  
> +
> +static void sgx_reset_epc_page(struct sgx_epc_page *epc_page)
> +{
> +	int ret;
> +
> +	ret = __eremove(sgx_get_epc_virt_addr(epc_page));
> +	if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
> +		return;
> +}
> +
>  /**
>   * sgx_encl_release - Destroy an enclave instance
>   * @kref:	address of a kref inside &sgx_encl
> @@ -412,6 +422,7 @@ void sgx_encl_release(struct kref *ref)
>  			if (sgx_unmark_page_reclaimable(entry->epc_page))
>  				continue;
>  
> +			sgx_reset_epc_page(entry->epc_page);
>  			sgx_free_epc_page(entry->epc_page);
>  			encl->secs_child_cnt--;
>  			entry->epc_page = NULL;
> @@ -423,6 +434,7 @@ void sgx_encl_release(struct kref *ref)
>  	xa_destroy(&encl->page_array);
>  
>  	if (!encl->secs_child_cnt && encl->secs.epc_page) {
> +		sgx_reset_epc_page(encl->secs.epc_page);
>  		sgx_free_epc_page(encl->secs.epc_page);
>  		encl->secs.epc_page = NULL;
>  	}
> @@ -431,6 +443,7 @@ void sgx_encl_release(struct kref *ref)
>  		va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
>  					   list);
>  		list_del(&va_page->list);
> +		sgx_reset_epc_page(va_page->epc_page);
>  		sgx_free_epc_page(va_page->epc_page);
>  		kfree(va_page);
>  	}
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index f330abdb5bb1..21c2ffa13870 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -598,16 +598,14 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>   * sgx_free_epc_page() - Free an EPC page
>   * @page:	an EPC page
>   *
> - * Call EREMOVE for an EPC page and insert it back to the list of free pages.
> + * Put the EPC page back to the list of free pages. It's the callers

"caller's"

> + * responsibility to make sure that the page is in uninitialized state In other

Period after "state", please.

> + * words, do EREMOVE, EWB or whatever operation is necessary before calling
> + * this function.
>   */

OK, so if you're going to say "the caller must put the page in
uninitialized state", let's also add a comment to the place that *DO*
that, like the shiny new sgx_reset_epc_page().
Sean Christopherson Jan. 26, 2021, 4:30 p.m. UTC | #6
On Tue, Jan 26, 2021, Dave Hansen wrote:
> On 1/26/21 1:30 AM, Kai Huang wrote:
> > Remove SGX_EPC_PAGE_RECLAIMER_TRACKED check and warning.  This cannot
> > happen, as enclave pages are freed only at the time when encl->refcount
> > triggers, i.e. when both VFS and the page reclaimer have given up on
> > their references.
> > 
> > Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
> >  arch/x86/kernel/cpu/sgx/main.c | 2 --
> >  1 file changed, 2 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> > index 8df81a3ed945..f330abdb5bb1 100644
> > --- a/arch/x86/kernel/cpu/sgx/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > @@ -605,8 +605,6 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
> >  	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
> >  	int ret;
> >  
> > -	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> 
> I'm all for cleaning up silly, useless warnings.  But, don't we usually
> put warnings in for things that we don't expect to be able to happen?
> 
> In other words, I'm fine with removing this if it hasn't been a valuable
> warning and we don't expect it to become a valuable warning.

Ya, I don't understand the motivation for removing this warning.  I tripped it
more than once in the past during one of the many rebases of the virtual EPC
and EPC cgroup branches.
Huang, Kai Jan. 26, 2021, 11:18 p.m. UTC | #7
On Tue, 2021-01-26 at 07:34 -0800, Dave Hansen wrote:
> On 1/26/21 1:30 AM, Kai Huang wrote:
> > From: Sean Christopherson <seanjc@google.com>
> > 
> > Add SGX1 and SGX2 feature flags, via CPUID.0x12.0x0.EAX, as scattered
> > features, since adding a new leaf for only two bits would be wasteful.
> > As part of virtualizing SGX, KVM will expose the SGX CPUID leafs to its
> > guest, and to do so correctly needs to query hardware and kernel support
> > for SGX1 and SGX2.
> 
> It's also not _just_ exposing the CPUID leaves.  There are some checks
> here when KVM is emulating some SGX instructions too, right?

I would say trapping instead of emulating, but yes KVM will do more. However those
are quite details, and I don't think we should put lots of details here. Or perhaps
we can use 'for instance' as brief description:

As part of virtualizing SGX, KVM will need to use the two flags, for instance, to
expose them to guest.

?

> 
> > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > index 84b887825f12..18b2d0c8bbbe 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -292,6 +292,8 @@
> >  #define X86_FEATURE_FENCE_SWAPGS_KERNEL	(11*32+ 5) /* "" LFENCE in kernel entry SWAPGS path */
> >  #define X86_FEATURE_SPLIT_LOCK_DETECT	(11*32+ 6) /* #AC for split lock */
> >  #define X86_FEATURE_PER_THREAD_MBA	(11*32+ 7) /* "" Per-thread Memory Bandwidth Allocation */
> > +#define X86_FEATURE_SGX1		(11*32+ 8) /* Software Guard Extensions sub-feature SGX1 */
> > +#define X86_FEATURE_SGX2        	(11*32+ 9) /* Software Guard Extensions sub-feature SGX2 */
> 
> FWIW, I'm not sure how valuable it is to spell the SGX acronym out three
> times.  Can't we use those bytes to put something more useful in that
> comment?

I think we can remove comment for SGX1, since it is basically SGX.

For SGX2, how about below?

/* SGX Enclave Dynamic Memory Management */
Huang, Kai Jan. 27, 2021, midnight UTC | #8
On Tue, 2021-01-26 at 07:49 -0800, Dave Hansen wrote:
> On 1/26/21 1:30 AM, Kai Huang wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > SGX virtualization requires to allocate "raw" EPC and use it as "virtual
> > EPC" for SGX guest.  Unlike EPC used by SGX driver, virtual EPC doesn't
> > track how EPC pages are used in VM, e.g. (de)construction of enclaves,
> > so it cannot guarantee EREMOVE success, e.g. it doesn't have a priori
> > knowledge of which pages are SECS with non-zero child counts.
> 
> The grammar there is a bit questionable in spots.  Here's a rewrite:
> 
> SGX can accurately track how bare-metal enclave pages are used.  This
> enables SECS to be specifically targeted and EREMOVE'd only after all
> child pages have been EREMOVE'd.  This ensures that bare-metal SGX will
> never encounter SGX_CHILD_PRESENT in normal operation.

How about:

"SGX driver can accurate track how enclave pages are used. This enables..."

Since in another email, you mentioned that we should get rid of bare-metal driver,
and Andy suggested we can just use SGX driver?

> 
> Virtual EPC is different.  The host does not track how EPC pages are
> used by the guest, so it cannot guarantee EREMOVE success.  It might,
> for instance, encounter a SECS with a non-zero child count.
> 
> Aside: Would it be *possible* for the host to figure out where the SECS
> pages are?  If not, we can say "host can not track" versus what I said:
> "host does not track".

Technically it is possible, so "host does not track" is more reasonable.

> 
> > Add SGX_CHILD_PRESENT for use by SGX virtualization to assert EREMOVE
> > failures are expected, but only due to SGX_CHILD_PRESENT.
> > 
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> 
> With the improved changelog:
> 
> Acked-by: Dave Hansen <dave.hansen@intel.com>

Thanks.
Dave Hansen Jan. 27, 2021, 12:21 a.m. UTC | #9
On 1/26/21 4:00 PM, Kai Huang wrote:
> On Tue, 2021-01-26 at 07:49 -0800, Dave Hansen wrote:
>> On 1/26/21 1:30 AM, Kai Huang wrote:
>>> From: Sean Christopherson <sean.j.christopherson@intel.com>
>>>
>>> SGX virtualization requires to allocate "raw" EPC and use it as "virtual
>>> EPC" for SGX guest.  Unlike EPC used by SGX driver, virtual EPC doesn't
>>> track how EPC pages are used in VM, e.g. (de)construction of enclaves,
>>> so it cannot guarantee EREMOVE success, e.g. it doesn't have a priori
>>> knowledge of which pages are SECS with non-zero child counts.
>>
>> The grammar there is a bit questionable in spots.  Here's a rewrite:
>>
>> SGX can accurately track how bare-metal enclave pages are used.  This
>> enables SECS to be specifically targeted and EREMOVE'd only after all
>> child pages have been EREMOVE'd.  This ensures that bare-metal SGX will
>> never encounter SGX_CHILD_PRESENT in normal operation.
> 
> How about:
> 
> "SGX driver can accurate track how enclave pages are used. This enables..."
> 
> Since in another email, you mentioned that we should get rid of bare-metal driver,
> and Andy suggested we can just use SGX driver?

<sigh>

Sure, but with correct grammar, please.

"SGX driver can accurately track how enclave pages are used. This
enables..."

Seriously, if you just paste the sentences into Word, it will highlight
this and tell you.
Huang, Kai Jan. 27, 2021, 12:52 a.m. UTC | #10
On Tue, 26 Jan 2021 16:21:36 -0800 Dave Hansen wrote:
> On 1/26/21 4:00 PM, Kai Huang wrote:
> > On Tue, 2021-01-26 at 07:49 -0800, Dave Hansen wrote:
> >> On 1/26/21 1:30 AM, Kai Huang wrote:
> >>> From: Sean Christopherson <sean.j.christopherson@intel.com>
> >>>
> >>> SGX virtualization requires to allocate "raw" EPC and use it as "virtual
> >>> EPC" for SGX guest.  Unlike EPC used by SGX driver, virtual EPC doesn't
> >>> track how EPC pages are used in VM, e.g. (de)construction of enclaves,
> >>> so it cannot guarantee EREMOVE success, e.g. it doesn't have a priori
> >>> knowledge of which pages are SECS with non-zero child counts.
> >>
> >> The grammar there is a bit questionable in spots.  Here's a rewrite:
> >>
> >> SGX can accurately track how bare-metal enclave pages are used.  This
> >> enables SECS to be specifically targeted and EREMOVE'd only after all
> >> child pages have been EREMOVE'd.  This ensures that bare-metal SGX will
> >> never encounter SGX_CHILD_PRESENT in normal operation.
> > 
> > How about:
> > 
> > "SGX driver can accurate track how enclave pages are used. This enables..."
> > 
> > Since in another email, you mentioned that we should get rid of bare-metal driver,
> > and Andy suggested we can just use SGX driver?
> 
> <sigh>
> 
> Sure, but with correct grammar, please.
> 
> "SGX driver can accurately track how enclave pages are used. This
> enables..."
> 
> Seriously, if you just paste the sentences into Word, it will highlight
> this and tell you.

Thanks. My fault.
Huang, Kai Jan. 27, 2021, 1:08 a.m. UTC | #11
On Tue, 2021-01-26 at 07:39 -0800, Dave Hansen wrote:
> On 1/26/21 1:30 AM, Kai Huang wrote:
> > Remove SGX_EPC_PAGE_RECLAIMER_TRACKED check and warning.  This cannot
> > happen, as enclave pages are freed only at the time when encl->refcount
> > triggers, i.e. when both VFS and the page reclaimer have given up on
> > their references.
> > 
> > Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
> >  arch/x86/kernel/cpu/sgx/main.c | 2 --
> >  1 file changed, 2 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> > index 8df81a3ed945..f330abdb5bb1 100644
> > --- a/arch/x86/kernel/cpu/sgx/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > @@ -605,8 +605,6 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
> >  	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
> >  	int ret;
> >  
> > 
> > -	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> 
> I'm all for cleaning up silly, useless warnings.  But, don't we usually
> put warnings in for things that we don't expect to be able to happen?
> 
> In other words, I'm fine with removing this if it hasn't been a valuable
> warning and we don't expect it to become a valuable warning.  But, the
> changelog doesn't say that.  It also doesn't explain what this patch is
> doing in this series.
> 
> Why is this her?

Hi Jarkko,

I don't have deep understanding of SGX driver. Would you help to answer?
Dave Hansen Jan. 27, 2021, 1:12 a.m. UTC | #12
On 1/26/21 5:08 PM, Kai Huang wrote:
> I don't have deep understanding of SGX driver. Would you help to answer?

Kai, as the patch submitter, you are expected to be able to at least
minimally explain what the patch is doing.  Please endeavor to obtain
this understanding before sending patches in the future.
Huang, Kai Jan. 27, 2021, 1:25 a.m. UTC | #13
On Tue, 26 Jan 2021 08:04:35 -0800 Dave Hansen wrote:
> On 1/26/21 1:30 AM, Kai Huang wrote:
> > From: Jarkko Sakkinen <jarkko@kernel.org>
> > 
> > Encapsulate the snippet in sgx_free_epc_page() concerning EREMOVE to
> > sgx_reset_epc_page(), which is a static helper function for
> > sgx_encl_release().  It's the only function existing, which deals with
> > initialized pages.
> 
> Yikes.  I have no idea what that is saying.  Here's a rewrite:
> 
> EREMOVE takes a pages and removes any association between that page and
> an enclave.  It must be run on a page before it can be added into
> another enclave.  Currently, EREMOVE is run as part of pages being freed
> into the SGX page allocator.  It is not expected to fail.
> 
> KVM does not track how guest pages are used, which means that SGX
> virtualization use of EREMOVE might fail.
> 
> Break out the EREMOVE call from the SGX page allocator.  This will allow
> the SGX virtualization code to use the allocator directly.  (SGX/KVM
> will also introduce a more permissive EREMOVE helper).

Thanks.

Hi Jarkko,

Do you want me to update your patch directly, or do you want to take the
change, and send me the patch again?

> 
> > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> > index ee50a5010277..a78b71447771 100644
> > --- a/arch/x86/kernel/cpu/sgx/encl.c
> > +++ b/arch/x86/kernel/cpu/sgx/encl.c
> > @@ -389,6 +389,16 @@ const struct vm_operations_struct sgx_vm_ops = {
> >  	.access = sgx_vma_access,
> >  };
> >  
> > +
> > +static void sgx_reset_epc_page(struct sgx_epc_page *epc_page)
> > +{
> > +	int ret;
> > +
> > +	ret = __eremove(sgx_get_epc_virt_addr(epc_page));
> > +	if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
> > +		return;
> > +}
> > +
> >  /**
> >   * sgx_encl_release - Destroy an enclave instance
> >   * @kref:	address of a kref inside &sgx_encl
> > @@ -412,6 +422,7 @@ void sgx_encl_release(struct kref *ref)
> >  			if (sgx_unmark_page_reclaimable(entry->epc_page))
> >  				continue;
> >  
> > +			sgx_reset_epc_page(entry->epc_page);
> >  			sgx_free_epc_page(entry->epc_page);
> >  			encl->secs_child_cnt--;
> >  			entry->epc_page = NULL;
> > @@ -423,6 +434,7 @@ void sgx_encl_release(struct kref *ref)
> >  	xa_destroy(&encl->page_array);
> >  
> >  	if (!encl->secs_child_cnt && encl->secs.epc_page) {
> > +		sgx_reset_epc_page(encl->secs.epc_page);
> >  		sgx_free_epc_page(encl->secs.epc_page);
> >  		encl->secs.epc_page = NULL;
> >  	}
> > @@ -431,6 +443,7 @@ void sgx_encl_release(struct kref *ref)
> >  		va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
> >  					   list);
> >  		list_del(&va_page->list);
> > +		sgx_reset_epc_page(va_page->epc_page);
> >  		sgx_free_epc_page(va_page->epc_page);
> >  		kfree(va_page);
> >  	}
> > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> > index f330abdb5bb1..21c2ffa13870 100644
> > --- a/arch/x86/kernel/cpu/sgx/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > @@ -598,16 +598,14 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> >   * sgx_free_epc_page() - Free an EPC page
> >   * @page:	an EPC page
> >   *
> > - * Call EREMOVE for an EPC page and insert it back to the list of free pages.
> > + * Put the EPC page back to the list of free pages. It's the callers
> 
> "caller's"
> 
> > + * responsibility to make sure that the page is in uninitialized state In other
> 
> Period after "state", please.
> 
> > + * words, do EREMOVE, EWB or whatever operation is necessary before calling
> > + * this function.
> >   */
> 
> OK, so if you're going to say "the caller must put the page in
> uninitialized state", let's also add a comment to the place that *DO*
> that, like the shiny new sgx_reset_epc_page().

Hi Dave,

Sorry I am a little bit confused here. Do you mean we should add a comment in
sgx_reset_epc_page() to say, for instance: sgx_free_epc_page() requires the EPC
page already been EREMOVE'd?
Huang, Kai Jan. 27, 2021, 1:26 a.m. UTC | #14
On Tue, 26 Jan 2021 17:12:12 -0800 Dave Hansen wrote:
> On 1/26/21 5:08 PM, Kai Huang wrote:
> > I don't have deep understanding of SGX driver. Would you help to answer?
> 
> Kai, as the patch submitter, you are expected to be able to at least
> minimally explain what the patch is doing.  Please endeavor to obtain
> this understanding before sending patches in the future.

I see. Thanks.
Jarkko Sakkinen Jan. 30, 2021, 1:20 p.m. UTC | #15
On Wed, Jan 27, 2021 at 12:18:32PM +1300, Kai Huang wrote:
> On Tue, 2021-01-26 at 07:34 -0800, Dave Hansen wrote:
> > On 1/26/21 1:30 AM, Kai Huang wrote:
> > > From: Sean Christopherson <seanjc@google.com>
> > > 
> > > Add SGX1 and SGX2 feature flags, via CPUID.0x12.0x0.EAX, as scattered
> > > features, since adding a new leaf for only two bits would be wasteful.
> > > As part of virtualizing SGX, KVM will expose the SGX CPUID leafs to its
> > > guest, and to do so correctly needs to query hardware and kernel support
> > > for SGX1 and SGX2.
> > 
> > It's also not _just_ exposing the CPUID leaves.  There are some checks
> > here when KVM is emulating some SGX instructions too, right?
> 
> I would say trapping instead of emulating, but yes KVM will do more. However those
> are quite details, and I don't think we should put lots of details here. Or perhaps
> we can use 'for instance' as brief description:
> 
> As part of virtualizing SGX, KVM will need to use the two flags, for instance, to
> expose them to guest.
> 
> ?
> 
> > 
> > > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > > index 84b887825f12..18b2d0c8bbbe 100644
> > > --- a/arch/x86/include/asm/cpufeatures.h
> > > +++ b/arch/x86/include/asm/cpufeatures.h
> > > @@ -292,6 +292,8 @@
> > >  #define X86_FEATURE_FENCE_SWAPGS_KERNEL	(11*32+ 5) /* "" LFENCE in kernel entry SWAPGS path */
> > >  #define X86_FEATURE_SPLIT_LOCK_DETECT	(11*32+ 6) /* #AC for split lock */
> > >  #define X86_FEATURE_PER_THREAD_MBA	(11*32+ 7) /* "" Per-thread Memory Bandwidth Allocation */
> > > +#define X86_FEATURE_SGX1		(11*32+ 8) /* Software Guard Extensions sub-feature SGX1 */
> > > +#define X86_FEATURE_SGX2        	(11*32+ 9) /* Software Guard Extensions sub-feature SGX2 */
> > 
> > FWIW, I'm not sure how valuable it is to spell the SGX acronym out three
> > times.  Can't we use those bytes to put something more useful in that
> > comment?
> 
> I think we can remove comment for SGX1, since it is basically SGX.
> 
> For SGX2, how about below?
> 
> /* SGX Enclave Dynamic Memory Management */

(EDMM)

/Jarkko
Huang, Kai Feb. 1, 2021, 12:01 a.m. UTC | #16
On Sat, 30 Jan 2021 15:20:54 +0200 Jarkko Sakkinen wrote:
> On Wed, Jan 27, 2021 at 12:18:32PM +1300, Kai Huang wrote:
> > On Tue, 2021-01-26 at 07:34 -0800, Dave Hansen wrote:
> > > On 1/26/21 1:30 AM, Kai Huang wrote:
> > > > From: Sean Christopherson <seanjc@google.com>
> > > > 
> > > > Add SGX1 and SGX2 feature flags, via CPUID.0x12.0x0.EAX, as scattered
> > > > features, since adding a new leaf for only two bits would be wasteful.
> > > > As part of virtualizing SGX, KVM will expose the SGX CPUID leafs to its
> > > > guest, and to do so correctly needs to query hardware and kernel support
> > > > for SGX1 and SGX2.
> > > 
> > > It's also not _just_ exposing the CPUID leaves.  There are some checks
> > > here when KVM is emulating some SGX instructions too, right?
> > 
> > I would say trapping instead of emulating, but yes KVM will do more. However those
> > are quite details, and I don't think we should put lots of details here. Or perhaps
> > we can use 'for instance' as brief description:
> > 
> > As part of virtualizing SGX, KVM will need to use the two flags, for instance, to
> > expose them to guest.
> > 
> > ?
> > 
> > > 
> > > > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > > > index 84b887825f12..18b2d0c8bbbe 100644
> > > > --- a/arch/x86/include/asm/cpufeatures.h
> > > > +++ b/arch/x86/include/asm/cpufeatures.h
> > > > @@ -292,6 +292,8 @@
> > > >  #define X86_FEATURE_FENCE_SWAPGS_KERNEL	(11*32+ 5) /* "" LFENCE in kernel entry SWAPGS path */
> > > >  #define X86_FEATURE_SPLIT_LOCK_DETECT	(11*32+ 6) /* #AC for split lock */
> > > >  #define X86_FEATURE_PER_THREAD_MBA	(11*32+ 7) /* "" Per-thread Memory Bandwidth Allocation */
> > > > +#define X86_FEATURE_SGX1		(11*32+ 8) /* Software Guard Extensions sub-feature SGX1 */
> > > > +#define X86_FEATURE_SGX2        	(11*32+ 9) /* Software Guard Extensions sub-feature SGX2 */
> > > 
> > > FWIW, I'm not sure how valuable it is to spell the SGX acronym out three
> > > times.  Can't we use those bytes to put something more useful in that
> > > comment?
> > 
> > I think we can remove comment for SGX1, since it is basically SGX.
> > 
> > For SGX2, how about below?
> > 
> > /* SGX Enclave Dynamic Memory Management */
> 
> (EDMM)

Does EDMM obvious to everyone, instead of explicitly saying Enclave Dynamic
Memory Management?

Also do you think we need a comment for SGX1 bit? I can add /* Basic SGX */,
but I am not sure whether it is required.
Huang, Kai Feb. 1, 2021, 12:11 a.m. UTC | #17
On Wed, 27 Jan 2021 14:26:52 +1300 Kai Huang wrote:
> On Tue, 26 Jan 2021 17:12:12 -0800 Dave Hansen wrote:
> > On 1/26/21 5:08 PM, Kai Huang wrote:
> > > I don't have deep understanding of SGX driver. Would you help to answer?
> > 
> > Kai, as the patch submitter, you are expected to be able to at least
> > minimally explain what the patch is doing.  Please endeavor to obtain
> > this understanding before sending patches in the future.
> 
> I see. Thanks.

Hi Jarkko,

I think I'll remove this patch in next version, since it is not related to KVM
SGX. And I'll rebase your second patch based on current tip/x86/sgx. You may
send out this patch independently. Let me know if you have comment.
Jarkko Sakkinen Feb. 2, 2021, 5:17 p.m. UTC | #18
On Mon, Feb 01, 2021 at 01:01:51PM +1300, Kai Huang wrote:
> On Sat, 30 Jan 2021 15:20:54 +0200 Jarkko Sakkinen wrote:
> > On Wed, Jan 27, 2021 at 12:18:32PM +1300, Kai Huang wrote:
> > > On Tue, 2021-01-26 at 07:34 -0800, Dave Hansen wrote:
> > > > On 1/26/21 1:30 AM, Kai Huang wrote:
> > > > > From: Sean Christopherson <seanjc@google.com>
> > > > > 
> > > > > Add SGX1 and SGX2 feature flags, via CPUID.0x12.0x0.EAX, as scattered
> > > > > features, since adding a new leaf for only two bits would be wasteful.
> > > > > As part of virtualizing SGX, KVM will expose the SGX CPUID leafs to its
> > > > > guest, and to do so correctly needs to query hardware and kernel support
> > > > > for SGX1 and SGX2.
> > > > 
> > > > It's also not _just_ exposing the CPUID leaves.  There are some checks
> > > > here when KVM is emulating some SGX instructions too, right?
> > > 
> > > I would say trapping instead of emulating, but yes KVM will do more. However those
> > > are quite details, and I don't think we should put lots of details here. Or perhaps
> > > we can use 'for instance' as brief description:
> > > 
> > > As part of virtualizing SGX, KVM will need to use the two flags, for instance, to
> > > expose them to guest.
> > > 
> > > ?
> > > 
> > > > 
> > > > > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > > > > index 84b887825f12..18b2d0c8bbbe 100644
> > > > > --- a/arch/x86/include/asm/cpufeatures.h
> > > > > +++ b/arch/x86/include/asm/cpufeatures.h
> > > > > @@ -292,6 +292,8 @@
> > > > >  #define X86_FEATURE_FENCE_SWAPGS_KERNEL	(11*32+ 5) /* "" LFENCE in kernel entry SWAPGS path */
> > > > >  #define X86_FEATURE_SPLIT_LOCK_DETECT	(11*32+ 6) /* #AC for split lock */
> > > > >  #define X86_FEATURE_PER_THREAD_MBA	(11*32+ 7) /* "" Per-thread Memory Bandwidth Allocation */
> > > > > +#define X86_FEATURE_SGX1		(11*32+ 8) /* Software Guard Extensions sub-feature SGX1 */
> > > > > +#define X86_FEATURE_SGX2        	(11*32+ 9) /* Software Guard Extensions sub-feature SGX2 */
> > > > 
> > > > FWIW, I'm not sure how valuable it is to spell the SGX acronym out three
> > > > times.  Can't we use those bytes to put something more useful in that
> > > > comment?
> > > 
> > > I think we can remove comment for SGX1, since it is basically SGX.
> > > 
> > > For SGX2, how about below?
> > > 
> > > /* SGX Enclave Dynamic Memory Management */
> > 
> > (EDMM)
> 
> Does EDMM obvious to everyone, instead of explicitly saying Enclave Dynamic
> Memory Management?
> 
> Also do you think we need a comment for SGX1 bit? I can add /* Basic SGX */,
> but I am not sure whether it is required.

I would put write the whole thing down and put EDMM to parentheses.

For SGX1 I would put "Basic SGX features for enclave construction".

/Jarkko
Paolo Bonzini Feb. 2, 2021, 5:56 p.m. UTC | #19
On 01/02/21 01:01, Kai Huang wrote:
>>> I think we can remove comment for SGX1, since it is basically SGX.
>>>
>>> For SGX2, how about below?
>>>
>>> /* SGX Enclave Dynamic Memory Management */
>> (EDMM)
> Does EDMM obvious to everyone, instead of explicitly saying Enclave Dynamic
> Memory Management?
> 
> Also do you think we need a comment for SGX1 bit? I can add /* Basic SGX */,
> but I am not sure whether it is required.
> 

Yes, please use

/* "" Basic SGX */
/* "" SGX Enclave Dynamic Memory Mgmt */

Paolo
Dave Hansen Feb. 2, 2021, 6 p.m. UTC | #20
On 2/2/21 9:56 AM, Paolo Bonzini wrote:
> On 01/02/21 01:01, Kai Huang wrote:
>>>> I think we can remove comment for SGX1, since it is basically SGX.
>>>>
>>>> For SGX2, how about below?
>>>>
>>>> /* SGX Enclave Dynamic Memory Management */
>>> (EDMM)
>> Does EDMM obvious to everyone, instead of explicitly saying Enclave
>> Dynamic
>> Memory Management?
>>
>> Also do you think we need a comment for SGX1 bit? I can add /* Basic
>> SGX */,
>> but I am not sure whether it is required.
> 
> Yes, please use
> 
> /* "" Basic SGX */
> /* "" SGX Enclave Dynamic Memory Mgmt */

Do you actually want to suppress these from /proc/cpuinfo with the ""?
Paolo Bonzini Feb. 2, 2021, 6 p.m. UTC | #21
On 27/01/21 02:25, Kai Huang wrote:
> On Tue, 26 Jan 2021 08:04:35 -0800 Dave Hansen wrote:
>> On 1/26/21 1:30 AM, Kai Huang wrote:
>>> From: Jarkko Sakkinen <jarkko@kernel.org>
>>>
>>> Encapsulate the snippet in sgx_free_epc_page() concerning EREMOVE to
>>> sgx_reset_epc_page(), which is a static helper function for
>>> sgx_encl_release().  It's the only function existing, which deals with
>>> initialized pages.
>>
>> Yikes.  I have no idea what that is saying.  Here's a rewrite:
>>
>> EREMOVE takes a pages and removes any association between that page and
>> an enclave.  It must be run on a page before it can be added into
>> another enclave.  Currently, EREMOVE is run as part of pages being freed
>> into the SGX page allocator.  It is not expected to fail.
>>
>> KVM does not track how guest pages are used, which means that SGX
>> virtualization use of EREMOVE might fail.
>>
>> Break out the EREMOVE call from the SGX page allocator.  This will allow
>> the SGX virtualization code to use the allocator directly.  (SGX/KVM
>> will also introduce a more permissive EREMOVE helper).
> 
> Thanks.
> 
> Hi Jarkko,
> 
> Do you want me to update your patch directly, or do you want to take the
> change, and send me the patch again?

I think you should treat all these 27 patches as yours now (including 
removing them, reordering them, adjusting commit message etc.).


>> OK, so if you're going to say "the caller must put the page in
>> uninitialized state", let's also add a comment to the place that *DO*
>> that, like the shiny new sgx_reset_epc_page().
> 
> Hi Dave,
> 
> Sorry I am a little bit confused here. Do you mean we should add a comment in
> sgx_reset_epc_page() to say, for instance: sgx_free_epc_page() requires the EPC
> page already been EREMOVE'd?

I also don't understand Dave's comment.  I would say

It's the caller's responsibility to make sure that the page is in 
uninitialized state with EREMOVE (sgx_reset_epc_page), EWB etc. before 
calling this function.

Paolo
Paolo Bonzini Feb. 2, 2021, 6:03 p.m. UTC | #22
On 02/02/21 19:00, Dave Hansen wrote:
>> /* "" Basic SGX */
>> /* "" SGX Enclave Dynamic Memory Mgmt */
> Do you actually want to suppress these from /proc/cpuinfo with the ""?
> 

sgx1 yes.  However sgx2 can be useful to have there, I guess.

Paolo
Sean Christopherson Feb. 2, 2021, 6:42 p.m. UTC | #23
On Tue, Feb 02, 2021, Paolo Bonzini wrote:
> On 02/02/21 19:00, Dave Hansen wrote:
> > > /* "" Basic SGX */
> > > /* "" SGX Enclave Dynamic Memory Mgmt */
> > Do you actually want to suppress these from /proc/cpuinfo with the ""?
> > 
> 
> sgx1 yes.  However sgx2 can be useful to have there, I guess.

Agreed, /proc/cpuinfo's sgx1 will always be in lockstep with sgx, so it won't
be useful for dealing with the fallout of hardware disabling SGX due to software
disabling a machine check bank via WRMSR(MCi_CTL).  I can't think of any other
use case for checking /proc/cpuinfo's sgx1.
Dave Hansen Feb. 2, 2021, 7:02 p.m. UTC | #24
On 1/26/21 5:25 PM, Kai Huang wrote:
>>
>>> + * responsibility to make sure that the page is in uninitialized state In other
>> Period after "state", please.
>>
>>> + * words, do EREMOVE, EWB or whatever operation is necessary before calling
>>> + * this function.
>>>   */
>> OK, so if you're going to say "the caller must put the page in
>> uninitialized state", let's also add a comment to the place that *DO*
>> that, like the shiny new sgx_reset_epc_page().
> Hi Dave,
> 
> Sorry I am a little bit confused here. Do you mean we should add a comment in
> sgx_reset_epc_page() to say, for instance: sgx_free_epc_page() requires the EPC
> page already been EREMOVE'd?

Yes.  You need to place a comment in sgx_reset_epc_page() which says
something like:

	/* Place the page in uninitialized state: */
Huang, Kai Feb. 2, 2021, 7:25 p.m. UTC | #25
On Tue, 2 Feb 2021 19:00:48 +0100 Paolo Bonzini wrote:
> On 27/01/21 02:25, Kai Huang wrote:
> > On Tue, 26 Jan 2021 08:04:35 -0800 Dave Hansen wrote:
> >> On 1/26/21 1:30 AM, Kai Huang wrote:
> >>> From: Jarkko Sakkinen <jarkko@kernel.org>
> >>>
> >>> Encapsulate the snippet in sgx_free_epc_page() concerning EREMOVE to
> >>> sgx_reset_epc_page(), which is a static helper function for
> >>> sgx_encl_release().  It's the only function existing, which deals with
> >>> initialized pages.
> >>
> >> Yikes.  I have no idea what that is saying.  Here's a rewrite:
> >>
> >> EREMOVE takes a pages and removes any association between that page and
> >> an enclave.  It must be run on a page before it can be added into
> >> another enclave.  Currently, EREMOVE is run as part of pages being freed
> >> into the SGX page allocator.  It is not expected to fail.
> >>
> >> KVM does not track how guest pages are used, which means that SGX
> >> virtualization use of EREMOVE might fail.
> >>
> >> Break out the EREMOVE call from the SGX page allocator.  This will allow
> >> the SGX virtualization code to use the allocator directly.  (SGX/KVM
> >> will also introduce a more permissive EREMOVE helper).
> > 
> > Thanks.
> > 
> > Hi Jarkko,
> > 
> > Do you want me to update your patch directly, or do you want to take the
> > change, and send me the patch again?
> 
> I think you should treat all these 27 patches as yours now (including 
> removing them, reordering them, adjusting commit message etc.).

Agreed. Thank you Paolo for starting to review this series :)
Edgecombe, Rick P Feb. 2, 2021, 10:21 p.m. UTC | #26
On Tue, 2021-01-26 at 23:10 +1300, Kai Huang wrote:
> This series adds KVM SGX virtualization support. The first 15 patches
> starting
> with x86/sgx or x86/cpu.. are necessary changes to x86 and SGX
> core/driver to
> support KVM SGX virtualization, while the rest are patches to KVM
> subsystem.

Do we need to restrict normal KVM host kernel access to EPC (i.e. via
__kvm_map_gfn() and friends)? As best I can tell the exact behavior of
this kind of access is undefined. The concern would be if any HW ever
treated it as an error, the guest could subject the host kernel to it.
Is it worth a check in those?
Sean Christopherson Feb. 2, 2021, 10:33 p.m. UTC | #27
On Tue, Feb 02, 2021, Edgecombe, Rick P wrote:
> On Tue, 2021-01-26 at 23:10 +1300, Kai Huang wrote:
> > This series adds KVM SGX virtualization support. The first 15 patches
> > starting
> > with x86/sgx or x86/cpu.. are necessary changes to x86 and SGX
> > core/driver to
> > support KVM SGX virtualization, while the rest are patches to KVM
> > subsystem.
> 
> Do we need to restrict normal KVM host kernel access to EPC (i.e. via
> __kvm_map_gfn() and friends)? As best I can tell the exact behavior of
> this kind of access is undefined. The concern would be if any HW ever
> treated it as an error, the guest could subject the host kernel to it.
> Is it worth a check in those?

I don't think so.  The SDM does state that the exact behavior is uArch specific,
but it also explicitly states that the access will be altered, which IMO doesn't
leave any wiggle room for a future CPU to fault instead of using some form of
abort semantics.

  Attempts to execute, read, or write to linear addresses mapped to EPC pages
  when not inside an enclave will result in the processor altering the access to
  preserve the confidentiality and integrity of the enclave. The exact behavior
  may be different between implementations.
Dave Hansen Feb. 2, 2021, 10:36 p.m. UTC | #28
On 2/2/21 2:21 PM, Edgecombe, Rick P wrote:
> On Tue, 2021-01-26 at 23:10 +1300, Kai Huang wrote:
>> This series adds KVM SGX virtualization support. The first 15 patches
>> starting
>> with x86/sgx or x86/cpu.. are necessary changes to x86 and SGX
>> core/driver to
>> support KVM SGX virtualization, while the rest are patches to KVM
>> subsystem.
> 
> Do we need to restrict normal KVM host kernel access to EPC (i.e. via
> __kvm_map_gfn() and friends)? As best I can tell the exact behavior of
> this kind of access is undefined. The concern would be if any HW ever
> treated it as an error, the guest could subject the host kernel to it.
> Is it worth a check in those?

Geez, you're right.  It's not even a page fault we can recover from.

SDM, Vol. 3D 37-1, 37.3 ACCESS-CONTROL REQUIREMENTS, says:

"Non-enclave accesses to EPC memory result in undefined behavior"
Dave Hansen Feb. 2, 2021, 11:21 p.m. UTC | #29
On 2/2/21 2:33 PM, Sean Christopherson wrote:
>> Do we need to restrict normal KVM host kernel access to EPC (i.e. via
>> __kvm_map_gfn() and friends)? As best I can tell the exact behavior of
>> this kind of access is undefined. The concern would be if any HW ever
>> treated it as an error, the guest could subject the host kernel to it.
>> Is it worth a check in those?
> I don't think so.  The SDM does state that the exact behavior is uArch specific,
> but it also explicitly states that the access will be altered, which IMO doesn't
> leave any wiggle room for a future CPU to fault instead of using some form of
> abort semantics.
> 
>   Attempts to execute, read, or write to linear addresses mapped to EPC pages
>   when not inside an enclave will result in the processor altering the access to
>   preserve the confidentiality and integrity of the enclave. The exact behavior
>   may be different between implementations.

I seem to remember much stronger language in the SDM about this.  I've
always thought of SGX as a big unrecoverable machine-check party waiting
to happen.

I'll ask around internally at Intel and see what folks say.  Basically,
should we be afraid of a big bad EPC access?
Sean Christopherson Feb. 2, 2021, 11:56 p.m. UTC | #30
On Tue, Feb 02, 2021, Dave Hansen wrote:
> On 2/2/21 2:33 PM, Sean Christopherson wrote:
> >> Do we need to restrict normal KVM host kernel access to EPC (i.e. via
> >> __kvm_map_gfn() and friends)? As best I can tell the exact behavior of
> >> this kind of access is undefined. The concern would be if any HW ever
> >> treated it as an error, the guest could subject the host kernel to it.
> >> Is it worth a check in those?
> > I don't think so.  The SDM does state that the exact behavior is uArch specific,
> > but it also explicitly states that the access will be altered, which IMO doesn't
> > leave any wiggle room for a future CPU to fault instead of using some form of
> > abort semantics.
> > 
> >   Attempts to execute, read, or write to linear addresses mapped to EPC pages
> >   when not inside an enclave will result in the processor altering the access to
> >   preserve the confidentiality and integrity of the enclave. The exact behavior
> >   may be different between implementations.
> 
> I seem to remember much stronger language in the SDM about this.  I've
> always thought of SGX as a big unrecoverable machine-check party waiting
> to happen.
>
> I'll ask around internally at Intel and see what folks say.  Basically,
> should we be afraid of a big bad EPC access?

If bad accesses to the EPC can cause machine checks, then EPC should never be
mapped into userspace, i.e. the SGX driver should never have been merged.

The SGX architecture is predicated on using isolation to protect enclaves from
software, not by poisoning memory, a la TDX.  E.g. SGX on ICX's MKTME wouldn't
be a thing if that weren't the case.

A physical attack on DRAM can trigger #MC on systems that use the MEE as
opposed to MKTME, but that obviously doesn't require a guest to coerce KVM into
accessing the EPC.
Dave Hansen Feb. 3, 2021, 12:43 a.m. UTC | #31
On 2/2/21 3:56 PM, Sean Christopherson wrote:
>> I'll ask around internally at Intel and see what folks say.  Basically,
>> should we be afraid of a big bad EPC access?
> If bad accesses to the EPC can cause machine checks, then EPC should never be
> mapped into userspace, i.e. the SGX driver should never have been merged.

That's a good point.  However, I've learned not to assume too much about
the SGX architecture.

Either way, I think we need some architectural clarification.  If it
can't *possibly* be harmful, then the architecture docs should at least
put a stake in the ground and say so.  I'll go rattle some cages.
Huang, Kai Feb. 3, 2021, 1:05 a.m. UTC | #32
On Tue, 2 Feb 2021 10:42:05 -0800 Sean Christopherson wrote:
> On Tue, Feb 02, 2021, Paolo Bonzini wrote:
> > On 02/02/21 19:00, Dave Hansen wrote:
> > > > /* "" Basic SGX */
> > > > /* "" SGX Enclave Dynamic Memory Mgmt */
> > > Do you actually want to suppress these from /proc/cpuinfo with the ""?
> > > 
> > 
> > sgx1 yes.  However sgx2 can be useful to have there, I guess.
> 
> Agreed, /proc/cpuinfo's sgx1 will always be in lockstep with sgx, so it won't
> be useful for dealing with the fallout of hardware disabling SGX due to software
> disabling a machine check bank via WRMSR(MCi_CTL).  I can't think of any other
> use case for checking /proc/cpuinfo's sgx1.

So combing all feedbacks, I'll put:

/* "" Basic SGX */
/* SGX Enclave Dynamic Memory Management (EDMM) */

Let me know if you guys have concern.
Huang, Kai Feb. 3, 2021, 1:09 a.m. UTC | #33
On Tue, 2 Feb 2021 19:17:56 +0200 Jarkko Sakkinen wrote:
> On Mon, Feb 01, 2021 at 01:01:51PM +1300, Kai Huang wrote:
> > On Sat, 30 Jan 2021 15:20:54 +0200 Jarkko Sakkinen wrote:
> > > On Wed, Jan 27, 2021 at 12:18:32PM +1300, Kai Huang wrote:
> > > > On Tue, 2021-01-26 at 07:34 -0800, Dave Hansen wrote:
> > > > > On 1/26/21 1:30 AM, Kai Huang wrote:
> > > > > > From: Sean Christopherson <seanjc@google.com>
> > > > > > 
> > > > > > Add SGX1 and SGX2 feature flags, via CPUID.0x12.0x0.EAX, as scattered
> > > > > > features, since adding a new leaf for only two bits would be wasteful.
> > > > > > As part of virtualizing SGX, KVM will expose the SGX CPUID leafs to its
> > > > > > guest, and to do so correctly needs to query hardware and kernel support
> > > > > > for SGX1 and SGX2.
> > > > > 
> > > > > It's also not _just_ exposing the CPUID leaves.  There are some checks
> > > > > here when KVM is emulating some SGX instructions too, right?
> > > > 
> > > > I would say trapping instead of emulating, but yes KVM will do more. However those
> > > > are quite details, and I don't think we should put lots of details here. Or perhaps
> > > > we can use 'for instance' as brief description:
> > > > 
> > > > As part of virtualizing SGX, KVM will need to use the two flags, for instance, to
> > > > expose them to guest.
> > > > 
> > > > ?
> > > > 
> > > > > 
> > > > > > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > > > > > index 84b887825f12..18b2d0c8bbbe 100644
> > > > > > --- a/arch/x86/include/asm/cpufeatures.h
> > > > > > +++ b/arch/x86/include/asm/cpufeatures.h
> > > > > > @@ -292,6 +292,8 @@
> > > > > >  #define X86_FEATURE_FENCE_SWAPGS_KERNEL	(11*32+ 5) /* "" LFENCE in kernel entry SWAPGS path */
> > > > > >  #define X86_FEATURE_SPLIT_LOCK_DETECT	(11*32+ 6) /* #AC for split lock */
> > > > > >  #define X86_FEATURE_PER_THREAD_MBA	(11*32+ 7) /* "" Per-thread Memory Bandwidth Allocation */
> > > > > > +#define X86_FEATURE_SGX1		(11*32+ 8) /* Software Guard Extensions sub-feature SGX1 */
> > > > > > +#define X86_FEATURE_SGX2        	(11*32+ 9) /* Software Guard Extensions sub-feature SGX2 */
> > > > > 
> > > > > FWIW, I'm not sure how valuable it is to spell the SGX acronym out three
> > > > > times.  Can't we use those bytes to put something more useful in that
> > > > > comment?
> > > > 
> > > > I think we can remove comment for SGX1, since it is basically SGX.
> > > > 
> > > > For SGX2, how about below?
> > > > 
> > > > /* SGX Enclave Dynamic Memory Management */
> > > 
> > > (EDMM)
> > 
> > Does EDMM obvious to everyone, instead of explicitly saying Enclave Dynamic
> > Memory Management?
> > 
> > Also do you think we need a comment for SGX1 bit? I can add /* Basic SGX */,
> > but I am not sure whether it is required.
> 
> I would put write the whole thing down and put EDMM to parentheses.

Good idea to me. Will do.

> 
> For SGX1 I would put "Basic SGX features for enclave construction".

I think "Basic SGX" should be enough, since it already implies "enclave
construction" part (plus other things). For someone doesn't care about SGX,
having "enclave construction" or not doesn't matter; for someone has some
knowledge of SGX, he or she knows what does "Basic SGX" mean.

> 
> /Jarkko
Jarkko Sakkinen Feb. 3, 2021, 10:03 a.m. UTC | #34
On Mon, Feb 01, 2021 at 01:11:10PM +1300, Kai Huang wrote:
> On Wed, 27 Jan 2021 14:26:52 +1300 Kai Huang wrote:
> > On Tue, 26 Jan 2021 17:12:12 -0800 Dave Hansen wrote:
> > > On 1/26/21 5:08 PM, Kai Huang wrote:
> > > > I don't have deep understanding of SGX driver. Would you help to answer?
> > > 
> > > Kai, as the patch submitter, you are expected to be able to at least
> > > minimally explain what the patch is doing.  Please endeavor to obtain
> > > this understanding before sending patches in the future.
> > 
> > I see. Thanks.
> 
> Hi Jarkko,
> 
> I think I'll remove this patch in next version, since it is not related to KVM
> SGX. And I'll rebase your second patch based on current tip/x86/sgx. You may
> send out this patch independently. Let me know if you have comment.

I don't like to pre-ack changes.

My main concern is not to introduce multiple disjoint versions
of sgx_free_epc_page(). It is just not sane because you can do
an implementation where those don't exist.

/Jarkko
Dave Hansen Feb. 3, 2021, 3:10 p.m. UTC | #35
On 2/2/21 3:56 PM, Sean Christopherson wrote:
>> I seem to remember much stronger language in the SDM about this.  I've
>> always thought of SGX as a big unrecoverable machine-check party waiting
>> to happen.
>>
>> I'll ask around internally at Intel and see what folks say.  Basically,
>> should we be afraid of a big bad EPC access?
> If bad accesses to the EPC can cause machine checks, then EPC should never be
> mapped into userspace, i.e. the SGX driver should never have been merged.

The SDM doesn't define the behavior well enough.  I'll try to get that
fixed.

But, there is some documentation of the abort page semantics:

> https://download.01.org/intel-sgx/sgx-linux/2.10/docs/Intel_SGX_Developer_Reference_Linux_2.10_Open_Source.pdf

Basically, writes get dropped and reads get all 1's on all the
implementations in the wild.  I actually would have much rather gotten a
fault, but oh well.

It sounds like we need to at least modify KVM to make sure not to map
and access EPC addresses.  We might even want to add some VM_WARN_ON()s
in the code that creates kernel mappings to catch these mappings if they
happen anywhere else.

EPC mappings seem like (silent) trouble waiting to happen.
Sean Christopherson Feb. 3, 2021, 5:36 p.m. UTC | #36
On Wed, Feb 03, 2021, Dave Hansen wrote:
> On 2/2/21 3:56 PM, Sean Christopherson wrote:
> >> I seem to remember much stronger language in the SDM about this.  I've
> >> always thought of SGX as a big unrecoverable machine-check party waiting
> >> to happen.
> >>
> >> I'll ask around internally at Intel and see what folks say.  Basically,
> >> should we be afraid of a big bad EPC access?
> > If bad accesses to the EPC can cause machine checks, then EPC should never be
> > mapped into userspace, i.e. the SGX driver should never have been merged.
> 
> The SDM doesn't define the behavior well enough.  I'll try to get that
> fixed.
> 
> But, there is some documentation of the abort page semantics:
> 
> > https://download.01.org/intel-sgx/sgx-linux/2.10/docs/Intel_SGX_Developer_Reference_Linux_2.10_Open_Source.pdf
> 
> Basically, writes get dropped and reads get all 1's on all the
> implementations in the wild.  I actually would have much rather gotten a
> fault, but oh well.
> 
> It sounds like we need to at least modify KVM to make sure not to map
> and access EPC addresses.

Why?  KVM will read garbage, but KVM needs to be careful with the data it reads,
irrespective of SGX, because the data is user/guest controlled even in the happy
case.

I'm not at all opposed to preventing KVM from accessing EPC, but I really don't
want to add a special check in KVM to avoid reading EPC.  KVM generally isn't
aware of physical backings, and the relevant KVM code is shared between all
architectures.

> We might even want to add some VM_WARN_ON()s in the code that creates kernel
> mappings to catch these mappings if they happen anywhere else.

One thought for handling this would be to extend __ioremap_check_other() to flag
EPC in some way, and then disallow memremap() to EPC.  A clever way to do that
without disallowing SGX's initial memremap() would be to tap into SGX's
sgx_epc_sections array, as the per-section check would activate only after each
section is initialized/map by SGX.

Disallowing memremap(), without warning, would address the KVM flow (the
memremap() in __kvm_map_gfn()) without forcing KVM to explicitly check for EPC.

E.g. something like:

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index c519fc5f6948..f263f3554f27 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -26,6 +26,19 @@ static LIST_HEAD(sgx_active_page_list);

 static DEFINE_SPINLOCK(sgx_reclaimer_lock);

+bool is_sgx_epc(resource_size_t addr, unsigned long size)
+{
+       resource_size_t end = addr + size - 1;
+       int i;
+
+       for (i = 0; i < sgx_nr_epc_sections; i++) {
+               if (<check for overlap with sgx_epc_sections[i])
+                       return true;
+       }
+
+       return false;
+}
+
 /*
  * Reset dirty EPC pages to uninitialized state. Laundry can be left with SECS
  * pages whose child pages blocked EREMOVE.
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 9e5ccc56f8e0..145fc6fc6bc5 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -34,6 +34,7 @@
  */
 struct ioremap_desc {
        unsigned int flags;
+       bool sgx_epc;
 };

 /*
@@ -110,8 +111,14 @@ static unsigned int __ioremap_check_encrypted(struct resource *res)
  * The EFI runtime services data area is not covered by walk_mem_res(), but must
  * be mapped encrypted when SEV is active.
  */
-static void __ioremap_check_other(resource_size_t addr, struct ioremap_desc *desc)
+static void __ioremap_check_other(resource_size_t addr, unsigned long size,
+                                 struct ioremap_desc *desc)
 {
+       if (sgx_is_epc(addr, size)) {
+               desc->sgx_epc = true;
+               return;
+       }
+
        if (!sev_active())
                return;

@@ -155,7 +162,7 @@ static void __ioremap_check_mem(resource_size_t addr, unsigned long size,

        walk_mem_res(start, end, desc, __ioremap_collect_map_flags);

-       __ioremap_check_other(addr, desc);
+       __ioremap_check_other(addr, size, desc);
 }

 /*
@@ -210,6 +217,13 @@ __ioremap_caller(resource_size_t phys_addr, unsigned long size,
                return NULL;
        }

+       /*
+        * Don't allow mapping SGX EPC, it's not accessible via normal reads
+        * and writes.
+        */
+       if (io_desc.epc)
+               return NULL;
+
        /*
         * Mappings have to be page-aligned
         */
Paolo Bonzini Feb. 3, 2021, 5:43 p.m. UTC | #37
On 03/02/21 18:36, Sean Christopherson wrote:
> I'm not at all opposed to preventing KVM from accessing EPC, but I 
> really don't want to add a special check in KVM to avoid reading EPC. 
> KVM generally isn't aware of physical backings, and the relevant KVM 
> code is shared between all architectures.

Yeah, special casing KVM is almost always the wrong thing to do. 
Anything that KVM can do, other subsystems will do as well.

Paolo
Dave Hansen Feb. 3, 2021, 5:46 p.m. UTC | #38
On 2/3/21 9:43 AM, Paolo Bonzini wrote:
> On 03/02/21 18:36, Sean Christopherson wrote:
>> I'm not at all opposed to preventing KVM from accessing EPC, but I
>> really don't want to add a special check in KVM to avoid reading EPC.
>> KVM generally isn't aware of physical backings, and the relevant KVM
>> code is shared between all architectures.
> 
> Yeah, special casing KVM is almost always the wrong thing to do.
> Anything that KVM can do, other subsystems will do as well.

Agreed.  Thwarting ioremap itself seems like the right way to go.
Huang, Kai Feb. 3, 2021, 11:09 p.m. UTC | #39
On Wed, 2021-02-03 at 09:46 -0800, Dave Hansen wrote:
> On 2/3/21 9:43 AM, Paolo Bonzini wrote:
> > On 03/02/21 18:36, Sean Christopherson wrote:
> > > I'm not at all opposed to preventing KVM from accessing EPC, but I
> > > really don't want to add a special check in KVM to avoid reading EPC.
> > > KVM generally isn't aware of physical backings, and the relevant KVM
> > > code is shared between all architectures.
> > 
> > Yeah, special casing KVM is almost always the wrong thing to do.
> > Anything that KVM can do, other subsystems will do as well.
> 
> Agreed.  Thwarting ioremap itself seems like the right way to go.

This sounds irrelevant to KVM SGX, thus I won't include it to KVM SGX series.
Sean Christopherson Feb. 3, 2021, 11:32 p.m. UTC | #40
On Thu, Feb 04, 2021, Kai Huang wrote:
> On Wed, 2021-02-03 at 09:46 -0800, Dave Hansen wrote:
> > On 2/3/21 9:43 AM, Paolo Bonzini wrote:
> > > On 03/02/21 18:36, Sean Christopherson wrote:
> > > > I'm not at all opposed to preventing KVM from accessing EPC, but I
> > > > really don't want to add a special check in KVM to avoid reading EPC.
> > > > KVM generally isn't aware of physical backings, and the relevant KVM
> > > > code is shared between all architectures.
> > > 
> > > Yeah, special casing KVM is almost always the wrong thing to do.
> > > Anything that KVM can do, other subsystems will do as well.
> > 
> > Agreed.  Thwarting ioremap itself seems like the right way to go.
> 
> This sounds irrelevant to KVM SGX, thus I won't include it to KVM SGX series.

I would say it's relevant, but a pre-existing bug.  Same net effect on what's
needed for this series..

I say it's a pre-existing bug, because I'm pretty sure KVM can be coerced into
accessing the EPC by handing KVM a memslot that's backed by an enclave that was
created by host userspace (via /dev/sgx_enclave).
Dave Hansen Feb. 3, 2021, 11:37 p.m. UTC | #41
On 2/3/21 3:32 PM, Sean Christopherson wrote:
>>>> Yeah, special casing KVM is almost always the wrong thing to do.
>>>> Anything that KVM can do, other subsystems will do as well.
>>> Agreed.  Thwarting ioremap itself seems like the right way to go.
>> This sounds irrelevant to KVM SGX, thus I won't include it to KVM SGX series.
> I would say it's relevant, but a pre-existing bug.  Same net effect on what's
> needed for this series..
> 
> I say it's a pre-existing bug, because I'm pretty sure KVM can be coerced into
> accessing the EPC by handing KVM a memslot that's backed by an enclave that was
> created by host userspace (via /dev/sgx_enclave).

Dang, you beat me to it.  I was composing another email that said the
exact same thing.

I guess we need to take a closer look at the KVM fallout from this.
It's a few spots where it KVM knew it might be consuming garbage.  It
just get extra weird stinky garbage now.
Huang, Kai Feb. 4, 2021, 12:04 a.m. UTC | #42
On Wed, 2021-02-03 at 15:37 -0800, Dave Hansen wrote:
> On 2/3/21 3:32 PM, Sean Christopherson wrote:
> > > > > Yeah, special casing KVM is almost always the wrong thing to do.
> > > > > Anything that KVM can do, other subsystems will do as well.
> > > > Agreed.  Thwarting ioremap itself seems like the right way to go.
> > > This sounds irrelevant to KVM SGX, thus I won't include it to KVM SGX series.
> > I would say it's relevant, but a pre-existing bug.  Same net effect on what's
> > needed for this series..
> > 
> > I say it's a pre-existing bug, because I'm pretty sure KVM can be coerced into
> > accessing the EPC by handing KVM a memslot that's backed by an enclave that was
> > created by host userspace (via /dev/sgx_enclave).
> 
> Dang, you beat me to it.  I was composing another email that said the
> exact same thing.
> 
> I guess we need to take a closer look at the KVM fallout from this.
> It's a few spots where it KVM knew it might be consuming garbage.  It
> just get extra weird stinky garbage now.

I don't quite understand how KVM will need to access EPC memslot. It is *guest*, but
not KVM, who can read EPC from non-enclave. And if I understand correctly, there will
be no place for KVM to use kernel address of EPC to access it. To KVM, there's no
difference, whether EPC backend is from /dev/sgx_enclave, or /dev/sgx_vepc. And we
really cannot prevent guest from doing anything. 

So how memremap() of EPC section is related to KVM SGX? For instance, the
implementation of this series needs to be modified due to this?
Sean Christopherson Feb. 4, 2021, 12:28 a.m. UTC | #43
On Thu, Feb 04, 2021, Kai Huang wrote:
> On Wed, 2021-02-03 at 15:37 -0800, Dave Hansen wrote:
> > On 2/3/21 3:32 PM, Sean Christopherson wrote:
> > > > > > Yeah, special casing KVM is almost always the wrong thing to do.
> > > > > > Anything that KVM can do, other subsystems will do as well.
> > > > > Agreed.  Thwarting ioremap itself seems like the right way to go.
> > > > This sounds irrelevant to KVM SGX, thus I won't include it to KVM SGX series.
> > > I would say it's relevant, but a pre-existing bug.  Same net effect on what's
> > > needed for this series..
> > > 
> > > I say it's a pre-existing bug, because I'm pretty sure KVM can be coerced into
> > > accessing the EPC by handing KVM a memslot that's backed by an enclave that was
> > > created by host userspace (via /dev/sgx_enclave).
> > 
> > Dang, you beat me to it.  I was composing another email that said the
> > exact same thing.
> > 
> > I guess we need to take a closer look at the KVM fallout from this.
> > It's a few spots where it KVM knew it might be consuming garbage.  It
> > just get extra weird stinky garbage now.
> 
> I don't quite understand how KVM will need to access EPC memslot. It is *guest*, but
> not KVM, who can read EPC from non-enclave. And if I understand correctly, there will
> be no place for KVM to use kernel address of EPC to access it. To KVM, there's no
> difference, whether EPC backend is from /dev/sgx_enclave, or /dev/sgx_vepc. And we
> really cannot prevent guest from doing anything.
> 
> So how memremap() of EPC section is related to KVM SGX? For instance, the
> implementation of this series needs to be modified due to this?

See kvm_vcpu_map() -> __kvm_map_gfn(), which blindly uses memremap() when the
resulting pfn isn't a "valid" pfn.  KVM doesn't need access to an EPC memslot,
we're talking the case where a malicious userspace/guest hands KVM a GPA that
resolves to the EPC.  E.g. nested VM-Enter with the L1->L2 MSR bitmap pointing
at EPC.  L0 KVM will intercept VM-Enter and then read L1's bitmap to merge it's
desires with L0 KVM's requirements.  That read will hit the EPC, and thankfully
for KVM, return garbage.
Huang, Kai Feb. 4, 2021, 3:18 a.m. UTC | #44
On Wed, 2021-02-03 at 16:28 -0800, Sean Christopherson wrote:
> On Thu, Feb 04, 2021, Kai Huang wrote:
> > On Wed, 2021-02-03 at 15:37 -0800, Dave Hansen wrote:
> > > On 2/3/21 3:32 PM, Sean Christopherson wrote:
> > > > > > > Yeah, special casing KVM is almost always the wrong thing to do.
> > > > > > > Anything that KVM can do, other subsystems will do as well.
> > > > > > Agreed.  Thwarting ioremap itself seems like the right way to go.
> > > > > This sounds irrelevant to KVM SGX, thus I won't include it to KVM SGX series.
> > > > I would say it's relevant, but a pre-existing bug.  Same net effect on what's
> > > > needed for this series..
> > > > 
> > > > I say it's a pre-existing bug, because I'm pretty sure KVM can be coerced into
> > > > accessing the EPC by handing KVM a memslot that's backed by an enclave that was
> > > > created by host userspace (via /dev/sgx_enclave).
> > > 
> > > Dang, you beat me to it.  I was composing another email that said the
> > > exact same thing.
> > > 
> > > I guess we need to take a closer look at the KVM fallout from this.
> > > It's a few spots where it KVM knew it might be consuming garbage.  It
> > > just get extra weird stinky garbage now.
> > 
> > I don't quite understand how KVM will need to access EPC memslot. It is *guest*, but
> > not KVM, who can read EPC from non-enclave. And if I understand correctly, there will
> > be no place for KVM to use kernel address of EPC to access it. To KVM, there's no
> > difference, whether EPC backend is from /dev/sgx_enclave, or /dev/sgx_vepc. And we
> > really cannot prevent guest from doing anything.
> > 
> > So how memremap() of EPC section is related to KVM SGX? For instance, the
> > implementation of this series needs to be modified due to this?
> 
> See kvm_vcpu_map() -> __kvm_map_gfn(), which blindly uses memremap() when the
> resulting pfn isn't a "valid" pfn.  KVM doesn't need access to an EPC memslot,
> we're talking the case where a malicious userspace/guest hands KVM a GPA that
> resolves to the EPC.  E.g. nested VM-Enter with the L1->L2 MSR bitmap pointing
> at EPC.  L0 KVM will intercept VM-Enter and then read L1's bitmap to merge it's
> desires with L0 KVM's requirements.  That read will hit the EPC, and thankfully
> for KVM, return garbage.

Right. I missed __kvm_map_gfn(). 

I am not quite sure returning all ones can be treated as garbage, since one can means
true for a boolean, or one bit in bitmap as you said. But since this only happens
when guest/userspace is malicious, so causing misbehavior to the guest is fine? Do we
see any security risk here?

And I also agree that denying memremap() for EPC is desirable. But I am not sure
whether this should be addressed before KVM SGX series, or KVM SGX series should take
care of it.
Sean Christopherson Feb. 4, 2021, 4:28 p.m. UTC | #45
On Thu, Feb 04, 2021, Kai Huang wrote:
> On Wed, 2021-02-03 at 16:28 -0800, Sean Christopherson wrote:
> > On Thu, Feb 04, 2021, Kai Huang wrote:
> > > On Wed, 2021-02-03 at 15:37 -0800, Dave Hansen wrote:
> > > > On 2/3/21 3:32 PM, Sean Christopherson wrote:
> > > > > > > > Yeah, special casing KVM is almost always the wrong thing to do.
> > > > > > > > Anything that KVM can do, other subsystems will do as well.
> > > > > > > Agreed.  Thwarting ioremap itself seems like the right way to go.
> > > > > > This sounds irrelevant to KVM SGX, thus I won't include it to KVM SGX series.
> > > > > I would say it's relevant, but a pre-existing bug.  Same net effect on what's
> > > > > needed for this series..
> > > > > 
> > > > > I say it's a pre-existing bug, because I'm pretty sure KVM can be coerced into
> > > > > accessing the EPC by handing KVM a memslot that's backed by an enclave that was
> > > > > created by host userspace (via /dev/sgx_enclave).
> > > > 
> > > > Dang, you beat me to it.  I was composing another email that said the
> > > > exact same thing.
> > > > 
> > > > I guess we need to take a closer look at the KVM fallout from this.
> > > > It's a few spots where it KVM knew it might be consuming garbage.  It
> > > > just get extra weird stinky garbage now.
> > > 
> > > I don't quite understand how KVM will need to access EPC memslot. It is *guest*, but
> > > not KVM, who can read EPC from non-enclave. And if I understand correctly, there will
> > > be no place for KVM to use kernel address of EPC to access it. To KVM, there's no
> > > difference, whether EPC backend is from /dev/sgx_enclave, or /dev/sgx_vepc. And we
> > > really cannot prevent guest from doing anything.
> > > 
> > > So how memremap() of EPC section is related to KVM SGX? For instance, the
> > > implementation of this series needs to be modified due to this?
> > 
> > See kvm_vcpu_map() -> __kvm_map_gfn(), which blindly uses memremap() when the
> > resulting pfn isn't a "valid" pfn.  KVM doesn't need access to an EPC memslot,
> > we're talking the case where a malicious userspace/guest hands KVM a GPA that
> > resolves to the EPC.  E.g. nested VM-Enter with the L1->L2 MSR bitmap pointing
> > at EPC.  L0 KVM will intercept VM-Enter and then read L1's bitmap to merge it's
> > desires with L0 KVM's requirements.  That read will hit the EPC, and thankfully
> > for KVM, return garbage.
> 
> Right. I missed __kvm_map_gfn(). 
> 
> I am not quite sure returning all ones can be treated as garbage, since one can means
> true for a boolean, or one bit in bitmap as you said. But since this only happens
> when guest/userspace is malicious, so causing misbehavior to the guest is fine?

Yes, it's fine.  It's really the guest causing misbehavior for itself.

> Do we see any security risk here?

Not with current CPUs, which drop writes and read all ones.  If future CPUs take
creatives liberties with the SDM, then we could have a problem, but that's why
Dave is trying to get stronger guarantees into the SDM.
Dave Hansen Feb. 4, 2021, 4:48 p.m. UTC | #46
On 2/4/21 8:28 AM, Sean Christopherson wrote:
>> Do we see any security risk here?
> Not with current CPUs, which drop writes and read all ones.  If future CPUs take
> creatives liberties with the SDM, then we could have a problem, but that's why
> Dave is trying to get stronger guarantees into the SDM.

I really don't like the idea of the abort page being used by code that
doesn't know what it's dealing with.  It just seems like trouble (aka.
security risk) waiting to happen.
Huang, Kai Feb. 5, 2021, 12:32 p.m. UTC | #47
On Thu, 2021-02-04 at 08:48 -0800, Dave Hansen wrote:
> On 2/4/21 8:28 AM, Sean Christopherson wrote:
> > > Do we see any security risk here?
> > Not with current CPUs, which drop writes and read all ones.  If future CPUs take
> > creatives liberties with the SDM, then we could have a problem, but that's why
> > Dave is trying to get stronger guarantees into the SDM.
> 
> I really don't like the idea of the abort page being used by code that
> doesn't know what it's dealing with.  It just seems like trouble (aka.
> security risk) waiting to happen.

Hi Dave,

Just to confirm, you want this (disallow ioremap() for EPC) fixed in upstream kernel
before KVM SGX can be merged, correct?

If so, and since it seems you also agreed that better solution is to modify ioremap()
to refuse to map EPC, what do you think of the sample code Sean put in his previous
reply?

https://www.spinics.net/lists/kvm/msg234754.html

IMHO adding 'bool sgx_epc' to ioremap_desc seems not ideal, since it's not generic.
Instead, we may define some new flag here, and ioremap_desc->flag can just cope with
it.

Btw as Sean already pointed out, SGX code uses memremap() to initialize EPC section,
and we could choose to still allow this to avoid code change to SGX driver. But it
seems it is a little hack here. What's your opinion? 

Hi Sean,

If we all agree the fix is needed here, do you want to work on the patch (since you
already provided your thought), or do you want me to do it, with Suggested-by you?
Sean Christopherson Feb. 5, 2021, 4:51 p.m. UTC | #48
On Sat, Feb 06, 2021, Kai Huang wrote:
> Hi Sean,
> 
> If we all agree the fix is needed here, do you want to work on the patch (since you
> already provided your thought), or do you want me to do it, with Suggested-by you?

Nope, all yours.