diff mbox series

[v19,12/27] x86/sgx: Enumerate and track EPC sections

Message ID 20190317211456.13927-13-jarkko.sakkinen@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Intel SGX1 support | expand

Commit Message

Jarkko Sakkinen March 17, 2019, 9:14 p.m. UTC
From: Sean Christopherson <sean.j.christopherson@intel.com>

Enumerate Enclave Page Cache (EPC) sections via CPUID and add the data
structures necessary to track EPC pages so that they can be allocated,
freed and managed.  As a system may have multiple EPC sections, invoke
CPUID on SGX sub-leafs until an invalid leaf is encountered.

On NUMA systems, a node can have at most one bank. A bank can be at
most part of two nodes.  SGX supports both nodes with a single memory
controller and also sub-cluster nodes with severals memory controllers
on a single die.

For simplicity, support a maximum of eight EPC sections.  Current
client hardware supports only a single section, while upcoming server
hardware will support at most eight sections.  Bounding the number of
sections also allows the section ID to be embedded along with a page's
offset in a single unsigned long, enabling easy retrieval of both the
VA and PA for a given page.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Co-developed-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Co-developed-by: Serge Ayoun <serge.ayoun@intel.com>
Signed-off-by: Serge Ayoun <serge.ayoun@intel.com>
---
 arch/x86/Kconfig                 |  19 ++++
 arch/x86/kernel/cpu/Makefile     |   1 +
 arch/x86/kernel/cpu/sgx/Makefile |   1 +
 arch/x86/kernel/cpu/sgx/main.c   | 149 +++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/sgx.h    |  62 +++++++++++++
 5 files changed, 232 insertions(+)
 create mode 100644 arch/x86/kernel/cpu/sgx/Makefile
 create mode 100644 arch/x86/kernel/cpu/sgx/main.c
 create mode 100644 arch/x86/kernel/cpu/sgx/sgx.h

Comments

Sean Christopherson March 18, 2019, 7:50 p.m. UTC | #1
On Sun, Mar 17, 2019 at 11:14:41PM +0200, Jarkko Sakkinen wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Enumerate Enclave Page Cache (EPC) sections via CPUID and add the data
> structures necessary to track EPC pages so that they can be allocated,
> freed and managed.  As a system may have multiple EPC sections, invoke
> CPUID on SGX sub-leafs until an invalid leaf is encountered.
> 
> On NUMA systems, a node can have at most one bank. A bank can be at
> most part of two nodes.  SGX supports both nodes with a single memory
> controller and also sub-cluster nodes with severals memory controllers
> on a single die.
> 
> For simplicity, support a maximum of eight EPC sections.  Current
> client hardware supports only a single section, while upcoming server
> hardware will support at most eight sections.  Bounding the number of
> sections also allows the section ID to be embedded along with a page's
> offset in a single unsigned long, enabling easy retrieval of both the
> VA and PA for a given page.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> Co-developed-by: Suresh Siddha <suresh.b.siddha@intel.com>
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> Co-developed-by: Serge Ayoun <serge.ayoun@intel.com>
> Signed-off-by: Serge Ayoun <serge.ayoun@intel.com>
> ---
>  arch/x86/Kconfig                 |  19 ++++
>  arch/x86/kernel/cpu/Makefile     |   1 +
>  arch/x86/kernel/cpu/sgx/Makefile |   1 +
>  arch/x86/kernel/cpu/sgx/main.c   | 149 +++++++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/sgx.h    |  62 +++++++++++++
>  5 files changed, 232 insertions(+)
>  create mode 100644 arch/x86/kernel/cpu/sgx/Makefile
>  create mode 100644 arch/x86/kernel/cpu/sgx/main.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/sgx.h
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index c1f9b3cf437c..dc630208003f 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1921,6 +1921,25 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
>  
>  	  If unsure, say y.
>  
> +config INTEL_SGX
> +	bool "Intel SGX core functionality"
> +	depends on X86_64 && CPU_SUP_INTEL
> +	help
> +
> +	Intel(R) SGX is a set of CPU instructions that can be used by
> +	applications to set aside private regions of code and data.  The code
> +	outside the enclave is disallowed to access the memory inside the
> +	enclave by the CPU access control.

"enclave" is used before it's defined.  And the second sentence could be
tweaked to make it explicitly clear that hardware disallows cross-enclave
access.  E.g.:

	Intel(R) SGX is a set of CPU instructions that can be used by
	applications to set aside private regions of code and data, referred
	to as enclaves.  An enclave's private memory can only be accessed by
	code running within the enclave.  Accesses from outside the enclave,
	including other enclaves, are disallowed by hardware.

> +
> +	The firmware uses PRMRR registers to reserve an area of physical memory
> +	called Enclave Page Cache (EPC). There is a hardware unit in the
> +	processor called Memory Encryption Engine. The MEE encrypts and decrypts
> +	the EPC pages as they enter and leave the processor package.

This second paragraph can probably be dropped altogether.  A reader won't
know what PRMRR means unless they're already familiar with SGX.  And the
PRMRR+MEE implementation is not architectural, i.e. future hardware could
support EPC through some other mechanism.  SGX does more than just encrypt
memory, covering those details is probably best left to intel_sgx.rst.

> +
> +	For details, see Documentation/x86/intel_sgx.rst
> +
> +	If unsure, say N.
> +
>  config EFI
>  	bool "EFI runtime service support"
>  	depends on ACPI
> diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
> index cfd24f9f7614..d1163c0fd5d6 100644
> --- a/arch/x86/kernel/cpu/Makefile
> +++ b/arch/x86/kernel/cpu/Makefile
> @@ -40,6 +40,7 @@ obj-$(CONFIG_X86_MCE)			+= mce/
>  obj-$(CONFIG_MTRR)			+= mtrr/
>  obj-$(CONFIG_MICROCODE)			+= microcode/
>  obj-$(CONFIG_X86_CPU_RESCTRL)		+= resctrl/
> +obj-$(CONFIG_INTEL_SGX)			+= sgx/
>  
>  obj-$(CONFIG_X86_LOCAL_APIC)		+= perfctr-watchdog.o
>  
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
> new file mode 100644
> index 000000000000..b666967fd570
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -0,0 +1 @@
> +obj-y += main.o
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> new file mode 100644
> index 000000000000..18ce4acdd7ef
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -0,0 +1,149 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> +// Copyright(c) 2016-17 Intel Corporation.
> +
> +#include <linux/freezer.h>
> +#include <linux/highmem.h>
> +#include <linux/kthread.h>
> +#include <linux/pagemap.h>
> +#include <linux/ratelimit.h>
> +#include <linux/sched/signal.h>
> +#include <linux/slab.h>
> +#include "arch.h"
> +#include "sgx.h"
> +
> +struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];

Dynamically allocating sgx_epc_sections isn't exactly difficult, and
AFAICT the static allocation is the primary motivation for capping
SGX_MAX_EPC_SECTIONS at such a low value (8).  I still think it makes
sense to define SGX_MAX_EPC_SECTIONS so that the section number can
be embedded in the offset, along with flags.  But the max can be
significantly higher, e.g. using 7 bits to support 128 sections.

I realize hardware is highly unlikely to have more than 8 sections, at
least for the near future, but IMO the small amount of extra complexity
is worth having a bit of breathing room.

> +EXPORT_SYMBOL_GPL(sgx_epc_sections);
> +
> +static int sgx_nr_epc_sections;
> +
> +static void sgx_section_put_page(struct sgx_epc_section *section,
> +				 struct sgx_epc_page *page)
> +{
> +	list_add_tail(&page->list, &section->page_list);
> +	section->free_cnt++;
> +}
> +
> +static __init void sgx_free_epc_section(struct sgx_epc_section *section)
> +{
> +	struct sgx_epc_page *page;
> +
> +	while (!list_empty(&section->page_list)) {
> +		page = list_first_entry(&section->page_list,
> +					struct sgx_epc_page, list);
> +		list_del(&page->list);
> +		kfree(page);
> +	}
> +	memunmap(section->va);
> +}
> +
> +static __init int sgx_init_epc_section(u64 addr, u64 size, unsigned long index,
> +				       struct sgx_epc_section *section)
> +{
> +	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	struct sgx_epc_page *page;
> +	unsigned long i;
> +
> +	section->va = memremap(addr, size, MEMREMAP_WB);
> +	if (!section->va)
> +		return -ENOMEM;
> +
> +	section->pa = addr;
> +	spin_lock_init(&section->lock);
> +	INIT_LIST_HEAD(&section->page_list);
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		page = kzalloc(sizeof(*page), GFP_KERNEL);
> +		if (!page)
> +			goto out;
> +		page->desc = (addr + (i << PAGE_SHIFT)) | index;
> +		sgx_section_put_page(section, page);
> +	}

Not sure if this is the correct location, but at some point the kernel
needs to sanitize the EPC during init.  EPC pages may be in an unknown
state, e.g. after kexec(), which will cause all manner of faults and
warnings.  Maybe the best approach is to sanitize on-demand, e.g. suppress
the first WARN due to unexpected ENCLS failure and purge the EPC at that
time.  The downside of that approach is that exposing EPC to a guest would
need to implement its own sanitization flow.

> +
> +	return 0;
> +out:
> +	sgx_free_epc_section(section);
> +	return -ENOMEM;
> +}
> +
> +static __init void sgx_page_cache_teardown(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < sgx_nr_epc_sections; i++)
> +		sgx_free_epc_section(&sgx_epc_sections[i]);
> +}
> +
> +/**
> + * A section metric is concatenated in a way that @low bits 12-31 define the
> + * bits 12-31 of the metric and @high bits 0-19 define the bits 32-51 of the
> + * metric.
> + */
> +static inline u64 sgx_calc_section_metric(u64 low, u64 high)
> +{
> +	return (low & GENMASK_ULL(31, 12)) +
> +	       ((high & GENMASK_ULL(19, 0)) << 32);
> +}
> +
> +static __init int sgx_page_cache_init(void)
> +{
> +	u32 eax, ebx, ecx, edx, type;
> +	u64 pa, size;
> +	int ret;
> +	int i;
> +
> +	BUILD_BUG_ON(SGX_MAX_EPC_SECTIONS > (SGX_EPC_SECTION_MASK + 1));
> +
> +	for (i = 0; i < (SGX_MAX_EPC_SECTIONS + 1); i++) {
> +		cpuid_count(SGX_CPUID, i + SGX_CPUID_FIRST_VARIABLE_SUB_LEAF,
> +			    &eax, &ebx, &ecx, &edx);
> +
> +		type = eax & SGX_CPUID_SUB_LEAF_TYPE_MASK;
> +		if (type == SGX_CPUID_SUB_LEAF_INVALID)
> +			break;
> +		if (type != SGX_CPUID_SUB_LEAF_EPC_SECTION) {
> +			pr_err_once("sgx: Unknown sub-leaf type: %u\n", type);
> +			return -ENODEV;

This should probably be "continue" rather than "return -ENODEV".  SGX
can still be used in the (extremely) unlikely event that there is usable
EPC and some unknown memory type enumerated.

> +		}
> +		if (i == SGX_MAX_EPC_SECTIONS) {
> +			pr_warn("sgx: More than "
> +				__stringify(SGX_MAX_EPC_SECTIONS)
> +				" EPC sections\n");

This isn't a very helpful message, e.g. it doesn't even imply that the
kernel is ignoring EPC sections.  It'd also be helpful to display the
sections that are being ignored.  Might also warrant pr_err() since it
means system resources are being ignored.

E.g.:

#define SGX_ARBITRARY_LOOP_TERMINATOR   1000

	for (i = 0; i < SGX_ARBITRARY_LOOP_TERMINATOR; i++) {
		...

		if (i >= SGX_MAX_EPC_SECTIONS) {
			pr_err("sgx: Reached max number of EPC sections (%u), "
			       "ignoring section 0x%llx-0x%llx\n",
			       pa, pa + size - 1);
		}
	}

> +			break;
> +		}
> +
> +		pa = sgx_calc_section_metric(eax, ebx);
> +		size = sgx_calc_section_metric(ecx, edx);
> +		pr_info("sgx: EPC section 0x%llx-0x%llx\n", pa, pa + size - 1);
> +
> +		ret = sgx_init_epc_section(pa, size, i, &sgx_epc_sections[i]);
> +		if (ret) {
> +			sgx_page_cache_teardown();
> +			return ret;

Similar to encountering unknown sections, any reason why we wouldn't
continue here and use whatever EPC was successfuly initialized?

> +		}
> +
> +		sgx_nr_epc_sections++;
> +	}
> +
> +	if (!sgx_nr_epc_sections) {
> +		pr_err("sgx: There are zero EPC sections.\n");
> +		return -ENODEV;
> +	}
> +
> +	return 0;
> +}
> +
> +static __init int sgx_init(void)
> +{
> +	int ret;
> +
> +	if (!boot_cpu_has(X86_FEATURE_SGX))
> +		return false;
> +
> +	ret = sgx_page_cache_init();
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
> +arch_initcall(sgx_init);
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> new file mode 100644
> index 000000000000..228e3dae360d
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
> +#ifndef _X86_SGX_H
> +#define _X86_SGX_H
> +
> +#include <linux/bitops.h>
> +#include <linux/err.h>
> +#include <linux/io.h>
> +#include <linux/rwsem.h>
> +#include <linux/types.h>
> +#include <asm/asm.h>
> +#include <uapi/asm/sgx_errno.h>
> +
> +struct sgx_epc_page {
> +	unsigned long desc;
> +	struct list_head list;
> +};
> +
> +/**
> + * struct sgx_epc_section
> + *
> + * The firmware can define multiple chunks of EPC to the different areas of the
> + * physical memory e.g. for memory areas of the each node. This structure is
> + * used to store EPC pages for one EPC section and virtual memory area where
> + * the pages have been mapped.
> + */
> +struct sgx_epc_section {
> +	unsigned long pa;
> +	void *va;
> +	struct list_head page_list;
> +	unsigned long free_cnt;
> +	spinlock_t lock;
> +};
> +
> +#define SGX_MAX_EPC_SECTIONS	8
> +
> +extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
> +
> +/**
> + * enum sgx_epc_page_desc - bits and masks for an EPC page's descriptor
> + * %SGX_EPC_SECTION_MASK:	SGX allows to have multiple EPC sections in the
> + *				physical memory. The existing and near-future
> + *				hardware defines at most eight sections, hence
> + *				three bits to hold a section.
> + */
> +enum sgx_epc_page_desc {
> +	SGX_EPC_SECTION_MASK			= GENMASK_ULL(3, 0),
> +	/* bits 12-63 are reserved for the physical page address of the page */
> +};
> +
> +static inline struct sgx_epc_section *sgx_epc_section(struct sgx_epc_page *page)
> +{
> +	return &sgx_epc_sections[page->desc & SGX_EPC_SECTION_MASK];
> +}
> +
> +static inline void *sgx_epc_addr(struct sgx_epc_page *page)
> +{
> +	struct sgx_epc_section *section = sgx_epc_section(page);
> +
> +	return section->va + (page->desc & PAGE_MASK) - section->pa;
> +}
> +
> +#endif /* _X86_SGX_H */
> -- 
> 2.19.1
>
Jarkko Sakkinen March 21, 2019, 2:40 p.m. UTC | #2
On Mon, Mar 18, 2019 at 12:50:43PM -0700, Sean Christopherson wrote:
> On Sun, Mar 17, 2019 at 11:14:41PM +0200, Jarkko Sakkinen wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > Enumerate Enclave Page Cache (EPC) sections via CPUID and add the data
> > structures necessary to track EPC pages so that they can be allocated,
> > freed and managed.  As a system may have multiple EPC sections, invoke
> > CPUID on SGX sub-leafs until an invalid leaf is encountered.
> > 
> > On NUMA systems, a node can have at most one bank. A bank can be at
> > most part of two nodes.  SGX supports both nodes with a single memory
> > controller and also sub-cluster nodes with severals memory controllers
> > on a single die.
> > 
> > For simplicity, support a maximum of eight EPC sections.  Current
> > client hardware supports only a single section, while upcoming server
> > hardware will support at most eight sections.  Bounding the number of
> > sections also allows the section ID to be embedded along with a page's
> > offset in a single unsigned long, enabling easy retrieval of both the
> > VA and PA for a given page.
> > 
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > Co-developed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> > Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> > Co-developed-by: Suresh Siddha <suresh.b.siddha@intel.com>
> > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> > Co-developed-by: Serge Ayoun <serge.ayoun@intel.com>
> > Signed-off-by: Serge Ayoun <serge.ayoun@intel.com>
> > ---
> >  arch/x86/Kconfig                 |  19 ++++
> >  arch/x86/kernel/cpu/Makefile     |   1 +
> >  arch/x86/kernel/cpu/sgx/Makefile |   1 +
> >  arch/x86/kernel/cpu/sgx/main.c   | 149 +++++++++++++++++++++++++++++++
> >  arch/x86/kernel/cpu/sgx/sgx.h    |  62 +++++++++++++
> >  5 files changed, 232 insertions(+)
> >  create mode 100644 arch/x86/kernel/cpu/sgx/Makefile
> >  create mode 100644 arch/x86/kernel/cpu/sgx/main.c
> >  create mode 100644 arch/x86/kernel/cpu/sgx/sgx.h
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index c1f9b3cf437c..dc630208003f 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1921,6 +1921,25 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
> >  
> >  	  If unsure, say y.
> >  
> > +config INTEL_SGX
> > +	bool "Intel SGX core functionality"
> > +	depends on X86_64 && CPU_SUP_INTEL
> > +	help
> > +
> > +	Intel(R) SGX is a set of CPU instructions that can be used by
> > +	applications to set aside private regions of code and data.  The code
> > +	outside the enclave is disallowed to access the memory inside the
> > +	enclave by the CPU access control.
> 
> "enclave" is used before it's defined.  And the second sentence could be
> tweaked to make it explicitly clear that hardware disallows cross-enclave
> access.  E.g.:
> 
> 	Intel(R) SGX is a set of CPU instructions that can be used by
> 	applications to set aside private regions of code and data, referred
> 	to as enclaves.  An enclave's private memory can only be accessed by
> 	code running within the enclave.  Accesses from outside the enclave,
> 	including other enclaves, are disallowed by hardware.

Agreed.

> 
> > +
> > +	The firmware uses PRMRR registers to reserve an area of physical memory
> > +	called Enclave Page Cache (EPC). There is a hardware unit in the
> > +	processor called Memory Encryption Engine. The MEE encrypts and decrypts
> > +	the EPC pages as they enter and leave the processor package.
> 
> This second paragraph can probably be dropped altogether.  A reader won't
> know what PRMRR means unless they're already familiar with SGX.  And the
> PRMRR+MEE implementation is not architectural, i.e. future hardware could
> support EPC through some other mechanism.  SGX does more than just encrypt
> memory, covering those details is probably best left to intel_sgx.rst.

Ditto.

> 
> > +
> > +	For details, see Documentation/x86/intel_sgx.rst
> > +
> > +	If unsure, say N.
> > +
> >  config EFI
> >  	bool "EFI runtime service support"
> >  	depends on ACPI
> > diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
> > index cfd24f9f7614..d1163c0fd5d6 100644
> > --- a/arch/x86/kernel/cpu/Makefile
> > +++ b/arch/x86/kernel/cpu/Makefile
> > @@ -40,6 +40,7 @@ obj-$(CONFIG_X86_MCE)			+= mce/
> >  obj-$(CONFIG_MTRR)			+= mtrr/
> >  obj-$(CONFIG_MICROCODE)			+= microcode/
> >  obj-$(CONFIG_X86_CPU_RESCTRL)		+= resctrl/
> > +obj-$(CONFIG_INTEL_SGX)			+= sgx/
> >  
> >  obj-$(CONFIG_X86_LOCAL_APIC)		+= perfctr-watchdog.o
> >  
> > diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
> > new file mode 100644
> > index 000000000000..b666967fd570
> > --- /dev/null
> > +++ b/arch/x86/kernel/cpu/sgx/Makefile
> > @@ -0,0 +1 @@
> > +obj-y += main.o
> > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> > new file mode 100644
> > index 000000000000..18ce4acdd7ef
> > --- /dev/null
> > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > @@ -0,0 +1,149 @@
> > +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> > +// Copyright(c) 2016-17 Intel Corporation.
> > +
> > +#include <linux/freezer.h>
> > +#include <linux/highmem.h>
> > +#include <linux/kthread.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/ratelimit.h>
> > +#include <linux/sched/signal.h>
> > +#include <linux/slab.h>
> > +#include "arch.h"
> > +#include "sgx.h"
> > +
> > +struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
> 
> Dynamically allocating sgx_epc_sections isn't exactly difficult, and
> AFAICT the static allocation is the primary motivation for capping
> SGX_MAX_EPC_SECTIONS at such a low value (8).  I still think it makes
> sense to define SGX_MAX_EPC_SECTIONS so that the section number can
> be embedded in the offset, along with flags.  But the max can be
> significantly higher, e.g. using 7 bits to support 128 sections.
> 

I don't disagree with you but I think for the existing and forseeable
hardware this is good enough. Can be refined if there is ever need.

> I realize hardware is highly unlikely to have more than 8 sections, at
> least for the near future, but IMO the small amount of extra complexity
> is worth having a bit of breathing room.

Yup.

> 
> > +EXPORT_SYMBOL_GPL(sgx_epc_sections);
> > +
> > +static int sgx_nr_epc_sections;
> > +
> > +static void sgx_section_put_page(struct sgx_epc_section *section,
> > +				 struct sgx_epc_page *page)
> > +{
> > +	list_add_tail(&page->list, &section->page_list);
> > +	section->free_cnt++;
> > +}
> > +
> > +static __init void sgx_free_epc_section(struct sgx_epc_section *section)
> > +{
> > +	struct sgx_epc_page *page;
> > +
> > +	while (!list_empty(&section->page_list)) {
> > +		page = list_first_entry(&section->page_list,
> > +					struct sgx_epc_page, list);
> > +		list_del(&page->list);
> > +		kfree(page);
> > +	}
> > +	memunmap(section->va);
> > +}
> > +
> > +static __init int sgx_init_epc_section(u64 addr, u64 size, unsigned long index,
> > +				       struct sgx_epc_section *section)
> > +{
> > +	unsigned long nr_pages = size >> PAGE_SHIFT;
> > +	struct sgx_epc_page *page;
> > +	unsigned long i;
> > +
> > +	section->va = memremap(addr, size, MEMREMAP_WB);
> > +	if (!section->va)
> > +		return -ENOMEM;
> > +
> > +	section->pa = addr;
> > +	spin_lock_init(&section->lock);
> > +	INIT_LIST_HEAD(&section->page_list);
> > +
> > +	for (i = 0; i < nr_pages; i++) {
> > +		page = kzalloc(sizeof(*page), GFP_KERNEL);
> > +		if (!page)
> > +			goto out;
> > +		page->desc = (addr + (i << PAGE_SHIFT)) | index;
> > +		sgx_section_put_page(section, page);
> > +	}
> 
> Not sure if this is the correct location, but at some point the kernel
> needs to sanitize the EPC during init.  EPC pages may be in an unknown
> state, e.g. after kexec(), which will cause all manner of faults and
> warnings.  Maybe the best approach is to sanitize on-demand, e.g. suppress
> the first WARN due to unexpected ENCLS failure and purge the EPC at that
> time.  The downside of that approach is that exposing EPC to a guest would
> need to implement its own sanitization flow.

Hmm... Lets think this through. I'm just thinking how sanitization on
demand would actually work given the parent-child relationships.

> 
> > +
> > +	return 0;
> > +out:
> > +	sgx_free_epc_section(section);
> > +	return -ENOMEM;
> > +}
> > +
> > +static __init void sgx_page_cache_teardown(void)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < sgx_nr_epc_sections; i++)
> > +		sgx_free_epc_section(&sgx_epc_sections[i]);
> > +}
> > +
> > +/**
> > + * A section metric is concatenated in a way that @low bits 12-31 define the
> > + * bits 12-31 of the metric and @high bits 0-19 define the bits 32-51 of the
> > + * metric.
> > + */
> > +static inline u64 sgx_calc_section_metric(u64 low, u64 high)
> > +{
> > +	return (low & GENMASK_ULL(31, 12)) +
> > +	       ((high & GENMASK_ULL(19, 0)) << 32);
> > +}
> > +
> > +static __init int sgx_page_cache_init(void)
> > +{
> > +	u32 eax, ebx, ecx, edx, type;
> > +	u64 pa, size;
> > +	int ret;
> > +	int i;
> > +
> > +	BUILD_BUG_ON(SGX_MAX_EPC_SECTIONS > (SGX_EPC_SECTION_MASK + 1));
> > +
> > +	for (i = 0; i < (SGX_MAX_EPC_SECTIONS + 1); i++) {
> > +		cpuid_count(SGX_CPUID, i + SGX_CPUID_FIRST_VARIABLE_SUB_LEAF,
> > +			    &eax, &ebx, &ecx, &edx);
> > +
> > +		type = eax & SGX_CPUID_SUB_LEAF_TYPE_MASK;
> > +		if (type == SGX_CPUID_SUB_LEAF_INVALID)
> > +			break;
> > +		if (type != SGX_CPUID_SUB_LEAF_EPC_SECTION) {
> > +			pr_err_once("sgx: Unknown sub-leaf type: %u\n", type);
> > +			return -ENODEV;
> 
> This should probably be "continue" rather than "return -ENODEV".  SGX
> can still be used in the (extremely) unlikely event that there is usable
> EPC and some unknown memory type enumerated.

OK, lets do that. Maybe also pr_warn_once() should be used?

> 
> > +		}
> > +		if (i == SGX_MAX_EPC_SECTIONS) {
> > +			pr_warn("sgx: More than "
> > +				__stringify(SGX_MAX_EPC_SECTIONS)
> > +				" EPC sections\n");
> 
> This isn't a very helpful message, e.g. it doesn't even imply that the
> kernel is ignoring EPC sections.  It'd also be helpful to display the
> sections that are being ignored.  Might also warrant pr_err() since it
> means system resources are being ignored.
> 
> E.g.:
> 
> #define SGX_ARBITRARY_LOOP_TERMINATOR   1000
> 
> 	for (i = 0; i < SGX_ARBITRARY_LOOP_TERMINATOR; i++) {
> 		...
> 
> 		if (i >= SGX_MAX_EPC_SECTIONS) {
> 			pr_err("sgx: Reached max number of EPC sections (%u), "
> 			       "ignoring section 0x%llx-0x%llx\n",
> 			       pa, pa + size - 1);
> 		}

Fully agree with these proposals!

> 	}
> 
> > +			break;
> > +		}
> > +
> > +		pa = sgx_calc_section_metric(eax, ebx);
> > +		size = sgx_calc_section_metric(ecx, edx);
> > +		pr_info("sgx: EPC section 0x%llx-0x%llx\n", pa, pa + size - 1);
> > +
> > +		ret = sgx_init_epc_section(pa, size, i, &sgx_epc_sections[i]);
> > +		if (ret) {
> > +			sgx_page_cache_teardown();
> > +			return ret;
> 
> Similar to encountering unknown sections, any reason why we wouldn't
> continue here and use whatever EPC was successfuly initialized?

Nope.

> 
> > +		}
> > +
> > +		sgx_nr_epc_sections++;
> > +	}
> > +
> > +	if (!sgx_nr_epc_sections) {
> > +		pr_err("sgx: There are zero EPC sections.\n");
> > +		return -ENODEV;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static __init int sgx_init(void)
> > +{
> > +	int ret;
> > +
> > +	if (!boot_cpu_has(X86_FEATURE_SGX))
> > +		return false;
> > +
> > +	ret = sgx_page_cache_init();
> > +	if (ret)
> > +		return ret;
> > +
> > +	return 0;
> > +}
> > +
> > +arch_initcall(sgx_init);
> > diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> > new file mode 100644
> > index 000000000000..228e3dae360d
> > --- /dev/null
> > +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> > @@ -0,0 +1,62 @@
> > +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
> > +#ifndef _X86_SGX_H
> > +#define _X86_SGX_H
> > +
> > +#include <linux/bitops.h>
> > +#include <linux/err.h>
> > +#include <linux/io.h>
> > +#include <linux/rwsem.h>
> > +#include <linux/types.h>
> > +#include <asm/asm.h>
> > +#include <uapi/asm/sgx_errno.h>
> > +
> > +struct sgx_epc_page {
> > +	unsigned long desc;
> > +	struct list_head list;
> > +};
> > +
> > +/**
> > + * struct sgx_epc_section
> > + *
> > + * The firmware can define multiple chunks of EPC to the different areas of the
> > + * physical memory e.g. for memory areas of the each node. This structure is
> > + * used to store EPC pages for one EPC section and virtual memory area where
> > + * the pages have been mapped.
> > + */
> > +struct sgx_epc_section {
> > +	unsigned long pa;
> > +	void *va;
> > +	struct list_head page_list;
> > +	unsigned long free_cnt;
> > +	spinlock_t lock;
> > +};
> > +
> > +#define SGX_MAX_EPC_SECTIONS	8
> > +
> > +extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
> > +
> > +/**
> > + * enum sgx_epc_page_desc - bits and masks for an EPC page's descriptor
> > + * %SGX_EPC_SECTION_MASK:	SGX allows to have multiple EPC sections in the
> > + *				physical memory. The existing and near-future
> > + *				hardware defines at most eight sections, hence
> > + *				three bits to hold a section.
> > + */
> > +enum sgx_epc_page_desc {
> > +	SGX_EPC_SECTION_MASK			= GENMASK_ULL(3, 0),
> > +	/* bits 12-63 are reserved for the physical page address of the page */
> > +};
> > +
> > +static inline struct sgx_epc_section *sgx_epc_section(struct sgx_epc_page *page)
> > +{
> > +	return &sgx_epc_sections[page->desc & SGX_EPC_SECTION_MASK];
> > +}
> > +
> > +static inline void *sgx_epc_addr(struct sgx_epc_page *page)
> > +{
> > +	struct sgx_epc_section *section = sgx_epc_section(page);
> > +
> > +	return section->va + (page->desc & PAGE_MASK) - section->pa;
> > +}
> > +
> > +#endif /* _X86_SGX_H */
> > -- 
> > 2.19.1
> > 

/Jarkko
Sean Christopherson March 21, 2019, 3:28 p.m. UTC | #3
On Thu, Mar 21, 2019 at 04:40:56PM +0200, Jarkko Sakkinen wrote:
> On Mon, Mar 18, 2019 at 12:50:43PM -0700, Sean Christopherson wrote:
> > On Sun, Mar 17, 2019 at 11:14:41PM +0200, Jarkko Sakkinen wrote:
> > Dynamically allocating sgx_epc_sections isn't exactly difficult, and
> > AFAICT the static allocation is the primary motivation for capping
> > SGX_MAX_EPC_SECTIONS at such a low value (8).  I still think it makes
> > sense to define SGX_MAX_EPC_SECTIONS so that the section number can
> > be embedded in the offset, along with flags.  But the max can be
> > significantly higher, e.g. using 7 bits to support 128 sections.
> > 
> 
> I don't disagree with you but I think for the existing and forseeable
> hardware this is good enough. Can be refined if there is ever need.

My concern is that there may be virtualization use cases that want to
expose more than 8 EPC sections to a guest.  I have no idea if this is
anything more than paranoia, but at the same time the cost to increase
support to 128+ sections is quite low.

> > I realize hardware is highly unlikely to have more than 8 sections, at
> > least for the near future, but IMO the small amount of extra complexity
> > is worth having a bit of breathing room.
> 
> Yup.
> 
> > > +static __init int sgx_init_epc_section(u64 addr, u64 size, unsigned long index,
> > > +				       struct sgx_epc_section *section)
> > > +{
> > > +	unsigned long nr_pages = size >> PAGE_SHIFT;
> > > +	struct sgx_epc_page *page;
> > > +	unsigned long i;
> > > +
> > > +	section->va = memremap(addr, size, MEMREMAP_WB);
> > > +	if (!section->va)
> > > +		return -ENOMEM;
> > > +
> > > +	section->pa = addr;
> > > +	spin_lock_init(&section->lock);
> > > +	INIT_LIST_HEAD(&section->page_list);
> > > +
> > > +	for (i = 0; i < nr_pages; i++) {
> > > +		page = kzalloc(sizeof(*page), GFP_KERNEL);
> > > +		if (!page)
> > > +			goto out;
> > > +		page->desc = (addr + (i << PAGE_SHIFT)) | index;
> > > +		sgx_section_put_page(section, page);
> > > +	}
> > 
> > Not sure if this is the correct location, but at some point the kernel
> > needs to sanitize the EPC during init.  EPC pages may be in an unknown
> > state, e.g. after kexec(), which will cause all manner of faults and
> > warnings.  Maybe the best approach is to sanitize on-demand, e.g. suppress
> > the first WARN due to unexpected ENCLS failure and purge the EPC at that
> > time.  The downside of that approach is that exposing EPC to a guest would
> > need to implement its own sanitization flow.
> 
> Hmm... Lets think this through. I'm just thinking how sanitization on
> demand would actually work given the parent-child relationships.

It's ugly.

  1. Temporarily disable EPC allocation and enclave fault handling
  2. Zap all TCS PTEs in all enclaves
  3. Flush all logical CPUs from enclaves via IPI
  4. Forcefully reclaim all EPC pages from enclaves
  5. EREMOVE all "free" EPC pages, track pages that fail with SGX_CHILD_PRESENT
  6. EREMOVE all EPC pages that failed with SGX_CHILD_PRESENT
  7. Disable SGX if any EREMOVE failed in step 6
  8. Re-enable EPC allocation and enclave fault handling

Exposing EPC to a VM would still require sanitization.

Sanitizing during boot is a lot cleaner, the primary concern is that it
will significantly increase boot time on systems with large EPCs.  If we
can somehow limit this to kexec() and that's the only scenario where the
EPC needs to be sanitized, then that would mitigate the boot time concern.

We might also be able to get away with unconditionally sanitizing the EPC
post-boot, e.g. via worker threads, returning -EBUSY for everything until
the EPC is good to go.

> 
> > 
> > > +
> > > +	return 0;
> > > +out:
> > > +	sgx_free_epc_section(section);
> > > +	return -ENOMEM;
> > > +}
> > > +
> > > +static __init void sgx_page_cache_teardown(void)
> > > +{
> > > +	int i;
> > > +
> > > +	for (i = 0; i < sgx_nr_epc_sections; i++)
> > > +		sgx_free_epc_section(&sgx_epc_sections[i]);
> > > +}
> > > +
> > > +/**
> > > + * A section metric is concatenated in a way that @low bits 12-31 define the
> > > + * bits 12-31 of the metric and @high bits 0-19 define the bits 32-51 of the
> > > + * metric.
> > > + */
> > > +static inline u64 sgx_calc_section_metric(u64 low, u64 high)
> > > +{
> > > +	return (low & GENMASK_ULL(31, 12)) +
> > > +	       ((high & GENMASK_ULL(19, 0)) << 32);
> > > +}
> > > +
> > > +static __init int sgx_page_cache_init(void)
> > > +{
> > > +	u32 eax, ebx, ecx, edx, type;
> > > +	u64 pa, size;
> > > +	int ret;
> > > +	int i;
> > > +
> > > +	BUILD_BUG_ON(SGX_MAX_EPC_SECTIONS > (SGX_EPC_SECTION_MASK + 1));
> > > +
> > > +	for (i = 0; i < (SGX_MAX_EPC_SECTIONS + 1); i++) {
> > > +		cpuid_count(SGX_CPUID, i + SGX_CPUID_FIRST_VARIABLE_SUB_LEAF,
> > > +			    &eax, &ebx, &ecx, &edx);
> > > +
> > > +		type = eax & SGX_CPUID_SUB_LEAF_TYPE_MASK;
> > > +		if (type == SGX_CPUID_SUB_LEAF_INVALID)
> > > +			break;
> > > +		if (type != SGX_CPUID_SUB_LEAF_EPC_SECTION) {
> > > +			pr_err_once("sgx: Unknown sub-leaf type: %u\n", type);
> > > +			return -ENODEV;
> > 
> > This should probably be "continue" rather than "return -ENODEV".  SGX
> > can still be used in the (extremely) unlikely event that there is usable
> > EPC and some unknown memory type enumerated.
> 
> OK, lets do that. Maybe also pr_warn_once() should be used?

Yeah, probably.  If we use pr_warn here, then we should also convert
pr_err("sgx: There are zero EPC sections.\n"); to use pr_warn() since
that'll likely fire at the same time.

> 
> > 
> > > +		}
> > > +		if (i == SGX_MAX_EPC_SECTIONS) {
> > > +			pr_warn("sgx: More than "
> > > +				__stringify(SGX_MAX_EPC_SECTIONS)
> > > +				" EPC sections\n");
> > 
> > This isn't a very helpful message, e.g. it doesn't even imply that the
> > kernel is ignoring EPC sections.  It'd also be helpful to display the
> > sections that are being ignored.  Might also warrant pr_err() since it
> > means system resources are being ignored.
> > 
> > E.g.:
> > 
> > #define SGX_ARBITRARY_LOOP_TERMINATOR   1000
> > 
> > 	for (i = 0; i < SGX_ARBITRARY_LOOP_TERMINATOR; i++) {
> > 		...
> > 
> > 		if (i >= SGX_MAX_EPC_SECTIONS) {
> > 			pr_err("sgx: Reached max number of EPC sections (%u), "
> > 			       "ignoring section 0x%llx-0x%llx\n",
> > 			       pa, pa + size - 1);
> > 		}
> 
> Fully agree with these proposals!
> 
> > 	}
> > 
> > > +			break;
> > > +		}
> > > +
> > > +		pa = sgx_calc_section_metric(eax, ebx);
> > > +		size = sgx_calc_section_metric(ecx, edx);
> > > +		pr_info("sgx: EPC section 0x%llx-0x%llx\n", pa, pa + size - 1);
> > > +
> > > +		ret = sgx_init_epc_section(pa, size, i, &sgx_epc_sections[i]);
> > > +		if (ret) {
> > > +			sgx_page_cache_teardown();
> > > +			return ret;
> > 
> > Similar to encountering unknown sections, any reason why we wouldn't
> > continue here and use whatever EPC was successfuly initialized?
> 
> Nope.
> 
> > 
> > > +		}
> > > +
> > > +		sgx_nr_epc_sections++;
> > > +	}
> > > +
> > > +	if (!sgx_nr_epc_sections) {
> > > +		pr_err("sgx: There are zero EPC sections.\n");
> > > +		return -ENODEV;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
Jarkko Sakkinen March 22, 2019, 10:19 a.m. UTC | #4
On Thu, Mar 21, 2019 at 08:28:10AM -0700, Sean Christopherson wrote:
> On Thu, Mar 21, 2019 at 04:40:56PM +0200, Jarkko Sakkinen wrote:
> > On Mon, Mar 18, 2019 at 12:50:43PM -0700, Sean Christopherson wrote:
> > > On Sun, Mar 17, 2019 at 11:14:41PM +0200, Jarkko Sakkinen wrote:
> > > Dynamically allocating sgx_epc_sections isn't exactly difficult, and
> > > AFAICT the static allocation is the primary motivation for capping
> > > SGX_MAX_EPC_SECTIONS at such a low value (8).  I still think it makes
> > > sense to define SGX_MAX_EPC_SECTIONS so that the section number can
> > > be embedded in the offset, along with flags.  But the max can be
> > > significantly higher, e.g. using 7 bits to support 128 sections.
> > > 
> > 
> > I don't disagree with you but I think for the existing and forseeable
> > hardware this is good enough. Can be refined if there is ever need.
> 
> My concern is that there may be virtualization use cases that want to
> expose more than 8 EPC sections to a guest.  I have no idea if this is
> anything more than paranoia, but at the same time the cost to increase
> support to 128+ sections is quite low.
> 
> > > I realize hardware is highly unlikely to have more than 8 sections, at
> > > least for the near future, but IMO the small amount of extra complexity
> > > is worth having a bit of breathing room.
> > 
> > Yup.
> > 
> > > > +static __init int sgx_init_epc_section(u64 addr, u64 size, unsigned long index,
> > > > +				       struct sgx_epc_section *section)
> > > > +{
> > > > +	unsigned long nr_pages = size >> PAGE_SHIFT;
> > > > +	struct sgx_epc_page *page;
> > > > +	unsigned long i;
> > > > +
> > > > +	section->va = memremap(addr, size, MEMREMAP_WB);
> > > > +	if (!section->va)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	section->pa = addr;
> > > > +	spin_lock_init(&section->lock);
> > > > +	INIT_LIST_HEAD(&section->page_list);
> > > > +
> > > > +	for (i = 0; i < nr_pages; i++) {
> > > > +		page = kzalloc(sizeof(*page), GFP_KERNEL);
> > > > +		if (!page)
> > > > +			goto out;
> > > > +		page->desc = (addr + (i << PAGE_SHIFT)) | index;
> > > > +		sgx_section_put_page(section, page);
> > > > +	}
> > > 
> > > Not sure if this is the correct location, but at some point the kernel
> > > needs to sanitize the EPC during init.  EPC pages may be in an unknown
> > > state, e.g. after kexec(), which will cause all manner of faults and
> > > warnings.  Maybe the best approach is to sanitize on-demand, e.g. suppress
> > > the first WARN due to unexpected ENCLS failure and purge the EPC at that
> > > time.  The downside of that approach is that exposing EPC to a guest would
> > > need to implement its own sanitization flow.
> > 
> > Hmm... Lets think this through. I'm just thinking how sanitization on
> > demand would actually work given the parent-child relationships.
> 
> It's ugly.
> 
>   1. Temporarily disable EPC allocation and enclave fault handling
>   2. Zap all TCS PTEs in all enclaves
>   3. Flush all logical CPUs from enclaves via IPI
>   4. Forcefully reclaim all EPC pages from enclaves
>   5. EREMOVE all "free" EPC pages, track pages that fail with SGX_CHILD_PRESENT
>   6. EREMOVE all EPC pages that failed with SGX_CHILD_PRESENT
>   7. Disable SGX if any EREMOVE failed in step 6
>   8. Re-enable EPC allocation and enclave fault handling
> 
> Exposing EPC to a VM would still require sanitization.
> 
> Sanitizing during boot is a lot cleaner, the primary concern is that it
> will significantly increase boot time on systems with large EPCs.  If we
> can somehow limit this to kexec() and that's the only scenario where the
> EPC needs to be sanitized, then that would mitigate the boot time concern.
> 
> We might also be able to get away with unconditionally sanitizing the EPC
> post-boot, e.g. via worker threads, returning -EBUSY for everything until
> the EPC is good to go.

I like the worker threads approach better. It is something that is
maintainable. I don't see any better solution given the hierarchical
nature of enclaves. It is also fairly to implement without making
major changes to the other parts of the implementation.

I.e. every time the driver initializes:

1. Move all EPC first to a bad pool.
2. Let worker threads move EPC to the real allocation pool.

Then the OS can immediately start to use EPC.

Is this about along the lines what you had in mind?

/Jarkko
Jarkko Sakkinen March 22, 2019, 10:50 a.m. UTC | #5
On Fri, Mar 22, 2019 at 12:19:40PM +0200, Jarkko Sakkinen wrote:
> On Thu, Mar 21, 2019 at 08:28:10AM -0700, Sean Christopherson wrote:
> > On Thu, Mar 21, 2019 at 04:40:56PM +0200, Jarkko Sakkinen wrote:
> > > On Mon, Mar 18, 2019 at 12:50:43PM -0700, Sean Christopherson wrote:
> > > > On Sun, Mar 17, 2019 at 11:14:41PM +0200, Jarkko Sakkinen wrote:
> > > > Dynamically allocating sgx_epc_sections isn't exactly difficult, and
> > > > AFAICT the static allocation is the primary motivation for capping
> > > > SGX_MAX_EPC_SECTIONS at such a low value (8).  I still think it makes
> > > > sense to define SGX_MAX_EPC_SECTIONS so that the section number can
> > > > be embedded in the offset, along with flags.  But the max can be
> > > > significantly higher, e.g. using 7 bits to support 128 sections.
> > > > 
> > > 
> > > I don't disagree with you but I think for the existing and forseeable
> > > hardware this is good enough. Can be refined if there is ever need.
> > 
> > My concern is that there may be virtualization use cases that want to
> > expose more than 8 EPC sections to a guest.  I have no idea if this is
> > anything more than paranoia, but at the same time the cost to increase
> > support to 128+ sections is quite low.
> > 
> > > > I realize hardware is highly unlikely to have more than 8 sections, at
> > > > least for the near future, but IMO the small amount of extra complexity
> > > > is worth having a bit of breathing room.
> > > 
> > > Yup.
> > > 
> > > > > +static __init int sgx_init_epc_section(u64 addr, u64 size, unsigned long index,
> > > > > +				       struct sgx_epc_section *section)
> > > > > +{
> > > > > +	unsigned long nr_pages = size >> PAGE_SHIFT;
> > > > > +	struct sgx_epc_page *page;
> > > > > +	unsigned long i;
> > > > > +
> > > > > +	section->va = memremap(addr, size, MEMREMAP_WB);
> > > > > +	if (!section->va)
> > > > > +		return -ENOMEM;
> > > > > +
> > > > > +	section->pa = addr;
> > > > > +	spin_lock_init(&section->lock);
> > > > > +	INIT_LIST_HEAD(&section->page_list);
> > > > > +
> > > > > +	for (i = 0; i < nr_pages; i++) {
> > > > > +		page = kzalloc(sizeof(*page), GFP_KERNEL);
> > > > > +		if (!page)
> > > > > +			goto out;
> > > > > +		page->desc = (addr + (i << PAGE_SHIFT)) | index;
> > > > > +		sgx_section_put_page(section, page);
> > > > > +	}
> > > > 
> > > > Not sure if this is the correct location, but at some point the kernel
> > > > needs to sanitize the EPC during init.  EPC pages may be in an unknown
> > > > state, e.g. after kexec(), which will cause all manner of faults and
> > > > warnings.  Maybe the best approach is to sanitize on-demand, e.g. suppress
> > > > the first WARN due to unexpected ENCLS failure and purge the EPC at that
> > > > time.  The downside of that approach is that exposing EPC to a guest would
> > > > need to implement its own sanitization flow.
> > > 
> > > Hmm... Lets think this through. I'm just thinking how sanitization on
> > > demand would actually work given the parent-child relationships.
> > 
> > It's ugly.
> > 
> >   1. Temporarily disable EPC allocation and enclave fault handling
> >   2. Zap all TCS PTEs in all enclaves
> >   3. Flush all logical CPUs from enclaves via IPI
> >   4. Forcefully reclaim all EPC pages from enclaves
> >   5. EREMOVE all "free" EPC pages, track pages that fail with SGX_CHILD_PRESENT
> >   6. EREMOVE all EPC pages that failed with SGX_CHILD_PRESENT
> >   7. Disable SGX if any EREMOVE failed in step 6
> >   8. Re-enable EPC allocation and enclave fault handling
> > 
> > Exposing EPC to a VM would still require sanitization.
> > 
> > Sanitizing during boot is a lot cleaner, the primary concern is that it
> > will significantly increase boot time on systems with large EPCs.  If we
> > can somehow limit this to kexec() and that's the only scenario where the
> > EPC needs to be sanitized, then that would mitigate the boot time concern.
> > 
> > We might also be able to get away with unconditionally sanitizing the EPC
> > post-boot, e.g. via worker threads, returning -EBUSY for everything until
> > the EPC is good to go.
> 
> I like the worker threads approach better. It is something that is
> maintainable. I don't see any better solution given the hierarchical
> nature of enclaves. It is also fairly to implement without making
> major changes to the other parts of the implementation.
> 
> I.e. every time the driver initializes:
> 
> 1. Move all EPC first to a bad pool.
> 2. Let worker threads move EPC to the real allocation pool.
> 
> Then the OS can immediately start to use EPC.
> 
> Is this about along the lines what you had in mind?

We could even simplify this by using the already existing reclaimer
thread for the purpose.

/Jarkko
diff mbox series

Patch

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c1f9b3cf437c..dc630208003f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1921,6 +1921,25 @@  config X86_INTEL_MEMORY_PROTECTION_KEYS
 
 	  If unsure, say y.
 
+config INTEL_SGX
+	bool "Intel SGX core functionality"
+	depends on X86_64 && CPU_SUP_INTEL
+	help
+
+	Intel(R) SGX is a set of CPU instructions that can be used by
+	applications to set aside private regions of code and data.  The code
+	outside the enclave is disallowed to access the memory inside the
+	enclave by the CPU access control.
+
+	The firmware uses PRMRR registers to reserve an area of physical memory
+	called Enclave Page Cache (EPC). There is a hardware unit in the
+	processor called Memory Encryption Engine. The MEE encrypts and decrypts
+	the EPC pages as they enter and leave the processor package.
+
+	For details, see Documentation/x86/intel_sgx.rst
+
+	If unsure, say N.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index cfd24f9f7614..d1163c0fd5d6 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -40,6 +40,7 @@  obj-$(CONFIG_X86_MCE)			+= mce/
 obj-$(CONFIG_MTRR)			+= mtrr/
 obj-$(CONFIG_MICROCODE)			+= microcode/
 obj-$(CONFIG_X86_CPU_RESCTRL)		+= resctrl/
+obj-$(CONFIG_INTEL_SGX)			+= sgx/
 
 obj-$(CONFIG_X86_LOCAL_APIC)		+= perfctr-watchdog.o
 
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
new file mode 100644
index 000000000000..b666967fd570
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -0,0 +1 @@ 
+obj-y += main.o
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
new file mode 100644
index 000000000000..18ce4acdd7ef
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -0,0 +1,149 @@ 
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+// Copyright(c) 2016-17 Intel Corporation.
+
+#include <linux/freezer.h>
+#include <linux/highmem.h>
+#include <linux/kthread.h>
+#include <linux/pagemap.h>
+#include <linux/ratelimit.h>
+#include <linux/sched/signal.h>
+#include <linux/slab.h>
+#include "arch.h"
+#include "sgx.h"
+
+struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
+EXPORT_SYMBOL_GPL(sgx_epc_sections);
+
+static int sgx_nr_epc_sections;
+
+static void sgx_section_put_page(struct sgx_epc_section *section,
+				 struct sgx_epc_page *page)
+{
+	list_add_tail(&page->list, &section->page_list);
+	section->free_cnt++;
+}
+
+static __init void sgx_free_epc_section(struct sgx_epc_section *section)
+{
+	struct sgx_epc_page *page;
+
+	while (!list_empty(&section->page_list)) {
+		page = list_first_entry(&section->page_list,
+					struct sgx_epc_page, list);
+		list_del(&page->list);
+		kfree(page);
+	}
+	memunmap(section->va);
+}
+
+static __init int sgx_init_epc_section(u64 addr, u64 size, unsigned long index,
+				       struct sgx_epc_section *section)
+{
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct sgx_epc_page *page;
+	unsigned long i;
+
+	section->va = memremap(addr, size, MEMREMAP_WB);
+	if (!section->va)
+		return -ENOMEM;
+
+	section->pa = addr;
+	spin_lock_init(&section->lock);
+	INIT_LIST_HEAD(&section->page_list);
+
+	for (i = 0; i < nr_pages; i++) {
+		page = kzalloc(sizeof(*page), GFP_KERNEL);
+		if (!page)
+			goto out;
+		page->desc = (addr + (i << PAGE_SHIFT)) | index;
+		sgx_section_put_page(section, page);
+	}
+
+	return 0;
+out:
+	sgx_free_epc_section(section);
+	return -ENOMEM;
+}
+
+static __init void sgx_page_cache_teardown(void)
+{
+	int i;
+
+	for (i = 0; i < sgx_nr_epc_sections; i++)
+		sgx_free_epc_section(&sgx_epc_sections[i]);
+}
+
+/**
+ * A section metric is concatenated in a way that @low bits 12-31 define the
+ * bits 12-31 of the metric and @high bits 0-19 define the bits 32-51 of the
+ * metric.
+ */
+static inline u64 sgx_calc_section_metric(u64 low, u64 high)
+{
+	return (low & GENMASK_ULL(31, 12)) +
+	       ((high & GENMASK_ULL(19, 0)) << 32);
+}
+
+static __init int sgx_page_cache_init(void)
+{
+	u32 eax, ebx, ecx, edx, type;
+	u64 pa, size;
+	int ret;
+	int i;
+
+	BUILD_BUG_ON(SGX_MAX_EPC_SECTIONS > (SGX_EPC_SECTION_MASK + 1));
+
+	for (i = 0; i < (SGX_MAX_EPC_SECTIONS + 1); i++) {
+		cpuid_count(SGX_CPUID, i + SGX_CPUID_FIRST_VARIABLE_SUB_LEAF,
+			    &eax, &ebx, &ecx, &edx);
+
+		type = eax & SGX_CPUID_SUB_LEAF_TYPE_MASK;
+		if (type == SGX_CPUID_SUB_LEAF_INVALID)
+			break;
+		if (type != SGX_CPUID_SUB_LEAF_EPC_SECTION) {
+			pr_err_once("sgx: Unknown sub-leaf type: %u\n", type);
+			return -ENODEV;
+		}
+		if (i == SGX_MAX_EPC_SECTIONS) {
+			pr_warn("sgx: More than "
+				__stringify(SGX_MAX_EPC_SECTIONS)
+				" EPC sections\n");
+			break;
+		}
+
+		pa = sgx_calc_section_metric(eax, ebx);
+		size = sgx_calc_section_metric(ecx, edx);
+		pr_info("sgx: EPC section 0x%llx-0x%llx\n", pa, pa + size - 1);
+
+		ret = sgx_init_epc_section(pa, size, i, &sgx_epc_sections[i]);
+		if (ret) {
+			sgx_page_cache_teardown();
+			return ret;
+		}
+
+		sgx_nr_epc_sections++;
+	}
+
+	if (!sgx_nr_epc_sections) {
+		pr_err("sgx: There are zero EPC sections.\n");
+		return -ENODEV;
+	}
+
+	return 0;
+}
+
+static __init int sgx_init(void)
+{
+	int ret;
+
+	if (!boot_cpu_has(X86_FEATURE_SGX))
+		return false;
+
+	ret = sgx_page_cache_init();
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+arch_initcall(sgx_init);
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
new file mode 100644
index 000000000000..228e3dae360d
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -0,0 +1,62 @@ 
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
+#ifndef _X86_SGX_H
+#define _X86_SGX_H
+
+#include <linux/bitops.h>
+#include <linux/err.h>
+#include <linux/io.h>
+#include <linux/rwsem.h>
+#include <linux/types.h>
+#include <asm/asm.h>
+#include <uapi/asm/sgx_errno.h>
+
+struct sgx_epc_page {
+	unsigned long desc;
+	struct list_head list;
+};
+
+/**
+ * struct sgx_epc_section
+ *
+ * The firmware can define multiple chunks of EPC to the different areas of the
+ * physical memory e.g. for memory areas of the each node. This structure is
+ * used to store EPC pages for one EPC section and virtual memory area where
+ * the pages have been mapped.
+ */
+struct sgx_epc_section {
+	unsigned long pa;
+	void *va;
+	struct list_head page_list;
+	unsigned long free_cnt;
+	spinlock_t lock;
+};
+
+#define SGX_MAX_EPC_SECTIONS	8
+
+extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
+
+/**
+ * enum sgx_epc_page_desc - bits and masks for an EPC page's descriptor
+ * %SGX_EPC_SECTION_MASK:	SGX allows to have multiple EPC sections in the
+ *				physical memory. The existing and near-future
+ *				hardware defines at most eight sections, hence
+ *				three bits to hold a section.
+ */
+enum sgx_epc_page_desc {
+	SGX_EPC_SECTION_MASK			= GENMASK_ULL(3, 0),
+	/* bits 12-63 are reserved for the physical page address of the page */
+};
+
+static inline struct sgx_epc_section *sgx_epc_section(struct sgx_epc_page *page)
+{
+	return &sgx_epc_sections[page->desc & SGX_EPC_SECTION_MASK];
+}
+
+static inline void *sgx_epc_addr(struct sgx_epc_page *page)
+{
+	struct sgx_epc_section *section = sgx_epc_section(page);
+
+	return section->va + (page->desc & PAGE_MASK) - section->pa;
+}
+
+#endif /* _X86_SGX_H */