diff mbox series

[v39,23/24] docs: x86/sgx: Document SGX micro architecture and kernel internals

Message ID 20201003045059.665934-24-jarkko.sakkinen@linux.intel.com
State New, archived
Headers show
Series Intel SGX foundations | expand

Commit Message

Jarkko Sakkinen Oct. 3, 2020, 4:50 a.m. UTC
Document the Intel SGX kernel architecture. The fine-grained micro
architecture details can be looked up from Intel SDM Volume 3D.

Cc: linux-doc@vger.kernel.org
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
 Documentation/x86/index.rst |   1 +
 Documentation/x86/sgx.rst   | 284 ++++++++++++++++++++++++++++++++++++
 2 files changed, 285 insertions(+)
 create mode 100644 Documentation/x86/sgx.rst
diff mbox series


diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index 740ee7f87898..b9db893c8aee 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -32,3 +32,4 @@  x86-specific Documentation
+   sgx
diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
new file mode 100644
index 000000000000..7b742c331247
--- /dev/null
+++ b/Documentation/x86/sgx.rst
@@ -0,0 +1,284 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+Software Guard eXtensions (SGX)
+*Software Guard eXtensions (SGX)* is a set of instructions that enable ring-3
+applications to set aside private regions of code and data. These regions are
+called enclaves. An enclave can be entered at a fixed set of entry points. Only
+a CPU running inside the enclave can access its code and data.
+The support can be determined by
+	``grep sgx /proc/cpuinfo``
+Enclave Page Cache
+SGX utilizes an *Enclave Page Cache (EPC)* to store pages that are associated
+with an enclave. It is contained in a BIOS-reserved region of physical memory.
+Unlike pages used for regular memory, pages can only be accessed outside the
+enclave for different purposes with the instructions **ENCLS**, **ENCLV** and
+Direct memory accesses at an enclave can be only done by a CPU executing inside
+the enclave. An enclave can be entered with **ENCLU[EENTER]** to a fixed set of
+entry points. However, a CPU executing inside the enclave can do outside memory
+Page Types
+**SGX Enclave Control Structure (SECS)**
+   Enclave's address range, attributes and other global data are defined
+   by this structure.
+**Regular (REG)**
+   Regular EPC pages contain the code and data of an enclave.
+**Thread Control Structure (TCS)**
+   Thread Control Structure pages define the entry points to an enclave and
+   track the execution state of an enclave thread.
+**Version Array (VA)**
+   Version Array pages contain 512 slots, each of which can contain a version
+   number for a page evicted from the EPC.
+Enclave Page Cache Map
+The processor tracks EPC pages via the *Enclave Page Cache Map (EPCM)*.  EPCM
+contains an entry for each EPC page, which describes the owning enclave, access
+rights and page type among the other things.
+The permissions from EPCM are consulted if and only if walking the kernel page
+tables succeeds. The total permissions are thus a conjunction between page table
+and EPCM permissions.
+For all intents and purposes, the SGX architecture allows the processor to
+invalidate all EPCM entries at will, i.e. requires that software be prepared to
+handle an EPCM fault at any time. The contents of EPC are encrypted with an
+ephemeral key, which is lost on power transitions.
+EPC management
+EPC pages do not have ``struct page`` instances. They are IO memory from kernel
+perspective. The consequence is that they are always mapped as shared memory.
+Kernel defines ``/dev/sgx/enclave`` that can be mapped as ``MAP_SHARED`` to
+define the address range for an enclave.
+EPC Over-subscription
+When the amount of free EPC pages goes below a low watermark the swapping thread
+starts reclaiming pages. The pages that do not have the **A** bit set are
+selected as victim pages.
+Launch Control
+SGX provides a launch control mechanism. After all enclave pages have been
+copied, kernel executes **ENCLS[EINIT]**, which initializes the enclave. Only
+after this the CPU can execute inside the enclave.
+This leaf function takes an RSA-3072 signature of the enclave measurement and an
+optional cryptographic token. Linux does not take advantage of launch tokens.
+The leaf instruction checks that the measurement is correct and signature is
+signed with the key hashed to the four +**IA32_SGXLEPUBKEYHASH{0, 1, 2, 3}**
+MSRs representing the SHA256 of a public key.
+Those MSRs can be configured by the BIOS to be either readable or writable.
+Linux supports only writable configuration in order to give full control to the
+kernel on launch control policy. Readable configuration requires the use of
+previously mentioned launch tokens.
+The launch is performed by setting the MSRs to the hash of the enclave signer's
+public key. The alternative would be to have *a launch enclave* that would be
+signed with the key set into MSRs, which would then generate launch tokens for
+other enclaves. This would only make sense with read-only MSRs, and thus the
+option has been discarded.
+Local Attestation
+In local attestation, an enclave creates a **REPORT** data structure with
+**ENCLS[EREPORT]**, which describes the origin of an enclave. In particular, it
+contains a AES-CMAC of the enclave contents signed with a report key unique to
+each processor. All enclaves have access to this key.
+This mechanism can also be used in addition as a communication channel as the
+**REPORT** data structure includes a 64-byte field for variable information.
+Remote Attestation
+Provisioning Certification Enclave (PCE), the root of trust for other enclaves,
+generates a signing key from a fused key called Provisioning Certification Key.
+PCE can then use this key to certify an attestation key of a Quoting Enclave
+(QE), e.g. we get the chain of trust down to the hardware if the Intel signed
+PCE is used.
+To use the needed keys, ATTRIBUTE.PROVISIONKEY is required but should be only
+allowed for those who actually need it so that only the trusted parties can
+certify QE's.
+A device file called /dev/sgx/provision exists to provide file descriptors that
+act as privilege tokens for building provisioning enclaves. These can be
+associated with enclaves with the ioctl SGX_IOC_ENCLAVE_SET_ATTRIBUTE.
+Encryption engines
+In order to conceal the enclave data while it is out of the CPU package,
+memory controller has to be extended with an encryption engine. MC can then
+route incoming requests coming from CPU cores running in enclave mode to the
+encryption engine.
+In CPUs prior to Icelake, Memory Encryption Engine (MEE) is used to
+encrypt pages leaving the CPU caches. MEE uses a n-ary Merkle tree with root in
+SRAM to maintain integrity of the encrypted data. This provides integrity and
+anti-replay protection but does not scale to large memory sizes because the time
+required to update the Merkle tree grows logarithmically in relation to the
+memory size.
+CPUs starting from Icelake use Total Memory Encryption (TME) in the place of
+MEE. SGX using TME does not have an integrity Merkle tree, which means losing HW
+protections from integrity and replay-attacks, but includes additional changes
+to prevent cipher text from being return and SW memory aliases from being
+created. DMA remains blocked by the PRMRR to the EPC memory even systems that
+use TME (SDM section 41.10).
+Backing storage
+Backing storage is shared and not accounted. It is implemented as a private
+shmem file. Providing a backing storage in some form from user space is not
+possible - accounting would go to invalid state as reclaimed pages would get
+accounted to the processes of which behalf the kernel happened to be acting on.
+Enclave Life Cycle
+Enclaves must be built before they can be executed (entered). The first step in
+building an enclave is opening the `/dev/sgx/enclave` device. Then, the enclave
+is built with ioctl's documented in `arch/x86/include/uapi/asm/sgx.h`.
+Since enclave memory is protected from direct access, special privileged
+instructions (name them here) are used to copy data into enclave pages and
+establish enclave page permissions within ioctl(SGX_whatever) calls.
+`mmap()` permissions are capped by the enclave permissions. A direct
+consequence of this is that all the pages for an address range must be added
+before `mmap()` can be applied. Effectively an enclave page with minimum
+permissions in the address range sets the permission cap for the mapping
+The basic concept and implementation is very similar to the kernel's exception
+fixup mechanism.  The key differences are that the kernel handler is hardcoded
+and the fixup entry addresses are relative to the overall table as opposed to
+individual entries.
+Hardcoding the kernel handler avoids the need to figure out how to get userspace
+code to point at a kernel function.  Given that the expected usage is to
+propagate information to userspace, dumping all fault information into registers
+is likely the desired behavior for the vast majority of yet-to-be-created
+functions.  Use registers DI, SI and DX to communicate fault information, which
+follows Linux's ABI for register consumption and hopefully avoids conflict with
+hardware features that might leverage the fixup capabilities, e.g.  register
+usage for SGX instructions was at least partially designed with calling
+conventions in mind.
+Making fixup addresses relative to the overall table allows the table to be
+stripped from the final vDSO image (it's a kernel construct) without
+complicating the offset logic, e.g. entry-relative addressing would also need to
+account for the table's location relative to the image.
+Regarding stripping the table, modify vdso2c to extract the table from the raw,
+a.k.a. unstripped, data and dump it as a standalone byte array in the resulting
+.c file.  The original base of the table, its length and a pointer to the byte
+array are captured in struct vdso_image.  Alternatively, the table could be
+dumped directly into the struct, but because the number of entries can vary per
+image, that would require either hardcoding a max sized table into the struct
+definition or defining the table as a flexible length array.  The flexible
+length array approach has zero benefits, e.g. the base/size are still needed,
+and prevents reusing the extraction code, while hardcoding the max size adds
+ongoing maintenance just to avoid exporting the explicit size.
+The immediate use case is for Intel Software Guard Extensions (SGX).  SGX
+introduces a new CPL3-only "enclave" mode that runs as a sort of black box
+shared object that is hosted by an untrusted "normal" CPl3 process.
+Entering an enclave can only be done through SGX-specific instructions, EENTER
+and ERESUME, and is a non-trivial process.  Because of the complexity of
+transitioning to/from an enclave, the vast majority of enclaves are expected to
+utilize a library to handle the actual transitions.  This is roughly analogous
+to how e.g. libc implementations are used by most applications.
+Another crucial characteristic of SGX enclaves is that they can generate
+exceptions as part of their normal (at least as "normal" as SGX can be)
+operation that need to be handled *in* the enclave and/or are unique to SGX.
+And because they are essentially fancy shared objects, a process can host any
+number of enclaves, each of which can execute multiple threads simultaneously.
+Putting everything together, userspace enclaves will utilize a library that must
+be prepared to handle any and (almost) all exceptions any time at least one
+thread may be executing in an enclave.  Leveraging signals to handle the enclave
+exceptions is unpleasant, to put it mildly, e.g.  the SGX library must
+constantly (un)register its signal handler based on whether or not at least one
+thread is executing in an enclave, and filter and forward exceptions that aren't
+related to its enclaves.  This becomes particularly nasty when using multiple
+levels of libraries that register signal handlers, e.g. running an enclave via
+cgo inside of the Go runtime.
+Enabling exception fixup in vDSO allows the kernel to provide a vDSO function
+that wraps the low-level transitions to/from the enclave, i.e.  the EENTER and
+ERESUME instructions.  The vDSO function can intercept exceptions that would
+otherwise generate a signal and return the fault information directly to its
+caller, thus avoiding the need to juggle signal handlers.
+Note that unlike the kernel's _ASM_EXTABLE_HANDLE implementation, the 'C'
+version of _ASM_VDSO_EXTABLE_HANDLE doesn't use a pre-compiled assembly macro.
+Duplicating four lines of code is simpler than adding the necessary
+infrastructure to generate pre-compiled assembly and the intended benefit of
+massaging GCC's inlining algorithm is unlikely to realized in the vDSO any time
+soon, if ever.
+Usage Models
+Shared Library
+Sensitive data and the code that acts on it is partitioned from the application
+into a separate library. The library is then linked as a DSO which can be loaded
+into an enclave. The application can then make individual function calls into
+the enclave through special SGX instructions. A run-time within the enclave is
+configured to marshal function parameters into and out of the enclave and to
+call the correct library function.
+Application Container
+An application may be loaded into a container enclave which is specially
+configured with a library OS and run-time which permits the application to run.
+The enclave run-time and library OS work together to execute the application
+when a thread enters the enclave.
+"Supporting Third Party Attestation for IntelĀ® SGX with IntelĀ® Data Center
+Attestation Primitives"
+   https://software.intel.com/sites/default/files/managed/f1/b8/intel-sgx-support-for-third-party-attestation.pdf