Message ID | cover.1644274683.git.reinette.chatre@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | x86/sgx and selftests/sgx: Support SGX2 | expand |
1. This interface looks very odd to me. mmap() is the kernel interface for changing user space memory maps. Why are we introducing a new interface for this? You can just simply add a new mmap flag (i.e. MAP_SGX_TCS*) and then figure out which SGX instructions to execute based on the desired state of the memory maps. If you do this, none of the following ioctls are needed: * SGX_IOC_ENCLAVE_RELAX_PERMISSIONS * SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS * SGX_IOC_ENCLAVE_REMOVE_PAGES * SGX_IOC_ENCLAVE_MODIFY_TYPE It also means that languages don't have to grow support for all these ioctls. Instead, they can just reuse the existing mmap() bindings with the new flag. Also, multiple operations can be combined into a single mmap() call, amortizing the changes over a single context switch. 2. Automatically adding pages with hard-coded permissions in a fault handler seems like a really bad idea. How do you distinguish between accesses which should result in an updated mapping and accesses that should result in a fault? IMHO, all unmapped page accesses should result in a page fault. mmap() should be called first to identify the correct permissions for these pages. Then the page handler should be updated to use the permissions from the mapping when backfilling physical pages. If I understand correctly, this should also obviate the need for the weird userspace callback to allow for execute permissions. 3. Implementing as I've suggested also means that we can lock down an enclave, for example - after code has been JITed, by closing the file descriptor. Once the file descriptor used to create the enclave is closed, no further mmap() can be performed on the enclave. Attempting to do EACCEPT on an unmapped page will generate a page fault. * - I'm aware that a new flag might be frowned upon. I see a few other options: 1. reuse an existing flag which doesn't make sense in this context 2. communicate the page type in the offset argument 3. keep SGX_IOC_ENCLAVE_MODIFY_TYPE On Mon, Feb 7, 2022 at 8:07 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > > V1: https://lore.kernel.org/linux-sgx/cover.1638381245.git.reinette.chatre@intel.com/ > > Changes since V1 that directly impact user space: > - SGX2 permission changes changed from a single ioctl() named > SGX_IOC_PAGE_MODP to two new ioctl()s: > SGX_IOC_ENCLAVE_RELAX_PERMISSIONS and > SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS, supported by two different > parameter structures (SGX_IOC_ENCLAVE_RELAX_PERMISSIONS does > not support a result output parameter) (Jarkko). > > User space flow impact: After user space runs ENCLU[EMODPE] it > needs to call SGX_IOC_ENCLAVE_RELAX_PERMISSIONS to have PTEs > updated. Previously running SGX_IOC_PAGE_MODP in this scenario > resulted in EPCM.PR being set but calling > SGX_IOC_ENCLAVE_RELAX_PERMISSIONS will not result in EPCM.PR > being set anymore and thus no need for an additional > ENCLU[EACCEPT]. > > - SGX_IOC_ENCLAVE_RELAX_PERMISSIONS and > SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS > obtain new permissions from secinfo as parameter instead of > the permissions directly (Jarkko). > > - ioctl() supporting SGX2 page type change is renamed from > SGX_IOC_PAGE_MODT to SGX_IOC_ENCLAVE_MODIFY_TYPE (Jarkko). > > - SGX_IOC_ENCLAVE_MODIFY_TYPE obtains new page type from secinfo > as parameter instead of the page type directly (Jarkko). > > - ioctl() supporting SGX2 page removal is renamed from > SGX_IOC_PAGE_REMOVE to SGX_IOC_ENCLAVE_REMOVE_PAGES (Jarkko). > > - All ioctl() parameter structures have been renamed as a result of the > ioctl() renaming: > SGX_IOC_ENCLAVE_RELAX_PERMISSIONS => struct sgx_enclave_relax_perm > SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS => struct sgx_enclave_restrict_perm > SGX_IOC_ENCLAVE_MODIFY_TYPE => struct sgx_enclave_modt > SGX_IOC_ENCLAVE_REMOVE_PAGES => struct sgx_enclave_remove_pages > > Changes since V1 that do not directly impact user space: > - Number of patches in series increased from 25 to 32 primarily because > of splitting the original submission: > - Wrappers for the new SGX2 functions are introduced in three separate > patches replacing the original "x86/sgx: Add wrappers for SGX2 > functions" > (Jarkko). > - Moving and renaming sgx_encl_ewb_cpumask() is done with two patches > replacing the original "x86/sgx: Use more generic name for enclave > cpumask function" (Jarkko). > - Support for SGX2 EPCM permission changes is split into two ioctls(), > one for relaxing and one for restricting permissions, each introduced > by a new patch replacing the original "x86/sgx: Support enclave page > permission changes" (Jarkko). > - Extracted code used by existing ioctls() for usage by new ioctl()s > into a new utility in new patch "x86/sgx: Create utility to validate > user provided offset and length" (Dave did not specifically ask for > this but it addresses his review feedback). > - Two new Documentation patches to support the SGX2 work > ("Documentation/x86: Introduce enclave runtime management") and > a dedicated section on the enclave permission management > ("Documentation/x86: Document SGX permission details") (Andy). > - Most patches were reworked to improve the language by: > * aiming to refer to exact item instead of English rephrasing (Jarkko). > * use ioctl() instead of ioctl throughout (Dave). > * Use "relaxed" instead of "exceed" when referring to permissions > (Dave). > - Improved documentation with several additions to > Documentation/x86/sgx.rst. > - Many smaller changes, please refer to individual patches. > > Hi Everybody, > > The current Linux kernel support for SGX includes support for SGX1 that > requires that an enclave be created with properties that accommodate all > usages over its (the enclave's) lifetime. This includes properties such > as permissions of enclave pages, the number of enclave pages, and the > number of threads supported by the enclave. > > Consequences of this requirement to have the enclave be created to > accommodate all usages include: > * pages needing to support relocated code are required to have RWX > permissions for their entire lifetime, > * an enclave needs to be created with the maximum stack and heap > projected to be needed during the enclave's entire lifetime which > can be longer than the processes running within it, > * an enclave needs to be created with support for the maximum number > of threads projected to run in the enclave. > > Since SGX1 a few more functions were introduced, collectively called > SGX2, that support modifications to an initialized enclave. Hardware > supporting these functions are already available as listed on > https://github.com/ayeks/SGX-hardware > > This series adds support for SGX2, also referred to as Enclave Dynamic > Memory Management (EDMM). This includes: > > * Support modifying permissions of regular enclave pages belonging to an > initialized enclave. New permissions are not allowed to exceed the > originally vetted permissions. For example, RX isn't allowed unless > the page was originally added with RX or RWX. > Modifying permissions is accomplished with two new ioctl()s: > SGX_IOC_ENCLAVE_RELAX_PERMISSIONS and > SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS. > > * Support dynamic addition of regular enclave pages to an initialized > enclave. Pages are added with RW permissions as their "originally > vetted permissions" (see previous bullet) and thus not allowed to > be made executable at this time. Enabling dynamically added pages > to obtain executable permissions require integration with user space > policy that is deferred until the core SGX2 enabling is complete. > Pages are dynamically added to an initialized enclave from the SGX > page fault handler. > > * Support expanding an initialized enclave to accommodate more threads. > More threads can be accommodated by an enclave with the addition of > Thread Control Structure (TCS) pages that is done by changing the > type of regular enclave pages to TCS pages using a new ioctl() > SGX_IOC_ENCLAVE_MODIFY_TYPE. > > * Support removing regular and TCS pages from an initialized enclave. > Removing pages is accomplished in two stages as supported by two new > ioctl()s SGX_IOC_ENCLAVE_MODIFY_TYPE (same ioctl() as mentioned in > previous bullet) and SGX_IOC_ENCLAVE_REMOVE_PAGES. > > * Tests covering all the new flows, some edge cases, and one > comprehensive stress scenario. > > No additional work is needed to support SGX2 in a virtualized > environment. The tests included in this series can also be run from > a guest and was tested with the recent QEMU release based on 6.2.0 > that supports SGX. > > Patches 1 to 14 prepares the existing code for SGX2 support by > introducing the SGX2 functions, making sure pages remain accessible > after their enclave permissions are changed, and tracking enclave page > types as well as runtime permissions as needed by SGX2. > > Patches 15 through 32 are a mix of x86/sgx and selftests/sgx patches > that follow the format where first an SGX2 feature is > enabled and then followed by tests of the new feature and/or > tests of scenarios that combine SGX2 features enabled up to that point. > > In two cases (patches 20 and 31) code in support of SGX2 is separated > out with detailed motivation to support the review. > > This series is based on v5.17-rc2 with the following fixes additionally > applied: > > "selftests/sgx: Remove extra newlines in test output" > https://lore.kernel.org/linux-sgx/16317683a1822bbd44ab3ca48b60a9a217ac24de.1643754040.git.reinette.chatre@intel.com/ > "selftests/sgx: Ensure enclave data available during debug print" > https://lore.kernel.org/linux-sgx/eaaeeb9122916d831942fc8a3043c687137314c1.1643754040.git.reinette.chatre@intel.com/ > "selftests/sgx: Do not attempt enclave build without valid enclave" > https://lore.kernel.org/linux-sgx/4e4ea6d70c286c209964bec1e8d29ac8e692748b.1643754040.git.reinette.chatre@intel.com/ > "selftests/sgx: Fix NULL-pointer-dereference upon early test failure" > https://lore.kernel.org/linux-sgx/89824888783fd8e770bfc64530c7549650a41851.1643754040.git.reinette.chatre@intel.com/ > "x86/sgx: Add poison handling to reclaimer" > https://lore.kernel.org/linux-sgx/dcc95eb2aaefb042527ac50d0a50738c7c160dac.1643830353.git.reinette.chatre@intel.com/ > "x86/sgx: Silence softlockup detection when releasing large enclaves" > https://lore.kernel.org/linux-sgx/b5e9f218064aa76e3026f778e1ad0a1d823e3db8.1643133224.git.reinette.chatre@intel.com/ > > Your feedback will be greatly appreciated. > > Regards, > > Reinette > > Reinette Chatre (32): > x86/sgx: Add short descriptions to ENCLS wrappers > x86/sgx: Add wrapper for SGX2 EMODPR function > x86/sgx: Add wrapper for SGX2 EMODT function > x86/sgx: Add wrapper for SGX2 EAUG function > Documentation/x86: Document SGX permission details > x86/sgx: Support VMA permissions more relaxed than enclave permissions > x86/sgx: Add pfn_mkwrite() handler for present PTEs > x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic > permission changes > x86/sgx: Export sgx_encl_ewb_cpumask() > x86/sgx: Rename sgx_encl_ewb_cpumask() as sgx_encl_cpumask() > x86/sgx: Move PTE zap code to new sgx_zap_enclave_ptes() > x86/sgx: Make sgx_ipi_cb() available internally > x86/sgx: Create utility to validate user provided offset and length > x86/sgx: Keep record of SGX page type > x86/sgx: Support relaxing of enclave page permissions > x86/sgx: Support restricting of enclave page permissions > selftests/sgx: Add test for EPCM permission changes > selftests/sgx: Add test for TCS page permission changes > x86/sgx: Support adding of pages to an initialized enclave > x86/sgx: Tighten accessible memory range after enclave initialization > selftests/sgx: Test two different SGX2 EAUG flows > x86/sgx: Support modifying SGX page type > x86/sgx: Support complete page removal > Documentation/x86: Introduce enclave runtime management section > selftests/sgx: Introduce dynamic entry point > selftests/sgx: Introduce TCS initialization enclave operation > selftests/sgx: Test complete changing of page type flow > selftests/sgx: Test faulty enclave behavior > selftests/sgx: Test invalid access to removed enclave page > selftests/sgx: Test reclaiming of untouched page > x86/sgx: Free up EPC pages directly to support large page ranges > selftests/sgx: Page removal stress test > > Documentation/x86/sgx.rst | 64 +- > arch/x86/include/asm/sgx.h | 8 + > arch/x86/include/uapi/asm/sgx.h | 81 + > arch/x86/kernel/cpu/sgx/encl.c | 334 +++- > arch/x86/kernel/cpu/sgx/encl.h | 12 +- > arch/x86/kernel/cpu/sgx/encls.h | 33 + > arch/x86/kernel/cpu/sgx/ioctl.c | 831 ++++++++- > arch/x86/kernel/cpu/sgx/main.c | 70 +- > arch/x86/kernel/cpu/sgx/sgx.h | 3 + > tools/testing/selftests/sgx/defines.h | 23 + > tools/testing/selftests/sgx/load.c | 41 + > tools/testing/selftests/sgx/main.c | 1484 +++++++++++++++++ > tools/testing/selftests/sgx/main.h | 1 + > tools/testing/selftests/sgx/test_encl.c | 68 + > .../selftests/sgx/test_encl_bootstrap.S | 6 + > 15 files changed, 2963 insertions(+), 96 deletions(-) > > > base-commit: 26291c54e111ff6ba87a164d85d4a4e134b7315c > prerequisite-patch-id: 3c3908f1c3536cc04ba020fb3e81f51395b44223 > prerequisite-patch-id: e860923423c3387cf6fdcceb2fa41dc5e9454ef4 > prerequisite-patch-id: 986260c8bc4255eb61e2c4afa88d2b723e376423 > prerequisite-patch-id: ba014a99fced2b57d5d9e2dfb9d80ddf4333c13e > prerequisite-patch-id: 65cbb72889b6353a5639b984615d12019136b270 > prerequisite-patch-id: e3296a2f0345a77c8a7ca91f76697ae2e1dca21f > -- > 2.25.1 >
Hi Nathaniel, On 2/22/2022 12:27 PM, Nathaniel McCallum wrote: > 1. This interface looks very odd to me. mmap() is the kernel interface > for changing user space memory maps. Why are we introducing a new > interface for this? mmap() is the kernel interface used to create new mappings in the virtual address space of the calling process. This is different from the permissions and properties of the underlying file/memory being mapped. A new interface is introduced because changes need to be made to the permissions and properties of the underlying enclave. A new virtual address space is not needed nor should existing VMAs be impacted. This is similar to how mmap() is not used to change file permissions. VMA permissions are separate from enclave page permissions as found in the EPCM (Enclave Page Cache Map). The current implementation (SGX1) already distinguishes between the VMA and EPCM permissions - for example, it is already possible to create a read-only VMA from enclave pages that have RW EPCM permissions. mmap() of a portion of EPC memory with a particular permission does not imply that the underlying EPCM permissions (should)have that permission. > You can just simply add a new mmap flag (i.e. > MAP_SGX_TCS*) and then figure out which SGX instructions to execute > based on the desired state of the memory maps. If you do this, none of > the following ioctls are needed: > > * SGX_IOC_ENCLAVE_RELAX_PERMISSIONS > * SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS > * SGX_IOC_ENCLAVE_REMOVE_PAGES > * SGX_IOC_ENCLAVE_MODIFY_TYPE > > It also means that languages don't have to grow support for all these > ioctls. Instead, they can just reuse the existing mmap() bindings with > the new flag. Also, multiple operations can be combined into a single > mmap() call, amortizing the changes over a single context switch. > > 2. Automatically adding pages with hard-coded permissions in a fault > handler seems like a really bad idea. Could you please elaborate why this is a bad idea? > How do you distinguish between > accesses which should result in an updated mapping and accesses that > should result in a fault? Accesses that should result in an updated mapping have two requirements: (a) address accessed belongs to the enclave based on the address range specified during enclave create (b) there is no backing enclave page for the address > IMHO, all unmapped page accesses should > result in a page fault. mmap() should be called first to identify the > correct permissions for these pages. > Then the page handler should be > updated to use the permissions from the mapping when backfilling > physical pages. If I understand correctly, this should also obviate Regular enclave pages can _only_ be dynamically added with RW permission. SGX2's support for adding regular pages to an enclave via the EAUG instruction is architecturally set at RW. The OS cannot change those permissions via the EAUG instruction nor can the OS do so with a different/additional instruction because: * the OS is not able to relax permissions since that can only be done from within the enclave with ENCLU[EMODPE], thus it is not possible for the OS to dynamically add pages via EAUG as RW and then relax permissions to RWX. * the OS is not able to EAUG a page and immediately attempt an EMODPR either as Jarkko also recently inquired about: https://lore.kernel.org/linux-sgx/80f3d7b9-e3d5-b2c0-7707-710bf6f5081e@intel.com/ > the need for the weird userspace callback to allow for execute > permissions. User policy integration would always be required to allow execute permissions on a writable page. This is not expected to be a userspace callback but instead integration with existing user policy subsystem(s). > > 3. Implementing as I've suggested also means that we can lock down an > enclave, for example - after code has been JITed, by closing the file > descriptor. Once the file descriptor used to create the enclave is > closed, no further mmap() can be performed on the enclave. Attempting > to do EACCEPT on an unmapped page will generate a page fault. This is not clear to me. If the file descriptor is closed and no further mmap() is allowed then how would a process be able to enter the enclave to execute code within it? This series does indeed lock down the address range to ensure that it is not possible to map memory that does not belong to the enclave after the enclave is created. Please see: https://lore.kernel.org/linux-sgx/1b833dbce6c937f71523f4aaf4b2181b9673519f.1644274683.git.reinette.chatre@intel.com/ > > * - I'm aware that a new flag might be frowned upon. I see a few other options: > 1. reuse an existing flag which doesn't make sense in this context > 2. communicate the page type in the offset argument > 3. keep SGX_IOC_ENCLAVE_MODIFY_TYPE > Reinette
On Tue, Feb 22, 2022 at 5:39 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > > Hi Nathaniel, > > On 2/22/2022 12:27 PM, Nathaniel McCallum wrote: > > 1. This interface looks very odd to me. mmap() is the kernel interface > > for changing user space memory maps. Why are we introducing a new > > interface for this? > > mmap() is the kernel interface used to create new mappings in the > virtual address space of the calling process. This is different from > the permissions and properties of the underlying file/memory being mapped. > > A new interface is introduced because changes need to be made to the > permissions and properties of the underlying enclave. A new virtual > address space is not needed nor should existing VMAs be impacted. > > This is similar to how mmap() is not used to change file permissions. > > VMA permissions are separate from enclave page permissions as found in > the EPCM (Enclave Page Cache Map). The current implementation (SGX1) already > distinguishes between the VMA and EPCM permissions - for example, it is > already possible to create a read-only VMA from enclave pages that have > RW EPCM permissions. mmap() of a portion of EPC memory with a particular > permission does not imply that the underlying EPCM permissions (should)have > that permission. Yes. BUT... unlike the file permissions, this leaks an implementation detail. The user process is governed by VMA permissions. And during enclave creation, it had to mmap() all the enclave regions to their final VMA permissions. So during enclave creation you have to use mmap() but after enclave creation you use custom APIs? That's inconsistent at best. Forcing userspace to worry about the (mostly undocumented!) interactions between EPC, PTE and VMA permissions makes these APIs hard to use and difficult to reason about. When I call SGX_IOC_ENCLAVE_RELAX_PERMISSIONS, do I also have to call mmap() to update the VMA permissions to match? It isn't clear. Nor is it really clear why I'm calling completely separate APIs. > > You can just simply add a new mmap flag (i.e. > > MAP_SGX_TCS*) and then figure out which SGX instructions to execute > > based on the desired state of the memory maps. If you do this, none of > > the following ioctls are needed: > > > > * SGX_IOC_ENCLAVE_RELAX_PERMISSIONS > > * SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS > > * SGX_IOC_ENCLAVE_REMOVE_PAGES > > * SGX_IOC_ENCLAVE_MODIFY_TYPE > > > > It also means that languages don't have to grow support for all these > > ioctls. Instead, they can just reuse the existing mmap() bindings with > > the new flag. Also, multiple operations can be combined into a single > > mmap() call, amortizing the changes over a single context switch. > > > > 2. Automatically adding pages with hard-coded permissions in a fault > > handler seems like a really bad idea. > > Could you please elaborate why this is a bad idea? Because implementations that miss this subtlety suddenly have pages with magic permissions. Magic is bad. Explicit is good. > > How do you distinguish between > > accesses which should result in an updated mapping and accesses that > > should result in a fault? > > Accesses that should result in an updated mapping have two requirements: > (a) address accessed belongs to the enclave based on the address > range specified during enclave create > (b) there is no backing enclave page for the address What happens if the enclave is buggy? Or has been compromised. In both of those cases, there should be a userspace visible fault and pages should not be added. > > IMHO, all unmapped page accesses should > > result in a page fault. mmap() should be called first to identify the > > correct permissions for these pages. > > Then the page handler should be > > updated to use the permissions from the mapping when backfilling > > physical pages. If I understand correctly, this should also obviate > > Regular enclave pages can _only_ be dynamically added with RW permission. > > SGX2's support for adding regular pages to an enclave via the EAUG > instruction is architecturally set at RW. The OS cannot change those permissions > via the EAUG instruction nor can the OS do so with a different/additional > instruction because: > * the OS is not able to relax permissions since that can only be done from > within the enclave with ENCLU[EMODPE], thus it is not possible for the OS to > dynamically add pages via EAUG as RW and then relax permissions to RWX. > * the OS is not able to EAUG a page and immediately attempt an EMODPR either > as Jarkko also recently inquired about: > https://lore.kernel.org/linux-sgx/80f3d7b9-e3d5-b2c0-7707-710bf6f5081e@intel.com/ This design looks... unfinished. EAUG takes a PAGEINFO in RBX, but PAGEINFO.SECINFO must be zeroed and EAUG instead sets magic hard-coded permissions. Why doesn't EAUG just respect the permissions in PAGEINFO.SECINFO? We aren't told. Further, if the enclave can do EMODPE, why does SGX_IOC_ENCLAVE_RELAX_PERMISSIONS even exist? None of the documentation explains what this ioctl even does. Does it update PTE permissions? VMA permissions? Nobody knows without reading the source code. Userspace should not be bothered with the subtle details of the interaction between EPC, PTE and VMA permissions. But this API does everything it can do to expose all these details to userspace. And it doesn't bother to document them (probably because it is hard). It would be much better to avoid exposing these details to userspace. IMHO, there should be a simple flow like this (if EAUG respects PAGEINFO.SECINFO): 1. Non-enclave calls mmap()/munmap(). 2. Enclave issues EACCEPT, if necessary. 3. Enclave issues EMODPE, if necessary. Notice that in the second step above, during the mmap() call, the kernel ensures that EPC, PTE and VMA are in sync and fails if they cannot be made to be compatible. Also note that in the above flow EAUG instructions can be efficiently batched. Given the current poor state of the EAUG instruction, we might need to do this flow instead: 1. Enclave issues EACCEPT, if necessary. (Add RW pages...) 2. Non-enclave calls mmap()/munmap(). 3. Enclave issues EACCEPT, if necessary. 4. Enclave issues EMODPE, if necessary. However, doing EAUG only via the page access handler means that there is no way to batch EAUG instructions and this forces a context switch for every page you want to add. This has to be terrible for performance. Note specifically that the SDM calls out batching, which is currently impossible under this patch set. 35.5.7 - "Page allocation operations may be batched to improve efficiency." As it stands today, if I want to add 256MiB of pages to an enclave, I'll have to do 2^16 context switches. That doesn't seem scalable. > > the need for the weird userspace callback to allow for execute > > permissions. > > User policy integration would always be required to allow execute > permissions on a writable page. This is not expected to be a userspace > callback but instead integration with existing user policy subsystem(s). Why? This isn't documented. > > 3. Implementing as I've suggested also means that we can lock down an > > enclave, for example - after code has been JITed, by closing the file > > descriptor. Once the file descriptor used to create the enclave is > > closed, no further mmap() can be performed on the enclave. Attempting > > to do EACCEPT on an unmapped page will generate a page fault. > > This is not clear to me. If the file descriptor is closed and no further > mmap() is allowed then how would a process be able to enter the enclave > to execute code within it? EENTER (or the vdso function) with the address of a TCS page, like normal. In Enarx, we don't retain the enclave fd after the final mmap() following EINIT. Everything works just fine. > This series does indeed lock down the address range to ensure that it is > not possible to map memory that does not belong to the enclave after the > enclave is created. Please see: > https://lore.kernel.org/linux-sgx/1b833dbce6c937f71523f4aaf4b2181b9673519f.1644274683.git.reinette.chatre@intel.com/ That's not what I'm talking about. I'm talking about a workflow like this: 1. Enclave initialization: ECREATE ... EINIT 2. EENTER 3. Enclave JITs some code (changes page permissions) 4. EEXIT 5. Close enclave fd. 6. EENTER 7. If an enclave attempts page modifications, a fault occurs. Think of this similar to seccomp(). The enclave wants to do some dynamic page table manipulation. But then it wants to lock down page table modification so that, if compromised, attackers have no ability to obtain RWX permissions.
Hi Nathaniel, On 2/23/2022 5:24 AM, Nathaniel McCallum wrote: > On Tue, Feb 22, 2022 at 5:39 PM Reinette Chatre > <reinette.chatre@intel.com> wrote: >> >> Hi Nathaniel, >> >> On 2/22/2022 12:27 PM, Nathaniel McCallum wrote: >>> 1. This interface looks very odd to me. mmap() is the kernel interface >>> for changing user space memory maps. Why are we introducing a new >>> interface for this? >> >> mmap() is the kernel interface used to create new mappings in the >> virtual address space of the calling process. This is different from >> the permissions and properties of the underlying file/memory being mapped. >> >> A new interface is introduced because changes need to be made to the >> permissions and properties of the underlying enclave. A new virtual >> address space is not needed nor should existing VMAs be impacted. >> >> This is similar to how mmap() is not used to change file permissions. >> >> VMA permissions are separate from enclave page permissions as found in >> the EPCM (Enclave Page Cache Map). The current implementation (SGX1) already >> distinguishes between the VMA and EPCM permissions - for example, it is >> already possible to create a read-only VMA from enclave pages that have >> RW EPCM permissions. mmap() of a portion of EPC memory with a particular >> permission does not imply that the underlying EPCM permissions (should)have >> that permission. > > Yes. BUT... unlike the file permissions, this leaks an implementation detail. Not really - just like a RW file can be mapped read-only or RW, RW enclave memory can be mapped read-only or RW. > > The user process is governed by VMA permissions. And during enclave > creation, it had to mmap() all the enclave regions to their final VMA > permissions. So during enclave creation you have to use mmap() but > after enclave creation you use custom APIs? That's inconsistent at > best. No. ioctl()s are consistently used to manage enclave memory. The existing ioctls() SGX_IOC_ENCLAVE_CREATE, SGX_IOC_ENCLAVE_ADD_PAGES, and SGX_IOC_ENCLAVE_INIT are used to set up to initialize the enclave memory. The new ioctls() are used to manage enclave memory after enclave initialization. The enclave memory is thus managed with a consistent interface. mmap() is required before SGX_IOC_ENCLAVE_CREATE to obtain a base address for the enclave that is required by the ioctl(). The rest of the ioctl()s, existing and new, are consistent in interface by not requiring a memory mapping but instead work from an offset from the base address. > Forcing userspace to worry about the (mostly undocumented!) > interactions between EPC, PTE and VMA permissions makes these APIs > hard to use and difficult to reason about. This is not new. The current SGX1 user space is already prevented from creating a mapping of enclave memory that is more relaxed than the enclave memory. For example, if the enclave memory has RW EPCM permissions then it is not possible to mmap() that memory as RWX. > > When I call SGX_IOC_ENCLAVE_RELAX_PERMISSIONS, do I also have to call > mmap() to update the VMA permissions to match? It isn't clear. Nor is mprotect() may be the better call to use. > it really clear why I'm calling completely separate APIs. > >>> You can just simply add a new mmap flag (i.e. >>> MAP_SGX_TCS*) and then figure out which SGX instructions to execute >>> based on the desired state of the memory maps. If you do this, none of >>> the following ioctls are needed: >>> >>> * SGX_IOC_ENCLAVE_RELAX_PERMISSIONS >>> * SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS >>> * SGX_IOC_ENCLAVE_REMOVE_PAGES >>> * SGX_IOC_ENCLAVE_MODIFY_TYPE >>> >>> It also means that languages don't have to grow support for all these >>> ioctls. Instead, they can just reuse the existing mmap() bindings with >>> the new flag. Also, multiple operations can be combined into a single >>> mmap() call, amortizing the changes over a single context switch. >>> >>> 2. Automatically adding pages with hard-coded permissions in a fault >>> handler seems like a really bad idea. >> >> Could you please elaborate why this is a bad idea? > > Because implementations that miss this subtlety suddenly have pages > with magic permissions. Magic is bad. Explicit is good. > There is no magic. Any new pages have to be accepted by the enclave. The enclave will not be able to access these pages unless explicitly accepted, ENCLU[EACCEPT], from within the enclave. >>> How do you distinguish between >>> accesses which should result in an updated mapping and accesses that >>> should result in a fault? >> >> Accesses that should result in an updated mapping have two requirements: >> (a) address accessed belongs to the enclave based on the address >> range specified during enclave create >> (b) there is no backing enclave page for the address > > What happens if the enclave is buggy? Or has been compromised. In both > of those cases, there should be a userspace visible fault and pages > should not be added. If user space accesses a memory address with a regular read/write that results in a new page added then there is indeed a user space visible fault. You can see this flow in action in the "augment" test case in https://lore.kernel.org/linux-sgx/32c1116934a588bd3e6c174684e3e36a05c0a4d4.1644274683.git.reinette.chatre@intel.com/ If user space indeed wants the page after encountering such a fault then it needs to enter the enclave again, from a different entry point, to run ENCLU[EACCEPT], before it can return to the original entry point to continue execution from the instruction that triggered the original read/write. The only flow where a page is added without a user space visible fault is when user space explicitly runs the ENCLU[EACCEPT] to do so. > >>> IMHO, all unmapped page accesses should >>> result in a page fault. mmap() should be called first to identify the >>> correct permissions for these pages. >>> Then the page handler should be >>> updated to use the permissions from the mapping when backfilling >>> physical pages. If I understand correctly, this should also obviate >> >> Regular enclave pages can _only_ be dynamically added with RW permission. >> >> SGX2's support for adding regular pages to an enclave via the EAUG >> instruction is architecturally set at RW. The OS cannot change those permissions >> via the EAUG instruction nor can the OS do so with a different/additional >> instruction because: >> * the OS is not able to relax permissions since that can only be done from >> within the enclave with ENCLU[EMODPE], thus it is not possible for the OS to >> dynamically add pages via EAUG as RW and then relax permissions to RWX. >> * the OS is not able to EAUG a page and immediately attempt an EMODPR either >> as Jarkko also recently inquired about: >> https://lore.kernel.org/linux-sgx/80f3d7b9-e3d5-b2c0-7707-710bf6f5081e@intel.com/ > > This design looks... unfinished. EAUG takes a PAGEINFO in RBX, but > PAGEINFO.SECINFO must be zeroed and EAUG instead sets magic hard-coded > permissions. Why doesn't EAUG just respect the permissions in > PAGEINFO.SECINFO? We aren't told. This design is finished and respects the hardware specification. You can find the details in the SDM's documentation of the EAUG function. If the SECINFO field has a value then the hardware requires it to indicate that it is a new shadow stack page being added, not a regular page. Support for shadow stack pages is not in scope for this work. Attempting to dynamically add a regular page with explicit permissions will result in a #GP(0). The only way to add a regular enclave page is to make the SECINFO field empty and doing so forces the page type to be a regular page and the permissions to be RW. > > Further, if the enclave can do EMODPE, why does > SGX_IOC_ENCLAVE_RELAX_PERMISSIONS even exist? None of the > documentation explains what this ioctl even does. Does it update PTE > permissions? VMA permissions? Nobody knows without reading the source > code. Build the documentation (after applying this series) and it should contain all the information you are searching for. As is the current custom in the SGX documentation the built documentation pulls its content from the kernel doc of the functions that implement the core of the user space interactions. > > Userspace should not be bothered with the subtle details of the > interaction between EPC, PTE and VMA permissions. But this API does > everything it can do to expose all these details to userspace. And it > doesn't bother to document them (probably because it is hard). It > would be much better to avoid exposing these details to userspace. > > IMHO, there should be a simple flow like this (if EAUG respects > PAGEINFO.SECINFO): EAUG does not respect PAGEINFO.SECINFO for regular pages. > > 1. Non-enclave calls mmap()/munmap(). > 2. Enclave issues EACCEPT, if necessary. > 3. Enclave issues EMODPE, if necessary. > > Notice that in the second step above, during the mmap() call, the > kernel ensures that EPC, PTE and VMA are in sync and fails if they > cannot be made to be compatible. Also note that in the above flow EAUG > instructions can be efficiently batched. > > Given the current poor state of the EAUG instruction, we might need to > do this flow instead: > > 1. Enclave issues EACCEPT, if necessary. (Add RW pages...) > 2. Non-enclave calls mmap()/munmap(). > 3. Enclave issues EACCEPT, if necessary. > 4. Enclave issues EMODPE, if necessary. > > However, doing EAUG only via the page access handler means that there > is no way to batch EAUG instructions and this forces a context switch > for every page you want to add. This has to be terrible for > performance. Note specifically that the SDM calls out batching, which > is currently impossible under this patch set. 35.5.7 - "Page > allocation operations may be batched to improve efficiency." These page functions are all per-page so it is not possible to add multiple pages with a single instruction. It is indeed possible to pre-fault pages. > As it stands today, if I want to add 256MiB of pages to an enclave, > I'll have to do 2^16 context switches. That doesn't seem scalable. No. Running ENCLU[EACCEPT] on each of the pages within that range should not need any explicit context switch out of the enclave. See the "augment_via_eaccept" test case in: https://lore.kernel.org/linux-sgx/32c1116934a588bd3e6c174684e3e36a05c0a4d4.1644274683.git.reinette.chatre@intel.com/ >>> the need for the weird userspace callback to allow for execute >>> permissions. >> >> User policy integration would always be required to allow execute >> permissions on a writable page. This is not expected to be a userspace >> callback but instead integration with existing user policy subsystem(s). > > Why? This isn't documented. This is similar to the existing policies involved in managing the permissions of mapped memory. When user space calls mprotect() to change permissions of a mapped region then the kernel will not blindly allow the permissions but instead ensure that it is allowed based on user policy by calling the LSM (Linux Security Module) hooks. You can learn more about LSM and various security modules at: Documentation/security/lsm.rst Documentation/admin-guide/LSM/* You can compare what is needed here to what is currently done when user space attempts to make some memory executable (see: mm/mprotect.c:do_mprotect_key()->security_file_mprotect()). User policy needs to help the kernel determine if this is allowed. For example, when SELinux is the security module of choice then the process or file (depending on what type of memory is being changed) needs to have a special permission (PROCESS__EXECHEAP, PROCESS__EXECSTACK, or FILE__EXECMOD) assigned by user space to allow this. Integration with user space policy is required for RWX of dynamically added pages to be supported. In this series dynamically added pages will not be allowed to be made executable, a follow-up series will add support for user policy integration to support RWX permissions of dynamically added pages. >>> 3. Implementing as I've suggested also means that we can lock down an >>> enclave, for example - after code has been JITed, by closing the file >>> descriptor. Once the file descriptor used to create the enclave is >>> closed, no further mmap() can be performed on the enclave. Attempting >>> to do EACCEPT on an unmapped page will generate a page fault. >> >> This is not clear to me. If the file descriptor is closed and no further >> mmap() is allowed then how would a process be able to enter the enclave >> to execute code within it? > > EENTER (or the vdso function) with the address of a TCS page, like > normal. In Enarx, we don't retain the enclave fd after the final > mmap() following EINIT. Everything works just fine. The OS fault handler is responsible for managing the PTEs that is required for the enclave to be able to access the memory within the enclave. The OS fault handler is attached to a VMA that is created with mmap(). > >> This series does indeed lock down the address range to ensure that it is >> not possible to map memory that does not belong to the enclave after the >> enclave is created. Please see: >> https://lore.kernel.org/linux-sgx/1b833dbce6c937f71523f4aaf4b2181b9673519f.1644274683.git.reinette.chatre@intel.com/ > > That's not what I'm talking about. I'm talking about a workflow like this: > > 1. Enclave initialization: ECREATE ... EINIT > 2. EENTER > 3. Enclave JITs some code (changes page permissions) > 4. EEXIT > 5. Close enclave fd. > 6. EENTER > 7. If an enclave attempts page modifications, a fault occurs. The original fd that was created to obtain the enclave base address may be closed at (5) but the executable and data portions of the enclave still needs to be mapped afterwards to be able to have OS support for managing the PTEs that the enclave depends on to access those pages. > > Think of this similar to seccomp(). The enclave wants to do some > dynamic page table manipulation. But then it wants to lock down page > table modification so that, if compromised, attackers have no ability > to obtain RWX permissions. Reinette
Reinette, Perhaps it would be better for us to have a shared understanding on how the patches as posted are supposed to work in the most common cases? I'm thinking here of projects such as Enarx, Gramine and Occulum, which all have a similar process. Namely they execute an executable (called exec in the below chart) which has things like syscalls handled by a shim. These two components (shim and exec) are supported by a non-enclave userspace runtime. Given this common architectural pattern, this is how I understand adding pages via an exec call to mmap() to work. https://mermaid.live/edit#pako:eNp1k81qwzAQhF9F6NRCAu1Vh0BIRemhoeSHBuIettYmFpElVZZLQ8i7144sJ8aOT2bmY3d2vT7R1AikjBb4U6JO8UXC3kGeaFI9FpyXqbSgPTmg06j6uiu1lzn2jSKTA2XwD9NEB31uPBLzi-6iMpLnYB8Wn4-kOBYpKBW52iXj8WQSmzEy5Zvt01ewG5HUQN2UEc7nK77YPjdALd64GWih8NpkALGwR_JtzOGAaKXexyTKGEt2pgoMaXahgj5Qgk9nM_6xGvDDJpsmOyiVv0LB62B8un4dBDrLiLPeWciCL9fvvKVQizhSG6stFz9Df7sxUpcYitR-SodFO2A_Vw-7l4nzzduqjX9bKJxOHDDeBB3RHF0OUlS3faq1hPoMqzulrHoVGPZOE32u0NIK8MiF9MZRtgNV4IhC6c3yqFPKvCsxQs3_0VDnfzf-CPg This only covers adding RW pages. I haven't even tackled permission changes yet. Is that understanding correct? If not, please provide an alternative sequence diagram to explain how you expect this to be used. On Wed, Feb 23, 2022 at 1:25 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > > Hi Nathaniel, > > On 2/23/2022 5:24 AM, Nathaniel McCallum wrote: > > On Tue, Feb 22, 2022 at 5:39 PM Reinette Chatre > > <reinette.chatre@intel.com> wrote: > >> > >> Hi Nathaniel, > >> > >> On 2/22/2022 12:27 PM, Nathaniel McCallum wrote: > >>> 1. This interface looks very odd to me. mmap() is the kernel interface > >>> for changing user space memory maps. Why are we introducing a new > >>> interface for this? > >> > >> mmap() is the kernel interface used to create new mappings in the > >> virtual address space of the calling process. This is different from > >> the permissions and properties of the underlying file/memory being mapped. > >> > >> A new interface is introduced because changes need to be made to the > >> permissions and properties of the underlying enclave. A new virtual > >> address space is not needed nor should existing VMAs be impacted. > >> > >> This is similar to how mmap() is not used to change file permissions. > >> > >> VMA permissions are separate from enclave page permissions as found in > >> the EPCM (Enclave Page Cache Map). The current implementation (SGX1) already > >> distinguishes between the VMA and EPCM permissions - for example, it is > >> already possible to create a read-only VMA from enclave pages that have > >> RW EPCM permissions. mmap() of a portion of EPC memory with a particular > >> permission does not imply that the underlying EPCM permissions (should)have > >> that permission. > > > > Yes. BUT... unlike the file permissions, this leaks an implementation detail. > > Not really - just like a RW file can be mapped read-only or RW, RW enclave > memory can be mapped read-only or RW. > > > > > The user process is governed by VMA permissions. And during enclave > > creation, it had to mmap() all the enclave regions to their final VMA > > permissions. So during enclave creation you have to use mmap() but > > after enclave creation you use custom APIs? That's inconsistent at > > best. > > No. ioctl()s are consistently used to manage enclave memory. > > The existing ioctls() SGX_IOC_ENCLAVE_CREATE, SGX_IOC_ENCLAVE_ADD_PAGES, > and SGX_IOC_ENCLAVE_INIT are used to set up to initialize the enclave memory. > > The new ioctls() are used to manage enclave memory after enclave initialization. > > The enclave memory is thus managed with a consistent interface. > > mmap() is required before SGX_IOC_ENCLAVE_CREATE to obtain a base address > for the enclave that is required by the ioctl(). The rest of the ioctl()s, > existing and new, are consistent in interface by not requiring a memory > mapping but instead work from an offset from the base address. > > > Forcing userspace to worry about the (mostly undocumented!) > > interactions between EPC, PTE and VMA permissions makes these APIs > > hard to use and difficult to reason about. > > This is not new. The current SGX1 user space is already prevented from > creating a mapping of enclave memory that is more relaxed than the enclave > memory. For example, if the enclave memory has RW EPCM permissions then it > is not possible to mmap() that memory as RWX. > > > > > When I call SGX_IOC_ENCLAVE_RELAX_PERMISSIONS, do I also have to call > > mmap() to update the VMA permissions to match? It isn't clear. Nor is > > mprotect() may be the better call to use. > > > it really clear why I'm calling completely separate APIs. > > > >>> You can just simply add a new mmap flag (i.e. > >>> MAP_SGX_TCS*) and then figure out which SGX instructions to execute > >>> based on the desired state of the memory maps. If you do this, none of > >>> the following ioctls are needed: > >>> > >>> * SGX_IOC_ENCLAVE_RELAX_PERMISSIONS > >>> * SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS > >>> * SGX_IOC_ENCLAVE_REMOVE_PAGES > >>> * SGX_IOC_ENCLAVE_MODIFY_TYPE > >>> > >>> It also means that languages don't have to grow support for all these > >>> ioctls. Instead, they can just reuse the existing mmap() bindings with > >>> the new flag. Also, multiple operations can be combined into a single > >>> mmap() call, amortizing the changes over a single context switch. > >>> > >>> 2. Automatically adding pages with hard-coded permissions in a fault > >>> handler seems like a really bad idea. > >> > >> Could you please elaborate why this is a bad idea? > > > > Because implementations that miss this subtlety suddenly have pages > > with magic permissions. Magic is bad. Explicit is good. > > > > There is no magic. Any new pages have to be accepted by the enclave. > The enclave will not be able to access these pages unless explicitly > accepted, ENCLU[EACCEPT], from within the enclave. > > >>> How do you distinguish between > >>> accesses which should result in an updated mapping and accesses that > >>> should result in a fault? > >> > >> Accesses that should result in an updated mapping have two requirements: > >> (a) address accessed belongs to the enclave based on the address > >> range specified during enclave create > >> (b) there is no backing enclave page for the address > > > > What happens if the enclave is buggy? Or has been compromised. In both > > of those cases, there should be a userspace visible fault and pages > > should not be added. > > If user space accesses a memory address with a regular read/write that > results in a new page added then there is indeed a user space visible > fault. You can see this flow in action in the "augment" test case in > https://lore.kernel.org/linux-sgx/32c1116934a588bd3e6c174684e3e36a05c0a4d4.1644274683.git.reinette.chatre@intel.com/ > > If user space indeed wants the page after encountering such a fault then > it needs to enter the enclave again, from a different entry point, to > run ENCLU[EACCEPT], before it can return to the original entry point to > continue execution from the instruction that triggered the original read/write. > > The only flow where a page is added without a user space visible fault > is when user space explicitly runs the ENCLU[EACCEPT] to do so. > > > > >>> IMHO, all unmapped page accesses should > >>> result in a page fault. mmap() should be called first to identify the > >>> correct permissions for these pages. > >>> Then the page handler should be > >>> updated to use the permissions from the mapping when backfilling > >>> physical pages. If I understand correctly, this should also obviate > >> > >> Regular enclave pages can _only_ be dynamically added with RW permission. > >> > >> SGX2's support for adding regular pages to an enclave via the EAUG > >> instruction is architecturally set at RW. The OS cannot change those permissions > >> via the EAUG instruction nor can the OS do so with a different/additional > >> instruction because: > >> * the OS is not able to relax permissions since that can only be done from > >> within the enclave with ENCLU[EMODPE], thus it is not possible for the OS to > >> dynamically add pages via EAUG as RW and then relax permissions to RWX. > >> * the OS is not able to EAUG a page and immediately attempt an EMODPR either > >> as Jarkko also recently inquired about: > >> https://lore.kernel.org/linux-sgx/80f3d7b9-e3d5-b2c0-7707-710bf6f5081e@intel.com/ > > > > This design looks... unfinished. EAUG takes a PAGEINFO in RBX, but > > PAGEINFO.SECINFO must be zeroed and EAUG instead sets magic hard-coded > > permissions. Why doesn't EAUG just respect the permissions in > > PAGEINFO.SECINFO? We aren't told. > > This design is finished and respects the hardware specification. You can find > the details in the SDM's documentation of the EAUG function. > > If the SECINFO field has a value then the hardware requires it to indicate > that it is a new shadow stack page being added, not a regular page. Support for > shadow stack pages is not in scope for this work. Attempting to dynamically > add a regular page with explicit permissions will result in a #GP(0). > > The only way to add a regular enclave page is to make the SECINFO field empty > and doing so forces the page type to be a regular page and the permissions to > be RW. > > > > > Further, if the enclave can do EMODPE, why does > > SGX_IOC_ENCLAVE_RELAX_PERMISSIONS even exist? None of the > > documentation explains what this ioctl even does. Does it update PTE > > permissions? VMA permissions? Nobody knows without reading the source > > code. > > Build the documentation (after applying this series) and it should > contain all the information you are searching for. As is the current custom > in the SGX documentation the built documentation pulls its content from > the kernel doc of the functions that implement the core of the > user space interactions. > > > > > Userspace should not be bothered with the subtle details of the > > interaction between EPC, PTE and VMA permissions. But this API does > > everything it can do to expose all these details to userspace. And it > > doesn't bother to document them (probably because it is hard). It > > would be much better to avoid exposing these details to userspace. > > > > IMHO, there should be a simple flow like this (if EAUG respects > > PAGEINFO.SECINFO): > > EAUG does not respect PAGEINFO.SECINFO for regular pages. > > > > > 1. Non-enclave calls mmap()/munmap(). > > 2. Enclave issues EACCEPT, if necessary. > > 3. Enclave issues EMODPE, if necessary. > > > > Notice that in the second step above, during the mmap() call, the > > kernel ensures that EPC, PTE and VMA are in sync and fails if they > > cannot be made to be compatible. Also note that in the above flow EAUG > > instructions can be efficiently batched. > > > > Given the current poor state of the EAUG instruction, we might need to > > do this flow instead: > > > > 1. Enclave issues EACCEPT, if necessary. (Add RW pages...) > > 2. Non-enclave calls mmap()/munmap(). > > 3. Enclave issues EACCEPT, if necessary. > > 4. Enclave issues EMODPE, if necessary. > > > > However, doing EAUG only via the page access handler means that there > > is no way to batch EAUG instructions and this forces a context switch > > for every page you want to add. This has to be terrible for > > performance. Note specifically that the SDM calls out batching, which > > is currently impossible under this patch set. 35.5.7 - "Page > > allocation operations may be batched to improve efficiency." > > These page functions are all per-page so it is not possible to add multiple > pages with a single instruction. It is indeed possible to pre-fault pages. > > > As it stands today, if I want to add 256MiB of pages to an enclave, > > I'll have to do 2^16 context switches. That doesn't seem scalable. > > No. Running ENCLU[EACCEPT] on each of the pages within that range should not > need any explicit context switch out of the enclave. See the "augment_via_eaccept" > test case in: > https://lore.kernel.org/linux-sgx/32c1116934a588bd3e6c174684e3e36a05c0a4d4.1644274683.git.reinette.chatre@intel.com/ > > > >>> the need for the weird userspace callback to allow for execute > >>> permissions. > >> > >> User policy integration would always be required to allow execute > >> permissions on a writable page. This is not expected to be a userspace > >> callback but instead integration with existing user policy subsystem(s). > > > > Why? This isn't documented. > > This is similar to the existing policies involved in managing the permissions > of mapped memory. When user space calls mprotect() to change permissions > of a mapped region then the kernel will not blindly allow the permissions but > instead ensure that it is allowed based on user policy by calling the LSM > (Linux Security Module) hooks. > > You can learn more about LSM and various security modules at: > Documentation/security/lsm.rst > Documentation/admin-guide/LSM/* > > You can compare what is needed here to what is currently done when user space > attempts to make some memory executable (see: > mm/mprotect.c:do_mprotect_key()->security_file_mprotect()). User policy needs > to help the kernel determine if this is allowed. For example, when SELinux is > the security module of choice then the process or file (depending on what type > of memory is being changed) needs to have a special permission (PROCESS__EXECHEAP, > PROCESS__EXECSTACK, or FILE__EXECMOD) assigned by user space to allow this. > > Integration with user space policy is required for RWX of dynamically added pages > to be supported. In this series dynamically added pages will not be allowed to > be made executable, a follow-up series will add support for user policy > integration to support RWX permissions of dynamically added pages. > > >>> 3. Implementing as I've suggested also means that we can lock down an > >>> enclave, for example - after code has been JITed, by closing the file > >>> descriptor. Once the file descriptor used to create the enclave is > >>> closed, no further mmap() can be performed on the enclave. Attempting > >>> to do EACCEPT on an unmapped page will generate a page fault. > >> > >> This is not clear to me. If the file descriptor is closed and no further > >> mmap() is allowed then how would a process be able to enter the enclave > >> to execute code within it? > > > > EENTER (or the vdso function) with the address of a TCS page, like > > normal. In Enarx, we don't retain the enclave fd after the final > > mmap() following EINIT. Everything works just fine. > > The OS fault handler is responsible for managing the PTEs that is required > for the enclave to be able to access the memory within the enclave. > The OS fault handler is attached to a VMA that is created with mmap(). > > > > >> This series does indeed lock down the address range to ensure that it is > >> not possible to map memory that does not belong to the enclave after the > >> enclave is created. Please see: > >> https://lore.kernel.org/linux-sgx/1b833dbce6c937f71523f4aaf4b2181b9673519f.1644274683.git.reinette.chatre@intel.com/ > > > > That's not what I'm talking about. I'm talking about a workflow like this: > > > > 1. Enclave initialization: ECREATE ... EINIT > > 2. EENTER > > 3. Enclave JITs some code (changes page permissions) > > 4. EEXIT > > 5. Close enclave fd. > > 6. EENTER > > 7. If an enclave attempts page modifications, a fault occurs. > > The original fd that was created to obtain the enclave base address > may be closed at (5) but the executable and data portions of the enclave > still needs to be mapped afterwards to be able to have OS support for > managing the PTEs that the enclave depends on to access those pages. > > > > > Think of this similar to seccomp(). The enclave wants to do some > > dynamic page table manipulation. But then it wants to lock down page > > table modification so that, if compromised, attackers have no ability > > to obtain RWX permissions. > > Reinette
Hi Nathaniel, On 3/2/2022 8:57 AM, Nathaniel McCallum wrote: > Perhaps it would be better for us to have a shared understanding on > how the patches as posted are supposed to work in the most common > cases? I'm thinking here of projects such as Enarx, Gramine and > Occulum, which all have a similar process. Namely they execute an > executable (called exec in the below chart) which has things like > syscalls handled by a shim. These two components (shim and exec) are > supported by a non-enclave userspace runtime. Given this common > architectural pattern, this is how I understand adding pages via an > exec call to mmap() to work. > > https://mermaid.live/edit#pako:eNp1k81qwzAQhF9F6NRCAu1Vh0BIRemhoeSHBuIettYmFpElVZZLQ8i7144sJ8aOT2bmY3d2vT7R1AikjBb4U6JO8UXC3kGeaFI9FpyXqbSgPTmg06j6uiu1lzn2jSKTA2XwD9NEB31uPBLzi-6iMpLnYB8Wn4-kOBYpKBW52iXj8WQSmzEy5Zvt01ewG5HUQN2UEc7nK77YPjdALd64GWih8NpkALGwR_JtzOGAaKXexyTKGEt2pgoMaXahgj5Qgk9nM_6xGvDDJpsmOyiVv0LB62B8un4dBDrLiLPeWciCL9fvvKVQizhSG6stFz9Df7sxUpcYitR-SodFO2A_Vw-7l4nzzduqjX9bKJxOHDDeBB3RHF0OUlS3faq1hPoMqzulrHoVGPZOE32u0NIK8MiF9MZRtgNV4IhC6c3yqFPKvCsxQs3_0VDnfzf-CPg > > This only covers adding RW pages. I haven't even tackled permission > changes yet. Is that understanding correct? If not, please provide an > alternative sequence diagram to explain how you expect this to be > used. Please find my attempt linked below: https://mermaid.live/edit#pako:eNqFUsFqAjEQ_ZWQUwsK7XUPgthQeqiUVang9jAkoxu6m2yzWVsR_72J2WTbKnSOb97MvPeSI-VaIM1oix8dKo4PEnYG6kIRVw0YK7lsQFlSghGfYPCy845GYXWJm05ZWV8ZaEt55QB-IS9UwOfaItF7NGc0I3UNzU3-ekvaQ8uhqiLPd8l4PJnEYxmZsvXm7i20e5B4QlA5rAqMgJJfG9Ixg21X2ctVXn9GGJsvWb65729FSZXWDdlqpxx46Qzu-gB8-cHzhhim2zKdzdjLcuAAt3IPzv6Qkq84EdxGM3492UJS-cdSpLHp6nEgCPz3RjI5NPvAlRisJjspOsbWT8sUyc_MwjuynC1Wzyw9EB3RGk0NUrgvePRYQW2J7tNQd5sKDN5ooU6O2jXCiWZCWm1otoWqxRGFzurFQXGaWdNhJPXfuGedvgFejOuH The changes include: * Move mmap() to occur before attempting EACCEPT on the addresses. This is required for EACCEPT (as well as any subsequent access from within the enclave) to be able to access the pages. * Remove AEX[1] to the runtime within the loop. After EAUG returns execution will return to the instruction pointer that triggered the #PF, EACCEPT, this will cause the EACCEPT to be run again, this time succeeding. This is based on the implementation within this series. When supporting the new ioctl() requested by Jarkko there will be an additional ioctl() required before the loop. Reinette
On Wed, Mar 2, 2022 at 4:20 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > > Hi Nathaniel, > > On 3/2/2022 8:57 AM, Nathaniel McCallum wrote: > > Perhaps it would be better for us to have a shared understanding on > > how the patches as posted are supposed to work in the most common > > cases? I'm thinking here of projects such as Enarx, Gramine and > > Occulum, which all have a similar process. Namely they execute an > > executable (called exec in the below chart) which has things like > > syscalls handled by a shim. These two components (shim and exec) are > > supported by a non-enclave userspace runtime. Given this common > > architectural pattern, this is how I understand adding pages via an > > exec call to mmap() to work. > > > > https://mermaid.live/edit#pako:eNp1k81qwzAQhF9F6NRCAu1Vh0BIRemhoeSHBuIettYmFpElVZZLQ8i7144sJ8aOT2bmY3d2vT7R1AikjBb4U6JO8UXC3kGeaFI9FpyXqbSgPTmg06j6uiu1lzn2jSKTA2XwD9NEB31uPBLzi-6iMpLnYB8Wn4-kOBYpKBW52iXj8WQSmzEy5Zvt01ewG5HUQN2UEc7nK77YPjdALd64GWih8NpkALGwR_JtzOGAaKXexyTKGEt2pgoMaXahgj5Qgk9nM_6xGvDDJpsmOyiVv0LB62B8un4dBDrLiLPeWciCL9fvvKVQizhSG6stFz9Df7sxUpcYitR-SodFO2A_Vw-7l4nzzduqjX9bKJxOHDDeBB3RHF0OUlS3faq1hPoMqzulrHoVGPZOE32u0NIK8MiF9MZRtgNV4IhC6c3yqFPKvCsxQs3_0VDnfzf-CPg > > > > This only covers adding RW pages. I haven't even tackled permission > > changes yet. Is that understanding correct? If not, please provide an > > alternative sequence diagram to explain how you expect this to be > > used. > > Please find my attempt linked below: > > https://mermaid.live/edit#pako:eNqFUsFqAjEQ_ZWQUwsK7XUPgthQeqiUVang9jAkoxu6m2yzWVsR_72J2WTbKnSOb97MvPeSI-VaIM1oix8dKo4PEnYG6kIRVw0YK7lsQFlSghGfYPCy845GYXWJm05ZWV8ZaEt55QB-IS9UwOfaItF7NGc0I3UNzU3-ekvaQ8uhqiLPd8l4PJnEYxmZsvXm7i20e5B4QlA5rAqMgJJfG9Ixg21X2ctVXn9GGJsvWb65729FSZXWDdlqpxx46Qzu-gB8-cHzhhim2zKdzdjLcuAAt3IPzv6Qkq84EdxGM3492UJS-cdSpLHp6nEgCPz3RjI5NPvAlRisJjspOsbWT8sUyc_MwjuynC1Wzyw9EB3RGk0NUrgvePRYQW2J7tNQd5sKDN5ooU6O2jXCiWZCWm1otoWqxRGFzurFQXGaWdNhJPXfuGedvgFejOuH > > The changes include: > * Move mmap() to occur before attempting EACCEPT on the addresses. This is > required for EACCEPT (as well as any subsequent access from within the enclave) > to be able to access the pages. > * Remove AEX[1] to the runtime within the loop. After EAUG returns execution > will return to the instruction pointer that triggered the #PF, EACCEPT, > this will cause the EACCEPT to be run again, this time succeeding. > > This is based on the implementation within this series. When supporting > the new ioctl() requested by Jarkko there will be an additional ioctl() > required before the loop. https://mermaid.live/edit/#pako:eNp1U9FqgzAU_ZWQpw1a2F6FFaQLYw8ro7asUPeQmWsNNYlL4rZS-u-LRmut1ie953jvOecmR5woBjjABr5LkAk8c7rTVMQSuYeWVslSfIH23wXVlie8oNKijGr2SzUMkT1oCfmwrktpuRj5wWRcDKvwB0ksfX2hLCD1A7quBkgIWtwtP-6ROZiE5nnLq1A0nc5m7bAAhWSzffj0cFNEFaEaGiBCFiuy3D42hKp4gWZUshy6ISOUL6X2e4CCy10rQhUW8dR52QESivGUJ9RyJQ2SAAyYZ_V6ndUSsnldneVca_bJdvY7lkf6vc4haTBlbsdbDmLoaLlSBUqVy5wmWW2nw3rq26Pg-oTzOXlf9Xkt7BfTeqjjSWlP2JWTlkrC9cutlmcLlUlxoRBkE3T9Mrq7KArd0UBPqFDGTpstI2OphSv-jf1cBukPJlmSaP1GXFs8wQK0oJy523Ws-DG2GTiJOHCvDLx3HMuTo5YFc1MJ41ZpHKQ0NzDB1fWLDjLBgdUltKTmhjas0z-kWy8L My comments below correspond to the arrow numbers in the diagram. 2. When the runtime receives the AEX, it doesn't have enough knowledge to know whether or not to ask the kernel for an mmap(). So it has to reenter the shim. 3. The shim has to handle the syscall instruction routing it to the enclave's memory management subsystem. 4. The shim has to do bookkeeping and decide if additional pages are even needed. If pages are already allocated, for example, it can skip directly to step 13. However, if modifications are needed, it will go to steps 5-12. 5-12. This is the part that represents new code from the kernel's perspective for SGX2. It is also in a performance critical path and should be evaluated with greater scrutiny. The number of context switches is O(2N + 4) for each new allocated block, where N is the number of pages: a context switch occurs at step 5, 6, 7, 8, 9/10 and 12. However, this can be reduced to O(4) for each new allocated block with a simple modification: https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIXQydjFymhaVmh24fSkkUbN1Gwrpf99pvlsk8G80nMez3l91SNmmgMOsIXPAhSDR0F3hspYIT9o4bQq5AeYap1T4wQTOVUOpdTwb2pgmNmDUZAN46ZQTsiRDTYVchiFH2CxquIL7QDpLzDnaICkpPnN8u0W2YNlNMsarsyi6XQ2a5oFKCSb7d17la6DqATKpgEiZLEiy-19DZTBXjalimfQNRlBPrTe7wFyoXaNCJ07JBJ_lh0gqblIBKNOaGWRAuDAK-qiVquWkM3zqpVzrblytjt-R2Va5yjR3h_K0nPrLleOaudFERKunzoIVE9Xj26VtZYbsEXmxgUOTP2_witfSTifk9fViMDzZPQuoij0V40eUK6tm9a3hqyjDq74P_zuH6V6aGRJovUL8WXxBEswkgruf8ux5GPsUvDvGQd-yiGhpS04ViePFjn3XQkXThscJDSzMMHld4oOiuHAmQIaqP5xNXX6BeBJIEk The interesting thing about this pattern is that this can be done for all page modification types except EMODT. For example, here's the same process for changing a mapping from RW to RX: https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIVCvdhFu5F0UGh24fSkkUbN1Gwrpf995jttMphXes7jOa-vesZMc8ABtvBZgGKwEvRgqIwV8oMWTqtCfoCp1zk1TjCRU-VQSg3_pgbGmSMYBdk4bgrlhJzYYFMhx1H4ARarOr7RDpD-AlNFAyQlze_C3T2yJ8tolrVcmUXz-WLRNgvQkuz2D-91ugmiEiibBoiQzZaE-8cGKIODbEoVz6BvMoF8aH08AuRCHVoROndIJP4sB0BSc5EIRp3QyiIFwIHX1FWtTi0hu-dtJ-dWc-1sf_yeyrTOUaK9P5SlVes-V45651URsn5ZvYY9BmqgbMB32jrTDdgic9MSR7b-X-ONs5U-MqGvmkxeRhQt_V2jJ5Rr6-bNtSHrqIMb_g_DhyepXxoJSfS2Jr4snmEJRlLB_Xc5l3yMXQr-QePATzkktHQFx-ri0SLnvivhwmmDg4RmFma4_E_RSTEcOFNACzVfrqEuvytQILY My point in this thread has always been that it is an anti-feature to presume that there is a need to treat EPC and VLA permissions separately. This is a performance sink and it optimizes for a use case which doesn't exist. Nobody actually wants there to be a mismatch between EPC and VLA permissions. So, besides EMODT, the only userspace interface we need is mmap()/mprotect()/munmap(). The kernel should either succeed the mmap()/mprotect()/munmap() syscall if the EPC permissions can be made compatible or should fail otherwise. Another interesting property arises from this flow. Since the EPC and VLA permissions are always synchronized from the perspective of userspace, in cases where the memory state between the kernel and the exec layer is roughly synchronous, bookkeeping in the shim can be implemented without any persistent memory between syscall handling events. So, for example, the shim can implement brk() and mmap()/munmap()/mprotect() with just two pointers: one to the break position and one to the lowest mmap(). It is true that this basically commits enclave authors to doing all EACCEPT calls immediately after modifications. But I suspect everyone will do this anyway since there is no efficient (read: performant) way for shims to handle page faults. So trying to do this lazily will just result in a huge decrease in performance.
Hi Nathaniel, On 3/2/2022 5:13 PM, Nathaniel McCallum wrote: > On Wed, Mar 2, 2022 at 4:20 PM Reinette Chatre > <reinette.chatre@intel.com> wrote: >> >> Hi Nathaniel, >> >> On 3/2/2022 8:57 AM, Nathaniel McCallum wrote: >>> Perhaps it would be better for us to have a shared understanding on >>> how the patches as posted are supposed to work in the most common >>> cases? I'm thinking here of projects such as Enarx, Gramine and >>> Occulum, which all have a similar process. Namely they execute an >>> executable (called exec in the below chart) which has things like >>> syscalls handled by a shim. These two components (shim and exec) are >>> supported by a non-enclave userspace runtime. Given this common >>> architectural pattern, this is how I understand adding pages via an >>> exec call to mmap() to work. >>> >>> https://mermaid.live/edit#pako:eNp1k81qwzAQhF9F6NRCAu1Vh0BIRemhoeSHBuIettYmFpElVZZLQ8i7144sJ8aOT2bmY3d2vT7R1AikjBb4U6JO8UXC3kGeaFI9FpyXqbSgPTmg06j6uiu1lzn2jSKTA2XwD9NEB31uPBLzi-6iMpLnYB8Wn4-kOBYpKBW52iXj8WQSmzEy5Zvt01ewG5HUQN2UEc7nK77YPjdALd64GWih8NpkALGwR_JtzOGAaKXexyTKGEt2pgoMaXahgj5Qgk9nM_6xGvDDJpsmOyiVv0LB62B8un4dBDrLiLPeWciCL9fvvKVQizhSG6stFz9Df7sxUpcYitR-SodFO2A_Vw-7l4nzzduqjX9bKJxOHDDeBB3RHF0OUlS3faq1hPoMqzulrHoVGPZOE32u0NIK8MiF9MZRtgNV4IhC6c3yqFPKvCsxQs3_0VDnfzf-CPg >>> >>> This only covers adding RW pages. I haven't even tackled permission >>> changes yet. Is that understanding correct? If not, please provide an >>> alternative sequence diagram to explain how you expect this to be >>> used. >> >> Please find my attempt linked below: >> >> https://mermaid.live/edit#pako:eNqFUsFqAjEQ_ZWQUwsK7XUPgthQeqiUVang9jAkoxu6m2yzWVsR_72J2WTbKnSOb97MvPeSI-VaIM1oix8dKo4PEnYG6kIRVw0YK7lsQFlSghGfYPCy845GYXWJm05ZWV8ZaEt55QB-IS9UwOfaItF7NGc0I3UNzU3-ekvaQ8uhqiLPd8l4PJnEYxmZsvXm7i20e5B4QlA5rAqMgJJfG9Ixg21X2ctVXn9GGJsvWb65729FSZXWDdlqpxx46Qzu-gB8-cHzhhim2zKdzdjLcuAAt3IPzv6Qkq84EdxGM3492UJS-cdSpLHp6nEgCPz3RjI5NPvAlRisJjspOsbWT8sUyc_MwjuynC1Wzyw9EB3RGk0NUrgvePRYQW2J7tNQd5sKDN5ooU6O2jXCiWZCWm1otoWqxRGFzurFQXGaWdNhJPXfuGedvgFejOuH >> >> The changes include: >> * Move mmap() to occur before attempting EACCEPT on the addresses. This is >> required for EACCEPT (as well as any subsequent access from within the enclave) >> to be able to access the pages. >> * Remove AEX[1] to the runtime within the loop. After EAUG returns execution >> will return to the instruction pointer that triggered the #PF, EACCEPT, >> this will cause the EACCEPT to be run again, this time succeeding. >> >> This is based on the implementation within this series. When supporting >> the new ioctl() requested by Jarkko there will be an additional ioctl() >> required before the loop. > > https://mermaid.live/edit/#pako:eNp1U9FqgzAU_ZWQpw1a2F6FFaQLYw8ro7asUPeQmWsNNYlL4rZS-u-LRmut1ie953jvOecmR5woBjjABr5LkAk8c7rTVMQSuYeWVslSfIH23wXVlie8oNKijGr2SzUMkT1oCfmwrktpuRj5wWRcDKvwB0ksfX2hLCD1A7quBkgIWtwtP-6ROZiE5nnLq1A0nc5m7bAAhWSzffj0cFNEFaEaGiBCFiuy3D42hKp4gWZUshy6ISOUL6X2e4CCy10rQhUW8dR52QESivGUJ9RyJQ2SAAyYZ_V6ndUSsnldneVca_bJdvY7lkf6vc4haTBlbsdbDmLoaLlSBUqVy5wmWW2nw3rq26Pg-oTzOXlf9Xkt7BfTeqjjSWlP2JWTlkrC9cutlmcLlUlxoRBkE3T9Mrq7KArd0UBPqFDGTpstI2OphSv-jf1cBukPJlmSaP1GXFs8wQK0oJy523Ws-DG2GTiJOHCvDLx3HMuTo5YFc1MJ41ZpHKQ0NzDB1fWLDjLBgdUltKTmhjas0z-kWy8L > > My comments below correspond to the arrow numbers in the diagram. > > 2. When the runtime receives the AEX, it doesn't have enough knowledge > to know whether or not to ask the kernel for an mmap(). So it has to > reenter the shim. > > 3. The shim has to handle the syscall instruction routing it to the > enclave's memory management subsystem. > > 4. The shim has to do bookkeeping and decide if additional pages are > even needed. If pages are already allocated, for example, it can skip > directly to step 13. However, if modifications are needed, it will go > to steps 5-12. > > 5-12. This is the part that represents new code from the kernel's > perspective for SGX2. It is also in a performance critical path and > should be evaluated with greater scrutiny. The number of context > switches is O(2N + 4) for each new allocated block, where N is the > number of pages: a context switch occurs at step 5, 6, 7, 8, 9/10 and > 12. However, this can be reduced to O(4) for each new allocated block > with a simple modification: > > https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIXQydjFymhaVmh24fSkkUbN1Gwrpf99pvlsk8G80nMez3l91SNmmgMOsIXPAhSDR0F3hspYIT9o4bQq5AeYap1T4wQTOVUOpdTwb2pgmNmDUZAN46ZQTsiRDTYVchiFH2CxquIL7QDpLzDnaICkpPnN8u0W2YNlNMsarsyi6XQ2a5oFKCSb7d17la6DqATKpgEiZLEiy-19DZTBXjalimfQNRlBPrTe7wFyoXaNCJ07JBJ_lh0gqblIBKNOaGWRAuDAK-qiVquWkM3zqpVzrblytjt-R2Va5yjR3h_K0nPrLleOaudFERKunzoIVE9Xj26VtZYbsEXmxgUOTP2_witfSTifk9fViMDzZPQuoij0V40eUK6tm9a3hqyjDq74P_zuH6V6aGRJovUL8WXxBEswkgruf8ux5GPsUvDvGQd-yiGhpS04ViePFjn3XQkXThscJDSzMMHld4oOiuHAmQIaqP5xNXX6BeBJIEk Your optimized proposal is possible in the current implementation as follows: https://mermaid.live/edit#pako:eNp1k11vgjAUhv_KSa-2RJPtlmQmxvViFzOLuMxEdlHbgzTSlrVlmzH-9xUBUWFclfc8nI-X0wPhRiCJiMOvEjXHZ8m2lqlEQ3hY6Y0u1QZt_V4w6yWXBdMeMmbFD7PYj-zQasz7ui21l2rgA5dJ1VfxF3mia31uPIL5RntSI1CKFXeLj3twe8dZnrdcFYXxeDJpi0Uwpav1w2cdbkSogKpoBJTOl3SxfmyASryIZkyLHLsiA8jGmN0OsZB62zZhCg8yDbNsEZQRMpWceWm0A40oUNTUVa5zt5SuXpbndm57rp3txu-oOnKd62ySRVfmfjhlz4YOy40pIDXBc8az0zhdbMAJOp3N6NuyY1A3o54Og-7F8TT8HHiCwjg_bnwG55nHG_4fhy5HqVeDLmj8_kpDWjIiCq1iUoT9PlR8QnyGYQNJFI4CU1bZQhJ9DGhZiFCVCumNJVHKcocjUl2AeK85ibwtsYWaO9JQxz-gBQs- You can think of that EACCEPT instruction similar to a current (SGX1) enclave memory read or write when the enclave page is not currently in the EPC, for example, if the enclave memory being accessed is swapped out and need to be decrypted and loaded back. Instead of ENCLS[ELDU] incorporated to load the enclave page back into EPC, ENCLS[EAUG] is incorporated to create a new EPC page. You can find an example of such a flow involving EACCEPT in the "augment_via_eaccept" test found in "[PATCH V2 21/32] selftests/sgx: Test two different SGX2 EAUG flows" > The interesting thing about this pattern is that this can be done for > all page modification types except EMODT. For example, here's the same > process for changing a mapping from RW to RX: > > https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIVCvdhFu5F0UGh24fSkkUbN1Gwrpf995jttMphXes7jOa-vesZMc8ABtvBZgGKwEvRgqIwV8oMWTqtCfoCp1zk1TjCRU-VQSg3_pgbGmSMYBdk4bgrlhJzYYFMhx1H4ARarOr7RDpD-AlNFAyQlze_C3T2yJ8tolrVcmUXz-WLRNgvQkuz2D-91ugmiEiibBoiQzZaE-8cGKIODbEoVz6BvMoF8aH08AuRCHVoROndIJP4sB0BSc5EIRp3QyiIFwIHX1FWtTi0hu-dtJ-dWc-1sf_yeyrTOUaK9P5SlVes-V45651URsn5ZvYY9BmqgbMB32jrTDdgic9MSR7b-X-ONs5U-MqGvmkxeRhQt_V2jJ5Rr6-bNtSHrqIMb_g_DhyepXxoJSfS2Jr4snmEJRlLB_Xc5l3yMXQr-QePATzkktHQFx-ri0SLnvivhwmmDg4RmFma4_E_RSTEcOFNACzVfrqEuvytQILY > > My point in this thread has always been that it is an anti-feature to > presume that there is a need to treat EPC and VLA permissions > separately. This is a performance sink and it optimizes for a use case > which doesn't exist. Nobody actually wants there to be a mismatch > between EPC and VLA permissions. I assume you mean VMA permissions. It is hard for me to trust the statement that nobody wants there to be a mismatch since VMA permissions being separate from EPC permissions is an intentional (as documented) and integral part of the current SGX ABI. Current SGX implementation explicitly checks for and supports VMA mappings with permissions different from EPC permissions. This SGX2 implementation follows and respects the current ABI and changing ABI cannot be taken lightly. Reinette
On Wed, Mar 02, 2022 at 08:13:55PM -0500, Nathaniel McCallum wrote: > On Wed, Mar 2, 2022 at 4:20 PM Reinette Chatre > <reinette.chatre@intel.com> wrote: > > > > Hi Nathaniel, > > > > On 3/2/2022 8:57 AM, Nathaniel McCallum wrote: > > > Perhaps it would be better for us to have a shared understanding on > > > how the patches as posted are supposed to work in the most common > > > cases? I'm thinking here of projects such as Enarx, Gramine and > > > Occulum, which all have a similar process. Namely they execute an > > > executable (called exec in the below chart) which has things like > > > syscalls handled by a shim. These two components (shim and exec) are > > > supported by a non-enclave userspace runtime. Given this common > > > architectural pattern, this is how I understand adding pages via an > > > exec call to mmap() to work. > > > > > > https://mermaid.live/edit#pako:eNp1k81qwzAQhF9F6NRCAu1Vh0BIRemhoeSHBuIettYmFpElVZZLQ8i7144sJ8aOT2bmY3d2vT7R1AikjBb4U6JO8UXC3kGeaFI9FpyXqbSgPTmg06j6uiu1lzn2jSKTA2XwD9NEB31uPBLzi-6iMpLnYB8Wn4-kOBYpKBW52iXj8WQSmzEy5Zvt01ewG5HUQN2UEc7nK77YPjdALd64GWih8NpkALGwR_JtzOGAaKXexyTKGEt2pgoMaXahgj5Qgk9nM_6xGvDDJpsmOyiVv0LB62B8un4dBDrLiLPeWciCL9fvvKVQizhSG6stFz9Df7sxUpcYitR-SodFO2A_Vw-7l4nzzduqjX9bKJxOHDDeBB3RHF0OUlS3faq1hPoMqzulrHoVGPZOE32u0NIK8MiF9MZRtgNV4IhC6c3yqFPKvCsxQs3_0VDnfzf-CPg > > > > > > This only covers adding RW pages. I haven't even tackled permission > > > changes yet. Is that understanding correct? If not, please provide an > > > alternative sequence diagram to explain how you expect this to be > > > used. > > > > Please find my attempt linked below: > > > > https://mermaid.live/edit#pako:eNqFUsFqAjEQ_ZWQUwsK7XUPgthQeqiUVang9jAkoxu6m2yzWVsR_72J2WTbKnSOb97MvPeSI-VaIM1oix8dKo4PEnYG6kIRVw0YK7lsQFlSghGfYPCy845GYXWJm05ZWV8ZaEt55QB-IS9UwOfaItF7NGc0I3UNzU3-ekvaQ8uhqiLPd8l4PJnEYxmZsvXm7i20e5B4QlA5rAqMgJJfG9Ixg21X2ctVXn9GGJsvWb65729FSZXWDdlqpxx46Qzu-gB8-cHzhhim2zKdzdjLcuAAt3IPzv6Qkq84EdxGM3492UJS-cdSpLHp6nEgCPz3RjI5NPvAlRisJjspOsbWT8sUyc_MwjuynC1Wzyw9EB3RGk0NUrgvePRYQW2J7tNQd5sKDN5ooU6O2jXCiWZCWm1otoWqxRGFzurFQXGaWdNhJPXfuGedvgFejOuH > > > > The changes include: > > * Move mmap() to occur before attempting EACCEPT on the addresses. This is > > required for EACCEPT (as well as any subsequent access from within the enclave) > > to be able to access the pages. > > * Remove AEX[1] to the runtime within the loop. After EAUG returns execution > > will return to the instruction pointer that triggered the #PF, EACCEPT, > > this will cause the EACCEPT to be run again, this time succeeding. > > > > This is based on the implementation within this series. When supporting > > the new ioctl() requested by Jarkko there will be an additional ioctl() > > required before the loop. > > https://mermaid.live/edit/#pako:eNp1U9FqgzAU_ZWQpw1a2F6FFaQLYw8ro7asUPeQmWsNNYlL4rZS-u-LRmut1ie953jvOecmR5woBjjABr5LkAk8c7rTVMQSuYeWVslSfIH23wXVlie8oNKijGr2SzUMkT1oCfmwrktpuRj5wWRcDKvwB0ksfX2hLCD1A7quBkgIWtwtP-6ROZiE5nnLq1A0nc5m7bAAhWSzffj0cFNEFaEaGiBCFiuy3D42hKp4gWZUshy6ISOUL6X2e4CCy10rQhUW8dR52QESivGUJ9RyJQ2SAAyYZ_V6ndUSsnldneVca_bJdvY7lkf6vc4haTBlbsdbDmLoaLlSBUqVy5wmWW2nw3rq26Pg-oTzOXlf9Xkt7BfTeqjjSWlP2JWTlkrC9cutlmcLlUlxoRBkE3T9Mrq7KArd0UBPqFDGTpstI2OphSv-jf1cBukPJlmSaP1GXFs8wQK0oJy523Ws-DG2GTiJOHCvDLx3HMuTo5YFc1MJ41ZpHKQ0NzDB1fWLDjLBgdUltKTmhjas0z-kWy8L > > My comments below correspond to the arrow numbers in the diagram. > > 2. When the runtime receives the AEX, it doesn't have enough knowledge > to know whether or not to ask the kernel for an mmap(). So it has to > reenter the shim. > > 3. The shim has to handle the syscall instruction routing it to the > enclave's memory management subsystem. > > 4. The shim has to do bookkeeping and decide if additional pages are > even needed. If pages are already allocated, for example, it can skip > directly to step 13. However, if modifications are needed, it will go > to steps 5-12. > > 5-12. This is the part that represents new code from the kernel's > perspective for SGX2. It is also in a performance critical path and > should be evaluated with greater scrutiny. The number of context > switches is O(2N + 4) for each new allocated block, where N is the > number of pages: a context switch occurs at step 5, 6, 7, 8, 9/10 and > 12. However, this can be reduced to O(4) for each new allocated block > with a simple modification: > > https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIXQydjFymhaVmh24fSkkUbN1Gwrpf99pvlsk8G80nMez3l91SNmmgMOsIXPAhSDR0F3hspYIT9o4bQq5AeYap1T4wQTOVUOpdTwb2pgmNmDUZAN46ZQTsiRDTYVchiFH2CxquIL7QDpLzDnaICkpPnN8u0W2YNlNMsarsyi6XQ2a5oFKCSb7d17la6DqATKpgEiZLEiy-19DZTBXjalimfQNRlBPrTe7wFyoXaNCJ07JBJ_lh0gqblIBKNOaGWRAuDAK-qiVquWkM3zqpVzrblytjt-R2Va5yjR3h_K0nPrLleOaudFERKunzoIVE9Xj26VtZYbsEXmxgUOTP2_witfSTifk9fViMDzZPQuoij0V40eUK6tm9a3hqyjDq74P_zuH6V6aGRJovUL8WXxBEswkgruf8ux5GPsUvDvGQd-yiGhpS04ViePFjn3XQkXThscJDSzMMHld4oOiuHAmQIaqP5xNXX6BeBJIEk > > The interesting thing about this pattern is that this can be done for > all page modification types except EMODT. For example, here's the same > process for changing a mapping from RW to RX: > > https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIVCvdhFu5F0UGh24fSkkUbN1Gwrpf995jttMphXes7jOa-vesZMc8ABtvBZgGKwEvRgqIwV8oMWTqtCfoCp1zk1TjCRU-VQSg3_pgbGmSMYBdk4bgrlhJzYYFMhx1H4ARarOr7RDpD-AlNFAyQlze_C3T2yJ8tolrVcmUXz-WLRNgvQkuz2D-91ugmiEiibBoiQzZaE-8cGKIODbEoVz6BvMoF8aH08AuRCHVoROndIJP4sB0BSc5EIRp3QyiIFwIHX1FWtTi0hu-dtJ-dWc-1sf_yeyrTOUaK9P5SlVes-V45651URsn5ZvYY9BmqgbMB32jrTDdgic9MSR7b-X-ONs5U-MqGvmkxeRhQt_V2jJ5Rr6-bNtSHrqIMb_g_DhyepXxoJSfS2Jr4snmEJRlLB_Xc5l3yMXQr-QePATzkktHQFx-ri0SLnvivhwmmDg4RmFma4_E_RSTEcOFNACzVfrqEuvytQILY > > My point in this thread has always been that it is an anti-feature to > presume that there is a need to treat EPC and VLA permissions > separately. This is a performance sink and it optimizes for a use case > which doesn't exist. Nobody actually wants there to be a mismatch > between EPC and VLA permissions. I would not touch pre-initialization, EADD'd pages. The reason is backwards compatibility. For post-initialization, options are still open. > So, besides EMODT, the only userspace interface we need is > mmap()/mprotect()/munmap(). The kernel should either succeed the > mmap()/mprotect()/munmap() syscall if the EPC permissions can be made > compatible or should fail otherwise. For mmap() it is the enclave who sets the permissions, not kernel, i.e. you get a half-broken mmap() implementation. Kernel does EAUG, enclave does EACCEPTCOPY. I think what you're asking is too simple to be true, and even if we could do it, it might limit possibilities to optimize user space, e.g. because there is two ENCLU leaf functions (EMODPE, EACCEPTCOPY) and one ENCLS leaf function (EMODPR), which can modify permissions. A kernel syscall is essentially something that can be fully serviced by the kernel. This is not such situation. The work is split. BR, Jarkko