[V2,00/32] x86/sgx and selftests/sgx: Support SGX2

Message ID	cover.1644274683.git.reinette.chatre@intel.com (mailing list archive)
Headers	show Return-Path: <linux-sgx-owner@kernel.org> From: Reinette Chatre <reinette.chatre@intel.com> To: dave.hansen@linux.intel.com, jarkko@kernel.org, tglx@linutronix.de, bp@alien8.de, luto@kernel.org, mingo@redhat.com, linux-sgx@vger.kernel.org, x86@kernel.org Cc: seanjc@google.com, kai.huang@intel.com, cathy.zhang@intel.com, cedric.xing@intel.com, haitao.huang@intel.com, mark.shanahan@intel.com, hpa@zytor.com, linux-kernel@vger.kernel.org Subject: [PATCH V2 00/32] x86/sgx and selftests/sgx: Support SGX2 Date: Mon, 7 Feb 2022 16:45:22 -0800 Message-Id: <cover.1644274683.git.reinette.chatre@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	x86/sgx and selftests/sgx: Support SGX2 \| expand [V2,00/32] x86/sgx and selftests/sgx: Support SGX2 [V2,01/32] x86/sgx: Add short descriptions to ENCLS wrappers [V2,02/32] x86/sgx: Add wrapper for SGX2 EMODPR function [V2,03/32] x86/sgx: Add wrapper for SGX2 EMODT function [V2,04/32] x86/sgx: Add wrapper for SGX2 EAUG function [V2,05/32] Documentation/x86: Document SGX permission details [V2,06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions [V2,07/32] x86/sgx: Add pfn_mkwrite() handler for present PTEs [V2,08/32] x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic permission changes [V2,09/32] x86/sgx: Export sgx_encl_ewb_cpumask() [V2,10/32] x86/sgx: Rename sgx_encl_ewb_cpumask() as sgx_encl_cpumask() [V2,11/32] x86/sgx: Move PTE zap code to new sgx_zap_enclave_ptes() [V2,12/32] x86/sgx: Make sgx_ipi_cb() available internally [V2,13/32] x86/sgx: Create utility to validate user provided offset and length [V2,14/32] x86/sgx: Keep record of SGX page type [V2,15/32] x86/sgx: Support relaxing of enclave page permissions [V2,16/32] x86/sgx: Support restricting of enclave page permissions [V2,17/32] selftests/sgx: Add test for EPCM permission changes [V2,18/32] selftests/sgx: Add test for TCS page permission changes [V2,19/32] x86/sgx: Support adding of pages to an initialized enclave [V2,20/32] x86/sgx: Tighten accessible memory range after enclave initialization [V2,21/32] selftests/sgx: Test two different SGX2 EAUG flows [V2,22/32] x86/sgx: Support modifying SGX page type [V2,23/32] x86/sgx: Support complete page removal [V2,24/32] Documentation/x86: Introduce enclave runtime management section [V2,25/32] selftests/sgx: Introduce dynamic entry point [V2,26/32] selftests/sgx: Introduce TCS initialization enclave operation [V2,27/32] selftests/sgx: Test complete changing of page type flow [V2,28/32] selftests/sgx: Test faulty enclave behavior [V2,29/32] selftests/sgx: Test invalid access to removed enclave page [V2,30/32] selftests/sgx: Test reclaiming of untouched page [V2,31/32] x86/sgx: Free up EPC pages directly to support large page ranges [V2,32/32] selftests/sgx: Page removal stress test

Reinette Chatre Feb. 8, 2022, 12:45 a.m. UTC

V1: https://lore.kernel.org/linux-sgx/cover.1638381245.git.reinette.chatre@intel.com/

Changes since V1 that directly impact user space:
- SGX2 permission changes changed from a single ioctl() named
  SGX_IOC_PAGE_MODP to two new ioctl()s:
  SGX_IOC_ENCLAVE_RELAX_PERMISSIONS and
  SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS, supported by two different
  parameter structures (SGX_IOC_ENCLAVE_RELAX_PERMISSIONS does
  not support a result output parameter) (Jarkko).

  User space flow impact: After user space runs ENCLU[EMODPE] it
  needs to call SGX_IOC_ENCLAVE_RELAX_PERMISSIONS to have PTEs
  updated. Previously running SGX_IOC_PAGE_MODP in this scenario
  resulted in EPCM.PR being set but calling
  SGX_IOC_ENCLAVE_RELAX_PERMISSIONS will not result in EPCM.PR
  being set anymore and thus no need for an additional
  ENCLU[EACCEPT].

- SGX_IOC_ENCLAVE_RELAX_PERMISSIONS and
  SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS
  obtain new permissions from secinfo as parameter instead of
  the permissions directly (Jarkko).

- ioctl() supporting SGX2 page type change is renamed from
  SGX_IOC_PAGE_MODT to SGX_IOC_ENCLAVE_MODIFY_TYPE (Jarkko).

- SGX_IOC_ENCLAVE_MODIFY_TYPE obtains new page type from secinfo
  as parameter instead of the page type directly (Jarkko).

- ioctl() supporting SGX2 page removal is renamed from
  SGX_IOC_PAGE_REMOVE to SGX_IOC_ENCLAVE_REMOVE_PAGES (Jarkko).

- All ioctl() parameter structures have been renamed as a result of the
  ioctl() renaming:
  SGX_IOC_ENCLAVE_RELAX_PERMISSIONS => struct sgx_enclave_relax_perm
  SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS => struct sgx_enclave_restrict_perm
  SGX_IOC_ENCLAVE_MODIFY_TYPE => struct sgx_enclave_modt
  SGX_IOC_ENCLAVE_REMOVE_PAGES => struct sgx_enclave_remove_pages

Changes since V1 that do not directly impact user space:
- Number of patches in series increased from 25 to 32 primarily because
  of splitting the original submission:
  - Wrappers for the new SGX2 functions are introduced in three separate
    patches replacing the original "x86/sgx: Add wrappers for SGX2
    functions"
    (Jarkko).
  - Moving and renaming sgx_encl_ewb_cpumask() is done with two patches
    replacing the original "x86/sgx: Use more generic name for enclave
    cpumask function" (Jarkko).
  - Support for SGX2 EPCM permission changes is split into two ioctls(),
    one for relaxing and one for restricting permissions, each introduced
    by a new patch replacing the original "x86/sgx: Support enclave page
    permission changes" (Jarkko).
  - Extracted code used by existing ioctls() for usage by new ioctl()s
    into a new utility in new patch "x86/sgx: Create utility to validate
    user provided offset and length" (Dave did not specifically ask for
    this but it addresses his review feedback).
  - Two new Documentation patches to support the SGX2 work
    ("Documentation/x86: Introduce enclave runtime management") and
    a dedicated section on the enclave permission management
    ("Documentation/x86: Document SGX permission details") (Andy).
- Most patches were reworked to improve the language by:
  * aiming to refer to exact item instead of English rephrasing (Jarkko).
  * use ioctl() instead of ioctl throughout (Dave).
  * Use "relaxed" instead of "exceed" when referring to permissions
    (Dave).
- Improved documentation with several additions to
  Documentation/x86/sgx.rst.
- Many smaller changes, please refer to individual patches.

Hi Everybody,

The current Linux kernel support for SGX includes support for SGX1 that
requires that an enclave be created with properties that accommodate all
usages over its (the enclave's) lifetime. This includes properties such
as permissions of enclave pages, the number of enclave pages, and the
number of threads supported by the enclave.

Consequences of this requirement to have the enclave be created to
accommodate all usages include:
* pages needing to support relocated code are required to have RWX
  permissions for their entire lifetime,
* an enclave needs to be created with the maximum stack and heap
  projected to be needed during the enclave's entire lifetime which
  can be longer than the processes running within it,
* an enclave needs to be created with support for the maximum number
  of threads projected to run in the enclave.

Since SGX1 a few more functions were introduced, collectively called
SGX2, that support modifications to an initialized enclave. Hardware
supporting these functions are already available as listed on
https://github.com/ayeks/SGX-hardware

This series adds support for SGX2, also referred to as Enclave Dynamic
Memory Management (EDMM). This includes:

* Support modifying permissions of regular enclave pages belonging to an
  initialized enclave. New permissions are not allowed to exceed the
  originally vetted permissions. For example, RX isn't allowed unless
  the page was originally added with RX or RWX.
  Modifying permissions is accomplished with two new ioctl()s:
  SGX_IOC_ENCLAVE_RELAX_PERMISSIONS and
  SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS.

* Support dynamic addition of regular enclave pages to an initialized
  enclave. Pages are added with RW permissions as their "originally
  vetted permissions" (see previous bullet) and thus not allowed to
  be made executable at this time. Enabling dynamically added pages
  to obtain executable permissions require integration with user space
  policy that is deferred until the core SGX2 enabling is complete.
  Pages are dynamically added to an initialized enclave from the SGX
  page fault handler.

* Support expanding an initialized enclave to accommodate more threads.
  More threads can be accommodated by an enclave with the addition of
  Thread Control Structure (TCS) pages that is done by changing the
  type of regular enclave pages to TCS pages using a new ioctl()
  SGX_IOC_ENCLAVE_MODIFY_TYPE.

* Support removing regular and TCS pages from an initialized enclave.
  Removing pages is accomplished in two stages as supported by two new
  ioctl()s SGX_IOC_ENCLAVE_MODIFY_TYPE (same ioctl() as mentioned in
  previous bullet) and SGX_IOC_ENCLAVE_REMOVE_PAGES.

* Tests covering all the new flows, some edge cases, and one
  comprehensive stress scenario.

No additional work is needed to support SGX2 in a virtualized
environment. The tests included in this series can also be run from
a guest and was tested with the recent QEMU release based on 6.2.0
that supports SGX.

Patches 1 to 14 prepares the existing code for SGX2 support by
introducing the SGX2 functions, making sure pages remain accessible
after their enclave permissions are changed, and tracking enclave page
types as well as runtime permissions as needed by SGX2.

Patches 15 through 32 are a mix of x86/sgx and selftests/sgx patches
that follow the format where first an SGX2 feature is
enabled and then followed by tests of the new feature and/or
tests of scenarios that combine SGX2 features enabled up to that point.

In two cases (patches 20 and 31) code in support of SGX2 is separated
out with detailed motivation to support the review.

This series is based on v5.17-rc2 with the following fixes additionally
applied:

"selftests/sgx: Remove extra newlines in test output"
 https://lore.kernel.org/linux-sgx/16317683a1822bbd44ab3ca48b60a9a217ac24de.1643754040.git.reinette.chatre@intel.com/
"selftests/sgx: Ensure enclave data available during debug print"
 https://lore.kernel.org/linux-sgx/eaaeeb9122916d831942fc8a3043c687137314c1.1643754040.git.reinette.chatre@intel.com/
"selftests/sgx: Do not attempt enclave build without valid enclave"
 https://lore.kernel.org/linux-sgx/4e4ea6d70c286c209964bec1e8d29ac8e692748b.1643754040.git.reinette.chatre@intel.com/
"selftests/sgx: Fix NULL-pointer-dereference upon early test failure"
 https://lore.kernel.org/linux-sgx/89824888783fd8e770bfc64530c7549650a41851.1643754040.git.reinette.chatre@intel.com/
"x86/sgx: Add poison handling to reclaimer"
 https://lore.kernel.org/linux-sgx/dcc95eb2aaefb042527ac50d0a50738c7c160dac.1643830353.git.reinette.chatre@intel.com/
"x86/sgx: Silence softlockup detection when releasing large enclaves"
 https://lore.kernel.org/linux-sgx/b5e9f218064aa76e3026f778e1ad0a1d823e3db8.1643133224.git.reinette.chatre@intel.com/

Your feedback will be greatly appreciated.

Regards,

Reinette

Reinette Chatre (32):
  x86/sgx: Add short descriptions to ENCLS wrappers
  x86/sgx: Add wrapper for SGX2 EMODPR function
  x86/sgx: Add wrapper for SGX2 EMODT function
  x86/sgx: Add wrapper for SGX2 EAUG function
  Documentation/x86: Document SGX permission details
  x86/sgx: Support VMA permissions more relaxed than enclave permissions
  x86/sgx: Add pfn_mkwrite() handler for present PTEs
  x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic
    permission changes
  x86/sgx: Export sgx_encl_ewb_cpumask()
  x86/sgx: Rename sgx_encl_ewb_cpumask() as sgx_encl_cpumask()
  x86/sgx: Move PTE zap code to new sgx_zap_enclave_ptes()
  x86/sgx: Make sgx_ipi_cb() available internally
  x86/sgx: Create utility to validate user provided offset and length
  x86/sgx: Keep record of SGX page type
  x86/sgx: Support relaxing of enclave page permissions
  x86/sgx: Support restricting of enclave page permissions
  selftests/sgx: Add test for EPCM permission changes
  selftests/sgx: Add test for TCS page permission changes
  x86/sgx: Support adding of pages to an initialized enclave
  x86/sgx: Tighten accessible memory range after enclave initialization
  selftests/sgx: Test two different SGX2 EAUG flows
  x86/sgx: Support modifying SGX page type
  x86/sgx: Support complete page removal
  Documentation/x86: Introduce enclave runtime management section
  selftests/sgx: Introduce dynamic entry point
  selftests/sgx: Introduce TCS initialization enclave operation
  selftests/sgx: Test complete changing of page type flow
  selftests/sgx: Test faulty enclave behavior
  selftests/sgx: Test invalid access to removed enclave page
  selftests/sgx: Test reclaiming of untouched page
  x86/sgx: Free up EPC pages directly to support large page ranges
  selftests/sgx: Page removal stress test

 Documentation/x86/sgx.rst                     |   64 +-
 arch/x86/include/asm/sgx.h                    |    8 +
 arch/x86/include/uapi/asm/sgx.h               |   81 +
 arch/x86/kernel/cpu/sgx/encl.c                |  334 +++-
 arch/x86/kernel/cpu/sgx/encl.h                |   12 +-
 arch/x86/kernel/cpu/sgx/encls.h               |   33 +
 arch/x86/kernel/cpu/sgx/ioctl.c               |  831 ++++++++-
 arch/x86/kernel/cpu/sgx/main.c                |   70 +-
 arch/x86/kernel/cpu/sgx/sgx.h                 |    3 +
 tools/testing/selftests/sgx/defines.h         |   23 +
 tools/testing/selftests/sgx/load.c            |   41 +
 tools/testing/selftests/sgx/main.c            | 1484 +++++++++++++++++
 tools/testing/selftests/sgx/main.h            |    1 +
 tools/testing/selftests/sgx/test_encl.c       |   68 +
 .../selftests/sgx/test_encl_bootstrap.S       |    6 +
 15 files changed, 2963 insertions(+), 96 deletions(-)


base-commit: 26291c54e111ff6ba87a164d85d4a4e134b7315c
prerequisite-patch-id: 3c3908f1c3536cc04ba020fb3e81f51395b44223
prerequisite-patch-id: e860923423c3387cf6fdcceb2fa41dc5e9454ef4
prerequisite-patch-id: 986260c8bc4255eb61e2c4afa88d2b723e376423
prerequisite-patch-id: ba014a99fced2b57d5d9e2dfb9d80ddf4333c13e
prerequisite-patch-id: 65cbb72889b6353a5639b984615d12019136b270
prerequisite-patch-id: e3296a2f0345a77c8a7ca91f76697ae2e1dca21f

Nathaniel McCallum Feb. 22, 2022, 8:27 p.m. UTC | #1

1. This interface looks very odd to me. mmap() is the kernel interface
for changing user space memory maps. Why are we introducing a new
interface for this? You can just simply add a new mmap flag (i.e.
MAP_SGX_TCS*) and then figure out which SGX instructions to execute
based on the desired state of the memory maps. If you do this, none of
the following ioctls are needed:

* SGX_IOC_ENCLAVE_RELAX_PERMISSIONS
* SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS
* SGX_IOC_ENCLAVE_REMOVE_PAGES
* SGX_IOC_ENCLAVE_MODIFY_TYPE

It also means that languages don't have to grow support for all these
ioctls. Instead, they can just reuse the existing mmap() bindings with
the new flag. Also, multiple operations can be combined into a single
mmap() call, amortizing the changes over a single context switch.

2. Automatically adding pages with hard-coded permissions in a fault
handler seems like a really bad idea. How do you distinguish between
accesses which should result in an updated mapping and accesses that
should result in a fault? IMHO, all unmapped page accesses should
result in a page fault. mmap() should be called first to identify the
correct permissions for these pages. Then the page handler should be
updated to use the permissions from the mapping when backfilling
physical pages. If I understand correctly, this should also obviate
the need for the weird userspace callback to allow for execute
permissions.

3. Implementing as I've suggested also means that we can lock down an
enclave, for example - after code has been JITed, by closing the file
descriptor. Once the file descriptor used to create the enclave is
closed, no further mmap() can be performed on the enclave. Attempting
to do EACCEPT on an unmapped page will generate a page fault.

* - I'm aware that a new flag might be frowned upon. I see a few other options:
1. reuse an existing flag which doesn't make sense in this context
2. communicate the page type in the offset argument
3. keep SGX_IOC_ENCLAVE_MODIFY_TYPE

On Mon, Feb 7, 2022 at 8:07 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> V1: https://lore.kernel.org/linux-sgx/cover.1638381245.git.reinette.chatre@intel.com/
>
> Changes since V1 that directly impact user space:
> - SGX2 permission changes changed from a single ioctl() named
>   SGX_IOC_PAGE_MODP to two new ioctl()s:
>   SGX_IOC_ENCLAVE_RELAX_PERMISSIONS and
>   SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS, supported by two different
>   parameter structures (SGX_IOC_ENCLAVE_RELAX_PERMISSIONS does
>   not support a result output parameter) (Jarkko).
>
>   User space flow impact: After user space runs ENCLU[EMODPE] it
>   needs to call SGX_IOC_ENCLAVE_RELAX_PERMISSIONS to have PTEs
>   updated. Previously running SGX_IOC_PAGE_MODP in this scenario
>   resulted in EPCM.PR being set but calling
>   SGX_IOC_ENCLAVE_RELAX_PERMISSIONS will not result in EPCM.PR
>   being set anymore and thus no need for an additional
>   ENCLU[EACCEPT].
>
> - SGX_IOC_ENCLAVE_RELAX_PERMISSIONS and
>   SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS
>   obtain new permissions from secinfo as parameter instead of
>   the permissions directly (Jarkko).
>
> - ioctl() supporting SGX2 page type change is renamed from
>   SGX_IOC_PAGE_MODT to SGX_IOC_ENCLAVE_MODIFY_TYPE (Jarkko).
>
> - SGX_IOC_ENCLAVE_MODIFY_TYPE obtains new page type from secinfo
>   as parameter instead of the page type directly (Jarkko).
>
> - ioctl() supporting SGX2 page removal is renamed from
>   SGX_IOC_PAGE_REMOVE to SGX_IOC_ENCLAVE_REMOVE_PAGES (Jarkko).
>
> - All ioctl() parameter structures have been renamed as a result of the
>   ioctl() renaming:
>   SGX_IOC_ENCLAVE_RELAX_PERMISSIONS => struct sgx_enclave_relax_perm
>   SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS => struct sgx_enclave_restrict_perm
>   SGX_IOC_ENCLAVE_MODIFY_TYPE => struct sgx_enclave_modt
>   SGX_IOC_ENCLAVE_REMOVE_PAGES => struct sgx_enclave_remove_pages
>
> Changes since V1 that do not directly impact user space:
> - Number of patches in series increased from 25 to 32 primarily because
>   of splitting the original submission:
>   - Wrappers for the new SGX2 functions are introduced in three separate
>     patches replacing the original "x86/sgx: Add wrappers for SGX2
>     functions"
>     (Jarkko).
>   - Moving and renaming sgx_encl_ewb_cpumask() is done with two patches
>     replacing the original "x86/sgx: Use more generic name for enclave
>     cpumask function" (Jarkko).
>   - Support for SGX2 EPCM permission changes is split into two ioctls(),
>     one for relaxing and one for restricting permissions, each introduced
>     by a new patch replacing the original "x86/sgx: Support enclave page
>     permission changes" (Jarkko).
>   - Extracted code used by existing ioctls() for usage by new ioctl()s
>     into a new utility in new patch "x86/sgx: Create utility to validate
>     user provided offset and length" (Dave did not specifically ask for
>     this but it addresses his review feedback).
>   - Two new Documentation patches to support the SGX2 work
>     ("Documentation/x86: Introduce enclave runtime management") and
>     a dedicated section on the enclave permission management
>     ("Documentation/x86: Document SGX permission details") (Andy).
> - Most patches were reworked to improve the language by:
>   * aiming to refer to exact item instead of English rephrasing (Jarkko).
>   * use ioctl() instead of ioctl throughout (Dave).
>   * Use "relaxed" instead of "exceed" when referring to permissions
>     (Dave).
> - Improved documentation with several additions to
>   Documentation/x86/sgx.rst.
> - Many smaller changes, please refer to individual patches.
>
> Hi Everybody,
>
> The current Linux kernel support for SGX includes support for SGX1 that
> requires that an enclave be created with properties that accommodate all
> usages over its (the enclave's) lifetime. This includes properties such
> as permissions of enclave pages, the number of enclave pages, and the
> number of threads supported by the enclave.
>
> Consequences of this requirement to have the enclave be created to
> accommodate all usages include:
> * pages needing to support relocated code are required to have RWX
>   permissions for their entire lifetime,
> * an enclave needs to be created with the maximum stack and heap
>   projected to be needed during the enclave's entire lifetime which
>   can be longer than the processes running within it,
> * an enclave needs to be created with support for the maximum number
>   of threads projected to run in the enclave.
>
> Since SGX1 a few more functions were introduced, collectively called
> SGX2, that support modifications to an initialized enclave. Hardware
> supporting these functions are already available as listed on
> https://github.com/ayeks/SGX-hardware
>
> This series adds support for SGX2, also referred to as Enclave Dynamic
> Memory Management (EDMM). This includes:
>
> * Support modifying permissions of regular enclave pages belonging to an
>   initialized enclave. New permissions are not allowed to exceed the
>   originally vetted permissions. For example, RX isn't allowed unless
>   the page was originally added with RX or RWX.
>   Modifying permissions is accomplished with two new ioctl()s:
>   SGX_IOC_ENCLAVE_RELAX_PERMISSIONS and
>   SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS.
>
> * Support dynamic addition of regular enclave pages to an initialized
>   enclave. Pages are added with RW permissions as their "originally
>   vetted permissions" (see previous bullet) and thus not allowed to
>   be made executable at this time. Enabling dynamically added pages
>   to obtain executable permissions require integration with user space
>   policy that is deferred until the core SGX2 enabling is complete.
>   Pages are dynamically added to an initialized enclave from the SGX
>   page fault handler.
>
> * Support expanding an initialized enclave to accommodate more threads.
>   More threads can be accommodated by an enclave with the addition of
>   Thread Control Structure (TCS) pages that is done by changing the
>   type of regular enclave pages to TCS pages using a new ioctl()
>   SGX_IOC_ENCLAVE_MODIFY_TYPE.
>
> * Support removing regular and TCS pages from an initialized enclave.
>   Removing pages is accomplished in two stages as supported by two new
>   ioctl()s SGX_IOC_ENCLAVE_MODIFY_TYPE (same ioctl() as mentioned in
>   previous bullet) and SGX_IOC_ENCLAVE_REMOVE_PAGES.
>
> * Tests covering all the new flows, some edge cases, and one
>   comprehensive stress scenario.
>
> No additional work is needed to support SGX2 in a virtualized
> environment. The tests included in this series can also be run from
> a guest and was tested with the recent QEMU release based on 6.2.0
> that supports SGX.
>
> Patches 1 to 14 prepares the existing code for SGX2 support by
> introducing the SGX2 functions, making sure pages remain accessible
> after their enclave permissions are changed, and tracking enclave page
> types as well as runtime permissions as needed by SGX2.
>
> Patches 15 through 32 are a mix of x86/sgx and selftests/sgx patches
> that follow the format where first an SGX2 feature is
> enabled and then followed by tests of the new feature and/or
> tests of scenarios that combine SGX2 features enabled up to that point.
>
> In two cases (patches 20 and 31) code in support of SGX2 is separated
> out with detailed motivation to support the review.
>
> This series is based on v5.17-rc2 with the following fixes additionally
> applied:
>
> "selftests/sgx: Remove extra newlines in test output"
>  https://lore.kernel.org/linux-sgx/16317683a1822bbd44ab3ca48b60a9a217ac24de.1643754040.git.reinette.chatre@intel.com/
> "selftests/sgx: Ensure enclave data available during debug print"
>  https://lore.kernel.org/linux-sgx/eaaeeb9122916d831942fc8a3043c687137314c1.1643754040.git.reinette.chatre@intel.com/
> "selftests/sgx: Do not attempt enclave build without valid enclave"
>  https://lore.kernel.org/linux-sgx/4e4ea6d70c286c209964bec1e8d29ac8e692748b.1643754040.git.reinette.chatre@intel.com/
> "selftests/sgx: Fix NULL-pointer-dereference upon early test failure"
>  https://lore.kernel.org/linux-sgx/89824888783fd8e770bfc64530c7549650a41851.1643754040.git.reinette.chatre@intel.com/
> "x86/sgx: Add poison handling to reclaimer"
>  https://lore.kernel.org/linux-sgx/dcc95eb2aaefb042527ac50d0a50738c7c160dac.1643830353.git.reinette.chatre@intel.com/
> "x86/sgx: Silence softlockup detection when releasing large enclaves"
>  https://lore.kernel.org/linux-sgx/b5e9f218064aa76e3026f778e1ad0a1d823e3db8.1643133224.git.reinette.chatre@intel.com/
>
> Your feedback will be greatly appreciated.
>
> Regards,
>
> Reinette
>
> Reinette Chatre (32):
>   x86/sgx: Add short descriptions to ENCLS wrappers
>   x86/sgx: Add wrapper for SGX2 EMODPR function
>   x86/sgx: Add wrapper for SGX2 EMODT function
>   x86/sgx: Add wrapper for SGX2 EAUG function
>   Documentation/x86: Document SGX permission details
>   x86/sgx: Support VMA permissions more relaxed than enclave permissions
>   x86/sgx: Add pfn_mkwrite() handler for present PTEs
>   x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic
>     permission changes
>   x86/sgx: Export sgx_encl_ewb_cpumask()
>   x86/sgx: Rename sgx_encl_ewb_cpumask() as sgx_encl_cpumask()
>   x86/sgx: Move PTE zap code to new sgx_zap_enclave_ptes()
>   x86/sgx: Make sgx_ipi_cb() available internally
>   x86/sgx: Create utility to validate user provided offset and length
>   x86/sgx: Keep record of SGX page type
>   x86/sgx: Support relaxing of enclave page permissions
>   x86/sgx: Support restricting of enclave page permissions
>   selftests/sgx: Add test for EPCM permission changes
>   selftests/sgx: Add test for TCS page permission changes
>   x86/sgx: Support adding of pages to an initialized enclave
>   x86/sgx: Tighten accessible memory range after enclave initialization
>   selftests/sgx: Test two different SGX2 EAUG flows
>   x86/sgx: Support modifying SGX page type
>   x86/sgx: Support complete page removal
>   Documentation/x86: Introduce enclave runtime management section
>   selftests/sgx: Introduce dynamic entry point
>   selftests/sgx: Introduce TCS initialization enclave operation
>   selftests/sgx: Test complete changing of page type flow
>   selftests/sgx: Test faulty enclave behavior
>   selftests/sgx: Test invalid access to removed enclave page
>   selftests/sgx: Test reclaiming of untouched page
>   x86/sgx: Free up EPC pages directly to support large page ranges
>   selftests/sgx: Page removal stress test
>
>  Documentation/x86/sgx.rst                     |   64 +-
>  arch/x86/include/asm/sgx.h                    |    8 +
>  arch/x86/include/uapi/asm/sgx.h               |   81 +
>  arch/x86/kernel/cpu/sgx/encl.c                |  334 +++-
>  arch/x86/kernel/cpu/sgx/encl.h                |   12 +-
>  arch/x86/kernel/cpu/sgx/encls.h               |   33 +
>  arch/x86/kernel/cpu/sgx/ioctl.c               |  831 ++++++++-
>  arch/x86/kernel/cpu/sgx/main.c                |   70 +-
>  arch/x86/kernel/cpu/sgx/sgx.h                 |    3 +
>  tools/testing/selftests/sgx/defines.h         |   23 +
>  tools/testing/selftests/sgx/load.c            |   41 +
>  tools/testing/selftests/sgx/main.c            | 1484 +++++++++++++++++
>  tools/testing/selftests/sgx/main.h            |    1 +
>  tools/testing/selftests/sgx/test_encl.c       |   68 +
>  .../selftests/sgx/test_encl_bootstrap.S       |    6 +
>  15 files changed, 2963 insertions(+), 96 deletions(-)
>
>
> base-commit: 26291c54e111ff6ba87a164d85d4a4e134b7315c
> prerequisite-patch-id: 3c3908f1c3536cc04ba020fb3e81f51395b44223
> prerequisite-patch-id: e860923423c3387cf6fdcceb2fa41dc5e9454ef4
> prerequisite-patch-id: 986260c8bc4255eb61e2c4afa88d2b723e376423
> prerequisite-patch-id: ba014a99fced2b57d5d9e2dfb9d80ddf4333c13e
> prerequisite-patch-id: 65cbb72889b6353a5639b984615d12019136b270
> prerequisite-patch-id: e3296a2f0345a77c8a7ca91f76697ae2e1dca21f
> --
> 2.25.1
>

Reinette Chatre Feb. 22, 2022, 10:39 p.m. UTC | #2

Hi Nathaniel,

On 2/22/2022 12:27 PM, Nathaniel McCallum wrote:
> 1. This interface looks very odd to me. mmap() is the kernel interface
> for changing user space memory maps. Why are we introducing a new
> interface for this?

mmap() is the kernel interface used to create new mappings in the
virtual address space of the calling process. This is different from
the permissions and properties of the underlying file/memory being mapped.

A new interface is introduced because changes need to be made to the
permissions and properties of the underlying enclave. A new virtual
address space is not needed nor should existing VMAs be impacted.

This is similar to how mmap() is not used to change file permissions.

VMA permissions are separate from enclave page permissions as found in
the EPCM (Enclave Page Cache Map). The current implementation (SGX1) already
distinguishes between the VMA and EPCM permissions - for example, it is
already possible to create a read-only VMA from enclave pages that have
RW EPCM permissions. mmap() of a portion of EPC memory with a particular
permission does not imply that the underlying EPCM permissions (should)have
that permission. 

> You can just simply add a new mmap flag (i.e.
> MAP_SGX_TCS*) and then figure out which SGX instructions to execute
> based on the desired state of the memory maps. If you do this, none of
> the following ioctls are needed:
> 
> * SGX_IOC_ENCLAVE_RELAX_PERMISSIONS
> * SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS
> * SGX_IOC_ENCLAVE_REMOVE_PAGES
> * SGX_IOC_ENCLAVE_MODIFY_TYPE
> 
> It also means that languages don't have to grow support for all these
> ioctls. Instead, they can just reuse the existing mmap() bindings with
> the new flag. Also, multiple operations can be combined into a single
> mmap() call, amortizing the changes over a single context switch.
> 
> 2. Automatically adding pages with hard-coded permissions in a fault
> handler seems like a really bad idea.

Could you please elaborate why this is a bad idea?

> How do you distinguish between
> accesses which should result in an updated mapping and accesses that
> should result in a fault?

Accesses that should result in an updated mapping have two requirements:
(a) address accessed belongs to the enclave based on the address
    range specified during enclave create
(b) there is no backing enclave page for the address

> IMHO, all unmapped page accesses should
> result in a page fault. mmap() should be called first to identify the
> correct permissions for these pages.
> Then the page handler should be
> updated to use the permissions from the mapping when backfilling
> physical pages. If I understand correctly, this should also obviate

Regular enclave pages can _only_ be dynamically added with RW permission.

SGX2's support for adding regular pages to an enclave via the EAUG
instruction is architecturally set at RW. The OS cannot change those permissions
via the EAUG instruction nor can the OS do so with a different/additional
instruction because:
* the OS is not able to relax permissions since that can only be done from
within the enclave with ENCLU[EMODPE], thus it is not possible for the OS to
dynamically add pages via EAUG as RW and then relax permissions to RWX. 
* the OS is not able to EAUG a page and immediately attempt an EMODPR either
as Jarkko also recently inquired about:
https://lore.kernel.org/linux-sgx/80f3d7b9-e3d5-b2c0-7707-710bf6f5081e@intel.com/

> the need for the weird userspace callback to allow for execute
> permissions.

User policy integration would always be required to allow execute
permissions on a writable page. This is not expected to be a userspace
callback but instead integration with existing user policy subsystem(s).

> 
> 3. Implementing as I've suggested also means that we can lock down an
> enclave, for example - after code has been JITed, by closing the file
> descriptor. Once the file descriptor used to create the enclave is
> closed, no further mmap() can be performed on the enclave. Attempting
> to do EACCEPT on an unmapped page will generate a page fault.

This is not clear to me. If the file descriptor is closed and no further
mmap() is allowed then how would a process be able to enter the enclave
to execute code within it?

This series does indeed lock down the address range to ensure that it is
not possible to map memory that does not belong to the enclave after the
enclave is created. Please see:
https://lore.kernel.org/linux-sgx/1b833dbce6c937f71523f4aaf4b2181b9673519f.1644274683.git.reinette.chatre@intel.com/

> 
> * - I'm aware that a new flag might be frowned upon. I see a few other options:
> 1. reuse an existing flag which doesn't make sense in this context
> 2. communicate the page type in the offset argument
> 3. keep SGX_IOC_ENCLAVE_MODIFY_TYPE
> 

Reinette

Nathaniel McCallum Feb. 23, 2022, 1:24 p.m. UTC | #3

On Tue, Feb 22, 2022 at 5:39 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Nathaniel,
>
> On 2/22/2022 12:27 PM, Nathaniel McCallum wrote:
> > 1. This interface looks very odd to me. mmap() is the kernel interface
> > for changing user space memory maps. Why are we introducing a new
> > interface for this?
>
> mmap() is the kernel interface used to create new mappings in the
> virtual address space of the calling process. This is different from
> the permissions and properties of the underlying file/memory being mapped.
>
> A new interface is introduced because changes need to be made to the
> permissions and properties of the underlying enclave. A new virtual
> address space is not needed nor should existing VMAs be impacted.
>
> This is similar to how mmap() is not used to change file permissions.
>
> VMA permissions are separate from enclave page permissions as found in
> the EPCM (Enclave Page Cache Map). The current implementation (SGX1) already
> distinguishes between the VMA and EPCM permissions - for example, it is
> already possible to create a read-only VMA from enclave pages that have
> RW EPCM permissions. mmap() of a portion of EPC memory with a particular
> permission does not imply that the underlying EPCM permissions (should)have
> that permission.

Yes. BUT... unlike the file permissions, this leaks an implementation detail.

The user process is governed by VMA permissions. And during enclave
creation, it had to mmap() all the enclave regions to their final VMA
permissions. So during enclave creation you have to use mmap() but
after enclave creation you use custom APIs? That's inconsistent at
best.

Forcing userspace to worry about the (mostly undocumented!)
interactions between EPC, PTE and VMA permissions makes these APIs
hard to use and difficult to reason about.

When I call SGX_IOC_ENCLAVE_RELAX_PERMISSIONS, do I also have to call
mmap() to update the VMA permissions to match? It isn't clear. Nor is
it really clear why I'm calling completely separate APIs.

> > You can just simply add a new mmap flag (i.e.
> > MAP_SGX_TCS*) and then figure out which SGX instructions to execute
> > based on the desired state of the memory maps. If you do this, none of
> > the following ioctls are needed:
> >
> > * SGX_IOC_ENCLAVE_RELAX_PERMISSIONS
> > * SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS
> > * SGX_IOC_ENCLAVE_REMOVE_PAGES
> > * SGX_IOC_ENCLAVE_MODIFY_TYPE
> >
> > It also means that languages don't have to grow support for all these
> > ioctls. Instead, they can just reuse the existing mmap() bindings with
> > the new flag. Also, multiple operations can be combined into a single
> > mmap() call, amortizing the changes over a single context switch.
> >
> > 2. Automatically adding pages with hard-coded permissions in a fault
> > handler seems like a really bad idea.
>
> Could you please elaborate why this is a bad idea?

Because implementations that miss this subtlety suddenly have pages
with magic permissions. Magic is bad. Explicit is good.

> > How do you distinguish between
> > accesses which should result in an updated mapping and accesses that
> > should result in a fault?
>
> Accesses that should result in an updated mapping have two requirements:
> (a) address accessed belongs to the enclave based on the address
>     range specified during enclave create
> (b) there is no backing enclave page for the address

What happens if the enclave is buggy? Or has been compromised. In both
of those cases, there should be a userspace visible fault and pages
should not be added.

> > IMHO, all unmapped page accesses should
> > result in a page fault. mmap() should be called first to identify the
> > correct permissions for these pages.
> > Then the page handler should be
> > updated to use the permissions from the mapping when backfilling
> > physical pages. If I understand correctly, this should also obviate
>
> Regular enclave pages can _only_ be dynamically added with RW permission.
>
> SGX2's support for adding regular pages to an enclave via the EAUG
> instruction is architecturally set at RW. The OS cannot change those permissions
> via the EAUG instruction nor can the OS do so with a different/additional
> instruction because:
> * the OS is not able to relax permissions since that can only be done from
> within the enclave with ENCLU[EMODPE], thus it is not possible for the OS to
> dynamically add pages via EAUG as RW and then relax permissions to RWX.
> * the OS is not able to EAUG a page and immediately attempt an EMODPR either
> as Jarkko also recently inquired about:
> https://lore.kernel.org/linux-sgx/80f3d7b9-e3d5-b2c0-7707-710bf6f5081e@intel.com/

This design looks... unfinished. EAUG takes a PAGEINFO in RBX, but
PAGEINFO.SECINFO must be zeroed and EAUG instead sets magic hard-coded
permissions. Why doesn't EAUG just respect the permissions in
PAGEINFO.SECINFO? We aren't told.

Further, if the enclave can do EMODPE, why does
SGX_IOC_ENCLAVE_RELAX_PERMISSIONS even exist? None of the
documentation explains what this ioctl even does. Does it update PTE
permissions? VMA permissions? Nobody knows without reading the source
code.

Userspace should not be bothered with the subtle details of the
interaction between EPC, PTE and VMA permissions. But this API does
everything it can do to expose all these details to userspace. And it
doesn't bother to document them (probably because it is hard). It
would be much better to avoid exposing these details to userspace.

IMHO, there should be a simple flow like this (if EAUG respects
PAGEINFO.SECINFO):

1. Non-enclave calls mmap()/munmap().
2. Enclave issues EACCEPT, if necessary.
3. Enclave issues EMODPE, if necessary.

Notice that in the second step above, during the mmap() call, the
kernel ensures that EPC, PTE and VMA are in sync and fails if they
cannot be made to be compatible. Also note that in the above flow EAUG
instructions can be efficiently batched.

Given the current poor state of the EAUG instruction, we might need to
do this flow instead:

1. Enclave issues EACCEPT, if necessary. (Add RW pages...)
2. Non-enclave calls mmap()/munmap().
3. Enclave issues EACCEPT, if necessary.
4. Enclave issues EMODPE, if necessary.

However, doing EAUG only via the page access handler means that there
is no way to batch EAUG instructions and this forces a context switch
for every page you want to add. This has to be terrible for
performance. Note specifically that the SDM calls out batching, which
is currently impossible under this patch set. 35.5.7 - "Page
allocation operations may be batched to improve efficiency."

As it stands today, if I want to add 256MiB of pages to an enclave,
I'll have to do 2^16 context switches. That doesn't seem scalable.

> > the need for the weird userspace callback to allow for execute
> > permissions.
>
> User policy integration would always be required to allow execute
> permissions on a writable page. This is not expected to be a userspace
> callback but instead integration with existing user policy subsystem(s).

Why? This isn't documented.

> > 3. Implementing as I've suggested also means that we can lock down an
> > enclave, for example - after code has been JITed, by closing the file
> > descriptor. Once the file descriptor used to create the enclave is
> > closed, no further mmap() can be performed on the enclave. Attempting
> > to do EACCEPT on an unmapped page will generate a page fault.
>
> This is not clear to me. If the file descriptor is closed and no further
> mmap() is allowed then how would a process be able to enter the enclave
> to execute code within it?

EENTER (or the vdso function) with the address of a TCS page, like
normal. In Enarx, we don't retain the enclave fd after the final
mmap() following EINIT. Everything works just fine.

> This series does indeed lock down the address range to ensure that it is
> not possible to map memory that does not belong to the enclave after the
> enclave is created. Please see:
> https://lore.kernel.org/linux-sgx/1b833dbce6c937f71523f4aaf4b2181b9673519f.1644274683.git.reinette.chatre@intel.com/

That's not what I'm talking about. I'm talking about a workflow like this:

1. Enclave initialization: ECREATE ... EINIT
2. EENTER
3. Enclave JITs some code (changes page permissions)
4. EEXIT
5. Close enclave fd.
6. EENTER
7. If an enclave attempts page modifications, a fault occurs.

Think of this similar to seccomp(). The enclave wants to do some
dynamic page table manipulation. But then it wants to lock down page
table modification so that, if compromised, attackers have no ability
to obtain RWX permissions.

Reinette Chatre Feb. 23, 2022, 6:25 p.m. UTC | #4

Hi Nathaniel,

On 2/23/2022 5:24 AM, Nathaniel McCallum wrote:
> On Tue, Feb 22, 2022 at 5:39 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>>
>> Hi Nathaniel,
>>
>> On 2/22/2022 12:27 PM, Nathaniel McCallum wrote:
>>> 1. This interface looks very odd to me. mmap() is the kernel interface
>>> for changing user space memory maps. Why are we introducing a new
>>> interface for this?
>>
>> mmap() is the kernel interface used to create new mappings in the
>> virtual address space of the calling process. This is different from
>> the permissions and properties of the underlying file/memory being mapped.
>>
>> A new interface is introduced because changes need to be made to the
>> permissions and properties of the underlying enclave. A new virtual
>> address space is not needed nor should existing VMAs be impacted.
>>
>> This is similar to how mmap() is not used to change file permissions.
>>
>> VMA permissions are separate from enclave page permissions as found in
>> the EPCM (Enclave Page Cache Map). The current implementation (SGX1) already
>> distinguishes between the VMA and EPCM permissions - for example, it is
>> already possible to create a read-only VMA from enclave pages that have
>> RW EPCM permissions. mmap() of a portion of EPC memory with a particular
>> permission does not imply that the underlying EPCM permissions (should)have
>> that permission.
> 
> Yes. BUT... unlike the file permissions, this leaks an implementation detail.

Not really - just like a RW file can be mapped read-only or RW, RW enclave
memory can be mapped read-only or RW.

> 
> The user process is governed by VMA permissions. And during enclave
> creation, it had to mmap() all the enclave regions to their final VMA
> permissions. So during enclave creation you have to use mmap() but
> after enclave creation you use custom APIs? That's inconsistent at
> best.

No. ioctl()s are consistently used to manage enclave memory.

The existing ioctls() SGX_IOC_ENCLAVE_CREATE, SGX_IOC_ENCLAVE_ADD_PAGES,
and SGX_IOC_ENCLAVE_INIT are used to set up to initialize the enclave memory.

The new ioctls() are used to manage enclave memory after enclave initialization.

The enclave memory is thus managed with a consistent interface.

mmap() is required before SGX_IOC_ENCLAVE_CREATE to obtain a base address
for the enclave that is required by the ioctl(). The rest of the ioctl()s,
existing and new, are consistent in interface by not requiring a memory
mapping but instead work from an offset from the base address.

> Forcing userspace to worry about the (mostly undocumented!)
> interactions between EPC, PTE and VMA permissions makes these APIs
> hard to use and difficult to reason about.

This is not new. The current SGX1 user space is already prevented from
creating a mapping of enclave memory that is more relaxed than the enclave
memory. For example, if the enclave memory has RW EPCM permissions then it
is not possible to mmap() that memory as RWX.

> 
> When I call SGX_IOC_ENCLAVE_RELAX_PERMISSIONS, do I also have to call
> mmap() to update the VMA permissions to match? It isn't clear. Nor is

mprotect() may be the better call to use.

> it really clear why I'm calling completely separate APIs.
> 
>>> You can just simply add a new mmap flag (i.e.
>>> MAP_SGX_TCS*) and then figure out which SGX instructions to execute
>>> based on the desired state of the memory maps. If you do this, none of
>>> the following ioctls are needed:
>>>
>>> * SGX_IOC_ENCLAVE_RELAX_PERMISSIONS
>>> * SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS
>>> * SGX_IOC_ENCLAVE_REMOVE_PAGES
>>> * SGX_IOC_ENCLAVE_MODIFY_TYPE
>>>
>>> It also means that languages don't have to grow support for all these
>>> ioctls. Instead, they can just reuse the existing mmap() bindings with
>>> the new flag. Also, multiple operations can be combined into a single
>>> mmap() call, amortizing the changes over a single context switch.
>>>
>>> 2. Automatically adding pages with hard-coded permissions in a fault
>>> handler seems like a really bad idea.
>>
>> Could you please elaborate why this is a bad idea?
> 
> Because implementations that miss this subtlety suddenly have pages
> with magic permissions. Magic is bad. Explicit is good.
> 

There is no magic. Any new pages have to be accepted by the enclave.
The enclave will not be able to access these pages unless explicitly
accepted, ENCLU[EACCEPT], from within the enclave.

>>> How do you distinguish between
>>> accesses which should result in an updated mapping and accesses that
>>> should result in a fault?
>>
>> Accesses that should result in an updated mapping have two requirements:
>> (a) address accessed belongs to the enclave based on the address
>>     range specified during enclave create
>> (b) there is no backing enclave page for the address
> 
> What happens if the enclave is buggy? Or has been compromised. In both
> of those cases, there should be a userspace visible fault and pages
> should not be added.

If user space accesses a memory address with a regular read/write that
results in a new page added then there is indeed a user space visible
fault. You can see this flow in action in the "augment" test case in
https://lore.kernel.org/linux-sgx/32c1116934a588bd3e6c174684e3e36a05c0a4d4.1644274683.git.reinette.chatre@intel.com/

If user space indeed wants the page after encountering such a fault then
it needs to enter the enclave again, from a different entry point, to
run ENCLU[EACCEPT], before it can return to the original entry point to
continue execution from the instruction that triggered the original read/write.

The only flow where a page is added without a user space visible fault
is when user space explicitly runs the ENCLU[EACCEPT] to do so.

> 
>>> IMHO, all unmapped page accesses should
>>> result in a page fault. mmap() should be called first to identify the
>>> correct permissions for these pages.
>>> Then the page handler should be
>>> updated to use the permissions from the mapping when backfilling
>>> physical pages. If I understand correctly, this should also obviate
>>
>> Regular enclave pages can _only_ be dynamically added with RW permission.
>>
>> SGX2's support for adding regular pages to an enclave via the EAUG
>> instruction is architecturally set at RW. The OS cannot change those permissions
>> via the EAUG instruction nor can the OS do so with a different/additional
>> instruction because:
>> * the OS is not able to relax permissions since that can only be done from
>> within the enclave with ENCLU[EMODPE], thus it is not possible for the OS to
>> dynamically add pages via EAUG as RW and then relax permissions to RWX.
>> * the OS is not able to EAUG a page and immediately attempt an EMODPR either
>> as Jarkko also recently inquired about:
>> https://lore.kernel.org/linux-sgx/80f3d7b9-e3d5-b2c0-7707-710bf6f5081e@intel.com/
> 
> This design looks... unfinished. EAUG takes a PAGEINFO in RBX, but
> PAGEINFO.SECINFO must be zeroed and EAUG instead sets magic hard-coded
> permissions. Why doesn't EAUG just respect the permissions in
> PAGEINFO.SECINFO? We aren't told.

This design is finished and respects the hardware specification. You can find
the details in the SDM's documentation of the EAUG function.

If the SECINFO field has a value then the hardware requires it to indicate
that it is a new shadow stack page being added, not a regular page. Support for
shadow stack pages is not in scope for this work. Attempting to dynamically
add a regular page with explicit permissions will result in a #GP(0).

The only way to add a regular enclave page is to make the SECINFO field empty
and doing so forces the page type to be a regular page and the permissions to
be RW.

> 
> Further, if the enclave can do EMODPE, why does
> SGX_IOC_ENCLAVE_RELAX_PERMISSIONS even exist? None of the
> documentation explains what this ioctl even does. Does it update PTE
> permissions? VMA permissions? Nobody knows without reading the source
> code.

Build the documentation (after applying this series) and it should
contain all the information you are searching for. As is the current custom
in the SGX documentation the built documentation pulls its content from
the kernel doc of the functions that implement the core of the 
user space interactions.

> 
> Userspace should not be bothered with the subtle details of the
> interaction between EPC, PTE and VMA permissions. But this API does
> everything it can do to expose all these details to userspace. And it
> doesn't bother to document them (probably because it is hard). It
> would be much better to avoid exposing these details to userspace.
> 
> IMHO, there should be a simple flow like this (if EAUG respects
> PAGEINFO.SECINFO):

EAUG does not respect PAGEINFO.SECINFO for regular pages.

> 
> 1. Non-enclave calls mmap()/munmap().
> 2. Enclave issues EACCEPT, if necessary.
> 3. Enclave issues EMODPE, if necessary.
> 
> Notice that in the second step above, during the mmap() call, the
> kernel ensures that EPC, PTE and VMA are in sync and fails if they
> cannot be made to be compatible. Also note that in the above flow EAUG
> instructions can be efficiently batched.
> 
> Given the current poor state of the EAUG instruction, we might need to
> do this flow instead:
> 
> 1. Enclave issues EACCEPT, if necessary. (Add RW pages...)
> 2. Non-enclave calls mmap()/munmap().
> 3. Enclave issues EACCEPT, if necessary.
> 4. Enclave issues EMODPE, if necessary.
> 
> However, doing EAUG only via the page access handler means that there
> is no way to batch EAUG instructions and this forces a context switch
> for every page you want to add. This has to be terrible for
> performance. Note specifically that the SDM calls out batching, which
> is currently impossible under this patch set. 35.5.7 - "Page
> allocation operations may be batched to improve efficiency."

These page functions are all per-page so it is not possible to add multiple
pages with a single instruction. It is indeed possible to pre-fault pages.

> As it stands today, if I want to add 256MiB of pages to an enclave,
> I'll have to do 2^16 context switches. That doesn't seem scalable.

No. Running ENCLU[EACCEPT] on each of the pages within that range should not
need any explicit context switch out of the enclave. See the "augment_via_eaccept" 
test case in:
https://lore.kernel.org/linux-sgx/32c1116934a588bd3e6c174684e3e36a05c0a4d4.1644274683.git.reinette.chatre@intel.com/

>>> the need for the weird userspace callback to allow for execute
>>> permissions.
>>
>> User policy integration would always be required to allow execute
>> permissions on a writable page. This is not expected to be a userspace
>> callback but instead integration with existing user policy subsystem(s).
> 
> Why? This isn't documented.

This is similar to the existing policies involved in managing the permissions
of mapped memory. When user space calls mprotect() to change permissions
of a mapped region then the kernel will not blindly allow the permissions but
instead ensure that it is allowed based on user policy by calling the LSM
(Linux Security Module) hooks.

You can learn more about LSM and various security modules at:
Documentation/security/lsm.rst
Documentation/admin-guide/LSM/*

You can compare what is needed here to what is currently done when user space
attempts to make some memory executable (see:
mm/mprotect.c:do_mprotect_key()->security_file_mprotect()). User policy needs
to help the kernel determine if this is allowed. For example, when SELinux is
the security module of choice then the process or file (depending on what type
of memory is being changed) needs to have a special permission (PROCESS__EXECHEAP,
PROCESS__EXECSTACK, or FILE__EXECMOD) assigned by user space to allow this.

Integration with user space policy is required for RWX of dynamically added pages
to be supported. In this series dynamically added pages will not be allowed to
be made executable, a follow-up series will add support for user policy
integration to support RWX permissions of dynamically added pages.

>>> 3. Implementing as I've suggested also means that we can lock down an
>>> enclave, for example - after code has been JITed, by closing the file
>>> descriptor. Once the file descriptor used to create the enclave is
>>> closed, no further mmap() can be performed on the enclave. Attempting
>>> to do EACCEPT on an unmapped page will generate a page fault.
>>
>> This is not clear to me. If the file descriptor is closed and no further
>> mmap() is allowed then how would a process be able to enter the enclave
>> to execute code within it?
> 
> EENTER (or the vdso function) with the address of a TCS page, like
> normal. In Enarx, we don't retain the enclave fd after the final
> mmap() following EINIT. Everything works just fine.

The OS fault handler is responsible for managing the PTEs that is required
for the enclave to be able to access the memory within the enclave.
The OS fault handler is attached to a VMA that is created with mmap(). 

> 
>> This series does indeed lock down the address range to ensure that it is
>> not possible to map memory that does not belong to the enclave after the
>> enclave is created. Please see:
>> https://lore.kernel.org/linux-sgx/1b833dbce6c937f71523f4aaf4b2181b9673519f.1644274683.git.reinette.chatre@intel.com/
> 
> That's not what I'm talking about. I'm talking about a workflow like this:
> 
> 1. Enclave initialization: ECREATE ... EINIT
> 2. EENTER
> 3. Enclave JITs some code (changes page permissions)
> 4. EEXIT
> 5. Close enclave fd.
> 6. EENTER
> 7. If an enclave attempts page modifications, a fault occurs.

The original fd that was created to obtain the enclave base address
may be closed at (5) but the executable and data portions of the enclave
still needs to be mapped afterwards to be able to have OS support for
managing the PTEs that the enclave depends on to access those pages.

> 
> Think of this similar to seccomp(). The enclave wants to do some
> dynamic page table manipulation. But then it wants to lock down page
> table modification so that, if compromised, attackers have no ability
> to obtain RWX permissions.

Reinette

Nathaniel McCallum March 2, 2022, 4:57 p.m. UTC | #5

Reinette,

Perhaps it would be better for us to have a shared understanding on
how the patches as posted are supposed to work in the most common
cases? I'm thinking here of projects such as Enarx, Gramine and
Occulum, which all have a similar process. Namely they execute an
executable (called exec in the below chart) which has things like
syscalls handled by a shim. These two components (shim and exec) are
supported by a non-enclave userspace runtime. Given this common
architectural pattern, this is how I understand adding pages via an
exec call to mmap() to work.

https://mermaid.live/edit#pako:eNp1k81qwzAQhF9F6NRCAu1Vh0BIRemhoeSHBuIettYmFpElVZZLQ8i7144sJ8aOT2bmY3d2vT7R1AikjBb4U6JO8UXC3kGeaFI9FpyXqbSgPTmg06j6uiu1lzn2jSKTA2XwD9NEB31uPBLzi-6iMpLnYB8Wn4-kOBYpKBW52iXj8WQSmzEy5Zvt01ewG5HUQN2UEc7nK77YPjdALd64GWih8NpkALGwR_JtzOGAaKXexyTKGEt2pgoMaXahgj5Qgk9nM_6xGvDDJpsmOyiVv0LB62B8un4dBDrLiLPeWciCL9fvvKVQizhSG6stFz9Df7sxUpcYitR-SodFO2A_Vw-7l4nzzduqjX9bKJxOHDDeBB3RHF0OUlS3faq1hPoMqzulrHoVGPZOE32u0NIK8MiF9MZRtgNV4IhC6c3yqFPKvCsxQs3_0VDnfzf-CPg

This only covers adding RW pages. I haven't even tackled permission
changes yet. Is that understanding correct? If not, please provide an
alternative sequence diagram to explain how you expect this to be
used.

On Wed, Feb 23, 2022 at 1:25 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Nathaniel,
>
> On 2/23/2022 5:24 AM, Nathaniel McCallum wrote:
> > On Tue, Feb 22, 2022 at 5:39 PM Reinette Chatre
> > <reinette.chatre@intel.com> wrote:
> >>
> >> Hi Nathaniel,
> >>
> >> On 2/22/2022 12:27 PM, Nathaniel McCallum wrote:
> >>> 1. This interface looks very odd to me. mmap() is the kernel interface
> >>> for changing user space memory maps. Why are we introducing a new
> >>> interface for this?
> >>
> >> mmap() is the kernel interface used to create new mappings in the
> >> virtual address space of the calling process. This is different from
> >> the permissions and properties of the underlying file/memory being mapped.
> >>
> >> A new interface is introduced because changes need to be made to the
> >> permissions and properties of the underlying enclave. A new virtual
> >> address space is not needed nor should existing VMAs be impacted.
> >>
> >> This is similar to how mmap() is not used to change file permissions.
> >>
> >> VMA permissions are separate from enclave page permissions as found in
> >> the EPCM (Enclave Page Cache Map). The current implementation (SGX1) already
> >> distinguishes between the VMA and EPCM permissions - for example, it is
> >> already possible to create a read-only VMA from enclave pages that have
> >> RW EPCM permissions. mmap() of a portion of EPC memory with a particular
> >> permission does not imply that the underlying EPCM permissions (should)have
> >> that permission.
> >
> > Yes. BUT... unlike the file permissions, this leaks an implementation detail.
>
> Not really - just like a RW file can be mapped read-only or RW, RW enclave
> memory can be mapped read-only or RW.
>
> >
> > The user process is governed by VMA permissions. And during enclave
> > creation, it had to mmap() all the enclave regions to their final VMA
> > permissions. So during enclave creation you have to use mmap() but
> > after enclave creation you use custom APIs? That's inconsistent at
> > best.
>
> No. ioctl()s are consistently used to manage enclave memory.
>
> The existing ioctls() SGX_IOC_ENCLAVE_CREATE, SGX_IOC_ENCLAVE_ADD_PAGES,
> and SGX_IOC_ENCLAVE_INIT are used to set up to initialize the enclave memory.
>
> The new ioctls() are used to manage enclave memory after enclave initialization.
>
> The enclave memory is thus managed with a consistent interface.
>
> mmap() is required before SGX_IOC_ENCLAVE_CREATE to obtain a base address
> for the enclave that is required by the ioctl(). The rest of the ioctl()s,
> existing and new, are consistent in interface by not requiring a memory
> mapping but instead work from an offset from the base address.
>
> > Forcing userspace to worry about the (mostly undocumented!)
> > interactions between EPC, PTE and VMA permissions makes these APIs
> > hard to use and difficult to reason about.
>
> This is not new. The current SGX1 user space is already prevented from
> creating a mapping of enclave memory that is more relaxed than the enclave
> memory. For example, if the enclave memory has RW EPCM permissions then it
> is not possible to mmap() that memory as RWX.
>
> >
> > When I call SGX_IOC_ENCLAVE_RELAX_PERMISSIONS, do I also have to call
> > mmap() to update the VMA permissions to match? It isn't clear. Nor is
>
> mprotect() may be the better call to use.
>
> > it really clear why I'm calling completely separate APIs.
> >
> >>> You can just simply add a new mmap flag (i.e.
> >>> MAP_SGX_TCS*) and then figure out which SGX instructions to execute
> >>> based on the desired state of the memory maps. If you do this, none of
> >>> the following ioctls are needed:
> >>>
> >>> * SGX_IOC_ENCLAVE_RELAX_PERMISSIONS
> >>> * SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS
> >>> * SGX_IOC_ENCLAVE_REMOVE_PAGES
> >>> * SGX_IOC_ENCLAVE_MODIFY_TYPE
> >>>
> >>> It also means that languages don't have to grow support for all these
> >>> ioctls. Instead, they can just reuse the existing mmap() bindings with
> >>> the new flag. Also, multiple operations can be combined into a single
> >>> mmap() call, amortizing the changes over a single context switch.
> >>>
> >>> 2. Automatically adding pages with hard-coded permissions in a fault
> >>> handler seems like a really bad idea.
> >>
> >> Could you please elaborate why this is a bad idea?
> >
> > Because implementations that miss this subtlety suddenly have pages
> > with magic permissions. Magic is bad. Explicit is good.
> >
>
> There is no magic. Any new pages have to be accepted by the enclave.
> The enclave will not be able to access these pages unless explicitly
> accepted, ENCLU[EACCEPT], from within the enclave.
>
> >>> How do you distinguish between
> >>> accesses which should result in an updated mapping and accesses that
> >>> should result in a fault?
> >>
> >> Accesses that should result in an updated mapping have two requirements:
> >> (a) address accessed belongs to the enclave based on the address
> >>     range specified during enclave create
> >> (b) there is no backing enclave page for the address
> >
> > What happens if the enclave is buggy? Or has been compromised. In both
> > of those cases, there should be a userspace visible fault and pages
> > should not be added.
>
> If user space accesses a memory address with a regular read/write that
> results in a new page added then there is indeed a user space visible
> fault. You can see this flow in action in the "augment" test case in
> https://lore.kernel.org/linux-sgx/32c1116934a588bd3e6c174684e3e36a05c0a4d4.1644274683.git.reinette.chatre@intel.com/
>
> If user space indeed wants the page after encountering such a fault then
> it needs to enter the enclave again, from a different entry point, to
> run ENCLU[EACCEPT], before it can return to the original entry point to
> continue execution from the instruction that triggered the original read/write.
>
> The only flow where a page is added without a user space visible fault
> is when user space explicitly runs the ENCLU[EACCEPT] to do so.
>
> >
> >>> IMHO, all unmapped page accesses should
> >>> result in a page fault. mmap() should be called first to identify the
> >>> correct permissions for these pages.
> >>> Then the page handler should be
> >>> updated to use the permissions from the mapping when backfilling
> >>> physical pages. If I understand correctly, this should also obviate
> >>
> >> Regular enclave pages can _only_ be dynamically added with RW permission.
> >>
> >> SGX2's support for adding regular pages to an enclave via the EAUG
> >> instruction is architecturally set at RW. The OS cannot change those permissions
> >> via the EAUG instruction nor can the OS do so with a different/additional
> >> instruction because:
> >> * the OS is not able to relax permissions since that can only be done from
> >> within the enclave with ENCLU[EMODPE], thus it is not possible for the OS to
> >> dynamically add pages via EAUG as RW and then relax permissions to RWX.
> >> * the OS is not able to EAUG a page and immediately attempt an EMODPR either
> >> as Jarkko also recently inquired about:
> >> https://lore.kernel.org/linux-sgx/80f3d7b9-e3d5-b2c0-7707-710bf6f5081e@intel.com/
> >
> > This design looks... unfinished. EAUG takes a PAGEINFO in RBX, but
> > PAGEINFO.SECINFO must be zeroed and EAUG instead sets magic hard-coded
> > permissions. Why doesn't EAUG just respect the permissions in
> > PAGEINFO.SECINFO? We aren't told.
>
> This design is finished and respects the hardware specification. You can find
> the details in the SDM's documentation of the EAUG function.
>
> If the SECINFO field has a value then the hardware requires it to indicate
> that it is a new shadow stack page being added, not a regular page. Support for
> shadow stack pages is not in scope for this work. Attempting to dynamically
> add a regular page with explicit permissions will result in a #GP(0).
>
> The only way to add a regular enclave page is to make the SECINFO field empty
> and doing so forces the page type to be a regular page and the permissions to
> be RW.
>
> >
> > Further, if the enclave can do EMODPE, why does
> > SGX_IOC_ENCLAVE_RELAX_PERMISSIONS even exist? None of the
> > documentation explains what this ioctl even does. Does it update PTE
> > permissions? VMA permissions? Nobody knows without reading the source
> > code.
>
> Build the documentation (after applying this series) and it should
> contain all the information you are searching for. As is the current custom
> in the SGX documentation the built documentation pulls its content from
> the kernel doc of the functions that implement the core of the
> user space interactions.
>
> >
> > Userspace should not be bothered with the subtle details of the
> > interaction between EPC, PTE and VMA permissions. But this API does
> > everything it can do to expose all these details to userspace. And it
> > doesn't bother to document them (probably because it is hard). It
> > would be much better to avoid exposing these details to userspace.
> >
> > IMHO, there should be a simple flow like this (if EAUG respects
> > PAGEINFO.SECINFO):
>
> EAUG does not respect PAGEINFO.SECINFO for regular pages.
>
> >
> > 1. Non-enclave calls mmap()/munmap().
> > 2. Enclave issues EACCEPT, if necessary.
> > 3. Enclave issues EMODPE, if necessary.
> >
> > Notice that in the second step above, during the mmap() call, the
> > kernel ensures that EPC, PTE and VMA are in sync and fails if they
> > cannot be made to be compatible. Also note that in the above flow EAUG
> > instructions can be efficiently batched.
> >
> > Given the current poor state of the EAUG instruction, we might need to
> > do this flow instead:
> >
> > 1. Enclave issues EACCEPT, if necessary. (Add RW pages...)
> > 2. Non-enclave calls mmap()/munmap().
> > 3. Enclave issues EACCEPT, if necessary.
> > 4. Enclave issues EMODPE, if necessary.
> >
> > However, doing EAUG only via the page access handler means that there
> > is no way to batch EAUG instructions and this forces a context switch
> > for every page you want to add. This has to be terrible for
> > performance. Note specifically that the SDM calls out batching, which
> > is currently impossible under this patch set. 35.5.7 - "Page
> > allocation operations may be batched to improve efficiency."
>
> These page functions are all per-page so it is not possible to add multiple
> pages with a single instruction. It is indeed possible to pre-fault pages.
>
> > As it stands today, if I want to add 256MiB of pages to an enclave,
> > I'll have to do 2^16 context switches. That doesn't seem scalable.
>
> No. Running ENCLU[EACCEPT] on each of the pages within that range should not
> need any explicit context switch out of the enclave. See the "augment_via_eaccept"
> test case in:
> https://lore.kernel.org/linux-sgx/32c1116934a588bd3e6c174684e3e36a05c0a4d4.1644274683.git.reinette.chatre@intel.com/
>
>
> >>> the need for the weird userspace callback to allow for execute
> >>> permissions.
> >>
> >> User policy integration would always be required to allow execute
> >> permissions on a writable page. This is not expected to be a userspace
> >> callback but instead integration with existing user policy subsystem(s).
> >
> > Why? This isn't documented.
>
> This is similar to the existing policies involved in managing the permissions
> of mapped memory. When user space calls mprotect() to change permissions
> of a mapped region then the kernel will not blindly allow the permissions but
> instead ensure that it is allowed based on user policy by calling the LSM
> (Linux Security Module) hooks.
>
> You can learn more about LSM and various security modules at:
> Documentation/security/lsm.rst
> Documentation/admin-guide/LSM/*
>
> You can compare what is needed here to what is currently done when user space
> attempts to make some memory executable (see:
> mm/mprotect.c:do_mprotect_key()->security_file_mprotect()). User policy needs
> to help the kernel determine if this is allowed. For example, when SELinux is
> the security module of choice then the process or file (depending on what type
> of memory is being changed) needs to have a special permission (PROCESS__EXECHEAP,
> PROCESS__EXECSTACK, or FILE__EXECMOD) assigned by user space to allow this.
>
> Integration with user space policy is required for RWX of dynamically added pages
> to be supported. In this series dynamically added pages will not be allowed to
> be made executable, a follow-up series will add support for user policy
> integration to support RWX permissions of dynamically added pages.
>
> >>> 3. Implementing as I've suggested also means that we can lock down an
> >>> enclave, for example - after code has been JITed, by closing the file
> >>> descriptor. Once the file descriptor used to create the enclave is
> >>> closed, no further mmap() can be performed on the enclave. Attempting
> >>> to do EACCEPT on an unmapped page will generate a page fault.
> >>
> >> This is not clear to me. If the file descriptor is closed and no further
> >> mmap() is allowed then how would a process be able to enter the enclave
> >> to execute code within it?
> >
> > EENTER (or the vdso function) with the address of a TCS page, like
> > normal. In Enarx, we don't retain the enclave fd after the final
> > mmap() following EINIT. Everything works just fine.
>
> The OS fault handler is responsible for managing the PTEs that is required
> for the enclave to be able to access the memory within the enclave.
> The OS fault handler is attached to a VMA that is created with mmap().
>
> >
> >> This series does indeed lock down the address range to ensure that it is
> >> not possible to map memory that does not belong to the enclave after the
> >> enclave is created. Please see:
> >> https://lore.kernel.org/linux-sgx/1b833dbce6c937f71523f4aaf4b2181b9673519f.1644274683.git.reinette.chatre@intel.com/
> >
> > That's not what I'm talking about. I'm talking about a workflow like this:
> >
> > 1. Enclave initialization: ECREATE ... EINIT
> > 2. EENTER
> > 3. Enclave JITs some code (changes page permissions)
> > 4. EEXIT
> > 5. Close enclave fd.
> > 6. EENTER
> > 7. If an enclave attempts page modifications, a fault occurs.
>
> The original fd that was created to obtain the enclave base address
> may be closed at (5) but the executable and data portions of the enclave
> still needs to be mapped afterwards to be able to have OS support for
> managing the PTEs that the enclave depends on to access those pages.
>
> >
> > Think of this similar to seccomp(). The enclave wants to do some
> > dynamic page table manipulation. But then it wants to lock down page
> > table modification so that, if compromised, attackers have no ability
> > to obtain RWX permissions.
>
> Reinette

Reinette Chatre March 2, 2022, 9:20 p.m. UTC | #6

Hi Nathaniel,

On 3/2/2022 8:57 AM, Nathaniel McCallum wrote:
> Perhaps it would be better for us to have a shared understanding on
> how the patches as posted are supposed to work in the most common
> cases? I'm thinking here of projects such as Enarx, Gramine and
> Occulum, which all have a similar process. Namely they execute an
> executable (called exec in the below chart) which has things like
> syscalls handled by a shim. These two components (shim and exec) are
> supported by a non-enclave userspace runtime. Given this common
> architectural pattern, this is how I understand adding pages via an
> exec call to mmap() to work.
> 
> https://mermaid.live/edit#pako:eNp1k81qwzAQhF9F6NRCAu1Vh0BIRemhoeSHBuIettYmFpElVZZLQ8i7144sJ8aOT2bmY3d2vT7R1AikjBb4U6JO8UXC3kGeaFI9FpyXqbSgPTmg06j6uiu1lzn2jSKTA2XwD9NEB31uPBLzi-6iMpLnYB8Wn4-kOBYpKBW52iXj8WQSmzEy5Zvt01ewG5HUQN2UEc7nK77YPjdALd64GWih8NpkALGwR_JtzOGAaKXexyTKGEt2pgoMaXahgj5Qgk9nM_6xGvDDJpsmOyiVv0LB62B8un4dBDrLiLPeWciCL9fvvKVQizhSG6stFz9Df7sxUpcYitR-SodFO2A_Vw-7l4nzzduqjX9bKJxOHDDeBB3RHF0OUlS3faq1hPoMqzulrHoVGPZOE32u0NIK8MiF9MZRtgNV4IhC6c3yqFPKvCsxQs3_0VDnfzf-CPg
> 
> This only covers adding RW pages. I haven't even tackled permission
> changes yet. Is that understanding correct? If not, please provide an
> alternative sequence diagram to explain how you expect this to be
> used.

Please find my attempt linked below:

https://mermaid.live/edit#pako:eNqFUsFqAjEQ_ZWQUwsK7XUPgthQeqiUVang9jAkoxu6m2yzWVsR_72J2WTbKnSOb97MvPeSI-VaIM1oix8dKo4PEnYG6kIRVw0YK7lsQFlSghGfYPCy845GYXWJm05ZWV8ZaEt55QB-IS9UwOfaItF7NGc0I3UNzU3-ekvaQ8uhqiLPd8l4PJnEYxmZsvXm7i20e5B4QlA5rAqMgJJfG9Ixg21X2ctVXn9GGJsvWb65729FSZXWDdlqpxx46Qzu-gB8-cHzhhim2zKdzdjLcuAAt3IPzv6Qkq84EdxGM3492UJS-cdSpLHp6nEgCPz3RjI5NPvAlRisJjspOsbWT8sUyc_MwjuynC1Wzyw9EB3RGk0NUrgvePRYQW2J7tNQd5sKDN5ooU6O2jXCiWZCWm1otoWqxRGFzurFQXGaWdNhJPXfuGedvgFejOuH

The changes include:
* Move mmap() to occur before attempting EACCEPT on the addresses. This is
  required for EACCEPT (as well as any subsequent access from within the enclave)
  to be able to access the pages.
* Remove AEX[1] to the runtime within the loop. After EAUG returns execution
  will return to the instruction pointer that triggered the #PF, EACCEPT,
  this will cause the EACCEPT to be run again, this time succeeding.

This is based on the implementation within this series. When supporting
the new ioctl() requested by Jarkko there will be an additional ioctl()
required before the loop.

Reinette

Nathaniel McCallum March 3, 2022, 1:13 a.m. UTC | #7

On Wed, Mar 2, 2022 at 4:20 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Nathaniel,
>
> On 3/2/2022 8:57 AM, Nathaniel McCallum wrote:
> > Perhaps it would be better for us to have a shared understanding on
> > how the patches as posted are supposed to work in the most common
> > cases? I'm thinking here of projects such as Enarx, Gramine and
> > Occulum, which all have a similar process. Namely they execute an
> > executable (called exec in the below chart) which has things like
> > syscalls handled by a shim. These two components (shim and exec) are
> > supported by a non-enclave userspace runtime. Given this common
> > architectural pattern, this is how I understand adding pages via an
> > exec call to mmap() to work.
> >
> > https://mermaid.live/edit#pako:eNp1k81qwzAQhF9F6NRCAu1Vh0BIRemhoeSHBuIettYmFpElVZZLQ8i7144sJ8aOT2bmY3d2vT7R1AikjBb4U6JO8UXC3kGeaFI9FpyXqbSgPTmg06j6uiu1lzn2jSKTA2XwD9NEB31uPBLzi-6iMpLnYB8Wn4-kOBYpKBW52iXj8WQSmzEy5Zvt01ewG5HUQN2UEc7nK77YPjdALd64GWih8NpkALGwR_JtzOGAaKXexyTKGEt2pgoMaXahgj5Qgk9nM_6xGvDDJpsmOyiVv0LB62B8un4dBDrLiLPeWciCL9fvvKVQizhSG6stFz9Df7sxUpcYitR-SodFO2A_Vw-7l4nzzduqjX9bKJxOHDDeBB3RHF0OUlS3faq1hPoMqzulrHoVGPZOE32u0NIK8MiF9MZRtgNV4IhC6c3yqFPKvCsxQs3_0VDnfzf-CPg
> >
> > This only covers adding RW pages. I haven't even tackled permission
> > changes yet. Is that understanding correct? If not, please provide an
> > alternative sequence diagram to explain how you expect this to be
> > used.
>
> Please find my attempt linked below:
>
> https://mermaid.live/edit#pako:eNqFUsFqAjEQ_ZWQUwsK7XUPgthQeqiUVang9jAkoxu6m2yzWVsR_72J2WTbKnSOb97MvPeSI-VaIM1oix8dKo4PEnYG6kIRVw0YK7lsQFlSghGfYPCy845GYXWJm05ZWV8ZaEt55QB-IS9UwOfaItF7NGc0I3UNzU3-ekvaQ8uhqiLPd8l4PJnEYxmZsvXm7i20e5B4QlA5rAqMgJJfG9Ixg21X2ctVXn9GGJsvWb65729FSZXWDdlqpxx46Qzu-gB8-cHzhhim2zKdzdjLcuAAt3IPzv6Qkq84EdxGM3492UJS-cdSpLHp6nEgCPz3RjI5NPvAlRisJjspOsbWT8sUyc_MwjuynC1Wzyw9EB3RGk0NUrgvePRYQW2J7tNQd5sKDN5ooU6O2jXCiWZCWm1otoWqxRGFzurFQXGaWdNhJPXfuGedvgFejOuH
>
> The changes include:
> * Move mmap() to occur before attempting EACCEPT on the addresses. This is
>   required for EACCEPT (as well as any subsequent access from within the enclave)
>   to be able to access the pages.
> * Remove AEX[1] to the runtime within the loop. After EAUG returns execution
>   will return to the instruction pointer that triggered the #PF, EACCEPT,
>   this will cause the EACCEPT to be run again, this time succeeding.
>
> This is based on the implementation within this series. When supporting
> the new ioctl() requested by Jarkko there will be an additional ioctl()
> required before the loop.

https://mermaid.live/edit/#pako:eNp1U9FqgzAU_ZWQpw1a2F6FFaQLYw8ro7asUPeQmWsNNYlL4rZS-u-LRmut1ie953jvOecmR5woBjjABr5LkAk8c7rTVMQSuYeWVslSfIH23wXVlie8oNKijGr2SzUMkT1oCfmwrktpuRj5wWRcDKvwB0ksfX2hLCD1A7quBkgIWtwtP-6ROZiE5nnLq1A0nc5m7bAAhWSzffj0cFNEFaEaGiBCFiuy3D42hKp4gWZUshy6ISOUL6X2e4CCy10rQhUW8dR52QESivGUJ9RyJQ2SAAyYZ_V6ndUSsnldneVca_bJdvY7lkf6vc4haTBlbsdbDmLoaLlSBUqVy5wmWW2nw3rq26Pg-oTzOXlf9Xkt7BfTeqjjSWlP2JWTlkrC9cutlmcLlUlxoRBkE3T9Mrq7KArd0UBPqFDGTpstI2OphSv-jf1cBukPJlmSaP1GXFs8wQK0oJy523Ws-DG2GTiJOHCvDLx3HMuTo5YFc1MJ41ZpHKQ0NzDB1fWLDjLBgdUltKTmhjas0z-kWy8L

My comments below correspond to the arrow numbers in the diagram.

2. When the runtime receives the AEX, it doesn't have enough knowledge
to know whether or not to ask the kernel for an mmap(). So it has to
reenter the shim.

3. The shim has to handle the syscall instruction routing it to the
enclave's memory management subsystem.

4. The shim has to do bookkeeping and decide if additional pages are
even needed. If pages are already allocated, for example, it can skip
directly to step 13. However, if modifications are needed, it will go
to steps 5-12.

5-12. This is the part that represents new code from the kernel's
perspective for SGX2. It is also in a performance critical path and
should be evaluated with greater scrutiny. The number of context
switches is O(2N + 4) for each new allocated block, where N is the
number of pages: a context switch occurs at step 5, 6, 7,  8, 9/10 and
12. However, this can be reduced to O(4) for each new allocated block
with a simple modification:

https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIXQydjFymhaVmh24fSkkUbN1Gwrpf99pvlsk8G80nMez3l91SNmmgMOsIXPAhSDR0F3hspYIT9o4bQq5AeYap1T4wQTOVUOpdTwb2pgmNmDUZAN46ZQTsiRDTYVchiFH2CxquIL7QDpLzDnaICkpPnN8u0W2YNlNMsarsyi6XQ2a5oFKCSb7d17la6DqATKpgEiZLEiy-19DZTBXjalimfQNRlBPrTe7wFyoXaNCJ07JBJ_lh0gqblIBKNOaGWRAuDAK-qiVquWkM3zqpVzrblytjt-R2Va5yjR3h_K0nPrLleOaudFERKunzoIVE9Xj26VtZYbsEXmxgUOTP2_witfSTifk9fViMDzZPQuoij0V40eUK6tm9a3hqyjDq74P_zuH6V6aGRJovUL8WXxBEswkgruf8ux5GPsUvDvGQd-yiGhpS04ViePFjn3XQkXThscJDSzMMHld4oOiuHAmQIaqP5xNXX6BeBJIEk

The interesting thing about this pattern is that this can be done for
all page modification types except EMODT. For example, here's the same
process for changing a mapping from RW to RX:

https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIVCvdhFu5F0UGh24fSkkUbN1Gwrpf995jttMphXes7jOa-vesZMc8ABtvBZgGKwEvRgqIwV8oMWTqtCfoCp1zk1TjCRU-VQSg3_pgbGmSMYBdk4bgrlhJzYYFMhx1H4ARarOr7RDpD-AlNFAyQlze_C3T2yJ8tolrVcmUXz-WLRNgvQkuz2D-91ugmiEiibBoiQzZaE-8cGKIODbEoVz6BvMoF8aH08AuRCHVoROndIJP4sB0BSc5EIRp3QyiIFwIHX1FWtTi0hu-dtJ-dWc-1sf_yeyrTOUaK9P5SlVes-V45651URsn5ZvYY9BmqgbMB32jrTDdgic9MSR7b-X-ONs5U-MqGvmkxeRhQt_V2jJ5Rr6-bNtSHrqIMb_g_DhyepXxoJSfS2Jr4snmEJRlLB_Xc5l3yMXQr-QePATzkktHQFx-ri0SLnvivhwmmDg4RmFma4_E_RSTEcOFNACzVfrqEuvytQILY

My point in this thread has always been that it is an anti-feature to
presume that there is a need to treat EPC and VLA permissions
separately. This is a performance sink and it optimizes for a use case
which doesn't exist. Nobody actually wants there to be a mismatch
between EPC and VLA permissions.

So, besides EMODT, the only userspace interface we need is
mmap()/mprotect()/munmap(). The kernel should either succeed the
mmap()/mprotect()/munmap() syscall if the EPC permissions can be made
compatible or should fail otherwise.

Another interesting property arises from this flow. Since the EPC and
VLA permissions are always synchronized from the perspective of
userspace, in cases where the memory state between the kernel and the
exec layer is roughly synchronous, bookkeeping in the shim can be
implemented without any persistent memory between syscall handling
events. So, for example, the shim can implement brk() and
mmap()/munmap()/mprotect() with just two pointers: one to the break
position and one to the lowest mmap().

It is true that this basically commits enclave authors to doing all
EACCEPT calls immediately after modifications. But I suspect everyone
will do this anyway since there is no efficient (read: performant) way
for shims to handle page faults. So trying to do this lazily will just
result in a huge decrease in performance.

Reinette Chatre March 3, 2022, 5:49 p.m. UTC | #8

Hi Nathaniel,

On 3/2/2022 5:13 PM, Nathaniel McCallum wrote:
> On Wed, Mar 2, 2022 at 4:20 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>>
>> Hi Nathaniel,
>>
>> On 3/2/2022 8:57 AM, Nathaniel McCallum wrote:
>>> Perhaps it would be better for us to have a shared understanding on
>>> how the patches as posted are supposed to work in the most common
>>> cases? I'm thinking here of projects such as Enarx, Gramine and
>>> Occulum, which all have a similar process. Namely they execute an
>>> executable (called exec in the below chart) which has things like
>>> syscalls handled by a shim. These two components (shim and exec) are
>>> supported by a non-enclave userspace runtime. Given this common
>>> architectural pattern, this is how I understand adding pages via an
>>> exec call to mmap() to work.
>>>
>>> https://mermaid.live/edit#pako:eNp1k81qwzAQhF9F6NRCAu1Vh0BIRemhoeSHBuIettYmFpElVZZLQ8i7144sJ8aOT2bmY3d2vT7R1AikjBb4U6JO8UXC3kGeaFI9FpyXqbSgPTmg06j6uiu1lzn2jSKTA2XwD9NEB31uPBLzi-6iMpLnYB8Wn4-kOBYpKBW52iXj8WQSmzEy5Zvt01ewG5HUQN2UEc7nK77YPjdALd64GWih8NpkALGwR_JtzOGAaKXexyTKGEt2pgoMaXahgj5Qgk9nM_6xGvDDJpsmOyiVv0LB62B8un4dBDrLiLPeWciCL9fvvKVQizhSG6stFz9Df7sxUpcYitR-SodFO2A_Vw-7l4nzzduqjX9bKJxOHDDeBB3RHF0OUlS3faq1hPoMqzulrHoVGPZOE32u0NIK8MiF9MZRtgNV4IhC6c3yqFPKvCsxQs3_0VDnfzf-CPg
>>>
>>> This only covers adding RW pages. I haven't even tackled permission
>>> changes yet. Is that understanding correct? If not, please provide an
>>> alternative sequence diagram to explain how you expect this to be
>>> used.
>>
>> Please find my attempt linked below:
>>
>> https://mermaid.live/edit#pako:eNqFUsFqAjEQ_ZWQUwsK7XUPgthQeqiUVang9jAkoxu6m2yzWVsR_72J2WTbKnSOb97MvPeSI-VaIM1oix8dKo4PEnYG6kIRVw0YK7lsQFlSghGfYPCy845GYXWJm05ZWV8ZaEt55QB-IS9UwOfaItF7NGc0I3UNzU3-ekvaQ8uhqiLPd8l4PJnEYxmZsvXm7i20e5B4QlA5rAqMgJJfG9Ixg21X2ctVXn9GGJsvWb65729FSZXWDdlqpxx46Qzu-gB8-cHzhhim2zKdzdjLcuAAt3IPzv6Qkq84EdxGM3492UJS-cdSpLHp6nEgCPz3RjI5NPvAlRisJjspOsbWT8sUyc_MwjuynC1Wzyw9EB3RGk0NUrgvePRYQW2J7tNQd5sKDN5ooU6O2jXCiWZCWm1otoWqxRGFzurFQXGaWdNhJPXfuGedvgFejOuH
>>
>> The changes include:
>> * Move mmap() to occur before attempting EACCEPT on the addresses. This is
>>   required for EACCEPT (as well as any subsequent access from within the enclave)
>>   to be able to access the pages.
>> * Remove AEX[1] to the runtime within the loop. After EAUG returns execution
>>   will return to the instruction pointer that triggered the #PF, EACCEPT,
>>   this will cause the EACCEPT to be run again, this time succeeding.
>>
>> This is based on the implementation within this series. When supporting
>> the new ioctl() requested by Jarkko there will be an additional ioctl()
>> required before the loop.
> 
> https://mermaid.live/edit/#pako:eNp1U9FqgzAU_ZWQpw1a2F6FFaQLYw8ro7asUPeQmWsNNYlL4rZS-u-LRmut1ie953jvOecmR5woBjjABr5LkAk8c7rTVMQSuYeWVslSfIH23wXVlie8oNKijGr2SzUMkT1oCfmwrktpuRj5wWRcDKvwB0ksfX2hLCD1A7quBkgIWtwtP-6ROZiE5nnLq1A0nc5m7bAAhWSzffj0cFNEFaEaGiBCFiuy3D42hKp4gWZUshy6ISOUL6X2e4CCy10rQhUW8dR52QESivGUJ9RyJQ2SAAyYZ_V6ndUSsnldneVca_bJdvY7lkf6vc4haTBlbsdbDmLoaLlSBUqVy5wmWW2nw3rq26Pg-oTzOXlf9Xkt7BfTeqjjSWlP2JWTlkrC9cutlmcLlUlxoRBkE3T9Mrq7KArd0UBPqFDGTpstI2OphSv-jf1cBukPJlmSaP1GXFs8wQK0oJy523Ws-DG2GTiJOHCvDLx3HMuTo5YFc1MJ41ZpHKQ0NzDB1fWLDjLBgdUltKTmhjas0z-kWy8L
> 
> My comments below correspond to the arrow numbers in the diagram.
> 
> 2. When the runtime receives the AEX, it doesn't have enough knowledge
> to know whether or not to ask the kernel for an mmap(). So it has to
> reenter the shim.
> 
> 3. The shim has to handle the syscall instruction routing it to the
> enclave's memory management subsystem.
> 
> 4. The shim has to do bookkeeping and decide if additional pages are
> even needed. If pages are already allocated, for example, it can skip
> directly to step 13. However, if modifications are needed, it will go
> to steps 5-12.
> 
> 5-12. This is the part that represents new code from the kernel's
> perspective for SGX2. It is also in a performance critical path and
> should be evaluated with greater scrutiny. The number of context
> switches is O(2N + 4) for each new allocated block, where N is the
> number of pages: a context switch occurs at step 5, 6, 7,  8, 9/10 and
> 12. However, this can be reduced to O(4) for each new allocated block
> with a simple modification:
> 
> https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIXQydjFymhaVmh24fSkkUbN1Gwrpf99pvlsk8G80nMez3l91SNmmgMOsIXPAhSDR0F3hspYIT9o4bQq5AeYap1T4wQTOVUOpdTwb2pgmNmDUZAN46ZQTsiRDTYVchiFH2CxquIL7QDpLzDnaICkpPnN8u0W2YNlNMsarsyi6XQ2a5oFKCSb7d17la6DqATKpgEiZLEiy-19DZTBXjalimfQNRlBPrTe7wFyoXaNCJ07JBJ_lh0gqblIBKNOaGWRAuDAK-qiVquWkM3zqpVzrblytjt-R2Va5yjR3h_K0nPrLleOaudFERKunzoIVE9Xj26VtZYbsEXmxgUOTP2_witfSTifk9fViMDzZPQuoij0V40eUK6tm9a3hqyjDq74P_zuH6V6aGRJovUL8WXxBEswkgruf8ux5GPsUvDvGQd-yiGhpS04ViePFjn3XQkXThscJDSzMMHld4oOiuHAmQIaqP5xNXX6BeBJIEk

Your optimized proposal is possible in the current implementation as
follows:

https://mermaid.live/edit#pako:eNp1k11vgjAUhv_KSa-2RJPtlmQmxvViFzOLuMxEdlHbgzTSlrVlmzH-9xUBUWFclfc8nI-X0wPhRiCJiMOvEjXHZ8m2lqlEQ3hY6Y0u1QZt_V4w6yWXBdMeMmbFD7PYj-zQasz7ui21l2rgA5dJ1VfxF3mia31uPIL5RntSI1CKFXeLj3twe8dZnrdcFYXxeDJpi0Uwpav1w2cdbkSogKpoBJTOl3SxfmyASryIZkyLHLsiA8jGmN0OsZB62zZhCg8yDbNsEZQRMpWceWm0A40oUNTUVa5zt5SuXpbndm57rp3txu-oOnKd62ySRVfmfjhlz4YOy40pIDXBc8az0zhdbMAJOp3N6NuyY1A3o54Og-7F8TT8HHiCwjg_bnwG55nHG_4fhy5HqVeDLmj8_kpDWjIiCq1iUoT9PlR8QnyGYQNJFI4CU1bZQhJ9DGhZiFCVCumNJVHKcocjUl2AeK85ibwtsYWaO9JQxz-gBQs-

You can think of that EACCEPT instruction similar to a current (SGX1)
enclave memory read or write when the enclave page is not currently in
the EPC, for example, if the enclave memory being accessed is swapped
out and need to be decrypted and loaded back. Instead of ENCLS[ELDU]
incorporated to load the enclave page back into EPC, ENCLS[EAUG] is
incorporated to create a new EPC page.

You can find an example of such a flow involving EACCEPT in the
"augment_via_eaccept" test found in "[PATCH V2 21/32] selftests/sgx: Test
two different SGX2 EAUG flows"


> The interesting thing about this pattern is that this can be done for
> all page modification types except EMODT. For example, here's the same
> process for changing a mapping from RW to RX:
> 
> https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIVCvdhFu5F0UGh24fSkkUbN1Gwrpf995jttMphXes7jOa-vesZMc8ABtvBZgGKwEvRgqIwV8oMWTqtCfoCp1zk1TjCRU-VQSg3_pgbGmSMYBdk4bgrlhJzYYFMhx1H4ARarOr7RDpD-AlNFAyQlze_C3T2yJ8tolrVcmUXz-WLRNgvQkuz2D-91ugmiEiibBoiQzZaE-8cGKIODbEoVz6BvMoF8aH08AuRCHVoROndIJP4sB0BSc5EIRp3QyiIFwIHX1FWtTi0hu-dtJ-dWc-1sf_yeyrTOUaK9P5SlVes-V45651URsn5ZvYY9BmqgbMB32jrTDdgic9MSR7b-X-ONs5U-MqGvmkxeRhQt_V2jJ5Rr6-bNtSHrqIMb_g_DhyepXxoJSfS2Jr4snmEJRlLB_Xc5l3yMXQr-QePATzkktHQFx-ri0SLnvivhwmmDg4RmFma4_E_RSTEcOFNACzVfrqEuvytQILY
> 
> My point in this thread has always been that it is an anti-feature to
> presume that there is a need to treat EPC and VLA permissions
> separately. This is a performance sink and it optimizes for a use case
> which doesn't exist. Nobody actually wants there to be a mismatch
> between EPC and VLA permissions.

I assume you mean VMA permissions. It is hard for me to trust the statement
that nobody wants there to be a mismatch since VMA permissions being separate
from EPC permissions is an intentional (as documented) and integral part of the
current SGX ABI. Current SGX implementation explicitly checks for and supports
VMA mappings with permissions different from EPC permissions.

This SGX2 implementation follows and respects the current ABI and changing ABI
cannot be taken lightly.
 
Reinette

Jarkko Sakkinen March 4, 2022, 12:57 a.m. UTC | #9

On Wed, Mar 02, 2022 at 08:13:55PM -0500, Nathaniel McCallum wrote:
> On Wed, Mar 2, 2022 at 4:20 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
> >
> > Hi Nathaniel,
> >
> > On 3/2/2022 8:57 AM, Nathaniel McCallum wrote:
> > > Perhaps it would be better for us to have a shared understanding on
> > > how the patches as posted are supposed to work in the most common
> > > cases? I'm thinking here of projects such as Enarx, Gramine and
> > > Occulum, which all have a similar process. Namely they execute an
> > > executable (called exec in the below chart) which has things like
> > > syscalls handled by a shim. These two components (shim and exec) are
> > > supported by a non-enclave userspace runtime. Given this common
> > > architectural pattern, this is how I understand adding pages via an
> > > exec call to mmap() to work.
> > >
> > > https://mermaid.live/edit#pako:eNp1k81qwzAQhF9F6NRCAu1Vh0BIRemhoeSHBuIettYmFpElVZZLQ8i7144sJ8aOT2bmY3d2vT7R1AikjBb4U6JO8UXC3kGeaFI9FpyXqbSgPTmg06j6uiu1lzn2jSKTA2XwD9NEB31uPBLzi-6iMpLnYB8Wn4-kOBYpKBW52iXj8WQSmzEy5Zvt01ewG5HUQN2UEc7nK77YPjdALd64GWih8NpkALGwR_JtzOGAaKXexyTKGEt2pgoMaXahgj5Qgk9nM_6xGvDDJpsmOyiVv0LB62B8un4dBDrLiLPeWciCL9fvvKVQizhSG6stFz9Df7sxUpcYitR-SodFO2A_Vw-7l4nzzduqjX9bKJxOHDDeBB3RHF0OUlS3faq1hPoMqzulrHoVGPZOE32u0NIK8MiF9MZRtgNV4IhC6c3yqFPKvCsxQs3_0VDnfzf-CPg
> > >
> > > This only covers adding RW pages. I haven't even tackled permission
> > > changes yet. Is that understanding correct? If not, please provide an
> > > alternative sequence diagram to explain how you expect this to be
> > > used.
> >
> > Please find my attempt linked below:
> >
> > https://mermaid.live/edit#pako:eNqFUsFqAjEQ_ZWQUwsK7XUPgthQeqiUVang9jAkoxu6m2yzWVsR_72J2WTbKnSOb97MvPeSI-VaIM1oix8dKo4PEnYG6kIRVw0YK7lsQFlSghGfYPCy845GYXWJm05ZWV8ZaEt55QB-IS9UwOfaItF7NGc0I3UNzU3-ekvaQ8uhqiLPd8l4PJnEYxmZsvXm7i20e5B4QlA5rAqMgJJfG9Ixg21X2ctVXn9GGJsvWb65729FSZXWDdlqpxx46Qzu-gB8-cHzhhim2zKdzdjLcuAAt3IPzv6Qkq84EdxGM3492UJS-cdSpLHp6nEgCPz3RjI5NPvAlRisJjspOsbWT8sUyc_MwjuynC1Wzyw9EB3RGk0NUrgvePRYQW2J7tNQd5sKDN5ooU6O2jXCiWZCWm1otoWqxRGFzurFQXGaWdNhJPXfuGedvgFejOuH
> >
> > The changes include:
> > * Move mmap() to occur before attempting EACCEPT on the addresses. This is
> >   required for EACCEPT (as well as any subsequent access from within the enclave)
> >   to be able to access the pages.
> > * Remove AEX[1] to the runtime within the loop. After EAUG returns execution
> >   will return to the instruction pointer that triggered the #PF, EACCEPT,
> >   this will cause the EACCEPT to be run again, this time succeeding.
> >
> > This is based on the implementation within this series. When supporting
> > the new ioctl() requested by Jarkko there will be an additional ioctl()
> > required before the loop.
> 
> https://mermaid.live/edit/#pako:eNp1U9FqgzAU_ZWQpw1a2F6FFaQLYw8ro7asUPeQmWsNNYlL4rZS-u-LRmut1ie953jvOecmR5woBjjABr5LkAk8c7rTVMQSuYeWVslSfIH23wXVlie8oNKijGr2SzUMkT1oCfmwrktpuRj5wWRcDKvwB0ksfX2hLCD1A7quBkgIWtwtP-6ROZiE5nnLq1A0nc5m7bAAhWSzffj0cFNEFaEaGiBCFiuy3D42hKp4gWZUshy6ISOUL6X2e4CCy10rQhUW8dR52QESivGUJ9RyJQ2SAAyYZ_V6ndUSsnldneVca_bJdvY7lkf6vc4haTBlbsdbDmLoaLlSBUqVy5wmWW2nw3rq26Pg-oTzOXlf9Xkt7BfTeqjjSWlP2JWTlkrC9cutlmcLlUlxoRBkE3T9Mrq7KArd0UBPqFDGTpstI2OphSv-jf1cBukPJlmSaP1GXFs8wQK0oJy523Ws-DG2GTiJOHCvDLx3HMuTo5YFc1MJ41ZpHKQ0NzDB1fWLDjLBgdUltKTmhjas0z-kWy8L
> 
> My comments below correspond to the arrow numbers in the diagram.
> 
> 2. When the runtime receives the AEX, it doesn't have enough knowledge
> to know whether or not to ask the kernel for an mmap(). So it has to
> reenter the shim.
> 
> 3. The shim has to handle the syscall instruction routing it to the
> enclave's memory management subsystem.
> 
> 4. The shim has to do bookkeeping and decide if additional pages are
> even needed. If pages are already allocated, for example, it can skip
> directly to step 13. However, if modifications are needed, it will go
> to steps 5-12.
> 
> 5-12. This is the part that represents new code from the kernel's
> perspective for SGX2. It is also in a performance critical path and
> should be evaluated with greater scrutiny. The number of context
> switches is O(2N + 4) for each new allocated block, where N is the
> number of pages: a context switch occurs at step 5, 6, 7,  8, 9/10 and
> 12. However, this can be reduced to O(4) for each new allocated block
> with a simple modification:
> 
> https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIXQydjFymhaVmh24fSkkUbN1Gwrpf99pvlsk8G80nMez3l91SNmmgMOsIXPAhSDR0F3hspYIT9o4bQq5AeYap1T4wQTOVUOpdTwb2pgmNmDUZAN46ZQTsiRDTYVchiFH2CxquIL7QDpLzDnaICkpPnN8u0W2YNlNMsarsyi6XQ2a5oFKCSb7d17la6DqATKpgEiZLEiy-19DZTBXjalimfQNRlBPrTe7wFyoXaNCJ07JBJ_lh0gqblIBKNOaGWRAuDAK-qiVquWkM3zqpVzrblytjt-R2Va5yjR3h_K0nPrLleOaudFERKunzoIVE9Xj26VtZYbsEXmxgUOTP2_witfSTifk9fViMDzZPQuoij0V40eUK6tm9a3hqyjDq74P_zuH6V6aGRJovUL8WXxBEswkgruf8ux5GPsUvDvGQd-yiGhpS04ViePFjn3XQkXThscJDSzMMHld4oOiuHAmQIaqP5xNXX6BeBJIEk
> 
> The interesting thing about this pattern is that this can be done for
> all page modification types except EMODT. For example, here's the same
> process for changing a mapping from RW to RX:
> 
> https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIVCvdhFu5F0UGh24fSkkUbN1Gwrpf995jttMphXes7jOa-vesZMc8ABtvBZgGKwEvRgqIwV8oMWTqtCfoCp1zk1TjCRU-VQSg3_pgbGmSMYBdk4bgrlhJzYYFMhx1H4ARarOr7RDpD-AlNFAyQlze_C3T2yJ8tolrVcmUXz-WLRNgvQkuz2D-91ugmiEiibBoiQzZaE-8cGKIODbEoVz6BvMoF8aH08AuRCHVoROndIJP4sB0BSc5EIRp3QyiIFwIHX1FWtTi0hu-dtJ-dWc-1sf_yeyrTOUaK9P5SlVes-V45651URsn5ZvYY9BmqgbMB32jrTDdgic9MSR7b-X-ONs5U-MqGvmkxeRhQt_V2jJ5Rr6-bNtSHrqIMb_g_DhyepXxoJSfS2Jr4snmEJRlLB_Xc5l3yMXQr-QePATzkktHQFx-ri0SLnvivhwmmDg4RmFma4_E_RSTEcOFNACzVfrqEuvytQILY
> 
> My point in this thread has always been that it is an anti-feature to
> presume that there is a need to treat EPC and VLA permissions
> separately. This is a performance sink and it optimizes for a use case
> which doesn't exist. Nobody actually wants there to be a mismatch
> between EPC and VLA permissions.

I would not touch pre-initialization, EADD'd pages. The reason is
backwards compatibility.

For post-initialization, options are still open.

> So, besides EMODT, the only userspace interface we need is
> mmap()/mprotect()/munmap(). The kernel should either succeed the
> mmap()/mprotect()/munmap() syscall if the EPC permissions can be made
> compatible or should fail otherwise.

For mmap() it is the enclave who sets the permissions, not kernel, i.e. you
get a half-broken mmap() implementation. Kernel does EAUG, enclave does
EACCEPTCOPY.

I think what you're asking is too simple to be true, and even if we could
do it, it might limit possibilities to optimize user space, e.g. because
there is two ENCLU leaf functions (EMODPE, EACCEPTCOPY) and one ENCLS
leaf function (EMODPR), which can modify permissions.

A kernel syscall is essentially something that can be fully serviced
by the kernel. This is not such situation. The work is split.

BR, Jarkko

[V2,00/32] x86/sgx and selftests/sgx: Support SGX2

Message

Comments