diff mbox series

[V2,06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions

Message ID 0555a4b4a5e8879eb8f879ab3d9908302000f11c.1644274683.git.reinette.chatre@intel.com (mailing list archive)
State New, archived
Headers show
Series x86/sgx and selftests/sgx: Support SGX2 | expand

Commit Message

Reinette Chatre Feb. 8, 2022, 12:45 a.m. UTC
=== Summary ===

An SGX VMA can only be created if its permissions are the same or
weaker than the Enclave Page Cache Map (EPCM) permissions. After VMA
creation this same rule is again enforced by the page fault handler:
faulted enclave pages are required to have equal or more relaxed
EPCM permissions than the VMA permissions.

On SGX1 systems the additional enforcement in the page fault handler
is redundant and on SGX2 systems it incorrectly prevents access.
On SGX1 systems it is unnecessary to repeat the enforcement of the
permission rule. The rule used during original VMA creation will
ensure that any access attempt will use correct permissions.
With SGX2 the EPCM permissions of a page can change after VMA
creation resulting in the VMA permissions potentially being more
relaxed than the EPCM permissions and the page fault handler
incorrectly blocking valid access attempts.

Enable the VMA's pages to remain accessible while ensuring that
the PTEs are installed to match the EPCM permissions but not be
more relaxed than the VMA permissions.

=== Full Changelog ===

An SGX enclave is an area of memory where parts of an application
can reside. First an enclave is created and loaded (from
non-enclave memory) with the code and data of an application,
then user space can map (mmap()) the enclave memory to
be able to enter the enclave at its defined entry points for
execution within it.

The hardware maintains a secure structure, the Enclave Page Cache Map
(EPCM), that tracks the contents of the enclave. Of interest here is
its tracking of the enclave page permissions. When a page is loaded
into the enclave its permissions are specified and recorded in the
EPCM. In parallel the kernel maintains permissions within the
page table entries (PTEs) and the rule is that PTE permissions
are not allowed to be more relaxed than the EPCM permissions.

A new mapping (mmap()) of enclave memory can only succeed if the
mapping has the same or weaker permissions than the permissions that
were vetted during enclave creation. This is enforced by
sgx_encl_may_map() that is called on the mmap() as well as mprotect()
paths. This rule remains.

One feature of SGX2 is to support the modification of EPCM permissions
after enclave initialization. Enclave pages may thus already be part
of a VMA at the time their EPCM permissions are changed resulting
in the VMA's permissions potentially being more relaxed than the EPCM
permissions.

Allow permissions of existing VMAs to be more relaxed than EPCM
permissions in preparation for dynamic EPCM permission changes
made possible in SGX2.  New VMAs that attempt to have more relaxed
permissions than EPCM permissions continue to be unsupported.

Reasons why permissions of existing VMAs are allowed to be more relaxed
than EPCM permissions instead of dynamically changing VMA permissions
when EPCM permissions change are:
1) Changing VMA permissions involve splitting VMAs which is an
   operation that can fail. Additionally changing EPCM permissions of
   a range of pages could also fail on any of the pages involved.
   Handling these error cases causes problems. For example, if an
   EPCM permission change fails and the VMA has already been split
   then it is not possible to undo the VMA split nor possible to
   undo the EPCM permission changes that did succeed before the
   failure.
2) The kernel has little insight into the user space where EPCM
   permissions are controlled from. For example, a RW page may
   be made RO just before it is made RX and splitting the VMAs
   while the VMAs may change soon is unnecessary.

Remove the extra permission check called on a page fault
(vm_operations_struct->fault) or during debugging
(vm_operations_struct->access) when loading the enclave page from swap
that ensures that the VMA permissions are not more relaxed than the
EPCM permissions. Since a VMA could only exist if it passed the
original permission checks during mmap() and a VMA may indeed
have more relaxed permissions than the EPCM permissions this extra
permission check is no longer appropriate.

With the permission check removed, ensure that PTEs do
not blindly inherit the VMA permissions but instead the permissions
that the VMA and EPCM agree on. PTEs for writable pages (from VMA
and enclave perspective) are installed with the writable bit set,
reducing the need for this additional flow to the permission mismatch
cases handled next.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
Changes since V1:
- Reword commit message (Jarkko).
- Use "relax" instead of "exceed" when referring to permissions (Dave).
- Add snippet to Documentation/x86/sgx.rst that highlights the
  relationship between VMA, EPCM, and PTE permissions on SGX
  systems (Andy).

 Documentation/x86/sgx.rst      | 10 +++++++++
 arch/x86/kernel/cpu/sgx/encl.c | 38 ++++++++++++++++++----------------
 2 files changed, 30 insertions(+), 18 deletions(-)

Comments

Jarkko Sakkinen March 7, 2022, 5:10 p.m. UTC | #1
On Mon, Feb 07, 2022 at 04:45:28PM -0800, Reinette Chatre wrote:
> === Summary ===
> 
> An SGX VMA can only be created if its permissions are the same or
> weaker than the Enclave Page Cache Map (EPCM) permissions. After VMA
> creation this same rule is again enforced by the page fault handler:
> faulted enclave pages are required to have equal or more relaxed
> EPCM permissions than the VMA permissions.
> 
> On SGX1 systems the additional enforcement in the page fault handler
> is redundant and on SGX2 systems it incorrectly prevents access.
> On SGX1 systems it is unnecessary to repeat the enforcement of the
> permission rule. The rule used during original VMA creation will
> ensure that any access attempt will use correct permissions.
> With SGX2 the EPCM permissions of a page can change after VMA
> creation resulting in the VMA permissions potentially being more
> relaxed than the EPCM permissions and the page fault handler
> incorrectly blocking valid access attempts.
> 
> Enable the VMA's pages to remain accessible while ensuring that
> the PTEs are installed to match the EPCM permissions but not be
> more relaxed than the VMA permissions.
> 
> === Full Changelog ===
> 
> An SGX enclave is an area of memory where parts of an application
> can reside. First an enclave is created and loaded (from
> non-enclave memory) with the code and data of an application,
> then user space can map (mmap()) the enclave memory to
> be able to enter the enclave at its defined entry points for
> execution within it.
> 
> The hardware maintains a secure structure, the Enclave Page Cache Map
> (EPCM), that tracks the contents of the enclave. Of interest here is
> its tracking of the enclave page permissions. When a page is loaded
> into the enclave its permissions are specified and recorded in the
> EPCM. In parallel the kernel maintains permissions within the
> page table entries (PTEs) and the rule is that PTE permissions
> are not allowed to be more relaxed than the EPCM permissions.
> 
> A new mapping (mmap()) of enclave memory can only succeed if the
> mapping has the same or weaker permissions than the permissions that
> were vetted during enclave creation. This is enforced by
> sgx_encl_may_map() that is called on the mmap() as well as mprotect()
> paths. This rule remains.
> 
> One feature of SGX2 is to support the modification of EPCM permissions
> after enclave initialization. Enclave pages may thus already be part
> of a VMA at the time their EPCM permissions are changed resulting
> in the VMA's permissions potentially being more relaxed than the EPCM
> permissions.
> 
> Allow permissions of existing VMAs to be more relaxed than EPCM
> permissions in preparation for dynamic EPCM permission changes
> made possible in SGX2.  New VMAs that attempt to have more relaxed
> permissions than EPCM permissions continue to be unsupported.
> 
> Reasons why permissions of existing VMAs are allowed to be more relaxed
> than EPCM permissions instead of dynamically changing VMA permissions
> when EPCM permissions change are:
> 1) Changing VMA permissions involve splitting VMAs which is an
>    operation that can fail. Additionally changing EPCM permissions of
>    a range of pages could also fail on any of the pages involved.
>    Handling these error cases causes problems. For example, if an
>    EPCM permission change fails and the VMA has already been split
>    then it is not possible to undo the VMA split nor possible to
>    undo the EPCM permission changes that did succeed before the
>    failure.
> 2) The kernel has little insight into the user space where EPCM
>    permissions are controlled from. For example, a RW page may
>    be made RO just before it is made RX and splitting the VMAs
>    while the VMAs may change soon is unnecessary.
> 
> Remove the extra permission check called on a page fault
> (vm_operations_struct->fault) or during debugging
> (vm_operations_struct->access) when loading the enclave page from swap
> that ensures that the VMA permissions are not more relaxed than the
> EPCM permissions. Since a VMA could only exist if it passed the
> original permission checks during mmap() and a VMA may indeed
> have more relaxed permissions than the EPCM permissions this extra
> permission check is no longer appropriate.
> 
> With the permission check removed, ensure that PTEs do
> not blindly inherit the VMA permissions but instead the permissions
> that the VMA and EPCM agree on. PTEs for writable pages (from VMA
> and enclave perspective) are installed with the writable bit set,
> reducing the need for this additional flow to the permission mismatch
> cases handled next.
> 
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> ---
> Changes since V1:
> - Reword commit message (Jarkko).
> - Use "relax" instead of "exceed" when referring to permissions (Dave).
> - Add snippet to Documentation/x86/sgx.rst that highlights the
>   relationship between VMA, EPCM, and PTE permissions on SGX
>   systems (Andy).
> 
>  Documentation/x86/sgx.rst      | 10 +++++++++
>  arch/x86/kernel/cpu/sgx/encl.c | 38 ++++++++++++++++++----------------
>  2 files changed, 30 insertions(+), 18 deletions(-)
> 
> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
> index 89ff924b1480..5659932728a5 100644
> --- a/Documentation/x86/sgx.rst
> +++ b/Documentation/x86/sgx.rst
> @@ -99,6 +99,16 @@ The relationships between the different permission masks are:
>  * PTEs are installed to match the EPCM permissions, but not be more
>    relaxed than the VMA permissions.
>  
> +On systems supporting SGX2 EPCM permissions may change while the
> +enclave page belongs to a VMA without impacting the VMA permissions.
> +This means that a running VMA may appear to allow access to an enclave
> +page that is not allowed by its EPCM permissions. For example, when an
> +enclave page with RW EPCM permissions is mapped by a RW VMA but is
> +subsequently changed to have read-only EPCM permissions. The kernel
> +continues to maintain correct access to the enclave page through the
> +PTE that will ensure that only access allowed by both the VMA
> +and EPCM permissions are permitted.
> +
>  Application interface
>  =====================
>  
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 48afe96ae0f0..b6105d9e7c46 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -91,10 +91,8 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
>  }
>  
>  static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
> -						unsigned long addr,
> -						unsigned long vm_flags)
> +						unsigned long addr)
>  {
> -	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
>  	struct sgx_epc_page *epc_page;
>  	struct sgx_encl_page *entry;
>  
> @@ -102,14 +100,6 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
>  	if (!entry)
>  		return ERR_PTR(-EFAULT);
>  
> -	/*
> -	 * Verify that the faulted page has equal or higher build time
> -	 * permissions than the VMA permissions (i.e. the subset of {VM_READ,
> -	 * VM_WRITE, VM_EXECUTE} in vma->vm_flags).
> -	 */
> -	if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
> -		return ERR_PTR(-EFAULT);
> -
>  	/* Entry successfully located. */
>  	if (entry->epc_page) {
>  		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> @@ -138,7 +128,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>  {
>  	unsigned long addr = (unsigned long)vmf->address;
>  	struct vm_area_struct *vma = vmf->vma;
> +	unsigned long page_prot_bits;
>  	struct sgx_encl_page *entry;
> +	unsigned long vm_prot_bits;
>  	unsigned long phys_addr;
>  	struct sgx_encl *encl;
>  	vm_fault_t ret;
> @@ -155,7 +147,7 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>  
>  	mutex_lock(&encl->lock);
>  
> -	entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
> +	entry = sgx_encl_load_page(encl, addr);
>  	if (IS_ERR(entry)) {
>  		mutex_unlock(&encl->lock);
  
> @@ -167,7 +159,19 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>  
>  	phys_addr = sgx_get_epc_phys_addr(entry->epc_page);
>  
> -	ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr));
> +	/*
> +	 * Insert PTE to match the EPCM page permissions ensured to not
> +	 * exceed the VMA permissions.
> +	 */
> +	vm_prot_bits = vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
> +	page_prot_bits = entry->vm_max_prot_bits & vm_prot_bits;
> +	/*
> +	 * Add VM_SHARED so that PTE is made writable right away if VMA
> +	 * and EPCM are writable (no COW in SGX).
> +	 */
> +	page_prot_bits |= (vma->vm_flags & VM_SHARED);
> +	ret = vmf_insert_pfn_prot(vma, addr, PFN_DOWN(phys_addr),
> +				  vm_get_page_prot(page_prot_bits));
>  	if (ret != VM_FAULT_NOPAGE) {
>  		mutex_unlock(&encl->lock);
>  
> @@ -295,15 +299,14 @@ static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag
>   * Load an enclave page to EPC if required, and take encl->lock.
>   */
>  static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
> -						   unsigned long addr,
> -						   unsigned long vm_flags)
> +						   unsigned long addr)
>  {
>  	struct sgx_encl_page *entry;
>  
>  	for ( ; ; ) {
>  		mutex_lock(&encl->lock);
>  
> -		entry = sgx_encl_load_page(encl, addr, vm_flags);
> +		entry = sgx_encl_load_page(encl, addr);
>  		if (PTR_ERR(entry) != -EBUSY)
>  			break;
>  
> @@ -339,8 +342,7 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
>  		return -EFAULT;
>  
>  	for (i = 0; i < len; i += cnt) {
> -		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK,
> -					      vma->vm_flags);
> +		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK);
>  		if (IS_ERR(entry)) {
>  			ret = PTR_ERR(entry);
>  			break;
> -- 
> 2.25.1
> 

If you unconditionally set vm_max_prot_bits to RWX for dynamically created
pags, you would not need to do this.

These patches could be then safely dropped then:

- [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions 
- [PATCH V2 08/32] x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic permission changes
- [PATCH V2 15/32] x86/sgx: Support relaxing of enclave page permissions

And that would also keep full ABI compatibility without exceptions to the
existing mainline code.

BR, Jarkko
Reinette Chatre March 7, 2022, 5:36 p.m. UTC | #2
Hi Jarkko,

On 3/7/2022 9:10 AM, Jarkko Sakkinen wrote:
> On Mon, Feb 07, 2022 at 04:45:28PM -0800, Reinette Chatre wrote:
>> === Summary ===
>>
>> An SGX VMA can only be created if its permissions are the same or
>> weaker than the Enclave Page Cache Map (EPCM) permissions. After VMA
>> creation this same rule is again enforced by the page fault handler:
>> faulted enclave pages are required to have equal or more relaxed
>> EPCM permissions than the VMA permissions.
>>
>> On SGX1 systems the additional enforcement in the page fault handler
>> is redundant and on SGX2 systems it incorrectly prevents access.
>> On SGX1 systems it is unnecessary to repeat the enforcement of the
>> permission rule. The rule used during original VMA creation will
>> ensure that any access attempt will use correct permissions.
>> With SGX2 the EPCM permissions of a page can change after VMA
>> creation resulting in the VMA permissions potentially being more
>> relaxed than the EPCM permissions and the page fault handler
>> incorrectly blocking valid access attempts.
>>
>> Enable the VMA's pages to remain accessible while ensuring that
>> the PTEs are installed to match the EPCM permissions but not be
>> more relaxed than the VMA permissions.
>>
>> === Full Changelog ===
>>
>> An SGX enclave is an area of memory where parts of an application
>> can reside. First an enclave is created and loaded (from
>> non-enclave memory) with the code and data of an application,
>> then user space can map (mmap()) the enclave memory to
>> be able to enter the enclave at its defined entry points for
>> execution within it.
>>
>> The hardware maintains a secure structure, the Enclave Page Cache Map
>> (EPCM), that tracks the contents of the enclave. Of interest here is
>> its tracking of the enclave page permissions. When a page is loaded
>> into the enclave its permissions are specified and recorded in the
>> EPCM. In parallel the kernel maintains permissions within the
>> page table entries (PTEs) and the rule is that PTE permissions
>> are not allowed to be more relaxed than the EPCM permissions.
>>
>> A new mapping (mmap()) of enclave memory can only succeed if the
>> mapping has the same or weaker permissions than the permissions that
>> were vetted during enclave creation. This is enforced by
>> sgx_encl_may_map() that is called on the mmap() as well as mprotect()
>> paths. This rule remains.
>>
>> One feature of SGX2 is to support the modification of EPCM permissions
>> after enclave initialization. Enclave pages may thus already be part
>> of a VMA at the time their EPCM permissions are changed resulting
>> in the VMA's permissions potentially being more relaxed than the EPCM
>> permissions.
>>
>> Allow permissions of existing VMAs to be more relaxed than EPCM
>> permissions in preparation for dynamic EPCM permission changes
>> made possible in SGX2.  New VMAs that attempt to have more relaxed
>> permissions than EPCM permissions continue to be unsupported.
>>
>> Reasons why permissions of existing VMAs are allowed to be more relaxed
>> than EPCM permissions instead of dynamically changing VMA permissions
>> when EPCM permissions change are:
>> 1) Changing VMA permissions involve splitting VMAs which is an
>>    operation that can fail. Additionally changing EPCM permissions of
>>    a range of pages could also fail on any of the pages involved.
>>    Handling these error cases causes problems. For example, if an
>>    EPCM permission change fails and the VMA has already been split
>>    then it is not possible to undo the VMA split nor possible to
>>    undo the EPCM permission changes that did succeed before the
>>    failure.
>> 2) The kernel has little insight into the user space where EPCM
>>    permissions are controlled from. For example, a RW page may
>>    be made RO just before it is made RX and splitting the VMAs
>>    while the VMAs may change soon is unnecessary.
>>
>> Remove the extra permission check called on a page fault
>> (vm_operations_struct->fault) or during debugging
>> (vm_operations_struct->access) when loading the enclave page from swap
>> that ensures that the VMA permissions are not more relaxed than the
>> EPCM permissions. Since a VMA could only exist if it passed the
>> original permission checks during mmap() and a VMA may indeed
>> have more relaxed permissions than the EPCM permissions this extra
>> permission check is no longer appropriate.
>>
>> With the permission check removed, ensure that PTEs do
>> not blindly inherit the VMA permissions but instead the permissions
>> that the VMA and EPCM agree on. PTEs for writable pages (from VMA
>> and enclave perspective) are installed with the writable bit set,
>> reducing the need for this additional flow to the permission mismatch
>> cases handled next.
>>
>> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
>> ---
>> Changes since V1:
>> - Reword commit message (Jarkko).
>> - Use "relax" instead of "exceed" when referring to permissions (Dave).
>> - Add snippet to Documentation/x86/sgx.rst that highlights the
>>   relationship between VMA, EPCM, and PTE permissions on SGX
>>   systems (Andy).
>>
>>  Documentation/x86/sgx.rst      | 10 +++++++++
>>  arch/x86/kernel/cpu/sgx/encl.c | 38 ++++++++++++++++++----------------
>>  2 files changed, 30 insertions(+), 18 deletions(-)
>>
>> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
>> index 89ff924b1480..5659932728a5 100644
>> --- a/Documentation/x86/sgx.rst
>> +++ b/Documentation/x86/sgx.rst
>> @@ -99,6 +99,16 @@ The relationships between the different permission masks are:
>>  * PTEs are installed to match the EPCM permissions, but not be more
>>    relaxed than the VMA permissions.
>>  
>> +On systems supporting SGX2 EPCM permissions may change while the
>> +enclave page belongs to a VMA without impacting the VMA permissions.
>> +This means that a running VMA may appear to allow access to an enclave
>> +page that is not allowed by its EPCM permissions. For example, when an
>> +enclave page with RW EPCM permissions is mapped by a RW VMA but is
>> +subsequently changed to have read-only EPCM permissions. The kernel
>> +continues to maintain correct access to the enclave page through the
>> +PTE that will ensure that only access allowed by both the VMA
>> +and EPCM permissions are permitted.
>> +
>>  Application interface
>>  =====================
>>  
>> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
>> index 48afe96ae0f0..b6105d9e7c46 100644
>> --- a/arch/x86/kernel/cpu/sgx/encl.c
>> +++ b/arch/x86/kernel/cpu/sgx/encl.c
>> @@ -91,10 +91,8 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
>>  }
>>  
>>  static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
>> -						unsigned long addr,
>> -						unsigned long vm_flags)
>> +						unsigned long addr)
>>  {
>> -	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
>>  	struct sgx_epc_page *epc_page;
>>  	struct sgx_encl_page *entry;
>>  
>> @@ -102,14 +100,6 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
>>  	if (!entry)
>>  		return ERR_PTR(-EFAULT);
>>  
>> -	/*
>> -	 * Verify that the faulted page has equal or higher build time
>> -	 * permissions than the VMA permissions (i.e. the subset of {VM_READ,
>> -	 * VM_WRITE, VM_EXECUTE} in vma->vm_flags).
>> -	 */
>> -	if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
>> -		return ERR_PTR(-EFAULT);
>> -
>>  	/* Entry successfully located. */
>>  	if (entry->epc_page) {
>>  		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
>> @@ -138,7 +128,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>  {
>>  	unsigned long addr = (unsigned long)vmf->address;
>>  	struct vm_area_struct *vma = vmf->vma;
>> +	unsigned long page_prot_bits;
>>  	struct sgx_encl_page *entry;
>> +	unsigned long vm_prot_bits;
>>  	unsigned long phys_addr;
>>  	struct sgx_encl *encl;
>>  	vm_fault_t ret;
>> @@ -155,7 +147,7 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>  
>>  	mutex_lock(&encl->lock);
>>  
>> -	entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
>> +	entry = sgx_encl_load_page(encl, addr);
>>  	if (IS_ERR(entry)) {
>>  		mutex_unlock(&encl->lock);
>   
>> @@ -167,7 +159,19 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>  
>>  	phys_addr = sgx_get_epc_phys_addr(entry->epc_page);
>>  
>> -	ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr));
>> +	/*
>> +	 * Insert PTE to match the EPCM page permissions ensured to not
>> +	 * exceed the VMA permissions.
>> +	 */
>> +	vm_prot_bits = vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
>> +	page_prot_bits = entry->vm_max_prot_bits & vm_prot_bits;
>> +	/*
>> +	 * Add VM_SHARED so that PTE is made writable right away if VMA
>> +	 * and EPCM are writable (no COW in SGX).
>> +	 */
>> +	page_prot_bits |= (vma->vm_flags & VM_SHARED);
>> +	ret = vmf_insert_pfn_prot(vma, addr, PFN_DOWN(phys_addr),
>> +				  vm_get_page_prot(page_prot_bits));
>>  	if (ret != VM_FAULT_NOPAGE) {
>>  		mutex_unlock(&encl->lock);
>>  
>> @@ -295,15 +299,14 @@ static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag
>>   * Load an enclave page to EPC if required, and take encl->lock.
>>   */
>>  static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
>> -						   unsigned long addr,
>> -						   unsigned long vm_flags)
>> +						   unsigned long addr)
>>  {
>>  	struct sgx_encl_page *entry;
>>  
>>  	for ( ; ; ) {
>>  		mutex_lock(&encl->lock);
>>  
>> -		entry = sgx_encl_load_page(encl, addr, vm_flags);
>> +		entry = sgx_encl_load_page(encl, addr);
>>  		if (PTR_ERR(entry) != -EBUSY)
>>  			break;
>>  
>> @@ -339,8 +342,7 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
>>  		return -EFAULT;
>>  
>>  	for (i = 0; i < len; i += cnt) {
>> -		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK,
>> -					      vma->vm_flags);
>> +		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK);
>>  		if (IS_ERR(entry)) {
>>  			ret = PTR_ERR(entry);
>>  			break;
>> -- 
>> 2.25.1
>>
> 
> If you unconditionally set vm_max_prot_bits to RWX for dynamically created
> pags, you would not need to do this.
> 
> These patches could be then safely dropped then:
> 
> - [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions 
> - [PATCH V2 08/32] x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic permission changes
> - [PATCH V2 15/32] x86/sgx: Support relaxing of enclave page permissions
> 
> And that would also keep full ABI compatibility without exceptions to the
> existing mainline code.
> 

Dropping these changes do not just impact dynamically created pages. Dropping
these patches would result in EPCM page permission restriction being supported
for all pages, those added before enclave initialization as well as dynamically
added pages, but their PTEs will not be impacted.

For example, if a RW enclave page is added via SGX_IOC_ENCLAVE_ADD_PAGES and
then later made read-only via SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS then Linux
would keep allowing and installing RW PTEs to this page.

Allowing this goes against something explicitly disallowed from the beginning
of SGX as per:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/x86/sgx.rst#n74

"EPCM permissions are separate from the normal page tables.  This prevents the
kernel from, for instance, allowing writes to data which an enclave wishes to
remain read-only."

Reinette
Jarkko Sakkinen March 8, 2022, 8:14 a.m. UTC | #3
On Mon, Mar 07, 2022 at 09:36:36AM -0800, Reinette Chatre wrote:
> Hi Jarkko,
> 
> On 3/7/2022 9:10 AM, Jarkko Sakkinen wrote:
> > On Mon, Feb 07, 2022 at 04:45:28PM -0800, Reinette Chatre wrote:
> >> === Summary ===
> >>
> >> An SGX VMA can only be created if its permissions are the same or
> >> weaker than the Enclave Page Cache Map (EPCM) permissions. After VMA
> >> creation this same rule is again enforced by the page fault handler:
> >> faulted enclave pages are required to have equal or more relaxed
> >> EPCM permissions than the VMA permissions.
> >>
> >> On SGX1 systems the additional enforcement in the page fault handler
> >> is redundant and on SGX2 systems it incorrectly prevents access.
> >> On SGX1 systems it is unnecessary to repeat the enforcement of the
> >> permission rule. The rule used during original VMA creation will
> >> ensure that any access attempt will use correct permissions.
> >> With SGX2 the EPCM permissions of a page can change after VMA
> >> creation resulting in the VMA permissions potentially being more
> >> relaxed than the EPCM permissions and the page fault handler
> >> incorrectly blocking valid access attempts.
> >>
> >> Enable the VMA's pages to remain accessible while ensuring that
> >> the PTEs are installed to match the EPCM permissions but not be
> >> more relaxed than the VMA permissions.
> >>
> >> === Full Changelog ===
> >>
> >> An SGX enclave is an area of memory where parts of an application
> >> can reside. First an enclave is created and loaded (from
> >> non-enclave memory) with the code and data of an application,
> >> then user space can map (mmap()) the enclave memory to
> >> be able to enter the enclave at its defined entry points for
> >> execution within it.
> >>
> >> The hardware maintains a secure structure, the Enclave Page Cache Map
> >> (EPCM), that tracks the contents of the enclave. Of interest here is
> >> its tracking of the enclave page permissions. When a page is loaded
> >> into the enclave its permissions are specified and recorded in the
> >> EPCM. In parallel the kernel maintains permissions within the
> >> page table entries (PTEs) and the rule is that PTE permissions
> >> are not allowed to be more relaxed than the EPCM permissions.
> >>
> >> A new mapping (mmap()) of enclave memory can only succeed if the
> >> mapping has the same or weaker permissions than the permissions that
> >> were vetted during enclave creation. This is enforced by
> >> sgx_encl_may_map() that is called on the mmap() as well as mprotect()
> >> paths. This rule remains.
> >>
> >> One feature of SGX2 is to support the modification of EPCM permissions
> >> after enclave initialization. Enclave pages may thus already be part
> >> of a VMA at the time their EPCM permissions are changed resulting
> >> in the VMA's permissions potentially being more relaxed than the EPCM
> >> permissions.
> >>
> >> Allow permissions of existing VMAs to be more relaxed than EPCM
> >> permissions in preparation for dynamic EPCM permission changes
> >> made possible in SGX2.  New VMAs that attempt to have more relaxed
> >> permissions than EPCM permissions continue to be unsupported.
> >>
> >> Reasons why permissions of existing VMAs are allowed to be more relaxed
> >> than EPCM permissions instead of dynamically changing VMA permissions
> >> when EPCM permissions change are:
> >> 1) Changing VMA permissions involve splitting VMAs which is an
> >>    operation that can fail. Additionally changing EPCM permissions of
> >>    a range of pages could also fail on any of the pages involved.
> >>    Handling these error cases causes problems. For example, if an
> >>    EPCM permission change fails and the VMA has already been split
> >>    then it is not possible to undo the VMA split nor possible to
> >>    undo the EPCM permission changes that did succeed before the
> >>    failure.
> >> 2) The kernel has little insight into the user space where EPCM
> >>    permissions are controlled from. For example, a RW page may
> >>    be made RO just before it is made RX and splitting the VMAs
> >>    while the VMAs may change soon is unnecessary.
> >>
> >> Remove the extra permission check called on a page fault
> >> (vm_operations_struct->fault) or during debugging
> >> (vm_operations_struct->access) when loading the enclave page from swap
> >> that ensures that the VMA permissions are not more relaxed than the
> >> EPCM permissions. Since a VMA could only exist if it passed the
> >> original permission checks during mmap() and a VMA may indeed
> >> have more relaxed permissions than the EPCM permissions this extra
> >> permission check is no longer appropriate.
> >>
> >> With the permission check removed, ensure that PTEs do
> >> not blindly inherit the VMA permissions but instead the permissions
> >> that the VMA and EPCM agree on. PTEs for writable pages (from VMA
> >> and enclave perspective) are installed with the writable bit set,
> >> reducing the need for this additional flow to the permission mismatch
> >> cases handled next.
> >>
> >> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> >> ---
> >> Changes since V1:
> >> - Reword commit message (Jarkko).
> >> - Use "relax" instead of "exceed" when referring to permissions (Dave).
> >> - Add snippet to Documentation/x86/sgx.rst that highlights the
> >>   relationship between VMA, EPCM, and PTE permissions on SGX
> >>   systems (Andy).
> >>
> >>  Documentation/x86/sgx.rst      | 10 +++++++++
> >>  arch/x86/kernel/cpu/sgx/encl.c | 38 ++++++++++++++++++----------------
> >>  2 files changed, 30 insertions(+), 18 deletions(-)
> >>
> >> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
> >> index 89ff924b1480..5659932728a5 100644
> >> --- a/Documentation/x86/sgx.rst
> >> +++ b/Documentation/x86/sgx.rst
> >> @@ -99,6 +99,16 @@ The relationships between the different permission masks are:
> >>  * PTEs are installed to match the EPCM permissions, but not be more
> >>    relaxed than the VMA permissions.
> >>  
> >> +On systems supporting SGX2 EPCM permissions may change while the
> >> +enclave page belongs to a VMA without impacting the VMA permissions.
> >> +This means that a running VMA may appear to allow access to an enclave
> >> +page that is not allowed by its EPCM permissions. For example, when an
> >> +enclave page with RW EPCM permissions is mapped by a RW VMA but is
> >> +subsequently changed to have read-only EPCM permissions. The kernel
> >> +continues to maintain correct access to the enclave page through the
> >> +PTE that will ensure that only access allowed by both the VMA
> >> +and EPCM permissions are permitted.
> >> +
> >>  Application interface
> >>  =====================
> >>  
> >> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> >> index 48afe96ae0f0..b6105d9e7c46 100644
> >> --- a/arch/x86/kernel/cpu/sgx/encl.c
> >> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> >> @@ -91,10 +91,8 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
> >>  }
> >>  
> >>  static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
> >> -						unsigned long addr,
> >> -						unsigned long vm_flags)
> >> +						unsigned long addr)
> >>  {
> >> -	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
> >>  	struct sgx_epc_page *epc_page;
> >>  	struct sgx_encl_page *entry;
> >>  
> >> @@ -102,14 +100,6 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
> >>  	if (!entry)
> >>  		return ERR_PTR(-EFAULT);
> >>  
> >> -	/*
> >> -	 * Verify that the faulted page has equal or higher build time
> >> -	 * permissions than the VMA permissions (i.e. the subset of {VM_READ,
> >> -	 * VM_WRITE, VM_EXECUTE} in vma->vm_flags).
> >> -	 */
> >> -	if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
> >> -		return ERR_PTR(-EFAULT);
> >> -
> >>  	/* Entry successfully located. */
> >>  	if (entry->epc_page) {
> >>  		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> >> @@ -138,7 +128,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
> >>  {
> >>  	unsigned long addr = (unsigned long)vmf->address;
> >>  	struct vm_area_struct *vma = vmf->vma;
> >> +	unsigned long page_prot_bits;
> >>  	struct sgx_encl_page *entry;
> >> +	unsigned long vm_prot_bits;
> >>  	unsigned long phys_addr;
> >>  	struct sgx_encl *encl;
> >>  	vm_fault_t ret;
> >> @@ -155,7 +147,7 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
> >>  
> >>  	mutex_lock(&encl->lock);
> >>  
> >> -	entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
> >> +	entry = sgx_encl_load_page(encl, addr);
> >>  	if (IS_ERR(entry)) {
> >>  		mutex_unlock(&encl->lock);
> >   
> >> @@ -167,7 +159,19 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
> >>  
> >>  	phys_addr = sgx_get_epc_phys_addr(entry->epc_page);
> >>  
> >> -	ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr));
> >> +	/*
> >> +	 * Insert PTE to match the EPCM page permissions ensured to not
> >> +	 * exceed the VMA permissions.
> >> +	 */
> >> +	vm_prot_bits = vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
> >> +	page_prot_bits = entry->vm_max_prot_bits & vm_prot_bits;
> >> +	/*
> >> +	 * Add VM_SHARED so that PTE is made writable right away if VMA
> >> +	 * and EPCM are writable (no COW in SGX).
> >> +	 */
> >> +	page_prot_bits |= (vma->vm_flags & VM_SHARED);
> >> +	ret = vmf_insert_pfn_prot(vma, addr, PFN_DOWN(phys_addr),
> >> +				  vm_get_page_prot(page_prot_bits));
> >>  	if (ret != VM_FAULT_NOPAGE) {
> >>  		mutex_unlock(&encl->lock);
> >>  
> >> @@ -295,15 +299,14 @@ static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag
> >>   * Load an enclave page to EPC if required, and take encl->lock.
> >>   */
> >>  static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
> >> -						   unsigned long addr,
> >> -						   unsigned long vm_flags)
> >> +						   unsigned long addr)
> >>  {
> >>  	struct sgx_encl_page *entry;
> >>  
> >>  	for ( ; ; ) {
> >>  		mutex_lock(&encl->lock);
> >>  
> >> -		entry = sgx_encl_load_page(encl, addr, vm_flags);
> >> +		entry = sgx_encl_load_page(encl, addr);
> >>  		if (PTR_ERR(entry) != -EBUSY)
> >>  			break;
> >>  
> >> @@ -339,8 +342,7 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
> >>  		return -EFAULT;
> >>  
> >>  	for (i = 0; i < len; i += cnt) {
> >> -		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK,
> >> -					      vma->vm_flags);
> >> +		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK);
> >>  		if (IS_ERR(entry)) {
> >>  			ret = PTR_ERR(entry);
> >>  			break;
> >> -- 
> >> 2.25.1
> >>
> > 
> > If you unconditionally set vm_max_prot_bits to RWX for dynamically created
> > pags, you would not need to do this.
> > 
> > These patches could be then safely dropped then:
> > 
> > - [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions 
> > - [PATCH V2 08/32] x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic permission changes
> > - [PATCH V2 15/32] x86/sgx: Support relaxing of enclave page permissions
> > 
> > And that would also keep full ABI compatibility without exceptions to the
> > existing mainline code.
> > 
> 
> Dropping these changes do not just impact dynamically created pages. Dropping
> these patches would result in EPCM page permission restriction being supported
> for all pages, those added before enclave initialization as well as dynamically
> added pages, but their PTEs will not be impacted.
> 
> For example, if a RW enclave page is added via SGX_IOC_ENCLAVE_ADD_PAGES and
> then later made read-only via SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS then Linux
> would keep allowing and installing RW PTEs to this page.

I think that would be perfectly fine, if someone wants to do that. There is
no corrateral damage on doing that. Kernel does not get messed because of
that. It's a use case that does not make sense in the first place, so it'd
be stupid to build anything extensive around it to the kernel.

Shooting yourself to the foot is something that kernel does and should not
protect user space from unless there is a risk of messing the state of the
kernel itself.

Much worse is that we have e.g. completely artificial ioctl
SGX_IOC_ENCLAVE_RELAX_PERMISSIONS to support this scheme, which could e.g.
cause extra roundtrips for simple EMODPE.

Also this means not having to include 06/32, which keeps 100% backwards
compatibility in run-time behaviour to the mainline while not restricting
at all dynamically created pages. And we get rid of complex book keeping
of vm_run_prot_bits.

And generally the whole model is then very easy to understand and explain.
If I had to keep presentation of the current mess in the patch set in a
conference, I can honestly say that I would be in serious trouble. It's
not clean and clear security model, which is a risk by itself.

BR, Jarkko
Jarkko Sakkinen March 8, 2022, 9:06 a.m. UTC | #4
On Tue, Mar 08, 2022 at 10:14:42AM +0200, Jarkko Sakkinen wrote:
> On Mon, Mar 07, 2022 at 09:36:36AM -0800, Reinette Chatre wrote:
> > Hi Jarkko,
> > 
> > On 3/7/2022 9:10 AM, Jarkko Sakkinen wrote:
> > > On Mon, Feb 07, 2022 at 04:45:28PM -0800, Reinette Chatre wrote:
> > >> === Summary ===
> > >>
> > >> An SGX VMA can only be created if its permissions are the same or
> > >> weaker than the Enclave Page Cache Map (EPCM) permissions. After VMA
> > >> creation this same rule is again enforced by the page fault handler:
> > >> faulted enclave pages are required to have equal or more relaxed
> > >> EPCM permissions than the VMA permissions.
> > >>
> > >> On SGX1 systems the additional enforcement in the page fault handler
> > >> is redundant and on SGX2 systems it incorrectly prevents access.
> > >> On SGX1 systems it is unnecessary to repeat the enforcement of the
> > >> permission rule. The rule used during original VMA creation will
> > >> ensure that any access attempt will use correct permissions.
> > >> With SGX2 the EPCM permissions of a page can change after VMA
> > >> creation resulting in the VMA permissions potentially being more
> > >> relaxed than the EPCM permissions and the page fault handler
> > >> incorrectly blocking valid access attempts.
> > >>
> > >> Enable the VMA's pages to remain accessible while ensuring that
> > >> the PTEs are installed to match the EPCM permissions but not be
> > >> more relaxed than the VMA permissions.
> > >>
> > >> === Full Changelog ===
> > >>
> > >> An SGX enclave is an area of memory where parts of an application
> > >> can reside. First an enclave is created and loaded (from
> > >> non-enclave memory) with the code and data of an application,
> > >> then user space can map (mmap()) the enclave memory to
> > >> be able to enter the enclave at its defined entry points for
> > >> execution within it.
> > >>
> > >> The hardware maintains a secure structure, the Enclave Page Cache Map
> > >> (EPCM), that tracks the contents of the enclave. Of interest here is
> > >> its tracking of the enclave page permissions. When a page is loaded
> > >> into the enclave its permissions are specified and recorded in the
> > >> EPCM. In parallel the kernel maintains permissions within the
> > >> page table entries (PTEs) and the rule is that PTE permissions
> > >> are not allowed to be more relaxed than the EPCM permissions.
> > >>
> > >> A new mapping (mmap()) of enclave memory can only succeed if the
> > >> mapping has the same or weaker permissions than the permissions that
> > >> were vetted during enclave creation. This is enforced by
> > >> sgx_encl_may_map() that is called on the mmap() as well as mprotect()
> > >> paths. This rule remains.
> > >>
> > >> One feature of SGX2 is to support the modification of EPCM permissions
> > >> after enclave initialization. Enclave pages may thus already be part
> > >> of a VMA at the time their EPCM permissions are changed resulting
> > >> in the VMA's permissions potentially being more relaxed than the EPCM
> > >> permissions.
> > >>
> > >> Allow permissions of existing VMAs to be more relaxed than EPCM
> > >> permissions in preparation for dynamic EPCM permission changes
> > >> made possible in SGX2.  New VMAs that attempt to have more relaxed
> > >> permissions than EPCM permissions continue to be unsupported.
> > >>
> > >> Reasons why permissions of existing VMAs are allowed to be more relaxed
> > >> than EPCM permissions instead of dynamically changing VMA permissions
> > >> when EPCM permissions change are:
> > >> 1) Changing VMA permissions involve splitting VMAs which is an
> > >>    operation that can fail. Additionally changing EPCM permissions of
> > >>    a range of pages could also fail on any of the pages involved.
> > >>    Handling these error cases causes problems. For example, if an
> > >>    EPCM permission change fails and the VMA has already been split
> > >>    then it is not possible to undo the VMA split nor possible to
> > >>    undo the EPCM permission changes that did succeed before the
> > >>    failure.
> > >> 2) The kernel has little insight into the user space where EPCM
> > >>    permissions are controlled from. For example, a RW page may
> > >>    be made RO just before it is made RX and splitting the VMAs
> > >>    while the VMAs may change soon is unnecessary.
> > >>
> > >> Remove the extra permission check called on a page fault
> > >> (vm_operations_struct->fault) or during debugging
> > >> (vm_operations_struct->access) when loading the enclave page from swap
> > >> that ensures that the VMA permissions are not more relaxed than the
> > >> EPCM permissions. Since a VMA could only exist if it passed the
> > >> original permission checks during mmap() and a VMA may indeed
> > >> have more relaxed permissions than the EPCM permissions this extra
> > >> permission check is no longer appropriate.
> > >>
> > >> With the permission check removed, ensure that PTEs do
> > >> not blindly inherit the VMA permissions but instead the permissions
> > >> that the VMA and EPCM agree on. PTEs for writable pages (from VMA
> > >> and enclave perspective) are installed with the writable bit set,
> > >> reducing the need for this additional flow to the permission mismatch
> > >> cases handled next.
> > >>
> > >> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> > >> ---
> > >> Changes since V1:
> > >> - Reword commit message (Jarkko).
> > >> - Use "relax" instead of "exceed" when referring to permissions (Dave).
> > >> - Add snippet to Documentation/x86/sgx.rst that highlights the
> > >>   relationship between VMA, EPCM, and PTE permissions on SGX
> > >>   systems (Andy).
> > >>
> > >>  Documentation/x86/sgx.rst      | 10 +++++++++
> > >>  arch/x86/kernel/cpu/sgx/encl.c | 38 ++++++++++++++++++----------------
> > >>  2 files changed, 30 insertions(+), 18 deletions(-)
> > >>
> > >> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
> > >> index 89ff924b1480..5659932728a5 100644
> > >> --- a/Documentation/x86/sgx.rst
> > >> +++ b/Documentation/x86/sgx.rst
> > >> @@ -99,6 +99,16 @@ The relationships between the different permission masks are:
> > >>  * PTEs are installed to match the EPCM permissions, but not be more
> > >>    relaxed than the VMA permissions.
> > >>  
> > >> +On systems supporting SGX2 EPCM permissions may change while the
> > >> +enclave page belongs to a VMA without impacting the VMA permissions.
> > >> +This means that a running VMA may appear to allow access to an enclave
> > >> +page that is not allowed by its EPCM permissions. For example, when an
> > >> +enclave page with RW EPCM permissions is mapped by a RW VMA but is
> > >> +subsequently changed to have read-only EPCM permissions. The kernel
> > >> +continues to maintain correct access to the enclave page through the
> > >> +PTE that will ensure that only access allowed by both the VMA
> > >> +and EPCM permissions are permitted.
> > >> +
> > >>  Application interface
> > >>  =====================
> > >>  
> > >> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> > >> index 48afe96ae0f0..b6105d9e7c46 100644
> > >> --- a/arch/x86/kernel/cpu/sgx/encl.c
> > >> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> > >> @@ -91,10 +91,8 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
> > >>  }
> > >>  
> > >>  static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
> > >> -						unsigned long addr,
> > >> -						unsigned long vm_flags)
> > >> +						unsigned long addr)
> > >>  {
> > >> -	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
> > >>  	struct sgx_epc_page *epc_page;
> > >>  	struct sgx_encl_page *entry;
> > >>  
> > >> @@ -102,14 +100,6 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
> > >>  	if (!entry)
> > >>  		return ERR_PTR(-EFAULT);
> > >>  
> > >> -	/*
> > >> -	 * Verify that the faulted page has equal or higher build time
> > >> -	 * permissions than the VMA permissions (i.e. the subset of {VM_READ,
> > >> -	 * VM_WRITE, VM_EXECUTE} in vma->vm_flags).
> > >> -	 */
> > >> -	if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
> > >> -		return ERR_PTR(-EFAULT);
> > >> -
> > >>  	/* Entry successfully located. */
> > >>  	if (entry->epc_page) {
> > >>  		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> > >> @@ -138,7 +128,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
> > >>  {
> > >>  	unsigned long addr = (unsigned long)vmf->address;
> > >>  	struct vm_area_struct *vma = vmf->vma;
> > >> +	unsigned long page_prot_bits;
> > >>  	struct sgx_encl_page *entry;
> > >> +	unsigned long vm_prot_bits;
> > >>  	unsigned long phys_addr;
> > >>  	struct sgx_encl *encl;
> > >>  	vm_fault_t ret;
> > >> @@ -155,7 +147,7 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
> > >>  
> > >>  	mutex_lock(&encl->lock);
> > >>  
> > >> -	entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
> > >> +	entry = sgx_encl_load_page(encl, addr);
> > >>  	if (IS_ERR(entry)) {
> > >>  		mutex_unlock(&encl->lock);
> > >   
> > >> @@ -167,7 +159,19 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
> > >>  
> > >>  	phys_addr = sgx_get_epc_phys_addr(entry->epc_page);
> > >>  
> > >> -	ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr));
> > >> +	/*
> > >> +	 * Insert PTE to match the EPCM page permissions ensured to not
> > >> +	 * exceed the VMA permissions.
> > >> +	 */
> > >> +	vm_prot_bits = vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
> > >> +	page_prot_bits = entry->vm_max_prot_bits & vm_prot_bits;
> > >> +	/*
> > >> +	 * Add VM_SHARED so that PTE is made writable right away if VMA
> > >> +	 * and EPCM are writable (no COW in SGX).
> > >> +	 */
> > >> +	page_prot_bits |= (vma->vm_flags & VM_SHARED);
> > >> +	ret = vmf_insert_pfn_prot(vma, addr, PFN_DOWN(phys_addr),
> > >> +				  vm_get_page_prot(page_prot_bits));
> > >>  	if (ret != VM_FAULT_NOPAGE) {
> > >>  		mutex_unlock(&encl->lock);
> > >>  
> > >> @@ -295,15 +299,14 @@ static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag
> > >>   * Load an enclave page to EPC if required, and take encl->lock.
> > >>   */
> > >>  static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
> > >> -						   unsigned long addr,
> > >> -						   unsigned long vm_flags)
> > >> +						   unsigned long addr)
> > >>  {
> > >>  	struct sgx_encl_page *entry;
> > >>  
> > >>  	for ( ; ; ) {
> > >>  		mutex_lock(&encl->lock);
> > >>  
> > >> -		entry = sgx_encl_load_page(encl, addr, vm_flags);
> > >> +		entry = sgx_encl_load_page(encl, addr);
> > >>  		if (PTR_ERR(entry) != -EBUSY)
> > >>  			break;
> > >>  
> > >> @@ -339,8 +342,7 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
> > >>  		return -EFAULT;
> > >>  
> > >>  	for (i = 0; i < len; i += cnt) {
> > >> -		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK,
> > >> -					      vma->vm_flags);
> > >> +		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK);
> > >>  		if (IS_ERR(entry)) {
> > >>  			ret = PTR_ERR(entry);
> > >>  			break;
> > >> -- 
> > >> 2.25.1
> > >>
> > > 
> > > If you unconditionally set vm_max_prot_bits to RWX for dynamically created
> > > pags, you would not need to do this.
> > > 
> > > These patches could be then safely dropped then:
> > > 
> > > - [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions 
> > > - [PATCH V2 08/32] x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic permission changes
> > > - [PATCH V2 15/32] x86/sgx: Support relaxing of enclave page permissions
> > > 
> > > And that would also keep full ABI compatibility without exceptions to the
> > > existing mainline code.
> > > 
> > 
> > Dropping these changes do not just impact dynamically created pages. Dropping
> > these patches would result in EPCM page permission restriction being supported
> > for all pages, those added before enclave initialization as well as dynamically
> > added pages, but their PTEs will not be impacted.
> > 
> > For example, if a RW enclave page is added via SGX_IOC_ENCLAVE_ADD_PAGES and
> > then later made read-only via SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS then Linux
> > would keep allowing and installing RW PTEs to this page.
> 
> I think that would be perfectly fine, if someone wants to do that. There is
> no corrateral damage on doing that. Kernel does not get messed because of
> that. It's a use case that does not make sense in the first place, so it'd
> be stupid to build anything extensive around it to the kernel.
> 
> Shooting yourself to the foot is something that kernel does and should not
> protect user space from unless there is a risk of messing the state of the
> kernel itself.
> 
> Much worse is that we have e.g. completely artificial ioctl
> SGX_IOC_ENCLAVE_RELAX_PERMISSIONS to support this scheme, which could e.g.
> cause extra roundtrips for simple EMODPE.
> 
> Also this means not having to include 06/32, which keeps 100% backwards
> compatibility in run-time behaviour to the mainline while not restricting
> at all dynamically created pages. And we get rid of complex book keeping
> of vm_run_prot_bits.
> 
> And generally the whole model is then very easy to understand and explain.
> If I had to keep presentation of the current mess in the patch set in a
> conference, I can honestly say that I would be in serious trouble. It's
> not clean and clear security model, which is a risk by itself.

I.e.

1. For EADD'd pages: stick what has been the invariant 1,5 years now. Do
   not change it by any means (e.g. 06/32).
2. For EAUG'd pages: set vm_max_prot_bits RWX, which essentially means do
   what ever you want with PTE's and EPCM.

It's a clear and understandable model that does nothing bad to the kernel,
and a run-time developer can surely find away to get things on going. For
user space, the most important thing is the clarity in kernel behaviour,
and this does deliver that clarity. It's not perfect but it does do the
job and anyone can get it.

BR, Jarkko
Jarkko Sakkinen March 8, 2022, 9:12 a.m. UTC | #5
On Tue, Mar 08, 2022 at 11:06:46AM +0200, Jarkko Sakkinen wrote:
> On Tue, Mar 08, 2022 at 10:14:42AM +0200, Jarkko Sakkinen wrote:
> > On Mon, Mar 07, 2022 at 09:36:36AM -0800, Reinette Chatre wrote:
> > > Hi Jarkko,
> > > 
> > > On 3/7/2022 9:10 AM, Jarkko Sakkinen wrote:
> > > > On Mon, Feb 07, 2022 at 04:45:28PM -0800, Reinette Chatre wrote:
> > > >> === Summary ===
> > > >>
> > > >> An SGX VMA can only be created if its permissions are the same or
> > > >> weaker than the Enclave Page Cache Map (EPCM) permissions. After VMA
> > > >> creation this same rule is again enforced by the page fault handler:
> > > >> faulted enclave pages are required to have equal or more relaxed
> > > >> EPCM permissions than the VMA permissions.
> > > >>
> > > >> On SGX1 systems the additional enforcement in the page fault handler
> > > >> is redundant and on SGX2 systems it incorrectly prevents access.
> > > >> On SGX1 systems it is unnecessary to repeat the enforcement of the
> > > >> permission rule. The rule used during original VMA creation will
> > > >> ensure that any access attempt will use correct permissions.
> > > >> With SGX2 the EPCM permissions of a page can change after VMA
> > > >> creation resulting in the VMA permissions potentially being more
> > > >> relaxed than the EPCM permissions and the page fault handler
> > > >> incorrectly blocking valid access attempts.
> > > >>
> > > >> Enable the VMA's pages to remain accessible while ensuring that
> > > >> the PTEs are installed to match the EPCM permissions but not be
> > > >> more relaxed than the VMA permissions.
> > > >>
> > > >> === Full Changelog ===
> > > >>
> > > >> An SGX enclave is an area of memory where parts of an application
> > > >> can reside. First an enclave is created and loaded (from
> > > >> non-enclave memory) with the code and data of an application,
> > > >> then user space can map (mmap()) the enclave memory to
> > > >> be able to enter the enclave at its defined entry points for
> > > >> execution within it.
> > > >>
> > > >> The hardware maintains a secure structure, the Enclave Page Cache Map
> > > >> (EPCM), that tracks the contents of the enclave. Of interest here is
> > > >> its tracking of the enclave page permissions. When a page is loaded
> > > >> into the enclave its permissions are specified and recorded in the
> > > >> EPCM. In parallel the kernel maintains permissions within the
> > > >> page table entries (PTEs) and the rule is that PTE permissions
> > > >> are not allowed to be more relaxed than the EPCM permissions.
> > > >>
> > > >> A new mapping (mmap()) of enclave memory can only succeed if the
> > > >> mapping has the same or weaker permissions than the permissions that
> > > >> were vetted during enclave creation. This is enforced by
> > > >> sgx_encl_may_map() that is called on the mmap() as well as mprotect()
> > > >> paths. This rule remains.
> > > >>
> > > >> One feature of SGX2 is to support the modification of EPCM permissions
> > > >> after enclave initialization. Enclave pages may thus already be part
> > > >> of a VMA at the time their EPCM permissions are changed resulting
> > > >> in the VMA's permissions potentially being more relaxed than the EPCM
> > > >> permissions.
> > > >>
> > > >> Allow permissions of existing VMAs to be more relaxed than EPCM
> > > >> permissions in preparation for dynamic EPCM permission changes
> > > >> made possible in SGX2.  New VMAs that attempt to have more relaxed
> > > >> permissions than EPCM permissions continue to be unsupported.
> > > >>
> > > >> Reasons why permissions of existing VMAs are allowed to be more relaxed
> > > >> than EPCM permissions instead of dynamically changing VMA permissions
> > > >> when EPCM permissions change are:
> > > >> 1) Changing VMA permissions involve splitting VMAs which is an
> > > >>    operation that can fail. Additionally changing EPCM permissions of
> > > >>    a range of pages could also fail on any of the pages involved.
> > > >>    Handling these error cases causes problems. For example, if an
> > > >>    EPCM permission change fails and the VMA has already been split
> > > >>    then it is not possible to undo the VMA split nor possible to
> > > >>    undo the EPCM permission changes that did succeed before the
> > > >>    failure.
> > > >> 2) The kernel has little insight into the user space where EPCM
> > > >>    permissions are controlled from. For example, a RW page may
> > > >>    be made RO just before it is made RX and splitting the VMAs
> > > >>    while the VMAs may change soon is unnecessary.
> > > >>
> > > >> Remove the extra permission check called on a page fault
> > > >> (vm_operations_struct->fault) or during debugging
> > > >> (vm_operations_struct->access) when loading the enclave page from swap
> > > >> that ensures that the VMA permissions are not more relaxed than the
> > > >> EPCM permissions. Since a VMA could only exist if it passed the
> > > >> original permission checks during mmap() and a VMA may indeed
> > > >> have more relaxed permissions than the EPCM permissions this extra
> > > >> permission check is no longer appropriate.
> > > >>
> > > >> With the permission check removed, ensure that PTEs do
> > > >> not blindly inherit the VMA permissions but instead the permissions
> > > >> that the VMA and EPCM agree on. PTEs for writable pages (from VMA
> > > >> and enclave perspective) are installed with the writable bit set,
> > > >> reducing the need for this additional flow to the permission mismatch
> > > >> cases handled next.
> > > >>
> > > >> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> > > >> ---
> > > >> Changes since V1:
> > > >> - Reword commit message (Jarkko).
> > > >> - Use "relax" instead of "exceed" when referring to permissions (Dave).
> > > >> - Add snippet to Documentation/x86/sgx.rst that highlights the
> > > >>   relationship between VMA, EPCM, and PTE permissions on SGX
> > > >>   systems (Andy).
> > > >>
> > > >>  Documentation/x86/sgx.rst      | 10 +++++++++
> > > >>  arch/x86/kernel/cpu/sgx/encl.c | 38 ++++++++++++++++++----------------
> > > >>  2 files changed, 30 insertions(+), 18 deletions(-)
> > > >>
> > > >> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
> > > >> index 89ff924b1480..5659932728a5 100644
> > > >> --- a/Documentation/x86/sgx.rst
> > > >> +++ b/Documentation/x86/sgx.rst
> > > >> @@ -99,6 +99,16 @@ The relationships between the different permission masks are:
> > > >>  * PTEs are installed to match the EPCM permissions, but not be more
> > > >>    relaxed than the VMA permissions.
> > > >>  
> > > >> +On systems supporting SGX2 EPCM permissions may change while the
> > > >> +enclave page belongs to a VMA without impacting the VMA permissions.
> > > >> +This means that a running VMA may appear to allow access to an enclave
> > > >> +page that is not allowed by its EPCM permissions. For example, when an
> > > >> +enclave page with RW EPCM permissions is mapped by a RW VMA but is
> > > >> +subsequently changed to have read-only EPCM permissions. The kernel
> > > >> +continues to maintain correct access to the enclave page through the
> > > >> +PTE that will ensure that only access allowed by both the VMA
> > > >> +and EPCM permissions are permitted.
> > > >> +
> > > >>  Application interface
> > > >>  =====================
> > > >>  
> > > >> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> > > >> index 48afe96ae0f0..b6105d9e7c46 100644
> > > >> --- a/arch/x86/kernel/cpu/sgx/encl.c
> > > >> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> > > >> @@ -91,10 +91,8 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
> > > >>  }
> > > >>  
> > > >>  static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
> > > >> -						unsigned long addr,
> > > >> -						unsigned long vm_flags)
> > > >> +						unsigned long addr)
> > > >>  {
> > > >> -	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
> > > >>  	struct sgx_epc_page *epc_page;
> > > >>  	struct sgx_encl_page *entry;
> > > >>  
> > > >> @@ -102,14 +100,6 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
> > > >>  	if (!entry)
> > > >>  		return ERR_PTR(-EFAULT);
> > > >>  
> > > >> -	/*
> > > >> -	 * Verify that the faulted page has equal or higher build time
> > > >> -	 * permissions than the VMA permissions (i.e. the subset of {VM_READ,
> > > >> -	 * VM_WRITE, VM_EXECUTE} in vma->vm_flags).
> > > >> -	 */
> > > >> -	if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
> > > >> -		return ERR_PTR(-EFAULT);
> > > >> -
> > > >>  	/* Entry successfully located. */
> > > >>  	if (entry->epc_page) {
> > > >>  		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> > > >> @@ -138,7 +128,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
> > > >>  {
> > > >>  	unsigned long addr = (unsigned long)vmf->address;
> > > >>  	struct vm_area_struct *vma = vmf->vma;
> > > >> +	unsigned long page_prot_bits;
> > > >>  	struct sgx_encl_page *entry;
> > > >> +	unsigned long vm_prot_bits;
> > > >>  	unsigned long phys_addr;
> > > >>  	struct sgx_encl *encl;
> > > >>  	vm_fault_t ret;
> > > >> @@ -155,7 +147,7 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
> > > >>  
> > > >>  	mutex_lock(&encl->lock);
> > > >>  
> > > >> -	entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
> > > >> +	entry = sgx_encl_load_page(encl, addr);
> > > >>  	if (IS_ERR(entry)) {
> > > >>  		mutex_unlock(&encl->lock);
> > > >   
> > > >> @@ -167,7 +159,19 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
> > > >>  
> > > >>  	phys_addr = sgx_get_epc_phys_addr(entry->epc_page);
> > > >>  
> > > >> -	ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr));
> > > >> +	/*
> > > >> +	 * Insert PTE to match the EPCM page permissions ensured to not
> > > >> +	 * exceed the VMA permissions.
> > > >> +	 */
> > > >> +	vm_prot_bits = vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
> > > >> +	page_prot_bits = entry->vm_max_prot_bits & vm_prot_bits;
> > > >> +	/*
> > > >> +	 * Add VM_SHARED so that PTE is made writable right away if VMA
> > > >> +	 * and EPCM are writable (no COW in SGX).
> > > >> +	 */
> > > >> +	page_prot_bits |= (vma->vm_flags & VM_SHARED);
> > > >> +	ret = vmf_insert_pfn_prot(vma, addr, PFN_DOWN(phys_addr),
> > > >> +				  vm_get_page_prot(page_prot_bits));
> > > >>  	if (ret != VM_FAULT_NOPAGE) {
> > > >>  		mutex_unlock(&encl->lock);
> > > >>  
> > > >> @@ -295,15 +299,14 @@ static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag
> > > >>   * Load an enclave page to EPC if required, and take encl->lock.
> > > >>   */
> > > >>  static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
> > > >> -						   unsigned long addr,
> > > >> -						   unsigned long vm_flags)
> > > >> +						   unsigned long addr)
> > > >>  {
> > > >>  	struct sgx_encl_page *entry;
> > > >>  
> > > >>  	for ( ; ; ) {
> > > >>  		mutex_lock(&encl->lock);
> > > >>  
> > > >> -		entry = sgx_encl_load_page(encl, addr, vm_flags);
> > > >> +		entry = sgx_encl_load_page(encl, addr);
> > > >>  		if (PTR_ERR(entry) != -EBUSY)
> > > >>  			break;
> > > >>  
> > > >> @@ -339,8 +342,7 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
> > > >>  		return -EFAULT;
> > > >>  
> > > >>  	for (i = 0; i < len; i += cnt) {
> > > >> -		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK,
> > > >> -					      vma->vm_flags);
> > > >> +		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK);
> > > >>  		if (IS_ERR(entry)) {
> > > >>  			ret = PTR_ERR(entry);
> > > >>  			break;
> > > >> -- 
> > > >> 2.25.1
> > > >>
> > > > 
> > > > If you unconditionally set vm_max_prot_bits to RWX for dynamically created
> > > > pags, you would not need to do this.
> > > > 
> > > > These patches could be then safely dropped then:
> > > > 
> > > > - [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions 
> > > > - [PATCH V2 08/32] x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic permission changes
> > > > - [PATCH V2 15/32] x86/sgx: Support relaxing of enclave page permissions
> > > > 
> > > > And that would also keep full ABI compatibility without exceptions to the
> > > > existing mainline code.
> > > > 
> > > 
> > > Dropping these changes do not just impact dynamically created pages. Dropping
> > > these patches would result in EPCM page permission restriction being supported
> > > for all pages, those added before enclave initialization as well as dynamically
> > > added pages, but their PTEs will not be impacted.
> > > 
> > > For example, if a RW enclave page is added via SGX_IOC_ENCLAVE_ADD_PAGES and
> > > then later made read-only via SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS then Linux
> > > would keep allowing and installing RW PTEs to this page.
> > 
> > I think that would be perfectly fine, if someone wants to do that. There is
> > no corrateral damage on doing that. Kernel does not get messed because of
> > that. It's a use case that does not make sense in the first place, so it'd
> > be stupid to build anything extensive around it to the kernel.
> > 
> > Shooting yourself to the foot is something that kernel does and should not
> > protect user space from unless there is a risk of messing the state of the
> > kernel itself.
> > 
> > Much worse is that we have e.g. completely artificial ioctl
> > SGX_IOC_ENCLAVE_RELAX_PERMISSIONS to support this scheme, which could e.g.
> > cause extra roundtrips for simple EMODPE.
> > 
> > Also this means not having to include 06/32, which keeps 100% backwards
> > compatibility in run-time behaviour to the mainline while not restricting
> > at all dynamically created pages. And we get rid of complex book keeping
> > of vm_run_prot_bits.
> > 
> > And generally the whole model is then very easy to understand and explain.
> > If I had to keep presentation of the current mess in the patch set in a
> > conference, I can honestly say that I would be in serious trouble. It's
> > not clean and clear security model, which is a risk by itself.
> 
> I.e.
> 
> 1. For EADD'd pages: stick what has been the invariant 1,5 years now. Do
>    not change it by any means (e.g. 06/32).
> 2. For EAUG'd pages: set vm_max_prot_bits RWX, which essentially means do
>    what ever you want with PTE's and EPCM.
> 
> It's a clear and understandable model that does nothing bad to the kernel,
> and a run-time developer can surely find away to get things on going. For
> user space, the most important thing is the clarity in kernel behaviour,
> and this does deliver that clarity. It's not perfect but it does do the
> job and anyone can get it.

Also a quantitive argument for this is that by simplifying security model
this way it is one ioctl less, which must be considered as +1. We do not
want to add new ioctls unless it is something we absolutely cannnot live
without. We absolutely can live without SGX_IOC_ENCLAVE_RELAX_PERMISSIONS.

BR, Jarkko
Reinette Chatre March 8, 2022, 4:04 p.m. UTC | #6
Hi Jarkko,

On 3/8/2022 1:12 AM, Jarkko Sakkinen wrote:
> On Tue, Mar 08, 2022 at 11:06:46AM +0200, Jarkko Sakkinen wrote:
>> On Tue, Mar 08, 2022 at 10:14:42AM +0200, Jarkko Sakkinen wrote:
>>> On Mon, Mar 07, 2022 at 09:36:36AM -0800, Reinette Chatre wrote:
>>>> Hi Jarkko,
>>>>
>>>> On 3/7/2022 9:10 AM, Jarkko Sakkinen wrote:
>>>>> On Mon, Feb 07, 2022 at 04:45:28PM -0800, Reinette Chatre wrote:
>>>>>> === Summary ===
>>>>>>
>>>>>> An SGX VMA can only be created if its permissions are the same or
>>>>>> weaker than the Enclave Page Cache Map (EPCM) permissions. After VMA
>>>>>> creation this same rule is again enforced by the page fault handler:
>>>>>> faulted enclave pages are required to have equal or more relaxed
>>>>>> EPCM permissions than the VMA permissions.
>>>>>>
>>>>>> On SGX1 systems the additional enforcement in the page fault handler
>>>>>> is redundant and on SGX2 systems it incorrectly prevents access.
>>>>>> On SGX1 systems it is unnecessary to repeat the enforcement of the
>>>>>> permission rule. The rule used during original VMA creation will
>>>>>> ensure that any access attempt will use correct permissions.
>>>>>> With SGX2 the EPCM permissions of a page can change after VMA
>>>>>> creation resulting in the VMA permissions potentially being more
>>>>>> relaxed than the EPCM permissions and the page fault handler
>>>>>> incorrectly blocking valid access attempts.
>>>>>>
>>>>>> Enable the VMA's pages to remain accessible while ensuring that
>>>>>> the PTEs are installed to match the EPCM permissions but not be
>>>>>> more relaxed than the VMA permissions.
>>>>>>
>>>>>> === Full Changelog ===
>>>>>>
>>>>>> An SGX enclave is an area of memory where parts of an application
>>>>>> can reside. First an enclave is created and loaded (from
>>>>>> non-enclave memory) with the code and data of an application,
>>>>>> then user space can map (mmap()) the enclave memory to
>>>>>> be able to enter the enclave at its defined entry points for
>>>>>> execution within it.
>>>>>>
>>>>>> The hardware maintains a secure structure, the Enclave Page Cache Map
>>>>>> (EPCM), that tracks the contents of the enclave. Of interest here is
>>>>>> its tracking of the enclave page permissions. When a page is loaded
>>>>>> into the enclave its permissions are specified and recorded in the
>>>>>> EPCM. In parallel the kernel maintains permissions within the
>>>>>> page table entries (PTEs) and the rule is that PTE permissions
>>>>>> are not allowed to be more relaxed than the EPCM permissions.
>>>>>>
>>>>>> A new mapping (mmap()) of enclave memory can only succeed if the
>>>>>> mapping has the same or weaker permissions than the permissions that
>>>>>> were vetted during enclave creation. This is enforced by
>>>>>> sgx_encl_may_map() that is called on the mmap() as well as mprotect()
>>>>>> paths. This rule remains.
>>>>>>
>>>>>> One feature of SGX2 is to support the modification of EPCM permissions
>>>>>> after enclave initialization. Enclave pages may thus already be part
>>>>>> of a VMA at the time their EPCM permissions are changed resulting
>>>>>> in the VMA's permissions potentially being more relaxed than the EPCM
>>>>>> permissions.
>>>>>>
>>>>>> Allow permissions of existing VMAs to be more relaxed than EPCM
>>>>>> permissions in preparation for dynamic EPCM permission changes
>>>>>> made possible in SGX2.  New VMAs that attempt to have more relaxed
>>>>>> permissions than EPCM permissions continue to be unsupported.
>>>>>>
>>>>>> Reasons why permissions of existing VMAs are allowed to be more relaxed
>>>>>> than EPCM permissions instead of dynamically changing VMA permissions
>>>>>> when EPCM permissions change are:
>>>>>> 1) Changing VMA permissions involve splitting VMAs which is an
>>>>>>    operation that can fail. Additionally changing EPCM permissions of
>>>>>>    a range of pages could also fail on any of the pages involved.
>>>>>>    Handling these error cases causes problems. For example, if an
>>>>>>    EPCM permission change fails and the VMA has already been split
>>>>>>    then it is not possible to undo the VMA split nor possible to
>>>>>>    undo the EPCM permission changes that did succeed before the
>>>>>>    failure.
>>>>>> 2) The kernel has little insight into the user space where EPCM
>>>>>>    permissions are controlled from. For example, a RW page may
>>>>>>    be made RO just before it is made RX and splitting the VMAs
>>>>>>    while the VMAs may change soon is unnecessary.
>>>>>>
>>>>>> Remove the extra permission check called on a page fault
>>>>>> (vm_operations_struct->fault) or during debugging
>>>>>> (vm_operations_struct->access) when loading the enclave page from swap
>>>>>> that ensures that the VMA permissions are not more relaxed than the
>>>>>> EPCM permissions. Since a VMA could only exist if it passed the
>>>>>> original permission checks during mmap() and a VMA may indeed
>>>>>> have more relaxed permissions than the EPCM permissions this extra
>>>>>> permission check is no longer appropriate.
>>>>>>
>>>>>> With the permission check removed, ensure that PTEs do
>>>>>> not blindly inherit the VMA permissions but instead the permissions
>>>>>> that the VMA and EPCM agree on. PTEs for writable pages (from VMA
>>>>>> and enclave perspective) are installed with the writable bit set,
>>>>>> reducing the need for this additional flow to the permission mismatch
>>>>>> cases handled next.
>>>>>>
>>>>>> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
>>>>>> ---
>>>>>> Changes since V1:
>>>>>> - Reword commit message (Jarkko).
>>>>>> - Use "relax" instead of "exceed" when referring to permissions (Dave).
>>>>>> - Add snippet to Documentation/x86/sgx.rst that highlights the
>>>>>>   relationship between VMA, EPCM, and PTE permissions on SGX
>>>>>>   systems (Andy).
>>>>>>
>>>>>>  Documentation/x86/sgx.rst      | 10 +++++++++
>>>>>>  arch/x86/kernel/cpu/sgx/encl.c | 38 ++++++++++++++++++----------------
>>>>>>  2 files changed, 30 insertions(+), 18 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
>>>>>> index 89ff924b1480..5659932728a5 100644
>>>>>> --- a/Documentation/x86/sgx.rst
>>>>>> +++ b/Documentation/x86/sgx.rst
>>>>>> @@ -99,6 +99,16 @@ The relationships between the different permission masks are:
>>>>>>  * PTEs are installed to match the EPCM permissions, but not be more
>>>>>>    relaxed than the VMA permissions.
>>>>>>  
>>>>>> +On systems supporting SGX2 EPCM permissions may change while the
>>>>>> +enclave page belongs to a VMA without impacting the VMA permissions.
>>>>>> +This means that a running VMA may appear to allow access to an enclave
>>>>>> +page that is not allowed by its EPCM permissions. For example, when an
>>>>>> +enclave page with RW EPCM permissions is mapped by a RW VMA but is
>>>>>> +subsequently changed to have read-only EPCM permissions. The kernel
>>>>>> +continues to maintain correct access to the enclave page through the
>>>>>> +PTE that will ensure that only access allowed by both the VMA
>>>>>> +and EPCM permissions are permitted.
>>>>>> +
>>>>>>  Application interface
>>>>>>  =====================
>>>>>>  
>>>>>> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
>>>>>> index 48afe96ae0f0..b6105d9e7c46 100644
>>>>>> --- a/arch/x86/kernel/cpu/sgx/encl.c
>>>>>> +++ b/arch/x86/kernel/cpu/sgx/encl.c
>>>>>> @@ -91,10 +91,8 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
>>>>>>  }
>>>>>>  
>>>>>>  static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
>>>>>> -						unsigned long addr,
>>>>>> -						unsigned long vm_flags)
>>>>>> +						unsigned long addr)
>>>>>>  {
>>>>>> -	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
>>>>>>  	struct sgx_epc_page *epc_page;
>>>>>>  	struct sgx_encl_page *entry;
>>>>>>  
>>>>>> @@ -102,14 +100,6 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
>>>>>>  	if (!entry)
>>>>>>  		return ERR_PTR(-EFAULT);
>>>>>>  
>>>>>> -	/*
>>>>>> -	 * Verify that the faulted page has equal or higher build time
>>>>>> -	 * permissions than the VMA permissions (i.e. the subset of {VM_READ,
>>>>>> -	 * VM_WRITE, VM_EXECUTE} in vma->vm_flags).
>>>>>> -	 */
>>>>>> -	if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
>>>>>> -		return ERR_PTR(-EFAULT);
>>>>>> -
>>>>>>  	/* Entry successfully located. */
>>>>>>  	if (entry->epc_page) {
>>>>>>  		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
>>>>>> @@ -138,7 +128,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>>>>>  {
>>>>>>  	unsigned long addr = (unsigned long)vmf->address;
>>>>>>  	struct vm_area_struct *vma = vmf->vma;
>>>>>> +	unsigned long page_prot_bits;
>>>>>>  	struct sgx_encl_page *entry;
>>>>>> +	unsigned long vm_prot_bits;
>>>>>>  	unsigned long phys_addr;
>>>>>>  	struct sgx_encl *encl;
>>>>>>  	vm_fault_t ret;
>>>>>> @@ -155,7 +147,7 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>>>>>  
>>>>>>  	mutex_lock(&encl->lock);
>>>>>>  
>>>>>> -	entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
>>>>>> +	entry = sgx_encl_load_page(encl, addr);
>>>>>>  	if (IS_ERR(entry)) {
>>>>>>  		mutex_unlock(&encl->lock);
>>>>>   
>>>>>> @@ -167,7 +159,19 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>>>>>  
>>>>>>  	phys_addr = sgx_get_epc_phys_addr(entry->epc_page);
>>>>>>  
>>>>>> -	ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr));
>>>>>> +	/*
>>>>>> +	 * Insert PTE to match the EPCM page permissions ensured to not
>>>>>> +	 * exceed the VMA permissions.
>>>>>> +	 */
>>>>>> +	vm_prot_bits = vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
>>>>>> +	page_prot_bits = entry->vm_max_prot_bits & vm_prot_bits;
>>>>>> +	/*
>>>>>> +	 * Add VM_SHARED so that PTE is made writable right away if VMA
>>>>>> +	 * and EPCM are writable (no COW in SGX).
>>>>>> +	 */
>>>>>> +	page_prot_bits |= (vma->vm_flags & VM_SHARED);
>>>>>> +	ret = vmf_insert_pfn_prot(vma, addr, PFN_DOWN(phys_addr),
>>>>>> +				  vm_get_page_prot(page_prot_bits));
>>>>>>  	if (ret != VM_FAULT_NOPAGE) {
>>>>>>  		mutex_unlock(&encl->lock);
>>>>>>  
>>>>>> @@ -295,15 +299,14 @@ static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag
>>>>>>   * Load an enclave page to EPC if required, and take encl->lock.
>>>>>>   */
>>>>>>  static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
>>>>>> -						   unsigned long addr,
>>>>>> -						   unsigned long vm_flags)
>>>>>> +						   unsigned long addr)
>>>>>>  {
>>>>>>  	struct sgx_encl_page *entry;
>>>>>>  
>>>>>>  	for ( ; ; ) {
>>>>>>  		mutex_lock(&encl->lock);
>>>>>>  
>>>>>> -		entry = sgx_encl_load_page(encl, addr, vm_flags);
>>>>>> +		entry = sgx_encl_load_page(encl, addr);
>>>>>>  		if (PTR_ERR(entry) != -EBUSY)
>>>>>>  			break;
>>>>>>  
>>>>>> @@ -339,8 +342,7 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
>>>>>>  		return -EFAULT;
>>>>>>  
>>>>>>  	for (i = 0; i < len; i += cnt) {
>>>>>> -		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK,
>>>>>> -					      vma->vm_flags);
>>>>>> +		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK);
>>>>>>  		if (IS_ERR(entry)) {
>>>>>>  			ret = PTR_ERR(entry);
>>>>>>  			break;
>>>>>> -- 
>>>>>> 2.25.1
>>>>>>
>>>>>
>>>>> If you unconditionally set vm_max_prot_bits to RWX for dynamically created
>>>>> pags, you would not need to do this.
>>>>>
>>>>> These patches could be then safely dropped then:
>>>>>
>>>>> - [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions 
>>>>> - [PATCH V2 08/32] x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic permission changes
>>>>> - [PATCH V2 15/32] x86/sgx: Support relaxing of enclave page permissions
>>>>>
>>>>> And that would also keep full ABI compatibility without exceptions to the
>>>>> existing mainline code.
>>>>>
>>>>
>>>> Dropping these changes do not just impact dynamically created pages. Dropping
>>>> these patches would result in EPCM page permission restriction being supported
>>>> for all pages, those added before enclave initialization as well as dynamically
>>>> added pages, but their PTEs will not be impacted.
>>>>
>>>> For example, if a RW enclave page is added via SGX_IOC_ENCLAVE_ADD_PAGES and
>>>> then later made read-only via SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS then Linux
>>>> would keep allowing and installing RW PTEs to this page.
>>>
>>> I think that would be perfectly fine, if someone wants to do that. There is
>>> no corrateral damage on doing that. Kernel does not get messed because of
>>> that. It's a use case that does not make sense in the first place, so it'd
>>> be stupid to build anything extensive around it to the kernel.
>>>
>>> Shooting yourself to the foot is something that kernel does and should not
>>> protect user space from unless there is a risk of messing the state of the
>>> kernel itself.
>>>
>>> Much worse is that we have e.g. completely artificial ioctl
>>> SGX_IOC_ENCLAVE_RELAX_PERMISSIONS to support this scheme, which could e.g.
>>> cause extra roundtrips for simple EMODPE.
>>>
>>> Also this means not having to include 06/32, which keeps 100% backwards
>>> compatibility in run-time behaviour to the mainline while not restricting
>>> at all dynamically created pages. And we get rid of complex book keeping
>>> of vm_run_prot_bits.
>>>
>>> And generally the whole model is then very easy to understand and explain.
>>> If I had to keep presentation of the current mess in the patch set in a
>>> conference, I can honestly say that I would be in serious trouble. It's
>>> not clean and clear security model, which is a risk by itself.
>>
>> I.e.
>>
>> 1. For EADD'd pages: stick what has been the invariant 1,5 years now. Do
>>    not change it by any means (e.g. 06/32).
>> 2. For EAUG'd pages: set vm_max_prot_bits RWX, which essentially means do
>>    what ever you want with PTE's and EPCM.
>>
>> It's a clear and understandable model that does nothing bad to the kernel,
>> and a run-time developer can surely find away to get things on going. For
>> user space, the most important thing is the clarity in kernel behaviour,
>> and this does deliver that clarity. It's not perfect but it does do the
>> job and anyone can get it.
> 
> Also a quantitive argument for this is that by simplifying security model
> this way it is one ioctl less, which must be considered as +1. We do not
> want to add new ioctls unless it is something we absolutely cannnot live
> without. We absolutely can live without SGX_IOC_ENCLAVE_RELAX_PERMISSIONS.
> 

ok, with the implications understood and accepted I will proceed with a new
series that separates EPCM from PTEs and make RWX PTEs possible by default
for EAUG pages. This has broader impact than just removing
the three patches you list. "[PATCH 07/32] x86/sgx: Add pfn_mkwrite() handler
for present PTEs" is also no longer needed and there is no longer a need
to flush PTEs after restricting permissions. New changes also need to
be considered - at least the current documentation. I'll rework the series.

Reinette
Jarkko Sakkinen March 8, 2022, 5 p.m. UTC | #7
On Tue, Mar 08, 2022 at 08:04:33AM -0800, Reinette Chatre wrote:
> Hi Jarkko,
> 
> On 3/8/2022 1:12 AM, Jarkko Sakkinen wrote:
> > On Tue, Mar 08, 2022 at 11:06:46AM +0200, Jarkko Sakkinen wrote:
> >> On Tue, Mar 08, 2022 at 10:14:42AM +0200, Jarkko Sakkinen wrote:
> >>> On Mon, Mar 07, 2022 at 09:36:36AM -0800, Reinette Chatre wrote:
> >>>> Hi Jarkko,
> >>>>
> >>>> On 3/7/2022 9:10 AM, Jarkko Sakkinen wrote:
> >>>>> On Mon, Feb 07, 2022 at 04:45:28PM -0800, Reinette Chatre wrote:
> >>>>>> === Summary ===
> >>>>>>
> >>>>>> An SGX VMA can only be created if its permissions are the same or
> >>>>>> weaker than the Enclave Page Cache Map (EPCM) permissions. After VMA
> >>>>>> creation this same rule is again enforced by the page fault handler:
> >>>>>> faulted enclave pages are required to have equal or more relaxed
> >>>>>> EPCM permissions than the VMA permissions.
> >>>>>>
> >>>>>> On SGX1 systems the additional enforcement in the page fault handler
> >>>>>> is redundant and on SGX2 systems it incorrectly prevents access.
> >>>>>> On SGX1 systems it is unnecessary to repeat the enforcement of the
> >>>>>> permission rule. The rule used during original VMA creation will
> >>>>>> ensure that any access attempt will use correct permissions.
> >>>>>> With SGX2 the EPCM permissions of a page can change after VMA
> >>>>>> creation resulting in the VMA permissions potentially being more
> >>>>>> relaxed than the EPCM permissions and the page fault handler
> >>>>>> incorrectly blocking valid access attempts.
> >>>>>>
> >>>>>> Enable the VMA's pages to remain accessible while ensuring that
> >>>>>> the PTEs are installed to match the EPCM permissions but not be
> >>>>>> more relaxed than the VMA permissions.
> >>>>>>
> >>>>>> === Full Changelog ===
> >>>>>>
> >>>>>> An SGX enclave is an area of memory where parts of an application
> >>>>>> can reside. First an enclave is created and loaded (from
> >>>>>> non-enclave memory) with the code and data of an application,
> >>>>>> then user space can map (mmap()) the enclave memory to
> >>>>>> be able to enter the enclave at its defined entry points for
> >>>>>> execution within it.
> >>>>>>
> >>>>>> The hardware maintains a secure structure, the Enclave Page Cache Map
> >>>>>> (EPCM), that tracks the contents of the enclave. Of interest here is
> >>>>>> its tracking of the enclave page permissions. When a page is loaded
> >>>>>> into the enclave its permissions are specified and recorded in the
> >>>>>> EPCM. In parallel the kernel maintains permissions within the
> >>>>>> page table entries (PTEs) and the rule is that PTE permissions
> >>>>>> are not allowed to be more relaxed than the EPCM permissions.
> >>>>>>
> >>>>>> A new mapping (mmap()) of enclave memory can only succeed if the
> >>>>>> mapping has the same or weaker permissions than the permissions that
> >>>>>> were vetted during enclave creation. This is enforced by
> >>>>>> sgx_encl_may_map() that is called on the mmap() as well as mprotect()
> >>>>>> paths. This rule remains.
> >>>>>>
> >>>>>> One feature of SGX2 is to support the modification of EPCM permissions
> >>>>>> after enclave initialization. Enclave pages may thus already be part
> >>>>>> of a VMA at the time their EPCM permissions are changed resulting
> >>>>>> in the VMA's permissions potentially being more relaxed than the EPCM
> >>>>>> permissions.
> >>>>>>
> >>>>>> Allow permissions of existing VMAs to be more relaxed than EPCM
> >>>>>> permissions in preparation for dynamic EPCM permission changes
> >>>>>> made possible in SGX2.  New VMAs that attempt to have more relaxed
> >>>>>> permissions than EPCM permissions continue to be unsupported.
> >>>>>>
> >>>>>> Reasons why permissions of existing VMAs are allowed to be more relaxed
> >>>>>> than EPCM permissions instead of dynamically changing VMA permissions
> >>>>>> when EPCM permissions change are:
> >>>>>> 1) Changing VMA permissions involve splitting VMAs which is an
> >>>>>>    operation that can fail. Additionally changing EPCM permissions of
> >>>>>>    a range of pages could also fail on any of the pages involved.
> >>>>>>    Handling these error cases causes problems. For example, if an
> >>>>>>    EPCM permission change fails and the VMA has already been split
> >>>>>>    then it is not possible to undo the VMA split nor possible to
> >>>>>>    undo the EPCM permission changes that did succeed before the
> >>>>>>    failure.
> >>>>>> 2) The kernel has little insight into the user space where EPCM
> >>>>>>    permissions are controlled from. For example, a RW page may
> >>>>>>    be made RO just before it is made RX and splitting the VMAs
> >>>>>>    while the VMAs may change soon is unnecessary.
> >>>>>>
> >>>>>> Remove the extra permission check called on a page fault
> >>>>>> (vm_operations_struct->fault) or during debugging
> >>>>>> (vm_operations_struct->access) when loading the enclave page from swap
> >>>>>> that ensures that the VMA permissions are not more relaxed than the
> >>>>>> EPCM permissions. Since a VMA could only exist if it passed the
> >>>>>> original permission checks during mmap() and a VMA may indeed
> >>>>>> have more relaxed permissions than the EPCM permissions this extra
> >>>>>> permission check is no longer appropriate.
> >>>>>>
> >>>>>> With the permission check removed, ensure that PTEs do
> >>>>>> not blindly inherit the VMA permissions but instead the permissions
> >>>>>> that the VMA and EPCM agree on. PTEs for writable pages (from VMA
> >>>>>> and enclave perspective) are installed with the writable bit set,
> >>>>>> reducing the need for this additional flow to the permission mismatch
> >>>>>> cases handled next.
> >>>>>>
> >>>>>> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> >>>>>> ---
> >>>>>> Changes since V1:
> >>>>>> - Reword commit message (Jarkko).
> >>>>>> - Use "relax" instead of "exceed" when referring to permissions (Dave).
> >>>>>> - Add snippet to Documentation/x86/sgx.rst that highlights the
> >>>>>>   relationship between VMA, EPCM, and PTE permissions on SGX
> >>>>>>   systems (Andy).
> >>>>>>
> >>>>>>  Documentation/x86/sgx.rst      | 10 +++++++++
> >>>>>>  arch/x86/kernel/cpu/sgx/encl.c | 38 ++++++++++++++++++----------------
> >>>>>>  2 files changed, 30 insertions(+), 18 deletions(-)
> >>>>>>
> >>>>>> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
> >>>>>> index 89ff924b1480..5659932728a5 100644
> >>>>>> --- a/Documentation/x86/sgx.rst
> >>>>>> +++ b/Documentation/x86/sgx.rst
> >>>>>> @@ -99,6 +99,16 @@ The relationships between the different permission masks are:
> >>>>>>  * PTEs are installed to match the EPCM permissions, but not be more
> >>>>>>    relaxed than the VMA permissions.
> >>>>>>  
> >>>>>> +On systems supporting SGX2 EPCM permissions may change while the
> >>>>>> +enclave page belongs to a VMA without impacting the VMA permissions.
> >>>>>> +This means that a running VMA may appear to allow access to an enclave
> >>>>>> +page that is not allowed by its EPCM permissions. For example, when an
> >>>>>> +enclave page with RW EPCM permissions is mapped by a RW VMA but is
> >>>>>> +subsequently changed to have read-only EPCM permissions. The kernel
> >>>>>> +continues to maintain correct access to the enclave page through the
> >>>>>> +PTE that will ensure that only access allowed by both the VMA
> >>>>>> +and EPCM permissions are permitted.
> >>>>>> +
> >>>>>>  Application interface
> >>>>>>  =====================
> >>>>>>  
> >>>>>> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> >>>>>> index 48afe96ae0f0..b6105d9e7c46 100644
> >>>>>> --- a/arch/x86/kernel/cpu/sgx/encl.c
> >>>>>> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> >>>>>> @@ -91,10 +91,8 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
> >>>>>>  }
> >>>>>>  
> >>>>>>  static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
> >>>>>> -						unsigned long addr,
> >>>>>> -						unsigned long vm_flags)
> >>>>>> +						unsigned long addr)
> >>>>>>  {
> >>>>>> -	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
> >>>>>>  	struct sgx_epc_page *epc_page;
> >>>>>>  	struct sgx_encl_page *entry;
> >>>>>>  
> >>>>>> @@ -102,14 +100,6 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
> >>>>>>  	if (!entry)
> >>>>>>  		return ERR_PTR(-EFAULT);
> >>>>>>  
> >>>>>> -	/*
> >>>>>> -	 * Verify that the faulted page has equal or higher build time
> >>>>>> -	 * permissions than the VMA permissions (i.e. the subset of {VM_READ,
> >>>>>> -	 * VM_WRITE, VM_EXECUTE} in vma->vm_flags).
> >>>>>> -	 */
> >>>>>> -	if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
> >>>>>> -		return ERR_PTR(-EFAULT);
> >>>>>> -
> >>>>>>  	/* Entry successfully located. */
> >>>>>>  	if (entry->epc_page) {
> >>>>>>  		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> >>>>>> @@ -138,7 +128,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
> >>>>>>  {
> >>>>>>  	unsigned long addr = (unsigned long)vmf->address;
> >>>>>>  	struct vm_area_struct *vma = vmf->vma;
> >>>>>> +	unsigned long page_prot_bits;
> >>>>>>  	struct sgx_encl_page *entry;
> >>>>>> +	unsigned long vm_prot_bits;
> >>>>>>  	unsigned long phys_addr;
> >>>>>>  	struct sgx_encl *encl;
> >>>>>>  	vm_fault_t ret;
> >>>>>> @@ -155,7 +147,7 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
> >>>>>>  
> >>>>>>  	mutex_lock(&encl->lock);
> >>>>>>  
> >>>>>> -	entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
> >>>>>> +	entry = sgx_encl_load_page(encl, addr);
> >>>>>>  	if (IS_ERR(entry)) {
> >>>>>>  		mutex_unlock(&encl->lock);
> >>>>>   
> >>>>>> @@ -167,7 +159,19 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
> >>>>>>  
> >>>>>>  	phys_addr = sgx_get_epc_phys_addr(entry->epc_page);
> >>>>>>  
> >>>>>> -	ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr));
> >>>>>> +	/*
> >>>>>> +	 * Insert PTE to match the EPCM page permissions ensured to not
> >>>>>> +	 * exceed the VMA permissions.
> >>>>>> +	 */
> >>>>>> +	vm_prot_bits = vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
> >>>>>> +	page_prot_bits = entry->vm_max_prot_bits & vm_prot_bits;
> >>>>>> +	/*
> >>>>>> +	 * Add VM_SHARED so that PTE is made writable right away if VMA
> >>>>>> +	 * and EPCM are writable (no COW in SGX).
> >>>>>> +	 */
> >>>>>> +	page_prot_bits |= (vma->vm_flags & VM_SHARED);
> >>>>>> +	ret = vmf_insert_pfn_prot(vma, addr, PFN_DOWN(phys_addr),
> >>>>>> +				  vm_get_page_prot(page_prot_bits));
> >>>>>>  	if (ret != VM_FAULT_NOPAGE) {
> >>>>>>  		mutex_unlock(&encl->lock);
> >>>>>>  
> >>>>>> @@ -295,15 +299,14 @@ static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag
> >>>>>>   * Load an enclave page to EPC if required, and take encl->lock.
> >>>>>>   */
> >>>>>>  static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
> >>>>>> -						   unsigned long addr,
> >>>>>> -						   unsigned long vm_flags)
> >>>>>> +						   unsigned long addr)
> >>>>>>  {
> >>>>>>  	struct sgx_encl_page *entry;
> >>>>>>  
> >>>>>>  	for ( ; ; ) {
> >>>>>>  		mutex_lock(&encl->lock);
> >>>>>>  
> >>>>>> -		entry = sgx_encl_load_page(encl, addr, vm_flags);
> >>>>>> +		entry = sgx_encl_load_page(encl, addr);
> >>>>>>  		if (PTR_ERR(entry) != -EBUSY)
> >>>>>>  			break;
> >>>>>>  
> >>>>>> @@ -339,8 +342,7 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
> >>>>>>  		return -EFAULT;
> >>>>>>  
> >>>>>>  	for (i = 0; i < len; i += cnt) {
> >>>>>> -		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK,
> >>>>>> -					      vma->vm_flags);
> >>>>>> +		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK);
> >>>>>>  		if (IS_ERR(entry)) {
> >>>>>>  			ret = PTR_ERR(entry);
> >>>>>>  			break;
> >>>>>> -- 
> >>>>>> 2.25.1
> >>>>>>
> >>>>>
> >>>>> If you unconditionally set vm_max_prot_bits to RWX for dynamically created
> >>>>> pags, you would not need to do this.
> >>>>>
> >>>>> These patches could be then safely dropped then:
> >>>>>
> >>>>> - [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions 
> >>>>> - [PATCH V2 08/32] x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic permission changes
> >>>>> - [PATCH V2 15/32] x86/sgx: Support relaxing of enclave page permissions
> >>>>>
> >>>>> And that would also keep full ABI compatibility without exceptions to the
> >>>>> existing mainline code.
> >>>>>
> >>>>
> >>>> Dropping these changes do not just impact dynamically created pages. Dropping
> >>>> these patches would result in EPCM page permission restriction being supported
> >>>> for all pages, those added before enclave initialization as well as dynamically
> >>>> added pages, but their PTEs will not be impacted.
> >>>>
> >>>> For example, if a RW enclave page is added via SGX_IOC_ENCLAVE_ADD_PAGES and
> >>>> then later made read-only via SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS then Linux
> >>>> would keep allowing and installing RW PTEs to this page.
> >>>
> >>> I think that would be perfectly fine, if someone wants to do that. There is
> >>> no corrateral damage on doing that. Kernel does not get messed because of
> >>> that. It's a use case that does not make sense in the first place, so it'd
> >>> be stupid to build anything extensive around it to the kernel.
> >>>
> >>> Shooting yourself to the foot is something that kernel does and should not
> >>> protect user space from unless there is a risk of messing the state of the
> >>> kernel itself.
> >>>
> >>> Much worse is that we have e.g. completely artificial ioctl
> >>> SGX_IOC_ENCLAVE_RELAX_PERMISSIONS to support this scheme, which could e.g.
> >>> cause extra roundtrips for simple EMODPE.
> >>>
> >>> Also this means not having to include 06/32, which keeps 100% backwards
> >>> compatibility in run-time behaviour to the mainline while not restricting
> >>> at all dynamically created pages. And we get rid of complex book keeping
> >>> of vm_run_prot_bits.
> >>>
> >>> And generally the whole model is then very easy to understand and explain.
> >>> If I had to keep presentation of the current mess in the patch set in a
> >>> conference, I can honestly say that I would be in serious trouble. It's
> >>> not clean and clear security model, which is a risk by itself.
> >>
> >> I.e.
> >>
> >> 1. For EADD'd pages: stick what has been the invariant 1,5 years now. Do
> >>    not change it by any means (e.g. 06/32).
> >> 2. For EAUG'd pages: set vm_max_prot_bits RWX, which essentially means do
> >>    what ever you want with PTE's and EPCM.
> >>
> >> It's a clear and understandable model that does nothing bad to the kernel,
> >> and a run-time developer can surely find away to get things on going. For
> >> user space, the most important thing is the clarity in kernel behaviour,
> >> and this does deliver that clarity. It's not perfect but it does do the
> >> job and anyone can get it.
> > 
> > Also a quantitive argument for this is that by simplifying security model
> > this way it is one ioctl less, which must be considered as +1. We do not
> > want to add new ioctls unless it is something we absolutely cannnot live
> > without. We absolutely can live without SGX_IOC_ENCLAVE_RELAX_PERMISSIONS.
> > 
> 
> ok, with the implications understood and accepted I will proceed with a new
> series that separates EPCM from PTEs and make RWX PTEs possible by default
> for EAUG pages. This has broader impact than just removing
> the three patches you list. "[PATCH 07/32] x86/sgx: Add pfn_mkwrite() handler
> for present PTEs" is also no longer needed and there is no longer a need
> to flush PTEs after restricting permissions. New changes also need to
> be considered - at least the current documentation. I'll rework the series.

Yes, I really think it is a solid plan. Any possible LSM hooks would most
likely attach to build product, not the dynamic behaviour.

As far as the page fault handler goes, Haitao is correct after the all
discussions that it makes sense. The purpose of MAP_POPULATE series is
not to replace it but instead complement it. Just wanted to clear this
up as I said otherwise earlier this week.

Thank you.

BR, Jarkko
Reinette Chatre March 8, 2022, 5:49 p.m. UTC | #8
Hi Jarkko,

On 3/8/2022 9:00 AM, Jarkko Sakkinen wrote:
> On Tue, Mar 08, 2022 at 08:04:33AM -0800, Reinette Chatre wrote:
>> Hi Jarkko,
>>
>> On 3/8/2022 1:12 AM, Jarkko Sakkinen wrote:
>>> On Tue, Mar 08, 2022 at 11:06:46AM +0200, Jarkko Sakkinen wrote:
>>>> On Tue, Mar 08, 2022 at 10:14:42AM +0200, Jarkko Sakkinen wrote:
>>>>> On Mon, Mar 07, 2022 at 09:36:36AM -0800, Reinette Chatre wrote:
>>>>>> Hi Jarkko,
>>>>>>
>>>>>> On 3/7/2022 9:10 AM, Jarkko Sakkinen wrote:
>>>>>>> On Mon, Feb 07, 2022 at 04:45:28PM -0800, Reinette Chatre wrote:
>>>>>>>> === Summary ===
>>>>>>>>
>>>>>>>> An SGX VMA can only be created if its permissions are the same or
>>>>>>>> weaker than the Enclave Page Cache Map (EPCM) permissions. After VMA
>>>>>>>> creation this same rule is again enforced by the page fault handler:
>>>>>>>> faulted enclave pages are required to have equal or more relaxed
>>>>>>>> EPCM permissions than the VMA permissions.
>>>>>>>>
>>>>>>>> On SGX1 systems the additional enforcement in the page fault handler
>>>>>>>> is redundant and on SGX2 systems it incorrectly prevents access.
>>>>>>>> On SGX1 systems it is unnecessary to repeat the enforcement of the
>>>>>>>> permission rule. The rule used during original VMA creation will
>>>>>>>> ensure that any access attempt will use correct permissions.
>>>>>>>> With SGX2 the EPCM permissions of a page can change after VMA
>>>>>>>> creation resulting in the VMA permissions potentially being more
>>>>>>>> relaxed than the EPCM permissions and the page fault handler
>>>>>>>> incorrectly blocking valid access attempts.
>>>>>>>>
>>>>>>>> Enable the VMA's pages to remain accessible while ensuring that
>>>>>>>> the PTEs are installed to match the EPCM permissions but not be
>>>>>>>> more relaxed than the VMA permissions.
>>>>>>>>
>>>>>>>> === Full Changelog ===
>>>>>>>>
>>>>>>>> An SGX enclave is an area of memory where parts of an application
>>>>>>>> can reside. First an enclave is created and loaded (from
>>>>>>>> non-enclave memory) with the code and data of an application,
>>>>>>>> then user space can map (mmap()) the enclave memory to
>>>>>>>> be able to enter the enclave at its defined entry points for
>>>>>>>> execution within it.
>>>>>>>>
>>>>>>>> The hardware maintains a secure structure, the Enclave Page Cache Map
>>>>>>>> (EPCM), that tracks the contents of the enclave. Of interest here is
>>>>>>>> its tracking of the enclave page permissions. When a page is loaded
>>>>>>>> into the enclave its permissions are specified and recorded in the
>>>>>>>> EPCM. In parallel the kernel maintains permissions within the
>>>>>>>> page table entries (PTEs) and the rule is that PTE permissions
>>>>>>>> are not allowed to be more relaxed than the EPCM permissions.
>>>>>>>>
>>>>>>>> A new mapping (mmap()) of enclave memory can only succeed if the
>>>>>>>> mapping has the same or weaker permissions than the permissions that
>>>>>>>> were vetted during enclave creation. This is enforced by
>>>>>>>> sgx_encl_may_map() that is called on the mmap() as well as mprotect()
>>>>>>>> paths. This rule remains.
>>>>>>>>
>>>>>>>> One feature of SGX2 is to support the modification of EPCM permissions
>>>>>>>> after enclave initialization. Enclave pages may thus already be part
>>>>>>>> of a VMA at the time their EPCM permissions are changed resulting
>>>>>>>> in the VMA's permissions potentially being more relaxed than the EPCM
>>>>>>>> permissions.
>>>>>>>>
>>>>>>>> Allow permissions of existing VMAs to be more relaxed than EPCM
>>>>>>>> permissions in preparation for dynamic EPCM permission changes
>>>>>>>> made possible in SGX2.  New VMAs that attempt to have more relaxed
>>>>>>>> permissions than EPCM permissions continue to be unsupported.
>>>>>>>>
>>>>>>>> Reasons why permissions of existing VMAs are allowed to be more relaxed
>>>>>>>> than EPCM permissions instead of dynamically changing VMA permissions
>>>>>>>> when EPCM permissions change are:
>>>>>>>> 1) Changing VMA permissions involve splitting VMAs which is an
>>>>>>>>    operation that can fail. Additionally changing EPCM permissions of
>>>>>>>>    a range of pages could also fail on any of the pages involved.
>>>>>>>>    Handling these error cases causes problems. For example, if an
>>>>>>>>    EPCM permission change fails and the VMA has already been split
>>>>>>>>    then it is not possible to undo the VMA split nor possible to
>>>>>>>>    undo the EPCM permission changes that did succeed before the
>>>>>>>>    failure.
>>>>>>>> 2) The kernel has little insight into the user space where EPCM
>>>>>>>>    permissions are controlled from. For example, a RW page may
>>>>>>>>    be made RO just before it is made RX and splitting the VMAs
>>>>>>>>    while the VMAs may change soon is unnecessary.
>>>>>>>>
>>>>>>>> Remove the extra permission check called on a page fault
>>>>>>>> (vm_operations_struct->fault) or during debugging
>>>>>>>> (vm_operations_struct->access) when loading the enclave page from swap
>>>>>>>> that ensures that the VMA permissions are not more relaxed than the
>>>>>>>> EPCM permissions. Since a VMA could only exist if it passed the
>>>>>>>> original permission checks during mmap() and a VMA may indeed
>>>>>>>> have more relaxed permissions than the EPCM permissions this extra
>>>>>>>> permission check is no longer appropriate.
>>>>>>>>
>>>>>>>> With the permission check removed, ensure that PTEs do
>>>>>>>> not blindly inherit the VMA permissions but instead the permissions
>>>>>>>> that the VMA and EPCM agree on. PTEs for writable pages (from VMA
>>>>>>>> and enclave perspective) are installed with the writable bit set,
>>>>>>>> reducing the need for this additional flow to the permission mismatch
>>>>>>>> cases handled next.
>>>>>>>>
>>>>>>>> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
>>>>>>>> ---
>>>>>>>> Changes since V1:
>>>>>>>> - Reword commit message (Jarkko).
>>>>>>>> - Use "relax" instead of "exceed" when referring to permissions (Dave).
>>>>>>>> - Add snippet to Documentation/x86/sgx.rst that highlights the
>>>>>>>>   relationship between VMA, EPCM, and PTE permissions on SGX
>>>>>>>>   systems (Andy).
>>>>>>>>
>>>>>>>>  Documentation/x86/sgx.rst      | 10 +++++++++
>>>>>>>>  arch/x86/kernel/cpu/sgx/encl.c | 38 ++++++++++++++++++----------------
>>>>>>>>  2 files changed, 30 insertions(+), 18 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
>>>>>>>> index 89ff924b1480..5659932728a5 100644
>>>>>>>> --- a/Documentation/x86/sgx.rst
>>>>>>>> +++ b/Documentation/x86/sgx.rst
>>>>>>>> @@ -99,6 +99,16 @@ The relationships between the different permission masks are:
>>>>>>>>  * PTEs are installed to match the EPCM permissions, but not be more
>>>>>>>>    relaxed than the VMA permissions.
>>>>>>>>  
>>>>>>>> +On systems supporting SGX2 EPCM permissions may change while the
>>>>>>>> +enclave page belongs to a VMA without impacting the VMA permissions.
>>>>>>>> +This means that a running VMA may appear to allow access to an enclave
>>>>>>>> +page that is not allowed by its EPCM permissions. For example, when an
>>>>>>>> +enclave page with RW EPCM permissions is mapped by a RW VMA but is
>>>>>>>> +subsequently changed to have read-only EPCM permissions. The kernel
>>>>>>>> +continues to maintain correct access to the enclave page through the
>>>>>>>> +PTE that will ensure that only access allowed by both the VMA
>>>>>>>> +and EPCM permissions are permitted.
>>>>>>>> +
>>>>>>>>  Application interface
>>>>>>>>  =====================
>>>>>>>>  
>>>>>>>> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
>>>>>>>> index 48afe96ae0f0..b6105d9e7c46 100644
>>>>>>>> --- a/arch/x86/kernel/cpu/sgx/encl.c
>>>>>>>> +++ b/arch/x86/kernel/cpu/sgx/encl.c
>>>>>>>> @@ -91,10 +91,8 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
>>>>>>>>  }
>>>>>>>>  
>>>>>>>>  static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
>>>>>>>> -						unsigned long addr,
>>>>>>>> -						unsigned long vm_flags)
>>>>>>>> +						unsigned long addr)
>>>>>>>>  {
>>>>>>>> -	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
>>>>>>>>  	struct sgx_epc_page *epc_page;
>>>>>>>>  	struct sgx_encl_page *entry;
>>>>>>>>  
>>>>>>>> @@ -102,14 +100,6 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
>>>>>>>>  	if (!entry)
>>>>>>>>  		return ERR_PTR(-EFAULT);
>>>>>>>>  
>>>>>>>> -	/*
>>>>>>>> -	 * Verify that the faulted page has equal or higher build time
>>>>>>>> -	 * permissions than the VMA permissions (i.e. the subset of {VM_READ,
>>>>>>>> -	 * VM_WRITE, VM_EXECUTE} in vma->vm_flags).
>>>>>>>> -	 */
>>>>>>>> -	if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
>>>>>>>> -		return ERR_PTR(-EFAULT);
>>>>>>>> -
>>>>>>>>  	/* Entry successfully located. */
>>>>>>>>  	if (entry->epc_page) {
>>>>>>>>  		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
>>>>>>>> @@ -138,7 +128,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>>>>>>>  {
>>>>>>>>  	unsigned long addr = (unsigned long)vmf->address;
>>>>>>>>  	struct vm_area_struct *vma = vmf->vma;
>>>>>>>> +	unsigned long page_prot_bits;
>>>>>>>>  	struct sgx_encl_page *entry;
>>>>>>>> +	unsigned long vm_prot_bits;
>>>>>>>>  	unsigned long phys_addr;
>>>>>>>>  	struct sgx_encl *encl;
>>>>>>>>  	vm_fault_t ret;
>>>>>>>> @@ -155,7 +147,7 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>>>>>>>  
>>>>>>>>  	mutex_lock(&encl->lock);
>>>>>>>>  
>>>>>>>> -	entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
>>>>>>>> +	entry = sgx_encl_load_page(encl, addr);
>>>>>>>>  	if (IS_ERR(entry)) {
>>>>>>>>  		mutex_unlock(&encl->lock);
>>>>>>>   
>>>>>>>> @@ -167,7 +159,19 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>>>>>>>  
>>>>>>>>  	phys_addr = sgx_get_epc_phys_addr(entry->epc_page);
>>>>>>>>  
>>>>>>>> -	ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr));
>>>>>>>> +	/*
>>>>>>>> +	 * Insert PTE to match the EPCM page permissions ensured to not
>>>>>>>> +	 * exceed the VMA permissions.
>>>>>>>> +	 */
>>>>>>>> +	vm_prot_bits = vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
>>>>>>>> +	page_prot_bits = entry->vm_max_prot_bits & vm_prot_bits;
>>>>>>>> +	/*
>>>>>>>> +	 * Add VM_SHARED so that PTE is made writable right away if VMA
>>>>>>>> +	 * and EPCM are writable (no COW in SGX).
>>>>>>>> +	 */
>>>>>>>> +	page_prot_bits |= (vma->vm_flags & VM_SHARED);
>>>>>>>> +	ret = vmf_insert_pfn_prot(vma, addr, PFN_DOWN(phys_addr),
>>>>>>>> +				  vm_get_page_prot(page_prot_bits));
>>>>>>>>  	if (ret != VM_FAULT_NOPAGE) {
>>>>>>>>  		mutex_unlock(&encl->lock);
>>>>>>>>  
>>>>>>>> @@ -295,15 +299,14 @@ static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag
>>>>>>>>   * Load an enclave page to EPC if required, and take encl->lock.
>>>>>>>>   */
>>>>>>>>  static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
>>>>>>>> -						   unsigned long addr,
>>>>>>>> -						   unsigned long vm_flags)
>>>>>>>> +						   unsigned long addr)
>>>>>>>>  {
>>>>>>>>  	struct sgx_encl_page *entry;
>>>>>>>>  
>>>>>>>>  	for ( ; ; ) {
>>>>>>>>  		mutex_lock(&encl->lock);
>>>>>>>>  
>>>>>>>> -		entry = sgx_encl_load_page(encl, addr, vm_flags);
>>>>>>>> +		entry = sgx_encl_load_page(encl, addr);
>>>>>>>>  		if (PTR_ERR(entry) != -EBUSY)
>>>>>>>>  			break;
>>>>>>>>  
>>>>>>>> @@ -339,8 +342,7 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
>>>>>>>>  		return -EFAULT;
>>>>>>>>  
>>>>>>>>  	for (i = 0; i < len; i += cnt) {
>>>>>>>> -		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK,
>>>>>>>> -					      vma->vm_flags);
>>>>>>>> +		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK);
>>>>>>>>  		if (IS_ERR(entry)) {
>>>>>>>>  			ret = PTR_ERR(entry);
>>>>>>>>  			break;
>>>>>>>> -- 
>>>>>>>> 2.25.1
>>>>>>>>
>>>>>>>
>>>>>>> If you unconditionally set vm_max_prot_bits to RWX for dynamically created
>>>>>>> pags, you would not need to do this.
>>>>>>>
>>>>>>> These patches could be then safely dropped then:
>>>>>>>
>>>>>>> - [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions 
>>>>>>> - [PATCH V2 08/32] x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic permission changes
>>>>>>> - [PATCH V2 15/32] x86/sgx: Support relaxing of enclave page permissions
>>>>>>>
>>>>>>> And that would also keep full ABI compatibility without exceptions to the
>>>>>>> existing mainline code.
>>>>>>>
>>>>>>
>>>>>> Dropping these changes do not just impact dynamically created pages. Dropping
>>>>>> these patches would result in EPCM page permission restriction being supported
>>>>>> for all pages, those added before enclave initialization as well as dynamically
>>>>>> added pages, but their PTEs will not be impacted.
>>>>>>
>>>>>> For example, if a RW enclave page is added via SGX_IOC_ENCLAVE_ADD_PAGES and
>>>>>> then later made read-only via SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS then Linux
>>>>>> would keep allowing and installing RW PTEs to this page.
>>>>>
>>>>> I think that would be perfectly fine, if someone wants to do that. There is
>>>>> no corrateral damage on doing that. Kernel does not get messed because of
>>>>> that. It's a use case that does not make sense in the first place, so it'd
>>>>> be stupid to build anything extensive around it to the kernel.
>>>>>
>>>>> Shooting yourself to the foot is something that kernel does and should not
>>>>> protect user space from unless there is a risk of messing the state of the
>>>>> kernel itself.
>>>>>
>>>>> Much worse is that we have e.g. completely artificial ioctl
>>>>> SGX_IOC_ENCLAVE_RELAX_PERMISSIONS to support this scheme, which could e.g.
>>>>> cause extra roundtrips for simple EMODPE.
>>>>>
>>>>> Also this means not having to include 06/32, which keeps 100% backwards
>>>>> compatibility in run-time behaviour to the mainline while not restricting
>>>>> at all dynamically created pages. And we get rid of complex book keeping
>>>>> of vm_run_prot_bits.
>>>>>
>>>>> And generally the whole model is then very easy to understand and explain.
>>>>> If I had to keep presentation of the current mess in the patch set in a
>>>>> conference, I can honestly say that I would be in serious trouble. It's
>>>>> not clean and clear security model, which is a risk by itself.
>>>>
>>>> I.e.
>>>>
>>>> 1. For EADD'd pages: stick what has been the invariant 1,5 years now. Do
>>>>    not change it by any means (e.g. 06/32).
>>>> 2. For EAUG'd pages: set vm_max_prot_bits RWX, which essentially means do
>>>>    what ever you want with PTE's and EPCM.
>>>>
>>>> It's a clear and understandable model that does nothing bad to the kernel,
>>>> and a run-time developer can surely find away to get things on going. For
>>>> user space, the most important thing is the clarity in kernel behaviour,
>>>> and this does deliver that clarity. It's not perfect but it does do the
>>>> job and anyone can get it.
>>>
>>> Also a quantitive argument for this is that by simplifying security model
>>> this way it is one ioctl less, which must be considered as +1. We do not
>>> want to add new ioctls unless it is something we absolutely cannnot live
>>> without. We absolutely can live without SGX_IOC_ENCLAVE_RELAX_PERMISSIONS.
>>>
>>
>> ok, with the implications understood and accepted I will proceed with a new
>> series that separates EPCM from PTEs and make RWX PTEs possible by default
>> for EAUG pages. This has broader impact than just removing
>> the three patches you list. "[PATCH 07/32] x86/sgx: Add pfn_mkwrite() handler
>> for present PTEs" is also no longer needed and there is no longer a need
>> to flush PTEs after restricting permissions. New changes also need to
>> be considered - at least the current documentation. I'll rework the series.
> 
> Yes, I really think it is a solid plan. Any possible LSM hooks would most
> likely attach to build product, not the dynamic behaviour.
> 
> As far as the page fault handler goes, Haitao is correct after the all
> discussions that it makes sense. The purpose of MAP_POPULATE series is
> not to replace it but instead complement it. Just wanted to clear this
> up as I said otherwise earlier this week.
> 

Understood. I will keep the implementation where EAUG is done in page fault
handler. I do plan to pick up your patch "x86/sgx: Export sgx_encl_page_alloc()"
since a consequence of the other changes is that this can now be shared.

Reinette
Jarkko Sakkinen March 8, 2022, 6:46 p.m. UTC | #9
On Tue, Mar 08, 2022 at 09:49:01AM -0800, Reinette Chatre wrote:
> Hi Jarkko,
> 
> On 3/8/2022 9:00 AM, Jarkko Sakkinen wrote:
> > On Tue, Mar 08, 2022 at 08:04:33AM -0800, Reinette Chatre wrote:
> >> Hi Jarkko,
> >>
> >> On 3/8/2022 1:12 AM, Jarkko Sakkinen wrote:
> >>> On Tue, Mar 08, 2022 at 11:06:46AM +0200, Jarkko Sakkinen wrote:
> >>>> On Tue, Mar 08, 2022 at 10:14:42AM +0200, Jarkko Sakkinen wrote:
> >>>>> On Mon, Mar 07, 2022 at 09:36:36AM -0800, Reinette Chatre wrote:
> >>>>>> Hi Jarkko,
> >>>>>>
> >>>>>> On 3/7/2022 9:10 AM, Jarkko Sakkinen wrote:
> >>>>>>> On Mon, Feb 07, 2022 at 04:45:28PM -0800, Reinette Chatre wrote:
> >>>>>>>> === Summary ===
> >>>>>>>>
> >>>>>>>> An SGX VMA can only be created if its permissions are the same or
> >>>>>>>> weaker than the Enclave Page Cache Map (EPCM) permissions. After VMA
> >>>>>>>> creation this same rule is again enforced by the page fault handler:
> >>>>>>>> faulted enclave pages are required to have equal or more relaxed
> >>>>>>>> EPCM permissions than the VMA permissions.
> >>>>>>>>
> >>>>>>>> On SGX1 systems the additional enforcement in the page fault handler
> >>>>>>>> is redundant and on SGX2 systems it incorrectly prevents access.
> >>>>>>>> On SGX1 systems it is unnecessary to repeat the enforcement of the
> >>>>>>>> permission rule. The rule used during original VMA creation will
> >>>>>>>> ensure that any access attempt will use correct permissions.
> >>>>>>>> With SGX2 the EPCM permissions of a page can change after VMA
> >>>>>>>> creation resulting in the VMA permissions potentially being more
> >>>>>>>> relaxed than the EPCM permissions and the page fault handler
> >>>>>>>> incorrectly blocking valid access attempts.
> >>>>>>>>
> >>>>>>>> Enable the VMA's pages to remain accessible while ensuring that
> >>>>>>>> the PTEs are installed to match the EPCM permissions but not be
> >>>>>>>> more relaxed than the VMA permissions.
> >>>>>>>>
> >>>>>>>> === Full Changelog ===
> >>>>>>>>
> >>>>>>>> An SGX enclave is an area of memory where parts of an application
> >>>>>>>> can reside. First an enclave is created and loaded (from
> >>>>>>>> non-enclave memory) with the code and data of an application,
> >>>>>>>> then user space can map (mmap()) the enclave memory to
> >>>>>>>> be able to enter the enclave at its defined entry points for
> >>>>>>>> execution within it.
> >>>>>>>>
> >>>>>>>> The hardware maintains a secure structure, the Enclave Page Cache Map
> >>>>>>>> (EPCM), that tracks the contents of the enclave. Of interest here is
> >>>>>>>> its tracking of the enclave page permissions. When a page is loaded
> >>>>>>>> into the enclave its permissions are specified and recorded in the
> >>>>>>>> EPCM. In parallel the kernel maintains permissions within the
> >>>>>>>> page table entries (PTEs) and the rule is that PTE permissions
> >>>>>>>> are not allowed to be more relaxed than the EPCM permissions.
> >>>>>>>>
> >>>>>>>> A new mapping (mmap()) of enclave memory can only succeed if the
> >>>>>>>> mapping has the same or weaker permissions than the permissions that
> >>>>>>>> were vetted during enclave creation. This is enforced by
> >>>>>>>> sgx_encl_may_map() that is called on the mmap() as well as mprotect()
> >>>>>>>> paths. This rule remains.
> >>>>>>>>
> >>>>>>>> One feature of SGX2 is to support the modification of EPCM permissions
> >>>>>>>> after enclave initialization. Enclave pages may thus already be part
> >>>>>>>> of a VMA at the time their EPCM permissions are changed resulting
> >>>>>>>> in the VMA's permissions potentially being more relaxed than the EPCM
> >>>>>>>> permissions.
> >>>>>>>>
> >>>>>>>> Allow permissions of existing VMAs to be more relaxed than EPCM
> >>>>>>>> permissions in preparation for dynamic EPCM permission changes
> >>>>>>>> made possible in SGX2.  New VMAs that attempt to have more relaxed
> >>>>>>>> permissions than EPCM permissions continue to be unsupported.
> >>>>>>>>
> >>>>>>>> Reasons why permissions of existing VMAs are allowed to be more relaxed
> >>>>>>>> than EPCM permissions instead of dynamically changing VMA permissions
> >>>>>>>> when EPCM permissions change are:
> >>>>>>>> 1) Changing VMA permissions involve splitting VMAs which is an
> >>>>>>>>    operation that can fail. Additionally changing EPCM permissions of
> >>>>>>>>    a range of pages could also fail on any of the pages involved.
> >>>>>>>>    Handling these error cases causes problems. For example, if an
> >>>>>>>>    EPCM permission change fails and the VMA has already been split
> >>>>>>>>    then it is not possible to undo the VMA split nor possible to
> >>>>>>>>    undo the EPCM permission changes that did succeed before the
> >>>>>>>>    failure.
> >>>>>>>> 2) The kernel has little insight into the user space where EPCM
> >>>>>>>>    permissions are controlled from. For example, a RW page may
> >>>>>>>>    be made RO just before it is made RX and splitting the VMAs
> >>>>>>>>    while the VMAs may change soon is unnecessary.
> >>>>>>>>
> >>>>>>>> Remove the extra permission check called on a page fault
> >>>>>>>> (vm_operations_struct->fault) or during debugging
> >>>>>>>> (vm_operations_struct->access) when loading the enclave page from swap
> >>>>>>>> that ensures that the VMA permissions are not more relaxed than the
> >>>>>>>> EPCM permissions. Since a VMA could only exist if it passed the
> >>>>>>>> original permission checks during mmap() and a VMA may indeed
> >>>>>>>> have more relaxed permissions than the EPCM permissions this extra
> >>>>>>>> permission check is no longer appropriate.
> >>>>>>>>
> >>>>>>>> With the permission check removed, ensure that PTEs do
> >>>>>>>> not blindly inherit the VMA permissions but instead the permissions
> >>>>>>>> that the VMA and EPCM agree on. PTEs for writable pages (from VMA
> >>>>>>>> and enclave perspective) are installed with the writable bit set,
> >>>>>>>> reducing the need for this additional flow to the permission mismatch
> >>>>>>>> cases handled next.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> >>>>>>>> ---
> >>>>>>>> Changes since V1:
> >>>>>>>> - Reword commit message (Jarkko).
> >>>>>>>> - Use "relax" instead of "exceed" when referring to permissions (Dave).
> >>>>>>>> - Add snippet to Documentation/x86/sgx.rst that highlights the
> >>>>>>>>   relationship between VMA, EPCM, and PTE permissions on SGX
> >>>>>>>>   systems (Andy).
> >>>>>>>>
> >>>>>>>>  Documentation/x86/sgx.rst      | 10 +++++++++
> >>>>>>>>  arch/x86/kernel/cpu/sgx/encl.c | 38 ++++++++++++++++++----------------
> >>>>>>>>  2 files changed, 30 insertions(+), 18 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
> >>>>>>>> index 89ff924b1480..5659932728a5 100644
> >>>>>>>> --- a/Documentation/x86/sgx.rst
> >>>>>>>> +++ b/Documentation/x86/sgx.rst
> >>>>>>>> @@ -99,6 +99,16 @@ The relationships between the different permission masks are:
> >>>>>>>>  * PTEs are installed to match the EPCM permissions, but not be more
> >>>>>>>>    relaxed than the VMA permissions.
> >>>>>>>>  
> >>>>>>>> +On systems supporting SGX2 EPCM permissions may change while the
> >>>>>>>> +enclave page belongs to a VMA without impacting the VMA permissions.
> >>>>>>>> +This means that a running VMA may appear to allow access to an enclave
> >>>>>>>> +page that is not allowed by its EPCM permissions. For example, when an
> >>>>>>>> +enclave page with RW EPCM permissions is mapped by a RW VMA but is
> >>>>>>>> +subsequently changed to have read-only EPCM permissions. The kernel
> >>>>>>>> +continues to maintain correct access to the enclave page through the
> >>>>>>>> +PTE that will ensure that only access allowed by both the VMA
> >>>>>>>> +and EPCM permissions are permitted.
> >>>>>>>> +
> >>>>>>>>  Application interface
> >>>>>>>>  =====================
> >>>>>>>>  
> >>>>>>>> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> >>>>>>>> index 48afe96ae0f0..b6105d9e7c46 100644
> >>>>>>>> --- a/arch/x86/kernel/cpu/sgx/encl.c
> >>>>>>>> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> >>>>>>>> @@ -91,10 +91,8 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
> >>>>>>>>  }
> >>>>>>>>  
> >>>>>>>>  static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
> >>>>>>>> -						unsigned long addr,
> >>>>>>>> -						unsigned long vm_flags)
> >>>>>>>> +						unsigned long addr)
> >>>>>>>>  {
> >>>>>>>> -	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
> >>>>>>>>  	struct sgx_epc_page *epc_page;
> >>>>>>>>  	struct sgx_encl_page *entry;
> >>>>>>>>  
> >>>>>>>> @@ -102,14 +100,6 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
> >>>>>>>>  	if (!entry)
> >>>>>>>>  		return ERR_PTR(-EFAULT);
> >>>>>>>>  
> >>>>>>>> -	/*
> >>>>>>>> -	 * Verify that the faulted page has equal or higher build time
> >>>>>>>> -	 * permissions than the VMA permissions (i.e. the subset of {VM_READ,
> >>>>>>>> -	 * VM_WRITE, VM_EXECUTE} in vma->vm_flags).
> >>>>>>>> -	 */
> >>>>>>>> -	if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
> >>>>>>>> -		return ERR_PTR(-EFAULT);
> >>>>>>>> -
> >>>>>>>>  	/* Entry successfully located. */
> >>>>>>>>  	if (entry->epc_page) {
> >>>>>>>>  		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> >>>>>>>> @@ -138,7 +128,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
> >>>>>>>>  {
> >>>>>>>>  	unsigned long addr = (unsigned long)vmf->address;
> >>>>>>>>  	struct vm_area_struct *vma = vmf->vma;
> >>>>>>>> +	unsigned long page_prot_bits;
> >>>>>>>>  	struct sgx_encl_page *entry;
> >>>>>>>> +	unsigned long vm_prot_bits;
> >>>>>>>>  	unsigned long phys_addr;
> >>>>>>>>  	struct sgx_encl *encl;
> >>>>>>>>  	vm_fault_t ret;
> >>>>>>>> @@ -155,7 +147,7 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
> >>>>>>>>  
> >>>>>>>>  	mutex_lock(&encl->lock);
> >>>>>>>>  
> >>>>>>>> -	entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
> >>>>>>>> +	entry = sgx_encl_load_page(encl, addr);
> >>>>>>>>  	if (IS_ERR(entry)) {
> >>>>>>>>  		mutex_unlock(&encl->lock);
> >>>>>>>   
> >>>>>>>> @@ -167,7 +159,19 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
> >>>>>>>>  
> >>>>>>>>  	phys_addr = sgx_get_epc_phys_addr(entry->epc_page);
> >>>>>>>>  
> >>>>>>>> -	ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr));
> >>>>>>>> +	/*
> >>>>>>>> +	 * Insert PTE to match the EPCM page permissions ensured to not
> >>>>>>>> +	 * exceed the VMA permissions.
> >>>>>>>> +	 */
> >>>>>>>> +	vm_prot_bits = vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
> >>>>>>>> +	page_prot_bits = entry->vm_max_prot_bits & vm_prot_bits;
> >>>>>>>> +	/*
> >>>>>>>> +	 * Add VM_SHARED so that PTE is made writable right away if VMA
> >>>>>>>> +	 * and EPCM are writable (no COW in SGX).
> >>>>>>>> +	 */
> >>>>>>>> +	page_prot_bits |= (vma->vm_flags & VM_SHARED);
> >>>>>>>> +	ret = vmf_insert_pfn_prot(vma, addr, PFN_DOWN(phys_addr),
> >>>>>>>> +				  vm_get_page_prot(page_prot_bits));
> >>>>>>>>  	if (ret != VM_FAULT_NOPAGE) {
> >>>>>>>>  		mutex_unlock(&encl->lock);
> >>>>>>>>  
> >>>>>>>> @@ -295,15 +299,14 @@ static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag
> >>>>>>>>   * Load an enclave page to EPC if required, and take encl->lock.
> >>>>>>>>   */
> >>>>>>>>  static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
> >>>>>>>> -						   unsigned long addr,
> >>>>>>>> -						   unsigned long vm_flags)
> >>>>>>>> +						   unsigned long addr)
> >>>>>>>>  {
> >>>>>>>>  	struct sgx_encl_page *entry;
> >>>>>>>>  
> >>>>>>>>  	for ( ; ; ) {
> >>>>>>>>  		mutex_lock(&encl->lock);
> >>>>>>>>  
> >>>>>>>> -		entry = sgx_encl_load_page(encl, addr, vm_flags);
> >>>>>>>> +		entry = sgx_encl_load_page(encl, addr);
> >>>>>>>>  		if (PTR_ERR(entry) != -EBUSY)
> >>>>>>>>  			break;
> >>>>>>>>  
> >>>>>>>> @@ -339,8 +342,7 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
> >>>>>>>>  		return -EFAULT;
> >>>>>>>>  
> >>>>>>>>  	for (i = 0; i < len; i += cnt) {
> >>>>>>>> -		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK,
> >>>>>>>> -					      vma->vm_flags);
> >>>>>>>> +		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK);
> >>>>>>>>  		if (IS_ERR(entry)) {
> >>>>>>>>  			ret = PTR_ERR(entry);
> >>>>>>>>  			break;
> >>>>>>>> -- 
> >>>>>>>> 2.25.1
> >>>>>>>>
> >>>>>>>
> >>>>>>> If you unconditionally set vm_max_prot_bits to RWX for dynamically created
> >>>>>>> pags, you would not need to do this.
> >>>>>>>
> >>>>>>> These patches could be then safely dropped then:
> >>>>>>>
> >>>>>>> - [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions 
> >>>>>>> - [PATCH V2 08/32] x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic permission changes
> >>>>>>> - [PATCH V2 15/32] x86/sgx: Support relaxing of enclave page permissions
> >>>>>>>
> >>>>>>> And that would also keep full ABI compatibility without exceptions to the
> >>>>>>> existing mainline code.
> >>>>>>>
> >>>>>>
> >>>>>> Dropping these changes do not just impact dynamically created pages. Dropping
> >>>>>> these patches would result in EPCM page permission restriction being supported
> >>>>>> for all pages, those added before enclave initialization as well as dynamically
> >>>>>> added pages, but their PTEs will not be impacted.
> >>>>>>
> >>>>>> For example, if a RW enclave page is added via SGX_IOC_ENCLAVE_ADD_PAGES and
> >>>>>> then later made read-only via SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS then Linux
> >>>>>> would keep allowing and installing RW PTEs to this page.
> >>>>>
> >>>>> I think that would be perfectly fine, if someone wants to do that. There is
> >>>>> no corrateral damage on doing that. Kernel does not get messed because of
> >>>>> that. It's a use case that does not make sense in the first place, so it'd
> >>>>> be stupid to build anything extensive around it to the kernel.
> >>>>>
> >>>>> Shooting yourself to the foot is something that kernel does and should not
> >>>>> protect user space from unless there is a risk of messing the state of the
> >>>>> kernel itself.
> >>>>>
> >>>>> Much worse is that we have e.g. completely artificial ioctl
> >>>>> SGX_IOC_ENCLAVE_RELAX_PERMISSIONS to support this scheme, which could e.g.
> >>>>> cause extra roundtrips for simple EMODPE.
> >>>>>
> >>>>> Also this means not having to include 06/32, which keeps 100% backwards
> >>>>> compatibility in run-time behaviour to the mainline while not restricting
> >>>>> at all dynamically created pages. And we get rid of complex book keeping
> >>>>> of vm_run_prot_bits.
> >>>>>
> >>>>> And generally the whole model is then very easy to understand and explain.
> >>>>> If I had to keep presentation of the current mess in the patch set in a
> >>>>> conference, I can honestly say that I would be in serious trouble. It's
> >>>>> not clean and clear security model, which is a risk by itself.
> >>>>
> >>>> I.e.
> >>>>
> >>>> 1. For EADD'd pages: stick what has been the invariant 1,5 years now. Do
> >>>>    not change it by any means (e.g. 06/32).
> >>>> 2. For EAUG'd pages: set vm_max_prot_bits RWX, which essentially means do
> >>>>    what ever you want with PTE's and EPCM.
> >>>>
> >>>> It's a clear and understandable model that does nothing bad to the kernel,
> >>>> and a run-time developer can surely find away to get things on going. For
> >>>> user space, the most important thing is the clarity in kernel behaviour,
> >>>> and this does deliver that clarity. It's not perfect but it does do the
> >>>> job and anyone can get it.
> >>>
> >>> Also a quantitive argument for this is that by simplifying security model
> >>> this way it is one ioctl less, which must be considered as +1. We do not
> >>> want to add new ioctls unless it is something we absolutely cannnot live
> >>> without. We absolutely can live without SGX_IOC_ENCLAVE_RELAX_PERMISSIONS.
> >>>
> >>
> >> ok, with the implications understood and accepted I will proceed with a new
> >> series that separates EPCM from PTEs and make RWX PTEs possible by default
> >> for EAUG pages. This has broader impact than just removing
> >> the three patches you list. "[PATCH 07/32] x86/sgx: Add pfn_mkwrite() handler
> >> for present PTEs" is also no longer needed and there is no longer a need
> >> to flush PTEs after restricting permissions. New changes also need to
> >> be considered - at least the current documentation. I'll rework the series.
> > 
> > Yes, I really think it is a solid plan. Any possible LSM hooks would most
> > likely attach to build product, not the dynamic behaviour.
> > 
> > As far as the page fault handler goes, Haitao is correct after the all
> > discussions that it makes sense. The purpose of MAP_POPULATE series is
> > not to replace it but instead complement it. Just wanted to clear this
> > up as I said otherwise earlier this week.
> > 
> 
> Understood. I will keep the implementation where EAUG is done in page fault
> handler. I do plan to pick up your patch "x86/sgx: Export sgx_encl_page_alloc()"
> since a consequence of the other changes is that this can now be shared.

Yeah, I think we might be able to get this polished for v5.19. I'd expect
a revision or few for polishing the corners but other than that this looks
to be going on right tracks now.

> Reinette

BR, Jarkko
Dr. Greg March 11, 2022, 11:06 a.m. UTC | #10
On Tue, Mar 08, 2022 at 07:00:03PM +0200, Jarkko Sakkinen wrote:

Good morning, I hope this note finds the week ending well for
everyone.

Based on previous experiences, I wasn't going to respond to this
conversation.  However, after seeing Thomas' response to Cathy's
microcode patch series, and reflecting on things a bit, I thought it
would be useful to do so in order to stimulate further scholarly discussion.

For the record, Thomas was spot-on with concerns about how much sense
it makes to persist enclaves through a microcode update.  Everyone
despises downtime, but if you are serious about security, idling a machine
out to fully 'reset it', in the face of a microcode update seems to be
the only prudent action.

However, with all due respect, the assertion that Linux is all about
the highest standards in technical honesty and soundness of
engineering, are specious, at least in the context of the development
of this driver.

I see that Jarkko noted his conversations with Mark Shanahan on SGX
micro-architectural details.  For the record, I have been fortunate
enough to have engaged with Shanahan, Razos, Johnson and a number of
the other engineers behind this technology.  I've also had the
opportunity to engage with and provide recommendations to the SGX
engineering team in Israel, before the demise of the Platform Security
Group and the exile of SGX to servers only.

I believe that Simon Johnson received a Principal Engineer nomination
for his work on SGX.  He called me one morning here at the lake and we
had a rather lengthy and entertaining discussion about the issues
surrounding doing packet inspection from inside of an enclave, I
believe at 40 GBPS line rates, maybe 10, I can't remember.  In any
event, way faster then what could be accomplished in the face of the
~60,000 cycle overhead of untrusted<->trusted context switches, which
in turn motivated the 'switchless' architecture.

So with the length and girth discussions at bay, reflections on the
technical and security issues at hand follow below, for whatever they
are worth.

> On Tue, Mar 08, 2022 at 08:04:33AM -0800, Reinette Chatre wrote:
> > ok, with the implications understood and accepted I will proceed
> > with a new series that separates EPCM from PTEs and make RWX PTEs
> > possible by default for EAUG pages. This has broader impact than
> > just removing the three patches you list. "[PATCH 07/32] x86/sgx:
> > Add pfn_mkwrite() handler for present PTEs" is also no longer
> > needed and there is no longer a need to flush PTEs after
> > restricting permissions. New changes also need to be considered -
> > at least the current documentation. I'll rework the series.

> Yes, I really think it is a solid plan. Any possible LSM hooks would
> most likely attach to build product, not the dynamic behaviour.

I assume everyone remembers, if not the kernel archives will have full
details, that we had a very lively discussion about these issues
starting well over two years ago.

Jarkko's, rather dogmatic assertion, that it should simply be 'The
Wild West', with respect to PTE memory permissions on dynamically
allocated enclave memory, suggests that all of the hand wringing and
proselytizing about SGX being a way to circumvent LSM controls on
executable memory were political and technical grandstanding,
amounting to nothing but security theater.

Apologies if this is perceived to be a bit strident, it had been a
long week already on Wednesday morning.

I made the point at the time, that remained unacknowledged, that none
of the machinations involved had any practical security value with
respect to where everyone wanted this technology to go, ie. a driver
with full Enclave Dynamic Memory Management (EDMM) support, which is
the precipice on which we now stand.

I noted that the only valid security controls for this technology were
reputational controls based on cryptographic identities.  In fact, we
developed, and posted, a rather complete implementation of such an
infrastructure.  Here is a URL to the last patch that we had time to
fuss with putting up:

ftp://ftp.enjellic.com/pub/sgx/kernel/SFLC-5.12.patch

I think we have a 5.13, if not a 5.14 patch laying around as well.

The response, from one of the illuminaries in these discussions was;
"I dare you to post the patches so I can immediately NACK them".

I guess that pretty much covers the question of why their may be
perceived reluctance in some quarters about spending time trying to
upstream kernel functionality.

So, it was with some interest, that I noted Reinette Chatre's recent
e-mail which indicated, that in the face of the EDMM driver and the
security implications it presents, a proof-of-concept implementation
for reputational security controls had been developed.  That
implementation is based on MRENCLAVE and/or MRSIGNER values, both
cryptographically based identities, as was ours.

Although we didn't bother with MRENCLAVE values, for largely the same
reason why SGX_KEYPOLICY_MRENCLAVE isn't considered useful for
symmetric key generation inside of an enclave, with perhaps the
exception of shrouding keys to defeat speculation attacks.

So, to assist the conversation, and for the 'Lore' record.

In an EDMM environment, anyone with adverse intent is going to simply
ship an enclave that amounts to nothing more than a bootloader.  Said
enclave will setup a network connection to an external code
repository, which will verify that it is only talking to a known
enclave through remote attestation, and then download whatever code,
via a cryptographically secured connection, that they actually want to
run in the enclave.

How do I know that?

I know that because we were paid to develop those types of systems by
customers who wanted to run proprietary and/or confidential code in
the 'cloud'.  It was interesting to see the number of groups that
looked at SGX as a means to protect their 'secret sauce'.

I conclude, from Jarkko's comment above, that the kernel is going to
simply ignore this threat scenario, seems vaguely unwise.

So, perhaps the best way to advance a profitable discussion, is for
the involved kernel developers to state, for the benefit of those of
us less enlightened, how effective LSM's are going to be developed for
the EDMM threat model.

I can offer up a few strawmen approaches

- Refuse any socket connections to an application that maps an enclave.

- Refuse to allow an application mapping an enclave to access any
files, particularly if it looks like they contain encrypted content.

- Count the number of page mappings requested and decide that NN pages
are OK but NN+ are not, MAP_POPULATE returns E2BIG??

- Implement seccomp like controls to analyze and interpret OCALL
behavior.

The first three seem a bit like show-stoppers when it comes to doing
anything useful with SGX, seccomp has a reputation of being hard to
get 'right', that would seem to be particularly the case here.

Perhaps something more exotic.

How about a BPF enabled LSM that monitors enclave page access
patterns, so a concerned system administrator can build custom
variants of the directed page side channel attack that Oakland
demonstrated against Haven.  Oakland, in his conclusions, described
their approach as 'devastating' to the notion that SGX could prevent an
adversarial OS from knowing what is going on inside of an enclave.

The conundrum is pretty simple and straight forward.  Either the
technology works, as advertised, which means, by definition, that the
operating system has no effective insight into what an enclave is
doing.

Or some of the kernel developers are right, and there is a way for the
OS to have effective insight and thus control, over what an enclave is
trying to do.  A fact that effectively implies that Fortanix, Asylo,
Enarx, Gramine, Oculum et.al. are peddling the equivalent of Security
Snake Oil when it comes to SGX enabled 'confidential' computing.

The above shouldn't be considered pejorative to those products,
companies or initiatives.  I have an acquaintance in the IT business
that tells me I worry too much about if things work and how, because
he has made a ton of money selling people stuff that he knows doesn't
work.

For the record, I'm completely ambivalent with respect to how any of
this gets done or what the PTE permissions for dynamic content are.
For those who may not be ambivalent about whether Linux gets security
'right', let me leave the following thoughts.

It hasn't been my experience that good engineers design things or
processes for the sake of designing them.  The architecture for
EDMM/SGX was proposed and presented in the form of two papers at
HASP-2016.

Let's see, the operating system 'flow' model counts as an author none
other then Mark Shanahan himself, with Bin Xing and Rebekah Hurd.

The SGX instructions and architecture paper was done by Frank McKeen,
Ilya Alexandrovich, Ittai Anati, Dror Capi, Simon Johnson, Rebekah
Hurd and Carlos Razos.

The interactions I've had left me feeling like these were really smart
people.  Maybe they were on psychedelics when they designed this and
wrote the papers?  It would seem to be helpful for these discussions
to know if this was the case.

Or maybe, perhaps, there are some subtleties, hidden 'gotchas',
micro-architectural pecularities and/or security considerations that
influenced the documented EDMM flow.  Seems like secure kernel
development would benefit from knowing that, rather then concluding
that this method is slow, and perhaps hard to implement, so Linux is
going to ignore these potential issues.

The EDMM papers, if anyone should happen to read them, indicate in the
acknowledgments that the design was done in collaboration with OS
designers.  I'm presuming that was Pienado's group at Microsoft
Research, given that is where Haven came from, maybe they were
dabbling with psychedelics.

I believe they did manage to document the early micro-architectural
errata about certain page access patterns causing processor faults.
That leads one to believe they couldn't have been too confused about
what was going on.

When I brought all of this up last time I was told I was trying to
present a 'scary boogeyman'.  That could be, I will leave that for
others to judge.

My pragmatic response to that accusation is why would we argue to have
systemd and the distro's change system behaviors to accomodate the
ability to apply an LSM to a 500K 'bootloader'.  When that
'bootloader' can turn around and potentially pull down gigabytes of
code that we appear to have decided don't need any controls applied to
whatsoever.

Caution would seem to suggest the need to understand the implications
of these issues a bit better than we currently do.

> Thank you.
> 
> BR, Jarkko

Best wishes for a pleasant weekend to everyone.

Dr. Greg

As always,
Dr. Greg Wettstein, Ph.D, Worker      Autonomously self-defensive
Enjellic Systems Development, LLC     IOT platforms and edge devices.
4206 N. 19th Ave.
Fargo, ND  58102
PH: 701-281-1686                      EMAIL: dg@enjellic.com
------------------------------------------------------------------------------
"Thinking implies disagreement; and disagreement implies
 non-conformity; and non-comformity implies heresy; and heresy implies
 disloyality -- so obviously thinking must be stopped"
                                -- [Call to Greatness, 1954]
                                   Adlai Stephenson
diff mbox series

Patch

diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
index 89ff924b1480..5659932728a5 100644
--- a/Documentation/x86/sgx.rst
+++ b/Documentation/x86/sgx.rst
@@ -99,6 +99,16 @@  The relationships between the different permission masks are:
 * PTEs are installed to match the EPCM permissions, but not be more
   relaxed than the VMA permissions.
 
+On systems supporting SGX2 EPCM permissions may change while the
+enclave page belongs to a VMA without impacting the VMA permissions.
+This means that a running VMA may appear to allow access to an enclave
+page that is not allowed by its EPCM permissions. For example, when an
+enclave page with RW EPCM permissions is mapped by a RW VMA but is
+subsequently changed to have read-only EPCM permissions. The kernel
+continues to maintain correct access to the enclave page through the
+PTE that will ensure that only access allowed by both the VMA
+and EPCM permissions are permitted.
+
 Application interface
 =====================
 
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 48afe96ae0f0..b6105d9e7c46 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -91,10 +91,8 @@  static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
 }
 
 static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
-						unsigned long addr,
-						unsigned long vm_flags)
+						unsigned long addr)
 {
-	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
 	struct sgx_epc_page *epc_page;
 	struct sgx_encl_page *entry;
 
@@ -102,14 +100,6 @@  static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 	if (!entry)
 		return ERR_PTR(-EFAULT);
 
-	/*
-	 * Verify that the faulted page has equal or higher build time
-	 * permissions than the VMA permissions (i.e. the subset of {VM_READ,
-	 * VM_WRITE, VM_EXECUTE} in vma->vm_flags).
-	 */
-	if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
-		return ERR_PTR(-EFAULT);
-
 	/* Entry successfully located. */
 	if (entry->epc_page) {
 		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
@@ -138,7 +128,9 @@  static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
 {
 	unsigned long addr = (unsigned long)vmf->address;
 	struct vm_area_struct *vma = vmf->vma;
+	unsigned long page_prot_bits;
 	struct sgx_encl_page *entry;
+	unsigned long vm_prot_bits;
 	unsigned long phys_addr;
 	struct sgx_encl *encl;
 	vm_fault_t ret;
@@ -155,7 +147,7 @@  static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
 
 	mutex_lock(&encl->lock);
 
-	entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
+	entry = sgx_encl_load_page(encl, addr);
 	if (IS_ERR(entry)) {
 		mutex_unlock(&encl->lock);
 
@@ -167,7 +159,19 @@  static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
 
 	phys_addr = sgx_get_epc_phys_addr(entry->epc_page);
 
-	ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr));
+	/*
+	 * Insert PTE to match the EPCM page permissions ensured to not
+	 * exceed the VMA permissions.
+	 */
+	vm_prot_bits = vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
+	page_prot_bits = entry->vm_max_prot_bits & vm_prot_bits;
+	/*
+	 * Add VM_SHARED so that PTE is made writable right away if VMA
+	 * and EPCM are writable (no COW in SGX).
+	 */
+	page_prot_bits |= (vma->vm_flags & VM_SHARED);
+	ret = vmf_insert_pfn_prot(vma, addr, PFN_DOWN(phys_addr),
+				  vm_get_page_prot(page_prot_bits));
 	if (ret != VM_FAULT_NOPAGE) {
 		mutex_unlock(&encl->lock);
 
@@ -295,15 +299,14 @@  static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag
  * Load an enclave page to EPC if required, and take encl->lock.
  */
 static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
-						   unsigned long addr,
-						   unsigned long vm_flags)
+						   unsigned long addr)
 {
 	struct sgx_encl_page *entry;
 
 	for ( ; ; ) {
 		mutex_lock(&encl->lock);
 
-		entry = sgx_encl_load_page(encl, addr, vm_flags);
+		entry = sgx_encl_load_page(encl, addr);
 		if (PTR_ERR(entry) != -EBUSY)
 			break;
 
@@ -339,8 +342,7 @@  static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
 		return -EFAULT;
 
 	for (i = 0; i < len; i += cnt) {
-		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK,
-					      vma->vm_flags);
+		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK);
 		if (IS_ERR(entry)) {
 			ret = PTR_ERR(entry);
 			break;