Message ID | 20201112220135.165028-11-jarkko@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | None | expand |
On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote: > From: Sean Christopherson <sean.j.christopherson@intel.com> > > Background > ========== > > 1. SGX enclave pages are populated with data by copying from normal memory > via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in > this series. > 2. It is desirable to be able to restrict those normal memory data sources. > For instance, to ensure that the source data is executable before > copying data to an executable enclave page. > 3. Enclave page permissions are dynamic (just like normal permissions) and > can be adjusted at runtime with mprotect(). > > This creates a problem because the original data source may have long since > vanished at the time when enclave page permissions are established (mmap() > or mprotect()). > > The solution (elsewhere in this series) is to force enclaves creators to > declare their paging permission *intent* up front to the ioctl(). This > intent can be immediately compared to the source data???s mapping and > rejected if necessary. > > The ???intent??? is also stashed off for later comparison with enclave > PTEs. This ensures that any future mmap()/mprotect() operations > performed by the enclave creator or done on behalf of the enclave > can be compared with the earlier declared permissions. > > Problem > ======= > > There is an existing mmap() hook which allows SGX to perform this > permission comparison at mmap() time. However, there is no corresponding > ->mprotect() hook. > > Solution > ======== > > Add a vm_ops->mprotect() hook so that mprotect() operations which are > inconsistent with any page's stashed intent can be rejected by the driver. > > Cc: linux-mm@kvack.org > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Mel Gorman <mgorman@techsingularity.net> > Acked-by: Jethro Beekman <jethro@fortanix.com> # v40 > Acked-by: Dave Hansen <dave.hansen@intel.com> # v40 > # Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org> > Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Mel Gorman <mgorman@techsingularity.net>
On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote: Good morning, I hope the weekend is going well for everyone. > From: Sean Christopherson <sean.j.christopherson@intel.com> We wish Sean well in whatever new avocation he has chosen. > Background > ========== > > 1. SGX enclave pages are populated with data by copying from normal memory > via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in > this series. > 2. It is desirable to be able to restrict those normal memory data sources. > For instance, to ensure that the source data is executable before > copying data to an executable enclave page. > 3. Enclave page permissions are dynamic (just like normal permissions) and > can be adjusted at runtime with mprotect(). > > This creates a problem because the original data source may have long since > vanished at the time when enclave page permissions are established (mmap() > or mprotect()). > > The solution (elsewhere in this series) is to force enclaves creators to > declare their paging permission *intent* up front to the ioctl(). This > intent can be immediately compared to the source data???s mapping and > rejected if necessary. > > The ???intent??? is also stashed off for later comparison with enclave > PTEs. This ensures that any future mmap()/mprotect() operations > performed by the enclave creator or done on behalf of the enclave > can be compared with the earlier declared permissions. The new mprotect hook in vm_operations_struct is indeed useful, as I will demonstrate in a subsequent patch for consideration. However, the officially stated intent of this version of the driver is to implement SGX1 semantics even on hardware (SGX2) that implements the instructions needed for Enclave Dynamic Memory Management (EDMM). As a result, at this stage of the driver's development, the implementation that walks the page permission intents can be substituted with a simple denial of mmap and mprotect on an initialized enclave. With this prohibition in place, the hardware itself will enforce the page permission intents that were encoded when the enclave was loaded, thus making the subsequent scan irrelevant. The following patch implements this functionality. Dr. Greg --------------------------------------------------------------------------- Subject: [PATCH] Unconditionally block permission changes on enclave memory. In SGX there are two levels of memory protection, the classic page table mechanism and SGX hardware based page protections that are codified in the Enclave Page Cache Metadata. A successful memory access requires that both mechanisms agree that the access is permitted. In the initial implementation of SGX (SGX1), the page permissions are immutable after the enclave is initialized. Even if classic page protections are modified via mprotect, any attempt to access enclave memory with alternative permissions will be blocked. One of the architectural changes implemented in the second generation of SGX (SGX2) is the ability for page access permissions to be dynamically manipulated after the enclave is initialized. This requires coordination between trusted code running in the enclave and untrusted code using mprotect and special ring-0 instructions. One of the security threats associated with SGX2 hardware is that enclave based code can conspire with its untrusted runtime to make executable enclave memory writable. This provides the opportunity for completely anonymous code execution that the operating system has no visibility into. All that is needed to, simply, close this vulnerability is to eliminate the ability to call mprotect or mmap against the virtual memory range of an enclave after it is initialized. Any permission changes made prior to initialization that are inconsistent with the permissions codified in the enclave will cause initialization or execution of the enclave to fail. Tested-by: Dr. Greg <greg@enjellic.com> Signed-off-by: Dr. Greg <greg@enjellic.com> --- arch/x86/kernel/cpu/sgx/encl.c | 50 +++++++++------------------------- 1 file changed, 13 insertions(+), 37 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index 5551c7d36483..3bd770fbfc32 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -212,27 +212,25 @@ static void sgx_vma_open(struct vm_area_struct *vma) * @end: upper bound of the address range, exclusive * @vm_flags: VMA flags * - * Iterate through the enclave pages contained within [@start, @end) to verify - * that the permissions requested by a subset of {VM_READ, VM_WRITE, VM_EXEC} - * does not contain any permissions that are not contained in the build time - * permissions of any of the enclave pages within the given address range. + * This function provides a method for determining whether or not mmap + * or mprotect can be invoked called on any pages in the virtual + * address range of an enclave. * - * An enclave creator must declare the strongest permissions that will be - * needed for each enclave page This ensures that mappings have the identical - * or weaker permissions that the earlier declared permissions. + * Since this driver is not designed to support Enclave Dynamic Memory + * Management (EDMM), any attempts to modify enclave memory map after + * the enclave is initialized are simply blocked. + * + * The function signature is left intact since future versions of the + * driver may implement verifications that the requested permission + * changes are consistent with the desire of the enclave author. * * Return: 0 on success, -EACCES otherwise */ int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start, unsigned long end, unsigned long vm_flags) { - unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC); - struct sgx_encl_page *page; - unsigned long count = 0; int ret = 0; - XA_STATE(xas, &encl->page_array, PFN_DOWN(start)); - /* * Disallow READ_IMPLIES_EXEC tasks as their VMA permissions might * conflict with the enclave page permissions. @@ -240,31 +238,9 @@ int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start, if (current->personality & READ_IMPLIES_EXEC) return -EACCES; - mutex_lock(&encl->lock); - xas_lock(&xas); - xas_for_each(&xas, page, PFN_DOWN(end - 1)) { - if (!page) - break; - - if (~page->vm_max_prot_bits & vm_prot_bits) { - ret = -EACCES; - break; - } - - /* Reschedule on every XA_CHECK_SCHED iteration. */ - if (!(++count % XA_CHECK_SCHED)) { - xas_pause(&xas); - xas_unlock(&xas); - mutex_unlock(&encl->lock); - - cond_resched(); - - mutex_lock(&encl->lock); - xas_lock(&xas); - } - } - xas_unlock(&xas); - mutex_unlock(&encl->lock); + /* Disallow mapping on an initialized enclave. */ + if (test_bit(SGX_ENCL_INITIALIZED, &encl->flags)) + ret = -EACCES; return ret; }
On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote: > +++ b/include/linux/mm.h > @@ -559,6 +559,13 @@ struct vm_operations_struct { > void (*close)(struct vm_area_struct * area); > int (*split)(struct vm_area_struct * area, unsigned long addr); > int (*mremap)(struct vm_area_struct * area); > + /* > + * Called by mprotect() to make driver-specific permission > + * checks before mprotect() is finalised. The VMA must not > + * be modified. Returns 0 if eprotect() can proceed. > + */ This is the wrong place for this documentation, and it's absurdly specific to your implementation. It should be in Documentation/filesystems/locking.rst. > + int (*mprotect)(struct vm_area_struct *vma, unsigned long start, > + unsigned long end, unsigned long newflags); > + > + if (vma->vm_ops && vma->vm_ops->mprotect) > + error = vma->vm_ops->mprotect(vma, nstart, tmp, newflags); > + if (error) > + goto out; > + > error = mprotect_fixup(vma, &prev, nstart, tmp, newflags); > if (error) > goto out; > + > nstart = tmp; Spurious newline added.
On 11/15/20 9:32 AM, Matthew Wilcox wrote: > On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote: >> +++ b/include/linux/mm.h >> @@ -559,6 +559,13 @@ struct vm_operations_struct { >> void (*close)(struct vm_area_struct * area); >> int (*split)(struct vm_area_struct * area, unsigned long addr); >> int (*mremap)(struct vm_area_struct * area); >> + /* >> + * Called by mprotect() to make driver-specific permission >> + * checks before mprotect() is finalised. The VMA must not >> + * be modified. Returns 0 if eprotect() can proceed. >> + */ > > This is the wrong place for this documentation, and it's absurdly > specific to your implementation. It should be in > Documentation/filesystems/locking.rst. I'll let you and Mel duke that one out: > https://lore.kernel.org/linux-mm/20201106100409.GD3371@techsingularity.net/ As for 'locking.rst', I didn't even know that was there. It's also a _bit_ silly that we would depend on folks finding that for code like SGX that has nothing to do with filesystems. If we expect folks to update that, we need a comment spelling that out in the struct. The comment also _looks_ reasonably generic to me. None of the parameters can be modified, so the impact on the core mm code must only be from the result of the return code, even if the handler does some other magic in addition to plain checks. For one thing, I guess we could take the literal mention of "driver" out. vm_operations are certainly supplied by code that isn't a "driver" per se. >> + int (*mprotect)(struct vm_area_struct *vma, unsigned long start, >> + unsigned long end, unsigned long newflags); >> + >> + if (vma->vm_ops && vma->vm_ops->mprotect) >> + error = vma->vm_ops->mprotect(vma, nstart, tmp, newflags); >> + if (error) >> + goto out; >> + >> error = mprotect_fixup(vma, &prev, nstart, tmp, newflags); >> if (error) >> goto out; >> + >> nstart = tmp; > > Spurious newline added. I don't actually think this is spurious. It ends up grouping the code into the logical chunks with the ->mprotect() and goto in one chunk and mprotect_fixup() and its goto in another. Without this newline, the two logical parts don't look separate. It's kinda hard to see with just the diff context, though.
On Sun, Nov 15, 2020 at 10:36:51AM -0800, Dave Hansen wrote: > On 11/15/20 9:32 AM, Matthew Wilcox wrote: > > On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote: > >> +++ b/include/linux/mm.h > >> @@ -559,6 +559,13 @@ struct vm_operations_struct { > >> void (*close)(struct vm_area_struct * area); > >> int (*split)(struct vm_area_struct * area, unsigned long addr); > >> int (*mremap)(struct vm_area_struct * area); > >> + /* > >> + * Called by mprotect() to make driver-specific permission > >> + * checks before mprotect() is finalised. The VMA must not > >> + * be modified. Returns 0 if eprotect() can proceed. > >> + */ > > > > This is the wrong place for this documentation, and it's absurdly > > specific to your implementation. It should be in > > Documentation/filesystems/locking.rst. > > I'll let you and Mel duke that one out: > I suggested placing the comment there to make it clear what the expected semantics of the hook was to reduce the chances of abuse or surprises. The hook does not affect locking so Documentation/filesystems/locking.rst didn't appear appropriate other than maybe adding a note there that it doesn't affect locks. The hook also is not expecting any filesystems-specific action that I aware of but a note could be added to the effect that filesystems should not need to take special action for it. Protections on the filesystem level are for the inode, I can't imagine what a filesystem would do with a protection change on the page table level but maybe I'm not particularly imaginative today.
On Fri, Nov 13, 2020 at 10:25:43AM +0000, Mel Gorman wrote: > On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote: > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > > > Background > > ========== > > > > 1. SGX enclave pages are populated with data by copying from normal memory > > via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in > > this series. > > 2. It is desirable to be able to restrict those normal memory data sources. > > For instance, to ensure that the source data is executable before > > copying data to an executable enclave page. > > 3. Enclave page permissions are dynamic (just like normal permissions) and > > can be adjusted at runtime with mprotect(). > > > > This creates a problem because the original data source may have long since > > vanished at the time when enclave page permissions are established (mmap() > > or mprotect()). > > > > The solution (elsewhere in this series) is to force enclaves creators to > > declare their paging permission *intent* up front to the ioctl(). This > > intent can be immediately compared to the source data???s mapping and > > rejected if necessary. > > > > The ???intent??? is also stashed off for later comparison with enclave > > PTEs. This ensures that any future mmap()/mprotect() operations > > performed by the enclave creator or done on behalf of the enclave > > can be compared with the earlier declared permissions. > > > > Problem > > ======= > > > > There is an existing mmap() hook which allows SGX to perform this > > permission comparison at mmap() time. However, there is no corresponding > > ->mprotect() hook. > > > > Solution > > ======== > > > > Add a vm_ops->mprotect() hook so that mprotect() operations which are > > inconsistent with any page's stashed intent can be rejected by the driver. > > > > Cc: linux-mm@kvack.org > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Matthew Wilcox <willy@infradead.org> > > Cc: Mel Gorman <mgorman@techsingularity.net> > > Acked-by: Jethro Beekman <jethro@fortanix.com> # v40 > > Acked-by: Dave Hansen <dave.hansen@intel.com> # v40 > > # Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > > Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org> > > Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> > > Acked-by: Mel Gorman <mgorman@techsingularity.net> Thank you. /Jarkko
On Mon, Nov 16, 2020 at 10:09:57AM +0000, Mel Gorman wrote: > On Sun, Nov 15, 2020 at 10:36:51AM -0800, Dave Hansen wrote: > > On 11/15/20 9:32 AM, Matthew Wilcox wrote: > > > On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote: > > >> +++ b/include/linux/mm.h > > >> @@ -559,6 +559,13 @@ struct vm_operations_struct { > > >> void (*close)(struct vm_area_struct * area); > > >> int (*split)(struct vm_area_struct * area, unsigned long addr); > > >> int (*mremap)(struct vm_area_struct * area); > > >> + /* > > >> + * Called by mprotect() to make driver-specific permission > > >> + * checks before mprotect() is finalised. The VMA must not > > >> + * be modified. Returns 0 if eprotect() can proceed. > > >> + */ Wonder if this should also document the negative case for the return value, i.e. -EACCES is returned otherwise. > > > > > > This is the wrong place for this documentation, and it's absurdly > > > specific to your implementation. It should be in > > > Documentation/filesystems/locking.rst. > > > > I'll let you and Mel duke that one out: > > > > I suggested placing the comment there to make it clear what the expected > semantics of the hook was to reduce the chances of abuse or surprises. The > hook does not affect locking so Documentation/filesystems/locking.rst > didn't appear appropriate other than maybe adding a note there > that it doesn't affect locks. The hook also is not expecting any > filesystems-specific action that I aware of but a note could be added to > the effect that filesystems should not need to take special action for it. > Protections on the filesystem level are for the inode, I can't imagine what > a filesystem would do with a protection change on the page table level > but maybe I'm not particularly imaginative today. I try to decipher this in generic context. In a permission check of a filesystem, truncated pages should be encapsulated in to the permission decision. It's a just a query. So maybe I'll add something like: "This callback does only a permission query, and thus does never return locked pages." > -- > Mel Gorman > SUSE Labs /Jarkko
diff --git a/include/linux/mm.h b/include/linux/mm.h index db6ae4d3fb4e..1813fa86b981 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -559,6 +559,13 @@ struct vm_operations_struct { void (*close)(struct vm_area_struct * area); int (*split)(struct vm_area_struct * area, unsigned long addr); int (*mremap)(struct vm_area_struct * area); + /* + * Called by mprotect() to make driver-specific permission + * checks before mprotect() is finalised. The VMA must not + * be modified. Returns 0 if eprotect() can proceed. + */ + int (*mprotect)(struct vm_area_struct *vma, unsigned long start, + unsigned long end, unsigned long newflags); vm_fault_t (*fault)(struct vm_fault *vmf); vm_fault_t (*huge_fault)(struct vm_fault *vmf, enum page_entry_size pe_size); diff --git a/mm/mprotect.c b/mm/mprotect.c index 56c02beb6041..ab709023e9aa 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -616,9 +616,16 @@ static int do_mprotect_pkey(unsigned long start, size_t len, tmp = vma->vm_end; if (tmp > end) tmp = end; + + if (vma->vm_ops && vma->vm_ops->mprotect) + error = vma->vm_ops->mprotect(vma, nstart, tmp, newflags); + if (error) + goto out; + error = mprotect_fixup(vma, &prev, nstart, tmp, newflags); if (error) goto out; + nstart = tmp; if (nstart < prev->vm_end)