[v39,13/24] x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES

Message ID	20201003045059.665934-14-jarkko.sakkinen@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=MzOG=DK=vger.kernel.org=linux-sgx-owner@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4E0ED6CB for <patchwork-linux-sgx@patchwork.kernel.org>; Sat, 3 Oct 2020 04:52:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2A0B323447 for <patchwork-linux-sgx@patchwork.kernel.org>; Sat, 3 Oct 2020 04:52:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725934AbgJCEvw (ORCPT <rfc822;patchwork-linux-sgx@patchwork.kernel.org>); Sat, 3 Oct 2020 00:51:52 -0400 Received: from mail.kernel.org ([198.145.29.99]:51118 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725911AbgJCEvg (ORCPT <rfc822;linux-sgx@vger.kernel.org>); Sat, 3 Oct 2020 00:51:36 -0400 Received: from localhost (83-245-197-237.elisa-laajakaista.fi [83.245.197.237]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id D452122204; Sat, 3 Oct 2020 04:51:33 +0000 (UTC) From: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> To: x86@kernel.org, linux-sgx@vger.kernel.org Cc: linux-kernel@vger.kernel.org, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>, Jethro Beekman <jethro@fortanix.com>, Haitao Huang <haitao.huang@linux.intel.com>, Chunyang Hui <sanqian.hcy@antfin.com>, Jordan Hand <jorhand@linux.microsoft.com>, Nathaniel McCallum <npmccallum@redhat.com>, Seth Moore <sethmo@google.com>, Darren Kenny <darren.kenny@oracle.com>, Sean Christopherson <sean.j.christopherson@intel.com>, Suresh Siddha <suresh.b.siddha@intel.com>, akpm@linux-foundation.org, andriy.shevchenko@linux.intel.com, asapek@google.com, bp@alien8.de, cedric.xing@intel.com, chenalexchen@google.com, conradparker@google.com, cyhanish@google.com, dave.hansen@intel.com, haitao.huang@intel.com, kai.huang@intel.com, kai.svahn@intel.com, kmoy@google.com, ludloff@google.com, luto@kernel.org, nhorman@redhat.com, puiterwijk@redhat.com, rientjes@google.com, tglx@linutronix.de, yaozhangx@google.com, mikko.ylinen@intel.com Subject: [PATCH v39 13/24] x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES Date: Sat, 3 Oct 2020 07:50:48 +0300 Message-Id: <20201003045059.665934-14-jarkko.sakkinen@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20201003045059.665934-1-jarkko.sakkinen@linux.intel.com> References: <20201003045059.665934-1-jarkko.sakkinen@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <linux-sgx.vger.kernel.org> X-Mailing-List: linux-sgx@vger.kernel.org
Series	Intel SGX foundations \| expand [v39,00/24] Intel SGX foundations [v39,01/24] x86/cpufeatures: x86/msr: Add Intel SGX hardware bits [v39,02/24] x86/cpufeatures: x86/msr: Add Intel SGX Launch Control hardware bits [v39,03/24] x86/mm: x86/sgx: Signal SIGSEGV with PF_SGX [v39,04/24] x86/sgx: Add SGX microarchitectural data structures [v39,05/24] x86/sgx: Add wrappers for ENCLS leaf functions [v39,06/24] x86/cpu/intel: Detect SGX support [v39,07/24] x86/cpu/intel: Add nosgx kernel parameter [v39,08/24] x86/sgx: Initialize metadata for Enclave Page Cache (EPC) sections [v39,09/24] x86/sgx: Add __sgx_alloc_epc_page() and sgx_free_epc_page() [v39,10/24] mm: Add 'mprotect' hook to struct vm_operations_struct [v39,11/24] x86/sgx: Add SGX enclave driver [v39,12/24] x86/sgx: Add SGX_IOC_ENCLAVE_CREATE [v39,13/24] x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES [v39,14/24] x86/sgx: Add SGX_IOC_ENCLAVE_INIT [v39,15/24] x86/sgx: Add SGX_IOC_ENCLAVE_PROVISION [v39,16/24] x86/sgx: Add a page reclaimer [v39,17/24] x86/sgx: Add ptrace() support for the SGX driver [v39,18/24] x86/vdso: Add support for exception fixup in vDSO functions [v39,19/24] x86/fault: Add helper function to sanitize error code [v39,20/24] x86/traps: Attempt to fixup exceptions in vDSO before signaling [v39,21/24] x86/vdso: Implement a vDSO for Intel SGX enclave call [v39,22/24] selftests/x86: Add a selftest for SGX [v39,23/24] docs: x86/sgx: Document SGX micro architecture and kernel internals [v39,24/24] x86/sgx: Update MAINTAINERS

Jarkko Sakkinen Oct. 3, 2020, 4:50 a.m. UTC

Add an ioctl, which performs ENCLS[EADD] that adds new visible page to an
enclave, and optionally ENCLS[EEXTEND] operations that hash the page to the
enclave measurement. By visible we mean a page that can be mapped to the
address range of an enclave.

Acked-by: Jethro Beekman <jethro@fortanix.com>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Tested-by: Haitao Huang <haitao.huang@linux.intel.com>
Tested-by: Chunyang Hui <sanqian.hcy@antfin.com>
Tested-by: Jordan Hand <jorhand@linux.microsoft.com>
Tested-by: Nathaniel McCallum <npmccallum@redhat.com>
Tested-by: Seth Moore <sethmo@google.com>
Tested-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
---
 arch/x86/include/uapi/asm/sgx.h |  30 ++++
 arch/x86/kernel/cpu/sgx/ioctl.c | 299 ++++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/sgx.h   |   1 +
 3 files changed, 330 insertions(+)

Dave Hansen Oct. 16, 2020, 9:25 p.m. UTC | #1

> +/**
> + * struct sgx_enclave_add_pages - parameter structure for the
> + *                                %SGX_IOC_ENCLAVE_ADD_PAGE ioctl
> + * @src:	start address for the page data
> + * @offset:	starting page offset

Is this the offset *within* the page?  Might be nice to say that.

> + * @length:	length of the data (multiple of the page size)
> + * @secinfo:	address for the SECINFO data
> + * @flags:	page control flags
> + * @count:	number of bytes added (multiple of the page size)
> + */
> +struct sgx_enclave_add_pages {
> +	__u64 src;
> +	__u64 offset;
> +	__u64 length;
> +	__u64 secinfo;
> +	__u64 flags;
> +	__u64 count;
> +};
> +
>  #endif /* _UAPI_ASM_X86_SGX_H */
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index 9bb4694e57c1..e13e04737683 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -194,6 +194,302 @@ static long sgx_ioc_enclave_create(struct sgx_encl *encl, void __user *arg)
>  	return ret;
>  }
>  
> +static struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
> +						 unsigned long offset,
> +						 u64 secinfo_flags)
> +{
> +	struct sgx_encl_page *encl_page;
> +	unsigned long prot;
> +
> +	encl_page = kzalloc(sizeof(*encl_page), GFP_KERNEL);
> +	if (!encl_page)
> +		return ERR_PTR(-ENOMEM);
> +
> +	encl_page->desc = encl->base + offset;
> +	encl_page->encl = encl;

Somewhere, we need an explanation of why we have 'sgx_epc_page' and
'sgx_encl_page'.  I think they're 1:1 at least after
sgx_encl_page_alloc(), so I'm wondering why we need two.

> +	prot = _calc_vm_trans(secinfo_flags, SGX_SECINFO_R, PROT_READ)  |
> +	       _calc_vm_trans(secinfo_flags, SGX_SECINFO_W, PROT_WRITE) |
> +	       _calc_vm_trans(secinfo_flags, SGX_SECINFO_X, PROT_EXEC);
> +
> +	/*
> +	 * TCS pages must always RW set for CPU access while the SECINFO
> +	 * permissions are *always* zero - the CPU ignores the user provided
> +	 * values and silently overwrites them with zero permissions.
> +	 */
> +	if ((secinfo_flags & SGX_SECINFO_PAGE_TYPE_MASK) == SGX_SECINFO_TCS)
> +		prot |= PROT_READ | PROT_WRITE;
> +
> +	/* Calculate maximum of the VM flags for the page. */
> +	encl_page->vm_max_prot_bits = calc_vm_prot_bits(prot, 0);
> +
> +	return encl_page;
> +}
> +
> +static int sgx_validate_secinfo(struct sgx_secinfo *secinfo)
> +{
> +	u64 perm = secinfo->flags & SGX_SECINFO_PERMISSION_MASK;
> +	u64 pt = secinfo->flags & SGX_SECINFO_PAGE_TYPE_MASK;

I'd align the ='s up there ^^

> +
> +	if (pt != SGX_SECINFO_REG && pt != SGX_SECINFO_TCS)
> +		return -EINVAL;
> +
> +	if ((perm & SGX_SECINFO_W) && !(perm & SGX_SECINFO_R))
> +		return -EINVAL;
> +
> +	/*
> +	 * CPU will silently overwrite the permissions as zero, which means
> +	 * that we need to validate it ourselves.
> +	 */
> +	if (pt == SGX_SECINFO_TCS && perm)
> +		return -EINVAL;
> +
> +	if (secinfo->flags & SGX_SECINFO_RESERVED_MASK)
> +		return -EINVAL;
> +
> +	if (memchr_inv(secinfo->reserved, 0, sizeof(secinfo->reserved)))
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int __sgx_encl_add_page(struct sgx_encl *encl,
> +			       struct sgx_encl_page *encl_page,
> +			       struct sgx_epc_page *epc_page,
> +			       struct sgx_secinfo *secinfo, unsigned long src)
> +{
> +	struct sgx_pageinfo pginfo;
> +	struct vm_area_struct *vma;
> +	struct page *src_page;
> +	int ret;
> +
> +	/* Deny noexec. */
> +	vma = find_vma(current->mm, src);
> +	if (!vma)
> +		return -EFAULT;
> +
> +	if (!(vma->vm_flags & VM_MAYEXEC))
> +		return -EACCES;
> +
> +	ret = get_user_pages(src, 1, 0, &src_page, NULL);
> +	if (ret < 1)
> +		return -EFAULT;
> +
> +	pginfo.secs = (unsigned long)sgx_get_epc_addr(encl->secs.epc_page);
> +	pginfo.addr = SGX_ENCL_PAGE_ADDR(encl_page);
> +	pginfo.metadata = (unsigned long)secinfo;
> +	pginfo.contents = (unsigned long)kmap_atomic(src_page);
> +
> +	ret = __eadd(&pginfo, sgx_get_epc_addr(epc_page));

Could you convince me that EADD is not going to fault and make the
kmap_atomic() mad?

> +	kunmap_atomic((void *)pginfo.contents);

All the casting is kinda nasty, but I gues you do it to ensure you can
use __u64 in the hardware structs.

> +	put_page(src_page);
> +
> +	return ret ? -EIO : 0;
> +}
> +
> +/*
> + * If the caller requires measurement of the page as a proof for the content,
> + * use EEXTEND to add a measurement for 256 bytes of the page. Repeat this
> + * operation until the entire page is measured."
> + */
> +static int __sgx_encl_extend(struct sgx_encl *encl,
> +			     struct sgx_epc_page *epc_page)
> +{
> +	int ret;
> +	int i;
> +
> +	for (i = 0; i < 16; i++) {

No magic numbers please.

#define SGX_EEXTEND_NR_BYTES 16 ??

> +		ret = __eextend(sgx_get_epc_addr(encl->secs.epc_page),
> +				sgx_get_epc_addr(epc_page) + (i * 0x100));

What's the 0x100 for?

> +		if (ret) {
> +			if (encls_failed(ret))
> +				ENCLS_WARN(ret, "EEXTEND");
> +			return -EIO;

How frequent should we expect these to be?  Can users cause them?  You
should *proably* call it ENCLS_WARN_ONCE() if it's implemented that way.

> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
> +			     unsigned long offset, struct sgx_secinfo *secinfo,
> +			     unsigned long flags)
> +{
> +	struct sgx_encl_page *encl_page;
> +	struct sgx_epc_page *epc_page;
> +	int ret;
> +
> +	encl_page = sgx_encl_page_alloc(encl, offset, secinfo->flags);
> +	if (IS_ERR(encl_page))
> +		return PTR_ERR(encl_page);
> +
> +	epc_page = __sgx_alloc_epc_page();
> +	if (IS_ERR(epc_page)) {
> +		kfree(encl_page);
> +		return PTR_ERR(epc_page);
> +	}

Looking at these, I'm forgetting why we need to both allocate an
encl_page and an epc_page.  Commends might remind me.  So would better
names.

> +	mmap_read_lock(current->mm);
> +	mutex_lock(&encl->lock);
> +
> +	/*
> +	 * Insert prior to EADD in case of OOM.

I wouldn't say OOM.  Maybe:

	xa_insert() and EADD can both fail.  But xa_insert() is easier
	to unwind so do it first.

>                                              EADD modifies MRENCLAVE, i.e.

What is MRENCLAVE?

> +	 * can't be gracefully unwound, while failure on EADD/EXTEND is limited
> +	 * to userspace errors (or kernel/hardware bugs).
> +	 */
> +	ret = xa_insert(&encl->page_array, PFN_DOWN(encl_page->desc),
> +			encl_page, GFP_KERNEL);
> +	if (ret)
> +		goto err_out_unlock;
> +
> +	ret = __sgx_encl_add_page(encl, encl_page, epc_page, secinfo,
> +				  src);
> +	if (ret)
> +		goto err_out;
> +
> +	/*
> +	 * Complete the "add" before doing the "extend" so that the "add"
> +	 * isn't in a half-baked state in the extremely unlikely scenario
> +	 * the enclave will be destroyed in response to EEXTEND failure.
> +	 */
> +	encl_page->encl = encl;
> +	encl_page->epc_page = epc_page;
> +	encl->secs_child_cnt++;
> +
> +	if (flags & SGX_PAGE_MEASURE) {
> +		ret = __sgx_encl_extend(encl, epc_page);
> +		if (ret)
> +			goto err_out;
> +	}

Why would we never *not* measure an added page?

> +	mutex_unlock(&encl->lock);
> +	mmap_read_unlock(current->mm);
> +	return ret;
> +
> +err_out:
> +	xa_erase(&encl->page_array, PFN_DOWN(encl_page->desc));
> +
> +err_out_unlock:
> +	mutex_unlock(&encl->lock);
> +	mmap_read_unlock(current->mm);
> +
> +	sgx_free_epc_page(epc_page);
> +	kfree(encl_page);
> +
> +	return ret;
> +}
> +
> +/**
> + * sgx_ioc_enclave_add_pages() - The handler for %SGX_IOC_ENCLAVE_ADD_PAGES
> + * @encl:       an enclave pointer
> + * @arg:	a user pointer to a struct sgx_enclave_add_pages instance
> + *
> + * Add one or more pages to an uninitialized enclave, and optionally extend the
> + * measurement with the contents of the page. The SECINFO and measurement mask
> + * are applied to all pages.
> + *
> + * A SECINFO for a TCS is required to always contain zero permissions because
> + * CPU silently zeros them. Allowing anything else would cause a mismatch in
> + * the measurement.
> + *
> + * mmap()'s protection bits are capped by the page permissions. For each page
> + * address, the maximum protection bits are computed with the following
> + * heuristics:
> + *
> + * 1. A regular page: PROT_R, PROT_W and PROT_X match the SECINFO permissions.
> + * 2. A TCS page: PROT_R | PROT_W.
> + *
> + * mmap() is not allowed to surpass the minimum of the maximum protection bits
> + * within the given address range.
> + *
> + * The function deinitializes kernel data structures for enclave and returns
> + * -EIO in any of the following conditions:
> + *
> + * - Enclave Page Cache (EPC), the physical memory holding enclaves, has
> + *   been invalidated. This will cause EADD and EEXTEND to fail.
> + * - If the source address is corrupted somehow when executing EADD.
> + *
> + * Return:
> + *   length of the data processed on success,
> + *   -EACCES if an executable source page is located in a noexec partition,
> + *   -ENOMEM if the system is out of EPC pages,
> + *   -EINTR if the call was interrupted before any data was processed,
> + *   -EIO if the enclave was lost
> + *   -errno otherwise
> + */
> +static long sgx_ioc_enclave_add_pages(struct sgx_encl *encl, void __user *arg)
> +{
> +	struct sgx_enclave_add_pages addp;
> +	struct sgx_secinfo secinfo;
> +	unsigned long c;
> +	int ret;
> +
> +	if ((atomic_read(&encl->flags) & SGX_ENCL_INITIALIZED) ||
> +	    !(atomic_read(&encl->flags) & SGX_ENCL_CREATED))
> +		return -EINVAL;

There should to be a nice state machine documented somewhere.  Is ther?

> +	if (copy_from_user(&addp, arg, sizeof(addp)))
> +		return -EFAULT;
> +
> +	if (!IS_ALIGNED(addp.offset, PAGE_SIZE) ||
> +	    !IS_ALIGNED(addp.src, PAGE_SIZE))
> +		return -EINVAL;
> +
> +	if (!(access_ok(addp.src, PAGE_SIZE)))
> +		return -EFAULT;

This worries me.  You're doing an access_ok() check on addp.src because
you evidently don't trust it.  But, below, it looks to be accessed
directly with an offset, bound by addp.length, which I think can be
>PAGE_SIZE.

I'd feel a lot better if addp.src's value was being passed around as a
__user pointer.

> +	if (addp.length & (PAGE_SIZE - 1))
> +		return -EINVAL;
> +
> +	if (addp.offset + addp.length - PAGE_SIZE >= encl->size)
> +		return -EINVAL;
> +
> +	if (copy_from_user(&secinfo, (void __user *)addp.secinfo,
> +			   sizeof(secinfo)))
> +		return -EFAULT;
> +
> +	if (sgx_validate_secinfo(&secinfo))
> +		return -EINVAL;
> +
> +	for (c = 0 ; c < addp.length; c += PAGE_SIZE) {
> +		if (signal_pending(current)) {
> +			if (!c)
> +				ret = -ERESTARTSYS;
> +
> +			break;
> +		}
> +
> +		if (c == SGX_MAX_ADD_PAGES_LENGTH)
> +			break;
> +
> +		if (need_resched())
> +			cond_resched();
> +
> +		ret = sgx_encl_add_page(encl, addp.src + c, addp.offset + c,
> +					&secinfo, addp.flags);

Yeah...  Don't we need to do another access_ok() check here, if we
needed one above since we are moving away from addrp.src?

> +		if (ret)
> +			break;
> +	}
> +
> +	addp.count = c;
> +
> +	if (copy_to_user(arg, &addp, sizeof(addp)))
> +		return -EFAULT;
> +
> +	/*
> +	 * If the enlave was lost, deinitialize the internal data structures
> +	 * for the enclave.
> +	 */
> +	if (ret == -EIO) {
> +		mutex_lock(&encl->lock);
> +		sgx_encl_destroy(encl);
> +		mutex_unlock(&encl->lock);
> +	}
> +
> +	return ret;
> +}
> +
>  long sgx_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
>  {
>  	struct sgx_encl *encl = filep->private_data;
> @@ -212,6 +508,9 @@ long sgx_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
>  	case SGX_IOC_ENCLAVE_CREATE:
>  		ret = sgx_ioc_enclave_create(encl, (void __user *)arg);
>  		break;
> +	case SGX_IOC_ENCLAVE_ADD_PAGES:
> +		ret = sgx_ioc_enclave_add_pages(encl, (void __user *)arg);
> +		break;
>  	default:
>  		ret = -ENOIOCTLCMD;
>  		break;
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index fce756c3434b..8d126070db1e 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -34,6 +34,7 @@ struct sgx_epc_section {
>  
>  #define SGX_EPC_SECTION_MASK		GENMASK(7, 0)
>  #define SGX_MAX_EPC_SECTIONS		(SGX_EPC_SECTION_MASK + 1)
> +#define SGX_MAX_ADD_PAGES_LENGTH	0x100000
>  
>  extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>  
>

Jarkko Sakkinen Oct. 18, 2020, 5:03 a.m. UTC | #2

On Fri, Oct 16, 2020 at 02:25:50PM -0700, Dave Hansen wrote:
> 
> > +/**
> > + * struct sgx_enclave_add_pages - parameter structure for the
> > + *                                %SGX_IOC_ENCLAVE_ADD_PAGE ioctl
> > + * @src:	start address for the page data
> > + * @offset:	starting page offset
> 
> Is this the offset *within* the page?  Might be nice to say that.

It's the offset in the enclave address range where page is to be added.

> > + * @length:	length of the data (multiple of the page size)
> > + * @secinfo:	address for the SECINFO data
> > + * @flags:	page control flags
> > + * @count:	number of bytes added (multiple of the page size)
> > + */
> > +struct sgx_enclave_add_pages {
> > +	__u64 src;
> > +	__u64 offset;
> > +	__u64 length;
> > +	__u64 secinfo;
> > +	__u64 flags;
> > +	__u64 count;
> > +};
> > +
> >  #endif /* _UAPI_ASM_X86_SGX_H */
> > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> > index 9bb4694e57c1..e13e04737683 100644
> > --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> > @@ -194,6 +194,302 @@ static long sgx_ioc_enclave_create(struct sgx_encl *encl, void __user *arg)
> >  	return ret;
> >  }
> >  
> > +static struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
> > +						 unsigned long offset,
> > +						 u64 secinfo_flags)
> > +{
> > +	struct sgx_encl_page *encl_page;
> > +	unsigned long prot;
> > +
> > +	encl_page = kzalloc(sizeof(*encl_page), GFP_KERNEL);
> > +	if (!encl_page)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	encl_page->desc = encl->base + offset;
> > +	encl_page->encl = encl;
> 
> Somewhere, we need an explanation of why we have 'sgx_epc_page' and
> 'sgx_encl_page'.  I think they're 1:1 at least after
> sgx_encl_page_alloc(), so I'm wondering why we need two.

You need sgx_encl_page to hold data that exists whether or not there is
an associated EPC page.

Essentially sgx_encl_page contains the data needed for a virtual page,
and sgx_epc_page what is needed to represent physical page.

None of the data contained in sgx_encl_page make sense for sgx_epc_page.
They don't contain intersecting or redundant data.

> > +	prot = _calc_vm_trans(secinfo_flags, SGX_SECINFO_R, PROT_READ)  |
> > +	       _calc_vm_trans(secinfo_flags, SGX_SECINFO_W, PROT_WRITE) |
> > +	       _calc_vm_trans(secinfo_flags, SGX_SECINFO_X, PROT_EXEC);
> > +
> > +	/*
> > +	 * TCS pages must always RW set for CPU access while the SECINFO
> > +	 * permissions are *always* zero - the CPU ignores the user provided
> > +	 * values and silently overwrites them with zero permissions.
> > +	 */
> > +	if ((secinfo_flags & SGX_SECINFO_PAGE_TYPE_MASK) == SGX_SECINFO_TCS)
> > +		prot |= PROT_READ | PROT_WRITE;
> > +
> > +	/* Calculate maximum of the VM flags for the page. */
> > +	encl_page->vm_max_prot_bits = calc_vm_prot_bits(prot, 0);
> > +
> > +	return encl_page;
> > +}
> > +
> > +static int sgx_validate_secinfo(struct sgx_secinfo *secinfo)
> > +{
> > +	u64 perm = secinfo->flags & SGX_SECINFO_PERMISSION_MASK;
> > +	u64 pt = secinfo->flags & SGX_SECINFO_PAGE_TYPE_MASK;
> 
> I'd align the ='s up there ^^

Thanks, I updated this.

> > +
> > +	if (pt != SGX_SECINFO_REG && pt != SGX_SECINFO_TCS)
> > +		return -EINVAL;
> > +
> > +	if ((perm & SGX_SECINFO_W) && !(perm & SGX_SECINFO_R))
> > +		return -EINVAL;
> > +
> > +	/*
> > +	 * CPU will silently overwrite the permissions as zero, which means
> > +	 * that we need to validate it ourselves.
> > +	 */
> > +	if (pt == SGX_SECINFO_TCS && perm)
> > +		return -EINVAL;
> > +
> > +	if (secinfo->flags & SGX_SECINFO_RESERVED_MASK)
> > +		return -EINVAL;
> > +
> > +	if (memchr_inv(secinfo->reserved, 0, sizeof(secinfo->reserved)))
> > +		return -EINVAL;
> > +
> > +	return 0;
> > +}
> > +
> > +static int __sgx_encl_add_page(struct sgx_encl *encl,
> > +			       struct sgx_encl_page *encl_page,
> > +			       struct sgx_epc_page *epc_page,
> > +			       struct sgx_secinfo *secinfo, unsigned long src)
> > +{
> > +	struct sgx_pageinfo pginfo;
> > +	struct vm_area_struct *vma;
> > +	struct page *src_page;
> > +	int ret;
> > +
> > +	/* Deny noexec. */
> > +	vma = find_vma(current->mm, src);
> > +	if (!vma)
> > +		return -EFAULT;
> > +
> > +	if (!(vma->vm_flags & VM_MAYEXEC))
> > +		return -EACCES;
> > +
> > +	ret = get_user_pages(src, 1, 0, &src_page, NULL);
> > +	if (ret < 1)
> > +		return -EFAULT;
> > +
> > +	pginfo.secs = (unsigned long)sgx_get_epc_addr(encl->secs.epc_page);
> > +	pginfo.addr = SGX_ENCL_PAGE_ADDR(encl_page);
> > +	pginfo.metadata = (unsigned long)secinfo;
> > +	pginfo.contents = (unsigned long)kmap_atomic(src_page);
> > +
> > +	ret = __eadd(&pginfo, sgx_get_epc_addr(epc_page));
> 
> Could you convince me that EADD is not going to fault and make the
> kmap_atomic() mad?

It can legitly fail in the case when power cycle happens.

That's why the inline assembly catches faults and return an error code.
Thhis code has been field tested a lot. I have fairly good trust on
it.

> > +	kunmap_atomic((void *)pginfo.contents);
> 
> All the casting is kinda nasty, but I gues you do it to ensure you can
> use __u64 in the hardware structs.

Yup.

> 
> > +	put_page(src_page);
> > +
> > +	return ret ? -EIO : 0;
> > +}
> > +
> > +/*
> > + * If the caller requires measurement of the page as a proof for the content,
> > + * use EEXTEND to add a measurement for 256 bytes of the page. Repeat this
> > + * operation until the entire page is measured."
> > + */
> > +static int __sgx_encl_extend(struct sgx_encl *encl,
> > +			     struct sgx_epc_page *epc_page)
> > +{
> > +	int ret;
> > +	int i;
> > +
> > +	for (i = 0; i < 16; i++) {
> 
> No magic numbers please.
> 
> #define SGX_EEXTEND_NR_BYTES 16 ??
> 
> > +		ret = __eextend(sgx_get_epc_addr(encl->secs.epc_page),
> > +				sgx_get_epc_addr(epc_page) + (i * 0x100));
> 
> What's the 0x100 for?

It's the block size for this operation. I will define constants.

> > +		if (ret) {
> > +			if (encls_failed(ret))
> > +				ENCLS_WARN(ret, "EEXTEND");
> > +			return -EIO;
> 
> How frequent should we expect these to be?  Can users cause them?  You
> should *proably* call it ENCLS_WARN_ONCE() if it's implemented that way.

If power cycle happens.

> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
> > +			     unsigned long offset, struct sgx_secinfo *secinfo,
> > +			     unsigned long flags)
> > +{
> > +	struct sgx_encl_page *encl_page;
> > +	struct sgx_epc_page *epc_page;
> > +	int ret;
> > +
> > +	encl_page = sgx_encl_page_alloc(encl, offset, secinfo->flags);
> > +	if (IS_ERR(encl_page))
> > +		return PTR_ERR(encl_page);
> > +
> > +	epc_page = __sgx_alloc_epc_page();
> > +	if (IS_ERR(epc_page)) {
> > +		kfree(encl_page);
> > +		return PTR_ERR(epc_page);
> > +	}
> 
> Looking at these, I'm forgetting why we need to both allocate an
> encl_page and an epc_page.  Commends might remind me.  So would better
> names.

Should the struct names be renamed?

Like sgx_phys_epc_page and sgx_virt_epc_page?

> 
> > +	mmap_read_lock(current->mm);
> > +	mutex_lock(&encl->lock);
> > +
> > +	/*
> > +	 * Insert prior to EADD in case of OOM.
> 
> I wouldn't say OOM.  Maybe:
> 
> 	xa_insert() and EADD can both fail.  But xa_insert() is easier
> 	to unwind so do it first.
> 
> >                                              EADD modifies MRENCLAVE, i.e.
> 
> What is MRENCLAVE?

The measurement stored in SECS. I'm wondering  with xarray, is it
possible to preallocate entry without inserting anything?

Then we could get rid of this unwind and also would not need to
take encl->lock in sgx_encl_may_map().

> > +	 * can't be gracefully unwound, while failure on EADD/EXTEND is limited
> > +	 * to userspace errors (or kernel/hardware bugs).
> > +	 */
> > +	ret = xa_insert(&encl->page_array, PFN_DOWN(encl_page->desc),
> > +			encl_page, GFP_KERNEL);
> > +	if (ret)
> > +		goto err_out_unlock;
> > +
> > +	ret = __sgx_encl_add_page(encl, encl_page, epc_page, secinfo,
> > +				  src);
> > +	if (ret)
> > +		goto err_out;
> > +
> > +	/*
> > +	 * Complete the "add" before doing the "extend" so that the "add"
> > +	 * isn't in a half-baked state in the extremely unlikely scenario
> > +	 * the enclave will be destroyed in response to EEXTEND failure.
> > +	 */
> > +	encl_page->encl = encl;
> > +	encl_page->epc_page = epc_page;
> > +	encl->secs_child_cnt++;
> > +
> > +	if (flags & SGX_PAGE_MEASURE) {
> > +		ret = __sgx_encl_extend(encl, epc_page);
> > +		if (ret)
> > +			goto err_out;
> > +	}
> 
> Why would we never *not* measure an added page?

You might add Thread Control Structure pages without measuring them or
data area. There are reasons for the user space not to have everything
measured.

> 
> > +	mutex_unlock(&encl->lock);
> > +	mmap_read_unlock(current->mm);
> > +	return ret;
> > +
> > +err_out:
> > +	xa_erase(&encl->page_array, PFN_DOWN(encl_page->desc));
> > +
> > +err_out_unlock:
> > +	mutex_unlock(&encl->lock);
> > +	mmap_read_unlock(current->mm);
> > +
> > +	sgx_free_epc_page(epc_page);
> > +	kfree(encl_page);
> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * sgx_ioc_enclave_add_pages() - The handler for %SGX_IOC_ENCLAVE_ADD_PAGES
> > + * @encl:       an enclave pointer
> > + * @arg:	a user pointer to a struct sgx_enclave_add_pages instance
> > + *
> > + * Add one or more pages to an uninitialized enclave, and optionally extend the
> > + * measurement with the contents of the page. The SECINFO and measurement mask
> > + * are applied to all pages.
> > + *
> > + * A SECINFO for a TCS is required to always contain zero permissions because
> > + * CPU silently zeros them. Allowing anything else would cause a mismatch in
> > + * the measurement.
> > + *
> > + * mmap()'s protection bits are capped by the page permissions. For each page
> > + * address, the maximum protection bits are computed with the following
> > + * heuristics:
> > + *
> > + * 1. A regular page: PROT_R, PROT_W and PROT_X match the SECINFO permissions.
> > + * 2. A TCS page: PROT_R | PROT_W.
> > + *
> > + * mmap() is not allowed to surpass the minimum of the maximum protection bits
> > + * within the given address range.
> > + *
> > + * The function deinitializes kernel data structures for enclave and returns
> > + * -EIO in any of the following conditions:
> > + *
> > + * - Enclave Page Cache (EPC), the physical memory holding enclaves, has
> > + *   been invalidated. This will cause EADD and EEXTEND to fail.
> > + * - If the source address is corrupted somehow when executing EADD.
> > + *
> > + * Return:
> > + *   length of the data processed on success,
> > + *   -EACCES if an executable source page is located in a noexec partition,
> > + *   -ENOMEM if the system is out of EPC pages,
> > + *   -EINTR if the call was interrupted before any data was processed,
> > + *   -EIO if the enclave was lost
> > + *   -errno otherwise
> > + */
> > +static long sgx_ioc_enclave_add_pages(struct sgx_encl *encl, void __user *arg)
> > +{
> > +	struct sgx_enclave_add_pages addp;
> > +	struct sgx_secinfo secinfo;
> > +	unsigned long c;
> > +	int ret;
> > +
> > +	if ((atomic_read(&encl->flags) & SGX_ENCL_INITIALIZED) ||
> > +	    !(atomic_read(&encl->flags) & SGX_ENCL_CREATED))
> > +		return -EINVAL;
> 
> There should to be a nice state machine documented somewhere.  Is ther?

So should I document to encl.h where they are declared to start with?

> 
> > +	if (copy_from_user(&addp, arg, sizeof(addp)))
> > +		return -EFAULT;
> > +
> > +	if (!IS_ALIGNED(addp.offset, PAGE_SIZE) ||
> > +	    !IS_ALIGNED(addp.src, PAGE_SIZE))
> > +		return -EINVAL;
> > +
> > +	if (!(access_ok(addp.src, PAGE_SIZE)))
> > +		return -EFAULT;
> 
> This worries me.  You're doing an access_ok() check on addp.src because
> you evidently don't trust it.  But, below, it looks to be accessed
> directly with an offset, bound by addp.length, which I think can be
> >PAGE_SIZE.
> 
> I'd feel a lot better if addp.src's value was being passed around as a
> __user pointer.

I'm not sure if that call is even needed. Each page is pinned with
get_user_pages(). AFAIK, it should be enough. This must be legacy cruft.

> > +	if (addp.length & (PAGE_SIZE - 1))
> > +		return -EINVAL;
> > +
> > +	if (addp.offset + addp.length - PAGE_SIZE >= encl->size)
> > +		return -EINVAL;
> > +
> > +	if (copy_from_user(&secinfo, (void __user *)addp.secinfo,
> > +			   sizeof(secinfo)))
> > +		return -EFAULT;
> > +
> > +	if (sgx_validate_secinfo(&secinfo))
> > +		return -EINVAL;
> > +
> > +	for (c = 0 ; c < addp.length; c += PAGE_SIZE) {
> > +		if (signal_pending(current)) {
> > +			if (!c)
> > +				ret = -ERESTARTSYS;
> > +
> > +			break;
> > +		}
> > +
> > +		if (c == SGX_MAX_ADD_PAGES_LENGTH)
> > +			break;
> > +
> > +		if (need_resched())
> > +			cond_resched();
> > +
> > +		ret = sgx_encl_add_page(encl, addp.src + c, addp.offset + c,
> > +					&secinfo, addp.flags);
> 
> Yeah...  Don't we need to do another access_ok() check here, if we
> needed one above since we are moving away from addrp.src?

I don't think so because the page is pinned with get_user_pages().

> 
> > +		if (ret)
> > +			break;
> > +	}
> > +
> > +	addp.count = c;
> > +
> > +	if (copy_to_user(arg, &addp, sizeof(addp)))
> > +		return -EFAULT;
> > +
> > +	/*
> > +	 * If the enlave was lost, deinitialize the internal data structures
> > +	 * for the enclave.
> > +	 */
> > +	if (ret == -EIO) {
> > +		mutex_lock(&encl->lock);
> > +		sgx_encl_destroy(encl);
> > +		mutex_unlock(&encl->lock);
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> >  long sgx_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
> >  {
> >  	struct sgx_encl *encl = filep->private_data;
> > @@ -212,6 +508,9 @@ long sgx_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
> >  	case SGX_IOC_ENCLAVE_CREATE:
> >  		ret = sgx_ioc_enclave_create(encl, (void __user *)arg);
> >  		break;
> > +	case SGX_IOC_ENCLAVE_ADD_PAGES:
> > +		ret = sgx_ioc_enclave_add_pages(encl, (void __user *)arg);
> > +		break;
> >  	default:
> >  		ret = -ENOIOCTLCMD;
> >  		break;
> > diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> > index fce756c3434b..8d126070db1e 100644
> > --- a/arch/x86/kernel/cpu/sgx/sgx.h
> > +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> > @@ -34,6 +34,7 @@ struct sgx_epc_section {
> >  
> >  #define SGX_EPC_SECTION_MASK		GENMASK(7, 0)
> >  #define SGX_MAX_EPC_SECTIONS		(SGX_EPC_SECTION_MASK + 1)
> > +#define SGX_MAX_ADD_PAGES_LENGTH	0x100000
> >  
> >  extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
> >  
> > 
> 

/Jarkko

Jarkko Sakkinen Oct. 19, 2020, 7:03 a.m. UTC | #3

On Sun, Oct 18, 2020 at 08:03:11AM +0300, Jarkko Sakkinen wrote:
> > > +	mmap_read_lock(current->mm);
> > > +	mutex_lock(&encl->lock);
> > > +
> > > +	/*
> > > +	 * Insert prior to EADD in case of OOM.
> > 
> > I wouldn't say OOM.  Maybe:
> > 
> > 	xa_insert() and EADD can both fail.  But xa_insert() is easier
> > 	to unwind so do it first.
> > 
> > >                                              EADD modifies MRENCLAVE, i.e.
> > 
> > What is MRENCLAVE?
> 
> The measurement stored in SECS. I'm wondering  with xarray, is it
> possible to preallocate entry without inserting anything?
> 
> Then we could get rid of this unwind and also would not need to
> take encl->lock in sgx_encl_may_map().

I'm still a bit confused with the unfamiliar Xarray API but I think I
got it:

1. xa_insert() with a NULL entry reserves index and more importantly
   does the memory allocation.
2. xa_cmpxchg() with the enclave page, if EADD and EEXTEND's succceed.
3. xa_release() otherwise.

This way sgx_encl_may_map() will never see a stale enclave page when it
does the permission check, even if encl->lock is not taken.

I mean right now I have to take both xas lock and enclave lock, which
is terrible but this will take care of it.

I will rewrite the comment to something more reasonable, once I've done
this code change.

The reason for doing insert first is that, if we get -ENOMEM after
successful EADD and EEXTEND's we have a legit microarchitectural state
but you cannot rollback a hash (MRENCLAVE), so game is anyway over
because your data structures are not in sync.

If -ENOMEM comes before, everything is still in sync and we don't have
invalidate the enclave.

/Jarkko

Dave Hansen Oct. 19, 2020, 8:48 p.m. UTC | #4

On 10/17/20 10:03 PM, Jarkko Sakkinen wrote:
> On Fri, Oct 16, 2020 at 02:25:50PM -0700, Dave Hansen wrote:
>>> +/**
>>> + * struct sgx_enclave_add_pages - parameter structure for the
>>> + *                                %SGX_IOC_ENCLAVE_ADD_PAGE ioctl
>>> + * @src:	start address for the page data
>>> + * @offset:	starting page offset
>>
>> Is this the offset *within* the page?  Might be nice to say that.
> 
> It's the offset in the enclave address range where page is to be added.

Yikes, comment improvement needed, badly.

>>> +static struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
>>> +						 unsigned long offset,
>>> +						 u64 secinfo_flags)
>>> +{
>>> +	struct sgx_encl_page *encl_page;
>>> +	unsigned long prot;
>>> +
>>> +	encl_page = kzalloc(sizeof(*encl_page), GFP_KERNEL);
>>> +	if (!encl_page)
>>> +		return ERR_PTR(-ENOMEM);
>>> +
>>> +	encl_page->desc = encl->base + offset;
>>> +	encl_page->encl = encl;
>>
>> Somewhere, we need an explanation of why we have 'sgx_epc_page' and
>> 'sgx_encl_page'.  I think they're 1:1 at least after
>> sgx_encl_page_alloc(), so I'm wondering why we need two.
> 
> You need sgx_encl_page to hold data that exists whether or not there is
> an associated EPC page.

Except they're currently tightly bound:

>         encl_page = sgx_encl_page_alloc(encl, offset, secinfo->flags);
>         if (IS_ERR(encl_page))
>                 return PTR_ERR(encl_page);
>  
> -       epc_page = __sgx_alloc_epc_page();
> +       epc_page = sgx_alloc_epc_page(encl_page, true);
>         if (IS_ERR(epc_page)) {
>                 kfree(encl_page);
>                 return PTR_ERR(epc_page);
>         }

So, is this because 'sgx_encl_page' continues to exist even if
'sgx_epc_page' is reclaimed?

> Essentially sgx_encl_page contains the data needed for a virtual page,
> and sgx_epc_page what is needed to represent physical page.

So, grumble grumble, that's horribly inefficient for sparse mappings.
There's a reason VMAs cover ranges instead of being allocated
per-virtual-page.

> None of the data contained in sgx_encl_page make sense for sgx_epc_page.
> They don't contain intersecting or redundant data.

Yeah, except they point to each other, so if one isn't necessary, we can
get rid of that pointer.

>>> +static int __sgx_encl_add_page(struct sgx_encl *encl,
>>> +			       struct sgx_encl_page *encl_page,
>>> +			       struct sgx_epc_page *epc_page,
>>> +			       struct sgx_secinfo *secinfo, unsigned long src)
>>> +{
>>> +	struct sgx_pageinfo pginfo;
>>> +	struct vm_area_struct *vma;
>>> +	struct page *src_page;
>>> +	int ret;
>>> +
>>> +	/* Deny noexec. */
>>> +	vma = find_vma(current->mm, src);
>>> +	if (!vma)
>>> +		return -EFAULT;
>>> +
>>> +	if (!(vma->vm_flags & VM_MAYEXEC))
>>> +		return -EACCES;
>>> +
>>> +	ret = get_user_pages(src, 1, 0, &src_page, NULL);
>>> +	if (ret < 1)
>>> +		return -EFAULT;
>>> +
>>> +	pginfo.secs = (unsigned long)sgx_get_epc_addr(encl->secs.epc_page);
>>> +	pginfo.addr = SGX_ENCL_PAGE_ADDR(encl_page);
>>> +	pginfo.metadata = (unsigned long)secinfo;
>>> +	pginfo.contents = (unsigned long)kmap_atomic(src_page);
>>> +
>>> +	ret = __eadd(&pginfo, sgx_get_epc_addr(epc_page));
>>
>> Could you convince me that EADD is not going to fault and make the
>> kmap_atomic() mad?
> 
> It can legitly fail in the case when power cycle happens.
> 
> That's why the inline assembly catches faults and return an error code.
> Thhis code has been field tested a lot. I have fairly good trust on
> it.

OK, so it can fault, but not *sleep*.

Can you comment it to that effect, please?

>>> +		if (ret) {
>>> +			if (encls_failed(ret))
>>> +				ENCLS_WARN(ret, "EEXTEND");
>>> +			return -EIO;
>>
>> How frequent should we expect these to be?  Can users cause them?  You
>> should *proably* call it ENCLS_WARN_ONCE() if it's implemented that way.
> 
> If power cycle happens.

So, we get one warning per power cycle?  Practically, do you mean a
suspend/resume cycle, or is this more like hibernation-to-disk-resume?

In any case, if this is normal system operation (which closing my laptop
lid qualifies as), it should produce zero warnings.


>>> +static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
>>> +			     unsigned long offset, struct sgx_secinfo *secinfo,
>>> +			     unsigned long flags)
>>> +{
>>> +	struct sgx_encl_page *encl_page;
>>> +	struct sgx_epc_page *epc_page;
>>> +	int ret;
>>> +
>>> +	encl_page = sgx_encl_page_alloc(encl, offset, secinfo->flags);
>>> +	if (IS_ERR(encl_page))
>>> +		return PTR_ERR(encl_page);
>>> +
>>> +	epc_page = __sgx_alloc_epc_page();
>>> +	if (IS_ERR(epc_page)) {
>>> +		kfree(encl_page);
>>> +		return PTR_ERR(epc_page);
>>> +	}
>>
>> Looking at these, I'm forgetting why we need to both allocate an
>> encl_page and an epc_page.  Commends might remind me.  So would better
>> names.
> 
> Should the struct names be renamed?
> 
> Like sgx_phys_epc_page and sgx_virt_epc_page?

"epc" is additional acronym nonsense and redundant with "sgx" and "page"
anyway.

I'd probably call then 'sgx_phys_page' and 'sgx_virt_slot' or something.

>>> +	mmap_read_lock(current->mm);
>>> +	mutex_lock(&encl->lock);
>>> +
>>> +	/*
>>> +	 * Insert prior to EADD in case of OOM.
>>
>> I wouldn't say OOM.  Maybe:
>>
>> 	xa_insert() and EADD can both fail.  But xa_insert() is easier
>> 	to unwind so do it first.
>>
>>>                                              EADD modifies MRENCLAVE, i.e.
>>
>> What is MRENCLAVE?
> 
> The measurement stored in SECS. I'm wondering  with xarray, is it
> possible to preallocate entry without inserting anything?

Let's use plain english here.  I don't care what the implementation
does, I just care about what it means to the kernel.

> Then we could get rid of this unwind and also would not need to
> take encl->lock in sgx_encl_may_map().

There was for radix trees, iirc.

>>> +	 * can't be gracefully unwound, while failure on EADD/EXTEND is limited
>>> +	 * to userspace errors (or kernel/hardware bugs).
>>> +	 */
>>> +	ret = xa_insert(&encl->page_array, PFN_DOWN(encl_page->desc),
>>> +			encl_page, GFP_KERNEL);
>>> +	if (ret)
>>> +		goto err_out_unlock;
>>> +
>>> +	ret = __sgx_encl_add_page(encl, encl_page, epc_page, secinfo,
>>> +				  src);
>>> +	if (ret)
>>> +		goto err_out;
>>> +
>>> +	/*
>>> +	 * Complete the "add" before doing the "extend" so that the "add"
>>> +	 * isn't in a half-baked state in the extremely unlikely scenario
>>> +	 * the enclave will be destroyed in response to EEXTEND failure.
>>> +	 */
>>> +	encl_page->encl = encl;
>>> +	encl_page->epc_page = epc_page;
>>> +	encl->secs_child_cnt++;
>>> +
>>> +	if (flags & SGX_PAGE_MEASURE) {
>>> +		ret = __sgx_encl_extend(encl, epc_page);
>>> +		if (ret)
>>> +			goto err_out;
>>> +	}
>>
>> Why would we never *not* measure an added page?
> 
> You might add Thread Control Structure pages without measuring them or
> data area. There are reasons for the user space not to have everything
> measured.

This is also good comment fodder.

>>> +static long sgx_ioc_enclave_add_pages(struct sgx_encl *encl, void __user *arg)
>>> +{
>>> +	struct sgx_enclave_add_pages addp;
>>> +	struct sgx_secinfo secinfo;
>>> +	unsigned long c;
>>> +	int ret;
>>> +
>>> +	if ((atomic_read(&encl->flags) & SGX_ENCL_INITIALIZED) ||
>>> +	    !(atomic_read(&encl->flags) & SGX_ENCL_CREATED))
>>> +		return -EINVAL;
>>
>> There should to be a nice state machine documented somewhere.  Is ther?
> 
> So should I document to encl.h where they are declared to start with?

I think it's better placed in the Documentation/.

>>> +	if (copy_from_user(&addp, arg, sizeof(addp)))
>>> +		return -EFAULT;
>>> +
>>> +	if (!IS_ALIGNED(addp.offset, PAGE_SIZE) ||
>>> +	    !IS_ALIGNED(addp.src, PAGE_SIZE))
>>> +		return -EINVAL;
>>> +
>>> +	if (!(access_ok(addp.src, PAGE_SIZE)))
>>> +		return -EFAULT;
>>
>> This worries me.  You're doing an access_ok() check on addp.src because
>> you evidently don't trust it.  But, below, it looks to be accessed
>> directly with an offset, bound by addp.length, which I think can be
>>> PAGE_SIZE.
>>
>> I'd feel a lot better if addp.src's value was being passed around as a
>> __user pointer.
> 
> I'm not sure if that call is even needed. Each page is pinned with
> get_user_pages(). AFAIK, it should be enough. This must be legacy cruft.

get_user_pages() and access_ok() do *very* different things.  Even if
the pages are pinned, you might still be tricked into referencing off
the end of the page, or up into the kernel address space.

>>> +	if (addp.length & (PAGE_SIZE - 1))
>>> +		return -EINVAL;
>>> +
>>> +	if (addp.offset + addp.length - PAGE_SIZE >= encl->size)
>>> +		return -EINVAL;
>>> +
>>> +	if (copy_from_user(&secinfo, (void __user *)addp.secinfo,
>>> +			   sizeof(secinfo)))
>>> +		return -EFAULT;
>>> +
>>> +	if (sgx_validate_secinfo(&secinfo))
>>> +		return -EINVAL;
>>> +
>>> +	for (c = 0 ; c < addp.length; c += PAGE_SIZE) {
>>> +		if (signal_pending(current)) {
>>> +			if (!c)
>>> +				ret = -ERESTARTSYS;
>>> +
>>> +			break;
>>> +		}
>>> +
>>> +		if (c == SGX_MAX_ADD_PAGES_LENGTH)
>>> +			break;
>>> +
>>> +		if (need_resched())
>>> +			cond_resched();
>>> +
>>> +		ret = sgx_encl_add_page(encl, addp.src + c, addp.offset + c,
>>> +					&secinfo, addp.flags);
>>
>> Yeah...  Don't we need to do another access_ok() check here, if we
>> needed one above since we are moving away from addrp.src?
> 
> I don't think so because the page is pinned with get_user_pages().

No, get_user_pages() is orthogonal.

Looking at this again, you _might_ be OK since you validated addp.length
against encl->size.  But, it's all very convoluted and doesn't look very
organized or obviously right.

So, this begs the question: What, exactly are the guarantees you are
expecting out of get_user_pages() here?

I also think it's an absolute requirement that if you're passing around
userspace pointers that you tag them as __user, not pass around as
unsigned longs.

Sean Christopherson Oct. 19, 2020, 9:15 p.m. UTC | #5

On Mon, Oct 19, 2020 at 01:48:32PM -0700, Dave Hansen wrote:
> On 10/17/20 10:03 PM, Jarkko Sakkinen wrote:
> >>> +		if (ret) {
> >>> +			if (encls_failed(ret))
> >>> +				ENCLS_WARN(ret, "EEXTEND");
> >>> +			return -EIO;
> >>
> >> How frequent should we expect these to be?  Can users cause them?  You
> >> should *proably* call it ENCLS_WARN_ONCE() if it's implemented that way.

Ya, it's implemented using WARN_ONCE.  It doesn't append _ONCE mostly to avoid
unnecessary verbosity, e.g. there's no existing SGX code that uses vanilla
WARN, nor does it seem likely that there will ever be a case where using WARN
is justified.

> > If power cycle happens.
> 
> So, we get one warning per power cycle?  Practically, do you mean a
> suspend/resume cycle, or is this more like hibernation-to-disk-resume?
> 
> In any case, if this is normal system operation (which closing my laptop
> lid qualifies as), it should produce zero warnings.

encls_failed() filters out EPCM faults (which, by the by, is why the kernel
cares about that #GP vs. #PF erratum).  So what you describe is the implemented
behavior, i.e. WARNs are triggerable if and only if there is a hardware or
kernel bug.

FWIW, I prefer burying the encls_failed() logic in ENCLS_WARN.  encls_failed()
is only used to gate ENCLS_WARN, and ENCLS_WARN is always wrapped with
encls_failed() excepted for EREMOVE.  The only downside to applying the logic
to EREMOVE is that it could theoretically suppress kernel bugs on CPUs that are
subject to the #GP instead of #PF errata.
 
> >>> +static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
> >>> +			     unsigned long offset, struct sgx_secinfo *secinfo,
> >>> +			     unsigned long flags)
> >>> +{
> >>> +	struct sgx_encl_page *encl_page;
> >>> +	struct sgx_epc_page *epc_page;
> >>> +	int ret;
> >>> +
> >>> +	encl_page = sgx_encl_page_alloc(encl, offset, secinfo->flags);
> >>> +	if (IS_ERR(encl_page))
> >>> +		return PTR_ERR(encl_page);
> >>> +
> >>> +	epc_page = __sgx_alloc_epc_page();
> >>> +	if (IS_ERR(epc_page)) {
> >>> +		kfree(encl_page);
> >>> +		return PTR_ERR(epc_page);
> >>> +	}
> >>
> >> Looking at these, I'm forgetting why we need to both allocate an
> >> encl_page and an epc_page.  Commends might remind me.  So would better
> >> names.
> > 
> > Should the struct names be renamed?
> > 
> > Like sgx_phys_epc_page and sgx_virt_epc_page?
> 
> "epc" is additional acronym nonsense and redundant with "sgx" and "page"
> anyway.
> 
> I'd probably call then 'sgx_phys_page' and 'sgx_virt_slot' or something.

I don't too much deeply about whether or not sgx_encl_page is renamed, but I
would very strongly prefer keeping sgx_epc_page.  Nearly all of the SGX
literature refers to the physical pages residing in the EPC as "EPC pages".
IMO, the "sgx" is the somewhat superfluous part that is tacked on to add
namespacing in case of collisions with "epc".

> >>> +	mmap_read_lock(current->mm);
> >>> +	mutex_lock(&encl->lock);
> >>> +
> >>> +	/*
> >>> +	 * Insert prior to EADD in case of OOM.
> >>> +	if (copy_from_user(&addp, arg, sizeof(addp)))
> >>> +		return -EFAULT;
> >>> +
> >>> +	if (!IS_ALIGNED(addp.offset, PAGE_SIZE) ||
> >>> +	    !IS_ALIGNED(addp.src, PAGE_SIZE))
> >>> +		return -EINVAL;
> >>> +
> >>> +	if (!(access_ok(addp.src, PAGE_SIZE)))
> >>> +		return -EFAULT;
> >>
> >> This worries me.  You're doing an access_ok() check on addp.src because
> >> you evidently don't trust it.  But, below, it looks to be accessed
> >> directly with an offset, bound by addp.length, which I think can be
> >>> PAGE_SIZE.
> >>
> >> I'd feel a lot better if addp.src's value was being passed around as a
> >> __user pointer.
> > 
> > I'm not sure if that call is even needed. Each page is pinned with
> > get_user_pages(). AFAIK, it should be enough. This must be legacy cruft.
> 
> get_user_pages() and access_ok() do *very* different things.  Even if
> the pages are pinned, you might still be tricked into referencing off
> the end of the page, or up into the kernel address space.
> 
> >>> +	if (addp.length & (PAGE_SIZE - 1))
> >>> +		return -EINVAL;
> >>> +
> >>> +	if (addp.offset + addp.length - PAGE_SIZE >= encl->size)
> >>> +		return -EINVAL;
> >>> +
> >>> +	if (copy_from_user(&secinfo, (void __user *)addp.secinfo,
> >>> +			   sizeof(secinfo)))
> >>> +		return -EFAULT;
> >>> +
> >>> +	if (sgx_validate_secinfo(&secinfo))
> >>> +		return -EINVAL;
> >>> +
> >>> +	for (c = 0 ; c < addp.length; c += PAGE_SIZE) {
> >>> +		if (signal_pending(current)) {
> >>> +			if (!c)
> >>> +				ret = -ERESTARTSYS;
> >>> +
> >>> +			break;
> >>> +		}
> >>> +
> >>> +		if (c == SGX_MAX_ADD_PAGES_LENGTH)
> >>> +			break;
> >>> +
> >>> +		if (need_resched())
> >>> +			cond_resched();
> >>> +
> >>> +		ret = sgx_encl_add_page(encl, addp.src + c, addp.offset + c,
> >>> +					&secinfo, addp.flags);
> >>
> >> Yeah...  Don't we need to do another access_ok() check here, if we
> >> needed one above since we are moving away from addrp.src?
> > 
> > I don't think so because the page is pinned with get_user_pages().
> 
> No, get_user_pages() is orthogonal.
> 
> Looking at this again, you _might_ be OK since you validated addp.length
> against encl->size.  But, it's all very convoluted and doesn't look very
> organized or obviously right.

The easiest fix would be to have the existing access_ok() check the entire
range, no?  Or am I missing something obvious?

Dave Hansen Oct. 19, 2020, 9:44 p.m. UTC | #6

On 10/19/20 2:15 PM, Sean Christopherson wrote:
>>>> Yeah...  Don't we need to do another access_ok() check here, if we
>>>> needed one above since we are moving away from addrp.src?
>>> I don't think so because the page is pinned with get_user_pages().
>> No, get_user_pages() is orthogonal.
>>
>> Looking at this again, you _might_ be OK since you validated addp.length
>> against encl->size.  But, it's all very convoluted and doesn't look very
>> organized or obviously right.
> The easiest fix would be to have the existing access_ok() check the entire
> range, no?  Or am I missing something obvious?

In general, I want the actual userspace access to be as close as
possible and 1:1 with the access_ok() checks.  That way, it's blatantly
obvious that the pointers have been checked.

*But* get_user_pages() has access_ok() checks inside of its
implementation, which makes sense.  *But*, that begs the question of
what the top-level one was doing in the first place.  Maybe it was just
superfluous.

Either way, it still doesn't explain what this is doing:

> +       ret = get_user_pages(src, 1, 0, &src_page, NULL);
> +       if (ret < 1)
> +               return -EFAULT;
> +
> +       pginfo.secs = (unsigned long)sgx_get_epc_addr(encl->secs.epc_page);
> +       pginfo.addr = SGX_ENCL_PAGE_ADDR(encl_page);
> +       pginfo.metadata = (unsigned long)secinfo;
> +       pginfo.contents = (unsigned long)kmap_atomic(src_page);
> +
> +       ret = __eadd(&pginfo, sgx_get_epc_addr(epc_page));
> +
> +       kunmap_atomic((void *)pginfo.contents);

I think the point is to create a stable kernel alias address for
'src_page' so that any mucking with the userspace mapping doesn't screw
up the __eadd() and any failures aren't due to reclaim or MADV_DONTNEED.

If this isn't even touching the userspace mapping, it didn't need
access_ok() in the first place.

Jarkko Sakkinen Oct. 23, 2020, 10:11 a.m. UTC | #7

On Mon, Oct 19, 2020 at 02:44:19PM -0700, Dave Hansen wrote:
> On 10/19/20 2:15 PM, Sean Christopherson wrote:
> >>>> Yeah...  Don't we need to do another access_ok() check here, if we
> >>>> needed one above since we are moving away from addrp.src?
> >>> I don't think so because the page is pinned with get_user_pages().
> >> No, get_user_pages() is orthogonal.
> >>
> >> Looking at this again, you _might_ be OK since you validated addp.length
> >> against encl->size.  But, it's all very convoluted and doesn't look very
> >> organized or obviously right.
> > The easiest fix would be to have the existing access_ok() check the entire
> > range, no?  Or am I missing something obvious?
> 
> In general, I want the actual userspace access to be as close as
> possible and 1:1 with the access_ok() checks.  That way, it's blatantly
> obvious that the pointers have been checked.
> 
> *But* get_user_pages() has access_ok() checks inside of its
> implementation, which makes sense.  *But*, that begs the question of
> what the top-level one was doing in the first place.  Maybe it was just
> superfluous.
> 
> Either way, it still doesn't explain what this is doing:

I guess it is just history. Used to be one page ioctl.

> > +       ret = get_user_pages(src, 1, 0, &src_page, NULL);
> > +       if (ret < 1)
> > +               return -EFAULT;
> > +
> > +       pginfo.secs = (unsigned long)sgx_get_epc_addr(encl->secs.epc_page);
> > +       pginfo.addr = SGX_ENCL_PAGE_ADDR(encl_page);
> > +       pginfo.metadata = (unsigned long)secinfo;
> > +       pginfo.contents = (unsigned long)kmap_atomic(src_page);
> > +
> > +       ret = __eadd(&pginfo, sgx_get_epc_addr(epc_page));
> > +
> > +       kunmap_atomic((void *)pginfo.contents);
> 
> I think the point is to create a stable kernel alias address for
> 'src_page' so that any mucking with the userspace mapping doesn't screw
> up the __eadd() and any failures aren't due to reclaim or MADV_DONTNEED.
> 
> If this isn't even touching the userspace mapping, it didn't need
> access_ok() in the first place.

The whole access_ok() check is just evolutionary cruft. I will remove
it.

/Jarkko

[v39,13/24] x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES

Commit Message

Comments

Patch