From patchwork Thu Sep 24 13:29:01 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Rapoport X-Patchwork-Id: 11797303 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A160F1731 for ; Thu, 24 Sep 2020 13:30:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 838F723A1E for ; Thu, 24 Sep 2020 13:30:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1600954203; bh=yvwNEpRIEIHw8P9K0dQ62NKqcHYzWpCD/1cOjVCVLe4=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=THmYsovbr042KjnHGHTYo8M3viIe2hqP9WVrQnvRKCcQOwmgIYVuF33iWqlZ/uRB0 qh4WHZ9GJiXcr4ieM48+DPJYgl7OqOiFo45vbzDAfcrCfcgFdIxws5ivBcIKUgVrg4 O5jxjQIQk9n5S6oIVoOJbqiVciQSeDXCTOZi3eTI= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728041AbgIXNaA (ORCPT ); Thu, 24 Sep 2020 09:30:00 -0400 Received: from mail.kernel.org ([198.145.29.99]:50594 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727849AbgIXN37 (ORCPT ); Thu, 24 Sep 2020 09:29:59 -0400 Received: from aquarius.haifa.ibm.com (nesher1.haifa.il.ibm.com [195.110.40.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 37C61238E4; Thu, 24 Sep 2020 13:29:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1600954197; bh=yvwNEpRIEIHw8P9K0dQ62NKqcHYzWpCD/1cOjVCVLe4=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=XDjNBS7WGTUQNn9xOy+4t+ZEUUJ9Fxl8LLL7snEb9VVHec5uQ0h3a7F6h4NGJAqIs R23zd9LjpqHEnpaUzrZTMVeC+HFeb16XyibmIgTD2xHs7f1ELXmaZ6lXoLgbTzinh9 42uYDA/wXWVFxxzFj4v6zPYAyb4Y+4bvvcgUeC5c= From: Mike Rapoport To: Andrew Morton Cc: Alexander Viro , Andy Lutomirski , Arnd Bergmann , Borislav Petkov , Catalin Marinas , Christopher Lameter , Dan Williams , Dave Hansen , David Hildenbrand , Elena Reshetova , "H. Peter Anvin" , Idan Yaniv , Ingo Molnar , James Bottomley , "Kirill A. Shutemov" , Matthew Wilcox , Mark Rutland , Mike Rapoport , Mike Rapoport , Michael Kerrisk , Palmer Dabbelt , Paul Walmsley , Peter Zijlstra , Thomas Gleixner , Shuah Khan , Tycho Andersen , Will Deacon , linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org Subject: [PATCH v6 3/6] mm: introduce memfd_secret system call to create "secret" memory areas Date: Thu, 24 Sep 2020 16:29:01 +0300 Message-Id: <20200924132904.1391-4-rppt@kernel.org> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20200924132904.1391-1-rppt@kernel.org> References: <20200924132904.1391-1-rppt@kernel.org> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-kselftest@vger.kernel.org From: Mike Rapoport Introduce "memfd_secret" system call with the ability to create memory areas visible only in the context of the owning process and not mapped not only to other processes but in the kernel page tables as well. The user will create a file descriptor using the memfd_secret() system call where flags supplied as a parameter to this system call will define the desired protection mode for the memory associated with that file descriptor. Currently there are two protection modes: * exclusive - the memory area is unmapped from the kernel direct map and it is present only in the page tables of the owning mm. * uncached - the memory area is present only in the page tables of the owning mm and it is mapped there as uncached. The "exclusive" mode is enabled implicitly and it is the default mode for memfd_secret(). The "uncached" mode requires architecture support and an architecture should opt-in for this mode using HAVE_SECRETMEM_UNCACHED configuration option. For instance, the following example will create an uncached mapping (error handling is omitted): fd = memfd_secret(SECRETMEM_UNCACHED); ftruncate(fd, MAP_SIZE); ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); Signed-off-by: Mike Rapoport --- arch/Kconfig | 7 + arch/x86/Kconfig | 1 + include/uapi/linux/magic.h | 1 + include/uapi/linux/secretmem.h | 8 + kernel/sys_ni.c | 2 + mm/Kconfig | 4 + mm/Makefile | 1 + mm/secretmem.c | 264 +++++++++++++++++++++++++++++++++ 8 files changed, 288 insertions(+) create mode 100644 include/uapi/linux/secretmem.h create mode 100644 mm/secretmem.c diff --git a/arch/Kconfig b/arch/Kconfig index af14a567b493..d3f11b2d03e8 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -975,6 +975,13 @@ config HAVE_SPARSE_SYSCALL_NR config ARCH_HAS_VDSO_DATA bool +config HAVE_SECRETMEM_UNCACHED + bool + help + An architecture can select this if its semantics of non-cached + mappings can be used to prevent speculative loads and it is + useful for secret protection. + source "kernel/gcov/Kconfig" source "scripts/gcc-plugins/Kconfig" diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7101ac64bb20..38ead8bd9909 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -220,6 +220,7 @@ config X86 select HAVE_UNSTABLE_SCHED_CLOCK select HAVE_USER_RETURN_NOTIFIER select HAVE_GENERIC_VDSO + select HAVE_SECRETMEM_UNCACHED select HOTPLUG_SMT if SMP select IRQ_FORCED_THREADING select NEED_SG_DMA_LENGTH diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index f3956fc11de6..35687dcb1a42 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -97,5 +97,6 @@ #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ #define Z3FOLD_MAGIC 0x33 #define PPC_CMM_MAGIC 0xc7571590 +#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ #endif /* __LINUX_MAGIC_H__ */ diff --git a/include/uapi/linux/secretmem.h b/include/uapi/linux/secretmem.h new file mode 100644 index 000000000000..2b9675f5dea9 --- /dev/null +++ b/include/uapi/linux/secretmem.h @@ -0,0 +1,8 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_SECRERTMEM_H +#define _UAPI_LINUX_SECRERTMEM_H + +/* secretmem operation modes */ +#define SECRETMEM_UNCACHED 0x1 + +#endif /* _UAPI_LINUX_SECRERTMEM_H */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 4d59775ea79c..8ae8d0c2d381 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -349,6 +349,8 @@ COND_SYSCALL(pkey_mprotect); COND_SYSCALL(pkey_alloc); COND_SYSCALL(pkey_free); +/* memfd_secret */ +COND_SYSCALL(memfd_secret); /* * Architecture specific weak syscall entries. diff --git a/mm/Kconfig b/mm/Kconfig index 6c974888f86f..d2fc73ccc183 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -868,4 +868,8 @@ config ARCH_HAS_HUGEPD config MAPPING_DIRTY_HELPERS bool +config SECRETMEM + def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED + select GENERIC_ALLOCATOR + endmenu diff --git a/mm/Makefile b/mm/Makefile index d5649f1c12c0..cae063dc8298 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -121,3 +121,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o obj-$(CONFIG_PTDUMP_CORE) += ptdump.o obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o +obj-$(CONFIG_SECRETMEM) += secretmem.o diff --git a/mm/secretmem.c b/mm/secretmem.c new file mode 100644 index 000000000000..3293f761076e --- /dev/null +++ b/mm/secretmem.c @@ -0,0 +1,264 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright IBM Corporation, 2020 + * + * Author: Mike Rapoport + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include + +#include "internal.h" + +#undef pr_fmt +#define pr_fmt(fmt) "secretmem: " fmt + +/* + * Secret memory areas are always exclusive to owning mm and they are + * removed from the direct map. + */ +#ifdef CONFIG_HAVE_SECRETMEM_UNCACHED +#define SECRETMEM_MODE_MASK (SECRETMEM_UNCACHED) +#else +#define SECRETMEM_MODE_MASK (0x0) +#endif + +#define SECRETMEM_FLAGS_MASK SECRETMEM_MODE_MASK + +struct secretmem_ctx { + unsigned int mode; +}; + +static struct page *secretmem_alloc_page(gfp_t gfp) +{ + /* + * FIXME: use a cache of large pages to reduce the direct map + * fragmentation + */ + return alloc_page(gfp); +} + +static vm_fault_t secretmem_fault(struct vm_fault *vmf) +{ + struct address_space *mapping = vmf->vma->vm_file->f_mapping; + struct inode *inode = file_inode(vmf->vma->vm_file); + pgoff_t offset = vmf->pgoff; + unsigned long addr; + struct page *page; + int ret = 0; + + if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode)) + return vmf_error(-EINVAL); + + page = find_get_entry(mapping, offset); + if (!page) { + page = secretmem_alloc_page(vmf->gfp_mask); + if (!page) + return vmf_error(-ENOMEM); + + ret = add_to_page_cache(page, mapping, offset, vmf->gfp_mask); + if (unlikely(ret)) + goto err_put_page; + + ret = set_direct_map_invalid_noflush(page); + if (ret) + goto err_del_page_cache; + + addr = (unsigned long)page_address(page); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + + __SetPageUptodate(page); + + ret = VM_FAULT_LOCKED; + } + + vmf->page = page; + return ret; + +err_del_page_cache: + delete_from_page_cache(page); +err_put_page: + put_page(page); + return vmf_error(ret); +} + +static const struct vm_operations_struct secretmem_vm_ops = { + .fault = secretmem_fault, +}; + +static int secretmem_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct secretmem_ctx *ctx = file->private_data; + unsigned long len = vma->vm_end - vma->vm_start; + + if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0) + return -EINVAL; + + if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len)) + return -EAGAIN; + + if (ctx->mode & SECRETMEM_UNCACHED) + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + + vma->vm_ops = &secretmem_vm_ops; + vma->vm_flags |= VM_LOCKED; + + return 0; +} + +const struct file_operations secretmem_fops = { + .mmap = secretmem_mmap, +}; + +static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode) +{ + return false; +} + +static int secretmem_migratepage(struct address_space *mapping, + struct page *newpage, struct page *page, + enum migrate_mode mode) +{ + return -EBUSY; +} + +static void secretmem_freepage(struct page *page) +{ + set_direct_map_default_noflush(page); +} + +static const struct address_space_operations secretmem_aops = { + .freepage = secretmem_freepage, + .migratepage = secretmem_migratepage, + .isolate_page = secretmem_isolate_page, +}; + +static struct vfsmount *secretmem_mnt; + +static struct file *secretmem_file_create(unsigned long flags) +{ + struct file *file = ERR_PTR(-ENOMEM); + struct secretmem_ctx *ctx; + struct inode *inode; + + inode = alloc_anon_inode(secretmem_mnt->mnt_sb); + if (IS_ERR(inode)) + return ERR_CAST(inode); + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + goto err_free_inode; + + file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem", + O_RDWR, &secretmem_fops); + if (IS_ERR(file)) + goto err_free_ctx; + + mapping_set_unevictable(inode->i_mapping); + + inode->i_mapping->private_data = ctx; + inode->i_mapping->a_ops = &secretmem_aops; + + /* pretend we are a normal file with zero size */ + inode->i_mode |= S_IFREG; + inode->i_size = 0; + + file->private_data = ctx; + + ctx->mode = flags & SECRETMEM_MODE_MASK; + + return file; + +err_free_ctx: + kfree(ctx); +err_free_inode: + iput(inode); + return file; +} + +SYSCALL_DEFINE1(memfd_secret, unsigned long, flags) +{ + struct file *file; + int fd, err; + + /* make sure local flags do not confict with global fcntl.h */ + BUILD_BUG_ON(SECRETMEM_FLAGS_MASK & O_CLOEXEC); + + if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC)) + return -EINVAL; + + fd = get_unused_fd_flags(flags & O_CLOEXEC); + if (fd < 0) + return fd; + + file = secretmem_file_create(flags); + if (IS_ERR(file)) { + err = PTR_ERR(file); + goto err_put_fd; + } + + file->f_flags |= O_LARGEFILE; + + fd_install(fd, file); + return fd; + +err_put_fd: + put_unused_fd(fd); + return err; +} + +static void secretmem_evict_inode(struct inode *inode) +{ + struct secretmem_ctx *ctx = inode->i_private; + + truncate_inode_pages_final(&inode->i_data); + clear_inode(inode); + kfree(ctx); +} + +static const struct super_operations secretmem_super_ops = { + .evict_inode = secretmem_evict_inode, +}; + +static int secretmem_init_fs_context(struct fs_context *fc) +{ + struct pseudo_fs_context *ctx = init_pseudo(fc, SECRETMEM_MAGIC); + + if (!ctx) + return -ENOMEM; + ctx->ops = &secretmem_super_ops; + + return 0; +} + +static struct file_system_type secretmem_fs = { + .name = "secretmem", + .init_fs_context = secretmem_init_fs_context, + .kill_sb = kill_anon_super, +}; + +static int secretmem_init(void) +{ + int ret = 0; + + secretmem_mnt = kern_mount(&secretmem_fs); + if (IS_ERR(secretmem_mnt)) + ret = PTR_ERR(secretmem_mnt); + + return ret; +} +fs_initcall(secretmem_init);