From patchwork Wed Sep 11 14:34:01 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fares Mehanna X-Patchwork-Id: 13800675 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AFBFAEE49BA for ; Wed, 11 Sep 2024 14:36:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4A052940048; Wed, 11 Sep 2024 10:36:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 42B53940021; Wed, 11 Sep 2024 10:36:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2CA14940048; Wed, 11 Sep 2024 10:36:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 0BD50940021 for ; Wed, 11 Sep 2024 10:36:16 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 8EA671404D8 for ; Wed, 11 Sep 2024 14:36:15 +0000 (UTC) X-FDA: 82552707510.26.F8D7394 Received: from smtp-fw-52004.amazon.com (smtp-fw-52004.amazon.com [52.119.213.154]) by imf01.hostedemail.com (Postfix) with ESMTP id 8E90940010 for ; Wed, 11 Sep 2024 14:36:13 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=amazon.de header.s=amazon201209 header.b=jxtw08f9; spf=pass (imf01.hostedemail.com: domain of "prvs=97728e23b=faresx@amazon.de" designates 52.119.213.154 as permitted sender) smtp.mailfrom="prvs=97728e23b=faresx@amazon.de"; dmarc=pass (policy=quarantine) header.from=amazon.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726065270; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Y9TjOwsiSXZl9P0/RyFMFJbYNQPYXtbeuapzd5F/ewk=; b=lJ5vDy1H8AVD7WbOS9inAf3q25UpjTWgc0H0j1Evi8kMhQ4UCCd9NPocJ+QwLu5dBCpoDZ 7N2Ov0Txqe0605wLc4BXPdAjBk22aDvJxPsFyTjEc/2hI7q5jDuMFpNndB6roGV3YW+3xl /eTii/EpA5kIjVPCVRyaUttsTiaSDug= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726065270; a=rsa-sha256; cv=none; b=fBHJMPo8E0nFzXo5GMMnSNiQ1zG5watVq+KEMeR0FqEdLWXBj1GdDQRd0rKND+n+P6QvpI t5c0Rb+H4Vha2LF5wIIggkYKlHGDsc6qyIU4U91JjpzfNvo7ZPhqyUI110EEP9UGeCMMOP xQN9R4frppOi4uz3Df+XcVNjMcTIzmo= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=amazon.de header.s=amazon201209 header.b=jxtw08f9; spf=pass (imf01.hostedemail.com: domain of "prvs=97728e23b=faresx@amazon.de" designates 52.119.213.154 as permitted sender) smtp.mailfrom="prvs=97728e23b=faresx@amazon.de"; dmarc=pass (policy=quarantine) header.from=amazon.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazon201209; t=1726065374; x=1757601374; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Y9TjOwsiSXZl9P0/RyFMFJbYNQPYXtbeuapzd5F/ewk=; b=jxtw08f9uvtGnFscZb2dGxwajbFL+rpYEHVl+DAxyCAf0OKY1IpSEZYh tlfsPUJaY41QRPyUK84vLYkdD5tC/t1Sx6gNNhBM9esU2SeLh4nNzWuaf 9giL9wLHLxq468JQoFvH1yJX7mE31GxdMYUQ/By1ert5j3OoaXOoMi4WD c=; X-IronPort-AV: E=Sophos;i="6.10,220,1719878400"; d="scan'208";a="231193640" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.43.8.2]) by smtp-border-fw-52004.iad7.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2024 14:36:07 +0000 Received: from EX19MTAEUC001.ant.amazon.com [10.0.17.79:61643] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.38.136:2525] with esmtp (Farcaster) id 3550e7db-38ee-4821-b3d5-b8907b20a81c; Wed, 11 Sep 2024 14:36:06 +0000 (UTC) X-Farcaster-Flow-ID: 3550e7db-38ee-4821-b3d5-b8907b20a81c Received: from EX19D033EUB004.ant.amazon.com (10.252.61.103) by EX19MTAEUC001.ant.amazon.com (10.252.51.193) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Wed, 11 Sep 2024 14:36:01 +0000 Received: from EX19MTAUEC001.ant.amazon.com (10.252.135.222) by EX19D033EUB004.ant.amazon.com (10.252.61.103) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Wed, 11 Sep 2024 14:36:01 +0000 Received: from dev-dsk-faresx-1b-27755bf1.eu-west-1.amazon.com (10.253.79.181) by mail-relay.amazon.com (10.252.135.200) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34 via Frontend Transport; Wed, 11 Sep 2024 14:35:59 +0000 From: Fares Mehanna To: CC: , Fares Mehanna , Roman Kagan , Marc Zyngier , Oliver Upton , James Morse , Suzuki K Poulose , Zenghui Yu , Catalin Marinas , Will Deacon , Andrew Morton , Kemeng Shi , =?utf-8?q?Pierre-Cl=C3=A9ment_Tosi?= , Ard Biesheuvel , Mark Rutland , "Javier Martinez Canillas" , Arnd Bergmann , Fuad Tabba , Mark Brown , Joey Gouly , Kristina Martsenko , "Randy Dunlap" , Bjorn Helgaas , Jean-Philippe Brucker , "Mike Rapoport (IBM)" , David Hildenbrand , "moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64)" , "open list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64)" , open list , "open list:MEMORY MANAGEMENT" Subject: [RFC PATCH 2/7] mm/secretmem: implement mm-local kernel allocations Date: Wed, 11 Sep 2024 14:34:01 +0000 Message-ID: <20240911143421.85612-3-faresx@amazon.de> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240911143421.85612-1-faresx@amazon.de> References: <20240911143421.85612-1-faresx@amazon.de> MIME-Version: 1.0 X-Rspamd-Queue-Id: 8E90940010 X-Stat-Signature: kgw4jgxffeezwhsww7dggfd4eehipdzb X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1726065373-816204 X-HE-Meta: U2FsdGVkX1/xFl/IXPQFrYNJd6ZXhd506HvKZwQvrgjbNSQ1xvvypbqKXCSh6B3q1uIt53OTaYqyh9F8LRsh3gBRoPEGK/Mu3Tka9ozdY0yNX6XcNGuKaNuCFxKtaMdb8Q1B2HpoXR66XpNIfTmFPazJ1xAlQX3ZngNLx/5SKUyFjPSmwpVrI6K2z+7U/GTS04n6fm0L/G6/P8V33Bl+qU6kXBbqIIAjwq0qnSjnwXRxADBg8XtEi6dpJmzkV485T44hkViCvH4fWxcxtCj68HhB2RQvkpQd+IVwLHb9+7sbC/J4W+JT9DBHOamb+hXCz1QK2CSqR5k0O4VY09MKa8/Gt3jzBIGr1jdAC5IhJ2TFsD1Mfwi+yVvZRdRhq8b9xYB8ZRUuMwwjU8gpYUh7yx/87gMjfKa4zClu5t7T3uHCh865iFIFKbrOb0DAaqTp9LLAvMyUeeSFaDgZ0nenNbPgAUvne5YMxsMTBqk30Bk1pgvz7bLtczcTCj5wAonbawjKF1aIRP3NQcVQLeFdhxaJ9q0Bncpj+wUen4vXgW1rFEBIqs9cwphypQXD3RuQuzcU/rBc75nEZo7aDlSS3jlGqpE+NaGfhx6pFa9fsWS9JaaAEuAt9X3Q9HB9P1ENh8/fOQ3My8i4FLIFuBq3AcCfFZoPY6OX12m+s/puE6SJLxsskgwQwgGV5La29b1sbuyQ1EFbZ6auGMnbZz2IAjDsRnHzdcXLUcqcAubDiiA6UvZMSecPnI3wKkRonhHJVlPLarLorevYiOn8QRoyMc/7BCBch+fRblXTmGteU6g4NpLjum8X9/ChbgyA7bhpZEJmDvAO8SOhaGNSnB6fqmekyX4IXEWmlGQbuidxQ8vjNsZJhlLX6MJhGj4MrdoFl9BP4XEx3o9oDumOdp+tB1639S9hnGc6jV3PGH1u+aqjCLakPsUPK+t/1NOIVKFOawy1ddll/Y1267O3Zlh Tr1GVAuG +7SsWmwcZ+W3o/2XVAaENkQBvnT7dqM4X+FUwXRJ1BLurHf/KgrJr6Bw8pcQS3UhYEqFDicK3Swk63MGZNWI1/UjzfrAqmk1NRUxExX5gNM5ifuk4uLgumzr2DlBb8q1a9sgD6FrOvYURUVWGWOtRamFkJ5pgw7xhIUcgqeYT6fsPv5H2Xii51q0CZd58LEwFRicA3vo4C7N8rSVyErg5kQFORa0q5+JTEf3iaKwgkI37pP+aQ2wIEOnb6TT3YLQ6Wv0ZV2D0D3DptNJNbKep24lP/ad5GkpEfdJH7LNpnmwMG/DKH38AQ9RcHnTRADSEBy3KFisOmDapdBQEdPbXhSsa0zHds+4dhOw1F76UQtkGPOZERDUhoLb284t5tMSfIIKVyH6WD9+S+vx4zpojnB06eA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: In order to be resilient against cross-process speculation-based attacks, it makes sense to store certain (secret) items in kernel memory local to mm. Implement such allocations on top of secretmem infrastructure. Specifically, on allocate 1. Create secretmem file. 2. To distinguish it from the conventional memfd_secret()-created one and to maintain associated mm-local allocation context, put the latter on ->private_data of the file. 3. Create virtual mapping in user virtual address space using mmap(). 4. Seal the virtual mapping to disallow the user from affecting it in any way. 5. Fault the pages in, effectively calling secretmem fault handler to remove the pages from kernel linear address and make them local to process mm. 6. Change the PTE from user mode to kernel mode, any access from userspace will result in segmentation fault. Kernel can access this virtual address now. 7. Return the secure area as a struct containing the pointer to the actual memory and providing the context for the release function later. On release, - if called while mm is still in use, remove the mapping - otherwise, if performed at mm teardown, no unmapping is necessary The rest is taken care of by secretmem file cleanup, including returning the pages to the kernel direct map. Signed-off-by: Fares Mehanna Signed-off-by: Roman Kagan --- include/linux/secretmem.h | 29 ++++++ mm/Kconfig | 10 ++ mm/gup.c | 4 +- mm/secretmem.c | 213 ++++++++++++++++++++++++++++++++++++++ 4 files changed, 254 insertions(+), 2 deletions(-) diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h index e918f96881f5..39cc73a0e4bd 100644 --- a/include/linux/secretmem.h +++ b/include/linux/secretmem.h @@ -2,6 +2,10 @@ #ifndef _LINUX_SECRETMEM_H #define _LINUX_SECRETMEM_H +struct secretmem_area { + void *ptr; +}; + #ifdef CONFIG_SECRETMEM extern const struct address_space_operations secretmem_aops; @@ -33,4 +37,29 @@ static inline bool secretmem_active(void) #endif /* CONFIG_SECRETMEM */ +#ifdef CONFIG_KERNEL_SECRETMEM + +bool can_access_secretmem_vma(struct vm_area_struct *vma); +struct secretmem_area *secretmem_allocate_pages(unsigned int order); +void secretmem_release_pages(struct secretmem_area *data); + +#else + +static inline bool can_access_secretmem_vma(struct vm_area_struct *vma) +{ + return true; +} + +static inline struct secretmem_area *secretmem_allocate_pages(unsigned int order) +{ + return NULL; +} + +static inline void secretmem_release_pages(struct secretmem_area *data) +{ + WARN_ONCE(1, "Called secret memory release page without support\n"); +} + +#endif /* CONFIG_KERNEL_SECRETMEM */ + #endif /* _LINUX_SECRETMEM_H */ diff --git a/mm/Kconfig b/mm/Kconfig index b72e7d040f78..a327d8def179 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1168,6 +1168,16 @@ config SECRETMEM memory areas visible only in the context of the owning process and not mapped to other processes and other kernel page tables. +config KERNEL_SECRETMEM + default y + bool "Enable kernel usage of memfd_secret()" if EXPERT + depends on SECRETMEM + depends on MMU + help + Enable the kernel usage of memfd_secret() for kernel memory allocations, + The allocated memory is visible only to the kernel in the context of + the owning process. + config ANON_VMA_NAME bool "Anonymous VMA name support" depends on PROC_FS && ADVISE_SYSCALLS && MMU diff --git a/mm/gup.c b/mm/gup.c index 54d0dc3831fb..6c2c6a0cbe2a 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1076,7 +1076,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, struct follow_page_context ctx = { NULL }; struct page *page; - if (vma_is_secretmem(vma)) + if (!can_access_secretmem_vma(vma)) return NULL; if (WARN_ON_ONCE(foll_flags & FOLL_PIN)) @@ -1281,7 +1281,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) return -EOPNOTSUPP; - if (vma_is_secretmem(vma)) + if (!can_access_secretmem_vma(vma)) return -EFAULT; if (write) { diff --git a/mm/secretmem.c b/mm/secretmem.c index 3afb5ad701e1..86afedc65889 100644 --- a/mm/secretmem.c +++ b/mm/secretmem.c @@ -13,13 +13,17 @@ #include #include #include +#include #include #include #include #include #include +#include +#include #include +#include #include @@ -42,6 +46,16 @@ MODULE_PARM_DESC(secretmem_enable, static atomic_t secretmem_users; +/* secretmem file private context */ +struct secretmem_ctx { + struct secretmem_area _area; + struct page **_pages; + unsigned long _nr_pages; + struct file *_file; + struct mm_struct *_mm; +}; + + bool secretmem_active(void) { return !!atomic_read(&secretmem_users); @@ -116,6 +130,7 @@ static const struct vm_operations_struct secretmem_vm_ops = { static int secretmem_release(struct inode *inode, struct file *file) { + kfree(file->private_data); atomic_dec(&secretmem_users); return 0; } @@ -123,13 +138,23 @@ static int secretmem_release(struct inode *inode, struct file *file) static int secretmem_mmap(struct file *file, struct vm_area_struct *vma) { unsigned long len = vma->vm_end - vma->vm_start; + struct secretmem_ctx *ctx = file->private_data; + unsigned long kernel_no_permissions; + + kernel_no_permissions = (VM_READ | VM_WRITE | VM_EXEC | VM_MAYEXEC); if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0) return -EINVAL; + if (ctx && (vma->vm_flags & kernel_no_permissions)) + return -EINVAL; + if (!mlock_future_ok(vma->vm_mm, vma->vm_flags | VM_LOCKED, len)) return -EAGAIN; + if (ctx) + vm_flags_set(vma, VM_MIXEDMAP); + vm_flags_set(vma, VM_LOCKED | VM_DONTDUMP); vma->vm_ops = &secretmem_vm_ops; @@ -230,6 +255,194 @@ static struct file *secretmem_file_create(unsigned long flags) return file; } +#ifdef CONFIG_KERNEL_SECRETMEM + +struct secretmem_area *secretmem_allocate_pages(unsigned int order) +{ + unsigned long uvaddr, uvaddr_inc, unused, nr_pages, bytes_length; + struct file *kernel_secfile; + struct vm_area_struct *vma; + struct secretmem_ctx *ctx; + struct page **sec_pages; + struct mm_struct *mm; + long nr_pinned_pages; + pte_t pte, old_pte; + spinlock_t *ptl; + pte_t *upte; + int rc; + + nr_pages = (1 << order); + bytes_length = nr_pages * PAGE_SIZE; + mm = current->mm; + + if (!mm || !mmget_not_zero(mm)) + return NULL; + + /* Create secret memory file / truncate it */ + kernel_secfile = secretmem_file_create(0); + if (IS_ERR(kernel_secfile)) + goto put_mm; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (IS_ERR(ctx)) + goto close_secfile; + kernel_secfile->private_data = ctx; + + rc = do_truncate(file_mnt_idmap(kernel_secfile), + file_dentry(kernel_secfile), bytes_length, 0, NULL); + if (rc) + goto close_secfile; + + if (mmap_write_lock_killable(mm)) + goto close_secfile; + + /* Map pages to the secretmem file */ + uvaddr = do_mmap(kernel_secfile, 0, bytes_length, PROT_NONE, + MAP_SHARED, 0, 0, &unused, NULL); + if (IS_ERR_VALUE(uvaddr)) + goto unlock_mmap; + + /* mseal() the VMA to make sure it won't change */ + rc = do_mseal(uvaddr, uvaddr + bytes_length, true); + if (rc) + goto unmap_pages; + + /* Make sure VMA is there, and is kernel-secure */ + vma = find_vma(current->mm, uvaddr); + if (!vma) + goto unseal_vma; + + if (!vma_is_secretmem(vma) || + !can_access_secretmem_vma(vma)) + goto unseal_vma; + + /* Pin user pages; fault them in */ + sec_pages = kzalloc(sizeof(struct page *) * nr_pages, GFP_KERNEL); + if (!sec_pages) + goto unseal_vma; + + nr_pinned_pages = pin_user_pages(uvaddr, nr_pages, FOLL_FORCE | FOLL_LONGTERM, sec_pages); + if (nr_pinned_pages < 0) + goto free_sec_pages; + if (nr_pinned_pages != nr_pages) + goto unpin_pages; + + /* Modify the existing mapping to be kernel accessible, local to this process mm */ + uvaddr_inc = uvaddr; + while (uvaddr_inc < uvaddr + bytes_length) { + upte = get_locked_pte(mm, uvaddr_inc, &ptl); + if (!upte) + goto unpin_pages; + old_pte = ptep_modify_prot_start(vma, uvaddr_inc, upte); + pte = pte_modify(old_pte, PAGE_KERNEL); + ptep_modify_prot_commit(vma, uvaddr_inc, upte, old_pte, pte); + pte_unmap_unlock(upte, ptl); + uvaddr_inc += PAGE_SIZE; + } + flush_tlb_range(vma, uvaddr, uvaddr + bytes_length); + + /* Return data */ + mmgrab(mm); + ctx->_area.ptr = (void *) uvaddr; + ctx->_pages = sec_pages; + ctx->_nr_pages = nr_pages; + ctx->_mm = mm; + ctx->_file = kernel_secfile; + + mmap_write_unlock(mm); + mmput(mm); + + return &ctx->_area; + +unpin_pages: + unpin_user_pages(sec_pages, nr_pinned_pages); +free_sec_pages: + kfree(sec_pages); +unseal_vma: + rc = do_mseal(uvaddr, uvaddr + bytes_length, false); + if (rc) + BUG(); +unmap_pages: + rc = do_munmap(mm, uvaddr, bytes_length, NULL); + if (rc) + BUG(); +unlock_mmap: + mmap_write_unlock(mm); +close_secfile: + fput(kernel_secfile); +put_mm: + mmput(mm); + return NULL; +} + +void secretmem_release_pages(struct secretmem_area *data) +{ + unsigned long uvaddr, bytes_length; + struct secretmem_ctx *ctx; + int rc; + + if (!data || !data->ptr) + BUG(); + + ctx = container_of(data, struct secretmem_ctx, _area); + if (!ctx || !ctx->_file || !ctx->_pages || !ctx->_mm) + BUG(); + + bytes_length = ctx->_nr_pages * PAGE_SIZE; + uvaddr = (unsigned long) data->ptr; + + /* + * Remove the mapping if mm is still in use. + * Not secure to continue if unmapping failed. + */ + if (mmget_not_zero(ctx->_mm)) { + mmap_write_lock(ctx->_mm); + rc = do_mseal(uvaddr, uvaddr + bytes_length, false); + if (rc) { + mmap_write_unlock(ctx->_mm); + BUG(); + } + rc = do_munmap(ctx->_mm, uvaddr, bytes_length, NULL); + if (rc) { + mmap_write_unlock(ctx->_mm); + BUG(); + } + mmap_write_unlock(ctx->_mm); + mmput(ctx->_mm); + } + + mmdrop(ctx->_mm); + unpin_user_pages(ctx->_pages, ctx->_nr_pages); + fput(ctx->_file); + kfree(ctx->_pages); + + ctx->_nr_pages = 0; + ctx->_pages = NULL; + ctx->_file = NULL; + ctx->_mm = NULL; + ctx->_area.ptr = NULL; +} + +bool can_access_secretmem_vma(struct vm_area_struct *vma) +{ + struct secretmem_ctx *ctx; + + if (!vma_is_secretmem(vma)) + return true; + + /* + * If VMA is owned by running process, and marked for kernel + * usage, then allow access. + */ + ctx = vma->vm_file->private_data; + if (ctx && current->mm == vma->vm_mm) + return true; + + return false; +} + +#endif /* CONFIG_KERNEL_SECRETMEM */ + SYSCALL_DEFINE1(memfd_secret, unsigned int, flags) { struct file *file;