From patchwork Wed Aug 4 04:32:14 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417765 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3F5C970 for ; Wed, 4 Aug 2021 04:32:36 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433045" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433045" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:35 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702656" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:35 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Ira Weiny , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Peter Zijlstra , Andy Lutomirski , "H. Peter Anvin" , Fenghua Yu , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 01/18] x86/pkeys: Create pkeys_common.h Date: Tue, 3 Aug 2021 21:32:14 -0700 Message-Id: <20210804043231.2655537-2-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Ira Weiny Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work in similar fashions and can share common defines. Specifically PKS and PKU each have: 1. A single control register 2. The same number of keys 3. The same number of bits in the register per key 4. Access and Write disable in the same bit locations Given the above, share all the macros that synthesize and manipulate register values between the two features. Share these defines by moving them into a new header, change their names to reflect the common use, and include the header where needed. Also while editing the code remove the use of 'we' from comments being touched. Signed-off-by: Ira Weiny --- arch/x86/include/asm/pkeys_common.h | 11 +++++++++++ arch/x86/include/asm/pkru.h | 18 ++++++------------ arch/x86/kernel/fpu/xstate.c | 8 ++++---- arch/x86/mm/pkeys.c | 14 ++++++-------- 4 files changed, 27 insertions(+), 24 deletions(-) create mode 100644 arch/x86/include/asm/pkeys_common.h diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h new file mode 100644 index 000000000000..f3277717faeb --- /dev/null +++ b/arch/x86/include/asm/pkeys_common.h @@ -0,0 +1,11 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_PKEYS_COMMON_H +#define _ASM_X86_PKEYS_COMMON_H + +#define PKR_AD_BIT 0x1 +#define PKR_WD_BIT 0x2 +#define PKR_BITS_PER_PKEY 2 + +#define PKR_AD_KEY(pkey) (PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY)) + +#endif /*_ASM_X86_PKEYS_COMMON_H */ diff --git a/arch/x86/include/asm/pkru.h b/arch/x86/include/asm/pkru.h index ccc539faa5bb..a74325b0d1df 100644 --- a/arch/x86/include/asm/pkru.h +++ b/arch/x86/include/asm/pkru.h @@ -3,10 +3,7 @@ #define _ASM_X86_PKRU_H #include - -#define PKRU_AD_BIT 0x1 -#define PKRU_WD_BIT 0x2 -#define PKRU_BITS_PER_PKEY 2 +#include #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS extern u32 init_pkru_value; @@ -18,18 +15,15 @@ extern u32 init_pkru_value; static inline bool __pkru_allows_read(u32 pkru, u16 pkey) { - int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY; - return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits)); + int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY; + return !(pkru & (PKR_AD_BIT << pkru_pkey_bits)); } static inline bool __pkru_allows_write(u32 pkru, u16 pkey) { - int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY; - /* - * Access-disable disables writes too so we need to check - * both bits here. - */ - return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits)); + int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY; + /* Access-disable disables writes too so check both bits here. */ + return !(pkru & ((PKR_AD_BIT|PKR_WD_BIT) << pkru_pkey_bits)); } static inline u32 read_pkru(void) diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index c8def1b7f8fb..6af0c80ad425 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -933,11 +933,11 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey, if (WARN_ON_ONCE(pkey >= arch_max_pkey())) return -EINVAL; - /* Set the bits we need in PKRU: */ + /* Set the bits needed in PKRU: */ if (init_val & PKEY_DISABLE_ACCESS) - new_pkru_bits |= PKRU_AD_BIT; + new_pkru_bits |= PKR_AD_BIT; if (init_val & PKEY_DISABLE_WRITE) - new_pkru_bits |= PKRU_WD_BIT; + new_pkru_bits |= PKR_WD_BIT; /* Shift the bits in to the correct place in PKRU for pkey: */ pkey_shift = pkey * PKRU_BITS_PER_PKEY; @@ -945,7 +945,7 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey, /* Get old PKRU and mask off any old bits in place: */ old_pkru = read_pkru(); - old_pkru &= ~((PKRU_AD_BIT|PKRU_WD_BIT) << pkey_shift); + old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift); /* Write old part along with new part: */ write_pkru(old_pkru | new_pkru_bits); diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index e44e938885b7..aa7042f272fb 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -110,19 +110,17 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot, int pkey return vma_pkey(vma); } -#define PKRU_AD_KEY(pkey) (PKRU_AD_BIT << ((pkey) * PKRU_BITS_PER_PKEY)) - /* * Make the default PKRU value (at execve() time) as restrictive * as possible. This ensures that any threads clone()'d early * in the process's lifetime will not accidentally get access * to data which is pkey-protected later on. */ -u32 init_pkru_value = PKRU_AD_KEY( 1) | PKRU_AD_KEY( 2) | PKRU_AD_KEY( 3) | - PKRU_AD_KEY( 4) | PKRU_AD_KEY( 5) | PKRU_AD_KEY( 6) | - PKRU_AD_KEY( 7) | PKRU_AD_KEY( 8) | PKRU_AD_KEY( 9) | - PKRU_AD_KEY(10) | PKRU_AD_KEY(11) | PKRU_AD_KEY(12) | - PKRU_AD_KEY(13) | PKRU_AD_KEY(14) | PKRU_AD_KEY(15); +u32 init_pkru_value = PKR_AD_KEY( 1) | PKR_AD_KEY( 2) | PKR_AD_KEY( 3) | + PKR_AD_KEY( 4) | PKR_AD_KEY( 5) | PKR_AD_KEY( 6) | + PKR_AD_KEY( 7) | PKR_AD_KEY( 8) | PKR_AD_KEY( 9) | + PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) | + PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15); static ssize_t init_pkru_read_file(struct file *file, char __user *user_buf, size_t count, loff_t *ppos) @@ -155,7 +153,7 @@ static ssize_t init_pkru_write_file(struct file *file, * up immediately if someone attempts to disable access * or writes to pkey 0. */ - if (new_init_pkru & (PKRU_AD_BIT|PKRU_WD_BIT)) + if (new_init_pkru & (PKR_AD_BIT|PKR_WD_BIT)) return -EINVAL; WRITE_ONCE(init_pkru_value, new_init_pkru); From patchwork Wed Aug 4 04:32:15 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417763 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 071B13481 for ; Wed, 4 Aug 2021 04:32:36 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433047" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433047" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:35 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702659" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:35 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Ira Weiny , Peter Zijlstra , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Andy Lutomirski , "H. Peter Anvin" , Fenghua Yu , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 02/18] x86/fpu: Refactor arch_set_user_pkey_access() Date: Tue, 3 Aug 2021 21:32:15 -0700 Message-Id: <20210804043231.2655537-3-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Ira Weiny Both PKU and PKS update their register values in the same way. They can therefore share the update code. Define a helper, update_pkey_val(), which will be used to support both Protection Key User (PKU) and the new Protection Key for Supervisor (PKS) in subsequent patches. Use that helper in arch_set_user_pkey_access(). Co-developed-by: Peter Zijlstra Signed-off-by: Peter Zijlstra Signed-off-by: Ira Weiny --- arch/x86/include/asm/pkeys.h | 2 ++ arch/x86/kernel/fpu/xstate.c | 22 ++++------------------ arch/x86/mm/pkeys.c | 23 +++++++++++++++++++++++ 3 files changed, 29 insertions(+), 18 deletions(-) diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h index 5c7bcaa79623..597f19e4525b 100644 --- a/arch/x86/include/asm/pkeys.h +++ b/arch/x86/include/asm/pkeys.h @@ -133,4 +133,6 @@ static inline int vma_pkey(struct vm_area_struct *vma) return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT; } +u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags); + #endif /*_ASM_X86_PKEYS_H */ diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index 6af0c80ad425..4f95ab38a23c 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -915,8 +915,7 @@ EXPORT_SYMBOL_GPL(get_xsave_addr); int arch_set_user_pkey_access(struct task_struct *tsk, int pkey, unsigned long init_val) { - u32 old_pkru, new_pkru_bits = 0; - int pkey_shift; + u32 pkru; /* * This check implies XSAVE support. OSPKE only gets @@ -933,22 +932,9 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey, if (WARN_ON_ONCE(pkey >= arch_max_pkey())) return -EINVAL; - /* Set the bits needed in PKRU: */ - if (init_val & PKEY_DISABLE_ACCESS) - new_pkru_bits |= PKR_AD_BIT; - if (init_val & PKEY_DISABLE_WRITE) - new_pkru_bits |= PKR_WD_BIT; - - /* Shift the bits in to the correct place in PKRU for pkey: */ - pkey_shift = pkey * PKRU_BITS_PER_PKEY; - new_pkru_bits <<= pkey_shift; - - /* Get old PKRU and mask off any old bits in place: */ - old_pkru = read_pkru(); - old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift); - - /* Write old part along with new part: */ - write_pkru(old_pkru | new_pkru_bits); + pkru = read_pkru(); + pkru = update_pkey_val(pkru, pkey, init_val); + write_pkru(pkru); return 0; } diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index aa7042f272fb..ca2e20b18645 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -190,3 +190,26 @@ static __init int setup_init_pkru(char *opt) return 1; } __setup("init_pkru=", setup_init_pkru); + +/* + * Replace disable bits for @pkey with values from @flags + * + * Kernel users use the same flags as user space: + * PKEY_DISABLE_ACCESS + * PKEY_DISABLE_WRITE + */ +u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags) +{ + int pkey_shift = pkey * PKR_BITS_PER_PKEY; + + /* Mask out old bit values */ + pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift); + + /* Or in new values */ + if (flags & PKEY_DISABLE_ACCESS) + pk_reg |= PKR_AD_BIT << pkey_shift; + if (flags & PKEY_DISABLE_WRITE) + pk_reg |= PKR_WD_BIT << pkey_shift; + + return pk_reg; +} From patchwork Wed Aug 4 04:32:16 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417767 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1419129CA for ; Wed, 4 Aug 2021 04:32:38 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433050" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433050" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:35 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702663" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:35 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Ira Weiny , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Peter Zijlstra , Andy Lutomirski , "H. Peter Anvin" , Fenghua Yu , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 03/18] x86/pks: Add additional PKEY helper macros Date: Tue, 3 Aug 2021 21:32:16 -0700 Message-Id: <20210804043231.2655537-4-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Ira Weiny Avoid open coding shift and mask operations by defining and using helper macros for PKey operations. Signed-off-by: Ira Weiny --- arch/x86/include/asm/pkeys_common.h | 6 +++++- arch/x86/include/asm/pkru.h | 6 ++---- arch/x86/mm/pkeys.c | 8 +++----- 3 files changed, 10 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h index f3277717faeb..8a3c6d2e6a8a 100644 --- a/arch/x86/include/asm/pkeys_common.h +++ b/arch/x86/include/asm/pkeys_common.h @@ -6,6 +6,10 @@ #define PKR_WD_BIT 0x2 #define PKR_BITS_PER_PKEY 2 -#define PKR_AD_KEY(pkey) (PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY)) +#define PKR_PKEY_SHIFT(pkey) (pkey * PKR_BITS_PER_PKEY) +#define PKR_PKEY_MASK(pkey) (((1 << PKR_BITS_PER_PKEY) - 1) << PKR_PKEY_SHIFT(pkey)) + +#define PKR_AD_KEY(pkey) (PKR_AD_BIT << PKR_PKEY_SHIFT(pkey)) +#define PKR_WD_KEY(pkey) (PKR_WD_BIT << PKR_PKEY_SHIFT(pkey)) #endif /*_ASM_X86_PKEYS_COMMON_H */ diff --git a/arch/x86/include/asm/pkru.h b/arch/x86/include/asm/pkru.h index a74325b0d1df..fb44ff542028 100644 --- a/arch/x86/include/asm/pkru.h +++ b/arch/x86/include/asm/pkru.h @@ -15,15 +15,13 @@ extern u32 init_pkru_value; static inline bool __pkru_allows_read(u32 pkru, u16 pkey) { - int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY; - return !(pkru & (PKR_AD_BIT << pkru_pkey_bits)); + return !(pkru & PKR_AD_KEY(pkey)); } static inline bool __pkru_allows_write(u32 pkru, u16 pkey) { - int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY; /* Access-disable disables writes too so check both bits here. */ - return !(pkru & ((PKR_AD_BIT|PKR_WD_BIT) << pkru_pkey_bits)); + return !(pkru & (PKR_AD_KEY(pkey) | PKR_WD_KEY(pkey))); } static inline u32 read_pkru(void) diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index ca2e20b18645..75437aa8fc56 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -200,16 +200,14 @@ __setup("init_pkru=", setup_init_pkru); */ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags) { - int pkey_shift = pkey * PKR_BITS_PER_PKEY; - /* Mask out old bit values */ - pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift); + pk_reg &= ~PKR_PKEY_MASK(pkey); /* Or in new values */ if (flags & PKEY_DISABLE_ACCESS) - pk_reg |= PKR_AD_BIT << pkey_shift; + pk_reg |= PKR_AD_KEY(pkey); if (flags & PKEY_DISABLE_WRITE) - pk_reg |= PKR_WD_BIT << pkey_shift; + pk_reg |= PKR_WD_KEY(pkey); return pk_reg; } From patchwork Wed Aug 4 04:32:17 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417769 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B444E2FB7 for ; Wed, 4 Aug 2021 04:32:38 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433052" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433052" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:36 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702667" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:35 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Ira Weiny , Fenghua Yu , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Peter Zijlstra , Andy Lutomirski , "H. Peter Anvin" , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 04/18] x86/pks: Add PKS defines and Kconfig options Date: Tue, 3 Aug 2021 21:32:17 -0700 Message-Id: <20210804043231.2655537-5-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Ira Weiny Protection Keys for Supervisor pages (PKS) enables fast, hardware thread specific, manipulation of permission restrictions on supervisor page mappings. It uses the same mechanism of Protection Keys as those on User mappings but applies that mechanism to supervisor mappings using a supervisor specific MSR. Define the PKS CPU feature bits. Add the Kconfig ARCH_HAS_SUPERVISOR_PKEYS to indicate to kernel consumers that an architecture supports pkeys. Introduce ARCH_ENABLE_SUPERVISOR_PKEYS to allow architectures to avoid PKS code unless a kernel consumers is configured. ARCH_ENABLE_SUPERVISOR_PKEYS remains off until the first kernel use case sets it. Reviewed-by: Dan Williams Co-developed-by: Fenghua Yu Signed-off-by: Fenghua Yu Signed-off-by: Ira Weiny --- arch/x86/Kconfig | 1 + arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h | 8 +++++++- arch/x86/include/uapi/asm/processor-flags.h | 2 ++ mm/Kconfig | 4 ++++ 5 files changed, 15 insertions(+), 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 49270655e827..d0a7d19aa245 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1837,6 +1837,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD) select ARCH_USES_HIGH_VMA_FLAGS select ARCH_HAS_PKEYS + select ARCH_HAS_SUPERVISOR_PKEYS help Memory Protection Keys provides a mechanism for enforcing page-based protections, but without requiring modification of the diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index d0ce5cfd3ac1..80c357f638fd 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -365,6 +365,7 @@ #define X86_FEATURE_MOVDIR64B (16*32+28) /* MOVDIR64B instruction */ #define X86_FEATURE_ENQCMD (16*32+29) /* ENQCMD and ENQCMDS instructions */ #define X86_FEATURE_SGX_LC (16*32+30) /* Software Guard Extensions Launch Control */ +#define X86_FEATURE_PKS (16*32+31) /* Protection Keys for Supervisor pages */ /* AMD-defined CPU features, CPUID level 0x80000007 (EBX), word 17 */ #define X86_FEATURE_OVERFLOW_RECOV (17*32+ 0) /* MCA overflow recovery support */ diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h index 8f28fafa98b3..66fdad8f3941 100644 --- a/arch/x86/include/asm/disabled-features.h +++ b/arch/x86/include/asm/disabled-features.h @@ -44,6 +44,12 @@ # define DISABLE_OSPKE (1<<(X86_FEATURE_OSPKE & 31)) #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */ +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS +# define DISABLE_PKS 0 +#else +# define DISABLE_PKS (1<<(X86_FEATURE_PKS & 31)) +#endif + #ifdef CONFIG_X86_5LEVEL # define DISABLE_LA57 0 #else @@ -85,7 +91,7 @@ #define DISABLED_MASK14 0 #define DISABLED_MASK15 0 #define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \ - DISABLE_ENQCMD) + DISABLE_ENQCMD|DISABLE_PKS) #define DISABLED_MASK17 0 #define DISABLED_MASK18 0 #define DISABLED_MASK19 0 diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h index bcba3c643e63..191c574b2390 100644 --- a/arch/x86/include/uapi/asm/processor-flags.h +++ b/arch/x86/include/uapi/asm/processor-flags.h @@ -130,6 +130,8 @@ #define X86_CR4_SMAP _BITUL(X86_CR4_SMAP_BIT) #define X86_CR4_PKE_BIT 22 /* enable Protection Keys support */ #define X86_CR4_PKE _BITUL(X86_CR4_PKE_BIT) +#define X86_CR4_PKS_BIT 24 /* enable Protection Keys for Supervisor */ +#define X86_CR4_PKS _BITUL(X86_CR4_PKS_BIT) /* * x86-64 Task Priority Register, CR8 diff --git a/mm/Kconfig b/mm/Kconfig index 40a9bfcd5062..e0d29c655ade 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -818,6 +818,10 @@ config ARCH_USES_HIGH_VMA_FLAGS bool config ARCH_HAS_PKEYS bool +config ARCH_HAS_SUPERVISOR_PKEYS + bool +config ARCH_ENABLE_SUPERVISOR_PKEYS + bool config PERCPU_STATS bool "Collect percpu memory statistics" From patchwork Wed Aug 4 04:32:18 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417771 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B442F70 for ; Wed, 4 Aug 2021 04:32:38 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433054" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433054" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:36 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702673" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:36 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Ira Weiny , Peter Zijlstra , Fenghua Yu , "Hansen, Dave" , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Andy Lutomirski , "H. Peter Anvin" , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 05/18] x86/pks: Add PKS setup code Date: Tue, 3 Aug 2021 21:32:18 -0700 Message-Id: <20210804043231.2655537-6-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Ira Weiny Protection Keys for Supervisor pages (PKS) enables fast, hardware thread specific, manipulation of permission restrictions on supervisor page mappings. It uses the same mechanism of Protection Keys as those on User mappings but applies that mechanism to supervisor mappings using a supervisor specific MSR. Add setup code and the lowest level of PKS MSR write support. Pkeys values are allocated statically via the pks_pkey_consumers enumeration. create_initial_pkrs_value() builds the initial protection values for each pkey. Users who need a default value other than Access Disabled should update consumer_defaults[]. The PKRS value is cached per-cpu to avoid the overhead of the MSR write if the value has not changed. That said, it should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not serializing but still maintains ordering properties similar to WRPKRU. The current SDM section on PKRS needs updating but should be the same as that of WRPKRU. So to quote from the WRPKRU text: WRPKRU will never execute transiently. Memory accesses affected by PKRU register will not execute (even transiently) until all prior executions of WRPKRU have completed execution and updated the PKRU register. write_pkrs() contributed by Peter Zijlstra. create_initial_pkrs_value() contributed by Dave Hansen setup_pks() is an internal x86 function call. Introduce asm/pks.h to declare functions and internal structures such as this. Reviewed-by: Dan Williams Co-developed-by: Peter Zijlstra Signed-off-by: Peter Zijlstra Co-developed-by: Fenghua Yu Signed-off-by: Fenghua Yu Co-developed-by: "Hansen, Dave" Signed-off-by: "Hansen, Dave" Signed-off-by: Ira Weiny --- Changes for V7 Create a dynamic pkrs_initial_value in early init code. Clean up comments Add comment to macro guard --- arch/x86/include/asm/msr-index.h | 1 + arch/x86/include/asm/pkeys_common.h | 4 ++ arch/x86/include/asm/pks.h | 15 ++++++ arch/x86/kernel/cpu/common.c | 2 + arch/x86/mm/pkeys.c | 75 +++++++++++++++++++++++++++++ include/linux/pkeys.h | 8 +++ 6 files changed, 105 insertions(+) create mode 100644 arch/x86/include/asm/pks.h diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index a7c413432b33..c986eb1f36a9 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -767,6 +767,7 @@ #define MSR_IA32_TSC_DEADLINE 0x000006E0 +#define MSR_IA32_PKRS 0x000006E1 #define MSR_TSX_FORCE_ABORT 0x0000010F diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h index 8a3c6d2e6a8a..079a8be9686b 100644 --- a/arch/x86/include/asm/pkeys_common.h +++ b/arch/x86/include/asm/pkeys_common.h @@ -2,14 +2,18 @@ #ifndef _ASM_X86_PKEYS_COMMON_H #define _ASM_X86_PKEYS_COMMON_H +#define PKR_RW_BIT 0x0 #define PKR_AD_BIT 0x1 #define PKR_WD_BIT 0x2 #define PKR_BITS_PER_PKEY 2 +#define PKS_NUM_PKEYS 16 + #define PKR_PKEY_SHIFT(pkey) (pkey * PKR_BITS_PER_PKEY) #define PKR_PKEY_MASK(pkey) (((1 << PKR_BITS_PER_PKEY) - 1) << PKR_PKEY_SHIFT(pkey)) #define PKR_AD_KEY(pkey) (PKR_AD_BIT << PKR_PKEY_SHIFT(pkey)) #define PKR_WD_KEY(pkey) (PKR_WD_BIT << PKR_PKEY_SHIFT(pkey)) +#define PKR_VALUE(pkey, val) (val << PKR_PKEY_SHIFT(pkey)) #endif /*_ASM_X86_PKEYS_COMMON_H */ diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h new file mode 100644 index 000000000000..5d7067ada8fb --- /dev/null +++ b/arch/x86/include/asm/pks.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_PKS_H +#define _ASM_X86_PKS_H + +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS + +void setup_pks(void); + +#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ + +static inline void setup_pks(void) { } + +#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ + +#endif /* _ASM_X86_PKS_H */ diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 64b805bd6a54..abb32bd32f53 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -59,6 +59,7 @@ #include #include #include +#include #include "cpu.h" @@ -1590,6 +1591,7 @@ static void identify_cpu(struct cpuinfo_x86 *c) x86_init_rdrand(c); setup_pku(c); + setup_pks(); /* * Clear/Set all flags overridden by options, need do it diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index 75437aa8fc56..fbffbced81b5 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -211,3 +211,78 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags) return pk_reg; } + +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS + +static DEFINE_PER_CPU(u32, pkrs_cache); +u32 __read_mostly pkrs_init_value; + +/* + * write_pkrs() optimizes MSR writes by maintaining a per cpu cache which can + * be checked quickly. + * + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not + * serializing but still maintains ordering properties similar to WRPKRU. + * The current SDM section on PKRS needs updating but should be the same as + * that of WRPKRU. So to quote from the WRPKRU text: + * + * WRPKRU will never execute transiently. Memory accesses + * affected by PKRU register will not execute (even transiently) + * until all prior executions of WRPKRU have completed execution + * and updated the PKRU register. + */ +void write_pkrs(u32 new_pkrs) +{ + u32 *pkrs; + + if (!static_cpu_has(X86_FEATURE_PKS)) + return; + + pkrs = get_cpu_ptr(&pkrs_cache); + if (*pkrs != new_pkrs) { + *pkrs = new_pkrs; + wrmsrl(MSR_IA32_PKRS, new_pkrs); + } + put_cpu_ptr(pkrs); +} + +/* + * Build a default PKRS value from the array specified by consumers + */ +static int __init create_initial_pkrs_value(void) +{ + /* All users get Access Disabled unless changed below */ + u8 consumer_defaults[PKS_NUM_PKEYS] = { + [0 ... PKS_NUM_PKEYS-1] = PKR_AD_BIT + }; + int i; + + consumer_defaults[PKS_KEY_DEFAULT] = PKR_RW_BIT; + + /* Ensure the number of consumers is less than the number of keys */ + BUILD_BUG_ON(PKS_KEY_NR_CONSUMERS > PKS_NUM_PKEYS); + + pkrs_init_value = 0; + + /* Fill the defaults for the consumers */ + for (i = 0; i < PKS_NUM_PKEYS; i++) + pkrs_init_value |= PKR_VALUE(i, consumer_defaults[i]); + + return 0; +} +early_initcall(create_initial_pkrs_value); + +/* + * PKS is independent of PKU and either or both may be supported on a CPU. + * Configure PKS if the CPU supports the feature. + */ +void setup_pks(void) +{ + if (!cpu_feature_enabled(X86_FEATURE_PKS)) + return; + + write_pkrs(pkrs_init_value); + cr4_set_bits(X86_CR4_PKS); +} + +#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h index 6beb26b7151d..580238388f0c 100644 --- a/include/linux/pkeys.h +++ b/include/linux/pkeys.h @@ -46,4 +46,12 @@ static inline bool arch_pkeys_enabled(void) #endif /* ! CONFIG_ARCH_HAS_PKEYS */ +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS +enum pks_pkey_consumers { + PKS_KEY_DEFAULT = 0, /* Must be 0 for default PTE values */ + PKS_KEY_NR_CONSUMERS +}; +extern u32 pkrs_init_value; +#endif + #endif /* _LINUX_PKEYS_H */ From patchwork Wed Aug 4 04:32:19 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417773 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C37683481 for ; Wed, 4 Aug 2021 04:32:39 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433057" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433057" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:36 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702676" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:36 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Ira Weiny , Sean Christopherson , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Peter Zijlstra , Andy Lutomirski , "H. Peter Anvin" , Fenghua Yu , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 06/18] x86/fault: Adjust WARN_ON for PKey fault Date: Tue, 3 Aug 2021 21:32:19 -0700 Message-Id: <20210804043231.2655537-7-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Ira Weiny Previously if a Protection key fault occurred it indicated something very wrong because user page mappings are not supposed to be in the kernel address space. Now PKey faults may happen on kernel mappings if the feature is enabled. Remove the warning in the fault path and allow the oops to occur without extra debugging if PKS is enabled. Cc: Sean Christopherson Cc: Dan Williams Signed-off-by: Ira Weiny --- arch/x86/mm/fault.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index b2eefdefc108..e133c0ed72a0 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1141,11 +1141,15 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, unsigned long address) { /* - * Protection keys exceptions only happen on user pages. We - * have no user pages in the kernel portion of the address - * space, so do not expect them here. + * X86_PF_PK (Protection key exceptions) may occur on kernel addresses + * when PKS (PKeys Supervisor) is enabled. + * + * However, if PKS is not enabled WARN if this exception is seen + * because there are no user pages in the kernel portion of the address + * space. */ - WARN_ON_ONCE(hw_error_code & X86_PF_PK); + WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS) && + (hw_error_code & X86_PF_PK)); #ifdef CONFIG_X86_32 /* From patchwork Wed Aug 4 04:32:20 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417783 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F34E43486 for ; Wed, 4 Aug 2021 04:32:39 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433059" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433059" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:36 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702681" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:36 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Ira Weiny , Fenghua Yu , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Peter Zijlstra , Andy Lutomirski , "H. Peter Anvin" , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 07/18] x86/pks: Preserve the PKRS MSR on context switch Date: Tue, 3 Aug 2021 21:32:20 -0700 Message-Id: <20210804043231.2655537-8-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Ira Weiny The PKRS MSR is defined as a per-logical-processor register. This isolates memory access by logical CPU. Unfortunately, the MSR is not managed by XSAVE. Therefore, tasks must save/restore the MSR value on context switch. Define a saved PKRS value in the task struct. Initialize all tasks with the INIT_PKRS_VALUE and call pkrs_write_current() to set the MSR to the saved task value on schedule in. Co-developed-by: Fenghua Yu Signed-off-by: Fenghua Yu Signed-off-by: Ira Weiny --- Changes for V7 Move definitions from asm/processor.h to asm/pks.h s/INIT_PKRS_VALUE/pkrs_init_value Change pks_init_task()/pks_sched_in() to functions s/pks_sched_in/pks_write_current to be used more generically later in the series --- arch/x86/include/asm/pks.h | 4 ++++ arch/x86/include/asm/processor.h | 19 ++++++++++++++++++- arch/x86/kernel/process.c | 3 +++ arch/x86/kernel/process_64.c | 3 +++ arch/x86/mm/pkeys.c | 16 ++++++++++++++++ 5 files changed, 44 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h index 5d7067ada8fb..e7727086cec2 100644 --- a/arch/x86/include/asm/pks.h +++ b/arch/x86/include/asm/pks.h @@ -5,10 +5,14 @@ #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS void setup_pks(void); +void pkrs_write_current(void); +void pks_init_task(struct task_struct *task); #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ static inline void setup_pks(void) { } +static inline void pkrs_write_current(void) { } +static inline void pks_init_task(struct task_struct *task) { } #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index f3020c54e2cb..a6cb7d152c62 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -502,6 +502,12 @@ struct thread_struct { unsigned long cr2; unsigned long trap_nr; unsigned long error_code; + +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS + /* Saved Protection key register for supervisor mappings */ + u32 saved_pkrs; +#endif + #ifdef CONFIG_VM86 /* Virtual 86 mode info */ struct vm86 *vm86; @@ -768,7 +774,18 @@ static inline void spin_lock_prefetch(const void *x) #define KSTK_ESP(task) (task_pt_regs(task)->sp) #else -#define INIT_THREAD { } + +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS +/* + * Early task gets full permissions, the restrictive value is set in + * pks_init_task() + */ +#define INIT_THREAD { \ + .saved_pkrs = 0, \ +} +#else +#define INIT_THREAD { } +#endif extern unsigned long KSTK_ESP(struct task_struct *task); diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 1d9463e3096b..c792ac5f33a2 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -43,6 +43,7 @@ #include #include #include +#include #include "process.h" @@ -223,6 +224,8 @@ void flush_thread(void) fpu_flush_thread(); pkru_flush_thread(); + + pks_init_task(tsk); } void disable_TSC(void) diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index ec0d836a13b1..8bd1f039e5bf 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -59,6 +59,7 @@ /* Not included via unistd.h */ #include #endif +#include #include "process.h" @@ -658,6 +659,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) /* Load the Intel cache allocation PQR MSR. */ resctrl_sched_in(); + pkrs_write_current(); + return prev_p; } diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index fbffbced81b5..eca01dc8d7ac 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -284,5 +284,21 @@ void setup_pks(void) write_pkrs(pkrs_init_value); cr4_set_bits(X86_CR4_PKS); } +; + +/* + * PKRS is only temporarily changed during specific code paths. Only a + * preemption during these windows away from the default value would + * require updating the MSR. write_pkrs() handles this optimization. + */ +void pkrs_write_current(void) +{ + write_pkrs(current->thread.saved_pkrs); +} + +void pks_init_task(struct task_struct *task) +{ + task->thread.saved_pkrs = pkrs_init_value; +} #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ From patchwork Wed Aug 4 04:32:21 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417777 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 16AC8348A for ; Wed, 4 Aug 2021 04:32:40 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433061" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433061" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:37 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702686" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:36 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Ira Weiny , Peter Zijlstra , Thomas Gleixner , Andy Lutomirski , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , Fenghua Yu , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions Date: Tue, 3 Aug 2021 21:32:21 -0700 Message-Id: <20210804043231.2655537-9-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Ira Weiny The PKRS MSR is not managed by XSAVE. It is preserved through a context switch but this support leaves exception handling code open to memory accesses during exceptions. 2 possible places for preserving this state were considered, irqentry_state_t or pt_regs.[1] pt_regs was much more complicated and was potentially fraught with unintended consequences.[2] However, Andy came up with a way to hide additional values on the stack which could be accessed as "extended_pt_regs".[3] This method allows for; any place which has struct pt_regs can get access to the extra information; no extra information is added to irq_state; and pt_regs is left intact for compatibility with outside tools like BPF. To simplify, the assembly code only adds space on the stack. The setting or use of any needed values are left to the C code. While some entry points may not use this space it is still added where ever pt_regs is passed to the C code for consistency. Each nested exception gets another copy of this extended space allowing for any number of levels of exception handling. In the assembly, a macro is defined to allow a central place to add space for other uses should the need arise. Finally export pkrs_{save|restore}_irq to the common code to allow it to preserve the current task's PKRS in the new extended pt_regs if enabled. Peter, Thomas, Andy, Dave, and Dan all suggested parts of the patch or aided in the development of the patch.. [1] https://lore.kernel.org/lkml/CALCETrVe1i5JdyzD_BcctxQJn+ZE3T38EFPgjxN1F577M36g+w@mail.gmail.com/ [2] https://lore.kernel.org/lkml/874kpxx4jf.fsf@nanos.tec.linutronix.de/#t [3] https://lore.kernel.org/lkml/CALCETrUHwZPic89oExMMe-WyDY8-O3W68NcZvse3=PGW+iW5=w@mail.gmail.com/ Cc: Dave Hansen Cc: Dan Williams Suggested-by: Dave Hansen Suggested-by: Dan Williams Suggested-by: Peter Zijlstra Suggested-by: Thomas Gleixner Suggested-by: Andy Lutomirski Signed-off-by: Ira Weiny --- Changes for V7: Rebased to 5.14 entry code declare write_pkrs() in pks.h s/INIT_PKRS_VALUE/pkrs_init_value Remove unnecessary INIT_PKRS_VALUE def s/pkrs_save_set_irq/pkrs_save_irq/ The inital value for exceptions is best managed completely within the pkey code. --- arch/x86/entry/calling.h | 26 +++++++++++++ arch/x86/entry/common.c | 54 ++++++++++++++++++++++++++ arch/x86/entry/entry_64.S | 22 ++++++----- arch/x86/entry/entry_64_compat.S | 6 +-- arch/x86/include/asm/pks.h | 18 +++++++++ arch/x86/include/asm/processor-flags.h | 2 + arch/x86/kernel/head_64.S | 7 ++-- arch/x86/mm/fault.c | 3 ++ include/linux/pkeys.h | 11 +++++- kernel/entry/common.c | 14 ++++++- 10 files changed, 143 insertions(+), 20 deletions(-) diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h index a4c061fb7c6e..a2f94677c3fd 100644 --- a/arch/x86/entry/calling.h +++ b/arch/x86/entry/calling.h @@ -63,6 +63,32 @@ For 32-bit we have the following conventions - kernel is built with * for assembly code: */ +/* + * __call_ext_ptregs - Helper macro to call into C with extended pt_regs + * @cfunc: C function to be called + * + * This will ensure that extended_ptregs is added and removed as needed during + * a call into C code. + */ +.macro __call_ext_ptregs cfunc annotate_retpoline_safe:req +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS + /* add space for extended_pt_regs */ + subq $EXTENDED_PT_REGS_SIZE, %rsp +#endif + .if \annotate_retpoline_safe == 1 + ANNOTATE_RETPOLINE_SAFE + .endif + call \cfunc +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS + /* remove space for extended_pt_regs */ + addq $EXTENDED_PT_REGS_SIZE, %rsp +#endif +.endm + +.macro call_ext_ptregs cfunc + __call_ext_ptregs \cfunc, annotate_retpoline_safe=0 +.endm + .macro PUSH_REGS rdx=%rdx rax=%rax save_ret=0 .if \save_ret pushq %rsi /* pt_regs->si */ diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 6c2826417b33..a0d1d5519dba 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -19,6 +19,7 @@ #include #include #include +#include #ifdef CONFIG_XEN_PV #include @@ -34,6 +35,7 @@ #include #include #include +#include #ifdef CONFIG_X86_64 @@ -252,6 +254,56 @@ SYSCALL_DEFINE0(ni_syscall) return -ENOSYS; } +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS + +void show_extended_regs_oops(struct pt_regs *regs, unsigned long error_code) +{ + struct extended_pt_regs *ept_regs = extended_pt_regs(regs); + + if (cpu_feature_enabled(X86_FEATURE_PKS) && (error_code & X86_PF_PK)) + pr_alert("PKRS: 0x%x\n", ept_regs->thread_pkrs); +} + +/* + * PKRS is a per-logical-processor MSR which overlays additional protection for + * pages which have been mapped with a protection key. + * + * Context switches save the MSR in the task struct thus taking that value to + * other processors if necessary. + * + * To protect against exceptions having access to this memory save the current + * thread value and set the PKRS value to be used during the exception. + */ +void pkrs_save_irq(struct pt_regs *regs) +{ + struct extended_pt_regs *ept_regs; + + BUILD_BUG_ON(sizeof(struct extended_pt_regs) + != EXTENDED_PT_REGS_SIZE + + sizeof(struct pt_regs)); + + if (!cpu_feature_enabled(X86_FEATURE_PKS)) + return; + + ept_regs = extended_pt_regs(regs); + ept_regs->thread_pkrs = current->thread.saved_pkrs; + write_pkrs(pkrs_init_value); +} + +void pkrs_restore_irq(struct pt_regs *regs) +{ + struct extended_pt_regs *ept_regs; + + if (!cpu_feature_enabled(X86_FEATURE_PKS)) + return; + + ept_regs = extended_pt_regs(regs); + write_pkrs(ept_regs->thread_pkrs); + current->thread.saved_pkrs = ept_regs->thread_pkrs; +} + +#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ + #ifdef CONFIG_XEN_PV #ifndef CONFIG_PREEMPTION /* @@ -309,6 +361,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs) inhcall = get_and_clear_inhcall(); if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) { + /* Normally called by irqentry_exit, restore pkrs here */ + pkrs_restore_irq(regs); irqentry_exit_cond_resched(); instrumentation_end(); restore_inhcall(inhcall); diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index e38a4cf795d9..1c390975a3de 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -332,7 +332,7 @@ SYM_CODE_END(ret_from_fork) movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */ .endif - call \cfunc + call_ext_ptregs \cfunc jmp error_return .endm @@ -435,7 +435,7 @@ SYM_CODE_START(\asmsym) movq %rsp, %rdi /* pt_regs pointer */ - call \cfunc + call_ext_ptregs \cfunc jmp paranoid_exit @@ -496,7 +496,7 @@ SYM_CODE_START(\asmsym) * stack. */ movq %rsp, %rdi /* pt_regs pointer */ - call vc_switch_off_ist + call_ext_ptregs vc_switch_off_ist movq %rax, %rsp /* Switch to new stack */ UNWIND_HINT_REGS @@ -507,7 +507,7 @@ SYM_CODE_START(\asmsym) movq %rsp, %rdi /* pt_regs pointer */ - call kernel_\cfunc + call_ext_ptregs kernel_\cfunc /* * No need to switch back to the IST stack. The current stack is either @@ -542,7 +542,7 @@ SYM_CODE_START(\asmsym) movq %rsp, %rdi /* pt_regs pointer into first argument */ movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/ movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */ - call \cfunc + call_ext_ptregs \cfunc jmp paranoid_exit @@ -781,7 +781,7 @@ SYM_CODE_START_LOCAL(exc_xen_hypervisor_callback) movq %rdi, %rsp /* we don't return, adjust the stack frame */ UNWIND_HINT_REGS - call xen_pv_evtchn_do_upcall + call_ext_ptregs xen_pv_evtchn_do_upcall jmp error_return SYM_CODE_END(exc_xen_hypervisor_callback) @@ -987,7 +987,7 @@ SYM_CODE_START_LOCAL(error_entry) /* Put us onto the real thread stack. */ popq %r12 /* save return addr in %12 */ movq %rsp, %rdi /* arg0 = pt_regs pointer */ - call sync_regs + call_ext_ptregs sync_regs movq %rax, %rsp /* switch stack */ ENCODE_FRAME_POINTER pushq %r12 @@ -1042,7 +1042,7 @@ SYM_CODE_START_LOCAL(error_entry) * as if we faulted immediately after IRET. */ mov %rsp, %rdi - call fixup_bad_iret + call_ext_ptregs fixup_bad_iret mov %rax, %rsp jmp .Lerror_entry_from_usermode_after_swapgs SYM_CODE_END(error_entry) @@ -1148,7 +1148,7 @@ SYM_CODE_START(asm_exc_nmi) movq %rsp, %rdi movq $-1, %rsi - call exc_nmi + call_ext_ptregs exc_nmi /* * Return back to user mode. We must *not* do the normal exit @@ -1184,6 +1184,8 @@ SYM_CODE_START(asm_exc_nmi) * +---------------------------------------------------------+ * | pt_regs | * +---------------------------------------------------------+ + * | (Optionally) extended_pt_regs | + * +---------------------------------------------------------+ * * The "original" frame is used by hardware. Before re-enabling * NMIs, we need to be done with it, and we need to leave enough @@ -1360,7 +1362,7 @@ end_repeat_nmi: movq %rsp, %rdi movq $-1, %rsi - call exc_nmi + call_ext_ptregs exc_nmi /* Always restore stashed CR3 value (see paranoid_entry) */ RESTORE_CR3 scratch_reg=%r15 save_reg=%r14 diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S index 0051cf5c792d..53254d29d5c7 100644 --- a/arch/x86/entry/entry_64_compat.S +++ b/arch/x86/entry/entry_64_compat.S @@ -136,7 +136,7 @@ SYM_INNER_LABEL(entry_SYSENTER_compat_after_hwframe, SYM_L_GLOBAL) .Lsysenter_flags_fixed: movq %rsp, %rdi - call do_SYSENTER_32 + call_ext_ptregs do_SYSENTER_32 /* XEN PV guests always use IRET path */ ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \ "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV @@ -253,7 +253,7 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_after_hwframe, SYM_L_GLOBAL) UNWIND_HINT_REGS movq %rsp, %rdi - call do_fast_syscall_32 + call_ext_ptregs do_fast_syscall_32 /* XEN PV guests always use IRET path */ ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \ "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV @@ -410,6 +410,6 @@ SYM_CODE_START(entry_INT80_compat) cld movq %rsp, %rdi - call do_int80_syscall_32 + call_ext_ptregs do_int80_syscall_32 jmp swapgs_restore_regs_and_return_to_usermode SYM_CODE_END(entry_INT80_compat) diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h index e7727086cec2..76960ec71b4b 100644 --- a/arch/x86/include/asm/pks.h +++ b/arch/x86/include/asm/pks.h @@ -4,15 +4,33 @@ #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS +struct extended_pt_regs { + u32 thread_pkrs; + /* Keep stack 8 byte aligned */ + u32 pad; + struct pt_regs pt_regs; +}; + void setup_pks(void); void pkrs_write_current(void); void pks_init_task(struct task_struct *task); +void write_pkrs(u32 new_pkrs); + +static inline struct extended_pt_regs *extended_pt_regs(struct pt_regs *regs) +{ + return container_of(regs, struct extended_pt_regs, pt_regs); +} + +void show_extended_regs_oops(struct pt_regs *regs, unsigned long error_code); #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ static inline void setup_pks(void) { } static inline void pkrs_write_current(void) { } static inline void pks_init_task(struct task_struct *task) { } +static inline void write_pkrs(u32 new_pkrs) { } +static inline void show_extended_regs_oops(struct pt_regs *regs, + unsigned long error_code) { } #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h index 02c2cbda4a74..4a41fc4cf028 100644 --- a/arch/x86/include/asm/processor-flags.h +++ b/arch/x86/include/asm/processor-flags.h @@ -53,4 +53,6 @@ # define X86_CR3_PTI_PCID_USER_BIT 11 #endif +#define EXTENDED_PT_REGS_SIZE 8 + #endif /* _ASM_X86_PROCESSOR_FLAGS_H */ diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S index d8b3ebd2bb85..90e76178b6b4 100644 --- a/arch/x86/kernel/head_64.S +++ b/arch/x86/kernel/head_64.S @@ -319,8 +319,7 @@ SYM_CODE_START_NOALIGN(vc_boot_ghcb) movq %rsp, %rdi movq ORIG_RAX(%rsp), %rsi movq initial_vc_handler(%rip), %rax - ANNOTATE_RETPOLINE_SAFE - call *%rax + __call_ext_ptregs *%rax, annotate_retpoline_safe=1 /* Unwind pt_regs */ POP_REGS @@ -397,7 +396,7 @@ SYM_CODE_START_LOCAL(early_idt_handler_common) UNWIND_HINT_REGS movq %rsp,%rdi /* RDI = pt_regs; RSI is already trapnr */ - call do_early_exception + call_ext_ptregs do_early_exception decl early_recursion_flag(%rip) jmp restore_regs_and_return_to_kernel @@ -421,7 +420,7 @@ SYM_CODE_START_NOALIGN(vc_no_ghcb) /* Call C handler */ movq %rsp, %rdi movq ORIG_RAX(%rsp), %rsi - call do_vc_no_ghcb + call_ext_ptregs do_vc_no_ghcb /* Unwind pt_regs */ POP_REGS diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index e133c0ed72a0..a4ce7cef0260 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -32,6 +32,7 @@ #include /* VMALLOC_START, ... */ #include /* kvm_handle_async_pf */ #include /* fixup_vdso_exception() */ +#include #define CREATE_TRACE_POINTS #include @@ -547,6 +548,8 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad (error_code & X86_PF_PK) ? "protection keys violation" : "permissions violation"); + show_extended_regs_oops(regs, error_code); + if (!(error_code & X86_PF_USER) && user_mode(regs)) { struct desc_ptr idt, gdt; u16 ldtr, tr; diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h index 580238388f0c..76eb19a37942 100644 --- a/include/linux/pkeys.h +++ b/include/linux/pkeys.h @@ -52,6 +52,15 @@ enum pks_pkey_consumers { PKS_KEY_NR_CONSUMERS }; extern u32 pkrs_init_value; -#endif + +void pkrs_save_irq(struct pt_regs *regs); +void pkrs_restore_irq(struct pt_regs *regs); + +#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ + +static inline void pkrs_save_irq(struct pt_regs *regs) { } +static inline void pkrs_restore_irq(struct pt_regs *regs) { } + +#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ #endif /* _LINUX_PKEYS_H */ diff --git a/kernel/entry/common.c b/kernel/entry/common.c index bf16395b9e13..aa0b1e8dd742 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -6,6 +6,7 @@ #include #include #include +#include #include "common.h" @@ -364,7 +365,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs) instrumentation_end(); ret.exit_rcu = true; - return ret; + goto done; } /* @@ -379,6 +380,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs) trace_hardirqs_off_finish(); instrumentation_end(); +done: + pkrs_save_irq(regs); return ret; } @@ -404,7 +407,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) /* Check whether this returns to user mode */ if (user_mode(regs)) { irqentry_exit_to_user_mode(regs); - } else if (!regs_irqs_disabled(regs)) { + return; + } + + pkrs_restore_irq(regs); + + if (!regs_irqs_disabled(regs)) { /* * If RCU was not watching on entry this needs to be done * carefully and needs the same ordering of lockdep/tracing @@ -458,11 +466,13 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs) ftrace_nmi_enter(); instrumentation_end(); + pkrs_save_irq(regs); return irq_state; } void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state) { + pkrs_restore_irq(regs); instrumentation_begin(); ftrace_nmi_exit(); if (irq_state.lockdep) { From patchwork Wed Aug 4 04:32:22 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417779 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CAA84348C for ; Wed, 4 Aug 2021 04:32:40 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433063" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433063" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:37 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702690" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:37 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Fenghua Yu , Sean Christopherson , Ira Weiny , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Peter Zijlstra , Andy Lutomirski , "H. Peter Anvin" , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 09/18] x86/pks: Add PKS kernel API Date: Tue, 3 Aug 2021 21:32:22 -0700 Message-Id: <20210804043231.2655537-10-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Fenghua Yu PKS allows kernel users to define domains of page mappings which have additional protections beyond the paging protections. Violating those protections creates a fault which by default will oops. Each kernel user defines a PKS_KEY_* key value which identifies a PKS domain to be used exclusively by that kernel user. This API is then used to control which pages are part of that domain and the current threads protection of those pages. 4 new functions are added pks_enabled(), pks_mk_noaccess(), pks_mk_readonly(), and pks_mk_readwrite(). 2 new macros are added PAGE_KERNEL_PKEY(key) and _PAGE_PKEY(pkey). Update the protection key documentation to cover pkeys on supervisor pages. This includes how to reserve a key and set the default permissions on that key. Cc: Sean Christopherson Cc: Dan Williams Co-developed-by: Ira Weiny Signed-off-by: Ira Weiny Signed-off-by: Fenghua Yu --- Change for V7 Add pks_enabled() to allow users more dynamic choice on PKS use. Update documentation for key allocation Remove dynamic key allocation, keys will be allocated statically now. Add expected CPU generation support to documentation --- Documentation/core-api/protection-keys.rst | 121 ++++++++++++++++++--- arch/x86/include/asm/pgtable_types.h | 12 ++ arch/x86/mm/pkeys.c | 66 +++++++++++ include/linux/pgtable.h | 4 + include/linux/pkeys.h | 14 +++ 5 files changed, 199 insertions(+), 18 deletions(-) diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst index ec575e72d0b2..6420a60666fc 100644 --- a/Documentation/core-api/protection-keys.rst +++ b/Documentation/core-api/protection-keys.rst @@ -4,25 +4,30 @@ Memory Protection Keys ====================== -Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature -which is found on Intel's Skylake (and later) "Scalable Processor" -Server CPUs. It will be available in future non-server Intel parts -and future AMD processors. +Memory Protection Keys provide a mechanism for enforcing page-based +protections, but without requiring modification of the page tables +when an application changes protection domains. -For anyone wishing to test or use this feature, it is available in -Amazon's EC2 C5 instances and is known to work there using an Ubuntu -17.04 image. +PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable +Processor" Server CPUs and later. And it will be available in future +non-server Intel parts and future AMD processors. -Memory Protection Keys provides a mechanism for enforcing page-based -protections, but without requiring modification of the page tables -when an application changes protection domains. It works by -dedicating 4 previously ignored bits in each page table entry to a -"protection key", giving 16 possible keys. +Protection Keys for Supervisor pages (PKS) is available in the SDM since May +2020. + +pkeys work by dedicating 4 previously Reserved bits in each page table entry to +a "protection key", giving 16 possible keys. User and Supervisor pages are +treated separately. + +Protections for each page are controlled with per-CPU registers for each type +of page User and Supervisor. Each of these 32-bit register stores two separate +bits (Access Disable and Write Disable) for each key. -There is also a new user-accessible register (PKRU) with two separate -bits (Access Disable and Write Disable) for each key. Being a CPU -register, PKRU is inherently thread-local, potentially giving each -thread a different set of protections from every other thread. +For Userspace the register is user-accessible (rdpkru/wrpkru). For +Supervisor, the register (MSR_IA32_PKRS) is accessible only to the kernel. + +Being a CPU register, pkeys are inherently thread-local, potentially giving +each thread an independent set of protections from every other thread. There are two new instructions (RDPKRU/WRPKRU) for reading and writing to the new register. The feature is only available in 64-bit mode, @@ -30,8 +35,11 @@ even though there is theoretically space in the PAE PTEs. These permissions are enforced on data access only and have no effect on instruction fetches. -Syscalls -======== +For kernel space rdmsr/wrmsr are used to access the kernel MSRs. + + +Syscalls for user space keys +============================ There are 3 system calls which directly interact with pkeys:: @@ -98,3 +106,80 @@ with a read():: The kernel will send a SIGSEGV in both cases, but si_code will be set to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when the plain mprotect() permissions are violated. + + +Kernel API for PKS support +========================== + +Similar to user space pkeys, supervisor pkeys allow additional protections to +be defined for a supervisor mappings. Unlike user space pkeys, violations of +these protections result in a a kernel oops. + +Supervisor Memory Protection Keys (PKS) is a feature which is found on Intel's +Sapphire Rapids (and later) "Scalable Processor" Server CPUs. It will also be +available in future non-server Intel parts. + +Also qemu has some support as well: https://www.qemu.org/2021/04/30/qemu-6-0-0/ + +Kernel users intending to use PKS support should depend on +ARCH_HAS_SUPERVISOR_PKEYS, and add their config to ARCH_ENABLE_SUPERVISOR_PKEYS +to turn on this support within the core. + +Users reserve a key value by adding an entry to the enum pks_pkey_consumers and +defining the initial protections in the consumer_defaults[] array. + +For example to configure a key for 'MY_FEATURE' with a default of Write +Disabled. + +:: + + enum pks_pkey_consumers + { + PKS_KEY_DEFAULT, + PKS_KEY_MY_FEATURE, + PKS_KEY_NR_CONSUMERS + } + + ... + consumer_defaults[PKS_KEY_DEFAULT] = 0; + consumer_defaults[PKS_KEY_MY_FEATURE] = PKR_DISABLE_WRITE; + ... + +The following interface is used to manipulate the 'protection domain' defined +by a pkey within the kernel. Setting a pkey value in a supervisor PTE adds +this additional protection to the page. + +:: + + #define PAGE_KERNEL_PKEY(pkey) + #define _PAGE_KEY(pkey) + bool pks_enabled(void); + void pks_mk_noaccess(int pkey); + void pks_mk_readonly(int pkey); + void pks_mk_readwrite(int pkey); + +pks_enabled() allows users to know if PKS is configured and available on the +current running system. + +Kernel users must set the pkey in the page table entries for the mappings they +want to protect. This can be done with PAGE_KERNEL_PKEY() or _PAGE_KEY(). + +The pks_mk*() family of calls allow indinvidual threads to change the +protections for the domain identified by the pkey parameter. 3 states are +available: pks_mk_noaccess(), pks_mk_readonly(), and pks_mk_readwrite() which +set the access to none, read, and read/write respectively. + +The interface sets (Access Disabled (AD=1)) for all keys not in use. + +It should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not serializing +but still maintains ordering properties similar to WRPKRU. Thus it is safe to +immediately use a mapping when the pks_mk*() functions return. + +Older versions of the SDM on PKRS may be wrong with regard to this +serialization. The text should be the same as that of WRPKRU. From the WRPKRU +text: + + WRPKRU will never execute transiently. Memory accesses + affected by PKRU register will not execute (even transiently) + until all prior executions of WRPKRU have completed execution + and updated the PKRU register. diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 40497a9020c6..3f866e730456 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -71,6 +71,12 @@ _PAGE_PKEY_BIT2 | \ _PAGE_PKEY_BIT3) +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS +#define _PAGE_PKEY(pkey) (_AT(pteval_t, pkey) << _PAGE_BIT_PKEY_BIT0) +#else +#define _PAGE_PKEY(pkey) (_AT(pteval_t, 0)) +#endif + #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED) #else @@ -226,6 +232,12 @@ enum page_cache_mode { #define PAGE_KERNEL_IO __pgprot_mask(__PAGE_KERNEL_IO) #define PAGE_KERNEL_IO_NOCACHE __pgprot_mask(__PAGE_KERNEL_IO_NOCACHE) +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS +#define PAGE_KERNEL_PKEY(pkey) __pgprot_mask(__PAGE_KERNEL | _PAGE_PKEY(pkey)) +#else +#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL +#endif + #endif /* __ASSEMBLY__ */ /* xwr */ diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index eca01dc8d7ac..146a665d1bf3 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -3,6 +3,9 @@ * Intel Memory Protection Keys management * Copyright (c) 2015, Intel Corporation. */ +#undef pr_fmt +#define pr_fmt(fmt) "x86/pkeys: " fmt + #include /* debugfs_create_u32() */ #include /* mm_struct, vma, etc... */ #include /* PKEY_* */ @@ -10,6 +13,7 @@ #include /* boot_cpu_has, ... */ #include /* vma_pkey() */ +#include int __execute_only_pkey(struct mm_struct *mm) { @@ -301,4 +305,66 @@ void pks_init_task(struct task_struct *task) task->thread.saved_pkrs = pkrs_init_value; } +bool pks_enabled(void) +{ + return cpu_feature_enabled(X86_FEATURE_PKS); +} + +/* + * Do not call this directly, see pks_mk*() below. + * + * @pkey: Key for the domain to change + * @protection: protection bits to be used + * + * Protection utilizes the same protection bits specified for User pkeys + * PKEY_DISABLE_ACCESS + * PKEY_DISABLE_WRITE + * + */ +static inline void pks_update_protection(int pkey, unsigned long protection) +{ + current->thread.saved_pkrs = update_pkey_val(current->thread.saved_pkrs, + pkey, protection); + pkrs_write_current(); +} + +/** + * pks_mk_noaccess() - Disable all access to the domain + * @pkey the pkey for which the access should change. + * + * Disable all access to the domain specified by pkey. This is not a global + * update and only affects the current running thread. + */ +void pks_mk_noaccess(int pkey) +{ + pks_update_protection(pkey, PKEY_DISABLE_ACCESS); +} +EXPORT_SYMBOL_GPL(pks_mk_noaccess); + +/** + * pks_mk_readonly() - Make the domain Read only + * @pkey the pkey for which the access should change. + * + * Allow read access to the domain specified by pkey. This is not a global + * update and only affects the current running thread. + */ +void pks_mk_readonly(int pkey) +{ + pks_update_protection(pkey, PKEY_DISABLE_WRITE); +} +EXPORT_SYMBOL_GPL(pks_mk_readonly); + +/** + * pks_mk_readwrite() - Make the domain Read/Write + * @pkey the pkey for which the access should change. + * + * Allow all access, read and write, to the domain specified by pkey. This is + * not a global update and only affects the current running thread. + */ +void pks_mk_readwrite(int pkey) +{ + pks_update_protection(pkey, 0); +} +EXPORT_SYMBOL_GPL(pks_mk_readwrite); + #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index d147480cdefc..eba1a9f9d124 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1526,6 +1526,10 @@ static inline bool arch_has_pfn_modify_check(void) # define PAGE_KERNEL_EXEC PAGE_KERNEL #endif +#ifndef PAGE_KERNEL_PKEY +#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL +#endif + /* * Page Table Modification bits for pgtbl_mod_mask. * diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h index 76eb19a37942..b9919ed4d300 100644 --- a/include/linux/pkeys.h +++ b/include/linux/pkeys.h @@ -56,11 +56,25 @@ extern u32 pkrs_init_value; void pkrs_save_irq(struct pt_regs *regs); void pkrs_restore_irq(struct pt_regs *regs); +bool pks_enabled(void); +void pks_mk_noaccess(int pkey); +void pks_mk_readonly(int pkey); +void pks_mk_readwrite(int pkey); + #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ static inline void pkrs_save_irq(struct pt_regs *regs) { } static inline void pkrs_restore_irq(struct pt_regs *regs) { } +static inline bool pks_enabled(void) +{ + return false; +} + +static inline void pks_mk_noaccess(int pkey) {} +static inline void pks_mk_readonly(int pkey) {} +static inline void pks_mk_readwrite(int pkey) {} + #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ #endif /* _LINUX_PKEYS_H */ From patchwork Wed Aug 4 04:32:23 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417781 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 843833492 for ; Wed, 4 Aug 2021 04:32:41 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433065" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433065" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:37 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702693" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:37 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Ira Weiny , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Peter Zijlstra , Andy Lutomirski , "H. Peter Anvin" , Fenghua Yu , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 10/18] x86/pks: Introduce pks_abandon_protections() Date: Tue, 3 Aug 2021 21:32:23 -0700 Message-Id: <20210804043231.2655537-11-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Ira Weiny Unanticipated access to PMEM by otherwise working kernel code would be very disruptive to otherwise working systems. Such access could be through valid uses such as kmap(). In this use case PMEM protections will require the ability to abandon all protections of a pkey on all threads system wide. Introduce pks_abandon_protections() to allow a user to mask off protection values. This will filter through all the threads of the system as they are scheduled in and in the immediate case override the value should running threads PKS fault. Update pkrs_write_current(), pks_init_task(), and pkrs_{save|restore}_irq() to account for pkrs_pkey_allowed_mask. Add handle_abandoned_pks_value() to adjust any already running threads which may fault on an abandoned pkey. Signed-off-by: Ira Weiny --- Changes for V7 New patch Significant internal review from Dan Williams and Rick Edgecombe --- Documentation/core-api/protection-keys.rst | 7 +++- arch/x86/entry/common.c | 6 ++- arch/x86/include/asm/pks.h | 5 +++ arch/x86/mm/fault.c | 24 ++++++----- arch/x86/mm/pkeys.c | 49 ++++++++++++++++++++++ include/linux/pkeys.h | 2 + 6 files changed, 80 insertions(+), 13 deletions(-) diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst index 6420a60666fc..202088634fa3 100644 --- a/Documentation/core-api/protection-keys.rst +++ b/Documentation/core-api/protection-keys.rst @@ -157,6 +157,7 @@ this additional protection to the page. void pks_mk_noaccess(int pkey); void pks_mk_readonly(int pkey); void pks_mk_readwrite(int pkey); + void pks_abandon_protections(int pkey); pks_enabled() allows users to know if PKS is configured and available on the current running system. @@ -169,7 +170,11 @@ protections for the domain identified by the pkey parameter. 3 states are available: pks_mk_noaccess(), pks_mk_readonly(), and pks_mk_readwrite() which set the access to none, read, and read/write respectively. -The interface sets (Access Disabled (AD=1)) for all keys not in use. +The interface sets Access Disabled for all keys not in use. The +pks_abandon_protections() call reduces the protections for the specified key to +be fully accessible thus abandoning the protections of the key. There is no +way to reverse this. As such pks_abandon_protections() is intended to provide +a 'relief valve' if the PKS protections should prove too restrictive. It should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not serializing but still maintains ordering properties similar to WRPKRU. Thus it is safe to diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index a0d1d5519dba..717091910ebc 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -37,6 +37,8 @@ #include #include +extern u32 pkrs_pkey_allowed_mask; + #ifdef CONFIG_X86_64 static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr) @@ -287,7 +289,7 @@ void pkrs_save_irq(struct pt_regs *regs) ept_regs = extended_pt_regs(regs); ept_regs->thread_pkrs = current->thread.saved_pkrs; - write_pkrs(pkrs_init_value); + write_pkrs(pkrs_init_value & pkrs_pkey_allowed_mask); } void pkrs_restore_irq(struct pt_regs *regs) @@ -298,8 +300,8 @@ void pkrs_restore_irq(struct pt_regs *regs) return; ept_regs = extended_pt_regs(regs); - write_pkrs(ept_regs->thread_pkrs); current->thread.saved_pkrs = ept_regs->thread_pkrs; + pkrs_write_current(); } #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h index 76960ec71b4b..ed293ef4509e 100644 --- a/arch/x86/include/asm/pks.h +++ b/arch/x86/include/asm/pks.h @@ -22,6 +22,7 @@ static inline struct extended_pt_regs *extended_pt_regs(struct pt_regs *regs) } void show_extended_regs_oops(struct pt_regs *regs, unsigned long error_code); +int handle_abandoned_pks_value(struct pt_regs *regs); #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ @@ -31,6 +32,10 @@ static inline void pks_init_task(struct task_struct *task) { } static inline void write_pkrs(u32 new_pkrs) { } static inline void show_extended_regs_oops(struct pt_regs *regs, unsigned long error_code) { } +static inline int handle_abandoned_pks_value(struct pt_regs *regs) +{ + return 0; +} #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index a4ce7cef0260..bf3353d8e011 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1143,16 +1143,20 @@ static void do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, unsigned long address) { - /* - * X86_PF_PK (Protection key exceptions) may occur on kernel addresses - * when PKS (PKeys Supervisor) is enabled. - * - * However, if PKS is not enabled WARN if this exception is seen - * because there are no user pages in the kernel portion of the address - * space. - */ - WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS) && - (hw_error_code & X86_PF_PK)); + if (hw_error_code & X86_PF_PK) { + /* + * X86_PF_PK (Protection key exceptions) may occur on kernel + * addresses when PKS (PKeys Supervisor) is enabled. + * + * However, if PKS is not enabled WARN if this exception is + * seen because there are no user pages in the kernel portion + * of the address space. + */ + WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS)); + + if (handle_abandoned_pks_value(regs)) + return; + } #ifdef CONFIG_X86_32 /* diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index 146a665d1bf3..56d37840186b 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -221,6 +221,26 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags) static DEFINE_PER_CPU(u32, pkrs_cache); u32 __read_mostly pkrs_init_value; +/* + * Define a mask of pkeys which are allowed, ie have not been abandoned. + * Default is all keys are allowed. + */ +#define PKRS_ALLOWED_MASK_DEFAULT 0xffffffff +u32 __read_mostly pkrs_pkey_allowed_mask; + +int handle_abandoned_pks_value(struct pt_regs *regs) +{ + struct extended_pt_regs *ept_regs; + u32 old; + + ept_regs = extended_pt_regs(regs); + old = ept_regs->thread_pkrs; + ept_regs->thread_pkrs &= pkrs_pkey_allowed_mask; + + /* If something changed retry the fault */ + return (ept_regs->thread_pkrs != old); +} + /* * write_pkrs() optimizes MSR writes by maintaining a per cpu cache which can * be checked quickly. @@ -267,6 +287,7 @@ static int __init create_initial_pkrs_value(void) BUILD_BUG_ON(PKS_KEY_NR_CONSUMERS > PKS_NUM_PKEYS); pkrs_init_value = 0; + pkrs_pkey_allowed_mask = PKRS_ALLOWED_MASK_DEFAULT; /* Fill the defaults for the consumers */ for (i = 0; i < PKS_NUM_PKEYS; i++) @@ -297,12 +318,14 @@ void setup_pks(void) */ void pkrs_write_current(void) { + current->thread.saved_pkrs &= pkrs_pkey_allowed_mask; write_pkrs(current->thread.saved_pkrs); } void pks_init_task(struct task_struct *task) { task->thread.saved_pkrs = pkrs_init_value; + task->thread.saved_pkrs &= pkrs_pkey_allowed_mask; } bool pks_enabled(void) @@ -367,4 +390,30 @@ void pks_mk_readwrite(int pkey) } EXPORT_SYMBOL_GPL(pks_mk_readwrite); +/** + * pks_abandon_protections() - Force readwrite (no protections) for the + * specified pkey + * @pkey The pkey to force + * + * Force the value of the pkey to readwrite (no protections) thus abandoning + * protections for this key. This is a permanent change and has no + * coresponding reversal function. + * + * This also updates the current running thread. + */ +void pks_abandon_protections(int pkey) +{ + u32 old_mask, new_mask; + + do { + old_mask = READ_ONCE(pkrs_pkey_allowed_mask); + new_mask = update_pkey_val(old_mask, pkey, 0); + } while (unlikely( + cmpxchg(&pkrs_pkey_allowed_mask, old_mask, new_mask) != old_mask)); + + /* Update the local thread as well. */ + pks_update_protection(pkey, 0); +} +EXPORT_SYMBOL_GPL(pks_abandon_protections); + #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h index b9919ed4d300..4d22ccd971fc 100644 --- a/include/linux/pkeys.h +++ b/include/linux/pkeys.h @@ -60,6 +60,7 @@ bool pks_enabled(void); void pks_mk_noaccess(int pkey); void pks_mk_readonly(int pkey); void pks_mk_readwrite(int pkey); +void pks_abandon_protections(int pkey); #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ @@ -74,6 +75,7 @@ static inline bool pks_enabled(void) static inline void pks_mk_noaccess(int pkey) {} static inline void pks_mk_readonly(int pkey) {} static inline void pks_mk_readwrite(int pkey) {} +static inline void pks_abandon_protections(int pkey) {} #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ From patchwork Wed Aug 4 04:32:24 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417787 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1A0113496 for ; Wed, 4 Aug 2021 04:32:42 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433068" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433068" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:37 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702698" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:37 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Ira Weiny , Fenghua Yu , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Peter Zijlstra , Andy Lutomirski , "H. Peter Anvin" , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 11/18] x86/pks: Add PKS Test code Date: Tue, 3 Aug 2021 21:32:24 -0700 Message-Id: <20210804043231.2655537-12-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Ira Weiny The core PKS functionality provides an interface for kernel users to reserve a key and set up page tables with that key. Define test code under CONFIG_PKS_TEST which exercises the core functionality of PKS via a debugfs entry. Basic checks can be triggered on boot with a kernel command line option while both basic and preemption checks can be triggered with separate debugfs values. [See the comment at the top of pks_test.c for details on the values which can be used and what tests they run.] CONFIG_PKS_TEST enables ARCH_ENABLE_SUPERVISOR_PKEYS but can not co-exist with any GENERAL_PKS_USER. This is because the test code iterates through all the keys and is pretty much not useful in general kernel configs. General PKS users should select GENERAL_PKS_USER which will disable PKS_TEST as well as enable ARCH_ENABLE_SUPERVISOR_PKEYS. A PKey is not reserved for this test and the test code defines its own PKS_KEY_PKS_TEST. To test pks_abandon_protections() each test requires the thread to be re-run after resetting the abandoned mask value. Do this by allowing the test code access to the abandoned mask value. Co-developed-by: Fenghua Yu Signed-off-by: Fenghua Yu Signed-off-by: Ira Weiny --- Changes for V7 Add testing for pks_abandon_protections() Adjust pkrs_init_value Adjust for new defines Clean up comments Adjust test for static allocation of pkeys Use lookup_address() instead of follow_pte() follow_pte only works on IO and raw PFN mappings, use lookup_address() instead. lookup_address() is constrained to architectures which support it. --- Documentation/core-api/protection-keys.rst | 6 +- arch/x86/include/asm/pks.h | 18 + arch/x86/mm/fault.c | 8 + arch/x86/mm/pkeys.c | 18 +- lib/Kconfig.debug | 13 + lib/Makefile | 3 + lib/pks/Makefile | 3 + lib/pks/pks_test.c | 864 +++++++++++++++++++++ mm/Kconfig | 5 +- tools/testing/selftests/x86/Makefile | 2 +- tools/testing/selftests/x86/test_pks.c | 157 ++++ 11 files changed, 1092 insertions(+), 5 deletions(-) create mode 100644 lib/pks/Makefile create mode 100644 lib/pks/pks_test.c create mode 100644 tools/testing/selftests/x86/test_pks.c diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst index 202088634fa3..8cf7eaaed3e5 100644 --- a/Documentation/core-api/protection-keys.rst +++ b/Documentation/core-api/protection-keys.rst @@ -122,8 +122,8 @@ available in future non-server Intel parts. Also qemu has some support as well: https://www.qemu.org/2021/04/30/qemu-6-0-0/ Kernel users intending to use PKS support should depend on -ARCH_HAS_SUPERVISOR_PKEYS, and add their config to ARCH_ENABLE_SUPERVISOR_PKEYS -to turn on this support within the core. +ARCH_HAS_SUPERVISOR_PKEYS, and add their config to GENERAL_PKS_USER to turn on +this support within the core. Users reserve a key value by adding an entry to the enum pks_pkey_consumers and defining the initial protections in the consumer_defaults[] array. @@ -188,3 +188,5 @@ text: affected by PKRU register will not execute (even transiently) until all prior executions of WRPKRU have completed execution and updated the PKRU register. + +Example code can be found in lib/pks/pks_test.c diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h index ed293ef4509e..e28413cc410d 100644 --- a/arch/x86/include/asm/pks.h +++ b/arch/x86/include/asm/pks.h @@ -39,4 +39,22 @@ static inline int handle_abandoned_pks_value(struct pt_regs *regs) #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ + +#ifdef CONFIG_PKS_TEST + +#define __static_or_pks_test + +bool pks_test_callback(struct pt_regs *regs); + +#else /* !CONFIG_PKS_TEST */ + +#define __static_or_pks_test static + +static inline bool pks_test_callback(struct pt_regs *regs) +{ + return false; +} + +#endif /* CONFIG_PKS_TEST */ + #endif /* _ASM_X86_PKS_H */ diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index bf3353d8e011..3780ed0f9597 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1154,6 +1154,14 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, */ WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS)); + /* + * If a protection key exception occurs it could be because a PKS test + * is running. If so, pks_test_callback() will clear the protection + * mechanism and return true to indicate the fault was handled. + */ + if (pks_test_callback(regs)) + return; + if (handle_abandoned_pks_value(regs)) return; } diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index 56d37840186b..c7358662ec07 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -218,7 +218,7 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags) #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS -static DEFINE_PER_CPU(u32, pkrs_cache); +__static_or_pks_test DEFINE_PER_CPU(u32, pkrs_cache); u32 __read_mostly pkrs_init_value; /* @@ -289,6 +289,22 @@ static int __init create_initial_pkrs_value(void) pkrs_init_value = 0; pkrs_pkey_allowed_mask = PKRS_ALLOWED_MASK_DEFAULT; + /* + * PKS_TEST is mutually exclusive to any real users of PKS so define a PKS_TEST + * appropriate value. + * + * NOTE: PKey 0 must still be fully permissive for normal kernel mappings to + * work correctly. + */ + if (IS_ENABLED(CONFIG_PKS_TEST)) { + pkrs_init_value = (PKR_AD_KEY(1) | PKR_AD_KEY(2) | PKR_AD_KEY(3) | \ + PKR_AD_KEY(4) | PKR_AD_KEY(5) | PKR_AD_KEY(6) | \ + PKR_AD_KEY(7) | PKR_AD_KEY(8) | PKR_AD_KEY(9) | \ + PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) | \ + PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15)); + return 0; + } + /* Fill the defaults for the consumers */ for (i = 0; i < PKS_NUM_PKEYS; i++) pkrs_init_value |= PKR_VALUE(i, consumer_defaults[i]); diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 831212722924..28579084649d 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2650,6 +2650,19 @@ config HYPERV_TESTING help Select this option to enable Hyper-V vmbus testing. +config PKS_TEST + bool "PKey (S)upervisor testing" + depends on ARCH_HAS_SUPERVISOR_PKEYS + depends on !GENERAL_PKS_USER + help + Select this option to enable testing of PKS core software and + hardware. The PKS core provides a mechanism to allocate keys as well + as maintain the protection settings across context switches. + + Answer N if you don't know what supervisor keys are. + + If unsure, say N. + endmenu # "Kernel Testing and Coverage" source "Documentation/Kconfig" diff --git a/lib/Makefile b/lib/Makefile index 5efd1b435a37..fc31f2d6d8e4 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -360,3 +360,6 @@ obj-$(CONFIG_CMDLINE_KUNIT_TEST) += cmdline_kunit.o obj-$(CONFIG_SLUB_KUNIT_TEST) += slub_kunit.o obj-$(CONFIG_GENERIC_LIB_DEVMEM_IS_ALLOWED) += devmem_is_allowed.o + +# PKS test +obj-y += pks/ diff --git a/lib/pks/Makefile b/lib/pks/Makefile new file mode 100644 index 000000000000..9daccba4f7c4 --- /dev/null +++ b/lib/pks/Makefile @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0 + +obj-$(CONFIG_PKS_TEST) += pks_test.o diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c new file mode 100644 index 000000000000..679edd487360 --- /dev/null +++ b/lib/pks/pks_test.c @@ -0,0 +1,864 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright(c) 2020 Intel Corporation. All rights reserved. + * + * Implement PKS testing + * Access to run this test can be with a command line parameter + * ("pks-test-on-boot") or more detailed tests can be triggered through: + * + * /sys/kernel/debug/x86/run_pks + * + * debugfs controls are: + * + * '0' -- Run access tests with a single pkey + * '1' -- Set up the pkey register with no access for the pkey allocated to + * this fd + * '2' -- Check that the pkey register updated in '1' is still the same. + * (To be used after a forced context switch.) + * '3' -- Allocate all pkeys possible and run tests on each pkey allocated. + * DEFAULT when run at boot. + * '4' -- The same as '0' with additional kernel debugging + * '5' -- The same as '3' with additional kernel debugging + * '6' -- Test abandoning a pkey + * '9' -- Set up and fault on a PKS protected page. This will crash the + * kernel and requires the option to be specified 2 times in a row. + * + * Closing the fd will cleanup and release the pkey, to exercise context + * switch testing a user space program is provided in: + * + * .../tools/testing/selftests/x86/test_pks.c + * + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include /* for struct pt_regs */ +#include +#include +#include + +/* + * PKS testing uses all pkeys but define 1 key to use for some tests. Any + * value from [1-PKS_NUM_PKEYS) will work. + */ +#define PKS_KEY_PKS_TEST 1 +#define PKS_TEST_MEM_SIZE (PAGE_SIZE) + +#define RUN_ALLOCATE "0" +#define ARM_CTX_SWITCH "1" +#define CHECK_CTX_SWITCH "2" +#define RUN_ALLOCATE_ALL "3" +#define RUN_ALLOCATE_DEBUG "4" +#define RUN_ALLOCATE_ALL_DEBUG "5" +#define RUN_DISABLE_TEST "6" +#define RUN_CRASH_TEST "9" + +/* The testing needs some knowledge of the internals */ +DECLARE_PER_CPU(u32, pkrs_cache); +extern u32 pkrs_pkey_allowed_mask; + +/* + * run_on_boot default '= false' which checkpatch complains about initializing; + * so don't + */ +static bool run_on_boot; +static struct dentry *pks_test_dentry; +static bool run_9; + +/* + * The following globals must be protected for brief periods while the fault + * handler checks/updates them. + */ +static DEFINE_SPINLOCK(test_lock); +static int test_armed_key; +static unsigned long prev_cnt; +static unsigned long fault_cnt; + +struct pks_test_ctx { + bool pass; + bool pks_cpu_enabled; + bool debug; + int pkey; + char data[64]; +}; +static struct pks_test_ctx *test_exception_ctx; + +static bool check_pkey_val(u32 pk_reg, int pkey, u32 expected) +{ + pk_reg = (pk_reg & PKR_PKEY_MASK(pkey)) >> PKR_PKEY_SHIFT(pkey); + return (pk_reg == expected); +} + +/* + * Check if the register @pkey value matches @expected value + * + * Both the cached and actual MSR must match. + */ +static bool check_pkrs(int pkey, u32 expected) +{ + bool ret = true; + u64 pkrs; + u32 *tmp_cache; + + tmp_cache = get_cpu_ptr(&pkrs_cache); + if (!check_pkey_val(*tmp_cache, pkey, expected)) + ret = false; + put_cpu_ptr(tmp_cache); + + rdmsrl(MSR_IA32_PKRS, pkrs); + if (!check_pkey_val(pkrs, pkey, expected)) + ret = false; + + return ret; +} + +static void check_exception(u32 thread_pkrs) +{ + /* Check the thread saved state */ + if (!check_pkey_val(thread_pkrs, test_armed_key, PKEY_DISABLE_WRITE)) { + pr_err(" FAIL: checking ept_regs->thread_pkrs\n"); + test_exception_ctx->pass = false; + } + + /* Check the exception state */ + if (!check_pkrs(test_armed_key, PKEY_DISABLE_ACCESS)) { + pr_err(" FAIL: PKRS cache and MSR\n"); + test_exception_ctx->pass = false; + } + + /* + * Ensure an update can occur during exception without affecting the + * interrupted thread. The interrupted thread is checked after + * exception... + */ + pks_mk_readwrite(test_armed_key); + if (!check_pkrs(test_armed_key, 0)) { + pr_err(" FAIL: exception did not change register to 0\n"); + test_exception_ctx->pass = false; + } + pks_mk_noaccess(test_armed_key); + if (!check_pkrs(test_armed_key, PKEY_DISABLE_ACCESS)) { + pr_err(" FAIL: exception did not change register to 0x%x\n", + PKEY_DISABLE_ACCESS); + test_exception_ctx->pass = false; + } +} + +/** + * pks_test_callback() is exported so that the fault handler can detect + * and report back status of intentional faults. + * + * NOTE: It clears the protection key from the page such that the fault handler + * will not re-trigger. + */ +bool pks_test_callback(struct pt_regs *regs) +{ + struct extended_pt_regs *ept_regs = extended_pt_regs(regs); + bool armed = (test_armed_key != 0); + + if (test_exception_ctx) { + check_exception(ept_regs->thread_pkrs); + /* + * Stop this check directly within the exception because the + * fault handler clean up code will call again while checking + * the PMD entry and there is no need to check this again. + */ + test_exception_ctx = NULL; + } + + if (armed) { + /* Enable read and write to stop faults */ + ept_regs->thread_pkrs = update_pkey_val(ept_regs->thread_pkrs, + test_armed_key, 0); + fault_cnt++; + } + + return armed; +} + +static bool exception_caught(void) +{ + bool ret = (fault_cnt != prev_cnt); + + prev_cnt = fault_cnt; + return ret; +} + +static void report_pkey_settings(void *info) +{ + u8 pkey; + unsigned long long msr = 0; + unsigned int cpu = smp_processor_id(); + struct pks_test_ctx *ctx = info; + + rdmsrl(MSR_IA32_PKRS, msr); + + pr_info("for CPU %d : 0x%llx\n", cpu, msr); + + if (ctx->debug) { + for (pkey = 0; pkey < PKS_NUM_PKEYS; pkey++) { + int ad, wd; + + ad = (msr >> PKR_PKEY_SHIFT(pkey)) & PKEY_DISABLE_ACCESS; + wd = (msr >> PKR_PKEY_SHIFT(pkey)) & PKEY_DISABLE_WRITE; + pr_info(" %u: A:%d W:%d\n", pkey, ad, wd); + } + } +} + +enum pks_access_mode { + PKS_TEST_NO_ACCESS, + PKS_TEST_RDWR, + PKS_TEST_RDONLY +}; + +static char *get_mode_str(enum pks_access_mode mode) +{ + switch (mode) { + case PKS_TEST_NO_ACCESS: + return "No Access"; + case PKS_TEST_RDWR: + return "Read Write"; + case PKS_TEST_RDONLY: + return "Read Only"; + default: + pr_err("BUG in test invalid mode\n"); + break; + } + + return ""; +} + +struct pks_access_test { + enum pks_access_mode mode; + bool write; + bool exception; +}; + +static struct pks_access_test pkey_test_ary[] = { + /* disable both */ + { PKS_TEST_NO_ACCESS, true, true }, + { PKS_TEST_NO_ACCESS, false, true }, + + /* enable both */ + { PKS_TEST_RDWR, true, false }, + { PKS_TEST_RDWR, false, false }, + + /* enable read only */ + { PKS_TEST_RDONLY, true, true }, + { PKS_TEST_RDONLY, false, false }, +}; + +static int test_it(struct pks_test_ctx *ctx, struct pks_access_test *test, + void *ptr, bool forced_sched) +{ + bool exception; + int ret = 0; + + spin_lock(&test_lock); + WRITE_ONCE(test_armed_key, ctx->pkey); + + if (test->write) + memcpy(ptr, ctx->data, 8); + else + memcpy(ctx->data, ptr, 8); + + exception = exception_caught(); + + WRITE_ONCE(test_armed_key, 0); + spin_unlock(&test_lock); + + /* + * After a forced schedule the allowed mask should be applied on + * sched_in and therefore no exception should ever be seen. + */ + if (forced_sched && exception) { + pr_err("pkey test FAILED: mode %s; write %s; exception %s != %s; sched TRUE\n", + get_mode_str(test->mode), + test->write ? "TRUE" : "FALSE", + test->exception ? "TRUE" : "FALSE", + exception ? "TRUE" : "FALSE"); + ret = -EFAULT; + } else if (test->exception != exception) { + pr_err("pkey test FAILED: mode %s; write %s; exception %s != %s\n", + get_mode_str(test->mode), + test->write ? "TRUE" : "FALSE", + test->exception ? "TRUE" : "FALSE", + exception ? "TRUE" : "FALSE"); + ret = -EFAULT; + } + + return ret; +} + +static int run_access_test(struct pks_test_ctx *ctx, + struct pks_access_test *test, + void *ptr, + bool forced_sched) +{ + switch (test->mode) { + case PKS_TEST_NO_ACCESS: + pks_mk_noaccess(ctx->pkey); + break; + case PKS_TEST_RDWR: + pks_mk_readwrite(ctx->pkey); + break; + case PKS_TEST_RDONLY: + pks_mk_readonly(ctx->pkey); + break; + default: + pr_err("BUG in test invalid mode\n"); + break; + } + + return test_it(ctx, test, ptr, forced_sched); +} + +static void *alloc_test_page(int pkey) +{ + return __vmalloc_node_range(PKS_TEST_MEM_SIZE, 1, VMALLOC_START, VMALLOC_END, + GFP_KERNEL, PAGE_KERNEL_PKEY(pkey), 0, + NUMA_NO_NODE, __builtin_return_address(0)); +} + +static void test_mem_access(struct pks_test_ctx *ctx) +{ + int i, rc; + u8 pkey; + void *ptr = NULL; + pte_t *ptep = NULL; + unsigned int level; + + ptr = alloc_test_page(ctx->pkey); + if (!ptr) { + pr_err("Failed to vmalloc page???\n"); + ctx->pass = false; + return; + } + + ptep = lookup_address((unsigned long)ptr, &level); + if (!ptep) { + pr_err("Failed to lookup address???\n"); + ctx->pass = false; + goto done; + } + + pr_info("lookup address ptr %p ptep %p\n", + ptr, ptep); + + pkey = pte_flags_pkey(ptep->pte); + pr_info("ptep flags 0x%lx pkey %u\n", + (unsigned long)ptep->pte, pkey); + + if (pkey != ctx->pkey) { + pr_err("invalid pkey found: %u, test_pkey: %u\n", + pkey, ctx->pkey); + ctx->pass = false; + goto done; + } + + if (!ctx->pks_cpu_enabled) { + pr_err("not CPU enabled; skipping access tests...\n"); + ctx->pass = true; + goto done; + } + + for (i = 0; i < ARRAY_SIZE(pkey_test_ary); i++) { + rc = run_access_test(ctx, &pkey_test_ary[i], ptr, false); + + /* only save last error is fine */ + if (rc) + ctx->pass = false; + } + +done: + vfree(ptr); +} + +static void pks_run_test(struct pks_test_ctx *ctx) +{ + ctx->pass = true; + + pr_info("\n"); + pr_info("\n"); + pr_info(" ***** BEGIN: Testing (CPU enabled : %s) *****\n", + ctx->pks_cpu_enabled ? "TRUE" : "FALSE"); + + if (ctx->pks_cpu_enabled) + on_each_cpu(report_pkey_settings, ctx, 1); + + pr_info(" BEGIN: pkey %d Testing\n", ctx->pkey); + test_mem_access(ctx); + pr_info(" END: PAGE_KERNEL_PKEY Testing : %s\n", + ctx->pass ? "PASS" : "FAIL"); + + pr_info(" ***** END: Testing *****\n"); + pr_info("\n"); + pr_info("\n"); +} + +static ssize_t pks_read_file(struct file *file, char __user *user_buf, + size_t count, loff_t *ppos) +{ + struct pks_test_ctx *ctx = file->private_data; + char buf[32]; + unsigned int len; + + if (!ctx) + len = sprintf(buf, "not run\n"); + else + len = sprintf(buf, "%s\n", ctx->pass ? "PASS" : "FAIL"); + + return simple_read_from_buffer(user_buf, count, ppos, buf, len); +} + +static struct pks_test_ctx *alloc_ctx(u8 pkey) +{ + struct pks_test_ctx *ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + + if (!ctx) { + pr_err("Failed to allocate memory for test context\n"); + return ERR_PTR(-ENOMEM); + } + + ctx->pkey = pkey; + ctx->pks_cpu_enabled = cpu_feature_enabled(X86_FEATURE_PKS); + sprintf(ctx->data, "%s", "DEADBEEF"); + return ctx; +} + +static void free_ctx(struct pks_test_ctx *ctx) +{ + kfree(ctx); +} + +static void run_exception_test(void) +{ + void *ptr = NULL; + bool pass = true; + struct pks_test_ctx *ctx; + + pr_info(" ***** BEGIN: exception checking\n"); + + ctx = alloc_ctx(PKS_KEY_PKS_TEST); + if (IS_ERR(ctx)) { + pr_err(" FAIL: no context\n"); + pass = false; + goto result; + } + ctx->pass = true; + + ptr = alloc_test_page(ctx->pkey); + if (!ptr) { + pr_err(" FAIL: no vmalloc page\n"); + pass = false; + goto free_context; + } + + pks_mk_readonly(ctx->pkey); + + spin_lock(&test_lock); + WRITE_ONCE(test_exception_ctx, ctx); + WRITE_ONCE(test_armed_key, ctx->pkey); + + memcpy(ptr, ctx->data, 8); + + if (!exception_caught()) { + pr_err(" FAIL: did not get an exception\n"); + pass = false; + } + + /* + * NOTE The exception code has to enable access (b00) to keep the fault + * from looping forever. Therefore full access is seen here rather + * than write disabled. + * + * Furthermore, check_exception() disabled access during the exception + * so this is testing that the thread value was restored back to the + * thread value. + */ + if (!check_pkrs(test_armed_key, 0)) { + pr_err(" FAIL: PKRS not restored\n"); + pass = false; + } + + if (!ctx->pass) + pass = false; + + WRITE_ONCE(test_armed_key, 0); + spin_unlock(&test_lock); + + vfree(ptr); +free_context: + free_ctx(ctx); +result: + pr_info(" ***** END: exception checking : %s\n", + pass ? "PASS" : "FAIL"); +} + +static struct pks_access_test abandon_test_ary[] = { + /* disable both */ + { PKS_TEST_NO_ACCESS, true, false }, + { PKS_TEST_NO_ACCESS, false, false }, + + /* enable both */ + { PKS_TEST_RDWR, true, false }, + { PKS_TEST_RDWR, false, false }, + + /* enable read only */ + { PKS_TEST_RDONLY, true, false }, + { PKS_TEST_RDONLY, false, false }, +}; + +static DEFINE_SPINLOCK(abandoned_test_lock); +struct shared_data { + struct pks_test_ctx *ctx; + void *kmap_addr; + struct pks_access_test *test; + bool thread_running; + bool sched_thread; +}; + +static int abandoned_test_main(void *d) +{ + struct shared_data *data = d; + struct pks_test_ctx *ctx = data->ctx; + + spin_lock(&abandoned_test_lock); + data->thread_running = true; + spin_unlock(&abandoned_test_lock); + + while (!kthread_should_stop()) { + spin_lock(&abandoned_test_lock); + if (data->kmap_addr) { + pr_info(" Thread ->saved_pkrs Before 0x%x (%d)\n", + current->thread.saved_pkrs, ctx->pkey); + if (data->sched_thread) + msleep(20); + if (run_access_test(ctx, data->test, data->kmap_addr, + data->sched_thread)) + ctx->pass = false; + pr_info(" Thread Remote ->saved_pkrs After 0x%x (%d)\n", + current->thread.saved_pkrs, ctx->pkey); + data->kmap_addr = NULL; + } + spin_unlock(&abandoned_test_lock); + } + + return 0; +} + +static void run_abandon_pkey_test(struct pks_test_ctx *ctx, + struct pks_access_test *test, + void *ptr, + bool sched_thread) +{ + struct task_struct *other_task; + struct shared_data data; + bool running = false; + + pr_info("checking... mode %s; write %s\n", + get_mode_str(test->mode), test->write ? "TRUE" : "FALSE"); + + pkrs_pkey_allowed_mask = 0xffffffff; + + memset(&data, 0, sizeof(data)); + data.ctx = ctx; + data.thread_running = false; + data.sched_thread = sched_thread; + other_task = kthread_run(abandoned_test_main, &data, "PKRS abandoned test"); + if (IS_ERR(other_task)) { + pr_err(" FAIL: Failed to start thread\n"); + ctx->pass = false; + return; + } + + while (!running) { + spin_lock(&abandoned_test_lock); + running = data.thread_running; + spin_unlock(&abandoned_test_lock); + } + + spin_lock(&abandoned_test_lock); + pr_info("Local ->saved_pkrs Before 0x%x (%d)\n", + current->thread.saved_pkrs, ctx->pkey); + pks_abandon_protections(ctx->pkey); + data.test = test; + data.kmap_addr = ptr; + spin_unlock(&abandoned_test_lock); + + while (data.kmap_addr) + msleep(20); + + pr_info("Local ->saved_pkrs After 0x%x (%d)\n", + current->thread.saved_pkrs, ctx->pkey); + + kthread_stop(other_task); +} + +static void run_abandoned_test(void) +{ + struct pks_test_ctx *ctx; + bool pass = true; + void *ptr; + int i; + + pr_info(" ***** BEGIN: abandoned pkey checking\n"); + + ctx = alloc_ctx(PKS_KEY_PKS_TEST); + if (IS_ERR(ctx)) { + pr_err(" FAIL: no context\n"); + pass = false; + goto result; + } + + ptr = alloc_test_page(ctx->pkey); + if (!ptr) { + pr_err(" FAIL: no vmalloc page\n"); + pass = false; + goto free_context; + } + + for (i = 0; i < ARRAY_SIZE(abandon_test_ary); i++) { + ctx->pass = true; + run_abandon_pkey_test(ctx, &abandon_test_ary[i], ptr, false); + /* sticky fail */ + if (!ctx->pass) + pass = ctx->pass; + + ctx->pass = true; + run_abandon_pkey_test(ctx, &abandon_test_ary[i], ptr, true); + /* sticky fail */ + if (!ctx->pass) + pass = ctx->pass; + } + + /* Force re-enable all keys */ + pkrs_pkey_allowed_mask = 0xffffffff; + + vfree(ptr); +free_context: + free_ctx(ctx); +result: + pr_info(" ***** END: abandoned pkey checking : %s\n", + pass ? "PASS" : "FAIL"); +} + +static void run_all(bool debug) +{ + struct pks_test_ctx *ctx[PKS_NUM_PKEYS]; + static char name[PKS_NUM_PKEYS][64]; + int i; + + for (i = 1; i < PKS_NUM_PKEYS; i++) { + sprintf(name[i], "pks ctx %d", i); + ctx[i] = alloc_ctx(i); + if (!IS_ERR(ctx[i])) + ctx[i]->debug = debug; + } + + for (i = 1; i < PKS_NUM_PKEYS; i++) { + if (!IS_ERR(ctx[i])) + pks_run_test(ctx[i]); + } + + for (i = 1; i < PKS_NUM_PKEYS; i++) { + if (!IS_ERR(ctx[i])) + free_ctx(ctx[i]); + } + + run_exception_test(); + + run_abandoned_test(); +} + +static void crash_it(void) +{ + struct pks_test_ctx *ctx; + void *ptr; + + pr_warn(" ***** BEGIN: Unhandled fault test *****\n"); + + ctx = alloc_ctx(PKS_KEY_PKS_TEST); + if (IS_ERR(ctx)) { + pr_err("Failed to allocate context???\n"); + return; + } + + ptr = alloc_test_page(ctx->pkey); + if (!ptr) { + pr_err("Failed to vmalloc page???\n"); + ctx->pass = false; + return; + } + + pks_mk_noaccess(ctx->pkey); + + spin_lock(&test_lock); + WRITE_ONCE(test_armed_key, 0); + /* This purposely faults */ + memcpy(ptr, ctx->data, 8); + spin_unlock(&test_lock); + + vfree(ptr); + free_ctx(ctx); +} + +static ssize_t pks_write_file(struct file *file, const char __user *user_buf, + size_t count, loff_t *ppos) +{ + char buf[2]; + struct pks_test_ctx *ctx = file->private_data; + + if (copy_from_user(buf, user_buf, 1)) + return -EFAULT; + buf[1] = '\0'; + + /* + * WARNING: Test "9" will crash the kernel. + * + * Arm the test and print a warning. A second "9" will run the test. + */ + if (!strcmp(buf, RUN_CRASH_TEST)) { + if (run_9) { + crash_it(); + run_9 = false; + } else { + pr_warn("CAUTION: Test 9 will crash in the kernel.\n"); + pr_warn(" Specify 9 a second time to run\n"); + pr_warn(" run any other test to clear\n"); + run_9 = true; + } + } else { + run_9 = false; + } + + /* + * Test "3" will test allocating all keys. Do it first without + * using "ctx". + */ + if (!strcmp(buf, RUN_ALLOCATE_ALL)) + run_all(false); + if (!strcmp(buf, RUN_ALLOCATE_ALL_DEBUG)) + run_all(true); + + if (!strcmp(buf, RUN_DISABLE_TEST)) + run_abandoned_test(); + + /* + * This context is only required if the file is held open for the below + * tests. Otherwise the context just get's freed in pks_release_file. + */ + if (!ctx) { + ctx = alloc_ctx(PKS_KEY_PKS_TEST); + if (IS_ERR(ctx)) + return -ENOMEM; + file->private_data = ctx; + } + + if (!strcmp(buf, RUN_ALLOCATE)) { + ctx->debug = false; + pks_run_test(ctx); + } + if (!strcmp(buf, RUN_ALLOCATE_DEBUG)) { + ctx->debug = true; + pks_run_test(ctx); + } + + /* start of context switch test */ + if (!strcmp(buf, ARM_CTX_SWITCH)) { + unsigned long reg_pkrs; + int access; + + /* Ensure a known state to test context switch */ + pks_mk_readwrite(ctx->pkey); + + rdmsrl(MSR_IA32_PKRS, reg_pkrs); + + access = (reg_pkrs >> PKR_PKEY_SHIFT(ctx->pkey)) & + PKEY_ACCESS_MASK; + pr_info("Context switch armed : pkey %d: 0x%x reg: 0x%lx\n", + ctx->pkey, access, reg_pkrs); + } + + /* After context switch msr should be restored */ + if (!strcmp(buf, CHECK_CTX_SWITCH) && ctx->pks_cpu_enabled) { + unsigned long reg_pkrs; + int access; + + rdmsrl(MSR_IA32_PKRS, reg_pkrs); + + access = (reg_pkrs >> PKR_PKEY_SHIFT(ctx->pkey)) & + PKEY_ACCESS_MASK; + if (access != 0) { + ctx->pass = false; + pr_err("Context switch check failed: pkey %d: 0x%x reg: 0x%lx\n", + ctx->pkey, access, reg_pkrs); + } else { + pr_err("Context switch check passed: pkey %d: 0x%x reg: 0x%lx\n", + ctx->pkey, access, reg_pkrs); + } + } + + return count; +} + +static int pks_release_file(struct inode *inode, struct file *file) +{ + struct pks_test_ctx *ctx = file->private_data; + + if (!ctx) + return 0; + + free_ctx(ctx); + return 0; +} + +static const struct file_operations fops_init_pks = { + .read = pks_read_file, + .write = pks_write_file, + .llseek = default_llseek, + .release = pks_release_file, +}; + +static int __init parse_pks_test_options(char *str) +{ + run_on_boot = true; + + return 0; +} +early_param("pks-test-on-boot", parse_pks_test_options); + +static int __init pks_test_init(void) +{ + if (cpu_feature_enabled(X86_FEATURE_PKS)) { + if (run_on_boot) + run_all(true); + + pks_test_dentry = debugfs_create_file("run_pks", 0600, arch_debugfs_dir, + NULL, &fops_init_pks); + } + + return 0; +} +late_initcall(pks_test_init); + +static void __exit pks_test_exit(void) +{ + debugfs_remove(pks_test_dentry); + pr_info("test exit\n"); +} diff --git a/mm/Kconfig b/mm/Kconfig index e0d29c655ade..ea6ffee69f55 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -820,8 +820,11 @@ config ARCH_HAS_PKEYS bool config ARCH_HAS_SUPERVISOR_PKEYS bool +config GENERAL_PKS_USER + def_bool n config ARCH_ENABLE_SUPERVISOR_PKEYS - bool + def_bool y + depends on PKS_TEST || GENERAL_PKS_USER config PERCPU_STATS bool "Collect percpu memory statistics" diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile index b4142cd1c5c2..b2f852f0e7e1 100644 --- a/tools/testing/selftests/x86/Makefile +++ b/tools/testing/selftests/x86/Makefile @@ -13,7 +13,7 @@ CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh $(CC) trivial_program.c -no-pie) TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \ check_initial_reg_state sigreturn iopl ioperm \ test_vsyscall mov_ss_trap \ - syscall_arg_fault fsgsbase_restore sigaltstack + syscall_arg_fault fsgsbase_restore sigaltstack test_pks TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \ test_FCMOV test_FCOMI test_FISTTP \ vdso_restorer diff --git a/tools/testing/selftests/x86/test_pks.c b/tools/testing/selftests/x86/test_pks.c new file mode 100644 index 000000000000..c12b38760c9c --- /dev/null +++ b/tools/testing/selftests/x86/test_pks.c @@ -0,0 +1,157 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright(c) 2020 Intel Corporation. All rights reserved. + * + * User space tool to test PKS operations. Accesses test code through + * /x86/run_pks when CONFIG_PKS_TEST is enabled. + * + */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define PKS_TEST_FILE "/sys/kernel/debug/x86/run_pks" + +#define RUN_ALLOCATE "0" +#define SETUP_CTX_SWITCH "1" +#define CHECK_CTX_SWITCH "2" +#define RUN_ALLOCATE_ALL "3" +#define RUN_ALLOCATE_DEBUG "4" +#define RUN_ALLOCATE_ALL_DEBUG "5" +#define RUN_DISABLE_TEST "6" +#define RUN_CRASH_TEST "9" + +int main(int argc, char *argv[]) +{ + cpu_set_t cpuset; + char result[32]; + pid_t pid; + int fd; + int setup_done[2]; + int switch_done[2]; + int cpu = 0; + int rc = 0; + int c; + bool debug = false; + + while (1) { + int option_index = 0; + static struct option long_options[] = { + {"debug", no_argument, 0, 0 }, + {0, 0, 0, 0 } + }; + + c = getopt_long(argc, argv, "", long_options, &option_index); + if (c == -1) + break; + + switch (c) { + case 0: + debug = true; + break; + } + } + + if (optind < argc) + cpu = strtoul(argv[optind], NULL, 0); + + if (cpu >= sysconf(_SC_NPROCESSORS_ONLN)) { + printf("CPU %d is invalid\n", cpu); + cpu = sysconf(_SC_NPROCESSORS_ONLN) - 1; + printf(" running on max CPU: %d\n", cpu); + } + + CPU_ZERO(&cpuset); + CPU_SET(cpu, &cpuset); + /* Two processes run on CPU 0 so that they go through context switch. */ + sched_setaffinity(getpid(), sizeof(cpu_set_t), &cpuset); + + if (pipe(setup_done)) + printf("Failed to create pipe\n"); + if (pipe(switch_done)) + printf("Failed to create pipe\n"); + + pid = fork(); + if (pid == 0) { + char done = 'y'; + + fd = open(PKS_TEST_FILE, O_RDWR); + if (fd < 0) { + printf("cannot open %s\n", PKS_TEST_FILE); + return -1; + } + + cpu = sched_getcpu(); + printf("Child running on cpu %d...\n", cpu); + + /* Allocate test_pkey1 and run test. */ + if (debug) + write(fd, RUN_ALLOCATE_DEBUG, 1); + else + write(fd, RUN_ALLOCATE, 1); + + /* Arm for context switch test */ + write(fd, SETUP_CTX_SWITCH, 1); + + printf(" tell parent to go\n"); + write(setup_done[1], &done, sizeof(done)); + + /* Context switch out... */ + printf(" Waiting for parent...\n"); + read(switch_done[0], &done, sizeof(done)); + + /* Check msr restored */ + printf("Checking result\n"); + write(fd, CHECK_CTX_SWITCH, 1); + + read(fd, result, 10); + printf(" #PF, context switch, pkey allocation and free tests: %s\n", result); + if (!strncmp(result, "PASS", 10)) { + rc = -1; + done = 'F'; + } + + /* Signal result */ + write(setup_done[1], &done, sizeof(done)); + } else { + char done = 'y'; + + read(setup_done[0], &done, sizeof(done)); + cpu = sched_getcpu(); + printf("Parent running on cpu %d\n", cpu); + + fd = open(PKS_TEST_FILE, O_RDWR); + if (fd < 0) { + printf("cannot open %s\n", PKS_TEST_FILE); + return -1; + } + + /* run test with alternate pkey */ + if (debug) + write(fd, RUN_ALLOCATE_DEBUG, 1); + else + write(fd, RUN_ALLOCATE, 1); + + printf(" Signaling child.\n"); + write(switch_done[1], &done, sizeof(done)); + + /* Wait for result */ + read(setup_done[0], &done, sizeof(done)); + if (done == 'F') + rc = -1; + } + + close(fd); + + return rc; +} From patchwork Wed Aug 4 04:32:25 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417785 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C02F63499 for ; Wed, 4 Aug 2021 04:32:42 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433071" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433071" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:38 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702703" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:37 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Rick Edgecombe , Ira Weiny , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Peter Zijlstra , Andy Lutomirski , "H. Peter Anvin" , Fenghua Yu , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 12/18] x86/pks: Add PKS fault callbacks Date: Tue, 3 Aug 2021 21:32:25 -0700 Message-Id: <20210804043231.2655537-13-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Rick Edgecombe Some PKS keys will want special handling on accesses that violate their permissions. One of these is PMEM which will want to have a mode that logs the access violation, disables protection, and continues rather than oops the machine. Since PKS faults do not provide the actual key that faulted, this information needs to be recovered by walking the page tables and extracting it from the leaf entry. This infrastructure could be used to implement abandoned pkeys, but adds support in a separate call such that abandoned pkeys are handled more quickly by skipping the page table walk. In pkeys.c, define a new api for setting callbacks for individual pkeys. Co-developed-by: Ira Weiny Signed-off-by: Ira Weiny Signed-off-by: Rick Edgecombe --- Changes for V7: New patch --- Documentation/core-api/protection-keys.rst | 27 +++++++++++- arch/x86/include/asm/pks.h | 7 +++ arch/x86/mm/fault.c | 51 ++++++++++++++++++++++ arch/x86/mm/pkeys.c | 13 ++++++ include/linux/pkeys.h | 2 + 5 files changed, 99 insertions(+), 1 deletion(-) diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst index 8cf7eaaed3e5..bbf81b12e67d 100644 --- a/Documentation/core-api/protection-keys.rst +++ b/Documentation/core-api/protection-keys.rst @@ -113,7 +113,8 @@ Kernel API for PKS support Similar to user space pkeys, supervisor pkeys allow additional protections to be defined for a supervisor mappings. Unlike user space pkeys, violations of -these protections result in a a kernel oops. +these protections result in a a kernel oops unless a PKS fault handler is +provided which handles the fault. Supervisor Memory Protection Keys (PKS) is a feature which is found on Intel's Sapphire Rapids (and later) "Scalable Processor" Server CPUs. It will also be @@ -145,6 +146,30 @@ Disabled. consumer_defaults[PKS_KEY_MY_FEATURE] = PKR_DISABLE_WRITE; ... + +Users may also provide a fault handler which can handle a fault differently +than an oops. Continuing our example from above if 'MY_FEATURE' wanted to +define a handler they can do so by adding the coresponding entry to the +pks_key_callbacks array. + +:: + + #ifdef CONFIG_MY_FEATURE + bool my_feature_pks_fault_callback(unsigned long address, bool write) + { + if (my_feature_fault_is_ok) + return true; + return false; + } + #endif + + static const pks_key_callback pks_key_callbacks[PKS_KEY_NR_CONSUMERS] = { + [PKS_KEY_DEFAULT] = NULL, + #ifdef CONFIG_MY_FEATURE + [PKS_KEY_PGMAP_PROTECTION] = my_feature_pks_fault_callback, + #endif + }; + The following interface is used to manipulate the 'protection domain' defined by a pkey within the kernel. Setting a pkey value in a supervisor PTE adds this additional protection to the page. diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h index e28413cc410d..3de5089d379d 100644 --- a/arch/x86/include/asm/pks.h +++ b/arch/x86/include/asm/pks.h @@ -23,6 +23,7 @@ static inline struct extended_pt_regs *extended_pt_regs(struct pt_regs *regs) void show_extended_regs_oops(struct pt_regs *regs, unsigned long error_code); int handle_abandoned_pks_value(struct pt_regs *regs); +bool handle_pks_key_callback(unsigned long address, bool write, u16 key); #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ @@ -36,6 +37,12 @@ static inline int handle_abandoned_pks_value(struct pt_regs *regs) { return 0; } +static inline bool handle_pks_key_fault(struct pt_regs *regs, + unsigned long hw_error_code, + unsigned long address) +{ + return false; +} #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 3780ed0f9597..7a8c807006c7 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1134,6 +1134,54 @@ bool fault_in_kernel_space(unsigned long address) return address >= TASK_SIZE_MAX; } +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS +bool handle_pks_key_fault(struct pt_regs *regs, unsigned long hw_error_code, + unsigned long address) +{ + bool write = (hw_error_code & X86_PF_WRITE); + pgd_t pgd; + p4d_t p4d; + pud_t pud; + pmd_t pmd; + pte_t pte; + + pgd = READ_ONCE(*(init_mm.pgd + pgd_index(address))); + if (!pgd_present(pgd)) + return false; + + p4d = READ_ONCE(*p4d_offset(&pgd, address)); + if (!p4d_present(p4d)) + return false; + + if (p4d_large(p4d)) + return handle_pks_key_callback(address, write, + pte_flags_pkey(p4d_val(p4d))); + + pud = READ_ONCE(*pud_offset(&p4d, address)); + if (!pud_present(pud)) + return false; + + if (pud_large(pud)) + return handle_pks_key_callback(address, write, + pte_flags_pkey(pud_val(pud))); + + pmd = READ_ONCE(*pmd_offset(&pud, address)); + if (!pmd_present(pmd)) + return false; + + if (pmd_large(pmd)) + return handle_pks_key_callback(address, write, + pte_flags_pkey(pmd_val(pmd))); + + pte = READ_ONCE(*pte_offset_kernel(&pmd, address)); + if (!pte_present(pte)) + return false; + + return handle_pks_key_callback(address, write, + pte_flags_pkey(pte_val(pte))); +} +#endif + /* * Called for all faults where 'address' is part of the kernel address * space. Might get called for faults that originate from *code* that @@ -1164,6 +1212,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, if (handle_abandoned_pks_value(regs)) return; + + if (handle_pks_key_fault(regs, hw_error_code, address)) + return; } #ifdef CONFIG_X86_32 diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index c7358662ec07..f0166725a128 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -241,6 +241,19 @@ int handle_abandoned_pks_value(struct pt_regs *regs) return (ept_regs->thread_pkrs != old); } +static const pks_key_callback pks_key_callbacks[PKS_KEY_NR_CONSUMERS] = { 0 }; + +bool handle_pks_key_callback(unsigned long address, bool write, u16 key) +{ + if (key > PKS_KEY_NR_CONSUMERS) + return false; + + if (pks_key_callbacks[key]) + return pks_key_callbacks[key](address, write); + + return false; +} + /* * write_pkrs() optimizes MSR writes by maintaining a per cpu cache which can * be checked quickly. diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h index 4d22ccd971fc..549fa01d7da3 100644 --- a/include/linux/pkeys.h +++ b/include/linux/pkeys.h @@ -62,6 +62,8 @@ void pks_mk_readonly(int pkey); void pks_mk_readwrite(int pkey); void pks_abandon_protections(int pkey); +typedef bool (*pks_key_callback)(unsigned long address, bool write); + #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ static inline void pkrs_save_irq(struct pt_regs *regs) { } From patchwork Wed Aug 4 04:32:26 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417789 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 91DDE349D for ; Wed, 4 Aug 2021 04:32:43 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433073" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433073" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:38 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702708" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:38 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Ira Weiny , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Peter Zijlstra , Andy Lutomirski , "H. Peter Anvin" , Fenghua Yu , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 13/18] memremap_pages: Add access protection via supervisor Protection Keys (PKS) Date: Tue, 3 Aug 2021 21:32:26 -0700 Message-Id: <20210804043231.2655537-14-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Ira Weiny The persistent memory (PMEM) driver uses the memremap_pages facility to provide 'struct page' metadata (vmemmap) for PMEM. Given that PMEM capacity maybe orders of magnitude higher capacity than System RAM it presents a large vulnerability surface to stray writes. Unlike stray writes to System RAM, which may result in a crash or other undesirable behavior, stray writes to PMEM additionally are more likely to result in permanent data loss. Reboot is not a remediation for PMEM corruption like it is for System RAM. Given that PMEM access from the kernel is limited to a constrained set of locations (PMEM driver, Filesystem-DAX, and direct-I/O to a DAX page), it is amenable to supervisor pkey protection. Set up an infrastructure for thread local access protection. Then implement the protection using the new Protection Keys Supervisor (PKS) on architectures that support it. To enable this extra protection memremap_pages users should check for protection support via pgmap_protection_enabled() and if enabled specify (PGMAP_PROTECTION) in (struct dev_pagemap)->flags to request that access protection. NOTE: The name of pgmap_protection_enable() and PGMAP_PROTECTION were specifically chosen to isolate the implementation of the protection from higher level users. Kernel code intending to access this memory can do so through 4 new calls. pgmap_mk_{readwrite,noaccess}() and __pgmap_mk_{readwrite,noaccess}() calls. The pgmap_mk_*() take a page parameter and the __pgmap_mk_*() calls directly take the dev_pagemap objects. pgmap_mk_*() take care of checking if the page is a page map managed page and are safe to any user who has a reference on the page. All changes in the protections must be through the above calls. They abstract the protection implementation (currently the PKS api) from the upper layer users. Furthermore, the calls are nestable by the use of a per task reference count. This ensures that the first call to re-enable protection does not 'break' the last access of the device memory. NOTE: There are no code paths which directly nest these calls. For this reason multiple reviewers, including Dan and Thomas, asked why this reference counting was needed at this level rather than in a higher level call such as kmap_{atomic,local_page}(). The reason is that pmgmap_mk_read_write() can nest with kmap_{atomic,local_page}(). Therefore push this reference counting to the lower level. Access to device memory during exceptions (#PF) is expected only from user faults. Therefore there is no need to maintain the reference count when entering or exiting exceptions. However, reference counting will occur during the exception. Recall that protection is automatically enabled during exceptions by the PKS core.[1] A default of (NVDIMM_PFN && ARCH_HAS_SUPERVISOR_PKEYS) was suggested but logically that is the same as saying default 'yes' because both NVDIMM_PFN and ARCH_HAS_SUPERVISOR_PKEYS are required. Therefore a default of 'yes' was used. Signed-off-by: Ira Weiny [1] https://lore.kernel.org/lkml/20210401225833.566238-9-ira.weiny@intel.com/ --- Changes for V7 Add __pgmap_mk_*() calls to allow users who have a dev_pagemap to call directly into that layer of the API Add pgmap_protection_enabled() and fail memremap_pages() if protection is requested and pgmap_protection_enabled() is false s/PGMAP_PKEY_PROTECT/PGMAP_PROTECTION This helps to isolate the implementation details of the protection from the higher layers. s/dev_page_access_ref/pgmap_prot_count s/DEV_PAGEMAP_PROTECTION/DEVMAP_ACCESS_PROTECTION Adjust Kconfig dependency and default Address feedback from Dan Williams Add requirement comment to devmap_protected Make pgmap_mk_* static inline Change to devmap_protected Change config to DEV_PAGEMAP_PROTECTION Remove dynamic key use from memremap This greatly simplifies turning on PKS when requested by the remapping code #define a static key for pmem use --- arch/x86/mm/pkeys.c | 3 +- include/linux/memremap.h | 1 + include/linux/mm.h | 62 ++++++++++++++++++++++++++++++++++ include/linux/pkeys.h | 1 + include/linux/sched.h | 7 ++++ init/init_task.c | 3 ++ kernel/fork.c | 3 ++ mm/Kconfig | 18 ++++++++++ mm/memremap.c | 73 ++++++++++++++++++++++++++++++++++++++++ 9 files changed, 170 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index f0166725a128..cdebc2018888 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -294,7 +294,8 @@ static int __init create_initial_pkrs_value(void) }; int i; - consumer_defaults[PKS_KEY_DEFAULT] = PKR_RW_BIT; + consumer_defaults[PKS_KEY_DEFAULT] = PKR_RW_BIT; + consumer_defaults[PKS_KEY_PGMAP_PROTECTION] = PKR_AD_BIT; /* Ensure the number of consumers is less than the number of keys */ BUILD_BUG_ON(PKS_KEY_NR_CONSUMERS > PKS_NUM_PKEYS); diff --git a/include/linux/memremap.h b/include/linux/memremap.h index c0e9d35889e8..53dc97823418 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -90,6 +90,7 @@ struct dev_pagemap_ops { }; #define PGMAP_ALTMAP_VALID (1 << 0) +#define PGMAP_PROTECTION (1 << 1) /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings diff --git a/include/linux/mm.h b/include/linux/mm.h index 7ca22e6e694a..d3c1a3ecca87 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1198,6 +1198,68 @@ static inline bool is_pci_p2pdma_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; } +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION +DECLARE_STATIC_KEY_FALSE(dev_pgmap_protection_static_key); + +/* + * devmap_protected() requires a reference on the page to ensure there is no + * races with dev_pagemap tear down. + */ +static inline bool devmap_protected(struct page *page) +{ + if (!static_branch_unlikely(&dev_pgmap_protection_static_key)) + return false; + if (!is_zone_device_page(page)) + return false; + if (page->pgmap->flags & PGMAP_PROTECTION) + return true; + return false; +} + +void __pgmap_mk_readwrite(struct dev_pagemap *pgmap); +void __pgmap_mk_noaccess(struct dev_pagemap *pgmap); + +static inline bool pgmap_check_pgmap_prot(struct page *page) +{ + if (!devmap_protected(page)) + return false; + + /* + * There is no known use case to change permissions in an irq for pgmap + * pages + */ + lockdep_assert_in_irq(); + return true; +} + +static inline void pgmap_mk_readwrite(struct page *page) +{ + if (!pgmap_check_pgmap_prot(page)) + return; + __pgmap_mk_readwrite(page->pgmap); +} +static inline void pgmap_mk_noaccess(struct page *page) +{ + if (!pgmap_check_pgmap_prot(page)) + return; + __pgmap_mk_noaccess(page->pgmap); +} + +bool pgmap_protection_enabled(void); + +#else + +static inline void __pgmap_mk_readwrite(struct dev_pagemap *pgmap) { } +static inline void __pgmap_mk_noaccess(struct dev_pagemap *pgmap) { } +static inline void pgmap_mk_readwrite(struct page *page) { } +static inline void pgmap_mk_noaccess(struct page *page) { } +static inline bool pgmap_protection_enabled(void) +{ + return false; +} + +#endif /* CONFIG_DEVMAP_ACCESS_PROTECTION */ + /* 127: arbitrary random number, small enough to assemble well */ #define page_ref_zero_or_close_to_overflow(page) \ ((unsigned int) page_ref_count(page) + 127u <= 127u) diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h index 549fa01d7da3..c06b47264c5d 100644 --- a/include/linux/pkeys.h +++ b/include/linux/pkeys.h @@ -49,6 +49,7 @@ static inline bool arch_pkeys_enabled(void) #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS enum pks_pkey_consumers { PKS_KEY_DEFAULT = 0, /* Must be 0 for default PTE values */ + PKS_KEY_PGMAP_PROTECTION, PKS_KEY_NR_CONSUMERS }; extern u32 pkrs_init_value; diff --git a/include/linux/sched.h b/include/linux/sched.h index ec8d07d88641..2d035d9981b5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1400,6 +1400,13 @@ struct task_struct { struct llist_head kretprobe_instances; #endif +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION + /* + * NOTE: pgmap_prot_count is modified within a single thread of + * execution. So it does not need to be atomic_t. + */ + u32 pgmap_prot_count; +#endif /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/init/init_task.c b/init/init_task.c index 562f2ef8d157..f628ad552ee3 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -213,6 +213,9 @@ struct task_struct init_task #ifdef CONFIG_SECCOMP_FILTER .seccomp = { .filter_count = ATOMIC_INIT(0) }, #endif +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION + .pgmap_prot_count = 0, +#endif }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index bc94b2cc5995..7f7b946f4f2e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -956,6 +956,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg = NULL; +#endif +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION + tsk->pgmap_prot_count = 0; #endif return tsk; diff --git a/mm/Kconfig b/mm/Kconfig index ea6ffee69f55..201d41269a36 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -790,6 +790,24 @@ config ZONE_DEVICE If FS_DAX is enabled, then say Y. +config DEVMAP_ACCESS_PROTECTION + bool "Access protection for memremap_pages()" + depends on NVDIMM_PFN + depends on ARCH_HAS_SUPERVISOR_PKEYS + select GENERAL_PKS_USER + default y + + help + Enable extra protections on device memory. This protects against + unintended access to devices such as a stray writes. This feature is + particularly useful to protect against corruption of persistent + memory. + + This depends on architecture support of supervisor PKeys and has no + overhead if the architecture does not support them. + + If you have persistent memory say 'Y'. + config DEV_PAGEMAP_OPS bool diff --git a/mm/memremap.c b/mm/memremap.c index 15a074ffb8d7..a05de8714916 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -63,6 +64,68 @@ static void devmap_managed_enable_put(struct dev_pagemap *pgmap) } #endif /* CONFIG_DEV_PAGEMAP_OPS */ +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION +/* + * Note; all devices which have asked for protections share the same key. The + * key may, or may not, have been provided by the core. If not, protection + * will be disabled. The key acquisition is attempted when the first ZONE + * DEVICE requests it and freed when all zones have been unmapped. + * + * Also this must be EXPORT_SYMBOL rather than EXPORT_SYMBOL_GPL because it is + * intended to be used in the kmap API. + */ +DEFINE_STATIC_KEY_FALSE(dev_pgmap_protection_static_key); +EXPORT_SYMBOL(dev_pgmap_protection_static_key); + +static void devmap_protection_enable(void) +{ + static_branch_inc(&dev_pgmap_protection_static_key); +} + +static pgprot_t devmap_protection_adjust_pgprot(pgprot_t prot) +{ + pgprotval_t val; + + val = pgprot_val(prot); + return __pgprot(val | _PAGE_PKEY(PKS_KEY_PGMAP_PROTECTION)); +} + +static void devmap_protection_disable(void) +{ + static_branch_dec(&dev_pgmap_protection_static_key); +} + +void __pgmap_mk_readwrite(struct dev_pagemap *pgmap) +{ + if (!current->pgmap_prot_count++) + pks_mk_readwrite(PKS_KEY_PGMAP_PROTECTION); +} +EXPORT_SYMBOL_GPL(__pgmap_mk_readwrite); + +void __pgmap_mk_noaccess(struct dev_pagemap *pgmap) +{ + if (!--current->pgmap_prot_count) + pks_mk_noaccess(PKS_KEY_PGMAP_PROTECTION); +} +EXPORT_SYMBOL_GPL(__pgmap_mk_noaccess); + +bool pgmap_protection_enabled(void) +{ + return pks_enabled(); +} +EXPORT_SYMBOL_GPL(pgmap_protection_enabled); + +#else /* !CONFIG_DEVMAP_ACCESS_PROTECTION */ + +static void devmap_protection_enable(void) { } +static void devmap_protection_disable(void) { } + +static pgprot_t devmap_protection_adjust_pgprot(pgprot_t prot) +{ + return prot; +} +#endif /* CONFIG_DEVMAP_ACCESS_PROTECTION */ + static void pgmap_array_delete(struct range *range) { xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end), @@ -181,6 +244,9 @@ void memunmap_pages(struct dev_pagemap *pgmap) WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); devmap_managed_enable_put(pgmap); + + if (pgmap->flags & PGMAP_PROTECTION) + devmap_protection_disable(); } EXPORT_SYMBOL_GPL(memunmap_pages); @@ -329,6 +395,13 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) if (WARN_ONCE(!nr_range, "nr_range must be specified\n")) return ERR_PTR(-EINVAL); + if (pgmap->flags & PGMAP_PROTECTION) { + if (!pgmap_protection_enabled()) + return ERR_PTR(-EINVAL); + devmap_protection_enable(); + params.pgprot = devmap_protection_adjust_pgprot(params.pgprot); + } + switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) { From patchwork Wed Aug 4 04:32:27 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417793 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E4B17349F for ; Wed, 4 Aug 2021 04:32:43 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433075" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433075" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:38 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702711" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:38 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Ira Weiny , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Peter Zijlstra , Andy Lutomirski , "H. Peter Anvin" , Fenghua Yu , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 14/18] memremap_pages: Add memremap.pks_fault_mode Date: Tue, 3 Aug 2021 21:32:27 -0700 Message-Id: <20210804043231.2655537-15-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Ira Weiny Some systems may be using pmem in unanticipated ways. As such it is possible a code path may violation the restrictions of the PMEM PKS protections. In order to provide a more seamless integration of the PMEM PKS feature provide a pks_fault_mode that allows for a relaxed mode should a previously working feature start to fault on PKS protected PMEM. 2 modes are available: 'relaxed' (default) -- WARN_ONCE, abandon the protections, and continuing to operate. 'strict' -- BUG_ON/or fault indicating the error. This is the most protective of the PMEM memory but may be undesirable in some configurations. NOTE: There was some debate about if a 3rd mode called 'silent' should be available. 'silent' would be the same as 'relaxed' but not print any output. While 'silent' is nice for admins to reduce console/log output it would result in less motivation to fix invalid access to the protected pmem pages. Therefore, 'silent' is left out. In addition, kmap() is known to not work with this protection. Provide a new call; pgmap_protection_flag_invalid(). This gives better debugging for missed kmap() users. This call also respects the pks_fault_mode settings. Signed-off-by: Ira Weiny --- Changes for V7 Leverage Rick Edgecombe's fault callback infrastructure to relax invalid uses and prevent crashes From Dan Williams Use sysfs_* calls for parameter Make pgmap_disable_protection inline Remove pfn from warn output Remove silent parameter option --- .../admin-guide/kernel-parameters.txt | 14 +++ arch/x86/mm/pkeys.c | 8 +- include/linux/mm.h | 26 ++++++ mm/memremap.c | 85 +++++++++++++++++++ 4 files changed, 132 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index bdb22006f713..7902fce7f1da 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4081,6 +4081,20 @@ pirq= [SMP,APIC] Manual mp-table setup See Documentation/x86/i386/IO-APIC.rst. + memremap.pks_fault_mode= [X86] Control the behavior of page map + protection violations. Violations may not be an actual + use of the memory but simply an attempt to map it in an + incompatible way. + (depends on CONFIG_DEVMAP_ACCESS_PROTECTION + + Format: { relaxed | strict } + + relaxed - Print a warning, disable the protection and + continue execution. + strict - Stop kernel execution via BUG_ON or fault + + default: relaxed + plip= [PPT,NET] Parallel port network link Format: { parport | timid | 0 } See also Documentation/admin-guide/parport.rst. diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index cdebc2018888..201004586c2b 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -9,6 +9,7 @@ #include /* debugfs_create_u32() */ #include /* mm_struct, vma, etc... */ #include /* PKEY_* */ +#include /* fault callback */ #include #include /* boot_cpu_has, ... */ @@ -241,7 +242,12 @@ int handle_abandoned_pks_value(struct pt_regs *regs) return (ept_regs->thread_pkrs != old); } -static const pks_key_callback pks_key_callbacks[PKS_KEY_NR_CONSUMERS] = { 0 }; +static const pks_key_callback pks_key_callbacks[PKS_KEY_NR_CONSUMERS] = { + [PKS_KEY_DEFAULT] = NULL, +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION + [PKS_KEY_PGMAP_PROTECTION] = pgmap_pks_fault_callback, +#endif +}; bool handle_pks_key_callback(unsigned long address, bool write, u16 key) { diff --git a/include/linux/mm.h b/include/linux/mm.h index d3c1a3ecca87..c13c7af7cad3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1216,6 +1216,7 @@ static inline bool devmap_protected(struct page *page) return false; } +void __pgmap_protection_flag_invalid(struct dev_pagemap *pgmap); void __pgmap_mk_readwrite(struct dev_pagemap *pgmap); void __pgmap_mk_noaccess(struct dev_pagemap *pgmap); @@ -1232,6 +1233,27 @@ static inline bool pgmap_check_pgmap_prot(struct page *page) return true; } +/* + * pgmap_protection_flag_invalid - Check and flag an invalid use of a pgmap + * protected page + * + * There are code paths which are known to not be compatible with pgmap + * protections. pgmap_protection_flag_invalid() is provided as a 'relief + * valve' to be used in those functions which are known to be incompatible. + * + * Thus an invalid code path can be flag more precisely what code contains the + * bug vs just flagging a fault. Like the fault handler code this abandons the + * use of the PKS key and optionally allows the calling code path to continue + * based on the configuration of the memremap.pks_fault_mode command line + * (and/or sysfs) option. + */ +static inline void pgmap_protection_flag_invalid(struct page *page) +{ + if (!pgmap_check_pgmap_prot(page)) + return; + __pgmap_protection_flag_invalid(page->pgmap); +} + static inline void pgmap_mk_readwrite(struct page *page) { if (!pgmap_check_pgmap_prot(page)) @@ -1247,10 +1269,14 @@ static inline void pgmap_mk_noaccess(struct page *page) bool pgmap_protection_enabled(void); +bool pgmap_pks_fault_callback(unsigned long address, bool write); + #else static inline void __pgmap_mk_readwrite(struct dev_pagemap *pgmap) { } static inline void __pgmap_mk_noaccess(struct dev_pagemap *pgmap) { } + +static inline void pgmap_protection_flag_invalid(struct page *page) { } static inline void pgmap_mk_readwrite(struct page *page) { } static inline void pgmap_mk_noaccess(struct page *page) { } static inline bool pgmap_protection_enabled(void) diff --git a/mm/memremap.c b/mm/memremap.c index a05de8714916..930b360bad86 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -95,6 +95,91 @@ static void devmap_protection_disable(void) static_branch_dec(&dev_pgmap_protection_static_key); } +/* + * Ignore the checkpatch warning because the typedef allows + * param_check_pks_fault_modes to automatically check the passed value. + */ +typedef enum { + PKS_MODE_STRICT = 0, + PKS_MODE_RELAXED = 1, +} pks_fault_modes; + +pks_fault_modes pks_fault_mode = PKS_MODE_RELAXED; + +static int param_set_pks_fault_mode(const char *val, const struct kernel_param *kp) +{ + int ret = -EINVAL; + + if (!sysfs_streq(val, "relaxed")) { + pks_fault_mode = PKS_MODE_RELAXED; + ret = 0; + } else if (!sysfs_streq(val, "strict")) { + pks_fault_mode = PKS_MODE_STRICT; + ret = 0; + } + + return ret; +} + +static int param_get_pks_fault_mode(char *buffer, const struct kernel_param *kp) +{ + int ret = 0; + + switch (pks_fault_mode) { + case PKS_MODE_STRICT: + ret = sysfs_emit(buffer, "strict\n"); + break; + case PKS_MODE_RELAXED: + ret = sysfs_emit(buffer, "relaxed\n"); + break; + default: + ret = sysfs_emit(buffer, "\n"); + break; + } + + return ret; +} + +static const struct kernel_param_ops param_ops_pks_fault_modes = { + .set = param_set_pks_fault_mode, + .get = param_get_pks_fault_mode, +}; + +#define param_check_pks_fault_modes(name, p) \ + __param_check(name, p, pks_fault_modes) +module_param(pks_fault_mode, pks_fault_modes, 0644); + +static void pgmap_abandon_protection(void) +{ + static bool protections_abandoned = false; + + if (!protections_abandoned) { + protections_abandoned = true; + pks_abandon_protections(PKS_KEY_PGMAP_PROTECTION); + } +} + +void __pgmap_protection_flag_invalid(struct dev_pagemap *pgmap) +{ + BUG_ON(pks_fault_mode == PKS_MODE_STRICT); + + WARN_ONCE(1, "Page map protection disabled"); + pgmap_abandon_protection(); +} +EXPORT_SYMBOL_GPL(__pgmap_protection_flag_invalid); + +bool pgmap_pks_fault_callback(unsigned long address, bool write) +{ + /* In strict mode just let the fault handler oops */ + if (pks_fault_mode == PKS_MODE_STRICT) + return false; + + WARN_ONCE(1, "Page map protection disabled"); + pgmap_abandon_protection(); + return true; +} +EXPORT_SYMBOL_GPL(pgmap_pks_fault_callback); + void __pgmap_mk_readwrite(struct dev_pagemap *pgmap) { if (!current->pgmap_prot_count++) From patchwork Wed Aug 4 04:32:28 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417791 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 18F2534A1 for ; Wed, 4 Aug 2021 04:32:44 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433077" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433077" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:38 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702717" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:38 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Ira Weiny , Dave Hansen , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Peter Zijlstra , Andy Lutomirski , "H. Peter Anvin" , Fenghua Yu , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 15/18] kmap: Add stray access protection for devmap pages Date: Tue, 3 Aug 2021 21:32:28 -0700 Message-Id: <20210804043231.2655537-16-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Ira Weiny Enable PKS protection for devmap pages. The devmap protection facility wants to co-opt kmap_{local_page,atomic}() to mediate access to PKS protected pages. kmap() allows for global mappings to be established, while the PKS facility depends on thread-local access. For this reason kmap() is not supported, but it leaves a policy decision for what to do when kmap() is attempted on a protected devmap page. Neither of the 2 current DAX-capable filesystems (ext4 and xfs) perform such global mappings. The bulk of device drivers that would handle devmap pages are not using kmap(). Any future filesystems that gain DAX support, or device drivers wanting to support devmap protected pages will need to move to kmap_local_page(). In the meantime to handle these kmap() users call pgmap_protection_flag_invalid() to flag and invalid use of any potentially protected pages. This allows better debugging of invalided uses vs catching faults later on when the address is used. Direct-map exposure is already mitigated by default on HIGHMEM systems because by definition HIGHMEM systems do not have large capacities of memory in the direct map. Therefore, to reduce complexity HIGHMEM systems are not supported. Cc: Dan Williams Cc: Dave Hansen Signed-off-by: Ira Weiny --- include/linux/highmem-internal.h | 5 +++++ mm/Kconfig | 1 + 2 files changed, 6 insertions(+) diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h index 7902c7d8b55f..f88bc14a643b 100644 --- a/include/linux/highmem-internal.h +++ b/include/linux/highmem-internal.h @@ -142,6 +142,7 @@ static inline struct page *kmap_to_page(void *addr) static inline void *kmap(struct page *page) { might_sleep(); + pgmap_protection_flag_invalid(page); return page_address(page); } @@ -157,6 +158,7 @@ static inline void kunmap(struct page *page) static inline void *kmap_local_page(struct page *page) { + pgmap_mk_readwrite(page); return page_address(page); } @@ -175,12 +177,14 @@ static inline void __kunmap_local(void *addr) #ifdef ARCH_HAS_FLUSH_ON_KUNMAP kunmap_flush_on_unmap(addr); #endif + pgmap_mk_noaccess(kmap_to_page(addr)); } static inline void *kmap_atomic(struct page *page) { preempt_disable(); pagefault_disable(); + pgmap_mk_readwrite(page); return page_address(page); } @@ -199,6 +203,7 @@ static inline void __kunmap_atomic(void *addr) #ifdef ARCH_HAS_FLUSH_ON_KUNMAP kunmap_flush_on_unmap(addr); #endif + pgmap_mk_noaccess(kmap_to_page(addr)); pagefault_enable(); preempt_enable(); } diff --git a/mm/Kconfig b/mm/Kconfig index 201d41269a36..4184d0a7531d 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -794,6 +794,7 @@ config DEVMAP_ACCESS_PROTECTION bool "Access protection for memremap_pages()" depends on NVDIMM_PFN depends on ARCH_HAS_SUPERVISOR_PKEYS + depends on !HIGHMEM select GENERAL_PKS_USER default y From patchwork Wed Aug 4 04:32:29 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417795 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DBA3E34A5 for ; Wed, 4 Aug 2021 04:32:44 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433081" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433081" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:39 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702721" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:39 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Ira Weiny , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Peter Zijlstra , Andy Lutomirski , "H. Peter Anvin" , Fenghua Yu , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 16/18] dax: Stray access protection for dax_direct_access() Date: Tue, 3 Aug 2021 21:32:29 -0700 Message-Id: <20210804043231.2655537-17-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Ira Weiny dax_direct_access() provides a way to obtain the direct map address of PMEM memory. Coordinate PKS protection with dax_direct_access() of protected devmap pages. Introduce 3 new calls dax_{protected,mk_readwrite,mk_noaccess}() These 3 calls do not have to be implemented by the dax provider if no protection is implemented. Single threads of execution can use dax_mk_{readwrite,noaccess}() to relax the protection of the dax device and allow direct use of the kaddr returned from dax_direct_access(). dax_mk_{readwrite,noaccess}() must be used within the dax_read_[un]lock() protected region. And they only need to be used to guard actual access to the memory pointed to. Other uses of dax_direct_access() do not need to use these guards. For users who require a permanent address to the dax device such as the DM write cache. dax_protected() indicates that the dax device has additional protections. In this case the user choses to create it's own mapping of the memory. Signed-off-by: Ira Weiny --- Changes for V7 Rework cover letter. Do not include a FS_DAX_LIMITED restriction for dcss. It will simply not implement the protection and there is no need to special case this. Clean up commit message because I did not originally understand the nuance of the s390 device. Introduce dax_{protected,mk_readwrite,mk_noaccess}() From Dan Williams Remove old clean up cruft from previous versions Remove map_protected Remove 'global' parameters all calls --- drivers/dax/super.c | 54 ++++++++++++++++++++++++++++++++++++++ drivers/md/dm-writecache.c | 8 +++++- fs/dax.c | 8 ++++++ fs/fuse/virtio_fs.c | 2 ++ include/linux/dax.h | 8 ++++++ 5 files changed, 79 insertions(+), 1 deletion(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 44736cbd446e..dc05c89102d0 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -296,6 +296,8 @@ EXPORT_SYMBOL_GPL(dax_attribute_group); * @pgoff: offset in pages from the start of the device to translate * @nr_pages: number of consecutive pages caller can handle relative to @pfn * @kaddr: output parameter that returns a virtual address mapping of pfn + * Direct access through this pointer must be guarded by calls to + * dax_mk_{readwrite,noaccess}() * @pfn: output parameter that returns an absolute pfn translation of @pgoff * * Return: negative errno if an error occurs, otherwise the number of @@ -389,6 +391,58 @@ void dax_flush(struct dax_device *dax_dev, void *addr, size_t size) #endif EXPORT_SYMBOL_GPL(dax_flush); +bool dax_map_protected(struct dax_device *dax_dev) +{ + if (!dax_alive(dax_dev)) + return false; + + if (dax_dev->ops->map_protected) + return dax_dev->ops->map_protected(dax_dev); + return false; +} +EXPORT_SYMBOL_GPL(dax_map_protected); + +/** + * dax_mk_readwrite() - make protected dax devices read/write + * @dax_dev: the dax device representing the memory to access + * + * Any access of the kaddr memory returned from dax_direct_access() must be + * guarded by dax_mk_readwrite() and dax_mk_noaccess(). This ensures that any + * dax devices which have additional protections are allowed to relax those + * protections for the thread using this memory. + * + * NOTE these calls must be contained within a single thread of execution and + * both must be guarded by dax_read_lock() Which is also a requirement for + * dax_direct_access() anyway. + */ +void dax_mk_readwrite(struct dax_device *dax_dev) +{ + if (!dax_alive(dax_dev)) + return; + + if (dax_dev->ops->mk_readwrite) + dax_dev->ops->mk_readwrite(dax_dev); +} +EXPORT_SYMBOL_GPL(dax_mk_readwrite); + +/** + * dax_mk_noaccess() - restore protection to dax devices if needed + * @dax_dev: the dax device representing the memory to access + * + * See dax_direct_access() and dax_mk_readwrite() + * + * NOTE Must be called prior to dax_read_unlock() + */ +void dax_mk_noaccess(struct dax_device *dax_dev) +{ + if (!dax_alive(dax_dev)) + return; + + if (dax_dev->ops->mk_noaccess) + dax_dev->ops->mk_noaccess(dax_dev); +} +EXPORT_SYMBOL_GPL(dax_mk_noaccess); + void dax_write_cache(struct dax_device *dax_dev, bool wc) { if (wc) diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c index e21e29e81bbf..27671300ad50 100644 --- a/drivers/md/dm-writecache.c +++ b/drivers/md/dm-writecache.c @@ -284,7 +284,13 @@ static int persistent_memory_claim(struct dm_writecache *wc) r = -EOPNOTSUPP; goto err2; } - if (da != p) { + + /* + * Force the write cache to map the pages directly if the dax device + * mapping is protected or if the number of pages returned was not what + * was requested. + */ + if (dax_map_protected(wc->ssd_dev->dax_dev) || da != p) { long i; wc->memory_map = NULL; pages = kvmalloc_array(p, sizeof(struct page *), GFP_KERNEL); diff --git a/fs/dax.c b/fs/dax.c index 99b4e78d888f..9dfb93b39754 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -728,7 +728,9 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d return rc; } vto = kmap_atomic(to); + dax_mk_readwrite(dax_dev); copy_user_page(vto, (void __force *)kaddr, vaddr, to); + dax_mk_noaccess(dax_dev); kunmap_atomic(vto); dax_read_unlock(id); return 0; @@ -1096,8 +1098,10 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap) } if (!page_aligned) { + dax_mk_readwrite(iomap->dax_dev); memset(kaddr + offset, 0, size); dax_flush(iomap->dax_dev, kaddr + offset, size); + dax_mk_noaccess(iomap->dax_dev); } dax_read_unlock(id); return size; @@ -1169,6 +1173,8 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, if (map_len > end - pos) map_len = end - pos; + dax_mk_readwrite(dax_dev); + /* * The userspace address for the memory copy has already been * validated via access_ok() in either vfs_read() or @@ -1181,6 +1187,8 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr, map_len, iter); + dax_mk_noaccess(dax_dev); + pos += xfer; length -= xfer; done += xfer; diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 8f52cdaa8445..3dfb053b1c4d 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -776,8 +776,10 @@ static int virtio_fs_zero_page_range(struct dax_device *dax_dev, rc = dax_direct_access(dax_dev, pgoff, nr_pages, &kaddr, NULL); if (rc < 0) return rc; + dax_mk_readwrite(dax_dev); memset(kaddr, 0, nr_pages << PAGE_SHIFT); dax_flush(dax_dev, kaddr, nr_pages << PAGE_SHIFT); + dax_mk_noaccess(dax_dev); return 0; } diff --git a/include/linux/dax.h b/include/linux/dax.h index b52f084aa643..8ad4839705ca 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -36,6 +36,10 @@ struct dax_operations { struct iov_iter *); /* zero_page_range: required operation. Zero page range */ int (*zero_page_range)(struct dax_device *, pgoff_t, size_t); + + bool (*map_protected)(struct dax_device *dax_dev); + void (*mk_readwrite)(struct dax_device *dax_dev); + void (*mk_noaccess)(struct dax_device *dax_dev); }; extern struct attribute_group dax_attribute_group; @@ -228,6 +232,10 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, size_t nr_pages); void dax_flush(struct dax_device *dax_dev, void *addr, size_t size); +bool dax_map_protected(struct dax_device *dax_dev); +void dax_mk_readwrite(struct dax_device *dax_dev); +void dax_mk_noaccess(struct dax_device *dax_dev); + ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter, const struct iomap_ops *ops); vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size, From patchwork Wed Aug 4 04:32:30 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417799 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D41AF34AB for ; Wed, 4 Aug 2021 04:32:45 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433083" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433083" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:39 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702726" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:39 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Ira Weiny , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Peter Zijlstra , Andy Lutomirski , "H. Peter Anvin" , Fenghua Yu , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 17/18] nvdimm/pmem: Enable stray access protection Date: Tue, 3 Aug 2021 21:32:30 -0700 Message-Id: <20210804043231.2655537-18-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Ira Weiny Now that all potential / valid kernel initiated access' to PMEM have been annotated with {__}pgmap_mk_{readwrite,noaccess}(), turn on PGMAP_PROTECTION. Implement the dax_protected which communicates this memory has extra protection. Also implement pmem_mk_{readwrite,noaccess}() to relax those protections for valid users. Internally, the pmem driver uses a cached virtual address, pmem->virt_addr (pmem_addr). Call __pgmap_mk_{readwrite,noaccess}() directly when PGMAP_PROTECTION is active on the device. Signed-off-by: Ira Weiny --- Changes for V7 Remove global param Add internal structure which uses the pmem device and pgmap device directly in the *_mk_*() calls. Add pmem dax ops callbacks Use pgmap_protection_enabled() s/PGMAP_PKEY_PROTECT/PGMAP_PROTECTION --- drivers/nvdimm/pmem.c | 55 ++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 54 insertions(+), 1 deletion(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 1e0615b8565e..6e924b907264 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -138,6 +138,18 @@ static blk_status_t read_pmem(struct page *page, unsigned int off, return BLK_STS_OK; } +static void __pmem_mk_readwrite(struct pmem_device *pmem) +{ + if (pmem->pgmap.flags & PGMAP_PROTECTION) + __pgmap_mk_readwrite(&pmem->pgmap); +} + +static void __pmem_mk_noaccess(struct pmem_device *pmem) +{ + if (pmem->pgmap.flags & PGMAP_PROTECTION) + __pgmap_mk_noaccess(&pmem->pgmap); +} + static blk_status_t pmem_do_read(struct pmem_device *pmem, struct page *page, unsigned int page_off, sector_t sector, unsigned int len) @@ -149,7 +161,10 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem, if (unlikely(is_bad_pmem(&pmem->bb, sector, len))) return BLK_STS_IOERR; + __pmem_mk_readwrite(pmem); rc = read_pmem(page, page_off, pmem_addr, len); + __pmem_mk_noaccess(pmem); + flush_dcache_page(page); return rc; } @@ -181,11 +196,14 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem, * after clear poison. */ flush_dcache_page(page); + + __pmem_mk_readwrite(pmem); write_pmem(pmem_addr, page, page_off, len); if (unlikely(bad_pmem)) { rc = pmem_clear_poison(pmem, pmem_off, len); write_pmem(pmem_addr, page, page_off, len); } + __pmem_mk_noaccess(pmem); return rc; } @@ -320,6 +338,23 @@ static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, return _copy_mc_to_iter(addr, bytes, i); } +static bool pmem_map_protected(struct dax_device *dax_dev) +{ + struct pmem_device *pmem = dax_get_private(dax_dev); + + return (pmem->pgmap.flags & PGMAP_PROTECTION); +} + +static void pmem_mk_readwrite(struct dax_device *dax_dev) +{ + __pmem_mk_readwrite(dax_get_private(dax_dev)); +} + +static void pmem_mk_noaccess(struct dax_device *dax_dev) +{ + __pmem_mk_noaccess(dax_get_private(dax_dev)); +} + static const struct dax_operations pmem_dax_ops = { .direct_access = pmem_dax_direct_access, .dax_supported = generic_fsdax_supported, @@ -328,6 +363,17 @@ static const struct dax_operations pmem_dax_ops = { .zero_page_range = pmem_dax_zero_page_range, }; +static const struct dax_operations pmem_protected_dax_ops = { + .direct_access = pmem_dax_direct_access, + .dax_supported = generic_fsdax_supported, + .copy_from_iter = pmem_copy_from_iter, + .copy_to_iter = pmem_copy_to_iter, + .zero_page_range = pmem_dax_zero_page_range, + .map_protected = pmem_map_protected, + .mk_readwrite = pmem_mk_readwrite, + .mk_noaccess = pmem_mk_noaccess, +}; + static const struct attribute_group *pmem_attribute_groups[] = { &dax_attribute_group, NULL, @@ -432,6 +478,8 @@ static int pmem_attach_disk(struct device *dev, if (is_nd_pfn(dev)) { pmem->pgmap.type = MEMORY_DEVICE_FS_DAX; pmem->pgmap.ops = &fsdax_pagemap_ops; + if (pgmap_protection_enabled()) + pmem->pgmap.flags |= PGMAP_PROTECTION; addr = devm_memremap_pages(dev, &pmem->pgmap); pfn_sb = nd_pfn->pfn_sb; pmem->data_offset = le64_to_cpu(pfn_sb->dataoff); @@ -446,6 +494,8 @@ static int pmem_attach_disk(struct device *dev, pmem->pgmap.nr_range = 1; pmem->pgmap.type = MEMORY_DEVICE_FS_DAX; pmem->pgmap.ops = &fsdax_pagemap_ops; + if (pgmap_protection_enabled()) + pmem->pgmap.flags |= PGMAP_PROTECTION; addr = devm_memremap_pages(dev, &pmem->pgmap); pmem->pfn_flags |= PFN_MAP; bb_range = pmem->pgmap.range; @@ -483,7 +533,10 @@ static int pmem_attach_disk(struct device *dev, if (is_nvdimm_sync(nd_region)) flags = DAXDEV_F_SYNC; - dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops, flags); + if (pmem->pgmap.flags & PGMAP_PROTECTION) + dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_protected_dax_ops, flags); + else + dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops, flags); if (IS_ERR(dax_dev)) { return PTR_ERR(dax_dev); } From patchwork Wed Aug 4 04:32:31 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ira Weiny X-Patchwork-Id: 12417797 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3E9DB34A1 for ; Wed, 4 Aug 2021 04:32:45 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="299433085" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="299433085" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:39 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702729" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:39 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Ira Weiny , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Peter Zijlstra , Andy Lutomirski , "H. Peter Anvin" , Fenghua Yu , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 18/18] devdax: Enable stray access protection Date: Tue, 3 Aug 2021 21:32:31 -0700 Message-Id: <20210804043231.2655537-19-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Ira Weiny Device dax is primarily accessed through user space. Kernel access is controlled through the kmap interfaces. Now that all valid kernel initiated access to dax devices have been accounted for with pgmap_mk_{readwrite,noaccess}() through kmap, turn on PGMAP_PKEYS_PROTECT for device dax. Signed-off-by: Ira Weiny --- Changes for V7 Use pgmap_protetion_enabled() s/PGMAP_PKEYS_PROTECT/PGMAP_PROTECTION/ --- drivers/dax/device.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/dax/device.c b/drivers/dax/device.c index dd8222a42808..cdf6ef4c1edb 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -426,6 +426,8 @@ int dev_dax_probe(struct dev_dax *dev_dax) } pgmap->type = MEMORY_DEVICE_GENERIC; + if (pgmap_protection_enabled()) + pgmap->flags |= PGMAP_PROTECTION; addr = devm_memremap_pages(dev, pgmap); if (IS_ERR(addr)) return PTR_ERR(addr);