diff mbox series

[V4,09/10] x86/pks: Add PKS kernel API

Message ID 20210322053020.2287058-10-ira.weiny@intel.com (mailing list archive)
State New
Headers show
Series PKS Add Protection Key Supervisor support | expand

Commit Message

Ira Weiny March 22, 2021, 5:30 a.m. UTC
From: Fenghua Yu <fenghua.yu@intel.com>

PKS allows kernel users to define domains of page mappings which have
additional protections beyond the paging protections.

Add an API to allocate, use, and free a protection key which identifies
such a domain.  Export 5 new symbols pks_key_alloc(), pks_mk_noaccess(),
pks_mk_readonly(), pks_mk_readwrite(), and pks_key_free().  Add 2 new
macros; PAGE_KERNEL_PKEY(key) and _PAGE_PKEY(pkey).

Update the protection key documentation to cover pkeys on supervisor
pages.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>

---
Changes from V3:
	From Dan Williams
		Remove flags from pks_key_alloc()
		Convert to ARCH_ENABLE_SUPERVISOR_PKEYS
		remove export of update_pkey_val()
		Update documentation
		change __clear_bit to clear_bit_unlock
		remove cpu_feature_enabled from pks_key_free
		remove pr_err stubs when CONFIG_HAS_SUPERVISOR_PKEYS=n
		clarify pks_key_alloc flags parameter with enum
	Update documentation for ARCH_ENABLE_SUPERVISOR_PKEYS
	No need to export write_pkrs
	Correct Kernel Doc for API functions
	From Randy Dunlap:
		Fix grammatical errors in doc

Changes from V2
	From Greg KH
		Replace all WARN_ON_ONCE() uses with pr_err()
	From Dan Williams
		Add __must_check to pks_key_alloc() to help ensure users
		are using the API correctly

Changes from V1
	Per Dave Hansen
		Add flags to pks_key_alloc() to help future proof the
		interface if/when the key space is exhausted.

Changes from RFC V3
	Per Dave Hansen
		Put WARN_ON_ONCE in pks_key_free()
		s/pks_mknoaccess/pks_mk_noaccess/
		s/pks_mkread/pks_mk_readonly/
		s/pks_mkrdwr/pks_mk_readwrite/
		Change return pks_key_alloc() to EOPNOTSUPP when not
			supported or configured
	Per Peter Zijlstra
		Remove unneeded preempt disable/enable
---
 Documentation/core-api/protection-keys.rst | 108 +++++++++++++---
 arch/x86/include/asm/pgtable_types.h       |  12 ++
 arch/x86/include/asm/pks.h                 |   4 +
 arch/x86/mm/pkeys.c                        | 137 ++++++++++++++++++++-
 include/linux/pgtable.h                    |   4 +
 include/linux/pkeys.h                      |  17 +++
 6 files changed, 263 insertions(+), 19 deletions(-)
diff mbox series

Patch

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index ec575e72d0b2..6d6c4f25080c 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -4,25 +4,30 @@ 
 Memory Protection Keys
 ======================
 
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature
-which is found on Intel's Skylake (and later) "Scalable Processor"
-Server CPUs. It will be available in future non-server Intel parts
-and future AMD processors.
+Memory Protection Keys provide a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables
+when an application changes protection domains.
 
-For anyone wishing to test or use this feature, it is available in
-Amazon's EC2 C5 instances and is known to work there using an Ubuntu
-17.04 image.
+PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable
+Processor" Server CPUs and later.  And it will be available in future
+non-server Intel parts and future AMD processors.
 
-Memory Protection Keys provides a mechanism for enforcing page-based
-protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
+Protection Keys for Supervisor pages (PKS) is available in the SDM since May
+2020.
+
+pkeys work by dedicating 4 previously Reserved bits in each page table entry to
+a "protection key", giving 16 possible keys.  User and Supervisor pages are
+treated separately.
 
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of protections from every other thread.
+Protections for each page are controlled with per-CPU registers for each type
+of page User and Supervisor.  Each of these 32-bit register stores two separate
+bits (Access Disable and Write Disable) for each key.
+
+For Userspace the register is user-accessible (rdpkru/wrpkru).  For
+Supervisor, the register (MSR_IA32_PKRS) is accessible only to the kernel.
+
+Being a CPU register, pkeys are inherently thread-local, potentially giving
+each thread an independent set of protections from every other thread.
 
 There are two new instructions (RDPKRU/WRPKRU) for reading and writing
 to the new register.  The feature is only available in 64-bit mode,
@@ -30,8 +35,11 @@  even though there is theoretically space in the PAE PTEs.  These
 permissions are enforced on data access only and have no effect on
 instruction fetches.
 
-Syscalls
-========
+For kernel space rdmsr/wrmsr are used to access the kernel MSRs.
+
+
+Syscalls for user space keys
+============================
 
 There are 3 system calls which directly interact with pkeys::
 
@@ -98,3 +106,67 @@  with a read()::
 The kernel will send a SIGSEGV in both cases, but si_code will be set
 to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
 the plain mprotect() permissions are violated.
+
+
+Kernel API for PKS support
+==========================
+
+Similar to user space pkeys, supervisor pkeys allow additional protections to
+be defined for a supervisor mappings.
+
+The following interface is used to allocate, use, and free a pkey which defines
+a 'protection domain' within the kernel.  Setting a pkey value in a supervisor
+PTE adds this additional protection to the page.
+
+Kernel users intending to use PKS support should check (depend on)
+ARCH_HAS_SUPERVISOR_PKEYS and add their config to ARCH_ENABLE_SUPERVISOR_PKEYS
+to turn on this support within the core.
+
+        int pks_key_alloc(const char * const pkey_user);
+        #define PAGE_KERNEL_PKEY(pkey)
+        #define _PAGE_KEY(pkey)
+        void pks_mk_noaccess(int pkey);
+        void pks_mk_readonly(int pkey);
+        void pks_mk_readwrite(int pkey);
+        void pks_key_free(int pkey);
+
+pks_key_alloc() allocates keys dynamically to allow better use of the limited
+key space.
+
+Callers of pks_key_alloc() _must_ be prepared for it to fail and take
+appropriate action.  This is due mainly to the fact that PKS may not be
+available on all arch's.  Failure to check the return of pks_key_alloc() and
+using any of the rest of the API is undefined.
+
+Keys are allocated with 'No Access' permissions.  If other permissions are
+required before the pkey is used, the pks_mk*() family of calls, documented
+below, can be used prior to setting the pkey within the page table entries.
+
+Kernel users must set the pkey in the page table entries for the mappings they
+want to protect.  This can be done with PAGE_KERNEL_PKEY() or _PAGE_KEY().
+
+The pks_mk*() family of calls allows kernel users to change the protections for
+the domain identified by the pkey parameter.  3 states are available:
+pks_mk_noaccess(), pks_mk_readonly(), and pks_mk_readwrite() which set the
+access to none, read, and read/write respectively.
+
+Finally, pks_key_free() allows a user to return the key to the allocator for
+use by others.
+
+The interface maintains pks_mk_noaccess() (Access Disabled (AD=1)) for all keys
+not currently allocated.  Therefore, the user can depend on access being
+disabled when pks_key_alloc() returns a key and the user should remove mappings
+from the domain (remove the pkey from the PTE) prior to calling pks_key_free().
+
+It should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not serializing
+but still maintains ordering properties similar to WRPKRU.  Thus it is safe to
+immediately use a mapping when the pks_mk*() functions return.
+
+Older versions of the SDM on PKRS may be wrong with regard to this
+serialization.  The text should be the same as that of WRPKRU.  From the WRPKRU
+text:
+
+	WRPKRU will never execute transiently. Memory accesses
+	affected by PKRU register will not execute (even transiently)
+	until all prior executions of WRPKRU have completed execution
+	and updated the PKRU register.
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index f24d7ef8fffa..a3cb274351d9 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -73,6 +73,12 @@ 
 			 _PAGE_PKEY_BIT2 | \
 			 _PAGE_PKEY_BIT3)
 
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+#define _PAGE_PKEY(pkey)	(_AT(pteval_t, pkey) << _PAGE_BIT_PKEY_BIT0)
+#else
+#define _PAGE_PKEY(pkey)	(_AT(pteval_t, 0))
+#endif
+
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED)
 #else
@@ -228,6 +234,12 @@  enum page_cache_mode {
 #define PAGE_KERNEL_IO		__pgprot_mask(__PAGE_KERNEL_IO)
 #define PAGE_KERNEL_IO_NOCACHE	__pgprot_mask(__PAGE_KERNEL_IO_NOCACHE)
 
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+#define PAGE_KERNEL_PKEY(pkey)	__pgprot_mask(__PAGE_KERNEL | _PAGE_PKEY(pkey))
+#else
+#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL
+#endif
+
 #endif	/* __ASSEMBLY__ */
 
 /*         xwr */
diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index bfa638e17620..4891c9aa8fc7 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -4,6 +4,10 @@ 
 
 #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
 
+/*  PKS supports 16 keys. Key 0 is reserved for the kernel. */
+#define        PKS_KERN_DEFAULT_KEY    0
+#define        PKS_NUM_KEYS            16
+
 struct extended_pt_regs {
 	u32 thread_pkrs;
 	/* Keep stack 8 byte aligned */
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index f6a3a54b8d7d..47d29707ac39 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -3,6 +3,9 @@ 
  * Intel Memory Protection Keys management
  * Copyright (c) 2015, Intel Corporation.
  */
+#undef pr_fmt
+#define pr_fmt(fmt) "x86/pkeys: " fmt
+
 #include <linux/debugfs.h>		/* debugfs_create_u32()		*/
 #include <linux/mm_types.h>             /* mm_struct, vma, etc...       */
 #include <linux/pkeys.h>                /* PKEY_*                       */
@@ -11,6 +14,7 @@ 
 #include <asm/cpufeature.h>             /* boot_cpu_has, ...            */
 #include <asm/mmu_context.h>            /* vma_pkey()                   */
 #include <asm/fpu/internal.h>		/* init_fpstate			*/
+#include <asm/pks.h>
 
 int __execute_only_pkey(struct mm_struct *mm)
 {
@@ -276,4 +280,135 @@  void setup_pks(void)
 	cr4_set_bits(X86_CR4_PKS);
 }
 
-#endif
+/*
+ * Do not call this directly, see pks_mk*() below.
+ *
+ * @pkey: Key for the domain to change
+ * @protection: protection bits to be used
+ *
+ * Protection utilizes the same protection bits specified for User pkeys
+ *     PKEY_DISABLE_ACCESS
+ *     PKEY_DISABLE_WRITE
+ *
+ */
+static inline void pks_update_protection(int pkey, unsigned long protection)
+{
+	current->thread.saved_pkrs = update_pkey_val(current->thread.saved_pkrs,
+						     pkey, protection);
+	write_pkrs(current->thread.saved_pkrs);
+}
+
+/**
+ * pks_mk_noaccess() - Disable all access to the domain
+ * @pkey the pkey for which the access should change.
+ *
+ * Disable all access to the domain specified by pkey.  This is a global
+ * update and only affects the current running thread.
+ *
+ * It is a bug for users to call this without a valid pkey returned from
+ * pks_key_alloc()
+ */
+void pks_mk_noaccess(int pkey)
+{
+	pks_update_protection(pkey, PKEY_DISABLE_ACCESS);
+}
+EXPORT_SYMBOL_GPL(pks_mk_noaccess);
+
+/**
+ * pks_mk_readonly() - Make the domain Read only
+ * @pkey the pkey for which the access should change.
+ *
+ * Allow read access to the domain specified by pkey.  This is a global update
+ * and only affects the current running thread.
+ *
+ * It is a bug for users to call this without a valid pkey returned from
+ * pks_key_alloc()
+ */
+void pks_mk_readonly(int pkey)
+{
+	pks_update_protection(pkey, PKEY_DISABLE_WRITE);
+}
+EXPORT_SYMBOL_GPL(pks_mk_readonly);
+
+/**
+ * pks_mk_readwrite() - Make the domain Read/Write
+ * @pkey the pkey for which the access should change.
+ *
+ * Allow all access, read and write, to the domain specified by pkey.  This is
+ * a global update and only affects the current running thread.
+ *
+ * It is a bug for users to call this without a valid pkey returned from
+ * pks_key_alloc()
+ */
+void pks_mk_readwrite(int pkey)
+{
+	pks_update_protection(pkey, 0);
+}
+EXPORT_SYMBOL_GPL(pks_mk_readwrite);
+
+static const char pks_key_user0[] = "kernel";
+
+/* Store names of allocated keys for debug.  Key 0 is reserved for the kernel.  */
+static const char *pks_key_users[PKS_NUM_KEYS] = {
+	pks_key_user0
+};
+
+/*
+ * Each key is represented by a bit.  Bit 0 is set for key 0 and reserved for
+ * its use.  We use ulong for the bit operations but only 16 bits are used.
+ */
+static unsigned long pks_key_allocation_map = 1 << PKS_KERN_DEFAULT_KEY;
+
+/**
+ * pks_key_alloc() - Allocate a PKS key
+ * @pkey_user: String stored for debugging of key exhaustion.  The caller is
+ *             responsible to maintain this memory until pks_key_free().
+ *
+ * Return: pkey if success
+ *         -EOPNOTSUPP if pks is not supported or not enabled
+ *         -ENOSPC if no keys are available
+ */
+__must_check int pks_key_alloc(const char * const pkey_user)
+{
+	int nr;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return -EOPNOTSUPP;
+
+	while (1) {
+		nr = find_first_zero_bit(&pks_key_allocation_map, PKS_NUM_KEYS);
+		if (nr >= PKS_NUM_KEYS) {
+			pr_info("Cannot allocate supervisor key for %s.\n",
+				pkey_user);
+			return -ENOSPC;
+		}
+		if (!test_and_set_bit_lock(nr, &pks_key_allocation_map))
+			break;
+	}
+
+	/* for debugging key exhaustion */
+	pks_key_users[nr] = pkey_user;
+
+	return nr;
+}
+EXPORT_SYMBOL_GPL(pks_key_alloc);
+
+/**
+ * pks_key_free() - Free a previously allocate PKS key
+ * @pkey: Key to be free'ed
+ */
+void pks_key_free(int pkey)
+{
+	if (pkey >= PKS_NUM_KEYS || pkey <= PKS_KERN_DEFAULT_KEY) {
+		pr_err("Invalid PKey value: %d\n", pkey);
+		return;
+	}
+
+	/* Restore to default of no access */
+	pks_mk_noaccess(pkey);
+	pks_key_users[pkey] = NULL;
+	clear_bit_unlock(pkey, &pks_key_allocation_map);
+}
+EXPORT_SYMBOL_GPL(pks_key_free);
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index cdfc4e9f253e..1e5f4a253e82 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1460,6 +1460,10 @@  static inline bool arch_has_pfn_modify_check(void)
 # define PAGE_KERNEL_EXEC PAGE_KERNEL
 #endif
 
+#ifndef PAGE_KERNEL_PKEY
+#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL
+#endif
+
 /*
  * Page Table Modification bits for pgtbl_mod_mask.
  *
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index a3d17a8e4e81..6659404af876 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -56,6 +56,13 @@  static inline void copy_init_pkru_to_fpregs(void)
 void pkrs_save_set_irq(struct pt_regs *regs, u32 val);
 void pkrs_restore_irq(struct pt_regs *regs);
 
+__must_check int pks_key_alloc(const char *const pkey_user);
+void pks_key_free(int pkey);
+
+void pks_mk_noaccess(int pkey);
+void pks_mk_readonly(int pkey);
+void pks_mk_readwrite(int pkey);
+
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 #ifndef INIT_PKRS_VALUE
@@ -65,6 +72,16 @@  void pkrs_restore_irq(struct pt_regs *regs);
 static inline void pkrs_save_set_irq(struct pt_regs *regs, u32 val) { }
 static inline void pkrs_restore_irq(struct pt_regs *regs) { }
 
+static inline __must_check int pks_key_alloc(const char * const pkey_user)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void pks_key_free(int pkey) {}
+static inline void pks_mk_noaccess(int pkey) {}
+static inline void pks_mk_readonly(int pkey) {}
+static inline void pks_mk_readwrite(int pkey) {}
+
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 #endif /* _LINUX_PKEYS_H */