From patchwork Fri Sep 18 19:21:06 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu-cheng Yu X-Patchwork-Id: 11785795 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5D227746 for ; Fri, 18 Sep 2020 19:22:33 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0D95C23119 for ; Fri, 18 Sep 2020 19:22:33 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0D95C23119 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C8AB66B006C; Fri, 18 Sep 2020 15:22:20 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id BBF226B006E; Fri, 18 Sep 2020 15:22:20 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8DA248E0001; Fri, 18 Sep 2020 15:22:20 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0149.hostedemail.com [216.40.44.149]) by kanga.kvack.org (Postfix) with ESMTP id 71A546B0068 for ; Fri, 18 Sep 2020 15:22:20 -0400 (EDT) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 3EC69180AD806 for ; Fri, 18 Sep 2020 19:22:20 +0000 (UTC) X-FDA: 77277153240.02.death64_59163382712d Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin02.hostedemail.com (Postfix) with ESMTP id 21379100519FA for ; Fri, 18 Sep 2020 19:22:20 +0000 (UTC) X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,yu-cheng.yu@intel.com,,RULES_HIT:4423:30045:30051:30054:30056:30064:30070:30079:30091,0,RBL:134.134.136.20:@intel.com:.lbl8.mailshell.net-62.50.0.100 64.95.201.95;04yrwxunjm8s5a7xysued9kx6sc39ycgprfso47on515nc9hi5qpejgzj7xgjg8.huqyd474d4wb6qiosjixutd1quru58yrpwimmj6bnysecdo1hi1rqartemoyni5.6-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:ft,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:25,LUA_SUMMARY:none X-HE-Tag: death64_59163382712d X-Filterd-Recvd-Size: 15371 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by imf10.hostedemail.com (Postfix) with ESMTP for ; Fri, 18 Sep 2020 19:22:19 +0000 (UTC) IronPort-SDR: hmlKJlHbFle2+XG/yUFCqCXiwzvwvd8vXI6397enPPe2/unw/fJMgXEyOQbDL2i08BUxCO+Hjm izpKaRwNUF4Q== X-IronPort-AV: E=McAfee;i="6000,8403,9748"; a="147696128" X-IronPort-AV: E=Sophos;i="5.77,274,1596524400"; d="scan'208";a="147696128" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Sep 2020 12:22:16 -0700 IronPort-SDR: bL7QBOpuu9UTMMuYWZ7wPYaRFajWMFfYa7yE54MI7kyJaVorzyprZBhDbFUfZIA4v0AcAAYVVz ugiZ2asIU5jA== X-IronPort-AV: E=Sophos;i="5.77,274,1596524400"; d="scan'208";a="484331830" Received: from yyu32-desk.sc.intel.com ([143.183.136.146]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Sep 2020 12:22:15 -0700 From: Yu-cheng Yu To: x86@kernel.org, "H. Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H.J. Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , "Ravi V. Shankar" , Vedvyas Shanbhogue , Dave Martin , Weijiang Yang Cc: Yu-cheng Yu Subject: [PATCH v12 08/26] x86/mm: Introduce _PAGE_COW Date: Fri, 18 Sep 2020 12:21:06 -0700 Message-Id: <20200918192125.25473-9-yu-cheng.yu@intel.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20200918192125.25473-1-yu-cheng.yu@intel.com> References: <20200918192125.25473-1-yu-cheng.yu@intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: There is essentially no room left in the x86 hardware PTEs on some OSes (not Linux). That left the hardware architects looking for a way to represent a new memory type (shadow stack) within the existing bits. They chose to repurpose a lightly-used state: Write=0,Dirty=1. The reason it's lightly used is that Dirty=1 is normally set by hardware and cannot normally be set by hardware on a Write=0 PTE. Software must normally be involved to create one of these PTEs, so software can simply opt to not create them. But that leaves us with a Linux problem: we need to ensure we never create Write=0,Dirty=1 PTEs. In places where we do create them, we need to find an alternative way to represent them _without_ using the same hardware bit combination. Thus, enter _PAGE_COW. This results in the following: (a) A modified, copy-on-write (COW) page: (R/O + _PAGE_COW) (b) A R/O page that has been COW'ed: (R/O + _PAGE_COW) The user page is in a R/O VMA, and get_user_pages() needs a writable copy. The page fault handler creates a copy of the page and sets the new copy's PTE as R/O and _PAGE_COW. (c) A shadow stack PTE: (R/O + _PAGE_DIRTY_HW) (d) A shared shadow stack PTE: (R/O + _PAGE_COW) When a shadow stack page is being shared among processes (this happens at fork()), its PTE is cleared of _PAGE_DIRTY_HW, so the next shadow stack access causes a fault, and the page is duplicated and _PAGE_DIRTY_HW is set again. This is the COW equivalent for shadow stack pages, even though it's copy-on-access rather than copy-on-write. (e) A page where the processor observed a Write=1 PTE, started a write, set Dirty=1, but then observed a Write=0 PTE. That's possible today, but will not happen on processors that support shadow stack. Use _PAGE_COW in pte_wrprotect() and _PAGE_DIRTY_HW in pte_mkwrite(). Apply the same changes to pmd and pud. When this patch is applied, there are six free bits left in the 64-bit PTE. There are no more free bits in the 32-bit PTE (except for PAE) and shadow stack is not implemented for the 32-bit kernel. Signed-off-by: Yu-cheng Yu Reviewed-by: Kees Cook --- v10: - Change _PAGE_BIT_DIRTY_SW to _PAGE_BIT_COW, as it is used for copy-on- write PTEs. - Update pte_write() and treat shadow stack as writable. - Change *_mkdirty_shstk() to *_mkwrite_shstk() as these make shadow stack pages writable. - Use bit test & shift to move _PAGE_BIT_DIRTY_HW to _PAGE_BIT_COW. - Change static_cpu_has() to cpu_feature_enabled(). - Revise commit log. v9: - Remove pte_move_flags() etc. and put the logic directly in pte_wrprotect()/pte_mkwrite() etc. - Change compile-time conditionals to run-time checks. - Split out pte_modify()/pmd_modify() to a new patch. - Update comments. arch/x86/include/asm/pgtable.h | 120 ++++++++++++++++++++++++--- arch/x86/include/asm/pgtable_types.h | 41 ++++++++- 2 files changed, 150 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 86b7acd221c1..ac4ed814be96 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -122,9 +122,9 @@ extern pmdval_t early_pmd_flags; * The following only work if pte_present() is true. * Undefined behaviour if not.. */ -static inline int pte_dirty(pte_t pte) +static inline bool pte_dirty(pte_t pte) { - return pte_flags(pte) & _PAGE_DIRTY_HW; + return pte_flags(pte) & _PAGE_DIRTY_BITS; } @@ -161,9 +161,9 @@ static inline int pte_young(pte_t pte) return pte_flags(pte) & _PAGE_ACCESSED; } -static inline int pmd_dirty(pmd_t pmd) +static inline bool pmd_dirty(pmd_t pmd) { - return pmd_flags(pmd) & _PAGE_DIRTY_HW; + return pmd_flags(pmd) & _PAGE_DIRTY_BITS; } static inline int pmd_young(pmd_t pmd) @@ -171,9 +171,9 @@ static inline int pmd_young(pmd_t pmd) return pmd_flags(pmd) & _PAGE_ACCESSED; } -static inline int pud_dirty(pud_t pud) +static inline bool pud_dirty(pud_t pud) { - return pud_flags(pud) & _PAGE_DIRTY_HW; + return pud_flags(pud) & _PAGE_DIRTY_BITS; } static inline int pud_young(pud_t pud) @@ -183,6 +183,12 @@ static inline int pud_young(pud_t pud) static inline int pte_write(pte_t pte) { + /* + * If _PAGE_DIRTY_HW is set, the PTE must either have + * _PAGE_RW or be a shadow stack PTE, which is logically writable. + */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) + return pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY_HW); return pte_flags(pte) & _PAGE_RW; } @@ -334,7 +340,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte) static inline pte_t pte_mkclean(pte_t pte) { - return pte_clear_flags(pte, _PAGE_DIRTY_HW); + return pte_clear_flags(pte, _PAGE_DIRTY_BITS); } static inline pte_t pte_mkold(pte_t pte) @@ -344,6 +350,17 @@ static inline pte_t pte_mkold(pte_t pte) static inline pte_t pte_wrprotect(pte_t pte) { + /* + * Blindly clearing _PAGE_RW might accidentally create + * a shadow stack PTE (RW=0,Dirty=1). Move the hardware + * dirty value to the software bit. + */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + pte.pte |= (pte.pte & _PAGE_DIRTY_HW) >> + _PAGE_BIT_DIRTY_HW << _PAGE_BIT_COW; + pte = pte_clear_flags(pte, _PAGE_DIRTY_HW); + } + return pte_clear_flags(pte, _PAGE_RW); } @@ -354,6 +371,18 @@ static inline pte_t pte_mkexec(pte_t pte) static inline pte_t pte_mkdirty(pte_t pte) { + pteval_t dirty = _PAGE_DIRTY_HW; + + /* Avoid creating (HW)Dirty=1,Write=0 PTEs */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte)) + dirty = _PAGE_COW; + + return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY); +} + +static inline pte_t pte_mkwrite_shstk(pte_t pte) +{ + pte = pte_clear_flags(pte, _PAGE_COW); return pte_set_flags(pte, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY); } @@ -364,6 +393,13 @@ static inline pte_t pte_mkyoung(pte_t pte) static inline pte_t pte_mkwrite(pte_t pte) { + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + if (pte_flags(pte) & _PAGE_COW) { + pte = pte_clear_flags(pte, _PAGE_COW); + pte = pte_set_flags(pte, _PAGE_DIRTY_HW); + } + } + return pte_set_flags(pte, _PAGE_RW); } @@ -435,16 +471,41 @@ static inline pmd_t pmd_mkold(pmd_t pmd) static inline pmd_t pmd_mkclean(pmd_t pmd) { - return pmd_clear_flags(pmd, _PAGE_DIRTY_HW); + return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS); } static inline pmd_t pmd_wrprotect(pmd_t pmd) { + /* + * Blindly clearing _PAGE_RW might accidentally create + * a shadow stack PMD (RW=0,Dirty=1). Move the hardware + * dirty value to the software bit. + */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + pmdval_t v = native_pmd_val(pmd); + + v |= (v & _PAGE_DIRTY_HW) >> _PAGE_BIT_DIRTY_HW << + _PAGE_BIT_COW; + pmd = pmd_clear_flags(__pmd(v), _PAGE_DIRTY_HW); + } + return pmd_clear_flags(pmd, _PAGE_RW); } static inline pmd_t pmd_mkdirty(pmd_t pmd) { + pmdval_t dirty = _PAGE_DIRTY_HW; + + /* Avoid creating (HW)Dirty=1,Write=0 PMDs */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !(pmd_flags(pmd) & _PAGE_RW)) + dirty = _PAGE_COW; + + return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY); +} + +static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd) +{ + pmd = pmd_clear_flags(pmd, _PAGE_COW); return pmd_set_flags(pmd, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY); } @@ -465,6 +526,13 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd) static inline pmd_t pmd_mkwrite(pmd_t pmd) { + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + if (pmd_flags(pmd) & _PAGE_COW) { + pmd = pmd_clear_flags(pmd, _PAGE_COW); + pmd = pmd_set_flags(pmd, _PAGE_DIRTY_HW); + } + } + return pmd_set_flags(pmd, _PAGE_RW); } @@ -489,17 +557,36 @@ static inline pud_t pud_mkold(pud_t pud) static inline pud_t pud_mkclean(pud_t pud) { - return pud_clear_flags(pud, _PAGE_DIRTY_HW); + return pud_clear_flags(pud, _PAGE_DIRTY_BITS); } static inline pud_t pud_wrprotect(pud_t pud) { + /* + * Blindly clearing _PAGE_RW might accidentally create + * a shadow stack PUD (RW=0,Dirty=1). Move the hardware + * dirty value to the software bit. + */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + pudval_t v = native_pud_val(pud); + + v |= (v & _PAGE_DIRTY_HW) >> _PAGE_BIT_DIRTY_HW << + _PAGE_BIT_COW; + pud = pud_clear_flags(__pud(v), _PAGE_DIRTY_HW); + } + return pud_clear_flags(pud, _PAGE_RW); } static inline pud_t pud_mkdirty(pud_t pud) { - return pud_set_flags(pud, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY); + pudval_t dirty = _PAGE_DIRTY_HW; + + /* Avoid creating (HW)Dirty=1,Write=0 PUDs */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !(pud_flags(pud) & _PAGE_RW)) + dirty = _PAGE_COW; + + return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY); } static inline pud_t pud_mkdevmap(pud_t pud) @@ -519,6 +606,13 @@ static inline pud_t pud_mkyoung(pud_t pud) static inline pud_t pud_mkwrite(pud_t pud) { + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + if (pud_flags(pud) & _PAGE_COW) { + pud = pud_clear_flags(pud, _PAGE_COW); + pud = pud_set_flags(pud, _PAGE_DIRTY_HW); + } + } + return pud_set_flags(pud, _PAGE_RW); } @@ -1132,6 +1226,12 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma, #define pmd_write pmd_write static inline int pmd_write(pmd_t pmd) { + /* + * If _PAGE_DIRTY_HW is set, then the PMD must either have + * _PAGE_RW or be a shadow stack PMD, which is logically writable. + */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) + return pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY_HW); return pmd_flags(pmd) & _PAGE_RW; } diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 5f31f1c407b9..b57483567b8b 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -23,7 +23,8 @@ #define _PAGE_BIT_SOFTW2 10 /* " */ #define _PAGE_BIT_SOFTW3 11 /* " */ #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ -#define _PAGE_BIT_SOFTW4 58 /* available for programmer */ +#define _PAGE_BIT_SOFTW4 57 /* available for programmer */ +#define _PAGE_BIT_SOFTW5 58 /* available for programmer */ #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */ #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */ #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */ @@ -36,6 +37,16 @@ #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 +/* + * This bit indicates a copy-on-write page, and is different from + * _PAGE_BIT_SOFT_DIRTY, which tracks which pages a task writes to. + */ +#ifdef CONFIG_X86_64 +#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */ +#else +#define _PAGE_BIT_COW 0 +#endif + /* If _PAGE_BIT_PRESENT is clear, we use these: */ /* - if the user mapped it with PROT_NONE; pte_present gives true */ #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL @@ -117,6 +128,34 @@ #define _PAGE_DEVMAP (_AT(pteval_t, 0)) #endif +/* + * _PAGE_COW is used to separate R/O and copy-on-write PTEs created by + * software from the shadow stack PTE setting required by the hardware: + * (a) A modified, copy-on-write (COW) page: (R/O + _PAGE_COW) + * (b) A R/O page that has been COW'ed: (R/O +_PAGE_COW) + * The user page is in a R/O VMA, and get_user_pages() needs a + * writable copy. The page fault handler creates a copy of the page + * and sets the new copy's PTE as R/O and _PAGE_COW. + * (c) A shadow stack PTE: (R/O + _PAGE_DIRTY_HW) + * (d) A shared (copy-on-access) shadow stack PTE: (R/O + _PAGE_COW) + * When a shadow stack page is being shared among processes (this + * happens at fork()), its PTE is cleared of _PAGE_DIRTY_HW, so the + * next shadow stack access causes a fault, and the page is duplicated + * and _PAGE_DIRTY_HW is set again. This is the COW equivalent for + * shadow stack pages, even though it's copy-on-access rather than + * copy-on-write. + * (e) A page where the processor observed a Write=1 PTE, started a write, + * set Dirty=1, but then observed a Write=0 PTE. That's possible + * today, but will not happen on processors that support shadow stack. + */ +#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER +#define _PAGE_COW (_AT(pteval_t, 1) << _PAGE_BIT_COW) +#else +#define _PAGE_COW (_AT(pteval_t, 0)) +#endif + +#define _PAGE_DIRTY_BITS (_PAGE_DIRTY_HW | _PAGE_COW) + #define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) /*