Message ID | 20210203225547.32221-9-yu-cheng.yu@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Control-flow Enforcement: Shadow Stack | expand |
On Wed, Feb 03, 2021 at 02:55:30PM -0800, Yu-cheng Yu wrote: > There is essentially no room left in the x86 hardware PTEs on some OSes > (not Linux). That left the hardware architects looking for a way to > represent a new memory type (shadow stack) within the existing bits. > They chose to repurpose a lightly-used state: Write=0, Dirty=1. > > The reason it's lightly used is that Dirty=1 is normally set by hardware > and cannot normally be set by hardware on a Write=0 PTE. Software must > normally be involved to create one of these PTEs, so software can simply > opt to not create them. > > In places where Linux normally creates Write=0, Dirty=1, it can use the > software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other > words, whenever Linux needs to create Write=0, Dirty=1, it instead creates > Write=0, Cow=1, except for shadow stack, which is Write=0, Dirty=1. This > clearly separates shadow stack from other data, and results in the > following: > > (a) A modified, copy-on-write (COW) page: (Write=0, Cow=1) > (b) A R/O page that has been COW'ed: (Write=0, Cow=1) > The user page is in a R/O VMA, and get_user_pages() needs a writable > copy. The page fault handler creates a copy of the page and sets > the new copy's PTE as Write=0 and Cow=1. > (c) A shadow stack PTE: (Write=0, Dirty=1) > (d) A shared shadow stack PTE: (Write=0, Cow=1) > When a shadow stack page is being shared among processes (this happens > at fork()), its PTE is made Dirty=0, so the next shadow stack access > causes a fault, and the page is duplicated and Dirty=1 is set again. > This is the COW equivalent for shadow stack pages, even though it's > copy-on-access rather than copy-on-write. > (e) A page where the processor observed a Write=1 PTE, started a write, set > Dirty=1, but then observed a Write=0 PTE. That's possible today, but > will not happen on processors that support shadow stack. What happens for "e" with/without CET? It sounds like direct writes to such pages will be (correctly) rejected by the MMU? > > Define _PAGE_COW and update pte_*() helpers and apply the same changes to > pmd and pud. > > After this, there are six free bits left in the 64-bit PTE, and no more > free bits in the 32-bit PTE (except for PAE) and Shadow Stack is not > implemented for the 32-bit kernel. Are there selftests to validate this change? I think it might be useful to more clearly describe what is considered "dirty" and "writeable" in comments above the pte_helpers. -Kees > > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> > --- > arch/x86/include/asm/pgtable.h | 125 ++++++++++++++++++++++++--- > arch/x86/include/asm/pgtable_types.h | 42 ++++++++- > 2 files changed, 154 insertions(+), 13 deletions(-) > > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h > index a02c67291cfc..4b0ec61510dc 100644 > --- a/arch/x86/include/asm/pgtable.h > +++ b/arch/x86/include/asm/pgtable.h > @@ -121,9 +121,9 @@ extern pmdval_t early_pmd_flags; > * The following only work if pte_present() is true. > * Undefined behaviour if not.. > */ > -static inline int pte_dirty(pte_t pte) > +static inline bool pte_dirty(pte_t pte) > { > - return pte_flags(pte) & _PAGE_DIRTY; > + return pte_flags(pte) & _PAGE_DIRTY_BITS; > } > > > @@ -160,9 +160,9 @@ static inline int pte_young(pte_t pte) > return pte_flags(pte) & _PAGE_ACCESSED; > } > > -static inline int pmd_dirty(pmd_t pmd) > +static inline bool pmd_dirty(pmd_t pmd) > { > - return pmd_flags(pmd) & _PAGE_DIRTY; > + return pmd_flags(pmd) & _PAGE_DIRTY_BITS; > } > > static inline int pmd_young(pmd_t pmd) > @@ -170,9 +170,9 @@ static inline int pmd_young(pmd_t pmd) > return pmd_flags(pmd) & _PAGE_ACCESSED; > } > > -static inline int pud_dirty(pud_t pud) > +static inline bool pud_dirty(pud_t pud) > { > - return pud_flags(pud) & _PAGE_DIRTY; > + return pud_flags(pud) & _PAGE_DIRTY_BITS; > } > > static inline int pud_young(pud_t pud) > @@ -182,7 +182,14 @@ static inline int pud_young(pud_t pud) > > static inline int pte_write(pte_t pte) > { > - return pte_flags(pte) & _PAGE_RW; > + /* > + * If _PAGE_DIRTY is set, the PTE must either have _PAGE_RW or be > + * a shadow stack PTE, which is logically writable. > + */ > + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) > + return pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY); > + else > + return pte_flags(pte) & _PAGE_RW; > } > > static inline int pte_huge(pte_t pte) > @@ -333,7 +340,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte) > > static inline pte_t pte_mkclean(pte_t pte) > { > - return pte_clear_flags(pte, _PAGE_DIRTY); > + return pte_clear_flags(pte, _PAGE_DIRTY_BITS); > } > > static inline pte_t pte_mkold(pte_t pte) > @@ -343,6 +350,18 @@ static inline pte_t pte_mkold(pte_t pte) > > static inline pte_t pte_wrprotect(pte_t pte) > { > + /* > + * Blindly clearing _PAGE_RW might accidentally create > + * a shadow stack PTE (RW=0, Dirty=1). Move the hardware > + * dirty value to the software bit. > + */ > + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { > + if (pte_flags(pte) & _PAGE_DIRTY) { > + pte = pte_clear_flags(pte, _PAGE_DIRTY); > + pte = pte_set_flags(pte, _PAGE_COW); > + } > + } > + > return pte_clear_flags(pte, _PAGE_RW); > } > > @@ -353,6 +372,18 @@ static inline pte_t pte_mkexec(pte_t pte) > > static inline pte_t pte_mkdirty(pte_t pte) > { > + pteval_t dirty = _PAGE_DIRTY; > + > + /* Avoid creating (HW)Dirty=1, Write=0 PTEs */ > + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte)) > + dirty = _PAGE_COW; > + > + return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY); > +} > + > +static inline pte_t pte_mkwrite_shstk(pte_t pte) > +{ > + pte = pte_clear_flags(pte, _PAGE_COW); > return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); > } > > @@ -363,6 +394,13 @@ static inline pte_t pte_mkyoung(pte_t pte) > > static inline pte_t pte_mkwrite(pte_t pte) > { > + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { > + if (pte_flags(pte) & _PAGE_COW) { > + pte = pte_clear_flags(pte, _PAGE_COW); > + pte = pte_set_flags(pte, _PAGE_DIRTY); > + } > + } > + > return pte_set_flags(pte, _PAGE_RW); > } > > @@ -434,16 +472,40 @@ static inline pmd_t pmd_mkold(pmd_t pmd) > > static inline pmd_t pmd_mkclean(pmd_t pmd) > { > - return pmd_clear_flags(pmd, _PAGE_DIRTY); > + return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS); > } > > static inline pmd_t pmd_wrprotect(pmd_t pmd) > { > + /* > + * Blindly clearing _PAGE_RW might accidentally create > + * a shadow stack PMD (RW=0, Dirty=1). Move the hardware > + * dirty value to the software bit. > + */ > + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { > + if (pmd_flags(pmd) & _PAGE_DIRTY) { > + pmd = pmd_clear_flags(pmd, _PAGE_DIRTY); > + pmd = pmd_set_flags(pmd, _PAGE_COW); > + } > + } > + > return pmd_clear_flags(pmd, _PAGE_RW); > } > > static inline pmd_t pmd_mkdirty(pmd_t pmd) > { > + pmdval_t dirty = _PAGE_DIRTY; > + > + /* Avoid creating (HW)Dirty=1, Write=0 PMDs */ > + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !(pmd_flags(pmd) & _PAGE_RW)) > + dirty = _PAGE_COW; > + > + return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY); > +} > + > +static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd) > +{ > + pmd = pmd_clear_flags(pmd, _PAGE_COW); > return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); > } > > @@ -464,6 +526,13 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd) > > static inline pmd_t pmd_mkwrite(pmd_t pmd) > { > + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { > + if (pmd_flags(pmd) & _PAGE_COW) { > + pmd = pmd_clear_flags(pmd, _PAGE_COW); > + pmd = pmd_set_flags(pmd, _PAGE_DIRTY); > + } > + } > + > return pmd_set_flags(pmd, _PAGE_RW); > } > > @@ -488,17 +557,35 @@ static inline pud_t pud_mkold(pud_t pud) > > static inline pud_t pud_mkclean(pud_t pud) > { > - return pud_clear_flags(pud, _PAGE_DIRTY); > + return pud_clear_flags(pud, _PAGE_DIRTY_BITS); > } > > static inline pud_t pud_wrprotect(pud_t pud) > { > + /* > + * Blindly clearing _PAGE_RW might accidentally create > + * a shadow stack PUD (RW=0, Dirty=1). Move the hardware > + * dirty value to the software bit. > + */ > + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { > + if (pud_flags(pud) & _PAGE_DIRTY) { > + pud = pud_clear_flags(pud, _PAGE_DIRTY); > + pud = pud_set_flags(pud, _PAGE_COW); > + } > + } > + > return pud_clear_flags(pud, _PAGE_RW); > } > > static inline pud_t pud_mkdirty(pud_t pud) > { > - return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); > + pudval_t dirty = _PAGE_DIRTY; > + > + /* Avoid creating (HW)Dirty=1, Write=0 PUDs */ > + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !(pud_flags(pud) & _PAGE_RW)) > + dirty = _PAGE_COW; > + > + return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY); > } > > static inline pud_t pud_mkdevmap(pud_t pud) > @@ -518,6 +605,13 @@ static inline pud_t pud_mkyoung(pud_t pud) > > static inline pud_t pud_mkwrite(pud_t pud) > { > + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { > + if (pud_flags(pud) & _PAGE_COW) { > + pud = pud_clear_flags(pud, _PAGE_COW); > + pud = pud_set_flags(pud, _PAGE_DIRTY); > + } > + } > + > return pud_set_flags(pud, _PAGE_RW); > } > > @@ -1131,7 +1225,14 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma, > #define pmd_write pmd_write > static inline int pmd_write(pmd_t pmd) > { > - return pmd_flags(pmd) & _PAGE_RW; > + /* > + * If _PAGE_DIRTY is set, then the PMD must either have _PAGE_RW or > + * be a shadow stack PMD, which is logically writable. > + */ > + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) > + return pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY); > + else > + return pmd_flags(pmd) & _PAGE_RW; > } > > #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR > diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h > index b8b79d618bbc..437d7ff0ae80 100644 > --- a/arch/x86/include/asm/pgtable_types.h > +++ b/arch/x86/include/asm/pgtable_types.h > @@ -23,7 +23,8 @@ > #define _PAGE_BIT_SOFTW2 10 /* " */ > #define _PAGE_BIT_SOFTW3 11 /* " */ > #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ > -#define _PAGE_BIT_SOFTW4 58 /* available for programmer */ > +#define _PAGE_BIT_SOFTW4 57 /* available for programmer */ > +#define _PAGE_BIT_SOFTW5 58 /* available for programmer */ > #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */ > #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */ > #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */ > @@ -36,6 +37,15 @@ > #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ > #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 > > +/* > + * Indicates a copy-on-write page. > + */ > +#ifdef CONFIG_X86_CET > +#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */ > +#else > +#define _PAGE_BIT_COW 0 > +#endif > + > /* If _PAGE_BIT_PRESENT is clear, we use these: */ > /* - if the user mapped it with PROT_NONE; pte_present gives true */ > #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL > @@ -117,6 +127,36 @@ > #define _PAGE_DEVMAP (_AT(pteval_t, 0)) > #endif > > +/* > + * The hardware requires shadow stack to be read-only and Dirty. > + * _PAGE_COW is a software-only bit used to separate copy-on-write PTEs > + * from shadow stack PTEs: > + * (a) A modified, copy-on-write (COW) page: (Write=0, Cow=1) > + * (b) A R/O page that has been COW'ed: (Write=0, Cow=1) > + * The user page is in a R/O VMA, and get_user_pages() needs a > + * writable copy. The page fault handler creates a copy of the page > + * and sets the new copy's PTE as Write=0, Cow=1. > + * (c) A shadow stack PTE: (Write=0, Dirty=1) > + * (d) A shared (copy-on-access) shadow stack PTE: (Write=0, Cow=1) > + * When a shadow stack page is being shared among processes (this > + * happens at fork()), its PTE is cleared of _PAGE_DIRTY, so the next > + * shadow stack access causes a fault, and the page is duplicated and > + * _PAGE_DIRTY is set again. This is the COW equivalent for shadow > + * stack pages, even though it's copy-on-access rather than > + * copy-on-write. > + * (e) A page where the processor observed a Write=1 PTE, started a write, > + * set Dirty=1, but then observed a Write=0 PTE (changed by another > + * thread). That's possible today, but will not happen on processors > + * that support shadow stack. > + */ > +#ifdef CONFIG_X86_CET > +#define _PAGE_COW (_AT(pteval_t, 1) << _PAGE_BIT_COW) > +#else > +#define _PAGE_COW (_AT(pteval_t, 0)) > +#endif > + > +#define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_COW) > + > #define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) > > /* > -- > 2.21.0 >
On 2/4/21 12:19 PM, Kees Cook wrote: >> (e) A page where the processor observed a Write=1 PTE, started a write, set >> Dirty=1, but then observed a Write=0 PTE. That's possible today, but >> will not happen on processors that support shadow stack. > What happens for "e" with/without CET? It sounds like direct writes to > such pages will be (correctly) rejected by the MMU? A page fault would be generated regardless of CET support. If CET were not around, the fault would be reported as a present, write fault. If this happened and CET were around (which shouldn't happen in practice, it means we have a hardware issue) a page fault exception is generated. Yu-cheng, I'm not sure there's enough debugging around to tell us if this happens. Would we even notice?
On 2/4/2021 12:27 PM, Dave Hansen wrote: > On 2/4/21 12:19 PM, Kees Cook wrote: >>> (e) A page where the processor observed a Write=1 PTE, started a write, set >>> Dirty=1, but then observed a Write=0 PTE. That's possible today, but >>> will not happen on processors that support shadow stack. >> What happens for "e" with/without CET? It sounds like direct writes to >> such pages will be (correctly) rejected by the MMU? > > A page fault would be generated regardless of CET support. > > If CET were not around, the fault would be reported as a present, write > fault. > > If this happened and CET were around (which shouldn't happen in > practice, it means we have a hardware issue) a page fault exception is > generated. Thanks for the clarification. With or without CET, direct write to Write=0, Dirty=1 PTE triggers page fault. > Yu-cheng, I'm not sure there's enough debugging around to > tell us if this happens. Would we even notice? That potential hardware issue is, on a CET-capable system, a processor writes to a Write=1, Dirty=0 page, and then observes the PTE is Write=0, Dirty=1. Let me think about it... Thanks! -- Yu-cheng
On 2/4/2021 12:19 PM, Kees Cook wrote: > On Wed, Feb 03, 2021 at 02:55:30PM -0800, Yu-cheng Yu wrote: >> There is essentially no room left in the x86 hardware PTEs on some OSes >> (not Linux). That left the hardware architects looking for a way to >> represent a new memory type (shadow stack) within the existing bits. >> They chose to repurpose a lightly-used state: Write=0, Dirty=1. >> >> The reason it's lightly used is that Dirty=1 is normally set by hardware >> and cannot normally be set by hardware on a Write=0 PTE. Software must >> normally be involved to create one of these PTEs, so software can simply >> opt to not create them. >> >> In places where Linux normally creates Write=0, Dirty=1, it can use the >> software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other >> words, whenever Linux needs to create Write=0, Dirty=1, it instead creates >> Write=0, Cow=1, except for shadow stack, which is Write=0, Dirty=1. This >> clearly separates shadow stack from other data, and results in the >> following: >> >> (a) A modified, copy-on-write (COW) page: (Write=0, Cow=1) >> (b) A R/O page that has been COW'ed: (Write=0, Cow=1) >> The user page is in a R/O VMA, and get_user_pages() needs a writable >> copy. The page fault handler creates a copy of the page and sets >> the new copy's PTE as Write=0 and Cow=1. >> (c) A shadow stack PTE: (Write=0, Dirty=1) >> (d) A shared shadow stack PTE: (Write=0, Cow=1) >> When a shadow stack page is being shared among processes (this happens >> at fork()), its PTE is made Dirty=0, so the next shadow stack access >> causes a fault, and the page is duplicated and Dirty=1 is set again. >> This is the COW equivalent for shadow stack pages, even though it's >> copy-on-access rather than copy-on-write. >> (e) A page where the processor observed a Write=1 PTE, started a write, set >> Dirty=1, but then observed a Write=0 PTE. That's possible today, but >> will not happen on processors that support shadow stack. > > What happens for "e" with/without CET? It sounds like direct writes to > such pages will be (correctly) rejected by the MMU? > >> >> Define _PAGE_COW and update pte_*() helpers and apply the same changes to >> pmd and pud. >> >> After this, there are six free bits left in the 64-bit PTE, and no more >> free bits in the 32-bit PTE (except for PAE) and Shadow Stack is not >> implemented for the 32-bit kernel. > > Are there selftests to validate this change? > I have some tests to verify, for example, - After clone(), shadow stack pages are indeed copy-on-write, - Shadow stack pages (i.e. Write=0, Dirty=1) cannot be directly written to, - Shadow stack guard pages exist. These tests are now on github, but kind of messy. I can gradually clean up them and submit as selftests separately. If you are asking for the detection of the potential hardware issue (that Dave Hansen talked about), then maybe we need to detect it from the kernel. > I think it might be useful to more clearly describe what is considered > "dirty" and "writeable" in comments above the pte_helpers. > Yes, I will update it. Thanks! [...]
On 2/5/2021 10:41 AM, Yu, Yu-cheng wrote: > On 2/4/2021 12:27 PM, Dave Hansen wrote: >> On 2/4/21 12:19 PM, Kees Cook wrote: >>>> (e) A page where the processor observed a Write=1 PTE, started a >>>> write, set >>>> Dirty=1, but then observed a Write=0 PTE. That's possible >>>> today, but >>>> will not happen on processors that support shadow stack. >>> What happens for "e" with/without CET? It sounds like direct writes to >>> such pages will be (correctly) rejected by the MMU? >> >> A page fault would be generated regardless of CET support. >> >> If CET were not around, the fault would be reported as a present, write >> fault. >> >> If this happened and CET were around (which shouldn't happen in >> practice, it means we have a hardware issue) a page fault exception is >> generated. > > Thanks for the clarification. With or without CET, direct write to > Write=0, Dirty=1 PTE triggers page fault. > >> Yu-cheng, I'm not sure there's enough debugging around to >> tell us if this happens. Would we even notice? > > That potential hardware issue is, on a CET-capable system, a processor > writes to a Write=1, Dirty=0 page, and then observes the PTE is Write=0, > Dirty=1. Let me think about it... > One way to detect the potential issue is adding a check when a non-shadow stack page's PTE goes from RW=0 to RW=1, like the following... diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 29aa6f07e3c9..241b94a0fa77 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -405,6 +405,8 @@ static inline pte_t pte_mkyoung(pte_t pte) static inline pte_t pte_mkwrite(pte_t pte) { if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + WARN_ONCE((pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == + _PAGE_DIRTY, "Found transient shadow stack PTE\n"); if (pte_flags(pte) & _PAGE_COW) { pte = pte_clear_flags(pte, _PAGE_COW); pte = pte_set_flags(pte, _PAGE_DIRTY); I run all my routine stress tests with the changes, and do not see any warning triggered. If this change is desirable, we can probably add #ifdef CONFIG_DEBUG_VM around it and make it a separate patch. -- Yu-cheng
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index a02c67291cfc..4b0ec61510dc 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -121,9 +121,9 @@ extern pmdval_t early_pmd_flags; * The following only work if pte_present() is true. * Undefined behaviour if not.. */ -static inline int pte_dirty(pte_t pte) +static inline bool pte_dirty(pte_t pte) { - return pte_flags(pte) & _PAGE_DIRTY; + return pte_flags(pte) & _PAGE_DIRTY_BITS; } @@ -160,9 +160,9 @@ static inline int pte_young(pte_t pte) return pte_flags(pte) & _PAGE_ACCESSED; } -static inline int pmd_dirty(pmd_t pmd) +static inline bool pmd_dirty(pmd_t pmd) { - return pmd_flags(pmd) & _PAGE_DIRTY; + return pmd_flags(pmd) & _PAGE_DIRTY_BITS; } static inline int pmd_young(pmd_t pmd) @@ -170,9 +170,9 @@ static inline int pmd_young(pmd_t pmd) return pmd_flags(pmd) & _PAGE_ACCESSED; } -static inline int pud_dirty(pud_t pud) +static inline bool pud_dirty(pud_t pud) { - return pud_flags(pud) & _PAGE_DIRTY; + return pud_flags(pud) & _PAGE_DIRTY_BITS; } static inline int pud_young(pud_t pud) @@ -182,7 +182,14 @@ static inline int pud_young(pud_t pud) static inline int pte_write(pte_t pte) { - return pte_flags(pte) & _PAGE_RW; + /* + * If _PAGE_DIRTY is set, the PTE must either have _PAGE_RW or be + * a shadow stack PTE, which is logically writable. + */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) + return pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY); + else + return pte_flags(pte) & _PAGE_RW; } static inline int pte_huge(pte_t pte) @@ -333,7 +340,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte) static inline pte_t pte_mkclean(pte_t pte) { - return pte_clear_flags(pte, _PAGE_DIRTY); + return pte_clear_flags(pte, _PAGE_DIRTY_BITS); } static inline pte_t pte_mkold(pte_t pte) @@ -343,6 +350,18 @@ static inline pte_t pte_mkold(pte_t pte) static inline pte_t pte_wrprotect(pte_t pte) { + /* + * Blindly clearing _PAGE_RW might accidentally create + * a shadow stack PTE (RW=0, Dirty=1). Move the hardware + * dirty value to the software bit. + */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + if (pte_flags(pte) & _PAGE_DIRTY) { + pte = pte_clear_flags(pte, _PAGE_DIRTY); + pte = pte_set_flags(pte, _PAGE_COW); + } + } + return pte_clear_flags(pte, _PAGE_RW); } @@ -353,6 +372,18 @@ static inline pte_t pte_mkexec(pte_t pte) static inline pte_t pte_mkdirty(pte_t pte) { + pteval_t dirty = _PAGE_DIRTY; + + /* Avoid creating (HW)Dirty=1, Write=0 PTEs */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte)) + dirty = _PAGE_COW; + + return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY); +} + +static inline pte_t pte_mkwrite_shstk(pte_t pte) +{ + pte = pte_clear_flags(pte, _PAGE_COW); return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); } @@ -363,6 +394,13 @@ static inline pte_t pte_mkyoung(pte_t pte) static inline pte_t pte_mkwrite(pte_t pte) { + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + if (pte_flags(pte) & _PAGE_COW) { + pte = pte_clear_flags(pte, _PAGE_COW); + pte = pte_set_flags(pte, _PAGE_DIRTY); + } + } + return pte_set_flags(pte, _PAGE_RW); } @@ -434,16 +472,40 @@ static inline pmd_t pmd_mkold(pmd_t pmd) static inline pmd_t pmd_mkclean(pmd_t pmd) { - return pmd_clear_flags(pmd, _PAGE_DIRTY); + return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS); } static inline pmd_t pmd_wrprotect(pmd_t pmd) { + /* + * Blindly clearing _PAGE_RW might accidentally create + * a shadow stack PMD (RW=0, Dirty=1). Move the hardware + * dirty value to the software bit. + */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + if (pmd_flags(pmd) & _PAGE_DIRTY) { + pmd = pmd_clear_flags(pmd, _PAGE_DIRTY); + pmd = pmd_set_flags(pmd, _PAGE_COW); + } + } + return pmd_clear_flags(pmd, _PAGE_RW); } static inline pmd_t pmd_mkdirty(pmd_t pmd) { + pmdval_t dirty = _PAGE_DIRTY; + + /* Avoid creating (HW)Dirty=1, Write=0 PMDs */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !(pmd_flags(pmd) & _PAGE_RW)) + dirty = _PAGE_COW; + + return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY); +} + +static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd) +{ + pmd = pmd_clear_flags(pmd, _PAGE_COW); return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); } @@ -464,6 +526,13 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd) static inline pmd_t pmd_mkwrite(pmd_t pmd) { + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + if (pmd_flags(pmd) & _PAGE_COW) { + pmd = pmd_clear_flags(pmd, _PAGE_COW); + pmd = pmd_set_flags(pmd, _PAGE_DIRTY); + } + } + return pmd_set_flags(pmd, _PAGE_RW); } @@ -488,17 +557,35 @@ static inline pud_t pud_mkold(pud_t pud) static inline pud_t pud_mkclean(pud_t pud) { - return pud_clear_flags(pud, _PAGE_DIRTY); + return pud_clear_flags(pud, _PAGE_DIRTY_BITS); } static inline pud_t pud_wrprotect(pud_t pud) { + /* + * Blindly clearing _PAGE_RW might accidentally create + * a shadow stack PUD (RW=0, Dirty=1). Move the hardware + * dirty value to the software bit. + */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + if (pud_flags(pud) & _PAGE_DIRTY) { + pud = pud_clear_flags(pud, _PAGE_DIRTY); + pud = pud_set_flags(pud, _PAGE_COW); + } + } + return pud_clear_flags(pud, _PAGE_RW); } static inline pud_t pud_mkdirty(pud_t pud) { - return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); + pudval_t dirty = _PAGE_DIRTY; + + /* Avoid creating (HW)Dirty=1, Write=0 PUDs */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !(pud_flags(pud) & _PAGE_RW)) + dirty = _PAGE_COW; + + return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY); } static inline pud_t pud_mkdevmap(pud_t pud) @@ -518,6 +605,13 @@ static inline pud_t pud_mkyoung(pud_t pud) static inline pud_t pud_mkwrite(pud_t pud) { + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + if (pud_flags(pud) & _PAGE_COW) { + pud = pud_clear_flags(pud, _PAGE_COW); + pud = pud_set_flags(pud, _PAGE_DIRTY); + } + } + return pud_set_flags(pud, _PAGE_RW); } @@ -1131,7 +1225,14 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma, #define pmd_write pmd_write static inline int pmd_write(pmd_t pmd) { - return pmd_flags(pmd) & _PAGE_RW; + /* + * If _PAGE_DIRTY is set, then the PMD must either have _PAGE_RW or + * be a shadow stack PMD, which is logically writable. + */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) + return pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY); + else + return pmd_flags(pmd) & _PAGE_RW; } #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index b8b79d618bbc..437d7ff0ae80 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -23,7 +23,8 @@ #define _PAGE_BIT_SOFTW2 10 /* " */ #define _PAGE_BIT_SOFTW3 11 /* " */ #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ -#define _PAGE_BIT_SOFTW4 58 /* available for programmer */ +#define _PAGE_BIT_SOFTW4 57 /* available for programmer */ +#define _PAGE_BIT_SOFTW5 58 /* available for programmer */ #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */ #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */ #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */ @@ -36,6 +37,15 @@ #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 +/* + * Indicates a copy-on-write page. + */ +#ifdef CONFIG_X86_CET +#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */ +#else +#define _PAGE_BIT_COW 0 +#endif + /* If _PAGE_BIT_PRESENT is clear, we use these: */ /* - if the user mapped it with PROT_NONE; pte_present gives true */ #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL @@ -117,6 +127,36 @@ #define _PAGE_DEVMAP (_AT(pteval_t, 0)) #endif +/* + * The hardware requires shadow stack to be read-only and Dirty. + * _PAGE_COW is a software-only bit used to separate copy-on-write PTEs + * from shadow stack PTEs: + * (a) A modified, copy-on-write (COW) page: (Write=0, Cow=1) + * (b) A R/O page that has been COW'ed: (Write=0, Cow=1) + * The user page is in a R/O VMA, and get_user_pages() needs a + * writable copy. The page fault handler creates a copy of the page + * and sets the new copy's PTE as Write=0, Cow=1. + * (c) A shadow stack PTE: (Write=0, Dirty=1) + * (d) A shared (copy-on-access) shadow stack PTE: (Write=0, Cow=1) + * When a shadow stack page is being shared among processes (this + * happens at fork()), its PTE is cleared of _PAGE_DIRTY, so the next + * shadow stack access causes a fault, and the page is duplicated and + * _PAGE_DIRTY is set again. This is the COW equivalent for shadow + * stack pages, even though it's copy-on-access rather than + * copy-on-write. + * (e) A page where the processor observed a Write=1 PTE, started a write, + * set Dirty=1, but then observed a Write=0 PTE (changed by another + * thread). That's possible today, but will not happen on processors + * that support shadow stack. + */ +#ifdef CONFIG_X86_CET +#define _PAGE_COW (_AT(pteval_t, 1) << _PAGE_BIT_COW) +#else +#define _PAGE_COW (_AT(pteval_t, 0)) +#endif + +#define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_COW) + #define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) /*
There is essentially no room left in the x86 hardware PTEs on some OSes (not Linux). That left the hardware architects looking for a way to represent a new memory type (shadow stack) within the existing bits. They chose to repurpose a lightly-used state: Write=0, Dirty=1. The reason it's lightly used is that Dirty=1 is normally set by hardware and cannot normally be set by hardware on a Write=0 PTE. Software must normally be involved to create one of these PTEs, so software can simply opt to not create them. In places where Linux normally creates Write=0, Dirty=1, it can use the software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other words, whenever Linux needs to create Write=0, Dirty=1, it instead creates Write=0, Cow=1, except for shadow stack, which is Write=0, Dirty=1. This clearly separates shadow stack from other data, and results in the following: (a) A modified, copy-on-write (COW) page: (Write=0, Cow=1) (b) A R/O page that has been COW'ed: (Write=0, Cow=1) The user page is in a R/O VMA, and get_user_pages() needs a writable copy. The page fault handler creates a copy of the page and sets the new copy's PTE as Write=0 and Cow=1. (c) A shadow stack PTE: (Write=0, Dirty=1) (d) A shared shadow stack PTE: (Write=0, Cow=1) When a shadow stack page is being shared among processes (this happens at fork()), its PTE is made Dirty=0, so the next shadow stack access causes a fault, and the page is duplicated and Dirty=1 is set again. This is the COW equivalent for shadow stack pages, even though it's copy-on-access rather than copy-on-write. (e) A page where the processor observed a Write=1 PTE, started a write, set Dirty=1, but then observed a Write=0 PTE. That's possible today, but will not happen on processors that support shadow stack. Define _PAGE_COW and update pte_*() helpers and apply the same changes to pmd and pud. After this, there are six free bits left in the 64-bit PTE, and no more free bits in the 32-bit PTE (except for PAE) and Shadow Stack is not implemented for the 32-bit kernel. Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> --- arch/x86/include/asm/pgtable.h | 125 ++++++++++++++++++++++++--- arch/x86/include/asm/pgtable_types.h | 42 ++++++++- 2 files changed, 154 insertions(+), 13 deletions(-)