diff mbox series

[v28,09/32] x86/mm: Introduce _PAGE_COW

Message ID 20210722205219.7934-10-yu-cheng.yu@intel.com (mailing list archive)
State New
Headers show
Series Control-flow Enforcement: Shadow Stack | expand

Commit Message

Yu-cheng Yu July 22, 2021, 8:51 p.m. UTC
There is essentially no room left in the x86 hardware PTEs on some OSes
(not Linux).  That left the hardware architects looking for a way to
represent a new memory type (shadow stack) within the existing bits.
They chose to repurpose a lightly-used state: Write=0, Dirty=1.

The reason it's lightly used is that Dirty=1 is normally set by hardware
and cannot normally be set by hardware on a Write=0 PTE.  Software must
normally be involved to create one of these PTEs, so software can simply
opt to not create them.

In places where Linux normally creates Write=0, Dirty=1, it can use the
software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY.  In other
words, whenever Linux needs to create Write=0, Dirty=1, it instead creates
Write=0, Cow=1, except for shadow stack, which is Write=0, Dirty=1.  This
clearly separates shadow stack from other data, and results in the
following:

(a) A modified, copy-on-write (COW) page: (Write=0, Cow=1)
(b) A R/O page that has been COW'ed: (Write=0, Cow=1)
    The user page is in a R/O VMA, and get_user_pages() needs a writable
    copy.  The page fault handler creates a copy of the page and sets
    the new copy's PTE as Write=0 and Cow=1.
(c) A shadow stack PTE: (Write=0, Dirty=1)
(d) A shared shadow stack PTE: (Write=0, Cow=1)
    When a shadow stack page is being shared among processes (this happens
    at fork()), its PTE is made Dirty=0, so the next shadow stack access
    causes a fault, and the page is duplicated and Dirty=1 is set again.
    This is the COW equivalent for shadow stack pages, even though it's
    copy-on-access rather than copy-on-write.
(e) A page where the processor observed a Write=1 PTE, started a write, set
    Dirty=1, but then observed a Write=0 PTE.  That's possible today, but
    will not happen on processors that support shadow stack.

Define _PAGE_COW and update pte_*() helpers and apply the same changes to
pmd and pud.

After this, there are six free bits left in the 64-bit PTE, and no more
free bits in the 32-bit PTE (except for PAE) and Shadow Stack is not
implemented for the 32-bit kernel.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h       | 196 ++++++++++++++++++++++++---
 arch/x86/include/asm/pgtable_types.h |  42 +++++-
 2 files changed, 217 insertions(+), 21 deletions(-)

Comments

Borislav Petkov Aug. 16, 2021, 10:43 a.m. UTC | #1
On Thu, Jul 22, 2021 at 01:51:56PM -0700, Yu-cheng Yu wrote:
> @@ -153,13 +178,23 @@ static inline int pud_young(pud_t pud)
>  
>  static inline int pte_write(pte_t pte)
>  {
> -	return pte_flags(pte) & _PAGE_RW;
> +	/*
> +	 * Shadow stack pages are always writable - but not by normal
> +	 * instructions, and only by shadow stack operations.  Therefore,
> +	 * the W=0,D=1 test with pte_shstk().
> +	 */
> +	return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);

Well, this is weird: if some kernel code queries a shstk page and this
here function says it is writable but then goes and tries to write into
it and that write fails, then it'll confuse the user.

IOW, from where I'm standing, that should be:

	return (pte_flags(pte) & _PAGE_RW) && !pte_shstk(pte);

as in, a writable page is one which has _PAGE_RW and it is *not* a
shadow stack page because latter is special and not really writable.

Hmmm?
Yu-cheng Yu Aug. 17, 2021, 6:24 p.m. UTC | #2
On 8/16/2021 3:43 AM, Borislav Petkov wrote:
> On Thu, Jul 22, 2021 at 01:51:56PM -0700, Yu-cheng Yu wrote:
>> @@ -153,13 +178,23 @@ static inline int pud_young(pud_t pud)
>>   
>>   static inline int pte_write(pte_t pte)
>>   {
>> -	return pte_flags(pte) & _PAGE_RW;
>> +	/*
>> +	 * Shadow stack pages are always writable - but not by normal
>> +	 * instructions, and only by shadow stack operations.  Therefore,
>> +	 * the W=0,D=1 test with pte_shstk().
>> +	 */
>> +	return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
> 
> Well, this is weird: if some kernel code queries a shstk page and this
> here function says it is writable but then goes and tries to write into
> it and that write fails, then it'll confuse the user.
> 
> IOW, from where I'm standing, that should be:
> 
> 	return (pte_flags(pte) & _PAGE_RW) && !pte_shstk(pte);
> 
> as in, a writable page is one which has _PAGE_RW and it is *not* a
> shadow stack page because latter is special and not really writable.
> > Hmmm?
> 

Indeed, this can be looked at in a few ways.  We can visualize 
pte_write() as 'CPU can write to it with MOV' or 'CPU can write to it 
with any opcodes'.  Depending on whatever pte_write() is, copy-on-write 
code can be adjusted accordingly.

Yu-cheng
Borislav Petkov Aug. 17, 2021, 7:54 p.m. UTC | #3
On Tue, Aug 17, 2021 at 11:24:29AM -0700, Yu, Yu-cheng wrote:
> Indeed, this can be looked at in a few ways.  We can visualize pte_write()
> as 'CPU can write to it with MOV' or 'CPU can write to it with any opcodes'.
> Depending on whatever pte_write() is, copy-on-write code can be adjusted
> accordingly.

Can be?

I think you should exclude shadow stack pages from being writable
and treat them as read-only. How the CPU writes them is immaterial -
pte/pmd_write() is used by normal kernel code to query whether the page
is writable or not by any instruction - not by the CPU.

And since normal kernel code cannot write shadow stack pages, then for
that code those pages are read-only.

If special kernel code using shadow stack management insns needs
to modify a shadow stack, then it can check whether a page is
pte/pmd_shstk() but that code is special anyway.

Hell, a shadow stack page is (Write=0, Dirty=1) so calling it writable
			      ^^^^^^^
is simply wrong.

Thx.
Andy Lutomirski Aug. 17, 2021, 8:13 p.m. UTC | #4
> On Aug 17, 2021, at 12:53 PM, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Tue, Aug 17, 2021 at 11:24:29AM -0700, Yu, Yu-cheng wrote:
>> Indeed, this can be looked at in a few ways.  We can visualize pte_write()
>> as 'CPU can write to it with MOV' or 'CPU can write to it with any opcodes'.
>> Depending on whatever pte_write() is, copy-on-write code can be adjusted
>> accordingly.
> 
> Can be?
> 
> I think you should exclude shadow stack pages from being writable
> and treat them as read-only. How the CPU writes them is immaterial -
> pte/pmd_write() is used by normal kernel code to query whether the page
> is writable or not by any instruction - not by the CPU.
> 
> And since normal kernel code cannot write shadow stack pages, then for
> that code those pages are read-only.
> 
> If special kernel code using shadow stack management insns needs
> to modify a shadow stack, then it can check whether a page is
> pte/pmd_shstk() but that code is special anyway.
> 
> Hell, a shadow stack page is (Write=0, Dirty=1) so calling it writable
>                  ^^^^^^^
> is simply wrong.

But it *is* writable using WRUSS, and it’s also writable by CALL, WRSS, etc.

Now if the mm code tries to write protect it and expects sensible semantics, the results could be interesting. At the very least, someone would need to validate that RET reading a read only shadow stack page does the right thing.

> 
> Thx.
> 
> -- 
> Regards/Gruss,
>    Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
Borislav Petkov Aug. 17, 2021, 8:24 p.m. UTC | #5
On Tue, Aug 17, 2021 at 01:13:09PM -0700, Andy Lutomirski wrote:
> > If special kernel code using shadow stack management insns needs
> > to modify a shadow stack, then it can check whether a page is
> > pte/pmd_shstk() but that code is special anyway.
> > 
> > Hell, a shadow stack page is (Write=0, Dirty=1) so calling it writable
> >                  ^^^^^^^
> > is simply wrong.
> 
> But it *is* writable using WRUSS, and it’s also writable by CALL,

Well, if we have to be precise, CALL doesn't write it directly - it
causes for shadow stack to be written as part of CALL's execution. Yeah
yeah, potato potato.

> WRSS, etc.

Thus the "special kernel code" thing above. I've left it in instead of
snipping it.

> Now if the mm code tries to write protect it and expects sensible
> semantics, the results could be interesting. At the very least,
> someone would need to validate that RET reading a read only shadow
> stack page does the right thing.

Huh?

A shadow stack page is RO (W=0).
Andy Lutomirski Aug. 17, 2021, 8:51 p.m. UTC | #6
On Tue, Aug 17, 2021, at 1:24 PM, Borislav Petkov wrote:
> On Tue, Aug 17, 2021 at 01:13:09PM -0700, Andy Lutomirski wrote:
> > > If special kernel code using shadow stack management insns needs
> > > to modify a shadow stack, then it can check whether a page is
> > > pte/pmd_shstk() but that code is special anyway.
> > > 
> > > Hell, a shadow stack page is (Write=0, Dirty=1) so calling it writable
> > >                  ^^^^^^^
> > > is simply wrong.
> > 
> > But it *is* writable using WRUSS, and it’s also writable by CALL,
> 
> Well, if we have to be precise, CALL doesn't write it directly - it
> causes for shadow stack to be written as part of CALL's execution. Yeah
> yeah, potato potato.

Potahto.

> 
> > WRSS, etc.
> 
> Thus the "special kernel code" thing above. I've left it in instead of
> snipping it.
> 

WRSS can be used from user mode depending on the configuration.

> > Now if the mm code tries to write protect it and expects sensible
> > semantics, the results could be interesting. At the very least,
> > someone would need to validate that RET reading a read only shadow
> > stack page does the right thing.
> 
> Huh?
> 
> A shadow stack page is RO (W=0).

Double-you shmouble-you.  You can't write it with MOV, but you can write it from user code and from kernel code.  As far as the mm is concerned, I think it should be considered writable.

Although... anyone who tries to copy_to_user() it is going to be a bit surprised.  Hmm.

> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
>
Borislav Petkov Aug. 17, 2021, 9:01 p.m. UTC | #7
On Tue, Aug 17, 2021 at 01:51:52PM -0700, Andy Lutomirski wrote:
> WRSS can be used from user mode depending on the configuration.

My point being, if you're going to do shadow stack management
operations, you should check whether the target you're writing to is a
shadow stack page. Clearly userspace can't do that but userspace will
get notified of that pretty timely.

> Double-you shmouble-you. You can't write it with MOV, but you can
> write it from user code and from kernel code. As far as the mm is
> concerned, I think it should be considered writable.

Because?

> Although... anyone who tries to copy_to_user() it is going to be a bit
> surprised. Hmm.

Ok, so you see the confusion.

In any case, I don't think you can simply look at a shadow stack page as
simple writable page. There are cases where it is going to be fun.

So why are we even saying that a shadow stack page is writable? Why
can't we simply say that a shadow stack page is, well, something
special?
Yu-cheng Yu Aug. 18, 2021, 4:38 p.m. UTC | #8
On 8/17/2021 2:01 PM, Borislav Petkov wrote:
> On Tue, Aug 17, 2021 at 01:51:52PM -0700, Andy Lutomirski wrote:
>> WRSS can be used from user mode depending on the configuration.
> 
> My point being, if you're going to do shadow stack management
> operations, you should check whether the target you're writing to is a
> shadow stack page. Clearly userspace can't do that but userspace will
> get notified of that pretty timely.
> 
>> Double-you shmouble-you. You can't write it with MOV, but you can
>> write it from user code and from kernel code. As far as the mm is
>> concerned, I think it should be considered writable.
> 
> Because?
> 
>> Although... anyone who tries to copy_to_user() it is going to be a bit
>> surprised. Hmm.
> 
> Ok, so you see the confusion.
> 

copy_to_user() can run into normal read-only areas too.  The caller can 
handle that just fine.

> In any case, I don't think you can simply look at a shadow stack page as
> simple writable page. There are cases where it is going to be fun.
> 
> So why are we even saying that a shadow stack page is writable? Why
> can't we simply say that a shadow stack page is, well, something
> special?
> 

We can visualize the type of a mm area by looking at vma->vm_flags, e.g. 
maybe_mkwrite(), and PTE macros as lower-level operatives.  These two 
have some relations but not one-to-one.  Note that a PTE in a writable 
area is not always pte_write().

I have considered and implemented a shadow stack PTE either pte_write() 
or not.  Making shadow stack as pte_write() results in less arch_* 
macros and less confusion in copy-on-write code.  That is one more thing 
to consider.

Thanks,
Yu-cheng
Borislav Petkov Aug. 21, 2021, 4:27 p.m. UTC | #9
On Wed, Aug 18, 2021 at 09:38:30AM -0700, Yu, Yu-cheng wrote:
> We can visualize the type of a mm area by looking at vma->vm_flags, e.g.

visualize?

> maybe_mkwrite(), and PTE macros as lower-level operatives.  These two have
> some relations but not one-to-one.  Note that a PTE in a writable area is
> not always pte_write().
> 
> I have considered and implemented a shadow stack PTE either pte_write() or
> not.  Making shadow stack as pte_write() results in less arch_* macros and
> less confusion in copy-on-write code.  That is one more thing to consider.

Ok, even though I'm still not 100% convinced by both amluto's and your
arguments. Let's try it and see what happens...
diff mbox series

Patch

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0ddeda0bc0c0..9f1ba76ed79a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -121,9 +121,20 @@  extern pmdval_t early_pmd_flags;
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
  */
-static inline int pte_dirty(pte_t pte)
+static inline bool pte_dirty(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_DIRTY;
+	/*
+	 * A dirty PTE has Dirty=1 or Cow=1.
+	 */
+	return pte_flags(pte) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pte_shstk(pte_t pte)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return false;
+
+	return (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
 }
 
 static inline int pte_young(pte_t pte)
@@ -131,9 +142,20 @@  static inline int pte_young(pte_t pte)
 	return pte_flags(pte) & _PAGE_ACCESSED;
 }
 
-static inline int pmd_dirty(pmd_t pmd)
+static inline bool pmd_dirty(pmd_t pmd)
+{
+	/*
+	 * A dirty PMD has Dirty=1 or Cow=1.
+	 */
+	return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pmd_shstk(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_DIRTY;
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return false;
+
+	return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
 }
 
 static inline int pmd_young(pmd_t pmd)
@@ -141,9 +163,12 @@  static inline int pmd_young(pmd_t pmd)
 	return pmd_flags(pmd) & _PAGE_ACCESSED;
 }
 
-static inline int pud_dirty(pud_t pud)
+static inline bool pud_dirty(pud_t pud)
 {
-	return pud_flags(pud) & _PAGE_DIRTY;
+	/*
+	 * A dirty PUD has Dirty=1 or Cow=1.
+	 */
+	return pud_flags(pud) & _PAGE_DIRTY_BITS;
 }
 
 static inline int pud_young(pud_t pud)
@@ -153,13 +178,23 @@  static inline int pud_young(pud_t pud)
 
 static inline int pte_write(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_RW;
+	/*
+	 * Shadow stack pages are always writable - but not by normal
+	 * instructions, and only by shadow stack operations.  Therefore,
+	 * the W=0,D=1 test with pte_shstk().
+	 */
+	return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
 }
 
 #define pmd_write pmd_write
 static inline int pmd_write(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_RW;
+	/*
+	 * Shadow stack pages are always writable - but not by normal
+	 * instructions, and only by shadow stack operations.  Therefore,
+	 * the W=0,D=1 test with pmd_shstk().
+	 */
+	return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd);
 }
 
 #define pud_write pud_write
@@ -297,6 +332,24 @@  static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 	return native_make_pte(v & ~clear);
 }
 
+static inline pte_t pte_mkcow(pte_t pte)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pte;
+
+	pte = pte_clear_flags(pte, _PAGE_DIRTY);
+	return pte_set_flags(pte, _PAGE_COW);
+}
+
+static inline pte_t pte_clear_cow(pte_t pte)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pte;
+
+	pte = pte_set_flags(pte, _PAGE_DIRTY);
+	return pte_clear_flags(pte, _PAGE_COW);
+}
+
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 static inline int pte_uffd_wp(pte_t pte)
 {
@@ -316,7 +369,7 @@  static inline pte_t pte_clear_uffd_wp(pte_t pte)
 
 static inline pte_t pte_mkclean(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_DIRTY);
+	return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
 }
 
 static inline pte_t pte_mkold(pte_t pte)
@@ -326,7 +379,16 @@  static inline pte_t pte_mkold(pte_t pte)
 
 static inline pte_t pte_wrprotect(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_RW);
+	pte = pte_clear_flags(pte, _PAGE_RW);
+
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PTE (RW=0, Dirty=1).  Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pte_dirty(pte))
+		pte = pte_mkcow(pte);
+	return pte;
 }
 
 static inline pte_t pte_mkexec(pte_t pte)
@@ -336,7 +398,18 @@  static inline pte_t pte_mkexec(pte_t pte)
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
-	return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pteval_t dirty = _PAGE_DIRTY;
+
+	/* Avoid creating (HW)Dirty=1, Write=0 PTEs */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte))
+		dirty = _PAGE_COW;
+
+	return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+	return pte_clear_cow(pte);
 }
 
 static inline pte_t pte_mkyoung(pte_t pte)
@@ -346,7 +419,12 @@  static inline pte_t pte_mkyoung(pte_t pte)
 
 static inline pte_t pte_mkwrite(pte_t pte)
 {
-	return pte_set_flags(pte, _PAGE_RW);
+	pte = pte_set_flags(pte, _PAGE_RW);
+
+	if (pte_dirty(pte))
+		pte = pte_clear_cow(pte);
+
+	return pte;
 }
 
 static inline pte_t pte_mkhuge(pte_t pte)
@@ -393,6 +471,24 @@  static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
 	return native_make_pmd(v & ~clear);
 }
 
+static inline pmd_t pmd_mkcow(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pmd;
+
+	pmd = pmd_clear_flags(pmd, _PAGE_DIRTY);
+	return pmd_set_flags(pmd, _PAGE_COW);
+}
+
+static inline pmd_t pmd_clear_cow(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pmd;
+
+	pmd = pmd_set_flags(pmd, _PAGE_DIRTY);
+	return pmd_clear_flags(pmd, _PAGE_COW);
+}
+
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 static inline int pmd_uffd_wp(pmd_t pmd)
 {
@@ -417,17 +513,36 @@  static inline pmd_t pmd_mkold(pmd_t pmd)
 
 static inline pmd_t pmd_mkclean(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_DIRTY);
+	return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
 }
 
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_RW);
+	pmd = pmd_clear_flags(pmd, _PAGE_RW);
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PMD (RW=0, Dirty=1).  Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pmd_dirty(pmd))
+		pmd = pmd_mkcow(pmd);
+	return pmd;
 }
 
 static inline pmd_t pmd_mkdirty(pmd_t pmd)
 {
-	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pmdval_t dirty = _PAGE_DIRTY;
+
+	/* Avoid creating (HW)Dirty=1, Write=0 PMDs */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pmd_write(pmd))
+		dirty = _PAGE_COW;
+
+	return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
+{
+	return pmd_clear_cow(pmd);
 }
 
 static inline pmd_t pmd_mkdevmap(pmd_t pmd)
@@ -447,7 +562,11 @@  static inline pmd_t pmd_mkyoung(pmd_t pmd)
 
 static inline pmd_t pmd_mkwrite(pmd_t pmd)
 {
-	return pmd_set_flags(pmd, _PAGE_RW);
+	pmd = pmd_set_flags(pmd, _PAGE_RW);
+
+	if (pmd_dirty(pmd))
+		pmd = pmd_clear_cow(pmd);
+	return pmd;
 }
 
 static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
@@ -464,6 +583,24 @@  static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
 	return native_make_pud(v & ~clear);
 }
 
+static inline pud_t pud_mkcow(pud_t pud)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pud;
+
+	pud = pud_clear_flags(pud, _PAGE_DIRTY);
+	return pud_set_flags(pud, _PAGE_COW);
+}
+
+static inline pud_t pud_clear_cow(pud_t pud)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pud;
+
+	pud = pud_set_flags(pud, _PAGE_DIRTY);
+	return pud_clear_flags(pud, _PAGE_COW);
+}
+
 static inline pud_t pud_mkold(pud_t pud)
 {
 	return pud_clear_flags(pud, _PAGE_ACCESSED);
@@ -471,17 +608,32 @@  static inline pud_t pud_mkold(pud_t pud)
 
 static inline pud_t pud_mkclean(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_DIRTY);
+	return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
 }
 
 static inline pud_t pud_wrprotect(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_RW);
+	pud = pud_clear_flags(pud, _PAGE_RW);
+
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PUD (RW=0, Dirty=1).  Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pud_dirty(pud))
+		pud = pud_mkcow(pud);
+	return pud;
 }
 
 static inline pud_t pud_mkdirty(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pudval_t dirty = _PAGE_DIRTY;
+
+	/* Avoid creating (HW)Dirty=1, Write=0 PUDs */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pud_write(pud))
+		dirty = _PAGE_COW;
+
+	return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY);
 }
 
 static inline pud_t pud_mkdevmap(pud_t pud)
@@ -501,7 +653,11 @@  static inline pud_t pud_mkyoung(pud_t pud)
 
 static inline pud_t pud_mkwrite(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_RW);
+	pud = pud_set_flags(pud, _PAGE_RW);
+
+	if (pud_dirty(pud))
+		pud = pud_clear_cow(pud);
+	return pud;
 }
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 3781a79b6388..1bfab70ff9ac 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -21,7 +21,8 @@ 
 #define _PAGE_BIT_SOFTW2	10	/* " */
 #define _PAGE_BIT_SOFTW3	11	/* " */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
-#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
+#define _PAGE_BIT_SOFTW4	57	/* available for programmer */
+#define _PAGE_BIT_SOFTW5	58	/* available for programmer */
 #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
 #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
 #define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
@@ -34,6 +35,15 @@ 
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
 
+/*
+ * Indicates a copy-on-write page.
+ */
+#ifdef CONFIG_X86_SHADOW_STACK
+#define _PAGE_BIT_COW		_PAGE_BIT_SOFTW5 /* copy-on-write */
+#else
+#define _PAGE_BIT_COW		0
+#endif
+
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
 #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
@@ -115,6 +125,36 @@ 
 #define _PAGE_DEVMAP	(_AT(pteval_t, 0))
 #endif
 
+/*
+ * The hardware requires shadow stack to be read-only and Dirty.
+ * _PAGE_COW is a software-only bit used to separate copy-on-write PTEs
+ * from shadow stack PTEs:
+ * (a) A modified, copy-on-write (COW) page: (Write=0, Cow=1)
+ * (b) A R/O page that has been COW'ed: (Write=0, Cow=1)
+ *     The user page is in a R/O VMA, and get_user_pages() needs a
+ *     writable copy.  The page fault handler creates a copy of the page
+ *     and sets the new copy's PTE as Write=0, Cow=1.
+ * (c) A shadow stack PTE: (Write=0, Dirty=1)
+ * (d) A shared (copy-on-access) shadow stack PTE: (Write=0, Cow=1)
+ *     When a shadow stack page is being shared among processes (this
+ *     happens at fork()), its PTE is cleared of _PAGE_DIRTY, so the next
+ *     shadow stack access causes a fault, and the page is duplicated and
+ *     _PAGE_DIRTY is set again.  This is the COW equivalent for shadow
+ *     stack pages, even though it's copy-on-access rather than
+ *     copy-on-write.
+ * (e) A page where the processor observed a Write=1 PTE, started a write,
+ *     set Dirty=1, but then observed a Write=0 PTE (changed by another
+ *     thread).  That's possible today, but will not happen on processors
+ *     that support shadow stack.
+ */
+#ifdef CONFIG_X86_SHADOW_STACK
+#define _PAGE_COW	(_AT(pteval_t, 1) << _PAGE_BIT_COW)
+#else
+#define _PAGE_COW	(_AT(pteval_t, 0))
+#endif
+
+#define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_COW)
+
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
 /*