diff mbox series

[v17,11/26] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW

Message ID 20201229213053.16395-12-yu-cheng.yu@intel.com (mailing list archive)
State New, archived
Headers show
Series Control-flow Enforcement: Shadow Stack | expand

Commit Message

Yu-cheng Yu Dec. 29, 2020, 9:30 p.m. UTC
When Shadow Stack is introduced, [R/O + _PAGE_DIRTY] PTE is reserved for
shadow stack.  Copy-on-write PTEs have [R/O + _PAGE_COW].

When a PTE goes from [R/W + _PAGE_DIRTY] to [R/O + _PAGE_COW], it could
become a transient shadow stack PTE in two cases:

The first case is that some processors can start a write but end up seeing
a read-only PTE by the time they get to the Dirty bit, creating a transient
shadow stack PTE.  However, this will not occur on processors supporting
Shadow Stack, therefore we don't need a TLB flush here.

The second case is that when the software, without atomic, tests & replaces
_PAGE_DIRTY with _PAGE_COW, a transient shadow stack PTE can exist.
This is prevented with cmpxchg.

Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
insights to the issue.  Jann Horn provided the cmpxchg solution.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/pgtable.h | 52 ++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

Comments

Borislav Petkov Jan. 25, 2021, 6:27 p.m. UTC | #1
On Tue, Dec 29, 2020 at 01:30:38PM -0800, Yu-cheng Yu wrote:
> When Shadow Stack is introduced, [R/O + _PAGE_DIRTY] PTE is reserved for
> shadow stack.  Copy-on-write PTEs have [R/O + _PAGE_COW].
> 
> When a PTE goes from [R/W + _PAGE_DIRTY] to [R/O + _PAGE_COW], it could
> become a transient shadow stack PTE in two cases:
> 
> The first case is that some processors can start a write but end up seeing
> a read-only PTE by the time they get to the Dirty bit, creating a transient
> shadow stack PTE.  However, this will not occur on processors supporting
> Shadow Stack, therefore we don't need a TLB flush here.

Who's "we"?

> The second case is that when the software, without atomic, tests & replaces

"... when _PAGE_DIRTY is replaced with _PAGE_COW non-atomically, a transient
shadow stack PTE can be created, as a result."

> _PAGE_DIRTY with _PAGE_COW, a transient shadow stack PTE can exist.
> This is prevented with cmpxchg.
> 
> Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
> insights to the issue.  Jann Horn provided the cmpxchg solution.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> ---
>  arch/x86/include/asm/pgtable.h | 52 ++++++++++++++++++++++++++++++++++
>  1 file changed, 52 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 666c25ab9564..1c84f1ba32b9 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1226,6 +1226,32 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>  				      unsigned long addr, pte_t *ptep)
>  {
> +	/*
> +	 * Some processors can start a write, but end up seeing a read-only
> +	 * PTE by the time they get to the Dirty bit.  In this case, they
> +	 * will set the Dirty bit, leaving a read-only, Dirty PTE which
> +	 * looks like a shadow stack PTE.
> +	 *
> +	 * However, this behavior has been improved

Improved how?

> and will not occur on
> +	 * processors supporting Shadow Stack.  Without this guarantee, a

Which guarantee? That it won't happen on CPUs which support SHSTK?

> +	 * transition to a non-present PTE and flush the TLB would be

s/flush the TLB/TLB flush/

> +	 * needed.
> +	 *
> +	 * When changing a writable PTE to read-only and if the PTE has
> +	 * _PAGE_DIRTY set, move that bit to _PAGE_COW so that the PTE is
> +	 * not a shadow stack PTE.
> +	 */

This sentence doesn't belong here as it refers to what pte_wrprotect()
does. You could expand the comment in pte_wrprotect() with this here as
it is better.

> +	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> +		pte_t old_pte, new_pte;
> +
> +		do {
> +			old_pte = READ_ONCE(*ptep);
> +			new_pte = pte_wrprotect(old_pte);

Maybe I'm missing something but those two can happen outside of the
loop, no? Or is *ptep somehow changing concurrently while the loop is
doing the CMPXCHG and you need to recreate it each time?

IOW, you can generate upfront and do the empty loop...

> +
> +		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
> +
> +		return;
> +	}
>  	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
>  }
>  
> @@ -1282,6 +1308,32 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
>  static inline void pmdp_set_wrprotect(struct mm_struct *mm,
>  				      unsigned long addr, pmd_t *pmdp)
>  {
> +	/*
> +	 * Some processors can start a write, but end up seeing a read-only
> +	 * PMD by the time they get to the Dirty bit.  In this case, they
> +	 * will set the Dirty bit, leaving a read-only, Dirty PMD which
> +	 * looks like a Shadow Stack PMD.
> +	 *
> +	 * However, this behavior has been improved and will not occur on
> +	 * processors supporting Shadow Stack.  Without this guarantee, a
> +	 * transition to a non-present PMD and flush the TLB would be
> +	 * needed.
> +	 *
> +	 * When changing a writable PMD to read-only and if the PMD has
> +	 * _PAGE_DIRTY set, move that bit to _PAGE_COW so that the PMD is
> +	 * not a shadow stack PMD.
> +	 */

Same comments as above.

> +	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> +		pmd_t old_pmd, new_pmd;
> +
> +		do {
> +			old_pmd = READ_ONCE(*pmdp);
> +			new_pmd = pmd_wrprotect(old_pmd);
> +
> +		} while (!try_cmpxchg((pmdval_t *)pmdp, (pmdval_t *)&old_pmd, pmd_val(new_pmd)));
> +
> +		return;
> +	}
>  	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
>  }
>  
> -- 
> 2.21.0
>
Yu-cheng Yu Jan. 25, 2021, 9:27 p.m. UTC | #2
On 1/25/2021 10:27 AM, Borislav Petkov wrote:
> On Tue, Dec 29, 2020 at 01:30:38PM -0800, Yu-cheng Yu wrote:

[...]

>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
>> index 666c25ab9564..1c84f1ba32b9 100644
>> --- a/arch/x86/include/asm/pgtable.h
>> +++ b/arch/x86/include/asm/pgtable.h
>> @@ -1226,6 +1226,32 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>>   static inline void ptep_set_wrprotect(struct mm_struct *mm,
>>   				      unsigned long addr, pte_t *ptep)
>>   {
>> +	/*
>> +	 * Some processors can start a write, but end up seeing a read-only
>> +	 * PTE by the time they get to the Dirty bit.  In this case, they
>> +	 * will set the Dirty bit, leaving a read-only, Dirty PTE which
>> +	 * looks like a shadow stack PTE.
>> +	 *
>> +	 * However, this behavior has been improved
> 
> Improved how?

Processors supporting Shadow Stack will not set a read-only PTE's dirty 
bit.  I will revise the comments.

>> and will not occur on
>> +	 * processors supporting Shadow Stack.  Without this guarantee, a
> 
> Which guarantee? That it won't happen on CPUs which support SHSTK?
> 

Yes.

>> +	 * transition to a non-present PTE and flush the TLB would be
> 
> s/flush the TLB/TLB flush/
> 
>> +	 * needed.
>> +	 *
>> +	 * When changing a writable PTE to read-only and if the PTE has
>> +	 * _PAGE_DIRTY set, move that bit to _PAGE_COW so that the PTE is
>> +	 * not a shadow stack PTE.
>> +	 */
> 
> This sentence doesn't belong here as it refers to what pte_wrprotect()
> does. You could expand the comment in pte_wrprotect() with this here as
> it is better.

I will move this paragraph to pte_wrprotect().

> 
>> +	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
>> +		pte_t old_pte, new_pte;
>> +
>> +		do {
>> +			old_pte = READ_ONCE(*ptep);
>> +			new_pte = pte_wrprotect(old_pte);
> 
> Maybe I'm missing something but those two can happen outside of the
> loop, no? Or is *ptep somehow changing concurrently while the loop is
> doing the CMPXCHG and you need to recreate it each time?
> 
> IOW, you can generate upfront and do the empty loop...

*ptep can change concurrently.

> 
>> +
>> +		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
>> +
>> +		return;
>> +	}
>>   	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
>>   }
>>   

[...]
Borislav Petkov Jan. 25, 2021, 9:55 p.m. UTC | #3
On Mon, Jan 25, 2021 at 01:27:51PM -0800, Yu, Yu-cheng wrote:
> > Maybe I'm missing something but those two can happen outside of the
> > loop, no? Or is *ptep somehow changing concurrently while the loop is
> > doing the CMPXCHG and you need to recreate it each time?
> > 
> > IOW, you can generate upfront and do the empty loop...
> 
> *ptep can change concurrently.

Care to elaborate?
Yu-cheng Yu Jan. 25, 2021, 10:18 p.m. UTC | #4
On 1/25/2021 1:55 PM, Borislav Petkov wrote:
> On Mon, Jan 25, 2021 at 01:27:51PM -0800, Yu, Yu-cheng wrote:
>>> Maybe I'm missing something but those two can happen outside of the
>>> loop, no? Or is *ptep somehow changing concurrently while the loop is
>>> doing the CMPXCHG and you need to recreate it each time?
>>>
>>> IOW, you can generate upfront and do the empty loop...
>>
>> *ptep can change concurrently.
> 
> Care to elaborate?
> 

For example, when a thread reads a W=1, D=0 PTE and before changing it 
to W=0,D=0, another thread could have written to the page and the PTE is 
W=1, D=1 now.  When try_cmpxchg() detects the difference, old_pte is 
read again.
Peter Zijlstra Jan. 26, 2021, 8:46 a.m. UTC | #5
On Mon, Jan 25, 2021 at 07:27:09PM +0100, Borislav Petkov wrote:

> > +		pte_t old_pte, new_pte;
> > +
> > +		do {
> > +			old_pte = READ_ONCE(*ptep);
> > +			new_pte = pte_wrprotect(old_pte);
> 
> Maybe I'm missing something but those two can happen outside of the
> loop, no? Or is *ptep somehow changing concurrently while the loop is
> doing the CMPXCHG and you need to recreate it each time?
> 
> IOW, you can generate upfront and do the empty loop...
> 
> > +
> > +		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
> > +
> > +		return;
> > +	}

Empty loop would be wrong, but that wants to be written like:

	old_pte = READ_ONCE(*ptep);
	do {
		new_pte = pte_wrprotect(old_pte);
	} while (try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));

Since try_cmpxchg() will update old_pte on failure.
Peter Zijlstra Jan. 26, 2021, 9:40 a.m. UTC | #6
On Tue, Jan 26, 2021 at 09:46:36AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 25, 2021 at 07:27:09PM +0100, Borislav Petkov wrote:
> 
> > > +		pte_t old_pte, new_pte;
> > > +
> > > +		do {
> > > +			old_pte = READ_ONCE(*ptep);
> > > +			new_pte = pte_wrprotect(old_pte);
> > 
> > Maybe I'm missing something but those two can happen outside of the
> > loop, no? Or is *ptep somehow changing concurrently while the loop is
> > doing the CMPXCHG and you need to recreate it each time?
> > 
> > IOW, you can generate upfront and do the empty loop...
> > 
> > > +
> > > +		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
> > > +
> > > +		return;
> > > +	}
> 
> Empty loop would be wrong, but that wants to be written like:
> 
> 	old_pte = READ_ONCE(*ptep);
> 	do {
> 		new_pte = pte_wrprotect(old_pte);
> 	} while (try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));

! went missing, too early, moar wake-up juice.

> Since try_cmpxchg() will update old_pte on failure.
Borislav Petkov Jan. 26, 2021, 10:24 a.m. UTC | #7
On Mon, Jan 25, 2021 at 02:18:37PM -0800, Yu, Yu-cheng wrote:
> For example, when a thread reads a W=1, D=0 PTE and before changing it to
> W=0,D=0, another thread could have written to the page and the PTE is W=1,
> D=1 now.  When try_cmpxchg() detects the difference, old_pte is read again.

None of that is mentioned in the comment above it and if anything,
*that* is what should be explained there - not some guarantee about some
processors which doesn't even apply here.

Also, add the fact that try_cmpxchg() will update old_pte with any
modified bits - D=1 for example - when it fails. As Peter just explained
to me on IRC.

Thx.
Yu-cheng Yu Jan. 26, 2021, 4:43 p.m. UTC | #8
On 1/26/2021 1:40 AM, Peter Zijlstra wrote:
> On Tue, Jan 26, 2021 at 09:46:36AM +0100, Peter Zijlstra wrote:
>> On Mon, Jan 25, 2021 at 07:27:09PM +0100, Borislav Petkov wrote:
>>
>>>> +		pte_t old_pte, new_pte;
>>>> +
>>>> +		do {
>>>> +			old_pte = READ_ONCE(*ptep);
>>>> +			new_pte = pte_wrprotect(old_pte);
>>>
>>> Maybe I'm missing something but those two can happen outside of the
>>> loop, no? Or is *ptep somehow changing concurrently while the loop is
>>> doing the CMPXCHG and you need to recreate it each time?
>>>
>>> IOW, you can generate upfront and do the empty loop...
>>>
>>>> +
>>>> +		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
>>>> +
>>>> +		return;
>>>> +	}
>>
>> Empty loop would be wrong, but that wants to be written like:
>>
>> 	old_pte = READ_ONCE(*ptep);
>> 	do {
>> 		new_pte = pte_wrprotect(old_pte);
>> 	} while (try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
> 
> ! went missing, too early, moar wake-up juice.
> 
>> Since try_cmpxchg() will update old_pte on failure.

Thanks Peter!  I will fix that.

--
Yu-cheng
Yu-cheng Yu Jan. 26, 2021, 4:45 p.m. UTC | #9
On 1/26/2021 2:24 AM, Borislav Petkov wrote:
> On Mon, Jan 25, 2021 at 02:18:37PM -0800, Yu, Yu-cheng wrote:
>> For example, when a thread reads a W=1, D=0 PTE and before changing it to
>> W=0,D=0, another thread could have written to the page and the PTE is W=1,
>> D=1 now.  When try_cmpxchg() detects the difference, old_pte is read again.
> 
> None of that is mentioned in the comment above it and if anything,
> *that* is what should be explained there - not some guarantee about some
> processors which doesn't even apply here.
> 
> Also, add the fact that try_cmpxchg() will update old_pte with any
> modified bits - D=1 for example - when it fails. As Peter just explained
> to me on IRC.
> 
> Thx.
> 

Yes, I will fix it.  Thanks!
diff mbox series

Patch

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 666c25ab9564..1c84f1ba32b9 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1226,6 +1226,32 @@  static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pte_t *ptep)
 {
+	/*
+	 * Some processors can start a write, but end up seeing a read-only
+	 * PTE by the time they get to the Dirty bit.  In this case, they
+	 * will set the Dirty bit, leaving a read-only, Dirty PTE which
+	 * looks like a shadow stack PTE.
+	 *
+	 * However, this behavior has been improved and will not occur on
+	 * processors supporting Shadow Stack.  Without this guarantee, a
+	 * transition to a non-present PTE and flush the TLB would be
+	 * needed.
+	 *
+	 * When changing a writable PTE to read-only and if the PTE has
+	 * _PAGE_DIRTY set, move that bit to _PAGE_COW so that the PTE is
+	 * not a shadow stack PTE.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		pte_t old_pte, new_pte;
+
+		do {
+			old_pte = READ_ONCE(*ptep);
+			new_pte = pte_wrprotect(old_pte);
+
+		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
+
+		return;
+	}
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
 }
 
@@ -1282,6 +1308,32 @@  static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pmd_t *pmdp)
 {
+	/*
+	 * Some processors can start a write, but end up seeing a read-only
+	 * PMD by the time they get to the Dirty bit.  In this case, they
+	 * will set the Dirty bit, leaving a read-only, Dirty PMD which
+	 * looks like a Shadow Stack PMD.
+	 *
+	 * However, this behavior has been improved and will not occur on
+	 * processors supporting Shadow Stack.  Without this guarantee, a
+	 * transition to a non-present PMD and flush the TLB would be
+	 * needed.
+	 *
+	 * When changing a writable PMD to read-only and if the PMD has
+	 * _PAGE_DIRTY set, move that bit to _PAGE_COW so that the PMD is
+	 * not a shadow stack PMD.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		pmd_t old_pmd, new_pmd;
+
+		do {
+			old_pmd = READ_ONCE(*pmdp);
+			new_pmd = pmd_wrprotect(old_pmd);
+
+		} while (!try_cmpxchg((pmdval_t *)pmdp, (pmdval_t *)&old_pmd, pmd_val(new_pmd)));
+
+		return;
+	}
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
 }