diff mbox series

[RFC,v3,12/24] x86/mm: Modify ptep_set_wrprotect and pmdp_set_wrprotect for _PAGE_DIRTY_SW

Message ID 20180830143904.3168-13-yu-cheng.yu@intel.com (mailing list archive)
State New, archived
Headers show
Series Control Flow Enforcement: Shadow Stack | expand

Commit Message

Yu-cheng Yu Aug. 30, 2018, 2:38 p.m. UTC
When Shadow Stack is enabled, the read-only and PAGE_DIRTY_HW PTE
setting is reserved only for the Shadow Stack.  To track dirty of
non-Shadow Stack read-only PTEs, we use PAGE_DIRTY_SW.

Update ptep_set_wrprotect() and pmdp_set_wrprotect().

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/pgtable.h | 42 ++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

Comments

Jann Horn Aug. 30, 2018, 3:49 p.m. UTC | #1
On Thu, Aug 30, 2018 at 4:43 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>
> When Shadow Stack is enabled, the read-only and PAGE_DIRTY_HW PTE
> setting is reserved only for the Shadow Stack.  To track dirty of
> non-Shadow Stack read-only PTEs, we use PAGE_DIRTY_SW.
>
> Update ptep_set_wrprotect() and pmdp_set_wrprotect().
>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  arch/x86/include/asm/pgtable.h | 42 ++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 4d50de77ea96..556ef258eeff 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1203,7 +1203,28 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>                                       unsigned long addr, pte_t *ptep)
>  {
> +       pte_t pte;
> +
>         clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
> +       pte = *ptep;
> +
> +       /*
> +        * Some processors can start a write, but ending up seeing
> +        * a read-only PTE by the time they get to the Dirty bit.
> +        * In this case, they will set the Dirty bit, leaving a
> +        * read-only, Dirty PTE which looks like a Shadow Stack PTE.
> +        *
> +        * However, this behavior has been improved and will not occur
> +        * on processors supporting Shadow Stacks.  Without this
> +        * guarantee, a transition to a non-present PTE and flush the
> +        * TLB would be needed.
> +        *
> +        * When change a writable PTE to read-only and if the PTE has
> +        * _PAGE_DIRTY_HW set, we move that bit to _PAGE_DIRTY_SW so
> +        * that the PTE is not a valid Shadow Stack PTE.
> +        */
> +       pte = pte_move_flags(pte, _PAGE_DIRTY_HW, _PAGE_DIRTY_SW);
> +       set_pte_at(mm, addr, ptep, pte);
>  }

I don't understand why it's okay that you first atomically clear the
RW bit, then atomically switch from DIRTY_HW to DIRTY_SW. Doesn't that
mean that between the two atomic writes, another core can incorrectly
see a shadow stack?
Yu-cheng Yu Aug. 30, 2018, 4:02 p.m. UTC | #2
On Thu, 2018-08-30 at 17:49 +0200, Jann Horn wrote:
> On Thu, Aug 30, 2018 at 4:43 PM Yu-cheng Yu <yu-cheng.yu@intel.com>
> wrote:
> > 
> > 
> > When Shadow Stack is enabled, the read-only and PAGE_DIRTY_HW PTE
> > setting is reserved only for the Shadow Stack.  To track dirty of
> > non-Shadow Stack read-only PTEs, we use PAGE_DIRTY_SW.
> > 
> > Update ptep_set_wrprotect() and pmdp_set_wrprotect().
> > 
> > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > ---
> >  arch/x86/include/asm/pgtable.h | 42
> > ++++++++++++++++++++++++++++++++++
> >  1 file changed, 42 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/pgtable.h
> > b/arch/x86/include/asm/pgtable.h
> > index 4d50de77ea96..556ef258eeff 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -1203,7 +1203,28 @@ static inline pte_t
> > ptep_get_and_clear_full(struct mm_struct *mm,
> >  static inline void ptep_set_wrprotect(struct mm_struct *mm,
> >                                       unsigned long addr, pte_t
> > *ptep)
> >  {
> > +       pte_t pte;
> > +
> >         clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
> > +       pte = *ptep;
> > +
> > +       /*
> > +        * Some processors can start a write, but ending up seeing
> > +        * a read-only PTE by the time they get to the Dirty bit.
> > +        * In this case, they will set the Dirty bit, leaving a
> > +        * read-only, Dirty PTE which looks like a Shadow Stack
> > PTE.
> > +        *
> > +        * However, this behavior has been improved and will not
> > occur
> > +        * on processors supporting Shadow Stacks.  Without this
> > +        * guarantee, a transition to a non-present PTE and flush
> > the
> > +        * TLB would be needed.
> > +        *
> > +        * When change a writable PTE to read-only and if the PTE
> > has
> > +        * _PAGE_DIRTY_HW set, we move that bit to _PAGE_DIRTY_SW
> > so
> > +        * that the PTE is not a valid Shadow Stack PTE.
> > +        */
> > +       pte = pte_move_flags(pte, _PAGE_DIRTY_HW, _PAGE_DIRTY_SW);
> > +       set_pte_at(mm, addr, ptep, pte);
> >  }
> I don't understand why it's okay that you first atomically clear the
> RW bit, then atomically switch from DIRTY_HW to DIRTY_SW. Doesn't
> that
> mean that between the two atomic writes, another core can
> incorrectly
> see a shadow stack?

Yes, we had that concern earlier and checked.
On processors supporting Shadow Stacks, that will not happen.

Yu-cheng
Dave Hansen Aug. 30, 2018, 4:08 p.m. UTC | #3
On 08/30/2018 08:49 AM, Jann Horn wrote:
>> @@ -1203,7 +1203,28 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>>                                       unsigned long addr, pte_t *ptep)
>>  {
>> +       pte_t pte;
>> +
>>         clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
>> +       pte = *ptep;
>> +
>> +       /*
>> +        * Some processors can start a write, but ending up seeing
>> +        * a read-only PTE by the time they get to the Dirty bit.
>> +        * In this case, they will set the Dirty bit, leaving a
>> +        * read-only, Dirty PTE which looks like a Shadow Stack PTE.
>> +        *
>> +        * However, this behavior has been improved and will not occur
>> +        * on processors supporting Shadow Stacks.  Without this
>> +        * guarantee, a transition to a non-present PTE and flush the
>> +        * TLB would be needed.
>> +        *
>> +        * When change a writable PTE to read-only and if the PTE has
>> +        * _PAGE_DIRTY_HW set, we move that bit to _PAGE_DIRTY_SW so
>> +        * that the PTE is not a valid Shadow Stack PTE.
>> +        */
>> +       pte = pte_move_flags(pte, _PAGE_DIRTY_HW, _PAGE_DIRTY_SW);
>> +       set_pte_at(mm, addr, ptep, pte);
>>  }
> I don't understand why it's okay that you first atomically clear the
> RW bit, then atomically switch from DIRTY_HW to DIRTY_SW. Doesn't that
> mean that between the two atomic writes, another core can incorrectly
> see a shadow stack?

Good point.

This could result in a spurious shadow-stack fault, or allow a
shadow-stack write to the page in the transient state.

But, the shadow-stack permissions are more restrictive than what could
be in the TLB at this point, so I don't think there's a real security
implication here.

The only trouble is handling the spurious shadow-stack fault.  The
alternative is to go !Present for a bit, which we would probably just
handle fine in the existing page fault code.
Jann Horn Aug. 30, 2018, 4:23 p.m. UTC | #4
On Thu, Aug 30, 2018 at 6:09 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
> On 08/30/2018 08:49 AM, Jann Horn wrote:
> >> @@ -1203,7 +1203,28 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> >>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
> >>                                       unsigned long addr, pte_t *ptep)
> >>  {
> >> +       pte_t pte;
> >> +
> >>         clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
> >> +       pte = *ptep;
> >> +
> >> +       /*
> >> +        * Some processors can start a write, but ending up seeing
> >> +        * a read-only PTE by the time they get to the Dirty bit.
> >> +        * In this case, they will set the Dirty bit, leaving a
> >> +        * read-only, Dirty PTE which looks like a Shadow Stack PTE.
> >> +        *
> >> +        * However, this behavior has been improved and will not occur
> >> +        * on processors supporting Shadow Stacks.  Without this
> >> +        * guarantee, a transition to a non-present PTE and flush the
> >> +        * TLB would be needed.
> >> +        *
> >> +        * When change a writable PTE to read-only and if the PTE has
> >> +        * _PAGE_DIRTY_HW set, we move that bit to _PAGE_DIRTY_SW so
> >> +        * that the PTE is not a valid Shadow Stack PTE.
> >> +        */
> >> +       pte = pte_move_flags(pte, _PAGE_DIRTY_HW, _PAGE_DIRTY_SW);
> >> +       set_pte_at(mm, addr, ptep, pte);
> >>  }
> > I don't understand why it's okay that you first atomically clear the
> > RW bit, then atomically switch from DIRTY_HW to DIRTY_SW. Doesn't that
> > mean that between the two atomic writes, another core can incorrectly
> > see a shadow stack?
>
> Good point.
>
> This could result in a spurious shadow-stack fault, or allow a
> shadow-stack write to the page in the transient state.
>
> But, the shadow-stack permissions are more restrictive than what could
> be in the TLB at this point, so I don't think there's a real security
> implication here.

How about this:

Three threads (A, B, C) run with the same CR3.

1. a dirty+writable PTE is placed directly in front of B's shadow stack.
   (this can happen, right? or is there a guard page?)
2. C's TLB caches the dirty+writable PTE.
3. A performs some syscall that triggers ptep_set_wrprotect().
4. A's syscall calls clear_bit().
5. B's TLB caches the transient shadow stack.
[now C has write access to B's transiently-extended shadow stack]
6. B recurses into the transiently-extended shadow stack
7. C overwrites the transiently-extended shadow stack area.
8. B returns through the transiently-extended shadow stack, giving
    the attacker instruction pointer control in B.
9. A's syscall broadcasts a TLB flush.

Sure, it's not exactly an easy race and probably requires at least
some black timing magic to exploit, if it's exploitable at all - but
still. This seems suboptimal.

> The only trouble is handling the spurious shadow-stack fault.  The
> alternative is to go !Present for a bit, which we would probably just
> handle fine in the existing page fault code.
Dave Hansen Aug. 30, 2018, 5:19 p.m. UTC | #5
On 08/30/2018 09:23 AM, Jann Horn wrote:
> Three threads (A, B, C) run with the same CR3.
> 
> 1. a dirty+writable PTE is placed directly in front of B's shadow stack.
>    (this can happen, right? or is there a guard page?)
> 2. C's TLB caches the dirty+writable PTE.
> 3. A performs some syscall that triggers ptep_set_wrprotect().
> 4. A's syscall calls clear_bit().
> 5. B's TLB caches the transient shadow stack.
> [now C has write access to B's transiently-extended shadow stack]
> 6. B recurses into the transiently-extended shadow stack
> 7. C overwrites the transiently-extended shadow stack area.
> 8. B returns through the transiently-extended shadow stack, giving
>     the attacker instruction pointer control in B.
> 9. A's syscall broadcasts a TLB flush.

Heh, that's a good point.  The shadow stack permissions are *not*
strictly reduced because a page getting marked as shadow-stack has
*increased* permissions when being used as a shadow stack.  Fun.

For general hardening, it seems like we want to ensure that there's a
guard page at the bottom of the shadow stack.  Yu-cheng, do we have a
guard page?

But, to keep B's TLB from picking up the entry, I think we can just make
it !Present for a moment.  No TLB can cache it, and I believe the same
"don't set Dirty on a !Writable entry" logic also holds for !Present
(modulo a weird erratum or two).

If we do that, we just need to make sure that the fault handler knows it
can get spurious faults from it, and might even run into the !Present
PTE for a moment.  It might be a bit confusing because it won't be a
PROT_NONE, migration, or swap PTE, but will be !Present.  We'll also
have to make sure that we're doing this in a way which is friendly to
the L1TF PTE handling.
Yu-cheng Yu Aug. 30, 2018, 5:26 p.m. UTC | #6
On Thu, 2018-08-30 at 10:19 -0700, Dave Hansen wrote:
> On 08/30/2018 09:23 AM, Jann Horn wrote:
> > 
> > Three threads (A, B, C) run with the same CR3.
> > 
> > 1. a dirty+writable PTE is placed directly in front of B's shadow
> > stack.
> >    (this can happen, right? or is there a guard page?)
> > 2. C's TLB caches the dirty+writable PTE.
> > 3. A performs some syscall that triggers ptep_set_wrprotect().
> > 4. A's syscall calls clear_bit().
> > 5. B's TLB caches the transient shadow stack.
> > [now C has write access to B's transiently-extended shadow stack]
> > 6. B recurses into the transiently-extended shadow stack
> > 7. C overwrites the transiently-extended shadow stack area.
> > 8. B returns through the transiently-extended shadow stack, giving
> >     the attacker instruction pointer control in B.
> > 9. A's syscall broadcasts a TLB flush.
> Heh, that's a good point.  The shadow stack permissions are *not*
> strictly reduced because a page getting marked as shadow-stack has
> *increased* permissions when being used as a shadow stack.  Fun.
> 
> For general hardening, it seems like we want to ensure that there's
> a
> guard page at the bottom of the shadow stack.  Yu-cheng, do we have
> a
> guard page?

We don't have the guard page now, but there is a shadow stack token
there, which cannot be used as a return address.

Yu-cheng
Dave Hansen Aug. 30, 2018, 5:33 p.m. UTC | #7
On 08/30/2018 10:26 AM, Yu-cheng Yu wrote:
> We don't have the guard page now, but there is a shadow stack token
> there, which cannot be used as a return address.

The overall concern is that we could overflow into a page that we did
not intend.  Either another actual shadow stack or something that a page
that the attacker constructed, like the transient scenario Jann described.
Andy Lutomirski Aug. 30, 2018, 5:34 p.m. UTC | #8
> On Aug 30, 2018, at 10:19 AM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
> 
>> On 08/30/2018 09:23 AM, Jann Horn wrote:
>> Three threads (A, B, C) run with the same CR3.
>> 
>> 1. a dirty+writable PTE is placed directly in front of B's shadow stack.
>>   (this can happen, right? or is there a guard page?)
>> 2. C's TLB caches the dirty+writable PTE.
>> 3. A performs some syscall that triggers ptep_set_wrprotect().
>> 4. A's syscall calls clear_bit().
>> 5. B's TLB caches the transient shadow stack.
>> [now C has write access to B's transiently-extended shadow stack]
>> 6. B recurses into the transiently-extended shadow stack
>> 7. C overwrites the transiently-extended shadow stack area.
>> 8. B returns through the transiently-extended shadow stack, giving
>>    the attacker instruction pointer control in B.
>> 9. A's syscall broadcasts a TLB flush.
> 
> Heh, that's a good point.  The shadow stack permissions are *not*
> strictly reduced because a page getting marked as shadow-stack has
> *increased* permissions when being used as a shadow stack.  Fun.
> 
> For general hardening, it seems like we want to ensure that there's a
> guard page at the bottom of the shadow stack.  Yu-cheng, do we have a
> guard page?
> 
> But, to keep B's TLB from picking up the entry, I think we can just make
> it !Present for a moment.  No TLB can cache it, and I believe the same
> "don't set Dirty on a !Writable entry" logic also holds for !Present
> (modulo a weird erratum or two).

Can we get documentation?  Pretty please?
Yu-cheng Yu Aug. 30, 2018, 5:54 p.m. UTC | #9
On Thu, 2018-08-30 at 10:33 -0700, Dave Hansen wrote:
> On 08/30/2018 10:26 AM, Yu-cheng Yu wrote:
> > 
> > We don't have the guard page now, but there is a shadow stack
> > token
> > there, which cannot be used as a return address.
> The overall concern is that we could overflow into a page that we
> did
> not intend.  Either another actual shadow stack or something that a
> page
> that the attacker constructed, like the transient scenario Jann
> described.
> 

A task could go beyond the bottom of its shadow stack by doing either
'ret' or 'incssp'.  If it is the 'ret' case, the token prevents it.
 If it is the 'incssp' case, a guard page cannot prevent it entirely,
right?

Yu-cheng
Jann Horn Aug. 30, 2018, 5:59 p.m. UTC | #10
On Thu, Aug 30, 2018 at 7:58 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>
> On Thu, 2018-08-30 at 10:33 -0700, Dave Hansen wrote:
> > On 08/30/2018 10:26 AM, Yu-cheng Yu wrote:
> > >
> > > We don't have the guard page now, but there is a shadow stack
> > > token
> > > there, which cannot be used as a return address.
> > The overall concern is that we could overflow into a page that we
> > did
> > not intend.  Either another actual shadow stack or something that a
> > page
> > that the attacker constructed, like the transient scenario Jann
> > described.
> >
>
> A task could go beyond the bottom of its shadow stack by doing either
> 'ret' or 'incssp'.  If it is the 'ret' case, the token prevents it.
>  If it is the 'incssp' case, a guard page cannot prevent it entirely,
> right?

I mean the other direction, on "call".
Dave Hansen Aug. 30, 2018, 6:55 p.m. UTC | #11
On 08/30/2018 10:34 AM, Andy Lutomirski wrote:
>> But, to keep B's TLB from picking up the entry, I think we can just make
>> it !Present for a moment.  No TLB can cache it, and I believe the same
>> "don't set Dirty on a !Writable entry" logic also holds for !Present
>> (modulo a weird erratum or two).
> Can we get documentation?  Pretty please?

The accessed bit description in the SDM looks pretty good to me today:

> Whenever the processor uses a paging-structure entry as part of
> linear-address translation, it sets the accessed flag in that entry
> (if it is not already set).
If it's !Present, it can't used as part of a translation so can't be
set.  I think that covers the thing I was unsure about.

But, Dirty is a bit, er, muddier, but mostly because it only gets set on
leaf entries:

> Whenever there is a write to a linear address, the processor sets the
> dirty flag (if it is not already set) in the paging- structure entry
> that identifies the final physical address for the linear address
> (either a PTE or a paging-structure entry in which the PS flag is
> 1).

That little hunk will definitely need to get updated with something like:

	On processors enumerating support for CET, the processor will on
	set the dirty flag on paging structure entries in which the W
	flag is 1.
Randy Dunlap Aug. 30, 2018, 7:59 p.m. UTC | #12
On 08/30/2018 07:38 AM, Yu-cheng Yu wrote:
> When Shadow Stack is enabled, the read-only and PAGE_DIRTY_HW PTE
> setting is reserved only for the Shadow Stack.  To track dirty of
> non-Shadow Stack read-only PTEs, we use PAGE_DIRTY_SW.
> 
> Update ptep_set_wrprotect() and pmdp_set_wrprotect().
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  arch/x86/include/asm/pgtable.h | 42 ++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 4d50de77ea96..556ef258eeff 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1203,7 +1203,28 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>  				      unsigned long addr, pte_t *ptep)
>  {
> +	pte_t pte;
> +
>  	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
> +	pte = *ptep;
> +
> +	/*
> +	 * Some processors can start a write, but ending up seeing

	                                      but end up seeing

> +	 * a read-only PTE by the time they get to the Dirty bit.
> +	 * In this case, they will set the Dirty bit, leaving a
> +	 * read-only, Dirty PTE which looks like a Shadow Stack PTE.
> +	 *
> +	 * However, this behavior has been improved and will not occur
> +	 * on processors supporting Shadow Stacks.  Without this
> +	 * guarantee, a transition to a non-present PTE and flush the
> +	 * TLB would be needed.
> +	 *
> +	 * When change a writable PTE to read-only and if the PTE has

	        changing

> +	 * _PAGE_DIRTY_HW set, we move that bit to _PAGE_DIRTY_SW so
> +	 * that the PTE is not a valid Shadow Stack PTE.
> +	 */
> +	pte = pte_move_flags(pte, _PAGE_DIRTY_HW, _PAGE_DIRTY_SW);
> +	set_pte_at(mm, addr, ptep, pte);
>  }
>  
>  #define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
> @@ -1266,7 +1287,28 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
>  static inline void pmdp_set_wrprotect(struct mm_struct *mm,
>  				      unsigned long addr, pmd_t *pmdp)
>  {
> +	pmd_t pmd;
> +
>  	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
> +	pmd = *pmdp;
> +
> +	/*
> +	 * Some processors can start a write, but ending up seeing

	                                      but end up seeing

> +	 * a read-only PTE by the time they get to the Dirty bit.
> +	 * In this case, they will set the Dirty bit, leaving a
> +	 * read-only, Dirty PTE which looks like a Shadow Stack PTE.
> +	 *
> +	 * However, this behavior has been improved and will not occur
> +	 * on processors supporting Shadow Stacks.  Without this
> +	 * guarantee, a transition to a non-present PTE and flush the
> +	 * TLB would be needed.
> +	 *
> +	 * When change a writable PTE to read-only and if the PTE has

	        changing

> +	 * _PAGE_DIRTY_HW set, we move that bit to _PAGE_DIRTY_SW so
> +	 * that the PTE is not a valid Shadow Stack PTE.
> +	 */
> +	pmd = pmd_move_flags(pmd, _PAGE_DIRTY_HW, _PAGE_DIRTY_SW);
> +	set_pmd_at(mm, addr, pmdp, pmd);
>  }
>  
>  #define pud_write pud_write
>
Yu-cheng Yu Aug. 30, 2018, 8:21 p.m. UTC | #13
On Thu, 2018-08-30 at 19:59 +0200, Jann Horn wrote:
> On Thu, Aug 30, 2018 at 7:58 PM Yu-cheng Yu <yu-cheng.yu@intel.com>
> wrote:
> > 
> > 
> > On Thu, 2018-08-30 at 10:33 -0700, Dave Hansen wrote:
> > > 
> > > On 08/30/2018 10:26 AM, Yu-cheng Yu wrote:
> > > > 
> > > > 
> > > > We don't have the guard page now, but there is a shadow stack
> > > > token
> > > > there, which cannot be used as a return address.
> > > The overall concern is that we could overflow into a page that
> > > we
> > > did
> > > not intend.  Either another actual shadow stack or something
> > > that a
> > > page
> > > that the attacker constructed, like the transient scenario Jann
> > > described.
> > > 
> > A task could go beyond the bottom of its shadow stack by doing
> > either
> > 'ret' or 'incssp'.  If it is the 'ret' case, the token prevents
> > it.
> >  If it is the 'incssp' case, a guard page cannot prevent it
> > entirely,
> > right?
> I mean the other direction, on "call".

In the flow you described, if C writes to the overflow page before B
gets in with a 'call', the return address is still correct for B.  To
make an attack, C needs to write again before the TLB flush.  I agree
that is possible.

Assume we have a guard page, can someone in the short window do
recursive calls in B, move ssp to the end of the guard page, and
trigger the same again?  He can simply take the incssp route.
Yu-cheng Yu Aug. 30, 2018, 8:23 p.m. UTC | #14
On Thu, 2018-08-30 at 12:59 -0700, Randy Dunlap wrote:
> On 08/30/2018 07:38 AM, Yu-cheng Yu wrote:
> > 
> > When Shadow Stack is enabled, the read-only and PAGE_DIRTY_HW PTE
> > setting is reserved only for the Shadow Stack.  To track dirty of
> > non-Shadow Stack read-only PTEs, we use PAGE_DIRTY_SW.
> > 
> > Update ptep_set_wrprotect() and pmdp_set_wrprotect().
> > 
> > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > ---
> >  arch/x86/include/asm/pgtable.h | 42
> > ++++++++++++++++++++++++++++++++++
> >  1 file changed, 42 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/pgtable.h
> > b/arch/x86/include/asm/pgtable.h
> > index 4d50de77ea96..556ef258eeff 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -1203,7 +1203,28 @@ static inline pte_t
> > ptep_get_and_clear_full(struct mm_struct *mm,
> >  static inline void ptep_set_wrprotect(struct mm_struct *mm,
> >  				      unsigned long addr, pte_t
> > *ptep)
> >  {
> > +	pte_t pte;
> > +
> >  	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
> > +	pte = *ptep;
> > +
> > +	/*
> > +	 * Some processors can start a write, but ending up
> > seeing
> 	                                      but end up seeing
> 
> > 
> > +	 * a read-only PTE by the time they get to the Dirty bit.
> > +	 * In this case, they will set the Dirty bit, leaving a
> > +	 * read-only, Dirty PTE which looks like a Shadow Stack
> > PTE.
> > +	 *
> > +	 * However, this behavior has been improved and will not
> > occur
> > +	 * on processors supporting Shadow Stacks.  Without this
> > +	 * guarantee, a transition to a non-present PTE and flush
> > the
> > +	 * TLB would be needed.
> > +	 *
> > +	 * When change a writable PTE to read-only and if the PTE
> > has
> 	        changing
> 
> > 
> > +	 * _PAGE_DIRTY_HW set, we move that bit to _PAGE_DIRTY_SW
> > so
> > +	 * that the PTE is not a valid Shadow Stack PTE.
> > +	 */
> > +	pte = pte_move_flags(pte, _PAGE_DIRTY_HW,
> > _PAGE_DIRTY_SW);
> > +	set_pte_at(mm, addr, ptep, pte);
> >  }
> >  
> >  #define flush_tlb_fix_spurious_fault(vma, address) do { } while
> > (0)
> > @@ -1266,7 +1287,28 @@ static inline pud_t
> > pudp_huge_get_and_clear(struct mm_struct *mm,
> >  static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> >  				      unsigned long addr, pmd_t
> > *pmdp)
> >  {
> > +	pmd_t pmd;
> > +
> >  	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
> > +	pmd = *pmdp;
> > +
> > +	/*
> > +	 * Some processors can start a write, but ending up
> > seeing
> 	                                      but end up seeing
> 
> > 
> > +	 * a read-only PTE by the time they get to the Dirty bit.
> > +	 * In this case, they will set the Dirty bit, leaving a
> > +	 * read-only, Dirty PTE which looks like a Shadow Stack
> > PTE.
> > +	 *
> > +	 * However, this behavior has been improved and will not
> > occur
> > +	 * on processors supporting Shadow Stacks.  Without this
> > +	 * guarantee, a transition to a non-present PTE and flush
> > the
> > +	 * TLB would be needed.
> > +	 *
> > +	 * When change a writable PTE to read-only and if the PTE
> > has
> 	        changing
> 
> > 
> > +	 * _PAGE_DIRTY_HW set, we move that bit to _PAGE_DIRTY_SW
> > so
> > +	 * that the PTE is not a valid Shadow Stack PTE.
> > +	 */
> > +	pmd = pmd_move_flags(pmd, _PAGE_DIRTY_HW,
> > _PAGE_DIRTY_SW);
> > +	set_pmd_at(mm, addr, pmdp, pmd);
> >  }
> >  
> >  #define pud_write pud_write
> > 
> 

Thanks, I will fix it!
Jann Horn Aug. 30, 2018, 8:44 p.m. UTC | #15
On Thu, Aug 30, 2018 at 10:25 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>
> On Thu, 2018-08-30 at 19:59 +0200, Jann Horn wrote:
> > On Thu, Aug 30, 2018 at 7:58 PM Yu-cheng Yu <yu-cheng.yu@intel.com>
> > wrote:
> > >
> > >
> > > On Thu, 2018-08-30 at 10:33 -0700, Dave Hansen wrote:
> > > >
> > > > On 08/30/2018 10:26 AM, Yu-cheng Yu wrote:
> > > > >
> > > > >
> > > > > We don't have the guard page now, but there is a shadow stack
> > > > > token
> > > > > there, which cannot be used as a return address.
> > > > The overall concern is that we could overflow into a page that
> > > > we
> > > > did
> > > > not intend.  Either another actual shadow stack or something
> > > > that a
> > > > page
> > > > that the attacker constructed, like the transient scenario Jann
> > > > described.
> > > >
> > > A task could go beyond the bottom of its shadow stack by doing
> > > either
> > > 'ret' or 'incssp'.  If it is the 'ret' case, the token prevents
> > > it.
> > >  If it is the 'incssp' case, a guard page cannot prevent it
> > > entirely,
> > > right?
> > I mean the other direction, on "call".
>
> In the flow you described, if C writes to the overflow page before B
> gets in with a 'call', the return address is still correct for B.  To
> make an attack, C needs to write again before the TLB flush.  I agree
> that is possible.
>
> Assume we have a guard page, can someone in the short window do
> recursive calls in B, move ssp to the end of the guard page, and
> trigger the same again?  He can simply take the incssp route.

I don't understand what you're saying. If the shadow stack is between
guard pages, you should never be able to move SSP past that area's
guard pages without an appropriate shadow stack token (not even with
INCSSP, since that has a maximum range of PAGE_SIZE/2), and therefore,
it shouldn't matter whether memory outside that range is incorrectly
marked as shadow stack. Am I missing something?
Yu-cheng Yu Aug. 30, 2018, 8:52 p.m. UTC | #16
On Thu, 2018-08-30 at 22:44 +0200, Jann Horn wrote:
> On Thu, Aug 30, 2018 at 10:25 PM Yu-cheng Yu <yu-cheng.yu@intel.com>
> wrote:
...
> > In the flow you described, if C writes to the overflow page before
> > B
> > gets in with a 'call', the return address is still correct for
> > B.  To
> > make an attack, C needs to write again before the TLB flush.  I
> > agree
> > that is possible.
> > 
> > Assume we have a guard page, can someone in the short window do
> > recursive calls in B, move ssp to the end of the guard page, and
> > trigger the same again?  He can simply take the incssp route.
> I don't understand what you're saying. If the shadow stack is
> between
> guard pages, you should never be able to move SSP past that area's
> guard pages without an appropriate shadow stack token (not even with
> INCSSP, since that has a maximum range of PAGE_SIZE/2), and
> therefore,
> it shouldn't matter whether memory outside that range is incorrectly
> marked as shadow stack. Am I missing something?

INCSSP has a range of 256, but we can do multiple of that.
But I realize the key is not to have the transient SHSTK page at all.
The guard page is !pte_write() and even we have flaws in
ptep_set_wrprotect(), there will not be any transient SHSTK pages. I
will add guard pages to both ends.

Still thinking how to fix ptep_set_wrprotect().

Yu-cheng
Jann Horn Aug. 30, 2018, 9:01 p.m. UTC | #17
On Thu, Aug 30, 2018 at 10:57 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>
> On Thu, 2018-08-30 at 22:44 +0200, Jann Horn wrote:
> > On Thu, Aug 30, 2018 at 10:25 PM Yu-cheng Yu <yu-cheng.yu@intel.com>
> > wrote:
> ...
> > > In the flow you described, if C writes to the overflow page before
> > > B
> > > gets in with a 'call', the return address is still correct for
> > > B.  To
> > > make an attack, C needs to write again before the TLB flush.  I
> > > agree
> > > that is possible.
> > >
> > > Assume we have a guard page, can someone in the short window do
> > > recursive calls in B, move ssp to the end of the guard page, and
> > > trigger the same again?  He can simply take the incssp route.
> > I don't understand what you're saying. If the shadow stack is
> > between
> > guard pages, you should never be able to move SSP past that area's
> > guard pages without an appropriate shadow stack token (not even with
> > INCSSP, since that has a maximum range of PAGE_SIZE/2), and
> > therefore,
> > it shouldn't matter whether memory outside that range is incorrectly
> > marked as shadow stack. Am I missing something?
>
> INCSSP has a range of 256, but we can do multiple of that.
> But I realize the key is not to have the transient SHSTK page at all.
> The guard page is !pte_write() and even we have flaws in
> ptep_set_wrprotect(), there will not be any transient SHSTK pages. I
> will add guard pages to both ends.
>
> Still thinking how to fix ptep_set_wrprotect().

cmpxchg loop? Or is that slow?
Jann Horn Aug. 30, 2018, 9:47 p.m. UTC | #18
On Thu, Aug 30, 2018 at 11:01 PM Jann Horn <jannh@google.com> wrote:
>
> On Thu, Aug 30, 2018 at 10:57 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> >
> > On Thu, 2018-08-30 at 22:44 +0200, Jann Horn wrote:
> > > On Thu, Aug 30, 2018 at 10:25 PM Yu-cheng Yu <yu-cheng.yu@intel.com>
> > > wrote:
> > ...
> > > > In the flow you described, if C writes to the overflow page before
> > > > B
> > > > gets in with a 'call', the return address is still correct for
> > > > B.  To
> > > > make an attack, C needs to write again before the TLB flush.  I
> > > > agree
> > > > that is possible.
> > > >
> > > > Assume we have a guard page, can someone in the short window do
> > > > recursive calls in B, move ssp to the end of the guard page, and
> > > > trigger the same again?  He can simply take the incssp route.
> > > I don't understand what you're saying. If the shadow stack is
> > > between
> > > guard pages, you should never be able to move SSP past that area's
> > > guard pages without an appropriate shadow stack token (not even with
> > > INCSSP, since that has a maximum range of PAGE_SIZE/2), and
> > > therefore,
> > > it shouldn't matter whether memory outside that range is incorrectly
> > > marked as shadow stack. Am I missing something?
> >
> > INCSSP has a range of 256, but we can do multiple of that.
> > But I realize the key is not to have the transient SHSTK page at all.
> > The guard page is !pte_write() and even we have flaws in
> > ptep_set_wrprotect(), there will not be any transient SHSTK pages. I
> > will add guard pages to both ends.
> >
> > Still thinking how to fix ptep_set_wrprotect().
>
> cmpxchg loop? Or is that slow?

Something like this:

static inline void ptep_set_wrprotect(struct mm_struct *mm,
                                      unsigned long addr, pte_t *ptep)
{
        pte_t pte = READ_ONCE(*ptep), new_pte;

        /* ... your comment about not needing a TLB shootdown here ... */
        do {
                pte = pte_wrprotect(pte);
                /* note: relies on _PAGE_DIRTY_HW < _PAGE_DIRTY_SW */
                /* dirty direct bit-twiddling; you can probably write
this in a nicer way */
                pte.pte |= (pte.pte & _PAGE_DIRTY_HW) >>
_PAGE_BIT_DIRTY_HW << _PAGE_BIT_DIRTY_SW;
                pte.pte &= ~_PAGE_DIRTY_HW;
                pte = cmpxchg(ptep, pte, new_pte);
        } while (pte != new_pte);
}

I think this has the advantage of not generating weird spurious pagefaults.
It's not compatible with Xen PV, but I'm guessing that this whole
feature isn't going to support Xen PV anyway? So you could switch
between two implementations of ptep_set_wrprotect using the pvop
mechanism or so - one for environments that support shadow stacks, one
for all other environments.
Or is there some arcane reason why cmpxchg doesn't work here the way I
think it should?
Andy Lutomirski Aug. 31, 2018, 1:23 a.m. UTC | #19
> On Aug 30, 2018, at 10:59 AM, Jann Horn <jannh@google.com> wrote:
> 
>> On Thu, Aug 30, 2018 at 7:58 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
>> 
>>> On Thu, 2018-08-30 at 10:33 -0700, Dave Hansen wrote:
>>>> On 08/30/2018 10:26 AM, Yu-cheng Yu wrote:
>>>> 
>>>> We don't have the guard page now, but there is a shadow stack
>>>> token
>>>> there, which cannot be used as a return address.
>>> The overall concern is that we could overflow into a page that we
>>> did
>>> not intend.  Either another actual shadow stack or something that a
>>> page
>>> that the attacker constructed, like the transient scenario Jann
>>> described.
>>> 
>> 
>> A task could go beyond the bottom of its shadow stack by doing either
>> 'ret' or 'incssp'.  If it is the 'ret' case, the token prevents it.
>> If it is the 'incssp' case, a guard page cannot prevent it entirely,
>> right?
> 
> I mean the other direction, on "call".

I still think that shadow stacks should work just like mmap and that mmap should learn to add guard pages for all non-MAP_FIXED allocations.
Peter Zijlstra Aug. 31, 2018, 9:53 a.m. UTC | #20
On Thu, Aug 30, 2018 at 11:47:16PM +0200, Jann Horn wrote:
>         do {
>                 pte = pte_wrprotect(pte);
>                 /* note: relies on _PAGE_DIRTY_HW < _PAGE_DIRTY_SW */
>                 /* dirty direct bit-twiddling; you can probably write
> this in a nicer way */
>                 pte.pte |= (pte.pte & _PAGE_DIRTY_HW) >>
> _PAGE_BIT_DIRTY_HW << _PAGE_BIT_DIRTY_SW;
>                 pte.pte &= ~_PAGE_DIRTY_HW;
>                 pte = cmpxchg(ptep, pte, new_pte);
>         } while (pte != new_pte);

Please use the form:

	pte_t new_pte, pte = READ_ONCE(*ptep);
	do {
		new_pte = /* ... */;
	} while (!try_cmpxchg(ptep, &pte, new_pte);

Also, this will fail to build on i386-PAE, but I suspect this code will
be under some CONFIG option specific to x86_64 anyway.
Yu-cheng Yu Aug. 31, 2018, 2:33 p.m. UTC | #21
On Fri, 2018-08-31 at 11:53 +0200, Peter Zijlstra wrote:
> On Thu, Aug 30, 2018 at 11:47:16PM +0200, Jann Horn wrote:
> > 
> >         do {
> >                 pte = pte_wrprotect(pte);
> >                 /* note: relies on _PAGE_DIRTY_HW < _PAGE_DIRTY_SW
> > */
> >                 /* dirty direct bit-twiddling; you can probably
> > write
> > this in a nicer way */
> >                 pte.pte |= (pte.pte & _PAGE_DIRTY_HW) >>
> > _PAGE_BIT_DIRTY_HW << _PAGE_BIT_DIRTY_SW;
> >                 pte.pte &= ~_PAGE_DIRTY_HW;
> >                 pte = cmpxchg(ptep, pte, new_pte);
> >         } while (pte != new_pte);
> Please use the form:
> 
> 	pte_t new_pte, pte = READ_ONCE(*ptep);
> 	do {
> 		new_pte = /* ... */;
> 	} while (!try_cmpxchg(ptep, &pte, new_pte);
> 
> Also, this will fail to build on i386-PAE, but I suspect this code
> will
> be under some CONFIG option specific to x86_64 anyway.

Thanks!  I will work on it.

Yu-cheng
Dave Hansen Aug. 31, 2018, 2:47 p.m. UTC | #22
On 08/31/2018 07:33 AM, Yu-cheng Yu wrote:
> Please use the form:
> 
> 	pte_t new_pte, pte = READ_ONCE(*ptep);
> 	do {
> 		new_pte = /* ... */;
> 	} while (!try_cmpxchg(ptep, &pte, new_pte);

It's probably also worth doing some testing to see if you can detect the
cost of the cmpxchg.  It's definitely more than the old code.

A loop that does mprotect(PROT_READ) followed by
mprotect(PROT_READ|PROT_WRITE) should do it.
Yu-cheng Yu Aug. 31, 2018, 3:48 p.m. UTC | #23
On Fri, 2018-08-31 at 07:47 -0700, Dave Hansen wrote:
> On 08/31/2018 07:33 AM, Yu-cheng Yu wrote:
> > 
> > Please use the form:
> > 
> > 	pte_t new_pte, pte = READ_ONCE(*ptep);
> > 	do {
> > 		new_pte = /* ... */;
> > 	} while (!try_cmpxchg(ptep, &pte, new_pte);
> It's probably also worth doing some testing to see if you can detect
> the
> cost of the cmpxchg.  It's definitely more than the old code.
> 
> A loop that does mprotect(PROT_READ) followed by
> mprotect(PROT_READ|PROT_WRITE) should do it.

I created the test,

https://github.com/yyu168/cet-smoke-test/blob/quick/quick/mprotect_ben
ch.c

then realized this won't work.

To trigger a race in ptep_set_wrprotect(), we need to fork from one of
three pthread siblings.

Or do we measure only how much this affects fork?
If there is no racing, the effect should be minimal.

Yu-cheng
Dave Hansen Aug. 31, 2018, 3:58 p.m. UTC | #24
On 08/31/2018 08:48 AM, Yu-cheng Yu wrote:
> To trigger a race in ptep_set_wrprotect(), we need to fork from one of
> three pthread siblings.
> 
> Or do we measure only how much this affects fork?
> If there is no racing, the effect should be minimal.

We don't need a race.

I think the cmpxchg will be slower, even without a race, than the code
that was there before.  The cmpxchg is a simple, straightforward
solution, but we're putting it in place of a plain memory write, which
is suboptimal.

But, before I nitpick the performance, I wanted to see if we could even
detect a delta.
Peter Zijlstra Aug. 31, 2018, 4:29 p.m. UTC | #25
On Fri, Aug 31, 2018 at 08:58:39AM -0700, Dave Hansen wrote:
> On 08/31/2018 08:48 AM, Yu-cheng Yu wrote:
> > To trigger a race in ptep_set_wrprotect(), we need to fork from one of
> > three pthread siblings.
> > 
> > Or do we measure only how much this affects fork?
> > If there is no racing, the effect should be minimal.
> 
> We don't need a race.
> 
> I think the cmpxchg will be slower, even without a race, than the code
> that was there before.  The cmpxchg is a simple, straightforward
> solution, but we're putting it in place of a plain memory write, which
> is suboptimal.

Note quite, the clear_bit() is LOCK prefixed.
Dave Hansen Aug. 31, 2018, 4:29 p.m. UTC | #26
On 08/30/2018 07:38 AM, Yu-cheng Yu wrote:
> +	 * Some processors can start a write, but ending up seeing
> +	 * a read-only PTE by the time they get to the Dirty bit.
> +	 * In this case, they will set the Dirty bit, leaving a
> +	 * read-only, Dirty PTE which looks like a Shadow Stack PTE.
> +	 *
> +	 * However, this behavior has been improved and will not occur
> +	 * on processors supporting Shadow Stacks.  Without this
> +	 * guarantee, a transition to a non-present PTE and flush the
> +	 * TLB would be needed.

Did we publicly document this behavior anywhere?  I can't seem to find it.
Andy Lutomirski Aug. 31, 2018, 5:46 p.m. UTC | #27
On Thu, Aug 30, 2018 at 11:55 AM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> On 08/30/2018 10:34 AM, Andy Lutomirski wrote:
>>> But, to keep B's TLB from picking up the entry, I think we can just make
>>> it !Present for a moment.  No TLB can cache it, and I believe the same
>>> "don't set Dirty on a !Writable entry" logic also holds for !Present
>>> (modulo a weird erratum or two).
>> Can we get documentation?  Pretty please?
>
> The accessed bit description in the SDM looks pretty good to me today:
>
>> Whenever the processor uses a paging-structure entry as part of
>> linear-address translation, it sets the accessed flag in that entry
>> (if it is not already set).
> If it's !Present, it can't used as part of a translation so can't be
> set.  I think that covers the thing I was unsure about.
>
> But, Dirty is a bit, er, muddier, but mostly because it only gets set on
> leaf entries:
>
>> Whenever there is a write to a linear address, the processor sets the
>> dirty flag (if it is not already set) in the paging- structure entry
>> that identifies the final physical address for the linear address
>> (either a PTE or a paging-structure entry in which the PS flag is
>> 1).
>
> That little hunk will definitely need to get updated with something like:
>
>         On processors enumerating support for CET, the processor will on
>         set the dirty flag on paging structure entries in which the W
>         flag is 1.

Can we get something much stronger, perhaps?  Like this:

On processors enumerating support for CET, the processor will write to
the accessed and/or dirty flags atomically, as if using the LOCK
CMPXCHG instruction.  The memory access, any cached entries in any
paging-structure caches, and the values in the paging-structure entry
before and after writing the A and/or D bits will all be consistent.

I'm sure this could be worded better.  The point is that the CPU
should, atomically, load the PTE, check if it allows the access, set A
and/or D appropriately, write the new value to the TLB, and use that
value for the access.  This is clearly a little bit slower than what
old CPUs could do when writing to an already-in-TLB writable non-dirty
entry, but new CPUs are going to have to atomically check the W bit.
(I assume that even old CPUs will *atomically* set the D bit as if by
LOCK BTS, but this is all very vague in the SDM IIRC.)
Dave Hansen Aug. 31, 2018, 5:52 p.m. UTC | #28
On 08/31/2018 10:46 AM, Andy Lutomirski wrote:
> On Thu, Aug 30, 2018 at 11:55 AM, Dave Hansen
>> That little hunk will definitely need to get updated with something like:
>>
>>         On processors enumerating support for CET, the processor will on
>>         set the dirty flag on paging structure entries in which the W
>>         flag is 1.
> 
> Can we get something much stronger, perhaps?  Like this:
> 
> On processors enumerating support for CET, the processor will write to
> the accessed and/or dirty flags atomically, as if using the LOCK
> CMPXCHG instruction.  The memory access, any cached entries in any
> paging-structure caches, and the values in the paging-structure entry
> before and after writing the A and/or D bits will all be consistent.

There's some talk of this already in: 8.1.2.1 Automatic Locking:

> When updating page-directory and page-table entries — When updating 
> page-directory and page-table entries, the processor uses locked 
> cycles to set the accessed and dirty flag in the page-directory and
> page-table entries.
As for the A/D consistency, I'll see if I can share that before it hits
the SDM for real and see if it's sufficient for everybody.
Yu-cheng Yu Sept. 14, 2018, 8:39 p.m. UTC | #29
On Fri, 2018-08-31 at 18:29 +0200, Peter Zijlstra wrote:
> On Fri, Aug 31, 2018 at 08:58:39AM -0700, Dave Hansen wrote:
> > 
> > On 08/31/2018 08:48 AM, Yu-cheng Yu wrote:
> > > 
> > > To trigger a race in ptep_set_wrprotect(), we need to fork from one of
> > > three pthread siblings.
> > > 
> > > Or do we measure only how much this affects fork?
> > > If there is no racing, the effect should be minimal.
> > We don't need a race.
> > 
> > I think the cmpxchg will be slower, even without a race, than the code
> > that was there before.  The cmpxchg is a simple, straightforward
> > solution, but we're putting it in place of a plain memory write, which
> > is suboptimal.
> Note quite, the clear_bit() is LOCK prefixed.

With the updated ptep_set_wrprotect() below, I did MADV_WILLNEED to a shadow
stack of 8 MB, then 10,000 fork()'s, but could not prove it is more or less
efficient than the other.  So can we say this is probably fine in terms of
efficiency?

Yu-cheng




--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1203,7 +1203,36 @@ static inline pte_t ptep_get_and_clear_full(struct
mm_struct *mm,
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pte_t *ptep)
 {
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+	pte_t new_pte, pte = READ_ONCE(*ptep);
+
+	/*
+	 * Some processors can start a write, but end up
+	 * seeing a read-only PTE by the time they get
+	 * to the Dirty bit.  In this case, they will
+	 * set the Dirty bit, leaving a read-only, Dirty
+	 * PTE which looks like a Shadow Stack PTE.
+	 *
+	 * However, this behavior has been improved and
+	 * will not occur on processors supporting
+	 * Shadow Stacks.  Without this guarantee, a
+	 * transition to a non-present PTE and flush the
+	 * TLB would be needed.
+	 *
+	 * When changing a writable PTE to read-only and
+	 * if the PTE has _PAGE_DIRTY_HW set, we move
+	 * that bit to _PAGE_DIRTY_SW so that the PTE is
+	 * not a valid Shadow Stack PTE.
+	 */
+	do {
+		new_pte = pte_wrprotect(pte);
+		new_pte.pte |= (new_pte.pte & _PAGE_DIRTY_HW) >>
+				_PAGE_BIT_DIRTY_HW << _PAGE_BIT_DIRTY_SW;
+		new_pte.pte &= ~_PAGE_DIRTY_HW;
+	} while (!try_cmpxchg(ptep, &pte, new_pte));
+#else
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
+#endif
 }
Dave Hansen Sept. 14, 2018, 8:46 p.m. UTC | #30
On 09/14/2018 01:39 PM, Yu-cheng Yu wrote:
> With the updated ptep_set_wrprotect() below, I did MADV_WILLNEED to a shadow
> stack of 8 MB, then 10,000 fork()'s, but could not prove it is more or less
> efficient than the other.  So can we say this is probably fine in terms of
> efficiency?

Well, the first fork() will do all the hard work.  I don't think
subsequent fork()s will be affected.

Did you do something to ensure this code was being run?

I would guess that a loop like this:

	for (i = 0; i < 10000; i++) {
		mprotect(addr, len, PROT_READ);
		mprotect(addr, len, PROT_READ|PROT_WRITE);
	}

might show it better.
Yu-cheng Yu Sept. 14, 2018, 9:08 p.m. UTC | #31
On Fri, 2018-09-14 at 13:46 -0700, Dave Hansen wrote:
> On 09/14/2018 01:39 PM, Yu-cheng Yu wrote:
> > 
> > With the updated ptep_set_wrprotect() below, I did MADV_WILLNEED to a shadow
> > stack of 8 MB, then 10,000 fork()'s, but could not prove it is more or less
> > efficient than the other.  So can we say this is probably fine in terms of
> > efficiency?
> Well, the first fork() will do all the hard work.  I don't think
> subsequent fork()s will be affected.

Are you talking about a recent commit:

    1b2de5d0 mm/cow: don't bother write protecting already write-protected pages

With that, subsequent fork()s will not do all the hard work.
However, I have not done that for shadow stack PTEs (do we want to do that?).
I think the additional benefit for shadow stack is small?

> 
> Did you do something to ensure this code was being run?
> 
> I would guess that a loop like this:
> 
> 	for (i = 0; i < 10000; i++) {
> 		mprotect(addr, len, PROT_READ);
> 		mprotect(addr, len, PROT_READ|PROT_WRITE);
> 	}
> 
> might show it better.

Would mprotect() do copy_one_pte()?  Otherwise it will not go through
ptep_set_wrprotect()?
Dave Hansen Sept. 14, 2018, 9:33 p.m. UTC | #32
On 09/14/2018 02:08 PM, Yu-cheng Yu wrote:
> On Fri, 2018-09-14 at 13:46 -0700, Dave Hansen wrote:
>> On 09/14/2018 01:39 PM, Yu-cheng Yu wrote:
>>>
>>> With the updated ptep_set_wrprotect() below, I did MADV_WILLNEED to a shadow
>>> stack of 8 MB, then 10,000 fork()'s, but could not prove it is more or less
>>> efficient than the other.  So can we say this is probably fine in terms of
>>> efficiency?

BTW, I wasn't particularly concerned about shadow stacks.  Plain old
memory is affected by this change too.  Right?

>> Well, the first fork() will do all the hard work.  I don't think
>> subsequent fork()s will be affected.
> 
> Are you talking about a recent commit:
> 
>     1b2de5d0 mm/cow: don't bother write protecting already write-protected pages
> 
> With that, subsequent fork()s will not do all the hard work.
> However, I have not done that for shadow stack PTEs (do we want to do that?).
> I think the additional benefit for shadow stack is small?

You're right.  mprotect() doesn't use this path.

But, that reminds me, can you take a quick look at change_pte_range()
and double-check that it's not affected by this issue?
diff mbox series

Patch

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 4d50de77ea96..556ef258eeff 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1203,7 +1203,28 @@  static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pte_t *ptep)
 {
+	pte_t pte;
+
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
+	pte = *ptep;
+
+	/*
+	 * Some processors can start a write, but ending up seeing
+	 * a read-only PTE by the time they get to the Dirty bit.
+	 * In this case, they will set the Dirty bit, leaving a
+	 * read-only, Dirty PTE which looks like a Shadow Stack PTE.
+	 *
+	 * However, this behavior has been improved and will not occur
+	 * on processors supporting Shadow Stacks.  Without this
+	 * guarantee, a transition to a non-present PTE and flush the
+	 * TLB would be needed.
+	 *
+	 * When change a writable PTE to read-only and if the PTE has
+	 * _PAGE_DIRTY_HW set, we move that bit to _PAGE_DIRTY_SW so
+	 * that the PTE is not a valid Shadow Stack PTE.
+	 */
+	pte = pte_move_flags(pte, _PAGE_DIRTY_HW, _PAGE_DIRTY_SW);
+	set_pte_at(mm, addr, ptep, pte);
 }
 
 #define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
@@ -1266,7 +1287,28 @@  static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pmd_t *pmdp)
 {
+	pmd_t pmd;
+
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
+	pmd = *pmdp;
+
+	/*
+	 * Some processors can start a write, but ending up seeing
+	 * a read-only PTE by the time they get to the Dirty bit.
+	 * In this case, they will set the Dirty bit, leaving a
+	 * read-only, Dirty PTE which looks like a Shadow Stack PTE.
+	 *
+	 * However, this behavior has been improved and will not occur
+	 * on processors supporting Shadow Stacks.  Without this
+	 * guarantee, a transition to a non-present PTE and flush the
+	 * TLB would be needed.
+	 *
+	 * When change a writable PTE to read-only and if the PTE has
+	 * _PAGE_DIRTY_HW set, we move that bit to _PAGE_DIRTY_SW so
+	 * that the PTE is not a valid Shadow Stack PTE.
+	 */
+	pmd = pmd_move_flags(pmd, _PAGE_DIRTY_HW, _PAGE_DIRTY_SW);
+	set_pmd_at(mm, addr, pmdp, pmd);
 }
 
 #define pud_write pud_write