diff mbox

[RFC,v2,18/27] x86/cet/shstk: Introduce WRUSS instruction

Message ID 20180710222639.8241-19-yu-cheng.yu@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yu, Yu-cheng July 10, 2018, 10:26 p.m. UTC
WRUSS is a new kernel-mode instruction but writes directly
to user shadow stack memory.  This is used to construct
a return address on the shadow stack for the signal
handler.

This instruction can fault if the user shadow stack is
invalid shadow stack memory.  In that case, the kernel does
fixup.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 arch/x86/include/asm/special_insns.h          | 45 +++++++++++++++++++
 arch/x86/lib/x86-opcode-map.txt               |  2 +-
 arch/x86/mm/fault.c                           | 13 +++++-
 tools/objtool/arch/x86/lib/x86-opcode-map.txt |  2 +-
 4 files changed, 59 insertions(+), 3 deletions(-)

Comments

Dave Hansen July 10, 2018, 11:48 p.m. UTC | #1
> +/*
> + * WRUSS is a kernel instrcution and but writes to user
> + * shadow stack memory.  When a fault occurs, both
> + * X86_PF_USER and X86_PF_SHSTK are set.
> + */
> +static int is_wruss(struct pt_regs *regs, unsigned long error_code)
> +{
> +	return (((error_code & (X86_PF_USER | X86_PF_SHSTK)) ==
> +		(X86_PF_USER | X86_PF_SHSTK)) && !user_mode(regs));
> +}

I thought X86_PF_USER was set based on the mode in which the fault
occurred.  Does this mean that the architecture of this bit is different
now?

That seems like something we need to call out if so.  It also means we
need to update the SDM because some of the text is wrong.

>  static void
>  show_fault_oops(struct pt_regs *regs, unsigned long error_code,
>  		unsigned long address)
> @@ -848,7 +859,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
>  	struct task_struct *tsk = current;
>  
>  	/* User mode accesses just cause a SIGSEGV */
> -	if (error_code & X86_PF_USER) {
> +	if ((error_code & X86_PF_USER) && !is_wruss(regs, error_code)) {
>  		/*
>  		 * It's possible to have interrupts off here:
>  		 */

This needs commenting about why is_wruss() is special.
Peter Zijlstra July 11, 2018, 9:44 a.m. UTC | #2
On Tue, Jul 10, 2018 at 03:26:30PM -0700, Yu-cheng Yu wrote:
> WRUSS is a new kernel-mode instruction but writes directly
> to user shadow stack memory.  This is used to construct
> a return address on the shadow stack for the signal
> handler.
> 
> This instruction can fault if the user shadow stack is
> invalid shadow stack memory.  In that case, the kernel does
> fixup.
> 

> +static inline int write_user_shstk_64(unsigned long addr, unsigned long val)
> +{
> +	int err = 0;
> +
> +	asm volatile("1: wrussq %[val], (%[addr])\n"
> +		     "xor %[err], %[err]\n"

this XOR is superfluous, you already cleared @err above.

> +		     "2:\n"
> +		     ".section .fixup,\"ax\"\n"
> +		     "3: mov $-1, %[err]; jmp 2b\n"
> +		     ".previous\n"
> +		     _ASM_EXTABLE(1b, 3b)
> +		     : [err] "=a" (err)
> +		     : [val] "S" (val), [addr] "D" (addr));
> +
> +	return err;
> +}
> +#endif /* CONFIG_X86_INTEL_CET */
> +
>  #define nop() asm volatile ("nop")

What happened to:

  https://lkml.kernel.org/r/1528729376.4526.0.camel@2b52.sc.intel.com
Peter Zijlstra July 11, 2018, 9:45 a.m. UTC | #3
On Tue, Jul 10, 2018 at 03:26:30PM -0700, Yu-cheng Yu wrote:
> diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
> index e0b85930dd77..72bb7c48a7df 100644
> --- a/arch/x86/lib/x86-opcode-map.txt
> +++ b/arch/x86/lib/x86-opcode-map.txt
> @@ -789,7 +789,7 @@ f0: MOVBE Gy,My | MOVBE Gw,Mw (66) | CRC32 Gd,Eb (F2) | CRC32 Gd,Eb (66&F2)
>  f1: MOVBE My,Gy | MOVBE Mw,Gw (66) | CRC32 Gd,Ey (F2) | CRC32 Gd,Ew (66&F2)
>  f2: ANDN Gy,By,Ey (v)
>  f3: Grp17 (1A)
> -f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v)
> +f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v) | WRUSS Pq,Qq (66),REX.W
>  f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey (F2),(v)
>  f7: BEXTR Gy,Ey,By (v) | SHLX Gy,Ey,By (66),(v) | SARX Gy,Ey,By (F3),(v) | SHRX Gy,Ey,By (F2),(v)
>  EndTable

Where are all the other instructions? ISTR that documentation patch
listing a whole bunch of new instructions, not just wuss.
Yu, Yu-cheng July 11, 2018, 2:58 p.m. UTC | #4
On Wed, 2018-07-11 at 11:45 +0200, Peter Zijlstra wrote:
> On Tue, Jul 10, 2018 at 03:26:30PM -0700, Yu-cheng Yu wrote:
> > 
> > diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-
> > opcode-map.txt
> > index e0b85930dd77..72bb7c48a7df 100644
> > --- a/arch/x86/lib/x86-opcode-map.txt
> > +++ b/arch/x86/lib/x86-opcode-map.txt
> > @@ -789,7 +789,7 @@ f0: MOVBE Gy,My | MOVBE Gw,Mw (66) | CRC32
> > Gd,Eb (F2) | CRC32 Gd,Eb (66&F2)
> >  f1: MOVBE My,Gy | MOVBE Mw,Gw (66) | CRC32 Gd,Ey (F2) | CRC32
> > Gd,Ew (66&F2)
> >  f2: ANDN Gy,By,Ey (v)
> >  f3: Grp17 (1A)
> > -f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey
> > (F2),(v)
> > +f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey
> > (F2),(v) | WRUSS Pq,Qq (66),REX.W
> >  f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey (F2),(v)
> >  f7: BEXTR Gy,Ey,By (v) | SHLX Gy,Ey,By (66),(v) | SARX Gy,Ey,By
> > (F3),(v) | SHRX Gy,Ey,By (F2),(v)
> >  EndTable
> Where are all the other instructions? ISTR that documentation patch
> listing a whole bunch of new instructions, not just wuss.

Currently we only use WRUSS in the kernel code.  Do we want to add all
instructions here?

Yu-cheng
Yu, Yu-cheng July 11, 2018, 3:06 p.m. UTC | #5
On Wed, 2018-07-11 at 11:44 +0200, Peter Zijlstra wrote:
> On Tue, Jul 10, 2018 at 03:26:30PM -0700, Yu-cheng Yu wrote:
> > 
> > WRUSS is a new kernel-mode instruction but writes directly
> > to user shadow stack memory.  This is used to construct
> > a return address on the shadow stack for the signal
> > handler.
> > 
> > This instruction can fault if the user shadow stack is
> > invalid shadow stack memory.  In that case, the kernel does
> > fixup.
> > 
> > 
> > +static inline int write_user_shstk_64(unsigned long addr, unsigned
> > long val)
> > +{
> > +	int err = 0;
> > +
> > +	asm volatile("1: wrussq %[val], (%[addr])\n"
> > +		     "xor %[err], %[err]\n"
> this XOR is superfluous, you already cleared @err above.

I will fix it.

> 
> > 
> > +		     "2:\n"
> > +		     ".section .fixup,\"ax\"\n"
> > +		     "3: mov $-1, %[err]; jmp 2b\n"
> > +		     ".previous\n"
> > +		     _ASM_EXTABLE(1b, 3b)
> > +		     : [err] "=a" (err)
> > +		     : [val] "S" (val), [addr] "D" (addr));
> > +
> > +	return err;
> > +}
> > +#endif /* CONFIG_X86_INTEL_CET */
> > +
> >  #define nop() asm volatile ("nop")
> What happened to:
> 
>   https://lkml.kernel.org/r/1528729376.4526.0.camel@2b52.sc.intel.com

Yes, I put that in once and realized we only need to skip the
instruction and return err.  Do you think we still need a handler for
that?

Yu-cheng
Peter Zijlstra July 11, 2018, 3:27 p.m. UTC | #6
On Wed, Jul 11, 2018 at 07:58:09AM -0700, Yu-cheng Yu wrote:
> On Wed, 2018-07-11 at 11:45 +0200, Peter Zijlstra wrote:
> > On Tue, Jul 10, 2018 at 03:26:30PM -0700, Yu-cheng Yu wrote:
> > > 
> > > diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-
> > > opcode-map.txt
> > > index e0b85930dd77..72bb7c48a7df 100644
> > > --- a/arch/x86/lib/x86-opcode-map.txt
> > > +++ b/arch/x86/lib/x86-opcode-map.txt
> > > @@ -789,7 +789,7 @@ f0: MOVBE Gy,My | MOVBE Gw,Mw (66) | CRC32
> > > Gd,Eb (F2) | CRC32 Gd,Eb (66&F2)
> > >  f1: MOVBE My,Gy | MOVBE Mw,Gw (66) | CRC32 Gd,Ey (F2) | CRC32
> > > Gd,Ew (66&F2)
> > >  f2: ANDN Gy,By,Ey (v)
> > >  f3: Grp17 (1A)
> > > -f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey
> > > (F2),(v)
> > > +f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey
> > > (F2),(v) | WRUSS Pq,Qq (66),REX.W
> > >  f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey (F2),(v)
> > >  f7: BEXTR Gy,Ey,By (v) | SHLX Gy,Ey,By (66),(v) | SARX Gy,Ey,By
> > > (F3),(v) | SHRX Gy,Ey,By (F2),(v)
> > >  EndTable
> > Where are all the other instructions? ISTR that documentation patch
> > listing a whole bunch of new instructions, not just wuss.
> 
> Currently we only use WRUSS in the kernel code.  Do we want to add all
> instructions here?

Yes, since we also use the in-kernel decoder to decode random userspace
code.
Peter Zijlstra July 11, 2018, 3:30 p.m. UTC | #7
On Wed, Jul 11, 2018 at 08:06:55AM -0700, Yu-cheng Yu wrote:
> On Wed, 2018-07-11 at 11:44 +0200, Peter Zijlstra wrote:

> > What happened to:
> > 
> >   https://lkml.kernel.org/r/1528729376.4526.0.camel@2b52.sc.intel.com
> 
> Yes, I put that in once and realized we only need to skip the
> instruction and return err.  Do you think we still need a handler for
> that?

I find that other form more readable, but then there's Nadav doing asm
macros to shrink inline asm thingies so maybe he has another suggestion.
Yu, Yu-cheng July 11, 2018, 3:41 p.m. UTC | #8
On Wed, 2018-07-11 at 17:27 +0200, Peter Zijlstra wrote:
> On Wed, Jul 11, 2018 at 07:58:09AM -0700, Yu-cheng Yu wrote:
> > 
> > On Wed, 2018-07-11 at 11:45 +0200, Peter Zijlstra wrote:
> > > 
> > > On Tue, Jul 10, 2018 at 03:26:30PM -0700, Yu-cheng Yu wrote:
> > > > 
> > > > 
> > > > diff --git a/arch/x86/lib/x86-opcode-map.txt
> > > > b/arch/x86/lib/x86-
> > > > opcode-map.txt
> > > > index e0b85930dd77..72bb7c48a7df 100644
> > > > --- a/arch/x86/lib/x86-opcode-map.txt
> > > > +++ b/arch/x86/lib/x86-opcode-map.txt
> > > > @@ -789,7 +789,7 @@ f0: MOVBE Gy,My | MOVBE Gw,Mw (66) | CRC32
> > > > Gd,Eb (F2) | CRC32 Gd,Eb (66&F2)
> > > >  f1: MOVBE My,Gy | MOVBE Mw,Gw (66) | CRC32 Gd,Ey (F2) | CRC32
> > > > Gd,Ew (66&F2)
> > > >  f2: ANDN Gy,By,Ey (v)
> > > >  f3: Grp17 (1A)
> > > > -f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey
> > > > (F2),(v)
> > > > +f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey
> > > > (F2),(v) | WRUSS Pq,Qq (66),REX.W
> > > >  f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey
> > > > (F2),(v)
> > > >  f7: BEXTR Gy,Ey,By (v) | SHLX Gy,Ey,By (66),(v) | SARX
> > > > Gy,Ey,By
> > > > (F3),(v) | SHRX Gy,Ey,By (F2),(v)
> > > >  EndTable
> > > Where are all the other instructions? ISTR that documentation
> > > patch
> > > listing a whole bunch of new instructions, not just wuss.
> > Currently we only use WRUSS in the kernel code.  Do we want to add
> > all
> > instructions here?
> Yes, since we also use the in-kernel decoder to decode random
> userspace
> code.

I will add other instructions.

Yu-cheng
Yu, Yu-cheng July 12, 2018, 10:59 p.m. UTC | #9
On Tue, 2018-07-10 at 16:48 -0700, Dave Hansen wrote:
> > 
> > +/*
> > + * WRUSS is a kernel instrcution and but writes to user
> > + * shadow stack memory.  When a fault occurs, both
> > + * X86_PF_USER and X86_PF_SHSTK are set.
> > + */
> > +static int is_wruss(struct pt_regs *regs, unsigned long error_code)
> > +{
> > +	return (((error_code & (X86_PF_USER | X86_PF_SHSTK)) ==
> > +		(X86_PF_USER | X86_PF_SHSTK)) && !user_mode(regs));
> > +}
> I thought X86_PF_USER was set based on the mode in which the fault
> occurred.  Does this mean that the architecture of this bit is different
> now?

Yes.

> That seems like something we need to call out if so.  It also means we
> need to update the SDM because some of the text is wrong.

It needs to mention the WRUSS case.

> 
> > 
> >  static void
> >  show_fault_oops(struct pt_regs *regs, unsigned long error_code,
> >  		unsigned long address)
> > @@ -848,7 +859,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
> >  	struct task_struct *tsk = current;
> >  
> >  	/* User mode accesses just cause a SIGSEGV */
> > -	if (error_code & X86_PF_USER) {
> > +	if ((error_code & X86_PF_USER) && !is_wruss(regs, error_code)) {
> >  		/*
> >  		 * It's possible to have interrupts off here:
> >  		 */
> This needs commenting about why is_wruss() is special.

Ok.
Dave Hansen July 12, 2018, 11:49 p.m. UTC | #10
On 07/12/2018 03:59 PM, Yu-cheng Yu wrote:
> On Tue, 2018-07-10 at 16:48 -0700, Dave Hansen wrote:
>>>
>>> +/*
>>> + * WRUSS is a kernel instrcution and but writes to user
>>> + * shadow stack memory.  When a fault occurs, both
>>> + * X86_PF_USER and X86_PF_SHSTK are set.
>>> + */
>>> +static int is_wruss(struct pt_regs *regs, unsigned long error_code)
>>> +{
>>> +	return (((error_code & (X86_PF_USER | X86_PF_SHSTK)) ==
>>> +		(X86_PF_USER | X86_PF_SHSTK)) && !user_mode(regs));
>>> +}
>> I thought X86_PF_USER was set based on the mode in which the fault
>> occurred.  Does this mean that the architecture of this bit is different
>> now?
> 
> Yes.
> 
>> That seems like something we need to call out if so.  It also means we
>> need to update the SDM because some of the text is wrong.
> 
> It needs to mention the WRUSS case.

Ugh.  The documentation for this is not pretty.  But, I guess this is
not fundamentally different from access to U=1 pages when SMAP is in
place and we've set EFLAGS.AC=1.

But, sheesh, we need to call this out really explicitly and make it
crystal clear what is going on.

We need to go through the page fault code very carefully and audit all
the X86_PF_USER spots and make sure there's no impact to those.  SMAP
should mean that we already dealt with these, but we still need an audit.

The docs[1] are clear as mud on this though: "Page entry has user
privilege (U=1) for a supervisor-level shadow-stack-load,
shadow-stack-store-intent or shadow-stack-store access except those that
originate from the WRUSS instruction."

Or, in short:

	"Page has U=1 ... except those that originate from the WRUSS 	
	instruction."

Which is backwards from what you said.  I really wish those docs had
reused the established SDM language instead of reinventing their own way
of saying things.

1.
https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf
Dave Hansen July 13, 2018, 1:50 a.m. UTC | #11
On 07/12/2018 04:49 PM, Dave Hansen wrote:
>>> That seems like something we need to call out if so.  It also means we
>>> need to update the SDM because some of the text is wrong.
>> It needs to mention the WRUSS case.
> Ugh.  The documentation for this is not pretty.  But, I guess this is
> not fundamentally different from access to U=1 pages when SMAP is in
> place and we've set EFLAGS.AC=1.

I was wrong and misread the docs.  We do not get X86_PF_USER set when
EFLAGS.AC=1.

But, we *do* get X86_PF_USER (otherwise defined to be set when in ring3)
when running in ring0 with the WRUSS instruction and some other various
shadow-stack-access-related things.  I'm sure folks had a good reason
for this architecture, but it is a pretty fundamentally *new*
architecture that we have to account for.

This new architecture is also not spelled out or accounted for in the
SDM as of yet.  It's only called out here as far as I know:
https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf

Which reminds me:  Yu-cheng, do you have a link to the docs anywhere in
your set?  If not, you really should.
Andy Lutomirski July 13, 2018, 2:21 a.m. UTC | #12
> On Jul 12, 2018, at 6:50 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 07/12/2018 04:49 PM, Dave Hansen wrote:
>>>> That seems like something we need to call out if so.  It also means we
>>>> need to update the SDM because some of the text is wrong.
>>> It needs to mention the WRUSS case.
>> Ugh.  The documentation for this is not pretty.  But, I guess this is
>> not fundamentally different from access to U=1 pages when SMAP is in
>> place and we've set EFLAGS.AC=1.
> 
> I was wrong and misread the docs.  We do not get X86_PF_USER set when
> EFLAGS.AC=1.
> 
> But, we *do* get X86_PF_USER (otherwise defined to be set when in ring3)
> when running in ring0 with the WRUSS instruction and some other various
> shadow-stack-access-related things.  I'm sure folks had a good reason
> for this architecture, but it is a pretty fundamentally *new*
> architecture that we have to account for.

I think it makes (some) sense. The USER bit is set for a page fault that was done with user privilege. So a descriptor table fault at CPL 3 has USER clear (regardless of the cause of the fault) and WRUSS has USER set.

> 
> This new architecture is also not spelled out or accounted for in the
> SDM as of yet.  It's only called out here as far as I know:
> https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf
> 
> Which reminds me:  Yu-cheng, do you have a link to the docs anywhere in
> your set?  If not, you really should.

I am tempted to suggest that the whole series not be merged until there are actual docs. It’s not a fantastic precedent.
Dave Hansen July 13, 2018, 4:16 a.m. UTC | #13
On 07/12/2018 07:21 PM, Andy Lutomirski wrote:
> I am tempted to suggest that the whole series not be merged until
> there are actual docs. It’s not a fantastic precedent.

Do you mean Documentation or manpages, or are you talking about hardware
documentation?
https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf
Dave Hansen July 13, 2018, 4:18 a.m. UTC | #14
On 07/12/2018 09:16 PM, Dave Hansen wrote:
> On 07/12/2018 07:21 PM, Andy Lutomirski wrote:
>> I am tempted to suggest that the whole series not be merged until
>> there are actual docs. It’s not a fantastic precedent.
> 
> Do you mean Documentation or manpages, or are you talking about hardware
> documentation?
> https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf

Hit send too soon...

We do need manpages a well.  If I had to do it for protection keys,
everyone else has to suffer too. :)

Yu-cheng, I really do think selftests are a necessity before this gets
merged.
Andy Lutomirski July 13, 2018, 5:55 a.m. UTC | #15
> On Jul 12, 2018, at 9:16 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
>> On 07/12/2018 07:21 PM, Andy Lutomirski wrote:
>> I am tempted to suggest that the whole series not be merged until
>> there are actual docs. It’s not a fantastic precedent.
> 
> Do you mean Documentation or manpages, or are you talking about hardware
> documentation?
> https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf

I mean hardware docs. The “preview” is a little bit dubious IMO.
Dave Hansen July 13, 2018, 12:12 p.m. UTC | #16
On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> +static int is_wruss(struct pt_regs *regs, unsigned long error_code)
> +{
> +	return (((error_code & (X86_PF_USER | X86_PF_SHSTK)) ==
> +		(X86_PF_USER | X86_PF_SHSTK)) && !user_mode(regs));
> +}
> +
>  static void
>  show_fault_oops(struct pt_regs *regs, unsigned long error_code,
>  		unsigned long address)
> @@ -848,7 +859,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
>  	struct task_struct *tsk = current;
>  
>  	/* User mode accesses just cause a SIGSEGV */
> -	if (error_code & X86_PF_USER) {
> +	if ((error_code & X86_PF_USER) && !is_wruss(regs, error_code)) {
>  		/*
>  		 * It's possible to have interrupts off here:
>  		 */

Please don't do it this way.

We have two styles of page fault:
1. User page faults: find a VMA, try to handle (allocate memory et al.),
   kill process if we can't handle.
2. Kernel page faults: search for a *discrete* set of conditions that
   can be handled, including faults in instructions marked in exception
   tables.

X86_PF_USER *means*: do user page fault handling.  In the places where
the hardware doesn't set it, but we still want user page fault handling,
we manually set it, like this where we "downgrade" an implicit
supervisor access to a user access:

        if (user_mode(regs)) {
                local_irq_enable();
                error_code |= X86_PF_USER;
                flags |= FAULT_FLAG_USER;

So, just please *clear* X86_PF_USER if !user_mode(regs) and X86_PF_SS.
We do not want user page fault handling, thus we should not keep the bit
set.
Yu, Yu-cheng July 13, 2018, 5:37 p.m. UTC | #17
On Fri, 2018-07-13 at 05:12 -0700, Dave Hansen wrote:
> On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> > 
> > +static int is_wruss(struct pt_regs *regs, unsigned long error_code)
> > +{
> > +	return (((error_code & (X86_PF_USER | X86_PF_SHSTK)) ==
> > +		(X86_PF_USER | X86_PF_SHSTK)) && !user_mode(regs));
> > +}
> > +
> >  static void
> >  show_fault_oops(struct pt_regs *regs, unsigned long error_code,
> >  		unsigned long address)
> > @@ -848,7 +859,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
> >  	struct task_struct *tsk = current;
> >  
> >  	/* User mode accesses just cause a SIGSEGV */
> > -	if (error_code & X86_PF_USER) {
> > +	if ((error_code & X86_PF_USER) && !is_wruss(regs, error_code)) {
> >  		/*
> >  		 * It's possible to have interrupts off here:
> >  		 */
> Please don't do it this way.
> 
> We have two styles of page fault:
> 1. User page faults: find a VMA, try to handle (allocate memory et al.),
>    kill process if we can't handle.
> 2. Kernel page faults: search for a *discrete* set of conditions that
>    can be handled, including faults in instructions marked in exception
>    tables.
> 
> X86_PF_USER *means*: do user page fault handling.  In the places where
> the hardware doesn't set it, but we still want user page fault handling,
> we manually set it, like this where we "downgrade" an implicit
> supervisor access to a user access:
> 
>         if (user_mode(regs)) {
>                 local_irq_enable();
>                 error_code |= X86_PF_USER;
>                 flags |= FAULT_FLAG_USER;
> 
> So, just please *clear* X86_PF_USER if !user_mode(regs) and X86_PF_SS.
> We do not want user page fault handling, thus we should not keep the bit
> set.

Agree.  I will change that.

Yu-cheng
Yu, Yu-cheng July 13, 2018, 5:39 p.m. UTC | #18
On Thu, 2018-07-12 at 21:18 -0700, Dave Hansen wrote:
> On 07/12/2018 09:16 PM, Dave Hansen wrote:
> > 
> > On 07/12/2018 07:21 PM, Andy Lutomirski wrote:
> > > 
> > > I am tempted to suggest that the whole series not be merged until
> > > there are actual docs. It’s not a fantastic precedent.
> > Do you mean Documentation or manpages, or are you talking about hardware
> > documentation?
> > https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf
> Hit send too soon...
> 
> We do need manpages a well.  If I had to do it for protection keys,
> everyone else has to suffer too. :)
> 
> Yu-cheng, I really do think selftests are a necessity before this gets
> merged.
> 

We already have some.  I will put those in patches.

Yu-cheng
diff mbox

Patch

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 317fc59b512c..c69d8d6b457f 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -237,6 +237,51 @@  static inline void clwb(volatile void *__p)
 		: [pax] "a" (p));
 }
 
+#ifdef CONFIG_X86_INTEL_CET
+
+#if defined(CONFIG_IA32_EMULATION) || defined(CONFIG_X86_X32)
+static inline int write_user_shstk_32(unsigned long addr, unsigned int val)
+{
+	int err;
+
+	asm volatile("1: wrussd %[val], (%[addr])\n"
+		     "xor %[err], %[err]\n"
+		     "2:\n"
+		     ".section .fixup,\"ax\"\n"
+		     "3: mov $-1, %[err]; jmp 2b\n"
+		     ".previous\n"
+		     _ASM_EXTABLE(1b, 3b)
+		     : [err] "=a" (err)
+		     : [val] "S" (val), [addr] "D" (addr));
+
+	return err;
+}
+#else
+static inline int write_user_shstk_32(unsigned long addr, unsigned int val)
+{
+	BUG();
+	return 0;
+}
+#endif
+
+static inline int write_user_shstk_64(unsigned long addr, unsigned long val)
+{
+	int err = 0;
+
+	asm volatile("1: wrussq %[val], (%[addr])\n"
+		     "xor %[err], %[err]\n"
+		     "2:\n"
+		     ".section .fixup,\"ax\"\n"
+		     "3: mov $-1, %[err]; jmp 2b\n"
+		     ".previous\n"
+		     _ASM_EXTABLE(1b, 3b)
+		     : [err] "=a" (err)
+		     : [val] "S" (val), [addr] "D" (addr));
+
+	return err;
+}
+#endif /* CONFIG_X86_INTEL_CET */
+
 #define nop() asm volatile ("nop")
 
 
diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index e0b85930dd77..72bb7c48a7df 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -789,7 +789,7 @@  f0: MOVBE Gy,My | MOVBE Gw,Mw (66) | CRC32 Gd,Eb (F2) | CRC32 Gd,Eb (66&F2)
 f1: MOVBE My,Gy | MOVBE Mw,Gw (66) | CRC32 Gd,Ey (F2) | CRC32 Gd,Ew (66&F2)
 f2: ANDN Gy,By,Ey (v)
 f3: Grp17 (1A)
-f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v)
+f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v) | WRUSS Pq,Qq (66),REX.W
 f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey (F2),(v)
 f7: BEXTR Gy,Ey,By (v) | SHLX Gy,Ey,By (66),(v) | SARX Gy,Ey,By (F3),(v) | SHRX Gy,Ey,By (F2),(v)
 EndTable
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fcd5739151f9..92f178b8b598 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -641,6 +641,17 @@  static int is_f00f_bug(struct pt_regs *regs, unsigned long address)
 	return 0;
 }
 
+/*
+ * WRUSS is a kernel instrcution and but writes to user
+ * shadow stack memory.  When a fault occurs, both
+ * X86_PF_USER and X86_PF_SHSTK are set.
+ */
+static int is_wruss(struct pt_regs *regs, unsigned long error_code)
+{
+	return (((error_code & (X86_PF_USER | X86_PF_SHSTK)) ==
+		(X86_PF_USER | X86_PF_SHSTK)) && !user_mode(regs));
+}
+
 static void
 show_fault_oops(struct pt_regs *regs, unsigned long error_code,
 		unsigned long address)
@@ -848,7 +859,7 @@  __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
 	struct task_struct *tsk = current;
 
 	/* User mode accesses just cause a SIGSEGV */
-	if (error_code & X86_PF_USER) {
+	if ((error_code & X86_PF_USER) && !is_wruss(regs, error_code)) {
 		/*
 		 * It's possible to have interrupts off here:
 		 */
diff --git a/tools/objtool/arch/x86/lib/x86-opcode-map.txt b/tools/objtool/arch/x86/lib/x86-opcode-map.txt
index e0b85930dd77..72bb7c48a7df 100644
--- a/tools/objtool/arch/x86/lib/x86-opcode-map.txt
+++ b/tools/objtool/arch/x86/lib/x86-opcode-map.txt
@@ -789,7 +789,7 @@  f0: MOVBE Gy,My | MOVBE Gw,Mw (66) | CRC32 Gd,Eb (F2) | CRC32 Gd,Eb (66&F2)
 f1: MOVBE My,Gy | MOVBE Mw,Gw (66) | CRC32 Gd,Ey (F2) | CRC32 Gd,Ew (66&F2)
 f2: ANDN Gy,By,Ey (v)
 f3: Grp17 (1A)
-f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v)
+f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v) | WRUSS Pq,Qq (66),REX.W
 f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey (F2),(v)
 f7: BEXTR Gy,Ey,By (v) | SHLX Gy,Ey,By (66),(v) | SARX Gy,Ey,By (F3),(v) | SHRX Gy,Ey,By (F2),(v)
 EndTable