diff mbox series

[7/7] x86emul: support SYSRET

Message ID 78b62646-6fd4-e5b3-bc09-783bb017eaaa@suse.com (mailing list archive)
State Superseded
Headers show
Series x86emul: (mainly) vendor specific behavior adjustments | expand

Commit Message

Jan Beulich March 24, 2020, 4:29 p.m. UTC
This is to augment SYSCALL, which has been supported for quite some
time.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

Comments

Andrew Cooper March 25, 2020, 10 a.m. UTC | #1
On 24/03/2020 16:29, Jan Beulich wrote:
> This is to augment SYSCALL, which has been supported for quite some
> time.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

I've compared this to the in-progress version I have in my XSA-204
follow-on series.  I'm afraid the behaviour has far more vendor specific
quirks than this.

>
> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
> @@ -5975,6 +5975,60 @@ x86_emulate(
>              goto done;
>          break;
>  
> +    case X86EMUL_OPC(0x0f, 0x07): /* sysret */
> +        vcpu_must_have(syscall);
> +        /* Inject #UD if syscall/sysret are disabled. */
> +        fail_if(!ops->read_msr);
> +        if ( (rc = ops->read_msr(MSR_EFER, &msr_val, ctxt)) != X86EMUL_OKAY )
> +            goto done;
> +        generate_exception_if((msr_val & EFER_SCE) == 0, EXC_UD);

(as with the SYSCALL side), no need for the vcpu_must_have(syscall) as
well as this check.

> +        generate_exception_if(!amd_like(ctxt) && !mode_64bit(), EXC_UD);
> +        generate_exception_if(!mode_ring0(), EXC_GP, 0);
> +        generate_exception_if(!in_protmode(ctxt, ops), EXC_GP, 0);
> +

The Intel SYSRET vulnerability checks regs->rcx for canonicity here, and
raises #GP here.

I see you've got it below, but this is where the Intel pseudocode puts
it, before MSR_STAR gets read, and logically it should be grouped with
the other excpetions.

> +        if ( (rc = ops->read_msr(MSR_STAR, &msr_val, ctxt)) != X86EMUL_OKAY )
> +            goto done;
> +        sreg.sel = ((msr_val >> 48) + 8) | 3; /* SELECTOR_RPL_MASK */

This would be the logical behaviour...

AMD CPUs |3 into %cs.sel, but don't make an equivalent adjustment for
%ss.sel, and simply take MSR_START.SYSRET_CS + 8.

If you aren't careful with MSR_STAR, SYSRET will return to userspace
with mismatching RPL/DPL and userspace can really find itself with an
%ss with an RPL of 0.  (Of course, when you take an interrupt and
attempt to IRET back to this context, things fall apart).

I discovered this entirely by accident in XTF, but it is confirmed by
careful reading of the AMD SYSRET pseudocode.

> +        cs.sel = op_bytes == 8 ? sreg.sel + 8 : sreg.sel - 8;
> +
> +        cs.base = sreg.base = 0; /* flat segment */
> +        cs.limit = sreg.limit = ~0u; /* 4GB limit */
> +        cs.attr = 0xcfb; /* G+DB+P+DPL3+S+Code */
> +        sreg.attr = 0xcf3; /* G+DB+P+DPL3+S+Data */

Again, that would be the logical behaviour...

AMD CPU's don't update anything but %ss.sel, and even comment the fact
in pseudocode now.

This was discovered by Andy Luto, where he found that taking an
interrupt (unconditionally sets %ss to NUL), and opportunistic sysret
back to 32bit userspace lets userspace see a sane %ss value, but with
the attrs still empty, and the stack unusable.

> +
> +#ifdef __x86_64__
> +        if ( mode_64bit() )
> +        {
> +            if ( op_bytes == 8 )
> +            {
> +                cs.attr = 0xafb; /* L+DB+P+DPL3+S+Code */
> +                generate_exception_if(!is_canonical_address(_regs.rcx) &&
> +                                      !amd_like(ctxt), EXC_GP, 0);

Wherever this ends up living, I think it needs calling out with a
comment /* CVE-xxx, Intel privilege escalation hole */, as it is a very
subtle piece of vendor specific behaviour.

Do we have a Centaur/other CPU to try with?  I'd err on the side of
going with == Intel rather than !AMD to avoid introducing known
vulnerabilities into models which stand half a chance of not being affected.

> +                _regs.rip = _regs.rcx;
> +            }
> +            else
> +                _regs.rip = _regs.ecx;
> +
> +            _regs.eflags = _regs.r11 & ~(X86_EFLAGS_RF | X86_EFLAGS_VM);
> +        }
> +        else
> +#endif
> +        {
> +            _regs.r(ip) = _regs.ecx;
> +            _regs.eflags |= X86_EFLAGS_IF;
> +        }
> +
> +        fail_if(!ops->write_segment);
> +        if ( (rc = ops->write_segment(x86_seg_cs, &cs, ctxt)) != X86EMUL_OKAY ||
> +             (!amd_like(ctxt) &&
> +              (rc = ops->write_segment(x86_seg_ss, &sreg,
> +                                       ctxt)) != X86EMUL_OKAY) )

Oh - here is the AMD behaviour with %ss, but its not quite correct.

AFAICT, the correct behaviour is to read the old %ss on AMD-like, set
flat attributes on Intel, and write back normally, because %ss.sel does
get updated.

~Andrew

> +            goto done;
> +
> +        singlestep = _regs.eflags & X86_EFLAGS_TF;
> +        break;
> +
>      case X86EMUL_OPC(0x0f, 0x08): /* invd */
>      case X86EMUL_OPC(0x0f, 0x09): /* wbinvd / wbnoinvd */
>          generate_exception_if(!mode_ring0(), EXC_GP, 0);
>
Jan Beulich March 25, 2020, 10:19 a.m. UTC | #2
On 25.03.2020 11:00, Andrew Cooper wrote:
> On 24/03/2020 16:29, Jan Beulich wrote:
>> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
>> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
>> @@ -5975,6 +5975,60 @@ x86_emulate(
>>              goto done;
>>          break;
>>  
>> +    case X86EMUL_OPC(0x0f, 0x07): /* sysret */
>> +        vcpu_must_have(syscall);
>> +        /* Inject #UD if syscall/sysret are disabled. */
>> +        fail_if(!ops->read_msr);
>> +        if ( (rc = ops->read_msr(MSR_EFER, &msr_val, ctxt)) != X86EMUL_OKAY )
>> +            goto done;
>> +        generate_exception_if((msr_val & EFER_SCE) == 0, EXC_UD);
> 
> (as with the SYSCALL side), no need for the vcpu_must_have(syscall) as
> well as this check.

Hmm, yes, we do so elsewhere too, so I'll adjust this there and here.

>> +        generate_exception_if(!amd_like(ctxt) && !mode_64bit(), EXC_UD);
>> +        generate_exception_if(!mode_ring0(), EXC_GP, 0);
>> +        generate_exception_if(!in_protmode(ctxt, ops), EXC_GP, 0);
>> +
> 
> The Intel SYSRET vulnerability checks regs->rcx for canonicity here, and
> raises #GP here.
> 
> I see you've got it below, but this is where the Intel pseudocode puts
> it, before MSR_STAR gets read, and logically it should be grouped with
> the other excpetions.

I had it here first, then moved it down to avoid yet another mode_64bit()
instance. I didn't see why the ordering would matter for the overall
result, on the basis that the STAR read ought not to fail under normal
circumstances. I'll move it back where it was since you ask for it.

>> +        if ( (rc = ops->read_msr(MSR_STAR, &msr_val, ctxt)) != X86EMUL_OKAY )
>> +            goto done;
>> +        sreg.sel = ((msr_val >> 48) + 8) | 3; /* SELECTOR_RPL_MASK */
> 
> This would be the logical behaviour...
> 
> AMD CPUs |3 into %cs.sel, but don't make an equivalent adjustment for
> %ss.sel, and simply take MSR_START.SYSRET_CS + 8.
> 
> If you aren't careful with MSR_STAR, SYSRET will return to userspace
> with mismatching RPL/DPL and userspace can really find itself with an
> %ss with an RPL of 0.  (Of course, when you take an interrupt and
> attempt to IRET back to this context, things fall apart).
> 
> I discovered this entirely by accident in XTF, but it is confirmed by
> careful reading of the AMD SYSRET pseudocode.

I did notice this in their pseudocode, but it looked too wrong to
be true. Will change.

>> +        cs.sel = op_bytes == 8 ? sreg.sel + 8 : sreg.sel - 8;
>> +
>> +        cs.base = sreg.base = 0; /* flat segment */
>> +        cs.limit = sreg.limit = ~0u; /* 4GB limit */
>> +        cs.attr = 0xcfb; /* G+DB+P+DPL3+S+Code */
>> +        sreg.attr = 0xcf3; /* G+DB+P+DPL3+S+Data */
> 
> Again, that would be the logical behaviour...
> 
> AMD CPU's don't update anything but %ss.sel, and even comment the fact
> in pseudocode now.
> 
> This was discovered by Andy Luto, where he found that taking an
> interrupt (unconditionally sets %ss to NUL), and opportunistic sysret
> back to 32bit userspace lets userspace see a sane %ss value, but with
> the attrs still empty, and the stack unusable.
> 
>> +
>> +#ifdef __x86_64__
>> +        if ( mode_64bit() )
>> +        {
>> +            if ( op_bytes == 8 )
>> +            {
>> +                cs.attr = 0xafb; /* L+DB+P+DPL3+S+Code */
>> +                generate_exception_if(!is_canonical_address(_regs.rcx) &&
>> +                                      !amd_like(ctxt), EXC_GP, 0);
> 
> Wherever this ends up living, I think it needs calling out with a
> comment /* CVE-xxx, Intel privilege escalation hole */, as it is a very
> subtle piece of vendor specific behaviour.
> 
> Do we have a Centaur/other CPU to try with?  I'd err on the side of
> going with == Intel rather than !AMD to avoid introducing known
> vulnerabilities into models which stand half a chance of not being affected.

I'd rather not - this exception behavior is spelled out by the
SDM, and hence imo pretty likely to be followed by clones.
While I do have a VIA box somewhere, it's not stable enough to
run for more than a couple of minutes.

>> +                _regs.rip = _regs.rcx;
>> +            }
>> +            else
>> +                _regs.rip = _regs.ecx;
>> +
>> +            _regs.eflags = _regs.r11 & ~(X86_EFLAGS_RF | X86_EFLAGS_VM);
>> +        }
>> +        else
>> +#endif
>> +        {
>> +            _regs.r(ip) = _regs.ecx;
>> +            _regs.eflags |= X86_EFLAGS_IF;
>> +        }
>> +
>> +        fail_if(!ops->write_segment);
>> +        if ( (rc = ops->write_segment(x86_seg_cs, &cs, ctxt)) != X86EMUL_OKAY ||
>> +             (!amd_like(ctxt) &&
>> +              (rc = ops->write_segment(x86_seg_ss, &sreg,
>> +                                       ctxt)) != X86EMUL_OKAY) )
> 
> Oh - here is the AMD behaviour with %ss, but its not quite correct.
> 
> AFAICT, the correct behaviour is to read the old %ss on AMD-like, set
> flat attributes on Intel, and write back normally, because %ss.sel does
> get updated.

Oh, of course - I meant to, got distracted, and then forgot. Will fix.

Jan
Andrew Cooper March 25, 2020, 10:47 a.m. UTC | #3
On 25/03/2020 10:19, Jan Beulich wrote:
> On 25.03.2020 11:00, Andrew Cooper wrote:
>> On 24/03/2020 16:29, Jan Beulich wrote:
>>> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
>>> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
>>> @@ -5975,6 +5975,60 @@ x86_emulate(
>>>              goto done;
>>>          break;
>>>  
>>> +    case X86EMUL_OPC(0x0f, 0x07): /* sysret */
>>> +        vcpu_must_have(syscall);
>>> +        /* Inject #UD if syscall/sysret are disabled. */
>>> +        fail_if(!ops->read_msr);
>>> +        if ( (rc = ops->read_msr(MSR_EFER, &msr_val, ctxt)) != X86EMUL_OKAY )
>>> +            goto done;
>>> +        generate_exception_if((msr_val & EFER_SCE) == 0, EXC_UD);
>> (as with the SYSCALL side), no need for the vcpu_must_have(syscall) as
>> well as this check.
> Hmm, yes, we do so elsewhere too, so I'll adjust this there and here.

In theory, the SEP checks for SYSENTER/SYSEXIT could be similarly
dropped, once the MSR logic is updated to perform proper availability
checks.

>>> +        if ( (rc = ops->read_msr(MSR_STAR, &msr_val, ctxt)) != X86EMUL_OKAY )
>>> +            goto done;
>>> +        sreg.sel = ((msr_val >> 48) + 8) | 3; /* SELECTOR_RPL_MASK */
>> This would be the logical behaviour...
>>
>> AMD CPUs |3 into %cs.sel, but don't make an equivalent adjustment for
>> %ss.sel, and simply take MSR_START.SYSRET_CS + 8.
>>
>> If you aren't careful with MSR_STAR, SYSRET will return to userspace
>> with mismatching RPL/DPL and userspace can really find itself with an
>> %ss with an RPL of 0.  (Of course, when you take an interrupt and
>> attempt to IRET back to this context, things fall apart).
>>
>> I discovered this entirely by accident in XTF, but it is confirmed by
>> careful reading of the AMD SYSRET pseudocode.
> I did notice this in their pseudocode, but it looked too wrong to
> be true. Will change.

The main reason why my 204 followon series is still pending is because I
never got around to completing an XTF test for all of these corner cases.

I'm happy to drop my series to Xen in light of this series of yours, but
I'd still like to complete the XTF side of things at some point.

>>> +
>>> +#ifdef __x86_64__
>>> +        if ( mode_64bit() )
>>> +        {
>>> +            if ( op_bytes == 8 )
>>> +            {
>>> +                cs.attr = 0xafb; /* L+DB+P+DPL3+S+Code */
>>> +                generate_exception_if(!is_canonical_address(_regs.rcx) &&
>>> +                                      !amd_like(ctxt), EXC_GP, 0);
>> Wherever this ends up living, I think it needs calling out with a
>> comment /* CVE-xxx, Intel privilege escalation hole */, as it is a very
>> subtle piece of vendor specific behaviour.
>>
>> Do we have a Centaur/other CPU to try with?  I'd err on the side of
>> going with == Intel rather than !AMD to avoid introducing known
>> vulnerabilities into models which stand half a chance of not being affected.
> I'd rather not - this exception behavior is spelled out by the
> SDM, and hence imo pretty likely to be followed by clones.

In pseudocode which certainly used to state somewhere "for reference
only, and not to be taken as an precise specification of behaviour". 
(And yes - that statement was still at the beginning of Vol2 when Intel
also claimed that "SYSRET was working according to the spec" in the
embargo period of XSA-7, because I called them out on it).

And anyway - it is a part of the AMD64 spec, not the Intel32 spec.  A
3rd party implementing it for 64bit support is more likely to go with
AMD's writings of how it behaves.

> While I do have a VIA box somewhere, it's not stable enough to
> run for more than a couple of minutes.

Fundamentally, it boils down to this.

Intel behaviour leaves a privilege escalation vulnerability available to
userspace.

Assuming AMD behaviour for unknown parts is the safer course of action,
because we don't need to issue an XSA/CVE to fix the emulator when it
turns out that we're wrong.

~Andrew
Jan Beulich March 25, 2020, 11:55 a.m. UTC | #4
On 25.03.2020 11:00, Andrew Cooper wrote:
> On 24/03/2020 16:29, Jan Beulich wrote:
>> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
>> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
>> @@ -5975,6 +5975,60 @@ x86_emulate(
>>              goto done;
>>          break;
>>  
>> +    case X86EMUL_OPC(0x0f, 0x07): /* sysret */
>> +        vcpu_must_have(syscall);
>> +        /* Inject #UD if syscall/sysret are disabled. */
>> +        fail_if(!ops->read_msr);
>> +        if ( (rc = ops->read_msr(MSR_EFER, &msr_val, ctxt)) != X86EMUL_OKAY )
>> +            goto done;
>> +        generate_exception_if((msr_val & EFER_SCE) == 0, EXC_UD);
> 
> (as with the SYSCALL side), no need for the vcpu_must_have(syscall) as
> well as this check.

Upon re-reading I'm now confused - are you suggesting to also drop
the EFER.SCE check? That's not what you said in reply to 6/7. If
so, what's your thinking behind saying so? If I'm to guess, this
may go along the lines of you suggesting to drop the explicit CPUID
checks from SYSENTER/SYSEXIT as well, but I'm not seeing there
either why you would think this way (albeit there it's also a
little vague what exact changes you're thinking of at the MSR
handling side).

Jan
Andrew Cooper March 25, 2020, 12:25 p.m. UTC | #5
On 25/03/2020 11:55, Jan Beulich wrote:
> On 25.03.2020 11:00, Andrew Cooper wrote:
>> On 24/03/2020 16:29, Jan Beulich wrote:
>>> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
>>> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
>>> @@ -5975,6 +5975,60 @@ x86_emulate(
>>>              goto done;
>>>          break;
>>>  
>>> +    case X86EMUL_OPC(0x0f, 0x07): /* sysret */
>>> +        vcpu_must_have(syscall);
>>> +        /* Inject #UD if syscall/sysret are disabled. */
>>> +        fail_if(!ops->read_msr);
>>> +        if ( (rc = ops->read_msr(MSR_EFER, &msr_val, ctxt)) != X86EMUL_OKAY )
>>> +            goto done;
>>> +        generate_exception_if((msr_val & EFER_SCE) == 0, EXC_UD);
>> (as with the SYSCALL side), no need for the vcpu_must_have(syscall) as
>> well as this check.
> Upon re-reading I'm now confused - are you suggesting to also drop
> the EFER.SCE check?

No.  The SCE check is critical and needs to remain.

The exact delta I had put together was:

diff --git a/xen/arch/x86/x86_emulate/x86_emulate.c
b/xen/arch/x86/x86_emulate/x86_emulate.c
index c730511ebe..57ce7e00be 100644
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -5883,9 +5883,11 @@ x86_emulate(
 
 #ifdef __XEN__
     case X86EMUL_OPC(0x0f, 0x05): /* syscall */
-        generate_exception_if(!in_protmode(ctxt, ops), EXC_UD);
+        if ( !in_protmode(ctxt, ops) ||
+             ((ctxt->cpuid->x86_vendor & X86_VENDOR_INTEL) &&
!mode_64bit()) )
+            generate_exception(EXC_UD);
 
-        /* Inject #UD if syscall/sysret are disabled. */
+        /* Inject #UD if SCE is disabled.  Subsumes the SYSCALL CPUID
check. */
         fail_if(ops->read_msr == NULL);
         if ( (rc = ops->read_msr(MSR_EFER, &msr_val, ctxt)) !=
X86EMUL_OKAY )
             goto done;


(Looking at the commit date, Mon Dec 19 13:32:11 2016 is quite a long
time ago...)

~Andrew
diff mbox series

Patch

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -5975,6 +5975,60 @@  x86_emulate(
             goto done;
         break;
 
+    case X86EMUL_OPC(0x0f, 0x07): /* sysret */
+        vcpu_must_have(syscall);
+        /* Inject #UD if syscall/sysret are disabled. */
+        fail_if(!ops->read_msr);
+        if ( (rc = ops->read_msr(MSR_EFER, &msr_val, ctxt)) != X86EMUL_OKAY )
+            goto done;
+        generate_exception_if((msr_val & EFER_SCE) == 0, EXC_UD);
+        generate_exception_if(!amd_like(ctxt) && !mode_64bit(), EXC_UD);
+        generate_exception_if(!mode_ring0(), EXC_GP, 0);
+        generate_exception_if(!in_protmode(ctxt, ops), EXC_GP, 0);
+
+        if ( (rc = ops->read_msr(MSR_STAR, &msr_val, ctxt)) != X86EMUL_OKAY )
+            goto done;
+
+        sreg.sel = ((msr_val >> 48) + 8) | 3; /* SELECTOR_RPL_MASK */
+        cs.sel = op_bytes == 8 ? sreg.sel + 8 : sreg.sel - 8;
+
+        cs.base = sreg.base = 0; /* flat segment */
+        cs.limit = sreg.limit = ~0u; /* 4GB limit */
+        cs.attr = 0xcfb; /* G+DB+P+DPL3+S+Code */
+        sreg.attr = 0xcf3; /* G+DB+P+DPL3+S+Data */
+
+#ifdef __x86_64__
+        if ( mode_64bit() )
+        {
+            if ( op_bytes == 8 )
+            {
+                cs.attr = 0xafb; /* L+DB+P+DPL3+S+Code */
+                generate_exception_if(!is_canonical_address(_regs.rcx) &&
+                                      !amd_like(ctxt), EXC_GP, 0);
+                _regs.rip = _regs.rcx;
+            }
+            else
+                _regs.rip = _regs.ecx;
+
+            _regs.eflags = _regs.r11 & ~(X86_EFLAGS_RF | X86_EFLAGS_VM);
+        }
+        else
+#endif
+        {
+            _regs.r(ip) = _regs.ecx;
+            _regs.eflags |= X86_EFLAGS_IF;
+        }
+
+        fail_if(!ops->write_segment);
+        if ( (rc = ops->write_segment(x86_seg_cs, &cs, ctxt)) != X86EMUL_OKAY ||
+             (!amd_like(ctxt) &&
+              (rc = ops->write_segment(x86_seg_ss, &sreg,
+                                       ctxt)) != X86EMUL_OKAY) )
+            goto done;
+
+        singlestep = _regs.eflags & X86_EFLAGS_TF;
+        break;
+
     case X86EMUL_OPC(0x0f, 0x08): /* invd */
     case X86EMUL_OPC(0x0f, 0x09): /* wbinvd / wbnoinvd */
         generate_exception_if(!mode_ring0(), EXC_GP, 0);