diff mbox series

x86/boot: Clean up the trampoline transition into Long mode

Message ID 20200102145953.6503-1-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show
Series x86/boot: Clean up the trampoline transition into Long mode | expand

Commit Message

Andrew Cooper Jan. 2, 2020, 2:59 p.m. UTC
The jmp after setting %cr0 is redundant with the following ljmp.

The CPUID to protect the jump to higher mappings was inserted due to an
abundance of caution/paranoia before Spectre was public.  There is not a
matching protection in the S3 resume path, and there is nothing
interesting in memory at this point.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/boot/trampoline.S | 22 ----------------------
 1 file changed, 22 deletions(-)

Comments

Wei Liu Jan. 2, 2020, 4:55 p.m. UTC | #1
On Thu, Jan 02, 2020 at 02:59:53PM +0000, Andrew Cooper wrote:
> The jmp after setting %cr0 is redundant with the following ljmp.
> 
> The CPUID to protect the jump to higher mappings was inserted due to an
> abundance of caution/paranoia before Spectre was public.  There is not a
> matching protection in the S3 resume path, and there is nothing
> interesting in memory at this point.

What do you mean by "there is nothing interesting in memory" here?

As far as I can tell idel page table has been loaded.  During AP
bring-up it contains runtime data, no?

Wei.
Andrew Cooper Jan. 2, 2020, 5:20 p.m. UTC | #2
On 02/01/2020 16:55, Wei Liu wrote:
> On Thu, Jan 02, 2020 at 02:59:53PM +0000, Andrew Cooper wrote:
>> The jmp after setting %cr0 is redundant with the following ljmp.
>>
>> The CPUID to protect the jump to higher mappings was inserted due to an
>> abundance of caution/paranoia before Spectre was public.  There is not a
>> matching protection in the S3 resume path, and there is nothing
>> interesting in memory at this point.
> What do you mean by "there is nothing interesting in memory" here?
>
> As far as I can tell idel page table has been loaded.  During AP
> bring-up it contains runtime data, no?

We haven't even decompressed the dom0 kernel at this point.  What data
are you concerned by?

This protection is only meaningful for virtualised guests, and is
ultimately incomplete.  If another VM can use Spectre v2 against this
VM, it can also use Spectre v1 and have a far more interesting time.

In the time since writing this code, it has become substantially more
apparent that VMs must trust their hypervisor to provide adequate
isolation, because there is literally nothing the VM can do itself.

~Andrew
Wei Liu Jan. 2, 2020, 6:45 p.m. UTC | #3
On Thu, Jan 02, 2020 at 05:20:12PM +0000, Andrew Cooper wrote:
> On 02/01/2020 16:55, Wei Liu wrote:
> > On Thu, Jan 02, 2020 at 02:59:53PM +0000, Andrew Cooper wrote:
> >> The jmp after setting %cr0 is redundant with the following ljmp.
> >>
> >> The CPUID to protect the jump to higher mappings was inserted due to an
> >> abundance of caution/paranoia before Spectre was public.  There is not a
> >> matching protection in the S3 resume path, and there is nothing
> >> interesting in memory at this point.
> > What do you mean by "there is nothing interesting in memory" here?
> >
> > As far as I can tell idel page table has been loaded.  During AP
> > bring-up it contains runtime data, no?
> 
> We haven't even decompressed the dom0 kernel at this point.  What data
> are you concerned by?

As the original text implied, CPU hotplug should also be considered.

If that's not relevant now, can you please note that in the commit
message?

Wei.

> 
> This protection is only meaningful for virtualised guests, and is
> ultimately incomplete.  If another VM can use Spectre v2 against this
> VM, it can also use Spectre v1 and have a far more interesting time.
> 
> In the time since writing this code, it has become substantially more
> apparent that VMs must trust their hypervisor to provide adequate
> isolation, because there is literally nothing the VM can do itself.
> 
> ~Andrew
Jan Beulich Jan. 3, 2020, 1:36 p.m. UTC | #4
On 02.01.2020 15:59, Andrew Cooper wrote:
> @@ -111,26 +109,6 @@ trampoline_protmode_entry:
>  start64:
>          /* Jump to high mappings. */
>          movabs  $__high_start, %rdi
> -
> -#ifdef CONFIG_INDIRECT_THUNK
> -        /*
> -         * If booting virtualised, or hot-onlining a CPU, sibling threads can
> -         * attempt Branch Target Injection against this jmp.
> -         *
> -         * We've got no usable stack so can't use a RETPOLINE thunk, and are
> -         * further than disp32 from the high mappings so couldn't use
> -         * JUMP_THUNK even if it was a non-RETPOLINE thunk.  Furthermore, an
> -         * LFENCE isn't necessarily safe to use at this point.
> -         *
> -         * As this isn't a hotpath, use a fully serialising event to reduce
> -         * the speculation window as much as possible.  %ebx needs preserving
> -         * for __high_start.
> -         */
> -        mov     %ebx, %esi
> -        cpuid
> -        mov     %esi, %ebx
> -#endif
> -
>          jmpq    *%rdi

I can see this being unneeded when running virtualized, as you said
in reply to Wei. However, for hot-onlining (when other CPUs may run
random vCPU-s) I don't see how this can safely be dropped. There's
no similar concern for S3 resume, as thaw_domains() happens only
after enable_nonboot_cpus().

Jan
Andrew Cooper Jan. 3, 2020, 1:44 p.m. UTC | #5
On 03/01/2020 13:36, Jan Beulich wrote:
> On 02.01.2020 15:59, Andrew Cooper wrote:
>> @@ -111,26 +109,6 @@ trampoline_protmode_entry:
>>  start64:
>>          /* Jump to high mappings. */
>>          movabs  $__high_start, %rdi
>> -
>> -#ifdef CONFIG_INDIRECT_THUNK
>> -        /*
>> -         * If booting virtualised, or hot-onlining a CPU, sibling threads can
>> -         * attempt Branch Target Injection against this jmp.
>> -         *
>> -         * We've got no usable stack so can't use a RETPOLINE thunk, and are
>> -         * further than disp32 from the high mappings so couldn't use
>> -         * JUMP_THUNK even if it was a non-RETPOLINE thunk.  Furthermore, an
>> -         * LFENCE isn't necessarily safe to use at this point.
>> -         *
>> -         * As this isn't a hotpath, use a fully serialising event to reduce
>> -         * the speculation window as much as possible.  %ebx needs preserving
>> -         * for __high_start.
>> -         */
>> -        mov     %ebx, %esi
>> -        cpuid
>> -        mov     %esi, %ebx
>> -#endif
>> -
>>          jmpq    *%rdi
> I can see this being unneeded when running virtualized, as you said
> in reply to Wei. However, for hot-onlining (when other CPUs may run
> random vCPU-s) I don't see how this can safely be dropped. There's
> no similar concern for S3 resume, as thaw_domains() happens only
> after enable_nonboot_cpus().

I covered that in the same reply.  Any guest which can use branch target
injection against this jmp can also poison the regular branch predictor
and get at data that way.

Once again, we get to CPU Hotplug being an unused feature in practice,
which is completely evident now with Intel MCE behaviour.

A guest can't control/guess when a hotplug even might occur, or where
exactly this branch is in memory (after all - it is variable based on
the position of the trampoline), and core scheduling mitigates the risk
entirely.

~Andrew
Jan Beulich Jan. 3, 2020, 1:52 p.m. UTC | #6
On 03.01.2020 14:44, Andrew Cooper wrote:
> On 03/01/2020 13:36, Jan Beulich wrote:
>> On 02.01.2020 15:59, Andrew Cooper wrote:
>>> @@ -111,26 +109,6 @@ trampoline_protmode_entry:
>>>  start64:
>>>          /* Jump to high mappings. */
>>>          movabs  $__high_start, %rdi
>>> -
>>> -#ifdef CONFIG_INDIRECT_THUNK
>>> -        /*
>>> -         * If booting virtualised, or hot-onlining a CPU, sibling threads can
>>> -         * attempt Branch Target Injection against this jmp.
>>> -         *
>>> -         * We've got no usable stack so can't use a RETPOLINE thunk, and are
>>> -         * further than disp32 from the high mappings so couldn't use
>>> -         * JUMP_THUNK even if it was a non-RETPOLINE thunk.  Furthermore, an
>>> -         * LFENCE isn't necessarily safe to use at this point.
>>> -         *
>>> -         * As this isn't a hotpath, use a fully serialising event to reduce
>>> -         * the speculation window as much as possible.  %ebx needs preserving
>>> -         * for __high_start.
>>> -         */
>>> -        mov     %ebx, %esi
>>> -        cpuid
>>> -        mov     %esi, %ebx
>>> -#endif
>>> -
>>>          jmpq    *%rdi
>> I can see this being unneeded when running virtualized, as you said
>> in reply to Wei. However, for hot-onlining (when other CPUs may run
>> random vCPU-s) I don't see how this can safely be dropped. There's
>> no similar concern for S3 resume, as thaw_domains() happens only
>> after enable_nonboot_cpus().
> 
> I covered that in the same reply.  Any guest which can use branch target
> injection against this jmp can also poison the regular branch predictor
> and get at data that way.

Aren't you implying then that retpolines could also be dropped?

> Once again, we get to CPU Hotplug being an unused feature in practice,
> which is completely evident now with Intel MCE behaviour.

What does Intel's MCE behavior have to do with whether CPU hotplug
(or hot-onlining) is (un)used in practice?

> A guest can't control/guess when a hotplug even might occur, or where
> exactly this branch is in memory (after all - it is variable based on
> the position of the trampoline), and core scheduling mitigates the risk
> entirely.

"... will mitigate ..." - it's experimental up to now, isn't it?

Jan
Andrew Cooper Jan. 3, 2020, 2:25 p.m. UTC | #7
On 03/01/2020 13:52, Jan Beulich wrote:
> On 03.01.2020 14:44, Andrew Cooper wrote:
>> On 03/01/2020 13:36, Jan Beulich wrote:
>>> On 02.01.2020 15:59, Andrew Cooper wrote:
>>>> @@ -111,26 +109,6 @@ trampoline_protmode_entry:
>>>>  start64:
>>>>          /* Jump to high mappings. */
>>>>          movabs  $__high_start, %rdi
>>>> -
>>>> -#ifdef CONFIG_INDIRECT_THUNK
>>>> -        /*
>>>> -         * If booting virtualised, or hot-onlining a CPU, sibling threads can
>>>> -         * attempt Branch Target Injection against this jmp.
>>>> -         *
>>>> -         * We've got no usable stack so can't use a RETPOLINE thunk, and are
>>>> -         * further than disp32 from the high mappings so couldn't use
>>>> -         * JUMP_THUNK even if it was a non-RETPOLINE thunk.  Furthermore, an
>>>> -         * LFENCE isn't necessarily safe to use at this point.
>>>> -         *
>>>> -         * As this isn't a hotpath, use a fully serialising event to reduce
>>>> -         * the speculation window as much as possible.  %ebx needs preserving
>>>> -         * for __high_start.
>>>> -         */
>>>> -        mov     %ebx, %esi
>>>> -        cpuid
>>>> -        mov     %esi, %ebx
>>>> -#endif
>>>> -
>>>>          jmpq    *%rdi
>>> I can see this being unneeded when running virtualized, as you said
>>> in reply to Wei. However, for hot-onlining (when other CPUs may run
>>> random vCPU-s) I don't see how this can safely be dropped. There's
>>> no similar concern for S3 resume, as thaw_domains() happens only
>>> after enable_nonboot_cpus().
>> I covered that in the same reply.  Any guest which can use branch target
>> injection against this jmp can also poison the regular branch predictor
>> and get at data that way.
> Aren't you implying then that retpolines could also be dropped?

No.  It is a simple risk vs complexity tradeoff.

Guests running on a sibling *can already* attack this branch with BTI,
because CPUID isn't a fix to bad BTB speculation, and the leakage gadget
need only be a single instruction.

Such a guest can also attack Xen in general with Spectre v1.

As I said - this was introduced because of paranoia, back while the few
people who knew about the issues (only several hundred at the time) were
attempting to figure out what exactly a speculative attack looked like,
and was applying duct tape to everything suspicious because we had 0
time to rewrite several core pieces of system handling.

>> Once again, we get to CPU Hotplug being an unused feature in practice,
>> which is completely evident now with Intel MCE behaviour.
> What does Intel's MCE behavior have to do with whether CPU hotplug
> (or hot-onlining) is (un)used in practice?

The logical consequence of hotplug breaking MCEs.

If hotplug had been used in practice, the MCE behaviour would have come
to light much sooner, when MCEs didn't work in practice.

Given that MCEs really did work in practice even before the L1TF days,
hotplug wasn't in common-enough use for anyone to notice the MCE behaviour.

>> A guest can't control/guess when a hotplug even might occur, or where
>> exactly this branch is in memory (after all - it is variable based on
>> the position of the trampoline), and core scheduling mitigates the risk
>> entirely.
> "... will mitigate ..." - it's experimental up to now, isn't it?

Core scheduling ought to prevent the problem entirely.  The current code
is not safe in the absence of core scheduling.

~Andrew
Jan Beulich Jan. 3, 2020, 2:34 p.m. UTC | #8
On 03.01.2020 15:25, Andrew Cooper wrote:
> On 03/01/2020 13:52, Jan Beulich wrote:
>> On 03.01.2020 14:44, Andrew Cooper wrote:
>>> On 03/01/2020 13:36, Jan Beulich wrote:
>>>> On 02.01.2020 15:59, Andrew Cooper wrote:
>>>>> @@ -111,26 +109,6 @@ trampoline_protmode_entry:
>>>>>  start64:
>>>>>          /* Jump to high mappings. */
>>>>>          movabs  $__high_start, %rdi
>>>>> -
>>>>> -#ifdef CONFIG_INDIRECT_THUNK
>>>>> -        /*
>>>>> -         * If booting virtualised, or hot-onlining a CPU, sibling threads can
>>>>> -         * attempt Branch Target Injection against this jmp.
>>>>> -         *
>>>>> -         * We've got no usable stack so can't use a RETPOLINE thunk, and are
>>>>> -         * further than disp32 from the high mappings so couldn't use
>>>>> -         * JUMP_THUNK even if it was a non-RETPOLINE thunk.  Furthermore, an
>>>>> -         * LFENCE isn't necessarily safe to use at this point.
>>>>> -         *
>>>>> -         * As this isn't a hotpath, use a fully serialising event to reduce
>>>>> -         * the speculation window as much as possible.  %ebx needs preserving
>>>>> -         * for __high_start.
>>>>> -         */
>>>>> -        mov     %ebx, %esi
>>>>> -        cpuid
>>>>> -        mov     %esi, %ebx
>>>>> -#endif
>>>>> -
>>>>>          jmpq    *%rdi
>>>> I can see this being unneeded when running virtualized, as you said
>>>> in reply to Wei. However, for hot-onlining (when other CPUs may run
>>>> random vCPU-s) I don't see how this can safely be dropped. There's
>>>> no similar concern for S3 resume, as thaw_domains() happens only
>>>> after enable_nonboot_cpus().
>>> I covered that in the same reply.  Any guest which can use branch target
>>> injection against this jmp can also poison the regular branch predictor
>>> and get at data that way.
>> Aren't you implying then that retpolines could also be dropped?
> 
> No.  It is a simple risk vs complexity tradeoff.
> 
> Guests running on a sibling *can already* attack this branch with BTI,
> because CPUID isn't a fix to bad BTB speculation, and the leakage gadget
> need only be a single instruction.
> 
> Such a guest can also attack Xen in general with Spectre v1.
> 
> As I said - this was introduced because of paranoia, back while the few
> people who knew about the issues (only several hundred at the time) were
> attempting to figure out what exactly a speculative attack looked like,
> and was applying duct tape to everything suspicious because we had 0
> time to rewrite several core pieces of system handling.

Well, okay then:
Acked-by: Jan Beulich <jbeulich@suse.com>

>>> Once again, we get to CPU Hotplug being an unused feature in practice,
>>> which is completely evident now with Intel MCE behaviour.
>> What does Intel's MCE behavior have to do with whether CPU hotplug
>> (or hot-onlining) is (un)used in practice?
> 
> The logical consequence of hotplug breaking MCEs.
> 
> If hotplug had been used in practice, the MCE behaviour would have come
> to light much sooner, when MCEs didn't work in practice.
> 
> Given that MCEs really did work in practice even before the L1TF days,
> hotplug wasn't in common-enough use for anyone to notice the MCE behaviour.

Or systems where CPU hotplug was actually used on were of good
enough quality to never surface #MC (personally I don't think
I've seen more than a handful of non-reproducible #MC instances)?
Or people having run into the bad behavior simply didn't have the
resources to investigate why their system shut down silently
(perhaps giving entirely random appearance of the behavior)?

Jan
Andrew Cooper Jan. 3, 2020, 6:55 p.m. UTC | #9
On 03/01/2020 14:34, Jan Beulich wrote:
> On 03.01.2020 15:25, Andrew Cooper wrote:
>> On 03/01/2020 13:52, Jan Beulich wrote:
>>> On 03.01.2020 14:44, Andrew Cooper wrote:
>>>> On 03/01/2020 13:36, Jan Beulich wrote:
>>>>> On 02.01.2020 15:59, Andrew Cooper wrote:
>>>>>> @@ -111,26 +109,6 @@ trampoline_protmode_entry:
>>>>>>  start64:
>>>>>>          /* Jump to high mappings. */
>>>>>>          movabs  $__high_start, %rdi
>>>>>> -
>>>>>> -#ifdef CONFIG_INDIRECT_THUNK
>>>>>> -        /*
>>>>>> -         * If booting virtualised, or hot-onlining a CPU, sibling threads can
>>>>>> -         * attempt Branch Target Injection against this jmp.
>>>>>> -         *
>>>>>> -         * We've got no usable stack so can't use a RETPOLINE thunk, and are
>>>>>> -         * further than disp32 from the high mappings so couldn't use
>>>>>> -         * JUMP_THUNK even if it was a non-RETPOLINE thunk.  Furthermore, an
>>>>>> -         * LFENCE isn't necessarily safe to use at this point.
>>>>>> -         *
>>>>>> -         * As this isn't a hotpath, use a fully serialising event to reduce
>>>>>> -         * the speculation window as much as possible.  %ebx needs preserving
>>>>>> -         * for __high_start.
>>>>>> -         */
>>>>>> -        mov     %ebx, %esi
>>>>>> -        cpuid
>>>>>> -        mov     %esi, %ebx
>>>>>> -#endif
>>>>>> -
>>>>>>          jmpq    *%rdi
>>>>> I can see this being unneeded when running virtualized, as you said
>>>>> in reply to Wei. However, for hot-onlining (when other CPUs may run
>>>>> random vCPU-s) I don't see how this can safely be dropped. There's
>>>>> no similar concern for S3 resume, as thaw_domains() happens only
>>>>> after enable_nonboot_cpus().
>>>> I covered that in the same reply.  Any guest which can use branch target
>>>> injection against this jmp can also poison the regular branch predictor
>>>> and get at data that way.
>>> Aren't you implying then that retpolines could also be dropped?
>> No.  It is a simple risk vs complexity tradeoff.
>>
>> Guests running on a sibling *can already* attack this branch with BTI,
>> because CPUID isn't a fix to bad BTB speculation, and the leakage gadget
>> need only be a single instruction.
>>
>> Such a guest can also attack Xen in general with Spectre v1.
>>
>> As I said - this was introduced because of paranoia, back while the few
>> people who knew about the issues (only several hundred at the time) were
>> attempting to figure out what exactly a speculative attack looked like,
>> and was applying duct tape to everything suspicious because we had 0
>> time to rewrite several core pieces of system handling.
> Well, okay then:
> Acked-by: Jan Beulich <jbeulich@suse.com>

Thanks.  I've adjusted the commit message in light of this conversation.

>
>>>> Once again, we get to CPU Hotplug being an unused feature in practice,
>>>> which is completely evident now with Intel MCE behaviour.
>>> What does Intel's MCE behavior have to do with whether CPU hotplug
>>> (or hot-onlining) is (un)used in practice?
>> The logical consequence of hotplug breaking MCEs.
>>
>> If hotplug had been used in practice, the MCE behaviour would have come
>> to light much sooner, when MCEs didn't work in practice.
>>
>> Given that MCEs really did work in practice even before the L1TF days,
>> hotplug wasn't in common-enough use for anyone to notice the MCE behaviour.
> Or systems where CPU hotplug was actually used on were of good
> enough quality to never surface #MC

Suffice it to say that there is plenty of evidence to the contrary here.

Without going into details for obvious reasons, there have been number
of #MC conditions (both preexisting, and regressions) in recent
microcode discovered in the field because everyone is needing to
proactively take microcode updates these days.

> (personally I don't think
> I've seen more than a handful of non-reproducible #MC instances)?

You don't run a "cloud scale" number of systems.

Even XenServers test system of a few hundred systems sees a concerning
(but ultimately, background noise) rate of #MC's, some of which are
definite hardware failures (and kept around for error testing purposes),
and others are in need of investigation.

> Or people having run into the bad behavior simply didn't have the
> resources to investigate why their system shut down silently
> (perhaps giving entirely random appearance of the behavior)?

Customers don't tolerate their hosts randomly crashing, especially if it
happens consistently.

Yes - technically speaking all of these are options, but the balance of
probability is vastly on the side of CPU hot-plug not actually being
used at any scale in practice.  (Not least because there are still
interrupt handling bugs present in Xen's implementation.)

~Andrew
diff mbox series

Patch

diff --git a/xen/arch/x86/boot/trampoline.S b/xen/arch/x86/boot/trampoline.S
index c60ebb3f00..574d1bd8f4 100644
--- a/xen/arch/x86/boot/trampoline.S
+++ b/xen/arch/x86/boot/trampoline.S
@@ -101,8 +101,6 @@  trampoline_protmode_entry:
         mov     $(X86_CR0_PG | X86_CR0_AM | X86_CR0_WP | X86_CR0_NE |\
                   X86_CR0_ET | X86_CR0_MP | X86_CR0_PE), %eax
         mov     %eax,%cr0
-        jmp     1f
-1:
 
         /* Now in compatibility mode. Long-jump into 64-bit mode. */
         ljmp    $BOOT_CS64,$bootsym_rel(start64,6)
@@ -111,26 +109,6 @@  trampoline_protmode_entry:
 start64:
         /* Jump to high mappings. */
         movabs  $__high_start, %rdi
-
-#ifdef CONFIG_INDIRECT_THUNK
-        /*
-         * If booting virtualised, or hot-onlining a CPU, sibling threads can
-         * attempt Branch Target Injection against this jmp.
-         *
-         * We've got no usable stack so can't use a RETPOLINE thunk, and are
-         * further than disp32 from the high mappings so couldn't use
-         * JUMP_THUNK even if it was a non-RETPOLINE thunk.  Furthermore, an
-         * LFENCE isn't necessarily safe to use at this point.
-         *
-         * As this isn't a hotpath, use a fully serialising event to reduce
-         * the speculation window as much as possible.  %ebx needs preserving
-         * for __high_start.
-         */
-        mov     %ebx, %esi
-        cpuid
-        mov     %esi, %ebx
-#endif
-
         jmpq    *%rdi
 
 #include "video.h"