[1/3] x86/boot: Remove cached CPUID data from the trampoline
diff mbox series

Message ID 20191101202502.31750-2-andrew.cooper3@citrix.com
State New
Headers show
Series
  • Fix PV shim ballooning problems
Related show

Commit Message

Andrew Cooper Nov. 1, 2019, 8:25 p.m. UTC
We have a cached cpuid_ext_features in the trampoline which is kept in sync by
various pieces of boot logic.  This is complicated, and all it is actually
used for is to derive whether NX is safe to use.

Replace it with a canned value to load into EFER.

trampoline_setup() and efi_arch_cpu() now tweak trampoline_efer at the point
that they are stashing the main copy of CPUID data.  Similarly,
early_init_intel() needs to tweak if it has re-enabled the use of NX.

This simplifies the AP boot and S3 resume paths by using trampoline_efer
directly, rather than locally turning FEATURE_NX into EFER_NX.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Sergey Dyasli <sergey.dyasli@citrix.com>
CC: Juergen Gross <jgross@suse.com>
---
 xen/arch/x86/boot/head.S        |  9 +++++++--
 xen/arch/x86/boot/trampoline.S  | 13 +++++--------
 xen/arch/x86/boot/wakeup.S      | 13 ++-----------
 xen/arch/x86/cpu/common.c       |  3 ---
 xen/arch/x86/cpu/intel.c        |  1 +
 xen/arch/x86/efi/efi-boot.h     |  8 +++++---
 xen/include/asm-x86/processor.h |  2 +-
 7 files changed, 21 insertions(+), 28 deletions(-)

Comments

Jan Beulich Nov. 4, 2019, 1:25 p.m. UTC | #1
On 01.11.2019 21:25, Andrew Cooper wrote:
> --- a/xen/arch/x86/cpu/intel.c
> +++ b/xen/arch/x86/cpu/intel.c
> @@ -270,6 +270,7 @@ static void early_init_intel(struct cpuinfo_x86 *c)
>  	if (disable) {
>  		wrmsrl(MSR_IA32_MISC_ENABLE, misc_enable & ~disable);
>  		bootsym(trampoline_misc_enable_off) |= disable;
> +		bootsym(trampoline_efer) |= EFER_NX;
>  	}

I'm fine with all other changes here, just this one concerns me:
Before your change we latch a value into trampoline_misc_enable_off
just to be used for subsequent IA32_MISC_ENABLE writes we do. This
means that, on a hypervisor (like Xen itself) simply discarding
guest writes to the MSR (which is "fine" with the described usage
of ours up to now), with your change we'd now end up trying to set
EFER.NX when the feature may not actually be enabled in
IA32_MISC_ENABLE. Wouldn't such an EFER write be liable to #GP(0)?
I.e. don't we need to read back the MSR value here, and verify
we actually managed to clear the bit before actually OR-ing in
EFER_NX?

Jan
Andrew Cooper Nov. 4, 2019, 2:59 p.m. UTC | #2
On 04/11/2019 13:25, Jan Beulich wrote:
> On 01.11.2019 21:25, Andrew Cooper wrote:
>> --- a/xen/arch/x86/cpu/intel.c
>> +++ b/xen/arch/x86/cpu/intel.c
>> @@ -270,6 +270,7 @@ static void early_init_intel(struct cpuinfo_x86 *c)
>>  	if (disable) {
>>  		wrmsrl(MSR_IA32_MISC_ENABLE, misc_enable & ~disable);
>>  		bootsym(trampoline_misc_enable_off) |= disable;
>> +		bootsym(trampoline_efer) |= EFER_NX;
>>  	}
> I'm fine with all other changes here, just this one concerns me:
> Before your change we latch a value into trampoline_misc_enable_off
> just to be used for subsequent IA32_MISC_ENABLE writes we do. This
> means that, on a hypervisor (like Xen itself) simply discarding
> guest writes to the MSR (which is "fine" with the described usage
> of ours up to now), with your change we'd now end up trying to set
> EFER.NX when the feature may not actually be enabled in
> IA32_MISC_ENABLE. Wouldn't such an EFER write be liable to #GP(0)?
> I.e. don't we need to read back the MSR value here, and verify
> we actually managed to clear the bit before actually OR-ing in
> EFER_NX?

If this is a problem in practice, execution won't continue beyond the
next if() condition just out of context, which set EFER.NX on the BSP
with an unguarded WRMSR.

~Andrew
Jan Beulich Nov. 4, 2019, 3:03 p.m. UTC | #3
On 04.11.2019 15:59, Andrew Cooper wrote:
> On 04/11/2019 13:25, Jan Beulich wrote:
>> On 01.11.2019 21:25, Andrew Cooper wrote:
>>> --- a/xen/arch/x86/cpu/intel.c
>>> +++ b/xen/arch/x86/cpu/intel.c
>>> @@ -270,6 +270,7 @@ static void early_init_intel(struct cpuinfo_x86 *c)
>>>  	if (disable) {
>>>  		wrmsrl(MSR_IA32_MISC_ENABLE, misc_enable & ~disable);
>>>  		bootsym(trampoline_misc_enable_off) |= disable;
>>> +		bootsym(trampoline_efer) |= EFER_NX;
>>>  	}
>> I'm fine with all other changes here, just this one concerns me:
>> Before your change we latch a value into trampoline_misc_enable_off
>> just to be used for subsequent IA32_MISC_ENABLE writes we do. This
>> means that, on a hypervisor (like Xen itself) simply discarding
>> guest writes to the MSR (which is "fine" with the described usage
>> of ours up to now), with your change we'd now end up trying to set
>> EFER.NX when the feature may not actually be enabled in
>> IA32_MISC_ENABLE. Wouldn't such an EFER write be liable to #GP(0)?
>> I.e. don't we need to read back the MSR value here, and verify
>> we actually managed to clear the bit before actually OR-ing in
>> EFER_NX?
> 
> If this is a problem in practice, execution won't continue beyond the
> next if() condition just out of context, which set EFER.NX on the BSP
> with an unguarded WRMSR.

And how is this any good? This kind of crash is exactly what I'm
asking to avoid.

Jan
Andrew Cooper Nov. 4, 2019, 3:22 p.m. UTC | #4
On 04/11/2019 15:03, Jan Beulich wrote:
> On 04.11.2019 15:59, Andrew Cooper wrote:
>> On 04/11/2019 13:25, Jan Beulich wrote:
>>> On 01.11.2019 21:25, Andrew Cooper wrote:
>>>> --- a/xen/arch/x86/cpu/intel.c
>>>> +++ b/xen/arch/x86/cpu/intel.c
>>>> @@ -270,6 +270,7 @@ static void early_init_intel(struct cpuinfo_x86 *c)
>>>>  	if (disable) {
>>>>  		wrmsrl(MSR_IA32_MISC_ENABLE, misc_enable & ~disable);
>>>>  		bootsym(trampoline_misc_enable_off) |= disable;
>>>> +		bootsym(trampoline_efer) |= EFER_NX;
>>>>  	}
>>> I'm fine with all other changes here, just this one concerns me:
>>> Before your change we latch a value into trampoline_misc_enable_off
>>> just to be used for subsequent IA32_MISC_ENABLE writes we do. This
>>> means that, on a hypervisor (like Xen itself) simply discarding
>>> guest writes to the MSR (which is "fine" with the described usage
>>> of ours up to now), with your change we'd now end up trying to set
>>> EFER.NX when the feature may not actually be enabled in
>>> IA32_MISC_ENABLE. Wouldn't such an EFER write be liable to #GP(0)?
>>> I.e. don't we need to read back the MSR value here, and verify
>>> we actually managed to clear the bit before actually OR-ing in
>>> EFER_NX?
>> If this is a problem in practice, execution won't continue beyond the
>> next if() condition just out of context, which set EFER.NX on the BSP
>> with an unguarded WRMSR.
> And how is this any good? This kind of crash is exactly what I'm
> asking to avoid.

What is the point of working around a theoretical edge case of broken
nesting under Xen which demonstrably doesn't exist in practice?

~Andrew
Jan Beulich Nov. 4, 2019, 3:31 p.m. UTC | #5
On 04.11.2019 16:22, Andrew Cooper wrote:
> On 04/11/2019 15:03, Jan Beulich wrote:
>> On 04.11.2019 15:59, Andrew Cooper wrote:
>>> On 04/11/2019 13:25, Jan Beulich wrote:
>>>> On 01.11.2019 21:25, Andrew Cooper wrote:
>>>>> --- a/xen/arch/x86/cpu/intel.c
>>>>> +++ b/xen/arch/x86/cpu/intel.c
>>>>> @@ -270,6 +270,7 @@ static void early_init_intel(struct cpuinfo_x86 *c)
>>>>>  	if (disable) {
>>>>>  		wrmsrl(MSR_IA32_MISC_ENABLE, misc_enable & ~disable);
>>>>>  		bootsym(trampoline_misc_enable_off) |= disable;
>>>>> +		bootsym(trampoline_efer) |= EFER_NX;
>>>>>  	}
>>>> I'm fine with all other changes here, just this one concerns me:
>>>> Before your change we latch a value into trampoline_misc_enable_off
>>>> just to be used for subsequent IA32_MISC_ENABLE writes we do. This
>>>> means that, on a hypervisor (like Xen itself) simply discarding
>>>> guest writes to the MSR (which is "fine" with the described usage
>>>> of ours up to now), with your change we'd now end up trying to set
>>>> EFER.NX when the feature may not actually be enabled in
>>>> IA32_MISC_ENABLE. Wouldn't such an EFER write be liable to #GP(0)?
>>>> I.e. don't we need to read back the MSR value here, and verify
>>>> we actually managed to clear the bit before actually OR-ing in
>>>> EFER_NX?
>>> If this is a problem in practice, execution won't continue beyond the
>>> next if() condition just out of context, which set EFER.NX on the BSP
>>> with an unguarded WRMSR.
>> And how is this any good? This kind of crash is exactly what I'm
>> asking to avoid.
> 
> What is the point of working around a theoretical edge case of broken
> nesting under Xen which demonstrably doesn't exist in practice?

Well, so far nothing was said about this not being an actual problem.
I simply don't know whether hardware would refuse such an EFER write.
If it does, it would be appropriate for hypervisors to also refuse
it. I.e. if we don't do so right now, correcting the behavior would
trip the code here.

Jan
Andrew Cooper Nov. 12, 2019, 4:09 p.m. UTC | #6
On 04/11/2019 15:31, Jan Beulich wrote:
> On 04.11.2019 16:22, Andrew Cooper wrote:
>> On 04/11/2019 15:03, Jan Beulich wrote:
>>> On 04.11.2019 15:59, Andrew Cooper wrote:
>>>> On 04/11/2019 13:25, Jan Beulich wrote:
>>>>> On 01.11.2019 21:25, Andrew Cooper wrote:
>>>>>> --- a/xen/arch/x86/cpu/intel.c
>>>>>> +++ b/xen/arch/x86/cpu/intel.c
>>>>>> @@ -270,6 +270,7 @@ static void early_init_intel(struct cpuinfo_x86 *c)
>>>>>>  	if (disable) {
>>>>>>  		wrmsrl(MSR_IA32_MISC_ENABLE, misc_enable & ~disable);
>>>>>>  		bootsym(trampoline_misc_enable_off) |= disable;
>>>>>> +		bootsym(trampoline_efer) |= EFER_NX;
>>>>>>  	}
>>>>> I'm fine with all other changes here, just this one concerns me:
>>>>> Before your change we latch a value into trampoline_misc_enable_off
>>>>> just to be used for subsequent IA32_MISC_ENABLE writes we do. This
>>>>> means that, on a hypervisor (like Xen itself) simply discarding
>>>>> guest writes to the MSR (which is "fine" with the described usage
>>>>> of ours up to now), with your change we'd now end up trying to set
>>>>> EFER.NX when the feature may not actually be enabled in
>>>>> IA32_MISC_ENABLE. Wouldn't such an EFER write be liable to #GP(0)?
>>>>> I.e. don't we need to read back the MSR value here, and verify
>>>>> we actually managed to clear the bit before actually OR-ing in
>>>>> EFER_NX?
>>>> If this is a problem in practice, execution won't continue beyond the
>>>> next if() condition just out of context, which set EFER.NX on the BSP
>>>> with an unguarded WRMSR.
>>> And how is this any good? This kind of crash is exactly what I'm
>>> asking to avoid.
>> What is the point of working around a theoretical edge case of broken
>> nesting under Xen which demonstrably doesn't exist in practice?
> Well, so far nothing was said about this not being an actual problem.

Its not an actual problem.  If it were, we would have had crash reports.

> I simply don't know whether hardware would refuse such an EFER write.

I've just experimented - writing EFER.NX takes a #GP fault when
MISC_ENABLE.XD is set.

> If it does, it would be appropriate for hypervisors to also refuse
> it. I.e. if we don't do so right now, correcting the behavior would
> trip the code here.

MISC_ENABLES.XD is architectural on any Intel system which enumerates
NX, and if the bit is set, it can be cleared.  (Although the semantics
described in the SDM are broken.  It is only available if NX is
enumerated, which is obfuscated by setting XD).

However, no hypervisor is going to bother virtualising this
functionality.  Either configure the VM with NX or without.  (KVM for
example doesn't virtualise MISC_ENABLES at all.)

There is one corner case on out-of-support versions of Xen (which don't
clear XD themselves) where XD would leak through and be ignored, after
which Xen will take a #GP fault trying to set EFER.NX, but I am still
firmly of the opinion that it is not worth putting in a workaround for
an obsolete issue which doesn't exist in practice.

~Andrew
Jan Beulich Nov. 12, 2019, 5:15 p.m. UTC | #7
On 12.11.2019 17:09, Andrew Cooper wrote:
> On 04/11/2019 15:31, Jan Beulich wrote:
>> On 04.11.2019 16:22, Andrew Cooper wrote:
>>> On 04/11/2019 15:03, Jan Beulich wrote:
>>>> On 04.11.2019 15:59, Andrew Cooper wrote:
>>>>> On 04/11/2019 13:25, Jan Beulich wrote:
>>>>>> On 01.11.2019 21:25, Andrew Cooper wrote:
>>>>>>> --- a/xen/arch/x86/cpu/intel.c
>>>>>>> +++ b/xen/arch/x86/cpu/intel.c
>>>>>>> @@ -270,6 +270,7 @@ static void early_init_intel(struct cpuinfo_x86 *c)
>>>>>>>  	if (disable) {
>>>>>>>  		wrmsrl(MSR_IA32_MISC_ENABLE, misc_enable & ~disable);
>>>>>>>  		bootsym(trampoline_misc_enable_off) |= disable;
>>>>>>> +		bootsym(trampoline_efer) |= EFER_NX;
>>>>>>>  	}
>>>>>> I'm fine with all other changes here, just this one concerns me:
>>>>>> Before your change we latch a value into trampoline_misc_enable_off
>>>>>> just to be used for subsequent IA32_MISC_ENABLE writes we do. This
>>>>>> means that, on a hypervisor (like Xen itself) simply discarding
>>>>>> guest writes to the MSR (which is "fine" with the described usage
>>>>>> of ours up to now), with your change we'd now end up trying to set
>>>>>> EFER.NX when the feature may not actually be enabled in
>>>>>> IA32_MISC_ENABLE. Wouldn't such an EFER write be liable to #GP(0)?
>>>>>> I.e. don't we need to read back the MSR value here, and verify
>>>>>> we actually managed to clear the bit before actually OR-ing in
>>>>>> EFER_NX?
>>>>> If this is a problem in practice, execution won't continue beyond the
>>>>> next if() condition just out of context, which set EFER.NX on the BSP
>>>>> with an unguarded WRMSR.
>>>> And how is this any good? This kind of crash is exactly what I'm
>>>> asking to avoid.
>>> What is the point of working around a theoretical edge case of broken
>>> nesting under Xen which demonstrably doesn't exist in practice?
>> Well, so far nothing was said about this not being an actual problem.
> 
> Its not an actual problem.  If it were, we would have had crash reports.
> 
>> I simply don't know whether hardware would refuse such an EFER write.
> 
> I've just experimented - writing EFER.NX takes a #GP fault when
> MISC_ENABLE.XD is set.
> 
>> If it does, it would be appropriate for hypervisors to also refuse
>> it. I.e. if we don't do so right now, correcting the behavior would
>> trip the code here.
> 
> MISC_ENABLES.XD is architectural on any Intel system which enumerates
> NX, and if the bit is set, it can be cleared.  (Although the semantics
> described in the SDM are broken.  It is only available if NX is
> enumerated, which is obfuscated by setting XD).
> 
> However, no hypervisor is going to bother virtualising this
> functionality.  Either configure the VM with NX or without.  (KVM for
> example doesn't virtualise MISC_ENABLES at all.)

I'm sorry, but I still don't follow: You say "if the bit is set, it
can be cleared", which is clearly not in line with our current guest
MSR write handling. It just so happens that we have no command line
option allowing to suppress the clearing of XD. If we had, according
to your findings above we'd run into a #GP upon trying to set NX.
How can you easily exclude another hypervisor actually doing so (and
nobody having run into the issue simply because the option is rarely
used)?

Btw - all would be fine if the code in question was guarded by an
NX feature check, but as you say that's not possible because XD set
forces NX clear. However, our setting of EFER.NX could be guarded
this way, as we _expect_ XD to be clear at that point.

Jan
Andrew Cooper Nov. 13, 2019, 1:22 p.m. UTC | #8
On 12/11/2019 17:15, Jan Beulich wrote:
> On 12.11.2019 17:09, Andrew Cooper wrote:
>> On 04/11/2019 15:31, Jan Beulich wrote:
>>> On 04.11.2019 16:22, Andrew Cooper wrote:
>>>> On 04/11/2019 15:03, Jan Beulich wrote:
>>>>> On 04.11.2019 15:59, Andrew Cooper wrote:
>>>>>> On 04/11/2019 13:25, Jan Beulich wrote:
>>>>>>> On 01.11.2019 21:25, Andrew Cooper wrote:
>>>>>>>> --- a/xen/arch/x86/cpu/intel.c
>>>>>>>> +++ b/xen/arch/x86/cpu/intel.c
>>>>>>>> @@ -270,6 +270,7 @@ static void early_init_intel(struct cpuinfo_x86 *c)
>>>>>>>>  	if (disable) {
>>>>>>>>  		wrmsrl(MSR_IA32_MISC_ENABLE, misc_enable & ~disable);
>>>>>>>>  		bootsym(trampoline_misc_enable_off) |= disable;
>>>>>>>> +		bootsym(trampoline_efer) |= EFER_NX;
>>>>>>>>  	}
>>>>>>> I'm fine with all other changes here, just this one concerns me:
>>>>>>> Before your change we latch a value into trampoline_misc_enable_off
>>>>>>> just to be used for subsequent IA32_MISC_ENABLE writes we do. This
>>>>>>> means that, on a hypervisor (like Xen itself) simply discarding
>>>>>>> guest writes to the MSR (which is "fine" with the described usage
>>>>>>> of ours up to now), with your change we'd now end up trying to set
>>>>>>> EFER.NX when the feature may not actually be enabled in
>>>>>>> IA32_MISC_ENABLE. Wouldn't such an EFER write be liable to #GP(0)?
>>>>>>> I.e. don't we need to read back the MSR value here, and verify
>>>>>>> we actually managed to clear the bit before actually OR-ing in
>>>>>>> EFER_NX?
>>>>>> If this is a problem in practice, execution won't continue beyond the
>>>>>> next if() condition just out of context, which set EFER.NX on the BSP
>>>>>> with an unguarded WRMSR.
>>>>> And how is this any good? This kind of crash is exactly what I'm
>>>>> asking to avoid.
>>>> What is the point of working around a theoretical edge case of broken
>>>> nesting under Xen which demonstrably doesn't exist in practice?
>>> Well, so far nothing was said about this not being an actual problem.
>> Its not an actual problem.  If it were, we would have had crash reports.
>>
>>> I simply don't know whether hardware would refuse such an EFER write.
>> I've just experimented - writing EFER.NX takes a #GP fault when
>> MISC_ENABLE.XD is set.
>>
>>> If it does, it would be appropriate for hypervisors to also refuse
>>> it. I.e. if we don't do so right now, correcting the behavior would
>>> trip the code here.
>> MISC_ENABLES.XD is architectural on any Intel system which enumerates
>> NX, and if the bit is set, it can be cleared.  (Although the semantics
>> described in the SDM are broken.  It is only available if NX is
>> enumerated, which is obfuscated by setting XD).
>>
>> However, no hypervisor is going to bother virtualising this
>> functionality.  Either configure the VM with NX or without.  (KVM for
>> example doesn't virtualise MISC_ENABLES at all.)
> I'm sorry, but I still don't follow: You say "if the bit is set, it
> can be cleared", which is clearly not in line with our current guest
> MSR write handling.

Yes - Xen's MSR handing is broken, but you snipped that part of my reply.

> It just so happens that we have no command line
> option allowing to suppress the clearing of XD.

Nor does Linux.  As to the other hypervisors...

> If we had, according
> to your findings above we'd run into a #GP upon trying to set NX.

Yes.

> How can you easily exclude another hypervisor actually doing so (and
> nobody having run into the issue simply because the option is rarely
> used)?

... observe that they require NX support as a prerequisite to install. 
You will not find a system with XD set these days.

> Btw - all would be fine if the code in question was guarded by an
> NX feature check, but as you say that's not possible because XD set
> forces NX clear. However, our setting of EFER.NX could be guarded
> this way, as we _expect_ XD to be clear at that point.

XD was clearly never designed for the OS to find and turn off, but NX
functionality is simply too important to let misconfigured firmware get
in the way of using.


The long and the short of it is that this patch does not change Xen's
behaviour WRT poorly virtualised XD.

I am not convinced the behaviour is worth changing, and I don't have
time for this scope creep.  If you want to submit a fix then go ahead,
but patch 3 of this is important to get in for 4.13

~Andrew
Jan Beulich Nov. 13, 2019, 1:29 p.m. UTC | #9
On 13.11.2019 14:22, Andrew Cooper wrote:
> I am not convinced the behaviour is worth changing, and I don't have
> time for this scope creep.

There's no scope creep here at all. All I'm asking for is that you
don't blindly OR in EFER_NX into trampoline_efer, but rather check
that it will be possible to successfully set it after the
MISC_ENABLE write (by reading back the value, or by reading
CPUID[0x80000001].NX again).

Jan
Andrew Cooper Nov. 19, 2019, 3:15 p.m. UTC | #10
On 13/11/2019 13:29, Jan Beulich wrote:
> On 13.11.2019 14:22, Andrew Cooper wrote:
>> I am not convinced the behaviour is worth changing, and I don't have
>> time for this scope creep.
> There's no scope creep here at all.

Yes - it really is scope creep.

This patch does not change the behaviour of Xen in the case of poor
virtualisation of the bit.  Xen will still crash either way.

I have explained, repeatedly now, why I am not inclined to fix this. It
is a bug which doesn't exist in practice.

You are welcome to fix this yourself, in separate change to match the
separate scope, when you are not blocking a 4.13 fix with your request.

~Andrew
Jan Beulich Nov. 19, 2019, 4:44 p.m. UTC | #11
On 19.11.2019 16:15, Andrew Cooper wrote:
> On 13/11/2019 13:29, Jan Beulich wrote:
>> On 13.11.2019 14:22, Andrew Cooper wrote:
>>> I am not convinced the behaviour is worth changing, and I don't have
>>> time for this scope creep.
>> There's no scope creep here at all.
> 
> Yes - it really is scope creep.
> 
> This patch does not change the behaviour of Xen in the case of poor
> virtualisation of the bit.  Xen will still crash either way.

So I have to apologize. What I didn't notice is

	if (disable & MSR_IA32_MISC_ENABLE_XD_DISABLE) {
		write_efer(read_efer() | EFER_NX);
		printk(KERN_INFO
		       "re-enabled NX (Execute Disable) protection\n");
	}

in early_init_intel(). I simply didn't expect we'd already have
such a blind EFER write. I therefore agree now that this is a
pre-existing bug that you don't make any worse.

Reviewed-by: Jan Beulich <jbeulich@suse.com>

> I have explained, repeatedly now, why I am not inclined to fix this. It
> is a bug which doesn't exist in practice.

I should have been looking more closely; the lack of sufficient
context did misguide me.

Jan
Andrew Cooper Nov. 21, 2019, 3:23 p.m. UTC | #12
On 19/11/2019 16:44, Jan Beulich wrote:
> On 19.11.2019 16:15, Andrew Cooper wrote:
>> On 13/11/2019 13:29, Jan Beulich wrote:
>>> On 13.11.2019 14:22, Andrew Cooper wrote:
>>>> I am not convinced the behaviour is worth changing, and I don't have
>>>> time for this scope creep.
>>> There's no scope creep here at all.
>> Yes - it really is scope creep.
>>
>> This patch does not change the behaviour of Xen in the case of poor
>> virtualisation of the bit.  Xen will still crash either way.
> So I have to apologize. What I didn't notice is
>
> 	if (disable & MSR_IA32_MISC_ENABLE_XD_DISABLE) {
> 		write_efer(read_efer() | EFER_NX);
> 		printk(KERN_INFO
> 		       "re-enabled NX (Execute Disable) protection\n");
> 	}
>
> in early_init_intel(). I simply didn't expect we'd already have
> such a blind EFER write. I therefore agree now that this is a
> pre-existing bug that you don't make any worse.
>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thankyou.  I'll get this series in now, along with some other straggling
4.13 content.

~Andrew

Patch
diff mbox series

diff --git a/xen/arch/x86/boot/head.S b/xen/arch/x86/boot/head.S
index a1564b520b..77309e3c82 100644
--- a/xen/arch/x86/boot/head.S
+++ b/xen/arch/x86/boot/head.S
@@ -640,8 +640,13 @@  trampoline_setup:
         jbe     1f
         mov     $0x80000001,%eax
         cpuid
-1:      mov     %edx,sym_fs(cpuid_ext_features)
-        mov     %edx,sym_fs(boot_cpu_data)+CPUINFO_FEATURE_OFFSET(X86_FEATURE_LM)
+1:      mov     %edx, sym_fs(boot_cpu_data) + CPUINFO_FEATURE_OFFSET(X86_FEATURE_LM)
+
+        /* Check for NX. Adjust EFER setting if available. */
+        bt      $cpufeat_bit(X86_FEATURE_NX), %edx
+        jnc     1f
+        orb     $EFER_NX >> 8, 1 + sym_esi(trampoline_efer)
+1:
 
         /* Check for availability of long mode. */
         bt      $cpufeat_bit(X86_FEATURE_LM),%edx
diff --git a/xen/arch/x86/boot/trampoline.S b/xen/arch/x86/boot/trampoline.S
index 870ec79a2d..26584493bb 100644
--- a/xen/arch/x86/boot/trampoline.S
+++ b/xen/arch/x86/boot/trampoline.S
@@ -88,8 +88,9 @@  trampoline_gdt:
 GLOBAL(trampoline_misc_enable_off)
         .quad   0
 
-GLOBAL(cpuid_ext_features)
-        .long   0
+/* EFER OR-mask for boot paths.  This gets adjusted with NX when available. */
+GLOBAL(trampoline_efer)
+        .long   EFER_LME | EFER_SCE
 
 GLOBAL(trampoline_xen_phys_start)
         .long   0
@@ -132,14 +133,10 @@  trampoline_protmode_entry:
 1:
 
         /* Set up EFER (Extended Feature Enable Register). */
-        mov     bootsym_rel(cpuid_ext_features,4,%edi)
         movl    $MSR_EFER,%ecx
         rdmsr
-        or      $EFER_LME|EFER_SCE,%eax   /* Long Mode + SYSCALL/SYSRET */
-        bt      $cpufeat_bit(X86_FEATURE_NX),%edi /* No Execute? */
-        jnc     1f
-        btsl    $_EFER_NX,%eax  /* No Execute     */
-1:      wrmsr
+        or      bootsym_rel(trampoline_efer, 4, %eax)
+        wrmsr
 
         mov     $(X86_CR0_PG | X86_CR0_AM | X86_CR0_WP | X86_CR0_NE |\
                   X86_CR0_ET | X86_CR0_MP | X86_CR0_PE), %eax
diff --git a/xen/arch/x86/boot/wakeup.S b/xen/arch/x86/boot/wakeup.S
index 25ec2fa32b..fc47721f43 100644
--- a/xen/arch/x86/boot/wakeup.S
+++ b/xen/arch/x86/boot/wakeup.S
@@ -131,20 +131,11 @@  wakeup_32:
         wrmsr
 1:
 
-        /* Will cpuid feature change after resume? */
         /* Set up EFER (Extended Feature Enable Register). */
-        mov     bootsym_rel(cpuid_ext_features,4,%edi)
-        test    $0x20100800,%edi /* SYSCALL/SYSRET, No Execute, Long Mode? */
-        jz      .Lskip_eferw
         movl    $MSR_EFER,%ecx
         rdmsr
-        btsl    $_EFER_LME,%eax /* Long Mode      */
-        btsl    $_EFER_SCE,%eax /* SYSCALL/SYSRET */
-        btl     $20,%edi        /* No Execute?    */
-        jnc     1f
-        btsl    $_EFER_NX,%eax  /* No Execute     */
-1:      wrmsr
-.Lskip_eferw:
+        or      bootsym_rel(trampoline_efer, 4, %eax)
+        wrmsr
 
         wbinvd
 
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index 6c6bd63301..e5ad17d8d9 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -391,9 +391,6 @@  static void generic_identify(struct cpuinfo_x86 *c)
 		cpuid(0x80000001, &tmp, &tmp,
 		      &c->x86_capability[cpufeat_word(X86_FEATURE_LAHF_LM)],
 		      &c->x86_capability[cpufeat_word(X86_FEATURE_SYSCALL)]);
-	if (c == &boot_cpu_data)
-		bootsym(cpuid_ext_features) =
-			c->x86_capability[cpufeat_word(X86_FEATURE_NX)];
 
 	if (c->extended_cpuid_level >= 0x80000004)
 		get_model_name(c); /* Default name */
diff --git a/xen/arch/x86/cpu/intel.c b/xen/arch/x86/cpu/intel.c
index 5356a6ae10..4d7324e4d0 100644
--- a/xen/arch/x86/cpu/intel.c
+++ b/xen/arch/x86/cpu/intel.c
@@ -270,6 +270,7 @@  static void early_init_intel(struct cpuinfo_x86 *c)
 	if (disable) {
 		wrmsrl(MSR_IA32_MISC_ENABLE, misc_enable & ~disable);
 		bootsym(trampoline_misc_enable_off) |= disable;
+		bootsym(trampoline_efer) |= EFER_NX;
 	}
 
 	if (disable & MSR_IA32_MISC_ENABLE_LIMIT_CPUID)
diff --git a/xen/arch/x86/efi/efi-boot.h b/xen/arch/x86/efi/efi-boot.h
index 940ce12706..cde193a771 100644
--- a/xen/arch/x86/efi/efi-boot.h
+++ b/xen/arch/x86/efi/efi-boot.h
@@ -238,7 +238,7 @@  static void __init noreturn efi_arch_post_exit_boot(void)
     asm volatile("pushq $0\n\tpopfq");
     rdmsrl(MSR_EFER, efer);
     efer |= EFER_SCE;
-    if ( cpuid_ext_features & cpufeat_mask(X86_FEATURE_NX) )
+    if ( cpu_has_nx )
         efer |= EFER_NX;
     wrmsrl(MSR_EFER, efer);
     write_cr0(X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP |
@@ -640,9 +640,11 @@  static void __init efi_arch_cpu(void)
 
     if ( (eax >> 16) == 0x8000 && eax > 0x80000000 )
     {
-        cpuid_ext_features = cpuid_edx(0x80000001);
         boot_cpu_data.x86_capability[cpufeat_word(X86_FEATURE_SYSCALL)]
-            = cpuid_ext_features;
+            = cpuid_edx(0x80000001);
+
+        if ( cpu_has_nx )
+            trampoline_efer |= EFER_NX;
     }
 }
 
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index b686156ea0..45d8f5117e 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -151,7 +151,7 @@  extern void ctxt_switch_levelling(const struct vcpu *next);
 extern void (*ctxt_switch_masking)(const struct vcpu *next);
 
 extern bool_t opt_cpu_info;
-extern u32 cpuid_ext_features;
+extern u32 trampoline_efer;
 extern u64 trampoline_misc_enable_off;
 
 /* Maximum width of physical addresses supported by the hardware. */