diff mbox series

[v3] x86: do not enable global pages when virtualized on AMD hardware

Message ID 20191204151208.37076-1-roger.pau@citrix.com (mailing list archive)
State Superseded
Headers show
Series [v3] x86: do not enable global pages when virtualized on AMD hardware | expand

Commit Message

Roger Pau Monné Dec. 4, 2019, 3:12 p.m. UTC
When using global pages a full tlb flush can only be performed by
toggling the PGE bit in CR4, which is usually quite expensive in terms
of performance when running virtualized. This is specially relevant on
AMD hardware, which doesn't have the ability to do selective CR4
trapping, but can also be relevant on Intel if the underlying
hypervisor also traps accesses to the PGE CR4 bit.

In order to avoid this performance penalty, do not use global pages
when running virtualized on AMD hardware. A command line option
'global-pages' is provided in order to allow the user to select
whether global pages will be enabled for PV guests.

The above figures are from a PV shim running on AMD hardware with
32 vCPUs:

PGE enabled, x2APIC mode:

(XEN) Global lock flush_lock: addr=ffff82d0804b01c0, lockval=1adb1adb, not locked
(XEN)   lock:1841883(1375128998543), block:1658716(10193054890781)

Average lock time:   746588ns
Average block time: 6145147ns

PGE disabled, x2APIC mode:

(XEN) Global lock flush_lock: addr=ffff82d0804af1c0, lockval=a8bfa8bf, not locked
(XEN)   lock:2730175(657505389886), block:2039716(2963768247738)

Average lock time:   240829ns
Average block time: 1453029ns

As seen from the above figures the lock and block time of the flush
lock is reduced to approximately 1/3 of the original value.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v2:
 - Set the default value at init if not specified by the user.
 - Use int8_t and read_mostly for opt_global_pages.

Changes since v1:
 - Provide command line option to enable/disable PGE.
 - Only disable PGE on AMD hardware when virtualized.
 - Document the global-pages option.
---
 docs/misc/xen-command-line.pandoc | 13 +++++++++++++
 xen/arch/x86/pv/domain.c          | 15 ++++++++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

Comments

Jan Beulich Dec. 4, 2019, 4:11 p.m. UTC | #1
On 04.12.2019 16:12, Roger Pau Monne wrote:
> --- a/xen/arch/x86/pv/domain.c
> +++ b/xen/arch/x86/pv/domain.c
> @@ -118,6 +118,19 @@ unsigned long pv_fixup_guest_cr4(const struct vcpu *v, unsigned long cr4)
>              (mmu_cr4_features & PV_CR4_GUEST_VISIBLE_MASK));
>  }
>  
> +static int8_t __read_mostly opt_global_pages = -1;
> +boolean_runtime_param("global-pages", opt_global_pages);
> +
> +static int __init pge_init(void)
> +{
> +    if ( opt_global_pages == -1 )
> +        opt_global_pages = !cpu_has_hypervisor ||
> +                           boot_cpu_data.x86_vendor != X86_VENDOR_AMD;
> +
> +    return 0;
> +}
> +__initcall(pge_init);
> +
>  unsigned long pv_make_cr4(const struct vcpu *v)
>  {
>      const struct domain *d = v->domain;
> @@ -130,7 +143,7 @@ unsigned long pv_make_cr4(const struct vcpu *v)
>       */
>      if ( d->arch.pv.pcid )
>          cr4 |= X86_CR4_PCIDE;
> -    else if ( !d->arch.pv.xpti )
> +    else if ( !d->arch.pv.xpti && opt_global_pages )
>          cr4 |= X86_CR4_PGE;

I'm sorry for noticing this only now, but what about XEN_MINIMAL_CR4,
which includes X86_CR4_PGE?

I also have to admit I'm still feeling uneasy about the runtime
adjustment you permit. I can't point at anything that's wrong in this
regard, but anyway.

Jan
Roger Pau Monné Dec. 4, 2019, 4:18 p.m. UTC | #2
On Wed, Dec 04, 2019 at 05:11:42PM +0100, Jan Beulich wrote:
> On 04.12.2019 16:12, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/pv/domain.c
> > +++ b/xen/arch/x86/pv/domain.c
> > @@ -118,6 +118,19 @@ unsigned long pv_fixup_guest_cr4(const struct vcpu *v, unsigned long cr4)
> >              (mmu_cr4_features & PV_CR4_GUEST_VISIBLE_MASK));
> >  }
> >  
> > +static int8_t __read_mostly opt_global_pages = -1;
> > +boolean_runtime_param("global-pages", opt_global_pages);
> > +
> > +static int __init pge_init(void)
> > +{
> > +    if ( opt_global_pages == -1 )
> > +        opt_global_pages = !cpu_has_hypervisor ||
> > +                           boot_cpu_data.x86_vendor != X86_VENDOR_AMD;
> > +
> > +    return 0;
> > +}
> > +__initcall(pge_init);
> > +
> >  unsigned long pv_make_cr4(const struct vcpu *v)
> >  {
> >      const struct domain *d = v->domain;
> > @@ -130,7 +143,7 @@ unsigned long pv_make_cr4(const struct vcpu *v)
> >       */
> >      if ( d->arch.pv.pcid )
> >          cr4 |= X86_CR4_PCIDE;
> > -    else if ( !d->arch.pv.xpti )
> > +    else if ( !d->arch.pv.xpti && opt_global_pages )
> >          cr4 |= X86_CR4_PGE;
> 
> I'm sorry for noticing this only now, but what about XEN_MINIMAL_CR4,
> which includes X86_CR4_PGE?

I've tried removing PGE from XEN_MINIMAL_CR4 but it made no noticeable
performance difference, so I left it as-is.

> 
> I also have to admit I'm still feeling uneasy about the runtime
> adjustment you permit. I can't point at anything that's wrong in this
> regard, but anyway.

Feel free to change boolean_runtime_param into boolean_param. I think
allowing runtime changes is fine because the flush takes into account
the current CR4, but maybe there are corner cases I'm not aware of.

Thanks, Roger.
Jan Beulich Dec. 4, 2019, 5:05 p.m. UTC | #3
On 04.12.2019 17:18, Roger Pau Monné wrote:
> On Wed, Dec 04, 2019 at 05:11:42PM +0100, Jan Beulich wrote:
>> On 04.12.2019 16:12, Roger Pau Monne wrote:
>>> @@ -130,7 +143,7 @@ unsigned long pv_make_cr4(const struct vcpu *v)
>>>       */
>>>      if ( d->arch.pv.pcid )
>>>          cr4 |= X86_CR4_PCIDE;
>>> -    else if ( !d->arch.pv.xpti )
>>> +    else if ( !d->arch.pv.xpti && opt_global_pages )
>>>          cr4 |= X86_CR4_PGE;
>>
>> I'm sorry for noticing this only now, but what about XEN_MINIMAL_CR4,
>> which includes X86_CR4_PGE?
> 
> I've tried removing PGE from XEN_MINIMAL_CR4 but it made no noticeable
> performance difference, so I left it as-is.

My concern isn't about performance, but correctness. I admit I
forgot for a moment that we now always write CR4 (unless the
cached value matches the intended new one). Yet
mmu_cr4_features (starting out as XEN_MINIMAL_CR4) is still of
concern.

I think this at least requires extending the description to
discuss the correctness.

Jan
Roger Pau Monné Dec. 9, 2019, 10:20 a.m. UTC | #4
On Wed, Dec 04, 2019 at 06:05:11PM +0100, Jan Beulich wrote:
> On 04.12.2019 17:18, Roger Pau Monné wrote:
> > On Wed, Dec 04, 2019 at 05:11:42PM +0100, Jan Beulich wrote:
> >> On 04.12.2019 16:12, Roger Pau Monne wrote:
> >>> @@ -130,7 +143,7 @@ unsigned long pv_make_cr4(const struct vcpu *v)
> >>>       */
> >>>      if ( d->arch.pv.pcid )
> >>>          cr4 |= X86_CR4_PCIDE;
> >>> -    else if ( !d->arch.pv.xpti )
> >>> +    else if ( !d->arch.pv.xpti && opt_global_pages )
> >>>          cr4 |= X86_CR4_PGE;
> >>
> >> I'm sorry for noticing this only now, but what about XEN_MINIMAL_CR4,
> >> which includes X86_CR4_PGE?
> > 
> > I've tried removing PGE from XEN_MINIMAL_CR4 but it made no noticeable
> > performance difference, so I left it as-is.
> 
> My concern isn't about performance, but correctness. I admit I
> forgot for a moment that we now always write CR4 (unless the
> cached value matches the intended new one). Yet
> mmu_cr4_features (starting out as XEN_MINIMAL_CR4) is still of
> concern.
> 
> I think this at least requires extending the description to
> discuss the correctness.

Would you be fine with adding the following at the end of the commit
message.

"Note that XEN_MINIMAL_CR4 is not modified, and thus global pages are
left enabled for the hypervisor. This is not an issue because the code
to switch the control registers (cr3 and cr4) already takes into
account such situation and performs the necessary flushes. The same
already happens when using XPTI or PCIDE, as the guest cr4 doesn't
have global pages enabled in that case either."

Thanks, Roger.
Jan Beulich Dec. 9, 2019, 2:21 p.m. UTC | #5
On 09.12.2019 11:20, Roger Pau Monné wrote:
> On Wed, Dec 04, 2019 at 06:05:11PM +0100, Jan Beulich wrote:
>> On 04.12.2019 17:18, Roger Pau Monné wrote:
>>> On Wed, Dec 04, 2019 at 05:11:42PM +0100, Jan Beulich wrote:
>>>> On 04.12.2019 16:12, Roger Pau Monne wrote:
>>>>> @@ -130,7 +143,7 @@ unsigned long pv_make_cr4(const struct vcpu *v)
>>>>>       */
>>>>>      if ( d->arch.pv.pcid )
>>>>>          cr4 |= X86_CR4_PCIDE;
>>>>> -    else if ( !d->arch.pv.xpti )
>>>>> +    else if ( !d->arch.pv.xpti && opt_global_pages )
>>>>>          cr4 |= X86_CR4_PGE;
>>>>
>>>> I'm sorry for noticing this only now, but what about XEN_MINIMAL_CR4,
>>>> which includes X86_CR4_PGE?
>>>
>>> I've tried removing PGE from XEN_MINIMAL_CR4 but it made no noticeable
>>> performance difference, so I left it as-is.
>>
>> My concern isn't about performance, but correctness. I admit I
>> forgot for a moment that we now always write CR4 (unless the
>> cached value matches the intended new one). Yet
>> mmu_cr4_features (starting out as XEN_MINIMAL_CR4) is still of
>> concern.
>>
>> I think this at least requires extending the description to
>> discuss the correctness.
> 
> Would you be fine with adding the following at the end of the commit
> message.
> 
> "Note that XEN_MINIMAL_CR4 is not modified, and thus global pages are
> left enabled for the hypervisor. This is not an issue because the code
> to switch the control registers (cr3 and cr4) already takes into
> account such situation and performs the necessary flushes. The same
> already happens when using XPTI or PCIDE, as the guest cr4 doesn't
> have global pages enabled in that case either."

Yes, this is good for XEN_MINIMAL_CR4. But I think mmu_cr4_features
needs discussing (or at least mentioning, if the same arguments
apply) as well.

Jan
Roger Pau Monné Dec. 9, 2019, 2:46 p.m. UTC | #6
On Mon, Dec 09, 2019 at 03:21:28PM +0100, Jan Beulich wrote:
> On 09.12.2019 11:20, Roger Pau Monné wrote:
> > On Wed, Dec 04, 2019 at 06:05:11PM +0100, Jan Beulich wrote:
> >> On 04.12.2019 17:18, Roger Pau Monné wrote:
> >>> On Wed, Dec 04, 2019 at 05:11:42PM +0100, Jan Beulich wrote:
> >>>> On 04.12.2019 16:12, Roger Pau Monne wrote:
> >>>>> @@ -130,7 +143,7 @@ unsigned long pv_make_cr4(const struct vcpu *v)
> >>>>>       */
> >>>>>      if ( d->arch.pv.pcid )
> >>>>>          cr4 |= X86_CR4_PCIDE;
> >>>>> -    else if ( !d->arch.pv.xpti )
> >>>>> +    else if ( !d->arch.pv.xpti && opt_global_pages )
> >>>>>          cr4 |= X86_CR4_PGE;
> >>>>
> >>>> I'm sorry for noticing this only now, but what about XEN_MINIMAL_CR4,
> >>>> which includes X86_CR4_PGE?
> >>>
> >>> I've tried removing PGE from XEN_MINIMAL_CR4 but it made no noticeable
> >>> performance difference, so I left it as-is.
> >>
> >> My concern isn't about performance, but correctness. I admit I
> >> forgot for a moment that we now always write CR4 (unless the
> >> cached value matches the intended new one). Yet
> >> mmu_cr4_features (starting out as XEN_MINIMAL_CR4) is still of
> >> concern.
> >>
> >> I think this at least requires extending the description to
> >> discuss the correctness.
> > 
> > Would you be fine with adding the following at the end of the commit
> > message.
> > 
> > "Note that XEN_MINIMAL_CR4 is not modified, and thus global pages are
> > left enabled for the hypervisor. This is not an issue because the code
> > to switch the control registers (cr3 and cr4) already takes into
> > account such situation and performs the necessary flushes. The same
> > already happens when using XPTI or PCIDE, as the guest cr4 doesn't
> > have global pages enabled in that case either."
> 
> Yes, this is good for XEN_MINIMAL_CR4. But I think mmu_cr4_features
> needs discussing (or at least mentioning, if the same arguments
> apply) as well.

The same applies to mmu_cr4_features, it's fine for the hypervisor to
use a different set of cr4 features (especially PGE) than PV guests:
this is already the case when using XPTI or PCIDE when Xen cr4 will
have PGE and guests cr4 won't, and switch_cr3_cr4 DTRT.

So I would s/XEN_MINIMAL_CR4/XEN_MINIMAL_CR4 and mmu_cr4_features/ in
the above proposed paragraph.

Thanks, Roger.
Jan Beulich Dec. 9, 2019, 3:04 p.m. UTC | #7
On 09.12.2019 15:46, Roger Pau Monné wrote:
> On Mon, Dec 09, 2019 at 03:21:28PM +0100, Jan Beulich wrote:
>> On 09.12.2019 11:20, Roger Pau Monné wrote:
>>> On Wed, Dec 04, 2019 at 06:05:11PM +0100, Jan Beulich wrote:
>>>> On 04.12.2019 17:18, Roger Pau Monné wrote:
>>>>> On Wed, Dec 04, 2019 at 05:11:42PM +0100, Jan Beulich wrote:
>>>>>> On 04.12.2019 16:12, Roger Pau Monne wrote:
>>>>>>> @@ -130,7 +143,7 @@ unsigned long pv_make_cr4(const struct vcpu *v)
>>>>>>>       */
>>>>>>>      if ( d->arch.pv.pcid )
>>>>>>>          cr4 |= X86_CR4_PCIDE;
>>>>>>> -    else if ( !d->arch.pv.xpti )
>>>>>>> +    else if ( !d->arch.pv.xpti && opt_global_pages )
>>>>>>>          cr4 |= X86_CR4_PGE;
>>>>>>
>>>>>> I'm sorry for noticing this only now, but what about XEN_MINIMAL_CR4,
>>>>>> which includes X86_CR4_PGE?
>>>>>
>>>>> I've tried removing PGE from XEN_MINIMAL_CR4 but it made no noticeable
>>>>> performance difference, so I left it as-is.
>>>>
>>>> My concern isn't about performance, but correctness. I admit I
>>>> forgot for a moment that we now always write CR4 (unless the
>>>> cached value matches the intended new one). Yet
>>>> mmu_cr4_features (starting out as XEN_MINIMAL_CR4) is still of
>>>> concern.
>>>>
>>>> I think this at least requires extending the description to
>>>> discuss the correctness.
>>>
>>> Would you be fine with adding the following at the end of the commit
>>> message.
>>>
>>> "Note that XEN_MINIMAL_CR4 is not modified, and thus global pages are
>>> left enabled for the hypervisor. This is not an issue because the code
>>> to switch the control registers (cr3 and cr4) already takes into
>>> account such situation and performs the necessary flushes. The same
>>> already happens when using XPTI or PCIDE, as the guest cr4 doesn't
>>> have global pages enabled in that case either."
>>
>> Yes, this is good for XEN_MINIMAL_CR4. But I think mmu_cr4_features
>> needs discussing (or at least mentioning, if the same arguments
>> apply) as well.
> 
> The same applies to mmu_cr4_features, it's fine for the hypervisor to
> use a different set of cr4 features (especially PGE) than PV guests:
> this is already the case when using XPTI or PCIDE when Xen cr4 will
> have PGE and guests cr4 won't, and switch_cr3_cr4 DTRT.
> 
> So I would s/XEN_MINIMAL_CR4/XEN_MINIMAL_CR4 and mmu_cr4_features/ in
> the above proposed paragraph.

I'm afraid it's more complicated, up to and including you making a
possible pre-existing bug worse: The S3 resume path loads CR4 from
mmu_cr4_features, but doesn't update the in-memory cache of the
register. Hence some subsequent CR4 writes may wrongly be skipped,
until hitting one where an actual write is necessary. Now that you
play (more) with PGE, the situation is only going to get worse. Of
course I may well simply not have managed to spot the sync-ing of
the cached value. (VMX, otoh, takes special care to keep CPU loaded
value and cache in sync - see the bottom of vmx_do_resume().)

Jan
Roger Pau Monné Dec. 9, 2019, 3:36 p.m. UTC | #8
On Mon, Dec 09, 2019 at 04:04:51PM +0100, Jan Beulich wrote:
> On 09.12.2019 15:46, Roger Pau Monné wrote:
> > On Mon, Dec 09, 2019 at 03:21:28PM +0100, Jan Beulich wrote:
> >> On 09.12.2019 11:20, Roger Pau Monné wrote:
> >>> On Wed, Dec 04, 2019 at 06:05:11PM +0100, Jan Beulich wrote:
> >>>> On 04.12.2019 17:18, Roger Pau Monné wrote:
> >>>>> On Wed, Dec 04, 2019 at 05:11:42PM +0100, Jan Beulich wrote:
> >>>>>> On 04.12.2019 16:12, Roger Pau Monne wrote:
> >>>>>>> @@ -130,7 +143,7 @@ unsigned long pv_make_cr4(const struct vcpu *v)
> >>>>>>>       */
> >>>>>>>      if ( d->arch.pv.pcid )
> >>>>>>>          cr4 |= X86_CR4_PCIDE;
> >>>>>>> -    else if ( !d->arch.pv.xpti )
> >>>>>>> +    else if ( !d->arch.pv.xpti && opt_global_pages )
> >>>>>>>          cr4 |= X86_CR4_PGE;
> >>>>>>
> >>>>>> I'm sorry for noticing this only now, but what about XEN_MINIMAL_CR4,
> >>>>>> which includes X86_CR4_PGE?
> >>>>>
> >>>>> I've tried removing PGE from XEN_MINIMAL_CR4 but it made no noticeable
> >>>>> performance difference, so I left it as-is.
> >>>>
> >>>> My concern isn't about performance, but correctness. I admit I
> >>>> forgot for a moment that we now always write CR4 (unless the
> >>>> cached value matches the intended new one). Yet
> >>>> mmu_cr4_features (starting out as XEN_MINIMAL_CR4) is still of
> >>>> concern.
> >>>>
> >>>> I think this at least requires extending the description to
> >>>> discuss the correctness.
> >>>
> >>> Would you be fine with adding the following at the end of the commit
> >>> message.
> >>>
> >>> "Note that XEN_MINIMAL_CR4 is not modified, and thus global pages are
> >>> left enabled for the hypervisor. This is not an issue because the code
> >>> to switch the control registers (cr3 and cr4) already takes into
> >>> account such situation and performs the necessary flushes. The same
> >>> already happens when using XPTI or PCIDE, as the guest cr4 doesn't
> >>> have global pages enabled in that case either."
> >>
> >> Yes, this is good for XEN_MINIMAL_CR4. But I think mmu_cr4_features
> >> needs discussing (or at least mentioning, if the same arguments
> >> apply) as well.
> > 
> > The same applies to mmu_cr4_features, it's fine for the hypervisor to
> > use a different set of cr4 features (especially PGE) than PV guests:
> > this is already the case when using XPTI or PCIDE when Xen cr4 will
> > have PGE and guests cr4 won't, and switch_cr3_cr4 DTRT.
> > 
> > So I would s/XEN_MINIMAL_CR4/XEN_MINIMAL_CR4 and mmu_cr4_features/ in
> > the above proposed paragraph.
> 
> I'm afraid it's more complicated, up to and including you making a
> possible pre-existing bug worse: The S3 resume path loads CR4 from
> mmu_cr4_features, but doesn't update the in-memory cache of the
> register.

All domains are paused and the scheduler is disabled when doing S3
suspend/resume, and the actual suspend and resume code is run by a
tasklet which is executed in the idle vCPU context (since all domains
are paused), and hence the write of CR4 with mmu_cr4_features is fine
as entering guest context after resume is going to involve a call to
switch_cr3_cr4 in order to switch out of the idle vCPU.

It might be clearer to just save cr4 in do_suspend_lowlevel like it's
done with the rest of the control registers.

Thanks, Roger.
Jan Beulich Dec. 9, 2019, 3:39 p.m. UTC | #9
On 09.12.2019 16:36, Roger Pau Monné wrote:
> On Mon, Dec 09, 2019 at 04:04:51PM +0100, Jan Beulich wrote:
>> On 09.12.2019 15:46, Roger Pau Monné wrote:
>>> On Mon, Dec 09, 2019 at 03:21:28PM +0100, Jan Beulich wrote:
>>>> On 09.12.2019 11:20, Roger Pau Monné wrote:
>>>>> On Wed, Dec 04, 2019 at 06:05:11PM +0100, Jan Beulich wrote:
>>>>>> On 04.12.2019 17:18, Roger Pau Monné wrote:
>>>>>>> On Wed, Dec 04, 2019 at 05:11:42PM +0100, Jan Beulich wrote:
>>>>>>>> On 04.12.2019 16:12, Roger Pau Monne wrote:
>>>>>>>>> @@ -130,7 +143,7 @@ unsigned long pv_make_cr4(const struct vcpu *v)
>>>>>>>>>       */
>>>>>>>>>      if ( d->arch.pv.pcid )
>>>>>>>>>          cr4 |= X86_CR4_PCIDE;
>>>>>>>>> -    else if ( !d->arch.pv.xpti )
>>>>>>>>> +    else if ( !d->arch.pv.xpti && opt_global_pages )
>>>>>>>>>          cr4 |= X86_CR4_PGE;
>>>>>>>>
>>>>>>>> I'm sorry for noticing this only now, but what about XEN_MINIMAL_CR4,
>>>>>>>> which includes X86_CR4_PGE?
>>>>>>>
>>>>>>> I've tried removing PGE from XEN_MINIMAL_CR4 but it made no noticeable
>>>>>>> performance difference, so I left it as-is.
>>>>>>
>>>>>> My concern isn't about performance, but correctness. I admit I
>>>>>> forgot for a moment that we now always write CR4 (unless the
>>>>>> cached value matches the intended new one). Yet
>>>>>> mmu_cr4_features (starting out as XEN_MINIMAL_CR4) is still of
>>>>>> concern.
>>>>>>
>>>>>> I think this at least requires extending the description to
>>>>>> discuss the correctness.
>>>>>
>>>>> Would you be fine with adding the following at the end of the commit
>>>>> message.
>>>>>
>>>>> "Note that XEN_MINIMAL_CR4 is not modified, and thus global pages are
>>>>> left enabled for the hypervisor. This is not an issue because the code
>>>>> to switch the control registers (cr3 and cr4) already takes into
>>>>> account such situation and performs the necessary flushes. The same
>>>>> already happens when using XPTI or PCIDE, as the guest cr4 doesn't
>>>>> have global pages enabled in that case either."
>>>>
>>>> Yes, this is good for XEN_MINIMAL_CR4. But I think mmu_cr4_features
>>>> needs discussing (or at least mentioning, if the same arguments
>>>> apply) as well.
>>>
>>> The same applies to mmu_cr4_features, it's fine for the hypervisor to
>>> use a different set of cr4 features (especially PGE) than PV guests:
>>> this is already the case when using XPTI or PCIDE when Xen cr4 will
>>> have PGE and guests cr4 won't, and switch_cr3_cr4 DTRT.
>>>
>>> So I would s/XEN_MINIMAL_CR4/XEN_MINIMAL_CR4 and mmu_cr4_features/ in
>>> the above proposed paragraph.
>>
>> I'm afraid it's more complicated, up to and including you making a
>> possible pre-existing bug worse: The S3 resume path loads CR4 from
>> mmu_cr4_features, but doesn't update the in-memory cache of the
>> register.
> 
> All domains are paused and the scheduler is disabled when doing S3
> suspend/resume, and the actual suspend and resume code is run by a
> tasklet which is executed in the idle vCPU context (since all domains
> are paused), and hence the write of CR4 with mmu_cr4_features is fine
> as entering guest context after resume is going to involve a call to
> switch_cr3_cr4 in order to switch out of the idle vCPU.

And switch_cr3_cr4() has

    if ( old_cr4 != cr4 )
        write_cr4(cr4);

with old_cr4 read from the cache.

> It might be clearer to just save cr4 in do_suspend_lowlevel like it's
> done with the rest of the control registers.

Not just more clear, but also more reliable.

Jan
Roger Pau Monné Dec. 9, 2019, 3:52 p.m. UTC | #10
On Mon, Dec 09, 2019 at 04:39:58PM +0100, Jan Beulich wrote:
> On 09.12.2019 16:36, Roger Pau Monné wrote:
> > On Mon, Dec 09, 2019 at 04:04:51PM +0100, Jan Beulich wrote:
> >> On 09.12.2019 15:46, Roger Pau Monné wrote:
> >>> On Mon, Dec 09, 2019 at 03:21:28PM +0100, Jan Beulich wrote:
> >>>> On 09.12.2019 11:20, Roger Pau Monné wrote:
> >>>>> On Wed, Dec 04, 2019 at 06:05:11PM +0100, Jan Beulich wrote:
> >>>>>> On 04.12.2019 17:18, Roger Pau Monné wrote:
> >>>>>>> On Wed, Dec 04, 2019 at 05:11:42PM +0100, Jan Beulich wrote:
> >>>>>>>> On 04.12.2019 16:12, Roger Pau Monne wrote:
> >>>>>>>>> @@ -130,7 +143,7 @@ unsigned long pv_make_cr4(const struct vcpu *v)
> >>>>>>>>>       */
> >>>>>>>>>      if ( d->arch.pv.pcid )
> >>>>>>>>>          cr4 |= X86_CR4_PCIDE;
> >>>>>>>>> -    else if ( !d->arch.pv.xpti )
> >>>>>>>>> +    else if ( !d->arch.pv.xpti && opt_global_pages )
> >>>>>>>>>          cr4 |= X86_CR4_PGE;
> >>>>>>>>
> >>>>>>>> I'm sorry for noticing this only now, but what about XEN_MINIMAL_CR4,
> >>>>>>>> which includes X86_CR4_PGE?
> >>>>>>>
> >>>>>>> I've tried removing PGE from XEN_MINIMAL_CR4 but it made no noticeable
> >>>>>>> performance difference, so I left it as-is.
> >>>>>>
> >>>>>> My concern isn't about performance, but correctness. I admit I
> >>>>>> forgot for a moment that we now always write CR4 (unless the
> >>>>>> cached value matches the intended new one). Yet
> >>>>>> mmu_cr4_features (starting out as XEN_MINIMAL_CR4) is still of
> >>>>>> concern.
> >>>>>>
> >>>>>> I think this at least requires extending the description to
> >>>>>> discuss the correctness.
> >>>>>
> >>>>> Would you be fine with adding the following at the end of the commit
> >>>>> message.
> >>>>>
> >>>>> "Note that XEN_MINIMAL_CR4 is not modified, and thus global pages are
> >>>>> left enabled for the hypervisor. This is not an issue because the code
> >>>>> to switch the control registers (cr3 and cr4) already takes into
> >>>>> account such situation and performs the necessary flushes. The same
> >>>>> already happens when using XPTI or PCIDE, as the guest cr4 doesn't
> >>>>> have global pages enabled in that case either."
> >>>>
> >>>> Yes, this is good for XEN_MINIMAL_CR4. But I think mmu_cr4_features
> >>>> needs discussing (or at least mentioning, if the same arguments
> >>>> apply) as well.
> >>>
> >>> The same applies to mmu_cr4_features, it's fine for the hypervisor to
> >>> use a different set of cr4 features (especially PGE) than PV guests:
> >>> this is already the case when using XPTI or PCIDE when Xen cr4 will
> >>> have PGE and guests cr4 won't, and switch_cr3_cr4 DTRT.
> >>>
> >>> So I would s/XEN_MINIMAL_CR4/XEN_MINIMAL_CR4 and mmu_cr4_features/ in
> >>> the above proposed paragraph.
> >>
> >> I'm afraid it's more complicated, up to and including you making a
> >> possible pre-existing bug worse: The S3 resume path loads CR4 from
> >> mmu_cr4_features, but doesn't update the in-memory cache of the
> >> register.
> > 
> > All domains are paused and the scheduler is disabled when doing S3
> > suspend/resume, and the actual suspend and resume code is run by a
> > tasklet which is executed in the idle vCPU context (since all domains
> > are paused), and hence the write of CR4 with mmu_cr4_features is fine
> > as entering guest context after resume is going to involve a call to
> > switch_cr3_cr4 in order to switch out of the idle vCPU.
> 
> And switch_cr3_cr4() has
> 
>     if ( old_cr4 != cr4 )
>         write_cr4(cr4);
> 
> with old_cr4 read from the cache.

That read from the cache is fine. The idle vCPU cr4 is
mmu_cr4_features (see write_ptbase), and hence the write done in
__ret_point should match what's in the cache.

> > It might be clearer to just save cr4 in do_suspend_lowlevel like it's
> > done with the rest of the control registers.
> 
> Not just more clear, but also more reliable.

Let me prepare a patch to improve this, but I think the current patch
at hand is correct and shouldn't be blocked by this suspend/resume
improvement.

Thanks, Roger.
Jan Beulich Dec. 9, 2019, 4:08 p.m. UTC | #11
On 09.12.2019 16:52, Roger Pau Monné wrote:
> On Mon, Dec 09, 2019 at 04:39:58PM +0100, Jan Beulich wrote:
>> On 09.12.2019 16:36, Roger Pau Monné wrote:
>>> On Mon, Dec 09, 2019 at 04:04:51PM +0100, Jan Beulich wrote:
>>>> On 09.12.2019 15:46, Roger Pau Monné wrote:
>>>>> On Mon, Dec 09, 2019 at 03:21:28PM +0100, Jan Beulich wrote:
>>>>>> On 09.12.2019 11:20, Roger Pau Monné wrote:
>>>>>>> On Wed, Dec 04, 2019 at 06:05:11PM +0100, Jan Beulich wrote:
>>>>>>>> On 04.12.2019 17:18, Roger Pau Monné wrote:
>>>>>>>>> On Wed, Dec 04, 2019 at 05:11:42PM +0100, Jan Beulich wrote:
>>>>>>>>>> On 04.12.2019 16:12, Roger Pau Monne wrote:
>>>>>>>>>>> @@ -130,7 +143,7 @@ unsigned long pv_make_cr4(const struct vcpu *v)
>>>>>>>>>>>       */
>>>>>>>>>>>      if ( d->arch.pv.pcid )
>>>>>>>>>>>          cr4 |= X86_CR4_PCIDE;
>>>>>>>>>>> -    else if ( !d->arch.pv.xpti )
>>>>>>>>>>> +    else if ( !d->arch.pv.xpti && opt_global_pages )
>>>>>>>>>>>          cr4 |= X86_CR4_PGE;
>>>>>>>>>>
>>>>>>>>>> I'm sorry for noticing this only now, but what about XEN_MINIMAL_CR4,
>>>>>>>>>> which includes X86_CR4_PGE?
>>>>>>>>>
>>>>>>>>> I've tried removing PGE from XEN_MINIMAL_CR4 but it made no noticeable
>>>>>>>>> performance difference, so I left it as-is.
>>>>>>>>
>>>>>>>> My concern isn't about performance, but correctness. I admit I
>>>>>>>> forgot for a moment that we now always write CR4 (unless the
>>>>>>>> cached value matches the intended new one). Yet
>>>>>>>> mmu_cr4_features (starting out as XEN_MINIMAL_CR4) is still of
>>>>>>>> concern.
>>>>>>>>
>>>>>>>> I think this at least requires extending the description to
>>>>>>>> discuss the correctness.
>>>>>>>
>>>>>>> Would you be fine with adding the following at the end of the commit
>>>>>>> message.
>>>>>>>
>>>>>>> "Note that XEN_MINIMAL_CR4 is not modified, and thus global pages are
>>>>>>> left enabled for the hypervisor. This is not an issue because the code
>>>>>>> to switch the control registers (cr3 and cr4) already takes into
>>>>>>> account such situation and performs the necessary flushes. The same
>>>>>>> already happens when using XPTI or PCIDE, as the guest cr4 doesn't
>>>>>>> have global pages enabled in that case either."
>>>>>>
>>>>>> Yes, this is good for XEN_MINIMAL_CR4. But I think mmu_cr4_features
>>>>>> needs discussing (or at least mentioning, if the same arguments
>>>>>> apply) as well.
>>>>>
>>>>> The same applies to mmu_cr4_features, it's fine for the hypervisor to
>>>>> use a different set of cr4 features (especially PGE) than PV guests:
>>>>> this is already the case when using XPTI or PCIDE when Xen cr4 will
>>>>> have PGE and guests cr4 won't, and switch_cr3_cr4 DTRT.
>>>>>
>>>>> So I would s/XEN_MINIMAL_CR4/XEN_MINIMAL_CR4 and mmu_cr4_features/ in
>>>>> the above proposed paragraph.
>>>>
>>>> I'm afraid it's more complicated, up to and including you making a
>>>> possible pre-existing bug worse: The S3 resume path loads CR4 from
>>>> mmu_cr4_features, but doesn't update the in-memory cache of the
>>>> register.
>>>
>>> All domains are paused and the scheduler is disabled when doing S3
>>> suspend/resume, and the actual suspend and resume code is run by a
>>> tasklet which is executed in the idle vCPU context (since all domains
>>> are paused), and hence the write of CR4 with mmu_cr4_features is fine
>>> as entering guest context after resume is going to involve a call to
>>> switch_cr3_cr4 in order to switch out of the idle vCPU.
>>
>> And switch_cr3_cr4() has
>>
>>     if ( old_cr4 != cr4 )
>>         write_cr4(cr4);
>>
>> with old_cr4 read from the cache.
> 
> That read from the cache is fine. The idle vCPU cr4 is
> mmu_cr4_features (see write_ptbase), and hence the write done in
> __ret_point should match what's in the cache.

Ah yes.

>>> It might be clearer to just save cr4 in do_suspend_lowlevel like it's
>>> done with the rest of the control registers.
>>
>> Not just more clear, but also more reliable.
> 
> Let me prepare a patch to improve this, but I think the current patch
> at hand is correct and shouldn't be blocked by this suspend/resume
> improvement.

Fine with me, but please re-post with the enhanced description.

Jan
diff mbox series

Patch

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index d9495ef6b9..7be30f2766 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1087,6 +1087,19 @@  value settable via Xen tools.
 
 Dom0 is using this value for sizing its maptrack table.
 
+### global-pages (x86)
+> `= <boolean>`
+
+> Default: `true` unless running virtualized on AMD hardware
+
+Set whether the PGE bit in CR4 will be enabled for PV guests. This controls the
+usage of global pages, and thus the need to perform tlb flushes by writing to
+CR4.
+
+Note it's disabled by default when running virtualized on AMD hardware since
+AMD SVM doesn't support selective trapping of CR4, so global pages are not
+enabled in order to reduce the overhead of tlb flushes.
+
 ### guest_loglvl
 > `= <level>[/<rate-limited level>]` where level is `none | error | warning | info | debug | all`
 
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 4b6f48dea2..8ff733f56b 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -118,6 +118,19 @@  unsigned long pv_fixup_guest_cr4(const struct vcpu *v, unsigned long cr4)
             (mmu_cr4_features & PV_CR4_GUEST_VISIBLE_MASK));
 }
 
+static int8_t __read_mostly opt_global_pages = -1;
+boolean_runtime_param("global-pages", opt_global_pages);
+
+static int __init pge_init(void)
+{
+    if ( opt_global_pages == -1 )
+        opt_global_pages = !cpu_has_hypervisor ||
+                           boot_cpu_data.x86_vendor != X86_VENDOR_AMD;
+
+    return 0;
+}
+__initcall(pge_init);
+
 unsigned long pv_make_cr4(const struct vcpu *v)
 {
     const struct domain *d = v->domain;
@@ -130,7 +143,7 @@  unsigned long pv_make_cr4(const struct vcpu *v)
      */
     if ( d->arch.pv.pcid )
         cr4 |= X86_CR4_PCIDE;
-    else if ( !d->arch.pv.xpti )
+    else if ( !d->arch.pv.xpti && opt_global_pages )
         cr4 |= X86_CR4_PGE;
 
     /*