diff mbox series

x86/intel: Expose MSR_ARCH_CAPS to dom0

Message ID 20200827193713.4962-1-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show
Series x86/intel: Expose MSR_ARCH_CAPS to dom0 | expand

Commit Message

Andrew Cooper Aug. 27, 2020, 7:37 p.m. UTC
The overhead of (the lack of) MDS_NO alone has been measured at 30% on some
workloads.  While we're not in a position yet to offer MSR_ARCH_CAPS generally
to guests, dom0 doesn't migrate, so we can pass a subset of hardware values
straight through.

This will cause PVH dom0's not to use KPTI by default, and all dom0's not to
use VERW flushing by default, and to use eIBRS in preference to retpoline on
recent Intel CPUs.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
---
 xen/arch/x86/cpuid.c |  8 ++++++++
 xen/arch/x86/msr.c   | 16 ++++++++++++++++
 2 files changed, 24 insertions(+)

Comments

Jan Beulich Aug. 28, 2020, 8:41 a.m. UTC | #1
On 27.08.2020 21:37, Andrew Cooper wrote:
> The overhead of (the lack of) MDS_NO alone has been measured at 30% on some
> workloads.  While we're not in a position yet to offer MSR_ARCH_CAPS generally
> to guests, dom0 doesn't migrate, so we can pass a subset of hardware values
> straight through.
> 
> This will cause PVH dom0's not to use KPTI by default, and all dom0's not to
> use VERW flushing by default,

To avoid VERW, shouldn't you also expose SKIP_L1DFL?

Jan
Andrew Cooper Aug. 28, 2020, 10:23 a.m. UTC | #2
On 28/08/2020 09:41, Jan Beulich wrote:
> On 27.08.2020 21:37, Andrew Cooper wrote:
>> The overhead of (the lack of) MDS_NO alone has been measured at 30% on some
>> workloads.  While we're not in a position yet to offer MSR_ARCH_CAPS generally
>> to guests, dom0 doesn't migrate, so we can pass a subset of hardware values
>> straight through.
>>
>> This will cause PVH dom0's not to use KPTI by default, and all dom0's not to
>> use VERW flushing by default,
> To avoid VERW, shouldn't you also expose SKIP_L1DFL?

SKIP_L1DFL is a software-only bit, specifically for nested virt.

It is for Xen to tell an L1 hypervisor "you don't need to flush on
vmentry because I'm taking care of it".

Longterm, it wants to be nestedhvm_enabled() && opt_l1d_flush, but
getting this working cleanly involves moving nested-virt to being a
domain creation flat so it isn't 0 during early domain construction.

~Andrew
Jan Beulich Aug. 28, 2020, 3:42 p.m. UTC | #3
On 28.08.2020 12:23, Andrew Cooper wrote:
> On 28/08/2020 09:41, Jan Beulich wrote:
>> On 27.08.2020 21:37, Andrew Cooper wrote:
>>> The overhead of (the lack of) MDS_NO alone has been measured at 30% on some
>>> workloads.  While we're not in a position yet to offer MSR_ARCH_CAPS generally
>>> to guests, dom0 doesn't migrate, so we can pass a subset of hardware values
>>> straight through.
>>>
>>> This will cause PVH dom0's not to use KPTI by default, and all dom0's not to
>>> use VERW flushing by default,
>> To avoid VERW, shouldn't you also expose SKIP_L1DFL?
> 
> SKIP_L1DFL is a software-only bit, specifically for nested virt.
> 
> It is for Xen to tell an L1 hypervisor "you don't need to flush on
> vmentry because I'm taking care of it".

Or for a hypervisor underneath us to tell us, which we could then
hand on to Dom0?

Jan
Andrew Cooper Aug. 28, 2020, 4:09 p.m. UTC | #4
On 28/08/2020 16:42, Jan Beulich wrote:
> On 28.08.2020 12:23, Andrew Cooper wrote:
>> On 28/08/2020 09:41, Jan Beulich wrote:
>>> On 27.08.2020 21:37, Andrew Cooper wrote:
>>>> The overhead of (the lack of) MDS_NO alone has been measured at 30% on some
>>>> workloads.  While we're not in a position yet to offer MSR_ARCH_CAPS generally
>>>> to guests, dom0 doesn't migrate, so we can pass a subset of hardware values
>>>> straight through.
>>>>
>>>> This will cause PVH dom0's not to use KPTI by default, and all dom0's not to
>>>> use VERW flushing by default,
>>> To avoid VERW, shouldn't you also expose SKIP_L1DFL?
>> SKIP_L1DFL is a software-only bit, specifically for nested virt.
>>
>> It is for Xen to tell an L1 hypervisor "you don't need to flush on
>> vmentry because I'm taking care of it".
> Or for a hypervisor underneath us to tell us, which we could then
> hand on to Dom0?

For dom0 to do what with?

PV guests can't use the VMLAUNCH/VMRESUME instruction at all, and it is
not currently possible to configure nested virt for a PVH dom0 to use.

In a future world where nested virt is working and usable by a PVH dom0,
then yes - it does need handling in some form or another, but we're a
long way away from having suitable infrastructure to do that correctly.

~Andrew
Jan Beulich Aug. 28, 2020, 4:17 p.m. UTC | #5
On 28.08.2020 18:09, Andrew Cooper wrote:
> On 28/08/2020 16:42, Jan Beulich wrote:
>> On 28.08.2020 12:23, Andrew Cooper wrote:
>>> On 28/08/2020 09:41, Jan Beulich wrote:
>>>> On 27.08.2020 21:37, Andrew Cooper wrote:
>>>>> The overhead of (the lack of) MDS_NO alone has been measured at 30% on some
>>>>> workloads.  While we're not in a position yet to offer MSR_ARCH_CAPS generally
>>>>> to guests, dom0 doesn't migrate, so we can pass a subset of hardware values
>>>>> straight through.
>>>>>
>>>>> This will cause PVH dom0's not to use KPTI by default, and all dom0's not to
>>>>> use VERW flushing by default,
>>>> To avoid VERW, shouldn't you also expose SKIP_L1DFL?
>>> SKIP_L1DFL is a software-only bit, specifically for nested virt.
>>>
>>> It is for Xen to tell an L1 hypervisor "you don't need to flush on
>>> vmentry because I'm taking care of it".
>> Or for a hypervisor underneath us to tell us, which we could then
>> hand on to Dom0?
> 
> For dom0 to do what with?
> 
> PV guests can't use the VMLAUNCH/VMRESUME instruction at all, and it is
> not currently possible to configure nested virt for a PVH dom0 to use.

Aren't they also using this on the exit-to-user-mode path, like we
do on exit-to-PV? And in certain cases when idle?

Jan
Andrew Cooper Aug. 28, 2020, 4:38 p.m. UTC | #6
On 28/08/2020 17:17, Jan Beulich wrote:
> On 28.08.2020 18:09, Andrew Cooper wrote:
>> On 28/08/2020 16:42, Jan Beulich wrote:
>>> On 28.08.2020 12:23, Andrew Cooper wrote:
>>>> On 28/08/2020 09:41, Jan Beulich wrote:
>>>>> On 27.08.2020 21:37, Andrew Cooper wrote:
>>>>>> The overhead of (the lack of) MDS_NO alone has been measured at 30% on some
>>>>>> workloads.  While we're not in a position yet to offer MSR_ARCH_CAPS generally
>>>>>> to guests, dom0 doesn't migrate, so we can pass a subset of hardware values
>>>>>> straight through.
>>>>>>
>>>>>> This will cause PVH dom0's not to use KPTI by default, and all dom0's not to
>>>>>> use VERW flushing by default,
>>>>> To avoid VERW, shouldn't you also expose SKIP_L1DFL?
>>>> SKIP_L1DFL is a software-only bit, specifically for nested virt.
>>>>
>>>> It is for Xen to tell an L1 hypervisor "you don't need to flush on
>>>> vmentry because I'm taking care of it".
>>> Or for a hypervisor underneath us to tell us, which we could then
>>> hand on to Dom0?
>> For dom0 to do what with?
>>
>> PV guests can't use the VMLAUNCH/VMRESUME instruction at all, and it is
>> not currently possible to configure nested virt for a PVH dom0 to use.
> Aren't they also using this on the exit-to-user-mode path, like we
> do on exit-to-PV? And in certain cases when idle?

MSR_FLUSH_CMD is used used for VMEntry.  This flushes the L1D cache, and
was to combat L1TF.  Native systems don't flush the L1D at all, and
invert PTEs instead as a *far* lower overhead mitigation.

Then MDS came along.  VERW is used to flush the uarch buffers.  This
needs doing in all return-to-guest contexts.

As VMEntry needs both, MSR_FLUSH_CMD's behaviour was extended to cover
both the L1D cache and uarch buffers, so software didn't have to arrange
for both.

Therefore, the overall mitigations are VERW on exit-to-PV, and
MSR_FLUSH_CMD on exit-to-HVM.


There is no current need for native setups to use MSR_FLUSH_CMD.  The
only reason we expose the MSR to HVM guests is for nested-virt.

~Andrew
Jan Beulich Aug. 31, 2020, 6:59 a.m. UTC | #7
On 28.08.2020 18:38, Andrew Cooper wrote:
> On 28/08/2020 17:17, Jan Beulich wrote:
>> On 28.08.2020 18:09, Andrew Cooper wrote:
>>> On 28/08/2020 16:42, Jan Beulich wrote:
>>>> On 28.08.2020 12:23, Andrew Cooper wrote:
>>>>> On 28/08/2020 09:41, Jan Beulich wrote:
>>>>>> On 27.08.2020 21:37, Andrew Cooper wrote:
>>>>>>> The overhead of (the lack of) MDS_NO alone has been measured at 30% on some
>>>>>>> workloads.  While we're not in a position yet to offer MSR_ARCH_CAPS generally
>>>>>>> to guests, dom0 doesn't migrate, so we can pass a subset of hardware values
>>>>>>> straight through.
>>>>>>>
>>>>>>> This will cause PVH dom0's not to use KPTI by default, and all dom0's not to
>>>>>>> use VERW flushing by default,
>>>>>> To avoid VERW, shouldn't you also expose SKIP_L1DFL?
>>>>> SKIP_L1DFL is a software-only bit, specifically for nested virt.
>>>>>
>>>>> It is for Xen to tell an L1 hypervisor "you don't need to flush on
>>>>> vmentry because I'm taking care of it".
>>>> Or for a hypervisor underneath us to tell us, which we could then
>>>> hand on to Dom0?
>>> For dom0 to do what with?
>>>
>>> PV guests can't use the VMLAUNCH/VMRESUME instruction at all, and it is
>>> not currently possible to configure nested virt for a PVH dom0 to use.
>> Aren't they also using this on the exit-to-user-mode path, like we
>> do on exit-to-PV? And in certain cases when idle?
> 
> MSR_FLUSH_CMD is used used for VMEntry.  This flushes the L1D cache, and
> was to combat L1TF.  Native systems don't flush the L1D at all, and
> invert PTEs instead as a *far* lower overhead mitigation.
> 
> Then MDS came along.  VERW is used to flush the uarch buffers.  This
> needs doing in all return-to-guest contexts.
> 
> As VMEntry needs both, MSR_FLUSH_CMD's behaviour was extended to cover
> both the L1D cache and uarch buffers, so software didn't have to arrange
> for both.
> 
> Therefore, the overall mitigations are VERW on exit-to-PV, and
> MSR_FLUSH_CMD on exit-to-HVM.
> 
> 
> There is no current need for native setups to use MSR_FLUSH_CMD.  The
> only reason we expose the MSR to HVM guests is for nested-virt.

But the question was about the use of VERW on exit-to-user paths in
a PV kernel, which I apparently wrongly understood SKIP_L1DFL also
indicates to be unnecessary. I'm sorry for the confusion. So
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
index 63a03ef1e5..6c608cc00b 100644
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -719,6 +719,14 @@  int init_domain_cpuid_policy(struct domain *d)
     if ( d->disable_migrate )
         p->extd.itsc = cpu_has_itsc;
 
+    /*
+     * Expose the "hardware speculation behaviour" bits of ARCH_CAPS to dom0,
+     * so dom0 can turn off workaround as appropriate.  Temporary, until the
+     * domain policy logic gains a better understanding of MSRs.
+     */
+    if ( is_hardware_domain(d) && boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
+        p->feat.arch_caps = true;
+
     d->arch.cpuid = p;
 
     recalculate_cpuid_policy(d);
diff --git a/xen/arch/x86/msr.c b/xen/arch/x86/msr.c
index c3862033eb..420a55f165 100644
--- a/xen/arch/x86/msr.c
+++ b/xen/arch/x86/msr.c
@@ -130,6 +130,22 @@  int init_domain_msr_policy(struct domain *d)
     if ( !opt_dom0_cpuid_faulting && is_control_domain(d) && is_pv_domain(d) )
         mp->platform_info.cpuid_faulting = false;
 
+    /*
+     * Expose the "hardware speculation behaviour" bits of ARCH_CAPS to dom0,
+     * so dom0 can turn off workaround as appropriate.  Temporary, until the
+     * domain policy logic gains a better understanding of MSRs.
+     */
+    if ( is_hardware_domain(d) && boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
+    {
+        uint64_t val;
+
+        rdmsrl(MSR_ARCH_CAPABILITIES, val);
+
+        mp->arch_caps.raw = val &
+            (ARCH_CAPS_RDCL_NO | ARCH_CAPS_IBRS_ALL | ARCH_CAPS_RSBA |
+             ARCH_CAPS_SSB_NO | ARCH_CAPS_MDS_NO | ARCH_CAPS_TAA_NO);
+    }
+
     d->arch.msr = mp;
 
     return 0;