x86/intel: Expose MSR_ARCH_CAPS to dom0

Message ID	20200827193713.4962-1-andrew.cooper3@citrix.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=HZRB=CF=lists.xenproject.org=xen-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C9E552087E IronPort-SDR: tmn1WcmJDk1niVoHYQb1zeAMt7Gvkv6cO8tMOCCgF6RKcBNKiRWREKewE65qdCO5MDOPoH7g2a lpOWe9n1gbTmwSu1s+RQDnvIKiGv9jXaBv+SD5NuTdk0ExwhPo6KjUztPO7+f6ClTBsIOHrD2y EZEN+2VP7vE4YG+2BpO7acQutGiVSebvtwgy6cd9tDj0lU6igyabWlfm8VhhvlplKe3oYIvQmZ t4ZZjkopgv1g8wwRLJ2iQBrrdIgVuzc0TW7ifI4HZ+HqD4zY77tedutj7pNWaExbrK0CEPzXUs ies= From: Andrew Cooper <andrew.cooper3@citrix.com> To: Xen-devel <xen-devel@lists.xenproject.org> CC: Andrew Cooper <amc96@cam.ac.uk>, Andrew Cooper <andrew.cooper3@citrix.com>, Jan Beulich <JBeulich@suse.com>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>, Wei Liu <wl@xen.org> Subject: [PATCH] x86/intel: Expose MSR_ARCH_CAPS to dom0 Date: Thu, 27 Aug 2020 20:37:13 +0100 Message-ID: <20200827193713.4962-1-andrew.cooper3@citrix.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Precedence: list Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Series	x86/intel: Expose MSR_ARCH_CAPS to dom0 \| expand x86/intel: Expose MSR_ARCH_CAPS to dom0

Andrew Cooper Aug. 27, 2020, 7:37 p.m. UTC

The overhead of (the lack of) MDS_NO alone has been measured at 30% on some
workloads.  While we're not in a position yet to offer MSR_ARCH_CAPS generally
to guests, dom0 doesn't migrate, so we can pass a subset of hardware values
straight through.

This will cause PVH dom0's not to use KPTI by default, and all dom0's not to
use VERW flushing by default, and to use eIBRS in preference to retpoline on
recent Intel CPUs.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
---
 xen/arch/x86/cpuid.c |  8 ++++++++
 xen/arch/x86/msr.c   | 16 ++++++++++++++++
 2 files changed, 24 insertions(+)

Jan Beulich Aug. 28, 2020, 8:41 a.m. UTC | #1

On 27.08.2020 21:37, Andrew Cooper wrote:
> The overhead of (the lack of) MDS_NO alone has been measured at 30% on some
> workloads.  While we're not in a position yet to offer MSR_ARCH_CAPS generally
> to guests, dom0 doesn't migrate, so we can pass a subset of hardware values
> straight through.
> 
> This will cause PVH dom0's not to use KPTI by default, and all dom0's not to
> use VERW flushing by default,

To avoid VERW, shouldn't you also expose SKIP_L1DFL?

Jan

Andrew Cooper Aug. 28, 2020, 10:23 a.m. UTC | #2

On 28/08/2020 09:41, Jan Beulich wrote:
> On 27.08.2020 21:37, Andrew Cooper wrote:
>> The overhead of (the lack of) MDS_NO alone has been measured at 30% on some
>> workloads.  While we're not in a position yet to offer MSR_ARCH_CAPS generally
>> to guests, dom0 doesn't migrate, so we can pass a subset of hardware values
>> straight through.
>>
>> This will cause PVH dom0's not to use KPTI by default, and all dom0's not to
>> use VERW flushing by default,
> To avoid VERW, shouldn't you also expose SKIP_L1DFL?

SKIP_L1DFL is a software-only bit, specifically for nested virt.

It is for Xen to tell an L1 hypervisor "you don't need to flush on
vmentry because I'm taking care of it".

Longterm, it wants to be nestedhvm_enabled() && opt_l1d_flush, but
getting this working cleanly involves moving nested-virt to being a
domain creation flat so it isn't 0 during early domain construction.

~Andrew

Jan Beulich Aug. 28, 2020, 3:42 p.m. UTC | #3

On 28.08.2020 12:23, Andrew Cooper wrote:
> On 28/08/2020 09:41, Jan Beulich wrote:
>> On 27.08.2020 21:37, Andrew Cooper wrote:
>>> The overhead of (the lack of) MDS_NO alone has been measured at 30% on some
>>> workloads.  While we're not in a position yet to offer MSR_ARCH_CAPS generally
>>> to guests, dom0 doesn't migrate, so we can pass a subset of hardware values
>>> straight through.
>>>
>>> This will cause PVH dom0's not to use KPTI by default, and all dom0's not to
>>> use VERW flushing by default,
>> To avoid VERW, shouldn't you also expose SKIP_L1DFL?
> 
> SKIP_L1DFL is a software-only bit, specifically for nested virt.
> 
> It is for Xen to tell an L1 hypervisor "you don't need to flush on
> vmentry because I'm taking care of it".

Or for a hypervisor underneath us to tell us, which we could then
hand on to Dom0?

Jan

Andrew Cooper Aug. 28, 2020, 4:09 p.m. UTC | #4

On 28/08/2020 16:42, Jan Beulich wrote:
> On 28.08.2020 12:23, Andrew Cooper wrote:
>> On 28/08/2020 09:41, Jan Beulich wrote:
>>> On 27.08.2020 21:37, Andrew Cooper wrote:
>>>> The overhead of (the lack of) MDS_NO alone has been measured at 30% on some
>>>> workloads.  While we're not in a position yet to offer MSR_ARCH_CAPS generally
>>>> to guests, dom0 doesn't migrate, so we can pass a subset of hardware values
>>>> straight through.
>>>>
>>>> This will cause PVH dom0's not to use KPTI by default, and all dom0's not to
>>>> use VERW flushing by default,
>>> To avoid VERW, shouldn't you also expose SKIP_L1DFL?
>> SKIP_L1DFL is a software-only bit, specifically for nested virt.
>>
>> It is for Xen to tell an L1 hypervisor "you don't need to flush on
>> vmentry because I'm taking care of it".
> Or for a hypervisor underneath us to tell us, which we could then
> hand on to Dom0?

For dom0 to do what with?

PV guests can't use the VMLAUNCH/VMRESUME instruction at all, and it is
not currently possible to configure nested virt for a PVH dom0 to use.

In a future world where nested virt is working and usable by a PVH dom0,
then yes - it does need handling in some form or another, but we're a
long way away from having suitable infrastructure to do that correctly.

~Andrew

Jan Beulich Aug. 28, 2020, 4:17 p.m. UTC | #5

On 28.08.2020 18:09, Andrew Cooper wrote:
> On 28/08/2020 16:42, Jan Beulich wrote:
>> On 28.08.2020 12:23, Andrew Cooper wrote:
>>> On 28/08/2020 09:41, Jan Beulich wrote:
>>>> On 27.08.2020 21:37, Andrew Cooper wrote:
>>>>> The overhead of (the lack of) MDS_NO alone has been measured at 30% on some
>>>>> workloads.  While we're not in a position yet to offer MSR_ARCH_CAPS generally
>>>>> to guests, dom0 doesn't migrate, so we can pass a subset of hardware values
>>>>> straight through.
>>>>>
>>>>> This will cause PVH dom0's not to use KPTI by default, and all dom0's not to
>>>>> use VERW flushing by default,
>>>> To avoid VERW, shouldn't you also expose SKIP_L1DFL?
>>> SKIP_L1DFL is a software-only bit, specifically for nested virt.
>>>
>>> It is for Xen to tell an L1 hypervisor "you don't need to flush on
>>> vmentry because I'm taking care of it".
>> Or for a hypervisor underneath us to tell us, which we could then
>> hand on to Dom0?
> 
> For dom0 to do what with?
> 
> PV guests can't use the VMLAUNCH/VMRESUME instruction at all, and it is
> not currently possible to configure nested virt for a PVH dom0 to use.

Aren't they also using this on the exit-to-user-mode path, like we
do on exit-to-PV? And in certain cases when idle?

Jan

Andrew Cooper Aug. 28, 2020, 4:38 p.m. UTC | #6

On 28/08/2020 17:17, Jan Beulich wrote:
> On 28.08.2020 18:09, Andrew Cooper wrote:
>> On 28/08/2020 16:42, Jan Beulich wrote:
>>> On 28.08.2020 12:23, Andrew Cooper wrote:
>>>> On 28/08/2020 09:41, Jan Beulich wrote:
>>>>> On 27.08.2020 21:37, Andrew Cooper wrote:
>>>>>> The overhead of (the lack of) MDS_NO alone has been measured at 30% on some
>>>>>> workloads.  While we're not in a position yet to offer MSR_ARCH_CAPS generally
>>>>>> to guests, dom0 doesn't migrate, so we can pass a subset of hardware values
>>>>>> straight through.
>>>>>>
>>>>>> This will cause PVH dom0's not to use KPTI by default, and all dom0's not to
>>>>>> use VERW flushing by default,
>>>>> To avoid VERW, shouldn't you also expose SKIP_L1DFL?
>>>> SKIP_L1DFL is a software-only bit, specifically for nested virt.
>>>>
>>>> It is for Xen to tell an L1 hypervisor "you don't need to flush on
>>>> vmentry because I'm taking care of it".
>>> Or for a hypervisor underneath us to tell us, which we could then
>>> hand on to Dom0?
>> For dom0 to do what with?
>>
>> PV guests can't use the VMLAUNCH/VMRESUME instruction at all, and it is
>> not currently possible to configure nested virt for a PVH dom0 to use.
> Aren't they also using this on the exit-to-user-mode path, like we
> do on exit-to-PV? And in certain cases when idle?

MSR_FLUSH_CMD is used used for VMEntry.  This flushes the L1D cache, and
was to combat L1TF.  Native systems don't flush the L1D at all, and
invert PTEs instead as a *far* lower overhead mitigation.

Then MDS came along.  VERW is used to flush the uarch buffers.  This
needs doing in all return-to-guest contexts.

As VMEntry needs both, MSR_FLUSH_CMD's behaviour was extended to cover
both the L1D cache and uarch buffers, so software didn't have to arrange
for both.

Therefore, the overall mitigations are VERW on exit-to-PV, and
MSR_FLUSH_CMD on exit-to-HVM.

There is no current need for native setups to use MSR_FLUSH_CMD.  The
only reason we expose the MSR to HVM guests is for nested-virt.

~Andrew

Jan Beulich Aug. 31, 2020, 6:59 a.m. UTC | #7

On 28.08.2020 18:38, Andrew Cooper wrote:
> On 28/08/2020 17:17, Jan Beulich wrote:
>> On 28.08.2020 18:09, Andrew Cooper wrote:
>>> On 28/08/2020 16:42, Jan Beulich wrote:
>>>> On 28.08.2020 12:23, Andrew Cooper wrote:
>>>>> On 28/08/2020 09:41, Jan Beulich wrote:
>>>>>> On 27.08.2020 21:37, Andrew Cooper wrote:
>>>>>>> The overhead of (the lack of) MDS_NO alone has been measured at 30% on some
>>>>>>> workloads.  While we're not in a position yet to offer MSR_ARCH_CAPS generally
>>>>>>> to guests, dom0 doesn't migrate, so we can pass a subset of hardware values
>>>>>>> straight through.
>>>>>>>
>>>>>>> This will cause PVH dom0's not to use KPTI by default, and all dom0's not to
>>>>>>> use VERW flushing by default,
>>>>>> To avoid VERW, shouldn't you also expose SKIP_L1DFL?
>>>>> SKIP_L1DFL is a software-only bit, specifically for nested virt.
>>>>>
>>>>> It is for Xen to tell an L1 hypervisor "you don't need to flush on
>>>>> vmentry because I'm taking care of it".
>>>> Or for a hypervisor underneath us to tell us, which we could then
>>>> hand on to Dom0?
>>> For dom0 to do what with?
>>>
>>> PV guests can't use the VMLAUNCH/VMRESUME instruction at all, and it is
>>> not currently possible to configure nested virt for a PVH dom0 to use.
>> Aren't they also using this on the exit-to-user-mode path, like we
>> do on exit-to-PV? And in certain cases when idle?
> 
> MSR_FLUSH_CMD is used used for VMEntry.  This flushes the L1D cache, and
> was to combat L1TF.  Native systems don't flush the L1D at all, and
> invert PTEs instead as a *far* lower overhead mitigation.
> 
> Then MDS came along.  VERW is used to flush the uarch buffers.  This
> needs doing in all return-to-guest contexts.
> 
> As VMEntry needs both, MSR_FLUSH_CMD's behaviour was extended to cover
> both the L1D cache and uarch buffers, so software didn't have to arrange
> for both.
> 
> Therefore, the overall mitigations are VERW on exit-to-PV, and
> MSR_FLUSH_CMD on exit-to-HVM.
> 
> 
> There is no current need for native setups to use MSR_FLUSH_CMD.  The
> only reason we expose the MSR to HVM guests is for nested-virt.

But the question was about the use of VERW on exit-to-user paths in
a PV kernel, which I apparently wrongly understood SKIP_L1DFL also
indicates to be unnecessary. I'm sorry for the confusion. So
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan

x86/intel: Expose MSR_ARCH_CAPS to dom0

Commit Message

Comments

Patch