diff mbox series

x86/cpu: Add a VMX flag to enumerate 5-level EPT support to userspace

Message ID 20240110002340.485595-1-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series x86/cpu: Add a VMX flag to enumerate 5-level EPT support to userspace | expand

Commit Message

Sean Christopherson Jan. 10, 2024, 12:23 a.m. UTC
Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
whether or not the CPU supports 5-level EPT paging.  EPT capabilities are
enumerated via MSR, i.e. aren't accessible to userspace without help from
the kernel, and knowing whether or not 5-level EPT is supported is sadly
necessary for userspace to correctly configure KVM VMs.

When EPT is enabled, bits 51:49 of guest physical addresses are consumed
if and only if 5-level EPT is enabled.  For CPUs with MAXPHYADDR > 48, KVM
*can't* map all legal guest memory if 5-level EPT is unsupported, e.g.
creating a VM with RAM (or anything that gets stuffed into KVM's memslots)
above bit 48 will be completely broken.

Having KVM enumerate guest.MAXPHYADDR=48 in this scenario doesn't work
either, as architecturally guest accesses to illegal addresses generate
RSVD #PF, i.e. advertising guest.MAXPHYADDR < host.MAXPHYADDR when EPT is
enabled would also result in broken guests.  KVM does provide a knob,
allow_smaller_maxphyaddr, to let userspace opt-in to such setups, but
that support is firmly best-effort, i.e. not something KVM wants to force
upon userspace.

While it's decidedly odd for a CPU to support a 52-bit MAXPHYADDR but not
5-level EPT, the combination is architecturally legal and such CPUs do
exist (and can easily be "created" with nested virtualization).

Reported-by: Yi Lai <yi1.lai@intel.com>
Cc: Tao Su <tao1.su@linux.intel.com>
Cc: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---

tip-tree folks, this is obviously not technically KVM code, but I'd like to
take this through the KVM tree so that we can use the information to fix
KVM selftests (hopefully this cycle).

 arch/x86/include/asm/vmxfeatures.h | 1 +
 arch/x86/kernel/cpu/feat_ctl.c     | 2 ++
 2 files changed, 3 insertions(+)


base-commit: 1c6d984f523f67ecfad1083bb04c55d91977bb15

Comments

Tao Su Jan. 10, 2024, 2:20 a.m. UTC | #1
On Tue, Jan 09, 2024 at 04:23:40PM -0800, Sean Christopherson wrote:
> Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
> whether or not the CPU supports 5-level EPT paging.  EPT capabilities are
> enumerated via MSR, i.e. aren't accessible to userspace without help from
> the kernel, and knowing whether or not 5-level EPT is supported is sadly
> necessary for userspace to correctly configure KVM VMs.
> 
> When EPT is enabled, bits 51:49 of guest physical addresses are consumed

nit: s/49/48

Thanks,
Tao

> if and only if 5-level EPT is enabled.  For CPUs with MAXPHYADDR > 48, KVM
> *can't* map all legal guest memory if 5-level EPT is unsupported, e.g.
> creating a VM with RAM (or anything that gets stuffed into KVM's memslots)
> above bit 48 will be completely broken.
> 
> Having KVM enumerate guest.MAXPHYADDR=48 in this scenario doesn't work
> either, as architecturally guest accesses to illegal addresses generate
> RSVD #PF, i.e. advertising guest.MAXPHYADDR < host.MAXPHYADDR when EPT is
> enabled would also result in broken guests.  KVM does provide a knob,
> allow_smaller_maxphyaddr, to let userspace opt-in to such setups, but
> that support is firmly best-effort, i.e. not something KVM wants to force
> upon userspace.
> 
> While it's decidedly odd for a CPU to support a 52-bit MAXPHYADDR but not
> 5-level EPT, the combination is architecturally legal and such CPUs do
> exist (and can easily be "created" with nested virtualization).
> 
> Reported-by: Yi Lai <yi1.lai@intel.com>
> Cc: Tao Su <tao1.su@linux.intel.com>
> Cc: Xudong Hao <xudong.hao@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> 
> tip-tree folks, this is obviously not technically KVM code, but I'd like to
> take this through the KVM tree so that we can use the information to fix
> KVM selftests (hopefully this cycle).
> 
>  arch/x86/include/asm/vmxfeatures.h | 1 +
>  arch/x86/kernel/cpu/feat_ctl.c     | 2 ++
>  2 files changed, 3 insertions(+)
> 
> diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
> index c6a7eed03914..266daf5b5b84 100644
> --- a/arch/x86/include/asm/vmxfeatures.h
> +++ b/arch/x86/include/asm/vmxfeatures.h
> @@ -25,6 +25,7 @@
>  #define VMX_FEATURE_EPT_EXECUTE_ONLY	( 0*32+ 17) /* "ept_x_only" EPT entries can be execute only */
>  #define VMX_FEATURE_EPT_AD		( 0*32+ 18) /* EPT Accessed/Dirty bits */
>  #define VMX_FEATURE_EPT_1GB		( 0*32+ 19) /* 1GB EPT pages */
> +#define VMX_FEATURE_EPT_5LEVEL		( 0*32+ 20) /* 5-level EPT paging */
>  
>  /* Aggregated APIC features 24-27 */
>  #define VMX_FEATURE_FLEXPRIORITY	( 0*32+ 24) /* TPR shadow + virt APIC */
> diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c
> index 03851240c3e3..1640ae76548f 100644
> --- a/arch/x86/kernel/cpu/feat_ctl.c
> +++ b/arch/x86/kernel/cpu/feat_ctl.c
> @@ -72,6 +72,8 @@ static void init_vmx_capabilities(struct cpuinfo_x86 *c)
>  		c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_AD);
>  	if (ept & VMX_EPT_1GB_PAGE_BIT)
>  		c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_1GB);
> +	if (ept & VMX_EPT_PAGE_WALK_5_BIT)
> +		c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_5LEVEL);
>  
>  	/* Synthetic APIC features that are aggregates of multiple features. */
>  	if ((c->vmx_capability[PRIMARY_CTLS] & VMX_F(VIRTUAL_TPR)) &&
> 
> base-commit: 1c6d984f523f67ecfad1083bb04c55d91977bb15
> -- 
> 2.43.0.472.g3155946c3a-goog
>
Chao Gao Jan. 10, 2024, 6:16 a.m. UTC | #2
On Tue, Jan 09, 2024 at 04:23:40PM -0800, Sean Christopherson wrote:
>Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
>whether or not the CPU supports 5-level EPT paging.  EPT capabilities are
>enumerated via MSR, i.e. aren't accessible to userspace without help from
>the kernel, and knowing whether or not 5-level EPT is supported is sadly
>necessary for userspace to correctly configure KVM VMs.

This assumes procfs is enabled in Kconfig and userspace has permission to
access /proc/cpuinfo. But it isn't always true. So, I think it is better to
advertise max addressable GPA via KVM ioctls.

>
>When EPT is enabled, bits 51:49 of guest physical addresses are consumed
>if and only if 5-level EPT is enabled.  For CPUs with MAXPHYADDR > 48, KVM
>*can't* map all legal guest memory if 5-level EPT is unsupported, e.g.
>creating a VM with RAM (or anything that gets stuffed into KVM's memslots)
>above bit 48 will be completely broken.
>
>Having KVM enumerate guest.MAXPHYADDR=48 in this scenario doesn't work
>either, as architecturally guest accesses to illegal addresses generate
>RSVD #PF, i.e. advertising guest.MAXPHYADDR < host.MAXPHYADDR when EPT is
>enabled would also result in broken guests.  KVM does provide a knob,
>allow_smaller_maxphyaddr, to let userspace opt-in to such setups, but
>that support is firmly best-effort, i.e. not something KVM wants to force
>upon userspace.
>
>While it's decidedly odd for a CPU to support a 52-bit MAXPHYADDR but not
>5-level EPT, the combination is architecturally legal and such CPUs do
>exist (and can easily be "created" with nested virtualization).
>
>Reported-by: Yi Lai <yi1.lai@intel.com>
>Cc: Tao Su <tao1.su@linux.intel.com>
>Cc: Xudong Hao <xudong.hao@intel.com>
>Signed-off-by: Sean Christopherson <seanjc@google.com>
>---
>
>tip-tree folks, this is obviously not technically KVM code, but I'd like to
>take this through the KVM tree so that we can use the information to fix
>KVM selftests (hopefully this cycle).
>
> arch/x86/include/asm/vmxfeatures.h | 1 +
> arch/x86/kernel/cpu/feat_ctl.c     | 2 ++
> 2 files changed, 3 insertions(+)
>
>diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
>index c6a7eed03914..266daf5b5b84 100644
>--- a/arch/x86/include/asm/vmxfeatures.h
>+++ b/arch/x86/include/asm/vmxfeatures.h
>@@ -25,6 +25,7 @@
> #define VMX_FEATURE_EPT_EXECUTE_ONLY	( 0*32+ 17) /* "ept_x_only" EPT entries can be execute only */
> #define VMX_FEATURE_EPT_AD		( 0*32+ 18) /* EPT Accessed/Dirty bits */
> #define VMX_FEATURE_EPT_1GB		( 0*32+ 19) /* 1GB EPT pages */
>+#define VMX_FEATURE_EPT_5LEVEL		( 0*32+ 20) /* 5-level EPT paging */
> 
> /* Aggregated APIC features 24-27 */
> #define VMX_FEATURE_FLEXPRIORITY	( 0*32+ 24) /* TPR shadow + virt APIC */
>diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c
>index 03851240c3e3..1640ae76548f 100644
>--- a/arch/x86/kernel/cpu/feat_ctl.c
>+++ b/arch/x86/kernel/cpu/feat_ctl.c
>@@ -72,6 +72,8 @@ static void init_vmx_capabilities(struct cpuinfo_x86 *c)
> 		c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_AD);
> 	if (ept & VMX_EPT_1GB_PAGE_BIT)
> 		c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_1GB);
>+	if (ept & VMX_EPT_PAGE_WALK_5_BIT)
>+		c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_5LEVEL);
> 
> 	/* Synthetic APIC features that are aggregates of multiple features. */
> 	if ((c->vmx_capability[PRIMARY_CTLS] & VMX_F(VIRTUAL_TPR)) &&
>
>base-commit: 1c6d984f523f67ecfad1083bb04c55d91977bb15
>-- 
>2.43.0.472.g3155946c3a-goog
>
>
Sean Christopherson Jan. 10, 2024, 1:59 p.m. UTC | #3
On Wed, Jan 10, 2024, Tao Su wrote:
> On Tue, Jan 09, 2024 at 04:23:40PM -0800, Sean Christopherson wrote:
> > Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
> > whether or not the CPU supports 5-level EPT paging.  EPT capabilities are
> > enumerated via MSR, i.e. aren't accessible to userspace without help from
> > the kernel, and knowing whether or not 5-level EPT is supported is sadly
> > necessary for userspace to correctly configure KVM VMs.
> > 
> > When EPT is enabled, bits 51:49 of guest physical addresses are consumed
> 
> nit: s/49/48

Argh, you even pointed that out before too.  Thanks!
Sean Christopherson Jan. 10, 2024, 4:26 p.m. UTC | #4
On Wed, Jan 10, 2024, Chao Gao wrote:
> On Tue, Jan 09, 2024 at 04:23:40PM -0800, Sean Christopherson wrote:
> >Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
> >whether or not the CPU supports 5-level EPT paging.  EPT capabilities are
> >enumerated via MSR, i.e. aren't accessible to userspace without help from
> >the kernel, and knowing whether or not 5-level EPT is supported is sadly
> >necessary for userspace to correctly configure KVM VMs.
> 
> This assumes procfs is enabled in Kconfig and userspace has permission to
> access /proc/cpuinfo. But it isn't always true. So, I think it is better to
> advertise max addressable GPA via KVM ioctls.

Hrm, so the help for PROC_FS says:

  Several programs depend on this, so everyone should say Y here.

Given that this is working around something that is borderline an erratum, I'm
inclined to say that userspace shouldn't simply assume the worst if /proc isn't
available.  Practically speaking, I don't think a "real" VM is likely to be
affected; AFAIK, there's no reason for QEMU or any other VMM to _need_ to expose
a memslot at GPA[51:48] unless the VM really has however much memory that is
(hundreds of terabytes?).  And a if someone is trying to run such a massive VM on
such a goofy CPU...

I don't think it's unreasonable for KVM selftests to require access to
/proc/cpuinfo.  Or actually, they can probably do the same thing and self-limit
to 48-bit addresses if /proc/cpuinfo isn't available.

I'm not totally opposed to adding a more programmatic way for userspace to query
5-level EPT support, it just seems unnecessary.  E.g. unlike CPUID, userspace
can't directly influence whether or not KVM uses 5-level EPT.  Even in hindsight,
I'm not entirely sure KVM should expose such a knob, as it raises questions around
interactions guest.MAXPHYADDR and memslots that I would rather avoid.

And even if we do add such uAPI, enumerating 5-level EPT in /proc/cpuinfo is
definitely worthwhile, the only thing that would need to be tweaked is the
justification in the changelog.

One thing we can do irrespective of feature enumeration is have kvm_mmu_page_fault()
exit to userspace with an explicit error if the guest faults ona GPA that KVM
knows it can't map, i.e. exit with KVM_EXIT_INTERNAL_ERROR or maybe even
KVM_EXIT_MEMORY_FAULT instead of looping indefinitely.
Tao Su Jan. 11, 2024, 2:52 a.m. UTC | #5
On Wed, Jan 10, 2024 at 08:26:25AM -0800, Sean Christopherson wrote:
> On Wed, Jan 10, 2024, Chao Gao wrote:
> > On Tue, Jan 09, 2024 at 04:23:40PM -0800, Sean Christopherson wrote:
> > >Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
> > >whether or not the CPU supports 5-level EPT paging.  EPT capabilities are
> > >enumerated via MSR, i.e. aren't accessible to userspace without help from
> > >the kernel, and knowing whether or not 5-level EPT is supported is sadly
> > >necessary for userspace to correctly configure KVM VMs.
> > 
> > This assumes procfs is enabled in Kconfig and userspace has permission to
> > access /proc/cpuinfo. But it isn't always true. So, I think it is better to
> > advertise max addressable GPA via KVM ioctls.
> 
> Hrm, so the help for PROC_FS says:
> 
>   Several programs depend on this, so everyone should say Y here.
> 
> Given that this is working around something that is borderline an erratum, I'm
> inclined to say that userspace shouldn't simply assume the worst if /proc isn't
> available.  Practically speaking, I don't think a "real" VM is likely to be
> affected; AFAIK, there's no reason for QEMU or any other VMM to _need_ to expose
> a memslot at GPA[51:48] unless the VM really has however much memory that is
> (hundreds of terabytes?).  And a if someone is trying to run such a massive VM on
> such a goofy CPU...

It is unusual to assign a huge RAM to guest, but passthrough a device also may trigger
this issue which we have met, i.e. alloc memslot for the 64bit BAR which can set
bits[51:48]. BIOS can control the BAR address, e.g. seabios moved 64bit pci window
to end of address space by using advertised physical bits[1].

[1] https://gitlab.com/qemu-project/seabios/-/commit/bcfed7e270776ab5595cafc6f1794bea0cae1c6c

> 
> I don't think it's unreasonable for KVM selftests to require access to
> /proc/cpuinfo.  Or actually, they can probably do the same thing and self-limit
> to 48-bit addresses if /proc/cpuinfo isn't available.
> 
> I'm not totally opposed to adding a more programmatic way for userspace to query
> 5-level EPT support, it just seems unnecessary.  E.g. unlike CPUID, userspace
> can't directly influence whether or not KVM uses 5-level EPT.  Even in hindsight,
> I'm not entirely sure KVM should expose such a knob, as it raises questions around
> interactions guest.MAXPHYADDR and memslots that I would rather avoid.
> 
> And even if we do add such uAPI, enumerating 5-level EPT in /proc/cpuinfo is
> definitely worthwhile, the only thing that would need to be tweaked is the
> justification in the changelog.
> 
> One thing we can do irrespective of feature enumeration is have kvm_mmu_page_fault()
> exit to userspace with an explicit error if the guest faults ona GPA that KVM
> knows it can't map, i.e. exit with KVM_EXIT_INTERNAL_ERROR or maybe even
> KVM_EXIT_MEMORY_FAULT instead of looping indefinitely.

If KVM does report guest.MAXPHYADDR=host.MAXPHYADDR, it is not reasonable to kill the
guest directly. And just reporting that it does not support 5-level EPT in /proc/cpuinfo
will make it difficult for users to realize that physical-bits needs to be forcibly
limited in the command. But advertising max addressable GPA via ioctl and this patch do
not conflict.

Thanks,
Tao
Paolo Bonzini Jan. 11, 2024, 10:13 a.m. UTC | #6
On Wed, Jan 10, 2024 at 1:23 AM Sean Christopherson <seanjc@google.com> wrote:
> Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
> whether or not the CPU supports 5-level EPT paging.

I think this is a good idea independent of the selftests issue.

For selftests, we could get similar info from the feature MSR
mechanism, i.e. KVM_GET_MSRS(MSR_IA32_VMX_EPT_VPID_CAP), but only on
Intel and only if nested virtualization is enabled, so that's
inferior.

A better idea for selftests is to add a new KVM_CAP_PHYS_ADDR_SIZE,
which could be implemented by all architectures and especially by both
x86 vendors. However, I am not sure for example if different VM types
(read: TDX?) could have different maximum physical addresses, and that
would have to be taken into consideration when designing the API.

Paolo

> tip-tree folks, this is obviously not technically KVM code, but I'd like to
> take this through the KVM tree so that we can use the information to fix
> KVM selftests (hopefully this cycle).
>
>  arch/x86/include/asm/vmxfeatures.h | 1 +
>  arch/x86/kernel/cpu/feat_ctl.c     | 2 ++
>  2 files changed, 3 insertions(+)
>
> diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
> index c6a7eed03914..266daf5b5b84 100644
> --- a/arch/x86/include/asm/vmxfeatures.h
> +++ b/arch/x86/include/asm/vmxfeatures.h
> @@ -25,6 +25,7 @@
>  #define VMX_FEATURE_EPT_EXECUTE_ONLY   ( 0*32+ 17) /* "ept_x_only" EPT entries can be execute only */
>  #define VMX_FEATURE_EPT_AD             ( 0*32+ 18) /* EPT Accessed/Dirty bits */
>  #define VMX_FEATURE_EPT_1GB            ( 0*32+ 19) /* 1GB EPT pages */
> +#define VMX_FEATURE_EPT_5LEVEL         ( 0*32+ 20) /* 5-level EPT paging */
>
>  /* Aggregated APIC features 24-27 */
>  #define VMX_FEATURE_FLEXPRIORITY       ( 0*32+ 24) /* TPR shadow + virt APIC */
> diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c
> index 03851240c3e3..1640ae76548f 100644
> --- a/arch/x86/kernel/cpu/feat_ctl.c
> +++ b/arch/x86/kernel/cpu/feat_ctl.c
> @@ -72,6 +72,8 @@ static void init_vmx_capabilities(struct cpuinfo_x86 *c)
>                 c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_AD);
>         if (ept & VMX_EPT_1GB_PAGE_BIT)
>                 c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_1GB);
> +       if (ept & VMX_EPT_PAGE_WALK_5_BIT)
> +               c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_5LEVEL);
>
>         /* Synthetic APIC features that are aggregates of multiple features. */
>         if ((c->vmx_capability[PRIMARY_CTLS] & VMX_F(VIRTUAL_TPR)) &&
>
> base-commit: 1c6d984f523f67ecfad1083bb04c55d91977bb15
> --
> 2.43.0.472.g3155946c3a-goog
Sean Christopherson Jan. 11, 2024, 4:17 p.m. UTC | #7
On Thu, Jan 11, 2024, Paolo Bonzini wrote:
> On Wed, Jan 10, 2024 at 1:23 AM Sean Christopherson <seanjc@google.com> wrote:
> > Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
> > whether or not the CPU supports 5-level EPT paging.
> 
> I think this is a good idea independent of the selftests issue.
> 
> For selftests, we could get similar info from the feature MSR
> mechanism, i.e. KVM_GET_MSRS(MSR_IA32_VMX_EPT_VPID_CAP), but only on
> Intel and only if nested virtualization is enabled, so that's
> inferior.
> 
> A better idea for selftests is to add a new KVM_CAP_PHYS_ADDR_SIZE,
> which could be implemented by all architectures and especially by both
> x86 vendors.

Doh.  I was thinking this wouldn't be a problem on AMD, but a guest can generate
52-bit GPAs even without LA57.

> However, I am not sure for example if different VM types (read: TDX?) could
> have different maximum physical addresses, and that would have to be taken
> into consideration when designing the API.
Sean Christopherson Jan. 11, 2024, 4:25 p.m. UTC | #8
On Thu, Jan 11, 2024, Tao Su wrote:
> On Wed, Jan 10, 2024 at 08:26:25AM -0800, Sean Christopherson wrote:
> > On Wed, Jan 10, 2024, Chao Gao wrote:
> > > On Tue, Jan 09, 2024 at 04:23:40PM -0800, Sean Christopherson wrote:
> > > >Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
> > > >whether or not the CPU supports 5-level EPT paging.  EPT capabilities are
> > > >enumerated via MSR, i.e. aren't accessible to userspace without help from
> > > >the kernel, and knowing whether or not 5-level EPT is supported is sadly
> > > >necessary for userspace to correctly configure KVM VMs.
> > > 
> > > This assumes procfs is enabled in Kconfig and userspace has permission to
> > > access /proc/cpuinfo. But it isn't always true. So, I think it is better to
> > > advertise max addressable GPA via KVM ioctls.
> > 
> > Hrm, so the help for PROC_FS says:
> > 
> >   Several programs depend on this, so everyone should say Y here.
> > 
> > Given that this is working around something that is borderline an erratum, I'm
> > inclined to say that userspace shouldn't simply assume the worst if /proc isn't
> > available.  Practically speaking, I don't think a "real" VM is likely to be
> > affected; AFAIK, there's no reason for QEMU or any other VMM to _need_ to expose
> > a memslot at GPA[51:48] unless the VM really has however much memory that is
> > (hundreds of terabytes?).  And a if someone is trying to run such a massive VM on
> > such a goofy CPU...
> 
> It is unusual to assign a huge RAM to guest, but passthrough a device also may trigger
> this issue which we have met, i.e. alloc memslot for the 64bit BAR which can set
> bits[51:48]. BIOS can control the BAR address, e.g. seabios moved 64bit pci window
> to end of address space by using advertised physical bits[1].

Drat.  Do you know if these CPUs are going to be productized?  We'll still need
something in KVM either way, but whether or not the problems are more or less
limited to funky software setups might influence how we address this.
Paolo Bonzini Jan. 11, 2024, 8:02 p.m. UTC | #9
On Thu, Jan 11, 2024 at 5:25 PM Sean Christopherson <seanjc@google.com> wrote:
> > It is unusual to assign a huge RAM to guest, but passthrough a device also may trigger
> > this issue which we have met, i.e. alloc memslot for the 64bit BAR which can set
> > bits[51:48]. BIOS can control the BAR address, e.g. seabios moved 64bit pci window
> > to end of address space by using advertised physical bits[1].
>
> Drat.  Do you know if these CPUs are going to be productized?  We'll still need
> something in KVM either way, but whether or not the problems are more or less
> limited to funky software setups might influence how we address this.

Wait, we do have an API for guest physical address size. It's
KVM_GET_SUPPORTED_CPUID2: the # of bits is in leaf 0x80000008, bits
0:7 of EAX. In fact that leaf is what firmware uses to place the BARs.
So it just needs to be adjusted for VMX in __do_cpuid_func, and looked
up in selftests.

Paolo
Jim Mattson Jan. 11, 2024, 9:12 p.m. UTC | #10
On Thu, Jan 11, 2024 at 12:02 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On Thu, Jan 11, 2024 at 5:25 PM Sean Christopherson <seanjc@google.com> wrote:
> > > It is unusual to assign a huge RAM to guest, but passthrough a device also may trigger
> > > this issue which we have met, i.e. alloc memslot for the 64bit BAR which can set
> > > bits[51:48]. BIOS can control the BAR address, e.g. seabios moved 64bit pci window
> > > to end of address space by using advertised physical bits[1].
> >
> > Drat.  Do you know if these CPUs are going to be productized?  We'll still need
> > something in KVM either way, but whether or not the problems are more or less
> > limited to funky software setups might influence how we address this.
>
> Wait, we do have an API for guest physical address size. It's
> KVM_GET_SUPPORTED_CPUID2: the # of bits is in leaf 0x80000008, bits
> 0:7 of EAX. In fact that leaf is what firmware uses to place the BARs.
> So it just needs to be adjusted for VMX in __do_cpuid_func, and looked
> up in selftests.

We've discussed this. The only *supported* value for guest.MAXPHYADDR
is host.MAXPHYADDR.

If EPT doesn't also support that value, then the only *supported*
configuration is shadow paging.

If someone wants to run with scissors, that's fine, but don't abuse
KVM_GET_SUPPORTED_CPUID2 to return an *unsupported* configuration.
Tao Su Jan. 12, 2024, 1:08 a.m. UTC | #11
On Thu, Jan 11, 2024 at 08:25:01AM -0800, Sean Christopherson wrote:
> On Thu, Jan 11, 2024, Tao Su wrote:
> > On Wed, Jan 10, 2024 at 08:26:25AM -0800, Sean Christopherson wrote:
> > > On Wed, Jan 10, 2024, Chao Gao wrote:
> > > > On Tue, Jan 09, 2024 at 04:23:40PM -0800, Sean Christopherson wrote:
> > > > >Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
> > > > >whether or not the CPU supports 5-level EPT paging.  EPT capabilities are
> > > > >enumerated via MSR, i.e. aren't accessible to userspace without help from
> > > > >the kernel, and knowing whether or not 5-level EPT is supported is sadly
> > > > >necessary for userspace to correctly configure KVM VMs.
> > > > 
> > > > This assumes procfs is enabled in Kconfig and userspace has permission to
> > > > access /proc/cpuinfo. But it isn't always true. So, I think it is better to
> > > > advertise max addressable GPA via KVM ioctls.
> > > 
> > > Hrm, so the help for PROC_FS says:
> > > 
> > >   Several programs depend on this, so everyone should say Y here.
> > > 
> > > Given that this is working around something that is borderline an erratum, I'm
> > > inclined to say that userspace shouldn't simply assume the worst if /proc isn't
> > > available.  Practically speaking, I don't think a "real" VM is likely to be
> > > affected; AFAIK, there's no reason for QEMU or any other VMM to _need_ to expose
> > > a memslot at GPA[51:48] unless the VM really has however much memory that is
> > > (hundreds of terabytes?).  And a if someone is trying to run such a massive VM on
> > > such a goofy CPU...
> > 
> > It is unusual to assign a huge RAM to guest, but passthrough a device also may trigger
> > this issue which we have met, i.e. alloc memslot for the 64bit BAR which can set
> > bits[51:48]. BIOS can control the BAR address, e.g. seabios moved 64bit pci window
> > to end of address space by using advertised physical bits[1].
> 
> Drat.  Do you know if these CPUs are going to be productized?  We'll still need
> something in KVM either way, but whether or not the problems are more or less
> limited to funky software setups might influence how we address this.

Yes, please see the CPU model I submitted[1].

[1] https://lore.kernel.org/all/20231206131923.1192066-1-tao1.su@linux.intel.com/

Thanks,
Tao

>
Sean Christopherson Feb. 23, 2024, 1:35 a.m. UTC | #12
On Tue, 09 Jan 2024 16:23:40 -0800, Sean Christopherson wrote:
> Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
> whether or not the CPU supports 5-level EPT paging.  EPT capabilities are
> enumerated via MSR, i.e. aren't accessible to userspace without help from
> the kernel, and knowing whether or not 5-level EPT is supported is sadly
> necessary for userspace to correctly configure KVM VMs.
> 
> When EPT is enabled, bits 51:49 of guest physical addresses are consumed
> if and only if 5-level EPT is enabled.  For CPUs with MAXPHYADDR > 48, KVM
> *can't* map all legal guest memory if 5-level EPT is unsupported, e.g.
> creating a VM with RAM (or anything that gets stuffed into KVM's memslots)
> above bit 48 will be completely broken.
> 
> [...]

Applied to kvm-x86 vmx, with a massaged changelog to avoid presenting this as a
bug fix (and finally fixed the 51:49=>51:48 goof):

    Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
    whether or not the CPU supports 5-level EPT paging.  EPT capabilities are
    enumerated via MSR, i.e. aren't accessible to userspace without help from
    the kernel, and knowing whether or not 5-level EPT is supported is useful
    for debug, triage, testing, etc.
    
    For example, when EPT is enabled, bits 51:48 of guest physical addresses
    are consumed by the CPU if and only if 5-level EPT is enabled.  For CPUs
    with MAXPHYADDR > 48, KVM *can't* map all legal guest memory if 5-level
    EPT is unsupported, making it more or less necessary to know whether or
    not 5-level EPT is supported.

[1/1] x86/cpu: Add a VMX flag to enumerate 5-level EPT support to userspace
      https://github.com/kvm-x86/linux/commit/b1a3c366cbc7

--
https://github.com/kvm-x86/linux/tree/next
Xiaoyao Li Feb. 26, 2024, 1:30 a.m. UTC | #13
On 2/23/2024 9:35 AM, Sean Christopherson wrote:
> On Tue, 09 Jan 2024 16:23:40 -0800, Sean Christopherson wrote:
>> Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
>> whether or not the CPU supports 5-level EPT paging.  EPT capabilities are
>> enumerated via MSR, i.e. aren't accessible to userspace without help from
>> the kernel, and knowing whether or not 5-level EPT is supported is sadly
>> necessary for userspace to correctly configure KVM VMs.
>>
>> When EPT is enabled, bits 51:49 of guest physical addresses are consumed
>> if and only if 5-level EPT is enabled.  For CPUs with MAXPHYADDR > 48, KVM
>> *can't* map all legal guest memory if 5-level EPT is unsupported, e.g.
>> creating a VM with RAM (or anything that gets stuffed into KVM's memslots)
>> above bit 48 will be completely broken.
>>
>> [...]
> 
> Applied to kvm-x86 vmx, with a massaged changelog to avoid presenting this as a
> bug fix (and finally fixed the 51:49=>51:48 goof):
> 
>      Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
>      whether or not the CPU supports 5-level EPT paging.  EPT capabilities are
>      enumerated via MSR, i.e. aren't accessible to userspace without help from
>      the kernel, and knowing whether or not 5-level EPT is supported is useful
>      for debug, triage, testing, etc.
>      
>      For example, when EPT is enabled, bits 51:48 of guest physical addresses
>      are consumed by the CPU if and only if 5-level EPT is enabled.  For CPUs
>      with MAXPHYADDR > 48, KVM *can't* map all legal guest memory if 5-level
>      EPT is unsupported, making it more or less necessary to know whether or
>      not 5-level EPT is supported.
> 
> [1/1] x86/cpu: Add a VMX flag to enumerate 5-level EPT support to userspace
>        https://github.com/kvm-x86/linux/commit/b1a3c366cbc7

Do we need a new KVM CAP for this? This decides how to interact with old 
kernel without this patch. In that case, no ept_5level in /proc/cpuinfo, 
what should we do in the absence of ept_5level? treat it only 4 level 
EPT supported?



> --
> https://github.com/kvm-x86/linux/tree/next
>
Tao Su Feb. 26, 2024, 7:11 a.m. UTC | #14
On Mon, Feb 26, 2024 at 09:30:33AM +0800, Xiaoyao Li wrote:
> On 2/23/2024 9:35 AM, Sean Christopherson wrote:
> > On Tue, 09 Jan 2024 16:23:40 -0800, Sean Christopherson wrote:
> > > Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
> > > whether or not the CPU supports 5-level EPT paging.  EPT capabilities are
> > > enumerated via MSR, i.e. aren't accessible to userspace without help from
> > > the kernel, and knowing whether or not 5-level EPT is supported is sadly
> > > necessary for userspace to correctly configure KVM VMs.
> > > 
> > > When EPT is enabled, bits 51:49 of guest physical addresses are consumed
> > > if and only if 5-level EPT is enabled.  For CPUs with MAXPHYADDR > 48, KVM
> > > *can't* map all legal guest memory if 5-level EPT is unsupported, e.g.
> > > creating a VM with RAM (or anything that gets stuffed into KVM's memslots)
> > > above bit 48 will be completely broken.
> > > 
> > > [...]
> > 
> > Applied to kvm-x86 vmx, with a massaged changelog to avoid presenting this as a
> > bug fix (and finally fixed the 51:49=>51:48 goof):
> > 
> >      Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
> >      whether or not the CPU supports 5-level EPT paging.  EPT capabilities are
> >      enumerated via MSR, i.e. aren't accessible to userspace without help from
> >      the kernel, and knowing whether or not 5-level EPT is supported is useful
> >      for debug, triage, testing, etc.
> >      For example, when EPT is enabled, bits 51:48 of guest physical addresses
> >      are consumed by the CPU if and only if 5-level EPT is enabled.  For CPUs
> >      with MAXPHYADDR > 48, KVM *can't* map all legal guest memory if 5-level
> >      EPT is unsupported, making it more or less necessary to know whether or
> >      not 5-level EPT is supported.
> > 
> > [1/1] x86/cpu: Add a VMX flag to enumerate 5-level EPT support to userspace
> >        https://github.com/kvm-x86/linux/commit/b1a3c366cbc7
> 
> Do we need a new KVM CAP for this? This decides how to interact with old
> kernel without this patch. In that case, no ept_5level in /proc/cpuinfo,
> what should we do in the absence of ept_5level? treat it only 4 level EPT
> supported?

Maybe also adding flag for 4-level EPT can be an option. If userspace
checks both 4-level and 5-level are not in /proc/cpuinfo, it can regard
the kernel as old.

Thanks,
Tao

> 
> 
> 
> > --
> > https://github.com/kvm-x86/linux/tree/next
> > 
>
Sean Christopherson Feb. 26, 2024, 3:27 p.m. UTC | #15
On Mon, Feb 26, 2024, Tao Su wrote:
> On Mon, Feb 26, 2024 at 09:30:33AM +0800, Xiaoyao Li wrote:
> > On 2/23/2024 9:35 AM, Sean Christopherson wrote:
> > > On Tue, 09 Jan 2024 16:23:40 -0800, Sean Christopherson wrote:
> > > > Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
> > > > whether or not the CPU supports 5-level EPT paging.  EPT capabilities are
> > > > enumerated via MSR, i.e. aren't accessible to userspace without help from
> > > > the kernel, and knowing whether or not 5-level EPT is supported is sadly
> > > > necessary for userspace to correctly configure KVM VMs.
> > > > 
> > > > When EPT is enabled, bits 51:49 of guest physical addresses are consumed
> > > > if and only if 5-level EPT is enabled.  For CPUs with MAXPHYADDR > 48, KVM
> > > > *can't* map all legal guest memory if 5-level EPT is unsupported, e.g.
> > > > creating a VM with RAM (or anything that gets stuffed into KVM's memslots)
> > > > above bit 48 will be completely broken.
> > > > 
> > > > [...]
> > > 
> > > Applied to kvm-x86 vmx, with a massaged changelog to avoid presenting this as a
> > > bug fix (and finally fixed the 51:49=>51:48 goof):
> > > 
> > >      Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
> > >      whether or not the CPU supports 5-level EPT paging.  EPT capabilities are
> > >      enumerated via MSR, i.e. aren't accessible to userspace without help from
> > >      the kernel, and knowing whether or not 5-level EPT is supported is useful
> > >      for debug, triage, testing, etc.
> > >      For example, when EPT is enabled, bits 51:48 of guest physical addresses
> > >      are consumed by the CPU if and only if 5-level EPT is enabled.  For CPUs
> > >      with MAXPHYADDR > 48, KVM *can't* map all legal guest memory if 5-level
> > >      EPT is unsupported, making it more or less necessary to know whether or
> > >      not 5-level EPT is supported.
> > > 
> > > [1/1] x86/cpu: Add a VMX flag to enumerate 5-level EPT support to userspace
> > >        https://github.com/kvm-x86/linux/commit/b1a3c366cbc7
> > 
> > Do we need a new KVM CAP for this? This decides how to interact with old
> > kernel without this patch. In that case, no ept_5level in /proc/cpuinfo,
> > what should we do in the absence of ept_5level? treat it only 4 level EPT
> > supported?
> 
> Maybe also adding flag for 4-level EPT can be an option. If userspace
> checks both 4-level and 5-level are not in /proc/cpuinfo, it can regard
> the kernel as old.

The intent is that this is informational only, not something that userspace can
or should use to make decisions about how to configure KVM guests.  As pointed
out elsewhere in the thread, simply restricting guest.MAXPHYADDR to 48 doesn't
actually create an architecturally viable VM.  At the very least, KVM needs to
be configured with allow_smaller_maxphyaddr=1, and aside from the gaping holes
in KVM related to that knob, AIUI allow_smaller_maxphyaddr=1 isn't an option in
this case due to other quirks/flaws with the CPU in question.

I don't think there's been an on-list summary posted, but the plan is to figure
out a way to inform guest firmware of the max _usable_ physical address, so that
firmware doesn't create BARs and whatnot in memory that KVM can't map.  And then
have KVM relay the usuable guest.MAXPHYADDR to userspace.  That way userspace
doesn't need to infer the effective guest.MAXPHYADDR from EPT knobs.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
index c6a7eed03914..266daf5b5b84 100644
--- a/arch/x86/include/asm/vmxfeatures.h
+++ b/arch/x86/include/asm/vmxfeatures.h
@@ -25,6 +25,7 @@ 
 #define VMX_FEATURE_EPT_EXECUTE_ONLY	( 0*32+ 17) /* "ept_x_only" EPT entries can be execute only */
 #define VMX_FEATURE_EPT_AD		( 0*32+ 18) /* EPT Accessed/Dirty bits */
 #define VMX_FEATURE_EPT_1GB		( 0*32+ 19) /* 1GB EPT pages */
+#define VMX_FEATURE_EPT_5LEVEL		( 0*32+ 20) /* 5-level EPT paging */
 
 /* Aggregated APIC features 24-27 */
 #define VMX_FEATURE_FLEXPRIORITY	( 0*32+ 24) /* TPR shadow + virt APIC */
diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c
index 03851240c3e3..1640ae76548f 100644
--- a/arch/x86/kernel/cpu/feat_ctl.c
+++ b/arch/x86/kernel/cpu/feat_ctl.c
@@ -72,6 +72,8 @@  static void init_vmx_capabilities(struct cpuinfo_x86 *c)
 		c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_AD);
 	if (ept & VMX_EPT_1GB_PAGE_BIT)
 		c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_1GB);
+	if (ept & VMX_EPT_PAGE_WALK_5_BIT)
+		c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_5LEVEL);
 
 	/* Synthetic APIC features that are aggregates of multiple features. */
 	if ((c->vmx_capability[PRIMARY_CTLS] & VMX_F(VIRTUAL_TPR)) &&