diff mbox series

[v5,3/5] KVM: x86: Virtualize CR3.LAM_{U48,U57}

Message ID 20230227084547.404871-4-robert.hu@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Linear Address Masking (LAM) KVM Enabling | expand

Commit Message

Robert Hoo Feb. 27, 2023, 8:45 a.m. UTC
LAM feature uses 2 high bits in CR3 (bit 62 for LAM_U48 and bit 61 for
LAM_U57) to enable/config LAM feature for user mode addresses. The LAM
masking is done before legacy canonical checks.

To virtualize LAM CR3 bits usage, this patch:
1. Don't reserve these 2 bits when LAM is enable on the vCPU. Previously
when validate CR3, kvm uses kvm_vcpu_is_legal_gpa(), now define
kvm_vcpu_is_valid_cr3() which is actually kvm_vcpu_is_legal_gpa()
+ CR3.LAM bits validation. Substitutes kvm_vcpu_is_legal/illegal_gpa()
with kvm_vcpu_is_valid_cr3() in call sites where is validating CR3 rather
than pure GPA.
2. mmu::get_guest_pgd(), its implementation is get_cr3() which returns
whole guest CR3 value. Strip LAM bits in those call sites that need pure
PGD value, e.g. mmu_alloc_shadow_roots(), FNAME(walk_addr_generic)().
3. When form a new guest CR3 (vmx_load_mmu_pgd()), melt in LAM bit
(kvm_get_active_lam()).
4. When guest sets CR3, identify ONLY-LAM-bits-toggling cases, where it is
unnecessary to make new pgd, but just make request of load pgd, then new
CR3.LAM bits configuration will be melt in (above point 3). To be
conservative, this case still do TLB flush.
5. For nested VM entry, allow the 2 CR3 bits set in corresponding
VMCS host/guest fields.

Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
---
 arch/x86/kvm/mmu.h             |  5 +++++
 arch/x86/kvm/mmu/mmu.c         |  9 ++++++++-
 arch/x86/kvm/mmu/paging_tmpl.h |  2 +-
 arch/x86/kvm/vmx/nested.c      |  4 ++--
 arch/x86/kvm/vmx/vmx.c         |  3 ++-
 arch/x86/kvm/x86.c             | 33 ++++++++++++++++++++++++++++-----
 arch/x86/kvm/x86.h             |  1 +
 7 files changed, 47 insertions(+), 10 deletions(-)

Comments

Chao Gao March 3, 2023, 6:21 a.m. UTC | #1
On Mon, Feb 27, 2023 at 04:45:45PM +0800, Robert Hoo wrote:
>LAM feature uses 2 high bits in CR3 (bit 62 for LAM_U48 and bit 61 for
>LAM_U57) to enable/config LAM feature for user mode addresses. The LAM
>masking is done before legacy canonical checks.
>
>To virtualize LAM CR3 bits usage, this patch:
>1. Don't reserve these 2 bits when LAM is enable on the vCPU. Previously
>when validate CR3, kvm uses kvm_vcpu_is_legal_gpa(), now define
>kvm_vcpu_is_valid_cr3() which is actually kvm_vcpu_is_legal_gpa()
>+ CR3.LAM bits validation. Substitutes kvm_vcpu_is_legal/illegal_gpa()
>with kvm_vcpu_is_valid_cr3() in call sites where is validating CR3 rather
>than pure GPA.
>2. mmu::get_guest_pgd(), its implementation is get_cr3() which returns
>whole guest CR3 value. Strip LAM bits in those call sites that need pure
>PGD value, e.g. mmu_alloc_shadow_roots(), FNAME(walk_addr_generic)().
>3. When form a new guest CR3 (vmx_load_mmu_pgd()), melt in LAM bit
>(kvm_get_active_lam()).
>4. When guest sets CR3, identify ONLY-LAM-bits-toggling cases, where it is
>unnecessary to make new pgd, but just make request of load pgd, then new
>CR3.LAM bits configuration will be melt in (above point 3). To be
>conservative, this case still do TLB flush.

>5. For nested VM entry, allow the 2 CR3 bits set in corresponding
>VMCS host/guest fields.

isn't this already covered by item #1 above?

>
>Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
>---
> arch/x86/kvm/mmu.h             |  5 +++++
> arch/x86/kvm/mmu/mmu.c         |  9 ++++++++-
> arch/x86/kvm/mmu/paging_tmpl.h |  2 +-
> arch/x86/kvm/vmx/nested.c      |  4 ++--
> arch/x86/kvm/vmx/vmx.c         |  3 ++-
> arch/x86/kvm/x86.c             | 33 ++++++++++++++++++++++++++++-----
> arch/x86/kvm/x86.h             |  1 +
> 7 files changed, 47 insertions(+), 10 deletions(-)
>
>diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>index 6bdaacb6faa0..866f2b7cb509 100644
>--- a/arch/x86/kvm/mmu.h
>+++ b/arch/x86/kvm/mmu.h
>@@ -142,6 +142,11 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)
> 	return kvm_get_pcid(vcpu, kvm_read_cr3(vcpu));
> }
> 
>+static inline u64 kvm_get_active_lam(struct kvm_vcpu *vcpu)
>+{
>+	return kvm_read_cr3(vcpu) & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57);
>+}

I think it is better to define a mask (like reserved_gpa_bits):

kvm_vcpu_arch {
	...

	/*
	 * Bits in CR3 used to enable certain features. These bits don't
	 * participate in page table walking. They should be masked to
	 * get the base address of page table. When shadow paging is
	 * used, these bits should be kept as is in the shadow CR3.
	 */
	u64 cr3_control_bits;

and initialize the mask in kvm_vcpu_after_set_cpuid():

	if (guest_cpuid_has(X86_FEATURE_LAM))
		vcpu->arch.cr3_control_bits = X86_CR3_LAM_U48 | X86_CR3_LAM_U57;

then add helpers to extract/mask control bits.

It is cleaner and can avoid looking up guest CPUID at runtime. And if
AMD has a similar feature (e.g., some CR3 bits are used as control bits),
it is easy to support that feature.
Robert Hoo March 3, 2023, 2:23 p.m. UTC | #2
On Fri, 2023-03-03 at 14:21 +0800, Chao Gao wrote:
> On Mon, Feb 27, 2023 at 04:45:45PM +0800, Robert Hoo wrote:
> > LAM feature uses 2 high bits in CR3 (bit 62 for LAM_U48 and bit 61
> > for
> > LAM_U57) to enable/config LAM feature for user mode addresses. The
> > LAM
> > masking is done before legacy canonical checks.
> > 
> > To virtualize LAM CR3 bits usage, this patch:
> > 1. Don't reserve these 2 bits when LAM is enable on the vCPU.
> > Previously
> > when validate CR3, kvm uses kvm_vcpu_is_legal_gpa(), now define
> > kvm_vcpu_is_valid_cr3() which is actually kvm_vcpu_is_legal_gpa()
> > + CR3.LAM bits validation. Substitutes
> > kvm_vcpu_is_legal/illegal_gpa()
> > with kvm_vcpu_is_valid_cr3() in call sites where is validating CR3
> > rather
> > than pure GPA.
> > 2. mmu::get_guest_pgd(), its implementation is get_cr3() which
> > returns
> > whole guest CR3 value. Strip LAM bits in those call sites that need
> > pure
> > PGD value, e.g. mmu_alloc_shadow_roots(),
> > FNAME(walk_addr_generic)().
> > 3. When form a new guest CR3 (vmx_load_mmu_pgd()), melt in LAM bit
> > (kvm_get_active_lam()).
> > 4. When guest sets CR3, identify ONLY-LAM-bits-toggling cases,
> > where it is
> > unnecessary to make new pgd, but just make request of load pgd,
> > then new
> > CR3.LAM bits configuration will be melt in (above point 3). To be
> > conservative, this case still do TLB flush.
> > 5. For nested VM entry, allow the 2 CR3 bits set in corresponding
> > VMCS host/guest fields.
> 
> isn't this already covered by item #1 above?

Ah, it is to address your comments on last version. To repeat/emphasize
again, doesn't harm, does it?;) 
> 
(...)
> > 
> > +static inline u64 kvm_get_active_lam(struct kvm_vcpu *vcpu)
> > +{
> > +	return kvm_read_cr3(vcpu) & (X86_CR3_LAM_U48 |
> > X86_CR3_LAM_U57);
> > +}
> 
> I think it is better to define a mask (like reserved_gpa_bits):
> 
> kvm_vcpu_arch {
> 	...
> 
> 	/*
> 	 * Bits in CR3 used to enable certain features. These bits
> don't
> 	 * participate in page table walking. They should be masked to
> 	 * get the base address of page table. When shadow paging is
> 	 * used, these bits should be kept as is in the shadow CR3.
> 	 */
> 	u64 cr3_control_bits;
> 

I don't strongly object this. But per SDM, CR3.bit[63:MAXPHYADDR] are
reserved; and MAXPHYADDR is at most 52 [1]. So can we assert and simply
define the MASK bit[63:52]? (I did this in v3 and prior)

[1] CPUID.80000008H:EAX[7:0] reports the physical-address width
supported by the processor. (For processors
that do not support CPUID function 80000008H, the width is generally 36
if CPUID.01H:EDX.PAE [bit 6] = 1
and 32 otherwise.) This width is referred to as MAXPHYADDR. MAXPHYADDR
is at most 52. (SDM 4.1.4 Enumeration of Paging Features by CPUID)

> and initialize the mask in kvm_vcpu_after_set_cpuid():
> 
> 	if (guest_cpuid_has(X86_FEATURE_LAM))
> 		vcpu->arch.cr3_control_bits = X86_CR3_LAM_U48 |
> X86_CR3_LAM_U57;
> 
> then add helpers to extract/mask control bits.
> 
> It is cleaner and can avoid looking up guest CPUID at runtime.
>  And if
> AMD has a similar feature (e.g., some CR3 bits are used as control
> bits),
> it is easy to support that feature.
Chao Gao March 3, 2023, 3:53 p.m. UTC | #3
On Fri, Mar 03, 2023 at 10:23:50PM +0800, Robert Hoo wrote:
>On Fri, 2023-03-03 at 14:21 +0800, Chao Gao wrote:
>> On Mon, Feb 27, 2023 at 04:45:45PM +0800, Robert Hoo wrote:
>> > LAM feature uses 2 high bits in CR3 (bit 62 for LAM_U48 and bit 61
>> > for
>> > LAM_U57) to enable/config LAM feature for user mode addresses. The
>> > LAM
>> > masking is done before legacy canonical checks.
>> > 
>> > To virtualize LAM CR3 bits usage, this patch:
>> > 1. Don't reserve these 2 bits when LAM is enable on the vCPU.
>> > Previously
>> > when validate CR3, kvm uses kvm_vcpu_is_legal_gpa(), now define
>> > kvm_vcpu_is_valid_cr3() which is actually kvm_vcpu_is_legal_gpa()
>> > + CR3.LAM bits validation. Substitutes
>> > kvm_vcpu_is_legal/illegal_gpa()
>> > with kvm_vcpu_is_valid_cr3() in call sites where is validating CR3
>> > rather
>> > than pure GPA.
>> > 2. mmu::get_guest_pgd(), its implementation is get_cr3() which
>> > returns
>> > whole guest CR3 value. Strip LAM bits in those call sites that need
>> > pure
>> > PGD value, e.g. mmu_alloc_shadow_roots(),
>> > FNAME(walk_addr_generic)().
>> > 3. When form a new guest CR3 (vmx_load_mmu_pgd()), melt in LAM bit
>> > (kvm_get_active_lam()).
>> > 4. When guest sets CR3, identify ONLY-LAM-bits-toggling cases,
>> > where it is
>> > unnecessary to make new pgd, but just make request of load pgd,
>> > then new
>> > CR3.LAM bits configuration will be melt in (above point 3). To be
>> > conservative, this case still do TLB flush.
>> > 5. For nested VM entry, allow the 2 CR3 bits set in corresponding
>> > VMCS host/guest fields.
>> 
>> isn't this already covered by item #1 above?
>
>Ah, it is to address your comments on last version. To repeat/emphasize
>again, doesn't harm, does it?;) 

It is confusing. Trying to merge #5 to #1:

If LAM is supported, bits 62:61 (LAM_U48 and LAM_U57) are not reserved
in CR3. VM entry also allows the two bits to be set in CR3 field in
guest-state and host-state area of the VMCS. Previously ...

>> 
>(...)
>> > 
>> > +static inline u64 kvm_get_active_lam(struct kvm_vcpu *vcpu)
>> > +{
>> > +	return kvm_read_cr3(vcpu) & (X86_CR3_LAM_U48 |
>> > X86_CR3_LAM_U57);
>> > +}
>> 
>> I think it is better to define a mask (like reserved_gpa_bits):
>> 
>> kvm_vcpu_arch {
>> 	...
>> 
>> 	/*
>> 	 * Bits in CR3 used to enable certain features. These bits
>> don't
>> 	 * participate in page table walking. They should be masked to
>> 	 * get the base address of page table. When shadow paging is
>> 	 * used, these bits should be kept as is in the shadow CR3.
>> 	 */
>> 	u64 cr3_control_bits;
>> 
>
>I don't strongly object this. But per SDM, CR3.bit[63:MAXPHYADDR] are
>reserved; and MAXPHYADDR is at most 52 [1]. So can we assert and simply
>define the MASK bit[63:52]? (I did this in v3 and prior)

No. Setting any bit in 60:52 should be rejected. And setting bit 62 or
61 should be allowed if LAM is supported by the vCPU. I don't see how
your proposal can distinguish these two cases.
Robert Hoo March 5, 2023, 1:31 a.m. UTC | #4
On Fri, 2023-03-03 at 23:53 +0800, Chao Gao wrote:
> On Fri, Mar 03, 2023 at 10:23:50PM +0800, Robert Hoo wrote:
> > On Fri, 2023-03-03 at 14:21 +0800, Chao Gao wrote:
> > > On Mon, Feb 27, 2023 at 04:45:45PM +0800, Robert Hoo wrote:
> > > > LAM feature uses 2 high bits in CR3 (bit 62 for LAM_U48 and bit
> > > > 61
> > > > for
> > > > LAM_U57) to enable/config LAM feature for user mode addresses.
> > > > The
> > > > LAM
> > > > masking is done before legacy canonical checks.
> > > > 
> > > > To virtualize LAM CR3 bits usage, this patch:
> > > > 1. Don't reserve these 2 bits when LAM is enable on the vCPU.
> > > > Previously
> > > > when validate CR3, kvm uses kvm_vcpu_is_legal_gpa(), now define
> > > > kvm_vcpu_is_valid_cr3() which is actually
> > > > kvm_vcpu_is_legal_gpa()
> > > > + CR3.LAM bits validation. Substitutes
> > > > kvm_vcpu_is_legal/illegal_gpa()
> > > > with kvm_vcpu_is_valid_cr3() in call sites where is validating
> > > > CR3
> > > > rather
> > > > than pure GPA.
> > > > 2. mmu::get_guest_pgd(), its implementation is get_cr3() which
> > > > returns
> > > > whole guest CR3 value. Strip LAM bits in those call sites that
> > > > need
> > > > pure
> > > > PGD value, e.g. mmu_alloc_shadow_roots(),
> > > > FNAME(walk_addr_generic)().
> > > > 3. When form a new guest CR3 (vmx_load_mmu_pgd()), melt in LAM
> > > > bit
> > > > (kvm_get_active_lam()).
> > > > 4. When guest sets CR3, identify ONLY-LAM-bits-toggling cases,
> > > > where it is
> > > > unnecessary to make new pgd, but just make request of load pgd,
> > > > then new
> > > > CR3.LAM bits configuration will be melt in (above point 3). To
> > > > be
> > > > conservative, this case still do TLB flush.
> > > > 5. For nested VM entry, allow the 2 CR3 bits set in
> > > > corresponding
> > > > VMCS host/guest fields.
> > > 
> > > isn't this already covered by item #1 above?
> > 
> > Ah, it is to address your comments on last version. To
> > repeat/emphasize
> > again, doesn't harm, does it?;) 
> 
> It is confusing. Trying to merge #5 to #1:

Well this is kind of subjective. I don't have any bias on this.
> 
> If LAM is supported, bits 62:61 (LAM_U48 and LAM_U57) are not
> reserved
> in CR3. VM entry also allows the two bits to be set in CR3 field in
> guest-state and host-state area of the VMCS. Previously ...
> 
> > > 
> > 
> > (...)
> > > > 
> > > > +static inline u64 kvm_get_active_lam(struct kvm_vcpu *vcpu)
> > > > +{
> > > > +	return kvm_read_cr3(vcpu) & (X86_CR3_LAM_U48 |
> > > > X86_CR3_LAM_U57);
> > > > +}
> > > 
> > > I think it is better to define a mask (like reserved_gpa_bits):
> > > 
> > > kvm_vcpu_arch {
> > > 	...
> > > 
> > > 	/*
> > > 	 * Bits in CR3 used to enable certain features. These bits
> > > don't
> > > 	 * participate in page table walking. They should be masked to
> > > 	 * get the base address of page table. When shadow paging is
> > > 	 * used, these bits should be kept as is in the shadow CR3.
> > > 	 */
> > > 	u64 cr3_control_bits;
> > > 
> > 
> > I don't strongly object this. But per SDM, CR3.bit[63:MAXPHYADDR]
> > are
> > reserved; and MAXPHYADDR is at most 52 [1]. So can we assert and
> > simply
> > define the MASK bit[63:52]? (I did this in v3 and prior)
> 
> No. Setting any bit in 60:52 should be rejected. And setting bit 62
> or
> 61 should be allowed if LAM is supported by the vCPU. I don't see how
> your proposal can distinguish these two cases.

No you didn't get my point.
Perhaps you can take a look at v3 patch and prior
https://lore.kernel.org/kvm/20221209044557.1496580-4-robert.hu@linux.intel.com/

define CR3_HIGH_RSVD_MASK, given "MAXPHYADDR is at most 52" is stated
in SDM.
Sean Christopherson March 10, 2023, 8:12 p.m. UTC | #5
On Mon, Feb 27, 2023, Robert Hoo wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 835426254e76..3efec7f8d8c6 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3699,7 +3699,14 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	int quadrant, i, r;
>  	hpa_t root;
>  
> -	root_pgd = mmu->get_guest_pgd(vcpu);
> +	/*
> +	 * Omit guest_cpuid_has(X86_FEATURE_LAM) check but can unconditionally
> +	 * strip CR3 LAM bits because they resides in high reserved bits,
> +	 * with LAM or not, those high bits should be striped anyway when
> +	 * interpreted to pgd.
> +	 */

This misses the most important part: why it's safe to ignore LAM bits when reusing
a root.

> +	root_pgd = mmu->get_guest_pgd(vcpu) &
> +		   ~(X86_CR3_LAM_U48 | X86_CR3_LAM_U57);

Unconditionally clearing LAM bits is unsafe.  At some point the EPTP may define
bits in the same area that must NOT be omitted from the root cache, e.g. the PWL
bits in the EPTP _need_ to be incorporated into is_root_usable().

For simplicity, I'm very, very tempted to say we should just leave the LAM bits
in root.pgd, i.e. force a new root for a CR3+LAM combination.  First and foremost,
that only matters for shadow paging.  Second, unless a guest kernel allows per-thread
LAM settings, KVM the extra checks will be benign.  And AIUI, the proposed kernel
implementation is to apply LAM on a per-MM basis.

And I would much prefer to solve the GFN calculation generically.  E.g. it really
should be something like this

	root_pgd = mmu->get_guest_pgd(vcpu);
	root_gfn = mmu->gpte_to_gfn(root_pgd);

but having to set gpte_to_gfn() in the MMU is quite unfortunate, and gpte_to_gfn()
is technically insufficient for PAE since it relies on previous checks to prevent
consuming a 64-bit CR3.

I was going to suggest extracting the maximal base addr mask and use that, e.g.

	#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))

Maybe this?

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8ebe542c565..8b2d2a6081b3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3732,7 +3732,12 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
        hpa_t root;
 
        root_pgd = mmu->get_guest_pgd(vcpu);
-       root_gfn = root_pgd >> PAGE_SHIFT;
+
+       /*
+        * The guest PGD has already been checked for validity, unconditionally
+        * strip non-address bits when computing the GFN.
+        */
+       root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
 
        if (mmu_check_root(vcpu, root_gfn))
                return 1;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index cc58631e2336..c0479cbc2ca3 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -21,6 +21,7 @@ extern bool dbg;
 #endif
 
 /* Page table builder macros common to shadow (host) PTEs and guest PTEs. */
+#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
 #define __PT_LEVEL_SHIFT(level, bits_per_level)        \
        (PAGE_SHIFT + ((level) - 1) * (bits_per_level))
 #define __PT_INDEX(address, level, bits_per_level) \
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 57f0b75c80f9..0583bfce3b52 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -62,7 +62,7 @@
 #endif
 
 /* Common logic, but per-type values.  These also need to be undefined. */
-#define PT_BASE_ADDR_MASK      ((pt_element_t)(((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1)))
+#define PT_BASE_ADDR_MASK      ((pt_element_t)__PT_BASE_ADDR_MASK)
 #define PT_LVL_ADDR_MASK(lvl)  __PT_LVL_ADDR_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
 #define PT_LVL_OFFSET_MASK(lvl)        __PT_LVL_OFFSET_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
 #define PT_INDEX(addr, lvl)    __PT_INDEX(addr, lvl, PT_LEVEL_BITS)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1279db2eab44..777f7d443e3b 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -36,7 +36,7 @@ static_assert(SPTE_TDP_AD_ENABLED == 0);
 #ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
 #define SPTE_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
 #else
-#define SPTE_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
+#define SPTE_BASE_ADDR_MASK __PT_BASE_ADDR_MASK
 #endif
 
 #define SPTE_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \

>  	root_gfn = root_pgd >> PAGE_SHIFT;
>  
>  	if (mmu_check_root(vcpu, root_gfn))
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 0f6455072055..57f39c7492ed 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -324,7 +324,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  	trace_kvm_mmu_pagetable_walk(addr, access);
>  retry_walk:
>  	walker->level = mmu->cpu_role.base.level;
> -	pte           = mmu->get_guest_pgd(vcpu);
> +	pte           = mmu->get_guest_pgd(vcpu) & ~(X86_CR3_LAM_U48 | X86_CR3_LAM_U57);

This should be unnecessary, gpte_to_gfn() is supposed to strip non-address bits.

>  	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
>  
>  #ifdef CONFIG_X86_64
>  	bool pcid_enabled = !!kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE);
>  
> @@ -1254,14 +1265,26 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>  	 * stuff CR3, e.g. for RSM emulation, and there is no guarantee that
>  	 * the current vCPU mode is accurate.
>  	 */
> -	if (kvm_vcpu_is_illegal_gpa(vcpu, cr3))
> +	if (!kvm_vcpu_is_valid_cr3(vcpu, cr3))
>  		return 1;
>  
>  	if (is_pae_paging(vcpu) && !load_pdptrs(vcpu, cr3))
>  		return 1;
>  
> -	if (cr3 != kvm_read_cr3(vcpu))
> -		kvm_mmu_new_pgd(vcpu, cr3);
> +	old_cr3 = kvm_read_cr3(vcpu);
> +	if (cr3 != old_cr3) {
> +		if ((cr3 ^ old_cr3) & ~(X86_CR3_LAM_U48 | X86_CR3_LAM_U57)) {
> +			kvm_mmu_new_pgd(vcpu, cr3 & ~(X86_CR3_LAM_U48 |
> +					X86_CR3_LAM_U57));

As above, no change is needed here if LAM is tracked in the PGD.
Binbin Wu March 20, 2023, 6:57 a.m. UTC | #6
On 3/11/2023 4:12 AM, Sean Christopherson wrote:
> On Mon, Feb 27, 2023, Robert Hoo wrote:
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 835426254e76..3efec7f8d8c6 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -3699,7 +3699,14 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>>   	int quadrant, i, r;
>>   	hpa_t root;
>>   
>> -	root_pgd = mmu->get_guest_pgd(vcpu);
>> +	/*
>> +	 * Omit guest_cpuid_has(X86_FEATURE_LAM) check but can unconditionally
>> +	 * strip CR3 LAM bits because they resides in high reserved bits,
>> +	 * with LAM or not, those high bits should be striped anyway when
>> +	 * interpreted to pgd.
>> +	 */
> This misses the most important part: why it's safe to ignore LAM bits when reusing
> a root.
>
>> +	root_pgd = mmu->get_guest_pgd(vcpu) &
>> +		   ~(X86_CR3_LAM_U48 | X86_CR3_LAM_U57);
> Unconditionally clearing LAM bits is unsafe.  At some point the EPTP may define
> bits in the same area that must NOT be omitted from the root cache, e.g. the PWL
> bits in the EPTP _need_ to be incorporated into is_root_usable().

Sean, sorry that I missed the mail when I cooked & sent out the v6 patch 
of the series.

You are right that the mmu->get_guest_pgd() could be EPTP, and I do 
agree it is not
reasonable to unconditionally ignore LAM bits when the value is EPTP, 
although PWL bits
are not in the high reserved bits.


>
> For simplicity, I'm very, very tempted to say we should just leave the LAM bits
> in root.pgd, i.e. force a new root for a CR3+LAM combination.
Thanks for the suggestion.

Force a new root for a CR3+LAM combination, one concern is that if LAM bits are part of
root.pgd, then toggle of CR3 LAM bit(s) will trigger the kvm mmu reload. That means for a
process that is created without LAM bits set, after it calling prctl to enable LAM, kvm will
have to do mmu load twice. Do you think it would be a problem?


> First and foremost,
> that only matters for shadow paging.  Second, unless a guest kernel allows per-thread
> LAM settings, KVM the extra checks will be benign.  And AIUI, the proposed kernel
> implementation is to apply LAM on a per-MM basis.

Yes, the proposed linux kernel implementaion is to apply LAM on a per-MM basis, and currently
no interface provided to disable LAM after enabling it.However, for kvm, guests are not limited to linux, and not sure how LAM 
feature is enabled in other guest OSes.


>
> And I would much prefer to solve the GFN calculation generically.  E.g. it really
> should be something like this
>
> 	root_pgd = mmu->get_guest_pgd(vcpu);
> 	root_gfn = mmu->gpte_to_gfn(root_pgd);
>
> but having to set gpte_to_gfn() in the MMU is quite unfortunate, and gpte_to_gfn()
> is technically insufficient for PAE since it relies on previous checks to prevent
> consuming a 64-bit CR3.
>
> I was going to suggest extracting the maximal base addr mask and use that, e.g.
>
> 	#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
>
> Maybe this?
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c8ebe542c565..8b2d2a6081b3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3732,7 +3732,12 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>          hpa_t root;
>   
>          root_pgd = mmu->get_guest_pgd(vcpu);
> -       root_gfn = root_pgd >> PAGE_SHIFT;
> +
> +       /*
> +        * The guest PGD has already been checked for validity, unconditionally
> +        * strip non-address bits when computing the GFN.
> +        */
> +       root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
>   
>          if (mmu_check_root(vcpu, root_gfn))
>                  return 1;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index cc58631e2336..c0479cbc2ca3 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -21,6 +21,7 @@ extern bool dbg;
>   #endif
>   
>   /* Page table builder macros common to shadow (host) PTEs and guest PTEs. */
> +#define __PT_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
>   #define __PT_LEVEL_SHIFT(level, bits_per_level)        \
>          (PAGE_SHIFT + ((level) - 1) * (bits_per_level))
>   #define __PT_INDEX(address, level, bits_per_level) \
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 57f0b75c80f9..0583bfce3b52 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -62,7 +62,7 @@
>   #endif
>   
>   /* Common logic, but per-type values.  These also need to be undefined. */
> -#define PT_BASE_ADDR_MASK      ((pt_element_t)(((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1)))
> +#define PT_BASE_ADDR_MASK      ((pt_element_t)__PT_BASE_ADDR_MASK)
>   #define PT_LVL_ADDR_MASK(lvl)  __PT_LVL_ADDR_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>   #define PT_LVL_OFFSET_MASK(lvl)        __PT_LVL_OFFSET_MASK(PT_BASE_ADDR_MASK, lvl, PT_LEVEL_BITS)
>   #define PT_INDEX(addr, lvl)    __PT_INDEX(addr, lvl, PT_LEVEL_BITS)
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 1279db2eab44..777f7d443e3b 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -36,7 +36,7 @@ static_assert(SPTE_TDP_AD_ENABLED == 0);
>   #ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
>   #define SPTE_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
>   #else
> -#define SPTE_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
> +#define SPTE_BASE_ADDR_MASK __PT_BASE_ADDR_MASK
>   #endif
>   
>   #define SPTE_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \
>
>>   	root_gfn = root_pgd >> PAGE_SHIFT;
>>   
>>   	if (mmu_check_root(vcpu, root_gfn))
>> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
>> index 0f6455072055..57f39c7492ed 100644
>> --- a/arch/x86/kvm/mmu/paging_tmpl.h
>> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
>> @@ -324,7 +324,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>>   	trace_kvm_mmu_pagetable_walk(addr, access);
>>   retry_walk:
>>   	walker->level = mmu->cpu_role.base.level;
>> -	pte           = mmu->get_guest_pgd(vcpu);
>> +	pte           = mmu->get_guest_pgd(vcpu) & ~(X86_CR3_LAM_U48 | X86_CR3_LAM_U57);
> This should be unnecessary, gpte_to_gfn() is supposed to strip non-address bits.
>
>>   	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
>>   
>>   #ifdef CONFIG_X86_64
>>   	bool pcid_enabled = !!kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE);
>>   
>> @@ -1254,14 +1265,26 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>>   	 * stuff CR3, e.g. for RSM emulation, and there is no guarantee that
>>   	 * the current vCPU mode is accurate.
>>   	 */
>> -	if (kvm_vcpu_is_illegal_gpa(vcpu, cr3))
>> +	if (!kvm_vcpu_is_valid_cr3(vcpu, cr3))
>>   		return 1;
>>   
>>   	if (is_pae_paging(vcpu) && !load_pdptrs(vcpu, cr3))
>>   		return 1;
>>   
>> -	if (cr3 != kvm_read_cr3(vcpu))
>> -		kvm_mmu_new_pgd(vcpu, cr3);
>> +	old_cr3 = kvm_read_cr3(vcpu);
>> +	if (cr3 != old_cr3) {
>> +		if ((cr3 ^ old_cr3) & ~(X86_CR3_LAM_U48 | X86_CR3_LAM_U57)) {
>> +			kvm_mmu_new_pgd(vcpu, cr3 & ~(X86_CR3_LAM_U48 |
>> +					X86_CR3_LAM_U57));
> As above, no change is needed here if LAM is tracked in the PGD.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 6bdaacb6faa0..866f2b7cb509 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -142,6 +142,11 @@  static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)
 	return kvm_get_pcid(vcpu, kvm_read_cr3(vcpu));
 }
 
+static inline u64 kvm_get_active_lam(struct kvm_vcpu *vcpu)
+{
+	return kvm_read_cr3(vcpu) & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57);
+}
+
 static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
 {
 	u64 root_hpa = vcpu->arch.mmu->root.hpa;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 835426254e76..3efec7f8d8c6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3699,7 +3699,14 @@  static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	int quadrant, i, r;
 	hpa_t root;
 
-	root_pgd = mmu->get_guest_pgd(vcpu);
+	/*
+	 * Omit guest_cpuid_has(X86_FEATURE_LAM) check but can unconditionally
+	 * strip CR3 LAM bits because they resides in high reserved bits,
+	 * with LAM or not, those high bits should be striped anyway when
+	 * interpreted to pgd.
+	 */
+	root_pgd = mmu->get_guest_pgd(vcpu) &
+		   ~(X86_CR3_LAM_U48 | X86_CR3_LAM_U57);
 	root_gfn = root_pgd >> PAGE_SHIFT;
 
 	if (mmu_check_root(vcpu, root_gfn))
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 0f6455072055..57f39c7492ed 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -324,7 +324,7 @@  static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	trace_kvm_mmu_pagetable_walk(addr, access);
 retry_walk:
 	walker->level = mmu->cpu_role.base.level;
-	pte           = mmu->get_guest_pgd(vcpu);
+	pte           = mmu->get_guest_pgd(vcpu) & ~(X86_CR3_LAM_U48 | X86_CR3_LAM_U57);
 	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
 
 #if PTTYPE == 64
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index b6f4411b613e..301912155dd1 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1078,7 +1078,7 @@  static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
 			       bool nested_ept, bool reload_pdptrs,
 			       enum vm_entry_failure_code *entry_failure_code)
 {
-	if (CC(kvm_vcpu_is_illegal_gpa(vcpu, cr3))) {
+	if (CC(!kvm_vcpu_is_valid_cr3(vcpu, cr3))) {
 		*entry_failure_code = ENTRY_FAIL_DEFAULT;
 		return -EINVAL;
 	}
@@ -2906,7 +2906,7 @@  static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
 
 	if (CC(!nested_host_cr0_valid(vcpu, vmcs12->host_cr0)) ||
 	    CC(!nested_host_cr4_valid(vcpu, vmcs12->host_cr4)) ||
-	    CC(kvm_vcpu_is_illegal_gpa(vcpu, vmcs12->host_cr3)))
+	    CC(!kvm_vcpu_is_valid_cr3(vcpu, vmcs12->host_cr3)))
 		return -EINVAL;
 
 	if (CC(is_noncanonical_address(vmcs12->host_ia32_sysenter_esp, vcpu)) ||
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fe5615fd8295..66edd091f145 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3289,7 +3289,8 @@  static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			update_guest_cr3 = false;
 		vmx_ept_load_pdptrs(vcpu);
 	} else {
-		guest_cr3 = root_hpa | kvm_get_active_pcid(vcpu);
+		guest_cr3 = root_hpa | kvm_get_active_pcid(vcpu) |
+			    kvm_get_active_lam(vcpu);
 	}
 
 	if (update_guest_cr3)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b9611690561d..be6c7c986f0f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1231,10 +1231,21 @@  static void kvm_invalidate_pcid(struct kvm_vcpu *vcpu, unsigned long pcid)
 	kvm_mmu_free_roots(vcpu->kvm, mmu, roots_to_free);
 }
 
+bool kvm_vcpu_is_valid_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
+{
+	if (guest_cpuid_has(vcpu, X86_FEATURE_LAM))
+		cr3 &= ~(X86_CR3_LAM_U48 | X86_CR3_LAM_U57);
+	else if (cr3 & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57))
+		return false;
+
+	return kvm_vcpu_is_legal_gpa(vcpu, cr3);
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_is_valid_cr3);
+
 int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 {
 	bool skip_tlb_flush = false;
-	unsigned long pcid = 0;
+	unsigned long pcid = 0, old_cr3;
 #ifdef CONFIG_X86_64
 	bool pcid_enabled = !!kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE);
 
@@ -1254,14 +1265,26 @@  int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 	 * stuff CR3, e.g. for RSM emulation, and there is no guarantee that
 	 * the current vCPU mode is accurate.
 	 */
-	if (kvm_vcpu_is_illegal_gpa(vcpu, cr3))
+	if (!kvm_vcpu_is_valid_cr3(vcpu, cr3))
 		return 1;
 
 	if (is_pae_paging(vcpu) && !load_pdptrs(vcpu, cr3))
 		return 1;
 
-	if (cr3 != kvm_read_cr3(vcpu))
-		kvm_mmu_new_pgd(vcpu, cr3);
+	old_cr3 = kvm_read_cr3(vcpu);
+	if (cr3 != old_cr3) {
+		if ((cr3 ^ old_cr3) & ~(X86_CR3_LAM_U48 | X86_CR3_LAM_U57)) {
+			kvm_mmu_new_pgd(vcpu, cr3 & ~(X86_CR3_LAM_U48 |
+					X86_CR3_LAM_U57));
+		} else {
+			/*
+			 * Though only LAM bits change, mark the
+			 * request in order to force an update on guest CR3
+			 * because the LAM bits of old CR3 is stale
+			 */
+			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
+		}
+	}
 
 	vcpu->arch.cr3 = cr3;
 	kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
@@ -11185,7 +11208,7 @@  static bool kvm_is_valid_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 		 */
 		if (!(sregs->cr4 & X86_CR4_PAE) || !(sregs->efer & EFER_LMA))
 			return false;
-		if (kvm_vcpu_is_illegal_gpa(vcpu, sregs->cr3))
+		if (!kvm_vcpu_is_valid_cr3(vcpu, sregs->cr3))
 			return false;
 	} else {
 		/*
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 8ec5cc983062..6b6bfddc84e0 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -440,6 +440,7 @@  void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
 void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
 int kvm_spec_ctrl_test_value(u64 value);
 bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+bool kvm_vcpu_is_valid_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
 int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r,
 			      struct x86_exception *e);
 int kvm_handle_invpcid(struct kvm_vcpu *vcpu, unsigned long type, gva_t gva);