diff mbox series

[v2,5/6] KVM: x86: Keep a per-VM MTRR state

Message ID 20230509135300.1855-1-yan.y.zhao@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: refine memtype related mmu zap | expand

Commit Message

Yan Zhao May 9, 2023, 1:53 p.m. UTC
Keep a per-VM MTRR state and point it to the MTRR state of vCPU 0.

This is a preparation patch for KVM to reference a per-VM guest MTRR
to decide memory type of EPT leaf entries when noncoherent DMA is present.

Though each vCPU has its own MTRR state, MTRR states should be
consistent across each VM, which is demanded as in Intel's SDM
"In a multiprocessor system using a processor in the P6 family or a more
recent family, each processor MUST use the identical MTRR memory map so
that software will have a consistent view of memory."

Therefore, when memory type of EPT leaf entry needs to honor guest MTRR,
a per-VM version of guest MTRR can be referenced.

Each vCPU still has its own MTRR state field to keep guest rdmsr()
returning the right value when there's lag of MTRR update for each vCPU.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/mtrr.c             | 22 ++++++++++++++++++++++
 arch/x86/kvm/x86.c              |  2 ++
 arch/x86/kvm/x86.h              |  2 ++
 4 files changed, 29 insertions(+)

Comments

David Matlack May 10, 2023, 5:23 p.m. UTC | #1
On Tue, May 09, 2023 at 09:53:00PM +0800, Yan Zhao wrote:
> +void kvm_mtrr_init(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +
> +	if (vcpu->vcpu_id)
> +		return;

vcpu_id is provided by userspace so I don't think there's any guarantee
that a vCPU with vcpu_id == 0 exists.
Robert Hoo May 21, 2023, 3:44 a.m. UTC | #2
On 5/9/2023 9:53 PM, Yan Zhao wrote:
> Keep a per-VM MTRR state and point it to the MTRR state of vCPU 0.
> 
> This is a preparation patch for KVM to reference a per-VM guest MTRR
> to decide memory type of EPT leaf entries when noncoherent DMA is present.
> 
> Though each vCPU has its own MTRR state, MTRR states should be
> consistent across each VM, which is demanded as in Intel's SDM
> "In a multiprocessor system using a processor in the P6 family or a more
> recent family, each processor MUST use the identical MTRR memory map so
> that software will have a consistent view of memory."
> 
> Therefore, when memory type of EPT leaf entry needs to honor guest MTRR,
> a per-VM version of guest MTRR can be referenced.
> 
> Each vCPU still has its own MTRR state field to keep guest rdmsr()
> returning the right value when there's lag of MTRR update for each vCPU.
> 
Can we get rid of per-vCPU MTRR state copies and just have this per-VM 
state only? therefore can simplify implementation and avoid hazard of 
inconsistency among per-VPU MTRR states.

I see in SDM, it notes:
"In multiple processor systems, the operating system must maintain MTRR 
consistency between all the processors in the system (that is, all 
processors must use the same MTRR values). The P6 and more recent processor 
families provide no hardware support for maintaining this consistency."

leaving each vCPU's MTRR is just to fully mimic HW?
Yan Zhao May 23, 2023, 6:21 a.m. UTC | #3
On Sun, May 21, 2023 at 11:44:36AM +0800, Robert Hoo wrote:
> On 5/9/2023 9:53 PM, Yan Zhao wrote:
> > Keep a per-VM MTRR state and point it to the MTRR state of vCPU 0.
> > 
> > This is a preparation patch for KVM to reference a per-VM guest MTRR
> > to decide memory type of EPT leaf entries when noncoherent DMA is present.
> > 
> > Though each vCPU has its own MTRR state, MTRR states should be
> > consistent across each VM, which is demanded as in Intel's SDM
> > "In a multiprocessor system using a processor in the P6 family or a more
> > recent family, each processor MUST use the identical MTRR memory map so
> > that software will have a consistent view of memory."
> > 
> > Therefore, when memory type of EPT leaf entry needs to honor guest MTRR,
> > a per-VM version of guest MTRR can be referenced.
> > 
> > Each vCPU still has its own MTRR state field to keep guest rdmsr()
> > returning the right value when there's lag of MTRR update for each vCPU.
> > 
> Can we get rid of per-vCPU MTRR state copies and just have this per-VM state
> only? therefore can simplify implementation and avoid hazard of
> inconsistency among per-VPU MTRR states.
> 
> I see in SDM, it notes:
> "In multiple processor systems, the operating system must maintain MTRR
> consistency between all the processors in the system (that is, all
> processors must use the same MTRR values). The P6 and more recent processor
> families provide no hardware support for maintaining this consistency."
> 
> leaving each vCPU's MTRR is just to fully mimic HW?
>
Yes, leaving each vCPU's MTRR to mimic HW.

As also suggested in SDM, the guest OS manipulates MTRRs in this way:

for each online CPUs {
	disable MTRR
	update fixed/var MTRR ranges
	enable MTRR
}
Guest OS needs to access memory only after this full pattern.

So, I think there should not have "hazard of inconsistency among per-VPU MTRR
states".

I want to have per-VM MTRR state is because I want to reduce unnessary EPT
zap, which costs quite a lot cpu cycles even when the EPT is empty.

In this patch, per-VM MTRR pointer is used to point to vCPU 0's MTRR state,
so that it can save some memory to keep the MTRR state.
But I found out that it would only work when vCPU 0 (boot processor) is
always online (which is not true for x86 under some configration).

I'll try to find out lowest online vCPU and keep a per-VM copy of MTRR state
in next version.

Thanks!
Sean Christopherson May 24, 2023, 12:13 a.m. UTC | #4
On Tue, May 23, 2023, Yan Zhao wrote:
> On Sun, May 21, 2023 at 11:44:36AM +0800, Robert Hoo wrote:
> > On 5/9/2023 9:53 PM, Yan Zhao wrote:
> > > Keep a per-VM MTRR state and point it to the MTRR state of vCPU 0.
> > > 
> > > This is a preparation patch for KVM to reference a per-VM guest MTRR
> > > to decide memory type of EPT leaf entries when noncoherent DMA is present.
> > > 
> > > Though each vCPU has its own MTRR state, MTRR states should be
> > > consistent across each VM, which is demanded as in Intel's SDM
> > > "In a multiprocessor system using a processor in the P6 family or a more
> > > recent family, each processor MUST use the identical MTRR memory map so
> > > that software will have a consistent view of memory."
> > > 
> > > Therefore, when memory type of EPT leaf entry needs to honor guest MTRR,
> > > a per-VM version of guest MTRR can be referenced.
> > > 
> > > Each vCPU still has its own MTRR state field to keep guest rdmsr()
> > > returning the right value when there's lag of MTRR update for each vCPU.
> > > 
> > Can we get rid of per-vCPU MTRR state copies and just have this per-VM state
> > only? therefore can simplify implementation and avoid hazard of
> > inconsistency among per-VPU MTRR states.
> > 
> > I see in SDM, it notes:
> > "In multiple processor systems, the operating system must maintain MTRR
> > consistency between all the processors in the system (that is, all
> > processors must use the same MTRR values). The P6 and more recent processor
> > families provide no hardware support for maintaining this consistency."
> > 
> > leaving each vCPU's MTRR is just to fully mimic HW?
> >
> Yes, leaving each vCPU's MTRR to mimic HW.
> 
> As also suggested in SDM, the guest OS manipulates MTRRs in this way:
> 
> for each online CPUs {
> 	disable MTRR
> 	update fixed/var MTRR ranges
> 	enable MTRR
> }
> Guest OS needs to access memory only after this full pattern.

FWIW, that Linux doesn't use that approach.  Linux instead only puts the cache
into no-fill mode (CR0.CD=1) when modifying MTRRs.  OVMF does both (and apparently
doesn't optimize for self-snoop?).

> So, I think there should not have "hazard of inconsistency among per-VPU MTRR
> states".

Probably not, but despite the SDM's assertion that software "MUST" keep things
consistent, that's not actually reality.  E.g. a careful kernel that partitions
memory doesn't need to strictly keep MTRRs consistent.  Ditto for scenarios where
CPUs are offline (for a variety of definitions of "offline"), in which case only
online CPUs need to have consistent MTRRs, e.g. before APs are brought up, MTRRs
are obviously inconsistent.

> I want to have per-VM MTRR state is because I want to reduce unnessary EPT
> zap, which costs quite a lot cpu cycles even when the EPT is empty.
> 
> In this patch, per-VM MTRR pointer is used to point to vCPU 0's MTRR state,
> so that it can save some memory to keep the MTRR state.
> But I found out that it would only work when vCPU 0 (boot processor) is
> always online (which is not true for x86 under some configration).
> 
> I'll try to find out lowest online vCPU

Picking a single vCPU will always be subject to edge cases.  E.g. I can see something
like kexec() "offlining" KVM's chosen vCPU and then having problems because KVM
ignores MTRR updates from other vCPUs in the new kernel.

One idea would be to let all vCPUs write the per-VM state, and zap if and only if
the MTRRs are actually different.  But as above, I'm on the fence about diverging
from what hardware actually does with respect to MTRRs.

Argh, stupid quirks.  I was going to suggest we take advantage of the guest needing
to either disable MTRRs or put the cache into no-fill mode to avoid zapping SPTEs,
i.e. skip the zap if CR0.CD=1, but KVM's implementation of the quirk is all kinds
of messed up.  KVM ignores MTRRs and guest PAT when CR0.CD=1, but then doesn't zap
when CR0.CD is cleared:

	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
	    kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);

which means that either KVM is relying on later MTRR changes to zap the bogus
WB SPTEs, or more likely things "work" because KVM doesn't install SPTEs for the
ranges that end up being DMA'd while CR0.CD is set.

*sigh*  What an absolutely asinine quirk.  KVM should never have taken it on to
workaround OVMF's stupidity.  Good gravy, that thing was even Cc'd stable@.  Yeesh.

Aha!  Idea.  There's should be no need to ignore guest PAT when CR0.CD=1 and the
quirk is enabled.  Even OVMF must be smart enough to map its code as WB in its page
tables.  And there is no need to zap SPTEs when CR0.CD is _set_, because SPTEs
created while CR0.CD=0 would have honored MTRRs.  Lastly, DMA when CR0.CD=1 and
the quirk is enabled simply can't work since KVM historically ignores everything
from the guest and maps all memory WB.

So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
and will zap when CR0.CD is cleared.  And to avoid regressing the CR0.CD case,
if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
zap non-WB MTRR ranges when CR0.CD is cleared.  Since WB is weak, looking for
non-WB MTRR ranges doesn't need to actually resolve the memtype, i.e. can be simple
and just
process MTRRs one by one.

Did that make sense?  Minus the code to identify non-WB MTRR ranges (and much
needed comments), the below is what I'm thinking.  If more intelligent zapping
provides the desired performance improvements, then I think/hope we avoid trying
to play games with MTRRs.

---
 arch/x86/kvm/mtrr.c    | 19 +++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c |  8 ++------
 arch/x86/kvm/x86.c     |  6 ++----
 3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index a67c28a56417..e700c230268b 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -323,6 +323,9 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
 	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
 		return;
 
+	if (kvm_is_cr0_bit_set(vcpu, X86_CR0_CD))
+		return;
+
 	if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType)
 		return;
 
@@ -375,6 +378,22 @@ static void set_var_mtrr_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	}
 }
 
+void kvm_mtrr_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+{
+	if (cr0 & X86_CR0_CD)
+		return;
+
+	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
+		return;
+
+	if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) {
+		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
+		return;
+	}
+
+	<zap non-WB memory>;
+}
+
 int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
 	int index;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 2d9d155691a7..962abc17afc0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7502,8 +7502,6 @@ static int vmx_vm_init(struct kvm *kvm)
 
 static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
-	u8 cache;
-
 	/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
 	 * memory aliases with conflicting memory types and sometimes MCEs.
 	 * We have to be careful as to what are honored and when.
@@ -7530,11 +7528,9 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 
 	if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
 		if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
-			cache = MTRR_TYPE_WRBACK;
+			return MTRR_TYPE_WRBACK;
 		else
-			cache = MTRR_TYPE_UNCACHABLE;
-
-		return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+			return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
 	}
 
 	return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 977dceb7ba7e..3c085b6f9e3c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -941,10 +941,8 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
 	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
 		kvm_mmu_reset_context(vcpu);
 
-	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
-	    kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
-	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
-		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
+	if ((cr0 ^ old_cr0) & X86_CR0_CD))
+		kvm_mtrr_post_set_cr0(vcpu, cr0);
 }
 EXPORT_SYMBOL_GPL(kvm_post_set_cr0);
 

base-commit: 7e998f35c57cb0a2c35f8545b54ab5f4f5a83841
--
Yan Zhao May 24, 2023, 11:03 a.m. UTC | #5
On Tue, May 23, 2023 at 05:13:38PM -0700, Sean Christopherson wrote:
> On Tue, May 23, 2023, Yan Zhao wrote:
> > On Sun, May 21, 2023 at 11:44:36AM +0800, Robert Hoo wrote:
> > > On 5/9/2023 9:53 PM, Yan Zhao wrote:
> > > > Keep a per-VM MTRR state and point it to the MTRR state of vCPU 0.
> > > > 
> > > > This is a preparation patch for KVM to reference a per-VM guest MTRR
> > > > to decide memory type of EPT leaf entries when noncoherent DMA is present.
> > > > 
> > > > Though each vCPU has its own MTRR state, MTRR states should be
> > > > consistent across each VM, which is demanded as in Intel's SDM
> > > > "In a multiprocessor system using a processor in the P6 family or a more
> > > > recent family, each processor MUST use the identical MTRR memory map so
> > > > that software will have a consistent view of memory."
> > > > 
> > > > Therefore, when memory type of EPT leaf entry needs to honor guest MTRR,
> > > > a per-VM version of guest MTRR can be referenced.
> > > > 
> > > > Each vCPU still has its own MTRR state field to keep guest rdmsr()
> > > > returning the right value when there's lag of MTRR update for each vCPU.
> > > > 
> > > Can we get rid of per-vCPU MTRR state copies and just have this per-VM state
> > > only? therefore can simplify implementation and avoid hazard of
> > > inconsistency among per-VPU MTRR states.
> > > 
> > > I see in SDM, it notes:
> > > "In multiple processor systems, the operating system must maintain MTRR
> > > consistency between all the processors in the system (that is, all
> > > processors must use the same MTRR values). The P6 and more recent processor
> > > families provide no hardware support for maintaining this consistency."
> > > 
> > > leaving each vCPU's MTRR is just to fully mimic HW?
> > >
> > Yes, leaving each vCPU's MTRR to mimic HW.
> > 
> > As also suggested in SDM, the guest OS manipulates MTRRs in this way:
> > 
> > for each online CPUs {
> > 	disable MTRR
> > 	update fixed/var MTRR ranges
> > 	enable MTRR
> > }
> > Guest OS needs to access memory only after this full pattern.
> 
> FWIW, that Linux doesn't use that approach.  Linux instead only puts the cache
> into no-fill mode (CR0.CD=1) when modifying MTRRs.  OVMF does both (and apparently
> doesn't optimize for self-snoop?).
I think Linux also follows this patten.
This is the call trace I found out in my environment.
cache_cpu_init
    cache_disable
        write_cr0 to CD=1, NW=0
        mtrr_disable
    mtrr_generic_set_state
        mtrr_wrmsr to fixed/var ranges
    cache_enable
        mtrr_enable
        write_cr0(read_cr0() & ~X86_CR0_CD);

For PAT not enabled in guest,
arch_phys_wc_add
    mtrr_add
        mtrr_add_page
            set_mtrr_cpuslocked
                mtrr_rendezvous_handler on each online cpu
                    generic_set_mtrr
                        cache_disable
                            write_cr0 to CD=1, NW=0
                            mtrr_disable
                        mtrr_wrmsr
                        cache_enable


For OVMF and Seabios, I dumped in host to observe their MTRR behaviors:
1. OVMF updates MTRR in below sequence
(1) vCPU 0
    CR0.CD=1
    MTRR disable
    MTRR update
    MTRR enable
    CR0.CD=0
(2) vCPU 1-n
    CR0.CD=1
    MTRR disable
(3) vCPU 1-n    
    MTRR update
(4) vCPU 1-n
    MTRR enable
    CR0.CD=0

2. Seabios can update MTRRs both when CR0.CD=0 and when CR0.CD=1 and follows below
sequence in each CPU
    MTRR disable
    MTRR update
    MTRR enable
I found it can update MTRRs even when CR0.CD=0 to below values:
MSR_MTRRfix16K_80000(0x258): 0x606060606060606
MSR_MTRRfix4K_C0000(0x268): 0x505050505050505
...
MSR_MTRRfix4K_F8000(0x26f): 0x505050505050505
MTRRphysBase_MSR(0): 0xc0000000
MTRRphysMask_MSR(0): 0x3fffc0000800

> 
> > So, I think there should not have "hazard of inconsistency among per-VPU MTRR
> > states".
> 
> Probably not, but despite the SDM's assertion that software "MUST" keep things
> consistent, that's not actually reality.  E.g. a careful kernel that partitions
> memory doesn't need to strictly keep MTRRs consistent.  Ditto for scenarios where
> CPUs are offline (for a variety of definitions of "offline"), in which case only
> online CPUs need to have consistent MTRRs, e.g. before APs are brought up, MTRRs
> are obviously inconsistent.
>
Yes. By "no inconsistency", I mean online vCPUs(who trigger active memory accesses)
need to all in a state that their MTRRs are consistent.
Offline vCPUs, as they will not access memory, it doesn't matter.
But once they are back to online, their MTRRs need to be in sync with online vCPUs.

> > I want to have per-VM MTRR state is because I want to reduce unnessary EPT
> > zap, which costs quite a lot cpu cycles even when the EPT is empty.
> > 
> > In this patch, per-VM MTRR pointer is used to point to vCPU 0's MTRR state,
> > so that it can save some memory to keep the MTRR state.
> > But I found out that it would only work when vCPU 0 (boot processor) is
> > always online (which is not true for x86 under some configration).
> > 
> > I'll try to find out lowest online vCPU
> 
> Picking a single vCPU will always be subject to edge cases.  E.g. I can see something
> like kexec() "offlining" KVM's chosen vCPU and then having problems because KVM
> ignores MTRR updates from other vCPUs in the new kernel.
>
Not familiar with kexec().
I wanted to trap APIC_SPIV and finding the lowest online vCPU id by checking
apic->sw_enabled status. Then only MTRR state of vCPU whose id equals to
the lowest online vCPU id can be written to per-VM MTRR state.
Will that work for kexec()?

> One idea would be to let all vCPUs write the per-VM state, and zap if and only if
> the MTRRs are actually different.  But as above, I'm on the fence about diverging
> from what hardware actually does with respect to MTRRs.
> 
> Argh, stupid quirks.  I was going to suggest we take advantage of the guest needing
> to either disable MTRRs or put the cache into no-fill mode to avoid zapping SPTEs,
> i.e. skip the zap if CR0.CD=1, but KVM's implementation of the quirk is all kinds
> of messed up.  KVM ignores MTRRs and guest PAT when CR0.CD=1, but then doesn't zap
> when CR0.CD is cleared:
> 
> 	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
> 	    kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
> 	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> 		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> 
> which means that either KVM is relying on later MTRR changes to zap the bogus
> WB SPTEs, or more likely things "work" because KVM doesn't install SPTEs for the
> ranges that end up being DMA'd while CR0.CD is set.
> 
> *sigh*  What an absolutely asinine quirk.  KVM should never have taken it on to
> workaround OVMF's stupidity.  Good gravy, that thing was even Cc'd stable@.  Yeesh.
> 
> Aha!  Idea.  There's should be no need to ignore guest PAT when CR0.CD=1 and the
> quirk is enabled.  Even OVMF must be smart enough to map its code as WB in its page
> tables.  And there is no need to zap SPTEs when CR0.CD is _set_, because SPTEs
> created while CR0.CD=0 would have honored MTRRs.  Lastly, DMA when CR0.CD=1 and
> the quirk is enabled simply can't work since KVM historically ignores everything
> from the guest and maps all memory WB.
> 
> So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
> then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
> and will zap when CR0.CD is cleared.  And to avoid regressing the CR0.CD case,
> if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
> zap non-WB MTRR ranges when CR0.CD is cleared.  Since WB is weak, looking for
> non-WB MTRR ranges doesn't need to actually resolve the memtype, i.e. can be simple
> and just
> process MTRRs one by one.
> 
> Did that make sense?  Minus the code to identify non-WB MTRR ranges (and much
> needed comments), the below is what I'm thinking.  If more intelligent zapping
> provides the desired performance improvements, then I think/hope we avoid trying
> to play games with MTRRs.
> 
> ---
>  arch/x86/kvm/mtrr.c    | 19 +++++++++++++++++++
>  arch/x86/kvm/vmx/vmx.c |  8 ++------
>  arch/x86/kvm/x86.c     |  6 ++----
>  3 files changed, 23 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> index a67c28a56417..e700c230268b 100644
> --- a/arch/x86/kvm/mtrr.c
> +++ b/arch/x86/kvm/mtrr.c
> @@ -323,6 +323,9 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
>  	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
>  		return;
>  
> +	if (kvm_is_cr0_bit_set(vcpu, X86_CR0_CD))
> +		return;
This will always make update_mtrr() return here for Linux and OVMF. 
> +
>  	if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType)
>  		return;
>  
> @@ -375,6 +378,22 @@ static void set_var_mtrr_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>  	}
>  }
>  
> +void kvm_mtrr_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
> +{
> +	if (cr0 & X86_CR0_CD)
> +		return;
> +
> +	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> +		return;
> +
> +	if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) {
> +		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> +		return;
> +	}
> +
> +	<zap non-WB memory>;
This zap looks will happen on each vCPU. Then only half of zaps are
saved compared to the original count of zaps in update_mtrr().
And pieces of no-WB memory might produce more kvm_zap_gfn_range()?


> +}
> +
>  int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>  {
>  	int index;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 2d9d155691a7..962abc17afc0 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7502,8 +7502,6 @@ static int vmx_vm_init(struct kvm *kvm)
>  
>  static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  {
> -	u8 cache;
> -
>  	/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
>  	 * memory aliases with conflicting memory types and sometimes MCEs.
>  	 * We have to be careful as to what are honored and when.
> @@ -7530,11 +7528,9 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  
>  	if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
>  		if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> -			cache = MTRR_TYPE_WRBACK;
> +			return MTRR_TYPE_WRBACK;
>  		else
> -			cache = MTRR_TYPE_UNCACHABLE;
> -
> -		return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> +			return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>  	}
>  
>  	return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 977dceb7ba7e..3c085b6f9e3c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -941,10 +941,8 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>  	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
>  		kvm_mmu_reset_context(vcpu);
>  
> -	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
> -	    kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
> -	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> -		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> +	if ((cr0 ^ old_cr0) & X86_CR0_CD))
> +		kvm_mtrr_post_set_cr0(vcpu, cr0);
>  }
>  EXPORT_SYMBOL_GPL(kvm_post_set_cr0);
>  
> 
> base-commit: 7e998f35c57cb0a2c35f8545b54ab5f4f5a83841
> -- 
>
Sean Christopherson May 24, 2023, 6:21 p.m. UTC | #6
On Wed, May 24, 2023, Yan Zhao wrote:
> On Tue, May 23, 2023 at 05:13:38PM -0700, Sean Christopherson wrote:
> > On Tue, May 23, 2023, Yan Zhao wrote:
> > > As also suggested in SDM, the guest OS manipulates MTRRs in this way:
> > > 
> > > for each online CPUs {
> > > 	disable MTRR
> > > 	update fixed/var MTRR ranges
> > > 	enable MTRR
> > > }
> > > Guest OS needs to access memory only after this full pattern.
> > 
> > FWIW, that Linux doesn't use that approach.  Linux instead only puts the cache
> > into no-fill mode (CR0.CD=1) when modifying MTRRs.  OVMF does both (and apparently
> > doesn't optimize for self-snoop?).
> I think Linux also follows this patten.
> This is the call trace I found out in my environment.
> cache_cpu_init
>     cache_disable
>         write_cr0 to CD=1, NW=0
>         mtrr_disable

Huh, I somehow missed this call to mtrr_disable().  I distinctly remember looking
at this helper, no idea how I missed that.

>     mtrr_generic_set_state
>         mtrr_wrmsr to fixed/var ranges
>     cache_enable
>         mtrr_enable
>         write_cr0(read_cr0() & ~X86_CR0_CD);
> 

...

> > Picking a single vCPU will always be subject to edge cases.  E.g. I can see something
> > like kexec() "offlining" KVM's chosen vCPU and then having problems because KVM
> > ignores MTRR updates from other vCPUs in the new kernel.
> >
> Not familiar with kexec().
> I wanted to trap APIC_SPIV and finding the lowest online vCPU id by checking
> apic->sw_enabled status. Then only MTRR state of vCPU whose id equals to
> the lowest online vCPU id can be written to per-VM MTRR state.
> Will that work for kexec()?

kexec() allows "booting" into a new kernel without transitioning through BIOS
and without doing a full reboot.  The scenario I'm worried about is if the new
kernel offlines KVM's chosen CPU for whatever reason and also modifies MTRRs.  I
think it's an extremely unlikely scenario, but my point is that selecting a single
vCPU to control the MTRRs works if and only if that vCPU stays online for the
entire lifetime of the VM, which KVM can't guarantee.

> > One idea would be to let all vCPUs write the per-VM state, and zap if and only if
> > the MTRRs are actually different.  But as above, I'm on the fence about diverging
> > from what hardware actually does with respect to MTRRs.

...

> > So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
> > then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
> > and will zap when CR0.CD is cleared.  And to avoid regressing the CR0.CD case,
> > if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
> > zap non-WB MTRR ranges when CR0.CD is cleared.  Since WB is weak, looking for
> > non-WB MTRR ranges doesn't need to actually resolve the memtype, i.e. can be simple
> > and just
> > process MTRRs one by one.
> > 
> > Did that make sense?  Minus the code to identify non-WB MTRR ranges (and much
> > needed comments), the below is what I'm thinking.  If more intelligent zapping
> > provides the desired performance improvements, then I think/hope we avoid trying
> > to play games with MTRRs.
> > 
> > ---
> >  arch/x86/kvm/mtrr.c    | 19 +++++++++++++++++++
> >  arch/x86/kvm/vmx/vmx.c |  8 ++------
> >  arch/x86/kvm/x86.c     |  6 ++----
> >  3 files changed, 23 insertions(+), 10 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> > index a67c28a56417..e700c230268b 100644
> > --- a/arch/x86/kvm/mtrr.c
> > +++ b/arch/x86/kvm/mtrr.c
> > @@ -323,6 +323,9 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> >  	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> >  		return;
> >  
> > +	if (kvm_is_cr0_bit_set(vcpu, X86_CR0_CD))
> > +		return;
> This will always make update_mtrr() return here for Linux and OVMF. 

Yes, that's the intent.  If the CR0.CD quirk is _disabled_, then all SPTEs are
UC, i.e. there's no need to zap.  If the quirk is enabled (the common case,
unfortunately), then KVM isn't honoring MTRRs, i.e. non-coherent DMA won't function
properly, and so zapping SPTEs is pointless.  And in both cases, KVM will zap
relevant SPTEs when CR0.CD is cleared.

And this is actually already the behavior for all MTRRs execpt for MSR_MTRRdefType
due to the !mtrr_is_enabled() clause below.

> >  	if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType)
> >  		return;
> >  
> > @@ -375,6 +378,22 @@ static void set_var_mtrr_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> >  	}
> >  }
> >  
> > +void kvm_mtrr_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
> > +{
> > +	if (cr0 & X86_CR0_CD)
> > +		return;
> > +
> > +	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> > +		return;
> > +
> > +	if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) {
> > +		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> > +		return;
> > +	}
> > +
> > +	<zap non-WB memory>;
> This zap looks will happen on each vCPU.

Yes, on CR0.CD 1=>0 transition.

> Then only half of zaps are saved compared to the original count of zaps in
> update_mtrr().

I don't think the current code is functionally correct though, even if we define
"correct" to only apply when CR0.CD=0 (because the CR0.CD=1 case when the quirk
can't possibly be correct).  Keeping stale WB+IGMT entries that were installed
while CR0.CD=1 is broken.  As mentioned in my previous mail, I suspect it works
only because KVM doesn't install SPTEs for DMA'd ranges.  FWIW, the WB memtype
is unlikely to be a problem, but setting IGMT definitely is.  Though of course
we can and should fix the IGMT thing separately.

> And pieces of no-WB memory might produce more kvm_zap_gfn_range()?

Yes, but they'll be much, much more precise.  And the bajillion fixed MTRRs can
be batched into a single zap by sacrificing a tiny amount of precision, i.e. zap
from the lowest non-WB fixed MTRR to the highest.  Crucially, assuming BIOS and
the kernel aren't doing something bizarre, that should preserve the SPTEs for the
code the guest is executing from (0 - VGA hole should be WB).

And if we want to squeeze more performance out of this path, there are other
optimizations we can make.  E.g. I'm guessing one of the problems, perhaps even
_the_ problem, is that there's contention on mmu_lock in kvm_zap_gfn_range() when
bringing up APs, which is likely why you observe slow downs even when there are
no SPTEs to zap.  A thought for handling that would be to do something akin to
kvm_recalculate_apic_map().  E.g. instead of having every vCPU zap, track (a)
if a zap is in-progress and (b) the ranges being zapped.  There will still be
lock contention to add ranges, but it should be fairly short in duration compared
to zapping all possible SPTEs (current behavior).

Something like

	struct tbd *range = NULL;
	bool do_zap = false;

	<gather non-wb ranges>

	for_each_non_wb_range(...) {
		if (!range)
			range = kmalloc(...);

		spin_lock(&kvm->arch.mtrr_zap_lock);
		if (<range not in list>) {
			list_add_tail(&range->list, &kvm->arch.mtrr_zap_list);
			range = NULL;

			if (!kvm->arch.mtrr_zapping) {
				do_zap = true;
				kvm->arch.mtrr_zapping = true;
			}
		}
		spin_unlock(&kvm->arch.mtrr_zap_lock);
	}

	kfree(zap_entry);

	if (do_zap) {
		spin_lock(&kvm->arch.mtrr_zap_lock);

		while (!list_empty(&kvm->arch.mtrr_zap_list)) {
			struct tbd *range;

			range = list_first_entry(&kvm->arch.mtrr_zap_list);
			list_del(range->list);

			spin_unlock(&kvm->arch.mtrr_zap_lock);

			kvm_zap_gfn_range(..., range->start, range->end);
			kfree(range);
		
			spin_lock(&kvm->arch.mtrr_zap_lock);
		}

		/* Clear under lock to ensure new entries are processed. */
		kvm->arch.mtrr_zapping = false;

		spin_unlock(&kvm->arch.mtrr_zap_lock);
	}

	/* Wait for the zap to complete. */
	while (READ_ONCE(kvm->arch.mtrr_zapping))
		cpu_relax();

and I'm sure someone that's better at lockless programming could optimize that
even further if necessary, e.g. by checking for "range in list" outside of the
spinlock.

E.g. in my OVMF based VM, the precise zap would target only two ranges, the VGA
hole and the 32-bit PCI:

  kvm: vCPU0 default type = 6
  kvm: vCPU0 fixed MTRR range 0 - 15 == 6
  kvm: vCPU0 fixed MTRR range 16 - 87 == 0
  kvm: vCPU0 variable MTRR range 80000000 - 100000000  = 0

That bumps up to three ranges for the kernel, which adds what I suspect is the
64-bit PCI hole:

  kvm: vCPU0 default type = 6                                                   
  kvm: vCPU0 fixed MTRR range 0 - 15 == 6                                       
  kvm: vCPU0 fixed MTRR range 16 - 87 == 0
  kvm: vCPU0 variable MTRR range 800000000 - 1000000000  = 0
  kvm: vCPU0 variable MTRR range 80000000 - 100000000  = 0

The above is distilled information from this hack-a-print:

diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index 9fac1ec03463..6259c7a4bcd3 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -304,12 +304,42 @@ static void var_mtrr_range(struct kvm_mtrr_range *range, u64 *start, u64 *end)
        *end = (*start | ~mask) + 1;
 }
 
+
+static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
+{
+       return (range->mask & (1 << 11)) != 0;
+}
+
 static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
 {
        struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
        gfn_t start, end;
        int index;
 
+       if (mtrr_is_enabled(mtrr_state) && msr == MSR_MTRRdefType) {
+               struct kvm_mtrr_range *range;
+               int i;
+
+               pr_warn("vCPU%u default type = %u\n", vcpu->vcpu_idx, (u8)(mtrr_state->deftype & 0xff));
+
+               if (!fixed_mtrr_is_enabled(mtrr_state)) {
+                       pr_warn("vCPU%u fixed MTRRs disabled\n", vcpu->vcpu_idx);
+               } else {
+                       for (i = 0; i < ARRAY_SIZE(mtrr_state->fixed_ranges); i++)
+                               pr_warn("vCPU%u fixed MTRR range %u == %u\n",
+                                       vcpu->vcpu_idx, i, mtrr_state->fixed_ranges[i]);
+               }
+
+               list_for_each_entry(range, &mtrr_state->head, node) {
+                       if (!var_mtrr_range_is_valid(range))
+                               continue;
+
+                       var_mtrr_range(range, &start, &end);
+                       pr_warn("vCPU%d variable MTRR range %llx - %llx  = %u\n",
+                               vcpu->vcpu_idx, start, end, (u8)(range->base & 0xff));
+               }
+       }
+
        if (msr == MSR_IA32_CR_PAT || !tdp_enabled ||
              !kvm_arch_has_noncoherent_dma(vcpu->kvm))
                return;
@@ -333,11 +363,6 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
        kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
 }
 
-static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
-{
-       return (range->mask & (1 << 11)) != 0;
-}
-
 static void set_var_mtrr_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
        struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
Robert Hoo May 25, 2023, 7:21 a.m. UTC | #7
On 5/23/2023 2:21 PM, Yan Zhao wrote:
[...]
> Yes, leaving each vCPU's MTRR to mimic HW.
> 
> As also suggested in SDM, the guest OS manipulates MTRRs in this way:
> 
> for each online CPUs {
> 	disable MTRR
> 	update fixed/var MTRR ranges
> 	enable MTRR
> }
> Guest OS needs to access memory only after this full pattern.
> 
Thanks for confirmation and clarification.

> So, I think there should not have "hazard of inconsistency among per-VPU MTRR
> states".
> 
> I want to have per-VM MTRR state is because I want to reduce unnessary EPT
> zap, which costs quite a lot cpu cycles even when the EPT is empty.
> 
> In this patch, per-VM MTRR pointer is used to point to vCPU 0's MTRR state,
> so that it can save some memory to keep the MTRR state.
> But I found out that it would only work when vCPU 0 (boot processor) is
> always online (which is not true for x86 under some configration).
> 
> I'll try to find out lowest online vCPU and keep a per-VM copy of MTRR state
> in next version.
> 
> Thanks!
> 

IIUC, your saving comes from skips the intermediate state during boot, when 
APs goes through setting MTRR, which would cause SPTE zap before your this 
patch set.

MHO was, now that your ignores other vCPU's MTRR settings (unless it is 
different from BP's MTRR?), why not let each vCPU's MTRR set/update 
directly set to the per-VM MTRR states (if differs from current value). 
It's guest OS/BIOS's responsibility to keep the consistency anyway. And 
even if the malfunction caused by the inconsistency might differ from that 
of native, SDM doesn't clearly state how the malfunction should be, does it?
that's to say, anyone knows, when inconsistency happens, does it cause that 
logical processor malfunction or in fact it impacts the global MTRR 
settings? If the latter, I think leaving only the per-VM MTRR state aligns 
with native.

BTW, with regard to KVM_X86_QUIRK_CD_NW_CLEARED, I see svm honors it while 
vmx doesn't before it clear CR0.CD/NW.

svm_set_cr0():

	if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
		hcr0 &= ~(X86_CR0_CD | X86_CR0_NW);


vmx_set_cr0():

	hw_cr0 = (cr0 & ~KVM_VM_CR0_ALWAYS_OFF);

Perhaps vmx side can be fixed passingly?
Yan Zhao May 25, 2023, 10:09 a.m. UTC | #8
On Wed, May 24, 2023 at 11:21:09AM -0700, Sean Christopherson wrote:

...

> > > Picking a single vCPU will always be subject to edge cases.  E.g. I can see something
> > > like kexec() "offlining" KVM's chosen vCPU and then having problems because KVM
> > > ignores MTRR updates from other vCPUs in the new kernel.
> > >
> > Not familiar with kexec().
> > I wanted to trap APIC_SPIV and finding the lowest online vCPU id by checking
> > apic->sw_enabled status. Then only MTRR state of vCPU whose id equals to
> > the lowest online vCPU id can be written to per-VM MTRR state.
> > Will that work for kexec()?
> 
> kexec() allows "booting" into a new kernel without transitioning through BIOS
> and without doing a full reboot.  The scenario I'm worried about is if the new
> kernel offlines KVM's chosen CPU for whatever reason and also modifies MTRRs.  I
> think it's an extremely unlikely scenario, but my point is that selecting a single
> vCPU to control the MTRRs works if and only if that vCPU stays online for the
> entire lifetime of the VM, which KVM can't guarantee.
> 
> > > One idea would be to let all vCPUs write the per-VM state, and zap if and only if
> > > the MTRRs are actually different.  But as above, I'm on the fence about diverging
> > > from what hardware actually does with respect to MTRRs.
> 
> ...
> 
> > > So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
> > > then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
> > > and will zap when CR0.CD is cleared.  And to avoid regressing the CR0.CD case,
> > > if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
> > > zap non-WB MTRR ranges when CR0.CD is cleared.  Since WB is weak, looking for
Not only non-WB ranges are required to be zapped.
Think about a scenario:
(1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too.
(2) after CR0.CD=1, without zap, its EPT entry memtype is still UC.
(3) then guest performs MTRR disable, no zap too.
(4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
(5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.

Does it make sense?

> > > non-WB MTRR ranges doesn't need to actually resolve the memtype, i.e. can be simple
> > > and just
> > > process MTRRs one by one.
> > > 
> > > Did that make sense?  Minus the code to identify non-WB MTRR ranges (and much
> > > needed comments), the below is what I'm thinking.  If more intelligent zapping
> > > provides the desired performance improvements, then I think/hope we avoid trying
> > > to play games with MTRRs.
> > > 
> > > ---
> > >  arch/x86/kvm/mtrr.c    | 19 +++++++++++++++++++
> > >  arch/x86/kvm/vmx/vmx.c |  8 ++------
> > >  arch/x86/kvm/x86.c     |  6 ++----
> > >  3 files changed, 23 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> > > index a67c28a56417..e700c230268b 100644
> > > --- a/arch/x86/kvm/mtrr.c
> > > +++ b/arch/x86/kvm/mtrr.c
> > > @@ -323,6 +323,9 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> > >  	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> > >  		return;
> > >  
> > > +	if (kvm_is_cr0_bit_set(vcpu, X86_CR0_CD))
> > > +		return;
> > This will always make update_mtrr() return here for Linux and OVMF. 
> 
> Yes, that's the intent.  If the CR0.CD quirk is _disabled_, then all SPTEs are
> UC, i.e. there's no need to zap.  If the quirk is enabled (the common case,
> unfortunately), then KVM isn't honoring MTRRs, i.e. non-coherent DMA won't function
> properly, and so zapping SPTEs is pointless.  And in both cases, KVM will zap
> relevant SPTEs when CR0.CD is cleared.
My worry is that if the quirk is enabled,
(1) when CR0.CD=1, the EPT memtype is WB.
(2) when MTRR is further disabled,
    a. with EPT zap, the new EPT memtype is UC, effective memtype is UC
    b. without EPT zap, some EPT memtype is still WB, effective memtype is guest PAT type
However, in guest driver's point of view, the memtype is UC, because its MTRR is disabled.

So, if we don't zap EPT when guest disables MTRR, and if there's
non-coherent DMA on-going which requires guest driver to flush caches to
ensure cache coherency (though I don't think this scenario will happen in reality), guest
driver may not do the right thing as it thinks the memory is UC which
has no need to flush cache while the effective memtype is WB.

Previously, (i.e. in current upstream implementation),  
(1) when CR0.CD=1, the EPT memtype is WB + ignore guest PAT.
(2) when MTRR is further disabled,
    EPT is zapped, new EPT memtype is UC, effective memtype is UC.
The guest device driver can flush cache well for non-coherent DMA.

So, though the quirk will make current upstream code malfunction when
CR0.CD is set alone, it still functions well if MTRR is also disabled.

But I doubt if we need to care about this extreme condition of
non-coherent DMA that happens when CR0.CD=1 and MTRR is disabled.


> 
> And this is actually already the behavior for all MTRRs execpt for MSR_MTRRdefType
> due to the !mtrr_is_enabled() clause below.
> 
> > >  	if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType)
> > >  		return;
> > >  
> > > @@ -375,6 +378,22 @@ static void set_var_mtrr_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> > >  	}
> > >  }
> > >  
> > > +void kvm_mtrr_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
> > > +{
> > > +	if (cr0 & X86_CR0_CD)
> > > +		return;
> > > +
> > > +	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> > > +		return;
> > > +
> > > +	if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) {
> > > +		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> > > +		return;
> > > +	}
> > > +
> > > +	<zap non-WB memory>;
> > This zap looks will happen on each vCPU.
> 
> Yes, on CR0.CD 1=>0 transition.
> 
> > Then only half of zaps are saved compared to the original count of zaps in
> > update_mtrr().
> 
> I don't think the current code is functionally correct though, even if we define
> "correct" to only apply when CR0.CD=0 (because the CR0.CD=1 case when the quirk
> can't possibly be correct).  Keeping stale WB+IGMT entries that were installed
> while CR0.CD=1 is broken.  As mentioned in my previous mail, I suspect it works
> only because KVM doesn't install SPTEs for DMA'd ranges.  FWIW, the WB memtype
> is unlikely to be a problem, but setting IGMT definitely is.  Though of course
> we can and should fix the IGMT thing separately.
Current code may also not function well for non-coherent DMA when guest vCPU has
no X86_FEATURE_MTRR, as it returns WB (without IPAT) as EPT memtype.
Then the effective memtype is guest PAT, but from guest's point of view, it's all
UC. (see pat_x_mtrr_type() in Linux's code).
But I guess we don't need to mind this case?


> 
> > And pieces of no-WB memory might produce more kvm_zap_gfn_range()?
> 
> Yes, but they'll be much, much more precise.  And the bajillion fixed MTRRs can
Could we instead keep a per-VM copy of MTRR?
Then, if a vCPU modifies an MTRR entry and we find it's different from that
in the per-VM copy, we mark that entry dirty. When CD0=0, we just zap
those dirty ranges. (or just zap all if there are dirty entries)
In theory, only one vCPU will trigger this zap in each round of MTRR update.

> be batched into a single zap by sacrificing a tiny amount of precision, i.e. zap
> from the lowest non-WB fixed MTRR to the highest.  Crucially, assuming BIOS and
> the kernel aren't doing something bizarre, that should preserve the SPTEs for the
> code the guest is executing from (0 - VGA hole should be WB).
> 
> And if we want to squeeze more performance out of this path, there are other
> optimizations we can make.  E.g. I'm guessing one of the problems, perhaps even
> _the_ problem, is that there's contention on mmu_lock in kvm_zap_gfn_range() when
> bringing up APs, which is likely why you observe slow downs even when there are
> no SPTEs to zap.  A thought for handling that would be to do something akin to
> kvm_recalculate_apic_map().  E.g. instead of having every vCPU zap, track (a)
> if a zap is in-progress and (b) the ranges being zapped.  There will still be
> lock contention to add ranges, but it should be fairly short in duration compared
> to zapping all possible SPTEs (current behavior).
Yes, I guess this one is more performant, too :)

> 
> Something like
> 
> 	struct tbd *range = NULL;
> 	bool do_zap = false;
> 
> 	<gather non-wb ranges>
> 
> 	for_each_non_wb_range(...) {
> 		if (!range)
> 			range = kmalloc(...);
> 
> 		spin_lock(&kvm->arch.mtrr_zap_lock);
> 		if (<range not in list>) {
> 			list_add_tail(&range->list, &kvm->arch.mtrr_zap_list);
> 			range = NULL;
> 
> 			if (!kvm->arch.mtrr_zapping) {
> 				do_zap = true;
> 				kvm->arch.mtrr_zapping = true;
> 			}
> 		}
> 		spin_unlock(&kvm->arch.mtrr_zap_lock);
> 	}
> 
> 	kfree(zap_entry);
> 
> 	if (do_zap) {
> 		spin_lock(&kvm->arch.mtrr_zap_lock);
> 
> 		while (!list_empty(&kvm->arch.mtrr_zap_list)) {
> 			struct tbd *range;
> 
> 			range = list_first_entry(&kvm->arch.mtrr_zap_list);
> 			list_del(range->list);
> 
> 			spin_unlock(&kvm->arch.mtrr_zap_lock);
> 
> 			kvm_zap_gfn_range(..., range->start, range->end);
> 			kfree(range);
> 		
> 			spin_lock(&kvm->arch.mtrr_zap_lock);
> 		}
> 
> 		/* Clear under lock to ensure new entries are processed. */
> 		kvm->arch.mtrr_zapping = false;
> 
> 		spin_unlock(&kvm->arch.mtrr_zap_lock);
> 	}
> 
> 	/* Wait for the zap to complete. */
> 	while (READ_ONCE(kvm->arch.mtrr_zapping))
> 		cpu_relax();
> 
> and I'm sure someone that's better at lockless programming could optimize that
> even further if necessary, e.g. by checking for "range in list" outside of the
> spinlock.
> 
> E.g. in my OVMF based VM, the precise zap would target only two ranges, the VGA
> hole and the 32-bit PCI:
> 
>   kvm: vCPU0 default type = 6
>   kvm: vCPU0 fixed MTRR range 0 - 15 == 6
>   kvm: vCPU0 fixed MTRR range 16 - 87 == 0
>   kvm: vCPU0 variable MTRR range 80000000 - 100000000  = 0
> 
> That bumps up to three ranges for the kernel, which adds what I suspect is the
> 64-bit PCI hole:
> 
>   kvm: vCPU0 default type = 6                                                   
>   kvm: vCPU0 fixed MTRR range 0 - 15 == 6                                       
>   kvm: vCPU0 fixed MTRR range 16 - 87 == 0
>   kvm: vCPU0 variable MTRR range 800000000 - 1000000000  = 0
>   kvm: vCPU0 variable MTRR range 80000000 - 100000000  = 0
> 
> The above is distilled information from this hack-a-print:
> 
> diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> index 9fac1ec03463..6259c7a4bcd3 100644
> --- a/arch/x86/kvm/mtrr.c
> +++ b/arch/x86/kvm/mtrr.c
> @@ -304,12 +304,42 @@ static void var_mtrr_range(struct kvm_mtrr_range *range, u64 *start, u64 *end)
>         *end = (*start | ~mask) + 1;
>  }
>  
> +
> +static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
> +{
> +       return (range->mask & (1 << 11)) != 0;
> +}
> +
>  static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
>  {
>         struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
>         gfn_t start, end;
>         int index;
>  
> +       if (mtrr_is_enabled(mtrr_state) && msr == MSR_MTRRdefType) {
> +               struct kvm_mtrr_range *range;
> +               int i;
> +
> +               pr_warn("vCPU%u default type = %u\n", vcpu->vcpu_idx, (u8)(mtrr_state->deftype & 0xff));
> +
> +               if (!fixed_mtrr_is_enabled(mtrr_state)) {
> +                       pr_warn("vCPU%u fixed MTRRs disabled\n", vcpu->vcpu_idx);
> +               } else {
> +                       for (i = 0; i < ARRAY_SIZE(mtrr_state->fixed_ranges); i++)
> +                               pr_warn("vCPU%u fixed MTRR range %u == %u\n",
> +                                       vcpu->vcpu_idx, i, mtrr_state->fixed_ranges[i]);
> +               }
> +
> +               list_for_each_entry(range, &mtrr_state->head, node) {
> +                       if (!var_mtrr_range_is_valid(range))
> +                               continue;
> +
> +                       var_mtrr_range(range, &start, &end);
> +                       pr_warn("vCPU%d variable MTRR range %llx - %llx  = %u\n",
> +                               vcpu->vcpu_idx, start, end, (u8)(range->base & 0xff));
> +               }
> +       }
> +
>         if (msr == MSR_IA32_CR_PAT || !tdp_enabled ||
>               !kvm_arch_has_noncoherent_dma(vcpu->kvm))
>                 return;
> @@ -333,11 +363,6 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
>         kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
>  }
>  
> -static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
> -{
> -       return (range->mask & (1 << 11)) != 0;
> -}
> -
>  static void set_var_mtrr_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>  {
>         struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
>
Sean Christopherson May 25, 2023, 2:53 p.m. UTC | #9
On Thu, May 25, 2023, Yan Zhao wrote:
> On Wed, May 24, 2023 at 11:21:09AM -0700, Sean Christopherson wrote:
> 
> ...
> 
> > > > Picking a single vCPU will always be subject to edge cases.  E.g. I can see something
> > > > like kexec() "offlining" KVM's chosen vCPU and then having problems because KVM
> > > > ignores MTRR updates from other vCPUs in the new kernel.
> > > >
> > > Not familiar with kexec().
> > > I wanted to trap APIC_SPIV and finding the lowest online vCPU id by checking
> > > apic->sw_enabled status. Then only MTRR state of vCPU whose id equals to
> > > the lowest online vCPU id can be written to per-VM MTRR state.
> > > Will that work for kexec()?
> > 
> > kexec() allows "booting" into a new kernel without transitioning through BIOS
> > and without doing a full reboot.  The scenario I'm worried about is if the new
> > kernel offlines KVM's chosen CPU for whatever reason and also modifies MTRRs.  I
> > think it's an extremely unlikely scenario, but my point is that selecting a single
> > vCPU to control the MTRRs works if and only if that vCPU stays online for the
> > entire lifetime of the VM, which KVM can't guarantee.
> > 
> > > > One idea would be to let all vCPUs write the per-VM state, and zap if and only if
> > > > the MTRRs are actually different.  But as above, I'm on the fence about diverging
> > > > from what hardware actually does with respect to MTRRs.
> > 
> > ...
> > 
> > > > So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
> > > > then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
> > > > and will zap when CR0.CD is cleared.  And to avoid regressing the CR0.CD case,
> > > > if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
> > > > zap non-WB MTRR ranges when CR0.CD is cleared.  Since WB is weak, looking for
> Not only non-WB ranges are required to be zapped.
> Think about a scenario:
> (1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too.
> (2) after CR0.CD=1, without zap, its EPT entry memtype is still UC.
> (3) then guest performs MTRR disable, no zap too.
> (4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
> (5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.

KVM installs WB memtype when the quirk is enabled, thus no need to zap.  KVM
currently zaps everything when the quirk is disabled, and I'm not proposing that
be changed.

> Does it make sense?
> 
> > > > non-WB MTRR ranges doesn't need to actually resolve the memtype, i.e. can be simple
> > > > and just
> > > > process MTRRs one by one.
> > > > 
> > > > Did that make sense?  Minus the code to identify non-WB MTRR ranges (and much
> > > > needed comments), the below is what I'm thinking.  If more intelligent zapping
> > > > provides the desired performance improvements, then I think/hope we avoid trying
> > > > to play games with MTRRs.
> > > > 
> > > > ---
> > > >  arch/x86/kvm/mtrr.c    | 19 +++++++++++++++++++
> > > >  arch/x86/kvm/vmx/vmx.c |  8 ++------
> > > >  arch/x86/kvm/x86.c     |  6 ++----
> > > >  3 files changed, 23 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> > > > index a67c28a56417..e700c230268b 100644
> > > > --- a/arch/x86/kvm/mtrr.c
> > > > +++ b/arch/x86/kvm/mtrr.c
> > > > @@ -323,6 +323,9 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> > > >  	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> > > >  		return;
> > > >  
> > > > +	if (kvm_is_cr0_bit_set(vcpu, X86_CR0_CD))
> > > > +		return;
> > > This will always make update_mtrr() return here for Linux and OVMF. 
> > 
> > Yes, that's the intent.  If the CR0.CD quirk is _disabled_, then all SPTEs are
> > UC, i.e. there's no need to zap.  If the quirk is enabled (the common case,
> > unfortunately), then KVM isn't honoring MTRRs, i.e. non-coherent DMA won't function
> > properly, and so zapping SPTEs is pointless.  And in both cases, KVM will zap
> > relevant SPTEs when CR0.CD is cleared.
> My worry is that if the quirk is enabled,
> (1) when CR0.CD=1, the EPT memtype is WB.
> (2) when MTRR is further disabled,
>     a. with EPT zap, the new EPT memtype is UC, effective memtype is UC
>     b. without EPT zap, some EPT memtype is still WB, effective memtype is guest PAT type
> However, in guest driver's point of view, the memtype is UC, because its MTRR is disabled.

The quirk throws all of that out the window.  As the quirk is implemented in KVM,
i.e. what guests have been running with for years, the MTRRs are completely
ignored when CR0.CD.

> So, if we don't zap EPT when guest disables MTRR, and if there's
> non-coherent DMA on-going which requires guest driver to flush caches to
> ensure cache coherency (though I don't think this scenario will happen in
> reality), guest driver may not do the right thing as it thinks the memory is
> UC which has no need to flush cache while the effective memtype is WB.
> 
> Previously, (i.e. in current upstream implementation),  
> (1) when CR0.CD=1, the EPT memtype is WB + ignore guest PAT.
> (2) when MTRR is further disabled,
>     EPT is zapped, new EPT memtype is UC, effective memtype is UC.

No, because after zapping, CR0.CD=1, and thus when KVM re-installs SPTEs the guest
will get WB memory for _everything_ once again.

> The guest device driver can flush cache well for non-coherent DMA.
>
> So, though the quirk will make current upstream code malfunction when
> CR0.CD is set alone, it still functions well if MTRR is also disabled.

Not unless I'm missing something.  I don't see any way for the guest to get UC
memory when CR0.CD=1 and the quirk is enabled.  The behavior is nonsensical IMO,
but that's been KVM's quirky behavior for years.

	if (is_mmio)
		return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;

	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

	if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
		if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
			cache = MTRR_TYPE_WRBACK;  <===== Overrides MTRR settings
		else
			cache = MTRR_TYPE_UNCACHABLE;

		return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
	}

	return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;

> But I doubt if we need to care about this extreme condition of
> non-coherent DMA that happens when CR0.CD=1 and MTRR is disabled.

It doesn't actually require non-coherent DMA within the CR0.CD=1 window though.
A SPTE installed while CR0.CD=1 could overlap a *future* non-coherent DMA range,
i.e. not zapping when CR0.CD is cleared would leave a stale WB SPTE.  This is
especially likely with hugepages, e.g. if KVM can create 1GiB mappings.

> > > This zap looks will happen on each vCPU.
> > 
> > Yes, on CR0.CD 1=>0 transition.
> > 
> > > Then only half of zaps are saved compared to the original count of zaps in
> > > update_mtrr().
> > 
> > I don't think the current code is functionally correct though, even if we define
> > "correct" to only apply when CR0.CD=0 (because the CR0.CD=1 case when the quirk
> > can't possibly be correct).  Keeping stale WB+IGMT entries that were installed
> > while CR0.CD=1 is broken.  As mentioned in my previous mail, I suspect it works
> > only because KVM doesn't install SPTEs for DMA'd ranges.  FWIW, the WB memtype
> > is unlikely to be a problem, but setting IGMT definitely is.  Though of course
> > we can and should fix the IGMT thing separately.
> Current code may also not function well for non-coherent DMA when guest vCPU has
> no X86_FEATURE_MTRR, as it returns WB (without IPAT) as EPT memtype.
> Then the effective memtype is guest PAT, but from guest's point of view, it's all
> UC. (see pat_x_mtrr_type() in Linux's code).
> But I guess we don't need to mind this case?

Yes, we can ignore !X86_FEATURE_MTRR guests.  If someone wants to run such a
setup with non-coherent DMA, they're welcome to send patches and write a lengthy
explanation of why they want to run such insanity :-)

> > > And pieces of no-WB memory might produce more kvm_zap_gfn_range()?
> > 
> > Yes, but they'll be much, much more precise.  And the bajillion fixed MTRRs can
> Could we instead keep a per-VM copy of MTRR?
> Then, if a vCPU modifies an MTRR entry and we find it's different from that
> in the per-VM copy, we mark that entry dirty. When CD0=0, we just zap
> those dirty ranges. (or just zap all if there are dirty entries)
> In theory, only one vCPU will trigger this zap in each round of MTRR update.

I don't think tracking "dirty" MTRRs releative to a per-VM copy will work because
of the weird behavior of the quirk.  E.g. if the below happens, vCPU1's WB SPTE
won't be zapped because MTRRx wasn't "dirty".

     vCPU0         vCPU1
     ----------    ----------
1.   MTRRx = UC 
2.   CR0.CD = 1    CR0.CD=1
3.                 SPTE = WB
4.                 MTRRx = UC
5.                 CR0.CD=0

Another problem is that while only one vCPU will trigger the zap, KVM needs to
stall all vCPUs that dirtied MTRRs *relative to the original per-VM state*.  E.g.
if the per-VM state for MTRRx is WB and both vCPU1 and vCPU2 dirty an MTRR and
both clear CR0.CD, then KVM needs to stall vCPU1 and vCPU2 regardless of which
vCPU actually does the zapping.

And when handling the transition from dirty=>clean, KVM would need to prevent
writes to MTRRs while grabbing the snapshot of dirty MTRRs, i.e. to-be-zapped
ranges, in order to provide stable view of the clean per-VM copy.

In other words, the net effect will essentially be what I sketched out with the
precise zapping, but with the added downside of serializing writes to MTRRs while
transitioning the per-VM state from dirty=>clean.
Sean Christopherson May 25, 2023, 3 p.m. UTC | #10
On Thu, May 25, 2023, Robert Hoo wrote:
> On 5/23/2023 2:21 PM, Yan Zhao wrote:
> IIUC, your saving comes from skips the intermediate state during boot, when
> APs goes through setting MTRR, which would cause SPTE zap before your this
> patch set.
> 
> MHO was, now that your ignores other vCPU's MTRR settings (unless it is
> different from BP's MTRR?), why not let each vCPU's MTRR set/update directly
> set to the per-VM MTRR states (if differs from current value). It's guest
> OS/BIOS's responsibility to keep the consistency anyway. And even if the
> malfunction caused by the inconsistency might differ from that of native,
> SDM doesn't clearly state how the malfunction should be, does it?
> that's to say, anyone knows, when inconsistency happens, does it cause that
> logical processor malfunction or in fact it impacts the global MTRR
> settings? If the latter, I think leaving only the per-VM MTRR state aligns
> with native.

The MTRRs are not system wide or per-package though, they are per logical CPU.
Yes, they "need" to be consistent with respect to one another, but only when the
CPU is actually accessing memory.  This is a big reason why trying to track MTRRs
as a per-VM asset in KVM is so difficult/messy.  Software doesn't rendezvous all
CPUs and then do the write on just one CPU, each CPU does its own writes more or
less independently.

> BTW, with regard to KVM_X86_QUIRK_CD_NW_CLEARED, I see svm honors it while
> vmx doesn't before it clear CR0.CD/NW.
> 
> svm_set_cr0():
> 
> 	if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> 		hcr0 &= ~(X86_CR0_CD | X86_CR0_NW);
> 
> 
> vmx_set_cr0():
> 
> 	hw_cr0 = (cr0 & ~KVM_VM_CR0_ALWAYS_OFF);
> 
> Perhaps vmx side can be fixed passingly?

Sadly, no.  SVM and VMX manage guest memtype completely differently.  VMX doesn't
allow CR0.CD=1 when VMX is enabled, and so KVM needs to emulate CR0.CD via the EPT
memtype.
Robert Hoo May 26, 2023, 1:49 a.m. UTC | #11
On 5/25/2023 11:00 PM, Sean Christopherson wrote:
[...]
> The MTRRs are not system wide or per-package though, they are per logical CPU.
> Yes, they "need" to be consistent with respect to one another, but only when the
> CPU is actually accessing memory.  This is a big reason why trying to track MTRRs
> as a per-VM asset in KVM is so difficult/messy.  Software doesn't rendezvous all
> CPUs and then do the write on just one CPU, each CPU does its own writes more or
> less independently.

Ah, got it, thanks!

(Some things of each logical processor seems just a shadow of an 
consolidate global one, e.g. CR0.CD. Some things are really separated 
setting of each logical processor, e.g. MTRR above. Really unfathomable 
Yan Zhao May 26, 2023, 7:54 a.m. UTC | #12
On Thu, May 25, 2023 at 07:53:41AM -0700, Sean Christopherson wrote:
> > > > > So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
> > > > > then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
> > > > > and will zap when CR0.CD is cleared.  And to avoid regressing the CR0.CD case,
> > > > > if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
> > > > > zap non-WB MTRR ranges when CR0.CD is cleared.  Since WB is weak, looking for
> > Not only non-WB ranges are required to be zapped.
> > Think about a scenario:
> > (1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too.
> > (2) after CR0.CD=1, without zap, its EPT entry memtype is still UC.
> > (3) then guest performs MTRR disable, no zap too.
> > (4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
> > (5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.
> 
> KVM installs WB memtype when the quirk is enabled, thus no need to zap.  KVM
> currently zaps everything when the quirk is disabled, and I'm not proposing that
> be changed.
I didn't explain it clearly.

(1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too. ==> EPT entry has been created here
(2) after CR0.CD=1, because of the quirk, no zap, the created EPT entry memtype is still UC.
(3) then guest performs MTRR disable, no zap too, according to our change.
(4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
(5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.==>we also need to zap WB range.


> It doesn't actually require non-coherent DMA within the CR0.CD=1 window though.
If we don't need to mind non-coherent DMA within the window CR0.CD=1 to CR0.CD=0,
do you think it's a good idea to do in this way...

(1) when CR0.CD=1, return guest mtrr type. 

--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7532,12 +7532,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
                return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

        if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
-               if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
-                       cache = MTRR_TYPE_WRBACK;
-               else
+               if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) {
+                       cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);
+                       return cache << VMX_EPT_MT_EPTE_SHIFT;
+               } else {
                        cache = MTRR_TYPE_UNCACHABLE;
-
-               return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+                       return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+               }
        }


(2) return MTRR_TYPE_WRBACK if guest mtrr has not been enabled for once
u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
@@ -628,13 +635,23 @@ u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
        struct mtrr_iter iter;
        u64 start, end;
        int type = -1;
        const int wt_wb_mask = (1 << MTRR_TYPE_WRBACK)
                               | (1 << MTRR_TYPE_WRTHROUGH);
 
+       if (!mtrr_state->enabled_once)
+               return MTRR_TYPE_WRBACK;
+


(3) when CR0.CD = 1 and the quirk is on, return MTRR type as if MTRR is enabled
+       ignore_disable = kvm_read_cr0_bits(vcpu, X86_CR0_CD) &&
+                        kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED);

-static void mtrr_lookup_start(struct mtrr_iter *iter)
+static void mtrr_lookup_start(struct mtrr_iter *iter, bool ignore_disable)
 {
-       if (!mtrr_is_enabled(iter->mtrr_state)) {
+       if (!ignore_disable && !mtrr_is_enabled(iter->mtrr_state)) {
                iter->mtrr_disabled = true;
                return;
        }

(4) Then we only need to do EPT zap when MTRR is enabled for the first time
and when MTRR fixed/var entries are changed at enabling MTRR or at CR0.CD=0
(I prefer at enabling MTRR, as seabios may do the MTRR disable/update/enable when
CR0.CD=0)


Besides, accoding to the commit message introducing KVM_QUIRK_CD_NW_CLEARED,
we can return MTRR_TYPE_WRBACK for CR0.CD=1 only when MTRR is not enbled for
once. (And I tried, it works!)

commit fb279950ba02e3210a16b11ecfa8871f3ee0ca49
Author: Xiao Guangrong <guangrong.xiao@intel.com>
Date:   Thu Jul 16 03:25:56 2015 +0800

    KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED

    OVMF depends on WB to boot fast, because it only clears caches after
    it has set up MTRRs---which is too late.

    Let's do writeback if CR0.CD is set to make it happy, similar to what
    SVM is already doing.
 
> Yes, we can ignore !X86_FEATURE_MTRR guests.  If someone wants to run such a
> setup with non-coherent DMA, they're welcome to send patches and write a lengthy
> explanation of why they want to run such insanity :-)
great :)

> 
> > > > And pieces of no-WB memory might produce more kvm_zap_gfn_range()?
> > > 
> > > Yes, but they'll be much, much more precise.  And the bajillion fixed MTRRs can
> > Could we instead keep a per-VM copy of MTRR?
> > Then, if a vCPU modifies an MTRR entry and we find it's different from that
> > in the per-VM copy, we mark that entry dirty. When CD0=0, we just zap
> > those dirty ranges. (or just zap all if there are dirty entries)
> > In theory, only one vCPU will trigger this zap in each round of MTRR update.
> 
> I don't think tracking "dirty" MTRRs releative to a per-VM copy will work because
> of the weird behavior of the quirk.  E.g. if the below happens, vCPU1's WB SPTE
> won't be zapped because MTRRx wasn't "dirty".
>
By returning MTRR type (as if MTRR is enabled) for CR0.CD=1, we can avoid to zap
entries in this window. Agree?

>      vCPU0         vCPU1
>      ----------    ----------
> 1.   MTRRx = UC 
> 2.   CR0.CD = 1    CR0.CD=1
> 3.                 SPTE = WB
> 4.                 MTRRx = UC
> 5.                 CR0.CD=0
> 
> Another problem is that while only one vCPU will trigger the zap, KVM needs to
> stall all vCPUs that dirtied MTRRs *relative to the original per-VM state*.  E.g.
> if the per-VM state for MTRRx is WB and both vCPU1 and vCPU2 dirty an MTRR and
> both clear CR0.CD, then KVM needs to stall vCPU1 and vCPU2 regardless of which
> vCPU actually does the zapping.
> 
> And when handling the transition from dirty=>clean, KVM would need to prevent
> writes to MTRRs while grabbing the snapshot of dirty MTRRs, i.e. to-be-zapped
> ranges, in order to provide stable view of the clean per-VM copy.
> 
> In other words, the net effect will essentially be what I sketched out with the
> precise zapping, but with the added downside of serializing writes to MTRRs while
> transitioning the per-VM state from dirty=>clean.
Yes, the net effect will essentially be what you sketched out in last mail.

I can try to figure out which way is more efficient before sending out next version
if you agree with the above direction, i.e.
(1) Do not consider non-coherent DMA when CR0.CD=1
(2) return MTRR_TYPE_WRBACK for CR0.CD=1 only when MTRR is not enabled for once.
(3) return MTRR type as if MTRR is enabled for CR0.CD=1
Sean Christopherson May 26, 2023, 4:09 p.m. UTC | #13
On Fri, May 26, 2023, Yan Zhao wrote:
> On Thu, May 25, 2023 at 07:53:41AM -0700, Sean Christopherson wrote:
> > > > > > So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
> > > > > > then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
> > > > > > and will zap when CR0.CD is cleared.  And to avoid regressing the CR0.CD case,
> > > > > > if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
> > > > > > zap non-WB MTRR ranges when CR0.CD is cleared.  Since WB is weak, looking for
> > > Not only non-WB ranges are required to be zapped.
> > > Think about a scenario:
> > > (1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too.
> > > (2) after CR0.CD=1, without zap, its EPT entry memtype is still UC.
> > > (3) then guest performs MTRR disable, no zap too.
> > > (4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
> > > (5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.
> > 
> > KVM installs WB memtype when the quirk is enabled, thus no need to zap.  KVM
> > currently zaps everything when the quirk is disabled, and I'm not proposing that
> > be changed.
> I didn't explain it clearly.
> 
> (1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too. ==> EPT entry has been created here
> (2) after CR0.CD=1, because of the quirk, no zap, the created EPT entry memtype is still UC.
> (3) then guest performs MTRR disable, no zap too, according to our change.
> (4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
> (5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.==>we also need to zap WB range.

Ugh, right.  But that case can be handled by zapping non-WB ranges on CR0.CD being
set.  Hmm, and KVM would need to zap everything if CR0.CD is toggled with MTRRs
disabled, but I don't think OVMF ever does that.  Zapping on CR0.CD toggling would
would likely introduce a small performance hit for SeaBIOS due to SeaBIOS clearing
CR0.CD immediately after SIPI, i.e. with MTRRs disabled, but that's arguably a
correctness fix since the quirk means KVM incorrectly runs the vCPU with WB SPTEs
until MTRRs are programmed.

With precise+batched zapping, zapping non-WB ranges even when CR0.CD is set should
still be a healthy performance boost for OVMF.

> > It doesn't actually require non-coherent DMA within the CR0.CD=1 window though.
> If we don't need to mind non-coherent DMA within the window CR0.CD=1 to CR0.CD=0,
> do you think it's a good idea to do in this way...
> 
> (1) when CR0.CD=1, return guest mtrr type. 
> 
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7532,12 +7532,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> 
>         if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
> -               if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> -                       cache = MTRR_TYPE_WRBACK;
> -               else
> +               if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) {
> +                       cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);
> +                       return cache << VMX_EPT_MT_EPTE_SHIFT;
> +               } else {
>                         cache = MTRR_TYPE_UNCACHABLE;
> -
> -               return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> +                       return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> +               }
>         }
> 
> 
> (2) return MTRR_TYPE_WRBACK if guest mtrr has not been enabled for once
> u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
> @@ -628,13 +635,23 @@ u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
>         struct mtrr_iter iter;
>         u64 start, end;
>         int type = -1;
>         const int wt_wb_mask = (1 << MTRR_TYPE_WRBACK)
>                                | (1 << MTRR_TYPE_WRTHROUGH);
>  
> +       if (!mtrr_state->enabled_once)
> +               return MTRR_TYPE_WRBACK;

No, because this assumes too many things about the guest, and completely falls
apart if the guest reboots.

> (3) when CR0.CD = 1 and the quirk is on, return MTRR type as if MTRR is enabled
> +       ignore_disable = kvm_read_cr0_bits(vcpu, X86_CR0_CD) &&
> +                        kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED);
> 
> -static void mtrr_lookup_start(struct mtrr_iter *iter)
> +static void mtrr_lookup_start(struct mtrr_iter *iter, bool ignore_disable)
>  {
> -       if (!mtrr_is_enabled(iter->mtrr_state)) {
> +       if (!ignore_disable && !mtrr_is_enabled(iter->mtrr_state)) {
>                 iter->mtrr_disabled = true;
>                 return;
>         }
> 
> (4) Then we only need to do EPT zap when MTRR is enabled for the first time
> and when MTRR fixed/var entries are changed at enabling MTRR or at CR0.CD=0
> (I prefer at enabling MTRR, as seabios may do the MTRR disable/update/enable when
> CR0.CD=0)
> 
> 
> Besides, accoding to the commit message introducing KVM_QUIRK_CD_NW_CLEARED,

I am not willing to lean on the historical commit messages for the quirk to
justify any change.  I'm not at all convinced that there was any meaningful thought
put into ensuring correctness.

> we can return MTRR_TYPE_WRBACK for CR0.CD=1 only when MTRR is not enbled for
> once. (And I tried, it works!)

On your system, for your setup.  The quirk terrifies me because it likely affects
every KVM-based VM out there (I can't find a single VMM that disables the quirk).
These changes are limited to VMs with noncoherent DMA, but still.

In short, I am steadfastly against any change that piles more arbitrary behavior
functional behavior on top of the quirk, especially when that behavior relies on
heuristics to try and guess what the guest is doing.
Yan Zhao May 30, 2023, 1:19 a.m. UTC | #14
On Fri, May 26, 2023 at 09:09:08AM -0700, Sean Christopherson wrote:
> On Fri, May 26, 2023, Yan Zhao wrote:
> > On Thu, May 25, 2023 at 07:53:41AM -0700, Sean Christopherson wrote:
> > > > > > > So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
> > > > > > > then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
> > > > > > > and will zap when CR0.CD is cleared.  And to avoid regressing the CR0.CD case,
> > > > > > > if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
> > > > > > > zap non-WB MTRR ranges when CR0.CD is cleared.  Since WB is weak, looking for
> > > > Not only non-WB ranges are required to be zapped.
> > > > Think about a scenario:
> > > > (1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too.
> > > > (2) after CR0.CD=1, without zap, its EPT entry memtype is still UC.
> > > > (3) then guest performs MTRR disable, no zap too.
> > > > (4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
> > > > (5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.
> > > 
> > > KVM installs WB memtype when the quirk is enabled, thus no need to zap.  KVM
> > > currently zaps everything when the quirk is disabled, and I'm not proposing that
> > > be changed.
> > I didn't explain it clearly.
> > 
> > (1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too. ==> EPT entry has been created here
> > (2) after CR0.CD=1, because of the quirk, no zap, the created EPT entry memtype is still UC.
> > (3) then guest performs MTRR disable, no zap too, according to our change.
> > (4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
> > (5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.==>we also need to zap WB range.
> 
> Ugh, right.  But that case can be handled by zapping non-WB ranges on CR0.CD being
> set.  Hmm, and KVM would need to zap everything if CR0.CD is toggled with MTRRs
Ok. Actually even with below "zap everything always", I didn't observe any performance
downgrade on OVMF in my side regarding to this change as we skipped EPT zap in
update_mtrr() when CR0.CD=1.
(even 3 less zaps for vCPU 0 and 1 less zap for APs as initially there are several MTRRs
disabled when CR0.CD=1 without accompanying CR0.CD toggle).

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dadf80fb85e9..ad3c4dc9d7bf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -941,9 +941,9 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon

        if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
-           kvm_mmu_honors_guest_mtrr(vcpu->kvm) &&
-           !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)
+           kvm_mmu_honors_guest_mtrr(vcpu->kvm)
                kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
 }
 EXPORT_SYMBOL_GPL(kvm_post_set_cr0);



> disabled, but I don't think OVMF ever does that.  Zapping on CR0.CD toggling would
> would likely introduce a small performance hit for SeaBIOS due to SeaBIOS clearing
> CR0.CD immediately after SIPI, i.e. with MTRRs disabled, but that's arguably a
For SeaBIOS, I also observed 1 less of EPT zap for each vCPU, because
there are 3 more MTRR toggles than CR0.CD toggles for each vCPU, and only 2 MTRR
toggles happen when CR0.CD=0.

> correctness fix since the quirk means KVM incorrectly runs the vCPU with WB SPTEs
> until MTRRs are programmed.

> With precise+batched zapping, zapping non-WB ranges even when CR0.CD is set should
> still be a healthy performance boost for OVMF.
Ok. I'll try to implement in this way in the next version.


> > (2) return MTRR_TYPE_WRBACK if guest mtrr has not been enabled for once
> > u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
> > @@ -628,13 +635,23 @@ u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
> >         struct mtrr_iter iter;
> >         u64 start, end;
> >         int type = -1;
> >         const int wt_wb_mask = (1 << MTRR_TYPE_WRBACK)
> >                                | (1 << MTRR_TYPE_WRTHROUGH);
> >  
> > +       if (!mtrr_state->enabled_once)
> > +               return MTRR_TYPE_WRBACK;
> 
> No, because this assumes too many things about the guest, and completely falls
> apart if the guest reboots.
Ah, right.

> I am not willing to lean on the historical commit messages for the quirk to
> justify any change.  I'm not at all convinced that there was any meaningful thought
> put into ensuring correctness.
> 
> > we can return MTRR_TYPE_WRBACK for CR0.CD=1 only when MTRR is not enbled for
> > once. (And I tried, it works!)
> 
> On your system, for your setup.  The quirk terrifies me because it likely affects
> every KVM-based VM out there (I can't find a single VMM that disables the quirk).
> These changes are limited to VMs with noncoherent DMA, but still.
> 
> In short, I am steadfastly against any change that piles more arbitrary behavior
> functional behavior on top of the quirk, especially when that behavior relies on
> heuristics to try and guess what the guest is doing.

Ok, yes, this is the right insistence.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2865c3cb3501..a2b6b1e1548f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1444,6 +1444,9 @@  struct kvm_arch {
 	 */
 #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
 	struct kvm_mmu_memory_cache split_desc_cache;
+
+	struct kvm_mtrr *mtrr_state;
+	bool has_mtrr;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index 62ebb9978156..1ae80c756797 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -438,6 +438,28 @@  void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu)
 	INIT_LIST_HEAD(&vcpu->arch.mtrr_state.head);
 }
 
+void kvm_mtrr_init(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm = vcpu->kvm;
+
+	if (vcpu->vcpu_id)
+		return;
+
+	rcu_assign_pointer(kvm->arch.mtrr_state, &vcpu->arch.mtrr_state);
+	kvm->arch.has_mtrr = guest_cpuid_has(vcpu, X86_FEATURE_MTRR);
+}
+
+void kvm_mtrr_destroy(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm = vcpu->kvm;
+
+	if (vcpu->vcpu_id)
+		return;
+
+	rcu_assign_pointer(kvm->arch.mtrr_state, NULL);
+	synchronize_srcu_expedited(&kvm->srcu);
+}
+
 struct mtrr_iter {
 	/* input fields. */
 	struct kvm_mtrr *mtrr_state;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 48b683a305b3..b8aa18031877 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11879,6 +11879,7 @@  int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT;
 	kvm_xen_init_vcpu(vcpu);
 	kvm_vcpu_mtrr_init(vcpu);
+	kvm_mtrr_init(vcpu);
 	vcpu_load(vcpu);
 	kvm_set_tsc_khz(vcpu, vcpu->kvm->arch.default_tsc_khz);
 	kvm_vcpu_reset(vcpu, false);
@@ -11948,6 +11949,7 @@  void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 	kvfree(vcpu->arch.cpuid_entries);
 	if (!lapic_in_kernel(vcpu))
 		static_branch_dec(&kvm_has_noapic_vcpu);
+	kvm_mtrr_destroy(vcpu);
 }
 
 void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index c544602d07a3..d0a7e50de739 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -308,6 +308,8 @@  void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
 				   struct kvm_queued_exception *ex);
 
 void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu);
+void kvm_mtrr_init(struct kvm_vcpu *vcpu);
+void kvm_mtrr_destroy(struct kvm_vcpu *vcpu);
 u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn);
 bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data);
 int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);