[11/15] KVM: x86/MMU: Refactor vmx_get_mt_mask

Message ID	20211115234603.2908381-12-bgardon@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Date: Mon, 15 Nov 2021 15:45:59 -0800 In-Reply-To: <20211115234603.2908381-1-bgardon@google.com> Message-Id: <20211115234603.2908381-12-bgardon@google.com> Mime-Version: 1.0 References: <20211115234603.2908381-1-bgardon@google.com> Subject: [PATCH 11/15] KVM: x86/MMU: Refactor vmx_get_mt_mask From: Ben Gardon <bgardon@google.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com>, Peter Xu <peterx@redhat.com>, Sean Christopherson <seanjc@google.com>, Peter Shier <pshier@google.com>, David Matlack <dmatlack@google.com>, Mingwei Zhang <mizhang@google.com>, Yulei Zhang <yulei.kernel@gmail.com>, Wanpeng Li <kernellwp@gmail.com>, Xiao Guangrong <xiaoguangrong.eric@gmail.com>, Kai Huang <kai.huang@intel.com>, Keqian Zhu <zhukeqian1@huawei.com>, David Hildenbrand <david@redhat.com>, Ben Gardon <bgardon@google.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	Currently disabling dirty logging with the TDP MMU is extremely slow. On a 96 vCPU / 96G VM it takes ~45 seconds to disable dirty logging with the TDP MMU, as opposed to ~3.5 seconds with the legacy MMU. This series optimizes TLB flushes and intro \| expand [00/15] Currently disabling dirty logging with the TDP MMU is extremely slow. On a 96 vCPU / 96G VM… [01/15] KVM: x86/mmu: Remove redundant flushes when disabling dirty logging [02/15] KVM: x86/mmu: Introduce vcpu_make_spte [03/15] KVM: x86/mmu: Factor wrprot for nested PML out of make_spte [04/15] KVM: x86/mmu: Factor mt_mask out of make_spte [05/15] KVM: x86/mmu: Remove need for a vcpu from kvm_slot_page_track_is_active [06/15] KVM: x86/mmu: Remove need for a vcpu from mmu_try_to_unsync_pages [07/15] KVM: x86/mmu: Factor shadow_zero_check out of make_spte [08/15] KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte [09/15] KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask [10/15] KVM: x86/mmu: Propagate memslot const qualifier [11/15] KVM: x86/MMU: Refactor vmx_get_mt_mask [12/15] KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend on vcpu [13/15] KVM: x86/mmu: Add try_get_mt_mask to x86_ops [14/15] KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c [15/15] KVM: x86/mmu: Promote pages in-place when disabling dirty logging

Ben Gardon Nov. 15, 2021, 11:45 p.m. UTC

Remove the gotos from vmx_get_mt_mask to make it easier to separate out
the parts which do not depend on vcpu state.

No functional change intended.


Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 23 +++++++----------------
 1 file changed, 7 insertions(+), 16 deletions(-)

Paolo Bonzini Nov. 18, 2021, 8:30 a.m. UTC | #1

On 11/16/21 00:45, Ben Gardon wrote:
> Remove the gotos from vmx_get_mt_mask to make it easier to separate out
> the parts which do not depend on vcpu state.
> 
> No functional change intended.
> 
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>

Queued, thanks (with a slightly edited commit message; the patch is a 
simplification anyway).

Paolo

> ---
>   arch/x86/kvm/vmx/vmx.c | 23 +++++++----------------
>   1 file changed, 7 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 71f54d85f104..77f45c005f28 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6987,7 +6987,6 @@ static int __init vmx_check_processor_compat(void)
>   static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>   {
>   	u8 cache;
> -	u64 ipat = 0;
>   
>   	/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
>   	 * memory aliases with conflicting memory types and sometimes MCEs.
> @@ -7007,30 +7006,22 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>   	 * EPT memory type is used to emulate guest CD/MTRR.
>   	 */
>   
> -	if (is_mmio) {
> -		cache = MTRR_TYPE_UNCACHABLE;
> -		goto exit;
> -	}
> +	if (is_mmio)
> +		return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
>   
> -	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) {
> -		ipat = VMX_EPT_IPAT_BIT;
> -		cache = MTRR_TYPE_WRBACK;
> -		goto exit;
> -	}
> +	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> +		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>   
>   	if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
> -		ipat = VMX_EPT_IPAT_BIT;
>   		if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
>   			cache = MTRR_TYPE_WRBACK;
>   		else
>   			cache = MTRR_TYPE_UNCACHABLE;
> -		goto exit;
> -	}
>   
> -	cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);
> +		return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> +	}
>   
> -exit:
> -	return (cache << VMX_EPT_MT_EPTE_SHIFT) | ipat;
> +	return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
>   }
>   
>   static void vmcs_set_secondary_exec_control(struct vcpu_vmx *vmx, u32 new_ctl)
>

Sean Christopherson Nov. 18, 2021, 3:30 p.m. UTC | #2

On Thu, Nov 18, 2021, Paolo Bonzini wrote:
> On 11/16/21 00:45, Ben Gardon wrote:
> > Remove the gotos from vmx_get_mt_mask to make it easier to separate out
> > the parts which do not depend on vcpu state.
> > 
> > No functional change intended.
> > 
> > 
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> 
> Queued, thanks (with a slightly edited commit message; the patch is a
> simplification anyway).

Don't know waht message you've queued, but just in case you kept some of the original,
can you further edit it to remove any snippets that mention separating out the parts
that don't depend on vCPU state?

IMO, we should not separate vmx_get_mt_mask() into per-VM and per-vCPU variants,
because the per-vCPU variant is a lie.  The memtype of a SPTE is not tracked anywhere,
which means that if the guest has non-uniform CR0.CD/NW or MTRR settings, KVM will
happily let the guest consumes SPTEs with the incorrect memtype.  In practice, this
isn't an issue because no sane BIOS or kernel uses per-CPU MTRRs, nor do they have
DMA operations running while the cacheability state is in flux.

If we really want to make this state per-vCPU, KVM would need to incorporate the
CR0.CD and MTRR settings in kvm_mmu_page_role.  For MTRRs in particular, the worst
case scenario is that every vCPU has different MTRR settings, which means that
kvm_mmu_page_role would need to be expanded by 10 bits in order to track every
possible vcpu_idx (currently capped at 1024).

So unless we want to massively complicate kvm_mmu_page_role and gfn_track for a
scenario no one cares about, I would strongly prefer to acknowledge that KVM assumes
memtypes are a per-VM property, e.g. on top:

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 77f45c005f28..8a84d30f1dbd 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6984,8 +6984,9 @@ static int __init vmx_check_processor_compat(void)
        return 0;
 }

-static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+static u64 vmx_get_mt_mask(struct kvm *kvm, gfn_t gfn, bool is_mmio)
 {
+       struct kvm_vcpu *vcpu;
        u8 cache;

        /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
@@ -7009,11 +7010,15 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
        if (is_mmio)
                return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;

-       if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
+       if (!kvm_arch_has_noncoherent_dma(kvm))
                return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

+       vcpu = kvm_get_vcpu_by_id(kvm, 0);
+       if (KVM_BUG_ON(!vcpu, kvm))
+               return;
+
        if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
-               if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
+               if (kvm_check_has_quirk(kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
                        cache = MTRR_TYPE_WRBACK;
                else
                        cache = MTRR_TYPE_UNCACHABLE;

Paolo Bonzini Nov. 19, 2021, 9:02 a.m. UTC | #3

On 11/18/21 16:30, Sean Christopherson wrote:
> On Thu, Nov 18, 2021, Paolo Bonzini wrote:
>> On 11/16/21 00:45, Ben Gardon wrote:
>>> Remove the gotos from vmx_get_mt_mask to make it easier to separate out
>>> the parts which do not depend on vcpu state.
>>>
>>> No functional change intended.
>>>
>>>
>>> Signed-off-by: Ben Gardon <bgardon@google.com>
>>
>> Queued, thanks (with a slightly edited commit message; the patch is a
>> simplification anyway).
> 
> Don't know waht message you've queued, but just in case you kept some of the original,
> can you further edit it to remove any snippets that mention separating out the parts
> that don't depend on vCPU state?

Indeed I did keep some:

commit b7297e02826857e068d03f844c8336ce48077d78
Author: Ben Gardon <bgardon@google.com>
Date:   Mon Nov 15 15:45:59 2021 -0800

     KVM: x86/MMU: Simplify flow of vmx_get_mt_mask
     
     Remove the gotos from vmx_get_mt_mask.  This may later make it easier
     to separate out the parts which do not depend on vcpu state, but it also
     simplifies the code in general.
     
     No functional change intended.

i.e. keeping it conditional but I can edit it further, like

     Remove the gotos from vmx_get_mt_mask.  It's easier to build the whole
     memory type at once, than it is to combine separate cacheability and ipat
     fields.

Paolo

> IMO, we should not separate vmx_get_mt_mask() into per-VM and per-vCPU variants,
> because the per-vCPU variant is a lie.  The memtype of a SPTE is not tracked anywhere,
> which means that if the guest has non-uniform CR0.CD/NW or MTRR settings, KVM will
> happily let the guest consumes SPTEs with the incorrect memtype.  In practice, this
> isn't an issue because no sane BIOS or kernel uses per-CPU MTRRs, nor do they have
> DMA operations running while the cacheability state is in flux.
> 
> If we really want to make this state per-vCPU, KVM would need to incorporate the
> CR0.CD and MTRR settings in kvm_mmu_page_role.  For MTRRs in particular, the worst
> case scenario is that every vCPU has different MTRR settings, which means that
> kvm_mmu_page_role would need to be expanded by 10 bits in order to track every
> possible vcpu_idx (currently capped at 1024).

Yes, that's insanity.  I was also a bit skeptical about Ben's try_get_mt_mask callback,
but this would be much much worse.

Paolo

> So unless we want to massively complicate kvm_mmu_page_role and gfn_track for a
> scenario no one cares about, I would strongly prefer to acknowledge that KVM assumes
> memtypes are a per-VM property, e.g. on top:
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 77f45c005f28..8a84d30f1dbd 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6984,8 +6984,9 @@ static int __init vmx_check_processor_compat(void)
>          return 0;
>   }
> 
> -static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> +static u64 vmx_get_mt_mask(struct kvm *kvm, gfn_t gfn, bool is_mmio)
>   {
> +       struct kvm_vcpu *vcpu;
>          u8 cache;
> 
>          /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
> @@ -7009,11 +7010,15 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>          if (is_mmio)
>                  return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> 
> -       if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> +       if (!kvm_arch_has_noncoherent_dma(kvm))
>                  return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> 
> +       vcpu = kvm_get_vcpu_by_id(kvm, 0);
> +       if (KVM_BUG_ON(!vcpu, kvm))
> +               return;
> +
>          if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
> -               if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> +               if (kvm_check_has_quirk(kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
>                          cache = MTRR_TYPE_WRBACK;
>                  else
>                          cache = MTRR_TYPE_UNCACHABLE;
>

Ben Gardon Nov. 22, 2021, 6:11 p.m. UTC | #4

On Fri, Nov 19, 2021 at 1:03 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 11/18/21 16:30, Sean Christopherson wrote:
> > On Thu, Nov 18, 2021, Paolo Bonzini wrote:
> >> On 11/16/21 00:45, Ben Gardon wrote:
> >>> Remove the gotos from vmx_get_mt_mask to make it easier to separate out
> >>> the parts which do not depend on vcpu state.
> >>>
> >>> No functional change intended.
> >>>
> >>>
> >>> Signed-off-by: Ben Gardon <bgardon@google.com>
> >>
> >> Queued, thanks (with a slightly edited commit message; the patch is a
> >> simplification anyway).
> >
> > Don't know waht message you've queued, but just in case you kept some of the original,
> > can you further edit it to remove any snippets that mention separating out the parts
> > that don't depend on vCPU state?
>
> Indeed I did keep some:
>
> commit b7297e02826857e068d03f844c8336ce48077d78
> Author: Ben Gardon <bgardon@google.com>
> Date:   Mon Nov 15 15:45:59 2021 -0800
>
>      KVM: x86/MMU: Simplify flow of vmx_get_mt_mask
>
>      Remove the gotos from vmx_get_mt_mask.  This may later make it easier
>      to separate out the parts which do not depend on vcpu state, but it also
>      simplifies the code in general.
>
>      No functional change intended.
>
> i.e. keeping it conditional but I can edit it further, like
>
>      Remove the gotos from vmx_get_mt_mask.  It's easier to build the whole
>      memory type at once, than it is to combine separate cacheability and ipat
>      fields.
>
> Paolo
>
> > IMO, we should not separate vmx_get_mt_mask() into per-VM and per-vCPU variants,
> > because the per-vCPU variant is a lie.  The memtype of a SPTE is not tracked anywhere,
> > which means that if the guest has non-uniform CR0.CD/NW or MTRR settings, KVM will
> > happily let the guest consumes SPTEs with the incorrect memtype.  In practice, this
> > isn't an issue because no sane BIOS or kernel uses per-CPU MTRRs, nor do they have
> > DMA operations running while the cacheability state is in flux.
> >
> > If we really want to make this state per-vCPU, KVM would need to incorporate the
> > CR0.CD and MTRR settings in kvm_mmu_page_role.  For MTRRs in particular, the worst
> > case scenario is that every vCPU has different MTRR settings, which means that
> > kvm_mmu_page_role would need to be expanded by 10 bits in order to track every
> > possible vcpu_idx (currently capped at 1024).
>
> Yes, that's insanity.  I was also a bit skeptical about Ben's try_get_mt_mask callback,
> but this would be much much worse.

Yeah, the implementation of that felt a bit kludgy to me too, but
refactoring the handling of all those CR bits was way more complex
than I wanted to handle in this patch set.
I'd love to see some of those CR0 / MTRR settings be set on a VM basis
and enforced as uniform across vCPUs.
Looking up vCPU 0 and basing things on that feels extra hacky though,
especially if we're still not asserting uniformity of settings across
vCPUs.
If we need to track that state to accurately virtualize the hardware
though, that would be unfortunate.

>
> Paolo
>
> > So unless we want to massively complicate kvm_mmu_page_role and gfn_track for a
> > scenario no one cares about, I would strongly prefer to acknowledge that KVM assumes
> > memtypes are a per-VM property, e.g. on top:
> >
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 77f45c005f28..8a84d30f1dbd 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -6984,8 +6984,9 @@ static int __init vmx_check_processor_compat(void)
> >          return 0;
> >   }
> >
> > -static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > +static u64 vmx_get_mt_mask(struct kvm *kvm, gfn_t gfn, bool is_mmio)
> >   {
> > +       struct kvm_vcpu *vcpu;
> >          u8 cache;
> >
> >          /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
> > @@ -7009,11 +7010,15 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> >          if (is_mmio)
> >                  return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> >
> > -       if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > +       if (!kvm_arch_has_noncoherent_dma(kvm))
> >                  return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> >
> > +       vcpu = kvm_get_vcpu_by_id(kvm, 0);
> > +       if (KVM_BUG_ON(!vcpu, kvm))
> > +               return;
> > +
> >          if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
> > -               if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> > +               if (kvm_check_has_quirk(kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> >                          cache = MTRR_TYPE_WRBACK;
> >                  else
> >                          cache = MTRR_TYPE_UNCACHABLE;
> >
>

Sean Christopherson Nov. 22, 2021, 6:46 p.m. UTC | #5

On Mon, Nov 22, 2021, Ben Gardon wrote:
> On Fri, Nov 19, 2021 at 1:03 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 11/18/21 16:30, Sean Christopherson wrote:
> > > If we really want to make this state per-vCPU, KVM would need to incorporate the
> > > CR0.CD and MTRR settings in kvm_mmu_page_role.  For MTRRs in particular, the worst
> > > case scenario is that every vCPU has different MTRR settings, which means that
> > > kvm_mmu_page_role would need to be expanded by 10 bits in order to track every
> > > possible vcpu_idx (currently capped at 1024).
> >
> > Yes, that's insanity.  I was also a bit skeptical about Ben's try_get_mt_mask callback,
> > but this would be much much worse.
> 
> Yeah, the implementation of that felt a bit kludgy to me too, but
> refactoring the handling of all those CR bits was way more complex
> than I wanted to handle in this patch set.
> I'd love to see some of those CR0 / MTRR settings be set on a VM basis
> and enforced as uniform across vCPUs.

Architecturally, we can't do that.  Even a perfectly well-behaved guest will have
(small) periods where the BSP has different settings than APs.  And it's technically
legal to have non-uniform MTRR and CR0.CD/NW configurations, even though no modern
BIOS/kernel does that.  Except for non-coherent DMA, it's a moot point because KVM
can simply ignore guest cacheability settings.

> Looking up vCPU 0 and basing things on that feels extra hacky though,
> especially if we're still not asserting uniformity of settings across
> vCPUs.

IMO, it's marginally less hacky than what KVM has today as it allows KVM's behavior
to be clearly and sanely stated, e.g. KVM uses vCPU0's cacheability settings when
mapping non-coherent DMA.  Compare that with today's behavior where the cacheability
settings depend on which vCPU first faulted in the address for a given MMU role and
instance of the associated root, and whether other vCPUs share an MMU role/root.

> If we need to track that state to accurately virtualize the hardware
> though, that would be unfortunate.

[11/15] KVM: x86/MMU: Refactor vmx_get_mt_mask

Commit Message

Comments

Patch