Message ID | 20220204115718.14934-1-pbonzini@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: MMU: MMU role refactoring | expand |
On Fri, Feb 04, 2022 at 06:56:55AM -0500, Paolo Bonzini wrote: > The TDP MMU has a performance regression compared to the legacy > MMU when CR0 changes often. This was reported for the grsecurity > kernel, which uses CR0.WP to implement kernel W^X. In that case, > each change to CR0.WP unloads the MMU and causes a lot of unnecessary > work. When running nested, this can even cause the L1 to hardly > make progress, as the L0 hypervisor it is overwhelmed by the amount > of MMU work that is needed. > > The root cause of the issue is that the "MMU role" in KVM is a mess > that mixes the CPU setup (CR0/CR4/EFER, SMM, guest mode, etc.) > and the shadow page table format. Whenever something is different > between the MMU and the CPU, it is stored as an extra field in struct > kvm_mmu---and for extra bonus complication, sometimes the same thing > is stored in both the role and an extra field. > > So, this is the "no functional change intended" part of the changes > required to fix the performance regression. It separates neatly > the shadow page table format ("MMU role") from the guest page table > format ("CPU role"), and removes the duplicate fields. What do you think about calling this the guest_role instead of cpu_role? There is a bit of a precedent for using "guest" instead of "cpu" already for this type of concept (e.g. guest_walker), and I find it more intuitive. > The next > step then is to avoid unloading the MMU as long as the MMU role > stays the same. > > Please review! > > Paolo > > Paolo Bonzini (23): > KVM: MMU: pass uses_nx directly to reset_shadow_zero_bits_mask > KVM: MMU: nested EPT cannot be used in SMM > KVM: MMU: remove valid from extended role > KVM: MMU: constify uses of struct kvm_mmu_role_regs > KVM: MMU: pull computation of kvm_mmu_role_regs to kvm_init_mmu > KVM: MMU: load new PGD once nested two-dimensional paging is > initialized > KVM: MMU: remove kvm_mmu_calc_root_page_role > KVM: MMU: rephrase unclear comment > KVM: MMU: remove "bool base_only" arguments > KVM: MMU: split cpu_role from mmu_role > KVM: MMU: do not recompute root level from kvm_mmu_role_regs > KVM: MMU: remove ept_ad field > KVM: MMU: remove kvm_calc_shadow_root_page_role_common > KVM: MMU: cleanup computation of MMU roles for two-dimensional paging > KVM: MMU: cleanup computation of MMU roles for shadow paging > KVM: MMU: remove extended bits from mmu_role > KVM: MMU: remove redundant bits from extended role > KVM: MMU: fetch shadow EFER.NX from MMU role > KVM: MMU: simplify and/or inline computation of shadow MMU roles > KVM: MMU: pull CPU role computation to kvm_init_mmu > KVM: MMU: store shadow_root_level into mmu_role > KVM: MMU: use cpu_role for root_level > KVM: MMU: replace direct_map with mmu_role.direct > > arch/x86/include/asm/kvm_host.h | 13 +- > arch/x86/kvm/mmu.h | 2 +- > arch/x86/kvm/mmu/mmu.c | 408 ++++++++++++-------------------- > arch/x86/kvm/mmu/mmu_audit.c | 6 +- > arch/x86/kvm/mmu/paging_tmpl.h | 12 +- > arch/86/kvm/mmu/tdp_mmu.c | 4 +- > arch/x86/kvm/svm/svm.c | 2 +- > arch/x86/kvm/vmx/vmx.c | 2 +- > arch/x86/kvm/x86.c | 12 +- > 10 files changed, 178 insertions(+), 284 deletions(-) > > -- > 2.31.1 >
On Mon, Feb 07, 2022, David Matlack wrote: > On Fri, Feb 04, 2022 at 06:56:55AM -0500, Paolo Bonzini wrote: > > The TDP MMU has a performance regression compared to the legacy > > MMU when CR0 changes often. This was reported for the grsecurity > > kernel, which uses CR0.WP to implement kernel W^X. In that case, > > each change to CR0.WP unloads the MMU and causes a lot of unnecessary > > work. When running nested, this can even cause the L1 to hardly > > make progress, as the L0 hypervisor it is overwhelmed by the amount > > of MMU work that is needed. > > > > The root cause of the issue is that the "MMU role" in KVM is a mess > > that mixes the CPU setup (CR0/CR4/EFER, SMM, guest mode, etc.) > > and the shadow page table format. Whenever something is different > > between the MMU and the CPU, it is stored as an extra field in struct > > kvm_mmu---and for extra bonus complication, sometimes the same thing > > is stored in both the role and an extra field. > > > > So, this is the "no functional change intended" part of the changes > > required to fix the performance regression. It separates neatly > > the shadow page table format ("MMU role") from the guest page table > > format ("CPU role"), and removes the duplicate fields. > > What do you think about calling this the guest_role instead of cpu_role? > There is a bit of a precedent for using "guest" instead of "cpu" already > for this type of concept (e.g. guest_walker), and I find it more > intuitive. Haven't looked at the series yet, but I'd prefer not to use guest_role, it's too similar to is_guest_mode() and kvm_mmu_role.guest_mode. E.g. we'd end up with static union kvm_mmu_role kvm_calc_guest_role(struct kvm_vcpu *vcpu, const struct kvm_mmu_role_regs *regs) { union kvm_mmu_role role = {0}; role.base.access = ACC_ALL; role.base.smm = is_smm(vcpu); role.base.guest_mode = is_guest_mode(vcpu); role.base.direct = !____is_cr0_pg(regs); ... } and possibly if (guest_role.guest_mode) ... which would be quite messy. Maybe vcpu_role if cpu_role isn't intuitive?
On Mon, Feb 7, 2022 at 3:27 PM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Feb 07, 2022, David Matlack wrote: > > On Fri, Feb 04, 2022 at 06:56:55AM -0500, Paolo Bonzini wrote: > > > The TDP MMU has a performance regression compared to the legacy > > > MMU when CR0 changes often. This was reported for the grsecurity > > > kernel, which uses CR0.WP to implement kernel W^X. In that case, > > > each change to CR0.WP unloads the MMU and causes a lot of unnecessary > > > work. When running nested, this can even cause the L1 to hardly > > > make progress, as the L0 hypervisor it is overwhelmed by the amount > > > of MMU work that is needed. > > > > > > The root cause of the issue is that the "MMU role" in KVM is a mess > > > that mixes the CPU setup (CR0/CR4/EFER, SMM, guest mode, etc.) > > > and the shadow page table format. Whenever something is different > > > between the MMU and the CPU, it is stored as an extra field in struct > > > kvm_mmu---and for extra bonus complication, sometimes the same thing > > > is stored in both the role and an extra field. > > > > > > So, this is the "no functional change intended" part of the changes > > > required to fix the performance regression. It separates neatly > > > the shadow page table format ("MMU role") from the guest page table > > > format ("CPU role"), and removes the duplicate fields. > > > > What do you think about calling this the guest_role instead of cpu_role? > > There is a bit of a precedent for using "guest" instead of "cpu" already > > for this type of concept (e.g. guest_walker), and I find it more > > intuitive. > > Haven't looked at the series yet, but I'd prefer not to use guest_role, it's > too similar to is_guest_mode() and kvm_mmu_role.guest_mode. E.g. we'd end up with > > static union kvm_mmu_role kvm_calc_guest_role(struct kvm_vcpu *vcpu, > const struct kvm_mmu_role_regs *regs) > { > union kvm_mmu_role role = {0}; > > role.base.access = ACC_ALL; > role.base.smm = is_smm(vcpu); > role.base.guest_mode = is_guest_mode(vcpu); > role.base.direct = !____is_cr0_pg(regs); > > ... > } > > and possibly > > if (guest_role.guest_mode) > ... > > which would be quite messy. Maybe vcpu_role if cpu_role isn't intuitive? I agree it's a little odd. But actually it's somewhat intuitive (the guest is in guest-mode, i.e. we're running a nested guest). Ok I'm stretching a little bit :). But if the trade-off is just "guest_role.guest_mode" requires a clarifying comment, but the rest of the code gets more readable (cpu_role is used a lot more than role.guest_mode), it still might be worth it.
On Fri, Feb 04, 2022, Paolo Bonzini wrote: > Paolo Bonzini (23): > KVM: MMU: pass uses_nx directly to reset_shadow_zero_bits_mask > KVM: MMU: nested EPT cannot be used in SMM > KVM: MMU: remove valid from extended role > KVM: MMU: constify uses of struct kvm_mmu_role_regs > KVM: MMU: pull computation of kvm_mmu_role_regs to kvm_init_mmu > KVM: MMU: load new PGD once nested two-dimensional paging is > initialized > KVM: MMU: remove kvm_mmu_calc_root_page_role > KVM: MMU: rephrase unclear comment > KVM: MMU: remove "bool base_only" arguments > KVM: MMU: split cpu_role from mmu_role > KVM: MMU: do not recompute root level from kvm_mmu_role_regs > KVM: MMU: remove ept_ad field > KVM: MMU: remove kvm_calc_shadow_root_page_role_common > KVM: MMU: cleanup computation of MMU roles for two-dimensional paging > KVM: MMU: cleanup computation of MMU roles for shadow paging > KVM: MMU: remove extended bits from mmu_role > KVM: MMU: remove redundant bits from extended role > KVM: MMU: fetch shadow EFER.NX from MMU role > KVM: MMU: simplify and/or inline computation of shadow MMU roles > KVM: MMU: pull CPU role computation to kvm_init_mmu > KVM: MMU: store shadow_root_level into mmu_role > KVM: MMU: use cpu_role for root_level > KVM: MMU: replace direct_map with mmu_role.direct Heresy! Everyone knows the one true way is "KVM: x86/mmu:" $ glo | grep "KVM: MMU:" | wc -l 740 $ glo | grep "KVM: x86/mmu:" | wc -l 403 Dammit, I'm the heathen... I do think we should use x86/mmu though. VMX and SVM (and nVMX and nSVM) are ok because they're unlikely to collide with other architectures, but every arch has an MMU...
On Mon, Feb 07, 2022, David Matlack wrote: > On Mon, Feb 7, 2022 at 3:27 PM Sean Christopherson <seanjc@google.com> wrote: > > > What do you think about calling this the guest_role instead of cpu_role? > > > There is a bit of a precedent for using "guest" instead of "cpu" already > > > for this type of concept (e.g. guest_walker), and I find it more > > > intuitive. > > > > Haven't looked at the series yet, but I'd prefer not to use guest_role, it's > > too similar to is_guest_mode() and kvm_mmu_role.guest_mode. E.g. we'd end up with > > > > static union kvm_mmu_role kvm_calc_guest_role(struct kvm_vcpu *vcpu, > > const struct kvm_mmu_role_regs *regs) > > { > > union kvm_mmu_role role = {0}; > > > > role.base.access = ACC_ALL; > > role.base.smm = is_smm(vcpu); > > role.base.guest_mode = is_guest_mode(vcpu); > > role.base.direct = !____is_cr0_pg(regs); > > > > ... > > } > > > > and possibly > > > > if (guest_role.guest_mode) > > ... > > > > which would be quite messy. Maybe vcpu_role if cpu_role isn't intuitive? > > I agree it's a little odd. But actually it's somewhat intuitive (the > guest is in guest-mode, i.e. we're running a nested guest). > > Ok I'm stretching a little bit :). But if the trade-off is just > "guest_role.guest_mode" requires a clarifying comment, but the rest of > the code gets more readable (cpu_role is used a lot more than > role.guest_mode), it still might be worth it. It's not just guest_mode, we also have guest_mmu, e.g. we'd end up with vcpu->arch.root_mmu.guest_role.base.level vcpu->arch.guest_mmu.guest_role.base.level vcpu->arch.nested_mmu.guest_role.base.level In a vacuum, I 100% agree that guest_role is better than cpu_role or vcpu_role, but the term "guest" has already been claimed for "L2" in far too many places. While we're behind the bikeshed... the resulting: union kvm_mmu_role cpu_role; union kvm_mmu_page_role mmu_role; is a mess. Again, I really like "mmu_role" in a vacuum, but juxtaposed with union kvm_mmu_role cpu_role; it's super confusing, e.g. I expected union kvm_mmu_role mmu_role; Nested EPT is a good example of complete confusion, because we compute kvm_mmu_role, compare it to cpu_role, then shove it into both cpu_role and mmu_ole. It makes sense once you reason about what it's doing, but on the surface it's confusing. struct kvm_mmu *context = &vcpu->arch.guest_mmu; u8 level = vmx_eptp_page_walk_level(new_eptp); union kvm_mmu_role new_role = kvm_calc_shadow_ept_root_page_role(vcpu, accessed_dirty, execonly, level); if (new_role.as_u64 != context->cpu_role.as_u64) { /* EPT, and thus nested EPT, does not consume CR0, CR4, nor EFER. */ context->cpu_role.as_u64 = new_role.as_u64; context->mmu_role.word = new_role.base.word; Mabye this? union kvm_mmu_vcpu_role vcpu_role; union kvm_mmu_page_role mmu_role; and some sample usage? diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index d25f8cb2e62b..9f9b97c88738 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4836,13 +4836,16 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, { struct kvm_mmu *context = &vcpu->arch.guest_mmu; u8 level = vmx_eptp_page_walk_level(new_eptp); - union kvm_mmu_role new_role = + union kvm_mmu_vcpu_role new_role = kvm_calc_shadow_ept_root_page_role(vcpu, accessed_dirty, execonly, level); - if (new_role.as_u64 != context->cpu_role.as_u64) { - /* EPT, and thus nested EPT, does not consume CR0, CR4, nor EFER. */ - context->cpu_role.as_u64 = new_role.as_u64; + if (new_role.as_u64 != context->vcpu_role.as_u64) { + /* + * EPT, and thus nested EPT, does not consume CR0, CR4, nor + * EFER, so the mmu_role is a strict subset of the vcpu_role. + */ + context->vcpu_role.as_u64 = new_role.as_u64; context->mmu_role.word = new_role.base.word; context->page_fault = ept_page_fault; And while I'm on a soapbox.... am I the only one that absolutely detests the use of "context" and "g_context"? I'd be all in favor of renaming those to "mmu" throughout the code as a prep to this series. I also think we should move the initializing of guest_mmu => mmu into the MMU helpers. Pulling the mmu from guest_mmu but then relying on the caller to wire up guest_mmu => mmu so that e.g. kvm_mmu_new_pgd() works is gross and confused the heck out of me. E.g. diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index d25f8cb2e62b..4e7fe9758ce8 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4794,7 +4794,7 @@ static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0, unsigned long cr4, u64 efer, gpa_t nested_cr3) { - struct kvm_mmu *context = &vcpu->arch.guest_mmu; + struct kvm_mmu *mmu = &vcpu->arch.guest_mmu; struct kvm_mmu_role_regs regs = { .cr0 = cr0, .cr4 = cr4 & ~X86_CR4_PKE, @@ -4806,6 +4806,8 @@ void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0, mmu_role = cpu_role.base; mmu_role.level = kvm_mmu_get_tdp_level(vcpu); + vcpu->arch.mmu = &vcpu->arch.guest_mmu; + shadow_mmu_init_context(vcpu, context, cpu_role, mmu_role); kvm_mmu_new_pgd(vcpu, nested_cr3); } @@ -4834,12 +4836,14 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, int huge_page_level, bool accessed_dirty, gpa_t new_eptp) { - struct kvm_mmu *context = &vcpu->arch.guest_mmu; + struct kvm_mmu *mmu = &vcpu->arch.guest_mmu; u8 level = vmx_eptp_page_walk_level(new_eptp); union kvm_mmu_role new_role = kvm_calc_shadow_ept_root_page_role(vcpu, accessed_dirty, execonly, level); + vcpu->arch.mmu = mmu; + if (new_role.as_u64 != context->cpu_role.as_u64) { /* EPT, and thus nested EPT, does not consume CR0, CR4, nor EFER. */ context->cpu_role.as_u64 = new_role.as_u64; diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 1218b5a342fc..d0f8eddb32be 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -98,8 +98,6 @@ static void nested_svm_init_mmu_context(struct kvm_vcpu *vcpu) WARN_ON(mmu_is_nested(vcpu)); - vcpu->arch.mmu = &vcpu->arch.guest_mmu; - /* * The NPT format depends on L1's CR4 and EFER, which is in vmcb01. Note, * when called via KVM_SET_NESTED_STATE, that state may _not_ match current
On 2/9/22 23:31, Sean Christopherson wrote: > Heresy! Everyone knows the one true way is "KVM: x86/mmu:" > > $ glo | grep "KVM: MMU:" | wc -l > 740 > $ glo | grep "KVM: x86/mmu:" | wc -l > 403 > > Dammit, I'm the heathen... > > I do think we should use x86/mmu though. VMX and SVM (and nVMX and nSVM) are ok > because they're unlikely to collide with other architectures, but every arch has > an MMU... > Sure, I can adjust my habits. Paolo
On 2/10/22 02:11, Sean Christopherson wrote: > In a vacuum, I 100% agree that guest_role is better than cpu_role or vcpu_role, > but the term "guest" has already been claimed for "L2" in far too many places. > > While we're behind the bikeshed... the resulting: > > union kvm_mmu_role cpu_role; > union kvm_mmu_page_role mmu_role; > > is a mess. Again, I really like "mmu_role" in a vacuum, but juxtaposed with > > union kvm_mmu_role cpu_role; > > it's super confusing, e.g. I expected > > union kvm_mmu_role mmu_role; What about union kvm_mmu_page_role root_role; union kvm_mmu_paging_mode cpu_mode; ? I already have to remove ".base" from all accesses to mmu_role, so it's not much extra churn. Paolo
On Thu, Feb 10, 2022, Paolo Bonzini wrote: > On 2/10/22 02:11, Sean Christopherson wrote: > > In a vacuum, I 100% agree that guest_role is better than cpu_role or vcpu_role, > > but the term "guest" has already been claimed for "L2" in far too many places. > > > > While we're behind the bikeshed... the resulting: > > > > union kvm_mmu_role cpu_role; > > union kvm_mmu_page_role mmu_role; > > > > is a mess. Again, I really like "mmu_role" in a vacuum, but juxtaposed with > > > > union kvm_mmu_role cpu_role; > > > > it's super confusing, e.g. I expected > > > > union kvm_mmu_role mmu_role; > > What about > > union kvm_mmu_page_role root_role; > union kvm_mmu_paging_mode cpu_mode; > > ? I already have to remove ".base" from all accesses to mmu_role, so it's > not much extra churn. I'd prefer to not use "paging mode", the SDM uses that terminology to refer to the four paging modes. My expectation given the name is that the union would track only CR0.PG, EFER.LME, CR4.PAE, and CR4.PSE[*]. I'm out of ideas at the moment, I'll keep chewing on this while reviewing... [*] Someone at Intel rewrote the SDM and eliminated Mode B, a.k.a. PSE 36-bit physical paging, it's now just part of "32-bit paging". But 5-level paging is considered it's own paging mode?!?! Lame. I guess they really want to have exactly four paging modes...
On 2/10/22 17:55, Sean Christopherson wrote: >> union kvm_mmu_page_role root_role; >> union kvm_mmu_paging_mode cpu_mode; > > I'd prefer to not use "paging mode", the SDM uses that terminology to refer to > the four paging modes. My expectation given the name is that the union would > track only CR0.PG, EFER.LME, CR4.PAE, and CR4.PSE[*]. Yeah, I had started with kvm_mmu_paging_flags, but cpu_flags was an even worse method than kvm_mmu_paging_mode. Anyway, now that I have done _some_ replacement, it's a matter of sed -i on the patch files once you or someone else come up with a good moniker. I take it that "root_role" passed your filter successfully. Paolo > I'm out of ideas at the moment, I'll keep chewing on this while reviewing... > > [*] Someone at Intel rewrote the SDM and eliminated Mode B, a.k.a. PSE 36-bit > physical paging, it's now just part of "32-bit paging". But 5-level paging is > considered it's own paging mode?!?! Lame. I guess they really want to have > exactly four paging modes...
On Thu, Feb 10, 2022, Paolo Bonzini wrote: > On 2/10/22 17:55, Sean Christopherson wrote: > > > union kvm_mmu_page_role root_role; > > > union kvm_mmu_paging_mode cpu_mode; > > > > I'd prefer to not use "paging mode", the SDM uses that terminology to refer to > > the four paging modes. My expectation given the name is that the union would > > track only CR0.PG, EFER.LME, CR4.PAE, and CR4.PSE[*]. > > Yeah, I had started with kvm_mmu_paging_flags, but cpu_flags was an even > worse method than kvm_mmu_paging_mode. We could always do s/is_guest_mode/is_nested_mode or something to that effect. It would take some retraining, but I feel like we've been fighting the whole "guest mode" thing over and over. > Anyway, now that I have done _some_ replacement, it's a matter of sed -i on > the patch files once you or someone else come up with a good moniker. > > I take it that "root_role" passed your filter successfully. Yep, works for me. I almost suggested it, too, but decided I liked mmu_role marginally better. I like root_role because it ties in with root_hpa and root_pga.
On Wed, Feb 9, 2022 at 2:31 PM Sean Christopherson <seanjc@google.com> wrote: > On Fri, Feb 04, 2022, Paolo Bonzini wrote: > > KVM: MMU: replace direct_map with mmu_role.direct > > Heresy! Everyone knows the one true way is "KVM: x86/mmu:" > > $ glo | grep "KVM: MMU:" | wc -l > 740 > $ glo | grep "KVM: x86/mmu:" | wc -l > 403 > > Dammit, I'm the heathen... > > I do think we should use x86/mmu though. VMX and SVM (and nVMX and nSVM) are ok > because they're unlikely to collide with other architectures, but every arch has > an MMU... Can you document these rules/preferences somewhere? Even better if we can enforce them with checkpatch :)