Message ID | 20220330132152.4568-3-jiangshanlai@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: X86: Add and use shadow page with level expanded or acting as pae_root | expand |
On Wed, Mar 30, 2022 at 9:21 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote: > @@ -1713,7 +1721,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn, > > sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache); > sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache); > - if (!role.direct) > + if (role.glevel == role.level) > sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache); > set_page_private(virt_to_page(sp->spt), (unsigned long)sp); > > @@ -2054,6 +2062,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, > quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1; > role.quadrant = quadrant; > } > + if (level < role.glevel) > + role.glevel = level; missing: if (direct) role.glevel = 0; > > sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]; > for_each_valid_sp(vcpu->kvm, sp, sp_list) {
On Wed, Mar 30, 2022, Lai Jiangshan wrote: > + role.glevel: > + The level in guest pagetable if the sp is indirect. Is 0 if the sp > + is direct without corresponding guest pagetable, like TDP or !CR0.PG. > + When role.level > guest paging level, indirect sp is created on the > + top with role.glevel = guest paging level and acks as passthrough sp s/acks/acts > + and its contents are specially installed rather than the translations > + of the corresponding guest pagetable. > gfn: > Either the guest page table containing the translations shadowed by this > page, or the base page frame for linear translations. See role.direct. > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 9694dd5e6ccc..67e1bccaf472 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -314,7 +314,7 @@ struct kvm_kernel_irq_routing_entry; > * cr0_wp=0, therefore these three bits only give rise to 5 possibilities. > * > * Therefore, the maximum number of possible upper-level shadow pages for a > - * single gfn is a bit less than 2^13. > + * single gfn is a bit less than 2^15. > */ > union kvm_mmu_page_role { > u32 word; > @@ -331,7 +331,8 @@ union kvm_mmu_page_role { > unsigned smap_andnot_wp:1; > unsigned ad_disabled:1; > unsigned guest_mode:1; > - unsigned :6; > + unsigned glevel:4; We don't need 4 bits for this. Crossing our fingers that we never had to shadow a 2-level guest with a 6-level host, we can do: unsigned passthrough_delta:2; Where the field is ignored if direct=1, '0' for non-passthrough, and 1-3 to handle shadow_root_level - guest_root_level. Basically the same idea as Paolo's smushing of direct+passthrough into mapping_level, just dressed up differently. Side topic, we should steal a bit back from "level", or at least document that we can steal a bit if necessary. > + unsigned :2; > > /* > * This is left at the top of the word so that > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 02eae110cbe1..d53037df8177 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -737,8 +737,12 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc) > > static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index) > { > - if (!sp->role.direct) > + if (!sp->role.direct) { > + if (unlikely(sp->role.glevel < sp->role.level)) Regardless of whatever magic we end up using, there should be an is_passthrough_sp() helper to wrap the magic. > + return sp->gfn; > + > return sp->gfns[index]; > + } > > return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS)); > }
On Wed, Apr 13, 2022 at 5:31 AM Sean Christopherson <seanjc@google.com> wrote: > > union kvm_mmu_page_role { > > u32 word; > > @@ -331,7 +331,8 @@ union kvm_mmu_page_role { > > unsigned smap_andnot_wp:1; > > unsigned ad_disabled:1; > > unsigned guest_mode:1; > > - unsigned :6; > > + unsigned glevel:4; > > We don't need 4 bits for this. Crossing our fingers that we never had to shadow > a 2-level guest with a 6-level host, we can do: > > unsigned passthrough_delta:2; We can save the bits in the future when we need more bits so I didn't hesitate to use 4 bits since glevel gives simple code. ^_^ I think the name passthrough_delta is more informative and glevel is used only for comparison so passthrough_delta can also be simple. I will apply your suggestion. Thanks Lai
On 4/12/22 23:31, Sean Christopherson wrote: >> + unsigned glevel:4; > We don't need 4 bits for this. Crossing our fingers that we never had to shadow > a 2-level guest with a 6-level host, we can do: > > unsigned passthrough_delta:2; > > Where the field is ignored if direct=1, '0' for non-passthrough, and 1-3 to handle > shadow_root_level - guest_root_level. Basically the same idea as Paolo's smushing > of direct+passthrough into mapping_level, just dressed up differently. Basically, your passthrough_delta is level - glevel in Jiangshan's patches. You'll need 3 bits anyway when we remove direct later (that would be passthrough_delta == level). Regarding the naming: * If we keep Jiangshan's logic, I don't like the glevel name very much, any of mapping_level, target_level or direct_level would be clearer? * If we go with yours, I would call the field "passthrough_levels". Paolo
On Wed, Apr 13, 2022, Paolo Bonzini wrote: > On 4/12/22 23:31, Sean Christopherson wrote: > > > + unsigned glevel:4; > > We don't need 4 bits for this. Crossing our fingers that we never had to shadow > > a 2-level guest with a 6-level host, we can do: > > > > unsigned passthrough_delta:2; > > > > Where the field is ignored if direct=1, '0' for non-passthrough, and 1-3 to handle > > shadow_root_level - guest_root_level. Basically the same idea as Paolo's smushing > > of direct+passthrough into mapping_level, just dressed up differently. > > Basically, your passthrough_delta is level - glevel in Jiangshan's patches. > You'll need 3 bits anyway when we remove direct later (that would be > passthrough_delta == level). Are we planning on removing direct? > Regarding the naming: > > * If we keep Jiangshan's logic, I don't like the glevel name very much, any > of mapping_level, target_level or direct_level would be clearer? I don't love any of these names, especially glevel, because the field doesn't strictly track the guest/mapping/target/direct level. That could obviously be remedied by making it valid at all times, but then the role would truly need 3 bits (on top of direct) to track 5-level guest paging. > * If we go with yours, I would call the field "passthrough_levels". Hmm, it's not a raw level though.
On 4/13/22 16:42, Sean Christopherson wrote: > On Wed, Apr 13, 2022, Paolo Bonzini wrote: >> On 4/12/22 23:31, Sean Christopherson wrote: >>> We don't need 4 bits for this. Crossing our fingers that we never had to shadow >>> a 2-level guest with a 6-level host, we can do: >>> >>> unsigned passthrough_delta:2; >>> >> Basically, your passthrough_delta is level - glevel in Jiangshan's patches. >> You'll need 3 bits anyway when we remove direct later (that would be >> passthrough_delta == level). > > Are we planning on removing direct? I think so, it's redundant and the code almost always checks direct||passthrough (which would be passthrough_delta > 0 with your scheme). >> Regarding the naming: >> >> * If we keep Jiangshan's logic, I don't like the glevel name very much, any >> of mapping_level, target_level or direct_level would be clearer? > > I don't love any of these names, especially glevel, because the field doesn't > strictly track the guest/mapping/target/direct level. That could obviously be > remedied by making it valid at all times, but then the role would truly need 3 > bits (on top of direct) to track 5-level guest paging. Yes, it would need 3 bits but direct can be removed. >> * If we go with yours, I would call the field "passthrough_levels". > > Hmm, it's not a raw level though. Hence the plural. :) Paolo
On Wed, Apr 13, 2022, Paolo Bonzini wrote: > On 4/13/22 16:42, Sean Christopherson wrote: > > On Wed, Apr 13, 2022, Paolo Bonzini wrote: > > > On 4/12/22 23:31, Sean Christopherson wrote: > > > > We don't need 4 bits for this. Crossing our fingers that we never had to shadow > > > > a 2-level guest with a 6-level host, we can do: > > > > > > > > unsigned passthrough_delta:2; > > > > > > > Basically, your passthrough_delta is level - glevel in Jiangshan's patches. > > > You'll need 3 bits anyway when we remove direct later (that would be > > > passthrough_delta == level). > > > > Are we planning on removing direct? > > I think so, it's redundant and the code almost always checks > direct||passthrough (which would be passthrough_delta > 0 with your scheme). It's not redundant, just split out. E.g. if 3 bits are used for the target_level, a special value is needed to indicate "direct", otherwise KVM couldn't differentiate between indirect and direct. Violent agreement and all that :-) I'm ok dropping direct and rolling it into target_level, just so long as we add helpers, e.g. IIUC they would be static inline bool is_sp_direct(...) { return !sp->role.target_level; } static inline bool is_sp_direct_or_passthrough(...) { return sp->role.target_level != sp->role.level; } > > > Regarding the naming: > > > > > > * If we keep Jiangshan's logic, I don't like the glevel name very much, any > > > of mapping_level, target_level or direct_level would be clearer? > > > > I don't love any of these names, especially glevel, because the field doesn't > > strictly track the guest/mapping/target/direct level. That could obviously be > > remedied by making it valid at all times, but then the role would truly need 3 > > bits (on top of direct) to track 5-level guest paging. > > Yes, it would need 3 bits but direct can be removed. > > > > * If we go with yours, I would call the field "passthrough_levels". > > > > Hmm, it's not a raw level though. > > Hence the plural. :) LOL, I honestly thought that was a typo. Making it plural sounds like it's passing through to multiple levels.
On 4/13/22 17:32, Sean Christopherson wrote: >>> Are we planning on removing direct? >> >> I think so, it's redundant and the code almost always checks >> direct||passthrough (which would be passthrough_delta > 0 with your scheme). > > I'm ok dropping direct and rolling it into target_level, just so long as we add > helpers, e.g. IIUC they would be > > static inline bool is_sp_direct(...) > { > return !sp->role.target_level; > } > > static inline bool is_sp_direct_or_passthrough(...) > { > return sp->role.target_level != sp->role.level; > } Yes of course. Or respectively: return sp->role.passthrough_levels == s->role.level; return sp->role.passthrough_levels > 0; I'm not sure about a more concise name for the latter. Maybe sp_has_gpt(...) but I haven't thought it through very much. >>> Hmm, it's not a raw level though. >> >> Hence the plural. :) > > LOL, I honestly thought that was a typo. Making it plural sounds like it's passing > through to multiple levels. I meant it as number of levels being passed through. I'll leave that to Jiangshan, either target_level or passthrough_levels will do for me. Paolo
On Wed, Apr 13, 2022, Paolo Bonzini wrote: > On 4/13/22 17:32, Sean Christopherson wrote: > > > > Are we planning on removing direct? > > > > > > I think so, it's redundant and the code almost always checks > > > direct||passthrough (which would be passthrough_delta > 0 with your scheme). > > > > I'm ok dropping direct and rolling it into target_level, just so long as we add > > helpers, e.g. IIUC they would be > > > > static inline bool is_sp_direct(...) > > { > > return !sp->role.target_level; > > } > > > > static inline bool is_sp_direct_or_passthrough(...) > > { > > return sp->role.target_level != sp->role.level; > > } > > Yes of course. Or respectively: > > return sp->role.passthrough_levels == s->role.level; > > return sp->role.passthrough_levels > 0; > > I'm not sure about a more concise name for the latter. Maybe > sp_has_gpt(...) but I haven't thought it through very much. > > > > > Hmm, it's not a raw level though. > > > > > > Hence the plural. :) > > > > LOL, I honestly thought that was a typo. Making it plural sounds like it's passing > > through to multiple levels. > > I meant it as number of levels being passed through. I'll leave that to > Jiangshan, either target_level or passthrough_levels will do for me. It took me until like 9pm last night to finally understand what you meant by "passthrough level". Now that I actually have my head wrapped around this... Stepping back, "glevel" and any of its derivations is actually just a combination of CR0.PG, CR4.PAE, EFER.LMA, and CR4.LA57. And "has_4_byte_gpte" is CR0.PG && !CR4.PAE. When running with !tdp_enabled, CR0.PG is tracked by "direct". And with TDP enabled, CR0.PG is either a don't care (L1 or nested EPT), or is guaranteed to be '1' (nested NPT). So, rather than add yet more synthetic information to the role, what about using the info we already have? I don't think it changes the number of bits that need to be stored, but I think the result would be easier for people to understand, at least superficially, e.g. "oh, the mode matters, got it". We'd need a beefy comment to explain the whole "passthrough levels" thing, but I think it the code would be more approachable for most people. If we move efer_lma and cr4_la57 from kvm_mmu_extended_role to kvm_mmu_page_role, and rename has_4_byte_gpte to cr4_pae, then we don't need passthrough_levels. If needed for performance, we could still have a "passthrough" bit, but otherwise detecting a passthrough SP would be static inline bool is_passthrough_sp(struct kvm_mmu_page *sp) { return !sp->role.direct && sp->role.level > role_to_root_level(sp->role); } where role_to_root_level() is extracted from kvm_calc_cpu_role() is Paolo's series. Or, if we wanted to optimize "is passthrough", then cr4_la57 could be left in the extended role, because passthrough is guaranteed to be '0' if CR4.LA57=1. That would prevent reusing shadow pages between 64-bit paging and PAE paging, but in practice no sane guest is going to reuse page tables between those mode, so who cares.
On Thu, Apr 14, 2022 at 11:51 PM Sean Christopherson <seanjc@google.com> wrote: > > On Wed, Apr 13, 2022, Paolo Bonzini wrote: > > On 4/13/22 17:32, Sean Christopherson wrote: > > > > > Are we planning on removing direct? > > > > > > > > I think so, it's redundant and the code almost always checks > > > > direct||passthrough (which would be passthrough_delta > 0 with your scheme). > > > > > > I'm ok dropping direct and rolling it into target_level, just so long as we add > > > helpers, e.g. IIUC they would be > > > > > > static inline bool is_sp_direct(...) > > > { > > > return !sp->role.target_level; > > > } > > > > > > static inline bool is_sp_direct_or_passthrough(...) > > > { > > > return sp->role.target_level != sp->role.level; > > > } > > > > Yes of course. Or respectively: > > > > return sp->role.passthrough_levels == s->role.level; > > > > return sp->role.passthrough_levels > 0; > > > > I'm not sure about a more concise name for the latter. Maybe > > sp_has_gpt(...) but I haven't thought it through very much. > > > > > > > Hmm, it's not a raw level though. > > > > > > > > Hence the plural. :) > > > > > > LOL, I honestly thought that was a typo. Making it plural sounds like it's passing > > > through to multiple levels. > > > > I meant it as number of levels being passed through. I'll leave that to > > Jiangshan, either target_level or passthrough_levels will do for me. > > It took me until like 9pm last night to finally understand what you meant by > "passthrough level". Now that I actually have my head wrapped around this... > > Stepping back, "glevel" and any of its derivations is actually just a combination > of CR0.PG, CR4.PAE, EFER.LMA, and CR4.LA57. And "has_4_byte_gpte" is CR0.PG && !CR4.PAE. > When running with !tdp_enabled, CR0.PG is tracked by "direct". And with TDP enabled, > CR0.PG is either a don't care (L1 or nested EPT), or is guaranteed to be '1' (nested NPT). glevel in the patchset is the page level of the corresponding page table in the guest, not the level of the guest *root* page table. role.glevel is initialized as the guest root level, and changed to the level of the guest page in kvm_mmu_get_page(). role.glevel is a bad name and is not sufficient to handle other purposes like role.pae_root, role.guest_pae_root. role.root_level is much better. role.root_level is a combination of CR0.PG, CR4.PAE, EFER.LMA, and CR4.LA57 as you said. > > So, rather than add yet more synthetic information to the role, what about using > the info we already have? I don't think it changes the number of bits that need to > be stored, but I think the result would be easier for people to understand, at > least superficially, e.g. "oh, the mode matters, got it". We'd need a beefy comment > to explain the whole "passthrough levels" thing, but I think it the code would be > more approachable for most people. > > If we move efer_lma and cr4_la57 from kvm_mmu_extended_role to kvm_mmu_page_role, > and rename has_4_byte_gpte to cr4_pae, then we don't need passthrough_levels. > If needed for performance, we could still have a "passthrough" bit, but otherwise > detecting a passthrough SP would be > > static inline bool is_passthrough_sp(struct kvm_mmu_page *sp) > { > return !sp->role.direct && sp->role.level > role_to_root_level(sp->role); > } > > where role_to_root_level() is extracted from kvm_calc_cpu_role() is Paolo's series. > I'm going to add static inline bool sp_has_gpte(struct kvm_mmu_page *sp) { return !sp->role.direct && ( /* passthrough */ sp->role.level > role_to_root_level(sp->role) || /* guest pae root */ (sp->role.level == 3 && role_to_root_level(sp->role) == 3) ); } And rename for_each_gfn_indirect_valid_sp() to for_each_gfn_valid_sp_has_gpte() which use sp_has_gpte() instead. I'm not objecting using efer_lma and cr4_la57. But I think role.root_level is more convenient than role_to_root_level(sp->role). cr4_pae, efer_lma and cr4_la57 are more natrual than has_4_byte_gpte and root_level.
diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/mmu.rst index 5b1ebad24c77..dee0e96d694a 100644 --- a/Documentation/virt/kvm/mmu.rst +++ b/Documentation/virt/kvm/mmu.rst @@ -202,6 +202,13 @@ Shadow pages contain the following information: Is 1 if the MMU instance cannot use A/D bits. EPT did not have A/D bits before Haswell; shadow EPT page tables also cannot use A/D bits if the L1 hypervisor does not enable them. + role.glevel: + The level in guest pagetable if the sp is indirect. Is 0 if the sp + is direct without corresponding guest pagetable, like TDP or !CR0.PG. + When role.level > guest paging level, indirect sp is created on the + top with role.glevel = guest paging level and acks as passthrough sp + and its contents are specially installed rather than the translations + of the corresponding guest pagetable. gfn: Either the guest page table containing the translations shadowed by this page, or the base page frame for linear translations. See role.direct. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 9694dd5e6ccc..67e1bccaf472 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -314,7 +314,7 @@ struct kvm_kernel_irq_routing_entry; * cr0_wp=0, therefore these three bits only give rise to 5 possibilities. * * Therefore, the maximum number of possible upper-level shadow pages for a - * single gfn is a bit less than 2^13. + * single gfn is a bit less than 2^15. */ union kvm_mmu_page_role { u32 word; @@ -331,7 +331,8 @@ union kvm_mmu_page_role { unsigned smap_andnot_wp:1; unsigned ad_disabled:1; unsigned guest_mode:1; - unsigned :6; + unsigned glevel:4; + unsigned :2; /* * This is left at the top of the word so that diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 02eae110cbe1..d53037df8177 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -737,8 +737,12 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc) static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index) { - if (!sp->role.direct) + if (!sp->role.direct) { + if (unlikely(sp->role.glevel < sp->role.level)) + return sp->gfn; + return sp->gfns[index]; + } return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS)); } @@ -746,6 +750,11 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index) static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn) { if (!sp->role.direct) { + if (unlikely(sp->role.glevel < sp->role.level)) { + WARN_ON_ONCE(gfn != sp->gfn); + return; + } + sp->gfns[index] = gfn; return; } @@ -1674,8 +1683,7 @@ static void kvm_mmu_free_page(struct kvm_mmu_page *sp) hlist_del(&sp->hash_link); list_del(&sp->link); free_page((unsigned long)sp->spt); - if (!sp->role.direct) - free_page((unsigned long)sp->gfns); + free_page((unsigned long)sp->gfns); kmem_cache_free(mmu_page_header_cache, sp); } @@ -1713,7 +1721,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn, sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache); sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache); - if (!role.direct) + if (role.glevel == role.level) sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache); set_page_private(virt_to_page(sp->spt), (unsigned long)sp); @@ -2054,6 +2062,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1; role.quadrant = quadrant; } + if (level < role.glevel) + role.glevel = level; sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]; for_each_valid_sp(vcpu->kvm, sp, sp_list) { @@ -4817,6 +4827,7 @@ kvm_calc_shadow_root_page_role_common(struct kvm_vcpu *vcpu, role.base.smep_andnot_wp = role.ext.cr4_smep && !____is_cr0_wp(regs); role.base.smap_andnot_wp = role.ext.cr4_smap && !____is_cr0_wp(regs); role.base.has_4_byte_gpte = ____is_cr0_pg(regs) && !____is_cr4_pae(regs); + role.base.glevel = role_regs_to_root_level(regs); return role; } @@ -5312,6 +5323,8 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, ++vcpu->kvm->stat.mmu_pte_write; for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) { + if (sp->role.glevel < sp->role.level) + continue; if (detect_write_misaligned(sp, gpa, bytes) || detect_write_flooding(sp)) { kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list); diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 8621188b46df..67489a060eba 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -1042,6 +1042,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) .level = 0xf, .access = 0x7, .quadrant = 0x3, + .glevel = 0xf, }; /*