Message ID | 20220401175554.1931568-1-dmatlack@google.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: Extend Eager Page Splitting to the shadow MMU | expand |
On Fri, Apr 01, 2022, David Matlack wrote: > This series extends KVM's Eager Page Splitting to also split huge pages > mapped by the shadow MMU, i.e. huge pages present in the memslot rmaps. > This will be useful for configurations that use Nested Virtualization, > disable the TDP MMU, or disable/lack TDP hardware support. > > For background on Eager Page Splitting, see: > - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/ > - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/ > > Splitting huge pages mapped by the shadow MMU is more complicated than > the TDP MMU, but it is also more important for performance as the shadow > MMU handles huge page write-protection faults under the write lock. See > the Performance section for more details. > > The extra complexity of splitting huge pages mapped by the shadow MMU > comes from a few places: I think we should restrict eager page splitting to the TDP MMU being enabled, i.e. restrict shadow MMU support to nested MMUs. A decent chunk of the churn and complexity in this series comes from having to deal with the intersection of things no one cares about in practice (!TDP shadow paging), and/or things we should be putting into maintenance-only mode (legacy MMU with TDP enabled). I see zero reason to support this for legacy shadow paging without a very concrete, very performance sensitive use case, and the legacy MMU with TDP should be a hard "no". With those out of the way, unsync support can also be jettisoned, because barring use cases I don't know about, hypervisors don't modify TDP entries in the same way that kernels modify native page tables, i.e. don't benefit from allowing SPTEs to go unsync. The other feature that I think we should deprecate (which I'm pretty sure someone on our team, maybe even you, is planning on proposing upstream) is support for zapping KVM shadow pages for the shrinker. In hindsight, we should have done that a few years ago instead of fixing the bug that made KVM's support meaningful (see commit ebdb292dac79 ("KVM: x86/mmu: Batch zap MMU pages when shrinking the slab"). Doing that for nested MMUs only (or at least first) should be less controversial. The other thing we want to do sooner than later is improve the scalability of the nested MMU. A relatively simple way to pick some juicy low hanging fruit, if we drop the aforementioned features we don't actually need for nested MMUs, would be to turn all of the tracking structures needed for handling a page fault into per-root lists/structures, e.g. active_mmu_pages and mmu_page_hash. Unless L1 is doing something funky, there is unlikely to be overlap between nested TDP page tables, i.e. per-root tracking shouldn't cause a memory explosion. At that point, as a first step/stopgap toward a more scalable nested MMU implementation, nested TDP page faults, zapping of obsolete pages (memslot updates), and eager page splitting (I think) can take mmu_lock for read and then take a per-root spinlock. At a bare minimum, taking mmu_lock for read would prevent a nested vCPU from blocking the TDP MMU, which in itself should be a big win. Zapping after a memslot updates would not interfere at all with re-faulting memory since zapping the obsolete roots would never get a lock conflict. And for use cases that spin up a large number of small L2 VMs, per-root locking will allow KVM to handle page faults for each L2 in parallel, which could be a huge performance boost for select use cases. Circling back to eager page splitting, this series could be reworked to take the first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in order to provide the necessary path for reworking nested MMU page faults. Then it can remove unsync and shrinker support for nested MMUs. With those gone, dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least some of the complexity/churn. > Performance > ----------- > > To measure the performance impact of Eager Page Splitting I ran > dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per > vCPU, and backed by 1GiB HugeTLB memory. The amount of memory that was > written to versus read was controlled with the -f option. > > To measure the imapct of customer performance, we can look at the time > it takes all vCPUs to dirty memory after dirty logging has been enabled. > Without Eager Page Splitting enabled, such dirtying must take faults to > split huge pages and bottleneck on the MMU lock. > > | Config: ept=Y, tdp_mmu=N, 100% writes | > | Config: ept=Y, tdp_mmu=N, 100% writes | > | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set | > | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set | > | Config: ept=N, tdp_mmu=Y, 100% writes | > | Config: ept=N, tdp_mmu=Y, 50% writes | > | Config: ept=N, tdp_mmu=Y, 5% writes | IMO, to justify this there needs to be performance numbers for ept=Y, tdp_mmu=Y, i.e. for the use case we actually care about. I don't expect the outcome to be any different, but it really should be explicitly tested.
On Mon, Apr 11, 2022 at 10:12 AM Sean Christopherson <seanjc@google.com> wrote: > > On Fri, Apr 01, 2022, David Matlack wrote: > > This series extends KVM's Eager Page Splitting to also split huge pages > > mapped by the shadow MMU, i.e. huge pages present in the memslot rmaps. > > This will be useful for configurations that use Nested Virtualization, > > disable the TDP MMU, or disable/lack TDP hardware support. > > > > For background on Eager Page Splitting, see: > > - Proposal: https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/ > > - TDP MMU support: https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/ > > > > Splitting huge pages mapped by the shadow MMU is more complicated than > > the TDP MMU, but it is also more important for performance as the shadow > > MMU handles huge page write-protection faults under the write lock. See > > the Performance section for more details. > > > > The extra complexity of splitting huge pages mapped by the shadow MMU > > comes from a few places: > > I think we should restrict eager page splitting to the TDP MMU being enabled, > i.e. restrict shadow MMU support to nested MMUs. > > A decent chunk of the churn and complexity in this series comes from having to > deal with the intersection of things no one cares about in practice (!TDP shadow > paging), and/or things we should be putting into maintenance-only mode (legacy MMU > with TDP enabled). I see zero reason to support this for legacy shadow paging > without a very concrete, very performance sensitive use case, and the legacy MMU > with TDP should be a hard "no". > > With those out of the way, unsync support can also be jettisoned, because barring > use cases I don't know about, hypervisors don't modify TDP entries in the same way > that kernels modify native page tables, i.e. don't benefit from allowing SPTEs to > go unsync. > > The other feature that I think we should deprecate (which I'm pretty sure someone on > our team, maybe even you, is planning on proposing upstream) is support for zapping > KVM shadow pages for the shrinker. In hindsight, we should have done that a few > years ago instead of fixing the bug that made KVM's support meaningful (see commit > ebdb292dac79 ("KVM: x86/mmu: Batch zap MMU pages when shrinking the slab"). Doing > that for nested MMUs only (or at least first) should be less controversial. > > The other thing we want to do sooner than later is improve the scalability of the > nested MMU. A relatively simple way to pick some juicy low hanging fruit, if we > drop the aforementioned features we don't actually need for nested MMUs, would be > to turn all of the tracking structures needed for handling a page fault into > per-root lists/structures, e.g. active_mmu_pages and mmu_page_hash. Unless L1 is > doing something funky, there is unlikely to be overlap between nested TDP page > tables, i.e. per-root tracking shouldn't cause a memory explosion. > > At that point, as a first step/stopgap toward a more scalable nested MMU implementation, > nested TDP page faults, zapping of obsolete pages (memslot updates), and eager page > splitting (I think) can take mmu_lock for read and then take a per-root spinlock. > > At a bare minimum, taking mmu_lock for read would prevent a nested vCPU from blocking > the TDP MMU, which in itself should be a big win. Zapping after a memslot updates > would not interfere at all with re-faulting memory since zapping the obsolete roots > would never get a lock conflict. And for use cases that spin up a large number of small > L2 VMs, per-root locking will allow KVM to handle page faults for each L2 in parallel, > which could be a huge performance boost for select use cases. > > Circling back to eager page splitting, this series could be reworked to take the > first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in > order to provide the necessary path for reworking nested MMU page faults. Then it > can remove unsync and shrinker support for nested MMUs. With those gone, > dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner > than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least > some of the complexity/churn. These sound like useful improvements but I am not really seeing the value of sequencing them before this series: - IMO the "churn" in patches 1-14 are a net improvement to the existing code. They improve readability by decomposing the shadow page creation path into smaller functions with better names, reduce the amount of redundant calculations, and reduce the dependence on struct kvm_vcpu where it is not needed. Even if eager page splitting is completely dropped I think they would be useful to merge. - Patches 15-21 are necessary complexity to support eager page splitting, but wouldn't change at all if this splitting was specific to splitting nested MMUs. - Outside of patches 1-14, unsync really doesn't play a role other than to skip splitting if sp->unsync is true. But as you pointed out in patch 22, that check can already be dropped since SPs with roles >4k are never marked unsync. I'd be fine with limiting eager page splitting to tdp_mmu=Y since nested is the primary use-case for Google and I agree TDP with the shadow MMU should be phased out. This would be an artificial limitation in the short term, but I imagine it would make all those improvements easier to make down the road. > > > Performance > > ----------- > > > > To measure the performance impact of Eager Page Splitting I ran > > dirty_log_perf_test with tdp_mmu=N, various virtual CPU counts, 1GiB per > > vCPU, and backed by 1GiB HugeTLB memory. The amount of memory that was > > written to versus read was controlled with the -f option. > > > > To measure the imapct of customer performance, we can look at the time > > it takes all vCPUs to dirty memory after dirty logging has been enabled. > > Without Eager Page Splitting enabled, such dirtying must take faults to > > split huge pages and bottleneck on the MMU lock. > > > > | Config: ept=Y, tdp_mmu=N, 100% writes | > > | Config: ept=Y, tdp_mmu=N, 100% writes | > > | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set | > > | Config: ept=Y, tdp_mmu=N, 100% writes initially-all-set | > > | Config: ept=N, tdp_mmu=Y, 100% writes | > > | Config: ept=N, tdp_mmu=Y, 50% writes | > > | Config: ept=N, tdp_mmu=Y, 5% writes | > > IMO, to justify this there needs to be performance numbers for ept=Y, tdp_mmu=Y, > i.e. for the use case we actually care about. I don't expect the outcome to be > any different, but it really should be explicitly tested. That's a fair request I guess. There should be no difference in performance from the ept=N results but it require a lot more effort to rig up, which is why I tested this way. I'll look into collecting some results with nested MMUs. On the plus side, better selftests support for nested MMUs will be useful as the various improvements you suggested are implemented.
On Mon, Apr 11, 2022, David Matlack wrote: > On Mon, Apr 11, 2022 at 10:12 AM Sean Christopherson <seanjc@google.com> wrote: > > Circling back to eager page splitting, this series could be reworked to take the > > first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in > > order to provide the necessary path for reworking nested MMU page faults. Then it > > can remove unsync and shrinker support for nested MMUs. With those gone, > > dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner > > than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least > > some of the complexity/churn. > > These sound like useful improvements but I am not really seeing the > value of sequencing them before this series: > > - IMO the "churn" in patches 1-14 are a net improvement to the > existing code. They improve readability by decomposing the shadow page > creation path into smaller functions with better names, reduce the > amount of redundant calculations, and reduce the dependence on struct > kvm_vcpu where it is not needed. Even if eager page splitting is > completely dropped I think they would be useful to merge. I definitely like some of patches 1-14, probably most after a few read throughs. But there are key parts that I do not like that are motivated almost entirely by the desire to support page splitting. Specifically, I don't like splitting the logic of finding a page, and I don't like having a separate alloc vs. initializer (though I'm guessing this will be needed somewhere to split huge pages for nested MMUs). E.g. I'd prefer the "get" flow look like the below (completely untested, for discussion purposes only). There's still churn, but the core loop is almost entirely unchanged. And it's not just this series, I don't want future improvements nested TDP to have to deal with the legacy baggage. Waaaay off topic, why do we still bother with stat.max_mmu_page_hash_collision? I assume it was originally added to tune the hashing logic? At this point is it anything but wasted cycles? static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu, gfn_t gfn, unsigned int gfn_hash, union kvm_mmu_page_role role) { struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash]; struct kvm_mmu_page *sp; LIST_HEAD(invalid_list); int collisions = 0; for_each_valid_sp(kvm, sp, sp_list) { if (sp->gfn != gfn) { collisions++; continue; } if (sp->role.word != role.word) { /* * If the guest is creating an upper-level page, zap * unsync pages for the same gfn. While it's possible * the guest is using recursive page tables, in all * likelihood the guest has stopped using the unsync * page and is installing a completely unrelated page. * Unsync pages must not be left as is, because the new * upper-level page will be write-protected. */ if (role.level > PG_LEVEL_4K && sp->unsync) kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list); continue; } /* unsync and write-flooding only apply to indirect SPs. */ if (sp->role.direct) goto out; if (sp->unsync) { /* * The page is good, but is stale. kvm_sync_page does * get the latest guest state, but (unlike mmu_unsync_children) * it doesn't write-protect the page or mark it synchronized! * This way the validity of the mapping is ensured, but the * overhead of write protection is not incurred until the * guest invalidates the TLB mapping. This allows multiple * SPs for a single gfn to be unsync. * * If the sync fails, the page is zapped. If so, break * in order to rebuild it. */ if (!kvm_sync_page(vcpu, sp, &invalid_list)) break; WARN_ON(!list_empty(&invalid_list)); kvm_flush_remote_tlbs(vcpu->kvm); } __clear_sp_write_flooding_count(sp); goto out; } sp = NULL; out: if (collisions > kvm->stat.max_mmu_page_hash_collisions) kvm->stat.max_mmu_page_hash_collisions = collisions; kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list); return sp; } static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, gfn_t gfn, unsigned int gfn_hash, union kvm_mmu_page_role role) { struct kvm_mmu_page *sp = __kvm_mmu_alloc_shadow_page(vcpu, role.direct); struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash]; ++kvm->stat.mmu_cache_miss; sp->gfn = gfn; sp->role = role; sp->mmu_valid_gen = kvm->arch.mmu_valid_gen; /* * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages() * depends on valid pages being added to the head of the list. See * comments in kvm_zap_obsolete_pages(). */ list_add(&sp->link, &kvm->arch.active_mmu_pages); kvm_mod_used_mmu_pages(kvm, 1); sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]; hlist_add_head(&sp->hash_link, sp_list); if (!role.direct) account_shadowed(kvm, slot, sp); } static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu, gfn_t gfn, union kvm_mmu_page_role role) { unsigned int gfn_hash = kvm_page_table_hashfn(gfn); struct kvm_mmu_page *sp; bool created = false; sp = kvm_mmu_find_shadow_page(vcpu, gfn, gfn_hash, role); if (!sp) { created = true; sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, gfn_hash, role); } trace_kvm_mmu_get_page(sp, created); return sp; }
On Mon, Apr 11, 2022 at 1:12 PM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Apr 11, 2022, David Matlack wrote: > > On Mon, Apr 11, 2022 at 10:12 AM Sean Christopherson <seanjc@google.com> wrote: > > > Circling back to eager page splitting, this series could be reworked to take the > > > first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in > > > order to provide the necessary path for reworking nested MMU page faults. Then it > > > can remove unsync and shrinker support for nested MMUs. With those gone, > > > dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner > > > than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least > > > some of the complexity/churn. > > > > These sound like useful improvements but I am not really seeing the > > value of sequencing them before this series: > > > > - IMO the "churn" in patches 1-14 are a net improvement to the > > existing code. They improve readability by decomposing the shadow page > > creation path into smaller functions with better names, reduce the > > amount of redundant calculations, and reduce the dependence on struct > > kvm_vcpu where it is not needed. Even if eager page splitting is > > completely dropped I think they would be useful to merge. > > I definitely like some of patches 1-14, probably most after a few read throughs. > But there are key parts that I do not like that are motivated almost entirely by > the desire to support page splitting. Specifically, I don't like splitting the > logic of finding a page, and I don't like having a separate alloc vs. initializer > (though I'm guessing this will be needed somewhere to split huge pages for nested > MMUs). > > E.g. I'd prefer the "get" flow look like the below (completely untested, for > discussion purposes only). There's still churn, but the core loop is almost > entirely unchanged. > > And it's not just this series, I don't want future improvements nested TDP to have > to deal with the legacy baggage. One thing that would be helpful is if you can explain in a bit more specifically what you'd like to see. Part of the reason why I prefer to sequence your proposal after eager page splitting is that I do not fully understand what you're proposing, and how complex it would be. e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page() for nested MMUs does not sound like less churn. From my perspective, this series is a net improvement to the readability and maintainability of existing code, while adding a performance improvement (eager page splitting). All of the changes you are proposing can still be implemented on top if and when they become a priority (including merging {,__}kvm_find_shadow_page()). And if we limit eager page splitting to nested MMUs, we don't have to worry about maintaining eager page splitting with TDP shadow MMU or legacy shadow paging over time. > > Waaaay off topic, why do we still bother with stat.max_mmu_page_hash_collision? > I assume it was originally added to tune the hashing logic? At this point is it > anything but wasted cycles? > > static struct kvm_mmu_page *kvm_mmu_find_shadow_page(struct kvm_vcpu *vcpu, > gfn_t gfn, > unsigned int gfn_hash, > union kvm_mmu_page_role role) > { > struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash]; > struct kvm_mmu_page *sp; > LIST_HEAD(invalid_list); > > int collisions = 0; > > for_each_valid_sp(kvm, sp, sp_list) { > if (sp->gfn != gfn) { > collisions++; > continue; > } > > if (sp->role.word != role.word) { > /* > * If the guest is creating an upper-level page, zap > * unsync pages for the same gfn. While it's possible > * the guest is using recursive page tables, in all > * likelihood the guest has stopped using the unsync > * page and is installing a completely unrelated page. > * Unsync pages must not be left as is, because the new > * upper-level page will be write-protected. > */ > if (role.level > PG_LEVEL_4K && sp->unsync) > kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list); > > continue; > } > > /* unsync and write-flooding only apply to indirect SPs. */ > if (sp->role.direct) > goto out; > > if (sp->unsync) { > /* > * The page is good, but is stale. kvm_sync_page does > * get the latest guest state, but (unlike mmu_unsync_children) > * it doesn't write-protect the page or mark it synchronized! > * This way the validity of the mapping is ensured, but the > * overhead of write protection is not incurred until the > * guest invalidates the TLB mapping. This allows multiple > * SPs for a single gfn to be unsync. > * > * If the sync fails, the page is zapped. If so, break > * in order to rebuild it. > */ > if (!kvm_sync_page(vcpu, sp, &invalid_list)) > break; > > WARN_ON(!list_empty(&invalid_list)); > kvm_flush_remote_tlbs(vcpu->kvm); > } > > __clear_sp_write_flooding_count(sp); > goto out; > } > > sp = NULL; > > out: > if (collisions > kvm->stat.max_mmu_page_hash_collisions) > kvm->stat.max_mmu_page_hash_collisions = collisions; > > kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list); > return sp; > } > > static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm_vcpu *vcpu, > gfn_t gfn, > unsigned int gfn_hash, > union kvm_mmu_page_role role) > { > struct kvm_mmu_page *sp = __kvm_mmu_alloc_shadow_page(vcpu, role.direct); > struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); > struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash]; > > ++kvm->stat.mmu_cache_miss; > > sp->gfn = gfn; > sp->role = role; > sp->mmu_valid_gen = kvm->arch.mmu_valid_gen; > > /* > * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages() > * depends on valid pages being added to the head of the list. See > * comments in kvm_zap_obsolete_pages(). > */ > list_add(&sp->link, &kvm->arch.active_mmu_pages); > kvm_mod_used_mmu_pages(kvm, 1); > > sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]; > hlist_add_head(&sp->hash_link, sp_list); > > if (!role.direct) > account_shadowed(kvm, slot, sp); > } > > > static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu, > gfn_t gfn, > union kvm_mmu_page_role role) > { > unsigned int gfn_hash = kvm_page_table_hashfn(gfn); > struct kvm_mmu_page *sp; > bool created = false; > > sp = kvm_mmu_find_shadow_page(vcpu, gfn, gfn_hash, role); > if (!sp) { > created = true; > sp = kvm_mmu_alloc_shadow_page(vcpu, gfn, gfn_hash, role); > } > > trace_kvm_mmu_get_page(sp, created); > return sp; > }
On Mon, Apr 11, 2022, David Matlack wrote: > On Mon, Apr 11, 2022 at 1:12 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Mon, Apr 11, 2022, David Matlack wrote: > > > On Mon, Apr 11, 2022 at 10:12 AM Sean Christopherson <seanjc@google.com> wrote: > > > > Circling back to eager page splitting, this series could be reworked to take the > > > > first step of forking FNAME(page_fault), FNAME(fetch) and kvm_mmu_get_page() in > > > > order to provide the necessary path for reworking nested MMU page faults. Then it > > > > can remove unsync and shrinker support for nested MMUs. With those gone, > > > > dissecting the nested MMU variant of kvm_mmu_get_page() should be simpler/cleaner > > > > than dealing with the existing kvm_mmu_get_page(), i.e. should eliminate at least > > > > some of the complexity/churn. > > > > > > These sound like useful improvements but I am not really seeing the > > > value of sequencing them before this series: > > > > > > - IMO the "churn" in patches 1-14 are a net improvement to the > > > existing code. They improve readability by decomposing the shadow page > > > creation path into smaller functions with better names, reduce the > > > amount of redundant calculations, and reduce the dependence on struct > > > kvm_vcpu where it is not needed. Even if eager page splitting is > > > completely dropped I think they would be useful to merge. > > > > I definitely like some of patches 1-14, probably most after a few read throughs. > > But there are key parts that I do not like that are motivated almost entirely by > > the desire to support page splitting. Specifically, I don't like splitting the > > logic of finding a page, and I don't like having a separate alloc vs. initializer > > (though I'm guessing this will be needed somewhere to split huge pages for nested > > MMUs). > > > > E.g. I'd prefer the "get" flow look like the below (completely untested, for > > discussion purposes only). There's still churn, but the core loop is almost > > entirely unchanged. > > > > And it's not just this series, I don't want future improvements nested TDP to have > > to deal with the legacy baggage. > > One thing that would be helpful is if you can explain in a bit more > specifically what you'd like to see. Part of the reason why I prefer > to sequence your proposal after eager page splitting is that I do not > fully understand what you're proposing, and how complex it would be. > e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page() > for nested MMUs does not sound like less churn. Oh, it's most definitely not less code, and probably more churn. But, it's churn that pushes us in a more favorable direction and that is desirable long term. I don't mind churning code, but I want the churn to make future life easier, not harder. Details below. > From my perspective, this series is a net improvement to the > readability and maintainability of existing code, while adding a > performance improvement (eager page splitting). All of the changes you > are proposing can still be implemented on top if They can be implemented on top, but I want to avoid inhireting complexity we don't actually want/need, unsync support being the most notable. What I mean by "fork" is that after the cleanups that make sense irrespective of eager page splitting, we make a copy of FNAME(page_fault) and add FNAME(get_shadow_page), extracting common logic where we can and probably doing something fancy to avoid having multiple copies of FNAME(get_shadow_page). Looking again at the code, it's probably best to keep FNAME(fetch), at least for now, as it's only the single unsync check that we can purge at this point. That gives us e.g. FNAME(nested_page_fault) that support EPT and 64-bit NPT, and a nested TDP specific get_shadow_page(). Then we rip out the unsync stuff for nested MMUs, which is quite clean because we can key off of tdp_enabled. It'll leave dead code for 32-bit hosts running nested VMs, but I highly doubt anyone will notice the perf hit. At that point, dissect kvm_nested_mmu_get_page() for eager page splitting and continue on. It's not drastically different than what you have now, but it avoids the nastiness around unsync pages, e.g. I'm pretty sure kvm_mmu_alloc_shadow_page() can be reused as I proposed and the "find" becomes something like: static struct kvm_mmu_page *kvm_mmu_nested_tdp_find_sp(struct kvm_vcpu *vcpu, gfn_t gfn, unsigned int gfn_hash, union kvm_mmu_page_role role) { struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash]; struct kvm_mmu_page *sp; for_each_valid_sp(kvm, sp, sp_list) { if (sp->gfn != gfn || sp->role.word != role.word) continue; __clear_sp_write_flooding_count(sp); return sp; } return NULL; } Having the separate page fault and get_shadow_page(), without the baggage of unsync in particular, sets us up for switching to taking mmu_lock for read, and in the distant future, implementing whatever new scheme someone concocts for shadowing nested TDP.
On Mon, Apr 11, 2022 at 5:39 PM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Apr 11, 2022, David Matlack wrote: > > > > One thing that would be helpful is if you can explain in a bit more > > specifically what you'd like to see. Part of the reason why I prefer > > to sequence your proposal after eager page splitting is that I do not > > fully understand what you're proposing, and how complex it would be. > > e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page() > > for nested MMUs does not sound like less churn. > > Oh, it's most definitely not less code, and probably more churn. But, it's churn > that pushes us in a more favorable direction and that is desirable long term. I > don't mind churning code, but I want the churn to make future life easier, not > harder. Details below. Of course. Let's make sure we're on the same page about what churn introduced by this series will make future life harder that we hope to avoid. If I understand you correctly, it's the following 2 changes: (a.) Using separate functions to allocate SPs and initialize SPs. (b.) Separating kvm_mmu_find_shadow_page() from __kvm_mmu_find_shadow_page(). (a.) stems from the fact that SP allocation during eager page splitting is made directly rather than through kvm_mmu_memory_caches, which was what you pushed for in the TDP MMU implementation. We could instead use kvm_mmu_memory_caches for the shadow MMU eager page splitting to eliminate (a.). But otherwise (a.) is necessary complexity of eager page splitting because it needs to allocate SPs differently from the vCPU fault path. As for (b.), see below... > > > From my perspective, this series is a net improvement to the > > readability and maintainability of existing code, while adding a > > performance improvement (eager page splitting). All of the changes you > > are proposing can still be implemented on top if > > They can be implemented on top, but I want to avoid inhireting complexity we > don't actually want/need, unsync support being the most notable. > > What I mean by "fork" is that after the cleanups that make sense irrespective of > eager page splitting, we make a copy of FNAME(page_fault) and add FNAME(get_shadow_page), > extracting common logic where we can and probably doing something fancy to avoid > having multiple copies of FNAME(get_shadow_page). Looking again at the code, it's > probably best to keep FNAME(fetch), at least for now, as it's only the single unsync > check that we can purge at this point. > > That gives us e.g. FNAME(nested_page_fault) that support EPT and 64-bit NPT, and > a nested TDP specific get_shadow_page(). > > Then we rip out the unsync stuff for nested MMUs, which is quite clean because we > can key off of tdp_enabled. It'll leave dead code for 32-bit hosts running nested > VMs, but I highly doubt anyone will notice the perf hit. > > At that point, dissect kvm_nested_mmu_get_page() for eager page splitting and > continue on. > > It's not drastically different than what you have now, but it avoids the nastiness > around unsync pages, e.g. I'm pretty sure kvm_mmu_alloc_shadow_page() can be reused > as I proposed and the "find" becomes something like: > > static struct kvm_mmu_page *kvm_mmu_nested_tdp_find_sp(struct kvm_vcpu *vcpu, > gfn_t gfn, > unsigned int gfn_hash, > union kvm_mmu_page_role role) > { > struct hlist_head *sp_list = &kvm->arch.mmu_page_hash[gfn_hash]; > struct kvm_mmu_page *sp; > > for_each_valid_sp(kvm, sp, sp_list) { > if (sp->gfn != gfn || sp->role.word != role.word) > continue; > > __clear_sp_write_flooding_count(sp); > return sp; > } > > return NULL; > } IIUC all of this would be to avoid separating kvm_mmu_find_shadow_page() from __kvm_mmu_find_shadow_page() correct? i.e. Nested MMUs would have their own "find" function, which is called by eager page splitting, and thus no separate __kvm_mmu_find_shadow_page(). But __kvm_mmu_find_shadow_page(), as implemented in this series, is about 90% similar to what you proposed for kvm_mmu_nested_tdp_find_sp(). And in fact it would work correctly to use __kvm_mmu_find_shadow_page() for nested MMUs, since we know the sp->unsync condition would just be skipped. So even if we did everything you proposed (which seems like an awful lot just to avoid __kvm_mmu_find_shadow_page()), there's a chance we would still end up with the exact same code. i.e. kvm_mmu_nested_tdp_find_sp() would be implemented by calling __kvm_mmu_find_shadow_page(), because it would be a waste to re-implement an almost identical function? > > Having the separate page fault and get_shadow_page(), without the baggage of unsync > in particular, sets us up for switching to taking mmu_lock for read, and in the > distant future, implementing whatever new scheme someone concocts for shadowing > nested TDP. Taking MMU read lock with per-root spinlocks for nested MMUs is a great idea btw. I think it would be a great improvement.
On Tue, Apr 12, 2022, David Matlack wrote: > On Mon, Apr 11, 2022 at 5:39 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Mon, Apr 11, 2022, David Matlack wrote: > > > > > > One thing that would be helpful is if you can explain in a bit more > > > specifically what you'd like to see. Part of the reason why I prefer > > > to sequence your proposal after eager page splitting is that I do not > > > fully understand what you're proposing, and how complex it would be. > > > e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page() > > > for nested MMUs does not sound like less churn. > > > > Oh, it's most definitely not less code, and probably more churn. But, it's churn > > that pushes us in a more favorable direction and that is desirable long term. I > > don't mind churning code, but I want the churn to make future life easier, not > > harder. Details below. > > Of course. Let's make sure we're on the same page about what churn > introduced by this series will make future life harder that we hope to > avoid. If I understand you correctly, it's the following 2 changes: > > (a.) Using separate functions to allocate SPs and initialize SPs. > (b.) Separating kvm_mmu_find_shadow_page() from __kvm_mmu_find_shadow_page(). > > (a.) stems from the fact that SP allocation during eager page > splitting is made directly rather than through kvm_mmu_memory_caches, > which was what you pushed for in the TDP MMU implementation. We could > instead use kvm_mmu_memory_caches for the shadow MMU eager page ... > So even if we did everything you proposed (which seems like an awful > lot just to avoid __kvm_mmu_find_shadow_page()), there's a chance we > would still end up with the exact same code. i.e. > kvm_mmu_nested_tdp_find_sp() would be implemented by calling > __kvm_mmu_find_shadow_page(), because it would be a waste to > re-implement an almost identical function? I went far enough down this path to know that my idea isn't completely awful, and wouldn't actually need to fork FNAME(page_fault) at this time, but sadly I still dislike the end result. Your assessment that the we'd still end up with very similar (if not quite exact) code is spot on. Ditto for your other assertion in (a) about using the caches. My vote for this series is to go the cache route, e.g. wrap kvm_mmu_memory_caches in a struct and pass that into kvm_mmu_get_page(). I still think it was the right call to ignore the caches for the TDP MMU, it gives the TDP MMU more flexibility and it was trivial to bypass the caches since the TDP MMU was doing its own thing anyways. But for the shadow MMU, IMO the cons outweigh the pros. E.g. in addition to ending up with two similar but subtly different "get page" flows, passing around "struct kvm_mmu_page **spp" is a bit unpleasant. Ditto for having a partially initialized kvm_mmu_page. The split code also ends up in a wierd state where it uses the caches for the pte_list, but not the other allocations. There will be one wart due to unsync pages needing @vcpu, but we can pass in NULL for the split case and assert that @vcpu is non-null since all of the children should be direct. if (sp->unsync) { if (WARN_ON_ONCE(!vcpu)) { kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); continue; } /* * The page is good, but is stale. kvm_sync_page does * get the latest guest state, but (unlike mmu_unsync_children) * it doesn't write-protect the page or mark it synchronized! * This way the validity of the mapping is ensured, but the * overhead of write protection is not incurred until the * guest invalidates the TLB mapping. This allows multiple * SPs for a single gfn to be unsync. * * If the sync fails, the page is zapped. If so, break * in order to rebuild it. */ if (!kvm_sync_page(vcpu, sp, &invalid_list)) break; WARN_ON(!list_empty(&invalid_list)); kvm_flush_remote_tlbs(kvm); }
On Wed, Apr 13, 2022 at 01:02:51AM +0000, Sean Christopherson wrote: > On Tue, Apr 12, 2022, David Matlack wrote: > > On Mon, Apr 11, 2022 at 5:39 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Mon, Apr 11, 2022, David Matlack wrote: > > > > > > > > One thing that would be helpful is if you can explain in a bit more > > > > specifically what you'd like to see. Part of the reason why I prefer > > > > to sequence your proposal after eager page splitting is that I do not > > > > fully understand what you're proposing, and how complex it would be. > > > > e.g. Forking FNAME(fetch), FNAME(page_fault), and kvm_mmu_get_page() > > > > for nested MMUs does not sound like less churn. > > > > > > Oh, it's most definitely not less code, and probably more churn. But, it's churn > > > that pushes us in a more favorable direction and that is desirable long term. I > > > don't mind churning code, but I want the churn to make future life easier, not > > > harder. Details below. > > > > Of course. Let's make sure we're on the same page about what churn > > introduced by this series will make future life harder that we hope to > > avoid. If I understand you correctly, it's the following 2 changes: > > > > (a.) Using separate functions to allocate SPs and initialize SPs. > > (b.) Separating kvm_mmu_find_shadow_page() from __kvm_mmu_find_shadow_page(). > > > > (a.) stems from the fact that SP allocation during eager page > > splitting is made directly rather than through kvm_mmu_memory_caches, > > which was what you pushed for in the TDP MMU implementation. We could > > instead use kvm_mmu_memory_caches for the shadow MMU eager page > > ... > > > So even if we did everything you proposed (which seems like an awful > > lot just to avoid __kvm_mmu_find_shadow_page()), there's a chance we > > would still end up with the exact same code. i.e. > > kvm_mmu_nested_tdp_find_sp() would be implemented by calling > > __kvm_mmu_find_shadow_page(), because it would be a waste to > > re-implement an almost identical function? > > I went far enough down this path to know that my idea isn't completely awful, > and wouldn't actually need to fork FNAME(page_fault) at this time, but sadly I > still dislike the end result. Thanks for looking into it so quickly so we could figure out a path forward. > > Your assessment that the we'd still end up with very similar (if not quite exact) > code is spot on. Ditto for your other assertion in (a) about using the caches. > > My vote for this series is to go the cache route, e.g. wrap kvm_mmu_memory_caches > in a struct and pass that into kvm_mmu_get_page(). I still think it was the right > call to ignore the caches for the TDP MMU, it gives the TDP MMU more flexibility > and it was trivial to bypass the caches since the TDP MMU was doing its own thing > anyways. > > But for the shadow MMU, IMO the cons outweigh the pros. E.g. in addition to > ending up with two similar but subtly different "get page" flows, passing around > "struct kvm_mmu_page **spp" is a bit unpleasant. Ditto for having a partially > initialized kvm_mmu_page. The split code also ends up in a wierd state where it > uses the caches for the pte_list, but not the other allocations. Sounds good. I will rework the series to use kvm_mmu_memory_cache structs for the SP allocation during eager page splitting. That will eliminate the separate allocation and initialization which will be a nice cleanup. And it will be great to get rid of the spp crud. And per your earlier feedback, I will also limit eager page splitting to nested MMUs. > > There will be one wart due to unsync pages needing @vcpu, but we can pass in NULL > for the split case and assert that @vcpu is non-null since all of the children > should be direct. The NULL vcpu check will be a little gross, but it should never trigger in practice since eager page splitting always requests direct SPs. My preference has been to enforce that in code by splitting out __kvm_mmu_find_shadow_page(), but I can see the advantage of your proposal is that eager page splitting and faults will go through the exact same code path to get a kvm_mmu_page. > > if (sp->unsync) { > if (WARN_ON_ONCE(!vcpu)) { > kvm_mmu_prepare_zap_page(kvm, sp, > &invalid_list); > continue; > } > > /* > * The page is good, but is stale. kvm_sync_page does > * get the latest guest state, but (unlike mmu_unsync_children) > * it doesn't write-protect the page or mark it synchronized! > * This way the validity of the mapping is ensured, but the > * overhead of write protection is not incurred until the > * guest invalidates the TLB mapping. This allows multiple > * SPs for a single gfn to be unsync. > * > * If the sync fails, the page is zapped. If so, break > * in order to rebuild it. > */ > if (!kvm_sync_page(vcpu, sp, &invalid_list)) > break; > > WARN_ON(!list_empty(&invalid_list)); > kvm_flush_remote_tlbs(kvm); > }
On Wed, Apr 13, 2022, David Matlack wrote: > On Wed, Apr 13, 2022 at 01:02:51AM +0000, Sean Christopherson wrote: > > There will be one wart due to unsync pages needing @vcpu, but we can pass in NULL > > for the split case and assert that @vcpu is non-null since all of the children > > should be direct. > > The NULL vcpu check will be a little gross, Yeah, I would even call it a lot gross :-) > but it should never trigger in practice since eager page splitting always > requests direct SPs. My preference has been to enforce that in code by > splitting out It still is enforced in code, just at different points. The split version WARNs and continues after finding a page, the below WARNs and rejects _while_ finding the page. Speaking of WARNs, that reminds me... it might be worth adding a WARN in kvm_mmu_get_child_sp() to document (and detect, but more to document) that @direct should never encounter an page with unsync or unsync_children, e.g. union kvm_mmu_page_role role; struct kvm_mmu_page *sp; role = kvm_mmu_child_role(sptep, direct, access); sp = kvm_mmu_get_page(vcpu, gfn, role); /* Comment goes here about direct pages in shadow MMUs? */ WARN_ON(direct && (sp->unsync || sp->unsync_children)); return sp; The indirect walk of FNAME(fetch)() handles unsync_children, but none of the other callers do. Obviously shouldn't happen, but especially in the huge page split case it took me a second to understand exactly why it can't happen. > but I can see the advantage of your proposal is that eager page splitting and > faults will go through the exact same code path to get a kvm_mmu_page. > __kvm_mmu_find_shadow_page(), but I can see the advantage of your > proposal is that eager page splitting and faults will go through the > exact same code path to get a kvm_mmu_page. > > > > > if (sp->unsync) { > > if (WARN_ON_ONCE(!vcpu)) { > > kvm_mmu_prepare_zap_page(kvm, sp, > > &invalid_list); > > continue; > > } > > > > /* > > * The page is good, but is stale. kvm_sync_page does > > * get the latest guest state, but (unlike mmu_unsync_children) > > * it doesn't write-protect the page or mark it synchronized! > > * This way the validity of the mapping is ensured, but the > > * overhead of write protection is not incurred until the > > * guest invalidates the TLB mapping. This allows multiple > > * SPs for a single gfn to be unsync. > > * > > * If the sync fails, the page is zapped. If so, break > > * in order to rebuild it. > > */ > > if (!kvm_sync_page(vcpu, sp, &invalid_list)) > > break; > > > > WARN_ON(!list_empty(&invalid_list)); > > kvm_flush_remote_tlbs(kvm); > > }
On Wed, Apr 13, 2022 at 11:28 AM Sean Christopherson <seanjc@google.com> wrote: > > On Wed, Apr 13, 2022, David Matlack wrote: > > On Wed, Apr 13, 2022 at 01:02:51AM +0000, Sean Christopherson wrote: > > > There will be one wart due to unsync pages needing @vcpu, but we can pass in NULL > > > for the split case and assert that @vcpu is non-null since all of the children > > > should be direct. > > > > The NULL vcpu check will be a little gross, > > Yeah, I would even call it a lot gross :-) > > > but it should never trigger in practice since eager page splitting always > > requests direct SPs. My preference has been to enforce that in code by > > splitting out > > It still is enforced in code, just at different points. The split version WARNs > and continues after finding a page, the below WARNs and rejects _while_ finding > the page. > > Speaking of WARNs, that reminds me... it might be worth adding a WARN in > kvm_mmu_get_child_sp() to document (and detect, but more to document) that @direct > should never encounter an page with unsync or unsync_children, e.g. > > union kvm_mmu_page_role role; > struct kvm_mmu_page *sp; > > role = kvm_mmu_child_role(sptep, direct, access); > sp = kvm_mmu_get_page(vcpu, gfn, role); > > /* Comment goes here about direct pages in shadow MMUs? */ > WARN_ON(direct && (sp->unsync || sp->unsync_children)); > return sp; > > The indirect walk of FNAME(fetch)() handles unsync_children, but none of the other > callers do. Obviously shouldn't happen, but especially in the huge page split > case it took me a second to understand exactly why it can't happen. Will do. > > > but I can see the advantage of your proposal is that eager page splitting and > > faults will go through the exact same code path to get a kvm_mmu_page. > > __kvm_mmu_find_shadow_page(), but I can see the advantage of your > > proposal is that eager page splitting and faults will go through the > > exact same code path to get a kvm_mmu_page. > > > > > > > > if (sp->unsync) { > > > if (WARN_ON_ONCE(!vcpu)) { > > > kvm_mmu_prepare_zap_page(kvm, sp, > > > &invalid_list); > > > continue; > > > } > > > > > > /* > > > * The page is good, but is stale. kvm_sync_page does > > > * get the latest guest state, but (unlike mmu_unsync_children) > > > * it doesn't write-protect the page or mark it synchronized! > > > * This way the validity of the mapping is ensured, but the > > > * overhead of write protection is not incurred until the > > > * guest invalidates the TLB mapping. This allows multiple > > > * SPs for a single gfn to be unsync. > > > * > > > * If the sync fails, the page is zapped. If so, break > > > * in order to rebuild it. > > > */ > > > if (!kvm_sync_page(vcpu, sp, &invalid_list)) > > > break; > > > > > > WARN_ON(!list_empty(&invalid_list)); > > > kvm_flush_remote_tlbs(kvm); > > > }