Message ID | 20230526234435.662652-5-yuzhao@google.com (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | mm/kvm: locklessly clear the accessed bit | expand |
Yu, On Fri, May 26, 2023 at 05:44:29PM -0600, Yu Zhao wrote: > Stage2 page tables are currently not RCU safe against unmapping or VM > destruction. The previous mmu_notifier_ops members rely on > kvm->mmu_lock to synchronize with those operations. > > However, the new mmu_notifier_ops member test_clear_young() provides > a fast path that does not take kvm->mmu_lock. To implement > kvm_arch_test_clear_young() for that path, unmapped page tables need > to be freed by RCU and kvm_free_stage2_pgd() needs to be after > mmu_notifier_unregister(). > > Remapping, specifically stage2_free_removed_table(), is already RCU > safe. > > Signed-off-by: Yu Zhao <yuzhao@google.com> > --- > arch/arm64/include/asm/kvm_pgtable.h | 2 ++ > arch/arm64/kvm/arm.c | 1 + > arch/arm64/kvm/hyp/pgtable.c | 8 ++++++-- > arch/arm64/kvm/mmu.c | 17 ++++++++++++++++- > 4 files changed, 25 insertions(+), 3 deletions(-) > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h > index ff520598b62c..5cab52e3a35f 100644 > --- a/arch/arm64/include/asm/kvm_pgtable.h > +++ b/arch/arm64/include/asm/kvm_pgtable.h > @@ -153,6 +153,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level) > * @put_page: Decrement the refcount on a page. When the > * refcount reaches 0 the page is automatically > * freed. > + * @put_page_rcu: RCU variant of the above. You don't need to add yet another hook to implement this. I was working on lock-free walks in a separate context and arrived at the following: commit f82d264a37745e07ee28e116c336f139f681fd7f Author: Oliver Upton <oliver.upton@linux.dev> Date: Mon May 1 08:53:37 2023 +0000 KVM: arm64: Consistently use free_removed_table() for stage-2 free_removed_table() is essential to the RCU-protected parallel walking scheme, as behind the scenes the cleanup is deferred until an RCU grace period. Nonetheless, the stage-2 unmap path calls put_page() directly, which leads to table memory being freed inline with the table walk. This is safe for the time being, as the stage-2 unmap walker is called while holding the write lock. A future change to KVM will further relax the locking mechanics around the stage-2 page tables to allow lock-free walkers protected only by RCU. As such, switch to the RCU-safe mechanism for freeing table memory. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index 3d61bd3e591d..bfbebdcb4ef0 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, kvm_granule_size(ctx->level)); if (childp) - mm_ops->put_page(childp); + mm_ops->free_removed_table(childp, ctx->level); return 0; }
On Sat, May 27, 2023 at 12:08 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > Yu, > > On Fri, May 26, 2023 at 05:44:29PM -0600, Yu Zhao wrote: > > Stage2 page tables are currently not RCU safe against unmapping or VM > > destruction. The previous mmu_notifier_ops members rely on > > kvm->mmu_lock to synchronize with those operations. > > > > However, the new mmu_notifier_ops member test_clear_young() provides > > a fast path that does not take kvm->mmu_lock. To implement > > kvm_arch_test_clear_young() for that path, unmapped page tables need > > to be freed by RCU and kvm_free_stage2_pgd() needs to be after > > mmu_notifier_unregister(). > > > > Remapping, specifically stage2_free_removed_table(), is already RCU > > safe. > > > > Signed-off-by: Yu Zhao <yuzhao@google.com> > > --- > > arch/arm64/include/asm/kvm_pgtable.h | 2 ++ > > arch/arm64/kvm/arm.c | 1 + > > arch/arm64/kvm/hyp/pgtable.c | 8 ++++++-- > > arch/arm64/kvm/mmu.c | 17 ++++++++++++++++- > > 4 files changed, 25 insertions(+), 3 deletions(-) > > > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h > > index ff520598b62c..5cab52e3a35f 100644 > > --- a/arch/arm64/include/asm/kvm_pgtable.h > > +++ b/arch/arm64/include/asm/kvm_pgtable.h > > @@ -153,6 +153,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level) > > * @put_page: Decrement the refcount on a page. When the > > * refcount reaches 0 the page is automatically > > * freed. > > + * @put_page_rcu: RCU variant of the above. > > You don't need to add yet another hook to implement this. I was working > on lock-free walks in a separate context and arrived at the following: > > commit f82d264a37745e07ee28e116c336f139f681fd7f > Author: Oliver Upton <oliver.upton@linux.dev> > Date: Mon May 1 08:53:37 2023 +0000 > > KVM: arm64: Consistently use free_removed_table() for stage-2 > > free_removed_table() is essential to the RCU-protected parallel walking > scheme, as behind the scenes the cleanup is deferred until an RCU grace > period. Nonetheless, the stage-2 unmap path calls put_page() directly, > which leads to table memory being freed inline with the table walk. > > This is safe for the time being, as the stage-2 unmap walker is called > while holding the write lock. A future change to KVM will further relax > the locking mechanics around the stage-2 page tables to allow lock-free > walkers protected only by RCU. As such, switch to the RCU-safe mechanism > for freeing table memory. > > Signed-off-by: Oliver Upton <oliver.upton@linux.dev> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > index 3d61bd3e591d..bfbebdcb4ef0 100644 > --- a/arch/arm64/kvm/hyp/pgtable.c > +++ b/arch/arm64/kvm/hyp/pgtable.c > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, > kvm_granule_size(ctx->level)); > > if (childp) > - mm_ops->put_page(childp); > + mm_ops->free_removed_table(childp, ctx->level); Thanks, Oliver. A couple of things I haven't had the chance to verify -- I'm hoping you could help clarify: 1. For unmapping, with free_removed_table(), wouldn't we have to look into the table we know it's empty unnecessarily? 2. For remapping and unmapping, how does free_removed_table() put the final refcnt on the table passed in? (Previously we had put_page(childp) in stage2_map_walk_table_post(). So I'm assuming we'd have to do something equivalent with free_removed_table().)
Hi Yu, On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote: > On Sat, May 27, 2023 at 12:08 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > > index 3d61bd3e591d..bfbebdcb4ef0 100644 > > --- a/arch/arm64/kvm/hyp/pgtable.c > > +++ b/arch/arm64/kvm/hyp/pgtable.c > > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, > > kvm_granule_size(ctx->level)); > > > > if (childp) > > - mm_ops->put_page(childp); > > + mm_ops->free_removed_table(childp, ctx->level); > > Thanks, Oliver. > > A couple of things I haven't had the chance to verify -- I'm hoping > you could help clarify: > 1. For unmapping, with free_removed_table(), wouldn't we have to look > into the table we know it's empty unnecessarily? As it is currently implemented, yes. But, there's potential to fast-path the implementation by checking page_count() before starting the walk. > 2. For remapping and unmapping, how does free_removed_table() put the > final refcnt on the table passed in? (Previously we had > put_page(childp) in stage2_map_walk_table_post(). So I'm assuming we'd > have to do something equivalent with free_removed_table().) Heh, that's a bug, and an embarrassing one at that! Sent out a fix for that, since it would appear we leak memory on table->block transitions. PTAL if you have a chance. https://lore.kernel.org/all/20230530193213.1663411-1-oliver.upton@linux.dev/
On Tue, May 30, 2023 at 1:37 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > Hi Yu, > > On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote: > > On Sat, May 27, 2023 at 12:08 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > > > index 3d61bd3e591d..bfbebdcb4ef0 100644 > > > --- a/arch/arm64/kvm/hyp/pgtable.c > > > +++ b/arch/arm64/kvm/hyp/pgtable.c > > > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, > > > kvm_granule_size(ctx->level)); > > > > > > if (childp) > > > - mm_ops->put_page(childp); > > > + mm_ops->free_removed_table(childp, ctx->level); > > > > Thanks, Oliver. > > > > A couple of things I haven't had the chance to verify -- I'm hoping > > you could help clarify: > > 1. For unmapping, with free_removed_table(), wouldn't we have to look > > into the table we know it's empty unnecessarily? > > As it is currently implemented, yes. But, there's potential to fast-path > the implementation by checking page_count() before starting the walk. Do you mind posting another patch? I'd be happy to ack it, as well as the one you suggested above. > > 2. For remapping and unmapping, how does free_removed_table() put the > > final refcnt on the table passed in? (Previously we had > > put_page(childp) in stage2_map_walk_table_post(). So I'm assuming we'd > > have to do something equivalent with free_removed_table().) > > Heh, that's a bug, and an embarrassing one at that! > > Sent out a fix for that, since it would appear we leak memory on > table->block transitions. PTAL if you have a chance. > > https://lore.kernel.org/all/20230530193213.1663411-1-oliver.upton@linux.dev/ Awesome.
On Wed, May 31, 2023 at 1:28 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote: > > On Tue, May 30, 2023 at 1:37 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > > Hi Yu, > > > > > > On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote: > > > > On Sat, May 27, 2023 at 12:08 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > > > > > index 3d61bd3e591d..bfbebdcb4ef0 100644 > > > > > --- a/arch/arm64/kvm/hyp/pgtable.c > > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c > > > > > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, > > > > > kvm_granule_size(ctx->level)); > > > > > > > > > > if (childp) > > > > > - mm_ops->put_page(childp); > > > > > + mm_ops->free_removed_table(childp, ctx->level); > > > > > > > > Thanks, Oliver. > > > > > > > > A couple of things I haven't had the chance to verify -- I'm hoping > > > > you could help clarify: > > > > 1. For unmapping, with free_removed_table(), wouldn't we have to look > > > > into the table we know it's empty unnecessarily? > > > > > > As it is currently implemented, yes. But, there's potential to fast-path > > > the implementation by checking page_count() before starting the walk. > > > > Do you mind posting another patch? I'd be happy to ack it, as well as > > the one you suggested above. > > I'd rather not take such a patch independent of the test_clear_young > series if you're OK with that. Do you mind implementing something > similar to the above patch w/ the proposed optimization if you need it? No worries. I can take the above together with the following, which would form a new series with its own merits, since apparently you think the !AF case is important. diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 26a8d955b49c..6ce73ce9f146 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1453,10 +1453,10 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa) trace_kvm_access_fault(fault_ipa); - read_lock(&vcpu->kvm->mmu_lock); + rcu_read_lock(); mmu = vcpu->arch.hw_mmu; pte = kvm_pgtable_stage2_mkyoung(mmu->pgt, fault_ipa); - read_unlock(&vcpu->kvm->mmu_lock); + rcu_read_unlock(); if (kvm_pte_valid(pte)) kvm_set_pfn_accessed(kvm_pte_to_pfn(pte));
On Wed, May 31, 2023 at 05:10:52PM -0600, Yu Zhao wrote: > On Wed, May 31, 2023 at 1:28 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote: > > > On Tue, May 30, 2023 at 1:37 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > As it is currently implemented, yes. But, there's potential to fast-path > > > > the implementation by checking page_count() before starting the walk. > > > > > > Do you mind posting another patch? I'd be happy to ack it, as well as > > > the one you suggested above. > > > > I'd rather not take such a patch independent of the test_clear_young > > series if you're OK with that. Do you mind implementing something > > similar to the above patch w/ the proposed optimization if you need it? > > No worries. I can take the above together with the following, which > would form a new series with its own merits, since apparently you > think the !AF case is important. Sorry if my suggestion was unclear. I thought we were talking about ->free_removed_table() being called from the stage-2 unmap path, in which case we wind up unnecessarily visiting PTEs on a table known to be empty. You could fast-path that by only initiating a walk if page_count() > 1: diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index 95dae02ccc2e..766563dc465c 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -1331,7 +1331,8 @@ void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pg .end = kvm_granule_size(level), }; - WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1)); + if (mm_ops->page_count(pgtable) > 1) + WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1)); WARN_ON(mm_ops->page_count(pgtable) != 1); mm_ops->put_page(pgtable); A lock-free access fault walker is interesting, but in my testing it hasn't led to any significant improvements over acquiring the MMU lock for read. Because of that I hadn't bothered with posting the series upstream.
On Wed, May 31, 2023 at 5:23 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > On Wed, May 31, 2023 at 05:10:52PM -0600, Yu Zhao wrote: > > On Wed, May 31, 2023 at 1:28 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote: > > > > On Tue, May 30, 2023 at 1:37 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > As it is currently implemented, yes. But, there's potential to fast-path > > > > > the implementation by checking page_count() before starting the walk. > > > > > > > > Do you mind posting another patch? I'd be happy to ack it, as well as > > > > the one you suggested above. > > > > > > I'd rather not take such a patch independent of the test_clear_young > > > series if you're OK with that. Do you mind implementing something > > > similar to the above patch w/ the proposed optimization if you need it? > > > > No worries. I can take the above together with the following, which > > would form a new series with its own merits, since apparently you > > think the !AF case is important. > > Sorry if my suggestion was unclear. > > I thought we were talking about ->free_removed_table() being called from > the stage-2 unmap path Yes, we were, or in general, about how to make KVM PTs RCU safe for ARM. So I'm thinking about taking 1) your patch above, 2) what I just suggested and 3) what you suggested below to form a mini series, which could land indepently and would make my job here easier. > in which case we wind up unnecessarily visiting > PTEs on a table known to be empty. You could fast-path that by only > initiating a walk if page_count() > 1: Yes, this is what I meant. > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > index 95dae02ccc2e..766563dc465c 100644 > --- a/arch/arm64/kvm/hyp/pgtable.c > +++ b/arch/arm64/kvm/hyp/pgtable.c > @@ -1331,7 +1331,8 @@ void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pg > .end = kvm_granule_size(level), > }; > > - WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1)); > + if (mm_ops->page_count(pgtable) > 1) > + WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1)); > > WARN_ON(mm_ops->page_count(pgtable) != 1); > mm_ops->put_page(pgtable); > > > A lock-free access fault walker is interesting, but in my testing it hasn't > led to any significant improvements over acquiring the MMU lock for > read. Because of that I hadn't bothered with posting the series upstream. It's hard to measure but we have perf benchmarks on ChromeOS which should help.
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index ff520598b62c..5cab52e3a35f 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -153,6 +153,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level) * @put_page: Decrement the refcount on a page. When the * refcount reaches 0 the page is automatically * freed. + * @put_page_rcu: RCU variant of the above. * @page_count: Return the refcount of a page. * @phys_to_virt: Convert a physical address into a virtual * address mapped in the current context. @@ -170,6 +171,7 @@ struct kvm_pgtable_mm_ops { void (*free_removed_table)(void *addr, u32 level); void (*get_page)(void *addr); void (*put_page)(void *addr); + void (*put_page_rcu)(void *addr); int (*page_count)(void *addr); void* (*phys_to_virt)(phys_addr_t phys); phys_addr_t (*virt_to_phys)(void *addr); diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 14391826241c..ee93271035d9 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -191,6 +191,7 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) */ void kvm_arch_destroy_vm(struct kvm *kvm) { + kvm_free_stage2_pgd(&kvm->arch.mmu); bitmap_free(kvm->arch.pmu_filter); free_cpumask_var(kvm->arch.supported_cpus); diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index 24678ccba76a..dbace4c6a841 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -988,8 +988,12 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops), kvm_granule_size(ctx->level)); - if (childp) - mm_ops->put_page(childp); + if (childp) { + if (mm_ops->put_page_rcu) + mm_ops->put_page_rcu(childp); + else + mm_ops->put_page(childp); + } return 0; } diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 3b9d4d24c361..c3b3e2afe26f 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -172,6 +172,21 @@ static int kvm_host_page_count(void *addr) return page_count(virt_to_page(addr)); } +static void kvm_s2_rcu_put_page(struct rcu_head *head) +{ + put_page(container_of(head, struct page, rcu_head)); +} + +static void kvm_s2_put_page_rcu(void *addr) +{ + struct page *page = virt_to_page(addr); + + if (kvm_host_page_count(addr) == 1) + kvm_account_pgtable_pages(addr, -1); + + call_rcu(&page->rcu_head, kvm_s2_rcu_put_page); +} + static phys_addr_t kvm_host_pa(void *addr) { return __pa(addr); @@ -704,6 +719,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = { .free_removed_table = stage2_free_removed_table, .get_page = kvm_host_get_page, .put_page = kvm_s2_put_page, + .put_page_rcu = kvm_s2_put_page_rcu, .page_count = kvm_host_page_count, .phys_to_virt = kvm_host_va, .virt_to_phys = kvm_host_pa, @@ -1877,7 +1893,6 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) void kvm_arch_flush_shadow_all(struct kvm *kvm) { - kvm_free_stage2_pgd(&kvm->arch.mmu); } void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
Stage2 page tables are currently not RCU safe against unmapping or VM destruction. The previous mmu_notifier_ops members rely on kvm->mmu_lock to synchronize with those operations. However, the new mmu_notifier_ops member test_clear_young() provides a fast path that does not take kvm->mmu_lock. To implement kvm_arch_test_clear_young() for that path, unmapped page tables need to be freed by RCU and kvm_free_stage2_pgd() needs to be after mmu_notifier_unregister(). Remapping, specifically stage2_free_removed_table(), is already RCU safe. Signed-off-by: Yu Zhao <yuzhao@google.com> --- arch/arm64/include/asm/kvm_pgtable.h | 2 ++ arch/arm64/kvm/arm.c | 1 + arch/arm64/kvm/hyp/pgtable.c | 8 ++++++-- arch/arm64/kvm/mmu.c | 17 ++++++++++++++++- 4 files changed, 25 insertions(+), 3 deletions(-)