diff mbox series

[6/9] KVM: arm64: Split huge pages when dirty logging is enabled

Message ID 20230113035000.480021-7-ricarkol@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: Eager Huge-page splitting for dirty-logging | expand

Commit Message

Ricardo Koller Jan. 13, 2023, 3:49 a.m. UTC
Split huge pages eagerly when enabling dirty logging. The goal is to
avoid doing it while faulting on write-protected pages, which
negatively impacts guest performance.

A memslot marked for dirty logging is split in 1GB pieces at a time.
This is in order to release the mmu_lock and give other kernel threads
the opportunity to run, and also in order to allocate enough pages to
split a 1GB range worth of huge pages (or a single 1GB huge page).
Note that these page allocations can fail, so eager page splitting is
best-effort.  This is not a correctness issue though, as huge pages
can still be split on write-faults.

The benefits of eager page splitting are the same as in x86, added
with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
the TDP MMU when dirty logging is enabled"). For example, when running
dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
all of their memory after dirty logging is enabled decreased by 44%
from 2.58s to 1.42s.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  30 ++++++++
 arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
 2 files changed, 138 insertions(+), 2 deletions(-)

Comments

Ben Gardon Jan. 24, 2023, 5:52 p.m. UTC | #1
On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote:
>
> Split huge pages eagerly when enabling dirty logging. The goal is to
> avoid doing it while faulting on write-protected pages, which
> negatively impacts guest performance.
>
> A memslot marked for dirty logging is split in 1GB pieces at a time.
> This is in order to release the mmu_lock and give other kernel threads
> the opportunity to run, and also in order to allocate enough pages to
> split a 1GB range worth of huge pages (or a single 1GB huge page).
> Note that these page allocations can fail, so eager page splitting is
> best-effort.  This is not a correctness issue though, as huge pages
> can still be split on write-faults.
>
> The benefits of eager page splitting are the same as in x86, added
> with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
> the TDP MMU when dirty logging is enabled"). For example, when running
> dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
> 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
> all of their memory after dirty logging is enabled decreased by 44%
> from 2.58s to 1.42s.
>
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> ---
>  arch/arm64/include/asm/kvm_host.h |  30 ++++++++
>  arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
>  2 files changed, 138 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 35a159d131b5..6ab37209b1d1 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -153,6 +153,36 @@ struct kvm_s2_mmu {
>         /* The last vcpu id that ran on each physical CPU */
>         int __percpu *last_vcpu_ran;
>
> +       /*
> +        * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
> +        * pages. It is used to allocate stage2 page tables while splitting
> +        * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
> +        * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
> +        * the capacity of the split page cache (CACHE_CAPACITY), and how often
> +        * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
> +        *
> +        * A good heuristic to pick CHUNK_SIZE is that it should be larger than
> +        * all the available huge-page sizes, and be a multiple of all the
> +        * other ones; for example, 1GB when all the available huge-page sizes
> +        * are (1GB, 2MB, 32MB, 512MB).

This feels a little fragile to link scheduling decisions into the
batch size. (I'm not saying we don't do the same thing on x86, but
it's a mistake in either case.) I'd prefer if we could yield more
granularly in a way that's not tied to splitting a whole
EAGER_PAGE_SPLIT_CHUNK_SIZE worth of pages.
Tuning the chunk size to balance memory overhead with batch
performance makes sense, but it becomes more difficult to also balance
with the needs of the scheduler. Could we add some yield point in
kvm_pgtable_stage2_split or give it an early return path or something?

> +        *
> +        * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for
> +        * example, 1GB requires the following number of PAGE_SIZE-pages:
> +        * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB).
> +        * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)).
> +        * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB).
> +        * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB).
> +        * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB
> +        * granules.
> +        *
> +        * Protected by kvm->slots_lock.
> +        */
> +#define EAGER_PAGE_SPLIT_CHUNK_SIZE                   SZ_1G
> +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY                                        \
> +       (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) +         \
> +        DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M))
> +       struct kvm_mmu_memory_cache split_page_cache;
> +
>         struct kvm_arch *arch;
>  };
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 700c5774b50d..41ee330edae3 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector;
>
>  static unsigned long io_map_base;
>
> -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> +bool __read_mostly eager_page_split = true;
> +module_param(eager_page_split, bool, 0644);
> +
> +static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
> +                                          phys_addr_t size)
>  {
> -       phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
>         phys_addr_t boundary = ALIGN_DOWN(addr + size, size);
>
>         return (boundary - 1 < end - 1) ? boundary : end;
>  }
>
> +static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> +{
> +       phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> +
> +       return __stage2_range_addr_end(addr, end, size);
> +}
> +
>  /*
>   * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
>   * we may see kernel panics with CONFIG_DETECT_HUNG_TASK,
> @@ -71,6 +81,64 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
>         return ret;
>  }
>
> +static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
> +{
> +       return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
> +}
> +
> +static bool need_topup_split_page_cache_or_resched(struct kvm *kvm)
> +{
> +       struct kvm_mmu_memory_cache *cache;
> +
> +       if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> +               return true;
> +
> +       cache = &kvm->arch.mmu.split_page_cache;
> +       return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY);
> +}
> +
> +static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
> +                             phys_addr_t end)
> +{
> +       struct kvm_mmu_memory_cache *cache;
> +       struct kvm_pgtable *pgt;
> +       int ret;
> +       u64 next;
> +       int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY;
> +
> +       lockdep_assert_held_write(&kvm->mmu_lock);
> +
> +       lockdep_assert_held(&kvm->slots_lock);
> +
> +       cache = &kvm->arch.mmu.split_page_cache;
> +
> +       do {
> +               if (need_topup_split_page_cache_or_resched(kvm)) {
> +                       write_unlock(&kvm->mmu_lock);
> +                       cond_resched();
> +                       /* Eager page splitting is best-effort. */
> +                       ret = __kvm_mmu_topup_memory_cache(cache,
> +                                                          cache_capacity,
> +                                                          cache_capacity);
> +                       write_lock(&kvm->mmu_lock);
> +                       if (ret)
> +                               break;
> +               }
> +
> +               pgt = kvm->arch.mmu.pgt;
> +               if (!pgt)
> +                       return -EINVAL;
> +
> +               next = __stage2_range_addr_end(addr, end,
> +                                              EAGER_PAGE_SPLIT_CHUNK_SIZE);
> +               ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache);
> +               if (ret)
> +                       break;
> +       } while (addr = next, addr != end);
> +
> +       return ret;
> +}
> +
>  #define stage2_apply_range_resched(kvm, addr, end, fn)                 \
>         stage2_apply_range(kvm, addr, end, fn, true)
>
> @@ -755,6 +823,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>         for_each_possible_cpu(cpu)
>                 *per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
>
> +       mmu->split_page_cache.gfp_zero = __GFP_ZERO;
> +
>         mmu->pgt = pgt;
>         mmu->pgd_phys = __pa(pgt->pgd);
>         return 0;
> @@ -769,6 +839,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>  void kvm_uninit_stage2_mmu(struct kvm *kvm)
>  {
>         kvm_free_stage2_pgd(&kvm->arch.mmu);
> +       kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
>  }
>
>  static void stage2_unmap_memslot(struct kvm *kvm,
> @@ -996,6 +1067,29 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
>         stage2_wp_range(&kvm->arch.mmu, start, end);
>  }
>
> +/**
> + * kvm_mmu_split_memory_region() - split the stage 2 blocks into PAGE_SIZE
> + *                                pages for memory slot
> + * @kvm:       The KVM pointer
> + * @slot:      The memory slot to split
> + *
> + * Acquires kvm->mmu_lock. Called with kvm->slots_lock mutex acquired,
> + * serializing operations for VM memory regions.
> + */
> +static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
> +{
> +       struct kvm_memslots *slots = kvm_memslots(kvm);
> +       struct kvm_memory_slot *memslot = id_to_memslot(slots, slot);
> +       phys_addr_t start, end;
> +
> +       start = memslot->base_gfn << PAGE_SHIFT;
> +       end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
> +
> +       write_lock(&kvm->mmu_lock);
> +       kvm_mmu_split_huge_pages(kvm, start, end);
> +       write_unlock(&kvm->mmu_lock);
> +}
> +
>  /*
>   * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
>   * dirty pages.
> @@ -1783,7 +1877,19 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>                 if (kvm_dirty_log_manual_protect_and_init_set(kvm))
>                         return;
>
> +               if (READ_ONCE(eager_page_split))
> +                       kvm_mmu_split_memory_region(kvm, new->id);
> +
>                 kvm_mmu_wp_memory_region(kvm, new->id);
> +       } else {
> +               /*
> +                * Free any leftovers from the eager page splitting cache. Do
> +                * this when deleting, moving, disabling dirty logging, or
> +                * creating the memslot (a nop). Doing it for deletes makes
> +                * sure we don't leak memory, and there's no need to keep the
> +                * cache around for any of the other cases.
> +                */
> +               kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
>         }
>  }
>
> --
> 2.39.0.314.g84b9a713c41-goog
>
Oliver Upton Jan. 24, 2023, 10:19 p.m. UTC | #2
On Tue, Jan 24, 2023 at 09:52:49AM -0800, Ben Gardon wrote:
> On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote:
> >
> > Split huge pages eagerly when enabling dirty logging. The goal is to
> > avoid doing it while faulting on write-protected pages, which
> > negatively impacts guest performance.
> >
> > A memslot marked for dirty logging is split in 1GB pieces at a time.
> > This is in order to release the mmu_lock and give other kernel threads
> > the opportunity to run, and also in order to allocate enough pages to
> > split a 1GB range worth of huge pages (or a single 1GB huge page).
> > Note that these page allocations can fail, so eager page splitting is
> > best-effort.  This is not a correctness issue though, as huge pages
> > can still be split on write-faults.
> >
> > The benefits of eager page splitting are the same as in x86, added
> > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
> > the TDP MMU when dirty logging is enabled"). For example, when running
> > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
> > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
> > all of their memory after dirty logging is enabled decreased by 44%
> > from 2.58s to 1.42s.
> >
> > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_host.h |  30 ++++++++
> >  arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
> >  2 files changed, 138 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 35a159d131b5..6ab37209b1d1 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -153,6 +153,36 @@ struct kvm_s2_mmu {
> >         /* The last vcpu id that ran on each physical CPU */
> >         int __percpu *last_vcpu_ran;
> >
> > +       /*
> > +        * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
> > +        * pages. It is used to allocate stage2 page tables while splitting
> > +        * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
> > +        * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
> > +        * the capacity of the split page cache (CACHE_CAPACITY), and how often
> > +        * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
> > +        *
> > +        * A good heuristic to pick CHUNK_SIZE is that it should be larger than
> > +        * all the available huge-page sizes, and be a multiple of all the
> > +        * other ones; for example, 1GB when all the available huge-page sizes
> > +        * are (1GB, 2MB, 32MB, 512MB).
> 
> This feels a little fragile to link scheduling decisions into the
> batch size.

Completely agree.

> (I'm not saying we don't do the same thing on x86, but
> it's a mistake in either case.) I'd prefer if we could yield more
> granularly in a way that's not tied to splitting a whole
> EAGER_PAGE_SPLIT_CHUNK_SIZE worth of pages.
> Tuning the chunk size to balance memory overhead with batch
> performance makes sense, but it becomes more difficult to also balance
> with the needs of the scheduler. Could we add some yield point in
> kvm_pgtable_stage2_split or give it an early return path or something?

We definitely need to move in the direction of early returns, but I'd
rather not build it purely for the sake of eager page spliting. A better
approach would likely be to add it to the core page table walker code
such that all walkers that operate on a swath of memory can use it.

In the interest of keeping this series manageable, I'd prefer yielding
walkers be done in a separate series.

--
Thanks,
Oliver
Oliver Upton Jan. 24, 2023, 10:45 p.m. UTC | #3
Hi Ricardo,

On Fri, Jan 13, 2023 at 03:49:57AM +0000, Ricardo Koller wrote:
> Split huge pages eagerly when enabling dirty logging. The goal is to
> avoid doing it while faulting on write-protected pages, which
> negatively impacts guest performance.
> 
> A memslot marked for dirty logging is split in 1GB pieces at a time.
> This is in order to release the mmu_lock and give other kernel threads
> the opportunity to run, and also in order to allocate enough pages to
> split a 1GB range worth of huge pages (or a single 1GB huge page).
> Note that these page allocations can fail, so eager page splitting is
> best-effort.  This is not a correctness issue though, as huge pages
> can still be split on write-faults.
> 
> The benefits of eager page splitting are the same as in x86, added
> with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
> the TDP MMU when dirty logging is enabled"). For example, when running
> dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
> 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
> all of their memory after dirty logging is enabled decreased by 44%
> from 2.58s to 1.42s.
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> ---
>  arch/arm64/include/asm/kvm_host.h |  30 ++++++++
>  arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
>  2 files changed, 138 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 35a159d131b5..6ab37209b1d1 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -153,6 +153,36 @@ struct kvm_s2_mmu {
>  	/* The last vcpu id that ran on each physical CPU */
>  	int __percpu *last_vcpu_ran;
>  
> +	/*
> +	 * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
> +	 * pages. It is used to allocate stage2 page tables while splitting
> +	 * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
> +	 * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
> +	 * the capacity of the split page cache (CACHE_CAPACITY), and how often
> +	 * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
> +	 *
> +	 * A good heuristic to pick CHUNK_SIZE is that it should be larger than
> +	 * all the available huge-page sizes, and be a multiple of all the
> +	 * other ones; for example, 1GB when all the available huge-page sizes
> +	 * are (1GB, 2MB, 32MB, 512MB).
> +	 *
> +	 * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for
> +	 * example, 1GB requires the following number of PAGE_SIZE-pages:
> +	 * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB).
> +	 * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)).
> +	 * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB).
> +	 * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB).
> +	 * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB
> +	 * granules.
> +	 *
> +	 * Protected by kvm->slots_lock.
> +	 */
> +#define EAGER_PAGE_SPLIT_CHUNK_SIZE		       SZ_1G
> +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY					\
> +	(DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) +		\
> +	 DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M))

Could you instead make use of the existing KVM_PGTABLE_MIN_BLOCK_LEVEL
as the batch size? 513 pages across all page sizes is a non-negligible
amount of memory that goes largely unused when PAGE_SIZE != 4K.

With that change it is a lot easier to correctly match the cache
capacity to the selected page size. Additionally, we continue to have a
single set of batching logic that we can improve later on.

> +	struct kvm_mmu_memory_cache split_page_cache;
> +
>  	struct kvm_arch *arch;
>  };
>  
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 700c5774b50d..41ee330edae3 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector;
>  
>  static unsigned long io_map_base;
>  
> -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> +bool __read_mostly eager_page_split = true;
> +module_param(eager_page_split, bool, 0644);
> +

Unless someone is really begging for it I'd prefer we not add a module
parameter for this.

> +static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
> +					   phys_addr_t size)
>  {
> -	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
>  	phys_addr_t boundary = ALIGN_DOWN(addr + size, size);
>  
>  	return (boundary - 1 < end - 1) ? boundary : end;
>  }
>  
> +static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> +{
> +	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> +
> +	return __stage2_range_addr_end(addr, end, size);
> +}
> +
>  /*
>   * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
>   * we may see kernel panics with CONFIG_DETECT_HUNG_TASK,
> @@ -71,6 +81,64 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
>  	return ret;
>  }
>  
> +static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
> +{
> +	return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
> +}

I don't think the helper is adding too much here.

> +static bool need_topup_split_page_cache_or_resched(struct kvm *kvm)
> +{
> +	struct kvm_mmu_memory_cache *cache;
> +
> +	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> +		return true;
> +
> +	cache = &kvm->arch.mmu.split_page_cache;
> +	return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY);
> +}
> +
> +static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
> +			      phys_addr_t end)
> +{
> +	struct kvm_mmu_memory_cache *cache;
> +	struct kvm_pgtable *pgt;
> +	int ret;
> +	u64 next;
> +	int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY;
> +
> +	lockdep_assert_held_write(&kvm->mmu_lock);

Rather than having the caller acquire the lock, can you instead do it
here? It would appear that the entire critical section is enclosed
within this function.

> +	lockdep_assert_held(&kvm->slots_lock);

This function doesn't depend on anything guarded by the slots_lock, can
you move this to kvm_mmu_split_memory_region()?

> +	cache = &kvm->arch.mmu.split_page_cache;
> +
> +	do {
> +		if (need_topup_split_page_cache_or_resched(kvm)) {
> +			write_unlock(&kvm->mmu_lock);
> +			cond_resched();
> +			/* Eager page splitting is best-effort. */
> +			ret = __kvm_mmu_topup_memory_cache(cache,
> +							   cache_capacity,
> +							   cache_capacity);
> +			write_lock(&kvm->mmu_lock);
> +			if (ret)
> +				break;
> +		}
> +
> +		pgt = kvm->arch.mmu.pgt;
> +		if (!pgt)
> +			return -EINVAL;
> +
> +		next = __stage2_range_addr_end(addr, end,
> +					       EAGER_PAGE_SPLIT_CHUNK_SIZE);
> +		ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache);
> +		if (ret)
> +			break;
> +	} while (addr = next, addr != end);
> +
> +	return ret;
> +}

--
Thanks,
Oliver
Ricardo Koller Jan. 26, 2023, 6:45 p.m. UTC | #4
On Tue, Jan 24, 2023 at 2:45 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> Hi Ricardo,
>
> On Fri, Jan 13, 2023 at 03:49:57AM +0000, Ricardo Koller wrote:
> > Split huge pages eagerly when enabling dirty logging. The goal is to
> > avoid doing it while faulting on write-protected pages, which
> > negatively impacts guest performance.
> >
> > A memslot marked for dirty logging is split in 1GB pieces at a time.
> > This is in order to release the mmu_lock and give other kernel threads
> > the opportunity to run, and also in order to allocate enough pages to
> > split a 1GB range worth of huge pages (or a single 1GB huge page).
> > Note that these page allocations can fail, so eager page splitting is
> > best-effort.  This is not a correctness issue though, as huge pages
> > can still be split on write-faults.
> >
> > The benefits of eager page splitting are the same as in x86, added
> > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
> > the TDP MMU when dirty logging is enabled"). For example, when running
> > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
> > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
> > all of their memory after dirty logging is enabled decreased by 44%
> > from 2.58s to 1.42s.
> >
> > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_host.h |  30 ++++++++
> >  arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
> >  2 files changed, 138 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 35a159d131b5..6ab37209b1d1 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -153,6 +153,36 @@ struct kvm_s2_mmu {
> >       /* The last vcpu id that ran on each physical CPU */
> >       int __percpu *last_vcpu_ran;
> >
> > +     /*
> > +      * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
> > +      * pages. It is used to allocate stage2 page tables while splitting
> > +      * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
> > +      * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
> > +      * the capacity of the split page cache (CACHE_CAPACITY), and how often
> > +      * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
> > +      *
> > +      * A good heuristic to pick CHUNK_SIZE is that it should be larger than
> > +      * all the available huge-page sizes, and be a multiple of all the
> > +      * other ones; for example, 1GB when all the available huge-page sizes
> > +      * are (1GB, 2MB, 32MB, 512MB).
> > +      *
> > +      * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for
> > +      * example, 1GB requires the following number of PAGE_SIZE-pages:
> > +      * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB).
> > +      * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)).
> > +      * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB).
> > +      * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB).
> > +      * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB
> > +      * granules.
> > +      *
> > +      * Protected by kvm->slots_lock.
> > +      */
> > +#define EAGER_PAGE_SPLIT_CHUNK_SIZE                 SZ_1G
> > +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY                                      \
> > +     (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) +         \
> > +      DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M))
>
> Could you instead make use of the existing KVM_PGTABLE_MIN_BLOCK_LEVEL
> as the batch size? 513 pages across all page sizes is a non-negligible
> amount of memory that goes largely unused when PAGE_SIZE != 4K.
>

Sounds good, will refine this for v2.

> With that change it is a lot easier to correctly match the cache
> capacity to the selected page size. Additionally, we continue to have a
> single set of batching logic that we can improve later on.
>
> > +     struct kvm_mmu_memory_cache split_page_cache;
> > +
> >       struct kvm_arch *arch;
> >  };
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 700c5774b50d..41ee330edae3 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector;
> >
> >  static unsigned long io_map_base;
> >
> > -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> > +bool __read_mostly eager_page_split = true;
> > +module_param(eager_page_split, bool, 0644);
> > +
>
> Unless someone is really begging for it I'd prefer we not add a module
> parameter for this.

It was mainly to match x86 and because it makes perf testing a bit
simpler. What do others think?

>
> > +static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
> > +                                        phys_addr_t size)
> >  {
> > -     phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> >       phys_addr_t boundary = ALIGN_DOWN(addr + size, size);
> >
> >       return (boundary - 1 < end - 1) ? boundary : end;
> >  }
> >
> > +static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> > +{
> > +     phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> > +
> > +     return __stage2_range_addr_end(addr, end, size);
> > +}
> > +
> >  /*
> >   * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
> >   * we may see kernel panics with CONFIG_DETECT_HUNG_TASK,
> > @@ -71,6 +81,64 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
> >       return ret;
> >  }
> >
> > +static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
> > +{
> > +     return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
> > +}
>
> I don't think the helper is adding too much here.

Will try how it looks without.

>
> > +static bool need_topup_split_page_cache_or_resched(struct kvm *kvm)
> > +{
> > +     struct kvm_mmu_memory_cache *cache;
> > +
> > +     if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> > +             return true;
> > +
> > +     cache = &kvm->arch.mmu.split_page_cache;
> > +     return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY);
> > +}
> > +
> > +static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
> > +                           phys_addr_t end)
> > +{
> > +     struct kvm_mmu_memory_cache *cache;
> > +     struct kvm_pgtable *pgt;
> > +     int ret;
> > +     u64 next;
> > +     int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY;
> > +
> > +     lockdep_assert_held_write(&kvm->mmu_lock);
>
> Rather than having the caller acquire the lock, can you instead do it
> here? It would appear that the entire critical section is enclosed
> within this function.

Sure. I will first double check things related to perf and correctness
just in case.
I'm not sure if the increase of acquire/releases makes any difference in perf.
Also, not sure if there's a correctness issue because of releasing the lock
between WP and split (I think it should be fine, but not 100% sure).

>
> > +     lockdep_assert_held(&kvm->slots_lock);
>
> This function doesn't depend on anything guarded by the slots_lock, can
> you move this to kvm_mmu_split_memory_region()?

kvm_mmu_split_memory_region() takes a memslot.
That works in this case, eager splitting when enabling dirty logging, but won't
work in the next commit when spliting on the CLEAR ioctl.

>
> > +     cache = &kvm->arch.mmu.split_page_cache;
> > +
> > +     do {
> > +             if (need_topup_split_page_cache_or_resched(kvm)) {
> > +                     write_unlock(&kvm->mmu_lock);
> > +                     cond_resched();
> > +                     /* Eager page splitting is best-effort. */
> > +                     ret = __kvm_mmu_topup_memory_cache(cache,
> > +                                                        cache_capacity,
> > +                                                        cache_capacity);
> > +                     write_lock(&kvm->mmu_lock);
> > +                     if (ret)
> > +                             break;
> > +             }
> > +
> > +             pgt = kvm->arch.mmu.pgt;
> > +             if (!pgt)
> > +                     return -EINVAL;
> > +
> > +             next = __stage2_range_addr_end(addr, end,
> > +                                            EAGER_PAGE_SPLIT_CHUNK_SIZE);
> > +             ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache);
> > +             if (ret)
> > +                     break;
> > +     } while (addr = next, addr != end);
> > +
> > +     return ret;
> > +}
>
> --
> Thanks,
> Oliver
Ricardo Koller Jan. 26, 2023, 7:25 p.m. UTC | #5
On Thu, Jan 26, 2023 at 10:45 AM Ricardo Koller <ricarkol@google.com> wrote:
>
> On Tue, Jan 24, 2023 at 2:45 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > Hi Ricardo,
> >
> > On Fri, Jan 13, 2023 at 03:49:57AM +0000, Ricardo Koller wrote:
> > > Split huge pages eagerly when enabling dirty logging. The goal is to
> > > avoid doing it while faulting on write-protected pages, which
> > > negatively impacts guest performance.
> > >
> > > A memslot marked for dirty logging is split in 1GB pieces at a time.
> > > This is in order to release the mmu_lock and give other kernel threads
> > > the opportunity to run, and also in order to allocate enough pages to
> > > split a 1GB range worth of huge pages (or a single 1GB huge page).
> > > Note that these page allocations can fail, so eager page splitting is
> > > best-effort.  This is not a correctness issue though, as huge pages
> > > can still be split on write-faults.
> > >
> > > The benefits of eager page splitting are the same as in x86, added
> > > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
> > > the TDP MMU when dirty logging is enabled"). For example, when running
> > > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
> > > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
> > > all of their memory after dirty logging is enabled decreased by 44%
> > > from 2.58s to 1.42s.
> > >
> > > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > > ---
> > >  arch/arm64/include/asm/kvm_host.h |  30 ++++++++
> > >  arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
> > >  2 files changed, 138 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > index 35a159d131b5..6ab37209b1d1 100644
> > > --- a/arch/arm64/include/asm/kvm_host.h
> > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > @@ -153,6 +153,36 @@ struct kvm_s2_mmu {
> > >       /* The last vcpu id that ran on each physical CPU */
> > >       int __percpu *last_vcpu_ran;
> > >
> > > +     /*
> > > +      * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
> > > +      * pages. It is used to allocate stage2 page tables while splitting
> > > +      * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
> > > +      * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
> > > +      * the capacity of the split page cache (CACHE_CAPACITY), and how often
> > > +      * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
> > > +      *
> > > +      * A good heuristic to pick CHUNK_SIZE is that it should be larger than
> > > +      * all the available huge-page sizes, and be a multiple of all the
> > > +      * other ones; for example, 1GB when all the available huge-page sizes
> > > +      * are (1GB, 2MB, 32MB, 512MB).
> > > +      *
> > > +      * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for
> > > +      * example, 1GB requires the following number of PAGE_SIZE-pages:
> > > +      * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB).
> > > +      * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)).
> > > +      * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB).
> > > +      * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB).
> > > +      * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB
> > > +      * granules.
> > > +      *
> > > +      * Protected by kvm->slots_lock.
> > > +      */
> > > +#define EAGER_PAGE_SPLIT_CHUNK_SIZE                 SZ_1G
> > > +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY                                      \
> > > +     (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) +         \
> > > +      DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M))
> >
> > Could you instead make use of the existing KVM_PGTABLE_MIN_BLOCK_LEVEL
> > as the batch size? 513 pages across all page sizes is a non-negligible
> > amount of memory that goes largely unused when PAGE_SIZE != 4K.
> >
>
> Sounds good, will refine this for v2.
>
> > With that change it is a lot easier to correctly match the cache
> > capacity to the selected page size. Additionally, we continue to have a
> > single set of batching logic that we can improve later on.
> >
> > > +     struct kvm_mmu_memory_cache split_page_cache;
> > > +
> > >       struct kvm_arch *arch;
> > >  };
> > >
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 700c5774b50d..41ee330edae3 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector;
> > >
> > >  static unsigned long io_map_base;
> > >
> > > -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> > > +bool __read_mostly eager_page_split = true;
> > > +module_param(eager_page_split, bool, 0644);
> > > +
> >
> > Unless someone is really begging for it I'd prefer we not add a module
> > parameter for this.
>
> It was mainly to match x86 and because it makes perf testing a bit
> simpler. What do others think?
>
> >
> > > +static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
> > > +                                        phys_addr_t size)
> > >  {
> > > -     phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> > >       phys_addr_t boundary = ALIGN_DOWN(addr + size, size);
> > >
> > >       return (boundary - 1 < end - 1) ? boundary : end;
> > >  }
> > >
> > > +static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> > > +{
> > > +     phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> > > +
> > > +     return __stage2_range_addr_end(addr, end, size);
> > > +}
> > > +
> > >  /*
> > >   * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
> > >   * we may see kernel panics with CONFIG_DETECT_HUNG_TASK,
> > > @@ -71,6 +81,64 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
> > >       return ret;
> > >  }
> > >
> > > +static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
> > > +{
> > > +     return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
> > > +}
> >
> > I don't think the helper is adding too much here.
>
> Will try how it looks without.
>
> >
> > > +static bool need_topup_split_page_cache_or_resched(struct kvm *kvm)
> > > +{
> > > +     struct kvm_mmu_memory_cache *cache;
> > > +
> > > +     if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> > > +             return true;
> > > +
> > > +     cache = &kvm->arch.mmu.split_page_cache;
> > > +     return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY);
> > > +}
> > > +
> > > +static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
> > > +                           phys_addr_t end)
> > > +{
> > > +     struct kvm_mmu_memory_cache *cache;
> > > +     struct kvm_pgtable *pgt;
> > > +     int ret;
> > > +     u64 next;
> > > +     int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY;
> > > +
> > > +     lockdep_assert_held_write(&kvm->mmu_lock);
> >
> > Rather than having the caller acquire the lock, can you instead do it
> > here? It would appear that the entire critical section is enclosed
> > within this function.
>
> Sure. I will first double check things related to perf and correctness
> just in case.
> I'm not sure if the increase of acquire/releases makes any difference in perf.
> Also, not sure if there's a correctness issue because of releasing the lock
> between WP and split (I think it should be fine, but not 100% sure).
>
> >
> > > +     lockdep_assert_held(&kvm->slots_lock);
> >
> > This function doesn't depend on anything guarded by the slots_lock, can
> > you move this to kvm_mmu_split_memory_region()?
>
> kvm_mmu_split_memory_region() takes a memslot.
> That works in this case, eager splitting when enabling dirty logging, but won't
> work in the next commit when spliting on the CLEAR ioctl.
>

Ahh, you meant just the "lockdep" line. Yes, that makes sense. Will do.

> >
> > > +     cache = &kvm->arch.mmu.split_page_cache;
> > > +
> > > +     do {
> > > +             if (need_topup_split_page_cache_or_resched(kvm)) {
> > > +                     write_unlock(&kvm->mmu_lock);
> > > +                     cond_resched();
> > > +                     /* Eager page splitting is best-effort. */
> > > +                     ret = __kvm_mmu_topup_memory_cache(cache,
> > > +                                                        cache_capacity,
> > > +                                                        cache_capacity);
> > > +                     write_lock(&kvm->mmu_lock);
> > > +                     if (ret)
> > > +                             break;
> > > +             }
> > > +
> > > +             pgt = kvm->arch.mmu.pgt;
> > > +             if (!pgt)
> > > +                     return -EINVAL;
> > > +
> > > +             next = __stage2_range_addr_end(addr, end,
> > > +                                            EAGER_PAGE_SPLIT_CHUNK_SIZE);
> > > +             ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache);
> > > +             if (ret)
> > > +                     break;
> > > +     } while (addr = next, addr != end);
> > > +
> > > +     return ret;
> > > +}
> >
> > --
> > Thanks,
> > Oliver
Marc Zyngier Jan. 26, 2023, 8:10 p.m. UTC | #6
On Thu, 26 Jan 2023 18:45:43 +0000,
Ricardo Koller <ricarkol@google.com> wrote:
> 
> On Tue, Jan 24, 2023 at 2:45 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > Hi Ricardo,
> >
> > On Fri, Jan 13, 2023 at 03:49:57AM +0000, Ricardo Koller wrote:
> > > Split huge pages eagerly when enabling dirty logging. The goal is to
> > > avoid doing it while faulting on write-protected pages, which
> > > negatively impacts guest performance.
> > >
> > > A memslot marked for dirty logging is split in 1GB pieces at a time.
> > > This is in order to release the mmu_lock and give other kernel threads
> > > the opportunity to run, and also in order to allocate enough pages to
> > > split a 1GB range worth of huge pages (or a single 1GB huge page).
> > > Note that these page allocations can fail, so eager page splitting is
> > > best-effort.  This is not a correctness issue though, as huge pages
> > > can still be split on write-faults.
> > >
> > > The benefits of eager page splitting are the same as in x86, added
> > > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
> > > the TDP MMU when dirty logging is enabled"). For example, when running
> > > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
> > > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
> > > all of their memory after dirty logging is enabled decreased by 44%
> > > from 2.58s to 1.42s.
> > >
> > > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > > ---
> > >  arch/arm64/include/asm/kvm_host.h |  30 ++++++++
> > >  arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
> > >  2 files changed, 138 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > index 35a159d131b5..6ab37209b1d1 100644
> > > --- a/arch/arm64/include/asm/kvm_host.h
> > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > @@ -153,6 +153,36 @@ struct kvm_s2_mmu {
> > >       /* The last vcpu id that ran on each physical CPU */
> > >       int __percpu *last_vcpu_ran;
> > >
> > > +     /*
> > > +      * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
> > > +      * pages. It is used to allocate stage2 page tables while splitting
> > > +      * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
> > > +      * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
> > > +      * the capacity of the split page cache (CACHE_CAPACITY), and how often
> > > +      * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
> > > +      *
> > > +      * A good heuristic to pick CHUNK_SIZE is that it should be larger than
> > > +      * all the available huge-page sizes, and be a multiple of all the
> > > +      * other ones; for example, 1GB when all the available huge-page sizes
> > > +      * are (1GB, 2MB, 32MB, 512MB).
> > > +      *
> > > +      * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for
> > > +      * example, 1GB requires the following number of PAGE_SIZE-pages:
> > > +      * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB).
> > > +      * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)).
> > > +      * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB).
> > > +      * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB).
> > > +      * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB
> > > +      * granules.
> > > +      *
> > > +      * Protected by kvm->slots_lock.
> > > +      */
> > > +#define EAGER_PAGE_SPLIT_CHUNK_SIZE                 SZ_1G
> > > +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY                                      \
> > > +     (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) +         \
> > > +      DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M))
> >
> > Could you instead make use of the existing KVM_PGTABLE_MIN_BLOCK_LEVEL
> > as the batch size? 513 pages across all page sizes is a non-negligible
> > amount of memory that goes largely unused when PAGE_SIZE != 4K.
> >
> 
> Sounds good, will refine this for v2.
> 
> > With that change it is a lot easier to correctly match the cache
> > capacity to the selected page size. Additionally, we continue to have a
> > single set of batching logic that we can improve later on.
> >
> > > +     struct kvm_mmu_memory_cache split_page_cache;
> > > +
> > >       struct kvm_arch *arch;
> > >  };
> > >
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 700c5774b50d..41ee330edae3 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector;
> > >
> > >  static unsigned long io_map_base;
> > >
> > > -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> > > +bool __read_mostly eager_page_split = true;
> > > +module_param(eager_page_split, bool, 0644);
> > > +
> >
> > Unless someone is really begging for it I'd prefer we not add a module
> > parameter for this.
> 
> It was mainly to match x86 and because it makes perf testing a bit
> simpler. What do others think?

From my PoV this is a no.

If you have a flag because this is an experimental feature (like NV),
then this is a kernel option, and you taint the kernel when it is set.

If you have a flag because this is a modal option that makes different
use of the HW which cannot be exposed to userspace (like GICv4), then
this also is a kernel option.

This is neither.

The one thing that would convince me to make it an option is the
amount of memory this thing consumes. 512+ pages is a huge amount, and
I'm not overly happy about that. Why can't this be a userspace visible
option, selectable on a per VM (or memslot) basis?

Thanks,

	M.
Ricardo Koller Jan. 27, 2023, 3:45 p.m. UTC | #7
Hi Marc,

On Thu, Jan 26, 2023 at 12:10 PM Marc Zyngier <maz@kernel.org> wrote:
>
> On Thu, 26 Jan 2023 18:45:43 +0000,
> Ricardo Koller <ricarkol@google.com> wrote:
> >
> > On Tue, Jan 24, 2023 at 2:45 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > >
> > > Hi Ricardo,
> > >
> > > On Fri, Jan 13, 2023 at 03:49:57AM +0000, Ricardo Koller wrote:
> > > > Split huge pages eagerly when enabling dirty logging. The goal is to
> > > > avoid doing it while faulting on write-protected pages, which
> > > > negatively impacts guest performance.
> > > >
> > > > A memslot marked for dirty logging is split in 1GB pieces at a time.
> > > > This is in order to release the mmu_lock and give other kernel threads
> > > > the opportunity to run, and also in order to allocate enough pages to
> > > > split a 1GB range worth of huge pages (or a single 1GB huge page).
> > > > Note that these page allocations can fail, so eager page splitting is
> > > > best-effort.  This is not a correctness issue though, as huge pages
> > > > can still be split on write-faults.
> > > >
> > > > The benefits of eager page splitting are the same as in x86, added
> > > > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
> > > > the TDP MMU when dirty logging is enabled"). For example, when running
> > > > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
> > > > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
> > > > all of their memory after dirty logging is enabled decreased by 44%
> > > > from 2.58s to 1.42s.
> > > >
> > > > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > > > ---
> > > >  arch/arm64/include/asm/kvm_host.h |  30 ++++++++
> > > >  arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
> > > >  2 files changed, 138 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > > index 35a159d131b5..6ab37209b1d1 100644
> > > > --- a/arch/arm64/include/asm/kvm_host.h
> > > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > > @@ -153,6 +153,36 @@ struct kvm_s2_mmu {
> > > >       /* The last vcpu id that ran on each physical CPU */
> > > >       int __percpu *last_vcpu_ran;
> > > >
> > > > +     /*
> > > > +      * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
> > > > +      * pages. It is used to allocate stage2 page tables while splitting
> > > > +      * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
> > > > +      * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
> > > > +      * the capacity of the split page cache (CACHE_CAPACITY), and how often
> > > > +      * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
> > > > +      *
> > > > +      * A good heuristic to pick CHUNK_SIZE is that it should be larger than
> > > > +      * all the available huge-page sizes, and be a multiple of all the
> > > > +      * other ones; for example, 1GB when all the available huge-page sizes
> > > > +      * are (1GB, 2MB, 32MB, 512MB).
> > > > +      *
> > > > +      * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for
> > > > +      * example, 1GB requires the following number of PAGE_SIZE-pages:
> > > > +      * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB).
> > > > +      * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)).
> > > > +      * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB).
> > > > +      * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB).
> > > > +      * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB
> > > > +      * granules.
> > > > +      *
> > > > +      * Protected by kvm->slots_lock.
> > > > +      */
> > > > +#define EAGER_PAGE_SPLIT_CHUNK_SIZE                 SZ_1G
> > > > +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY                                      \
> > > > +     (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) +         \
> > > > +      DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M))
> > >
> > > Could you instead make use of the existing KVM_PGTABLE_MIN_BLOCK_LEVEL
> > > as the batch size? 513 pages across all page sizes is a non-negligible
> > > amount of memory that goes largely unused when PAGE_SIZE != 4K.
> > >
> >
> > Sounds good, will refine this for v2.
> >
> > > With that change it is a lot easier to correctly match the cache
> > > capacity to the selected page size. Additionally, we continue to have a
> > > single set of batching logic that we can improve later on.
> > >
> > > > +     struct kvm_mmu_memory_cache split_page_cache;
> > > > +
> > > >       struct kvm_arch *arch;
> > > >  };
> > > >
> > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > index 700c5774b50d..41ee330edae3 100644
> > > > --- a/arch/arm64/kvm/mmu.c
> > > > +++ b/arch/arm64/kvm/mmu.c
> > > > @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector;
> > > >
> > > >  static unsigned long io_map_base;
> > > >
> > > > -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> > > > +bool __read_mostly eager_page_split = true;
> > > > +module_param(eager_page_split, bool, 0644);
> > > > +
> > >
> > > Unless someone is really begging for it I'd prefer we not add a module
> > > parameter for this.
> >
> > It was mainly to match x86 and because it makes perf testing a bit
> > simpler. What do others think?
>
> From my PoV this is a no.
>
> If you have a flag because this is an experimental feature (like NV),
> then this is a kernel option, and you taint the kernel when it is set.
>
> If you have a flag because this is a modal option that makes different
> use of the HW which cannot be exposed to userspace (like GICv4), then
> this also is a kernel option.
>
> This is neither.

Ah, I see. Thanks for the explanation.

>
> The one thing that would convince me to make it an option is the
> amount of memory this thing consumes. 512+ pages is a huge amount, and
> I'm not overly happy about that. Why can't this be a userspace visible
> option, selectable on a per VM (or memslot) basis?
>

It should be possible.  I am exploring a couple of ideas that could
help when the hugepages are not 1G (e.g., 2M).  However, they add
complexity and I'm not sure they help much.

(will be using PAGE_SIZE=4K to make things simpler)

This feature pre-allocates 513 pages before splitting every 1G range.
For example, it converts 1G block PTEs into trees made of 513 pages.
When not using this feature, the same 513 pages would be allocated,
but lazily over a longer period of time.

Eager-splitting pre-allocates those pages in order to split huge-pages
into fully populated trees.  Which is needed in order to use FEAT_BBM
and skipping the expensive TLBI broadcasts.  513 is just the number of
pages needed to break a 1G huge-page.

We could optimize for smaller huge-pages, like 2M by splitting 1
huge-page at a time: only preallocate one 4K page at a time.  The
trick is how to know that we are splitting 2M huge-pages.  We could
either get the vma pagesize or use hints from userspace.  I'm not sure
that this is worth it though.  The user will most likely want to split
big ranges of memory (>1G), so optimizing for smaller huge-pages only
converts the left into the right:

alloc 1 page            |    |  alloc 512 pages
split 2M huge-page      |    |  split 2M huge-page
alloc 1 page            |    |  split 2M huge-page
split 2M huge-page      | => |  split 2M huge-page
                        ...
alloc 1 page            |    |  split 2M huge-page
split 2M huge-page      |    |  split 2M huge-page

Still thinking of what else to do.

Thanks!
Ricardo

> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.
Oliver Upton Jan. 30, 2023, 9:18 p.m. UTC | #8
On Fri, Jan 27, 2023 at 07:45:15AM -0800, Ricardo Koller wrote:
> Hi Marc,
> 
> On Thu, Jan 26, 2023 at 12:10 PM Marc Zyngier <maz@kernel.org> wrote:

[...]

> >
> > The one thing that would convince me to make it an option is the
> > amount of memory this thing consumes. 512+ pages is a huge amount, and
> > I'm not overly happy about that. Why can't this be a userspace visible
> > option, selectable on a per VM (or memslot) basis?
> >
> 
> It should be possible.  I am exploring a couple of ideas that could
> help when the hugepages are not 1G (e.g., 2M).  However, they add
> complexity and I'm not sure they help much.
> 
> (will be using PAGE_SIZE=4K to make things simpler)
> 
> This feature pre-allocates 513 pages before splitting every 1G range.
> For example, it converts 1G block PTEs into trees made of 513 pages.
> When not using this feature, the same 513 pages would be allocated,
> but lazily over a longer period of time.
> 
> Eager-splitting pre-allocates those pages in order to split huge-pages
> into fully populated trees.  Which is needed in order to use FEAT_BBM
> and skipping the expensive TLBI broadcasts.  513 is just the number of
> pages needed to break a 1G huge-page.
> 
> We could optimize for smaller huge-pages, like 2M by splitting 1
> huge-page at a time: only preallocate one 4K page at a time.  The
> trick is how to know that we are splitting 2M huge-pages.  We could
> either get the vma pagesize or use hints from userspace.  I'm not sure
> that this is worth it though.  The user will most likely want to split
> big ranges of memory (>1G), so optimizing for smaller huge-pages only
> converts the left into the right:
> 
> alloc 1 page            |    |  alloc 512 pages
> split 2M huge-page      |    |  split 2M huge-page
> alloc 1 page            |    |  split 2M huge-page
> split 2M huge-page      | => |  split 2M huge-page
>                         ...
> alloc 1 page            |    |  split 2M huge-page
> split 2M huge-page      |    |  split 2M huge-page
> 
> Still thinking of what else to do.

I think that Marc's suggestion of having userspace configure this is
sound. After all, userspace _should_ know the granularity of the backing
source it chose for guest memory.

We could also interpret a cache size of 0 to signal that userspace wants
to disable eager page split for a VM altogether. It is entirely possible
that the user will want a differing QoS between slice-of-hardware and
overcommitted VMs.
Sean Christopherson Jan. 31, 2023, 1:18 a.m. UTC | #9
On Mon, Jan 30, 2023, Oliver Upton wrote:
> I think that Marc's suggestion of having userspace configure this is
> sound. After all, userspace _should_ know the granularity of the backing
> source it chose for guest memory.
> 
> We could also interpret a cache size of 0 to signal that userspace wants
> to disable eager page split for a VM altogether. It is entirely possible that
> the user will want a differing QoS between slice-of-hardware and
> overcommitted VMs.

Maybe.  It's also entirely possible that QoS is never factored in, e.g. if QoS
guarantees for all VMs on a system are better met by enabling eager splitting
across the board.

There are other reasons to use module/kernel params beyond what Marc listed, e.g.
to let the user opt out even when something is on by default.  x86's TDP MMU has
benefited greatly from downstream users being able to do A/B performance testing
this way.  I suspect x86's eager_page_split knob was added largely for this
reason, e.g. to easily see how a specific workload is affected by eager splitting.
That seems like a reasonable fit on the ARM side as well.

IMO, we should try to avoid new uAPI without a proven need, especially if we're
considering throwing something into memslots.  AFAIK, module params, at least on
the x86 side of things, aren't considered immutable ABI (or maybe it's just that
we haven't modified/removed one recently that folks care about).
Marc Zyngier Jan. 31, 2023, 10:28 a.m. UTC | #10
On Fri, 27 Jan 2023 15:45:15 +0000,
Ricardo Koller <ricarkol@google.com> wrote:
> 
> > The one thing that would convince me to make it an option is the
> > amount of memory this thing consumes. 512+ pages is a huge amount, and
> > I'm not overly happy about that. Why can't this be a userspace visible
> > option, selectable on a per VM (or memslot) basis?
> >
> 
> It should be possible.  I am exploring a couple of ideas that could
> help when the hugepages are not 1G (e.g., 2M).  However, they add
> complexity and I'm not sure they help much.
> 
> (will be using PAGE_SIZE=4K to make things simpler)
> 
> This feature pre-allocates 513 pages before splitting every 1G range.
> For example, it converts 1G block PTEs into trees made of 513 pages.
> When not using this feature, the same 513 pages would be allocated,
> but lazily over a longer period of time.

This is an important difference. It avoids the upfront allocation
"thermal shock", giving time to the kernel to reclaim memory from
somewhere else. Doing it upfront means you *must* have 2MB+ of
immediately available memory for each GB of RAM you guest uses.

> 
> Eager-splitting pre-allocates those pages in order to split huge-pages
> into fully populated trees.  Which is needed in order to use FEAT_BBM
> and skipping the expensive TLBI broadcasts.  513 is just the number of
> pages needed to break a 1G huge-page.

I understand that. But it also clear that 1GB huge pages are unlikely
to be THPs, and I wonder if we should treat the two differently. Using
HugeTLBFS pages is significant here.

> 
> We could optimize for smaller huge-pages, like 2M by splitting 1
> huge-page at a time: only preallocate one 4K page at a time.  The
> trick is how to know that we are splitting 2M huge-pages.  We could
> either get the vma pagesize or use hints from userspace.  I'm not sure
> that this is worth it though.  The user will most likely want to split
> big ranges of memory (>1G), so optimizing for smaller huge-pages only
> converts the left into the right:
> 
> alloc 1 page            |    |  alloc 512 pages
> split 2M huge-page      |    |  split 2M huge-page
> alloc 1 page            |    |  split 2M huge-page
> split 2M huge-page      | => |  split 2M huge-page
>                         ...
> alloc 1 page            |    |  split 2M huge-page
> split 2M huge-page      |    |  split 2M huge-page
> 
> Still thinking of what else to do.

I think the 1G case fits your own use case, but I doubt this covers
the majority of the users. Most people rely on the kernel ability to
use THPs, which are capped at the first level of block mapping.

2MB (and 32MB for 16kB base pages) are the most likely mappings in my
experience (512MB with 64kB pages are vanishingly rare).

Having to pay an upfront cost for HugeTLBFS doesn't shock me, and it
fits the model. For THPs, where everything is opportunistic and the
user not involved, this is a lot more debatable.

This is why I'd like this behaviour to be a buy-in, either directly (a
first class userspace API) or indirectly (the provenance of the
memory).

Thanks,

	M.
Marc Zyngier Jan. 31, 2023, 10:31 a.m. UTC | #11
On Mon, 30 Jan 2023 21:18:32 +0000,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> On Fri, Jan 27, 2023 at 07:45:15AM -0800, Ricardo Koller wrote:
> > Hi Marc,
> > 
> > On Thu, Jan 26, 2023 at 12:10 PM Marc Zyngier <maz@kernel.org> wrote:
> 
> [...]
> 
> > >
> > > The one thing that would convince me to make it an option is the
> > > amount of memory this thing consumes. 512+ pages is a huge amount, and
> > > I'm not overly happy about that. Why can't this be a userspace visible
> > > option, selectable on a per VM (or memslot) basis?
> > >
> > 
> > It should be possible.  I am exploring a couple of ideas that could
> > help when the hugepages are not 1G (e.g., 2M).  However, they add
> > complexity and I'm not sure they help much.
> > 
> > (will be using PAGE_SIZE=4K to make things simpler)
> > 
> > This feature pre-allocates 513 pages before splitting every 1G range.
> > For example, it converts 1G block PTEs into trees made of 513 pages.
> > When not using this feature, the same 513 pages would be allocated,
> > but lazily over a longer period of time.
> > 
> > Eager-splitting pre-allocates those pages in order to split huge-pages
> > into fully populated trees.  Which is needed in order to use FEAT_BBM
> > and skipping the expensive TLBI broadcasts.  513 is just the number of
> > pages needed to break a 1G huge-page.
> > 
> > We could optimize for smaller huge-pages, like 2M by splitting 1
> > huge-page at a time: only preallocate one 4K page at a time.  The
> > trick is how to know that we are splitting 2M huge-pages.  We could
> > either get the vma pagesize or use hints from userspace.  I'm not sure
> > that this is worth it though.  The user will most likely want to split
> > big ranges of memory (>1G), so optimizing for smaller huge-pages only
> > converts the left into the right:
> > 
> > alloc 1 page            |    |  alloc 512 pages
> > split 2M huge-page      |    |  split 2M huge-page
> > alloc 1 page            |    |  split 2M huge-page
> > split 2M huge-page      | => |  split 2M huge-page
> >                         ...
> > alloc 1 page            |    |  split 2M huge-page
> > split 2M huge-page      |    |  split 2M huge-page
> > 
> > Still thinking of what else to do.
> 
> I think that Marc's suggestion of having userspace configure this is
> sound. After all, userspace _should_ know the granularity of the backing
> source it chose for guest memory.

Only if it is not using anonymous memory. That's the important distinction.

> 
> We could also interpret a cache size of 0 to signal that userspace wants
> to disable eager page split for a VM altogether. It is entirely possible
> that the user will want a differing QoS between slice-of-hardware and
> overcommitted VMs.

Absolutely. The overcommited case would suffer from the upfront
allocation (these systems are usually very densely packed).

Thanks,

	M.
Oliver Upton Jan. 31, 2023, 5:45 p.m. UTC | #12
On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote:
> On Mon, Jan 30, 2023, Oliver Upton wrote:
> > I think that Marc's suggestion of having userspace configure this is
> > sound. After all, userspace _should_ know the granularity of the backing
> > source it chose for guest memory.
> > 
> > We could also interpret a cache size of 0 to signal that userspace wants
> > to disable eager page split for a VM altogether. It is entirely possible that
> > the user will want a differing QoS between slice-of-hardware and
> > overcommitted VMs.
> 
> Maybe.  It's also entirely possible that QoS is never factored in, e.g. if QoS
> guarantees for all VMs on a system are better met by enabling eager splitting
> across the board.
> 
> There are other reasons to use module/kernel params beyond what Marc listed, e.g.
> to let the user opt out even when something is on by default.  x86's TDP MMU has
> benefited greatly from downstream users being able to do A/B performance testing
> this way.  I suspect x86's eager_page_split knob was added largely for this
> reason, e.g. to easily see how a specific workload is affected by eager splitting.
> That seems like a reasonable fit on the ARM side as well.

There's a rather important distinction here in that we'd allow userspace
to select the page split cache size, which should be correctly sized for
the backing memory source. Considering the break-before-make rules of
the architecture, the only way eager split is performant on arm64 is by
replacing a block entry with a fully populated table hierarchy in one
operation. AFAICT, you don't have this problem on x86, as the
architecture generally permits a direct valid->valid transformation
without an intermediate invalidation. Well, ignoring iTLB multihit :)

So, the largest transformation we need to do right now is on a PUD w/
PAGE_SIZE=4K, leading to 513 pages as proposed in the series. Exposing
that configuration option in a module parameter is presumptive that all
VMs on a host use the exact same memory configuration, which doesn't
feel right to me.

--
Thanks,
Oliver
Sean Christopherson Jan. 31, 2023, 5:54 p.m. UTC | #13
On Tue, Jan 31, 2023, Oliver Upton wrote:
> On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote:
> > On Mon, Jan 30, 2023, Oliver Upton wrote:
> > > I think that Marc's suggestion of having userspace configure this is
> > > sound. After all, userspace _should_ know the granularity of the backing
> > > source it chose for guest memory.
> > > 
> > > We could also interpret a cache size of 0 to signal that userspace wants
> > > to disable eager page split for a VM altogether. It is entirely possible that
> > > the user will want a differing QoS between slice-of-hardware and
> > > overcommitted VMs.
> > 
> > Maybe.  It's also entirely possible that QoS is never factored in, e.g. if QoS
> > guarantees for all VMs on a system are better met by enabling eager splitting
> > across the board.
> > 
> > There are other reasons to use module/kernel params beyond what Marc listed, e.g.
> > to let the user opt out even when something is on by default.  x86's TDP MMU has
> > benefited greatly from downstream users being able to do A/B performance testing
> > this way.  I suspect x86's eager_page_split knob was added largely for this
> > reason, e.g. to easily see how a specific workload is affected by eager splitting.
> > That seems like a reasonable fit on the ARM side as well.
> 
> There's a rather important distinction here in that we'd allow userspace
> to select the page split cache size, which should be correctly sized for
> the backing memory source. Considering the break-before-make rules of
> the architecture, the only way eager split is performant on arm64 is by
> replacing a block entry with a fully populated table hierarchy in one
> operation. AFAICT, you don't have this problem on x86, as the
> architecture generally permits a direct valid->valid transformation
> without an intermediate invalidation. Well, ignoring iTLB multihit :)
> 
> So, the largest transformation we need to do right now is on a PUD w/
> PAGE_SIZE=4K, leading to 513 pages as proposed in the series. Exposing
> that configuration option in a module parameter is presumptive that all
> VMs on a host use the exact same memory configuration, which doesn't
> feel right to me.

Can you elaborate on the cache size needing to be tied to the backing source?
Do the issues arise if you get to a point where KVM can have PGD-sized hugepages
with PAGE_SIZE=4KiB?  Or do you want to let userspace optimize _now_ for PMD+4KiB?
David Matlack Jan. 31, 2023, 6:01 p.m. UTC | #14
On Tue, Jan 31, 2023 at 9:46 AM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote:
> > On Mon, Jan 30, 2023, Oliver Upton wrote:
> > > I think that Marc's suggestion of having userspace configure this is
> > > sound. After all, userspace _should_ know the granularity of the backing
> > > source it chose for guest memory.
> > >
> > > We could also interpret a cache size of 0 to signal that userspace wants
> > > to disable eager page split for a VM altogether. It is entirely possible that
> > > the user will want a differing QoS between slice-of-hardware and
> > > overcommitted VMs.
> >
> > Maybe.  It's also entirely possible that QoS is never factored in, e.g. if QoS
> > guarantees for all VMs on a system are better met by enabling eager splitting
> > across the board.
> >
> > There are other reasons to use module/kernel params beyond what Marc listed, e.g.
> > to let the user opt out even when something is on by default.  x86's TDP MMU has
> > benefited greatly from downstream users being able to do A/B performance testing
> > this way.  I suspect x86's eager_page_split knob was added largely for this
> > reason, e.g. to easily see how a specific workload is affected by eager splitting.
> > That seems like a reasonable fit on the ARM side as well.
>
> There's a rather important distinction here in that we'd allow userspace
> to select the page split cache size, which should be correctly sized for
> the backing memory source. Considering the break-before-make rules of
> the architecture, the only way eager split is performant on arm64 is by
> replacing a block entry with a fully populated table hierarchy in one
> operation.

I don't see how this can be true if we are able to tolerate splitting
2M pages. Splitting 2M pages inherently means 512 Break-Before-Make
operations per 1GiB region of guest physical memory.

If we had a cache size of 1 and were splitting a 1GiB region, we would
then need to do 512+1 BBM per 1GiB region. If we can tolerate 512 per
1GiB, why not 513?

It seems more like the 513 cache size is more to optimize splitting
1GiB pages. I agree it can turn those 513 int 1, but future versions
of the architecture also elide BBM requirements which is another way
to optimize 1GiB pages.


> AFAICT, you don't have this problem on x86, as the
> architecture generally permits a direct valid->valid transformation
> without an intermediate invalidation. Well, ignoring iTLB multihit :)
>
> So, the largest transformation we need to do right now is on a PUD w/
> PAGE_SIZE=4K, leading to 513 pages as proposed in the series. Exposing
> that configuration option in a module parameter is presumptive that all
> VMs on a host use the exact same memory configuration, which doesn't
> feel right to me.
>
> --
> Thanks,
> Oliver
Ricardo Koller Jan. 31, 2023, 6:19 p.m. UTC | #15
On Tue, Jan 31, 2023 at 10:02 AM David Matlack <dmatlack@google.com> wrote:
>
> On Tue, Jan 31, 2023 at 9:46 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote:
> > > On Mon, Jan 30, 2023, Oliver Upton wrote:
> > > > I think that Marc's suggestion of having userspace configure this is
> > > > sound. After all, userspace _should_ know the granularity of the backing
> > > > source it chose for guest memory.
> > > >
> > > > We could also interpret a cache size of 0 to signal that userspace wants
> > > > to disable eager page split for a VM altogether. It is entirely possible that
> > > > the user will want a differing QoS between slice-of-hardware and
> > > > overcommitted VMs.
> > >
> > > Maybe.  It's also entirely possible that QoS is never factored in, e.g. if QoS
> > > guarantees for all VMs on a system are better met by enabling eager splitting
> > > across the board.
> > >
> > > There are other reasons to use module/kernel params beyond what Marc listed, e.g.
> > > to let the user opt out even when something is on by default.  x86's TDP MMU has
> > > benefited greatly from downstream users being able to do A/B performance testing
> > > this way.  I suspect x86's eager_page_split knob was added largely for this
> > > reason, e.g. to easily see how a specific workload is affected by eager splitting.
> > > That seems like a reasonable fit on the ARM side as well.
> >
> > There's a rather important distinction here in that we'd allow userspace
> > to select the page split cache size, which should be correctly sized for
> > the backing memory source. Considering the break-before-make rules of
> > the architecture, the only way eager split is performant on arm64 is by
> > replacing a block entry with a fully populated table hierarchy in one
> > operation.
>
> I don't see how this can be true if we are able to tolerate splitting
> 2M pages. Splitting 2M pages inherently means 512 Break-Before-Make
> operations per 1GiB region of guest physical memory.
>
> If we had a cache size of 1 and were splitting a 1GiB region, we would
> then need to do 512+1 BBM per 1GiB region. If we can tolerate 512 per
> 1GiB, why not 513?
>
> It seems more like the 513 cache size is more to optimize splitting
> 1GiB pages.

That's correct. Although there's also a slight benefit for the 2M huge-pages
case as the 512 huge-pages can be split without having to walk down
the tree from the root every time.

> I agree it can turn those 513 int 1, but future versions
> of the architecture also elide BBM requirements which is another way
> to optimize 1GiB pages.

There's also the CPU cost needed to walk down the tree from the root
513 times (same as above).

>
>
> > AFAICT, you don't have this problem on x86, as the
> > architecture generally permits a direct valid->valid transformation
> > without an intermediate invalidation. Well, ignoring iTLB multihit :)
> >
> > So, the largest transformation we need to do right now is on a PUD w/
> > PAGE_SIZE=4K, leading to 513 pages as proposed in the series. Exposing
> > that configuration option in a module parameter is presumptive that all
> > VMs on a host use the exact same memory configuration, which doesn't
> > feel right to me.
> >
> > --
> > Thanks,
> > Oliver
Oliver Upton Jan. 31, 2023, 6:35 p.m. UTC | #16
On Tue, Jan 31, 2023 at 10:01:45AM -0800, David Matlack wrote:
> On Tue, Jan 31, 2023 at 9:46 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote:
> > > On Mon, Jan 30, 2023, Oliver Upton wrote:
> > > > I think that Marc's suggestion of having userspace configure this is
> > > > sound. After all, userspace _should_ know the granularity of the backing
> > > > source it chose for guest memory.
> > > >
> > > > We could also interpret a cache size of 0 to signal that userspace wants
> > > > to disable eager page split for a VM altogether. It is entirely possible that
> > > > the user will want a differing QoS between slice-of-hardware and
> > > > overcommitted VMs.
> > >
> > > Maybe.  It's also entirely possible that QoS is never factored in, e.g. if QoS
> > > guarantees for all VMs on a system are better met by enabling eager splitting
> > > across the board.
> > >
> > > There are other reasons to use module/kernel params beyond what Marc listed, e.g.
> > > to let the user opt out even when something is on by default.  x86's TDP MMU has
> > > benefited greatly from downstream users being able to do A/B performance testing
> > > this way.  I suspect x86's eager_page_split knob was added largely for this
> > > reason, e.g. to easily see how a specific workload is affected by eager splitting.
> > > That seems like a reasonable fit on the ARM side as well.
> >
> > There's a rather important distinction here in that we'd allow userspace
> > to select the page split cache size, which should be correctly sized for
> > the backing memory source. Considering the break-before-make rules of
> > the architecture, the only way eager split is performant on arm64 is by
> > replacing a block entry with a fully populated table hierarchy in one
> > operation.
> 
> I don't see how this can be true if we are able to tolerate splitting
> 2M pages. Splitting 2M pages inherently means 512 Break-Before-Make
> operations per 1GiB region of guest physical memory.

'Only' was definitely not the right word here.

I fear that there is a rather limited degree of portability for any
observation that we derive from a particular system. Assuming the
absolute worst of hardware, TLBIs + serialization will become even more
of a bounding issue as the number of affected entities on the
interconnect scales.

So in that vein I don't believe it is easy to glean what may or may
not be tolerable.

> It seems more like the 513 cache size is more to optimize splitting
> 1GiB pages.

I don't believe anyone is particularly jazzed about this detail, so I
would be surprised if we accepted this as the default configuration.

> I agree it can turn those 513 int 1, but future versions
> of the architecture also elide BBM requirements which is another way
> to optimize 1GiB pages.

Level 2 break-before-make behavior is helpful but certainly not a
panacea. Most worrying is that implementations may generate TLB
conflict aborts, at which point the only remediation is to invalidate
the entire affected context.
Oliver Upton Jan. 31, 2023, 7:06 p.m. UTC | #17
On Tue, Jan 31, 2023 at 05:54:45PM +0000, Sean Christopherson wrote:
> On Tue, Jan 31, 2023, Oliver Upton wrote:
> > On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote:
> > > On Mon, Jan 30, 2023, Oliver Upton wrote:
> > > > I think that Marc's suggestion of having userspace configure this is
> > > > sound. After all, userspace _should_ know the granularity of the backing
> > > > source it chose for guest memory.
> > > > 
> > > > We could also interpret a cache size of 0 to signal that userspace wants
> > > > to disable eager page split for a VM altogether. It is entirely possible that
> > > > the user will want a differing QoS between slice-of-hardware and
> > > > overcommitted VMs.
> > > 
> > > Maybe.  It's also entirely possible that QoS is never factored in, e.g. if QoS
> > > guarantees for all VMs on a system are better met by enabling eager splitting
> > > across the board.
> > > 
> > > There are other reasons to use module/kernel params beyond what Marc listed, e.g.
> > > to let the user opt out even when something is on by default.  x86's TDP MMU has
> > > benefited greatly from downstream users being able to do A/B performance testing
> > > this way.  I suspect x86's eager_page_split knob was added largely for this
> > > reason, e.g. to easily see how a specific workload is affected by eager splitting.
> > > That seems like a reasonable fit on the ARM side as well.
> > 
> > There's a rather important distinction here in that we'd allow userspace
> > to select the page split cache size, which should be correctly sized for
> > the backing memory source. Considering the break-before-make rules of
> > the architecture, the only way eager split is performant on arm64 is by
> > replacing a block entry with a fully populated table hierarchy in one
> > operation. AFAICT, you don't have this problem on x86, as the
> > architecture generally permits a direct valid->valid transformation
> > without an intermediate invalidation. Well, ignoring iTLB multihit :)
> > 
> > So, the largest transformation we need to do right now is on a PUD w/
> > PAGE_SIZE=4K, leading to 513 pages as proposed in the series. Exposing
> > that configuration option in a module parameter is presumptive that all
> > VMs on a host use the exact same memory configuration, which doesn't
> > feel right to me.
> 
> Can you elaborate on the cache size needing to be tied to the backing source?

The proposed eager split mechanism attempts to replace a block with a
a fully populated page table hierarchy (i.e. mapped at PTE granularity)
in order to avoid successive break-before-make invalidations. The cache
size must be >= the number of pages required to build out that fully
mapped page table hierarchy.

> Do the issues arise if you get to a point where KVM can have PGD-sized hugepages
> with PAGE_SIZE=4KiB?

Those problems when splitting any hugepage larger than a PMD. It just
so happens that the only configuration that supports larger mappings is
4K at the moment.

If we were to take the step-down approach to eager page splitting, there
will be a lot of knock-on break-before-make operations as we go
PUD -> PMD -> PTE.

> Or do you want to let userspace optimize _now_ for PMD+4KiB?

The default cache value should probably optimize for PMD splitting and
give userspace the option to scale that up for PUD or greater if it sees
fit.
Ricardo Koller Feb. 6, 2023, 4:35 p.m. UTC | #18
Hi Marc,

On Tue, Jan 31, 2023 at 2:28 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Fri, 27 Jan 2023 15:45:15 +0000,
> Ricardo Koller <ricarkol@google.com> wrote:
> >
> > > The one thing that would convince me to make it an option is the
> > > amount of memory this thing consumes. 512+ pages is a huge amount, and
> > > I'm not overly happy about that. Why can't this be a userspace visible
> > > option, selectable on a per VM (or memslot) basis?
> > >
> >
> > It should be possible.  I am exploring a couple of ideas that could
> > help when the hugepages are not 1G (e.g., 2M).  However, they add
> > complexity and I'm not sure they help much.
> >
> > (will be using PAGE_SIZE=4K to make things simpler)
> >
> > This feature pre-allocates 513 pages before splitting every 1G range.
> > For example, it converts 1G block PTEs into trees made of 513 pages.
> > When not using this feature, the same 513 pages would be allocated,
> > but lazily over a longer period of time.
>
> This is an important difference. It avoids the upfront allocation
> "thermal shock", giving time to the kernel to reclaim memory from
> somewhere else. Doing it upfront means you *must* have 2MB+ of
> immediately available memory for each GB of RAM you guest uses.
>
> >
> > Eager-splitting pre-allocates those pages in order to split huge-pages
> > into fully populated trees.  Which is needed in order to use FEAT_BBM
> > and skipping the expensive TLBI broadcasts.  513 is just the number of
> > pages needed to break a 1G huge-page.
>
> I understand that. But it also clear that 1GB huge pages are unlikely
> to be THPs, and I wonder if we should treat the two differently. Using
> HugeTLBFS pages is significant here.
>
> >
> > We could optimize for smaller huge-pages, like 2M by splitting 1
> > huge-page at a time: only preallocate one 4K page at a time.  The
> > trick is how to know that we are splitting 2M huge-pages.  We could
> > either get the vma pagesize or use hints from userspace.  I'm not sure
> > that this is worth it though.  The user will most likely want to split
> > big ranges of memory (>1G), so optimizing for smaller huge-pages only
> > converts the left into the right:
> >
> > alloc 1 page            |    |  alloc 512 pages
> > split 2M huge-page      |    |  split 2M huge-page
> > alloc 1 page            |    |  split 2M huge-page
> > split 2M huge-page      | => |  split 2M huge-page
> >                         ...
> > alloc 1 page            |    |  split 2M huge-page
> > split 2M huge-page      |    |  split 2M huge-page
> >
> > Still thinking of what else to do.
>
> I think the 1G case fits your own use case, but I doubt this covers
> the majority of the users. Most people rely on the kernel ability to
> use THPs, which are capped at the first level of block mapping.
>
> 2MB (and 32MB for 16kB base pages) are the most likely mappings in my
> experience (512MB with 64kB pages are vanishingly rare).
>
> Having to pay an upfront cost for HugeTLBFS doesn't shock me, and it
> fits the model. For THPs, where everything is opportunistic and the
> user not involved, this is a lot more debatable.
>
> This is why I'd like this behaviour to be a buy-in, either directly (a
> first class userspace API) or indirectly (the provenance of the
> memory).

This all makes sense, thanks for the explanation. I decided to implement
something for both cases: small caches (~1 page) where the PUDs are
split one PMD at a time, and bigger caches (>513) where the PUDs can
be split with a single replacement. The user specifies the size of the cache
via a capability, and size of 0 implies no eager splitting (the feature is off).

Thanks,
Ricardo

>
> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 35a159d131b5..6ab37209b1d1 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -153,6 +153,36 @@  struct kvm_s2_mmu {
 	/* The last vcpu id that ran on each physical CPU */
 	int __percpu *last_vcpu_ran;
 
+	/*
+	 * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
+	 * pages. It is used to allocate stage2 page tables while splitting
+	 * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
+	 * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
+	 * the capacity of the split page cache (CACHE_CAPACITY), and how often
+	 * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
+	 *
+	 * A good heuristic to pick CHUNK_SIZE is that it should be larger than
+	 * all the available huge-page sizes, and be a multiple of all the
+	 * other ones; for example, 1GB when all the available huge-page sizes
+	 * are (1GB, 2MB, 32MB, 512MB).
+	 *
+	 * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for
+	 * example, 1GB requires the following number of PAGE_SIZE-pages:
+	 * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB).
+	 * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)).
+	 * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB).
+	 * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB).
+	 * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB
+	 * granules.
+	 *
+	 * Protected by kvm->slots_lock.
+	 */
+#define EAGER_PAGE_SPLIT_CHUNK_SIZE		       SZ_1G
+#define EAGER_PAGE_SPLIT_CACHE_CAPACITY					\
+	(DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) +		\
+	 DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M))
+	struct kvm_mmu_memory_cache split_page_cache;
+
 	struct kvm_arch *arch;
 };
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 700c5774b50d..41ee330edae3 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -31,14 +31,24 @@  static phys_addr_t hyp_idmap_vector;
 
 static unsigned long io_map_base;
 
-static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
+bool __read_mostly eager_page_split = true;
+module_param(eager_page_split, bool, 0644);
+
+static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
+					   phys_addr_t size)
 {
-	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
 	phys_addr_t boundary = ALIGN_DOWN(addr + size, size);
 
 	return (boundary - 1 < end - 1) ? boundary : end;
 }
 
+static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
+{
+	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
+
+	return __stage2_range_addr_end(addr, end, size);
+}
+
 /*
  * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
  * we may see kernel panics with CONFIG_DETECT_HUNG_TASK,
@@ -71,6 +81,64 @@  static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
 	return ret;
 }
 
+static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
+{
+	return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
+}
+
+static bool need_topup_split_page_cache_or_resched(struct kvm *kvm)
+{
+	struct kvm_mmu_memory_cache *cache;
+
+	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
+		return true;
+
+	cache = &kvm->arch.mmu.split_page_cache;
+	return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY);
+}
+
+static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
+			      phys_addr_t end)
+{
+	struct kvm_mmu_memory_cache *cache;
+	struct kvm_pgtable *pgt;
+	int ret;
+	u64 next;
+	int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	cache = &kvm->arch.mmu.split_page_cache;
+
+	do {
+		if (need_topup_split_page_cache_or_resched(kvm)) {
+			write_unlock(&kvm->mmu_lock);
+			cond_resched();
+			/* Eager page splitting is best-effort. */
+			ret = __kvm_mmu_topup_memory_cache(cache,
+							   cache_capacity,
+							   cache_capacity);
+			write_lock(&kvm->mmu_lock);
+			if (ret)
+				break;
+		}
+
+		pgt = kvm->arch.mmu.pgt;
+		if (!pgt)
+			return -EINVAL;
+
+		next = __stage2_range_addr_end(addr, end,
+					       EAGER_PAGE_SPLIT_CHUNK_SIZE);
+		ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache);
+		if (ret)
+			break;
+	} while (addr = next, addr != end);
+
+	return ret;
+}
+
 #define stage2_apply_range_resched(kvm, addr, end, fn)			\
 	stage2_apply_range(kvm, addr, end, fn, true)
 
@@ -755,6 +823,8 @@  int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
 
+	mmu->split_page_cache.gfp_zero = __GFP_ZERO;
+
 	mmu->pgt = pgt;
 	mmu->pgd_phys = __pa(pgt->pgd);
 	return 0;
@@ -769,6 +839,7 @@  int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 void kvm_uninit_stage2_mmu(struct kvm *kvm)
 {
 	kvm_free_stage2_pgd(&kvm->arch.mmu);
+	kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
 }
 
 static void stage2_unmap_memslot(struct kvm *kvm,
@@ -996,6 +1067,29 @@  static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 	stage2_wp_range(&kvm->arch.mmu, start, end);
 }
 
+/**
+ * kvm_mmu_split_memory_region() - split the stage 2 blocks into PAGE_SIZE
+ *				   pages for memory slot
+ * @kvm:	The KVM pointer
+ * @slot:	The memory slot to split
+ *
+ * Acquires kvm->mmu_lock. Called with kvm->slots_lock mutex acquired,
+ * serializing operations for VM memory regions.
+ */
+static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	struct kvm_memory_slot *memslot = id_to_memslot(slots, slot);
+	phys_addr_t start, end;
+
+	start = memslot->base_gfn << PAGE_SHIFT;
+	end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
+
+	write_lock(&kvm->mmu_lock);
+	kvm_mmu_split_huge_pages(kvm, start, end);
+	write_unlock(&kvm->mmu_lock);
+}
+
 /*
  * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
  * dirty pages.
@@ -1783,7 +1877,19 @@  void kvm_arch_commit_memory_region(struct kvm *kvm,
 		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
 			return;
 
+		if (READ_ONCE(eager_page_split))
+			kvm_mmu_split_memory_region(kvm, new->id);
+
 		kvm_mmu_wp_memory_region(kvm, new->id);
+	} else {
+		/*
+		 * Free any leftovers from the eager page splitting cache. Do
+		 * this when deleting, moving, disabling dirty logging, or
+		 * creating the memslot (a nop). Doing it for deletes makes
+		 * sure we don't leak memory, and there's no need to keep the
+		 * cache around for any of the other cases.
+		 */
+		kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
 	}
 }